From patchwork Wed May 20 14:37:12 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chris Down <chris@chrisdown.name>
X-Patchwork-Id: 11560561
Return-Path: <SRS0=NKc1=7C=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7693260D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 20 May 2020 14:37:16 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 4381C207C4
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 20 May 2020 14:37:16 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=chrisdown.name header.i=@chrisdown.name
 header.b="I+Q6F55H"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4381C207C4
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=chrisdown.name
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 6284C80007; Wed, 20 May 2020 10:37:15 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5B230900002; Wed, 20 May 2020 10:37:15 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 47A2C80007; Wed, 20 May 2020 10:37:15 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0034.hostedemail.com
 [216.40.44.34])
	by kanga.kvack.org (Postfix) with ESMTP id 29E79900002
	for <linux-mm@kvack.org>; Wed, 20 May 2020 10:37:15 -0400 (EDT)
Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id C8862180AD81A
	for <linux-mm@kvack.org>; Wed, 20 May 2020 14:37:14 +0000 (UTC)
X-FDA: 76837349988.15.brass78_3f3a95b4f671b
X-Spam-Summary: 
 2,0,0,38baab35f8607072,d41d8cd98f00b204,chris@chrisdown.name,,RULES_HIT:41:355:379:800:960:966:973:988:989:1260:1277:1312:1313:1314:1345:1431:1437:1516:1518:1519:1535:1544:1593:1594:1595:1596:1605:1711:1730:1747:1777:1792:1969:2195:2196:2199:2200:2393:2559:2562:2898:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4118:4321:4385:5007:6119:6261:6299:6653:6671:7903:10004:11026:11473:11658:11914:12043:12291:12295:12296:12297:12438:12517:12519:12555:12740:12895:12986:13007:13161:13184:13229:13439:13869:13895:14040:14096:14097:14181:14394:14721:21063:21080:21324:21444:21450:21451:21627:21740:21795:21966:21990:22013:30005:30034:30051:30054:30066:30070,0,RBL:209.85.218.66:@chrisdown.name:.lbl8.mailshell.net-62.14.0.100
 66.201.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fp,MSBL:0,DNSBL:none,Custom_rules:0:0:0,LFtime:38,LUA_SUMMARY:none
X-HE-Tag: brass78_3f3a95b4f671b
X-Filterd-Recvd-Size: 7448
Received: from mail-ej1-f66.google.com (mail-ej1-f66.google.com
 [209.85.218.66])
	by imf10.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 20 May 2020 14:37:14 +0000 (UTC)
Received: by mail-ej1-f66.google.com with SMTP id z5so4113039ejb.3
        for <linux-mm@kvack.org>; Wed, 20 May 2020 07:37:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=chrisdown.name; s=google;
        h=date:from:to:cc:subject:message-id:mime-version:content-disposition;
        bh=k9a4yj0MuIquiuHcgLAuITw21En7MJN+VIITP5CpbD4=;
        b=I+Q6F55Hv3hJbZ274sPb2I5KbpzXe2i29v/p+ck+988HySt5+6rJ3ADVXWWXkkAi0Y
         7IC/Lfq0M9titHf/akcKXX3zsJ/TKYVDJVT196lf10ijRoYQN7cnGSWWvkfIXnXVVfAj
         LDzPef6T7ZoMTtBUVPREtCsYhez7+WUwBHlcc=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:mime-version
         :content-disposition;
        bh=k9a4yj0MuIquiuHcgLAuITw21En7MJN+VIITP5CpbD4=;
        b=Om7HQDFisR6miN0vDkQdnaEXTWN6Im3f+tt+EUtrtW+FJajSZ7Ux8mB3/Hyo81f4om
         7s22vqgYNhMAYYvXBXiikMUmAFIEO2nxJGLcJkikdrwngw9oXZoMhddSW8RiWwoED4j8
         dSb0l64QykdhTLHFnMfzabR5yM+E6skuCzhui/SYcCOUsPizT3rS1K/4whJ7kC3G3xGA
         zi9sxxZvJlNvmgmJyJIpxiSmC3jsA8rZKKU+T2POpVkUW/pVrqlmE/E4Swp+xMFLe+De
         LftFkfMNZ+a42VE2LJH2zO8iO6AcZ/ga1JXxMIiuGAnYfDQ35jx3P4k75AXbCYJ6k4ER
         /A5A==
X-Gm-Message-State: AOAM531A/LBe9M5ToIQdFTO8hoEbGdeGg2fmIiKVjmeXC6A4NMnxhxtv
	s+qZZ4FYtA2dt51CawC7ONL6iQ==
X-Google-Smtp-Source: 
 ABdhPJzxIHGgH9nM6vQEvQqzcIrfRNGFqaqKcb9nqpgf8jN7KzaQ9t8o4DHx4z91ww3+I5uqZxSYGw==
X-Received: by 2002:a17:906:f103:: with SMTP id
 gv3mr3868508ejb.226.1589985432937;
        Wed, 20 May 2020 07:37:12 -0700 (PDT)
Received: from localhost ([2620:10d:c093:400::5:758d])
        by smtp.gmail.com with ESMTPSA id
 s20sm2060359eju.96.2020.05.20.07.37.12
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 20 May 2020 07:37:12 -0700 (PDT)
Date: Wed, 20 May 2020 15:37:12 +0100
From: Chris Down <chris@chrisdown.name>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: [PATCH] mm, memcg: reclaim more aggressively before high allocator
 throttling
Message-ID: <20200520143712.GA749486@chrisdown.name>
MIME-Version: 1.0
Content-Disposition: inline
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

In Facebook production, we've seen cases where cgroups have been put
into allocator throttling even when they appear to have a lot of slack
file caches which should be trivially reclaimable.

Looking more closely, the problem is that we only try a single cgroup
reclaim walk for each return to usermode before calculating whether or
not we should throttle. This single attempt doesn't produce enough
pressure to shrink for cgroups with a rapidly growing amount of file
caches prior to entering allocator throttling.

As an example, we see that threads in an affected cgroup are stuck in
allocator throttling:

    # for i in $(cat cgroup.threads); do
    >     grep over_high "/proc/$i/stack"
    > done
    [<0>] mem_cgroup_handle_over_high+0x10b/0x150
    [<0>] mem_cgroup_handle_over_high+0x10b/0x150
    [<0>] mem_cgroup_handle_over_high+0x10b/0x150

...however, there is no I/O pressure reported by PSI, despite a lot of
slack file pages:

    # cat memory.pressure
    some avg10=78.50 avg60=84.99 avg300=84.53 total=5702440903
    full avg10=78.50 avg60=84.99 avg300=84.53 total=5702116959
    # cat io.pressure
    some avg10=0.00 avg60=0.00 avg300=0.00 total=78051391
    full avg10=0.00 avg60=0.00 avg300=0.00 total=78049640
    # grep _file memory.stat
    inactive_file 1370939392
    active_file 661635072

This patch changes the behaviour to retry reclaim either until the
current task goes below the 10ms grace period, or we are making no
reclaim progress at all. In the latter case, we enter reclaim throttling
as before.

To a user, there's no intuitive reason for the reclaim behaviour to
differ from hitting memory.high as part of a new allocation, as opposed
to hitting memory.high because someone lowered its value. As such this
also brings an added benefit: it unifies the reclaim behaviour between
the two.

There's precedent for this behaviour: we already do reclaim retries when
writing to memory.{high,max}, in max reclaim, and in the page allocator
itself.

Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
---
 mm/memcontrol.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2df9510b7d64..b040951ccd6b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -73,6 +73,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
 
 struct mem_cgroup *root_mem_cgroup __read_mostly;
 
+/* The number of times we should retry reclaim failures before giving up. */
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
 /* Socket memory accounting disabled? */
@@ -2228,17 +2229,22 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
 	return 0;
 }
 
-static void reclaim_high(struct mem_cgroup *memcg,
-			 unsigned int nr_pages,
-			 gfp_t gfp_mask)
+static unsigned long reclaim_high(struct mem_cgroup *memcg,
+				  unsigned int nr_pages,
+				  gfp_t gfp_mask)
 {
+	unsigned long nr_reclaimed = 0;
+
 	do {
 		if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->high))
 			continue;
 		memcg_memory_event(memcg, MEMCG_HIGH);
-		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
+							     gfp_mask, true);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
+
+	return nr_reclaimed;
 }
 
 static void high_work_func(struct work_struct *work)
@@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void)
 {
 	unsigned long penalty_jiffies;
 	unsigned long pflags;
+	unsigned long nr_reclaimed;
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *memcg;
 
 	if (likely(!nr_pages))
 		return;
 
 	memcg = get_mem_cgroup_from_mm(current->mm);
-	reclaim_high(memcg, nr_pages, GFP_KERNEL);
 	current->memcg_nr_pages_over_high = 0;
 
+retry_reclaim:
+	nr_reclaimed = reclaim_high(memcg, nr_pages, GFP_KERNEL);
+
 	/*
 	 * memory.high is breached and reclaim is unable to keep up. Throttle
 	 * allocators proactively to slow down excessive growth.
@@ -2403,6 +2413,14 @@ void mem_cgroup_handle_over_high(void)
 	if (penalty_jiffies <= HZ / 100)
 		goto out;
 
+	/*
+	 * If reclaim is making forward progress but we're still over
+	 * memory.high, we want to encourage that rather than doing allocator
+	 * throttling.
+	 */
+	if (nr_reclaimed || nr_retries--)
+		goto retry_reclaim;
+
 	/*
 	 * If we exit early, we're guaranteed to die (since
 	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't