[2/2] mm: memcontrol: try harder to set a new memory.high

Message ID	20191022201518.341216-2-hannes@cmpxchg.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=4xz8=YP=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2488521783 From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 2/2] mm: memcontrol: try harder to set a new memory.high Date: Tue, 22 Oct 2019 16:15:18 -0400 Message-Id: <20191022201518.341216-2-hannes@cmpxchg.org> In-Reply-To: <20191022201518.341216-1-hannes@cmpxchg.org> References: <20191022201518.341216-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[1/2] mm: memcontrol: remove dead code from memory_max_write() \| expand [1/2] mm: memcontrol: remove dead code from memory_max_write() [2/2] mm: memcontrol: try harder to set a new memory.high

Message ID

20191022201518.341216-2-hannes@cmpxchg.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2488521783
From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org,
	cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: [PATCH 2/2] mm: memcontrol: try harder to set a new memory.high
Date: Tue, 22 Oct 2019 16:15:18 -0400
Message-Id: <20191022201518.341216-2-hannes@cmpxchg.org>
In-Reply-To: <20191022201518.341216-1-hannes@cmpxchg.org>
References: <20191022201518.341216-1-hannes@cmpxchg.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[1/2] mm: memcontrol: remove dead code from memory_max_write() | expand

Commit Message

Johannes Weiner Oct. 22, 2019, 8:15 p.m. UTC

Setting a memory.high limit below the usage makes almost no effort to
shrink the cgroup to the new target size.

While memory.high is a "soft" limit that isn't supposed to cause OOM
situations, we should still try harder to meet a user request through
persistent reclaim.

For example, after setting a 10M memory.high on an 800M cgroup full of
file cache, the usage shrinks to about 350M:

+ cat /cgroup/workingset/memory.current
841568256
+ echo 10M
+ cat /cgroup/workingset/memory.current
355729408

This isn't exactly what the user would expect to happen. Setting the
value a few more times eventually whittles the usage down to what we
are asking for:

+ echo 10M
+ cat /cgroup/workingset/memory.current
104181760
+ echo 10M
+ cat /cgroup/workingset/memory.current
31801344
+ echo 10M
+ cat /cgroup/workingset/memory.current
10440704

To improve this, add reclaim retry loops to the memory.high write()
callback, similar to what we do for memory.max, to make a reasonable
effort that the usage meets the requested size after the call returns.

Afterwards, a single write() to memory.high is enough in all but
extreme cases:

+ cat /cgroup/workingset/memory.current
841609216
+ echo 10M
+ cat /cgroup/workingset/memory.current
10182656

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

Comments

Michal Hocko Oct. 23, 2019, 6:59 a.m. UTC | #1

On Tue 22-10-19 16:15:18, Johannes Weiner wrote:
> Setting a memory.high limit below the usage makes almost no effort to
> shrink the cgroup to the new target size.
> 
> While memory.high is a "soft" limit that isn't supposed to cause OOM
> situations, we should still try harder to meet a user request through
> persistent reclaim.
> 
> For example, after setting a 10M memory.high on an 800M cgroup full of
> file cache, the usage shrinks to about 350M:
> 
> + cat /cgroup/workingset/memory.current
> 841568256
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 355729408
> 
> This isn't exactly what the user would expect to happen. Setting the
> value a few more times eventually whittles the usage down to what we
> are asking for:
> 
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 104181760
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 31801344
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 10440704
> 
> To improve this, add reclaim retry loops to the memory.high write()
> callback, similar to what we do for memory.max, to make a reasonable
> effort that the usage meets the requested size after the call returns.

That suggests that the reclaim couldn't meet the given reclaim target
but later attempts just made it through. Is this due to amount of dirty
pages or what prevented the reclaim to do its job?

While I am not against the reclaim retry loop I would like to understand
the underlying issue. Because if this is really about dirty memory then
we should probably be more pro-active in flushing it. Otherwise the
retry might not be of any help.

> Afterwards, a single write() to memory.high is enough in all but
> extreme cases:
> 
> + cat /cgroup/workingset/memory.current
> 841609216
> + echo 10M
> + cat /cgroup/workingset/memory.current
> 10182656
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 30 ++++++++++++++++++++++++------
>  1 file changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ff90d4e7df37..8090b4c99ac7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6074,7 +6074,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  				 char *buf, size_t nbytes, loff_t off)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	unsigned long nr_pages;
> +	unsigned int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	bool drained = false;
>  	unsigned long high;
>  	int err;
>  
> @@ -6085,12 +6086,29 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>  
>  	memcg->high = high;
>  
> -	nr_pages = page_counter_read(&memcg->memory);
> -	if (nr_pages > high)
> -		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -					     GFP_KERNEL, true);
> +	for (;;) {
> +		unsigned long nr_pages = page_counter_read(&memcg->memory);
> +		unsigned long reclaimed;
> +
> +		if (nr_pages <= high)
> +			break;
> +
> +		if (signal_pending(current))
> +			break;
> +
> +		if (!drained) {
> +			drain_all_stock(memcg);
> +			drained = true;
> +			continue;
> +		}
> +
> +		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +							 GFP_KERNEL, true);
> +
> +		if (!reclaimed && !nr_retries--)
> +			break;
> +	}
>  
> -	memcg_wb_domain_size_changed(memcg);
>  	return nbytes;
>  }
>  
> -- 
> 2.23.0

Johannes Weiner Oct. 23, 2019, 5:57 p.m. UTC | #2

On Wed, Oct 23, 2019 at 08:59:49AM +0200, Michal Hocko wrote:
> On Tue 22-10-19 16:15:18, Johannes Weiner wrote:
> > Setting a memory.high limit below the usage makes almost no effort to
> > shrink the cgroup to the new target size.
> > 
> > While memory.high is a "soft" limit that isn't supposed to cause OOM
> > situations, we should still try harder to meet a user request through
> > persistent reclaim.
> > 
> > For example, after setting a 10M memory.high on an 800M cgroup full of
> > file cache, the usage shrinks to about 350M:
> > 
> > + cat /cgroup/workingset/memory.current
> > 841568256
> > + echo 10M
> > + cat /cgroup/workingset/memory.current
> > 355729408
> > 
> > This isn't exactly what the user would expect to happen. Setting the
> > value a few more times eventually whittles the usage down to what we
> > are asking for:
> > 
> > + echo 10M
> > + cat /cgroup/workingset/memory.current
> > 104181760
> > + echo 10M
> > + cat /cgroup/workingset/memory.current
> > 31801344
> > + echo 10M
> > + cat /cgroup/workingset/memory.current
> > 10440704
> > 
> > To improve this, add reclaim retry loops to the memory.high write()
> > callback, similar to what we do for memory.max, to make a reasonable
> > effort that the usage meets the requested size after the call returns.
> 
> That suggests that the reclaim couldn't meet the given reclaim target
> but later attempts just made it through. Is this due to amount of dirty
> pages or what prevented the reclaim to do its job?
> 
> While I am not against the reclaim retry loop I would like to understand
> the underlying issue. Because if this is really about dirty memory then
> we should probably be more pro-active in flushing it. Otherwise the
> retry might not be of any help.

All the pages in my test case are clean cache. But they are active,
and they need to go through the inactive list before reclaiming. The
inactive list size is designed to pre-age just enough pages for
regular reclaim targets, i.e. pages in the SWAP_CLUSTER_MAX ballpark,
In this case, the reclaim goal for a single invocation is 790M and the
inactive list is a small funnel to put all that through, and we need
several iterations to accomplish that.

But 790M is not a reasonable reclaim target to ask of a single reclaim
invocation. And it wouldn't be reasonable to optimize the reclaim code
for it. So asking for the full size but retrying is not a bad choice
here: we express our intent, and benefit if reclaim becomes better at
handling larger requests, but we also acknowledge that some of the
deltas we can encounter in memory_high_write() are just too
ridiculously big for a single reclaim invocation to manage.

Michal Hocko Oct. 24, 2019, 8:24 a.m. UTC | #3

On Wed 23-10-19 13:57:24, Johannes Weiner wrote:
> On Wed, Oct 23, 2019 at 08:59:49AM +0200, Michal Hocko wrote:
> > On Tue 22-10-19 16:15:18, Johannes Weiner wrote:
> > > Setting a memory.high limit below the usage makes almost no effort to
> > > shrink the cgroup to the new target size.
> > > 
> > > While memory.high is a "soft" limit that isn't supposed to cause OOM
> > > situations, we should still try harder to meet a user request through
> > > persistent reclaim.
> > > 
> > > For example, after setting a 10M memory.high on an 800M cgroup full of
> > > file cache, the usage shrinks to about 350M:
> > > 
> > > + cat /cgroup/workingset/memory.current
> > > 841568256
> > > + echo 10M
> > > + cat /cgroup/workingset/memory.current
> > > 355729408
> > > 
> > > This isn't exactly what the user would expect to happen. Setting the
> > > value a few more times eventually whittles the usage down to what we
> > > are asking for:
> > > 
> > > + echo 10M
> > > + cat /cgroup/workingset/memory.current
> > > 104181760
> > > + echo 10M
> > > + cat /cgroup/workingset/memory.current
> > > 31801344
> > > + echo 10M
> > > + cat /cgroup/workingset/memory.current
> > > 10440704
> > > 
> > > To improve this, add reclaim retry loops to the memory.high write()
> > > callback, similar to what we do for memory.max, to make a reasonable
> > > effort that the usage meets the requested size after the call returns.
> > 
> > That suggests that the reclaim couldn't meet the given reclaim target
> > but later attempts just made it through. Is this due to amount of dirty
> > pages or what prevented the reclaim to do its job?
> > 
> > While I am not against the reclaim retry loop I would like to understand
> > the underlying issue. Because if this is really about dirty memory then
> > we should probably be more pro-active in flushing it. Otherwise the
> > retry might not be of any help.
> 
> All the pages in my test case are clean cache. But they are active,
> and they need to go through the inactive list before reclaiming. The
> inactive list size is designed to pre-age just enough pages for
> regular reclaim targets, i.e. pages in the SWAP_CLUSTER_MAX ballpark,
> In this case, the reclaim goal for a single invocation is 790M and the
> inactive list is a small funnel to put all that through, and we need
> several iterations to accomplish that.

Thanks for the clarification.

> But 790M is not a reasonable reclaim target to ask of a single reclaim
> invocation. And it wouldn't be reasonable to optimize the reclaim code
> for it. So asking for the full size but retrying is not a bad choice
> here: we express our intent, and benefit if reclaim becomes better at
> handling larger requests, but we also acknowledge that some of the
> deltas we can encounter in memory_high_write() are just too
> ridiculously big for a single reclaim invocation to manage.

Yes that makes sense and I think it should be a part of the changelog.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff90d4e7df37..8090b4c99ac7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6074,7 +6074,8 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 				 char *buf, size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	unsigned long nr_pages;
+	unsigned int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	bool drained = false;
 	unsigned long high;
 	int err;
 
@@ -6085,12 +6086,29 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 
 	memcg->high = high;
 
-	nr_pages = page_counter_read(&memcg->memory);
-	if (nr_pages > high)
-		try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					     GFP_KERNEL, true);
+	for (;;) {
+		unsigned long nr_pages = page_counter_read(&memcg->memory);
+		unsigned long reclaimed;
+
+		if (nr_pages <= high)
+			break;
+
+		if (signal_pending(current))
+			break;
+
+		if (!drained) {
+			drain_all_stock(memcg);
+			drained = true;
+			continue;
+		}
+
+		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
+							 GFP_KERNEL, true);
+
+		if (!reclaimed && !nr_retries--)
+			break;
+	}
 
-	memcg_wb_domain_size_changed(memcg);
 	return nbytes;
 }

[2/2] mm: memcontrol: try harder to set a new memory.high

Commit Message

Comments

Patch