diff mbox series

mm, vmscan: guarantee drop_slab_node() termination

Message ID 20210818152239.25502-1-vbabka@suse.cz (mailing list archive)
State New
Headers show
Series mm, vmscan: guarantee drop_slab_node() termination | expand

Commit Message

Vlastimil Babka Aug. 18, 2021, 3:22 p.m. UTC
drop_slab_node() is called as part of echo 2>/proc/sys/vm/drop_caches
operation. It iterates over all memcgs and calls shrink_slab() which in turn
iterates over all slab shrinkers. Freed objects are counted and as long as the
total number of freed objects from all memcgs and shrinkers is higher than 10,
drop_slab_node() loops for another full memcgs*shrinkers iteration.

This arbitrary constant threshold of 10 can result in effectively an infinite
loop on a system with large number of memcgs and/or parallel activity that
allocates new objects. This has been reported previously by Chunxin Zang [1]
and recently by our customer.

The previous report [1] has resulted in commit 069c411de40a ("mm/vmscan: fix
infinite loop in drop_slab_node") which added a check for signals allowing the
user to terminate the command writing to drop_caches. At the time it was also
considered to make the threshold grow with each iteration to guarantee
termination, but such patch hasn't been formally proposed yet.

This patch implements the dynamically growing threshold. At first iteration
it's enough to free one object to continue, and this threshold effectively
doubles with each iteration. Our customer's feedback was positive.

There is always a risk that this change will result on some system in a
previously terminating drop_caches operation to terminate sooner and free fewer
objects. Ideally the semantics would guarantee freeing all freeable objects
that existed at the moment of starting the operation, while not looping forever
for newly allocated objects, but that's not feasible to track. In the less
ideal solution based on thresholds, arguably the termination guarantee is more
important than the exhaustiveness guarantee. If there are reports of large
regression wrt being exhaustive, we can tune how fast the threshold grows.

[1] https://lore.kernel.org/lkml/20200909152047.27905-1-zangchunxin@bytedance.com/T/#u

Reported-by: Chunxin Zang <zangchunxin@bytedance.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Chris Down Aug. 18, 2021, 9:48 p.m. UTC | #1
Vlastimil Babka writes:
>@@ -948,7 +949,7 @@ void drop_slab_node(int nid)
> 		do {
> 			freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
> 		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>-	} while (freed > 10);
>+	} while ((freed >> shift++) > 0);

I think this is a good idea, thanks for bringing it up :-)

I'm not sure about the bitshift idea, though. It certainly makes sure that even 
large, continuous periods of reclaim eventually terminates, but I find it hard 
to reason about -- for example, if there's a lot of parallel activity, that 
might result in 10 constantly reintroduced pages, or 1000 pages, and it's not 
immediately obvious that we should treat those differently.

What about using MAX_RECLAIM_RETRIES? There's already precedent for using it in 
non-OOM scenarios, like mem_cgroup_handle_over_high.
Kefeng Wang Aug. 19, 2021, 2:55 a.m. UTC | #2
On 2021/8/19 5:48, Chris Down wrote:
> Vlastimil Babka writes:
>> @@ -948,7 +949,7 @@ void drop_slab_node(int nid)
>>         do {
>>             freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
>>         } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>> -    } while (freed > 10);
>> +    } while ((freed >> shift++) > 0);
>
> I think this is a good idea, thanks for bringing it up :-)
>
> I'm not sure about the bitshift idea, though. It certainly makes sure 
> that even large, continuous periods of reclaim eventually terminates, 
> but I find it hard to reason about -- for example, if there's a lot of 
> parallel activity, that might result in 10 constantly reintroduced 
> pages, or 1000 pages, and it's not immediately obvious that we should 
> treat those differently.
>
> What about using MAX_RECLAIM_RETRIES? There's already precedent for 
> using it in non-OOM scenarios, like mem_cgroup_handle_over_high.

Yes, we meet this issue too, and we add a max loop limit in 
drop_slab_node() in our kernel, which also could be reconfigured by 
sysctl ;)

>
> .
>
Vlastimil Babka Aug. 19, 2021, 7:01 a.m. UTC | #3
On 8/19/21 4:55 AM, Kefeng Wang wrote:
> 
> On 2021/8/19 5:48, Chris Down wrote:
>> Vlastimil Babka writes:
>>
>> I think this is a good idea, thanks for bringing it up :-)
>>
>> I'm not sure about the bitshift idea, though. It certainly makes sure 
>> that even large, continuous periods of reclaim eventually terminates, 
>> but I find it hard to reason about -- for example, if there's a lot of 
>> parallel activity, that might result in 10 constantly reintroduced 
>> pages, or 1000 pages, and it's not immediately obvious that we should 
>> treat those differently.
>>
>> What about using MAX_RECLAIM_RETRIES? There's already precedent for 
>> using it in non-OOM scenarios, like mem_cgroup_handle_over_high.

It's an option, but then (together with fixed threshold) it ignores how
large the 'freed' value is, as long it's above threshold? Although the
end result will probably not be much different.

> Yes, we meet this issue too, and we add a max loop limit in 
> drop_slab_node() in our kernel, which also could be reconfigured by 
> sysctl ;)

Sysctl sounds like an overkill. How do you tune the limit? Any
experience with what scenarios need what limit?
Kefeng Wang Aug. 19, 2021, 9:38 a.m. UTC | #4
On 2021/8/19 15:01, Vlastimil Babka wrote:
> On 8/19/21 4:55 AM, Kefeng Wang wrote:
>> On 2021/8/19 5:48, Chris Down wrote:
>>> Vlastimil Babka writes:
>>>
>>> I think this is a good idea, thanks for bringing it up :-)
>>>
>>> I'm not sure about the bitshift idea, though. It certainly makes sure
>>> that even large, continuous periods of reclaim eventually terminates,
>>> but I find it hard to reason about -- for example, if there's a lot of
>>> parallel activity, that might result in 10 constantly reintroduced
>>> pages, or 1000 pages, and it's not immediately obvious that we should
>>> treat those differently.
>>>
>>> What about using MAX_RECLAIM_RETRIES? There's already precedent for
>>> using it in non-OOM scenarios, like mem_cgroup_handle_over_high.
> It's an option, but then (together with fixed threshold) it ignores how
> large the 'freed' value is, as long it's above threshold? Although the
> end result will probably not be much different.
>
>> Yes, we meet this issue too, and we add a max loop limit in
>> drop_slab_node() in our kernel, which also could be reconfigured by
>> sysctl ;)
> Sysctl sounds like an overkill. How do you tune the limit? Any
> experience with what scenarios need what limit?

This is no clear limit, we do some test, then choose a big limit which 
is tolerated

by our user, and the user could change it dynamically by sysctl interface.

The dynamically growing threshold is great, I am very agree with this patch.

Like Chris said, the option is that we could also add a max limit then 
let's the

user to make a decision according their scenarios, this is just a option.

>
> .
>
Chris Down Aug. 19, 2021, 1:21 p.m. UTC | #5
Vlastimil Babka writes:
>On 8/19/21 4:55 AM, Kefeng Wang wrote:
>>
>> On 2021/8/19 5:48, Chris Down wrote:
>>> Vlastimil Babka writes:
>>>
>>> I think this is a good idea, thanks for bringing it up :-)
>>>
>>> I'm not sure about the bitshift idea, though. It certainly makes sure
>>> that even large, continuous periods of reclaim eventually terminates,
>>> but I find it hard to reason about -- for example, if there's a lot of
>>> parallel activity, that might result in 10 constantly reintroduced
>>> pages, or 1000 pages, and it's not immediately obvious that we should
>>> treat those differently.
>>>
>>> What about using MAX_RECLAIM_RETRIES? There's already precedent for
>>> using it in non-OOM scenarios, like mem_cgroup_handle_over_high.
>
>It's an option, but then (together with fixed threshold) it ignores how
>large the 'freed' value is, as long it's above threshold? Although the
>end result will probably not be much different.

Yeah, but we already draw the line at 10 right now. `freed > 10 && retries < 
MAX_RECLAIM_RETRIES` seems easier to reason about, at least to me, and 
stays closer to the current behaviour while providing a definitive point of 
loop termination.

>> Yes, we meet this issue too, and we add a max loop limit in
>> drop_slab_node() in our kernel, which also could be reconfigured by
>> sysctl ;)
>
>Sysctl sounds like an overkill. How do you tune the limit? Any
>experience with what scenarios need what limit?

sysctls in mm tend to mean we didn't think hard enough about how things should 
work and just punted it to the user :-)

So I agree with you that I'd rather not have a sysctl, and just go the 
MAX_RECLAIM_RETRIES route. After that I'll happily add my Acked-by.
Michal Hocko Aug. 19, 2021, 2:16 p.m. UTC | #6
On Thu 19-08-21 14:21:08, Chris Down wrote:
> Vlastimil Babka writes:
> > On 8/19/21 4:55 AM, Kefeng Wang wrote:
> > > 
> > > On 2021/8/19 5:48, Chris Down wrote:
> > > > Vlastimil Babka writes:
> > > > 
> > > > I think this is a good idea, thanks for bringing it up :-)
> > > > 
> > > > I'm not sure about the bitshift idea, though. It certainly makes sure
> > > > that even large, continuous periods of reclaim eventually terminates,
> > > > but I find it hard to reason about -- for example, if there's a lot of
> > > > parallel activity, that might result in 10 constantly reintroduced
> > > > pages, or 1000 pages, and it's not immediately obvious that we should
> > > > treat those differently.
> > > > 
> > > > What about using MAX_RECLAIM_RETRIES? There's already precedent for
> > > > using it in non-OOM scenarios, like mem_cgroup_handle_over_high.
> > 
> > It's an option, but then (together with fixed threshold) it ignores how
> > large the 'freed' value is, as long it's above threshold? Although the
> > end result will probably not be much different.
> 
> Yeah, but we already draw the line at 10 right now. `freed > 10 && retries <
> MAX_RECLAIM_RETRIES` seems easier to reason about, at least to me, and stays
> closer to the current behaviour while providing a definitive point of loop
> termination.

I have to say that I am not really a fan of MAX_RECLAIM_RETRIES approach
especially for user interfaces. Any limit on retries has kicked us back
(e.g. offlining for the memory hotplug just to mention one of those).
drop_caches can take a long time on its own even without retrying. We
should teach people to interrupt those operations if they should really
finish early (e.g. timeout $TIMEOUT echo > /proc/sys/vm/drop_caches)
rather than trying to be extra clever here.

I am not against the patch Vlastimil is proposing because it replaces an
ad-hoc limit on the reclaimed objects threshold with something that is
less "random" - sort of a backoff instead seems like an improvement to
me. But I would still be worried that this could regress for some users
so in an ideal world the existing bail on signal should be enough.
Vlastimil Babka Aug. 24, 2021, 9:33 a.m. UTC | #7
On 8/19/21 16:16, Michal Hocko wrote:
> On Thu 19-08-21 14:21:08, Chris Down wrote:
>> 
>> Yeah, but we already draw the line at 10 right now. `freed > 10 && retries <
>> MAX_RECLAIM_RETRIES` seems easier to reason about, at least to me, and stays
>> closer to the current behaviour while providing a definitive point of loop
>> termination.

The doubling threshold is also definitive :) 64 iterations on 64-bit systems, 32
on 32 bit. In practice it will be much less.

> I have to say that I am not really a fan of MAX_RECLAIM_RETRIES approach
> especially for user interfaces. Any limit on retries has kicked us back
> (e.g. offlining for the memory hotplug just to mention one of those).
> drop_caches can take a long time on its own even without retrying. We
> should teach people to interrupt those operations if they should really
> finish early (e.g. timeout $TIMEOUT echo > /proc/sys/vm/drop_caches)
> rather than trying to be extra clever here.

I'm afraid the retries vs threshold has more potential for bikeshedding than
actual difference in practice. I can change it if more people prefer that way.

> I am not against the patch Vlastimil is proposing because it replaces an
> ad-hoc limit on the reclaimed objects threshold with something that is
> less "random" - sort of a backoff instead seems like an improvement to
> me. But I would still be worried that this could regress for some users
> so in an ideal world the existing bail on signal should be enough.

My point of view on this is:

- first let's not forget this is an interface intended for debugging, that we
generally discourage to use in a regular manner. Yet it exists, so people will
inevitably use it from automated scripts and cron and whatnot.

- as such there are no guarantees how much it reclaims, it's a best effort
really. If a user thinks it hasn't reclaimed enough, they can always implement
own userspace loop that checks e.g. meminfo or slabinfo to decide whether to do
another iteration of drop_caches.

- while the userspace loop might be convoluted, the idea of running with timeout
is not obvious. One might see the operation finishes fine in initial testing,
put it in cron etc. without timeout, and one day the infinite loop kicks in and
there's trouble.

So I think the safer default that avoids suprises is to guarantee termination
rather than aggressive reclaim.
Matthew Wilcox (Oracle) Aug. 24, 2021, 10:02 a.m. UTC | #8
On Wed, Aug 18, 2021 at 05:22:39PM +0200, Vlastimil Babka wrote:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 403a175a720f..ef3554314b47 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -936,6 +936,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>  void drop_slab_node(int nid)
>  {
>  	unsigned long freed;
> +	int shift = 0;
>  
>  	do {
>  		struct mem_cgroup *memcg = NULL;
> @@ -948,7 +949,7 @@ void drop_slab_node(int nid)
>  		do {
>  			freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
>  		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
> -	} while (freed > 10);
> +	} while ((freed >> shift++) > 0);

This can, if you're really unlucky, produce UB.  If you free 2^63 items
when shift is 63, then 2^63 >> 63 is 1 and shift becomes 64, producing
UB on the next iteration.  We could do:

	} while (shift < BITS_PER_LONG) && (freed >> shift++) > 0);

but honestly, that feels silly.  How about:

	} while ((freed >> shift++) > 1);

almost exactly as arbitrary, but guarantees no UB.
Vlastimil Babka Aug. 24, 2021, 2:04 p.m. UTC | #9
On 8/24/21 12:02, Matthew Wilcox wrote:
> On Wed, Aug 18, 2021 at 05:22:39PM +0200, Vlastimil Babka wrote:
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 403a175a720f..ef3554314b47 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -936,6 +936,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
>>  void drop_slab_node(int nid)
>>  {
>>  	unsigned long freed;
>> +	int shift = 0;
>>  
>>  	do {
>>  		struct mem_cgroup *memcg = NULL;
>> @@ -948,7 +949,7 @@ void drop_slab_node(int nid)
>>  		do {
>>  			freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
>>  		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
>> -	} while (freed > 10);
>> +	} while ((freed >> shift++) > 0);
> 
> This can, if you're really unlucky, produce UB.  If you free 2^63 items
> when shift is 63, then 2^63 >> 63 is 1 and shift becomes 64, producing
> UB on the next iteration.  We could do:
> 
> 	} while (shift < BITS_PER_LONG) && (freed >> shift++) > 0);
> 
> but honestly, that feels silly.  How about:
> 
> 	} while ((freed >> shift++) > 1);
> 
> almost exactly as arbitrary, but guarantees no UB.

Hey, zero is not arbitrary :P
But thanks, here's a fix up.

From 88189bf16406c5910400193422b3f18f859f18d8 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 24 Aug 2021 14:08:53 +0200
Subject: [PATCH] mm, vmscan: guarantee drop_slab_node() termination-fix

Matthew reports [1] that if we free enough objects, we can eventually
right-shift by BITS_PER_LONG, which is undefined behavior. Raise the
threshold from 0 to 1 which means we will shift only up to BITS_PER_LONG-1.

[1] https://lore.kernel.org/linux-mm/YSTDnqKgQLvziyQI@casper.infradead.org/

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4ffaa7970904..f08aef08c351 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -952,7 +952,7 @@ void drop_slab_node(int nid)
 		do {
 			freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
 		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
-	} while ((freed >> shift++) > 0);
+	} while ((freed >> shift++) > 1);
 }
 
 void drop_slab(void)
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 403a175a720f..ef3554314b47 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -936,6 +936,7 @@  static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 void drop_slab_node(int nid)
 {
 	unsigned long freed;
+	int shift = 0;
 
 	do {
 		struct mem_cgroup *memcg = NULL;
@@ -948,7 +949,7 @@  void drop_slab_node(int nid)
 		do {
 			freed += shrink_slab(GFP_KERNEL, nid, memcg, 0);
 		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
-	} while (freed > 10);
+	} while ((freed >> shift++) > 0);
 }
 
 void drop_slab(void)