Message ID | 20210818152239.25502-1-vbabka@suse.cz (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, vmscan: guarantee drop_slab_node() termination | expand |
Vlastimil Babka writes: >@@ -948,7 +949,7 @@ void drop_slab_node(int nid) > do { > freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); > } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); >- } while (freed > 10); >+ } while ((freed >> shift++) > 0); I think this is a good idea, thanks for bringing it up :-) I'm not sure about the bitshift idea, though. It certainly makes sure that even large, continuous periods of reclaim eventually terminates, but I find it hard to reason about -- for example, if there's a lot of parallel activity, that might result in 10 constantly reintroduced pages, or 1000 pages, and it's not immediately obvious that we should treat those differently. What about using MAX_RECLAIM_RETRIES? There's already precedent for using it in non-OOM scenarios, like mem_cgroup_handle_over_high.
On 2021/8/19 5:48, Chris Down wrote: > Vlastimil Babka writes: >> @@ -948,7 +949,7 @@ void drop_slab_node(int nid) >> do { >> freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); >> } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); >> - } while (freed > 10); >> + } while ((freed >> shift++) > 0); > > I think this is a good idea, thanks for bringing it up :-) > > I'm not sure about the bitshift idea, though. It certainly makes sure > that even large, continuous periods of reclaim eventually terminates, > but I find it hard to reason about -- for example, if there's a lot of > parallel activity, that might result in 10 constantly reintroduced > pages, or 1000 pages, and it's not immediately obvious that we should > treat those differently. > > What about using MAX_RECLAIM_RETRIES? There's already precedent for > using it in non-OOM scenarios, like mem_cgroup_handle_over_high. Yes, we meet this issue too, and we add a max loop limit in drop_slab_node() in our kernel, which also could be reconfigured by sysctl ;) > > . >
On 8/19/21 4:55 AM, Kefeng Wang wrote: > > On 2021/8/19 5:48, Chris Down wrote: >> Vlastimil Babka writes: >> >> I think this is a good idea, thanks for bringing it up :-) >> >> I'm not sure about the bitshift idea, though. It certainly makes sure >> that even large, continuous periods of reclaim eventually terminates, >> but I find it hard to reason about -- for example, if there's a lot of >> parallel activity, that might result in 10 constantly reintroduced >> pages, or 1000 pages, and it's not immediately obvious that we should >> treat those differently. >> >> What about using MAX_RECLAIM_RETRIES? There's already precedent for >> using it in non-OOM scenarios, like mem_cgroup_handle_over_high. It's an option, but then (together with fixed threshold) it ignores how large the 'freed' value is, as long it's above threshold? Although the end result will probably not be much different. > Yes, we meet this issue too, and we add a max loop limit in > drop_slab_node() in our kernel, which also could be reconfigured by > sysctl ;) Sysctl sounds like an overkill. How do you tune the limit? Any experience with what scenarios need what limit?
On 2021/8/19 15:01, Vlastimil Babka wrote: > On 8/19/21 4:55 AM, Kefeng Wang wrote: >> On 2021/8/19 5:48, Chris Down wrote: >>> Vlastimil Babka writes: >>> >>> I think this is a good idea, thanks for bringing it up :-) >>> >>> I'm not sure about the bitshift idea, though. It certainly makes sure >>> that even large, continuous periods of reclaim eventually terminates, >>> but I find it hard to reason about -- for example, if there's a lot of >>> parallel activity, that might result in 10 constantly reintroduced >>> pages, or 1000 pages, and it's not immediately obvious that we should >>> treat those differently. >>> >>> What about using MAX_RECLAIM_RETRIES? There's already precedent for >>> using it in non-OOM scenarios, like mem_cgroup_handle_over_high. > It's an option, but then (together with fixed threshold) it ignores how > large the 'freed' value is, as long it's above threshold? Although the > end result will probably not be much different. > >> Yes, we meet this issue too, and we add a max loop limit in >> drop_slab_node() in our kernel, which also could be reconfigured by >> sysctl ;) > Sysctl sounds like an overkill. How do you tune the limit? Any > experience with what scenarios need what limit? This is no clear limit, we do some test, then choose a big limit which is tolerated by our user, and the user could change it dynamically by sysctl interface. The dynamically growing threshold is great, I am very agree with this patch. Like Chris said, the option is that we could also add a max limit then let's the user to make a decision according their scenarios, this is just a option. > > . >
Vlastimil Babka writes: >On 8/19/21 4:55 AM, Kefeng Wang wrote: >> >> On 2021/8/19 5:48, Chris Down wrote: >>> Vlastimil Babka writes: >>> >>> I think this is a good idea, thanks for bringing it up :-) >>> >>> I'm not sure about the bitshift idea, though. It certainly makes sure >>> that even large, continuous periods of reclaim eventually terminates, >>> but I find it hard to reason about -- for example, if there's a lot of >>> parallel activity, that might result in 10 constantly reintroduced >>> pages, or 1000 pages, and it's not immediately obvious that we should >>> treat those differently. >>> >>> What about using MAX_RECLAIM_RETRIES? There's already precedent for >>> using it in non-OOM scenarios, like mem_cgroup_handle_over_high. > >It's an option, but then (together with fixed threshold) it ignores how >large the 'freed' value is, as long it's above threshold? Although the >end result will probably not be much different. Yeah, but we already draw the line at 10 right now. `freed > 10 && retries < MAX_RECLAIM_RETRIES` seems easier to reason about, at least to me, and stays closer to the current behaviour while providing a definitive point of loop termination. >> Yes, we meet this issue too, and we add a max loop limit in >> drop_slab_node() in our kernel, which also could be reconfigured by >> sysctl ;) > >Sysctl sounds like an overkill. How do you tune the limit? Any >experience with what scenarios need what limit? sysctls in mm tend to mean we didn't think hard enough about how things should work and just punted it to the user :-) So I agree with you that I'd rather not have a sysctl, and just go the MAX_RECLAIM_RETRIES route. After that I'll happily add my Acked-by.
On Thu 19-08-21 14:21:08, Chris Down wrote: > Vlastimil Babka writes: > > On 8/19/21 4:55 AM, Kefeng Wang wrote: > > > > > > On 2021/8/19 5:48, Chris Down wrote: > > > > Vlastimil Babka writes: > > > > > > > > I think this is a good idea, thanks for bringing it up :-) > > > > > > > > I'm not sure about the bitshift idea, though. It certainly makes sure > > > > that even large, continuous periods of reclaim eventually terminates, > > > > but I find it hard to reason about -- for example, if there's a lot of > > > > parallel activity, that might result in 10 constantly reintroduced > > > > pages, or 1000 pages, and it's not immediately obvious that we should > > > > treat those differently. > > > > > > > > What about using MAX_RECLAIM_RETRIES? There's already precedent for > > > > using it in non-OOM scenarios, like mem_cgroup_handle_over_high. > > > > It's an option, but then (together with fixed threshold) it ignores how > > large the 'freed' value is, as long it's above threshold? Although the > > end result will probably not be much different. > > Yeah, but we already draw the line at 10 right now. `freed > 10 && retries < > MAX_RECLAIM_RETRIES` seems easier to reason about, at least to me, and stays > closer to the current behaviour while providing a definitive point of loop > termination. I have to say that I am not really a fan of MAX_RECLAIM_RETRIES approach especially for user interfaces. Any limit on retries has kicked us back (e.g. offlining for the memory hotplug just to mention one of those). drop_caches can take a long time on its own even without retrying. We should teach people to interrupt those operations if they should really finish early (e.g. timeout $TIMEOUT echo > /proc/sys/vm/drop_caches) rather than trying to be extra clever here. I am not against the patch Vlastimil is proposing because it replaces an ad-hoc limit on the reclaimed objects threshold with something that is less "random" - sort of a backoff instead seems like an improvement to me. But I would still be worried that this could regress for some users so in an ideal world the existing bail on signal should be enough.
On 8/19/21 16:16, Michal Hocko wrote: > On Thu 19-08-21 14:21:08, Chris Down wrote: >> >> Yeah, but we already draw the line at 10 right now. `freed > 10 && retries < >> MAX_RECLAIM_RETRIES` seems easier to reason about, at least to me, and stays >> closer to the current behaviour while providing a definitive point of loop >> termination. The doubling threshold is also definitive :) 64 iterations on 64-bit systems, 32 on 32 bit. In practice it will be much less. > I have to say that I am not really a fan of MAX_RECLAIM_RETRIES approach > especially for user interfaces. Any limit on retries has kicked us back > (e.g. offlining for the memory hotplug just to mention one of those). > drop_caches can take a long time on its own even without retrying. We > should teach people to interrupt those operations if they should really > finish early (e.g. timeout $TIMEOUT echo > /proc/sys/vm/drop_caches) > rather than trying to be extra clever here. I'm afraid the retries vs threshold has more potential for bikeshedding than actual difference in practice. I can change it if more people prefer that way. > I am not against the patch Vlastimil is proposing because it replaces an > ad-hoc limit on the reclaimed objects threshold with something that is > less "random" - sort of a backoff instead seems like an improvement to > me. But I would still be worried that this could regress for some users > so in an ideal world the existing bail on signal should be enough. My point of view on this is: - first let's not forget this is an interface intended for debugging, that we generally discourage to use in a regular manner. Yet it exists, so people will inevitably use it from automated scripts and cron and whatnot. - as such there are no guarantees how much it reclaims, it's a best effort really. If a user thinks it hasn't reclaimed enough, they can always implement own userspace loop that checks e.g. meminfo or slabinfo to decide whether to do another iteration of drop_caches. - while the userspace loop might be convoluted, the idea of running with timeout is not obvious. One might see the operation finishes fine in initial testing, put it in cron etc. without timeout, and one day the infinite loop kicks in and there's trouble. So I think the safer default that avoids suprises is to guarantee termination rather than aggressive reclaim.
On Wed, Aug 18, 2021 at 05:22:39PM +0200, Vlastimil Babka wrote: > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 403a175a720f..ef3554314b47 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -936,6 +936,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, > void drop_slab_node(int nid) > { > unsigned long freed; > + int shift = 0; > > do { > struct mem_cgroup *memcg = NULL; > @@ -948,7 +949,7 @@ void drop_slab_node(int nid) > do { > freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); > } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); > - } while (freed > 10); > + } while ((freed >> shift++) > 0); This can, if you're really unlucky, produce UB. If you free 2^63 items when shift is 63, then 2^63 >> 63 is 1 and shift becomes 64, producing UB on the next iteration. We could do: } while (shift < BITS_PER_LONG) && (freed >> shift++) > 0); but honestly, that feels silly. How about: } while ((freed >> shift++) > 1); almost exactly as arbitrary, but guarantees no UB.
On 8/24/21 12:02, Matthew Wilcox wrote: > On Wed, Aug 18, 2021 at 05:22:39PM +0200, Vlastimil Babka wrote: >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 403a175a720f..ef3554314b47 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -936,6 +936,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, >> void drop_slab_node(int nid) >> { >> unsigned long freed; >> + int shift = 0; >> >> do { >> struct mem_cgroup *memcg = NULL; >> @@ -948,7 +949,7 @@ void drop_slab_node(int nid) >> do { >> freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); >> } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); >> - } while (freed > 10); >> + } while ((freed >> shift++) > 0); > > This can, if you're really unlucky, produce UB. If you free 2^63 items > when shift is 63, then 2^63 >> 63 is 1 and shift becomes 64, producing > UB on the next iteration. We could do: > > } while (shift < BITS_PER_LONG) && (freed >> shift++) > 0); > > but honestly, that feels silly. How about: > > } while ((freed >> shift++) > 1); > > almost exactly as arbitrary, but guarantees no UB. Hey, zero is not arbitrary :P But thanks, here's a fix up. From 88189bf16406c5910400193422b3f18f859f18d8 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka <vbabka@suse.cz> Date: Tue, 24 Aug 2021 14:08:53 +0200 Subject: [PATCH] mm, vmscan: guarantee drop_slab_node() termination-fix Matthew reports [1] that if we free enough objects, we can eventually right-shift by BITS_PER_LONG, which is undefined behavior. Raise the threshold from 0 to 1 which means we will shift only up to BITS_PER_LONG-1. [1] https://lore.kernel.org/linux-mm/YSTDnqKgQLvziyQI@casper.infradead.org/ Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4ffaa7970904..f08aef08c351 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -952,7 +952,7 @@ void drop_slab_node(int nid) do { freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); - } while ((freed >> shift++) > 0); + } while ((freed >> shift++) > 1); } void drop_slab(void)
diff --git a/mm/vmscan.c b/mm/vmscan.c index 403a175a720f..ef3554314b47 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -936,6 +936,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid, void drop_slab_node(int nid) { unsigned long freed; + int shift = 0; do { struct mem_cgroup *memcg = NULL; @@ -948,7 +949,7 @@ void drop_slab_node(int nid) do { freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); - } while (freed > 10); + } while ((freed >> shift++) > 0); } void drop_slab(void)
drop_slab_node() is called as part of echo 2>/proc/sys/vm/drop_caches operation. It iterates over all memcgs and calls shrink_slab() which in turn iterates over all slab shrinkers. Freed objects are counted and as long as the total number of freed objects from all memcgs and shrinkers is higher than 10, drop_slab_node() loops for another full memcgs*shrinkers iteration. This arbitrary constant threshold of 10 can result in effectively an infinite loop on a system with large number of memcgs and/or parallel activity that allocates new objects. This has been reported previously by Chunxin Zang [1] and recently by our customer. The previous report [1] has resulted in commit 069c411de40a ("mm/vmscan: fix infinite loop in drop_slab_node") which added a check for signals allowing the user to terminate the command writing to drop_caches. At the time it was also considered to make the threshold grow with each iteration to guarantee termination, but such patch hasn't been formally proposed yet. This patch implements the dynamically growing threshold. At first iteration it's enough to free one object to continue, and this threshold effectively doubles with each iteration. Our customer's feedback was positive. There is always a risk that this change will result on some system in a previously terminating drop_caches operation to terminate sooner and free fewer objects. Ideally the semantics would guarantee freeing all freeable objects that existed at the moment of starting the operation, while not looping forever for newly allocated objects, but that's not feasible to track. In the less ideal solution based on thresholds, arguably the termination guarantee is more important than the exhaustiveness guarantee. If there are reports of large regression wrt being exhaustive, we can tune how fast the threshold grows. [1] https://lore.kernel.org/lkml/20200909152047.27905-1-zangchunxin@bytedance.com/T/#u Reported-by: Chunxin Zang <zangchunxin@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)