Message ID | 20211202150614.22440-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4,1/1] mm: vmscan: Reduce throttling due to a failure to make progress | expand |
Hi Mel, On Thu, Dec 2, 2021 at 7:07 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > problems due to reclaim throttling for excessive lengths of time. > In Alexey's case, a memory hog that should go OOM quickly stalls for > several minutes before stalling. In Mike and Darrick's cases, a small > memcg environment stalled excessively even though the system had enough > memory overall. > > Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being > made") introduced the problem although commit a19594ca4a8b ("mm/vmscan: > increase the timeout if page reclaim is not making progress") made it > worse. Systems at or near an OOM state that cannot be recovered must > reach OOM quickly and memcg should kill tasks if a memcg is near OOM. > Is there a reason we can't simply revert 69392a403f49 instead of adding more code/heuristics? Looking more into 69392a403f49, I don't think the code and commit message are in sync. For the memcg reclaim, instead of just removing congestion_wait or replacing it with schedule_timeout in mem_cgroup_force_empty(), why change the behavior of all memcg reclaim. Also this patch effectively reverts that behavior of 69392a403f49. For direct reclaimers under global pressure, why is page allocator a bad place for stalling on no progress reclaim? IMHO the callers of the reclaim should decide what to do if reclaim is not making progress. thanks, Shakeel
On Thu, Dec 02, 2021 at 08:30:51AM -0800, Shakeel Butt wrote: > Hi Mel, > > On Thu, Dec 2, 2021 at 7:07 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > > problems due to reclaim throttling for excessive lengths of time. > > In Alexey's case, a memory hog that should go OOM quickly stalls for > > several minutes before stalling. In Mike and Darrick's cases, a small > > memcg environment stalled excessively even though the system had enough > > memory overall. > > > > Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being > > made") introduced the problem although commit a19594ca4a8b ("mm/vmscan: > > increase the timeout if page reclaim is not making progress") made it > > worse. Systems at or near an OOM state that cannot be recovered must > > reach OOM quickly and memcg should kill tasks if a memcg is near OOM. > > > > Is there a reason we can't simply revert 69392a403f49 instead of adding > more code/heuristics? Looking more into 69392a403f49, I don't think the > code and commit message are in sync. > > For the memcg reclaim, instead of just removing congestion_wait or > replacing it with schedule_timeout in mem_cgroup_force_empty(), why > change the behavior of all memcg reclaim. Also this patch effectively > reverts that behavior of 69392a403f49. > It doesn't fully revert it but I did consider reverting it. The reason why I preserved it because the intent originally was to throttle somewhat when progress is not being made to avoid a premature OOM and I wanted to preserve that charactersistic. Right now, this is the least harmful way of doing it. As more memcg, I removed the NOTHROTTLE because the primary reason why a memcg might fail to make progress is excessive writeback and that should still throttle. Completely failing to make progress in a memcg is most likely due to a memcg-OOM. > For direct reclaimers under global pressure, why is page allocator a bad > place for stalling on no progress reclaim? IMHO the callers of the > reclaim should decide what to do if reclaim is not making progress. Because it's a layering violation and the caller has little direct control over the reclaim retry logic. The page allocator has no visibility on why reclaim failed only that it did fail.
On Thu, Dec 2, 2021 at 8:52 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Thu, Dec 02, 2021 at 08:30:51AM -0800, Shakeel Butt wrote: > > Hi Mel, > > > > On Thu, Dec 2, 2021 at 7:07 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > > > problems due to reclaim throttling for excessive lengths of time. > > > In Alexey's case, a memory hog that should go OOM quickly stalls for > > > several minutes before stalling. In Mike and Darrick's cases, a small > > > memcg environment stalled excessively even though the system had enough > > > memory overall. > > > > > > Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being > > > made") introduced the problem although commit a19594ca4a8b ("mm/vmscan: > > > increase the timeout if page reclaim is not making progress") made it > > > worse. Systems at or near an OOM state that cannot be recovered must > > > reach OOM quickly and memcg should kill tasks if a memcg is near OOM. > > > > > > > Is there a reason we can't simply revert 69392a403f49 instead of adding > > more code/heuristics? Looking more into 69392a403f49, I don't think the > > code and commit message are in sync. > > > > For the memcg reclaim, instead of just removing congestion_wait or > > replacing it with schedule_timeout in mem_cgroup_force_empty(), why > > change the behavior of all memcg reclaim. Also this patch effectively > > reverts that behavior of 69392a403f49. > > > > It doesn't fully revert it but I did consider reverting it. The reason > why I preserved it because the intent originally was to throttle somewhat > when progress is not being made to avoid a premature OOM and I wanted to > preserve that charactersistic. Right now, this is the least harmful way > of doing it. If I understand correctly, the original intent of 69392a403f49 which you want to preserve is "avoid premature OOMs when reclaim is not making progress". Were there any complaints or bug reports on these premature OOMs? > > As more memcg, I removed the NOTHROTTLE because the primary reason why a > memcg might fail to make progress is excessive writeback and that should > still throttle. Completely failing to make progress in a memcg is most > likely due to a memcg-OOM. > > > For direct reclaimers under global pressure, why is page allocator a bad > > place for stalling on no progress reclaim? IMHO the callers of the > > reclaim should decide what to do if reclaim is not making progress. > > Because it's a layering violation and the caller has little direct control > over the reclaim retry logic. The page allocator has no visibility on > why reclaim failed only that it did fail. > Isn't it better that the reclaim returns why it is failing instead of littering the reclaim code with 'is this global reclaim', 'is this memcg reclaim', 'am I kswapd' which is also a layering violation. IMO this is the direction we should be going towards though not asking to do this now. Regarding this patch and 69392a403f49, I am still confused on the main motivation behind 69392a403f49 to change the behavior of 'direct reclaimers from page allocator'. thanks, Shakeel
On Thu, Dec 02, 2021 at 09:41:04AM -0800, Shakeel Butt wrote: > On Thu, Dec 2, 2021 at 8:52 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Thu, Dec 02, 2021 at 08:30:51AM -0800, Shakeel Butt wrote: > > > Hi Mel, > > > > > > On Thu, Dec 2, 2021 at 7:07 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > > > > problems due to reclaim throttling for excessive lengths of time. > > > > In Alexey's case, a memory hog that should go OOM quickly stalls for > > > > several minutes before stalling. In Mike and Darrick's cases, a small > > > > memcg environment stalled excessively even though the system had enough > > > > memory overall. > > > > > > > > Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being > > > > made") introduced the problem although commit a19594ca4a8b ("mm/vmscan: > > > > increase the timeout if page reclaim is not making progress") made it > > > > worse. Systems at or near an OOM state that cannot be recovered must > > > > reach OOM quickly and memcg should kill tasks if a memcg is near OOM. > > > > > > > > > > Is there a reason we can't simply revert 69392a403f49 instead of adding > > > more code/heuristics? Looking more into 69392a403f49, I don't think the > > > code and commit message are in sync. > > > > > > For the memcg reclaim, instead of just removing congestion_wait or > > > replacing it with schedule_timeout in mem_cgroup_force_empty(), why > > > change the behavior of all memcg reclaim. Also this patch effectively > > > reverts that behavior of 69392a403f49. > > > > > > > It doesn't fully revert it but I did consider reverting it. The reason > > why I preserved it because the intent originally was to throttle somewhat > > when progress is not being made to avoid a premature OOM and I wanted to > > preserve that charactersistic. Right now, this is the least harmful way > > of doing it. > > If I understand correctly, the original intent of 69392a403f49 which > you want to preserve is "avoid premature OOMs when reclaim is not > making progress". Were there any complaints or bug reports on these > premature OOMs? > Not recently that I'm aware of but historically reclaim has been plagued by at least two classes of problems -- premature OOM and excessive CPU usage churning through the LRU. Going back, the solution was basically to sleep something like "disable kswapd if it fails to make progress for too long". Commit 69392a403f49 addressed a case where calling congestion_wait might as well have been schedule_timeout_uninterruptible(HZ/10) because congestion is no longer tracked by the block layer. Hence 69392a403f49 allows reclaim to throttle on NOPROGRESS but if another task makes progress, the throttled tasks can be woken before the timeout. The flaw was throttling too easily or for too long delaying OOM being properly detected. > > > > As more memcg, I removed the NOTHROTTLE because the primary reason why a > > memcg might fail to make progress is excessive writeback and that should > > still throttle. Completely failing to make progress in a memcg is most > > likely due to a memcg-OOM. > > > > > For direct reclaimers under global pressure, why is page allocator a bad > > > place for stalling on no progress reclaim? IMHO the callers of the > > > reclaim should decide what to do if reclaim is not making progress. > > > > Because it's a layering violation and the caller has little direct control > > over the reclaim retry logic. The page allocator has no visibility on > > why reclaim failed only that it did fail. > > > > Isn't it better that the reclaim returns why it is failing instead of > littering the reclaim code with 'is this global reclaim', 'is this > memcg reclaim', 'am I kswapd' which is also a layering violation. IMO > this is the direction we should be going towards though not asking to > do this now. > It's not clear why you think the page allocator can make better decisions about reclaim than reclaim can. It might make sense if callers were returned enough information to make a decision but even if they could, it would not be popular as the API would be difficult to use properly. Is your primary objection the cgroup_reclaim(sc) check? If so, I can remove it. While there is a mild risk that OOM would be delayed, it's very unlikely because a memcg failing to make progress in the local case will probably call cond_resched() if there are not lots of of pages pending writes globally. > Regarding this patch and 69392a403f49, I am still confused on the main > motivation behind 69392a403f49 to change the behavior of 'direct > reclaimers from page allocator'. > The main motivation of the series overall was to remove the reliance on congestion_wait and wait_iff_congested because both are fundamentally broken when congestion is not tracked by the block layer. Replacing with schedule_timeout_uninterruptible() would be silly because where possible decisions on whether to pause or throttle should be based on events, not time. For example, if there are too many pages waiting on writeback then throttle but if writeback completes, wake the throttled tasks instead of "sleep some time and hope for the best".
On Fri, Dec 3, 2021 at 1:01 AM Mel Gorman <mgorman@techsingularity.net> wrote: > [...] > > Not recently that I'm aware of but historically reclaim has been plagued by > at least two classes of problems -- premature OOM and excessive CPU usage > churning through the LRU. Going back, the solution was basically to sleep > something like "disable kswapd if it fails to make progress for too long". > Commit 69392a403f49 addressed a case where calling congestion_wait might as > well have been schedule_timeout_uninterruptible(HZ/10) because congestion > is no longer tracked by the block layer. > > Hence 69392a403f49 allows reclaim to throttle on NOPROGRESS but if > another task makes progress, the throttled tasks can be woken before the > timeout. The flaw was throttling too easily or for too long delaying OOM > being properly detected. > To remove congestion_wait of mem_cgroup_force_empty_write(), the commit 69392a403f49 has changed the behavior of all memcg reclaim codepaths as well as direct global reclaimers. Were there other congestion_wait() instances which commit 69392a403f49 was targeting but those congestion_wait() were replaced/removed by different commits? [...] > > > > Isn't it better that the reclaim returns why it is failing instead of > > littering the reclaim code with 'is this global reclaim', 'is this > > memcg reclaim', 'am I kswapd' which is also a layering violation. IMO > > this is the direction we should be going towards though not asking to > > do this now. > > > > It's not clear why you think the page allocator can make better decisions > about reclaim than reclaim can. It might make sense if callers were > returned enough information to make a decision but even if they could, > it would not be popular as the API would be difficult to use properly. > The above is a separate discussion for later. > Is your primary objection the cgroup_reclaim(sc) check? No, I am of the opinion that we should revert 69392a403f49 and we should have just replaced congestion_wait in mem_cgroup_force_empty_write with a simple schedule_timeout_interruptible. The memory.force_empty is a cgroup v1 interface (to be deprecated) and it is very normal to expect that the user will trigger that interface multiple times. We should not change the behavior of all the memcg reclaimers and direct global reclaimers so that we can remove congestion_wait from mem_cgroup_force_empty_write. > If so, I can > remove it. While there is a mild risk that OOM would be delayed, it's very > unlikely because a memcg failing to make progress in the local case will > probably call cond_resched() if there are not lots of of pages pending > writes globally. > > > Regarding this patch and 69392a403f49, I am still confused on the main > > motivation behind 69392a403f49 to change the behavior of 'direct > > reclaimers from page allocator'. > > > > The main motivation of the series overall was to remove the reliance on > congestion_wait and wait_iff_congested because both are fundamentally > broken when congestion is not tracked by the block layer. Replacing with > schedule_timeout_uninterruptible() would be silly because where possible > decisions on whether to pause or throttle should be based on events, > not time. For example, if there are too many pages waiting on writeback > then throttle but if writeback completes, wake the throttled tasks > instead of "sleep some time and hope for the best". > I am in agreement with the motivation of the whole series. I am just making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based throttle is more than just the congestion_wait of mem_cgroup_force_empty_write. thanks, Shakeel
On Fri, Dec 03, 2021 at 09:50:51AM -0800, Shakeel Butt wrote: > On Fri, Dec 3, 2021 at 1:01 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > [...] > > > > Not recently that I'm aware of but historically reclaim has been plagued by > > at least two classes of problems -- premature OOM and excessive CPU usage > > churning through the LRU. Going back, the solution was basically to sleep > > something like "disable kswapd if it fails to make progress for too long". > > Commit 69392a403f49 addressed a case where calling congestion_wait might as > > well have been schedule_timeout_uninterruptible(HZ/10) because congestion > > is no longer tracked by the block layer. > > > > Hence 69392a403f49 allows reclaim to throttle on NOPROGRESS but if > > another task makes progress, the throttled tasks can be woken before the > > timeout. The flaw was throttling too easily or for too long delaying OOM > > being properly detected. > > > > To remove congestion_wait of mem_cgroup_force_empty_write(), the > commit 69392a403f49 has changed the behavior of all memcg reclaim > codepaths as well as direct global reclaimers. Well, yes, it moved it to stalling on writeback if it's in progress and waking up if writeback makes forward progress instead of a schedule_timeout_interruptible(). > Were there other > congestion_wait() instances which commit 69392a403f49 was targeting > but those congestion_wait() were replaced/removed by different > commits? > Yes, the series removed congestion_wait from other places because the interface is broken and has been for a long time. > [...] > > > > > > > Isn't it better that the reclaim returns why it is failing instead of > > > littering the reclaim code with 'is this global reclaim', 'is this > > > memcg reclaim', 'am I kswapd' which is also a layering violation. IMO > > > this is the direction we should be going towards though not asking to > > > do this now. > > > > > > > It's not clear why you think the page allocator can make better decisions > > about reclaim than reclaim can. It might make sense if callers were > > returned enough information to make a decision but even if they could, > > it would not be popular as the API would be difficult to use properly. > > > > The above is a separate discussion for later. > > > Is your primary objection the cgroup_reclaim(sc) check? > > No, I am of the opinion that we should revert 69392a403f49 and we > should have just replaced congestion_wait in > mem_cgroup_force_empty_write with a simple > schedule_timeout_interruptible. That is a bit weak. Depending on the type of storage, writeback may completes in microseconds or seconds. The event used to be "sleep until congestion clears" which is no longer an event that can be waited upon in the vast majority of cases (NFS being an obvious exception). Now, it may throttle writeback on a waitqueue and if enough writeback completes, the task will be woken before the timeout to minimise the stall. schedule_timeout_interruptible() always waits for a fixed duration regardless of what else happens in the meantime. > The memory.force_empty is a cgroup v1 > interface (to be deprecated) and it is very normal to expect that the > user will trigger that interface multiple times. We should not change > the behavior of all the memcg reclaimers and direct global reclaimers > so that we can remove congestion_wait from > mem_cgroup_force_empty_write. > The mem_cgroup_force_empty_write() path will throttle on writeback in the same way that global reclaim does at this point /* * If kswapd scans pages marked for immediate * reclaim and under writeback (nr_immediate), it * implies that pages are cycling through the LRU * faster than they are written so forcibly stall * until some pages complete writeback. */ if (sc->nr.immediate) reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); With this patch, memcg does not stall on NOPROGRESS. > > If so, I can > > remove it. While there is a mild risk that OOM would be delayed, it's very > > unlikely because a memcg failing to make progress in the local case will > > probably call cond_resched() if there are not lots of of pages pending > > writes globally. > > > > > Regarding this patch and 69392a403f49, I am still confused on the main > > > motivation behind 69392a403f49 to change the behavior of 'direct > > > reclaimers from page allocator'. > > > > > > > The main motivation of the series overall was to remove the reliance on > > congestion_wait and wait_iff_congested because both are fundamentally > > broken when congestion is not tracked by the block layer. Replacing with > > schedule_timeout_uninterruptible() would be silly because where possible > > decisions on whether to pause or throttle should be based on events, > > not time. For example, if there are too many pages waiting on writeback > > then throttle but if writeback completes, wake the throttled tasks > > instead of "sleep some time and hope for the best". > > > > I am in agreement with the motivation of the whole series. I am just > making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based > throttle is more than just the congestion_wait of > mem_cgroup_force_empty_write. > The commit that primarily targets congestion_wait is 8cd7c588decf ("mm/vmscan: throttle reclaim until some writeback completes if congested"). The series recognises that there are other reasons why reclaim can fail to make progress that is not directly writeback related.
On Fri, Dec 3, 2021 at 11:08 AM Mel Gorman <mgorman@techsingularity.net> wrote: > [...] > > I am in agreement with the motivation of the whole series. I am just > > making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based > > throttle is more than just the congestion_wait of > > mem_cgroup_force_empty_write. > > > > The commit that primarily targets congestion_wait is 8cd7c588decf > ("mm/vmscan: throttle reclaim until some writeback completes if > congested"). The series recognises that there are other reasons why > reclaim can fail to make progress that is not directly writeback related. > I agree with throttling for VMSCAN_THROTTLE_[WRITEBACK|ISOLATED] reasons. Please explain why we should throttle for VMSCAN_THROTTLE_NOPROGRESS? Also 69392a403f49 claims "Direct reclaim primarily is throttled in the page allocator if it is failing to make progress.", can you please explain how?
On Sun, Dec 05, 2021 at 10:06:27PM -0800, Shakeel Butt wrote: > On Fri, Dec 3, 2021 at 11:08 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > [...] > > > I am in agreement with the motivation of the whole series. I am just > > > making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based > > > throttle is more than just the congestion_wait of > > > mem_cgroup_force_empty_write. > > > > > > > The commit that primarily targets congestion_wait is 8cd7c588decf > > ("mm/vmscan: throttle reclaim until some writeback completes if > > congested"). The series recognises that there are other reasons why > > reclaim can fail to make progress that is not directly writeback related. > > > > I agree with throttling for VMSCAN_THROTTLE_[WRITEBACK|ISOLATED] > reasons. Please explain why we should throttle for > VMSCAN_THROTTLE_NOPROGRESS? Also 69392a403f49 claims "Direct reclaim > primarily is throttled in the page allocator if it is failing to make > progress.", can you please explain how? It could happen if the pages on the LRU are being reactivated continually or holding an elevated reference count for some reason (e.g. gup, page migration etc). The event is probably transient, hence the short throttling.
On Mon, Dec 6, 2021 at 3:25 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Sun, Dec 05, 2021 at 10:06:27PM -0800, Shakeel Butt wrote: > > On Fri, Dec 3, 2021 at 11:08 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > [...] > > > > I am in agreement with the motivation of the whole series. I am just > > > > making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based > > > > throttle is more than just the congestion_wait of > > > > mem_cgroup_force_empty_write. > > > > > > > > > > The commit that primarily targets congestion_wait is 8cd7c588decf > > > ("mm/vmscan: throttle reclaim until some writeback completes if > > > congested"). The series recognises that there are other reasons why > > > reclaim can fail to make progress that is not directly writeback related. > > > > > > > I agree with throttling for VMSCAN_THROTTLE_[WRITEBACK|ISOLATED] > > reasons. Please explain why we should throttle for > > VMSCAN_THROTTLE_NOPROGRESS? Also 69392a403f49 claims "Direct reclaim > > primarily is throttled in the page allocator if it is failing to make > > progress.", can you please explain how? > > It could happen if the pages on the LRU are being reactivated continually > or holding an elevated reference count for some reason (e.g. gup, > page migration etc). The event is probably transient, hence the short > throttling. > What's the worst that can happen if the kernel doesn't throttle at all for these transient scenarios? Premature oom-kills? The kernel already has some protection against such situations with retries i.e. consecutive 16 unsuccessful reclaim tries have to fail to give up the reclaim. Anyways, I have shared my view which is 'no need to throttle at all for no-progress reclaims for now and course correct if there are complaints in future' but will not block the patch. thanks, Shakeel
On Mon, Dec 06, 2021 at 11:14:58PM -0800, Shakeel Butt wrote: > On Mon, Dec 6, 2021 at 3:25 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Sun, Dec 05, 2021 at 10:06:27PM -0800, Shakeel Butt wrote: > > > On Fri, Dec 3, 2021 at 11:08 AM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > > [...] > > > > > I am in agreement with the motivation of the whole series. I am just > > > > > making sure that the motivation of VMSCAN_THROTTLE_NOPROGRESS based > > > > > throttle is more than just the congestion_wait of > > > > > mem_cgroup_force_empty_write. > > > > > > > > > > > > > The commit that primarily targets congestion_wait is 8cd7c588decf > > > > ("mm/vmscan: throttle reclaim until some writeback completes if > > > > congested"). The series recognises that there are other reasons why > > > > reclaim can fail to make progress that is not directly writeback related. > > > > > > > > > > I agree with throttling for VMSCAN_THROTTLE_[WRITEBACK|ISOLATED] > > > reasons. Please explain why we should throttle for > > > VMSCAN_THROTTLE_NOPROGRESS? Also 69392a403f49 claims "Direct reclaim > > > primarily is throttled in the page allocator if it is failing to make > > > progress.", can you please explain how? > > > > It could happen if the pages on the LRU are being reactivated continually > > or holding an elevated reference count for some reason (e.g. gup, > > page migration etc). The event is probably transient, hence the short > > throttling. > > > > What's the worst that can happen if the kernel doesn't throttle at all > for these transient scenarios? Premature oom-kills? Excessive CPU usage in reclaim, potential premature OOM kills. > The kernel already > has some protection against such situations with retries i.e. > consecutive 16 unsuccessful reclaim tries have to fail to give up the > reclaim. > The retries mitigate the premature OOM kills but not the excessive CPU usage. > Anyways, I have shared my view which is 'no need to throttle at all > for no-progress reclaims for now and course correct if there are > complaints in future' but will not block the patch. > We've gone through periods of bugs that had either direct reclaim or kswapd pegged at 100% CPU usage. While kswapd now just stops, the patch still minimises the risk of excessive CPU usage bugs due to direct reclaim.
On Thu, 2 Dec 2021, Mel Gorman wrote: ... > --- a/mm/vmscan.c > +++ b/mm/vmscan.c ... > @@ -3478,14 +3520,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > /* need some check for avoid more shrink_zone() */ > } > > + if (!first_pgdat) > + first_pgdat = zone->zone_pgdat; > + > /* See comment about same check for global reclaim above */ > if (zone->zone_pgdat == last_pgdat) > continue; > last_pgdat = zone->zone_pgdat; > shrink_node(zone->zone_pgdat, sc); > - consider_reclaim_throttle(zone->zone_pgdat, sc); > } > > + consider_reclaim_throttle(first_pgdat, sc); My tmpfs swapping load (tweaked to use huge pages more heavily than in real life) is far from being a realistic load: but it was notably slowed down by your throttling mods in 5.16-rc, and this patch makes it well again - thanks. But: it very quickly hit NULL pointer until I changed that last line to if (first_pgdat) consider_reclaim_throttle(first_pgdat, sc); I've given no thought as to whether that is the correct fix, or if first_pgdat should be set earlier in the loop above. Hugh
On Wed, Dec 08, 2021 at 10:20:47PM -0800, Hugh Dickins wrote: > On Thu, 2 Dec 2021, Mel Gorman wrote: > ... > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > ... > > @@ -3478,14 +3520,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > > /* need some check for avoid more shrink_zone() */ > > } > > > > + if (!first_pgdat) > > + first_pgdat = zone->zone_pgdat; > > + > > /* See comment about same check for global reclaim above */ > > if (zone->zone_pgdat == last_pgdat) > > continue; > > last_pgdat = zone->zone_pgdat; > > shrink_node(zone->zone_pgdat, sc); > > - consider_reclaim_throttle(zone->zone_pgdat, sc); > > } > > > > + consider_reclaim_throttle(first_pgdat, sc); > > My tmpfs swapping load (tweaked to use huge pages more heavily than > in real life) is far from being a realistic load: but it was notably > slowed down by your throttling mods in 5.16-rc, and this patch makes > it well again - thanks. > > But: it very quickly hit NULL pointer until I changed that last line to > > if (first_pgdat) > consider_reclaim_throttle(first_pgdat, sc); > > I've given no thought as to whether that is the correct fix, > or if first_pgdat should be set earlier in the loop above. > It's the right fix, first_pgdat may be NULL if compaction can run for each zone in the zonelist which could be the case for a tmpfs swapping load that is huge page intensive. Thanks Hugh.
Hi, this is your Linux kernel regression tracker speaking. On 02.12.21 16:06, Mel Gorman wrote: > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > problems due to reclaim throttling for excessive lengths of time. > In Alexey's case, a memory hog that should go OOM quickly stalls for > several minutes before stalling. In Mike and Darrick's cases, a small > memcg environment stalled excessively even though the system had enough > memory overall. Just wondering: this patch afaics is now in -mm and Linux next for nearly two weeks. Is that intentional? I had expected it to be mainlined with the batch of patches Andrew mailed to Linus last week, but it wasn't among them. Or am I missing something? Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat) P.S.: As a Linux kernel regression tracker I'm getting a lot of reports on my table. I can only look briefly into most of them. Unfortunately therefore I sometimes will get things wrong or miss something important. I hope that's not the case here; if you think it is, don't hesitate to tell me about it in a public reply, that's in everyone's interest. BTW, I have no personal interest in this issue, which is tracked using regzbot, my Linux kernel regression tracking bot (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting this mail to get things rolling again and hence don't need to be CC on all further activities wrt to this regression. > Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being > made") introduced the problem although commit a19594ca4a8b ("mm/vmscan: > increase the timeout if page reclaim is not making progress") made it > worse. Systems at or near an OOM state that cannot be recovered must > reach OOM quickly and memcg should kill tasks if a memcg is near OOM. > > To address this, only stall for the first zone in the zonelist, reduce > the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if > the scan control nr_reclaimed is 0, kswapd is still active and there were > excessive pages pending for writeback. If kswapd has stopped reclaiming due > to excessive failures, do not stall at all so that OOM triggers relatively > quickly. Similarly, if an LRU is simply congested, only lightly throttle > similar to NOPROGRESS. > > Alexey's original case was the most straight forward > > for i in {1..3}; do tail /dev/zero; done > > On vanilla 5.16-rc1, this test stalled heavily, after the patch the test > completes in a few seconds similar to 5.15. > > Alexey's second test case added watching a youtube video while tail runs > 10 times. On 5.15, playback only jitters slightly, 5.16-rc1 stalls a lot > with lots of frames missing and numerous audio glitches. With this patch > applies, the video plays similarly to 5.15. > > Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de > Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv > Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net > > [lkp@intel.com: Fix W=1 build warning] > Reported-and-tested-by: Alexey Avramov <hakavlad@inbox.lv> > Reported-and-tested-by: Mike Galbraith <efault@gmx.de> > Reported-and-tested-by: Darrick J. Wong <djwong@kernel.org> > Reported-by: kernel test robot <lkp@intel.com> > Fixes: 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made") > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > include/linux/mmzone.h | 1 + > include/trace/events/vmscan.h | 4 ++- > mm/vmscan.c | 64 ++++++++++++++++++++++++++++++----- > 3 files changed, 59 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 58e744b78c2c..936dc0b6c226 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -277,6 +277,7 @@ enum vmscan_throttle_state { > VMSCAN_THROTTLE_WRITEBACK, > VMSCAN_THROTTLE_ISOLATED, > VMSCAN_THROTTLE_NOPROGRESS, > + VMSCAN_THROTTLE_CONGESTED, > NR_VMSCAN_THROTTLE, > }; > > diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h > index f25a6149d3ba..ca2e9009a651 100644 > --- a/include/trace/events/vmscan.h > +++ b/include/trace/events/vmscan.h > @@ -30,12 +30,14 @@ > #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) > #define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED) > #define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS) > +#define _VMSCAN_THROTTLE_CONGESTED (1 << VMSCAN_THROTTLE_CONGESTED) > > #define show_throttle_flags(flags) \ > (flags) ? __print_flags(flags, "|", \ > {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \ > {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \ > - {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"} \ > + {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"}, \ > + {_VMSCAN_THROTTLE_CONGESTED, "VMSCAN_THROTTLE_CONGESTED"} \ > ) : "VMSCAN_THROTTLE_NONE" > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index fb9584641ac7..4c4d5f6cd8a3 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1021,6 +1021,39 @@ static void handle_write_error(struct address_space *mapping, > unlock_page(page); > } > > +static bool skip_throttle_noprogress(pg_data_t *pgdat) > +{ > + int reclaimable = 0, write_pending = 0; > + int i; > + > + /* > + * If kswapd is disabled, reschedule if necessary but do not > + * throttle as the system is likely near OOM. > + */ > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) > + return true; > + > + /* > + * If there are a lot of dirty/writeback pages then do not > + * throttle as throttling will occur when the pages cycle > + * towards the end of the LRU if still under writeback. > + */ > + for (i = 0; i < MAX_NR_ZONES; i++) { > + struct zone *zone = pgdat->node_zones + i; > + > + if (!populated_zone(zone)) > + continue; > + > + reclaimable += zone_reclaimable_pages(zone); > + write_pending += zone_page_state_snapshot(zone, > + NR_ZONE_WRITE_PENDING); > + } > + if (2 * write_pending <= reclaimable) > + return true; > + > + return false; > +} > + > void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) > { > wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason]; > @@ -1056,8 +1089,16 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) > } > > break; > + case VMSCAN_THROTTLE_CONGESTED: > + fallthrough; > case VMSCAN_THROTTLE_NOPROGRESS: > - timeout = HZ/2; > + if (skip_throttle_noprogress(pgdat)) { > + cond_resched(); > + return; > + } > + > + timeout = 1; > + > break; > case VMSCAN_THROTTLE_ISOLATED: > timeout = HZ/50; > @@ -3321,7 +3362,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > if (!current_is_kswapd() && current_may_throttle() && > !sc->hibernation_mode && > test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) > - reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); > > if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, > sc)) > @@ -3386,16 +3427,16 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) > } > > /* > - * Do not throttle kswapd on NOPROGRESS as it will throttle on > - * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under > - * writeback and marked for immediate reclaim at the tail of > - * the LRU. > + * Do not throttle kswapd or cgroup reclaim on NOPROGRESS as it will > + * throttle on VMSCAN_THROTTLE_WRITEBACK if there are too many pages > + * under writeback and marked for immediate reclaim at the tail of the > + * LRU. > */ > - if (current_is_kswapd()) > + if (current_is_kswapd() || cgroup_reclaim(sc)) > return; > > /* Throttle if making no progress at high prioities. */ > - if (sc->priority < DEF_PRIORITY - 2) > + if (sc->priority == 1 && !sc->nr_reclaimed) > reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS); > } > > @@ -3415,6 +3456,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > unsigned long nr_soft_scanned; > gfp_t orig_mask; > pg_data_t *last_pgdat = NULL; > + pg_data_t *first_pgdat = NULL; > > /* > * If the number of buffer_heads in the machine exceeds the maximum > @@ -3478,14 +3520,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) > /* need some check for avoid more shrink_zone() */ > } > > + if (!first_pgdat) > + first_pgdat = zone->zone_pgdat; > + > /* See comment about same check for global reclaim above */ > if (zone->zone_pgdat == last_pgdat) > continue; > last_pgdat = zone->zone_pgdat; > shrink_node(zone->zone_pgdat, sc); > - consider_reclaim_throttle(zone->zone_pgdat, sc); > } > > + consider_reclaim_throttle(first_pgdat, sc); > + > /* > * Restore to original mask to avoid the impact on the caller if we > * promoted it to __GFP_HIGHMEM.
On Tue, 28 Dec 2021 11:04:18 +0100 Thorsten Leemhuis <regressions@leemhuis.info> wrote: > Hi, this is your Linux kernel regression tracker speaking. > > On 02.12.21 16:06, Mel Gorman wrote: > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > > problems due to reclaim throttling for excessive lengths of time. > > In Alexey's case, a memory hog that should go OOM quickly stalls for > > several minutes before stalling. In Mike and Darrick's cases, a small > > memcg environment stalled excessively even though the system had enough > > memory overall. > > Just wondering: this patch afaics is now in -mm and Linux next for > nearly two weeks. Is that intentional? I had expected it to be mainlined > with the batch of patches Andrew mailed to Linus last week, but it > wasn't among them. I have it queued for 5.17-rc1. There is still time to squeeze it into 5.16, just, with a cc:stable. Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it will trickle back with less risk to the 5.17 release. What do people think?
On 30.12.21 00:45, Andrew Morton wrote: > On Tue, 28 Dec 2021 11:04:18 +0100 Thorsten Leemhuis <regressions@leemhuis.info> wrote: > >> Hi, this is your Linux kernel regression tracker speaking. >> >> On 02.12.21 16:06, Mel Gorman wrote: >>> Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar >>> problems due to reclaim throttling for excessive lengths of time. >>> In Alexey's case, a memory hog that should go OOM quickly stalls for >>> several minutes before stalling. In Mike and Darrick's cases, a small >>> memcg environment stalled excessively even though the system had enough >>> memory overall. >> >> Just wondering: this patch afaics is now in -mm and Linux next for >> nearly two weeks. Is that intentional? I had expected it to be mainlined >> with the batch of patches Andrew mailed to Linus last week, but it >> wasn't among them. > > I have it queued for 5.17-rc1. > > There is still time to squeeze it into 5.16, just, with a cc:stable. > > Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it > will trickle back with less risk to the 5.17 release. > > What do people think? CCing Linus, to make sure he's aware of this. Maybe I'm totally missing something, but I'm a bit confused by what you wrote, as the regression afaik was introduced between v5.15..v5.16-rc1. So I assume this is what you meant: ``` I have it queued for 5.17-rc1. There is still time to squeeze it into 5.16. Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it will trickle back with less risk to the 5.16 release. What do people think? ``` I'll leave the individual risk evaluation of the patch to others. If the fix is risky, waiting for 5.17 is fine for me. But hmmm, regarding the "could merge it into 5.17-rc1 with a cc:stable" idea a remark: is that really "less risk", as your stated? If we get it into rc8 (which is still possible, even if a bit hard due to the new year festivities), it will get at least one week of testing. If the fix waits for the next merge window, it all depends on the how the timing works out. But it's easy to picture a worst case: the fix is only merged on the Friday evening before Linus releases 5.17-rc1 and right after it's out makes it into a stable-rc (say a day or two after 5.17-rc1 is out) and from there into a 5.16.y release on Thursday. That IMHO would mean less days of testing in the end (and there is a weekend in this period as well). Waiting obviously will also mean that users of 5.16 and 5.16.y will likely have to face this regression for at least two and a half weeks, unless you send the fix early and Greg backports it before rc1 (which he afaics does if there are good reasons). Yes, it's `just` a performance regression, so it might not stop anyone from running Linux 5.16 -- but it's one that three people separately reported in the 5.16 devel cycle, so others will likely encounter it as well if we leave it unfixed in 5.16. This will likely annoy some people, especially if they invest time in bisecting it, only to find out that the forth iteration of the fix for the regression is already available since December the 2nd. Ciao, Thorsten
On Fri, 31 Dec 2021, Thorsten Leemhuis wrote: > On 30.12.21 00:45, Andrew Morton wrote: > > On Tue, 28 Dec 2021 11:04:18 +0100 Thorsten Leemhuis <regressions@leemhuis.info> wrote: > > > >> Hi, this is your Linux kernel regression tracker speaking. > >> > >> On 02.12.21 16:06, Mel Gorman wrote: > >>> Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > >>> problems due to reclaim throttling for excessive lengths of time. > >>> In Alexey's case, a memory hog that should go OOM quickly stalls for > >>> several minutes before stalling. In Mike and Darrick's cases, a small > >>> memcg environment stalled excessively even though the system had enough > >>> memory overall. > >> > >> Just wondering: this patch afaics is now in -mm and Linux next for > >> nearly two weeks. Is that intentional? I had expected it to be mainlined > >> with the batch of patches Andrew mailed to Linus last week, but it > >> wasn't among them. > > > > I have it queued for 5.17-rc1. > > > > There is still time to squeeze it into 5.16, just, with a cc:stable. > > > > Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it > > will trickle back with less risk to the 5.17 release. > > > > What do people think? > > CCing Linus, to make sure he's aware of this. > > Maybe I'm totally missing something, but I'm a bit confused by what you > wrote, as the regression afaik was introduced between v5.15..v5.16-rc1. > So I assume this is what you meant: > > ``` > I have it queued for 5.17-rc1. > > There is still time to squeeze it into 5.16. > > Alternatively we could merge it into 5.17-rc1 with a cc:stable, so it > will trickle back with less risk to the 5.16 release. > > What do people think? > ``` > > I'll leave the individual risk evaluation of the patch to others. If the > fix is risky, waiting for 5.17 is fine for me. > > But hmmm, regarding the "could merge it into 5.17-rc1 with a cc:stable" > idea a remark: is that really "less risk", as your stated? > > If we get it into rc8 (which is still possible, even if a bit hard due > to the new year festivities), it will get at least one week of testing. My vote is for it to go into rc8: for me, 5.16-rc reclaim behaves too oddly without it, so I've simply added it into whatever testing I do ever since Mel posted - no regressions noticed with it in (aside from needing the -fix.patch you already added a few weeks ago). Hugh > > If the fix waits for the next merge window, it all depends on the how > the timing works out. But it's easy to picture a worst case: the fix is > only merged on the Friday evening before Linus releases 5.17-rc1 and > right after it's out makes it into a stable-rc (say a day or two after > 5.17-rc1 is out) and from there into a 5.16.y release on Thursday. That > IMHO would mean less days of testing in the end (and there is a weekend > in this period as well). > > Waiting obviously will also mean that users of 5.16 and 5.16.y will > likely have to face this regression for at least two and a half weeks, > unless you send the fix early and Greg backports it before rc1 (which he > afaics does if there are good reasons). Yes, it's `just` a performance > regression, so it might not stop anyone from running Linux 5.16 -- but > it's one that three people separately reported in the 5.16 devel cycle, > so others will likely encounter it as well if we leave it unfixed in > 5.16. This will likely annoy some people, especially if they invest time > in bisecting it, only to find out that the forth iteration of the fix > for the regression is already available since December the 2nd. > > Ciao, Thorsten
On Fri, Dec 31, 2021 at 6:24 AM Thorsten Leemhuis <regressions@leemhuis.info> wrote: > > If we get it into rc8 (which is still possible, even if a bit hard due > to the new year festivities), it will get at least one week of testing. I took it with Hugh's ack from his reply to this, so it should be in rc8. I'm not sure how much testing a week in rc8 will actually get, but next week is likely to be at least a _bit_ more active than the current week has been. And it's been tested by the people reporting it (and apparently Hugh too). Linus
On Fri, Dec 31, 2021 at 11:14 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Dec 31, 2021 at 6:24 AM Thorsten Leemhuis > <regressions@leemhuis.info> wrote: > > > > If we get it into rc8 (which is still possible, even if a bit hard due > > to the new year festivities), it will get at least one week of testing. > > I took it with Hugh's ack from his reply to this, so it should be in rc8. Pushed out as 1b4e3f26f9f7 ("mm: vmscan: Reduce throttling due to a failure to make progress") Linus
On Fri, Dec 31, 2021 at 11:21 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Pushed out as 1b4e3f26f9f7 ("mm: vmscan: Reduce throttling due to a > failure to make progress") .. and I _think_ this empties the regzbot queue for this release, Thorsten. No? Linus
On Fri, 31 Dec 2021 11:21:14 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Dec 31, 2021 at 11:14 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Fri, Dec 31, 2021 at 6:24 AM Thorsten Leemhuis > > <regressions@leemhuis.info> wrote: > > > > > > If we get it into rc8 (which is still possible, even if a bit hard due > > > to the new year festivities), it will get at least one week of testing. > > > > I took it with Hugh's ack from his reply to this, so it should be in rc8. > > Pushed out as 1b4e3f26f9f7 ("mm: vmscan: Reduce throttling due to a > failure to make progress") Needs this fixup, which I shall tweak a bit then send formally in a few minutes. From: Mel Gorman <mgorman@techsingularity.net> Subject: mm: vmscan: reduce throttling due to a failure to make progress -fix Hugh Dickins reported the following My tmpfs swapping load (tweaked to use huge pages more heavily than in real life) is far from being a realistic load: but it was notably slowed down by your throttling mods in 5.16-rc, and this patch makes it well again - thanks. But: it very quickly hit NULL pointer until I changed that last line to if (first_pgdat) consider_reclaim_throttle(first_pgdat, sc); The likely issue is that huge pages are a major component of the test workload. When this is the case, first_pgdat may never get set if compaction is ready to continue due to this check if (IS_ENABLED(CONFIG_COMPACTION) && sc->order > PAGE_ALLOC_COSTLY_ORDER && compaction_ready(zone, sc)) { sc->compaction_ready = true; continue; } If this was true for every zone in the zonelist, first_pgdat would never get set resulting in a NULL pointer exception. This is a fix to the mmotm patch mm-vmscan-reduce-throttling-due-to-a-failure-to-make-progress.patch Link: https://lkml.kernel.org/r/20211209095453.GM3366@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/mm/vmscan.c~mm-vmscan-reduce-throttling-due-to-a-failure-to-make-progress-fix +++ a/mm/vmscan.c @@ -3530,7 +3530,8 @@ static void shrink_zones(struct zonelist shrink_node(zone->zone_pgdat, sc); } - consider_reclaim_throttle(first_pgdat, sc); + if (first_pgdat) + consider_reclaim_throttle(first_pgdat, sc); /* * Restore to original mask to avoid the impact on the caller if we
On Fri, Dec 31, 2021 at 1:04 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > Needs this fixup, which I shall tweak a bit then send formally > in a few minutes. Thanks, applied. Linus
On 31.12.21 20:22, Linus Torvalds wrote: > On Fri, Dec 31, 2021 at 11:21 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Pushed out as 1b4e3f26f9f7 ("mm: vmscan: Reduce throttling due to a >> failure to make progress") Thx. > .. and I _think_ this empties the regzbot queue for this release, Thorsten. No? Well, it was the regression that bothered me most. But there are still a few out there that got introduced this cycle. There is a regression in RDMA/mlx5 introduced in v5.16-rc5: https://lore.kernel.org/lkml/f298db4ec5fdf7a2d1d166ca2f66020fd9397e5c.1640079962.git.leonro@nvidia.com/ https://lore.kernel.org/all/EEBA2D1C-F29C-4237-901C-587B60CEE113@oracle.com/ A fix is available, but got stuck afaics: https://lore.kernel.org/lkml/f298db4ec5fdf7a2d1d166ca2f66020fd9397e5c.1640079962.git.leonro@nvidia.com/ And I only noticed just now: a revert was also discussed, but not performed: https://lore.kernel.org/all/20211222101312.1358616-1-maorg@nvidia.com/ Will let Greg know, seems the commit got backported to 5.15. s0ix suspend broke for some AMD machines, Alex and Mario are busy looking into it: https://gitlab.freedesktop.org/drm/amd/-/issues/1821 https://bugzilla.kernel.org/show_bug.cgi?id=215436 Alex is also dealing with another issue where the screen contents now get restored after some input events: https://bugzilla.kernel.org/show_bug.cgi?id=215203 There still seems to be a performance regression that Josef and Valentin try hard to pin down without much success for weeks now: https://lore.kernel.org/lkml/87tuf07hdk.mognet@arm.com/ And there one more report, but it might be a follow-up error due to another regression: https://lore.kernel.org/linux-pm/52933493.dBzk7ret6Y@geek500/
Hi Mel, Mel Gorman wrote: > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > problems due to reclaim throttling for excessive lengths of time. > In Alexey's case, a memory hog that should go OOM quickly stalls for > several minutes before stalling. In Mike and Darrick's cases, a small > memcg environment stalled excessively even though the system had enough > memory overall. > I recently found a regression when I tested MGLRU with fio on Linux 5.16-rc6 [1]. After this patch was applied, I re-ran the test with Linux 5.16, but the regression has not been fixed yet. The workload is to let fio perform random access on files with buffered IO. The total file size is 2x the memory size. Files are stored on pmem. For each configuration, I ran fio 10 times and reported the average and the standard deviation. Fio command =========== $ numactl --cpubind=0 --membind=0 fio --name=randread \ --directory=/mnt/pmem/ --size={10G, 5G} --io_size=1000TB \ --time_based --numjobs={40, 80} --ioengine=io_uring \ --ramp_time=20m --runtime=10m --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random \ --direct=0 --norandommap --group_reporting Results in throughput (MB/s): ============================= +------------+------+-------+------+-------+----------+-------+ | Jobs / CPU | 5.15 | stdev | 5.16 | stdev | 5.17-rc3 | stdev | +------------+------+-------+------+-------+----------+-------+ | 1 | 8411 | 75 | 7459 | 38 | 7331 | 36 | +------------+------+-------+------+-------+----------+-------+ | 2 | 8417 | 54 | 7491 | 41 | 7383 | 15 | +------------+------+-------+------+-------+----------+-------+ [1] https://lore.kernel.org/linux-mm/20220105024423.26409-1-szhai2@cs.rochester.edu/ Thanks! Shuang
On Mon, Feb 14, 2022 at 04:10:50PM -0500, Shuang Zhai wrote: > Hi Mel, > > Mel Gorman wrote: > > > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > > problems due to reclaim throttling for excessive lengths of time. > > In Alexey's case, a memory hog that should go OOM quickly stalls for > > several minutes before stalling. In Mike and Darrick's cases, a small > > memcg environment stalled excessively even though the system had enough > > memory overall. > > > > I recently found a regression when I tested MGLRU with fio on Linux > 5.16-rc6 [1]. After this patch was applied, I re-ran the test with Linux > 5.16, but the regression has not been fixed yet. > Am I correct in thinging that this only happens with MGLRU?
Mel Gorman wrote: > On Mon, Feb 14, 2022 at 04:10:50PM -0500, Shuang Zhai wrote: > > Hi Mel, > > > > Mel Gorman wrote: > > > > > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > > > problems due to reclaim throttling for excessive lengths of time. > > > In Alexey's case, a memory hog that should go OOM quickly stalls for > > > several minutes before stalling. In Mike and Darrick's cases, a small > > > memcg environment stalled excessively even though the system had enough > > > memory overall. > > > > > > > I recently found a regression when I tested MGLRU with fio on Linux > > 5.16-rc6 [1]. After this patch was applied, I re-ran the test with Linux > > 5.16, but the regression has not been fixed yet. > > > > Am I correct in thinging that this only happens with MGLRU? Sorry about the confusion and let me clarify on this. The regression happens on upstream Linux with the default page replacement mechanism.
On Tue, Feb 22, 2022 at 12:27:31PM -0500, Shuang Zhai wrote: > Mel Gorman wrote: > > On Mon, Feb 14, 2022 at 04:10:50PM -0500, Shuang Zhai wrote: > > > Hi Mel, > > > > > > Mel Gorman wrote: > > > > > > > > Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar > > > > problems due to reclaim throttling for excessive lengths of time. > > > > In Alexey's case, a memory hog that should go OOM quickly stalls for > > > > several minutes before stalling. In Mike and Darrick's cases, a small > > > > memcg environment stalled excessively even though the system had enough > > > > memory overall. > > > > > > > > > > I recently found a regression when I tested MGLRU with fio on Linux > > > 5.16-rc6 [1]. After this patch was applied, I re-ran the test with Linux > > > 5.16, but the regression has not been fixed yet. > > > > > > > Am I correct in thinging that this only happens with MGLRU? > > Sorry about the confusion and let me clarify on this. The regression happens > on upstream Linux with the default page replacement mechanism. Ok, the fio command for me simply exits with an error and even if it didn't the test machine I have with persistent memory does not have enough pmem to trigger memory reclaim issues with fio. Can you do the following please? # echo 1 > vmscan/mm_vmscan_throttled/enable # cat /sys/kernel/debug/tracing/trace_pipe > trace.out and run the test? Compress trace.out with xz and send it to me by mail. If the trace is too large, send as much as you can.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 58e744b78c2c..936dc0b6c226 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -277,6 +277,7 @@ enum vmscan_throttle_state { VMSCAN_THROTTLE_WRITEBACK, VMSCAN_THROTTLE_ISOLATED, VMSCAN_THROTTLE_NOPROGRESS, + VMSCAN_THROTTLE_CONGESTED, NR_VMSCAN_THROTTLE, }; diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index f25a6149d3ba..ca2e9009a651 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -30,12 +30,14 @@ #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) #define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED) #define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS) +#define _VMSCAN_THROTTLE_CONGESTED (1 << VMSCAN_THROTTLE_CONGESTED) #define show_throttle_flags(flags) \ (flags) ? __print_flags(flags, "|", \ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \ {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \ - {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"} \ + {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"}, \ + {_VMSCAN_THROTTLE_CONGESTED, "VMSCAN_THROTTLE_CONGESTED"} \ ) : "VMSCAN_THROTTLE_NONE" diff --git a/mm/vmscan.c b/mm/vmscan.c index fb9584641ac7..4c4d5f6cd8a3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1021,6 +1021,39 @@ static void handle_write_error(struct address_space *mapping, unlock_page(page); } +static bool skip_throttle_noprogress(pg_data_t *pgdat) +{ + int reclaimable = 0, write_pending = 0; + int i; + + /* + * If kswapd is disabled, reschedule if necessary but do not + * throttle as the system is likely near OOM. + */ + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) + return true; + + /* + * If there are a lot of dirty/writeback pages then do not + * throttle as throttling will occur when the pages cycle + * towards the end of the LRU if still under writeback. + */ + for (i = 0; i < MAX_NR_ZONES; i++) { + struct zone *zone = pgdat->node_zones + i; + + if (!populated_zone(zone)) + continue; + + reclaimable += zone_reclaimable_pages(zone); + write_pending += zone_page_state_snapshot(zone, + NR_ZONE_WRITE_PENDING); + } + if (2 * write_pending <= reclaimable) + return true; + + return false; +} + void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) { wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason]; @@ -1056,8 +1089,16 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason) } break; + case VMSCAN_THROTTLE_CONGESTED: + fallthrough; case VMSCAN_THROTTLE_NOPROGRESS: - timeout = HZ/2; + if (skip_throttle_noprogress(pgdat)) { + cond_resched(); + return; + } + + timeout = 1; + break; case VMSCAN_THROTTLE_ISOLATED: timeout = HZ/50; @@ -3321,7 +3362,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) if (!current_is_kswapd() && current_may_throttle() && !sc->hibernation_mode && test_bit(LRUVEC_CONGESTED, &target_lruvec->flags)) - reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK); + reclaim_throttle(pgdat, VMSCAN_THROTTLE_CONGESTED); if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, sc)) @@ -3386,16 +3427,16 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc) } /* - * Do not throttle kswapd on NOPROGRESS as it will throttle on - * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under - * writeback and marked for immediate reclaim at the tail of - * the LRU. + * Do not throttle kswapd or cgroup reclaim on NOPROGRESS as it will + * throttle on VMSCAN_THROTTLE_WRITEBACK if there are too many pages + * under writeback and marked for immediate reclaim at the tail of the + * LRU. */ - if (current_is_kswapd()) + if (current_is_kswapd() || cgroup_reclaim(sc)) return; /* Throttle if making no progress at high prioities. */ - if (sc->priority < DEF_PRIORITY - 2) + if (sc->priority == 1 && !sc->nr_reclaimed) reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS); } @@ -3415,6 +3456,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) unsigned long nr_soft_scanned; gfp_t orig_mask; pg_data_t *last_pgdat = NULL; + pg_data_t *first_pgdat = NULL; /* * If the number of buffer_heads in the machine exceeds the maximum @@ -3478,14 +3520,18 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) /* need some check for avoid more shrink_zone() */ } + if (!first_pgdat) + first_pgdat = zone->zone_pgdat; + /* See comment about same check for global reclaim above */ if (zone->zone_pgdat == last_pgdat) continue; last_pgdat = zone->zone_pgdat; shrink_node(zone->zone_pgdat, sc); - consider_reclaim_throttle(zone->zone_pgdat, sc); } + consider_reclaim_throttle(first_pgdat, sc); + /* * Restore to original mask to avoid the impact on the caller if we * promoted it to __GFP_HIGHMEM.
Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar problems due to reclaim throttling for excessive lengths of time. In Alexey's case, a memory hog that should go OOM quickly stalls for several minutes before stalling. In Mike and Darrick's cases, a small memcg environment stalled excessively even though the system had enough memory overall. Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made") introduced the problem although commit a19594ca4a8b ("mm/vmscan: increase the timeout if page reclaim is not making progress") made it worse. Systems at or near an OOM state that cannot be recovered must reach OOM quickly and memcg should kill tasks if a memcg is near OOM. To address this, only stall for the first zone in the zonelist, reduce the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if the scan control nr_reclaimed is 0, kswapd is still active and there were excessive pages pending for writeback. If kswapd has stopped reclaiming due to excessive failures, do not stall at all so that OOM triggers relatively quickly. Similarly, if an LRU is simply congested, only lightly throttle similar to NOPROGRESS. Alexey's original case was the most straight forward for i in {1..3}; do tail /dev/zero; done On vanilla 5.16-rc1, this test stalled heavily, after the patch the test completes in a few seconds similar to 5.15. Alexey's second test case added watching a youtube video while tail runs 10 times. On 5.15, playback only jitters slightly, 5.16-rc1 stalls a lot with lots of frames missing and numerous audio glitches. With this patch applies, the video plays similarly to 5.15. Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net [lkp@intel.com: Fix W=1 build warning] Reported-and-tested-by: Alexey Avramov <hakavlad@inbox.lv> Reported-and-tested-by: Mike Galbraith <efault@gmx.de> Reported-and-tested-by: Darrick J. Wong <djwong@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Fixes: 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mmzone.h | 1 + include/trace/events/vmscan.h | 4 ++- mm/vmscan.c | 64 ++++++++++++++++++++++++++++++----- 3 files changed, 59 insertions(+), 10 deletions(-)