Message ID | 20211019090108.25501-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Remove dependency on congestion_wait in mm/ | expand |
On Tue, 19 Oct 2021 10:01:00 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > Changelog since v3 > o Count writeback completions for NR_THROTTLED_WRITTEN only > o Use IRQ-safe inc_node_page_state > o Remove redundant throttling > > This series is also available at > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v4r2 > > This series that removes all calls to congestion_wait > in mm/ and deletes wait_iff_congested. It's not a clever > implementation but congestion_wait has been broken for a long time > (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/). The block layer doesn't call clear_bdi_congested() at all. I never knew this until recent discussions :( So this means that congestion_wait() will always time out, yes? > Even if congestion throttling worked, it was never a great idea. Well. It was a good idea until things like isolation got added! > While > excessive dirty/writeback pages at the tail of the LRU is one possibility > that reclaim may be slow, there is also the problem of too many pages > being isolated and reclaim failing for other reasons (elevated references, > too many pages isolated, excessive LRU contention etc). > > This series replaces the "congestion" throttling with 3 different types. > > o If there are too many dirty/writeback pages, sleep until a timeout > or enough pages get cleaned > o If too many pages are isolated, sleep until enough isolated pages > are either reclaimed or put back on the LRU > o If no progress is being made, direct reclaim tasks sleep until > another task makes progress with acceptable efficiency. > > This was initially tested with a mix of workloads that used to trigger > corner cases that no longer work. Mix of workloads is nice, but a mix of devices is more important here. I trust some testing was done on plain old spinning disks? And USB storage, please! And NFS plays with BDI congestion. Ceph and FUSE also. We've had complaints about this stuff forever. Usually of the form of interactive tasks getting horridly stalled by writeout/swap activity.
On Tue, Oct 19, 2021 at 03:00:25PM -0700, Andrew Morton wrote: > On Tue, 19 Oct 2021 10:01:00 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > > > Changelog since v3 > > o Count writeback completions for NR_THROTTLED_WRITTEN only > > o Use IRQ-safe inc_node_page_state > > o Remove redundant throttling > > > > This series is also available at > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v4r2 > > > > This series that removes all calls to congestion_wait > > in mm/ and deletes wait_iff_congested. It's not a clever > > implementation but congestion_wait has been broken for a long time > > (https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/). > > The block layer doesn't call clear_bdi_congested() at all. I never > knew this until recent discussions :( > > So this means that congestion_wait() will always time out, yes? > Unfortunately, yes except for filesystems that call [set_clear]_bdi_congested. For the test case in the series leader, congestion_wait always hit the full timeout. > > Even if congestion throttling worked, it was never a great idea. > > Well. It was a good idea until things like isolation got added! > Well, true to an extent although it was always true that reclaim could fail to make progress for reasons other than pages under writeback. But you're right, saying it was "never a great idea" is overkill. congestion_wait used to work and I expect it was particularly helpful before IO-less write throttling, accurate dirty page tracking and immediate reclaim existed. > > While > > excessive dirty/writeback pages at the tail of the LRU is one possibility > > that reclaim may be slow, there is also the problem of too many pages > > being isolated and reclaim failing for other reasons (elevated references, > > too many pages isolated, excessive LRU contention etc). > > > > This series replaces the "congestion" throttling with 3 different types. > > > > o If there are too many dirty/writeback pages, sleep until a timeout > > or enough pages get cleaned > > o If too many pages are isolated, sleep until enough isolated pages > > are either reclaimed or put back on the LRU > > o If no progress is being made, direct reclaim tasks sleep until > > another task makes progress with acceptable efficiency. > > > > This was initially tested with a mix of workloads that used to trigger > > corner cases that no longer work. > > Mix of workloads is nice, but a mix of devices is more important here. I tested as much as I could but as well as storage devices, different memory sizes are also relevant. > I trust some testing was done on plain old spinning disks? And USB > storage, please! And NFS plays with BDI congestion. Ceph and FUSE also. > Plain old spinning disk was tested. Basic USB testing didn't show many problems although given it was my desktop machine, it might have had too memory as no amount of IO to the USB key triggered a problem where reclaim failed to make progress and get throttled. There was basic NFS testing although I didn't try running stutterp over NFS. Given the original thread motivating this was NFS-related and they are cc'd, I'm hoping they'll give it a realistic kick. I don't have a realistic setup for ceph and didn't try fuse. > We've had complaints about this stuff forever. Usually of the form of > interactive tasks getting horridly stalled by writeout/swap activity. I know and there is no guarantee it won't happen again. The main problem I was trying to solve was that congestion-based throttling is not suitable for mm/. From reclaim context, there isn't a good way of detecting "interactive" tasks. At best, under reclaim pressure we could try tracking allocation rates and throttle heavy allocators more than light allocators but I didn't want to introduce complexity prematurely.
On Tue, 19 Oct 2021, Mel Gorman wrote: > Changelog since v3 > o Count writeback completions for NR_THROTTLED_WRITTEN only > o Use IRQ-safe inc_node_page_state > o Remove redundant throttling > > This series is also available at > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v4r2 > > This series that removes all calls to congestion_wait > in mm/ and deletes wait_iff_congested. Thanks for this. I don't have sufficient expertise for a positive review, but it seems to make sense with one exception which I have commented on separately. In general, I still don't like the use of wake_up_all(), though it won't cause incorrect behaviour. I would prefer the first patch would: - define NR_VMSCAN_THROTTLE - make reclaim_wait an array - spelled nr_reclaim_throttled as nr_writeback_throttled rather than leaving those changes for the second patch. I think that would make review easier. But these are minor and I'll not mention them again. Thanks, NeilBrown
On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote: > On Tue, 19 Oct 2021, Mel Gorman wrote: > > Changelog since v3 > > o Count writeback completions for NR_THROTTLED_WRITTEN only > > o Use IRQ-safe inc_node_page_state > > o Remove redundant throttling > > > > This series is also available at > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v4r2 > > > > This series that removes all calls to congestion_wait > > in mm/ and deletes wait_iff_congested. > > Thanks for this. > I don't have sufficient expertise for a positive review, but it seems to > make sense with one exception which I have commented on separately. > A test battering NFS would still be nice! > In general, I still don't like the use of wake_up_all(), though it won't > cause incorrect behaviour. > Removing wake_up_all would be tricky. Ideally it would be prioritised but more importantly, some sort of guarantee should exist that enough wakeup events trigger to wake tasks before the timeout. That would need careful thinking about each reclaim reason. For example, if N tasks throttle on NOPROGRESS, there is no guarantee that N tasks are currently in reclaim that would wake each sleeping task as progress is made. It's similar for writeback, are enough pages under writeback to trigger each wakeup? A more subtle issue is if each reason should be strict if waking tasks one at a time. For example, a task sleeping on writeback might make progress for other reasons such as the working set changing during reclaim or a large task exiting. Of course the same concerns exist for the series as it stands but the worst case scenarios are mitigated by wake_up_all. > I would prefer the first patch would: > - define NR_VMSCAN_THROTTLE > - make reclaim_wait an array > - spelled nr_reclaim_throttled as nr_writeback_throttled > > rather than leaving those changes for the second patch. I think that > would make review easier. > I can do this. Normally I try structure series from least-to-most controversial so that it can be cut at any point and still make sense so the array was defined in the second patch because that's when it is required. However, I already had defined the enum in patch 1 for the tracepoint so I might as well make it an array too.
On Fri, 22 Oct 2021, Mel Gorman wrote: > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote: > > > In general, I still don't like the use of wake_up_all(), though it won't > > cause incorrect behaviour. > > > > Removing wake_up_all would be tricky. I think there is a misunderstanding. Removing wake_up_all() is as simple as s/wake_up_all/wake_up/ If you used prepare_to_wait_exclusive(), then wake_up() would only wake one waiter, while wake_up_all() would wake all of them. As you use prepare_to_wait(), wake_up() will wake all waiters - as will wake_up_all(). When I see "wake_up_all()" I assume it is an exclusive wait, and that for some reason this particular wake_up needs to wake up all waiters. That is not the case here. I suspect it would be clearer if "wake_up" always woke everything, and "wake_up_one" was the special case - but unfortunately that isn't what we have. There are other non-exclusive waiters which use wake_up_all(), but the vast majority of wakeups use wake_up(), and most of those are for non-exclusive waiters. Thanks, NeilBrown
On Fri, Oct 22, 2021 at 10:26:30PM +1100, NeilBrown wrote: > On Fri, 22 Oct 2021, Mel Gorman wrote: > > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote: > > > > > In general, I still don't like the use of wake_up_all(), though it won't > > > cause incorrect behaviour. > > > > > > > Removing wake_up_all would be tricky. > > I think there is a misunderstanding. Removing wake_up_all() is as > simple as > s/wake_up_all/wake_up/ > > If you used prepare_to_wait_exclusive(), then wake_up() would only wake > one waiter, while wake_up_all() would wake all of them. > As you use prepare_to_wait(), wake_up() will wake all waiters - as will > wake_up_all(). > Ok, yes, there was a misunderstanding. I thought you were suggesting a move to exclusive wakeups. I felt that the wake_up_all was explicit in terms of intent and that I really meant for all tasks to wake instead of one at a time.
On Sat, 23 Oct 2021, Mel Gorman wrote: > On Fri, Oct 22, 2021 at 10:26:30PM +1100, NeilBrown wrote: > > On Fri, 22 Oct 2021, Mel Gorman wrote: > > > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote: > > > > > > > In general, I still don't like the use of wake_up_all(), though it won't > > > > cause incorrect behaviour. > > > > > > > > > > Removing wake_up_all would be tricky. > > > > I think there is a misunderstanding. Removing wake_up_all() is as > > simple as > > s/wake_up_all/wake_up/ > > > > If you used prepare_to_wait_exclusive(), then wake_up() would only wake > > one waiter, while wake_up_all() would wake all of them. > > As you use prepare_to_wait(), wake_up() will wake all waiters - as will > > wake_up_all(). > > > > Ok, yes, there was a misunderstanding. I thought you were suggesting a > move to exclusive wakeups. I felt that the wake_up_all was explicit in > terms of intent and that I really meant for all tasks to wake instead of > one at a time. Fair enough. Thanks for changing it :-) But this prompts me to wonder if exclusive wakeups would be a good idea - which is a useful springboard to try to understand the code better. For VMSCAN_THROTTLE_ISOLATED they probably are. One pattern for reliable exclusive wakeups is for any thread that received a wake-up to then consider sending a wake up. Two places receive VMSCAN_THROTTLE_ISOLATED wakeups and both then call too_many_isolated() which - on success - sends another wakeup - before the caller has had a chance to isolate anything. If, instead, the wakeup was sent sometime later, after pages were isolated by before the caller (isoloate_migratepages_block() or shrink_inactive_list()) returned, then we would get an orderly progression of threads running through that code. For VMSCAN_THROTTLE_WRITEBACK is a little less straight forward. There are two different places that wait for the wakeup, and a wake_up is sent to all waiters after a time proportional to the number of waiters. It might make sense to wake one thread per time unit? That might work well for do_writepages - every SWAP_CLUSTER_MAX writes triggers one wakeup. I'm less sure that it would work for shrink_node(). Maybe the shrink_node() waiters could be non-exclusive so they get woken as soon a SWAP_CLUSTER_MAX writes complete, while do_writepages are exclusive and get woken one at a time. For VMSCAN_THROTTLE_NOPROGRESS .... I don't understand. If one zone isn't making "enough" progress, we throttle before moving on to the next zone. So we delay processing of the next zone, and only indirectly delay re-processing of the current congested zone. Maybe it make sense, but I don't see it yet. I note that the commit message says "it's messy". I can't argue with that! I'll follow up with patches to clarify what I am thinking about the first two. I'm not proposing the patches, just presenting them as part of improving my understanding. Thanks, NeilBrown
On Wed, Oct 27, 2021 at 11:43:22AM +1100, NeilBrown wrote: > On Sat, 23 Oct 2021, Mel Gorman wrote: > > On Fri, Oct 22, 2021 at 10:26:30PM +1100, NeilBrown wrote: > > > On Fri, 22 Oct 2021, Mel Gorman wrote: > > > > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote: > > > > > > > > > In general, I still don't like the use of wake_up_all(), though it won't > > > > > cause incorrect behaviour. > > > > > > > > > > > > > Removing wake_up_all would be tricky. > > > > > > I think there is a misunderstanding. Removing wake_up_all() is as > > > simple as > > > s/wake_up_all/wake_up/ > > > > > > If you used prepare_to_wait_exclusive(), then wake_up() would only wake > > > one waiter, while wake_up_all() would wake all of them. > > > As you use prepare_to_wait(), wake_up() will wake all waiters - as will > > > wake_up_all(). > > > > > > > Ok, yes, there was a misunderstanding. I thought you were suggesting a > > move to exclusive wakeups. I felt that the wake_up_all was explicit in > > terms of intent and that I really meant for all tasks to wake instead of > > one at a time. > > Fair enough. Thanks for changing it :-) > > But this prompts me to wonder if exclusive wakeups would be a good idea > - which is a useful springboard to try to understand the code better. > > For VMSCAN_THROTTLE_ISOLATED they probably are. > One pattern for reliable exclusive wakeups is for any thread that > received a wake-up to then consider sending a wake up. > > Two places receive VMSCAN_THROTTLE_ISOLATED wakeups and both then call > too_many_isolated() which - on success - sends another wakeup - before > the caller has had a chance to isolate anything. If, instead, the > wakeup was sent sometime later, after pages were isolated by before the > caller (isoloate_migratepages_block() or shrink_inactive_list()) > returned, then we would get an orderly progression of threads running > through that code. > That should work as the throttling condition is straight-forward. It might even reduce a race condition where waking all throttled tasks all then trigger the same "too many isolated" condition. > For VMSCAN_THROTTLE_WRITEBACK is a little less straight forward. > There are two different places that wait for the wakeup, and a wake_up > is sent to all waiters after a time proportional to the number of > waiters. It might make sense to wake one thread per time unit? I'd avoid time as a wakeup condition other than the timeout which is there to guarantee forward progress. I assume you mean "one thread per SWAP_CLUSTER_MAX completions". > That might work well for do_writepages - every SWAP_CLUSTER_MAX writes > triggers one wakeup. > I'm less sure that it would work for shrink_node(). Maybe the > shrink_node() waiters could be non-exclusive so they get woken as soon a > SWAP_CLUSTER_MAX writes complete, while do_writepages are exclusive and > get woken one at a time. > It should work for either with the slight caveat that the last waiter may not see SWAP_CLUSTER_MAX completions. > For VMSCAN_THROTTLE_NOPROGRESS .... I don't understand. > If one zone isn't making "enough" progress, we throttle before moving on > to the next zone. So we delay processing of the next zone, and only > indirectly delay re-processing of the current congested zone. > Maybe it make sense, but I don't see it yet. I note that the commit > message says "it's messy". I can't argue with that! > Yes, we delay the processing of the next zone when a given zone cannot make progress. The thinking is that circumstances that cause one zone to fail to make progress could spill over to other zones in the absense of any throttling. Where it might cause problems is where the preferred zone is very small. If a bug showed up like that, a potential fix would be to avoid throttling if the preferred zone is very small relative to the total amount of memory local to the node or total memory (preferably local node). > I'll follow up with patches to clarify what I am thinking about the > first two. I'm not proposing the patches, just presenting them as part > of improving my understanding. > If I'm cc'd, I'll review and if I think they're promising, I'll run them through the same tests and machines.