Message ID | 87tv2b7q72.fsf@notabene.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [PATCH/RFC] MM: fix writeback for NFS | expand |
Please ignore my previous patch (which this is in reply to), it was flawed in various ways. I now understand the code a bit better and have a somewhat simpler patch which appears to address the same problem. The problem is that writeback to NFS often produces lots of small writes (10s of K) rather than fewer large writes (1M). This pattern can often hurt throughput, but in certain circumstances it can hurt NFS throughput more than expected. Each nfs_writepages() call results in an NFS commit being sent to the server. If writeback triggers lots of smaller nfs_writepages calls, this means lots of COMMITs. If the server is slow to handle the COMMIT (I've seen the Ganesha NFS server take over 200ms per commit), these COMMITs can overlap, queue up, and choke the NFS server and cause order-of-magnitude drop in throughput. So we really want to only call nfs_writepages when there are a largish number of pages to be written - i.e. that are 'dirty'. For historical reasons that I didn't thoroughly research but I'm confident are no longer relevant, pages that have been written to the NFS server but have not yet been the subject of a COMMIT - so-called "unstable" pages - are effectively accounted that same as "dirty" pages (sometimes called "reclaimable"). This can result in writeback thinking there are lots of "dirty" pages to reclaim, while nfs_writepages can only find a few that it can write out. The second patch following changes the accounting for these "unstable" pages. They are now always accounted exactly the same was writeback pages. Conceptually they can be thought of as still in writeback, but the writeback is now happening on the server. A COMMIT will always automatically follow the writes generated by nfs_writepages, so from the perspective of the VM, there really is no difference: It has scheduled the write and there is nothing else it can do except wait. Testing this patch showed that loop-back NFS is prone to deadlocks again. I cannot see exactly how the change to 'unstable' accounting affected this, but I can see that the old +25% heuristic can no longer be justified given the complexity of writeback calculations. So the first patch following changes how writeback is handled for NFS servers handling loop-back requests (and other similar services) so that it is more obviously safe against excessive dirty pages scheduled for other devices. Thanks, NeilBrown
On Thu, 02 Apr 2020 10:53:20 +1100 NeilBrown wrote: > > PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > loop block driver, where a daemon needs to write to one bdi in > order to free up writes queued to another bdi. > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > This patch changes the heuristic to ignore the global limits and > consider only the limit relevant to the bdi being written to. This > approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > should not introduce surprises. This has the desired result of > protecting the task from the consequences of large amounts of dirty data > queued for other devices. > > This approach of "only consider the target bdi" is consistent with the > other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes > attention to be focussed only on the target bdi. > > So this patch > - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > - remove the 25% bonus that that flag gives, and > - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE > set. /* * The strictlimit feature is a tool preventing mistrusted filesystems * from growing a large number of dirty pages before throttling. For Based on the comment snippet, I suspect it is applicable to IO flushers unless they are likely generating tons of dirty pages. If they are, however, cutting their bonuses seem questionable. > > Note that previously realtime threads were treated the same as > PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > real-time threads, so it is now different from the behaviour of nfsd and > loop tasks. I don't know what is wanted for realtime. > > Signed-off-by: NeilBrown <neilb@suse.de> > =2D-- Hrm corrupted delivery? > drivers/block/loop.c | 2 +- > fs/nfsd/vfs.c | 9 +++++---- > include/linux/sched.h | 2 +- > kernel/sys.c | 2 +- > mm/page-writeback.c | 10 ++++++---- > mm/vmscan.c | 4 ++-- > 6 files changed, 16 insertions(+), 13 deletions(-)
On Thu, Apr 02 2020, Hillf Danton wrote: > On Thu, 02 Apr 2020 10:53:20 +1100 NeilBrown wrote: >> >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the >> loop block driver, where a daemon needs to write to one bdi in >> order to free up writes queued to another bdi. >> >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty >> pages, so that it can still dirty pages after other processses have been >> throttled. >> >> This approach was designed when all threads were blocked equally, >> independently on which device they were writing to, or how fast it was. >> Since that time the writeback algorithm has changed substantially with >> different threads getting different allowances based on non-trivial >> heuristics. This means the simple "add 25%" heuristic is no longer >> reliable. >> >> This patch changes the heuristic to ignore the global limits and >> consider only the limit relevant to the bdi being written to. This >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and >> should not introduce surprises. This has the desired result of >> protecting the task from the consequences of large amounts of dirty data >> queued for other devices. >> >> This approach of "only consider the target bdi" is consistent with the >> other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes >> attention to be focussed only on the target bdi. >> >> So this patch >> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, >> - remove the 25% bonus that that flag gives, and >> - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE >> set. > > /* > * The strictlimit feature is a tool preventing mistrusted filesystems > * from growing a large number of dirty pages before throttling. For > > Based on the comment snippet, I suspect it is applicable to IO flushers > unless they are likely generating tons of dirty pages. If they are, > however, cutting their bonuses seem questionable. The purpose of the strictlimit feature was to isolate one filesystem (bdi) from all others, so that the one cannot create dirty pages which unfairly disadvantage the others - this is what that comment says. But the implementation appears to focus on the isolation, not the specific purpose, and isolation works both ways. It protects the others from the one, and the one from the others. fuse needs to be isolated so it doesn't harm others. nfsd and loop need to be isolate so they aren't harmed by others. I'm less familiar with IO flushers but I suspect that have exactly the same need as nfsd and loop - they need to be isolated from dirty pages other than on the device they are writing to. The 25% bonus was never about giving them a bonus because they need it. It was about protecting them from excess usage elsewhere. I strongly suspect that my change will provide a conceptually better service for IO flushers. (whether it is better in a practical measurable sense I cannot say, but I'd be surprised if it was worse). One possible problem with strictlimit isolation is suggested by the comment * * In strictlimit case make decision based on the wb counters * and limits. Small writeouts when the wb limits are ramping * up are the price we consciously pay for strictlimit-ing. * This suggests that starting transients may be worse in some cases. I haven't noticed any problems in my (limited) testing so while I suspect there is a genuine difference here, I don't expect it to be problematic. > >> >> Note that previously realtime threads were treated the same as >> PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for >> real-time threads, so it is now different from the behaviour of nfsd and >> loop tasks. I don't know what is wanted for realtime. >> >> Signed-off-by: NeilBrown <neilb@suse.de> >> =2D-- > > Hrm corrupted delivery? No, that's just normal email quotation for a multi-part message (needed for crypto-signing which I do by default) 'git am', for example, is quite capable of coping with it. Thanks, NeilBrown > >> drivers/block/loop.c | 2 +- >> fs/nfsd/vfs.c | 9 +++++---- >> include/linux/sched.h | 2 +- >> kernel/sys.c | 2 +- >> mm/page-writeback.c | 10 ++++++---- >> mm/vmscan.c | 4 ++-- >> 6 files changed, 16 insertions(+), 13 deletions(-)
On Thu, 02 Apr 2020 15:57:56 +1100 NeilBrown wrote: > > On Thu, Apr 02 2020, Hillf Danton wrote: > > > On Thu, 02 Apr 2020 10:53:20 +1100 NeilBrown wrote: > >>=20 > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > >> loop block driver, where a daemon needs to write to one bdi in > >> order to free up writes queued to another bdi. > >>=20 > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > >> pages, so that it can still dirty pages after other processses have been > >> throttled. > >>=20 > >> This approach was designed when all threads were blocked equally, > >> independently on which device they were writing to, or how fast it was. > >> Since that time the writeback algorithm has changed substantially with > >> different threads getting different allowances based on non-trivial > >> heuristics. This means the simple "add 25%" heuristic is no longer > >> reliable. > >>=20 > >> This patch changes the heuristic to ignore the global limits and > >> consider only the limit relevant to the bdi being written to. This > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > >> should not introduce surprises. This has the desired result of > >> protecting the task from the consequences of large amounts of dirty data > >> queued for other devices. > >>=20 > >> This approach of "only consider the target bdi" is consistent with the > >> other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes > >> attention to be focussed only on the target bdi. > >>=20 > >> So this patch > >> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > >> - remove the 25% bonus that that flag gives, and > >> - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE > >> set. > > > > /* > > * The strictlimit feature is a tool preventing mistrusted filesystems > > * from growing a large number of dirty pages before throttling. For > > > > Based on the comment snippet, I suspect it is applicable to IO flushers > > unless they are likely generating tons of dirty pages. If they are, > > however, cutting their bonuses seem questionable. > > The purpose of the strictlimit feature was to isolate one filesystem > (bdi) from all others, so that the one cannot create dirty pages which > unfairly disadvantage the others - this is what that comment says. > But the implementation appears to focus on the isolation, not the > specific purpose, and isolation works both ways. It protects the others > from the one, and the one from the others. > > fuse needs to be isolated so it doesn't harm others. > nfsd and loop need to be isolate so they aren't harmed by others. For those working in emergency services, extra N95 face masks and Covid-19 testing kits, say 25%, would be preserved, too, if isolation doesn't help them. > I'm less familiar with IO flushers but I suspect that have exactly the > same need as nfsd and loop - they need to be isolated from dirty pages > other than on the device they are writing to. > The 25% bonus was never about giving them a bonus because they need it. > It was about protecting them from excess usage elsewhere. For example, > I strongly > suspect that my change will provide a conceptually better service for IO > flushers. (whether it is better in a practical measurable sense I cannot > say, but I'd be surprised if it was worse).
This is a second version of my patches to improve writeback over NFS. This was prompted by a customer report of very slow throughput to a server running Ganesha NFS server which was taking about 200ms to handle COMMIT requests. While slow COMMIT requests do exacerbate the problem, there are still problems when COMMIT is comparatively fast. Latest test results from customer are that these changes provide substantial improvements. Changes over first version include improvements to the explanation in the commit message, and preservation of "NFS Unstable" counts in various statistics files, despite that fact that the internal accounting is completely gone (merged with writeback accounting). Thanks, NeilBrown
This is version 3 (I think) of my patches to improve NFS writeaback. Changes: - the code for adding legacy values to /proc/vmstat was broken. I haven't taken the approach that Michal and Jan discussed but a simpler (I hope) approach that just seq_puts() the lines at an appropriate place. - I've modified the handling of PF_LOCAL_THROTTLE - sufficiently that I dropped Jan's reviewed-by. Rather than invoking the same behaviour as BDI_CAP_STRICTLIMIT, I now just use the part I needed. So if the global threshold is not exceeded, PF_LOCAL_THROTTLE tasks are not throttled. This is the case for normal processes, but not when BDI_CAP_STRICTLIMIT is in effect. If the global threshold *is* exceeded, only then to we check the local per-bdi threshold. If that is not exceeded then PF_LOCAL_THROTTLE again avoid any throttling. Only if both the thresholds are exceeded are these tasks throttled. I think this is more refletive if what we actually need. Thanks, NeilBrown
diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2caf780a42e7..99419cba1e7a 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1593,10 +1593,11 @@ static void balance_dirty_pages(struct bdi_writeback *wb, * written to the server's write cache, but has not yet * been flushed to permanent storage. */ - nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) + - global_node_page_state(NR_UNSTABLE_NFS); + nr_reclaimable = global_node_page_state(NR_FILE_DIRTY); gdtc->avail = global_dirtyable_memory(); - gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK); + gdtc->dirty = nr_reclaimable + + global_node_page_state(NR_WRITEBACK) + + global_node_page_state(NR_UNSTABLE_NFS); domain_dirty_limits(gdtc); @@ -1664,7 +1665,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb, break; } - if (unlikely(!writeback_in_progress(wb))) + if (nr_reclaimable > gdtc->thresh && + unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); mem_cgroup_flush_foreign(wb);
Page account happens a little differently for NFS, and writeback is aware of this, but handles it wrongly. With NFS, pages go throught three states which are accounted separately: NR_FILE_DIRTY - page is dirty and unwritten NR_WRITEBACK - page is being written back NR_UNSTABLE_NFS - page has been written, but might need to be written again if server restarts. Once an NFS COMMIT completes, page becomes freeable. writeback combines both the NF_FILE_DIRTY counters and NF_UNSTABLE_NFS together, which doesn't make a lot of sense. NR_UNSTABLE_NFS is more like NR_WRITEBACK in that there is nothing that can be done for these pages except wait. Current behaviour is that when there are a lot of NR_UNSTABLE_NFS pages, balance_dirty_pages() repeatedly schedules the writeback thread, even when the number of dirty pages is below the threshold. This result is each ->write_pages() call into NFS only finding relatively few DIRTY pages, so it creates requests that are well below the maximum size. Depending on performance of characteristics of the server, this can substantially reduce throughput. There are two places where balance_dirty_pages() schedules the writeback thread, and each of them schedule writeback too often. 1/ When balance_dirty_pages() determines that it must wait for the number of dirty pages to shrink, it unconditionally schedules writeback. This is probably not ideal for any filesystem but I measured it to be harmful for NFS. This writeback should only be scheduled if the number of dirty pages is above the threshold. While it is below, waiting for the pending writes to complete (and be committed) is a more appropriate behaviour. 2/ When balance_dirty_pages() finishes, it again shedules writeback if nr_reclaimable is above the background threshold. This is conceptually correct, but wrong because nr_reclaimable is wrong. The impliciation of this usage is that nr_reclaimable is the number of pages that can be reclaimed if writeback is triggered. In fact it is NR_FILE_DIRTY plus NR_UNSTABLE_NFS, which is not a meaningful number. So this patch removes NR_UNSTABLE_NFS from nr_reclaimable, and only schedules writeback when the new nr_reclaimable is greater than ->thresh or ->bg_thresh, depending on context. The effect of this can be measured by looking at the count of nfs_writepages calls while performing a large streaming write (larger than dirty_threshold) e.g. mount -t nfs server:/path /mnt dd if=/dev/zero of=/mnt/zeros bs=1M count=1000 awk '/events:/ {print $13}' /proc/self/mountstats Each writepages call should normally submit writes for 4M (1024 pages) or more (MIN_WRITEBACK_PAGES) (unless memory size is tiny). With the patch I measure typically 200-300 writepages calls with the above, which suggests an average of about 850 pages being made available to each writepages call. This is less than MIN_WRITEBACK_PAGES, but enough to assemble large 1MB WRITE requests fairly often. Without the patch, numbers range from about 2000 to over 10000, meaning an average of maybe 100 pages per call, which means we rarely get large 1M requests, and often get small requests Note that there are other places where NR_FILE_DIRTY and NR_UNSTABLE_NFS are combined (wb_over_bg_thresh() and get_nr_dirty_pages()). I suspect they are wrong too. I also suspect that unstable-NFS pages should not be included in the WB_RECLAIMABLE wb_stat, but I haven't explored that in detail yet. Note that the current behaviour of NFS w.r.t sending COMMIT requests is that as soon as a batch of writes all complete, a COMMIT is scheduled. Prior to commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete"), other mechanisms triggered the COMMIT. It is possible that the current writeback code made more sense before that commit. Signed-off-by: NeilBrown <neilb@suse.de> --- mm/page-writeback.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)