Message ID | 87sghmyd8v.fsf@notabene.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE | expand |
On Thu 02-04-20 10:53:20, NeilBrown wrote: > > PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > loop block driver, where a daemon needs to write to one bdi in > order to free up writes queued to another bdi. > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > This patch changes the heuristic to ignore the global limits and > consider only the limit relevant to the bdi being written to. This > approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > should not introduce surprises. This has the desired result of > protecting the task from the consequences of large amounts of dirty data > queued for other devices. > > This approach of "only consider the target bdi" is consistent with the > other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes > attention to be focussed only on the target bdi. > > So this patch > - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, > - remove the 25% bonus that that flag gives, and > - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE > set. > > Note that previously realtime threads were treated the same as > PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for > real-time threads, so it is now different from the behaviour of nfsd and > loop tasks. I don't know what is wanted for realtime. > > Signed-off-by: NeilBrown <neilb@suse.de> This makes sense to me and the patch looks good. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Thanks. Honza > --- > drivers/block/loop.c | 2 +- > fs/nfsd/vfs.c | 9 +++++---- > include/linux/sched.h | 2 +- > kernel/sys.c | 2 +- > mm/page-writeback.c | 10 ++++++---- > mm/vmscan.c | 4 ++-- > 6 files changed, 16 insertions(+), 13 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 739b372a5112..2c59371ce936 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -897,7 +897,7 @@ static void loop_unprepare_queue(struct loop_device *lo) > > static int loop_kthread_worker_fn(void *worker_ptr) > { > - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; > + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; > return kthread_worker_fn(worker_ptr); > } > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c > index 0aa02eb18bd3..c3fbab1753ec 100644 > --- a/fs/nfsd/vfs.c > +++ b/fs/nfsd/vfs.c > @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > > if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) > /* > - * We want less throttling in balance_dirty_pages() > - * and shrink_inactive_list() so that nfs to > + * We want throttling in balance_dirty_pages() > + * and shrink_inactive_list() to only consider > + * the backingdev we are writing to, so that nfs to > * localhost doesn't cause nfsd to lock up due to all > * the client's dirty pages or its congested queue. > */ > - current->flags |= PF_LESS_THROTTLE; > + current->flags |= PF_LOCAL_THROTTLE; > > exp = fhp->fh_export; > use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); > @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, > nfserr = nfserrno(host_err); > } > if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) > - current_restore_flags(pflags, PF_LESS_THROTTLE); > + current_restore_flags(pflags, PF_LOCAL_THROTTLE); > return nfserr; > } > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 04278493bf15..5dcd27abc8cd 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1473,7 +1473,7 @@ extern struct pid *cad_pid; > #define PF_KSWAPD 0x00020000 /* I am kswapd */ > #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ > #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ > -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ > diff --git a/kernel/sys.c b/kernel/sys.c > index d325f3ab624a..180a2fa33f7f 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, > return -EINVAL; > } > > -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) > +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) > > SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > unsigned long, arg4, unsigned long, arg5) > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 2caf780a42e7..2afb09fa2fe0 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) > * Calculate @dtc->thresh and ->bg_thresh considering > * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller > * must ensure that @dtc->avail is set before calling this function. The > - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and > - * real-time tasks. > + * dirty limits will be lifted by 1/4 for real-time tasks. > */ > static void domain_dirty_limits(struct dirty_throttle_control *dtc) > { > @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) > if (bg_thresh >= thresh) > bg_thresh = thresh / 2; > tsk = current; > - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { > + if (rt_task(tsk)) { > bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; > thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; > } > @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) > else > dirty = vm_dirty_ratio * node_memory / 100; > > - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) > + if (rt_task(tsk)) > dirty += dirty / 4; > > return dirty; > @@ -1580,6 +1579,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb, > bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; > unsigned long start_time = jiffies; > > + if (current->flags & PF_LOCAL_THROTTLE) > + /* This task must only be throttled by its own writeback */ > + strictlimit = true; > for (;;) { > unsigned long now = jiffies; > unsigned long dirty, thresh, bg_thresh; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 876370565455..c5cf25938c56 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1880,13 +1880,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, > > /* > * If a kernel thread (such as nfsd for loop-back mounts) services > - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. > + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. > * In that case we should only throttle if the backing device it is > * writing to is congested. In other cases it is safe to throttle. > */ > static int current_may_throttle(void) > { > - return !(current->flags & PF_LESS_THROTTLE) || > + return !(current->flags & PF_LOCAL_THROTTLE) || > current->backing_dev_info == NULL || > bdi_write_congested(current->backing_dev_info); > } > -- > 2.26.0 >
On Thu 02-04-20 10:53:20, Neil Brown wrote: > > PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > loop block driver, where a daemon needs to write to one bdi in > order to free up writes queued to another bdi. > > The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > pages, so that it can still dirty pages after other processses have been > throttled. > > This approach was designed when all threads were blocked equally, > independently on which device they were writing to, or how fast it was. > Since that time the writeback algorithm has changed substantially with > different threads getting different allowances based on non-trivial > heuristics. This means the simple "add 25%" heuristic is no longer > reliable. > > This patch changes the heuristic to ignore the global limits and > consider only the limit relevant to the bdi being written to. This > approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > should not introduce surprises. This has the desired result of > protecting the task from the consequences of large amounts of dirty data > queued for other devices. While I understand that you want to have per bdi throttling for those "special" files I am still missing how this is going to provide the additional room that the additnal 25% gave them previously. I might misremember or things have changed (what you mention as non-trivial heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a forward progress. Care to expan some more on how this is handled now? Maybe we do not need it anymore but calling that out explicitly would be really helpful.
On Fri, Apr 03 2020, Michal Hocko wrote: > On Thu 02-04-20 10:53:20, Neil Brown wrote: >> >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the >> loop block driver, where a daemon needs to write to one bdi in >> order to free up writes queued to another bdi. >> >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty >> pages, so that it can still dirty pages after other processses have been >> throttled. >> >> This approach was designed when all threads were blocked equally, >> independently on which device they were writing to, or how fast it was. >> Since that time the writeback algorithm has changed substantially with >> different threads getting different allowances based on non-trivial >> heuristics. This means the simple "add 25%" heuristic is no longer >> reliable. >> >> This patch changes the heuristic to ignore the global limits and >> consider only the limit relevant to the bdi being written to. This >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and >> should not introduce surprises. This has the desired result of >> protecting the task from the consequences of large amounts of dirty data >> queued for other devices. > > While I understand that you want to have per bdi throttling for those > "special" files I am still missing how this is going to provide the > additional room that the additnal 25% gave them previously. I might > misremember or things have changed (what you mention as non-trivial > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a > forward progress. Care to expan some more on how this is handled now? > Maybe we do not need it anymore but calling that out explicitly would be > really helpful. The 25% was a means to an end, not an end in itself. The problem is that the NFS server needs to be able to write to the backing filesystem when the dirty memory limits have been reached by being totally consumed by dirty pages on the NFS filesystem. The 25% was just a way of giving an allowance of dirty pages to nfsd that could not be consumed by processes writing to an NFS filesystem. i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only really needs 1 page privately, but a few pages give better throughput and 25% seemed like a good idea at the time. per-bdi throttling focuses on the "PRIVATELY" (the important bit) and de-emphasises the 25% (the irrelevant detail). Thanks, NeilBrown
On Sat 04-04-20 08:40:17, Neil Brown wrote: > On Fri, Apr 03 2020, Michal Hocko wrote: > > > On Thu 02-04-20 10:53:20, Neil Brown wrote: > >> > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > >> loop block driver, where a daemon needs to write to one bdi in > >> order to free up writes queued to another bdi. > >> > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > >> pages, so that it can still dirty pages after other processses have been > >> throttled. > >> > >> This approach was designed when all threads were blocked equally, > >> independently on which device they were writing to, or how fast it was. > >> Since that time the writeback algorithm has changed substantially with > >> different threads getting different allowances based on non-trivial > >> heuristics. This means the simple "add 25%" heuristic is no longer > >> reliable. > >> > >> This patch changes the heuristic to ignore the global limits and > >> consider only the limit relevant to the bdi being written to. This > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > >> should not introduce surprises. This has the desired result of > >> protecting the task from the consequences of large amounts of dirty data > >> queued for other devices. > > > > While I understand that you want to have per bdi throttling for those > > "special" files I am still missing how this is going to provide the > > additional room that the additnal 25% gave them previously. I might > > misremember or things have changed (what you mention as non-trivial > > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a > > forward progress. Care to expan some more on how this is handled now? > > Maybe we do not need it anymore but calling that out explicitly would be > > really helpful. > > The 25% was a means to an end, not an end in itself. > > The problem is that the NFS server needs to be able to write to the > backing filesystem when the dirty memory limits have been reached by > being totally consumed by dirty pages on the NFS filesystem. > > The 25% was just a way of giving an allowance of dirty pages to nfsd > that could not be consumed by processes writing to an NFS filesystem. > i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only > really needs 1 page privately, but a few pages give better throughput > and 25% seemed like a good idea at the time. Yes this part is clear to me. > per-bdi throttling focuses on the "PRIVATELY" (the important bit) and > de-emphasises the 25% (the irrelevant detail). It is still not clear to me how this patch is going to behave when the global dirty throttling is essentially equal to the per-bdi - e.g. there is only a single bdi and now the PF_LOCAL_THROTTLE process doesn't have anything private.
On Mon 06-04-20 09:44:53, Michal Hocko wrote: > On Sat 04-04-20 08:40:17, Neil Brown wrote: > > On Fri, Apr 03 2020, Michal Hocko wrote: > > > > > On Thu 02-04-20 10:53:20, Neil Brown wrote: > > >> > > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > > >> loop block driver, where a daemon needs to write to one bdi in > > >> order to free up writes queued to another bdi. > > >> > > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > > >> pages, so that it can still dirty pages after other processses have been > > >> throttled. > > >> > > >> This approach was designed when all threads were blocked equally, > > >> independently on which device they were writing to, or how fast it was. > > >> Since that time the writeback algorithm has changed substantially with > > >> different threads getting different allowances based on non-trivial > > >> heuristics. This means the simple "add 25%" heuristic is no longer > > >> reliable. > > >> > > >> This patch changes the heuristic to ignore the global limits and > > >> consider only the limit relevant to the bdi being written to. This > > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > > >> should not introduce surprises. This has the desired result of > > >> protecting the task from the consequences of large amounts of dirty data > > >> queued for other devices. > > > > > > While I understand that you want to have per bdi throttling for those > > > "special" files I am still missing how this is going to provide the > > > additional room that the additnal 25% gave them previously. I might > > > misremember or things have changed (what you mention as non-trivial > > > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a > > > forward progress. Care to expan some more on how this is handled now? > > > Maybe we do not need it anymore but calling that out explicitly would be > > > really helpful. > > > > The 25% was a means to an end, not an end in itself. > > > > The problem is that the NFS server needs to be able to write to the > > backing filesystem when the dirty memory limits have been reached by > > being totally consumed by dirty pages on the NFS filesystem. > > > > The 25% was just a way of giving an allowance of dirty pages to nfsd > > that could not be consumed by processes writing to an NFS filesystem. > > i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only > > really needs 1 page privately, but a few pages give better throughput > > and 25% seemed like a good idea at the time. > > Yes this part is clear to me. > > > per-bdi throttling focuses on the "PRIVATELY" (the important bit) and > > de-emphasises the 25% (the irrelevant detail). > > It is still not clear to me how this patch is going to behave when the > global dirty throttling is essentially equal to the per-bdi - e.g. there > is only a single bdi and now the PF_LOCAL_THROTTLE process doesn't have > anything private. Let me think out loud so see whether I understand this properly. There are two BDIs involved in NFS loop mount - the NFS virtual BDI (let's call it simply NFS-bdi) and the bdi of the real filesystem that is backing NFS (let's call this real-bdi). The case we are concerned about is when NFS-bdi is full of dirty pages so that global dirty limit of the machine is exceeded. Then flusher thread will take dirty pages from NFS-bdi and send them over localhost to nfsd. Nfsd, which has PF_LOCAL_THROTTLE set, will take these pages and write them to real-bdi. Now because PF_LOCAL_THROTTLE is set for nfsd, the fact that we are over global limit does not take effect and nfsd is still able to write to real-bdi until dirty limit on real-bdi is reached. So things should work as Neil writes AFAIU. Honza
On Mon 06-04-20 11:36:01, Jan Kara wrote: > On Mon 06-04-20 09:44:53, Michal Hocko wrote: > > On Sat 04-04-20 08:40:17, Neil Brown wrote: > > > On Fri, Apr 03 2020, Michal Hocko wrote: > > > > > > > On Thu 02-04-20 10:53:20, Neil Brown wrote: > > > >> > > > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the > > > >> loop block driver, where a daemon needs to write to one bdi in > > > >> order to free up writes queued to another bdi. > > > >> > > > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty > > > >> pages, so that it can still dirty pages after other processses have been > > > >> throttled. > > > >> > > > >> This approach was designed when all threads were blocked equally, > > > >> independently on which device they were writing to, or how fast it was. > > > >> Since that time the writeback algorithm has changed substantially with > > > >> different threads getting different allowances based on non-trivial > > > >> heuristics. This means the simple "add 25%" heuristic is no longer > > > >> reliable. > > > >> > > > >> This patch changes the heuristic to ignore the global limits and > > > >> consider only the limit relevant to the bdi being written to. This > > > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and > > > >> should not introduce surprises. This has the desired result of > > > >> protecting the task from the consequences of large amounts of dirty data > > > >> queued for other devices. > > > > > > > > While I understand that you want to have per bdi throttling for those > > > > "special" files I am still missing how this is going to provide the > > > > additional room that the additnal 25% gave them previously. I might > > > > misremember or things have changed (what you mention as non-trivial > > > > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a > > > > forward progress. Care to expan some more on how this is handled now? > > > > Maybe we do not need it anymore but calling that out explicitly would be > > > > really helpful. > > > > > > The 25% was a means to an end, not an end in itself. > > > > > > The problem is that the NFS server needs to be able to write to the > > > backing filesystem when the dirty memory limits have been reached by > > > being totally consumed by dirty pages on the NFS filesystem. > > > > > > The 25% was just a way of giving an allowance of dirty pages to nfsd > > > that could not be consumed by processes writing to an NFS filesystem. > > > i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only > > > really needs 1 page privately, but a few pages give better throughput > > > and 25% seemed like a good idea at the time. > > > > Yes this part is clear to me. > > > > > per-bdi throttling focuses on the "PRIVATELY" (the important bit) and > > > de-emphasises the 25% (the irrelevant detail). > > > > It is still not clear to me how this patch is going to behave when the > > global dirty throttling is essentially equal to the per-bdi - e.g. there > > is only a single bdi and now the PF_LOCAL_THROTTLE process doesn't have > > anything private. > > Let me think out loud so see whether I understand this properly. There are > two BDIs involved in NFS loop mount - the NFS virtual BDI (let's call it > simply NFS-bdi) and the bdi of the real filesystem that is backing NFS > (let's call this real-bdi). The case we are concerned about is when NFS-bdi > is full of dirty pages so that global dirty limit of the machine is > exceeded. Then flusher thread will take dirty pages from NFS-bdi and send > them over localhost to nfsd. Nfsd, which has PF_LOCAL_THROTTLE set, will take > these pages and write them to real-bdi. Now because PF_LOCAL_THROTTLE is > set for nfsd, the fact that we are over global limit does not take effect > and nfsd is still able to write to real-bdi until dirty limit on real-bdi > is reached. So things should work as Neil writes AFAIU. Thanks for the clarification. I was not aware of the 2 bdi situation. This makes more sense now. Maybe this is a trivial fact for everybody who is more familiar with nfs internals but it would be so much esier to follow if it was explicit in the changelog.
On Mon, Apr 06 2020, Jan Kara wrote: > On Mon 06-04-20 09:44:53, Michal Hocko wrote: >> On Sat 04-04-20 08:40:17, Neil Brown wrote: >> > On Fri, Apr 03 2020, Michal Hocko wrote: >> > >> > > On Thu 02-04-20 10:53:20, Neil Brown wrote: >> > >> >> > >> PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the >> > >> loop block driver, where a daemon needs to write to one bdi in >> > >> order to free up writes queued to another bdi. >> > >> >> > >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty >> > >> pages, so that it can still dirty pages after other processses have been >> > >> throttled. >> > >> >> > >> This approach was designed when all threads were blocked equally, >> > >> independently on which device they were writing to, or how fast it was. >> > >> Since that time the writeback algorithm has changed substantially with >> > >> different threads getting different allowances based on non-trivial >> > >> heuristics. This means the simple "add 25%" heuristic is no longer >> > >> reliable. >> > >> >> > >> This patch changes the heuristic to ignore the global limits and >> > >> consider only the limit relevant to the bdi being written to. This >> > >> approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and >> > >> should not introduce surprises. This has the desired result of >> > >> protecting the task from the consequences of large amounts of dirty data >> > >> queued for other devices. >> > > >> > > While I understand that you want to have per bdi throttling for those >> > > "special" files I am still missing how this is going to provide the >> > > additional room that the additnal 25% gave them previously. I might >> > > misremember or things have changed (what you mention as non-trivial >> > > heuristics) but PF_LESS_THROTTLE really needed that room to guarantee a >> > > forward progress. Care to expan some more on how this is handled now? >> > > Maybe we do not need it anymore but calling that out explicitly would be >> > > really helpful. >> > >> > The 25% was a means to an end, not an end in itself. >> > >> > The problem is that the NFS server needs to be able to write to the >> > backing filesystem when the dirty memory limits have been reached by >> > being totally consumed by dirty pages on the NFS filesystem. >> > >> > The 25% was just a way of giving an allowance of dirty pages to nfsd >> > that could not be consumed by processes writing to an NFS filesystem. >> > i.e. it doesn't need 25% MORE, it needs 25% PRIVATELY. Actually it only >> > really needs 1 page privately, but a few pages give better throughput >> > and 25% seemed like a good idea at the time. >> >> Yes this part is clear to me. >> >> > per-bdi throttling focuses on the "PRIVATELY" (the important bit) and >> > de-emphasises the 25% (the irrelevant detail). >> >> It is still not clear to me how this patch is going to behave when the >> global dirty throttling is essentially equal to the per-bdi - e.g. there >> is only a single bdi and now the PF_LOCAL_THROTTLE process doesn't have >> anything private. > > Let me think out loud so see whether I understand this properly. There are > two BDIs involved in NFS loop mount - the NFS virtual BDI (let's call it > simply NFS-bdi) and the bdi of the real filesystem that is backing NFS > (let's call this real-bdi). The case we are concerned about is when NFS-bdi > is full of dirty pages so that global dirty limit of the machine is > exceeded. Then flusher thread will take dirty pages from NFS-bdi and send > them over localhost to nfsd. Nfsd, which has PF_LOCAL_THROTTLE set, will take > these pages and write them to real-bdi. Now because PF_LOCAL_THROTTLE is > set for nfsd, the fact that we are over global limit does not take effect > and nfsd is still able to write to real-bdi until dirty limit on real-bdi > is reached. So things should work as Neil writes AFAIU. Exactly. The 'loop' block device follows a similar pattern - there is the 'loop' bdi that might consume all the allowed dirty pages, and the backing bdi that we need to write to so those dirty pages can be cleaned. The intention for PR_SET_IO_FLUSHER as described in 'man 2 prctl' is much the same. The thread that sets this is expected to be working on behalf of a "block layer or filesystem" such as "FUSE daemons, SCSI device emulation daemons" - each of these would be serving a bdi "above" by writing to a bdi "below". I'll add some more text to the changelog to make this clearer. Thanks, NeilBrown
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 739b372a5112..2c59371ce936 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -897,7 +897,7 @@ static void loop_unprepare_queue(struct loop_device *lo) static int loop_kthread_worker_fn(void *worker_ptr) { - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO; + current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; return kthread_worker_fn(worker_ptr); } diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 0aa02eb18bd3..c3fbab1753ec 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) /* - * We want less throttling in balance_dirty_pages() - * and shrink_inactive_list() so that nfs to + * We want throttling in balance_dirty_pages() + * and shrink_inactive_list() to only consider + * the backingdev we are writing to, so that nfs to * localhost doesn't cause nfsd to lock up due to all * the client's dirty pages or its congested queue. */ - current->flags |= PF_LESS_THROTTLE; + current->flags |= PF_LOCAL_THROTTLE; exp = fhp->fh_export; use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp); @@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf, nfserr = nfserrno(host_err); } if (test_bit(RQ_LOCAL, &rqstp->rq_flags)) - current_restore_flags(pflags, PF_LESS_THROTTLE); + current_restore_flags(pflags, PF_LOCAL_THROTTLE); return nfserr; } diff --git a/include/linux/sched.h b/include/linux/sched.h index 04278493bf15..5dcd27abc8cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1473,7 +1473,7 @@ extern struct pid *cad_pid; #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */ -#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ +#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle me less: I clean memory */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ diff --git a/kernel/sys.c b/kernel/sys.c index d325f3ab624a..180a2fa33f7f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, return -EINVAL; } -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE) +#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2caf780a42e7..2afb09fa2fe0 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void) * Calculate @dtc->thresh and ->bg_thresh considering * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller * must ensure that @dtc->avail is set before calling this function. The - * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and - * real-time tasks. + * dirty limits will be lifted by 1/4 for real-time tasks. */ static void domain_dirty_limits(struct dirty_throttle_control *dtc) { @@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc) if (bg_thresh >= thresh) bg_thresh = thresh / 2; tsk = current; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { + if (rt_task(tsk)) { bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; } @@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat) else dirty = vm_dirty_ratio * node_memory / 100; - if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) + if (rt_task(tsk)) dirty += dirty / 4; return dirty; @@ -1580,6 +1579,9 @@ static void balance_dirty_pages(struct bdi_writeback *wb, bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; unsigned long start_time = jiffies; + if (current->flags & PF_LOCAL_THROTTLE) + /* This task must only be throttled by its own writeback */ + strictlimit = true; for (;;) { unsigned long now = jiffies; unsigned long dirty, thresh, bg_thresh; diff --git a/mm/vmscan.c b/mm/vmscan.c index 876370565455..c5cf25938c56 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1880,13 +1880,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec, /* * If a kernel thread (such as nfsd for loop-back mounts) services - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE. + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE. * In that case we should only throttle if the backing device it is * writing to is congested. In other cases it is safe to throttle. */ static int current_may_throttle(void) { - return !(current->flags & PF_LESS_THROTTLE) || + return !(current->flags & PF_LOCAL_THROTTLE) || current->backing_dev_info == NULL || bdi_write_congested(current->backing_dev_info); }
PF_LESS_THROTTLE exists for loop-back nfsd, and a similar need in the loop block driver, where a daemon needs to write to one bdi in order to free up writes queued to another bdi. The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty pages, so that it can still dirty pages after other processses have been throttled. This approach was designed when all threads were blocked equally, independently on which device they were writing to, or how fast it was. Since that time the writeback algorithm has changed substantially with different threads getting different allowances based on non-trivial heuristics. This means the simple "add 25%" heuristic is no longer reliable. This patch changes the heuristic to ignore the global limits and consider only the limit relevant to the bdi being written to. This approach is already available for BDI_CAP_STRICTLIMIT users (fuse) and should not introduce surprises. This has the desired result of protecting the task from the consequences of large amounts of dirty data queued for other devices. This approach of "only consider the target bdi" is consistent with the other use of PF_LESS_THROTTLE in current_may_throttle(), were it causes attention to be focussed only on the target bdi. So this patch - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE, - remove the 25% bonus that that flag gives, and - imposes 'strictlimit' handling for any process with PF_LOCAL_THROTTLE set. Note that previously realtime threads were treated the same as PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for real-time threads, so it is now different from the behaviour of nfsd and loop tasks. I don't know what is wanted for realtime. Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/block/loop.c | 2 +- fs/nfsd/vfs.c | 9 +++++---- include/linux/sched.h | 2 +- kernel/sys.c | 2 +- mm/page-writeback.c | 10 ++++++---- mm/vmscan.c | 4 ++-- 6 files changed, 16 insertions(+), 13 deletions(-)