Message ID | 20231204014042.6754-3-neilb@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Move all file-close work for nfsd into nfsd threads | expand |
On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote: > The work of closing a file can have non-trivial cost. Doing it in a > separate work queue thread means that cost isn't imposed on the nfsd > threads and an imbalance can be created. > > I have evidence from a customer site when nfsd is being asked to modify > many millions of files which causes sufficient memory pressure that some > cache (in XFS I think) gets cleaned earlier than would be ideal. When > __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode() > needs to synchronously read back previously cached info from storage. > This slows down the single thread that is making all the final __dput() > calls for all the nfsd threads with the net result that files are added > to the delayed_fput_list faster than they are removed, and the system > eventually runs out of memory. > > To avoid this work imbalance that exhausts memory, this patch moves all > work for closing files into the nfsd threads. This means that when the > work imposes a cost, that cost appears where it would be expected - in > the work of the nfsd thread. Thanks for pursuing this next step in the evolution of the NFSD file cache. Your problem statement should mention whether you have observed the issue with an NFSv3 or an NFSv4 workload or if you see this issue with both, since those two use cases are handled very differently within the file cache implementation. > There are two changes to achieve this. > > 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the > final __dput() for any file closed by a given nfsd thread is handled > by that thread. This ensures that the number of files that are > queued for a final close is limited by the number of threads and > cannot grow without bound. > > 2/ Files opened for NFSv3 are never explicitly closed by the client and are > kept open by the server in the "filecache", which responds to memory > pressure, is garbage collected even when there is no pressure, and > sometimes closes files when there is particular need such as for > rename. There is a good reason for close-on-rename: IIRC we want to avoid triggering a silly-rename on NFS re-exports. Also, I think we do want to close cached garbage-collected files quickly, even without memory pressure. Files left open in this way can conflict with subsequent NFSv4 OPENs that might hand out a delegation as long as no other clients are using them. Files held open by the file cache will interfere with that. > These files currently have filp_close() called in a dedicated > work queue, so their __dput() can have no effect on nfsd threads. > > This patch discards the work queue and instead has each nfsd thread > call flip_close() on as many as 8 files from the filecache each time > it acts on a client request (or finds there are no pending client > requests). If there are more to be closed, more threads are woken. > This spreads the work of __dput() over multiple threads and imposes > any cost on those threads. > > The number 8 is somewhat arbitrary. It needs to be greater than 1 to > ensure that files are closed more quickly than they can be added to > the cache. It needs to be small enough to limit the per-request > delays that will be imposed on clients when all threads are busy > closing files. IMO we want to explicitly separate the mechanisms of handling garbage-collected files and non-garbage-collected files. In the non-garbage-collected (NFSv4) case, the kthread can wait for everything it has opened to be closed. task_work seems appropriate for that IIUC. The problem with handling a limited number of garbage-collected items is that once the RPC workload stops, any remaining open files will remain open because garbage collection has effectively stopped. We really need those files closed out within a couple of seconds. We used to have watermarks in the nfsd_file_put() path to kick garbage-collection if there were too many open files. Instead, waiting for the GC thread to make progress before recycling the kthread might be beneficial. And, as we discussed in a previous thread, replacing the per- namespace worker with a parallel mechanism would help GC proceed more quickly to reduce the flush/close backlog for NFSv3. > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/filecache.c | 62 ++++++++++++++++++--------------------------- > fs/nfsd/filecache.h | 1 + > fs/nfsd/nfssvc.c | 6 +++++ > 3 files changed, 32 insertions(+), 37 deletions(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index ee9c923192e0..55268b7362d4 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -39,6 +39,7 @@ > #include <linux/fsnotify.h> > #include <linux/seq_file.h> > #include <linux/rhashtable.h> > +#include <linux/task_work.h> > > #include "vfs.h" > #include "nfsd.h" > @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age); > static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions); > > struct nfsd_fcache_disposal { > - struct work_struct work; > spinlock_t lock; > struct list_head freeme; > }; > > -static struct workqueue_struct *nfsd_filecache_wq __read_mostly; > - > static struct kmem_cache *nfsd_file_slab; > static struct kmem_cache *nfsd_file_mark_slab; > static struct list_lru nfsd_file_lru; > @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) > spin_lock(&l->lock); > list_move_tail(&nf->nf_lru, &l->freeme); > spin_unlock(&l->lock); > - queue_work(nfsd_filecache_wq, &l->work); > + svc_wake_up(nn->nfsd_serv); > } > } > > +/** > + * nfsd_file_dispose_some This needs a short description and: * @nn: namespace to check Or something more enlightening than that. Also, the function name exposes mechanism; I think I'd prefer a name that is more abstract, such as nfsd_file_net_release() ? > + * > + */ > +void nfsd_file_dispose_some(struct nfsd_net *nn) > +{ > + struct nfsd_fcache_disposal *l = nn->fcache_disposal; > + LIST_HEAD(dispose); > + int i; > + > + if (list_empty(&l->freeme)) > + return; > + spin_lock(&l->lock); > + for (i = 0; i < 8 && !list_empty(&l->freeme); i++) > + list_move(l->freeme.next, &dispose); > + spin_unlock(&l->lock); > + if (!list_empty(&l->freeme)) > + svc_wake_up(nn->nfsd_serv); > + nfsd_file_dispose_list(&dispose); > +} > + > /** > * nfsd_file_lru_cb - Examine an entry on the LRU list > * @item: LRU entry to examine > @@ -635,28 +654,8 @@ nfsd_file_close_inode_sync(struct inode *inode) > list_del_init(&nf->nf_lru); > nfsd_file_free(nf); > } > - flush_delayed_fput(); > -} > - > -/** > - * nfsd_file_delayed_close - close unused nfsd_files > - * @work: dummy > - * > - * Scrape the freeme list for this nfsd_net, and then dispose of them > - * all. > - */ > -static void > -nfsd_file_delayed_close(struct work_struct *work) > -{ > - LIST_HEAD(head); > - struct nfsd_fcache_disposal *l = container_of(work, > - struct nfsd_fcache_disposal, work); > - > - spin_lock(&l->lock); > - list_splice_init(&l->freeme, &head); > - spin_unlock(&l->lock); > - > - nfsd_file_dispose_list(&head); > + /* Flush any delayed fput */ > + task_work_run(); > } > > static int > @@ -721,10 +720,6 @@ nfsd_file_cache_init(void) > return ret; > > ret = -ENOMEM; > - nfsd_filecache_wq = alloc_workqueue("nfsd_filecache", 0, 0); > - if (!nfsd_filecache_wq) > - goto out; > - > nfsd_file_slab = kmem_cache_create("nfsd_file", > sizeof(struct nfsd_file), 0, 0, NULL); > if (!nfsd_file_slab) { > @@ -739,7 +734,6 @@ nfsd_file_cache_init(void) > goto out_err; > } > > - > ret = list_lru_init(&nfsd_file_lru); > if (ret) { > pr_err("nfsd: failed to init nfsd_file_lru: %d\n", ret); > @@ -782,8 +776,6 @@ nfsd_file_cache_init(void) > nfsd_file_slab = NULL; > kmem_cache_destroy(nfsd_file_mark_slab); > nfsd_file_mark_slab = NULL; > - destroy_workqueue(nfsd_filecache_wq); > - nfsd_filecache_wq = NULL; > rhltable_destroy(&nfsd_file_rhltable); > goto out; > } > @@ -829,7 +821,6 @@ nfsd_alloc_fcache_disposal(void) > l = kmalloc(sizeof(*l), GFP_KERNEL); > if (!l) > return NULL; > - INIT_WORK(&l->work, nfsd_file_delayed_close); > spin_lock_init(&l->lock); > INIT_LIST_HEAD(&l->freeme); > return l; > @@ -838,7 +829,6 @@ nfsd_alloc_fcache_disposal(void) > static void > nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l) > { > - cancel_work_sync(&l->work); > nfsd_file_dispose_list(&l->freeme); > kfree(l); > } > @@ -907,8 +897,6 @@ nfsd_file_cache_shutdown(void) > fsnotify_wait_marks_destroyed(); > kmem_cache_destroy(nfsd_file_mark_slab); > nfsd_file_mark_slab = NULL; > - destroy_workqueue(nfsd_filecache_wq); > - nfsd_filecache_wq = NULL; > rhltable_destroy(&nfsd_file_rhltable); > > for_each_possible_cpu(i) { > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > index e54165a3224f..bc8c3363bbdf 100644 > --- a/fs/nfsd/filecache.h > +++ b/fs/nfsd/filecache.h > @@ -56,6 +56,7 @@ void nfsd_file_cache_shutdown_net(struct net *net); > void nfsd_file_put(struct nfsd_file *nf); > struct nfsd_file *nfsd_file_get(struct nfsd_file *nf); > void nfsd_file_close_inode_sync(struct inode *inode); > +void nfsd_file_dispose_some(struct nfsd_net *nn); > bool nfsd_file_is_cached(struct inode *inode); > __be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp, > unsigned int may_flags, struct nfsd_file **nfp); > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c > index c7af1095f6b5..02ea16636b54 100644 > --- a/fs/nfsd/nfssvc.c > +++ b/fs/nfsd/nfssvc.c > @@ -13,6 +13,7 @@ > #include <linux/fs_struct.h> > #include <linux/swap.h> > #include <linux/siphash.h> > +#include <linux/task_work.h> > > #include <linux/sunrpc/stats.h> > #include <linux/sunrpc/svcsock.h> > @@ -949,6 +950,7 @@ nfsd(void *vrqstp) > } > > current->fs->umask = 0; > + current->flags |= PF_RUNS_TASK_WORK; > > atomic_inc(&nfsdstats.th_cnt); > > @@ -963,6 +965,10 @@ nfsd(void *vrqstp) > > svc_recv(rqstp); > validate_process_creds(); > + > + nfsd_file_dispose_some(nn); > + if (task_work_pending(current)) > + task_work_run(); I'd prefer that these task_work details reside inside nfsd_file_dispose_some(), or whatever we want to call to call it ... > } > > atomic_dec(&nfsdstats.th_cnt); > -- > 2.43.0 >
On Tue, 05 Dec 2023, Chuck Lever wrote: > On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote: > > The work of closing a file can have non-trivial cost. Doing it in a > > separate work queue thread means that cost isn't imposed on the nfsd > > threads and an imbalance can be created. > > > > I have evidence from a customer site when nfsd is being asked to modify > > many millions of files which causes sufficient memory pressure that some > > cache (in XFS I think) gets cleaned earlier than would be ideal. When > > __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode() > > needs to synchronously read back previously cached info from storage. > > This slows down the single thread that is making all the final __dput() > > calls for all the nfsd threads with the net result that files are added > > to the delayed_fput_list faster than they are removed, and the system > > eventually runs out of memory. > > > > To avoid this work imbalance that exhausts memory, this patch moves all > > work for closing files into the nfsd threads. This means that when the > > work imposes a cost, that cost appears where it would be expected - in > > the work of the nfsd thread. > > Thanks for pursuing this next step in the evolution of the NFSD > file cache. > > Your problem statement should mention whether you have observed the > issue with an NFSv3 or an NFSv4 workload or if you see this issue > with both, since those two use cases are handled very differently > within the file cache implementation. I have added: ============= The customer was using NFSv4. I can demonstrate the same problem using NFSv3 or NFSv4 (which close files in different ways) by adding msleep(25) to for FMODE_WRITE files in __fput(). This simulates slowness in the final close and when writing through nfsd it causes /proc/sys/fs/file-nr to grow without bound. ============== > > > > There are two changes to achieve this. > > > > 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the > > final __dput() for any file closed by a given nfsd thread is handled > > by that thread. This ensures that the number of files that are > > queued for a final close is limited by the number of threads and > > cannot grow without bound. > > > > 2/ Files opened for NFSv3 are never explicitly closed by the client and are > > kept open by the server in the "filecache", which responds to memory > > pressure, is garbage collected even when there is no pressure, and > > sometimes closes files when there is particular need such as for > > rename. > > There is a good reason for close-on-rename: IIRC we want to avoid > triggering a silly-rename on NFS re-exports. > > Also, I think we do want to close cached garbage-collected files > quickly, even without memory pressure. Files left open in this way > can conflict with subsequent NFSv4 OPENs that might hand out a > delegation as long as no other clients are using them. Files held > open by the file cache will interfere with that. Yes - I agree all this behaviour is appropriate. I was just setting out the current behaviour of the filecache so that effect of the proposed changes would be easier to understand. > > > > These files currently have filp_close() called in a dedicated > > work queue, so their __dput() can have no effect on nfsd threads. > > > > This patch discards the work queue and instead has each nfsd thread > > call flip_close() on as many as 8 files from the filecache each time > > it acts on a client request (or finds there are no pending client > > requests). If there are more to be closed, more threads are woken. > > This spreads the work of __dput() over multiple threads and imposes > > any cost on those threads. > > > > The number 8 is somewhat arbitrary. It needs to be greater than 1 to > > ensure that files are closed more quickly than they can be added to > > the cache. It needs to be small enough to limit the per-request > > delays that will be imposed on clients when all threads are busy > > closing files. > > IMO we want to explicitly separate the mechanisms of handling > garbage-collected files and non-garbage-collected files. I think we already have explicit separation. garbage-collected files are handled to nfsd_file_display_list_delayed(), either when they fall off the lru or through nfsd_file_close_inode() - which is used by lease and fsnotify callbacks. non-garbage collected files are closed directly by nfsd_file_put(). > > In the non-garbage-collected (NFSv4) case, the kthread can wait > for everything it has opened to be closed. task_work seems > appropriate for that IIUC. Agreed. The task_work change is all that we need for NFSv4. > > The problem with handling a limited number of garbage-collected > items is that once the RPC workload stops, any remaining open > files will remain open because garbage collection has effectively > stopped. We really need those files closed out within a couple of > seconds. Why would garbage collection stop? nfsd_filecache_laundrette is still running on the system_wq. It will continue to garbage collect and queue files using nfsd_file_display_list_delayed(). That will wake up an nfsd thread if none is running. The thread will close a few, but will first wake another thread if there was more than it was willing to manage. So the closing of files should proceed promptly, and if any close operation takes a non-trivial amount of time, more threads will be woken and work will proceed in parallel. > > We used to have watermarks in the nfsd_file_put() path to kick > garbage-collection if there were too many open files. Instead, > waiting for the GC thread to make progress before recycling the > kthread might be beneficial. "too many" is only meaningful in the context of memory usage. Having the shrinker callback is exactly the right way to address this - nothing else is needed. The GC thread is expected to be CPU intensive. The main cause of delay is skipping over lots of files that cannot be closed yet - looking for files that can. This could delay the closing of files, but not nearly as much as the delays I saw caused by synchronous IO. We might be able to improve the situation a bit by queuing files as soon as list_lru_walk finds them, rather than gathering them all into a list and the queuing them one by one from that list. It isn't clear to me that there is an issue here that needs fixing. > > And, as we discussed in a previous thread, replacing the per- > namespace worker with a parallel mechanism would help GC proceed > more quickly to reduce the flush/close backlog for NFSv3. This patch discards the per-namespace worker. The GC step (searching the LRU list for "garbage") is still single-threaded. The filecache is shared by all net-namespaces and there is a single GC thread for the filecache. Files that are found *were* filp_close()ed by per-net-fs work-items. With this patch the filp_close() is called by the nfsd threads. The file __fput of those files *was* handled by a single system-wide work-item. With this patch they are called by the nfsd thread which called the filp_close(). > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/nfsd/filecache.c | 62 ++++++++++++++++++--------------------------- > > fs/nfsd/filecache.h | 1 + > > fs/nfsd/nfssvc.c | 6 +++++ > > 3 files changed, 32 insertions(+), 37 deletions(-) > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index ee9c923192e0..55268b7362d4 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -39,6 +39,7 @@ > > #include <linux/fsnotify.h> > > #include <linux/seq_file.h> > > #include <linux/rhashtable.h> > > +#include <linux/task_work.h> > > > > #include "vfs.h" > > #include "nfsd.h" > > @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age); > > static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions); > > > > struct nfsd_fcache_disposal { > > - struct work_struct work; > > spinlock_t lock; > > struct list_head freeme; > > }; > > > > -static struct workqueue_struct *nfsd_filecache_wq __read_mostly; > > - > > static struct kmem_cache *nfsd_file_slab; > > static struct kmem_cache *nfsd_file_mark_slab; > > static struct list_lru nfsd_file_lru; > > @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) > > spin_lock(&l->lock); > > list_move_tail(&nf->nf_lru, &l->freeme); > > spin_unlock(&l->lock); > > - queue_work(nfsd_filecache_wq, &l->work); > > + svc_wake_up(nn->nfsd_serv); > > } > > } > > > > +/** > > + * nfsd_file_dispose_some > > This needs a short description and: > > * @nn: namespace to check > > Or something more enlightening than that. > > Also, the function name exposes mechanism; I think I'd prefer a name > that is more abstract, such as nfsd_file_net_release() ? Sometimes exposing mechanism is a good thing. It means the casual reader can get a sense of what the function does without having to look at the function. So I still prefer my name, but I changed to nfsd_file_net_dispose() so as suit your preference, but follow the established pattern of using the word "dispose". "release" usually just drops a reference. "dispose" makes it clear that the thing is going away now. /** * nfsd_file_net_dispose - deal with nfsd_files wait to be disposed. * @nn: nfsd_net in which to find files to be disposed. * * When files held open for nfsv3 are removed from the filecache, whether * due to memory pressure or garbage collection, they are queued to * a per-net-ns queue. This function completes the disposal, either * directly or by waking another nfsd thread to help with the work. */ > > > + * > > + */ > > +void nfsd_file_dispose_some(struct nfsd_net *nn) > > +{ > > + struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > + LIST_HEAD(dispose); > > + int i; > > + > > + if (list_empty(&l->freeme)) > > + return; > > + spin_lock(&l->lock); > > + for (i = 0; i < 8 && !list_empty(&l->freeme); i++) > > + list_move(l->freeme.next, &dispose); > > + spin_unlock(&l->lock); > > + if (!list_empty(&l->freeme)) > > + svc_wake_up(nn->nfsd_serv); > > + nfsd_file_dispose_list(&dispose); .. > > @@ -949,6 +950,7 @@ nfsd(void *vrqstp) > > } > > > > current->fs->umask = 0; > > + current->flags |= PF_RUNS_TASK_WORK; > > > > atomic_inc(&nfsdstats.th_cnt); > > > > @@ -963,6 +965,10 @@ nfsd(void *vrqstp) > > > > svc_recv(rqstp); > > validate_process_creds(); > > + > > + nfsd_file_dispose_some(nn); > > + if (task_work_pending(current)) > > + task_work_run(); > > I'd prefer that these task_work details reside inside > nfsd_file_dispose_some(), or whatever we want to call to call it ... I don't agree. They are performing quite separate tasks. nfsd_file_net_dispose() is disposing files queued for this net. task_run_work() is finalising the close of any file closed by this thread, including those used for NFSv4 that are not touched by nfsd_file_dispose_some(). I don't think they belong in the same function. Thanks, NeilBrown
On Tue, Dec 05, 2023 at 09:21:08AM +1100, NeilBrown wrote: > On Tue, 05 Dec 2023, Chuck Lever wrote: > > On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote: > > > The work of closing a file can have non-trivial cost. Doing it in a > > > separate work queue thread means that cost isn't imposed on the nfsd > > > threads and an imbalance can be created. > > > > > > I have evidence from a customer site when nfsd is being asked to modify > > > many millions of files which causes sufficient memory pressure that some > > > cache (in XFS I think) gets cleaned earlier than would be ideal. When > > > __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode() > > > needs to synchronously read back previously cached info from storage. > > > This slows down the single thread that is making all the final __dput() > > > calls for all the nfsd threads with the net result that files are added > > > to the delayed_fput_list faster than they are removed, and the system > > > eventually runs out of memory. > > > > > > To avoid this work imbalance that exhausts memory, this patch moves all > > > work for closing files into the nfsd threads. This means that when the > > > work imposes a cost, that cost appears where it would be expected - in > > > the work of the nfsd thread. > > > > Thanks for pursuing this next step in the evolution of the NFSD > > file cache. > > > > Your problem statement should mention whether you have observed the > > issue with an NFSv3 or an NFSv4 workload or if you see this issue > > with both, since those two use cases are handled very differently > > within the file cache implementation. > > I have added: > > ============= > The customer was using NFSv4. I can demonstrate the same problem using > NFSv3 or NFSv4 (which close files in different ways) by adding > msleep(25) to for FMODE_WRITE files in __fput(). This simulates > slowness in the final close and when writing through nfsd it causes > /proc/sys/fs/file-nr to grow without bound. > ============== > > > > > > > > There are two changes to achieve this. > > > > > > 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the > > > final __dput() for any file closed by a given nfsd thread is handled > > > by that thread. This ensures that the number of files that are > > > queued for a final close is limited by the number of threads and > > > cannot grow without bound. > > > > > > 2/ Files opened for NFSv3 are never explicitly closed by the client and are > > > kept open by the server in the "filecache", which responds to memory > > > pressure, is garbage collected even when there is no pressure, and > > > sometimes closes files when there is particular need such as for > > > rename. > > > > There is a good reason for close-on-rename: IIRC we want to avoid > > triggering a silly-rename on NFS re-exports. > > > > Also, I think we do want to close cached garbage-collected files > > quickly, even without memory pressure. Files left open in this way > > can conflict with subsequent NFSv4 OPENs that might hand out a > > delegation as long as no other clients are using them. Files held > > open by the file cache will interfere with that. > > Yes - I agree all this behaviour is appropriate. I was just setting out > the current behaviour of the filecache so that effect of the proposed > changes would be easier to understand. Ok, I misread "when there is particular need" as "when there is no particular need." My bad. > > > These files currently have filp_close() called in a dedicated > > > work queue, so their __dput() can have no effect on nfsd threads. > > > > > > This patch discards the work queue and instead has each nfsd thread > > > call flip_close() on as many as 8 files from the filecache each time > > > it acts on a client request (or finds there are no pending client > > > requests). If there are more to be closed, more threads are woken. > > > This spreads the work of __dput() over multiple threads and imposes > > > any cost on those threads. > > > > > > The number 8 is somewhat arbitrary. It needs to be greater than 1 to > > > ensure that files are closed more quickly than they can be added to > > > the cache. It needs to be small enough to limit the per-request > > > delays that will be imposed on clients when all threads are busy > > > closing files. > > > > IMO we want to explicitly separate the mechanisms of handling > > garbage-collected files and non-garbage-collected files. > > I think we already have explicit separation. > garbage-collected files are handled to nfsd_file_display_list_delayed(), > either when they fall off the lru or through nfsd_file_close_inode() - > which is used by lease and fsnotify callbacks. > > non-garbage collected files are closed directly by nfsd_file_put(). The separation is more clear to me now. Building this all into a single patch kind of blurred the edges between the two. > > In the non-garbage-collected (NFSv4) case, the kthread can wait > > for everything it has opened to be closed. task_work seems > > appropriate for that IIUC. > > Agreed. The task_work change is all that we need for NFSv4. > > > The problem with handling a limited number of garbage-collected > > items is that once the RPC workload stops, any remaining open > > files will remain open because garbage collection has effectively > > stopped. We really need those files closed out within a couple of > > seconds. > > Why would garbage collection stop? Because with your patch GC now appears to be driven through nfsd_file_dispose_some(). I see now that there is a hidden recursion that wakes more nfsd threads if there's more GC to be done. So file disposal is indeed not dependent on more ingress RPC traffic. The "If there are more to be closed" remark above in the patch description was ambiguous to me, but I think I get it now. > nfsd_filecache_laundrette is still running on the system_wq. It will > continue to garbage collect and queue files using > nfsd_file_display_list_delayed(). > That will wake up an nfsd thread if none is running. The thread will > close a few, but will first wake another thread if there was more than > it was willing to manage. So the closing of files should proceed > promptly, and if any close operation takes a non-trivial amount of time, > more threads will be woken and work will proceed in parallel. OK, that is what the svc_wake_up()s are doing. > > And, as we discussed in a previous thread, replacing the per- > > namespace worker with a parallel mechanism would help GC proceed > > more quickly to reduce the flush/close backlog for NFSv3. > > This patch discards the per-namespace worker. > > The GC step (searching the LRU list for "garbage") is still > single-threaded. The filecache is shared by all net-namespaces and > there is a single GC thread for the filecache. Agreed. > Files that are found *were* filp_close()ed by per-net-fs work-items. > With this patch the filp_close() is called by the nfsd threads. > > The file __fput of those files *was* handled by a single system-wide > work-item. With this patch they are called by the nfsd thread which > called the filp_close(). Fwiw, none of that is obvious to me when looking at the diff. > > > Signed-off-by: NeilBrown <neilb@suse.de> > > > --- > > > fs/nfsd/filecache.c | 62 ++++++++++++++++++--------------------------- > > > fs/nfsd/filecache.h | 1 + > > > fs/nfsd/nfssvc.c | 6 +++++ > > > 3 files changed, 32 insertions(+), 37 deletions(-) > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > > index ee9c923192e0..55268b7362d4 100644 > > > --- a/fs/nfsd/filecache.c > > > +++ b/fs/nfsd/filecache.c > > > @@ -39,6 +39,7 @@ > > > #include <linux/fsnotify.h> > > > #include <linux/seq_file.h> > > > #include <linux/rhashtable.h> > > > +#include <linux/task_work.h> > > > > > > #include "vfs.h" > > > #include "nfsd.h" > > > @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age); > > > static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions); > > > > > > struct nfsd_fcache_disposal { > > > - struct work_struct work; > > > spinlock_t lock; > > > struct list_head freeme; > > > }; > > > > > > -static struct workqueue_struct *nfsd_filecache_wq __read_mostly; > > > - > > > static struct kmem_cache *nfsd_file_slab; > > > static struct kmem_cache *nfsd_file_mark_slab; > > > static struct list_lru nfsd_file_lru; > > > @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) > > > spin_lock(&l->lock); > > > list_move_tail(&nf->nf_lru, &l->freeme); > > > spin_unlock(&l->lock); > > > - queue_work(nfsd_filecache_wq, &l->work); > > > + svc_wake_up(nn->nfsd_serv); > > > } > > > } > > > > > > +/** > > > + * nfsd_file_dispose_some > > > > This needs a short description and: > > > > * @nn: namespace to check > > > > Or something more enlightening than that. > > > > Also, the function name exposes mechanism; I think I'd prefer a name > > that is more abstract, such as nfsd_file_net_release() ? > > Sometimes exposing mechanism is a good thing. It means the casual reader > can get a sense of what the function does without having to look at the > function. > So I still prefer my name, but I changed to nfsd_file_net_dispose() so > as suit your preference, but follow the established pattern of using the > word "dispose". "release" usually just drops a reference. "dispose" > makes it clear that the thing is going away now. > > /** > * nfsd_file_net_dispose - deal with nfsd_files wait to be disposed. > * @nn: nfsd_net in which to find files to be disposed. > * > * When files held open for nfsv3 are removed from the filecache, whether This comment is helpful. But note that we quite purposely do not refer to NFS versions in filecache.c -- it's either garbage- collected or not garbage-collected files. IIRC on occasion NFSv3 wants to use a non-garbage-collected file, and NFSv4 might sometimes use a GC-d file. I've forgotten the details. > * due to memory pressure or garbage collection, they are queued to > * a per-net-ns queue. This function completes the disposal, either > * directly or by waking another nfsd thread to help with the work. > */ I understand why you want to keep this name: this function handles only garbage-collected files. I would still like nfsd() to call a wrapper function to handle the details of closing both types of files rather than open-coding calling nfsd_file_net_dispose() and task_run_work(), especially because there is no code comment explaining why the task_run_work() call is needed. That level of filecache implementation detail doesn't belong in nfsd().
Hi NeilBrown, kernel test robot noticed the following build warnings: [auto build test WARNING on tip/sched/core] [also build test WARNING on linus/master v6.7-rc4] [cannot apply to brauner-vfs/vfs.all next-20231205] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/NeilBrown/Allow-a-kthread-to-declare-that-it-calls-task_work_run/20231204-094248 base: tip/sched/core patch link: https://lore.kernel.org/r/20231204014042.6754-3-neilb%40suse.de patch subject: [PATCH 2/2] nfsd: Don't leave work of closing files to a work queue. config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20231205/202312051443.02Ddb5pd-lkp@intel.com/config) compiler: sh4-linux-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231205/202312051443.02Ddb5pd-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202312051443.02Ddb5pd-lkp@intel.com/ All warnings (new ones prefixed by >>): >> fs/nfsd/filecache.c:431: warning: Function parameter or member 'nn' not described in 'nfsd_file_dispose_some' vim +431 fs/nfsd/filecache.c 425 426 /** 427 * nfsd_file_dispose_some 428 * 429 */ 430 void nfsd_file_dispose_some(struct nfsd_net *nn) > 431 { 432 struct nfsd_fcache_disposal *l = nn->fcache_disposal; 433 LIST_HEAD(dispose); 434 int i; 435 436 if (list_empty(&l->freeme)) 437 return; 438 spin_lock(&l->lock); 439 for (i = 0; i < 8 && !list_empty(&l->freeme); i++) 440 list_move(l->freeme.next, &dispose); 441 spin_unlock(&l->lock); 442 if (!list_empty(&l->freeme)) 443 svc_wake_up(nn->nfsd_serv); 444 nfsd_file_dispose_list(&dispose); 445 } 446
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index ee9c923192e0..55268b7362d4 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -39,6 +39,7 @@ #include <linux/fsnotify.h> #include <linux/seq_file.h> #include <linux/rhashtable.h> +#include <linux/task_work.h> #include "vfs.h" #include "nfsd.h" @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age); static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions); struct nfsd_fcache_disposal { - struct work_struct work; spinlock_t lock; struct list_head freeme; }; -static struct workqueue_struct *nfsd_filecache_wq __read_mostly; - static struct kmem_cache *nfsd_file_slab; static struct kmem_cache *nfsd_file_mark_slab; static struct list_lru nfsd_file_lru; @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) spin_lock(&l->lock); list_move_tail(&nf->nf_lru, &l->freeme); spin_unlock(&l->lock); - queue_work(nfsd_filecache_wq, &l->work); + svc_wake_up(nn->nfsd_serv); } } +/** + * nfsd_file_dispose_some + * + */ +void nfsd_file_dispose_some(struct nfsd_net *nn) +{ + struct nfsd_fcache_disposal *l = nn->fcache_disposal; + LIST_HEAD(dispose); + int i; + + if (list_empty(&l->freeme)) + return; + spin_lock(&l->lock); + for (i = 0; i < 8 && !list_empty(&l->freeme); i++) + list_move(l->freeme.next, &dispose); + spin_unlock(&l->lock); + if (!list_empty(&l->freeme)) + svc_wake_up(nn->nfsd_serv); + nfsd_file_dispose_list(&dispose); +} + /** * nfsd_file_lru_cb - Examine an entry on the LRU list * @item: LRU entry to examine @@ -635,28 +654,8 @@ nfsd_file_close_inode_sync(struct inode *inode) list_del_init(&nf->nf_lru); nfsd_file_free(nf); } - flush_delayed_fput(); -} - -/** - * nfsd_file_delayed_close - close unused nfsd_files - * @work: dummy - * - * Scrape the freeme list for this nfsd_net, and then dispose of them - * all. - */ -static void -nfsd_file_delayed_close(struct work_struct *work) -{ - LIST_HEAD(head); - struct nfsd_fcache_disposal *l = container_of(work, - struct nfsd_fcache_disposal, work); - - spin_lock(&l->lock); - list_splice_init(&l->freeme, &head); - spin_unlock(&l->lock); - - nfsd_file_dispose_list(&head); + /* Flush any delayed fput */ + task_work_run(); } static int @@ -721,10 +720,6 @@ nfsd_file_cache_init(void) return ret; ret = -ENOMEM; - nfsd_filecache_wq = alloc_workqueue("nfsd_filecache", 0, 0); - if (!nfsd_filecache_wq) - goto out; - nfsd_file_slab = kmem_cache_create("nfsd_file", sizeof(struct nfsd_file), 0, 0, NULL); if (!nfsd_file_slab) { @@ -739,7 +734,6 @@ nfsd_file_cache_init(void) goto out_err; } - ret = list_lru_init(&nfsd_file_lru); if (ret) { pr_err("nfsd: failed to init nfsd_file_lru: %d\n", ret); @@ -782,8 +776,6 @@ nfsd_file_cache_init(void) nfsd_file_slab = NULL; kmem_cache_destroy(nfsd_file_mark_slab); nfsd_file_mark_slab = NULL; - destroy_workqueue(nfsd_filecache_wq); - nfsd_filecache_wq = NULL; rhltable_destroy(&nfsd_file_rhltable); goto out; } @@ -829,7 +821,6 @@ nfsd_alloc_fcache_disposal(void) l = kmalloc(sizeof(*l), GFP_KERNEL); if (!l) return NULL; - INIT_WORK(&l->work, nfsd_file_delayed_close); spin_lock_init(&l->lock); INIT_LIST_HEAD(&l->freeme); return l; @@ -838,7 +829,6 @@ nfsd_alloc_fcache_disposal(void) static void nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l) { - cancel_work_sync(&l->work); nfsd_file_dispose_list(&l->freeme); kfree(l); } @@ -907,8 +897,6 @@ nfsd_file_cache_shutdown(void) fsnotify_wait_marks_destroyed(); kmem_cache_destroy(nfsd_file_mark_slab); nfsd_file_mark_slab = NULL; - destroy_workqueue(nfsd_filecache_wq); - nfsd_filecache_wq = NULL; rhltable_destroy(&nfsd_file_rhltable); for_each_possible_cpu(i) { diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h index e54165a3224f..bc8c3363bbdf 100644 --- a/fs/nfsd/filecache.h +++ b/fs/nfsd/filecache.h @@ -56,6 +56,7 @@ void nfsd_file_cache_shutdown_net(struct net *net); void nfsd_file_put(struct nfsd_file *nf); struct nfsd_file *nfsd_file_get(struct nfsd_file *nf); void nfsd_file_close_inode_sync(struct inode *inode); +void nfsd_file_dispose_some(struct nfsd_net *nn); bool nfsd_file_is_cached(struct inode *inode); __be32 nfsd_file_acquire_gc(struct svc_rqst *rqstp, struct svc_fh *fhp, unsigned int may_flags, struct nfsd_file **nfp); diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c index c7af1095f6b5..02ea16636b54 100644 --- a/fs/nfsd/nfssvc.c +++ b/fs/nfsd/nfssvc.c @@ -13,6 +13,7 @@ #include <linux/fs_struct.h> #include <linux/swap.h> #include <linux/siphash.h> +#include <linux/task_work.h> #include <linux/sunrpc/stats.h> #include <linux/sunrpc/svcsock.h> @@ -949,6 +950,7 @@ nfsd(void *vrqstp) } current->fs->umask = 0; + current->flags |= PF_RUNS_TASK_WORK; atomic_inc(&nfsdstats.th_cnt); @@ -963,6 +965,10 @@ nfsd(void *vrqstp) svc_recv(rqstp); validate_process_creds(); + + nfsd_file_dispose_some(nn); + if (task_work_pending(current)) + task_work_run(); } atomic_dec(&nfsdstats.th_cnt);
The work of closing a file can have non-trivial cost. Doing it in a separate work queue thread means that cost isn't imposed on the nfsd threads and an imbalance can be created. I have evidence from a customer site when nfsd is being asked to modify many millions of files which causes sufficient memory pressure that some cache (in XFS I think) gets cleaned earlier than would be ideal. When __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back previously cached info from storage. This slows down the single thread that is making all the final __dput() calls for all the nfsd threads with the net result that files are added to the delayed_fput_list faster than they are removed, and the system eventually runs out of memory. To avoid this work imbalance that exhausts memory, this patch moves all work for closing files into the nfsd threads. This means that when the work imposes a cost, that cost appears where it would be expected - in the work of the nfsd thread. There are two changes to achieve this. 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the final __dput() for any file closed by a given nfsd thread is handled by that thread. This ensures that the number of files that are queued for a final close is limited by the number of threads and cannot grow without bound. 2/ Files opened for NFSv3 are never explicitly closed by the client and are kept open by the server in the "filecache", which responds to memory pressure, is garbage collected even when there is no pressure, and sometimes closes files when there is particular need such as for rename. These files currently have filp_close() called in a dedicated work queue, so their __dput() can have no effect on nfsd threads. This patch discards the work queue and instead has each nfsd thread call flip_close() on as many as 8 files from the filecache each time it acts on a client request (or finds there are no pending client requests). If there are more to be closed, more threads are woken. This spreads the work of __dput() over multiple threads and imposes any cost on those threads. The number 8 is somewhat arbitrary. It needs to be greater than 1 to ensure that files are closed more quickly than they can be added to the cache. It needs to be small enough to limit the per-request delays that will be imposed on clients when all threads are busy closing files. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/filecache.c | 62 ++++++++++++++++++--------------------------- fs/nfsd/filecache.h | 1 + fs/nfsd/nfssvc.c | 6 +++++ 3 files changed, 32 insertions(+), 37 deletions(-)