Message ID | 20231204014042.6754-2-neilb@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Move all file-close work for nfsd into nfsd threads | expand |
On 12/3/23 6:36 PM, NeilBrown wrote: > diff --git a/fs/namespace.c b/fs/namespace.c > index e157efc54023..46d640b70ca9 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt) > > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { > struct task_struct *task = current; > - if (likely(!(task->flags & PF_KTHREAD))) { > + if (likely((task->flags & PF_RUNS_TASK_WORK))) { Extraneous parens here. > diff --git a/kernel/task_work.c b/kernel/task_work.c > index 95a7e1b7f1da..aec19876e121 100644 > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -183,3 +183,4 @@ void task_work_run(void) > } while (work); > } > } > +EXPORT_SYMBOL(task_work_run); If we're exporting this, then I think that function needs a big disclaimer on exactly when it is safe to call it. And it most certainly needs to be a _GPL export.
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > +++ b/fs/namespace.c > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt) > > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { > struct task_struct *task = current; > - if (likely(!(task->flags & PF_KTHREAD))) { > + if (likely((task->flags & PF_RUNS_TASK_WORK))) { You could lose one set of parens here ... if (likely(task->flags & PF_RUNS_TASK_WORK)) { > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > -#define PF__HOLE__00800000 0x00800000 > +#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */ And you could lose "Will" here: #define PF_RUNS_TASK_WORK 0x00800000 /* Calls task_work_run() periodically */ > diff --git a/kernel/task_work.c b/kernel/task_work.c > index 95a7e1b7f1da..aec19876e121 100644 > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -183,3 +183,4 @@ void task_work_run(void) > } while (work); > } > } > +EXPORT_SYMBOL(task_work_run); _GPL?
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > This means that any cost for doing the work is not imposed on the kernel > thread, and importantly excessive amounts of work cannot apply > back-pressure to reduce the amount of new work queued. It also means that a stuck ->release() won't end up with stuck kernel thread... > earlier than would be ideal. When __dput (from the workqueue) calls WTF is that __dput thing? __fput, perhaps? > This patch adds a new process flag PF_RUNS_TASK_WORK which is now used > instead of PF_KTHREAD to determine whether it is sensible to queue > something to task_works. This flag is always set for non-kernel threads. *ugh* What's that flag for? task_work_add() always can fail; any caller must have a fallback to cope with that possibility; fput() certainly does. Just have the kernel threads born with ->task_works set to &work_exited and provide a primitive that would flip it from that to NULL. > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt) > > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { > struct task_struct *task = current; > - if (likely(!(task->flags & PF_KTHREAD))) { > + if (likely((task->flags & PF_RUNS_TASK_WORK))) { > init_task_work(&mnt->mnt_rcu, __cleanup_mnt); > if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME)) > return; Now, *that* is something I have much stronger objections to. Stuck filesystem shutdown is far more likely than stuck ->release(). You are seriously asking for trouble here. Why would you want to have nfsd block on that?
I am sick and can't read emails, just one note On 12/04, Al Viro wrote: > > Just have the kernel threads born with ->task_works set to &work_exited Then irq_thread()->task_work_add() will silently fail, > and provide a primitive that would flip it from that to NULL. OK, so this change should update irq_thread(). But what else can fail? And what if another kthread uses task_work_add(current) to add the desctructor and does fput() without task_work_run() ? Oleg.
On Mon, 04 Dec 2023, Jens Axboe wrote: > On 12/3/23 6:36 PM, NeilBrown wrote: > > diff --git a/fs/namespace.c b/fs/namespace.c > > index e157efc54023..46d640b70ca9 100644 > > --- a/fs/namespace.c > > +++ b/fs/namespace.c > > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt) > > > > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { > > struct task_struct *task = current; > > - if (likely(!(task->flags & PF_KTHREAD))) { > > + if (likely((task->flags & PF_RUNS_TASK_WORK))) { > > Extraneous parens here. Thanks - and thanks to Matthew Wilcox too. Fixed. > > > diff --git a/kernel/task_work.c b/kernel/task_work.c > > index 95a7e1b7f1da..aec19876e121 100644 > > --- a/kernel/task_work.c > > +++ b/kernel/task_work.c > > @@ -183,3 +183,4 @@ void task_work_run(void) > > } while (work); > > } > > } > > +EXPORT_SYMBOL(task_work_run); > > If we're exporting this, then I think that function needs a big > disclaimer on exactly when it is safe to call it. And it most certainly > needs to be a _GPL export. I've added * Can be used by a kernel thread but only when no locks are held and the * thread won't be waited for by other code that might hold locks. It * can be useful in the top-level loop of a file-serving thread to ensure * files get closed promptly. to the documentation comment. It isn't clear to me what _GPL is appropriate, but maybe the rules changed since last I looked..... are there rules? My reasoning was that the call is effectively part of the user-space ABI. A user-space process can call this trivially by invoking any system call. The user-space ABI is explicitly a boundary which the GPL does not cross. So it doesn't seem appropriate to prevent non-GPL kernel code from doing something that non-GPL user-space code can trivially do. But if there are other strong opinions, or clearly documented rules that contradict my opinion, I have not problem with adding _GPL. Thanks, NeilBrown > > -- > Jens Axboe > >
On Mon, 04 Dec 2023, Matthew Wilcox wrote: > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > > +++ b/fs/namespace.c > > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt) > > > > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { > > struct task_struct *task = current; > > - if (likely(!(task->flags & PF_KTHREAD))) { > > + if (likely((task->flags & PF_RUNS_TASK_WORK))) { > > You could lose one set of parens here ... > > if (likely(task->flags & PF_RUNS_TASK_WORK)) { Done. > > > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > > -#define PF__HOLE__00800000 0x00800000 > > +#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */ > > And you could lose "Will" here: > > #define PF_RUNS_TASK_WORK 0x00800000 /* Calls task_work_run() periodically */ Better - thanks. > > > diff --git a/kernel/task_work.c b/kernel/task_work.c > > index 95a7e1b7f1da..aec19876e121 100644 > > --- a/kernel/task_work.c > > +++ b/kernel/task_work.c > > @@ -183,3 +183,4 @@ void task_work_run(void) > > } while (work); > > } > > } > > +EXPORT_SYMBOL(task_work_run); > > _GPL? Justification? Thanks, NeilBrown
On Mon, 04 Dec 2023, Al Viro wrote: > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > > > This means that any cost for doing the work is not imposed on the kernel > > thread, and importantly excessive amounts of work cannot apply > > back-pressure to reduce the amount of new work queued. > > It also means that a stuck ->release() won't end up with stuck > kernel thread... Is a stuck kernel thread any worse than a stuck user-space thread? > > > earlier than would be ideal. When __dput (from the workqueue) calls > > WTF is that __dput thing? __fput, perhaps? Either __fput or dput :-) ->release isn't the problem that I am seeing. The call trace that I see causing problems is __fput -> dput -> dentry_kill -> destroy_inode -> xfs_fs_destroy_inode so both __fput and dput are there, but most of the code is dput related. So both "put"s were swimming in by brain and the wrong combination came out. I changed it to __fput - thanks. > > > This patch adds a new process flag PF_RUNS_TASK_WORK which is now used > > instead of PF_KTHREAD to determine whether it is sensible to queue > > something to task_works. This flag is always set for non-kernel threads. > > *ugh* > > What's that flag for? task_work_add() always can fail; any caller must > have a fallback to cope with that possibility; fput() certainly does. As Oleg pointed out, all threads including kernel threads call task_work_run() at exit, and some kernel threads depend on this. So disabling task_work_add() early for all kernel threads would break things. Currently task_work_add() fails only once the process has started exiting. Only code that might run during the exit handling need check. > > Just have the kernel threads born with ->task_works set to &work_exited > and provide a primitive that would flip it from that to NULL. > > > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt) > > > > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { > > struct task_struct *task = current; > > - if (likely(!(task->flags & PF_KTHREAD))) { > > + if (likely((task->flags & PF_RUNS_TASK_WORK))) { > > init_task_work(&mnt->mnt_rcu, __cleanup_mnt); > > if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME)) > > return; > > Now, *that* is something I have much stronger objections to. > Stuck filesystem shutdown is far more likely than stuck > ->release(). You are seriously asking for trouble here. > > Why would you want to have nfsd block on that? > I don't *want* nfsd block on that, but I don't care if it does. nfsd will only call task_work_run() at a safe time. This is no different to user-space processes only calling task_work_run() at a safe time. The new flag isn't "I_AM_NFSD" or "QUEUE_FPUT_WORK_TO_TASK". It is "RUNS_TASK_WORK". So any code that would prefer to call task_work_add() but has a fall-back for tasks that don't call run_task_work() should test the new flag. Doing otherwise would be inconsistent and potentially confusing. I don't think that nfsd getting stuck would be any worse than systemd getting stuck, or automount getting stuck, or udiskd getting stuck. Thanks, NeilBrown
On 12/4/23 2:02 PM, NeilBrown wrote: > It isn't clear to me what _GPL is appropriate, but maybe the rules > changed since last I looked..... are there rules? > > My reasoning was that the call is effectively part of the user-space > ABI. A user-space process can call this trivially by invoking any > system call. The user-space ABI is explicitly a boundary which the GPL > does not cross. So it doesn't seem appropriate to prevent non-GPL > kernel code from doing something that non-GPL user-space code can > trivially do. By that reasoning, basically everything in the kernel should be non-GPL marked. And while task_work can get used by the application, it happens only indirectly or implicitly. So I don't think this reasoning is sound at all, it's not an exported ABI or API by itself. For me, the more core of an export it is, the stronger the reason it should be GPL. FWIW, I don't think exporting task_work functionality is a good idea in the first place, but if there's a strong reason to do so, it should most certainly not be accessible to non-GPL modules. Basically NO new export should be non-GPL.
On Tue, 05 Dec 2023, Jens Axboe wrote: > On 12/4/23 2:02 PM, NeilBrown wrote: > > It isn't clear to me what _GPL is appropriate, but maybe the rules > > changed since last I looked..... are there rules? > > > > My reasoning was that the call is effectively part of the user-space > > ABI. A user-space process can call this trivially by invoking any > > system call. The user-space ABI is explicitly a boundary which the GPL > > does not cross. So it doesn't seem appropriate to prevent non-GPL > > kernel code from doing something that non-GPL user-space code can > > trivially do. > > By that reasoning, basically everything in the kernel should be non-GPL > marked. And while task_work can get used by the application, it happens > only indirectly or implicitly. So I don't think this reasoning is sound > at all, it's not an exported ABI or API by itself. > > For me, the more core of an export it is, the stronger the reason it > should be GPL. FWIW, I don't think exporting task_work functionality is > a good idea in the first place, but if there's a strong reason to do so, > it should most certainly not be accessible to non-GPL modules. Basically > NO new export should be non-GPL. An alternate to exporting task_work_run() might be to call it from try_to_freeze(). I think that too should only be called from a context where no locks are held etc. Obviously try_to_freeze would only call task_work_run() if PF_RUNS_TASK_WORK were set. I'm not sure this is a *good* idea, but it is an idea that would avoid the export. For now I change the export to _GPL. Thanks, NeilBrown
On Tue, Dec 05, 2023 at 08:20:31AM +1100, NeilBrown wrote: > On Mon, 04 Dec 2023, Al Viro wrote: > > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > > > > > This means that any cost for doing the work is not imposed on the kernel > > > thread, and importantly excessive amounts of work cannot apply > > > back-pressure to reduce the amount of new work queued. > > > > It also means that a stuck ->release() won't end up with stuck > > kernel thread... > > Is a stuck kernel thread any worse than a stuck user-space thread? > > > > > > earlier than would be ideal. When __dput (from the workqueue) calls > > > > WTF is that __dput thing? __fput, perhaps? > > Either __fput or dput :-) > ->release isn't the problem that I am seeing. > The call trace that I see causing problems is > __fput -> dput -> dentry_kill -> destroy_inode -> xfs_fs_destroy_inode What problem, exactly, are you having with xfs_fs_destroy_inode()? -Dave.
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > User-space processes always call task_work_run() as needed when > returning from a system call. Kernel-threads generally do not. > Because of this some work that is best run in the task_works context > (guaranteed that no locks are held) cannot be queued to task_works from > kernel threads and so are queued to a (single) work_time to be managed > on a work queue. > > This means that any cost for doing the work is not imposed on the kernel > thread, and importantly excessive amounts of work cannot apply > back-pressure to reduce the amount of new work queued. > > I have evidence from a customer site when nfsd (which runs as kernel > threads) is being asked to modify many millions of files which causes > sufficient memory pressure that some cache (in XFS I think) gets cleaned > earlier than would be ideal. When __dput (from the workqueue) calls > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back > previously cached info from storage. We fixed that specific XFS problem in 5.9. https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/ Can you reproduce these issues on a current TOT kernel? If not, there's no bugs to fix in the upstream kernel. If you can, then we've got more XFS issues to work through and fix. Fundamentally, though, we should not be papering over an XFS issue by changing how core task_work infrastructure is used. So let's deal with the XFS issue first.... -Dave.
On Tue, 05 Dec 2023, Dave Chinner wrote: > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > > User-space processes always call task_work_run() as needed when > > returning from a system call. Kernel-threads generally do not. > > Because of this some work that is best run in the task_works context > > (guaranteed that no locks are held) cannot be queued to task_works from > > kernel threads and so are queued to a (single) work_time to be managed > > on a work queue. > > > > This means that any cost for doing the work is not imposed on the kernel > > thread, and importantly excessive amounts of work cannot apply > > back-pressure to reduce the amount of new work queued. > > > > I have evidence from a customer site when nfsd (which runs as kernel > > threads) is being asked to modify many millions of files which causes > > sufficient memory pressure that some cache (in XFS I think) gets cleaned > > earlier than would be ideal. When __dput (from the workqueue) calls > > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back > > previously cached info from storage. > > We fixed that specific XFS problem in 5.9. > > https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/ Good to know - thanks. > > Can you reproduce these issues on a current TOT kernel? I haven't tried. I don't know if I know enough details of the work load to attempt it. > > If not, there's no bugs to fix in the upstream kernel. If you can, > then we've got more XFS issues to work through and fix. > > Fundamentally, though, we should not be papering over an XFS issue > by changing how core task_work infrastructure is used. So let's deal > with the XFS issue first.... I disagree. This customer experience has demonstrated both a bug in XFS and bug in the interaction between fput, task_work, and nfsd. If a bug in a filesystem that only causes a modest performance impact when used through the syscall API can bring the system to its knees through memory exhaustion when used by nfsd, then that is a robustness issue for nfsd. I want to fix that robustness issue so that unusual behaviour in filesystems does not cause out-of-proportion bad behaviour in nfsd. I highlighted this in the cover letter to the first version of my patch: https://lore.kernel.org/all/170112272125.7109.6245462722883333440@noble.neil.brown.name/ While this might point to a problem with the filesystem not handling the final close efficiently, such problems should only hurt throughput, not lead to memory exhaustion. Thanks, NeilBrown > > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: > On 12/4/23 2:02 PM, NeilBrown wrote: > > It isn't clear to me what _GPL is appropriate, but maybe the rules > > changed since last I looked..... are there rules? > > > > My reasoning was that the call is effectively part of the user-space > > ABI. A user-space process can call this trivially by invoking any > > system call. The user-space ABI is explicitly a boundary which the GPL > > does not cross. So it doesn't seem appropriate to prevent non-GPL > > kernel code from doing something that non-GPL user-space code can > > trivially do. > > By that reasoning, basically everything in the kernel should be non-GPL > marked. And while task_work can get used by the application, it happens > only indirectly or implicitly. So I don't think this reasoning is sound > at all, it's not an exported ABI or API by itself. > > For me, the more core of an export it is, the stronger the reason it > should be GPL. FWIW, I don't think exporting task_work functionality is > a good idea in the first place, but if there's a strong reason to do so, Yeah, I'm not too fond of that part as well. I don't think we want to give modules the ability to mess with task work. This is just asking for trouble.
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > User-space processes always call task_work_run() as needed when > returning from a system call. Kernel-threads generally do not. > Because of this some work that is best run in the task_works context > (guaranteed that no locks are held) cannot be queued to task_works from > kernel threads and so are queued to a (single) work_time to be managed > on a work queue. > > This means that any cost for doing the work is not imposed on the kernel > thread, and importantly excessive amounts of work cannot apply > back-pressure to reduce the amount of new work queued. > > I have evidence from a customer site when nfsd (which runs as kernel > threads) is being asked to modify many millions of files which causes > sufficient memory pressure that some cache (in XFS I think) gets cleaned > earlier than would be ideal. When __dput (from the workqueue) calls > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back > previously cached info from storage. This slows down the single thread > that is making all the final __dput() calls for all the nfsd threads > with the net result that files are added to the delayed_fput_list faster > than they are removed, and the system eventually runs out of memory. > > This happens because there is no back-pressure: the nfsd isn't forced to > slow down when __dput() is slow for any reason. To fix this we can > change the nfsd threads to call task_work_run() regularly (much like > user-space processes do) and allow it to declare this so that work does > get queued to task_works rather than to a work queue. > > This patch adds a new process flag PF_RUNS_TASK_WORK which is now used > instead of PF_KTHREAD to determine whether it is sensible to queue > something to task_works. This flag is always set for non-kernel threads. > > task_work_run() is also exported so that it can be called from a module > such as nfsd. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- The thing that bugs me the most about this is that we expose task work infrastructure to modules which I think is a really bad idea. File handling code brings so many driver to their knees and now we're handing them another footgun. I'm not per se opposed to all of this but is this really what the other NFS maintainers want to switch to as well? And is this really that badly needed and that common that we want to go down that road? I wouldn't mind not having to do all this if we can get by via other means. > fs/file_table.c | 3 ++- > fs/namespace.c | 2 +- > include/linux/sched.h | 2 +- > kernel/fork.c | 2 ++ > kernel/task_work.c | 1 + > 5 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/fs/file_table.c b/fs/file_table.c > index ee21b3da9d08..d36cade6e366 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -435,7 +435,8 @@ void fput(struct file *file) > if (atomic_long_dec_and_test(&file->f_count)) { > struct task_struct *task = current; > > - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { > + if (likely(!in_interrupt() && > + (task->flags & PF_RUNS_TASK_WORK))) { > init_task_work(&file->f_rcuhead, ____fput); > if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) > return; > diff --git a/fs/namespace.c b/fs/namespace.c > index e157efc54023..46d640b70ca9 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt) > > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { > struct task_struct *task = current; > - if (likely(!(task->flags & PF_KTHREAD))) { > + if (likely((task->flags & PF_RUNS_TASK_WORK))) { > init_task_work(&mnt->mnt_rcu, __cleanup_mnt); > if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME)) > return; > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 77f01ac385f7..e4eebac708e7 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1747,7 +1747,7 @@ extern struct pid *cad_pid; > * I am cleaning dirty pages from some other bdi. */ > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > -#define PF__HOLE__00800000 0x00800000 > +#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */ The flag seems better to me than just relying on exit_work as itt's easier to reason about. > #define PF__HOLE__01000000 0x01000000 > #define PF__HOLE__02000000 0x02000000 > #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ > diff --git a/kernel/fork.c b/kernel/fork.c > index 3b6d20dfb9a8..d612d8f14861 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -2330,6 +2330,8 @@ __latent_entropy struct task_struct *copy_process( > p->flags &= ~PF_KTHREAD; > if (args->kthread) > p->flags |= PF_KTHREAD; > + else > + p->flags |= PF_RUNS_TASK_WORK; > if (args->user_worker) { > /* > * Mark us a user worker, and block any signal that isn't > diff --git a/kernel/task_work.c b/kernel/task_work.c > index 95a7e1b7f1da..aec19876e121 100644 > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -183,3 +183,4 @@ void task_work_run(void) > } while (work); > } > } > +EXPORT_SYMBOL(task_work_run); > -- > 2.43.0 >
On Tue, Dec 05, 2023 at 07:48:20PM +1100, NeilBrown wrote: > On Tue, 05 Dec 2023, Dave Chinner wrote: > > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > > > User-space processes always call task_work_run() as needed when > > > returning from a system call. Kernel-threads generally do not. > > > Because of this some work that is best run in the task_works context > > > (guaranteed that no locks are held) cannot be queued to task_works from > > > kernel threads and so are queued to a (single) work_time to be managed > > > on a work queue. > > > > > > This means that any cost for doing the work is not imposed on the kernel > > > thread, and importantly excessive amounts of work cannot apply > > > back-pressure to reduce the amount of new work queued. > > > > > > I have evidence from a customer site when nfsd (which runs as kernel > > > threads) is being asked to modify many millions of files which causes > > > sufficient memory pressure that some cache (in XFS I think) gets cleaned > > > earlier than would be ideal. When __dput (from the workqueue) calls > > > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back > > > previously cached info from storage. > > > > We fixed that specific XFS problem in 5.9. > > > > https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/ > > Good to know - thanks. > > > > > Can you reproduce these issues on a current TOT kernel? > > I haven't tried. I don't know if I know enough details of the work load > to attempt it. > > > > > If not, there's no bugs to fix in the upstream kernel. If you can, > > then we've got more XFS issues to work through and fix. > > > > Fundamentally, though, we should not be papering over an XFS issue > > by changing how core task_work infrastructure is used. So let's deal > > with the XFS issue first.... > > I disagree. This customer experience has demonstrated both a bug in XFS > and bug in the interaction between fput, task_work, and nfsd. > > If a bug in a filesystem that only causes a modest performance impact > when used through the syscall API can bring the system to its knees > through memory exhaustion when used by nfsd, then that is a robustness > issue for nfsd. > > I want to fix that robustness issue so that unusual behaviour in > filesystems does not cause out-of-proportion bad behaviour in nfsd. > > I highlighted this in the cover letter to the first version of my patch: > > https://lore.kernel.org/all/170112272125.7109.6245462722883333440@noble.neil.brown.name/ > > While this might point to a problem with the filesystem not handling the > final close efficiently, such problems should only hurt throughput, not > lead to memory exhaustion. I'm still confused about this memory exhaustion claim? If this is a filesystem problem it's pretty annoying that we have to work around it by exposing task work to random modules.
On Tue, 2023-12-05 at 12:14 +0100, Christian Brauner wrote: > On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: > > On 12/4/23 2:02 PM, NeilBrown wrote: > > > It isn't clear to me what _GPL is appropriate, but maybe the rules > > > changed since last I looked..... are there rules? > > > > > > My reasoning was that the call is effectively part of the user-space > > > ABI. A user-space process can call this trivially by invoking any > > > system call. The user-space ABI is explicitly a boundary which the GPL > > > does not cross. So it doesn't seem appropriate to prevent non-GPL > > > kernel code from doing something that non-GPL user-space code can > > > trivially do. > > > > By that reasoning, basically everything in the kernel should be non-GPL > > marked. And while task_work can get used by the application, it happens > > only indirectly or implicitly. So I don't think this reasoning is sound > > at all, it's not an exported ABI or API by itself. > > > > For me, the more core of an export it is, the stronger the reason it > > should be GPL. FWIW, I don't think exporting task_work functionality is > > a good idea in the first place, but if there's a strong reason to do so, > > Yeah, I'm not too fond of that part as well. I don't think we want to > give modules the ability to mess with task work. This is just asking for > trouble. The fact that nfsd has to queue all of the delayed fput activity to a workqueue has always been a horrible hack though. We export all kinds of functionality to modules that you can screw up. I think that nfsd's use-case is legitimate. ksmbd may also want to follow suit.
On Tue, Dec 05, 2023 at 12:25:40PM +0100, Christian Brauner wrote: > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote: > > User-space processes always call task_work_run() as needed when > > returning from a system call. Kernel-threads generally do not. > > Because of this some work that is best run in the task_works context > > (guaranteed that no locks are held) cannot be queued to task_works from > > kernel threads and so are queued to a (single) work_time to be managed > > on a work queue. > > > > This means that any cost for doing the work is not imposed on the kernel > > thread, and importantly excessive amounts of work cannot apply > > back-pressure to reduce the amount of new work queued. > > > > I have evidence from a customer site when nfsd (which runs as kernel > > threads) is being asked to modify many millions of files which causes > > sufficient memory pressure that some cache (in XFS I think) gets cleaned > > earlier than would be ideal. When __dput (from the workqueue) calls > > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back > > previously cached info from storage. This slows down the single thread > > that is making all the final __dput() calls for all the nfsd threads > > with the net result that files are added to the delayed_fput_list faster > > than they are removed, and the system eventually runs out of memory. > > > > This happens because there is no back-pressure: the nfsd isn't forced to > > slow down when __dput() is slow for any reason. To fix this we can > > change the nfsd threads to call task_work_run() regularly (much like > > user-space processes do) and allow it to declare this so that work does > > get queued to task_works rather than to a work queue. > > > > This patch adds a new process flag PF_RUNS_TASK_WORK which is now used > > instead of PF_KTHREAD to determine whether it is sensible to queue > > something to task_works. This flag is always set for non-kernel threads. > > > > task_work_run() is also exported so that it can be called from a module > > such as nfsd. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > The thing that bugs me the most about this is that we expose task work > infrastructure to modules which I think is a really bad idea. File > handling code brings so many driver to their knees and now we're handing > them another footgun. > > I'm not per se opposed to all of this but is this really what the other > NFS maintainers want to switch to as well? And is this really that badly > needed and that common that we want to go down that road? I wouldn't > mind not having to do all this if we can get by via other means. The problem of slow flushing during close is not limited to XFS or any particular underlying file system. Sometimes it is due to performance or scalability bugs, but sometimes it's just a slow storage stack on the NFS server (eg NFS re-export). One slow synchronous flush in a single-threaded queue will result in head-of-queue blocking. That is something that needs to be addressed (IMHO, first). Adding back pressure on NFS clients when NFSD is not able to get dirty data onto durable storage fast enough is a long term solution, but it's probably a heavier lift. I'm not wedded to using task_work to do that, but it does seem to fit the problem at hand. > > fs/file_table.c | 3 ++- > > fs/namespace.c | 2 +- > > include/linux/sched.h | 2 +- > > kernel/fork.c | 2 ++ > > kernel/task_work.c | 1 + > > 5 files changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/fs/file_table.c b/fs/file_table.c > > index ee21b3da9d08..d36cade6e366 100644 > > --- a/fs/file_table.c > > +++ b/fs/file_table.c > > @@ -435,7 +435,8 @@ void fput(struct file *file) > > if (atomic_long_dec_and_test(&file->f_count)) { > > struct task_struct *task = current; > > > > - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { > > + if (likely(!in_interrupt() && > > + (task->flags & PF_RUNS_TASK_WORK))) { > > init_task_work(&file->f_rcuhead, ____fput); > > if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) > > return; > > diff --git a/fs/namespace.c b/fs/namespace.c > > index e157efc54023..46d640b70ca9 100644 > > --- a/fs/namespace.c > > +++ b/fs/namespace.c > > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt) > > > > if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { > > struct task_struct *task = current; > > - if (likely(!(task->flags & PF_KTHREAD))) { > > + if (likely((task->flags & PF_RUNS_TASK_WORK))) { > > init_task_work(&mnt->mnt_rcu, __cleanup_mnt); > > if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME)) > > return; > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 77f01ac385f7..e4eebac708e7 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -1747,7 +1747,7 @@ extern struct pid *cad_pid; > > * I am cleaning dirty pages from some other bdi. */ > > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > > -#define PF__HOLE__00800000 0x00800000 > > +#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */ > > The flag seems better to me than just relying on exit_work as itt's > easier to reason about. > > > #define PF__HOLE__01000000 0x01000000 > > #define PF__HOLE__02000000 0x02000000 > > #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 3b6d20dfb9a8..d612d8f14861 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -2330,6 +2330,8 @@ __latent_entropy struct task_struct *copy_process( > > p->flags &= ~PF_KTHREAD; > > if (args->kthread) > > p->flags |= PF_KTHREAD; > > + else > > + p->flags |= PF_RUNS_TASK_WORK; > > if (args->user_worker) { > > /* > > * Mark us a user worker, and block any signal that isn't > > diff --git a/kernel/task_work.c b/kernel/task_work.c > > index 95a7e1b7f1da..aec19876e121 100644 > > --- a/kernel/task_work.c > > +++ b/kernel/task_work.c > > @@ -183,3 +183,4 @@ void task_work_run(void) > > } while (work); > > } > > } > > +EXPORT_SYMBOL(task_work_run); > > -- > > 2.43.0 > >
On Tue, 05 Dec 2023, Christian Brauner wrote: > On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: > > On 12/4/23 2:02 PM, NeilBrown wrote: > > > It isn't clear to me what _GPL is appropriate, but maybe the rules > > > changed since last I looked..... are there rules? > > > > > > My reasoning was that the call is effectively part of the user-space > > > ABI. A user-space process can call this trivially by invoking any > > > system call. The user-space ABI is explicitly a boundary which the GPL > > > does not cross. So it doesn't seem appropriate to prevent non-GPL > > > kernel code from doing something that non-GPL user-space code can > > > trivially do. > > > > By that reasoning, basically everything in the kernel should be non-GPL > > marked. And while task_work can get used by the application, it happens > > only indirectly or implicitly. So I don't think this reasoning is sound > > at all, it's not an exported ABI or API by itself. > > > > For me, the more core of an export it is, the stronger the reason it > > should be GPL. FWIW, I don't think exporting task_work functionality is > > a good idea in the first place, but if there's a strong reason to do so, > > Yeah, I'm not too fond of that part as well. I don't think we want to > give modules the ability to mess with task work. This is just asking for > trouble. > Ok, maybe we need to reframe the problem then. Currently fput(), and hence filp_close(), take control away from kernel threads in that they cannot be sure that a "close" has actually completed. This is already a problem for nfsd. When renaming a file, nfsd needs to ensure any cached "open" that it has on the file is closed (else when re-exporting an NFS filesystem it can result in a silly-rename). nfsd currently handles this case by calling flush_delayed_fput(). I suspect you are no more happy about exporting that than you are about exporting task_work_run(), but this solution isn't actually 100% reliable. If some other thread calls flush_delayed_fput() between nfsd calling filp_close() and that same nfsd calling flush_delayed_fput(), then the second flush can return before the first flush (in the other thread) completes all the work it took on. What we really need - both for handling renames and for avoiding possible memory exhaustion - is for nfsd to be able to reliably wait for any fput() that it initiated to complete. How would you like the VFS to provide that service? Thanks, NeilBrown
On 12/5/23 2:28 PM, NeilBrown wrote: > On Tue, 05 Dec 2023, Christian Brauner wrote: >> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: >>> On 12/4/23 2:02 PM, NeilBrown wrote: >>>> It isn't clear to me what _GPL is appropriate, but maybe the rules >>>> changed since last I looked..... are there rules? >>>> >>>> My reasoning was that the call is effectively part of the user-space >>>> ABI. A user-space process can call this trivially by invoking any >>>> system call. The user-space ABI is explicitly a boundary which the GPL >>>> does not cross. So it doesn't seem appropriate to prevent non-GPL >>>> kernel code from doing something that non-GPL user-space code can >>>> trivially do. >>> >>> By that reasoning, basically everything in the kernel should be non-GPL >>> marked. And while task_work can get used by the application, it happens >>> only indirectly or implicitly. So I don't think this reasoning is sound >>> at all, it's not an exported ABI or API by itself. >>> >>> For me, the more core of an export it is, the stronger the reason it >>> should be GPL. FWIW, I don't think exporting task_work functionality is >> >> Yeah, I'm not too fond of that part as well. I don't think we want to >> give modules the ability to mess with task work. This is just asking for >> trouble. >> > > Ok, maybe we need to reframe the problem then. > > Currently fput(), and hence filp_close(), take control away from kernel > threads in that they cannot be sure that a "close" has actually > completed. > > This is already a problem for nfsd. When renaming a file, nfsd needs to > ensure any cached "open" that it has on the file is closed (else when > re-exporting an NFS filesystem it can result in a silly-rename). > > nfsd currently handles this case by calling flush_delayed_fput(). I > suspect you are no more happy about exporting that than you are about > exporting task_work_run(), but this solution isn't actually 100% > reliable. If some other thread calls flush_delayed_fput() between nfsd > calling filp_close() and that same nfsd calling flush_delayed_fput(), > then the second flush can return before the first flush (in the other > thread) completes all the work it took on. > > What we really need - both for handling renames and for avoiding > possible memory exhaustion - is for nfsd to be able to reliably wait for > any fput() that it initiated to complete. > > How would you like the VFS to provide that service? Since task_work happens in the context of your task already, why not just have a way to get it stashed into a list when final fput is done? This avoids all of this "let's expose task_work" and using the task list for that, which seems kind of pointless as you're just going to run it later on manually anyway. In semi pseudo code: bool fput_put_ref(struct file *file) { return atomic_dec_and_test(&file->f_count); } void fput(struct file *file) { if (fput_put_ref(file)) { ... } } and then your nfsd_file_free() could do: ret = filp_flush(file, id); if (fput_put_ref(file)) llist_add(&file->f_llist, &l->to_free_llist); or something like that, where l->to_free_llist is where ever you'd otherwise punt the actual freeing to.
On 12/5/23 2:58 PM, Jens Axboe wrote: > On 12/5/23 2:28 PM, NeilBrown wrote: >> On Tue, 05 Dec 2023, Christian Brauner wrote: >>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: >>>> On 12/4/23 2:02 PM, NeilBrown wrote: >>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules >>>>> changed since last I looked..... are there rules? >>>>> >>>>> My reasoning was that the call is effectively part of the user-space >>>>> ABI. A user-space process can call this trivially by invoking any >>>>> system call. The user-space ABI is explicitly a boundary which the GPL >>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL >>>>> kernel code from doing something that non-GPL user-space code can >>>>> trivially do. >>>> >>>> By that reasoning, basically everything in the kernel should be non-GPL >>>> marked. And while task_work can get used by the application, it happens >>>> only indirectly or implicitly. So I don't think this reasoning is sound >>>> at all, it's not an exported ABI or API by itself. >>>> >>>> For me, the more core of an export it is, the stronger the reason it >>>> should be GPL. FWIW, I don't think exporting task_work functionality is > >>> >>> Yeah, I'm not too fond of that part as well. I don't think we want to >>> give modules the ability to mess with task work. This is just asking for >>> trouble. >>> >> >> Ok, maybe we need to reframe the problem then. >> >> Currently fput(), and hence filp_close(), take control away from kernel >> threads in that they cannot be sure that a "close" has actually >> completed. >> >> This is already a problem for nfsd. When renaming a file, nfsd needs to >> ensure any cached "open" that it has on the file is closed (else when >> re-exporting an NFS filesystem it can result in a silly-rename). >> >> nfsd currently handles this case by calling flush_delayed_fput(). I >> suspect you are no more happy about exporting that than you are about >> exporting task_work_run(), but this solution isn't actually 100% >> reliable. If some other thread calls flush_delayed_fput() between nfsd >> calling filp_close() and that same nfsd calling flush_delayed_fput(), >> then the second flush can return before the first flush (in the other >> thread) completes all the work it took on. >> >> What we really need - both for handling renames and for avoiding >> possible memory exhaustion - is for nfsd to be able to reliably wait for >> any fput() that it initiated to complete. >> >> How would you like the VFS to provide that service? > > Since task_work happens in the context of your task already, why not > just have a way to get it stashed into a list when final fput is done? > This avoids all of this "let's expose task_work" and using the task list > for that, which seems kind of pointless as you're just going to run it > later on manually anyway. > > In semi pseudo code: > > bool fput_put_ref(struct file *file) > { > return atomic_dec_and_test(&file->f_count); > } > > void fput(struct file *file) > { > if (fput_put_ref(file)) { > ... > } > } > > and then your nfsd_file_free() could do: > > ret = filp_flush(file, id); > if (fput_put_ref(file)) > llist_add(&file->f_llist, &l->to_free_llist); > > or something like that, where l->to_free_llist is where ever you'd > otherwise punt the actual freeing to. Should probably have the put_ref or whatever helper also init the task_work, and then reuse the list in the callback_head there. Then whoever flushes it has to call ->func() and avoid exposing ____fput() to random users. But you get the idea.
On Wed, 06 Dec 2023, Jens Axboe wrote: > On 12/5/23 2:58 PM, Jens Axboe wrote: > > On 12/5/23 2:28 PM, NeilBrown wrote: > >> On Tue, 05 Dec 2023, Christian Brauner wrote: > >>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: > >>>> On 12/4/23 2:02 PM, NeilBrown wrote: > >>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules > >>>>> changed since last I looked..... are there rules? > >>>>> > >>>>> My reasoning was that the call is effectively part of the user-space > >>>>> ABI. A user-space process can call this trivially by invoking any > >>>>> system call. The user-space ABI is explicitly a boundary which the GPL > >>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL > >>>>> kernel code from doing something that non-GPL user-space code can > >>>>> trivially do. > >>>> > >>>> By that reasoning, basically everything in the kernel should be non-GPL > >>>> marked. And while task_work can get used by the application, it happens > >>>> only indirectly or implicitly. So I don't think this reasoning is sound > >>>> at all, it's not an exported ABI or API by itself. > >>>> > >>>> For me, the more core of an export it is, the stronger the reason it > >>>> should be GPL. FWIW, I don't think exporting task_work functionality is > > > >>> > >>> Yeah, I'm not too fond of that part as well. I don't think we want to > >>> give modules the ability to mess with task work. This is just asking for > >>> trouble. > >>> > >> > >> Ok, maybe we need to reframe the problem then. > >> > >> Currently fput(), and hence filp_close(), take control away from kernel > >> threads in that they cannot be sure that a "close" has actually > >> completed. > >> > >> This is already a problem for nfsd. When renaming a file, nfsd needs to > >> ensure any cached "open" that it has on the file is closed (else when > >> re-exporting an NFS filesystem it can result in a silly-rename). > >> > >> nfsd currently handles this case by calling flush_delayed_fput(). I > >> suspect you are no more happy about exporting that than you are about > >> exporting task_work_run(), but this solution isn't actually 100% > >> reliable. If some other thread calls flush_delayed_fput() between nfsd > >> calling filp_close() and that same nfsd calling flush_delayed_fput(), > >> then the second flush can return before the first flush (in the other > >> thread) completes all the work it took on. > >> > >> What we really need - both for handling renames and for avoiding > >> possible memory exhaustion - is for nfsd to be able to reliably wait for > >> any fput() that it initiated to complete. > >> > >> How would you like the VFS to provide that service? > > > > Since task_work happens in the context of your task already, why not > > just have a way to get it stashed into a list when final fput is done? > > This avoids all of this "let's expose task_work" and using the task list > > for that, which seems kind of pointless as you're just going to run it > > later on manually anyway. > > > > In semi pseudo code: > > > > bool fput_put_ref(struct file *file) > > { > > return atomic_dec_and_test(&file->f_count); > > } > > > > void fput(struct file *file) > > { > > if (fput_put_ref(file)) { > > ... > > } > > } > > > > and then your nfsd_file_free() could do: > > > > ret = filp_flush(file, id); > > if (fput_put_ref(file)) > > llist_add(&file->f_llist, &l->to_free_llist); > > > > or something like that, where l->to_free_llist is where ever you'd > > otherwise punt the actual freeing to. > > Should probably have the put_ref or whatever helper also init the > task_work, and then reuse the list in the callback_head there. Then > whoever flushes it has to call ->func() and avoid exposing ____fput() to > random users. But you get the idea. Interesting ideas - thanks. So maybe the new API would be fput_queued(struct file *f, struct llist_head *q) and flush_fput_queue(struct llist_head *q) with the meaning being that fput_queued() is just like fput() except that any file needing __fput() is added to the 'q'; and that flush_fput_queue() calls __fput() on any files in 'q'. So to close a file nfsd would: fget(f); flip_close(f); fput_queued(f, &my_queue); though possibly we could have a filp_close_queued(f, q) as well. I'll try that out - but am happy to hear alternate suggestions for names :-) Thanks, NeilBrown
On Wed, 06 Dec 2023, NeilBrown wrote: > On Wed, 06 Dec 2023, Jens Axboe wrote: > > On 12/5/23 2:58 PM, Jens Axboe wrote: > > > On 12/5/23 2:28 PM, NeilBrown wrote: > > >> On Tue, 05 Dec 2023, Christian Brauner wrote: > > >>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: > > >>>> On 12/4/23 2:02 PM, NeilBrown wrote: > > >>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules > > >>>>> changed since last I looked..... are there rules? > > >>>>> > > >>>>> My reasoning was that the call is effectively part of the user-space > > >>>>> ABI. A user-space process can call this trivially by invoking any > > >>>>> system call. The user-space ABI is explicitly a boundary which the GPL > > >>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL > > >>>>> kernel code from doing something that non-GPL user-space code can > > >>>>> trivially do. > > >>>> > > >>>> By that reasoning, basically everything in the kernel should be non-GPL > > >>>> marked. And while task_work can get used by the application, it happens > > >>>> only indirectly or implicitly. So I don't think this reasoning is sound > > >>>> at all, it's not an exported ABI or API by itself. > > >>>> > > >>>> For me, the more core of an export it is, the stronger the reason it > > >>>> should be GPL. FWIW, I don't think exporting task_work functionality is > > > > > >>> > > >>> Yeah, I'm not too fond of that part as well. I don't think we want to > > >>> give modules the ability to mess with task work. This is just asking for > > >>> trouble. > > >>> > > >> > > >> Ok, maybe we need to reframe the problem then. > > >> > > >> Currently fput(), and hence filp_close(), take control away from kernel > > >> threads in that they cannot be sure that a "close" has actually > > >> completed. > > >> > > >> This is already a problem for nfsd. When renaming a file, nfsd needs to > > >> ensure any cached "open" that it has on the file is closed (else when > > >> re-exporting an NFS filesystem it can result in a silly-rename). > > >> > > >> nfsd currently handles this case by calling flush_delayed_fput(). I > > >> suspect you are no more happy about exporting that than you are about > > >> exporting task_work_run(), but this solution isn't actually 100% > > >> reliable. If some other thread calls flush_delayed_fput() between nfsd > > >> calling filp_close() and that same nfsd calling flush_delayed_fput(), > > >> then the second flush can return before the first flush (in the other > > >> thread) completes all the work it took on. > > >> > > >> What we really need - both for handling renames and for avoiding > > >> possible memory exhaustion - is for nfsd to be able to reliably wait for > > >> any fput() that it initiated to complete. > > >> > > >> How would you like the VFS to provide that service? > > > > > > Since task_work happens in the context of your task already, why not > > > just have a way to get it stashed into a list when final fput is done? > > > This avoids all of this "let's expose task_work" and using the task list > > > for that, which seems kind of pointless as you're just going to run it > > > later on manually anyway. > > > > > > In semi pseudo code: > > > > > > bool fput_put_ref(struct file *file) > > > { > > > return atomic_dec_and_test(&file->f_count); > > > } > > > > > > void fput(struct file *file) > > > { > > > if (fput_put_ref(file)) { > > > ... > > > } > > > } > > > > > > and then your nfsd_file_free() could do: > > > > > > ret = filp_flush(file, id); > > > if (fput_put_ref(file)) > > > llist_add(&file->f_llist, &l->to_free_llist); > > > > > > or something like that, where l->to_free_llist is where ever you'd > > > otherwise punt the actual freeing to. > > > > Should probably have the put_ref or whatever helper also init the > > task_work, and then reuse the list in the callback_head there. Then > > whoever flushes it has to call ->func() and avoid exposing ____fput() to > > random users. But you get the idea. > > Interesting ideas - thanks. > > So maybe the new API would be > > fput_queued(struct file *f, struct llist_head *q) > and > flush_fput_queue(struct llist_head *q) > > with the meaning being that fput_queued() is just like fput() except > that any file needing __fput() is added to the 'q'; and that > flush_fput_queue() calls __fput() on any files in 'q'. > > So to close a file nfsd would: > > fget(f); > flip_close(f); > fput_queued(f, &my_queue); > > though possibly we could have a > filp_close_queued(f, q) > as well. > > I'll try that out - but am happy to hear alternate suggestions for names :-) > Actually .... I'm beginning to wonder if we should just use __fput_sync() in nfsd. It has a big warning about not doing that blindly, but the detail in the warning doesn't seem to apply to nfsd... NeilBrown
On 12/5/23 4:23 PM, NeilBrown wrote: > On Wed, 06 Dec 2023, NeilBrown wrote: >> On Wed, 06 Dec 2023, Jens Axboe wrote: >>> On 12/5/23 2:58 PM, Jens Axboe wrote: >>>> On 12/5/23 2:28 PM, NeilBrown wrote: >>>>> On Tue, 05 Dec 2023, Christian Brauner wrote: >>>>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: >>>>>>> On 12/4/23 2:02 PM, NeilBrown wrote: >>>>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules >>>>>>>> changed since last I looked..... are there rules? >>>>>>>> >>>>>>>> My reasoning was that the call is effectively part of the user-space >>>>>>>> ABI. A user-space process can call this trivially by invoking any >>>>>>>> system call. The user-space ABI is explicitly a boundary which the GPL >>>>>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL >>>>>>>> kernel code from doing something that non-GPL user-space code can >>>>>>>> trivially do. >>>>>>> >>>>>>> By that reasoning, basically everything in the kernel should be non-GPL >>>>>>> marked. And while task_work can get used by the application, it happens >>>>>>> only indirectly or implicitly. So I don't think this reasoning is sound >>>>>>> at all, it's not an exported ABI or API by itself. >>>>>>> >>>>>>> For me, the more core of an export it is, the stronger the reason it >>>>>>> should be GPL. FWIW, I don't think exporting task_work functionality is >>>> >>>>>> >>>>>> Yeah, I'm not too fond of that part as well. I don't think we want to >>>>>> give modules the ability to mess with task work. This is just asking for >>>>>> trouble. >>>>>> >>>>> >>>>> Ok, maybe we need to reframe the problem then. >>>>> >>>>> Currently fput(), and hence filp_close(), take control away from kernel >>>>> threads in that they cannot be sure that a "close" has actually >>>>> completed. >>>>> >>>>> This is already a problem for nfsd. When renaming a file, nfsd needs to >>>>> ensure any cached "open" that it has on the file is closed (else when >>>>> re-exporting an NFS filesystem it can result in a silly-rename). >>>>> >>>>> nfsd currently handles this case by calling flush_delayed_fput(). I >>>>> suspect you are no more happy about exporting that than you are about >>>>> exporting task_work_run(), but this solution isn't actually 100% >>>>> reliable. If some other thread calls flush_delayed_fput() between nfsd >>>>> calling filp_close() and that same nfsd calling flush_delayed_fput(), >>>>> then the second flush can return before the first flush (in the other >>>>> thread) completes all the work it took on. >>>>> >>>>> What we really need - both for handling renames and for avoiding >>>>> possible memory exhaustion - is for nfsd to be able to reliably wait for >>>>> any fput() that it initiated to complete. >>>>> >>>>> How would you like the VFS to provide that service? >>>> >>>> Since task_work happens in the context of your task already, why not >>>> just have a way to get it stashed into a list when final fput is done? >>>> This avoids all of this "let's expose task_work" and using the task list >>>> for that, which seems kind of pointless as you're just going to run it >>>> later on manually anyway. >>>> >>>> In semi pseudo code: >>>> >>>> bool fput_put_ref(struct file *file) >>>> { >>>> return atomic_dec_and_test(&file->f_count); >>>> } >>>> >>>> void fput(struct file *file) >>>> { >>>> if (fput_put_ref(file)) { >>>> ... >>>> } >>>> } >>>> >>>> and then your nfsd_file_free() could do: >>>> >>>> ret = filp_flush(file, id); >>>> if (fput_put_ref(file)) >>>> llist_add(&file->f_llist, &l->to_free_llist); >>>> >>>> or something like that, where l->to_free_llist is where ever you'd >>>> otherwise punt the actual freeing to. >>> >>> Should probably have the put_ref or whatever helper also init the >>> task_work, and then reuse the list in the callback_head there. Then >>> whoever flushes it has to call ->func() and avoid exposing ____fput() to >>> random users. But you get the idea. >> >> Interesting ideas - thanks. >> >> So maybe the new API would be >> >> fput_queued(struct file *f, struct llist_head *q) >> and >> flush_fput_queue(struct llist_head *q) >> >> with the meaning being that fput_queued() is just like fput() except >> that any file needing __fput() is added to the 'q'; and that >> flush_fput_queue() calls __fput() on any files in 'q'. >> >> So to close a file nfsd would: >> >> fget(f); >> flip_close(f); >> fput_queued(f, &my_queue); >> >> though possibly we could have a >> filp_close_queued(f, q) >> as well. >> >> I'll try that out - but am happy to hear alternate suggestions for names :-) >> > > Actually .... I'm beginning to wonder if we should just use > __fput_sync() in nfsd. > It has a big warning about not doing that blindly, but the detail in the > warning doesn't seem to apply to nfsd... If you can do it from the context where you do the filp_close() right now, then yeah there's no reason to over-complicate this at all... FWIW, the reason task_work exists is just to ensure a clean context to perform these operations from the task itself. The more I think about it, it doesn't make a lot of sense to utilize it for this purpose, which is where my alternate suggestion came from. But if you can just call it directly, then that makes everything much easier.
On Tue, Dec 05, 2023 at 12:14:29PM +0100, Christian Brauner wrote: > > For me, the more core of an export it is, the stronger the reason it > > should be GPL. FWIW, I don't think exporting task_work functionality is > > a good idea in the first place, but if there's a strong reason to do so, > > Yeah, I'm not too fond of that part as well. I don't think we want to > give modules the ability to mess with task work. This is just asking for > trouble. It just seems like a really bad idea. At the same time it fixes a real problem. If we go a step back how could we fix it in a better way? Do we even need the task_run based delay for file usage from kernel threads?
On Tue, Dec 05, 2023 at 04:31:51PM -0700, Jens Axboe wrote: > On 12/5/23 4:23 PM, NeilBrown wrote: > > On Wed, 06 Dec 2023, NeilBrown wrote: > >> On Wed, 06 Dec 2023, Jens Axboe wrote: > >>> On 12/5/23 2:58 PM, Jens Axboe wrote: > >>>> On 12/5/23 2:28 PM, NeilBrown wrote: > >>>>> On Tue, 05 Dec 2023, Christian Brauner wrote: > >>>>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: > >>>>>>> On 12/4/23 2:02 PM, NeilBrown wrote: > >>>>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules > >>>>>>>> changed since last I looked..... are there rules? > >>>>>>>> > >>>>>>>> My reasoning was that the call is effectively part of the user-space > >>>>>>>> ABI. A user-space process can call this trivially by invoking any > >>>>>>>> system call. The user-space ABI is explicitly a boundary which the GPL > >>>>>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL > >>>>>>>> kernel code from doing something that non-GPL user-space code can > >>>>>>>> trivially do. > >>>>>>> > >>>>>>> By that reasoning, basically everything in the kernel should be non-GPL > >>>>>>> marked. And while task_work can get used by the application, it happens > >>>>>>> only indirectly or implicitly. So I don't think this reasoning is sound > >>>>>>> at all, it's not an exported ABI or API by itself. > >>>>>>> > >>>>>>> For me, the more core of an export it is, the stronger the reason it > >>>>>>> should be GPL. FWIW, I don't think exporting task_work functionality is > >>>> > >>>>>> > >>>>>> Yeah, I'm not too fond of that part as well. I don't think we want to > >>>>>> give modules the ability to mess with task work. This is just asking for > >>>>>> trouble. > >>>>>> > >>>>> > >>>>> Ok, maybe we need to reframe the problem then. > >>>>> > >>>>> Currently fput(), and hence filp_close(), take control away from kernel > >>>>> threads in that they cannot be sure that a "close" has actually > >>>>> completed. > >>>>> > >>>>> This is already a problem for nfsd. When renaming a file, nfsd needs to > >>>>> ensure any cached "open" that it has on the file is closed (else when > >>>>> re-exporting an NFS filesystem it can result in a silly-rename). > >>>>> > >>>>> nfsd currently handles this case by calling flush_delayed_fput(). I > >>>>> suspect you are no more happy about exporting that than you are about > >>>>> exporting task_work_run(), but this solution isn't actually 100% > >>>>> reliable. If some other thread calls flush_delayed_fput() between nfsd > >>>>> calling filp_close() and that same nfsd calling flush_delayed_fput(), > >>>>> then the second flush can return before the first flush (in the other > >>>>> thread) completes all the work it took on. > >>>>> > >>>>> What we really need - both for handling renames and for avoiding > >>>>> possible memory exhaustion - is for nfsd to be able to reliably wait for > >>>>> any fput() that it initiated to complete. > >>>>> > >>>>> How would you like the VFS to provide that service? > >>>> > >>>> Since task_work happens in the context of your task already, why not > >>>> just have a way to get it stashed into a list when final fput is done? > >>>> This avoids all of this "let's expose task_work" and using the task list > >>>> for that, which seems kind of pointless as you're just going to run it > >>>> later on manually anyway. > >>>> > >>>> In semi pseudo code: > >>>> > >>>> bool fput_put_ref(struct file *file) > >>>> { > >>>> return atomic_dec_and_test(&file->f_count); > >>>> } > >>>> > >>>> void fput(struct file *file) > >>>> { > >>>> if (fput_put_ref(file)) { > >>>> ... > >>>> } > >>>> } > >>>> > >>>> and then your nfsd_file_free() could do: > >>>> > >>>> ret = filp_flush(file, id); > >>>> if (fput_put_ref(file)) > >>>> llist_add(&file->f_llist, &l->to_free_llist); > >>>> > >>>> or something like that, where l->to_free_llist is where ever you'd > >>>> otherwise punt the actual freeing to. > >>> > >>> Should probably have the put_ref or whatever helper also init the > >>> task_work, and then reuse the list in the callback_head there. Then > >>> whoever flushes it has to call ->func() and avoid exposing ____fput() to > >>> random users. But you get the idea. > >> > >> Interesting ideas - thanks. > >> > >> So maybe the new API would be > >> > >> fput_queued(struct file *f, struct llist_head *q) > >> and > >> flush_fput_queue(struct llist_head *q) > >> > >> with the meaning being that fput_queued() is just like fput() except > >> that any file needing __fput() is added to the 'q'; and that > >> flush_fput_queue() calls __fput() on any files in 'q'. > >> > >> So to close a file nfsd would: > >> > >> fget(f); > >> flip_close(f); > >> fput_queued(f, &my_queue); > >> > >> though possibly we could have a > >> filp_close_queued(f, q) > >> as well. > >> > >> I'll try that out - but am happy to hear alternate suggestions for names :-) > >> > > > > Actually .... I'm beginning to wonder if we should just use > > __fput_sync() in nfsd. > > It has a big warning about not doing that blindly, but the detail in the > > warning doesn't seem to apply to nfsd... > > If you can do it from the context where you do the filp_close() right > now, then yeah there's no reason to over-complicate this at all... FWIW, As long as nfsd doesn't care that it may get stuck on umount or ->release... > the reason task_work exists is just to ensure a clean context to perform > these operations from the task itself. The more I think about it, it > doesn't make a lot of sense to utilize it for this purpose, which is > where my alternate suggestion came from. But if you can just call it > directly, then that makes everything much easier. And for better or worse we already expose __fput_sync(). We've recently switched close(2) over to it as well as it was needlessly punting to task work.
On Wed, Dec 06, 2023 at 08:28:15AM +1100, NeilBrown wrote: > On Tue, 05 Dec 2023, Christian Brauner wrote: > > On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: > > > On 12/4/23 2:02 PM, NeilBrown wrote: > > > > It isn't clear to me what _GPL is appropriate, but maybe the rules > > > > changed since last I looked..... are there rules? > > > > > > > > My reasoning was that the call is effectively part of the user-space > > > > ABI. A user-space process can call this trivially by invoking any > > > > system call. The user-space ABI is explicitly a boundary which the GPL > > > > does not cross. So it doesn't seem appropriate to prevent non-GPL > > > > kernel code from doing something that non-GPL user-space code can > > > > trivially do. > > > > > > By that reasoning, basically everything in the kernel should be non-GPL > > > marked. And while task_work can get used by the application, it happens > > > only indirectly or implicitly. So I don't think this reasoning is sound > > > at all, it's not an exported ABI or API by itself. > > > > > > For me, the more core of an export it is, the stronger the reason it > > > should be GPL. FWIW, I don't think exporting task_work functionality is > > > a good idea in the first place, but if there's a strong reason to do so, > > > > Yeah, I'm not too fond of that part as well. I don't think we want to > > give modules the ability to mess with task work. This is just asking for > > trouble. > > > > Ok, maybe we need to reframe the problem then. > > Currently fput(), and hence filp_close(), take control away from kernel > threads in that they cannot be sure that a "close" has actually > completed. > > This is already a problem for nfsd. When renaming a file, nfsd needs to > ensure any cached "open" that it has on the file is closed (else when > re-exporting an NFS filesystem it can result in a silly-rename). > > nfsd currently handles this case by calling flush_delayed_fput(). I > suspect you are no more happy about exporting that than you are about > exporting task_work_run(), but this solution isn't actually 100% > reliable. If some other thread calls flush_delayed_fput() between nfsd > calling filp_close() and that same nfsd calling flush_delayed_fput(), > then the second flush can return before the first flush (in the other > thread) completes all the work it took on. > > What we really need - both for handling renames and for avoiding > possible memory exhaustion - is for nfsd to be able to reliably wait for > any fput() that it initiated to complete. Yeah, I acknowledge the problem and I said I'm not opposed to your solution I just would like to do it differently if we can. > > How would you like the VFS to provide that service? If you really don't care about getting stuck on an fput() somehow then what Jens suggested might actually be fine. And the proposal - queue the files on a list - isn't that already what nfsd is kinda doing already for the file cache. That's at least the impression I got from reading over nfsd_file_free(). It's just that it doesn't free directly but used that flush_delayed_fput(). So really, taking this on step further and doing the fput synchronously might work.
On Thu, 07 Dec 2023, Christian Brauner wrote: > On Tue, Dec 05, 2023 at 04:31:51PM -0700, Jens Axboe wrote: > > On 12/5/23 4:23 PM, NeilBrown wrote: > > > On Wed, 06 Dec 2023, NeilBrown wrote: > > >> On Wed, 06 Dec 2023, Jens Axboe wrote: > > >>> On 12/5/23 2:58 PM, Jens Axboe wrote: > > >>>> On 12/5/23 2:28 PM, NeilBrown wrote: > > >>>>> On Tue, 05 Dec 2023, Christian Brauner wrote: > > >>>>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote: > > >>>>>>> On 12/4/23 2:02 PM, NeilBrown wrote: > > >>>>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules > > >>>>>>>> changed since last I looked..... are there rules? > > >>>>>>>> > > >>>>>>>> My reasoning was that the call is effectively part of the user-space > > >>>>>>>> ABI. A user-space process can call this trivially by invoking any > > >>>>>>>> system call. The user-space ABI is explicitly a boundary which the GPL > > >>>>>>>> does not cross. So it doesn't seem appropriate to prevent non-GPL > > >>>>>>>> kernel code from doing something that non-GPL user-space code can > > >>>>>>>> trivially do. > > >>>>>>> > > >>>>>>> By that reasoning, basically everything in the kernel should be non-GPL > > >>>>>>> marked. And while task_work can get used by the application, it happens > > >>>>>>> only indirectly or implicitly. So I don't think this reasoning is sound > > >>>>>>> at all, it's not an exported ABI or API by itself. > > >>>>>>> > > >>>>>>> For me, the more core of an export it is, the stronger the reason it > > >>>>>>> should be GPL. FWIW, I don't think exporting task_work functionality is > > >>>> > > >>>>>> > > >>>>>> Yeah, I'm not too fond of that part as well. I don't think we want to > > >>>>>> give modules the ability to mess with task work. This is just asking for > > >>>>>> trouble. > > >>>>>> > > >>>>> > > >>>>> Ok, maybe we need to reframe the problem then. > > >>>>> > > >>>>> Currently fput(), and hence filp_close(), take control away from kernel > > >>>>> threads in that they cannot be sure that a "close" has actually > > >>>>> completed. > > >>>>> > > >>>>> This is already a problem for nfsd. When renaming a file, nfsd needs to > > >>>>> ensure any cached "open" that it has on the file is closed (else when > > >>>>> re-exporting an NFS filesystem it can result in a silly-rename). > > >>>>> > > >>>>> nfsd currently handles this case by calling flush_delayed_fput(). I > > >>>>> suspect you are no more happy about exporting that than you are about > > >>>>> exporting task_work_run(), but this solution isn't actually 100% > > >>>>> reliable. If some other thread calls flush_delayed_fput() between nfsd > > >>>>> calling filp_close() and that same nfsd calling flush_delayed_fput(), > > >>>>> then the second flush can return before the first flush (in the other > > >>>>> thread) completes all the work it took on. > > >>>>> > > >>>>> What we really need - both for handling renames and for avoiding > > >>>>> possible memory exhaustion - is for nfsd to be able to reliably wait for > > >>>>> any fput() that it initiated to complete. > > >>>>> > > >>>>> How would you like the VFS to provide that service? > > >>>> > > >>>> Since task_work happens in the context of your task already, why not > > >>>> just have a way to get it stashed into a list when final fput is done? > > >>>> This avoids all of this "let's expose task_work" and using the task list > > >>>> for that, which seems kind of pointless as you're just going to run it > > >>>> later on manually anyway. > > >>>> > > >>>> In semi pseudo code: > > >>>> > > >>>> bool fput_put_ref(struct file *file) > > >>>> { > > >>>> return atomic_dec_and_test(&file->f_count); > > >>>> } > > >>>> > > >>>> void fput(struct file *file) > > >>>> { > > >>>> if (fput_put_ref(file)) { > > >>>> ... > > >>>> } > > >>>> } > > >>>> > > >>>> and then your nfsd_file_free() could do: > > >>>> > > >>>> ret = filp_flush(file, id); > > >>>> if (fput_put_ref(file)) > > >>>> llist_add(&file->f_llist, &l->to_free_llist); > > >>>> > > >>>> or something like that, where l->to_free_llist is where ever you'd > > >>>> otherwise punt the actual freeing to. > > >>> > > >>> Should probably have the put_ref or whatever helper also init the > > >>> task_work, and then reuse the list in the callback_head there. Then > > >>> whoever flushes it has to call ->func() and avoid exposing ____fput() to > > >>> random users. But you get the idea. > > >> > > >> Interesting ideas - thanks. > > >> > > >> So maybe the new API would be > > >> > > >> fput_queued(struct file *f, struct llist_head *q) > > >> and > > >> flush_fput_queue(struct llist_head *q) > > >> > > >> with the meaning being that fput_queued() is just like fput() except > > >> that any file needing __fput() is added to the 'q'; and that > > >> flush_fput_queue() calls __fput() on any files in 'q'. > > >> > > >> So to close a file nfsd would: > > >> > > >> fget(f); > > >> flip_close(f); > > >> fput_queued(f, &my_queue); > > >> > > >> though possibly we could have a > > >> filp_close_queued(f, q) > > >> as well. > > >> > > >> I'll try that out - but am happy to hear alternate suggestions for names :-) > > >> > > > > > > Actually .... I'm beginning to wonder if we should just use > > > __fput_sync() in nfsd. > > > It has a big warning about not doing that blindly, but the detail in the > > > warning doesn't seem to apply to nfsd... > > > > If you can do it from the context where you do the filp_close() right > > now, then yeah there's no reason to over-complicate this at all... FWIW, > > As long as nfsd doesn't care that it may get stuck on umount or > ->release... I think we do *care* about getting stuck. But I don't think we would *expect* to get stuck.. I had a look at varous ->release function. Quite few do fsync or similar which isn't a problem. nfsd often waits for writes to complete. Some lock the inode, which again is something that nfsd threads often do. Is there something special that ->release might do but that other filesystem operation don't do? I'd really like to understand why __fput is so special that we often queue it to a separate thread. > > > the reason task_work exists is just to ensure a clean context to perform > > these operations from the task itself. The more I think about it, it > > doesn't make a lot of sense to utilize it for this purpose, which is > > where my alternate suggestion came from. But if you can just call it > > directly, then that makes everything much easier. > > And for better or worse we already expose __fput_sync(). We've recently > switched close(2) over to it as well as it was needlessly punting to > task work. > exit_files() would be another good candidate for using __fput_sync(). Oleg Nesterov has reported problems when a process which a large number of files exits - this currently puts lots of entries on the task_works lists. If task_work_cancel is then called before those are all dealt with, it can have a long list to search while holding a hot lock. (I hope I got that description right). Thanks, NeilBrown
diff --git a/fs/file_table.c b/fs/file_table.c index ee21b3da9d08..d36cade6e366 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -435,7 +435,8 @@ void fput(struct file *file) if (atomic_long_dec_and_test(&file->f_count)) { struct task_struct *task = current; - if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { + if (likely(!in_interrupt() && + (task->flags & PF_RUNS_TASK_WORK))) { init_task_work(&file->f_rcuhead, ____fput); if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME)) return; diff --git a/fs/namespace.c b/fs/namespace.c index e157efc54023..46d640b70ca9 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt) if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { struct task_struct *task = current; - if (likely(!(task->flags & PF_KTHREAD))) { + if (likely((task->flags & PF_RUNS_TASK_WORK))) { init_task_work(&mnt->mnt_rcu, __cleanup_mnt); if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME)) return; diff --git a/include/linux/sched.h b/include/linux/sched.h index 77f01ac385f7..e4eebac708e7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1747,7 +1747,7 @@ extern struct pid *cad_pid; * I am cleaning dirty pages from some other bdi. */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ -#define PF__HOLE__00800000 0x00800000 +#define PF_RUNS_TASK_WORK 0x00800000 /* Will call task_work_run() periodically */ #define PF__HOLE__01000000 0x01000000 #define PF__HOLE__02000000 0x02000000 #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ diff --git a/kernel/fork.c b/kernel/fork.c index 3b6d20dfb9a8..d612d8f14861 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2330,6 +2330,8 @@ __latent_entropy struct task_struct *copy_process( p->flags &= ~PF_KTHREAD; if (args->kthread) p->flags |= PF_KTHREAD; + else + p->flags |= PF_RUNS_TASK_WORK; if (args->user_worker) { /* * Mark us a user worker, and block any signal that isn't diff --git a/kernel/task_work.c b/kernel/task_work.c index 95a7e1b7f1da..aec19876e121 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -183,3 +183,4 @@ void task_work_run(void) } while (work); } } +EXPORT_SYMBOL(task_work_run);
User-space processes always call task_work_run() as needed when returning from a system call. Kernel-threads generally do not. Because of this some work that is best run in the task_works context (guaranteed that no locks are held) cannot be queued to task_works from kernel threads and so are queued to a (single) work_time to be managed on a work queue. This means that any cost for doing the work is not imposed on the kernel thread, and importantly excessive amounts of work cannot apply back-pressure to reduce the amount of new work queued. I have evidence from a customer site when nfsd (which runs as kernel threads) is being asked to modify many millions of files which causes sufficient memory pressure that some cache (in XFS I think) gets cleaned earlier than would be ideal. When __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back previously cached info from storage. This slows down the single thread that is making all the final __dput() calls for all the nfsd threads with the net result that files are added to the delayed_fput_list faster than they are removed, and the system eventually runs out of memory. This happens because there is no back-pressure: the nfsd isn't forced to slow down when __dput() is slow for any reason. To fix this we can change the nfsd threads to call task_work_run() regularly (much like user-space processes do) and allow it to declare this so that work does get queued to task_works rather than to a work queue. This patch adds a new process flag PF_RUNS_TASK_WORK which is now used instead of PF_KTHREAD to determine whether it is sensible to queue something to task_works. This flag is always set for non-kernel threads. task_work_run() is also exported so that it can be called from a module such as nfsd. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/file_table.c | 3 ++- fs/namespace.c | 2 +- include/linux/sched.h | 2 +- kernel/fork.c | 2 ++ kernel/task_work.c | 1 + 5 files changed, 7 insertions(+), 3 deletions(-)