diff mbox series

[1/2] Allow a kthread to declare that it calls task_work_run()

Message ID 20231204014042.6754-2-neilb@suse.de (mailing list archive)
State New, archived
Headers show
Series Move all file-close work for nfsd into nfsd threads | expand

Commit Message

NeilBrown Dec. 4, 2023, 1:36 a.m. UTC
User-space processes always call task_work_run() as needed when
returning from a system call.  Kernel-threads generally do not.
Because of this some work that is best run in the task_works context
(guaranteed that no locks are held) cannot be queued to task_works from
kernel threads and so are queued to a (single) work_time to be managed
on a work queue.

This means that any cost for doing the work is not imposed on the kernel
thread, and importantly excessive amounts of work cannot apply
back-pressure to reduce the amount of new work queued.

I have evidence from a customer site when nfsd (which runs as kernel
threads) is being asked to modify many millions of files which causes
sufficient memory pressure that some cache (in XFS I think) gets cleaned
earlier than would be ideal.  When __dput (from the workqueue) calls
__dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
previously cached info from storage.  This slows down the single thread
that is making all the final __dput() calls for all the nfsd threads
with the net result that files are added to the delayed_fput_list faster
than they are removed, and the system eventually runs out of memory.

This happens because there is no back-pressure: the nfsd isn't forced to
slow down when __dput() is slow for any reason.  To fix this we can
change the nfsd threads to call task_work_run() regularly (much like
user-space processes do) and allow it to declare this so that work does
get queued to task_works rather than to a work queue.

This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
instead of PF_KTHREAD to determine whether it is sensible to queue
something to task_works.  This flag is always set for non-kernel threads.

task_work_run() is also exported so that it can be called from a module
such as nfsd.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/file_table.c       | 3 ++-
 fs/namespace.c        | 2 +-
 include/linux/sched.h | 2 +-
 kernel/fork.c         | 2 ++
 kernel/task_work.c    | 1 +
 5 files changed, 7 insertions(+), 3 deletions(-)

Comments

Jens Axboe Dec. 4, 2023, 2:13 a.m. UTC | #1
On 12/3/23 6:36 PM, NeilBrown wrote:
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e157efc54023..46d640b70ca9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
>  
>  	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
>  		struct task_struct *task = current;
> -		if (likely(!(task->flags & PF_KTHREAD))) {
> +		if (likely((task->flags & PF_RUNS_TASK_WORK))) {

Extraneous parens here.

> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 95a7e1b7f1da..aec19876e121 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -183,3 +183,4 @@ void task_work_run(void)
>  		} while (work);
>  	}
>  }
> +EXPORT_SYMBOL(task_work_run);

If we're exporting this, then I think that function needs a big
disclaimer on exactly when it is safe to call it. And it most certainly
needs to be a _GPL export.
Matthew Wilcox Dec. 4, 2023, 2:25 a.m. UTC | #2
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> +++ b/fs/namespace.c
> @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
>  
>  	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
>  		struct task_struct *task = current;
> -		if (likely(!(task->flags & PF_KTHREAD))) {
> +		if (likely((task->flags & PF_RUNS_TASK_WORK))) {

You could lose one set of parens here ...

		if (likely(task->flags & PF_RUNS_TASK_WORK)) {

>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> -#define PF__HOLE__00800000	0x00800000
> +#define PF_RUNS_TASK_WORK	0x00800000	/* Will call task_work_run() periodically */

And you could lose "Will" here:

#define PF_RUNS_TASK_WORK    0x00800000      /* Calls task_work_run() periodically */

> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 95a7e1b7f1da..aec19876e121 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -183,3 +183,4 @@ void task_work_run(void)
>  		} while (work);
>  	}
>  }
> +EXPORT_SYMBOL(task_work_run);

_GPL?
Al Viro Dec. 4, 2023, 2:40 a.m. UTC | #3
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:

> This means that any cost for doing the work is not imposed on the kernel
> thread, and importantly excessive amounts of work cannot apply
> back-pressure to reduce the amount of new work queued.

It also means that a stuck ->release() won't end up with stuck
kernel thread...

> earlier than would be ideal.  When __dput (from the workqueue) calls

WTF is that __dput thing?  __fput, perhaps?

> This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
> instead of PF_KTHREAD to determine whether it is sensible to queue
> something to task_works.  This flag is always set for non-kernel threads.

*ugh*

What's that flag for?  task_work_add() always can fail; any caller must
have a fallback to cope with that possibility; fput() certainly does.

Just have the kernel threads born with ->task_works set to &work_exited
and provide a primitive that would flip it from that to NULL.

> @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
>  
>  	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
>  		struct task_struct *task = current;
> -		if (likely(!(task->flags & PF_KTHREAD))) {
> +		if (likely((task->flags & PF_RUNS_TASK_WORK))) {
>  			init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
>  			if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
>  				return;

Now, *that* is something I have much stronger objections to.
Stuck filesystem shutdown is far more likely than stuck
->release().  You are seriously asking for trouble here.

Why would you want to have nfsd block on that?
Oleg Nesterov Dec. 4, 2023, 4:12 p.m. UTC | #4
I am sick and can't read emails, just one note

On 12/04, Al Viro wrote:
>
> Just have the kernel threads born with ->task_works set to &work_exited

Then irq_thread()->task_work_add() will silently fail,

> and provide a primitive that would flip it from that to NULL.

OK, so this change should update irq_thread(). But what else can fail?

And what if another kthread uses task_work_add(current) to add the
desctructor and does fput() without task_work_run() ?

Oleg.
NeilBrown Dec. 4, 2023, 9:02 p.m. UTC | #5
On Mon, 04 Dec 2023, Jens Axboe wrote:
> On 12/3/23 6:36 PM, NeilBrown wrote:
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index e157efc54023..46d640b70ca9 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
> >  
> >  	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> >  		struct task_struct *task = current;
> > -		if (likely(!(task->flags & PF_KTHREAD))) {
> > +		if (likely((task->flags & PF_RUNS_TASK_WORK))) {
> 
> Extraneous parens here.

Thanks - and thanks to Matthew Wilcox too.   Fixed.

> 
> > diff --git a/kernel/task_work.c b/kernel/task_work.c
> > index 95a7e1b7f1da..aec19876e121 100644
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -183,3 +183,4 @@ void task_work_run(void)
> >  		} while (work);
> >  	}
> >  }
> > +EXPORT_SYMBOL(task_work_run);
> 
> If we're exporting this, then I think that function needs a big
> disclaimer on exactly when it is safe to call it. And it most certainly
> needs to be a _GPL export.

I've added

 * Can be used by a kernel thread but only when no locks are held and the
 * thread won't be waited for by other code that might hold locks.  It
 * can be useful in the top-level loop of a file-serving thread to ensure
 * files get closed promptly.

to the documentation comment.
It isn't clear to me what _GPL is appropriate, but maybe the rules
changed since last I looked..... are there rules?

My reasoning was that the call is effectively part of the user-space
ABI.  A user-space process can call this trivially by invoking any
system call.  The user-space ABI is explicitly a boundary which the GPL
does not cross.  So it doesn't seem appropriate to prevent non-GPL
kernel code from doing something that non-GPL user-space code can
trivially do.

But if there are other strong opinions, or clearly documented rules that
contradict my opinion, I have not problem with adding _GPL.

Thanks,
NeilBrown


> 
> -- 
> Jens Axboe
> 
>
NeilBrown Dec. 4, 2023, 9:04 p.m. UTC | #6
On Mon, 04 Dec 2023, Matthew Wilcox wrote:
> On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > +++ b/fs/namespace.c
> > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
> >  
> >  	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> >  		struct task_struct *task = current;
> > -		if (likely(!(task->flags & PF_KTHREAD))) {
> > +		if (likely((task->flags & PF_RUNS_TASK_WORK))) {
> 
> You could lose one set of parens here ...
> 
> 		if (likely(task->flags & PF_RUNS_TASK_WORK)) {

Done.

> 
> >  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> > -#define PF__HOLE__00800000	0x00800000
> > +#define PF_RUNS_TASK_WORK	0x00800000	/* Will call task_work_run() periodically */
> 
> And you could lose "Will" here:
> 
> #define PF_RUNS_TASK_WORK    0x00800000      /* Calls task_work_run() periodically */

Better - thanks.


> 
> > diff --git a/kernel/task_work.c b/kernel/task_work.c
> > index 95a7e1b7f1da..aec19876e121 100644
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -183,3 +183,4 @@ void task_work_run(void)
> >  		} while (work);
> >  	}
> >  }
> > +EXPORT_SYMBOL(task_work_run);
> 
> _GPL?

Justification?

Thanks,
NeilBrown
NeilBrown Dec. 4, 2023, 9:20 p.m. UTC | #7
On Mon, 04 Dec 2023, Al Viro wrote:
> On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> 
> > This means that any cost for doing the work is not imposed on the kernel
> > thread, and importantly excessive amounts of work cannot apply
> > back-pressure to reduce the amount of new work queued.
> 
> It also means that a stuck ->release() won't end up with stuck
> kernel thread...

Is a stuck kernel thread any worse than a stuck user-space thread?

> 
> > earlier than would be ideal.  When __dput (from the workqueue) calls
> 
> WTF is that __dput thing?  __fput, perhaps?

Either __fput or dput :-)
->release isn't the problem that I am seeing.
The call trace that I see causing problems is
__fput -> dput -> dentry_kill -> destroy_inode -> xfs_fs_destroy_inode

so both __fput and dput are there, but most of the code is dput related.
So both "put"s were swimming in by brain and the wrong combination came
out.
I changed it to __fput - thanks.

> 
> > This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
> > instead of PF_KTHREAD to determine whether it is sensible to queue
> > something to task_works.  This flag is always set for non-kernel threads.
> 
> *ugh*
> 
> What's that flag for?  task_work_add() always can fail; any caller must
> have a fallback to cope with that possibility; fput() certainly does.

As Oleg pointed out, all threads including kernel threads call
task_work_run() at exit, and some kernel threads depend on this.  So
disabling task_work_add() early for all kernel threads would break
things.

Currently task_work_add() fails only once the process has started
exiting.  Only code that might run during the exit handling need check.

> 
> Just have the kernel threads born with ->task_works set to &work_exited
> and provide a primitive that would flip it from that to NULL.
> 
> > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
> >  
> >  	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> >  		struct task_struct *task = current;
> > -		if (likely(!(task->flags & PF_KTHREAD))) {
> > +		if (likely((task->flags & PF_RUNS_TASK_WORK))) {
> >  			init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
> >  			if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
> >  				return;
> 
> Now, *that* is something I have much stronger objections to.
> Stuck filesystem shutdown is far more likely than stuck
> ->release().  You are seriously asking for trouble here.
> 
> Why would you want to have nfsd block on that?
> 

I don't *want* nfsd block on that, but I don't care if it does.  nfsd
will only call task_work_run() at a safe time.  This is no different to
user-space processes only calling task_work_run() at a safe time.

The new flag isn't "I_AM_NFSD" or "QUEUE_FPUT_WORK_TO_TASK".  It is
"RUNS_TASK_WORK".  So any code that would prefer to call task_work_add()
but has a fall-back for tasks that don't call run_task_work() should
test the new flag.  Doing otherwise would be inconsistent and
potentially confusing.

I don't think that nfsd getting stuck would be any worse than systemd
getting stuck, or automount getting stuck, or udiskd getting stuck.

Thanks,
NeilBrown
Jens Axboe Dec. 4, 2023, 10:09 p.m. UTC | #8
On 12/4/23 2:02 PM, NeilBrown wrote:
> It isn't clear to me what _GPL is appropriate, but maybe the rules
> changed since last I looked..... are there rules?
> 
> My reasoning was that the call is effectively part of the user-space
> ABI.  A user-space process can call this trivially by invoking any
> system call.  The user-space ABI is explicitly a boundary which the GPL
> does not cross.  So it doesn't seem appropriate to prevent non-GPL
> kernel code from doing something that non-GPL user-space code can
> trivially do.

By that reasoning, basically everything in the kernel should be non-GPL
marked. And while task_work can get used by the application, it happens
only indirectly or implicitly. So I don't think this reasoning is sound
at all, it's not an exported ABI or API by itself.

For me, the more core of an export it is, the stronger the reason it
should be GPL. FWIW, I don't think exporting task_work functionality is
a good idea in the first place, but if there's a strong reason to do so,
it should most certainly not be accessible to non-GPL modules. Basically
NO new export should be non-GPL.
NeilBrown Dec. 4, 2023, 10:27 p.m. UTC | #9
On Tue, 05 Dec 2023, Jens Axboe wrote:
> On 12/4/23 2:02 PM, NeilBrown wrote:
> > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > changed since last I looked..... are there rules?
> > 
> > My reasoning was that the call is effectively part of the user-space
> > ABI.  A user-space process can call this trivially by invoking any
> > system call.  The user-space ABI is explicitly a boundary which the GPL
> > does not cross.  So it doesn't seem appropriate to prevent non-GPL
> > kernel code from doing something that non-GPL user-space code can
> > trivially do.
> 
> By that reasoning, basically everything in the kernel should be non-GPL
> marked. And while task_work can get used by the application, it happens
> only indirectly or implicitly. So I don't think this reasoning is sound
> at all, it's not an exported ABI or API by itself.
> 
> For me, the more core of an export it is, the stronger the reason it
> should be GPL. FWIW, I don't think exporting task_work functionality is
> a good idea in the first place, but if there's a strong reason to do so,
> it should most certainly not be accessible to non-GPL modules. Basically
> NO new export should be non-GPL.

An alternate to exporting task_work_run() might be to call it from
try_to_freeze().  I think that too should only be called from a context
where no locks are held etc.  Obviously try_to_freeze would only call
task_work_run() if PF_RUNS_TASK_WORK were set.
I'm not sure this is a *good* idea, but it is an idea that would avoid
the export.

For now I change the export to _GPL.

Thanks,
NeilBrown
Dave Chinner Dec. 5, 2023, 6:27 a.m. UTC | #10
On Tue, Dec 05, 2023 at 08:20:31AM +1100, NeilBrown wrote:
> On Mon, 04 Dec 2023, Al Viro wrote:
> > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > 
> > > This means that any cost for doing the work is not imposed on the kernel
> > > thread, and importantly excessive amounts of work cannot apply
> > > back-pressure to reduce the amount of new work queued.
> > 
> > It also means that a stuck ->release() won't end up with stuck
> > kernel thread...
> 
> Is a stuck kernel thread any worse than a stuck user-space thread?
> 
> > 
> > > earlier than would be ideal.  When __dput (from the workqueue) calls
> > 
> > WTF is that __dput thing?  __fput, perhaps?
> 
> Either __fput or dput :-)
> ->release isn't the problem that I am seeing.
> The call trace that I see causing problems is
> __fput -> dput -> dentry_kill -> destroy_inode -> xfs_fs_destroy_inode

What problem, exactly, are you having with xfs_fs_destroy_inode()?

-Dave.
Dave Chinner Dec. 5, 2023, 6:41 a.m. UTC | #11
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> User-space processes always call task_work_run() as needed when
> returning from a system call.  Kernel-threads generally do not.
> Because of this some work that is best run in the task_works context
> (guaranteed that no locks are held) cannot be queued to task_works from
> kernel threads and so are queued to a (single) work_time to be managed
> on a work queue.
> 
> This means that any cost for doing the work is not imposed on the kernel
> thread, and importantly excessive amounts of work cannot apply
> back-pressure to reduce the amount of new work queued.
> 
> I have evidence from a customer site when nfsd (which runs as kernel
> threads) is being asked to modify many millions of files which causes
> sufficient memory pressure that some cache (in XFS I think) gets cleaned
> earlier than would be ideal.  When __dput (from the workqueue) calls
> __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> previously cached info from storage.

We fixed that specific XFS problem in 5.9.

https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/

Can you reproduce these issues on a current TOT kernel?

If not, there's no bugs to fix in the upstream kernel. If you can,
then we've got more XFS issues to work through and fix. 

Fundamentally, though, we should not be papering over an XFS issue
by changing how core task_work infrastructure is used. So let's deal
with the XFS issue first....

-Dave.
NeilBrown Dec. 5, 2023, 8:48 a.m. UTC | #12
On Tue, 05 Dec 2023, Dave Chinner wrote:
> On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > User-space processes always call task_work_run() as needed when
> > returning from a system call.  Kernel-threads generally do not.
> > Because of this some work that is best run in the task_works context
> > (guaranteed that no locks are held) cannot be queued to task_works from
> > kernel threads and so are queued to a (single) work_time to be managed
> > on a work queue.
> > 
> > This means that any cost for doing the work is not imposed on the kernel
> > thread, and importantly excessive amounts of work cannot apply
> > back-pressure to reduce the amount of new work queued.
> > 
> > I have evidence from a customer site when nfsd (which runs as kernel
> > threads) is being asked to modify many millions of files which causes
> > sufficient memory pressure that some cache (in XFS I think) gets cleaned
> > earlier than would be ideal.  When __dput (from the workqueue) calls
> > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> > previously cached info from storage.
> 
> We fixed that specific XFS problem in 5.9.
> 
> https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/

Good to know - thanks.

> 
> Can you reproduce these issues on a current TOT kernel?

I haven't tried.  I don't know if I know enough details of the work load
to attempt it.

> 
> If not, there's no bugs to fix in the upstream kernel. If you can,
> then we've got more XFS issues to work through and fix. 
> 
> Fundamentally, though, we should not be papering over an XFS issue
> by changing how core task_work infrastructure is used. So let's deal
> with the XFS issue first....

I disagree.  This customer experience has demonstrated both a bug in XFS
and bug in the interaction between fput, task_work, and nfsd.

If a bug in a filesystem that only causes a modest performance impact
when used through the syscall API can bring the system to its knees
through memory exhaustion when used by nfsd, then that is a robustness
issue for nfsd.

I want to fix that robustness issue so that unusual behaviour in
filesystems does not cause out-of-proportion bad behaviour in nfsd.

I highlighted this in the cover letter to the first version of my patch:

https://lore.kernel.org/all/170112272125.7109.6245462722883333440@noble.neil.brown.name/

  While this might point to a problem with the filesystem not handling the
  final close efficiently, such problems should only hurt throughput, not
  lead to memory exhaustion.

Thanks,
NeilBrown


> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Christian Brauner Dec. 5, 2023, 11:14 a.m. UTC | #13
On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> On 12/4/23 2:02 PM, NeilBrown wrote:
> > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > changed since last I looked..... are there rules?
> > 
> > My reasoning was that the call is effectively part of the user-space
> > ABI.  A user-space process can call this trivially by invoking any
> > system call.  The user-space ABI is explicitly a boundary which the GPL
> > does not cross.  So it doesn't seem appropriate to prevent non-GPL
> > kernel code from doing something that non-GPL user-space code can
> > trivially do.
> 
> By that reasoning, basically everything in the kernel should be non-GPL
> marked. And while task_work can get used by the application, it happens
> only indirectly or implicitly. So I don't think this reasoning is sound
> at all, it's not an exported ABI or API by itself.
> 
> For me, the more core of an export it is, the stronger the reason it
> should be GPL. FWIW, I don't think exporting task_work functionality is
> a good idea in the first place, but if there's a strong reason to do so,

Yeah, I'm not too fond of that part as well. I don't think we want to
give modules the ability to mess with task work. This is just asking for
trouble.
Christian Brauner Dec. 5, 2023, 11:25 a.m. UTC | #14
On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> User-space processes always call task_work_run() as needed when
> returning from a system call.  Kernel-threads generally do not.
> Because of this some work that is best run in the task_works context
> (guaranteed that no locks are held) cannot be queued to task_works from
> kernel threads and so are queued to a (single) work_time to be managed
> on a work queue.
> 
> This means that any cost for doing the work is not imposed on the kernel
> thread, and importantly excessive amounts of work cannot apply
> back-pressure to reduce the amount of new work queued.
> 
> I have evidence from a customer site when nfsd (which runs as kernel
> threads) is being asked to modify many millions of files which causes
> sufficient memory pressure that some cache (in XFS I think) gets cleaned
> earlier than would be ideal.  When __dput (from the workqueue) calls
> __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> previously cached info from storage.  This slows down the single thread
> that is making all the final __dput() calls for all the nfsd threads
> with the net result that files are added to the delayed_fput_list faster
> than they are removed, and the system eventually runs out of memory.
> 
> This happens because there is no back-pressure: the nfsd isn't forced to
> slow down when __dput() is slow for any reason.  To fix this we can
> change the nfsd threads to call task_work_run() regularly (much like
> user-space processes do) and allow it to declare this so that work does
> get queued to task_works rather than to a work queue.
> 
> This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
> instead of PF_KTHREAD to determine whether it is sensible to queue
> something to task_works.  This flag is always set for non-kernel threads.
> 
> task_work_run() is also exported so that it can be called from a module
> such as nfsd.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---

The thing that bugs me the most about this is that we expose task work
infrastructure to modules which I think is a really bad idea. File
handling code brings so many driver to their knees and now we're handing
them another footgun.

I'm not per se opposed to all of this but is this really what the other
NFS maintainers want to switch to as well? And is this really that badly
needed and that common that we want to go down that road? I wouldn't
mind not having to do all this if we can get by via other means.

>  fs/file_table.c       | 3 ++-
>  fs/namespace.c        | 2 +-
>  include/linux/sched.h | 2 +-
>  kernel/fork.c         | 2 ++
>  kernel/task_work.c    | 1 +
>  5 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/file_table.c b/fs/file_table.c
> index ee21b3da9d08..d36cade6e366 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -435,7 +435,8 @@ void fput(struct file *file)
>  	if (atomic_long_dec_and_test(&file->f_count)) {
>  		struct task_struct *task = current;
>  
> -		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> +		if (likely(!in_interrupt() &&
> +			   (task->flags & PF_RUNS_TASK_WORK))) {
>  			init_task_work(&file->f_rcuhead, ____fput);
>  			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
>  				return;
> diff --git a/fs/namespace.c b/fs/namespace.c
> index e157efc54023..46d640b70ca9 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
>  
>  	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
>  		struct task_struct *task = current;
> -		if (likely(!(task->flags & PF_KTHREAD))) {
> +		if (likely((task->flags & PF_RUNS_TASK_WORK))) {
>  			init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
>  			if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
>  				return;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 77f01ac385f7..e4eebac708e7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1747,7 +1747,7 @@ extern struct pid *cad_pid;
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> -#define PF__HOLE__00800000	0x00800000
> +#define PF_RUNS_TASK_WORK	0x00800000	/* Will call task_work_run() periodically */

The flag seems better to me than just relying on exit_work as itt's
easier to reason about.

>  #define PF__HOLE__01000000	0x01000000
>  #define PF__HOLE__02000000	0x02000000
>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3b6d20dfb9a8..d612d8f14861 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2330,6 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
>  	p->flags &= ~PF_KTHREAD;
>  	if (args->kthread)
>  		p->flags |= PF_KTHREAD;
> +	else
> +		p->flags |= PF_RUNS_TASK_WORK;
>  	if (args->user_worker) {
>  		/*
>  		 * Mark us a user worker, and block any signal that isn't
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 95a7e1b7f1da..aec19876e121 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -183,3 +183,4 @@ void task_work_run(void)
>  		} while (work);
>  	}
>  }
> +EXPORT_SYMBOL(task_work_run);
> -- 
> 2.43.0
>
Christian Brauner Dec. 5, 2023, 11:29 a.m. UTC | #15
On Tue, Dec 05, 2023 at 07:48:20PM +1100, NeilBrown wrote:
> On Tue, 05 Dec 2023, Dave Chinner wrote:
> > On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > > User-space processes always call task_work_run() as needed when
> > > returning from a system call.  Kernel-threads generally do not.
> > > Because of this some work that is best run in the task_works context
> > > (guaranteed that no locks are held) cannot be queued to task_works from
> > > kernel threads and so are queued to a (single) work_time to be managed
> > > on a work queue.
> > > 
> > > This means that any cost for doing the work is not imposed on the kernel
> > > thread, and importantly excessive amounts of work cannot apply
> > > back-pressure to reduce the amount of new work queued.
> > > 
> > > I have evidence from a customer site when nfsd (which runs as kernel
> > > threads) is being asked to modify many millions of files which causes
> > > sufficient memory pressure that some cache (in XFS I think) gets cleaned
> > > earlier than would be ideal.  When __dput (from the workqueue) calls
> > > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> > > previously cached info from storage.
> > 
> > We fixed that specific XFS problem in 5.9.
> > 
> > https://lore.kernel.org/linux-xfs/20200622081605.1818434-1-david@fromorbit.com/
> 
> Good to know - thanks.
> 
> > 
> > Can you reproduce these issues on a current TOT kernel?
> 
> I haven't tried.  I don't know if I know enough details of the work load
> to attempt it.
> 
> > 
> > If not, there's no bugs to fix in the upstream kernel. If you can,
> > then we've got more XFS issues to work through and fix. 
> > 
> > Fundamentally, though, we should not be papering over an XFS issue
> > by changing how core task_work infrastructure is used. So let's deal
> > with the XFS issue first....
> 
> I disagree.  This customer experience has demonstrated both a bug in XFS
> and bug in the interaction between fput, task_work, and nfsd.
> 
> If a bug in a filesystem that only causes a modest performance impact
> when used through the syscall API can bring the system to its knees
> through memory exhaustion when used by nfsd, then that is a robustness
> issue for nfsd.
> 
> I want to fix that robustness issue so that unusual behaviour in
> filesystems does not cause out-of-proportion bad behaviour in nfsd.
> 
> I highlighted this in the cover letter to the first version of my patch:
> 
> https://lore.kernel.org/all/170112272125.7109.6245462722883333440@noble.neil.brown.name/
> 
>   While this might point to a problem with the filesystem not handling the
>   final close efficiently, such problems should only hurt throughput, not
>   lead to memory exhaustion.

I'm still confused about this memory exhaustion claim?
If this is a filesystem problem it's pretty annoying that we have to
work around it by exposing task work to random modules.
Jeff Layton Dec. 5, 2023, 2:06 p.m. UTC | #16
On Tue, 2023-12-05 at 12:14 +0100, Christian Brauner wrote:
> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > On 12/4/23 2:02 PM, NeilBrown wrote:
> > > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > > changed since last I looked..... are there rules?
> > > 
> > > My reasoning was that the call is effectively part of the user-space
> > > ABI.  A user-space process can call this trivially by invoking any
> > > system call.  The user-space ABI is explicitly a boundary which the GPL
> > > does not cross.  So it doesn't seem appropriate to prevent non-GPL
> > > kernel code from doing something that non-GPL user-space code can
> > > trivially do.
> > 
> > By that reasoning, basically everything in the kernel should be non-GPL
> > marked. And while task_work can get used by the application, it happens
> > only indirectly or implicitly. So I don't think this reasoning is sound
> > at all, it's not an exported ABI or API by itself.
> > 
> > For me, the more core of an export it is, the stronger the reason it
> > should be GPL. FWIW, I don't think exporting task_work functionality is
> > a good idea in the first place, but if there's a strong reason to do so,
> 
> Yeah, I'm not too fond of that part as well. I don't think we want to
> give modules the ability to mess with task work. This is just asking for
> trouble.

The fact that nfsd has to queue all of the delayed fput activity to a
workqueue has always been a horrible hack though. We export all kinds of
functionality to modules that you can screw up.

I think that nfsd's use-case is legitimate. ksmbd may also want to
follow suit.
Chuck Lever Dec. 5, 2023, 2:23 p.m. UTC | #17
On Tue, Dec 05, 2023 at 12:25:40PM +0100, Christian Brauner wrote:
> On Mon, Dec 04, 2023 at 12:36:41PM +1100, NeilBrown wrote:
> > User-space processes always call task_work_run() as needed when
> > returning from a system call.  Kernel-threads generally do not.
> > Because of this some work that is best run in the task_works context
> > (guaranteed that no locks are held) cannot be queued to task_works from
> > kernel threads and so are queued to a (single) work_time to be managed
> > on a work queue.
> > 
> > This means that any cost for doing the work is not imposed on the kernel
> > thread, and importantly excessive amounts of work cannot apply
> > back-pressure to reduce the amount of new work queued.
> > 
> > I have evidence from a customer site when nfsd (which runs as kernel
> > threads) is being asked to modify many millions of files which causes
> > sufficient memory pressure that some cache (in XFS I think) gets cleaned
> > earlier than would be ideal.  When __dput (from the workqueue) calls
> > __dentry_kill, xfs_fs_destroy_inode() needs to synchronously read back
> > previously cached info from storage.  This slows down the single thread
> > that is making all the final __dput() calls for all the nfsd threads
> > with the net result that files are added to the delayed_fput_list faster
> > than they are removed, and the system eventually runs out of memory.
> > 
> > This happens because there is no back-pressure: the nfsd isn't forced to
> > slow down when __dput() is slow for any reason.  To fix this we can
> > change the nfsd threads to call task_work_run() regularly (much like
> > user-space processes do) and allow it to declare this so that work does
> > get queued to task_works rather than to a work queue.
> > 
> > This patch adds a new process flag PF_RUNS_TASK_WORK which is now used
> > instead of PF_KTHREAD to determine whether it is sensible to queue
> > something to task_works.  This flag is always set for non-kernel threads.
> > 
> > task_work_run() is also exported so that it can be called from a module
> > such as nfsd.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> 
> The thing that bugs me the most about this is that we expose task work
> infrastructure to modules which I think is a really bad idea. File
> handling code brings so many driver to their knees and now we're handing
> them another footgun.
> 
> I'm not per se opposed to all of this but is this really what the other
> NFS maintainers want to switch to as well? And is this really that badly
> needed and that common that we want to go down that road? I wouldn't
> mind not having to do all this if we can get by via other means.

The problem of slow flushing during close is not limited to XFS or
any particular underlying file system. Sometimes it is due to
performance or scalability bugs, but sometimes it's just a slow
storage stack on the NFS server (eg NFS re-export).

One slow synchronous flush in a single-threaded queue will result
in head-of-queue blocking. That is something that needs to be
addressed (IMHO, first).

Adding back pressure on NFS clients when NFSD is not able to get
dirty data onto durable storage fast enough is a long term solution,
but it's probably a heavier lift. I'm not wedded to using task_work
to do that, but it does seem to fit the problem at hand.


> >  fs/file_table.c       | 3 ++-
> >  fs/namespace.c        | 2 +-
> >  include/linux/sched.h | 2 +-
> >  kernel/fork.c         | 2 ++
> >  kernel/task_work.c    | 1 +
> >  5 files changed, 7 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index ee21b3da9d08..d36cade6e366 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -435,7 +435,8 @@ void fput(struct file *file)
> >  	if (atomic_long_dec_and_test(&file->f_count)) {
> >  		struct task_struct *task = current;
> >  
> > -		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> > +		if (likely(!in_interrupt() &&
> > +			   (task->flags & PF_RUNS_TASK_WORK))) {
> >  			init_task_work(&file->f_rcuhead, ____fput);
> >  			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
> >  				return;
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index e157efc54023..46d640b70ca9 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -1328,7 +1328,7 @@ static void mntput_no_expire(struct mount *mnt)
> >  
> >  	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
> >  		struct task_struct *task = current;
> > -		if (likely(!(task->flags & PF_KTHREAD))) {
> > +		if (likely((task->flags & PF_RUNS_TASK_WORK))) {
> >  			init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
> >  			if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
> >  				return;
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 77f01ac385f7..e4eebac708e7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1747,7 +1747,7 @@ extern struct pid *cad_pid;
> >  						 * I am cleaning dirty pages from some other bdi. */
> >  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
> >  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> > -#define PF__HOLE__00800000	0x00800000
> > +#define PF_RUNS_TASK_WORK	0x00800000	/* Will call task_work_run() periodically */
> 
> The flag seems better to me than just relying on exit_work as itt's
> easier to reason about.
> 
> >  #define PF__HOLE__01000000	0x01000000
> >  #define PF__HOLE__02000000	0x02000000
> >  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 3b6d20dfb9a8..d612d8f14861 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -2330,6 +2330,8 @@ __latent_entropy struct task_struct *copy_process(
> >  	p->flags &= ~PF_KTHREAD;
> >  	if (args->kthread)
> >  		p->flags |= PF_KTHREAD;
> > +	else
> > +		p->flags |= PF_RUNS_TASK_WORK;
> >  	if (args->user_worker) {
> >  		/*
> >  		 * Mark us a user worker, and block any signal that isn't
> > diff --git a/kernel/task_work.c b/kernel/task_work.c
> > index 95a7e1b7f1da..aec19876e121 100644
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -183,3 +183,4 @@ void task_work_run(void)
> >  		} while (work);
> >  	}
> >  }
> > +EXPORT_SYMBOL(task_work_run);
> > -- 
> > 2.43.0
> >
NeilBrown Dec. 5, 2023, 9:28 p.m. UTC | #18
On Tue, 05 Dec 2023, Christian Brauner wrote:
> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > On 12/4/23 2:02 PM, NeilBrown wrote:
> > > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > > changed since last I looked..... are there rules?
> > > 
> > > My reasoning was that the call is effectively part of the user-space
> > > ABI.  A user-space process can call this trivially by invoking any
> > > system call.  The user-space ABI is explicitly a boundary which the GPL
> > > does not cross.  So it doesn't seem appropriate to prevent non-GPL
> > > kernel code from doing something that non-GPL user-space code can
> > > trivially do.
> > 
> > By that reasoning, basically everything in the kernel should be non-GPL
> > marked. And while task_work can get used by the application, it happens
> > only indirectly or implicitly. So I don't think this reasoning is sound
> > at all, it's not an exported ABI or API by itself.
> > 
> > For me, the more core of an export it is, the stronger the reason it
> > should be GPL. FWIW, I don't think exporting task_work functionality is
> > a good idea in the first place, but if there's a strong reason to do so,
> 
> Yeah, I'm not too fond of that part as well. I don't think we want to
> give modules the ability to mess with task work. This is just asking for
> trouble.
> 

Ok, maybe we need to reframe the problem then.

Currently fput(), and hence filp_close(), take control away from kernel
threads in that they cannot be sure that a "close" has actually
completed.

This is already a problem for nfsd.  When renaming a file, nfsd needs to
ensure any cached "open" that it has on the file is closed (else when
re-exporting an NFS filesystem it can result in a silly-rename).

nfsd currently handles this case by calling flush_delayed_fput().  I
suspect you are no more happy about exporting that than you are about
exporting task_work_run(), but this solution isn't actually 100%
reliable.  If some other thread calls flush_delayed_fput() between nfsd
calling filp_close() and that same nfsd calling flush_delayed_fput(),
then the second flush can return before the first flush (in the other
thread) completes all the work it took on.

What we really need - both for handling renames and for avoiding
possible memory exhaustion - is for nfsd to be able to reliably wait for
any fput() that it initiated to complete.

How would you like the VFS to provide that service?

Thanks,
NeilBrown
Jens Axboe Dec. 5, 2023, 9:58 p.m. UTC | #19
On 12/5/23 2:28 PM, NeilBrown wrote:
> On Tue, 05 Dec 2023, Christian Brauner wrote:
>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
>>> On 12/4/23 2:02 PM, NeilBrown wrote:
>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
>>>> changed since last I looked..... are there rules?
>>>>
>>>> My reasoning was that the call is effectively part of the user-space
>>>> ABI.  A user-space process can call this trivially by invoking any
>>>> system call.  The user-space ABI is explicitly a boundary which the GPL
>>>> does not cross.  So it doesn't seem appropriate to prevent non-GPL
>>>> kernel code from doing something that non-GPL user-space code can
>>>> trivially do.
>>>
>>> By that reasoning, basically everything in the kernel should be non-GPL
>>> marked. And while task_work can get used by the application, it happens
>>> only indirectly or implicitly. So I don't think this reasoning is sound
>>> at all, it's not an exported ABI or API by itself.
>>>
>>> For me, the more core of an export it is, the stronger the reason it
>>> should be GPL. FWIW, I don't think exporting task_work functionality is

>>
>> Yeah, I'm not too fond of that part as well. I don't think we want to
>> give modules the ability to mess with task work. This is just asking for
>> trouble.
>>
> 
> Ok, maybe we need to reframe the problem then.
> 
> Currently fput(), and hence filp_close(), take control away from kernel
> threads in that they cannot be sure that a "close" has actually
> completed.
> 
> This is already a problem for nfsd.  When renaming a file, nfsd needs to
> ensure any cached "open" that it has on the file is closed (else when
> re-exporting an NFS filesystem it can result in a silly-rename).
> 
> nfsd currently handles this case by calling flush_delayed_fput().  I
> suspect you are no more happy about exporting that than you are about
> exporting task_work_run(), but this solution isn't actually 100%
> reliable.  If some other thread calls flush_delayed_fput() between nfsd
> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> then the second flush can return before the first flush (in the other
> thread) completes all the work it took on.
> 
> What we really need - both for handling renames and for avoiding
> possible memory exhaustion - is for nfsd to be able to reliably wait for
> any fput() that it initiated to complete.
> 
> How would you like the VFS to provide that service?

Since task_work happens in the context of your task already, why not
just have a way to get it stashed into a list when final fput is done?
This avoids all of this "let's expose task_work" and using the task list
for that, which seems kind of pointless as you're just going to run it
later on manually anyway.

In semi pseudo code:

bool fput_put_ref(struct file *file)
{
	return atomic_dec_and_test(&file->f_count);
}

void fput(struct file *file)
{
	if (fput_put_ref(file)) {
		...
	}
}

and then your nfsd_file_free() could do:

ret = filp_flush(file, id);
if (fput_put_ref(file))
	llist_add(&file->f_llist, &l->to_free_llist);

or something like that, where l->to_free_llist is where ever you'd
otherwise punt the actual freeing to.
Jens Axboe Dec. 5, 2023, 10:03 p.m. UTC | #20
On 12/5/23 2:58 PM, Jens Axboe wrote:
> On 12/5/23 2:28 PM, NeilBrown wrote:
>> On Tue, 05 Dec 2023, Christian Brauner wrote:
>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
>>>> On 12/4/23 2:02 PM, NeilBrown wrote:
>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
>>>>> changed since last I looked..... are there rules?
>>>>>
>>>>> My reasoning was that the call is effectively part of the user-space
>>>>> ABI.  A user-space process can call this trivially by invoking any
>>>>> system call.  The user-space ABI is explicitly a boundary which the GPL
>>>>> does not cross.  So it doesn't seem appropriate to prevent non-GPL
>>>>> kernel code from doing something that non-GPL user-space code can
>>>>> trivially do.
>>>>
>>>> By that reasoning, basically everything in the kernel should be non-GPL
>>>> marked. And while task_work can get used by the application, it happens
>>>> only indirectly or implicitly. So I don't think this reasoning is sound
>>>> at all, it's not an exported ABI or API by itself.
>>>>
>>>> For me, the more core of an export it is, the stronger the reason it
>>>> should be GPL. FWIW, I don't think exporting task_work functionality is
> 
>>>
>>> Yeah, I'm not too fond of that part as well. I don't think we want to
>>> give modules the ability to mess with task work. This is just asking for
>>> trouble.
>>>
>>
>> Ok, maybe we need to reframe the problem then.
>>
>> Currently fput(), and hence filp_close(), take control away from kernel
>> threads in that they cannot be sure that a "close" has actually
>> completed.
>>
>> This is already a problem for nfsd.  When renaming a file, nfsd needs to
>> ensure any cached "open" that it has on the file is closed (else when
>> re-exporting an NFS filesystem it can result in a silly-rename).
>>
>> nfsd currently handles this case by calling flush_delayed_fput().  I
>> suspect you are no more happy about exporting that than you are about
>> exporting task_work_run(), but this solution isn't actually 100%
>> reliable.  If some other thread calls flush_delayed_fput() between nfsd
>> calling filp_close() and that same nfsd calling flush_delayed_fput(),
>> then the second flush can return before the first flush (in the other
>> thread) completes all the work it took on.
>>
>> What we really need - both for handling renames and for avoiding
>> possible memory exhaustion - is for nfsd to be able to reliably wait for
>> any fput() that it initiated to complete.
>>
>> How would you like the VFS to provide that service?
> 
> Since task_work happens in the context of your task already, why not
> just have a way to get it stashed into a list when final fput is done?
> This avoids all of this "let's expose task_work" and using the task list
> for that, which seems kind of pointless as you're just going to run it
> later on manually anyway.
> 
> In semi pseudo code:
> 
> bool fput_put_ref(struct file *file)
> {
> 	return atomic_dec_and_test(&file->f_count);
> }
> 
> void fput(struct file *file)
> {
> 	if (fput_put_ref(file)) {
> 		...
> 	}
> }
> 
> and then your nfsd_file_free() could do:
> 
> ret = filp_flush(file, id);
> if (fput_put_ref(file))
> 	llist_add(&file->f_llist, &l->to_free_llist);
> 
> or something like that, where l->to_free_llist is where ever you'd
> otherwise punt the actual freeing to.

Should probably have the put_ref or whatever helper also init the
task_work, and then reuse the list in the callback_head there. Then
whoever flushes it has to call ->func() and avoid exposing ____fput() to
random users. But you get the idea.
NeilBrown Dec. 5, 2023, 10:16 p.m. UTC | #21
On Wed, 06 Dec 2023, Jens Axboe wrote:
> On 12/5/23 2:58 PM, Jens Axboe wrote:
> > On 12/5/23 2:28 PM, NeilBrown wrote:
> >> On Tue, 05 Dec 2023, Christian Brauner wrote:
> >>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> >>>> On 12/4/23 2:02 PM, NeilBrown wrote:
> >>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
> >>>>> changed since last I looked..... are there rules?
> >>>>>
> >>>>> My reasoning was that the call is effectively part of the user-space
> >>>>> ABI.  A user-space process can call this trivially by invoking any
> >>>>> system call.  The user-space ABI is explicitly a boundary which the GPL
> >>>>> does not cross.  So it doesn't seem appropriate to prevent non-GPL
> >>>>> kernel code from doing something that non-GPL user-space code can
> >>>>> trivially do.
> >>>>
> >>>> By that reasoning, basically everything in the kernel should be non-GPL
> >>>> marked. And while task_work can get used by the application, it happens
> >>>> only indirectly or implicitly. So I don't think this reasoning is sound
> >>>> at all, it's not an exported ABI or API by itself.
> >>>>
> >>>> For me, the more core of an export it is, the stronger the reason it
> >>>> should be GPL. FWIW, I don't think exporting task_work functionality is
> > 
> >>>
> >>> Yeah, I'm not too fond of that part as well. I don't think we want to
> >>> give modules the ability to mess with task work. This is just asking for
> >>> trouble.
> >>>
> >>
> >> Ok, maybe we need to reframe the problem then.
> >>
> >> Currently fput(), and hence filp_close(), take control away from kernel
> >> threads in that they cannot be sure that a "close" has actually
> >> completed.
> >>
> >> This is already a problem for nfsd.  When renaming a file, nfsd needs to
> >> ensure any cached "open" that it has on the file is closed (else when
> >> re-exporting an NFS filesystem it can result in a silly-rename).
> >>
> >> nfsd currently handles this case by calling flush_delayed_fput().  I
> >> suspect you are no more happy about exporting that than you are about
> >> exporting task_work_run(), but this solution isn't actually 100%
> >> reliable.  If some other thread calls flush_delayed_fput() between nfsd
> >> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> >> then the second flush can return before the first flush (in the other
> >> thread) completes all the work it took on.
> >>
> >> What we really need - both for handling renames and for avoiding
> >> possible memory exhaustion - is for nfsd to be able to reliably wait for
> >> any fput() that it initiated to complete.
> >>
> >> How would you like the VFS to provide that service?
> > 
> > Since task_work happens in the context of your task already, why not
> > just have a way to get it stashed into a list when final fput is done?
> > This avoids all of this "let's expose task_work" and using the task list
> > for that, which seems kind of pointless as you're just going to run it
> > later on manually anyway.
> > 
> > In semi pseudo code:
> > 
> > bool fput_put_ref(struct file *file)
> > {
> > 	return atomic_dec_and_test(&file->f_count);
> > }
> > 
> > void fput(struct file *file)
> > {
> > 	if (fput_put_ref(file)) {
> > 		...
> > 	}
> > }
> > 
> > and then your nfsd_file_free() could do:
> > 
> > ret = filp_flush(file, id);
> > if (fput_put_ref(file))
> > 	llist_add(&file->f_llist, &l->to_free_llist);
> > 
> > or something like that, where l->to_free_llist is where ever you'd
> > otherwise punt the actual freeing to.
> 
> Should probably have the put_ref or whatever helper also init the
> task_work, and then reuse the list in the callback_head there. Then
> whoever flushes it has to call ->func() and avoid exposing ____fput() to
> random users. But you get the idea.

Interesting ideas - thanks.

So maybe the new API would be

 fput_queued(struct file *f, struct llist_head *q)
and
 flush_fput_queue(struct llist_head *q)

with the meaning being that fput_queued() is just like fput() except
that any file needing __fput() is added to the 'q'; and that
flush_fput_queue() calls __fput() on any files in 'q'.

So to close a file nfsd would:

  fget(f);
  flip_close(f);
  fput_queued(f, &my_queue);

though possibly we could have a
  filp_close_queued(f, q)
as well.

I'll try that out - but am happy to hear alternate suggestions for names :-)

Thanks,
NeilBrown
NeilBrown Dec. 5, 2023, 11:23 p.m. UTC | #22
On Wed, 06 Dec 2023, NeilBrown wrote:
> On Wed, 06 Dec 2023, Jens Axboe wrote:
> > On 12/5/23 2:58 PM, Jens Axboe wrote:
> > > On 12/5/23 2:28 PM, NeilBrown wrote:
> > >> On Tue, 05 Dec 2023, Christian Brauner wrote:
> > >>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > >>>> On 12/4/23 2:02 PM, NeilBrown wrote:
> > >>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
> > >>>>> changed since last I looked..... are there rules?
> > >>>>>
> > >>>>> My reasoning was that the call is effectively part of the user-space
> > >>>>> ABI.  A user-space process can call this trivially by invoking any
> > >>>>> system call.  The user-space ABI is explicitly a boundary which the GPL
> > >>>>> does not cross.  So it doesn't seem appropriate to prevent non-GPL
> > >>>>> kernel code from doing something that non-GPL user-space code can
> > >>>>> trivially do.
> > >>>>
> > >>>> By that reasoning, basically everything in the kernel should be non-GPL
> > >>>> marked. And while task_work can get used by the application, it happens
> > >>>> only indirectly or implicitly. So I don't think this reasoning is sound
> > >>>> at all, it's not an exported ABI or API by itself.
> > >>>>
> > >>>> For me, the more core of an export it is, the stronger the reason it
> > >>>> should be GPL. FWIW, I don't think exporting task_work functionality is
> > > 
> > >>>
> > >>> Yeah, I'm not too fond of that part as well. I don't think we want to
> > >>> give modules the ability to mess with task work. This is just asking for
> > >>> trouble.
> > >>>
> > >>
> > >> Ok, maybe we need to reframe the problem then.
> > >>
> > >> Currently fput(), and hence filp_close(), take control away from kernel
> > >> threads in that they cannot be sure that a "close" has actually
> > >> completed.
> > >>
> > >> This is already a problem for nfsd.  When renaming a file, nfsd needs to
> > >> ensure any cached "open" that it has on the file is closed (else when
> > >> re-exporting an NFS filesystem it can result in a silly-rename).
> > >>
> > >> nfsd currently handles this case by calling flush_delayed_fput().  I
> > >> suspect you are no more happy about exporting that than you are about
> > >> exporting task_work_run(), but this solution isn't actually 100%
> > >> reliable.  If some other thread calls flush_delayed_fput() between nfsd
> > >> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> > >> then the second flush can return before the first flush (in the other
> > >> thread) completes all the work it took on.
> > >>
> > >> What we really need - both for handling renames and for avoiding
> > >> possible memory exhaustion - is for nfsd to be able to reliably wait for
> > >> any fput() that it initiated to complete.
> > >>
> > >> How would you like the VFS to provide that service?
> > > 
> > > Since task_work happens in the context of your task already, why not
> > > just have a way to get it stashed into a list when final fput is done?
> > > This avoids all of this "let's expose task_work" and using the task list
> > > for that, which seems kind of pointless as you're just going to run it
> > > later on manually anyway.
> > > 
> > > In semi pseudo code:
> > > 
> > > bool fput_put_ref(struct file *file)
> > > {
> > > 	return atomic_dec_and_test(&file->f_count);
> > > }
> > > 
> > > void fput(struct file *file)
> > > {
> > > 	if (fput_put_ref(file)) {
> > > 		...
> > > 	}
> > > }
> > > 
> > > and then your nfsd_file_free() could do:
> > > 
> > > ret = filp_flush(file, id);
> > > if (fput_put_ref(file))
> > > 	llist_add(&file->f_llist, &l->to_free_llist);
> > > 
> > > or something like that, where l->to_free_llist is where ever you'd
> > > otherwise punt the actual freeing to.
> > 
> > Should probably have the put_ref or whatever helper also init the
> > task_work, and then reuse the list in the callback_head there. Then
> > whoever flushes it has to call ->func() and avoid exposing ____fput() to
> > random users. But you get the idea.
> 
> Interesting ideas - thanks.
> 
> So maybe the new API would be
> 
>  fput_queued(struct file *f, struct llist_head *q)
> and
>  flush_fput_queue(struct llist_head *q)
> 
> with the meaning being that fput_queued() is just like fput() except
> that any file needing __fput() is added to the 'q'; and that
> flush_fput_queue() calls __fput() on any files in 'q'.
> 
> So to close a file nfsd would:
> 
>   fget(f);
>   flip_close(f);
>   fput_queued(f, &my_queue);
> 
> though possibly we could have a
>   filp_close_queued(f, q)
> as well.
> 
> I'll try that out - but am happy to hear alternate suggestions for names :-)
> 

Actually ....  I'm beginning to wonder if we should just use
__fput_sync() in nfsd.
It has a big warning about not doing that blindly, but the detail in the
warning doesn't seem to apply to nfsd...

NeilBrown
Jens Axboe Dec. 5, 2023, 11:31 p.m. UTC | #23
On 12/5/23 4:23 PM, NeilBrown wrote:
> On Wed, 06 Dec 2023, NeilBrown wrote:
>> On Wed, 06 Dec 2023, Jens Axboe wrote:
>>> On 12/5/23 2:58 PM, Jens Axboe wrote:
>>>> On 12/5/23 2:28 PM, NeilBrown wrote:
>>>>> On Tue, 05 Dec 2023, Christian Brauner wrote:
>>>>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
>>>>>>> On 12/4/23 2:02 PM, NeilBrown wrote:
>>>>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
>>>>>>>> changed since last I looked..... are there rules?
>>>>>>>>
>>>>>>>> My reasoning was that the call is effectively part of the user-space
>>>>>>>> ABI.  A user-space process can call this trivially by invoking any
>>>>>>>> system call.  The user-space ABI is explicitly a boundary which the GPL
>>>>>>>> does not cross.  So it doesn't seem appropriate to prevent non-GPL
>>>>>>>> kernel code from doing something that non-GPL user-space code can
>>>>>>>> trivially do.
>>>>>>>
>>>>>>> By that reasoning, basically everything in the kernel should be non-GPL
>>>>>>> marked. And while task_work can get used by the application, it happens
>>>>>>> only indirectly or implicitly. So I don't think this reasoning is sound
>>>>>>> at all, it's not an exported ABI or API by itself.
>>>>>>>
>>>>>>> For me, the more core of an export it is, the stronger the reason it
>>>>>>> should be GPL. FWIW, I don't think exporting task_work functionality is
>>>>
>>>>>>
>>>>>> Yeah, I'm not too fond of that part as well. I don't think we want to
>>>>>> give modules the ability to mess with task work. This is just asking for
>>>>>> trouble.
>>>>>>
>>>>>
>>>>> Ok, maybe we need to reframe the problem then.
>>>>>
>>>>> Currently fput(), and hence filp_close(), take control away from kernel
>>>>> threads in that they cannot be sure that a "close" has actually
>>>>> completed.
>>>>>
>>>>> This is already a problem for nfsd.  When renaming a file, nfsd needs to
>>>>> ensure any cached "open" that it has on the file is closed (else when
>>>>> re-exporting an NFS filesystem it can result in a silly-rename).
>>>>>
>>>>> nfsd currently handles this case by calling flush_delayed_fput().  I
>>>>> suspect you are no more happy about exporting that than you are about
>>>>> exporting task_work_run(), but this solution isn't actually 100%
>>>>> reliable.  If some other thread calls flush_delayed_fput() between nfsd
>>>>> calling filp_close() and that same nfsd calling flush_delayed_fput(),
>>>>> then the second flush can return before the first flush (in the other
>>>>> thread) completes all the work it took on.
>>>>>
>>>>> What we really need - both for handling renames and for avoiding
>>>>> possible memory exhaustion - is for nfsd to be able to reliably wait for
>>>>> any fput() that it initiated to complete.
>>>>>
>>>>> How would you like the VFS to provide that service?
>>>>
>>>> Since task_work happens in the context of your task already, why not
>>>> just have a way to get it stashed into a list when final fput is done?
>>>> This avoids all of this "let's expose task_work" and using the task list
>>>> for that, which seems kind of pointless as you're just going to run it
>>>> later on manually anyway.
>>>>
>>>> In semi pseudo code:
>>>>
>>>> bool fput_put_ref(struct file *file)
>>>> {
>>>> 	return atomic_dec_and_test(&file->f_count);
>>>> }
>>>>
>>>> void fput(struct file *file)
>>>> {
>>>> 	if (fput_put_ref(file)) {
>>>> 		...
>>>> 	}
>>>> }
>>>>
>>>> and then your nfsd_file_free() could do:
>>>>
>>>> ret = filp_flush(file, id);
>>>> if (fput_put_ref(file))
>>>> 	llist_add(&file->f_llist, &l->to_free_llist);
>>>>
>>>> or something like that, where l->to_free_llist is where ever you'd
>>>> otherwise punt the actual freeing to.
>>>
>>> Should probably have the put_ref or whatever helper also init the
>>> task_work, and then reuse the list in the callback_head there. Then
>>> whoever flushes it has to call ->func() and avoid exposing ____fput() to
>>> random users. But you get the idea.
>>
>> Interesting ideas - thanks.
>>
>> So maybe the new API would be
>>
>>  fput_queued(struct file *f, struct llist_head *q)
>> and
>>  flush_fput_queue(struct llist_head *q)
>>
>> with the meaning being that fput_queued() is just like fput() except
>> that any file needing __fput() is added to the 'q'; and that
>> flush_fput_queue() calls __fput() on any files in 'q'.
>>
>> So to close a file nfsd would:
>>
>>   fget(f);
>>   flip_close(f);
>>   fput_queued(f, &my_queue);
>>
>> though possibly we could have a
>>   filp_close_queued(f, q)
>> as well.
>>
>> I'll try that out - but am happy to hear alternate suggestions for names :-)
>>
> 
> Actually ....  I'm beginning to wonder if we should just use
> __fput_sync() in nfsd.
> It has a big warning about not doing that blindly, but the detail in the
> warning doesn't seem to apply to nfsd...

If you can do it from the context where you do the filp_close() right
now, then yeah there's no reason to over-complicate this at all... FWIW,
the reason task_work exists is just to ensure a clean context to perform
these operations from the task itself. The more I think about it, it
doesn't make a lot of sense to utilize it for this purpose, which is
where my alternate suggestion came from. But if you can just call it
directly, then that makes everything much easier.
Christoph Hellwig Dec. 6, 2023, 5:44 a.m. UTC | #24
On Tue, Dec 05, 2023 at 12:14:29PM +0100, Christian Brauner wrote:
> > For me, the more core of an export it is, the stronger the reason it
> > should be GPL. FWIW, I don't think exporting task_work functionality is
> > a good idea in the first place, but if there's a strong reason to do so,
> 
> Yeah, I'm not too fond of that part as well. I don't think we want to
> give modules the ability to mess with task work. This is just asking for
> trouble.

It just seems like a really bad idea.  At the same time it fixes a real
problem.  If we go a step back how could we fix it in a better way?
Do we even need the task_run based delay for file usage from kernel
threads?
Christian Brauner Dec. 6, 2023, 2:24 p.m. UTC | #25
On Tue, Dec 05, 2023 at 04:31:51PM -0700, Jens Axboe wrote:
> On 12/5/23 4:23 PM, NeilBrown wrote:
> > On Wed, 06 Dec 2023, NeilBrown wrote:
> >> On Wed, 06 Dec 2023, Jens Axboe wrote:
> >>> On 12/5/23 2:58 PM, Jens Axboe wrote:
> >>>> On 12/5/23 2:28 PM, NeilBrown wrote:
> >>>>> On Tue, 05 Dec 2023, Christian Brauner wrote:
> >>>>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> >>>>>>> On 12/4/23 2:02 PM, NeilBrown wrote:
> >>>>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
> >>>>>>>> changed since last I looked..... are there rules?
> >>>>>>>>
> >>>>>>>> My reasoning was that the call is effectively part of the user-space
> >>>>>>>> ABI.  A user-space process can call this trivially by invoking any
> >>>>>>>> system call.  The user-space ABI is explicitly a boundary which the GPL
> >>>>>>>> does not cross.  So it doesn't seem appropriate to prevent non-GPL
> >>>>>>>> kernel code from doing something that non-GPL user-space code can
> >>>>>>>> trivially do.
> >>>>>>>
> >>>>>>> By that reasoning, basically everything in the kernel should be non-GPL
> >>>>>>> marked. And while task_work can get used by the application, it happens
> >>>>>>> only indirectly or implicitly. So I don't think this reasoning is sound
> >>>>>>> at all, it's not an exported ABI or API by itself.
> >>>>>>>
> >>>>>>> For me, the more core of an export it is, the stronger the reason it
> >>>>>>> should be GPL. FWIW, I don't think exporting task_work functionality is
> >>>>
> >>>>>>
> >>>>>> Yeah, I'm not too fond of that part as well. I don't think we want to
> >>>>>> give modules the ability to mess with task work. This is just asking for
> >>>>>> trouble.
> >>>>>>
> >>>>>
> >>>>> Ok, maybe we need to reframe the problem then.
> >>>>>
> >>>>> Currently fput(), and hence filp_close(), take control away from kernel
> >>>>> threads in that they cannot be sure that a "close" has actually
> >>>>> completed.
> >>>>>
> >>>>> This is already a problem for nfsd.  When renaming a file, nfsd needs to
> >>>>> ensure any cached "open" that it has on the file is closed (else when
> >>>>> re-exporting an NFS filesystem it can result in a silly-rename).
> >>>>>
> >>>>> nfsd currently handles this case by calling flush_delayed_fput().  I
> >>>>> suspect you are no more happy about exporting that than you are about
> >>>>> exporting task_work_run(), but this solution isn't actually 100%
> >>>>> reliable.  If some other thread calls flush_delayed_fput() between nfsd
> >>>>> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> >>>>> then the second flush can return before the first flush (in the other
> >>>>> thread) completes all the work it took on.
> >>>>>
> >>>>> What we really need - both for handling renames and for avoiding
> >>>>> possible memory exhaustion - is for nfsd to be able to reliably wait for
> >>>>> any fput() that it initiated to complete.
> >>>>>
> >>>>> How would you like the VFS to provide that service?
> >>>>
> >>>> Since task_work happens in the context of your task already, why not
> >>>> just have a way to get it stashed into a list when final fput is done?
> >>>> This avoids all of this "let's expose task_work" and using the task list
> >>>> for that, which seems kind of pointless as you're just going to run it
> >>>> later on manually anyway.
> >>>>
> >>>> In semi pseudo code:
> >>>>
> >>>> bool fput_put_ref(struct file *file)
> >>>> {
> >>>> 	return atomic_dec_and_test(&file->f_count);
> >>>> }
> >>>>
> >>>> void fput(struct file *file)
> >>>> {
> >>>> 	if (fput_put_ref(file)) {
> >>>> 		...
> >>>> 	}
> >>>> }
> >>>>
> >>>> and then your nfsd_file_free() could do:
> >>>>
> >>>> ret = filp_flush(file, id);
> >>>> if (fput_put_ref(file))
> >>>> 	llist_add(&file->f_llist, &l->to_free_llist);
> >>>>
> >>>> or something like that, where l->to_free_llist is where ever you'd
> >>>> otherwise punt the actual freeing to.
> >>>
> >>> Should probably have the put_ref or whatever helper also init the
> >>> task_work, and then reuse the list in the callback_head there. Then
> >>> whoever flushes it has to call ->func() and avoid exposing ____fput() to
> >>> random users. But you get the idea.
> >>
> >> Interesting ideas - thanks.
> >>
> >> So maybe the new API would be
> >>
> >>  fput_queued(struct file *f, struct llist_head *q)
> >> and
> >>  flush_fput_queue(struct llist_head *q)
> >>
> >> with the meaning being that fput_queued() is just like fput() except
> >> that any file needing __fput() is added to the 'q'; and that
> >> flush_fput_queue() calls __fput() on any files in 'q'.
> >>
> >> So to close a file nfsd would:
> >>
> >>   fget(f);
> >>   flip_close(f);
> >>   fput_queued(f, &my_queue);
> >>
> >> though possibly we could have a
> >>   filp_close_queued(f, q)
> >> as well.
> >>
> >> I'll try that out - but am happy to hear alternate suggestions for names :-)
> >>
> > 
> > Actually ....  I'm beginning to wonder if we should just use
> > __fput_sync() in nfsd.
> > It has a big warning about not doing that blindly, but the detail in the
> > warning doesn't seem to apply to nfsd...
> 
> If you can do it from the context where you do the filp_close() right
> now, then yeah there's no reason to over-complicate this at all... FWIW,

As long as nfsd doesn't care that it may get stuck on umount or
->release...

> the reason task_work exists is just to ensure a clean context to perform
> these operations from the task itself. The more I think about it, it
> doesn't make a lot of sense to utilize it for this purpose, which is
> where my alternate suggestion came from. But if you can just call it
> directly, then that makes everything much easier.

And for better or worse we already expose __fput_sync(). We've recently
switched close(2) over to it as well as it was needlessly punting to
task work.
Christian Brauner Dec. 6, 2023, 2:29 p.m. UTC | #26
On Wed, Dec 06, 2023 at 08:28:15AM +1100, NeilBrown wrote:
> On Tue, 05 Dec 2023, Christian Brauner wrote:
> > On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > > On 12/4/23 2:02 PM, NeilBrown wrote:
> > > > It isn't clear to me what _GPL is appropriate, but maybe the rules
> > > > changed since last I looked..... are there rules?
> > > > 
> > > > My reasoning was that the call is effectively part of the user-space
> > > > ABI.  A user-space process can call this trivially by invoking any
> > > > system call.  The user-space ABI is explicitly a boundary which the GPL
> > > > does not cross.  So it doesn't seem appropriate to prevent non-GPL
> > > > kernel code from doing something that non-GPL user-space code can
> > > > trivially do.
> > > 
> > > By that reasoning, basically everything in the kernel should be non-GPL
> > > marked. And while task_work can get used by the application, it happens
> > > only indirectly or implicitly. So I don't think this reasoning is sound
> > > at all, it's not an exported ABI or API by itself.
> > > 
> > > For me, the more core of an export it is, the stronger the reason it
> > > should be GPL. FWIW, I don't think exporting task_work functionality is
> > > a good idea in the first place, but if there's a strong reason to do so,
> > 
> > Yeah, I'm not too fond of that part as well. I don't think we want to
> > give modules the ability to mess with task work. This is just asking for
> > trouble.
> > 
> 
> Ok, maybe we need to reframe the problem then.
> 
> Currently fput(), and hence filp_close(), take control away from kernel
> threads in that they cannot be sure that a "close" has actually
> completed.
> 
> This is already a problem for nfsd.  When renaming a file, nfsd needs to
> ensure any cached "open" that it has on the file is closed (else when
> re-exporting an NFS filesystem it can result in a silly-rename).
> 
> nfsd currently handles this case by calling flush_delayed_fput().  I
> suspect you are no more happy about exporting that than you are about
> exporting task_work_run(), but this solution isn't actually 100%
> reliable.  If some other thread calls flush_delayed_fput() between nfsd
> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> then the second flush can return before the first flush (in the other
> thread) completes all the work it took on.
> 
> What we really need - both for handling renames and for avoiding
> possible memory exhaustion - is for nfsd to be able to reliably wait for
> any fput() that it initiated to complete.

Yeah, I acknowledge the problem and I said I'm not opposed to your
solution I just would like to do it differently if we can.

> 
> How would you like the VFS to provide that service?

If you really don't care about getting stuck on an fput() somehow then
what Jens suggested might actually be fine.

And the proposal - queue the files on a list - isn't that already what
nfsd is kinda doing already for the file cache. That's at least the
impression I got from reading over nfsd_file_free(). It's just that it
doesn't free directly but used that flush_delayed_fput(). So really,
taking this on step further and doing the fput synchronously might work.
NeilBrown Dec. 8, 2023, 1:40 a.m. UTC | #27
On Thu, 07 Dec 2023, Christian Brauner wrote:
> On Tue, Dec 05, 2023 at 04:31:51PM -0700, Jens Axboe wrote:
> > On 12/5/23 4:23 PM, NeilBrown wrote:
> > > On Wed, 06 Dec 2023, NeilBrown wrote:
> > >> On Wed, 06 Dec 2023, Jens Axboe wrote:
> > >>> On 12/5/23 2:58 PM, Jens Axboe wrote:
> > >>>> On 12/5/23 2:28 PM, NeilBrown wrote:
> > >>>>> On Tue, 05 Dec 2023, Christian Brauner wrote:
> > >>>>>> On Mon, Dec 04, 2023 at 03:09:44PM -0700, Jens Axboe wrote:
> > >>>>>>> On 12/4/23 2:02 PM, NeilBrown wrote:
> > >>>>>>>> It isn't clear to me what _GPL is appropriate, but maybe the rules
> > >>>>>>>> changed since last I looked..... are there rules?
> > >>>>>>>>
> > >>>>>>>> My reasoning was that the call is effectively part of the user-space
> > >>>>>>>> ABI.  A user-space process can call this trivially by invoking any
> > >>>>>>>> system call.  The user-space ABI is explicitly a boundary which the GPL
> > >>>>>>>> does not cross.  So it doesn't seem appropriate to prevent non-GPL
> > >>>>>>>> kernel code from doing something that non-GPL user-space code can
> > >>>>>>>> trivially do.
> > >>>>>>>
> > >>>>>>> By that reasoning, basically everything in the kernel should be non-GPL
> > >>>>>>> marked. And while task_work can get used by the application, it happens
> > >>>>>>> only indirectly or implicitly. So I don't think this reasoning is sound
> > >>>>>>> at all, it's not an exported ABI or API by itself.
> > >>>>>>>
> > >>>>>>> For me, the more core of an export it is, the stronger the reason it
> > >>>>>>> should be GPL. FWIW, I don't think exporting task_work functionality is
> > >>>>
> > >>>>>>
> > >>>>>> Yeah, I'm not too fond of that part as well. I don't think we want to
> > >>>>>> give modules the ability to mess with task work. This is just asking for
> > >>>>>> trouble.
> > >>>>>>
> > >>>>>
> > >>>>> Ok, maybe we need to reframe the problem then.
> > >>>>>
> > >>>>> Currently fput(), and hence filp_close(), take control away from kernel
> > >>>>> threads in that they cannot be sure that a "close" has actually
> > >>>>> completed.
> > >>>>>
> > >>>>> This is already a problem for nfsd.  When renaming a file, nfsd needs to
> > >>>>> ensure any cached "open" that it has on the file is closed (else when
> > >>>>> re-exporting an NFS filesystem it can result in a silly-rename).
> > >>>>>
> > >>>>> nfsd currently handles this case by calling flush_delayed_fput().  I
> > >>>>> suspect you are no more happy about exporting that than you are about
> > >>>>> exporting task_work_run(), but this solution isn't actually 100%
> > >>>>> reliable.  If some other thread calls flush_delayed_fput() between nfsd
> > >>>>> calling filp_close() and that same nfsd calling flush_delayed_fput(),
> > >>>>> then the second flush can return before the first flush (in the other
> > >>>>> thread) completes all the work it took on.
> > >>>>>
> > >>>>> What we really need - both for handling renames and for avoiding
> > >>>>> possible memory exhaustion - is for nfsd to be able to reliably wait for
> > >>>>> any fput() that it initiated to complete.
> > >>>>>
> > >>>>> How would you like the VFS to provide that service?
> > >>>>
> > >>>> Since task_work happens in the context of your task already, why not
> > >>>> just have a way to get it stashed into a list when final fput is done?
> > >>>> This avoids all of this "let's expose task_work" and using the task list
> > >>>> for that, which seems kind of pointless as you're just going to run it
> > >>>> later on manually anyway.
> > >>>>
> > >>>> In semi pseudo code:
> > >>>>
> > >>>> bool fput_put_ref(struct file *file)
> > >>>> {
> > >>>> 	return atomic_dec_and_test(&file->f_count);
> > >>>> }
> > >>>>
> > >>>> void fput(struct file *file)
> > >>>> {
> > >>>> 	if (fput_put_ref(file)) {
> > >>>> 		...
> > >>>> 	}
> > >>>> }
> > >>>>
> > >>>> and then your nfsd_file_free() could do:
> > >>>>
> > >>>> ret = filp_flush(file, id);
> > >>>> if (fput_put_ref(file))
> > >>>> 	llist_add(&file->f_llist, &l->to_free_llist);
> > >>>>
> > >>>> or something like that, where l->to_free_llist is where ever you'd
> > >>>> otherwise punt the actual freeing to.
> > >>>
> > >>> Should probably have the put_ref or whatever helper also init the
> > >>> task_work, and then reuse the list in the callback_head there. Then
> > >>> whoever flushes it has to call ->func() and avoid exposing ____fput() to
> > >>> random users. But you get the idea.
> > >>
> > >> Interesting ideas - thanks.
> > >>
> > >> So maybe the new API would be
> > >>
> > >>  fput_queued(struct file *f, struct llist_head *q)
> > >> and
> > >>  flush_fput_queue(struct llist_head *q)
> > >>
> > >> with the meaning being that fput_queued() is just like fput() except
> > >> that any file needing __fput() is added to the 'q'; and that
> > >> flush_fput_queue() calls __fput() on any files in 'q'.
> > >>
> > >> So to close a file nfsd would:
> > >>
> > >>   fget(f);
> > >>   flip_close(f);
> > >>   fput_queued(f, &my_queue);
> > >>
> > >> though possibly we could have a
> > >>   filp_close_queued(f, q)
> > >> as well.
> > >>
> > >> I'll try that out - but am happy to hear alternate suggestions for names :-)
> > >>
> > > 
> > > Actually ....  I'm beginning to wonder if we should just use
> > > __fput_sync() in nfsd.
> > > It has a big warning about not doing that blindly, but the detail in the
> > > warning doesn't seem to apply to nfsd...
> > 
> > If you can do it from the context where you do the filp_close() right
> > now, then yeah there's no reason to over-complicate this at all... FWIW,
> 
> As long as nfsd doesn't care that it may get stuck on umount or
> ->release...

I think we do *care* about getting stuck.  But I don't think we would
*expect* to get stuck..

I had a look at varous ->release function.  Quite few do fsync or
similar which isn't a problem.  nfsd often waits for writes to complete.
Some lock the inode, which again is something that nfsd threads often
do.

Is there something special that ->release might do but that other
filesystem operation don't do?

I'd really like to understand why __fput is so special that we often
queue it to a separate thread.

> 
> > the reason task_work exists is just to ensure a clean context to perform
> > these operations from the task itself. The more I think about it, it
> > doesn't make a lot of sense to utilize it for this purpose, which is
> > where my alternate suggestion came from. But if you can just call it
> > directly, then that makes everything much easier.
> 
> And for better or worse we already expose __fput_sync(). We've recently
> switched close(2) over to it as well as it was needlessly punting to
> task work.
> 

exit_files() would be another good candidate for using __fput_sync().
Oleg Nesterov has reported problems when a process which a large number
of files exits - this currently puts lots of entries on the task_works
lists.  If task_work_cancel is then called before those are all dealt
with, it can have a long list to search while holding a hot lock.  (I
hope I got that description right).

Thanks,
NeilBrown
diff mbox series

Patch

diff --git a/fs/file_table.c b/fs/file_table.c
index ee21b3da9d08..d36cade6e366 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -435,7 +435,8 @@  void fput(struct file *file)
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		struct task_struct *task = current;
 
-		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+		if (likely(!in_interrupt() &&
+			   (task->flags & PF_RUNS_TASK_WORK))) {
 			init_task_work(&file->f_rcuhead, ____fput);
 			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
 				return;
diff --git a/fs/namespace.c b/fs/namespace.c
index e157efc54023..46d640b70ca9 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1328,7 +1328,7 @@  static void mntput_no_expire(struct mount *mnt)
 
 	if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) {
 		struct task_struct *task = current;
-		if (likely(!(task->flags & PF_KTHREAD))) {
+		if (likely((task->flags & PF_RUNS_TASK_WORK))) {
 			init_task_work(&mnt->mnt_rcu, __cleanup_mnt);
 			if (!task_work_add(task, &mnt->mnt_rcu, TWA_RESUME))
 				return;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7..e4eebac708e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1747,7 +1747,7 @@  extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
-#define PF__HOLE__00800000	0x00800000
+#define PF_RUNS_TASK_WORK	0x00800000	/* Will call task_work_run() periodically */
 #define PF__HOLE__01000000	0x01000000
 #define PF__HOLE__02000000	0x02000000
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..d612d8f14861 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2330,6 +2330,8 @@  __latent_entropy struct task_struct *copy_process(
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
 		p->flags |= PF_KTHREAD;
+	else
+		p->flags |= PF_RUNS_TASK_WORK;
 	if (args->user_worker) {
 		/*
 		 * Mark us a user worker, and block any signal that isn't
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 95a7e1b7f1da..aec19876e121 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -183,3 +183,4 @@  void task_work_run(void)
 		} while (work);
 	}
 }
+EXPORT_SYMBOL(task_work_run);