nfsd: decrease cl_cb_inflight if fail to queue cb_work

Message ID	20250218135423.1487309-1-lilingfeng3@huawei.com (mailing list archive)
State	Under Review
Delegated to:	Chuck Lever
Headers	show Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 90B9A246326; Tue, 18 Feb 2025 13:37:08 +0000 (UTC) From: Li Lingfeng <lilingfeng3@huawei.com> To: <chuck.lever@oracle.com>, <jlayton@kernel.org>, <neilb@suse.de>, <okorniev@redhat.com>, <Dai.Ngo@oracle.com>, <tom@talpey.com>, <linux-nfs@vger.kernel.org>, <linux-kernel@vger.kernel.org> CC: <yukuai1@huaweicloud.com>, <houtao1@huawei.com>, <yi.zhang@huawei.com>, <yangerkun@huawei.com>, <lilingfeng@huaweicloud.com>, <lilingfeng3@huawei.com> Subject: [PATCH] nfsd: decrease cl_cb_inflight if fail to queue cb_work Date: Tue, 18 Feb 2025 21:54:23 +0800 Message-ID: <20250218135423.1487309-1-lilingfeng3@huawei.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit
Series	nfsd: decrease cl_cb_inflight if fail to queue cb_work \| expand nfsd: decrease cl_cb_inflight if fail to queue cb_work

Li Lingfeng Feb. 18, 2025, 1:54 p.m. UTC

In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
cb_work to callback_wq. This count can be decreased in three situations:
1) If queuing fails in nfsd4_run_cb, the count will be decremented
accordingly.
2) After cb_work is running, the count is decreased in the exception
branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
3) The count is decreased in the release callback of rpc_task — either
directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
calling nfsd41_destroy_cb in nfsd4_cb_release.

However, in nfsd4_cb_release, if the current cb_work needs to restart, the
count will not be decreased, with the expectation that it will be reduced
once cb_work is running.
If queuing fails here, then the count will leak, ultimately causing the
nfsd service to be unable to exit as shown below:
[root@nfs_test2 ~]# cat /proc/2271/stack
[<0>] nfsd4_shutdown_callback+0x22b/0x290
[<0>] __destroy_client+0x3cd/0x5c0
[<0>] nfs4_state_destroy_net+0xd2/0x330
[<0>] nfs4_state_shutdown_net+0x2ad/0x410
[<0>] nfsd_shutdown_net+0xb7/0x250
[<0>] nfsd_last_thread+0x15f/0x2a0
[<0>] nfsd_svc+0x388/0x3f0
[<0>] write_threads+0x17e/0x2b0
[<0>] nfsctl_transaction_write+0x91/0xf0
[<0>] vfs_write+0x1c4/0x750
[<0>] ksys_write+0xcb/0x170
[<0>] do_syscall_64+0x70/0x120
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
[root@nfs_test2 ~]#

Fix this by decreasing cl_cb_inflight if the restart fails.

Fixes: cba5f62b1830 ("nfsd: fix callback restarts")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
---
 fs/nfsd/nfs4callback.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Jeff Layton Feb. 18, 2025, 1:58 p.m. UTC | #1

On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote:
> In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
> cb_work to callback_wq. This count can be decreased in three situations:
> 1) If queuing fails in nfsd4_run_cb, the count will be decremented
> accordingly.
> 2) After cb_work is running, the count is decreased in the exception
> branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
> 3) The count is decreased in the release callback of rpc_task — either
> directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
> calling nfsd41_destroy_cb in nfsd4_cb_release.
> 
> However, in nfsd4_cb_release, if the current cb_work needs to restart, the
> count will not be decreased, with the expectation that it will be reduced
> once cb_work is running.
> If queuing fails here, then the count will leak, ultimately causing the
> nfsd service to be unable to exit as shown below:
> [root@nfs_test2 ~]# cat /proc/2271/stack
> [<0>] nfsd4_shutdown_callback+0x22b/0x290
> [<0>] __destroy_client+0x3cd/0x5c0
> [<0>] nfs4_state_destroy_net+0xd2/0x330
> [<0>] nfs4_state_shutdown_net+0x2ad/0x410
> [<0>] nfsd_shutdown_net+0xb7/0x250
> [<0>] nfsd_last_thread+0x15f/0x2a0
> [<0>] nfsd_svc+0x388/0x3f0
> [<0>] write_threads+0x17e/0x2b0
> [<0>] nfsctl_transaction_write+0x91/0xf0
> [<0>] vfs_write+0x1c4/0x750
> [<0>] ksys_write+0xcb/0x170
> [<0>] do_syscall_64+0x70/0x120
> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
> [root@nfs_test2 ~]#
> 
> Fix this by decreasing cl_cb_inflight if the restart fails.
> 
> Fixes: cba5f62b1830 ("nfsd: fix callback restarts")
> Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
> ---
>  fs/nfsd/nfs4callback.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> index 484077200c5d..8a7d24efdd08 100644
> --- a/fs/nfsd/nfs4callback.c
> +++ b/fs/nfsd/nfs4callback.c
> @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata)
>  static void nfsd4_cb_release(void *calldata)
>  {
>  	struct nfsd4_callback *cb = calldata;
> +	struct nfs4_client *clp = cb->cb_clp;
> +	int queued;
>  
>  	trace_nfsd_cb_rpc_release(cb->cb_clp);
>  
> -	if (cb->cb_need_restart)
> -		nfsd4_queue_cb(cb);
> -	else
> +	if (cb->cb_need_restart) {
> +		queued = nfsd4_queue_cb(cb);
> +		if (!queued)
> +			nfsd41_cb_inflight_end(clp);
> +	} else
>  		nfsd41_destroy_cb(cb);
>  
>  }

Good catch!

Reviewed-by: Jeff Layton <jlayton@kernel.org>

Chuck Lever Feb. 18, 2025, 2:24 p.m. UTC | #2

From: Chuck Lever <chuck.lever@oracle.com>

On Tue, 18 Feb 2025 21:54:23 +0800, Li Lingfeng wrote:
> In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
> cb_work to callback_wq. This count can be decreased in three situations:
> 1) If queuing fails in nfsd4_run_cb, the count will be decremented
> accordingly.
> 2) After cb_work is running, the count is decreased in the exception
> branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
> 3) The count is decreased in the release callback of rpc_task — either
> directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
> calling nfsd41_destroy_cb in nfsd4_cb_release.
> 
> [...]

Applied to nfsd-testing, thanks!

[1/1] nfsd: decrease cl_cb_inflight if fail to queue cb_work
      commit: bcf482e3b1f4e622c6a1bfa971cd8035227b8b56

--
Chuck Lever

Jeff Layton Feb. 18, 2025, 2:29 p.m. UTC | #3

On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote:
> On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote:
> > In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
> > cb_work to callback_wq. This count can be decreased in three situations:
> > 1) If queuing fails in nfsd4_run_cb, the count will be decremented
> > accordingly.
> > 2) After cb_work is running, the count is decreased in the exception
> > branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
> > 3) The count is decreased in the release callback of rpc_task — either
> > directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
> > calling nfsd41_destroy_cb in 	.
> > 
> > However, in nfsd4_cb_release, if the current cb_work needs to restart, the
> > count will not be decreased, with the expectation that it will be reduced
> > once cb_work is running.
> > If queuing fails here, then the count will leak, ultimately causing the
> > nfsd service to be unable to exit as shown below:
> > [root@nfs_test2 ~]# cat /proc/2271/stack
> > [<0>] nfsd4_shutdown_callback+0x22b/0x290
> > [<0>] __destroy_client+0x3cd/0x5c0
> > [<0>] nfs4_state_destroy_net+0xd2/0x330
> > [<0>] nfs4_state_shutdown_net+0x2ad/0x410
> > [<0>] nfsd_shutdown_net+0xb7/0x250
> > [<0>] nfsd_last_thread+0x15f/0x2a0
> > [<0>] nfsd_svc+0x388/0x3f0
> > [<0>] write_threads+0x17e/0x2b0
> > [<0>] nfsctl_transaction_write+0x91/0xf0
> > [<0>] vfs_write+0x1c4/0x750
> > [<0>] ksys_write+0xcb/0x170
> > [<0>] do_syscall_64+0x70/0x120
> > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > [root@nfs_test2 ~]#
> > 
> > Fix this by decreasing cl_cb_inflight if the restart fails.
> > 
> > Fixes: cba5f62b1830 ("nfsd: fix callback restarts")
> > Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
> > ---
> >  fs/nfsd/nfs4callback.c | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> > index 484077200c5d..8a7d24efdd08 100644
> > --- a/fs/nfsd/nfs4callback.c
> > +++ b/fs/nfsd/nfs4callback.c
> > @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata)
> >  static void nfsd4_cb_release(void *calldata)
> >  {
> >  	struct nfsd4_callback *cb = calldata;
> > +	struct nfs4_client *clp = cb->cb_clp;
> > +	int queued;
> >  
> >  	trace_nfsd_cb_rpc_release(cb->cb_clp);
> >  
> > -	if (cb->cb_need_restart)
> > -		nfsd4_queue_cb(cb);
> > -	else
> > +	if (cb->cb_need_restart) {
> > +		queued = nfsd4_queue_cb(cb);
> > +		if (!queued)
> > +			nfsd41_cb_inflight_end(clp);
> > +	} else
> >  		nfsd41_destroy_cb(cb);
> >  
> >  }
> 
> Good catch!
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> 

Actually, I think this is not quite right. It's a bit more subtle than
it first appears. The problem of course is that the callback workqueue
jobs run in a different task than the RPC workqueue jobs, so they can
race.

cl_cb_inflight gets bumped when the callback is first queued, and only
gets released in nfsd41_destroy_cb(). If it fails to be queued, it's
because something else has queued the workqueue job in the meantime.

There are two places that can occur: nfsd4_cb_release() and
nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only
other option is that something raced in and queued it via
nfsd4_run_cb(). That will have incremented cl_cb_inflight() an extra
time and so your patch will make sense for that.

Unfortunately, the slot may leak in that case if nothing released it
earlier. I think this probably needs to call nfsd41_destroy_cb() if the
job can't be queued. That might, however, race with the callback
workqueue job running.

I think we might need to consider adding some sort of "this callback is
running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb()
and only queue the workqueue job if that comes back false. Then, we can
clear the bit in nfsd41_destroy_cb().

That should ensure that you never fail to queue the workqueue job from
nfsd4_cb_release().

Thoughts?

Chuck Lever Feb. 18, 2025, 2:31 p.m. UTC | #4

On 2/18/25 9:29 AM, Jeff Layton wrote:
> On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote:
>> On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote:
>>> In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
>>> cb_work to callback_wq. This count can be decreased in three situations:
>>> 1) If queuing fails in nfsd4_run_cb, the count will be decremented
>>> accordingly.
>>> 2) After cb_work is running, the count is decreased in the exception
>>> branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
>>> 3) The count is decreased in the release callback of rpc_task — either
>>> directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
>>> calling nfsd41_destroy_cb in 	.
>>>
>>> However, in nfsd4_cb_release, if the current cb_work needs to restart, the
>>> count will not be decreased, with the expectation that it will be reduced
>>> once cb_work is running.
>>> If queuing fails here, then the count will leak, ultimately causing the
>>> nfsd service to be unable to exit as shown below:
>>> [root@nfs_test2 ~]# cat /proc/2271/stack
>>> [<0>] nfsd4_shutdown_callback+0x22b/0x290
>>> [<0>] __destroy_client+0x3cd/0x5c0
>>> [<0>] nfs4_state_destroy_net+0xd2/0x330
>>> [<0>] nfs4_state_shutdown_net+0x2ad/0x410
>>> [<0>] nfsd_shutdown_net+0xb7/0x250
>>> [<0>] nfsd_last_thread+0x15f/0x2a0
>>> [<0>] nfsd_svc+0x388/0x3f0
>>> [<0>] write_threads+0x17e/0x2b0
>>> [<0>] nfsctl_transaction_write+0x91/0xf0
>>> [<0>] vfs_write+0x1c4/0x750
>>> [<0>] ksys_write+0xcb/0x170
>>> [<0>] do_syscall_64+0x70/0x120
>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>> [root@nfs_test2 ~]#
>>>
>>> Fix this by decreasing cl_cb_inflight if the restart fails.
>>>
>>> Fixes: cba5f62b1830 ("nfsd: fix callback restarts")
>>> Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
>>> ---
>>>  fs/nfsd/nfs4callback.c | 10 +++++++---
>>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
>>> index 484077200c5d..8a7d24efdd08 100644
>>> --- a/fs/nfsd/nfs4callback.c
>>> +++ b/fs/nfsd/nfs4callback.c
>>> @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata)
>>>  static void nfsd4_cb_release(void *calldata)
>>>  {
>>>  	struct nfsd4_callback *cb = calldata;
>>> +	struct nfs4_client *clp = cb->cb_clp;
>>> +	int queued;
>>>  
>>>  	trace_nfsd_cb_rpc_release(cb->cb_clp);
>>>  
>>> -	if (cb->cb_need_restart)
>>> -		nfsd4_queue_cb(cb);
>>> -	else
>>> +	if (cb->cb_need_restart) {
>>> +		queued = nfsd4_queue_cb(cb);
>>> +		if (!queued)
>>> +			nfsd41_cb_inflight_end(clp);
>>> +	} else
>>>  		nfsd41_destroy_cb(cb);
>>>  
>>>  }
>>
>> Good catch!
>>
>> Reviewed-by: Jeff Layton <jlayton@kernel.org>
>>
> 
> Actually, I think this is not quite right. It's a bit more subtle than
> it first appears. The problem of course is that the callback workqueue
> jobs run in a different task than the RPC workqueue jobs, so they can
> race.
> 
> cl_cb_inflight gets bumped when the callback is first queued, and only
> gets released in nfsd41_destroy_cb(). If it fails to be queued, it's
> because something else has queued the workqueue job in the meantime.
> 
> There are two places that can occur: nfsd4_cb_release() and
> nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only
> other option is that something raced in and queued it via
> nfsd4_run_cb().

What would be the "something" that raced in?


> That will have incremented cl_cb_inflight() an extra
> time and so your patch will make sense for that.
> 
> Unfortunately, the slot may leak in that case if nothing released it
> earlier. I think this probably needs to call nfsd41_destroy_cb() if the
> job can't be queued. That might, however, race with the callback
> workqueue job running.
> 
> I think we might need to consider adding some sort of "this callback is
> running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb()
> and only queue the workqueue job if that comes back false. Then, we can
> clear the bit in nfsd41_destroy_cb().
> 
> That should ensure that you never fail to queue the workqueue job from
> nfsd4_cb_release().
> 
> Thoughts?

Jeff Layton Feb. 18, 2025, 2:40 p.m. UTC | #5

On Tue, 2025-02-18 at 09:31 -0500, Chuck Lever wrote:
> On 2/18/25 9:29 AM, Jeff Layton wrote:
> > On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote:
> > > On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote:
> > > > In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
> > > > cb_work to callback_wq. This count can be decreased in three situations:
> > > > 1) If queuing fails in nfsd4_run_cb, the count will be decremented
> > > > accordingly.
> > > > 2) After cb_work is running, the count is decreased in the exception
> > > > branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
> > > > 3) The count is decreased in the release callback of rpc_task — either
> > > > directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
> > > > calling nfsd41_destroy_cb in 	.
> > > > 
> > > > However, in nfsd4_cb_release, if the current cb_work needs to restart, the
> > > > count will not be decreased, with the expectation that it will be reduced
> > > > once cb_work is running.
> > > > If queuing fails here, then the count will leak, ultimately causing the
> > > > nfsd service to be unable to exit as shown below:
> > > > [root@nfs_test2 ~]# cat /proc/2271/stack
> > > > [<0>] nfsd4_shutdown_callback+0x22b/0x290
> > > > [<0>] __destroy_client+0x3cd/0x5c0
> > > > [<0>] nfs4_state_destroy_net+0xd2/0x330
> > > > [<0>] nfs4_state_shutdown_net+0x2ad/0x410
> > > > [<0>] nfsd_shutdown_net+0xb7/0x250
> > > > [<0>] nfsd_last_thread+0x15f/0x2a0
> > > > [<0>] nfsd_svc+0x388/0x3f0
> > > > [<0>] write_threads+0x17e/0x2b0
> > > > [<0>] nfsctl_transaction_write+0x91/0xf0
> > > > [<0>] vfs_write+0x1c4/0x750
> > > > [<0>] ksys_write+0xcb/0x170
> > > > [<0>] do_syscall_64+0x70/0x120
> > > > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > > > [root@nfs_test2 ~]#
> > > > 
> > > > Fix this by decreasing cl_cb_inflight if the restart fails.
> > > > 
> > > > Fixes: cba5f62b1830 ("nfsd: fix callback restarts")
> > > > Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
> > > > ---
> > > >  fs/nfsd/nfs4callback.c | 10 +++++++---
> > > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> > > > index 484077200c5d..8a7d24efdd08 100644
> > > > --- a/fs/nfsd/nfs4callback.c
> > > > +++ b/fs/nfsd/nfs4callback.c
> > > > @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata)
> > > >  static void nfsd4_cb_release(void *calldata)
> > > >  {
> > > >  	struct nfsd4_callback *cb = calldata;
> > > > +	struct nfs4_client *clp = cb->cb_clp;
> > > > +	int queued;
> > > >  
> > > >  	trace_nfsd_cb_rpc_release(cb->cb_clp);
> > > >  
> > > > -	if (cb->cb_need_restart)
> > > > -		nfsd4_queue_cb(cb);
> > > > -	else
> > > > +	if (cb->cb_need_restart) {
> > > > +		queued = nfsd4_queue_cb(cb);
> > > > +		if (!queued)
> > > > +			nfsd41_cb_inflight_end(clp);
> > > > +	} else
> > > >  		nfsd41_destroy_cb(cb);
> > > >  
> > > >  }
> > > 
> > > Good catch!
> > > 
> > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > 
> > 
> > Actually, I think this is not quite right. It's a bit more subtle than
> > it first appears. The problem of course is that the callback workqueue
> > jobs run in a different task than the RPC workqueue jobs, so they can
> > race.
> > 
> > cl_cb_inflight gets bumped when the callback is first queued, and only
> > gets released in nfsd41_destroy_cb(). If it fails to be queued, it's
> > because something else has queued the workqueue job in the meantime.
> > 
> > There are two places that can occur: nfsd4_cb_release() and
> > nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only
> > other option is that something raced in and queued it via
> > nfsd4_run_cb().
> 
> What would be the "something" that raced in?
> 

I think we may be able to get there via multiple __break_lease() calls
on the same layout or delegation. That could mean multiple calls to the
->lm_break operation on the same object.

> 
> > That will have incremented cl_cb_inflight() an extra
> > time and so your patch will make sense for that.
> > 
> > Unfortunately, the slot may leak in that case if nothing released it
> > earlier. I think this probably needs to call nfsd41_destroy_cb() if the
> > job can't be queued. That might, however, race with the callback
> > workqueue job running.
> > 
> > I think we might need to consider adding some sort of "this callback is
> > running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb()
> > and only queue the workqueue job if that comes back false. Then, we can
> > clear the bit in nfsd41_destroy_cb().
> > 
> > That should ensure that you never fail to queue the workqueue job from
> > nfsd4_cb_release().
> > 
> > Thoughts?
> 
>

Chuck Lever Feb. 18, 2025, 3:02 p.m. UTC | #6

On 2/18/25 9:40 AM, Jeff Layton wrote:
> On Tue, 2025-02-18 at 09:31 -0500, Chuck Lever wrote:
>> On 2/18/25 9:29 AM, Jeff Layton wrote:
>>> On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote:
>>>> On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote:
>>>>> In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
>>>>> cb_work to callback_wq. This count can be decreased in three situations:
>>>>> 1) If queuing fails in nfsd4_run_cb, the count will be decremented
>>>>> accordingly.
>>>>> 2) After cb_work is running, the count is decreased in the exception
>>>>> branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
>>>>> 3) The count is decreased in the release callback of rpc_task — either
>>>>> directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
>>>>> calling nfsd41_destroy_cb in 	.
>>>>>
>>>>> However, in nfsd4_cb_release, if the current cb_work needs to restart, the
>>>>> count will not be decreased, with the expectation that it will be reduced
>>>>> once cb_work is running.
>>>>> If queuing fails here, then the count will leak, ultimately causing the
>>>>> nfsd service to be unable to exit as shown below:
>>>>> [root@nfs_test2 ~]# cat /proc/2271/stack
>>>>> [<0>] nfsd4_shutdown_callback+0x22b/0x290
>>>>> [<0>] __destroy_client+0x3cd/0x5c0
>>>>> [<0>] nfs4_state_destroy_net+0xd2/0x330
>>>>> [<0>] nfs4_state_shutdown_net+0x2ad/0x410
>>>>> [<0>] nfsd_shutdown_net+0xb7/0x250
>>>>> [<0>] nfsd_last_thread+0x15f/0x2a0
>>>>> [<0>] nfsd_svc+0x388/0x3f0
>>>>> [<0>] write_threads+0x17e/0x2b0
>>>>> [<0>] nfsctl_transaction_write+0x91/0xf0
>>>>> [<0>] vfs_write+0x1c4/0x750
>>>>> [<0>] ksys_write+0xcb/0x170
>>>>> [<0>] do_syscall_64+0x70/0x120
>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>>>> [root@nfs_test2 ~]#
>>>>>
>>>>> Fix this by decreasing cl_cb_inflight if the restart fails.
>>>>>
>>>>> Fixes: cba5f62b1830 ("nfsd: fix callback restarts")
>>>>> Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
>>>>> ---
>>>>>  fs/nfsd/nfs4callback.c | 10 +++++++---
>>>>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
>>>>> index 484077200c5d..8a7d24efdd08 100644
>>>>> --- a/fs/nfsd/nfs4callback.c
>>>>> +++ b/fs/nfsd/nfs4callback.c
>>>>> @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata)
>>>>>  static void nfsd4_cb_release(void *calldata)
>>>>>  {
>>>>>  	struct nfsd4_callback *cb = calldata;
>>>>> +	struct nfs4_client *clp = cb->cb_clp;
>>>>> +	int queued;
>>>>>  
>>>>>  	trace_nfsd_cb_rpc_release(cb->cb_clp);
>>>>>  
>>>>> -	if (cb->cb_need_restart)
>>>>> -		nfsd4_queue_cb(cb);
>>>>> -	else
>>>>> +	if (cb->cb_need_restart) {
>>>>> +		queued = nfsd4_queue_cb(cb);
>>>>> +		if (!queued)
>>>>> +			nfsd41_cb_inflight_end(clp);
>>>>> +	} else
>>>>>  		nfsd41_destroy_cb(cb);
>>>>>  
>>>>>  }
>>>>
>>>> Good catch!
>>>>
>>>> Reviewed-by: Jeff Layton <jlayton@kernel.org>
>>>>
>>>
>>> Actually, I think this is not quite right. It's a bit more subtle than
>>> it first appears. The problem of course is that the callback workqueue
>>> jobs run in a different task than the RPC workqueue jobs, so they can
>>> race.
>>>
>>> cl_cb_inflight gets bumped when the callback is first queued, and only
>>> gets released in nfsd41_destroy_cb(). If it fails to be queued, it's
>>> because something else has queued the workqueue job in the meantime.
>>>
>>> There are two places that can occur: nfsd4_cb_release() and
>>> nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only
>>> other option is that something raced in and queued it via
>>> nfsd4_run_cb().
>>
>> What would be the "something" that raced in?
>>
> 
> I think we may be able to get there via multiple __break_lease() calls
> on the same layout or delegation. That could mean multiple calls to the
> ->lm_break operation on the same object.

Makes sense.

Out of curiosity, what would be the complexity/performance cost of
serializing the lm_break calls, or having each lm_break call register
an interest in the CB_RECALL callback? Maybe not worth it.


>>> That will have incremented cl_cb_inflight() an extra
>>> time and so your patch will make sense for that.
>>>
>>> Unfortunately, the slot may leak in that case if nothing released it
>>> earlier. I think this probably needs to call nfsd41_destroy_cb() if the
>>> job can't be queued. That might, however, race with the callback
>>> workqueue job running.
>>>
>>> I think we might need to consider adding some sort of "this callback is
>>> running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb()
>>> and only queue the workqueue job if that comes back false. Then, we can
>>> clear the bit in nfsd41_destroy_cb().
>>>
>>> That should ensure that you never fail to queue the workqueue job from
>>> nfsd4_cb_release().
>>>
>>> Thoughts?
>>
>>
>

Jeff Layton Feb. 18, 2025, 3:51 p.m. UTC | #7

On Tue, 2025-02-18 at 10:02 -0500, Chuck Lever wrote:
> On 2/18/25 9:40 AM, Jeff Layton wrote:
> > On Tue, 2025-02-18 at 09:31 -0500, Chuck Lever wrote:
> > > On 2/18/25 9:29 AM, Jeff Layton wrote:
> > > > On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote:
> > > > > On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote:
> > > > > > In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
> > > > > > cb_work to callback_wq. This count can be decreased in three situations:
> > > > > > 1) If queuing fails in nfsd4_run_cb, the count will be decremented
> > > > > > accordingly.
> > > > > > 2) After cb_work is running, the count is decreased in the exception
> > > > > > branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
> > > > > > 3) The count is decreased in the release callback of rpc_task — either
> > > > > > directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
> > > > > > calling nfsd41_destroy_cb in 	.
> > > > > > 
> > > > > > However, in nfsd4_cb_release, if the current cb_work needs to restart, the
> > > > > > count will not be decreased, with the expectation that it will be reduced
> > > > > > once cb_work is running.
> > > > > > If queuing fails here, then the count will leak, ultimately causing the
> > > > > > nfsd service to be unable to exit as shown below:
> > > > > > [root@nfs_test2 ~]# cat /proc/2271/stack
> > > > > > [<0>] nfsd4_shutdown_callback+0x22b/0x290
> > > > > > [<0>] __destroy_client+0x3cd/0x5c0
> > > > > > [<0>] nfs4_state_destroy_net+0xd2/0x330
> > > > > > [<0>] nfs4_state_shutdown_net+0x2ad/0x410
> > > > > > [<0>] nfsd_shutdown_net+0xb7/0x250
> > > > > > [<0>] nfsd_last_thread+0x15f/0x2a0
> > > > > > [<0>] nfsd_svc+0x388/0x3f0
> > > > > > [<0>] write_threads+0x17e/0x2b0
> > > > > > [<0>] nfsctl_transaction_write+0x91/0xf0
> > > > > > [<0>] vfs_write+0x1c4/0x750
> > > > > > [<0>] ksys_write+0xcb/0x170
> > > > > > [<0>] do_syscall_64+0x70/0x120
> > > > > > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > > > > > [root@nfs_test2 ~]#
> > > > > > 
> > > > > > Fix this by decreasing cl_cb_inflight if the restart fails.
> > > > > > 
> > > > > > Fixes: cba5f62b1830 ("nfsd: fix callback restarts")
> > > > > > Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
> > > > > > ---
> > > > > >  fs/nfsd/nfs4callback.c | 10 +++++++---
> > > > > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> > > > > > index 484077200c5d..8a7d24efdd08 100644
> > > > > > --- a/fs/nfsd/nfs4callback.c
> > > > > > +++ b/fs/nfsd/nfs4callback.c
> > > > > > @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata)
> > > > > >  static void nfsd4_cb_release(void *calldata)
> > > > > >  {
> > > > > >  	struct nfsd4_callback *cb = calldata;
> > > > > > +	struct nfs4_client *clp = cb->cb_clp;
> > > > > > +	int queued;
> > > > > >  
> > > > > >  	trace_nfsd_cb_rpc_release(cb->cb_clp);
> > > > > >  
> > > > > > -	if (cb->cb_need_restart)
> > > > > > -		nfsd4_queue_cb(cb);
> > > > > > -	else
> > > > > > +	if (cb->cb_need_restart) {
> > > > > > +		queued = nfsd4_queue_cb(cb);
> > > > > > +		if (!queued)
> > > > > > +			nfsd41_cb_inflight_end(clp);
> > > > > > +	} else
> > > > > >  		nfsd41_destroy_cb(cb);
> > > > > >  
> > > > > >  }
> > > > > 
> > > > > Good catch!
> > > > > 
> > > > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > > > 
> > > > 
> > > > Actually, I think this is not quite right. It's a bit more subtle than
> > > > it first appears. The problem of course is that the callback workqueue
> > > > jobs run in a different task than the RPC workqueue jobs, so they can
> > > > race.
> > > > 
> > > > cl_cb_inflight gets bumped when the callback is first queued, and only
> > > > gets released in nfsd41_destroy_cb(). If it fails to be queued, it's
> > > > because something else has queued the workqueue job in the meantime.
> > > > 
> > > > There are two places that can occur: nfsd4_cb_release() and
> > > > nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only
> > > > other option is that something raced in and queued it via
> > > > nfsd4_run_cb().
> > > 
> > > What would be the "something" that raced in?
> > > 
> > 
> > I think we may be able to get there via multiple __break_lease() calls
> > on the same layout or delegation. That could mean multiple calls to the
> > ->lm_break operation on the same object.
> 
> Makes sense.
> 
> Out of curiosity, what would be the complexity/performance cost of
> serializing the lm_break calls, or having each lm_break call register
> an interest in the CB_RECALL callback? Maybe not worth it.
> 

I'd probably resist trying to put that logic in generic code. I think
for now I'd rather just add this logic in the nfsd4_callback handling.

Longer term, I'm wondering if we really need the callback workqueue at
all. It basically exists to just queue async RPC jobs (which are
themselves workqueue tasks). Maybe it'd be simpler to just queue the
RPC tasks directly at the points where we call nfsd4_run_cb() today?

That's a bit more of an overhaul though, particularly in the callback
channel probing codepath.

> 
> > > > That will have incremented cl_cb_inflight() an extra
> > > > time and so your patch will make sense for that.
> > > > 
> > > > Unfortunately, the slot may leak in that case if nothing released it
> > > > earlier. I think this probably needs to call nfsd41_destroy_cb() if the
> > > > job can't be queued. That might, however, race with the callback
> > > > workqueue job running.
> > > > 
> > > > I think we might need to consider adding some sort of "this callback is
> > > > running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb()
> > > > and only queue the workqueue job if that comes back false. Then, we can
> > > > clear the bit in nfsd41_destroy_cb().
> > > > 
> > > > That should ensure that you never fail to queue the workqueue job from
> > > > nfsd4_cb_release().
> > > > 
> > > > Thoughts?
> > > 
> > > 
> > 
> 
>

Chuck Lever Feb. 18, 2025, 4:01 p.m. UTC | #8

On 2/18/25 10:51 AM, Jeff Layton wrote:
> On Tue, 2025-02-18 at 10:02 -0500, Chuck Lever wrote:
>> On 2/18/25 9:40 AM, Jeff Layton wrote:
>>> On Tue, 2025-02-18 at 09:31 -0500, Chuck Lever wrote:
>>>> On 2/18/25 9:29 AM, Jeff Layton wrote:
>>>>> On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote:
>>>>>> On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote:
>>>>>>> In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
>>>>>>> cb_work to callback_wq. This count can be decreased in three situations:
>>>>>>> 1) If queuing fails in nfsd4_run_cb, the count will be decremented
>>>>>>> accordingly.
>>>>>>> 2) After cb_work is running, the count is decreased in the exception
>>>>>>> branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
>>>>>>> 3) The count is decreased in the release callback of rpc_task — either
>>>>>>> directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
>>>>>>> calling nfsd41_destroy_cb in 	.
>>>>>>>
>>>>>>> However, in nfsd4_cb_release, if the current cb_work needs to restart, the
>>>>>>> count will not be decreased, with the expectation that it will be reduced
>>>>>>> once cb_work is running.
>>>>>>> If queuing fails here, then the count will leak, ultimately causing the
>>>>>>> nfsd service to be unable to exit as shown below:
>>>>>>> [root@nfs_test2 ~]# cat /proc/2271/stack
>>>>>>> [<0>] nfsd4_shutdown_callback+0x22b/0x290
>>>>>>> [<0>] __destroy_client+0x3cd/0x5c0
>>>>>>> [<0>] nfs4_state_destroy_net+0xd2/0x330
>>>>>>> [<0>] nfs4_state_shutdown_net+0x2ad/0x410
>>>>>>> [<0>] nfsd_shutdown_net+0xb7/0x250
>>>>>>> [<0>] nfsd_last_thread+0x15f/0x2a0
>>>>>>> [<0>] nfsd_svc+0x388/0x3f0
>>>>>>> [<0>] write_threads+0x17e/0x2b0
>>>>>>> [<0>] nfsctl_transaction_write+0x91/0xf0
>>>>>>> [<0>] vfs_write+0x1c4/0x750
>>>>>>> [<0>] ksys_write+0xcb/0x170
>>>>>>> [<0>] do_syscall_64+0x70/0x120
>>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>>>>>> [root@nfs_test2 ~]#
>>>>>>>
>>>>>>> Fix this by decreasing cl_cb_inflight if the restart fails.
>>>>>>>
>>>>>>> Fixes: cba5f62b1830 ("nfsd: fix callback restarts")
>>>>>>> Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
>>>>>>> ---
>>>>>>>  fs/nfsd/nfs4callback.c | 10 +++++++---
>>>>>>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
>>>>>>> index 484077200c5d..8a7d24efdd08 100644
>>>>>>> --- a/fs/nfsd/nfs4callback.c
>>>>>>> +++ b/fs/nfsd/nfs4callback.c
>>>>>>> @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata)
>>>>>>>  static void nfsd4_cb_release(void *calldata)
>>>>>>>  {
>>>>>>>  	struct nfsd4_callback *cb = calldata;
>>>>>>> +	struct nfs4_client *clp = cb->cb_clp;
>>>>>>> +	int queued;
>>>>>>>  
>>>>>>>  	trace_nfsd_cb_rpc_release(cb->cb_clp);
>>>>>>>  
>>>>>>> -	if (cb->cb_need_restart)
>>>>>>> -		nfsd4_queue_cb(cb);
>>>>>>> -	else
>>>>>>> +	if (cb->cb_need_restart) {
>>>>>>> +		queued = nfsd4_queue_cb(cb);
>>>>>>> +		if (!queued)
>>>>>>> +			nfsd41_cb_inflight_end(clp);
>>>>>>> +	} else
>>>>>>>  		nfsd41_destroy_cb(cb);
>>>>>>>  
>>>>>>>  }
>>>>>>
>>>>>> Good catch!
>>>>>>
>>>>>> Reviewed-by: Jeff Layton <jlayton@kernel.org>
>>>>>>
>>>>>
>>>>> Actually, I think this is not quite right. It's a bit more subtle than
>>>>> it first appears. The problem of course is that the callback workqueue
>>>>> jobs run in a different task than the RPC workqueue jobs, so they can
>>>>> race.
>>>>>
>>>>> cl_cb_inflight gets bumped when the callback is first queued, and only
>>>>> gets released in nfsd41_destroy_cb(). If it fails to be queued, it's
>>>>> because something else has queued the workqueue job in the meantime.
>>>>>
>>>>> There are two places that can occur: nfsd4_cb_release() and
>>>>> nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only
>>>>> other option is that something raced in and queued it via
>>>>> nfsd4_run_cb().
>>>>
>>>> What would be the "something" that raced in?
>>>>
>>>
>>> I think we may be able to get there via multiple __break_lease() calls
>>> on the same layout or delegation. That could mean multiple calls to the
>>> ->lm_break operation on the same object.
>>
>> Makes sense.
>>
>> Out of curiosity, what would be the complexity/performance cost of
>> serializing the lm_break calls, or having each lm_break call register
>> an interest in the CB_RECALL callback? Maybe not worth it.
>>
> 
> I'd probably resist trying to put that logic in generic code.

Hrm, I consider the callback code itself as generic, since all callback
operations use it. It makes sense for example to specifically constrain
the number of CB_RECALLs on a single inode to one.


> I think
> for now I'd rather just add this logic in the nfsd4_callback handling.

Fair enough. Let's see a specific change proposal before letting our
imaginations run wild. ;-)


> Longer term, I'm wondering if we really need the callback workqueue at
> all. It basically exists to just queue async RPC jobs (which are
> themselves workqueue tasks). Maybe it'd be simpler to just queue the
> RPC tasks directly at the points where we call nfsd4_run_cb() today?

The workqueue is single-threaded, so it serializes access to various
data structures, and prevents multiple concurrent attempts to start or
shutdown a backchannel. The WQ can be removed, but only after
alternative serialization has been added. Not impossible, but not a
dev priority either.


> That's a bit more of an overhaul though, particularly in the callback
> channel probing codepath.
> 
>>
>>>>> That will have incremented cl_cb_inflight() an extra
>>>>> time and so your patch will make sense for that.
>>>>>
>>>>> Unfortunately, the slot may leak in that case if nothing released it
>>>>> earlier. I think this probably needs to call nfsd41_destroy_cb() if the
>>>>> job can't be queued. That might, however, race with the callback
>>>>> workqueue job running.
>>>>>
>>>>> I think we might need to consider adding some sort of "this callback is
>>>>> running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb()
>>>>> and only queue the workqueue job if that comes back false. Then, we can
>>>>> clear the bit in nfsd41_destroy_cb().
>>>>>
>>>>> That should ensure that you never fail to queue the workqueue job from
>>>>> nfsd4_cb_release().
>>>>>
>>>>> Thoughts?
>>>>
>>>>
>>>
>>
>>
>

nfsd: decrease cl_cb_inflight if fail to queue cb_work

Commit Message

Comments

Patch