Message ID | 20250218135423.1487309-1-lilingfeng3@huawei.com (mailing list archive) |
---|---|
State | Under Review |
Delegated to: | Chuck Lever |
Headers | show |
Series | nfsd: decrease cl_cb_inflight if fail to queue cb_work | expand |
On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote: > In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue > cb_work to callback_wq. This count can be decreased in three situations: > 1) If queuing fails in nfsd4_run_cb, the count will be decremented > accordingly. > 2) After cb_work is running, the count is decreased in the exception > branch of nfsd4_run_cb_work via nfsd41_destroy_cb. > 3) The count is decreased in the release callback of rpc_task — either > directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or > calling nfsd41_destroy_cb in nfsd4_cb_release. > > However, in nfsd4_cb_release, if the current cb_work needs to restart, the > count will not be decreased, with the expectation that it will be reduced > once cb_work is running. > If queuing fails here, then the count will leak, ultimately causing the > nfsd service to be unable to exit as shown below: > [root@nfs_test2 ~]# cat /proc/2271/stack > [<0>] nfsd4_shutdown_callback+0x22b/0x290 > [<0>] __destroy_client+0x3cd/0x5c0 > [<0>] nfs4_state_destroy_net+0xd2/0x330 > [<0>] nfs4_state_shutdown_net+0x2ad/0x410 > [<0>] nfsd_shutdown_net+0xb7/0x250 > [<0>] nfsd_last_thread+0x15f/0x2a0 > [<0>] nfsd_svc+0x388/0x3f0 > [<0>] write_threads+0x17e/0x2b0 > [<0>] nfsctl_transaction_write+0x91/0xf0 > [<0>] vfs_write+0x1c4/0x750 > [<0>] ksys_write+0xcb/0x170 > [<0>] do_syscall_64+0x70/0x120 > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 > [root@nfs_test2 ~]# > > Fix this by decreasing cl_cb_inflight if the restart fails. > > Fixes: cba5f62b1830 ("nfsd: fix callback restarts") > Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> > --- > fs/nfsd/nfs4callback.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > index 484077200c5d..8a7d24efdd08 100644 > --- a/fs/nfsd/nfs4callback.c > +++ b/fs/nfsd/nfs4callback.c > @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata) > static void nfsd4_cb_release(void *calldata) > { > struct nfsd4_callback *cb = calldata; > + struct nfs4_client *clp = cb->cb_clp; > + int queued; > > trace_nfsd_cb_rpc_release(cb->cb_clp); > > - if (cb->cb_need_restart) > - nfsd4_queue_cb(cb); > - else > + if (cb->cb_need_restart) { > + queued = nfsd4_queue_cb(cb); > + if (!queued) > + nfsd41_cb_inflight_end(clp); > + } else > nfsd41_destroy_cb(cb); > > } Good catch! Reviewed-by: Jeff Layton <jlayton@kernel.org>
From: Chuck Lever <chuck.lever@oracle.com> On Tue, 18 Feb 2025 21:54:23 +0800, Li Lingfeng wrote: > In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue > cb_work to callback_wq. This count can be decreased in three situations: > 1) If queuing fails in nfsd4_run_cb, the count will be decremented > accordingly. > 2) After cb_work is running, the count is decreased in the exception > branch of nfsd4_run_cb_work via nfsd41_destroy_cb. > 3) The count is decreased in the release callback of rpc_task — either > directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or > calling nfsd41_destroy_cb in nfsd4_cb_release. > > [...] Applied to nfsd-testing, thanks! [1/1] nfsd: decrease cl_cb_inflight if fail to queue cb_work commit: bcf482e3b1f4e622c6a1bfa971cd8035227b8b56 -- Chuck Lever
On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote: > On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote: > > In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue > > cb_work to callback_wq. This count can be decreased in three situations: > > 1) If queuing fails in nfsd4_run_cb, the count will be decremented > > accordingly. > > 2) After cb_work is running, the count is decreased in the exception > > branch of nfsd4_run_cb_work via nfsd41_destroy_cb. > > 3) The count is decreased in the release callback of rpc_task — either > > directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or > > calling nfsd41_destroy_cb in . > > > > However, in nfsd4_cb_release, if the current cb_work needs to restart, the > > count will not be decreased, with the expectation that it will be reduced > > once cb_work is running. > > If queuing fails here, then the count will leak, ultimately causing the > > nfsd service to be unable to exit as shown below: > > [root@nfs_test2 ~]# cat /proc/2271/stack > > [<0>] nfsd4_shutdown_callback+0x22b/0x290 > > [<0>] __destroy_client+0x3cd/0x5c0 > > [<0>] nfs4_state_destroy_net+0xd2/0x330 > > [<0>] nfs4_state_shutdown_net+0x2ad/0x410 > > [<0>] nfsd_shutdown_net+0xb7/0x250 > > [<0>] nfsd_last_thread+0x15f/0x2a0 > > [<0>] nfsd_svc+0x388/0x3f0 > > [<0>] write_threads+0x17e/0x2b0 > > [<0>] nfsctl_transaction_write+0x91/0xf0 > > [<0>] vfs_write+0x1c4/0x750 > > [<0>] ksys_write+0xcb/0x170 > > [<0>] do_syscall_64+0x70/0x120 > > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 > > [root@nfs_test2 ~]# > > > > Fix this by decreasing cl_cb_inflight if the restart fails. > > > > Fixes: cba5f62b1830 ("nfsd: fix callback restarts") > > Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> > > --- > > fs/nfsd/nfs4callback.c | 10 +++++++--- > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > index 484077200c5d..8a7d24efdd08 100644 > > --- a/fs/nfsd/nfs4callback.c > > +++ b/fs/nfsd/nfs4callback.c > > @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata) > > static void nfsd4_cb_release(void *calldata) > > { > > struct nfsd4_callback *cb = calldata; > > + struct nfs4_client *clp = cb->cb_clp; > > + int queued; > > > > trace_nfsd_cb_rpc_release(cb->cb_clp); > > > > - if (cb->cb_need_restart) > > - nfsd4_queue_cb(cb); > > - else > > + if (cb->cb_need_restart) { > > + queued = nfsd4_queue_cb(cb); > > + if (!queued) > > + nfsd41_cb_inflight_end(clp); > > + } else > > nfsd41_destroy_cb(cb); > > > > } > > Good catch! > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > Actually, I think this is not quite right. It's a bit more subtle than it first appears. The problem of course is that the callback workqueue jobs run in a different task than the RPC workqueue jobs, so they can race. cl_cb_inflight gets bumped when the callback is first queued, and only gets released in nfsd41_destroy_cb(). If it fails to be queued, it's because something else has queued the workqueue job in the meantime. There are two places that can occur: nfsd4_cb_release() and nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only other option is that something raced in and queued it via nfsd4_run_cb(). That will have incremented cl_cb_inflight() an extra time and so your patch will make sense for that. Unfortunately, the slot may leak in that case if nothing released it earlier. I think this probably needs to call nfsd41_destroy_cb() if the job can't be queued. That might, however, race with the callback workqueue job running. I think we might need to consider adding some sort of "this callback is running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb() and only queue the workqueue job if that comes back false. Then, we can clear the bit in nfsd41_destroy_cb(). That should ensure that you never fail to queue the workqueue job from nfsd4_cb_release(). Thoughts?
On 2/18/25 9:29 AM, Jeff Layton wrote: > On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote: >> On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote: >>> In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue >>> cb_work to callback_wq. This count can be decreased in three situations: >>> 1) If queuing fails in nfsd4_run_cb, the count will be decremented >>> accordingly. >>> 2) After cb_work is running, the count is decreased in the exception >>> branch of nfsd4_run_cb_work via nfsd41_destroy_cb. >>> 3) The count is decreased in the release callback of rpc_task — either >>> directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or >>> calling nfsd41_destroy_cb in . >>> >>> However, in nfsd4_cb_release, if the current cb_work needs to restart, the >>> count will not be decreased, with the expectation that it will be reduced >>> once cb_work is running. >>> If queuing fails here, then the count will leak, ultimately causing the >>> nfsd service to be unable to exit as shown below: >>> [root@nfs_test2 ~]# cat /proc/2271/stack >>> [<0>] nfsd4_shutdown_callback+0x22b/0x290 >>> [<0>] __destroy_client+0x3cd/0x5c0 >>> [<0>] nfs4_state_destroy_net+0xd2/0x330 >>> [<0>] nfs4_state_shutdown_net+0x2ad/0x410 >>> [<0>] nfsd_shutdown_net+0xb7/0x250 >>> [<0>] nfsd_last_thread+0x15f/0x2a0 >>> [<0>] nfsd_svc+0x388/0x3f0 >>> [<0>] write_threads+0x17e/0x2b0 >>> [<0>] nfsctl_transaction_write+0x91/0xf0 >>> [<0>] vfs_write+0x1c4/0x750 >>> [<0>] ksys_write+0xcb/0x170 >>> [<0>] do_syscall_64+0x70/0x120 >>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 >>> [root@nfs_test2 ~]# >>> >>> Fix this by decreasing cl_cb_inflight if the restart fails. >>> >>> Fixes: cba5f62b1830 ("nfsd: fix callback restarts") >>> Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> >>> --- >>> fs/nfsd/nfs4callback.c | 10 +++++++--- >>> 1 file changed, 7 insertions(+), 3 deletions(-) >>> >>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>> index 484077200c5d..8a7d24efdd08 100644 >>> --- a/fs/nfsd/nfs4callback.c >>> +++ b/fs/nfsd/nfs4callback.c >>> @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata) >>> static void nfsd4_cb_release(void *calldata) >>> { >>> struct nfsd4_callback *cb = calldata; >>> + struct nfs4_client *clp = cb->cb_clp; >>> + int queued; >>> >>> trace_nfsd_cb_rpc_release(cb->cb_clp); >>> >>> - if (cb->cb_need_restart) >>> - nfsd4_queue_cb(cb); >>> - else >>> + if (cb->cb_need_restart) { >>> + queued = nfsd4_queue_cb(cb); >>> + if (!queued) >>> + nfsd41_cb_inflight_end(clp); >>> + } else >>> nfsd41_destroy_cb(cb); >>> >>> } >> >> Good catch! >> >> Reviewed-by: Jeff Layton <jlayton@kernel.org> >> > > Actually, I think this is not quite right. It's a bit more subtle than > it first appears. The problem of course is that the callback workqueue > jobs run in a different task than the RPC workqueue jobs, so they can > race. > > cl_cb_inflight gets bumped when the callback is first queued, and only > gets released in nfsd41_destroy_cb(). If it fails to be queued, it's > because something else has queued the workqueue job in the meantime. > > There are two places that can occur: nfsd4_cb_release() and > nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only > other option is that something raced in and queued it via > nfsd4_run_cb(). What would be the "something" that raced in? > That will have incremented cl_cb_inflight() an extra > time and so your patch will make sense for that. > > Unfortunately, the slot may leak in that case if nothing released it > earlier. I think this probably needs to call nfsd41_destroy_cb() if the > job can't be queued. That might, however, race with the callback > workqueue job running. > > I think we might need to consider adding some sort of "this callback is > running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb() > and only queue the workqueue job if that comes back false. Then, we can > clear the bit in nfsd41_destroy_cb(). > > That should ensure that you never fail to queue the workqueue job from > nfsd4_cb_release(). > > Thoughts?
On Tue, 2025-02-18 at 09:31 -0500, Chuck Lever wrote: > On 2/18/25 9:29 AM, Jeff Layton wrote: > > On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote: > > > On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote: > > > > In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue > > > > cb_work to callback_wq. This count can be decreased in three situations: > > > > 1) If queuing fails in nfsd4_run_cb, the count will be decremented > > > > accordingly. > > > > 2) After cb_work is running, the count is decreased in the exception > > > > branch of nfsd4_run_cb_work via nfsd41_destroy_cb. > > > > 3) The count is decreased in the release callback of rpc_task — either > > > > directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or > > > > calling nfsd41_destroy_cb in . > > > > > > > > However, in nfsd4_cb_release, if the current cb_work needs to restart, the > > > > count will not be decreased, with the expectation that it will be reduced > > > > once cb_work is running. > > > > If queuing fails here, then the count will leak, ultimately causing the > > > > nfsd service to be unable to exit as shown below: > > > > [root@nfs_test2 ~]# cat /proc/2271/stack > > > > [<0>] nfsd4_shutdown_callback+0x22b/0x290 > > > > [<0>] __destroy_client+0x3cd/0x5c0 > > > > [<0>] nfs4_state_destroy_net+0xd2/0x330 > > > > [<0>] nfs4_state_shutdown_net+0x2ad/0x410 > > > > [<0>] nfsd_shutdown_net+0xb7/0x250 > > > > [<0>] nfsd_last_thread+0x15f/0x2a0 > > > > [<0>] nfsd_svc+0x388/0x3f0 > > > > [<0>] write_threads+0x17e/0x2b0 > > > > [<0>] nfsctl_transaction_write+0x91/0xf0 > > > > [<0>] vfs_write+0x1c4/0x750 > > > > [<0>] ksys_write+0xcb/0x170 > > > > [<0>] do_syscall_64+0x70/0x120 > > > > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 > > > > [root@nfs_test2 ~]# > > > > > > > > Fix this by decreasing cl_cb_inflight if the restart fails. > > > > > > > > Fixes: cba5f62b1830 ("nfsd: fix callback restarts") > > > > Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> > > > > --- > > > > fs/nfsd/nfs4callback.c | 10 +++++++--- > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > index 484077200c5d..8a7d24efdd08 100644 > > > > --- a/fs/nfsd/nfs4callback.c > > > > +++ b/fs/nfsd/nfs4callback.c > > > > @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata) > > > > static void nfsd4_cb_release(void *calldata) > > > > { > > > > struct nfsd4_callback *cb = calldata; > > > > + struct nfs4_client *clp = cb->cb_clp; > > > > + int queued; > > > > > > > > trace_nfsd_cb_rpc_release(cb->cb_clp); > > > > > > > > - if (cb->cb_need_restart) > > > > - nfsd4_queue_cb(cb); > > > > - else > > > > + if (cb->cb_need_restart) { > > > > + queued = nfsd4_queue_cb(cb); > > > > + if (!queued) > > > > + nfsd41_cb_inflight_end(clp); > > > > + } else > > > > nfsd41_destroy_cb(cb); > > > > > > > > } > > > > > > Good catch! > > > > > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > > > > > > > Actually, I think this is not quite right. It's a bit more subtle than > > it first appears. The problem of course is that the callback workqueue > > jobs run in a different task than the RPC workqueue jobs, so they can > > race. > > > > cl_cb_inflight gets bumped when the callback is first queued, and only > > gets released in nfsd41_destroy_cb(). If it fails to be queued, it's > > because something else has queued the workqueue job in the meantime. > > > > There are two places that can occur: nfsd4_cb_release() and > > nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only > > other option is that something raced in and queued it via > > nfsd4_run_cb(). > > What would be the "something" that raced in? > I think we may be able to get there via multiple __break_lease() calls on the same layout or delegation. That could mean multiple calls to the ->lm_break operation on the same object. > > > That will have incremented cl_cb_inflight() an extra > > time and so your patch will make sense for that. > > > > Unfortunately, the slot may leak in that case if nothing released it > > earlier. I think this probably needs to call nfsd41_destroy_cb() if the > > job can't be queued. That might, however, race with the callback > > workqueue job running. > > > > I think we might need to consider adding some sort of "this callback is > > running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb() > > and only queue the workqueue job if that comes back false. Then, we can > > clear the bit in nfsd41_destroy_cb(). > > > > That should ensure that you never fail to queue the workqueue job from > > nfsd4_cb_release(). > > > > Thoughts? > >
On 2/18/25 9:40 AM, Jeff Layton wrote: > On Tue, 2025-02-18 at 09:31 -0500, Chuck Lever wrote: >> On 2/18/25 9:29 AM, Jeff Layton wrote: >>> On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote: >>>> On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote: >>>>> In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue >>>>> cb_work to callback_wq. This count can be decreased in three situations: >>>>> 1) If queuing fails in nfsd4_run_cb, the count will be decremented >>>>> accordingly. >>>>> 2) After cb_work is running, the count is decreased in the exception >>>>> branch of nfsd4_run_cb_work via nfsd41_destroy_cb. >>>>> 3) The count is decreased in the release callback of rpc_task — either >>>>> directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or >>>>> calling nfsd41_destroy_cb in . >>>>> >>>>> However, in nfsd4_cb_release, if the current cb_work needs to restart, the >>>>> count will not be decreased, with the expectation that it will be reduced >>>>> once cb_work is running. >>>>> If queuing fails here, then the count will leak, ultimately causing the >>>>> nfsd service to be unable to exit as shown below: >>>>> [root@nfs_test2 ~]# cat /proc/2271/stack >>>>> [<0>] nfsd4_shutdown_callback+0x22b/0x290 >>>>> [<0>] __destroy_client+0x3cd/0x5c0 >>>>> [<0>] nfs4_state_destroy_net+0xd2/0x330 >>>>> [<0>] nfs4_state_shutdown_net+0x2ad/0x410 >>>>> [<0>] nfsd_shutdown_net+0xb7/0x250 >>>>> [<0>] nfsd_last_thread+0x15f/0x2a0 >>>>> [<0>] nfsd_svc+0x388/0x3f0 >>>>> [<0>] write_threads+0x17e/0x2b0 >>>>> [<0>] nfsctl_transaction_write+0x91/0xf0 >>>>> [<0>] vfs_write+0x1c4/0x750 >>>>> [<0>] ksys_write+0xcb/0x170 >>>>> [<0>] do_syscall_64+0x70/0x120 >>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 >>>>> [root@nfs_test2 ~]# >>>>> >>>>> Fix this by decreasing cl_cb_inflight if the restart fails. >>>>> >>>>> Fixes: cba5f62b1830 ("nfsd: fix callback restarts") >>>>> Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> >>>>> --- >>>>> fs/nfsd/nfs4callback.c | 10 +++++++--- >>>>> 1 file changed, 7 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>>>> index 484077200c5d..8a7d24efdd08 100644 >>>>> --- a/fs/nfsd/nfs4callback.c >>>>> +++ b/fs/nfsd/nfs4callback.c >>>>> @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata) >>>>> static void nfsd4_cb_release(void *calldata) >>>>> { >>>>> struct nfsd4_callback *cb = calldata; >>>>> + struct nfs4_client *clp = cb->cb_clp; >>>>> + int queued; >>>>> >>>>> trace_nfsd_cb_rpc_release(cb->cb_clp); >>>>> >>>>> - if (cb->cb_need_restart) >>>>> - nfsd4_queue_cb(cb); >>>>> - else >>>>> + if (cb->cb_need_restart) { >>>>> + queued = nfsd4_queue_cb(cb); >>>>> + if (!queued) >>>>> + nfsd41_cb_inflight_end(clp); >>>>> + } else >>>>> nfsd41_destroy_cb(cb); >>>>> >>>>> } >>>> >>>> Good catch! >>>> >>>> Reviewed-by: Jeff Layton <jlayton@kernel.org> >>>> >>> >>> Actually, I think this is not quite right. It's a bit more subtle than >>> it first appears. The problem of course is that the callback workqueue >>> jobs run in a different task than the RPC workqueue jobs, so they can >>> race. >>> >>> cl_cb_inflight gets bumped when the callback is first queued, and only >>> gets released in nfsd41_destroy_cb(). If it fails to be queued, it's >>> because something else has queued the workqueue job in the meantime. >>> >>> There are two places that can occur: nfsd4_cb_release() and >>> nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only >>> other option is that something raced in and queued it via >>> nfsd4_run_cb(). >> >> What would be the "something" that raced in? >> > > I think we may be able to get there via multiple __break_lease() calls > on the same layout or delegation. That could mean multiple calls to the > ->lm_break operation on the same object. Makes sense. Out of curiosity, what would be the complexity/performance cost of serializing the lm_break calls, or having each lm_break call register an interest in the CB_RECALL callback? Maybe not worth it. >>> That will have incremented cl_cb_inflight() an extra >>> time and so your patch will make sense for that. >>> >>> Unfortunately, the slot may leak in that case if nothing released it >>> earlier. I think this probably needs to call nfsd41_destroy_cb() if the >>> job can't be queued. That might, however, race with the callback >>> workqueue job running. >>> >>> I think we might need to consider adding some sort of "this callback is >>> running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb() >>> and only queue the workqueue job if that comes back false. Then, we can >>> clear the bit in nfsd41_destroy_cb(). >>> >>> That should ensure that you never fail to queue the workqueue job from >>> nfsd4_cb_release(). >>> >>> Thoughts? >> >> >
On Tue, 2025-02-18 at 10:02 -0500, Chuck Lever wrote: > On 2/18/25 9:40 AM, Jeff Layton wrote: > > On Tue, 2025-02-18 at 09:31 -0500, Chuck Lever wrote: > > > On 2/18/25 9:29 AM, Jeff Layton wrote: > > > > On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote: > > > > > On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote: > > > > > > In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue > > > > > > cb_work to callback_wq. This count can be decreased in three situations: > > > > > > 1) If queuing fails in nfsd4_run_cb, the count will be decremented > > > > > > accordingly. > > > > > > 2) After cb_work is running, the count is decreased in the exception > > > > > > branch of nfsd4_run_cb_work via nfsd41_destroy_cb. > > > > > > 3) The count is decreased in the release callback of rpc_task — either > > > > > > directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or > > > > > > calling nfsd41_destroy_cb in . > > > > > > > > > > > > However, in nfsd4_cb_release, if the current cb_work needs to restart, the > > > > > > count will not be decreased, with the expectation that it will be reduced > > > > > > once cb_work is running. > > > > > > If queuing fails here, then the count will leak, ultimately causing the > > > > > > nfsd service to be unable to exit as shown below: > > > > > > [root@nfs_test2 ~]# cat /proc/2271/stack > > > > > > [<0>] nfsd4_shutdown_callback+0x22b/0x290 > > > > > > [<0>] __destroy_client+0x3cd/0x5c0 > > > > > > [<0>] nfs4_state_destroy_net+0xd2/0x330 > > > > > > [<0>] nfs4_state_shutdown_net+0x2ad/0x410 > > > > > > [<0>] nfsd_shutdown_net+0xb7/0x250 > > > > > > [<0>] nfsd_last_thread+0x15f/0x2a0 > > > > > > [<0>] nfsd_svc+0x388/0x3f0 > > > > > > [<0>] write_threads+0x17e/0x2b0 > > > > > > [<0>] nfsctl_transaction_write+0x91/0xf0 > > > > > > [<0>] vfs_write+0x1c4/0x750 > > > > > > [<0>] ksys_write+0xcb/0x170 > > > > > > [<0>] do_syscall_64+0x70/0x120 > > > > > > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 > > > > > > [root@nfs_test2 ~]# > > > > > > > > > > > > Fix this by decreasing cl_cb_inflight if the restart fails. > > > > > > > > > > > > Fixes: cba5f62b1830 ("nfsd: fix callback restarts") > > > > > > Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> > > > > > > --- > > > > > > fs/nfsd/nfs4callback.c | 10 +++++++--- > > > > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > > > > > > index 484077200c5d..8a7d24efdd08 100644 > > > > > > --- a/fs/nfsd/nfs4callback.c > > > > > > +++ b/fs/nfsd/nfs4callback.c > > > > > > @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata) > > > > > > static void nfsd4_cb_release(void *calldata) > > > > > > { > > > > > > struct nfsd4_callback *cb = calldata; > > > > > > + struct nfs4_client *clp = cb->cb_clp; > > > > > > + int queued; > > > > > > > > > > > > trace_nfsd_cb_rpc_release(cb->cb_clp); > > > > > > > > > > > > - if (cb->cb_need_restart) > > > > > > - nfsd4_queue_cb(cb); > > > > > > - else > > > > > > + if (cb->cb_need_restart) { > > > > > > + queued = nfsd4_queue_cb(cb); > > > > > > + if (!queued) > > > > > > + nfsd41_cb_inflight_end(clp); > > > > > > + } else > > > > > > nfsd41_destroy_cb(cb); > > > > > > > > > > > > } > > > > > > > > > > Good catch! > > > > > > > > > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > > > > > > > > > > > > > Actually, I think this is not quite right. It's a bit more subtle than > > > > it first appears. The problem of course is that the callback workqueue > > > > jobs run in a different task than the RPC workqueue jobs, so they can > > > > race. > > > > > > > > cl_cb_inflight gets bumped when the callback is first queued, and only > > > > gets released in nfsd41_destroy_cb(). If it fails to be queued, it's > > > > because something else has queued the workqueue job in the meantime. > > > > > > > > There are two places that can occur: nfsd4_cb_release() and > > > > nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only > > > > other option is that something raced in and queued it via > > > > nfsd4_run_cb(). > > > > > > What would be the "something" that raced in? > > > > > > > I think we may be able to get there via multiple __break_lease() calls > > on the same layout or delegation. That could mean multiple calls to the > > ->lm_break operation on the same object. > > Makes sense. > > Out of curiosity, what would be the complexity/performance cost of > serializing the lm_break calls, or having each lm_break call register > an interest in the CB_RECALL callback? Maybe not worth it. > I'd probably resist trying to put that logic in generic code. I think for now I'd rather just add this logic in the nfsd4_callback handling. Longer term, I'm wondering if we really need the callback workqueue at all. It basically exists to just queue async RPC jobs (which are themselves workqueue tasks). Maybe it'd be simpler to just queue the RPC tasks directly at the points where we call nfsd4_run_cb() today? That's a bit more of an overhaul though, particularly in the callback channel probing codepath. > > > > > That will have incremented cl_cb_inflight() an extra > > > > time and so your patch will make sense for that. > > > > > > > > Unfortunately, the slot may leak in that case if nothing released it > > > > earlier. I think this probably needs to call nfsd41_destroy_cb() if the > > > > job can't be queued. That might, however, race with the callback > > > > workqueue job running. > > > > > > > > I think we might need to consider adding some sort of "this callback is > > > > running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb() > > > > and only queue the workqueue job if that comes back false. Then, we can > > > > clear the bit in nfsd41_destroy_cb(). > > > > > > > > That should ensure that you never fail to queue the workqueue job from > > > > nfsd4_cb_release(). > > > > > > > > Thoughts? > > > > > > > > > >
On 2/18/25 10:51 AM, Jeff Layton wrote: > On Tue, 2025-02-18 at 10:02 -0500, Chuck Lever wrote: >> On 2/18/25 9:40 AM, Jeff Layton wrote: >>> On Tue, 2025-02-18 at 09:31 -0500, Chuck Lever wrote: >>>> On 2/18/25 9:29 AM, Jeff Layton wrote: >>>>> On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote: >>>>>> On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote: >>>>>>> In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue >>>>>>> cb_work to callback_wq. This count can be decreased in three situations: >>>>>>> 1) If queuing fails in nfsd4_run_cb, the count will be decremented >>>>>>> accordingly. >>>>>>> 2) After cb_work is running, the count is decreased in the exception >>>>>>> branch of nfsd4_run_cb_work via nfsd41_destroy_cb. >>>>>>> 3) The count is decreased in the release callback of rpc_task — either >>>>>>> directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or >>>>>>> calling nfsd41_destroy_cb in . >>>>>>> >>>>>>> However, in nfsd4_cb_release, if the current cb_work needs to restart, the >>>>>>> count will not be decreased, with the expectation that it will be reduced >>>>>>> once cb_work is running. >>>>>>> If queuing fails here, then the count will leak, ultimately causing the >>>>>>> nfsd service to be unable to exit as shown below: >>>>>>> [root@nfs_test2 ~]# cat /proc/2271/stack >>>>>>> [<0>] nfsd4_shutdown_callback+0x22b/0x290 >>>>>>> [<0>] __destroy_client+0x3cd/0x5c0 >>>>>>> [<0>] nfs4_state_destroy_net+0xd2/0x330 >>>>>>> [<0>] nfs4_state_shutdown_net+0x2ad/0x410 >>>>>>> [<0>] nfsd_shutdown_net+0xb7/0x250 >>>>>>> [<0>] nfsd_last_thread+0x15f/0x2a0 >>>>>>> [<0>] nfsd_svc+0x388/0x3f0 >>>>>>> [<0>] write_threads+0x17e/0x2b0 >>>>>>> [<0>] nfsctl_transaction_write+0x91/0xf0 >>>>>>> [<0>] vfs_write+0x1c4/0x750 >>>>>>> [<0>] ksys_write+0xcb/0x170 >>>>>>> [<0>] do_syscall_64+0x70/0x120 >>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 >>>>>>> [root@nfs_test2 ~]# >>>>>>> >>>>>>> Fix this by decreasing cl_cb_inflight if the restart fails. >>>>>>> >>>>>>> Fixes: cba5f62b1830 ("nfsd: fix callback restarts") >>>>>>> Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> >>>>>>> --- >>>>>>> fs/nfsd/nfs4callback.c | 10 +++++++--- >>>>>>> 1 file changed, 7 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c >>>>>>> index 484077200c5d..8a7d24efdd08 100644 >>>>>>> --- a/fs/nfsd/nfs4callback.c >>>>>>> +++ b/fs/nfsd/nfs4callback.c >>>>>>> @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata) >>>>>>> static void nfsd4_cb_release(void *calldata) >>>>>>> { >>>>>>> struct nfsd4_callback *cb = calldata; >>>>>>> + struct nfs4_client *clp = cb->cb_clp; >>>>>>> + int queued; >>>>>>> >>>>>>> trace_nfsd_cb_rpc_release(cb->cb_clp); >>>>>>> >>>>>>> - if (cb->cb_need_restart) >>>>>>> - nfsd4_queue_cb(cb); >>>>>>> - else >>>>>>> + if (cb->cb_need_restart) { >>>>>>> + queued = nfsd4_queue_cb(cb); >>>>>>> + if (!queued) >>>>>>> + nfsd41_cb_inflight_end(clp); >>>>>>> + } else >>>>>>> nfsd41_destroy_cb(cb); >>>>>>> >>>>>>> } >>>>>> >>>>>> Good catch! >>>>>> >>>>>> Reviewed-by: Jeff Layton <jlayton@kernel.org> >>>>>> >>>>> >>>>> Actually, I think this is not quite right. It's a bit more subtle than >>>>> it first appears. The problem of course is that the callback workqueue >>>>> jobs run in a different task than the RPC workqueue jobs, so they can >>>>> race. >>>>> >>>>> cl_cb_inflight gets bumped when the callback is first queued, and only >>>>> gets released in nfsd41_destroy_cb(). If it fails to be queued, it's >>>>> because something else has queued the workqueue job in the meantime. >>>>> >>>>> There are two places that can occur: nfsd4_cb_release() and >>>>> nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only >>>>> other option is that something raced in and queued it via >>>>> nfsd4_run_cb(). >>>> >>>> What would be the "something" that raced in? >>>> >>> >>> I think we may be able to get there via multiple __break_lease() calls >>> on the same layout or delegation. That could mean multiple calls to the >>> ->lm_break operation on the same object. >> >> Makes sense. >> >> Out of curiosity, what would be the complexity/performance cost of >> serializing the lm_break calls, or having each lm_break call register >> an interest in the CB_RECALL callback? Maybe not worth it. >> > > I'd probably resist trying to put that logic in generic code. Hrm, I consider the callback code itself as generic, since all callback operations use it. It makes sense for example to specifically constrain the number of CB_RECALLs on a single inode to one. > I think > for now I'd rather just add this logic in the nfsd4_callback handling. Fair enough. Let's see a specific change proposal before letting our imaginations run wild. ;-) > Longer term, I'm wondering if we really need the callback workqueue at > all. It basically exists to just queue async RPC jobs (which are > themselves workqueue tasks). Maybe it'd be simpler to just queue the > RPC tasks directly at the points where we call nfsd4_run_cb() today? The workqueue is single-threaded, so it serializes access to various data structures, and prevents multiple concurrent attempts to start or shutdown a backchannel. The WQ can be removed, but only after alternative serialization has been added. Not impossible, but not a dev priority either. > That's a bit more of an overhaul though, particularly in the callback > channel probing codepath. > >> >>>>> That will have incremented cl_cb_inflight() an extra >>>>> time and so your patch will make sense for that. >>>>> >>>>> Unfortunately, the slot may leak in that case if nothing released it >>>>> earlier. I think this probably needs to call nfsd41_destroy_cb() if the >>>>> job can't be queued. That might, however, race with the callback >>>>> workqueue job running. >>>>> >>>>> I think we might need to consider adding some sort of "this callback is >>>>> running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb() >>>>> and only queue the workqueue job if that comes back false. Then, we can >>>>> clear the bit in nfsd41_destroy_cb(). >>>>> >>>>> That should ensure that you never fail to queue the workqueue job from >>>>> nfsd4_cb_release(). >>>>> >>>>> Thoughts? >>>> >>>> >>> >> >> >
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c index 484077200c5d..8a7d24efdd08 100644 --- a/fs/nfsd/nfs4callback.c +++ b/fs/nfsd/nfs4callback.c @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata) static void nfsd4_cb_release(void *calldata) { struct nfsd4_callback *cb = calldata; + struct nfs4_client *clp = cb->cb_clp; + int queued; trace_nfsd_cb_rpc_release(cb->cb_clp); - if (cb->cb_need_restart) - nfsd4_queue_cb(cb); - else + if (cb->cb_need_restart) { + queued = nfsd4_queue_cb(cb); + if (!queued) + nfsd41_cb_inflight_end(clp); + } else nfsd41_destroy_cb(cb); }
In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue cb_work to callback_wq. This count can be decreased in three situations: 1) If queuing fails in nfsd4_run_cb, the count will be decremented accordingly. 2) After cb_work is running, the count is decreased in the exception branch of nfsd4_run_cb_work via nfsd41_destroy_cb. 3) The count is decreased in the release callback of rpc_task — either directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or calling nfsd41_destroy_cb in nfsd4_cb_release. However, in nfsd4_cb_release, if the current cb_work needs to restart, the count will not be decreased, with the expectation that it will be reduced once cb_work is running. If queuing fails here, then the count will leak, ultimately causing the nfsd service to be unable to exit as shown below: [root@nfs_test2 ~]# cat /proc/2271/stack [<0>] nfsd4_shutdown_callback+0x22b/0x290 [<0>] __destroy_client+0x3cd/0x5c0 [<0>] nfs4_state_destroy_net+0xd2/0x330 [<0>] nfs4_state_shutdown_net+0x2ad/0x410 [<0>] nfsd_shutdown_net+0xb7/0x250 [<0>] nfsd_last_thread+0x15f/0x2a0 [<0>] nfsd_svc+0x388/0x3f0 [<0>] write_threads+0x17e/0x2b0 [<0>] nfsctl_transaction_write+0x91/0xf0 [<0>] vfs_write+0x1c4/0x750 [<0>] ksys_write+0xcb/0x170 [<0>] do_syscall_64+0x70/0x120 [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 [root@nfs_test2 ~]# Fix this by decreasing cl_cb_inflight if the restart fails. Fixes: cba5f62b1830 ("nfsd: fix callback restarts") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> --- fs/nfsd/nfs4callback.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)