Message ID | 1351188163-10067-1-git-send-email-dros@netapp.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 2012-10-25 at 14:02 -0400, Weston Andros Adamson wrote: > Calling nfs_kill_super from an RPC task callback would result in a deadlock > where nfs_free_server (via rpc_shutdown_client) tries to kill all > RPC tasks associated with that connection - including itself! > > Instead of calling nfs_kill_super directly, queue a job on the nfsiod > workqueue. > > Signed-off-by: Weston Andros Adamson <dros@netapp.com> > --- > > This fixes the current incarnation of the lockup I've been tracking down for > some time now. I still have to go back and see why the reproducer changed > behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but > now (before this patch) are stuck in rpc_exit. > > The reproducer works against a server with write delegations: > > ./nfsometer.py -m v4 server:/path dd_100m_100k > > which is basically: > - mount > - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024 > - umount > - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this > case it NEVER goes away) > > There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the > done handler calls nfs_sb_deactivate... > > I've tested this approach with 10 runs X 3 nfs versions X 5 workloads > (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident > its correct. > > One question for the list: should nfs_free_server *always* be scheduled on > the nfsiod workqueue? It's called in error paths in several locations. > After looking at them, I don't think my approach would break anything, but > some might have style objections. > This doesn't add up. There should be nothing calling nfs_sb_deactive() from a rpc_call_done() callback. If so, then that would be the bug. All calls to things like rpc_put_task(), put_nfs_open_context(), dput(), or nfs_sb_deactive() should occur in the rpc_call_release() callback if they can't be done in a process context. In both those cases, the rpc_task will be invisible to rpc_killall_tasks and rpc_shutdown_client. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On Oct 25, 2012, at 2:17 PM, "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote: > On Thu, 2012-10-25 at 14:02 -0400, Weston Andros Adamson wrote: >> Calling nfs_kill_super from an RPC task callback would result in a deadlock >> where nfs_free_server (via rpc_shutdown_client) tries to kill all >> RPC tasks associated with that connection - including itself! >> >> Instead of calling nfs_kill_super directly, queue a job on the nfsiod >> workqueue. >> >> Signed-off-by: Weston Andros Adamson <dros@netapp.com> >> --- >> >> This fixes the current incarnation of the lockup I've been tracking down for >> some time now. I still have to go back and see why the reproducer changed >> behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but >> now (before this patch) are stuck in rpc_exit. >> >> The reproducer works against a server with write delegations: >> >> ./nfsometer.py -m v4 server:/path dd_100m_100k >> >> which is basically: >> - mount >> - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024 >> - umount >> - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this >> case it NEVER goes away) >> >> There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the >> done handler calls nfs_sb_deactivate... >> >> I've tested this approach with 10 runs X 3 nfs versions X 5 workloads >> (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident >> its correct. >> >> One question for the list: should nfs_free_server *always* be scheduled on >> the nfsiod workqueue? It's called in error paths in several locations. >> After looking at them, I don't think my approach would break anything, but >> some might have style objections. >> > > This doesn't add up. There should be nothing calling nfs_sb_deactive() > from a rpc_call_done() callback. If so, then that would be the bug. > > All calls to things like rpc_put_task(), put_nfs_open_context(), dput(), > or nfs_sb_deactive() should occur in the rpc_call_release() callback if > they can't be done in a process context. In both those cases, the > rpc_task will be invisible to rpc_killall_tasks and rpc_shutdown_client. Ah, I misunderstood what was going on here. nfs_kill_super *is* being called by rpc_release_calldata callback: The kworker stuck in rpc_killall_tasks forever: [ 34.552600] [<ffffffffa00868e6>] rpc_killall_tasks+0x2d/0xcd [sunrpc] [ 34.552608] [<ffffffffa00883e4>] rpc_shutdown_client+0x4a/0xec [sunrpc] [ 34.552615] [<ffffffffa002b973>] nfs_free_server+0xcf/0x133 [nfs] [ 34.552625] [<ffffffffa0033193>] nfs_kill_super+0x37/0x3c [nfs] [ 34.552629] [<ffffffff81136c68>] deactivate_locked_super+0x37/0x63 [ 34.552633] [<ffffffff8113785f>] deactivate_super+0x37/0x3b [ 34.552642] [<ffffffffa0034fc1>] nfs_sb_deactive+0x23/0x25 [nfs] [ 34.552649] [<ffffffffa00dbba2>] nfs4_free_closedata+0x53/0x63 [nfsv4] [ 34.552661] [<ffffffffa008f997>] rpc_release_calldata+0x17/0x19 [sunrpc] [ 34.552671] [<ffffffffa008f9f5>] rpc_free_task+0x5c/0x65 [sunrpc] [ 34.552680] [<ffffffffa008fe07>] rpc_async_release+0x15/0x17 [sunrpc] [ 34.552684] [<ffffffff810632b7>] process_one_work+0x192/0x2a0 [ 34.552693] [<ffffffffa008fdf2>] ? rpc_async_schedule+0x33/0x33 [sunrpc] [ 34.552697] [<ffffffff81064169>] worker_thread+0x140/0x1d7 [ 34.552700] [<ffffffff81064029>] ? manage_workers+0x23b/0x23b [ 34.552704] [<ffffffff81067d21>] kthread+0x8d/0x95 [ 34.552708] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43 [ 34.552713] [<ffffffff814ef1ac>] ret_from_fork+0x7c/0xb0 [ 34.552717] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43 And the client's task list: [ 174.574006] -pid- flgs status -client- --rqstp- -timeout ---ops-- [ 174.574019] 1664 0181 -5 ffff880226474600 (null) 0 ffffffffa00f7ce0 nfsv4 DELEGRETURN a:rpc_exit_task [sunrpc] q:none So it looks like a CLOSE's rpc_release_calldata is triggering the nfs_kill_super and this is stuck trying to kill the DELEGRETURN task - which never gets run. I've debugged this from the workqueue side and the DELEGRETURN work is scheduled, but ends up having insert_wq_barrier called on it. It seems to me that this means the work queue is enforcing queue ordering that the CLOSE work should complete before the DELEGRETURN work can proceed -- and *that* is the deadlock (CLOSE waiting until DELEGRETURN is dead, DELEGRETURN can't run until close is complete). This would also explain our (Trond and me) failed attempts at canceling / rescheduling jobs in killall_tasks -- insert_wq_barrier's comment states: * Currently, a queued barrier can't be canceled. This is because * try_to_grab_pending() can't determine whether the work to be * grabbed is at the head of the queue and thus can't clear LINKED * flag of the previous work while there must be a valid next work * after a work with LINKED flag set. Now that I have a better understanding of what's happening, I'll go back to the drawing board. Thanks! -dros-- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Oct 25, 2012, at 2:17 PM, "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote: > On Thu, 2012-10-25 at 14:02 -0400, Weston Andros Adamson wrote: >> Calling nfs_kill_super from an RPC task callback would result in a deadlock >> where nfs_free_server (via rpc_shutdown_client) tries to kill all >> RPC tasks associated with that connection - including itself! >> >> Instead of calling nfs_kill_super directly, queue a job on the nfsiod >> workqueue. >> >> Signed-off-by: Weston Andros Adamson <dros@netapp.com> >> --- >> >> This fixes the current incarnation of the lockup I've been tracking down for >> some time now. I still have to go back and see why the reproducer changed >> behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but >> now (before this patch) are stuck in rpc_exit. >> >> The reproducer works against a server with write delegations: >> >> ./nfsometer.py -m v4 server:/path dd_100m_100k >> >> which is basically: >> - mount >> - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024 >> - umount >> - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this >> case it NEVER goes away) >> >> There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the >> done handler calls nfs_sb_deactivate... >> >> I've tested this approach with 10 runs X 3 nfs versions X 5 workloads >> (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident >> its correct. >> >> One question for the list: should nfs_free_server *always* be scheduled on >> the nfsiod workqueue? It's called in error paths in several locations. >> After looking at them, I don't think my approach would break anything, but >> some might have style objections. >> > > This doesn't add up. There should be nothing calling nfs_sb_deactive() > from a rpc_call_done() callback. If so, then that would be the bug. > > All calls to things like rpc_put_task(), put_nfs_open_context(), dput(), > or nfs_sb_deactive() should occur in the rpc_call_release() callback if > they can't be done in a process context. In both those cases, the > rpc_task will be invisible to rpc_killall_tasks and rpc_shutdown_client. Ah, I misunderstood what was going on here. nfs_kill_super *is* being called by rpc_release_calldata callback: The kworker stuck in rpc_killall_tasks forever: [ 34.552600] [<ffffffffa00868e6>] rpc_killall_tasks+0x2d/0xcd [sunrpc] [ 34.552608] [<ffffffffa00883e4>] rpc_shutdown_client+0x4a/0xec [sunrpc] [ 34.552615] [<ffffffffa002b973>] nfs_free_server+0xcf/0x133 [nfs] [ 34.552625] [<ffffffffa0033193>] nfs_kill_super+0x37/0x3c [nfs] [ 34.552629] [<ffffffff81136c68>] deactivate_locked_super+0x37/0x63 [ 34.552633] [<ffffffff8113785f>] deactivate_super+0x37/0x3b [ 34.552642] [<ffffffffa0034fc1>] nfs_sb_deactive+0x23/0x25 [nfs] [ 34.552649] [<ffffffffa00dbba2>] nfs4_free_closedata+0x53/0x63 [nfsv4] [ 34.552661] [<ffffffffa008f997>] rpc_release_calldata+0x17/0x19 [sunrpc] [ 34.552671] [<ffffffffa008f9f5>] rpc_free_task+0x5c/0x65 [sunrpc] [ 34.552680] [<ffffffffa008fe07>] rpc_async_release+0x15/0x17 [sunrpc] [ 34.552684] [<ffffffff810632b7>] process_one_work+0x192/0x2a0 [ 34.552693] [<ffffffffa008fdf2>] ? rpc_async_schedule+0x33/0x33 [sunrpc] [ 34.552697] [<ffffffff81064169>] worker_thread+0x140/0x1d7 [ 34.552700] [<ffffffff81064029>] ? manage_workers+0x23b/0x23b [ 34.552704] [<ffffffff81067d21>] kthread+0x8d/0x95 [ 34.552708] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43 [ 34.552713] [<ffffffff814ef1ac>] ret_from_fork+0x7c/0xb0 [ 34.552717] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43 And the client's task list: [ 174.574006] -pid- flgs status -client- --rqstp- -timeout ---ops-- [ 174.574019] 1664 0181 -5 ffff880226474600 (null) 0 ffffffffa00f7ce0 nfsv4 DELEGRETURN a:rpc_exit_task [sunrpc] q:none So it looks like a CLOSE's rpc_release_calldata is triggering the nfs_kill_super and this is stuck trying to kill the DELEGRETURN task - which never gets run. I've debugged this from the workqueue side and the DELEGRETURN work is scheduled, but ends up having insert_wq_barrier called on it. It seems to me that this means the work queue is enforcing queue ordering that the CLOSE work should complete before the DELEGRETURN work can proceed -- and *that* is the deadlock (CLOSE waiting until DELEGRETURN is dead, DELEGRETURN can't run until close is complete). This would also explain our (Trond and me) failed attempts at canceling / rescheduling jobs in killall_tasks -- insert_wq_barrier's comment states: * Currently, a queued barrier can't be canceled. This is because * try_to_grab_pending() can't determine whether the work to be * grabbed is at the head of the queue and thus can't clear LINKED * flag of the previous work while there must be a valid next work * after a work with LINKED flag set. Now that I have a better understanding of what's happening, I'll go back to the drawing board. Thanks! -dros-- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/client.c b/fs/nfs/client.c index c285e0a..9186a96 100644 --- a/fs/nfs/client.c +++ b/fs/nfs/client.c @@ -1010,9 +1010,11 @@ EXPORT_SYMBOL_GPL(nfs_alloc_server); /* * Free up a server record */ -void nfs_free_server(struct nfs_server *server) +static void nfs_free_server_schedule_work(struct work_struct *work) { - dprintk("--> nfs_free_server()\n"); + struct nfs_server *server = container_of(work, struct nfs_server, work); + + dprintk("--> %s\n", __func__); nfs_server_remove_lists(server); @@ -1032,7 +1034,19 @@ void nfs_free_server(struct nfs_server *server) bdi_destroy(&server->backing_dev_info); kfree(server); nfs_release_automount_timer(); - dprintk("<-- nfs_free_server()\n"); + dprintk("<-- %s\n", __func__); +} + +/* + * Queue work on nfsiod workqueue to free up a server record. + * This avoids a deadlock when an RPC task scheduled from the rpciod + * workqueue tries to kill itself. + */ +void nfs_free_server(struct nfs_server *server) +{ + WARN_ON_ONCE(work_pending(&server->work)); + INIT_WORK(&server->work, nfs_free_server_schedule_work); + queue_work(nfsiod_workqueue, &server->work); } EXPORT_SYMBOL_GPL(nfs_free_server); diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h index a9e76ee..a607886 100644 --- a/include/linux/nfs_fs_sb.h +++ b/include/linux/nfs_fs_sb.h @@ -171,6 +171,7 @@ struct nfs_server { void (*destroy)(struct nfs_server *); atomic_t active; /* Keep trace of any activity to this server */ + struct work_struct work; /* used to schedule free */ /* mountd-related mount options */ struct sockaddr_storage mountd_address;
Calling nfs_kill_super from an RPC task callback would result in a deadlock where nfs_free_server (via rpc_shutdown_client) tries to kill all RPC tasks associated with that connection - including itself! Instead of calling nfs_kill_super directly, queue a job on the nfsiod workqueue. Signed-off-by: Weston Andros Adamson <dros@netapp.com> --- This fixes the current incarnation of the lockup I've been tracking down for some time now. I still have to go back and see why the reproducer changed behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but now (before this patch) are stuck in rpc_exit. The reproducer works against a server with write delegations: ./nfsometer.py -m v4 server:/path dd_100m_100k which is basically: - mount - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024 - umount - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this case it NEVER goes away) There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the done handler calls nfs_sb_deactivate... I've tested this approach with 10 runs X 3 nfs versions X 5 workloads (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident its correct. One question for the list: should nfs_free_server *always* be scheduled on the nfsiod workqueue? It's called in error paths in several locations. After looking at them, I don't think my approach would break anything, but some might have style objections. -dros fs/nfs/client.c | 20 +++++++++++++++++--- include/linux/nfs_fs_sb.h | 1 + 2 files changed, 18 insertions(+), 3 deletions(-)