Message ID | 1442491104-30080-1-git-send-email-jeff.layton@primarydata.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Sep 17, 2015 at 07:58:24AM -0400, Jeff Layton wrote: > In order to allow the client to make a sane determination of what > happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we > must ensure that the seqids return accurately represent the order of > operations. The simplest way to do that is to ensure that operations on > a single stateid are serialized. > > This patch adds a mutex to the layout stateid, and locks it when > checking the layout stateid's seqid. The mutex is held over the entire > operation and released after the seqid is bumped. > > Note that in the case of CB_LAYOUTRECALL we must move the increment of > the seqid and setting into a new cb "prepare" operation. The lease > infrastructure will call the lm_break callback with a spinlock held, so > and we can't take the mutex in that codepath. I can't say I like the long running mutex all that much. What kinds of reproducers do you have where the current behavior causes problems? -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 11 Oct 2015 15:15:27 +0200 Christoph Hellwig <hch@lst.de> wrote: > On Thu, Sep 17, 2015 at 07:58:24AM -0400, Jeff Layton wrote: > > In order to allow the client to make a sane determination of what > > happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we > > must ensure that the seqids return accurately represent the order of > > operations. The simplest way to do that is to ensure that operations on > > a single stateid are serialized. > > > > This patch adds a mutex to the layout stateid, and locks it when > > checking the layout stateid's seqid. The mutex is held over the entire > > operation and released after the seqid is bumped. > > > > Note that in the case of CB_LAYOUTRECALL we must move the increment of > > the seqid and setting into a new cb "prepare" operation. The lease > > infrastructure will call the lm_break callback with a spinlock held, so > > and we can't take the mutex in that codepath. > > I can't say I like the long running mutex all that much. What kinds > of reproducers do you have where the current behavior causes problems? I'm not thrilled with it either, though it is per-stateid. It should only ever be contended when there are concurrent operations that specify the same stateid. We did have a report of this problem with open stateids that came about after the client started parallelizing opens more aggressively. It was pretty clear that you could hit similar races with lock stateids as well, given a client that didn't serialize things properly. I don't have any reports of this problem with layout stateids though. It may be that the client is good enough at serializing these operations (or slow enough) that it's never an issue. But, if that's the case then the mutex is harmless anyway since it'd never be contended. In hindsight I probably should have sent this as a RFC patch, I'm fine with dropping this for now if you think it's potentially harmful.
On Sun, Oct 11, 2015 at 04:51:21PM -0400, Jeff Layton wrote: > On Sun, 11 Oct 2015 15:15:27 +0200 > Christoph Hellwig <hch@lst.de> wrote: > > > On Thu, Sep 17, 2015 at 07:58:24AM -0400, Jeff Layton wrote: > > > In order to allow the client to make a sane determination of what > > > happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we > > > must ensure that the seqids return accurately represent the order of > > > operations. The simplest way to do that is to ensure that operations on > > > a single stateid are serialized. > > > > > > This patch adds a mutex to the layout stateid, and locks it when > > > checking the layout stateid's seqid. The mutex is held over the entire > > > operation and released after the seqid is bumped. > > > > > > Note that in the case of CB_LAYOUTRECALL we must move the increment of > > > the seqid and setting into a new cb "prepare" operation. The lease > > > infrastructure will call the lm_break callback with a spinlock held, so > > > and we can't take the mutex in that codepath. > > > > I can't say I like the long running mutex all that much. What kinds > > of reproducers do you have where the current behavior causes problems? > > I'm not thrilled with it either, though it is per-stateid. It should > only ever be contended when there are concurrent operations that specify > the same stateid. > > We did have a report of this problem with open stateids that came about > after the client started parallelizing opens more aggressively. It was > pretty clear that you could hit similar races with lock stateids as > well, given a client that didn't serialize things properly. > > I don't have any reports of this problem with layout stateids though. > It may be that the client is good enough at serializing these > operations (or slow enough) that it's never an issue. But, if that's > the case then the mutex is harmless anyway since it'd never be > contended. > > In hindsight I probably should have sent this as a RFC patch, I'm fine > with dropping this for now if you think it's potentially harmful. Still looks to me like the code's incorrect without your patch, so I intend to apply it. I'll fully admit that I haven't looked at how layout stateid's are supposed to work closely enough so it's possible Christoph can persaude me otherwise. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I meet two problems with this patch, 1. a dump_stack messages printed in process_one_work() at kernel/workqueue.c. BUG: workqueue leaked lock or atomic: kworker/u2:4/0x00000000/98#012 last function: nfsd4_run_cb_work [nfsd] 1 lock held by kworker/u2:4/98: #0: (&ls->ls_mutex){......}, at: [<ffffffffa0250d34>] nfsd4_cb_layout_prepare+0x24/0x40 [nfsd] CPU: 0 PID: 98 Comm: kworker/u2:4 Not tainted 4.4.0-rc2+ #333 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd] ffff8800362b9e40 000000007fe9394f ffff880036353d58 ffffffff8136dc64 ffff880036353dd8 ffffffff810a3f12 ffffffff810a3cbd 000000000000000a ffffffffa0261d78 ffffffff82902e20 0000000000000000 ffffffffa0259241 Call Trace: [<ffffffff8136dc64>] dump_stack+0x19/0x25 [<ffffffff810a3f12>] process_one_work+0x3c2/0x4c0 [<ffffffff810a3cbd>] ? process_one_work+0x16d/0x4c0 [<ffffffff810a405a>] worker_thread+0x4a/0x440 [<ffffffff810a4010>] ? process_one_work+0x4c0/0x4c0 [<ffffffff810a4010>] ? process_one_work+0x4c0/0x4c0 [<ffffffff810a91e5>] kthread+0xf5/0x110 [<ffffffff810a90f0>] ? kthread_create_on_node+0x240/0x240 [<ffffffff81738d0f>] ret_from_fork+0x3f/0x70 [<ffffffff810a90f0>] ? kthread_create_on_node+0x240/0x240 2. a mutex race between layoutrecall and layoutcommit, callback thread, nfsd4_cb_layout_prepare --->mutex_lock(&ls->ls_mutex); layoutcommit thread, nfsd4_layoutcommit ---> nfsd4_preprocess_layout_stateid --> mutex_lock(&ls->ls_mutex); <---------------- hang [ 600.645142] INFO: task nfsd:11623 blocked for more than 120 seconds. [ 600.646337] Not tainted 4.4.0-rc2+ #332 [ 600.647404] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 600.648546] nfsd D ffff880064277b80 0 11623 2 0x00000000 [ 600.649803] ffff880064277b80 ffff880064278000 00000000ffffffff ffff88005dd241a8 [ 600.651021] ffffffffa025e77c 0000000000000246 ffff880064277b98 ffffffff81733103 [ 600.652255] ffff880063d7e100 ffff880064277ba8 ffffffff8173330e ffff880064277c28 [ 600.653512] Call Trace: [ 600.654765] [<ffffffffa025e77c>] ? nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd] [ 600.656084] [<ffffffff81733103>] schedule+0x33/0x80 [ 600.657405] [<ffffffff8173330e>] schedule_preempt_disabled+0xe/0x10 [ 600.658741] [<ffffffff817357f5>] mutex_lock_nested+0x145/0x330 [ 600.660094] [<ffffffffa025e77c>] ? nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd] [ 600.661696] [<ffffffffa025e77c>] nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd] [ 600.663129] [<ffffffffa025e405>] ? nfsd4_preprocess_layout_stateid+0x5/0x400 [nfsd] [ 600.664558] [<ffffffff81173b8f>] ? printk+0x56/0x72 [ 600.665990] [<ffffffffa023e3ec>] nfsd4_layoutcommit+0x13c/0x200 [nfsd] [ 600.667365] [<ffffffffa023fb98>] nfsd4_proc_compound+0x388/0x660 [nfsd] [ 600.668835] [<ffffffffa022c148>] nfsd_dispatch+0xb8/0x200 [nfsd] [ 600.670323] [<ffffffffa0093d89>] svc_process_common+0x409/0x650 [sunrpc] [ 600.671836] [<ffffffffa0094e04>] svc_process+0xf4/0x190 [sunrpc] [ 600.673328] [<ffffffffa022bb05>] nfsd+0x135/0x1a0 [nfsd] [ 600.674825] [<ffffffffa022b9d5>] ? nfsd+0x5/0x1a0 [nfsd] [ 600.676388] [<ffffffffa022b9d0>] ? nfsd_destroy+0xb0/0xb0 [nfsd] [ 600.677884] [<ffffffff810a9175>] kthread+0xf5/0x110 [ 600.679373] [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240 [ 600.680874] [<ffffffff81738ccf>] ret_from_fork+0x3f/0x70 [ 600.682398] [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240 [ 600.683893] 1 lock held by nfsd/11623: [ 600.685449] #0: (&ls->ls_mutex){......}, at: [<ffffffffa025e77c>] nfsd4_preprocess_layout_stateid+0x37c/0x400 [nfsd] [ 600.688778] Sending NMI to all CPUs: [ 600.690854] NMI backtrace for cpu 0 [ 600.691909] CPU: 0 PID: 11 Comm: khungtaskd Not tainted 4.4.0-rc2+ #332 [ 600.692523] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 600.693821] task: ffff88007b900000 ti: ffff88007b8fc000 task.ti: ffff88007b8fc000 [ 600.694496] RIP: 0010:[<ffffffff81053aca>] [<ffffffff81053aca>] native_write_msr_safe+0xa/0x10 [ 600.695185] RSP: 0018:ffff88007b8ffd70 EFLAGS: 00000046 [ 600.695861] RAX: 0000000000000400 RBX: 0000000000000286 RCX: 0000000000000830 [ 600.696539] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000830 [ 600.697204] RBP: ffff88007b8ffd70 R08: 0000000000000400 R09: 0000000000000000 [ 600.697862] R10: 00000000000000e4 R11: 0000000000000001 R12: ffff880063d7e100 [ 600.698513] R13: 00000000003fff3c R14: ffff880063d7e308 R15: 0000000000000004 [ 600.699156] FS: 0000000000000000(0000) GS:ffffffff81c27000(0000) knlGS:0000000000000000 [ 600.699823] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 600.700459] CR2: 00007f366afd4000 CR3: 000000005dd56000 CR4: 00000000001406f0 [ 600.701106] Stack: [ 600.701745] ffff88007b8ffd88 ffffffff8104a860 ffffffff81047340 ffff88007b8ffd98 [ 600.702404] ffffffff8104a885 ffff88007b8ffda8 ffffffff8104735b ffff88007b8ffdd8 [ 600.703058] ffffffff813723ad 0000000000000078 ffff880063d7e100 00000000003fff3c [ 600.703712] Call Trace: [ 600.704355] [<ffffffff8104a860>] __x2apic_send_IPI_mask.isra.2+0x60/0x70 [ 600.705017] [<ffffffff81047340>] ? setup_vector_irq+0x130/0x130 [ 600.705676] [<ffffffff8104a885>] x2apic_send_IPI_mask+0x15/0x20 [ 600.706335] [<ffffffff8104735b>] nmi_raise_cpu_backtrace+0x1b/0x20 [ 600.706989] [<ffffffff813723ad>] nmi_trigger_all_cpu_backtrace+0x14d/0x1c0 [ 600.707693] [<ffffffff810473b9>] arch_trigger_all_cpu_backtrace+0x19/0x20 [ 600.708362] [<ffffffff8112c4cf>] watchdog+0x32f/0x370 [ 600.709031] [<ffffffff8112c221>] ? watchdog+0x81/0x370 [ 600.709725] [<ffffffff8112c1a0>] ? reset_hung_task_detector+0x20/0x20 [ 600.710398] [<ffffffff810a9175>] kthread+0xf5/0x110 [ 600.711067] [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240 [ 600.711739] [<ffffffff81738ccf>] ret_from_fork+0x3f/0x70 [ 600.712405] [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240 [ 600.713073] Code: 00 55 89 f9 48 89 e5 0f 32 45 31 c0 48 c1 e2 20 44 89 06 48 09 d0 5d c3 66 0f 1f 84 00 00 00 00 00 55 89 f0 89 f9 48 89 e5 0f 30 <31> c0 5d c3 66 90 55 89 f9 48 89 e5 0f 33 48 c1 e2 20 48 09 d0 [ 600.715196] Kernel panic - not syncing: hung_task: blocked tasks [ 600.715889] CPU: 0 PID: 11 Comm: khungtaskd Not tainted 4.4.0-rc2+ #332 [ 600.716540] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 600.717910] ffffffff81a4d19c 00000000480750be ffff88007b8ffd60 ffffffff8136dbf4 [ 600.718610] ffff88007b8ffde8 ffffffff81173559 ffff880000000008 ffff88007b8ffdf8 [ 600.719302] ffff88007b8ffd90 00000000480750be 0000000000000001 0000000000000001 [ 600.719984] Call Trace: [ 600.720646] [<ffffffff8136dbf4>] dump_stack+0x19/0x25 [ 600.721330] [<ffffffff81173559>] panic+0xd3/0x212 [ 600.722009] [<ffffffff8112c4db>] watchdog+0x33b/0x370 [ 600.722686] [<ffffffff8112c221>] ? watchdog+0x81/0x370 [ 600.723213] [<ffffffff8112c1a0>] ? reset_hung_task_detector+0x20/0x20 [ 600.723674] [<ffffffff810a9175>] kthread+0xf5/0x110 [ 600.724107] [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240 [ 600.724509] [<ffffffff81738ccf>] ret_from_fork+0x3f/0x70 [ 600.724903] [<ffffffff810a9080>] ? kthread_create_on_node+0x240/0x240 thanks, Kinglong Mee On 9/17/2015 19:58, Jeff Layton wrote: > In order to allow the client to make a sane determination of what > happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we > must ensure that the seqids return accurately represent the order of > operations. The simplest way to do that is to ensure that operations on > a single stateid are serialized. > > This patch adds a mutex to the layout stateid, and locks it when > checking the layout stateid's seqid. The mutex is held over the entire > operation and released after the seqid is bumped. > > Note that in the case of CB_LAYOUTRECALL we must move the increment of > the seqid and setting into a new cb "prepare" operation. The lease > infrastructure will call the lm_break callback with a spinlock held, so > and we can't take the mutex in that codepath. > > Cc: Christoph Hellwig <hch@lst.de> > Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> > --- > fs/nfsd/nfs4layouts.c | 25 +++++++++++++++++++++---- > fs/nfsd/nfs4proc.c | 4 ++++ > fs/nfsd/state.h | 1 + > 3 files changed, 26 insertions(+), 4 deletions(-) > > diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c > index ebf90e487c75..4a68ab901b4b 100644 > --- a/fs/nfsd/nfs4layouts.c > +++ b/fs/nfsd/nfs4layouts.c > @@ -201,6 +201,7 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate, > INIT_LIST_HEAD(&ls->ls_perfile); > spin_lock_init(&ls->ls_lock); > INIT_LIST_HEAD(&ls->ls_layouts); > + mutex_init(&ls->ls_mutex); > ls->ls_layout_type = layout_type; > nfsd4_init_cb(&ls->ls_recall, clp, &nfsd4_cb_layout_ops, > NFSPROC4_CLNT_CB_LAYOUT); > @@ -262,19 +263,23 @@ nfsd4_preprocess_layout_stateid(struct svc_rqst *rqstp, > status = nfserr_jukebox; > if (!ls) > goto out; > + mutex_lock(&ls->ls_mutex); > } else { > ls = container_of(stid, struct nfs4_layout_stateid, ls_stid); > > status = nfserr_bad_stateid; > + mutex_lock(&ls->ls_mutex); > if (stateid->si_generation > stid->sc_stateid.si_generation) > - goto out_put_stid; > + goto out_unlock_stid; > if (layout_type != ls->ls_layout_type) > - goto out_put_stid; > + goto out_unlock_stid; > } > > *lsp = ls; > return 0; > > +out_unlock_stid: > + mutex_unlock(&ls->ls_mutex); > out_put_stid: > nfs4_put_stid(stid); > out: > @@ -296,8 +301,6 @@ nfsd4_recall_file_layout(struct nfs4_layout_stateid *ls) > trace_layout_recall(&ls->ls_stid.sc_stateid); > > atomic_inc(&ls->ls_stid.sc_count); > - update_stateid(&ls->ls_stid.sc_stateid); > - memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t)); > nfsd4_run_cb(&ls->ls_recall); > > out_unlock: > @@ -494,6 +497,7 @@ nfsd4_return_file_layouts(struct svc_rqst *rqstp, > } > spin_unlock(&ls->ls_lock); > > + mutex_unlock(&ls->ls_mutex); > nfs4_put_stid(&ls->ls_stid); > nfsd4_free_layouts(&reaplist); > return nfs_ok; > @@ -608,6 +612,17 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > } > } > > +static void > +nfsd4_cb_layout_prepare(struct nfsd4_callback *cb) > +{ > + struct nfs4_layout_stateid *ls = > + container_of(cb, struct nfs4_layout_stateid, ls_recall); > + > + mutex_lock(&ls->ls_mutex); > + update_stateid(&ls->ls_stid.sc_stateid); > + memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t)); > +} > + > static int > nfsd4_cb_layout_done(struct nfsd4_callback *cb, struct rpc_task *task) > { > @@ -649,12 +664,14 @@ nfsd4_cb_layout_release(struct nfsd4_callback *cb) > > trace_layout_recall_release(&ls->ls_stid.sc_stateid); > > + mutex_unlock(&ls->ls_mutex); > nfsd4_return_all_layouts(ls, &reaplist); > nfsd4_free_layouts(&reaplist); > nfs4_put_stid(&ls->ls_stid); > } > > static struct nfsd4_callback_ops nfsd4_cb_layout_ops = { > + .prepare = nfsd4_cb_layout_prepare, > .done = nfsd4_cb_layout_done, > .release = nfsd4_cb_layout_release, > }; > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c > index 4ce6b97b31ad..a9f096c7e99f 100644 > --- a/fs/nfsd/nfs4proc.c > +++ b/fs/nfsd/nfs4proc.c > @@ -1309,6 +1309,7 @@ nfsd4_layoutget(struct svc_rqst *rqstp, > nfserr = nfsd4_insert_layout(lgp, ls); > > out_put_stid: > + mutex_unlock(&ls->ls_mutex); > nfs4_put_stid(&ls->ls_stid); > out: > return nfserr; > @@ -1362,6 +1363,9 @@ nfsd4_layoutcommit(struct svc_rqst *rqstp, > goto out; > } > > + /* LAYOUTCOMMIT does not require any serialization */ > + mutex_unlock(&ls->ls_mutex); > + > if (new_size > i_size_read(inode)) { > lcp->lc_size_chg = 1; > lcp->lc_newsize = new_size; > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > index 31bde12feefe..1fa0f3848d4e 100644 > --- a/fs/nfsd/state.h > +++ b/fs/nfsd/state.h > @@ -562,6 +562,7 @@ struct nfs4_layout_stateid { > struct nfsd4_callback ls_recall; > stateid_t ls_recall_sid; > bool ls_recalled; > + struct mutex ls_mutex; > }; > > static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s) > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c index ebf90e487c75..4a68ab901b4b 100644 --- a/fs/nfsd/nfs4layouts.c +++ b/fs/nfsd/nfs4layouts.c @@ -201,6 +201,7 @@ nfsd4_alloc_layout_stateid(struct nfsd4_compound_state *cstate, INIT_LIST_HEAD(&ls->ls_perfile); spin_lock_init(&ls->ls_lock); INIT_LIST_HEAD(&ls->ls_layouts); + mutex_init(&ls->ls_mutex); ls->ls_layout_type = layout_type; nfsd4_init_cb(&ls->ls_recall, clp, &nfsd4_cb_layout_ops, NFSPROC4_CLNT_CB_LAYOUT); @@ -262,19 +263,23 @@ nfsd4_preprocess_layout_stateid(struct svc_rqst *rqstp, status = nfserr_jukebox; if (!ls) goto out; + mutex_lock(&ls->ls_mutex); } else { ls = container_of(stid, struct nfs4_layout_stateid, ls_stid); status = nfserr_bad_stateid; + mutex_lock(&ls->ls_mutex); if (stateid->si_generation > stid->sc_stateid.si_generation) - goto out_put_stid; + goto out_unlock_stid; if (layout_type != ls->ls_layout_type) - goto out_put_stid; + goto out_unlock_stid; } *lsp = ls; return 0; +out_unlock_stid: + mutex_unlock(&ls->ls_mutex); out_put_stid: nfs4_put_stid(stid); out: @@ -296,8 +301,6 @@ nfsd4_recall_file_layout(struct nfs4_layout_stateid *ls) trace_layout_recall(&ls->ls_stid.sc_stateid); atomic_inc(&ls->ls_stid.sc_count); - update_stateid(&ls->ls_stid.sc_stateid); - memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t)); nfsd4_run_cb(&ls->ls_recall); out_unlock: @@ -494,6 +497,7 @@ nfsd4_return_file_layouts(struct svc_rqst *rqstp, } spin_unlock(&ls->ls_lock); + mutex_unlock(&ls->ls_mutex); nfs4_put_stid(&ls->ls_stid); nfsd4_free_layouts(&reaplist); return nfs_ok; @@ -608,6 +612,17 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) } } +static void +nfsd4_cb_layout_prepare(struct nfsd4_callback *cb) +{ + struct nfs4_layout_stateid *ls = + container_of(cb, struct nfs4_layout_stateid, ls_recall); + + mutex_lock(&ls->ls_mutex); + update_stateid(&ls->ls_stid.sc_stateid); + memcpy(&ls->ls_recall_sid, &ls->ls_stid.sc_stateid, sizeof(stateid_t)); +} + static int nfsd4_cb_layout_done(struct nfsd4_callback *cb, struct rpc_task *task) { @@ -649,12 +664,14 @@ nfsd4_cb_layout_release(struct nfsd4_callback *cb) trace_layout_recall_release(&ls->ls_stid.sc_stateid); + mutex_unlock(&ls->ls_mutex); nfsd4_return_all_layouts(ls, &reaplist); nfsd4_free_layouts(&reaplist); nfs4_put_stid(&ls->ls_stid); } static struct nfsd4_callback_ops nfsd4_cb_layout_ops = { + .prepare = nfsd4_cb_layout_prepare, .done = nfsd4_cb_layout_done, .release = nfsd4_cb_layout_release, }; diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index 4ce6b97b31ad..a9f096c7e99f 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -1309,6 +1309,7 @@ nfsd4_layoutget(struct svc_rqst *rqstp, nfserr = nfsd4_insert_layout(lgp, ls); out_put_stid: + mutex_unlock(&ls->ls_mutex); nfs4_put_stid(&ls->ls_stid); out: return nfserr; @@ -1362,6 +1363,9 @@ nfsd4_layoutcommit(struct svc_rqst *rqstp, goto out; } + /* LAYOUTCOMMIT does not require any serialization */ + mutex_unlock(&ls->ls_mutex); + if (new_size > i_size_read(inode)) { lcp->lc_size_chg = 1; lcp->lc_newsize = new_size; diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h index 31bde12feefe..1fa0f3848d4e 100644 --- a/fs/nfsd/state.h +++ b/fs/nfsd/state.h @@ -562,6 +562,7 @@ struct nfs4_layout_stateid { struct nfsd4_callback ls_recall; stateid_t ls_recall_sid; bool ls_recalled; + struct mutex ls_mutex; }; static inline struct nfs4_layout_stateid *layoutstateid(struct nfs4_stid *s)
In order to allow the client to make a sane determination of what happened with racing LAYOUTGET/LAYOUTRETURN/CB_LAYOUTRECALL calls, we must ensure that the seqids return accurately represent the order of operations. The simplest way to do that is to ensure that operations on a single stateid are serialized. This patch adds a mutex to the layout stateid, and locks it when checking the layout stateid's seqid. The mutex is held over the entire operation and released after the seqid is bumped. Note that in the case of CB_LAYOUTRECALL we must move the increment of the seqid and setting into a new cb "prepare" operation. The lease infrastructure will call the lm_break callback with a spinlock held, so and we can't take the mutex in that codepath. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> --- fs/nfsd/nfs4layouts.c | 25 +++++++++++++++++++++---- fs/nfsd/nfs4proc.c | 4 ++++ fs/nfsd/state.h | 1 + 3 files changed, 26 insertions(+), 4 deletions(-)