Message ID | 4DDE536E.8020805@panasas.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, 2011-05-26 at 16:19 +0300, Boaz Harrosh wrote: > Every thing was ready, in pnfs_roc(). The segments released > and the LO state blocked til after the close is done. All that > is needed is to send the actual layoutreturn synchronously. Why would we want to do this? Return-on-close was initially considered useful only for debugging. At the interim IETF meeting in Sunnyvale, we also discussed the case where the forgetful client has forgotten the layout: in this case the server may decide to forget the layout too. There is no controversy in doing this, since both the client and the server know that any outstanding layout is supposed to be returned (and if there is a problem, then the server always has the option of sending a CB_LAYOUTRECALL). Adding a synchronous call to close is in any case a bug since close can on occasion be sent in situations where we don't allow sleeping. Cheers Trond
On 05/26/2011 05:16 PM, Trond Myklebust wrote: > On Thu, 2011-05-26 at 16:19 +0300, Boaz Harrosh wrote: >> Every thing was ready, in pnfs_roc(). The segments released >> and the LO state blocked til after the close is done. All that >> is needed is to send the actual layoutreturn synchronously. > > Why would we want to do this? > > Return-on-close was initially considered useful only for debugging. > What ?? > At the interim IETF meeting in Sunnyvale, we also discussed the case > where the forgetful client has forgotten the layout: in this case the > server may decide to forget the layout too. There is no controversy in > doing this, since both the client and the server know that any > outstanding layout is supposed to be returned (and if there is a > problem, then the server always has the option of sending a > CB_LAYOUTRECALL). > OK I didn't know that. So what you are saying is that if the server see a final close he can go and provocative free all segments marked with ROC? If so then someone should fix the Linux server. Because currently it never frees them. On a modest machine like the UML I use. Few 10s of "git checkout linux" crash the machine with oom. Today they are only freed on client umount. > Adding a synchronous call to close is in any case a bug since close can > on occasion be sent in situations where we don't allow sleeping. > This is done only on the final close. Isn't the very final call sync? Ok re-inspecting the code I can see that nfs4_do_close also takes a wait flag. I thought that the last close should always be waiting for all operations to end before proceeding with the close. That's how it is at the VFS level but I guess life is hard. So the only possible solution is within the same compound as the close. (not that we need it as you say) > Cheers > Trond Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-05-26 at 17:45 +0300, Boaz Harrosh wrote: > On 05/26/2011 05:16 PM, Trond Myklebust wrote: > > On Thu, 2011-05-26 at 16:19 +0300, Boaz Harrosh wrote: > >> Every thing was ready, in pnfs_roc(). The segments released > >> and the LO state blocked til after the close is done. All that > >> is needed is to send the actual layoutreturn synchronously. > > > > Why would we want to do this? > > > > Return-on-close was initially considered useful only for debugging. > > > What ?? > > > At the interim IETF meeting in Sunnyvale, we also discussed the case > > where the forgetful client has forgotten the layout: in this case the > > server may decide to forget the layout too. There is no controversy in > > doing this, since both the client and the server know that any > > outstanding layout is supposed to be returned (and if there is a > > problem, then the server always has the option of sending a > > CB_LAYOUTRECALL). > > > > OK I didn't know that. So what you are saying is that if the server see > a final close he can go and provocative free all segments marked with ROC? > > If so then someone should fix the Linux server. Because currently it never > frees them. On a modest machine like the UML I use. Few 10s of "git checkout linux" > crash the machine with oom. Today they are only freed on client umount. That would be a bug. The server must either free or recall in this situation. > > Adding a synchronous call to close is in any case a bug since close can > > on occasion be sent in situations where we don't allow sleeping. > > > > This is done only on the final close. Isn't the very final call sync? > > Ok re-inspecting the code I can see that nfs4_do_close also takes a wait flag. > I thought that the last close should always be waiting for all operations > to end before proceeding with the close. That's how it is at the VFS level > but I guess life is hard. So the only possible solution is within the same > compound as the close. (not that we need it as you say) The problem is that LAYOUTRETURN may have to be sent with a _different_ credential than the CLOSE itself. See the description of the EXCHGID4_FLAG_BIND_PRINC_STATEID exchangeid flag. Although we don't set that flag in our exchangeid requests, the _server_ may still set it in its reply, in which case we are supposed to obey it. This is also a reason why sending OPEN and LAYOUTGET in the same compound can be problematic. Cheers Trond
On Thu, 2011-05-26 at 11:04 -0400, Trond Myklebust wrote: > See the description of the EXCHGID4_FLAG_BIND_PRINC_STATEID exchangeid > flag. Although we don't set that flag in our exchangeid requests, the > _server_ may still set it in its reply, in which case we are supposed to > obey it. > This is also a reason why sending OPEN and LAYOUTGET in the same > compound can be problematic. BTW: We have yet to implement support for BIND_PRINC_STATEID in the client. One of the more "interesting" issues, I think, is that we probably want to handle EKEYEXPIRED differently: instead of just retrying, we should really clear out the layouts (forgetfully - since we don't have a credential) and reget them. Note that SP4_MACH_CRED and SP4_SSV might help here, but we're not guaranteed that the server supports those even if it requires BIND_PRINC_STATEID.
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index 5734009..f32cc46 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -5745,6 +5745,7 @@ int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp) .rpc_message = &msg, .callback_ops = &nfs4_layoutreturn_call_ops, .callback_data = lrp, + .flags = RPC_TASK_ASYNC, }; int status; @@ -5753,8 +5754,15 @@ int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp) if (IS_ERR(task)) return PTR_ERR(task); status = task->tk_status; - dprintk("<-- %s status=%d\n", __func__, status); + + if (!status && lrp->args.sync) { + status = nfs4_wait_for_completion_rpc_task(task); + if (status == 0) + status = task->tk_status; + } + rpc_put_task(task); + dprintk("<-- %s status=%d\n", __func__, status); return status; } diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index 9b749f2..7dc15d8 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -620,6 +620,31 @@ out_err_free: return NULL; } +static int __send_layoutreturn(struct inode *ino, nfs4_stateid *stateid, + bool sync) +{ + struct nfs4_layoutreturn *lrp = NULL; + int status; + + /* lrp is freed in nfs4_layoutreturn_release */ + lrp = kzalloc(sizeof(*lrp), GFP_KERNEL); + if (unlikely(!lrp)) { + put_layout_hdr(NFS_I(ino)->layout); + status = -ENOMEM; + goto out; + } + lrp->args.sync = sync; + lrp->args.stateid = *stateid; + lrp->args.reclaim = 0; + lrp->args.layout_type = NFS_SERVER(ino)->pnfs_curr_ld->id; + lrp->args.inode = ino; + lrp->clp = NFS_SERVER(ino)->nfs_client; + + status = nfs4_proc_layoutreturn(lrp); +out: + dprintk("%s: ==> %d\n", __func__, status); + return status; +} /* Initiates a LAYOUTRETURN(FILE) */ int _pnfs_return_layout(struct inode *ino) @@ -627,7 +652,6 @@ _pnfs_return_layout(struct inode *ino) struct pnfs_layout_hdr *lo = NULL; struct nfs_inode *nfsi = NFS_I(ino); LIST_HEAD(tmp_list); - struct nfs4_layoutreturn *lrp = NULL; nfs4_stateid stateid; int status = 0; @@ -647,37 +671,23 @@ _pnfs_return_layout(struct inode *ino) pnfs_free_lseg_list(&tmp_list); WARN_ON(test_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)); - - /* lrp is freed in nfs4_layoutreturn_release */ - lrp = kzalloc(sizeof(*lrp), GFP_KERNEL); - if (unlikely(!lrp)) { - put_layout_hdr(NFS_I(ino)->layout); - status = -ENOMEM; - goto out; - } - lrp->args.stateid = stateid; - lrp->args.reclaim = 0; - lrp->args.layout_type = NFS_SERVER(ino)->pnfs_curr_ld->id; - lrp->args.inode = ino; - lrp->clp = NFS_SERVER(ino)->nfs_client; - - status = nfs4_proc_layoutreturn(lrp); + status = __send_layoutreturn(ino, &stateid, false); out: - if (unlikely(status)) - kfree(lrp); dprintk("<-- %s status: %d\n", __func__, status); return status; } bool pnfs_roc(struct inode *ino) { + struct nfs_inode *nfsi = NFS_I(ino); struct pnfs_layout_hdr *lo; struct pnfs_layout_segment *lseg, *tmp; LIST_HEAD(tmp_list); bool found = false; + nfs4_stateid stateid; spin_lock(&ino->i_lock); - lo = NFS_I(ino)->layout; + lo = nfsi->layout; if (!lo || !test_and_clear_bit(NFS_LAYOUT_ROC, &lo->plh_flags) || test_bit(NFS_LAYOUT_BULK_RECALL, &lo->plh_flags)) goto out_nolayout; @@ -689,9 +699,13 @@ bool pnfs_roc(struct inode *ino) if (!found) goto out_nolayout; lo->plh_block_lgets++; + get_layout_hdr(lo); /* matched in nfs4_layoutreturn_release */ get_layout_hdr(lo); /* matched in pnfs_roc_release */ + stateid = nfsi->layout->plh_stateid; spin_unlock(&ino->i_lock); pnfs_free_lseg_list(&tmp_list); + /* We ignore the return code from layoutreturn, layout is forgotten */ + __send_layoutreturn(ino, &stateid, true); return true; out_nolayout: diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h index df99102..00b391a 100644 --- a/include/linux/nfs_xdr.h +++ b/include/linux/nfs_xdr.h @@ -272,6 +272,7 @@ struct nfs4_layoutcommit_data { }; struct nfs4_layoutreturn_args { + bool sync; __u32 reclaim; __u32 layout_type; struct inode *inode;
Every thing was ready, in pnfs_roc(). The segments released and the LO state blocked til after the close is done. All that is needed is to send the actual layoutreturn synchronously. RFC1: It looks like the close that comes afterwords is also sync So I assumed it would be always safe to wait here. Is that true? RFC2: The ideal is if we could send the layoutreturn together with the close, in the same compound. The only little problem with that is that we will need, at error handling, to analyse if the error came from the layoutreturn part of the compound, re-encode the close with out the layoutreturn and resend. I mean, What stupid Server returns an error on layoutreturn?! really? why is it useful information to the client? what can he do differently. Any way ... So I'm more lazy to do the right thing, right now. I promise to put it on my TODO, and do it soon. TODO: mark_lseg_invalid() should receive and collect the union *range* of all the segments that where ROC. And only send a layoutreturn on that collected range. (If lseg_list is empty then all-file return) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> --- fs/nfs/nfs4proc.c | 10 ++++++++- fs/nfs/pnfs.c | 52 +++++++++++++++++++++++++++++----------------- include/linux/nfs_xdr.h | 1 + 3 files changed, 43 insertions(+), 20 deletions(-)