diff mbox

[2/3] NFS: Allow for asynchronous WRITE_PLUS calls

Message ID 1382972418-2249-3-git-send-email-bjschuma@netapp.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bryan Schumaker Oct. 28, 2013, 3 p.m. UTC
Clients are required to support CB_OFFLOAD for the NFS4_CONTENT_HOLE arm
of the WRITE_PLUS union.

Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
---
 fs/nfs/callback.h      | 13 ++++++++++++
 fs/nfs/callback_proc.c |  8 ++++++++
 fs/nfs/callback_xdr.c  | 54 +++++++++++++++++++++++++++++++++++++++++++++++---
 fs/nfs/nfs4_fs.h       |  3 +++
 fs/nfs/nfs4file.c      | 54 +++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/nfs/nfs4xdr.c       |  2 +-
 6 files changed, 129 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig Oct. 29, 2013, 7:39 a.m. UTC | #1
On Mon, Oct 28, 2013 at 11:00:17AM -0400, Anna Schumaker wrote:
> Clients are required to support CB_OFFLOAD for the NFS4_CONTENT_HOLE arm
> of the WRITE_PLUS union.

This sounds pretty stupid.  Just curious, who got this braindamage into
the standard?

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Oct. 29, 2013, 12:44 p.m. UTC | #2
On Oct 29, 2013, at 3:39 AM, Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Oct 28, 2013 at 11:00:17AM -0400, Anna Schumaker wrote:
>> Clients are required to support CB_OFFLOAD for the NFS4_CONTENT_HOLE arm
>> of the WRITE_PLUS union.
> 
> This sounds pretty stupid.  Just curious, who got this braindamage into
> the standard?
> 

It exists because the server vendors were worried that operations such as preallocation and/or hole punching can take a more or less unbounded amount of time due to the 64-bit size. By using an (optional) callback method, the server can free up the RPC slot that would otherwise be kept waiting in the synchronous case.

Cheers
  Trond--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 29, 2013, 12:48 p.m. UTC | #3
On Tue, Oct 29, 2013 at 12:44:50PM +0000, Myklebust, Trond wrote:
> It exists because the server vendors were worried that operations such as preallocation and/or hole punching can take a more or less unbounded amount of time due to the 64-bit size. By using an (optional) callback method, the server can free up the RPC slot that would otherwise be kept waiting in the synchronous case.

Must be some horried filesystems on those vendors servers.  Both
operations should be O(nr_extents), where nr_extents should be extremely
low for the preallocation and still reasonably low for hole punches,
although users control the allocastion pattern.

There's a reason why these aren't aio operations locally either.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Oct. 29, 2013, 1:05 p.m. UTC | #4
On Oct 29, 2013, at 8:48 AM, Hellwig Christoph <hch@infradead.org> wrote:

> On Tue, Oct 29, 2013 at 12:44:50PM +0000, Myklebust, Trond wrote:
>> It exists because the server vendors were worried that operations such as preallocation and/or hole punching can take a more or less unbounded amount of time due to the 64-bit size. By using an (optional) callback method, the server can free up the RPC slot that would otherwise be kept waiting in the synchronous case.
> 
> Must be some horried filesystems on those vendors servers.  Both
> operations should be O(nr_extents), where nr_extents should be extremely
> low for the preallocation and still reasonably low for hole punches,
> although users control the allocastion pattern.
> 
> There's a reason why these aren't aio operations locally either.
> 

Imagine someone wanting to punch a 50TB hole in NTFS, for instance. It doesn't have real holes, so you'd end up needing to zero out the existing extents in that region of the file. That will take time, even with an O(nr_extents) algorithm.


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 29, 2013, 1:08 p.m. UTC | #5
On Tue, Oct 29, 2013 at 01:05:44PM +0000, Myklebust, Trond wrote:
> Imagine someone wanting to punch a 50TB hole in NTFS, for instance. It doesn't have real holes, so you'd end up needing to zero out the existing extents in that region of the file. That will take time, even with an O(nr_extents) algorithm.

That behaviour is not a hole punch, and should not be multiplexed onto
a whole punch on the wire command!  If such a use case is important
enough there should be an equivanet of the SCSI WRITE SAME command which
might make some sense to be implemented async.

We'd surely not support it in the Linux nfsd, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Oct. 29, 2013, 1:21 p.m. UTC | #6
On Oct 29, 2013, at 9:08 AM, Hellwig Christoph <hch@infradead.org> wrote:

> On Tue, Oct 29, 2013 at 01:05:44PM +0000, Myklebust, Trond wrote:
>> Imagine someone wanting to punch a 50TB hole in NTFS, for instance. It doesn't have real holes, so you'd end up needing to zero out the existing extents in that region of the file. That will take time, even with an O(nr_extents) algorithm.
> 
> That behaviour is not a hole punch, and should not be multiplexed onto
> a whole punch on the wire command!  If such a use case is important
> enough there should be an equivanet of the SCSI WRITE SAME command which
> might make some sense to be implemented async.
> 
> We'd surely not support it in the Linux nfsd, though.

The current draft spec even allows the client to specify a "pattern" to be written to the "hole". It is mainly there for applications like Oracle, that initialise empty blocks with a non-trivial pattern so that they can do the on-disk equivalent of memory poisoning. By doing it in this way, it allows the more powerful storage arrays to just implement that patterned region as a set of deduplicated blocks instead of actually allocating all the blocks.

You can indeed argue that filesystems that don't have holes and/or deduplication should not implement this operation at all, however the functionality is there in case they do...

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 29, 2013, 1:24 p.m. UTC | #7
On Tue, Oct 29, 2013 at 01:21:26PM +0000, Myklebust, Trond wrote:
> The current draft spec even allows the client to specify a "pattern" to be written to the "hole".

That's indeed the WRITE SAME lookalike then.  At least the SCSI people
were smart enough to define an unmap bit for "hole punching" and allow
the target to reject all other versions if they don't want to support
it.

Given that NFS v4.2 isn't finalized I'd very much recommend to

 a) properly separate those case s
 b) do not make the async version mandatory
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Oct. 29, 2013, 1:41 p.m. UTC | #8
On Oct 29, 2013, at 9:24 AM, Hellwig Christoph <hch@infradead.org> wrote:

> On Tue, Oct 29, 2013 at 01:21:26PM +0000, Myklebust, Trond wrote:
>> The current draft spec even allows the client to specify a "pattern" to be written to the "hole".
> 
> That's indeed the WRITE SAME lookalike then.  At least the SCSI people
> were smart enough to define an unmap bit for "hole punching" and allow
> the target to reject all other versions if they don't want to support
> it.
> 
> Given that NFS v4.2 isn't finalized I'd very much recommend to
> 
> a) properly separate those case s
> b) do not make the async version mandatory

   union write_plus_arg4 switch (data_content4 wpa_content) {
   case NFS4_CONTENT_DATA:
           data4           wpa_data;
   case NFS4_CONTENT_APP_DATA_HOLE:
           app_data_hole4  wpa_adh;
   case NFS4_CONTENT_HOLE:
           data_info4      wpa_hole;
   default:
           void;
   };

I suppose we could add cases: NFS4_CONTENT_APP_DATA_HOLE_SYNC, NFS4_CONTENT_HOLE_SYNC to allow the client to specify that it doesn't want asynchronous behaviour (and add error cases to allow the server to specify that it cannot safely do so).

There is already a 'di_allocated' boolean for the NFS4_CONTENT_HOLE case (which is used to distinguish between holes and zeroed extents). The only thing missing there is an error case, AFAICS, to allow the server to return that it doesn't support holes.

Tom, any thoughts on whether or not this kind of change to the spec is doable at this time?

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Oct. 29, 2013, 1:44 p.m. UTC | #9
On Tue, Oct 29, 2013 at 06:24:21AM -0700, Hellwig Christoph wrote:
> On Tue, Oct 29, 2013 at 01:21:26PM +0000, Myklebust, Trond wrote:
> > The current draft spec even allows the client to specify a "pattern" to be written to the "hole".
> 
> That's indeed the WRITE SAME lookalike then.  At least the SCSI people
> were smart enough to define an unmap bit for "hole punching" and allow
> the target to reject all other versions if they don't want to support
> it.
> 
> Given that NFS v4.2 isn't finalized I'd very much recommend to
> 
>  a) properly separate those case s
>  b) do not make the async version mandatory

That sounds reasonable to me.

By the way, source is available from https://github.com/loghyr/NFSv4.2,
and Tom takes patches.....

(Or maybe Anna can be talked into that.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
index 84326e9..1653958 100644
--- a/fs/nfs/callback.h
+++ b/fs/nfs/callback.h
@@ -187,6 +187,19 @@  extern __be32 nfs4_callback_devicenotify(
 	void *dummy, struct cb_process_state *cps);
 
 #endif /* CONFIG_NFS_V4_1 */
+
+#ifdef CONFIG_NFS_V4_2
+struct cb_offloadargs {
+	struct nfs_fh			dst_fh;
+	nfs4_stateid			stateid;
+	struct nfs42_write_res		write_res;
+};
+
+extern __be32 nfs4_callback_offload(struct cb_offloadargs *, void *,
+				    struct cb_process_state *);
+int nfs_wake_offload(struct cb_offloadargs *);
+#endif /* CONFIG_NFS_V4_2 */
+
 extern int check_gss_callback_principal(struct nfs_client *, struct svc_rqst *);
 extern __be32 nfs4_callback_getattr(struct cb_getattrargs *args,
 				    struct cb_getattrres *res,
diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index ae2e87b..8ff6400 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -536,3 +536,11 @@  out:
 	return status;
 }
 #endif /* CONFIG_NFS_V4_1 */
+
+#ifdef CONFIG_NFS_V4_2
+__be32 nfs4_callback_offload(struct cb_offloadargs *args, void *dummy,
+			     struct cb_process_state *cps)
+{
+	return htonl(nfs_wake_offload(args));
+}
+#endif /* CONFIG_NFS_V4_2 */
diff --git a/fs/nfs/callback_xdr.c b/fs/nfs/callback_xdr.c
index f4ccfe6..843e074 100644
--- a/fs/nfs/callback_xdr.c
+++ b/fs/nfs/callback_xdr.c
@@ -35,6 +35,14 @@ 
 #define CB_OP_RECALLSLOT_RES_MAXSZ	(CB_OP_HDR_RES_MAXSZ)
 #endif /* CONFIG_NFS_V4_1 */
 
+#if defined(CONFIG_NFS_V4_2)
+#define CB_OP_OFFLOAD_RES_MAXSZ		(CB_OP_HDR_RES_MAXSZ + \
+					 1 + XDR_QUADLEN(NFS4_FHSIZE) + \
+					 XDR_QUADLEN(NFS4_STATEID_SIZE) + \
+					 1 + XDR_QUADLEN(NFS4_STATEID_SIZE) + \
+					 2 + 1 + XDR_QUADLEN(NFS4_VERIFIER_SIZE))
+#endif /* CONFIG_NFS_V4_2 */
+
 #define NFSDBG_FACILITY NFSDBG_CALLBACK
 
 /* Internal error code */
@@ -527,6 +535,37 @@  static __be32 decode_recallslot_args(struct svc_rqst *rqstp,
 
 #endif /* CONFIG_NFS_V4_1 */
 
+#ifdef CONFIG_NFS_V4_2
+static inline __be32 decode_write_res(struct xdr_stream *xdr,
+				      struct nfs42_write_res *write_res)
+{
+	__be32 status = decode_write_response(xdr, write_res);
+	if (status == -EIO)
+		return htonl(NFS4ERR_RESOURCE);
+	return htonl(status);
+}
+
+static __be32 decode_offload_args(struct svc_rqst *rqstp,
+				  struct xdr_stream *xdr,
+				  struct cb_offloadargs *args)
+{
+	__be32 status;
+
+	status = decode_fh(xdr, &args->dst_fh);
+	if (unlikely(status != 0))
+		goto out;
+
+	status = decode_stateid(xdr, &args->stateid);
+	if (unlikely(status != 0))
+		goto out;
+
+	status = decode_write_res(xdr, &args->write_res);
+out:
+	dprintk("%s: exit with status = %d\n", __func__, ntohl(status));
+	return status;
+}
+#endif /* CONFIG_NFS_V4_2 */
+
 static __be32 encode_string(struct xdr_stream *xdr, unsigned int len, const char *str)
 {
 	__be32 *p;
@@ -794,9 +833,11 @@  preprocess_nfs42_op(int nop, unsigned int op_nr, struct callback_op **op)
 	if (status != htonl(NFS4ERR_OP_ILLEGAL))
 		return status;
 
-	if (op_nr == OP_CB_OFFLOAD)
-		return htonl(NFS4ERR_NOTSUPP);
-	return htonl(NFS4ERR_OP_ILLEGAL);
+	if (op_nr != OP_CB_OFFLOAD)
+		return htonl(NFS4ERR_OP_ILLEGAL);
+
+	*op = &callback_ops[op_nr];
+	return htonl(NFS4_OK);
 }
 #else /* CONFIG_NFS_V4_2 */
 static __be32
@@ -991,6 +1032,13 @@  static struct callback_op callback_ops[] = {
 		.res_maxsize = CB_OP_RECALLSLOT_RES_MAXSZ,
 	},
 #endif /* CONFIG_NFS_V4_1 */
+#if defined(CONFIG_NFS_V4_2)
+	[OP_CB_OFFLOAD] = {
+		.process_op = (callback_process_op_t)nfs4_callback_offload,
+		.decode_args = (callback_decode_arg_t)decode_offload_args,
+		.res_maxsize = CB_OP_OFFLOAD_RES_MAXSZ,
+	},
+#endif
 };
 
 /*
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 1557f15..f52fc5f 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -464,6 +464,9 @@  static inline void nfs4_unregister_sysctl(void)
 
 /* nfs4xdr.c */
 extern struct rpc_procinfo nfs4_procedures[];
+#if defined(CONFIG_NFS_V4_2)
+int decode_write_response(struct xdr_stream *, struct nfs42_write_res *);
+#endif /* CONFIG_NFS_V4_2 */
 
 struct nfs4_mount_data;
 
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index ab2fbe0..caf0658 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -4,6 +4,7 @@ 
  *  Copyright (C) 1992  Rick Sladkey
  */
 #include <linux/nfs_fs.h>
+#include "callback.h"
 #include "internal.h"
 #include "fscache.h"
 #include "pnfs.h"
@@ -119,6 +120,50 @@  nfs4_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 }
 
 #ifdef CONFIG_NFS_V4_2
+static LIST_HEAD(nfs_offload_waitlist);
+static DEFINE_SPINLOCK(nfs_offload_lock);
+
+static unsigned int nfs_wait_for_offload(struct nfs42_write_res *res)
+{
+	spin_lock(&nfs_offload_lock);
+	list_add(&res->wait_list, &nfs_offload_waitlist);
+	spin_unlock(&nfs_offload_lock);
+
+	wait_for_completion(&res->completion);
+
+	spin_lock(&nfs_offload_lock);
+	list_del(&res->wait_list);
+	spin_unlock(&nfs_offload_lock);
+
+	return res->wr_status;
+}
+
+static struct nfs42_write_res *nfs_find_waiting_offload(nfs4_stateid *stateid)
+{
+	struct nfs42_write_res *cur;
+
+	list_for_each_entry(cur, &nfs_offload_waitlist, wait_list) {
+		if (memcmp(stateid, &cur->wr_stateid, sizeof(nfs4_stateid)) == 0)
+			return cur;
+	}
+	return NULL;
+}
+
+int nfs_wake_offload(struct cb_offloadargs *offload)
+{
+	struct nfs42_write_res *write;
+
+	spin_lock(&nfs_offload_lock);
+	write = nfs_find_waiting_offload(&offload->stateid);
+	spin_unlock(&nfs_offload_lock);
+	if (write == NULL)
+		return NFS4ERR_BAD_STATEID;
+
+	write->wr_bytes_copied = offload->write_res.wr_bytes_copied;
+	complete(&write->completion);
+	return NFS4_OK;
+}
+
 static int nfs42_select_stateid(struct file *file, nfs4_stateid *stateid,
 				fmode_t mode, struct nfs_open_context **ctx)
 {
@@ -164,7 +209,14 @@  static long nfs42_fallocate(struct file *file, int mode, loff_t offset, loff_t l
 	if (err < 0)
 		return err;
 
-	return nfs42_proc_fallocate(server, &args, &res, ctx->cred);
+	init_completion(&res.completion);
+	err = nfs42_proc_fallocate(server, &args, &res, ctx->cred);
+	if (err)
+		return err;
+
+	if (res.wr_async == true)
+		err = nfs_wait_for_offload(&res);
+	return err;
 }
 #endif /* CONFIG_NFS_V4_2 */
 
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 4ffecbe..28e2fad 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -6041,7 +6041,7 @@  static int decode_free_stateid(struct xdr_stream *xdr,
 #endif /* CONFIG_NFS_V4_1 */
 
 #ifdef CONFIG_NFS_V4_2
-static int decode_write_response(struct xdr_stream *xdr,
+int decode_write_response(struct xdr_stream *xdr,
 				 struct nfs42_write_res *write_res)
 {
 	__be32 *p;