diff mbox

[RFC] backchannel overflows

Message ID 20150430143731.GA22038@infradead.org (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig April 30, 2015, 2:37 p.m. UTC
On Thu, Apr 30, 2015 at 10:34:02AM -0400, Chuck Lever wrote:
> I?ve been discussing the possibility of adding more session slots on
> the Linux NFS client with jlayton. We think it would be straightforward,
> once the workqueue-based NFSD patches are in, to make the backchannel
> service into a workqueue. Then it would be a simple matter to increase
> the number of session slots.
> 
> We haven?t discussed what would be needed on the server side of this
> equation, but sounds like it has some deeper problems if it is not
> obeying the session slot table limits advertised by the client.

No, the client isn't obeying it's own slot limits

The problem is when the client responds to a callback it still
holds a references on rpc_rqst for a while.  If the server
sends the next callback fast enough to hit that race window the
client incorrectly rejects it.  Note that we never even get
to the nfs code that check the slot id in this case, it's low-level
sunrpc code that is the problem.

Note that with my patches the server can recover from this client
problem by resetting the connection, but that's not very optimal
behavior.

My current patch which includes the explanation is below.

---
commit 7e995048c9917766e0ae96889ef48733c2229b7e
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Apr 30 11:53:11 2015 +0200

    nfs: overallocate backchannel requests
    
    The RPC code releases the rpc_rqst much later than sending the reply
    to the server.  To avoid low-level errors that lead connection requests
    under load allocate twice as many rpc_rqst structures as callback
    slots to allow for double buffering.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Trond Myklebust April 30, 2015, 3:02 p.m. UTC | #1
Apologies for the resend...

On Thu, Apr 30, 2015 at 11:01 AM, Trond Myklebust
<trond.myklebust@primarydata.com> wrote:
>
>
> On Thu, Apr 30, 2015 at 10:37 AM, Christoph Hellwig <hch@infradead.org>
> wrote:
>>
>> On Thu, Apr 30, 2015 at 10:34:02AM -0400, Chuck Lever wrote:
>> > I?ve been discussing the possibility of adding more session slots on
>> > the Linux NFS client with jlayton. We think it would be straightforward,
>> > once the workqueue-based NFSD patches are in, to make the backchannel
>> > service into a workqueue. Then it would be a simple matter to increase
>> > the number of session slots.
>> >
>> > We haven?t discussed what would be needed on the server side of this
>> > equation, but sounds like it has some deeper problems if it is not
>> > obeying the session slot table limits advertised by the client.
>>
>> No, the client isn't obeying it's own slot limits
>>
>> The problem is when the client responds to a callback it still
>> holds a references on rpc_rqst for a while.  If the server
>> sends the next callback fast enough to hit that race window the
>> client incorrectly rejects it.  Note that we never even get
>> to the nfs code that check the slot id in this case, it's low-level
>> sunrpc code that is the problem.
>
>
> We can add dynamic allocation of a new slot as part of the backchannel reply
> transmit workload. That way we close the race without opening for violation
> of session limits.
>
> Cheers
>   Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever April 30, 2015, 3:11 p.m. UTC | #2
On Apr 30, 2015, at 11:01 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:

>> 
>> 
>> On Thu, Apr 30, 2015 at 10:37 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Thu, Apr 30, 2015 at 10:34:02AM -0400, Chuck Lever wrote:
>> > I?ve been discussing the possibility of adding more session slots on
>> > the Linux NFS client with jlayton. We think it would be straightforward,
>> > once the workqueue-based NFSD patches are in, to make the backchannel
>> > service into a workqueue. Then it would be a simple matter to increase
>> > the number of session slots.
>> >
>> > We haven?t discussed what would be needed on the server side of this
>> > equation, but sounds like it has some deeper problems if it is not
>> > obeying the session slot table limits advertised by the client.
>> 
>> No, the client isn't obeying it's own slot limits
>> 
>> The problem is when the client responds to a callback it still
>> holds a references on rpc_rqst for a while.  If the server
>> sends the next callback fast enough to hit that race window the
>> client incorrectly rejects it.  Note that we never even get
>> to the nfs code that check the slot id in this case, it's low-level
>> sunrpc code that is the problem.
> 
> We can add dynamic allocation of a new slot as part of the backchannel reply transmit workload. That way we close the race without opening for violation of session limits.

I’ll have to think about how that would affect RPC/RDMA backchannel.
Transport resources are allocated when the transport is created, and
can’t be dynamically added. (It certainly wouldn’t be a problem to
overprovision, as Christoph has done here).

I was thinking maybe using a local copy of the rpc_rqst for sending
the backchannel reply, and freeing the rpc_rqst before sending, might
close the window.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever April 30, 2015, 5:41 p.m. UTC | #3
On Apr 30, 2015, at 11:11 AM, Chuck Lever <chuck.lever@oracle.com> wrote:

> 
> On Apr 30, 2015, at 11:01 AM, Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> 
>>> 
>>> 
>>> On Thu, Apr 30, 2015 at 10:37 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>> On Thu, Apr 30, 2015 at 10:34:02AM -0400, Chuck Lever wrote:
>>>> I?ve been discussing the possibility of adding more session slots on
>>>> the Linux NFS client with jlayton. We think it would be straightforward,
>>>> once the workqueue-based NFSD patches are in, to make the backchannel
>>>> service into a workqueue. Then it would be a simple matter to increase
>>>> the number of session slots.
>>>> 
>>>> We haven?t discussed what would be needed on the server side of this
>>>> equation, but sounds like it has some deeper problems if it is not
>>>> obeying the session slot table limits advertised by the client.
>>> 
>>> No, the client isn't obeying it's own slot limits
>>> 
>>> The problem is when the client responds to a callback it still
>>> holds a references on rpc_rqst for a while.  If the server
>>> sends the next callback fast enough to hit that race window the
>>> client incorrectly rejects it.  Note that we never even get
>>> to the nfs code that check the slot id in this case, it's low-level
>>> sunrpc code that is the problem.
>> 
>> We can add dynamic allocation of a new slot as part of the backchannel reply transmit workload. That way we close the race without opening for violation of session limits.
> 
> I’ll have to think about how that would affect RPC/RDMA backchannel.
> Transport resources are allocated when the transport is created, and
> can’t be dynamically added. (It certainly wouldn’t be a problem to
> overprovision, as Christoph has done here).

We discussed this briefly during the Linux NFS town hall meeting.
I agree using dynamic slot allocation for TCP is fine, and RPC/RDMA
can use simple overprovisioning.

This way the upper layer (NFSv4.1 client) doesn’t have to be aware of
limitations in the RPC layer mechanism.

Trond may have an additional concern that I didn’t capture.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig May 1, 2015, 5:23 p.m. UTC | #4
On Thu, Apr 30, 2015 at 01:41:02PM -0400, Chuck Lever wrote:
> We discussed this briefly during the Linux NFS town hall meeting.
> I agree using dynamic slot allocation for TCP is fine, and RPC/RDMA
> can use simple overprovisioning.
> 
> This way the upper layer (NFSv4.1 client) doesn?t have to be aware of
> limitations in the RPC layer mechanism.
> 
> Trond may have an additional concern that I didn?t capture.

The other option would be to simply overallocate in the transport layer,
as that is the layer which causes the problem to start with.

That being said, what is the argument for doing any sort of static
allocation here?  I'm fine with doing fully dynamic allocation if that
works out fine, but a mixed static / dynamic allocation sounds like a
nightmare.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust May 1, 2015, 5:28 p.m. UTC | #5
On Fri, May 1, 2015 at 1:23 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Apr 30, 2015 at 01:41:02PM -0400, Chuck Lever wrote:
>> We discussed this briefly during the Linux NFS town hall meeting.
>> I agree using dynamic slot allocation for TCP is fine, and RPC/RDMA
>> can use simple overprovisioning.
>>
>> This way the upper layer (NFSv4.1 client) doesn?t have to be aware of
>> limitations in the RPC layer mechanism.
>>
>> Trond may have an additional concern that I didn?t capture.
>
> The other option would be to simply overallocate in the transport layer,
> as that is the layer which causes the problem to start with.
>
> That being said, what is the argument for doing any sort of static
> allocation here?  I'm fine with doing fully dynamic allocation if that
> works out fine, but a mixed static / dynamic allocation sounds like a
> nightmare.

The concern is not so much static vs dynamic. The concern is limiting
incoming RPC calls to the number allowed by the NFSv4.1 session. Right
now, the static allocation enforces the limit of 1 slot that the
client offers to the server (albeit with the race) and so I want any
replacement to meet the same requirement.

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever May 1, 2015, 5:31 p.m. UTC | #6
On May 1, 2015, at 1:23 PM, Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Apr 30, 2015 at 01:41:02PM -0400, Chuck Lever wrote:
>> We discussed this briefly during the Linux NFS town hall meeting.
>> I agree using dynamic slot allocation for TCP is fine, and RPC/RDMA
>> can use simple overprovisioning.
>> 
>> This way the upper layer (NFSv4.1 client) doesn?t have to be aware of
>> limitations in the RPC layer mechanism.
>> 
>> Trond may have an additional concern that I didn?t capture.
> 
> The other option would be to simply overallocate in the transport layer,
> as that is the layer which causes the problem to start with.

That’s exactly what the RDMA backchannel will do.

> That being said, what is the argument for doing any sort of static
> allocation here?  I'm fine with doing fully dynamic allocation if that
> works out fine, but a mixed static / dynamic allocation sounds like a
> nightmare.

RDMA resources must be allocated and pre-registered up front. The
RDMA transport can’t support dynamic slot allocation without adding
a lot more complexity.

The RDMA transport will need to have separate backchannel setup and
destroy methods anyway. So doing dynamic for TCP and overprovision for
RDMA will be simple.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig May 1, 2015, 5:37 p.m. UTC | #7
On Fri, May 01, 2015 at 01:28:12PM -0400, Trond Myklebust wrote:
> The concern is not so much static vs dynamic. The concern is limiting
> incoming RPC calls to the number allowed by the NFSv4.1 session. Right
> now, the static allocation enforces the limit of 1 slot that the
> client offers to the server (albeit with the race) and so I want any
> replacement to meet the same requirement.

Either variant will allow accepting more than one backchannel
request at the RPC layer, that's the whole point.  With the simple
patch I posted it will accept a 2nd one, with dynamic allocation
the number would be potentially unbound.

But given that the NFS client already enforces the slot limit
properly in validate_seqid, and even returns the proper nfs error
for this case I don't really see what enforcing it at a lower level
we we can't even report proper errors buys us.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust May 1, 2015, 5:47 p.m. UTC | #8
On Fri, May 1, 2015 at 1:37 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, May 01, 2015 at 01:28:12PM -0400, Trond Myklebust wrote:
>> The concern is not so much static vs dynamic. The concern is limiting
>> incoming RPC calls to the number allowed by the NFSv4.1 session. Right
>> now, the static allocation enforces the limit of 1 slot that the
>> client offers to the server (albeit with the race) and so I want any
>> replacement to meet the same requirement.
>
> Either variant will allow accepting more than one backchannel
> request at the RPC layer, that's the whole point.  With the simple
> patch I posted it will accept a 2nd one, with dynamic allocation
> the number would be potentially unbound.

If you do it wrong, yes. The point is if you do the allocation as part
of the send process, then you prevent unbounded growth.

> But given that the NFS client already enforces the slot limit
> properly in validate_seqid, and even returns the proper nfs error
> for this case I don't really see what enforcing it at a lower level
> we we can't even report proper errors buys us.

It buys us better congestion control. We don't even need to process the request.

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
index 84326e9..7f79345 100644
--- a/fs/nfs/callback.h
+++ b/fs/nfs/callback.h
@@ -200,13 +200,18 @@  extern int nfs4_validate_delegation_stateid(struct nfs_delegation *delegation,
 					    const nfs4_stateid *stateid);
 extern int nfs4_set_callback_sessionid(struct nfs_client *clp);
 #endif /* CONFIG_NFS_V4 */
+
 /*
  * nfs41: Callbacks are expected to not cause substantial latency,
  * so we limit their concurrency to 1 by setting up the maximum number
  * of slots for the backchannel.
+ *
+ * Note that we allocate a second RPC request structure because the
+ * RPC code holds on to the buffer for a while after sending a reply
+ * to the server.
  */
-#define NFS41_BC_MIN_CALLBACKS 1
-#define NFS41_BC_MAX_CALLBACKS 1
+#define NFS41_BC_SLOTS 1
+#define NFS41_BC_REQUESTS (NFS41_BC_SLOTS * 2)
 
 extern unsigned int nfs_callback_set_tcpport;
 extern unsigned short nfs_callback_tcpport;
diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c
index 197806f..d733f04 100644
--- a/fs/nfs/callback_proc.c
+++ b/fs/nfs/callback_proc.c
@@ -320,7 +320,7 @@  validate_seqid(struct nfs4_slot_table *tbl, struct cb_sequenceargs * args)
 	dprintk("%s enter. slotid %u seqid %u\n",
 		__func__, args->csa_slotid, args->csa_sequenceid);
 
-	if (args->csa_slotid >= NFS41_BC_MAX_CALLBACKS)
+	if (args->csa_slotid >= NFS41_BC_SLOTS)
 		return htonl(NFS4ERR_BADSLOT);
 
 	slot = tbl->slots + args->csa_slotid;
@@ -465,8 +465,8 @@  __be32 nfs4_callback_sequence(struct cb_sequenceargs *args,
 	       sizeof(res->csr_sessionid));
 	res->csr_sequenceid = args->csa_sequenceid;
 	res->csr_slotid = args->csa_slotid;
-	res->csr_highestslotid = NFS41_BC_MAX_CALLBACKS - 1;
-	res->csr_target_highestslotid = NFS41_BC_MAX_CALLBACKS - 1;
+	res->csr_highestslotid = NFS41_BC_SLOTS - 1;
+	res->csr_target_highestslotid = NFS41_BC_SLOTS - 1;
 
 out:
 	cps->clp = clp; /* put in nfs4_callback_compound */
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index e42be52..e36717c 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -248,7 +248,7 @@  static int nfs4_init_callback(struct nfs_client *clp)
 	xprt = rcu_dereference_raw(clp->cl_rpcclient->cl_xprt);
 
 	if (nfs4_has_session(clp)) {
-		error = xprt_setup_backchannel(xprt, NFS41_BC_MIN_CALLBACKS);
+		error = xprt_setup_backchannel(xprt, NFS41_BC_REQUESTS);
 		if (error < 0)
 			return error;
 	}
diff --git a/fs/nfs/nfs4session.c b/fs/nfs/nfs4session.c
index e23366e..3ad4da4 100644
--- a/fs/nfs/nfs4session.c
+++ b/fs/nfs/nfs4session.c
@@ -500,7 +500,7 @@  void nfs4_destroy_session(struct nfs4_session *session)
 	rcu_read_unlock();
 	dprintk("%s Destroy backchannel for xprt %p\n",
 		__func__, xprt);
-	xprt_destroy_backchannel(xprt, NFS41_BC_MIN_CALLBACKS);
+	xprt_destroy_backchannel(xprt, NFS41_BC_REQUESTS);
 	nfs4_destroy_session_slot_tables(session);
 	kfree(session);
 }