[RFC] backchannel overflows

Message ID	20150428202157.GA23972@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> Date: Tue, 28 Apr 2015 13:21:57 -0700 From: Christoph Hellwig <hch@infradead.org> To: Trond Myklebust <trond.myklebust@primarydata.com> Cc: linux-nfs@vger.kernel.org Subject: [PATCH, RFC] backchannel overflows Message-ID: <20150428202157.GA23972@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Christoph Hellwig April 28, 2015, 8:21 p.m. UTC

Currently the client will just crap out if a CB_NULL comes in at the
same time as a slot controlled CB_COMPOUND that includes a CB_SEQUENCE.

I don't really understand how the spec wants to treat CB_NULL in
relation to the available backchannel slots, but given that it's
not part of the sequences in CB_SEQUENCE it somehow nees to bypass them.

If we make sure to overallocate the rpc-level buffers so that we
have more than the available NFS-level slots we should be safe from
this condition which makes a 4.1 server doing heavy recalls under
load very unhapy by not returning an NFS level reply to its layout
recalls.

I dont really like this patch much, so any idea for a better solution
would be highly welcome!

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kinglong Mee April 29, 2015, 12:08 p.m. UTC | #1

On 4/29/2015 4:21 AM, Christoph Hellwig wrote:
> Currently the client will just crap out if a CB_NULL comes in at the
> same time as a slot controlled CB_COMPOUND that includes a CB_SEQUENCE.
> 
> I don't really understand how the spec wants to treat CB_NULL in
> relation to the available backchannel slots, but given that it's
> not part of the sequences in CB_SEQUENCE it somehow nees to bypass them.
> 
> If we make sure to overallocate the rpc-level buffers so that we
> have more than the available NFS-level slots we should be safe from
> this condition which makes a 4.1 server doing heavy recalls under
> load very unhapy by not returning an NFS level reply to its layout
> recalls.
> 
> I dont really like this patch much, so any idea for a better solution
> would be highly welcome!
> 
> diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
> index 84326e9..7afb3ef 100644
> --- a/fs/nfs/callback.h
> +++ b/fs/nfs/callback.h
> @@ -205,7 +205,7 @@ extern int nfs4_set_callback_sessionid(struct nfs_client *clp);
>   * so we limit their concurrency to 1 by setting up the maximum number
>   * of slots for the backchannel.
>   */
> -#define NFS41_BC_MIN_CALLBACKS 1
> +#define NFS41_BC_MIN_CALLBACKS 2
>  #define NFS41_BC_MAX_CALLBACKS 1

Are you sure that's okay without update NFS41_BC_MAX_CALLBACKS?

thanks,
Kinglong Mee
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig April 29, 2015, 1:46 p.m. UTC | #2

On Wed, Apr 29, 2015 at 08:08:29PM +0800, Kinglong Mee wrote:
> > diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
> > index 84326e9..7afb3ef 100644
> > --- a/fs/nfs/callback.h
> > +++ b/fs/nfs/callback.h
> > @@ -205,7 +205,7 @@ extern int nfs4_set_callback_sessionid(struct nfs_client *clp);
> >   * so we limit their concurrency to 1 by setting up the maximum number
> >   * of slots for the backchannel.
> >   */
> > -#define NFS41_BC_MIN_CALLBACKS 1
> > +#define NFS41_BC_MIN_CALLBACKS 2
> >  #define NFS41_BC_MAX_CALLBACKS 1
> 
> Are you sure that's okay without update NFS41_BC_MAX_CALLBACKS?

Yes, NFS41_BC_MIN_CALLBACKS is the number of slots we tell the sunrpc
layer to allocate, while NFS41_BC_MAX_CALLBACKS is used for nfs-internal
accounting.  So having them different is intentional for this rfc.  If
we'll have to merge this instead of a better fix we defintively should
fix up the naming, though.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chuck Lever III April 29, 2015, 2:55 p.m. UTC | #3

On Apr 28, 2015, at 4:21 PM, Christoph Hellwig <hch@infradead.org> wrote:

> Currently the client will just crap out if a CB_NULL comes in at the
> same time as a slot controlled CB_COMPOUND that includes a CB_SEQUENCE.

Under what circumstances does the server send a CB_NULL while a CB_COMPOUND
is in flight?


> I don't really understand how the spec wants to treat CB_NULL in
> relation to the available backchannel slots, but given that it's
> not part of the sequences in CB_SEQUENCE it somehow nees to bypass them.
> 
> If we make sure to overallocate the rpc-level buffers so that we
> have more than the available NFS-level slots we should be safe from
> this condition which makes a 4.1 server doing heavy recalls under
> load very unhapy by not returning an NFS level reply to its layout
> recalls.
> 
> I dont really like this patch much, so any idea for a better solution
> would be highly welcome!
> 
> diff --git a/fs/nfs/callback.h b/fs/nfs/callback.h
> index 84326e9..7afb3ef 100644
> --- a/fs/nfs/callback.h
> +++ b/fs/nfs/callback.h
> @@ -205,7 +205,7 @@ extern int nfs4_set_callback_sessionid(struct nfs_client *clp);
>  * so we limit their concurrency to 1 by setting up the maximum number
>  * of slots for the backchannel.
>  */
> -#define NFS41_BC_MIN_CALLBACKS 1
> +#define NFS41_BC_MIN_CALLBACKS 2
> #define NFS41_BC_MAX_CALLBACKS 1
> 
> extern unsigned int nfs_callback_set_tcpport;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust April 29, 2015, 2:58 p.m. UTC | #4

On Wed, Apr 29, 2015 at 10:55 AM, Chuck Lever <chuck.lever@oracle.com> wrote:
>
> On Apr 28, 2015, at 4:21 PM, Christoph Hellwig <hch@infradead.org> wrote:
>
>> Currently the client will just crap out if a CB_NULL comes in at the
>> same time as a slot controlled CB_COMPOUND that includes a CB_SEQUENCE.
>
> Under what circumstances does the server send a CB_NULL while a CB_COMPOUND
> is in flight?
>

I agree with Chuck. Why does knfsd send a CB_NULL at all if the
intention is to send CB_COMPOUND?

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig April 29, 2015, 3:14 p.m. UTC | #5

On Wed, Apr 29, 2015 at 10:55:10AM -0400, Chuck Lever wrote:
> 
> On Apr 28, 2015, at 4:21 PM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> > Currently the client will just crap out if a CB_NULL comes in at the
> > same time as a slot controlled CB_COMPOUND that includes a CB_SEQUENCE.
> 
> Under what circumstances does the server send a CB_NULL while a CB_COMPOUND
> is in flight?

When a client is under heavy loads from fsx or aio-stress, and we lose
the connection (nfsd4_conn_lost is called) while doing heavy recalls.

xfstests against a server offering pnfs layouts for which the client
can't reach the storage devices is an easy reproducer.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust April 29, 2015, 3:24 p.m. UTC | #6

On Wed, Apr 29, 2015 at 11:14 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Apr 29, 2015 at 10:55:10AM -0400, Chuck Lever wrote:
>>
>> On Apr 28, 2015, at 4:21 PM, Christoph Hellwig <hch@infradead.org> wrote:
>>
>> > Currently the client will just crap out if a CB_NULL comes in at the
>> > same time as a slot controlled CB_COMPOUND that includes a CB_SEQUENCE.
>>
>> Under what circumstances does the server send a CB_NULL while a CB_COMPOUND
>> is in flight?
>
> When a client is under heavy loads from fsx or aio-stress, and we lose
> the connection (nfsd4_conn_lost is called) while doing heavy recalls.
>
> xfstests against a server offering pnfs layouts for which the client
> can't reach the storage devices is an easy reproducer.

Why does it need to do this? If the client has sent the
BIND_CONN_TO_SESSION (which I believe that knfsd asks for), then the
server knows that this is a bi-directional connection.
The difference between NFSv4 and NFSv4.1 is that the CB_NULL should
almost always be redundant, because the client initiates the
connection and it explicitly tells the server whether or not it is to
be used for the callback channel.

The CB_NULL should always be redundant.

Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields April 29, 2015, 5:34 p.m. UTC | #7

On Wed, Apr 29, 2015 at 11:24:02AM -0400, Trond Myklebust wrote:
> On Wed, Apr 29, 2015 at 11:14 AM, Christoph Hellwig <hch@infradead.org> wrote:
> > On Wed, Apr 29, 2015 at 10:55:10AM -0400, Chuck Lever wrote:
> >>
> >> On Apr 28, 2015, at 4:21 PM, Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >> > Currently the client will just crap out if a CB_NULL comes in at the
> >> > same time as a slot controlled CB_COMPOUND that includes a CB_SEQUENCE.
> >>
> >> Under what circumstances does the server send a CB_NULL while a CB_COMPOUND
> >> is in flight?
> >
> > When a client is under heavy loads from fsx or aio-stress, and we lose
> > the connection (nfsd4_conn_lost is called) while doing heavy recalls.
> >
> > xfstests against a server offering pnfs layouts for which the client
> > can't reach the storage devices is an easy reproducer.
> 
> Why does it need to do this? If the client has sent the
> BIND_CONN_TO_SESSION (which I believe that knfsd asks for), then the
> server knows that this is a bi-directional connection.
> The difference between NFSv4 and NFSv4.1 is that the CB_NULL should
> almost always be redundant, because the client initiates the
> connection and it explicitly tells the server whether or not it is to
> be used for the callback channel.
> 
> The CB_NULL should always be redundant.

I'd be fine with suppressing it.  I think I actually intended to but
screwed it up.  (Chuck or somebody convinced me the
NFSD4_CB_UP/UNKNOWN/DOWN logic is totally broken but I never got around
to fixing it.)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig April 30, 2015, 6:25 a.m. UTC | #8

On Wed, Apr 29, 2015 at 01:34:54PM -0400, J. Bruce Fields wrote:
> > Why does it need to do this? If the client has sent the
> > BIND_CONN_TO_SESSION (which I believe that knfsd asks for), then the
> > server knows that this is a bi-directional connection.
> > The difference between NFSv4 and NFSv4.1 is that the CB_NULL should
> > almost always be redundant, because the client initiates the
> > connection and it explicitly tells the server whether or not it is to
> > be used for the callback channel.
> > 
> > The CB_NULL should always be redundant.
> 
> I'd be fine with suppressing it.  I think I actually intended to but
> screwed it up.  (Chuck or somebody convinced me the
> NFSD4_CB_UP/UNKNOWN/DOWN logic is totally broken but I never got around
> to fixing it.)

I've dived into removing CB_NULL, and fixed various major breakage
in the nfsd callback path. for which I will send you an RFC series ASAP.

However, even with that I see the "Callback slot table overflowed" from the
client under load.  I think the problem is the following:

Between sending the callback response in call_bc_transmit -> xprt_transmit
and actually releasing the request from rpc_exit_task -> xprt_release ->
xprt_free_bc_request there is race window, and between and overloaded client
and a fast connection we can hit this one easily.

My patch to increase the number of buffers for the backchannel ensures
this doesn't happen in my setup, but of course I could envinsion a
theoretical setu where the client is so slow that multiple already
processed requests might not be returned yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chuck Lever III April 30, 2015, 2:34 p.m. UTC | #9

On Apr 30, 2015, at 2:25 AM, Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, Apr 29, 2015 at 01:34:54PM -0400, J. Bruce Fields wrote:
>>> Why does it need to do this? If the client has sent the
>>> BIND_CONN_TO_SESSION (which I believe that knfsd asks for), then the
>>> server knows that this is a bi-directional connection.
>>> The difference between NFSv4 and NFSv4.1 is that the CB_NULL should
>>> almost always be redundant, because the client initiates the
>>> connection and it explicitly tells the server whether or not it is to
>>> be used for the callback channel.
>>> 
>>> The CB_NULL should always be redundant.
>> 
>> I'd be fine with suppressing it.  I think I actually intended to but
>> screwed it up.  (Chuck or somebody convinced me the
>> NFSD4_CB_UP/UNKNOWN/DOWN logic is totally broken but I never got around
>> to fixing it.)
> 
> I've dived into removing CB_NULL, and fixed various major breakage
> in the nfsd callback path. for which I will send you an RFC series ASAP.
> 
> However, even with that I see the "Callback slot table overflowed" from the
> client under load.  I think the problem is the following:
> 
> Between sending the callback response in call_bc_transmit -> xprt_transmit
> and actually releasing the request from rpc_exit_task -> xprt_release ->
> xprt_free_bc_request there is race window, and between and overloaded client
> and a fast connection we can hit this one easily.
> 
> My patch to increase the number of buffers for the backchannel ensures
> this doesn't happen in my setup, but of course I could envinsion a
> theoretical setu where the client is so slow that multiple already
> processed requests might not be returned yet.

I’ve been discussing the possibility of adding more session slots on
the Linux NFS client with jlayton. We think it would be straightforward,
once the workqueue-based NFSD patches are in, to make the backchannel
service into a workqueue. Then it would be a simple matter to increase
the number of session slots.

We haven’t discussed what would be needed on the server side of this
equation, but sounds like it has some deeper problems if it is not
obeying the session slot table limits advertised by the client.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] backchannel overflows

Commit Message

Comments

Patch