diff mbox

Need help with NFS Server SUNRPC performance issue

Message ID 20131105195810.GC23329@fieldses.org (mailing list archive)
State New, archived
Headers show

Commit Message

J. Bruce Fields Nov. 5, 2013, 7:58 p.m. UTC
On Tue, Nov 05, 2013 at 07:14:50PM +0530, Shyam Kaushik wrote:
> Hi Bruce,
> 
> You are spot on this issue. I did a quicker option of just fixing
> 
> fs/nfsd/nfs4proc.c
> 
> nfsd_procedures4[]
> 
> NFSPROC4_COMPOUND
> instead of
> .pc_xdrressize = NFSD_BUFSIZE/4
> 
> I made it by /8 & I got double the IOPs. I moved it /16 & now I see
> that 30 NFSD threads out of 32 that I have configured are doing the
> nfsd_write() job. So yes this is the exact problematic area.

Yes, that looks like good evidence we're on the right track, thanks very
much for the testing.

> Now for a permanent fixture for this issue, what do you suggest? Is it
> that before processing the compound we adjust svc_reserve()?

I think decode_compound() needs to do some estimate of the maximum total
reply size and call svc_reserve() with that new estimate.

And for the current code I think it really could be as simple as
checking whether the compound includes a READ op.

That's because that's all the current xdr encoding handles.  We need to
fix that: people need to be able to fetch ACLs larger than 4k, and
READDIR would be faster if it could return more than 4k of data at a go.

After we do that, we'll need to know more than just the list of ops,
we'll need to e.g.  know which attributes exactly a GETATTR requested.
And we don't have any automatic way to figure that out so it'll all be a
lot of manual arithmetic.  On the other hand the good news is we only
need a rough upper bound, so this will may be doable.

Beyond that it would also be good to think about whether using
worst-case reply sizes to decide when to accept requests is really
right.

Anyway here's the slightly improved hack--totally untested except to fix
some compile errors.

--b.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Shyam Kaushik Nov. 6, 2013, 7:27 a.m. UTC | #1
Hi Bruce,

This hack works great. All 32 of configured NFSD threads end up doing
nfsd_write() which is great & I get higher IOPs/bandwidth from NFS
client side.

What do you think if we vary the signature of
typedef __be32(*nfsd4_dec)(struct nfsd4_compoundargs *argp, void *);

to include an additional return argument of the size estimate. This
way we get size estimate from the decoders (like nfsd4_decode_read
could return this estimate as rd_length + overhead) & in the worst
case if decoder says cant estimate (like a special return code -1) we
dont do svc_reserve() & leave it like it is. So when we run through
the compound we have a sum of size estimate & just do svc_reserve() at
the end of nfsd4_decode_compound() like your hack has.

Does this sound reasonable to you? If not, perhaps can we just use the
hack for now & worry about readdir etc when they support >4K buffer?

--Shyam

On Wed, Nov 6, 2013 at 1:28 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Tue, Nov 05, 2013 at 07:14:50PM +0530, Shyam Kaushik wrote:
>> Hi Bruce,
>>
>> You are spot on this issue. I did a quicker option of just fixing
>>
>> fs/nfsd/nfs4proc.c
>>
>> nfsd_procedures4[]
>>
>> NFSPROC4_COMPOUND
>> instead of
>> .pc_xdrressize = NFSD_BUFSIZE/4
>>
>> I made it by /8 & I got double the IOPs. I moved it /16 & now I see
>> that 30 NFSD threads out of 32 that I have configured are doing the
>> nfsd_write() job. So yes this is the exact problematic area.
>
> Yes, that looks like good evidence we're on the right track, thanks very
> much for the testing.
>
>> Now for a permanent fixture for this issue, what do you suggest? Is it
>> that before processing the compound we adjust svc_reserve()?
>
> I think decode_compound() needs to do some estimate of the maximum total
> reply size and call svc_reserve() with that new estimate.
>
> And for the current code I think it really could be as simple as
> checking whether the compound includes a READ op.
>
> That's because that's all the current xdr encoding handles.  We need to
> fix that: people need to be able to fetch ACLs larger than 4k, and
> READDIR would be faster if it could return more than 4k of data at a go.
>
> After we do that, we'll need to know more than just the list of ops,
> we'll need to e.g.  know which attributes exactly a GETATTR requested.
> And we don't have any automatic way to figure that out so it'll all be a
> lot of manual arithmetic.  On the other hand the good news is we only
> need a rough upper bound, so this will may be doable.
>
> Beyond that it would also be good to think about whether using
> worst-case reply sizes to decide when to accept requests is really
> right.
>
> Anyway here's the slightly improved hack--totally untested except to fix
> some compile errors.
>
> --b.
>
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index d9454fe..947f268 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1617,6 +1617,7 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
>         struct nfsd4_op *op;
>         struct nfsd4_minorversion_ops *ops;
>         bool cachethis = false;
> +       bool foundread = false;
>         int i;
>
>         READ_BUF(4);
> @@ -1667,10 +1668,15 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
>                  * op in the compound wants to be cached:
>                  */
>                 cachethis |= nfsd4_cache_this_op(op);
> +
> +               foundread |= op->opnum == OP_READ;
>         }
>         /* Sessions make the DRC unnecessary: */
>         if (argp->minorversion)
>                 cachethis = false;
> +       if (!foundread)
> +               /* XXX: use tighter estimates, and svc_reserve_auth: */
> +               svc_reserve(argp->rqstp, PAGE_SIZE);
>         argp->rqstp->rq_cachetype = cachethis ? RC_REPLBUFF : RC_NOCACHE;
>
>         DECODE_TAIL;
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Nov. 13, 2013, 4:24 p.m. UTC | #2
On Wed, Nov 06, 2013 at 12:57:38PM +0530, Shyam Kaushik wrote:
> Hi Bruce,
> 
> This hack works great. All 32 of configured NFSD threads end up doing
> nfsd_write() which is great & I get higher IOPs/bandwidth from NFS
> client side.
> 
> What do you think if we vary the signature of
> typedef __be32(*nfsd4_dec)(struct nfsd4_compoundargs *argp, void *);
> 
> to include an additional return argument of the size estimate. This
> way we get size estimate from the decoders (like nfsd4_decode_read
> could return this estimate as rd_length + overhead) & in the worst
> case if decoder says cant estimate (like a special return code -1) we
> dont do svc_reserve() & leave it like it is. So when we run through
> the compound we have a sum of size estimate & just do svc_reserve() at
> the end of nfsd4_decode_compound() like your hack has.
> 
> Does this sound reasonable to you? If not, perhaps can we just use the
> hack for now & worry about readdir etc when they support >4K buffer?

Yep.  Actually looking at it again I think it needs a couple more
special cases (for readlink, readdir), but that should be good enough
for now.

For the future.... I'd rather not add an extra argument to every encoder
but maybe that is the simplest thing to do.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index d9454fe..947f268 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1617,6 +1617,7 @@  nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
 	struct nfsd4_op *op;
 	struct nfsd4_minorversion_ops *ops;
 	bool cachethis = false;
+	bool foundread = false;
 	int i;
 
 	READ_BUF(4);
@@ -1667,10 +1668,15 @@  nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
 		 * op in the compound wants to be cached:
 		 */
 		cachethis |= nfsd4_cache_this_op(op);
+
+		foundread |= op->opnum == OP_READ;
 	}
 	/* Sessions make the DRC unnecessary: */
 	if (argp->minorversion)
 		cachethis = false;
+	if (!foundread)
+		/* XXX: use tighter estimates, and svc_reserve_auth: */
+		svc_reserve(argp->rqstp, PAGE_SIZE);
 	argp->rqstp->rq_cachetype = cachethis ? RC_REPLBUFF : RC_NOCACHE;
 
 	DECODE_TAIL;