[1/1] nbd/server: push pending frames after sending reply

Message ID	20230324104720.2498-1-fw@strlen.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Florian Westphal <fw@strlen.de> To: qemu-block@nongnu.org Cc: eblake@redhat.com, vsementsov@yandex-team.ru, qemu-devel@nongnu.org, Florian Westphal <fw@strlen.de> Subject: [PATCH 1/1] nbd/server: push pending frames after sending reply Date: Fri, 24 Mar 2023 11:47:20 +0100 Message-Id: <20230324104720.2498-1-fw@strlen.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2a0a:51c0:0:237:300::1; envelope-from=fw@breakpoint.cc; helo=Chamillionaire.breakpoint.cc X-Spam_score_int: -39 X-Spam_score: -4.0 X-Spam_bar: ---- X-Spam_report: (-4.0 / 5.0 requ) BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.25, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	[1/1] nbd/server: push pending frames after sending reply \| expand [1/1] nbd/server: push pending frames after sending reply

Florian Westphal March 24, 2023, 10:47 a.m. UTC

qemu-nbd doesn't set TCP_NODELAY on the tcp socket.

Kernel waits for more data and avoids transmission of small packets.
Without TLS this is barely noticeable, but with TLS this really shows.

Booting a VM via qemu-nbd on localhost (with tls) takes more than
2 minutes on my system.  tcpdump shows frequent wait periods, where no
packets get sent for a 40ms period.

Add explicit (un)corking when processing (and responding to) requests.
"TCP_CORK, &zero" after earlier "CORK, &one" will flush pending data.

VM Boot time:
main:    no tls:  23s, with tls: 2m45s
patched: no tls:  14s, with tls: 15s

VM Boot time, qemu-nbd via network (same lan):
main:    no tls:  18s, with tls: 1m50s
patched: no tls:  17s, with tls: 18s

Future optimization: if we could detect if there is another pending
request we could defer the uncork operation because more data would be
appended.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 nbd/server.c | 3 +++
 1 file changed, 3 insertions(+)

Kevin Wolf March 24, 2023, 5:20 p.m. UTC | #1

Am 24.03.2023 um 11:47 hat Florian Westphal geschrieben:
> qemu-nbd doesn't set TCP_NODELAY on the tcp socket.
> 
> Kernel waits for more data and avoids transmission of small packets.
> Without TLS this is barely noticeable, but with TLS this really shows.
> 
> Booting a VM via qemu-nbd on localhost (with tls) takes more than
> 2 minutes on my system.  tcpdump shows frequent wait periods, where no
> packets get sent for a 40ms period.
> 
> Add explicit (un)corking when processing (and responding to) requests.
> "TCP_CORK, &zero" after earlier "CORK, &one" will flush pending data.
> 
> VM Boot time:
> main:    no tls:  23s, with tls: 2m45s
> patched: no tls:  14s, with tls: 15s
> 
> VM Boot time, qemu-nbd via network (same lan):
> main:    no tls:  18s, with tls: 1m50s
> patched: no tls:  17s, with tls: 18s
> 
> Future optimization: if we could detect if there is another pending
> request we could defer the uncork operation because more data would be
> appended.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  nbd/server.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index a4750e41880a..848836d41405 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -2667,6 +2667,8 @@ static coroutine_fn void nbd_trip(void *opaque)
>          goto disconnect;
>      }
>  
> +    qio_channel_set_cork(client->ioc, true);
> +
>      if (ret < 0) {
>          /* It wasn't -EIO, so, according to nbd_co_receive_request()
>           * semantics, we should return the error to the client. */
> @@ -2692,6 +2694,7 @@ static coroutine_fn void nbd_trip(void *opaque)
>          goto disconnect;
>      }
>  
> +    qio_channel_set_cork(client->ioc, false);
>  done:
>      nbd_request_put(req);
>      nbd_client_put(client);

In the error paths, we never call set_cork(false) again. I suppose the
reason that this is okay is because the next thing is actually that we
close the socket?

Kevin

Florian Westphal March 24, 2023, 6:20 p.m. UTC | #2

Kevin Wolf <kwolf@redhat.com> wrote:
> Am 24.03.2023 um 11:47 hat Florian Westphal geschrieben:
> > +    qio_channel_set_cork(client->ioc, true);
> > +
> >      if (ret < 0) {
> >          /* It wasn't -EIO, so, according to nbd_co_receive_request()
> >           * semantics, we should return the error to the client. */
> > @@ -2692,6 +2694,7 @@ static coroutine_fn void nbd_trip(void *opaque)
> >          goto disconnect;
> >      }
> >  
> > +    qio_channel_set_cork(client->ioc, false);
> >  done:
> >      nbd_request_put(req);
> >      nbd_client_put(client);
> 
> In the error paths, we never call set_cork(false) again. I suppose the
> reason that this is okay is because the next thing is actually that we
> close the socket?

Yes, no need to uncork before close.

Eric Blake March 24, 2023, 7:41 p.m. UTC | #3

On Fri, Mar 24, 2023 at 11:47:20AM +0100, Florian Westphal wrote:
> qemu-nbd doesn't set TCP_NODELAY on the tcp socket.
> 
> Kernel waits for more data and avoids transmission of small packets.
> Without TLS this is barely noticeable, but with TLS this really shows.
> 
> Booting a VM via qemu-nbd on localhost (with tls) takes more than
> 2 minutes on my system.  tcpdump shows frequent wait periods, where no
> packets get sent for a 40ms period.

Thank you for this analysis.

> 
> Add explicit (un)corking when processing (and responding to) requests.
> "TCP_CORK, &zero" after earlier "CORK, &one" will flush pending data.
> 
> VM Boot time:
> main:    no tls:  23s, with tls: 2m45s
> patched: no tls:  14s, with tls: 15s
> 
> VM Boot time, qemu-nbd via network (same lan):
> main:    no tls:  18s, with tls: 1m50s
> patched: no tls:  17s, with tls: 18s

And the timings bear proof that it matters.

> 
> Future optimization: if we could detect if there is another pending
> request we could defer the uncork operation because more data would be
> appended.

nbdkit and libnbd do this with the MSG_MORE flag (plaintext) and TLS
corking (tls); when building up a message to the other side, a flag is
set any time we know we are likely to send more data very shortly.

nbdkit wraps it under a flag SEND_MORE, which applies to both plaintext:
https://gitlab.com/nbdkit/nbdkit/-/blob/master/server/connections.c#L415
and to TLS connections:
https://gitlab.com/nbdkit/nbdkit/-/blob/master/server/crypto.c#L396

while libnbd uses MSG_MORE a bit more directly for the same purpose
for plaintext, but isn't (yet) doing TLS corking:
https://gitlab.com/nbdkit/libnbd/-/blob/master/generator/states-issue-command.c#L53
https://gitlab.com/nbdkit/libnbd/-/blob/master/lib/internal.h#L57

I would love to see a future patch to qio_channel code to support
MSG_MORE in the same way as nbdkit is using its SEND_MORE flag, as the
caller often has more info on whether it is sending a short prefix or
is done with a conceptual message and ready to uncork, and where the
use of a flag can be more efficient than separate passes through
cork/uncork calls.  But even your initial work at properly corking is
a good step in the right direction.

And surprisingly, qemu IS using corking on the client side:
https://gitlab.com/qemu-project/qemu/-/blob/master/block/nbd.c#L525
just not on the server side, before your patch.

> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  nbd/server.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index a4750e41880a..848836d41405 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -2667,6 +2667,8 @@ static coroutine_fn void nbd_trip(void *opaque)
>          goto disconnect;
>      }
>  
> +    qio_channel_set_cork(client->ioc, true);
> +
>      if (ret < 0) {
>          /* It wasn't -EIO, so, according to nbd_co_receive_request()
>           * semantics, we should return the error to the client. */
> @@ -2692,6 +2694,7 @@ static coroutine_fn void nbd_trip(void *opaque)
>          goto disconnect;
>      }
>  
> +    qio_channel_set_cork(client->ioc, false);

Reviewed-by: Eric Blake <eblake@redhat.com>

>  done:
>      nbd_request_put(req);
>      nbd_client_put(client);
> -- 
> 2.39.2
>

Eric Blake March 24, 2023, 7:43 p.m. UTC | #4

On Fri, Mar 24, 2023 at 07:20:17PM +0100, Florian Westphal wrote:
> Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 24.03.2023 um 11:47 hat Florian Westphal geschrieben:
> > > +    qio_channel_set_cork(client->ioc, true);
> > > +
> > >      if (ret < 0) {
> > >          /* It wasn't -EIO, so, according to nbd_co_receive_request()
> > >           * semantics, we should return the error to the client. */
> > > @@ -2692,6 +2694,7 @@ static coroutine_fn void nbd_trip(void *opaque)
> > >          goto disconnect;
> > >      }
> > >  
> > > +    qio_channel_set_cork(client->ioc, false);
> > >  done:
> > >      nbd_request_put(req);
> > >      nbd_client_put(client);
> > 
> > In the error paths, we never call set_cork(false) again. I suppose the
> > reason that this is okay is because the next thing is actually that we
> > close the socket?
> 
> Yes, no need to uncork before close.

Uncorking may let the close be cleaner (especially in TLS, it is nicer
to let the connection send the necessary handshaking that the
connection is intentionally going down, rather than being abruptly
ripped out by a man-in-the-middle attacker), but NBD already has to
gracefully handle abrupt connection teardown, so we aren't losing
anything if the cork delays the close or if corked data that would
have given a clean shutdown is lost.

If we teach qio_channel to honor MSG_MORE as a way to do auto-corking
without explicit cork/uncork calls, that may take care of itself.  But
that's a bigger project, while this patch seems safe enough as-is.

Eric Blake March 24, 2023, 8:03 p.m. UTC | #5

On Fri, Mar 24, 2023 at 02:41:20PM -0500, Eric Blake wrote:
> On Fri, Mar 24, 2023 at 11:47:20AM +0100, Florian Westphal wrote:
> > qemu-nbd doesn't set TCP_NODELAY on the tcp socket.

Replying to myself, WHY aren't we setting TCP_NODELAY on the socket?

> 
> And surprisingly, qemu IS using corking on the client side:
> https://gitlab.com/qemu-project/qemu/-/blob/master/block/nbd.c#L525
> just not on the server side, before your patch.

Corking matters more when TCP_NODELAY is enabled.  The entire reason
Nagle's algorithm exists (and is default on unless you enable
TCP_NODELAY) is that the benefits of merging smaller piecemeal packets
into larger traffic is a lot easier to do in a common location for
code that isn't super-sensitive to latency and message boundaries.
But once you get to the point where corking or MSG_MORE makes a
difference, then it is obvious you know your message boundaries, and
will benefit from TCP_NODELAY, at the expense of potentially more
network traffic overhead.  One more code search, and I find that we
use TCP_NODELAY in all of:

qemu client: https://gitlab.com/qemu-project/qemu/-/blob/master/nbd/client-connection.c#L143
nbdkit: https://gitlab.com/nbdkit/nbdkit/-/blob/master/server/sockets.c#L430
libnbd: https://gitlab.com/nbdkit/libnbd/-/blob/master/generator/states-connect.c#L41

so I think we _should_ be calling qio_channel_set_delay(false) for
qemu-nbd as well.  That doesn't negate your patch, but rather argues
that we can go for even better performance with TCP_NODELAY also
turned on.

Florian Westphal March 24, 2023, 11:55 p.m. UTC | #6

Eric Blake <eblake@redhat.com> wrote:
> On Fri, Mar 24, 2023 at 02:41:20PM -0500, Eric Blake wrote:
> > On Fri, Mar 24, 2023 at 11:47:20AM +0100, Florian Westphal wrote:
> > > qemu-nbd doesn't set TCP_NODELAY on the tcp socket.
> 
> Replying to myself, WHY aren't we setting TCP_NODELAY on the socket?

If the application protocol is strictly or mostly request/response then
I agree that qemu-nbd should turn nagle off as well.

Kevin Wolf March 27, 2023, 11:44 a.m. UTC | #7

Am 24.03.2023 um 11:47 hat Florian Westphal geschrieben:
> qemu-nbd doesn't set TCP_NODELAY on the tcp socket.
> 
> Kernel waits for more data and avoids transmission of small packets.
> Without TLS this is barely noticeable, but with TLS this really shows.
> 
> Booting a VM via qemu-nbd on localhost (with tls) takes more than
> 2 minutes on my system.  tcpdump shows frequent wait periods, where no
> packets get sent for a 40ms period.
> 
> Add explicit (un)corking when processing (and responding to) requests.
> "TCP_CORK, &zero" after earlier "CORK, &one" will flush pending data.
> 
> VM Boot time:
> main:    no tls:  23s, with tls: 2m45s
> patched: no tls:  14s, with tls: 15s
> 
> VM Boot time, qemu-nbd via network (same lan):
> main:    no tls:  18s, with tls: 1m50s
> patched: no tls:  17s, with tls: 18s
> 
> Future optimization: if we could detect if there is another pending
> request we could defer the uncork operation because more data would be
> appended.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>

Thanks, applied to the block branch.

Kevin

[1/1] nbd/server: push pending frames after sending reply

Commit Message

Comments

Patch