mbox series

[v6,00/11] remove msize limit in virtio transport

Message ID cover.1657920926.git.linux_oss@crudebyte.com (mailing list archive)
Headers show
Series remove msize limit in virtio transport | expand

Message

Christian Schoenebeck July 15, 2022, 9:35 p.m. UTC
This series aims to get get rid of the current 500k 'msize' limitation in
the 9p virtio transport, which is currently a bottleneck for performance
of 9p mounts.

To avoid confusion: it does remove the msize limit for the virtio transport,
on 9p client level though the anticipated milestone for this series is now
a max. 'msize' of 4 MB. See patch 5 for reason why.

This is a follow-up of the following series and discussion:
https://lore.kernel.org/all/cover.1657636554.git.linux_oss@crudebyte.com/

Latest version of this series:
https://github.com/cschoenebeck/linux/commits/9p-virtio-drop-msize-cap


OVERVIEW OF PATCHES:

* Patches 1..6 remove the msize limitation from the 'virtio' transport
  (i.e. the 9p 'virtio' transport itself actually supports >4MB now, tested
  successfully with an experimental QEMU version and some dirty 9p Linux
  client hacks up to msize=128MB).

* Patches 7..11 tremendously reduce unnecessarily huge 9p message sizes and
  therefore provide performance gain as well. So far, almost all 9p messages
  simply allocated message buffers exactly msize large, even for messages
  that actually just needed few bytes. So these patches make sense by
  themselves, independent of this overall series, however for this series
  even more, because the larger msize, the more this issue would have hurt
  otherwise.


PREREQUISITES:

If you are testing with QEMU then please either use QEMU 6.2 or higher, or
at least apply the following patch on QEMU side:

  https://lore.kernel.org/qemu-devel/E1mT2Js-0000DW-OH@lizzy.crudebyte.com/

That QEMU patch is required if you are using a user space app that
automatically retrieves an optimum I/O block size by obeying stat's
st_blksize, which 'cat' for instance is doing, e.g.:

	time cat test_rnd.dat > /dev/null

Otherwise please use a user space app for performance testing that allows
you to force a large block size and to avoid that QEMU issue, like 'dd'
for instance, in that case you don't need to patch QEMU.


STILL TO DO:

  1. Negotiating virtio "Queue Indirect Size" (recommended):

    This series assumes a maximum amount of 1024 virtio indirect desriptors
    to be supported by 9p server, which is true for QEMU, but probably not
    for other 9p servers and therefore be addressed by negotiating the
    maximum length of virtio indirect descriptor tables on virtio device
    initialization instead. This would not only avoid issues with other 9p
    servers, but would also allow msize of >4MB in future. Before that
    change can be done on Linux and QEMU sides though, it first requires a
    change to the virtio spec. Work on that on the virtio spec. is still in
    progress:

    https://github.com/oasis-tcs/virtio-spec/issues/122

  2. Reduce readdir buffer sizes (optional - maybe later):

    This series already reduced the message buffers for most 9p message
    types. This does not include Treaddir though yet, which is still simply
    using msize. It would make sense to benchmark first whether this is
    actually an issue that hurts. If it does, then one might use already
    existing vfs knowledge to estimate the Treaddir size, or starting with
    some reasonable hard coded small Treaddir size first and then increasing
    it just on the 2nd Treaddir request if there are more directory entries
    to fetch.

  3. Add more buffer caches (optional - maybe later):

    p9_fcall_init() uses kmem_cache_alloc() instead of kmalloc() for very
    large buffers to reduce latency waiting for memory allocation to
    complete. Currently it does that only if the requested buffer size is
    exactly msize large. As patch 10 already divided the 9p message types
    into few message size categories, maybe it would make sense to use e.g.
    4 separate caches for those memory category (e.g. 4k, 8k, msize/2,
    msize). Might be worth a benchmark test.

Testing and feedback appreciated!

v5 -> v6:

  * Fix BUG_ON(node >= VIRTQUEUE_SG_NSGL_MAX) ->
        BUG_ON(node >= vq_sg->nsgl) in vq_sg_page().
    [patch 3]

  * Limit virtio transport to (slightly below) 4MB msize by assuming
    VIRTIO_MAX_DESCRIPTORS = 1024. [patch 5]

  * Drop previous patch "net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all".
    [OLD patch 7]

  * Add static_assert(NAME_MAX <= 4k) to p9_msg_buf_size().
    [patch 9]

  * Make it more clear from the commit log how individual 9p message types
    were distributed over the 2 hard coded message size categories.
    [patch 9]

  * Drop max() call in p9_msg_buf_size() which was superfluous.
    [patch 9]

  * Compare protocol version directly against 'p9_proto_legacy' instead of
    inverted comparison against 'p9_proto_2000u' and 'p9_proto_2000l'.
    [patch 9]

  * Fix code style aesthetics by getting rid of an unnecessary block
    indention. [patch 9]

  * Dynamically calculate size for P9_TSYMLINK instead of using a hard
    coded size. [patch 9]

  * Fix s/P9_TSTAT/P9_TWSTAT/.
    [patch 9]

  * Add missing P9_TMKNOD, P9_TRENAME, P9_TLOCK to hard coded 8k size
    category. [patch 9]

  * Use a dedicated new bool field 'pooled_rbuffers' in struct
    p9_trans_module instead of strcmp(c->trans_mod->name, "rdma").
    [NEW patch 10], [patch 11]

  * Allow dynamic request buffer size for RDMA transport as well.
    [patch 11]

Christian Schoenebeck (11):
  9p/trans_virtio: separate allocation of scatter gather list
  9p/trans_virtio: turn amount of sg lists into runtime info
  9p/trans_virtio: introduce struct virtqueue_sg
  net/9p: add trans_maxsize to struct p9_client
  9p/trans_virtio: support larger msize values
  9p/trans_virtio: resize sg lists to whatever is possible
  net/9p: split message size argument into 't_size' and 'r_size' pair
  9p: add P9_ERRMAX for 9p2000 and 9p2000.u
  net/9p: add p9_msg_buf_size()
  net/9p: add 'pooled_rbuffers' flag to struct p9_trans_module
  net/9p: allocate appropriate reduced message buffers

 include/net/9p/9p.h        |   3 +
 include/net/9p/client.h    |   2 +
 include/net/9p/transport.h |   5 +
 net/9p/client.c            |  60 +++++--
 net/9p/protocol.c          | 167 +++++++++++++++++++
 net/9p/protocol.h          |   2 +
 net/9p/trans_fd.c          |   1 +
 net/9p/trans_rdma.c        |   1 +
 net/9p/trans_virtio.c      | 321 ++++++++++++++++++++++++++++++++-----
 net/9p/trans_xen.c         |   1 +
 10 files changed, 514 insertions(+), 49 deletions(-)

Comments

Christian Schoenebeck July 16, 2022, 9:54 a.m. UTC | #1
On Samstag, 16. Juli 2022 01:28:51 CEST Dominique Martinet wrote:
> Dominique Martinet wrote on Sat, Jul 16, 2022 at 07:30:45AM +0900:
> > Christian Schoenebeck wrote on Fri, Jul 15, 2022 at 11:35:26PM +0200:
> > > * Patches 7..11 tremendously reduce unnecessarily huge 9p message sizes
> > > and
> > > 
> > >   therefore provide performance gain as well. So far, almost all 9p
> > >   messages
> > >   simply allocated message buffers exactly msize large, even for
> > >   messages
> > >   that actually just needed few bytes. So these patches make sense by
> > >   themselves, independent of this overall series, however for this
> > >   series
> > >   even more, because the larger msize, the more this issue would have
> > >   hurt
> > >   otherwise.
> > 
> > Unless they got stuck somewhere the mails are missing patches 10 and 11,
> > one too many 0s to git send-email ?
> 
> nevermind, they just got in after 1h30... I thought it'd been 1h since
> the first mails because the first ones were already 50 mins late and I
> hadn't noticed! I wonder where they're stuck, that's the time
> lizzy.crudebyte.com received them and it filters earlier headers so
> probably between you and it?

Certainly an outbound SMTP greylisting delay, i.e. lack of karma. Sometimes my 
patches make it to lists after 3 hours. I haven't figured out though why some 
patches within the same series arrive significantly faster than certain other 
ones, which is especially weird when that happens not in order they were sent.

> ohwell.
> 
> > I'll do a quick review from github commit meanwhile
> 
> Looks good to me, I'll try to get some tcp/rdma testing done this
> weekend and stash them up to next

Great, thanks!

> --
> Dominique
Dominique Martinet July 16, 2022, 11:54 a.m. UTC | #2
Christian Schoenebeck wrote on Sat, Jul 16, 2022 at 11:54:29AM +0200:
> > Looks good to me, I'll try to get some tcp/rdma testing done this
> > weekend and stash them up to next
> 
> Great, thanks!

Quick update on this: tcp seems to work fine, I need to let it run a bit
longer but not expecting any trouble.

RDMA is... complicated.
I was certain an adapter in loopback mode ought to work so I just
bought a cheap card alone, but I couldn't get it to work (ipoib works
but I think that's just the linux tcp stack cheating, I'm getting unable
to resolve route (rdma_resolve_route) errors when trying real rdma
applications...)


OTOH, linux got softiwarp merged in as RDMA_SIW which works perfectly
with my rdma applications, after fixing/working around a couple of bugs
on the server I'm getting hangs that I can't reproduce with debug on
current master so this isn't exactly great, not sure where it goes
wrong :|
At least with debug still enabled I'm not getting any new hang with your
patches, so let's call it ok...?

I'll send a mail to ex-collegues who might care about it (and
investigate a bit more if so), and a more open mail if that falls
short...

--
Dominique
Christian Schoenebeck July 16, 2022, 12:10 p.m. UTC | #3
On Samstag, 16. Juli 2022 13:54:08 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Sat, Jul 16, 2022 at 11:54:29AM +0200:
> > > Looks good to me, I'll try to get some tcp/rdma testing done this
> > > weekend and stash them up to next
> > 
> > Great, thanks!
> 
> Quick update on this: tcp seems to work fine, I need to let it run a bit
> longer but not expecting any trouble.
> 
> RDMA is... complicated.
> I was certain an adapter in loopback mode ought to work so I just
> bought a cheap card alone, but I couldn't get it to work (ipoib works
> but I think that's just the linux tcp stack cheating, I'm getting unable
> to resolve route (rdma_resolve_route) errors when trying real rdma
> applications...)
> 
> 
> OTOH, linux got softiwarp merged in as RDMA_SIW which works perfectly
> with my rdma applications, after fixing/working around a couple of bugs
> on the server I'm getting hangs that I can't reproduce with debug on
> current master so this isn't exactly great, not sure where it goes
> wrong :|
> At least with debug still enabled I'm not getting any new hang with your
> patches, so let's call it ok...?

Well, I would need more info to judge or resolve that, like which patch 
exactly broke RDMA behaviour for you?

> I'll send a mail to ex-collegues who might care about it (and
> investigate a bit more if so), and a more open mail if that falls
> short...
> 
> --
> Dominique
Dominique Martinet July 16, 2022, 12:44 p.m. UTC | #4
Christian Schoenebeck wrote on Sat, Jul 16, 2022 at 02:10:05PM +0200:
> > OTOH, linux got softiwarp merged in as RDMA_SIW which works perfectly
> > with my rdma applications, after fixing/working around a couple of bugs
> > on the server I'm getting hangs that I can't reproduce with debug on
> > current master so this isn't exactly great, not sure where it goes
> > wrong :|
> > At least with debug still enabled I'm not getting any new hang with your
> > patches, so let's call it ok...?
> 
> Well, I would need more info to judge or resolve that, like which patch 
> exactly broke RDMA behaviour for you?

I wouldn't have troubles if I knew that, I don't have access to the
hardware I last used 9p/rdma on so it might very well be a softiwarp
compatibility problem, server version, or anything else.

At the very least I'm not getting new errors and the server does receive
everyhing we sent, so as far as these patches are concerned I don't
think we're making anything worse.


I'll get back to you once I hear back from former employer (if they can
have someone run some tests, confirm it works and/or bisect that), I
really spent too much time trying to get the old adapter I got working
already...

All I can say is that there's no error anywhere, I've finally reproduced
it once with debug and I can confirm the server sent the reply and
didn't get any error in ibv_post_send() so the message should have been
sent, but the client just never processed it.
Next step would be to add/enable some logs on the client see if it
actually received something or not and go from there, but I'd like to
see something that works first...

--
Dominique