Message ID | cover.1640870037.git.linux_oss@crudebyte.com (mailing list archive) |
---|---|
Headers | show |
Series | remove msize limit in virtio transport | expand |
Thanks for the patches. I've applied them on top of 5.16.2 kernel and it works for msize=1048576. Performance-wise, same throughput as the previous patches, basically limiting factor is the backend block storage. However, when I mount with msize=4194304, the system locks up upon first try to traverse the directory structure, ie 'ls'. Only solution is to 'poweroff' the guest. Nothing in the logs. Qemu 6.0.0 on the host has the following patches: 01-fix-wrong-io-block-size-Rgetattr.patch 02-dedupe-iounit-code.patch 03-9pfs-simplify-blksize_to_iounit.patch The kernel patches were applied on the guest kernel only. I've generated them with the following command: git diff 783ba37c1566dd715b9a67d437efa3b77e3cd1a7^..8c305df4646b65218978fc6474aa0f5f29b216a0 > /tmp/kernel-5.16-9p-virtio-drop-msize-cap.patch The host runs 5.15.4 kernel. Cheers, -N On Thu, 2021-12-30 at 14:23 +0100, Christian Schoenebeck wrote: > This series aims to get get rid of the current 500k 'msize' limitation > in > the 9p virtio transport, which is currently a bottleneck for > performance > of 9p mounts. > > To avoid confusion: it does remove the msize limit for the virtio > transport, > on 9p client level though the anticipated milestone for this series is > now > a max. 'msize' of 4 MB. See patch 8 for reason why. > > This is a follow-up of the following series and discussion: > https://lore.kernel.org/netdev/cover.1632327421.git.linux_oss@crudebyte.com/ > > Latest version of this series: > https://github.com/cschoenebeck/linux/commits/9p-virtio-drop-msize-cap > > > OVERVIEW OF PATCHES: > > * Patch 1 is just a trivial info message for the user to know why his > msize > option got ignored by 9p client in case the value was larger than > what is > supported by this 9p driver. > > * Patches 2..7 remove the msize limitation from the 'virtio' transport > (i.e. the 9p 'virtio' transport itself actually supports >4MB now, > tested > successfully with an experimental QEMU version and some dirty 9p > Linux > client hacks up to msize=128MB). > > * Patch 8 limits msize for all transports to 4 MB for now as >4MB > would need > more work on 9p client level (see commit log of patch 8 for > details). > > * Patches 9..12 tremendously reduce unnecessarily huge 9p message > sizes and > therefore provide performance gain as well. So far, almost all 9p > messages > simply allocated message buffers exactly msize large, even for > messages > that actually just needed few bytes. So these patches make sense by > themselves, independent of this overall series, however for this > series > even more, because the larger msize, the more this issue would have > hurt > otherwise. > > > PREREQUISITES: > > If you are testing with QEMU then please either use latest QEMU 6.2 > release > or higher, or at least apply the following patch on QEMU side: > > > https://lore.kernel.org/qemu-devel/E1mT2Js-0000DW-OH@lizzy.crudebyte.com/ > > That QEMU patch is required if you are using a user space app that > automatically retrieves an optimum I/O block size by obeying stat's > st_blksize, which 'cat' for instance is doing, e.g.: > > time cat test_rnd.dat > /dev/null > > Otherwise please use a user space app for performance testing that > allows > you to force a large block size and to avoid that QEMU issue, like > 'dd' > for instance, in that case you don't need to patch QEMU. > > > KNOWN LIMITATION: > > With this series applied I can run > > QEMU host <-> 9P virtio <-> Linux guest > > with up to slightly below 4 MB msize [4186112 = (1024-2) * 4096]. If I > try > to run it with exactly 4 MB (4194304) it currently hits a limitation > on > QEMU side: > > qemu-system-x86_64: virtio: too many write descriptors in indirect > table > > That's because QEMU currently has a hard coded limit of max. 1024 > virtio > descriptors per vring slot (i.e. per virtio message), see to do (1.) > below. > > > STILL TO DO: > > 1. Negotiating virtio "Queue Indirect Size" (MANDATORY): > > The QEMU issue described above must be addressed by negotiating > the > maximum length of virtio indirect descriptor tables on virtio > device > initialization. This would not only avoid the QEMU error above, > but would > also allow msize of >4MB in future. Before that change can be done > on > Linux and QEMU sides though, it first requires a change to the > virtio > specs. Work on that on the virtio specs is in progress: > > https://github.com/oasis-tcs/virtio-spec/issues/122 > > This is not really an issue for testing this series. Just stick to > max. > msize=4186112 as described above and you will be fine. However for > the > final PR this should obviously be addressed in a clean way. > > 2. Reduce readdir buffer sizes (optional - maybe later): > > This series already reduced the message buffers for most 9p > message > types. This does not include Treaddir though yet, which is still > simply > using msize. It would make sense to benchmark first whether this > is > actually an issue that hurts. If it does, then one might use > already > existing vfs knowledge to estimate the Treaddir size, or starting > with > some reasonable hard coded small Treaddir size first and then > increasing > it just on the 2nd Treaddir request if there are more directory > entries > to fetch. > > 3. Add more buffer caches (optional - maybe later): > > p9_fcall_init() uses kmem_cache_alloc() instead of kmalloc() for > very > large buffers to reduce latency waiting for memory allocation to > complete. Currently it does that only if the requested buffer size > is > exactly msize large. As patch 11 already divided the 9p message > types > into few message size categories, maybe it would make sense to use > e.g. > 4 separate caches for those memory category (e.g. 4k, 8k, msize/2, > msize). Might be worth a benchmark test. > > Testing and feedback appreciated! > > v3 -> v4: > > * Limit msize to 4 MB for all transports [NEW patch 8]. > > * Avoid unnecessarily huge 9p message buffers > [NEW patch 9] .. [NEW patch 12]. > > Christian Schoenebeck (12): > net/9p: show error message if user 'msize' cannot be satisfied > 9p/trans_virtio: separate allocation of scatter gather list > 9p/trans_virtio: turn amount of sg lists into runtime info > 9p/trans_virtio: introduce struct virtqueue_sg > net/9p: add trans_maxsize to struct p9_client > 9p/trans_virtio: support larger msize values > 9p/trans_virtio: resize sg lists to whatever is possible > net/9p: limit 'msize' to KMALLOC_MAX_SIZE for all transports > net/9p: split message size argument into 't_size' and 'r_size' pair > 9p: add P9_ERRMAX for 9p2000 and 9p2000.u > net/9p: add p9_msg_buf_size() > net/9p: allocate appropriate reduced message buffers > > include/net/9p/9p.h | 3 + > include/net/9p/client.h | 2 + > net/9p/client.c | 67 +++++++-- > net/9p/protocol.c | 154 ++++++++++++++++++++ > net/9p/protocol.h | 2 + > net/9p/trans_virtio.c | 304 +++++++++++++++++++++++++++++++++++---- > - > 6 files changed, 483 insertions(+), 49 deletions(-) >
On Donnerstag, 20. Januar 2022 23:43:46 CET Nikolay Kichukov wrote: > Thanks for the patches. I've applied them on top of 5.16.2 kernel and it > works for msize=1048576. Performance-wise, same throughput as the > previous patches, basically limiting factor is the backend block > storage. Depends on how you were testing exactly. I assume you just booted a guest and then mounted a humble 9p directory in guest to perform some isolated I/O throughput tests on a single file. In this test scenario yes, you would not see much of a difference between v3 vs. v4 of this series. However in my tests I went much further than that by running an entire guest on top of 9p as its root filesystem: https://wiki.qemu.org/Documentation/9p_root_fs With this 9p rootfs setup you get a completely different picture. For instance you'll notice with v3 that guest boot time *increases* with rising msize, whereas with v4 it shrinks. And also when you benchmark throughput on a file in this 9p rootfs setup with v3 you get worse results than with v4, sometimes with v3 even worse than without patches at all. With v4 applied though it clearly outperforms any other kernel version in all aspects. I highly recommend this 9p rootfs setup as a heterogenous 9p test environment, as it is a very good real world test scenario for all kinds of aspects. > However, when I mount with msize=4194304, the system locks up upon first > try to traverse the directory structure, ie 'ls'. Only solution is to > 'poweroff' the guest. Nothing in the logs. I've described this in detail in the cover letter under "KNOWN LIMITATIONS" already. Use max. msize 4186112. > Qemu 6.0.0 on the host has the following patches: > > 01-fix-wrong-io-block-size-Rgetattr.patch > 02-dedupe-iounit-code.patch > 03-9pfs-simplify-blksize_to_iounit.patch I recommend just using QEMU 6.2. It is not worth to patch that old QEMU version. E.g. you would have a lousy readdir performance with that QEMU version and what not. You don't need to install QEMU. You can directly run it from the build directory. > The kernel patches were applied on the guest kernel only. > > I've generated them with the following command: > git diff > 783ba37c1566dd715b9a67d437efa3b77e3cd1a7^..8c305df4646b65218978fc6474aa0f5f > 29b216a0 > /tmp/kernel-5.16-9p-virtio-drop-msize-cap.patch > > The host runs 5.15.4 kernel. Host kernel version currently does not matter for this series. Best regards, Christian Schoenebeck
Thanks Christian, It works, sorry for overlooking the 'known limitations' in the first place. When do we expect these patches to be merged upstream? Cheers, -N On Sat, 2022-01-22 at 14:34 +0100, Christian Schoenebeck wrote: > On Donnerstag, 20. Januar 2022 23:43:46 CET Nikolay Kichukov wrote: > > Thanks for the patches. I've applied them on top of 5.16.2 kernel > > and it > > works for msize=1048576. Performance-wise, same throughput as the > > previous patches, basically limiting factor is the backend block > > storage. > > Depends on how you were testing exactly. I assume you just booted a > guest and > then mounted a humble 9p directory in guest to perform some isolated > I/O > throughput tests on a single file. In this test scenario yes, you > would not > see much of a difference between v3 vs. v4 of this series. > > However in my tests I went much further than that by running an entire > guest > on top of 9p as its root filesystem: > https://wiki.qemu.org/Documentation/9p_root_fs > With this 9p rootfs setup you get a completely different picture. For > instance > you'll notice with v3 that guest boot time *increases* with rising > msize, > whereas with v4 it shrinks. And also when you benchmark throughput on > a file > in this 9p rootfs setup with v3 you get worse results than with v4, > sometimes > with v3 even worse than without patches at all. With v4 applied though > it > clearly outperforms any other kernel version in all aspects. > > I highly recommend this 9p rootfs setup as a heterogenous 9p test > environment, > as it is a very good real world test scenario for all kinds of > aspects. > > > However, when I mount with msize=4194304, the system locks up upon > > first > > try to traverse the directory structure, ie 'ls'. Only solution is > > to > > 'poweroff' the guest. Nothing in the logs. > > I've described this in detail in the cover letter under "KNOWN > LIMITATIONS" > already. Use max. msize 4186112. > > > Qemu 6.0.0 on the host has the following patches: > > > > 01-fix-wrong-io-block-size-Rgetattr.patch > > 02-dedupe-iounit-code.patch > > 03-9pfs-simplify-blksize_to_iounit.patch > > I recommend just using QEMU 6.2. It is not worth to patch that old > QEMU > version. E.g. you would have a lousy readdir performance with that > QEMU > version and what not. > > You don't need to install QEMU. You can directly run it from the build > directory. > > > The kernel patches were applied on the guest kernel only. > > > > I've generated them with the following command: > > git diff > > 783ba37c1566dd715b9a67d437efa3b77e3cd1a7^..8c305df4646b65218978fc647 > > 4aa0f5f > > 29b216a0 > /tmp/kernel-5.16-9p-virtio-drop-msize-cap.patch > > > > The host runs 5.15.4 kernel. > > Host kernel version currently does not matter for this series. > > Best regards, > Christian Schoenebeck > >
Nikolay Kichukov wrote on Mon, Jan 24, 2022 at 11:21:08AM +0100: > It works, sorry for overlooking the 'known limitations' in the first > place. When do we expect these patches to be merged upstream? We're just starting a new development cycle for 5.18 while 5.17 is stabilizing, so this mostly depends on the ability to check if a msize given in parameter is valid as described in the first "STILL TO DO" point listed in the cover letter. I personally would be happy considering this series for this cycle with just a max msize of 4MB-8k and leave that further bump for later if we're sure qemu will handle it. We're still seeing a boost for that and the smaller buffers for small messages will benefit all transport types, so that would get in in roughly two months for 5.18-rc1, then another two months for 5.18 to actually be released and start hitting production code. I'm not sure when exactly but I'll run some tests with it as well and redo a proper code review within the next few weeks, so we can get this in -next for a little while before the merge window.
On Montag, 24. Januar 2022 12:07:20 CET Dominique Martinet wrote: > Nikolay Kichukov wrote on Mon, Jan 24, 2022 at 11:21:08AM +0100: > > It works, sorry for overlooking the 'known limitations' in the first > > place. When do we expect these patches to be merged upstream? > > We're just starting a new development cycle for 5.18 while 5.17 is > stabilizing, so this mostly depends on the ability to check if a msize > given in parameter is valid as described in the first "STILL TO DO" > point listed in the cover letter. I will ping the Redhat guys on the open virtio spec issue this week. If you want I can CC you Dominique on the discussion regarding the virtio spec changes. It's a somewhat dry topic though. > I personally would be happy considering this series for this cycle with > just a max msize of 4MB-8k and leave that further bump for later if > we're sure qemu will handle it. I haven't actually checked whether there was any old QEMU version that did not support exceeding the virtio queue size. So it might be possible that a very ancient QEMU version might error out if msize > (128 * 4096 = 512k). Besides QEMU, what other 9p server implementations are actually out there, and how would they behave on this? A test on their side would definitely be a good idea. > We're still seeing a boost for that and the smaller buffers for small > messages will benefit all transport types, so that would get in in > roughly two months for 5.18-rc1, then another two months for 5.18 to > actually be released and start hitting production code. > > > I'm not sure when exactly but I'll run some tests with it as well and > redo a proper code review within the next few weeks, so we can get this > in -next for a little while before the merge window. Especially the buffer size reduction patches needs a proper review. Those changes can be tricky. So far I have not encountered any issues with tests at least. OTOH these patches could be pushed through separately already, no matter what the decision regarding the virtio issue will be. Best regards, Christian Schoenebeck
Christian Schoenebeck wrote on Mon, Jan 24, 2022 at 12:57:35PM +0100: > > We're just starting a new development cycle for 5.18 while 5.17 is > > stabilizing, so this mostly depends on the ability to check if a msize > > given in parameter is valid as described in the first "STILL TO DO" > > point listed in the cover letter. > > I will ping the Redhat guys on the open virtio spec issue this week. If you > want I can CC you Dominique on the discussion regarding the virtio spec > changes. It's a somewhat dry topic though. I don't have much expertise on virtio stuff so don't think I'll bring much to the discussion, but always happy to fill my inbox :) It's always good to keep an eye on things at least. > > I personally would be happy considering this series for this cycle with > > just a max msize of 4MB-8k and leave that further bump for later if > > we're sure qemu will handle it. > > I haven't actually checked whether there was any old QEMU version that did not > support exceeding the virtio queue size. So it might be possible that a very > ancient QEMU version might error out if msize > (128 * 4096 = 512k). Even if the spec gets implemented we need the default msize to work for reasonably older versions of qemu (at least a few years e.g. supported versions of debian/rhel can go quite a while back), and ideally have a somewhat sensible error if we go above some max... > Besides QEMU, what other 9p server implementations are actually out there, and > how would they behave on this? A test on their side would definitely be a good > idea. 9p virtio would only be qemu as far as I know. For tcp/fd there are a few: - https://github.com/chaos/diod (also supports rdma iirc, I don't have any hardware for rdma tests anymore though) - https://github.com/nfs-ganesha/nfs-ganesha (also rdma) - I was pointed at https://github.com/lionkov/go9p in a recent bug report - http://repo.cat-v.org/libixp/ is also a server implementation I haven't tested with the linux client in a while but iirc it used to work I normally run some tests with qemu (virtio) and ganesha (tcp) before pushing to my linux-next branch, so we hopefully don't make too many assumptions that are specific to a server > > We're still seeing a boost for that and the smaller buffers for small > > messages will benefit all transport types, so that would get in in > > roughly two months for 5.18-rc1, then another two months for 5.18 to > > actually be released and start hitting production code. > > > > > > I'm not sure when exactly but I'll run some tests with it as well and > > redo a proper code review within the next few weeks, so we can get this > > in -next for a little while before the merge window. > > Especially the buffer size reduction patches needs a proper review. Those > changes can be tricky. So far I have not encountered any issues with tests at > least. OTOH these patches could be pushed through separately already, no > matter what the decision regarding the virtio issue will be. Yes, I've had a first look and it's quite different from what I'd have done, but it didn't look bad and I just wanted to spend a bit more time on it. On a very high level I'm not fond of the logical duplication brought by deciding the size in a different function (duplicates format strings for checks and brings in a huge case with all formats) when we already have one function per call which could take the size decision directly without going through the format varargs, but it's not like the protocol has evolved over the past ten years so it's not really a problem -- I just need to get down to it and check it all matches up. I also agree it's totally orthogonal to the virtio size extension so if you want to wait for the new virtio standard I'll focus on this part first.
On Montag, 24. Januar 2022 13:56:17 CET Dominique Martinet wrote: > Christian Schoenebeck wrote on Mon, Jan 24, 2022 at 12:57:35PM +0100: > > > We're just starting a new development cycle for 5.18 while 5.17 is > > > stabilizing, so this mostly depends on the ability to check if a msize > > > given in parameter is valid as described in the first "STILL TO DO" > > > point listed in the cover letter. > > > > I will ping the Redhat guys on the open virtio spec issue this week. If > > you > > want I can CC you Dominique on the discussion regarding the virtio spec > > changes. It's a somewhat dry topic though. > > I don't have much expertise on virtio stuff so don't think I'll bring > much to the discussion, but always happy to fill my inbox :) > It's always good to keep an eye on things at least. Ok, I'll then CC you from now on at the virtio spec front, if it gets too noisy just raise your hand. > > > I personally would be happy considering this series for this cycle with > > > just a max msize of 4MB-8k and leave that further bump for later if > > > we're sure qemu will handle it. > > > > I haven't actually checked whether there was any old QEMU version that did > > not support exceeding the virtio queue size. So it might be possible that > > a very ancient QEMU version might error out if msize > (128 * 4096 = > > 512k). > Even if the spec gets implemented we need the default msize to work for > reasonably older versions of qemu (at least a few years e.g. supported > versions of debian/rhel can go quite a while back), and ideally have a > somewhat sensible error if we go above some max... Once the virtio spec changes are accepted and implemented, that would not be an issue at all, virtio changes are always made with backward compatibility in mind. The plan is to negotiate that new virtio feature on virtio subsystem level, if either side does not support the new virtio feature (either too old QEMU or too old kernel), then msize would automatically be limited to the old virtio size/behaviour (a.k.a. virtio "queue size") and with QEMU as 9p server that would be max. msize 500k. Therefore I suggest just waiting for the virtio spec changes to be complete and implemented. People who care about performance should then just use an updated kernel *and* updated QEMU version to achieve msize > 500k. IMO, no need to risk breaking some old kernel/QEMU combination if nobody asked for it anyway, and if somebody does, then we could still add some kind of --force-at-your-own-risk switch later on. > > Besides QEMU, what other 9p server implementations are actually out there, > > and how would they behave on this? A test on their side would definitely > > be a good idea. > > 9p virtio would only be qemu as far as I know. > > For tcp/fd there are a few: > - https://github.com/chaos/diod (also supports rdma iirc, I don't have > any hardware for rdma tests anymore though) > - https://github.com/nfs-ganesha/nfs-ganesha (also rdma) > - I was pointed at https://github.com/lionkov/go9p in a recent bug > report > - http://repo.cat-v.org/libixp/ is also a server implementation I > haven't tested with the linux client in a while but iirc it used to work > > > I normally run some tests with qemu (virtio) and ganesha (tcp) before > pushing to my linux-next branch, so we hopefully don't make too many > assumptions that are specific to a server Good to know, thanks! > > > We're still seeing a boost for that and the smaller buffers for small > > > messages will benefit all transport types, so that would get in in > > > roughly two months for 5.18-rc1, then another two months for 5.18 to > > > actually be released and start hitting production code. > > > > > > > > > I'm not sure when exactly but I'll run some tests with it as well and > > > redo a proper code review within the next few weeks, so we can get this > > > in -next for a little while before the merge window. > > > > Especially the buffer size reduction patches needs a proper review. Those > > changes can be tricky. So far I have not encountered any issues with tests > > at least. OTOH these patches could be pushed through separately already, > > no matter what the decision regarding the virtio issue will be. > > Yes, I've had a first look and it's quite different from what I'd have > done, but it didn't look bad and I just wanted to spend a bit more time > on it. > On a very high level I'm not fond of the logical duplication brought by > deciding the size in a different function (duplicates format strings for > checks and brings in a huge case with all formats) when we already have > one function per call which could take the size decision directly > without going through the format varargs, but it's not like the protocol > has evolved over the past ten years so it's not really a problem -- I > just need to get down to it and check it all matches up. Yeah I know, the advantage though is that this separate function/switch-case approach merges many message types. So it is actually less code. And I tried to automate code sanity with various BUG_ON() calls to prevent them from accidentally drifting with future changes. > I also agree it's totally orthogonal to the virtio size extension so if > you want to wait for the new virtio standard I'll focus on this part > first. IMO it would make sense to give these message size reduction patches priority for now, as long as the virtio spec changes are incomplete. One more thing: so far I have just concentrated on behavioural aspects and testing. What I completed neglected so far was code style. If you want I can send a v5 this week with code style (and only code style) being fixed if that helps you to keep diff-noise low for your review. Best regards, Christian Schoenebeck
On Mon, 2022-01-24 at 12:57 +0100, Christian Schoenebeck wrote: > Besides QEMU, what other 9p server implementations are actually out > there, and > how would they behave on this? A test on their side would definitely > be a good > idea. diod is a 9p network server, if these patches are purely virtio transport specific, I believe they should not affect it. Here is the source code for diod: https://github.com/chaos/diod Jim Garlick maintains it, added to CC here.
Hello Dominique, On Mon, 2022-01-24 at 20:07 +0900, Dominique Martinet wrote: > Nikolay Kichukov wrote on Mon, Jan 24, 2022 at 11:21:08AM +0100: > > It works, sorry for overlooking the 'known limitations' in the first > > place. When do we expect these patches to be merged upstream? > > We're just starting a new development cycle for 5.18 while 5.17 is > stabilizing, so this mostly depends on the ability to check if a msize > given in parameter is valid as described in the first "STILL TO DO" > point listed in the cover letter. > > I personally would be happy considering this series for this cycle > with > just a max msize of 4MB-8k and leave that further bump for later if > we're sure qemu will handle it. > We're still seeing a boost for that and the smaller buffers for small > messages will benefit all transport types, so that would get in in > roughly two months for 5.18-rc1, then another two months for 5.18 to > actually be released and start hitting production code. > > > I'm not sure when exactly but I'll run some tests with it as well and > redo a proper code review within the next few weeks, so we can get > this > in -next for a little while before the merge window. > Did you make it into 5.18? I see it just got released...
On Dienstag, 24. Mai 2022 10:10:31 CEST Nikolay Kichukov wrote: > Hello Dominique, > > On Mon, 2022-01-24 at 20:07 +0900, Dominique Martinet wrote: > > Nikolay Kichukov wrote on Mon, Jan 24, 2022 at 11:21:08AM +0100: > > > It works, sorry for overlooking the 'known limitations' in the first > > > place. When do we expect these patches to be merged upstream? > > > > We're just starting a new development cycle for 5.18 while 5.17 is > > stabilizing, so this mostly depends on the ability to check if a msize > > given in parameter is valid as described in the first "STILL TO DO" > > point listed in the cover letter. > > > > I personally would be happy considering this series for this cycle > > with > > just a max msize of 4MB-8k and leave that further bump for later if > > we're sure qemu will handle it. > > We're still seeing a boost for that and the smaller buffers for small > > messages will benefit all transport types, so that would get in in > > roughly two months for 5.18-rc1, then another two months for 5.18 to > > actually be released and start hitting production code. > > > > > > I'm not sure when exactly but I'll run some tests with it as well and > > redo a proper code review within the next few weeks, so we can get > > this > > in -next for a little while before the merge window. > > Did you make it into 5.18? I see it just got released... No, not yet. I can send a v5 as outlined above, including opt-out for the RDMA transport as Dominique noted, that wouldn't be much work for me. However ATM I fear it would probably not help anybody, as we currently have two far more serious issues [1] regarding 9p's cache support introduced by the netfs changes: 1. 9p cache enabled performs now worse than without any cache. 2. There are misbehaviours once in a while, as 9p cache opens FIDs in read- only mode and once in a while then tries to write with those read-only FIDs which causes EBADF errors. [1] https://lore.kernel.org/lkml/9591612.lsmsJCMaJN@silver/ Issue (2.) can probably be fixed by just opening the FIDs in RW mode, as suggested by Dominique. But for performance issue (1.) nobody had an idea yet. Best regards, Christian Schoenebeck
On Donnerstag, 30. Dezember 2021 14:23:18 CEST Christian Schoenebeck wrote: > This series aims to get get rid of the current 500k 'msize' limitation in > the 9p virtio transport, which is currently a bottleneck for performance > of 9p mounts. [...] > KNOWN LIMITATION: > > With this series applied I can run > > QEMU host <-> 9P virtio <-> Linux guest > > with up to slightly below 4 MB msize [4186112 = (1024-2) * 4096]. If I try > to run it with exactly 4 MB (4194304) it currently hits a limitation on > QEMU side: > > qemu-system-x86_64: virtio: too many write descriptors in indirect table > > That's because QEMU currently has a hard coded limit of max. 1024 virtio > descriptors per vring slot (i.e. per virtio message), see to do (1.) below. > > > STILL TO DO: > > 1. Negotiating virtio "Queue Indirect Size" (MANDATORY): > > The QEMU issue described above must be addressed by negotiating the > maximum length of virtio indirect descriptor tables on virtio device > initialization. This would not only avoid the QEMU error above, but > would also allow msize of >4MB in future. Before that change can be done > on Linux and QEMU sides though, it first requires a change to the virtio > specs. Work on that on the virtio specs is in progress: > > https://github.com/oasis-tcs/virtio-spec/issues/122 > > This is not really an issue for testing this series. Just stick to max. > msize=4186112 as described above and you will be fine. However for the > final PR this should obviously be addressed in a clean way. As there is no progress on virtio spec side in sight, I'm considering to handle this issue in upcoming v5 by simply assuming (hard coding) that 9p server supports exactly up to 1024 virtio descriptors (memory segments) per round trip message. That's maybe a bit unclean, but that's what other virtio drivers in the Linux kernel do for many years as well, so I am not expecting a negative issue in practice. And I mean, when we talk about 9p + virtio, that actually implies QEMU being the 9p server, right? At least I am not aware of another 9p server implementation supporting virtio transport (nor any QEMU version that ever supported less than 1024 virtio descriptors). Maybe Microsoft WSL? Not sure. Best regards, Christian Schoenebeck
Eric Van Hensbergen wrote on Fri, Jul 08, 2022 at 10:44:45AM +1000: > there are other 9p virtio servers - several emulation platforms support it > sans qemu. Would you happen to have any concrete example? I'd be curious if there are some that'd be easy to setup for test for example; my current validation setup lacks a bit of diversity... I found https://github.com/moby/hyperkit for OSX but that doesn't really help me, and can't see much else relevant in a quick search -- Dominique
On Freitag, 8. Juli 2022 04:26:40 CEST Eric Van Hensbergen wrote: > kvmtool might be the easiest I guess - I’m traveling right now but I can > try and find some others. The arm fast models have free versions that are > downloadable as well. I know I’ve seem some other less-traditional uses of > virtio particularly in libos deployments but will take some time to rattle > those from my memory. Some examples would indeed be useful, thanks! > On Fri, Jul 8, 2022 at 11:16 AM Dominique Martinet <asmadeus@codewreck.org> > > wrote: > > Eric Van Hensbergen wrote on Fri, Jul 08, 2022 at 10:44:45AM +1000: > > > there are other 9p virtio servers - several emulation platforms support > > > > it > > > > > sans qemu. > > > > Would you happen to have any concrete example? > > I'd be curious if there are some that'd be easy to setup for test for > > example; my current validation setup lacks a bit of diversity... > > > > I found https://github.com/moby/hyperkit for OSX but that doesn't really > > help me, and can't see much else relevant in a quick search So that appears to be a 9p (@virtio-PCI) client for xhyve, with max. 256kB buffers <=> max. 68 virtio descriptors (memory segments) [1]: /* XXX issues with larger buffers elsewhere in stack */ #define BUFSIZE (1 << 18) #define MAXDESC (BUFSIZE / 4096 + 4) #define VT9P_RINGSZ (BUFSIZE / 4096 * 4) [1] https://github.com/moby/hyperkit/blob/master/src/lib/pci_virtio_9p.c#L27 But on xhyve side I don't see any 9p server implementation: https://github.com/machyve/xhyve/search?q=9p Maybe a 9p server is already implemented by Apple's Hypervisor framework. I don't find this documented anywhere though. Best regards, Christian Schoenebeck
Christian Schoenebeck wrote on Fri, Jul 08, 2022 at 01:18:40PM +0200: > On Freitag, 8. Juli 2022 04:26:40 CEST Eric Van Hensbergen wrote: > > kvmtool might be the easiest I guess - I’m traveling right now but I can > > try and find some others. The arm fast models have free versions that are > > downloadable as well. I know I’ve seem some other less-traditional uses of > > virtio particularly in libos deployments but will take some time to rattle > > those from my memory. > > Some examples would indeed be useful, thanks! https://github.com/kvmtool/kvmtool indeed has a 9p server, I think I used to run it ages ago. I'll give it a fresh spin, thanks for the reminder. For this one it defines VIRTQUEUE_NUM to 128, so not quite 1024. > > > I found https://github.com/moby/hyperkit for OSX but that doesn't really > > > help me, and can't see much else relevant in a quick search > > So that appears to be a 9p (@virtio-PCI) client for xhyve, oh the 9p part is client code? the readme says it's a server: "It includes a complete hypervisor, based on xhyve/bhyve" but I can't run it anyway, so I didn't check very hard. > with max. 256kB buffers <=> max. 68 virtio descriptors (memory segments) [1]: huh... Well, as long as msize is set I assume it'll work out anyway? How does virtio queue size work with e.g. parallel messages? Anyway, even if the negotiation part gets done servers won't all get implemented in a day, so we need to think of other servers a bit.. -- Dominique
On Freitag, 8. Juli 2022 13:40:36 CEST Dominique Martinet wrote: > Christian Schoenebeck wrote on Fri, Jul 08, 2022 at 01:18:40PM +0200: > > On Freitag, 8. Juli 2022 04:26:40 CEST Eric Van Hensbergen wrote: [...] > https://github.com/kvmtool/kvmtool indeed has a 9p server, I think I > used to run it ages ago. > I'll give it a fresh spin, thanks for the reminder. > > For this one it defines VIRTQUEUE_NUM to 128, so not quite 1024. Yes, and it does *not* limit client supplied 'msize' either. It just always sends the same 'msize' value as-is back to client. :/ So I would expect it to error (or worse) if client tries msize > 512kB. > > > > I found https://github.com/moby/hyperkit for OSX but that doesn't > > > > really > > > > help me, and can't see much else relevant in a quick search > > > > So that appears to be a 9p (@virtio-PCI) client for xhyve, > > oh the 9p part is client code? > the readme says it's a server: > "It includes a complete hypervisor, based on xhyve/bhyve" > but I can't run it anyway, so I didn't check very hard. Hmm, I actually just interpreted this for it to be a client: fprintf(stderr, "virtio-9p: unexpected EOF writing to server-- did the 9P server crash?\n"); But looking at it again, it seems you are right, at least I see that it also handles even 9p message type numbers, but only Twrite and Tflush? I don't see any Tversion or msize handling in general. [shrug] > > with max. 256kB buffers <=> max. 68 virtio descriptors (memory segments) [1]: > huh... > > Well, as long as msize is set I assume it'll work out anyway? If server would limit 'msize' appropriately, yes. But as the kvmtool example shows, that should probably not taken for granted. > How does virtio queue size work with e.g. parallel messages? Simple question, complicated to answer. From virtio spec PoV (and current virtio <= v1.2), the negotiated virtio queue size defines both the max. amount of parallel (round-trip) messages *and* the max. amount of virtio descriptors (memory segments) of *all* currently active/ parallel messages in total. I "think" because of virtio's origin for virtualized network devices? So yes, if you are very strict what the virtio spec <= v1.2 says, and say you have a virtio queue size of 128 (e.g. hard coded by QEMU, kvmtool), and say client would send out the first 9p request with 128 memory segments, then the next (i.e. second) parallel 9p request sent to server would already exceed the theoretically allowed max. amount of virtio descriptors. But in practice, I don't see that this theoretical limitation would exist in actual 9p virtio server implementations. At least in all server implementations I saw so far, they all seem to handle the max. virtio descriptors amount for each request separately. > Anyway, even if the negotiation part gets done servers won't all get > implemented in a day, so we need to think of other servers a bit.. OTOH kernel should have the name of the hypervisor/emulator somewhere, right? So Linux client's max. virtio descriptors could probably made dependent on that name? Best regards, Christian Schoenebeck