diff mbox series

[2/2] net/9p: increase default msize to 128k

Message ID 61ea0f0faaaaf26dd3c762eabe4420306ced21b9.1630770829.git.linux_oss@crudebyte.com (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series net/9p: increase default msize | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Guessed tree name to be net-next
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cc_maintainers warning 2 maintainers not CCed: davem@davemloft.net kuba@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 4 this patch: 4
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 4 this patch: 4
netdev/header_inline success Link

Commit Message

Christian Schoenebeck Sept. 4, 2021, 3:12 p.m. UTC
Let's raise the default msize value to 128k.

The 'msize' option defines the maximum message size allowed for any
message being transmitted (in both directions) between 9p server and 9p
client during a 9p session.

Currently the default 'msize' is just 8k, which is way too conservative.
Such a small 'msize' value has quite a negative performance impact,
because individual 9p messages have to be split up far too often into
numerous smaller messages to fit into this message size limitation.

A default value of just 8k also has a much higher probablity of hitting
short-read issues like: https://gitlab.com/qemu-project/qemu/-/issues/409

Unfortunately user feedback showed that many 9p users are not aware that
this option even exists, nor the negative impact it might have if it is
too low.

Link: https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg01003.html
Signed-off-by: Christian Schoenebeck <linux_oss@crudebyte.com>
---
 net/9p/client.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Dominique Martinet Sept. 4, 2021, 11:31 p.m. UTC | #1
Hi Christian,

thanks for the follow up, this has been on my todolist forever and
despite how simple it is I never took the time for it...


I've applied both patches to my -next branch... We're a bit tight for
this merge window (v5.15) but I'll send it to linux mid next week anyway.


I've also added this patch (sorry for laziness) for TCP, other transports
are OK iirc:

>From 657e35583c70bed86526cb8eb207abe3d55ea4ea Mon Sep 17 00:00:00 2001
From: Dominique Martinet <asmadeus@codewreck.org>
Date: Sun, 5 Sep 2021 08:29:22 +0900
Subject: [PATCH] net/9p: increase tcp max msize to 1MB

Historically TCP has been limited to 64K buffers, but increasing msize provides
huge performance benefits especially as latency increase so allow for bigger buffers.

Ideally further improvements could change the allocation from the current contiguous chunk
in slab (kmem_cache) to some scatter-gather compatible API...

Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index f4dd0456beaf..007bbcc68010 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -34,7 +34,7 @@
 #include <linux/syscalls.h> /* killme */
 
 #define P9_PORT 564
-#define MAX_SOCK_BUF (64*1024)
+#define MAX_SOCK_BUF (1024*1024)
 #define MAXPOLLWADDR   2
 
 static struct p9_trans_module p9_tcp_trans;
Christian Schoenebeck Sept. 5, 2021, 1:33 p.m. UTC | #2
On Sonntag, 5. September 2021 01:31:50 CEST Dominique Martinet wrote:
> Hi Christian,
> 
> thanks for the follow up, this has been on my todolist forever and
> despite how simple it is I never took the time for it...
> 
> 
> I've applied both patches to my -next branch... We're a bit tight for
> this merge window (v5.15) but I'll send it to linux mid next week anyway.

That would be great!

> I've also added this patch (sorry for laziness) for TCP, other transports
> are OK iirc:
> 
> From 657e35583c70bed86526cb8eb207abe3d55ea4ea Mon Sep 17 00:00:00 2001
> From: Dominique Martinet <asmadeus@codewreck.org>
> Date: Sun, 5 Sep 2021 08:29:22 +0900
> Subject: [PATCH] net/9p: increase tcp max msize to 1MB
> 
> Historically TCP has been limited to 64K buffers, but increasing msize
> provides huge performance benefits especially as latency increase so allow
> for bigger buffers.
> 
> Ideally further improvements could change the allocation from the current
> contiguous chunk in slab (kmem_cache) to some scatter-gather compatible
> API...
> 
> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
> 
> diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
> index f4dd0456beaf..007bbcc68010 100644
> --- a/net/9p/trans_fd.c
> +++ b/net/9p/trans_fd.c
> @@ -34,7 +34,7 @@
>  #include <linux/syscalls.h> /* killme */
> 
>  #define P9_PORT 564
> -#define MAX_SOCK_BUF (64*1024)
> +#define MAX_SOCK_BUF (1024*1024)
>  #define MAXPOLLWADDR   2
> 
>  static struct p9_trans_module p9_tcp_trans;

Yes, makes sense.

Is the TCP transport driver of the Linux 9p client somewhat equally mature to 
the virtio transport driver? Because I still have macOS support for 9p in QEMU 
on my TODO list and accordingly a decision for a specific transport type for 
macOS will be needed.

For the next series mentioned by me (getting rid of the 512k msize capping) I 
first need to readup on the Linux kernel code. According to the discussion we 
already had about that issue, the reason for that capping was that the amount 
of virtio elements is currently hard coded in the Linux client's virtio 
transport:

On Samstag, 27. Februar 2021 01:03:40 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100:
> > Right now the client uses a hard coded amount of 128 elements. So what
> > about replacing VIRTQUEUE_NUM by a variable which is initialized with a
> > value according to the user's requested 'msize' option at init time?
> > 
> > According to the virtio specs the max. amount of elements in a virtqueue
> > is
> > 32768. So 32768 * 4k = 128M as new upper limit would already be a
> > significant improvement and would not require too many changes to the
> > client code, right?
> The current code inits the chan->sg at probe time (when driver is
> loader) and not mount time, and it is currently embedded in the chan
> struct, so that would need allocating at mount time (p9_client_create ;
> either resizing if required or not sharing) but it doesn't sound too
> intrusive yes.
> 
> I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying.

So probably as a first step I would only re-allocate the virtio elements 
according to the user supplied (i.e. large) 'msize' value if the 9p driver is 
not in use at that point, and stick to capping otherwise. That should probably 
be fine for many users and would avoid need for some synchronization measures 
and the traps it might bring. But again, I still need to read more of the 
Linux client code first.

BTW just in case I have not mentioned it before: there are some developer docs 
for the QEMU 9p server implementation now:
https://wiki.qemu.org/Documentation/9p

Best regards,
Christian Schoenebeck
Dominique Martinet Sept. 5, 2021, 9:48 p.m. UTC | #3
Christian Schoenebeck wrote on Sun, Sep 05, 2021 at 03:33:11PM +0200:
> > Subject: [PATCH] net/9p: increase tcp max msize to 1MB
> 
> Yes, makes sense.
> 
> Is the TCP transport driver of the Linux 9p client somewhat equally mature to 
> the virtio transport driver? Because I still have macOS support for 9p in QEMU 
> on my TODO list and accordingly a decision for a specific transport type for 
> macOS will be needed.

I'd say it should be just about as stable as virtio.. It's definitely
getting a lot of tests like syzcaller as it's easy to open an arbitrary
fd and pass that to kernel for fuzzing.

Performance-wise you won't have zero-copy, and the monolitic memory
blocks requirement is the same, so it won't be as good as virtio but it
can probably be used. The problem will more be one of setting -- for
user networking we can always use qemu's networking implementation, but
for passthrough or tap qemu won't easily be discoverable from the VM,
I'm not sure about that.
Does AF_VSOCK work on macos? that could probably easily be added to
trans_fd...

> For the next series mentioned by me (getting rid of the 512k msize capping) I 
> first need to readup on the Linux kernel code. According to the discussion we 
> already had about that issue, the reason for that capping was that the amount 
> of virtio elements is currently hard coded in the Linux client's virtio 
> transport:
> 
> On Samstag, 27. Februar 2021 01:03:40 CEST Dominique Martinet wrote:
> > Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100:
> > > Right now the client uses a hard coded amount of 128 elements. So what
> > > about replacing VIRTQUEUE_NUM by a variable which is initialized with a
> > > value according to the user's requested 'msize' option at init time?
> > > 
> > > According to the virtio specs the max. amount of elements in a virtqueue
> > > is
> > > 32768. So 32768 * 4k = 128M as new upper limit would already be a
> > > significant improvement and would not require too many changes to the
> > > client code, right?
> > The current code inits the chan->sg at probe time (when driver is
> > loader) and not mount time, and it is currently embedded in the chan
> > struct, so that would need allocating at mount time (p9_client_create ;
> > either resizing if required or not sharing) but it doesn't sound too
> > intrusive yes.
> > 
> > I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying.
> 
> So probably as a first step I would only re-allocate the virtio elements 
> according to the user supplied (i.e. large) 'msize' value if the 9p driver is 
> not in use at that point, and stick to capping otherwise. That should probably 
> be fine for many users and would avoid need for some synchronization measures 
> and the traps it might bring. But again, I still need to read more of the 
> Linux client code first.

Right, looking at it again p9_virtio_create would already allow that so
you just need to pick the most appropriate channel or create a new one
if required, synchronization shouldn't be too much of a problem.

The 9p code is sometimes impressive in its foresight ;)

> BTW just in case I have not mentioned it before: there are some developer docs 
> for the QEMU 9p server implementation now:
> https://wiki.qemu.org/Documentation/9p

Wow, that's an impressive wiki page.
I've never been good with documentation (be it writing or using it), but
hopefully it'll help people make the first step -- good work!
Dominique Martinet Sept. 6, 2021, 12:07 a.m. UTC | #4
Eric Van Hensbergen wrote on Sun, Sep 05, 2021 at 06:44:13PM -0500:
> there will likely be a tradeoff with tcp in terms of latency to first
> message so while
> absolute bw may be higher processing time may suffer.  8k was default msize
> to more closely match it to jumbo frames on an ethernet.  of course all
> that intuition is close to 30 years out of date….

It's not because the max size is 128k (or 1MB) that this much is sent
over the wire everytime -- if a message used to fit in 8KB, then its
on-the-wire size won't change and speed/latency won't be affected for
these.

For messages that do require more than 8KB (read/write/readdir) then you
can fit more data per message, so for a given userspace request (feed me
xyz amount of data) you'll have less client-server round-trips, and the
final user-reflected latency will be better as well -- that's why
e.g. NFS has been setting a max size of 1MB by default for a while now,
and they allow even more (32MB iirc? not sure)

I've only had done these tests years ago and no longer have access to
the note that was written back then, but TCP also definitely benefits
from > 64k msize as long as there's enough memory available.


The downside (because it's not free) is there though, you need more
memory for 9p with big buffers even if we didn't need so much in the
first place.
The code using a slab now means that the memory is not locked per mount
as it used to, but that also means allocations can fail if there is a
big pressure after not having been released. OTOH as long as it's
consistently used the buffers will be recycled so it's not necessarily
too bad performance-wise in hot periods.

Ideally we'd need to rework transports to allow scatter-gather (iovec or
similar API), and work with smaller allocations batched together on
send, but I don't have time for something like that... If we do that we
can probably get the best of both worlds -- and could consider >1MB, but
that's unrealistic as of now, regardless of the transport.
Christian Schoenebeck Sept. 10, 2021, 2:04 p.m. UTC | #5
On Sonntag, 5. September 2021 23:48:04 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Sun, Sep 05, 2021 at 03:33:11PM +0200:
> > > Subject: [PATCH] net/9p: increase tcp max msize to 1MB
> > 
> > Yes, makes sense.
> > 
> > Is the TCP transport driver of the Linux 9p client somewhat equally mature
> > to the virtio transport driver? Because I still have macOS support for 9p
> > in QEMU on my TODO list and accordingly a decision for a specific
> > transport type for macOS will be needed.
> 
> I'd say it should be just about as stable as virtio.. It's definitely
> getting a lot of tests like syzcaller as it's easy to open an arbitrary
> fd and pass that to kernel for fuzzing.
> 
> Performance-wise you won't have zero-copy, and the monolitic memory
> blocks requirement is the same, so it won't be as good as virtio but it
> can probably be used. The problem will more be one of setting -- for
> user networking we can always use qemu's networking implementation, but
> for passthrough or tap qemu won't easily be discoverable from the VM,
> I'm not sure about that.
> Does AF_VSOCK work on macos? that could probably easily be added to
> trans_fd...

I haven't looked yet into what the latest options are on macOS. It probably 
takes a while before I'll have time for that, as it is not high priority for 
me.

With macOS 11 Apple added many new restrictions of what can still be done 
beyond user space. For virtualization you would now use the macoOS provided 
hypervisor for instance. Apple sent a bunch of patches last year to support 
their HVF in QEMU.

> > On Samstag, 27. Februar 2021 01:03:40 CEST Dominique Martinet wrote:
> > > Christian Schoenebeck wrote on Fri, Feb 26, 2021 at 02:49:12PM +0100:
> > > > Right now the client uses a hard coded amount of 128 elements. So what
> > > > about replacing VIRTQUEUE_NUM by a variable which is initialized with
> > > > a
> > > > value according to the user's requested 'msize' option at init time?
> > > > 
> > > > According to the virtio specs the max. amount of elements in a
> > > > virtqueue
> > > > is
> > > > 32768. So 32768 * 4k = 128M as new upper limit would already be a
> > > > significant improvement and would not require too many changes to the
> > > > client code, right?
> > > 
> > > The current code inits the chan->sg at probe time (when driver is
> > > loader) and not mount time, and it is currently embedded in the chan
> > > struct, so that would need allocating at mount time (p9_client_create ;
> > > either resizing if required or not sharing) but it doesn't sound too
> > > intrusive yes.
> > > 
> > > I don't see more adherenences to VIRTQUEUE_NUM that would hurt trying.
> > 
> > So probably as a first step I would only re-allocate the virtio elements
> > according to the user supplied (i.e. large) 'msize' value if the 9p driver
> > is not in use at that point, and stick to capping otherwise. That should
> > probably be fine for many users and would avoid need for some
> > synchronization measures and the traps it might bring. But again, I still
> > need to read more of the Linux client code first.
> 
> Right, looking at it again p9_virtio_create would already allow that so
> you just need to pick the most appropriate channel or create a new one
> if required, synchronization shouldn't be too much of a problem.
> 
> The 9p code is sometimes impressive in its foresight ;)

I just realized this is going to be much more work than I expected. Apparently 
struct scatterlist is limited to a value of SG_MAX_SINGLE_ALLOC, which is 
something around 128 or slightly more:

/*
 * Maximum number of entries that will be allocated in one piece, if
 * a list larger than this is required then chaining will be utilized.
 */
#define SG_MAX_SINGLE_ALLOC		(PAGE_SIZE / sizeof(struct scatterlist))

/*
 * The maximum number of SG segments that we will put inside a
 * scatterlist (unless chaining is used). Should ideally fit inside a
 * single page, to avoid a higher order allocation.  We could define this
 * to SG_MAX_SINGLE_ALLOC to pack correctly at the highest order.  The
 * minimum value is 32
 */
#define SG_CHUNK_SIZE	128

Which explains the current hard coded default value of 128 sglist elements in 
the 9p virtio transport:

#define VIRTQUEUE_NUM	128

So struct virtio_chan would need to switch from scatterlist to a different 
type, probably to sg_table. The latter API allows to automatically chain sg 
lists if the required amount exceeds SG_MAX_SINGLE_ALLOC. It misuses the last 
sglist element to build a chained list. However chaining is not supported on 
some architectures:

int __sg_alloc_table(struct sg_table *table, unsigned int nents,
		     unsigned int max_ents, struct scatterlist *first_chunk,
		     unsigned int nents_first_chunk, gfp_t gfp_mask,
		     sg_alloc_fn *alloc_fn)
{
	...
#ifdef CONFIG_ARCH_NO_SG_CHAIN
	if (WARN_ON_ONCE(nents > max_ents))
		return -EINVAL;
#endif
	...
}

So those architectures would still end up with the current 128 sglist elements 
limitation with the 9p virtio transport.

And it is also not clear to me whether the Linux sg_table API allows to grow 
the table if it had been allocated already; say virtio transport retains the 
current pre-allocation of hard-coded 128 elements in early p9_virtio_probe() 
and would then just append more sglists in p9_virtio_create() if needed due to 
a large msize option provided by user. It "looks" like sg_alloc_table() might 
allow to append, but I am not sure.

Another type candidate would be struct sg_append_table, but its API wants a 
complete list of pages to be provided upon table allocation upfront:

/**
 * sg_alloc_append_table_from_pages - Allocate and initialize an append sg
 *                                    table from an array of pages
 * @sgt_append:  The sg append table to use
 * @pages:       Pointer to an array of page pointers
 * @n_pages:     Number of pages in the pages array
 * @offset:      Offset from start of the first page to the start of a buffer
 * @size:        Number of valid bytes in the buffer (after offset)
 * @max_segment: Maximum size of a scatterlist element in bytes
 * @left_pages:  Left pages caller have to set after this call
 * @gfp_mask:	 GFP allocation mask
 *
 * Description:
 *    In the first call it allocate and initialize an sg table from a list of
 *    pages, else reuse the scatterlist from sgt_append. Contiguous ranges of
 *    the pages are squashed into a single scatterlist entry up to the maximum
 *    size specified in @max_segment.  A user may provide an offset at a start
 *    and a size of valid data in a buffer specified by the page array. The
 *    returned sg table is released by sg_free_append_table
 *
 * Returns:
 *   0 on success, negative error on failure
 *
 * Notes:
 *   If this function returns non-0 (eg failure), the caller must call
 *   sg_free_append_table() to cleanup any leftover allocations.
 *
 *   In the fist call, sgt_append must by initialized.
 */
int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
		struct page **pages, unsigned int n_pages, unsigned int offset,
		unsigned long size, unsigned int max_segment,
		unsigned int left_pages, gfp_t gfp_mask)
{
	...
}

Which does not really fit here (ATM), as the 9p virtio transport maps the 
pages on demand apparently, not when the client/transport is created already.

Either way, additionally functions pack_sg_list() and pack_sg_list_p() in the 
9p virtio transport would need to be adjusted as well to run like:

	int pack_sg_list() {
		FOR_EACH(sglist in sgtable) {
			for (i in 0..127) {
				... sglist[i] ...
			}
		}
	}

Which is not really a trivial change set. :/

On Montag, 6. September 2021 02:07:56 CEST Dominique Martinet wrote:
> Eric Van Hensbergen wrote on Sun, Sep 05, 2021 at 06:44:13PM -0500:
> > there will likely be a tradeoff with tcp in terms of latency to first
> > message so while
> > absolute bw may be higher processing time may suffer.  8k was default
> > msize
> > to more closely match it to jumbo frames on an ethernet.  of course all
> > that intuition is close to 30 years out of date….
> 
> It's not because the max size is 128k (or 1MB) that this much is sent
> over the wire everytime -- if a message used to fit in 8KB, then its
> on-the-wire size won't change and speed/latency won't be affected for
> these.
> 
> For messages that do require more than 8KB (read/write/readdir) then you
> can fit more data per message, so for a given userspace request (feed me
> xyz amount of data) you'll have less client-server round-trips, and the
> final user-reflected latency will be better as well -- that's why
> e.g. NFS has been setting a max size of 1MB by default for a while now,
> and they allow even more (32MB iirc? not sure)

I can just speak for the setup case QEMU server <-> virtio <-> Linux client 
now:

In this setup there is definitely no disadvantage regarding latency by raising 
msize. It is actually the exact opposite: low msize values significantly 
increase overall latency, because for each 9p message there is not only the 
latency between each server and client 9p message transfer, but there is also 
latency added on server side when handling each request, as server has to 
dispatch between fs driver worker thread(s) and server main thread at least 
twice per 9p request, in some cases even multiple times per request.

That's actually something I'm working on to reduce this to its theoretical 
limit of 2 thread hops for every 9p request type on QEMU server side (WIP).

The only bottleneck situation I can think of when raising msize to a giant 
value is when you have a guest app reading/writing constantly with maximum 
chunk size according to msize (which 'cat' for instance is always doing by 
reading stat's st_blksize). In this case other 9p requests of other threads/
processes would need to wait as that single 9p consumer would occupy all 
available pages with a single, huge 9p request.

That could be addressed maybe by multiplying msize with the amount of 
available CPU cores or something like that when allocating the sg table on 
client transport side.

> I've only had done these tests years ago and no longer have access to
> the note that was written back then, but TCP also definitely benefits
> from > 64k msize as long as there's enough memory available.
> 
> The downside (because it's not free) is there though, you need more
> memory for 9p with big buffers even if we didn't need so much in the
> first place.
> The code using a slab now means that the memory is not locked per mount
> as it used to, but that also means allocations can fail if there is a
> big pressure after not having been released. OTOH as long as it's
> consistently used the buffers will be recycled so it's not necessarily
> too bad performance-wise in hot periods.
> 
> Ideally we'd need to rework transports to allow scatter-gather (iovec or
> similar API), and work with smaller allocations batched together on
> send, but I don't have time for something like that... If we do that we
> can probably get the best of both worlds -- and could consider >1MB, but
> that's unrealistic as of now, regardless of the transport.

Ok, but that's not an issue for the virtio transport, is it? As virtio 
transport already uses scatter-gather. AFAICS in case of virtio the issue 
might be that it's claiming pages on demand instead of pinning them when 
client/virtio is already created. So it could run out of free pages during 
life time I guess.

Best regards,
Christian Schoenebeck
diff mbox series

Patch

diff --git a/net/9p/client.c b/net/9p/client.c
index 1cb255587fff..213f12ed76cd 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -30,7 +30,7 @@ 
 #define CREATE_TRACE_POINTS
 #include <trace/events/9p.h>
 
-#define DEFAULT_MSIZE 8192
+#define DEFAULT_MSIZE (128 * 1024)
 
 /*
   * Client Option Parsing (code inspired by NFS code)