Message ID | 20221108050521.3198458-1-jonathan.lemon@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | zero-copy RX for io_uring | expand |
On Mon, Nov 07, 2022 at 09:05:06PM -0800, Jonathan Lemon wrote: >This series introduces network RX zerocopy for io_uring. > >This is an evolution of the earlier zctap work, re-targeted to use >io_uring as the userspace API. The code is intends to provide a >ZC RX path for upper-level networking protocols (aka TCP and UDP), >with a focus on focuses on host-provided memory (not GPU memory). > >This patch contains the upper-level core code required for operation, >but does not not contain the network driver side changes required for >true zero-copy operation. The io_uring RECV_ZC opcode will work >without hardware support, albeit in copy mode. > >The intent is to use a network driver which provides header/data >splitting, so the frame header which is processed by the networking >stack is not placed in user memory. > >The code is successfully receiving a zero-copy TCP stream from a >remote sender. > >There is a liburing fork providing the needed wrappers: > > https://github.com/jlemon/liburing/tree/zctap > >Which contains an examples/io_uring-net test application exercising >these features. A sample run: > > # ./io_uring-net -i eth1 -q 20 -p 9999 -r 3000 > copy bytes: 1938872 > ZC bytes: 996683008 > Total bytes: 998621880, nsec:1025219375 > Rate: 7.79 Gb/s > >If no queue is specified, then non-zc mode is used: > > # ./io_uring-net -p 9999 > copy bytes: 998621880 > ZC bytes: 0 > Total bytes: 998621880, nsec:1051515726 > Rate: 7.60 Gb/s Haven't dive into your test case yet, but the performance data looks disappointing I don't know why we need zerocopy if we can't get a big performance gain. Have you tested large messages with jumbo or LRO enabled ? Thanks > >There is also an iperf3 fork as well: > > https://github.com/jlemon/iperf/tree/io_uring > >This allows running single tests with either: > * select (normal iperf3) > * io_uring READ > * io_uring RECV_ZC copy mode > * io_uring RECV_ZC hardware mode > >Current testing shows similar BW between RECV_ZC and READ modes >(running at 22Gbit/sec), but a reduction of ~50% of MemBW. > >High level description: > >The application allocates a frame backing store, and provides this >to the kernel for use. An interface queue is requested from the >networking device, and incoming frames are deposited into the provided >memory region. The NIC should provide a header splitting feature, so >only the frame payload is placed in the user space area. > >Responsibility for correctly steering incoming frames to the queue >is outside the scope of this work - it is assumed that the user >has set steering rules up separately. > >Incoming frames are sent up the stack as skb's and eventually >land in the application's socket receive queue. This differs >from AF_XDP, which receives raw frames directly to userspace, >without protocol processing. > >The RECV_ZC opcode then returns an iov[] style vector which points >to the data in userspace memory. When the application has completed >processing of the data, the buffers are returned back to the kernel >through a fill ring for reuse. > >Jonathan Lemon (15): > io_uring: add zctap ifq definition > netdevice: add SETUP_ZCTAP to the netdev_bpf structure > io_uring: add register ifq opcode > io_uring: create a zctap region for a mapped buffer > io_uring: mark pages in ifq region with zctap information. > io_uring: Provide driver API for zctap packet buffers. > io_uring: Allocate zctap device buffers and dma map them. > io_uring: Add zctap buffer get/put functions and refcounting. > skbuff: Introduce SKBFL_FIXED_FRAG and skb_fixed() > io_uring: Allocate a uarg for use by the ifq RX > io_uring: Define the zctap iov[] returned to the user. > io_uring: add OP_RECV_ZC command. > io_uring: Make remove_ifq_region a delayed work call > io_uring: Add a buffer caching mechanism for zctap. > io_uring: Notify the application as the fillq is drained. > > include/linux/io_uring.h | 47 ++ > include/linux/io_uring_types.h | 12 + > include/linux/netdevice.h | 6 + > include/linux/skbuff.h | 10 +- > include/uapi/linux/io_uring.h | 24 + > io_uring/Makefile | 3 +- > io_uring/io_uring.c | 8 + > io_uring/kbuf.c | 13 + > io_uring/kbuf.h | 2 + > io_uring/net.c | 123 ++++ > io_uring/opdef.c | 15 + > io_uring/zctap.c | 1001 ++++++++++++++++++++++++++++++++ > io_uring/zctap.h | 31 + > 13 files changed, 1293 insertions(+), 2 deletions(-) > create mode 100644 io_uring/zctap.c > create mode 100644 io_uring/zctap.h > >-- >2.30.2
On Wed, Nov 09, 2022 at 02:37:42PM +0800, Dust Li wrote: > > Haven't dive into your test case yet, but the performance data > looks disappointing > > I don't know why we need zerocopy if we can't get a big performance > gain. The cited numbers are intended to show that there is no network bandwidth performance drop from using RECV_ZC. For the systems under test (25Gb, 50Gb), we are getting linerate already, so there wouldn't be any BW improvement. The zerocopy gains come from a reduction in memory bandwith, removing the user memcpy overhead, reducing the CPU usage. This is why I put up the changes for iperf3, so cpu and membw usage metrics can be collected while under network load. > Have you tested large messages with jumbo or LRO enabled ? Haven't tested jumbo frames, but these are with GRO support.