Message ID | 20241204172204.4180482-17-dw@davidwei.uk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | io_uring zero copy rx | expand |
On Wed, Dec 4, 2024 at 9:23 AM David Wei <dw@davidwei.uk> wrote: > > Add documentation for io_uring zero copy Rx that explains requirements > and the user API. > > Signed-off-by: David Wei <dw@davidwei.uk> > --- > Documentation/networking/iou-zcrx.rst | 201 ++++++++++++++++++++++++++ > 1 file changed, 201 insertions(+) > create mode 100644 Documentation/networking/iou-zcrx.rst > > diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst > new file mode 100644 > index 000000000000..0a3af8c08c7e > --- /dev/null > +++ b/Documentation/networking/iou-zcrx.rst > @@ -0,0 +1,201 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +io_uring zero copy Rx > +===================== > + > +Introduction > +============ > + > +io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on > +the network receive path, allowing packet data to be received directly into > +userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that > +there are no strict alignment requirements and no need to mmap()/munmap(). > +Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are > +processed by the kernel TCP stack as normal. > + > +NIC HW Requirements > +=================== > + > +Several NIC HW features are required for io_uring ZC Rx to work. For now the > +kernel API does not configure the NIC and it must be done by the user. > + > +Header/data split > +----------------- > + > +Required to split packets at the L4 boundary into a header and a payload. > +Headers are received into kernel memory as normal and processed by the TCP > +stack as normal. Payloads are received into userspace memory directly. > + > +Flow steering > +------------- > + > +Specific HW Rx queues are configured for this feature, but modern NICs > +typically distribute flows across all HW Rx queues. Flow steering is required > +to ensure that only desired flows are directed towards HW queues that are > +configured for io_uring ZC Rx. > + > +RSS > +--- > + > +In addition to flow steering above, RSS is required to steer all other non-zero > +copy flows away from queues that are configured for io_uring ZC Rx. > + > +Usage > +===== > + > +Setup NIC > +--------- > + > +Must be done out of band for now. I would remove any 'for now' instances in the docs. Uapis are going to be maintained as-is for posterity. Even if you in the future add new APIs which auto-configure headersplit/flow steering/rss, I'm guessing the current API would live on for backward compatibility reasons. > + > +Ensure there are enough queues:: Was not clear to me what are enough queues. Technically you only need 2 queues, right? (one for iozcrx and one for normal traffic). > + > + ethtool -L eth0 combined 32 > + > +Enable header/data split:: > + > + ethtool -G eth0 tcp-data-split on > + > +Carve out half of the HW Rx queues for zero copy using RSS:: > + > + ethtool -X eth0 equal 16 > + > +Set up flow steering:: > + > + ethtool -N eth0 flow-type tcp6 ... action 16 > + > +Setup io_uring > +-------------- > + > +This section describes the low level io_uring kernel API. Please refer to > +liburing documentation for how to use the higher level API. > + > +Create an io_uring instance with the following required setup flags:: > + > + IORING_SETUP_SINGLE_ISSUER > + IORING_SETUP_DEFER_TASKRUN > + IORING_SETUP_CQE32 > + > +Create memory area > +------------------ > + > +Allocate userspace memory area for receiving zero copy data:: > + > + void *area_ptr = mmap(NULL, area_size, > + PROT_READ | PROT_WRITE, > + MAP_ANONYMOUS | MAP_PRIVATE, > + 0, 0); > + > +Create refill ring > +------------------ > + > +Allocate memory for a shared ringbuf used for returning consumed buffers:: > + > + void *ring_ptr = mmap(NULL, ring_size, > + PROT_READ | PROT_WRITE, > + MAP_ANONYMOUS | MAP_PRIVATE, > + 0, 0); > + > +This refill ring consists of some space for the header, followed by an array of > +``struct io_uring_zcrx_rqe``:: > + > + size_t rq_entries = 4096; > + size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE; > + /* align to page size */ > + ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1); > + > +Register ZC Rx > +-------------- > + > +Fill in registration structs:: > + > + struct io_uring_zcrx_area_reg area_reg = { > + .addr = (__u64)(unsigned long)area_ptr, > + .len = area_size, > + .flags = 0, > + }; > + > + struct io_uring_region_desc region_reg = { > + .user_addr = (__u64)(unsigned long)ring_ptr, > + .size = ring_size, > + .flags = IORING_MEM_REGION_TYPE_USER, > + }; > + > + struct io_uring_zcrx_ifq_reg reg = { > + .if_idx = if_nametoindex("eth0"), > + /* this is the HW queue with desired flow steered into it */ > + .if_rxq = 16, > + .rq_entries = rq_entries, > + .area_ptr = (__u64)(unsigned long)&area_reg, > + .region_ptr = (__u64)(unsigned long)®ion_reg, > + }; > + > +Register with kernel:: > + > + io_uring_register_ifq(ring, ®); > + > +Map refill ring > +--------------- > + > +The kernel fills in fields for the refill ring in the registration ``struct > +io_uring_zcrx_ifq_reg``. Map it into userspace:: > + > + struct io_uring_zcrx_rq refill_ring; > + > + refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head); > + refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail); > + refill_ring.rqes = > + (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes); > + refill_ring.rq_tail = 0; > + refill_ring.ring_ptr = ring_ptr; > + > +Receiving data > +-------------- > + > +Prepare a zero copy recv request:: > + > + struct io_uring_sqe *sqe; > + > + sqe = io_uring_get_sqe(ring); > + io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0); > + sqe->ioprio |= IORING_RECV_MULTISHOT; > + > +Now, submit and wait:: > + > + io_uring_submit_and_wait(ring, 1); > + > +Finally, process completions:: > + > + struct io_uring_cqe *cqe; > + unsigned int count = 0; > + unsigned int head; > + > + io_uring_for_each_cqe(ring, head, cqe) { > + struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1); > + > + unsigned char *data = area_ptr + (rcqe->off & IORING_ZCRX_AREA_MASK); > + /* do something with the data */ > + > + count++; > + } > + io_uring_cq_advance(ring, count); > + > +Recycling buffers > +----------------- > + > +Return buffers back to the kernel to be used again:: > + > + struct io_uring_zcrx_rqe *rqe; > + unsigned mask = refill_ring.ring_entries - 1; > + rqe = &refill_ring.rqes[refill_ring.rq_tail & mask]; > + > + area_offset = rcqe->off & IORING_ZCRX_AREA_MASK; > + rqe->off = area_offset | area_reg.rq_area_token; > + rqe->len = cqe->res; > + IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail); > + Ah, I see why it's difficult for you to napi_pp_put_page() refilled pages. In part it's because the refill code in the userspace never traps into the kernel. Got it. I guess I read from the other thread that there are some (significant?) changes in the next version to incorporate feedback from Jakub, so I guess I'll wait for that before commenting further.
On Wed, Dec 4, 2024 at 9:23 AM David Wei <dw@davidwei.uk> wrote: > > Add documentation for io_uring zero copy Rx that explains requirements > and the user API. > > Signed-off-by: David Wei <dw@davidwei.uk> > --- > Documentation/networking/iou-zcrx.rst | 201 ++++++++++++++++++++++++++ > 1 file changed, 201 insertions(+) > create mode 100644 Documentation/networking/iou-zcrx.rst > > diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst I think you need a link to Documentation/networking/index.rst to point to your new docs. .... > +Testing > +======= > + > +See ``tools/testing/selftests/net/iou-zcrx.c`` Link is wrong I think. The path in this series is tools/testing/selftests/drivers/net/hw/iou-zcrx.c.
On 2024-12-09 09:51, Mina Almasry wrote: > On Wed, Dec 4, 2024 at 9:23 AM David Wei <dw@davidwei.uk> wrote: >> >> Add documentation for io_uring zero copy Rx that explains requirements >> and the user API. >> >> Signed-off-by: David Wei <dw@davidwei.uk> >> --- >> Documentation/networking/iou-zcrx.rst | 201 ++++++++++++++++++++++++++ >> 1 file changed, 201 insertions(+) >> create mode 100644 Documentation/networking/iou-zcrx.rst >> >> diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst >> new file mode 100644 >> index 000000000000..0a3af8c08c7e >> --- /dev/null ... >> +Usage >> +===== >> + >> +Setup NIC >> +--------- >> + >> +Must be done out of band for now. > > I would remove any 'for now' instances in the docs. Uapis are going to > be maintained as-is for posterity. Even if you in the future add new > APIs which auto-configure headersplit/flow steering/rss, I'm guessing > the current API would live on for backward compatibility reasons. The UAPI will be extended in a way that does not affect backwards compatibility e.g. extending io_uring_zcrx_ifq_reg. Such that any of the following will work: 1. Configure NIC using ethtool 2. Call io_uring_register_ifq() 1. Configure NIC using new io_uring UAPI 2. Call io_uring_register_ifq() 1. Configure NIC by setting new fields in io_uring_zcrx_ifq_reg 2. Call io_uring_register_ifq() Therefore the "for now" is intended. > >> + >> +Ensure there are enough queues:: > > Was not clear to me what are enough queues. Technically you only need > 2 queues, right? (one for iozcrx and one for normal traffic). I'll change it to "ensure there are at least two queues".
On 2024-12-09 09:52, Mina Almasry wrote: > On Wed, Dec 4, 2024 at 9:23 AM David Wei <dw@davidwei.uk> wrote: >> >> Add documentation for io_uring zero copy Rx that explains requirements >> and the user API. >> >> Signed-off-by: David Wei <dw@davidwei.uk> >> --- >> Documentation/networking/iou-zcrx.rst | 201 ++++++++++++++++++++++++++ >> 1 file changed, 201 insertions(+) >> create mode 100644 Documentation/networking/iou-zcrx.rst >> >> diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst > > I think you need a link to Documentation/networking/index.rst to point > to your new docs. > > .... > >> +Testing >> +======= >> + >> +See ``tools/testing/selftests/net/iou-zcrx.c`` > > Link is wrong I think. The path in this series is > tools/testing/selftests/drivers/net/hw/iou-zcrx.c. > Thanks, will fix both.
diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst new file mode 100644 index 000000000000..0a3af8c08c7e --- /dev/null +++ b/Documentation/networking/iou-zcrx.rst @@ -0,0 +1,201 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +io_uring zero copy Rx +===================== + +Introduction +============ + +io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on +the network receive path, allowing packet data to be received directly into +userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that +there are no strict alignment requirements and no need to mmap()/munmap(). +Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are +processed by the kernel TCP stack as normal. + +NIC HW Requirements +=================== + +Several NIC HW features are required for io_uring ZC Rx to work. For now the +kernel API does not configure the NIC and it must be done by the user. + +Header/data split +----------------- + +Required to split packets at the L4 boundary into a header and a payload. +Headers are received into kernel memory as normal and processed by the TCP +stack as normal. Payloads are received into userspace memory directly. + +Flow steering +------------- + +Specific HW Rx queues are configured for this feature, but modern NICs +typically distribute flows across all HW Rx queues. Flow steering is required +to ensure that only desired flows are directed towards HW queues that are +configured for io_uring ZC Rx. + +RSS +--- + +In addition to flow steering above, RSS is required to steer all other non-zero +copy flows away from queues that are configured for io_uring ZC Rx. + +Usage +===== + +Setup NIC +--------- + +Must be done out of band for now. + +Ensure there are enough queues:: + + ethtool -L eth0 combined 32 + +Enable header/data split:: + + ethtool -G eth0 tcp-data-split on + +Carve out half of the HW Rx queues for zero copy using RSS:: + + ethtool -X eth0 equal 16 + +Set up flow steering:: + + ethtool -N eth0 flow-type tcp6 ... action 16 + +Setup io_uring +-------------- + +This section describes the low level io_uring kernel API. Please refer to +liburing documentation for how to use the higher level API. + +Create an io_uring instance with the following required setup flags:: + + IORING_SETUP_SINGLE_ISSUER + IORING_SETUP_DEFER_TASKRUN + IORING_SETUP_CQE32 + +Create memory area +------------------ + +Allocate userspace memory area for receiving zero copy data:: + + void *area_ptr = mmap(NULL, area_size, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, + 0, 0); + +Create refill ring +------------------ + +Allocate memory for a shared ringbuf used for returning consumed buffers:: + + void *ring_ptr = mmap(NULL, ring_size, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, + 0, 0); + +This refill ring consists of some space for the header, followed by an array of +``struct io_uring_zcrx_rqe``:: + + size_t rq_entries = 4096; + size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE; + /* align to page size */ + ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1); + +Register ZC Rx +-------------- + +Fill in registration structs:: + + struct io_uring_zcrx_area_reg area_reg = { + .addr = (__u64)(unsigned long)area_ptr, + .len = area_size, + .flags = 0, + }; + + struct io_uring_region_desc region_reg = { + .user_addr = (__u64)(unsigned long)ring_ptr, + .size = ring_size, + .flags = IORING_MEM_REGION_TYPE_USER, + }; + + struct io_uring_zcrx_ifq_reg reg = { + .if_idx = if_nametoindex("eth0"), + /* this is the HW queue with desired flow steered into it */ + .if_rxq = 16, + .rq_entries = rq_entries, + .area_ptr = (__u64)(unsigned long)&area_reg, + .region_ptr = (__u64)(unsigned long)®ion_reg, + }; + +Register with kernel:: + + io_uring_register_ifq(ring, ®); + +Map refill ring +--------------- + +The kernel fills in fields for the refill ring in the registration ``struct +io_uring_zcrx_ifq_reg``. Map it into userspace:: + + struct io_uring_zcrx_rq refill_ring; + + refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head); + refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail); + refill_ring.rqes = + (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes); + refill_ring.rq_tail = 0; + refill_ring.ring_ptr = ring_ptr; + +Receiving data +-------------- + +Prepare a zero copy recv request:: + + struct io_uring_sqe *sqe; + + sqe = io_uring_get_sqe(ring); + io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0); + sqe->ioprio |= IORING_RECV_MULTISHOT; + +Now, submit and wait:: + + io_uring_submit_and_wait(ring, 1); + +Finally, process completions:: + + struct io_uring_cqe *cqe; + unsigned int count = 0; + unsigned int head; + + io_uring_for_each_cqe(ring, head, cqe) { + struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1); + + unsigned char *data = area_ptr + (rcqe->off & IORING_ZCRX_AREA_MASK); + /* do something with the data */ + + count++; + } + io_uring_cq_advance(ring, count); + +Recycling buffers +----------------- + +Return buffers back to the kernel to be used again:: + + struct io_uring_zcrx_rqe *rqe; + unsigned mask = refill_ring.ring_entries - 1; + rqe = &refill_ring.rqes[refill_ring.rq_tail & mask]; + + area_offset = rcqe->off & IORING_ZCRX_AREA_MASK; + rqe->off = area_offset | area_reg.rq_area_token; + rqe->len = cqe->res; + IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail); + +Testing +======= + +See ``tools/testing/selftests/net/iou-zcrx.c``
Add documentation for io_uring zero copy Rx that explains requirements and the user API. Signed-off-by: David Wei <dw@davidwei.uk> --- Documentation/networking/iou-zcrx.rst | 201 ++++++++++++++++++++++++++ 1 file changed, 201 insertions(+) create mode 100644 Documentation/networking/iou-zcrx.rst