mbox series

[RFC,RESEND,00/11] Zero copy network RX using io_uring

Message ID 20230826011954.1801099-1-dw@davidwei.uk (mailing list archive)
Headers show
Series Zero copy network RX using io_uring | expand

Message

David Wei Aug. 26, 2023, 1:19 a.m. UTC
From: David Wei <davidhwei@meta.com>

This patchset is a proposal that adds zero copy network RX to io_uring.
With it, userspace can register a region of host memory for receiving
data directly from a NIC using DMA, without needing a kernel to user
copy.

Software support is added to the Broadcom BNXT driver. Hardware support
for receive flow steering and header splitting is required.

On the userspace side, a sample server is added in this branch of
liburing:
https://github.com/spikeh/liburing/tree/zcrx2

Build liburing as normal, and run examples/zcrx. Then, set flow steering
rules using ethtool. A sample shell script is included in
examples/zcrx_flow.sh, but you need to change the source IP. Finally,
connect a client using e.g. netcat and send data.

This patchset + userspace code was tested on an Intel Xeon Platinum
8321HC CPU and Broadcom BCM57504 NIC.

Early benchmarks using this prototype, with iperf3 as a load generator,
showed a ~50% reduction in overall system memory bandwidth as measured
using perf counters. Note that DDIO must be disabled on Intel systems.

Mina et al. from Google and Kuba are collaborating on a similar proposal
to ZC from NIC to devmem. There are many shared functionality in netdev
that we can collaborate on e.g.:
* Page pool memory provider backend and resource registration
* Page pool refcounted iov/buf representation and lifecycle
* Setting receive flow steering

As mentioned earlier, this is an early prototype. It is brittle, some
functionality is missing and there's little optimisation. We're looking
for feedback on the overall approach and points of collaboration in
netdev.
* No copy fallback, if payload ends up in linear part of skb then the
  code will not work
* No way to pin an RX queue to a specific CPU
* Only one ifq, one pool region, on RX queue...

This patchset is based on the work by Jonathan Lemon
<jonathan.lemon@gmail.com>:
https://lore.kernel.org/io-uring/20221108050521.3198458-1-jonathan.lemon@gmail.com/

David Wei (11):
  io_uring: add interface queue
  io_uring: add mmap support for shared ifq ringbuffers
  netdev: add XDP_SETUP_ZC_RX command
  io_uring: setup ZC for an RX queue when registering an ifq
  io_uring: add ZC buf and pool
  io_uring: add ZC pool API
  skbuff: add SKBFL_FIXED_FRAG and skb_fixed()
  io_uring: allocate a uarg for freeing zero copy skbs
  io_uring: delay ZC pool destruction
  netdev/bnxt: add data pool and use it in BNXT driver
  io_uring: add io_recvzc request

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  59 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   4 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   3 +
 include/linux/io_uring.h                      |  31 +
 include/linux/io_uring_types.h                |   6 +
 include/linux/netdevice.h                     |  11 +
 include/linux/skbuff.h                        |  10 +-
 include/net/data_pool.h                       |  96 +++
 include/uapi/linux/io_uring.h                 |  53 ++
 io_uring/Makefile                             |   3 +-
 io_uring/io_uring.c                           |  13 +
 io_uring/kbuf.c                               |  30 +
 io_uring/kbuf.h                               |   5 +
 io_uring/net.c                                |  83 +-
 io_uring/opdef.c                              |  16 +
 io_uring/zc_rx.c                              | 723 ++++++++++++++++++
 io_uring/zc_rx.h                              |  42 +
 17 files changed, 1168 insertions(+), 20 deletions(-)
 create mode 100644 include/net/data_pool.h
 create mode 100644 io_uring/zc_rx.c
 create mode 100644 io_uring/zc_rx.h

Comments

Gal Pressman Oct. 22, 2023, 7:06 p.m. UTC | #1
On 26/08/2023 4:19, David Wei wrote:
> From: David Wei <davidhwei@meta.com>
> 
> This patchset is a proposal that adds zero copy network RX to io_uring.
> With it, userspace can register a region of host memory for receiving
> data directly from a NIC using DMA, without needing a kernel to user
> copy.
> 
> Software support is added to the Broadcom BNXT driver. Hardware support
> for receive flow steering and header splitting is required.
> 
> On the userspace side, a sample server is added in this branch of
> liburing:
> https://github.com/spikeh/liburing/tree/zcrx2
> 
> Build liburing as normal, and run examples/zcrx. Then, set flow steering
> rules using ethtool. A sample shell script is included in
> examples/zcrx_flow.sh, but you need to change the source IP. Finally,
> connect a client using e.g. netcat and send data.
> 
> This patchset + userspace code was tested on an Intel Xeon Platinum
> 8321HC CPU and Broadcom BCM57504 NIC.
> 
> Early benchmarks using this prototype, with iperf3 as a load generator,
> showed a ~50% reduction in overall system memory bandwidth as measured
> using perf counters. Note that DDIO must be disabled on Intel systems.
> 
> Mina et al. from Google and Kuba are collaborating on a similar proposal
> to ZC from NIC to devmem. There are many shared functionality in netdev
> that we can collaborate on e.g.:
> * Page pool memory provider backend and resource registration
> * Page pool refcounted iov/buf representation and lifecycle
> * Setting receive flow steering
> 
> As mentioned earlier, this is an early prototype. It is brittle, some
> functionality is missing and there's little optimisation. We're looking
> for feedback on the overall approach and points of collaboration in
> netdev.
> * No copy fallback, if payload ends up in linear part of skb then the
>   code will not work
> * No way to pin an RX queue to a specific CPU
> * Only one ifq, one pool region, on RX queue...
> 
> This patchset is based on the work by Jonathan Lemon
> <jonathan.lemon@gmail.com>:
> https://lore.kernel.org/io-uring/20221108050521.3198458-1-jonathan.lemon@gmail.com/

Hello David,

This work looks interesting, is there anywhere I can read about it some
more? Maybe it was presented (and hopefully recorded) in a recent
conference?
Maybe something geared towards adding more drivers support?

I took a brief look at the bnxt patch and saw you converted the page
pool allocation to data pool allocation, I assume this is done for data
pages only, right? Headers are still allocated on page pool pages?

Thanks
David Wei Oct. 23, 2023, 3:35 a.m. UTC | #2
On 2023-10-22 12:06, Gal Pressman wrote:
> On 26/08/2023 4:19, David Wei wrote:
>> From: David Wei <davidhwei@meta.com>
>>
>> This patchset is a proposal that adds zero copy network RX to io_uring.
>> With it, userspace can register a region of host memory for receiving
>> data directly from a NIC using DMA, without needing a kernel to user
>> copy.
>>
>> Software support is added to the Broadcom BNXT driver. Hardware support
>> for receive flow steering and header splitting is required.
>>
>> On the userspace side, a sample server is added in this branch of
>> liburing:
>> https://github.com/spikeh/liburing/tree/zcrx2
>>
>> Build liburing as normal, and run examples/zcrx. Then, set flow steering
>> rules using ethtool. A sample shell script is included in
>> examples/zcrx_flow.sh, but you need to change the source IP. Finally,
>> connect a client using e.g. netcat and send data.
>>
>> This patchset + userspace code was tested on an Intel Xeon Platinum
>> 8321HC CPU and Broadcom BCM57504 NIC.
>>
>> Early benchmarks using this prototype, with iperf3 as a load generator,
>> showed a ~50% reduction in overall system memory bandwidth as measured
>> using perf counters. Note that DDIO must be disabled on Intel systems.
>>
>> Mina et al. from Google and Kuba are collaborating on a similar proposal
>> to ZC from NIC to devmem. There are many shared functionality in netdev
>> that we can collaborate on e.g.:
>> * Page pool memory provider backend and resource registration
>> * Page pool refcounted iov/buf representation and lifecycle
>> * Setting receive flow steering
>>
>> As mentioned earlier, this is an early prototype. It is brittle, some
>> functionality is missing and there's little optimisation. We're looking
>> for feedback on the overall approach and points of collaboration in
>> netdev.
>> * No copy fallback, if payload ends up in linear part of skb then the
>>   code will not work
>> * No way to pin an RX queue to a specific CPU
>> * Only one ifq, one pool region, on RX queue...
>>
>> This patchset is based on the work by Jonathan Lemon
>> <jonathan.lemon@gmail.com>:
>> https://lore.kernel.org/io-uring/20221108050521.3198458-1-jonathan.lemon@gmail.com/
> 
> Hello David,
> 
> This work looks interesting, is there anywhere I can read about it some
> more? Maybe it was presented (and hopefully recorded) in a recent
> conference?
> Maybe something geared towards adding more drivers support?
> 

Hi Gal,

Thank you for your interest in our work! We will be publishing a paper
and presenting this work at NetDev conference on 1 Nov.

Support for more drivers (e.g. mlx5) is definitely on our radar. We are
collaborating with Mina and others from Google who are working on a
similar proposal but targetting NIC -> ZC RX into GPU memory. We both
require shared bits of infra e.g. page pool memory providers that will
replace the use of a one-off data_pool in this patchset. This would
minimise driver changes needed to support this feature.

> I took a brief look at the bnxt patch and saw you converted the page
> pool allocation to data pool allocation, I assume this is done for data
> pages only, right? Headers are still allocated on page pool pages?
> 
> Thanks

Yes, that's right.

David