Message ID | 20241007221603.1703699-14-dw@davidwei.uk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | io_uring zero copy rx | expand |
On 10/07, David Wei wrote: > From: Pavel Begunkov <asml.silence@gmail.com> > > There are scenarios in which the zerocopy path might get a normal > in-kernel buffer, it could be a mis-steered packet or simply the linear > part of an skb. Another use case is to allow the driver to allocate > kernel pages when it's out of zc buffers, which makes it more resilient > to spikes in load and allow the user to choose the balance between the > amount of memory provided and performance. Tangential: should there be some clear way for the users to discover that (some counter of some entry on cq about copy fallback)? Or the expectation is that somebody will run bpftrace to diagnose (supposedly) poor ZC performance when it falls back to copy?
On 10/8/24 16:58, Stanislav Fomichev wrote: > On 10/07, David Wei wrote: >> From: Pavel Begunkov <asml.silence@gmail.com> >> >> There are scenarios in which the zerocopy path might get a normal >> in-kernel buffer, it could be a mis-steered packet or simply the linear >> part of an skb. Another use case is to allow the driver to allocate >> kernel pages when it's out of zc buffers, which makes it more resilient >> to spikes in load and allow the user to choose the balance between the >> amount of memory provided and performance. > > Tangential: should there be some clear way for the users to discover that > (some counter of some entry on cq about copy fallback)? > > Or the expectation is that somebody will run bpftrace to diagnose > (supposedly) poor ZC performance when it falls back to copy? We had some notification for testing before, but that's left out of the series to follow up patches to keep it simple. We can post a special CQE to notify the user about that from time to time, which should be just fine as it's a slow path.
On 2024-10-08 08:58, Stanislav Fomichev wrote: > On 10/07, David Wei wrote: >> From: Pavel Begunkov <asml.silence@gmail.com> >> >> There are scenarios in which the zerocopy path might get a normal >> in-kernel buffer, it could be a mis-steered packet or simply the linear >> part of an skb. Another use case is to allow the driver to allocate >> kernel pages when it's out of zc buffers, which makes it more resilient >> to spikes in load and allow the user to choose the balance between the >> amount of memory provided and performance. > > Tangential: should there be some clear way for the users to discover that > (some counter of some entry on cq about copy fallback)? > > Or the expectation is that somebody will run bpftrace to diagnose > (supposedly) poor ZC performance when it falls back to copy? Yeah there definitely needs to be a way to notify the user that copy fallback happened. Right now I'm relying on bpftrace hooking into io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is too much. I can think of two other options: 1. Send a final cqe at the end of a number of frag cqes with a count of the number of copies. 2. Register a secondary area just for handling copies. Other suggestions are also very welcome.
On 10/08, David Wei wrote: > On 2024-10-08 08:58, Stanislav Fomichev wrote: > > On 10/07, David Wei wrote: > >> From: Pavel Begunkov <asml.silence@gmail.com> > >> > >> There are scenarios in which the zerocopy path might get a normal > >> in-kernel buffer, it could be a mis-steered packet or simply the linear > >> part of an skb. Another use case is to allow the driver to allocate > >> kernel pages when it's out of zc buffers, which makes it more resilient > >> to spikes in load and allow the user to choose the balance between the > >> amount of memory provided and performance. > > > > Tangential: should there be some clear way for the users to discover that > > (some counter of some entry on cq about copy fallback)? > > > > Or the expectation is that somebody will run bpftrace to diagnose > > (supposedly) poor ZC performance when it falls back to copy? > > Yeah there definitely needs to be a way to notify the user that copy > fallback happened. Right now I'm relying on bpftrace hooking into > io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is > too much. I can think of two other options: > > 1. Send a final cqe at the end of a number of frag cqes with a count of > the number of copies. > 2. Register a secondary area just for handling copies. > > Other suggestions are also very welcome. SG, thanks. Up to you and Pavel on the mechanism and whether to follow up separately. Maybe even move this fallback (this patch) into that separate series as well? Will be easier to review/accept the rest.
On 10/7/24 4:16 PM, David Wei wrote: > From: Pavel Begunkov <asml.silence@gmail.com> > > There are scenarios in which the zerocopy path might get a normal > in-kernel buffer, it could be a mis-steered packet or simply the linear > part of an skb. Another use case is to allow the driver to allocate > kernel pages when it's out of zc buffers, which makes it more resilient > to spikes in load and allow the user to choose the balance between the > amount of memory provided and performance. > > At the moment we fail such requests. Instead, grab a buffer from the > page pool, copy data there, and return back to user in the usual way. > Because the refill ring is private to the napi our page pool is running > from, it's done by stopping the napi via napi_execute() helper. It grabs > only one buffer, which is inefficient, and improving it is left for > follow up patches. This also looks fine to me. Agree with the sentiment that it'd be nice to propagate back if copies are happening, and to what extent. But also agree that this can wait.
On 10/9/24 17:30, Stanislav Fomichev wrote: > On 10/08, David Wei wrote: >> On 2024-10-08 08:58, Stanislav Fomichev wrote: >>> On 10/07, David Wei wrote: >>>> From: Pavel Begunkov <asml.silence@gmail.com> >>>> >>>> There are scenarios in which the zerocopy path might get a normal >>>> in-kernel buffer, it could be a mis-steered packet or simply the linear >>>> part of an skb. Another use case is to allow the driver to allocate >>>> kernel pages when it's out of zc buffers, which makes it more resilient >>>> to spikes in load and allow the user to choose the balance between the >>>> amount of memory provided and performance. >>> >>> Tangential: should there be some clear way for the users to discover that >>> (some counter of some entry on cq about copy fallback)? >>> >>> Or the expectation is that somebody will run bpftrace to diagnose >>> (supposedly) poor ZC performance when it falls back to copy? >> >> Yeah there definitely needs to be a way to notify the user that copy >> fallback happened. Right now I'm relying on bpftrace hooking into >> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is >> too much. I can think of two other options: >> >> 1. Send a final cqe at the end of a number of frag cqes with a count of >> the number of copies. >> 2. Register a secondary area just for handling copies. >> >> Other suggestions are also very welcome. > > SG, thanks. Up to you and Pavel on the mechanism and whether to follow > up separately. Maybe even move this fallback (this patch) into that separate > series as well? Will be easier to review/accept the rest. I think it's fine to leave it? It shouldn't be particularly interesting to the net folks to review, and without it any skb with the linear part would break it, but perhaps it's not such a concern for bnxt.
On 2024-10-09 16:05, Pavel Begunkov wrote: > On 10/9/24 17:30, Stanislav Fomichev wrote: >> On 10/08, David Wei wrote: >>> On 2024-10-08 08:58, Stanislav Fomichev wrote: >>>> On 10/07, David Wei wrote: >>>>> From: Pavel Begunkov <asml.silence@gmail.com> >>>>> >>>>> There are scenarios in which the zerocopy path might get a normal >>>>> in-kernel buffer, it could be a mis-steered packet or simply the linear >>>>> part of an skb. Another use case is to allow the driver to allocate >>>>> kernel pages when it's out of zc buffers, which makes it more resilient >>>>> to spikes in load and allow the user to choose the balance between the >>>>> amount of memory provided and performance. >>>> >>>> Tangential: should there be some clear way for the users to discover that >>>> (some counter of some entry on cq about copy fallback)? >>>> >>>> Or the expectation is that somebody will run bpftrace to diagnose >>>> (supposedly) poor ZC performance when it falls back to copy? >>> >>> Yeah there definitely needs to be a way to notify the user that copy >>> fallback happened. Right now I'm relying on bpftrace hooking into >>> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is >>> too much. I can think of two other options: >>> >>> 1. Send a final cqe at the end of a number of frag cqes with a count of >>> the number of copies. >>> 2. Register a secondary area just for handling copies. >>> >>> Other suggestions are also very welcome. >> >> SG, thanks. Up to you and Pavel on the mechanism and whether to follow >> up separately. Maybe even move this fallback (this patch) into that separate >> series as well? Will be easier to review/accept the rest. > > I think it's fine to leave it? It shouldn't be particularly > interesting to the net folks to review, and without it any skb > with the linear part would break it, but perhaps it's not such > a concern for bnxt. > My preference is to leave it. Actually from real workloads, fully linearized skbs are not uncommon due to the minimum size for HDS to kick in for bnxt. Taking this out would imo make this patchset functionally broken. Since we're all in agreement here, let's defer the improvements as a follow up.
On 10/10, David Wei wrote: > On 2024-10-09 16:05, Pavel Begunkov wrote: > > On 10/9/24 17:30, Stanislav Fomichev wrote: > >> On 10/08, David Wei wrote: > >>> On 2024-10-08 08:58, Stanislav Fomichev wrote: > >>>> On 10/07, David Wei wrote: > >>>>> From: Pavel Begunkov <asml.silence@gmail.com> > >>>>> > >>>>> There are scenarios in which the zerocopy path might get a normal > >>>>> in-kernel buffer, it could be a mis-steered packet or simply the linear > >>>>> part of an skb. Another use case is to allow the driver to allocate > >>>>> kernel pages when it's out of zc buffers, which makes it more resilient > >>>>> to spikes in load and allow the user to choose the balance between the > >>>>> amount of memory provided and performance. > >>>> > >>>> Tangential: should there be some clear way for the users to discover that > >>>> (some counter of some entry on cq about copy fallback)? > >>>> > >>>> Or the expectation is that somebody will run bpftrace to diagnose > >>>> (supposedly) poor ZC performance when it falls back to copy? > >>> > >>> Yeah there definitely needs to be a way to notify the user that copy > >>> fallback happened. Right now I'm relying on bpftrace hooking into > >>> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is > >>> too much. I can think of two other options: > >>> > >>> 1. Send a final cqe at the end of a number of frag cqes with a count of > >>> the number of copies. > >>> 2. Register a secondary area just for handling copies. > >>> > >>> Other suggestions are also very welcome. > >> > >> SG, thanks. Up to you and Pavel on the mechanism and whether to follow > >> up separately. Maybe even move this fallback (this patch) into that separate > >> series as well? Will be easier to review/accept the rest. > > > > I think it's fine to leave it? It shouldn't be particularly > > interesting to the net folks to review, and without it any skb > > with the linear part would break it, but perhaps it's not such > > a concern for bnxt. > > > > My preference is to leave it. Actually from real workloads, fully > linearized skbs are not uncommon due to the minimum size for HDS to kick > in for bnxt. Taking this out would imo make this patchset functionally > broken. Since we're all in agreement here, let's defer the improvements > as a follow up. Sounds good!
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index 8166d8a2656e..d21e7017deb3 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -5,6 +5,8 @@ #include <linux/nospec.h> #include <linux/netdevice.h> #include <linux/io_uring.h> +#include <linux/skbuff_ref.h> +#include <net/busy_poll.h> #include <net/page_pool/helpers.h> #include <trace/events/page_pool.h> #include <net/tcp.h> @@ -28,6 +30,11 @@ struct io_zcrx_args { struct socket *sock; }; +struct io_zc_refill_data { + struct io_zcrx_ifq *ifq; + struct net_iov *niov; +}; + static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov) { struct net_iov_area *owner = net_iov_owner(niov); @@ -35,6 +42,13 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio return container_of(owner, struct io_zcrx_area, nia); } +static inline struct page *io_zcrx_iov_page(const struct net_iov *niov) +{ + struct io_zcrx_area *area = io_zcrx_iov_to_area(niov); + + return area->pages[net_iov_idx(niov)]; +} + static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq, struct io_uring_zcrx_ifq_reg *reg) { @@ -475,6 +489,34 @@ const struct memory_provider_ops io_uring_pp_zc_ops = { .scrub = io_pp_zc_scrub, }; +static void io_napi_refill(void *data) +{ + struct io_zc_refill_data *rd = data; + struct io_zcrx_ifq *ifq = rd->ifq; + netmem_ref netmem; + + if (WARN_ON_ONCE(!ifq->pp)) + return; + + netmem = page_pool_alloc_netmem(ifq->pp, GFP_ATOMIC | __GFP_NOWARN); + if (!netmem) + return; + if (WARN_ON_ONCE(!netmem_is_net_iov(netmem))) + return; + + rd->niov = netmem_to_net_iov(netmem); +} + +static struct net_iov *io_zc_get_buf_task_safe(struct io_zcrx_ifq *ifq) +{ + struct io_zc_refill_data rd = { + .ifq = ifq, + }; + + napi_execute(ifq->napi_id, io_napi_refill, &rd); + return rd.niov; +} + static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov, struct io_zcrx_ifq *ifq, int off, int len) { @@ -498,6 +540,45 @@ static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov, return true; } +static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq, + void *data, unsigned int offset, size_t len) +{ + size_t copy_size, copied = 0; + int ret = 0, off = 0; + struct page *page; + u8 *vaddr; + + do { + struct net_iov *niov; + + niov = io_zc_get_buf_task_safe(ifq); + if (!niov) { + ret = -ENOMEM; + break; + } + + page = io_zcrx_iov_page(niov); + vaddr = kmap_local_page(page); + copy_size = min_t(size_t, PAGE_SIZE, len); + memcpy(vaddr, data + offset, copy_size); + kunmap_local(vaddr); + + if (!io_zcrx_queue_cqe(req, niov, ifq, off, copy_size)) { + napi_pp_put_page(net_iov_to_netmem(niov)); + return -ENOSPC; + } + + io_zcrx_get_buf_uref(niov); + napi_pp_put_page(net_iov_to_netmem(niov)); + + offset += copy_size; + len -= copy_size; + copied += copy_size; + } while (offset < len); + + return copied ? copied : ret; +} + static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq, const skb_frag_t *frag, int off, int len) { @@ -505,8 +586,24 @@ static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq, off += skb_frag_off(frag); - if (unlikely(!skb_frag_is_net_iov(frag))) - return -EOPNOTSUPP; + if (unlikely(!skb_frag_is_net_iov(frag))) { + struct page *page = skb_frag_page(frag); + u32 p_off, p_len, t, copied = 0; + u8 *vaddr; + int ret = 0; + + skb_frag_foreach_page(frag, off, len, + page, p_off, p_len, t) { + vaddr = kmap_local_page(page); + ret = io_zcrx_copy_chunk(req, ifq, vaddr, p_off, p_len); + kunmap_local(vaddr); + + if (ret < 0) + return copied ? copied : ret; + copied += ret; + } + return copied; + } niov = netmem_to_net_iov(frag->netmem); if (niov->pp->mp_ops != &io_uring_pp_zc_ops || @@ -527,15 +624,29 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, struct io_zcrx_ifq *ifq = args->ifq; struct io_kiocb *req = args->req; struct sk_buff *frag_iter; - unsigned start, start_off; + unsigned start, start_off = offset; int i, copy, end, off; int ret = 0; - start = skb_headlen(skb); - start_off = offset; + if (unlikely(offset < skb_headlen(skb))) { + ssize_t copied; + size_t to_copy; - if (offset < start) - return -EOPNOTSUPP; + to_copy = min_t(size_t, skb_headlen(skb) - offset, len); + copied = io_zcrx_copy_chunk(req, ifq, skb->data, offset, to_copy); + if (copied < 0) { + ret = copied; + goto out; + } + offset += copied; + len -= copied; + if (!len) + goto out; + if (offset != skb_headlen(skb)) + goto out; + } + + start = skb_headlen(skb); for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { const skb_frag_t *frag;