Message ID | 20241016185252.3746190-3-dw@davidwei.uk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | io_uring zero copy rx | expand |
On Wed, Oct 16, 2024 at 11:52:39AM -0700, David Wei wrote: > From: Pavel Begunkov <asml.silence@gmail.com> > > Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner, > which serves as a useful abstraction to share data and provide a > context. However, it's too devmem specific, and we want to reuse it for > other memory providers, and for that we need to decouple net_iov from > devmem. Make net_iov to point to a new base structure called > net_iov_area, which dmabuf_genpool_chunk_owner extends. We've been there before. Instead of reinventing your own memory provider please enhance dmabufs for your use case. We don't really need to build memory buffer abstraction over memory buffer abstraction.
On 10/23/24 08:20, Christoph Hellwig wrote: > On Wed, Oct 16, 2024 at 11:52:39AM -0700, David Wei wrote: >> From: Pavel Begunkov <asml.silence@gmail.com> >> >> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner, >> which serves as a useful abstraction to share data and provide a >> context. However, it's too devmem specific, and we want to reuse it for >> other memory providers, and for that we need to decouple net_iov from >> devmem. Make net_iov to point to a new base structure called >> net_iov_area, which dmabuf_genpool_chunk_owner extends. > > We've been there before. Instead of reinventing your own memory > provider please enhance dmabufs for your use case. We don't really > need to build memory buffer abstraction over memory buffer abstraction. It doesn't care much what kind of memory it is, nor it's important for internals how it's imported, it's user addresses -> pages for user convenience sake. All the net_iov setup code is in the page pool core code. What it does, however, is implementing the user API, so There is no relevance with dmabufs.
On Wed, Oct 23, 2024 at 03:34:53PM +0100, Pavel Begunkov wrote: > It doesn't care much what kind of memory it is, nor it's important > for internals how it's imported, it's user addresses -> pages for > user convenience sake. All the net_iov setup code is in the page pool > core code. What it does, however, is implementing the user API, so That's not what this series does. It adds the new memory_provider_ops set of hooks, with once implementation for dmabufs, and one for io_uring zero copy. So you are precluding zero copy RX into anything but your magic io_uring buffers, and using an odd abstraction for that. The right way would be to support zero copy RX into every designated dmabuf, and make io_uring work with udmabuf or if absolutely needed it's own kind of dmabuf. Instead we create a maze of incompatible abstractions here. The use case of e.g. doing zero copy receive into a NVMe CMB using PCIe P2P transactions is every but made up, so this does create a problem.
On 10/24/24 10:23, Christoph Hellwig wrote: > On Wed, Oct 23, 2024 at 03:34:53PM +0100, Pavel Begunkov wrote: >> It doesn't care much what kind of memory it is, nor it's important >> for internals how it's imported, it's user addresses -> pages for >> user convenience sake. All the net_iov setup code is in the page pool >> core code. What it does, however, is implementing the user API, so > > That's not what this series does. It adds the new memory_provider_ops > set of hooks, with once implementation for dmabufs, and one for > io_uring zero copy. First, it's not a _new_ abstraction over a buffer as you called it before, the abstraction (net_iov) is already merged. Second, you mention devmem TCP, and it's not just a page pool with "dmabufs", it's a user API to use it and other memory agnostic allocation logic. And yes, dmabufs there is the least technically important part. Just having a dmabuf handle solves absolutely nothing. > So you are precluding zero copy RX into anything but your magic > io_uring buffers, and using an odd abstraction for that. Right io_uring zero copy RX API expects transfer to happen into io_uring controlled buffers, and that's the entire idea. Buffers that are based on an existing network specific abstraction, which are not restricted to pages or anything specific in the long run, but the flow of which from net stack to user and back is controlled by io_uring. If you worry about abuse, io_uring can't even sanely initialise those buffers itself and therefore asking the page pool code to do that. > The right way would be to support zero copy RX into every > designated dmabuf, and make io_uring work with udmabuf or if I have no idea what you mean, but shoving dmabufs into every single place regardless whether it makes sense or not is hardly a good way forward. > absolutely needed it's own kind of dmabuf. Instead we create I'm even more confused how that would help. The user API has to be implemented and adding a new dmabuf gives nothing, not even mentioning it's not clear what semantics of that beast is supposed to be. > a maze of incompatible abstractions here. The use case of e.g. > doing zero copy receive into a NVMe CMB using PCIe P2P transactions > is every but made up, so this does create a problem. That's some kind of a confusion again, there is no reason why it can't be supported, transparently to the non-setup code at that. That's left out as other bits to further iterations to keep this set simpler.
On Thu, Oct 24, 2024 at 03:23:06PM +0100, Pavel Begunkov wrote: > > That's not what this series does. It adds the new memory_provider_ops > > set of hooks, with once implementation for dmabufs, and one for > > io_uring zero copy. > > First, it's not a _new_ abstraction over a buffer as you called it > before, the abstraction (net_iov) is already merged. Umm, it is a new ops vector. > Second, you mention devmem TCP, and it's not just a page pool with > "dmabufs", it's a user API to use it and other memory agnostic > allocation logic. And yes, dmabufs there is the least technically > important part. Just having a dmabuf handle solves absolutely nothing. It solves a lot, becaue it provides a proper abstraction. > > So you are precluding zero copy RX into anything but your magic > > io_uring buffers, and using an odd abstraction for that. > > Right io_uring zero copy RX API expects transfer to happen into io_uring > controlled buffers, and that's the entire idea. Buffers that are based > on an existing network specific abstraction, which are not restricted to > pages or anything specific in the long run, but the flow of which from > net stack to user and back is controlled by io_uring. If you worry about > abuse, io_uring can't even sanely initialise those buffers itself and > therefore asking the page pool code to do that. No, I worry about trying to io_uring for not good reason. This pre-cludes in-kernel uses which would be extremly useful for network storage drivers, and it precludes device memory of all kinds. > I'm even more confused how that would help. The user API has to > be implemented and adding a new dmabuf gives nothing, not even > mentioning it's not clear what semantics of that beast is > supposed to be. > The dma-buf maintainers already explained to you last time that there is absolutely no need to use the dmabuf UAPI, you can use dma-bufs through in-kernel interfaces just fine.
On 10/24/24 17:06, Christoph Hellwig wrote: > On Thu, Oct 24, 2024 at 03:23:06PM +0100, Pavel Begunkov wrote: >>> That's not what this series does. It adds the new memory_provider_ops >>> set of hooks, with once implementation for dmabufs, and one for >>> io_uring zero copy. >> >> First, it's not a _new_ abstraction over a buffer as you called it >> before, the abstraction (net_iov) is already merged. > > Umm, it is a new ops vector. I don't understand what you mean. Callback? >> Second, you mention devmem TCP, and it's not just a page pool with >> "dmabufs", it's a user API to use it and other memory agnostic >> allocation logic. And yes, dmabufs there is the least technically >> important part. Just having a dmabuf handle solves absolutely nothing. > > It solves a lot, becaue it provides a proper abstraction. Then please go ahead and take a look at the patchset in question and see how much of dmabuf handling is there comparing to pure networking changes. The point that it's a new set of API and lots of changes not related directly to dmabufs stand. dmabufs is useful there as an abstraction there, but it's a very long stretch saying that the series is all about it. > >>> So you are precluding zero copy RX into anything but your magic >>> io_uring buffers, and using an odd abstraction for that. >> >> Right io_uring zero copy RX API expects transfer to happen into io_uring >> controlled buffers, and that's the entire idea. Buffers that are based >> on an existing network specific abstraction, which are not restricted to >> pages or anything specific in the long run, but the flow of which from >> net stack to user and back is controlled by io_uring. If you worry about >> abuse, io_uring can't even sanely initialise those buffers itself and >> therefore asking the page pool code to do that. > > No, I worry about trying to io_uring for not good reason. This It sounds that the argument is that you just don't want any io_uring APIs, I don't think you'd be able to help you with that. > pre-cludes in-kernel uses which would be extremly useful for Uses of what? devmem TCP is merged, I'm not removing it, and the net_iov abstraction is in there, which can be potentially be reused by other in-kernel users if that'd even make sense. > network storage drivers, and it precludes device memory of all > kinds. You can't use page pools to allocate for a storage device, it's a network specific allocator. You can get a dmabuf around that device's memory and zero copy into it, but there is no problem with that. Either use devmem TCP or wait until io_uring adds support for dmabufs, which is, again, trivial. >> I'm even more confused how that would help. The user API has to >> be implemented and adding a new dmabuf gives nothing, not even >> mentioning it's not clear what semantics of that beast is >> supposed to be. >> > > The dma-buf maintainers already explained to you last time > that there is absolutely no need to use the dmabuf UAPI, you > can use dma-bufs through in-kernel interfaces just fine. You can, even though it's not needed and I don't see how it'd be useful, but you're missing the point. A new dmabuf implementation doesn't implement the uapi we need nor it helps to talk to the net layer.
On Thu, Oct 24, 2024 at 05:40:02PM +0100, Pavel Begunkov wrote: > On 10/24/24 17:06, Christoph Hellwig wrote: > > On Thu, Oct 24, 2024 at 03:23:06PM +0100, Pavel Begunkov wrote: > > > > That's not what this series does. It adds the new memory_provider_ops > > > > set of hooks, with once implementation for dmabufs, and one for > > > > io_uring zero copy. > > > > > > First, it's not a _new_ abstraction over a buffer as you called it > > > before, the abstraction (net_iov) is already merged. > > > > Umm, it is a new ops vector. > > I don't understand what you mean. Callback? struct memory_provider_ops. It's a method table or ops vetor, no callbacks involved. > Then please go ahead and take a look at the patchset in question > and see how much of dmabuf handling is there comparing to pure > networking changes. The point that it's a new set of API and lots > of changes not related directly to dmabufs stand. dmabufs is useful > there as an abstraction there, but it's a very long stretch saying > that the series is all about it. I did take a look, that's why I replied. > > > on an existing network specific abstraction, which are not restricted to > > > pages or anything specific in the long run, but the flow of which from > > > net stack to user and back is controlled by io_uring. If you worry about > > > abuse, io_uring can't even sanely initialise those buffers itself and > > > therefore asking the page pool code to do that. > > > > No, I worry about trying to io_uring for not good reason. This > > It sounds that the argument is that you just don't want any > io_uring APIs, I don't think you'd be able to help you with > that. No, that's complete misinterpreting what I'm saying. Of course an io_uring API is fine. But tying low-level implementation details to to is not. > > pre-cludes in-kernel uses which would be extremly useful for > > Uses of what? devmem TCP is merged, I'm not removing it, > and the net_iov abstraction is in there, which can be potentially > be reused by other in-kernel users if that'd even make sense. How when you are hardcoding io uring memory registrations instead of making them a generic dmabuf? Which btw would also really help with pre-registering the memry with the iommu to get good performance in IOMMU-enabled setups.
On 10/28/24 12:11, Christoph Hellwig wrote: > On Thu, Oct 24, 2024 at 05:40:02PM +0100, Pavel Begunkov wrote: >> On 10/24/24 17:06, Christoph Hellwig wrote: >>> On Thu, Oct 24, 2024 at 03:23:06PM +0100, Pavel Begunkov wrote: >>>>> That's not what this series does. It adds the new memory_provider_ops >>>>> set of hooks, with once implementation for dmabufs, and one for >>>>> io_uring zero copy. >>>> >>>> First, it's not a _new_ abstraction over a buffer as you called it >>>> before, the abstraction (net_iov) is already merged. >>> >>> Umm, it is a new ops vector. >> >> I don't understand what you mean. Callback? > > struct memory_provider_ops. It's a method table or ops vetor, no > callbacks involved. I see, the reply is about your phrase about additional memory abstractions: "... don't really need to build memory buffer abstraction over memory buffer abstraction." >> Then please go ahead and take a look at the patchset in question >> and see how much of dmabuf handling is there comparing to pure >> networking changes. The point that it's a new set of API and lots >> of changes not related directly to dmabufs stand. dmabufs is useful >> there as an abstraction there, but it's a very long stretch saying >> that the series is all about it. > > I did take a look, that's why I replied. > >>>> on an existing network specific abstraction, which are not restricted to >>>> pages or anything specific in the long run, but the flow of which from >>>> net stack to user and back is controlled by io_uring. If you worry about >>>> abuse, io_uring can't even sanely initialise those buffers itself and >>>> therefore asking the page pool code to do that. >>> >>> No, I worry about trying to io_uring for not good reason. This >> >> It sounds that the argument is that you just don't want any >> io_uring APIs, I don't think you'd be able to help you with >> that. > > No, that's complete misinterpreting what I'm saying. Of course an > io_uring API is fine. But tying low-level implementation details to > to is not. It works with low level concepts, i.e. private NIC queues, but it does that through well established abstractions (page pool) already extended for such cases. There is no directly going into a driver / hardware and hard coding queue allocation, some memory injection or anything similar. The user api has to embrace the hardware limitations, right, there is no way around it without completely changing the approach and performance and/or applicability. And queues as first class citizens is not a new concept in general. >>> pre-cludes in-kernel uses which would be extremly useful for >> >> Uses of what? devmem TCP is merged, I'm not removing it, >> and the net_iov abstraction is in there, which can be potentially >> be reused by other in-kernel users if that'd even make sense. > > How when you are hardcoding io uring memory registrations instead > of making them a generic dmabuf? Which btw would also really help If you mean internals, making up a dmabuf that has never existed in the picture in the first place is not cleaner or easier in any way. If that changes, e.g. there is more code to reuse in the future, we can unify it then. If that's about user api, you've just mentioned before that it can be pages / user pointers. As to why it goes through io_uring, I explained it before, but in short, it gives a better api for io_uring users, we can avoid creating a yet another file (netlink socket) and keeping it around, that way we don't need to synchronise with the nl socket and/or trying to steal memory from it, and the devmem api is also too monolithic for such purposes, so even that would need to change, i.e. splitting queue and memory registration. > with pre-registering the memry with the iommu to get good performance > in IOMMU-enabled setups. The page pool already does that just like it handles the normal path without providers.
On Tue, Oct 29, 2024 at 04:35:16PM +0000, Pavel Begunkov wrote: > I see, the reply is about your phrase about additional memory > abstractions: > > "... don't really need to build memory buffer abstraction over > memory buffer abstraction." Yes, over the exsting memory buffer abstraction (dma_buf). > If you mean internals, making up a dmabuf that has never existed in the > picture in the first place is not cleaner or easier in any way. If that > changes, e.g. there is more code to reuse in the future, we can unify it > then. I'm not sure what "making up" means here, they are all made up :) > > with pre-registering the memry with the iommu to get good performance > > in IOMMU-enabled setups. > > The page pool already does that just like it handles the normal > path without providers. In which case is basically is a dma-buf. If you'd expose it as such we could actually use to communicate between subsystems in the kernel.
diff --git a/include/net/netmem.h b/include/net/netmem.h index 8a6e20be4b9d..3795ded30d2c 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -24,11 +24,20 @@ struct net_iov { unsigned long __unused_padding; unsigned long pp_magic; struct page_pool *pp; - struct dmabuf_genpool_chunk_owner *owner; + struct net_iov_area *owner; unsigned long dma_addr; atomic_long_t pp_ref_count; }; +struct net_iov_area { + /* Array of net_iovs for this area. */ + struct net_iov *niovs; + size_t num_niovs; + + /* Offset into the dma-buf where this chunk starts. */ + unsigned long base_virtual; +}; + /* These fields in struct page are used by the page_pool and net stack: * * struct { @@ -54,6 +63,16 @@ NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr); NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count); #undef NET_IOV_ASSERT_OFFSET +static inline struct net_iov_area *net_iov_owner(const struct net_iov *niov) +{ + return niov->owner; +} + +static inline unsigned int net_iov_idx(const struct net_iov *niov) +{ + return niov - net_iov_owner(niov)->niovs; +} + /* netmem */ /** diff --git a/net/core/devmem.c b/net/core/devmem.c index 858982858f81..5c10cf0e2a18 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -32,14 +32,15 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, { struct dmabuf_genpool_chunk_owner *owner = chunk->owner; - kvfree(owner->niovs); + kvfree(owner->area.niovs); kfree(owner); } static dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) { - struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + struct dmabuf_genpool_chunk_owner *owner; + owner = net_devmem_iov_to_chunk_owner(niov); return owner->base_dma_addr + ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT); } @@ -82,7 +83,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) offset = dma_addr - owner->base_dma_addr; index = offset / PAGE_SIZE; - niov = &owner->niovs[index]; + niov = &owner->area.niovs[index]; niov->pp_magic = 0; niov->pp = NULL; @@ -250,9 +251,9 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, goto err_free_chunks; } - owner->base_virtual = virtual; + owner->area.base_virtual = virtual; owner->base_dma_addr = dma_addr; - owner->num_niovs = len / PAGE_SIZE; + owner->area.num_niovs = len / PAGE_SIZE; owner->binding = binding; err = gen_pool_add_owner(binding->chunk_pool, dma_addr, @@ -264,17 +265,17 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, goto err_free_chunks; } - owner->niovs = kvmalloc_array(owner->num_niovs, - sizeof(*owner->niovs), - GFP_KERNEL); - if (!owner->niovs) { + owner->area.niovs = kvmalloc_array(owner->area.num_niovs, + sizeof(*owner->area.niovs), + GFP_KERNEL); + if (!owner->area.niovs) { err = -ENOMEM; goto err_free_chunks; } - for (i = 0; i < owner->num_niovs; i++) { - niov = &owner->niovs[i]; - niov->owner = owner; + for (i = 0; i < owner->area.num_niovs; i++) { + niov = &owner->area.niovs[i]; + niov->owner = &owner->area; page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), net_devmem_get_dma_addr(niov)); } diff --git a/net/core/devmem.h b/net/core/devmem.h index 99782ddeca40..a2b9913e9a17 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -10,6 +10,8 @@ #ifndef _NET_DEVMEM_H #define _NET_DEVMEM_H +#include <net/netmem.h> + struct netlink_ext_ack; struct net_devmem_dmabuf_binding { @@ -51,17 +53,11 @@ struct net_devmem_dmabuf_binding { * allocations from this chunk. */ struct dmabuf_genpool_chunk_owner { - /* Offset into the dma-buf where this chunk starts. */ - unsigned long base_virtual; + struct net_iov_area area; + struct net_devmem_dmabuf_binding *binding; /* dma_addr of the start of the chunk. */ dma_addr_t base_dma_addr; - - /* Array of net_iovs for this chunk. */ - struct net_iov *niovs; - size_t num_niovs; - - struct net_devmem_dmabuf_binding *binding; }; void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding); @@ -75,20 +71,17 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, void dev_dmabuf_uninstall(struct net_device *dev); static inline struct dmabuf_genpool_chunk_owner * -net_iov_owner(const struct net_iov *niov) +net_devmem_iov_to_chunk_owner(const struct net_iov *niov) { - return niov->owner; -} + struct net_iov_area *owner = net_iov_owner(niov); -static inline unsigned int net_iov_idx(const struct net_iov *niov) -{ - return niov - net_iov_owner(niov)->niovs; + return container_of(owner, struct dmabuf_genpool_chunk_owner, area); } static inline struct net_devmem_dmabuf_binding * net_devmem_iov_binding(const struct net_iov *niov) { - return net_iov_owner(niov)->binding; + return net_devmem_iov_to_chunk_owner(niov)->binding; } static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov) @@ -98,7 +91,7 @@ static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov) static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov) { - struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + struct net_iov_area *owner = net_iov_owner(niov); return owner->base_virtual + ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT);