Message ID | 20240403002053.2376017-3-almasrymina@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Device Memory TCP | expand |
Still NAK to creating aⅺbitrary hooks here. This should be a page or dmabuf pool and not an indirect call abstraction allowing random crap to hook into it.
Sorry for the late reply. On Wed, May 1, 2024 at 12:55 AM Christoph Hellwig <hch@infradead.org> wrote: > > Still NAK to creating aⅺbitrary hooks here. Is the concern still that folks may be able to hook proprietary stuff into this like you mentioned before[1]? I don't see how that can be done as currently written. The page_pool grabs the memory_provider_ops from the netdev_rx_queue struct managed by core net stack and not really overridable by external modules. When the netdev creates the page_pool, it gets the core-managed netdev_rx_queue via something like __netif_get_rx_queue() and passes that to page_pool_create(). We could make the memory_provider_ops even more opaque by only allowing the device to only pass in the netdev + queue num to the page_pool_create, and have the page_pool_create query the netdev_rx_queue struct, to make sure we're getting the one managed by core. Long story short is that as currently written I think it's pretty much impossible for someone to plug in a proprietary out-of-tree memory provider using these hooks, and if desired I can change the code slightly to make it even more difficult (but maybe that's pointless, I don't think it's possible even in the current iteration). The only way to get a memory_provider_ops in is to seek to merge it as part of the kernel with community approval. Is there something I'm missing here? > This should be a page or > dmabuf pool and not an indirect call abstraction allowing random > crap to hook into it. > What is the suggested fix here? I do something like: cp net/core/page_pool.c net/core/dmabuf_pool.c and then modify it such that the net stack maintains 2 page_pools? There are a lot of cons to that: 1. Code duplication/maintenance (page_pool.c + dmabuf_pool.c will look very similar). 2. The hooks enable more use cases than dmabuf_pool + standard pages. In addition to those, I'm thinking of (but not working on): a. Limited memory pools. I.e. a page_pool limited to a certain amount of memory (for overcommited VMs). b. dmabuf pools with GPU virtual addresses. Currently we seek to support dmabuf memory where the virtual address is an offset into the dmabuf for CPU access. For GPU memory accessible to the GPU we need dmabuf memory where the virtual address is the GPU virtual address. 3. Support for multiple page_pools is actually more proprietary friendly IMO. Currently the page_pool is internal to core. If we start adding additional pools we need to have some uniform behavior between all the pools so core can operate on memory that originated from any one of them. In that case it becomes actually easier for someone to develop an out of tree pool and use it from their out-of-tree driver and as long as their out of tree page_pool behaves similarly enough to the decided uniform behavior, it may be able to fool core into thinking it's an in-tree pool... [1] https://lore.kernel.org/linux-kernel/ZfegzB341oNc_Ocz@infradead.org/ -- Thanks, Mina
On Fri, May 03, 2024 at 01:10:44PM -0700, Mina Almasry wrote: > Is the concern still that folks may be able to hook proprietary stuff > into this like you mentioned before[1]? That is on concern. The other is that people will do stupid stuff even in tree if you give them enough rope, and they should not have that rope when the only sensible options are page/folio based kernel memory (incuding large/huge folios) and dmabuf. > cp net/core/page_pool.c net/core/dmabuf_pool.c > > and then modify it such that the net stack maintains 2 page_pools? > There are a lot of cons to that: No. Just have branches for page based vs dmabuf in a few places.
On 5/6/24 13:04, Christoph Hellwig wrote: > On Fri, May 03, 2024 at 01:10:44PM -0700, Mina Almasry wrote: >> Is the concern still that folks may be able to hook proprietary stuff >> into this like you mentioned before[1]? > > That is on concern. The other is that people will do stupid stuff We're not now banning all virtual function tables because there is a chance someone might probably conceivably do perhaps something proprietary, aren't we? Let's just patch up all ways they might use it if there is any left. > even in tree if you give them enough rope, and they should not have > that rope when the only sensible options are page/folio based kernel > memory (incuding large/huge folios) and dmabuf. I believe there is at least one deep confusion here, considering you previously mentioned Keith's pre-mapping patches. The "hooks" are not that about in what format you pass memory, it's arguably the least interesting part for page pool, more or less it'd circulate whatever is given. It's more of how to have a better control over buffer lifetime and implement a buffer pool passing data to users and empty buffers back. It's a different from storage where a user passes a buffer before it initiates a read, data is sent by the other socket end. TCP devmem uses dmabuf, io_uring keeps pages (of normal user memory). >> cp net/core/page_pool.c net/core/dmabuf_pool.c >> >> and then modify it such that the net stack maintains 2 page_pools? >> There are a lot of cons to that: > > No. Just have branches for page based vs dmabuf in a few places. >
On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote: > > even in tree if you give them enough rope, and they should not have > > that rope when the only sensible options are page/folio based kernel > > memory (incuding large/huge folios) and dmabuf. > > I believe there is at least one deep confusion here, considering you > previously mentioned Keith's pre-mapping patches. The "hooks" are not > that about in what format you pass memory, it's arguably the least > interesting part for page pool, more or less it'd circulate whatever > is given. It's more of how to have a better control over buffer lifetime > and implement a buffer pool passing data to users and empty buffers > back. Isn't that more or less exactly what dmabuf is? Why do you need another almost dma-buf thing for another project? Jason
On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote: > On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote: > > > even in tree if you give them enough rope, and they should not have > > > that rope when the only sensible options are page/folio based kernel > > > memory (incuding large/huge folios) and dmabuf. > > > > I believe there is at least one deep confusion here, considering you > > previously mentioned Keith's pre-mapping patches. The "hooks" are not > > that about in what format you pass memory, it's arguably the least > > interesting part for page pool, more or less it'd circulate whatever > > is given. It's more of how to have a better control over buffer lifetime > > and implement a buffer pool passing data to users and empty buffers > > back. > > Isn't that more or less exactly what dmabuf is? Why do you need > another almost dma-buf thing for another project? That's the exact point I've been making since the last round of the series. We don't need to reinvent dmabuf poorly in every subsystem, but instead fix the odd parts in it and make it suitable for everyone.
On Tue, May 7, 2024 at 9:24 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote: > > On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote: > > > > even in tree if you give them enough rope, and they should not have > > > > that rope when the only sensible options are page/folio based kernel > > > > memory (incuding large/huge folios) and dmabuf. > > > > > > I believe there is at least one deep confusion here, considering you > > > previously mentioned Keith's pre-mapping patches. The "hooks" are not > > > that about in what format you pass memory, it's arguably the least > > > interesting part for page pool, more or less it'd circulate whatever > > > is given. It's more of how to have a better control over buffer lifetime > > > and implement a buffer pool passing data to users and empty buffers > > > back. > > > > Isn't that more or less exactly what dmabuf is? Why do you need > > another almost dma-buf thing for another project? > > That's the exact point I've been making since the last round of > the series. We don't need to reinvent dmabuf poorly in every > subsystem, but instead fix the odd parts in it and make it suitable > for everyone. > FWIW the change Christoph is requesting is straight forward from my POV and doesn't really hurt the devmem use case. I'd basically remove the ops and add an if statement in the slow path where the ops are being used to alloc/free from dmabuf instead of alloc_pages(). Something like (very rough, doesn't compile): diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 92be1aaf18ccc..2cc986455bce6 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp) return netmem; /* Slow-path: cache empty, do real allocation */ - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - netmem = pool->mp_ops->alloc_pages(pool, gfp); + if (page_pool_is_dmabuf(pool)) + netmem = mp_dmabuf_devmem_alloc_pages(): else netmem = __page_pool_alloc_pages_slow(pool, gfp); return netmem; The folks that will be negatively impacted by this are Jakub/Pavel/David. I think all were planning to extend the hooks for io_uring or other memory types. Pavel/David, AFAICT you have these options here (but maybe you can think of more): 1. Align with devmem TCP to use udmabuf for your io_uring memory. I think in the past you said it's a uapi you don't link but in the face of this pushback you may want to reconsider. 2. Follow the example of devmem TCP and add another if statement to alloc from io_uring, so something like: diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 92be1aaf18ccc..3545bb82c7d05 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -557,8 +557,10 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp) return netmem; /* Slow-path: cache empty, do real allocation */ - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) - netmem = pool->mp_ops->alloc_pages(pool, gfp); + if (page_pool_is_dmabuf(pool)) + netmem = mp_dmabuf_devmem_alloc_pages(): + else if (page_pool_is_io_uring(pool)) + netmem = mp_io_uring_alloc_pages(): else netmem = __page_pool_alloc_pages_slow(pool, gfp); return netmem; Note that Christoph/Jason may not like you adding non-dmabuf io_uring backing memory in the first place, so there may be pushback against this approach. 3. Pushback on the nack on this thread. It seems you're already discussing this. I'll see what happens. To be honest the GVE queue-API has just been merged I think, so I'm now unblocked on sending non-RFCs of this work and I'm hoping to send the next version soon. I may apply these changes on the next version for more discussion or leave as is and carry the nack until the conversation converges.
On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I > think in the past you said it's a uapi you don't link but in the face > of this pushback you may want to reconsider. dmabuf does not force a uapi, you can acquire your pages however you want and wrap them up in a dmabuf. No uapi at all. The point is that dmabuf already provides ops that do basically what is needed here. We don't need ops calling ops just because dmabuf's ops are not understsood or not perfect. Fixup dmabuf. If io_uring wants to take its existing memory pre-registration it can wrap that in a dmbauf, and somehow pass it to the netstack. Userspace doesn't need to know a dmabuf is being used in the background. Jason
On 5/7/24 17:23, Christoph Hellwig wrote: > On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote: >> On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote: >>>> even in tree if you give them enough rope, and they should not have >>>> that rope when the only sensible options are page/folio based kernel >>>> memory (incuding large/huge folios) and dmabuf. >>> >>> I believe there is at least one deep confusion here, considering you >>> previously mentioned Keith's pre-mapping patches. The "hooks" are not >>> that about in what format you pass memory, it's arguably the least >>> interesting part for page pool, more or less it'd circulate whatever >>> is given. It's more of how to have a better control over buffer lifetime >>> and implement a buffer pool passing data to users and empty buffers >>> back. >> >> Isn't that more or less exactly what dmabuf is? Why do you need >> another almost dma-buf thing for another project? > > That's the exact point I've been making since the last round of > the series. We don't need to reinvent dmabuf poorly in every > subsystem, but instead fix the odd parts in it and make it suitable > for everyone. Someone would need to elaborate how dma-buf is like that addition to page pool infra. The granularity here is usually 4K and less (hw dictated), what user receives cannot be guaranteed to be contiguous in memory. Having thousands of dma-buf instances is not an option, so a completion would need to include a range where data sits. Then who controls lifetime of buffers? If it's dma-buf, then at least it needs to track what sub-buffers are handed to user and what are currently in the kernel. How it would be accounted? ioctl_return_subrange(dmabuf, [range]), sounds like a bad idea for performance. To cover user memory it'd also need to be read from userspace, ioctl here wouldn't be an option, but let's say it's somehow done in the kernel. That's not all the list, but in short, even though I haven't been following dma-buf developments too closely, I have hard time seeing how it can be a replacement here.
On Tue, May 7, 2024 at 9:55 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 5/7/24 17:23, Christoph Hellwig wrote: > > On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote: > >> On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote: > >>>> even in tree if you give them enough rope, and they should not have > >>>> that rope when the only sensible options are page/folio based kernel > >>>> memory (incuding large/huge folios) and dmabuf. > >>> > >>> I believe there is at least one deep confusion here, considering you > >>> previously mentioned Keith's pre-mapping patches. The "hooks" are not > >>> that about in what format you pass memory, it's arguably the least > >>> interesting part for page pool, more or less it'd circulate whatever > >>> is given. It's more of how to have a better control over buffer lifetime > >>> and implement a buffer pool passing data to users and empty buffers > >>> back. > >> > >> Isn't that more or less exactly what dmabuf is? Why do you need > >> another almost dma-buf thing for another project? > > > > That's the exact point I've been making since the last round of > > the series. We don't need to reinvent dmabuf poorly in every > > subsystem, but instead fix the odd parts in it and make it suitable > > for everyone. > > Someone would need to elaborate how dma-buf is like that addition > to page pool infra. I think I understand what Jason is requesting here, and I'll take a shot at elaborating. AFAICT what he's saying is technically feasible and addresses the nack while giving you the uapi you want. It just requires a bit (a lot?) of work on your end unfortunately. CONFIG_UDMABUF takes in a memfd, converts it to a dmabuf, and returns it to userspace. See udmabuf_create(). I think what Jason is saying here, is that you can write similar code to udmabuf_creat() that takes in a io_uring memory region, and converts it to a dmabuf inside the kernel. I haven't looked at your series yet too closely (sorry!), but I assume you currently have a netlink API that binds an io_uring memory region to the NIC rx-queue page_pool, right? That netlink API would need to be changed to: 1. Take in the io_uring memory. 2. Convert it to a dmabuf like udmabuf_create() does. 3. Bind the resulting dmabuf to the rx-queue page_pool. There would be more changes needed vis-a-vis the clean up path and lifetime management, but I think this is the general idea. This would give you the uapi you want, while the page_pool never seen non-dmabuf memory (addresses the nack, I think).
On 5/7/24 17:42, Mina Almasry wrote: > On Tue, May 7, 2024 at 9:24 AM Christoph Hellwig <hch@infradead.org> wrote: >> >> On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote: >>> On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote: >>>>> even in tree if you give them enough rope, and they should not have >>>>> that rope when the only sensible options are page/folio based kernel >>>>> memory (incuding large/huge folios) and dmabuf. >>>> >>>> I believe there is at least one deep confusion here, considering you >>>> previously mentioned Keith's pre-mapping patches. The "hooks" are not >>>> that about in what format you pass memory, it's arguably the least >>>> interesting part for page pool, more or less it'd circulate whatever >>>> is given. It's more of how to have a better control over buffer lifetime >>>> and implement a buffer pool passing data to users and empty buffers >>>> back. >>> >>> Isn't that more or less exactly what dmabuf is? Why do you need >>> another almost dma-buf thing for another project? >> >> That's the exact point I've been making since the last round of >> the series. We don't need to reinvent dmabuf poorly in every >> subsystem, but instead fix the odd parts in it and make it suitable >> for everyone. >> > > > FWIW the change Christoph is requesting is straight forward from my > POV and doesn't really hurt the devmem use case. I'd basically remove > the ops and add an if statement in the slow path where the ops are > being used to alloc/free from dmabuf instead of alloc_pages(). > Something like (very rough, doesn't compile): > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 92be1aaf18ccc..2cc986455bce6 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool > *pool, gfp_t gfp) > return netmem; > > /* Slow-path: cache empty, do real allocation */ > - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) > - netmem = pool->mp_ops->alloc_pages(pool, gfp); > + if (page_pool_is_dmabuf(pool)) > + netmem = mp_dmabuf_devmem_alloc_pages(): > else > netmem = __page_pool_alloc_pages_slow(pool, gfp); > return netmem; > > > The folks that will be negatively impacted by this are > Jakub/Pavel/David. I think all were planning to extend the hooks for > io_uring or other memory types. > > Pavel/David, AFAICT you have these options here (but maybe you can > think of more): > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I > think in the past you said it's a uapi you don't link but in the face > of this pushback you may want to reconsider. If the argument would be that we have to switch to a less efficient and less consistent api for io_uring (fast path handling used buffers back to kernel) just because it has to has dmabuf and without direct relation to dmabuf, then no, it's not the way anything can be sanely developed. > 2. Follow the example of devmem TCP and add another if statement to > alloc from io_uring, so something like: > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 92be1aaf18ccc..3545bb82c7d05 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -557,8 +557,10 @@ netmem_ref page_pool_alloc_netmem(struct > page_pool *pool, gfp_t gfp) > return netmem; > > /* Slow-path: cache empty, do real allocation */ > - if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) > - netmem = pool->mp_ops->alloc_pages(pool, gfp); > + if (page_pool_is_dmabuf(pool)) > + netmem = mp_dmabuf_devmem_alloc_pages(): > + else if (page_pool_is_io_uring(pool)) > + netmem = mp_io_uring_alloc_pages(): > else > netmem = __page_pool_alloc_pages_slow(pool, gfp); I don't see why we'd do that instead instead of having a well made function table, which is equivalent. > return netmem; > > Note that Christoph/Jason may not like you adding non-dmabuf io_uring > backing memory in the first place, so there may be pushback against > this approach. Christoph mentioned pages, we're using pages, I don't think it's too fancy. I don't believe that's it, which would be equivalent to "let's remove user pointers from the kernel and mandate passing dmabuf only". > 3. Pushback on the nack on this thread. It seems you're already > discussing this. I'll see what happens. > > To be honest the GVE queue-API has just been merged I think, so I'm > now unblocked on sending non-RFCs of this work and I'm hoping to send > the next version soon. I may apply these changes on the next version > for more discussion or leave as is and carry the nack until the > conversation converges. >
On Tue, May 07, 2024 at 01:48:38PM -0300, Jason Gunthorpe wrote: > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: > > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I > > think in the past you said it's a uapi you don't link but in the face > > of this pushback you may want to reconsider. > > dmabuf does not force a uapi, you can acquire your pages however you > want and wrap them up in a dmabuf. No uapi at all. > > The point is that dmabuf already provides ops that do basically what > is needed here. We don't need ops calling ops just because dmabuf's > ops are not understsood or not perfect. Fixup dmabuf. > > If io_uring wants to take its existing memory pre-registration it can > wrap that in a dmbauf, and somehow pass it to the netstack. Userspace > doesn't need to know a dmabuf is being used in the background. So roughly the current dma-buf design considerations for the users of the dma-api interfaces: - It's a memory buffer of fixed length. - Either that memory is permanently nailed into place with dma_buf_pin (and if we add more users than just drm display then we should probably figure out the mlock account question for these). For locking hierarchy dma_buf_pin uses dma_resv_lock which nests within mmap_sem/vma locks but outside of any reclaim/alloc contexts. - Or the memory is more dynamic, in which case case you need to be able to dma_resv_lock when you need the memory and make a promise (as a dma_fence) that you'll release the memory within finite time and without any further allocations once you've unlocked the dma_buf (because dma_fence is in GFP_NORECLAIM). That promise can be waiting for memory access to finish, but it can also be a pte invalidate+tlb flush, or some kind of preemption, or whatever your hw can do really. Also, if you do this dynamic model and need to atomically reserve more than one dma_buf, you get to do the wait/wound mutex dance, but that's really just a bunch of funny looking error handling code and not really impacting the overall design or locking hierarchy. Everything else we can adjust, but I think the above three are not really changeable or dma-buf becomes unuseable for gpu drivers. Note that exporters of dma-buf can pretty much do whatever they feel like, including rejecting all the generic interfaces/ops, because we also use dma-buf as userspace handles for some really special memory. -Sima
On 5/7/24 17:48, Jason Gunthorpe wrote: > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: > >> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I >> think in the past you said it's a uapi you don't link but in the face >> of this pushback you may want to reconsider. > > dmabuf does not force a uapi, you can acquire your pages however you > want and wrap them up in a dmabuf. No uapi at all. > > The point is that dmabuf already provides ops that do basically what > is needed here. We don't need ops calling ops just because dmabuf's > ops are not understsood or not perfect. Fixup dmabuf. Those ops, for example, are used to efficiently return used buffers back to the kernel, which is uapi, I don't see how dmabuf can be fixed up to cover it. > If io_uring wants to take its existing memory pre-registration it can > wrap that in a dmbauf, and somehow pass it to the netstack. Userspace > doesn't need to know a dmabuf is being used in the background. io_uring's pre-registered memory is just pages, but even that is going to be replaced with just a normal user buffer pointer. Regardless, io_uring can wrap pages into a dmabuf, but it's not a direct replacement for the ops, it'd mandate uapi change in a not desirable way.
On 5/7/24 18:15, Mina Almasry wrote: > On Tue, May 7, 2024 at 9:55 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> On 5/7/24 17:23, Christoph Hellwig wrote: >>> On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote: >>>> On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote: >>>>>> even in tree if you give them enough rope, and they should not have >>>>>> that rope when the only sensible options are page/folio based kernel >>>>>> memory (incuding large/huge folios) and dmabuf. >>>>> >>>>> I believe there is at least one deep confusion here, considering you >>>>> previously mentioned Keith's pre-mapping patches. The "hooks" are not >>>>> that about in what format you pass memory, it's arguably the least >>>>> interesting part for page pool, more or less it'd circulate whatever >>>>> is given. It's more of how to have a better control over buffer lifetime >>>>> and implement a buffer pool passing data to users and empty buffers >>>>> back. >>>> >>>> Isn't that more or less exactly what dmabuf is? Why do you need >>>> another almost dma-buf thing for another project? >>> >>> That's the exact point I've been making since the last round of >>> the series. We don't need to reinvent dmabuf poorly in every >>> subsystem, but instead fix the odd parts in it and make it suitable >>> for everyone. >> >> Someone would need to elaborate how dma-buf is like that addition >> to page pool infra. > > I think I understand what Jason is requesting here, and I'll take a > shot at elaborating. AFAICT what he's saying is technically feasible > and addresses the nack while giving you the uapi you want. It just > requires a bit (a lot?) of work on your end unfortunately. > > CONFIG_UDMABUF takes in a memfd, converts it to a dmabuf, and returns > it to userspace. See udmabuf_create(). > > I think what Jason is saying here, is that you can write similar code > to udmabuf_creat() that takes in a io_uring memory region, and > converts it to a dmabuf inside the kernel. > > I haven't looked at your series yet too closely (sorry!), but I assume > you currently have a netlink API that binds an io_uring memory region > to the NIC rx-queue page_pool, right? That netlink API would need to > be changed to: No, it's different, I'll skip details, but the main problem is that those callbacks are used to implement the user api returning buffers via a ring, where the callback grabs them (in napi context) and feeds into page pool. That replaces SO_DEVMEM_DONTNEED and the need for ioctl/setsockopt. > 1. Take in the io_uring memory. > 2. Convert it to a dmabuf like udmabuf_create() does. > 3. Bind the resulting dmabuf to the rx-queue page_pool. > > There would be more changes needed vis-a-vis the clean up path and > lifetime management, but I think this is the general idea. > > This would give you the uapi you want, while the page_pool never seen > non-dmabuf memory (addresses the nack, I think).
On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote: > On 5/7/24 17:48, Jason Gunthorpe wrote: > > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: > > > > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I > > > think in the past you said it's a uapi you don't link but in the face > > > of this pushback you may want to reconsider. > > > > dmabuf does not force a uapi, you can acquire your pages however you > > want and wrap them up in a dmabuf. No uapi at all. > > > > The point is that dmabuf already provides ops that do basically what > > is needed here. We don't need ops calling ops just because dmabuf's > > ops are not understsood or not perfect. Fixup dmabuf. > > Those ops, for example, are used to efficiently return used buffers > back to the kernel, which is uapi, I don't see how dmabuf can be > fixed up to cover it. Sure, but that doesn't mean you can't use dma buf for the other parts of the flow. The per-page lifetime is a different topic than the refcounting and access of the entire bulk of memory. Jason
On 5/7/24 18:56, Jason Gunthorpe wrote: > On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote: >> On 5/7/24 17:48, Jason Gunthorpe wrote: >>> On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: >>> >>>> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I >>>> think in the past you said it's a uapi you don't link but in the face >>>> of this pushback you may want to reconsider. >>> >>> dmabuf does not force a uapi, you can acquire your pages however you >>> want and wrap them up in a dmabuf. No uapi at all. >>> >>> The point is that dmabuf already provides ops that do basically what >>> is needed here. We don't need ops calling ops just because dmabuf's >>> ops are not understsood or not perfect. Fixup dmabuf. >> >> Those ops, for example, are used to efficiently return used buffers >> back to the kernel, which is uapi, I don't see how dmabuf can be >> fixed up to cover it. > > Sure, but that doesn't mean you can't use dma buf for the other parts > of the flow. The per-page lifetime is a different topic than the > refcounting and access of the entire bulk of memory. Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as is, the rest is resolving uptr -> pages, and passing it to page pool in a convenient to page pool format (net_iov). I don't see how dmabuf would help here. Adding dmabuf in the middle (internally wrapping pages) would add more setup code with the same final result, that is a format that page pool can work with. And for io_uring it's normal user memory. We'll have to use dmabuf when we'd want to extend to peer-to-peer and all that fun, but that's a small fraction of it, and we'll hopefully reuse some setup helpers from devmem tcp.
On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote: > On 5/7/24 18:56, Jason Gunthorpe wrote: > > On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote: > > > On 5/7/24 17:48, Jason Gunthorpe wrote: > > > > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: > > > > > > > > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I > > > > > think in the past you said it's a uapi you don't link but in the face > > > > > of this pushback you may want to reconsider. > > > > > > > > dmabuf does not force a uapi, you can acquire your pages however you > > > > want and wrap them up in a dmabuf. No uapi at all. > > > > > > > > The point is that dmabuf already provides ops that do basically what > > > > is needed here. We don't need ops calling ops just because dmabuf's > > > > ops are not understsood or not perfect. Fixup dmabuf. > > > > > > Those ops, for example, are used to efficiently return used buffers > > > back to the kernel, which is uapi, I don't see how dmabuf can be > > > fixed up to cover it. > > > > Sure, but that doesn't mean you can't use dma buf for the other parts > > of the flow. The per-page lifetime is a different topic than the > > refcounting and access of the entire bulk of memory. > > Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as > is, the rest is resolving uptr -> pages, and passing it to page pool in > a convenient to page pool format (net_iov). I'm not going to pretend to know about page pool details, but dmabuf is the way to get the bulk of pages into a pool within the net stack's allocator and keep that bulk properly refcounted while. An object like dmabuf is needed for the general case because there are not going to be per-page references or otherwise available. What you seem to want is to alter how the actual allocation flow works from that bulk of memory and delay the free. It seems like a different topic to me, and honestly hacking into the allocator free function seems a bit weird.. Jason
On Tue, May 07, 2024 at 08:32:47PM -0300, Jason Gunthorpe wrote: > On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote: > > On 5/7/24 18:56, Jason Gunthorpe wrote: > > > On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote: > > > > On 5/7/24 17:48, Jason Gunthorpe wrote: > > > > > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: > > > > > > > > > > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I > > > > > > think in the past you said it's a uapi you don't link but in the face > > > > > > of this pushback you may want to reconsider. > > > > > > > > > > dmabuf does not force a uapi, you can acquire your pages however you > > > > > want and wrap them up in a dmabuf. No uapi at all. > > > > > > > > > > The point is that dmabuf already provides ops that do basically what > > > > > is needed here. We don't need ops calling ops just because dmabuf's > > > > > ops are not understsood or not perfect. Fixup dmabuf. > > > > > > > > Those ops, for example, are used to efficiently return used buffers > > > > back to the kernel, which is uapi, I don't see how dmabuf can be > > > > fixed up to cover it. > > > > > > Sure, but that doesn't mean you can't use dma buf for the other parts > > > of the flow. The per-page lifetime is a different topic than the > > > refcounting and access of the entire bulk of memory. > > > > Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as > > is, the rest is resolving uptr -> pages, and passing it to page pool in > > a convenient to page pool format (net_iov). > > I'm not going to pretend to know about page pool details, but dmabuf > is the way to get the bulk of pages into a pool within the net stack's > allocator and keep that bulk properly refcounted while. > > An object like dmabuf is needed for the general case because there are > not going to be per-page references or otherwise available. > > What you seem to want is to alter how the actual allocation flow works > from that bulk of memory and delay the free. It seems like a different > topic to me, and honestly hacking into the allocator free function > seems a bit weird.. Also I don't see how it's an argument against dma-buf as the interface for all these, because e.g. ttm internally does have a page pool because depending upon allocator, that's indeed beneficial. Other drm drivers have more buffer-based concepts for opportunistically memory around, usually by marking buffers that are just kept as cache as purgeable (which is a concept that goes all the way to opengl/vulkan). But these are all internals of the dma-buf exporter, the dma-buf api users don't ever need to care. -Sima
On 5/8/24 00:32, Jason Gunthorpe wrote: > On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote: >> On 5/7/24 18:56, Jason Gunthorpe wrote: >>> On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote: >>>> On 5/7/24 17:48, Jason Gunthorpe wrote: >>>>> On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: >>>>> >>>>>> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I >>>>>> think in the past you said it's a uapi you don't link but in the face >>>>>> of this pushback you may want to reconsider. >>>>> >>>>> dmabuf does not force a uapi, you can acquire your pages however you >>>>> want and wrap them up in a dmabuf. No uapi at all. >>>>> >>>>> The point is that dmabuf already provides ops that do basically what >>>>> is needed here. We don't need ops calling ops just because dmabuf's >>>>> ops are not understsood or not perfect. Fixup dmabuf. >>>> >>>> Those ops, for example, are used to efficiently return used buffers >>>> back to the kernel, which is uapi, I don't see how dmabuf can be >>>> fixed up to cover it. >>> >>> Sure, but that doesn't mean you can't use dma buf for the other parts >>> of the flow. The per-page lifetime is a different topic than the >>> refcounting and access of the entire bulk of memory. >> >> Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as >> is, the rest is resolving uptr -> pages, and passing it to page pool in >> a convenient to page pool format (net_iov). > > I'm not going to pretend to know about page pool details, but dmabuf > is the way to get the bulk of pages into a pool within the net stack's > allocator and keep that bulk properly refcounted while.> > An object like dmabuf is needed for the general case because there are > not going to be per-page references or otherwise available. They are already pinned, memory is owned by the provider, io_uring in this case, and it should not be freed circumventing io_uring, and at this stage calling release_pages() is not such a hassle, especially comparing to introducing an additional object. My question is how having an intermediary dmabuf benefits the net stack or io_uring ? For now IMO it doesn't solve anything but adds extra complexity. Adding dmabuf for the sake of adding dmabuf is not a great choice. > What you seem to want is to alter how the actual allocation flow works > from that bulk of memory and delay the free. It seems like a different For people who jumped here without looking what this patchset is about, that's the entire point of the io_uring zero copy approach as well as this set. Instead of using kernel private pages that you have no other option but to copy/mmap (and then free), it hands buffers to the user while using memory accessible/visible in some way by the user. That "delay free" is taking a reference while user is reading data (slightly different for devmem tcp). And note, it's not a page/dmabuf reference, kernel can forcibly take it back and release pages. > topic to me, and honestly hacking into the allocator free function > seems a bit weird.. Do you also think that DMA_BUF_IOCTL_SYNC is a weird hack, because it "delays free" by pinning the dmabuf object and letting the user read memory instead of copying it? I can find many examples
On 5/8/24 08:16, Daniel Vetter wrote: > On Tue, May 07, 2024 at 08:32:47PM -0300, Jason Gunthorpe wrote: >> On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote: >>> On 5/7/24 18:56, Jason Gunthorpe wrote: >>>> On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote: >>>>> On 5/7/24 17:48, Jason Gunthorpe wrote: >>>>>> On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: >>>>>> >>>>>>> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I >>>>>>> think in the past you said it's a uapi you don't link but in the face >>>>>>> of this pushback you may want to reconsider. >>>>>> >>>>>> dmabuf does not force a uapi, you can acquire your pages however you >>>>>> want and wrap them up in a dmabuf. No uapi at all. >>>>>> >>>>>> The point is that dmabuf already provides ops that do basically what >>>>>> is needed here. We don't need ops calling ops just because dmabuf's >>>>>> ops are not understsood or not perfect. Fixup dmabuf. >>>>> >>>>> Those ops, for example, are used to efficiently return used buffers >>>>> back to the kernel, which is uapi, I don't see how dmabuf can be >>>>> fixed up to cover it. >>>> >>>> Sure, but that doesn't mean you can't use dma buf for the other parts >>>> of the flow. The per-page lifetime is a different topic than the >>>> refcounting and access of the entire bulk of memory. >>> >>> Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as >>> is, the rest is resolving uptr -> pages, and passing it to page pool in >>> a convenient to page pool format (net_iov). >> >> I'm not going to pretend to know about page pool details, but dmabuf >> is the way to get the bulk of pages into a pool within the net stack's >> allocator and keep that bulk properly refcounted while. >> >> An object like dmabuf is needed for the general case because there are >> not going to be per-page references or otherwise available. >> >> What you seem to want is to alter how the actual allocation flow works >> from that bulk of memory and delay the free. It seems like a different >> topic to me, and honestly hacking into the allocator free function >> seems a bit weird.. > > Also I don't see how it's an argument against dma-buf as the interface for It's not, neither I said it is, but it is an argument against removing the network's page pool ops. > all these, because e.g. ttm internally does have a page pool because > depending upon allocator, that's indeed beneficial. Other drm drivers have > more buffer-based concepts for opportunistically memory around, usually > by marking buffers that are just kept as cache as purgeable (which is a > concept that goes all the way to opengl/vulkan). Because in this case it solves nothing and helps with nothing, quite the opposite. Just as well we can ask why NVMe doesn't wrap user pages into a dmabuf while doing IO. > But these are all internals of the dma-buf exporter, the dma-buf api users > don't ever need to care. > -Sima
On Wed, May 08, 2024 at 12:30:07PM +0100, Pavel Begunkov wrote: > > I'm not going to pretend to know about page pool details, but dmabuf > > is the way to get the bulk of pages into a pool within the net stack's > > allocator and keep that bulk properly refcounted while.> An object like > > dmabuf is needed for the general case because there are > > not going to be per-page references or otherwise available. > > They are already pinned, memory is owned by the provider, io_uring > in this case, and it should not be freed circumventing io_uring, > and at this stage calling release_pages() is not such a hassle, > especially comparing to introducing an additional object. Something needs to co-ordinate when the net stack's allocator is done with the bulk of pages and when io_uring and do the final put_user_page() to free it. DMABUF is not an unreasonable choice for this. > > topic to me, and honestly hacking into the allocator free function > > seems a bit weird.. > > Do you also think that DMA_BUF_IOCTL_SYNC is a weird hack, because > it "delays free" by pinning the dmabuf object and letting the user > read memory instead of copying it? I can find many examples It seems to me the flow you want is for the driver to allocate a page, put it on a rx ring, process it through the netstack, and deliver it to io_uring. io_uring would then sit on the allocation until userspace it done and return it back to the netstack allocator. Hooking the free of the netstack allocator and then defering it seems like a weird and indirect way to get there. Why can't io_uring just be the entity that does the final free and not mess with the logic allocator? Jason
On Wed, May 08, 2024 at 12:35:52PM +0100, Pavel Begunkov wrote: > On 5/8/24 08:16, Daniel Vetter wrote: > > On Tue, May 07, 2024 at 08:32:47PM -0300, Jason Gunthorpe wrote: > > > On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote: > > > > On 5/7/24 18:56, Jason Gunthorpe wrote: > > > > > On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote: > > > > > > On 5/7/24 17:48, Jason Gunthorpe wrote: > > > > > > > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote: > > > > > > > > > > > > > > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I > > > > > > > > think in the past you said it's a uapi you don't link but in the face > > > > > > > > of this pushback you may want to reconsider. > > > > > > > > > > > > > > dmabuf does not force a uapi, you can acquire your pages however you > > > > > > > want and wrap them up in a dmabuf. No uapi at all. > > > > > > > > > > > > > > The point is that dmabuf already provides ops that do basically what > > > > > > > is needed here. We don't need ops calling ops just because dmabuf's > > > > > > > ops are not understsood or not perfect. Fixup dmabuf. > > > > > > > > > > > > Those ops, for example, are used to efficiently return used buffers > > > > > > back to the kernel, which is uapi, I don't see how dmabuf can be > > > > > > fixed up to cover it. > > > > > > > > > > Sure, but that doesn't mean you can't use dma buf for the other parts > > > > > of the flow. The per-page lifetime is a different topic than the > > > > > refcounting and access of the entire bulk of memory. > > > > > > > > Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as > > > > is, the rest is resolving uptr -> pages, and passing it to page pool in > > > > a convenient to page pool format (net_iov). > > > > > > I'm not going to pretend to know about page pool details, but dmabuf > > > is the way to get the bulk of pages into a pool within the net stack's > > > allocator and keep that bulk properly refcounted while. > > > > > > An object like dmabuf is needed for the general case because there are > > > not going to be per-page references or otherwise available. > > > > > > What you seem to want is to alter how the actual allocation flow works > > > from that bulk of memory and delay the free. It seems like a different > > > topic to me, and honestly hacking into the allocator free function > > > seems a bit weird.. > > > > Also I don't see how it's an argument against dma-buf as the interface for > > It's not, neither I said it is, but it is an argument against removing > the network's page pool ops. > > > all these, because e.g. ttm internally does have a page pool because > > depending upon allocator, that's indeed beneficial. Other drm drivers have > > more buffer-based concepts for opportunistically memory around, usually > > by marking buffers that are just kept as cache as purgeable (which is a > > concept that goes all the way to opengl/vulkan). > > Because in this case it solves nothing and helps with nothing, quite > the opposite. Just as well we can ask why NVMe doesn't wrap user pages > into a dmabuf while doing IO. Because the rules around memory reclaim, gfp nesting and guaranteed forward progress don't match up for block i/o. I looked quite a bit into gluing direct i/o into dma-buf because there's vulkan extensions for that, and it's an absolute mess. -Sima > > > But these are all internals of the dma-buf exporter, the dma-buf api users > > don't ever need to care. > > -Sima > > -- > Pavel Begunkov
On 5/8/24 15:25, Jason Gunthorpe wrote: > On Wed, May 08, 2024 at 12:30:07PM +0100, Pavel Begunkov wrote: > >>> I'm not going to pretend to know about page pool details, but dmabuf >>> is the way to get the bulk of pages into a pool within the net stack's >>> allocator and keep that bulk properly refcounted while.> An object like >>> dmabuf is needed for the general case because there are >>> not going to be per-page references or otherwise available. >> >> They are already pinned, memory is owned by the provider, io_uring >> in this case, and it should not be freed circumventing io_uring, >> and at this stage calling release_pages() is not such a hassle, >> especially comparing to introducing an additional object. > > Something needs to co-ordinate when the net stack's allocator is done > with the bulk of pages and when io_uring and do the final > put_user_page() to free it. DMABUF is not an unreasonable choice for > this. When a page pool dies notifies io_uring via the ->destroy callback. Vise versa, when io_uring wants to terminate zerocopy, it releases the interface queue, which kills the page pool, ending in ->destroy again. >>> topic to me, and honestly hacking into the allocator free function >>> seems a bit weird.. >> >> Do you also think that DMA_BUF_IOCTL_SYNC is a weird hack, because >> it "delays free" by pinning the dmabuf object and letting the user >> read memory instead of copying it? I can find many examples > > It seems to me the flow you want is for the driver to allocate a page, > put it on a rx ring, process it through the netstack, and deliver it > to io_uring. io_uring would then sit on the allocation until userspace > it done and return it back to the netstack allocator. That's right, with a note that "driver allocating a page" is not a alloc_page() but grabbing a user page/frag the user registered beforehand. > Hooking the free of the netstack allocator and then defering it seems FWIW, it's not about page pool's ->release_page, it's a slow path and in an ideal world wouldn't be called outside of tear down. > like a weird and indirect way to get there. Why can't io_uring just be > the entity that does the final free and not mess with the logic > allocator? Then the user has to do a syscall (e.g. via io_uring) to return pages, and there we'd need to care how to put the pages efficiently, i.e. hitting the page pool's fast path, e.g. by hoping napi is scheduled and scheduled for the CPU we're running on, or maybe transferring the pages to the right CPU first. Compare it with userspace putting pages into a ring, and the allocator taking from there when needed without any extra synchronisation and hassle just because it's a sole consumer.
On Wed, May 08, 2024 at 12:35:52PM +0100, Pavel Begunkov wrote: > > all these, because e.g. ttm internally does have a page pool because > > depending upon allocator, that's indeed beneficial. Other drm drivers have > > more buffer-based concepts for opportunistically memory around, usually > > by marking buffers that are just kept as cache as purgeable (which is a > > concept that goes all the way to opengl/vulkan). > > Because in this case it solves nothing and helps with nothing, quite > the opposite. Just as well we can ask why NVMe doesn't wrap user pages > into a dmabuf while doing IO. You seem to confused totally unrelated things. For short-term pins, that is pin_user_pages without FOLL_LONGTERM there would never be any point in using a dmabuf as the pin is transient. For long-term pin dmabufs in the block layer absolutely make sense, and I wish io_uring would have just implemented them from the start instead of the current fixed buffers that are not quite as useful by not pre-mapping DMA and not supporting P2P.
On Wed, May 08, 2024 at 04:44:32PM +0100, Pavel Begunkov wrote: > > like a weird and indirect way to get there. Why can't io_uring just be > > the entity that does the final free and not mess with the logic > > allocator? > > Then the user has to do a syscall (e.g. via io_uring) to return pages, > and there we'd need to care how to put the pages efficiently, i.e. > hitting the page pool's fast path, e.g. by hoping napi is scheduled and > scheduled for the CPU we're running on, or maybe transferring the pages > to the right CPU first. > > Compare it with userspace putting pages into a ring, and the allocator > taking from there when needed without any extra synchronisation and > hassle just because it's a sole consumer. Wow, that sounds a bit terrifying for security, but I guess I can see your point. You are replacing the whole allocator logic if you are effectively putting the free list in userspace memory. Jason
On 5/8/24 16:58, Jason Gunthorpe wrote: > On Wed, May 08, 2024 at 04:44:32PM +0100, Pavel Begunkov wrote: > >>> like a weird and indirect way to get there. Why can't io_uring just be >>> the entity that does the final free and not mess with the logic >>> allocator? >> >> Then the user has to do a syscall (e.g. via io_uring) to return pages, >> and there we'd need to care how to put the pages efficiently, i.e. >> hitting the page pool's fast path, e.g. by hoping napi is scheduled and >> scheduled for the CPU we're running on, or maybe transferring the pages >> to the right CPU first. >> >> Compare it with userspace putting pages into a ring, and the allocator >> taking from there when needed without any extra synchronisation and >> hassle just because it's a sole consumer. > > Wow, that sounds a bit terrifying for security, but I guess I can see > your point. Mind elaborating about security? "No synchronisation" is for grabbing from the ring, it's napi exclusive, but it does refcounting to make sure there are no previous net users left and the userspace doesn't try anything funny like returning a page twice. And it's not even a page but rather a separately refcounted buffer represented by an offset from the userspace POV. It doesn't even have to be page sized, hw benefits from smaller chunks. > You are replacing the whole allocator logic if you are effectively > putting the free list in userspace memory. > > Jason
On 5/8/24 16:51, Christoph Hellwig wrote: > On Wed, May 08, 2024 at 12:35:52PM +0100, Pavel Begunkov wrote: >>> all these, because e.g. ttm internally does have a page pool because >>> depending upon allocator, that's indeed beneficial. Other drm drivers have >>> more buffer-based concepts for opportunistically memory around, usually >>> by marking buffers that are just kept as cache as purgeable (which is a >>> concept that goes all the way to opengl/vulkan). >> >> Because in this case it solves nothing and helps with nothing, quite >> the opposite. Just as well we can ask why NVMe doesn't wrap user pages >> into a dmabuf while doing IO. > > You seem to confused totally unrelated things. > > For short-term pins, that is pin_user_pages without FOLL_LONGTERM there > would never be any point in using a dmabuf as the pin is transient. For > long-term pin dmabufs in the block layer absolutely make sense, and I Well, the example fell flat, but you don't use dmabuf when there are no upsides from using it. For instance, when you already have pinned pages, you're going to use pages, and there are no other refcounting concerns. Unless there is an advantage of dmabufs over FOLL_LONGTERM that I don't know about when used with normal user pages. > wish io_uring would have just implemented them from the start instead of > the current fixed buffers that are not quite as useful by not > pre-mapping DMA and not supporting P2P. fdget(dmabuf) would be horrible, I assume that's not the suggestion. But then it's really about kernel internals, and theoretically can be patched to wrap user pages into a dmabuf and pass it in. The tricky part is how that "pass it in" should looks like. Keith tried to cover both pre-mapping and p2p by registering dmabuf and passing the mapped addresses in an iter IIRC. Anyway, this discussion should better move from to block/fs lists, if there is any interest.
On Wed, May 08, 2024 at 06:02:14PM +0100, Pavel Begunkov wrote: > Well, the example fell flat, but you don't use dmabuf when there are > no upsides from using it. For instance, when you already have pinned > pages, you're going to use pages, and there are no other refcounting > concerns. Sure. > Unless there is an advantage of dmabufs over FOLL_LONGTERM > that I don't know about when used with normal user pages. The advantages of using a dma-buf over FOLL_LONGTERM are: a) you pre-dma map, which is a significant performance advantage for IOMMU-based setups b) you support any dma-buf exported and not just user memory. This is primarily important for PCIe P2P, but there might be other useful exporters as well > > wish io_uring would have just implemented them from the start instead of > > the current fixed buffers that are not quite as useful by not > > pre-mapping DMA and not supporting P2P. > > fdget(dmabuf) would be horrible, I assume that's not the suggestion. I'm not even sure what you mean with that.
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 5e43a08d3231..ffe5f31fb0da 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -52,6 +52,7 @@ struct pp_alloc_cache { * @dev: device, for DMA pre-mapping purposes * @netdev: netdev this pool will serve (leave as NULL if none or multiple) * @napi: NAPI which is the sole consumer of pages, otherwise NULL + * @queue: struct netdev_rx_queue this page_pool is being created for. * @dma_dir: DMA mapping direction * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV @@ -64,6 +65,7 @@ struct page_pool_params { int nid; struct device *dev; struct napi_struct *napi; + struct netdev_rx_queue *queue; enum dma_data_direction dma_dir; unsigned int max_len; unsigned int offset; @@ -126,6 +128,13 @@ struct page_pool_stats { }; #endif +struct memory_provider_ops { + int (*init)(struct page_pool *pool); + void (*destroy)(struct page_pool *pool); + struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp); + bool (*release_page)(struct page_pool *pool, struct page *page); +}; + struct page_pool { struct page_pool_params_fast p; @@ -176,6 +185,9 @@ struct page_pool { */ struct ptr_ring ring; + void *mp_priv; + const struct memory_provider_ops *mp_ops; + #ifdef CONFIG_PAGE_POOL_STATS /* recycle stats are per-cpu to avoid locking */ struct page_pool_recycle_stats __percpu *recycle_stats; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index dd364d738c00..795b7ff1c01f 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -25,6 +25,8 @@ #include "page_pool_priv.h" +static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); + #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -177,6 +179,7 @@ static int page_pool_init(struct page_pool *pool, int cpuid) { unsigned int ring_qsize = 1024; /* Default */ + int err; memcpy(&pool->p, ¶ms->fast, sizeof(pool->p)); memcpy(&pool->slow, ¶ms->slow, sizeof(pool->slow)); @@ -248,10 +251,25 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1); + if (pool->mp_ops) { + err = pool->mp_ops->init(pool); + if (err) { + pr_warn("%s() mem-provider init failed %d\n", __func__, + err); + goto free_ptr_ring; + } + + static_branch_inc(&page_pool_mem_providers); + } + if (pool->p.flags & PP_FLAG_DMA_MAP) get_device(pool->p.dev); return 0; + +free_ptr_ring: + ptr_ring_cleanup(&pool->ring, NULL); + return err; } static void page_pool_uninit(struct page_pool *pool) @@ -546,7 +564,10 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) return page; /* Slow-path: cache empty, do real allocation */ - page = __page_pool_alloc_pages_slow(pool, gfp); + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + page = pool->mp_ops->alloc_pages(pool, gfp); + else + page = __page_pool_alloc_pages_slow(pool, gfp); return page; } EXPORT_SYMBOL(page_pool_alloc_pages); @@ -603,10 +624,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) void page_pool_return_page(struct page_pool *pool, struct page *page) { int count; + bool put; - __page_pool_release_page_dma(pool, page); - - page_pool_clear_pp_info(page); + put = true; + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + put = pool->mp_ops->release_page(pool, page); + else + __page_pool_release_page_dma(pool, page); /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -614,7 +638,10 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count); - put_page(page); + if (put) { + page_pool_clear_pp_info(page); + put_page(page); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). @@ -889,6 +916,12 @@ static void __page_pool_destroy(struct page_pool *pool) page_pool_unlist(pool); page_pool_uninit(pool); + + if (pool->mp_ops) { + pool->mp_ops->destroy(pool); + static_branch_dec(&page_pool_mem_providers); + } + kfree(pool); }