Message ID | 20240305020153.2787423-10-almasrymina@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Device Memory TCP | expand |
On 2024-03-04 18:01, Mina Almasry wrote: > + if (pool->p.queue) > + binding = READ_ONCE(pool->p.queue->binding); > + > + if (binding) { > + pool->mp_ops = &dmabuf_devmem_ops; > + pool->mp_priv = binding; > + } This is specific to TCP devmem. For ZC Rx we will need something more generic to let us pass our own memory provider backend down to the page pool. What about storing ops and priv void ptr in struct netdev_rx_queue instead? Then we can both use it.
On Tue, Mar 5, 2024 at 6:28 PM David Wei <dw@davidwei.uk> wrote: > > On 2024-03-04 18:01, Mina Almasry wrote: > > + if (pool->p.queue) > > + binding = READ_ONCE(pool->p.queue->binding); > > + > > + if (binding) { > > + pool->mp_ops = &dmabuf_devmem_ops; > > + pool->mp_priv = binding; > > + } > > This is specific to TCP devmem. For ZC Rx we will need something more > generic to let us pass our own memory provider backend down to the page > pool. > > What about storing ops and priv void ptr in struct netdev_rx_queue > instead? Then we can both use it. Yes, this is dmabuf specific, I was thinking you'd define your own member of netdev_rx_queue, and then add something like this to page_pool_init: + if (pool->p.queue) + io_uring_metadata = READ_ONCE(pool->p.queue->io_uring_metadata); + + /* We don't support rx-queues that are configured for both io_uring & dmabuf binding */ + BUG_ON(io_uring_metadata && binding); + + if (io_uring_metadata) { + pool->mp_ops = &io_uring_ops; + pool->mp_priv = io_uring_metadata; + } I.e., we share the pool->mp_ops and the pool->mp_priv but we don't really need to share the same netdev_rx_queue member. For me it's a dma-buf specific data structure (netdev_dmabuf_binding) and for you it's something else. page_pool_init() probably needs to validate that the queue is configured for dma-buf or io_uring but not both. If it's configured for both then the user is doing something funky we shouldn't support. Perhaps I can make the intention clearer by renaming 'binding' to something more specific to dma-buf like queue->dmabuf_binding, to make it clear that this is the dma-buf binding and not some other binding like io_uring?
On 2024-03-05 18:42, Mina Almasry wrote: > On Tue, Mar 5, 2024 at 6:28 PM David Wei <dw@davidwei.uk> wrote: >> >> On 2024-03-04 18:01, Mina Almasry wrote: >>> + if (pool->p.queue) >>> + binding = READ_ONCE(pool->p.queue->binding); >>> + >>> + if (binding) { >>> + pool->mp_ops = &dmabuf_devmem_ops; >>> + pool->mp_priv = binding; >>> + } >> >> This is specific to TCP devmem. For ZC Rx we will need something more >> generic to let us pass our own memory provider backend down to the page >> pool. >> >> What about storing ops and priv void ptr in struct netdev_rx_queue >> instead? Then we can both use it. > > Yes, this is dmabuf specific, I was thinking you'd define your own > member of netdev_rx_queue, and then add something like this to > page_pool_init: > > + if (pool->p.queue) > + io_uring_metadata = READ_ONCE(pool->p.queue->io_uring_metadata); > + > + /* We don't support rx-queues that are configured for both > io_uring & dmabuf binding */ > + BUG_ON(io_uring_metadata && binding); > + > + if (io_uring_metadata) { > + pool->mp_ops = &io_uring_ops; > + pool->mp_priv = io_uring_metadata; > + } > > I.e., we share the pool->mp_ops and the pool->mp_priv but we don't > really need to share the same netdev_rx_queue member. For me it's a > dma-buf specific data structure (netdev_dmabuf_binding) and for you > it's something else. This adds size to struct netdev_rx_queue and requires checks on whether both are set. There can be thousands of these structs at any one time so if we don't need to add size unnecessarily then that would be best. We can disambiguate by comparing &mp_ops and then cast the void ptr to our impl specific objects. What do you not like about this approach? > > page_pool_init() probably needs to validate that the queue is > configured for dma-buf or io_uring but not both. If it's configured > for both then the user is doing something funky we shouldn't support. > > Perhaps I can make the intention clearer by renaming 'binding' to > something more specific to dma-buf like queue->dmabuf_binding, to make > it clear that this is the dma-buf binding and not some other binding > like io_uring? >
On Tue, Mar 5, 2024 at 6:47 PM David Wei <dw@davidwei.uk> wrote: > > On 2024-03-05 18:42, Mina Almasry wrote: > > On Tue, Mar 5, 2024 at 6:28 PM David Wei <dw@davidwei.uk> wrote: > >> > >> On 2024-03-04 18:01, Mina Almasry wrote: > >>> + if (pool->p.queue) > >>> + binding = READ_ONCE(pool->p.queue->binding); > >>> + > >>> + if (binding) { > >>> + pool->mp_ops = &dmabuf_devmem_ops; > >>> + pool->mp_priv = binding; > >>> + } > >> > >> This is specific to TCP devmem. For ZC Rx we will need something more > >> generic to let us pass our own memory provider backend down to the page > >> pool. > >> > >> What about storing ops and priv void ptr in struct netdev_rx_queue > >> instead? Then we can both use it. > > > > Yes, this is dmabuf specific, I was thinking you'd define your own > > member of netdev_rx_queue, and then add something like this to > > page_pool_init: > > > > + if (pool->p.queue) > > + io_uring_metadata = READ_ONCE(pool->p.queue->io_uring_metadata); > > + > > + /* We don't support rx-queues that are configured for both > > io_uring & dmabuf binding */ > > + BUG_ON(io_uring_metadata && binding); > > + > > + if (io_uring_metadata) { > > + pool->mp_ops = &io_uring_ops; > > + pool->mp_priv = io_uring_metadata; > > + } > > > > I.e., we share the pool->mp_ops and the pool->mp_priv but we don't > > really need to share the same netdev_rx_queue member. For me it's a > > dma-buf specific data structure (netdev_dmabuf_binding) and for you > > it's something else. > > This adds size to struct netdev_rx_queue and requires checks on whether > both are set. There can be thousands of these structs at any one time so > if we don't need to add size unnecessarily then that would be best. > > We can disambiguate by comparing &mp_ops and then cast the void ptr to > our impl specific objects. > > What do you not like about this approach? > I was thinking it leaks page_pool specifics into a generic struct unrelated to the page pool like netdev_rx_queue. My mental model is that the rx-queue just says that it's bound to a dma-buf/io_uring unaware of page_pool internals, and the page pool internals figure out what to do from there. Currently netdev_rx_queue.h doesn't include net/page_pool/types.h for example because there is no dependency between netdev_rx_queue & page_pool, I think this change would add a dependency. But I concede it does not matter much AFAICT, I can certainly change the netdev_rx_queue to hold the mp_priv & mp_ops directly and include net/page_pool/types.h if you prefer that. I'll look into applying this change in the next iteration if there are no objections. > > > > page_pool_init() probably needs to validate that the queue is > > configured for dma-buf or io_uring but not both. If it's configured > > for both then the user is doing something funky we shouldn't support. > > > > Perhaps I can make the intention clearer by renaming 'binding' to > > something more specific to dma-buf like queue->dmabuf_binding, to make > > it clear that this is the dma-buf binding and not some other binding > > like io_uring? > >
On 3/6/24 02:42, Mina Almasry wrote: > On Tue, Mar 5, 2024 at 6:28 PM David Wei <dw@davidwei.uk> wrote: >> >> On 2024-03-04 18:01, Mina Almasry wrote: >>> + if (pool->p.queue) >>> + binding = READ_ONCE(pool->p.queue->binding); >>> + >>> + if (binding) { >>> + pool->mp_ops = &dmabuf_devmem_ops; >>> + pool->mp_priv = binding; >>> + } >> >> This is specific to TCP devmem. For ZC Rx we will need something more >> generic to let us pass our own memory provider backend down to the page >> pool. >> >> What about storing ops and priv void ptr in struct netdev_rx_queue >> instead? Then we can both use it. > > Yes, this is dmabuf specific, I was thinking you'd define your own > member of netdev_rx_queue, and then add something like this to > page_pool_init: That would be quite annoying, there are 3 expected users together with huge pages, each would need a field and check all others are disabled as you mentioned and so on. It should be cleaner to pass a generic {pp_ops,pp_private} pair instead. If header dependencies is a problem, you it can probably be struct pp_provider_param { struct pp_ops ops; void *private; }; # netdev_rx_queue.h // definition is not included here struct pp_provider_params; struct netdev_rx_queue { ... struct pp_provider_params *pp_params; };
On Wed, Mar 6, 2024 at 6:59 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 3/6/24 02:42, Mina Almasry wrote: > > On Tue, Mar 5, 2024 at 6:28 PM David Wei <dw@davidwei.uk> wrote: > >> > >> On 2024-03-04 18:01, Mina Almasry wrote: > >>> + if (pool->p.queue) > >>> + binding = READ_ONCE(pool->p.queue->binding); > >>> + > >>> + if (binding) { > >>> + pool->mp_ops = &dmabuf_devmem_ops; > >>> + pool->mp_priv = binding; > >>> + } > >> > >> This is specific to TCP devmem. For ZC Rx we will need something more > >> generic to let us pass our own memory provider backend down to the page > >> pool. > >> > >> What about storing ops and priv void ptr in struct netdev_rx_queue > >> instead? Then we can both use it. > > > > Yes, this is dmabuf specific, I was thinking you'd define your own > > member of netdev_rx_queue, and then add something like this to > > page_pool_init: > > That would be quite annoying, there are 3 expected users together > with huge pages, each would need a field and check all others are > disabled as you mentioned and so on. It should be cleaner to pass > a generic {pp_ops,pp_private} pair instead. > > If header dependencies is a problem, you it can probably be > > struct pp_provider_param { > struct pp_ops ops; > void *private; > }; > > # netdev_rx_queue.h > > // definition is not included here > struct pp_provider_params; > > struct netdev_rx_queue { > ... > struct pp_provider_params *pp_params; > }; > Seems very reasonable, will do! Thanks! > -- > Pavel Begunkov
diff --git a/include/net/netmem.h b/include/net/netmem.h index 8699788d587d..a2de9411025d 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -127,6 +127,20 @@ static inline struct page *netmem_to_page(netmem_ref netmem) return (__force struct page *)netmem; } +static inline struct net_iov *netmem_to_net_iov(netmem_ref netmem) +{ + if (netmem_is_net_iov(netmem)) + return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV); + + DEBUG_NET_WARN_ON_ONCE(true); + return NULL; +} + +static inline netmem_ref net_iov_to_netmem(struct net_iov *niov) +{ + return (__force netmem_ref)((unsigned long)niov | NET_IOV); +} + static inline netmem_ref page_to_netmem(struct page *page) { return (__force netmem_ref)page; diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index c6a55eddefae..00682b4de6e8 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -453,4 +453,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) page_pool_update_nid(pool, new_nid); } +static inline void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) +{ + netmem_set_pp(netmem, pool); + netmem_or_pp_magic(netmem, PP_SIGNATURE); + + /* Ensuring all pages have been split into one fragment initially: + * page_pool_set_pp_info() is only called once for every page when it + * is allocated from the page allocator and page_pool_fragment_page() + * is dirtying the same cache line as the page->pp_magic above, so + * the overhead is negligible. + */ + page_pool_fragment_netmem(netmem, 1); + if (pool->has_init_callback) + pool->slow.init_callback(netmem, pool->slow.init_arg); +} + +static inline void page_pool_clear_pp_info(netmem_ref netmem) +{ + netmem_clear_pp_magic(netmem); + netmem_set_pp(netmem, NULL); +} #endif /* _NET_PAGE_POOL_HELPERS_H */ diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index e29e77f7934e..096cd2455b2c 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -136,6 +136,8 @@ struct memory_provider_ops { bool (*release_page)(struct page_pool *pool, netmem_ref netmem); }; +extern const struct memory_provider_ops dmabuf_devmem_ops; + struct page_pool { struct page_pool_params_fast p; diff --git a/net/core/devmem.c b/net/core/devmem.c index 57d3a1f223ef..3ced312f7860 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -329,3 +329,85 @@ int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, return err; } #endif + +/*** "Dmabuf devmem memory provider" ***/ + +static int mp_dmabuf_devmem_init(struct page_pool *pool) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + + if (!binding) + return -EINVAL; + + if (!(pool->p.flags & PP_FLAG_DMA_MAP)) + return -EOPNOTSUPP; + + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) + return -EOPNOTSUPP; + + if (pool->p.order != 0) + return -E2BIG; + + netdev_dmabuf_binding_get(binding); + return 0; +} + +static netmem_ref mp_dmabuf_devmem_alloc_pages(struct page_pool *pool, + gfp_t gfp) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + netmem_ref netmem; + struct net_iov *niov; + dma_addr_t dma_addr; + + niov = netdev_alloc_dmabuf(binding); + if (!niov) + return 0; + + dma_addr = net_iov_dma_addr(niov); + + netmem = net_iov_to_netmem(niov); + + page_pool_set_pp_info(pool, netmem); + + if (page_pool_set_dma_addr_netmem(netmem, dma_addr)) + goto err_free; + + pool->pages_state_hold_cnt++; + trace_page_pool_state_hold(pool, netmem, pool->pages_state_hold_cnt); + return netmem; + +err_free: + netdev_free_dmabuf(niov); + return 0; +} + +static void mp_dmabuf_devmem_destroy(struct page_pool *pool) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + + netdev_dmabuf_binding_put(binding); +} + +static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, + netmem_ref netmem) +{ + WARN_ON_ONCE(!netmem_is_net_iov(netmem)); + WARN_ON_ONCE(atomic_long_read(netmem_get_pp_ref_count_ref(netmem)) + != 1); + + page_pool_clear_pp_info(netmem); + + netdev_free_dmabuf(netmem_to_net_iov(netmem)); + + /* We don't want the page pool put_page()ing our net_iovs. */ + return false; +} + +const struct memory_provider_ops dmabuf_devmem_ops = { + .init = mp_dmabuf_devmem_init, + .destroy = mp_dmabuf_devmem_destroy, + .alloc_pages = mp_dmabuf_devmem_alloc_pages, + .release_page = mp_dmabuf_devmem_release_page, +}; +EXPORT_SYMBOL(dmabuf_devmem_ops); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 22e3d439da18..2cee7d9f6ca6 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -12,6 +12,7 @@ #include <net/page_pool/helpers.h> #include <net/xdp.h> +#include <net/netdev_rx_queue.h> #include <linux/dma-direction.h> #include <linux/dma-mapping.h> @@ -20,12 +21,15 @@ #include <linux/poison.h> #include <linux/ethtool.h> #include <linux/netdevice.h> +#include <linux/genalloc.h> +#include <net/devmem.h> #include <trace/events/page_pool.h> #include "page_pool_priv.h" DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); +EXPORT_SYMBOL(page_pool_mem_providers); #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -178,6 +182,7 @@ static int page_pool_init(struct page_pool *pool, const struct page_pool_params *params, int cpuid) { + struct netdev_dmabuf_binding *binding = NULL; unsigned int ring_qsize = 1024; /* Default */ int err; @@ -251,6 +256,14 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1); + if (pool->p.queue) + binding = READ_ONCE(pool->p.queue->binding); + + if (binding) { + pool->mp_ops = &dmabuf_devmem_ops; + pool->mp_priv = binding; + } + if (pool->mp_ops) { err = pool->mp_ops->init(pool); if (err) { @@ -444,28 +457,6 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) return false; } -static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) -{ - netmem_set_pp(netmem, pool); - netmem_or_pp_magic(netmem, PP_SIGNATURE); - - /* Ensuring all pages have been split into one fragment initially: - * page_pool_set_pp_info() is only called once for every page when it - * is allocated from the page allocator and page_pool_fragment_page() - * is dirtying the same cache line as the page->pp_magic above, so - * the overhead is negligible. - */ - page_pool_fragment_netmem(netmem, 1); - if (pool->has_init_callback) - pool->slow.init_callback(netmem, pool->slow.init_arg); -} - -static void page_pool_clear_pp_info(netmem_ref netmem) -{ - netmem_clear_pp_magic(netmem); - netmem_set_pp(netmem, NULL); -} - static struct page *__page_pool_alloc_page_order(struct page_pool *pool, gfp_t gfp) {