diff mbox series

[net-next,v3,3/3] page_pool: fix IOMMU crash when driver has already unbound

Message ID 20241022032214.3915232-4-linyunsheng@huawei.com (mailing list archive)
State New
Headers show
Series None | expand

Commit Message

Yunsheng Lin Oct. 22, 2024, 3:22 a.m. UTC
Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.

As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to.

Currently it seems there are at least two cases that the page
is not released fast enough causing dma unmmapping done after
driver has already unbound:
1. ipv4 packet defragmentation timeout: this seems to cause
   delay up to 30 secs.
2. skb_defer_free_flush(): this may cause infinite delay if
   there is no triggering for net_rx_action().

In order not to do the dma unmmapping after driver has already
unbound and stall the unloading of the networking driver, add
the pool->items array to record all the pages including the ones
which are handed over to network stack, so the page_pool can
do the dma unmmapping for those pages when page_pool_destroy()
is called. As the pool->items need to be large enough to avoid
performance degradation, add a 'item_full' stat to indicate the
allocation failure due to unavailability of pool->items.

Note, the devmem patchset seems to make the bug harder to fix,
and may make backporting harder too. As there is no actual user
for the devmem and the fixing for devmem is unclear for now,
this patch does not consider fixing the case for devmem yet.

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/

Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
---
 include/linux/mm_types.h        |   2 +-
 include/linux/skbuff.h          |   1 +
 include/net/netmem.h            |  10 +-
 include/net/page_pool/helpers.h |   4 +-
 include/net/page_pool/types.h   |  17 ++-
 net/core/devmem.c               |   4 +-
 net/core/netmem_priv.h          |   5 +-
 net/core/page_pool.c            | 213 ++++++++++++++++++++++++++------
 net/core/page_pool_priv.h       |  10 +-
 9 files changed, 210 insertions(+), 56 deletions(-)

Comments

Simon Horman Oct. 22, 2024, 4:40 p.m. UTC | #1
On Tue, Oct 22, 2024 at 11:22:13AM +0800, Yunsheng Lin wrote:
> Networking driver with page_pool support may hand over page
> still with dma mapping to network stack and try to reuse that
> page after network stack is done with it and passes it back
> to page_pool to avoid the penalty of dma mapping/unmapping.
> With all the caching in the network stack, some pages may be
> held in the network stack without returning to the page_pool
> soon enough, and with VF disable causing the driver unbound,
> the page_pool does not stop the driver from doing it's
> unbounding work, instead page_pool uses workqueue to check
> if there is some pages coming back from the network stack
> periodically, if there is any, it will do the dma unmmapping
> related cleanup work.
> 
> As mentioned in [1], attempting DMA unmaps after the driver
> has already unbound may leak resources or at worst corrupt
> memory. Fundamentally, the page pool code cannot allow DMA
> mappings to outlive the driver they belong to.
> 
> Currently it seems there are at least two cases that the page
> is not released fast enough causing dma unmmapping done after
> driver has already unbound:
> 1. ipv4 packet defragmentation timeout: this seems to cause
>    delay up to 30 secs.
> 2. skb_defer_free_flush(): this may cause infinite delay if
>    there is no triggering for net_rx_action().
> 
> In order not to do the dma unmmapping after driver has already
> unbound and stall the unloading of the networking driver, add
> the pool->items array to record all the pages including the ones
> which are handed over to network stack, so the page_pool can
> do the dma unmmapping for those pages when page_pool_destroy()
> is called. As the pool->items need to be large enough to avoid
> performance degradation, add a 'item_full' stat to indicate the
> allocation failure due to unavailability of pool->items.
> 
> Note, the devmem patchset seems to make the bug harder to fix,
> and may make backporting harder too. As there is no actual user
> for the devmem and the fixing for devmem is unclear for now,
> this patch does not consider fixing the case for devmem yet.
> 
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> 
> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Tested-by: Yonglong Liu <liuyonglong@huawei.com>
> CC: Robin Murphy <robin.murphy@arm.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: IOMMU <iommu@lists.linux.dev>

...

> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index c022c410abe3..194006d2930f 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -102,6 +102,7 @@ struct page_pool_params {
>   * @refill:	an allocation which triggered a refill of the cache
>   * @waive:	pages obtained from the ptr ring that cannot be added to
>   *		the cache due to a NUMA mismatch
> + * @item_full	items array is full

No need to block progress because of this, but the correct syntax is:

  * @item_full: ...

Flagged by ./scripts/kernel-doc -none

...
Jesper Dangaard Brouer Oct. 22, 2024, 6:14 p.m. UTC | #2
On 22/10/2024 05.22, Yunsheng Lin wrote:
> Networking driver with page_pool support may hand over page
> still with dma mapping to network stack and try to reuse that
> page after network stack is done with it and passes it back
> to page_pool to avoid the penalty of dma mapping/unmapping.
> With all the caching in the network stack, some pages may be
> held in the network stack without returning to the page_pool
> soon enough, and with VF disable causing the driver unbound,
> the page_pool does not stop the driver from doing it's
> unbounding work, instead page_pool uses workqueue to check
> if there is some pages coming back from the network stack
> periodically, if there is any, it will do the dma unmmapping
> related cleanup work.
> 
> As mentioned in [1], attempting DMA unmaps after the driver
> has already unbound may leak resources or at worst corrupt
> memory. Fundamentally, the page pool code cannot allow DMA
> mappings to outlive the driver they belong to.
> 
> Currently it seems there are at least two cases that the page
> is not released fast enough causing dma unmmapping done after
> driver has already unbound:
> 1. ipv4 packet defragmentation timeout: this seems to cause
>     delay up to 30 secs.
> 2. skb_defer_free_flush(): this may cause infinite delay if
>     there is no triggering for net_rx_action().
> 
> In order not to do the dma unmmapping after driver has already
> unbound and stall the unloading of the networking driver, add
> the pool->items array to record all the pages including the ones
> which are handed over to network stack, so the page_pool can

I really really dislike this approach!

Nacked-by: Jesper Dangaard Brouer <hawk@kernel.org>

Having to keep an array to record all the pages including the ones
which are handed over to network stack, goes against the very principle
behind page_pool. We added members to struct page, such that pages could
be "outstanding".

The page_pool already have a system for waiting for these outstanding /
inflight packets to get returned.  As I suggested before, the page_pool
should simply take over the responsability (from net_device) to free the
struct device (after inflight reach zero), where AFAIK the DMA device is
connected via.

The alternative is what Kuba suggested (and proposed an RFC for),  that
the net_device teardown waits for the page_pool inflight packets.


> do the dma unmmapping for those pages when page_pool_destroy()
> is called. As the pool->items need to be large enough to avoid
> performance degradation, add a 'item_full' stat to indicate the
> allocation failure due to unavailability of pool->items.

Are you saying page_pool will do allocation failures, based on the size
of this pool->items array?

Production workloads will have very large number of outstanding packet
memory in e.g. in TCP sockets.  In production I'm seeing TCP memory
reach 24 GiB (on machines with 384GiB memory).


> Note, the devmem patchset seems to make the bug harder to fix,
> and may make backporting harder too. As there is no actual user
> for the devmem and the fixing for devmem is unclear for now,
> this patch does not consider fixing the case for devmem yet.
> 
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> 
> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Tested-by: Yonglong Liu <liuyonglong@huawei.com>
> CC: Robin Murphy <robin.murphy@arm.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: IOMMU <iommu@lists.linux.dev>
> ---
>   include/linux/mm_types.h        |   2 +-
>   include/linux/skbuff.h          |   1 +
>   include/net/netmem.h            |  10 +-
>   include/net/page_pool/helpers.h |   4 +-
>   include/net/page_pool/types.h   |  17 ++-
>   net/core/devmem.c               |   4 +-
>   net/core/netmem_priv.h          |   5 +-
>   net/core/page_pool.c            | 213 ++++++++++++++++++++++++++------
>   net/core/page_pool_priv.h       |  10 +-
>   9 files changed, 210 insertions(+), 56 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8e38bc..d3e30dcfa021 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -120,7 +120,7 @@ struct page {
>   			 * page_pool allocated pages.
>   			 */
>   			unsigned long pp_magic;
> -			struct page_pool *pp;
> +			struct page_pool_item *pp_item;

Why do we need to change the pp pointer?

Why can't we access the "items" array like pp->items?


>   			unsigned long _pp_mapping_pad;
>   			unsigned long dma_addr;
>   			atomic_long_t pp_ref_count;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 48f1e0fa2a13..44f3707ec3e3 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -38,6 +38,7 @@
>   #include <net/net_debug.h>
>   #include <net/dropreason-core.h>
>   #include <net/netmem.h>
> +#include <net/page_pool/types.h>
>   
>   /**
>    * DOC: skb checksums
> diff --git a/include/net/netmem.h b/include/net/netmem.h
> index 8a6e20be4b9d..5e7b4d1c1c44 100644
> --- a/include/net/netmem.h
> +++ b/include/net/netmem.h
> @@ -23,7 +23,7 @@ DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers);
>   struct net_iov {
>   	unsigned long __unused_padding;
>   	unsigned long pp_magic;
> -	struct page_pool *pp;
> +	struct page_pool_item *pp_item;
>   	struct dmabuf_genpool_chunk_owner *owner;
>   	unsigned long dma_addr;
>   	atomic_long_t pp_ref_count;
> @@ -33,7 +33,7 @@ struct net_iov {
>    *
>    *        struct {
>    *                unsigned long pp_magic;
> - *                struct page_pool *pp;
> + *                struct page_pool_item *pp_item;
>    *                unsigned long _pp_mapping_pad;
>    *                unsigned long dma_addr;
>    *                atomic_long_t pp_ref_count;
> @@ -49,7 +49,7 @@ struct net_iov {
>   	static_assert(offsetof(struct page, pg) == \
>   		      offsetof(struct net_iov, iov))
>   NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic);
> -NET_IOV_ASSERT_OFFSET(pp, pp);
> +NET_IOV_ASSERT_OFFSET(pp_item, pp_item);
>   NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr);
>   NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
>   #undef NET_IOV_ASSERT_OFFSET
> @@ -127,9 +127,9 @@ static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem)
>   	return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV);
>   }
>   
> -static inline struct page_pool *netmem_get_pp(netmem_ref netmem)
> +static inline struct page_pool_item *netmem_get_pp_item(netmem_ref netmem)
>   {
> -	return __netmem_clear_lsb(netmem)->pp;
> +	return __netmem_clear_lsb(netmem)->pp_item;
>   }
>   
>   static inline atomic_long_t *netmem_get_pp_ref_count_ref(netmem_ref netmem)
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 1659f1995985..f781c81f8aa9 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -85,7 +85,9 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
>   
>   static inline struct page_pool *page_pool_to_pp(struct page *page)
>   {
> -	return page->pp;
> +	struct page_pool_item *item = page->pp_item;
> +
> +	return container_of(item, struct page_pool, items[item->pp_idx]);
>   }
>   
>   /**
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index c022c410abe3..194006d2930f 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -102,6 +102,7 @@ struct page_pool_params {
>    * @refill:	an allocation which triggered a refill of the cache
>    * @waive:	pages obtained from the ptr ring that cannot be added to
>    *		the cache due to a NUMA mismatch
> + * @item_full	items array is full
>    */
>   struct page_pool_alloc_stats {
>   	u64 fast;
> @@ -110,6 +111,7 @@ struct page_pool_alloc_stats {
>   	u64 empty;
>   	u64 refill;
>   	u64 waive;
> +	u64 item_full;
>   };
>   
>   /**
> @@ -142,6 +144,11 @@ struct page_pool_stats {
>   };
>   #endif
>   
> +struct page_pool_item {
> +	netmem_ref pp_netmem;
> +	unsigned int pp_idx;
> +};
> +
>   /* The whole frag API block must stay within one cacheline. On 32-bit systems,
>    * sizeof(long) == sizeof(int), so that the block size is ``3 * sizeof(long)``.
>    * On 64-bit systems, the actual size is ``2 * sizeof(long) + sizeof(int)``.
> @@ -161,6 +168,8 @@ struct page_pool {
>   
>   	int cpuid;
>   	u32 pages_state_hold_cnt;
> +	unsigned int item_mask;
> +	unsigned int item_idx;
>   
>   	bool has_init_callback:1;	/* slow::init_callback is set */
>   	bool dma_map:1;			/* Perform DMA mapping */
> @@ -228,7 +237,11 @@ struct page_pool {
>   	 */
>   	refcount_t user_cnt;
>   
> -	u64 destroy_cnt;
> +	/* Lock to avoid doing dma unmapping concurrently when
> +	 * destroy_cnt > 0.
> +	 */
> +	spinlock_t destroy_lock;
> +	unsigned int destroy_cnt;
>   
>   	/* Slow/Control-path information follows */
>   	struct page_pool_params_slow slow;
> @@ -239,6 +252,8 @@ struct page_pool {
>   		u32 napi_id;
>   		u32 id;
>   	} user;
> +
> +	struct page_pool_item items[] ____cacheline_aligned_in_smp;
>   };
>   
>   struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 11b91c12ee11..09c5aa83f12a 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -85,7 +85,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
>   	niov = &owner->niovs[index];
>   
>   	niov->pp_magic = 0;
> -	niov->pp = NULL;
> +	niov->pp_item = NULL;
>   	atomic_long_set(&niov->pp_ref_count, 0);
>   
>   	return niov;
> @@ -380,7 +380,7 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
>   	if (WARN_ON_ONCE(refcount != 1))
>   		return false;
>   
> -	page_pool_clear_pp_info(netmem);
> +	page_pool_clear_pp_info(pool, netmem);
>   
>   	net_devmem_free_dmabuf(netmem_to_net_iov(netmem));
>   
> diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
> index 7eadb8393e00..3173f6070cf7 100644
> --- a/net/core/netmem_priv.h
> +++ b/net/core/netmem_priv.h
> @@ -18,9 +18,10 @@ static inline void netmem_clear_pp_magic(netmem_ref netmem)
>   	__netmem_clear_lsb(netmem)->pp_magic = 0;
>   }
>   
> -static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool)
> +static inline void netmem_set_pp_item(netmem_ref netmem,
> +				      struct page_pool_item *item)
>   {
> -	__netmem_clear_lsb(netmem)->pp = pool;
> +	__netmem_clear_lsb(netmem)->pp_item = item;
>   }
>   
>   static inline void netmem_set_dma_addr(netmem_ref netmem,
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index dd497f5c927d..fa4a98b4fc67 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -61,6 +61,7 @@ static const char pp_stats[][ETH_GSTRING_LEN] = {
>   	"rx_pp_alloc_empty",
>   	"rx_pp_alloc_refill",
>   	"rx_pp_alloc_waive",
> +	"rx_pp_alloc_item_full",
>   	"rx_pp_recycle_cached",
>   	"rx_pp_recycle_cache_full",
>   	"rx_pp_recycle_ring",
> @@ -94,6 +95,7 @@ bool page_pool_get_stats(const struct page_pool *pool,
>   	stats->alloc_stats.empty += pool->alloc_stats.empty;
>   	stats->alloc_stats.refill += pool->alloc_stats.refill;
>   	stats->alloc_stats.waive += pool->alloc_stats.waive;
> +	stats->alloc_stats.item_full += pool->alloc_stats.item_full;
>   
>   	for_each_possible_cpu(cpu) {
>   		const struct page_pool_recycle_stats *pcpu =
> @@ -139,6 +141,7 @@ u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
>   	*data++ = pool_stats->alloc_stats.empty;
>   	*data++ = pool_stats->alloc_stats.refill;
>   	*data++ = pool_stats->alloc_stats.waive;
> +	*data++ = pool_stats->alloc_stats.item_full;
>   	*data++ = pool_stats->recycle_stats.cached;
>   	*data++ = pool_stats->recycle_stats.cache_full;
>   	*data++ = pool_stats->recycle_stats.ring;
> @@ -267,14 +270,12 @@ static int page_pool_init(struct page_pool *pool,
>   		return -ENOMEM;
>   	}
>   
> +	spin_lock_init(&pool->destroy_lock);
>   	atomic_set(&pool->pages_state_release_cnt, 0);
>   
>   	/* Driver calling page_pool_create() also call page_pool_destroy() */
>   	refcount_set(&pool->user_cnt, 1);
>   
> -	if (pool->dma_map)
> -		get_device(pool->p.dev);
> -
>   	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
>   		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
>   		 * configuration doesn't change while we're initializing
> @@ -312,15 +313,93 @@ static void page_pool_uninit(struct page_pool *pool)
>   {
>   	ptr_ring_cleanup(&pool->ring, NULL);
>   
> -	if (pool->dma_map)
> -		put_device(pool->p.dev);
> -
>   #ifdef CONFIG_PAGE_POOL_STATS
>   	if (!pool->system)
>   		free_percpu(pool->recycle_stats);
>   #endif
>   }
>   
> +static void page_pool_item_init(struct page_pool *pool, unsigned int item_cnt)
> +{
> +	struct page_pool_item *items = pool->items;
> +	unsigned int i;
> +
> +	WARN_ON_ONCE(!is_power_of_2(item_cnt));
> +
> +	for (i = 0; i < item_cnt; i++)
> +		items[i].pp_idx = i;
> +
> +	pool->item_mask = item_cnt - 1;
> +}
> +
> +static void page_pool_item_uninit(struct page_pool *pool)
> +{
> +	struct page_pool_item *items = pool->items;
> +	unsigned int mask = pool->item_mask;
> +	unsigned int i, unmapped = 0;
> +
> +	if (!pool->dma_map || pool->mp_priv)
> +		return;
> +
> +	spin_lock_bh(&pool->destroy_lock);
> +
> +	for (i = 0; i <= mask; i++) {
> +		struct page *page;
> +
> +		page = netmem_to_page(READ_ONCE(items[i].pp_netmem));
> +		if (!page)
> +			continue;
> +
> +		unmapped++;
> +		dma_unmap_page_attrs(pool->p.dev, page_pool_get_dma_addr(page),
> +				     PAGE_SIZE << pool->p.order,
> +				     pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC |
> +				     DMA_ATTR_WEAK_ORDERING);
> +		page_pool_set_dma_addr(page, 0);
> +	}
> +
> +	WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
> +		  "page_pool(%u): unmapped(%u) != inflight page(%d)\n",
> +		  pool->user.id, unmapped, page_pool_inflight(pool, false));
> +
> +	pool->dma_map = false;
> +	spin_unlock_bh(&pool->destroy_lock);
> +}
> +
> +static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem)
> +{
> +	struct page_pool_item *items = pool->items;
> +	unsigned int mask = pool->item_mask;
> +	unsigned int idx = pool->item_idx;
> +	unsigned int i;
> +
> +	for (i = 0; i <= mask; i++) {
> +		unsigned int mask_idx = idx++ & mask;
> +
> +		if (!READ_ONCE(items[mask_idx].pp_netmem)) {
> +			WRITE_ONCE(items[mask_idx].pp_netmem, netmem);
> +			netmem_set_pp_item(netmem, &items[mask_idx]);
> +			pool->item_idx = idx;
> +			return true;
> +		}
> +	}
> +
> +	pool->item_idx = idx;
> +	alloc_stat_inc(pool, item_full);
> +	return false;
> +}
> +
> +static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem)
> +{
> +	struct page_pool_item *item = netmem_to_page(netmem)->pp_item;
> +	struct page_pool_item *items = pool->items;
> +	unsigned int idx = item->pp_idx;
> +
> +	DEBUG_NET_WARN_ON_ONCE(items[idx].pp_netmem != netmem);
> +	WRITE_ONCE(items[idx].pp_netmem, (netmem_ref)NULL);
> +	netmem_set_pp_item(netmem, NULL);
> +}
> +
>   /**
>    * page_pool_create_percpu() - create a page pool for a given cpu.
>    * @params: parameters, see struct page_pool_params
> @@ -329,10 +408,15 @@ static void page_pool_uninit(struct page_pool *pool)
>   struct page_pool *
>   page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
>   {
> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS		512
> +	unsigned int item_cnt = (params->pool_size ? : 1024) +
> +				PP_ALLOC_CACHE_SIZE + PAGE_POOL_MIN_INFLIGHT_ITEMS;
>   	struct page_pool *pool;
>   	int err;
>   
> -	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
> +	item_cnt = roundup_pow_of_two(item_cnt);
> +	pool = kvzalloc_node(struct_size(pool, items, item_cnt), GFP_KERNEL,
> +			     params->nid);
>   	if (!pool)
>   		return ERR_PTR(-ENOMEM);
>   
> @@ -340,6 +424,8 @@ page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
>   	if (err < 0)
>   		goto err_free;
>   
> +	page_pool_item_init(pool, item_cnt);
> +
>   	err = page_pool_list(pool);
>   	if (err)
>   		goto err_uninit;
> @@ -350,7 +436,7 @@ page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
>   	page_pool_uninit(pool);
>   err_free:
>   	pr_warn("%s() gave up with errno %d\n", __func__, err);
> -	kfree(pool);
> +	kvfree(pool);
>   	return ERR_PTR(err);
>   }
>   EXPORT_SYMBOL(page_pool_create_percpu);
> @@ -365,7 +451,7 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
>   }
>   EXPORT_SYMBOL(page_pool_create);
>   
> -static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
> +static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
>   
>   static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
>   {
> @@ -403,7 +489,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
>   			 * (2) break out to fallthrough to alloc_pages_node.
>   			 * This limit stress on page buddy alloactor.
>   			 */
> -			page_pool_return_page(pool, netmem);
> +			__page_pool_return_page(pool, netmem);
>   			alloc_stat_inc(pool, waive);
>   			netmem = 0;
>   			break;
> @@ -436,26 +522,32 @@ static netmem_ref __page_pool_get_cached(struct page_pool *pool)
>   	return netmem;
>   }
>   
> -static void __page_pool_dma_sync_for_device(const struct page_pool *pool,
> -					    netmem_ref netmem,
> -					    u32 dma_sync_size)
> +static __always_inline void
> +__page_pool_dma_sync_for_device(const struct page_pool *pool, netmem_ref netmem,
> +				u32 dma_sync_size)
>   {
>   #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
> -	dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
> +	if (pool->dma_sync && dma_dev_need_sync(pool->p.dev)) {
> +		dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
>   
> -	dma_sync_size = min(dma_sync_size, pool->p.max_len);
> -	__dma_sync_single_for_device(pool->p.dev, dma_addr + pool->p.offset,
> -				     dma_sync_size, pool->p.dma_dir);
> +		dma_sync_size = min(dma_sync_size, pool->p.max_len);
> +		__dma_sync_single_for_device(pool->p.dev,
> +					     dma_addr + pool->p.offset,
> +					     dma_sync_size, pool->p.dma_dir);
> +	}
>   #endif
>   }
>   
> +/* Called from page_pool_put_*() path, need to synchronizated with
> + * page_pool_destory() path.
> + */
>   static __always_inline void
> -page_pool_dma_sync_for_device(const struct page_pool *pool,
> -			      netmem_ref netmem,
> +page_pool_dma_sync_for_device(const struct page_pool *pool, netmem_ref netmem,
>   			      u32 dma_sync_size)
>   {
> -	if (pool->dma_sync && dma_dev_need_sync(pool->p.dev))
> -		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
> +	rcu_read_lock();
> +	__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
> +	rcu_read_unlock();
>   }
>   
>   static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
> @@ -477,7 +569,7 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
>   	if (page_pool_set_dma_addr_netmem(netmem, dma))
>   		goto unmap_failed;
>   
> -	page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
> +	__page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
>   
>   	return true;
>   
> @@ -499,19 +591,24 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
>   	if (unlikely(!page))
>   		return NULL;
>   
> -	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
> -		put_page(page);
> -		return NULL;
> -	}
> +	if (unlikely(!page_pool_set_pp_info(pool, page_to_netmem(page))))
> +		goto err_alloc;
> +
> +	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page))))
> +		goto err_set_info;
>   
>   	alloc_stat_inc(pool, slow_high_order);
> -	page_pool_set_pp_info(pool, page_to_netmem(page));
>   
>   	/* Track how many pages are held 'in-flight' */
>   	pool->pages_state_hold_cnt++;
>   	trace_page_pool_state_hold(pool, page_to_netmem(page),
>   				   pool->pages_state_hold_cnt);
>   	return page;
> +err_set_info:
> +	page_pool_clear_pp_info(pool, page_to_netmem(page));
> +err_alloc:
> +	put_page(page);
> +	return NULL;
>   }
>   
>   /* slow path */
> @@ -546,12 +643,18 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
>   	 */
>   	for (i = 0; i < nr_pages; i++) {
>   		netmem = pool->alloc.cache[i];
> +
> +		if (unlikely(!page_pool_set_pp_info(pool, netmem))) {
> +			put_page(netmem_to_page(netmem));
> +			continue;
> +		}
> +
>   		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
> +			page_pool_clear_pp_info(pool, netmem);
>   			put_page(netmem_to_page(netmem));
>   			continue;
>   		}
>   
> -		page_pool_set_pp_info(pool, netmem);
>   		pool->alloc.cache[pool->alloc.count++] = netmem;
>   		/* Track how many pages are held 'in-flight' */
>   		pool->pages_state_hold_cnt++;
> @@ -623,9 +726,11 @@ s32 page_pool_inflight(const struct page_pool *pool, bool strict)
>   	return inflight;
>   }
>   
> -void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
> +bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
>   {
> -	netmem_set_pp(netmem, pool);
> +	if (unlikely(!page_pool_item_add(pool, netmem)))
> +		return false;
> +
>   	netmem_or_pp_magic(netmem, PP_SIGNATURE);
>   
>   	/* Ensuring all pages have been split into one fragment initially:
> @@ -637,12 +742,14 @@ void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
>   	page_pool_fragment_netmem(netmem, 1);
>   	if (pool->has_init_callback)
>   		pool->slow.init_callback(netmem, pool->slow.init_arg);
> +
> +	return true;
>   }
>   
> -void page_pool_clear_pp_info(netmem_ref netmem)
> +void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem)
>   {
>   	netmem_clear_pp_magic(netmem);
> -	netmem_set_pp(netmem, NULL);
> +	page_pool_item_del(pool, netmem);
>   }
>   
>   static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
> @@ -670,7 +777,7 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
>    * a regular page (that will eventually be returned to the normal
>    * page-allocator via put_page).
>    */
> -void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
> +void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   {
>   	int count;
>   	bool put;
> @@ -688,7 +795,7 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   	trace_page_pool_state_release(pool, netmem, count);
>   
>   	if (put) {
> -		page_pool_clear_pp_info(netmem);
> +		page_pool_clear_pp_info(pool, netmem);
>   		put_page(netmem_to_page(netmem));
>   	}
>   	/* An optimization would be to call __free_pages(page, pool->p.order)
> @@ -697,6 +804,27 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   	 */
>   }
>   
> +/* Called from page_pool_put_*() path, need to synchronizated with
> + * page_pool_destory() path.
> + */
> +static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
> +{
> +	unsigned int destroy_cnt;
> +
> +	rcu_read_lock();
> +
> +	destroy_cnt = READ_ONCE(pool->destroy_cnt);
> +	if (unlikely(destroy_cnt)) {
> +		spin_lock_bh(&pool->destroy_lock);
> +		__page_pool_return_page(pool, netmem);
> +		spin_unlock_bh(&pool->destroy_lock);
> +	} else {
> +		__page_pool_return_page(pool, netmem);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
>   static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
>   {
>   	int ret;
> @@ -920,11 +1048,11 @@ static netmem_ref page_pool_drain_frag(struct page_pool *pool,
>   		return 0;
>   
>   	if (__page_pool_page_can_be_recycled(netmem)) {
> -		page_pool_dma_sync_for_device(pool, netmem, -1);
> +		__page_pool_dma_sync_for_device(pool, netmem, -1);
>   		return netmem;
>   	}
>   
> -	page_pool_return_page(pool, netmem);
> +	__page_pool_return_page(pool, netmem);
>   	return 0;
>   }
>   
> @@ -938,7 +1066,7 @@ static void page_pool_free_frag(struct page_pool *pool)
>   	if (!netmem || page_pool_unref_netmem(netmem, drain_count))
>   		return;
>   
> -	page_pool_return_page(pool, netmem);
> +	__page_pool_return_page(pool, netmem);
>   }
>   
>   netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
> @@ -1022,7 +1150,7 @@ static void __page_pool_destroy(struct page_pool *pool)
>   		static_branch_dec(&page_pool_mem_providers);
>   	}
>   
> -	kfree(pool);
> +	kvfree(pool);
>   }
>   
>   static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
> @@ -1045,7 +1173,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
>   static void page_pool_scrub(struct page_pool *pool)
>   {
>   	page_pool_empty_alloc_cache_once(pool);
> -	pool->destroy_cnt++;
> +	WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
>   
>   	/* No more consumers should exist, but producers could still
>   	 * be in-flight.
> @@ -1133,12 +1261,17 @@ void page_pool_destroy(struct page_pool *pool)
>   	if (!page_pool_release(pool))
>   		return;
>   
> +	/* Skip dma sync to avoid calling dma API after driver has unbound */
> +	pool->dma_sync = false;
> +
>   	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
>   	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>   	 * before returning to driver to free the napi instance.
>   	 */
>   	synchronize_rcu();
>   
> +	page_pool_item_uninit(pool);
> +
>   	page_pool_detached(pool);
>   	pool->defer_start = jiffies;
>   	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> @@ -1159,7 +1292,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>   	/* Flush pool alloc cache, as refill will check NUMA node */
>   	while (pool->alloc.count) {
>   		netmem = pool->alloc.cache[--pool->alloc.count];
> -		page_pool_return_page(pool, netmem);
> +		__page_pool_return_page(pool, netmem);
>   	}
>   }
>   EXPORT_SYMBOL(page_pool_update_nid);
> diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
> index 57439787b9c2..5d85f862a30a 100644
> --- a/net/core/page_pool_priv.h
> +++ b/net/core/page_pool_priv.h
> @@ -36,16 +36,18 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>   }
>   
>   #if defined(CONFIG_PAGE_POOL)
> -void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
> -void page_pool_clear_pp_info(netmem_ref netmem);
> +bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
> +void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem);
>   int page_pool_check_memory_provider(struct net_device *dev,
>   				    struct netdev_rx_queue *rxq);
>   #else
> -static inline void page_pool_set_pp_info(struct page_pool *pool,
> +static inline bool page_pool_set_pp_info(struct page_pool *pool,
>   					 netmem_ref netmem)
>   {
> +	return true;
>   }
> -static inline void page_pool_clear_pp_info(netmem_ref netmem)
> +static inline void page_pool_clear_pp_info(struct page_pool *pool,
> +					   netmem_ref netmem)
>   {
>   }
>   static inline int page_pool_check_memory_provider(struct net_device *dev,
Yunsheng Lin Oct. 23, 2024, 8:59 a.m. UTC | #3
On 2024/10/23 2:14, Jesper Dangaard Brouer wrote:
> 
> 
> On 22/10/2024 05.22, Yunsheng Lin wrote:
>> Networking driver with page_pool support may hand over page
>> still with dma mapping to network stack and try to reuse that
>> page after network stack is done with it and passes it back
>> to page_pool to avoid the penalty of dma mapping/unmapping.
>> With all the caching in the network stack, some pages may be
>> held in the network stack without returning to the page_pool
>> soon enough, and with VF disable causing the driver unbound,
>> the page_pool does not stop the driver from doing it's
>> unbounding work, instead page_pool uses workqueue to check
>> if there is some pages coming back from the network stack
>> periodically, if there is any, it will do the dma unmmapping
>> related cleanup work.
>>
>> As mentioned in [1], attempting DMA unmaps after the driver
>> has already unbound may leak resources or at worst corrupt
>> memory. Fundamentally, the page pool code cannot allow DMA
>> mappings to outlive the driver they belong to.
>>
>> Currently it seems there are at least two cases that the page
>> is not released fast enough causing dma unmmapping done after
>> driver has already unbound:
>> 1. ipv4 packet defragmentation timeout: this seems to cause
>>     delay up to 30 secs.
>> 2. skb_defer_free_flush(): this may cause infinite delay if
>>     there is no triggering for net_rx_action().
>>
>> In order not to do the dma unmmapping after driver has already
>> unbound and stall the unloading of the networking driver, add
>> the pool->items array to record all the pages including the ones
>> which are handed over to network stack, so the page_pool can
> 
> I really really dislike this approach!
> 
> Nacked-by: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> Having to keep an array to record all the pages including the ones
> which are handed over to network stack, goes against the very principle
> behind page_pool. We added members to struct page, such that pages could
> be "outstanding".

Before and after this patch both support "outstanding", the difference is
how many "outstanding" pages do they support.

The question seems to be do we really need unlimited inflight page for
page_pool to work as mentioned in [1]?

1. https://lore.kernel.org/all/5d9ea7bd-67bb-4a9d-a120-c8f290c31a47@huawei.com/

> 
> The page_pool already have a system for waiting for these outstanding /
> inflight packets to get returned.  As I suggested before, the page_pool
> should simply take over the responsability (from net_device) to free the
> struct device (after inflight reach zero), where AFAIK the DMA device is
> connected via.

It seems you mentioned some similar suggestion in previous version,
it would be good to show some code about the idea in your mind, I am sure
that Yonglong Liu Cc'ed will be happy to test it if there some code like
POC/RFC is provided.

I should mention that it seems that DMA device is not longer vaild when
remove() function of the device driver returns, as mentioned in [2], which
means dma API is not allowed to called after remove() function of the device
driver returns.

2. https://www.spinics.net/lists/netdev/msg1030641.html

> 
> The alternative is what Kuba suggested (and proposed an RFC for),  that
> the net_device teardown waits for the page_pool inflight packets.

As above, the question is how long does the waiting take here?
Yonglong tested Kuba's RFC, see [3], the waiting took forever due to
reason as mentioned in commit log:
"skb_defer_free_flush(): this may cause infinite delay if there is no
triggering for net_rx_action()."

And there seems to be some device lock held during the waiting causing
hung_task log in [3] too, as it is the vf disabling causing the driver
unbound.

3. https://lore.kernel.org/netdev/758b4d47-c980-4f66-b4a4-949c3fc4b040@huawei.com/

> 
> 
>> do the dma unmmapping for those pages when page_pool_destroy()
>> is called. As the pool->items need to be large enough to avoid
>> performance degradation, add a 'item_full' stat to indicate the
>> allocation failure due to unavailability of pool->items.
> 
> Are you saying page_pool will do allocation failures, based on the size
> of this pool->items array?

Yes.

> 
> Production workloads will have very large number of outstanding packet
> memory in e.g. in TCP sockets.  In production I'm seeing TCP memory
> reach 24 GiB (on machines with 384GiB memory).

It would be good to provide more information here:
1. How much is the above '24 GiB' memory it is about rx?
2. What is value of /proc/sys/net/ipv4/tcp_rmem?
3. How many network port does the machine is using?
4. How many queues and what is queue depth of each network port?

Like there need to be a specific number of entries that a nic hw queue
are using, I believe there need to be a specific number of inflight pages
for page_pool to reach the best performance, it just can not be unlimit,
otherwise it will eat up all of the '384GiB memory'.

If the number of inflight pages needed is so dynamic and hard to calculate,
linked list can always be used instead of fixed array here, but I am not
sure if it is worthing the complexity here yet.

> 
> 
>> Note, the devmem patchset seems to make the bug harder to fix,
>> and may make backporting harder too. As there is no actual user
>> for the devmem and the fixing for devmem is unclear for now,
>> this patch does not consider fixing the case for devmem yet.
>>
>> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
>>
>> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> Tested-by: Yonglong Liu <liuyonglong@huawei.com>
>> CC: Robin Murphy <robin.murphy@arm.com>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
>> CC: IOMMU <iommu@lists.linux.dev>
>> ---
>>   include/linux/mm_types.h        |   2 +-
>>   include/linux/skbuff.h          |   1 +
>>   include/net/netmem.h            |  10 +-
>>   include/net/page_pool/helpers.h |   4 +-
>>   include/net/page_pool/types.h   |  17 ++-
>>   net/core/devmem.c               |   4 +-
>>   net/core/netmem_priv.h          |   5 +-
>>   net/core/page_pool.c            | 213 ++++++++++++++++++++++++++------
>>   net/core/page_pool_priv.h       |  10 +-
>>   9 files changed, 210 insertions(+), 56 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 6e3bdf8e38bc..d3e30dcfa021 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -120,7 +120,7 @@ struct page {
>>                * page_pool allocated pages.
>>                */
>>               unsigned long pp_magic;
>> -            struct page_pool *pp;
>> +            struct page_pool_item *pp_item;
> 
> Why do we need to change the pp pointer?
> 
> Why can't we access the "items" array like pp->items?

The info about which entry of pp->items a specific 'struct page' is
using need to be in the 'struct page', so that the specific entry of
pp->items can be reused for another new page when the old page is
returned back to the page allocator.

And we still need the info about which page_pool does a specific
page belong to, that info is avaliable through the below helper:

static inline struct page_pool *page_pool_to_pp(struct page *page)
{
	struct page_pool_item *item = page->pp_item;

	return container_of(item, struct page_pool, items[item->pp_idx]);
}

> 
> 
>>               unsigned long _pp_mapping_pad;
>>               unsigned long dma_addr;
>>               atomic_long_t pp_ref_count;
Toke Høiland-Jørgensen Oct. 24, 2024, 2:40 p.m. UTC | #4
Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2024/10/23 2:14, Jesper Dangaard Brouer wrote:
>> 
>> 
>> On 22/10/2024 05.22, Yunsheng Lin wrote:
>>> Networking driver with page_pool support may hand over page
>>> still with dma mapping to network stack and try to reuse that
>>> page after network stack is done with it and passes it back
>>> to page_pool to avoid the penalty of dma mapping/unmapping.
>>> With all the caching in the network stack, some pages may be
>>> held in the network stack without returning to the page_pool
>>> soon enough, and with VF disable causing the driver unbound,
>>> the page_pool does not stop the driver from doing it's
>>> unbounding work, instead page_pool uses workqueue to check
>>> if there is some pages coming back from the network stack
>>> periodically, if there is any, it will do the dma unmmapping
>>> related cleanup work.
>>>
>>> As mentioned in [1], attempting DMA unmaps after the driver
>>> has already unbound may leak resources or at worst corrupt
>>> memory. Fundamentally, the page pool code cannot allow DMA
>>> mappings to outlive the driver they belong to.
>>>
>>> Currently it seems there are at least two cases that the page
>>> is not released fast enough causing dma unmmapping done after
>>> driver has already unbound:
>>> 1. ipv4 packet defragmentation timeout: this seems to cause
>>>     delay up to 30 secs.
>>> 2. skb_defer_free_flush(): this may cause infinite delay if
>>>     there is no triggering for net_rx_action().
>>>
>>> In order not to do the dma unmmapping after driver has already
>>> unbound and stall the unloading of the networking driver, add
>>> the pool->items array to record all the pages including the ones
>>> which are handed over to network stack, so the page_pool can
>> 
>> I really really dislike this approach!
>> 
>> Nacked-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> 
>> Having to keep an array to record all the pages including the ones
>> which are handed over to network stack, goes against the very principle
>> behind page_pool. We added members to struct page, such that pages could
>> be "outstanding".
>
> Before and after this patch both support "outstanding", the difference is
> how many "outstanding" pages do they support.
>
> The question seems to be do we really need unlimited inflight page for
> page_pool to work as mentioned in [1]?
>
> 1. https://lore.kernel.org/all/5d9ea7bd-67bb-4a9d-a120-c8f290c31a47@huawei.com/

Well, yes? Imposing an arbitrary limit on the number of in-flight
packets (especially such a low one as in this series) is a complete
non-starter. Servers have hundreds of gigs of memory these days, and if
someone wants to use that for storing in-flight packets, the kernel
definitely shouldn't impose some (hard-coded!) limit on that.

>> 
>> The page_pool already have a system for waiting for these outstanding /
>> inflight packets to get returned.  As I suggested before, the page_pool
>> should simply take over the responsability (from net_device) to free the
>> struct device (after inflight reach zero), where AFAIK the DMA device is
>> connected via.
>
> It seems you mentioned some similar suggestion in previous version,
> it would be good to show some code about the idea in your mind, I am sure
> that Yonglong Liu Cc'ed will be happy to test it if there some code like
> POC/RFC is provided.

I believe Jesper is basically referring to Jakub's RFC that you
mentioned below.

> I should mention that it seems that DMA device is not longer vaild when
> remove() function of the device driver returns, as mentioned in [2], which
> means dma API is not allowed to called after remove() function of the device
> driver returns.
>
> 2. https://www.spinics.net/lists/netdev/msg1030641.html
>
>> 
>> The alternative is what Kuba suggested (and proposed an RFC for),  that
>> the net_device teardown waits for the page_pool inflight packets.
>
> As above, the question is how long does the waiting take here?
> Yonglong tested Kuba's RFC, see [3], the waiting took forever due to
> reason as mentioned in commit log:
> "skb_defer_free_flush(): this may cause infinite delay if there is no
> triggering for net_rx_action()."

Honestly, this just seems like a bug (the "no triggering of
net_rx_action()") that should be root caused and fixed; not a reason
that waiting can't work.

-Toke
Yunsheng Lin Oct. 25, 2024, 3:20 a.m. UTC | #5
On 2024/10/24 22:40, Toke Høiland-Jørgensen wrote:

...

>>>
>>> I really really dislike this approach!
>>>
>>> Nacked-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>
>>> Having to keep an array to record all the pages including the ones
>>> which are handed over to network stack, goes against the very principle
>>> behind page_pool. We added members to struct page, such that pages could
>>> be "outstanding".
>>
>> Before and after this patch both support "outstanding", the difference is
>> how many "outstanding" pages do they support.
>>
>> The question seems to be do we really need unlimited inflight page for
>> page_pool to work as mentioned in [1]?
>>
>> 1. https://lore.kernel.org/all/5d9ea7bd-67bb-4a9d-a120-c8f290c31a47@huawei.com/
> 
> Well, yes? Imposing an arbitrary limit on the number of in-flight
> packets (especially such a low one as in this series) is a complete
> non-starter. Servers have hundreds of gigs of memory these days, and if
> someone wants to use that for storing in-flight packets, the kernel
> definitely shouldn't impose some (hard-coded!) limit on that.

You and Jesper seems to be mentioning a possible fact that there might
be 'hundreds of gigs of memory' needed for inflight pages, it would be nice
to provide more info or reasoning above why 'hundreds of gigs of memory' is
needed here so that we don't do a over-designed thing to support recording
unlimited in-flight pages if the driver unbound stalling turns out impossible
and the inflight pages do need to be recorded.

I guess it is common sense to start with easy one until someone complains
with some testcase and detailed reasoning if we need to go the hard way as
you and Jesper are also prefering waiting over having to record the inflight
pages.

More detailed about why we might need to go the hard way of having to record
the inflight pages as below.

> 
>>>
>>> The page_pool already have a system for waiting for these outstanding /
>>> inflight packets to get returned.  As I suggested before, the page_pool
>>> should simply take over the responsability (from net_device) to free the
>>> struct device (after inflight reach zero), where AFAIK the DMA device is
>>> connected via.
>>
>> It seems you mentioned some similar suggestion in previous version,
>> it would be good to show some code about the idea in your mind, I am sure
>> that Yonglong Liu Cc'ed will be happy to test it if there some code like
>> POC/RFC is provided.
> 
> I believe Jesper is basically referring to Jakub's RFC that you
> mentioned below.
> 
>> I should mention that it seems that DMA device is not longer vaild when
>> remove() function of the device driver returns, as mentioned in [2], which
>> means dma API is not allowed to called after remove() function of the device
>> driver returns.
>>
>> 2. https://www.spinics.net/lists/netdev/msg1030641.html
>>
>>>
>>> The alternative is what Kuba suggested (and proposed an RFC for),  that
>>> the net_device teardown waits for the page_pool inflight packets.
>>
>> As above, the question is how long does the waiting take here?
>> Yonglong tested Kuba's RFC, see [3], the waiting took forever due to
>> reason as mentioned in commit log:
>> "skb_defer_free_flush(): this may cause infinite delay if there is no
>> triggering for net_rx_action()."
> 
> Honestly, this just seems like a bug (the "no triggering of
> net_rx_action()") that should be root caused and fixed; not a reason
> that waiting can't work.

I would prefer the waiting too if simple waiting fixed the test cases that
Youglong and Haiqing were reporting and I did not look into the rabbit hole
of possible caching in networking.

As mentioned in commit log and [1]:
1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30
   secs, which was reported by Haiqing.
2. skb_defer_free_flush(): this may cause infinite delay if there is no
   triggering for net_rx_action(), which was reported by Yonglong.

For case 1, is it really ok to stall the driver unbound up to 30 secs for the
default setting of defragmentation timeout?

For case 2, it is possible to add timeout for those kind of caching like the
defragmentation timeout too, but as mentioned in [2], it seems to be a normal
thing for this kind of caching in networking:
"Eric pointed out/predicted there's no guarantee that applications will
read / close their sockets so a page pool page may be stuck in a socket
(but not leaked) forever."

1. https://lore.kernel.org/all/2c5ccfff-6ab4-4aea-bff6-3679ff72cc9a@huawei.com/
2. https://www.spinics.net/lists/netdev/msg952604.html

> 
> -Toke
> 
>
Toke Høiland-Jørgensen Oct. 25, 2024, 11:16 a.m. UTC | #6
Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2024/10/24 22:40, Toke Høiland-Jørgensen wrote:
>
> ...
>
>>>>
>>>> I really really dislike this approach!
>>>>
>>>> Nacked-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>
>>>> Having to keep an array to record all the pages including the ones
>>>> which are handed over to network stack, goes against the very principle
>>>> behind page_pool. We added members to struct page, such that pages could
>>>> be "outstanding".
>>>
>>> Before and after this patch both support "outstanding", the difference is
>>> how many "outstanding" pages do they support.
>>>
>>> The question seems to be do we really need unlimited inflight page for
>>> page_pool to work as mentioned in [1]?
>>>
>>> 1. https://lore.kernel.org/all/5d9ea7bd-67bb-4a9d-a120-c8f290c31a47@huawei.com/
>> 
>> Well, yes? Imposing an arbitrary limit on the number of in-flight
>> packets (especially such a low one as in this series) is a complete
>> non-starter. Servers have hundreds of gigs of memory these days, and if
>> someone wants to use that for storing in-flight packets, the kernel
>> definitely shouldn't impose some (hard-coded!) limit on that.
>
> You and Jesper seems to be mentioning a possible fact that there might
> be 'hundreds of gigs of memory' needed for inflight pages, it would be nice
> to provide more info or reasoning above why 'hundreds of gigs of memory' is
> needed here so that we don't do a over-designed thing to support recording
> unlimited in-flight pages if the driver unbound stalling turns out impossible
> and the inflight pages do need to be recorded.

I don't have a concrete example of a use that will blow the limit you
are setting (but maybe Jesper does), I am simply objecting to the
arbitrary imposing of any limit at all. It smells a lot of "640k ought
to be enough for anyone".

> I guess it is common sense to start with easy one until someone complains
> with some testcase and detailed reasoning if we need to go the hard way as
> you and Jesper are also prefering waiting over having to record the inflight
> pages.

AFAIU Jakub's comment on his RFC patch for waiting, he was suggesting
exactly this: Add the wait, and see if the cases where it can stall turn
out to be problems in practice.

> More detailed about why we might need to go the hard way of having to record
> the inflight pages as below.
>
>> 
>>>>
>>>> The page_pool already have a system for waiting for these outstanding /
>>>> inflight packets to get returned.  As I suggested before, the page_pool
>>>> should simply take over the responsability (from net_device) to free the
>>>> struct device (after inflight reach zero), where AFAIK the DMA device is
>>>> connected via.
>>>
>>> It seems you mentioned some similar suggestion in previous version,
>>> it would be good to show some code about the idea in your mind, I am sure
>>> that Yonglong Liu Cc'ed will be happy to test it if there some code like
>>> POC/RFC is provided.
>> 
>> I believe Jesper is basically referring to Jakub's RFC that you
>> mentioned below.
>> 
>>> I should mention that it seems that DMA device is not longer vaild when
>>> remove() function of the device driver returns, as mentioned in [2], which
>>> means dma API is not allowed to called after remove() function of the device
>>> driver returns.
>>>
>>> 2. https://www.spinics.net/lists/netdev/msg1030641.html
>>>
>>>>
>>>> The alternative is what Kuba suggested (and proposed an RFC for),  that
>>>> the net_device teardown waits for the page_pool inflight packets.
>>>
>>> As above, the question is how long does the waiting take here?
>>> Yonglong tested Kuba's RFC, see [3], the waiting took forever due to
>>> reason as mentioned in commit log:
>>> "skb_defer_free_flush(): this may cause infinite delay if there is no
>>> triggering for net_rx_action()."
>> 
>> Honestly, this just seems like a bug (the "no triggering of
>> net_rx_action()") that should be root caused and fixed; not a reason
>> that waiting can't work.
>
> I would prefer the waiting too if simple waiting fixed the test cases that
> Youglong and Haiqing were reporting and I did not look into the rabbit hole
> of possible caching in networking.
>
> As mentioned in commit log and [1]:
> 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30
>    secs, which was reported by Haiqing.
> 2. skb_defer_free_flush(): this may cause infinite delay if there is no
>    triggering for net_rx_action(), which was reported by Yonglong.
>
> For case 1, is it really ok to stall the driver unbound up to 30 secs for the
> default setting of defragmentation timeout?
>
> For case 2, it is possible to add timeout for those kind of caching like the
> defragmentation timeout too, but as mentioned in [2], it seems to be a normal
> thing for this kind of caching in networking:

Both 1 and 2 seem to be cases where the netdev teardown code can just
make sure to kick the respective queues and make sure there's nothing
outstanding (for (1), walk the defrag cache and clear out anything
related to the netdev going away, for (2) make sure to kick
net_rx_action() as part of the teardown).

> "Eric pointed out/predicted there's no guarantee that applications will
> read / close their sockets so a page pool page may be stuck in a socket
> (but not leaked) forever."

As for this one, I would put that in the "well, let's see if this
becomes a problem in practice" bucket.

-Toke
Jesper Dangaard Brouer Oct. 25, 2024, 2:07 p.m. UTC | #7
On 25/10/2024 13.16, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
> 
>> On 2024/10/24 22:40, Toke Høiland-Jørgensen wrote:
>>
>> ...
>>
>>>>>
>>>>> I really really dislike this approach!
>>>>>
>>>>> Nacked-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>>
>>>>> Having to keep an array to record all the pages including the ones
>>>>> which are handed over to network stack, goes against the very principle
>>>>> behind page_pool. We added members to struct page, such that pages could
>>>>> be "outstanding".
>>>>
>>>> Before and after this patch both support "outstanding", the difference is
>>>> how many "outstanding" pages do they support.
>>>>
>>>> The question seems to be do we really need unlimited inflight page for
>>>> page_pool to work as mentioned in [1]?
>>>>
>>>> 1. https://lore.kernel.org/all/5d9ea7bd-67bb-4a9d-a120-c8f290c31a47@huawei.com/
>>>
>>> Well, yes? Imposing an arbitrary limit on the number of in-flight
>>> packets (especially such a low one as in this series) is a complete
>>> non-starter. Servers have hundreds of gigs of memory these days, and if
>>> someone wants to use that for storing in-flight packets, the kernel
>>> definitely shouldn't impose some (hard-coded!) limit on that.
>>

I agree this limit is a non-starter.

>> You and Jesper seems to be mentioning a possible fact that there might
>> be 'hundreds of gigs of memory' needed for inflight pages, it would be nice
>> to provide more info or reasoning above why 'hundreds of gigs of memory' is
>> needed here so that we don't do a over-designed thing to support recording
>> unlimited in-flight pages if the driver unbound stalling turns out impossible
>> and the inflight pages do need to be recorded.
> 
> I don't have a concrete example of a use that will blow the limit you
> are setting (but maybe Jesper does), I am simply objecting to the
> arbitrary imposing of any limit at all. It smells a lot of "640k ought
> to be enough for anyone".
> 

As I wrote before. In *production* I'm seeing TCP memory reach 24 GiB
(on machines with 384GiB memory). I have attached a grafana screenshot
to prove what I'm saying.

As my co-worker Mike Freemon, have explain to me (and more details in
blogposts[1]). It is no coincident that graph have a strange "sealing"
close to 24 GiB (on machines with 384GiB total memory).  This is because
TCP network stack goes into a memory "under pressure" state when 6.25%
of total memory is used by TCP-stack. (Detail: The system will stay in
that mode until allocated TCP memory falls below 4.68% of total memory).

  [1] 
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/


>> I guess it is common sense to start with easy one until someone complains
>> with some testcase and detailed reasoning if we need to go the hard way as
>> you and Jesper are also prefering waiting over having to record the inflight
>> pages.
> 
> AFAIU Jakub's comment on his RFC patch for waiting, he was suggesting
> exactly this: Add the wait, and see if the cases where it can stall turn
> out to be problems in practice.

+1

I like Jakub's approach.

--Jesper
Yunsheng Lin Oct. 26, 2024, 7:32 a.m. UTC | #8
On 2024/10/25 19:16, Toke Høiland-Jørgensen wrote:

...

>>
>> You and Jesper seems to be mentioning a possible fact that there might
>> be 'hundreds of gigs of memory' needed for inflight pages, it would be nice
>> to provide more info or reasoning above why 'hundreds of gigs of memory' is
>> needed here so that we don't do a over-designed thing to support recording
>> unlimited in-flight pages if the driver unbound stalling turns out impossible
>> and the inflight pages do need to be recorded.
> 
> I don't have a concrete example of a use that will blow the limit you
> are setting (but maybe Jesper does), I am simply objecting to the
> arbitrary imposing of any limit at all. It smells a lot of "640k ought
> to be enough for anyone".

From Jesper's link, it seems there is still some limit for those memory and
the limit seems to be configured dynamically.

Currently the number of supporting in-flight pages is depending on the rx
queue number and their queue depth in this patch, as mentioned before, it
always can be dynamic at some cost of complexity and maybe some overhead.

> 
>> I guess it is common sense to start with easy one until someone complains
>> with some testcase and detailed reasoning if we need to go the hard way as
>> you and Jesper are also prefering waiting over having to record the inflight
>> pages.
> 
> AFAIU Jakub's comment on his RFC patch for waiting, he was suggesting
> exactly this: Add the wait, and see if the cases where it can stall turn
> out to be problems in practice.

I should mention that jakub seemed to prefer to only check some condition
to decide if the waiting is needed in [1].

If the below 'kick' need to be done during the waiting, it might be
going the hard way instead of going the easy here comparing to recording
the inflight pages IHMO.

1. https://lore.kernel.org/all/20241014171406.43e730c9@kernel.org/

> 
>> More detailed about why we might need to go the hard way of having to record
>> the inflight pages as below.
>>
>>>
>>>>>
>>>>> The page_pool already have a system for waiting for these outstanding /
>>>>> inflight packets to get returned.  As I suggested before, the page_pool
>>>>> should simply take over the responsability (from net_device) to free the
>>>>> struct device (after inflight reach zero), where AFAIK the DMA device is
>>>>> connected via.
>>>>
>>>> It seems you mentioned some similar suggestion in previous version,
>>>> it would be good to show some code about the idea in your mind, I am sure
>>>> that Yonglong Liu Cc'ed will be happy to test it if there some code like
>>>> POC/RFC is provided.
>>>
>>> I believe Jesper is basically referring to Jakub's RFC that you
>>> mentioned below.
>>>
>>>> I should mention that it seems that DMA device is not longer vaild when
>>>> remove() function of the device driver returns, as mentioned in [2], which
>>>> means dma API is not allowed to called after remove() function of the device
>>>> driver returns.
>>>>
>>>> 2. https://www.spinics.net/lists/netdev/msg1030641.html
>>>>
>>>>>
>>>>> The alternative is what Kuba suggested (and proposed an RFC for),  that
>>>>> the net_device teardown waits for the page_pool inflight packets.
>>>>
>>>> As above, the question is how long does the waiting take here?
>>>> Yonglong tested Kuba's RFC, see [3], the waiting took forever due to
>>>> reason as mentioned in commit log:
>>>> "skb_defer_free_flush(): this may cause infinite delay if there is no
>>>> triggering for net_rx_action()."
>>>
>>> Honestly, this just seems like a bug (the "no triggering of
>>> net_rx_action()") that should be root caused and fixed; not a reason
>>> that waiting can't work.
>>
>> I would prefer the waiting too if simple waiting fixed the test cases that
>> Youglong and Haiqing were reporting and I did not look into the rabbit hole
>> of possible caching in networking.
>>
>> As mentioned in commit log and [1]:
>> 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30
>>    secs, which was reported by Haiqing.
>> 2. skb_defer_free_flush(): this may cause infinite delay if there is no
>>    triggering for net_rx_action(), which was reported by Yonglong.
>>
>> For case 1, is it really ok to stall the driver unbound up to 30 secs for the
>> default setting of defragmentation timeout?
>>
>> For case 2, it is possible to add timeout for those kind of caching like the
>> defragmentation timeout too, but as mentioned in [2], it seems to be a normal
>> thing for this kind of caching in networking:
> 
> Both 1 and 2 seem to be cases where the netdev teardown code can just
> make sure to kick the respective queues and make sure there's nothing
> outstanding (for (1), walk the defrag cache and clear out anything
> related to the netdev going away, for (2) make sure to kick
> net_rx_action() as part of the teardown).

It would be good to be more specific about the 'kick' here, does it mean
taking the lock and doing one of below action for each cache instance?
1. flush all the cache of each cache instance.
2. scan for the page_pool owned page and do the finegrained flushing.

> 
>> "Eric pointed out/predicted there's no guarantee that applications will
>> read / close their sockets so a page pool page may be stuck in a socket
>> (but not leaked) forever."
> 
> As for this one, I would put that in the "well, let's see if this
> becomes a problem in practice" bucket.

As the commit log in [2], it seems it is already happening.

Those cache are mostly per-cpu and per-socket, and there may be hundreds of
CPUs and thousands of sockets in one system, are you really sure we need
to take the lock of each cache instance, which may be thousands of them,
and do the flushing/scaning of memory used in networking, which may be as
large as '24 GiB' mentioned by Jesper?

2. https://lore.kernel.org/all/20231126230740.2148636-13-kuba@kernel.org/

> 
> -Toke
> 
>
Yunsheng Lin Oct. 26, 2024, 7:33 a.m. UTC | #9
On 2024/10/25 22:07, Jesper Dangaard Brouer wrote:

...

> 
>>> You and Jesper seems to be mentioning a possible fact that there might
>>> be 'hundreds of gigs of memory' needed for inflight pages, it would be nice
>>> to provide more info or reasoning above why 'hundreds of gigs of memory' is
>>> needed here so that we don't do a over-designed thing to support recording
>>> unlimited in-flight pages if the driver unbound stalling turns out impossible
>>> and the inflight pages do need to be recorded.
>>
>> I don't have a concrete example of a use that will blow the limit you
>> are setting (but maybe Jesper does), I am simply objecting to the
>> arbitrary imposing of any limit at all. It smells a lot of "640k ought
>> to be enough for anyone".
>>
> 
> As I wrote before. In *production* I'm seeing TCP memory reach 24 GiB
> (on machines with 384GiB memory). I have attached a grafana screenshot
> to prove what I'm saying.
> 
> As my co-worker Mike Freemon, have explain to me (and more details in
> blogposts[1]). It is no coincident that graph have a strange "sealing"
> close to 24 GiB (on machines with 384GiB total memory).  This is because
> TCP network stack goes into a memory "under pressure" state when 6.25%
> of total memory is used by TCP-stack. (Detail: The system will stay in
> that mode until allocated TCP memory falls below 4.68% of total memory).
> 
>  [1] https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

Thanks for the info.

> 
> 
>>> I guess it is common sense to start with easy one until someone complains
>>> with some testcase and detailed reasoning if we need to go the hard way as
>>> you and Jesper are also prefering waiting over having to record the inflight
>>> pages.
>>
>> AFAIU Jakub's comment on his RFC patch for waiting, he was suggesting
>> exactly this: Add the wait, and see if the cases where it can stall turn
>> out to be problems in practice.
> 
> +1
> 
> I like Jakub's approach.

As mentioned in Toke's comment, I am still not convinced that there is some
easy way of waiting here, doing the kick in the kernel space is hard enough,
I am not even sure if kick need to be done or how it can be done in the user
space if some page_pool owned page is held from user space for the cases of zero
rx copy, io_uring and devmem tcp? killing the userspace app?

If you and Toke still think the waiting is the way out for the problem here, maybe
we should wait for jakub's opinion and let him decide if he want to proceed with
his waiting patch.

> 
> --Jesper
Toke Høiland-Jørgensen Oct. 29, 2024, 1:58 p.m. UTC | #10
Yunsheng Lin <linyunsheng@huawei.com> writes:

>>> I would prefer the waiting too if simple waiting fixed the test cases that
>>> Youglong and Haiqing were reporting and I did not look into the rabbit hole
>>> of possible caching in networking.
>>>
>>> As mentioned in commit log and [1]:
>>> 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30
>>>    secs, which was reported by Haiqing.
>>> 2. skb_defer_free_flush(): this may cause infinite delay if there is no
>>>    triggering for net_rx_action(), which was reported by Yonglong.
>>>
>>> For case 1, is it really ok to stall the driver unbound up to 30 secs for the
>>> default setting of defragmentation timeout?
>>>
>>> For case 2, it is possible to add timeout for those kind of caching like the
>>> defragmentation timeout too, but as mentioned in [2], it seems to be a normal
>>> thing for this kind of caching in networking:
>> 
>> Both 1 and 2 seem to be cases where the netdev teardown code can just
>> make sure to kick the respective queues and make sure there's nothing
>> outstanding (for (1), walk the defrag cache and clear out anything
>> related to the netdev going away, for (2) make sure to kick
>> net_rx_action() as part of the teardown).
>
> It would be good to be more specific about the 'kick' here, does it mean
> taking the lock and doing one of below action for each cache instance?
> 1. flush all the cache of each cache instance.
> 2. scan for the page_pool owned page and do the finegrained flushing.

Depends on the context. The page pool is attached to a device, so it
should be possible to walk the skb frags queue and just remove any skbs
that refer to that netdevice, or something like that.

As for the lack of net_rx_action(), this is related to the deferred
freeing of skbs, so it seems like just calling skb_defer_free_flush() on
teardown could be an option.

>>> "Eric pointed out/predicted there's no guarantee that applications will
>>> read / close their sockets so a page pool page may be stuck in a socket
>>> (but not leaked) forever."
>> 
>> As for this one, I would put that in the "well, let's see if this
>> becomes a problem in practice" bucket.
>
> As the commit log in [2], it seems it is already happening.
>
> Those cache are mostly per-cpu and per-socket, and there may be hundreds of
> CPUs and thousands of sockets in one system, are you really sure we need
> to take the lock of each cache instance, which may be thousands of them,
> and do the flushing/scaning of memory used in networking, which may be as
> large as '24 GiB' mentioned by Jesper?

Well, as above, the two issues you mentioned are per-netns (or possibly
per-CPU), so those seem to be manageable to do on device teardown if the
wait is really a problem.

But, well, I'm not sure it is? You seem to be taking it as axiomatic
that the wait in itself is bad. Why? It's just a bit memory being held
on to while it is still in use, and so what?

-Toke
Yunsheng Lin Oct. 30, 2024, 11:30 a.m. UTC | #11
On 2024/10/29 21:58, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
> 
>>>> I would prefer the waiting too if simple waiting fixed the test cases that
>>>> Youglong and Haiqing were reporting and I did not look into the rabbit hole
>>>> of possible caching in networking.
>>>>
>>>> As mentioned in commit log and [1]:
>>>> 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30
>>>>    secs, which was reported by Haiqing.
>>>> 2. skb_defer_free_flush(): this may cause infinite delay if there is no
>>>>    triggering for net_rx_action(), which was reported by Yonglong.
>>>>
>>>> For case 1, is it really ok to stall the driver unbound up to 30 secs for the
>>>> default setting of defragmentation timeout?
>>>>
>>>> For case 2, it is possible to add timeout for those kind of caching like the
>>>> defragmentation timeout too, but as mentioned in [2], it seems to be a normal
>>>> thing for this kind of caching in networking:
>>>
>>> Both 1 and 2 seem to be cases where the netdev teardown code can just
>>> make sure to kick the respective queues and make sure there's nothing
>>> outstanding (for (1), walk the defrag cache and clear out anything
>>> related to the netdev going away, for (2) make sure to kick
>>> net_rx_action() as part of the teardown).
>>
>> It would be good to be more specific about the 'kick' here, does it mean
>> taking the lock and doing one of below action for each cache instance?
>> 1. flush all the cache of each cache instance.
>> 2. scan for the page_pool owned page and do the finegrained flushing.
> 
> Depends on the context. The page pool is attached to a device, so it
> should be possible to walk the skb frags queue and just remove any skbs
> that refer to that netdevice, or something like that.

I am not sure if netdevice is still the same when passing through all sorts
of software netdevice, checking if it is the page_pool owned page seems safer?

The scaning/flushing seems complicated and hard to get it right if it is
depending on internal detail of other subsystem's cache implementation.

> 
> As for the lack of net_rx_action(), this is related to the deferred
> freeing of skbs, so it seems like just calling skb_defer_free_flush() on
> teardown could be an option.

That was my initial thinking about the above case too if we know which percpu
sd to be passed to skb_defer_free_flush() or which cpu to trigger its
net_rx_action().
But it seems hard to tell which cpu napi is running in before napi is disabled,
which means skb_defer_free_flush() might need to be called for every cpu with
softirq disabled, as skb_defer_free_flush() calls napi_consume_skb() with
budget being 1 or call kick_defer_list_purge() for each CPU.

> 
>>>> "Eric pointed out/predicted there's no guarantee that applications will
>>>> read / close their sockets so a page pool page may be stuck in a socket
>>>> (but not leaked) forever."
>>>
>>> As for this one, I would put that in the "well, let's see if this
>>> becomes a problem in practice" bucket.
>>
>> As the commit log in [2], it seems it is already happening.
>>
>> Those cache are mostly per-cpu and per-socket, and there may be hundreds of
>> CPUs and thousands of sockets in one system, are you really sure we need
>> to take the lock of each cache instance, which may be thousands of them,
>> and do the flushing/scaning of memory used in networking, which may be as
>> large as '24 GiB' mentioned by Jesper?
> 
> Well, as above, the two issues you mentioned are per-netns (or possibly
> per-CPU), so those seem to be manageable to do on device teardown if the
> wait is really a problem.

As above, I am not sure if it is still the same netns if the skb is passing
through all sorts of software netdevice?

> 
> But, well, I'm not sure it is? You seem to be taking it as axiomatic
> that the wait in itself is bad. Why? It's just a bit memory being held
> on to while it is still in use, and so what?

Actually, I thought about adding some sort of timeout or kicking based on
jakub's waiting patch too.

But after looking at more caching in the networking, waiting and kicking/flushing
seems harder than recording the inflight pages, mainly because kicking/flushing
need very subsystem using page_pool owned page to provide a kicking/flushing
mechanism for it to work, not to mention how much time does it take to do all
the kicking/flushing.

It seems rdma subsystem uses a similar mechanism:
https://lwn.net/Articles/989087/

> 
> -Toke
> 
> 
>
Toke Høiland-Jørgensen Oct. 30, 2024, 11:57 a.m. UTC | #12
Yunsheng Lin <linyunsheng@huawei.com> writes:

>> But, well, I'm not sure it is? You seem to be taking it as axiomatic
>> that the wait in itself is bad. Why? It's just a bit memory being held
>> on to while it is still in use, and so what?
>
> Actually, I thought about adding some sort of timeout or kicking based on
> jakub's waiting patch too.
>
> But after looking at more caching in the networking, waiting and kicking/flushing
> seems harder than recording the inflight pages, mainly because kicking/flushing
> need very subsystem using page_pool owned page to provide a kicking/flushing
> mechanism for it to work, not to mention how much time does it take to do all
> the kicking/flushing.

Eliding the details above, but yeah, you're right, there are probably
some pernicious details to get right if we want to flush all caches. S
I wouldn't do that to start with. Instead, just add the waiting to start
with, then wait and see if this actually turns out to be a problem in
practice. And if it is, identify the source of that problem, deal with
it, rinse and repeat :)

-Toke
Yunsheng Lin Oct. 31, 2024, 12:17 p.m. UTC | #13
On 2024/10/30 19:57, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
> 
>>> But, well, I'm not sure it is? You seem to be taking it as axiomatic
>>> that the wait in itself is bad. Why? It's just a bit memory being held
>>> on to while it is still in use, and so what?
>>
>> Actually, I thought about adding some sort of timeout or kicking based on
>> jakub's waiting patch too.
>>
>> But after looking at more caching in the networking, waiting and kicking/flushing
>> seems harder than recording the inflight pages, mainly because kicking/flushing
>> need very subsystem using page_pool owned page to provide a kicking/flushing
>> mechanism for it to work, not to mention how much time does it take to do all
>> the kicking/flushing.
> 
> Eliding the details above, but yeah, you're right, there are probably
> some pernicious details to get right if we want to flush all caches. S
> I wouldn't do that to start with. Instead, just add the waiting to start
> with, then wait and see if this actually turns out to be a problem in
> practice. And if it is, identify the source of that problem, deal with
> it, rinse and repeat :)

I am not sure if I have mentioned to you that jakub had a RFC for the waiting,
see [1]. And Yonglong Cc'ed had tested it, the waiting caused the driver unload
stalling forever and some task hung, see [2].

The root cause for the above case is skb_defer_free_flush() not being called
as mentioned before.

I am not sure if I understand the reasoning behind the above suggestion to 'wait
and see if this actually turns out to be a problem' when we already know that there
are some cases which need cache kicking/flushing for the waiting to work and those
kicking/flushing may not be easy and may take indefinite time too, not to mention
there might be other cases that need kicking/flushing that we don't know yet.

Is there any reason not to consider recording the inflight pages so that unmapping
can be done for inflight pages before driver unbound supposing dynamic number of
inflight pages can be supported?

IOW, Is there any reason you and jesper taking it as axiomatic that recording the
inflight pages is bad supposing the inflight pages can be unlimited and recording
can be done with least performance overhead?

Or is there any better idea other than recording the inflight pages and doing the
kicking/flushing during waiting?

1. https://lore.kernel.org/netdev/20240806151618.1373008-1-kuba@kernel.org/
2. https://lore.kernel.org/netdev/758b4d47-c980-4f66-b4a4-949c3fc4b040@huawei.com/

> 
> -Toke
> 
>
Toke Høiland-Jørgensen Oct. 31, 2024, 4:18 p.m. UTC | #14
Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2024/10/30 19:57, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>> 
>>>> But, well, I'm not sure it is? You seem to be taking it as axiomatic
>>>> that the wait in itself is bad. Why? It's just a bit memory being held
>>>> on to while it is still in use, and so what?
>>>
>>> Actually, I thought about adding some sort of timeout or kicking based on
>>> jakub's waiting patch too.
>>>
>>> But after looking at more caching in the networking, waiting and kicking/flushing
>>> seems harder than recording the inflight pages, mainly because kicking/flushing
>>> need very subsystem using page_pool owned page to provide a kicking/flushing
>>> mechanism for it to work, not to mention how much time does it take to do all
>>> the kicking/flushing.
>> 
>> Eliding the details above, but yeah, you're right, there are probably
>> some pernicious details to get right if we want to flush all caches. S
>> I wouldn't do that to start with. Instead, just add the waiting to start
>> with, then wait and see if this actually turns out to be a problem in
>> practice. And if it is, identify the source of that problem, deal with
>> it, rinse and repeat :)
>
> I am not sure if I have mentioned to you that jakub had a RFC for the waiting,
> see [1]. And Yonglong Cc'ed had tested it, the waiting caused the driver unload
> stalling forever and some task hung, see [2].
>
> The root cause for the above case is skb_defer_free_flush() not being called
> as mentioned before.

Well, let's fix that, then! We already logic to flush backlogs when a
netdevice is going away, so AFAICT all that's needed is to add the
skb_defer_free_flush() to that logic. Totally untested patch below, that
we should maybe consider applying in any case.

> I am not sure if I understand the reasoning behind the above suggestion to 'wait
> and see if this actually turns out to be a problem' when we already know that there
> are some cases which need cache kicking/flushing for the waiting to work and those
> kicking/flushing may not be easy and may take indefinite time too, not to mention
> there might be other cases that need kicking/flushing that we don't know yet.
>
> Is there any reason not to consider recording the inflight pages so that unmapping
> can be done for inflight pages before driver unbound supposing dynamic number of
> inflight pages can be supported?
>
> IOW, Is there any reason you and jesper taking it as axiomatic that recording the
> inflight pages is bad supposing the inflight pages can be unlimited and recording
> can be done with least performance overhead?

Well, page pool is a memory allocator, and it already has a mechanism to
handle returning of memory to it. You're proposing to add a second,
orthogonal, mechanism to do this, one that adds both overhead and
complexity, yet doesn't handle all cases (cf your comment about devmem).

And even if it did handle all cases, force-releasing pages in this way
really feels like it's just papering over the issue. If there are pages
being leaked (or that are outstanding forever, which basically amounts
to the same thing), that is something we should be fixing the root cause
of, not just working around it like this series does.

-Toke


Patch to flush the deferred free list when taking down a netdevice;
compile-tested only:



diff --git a/net/core/dev.c b/net/core/dev.c
index ea5fbcd133ae..6e64e24ad6fa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5955,6 +5955,27 @@ EXPORT_SYMBOL(netif_receive_skb_list);
 
 static DEFINE_PER_CPU(struct work_struct, flush_works);
 
+static void skb_defer_free_flush(struct softnet_data *sd)
+{
+	struct sk_buff *skb, *next;
+
+	/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
+	if (!READ_ONCE(sd->defer_list))
+		return;
+
+	spin_lock(&sd->defer_lock);
+	skb = sd->defer_list;
+	sd->defer_list = NULL;
+	sd->defer_count = 0;
+	spin_unlock(&sd->defer_lock);
+
+	while (skb != NULL) {
+		next = skb->next;
+		napi_consume_skb(skb, 1);
+		skb = next;
+	}
+}
+
 /* Network device is going away, flush any packets still pending */
 static void flush_backlog(struct work_struct *work)
 {
@@ -5964,6 +5985,8 @@ static void flush_backlog(struct work_struct *work)
 	local_bh_disable();
 	sd = this_cpu_ptr(&softnet_data);
 
+	skb_defer_free_flush(sd);
+
 	backlog_lock_irq_disable(sd);
 	skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) {
 		if (skb->dev->reg_state == NETREG_UNREGISTERING) {
@@ -6001,6 +6024,9 @@ static bool flush_required(int cpu)
 		   !skb_queue_empty_lockless(&sd->process_queue);
 	backlog_unlock_irq_enable(sd);
 
+	if (!do_flush && READ_ONCE(sd->defer_list))
+		do_flush = true;
+
 	return do_flush;
 #endif
 	/* without RPS we can't safely check input_pkt_queue: during a
@@ -6298,27 +6324,6 @@ struct napi_struct *napi_by_id(unsigned int napi_id)
 	return NULL;
 }
 
-static void skb_defer_free_flush(struct softnet_data *sd)
-{
-	struct sk_buff *skb, *next;
-
-	/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
-	if (!READ_ONCE(sd->defer_list))
-		return;
-
-	spin_lock(&sd->defer_lock);
-	skb = sd->defer_list;
-	sd->defer_list = NULL;
-	sd->defer_count = 0;
-	spin_unlock(&sd->defer_lock);
-
-	while (skb != NULL) {
-		next = skb->next;
-		napi_consume_skb(skb, 1);
-		skb = next;
-	}
-}
-
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 
 static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
Yunsheng Lin Nov. 1, 2024, 11:11 a.m. UTC | #15
On 2024/11/1 0:18, Toke Høiland-Jørgensen wrote:

...

>>>
>>> Eliding the details above, but yeah, you're right, there are probably
>>> some pernicious details to get right if we want to flush all caches. S
>>> I wouldn't do that to start with. Instead, just add the waiting to start
>>> with, then wait and see if this actually turns out to be a problem in
>>> practice. And if it is, identify the source of that problem, deal with
>>> it, rinse and repeat :)
>>
>> I am not sure if I have mentioned to you that jakub had a RFC for the waiting,
>> see [1]. And Yonglong Cc'ed had tested it, the waiting caused the driver unload
>> stalling forever and some task hung, see [2].
>>
>> The root cause for the above case is skb_defer_free_flush() not being called
>> as mentioned before.
> 
> Well, let's fix that, then! We already logic to flush backlogs when a
> netdevice is going away, so AFAICT all that's needed is to add the

Is there a possiblity that the page_pool owned page might be still handled/cached
in somewhere of networking if netif_rx_internal() is already called for the
corresponding skb and skb_attempt_defer_free() is called after skb_defer_free_flush()
added in below patch is called?

Maybe add a timeout thing like timer to call kick_defer_list_purge() if you treat
'outstanding forever' as leaked? I actually thought about this, but had not found
out an elegant way to add the timeout.

> skb_defer_free_flush() to that logic. Totally untested patch below, that
> we should maybe consider applying in any case.

I am not sure about that as the above mentioned timing window, but it does seem we
might need to do something similar in dev_cpu_dead().

> 
>> I am not sure if I understand the reasoning behind the above suggestion to 'wait
>> and see if this actually turns out to be a problem' when we already know that there
>> are some cases which need cache kicking/flushing for the waiting to work and those
>> kicking/flushing may not be easy and may take indefinite time too, not to mention
>> there might be other cases that need kicking/flushing that we don't know yet.
>>
>> Is there any reason not to consider recording the inflight pages so that unmapping
>> can be done for inflight pages before driver unbound supposing dynamic number of
>> inflight pages can be supported?
>>
>> IOW, Is there any reason you and jesper taking it as axiomatic that recording the
>> inflight pages is bad supposing the inflight pages can be unlimited and recording
>> can be done with least performance overhead?
> 
> Well, page pool is a memory allocator, and it already has a mechanism to
> handle returning of memory to it. You're proposing to add a second,
> orthogonal, mechanism to do this, one that adds both overhead and

I would call it as a replacement/improvement for the old one instead of
'a second, orthogonal' as the old one doesn't really exist after this patch.

> complexity, yet doesn't handle all cases (cf your comment about devmem).

I am not sure if unmapping only need to be done using its own version DMA API
for devmem yet, but it seems waiting might also need to use its own version
of kicking/flushing for devmem as devmem might be held from the user space?

> 
> And even if it did handle all cases, force-releasing pages in this way
> really feels like it's just papering over the issue. If there are pages
> being leaked (or that are outstanding forever, which basically amounts
> to the same thing), that is something we should be fixing the root cause
> of, not just working around it like this series does.

If there is a definite time for waiting, I am probably agreed with the above.
From the previous discussion, it seems the time to do the kicking/flushing
would be indefinite depending how much cache to be scaned/flushed.

For the 'papering over' part, it seems it is about if we want to paper over
different kicking/flushing or paper over unmapping using different DMA API.

Also page_pool is not really a allocator, instead it is more like a pool
based on different allocator, such as buddy allocator or devmem allocator.
I am not sure it makes much to do the flushing when page_pool_destroy() is
called if the buddy allocator behind the page_pool is not under memory
pressure yet.

For the 'leaked' part mentioned above, I am agreed that it should be fixed
if we have a clear and unified definition of 'leaked', for example, is it
allowed to keep the cache outstanding forever if the allocator is not under
memory pressure and not ask for the releasing of its memory?

Doesn't it make more sense to use something like shrinker_register() mechanism
to decide whether to do the flushing?

IOW, maybe it makes more sense that the allocator behind the page_pool should
be deciding whether to do the kicking/flushing, and maybe page_pool should also
use the shrinker_register() mechanism to empty its cache when necessary instead
of deciding whether to do the kicking/flushing.

So I am not even sure if it is appropriate to do the cache kicking/flushing
during waiting, not to mention the indefinite time to do the kicking/flushing.

> 
> -Toke
> 
> 
> Patch to flush the deferred free list when taking down a netdevice;
> compile-tested only:

As mentioned above, doesn't it have the time window below?

            CPU 0                           CPU1
      unregister_netdev()                    .
             .                               .
     flush_all_backlogs()                    .
             .                               .
             .                     skb_attempt_defer_free()
             .                               .
             .                               .
             .                               .
             .                               .
Jesper Dangaard Brouer Nov. 5, 2024, 8:11 p.m. UTC | #16
On 01/11/2024 12.11, Yunsheng Lin wrote:
> On 2024/11/1 0:18, Toke Høiland-Jørgensen wrote:
> 
> ...
> 
>>>>
>>>> Eliding the details above, but yeah, you're right, there are probably
>>>> some pernicious details to get right if we want to flush all caches. S
>>>> I wouldn't do that to start with. Instead, just add the waiting to start
>>>> with, then wait and see if this actually turns out to be a problem in
>>>> practice. And if it is, identify the source of that problem, deal with
>>>> it, rinse and repeat :)
>>>
>>> I am not sure if I have mentioned to you that jakub had a RFC for the waiting,
>>> see [1]. And Yonglong Cc'ed had tested it, the waiting caused the driver unload
>>> stalling forever and some task hung, see [2].
>>>
>>> The root cause for the above case is skb_defer_free_flush() not being called
>>> as mentioned before.
>>
>> Well, let's fix that, then! We already logic to flush backlogs when a
>> netdevice is going away, so AFAICT all that's needed is to add the
> 
> Is there a possiblity that the page_pool owned page might be still handled/cached
> in somewhere of networking if netif_rx_internal() is already called for the
> corresponding skb and skb_attempt_defer_free() is called after skb_defer_free_flush()
> added in below patch is called?
> 
> Maybe add a timeout thing like timer to call kick_defer_list_purge() if you treat
> 'outstanding forever' as leaked? I actually thought about this, but had not found
> out an elegant way to add the timeout.
> 
>> skb_defer_free_flush() to that logic. Totally untested patch below, that
>> we should maybe consider applying in any case.
> 
> I am not sure about that as the above mentioned timing window, but it does seem we
> might need to do something similar in dev_cpu_dead().
> 
>>
>>> I am not sure if I understand the reasoning behind the above suggestion to 'wait
>>> and see if this actually turns out to be a problem' when we already know that there
>>> are some cases which need cache kicking/flushing for the waiting to work and those
>>> kicking/flushing may not be easy and may take indefinite time too, not to mention
>>> there might be other cases that need kicking/flushing that we don't know yet.
>>>
>>> Is there any reason not to consider recording the inflight pages so that unmapping
>>> can be done for inflight pages before driver unbound supposing dynamic number of
>>> inflight pages can be supported?
>>>
>>> IOW, Is there any reason you and jesper taking it as axiomatic that recording the
>>> inflight pages is bad supposing the inflight pages can be unlimited and recording
>>> can be done with least performance overhead?
>>
>> Well, page pool is a memory allocator, and it already has a mechanism to
>> handle returning of memory to it. You're proposing to add a second,
>> orthogonal, mechanism to do this, one that adds both overhead and
> 
> I would call it as a replacement/improvement for the old one instead of
> 'a second, orthogonal' as the old one doesn't really exist after this patch.
> 

Yes, are proposing doing a very radical change to the page_pool design.
And this is getting proposed as a fix patch for IOMMU.

It is a very radical change that page_pool needs to keep track of *ALL* 
in-flight pages.

The DMA issue is a life-time issue of DMA object associated with the
struct device.  Then, why are you not looking at extending the life-time
of the DMA object, or at least detect when DMA object goes away, such
that we can change a setting in page_pool to stop calling DMA unmap for
the pages in-flight once they get returned (which we have en existing
mechanism for).


>> complexity, yet doesn't handle all cases (cf your comment about devmem).
> 
> I am not sure if unmapping only need to be done using its own version DMA API
> for devmem yet, but it seems waiting might also need to use its own version
> of kicking/flushing for devmem as devmem might be held from the user space?
> 
>>
>> And even if it did handle all cases, force-releasing pages in this way
>> really feels like it's just papering over the issue. If there are pages
>> being leaked (or that are outstanding forever, which basically amounts
>> to the same thing), that is something we should be fixing the root cause
>> of, not just working around it like this series does.
> 
> If there is a definite time for waiting, I am probably agreed with the above.
>  From the previous discussion, it seems the time to do the kicking/flushing
> would be indefinite depending how much cache to be scaned/flushed.
> 
> For the 'papering over' part, it seems it is about if we want to paper over
> different kicking/flushing or paper over unmapping using different DMA API.
> 
> Also page_pool is not really a allocator, instead it is more like a pool
> based on different allocator, such as buddy allocator or devmem allocator.
> I am not sure it makes much to do the flushing when page_pool_destroy() is
> called if the buddy allocator behind the page_pool is not under memory
> pressure yet.
> 

I still see page_pool as an allocator like the SLUB/SLAB allocators,
where slab allocators are created (and can be destroyed again), which we
can allocate slab objects from.  SLAB allocators also use buddy
allocator as their backing allocator.

The page_pool is of-cause evolving with the addition of the devmem
allocator as a different "backing" allocator type.


> For the 'leaked' part mentioned above, I am agreed that it should be fixed
> if we have a clear and unified definition of 'leaked', for example, is it
> allowed to keep the cache outstanding forever if the allocator is not under
> memory pressure and not ask for the releasing of its memory?
> 

It seems wrong to me if page_pool need to dictate how long the API users
is allowed to hold the page.

> Doesn't it make more sense to use something like shrinker_register() mechanism
> to decide whether to do the flushing?
> 
> IOW, maybe it makes more sense that the allocator behind the page_pool should
> be deciding whether to do the kicking/flushing, and maybe page_pool should also
> use the shrinker_register() mechanism to empty its cache when necessary instead
> of deciding whether to do the kicking/flushing.
> 

Sure, I've argued before that page_pool should use shrinker_register()
but only when used with the normal buddy allocator.
BUT you need to realize that bad things can happen when network stack
fails to allocate memory for packets, because it can block connections
from making forward progress and those connections can be holding on to
memory (that is part of the memory pressure problem).


> So I am not even sure if it is appropriate to do the cache kicking/flushing
> during waiting, not to mention the indefinite time to do the kicking/flushing.

--Jesper
Yunsheng Lin Nov. 6, 2024, 10:56 a.m. UTC | #17
+cc Christoph & Marek

On 2024/11/6 4:11, Jesper Dangaard Brouer wrote:

...

>>>
>>>> I am not sure if I understand the reasoning behind the above suggestion to 'wait
>>>> and see if this actually turns out to be a problem' when we already know that there
>>>> are some cases which need cache kicking/flushing for the waiting to work and those
>>>> kicking/flushing may not be easy and may take indefinite time too, not to mention
>>>> there might be other cases that need kicking/flushing that we don't know yet.
>>>>
>>>> Is there any reason not to consider recording the inflight pages so that unmapping
>>>> can be done for inflight pages before driver unbound supposing dynamic number of
>>>> inflight pages can be supported?
>>>>
>>>> IOW, Is there any reason you and jesper taking it as axiomatic that recording the
>>>> inflight pages is bad supposing the inflight pages can be unlimited and recording
>>>> can be done with least performance overhead?
>>>
>>> Well, page pool is a memory allocator, and it already has a mechanism to
>>> handle returning of memory to it. You're proposing to add a second,
>>> orthogonal, mechanism to do this, one that adds both overhead and
>>
>> I would call it as a replacement/improvement for the old one instead of
>> 'a second, orthogonal' as the old one doesn't really exist after this patch.
>>
> 
> Yes, are proposing doing a very radical change to the page_pool design.
> And this is getting proposed as a fix patch for IOMMU.
> 
> It is a very radical change that page_pool needs to keep track of *ALL* in-flight pages.

I am agreed that it is a radical change, that is why it is targetting net-next
tree instead of net tree even when there is a Fixes tag for it.

If there is a proper and non-radical way to fix that, I would prefer the
non-radical way too.

> 
> The DMA issue is a life-time issue of DMA object associated with the
> struct device.  Then, why are you not looking at extending the life-time

It seems it is not really about the life-time of DMA object associated with the
life-time of 'struct device', it seems to be the life-time of DMA API associated
with the life-time of the driver for the 'struct device' from the the opinion of
experts from IOMMU/DMA subsystem in [1] & [2].

I am not sure what is reasoning behind the above, but the implementation seems
to be the case as mentioned in [3]:
__device_release_driver -> device_unbind_cleanup -> arch_teardown_dma_ops

1. https://lkml.org/lkml/2024/8/6/632
2. https://lore.kernel.org/all/20240923175226.GC9634@ziepe.ca/
3. https://lkml.org/lkml/2024/10/15/686

> of the DMA object, or at least detect when DMA object goes away, such
> that we can change a setting in page_pool to stop calling DMA unmap for
> the pages in-flight once they get returned (which we have en existing
> mechanism for).

To be honest, I was mostly depending on the opinion of the experts from
IOMMU/DMA subsystem for the correct DMA API usage as mentioned above.
So I am not sure if skipping DMA unmapping for the inflight pages is the
correct DMA API usage?
If it is the correct DMA API usage, how to detect that if DMA unmapping
can be skipped?

From previous discussion, skipping DMA unmapping may casue some resource
leaking, like the iova resoure behind the IOMMU and bounce buffer memory
behind the swiotlb.

Anyway, I may be wrong, CC'ing more experts to see if we can have some
clarifying from them.

> 
> 
>>> complexity, yet doesn't handle all cases (cf your comment about devmem).
>>
>> I am not sure if unmapping only need to be done using its own version DMA API
>> for devmem yet, but it seems waiting might also need to use its own version
>> of kicking/flushing for devmem as devmem might be held from the user space?
>>
>>>
>>> And even if it did handle all cases, force-releasing pages in this way
>>> really feels like it's just papering over the issue. If there are pages
>>> being leaked (or that are outstanding forever, which basically amounts
>>> to the same thing), that is something we should be fixing the root cause
>>> of, not just working around it like this series does.
>>
>> If there is a definite time for waiting, I am probably agreed with the above.
>>  From the previous discussion, it seems the time to do the kicking/flushing
>> would be indefinite depending how much cache to be scaned/flushed.
>>
>> For the 'papering over' part, it seems it is about if we want to paper over
>> different kicking/flushing or paper over unmapping using different DMA API.
>>
>> Also page_pool is not really a allocator, instead it is more like a pool
>> based on different allocator, such as buddy allocator or devmem allocator.
>> I am not sure it makes much to do the flushing when page_pool_destroy() is
>> called if the buddy allocator behind the page_pool is not under memory
>> pressure yet.
>>
> 
> I still see page_pool as an allocator like the SLUB/SLAB allocators,
> where slab allocators are created (and can be destroyed again), which we
> can allocate slab objects from.  SLAB allocators also use buddy
> allocator as their backing allocator.

I am not sure if SLUB/SLAB is that similar to page_pool for the specific
problem here, at least SLUB/SLAB doesn't seems to support dma mapping in
its core and doesn't seem to allow inflight cache when kmem_cache_destroy()
is called as its alloc API doesn't seems to take reference to s->refcount
and doesn't have the inflight cache calculating as page_pool does?
https://elixir.bootlin.com/linux/v6.12-rc6/source/mm/slab_common.c#L512


> 
> The page_pool is of-cause evolving with the addition of the devmem
> allocator as a different "backing" allocator type.
Jesper Dangaard Brouer Nov. 6, 2024, 1:25 p.m. UTC | #18
On 26/10/2024 09.33, Yunsheng Lin wrote:
> On 2024/10/25 22:07, Jesper Dangaard Brouer wrote:
> 
> ...
> 
>>
>>>> You and Jesper seems to be mentioning a possible fact that there might
>>>> be 'hundreds of gigs of memory' needed for inflight pages, it would be nice
>>>> to provide more info or reasoning above why 'hundreds of gigs of memory' is
>>>> needed here so that we don't do a over-designed thing to support recording
>>>> unlimited in-flight pages if the driver unbound stalling turns out impossible
>>>> and the inflight pages do need to be recorded.
>>>
>>> I don't have a concrete example of a use that will blow the limit you
>>> are setting (but maybe Jesper does), I am simply objecting to the
>>> arbitrary imposing of any limit at all. It smells a lot of "640k ought
>>> to be enough for anyone".
>>>
>>
>> As I wrote before. In *production* I'm seeing TCP memory reach 24 GiB
>> (on machines with 384GiB memory). I have attached a grafana screenshot
>> to prove what I'm saying.
>>
>> As my co-worker Mike Freemon, have explain to me (and more details in
>> blogposts[1]). It is no coincident that graph have a strange "sealing"
>> close to 24 GiB (on machines with 384GiB total memory).  This is because
>> TCP network stack goes into a memory "under pressure" state when 6.25%
>> of total memory is used by TCP-stack. (Detail: The system will stay in
>> that mode until allocated TCP memory falls below 4.68% of total memory).
>>
>>   [1] https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
> 
> Thanks for the info.

Some more info from production servers.

(I'm amazed what we can do with a simple bpftrace script, Cc Viktor)

In below bpftrace script/oneliner I'm extracting the inflight count, for
all page_pool's in the system, and storing that in a histogram hash.

sudo bpftrace -e '
  rawtracepoint:page_pool_state_release { @cnt[probe]=count();
   @cnt_total[probe]=count();
   $pool=(struct page_pool*)arg0;
   $release_cnt=(uint32)arg2;
   $hold_cnt=$pool->pages_state_hold_cnt;
   $inflight_cnt=(int32)($hold_cnt - $release_cnt);
   @inflight=hist($inflight_cnt);
  }
  interval:s:1 {time("\n%H:%M:%S\n");
   print(@cnt); clear(@cnt);
   print(@inflight);
   print(@cnt_total);
  }'

The page_pool behavior depend on how NIC driver use it, so I've run this 
on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel.

Driver: bnxt_en
- kernel 6.6.51

@cnt[rawtracepoint:page_pool_state_release]: 8447
@inflight:
[0]             507 |                                        |
[1]             275 |                                        |
[2, 4)          261 |                                        |
[4, 8)          215 |                                        |
[8, 16)         259 |                                        |
[16, 32)        361 |                                        |
[32, 64)        933 |                                        |
[64, 128)      1966 |                                        |
[128, 256)   937052 |@@@@@@@@@                               |
[256, 512)  5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)     73908 |                                        |
[1K, 2K)    1220128 |@@@@@@@@@@@@                            |
[2K, 4K)    1532724 |@@@@@@@@@@@@@@@                         |
[4K, 8K)    1849062 |@@@@@@@@@@@@@@@@@@                      |
[8K, 16K)   1466424 |@@@@@@@@@@@@@@                          |
[16K, 32K)   858585 |@@@@@@@@                                |
[32K, 64K)   693893 |@@@@@@                                  |
[64K, 128K)  170625 |@                                       |

Driver: mlx5_core
  - Kernel: 6.6.51

@cnt[rawtracepoint:page_pool_state_release]: 1975
@inflight:
[128, 256)         28293 |@@@@                               |
[256, 512)        184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[512, 1K)              0 |                                   |
[1K, 2K)            4671 |                                   |
[2K, 4K)          342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)          180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[8K, 16K)          96483 |@@@@@@@@@@@@@@                     |
[16K, 32K)         25133 |@@@                                |
[32K, 64K)          8274 |@                                  |


The key thing to notice that we have up-to 128,000 pages in flight on
these random production servers. The NIC have 64 RX queue configured,
thus also 64 page_pool objects.

--Jesper
Robin Murphy Nov. 6, 2024, 2:17 p.m. UTC | #19
On 2024-11-06 10:56 am, Yunsheng Lin wrote:
> +cc Christoph & Marek
> 
> On 2024/11/6 4:11, Jesper Dangaard Brouer wrote:
> 
> ...
> 
>>>>
>>>>> I am not sure if I understand the reasoning behind the above suggestion to 'wait
>>>>> and see if this actually turns out to be a problem' when we already know that there
>>>>> are some cases which need cache kicking/flushing for the waiting to work and those
>>>>> kicking/flushing may not be easy and may take indefinite time too, not to mention
>>>>> there might be other cases that need kicking/flushing that we don't know yet.
>>>>>
>>>>> Is there any reason not to consider recording the inflight pages so that unmapping
>>>>> can be done for inflight pages before driver unbound supposing dynamic number of
>>>>> inflight pages can be supported?
>>>>>
>>>>> IOW, Is there any reason you and jesper taking it as axiomatic that recording the
>>>>> inflight pages is bad supposing the inflight pages can be unlimited and recording
>>>>> can be done with least performance overhead?
>>>>
>>>> Well, page pool is a memory allocator, and it already has a mechanism to
>>>> handle returning of memory to it. You're proposing to add a second,
>>>> orthogonal, mechanism to do this, one that adds both overhead and
>>>
>>> I would call it as a replacement/improvement for the old one instead of
>>> 'a second, orthogonal' as the old one doesn't really exist after this patch.
>>>
>>
>> Yes, are proposing doing a very radical change to the page_pool design.
>> And this is getting proposed as a fix patch for IOMMU.
>>
>> It is a very radical change that page_pool needs to keep track of *ALL* in-flight pages.
> 
> I am agreed that it is a radical change, that is why it is targetting net-next
> tree instead of net tree even when there is a Fixes tag for it.
> 
> If there is a proper and non-radical way to fix that, I would prefer the
> non-radical way too.
> 
>>
>> The DMA issue is a life-time issue of DMA object associated with the
>> struct device.  Then, why are you not looking at extending the life-time
> 
> It seems it is not really about the life-time of DMA object associated with the
> life-time of 'struct device', it seems to be the life-time of DMA API associated
> with the life-time of the driver for the 'struct device' from the the opinion of
> experts from IOMMU/DMA subsystem in [1] & [2].

There is no "DMA object". The DMA API expects to be called with a valid 
device bound to a driver. There are parts in many different places all 
built around that expectation to varying degrees. Looking again, it 
seems dma_debug_device_change() has existed for way longer than the 
page_pool code, so frankly I'm a little surprised that this case is only 
coming up now in this context...

Even if one tries to handwave past that with a bogus argument that 
technically these DMA mappings belong to the subsystem rather than the 
driver itself, it is clearly unrealistic to imagine that once a device 
is torn down by device_del() it's still valid for anything. In fact, 
before even that point, it is explicitly documented that a device which 
is merely offlined prior to potential removal "cannot be used for any 
purpose", per Documentation/ABI/testing/sysfs-devices-online.

Holding a refcount so that the memory backing the struct device can 
still be accessed without a literal use-after-free does not represent 
the device being conceptually valid in any API-level sense. Even if the 
device isn't removed, as soon as its driver is unbound its DMA ops can 
change; the driver could then be re-bound, and the device valid for 
*new* DMA mappings again, but it's still bogus to attempt to unmap 
outstanding old mappings through the new ops (which is just as likely to 
throw an error/crash/corrupt memory/whatever). The page pool DMA mapping 
design is just fundamentally incorrect with respect to the device/driver 
model lifecycle.

> I am not sure what is reasoning behind the above, but the implementation seems
> to be the case as mentioned in [3]:
> __device_release_driver -> device_unbind_cleanup -> arch_teardown_dma_ops
> 
> 1. https://lkml.org/lkml/2024/8/6/632
> 2. https://lore.kernel.org/all/20240923175226.GC9634@ziepe.ca/
> 3. https://lkml.org/lkml/2024/10/15/686
> 
>> of the DMA object, or at least detect when DMA object goes away, such
>> that we can change a setting in page_pool to stop calling DMA unmap for
>> the pages in-flight once they get returned (which we have en existing
>> mechanism for).
> 
> To be honest, I was mostly depending on the opinion of the experts from
> IOMMU/DMA subsystem for the correct DMA API usage as mentioned above.
> So I am not sure if skipping DMA unmapping for the inflight pages is the
> correct DMA API usage?
> If it is the correct DMA API usage, how to detect that if DMA unmapping
> can be skipped?

Once page_pool has allowed a driver to unbind from a device without 
cleaning up all outstanding DMA mappings made via that device, then it 
has already leaked those mappings and the damage is done, regardless of 
whether the effects are visible yet. If you'd really rather play 
whack-a-mole trying to paper over secondary symptoms of that issue than 
actually fix it, then fine, but don't expect any driver core/DMA API 
changes to be acceptable for that purpose. However, if people are 
hitting those symptoms now, then I'd imagine they're eventually going to 
come back asking about the ones which can't be papered over, like 
dma-debug reporting the leaks, or just why their system ends up burning 
gigabytes on IOMMU pagetables and IOVA kmem_caches.

Thanks,
Robin.
Jesper Dangaard Brouer Nov. 6, 2024, 3:57 p.m. UTC | #20
On 06/11/2024 14.25, Jesper Dangaard Brouer wrote:
> 
> On 26/10/2024 09.33, Yunsheng Lin wrote:
>> On 2024/10/25 22:07, Jesper Dangaard Brouer wrote:
>>
>> ...
>>
>>>
>>>>> You and Jesper seems to be mentioning a possible fact that there might
>>>>> be 'hundreds of gigs of memory' needed for inflight pages, it would 
>>>>> be nice
>>>>> to provide more info or reasoning above why 'hundreds of gigs of 
>>>>> memory' is
>>>>> needed here so that we don't do a over-designed thing to support 
>>>>> recording
>>>>> unlimited in-flight pages if the driver unbound stalling turns out 
>>>>> impossible
>>>>> and the inflight pages do need to be recorded.
>>>>
>>>> I don't have a concrete example of a use that will blow the limit you
>>>> are setting (but maybe Jesper does), I am simply objecting to the
>>>> arbitrary imposing of any limit at all. It smells a lot of "640k ought
>>>> to be enough for anyone".
>>>>
>>>
>>> As I wrote before. In *production* I'm seeing TCP memory reach 24 GiB
>>> (on machines with 384GiB memory). I have attached a grafana screenshot
>>> to prove what I'm saying.
>>>
>>> As my co-worker Mike Freemon, have explain to me (and more details in
>>> blogposts[1]). It is no coincident that graph have a strange "sealing"
>>> close to 24 GiB (on machines with 384GiB total memory).  This is because
>>> TCP network stack goes into a memory "under pressure" state when 6.25%
>>> of total memory is used by TCP-stack. (Detail: The system will stay in
>>> that mode until allocated TCP memory falls below 4.68% of total memory).
>>>
>>>   [1] 
>>> https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
>>
>> Thanks for the info.
> 
> Some more info from production servers.
> 
> (I'm amazed what we can do with a simple bpftrace script, Cc Viktor)
> 
> In below bpftrace script/oneliner I'm extracting the inflight count, for
> all page_pool's in the system, and storing that in a histogram hash.
> 
> sudo bpftrace -e '
>   rawtracepoint:page_pool_state_release { @cnt[probe]=count();
>    @cnt_total[probe]=count();
>    $pool=(struct page_pool*)arg0;
>    $release_cnt=(uint32)arg2;
>    $hold_cnt=$pool->pages_state_hold_cnt;
>    $inflight_cnt=(int32)($hold_cnt - $release_cnt);
>    @inflight=hist($inflight_cnt);
>   }
>   interval:s:1 {time("\n%H:%M:%S\n");
>    print(@cnt); clear(@cnt);
>    print(@inflight);
>    print(@cnt_total);
>   }'
> 
> The page_pool behavior depend on how NIC driver use it, so I've run this 
> on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel.
> 
> Driver: bnxt_en
> - kernel 6.6.51
> 
> @cnt[rawtracepoint:page_pool_state_release]: 8447
> @inflight:
> [0]             507 |                                        |
> [1]             275 |                                        |
> [2, 4)          261 |                                        |
> [4, 8)          215 |                                        |
> [8, 16)         259 |                                        |
> [16, 32)        361 |                                        |
> [32, 64)        933 |                                        |
> [64, 128)      1966 |                                        |
> [128, 256)   937052 |@@@@@@@@@                               |
> [256, 512)  5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [512, 1K)     73908 |                                        |
> [1K, 2K)    1220128 |@@@@@@@@@@@@                            |
> [2K, 4K)    1532724 |@@@@@@@@@@@@@@@                         |
> [4K, 8K)    1849062 |@@@@@@@@@@@@@@@@@@                      |
> [8K, 16K)   1466424 |@@@@@@@@@@@@@@                          |
> [16K, 32K)   858585 |@@@@@@@@                                |
> [32K, 64K)   693893 |@@@@@@                                  |
> [64K, 128K)  170625 |@                                       |
> 
> Driver: mlx5_core
>   - Kernel: 6.6.51
> 
> @cnt[rawtracepoint:page_pool_state_release]: 1975
> @inflight:
> [128, 256)         28293 |@@@@                               |
> [256, 512)        184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
> [512, 1K)              0 |                                   |
> [1K, 2K)            4671 |                                   |
> [2K, 4K)          342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4K, 8K)          180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
> [8K, 16K)          96483 |@@@@@@@@@@@@@@                     |
> [16K, 32K)         25133 |@@@                                |
> [32K, 64K)          8274 |@                                  |
> 
> 
> The key thing to notice that we have up-to 128,000 pages in flight on
> these random production servers. The NIC have 64 RX queue configured,
> thus also 64 page_pool objects.
> 

I realized that we primarily want to know the maximum in-flight pages.

So, I modified the bpftrace oneliner to track the max for each page_pool 
in the system.

sudo bpftrace -e '
  rawtracepoint:page_pool_state_release { @cnt[probe]=count();
   @cnt_total[probe]=count();
   $pool=(struct page_pool*)arg0;
   $release_cnt=(uint32)arg2;
   $hold_cnt=$pool->pages_state_hold_cnt;
   $inflight_cnt=(int32)($hold_cnt - $release_cnt);
   $cur=@inflight_max[$pool];
   if ($inflight_cnt > $cur) {
     @inflight_max[$pool]=$inflight_cnt;}
  }
  interval:s:1 {time("\n%H:%M:%S\n");
   print(@cnt); clear(@cnt);
   print(@inflight_max);
   print(@cnt_total);
  }'

I've attached the output from the script.
For unknown reason this system had 199 page_pool objects.

The 20 top users:

$ cat out02.inflight-max | grep inflight_max | tail -n 20
@inflight_max[0xffff88829133d800]: 26473
@inflight_max[0xffff888293c3e000]: 27042
@inflight_max[0xffff888293c3b000]: 27709
@inflight_max[0xffff8881076f2800]: 29400
@inflight_max[0xffff88818386e000]: 29690
@inflight_max[0xffff8882190b1800]: 29813
@inflight_max[0xffff88819ee83800]: 30067
@inflight_max[0xffff8881076f4800]: 30086
@inflight_max[0xffff88818386b000]: 31116
@inflight_max[0xffff88816598f800]: 36970
@inflight_max[0xffff8882190b7800]: 37336
@inflight_max[0xffff888293c38800]: 39265
@inflight_max[0xffff888293c3c800]: 39632
@inflight_max[0xffff888293c3b800]: 43461
@inflight_max[0xffff888293c3f000]: 43787
@inflight_max[0xffff88816598f000]: 44557
@inflight_max[0xffff888132ce9000]: 45037
@inflight_max[0xffff888293c3f800]: 51843
@inflight_max[0xffff888183869800]: 62612
@inflight_max[0xffff888113d08000]: 73203

Adding all values together:

  grep inflight_max out02.inflight-max | awk 'BEGIN {tot=0} {tot+=$2; 
printf "total:" tot "\n"}' | tail -n 1

total:1707129

Worst case we need a data structure holding 1,707,129 pages.
Fortunately, we don't need a single data structure as this will be split
between 199 page_pool's.

--Jesper
15:07:05
@cnt[rawtracepoint:page_pool_state_release]: 6344
@inflight_max[0xffff88a1512d9800]: 318
@inflight_max[0xffff88a071703800]: 318
@inflight_max[0xffff88a151367800]: 318
@inflight_max[0xffff88a1512fc000]: 318
@inflight_max[0xffff88a151365000]: 318
@inflight_max[0xffff88a1512d8000]: 318
@inflight_max[0xffff88a1512fe800]: 318
@inflight_max[0xffff88a1512dd800]: 318
@inflight_max[0xffff88a1512de800]: 318
@inflight_max[0xffff88a071707800]: 318
@inflight_max[0xffff88a1512ff000]: 318
@inflight_max[0xffff88a1512dd000]: 318
@inflight_max[0xffff88a1512dc800]: 318
@inflight_max[0xffff88a151366800]: 318
@inflight_max[0xffff88a071701000]: 318
@inflight_max[0xffff88a1512f9800]: 318
@inflight_max[0xffff88a071706000]: 318
@inflight_max[0xffff88a1512fb000]: 318
@inflight_max[0xffff88a071700000]: 318
@inflight_max[0xffff88a151360800]: 318
@inflight_max[0xffff88a1512fa800]: 318
@inflight_max[0xffff88a151360000]: 318
@inflight_max[0xffff88a151361800]: 318
@inflight_max[0xffff88a1512da000]: 318
@inflight_max[0xffff88a1512de000]: 318
@inflight_max[0xffff88a1512db800]: 318
@inflight_max[0xffff88a1512fb800]: 318
@inflight_max[0xffff88a1512fe000]: 318
@inflight_max[0xffff88a151364800]: 318
@inflight_max[0xffff88a071706800]: 318
@inflight_max[0xffff88a151364000]: 318
@inflight_max[0xffff88a1512fd000]: 318
@inflight_max[0xffff88a151366000]: 318
@inflight_max[0xffff88a071701800]: 318
@inflight_max[0xffff88a1512da800]: 318
@inflight_max[0xffff88a071700800]: 318
@inflight_max[0xffff88a1512fd800]: 318
@inflight_max[0xffff88a071702000]: 318
@inflight_max[0xffff88a151365800]: 318
@inflight_max[0xffff88a151361000]: 318
@inflight_max[0xffff88a1512f8000]: 318
@inflight_max[0xffff88a071705000]: 318
@inflight_max[0xffff88a151363800]: 318
@inflight_max[0xffff88a151362000]: 318
@inflight_max[0xffff88a1512d8800]: 318
@inflight_max[0xffff88a071704800]: 318
@inflight_max[0xffff88a1512db000]: 318
@inflight_max[0xffff88a1512fc800]: 318
@inflight_max[0xffff88a1512df000]: 318
@inflight_max[0xffff88a1512f8800]: 318
@inflight_max[0xffff88a1512df800]: 318
@inflight_max[0xffff88a071707000]: 318
@inflight_max[0xffff88a1512dc000]: 318
@inflight_max[0xffff88a071704000]: 318
@inflight_max[0xffff88a071702800]: 318
@inflight_max[0xffff88a071703000]: 318
@inflight_max[0xffff88a1512d9000]: 318
@inflight_max[0xffff88a151362800]: 318
@inflight_max[0xffff88a151367000]: 318
@inflight_max[0xffff88a1512f9000]: 318
@inflight_max[0xffff88a151363000]: 318
@inflight_max[0xffff88a1512fa000]: 318
@inflight_max[0xffff88a071705800]: 318
@inflight_max[0xffff88991d969000]: 331
@inflight_max[0xffff8899ca7b6800]: 336
@inflight_max[0xffff88991d96c000]: 339
@inflight_max[0xffff8899ca7b6000]: 340
@inflight_max[0xffff8899ca7b3000]: 342
@inflight_max[0xffff8899ca7a1800]: 342
@inflight_max[0xffff8899ca7b2800]: 342
@inflight_max[0xffff8899ca7a7800]: 343
@inflight_max[0xffff88991d96e000]: 343
@inflight_max[0xffff88991d96a000]: 344
@inflight_max[0xffff88991d96b000]: 344
@inflight_max[0xffff8898b3c92800]: 345
@inflight_max[0xffff8899ca7a4800]: 345
@inflight_max[0xffff8899ca7b4000]: 346
@inflight_max[0xffff8898b3c93800]: 347
@inflight_max[0xffff88991d968000]: 347
@inflight_max[0xffff88991d96f800]: 348
@inflight_max[0xffff8898b3c92000]: 348
@inflight_max[0xffff88991d96b800]: 350
@inflight_max[0xffff8898b3c90800]: 350
@inflight_max[0xffff8898b3c96800]: 351
@inflight_max[0xffff8899ca7b5000]: 351
@inflight_max[0xffff8898b3c97800]: 351
@inflight_max[0xffff8899ca7b4800]: 352
@inflight_max[0xffff8898b3c93000]: 353
@inflight_max[0xffff88991d969800]: 353
@inflight_max[0xffff8899ca7b5800]: 354
@inflight_max[0xffff8899ca7a2000]: 357
@inflight_max[0xffff8898b3c91000]: 357
@inflight_max[0xffff8898b3c97000]: 357
@inflight_max[0xffff8899ca7b1000]: 358
@inflight_max[0xffff8898b3c94000]: 359
@inflight_max[0xffff88991d96d800]: 362
@inflight_max[0xffff8899ca7b0800]: 363
@inflight_max[0xffff8899ca7a1000]: 364
@inflight_max[0xffff8899ca7a0000]: 364
@inflight_max[0xffff8898b3c95800]: 364
@inflight_max[0xffff8899ca7b7800]: 364
@inflight_max[0xffff8898b3c94800]: 365
@inflight_max[0xffff88991d96d000]: 365
@inflight_max[0xffff88991d968800]: 365
@inflight_max[0xffff8898b3c90000]: 365
@inflight_max[0xffff8899ca7a5800]: 365
@inflight_max[0xffff8899ca7a5000]: 366
@inflight_max[0xffff8899ca7a2800]: 366
@inflight_max[0xffff8899ca7b0000]: 366
@inflight_max[0xffff8899ca7a7000]: 367
@inflight_max[0xffff88991d96a800]: 368
@inflight_max[0xffff88991d96c800]: 368
@inflight_max[0xffff8899ca7a0800]: 368
@inflight_max[0xffff8899ca7a3800]: 370
@inflight_max[0xffff88991d96f000]: 371
@inflight_max[0xffff88991d96e800]: 372
@inflight_max[0xffff8899ca7b3800]: 373
@inflight_max[0xffff8899ca7b2000]: 373
@inflight_max[0xffff8899ca7b7000]: 373
@inflight_max[0xffff8899ca7a6800]: 373
@inflight_max[0xffff8899ca7a4000]: 374
@inflight_max[0xffff8899ca7a6000]: 377
@inflight_max[0xffff8899ca7a3000]: 378
@inflight_max[0xffff8898b3c95000]: 379
@inflight_max[0xffff8898b3c91800]: 379
@inflight_max[0xffff8898b3c96000]: 389
@inflight_max[0xffff888111079800]: 4201
@inflight_max[0xffff888111205000]: 4203
@inflight_max[0xffff888111079000]: 4393
@inflight_max[0xffff8881134fb800]: 4519
@inflight_max[0xffff88811107d800]: 4520
@inflight_max[0xffff8881134fc800]: 4586
@inflight_max[0xffff88811107a000]: 4650
@inflight_max[0xffff8881111b1800]: 5674
@inflight_max[0xffff88811107d000]: 6314
@inflight_max[0xffff888293c3d800]: 11714
@inflight_max[0xffff88818386c000]: 12302
@inflight_max[0xffff888293c3c000]: 12393
@inflight_max[0xffff888132cea800]: 12500
@inflight_max[0xffff888165966000]: 12940
@inflight_max[0xffff888113d0d800]: 13370
@inflight_max[0xffff88818386c800]: 13510
@inflight_max[0xffff888113d0c800]: 14027
@inflight_max[0xffff8882190b0800]: 15149
@inflight_max[0xffff888132ceb800]: 15405
@inflight_max[0xffff888132ceb000]: 15633
@inflight_max[0xffff888183869000]: 15684
@inflight_max[0xffff88818386b800]: 16142
@inflight_max[0xffff888132ced000]: 16450
@inflight_max[0xffff888165964800]: 17007
@inflight_max[0xffff8882190b5800]: 17879
@inflight_max[0xffff8882190b6000]: 17915
@inflight_max[0xffff88819ee80800]: 17977
@inflight_max[0xffff88819ee84000]: 18132
@inflight_max[0xffff888118680000]: 18204
@inflight_max[0xffff888118680800]: 18514
@inflight_max[0xffff88819ee83000]: 18546
@inflight_max[0xffff888183868000]: 18552
@inflight_max[0xffff8881076f1800]: 18706
@inflight_max[0xffff88819ee87000]: 18801
@inflight_max[0xffff888165965800]: 19556
@inflight_max[0xffff888293c3d000]: 20675
@inflight_max[0xffff88818386d000]: 20749
@inflight_max[0xffff88818386a800]: 21226
@inflight_max[0xffff888183868800]: 21559
@inflight_max[0xffff8881076f3000]: 21933
@inflight_max[0xffff888293c3a000]: 22086
@inflight_max[0xffff88819ee82800]: 22975
@inflight_max[0xffff88818386a000]: 23600
@inflight_max[0xffff88816598c000]: 24092
@inflight_max[0xffff888293c39800]: 24093
@inflight_max[0xffff88818386f000]: 24438
@inflight_max[0xffff888113d0e800]: 24882
@inflight_max[0xffff888293c39000]: 25218
@inflight_max[0xffff88818386d800]: 25276
@inflight_max[0xffff888293c3a800]: 25292
@inflight_max[0xffff888293c3e800]: 25429
@inflight_max[0xffff888293c38000]: 25794
@inflight_max[0xffff8881076f6800]: 26030
@inflight_max[0xffff88829133d800]: 26473
@inflight_max[0xffff888293c3e000]: 27042
@inflight_max[0xffff888293c3b000]: 27709
@inflight_max[0xffff8881076f2800]: 29400
@inflight_max[0xffff88818386e000]: 29690
@inflight_max[0xffff8882190b1800]: 29813
@inflight_max[0xffff88819ee83800]: 30067
@inflight_max[0xffff8881076f4800]: 30086
@inflight_max[0xffff88818386b000]: 31116
@inflight_max[0xffff88816598f800]: 36970
@inflight_max[0xffff8882190b7800]: 37336
@inflight_max[0xffff888293c38800]: 39265
@inflight_max[0xffff888293c3c800]: 39632
@inflight_max[0xffff888293c3b800]: 43461
@inflight_max[0xffff888293c3f000]: 43787
@inflight_max[0xffff88816598f000]: 44557
@inflight_max[0xffff888132ce9000]: 45037
@inflight_max[0xffff888293c3f800]: 51843
@inflight_max[0xffff888183869800]: 62612
@inflight_max[0xffff888113d08000]: 73203
@cnt_total[rawtracepoint:page_pool_state_release]: 67263129
Alexander Duyck Nov. 6, 2024, 7:55 p.m. UTC | #21
On Wed, Nov 6, 2024 at 7:57 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 06/11/2024 14.25, Jesper Dangaard Brouer wrote:
> >
> > On 26/10/2024 09.33, Yunsheng Lin wrote:
> >> On 2024/10/25 22:07, Jesper Dangaard Brouer wrote:
> >>
> >> ...
> >>
> >>>
> >>>>> You and Jesper seems to be mentioning a possible fact that there might
> >>>>> be 'hundreds of gigs of memory' needed for inflight pages, it would
> >>>>> be nice
> >>>>> to provide more info or reasoning above why 'hundreds of gigs of
> >>>>> memory' is
> >>>>> needed here so that we don't do a over-designed thing to support
> >>>>> recording
> >>>>> unlimited in-flight pages if the driver unbound stalling turns out
> >>>>> impossible
> >>>>> and the inflight pages do need to be recorded.
> >>>>
> >>>> I don't have a concrete example of a use that will blow the limit you
> >>>> are setting (but maybe Jesper does), I am simply objecting to the
> >>>> arbitrary imposing of any limit at all. It smells a lot of "640k ought
> >>>> to be enough for anyone".
> >>>>
> >>>
> >>> As I wrote before. In *production* I'm seeing TCP memory reach 24 GiB
> >>> (on machines with 384GiB memory). I have attached a grafana screenshot
> >>> to prove what I'm saying.
> >>>
> >>> As my co-worker Mike Freemon, have explain to me (and more details in
> >>> blogposts[1]). It is no coincident that graph have a strange "sealing"
> >>> close to 24 GiB (on machines with 384GiB total memory).  This is because
> >>> TCP network stack goes into a memory "under pressure" state when 6.25%
> >>> of total memory is used by TCP-stack. (Detail: The system will stay in
> >>> that mode until allocated TCP memory falls below 4.68% of total memory).
> >>>
> >>>   [1]
> >>> https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
> >>
> >> Thanks for the info.
> >
> > Some more info from production servers.
> >
> > (I'm amazed what we can do with a simple bpftrace script, Cc Viktor)
> >
> > In below bpftrace script/oneliner I'm extracting the inflight count, for
> > all page_pool's in the system, and storing that in a histogram hash.
> >
> > sudo bpftrace -e '
> >   rawtracepoint:page_pool_state_release { @cnt[probe]=count();
> >    @cnt_total[probe]=count();
> >    $pool=(struct page_pool*)arg0;
> >    $release_cnt=(uint32)arg2;
> >    $hold_cnt=$pool->pages_state_hold_cnt;
> >    $inflight_cnt=(int32)($hold_cnt - $release_cnt);
> >    @inflight=hist($inflight_cnt);
> >   }
> >   interval:s:1 {time("\n%H:%M:%S\n");
> >    print(@cnt); clear(@cnt);
> >    print(@inflight);
> >    print(@cnt_total);
> >   }'
> >
> > The page_pool behavior depend on how NIC driver use it, so I've run this
> > on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel.
> >
> > Driver: bnxt_en
> > - kernel 6.6.51
> >
> > @cnt[rawtracepoint:page_pool_state_release]: 8447
> > @inflight:
> > [0]             507 |                                        |
> > [1]             275 |                                        |
> > [2, 4)          261 |                                        |
> > [4, 8)          215 |                                        |
> > [8, 16)         259 |                                        |
> > [16, 32)        361 |                                        |
> > [32, 64)        933 |                                        |
> > [64, 128)      1966 |                                        |
> > [128, 256)   937052 |@@@@@@@@@                               |
> > [256, 512)  5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [512, 1K)     73908 |                                        |
> > [1K, 2K)    1220128 |@@@@@@@@@@@@                            |
> > [2K, 4K)    1532724 |@@@@@@@@@@@@@@@                         |
> > [4K, 8K)    1849062 |@@@@@@@@@@@@@@@@@@                      |
> > [8K, 16K)   1466424 |@@@@@@@@@@@@@@                          |
> > [16K, 32K)   858585 |@@@@@@@@                                |
> > [32K, 64K)   693893 |@@@@@@                                  |
> > [64K, 128K)  170625 |@                                       |
> >
> > Driver: mlx5_core
> >   - Kernel: 6.6.51
> >
> > @cnt[rawtracepoint:page_pool_state_release]: 1975
> > @inflight:
> > [128, 256)         28293 |@@@@                               |
> > [256, 512)        184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
> > [512, 1K)              0 |                                   |
> > [1K, 2K)            4671 |                                   |
> > [2K, 4K)          342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [4K, 8K)          180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
> > [8K, 16K)          96483 |@@@@@@@@@@@@@@                     |
> > [16K, 32K)         25133 |@@@                                |
> > [32K, 64K)          8274 |@                                  |
> >
> >
> > The key thing to notice that we have up-to 128,000 pages in flight on
> > these random production servers. The NIC have 64 RX queue configured,
> > thus also 64 page_pool objects.
> >
>
> I realized that we primarily want to know the maximum in-flight pages.
>
> So, I modified the bpftrace oneliner to track the max for each page_pool
> in the system.
>
> sudo bpftrace -e '
>   rawtracepoint:page_pool_state_release { @cnt[probe]=count();
>    @cnt_total[probe]=count();
>    $pool=(struct page_pool*)arg0;
>    $release_cnt=(uint32)arg2;
>    $hold_cnt=$pool->pages_state_hold_cnt;
>    $inflight_cnt=(int32)($hold_cnt - $release_cnt);
>    $cur=@inflight_max[$pool];
>    if ($inflight_cnt > $cur) {
>      @inflight_max[$pool]=$inflight_cnt;}
>   }
>   interval:s:1 {time("\n%H:%M:%S\n");
>    print(@cnt); clear(@cnt);
>    print(@inflight_max);
>    print(@cnt_total);
>   }'
>
> I've attached the output from the script.
> For unknown reason this system had 199 page_pool objects.
>
> The 20 top users:
>
> $ cat out02.inflight-max | grep inflight_max | tail -n 20
> @inflight_max[0xffff88829133d800]: 26473
> @inflight_max[0xffff888293c3e000]: 27042
> @inflight_max[0xffff888293c3b000]: 27709
> @inflight_max[0xffff8881076f2800]: 29400
> @inflight_max[0xffff88818386e000]: 29690
> @inflight_max[0xffff8882190b1800]: 29813
> @inflight_max[0xffff88819ee83800]: 30067
> @inflight_max[0xffff8881076f4800]: 30086
> @inflight_max[0xffff88818386b000]: 31116
> @inflight_max[0xffff88816598f800]: 36970
> @inflight_max[0xffff8882190b7800]: 37336
> @inflight_max[0xffff888293c38800]: 39265
> @inflight_max[0xffff888293c3c800]: 39632
> @inflight_max[0xffff888293c3b800]: 43461
> @inflight_max[0xffff888293c3f000]: 43787
> @inflight_max[0xffff88816598f000]: 44557
> @inflight_max[0xffff888132ce9000]: 45037
> @inflight_max[0xffff888293c3f800]: 51843
> @inflight_max[0xffff888183869800]: 62612
> @inflight_max[0xffff888113d08000]: 73203
>
> Adding all values together:
>
>   grep inflight_max out02.inflight-max | awk 'BEGIN {tot=0} {tot+=$2;
> printf "total:" tot "\n"}' | tail -n 1
>
> total:1707129
>
> Worst case we need a data structure holding 1,707,129 pages.
> Fortunately, we don't need a single data structure as this will be split
> between 199 page_pool's.
>
> --Jesper

Is there any specific reason for why we need to store the pages
instead of just scanning the page tables to look for them? We should
already know how many we need to look for and free. If we were to just
scan the page structs and identify the page pool pages that are
pointing to our pool we should be able to go through and clean them
up. It won't be the fastest approach, but this should be an
exceptional case to handle things like a hot plug removal of a device
where we can essentially run this in the background before we free the
device.

Then it would just be a matter of modifying the pool so that it will
drop support for doing DMA unmapping and essentially just become a
place for the freed pages to go to die.
Christoph Hellwig Nov. 7, 2024, 8:41 a.m. UTC | #22
On Wed, Nov 06, 2024 at 06:56:34PM +0800, Yunsheng Lin wrote:
> > It is a very radical change that page_pool needs to keep track of *ALL* in-flight pages.
> 
> I am agreed that it is a radical change, that is why it is targetting net-next
> tree instead of net tree even when there is a Fixes tag for it.
> 
> If there is a proper and non-radical way to fix that, I would prefer the
> non-radical way too.

As Robin already correctly pointed out DMA mappings fundamentally can't
outlive the devices they were performed on.  So I don't think there will
be much hope for a non-radical fix for this design fault.
Yunsheng Lin Nov. 7, 2024, 11:09 a.m. UTC | #23
On 2024/11/6 23:57, Jesper Dangaard Brouer wrote:

...

>>
>> Some more info from production servers.
>>
>> (I'm amazed what we can do with a simple bpftrace script, Cc Viktor)
>>
>> In below bpftrace script/oneliner I'm extracting the inflight count, for
>> all page_pool's in the system, and storing that in a histogram hash.
>>
>> sudo bpftrace -e '
>>   rawtracepoint:page_pool_state_release { @cnt[probe]=count();
>>    @cnt_total[probe]=count();
>>    $pool=(struct page_pool*)arg0;
>>    $release_cnt=(uint32)arg2;
>>    $hold_cnt=$pool->pages_state_hold_cnt;
>>    $inflight_cnt=(int32)($hold_cnt - $release_cnt);
>>    @inflight=hist($inflight_cnt);
>>   }
>>   interval:s:1 {time("\n%H:%M:%S\n");
>>    print(@cnt); clear(@cnt);
>>    print(@inflight);
>>    print(@cnt_total);
>>   }'
>>
>> The page_pool behavior depend on how NIC driver use it, so I've run this on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel.
>>
>> Driver: bnxt_en
>> - kernel 6.6.51
>>
>> @cnt[rawtracepoint:page_pool_state_release]: 8447
>> @inflight:
>> [0]             507 |                                        |
>> [1]             275 |                                        |
>> [2, 4)          261 |                                        |
>> [4, 8)          215 |                                        |
>> [8, 16)         259 |                                        |
>> [16, 32)        361 |                                        |
>> [32, 64)        933 |                                        |
>> [64, 128)      1966 |                                        |
>> [128, 256)   937052 |@@@@@@@@@                               |
>> [256, 512)  5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [512, 1K)     73908 |                                        |
>> [1K, 2K)    1220128 |@@@@@@@@@@@@                            |
>> [2K, 4K)    1532724 |@@@@@@@@@@@@@@@                         |
>> [4K, 8K)    1849062 |@@@@@@@@@@@@@@@@@@                      |
>> [8K, 16K)   1466424 |@@@@@@@@@@@@@@                          |
>> [16K, 32K)   858585 |@@@@@@@@                                |
>> [32K, 64K)   693893 |@@@@@@                                  |
>> [64K, 128K)  170625 |@                                       |
>>
>> Driver: mlx5_core
>>   - Kernel: 6.6.51
>>
>> @cnt[rawtracepoint:page_pool_state_release]: 1975
>> @inflight:
>> [128, 256)         28293 |@@@@                               |
>> [256, 512)        184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
>> [512, 1K)              0 |                                   |
>> [1K, 2K)            4671 |                                   |
>> [2K, 4K)          342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [4K, 8K)          180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
>> [8K, 16K)          96483 |@@@@@@@@@@@@@@                     |
>> [16K, 32K)         25133 |@@@                                |
>> [32K, 64K)          8274 |@                                  |
>>
>>
>> The key thing to notice that we have up-to 128,000 pages in flight on
>> these random production servers. The NIC have 64 RX queue configured,
>> thus also 64 page_pool objects.
>>
> 
> I realized that we primarily want to know the maximum in-flight pages.
> 
> So, I modified the bpftrace oneliner to track the max for each page_pool in the system.
> 
> sudo bpftrace -e '
>  rawtracepoint:page_pool_state_release { @cnt[probe]=count();
>   @cnt_total[probe]=count();
>   $pool=(struct page_pool*)arg0;
>   $release_cnt=(uint32)arg2;
>   $hold_cnt=$pool->pages_state_hold_cnt;
>   $inflight_cnt=(int32)($hold_cnt - $release_cnt);
>   $cur=@inflight_max[$pool];
>   if ($inflight_cnt > $cur) {
>     @inflight_max[$pool]=$inflight_cnt;}
>  }
>  interval:s:1 {time("\n%H:%M:%S\n");
>   print(@cnt); clear(@cnt);
>   print(@inflight_max);
>   print(@cnt_total);
>  }'
> 
> I've attached the output from the script.
> For unknown reason this system had 199 page_pool objects.

Perhaps some of those page_pool objects are per_cpu page_pool
objects from net_page_pool_create()?

It would be good if the pool_size for those page_pool objects
is printed too.

> 
> The 20 top users:
> 
> $ cat out02.inflight-max | grep inflight_max | tail -n 20
> @inflight_max[0xffff88829133d800]: 26473
> @inflight_max[0xffff888293c3e000]: 27042
> @inflight_max[0xffff888293c3b000]: 27709
> @inflight_max[0xffff8881076f2800]: 29400
> @inflight_max[0xffff88818386e000]: 29690
> @inflight_max[0xffff8882190b1800]: 29813
> @inflight_max[0xffff88819ee83800]: 30067
> @inflight_max[0xffff8881076f4800]: 30086
> @inflight_max[0xffff88818386b000]: 31116
> @inflight_max[0xffff88816598f800]: 36970
> @inflight_max[0xffff8882190b7800]: 37336
> @inflight_max[0xffff888293c38800]: 39265
> @inflight_max[0xffff888293c3c800]: 39632
> @inflight_max[0xffff888293c3b800]: 43461
> @inflight_max[0xffff888293c3f000]: 43787
> @inflight_max[0xffff88816598f000]: 44557
> @inflight_max[0xffff888132ce9000]: 45037
> @inflight_max[0xffff888293c3f800]: 51843
> @inflight_max[0xffff888183869800]: 62612
> @inflight_max[0xffff888113d08000]: 73203
> 
> Adding all values together:
> 
>  grep inflight_max out02.inflight-max | awk 'BEGIN {tot=0} {tot+=$2; printf "total:" tot "\n"}' | tail -n 1
> 
> total:1707129
> 
> Worst case we need a data structure holding 1,707,129 pages.

For 64 bit system, that means about 54MB memory overhead for tracking those
inflight pages if 16 byte memory of metadata needed for each page, I guess
that is ok for those large systems.

> Fortunately, we don't need a single data structure as this will be split
> between 199 page_pool's.

It would be good to have an average value for the number of inflight pages,
so that we might be able to have a statically allocated memory to satisfy
the mostly used case, and use the dynamically allocated memory if/when
necessary.

> 
> --Jesper
Yunsheng Lin Nov. 7, 2024, 11:10 a.m. UTC | #24
On 2024/11/7 3:55, Alexander Duyck wrote:
                         |
...

> 
> Is there any specific reason for why we need to store the pages
> instead of just scanning the page tables to look for them? We should
> already know how many we need to look for and free. If we were to just
> scan the page structs and identify the page pool pages that are
> pointing to our pool we should be able to go through and clean them
> up. It won't be the fastest approach, but this should be an
> exceptional case to handle things like a hot plug removal of a device
> where we can essentially run this in the background before we free the
> device.

Does 'scanning the page tables' mean scaning the array of 'struct page *',
like vmemmap/memmap?

I am not sure if there is any existing pattern or API to scan that array?
Does it fall under the catalog of poking into the internals of mm subsystem?
For exmaple, there seems to be different implemenation for that array depending
on CONFIG_SPARSEMEM* config.

Also, I am not sure how much time it may take if we have to scan the array
of 'struct page *' for all the memory in the system.

> 
> Then it would just be a matter of modifying the pool so that it will
> drop support for doing DMA unmapping and essentially just become a
> place for the freed pages to go to die.
> 
If I understand it correctly, the above seems to be what this patch is
trying to do by clearing pool->dma_map after doing the dma unmapping for
the inflight pages.
Yunsheng Lin Nov. 11, 2024, 11:31 a.m. UTC | #25
On 2024/10/26 15:33, Yunsheng Lin wrote:

...

>>>
>>> AFAIU Jakub's comment on his RFC patch for waiting, he was suggesting
>>> exactly this: Add the wait, and see if the cases where it can stall turn
>>> out to be problems in practice.
>>
>> +1
>>
>> I like Jakub's approach.
> 
> As mentioned in Toke's comment, I am still not convinced that there is some
> easy way of waiting here, doing the kick in the kernel space is hard enough,
> I am not even sure if kick need to be done or how it can be done in the user
> space if some page_pool owned page is held from user space for the cases of zero
> rx copy, io_uring and devmem tcp? killing the userspace app?
> 
> If you and Toke still think the waiting is the way out for the problem here, maybe
> we should wait for jakub's opinion and let him decide if he want to proceed with
> his waiting patch.

Is there any other suggestion/concern about how to fix the problem here?

From the previous discussion, it seems the main concern about tracking the
inflight pages is about how many inflight pages it is needed.

If there is no other suggestion/concern , it seems the above concern might be
addressed by using pre-allocated memory to satisfy the mostly used case, and
use the dynamically allocated memory if/when necessary.
Toke Høiland-Jørgensen Nov. 11, 2024, 6:51 p.m. UTC | #26
Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2024/10/26 15:33, Yunsheng Lin wrote:
>
> ...
>
>>>>
>>>> AFAIU Jakub's comment on his RFC patch for waiting, he was suggesting
>>>> exactly this: Add the wait, and see if the cases where it can stall turn
>>>> out to be problems in practice.
>>>
>>> +1
>>>
>>> I like Jakub's approach.
>> 
>> As mentioned in Toke's comment, I am still not convinced that there is some
>> easy way of waiting here, doing the kick in the kernel space is hard enough,
>> I am not even sure if kick need to be done or how it can be done in the user
>> space if some page_pool owned page is held from user space for the cases of zero
>> rx copy, io_uring and devmem tcp? killing the userspace app?
>> 
>> If you and Toke still think the waiting is the way out for the problem here, maybe
>> we should wait for jakub's opinion and let him decide if he want to proceed with
>> his waiting patch.
>
> Is there any other suggestion/concern about how to fix the problem here?
>
> From the previous discussion, it seems the main concern about tracking the
> inflight pages is about how many inflight pages it is needed.

Yeah, my hardest objection was against putting a hard limit on the
number of outstanding pages.

> If there is no other suggestion/concern , it seems the above concern might be
> addressed by using pre-allocated memory to satisfy the mostly used case, and
> use the dynamically allocated memory if/when necessary.

For this, my biggest concern would be performance.

In general, doing extra work in rarely used code paths (such as device
teardown) is much preferred to adding extra tracking in the fast path.
Which would be an argument for Alexander's suggestion of just scanning
the entire system page table to find pages to unmap. Don't know enough
about mm system internals to have an opinion on whether this is
feasible, though.

In any case, we'll need some numbers to really judge the overhead in
practice. So benchmarking would be the logical next step in any case :)

-Toke
Yunsheng Lin Nov. 12, 2024, 12:22 p.m. UTC | #27
On 2024/11/12 2:51, Toke Høiland-Jørgensen wrote:

...

>>
>> Is there any other suggestion/concern about how to fix the problem here?
>>
>> From the previous discussion, it seems the main concern about tracking the
>> inflight pages is about how many inflight pages it is needed.
> 
> Yeah, my hardest objection was against putting a hard limit on the
> number of outstanding pages.
> 
>> If there is no other suggestion/concern , it seems the above concern might be
>> addressed by using pre-allocated memory to satisfy the mostly used case, and
>> use the dynamically allocated memory if/when necessary.
> 
> For this, my biggest concern would be performance.
> 
> In general, doing extra work in rarely used code paths (such as device
> teardown) is much preferred to adding extra tracking in the fast path.
> Which would be an argument for Alexander's suggestion of just scanning
> the entire system page table to find pages to unmap. Don't know enough
> about mm system internals to have an opinion on whether this is
> feasible, though.

Yes, there seems to be many MM system internals, like the CONFIG_SPARSEMEM*
config, memory offline/online and other MM specific optimization that it
is hard to tell it is feasible.

It would be good if MM experts can clarify on this.

> 
> In any case, we'll need some numbers to really judge the overhead in
> practice. So benchmarking would be the logical next step in any case :)

Using POC code show that using the dynamic memory allocation does not
seems to be adding much overhead than the pre-allocated memory allocation
in this patch, the overhead is about 10~20ns, which seems to be similar to
the overhead of added overhead in the patch.

> 
> -Toke
> 
>
Jesper Dangaard Brouer Nov. 12, 2024, 2:19 p.m. UTC | #28
On 12/11/2024 13.22, Yunsheng Lin wrote:
> On 2024/11/12 2:51, Toke Høiland-Jørgensen wrote:
> 
> ...
> 
>>>
>>> Is there any other suggestion/concern about how to fix the problem here?
>>>
>>>  From the previous discussion, it seems the main concern about tracking the
>>> inflight pages is about how many inflight pages it is needed.
>>
>> Yeah, my hardest objection was against putting a hard limit on the
>> number of outstanding pages.
>>
>>> If there is no other suggestion/concern , it seems the above concern might be
>>> addressed by using pre-allocated memory to satisfy the mostly used case, and
>>> use the dynamically allocated memory if/when necessary.
>>
>> For this, my biggest concern would be performance.
>>
>> In general, doing extra work in rarely used code paths (such as device
>> teardown) is much preferred to adding extra tracking in the fast path.
>> Which would be an argument for Alexander's suggestion of just scanning
>> the entire system page table to find pages to unmap. Don't know enough
>> about mm system internals to have an opinion on whether this is
>> feasible, though.
> 
> Yes, there seems to be many MM system internals, like the CONFIG_SPARSEMEM*
> config, memory offline/online and other MM specific optimization that it
> is hard to tell it is feasible.
> 
> It would be good if MM experts can clarify on this.
>

Yes, please.  Can Alex Duyck or MM-experts point me at some code walking
entire system page table?

Then I'll write some kernel code (maybe module) that I can benchmark how
long it takes on my machine with 384GiB. I do like Alex'es suggestion,
but I want to assess the overhead of doing this on modern hardware.

>>
>> In any case, we'll need some numbers to really judge the overhead in
>> practice. So benchmarking would be the logical next step in any case :)
> 
> Using POC code show that using the dynamic memory allocation does not
> seems to be adding much overhead than the pre-allocated memory allocation
> in this patch, the overhead is about 10~20ns, which seems to be similar to
> the overhead of added overhead in the patch.
> 

Overhead around 10~20ns is too large for page_pool, because XDP DDoS
use-case have a very small time budget (which is what page_pool was
designed for).

[1] 
https://github.com/xdp-project/xdp-project/blob/master/areas/hints/traits01_bench_kmod.org#benchmark-basics

  | Link speed | Packet rate           | Time-budget   |
  |            | at smallest pkts size | per packet    |
  |------------+-----------------------+---------------|
  |  10 Gbit/s |  14,880,952 pps       | 67.2 nanosec  |
  |  25 Gbit/s |  37,202,381 pps       | 26.88 nanosec |
  | 100 Gbit/s | 148,809,523 pps       |  6.72 nanosec |


--Jesper
Yunsheng Lin Nov. 13, 2024, 12:21 p.m. UTC | #29
On 2024/11/12 22:19, Jesper Dangaard Brouer wrote:

...

>>>
>>> In any case, we'll need some numbers to really judge the overhead in
>>> practice. So benchmarking would be the logical next step in any case :)
>>
>> Using POC code show that using the dynamic memory allocation does not
>> seems to be adding much overhead than the pre-allocated memory allocation
>> in this patch, the overhead is about 10~20ns, which seems to be similar to
>> the overhead of added overhead in the patch.
>>
> 
> Overhead around 10~20ns is too large for page_pool, because XDP DDoS
> use-case have a very small time budget (which is what page_pool was
> designed for).

I should have mentioned that the above 10~20ns overhead is from the
test case of time_bench_page_pool03_slow() in bench_page_pool_simple.

More detailed test result as below:

After:
root@(none)$ taskset -c 0 insmod bench_page_pool_simple.ko
[   50.359865] bench_page_pool_simple: Loaded
[   50.440982] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.076980410 sec time_interval:76980410) - (invoke count:100000000 tsc_interval:7698030)
[   52.497915] time_bench: Type:atomic_inc Per elem: 2 cycles(tsc) 20.396 ns (step:0) - (measurement period time:2.039650210 sec time_interval:2039650210) - (invoke count:100000000 tsc_interval:203965016)
[   52.665872] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:0.150067780 sec time_interval:150067780) - (invoke count:10000000 tsc_interval:15006773)
[   53.337133] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:0.654153620 sec time_interval:654153620) - (invoke count:100000000 tsc_interval:65415355)
[   53.354152] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[   53.647814] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 28.436 ns (step:0) - (measurement period time:0.284369800 sec time_interval:284369800) - (invoke count:10000000 tsc_interval:28436974)
[   53.666482] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[   54.264789] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 58.910 ns (step:0) - (measurement period time:0.589102240 sec time_interval:589102240) - (invoke count:10000000 tsc_interval:58910216)
[   54.283459] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[   56.202440] time_bench: Type:no-softirq-page_pool03 Per elem: 19 cycles(tsc) 191.012 ns (step:0) - (measurement period time:1.910122260 sec time_interval:1910122260) - (invoke count:10000000 tsc_interval:191012216)
[   56.221463] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[   56.229367] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[   56.521551] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 28.306 ns (step:0) - (measurement period time:0.283066000 sec time_interval:283066000) - (invoke count:10000000 tsc_interval:28306590)
[   56.540827] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[   57.203988] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 6 cycles(tsc) 65.412 ns (step:0) - (measurement period time:0.654129240 sec time_interval:654129240) - (invoke count:10000000 tsc_interval:65412917)
[   57.223177] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[   59.297677] time_bench: Type:tasklet_page_pool03_slow Per elem: 20 cycles(tsc) 206.581 ns (step:0) - (measurement period time:2.065816850 sec time_interval:2065816850) - (invoke count:10000000 tsc_interval:206581679)


Before:
root@(none)$ taskset -c 0 insmod bench_page_pool_simple.ko
[  519.020980] bench_page_pool_simple: Loaded
[  519.102080] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.076979320 sec time_interval:76979320) - (invoke count:100000000 tsc_interval:7697917)
[  520.466133] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 13.467 ns (step:0) - (measurement period time:1.346763300 sec time_interval:1346763300) - (invoke count:100000000 tsc_interval:134676325)
[  520.634079] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.005 ns (step:0) - (measurement period time:0.150054340 sec time_interval:150054340) - (invoke count:10000000 tsc_interval:15005430)
[  521.190881] time_bench: Type:rcu Per elem: 0 cycles(tsc) 5.396 ns (step:0) - (measurement period time:0.539696370 sec time_interval:539696370) - (invoke count:100000000 tsc_interval:53969632)
[  521.207901] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  521.514478] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 29.728 ns (step:0) - (measurement period time:0.297282500 sec time_interval:297282500) - (invoke count:10000000 tsc_interval:29728246)
[  521.533148] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  522.117048] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.469 ns (step:0) - (measurement period time:0.574694970 sec time_interval:574694970) - (invoke count:10000000 tsc_interval:57469491)
[  522.135717] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  523.962813] time_bench: Type:no-softirq-page_pool03 Per elem: 18 cycles(tsc) 181.823 ns (step:0) - (measurement period time:1.818238850 sec time_interval:1818238850) - (invoke count:10000000 tsc_interval:181823878)
[  523.981837] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  523.989742] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  524.296961] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 29.810 ns (step:0) - (measurement period time:0.298100890 sec time_interval:298100890) - (invoke count:10000000 tsc_interval:29810083)
[  524.316236] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  524.852783] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 52.751 ns (step:0) - (measurement period time:0.527516430 sec time_interval:527516430) - (invoke count:10000000 tsc_interval:52751638)
[  524.871972] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  526.710040] time_bench: Type:tasklet_page_pool03_slow Per elem: 18 cycles(tsc) 182.938 ns (step:0) - (measurement period time:1.829384610 sec time_interval:1829384610) - (invoke count:10000000 tsc_interval:182938456)


> 
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/hints/traits01_bench_kmod.org#benchmark-basics
> 
>  | Link speed | Packet rate           | Time-budget   |
>  |            | at smallest pkts size | per packet    |
>  |------------+-----------------------+---------------|
>  |  10 Gbit/s |  14,880,952 pps       | 67.2 nanosec  |
>  |  25 Gbit/s |  37,202,381 pps       | 26.88 nanosec |
>  | 100 Gbit/s | 148,809,523 pps       |  6.72 nanosec |
> 
> 
> --Jesper
Jesper Dangaard Brouer Nov. 18, 2024, 3:11 p.m. UTC | #30
On 18/11/2024 10.08, Yunsheng Lin wrote:
> On 2024/11/12 22:19, Jesper Dangaard Brouer wrote:
>>>
>>> Yes, there seems to be many MM system internals, like the CONFIG_SPARSEMEM*
>>> config, memory offline/online and other MM specific optimization that it
>>> is hard to tell it is feasible.
>>>
>>> It would be good if MM experts can clarify on this.
>>>
>>
>> Yes, please.  Can Alex Duyck or MM-experts point me at some code walking
>> entire system page table?
>>
>> Then I'll write some kernel code (maybe module) that I can benchmark how
>> long it takes on my machine with 384GiB. I do like Alex'es suggestion,
>> but I want to assess the overhead of doing this on modern hardware.
>>
> 
> After looking more closely into MM subsystem, it seems there is some existing
> pattern or API to walk the entire pages from the buddy allocator subsystem,
> see the kmemleak_scan() in mm/kmemleak.c:
> https://elixir.bootlin.com/linux/v6.12/source/mm/kmemleak.c#L1680
> 
> I used that to walk the pages in a arm64 system with over 300GB memory,
> it took about 1.3 sec to do the walking, which seems acceptable?

Yes, that seems acceptable to me.

I'll also do a test on one of my 384 GB systems.
  - It took approx 0.391661 seconds.

I just deref page->pp_magic and counted the pages, not many page were
in-use (page_count(page) > 0) as machine has just been rebooted into
this kernel:
  - pages=100592572 in-use:2079607

--Jesper
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e3bdf8e38bc..d3e30dcfa021 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -120,7 +120,7 @@  struct page {
 			 * page_pool allocated pages.
 			 */
 			unsigned long pp_magic;
-			struct page_pool *pp;
+			struct page_pool_item *pp_item;
 			unsigned long _pp_mapping_pad;
 			unsigned long dma_addr;
 			atomic_long_t pp_ref_count;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 48f1e0fa2a13..44f3707ec3e3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -38,6 +38,7 @@ 
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
 #include <net/netmem.h>
+#include <net/page_pool/types.h>
 
 /**
  * DOC: skb checksums
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 8a6e20be4b9d..5e7b4d1c1c44 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -23,7 +23,7 @@  DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers);
 struct net_iov {
 	unsigned long __unused_padding;
 	unsigned long pp_magic;
-	struct page_pool *pp;
+	struct page_pool_item *pp_item;
 	struct dmabuf_genpool_chunk_owner *owner;
 	unsigned long dma_addr;
 	atomic_long_t pp_ref_count;
@@ -33,7 +33,7 @@  struct net_iov {
  *
  *        struct {
  *                unsigned long pp_magic;
- *                struct page_pool *pp;
+ *                struct page_pool_item *pp_item;
  *                unsigned long _pp_mapping_pad;
  *                unsigned long dma_addr;
  *                atomic_long_t pp_ref_count;
@@ -49,7 +49,7 @@  struct net_iov {
 	static_assert(offsetof(struct page, pg) == \
 		      offsetof(struct net_iov, iov))
 NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic);
-NET_IOV_ASSERT_OFFSET(pp, pp);
+NET_IOV_ASSERT_OFFSET(pp_item, pp_item);
 NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr);
 NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
 #undef NET_IOV_ASSERT_OFFSET
@@ -127,9 +127,9 @@  static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem)
 	return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV);
 }
 
-static inline struct page_pool *netmem_get_pp(netmem_ref netmem)
+static inline struct page_pool_item *netmem_get_pp_item(netmem_ref netmem)
 {
-	return __netmem_clear_lsb(netmem)->pp;
+	return __netmem_clear_lsb(netmem)->pp_item;
 }
 
 static inline atomic_long_t *netmem_get_pp_ref_count_ref(netmem_ref netmem)
diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 1659f1995985..f781c81f8aa9 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -85,7 +85,9 @@  static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
 
 static inline struct page_pool *page_pool_to_pp(struct page *page)
 {
-	return page->pp;
+	struct page_pool_item *item = page->pp_item;
+
+	return container_of(item, struct page_pool, items[item->pp_idx]);
 }
 
 /**
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index c022c410abe3..194006d2930f 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -102,6 +102,7 @@  struct page_pool_params {
  * @refill:	an allocation which triggered a refill of the cache
  * @waive:	pages obtained from the ptr ring that cannot be added to
  *		the cache due to a NUMA mismatch
+ * @item_full	items array is full
  */
 struct page_pool_alloc_stats {
 	u64 fast;
@@ -110,6 +111,7 @@  struct page_pool_alloc_stats {
 	u64 empty;
 	u64 refill;
 	u64 waive;
+	u64 item_full;
 };
 
 /**
@@ -142,6 +144,11 @@  struct page_pool_stats {
 };
 #endif
 
+struct page_pool_item {
+	netmem_ref pp_netmem;
+	unsigned int pp_idx;
+};
+
 /* The whole frag API block must stay within one cacheline. On 32-bit systems,
  * sizeof(long) == sizeof(int), so that the block size is ``3 * sizeof(long)``.
  * On 64-bit systems, the actual size is ``2 * sizeof(long) + sizeof(int)``.
@@ -161,6 +168,8 @@  struct page_pool {
 
 	int cpuid;
 	u32 pages_state_hold_cnt;
+	unsigned int item_mask;
+	unsigned int item_idx;
 
 	bool has_init_callback:1;	/* slow::init_callback is set */
 	bool dma_map:1;			/* Perform DMA mapping */
@@ -228,7 +237,11 @@  struct page_pool {
 	 */
 	refcount_t user_cnt;
 
-	u64 destroy_cnt;
+	/* Lock to avoid doing dma unmapping concurrently when
+	 * destroy_cnt > 0.
+	 */
+	spinlock_t destroy_lock;
+	unsigned int destroy_cnt;
 
 	/* Slow/Control-path information follows */
 	struct page_pool_params_slow slow;
@@ -239,6 +252,8 @@  struct page_pool {
 		u32 napi_id;
 		u32 id;
 	} user;
+
+	struct page_pool_item items[] ____cacheline_aligned_in_smp;
 };
 
 struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 11b91c12ee11..09c5aa83f12a 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -85,7 +85,7 @@  net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
 	niov = &owner->niovs[index];
 
 	niov->pp_magic = 0;
-	niov->pp = NULL;
+	niov->pp_item = NULL;
 	atomic_long_set(&niov->pp_ref_count, 0);
 
 	return niov;
@@ -380,7 +380,7 @@  bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
 	if (WARN_ON_ONCE(refcount != 1))
 		return false;
 
-	page_pool_clear_pp_info(netmem);
+	page_pool_clear_pp_info(pool, netmem);
 
 	net_devmem_free_dmabuf(netmem_to_net_iov(netmem));
 
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index 7eadb8393e00..3173f6070cf7 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -18,9 +18,10 @@  static inline void netmem_clear_pp_magic(netmem_ref netmem)
 	__netmem_clear_lsb(netmem)->pp_magic = 0;
 }
 
-static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool)
+static inline void netmem_set_pp_item(netmem_ref netmem,
+				      struct page_pool_item *item)
 {
-	__netmem_clear_lsb(netmem)->pp = pool;
+	__netmem_clear_lsb(netmem)->pp_item = item;
 }
 
 static inline void netmem_set_dma_addr(netmem_ref netmem,
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index dd497f5c927d..fa4a98b4fc67 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -61,6 +61,7 @@  static const char pp_stats[][ETH_GSTRING_LEN] = {
 	"rx_pp_alloc_empty",
 	"rx_pp_alloc_refill",
 	"rx_pp_alloc_waive",
+	"rx_pp_alloc_item_full",
 	"rx_pp_recycle_cached",
 	"rx_pp_recycle_cache_full",
 	"rx_pp_recycle_ring",
@@ -94,6 +95,7 @@  bool page_pool_get_stats(const struct page_pool *pool,
 	stats->alloc_stats.empty += pool->alloc_stats.empty;
 	stats->alloc_stats.refill += pool->alloc_stats.refill;
 	stats->alloc_stats.waive += pool->alloc_stats.waive;
+	stats->alloc_stats.item_full += pool->alloc_stats.item_full;
 
 	for_each_possible_cpu(cpu) {
 		const struct page_pool_recycle_stats *pcpu =
@@ -139,6 +141,7 @@  u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
 	*data++ = pool_stats->alloc_stats.empty;
 	*data++ = pool_stats->alloc_stats.refill;
 	*data++ = pool_stats->alloc_stats.waive;
+	*data++ = pool_stats->alloc_stats.item_full;
 	*data++ = pool_stats->recycle_stats.cached;
 	*data++ = pool_stats->recycle_stats.cache_full;
 	*data++ = pool_stats->recycle_stats.ring;
@@ -267,14 +270,12 @@  static int page_pool_init(struct page_pool *pool,
 		return -ENOMEM;
 	}
 
+	spin_lock_init(&pool->destroy_lock);
 	atomic_set(&pool->pages_state_release_cnt, 0);
 
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
-	if (pool->dma_map)
-		get_device(pool->p.dev);
-
 	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
 		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
 		 * configuration doesn't change while we're initializing
@@ -312,15 +313,93 @@  static void page_pool_uninit(struct page_pool *pool)
 {
 	ptr_ring_cleanup(&pool->ring, NULL);
 
-	if (pool->dma_map)
-		put_device(pool->p.dev);
-
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
 		free_percpu(pool->recycle_stats);
 #endif
 }
 
+static void page_pool_item_init(struct page_pool *pool, unsigned int item_cnt)
+{
+	struct page_pool_item *items = pool->items;
+	unsigned int i;
+
+	WARN_ON_ONCE(!is_power_of_2(item_cnt));
+
+	for (i = 0; i < item_cnt; i++)
+		items[i].pp_idx = i;
+
+	pool->item_mask = item_cnt - 1;
+}
+
+static void page_pool_item_uninit(struct page_pool *pool)
+{
+	struct page_pool_item *items = pool->items;
+	unsigned int mask = pool->item_mask;
+	unsigned int i, unmapped = 0;
+
+	if (!pool->dma_map || pool->mp_priv)
+		return;
+
+	spin_lock_bh(&pool->destroy_lock);
+
+	for (i = 0; i <= mask; i++) {
+		struct page *page;
+
+		page = netmem_to_page(READ_ONCE(items[i].pp_netmem));
+		if (!page)
+			continue;
+
+		unmapped++;
+		dma_unmap_page_attrs(pool->p.dev, page_pool_get_dma_addr(page),
+				     PAGE_SIZE << pool->p.order,
+				     pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC |
+				     DMA_ATTR_WEAK_ORDERING);
+		page_pool_set_dma_addr(page, 0);
+	}
+
+	WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
+		  "page_pool(%u): unmapped(%u) != inflight page(%d)\n",
+		  pool->user.id, unmapped, page_pool_inflight(pool, false));
+
+	pool->dma_map = false;
+	spin_unlock_bh(&pool->destroy_lock);
+}
+
+static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem)
+{
+	struct page_pool_item *items = pool->items;
+	unsigned int mask = pool->item_mask;
+	unsigned int idx = pool->item_idx;
+	unsigned int i;
+
+	for (i = 0; i <= mask; i++) {
+		unsigned int mask_idx = idx++ & mask;
+
+		if (!READ_ONCE(items[mask_idx].pp_netmem)) {
+			WRITE_ONCE(items[mask_idx].pp_netmem, netmem);
+			netmem_set_pp_item(netmem, &items[mask_idx]);
+			pool->item_idx = idx;
+			return true;
+		}
+	}
+
+	pool->item_idx = idx;
+	alloc_stat_inc(pool, item_full);
+	return false;
+}
+
+static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem)
+{
+	struct page_pool_item *item = netmem_to_page(netmem)->pp_item;
+	struct page_pool_item *items = pool->items;
+	unsigned int idx = item->pp_idx;
+
+	DEBUG_NET_WARN_ON_ONCE(items[idx].pp_netmem != netmem);
+	WRITE_ONCE(items[idx].pp_netmem, (netmem_ref)NULL);
+	netmem_set_pp_item(netmem, NULL);
+}
+
 /**
  * page_pool_create_percpu() - create a page pool for a given cpu.
  * @params: parameters, see struct page_pool_params
@@ -329,10 +408,15 @@  static void page_pool_uninit(struct page_pool *pool)
 struct page_pool *
 page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
 {
+#define PAGE_POOL_MIN_INFLIGHT_ITEMS		512
+	unsigned int item_cnt = (params->pool_size ? : 1024) +
+				PP_ALLOC_CACHE_SIZE + PAGE_POOL_MIN_INFLIGHT_ITEMS;
 	struct page_pool *pool;
 	int err;
 
-	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+	item_cnt = roundup_pow_of_two(item_cnt);
+	pool = kvzalloc_node(struct_size(pool, items, item_cnt), GFP_KERNEL,
+			     params->nid);
 	if (!pool)
 		return ERR_PTR(-ENOMEM);
 
@@ -340,6 +424,8 @@  page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
 	if (err < 0)
 		goto err_free;
 
+	page_pool_item_init(pool, item_cnt);
+
 	err = page_pool_list(pool);
 	if (err)
 		goto err_uninit;
@@ -350,7 +436,7 @@  page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
 	page_pool_uninit(pool);
 err_free:
 	pr_warn("%s() gave up with errno %d\n", __func__, err);
-	kfree(pool);
+	kvfree(pool);
 	return ERR_PTR(err);
 }
 EXPORT_SYMBOL(page_pool_create_percpu);
@@ -365,7 +451,7 @@  struct page_pool *page_pool_create(const struct page_pool_params *params)
 }
 EXPORT_SYMBOL(page_pool_create);
 
-static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
+static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
 
 static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 {
@@ -403,7 +489,7 @@  static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 			 * (2) break out to fallthrough to alloc_pages_node.
 			 * This limit stress on page buddy alloactor.
 			 */
-			page_pool_return_page(pool, netmem);
+			__page_pool_return_page(pool, netmem);
 			alloc_stat_inc(pool, waive);
 			netmem = 0;
 			break;
@@ -436,26 +522,32 @@  static netmem_ref __page_pool_get_cached(struct page_pool *pool)
 	return netmem;
 }
 
-static void __page_pool_dma_sync_for_device(const struct page_pool *pool,
-					    netmem_ref netmem,
-					    u32 dma_sync_size)
+static __always_inline void
+__page_pool_dma_sync_for_device(const struct page_pool *pool, netmem_ref netmem,
+				u32 dma_sync_size)
 {
 #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
-	dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
+	if (pool->dma_sync && dma_dev_need_sync(pool->p.dev)) {
+		dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
 
-	dma_sync_size = min(dma_sync_size, pool->p.max_len);
-	__dma_sync_single_for_device(pool->p.dev, dma_addr + pool->p.offset,
-				     dma_sync_size, pool->p.dma_dir);
+		dma_sync_size = min(dma_sync_size, pool->p.max_len);
+		__dma_sync_single_for_device(pool->p.dev,
+					     dma_addr + pool->p.offset,
+					     dma_sync_size, pool->p.dma_dir);
+	}
 #endif
 }
 
+/* Called from page_pool_put_*() path, need to synchronizated with
+ * page_pool_destory() path.
+ */
 static __always_inline void
-page_pool_dma_sync_for_device(const struct page_pool *pool,
-			      netmem_ref netmem,
+page_pool_dma_sync_for_device(const struct page_pool *pool, netmem_ref netmem,
 			      u32 dma_sync_size)
 {
-	if (pool->dma_sync && dma_dev_need_sync(pool->p.dev))
-		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
+	rcu_read_lock();
+	__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
+	rcu_read_unlock();
 }
 
 static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
@@ -477,7 +569,7 @@  static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	if (page_pool_set_dma_addr_netmem(netmem, dma))
 		goto unmap_failed;
 
-	page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
+	__page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
 
 	return true;
 
@@ -499,19 +591,24 @@  static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 	if (unlikely(!page))
 		return NULL;
 
-	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
-		put_page(page);
-		return NULL;
-	}
+	if (unlikely(!page_pool_set_pp_info(pool, page_to_netmem(page))))
+		goto err_alloc;
+
+	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page))))
+		goto err_set_info;
 
 	alloc_stat_inc(pool, slow_high_order);
-	page_pool_set_pp_info(pool, page_to_netmem(page));
 
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
 	trace_page_pool_state_hold(pool, page_to_netmem(page),
 				   pool->pages_state_hold_cnt);
 	return page;
+err_set_info:
+	page_pool_clear_pp_info(pool, page_to_netmem(page));
+err_alloc:
+	put_page(page);
+	return NULL;
 }
 
 /* slow path */
@@ -546,12 +643,18 @@  static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		netmem = pool->alloc.cache[i];
+
+		if (unlikely(!page_pool_set_pp_info(pool, netmem))) {
+			put_page(netmem_to_page(netmem));
+			continue;
+		}
+
 		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
+			page_pool_clear_pp_info(pool, netmem);
 			put_page(netmem_to_page(netmem));
 			continue;
 		}
 
-		page_pool_set_pp_info(pool, netmem);
 		pool->alloc.cache[pool->alloc.count++] = netmem;
 		/* Track how many pages are held 'in-flight' */
 		pool->pages_state_hold_cnt++;
@@ -623,9 +726,11 @@  s32 page_pool_inflight(const struct page_pool *pool, bool strict)
 	return inflight;
 }
 
-void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
+bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
 {
-	netmem_set_pp(netmem, pool);
+	if (unlikely(!page_pool_item_add(pool, netmem)))
+		return false;
+
 	netmem_or_pp_magic(netmem, PP_SIGNATURE);
 
 	/* Ensuring all pages have been split into one fragment initially:
@@ -637,12 +742,14 @@  void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
 	page_pool_fragment_netmem(netmem, 1);
 	if (pool->has_init_callback)
 		pool->slow.init_callback(netmem, pool->slow.init_arg);
+
+	return true;
 }
 
-void page_pool_clear_pp_info(netmem_ref netmem)
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem)
 {
 	netmem_clear_pp_magic(netmem);
-	netmem_set_pp(netmem, NULL);
+	page_pool_item_del(pool, netmem);
 }
 
 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
@@ -670,7 +777,7 @@  static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
  * a regular page (that will eventually be returned to the normal
  * page-allocator via put_page).
  */
-void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
+void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 {
 	int count;
 	bool put;
@@ -688,7 +795,7 @@  void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	trace_page_pool_state_release(pool, netmem, count);
 
 	if (put) {
-		page_pool_clear_pp_info(netmem);
+		page_pool_clear_pp_info(pool, netmem);
 		put_page(netmem_to_page(netmem));
 	}
 	/* An optimization would be to call __free_pages(page, pool->p.order)
@@ -697,6 +804,27 @@  void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	 */
 }
 
+/* Called from page_pool_put_*() path, need to synchronizated with
+ * page_pool_destory() path.
+ */
+static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
+{
+	unsigned int destroy_cnt;
+
+	rcu_read_lock();
+
+	destroy_cnt = READ_ONCE(pool->destroy_cnt);
+	if (unlikely(destroy_cnt)) {
+		spin_lock_bh(&pool->destroy_lock);
+		__page_pool_return_page(pool, netmem);
+		spin_unlock_bh(&pool->destroy_lock);
+	} else {
+		__page_pool_return_page(pool, netmem);
+	}
+
+	rcu_read_unlock();
+}
+
 static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
 {
 	int ret;
@@ -920,11 +1048,11 @@  static netmem_ref page_pool_drain_frag(struct page_pool *pool,
 		return 0;
 
 	if (__page_pool_page_can_be_recycled(netmem)) {
-		page_pool_dma_sync_for_device(pool, netmem, -1);
+		__page_pool_dma_sync_for_device(pool, netmem, -1);
 		return netmem;
 	}
 
-	page_pool_return_page(pool, netmem);
+	__page_pool_return_page(pool, netmem);
 	return 0;
 }
 
@@ -938,7 +1066,7 @@  static void page_pool_free_frag(struct page_pool *pool)
 	if (!netmem || page_pool_unref_netmem(netmem, drain_count))
 		return;
 
-	page_pool_return_page(pool, netmem);
+	__page_pool_return_page(pool, netmem);
 }
 
 netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
@@ -1022,7 +1150,7 @@  static void __page_pool_destroy(struct page_pool *pool)
 		static_branch_dec(&page_pool_mem_providers);
 	}
 
-	kfree(pool);
+	kvfree(pool);
 }
 
 static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
@@ -1045,7 +1173,7 @@  static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 static void page_pool_scrub(struct page_pool *pool)
 {
 	page_pool_empty_alloc_cache_once(pool);
-	pool->destroy_cnt++;
+	WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
 
 	/* No more consumers should exist, but producers could still
 	 * be in-flight.
@@ -1133,12 +1261,17 @@  void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_release(pool))
 		return;
 
+	/* Skip dma sync to avoid calling dma API after driver has unbound */
+	pool->dma_sync = false;
+
 	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
 	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
 	 * before returning to driver to free the napi instance.
 	 */
 	synchronize_rcu();
 
+	page_pool_item_uninit(pool);
+
 	page_pool_detached(pool);
 	pool->defer_start = jiffies;
 	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
@@ -1159,7 +1292,7 @@  void page_pool_update_nid(struct page_pool *pool, int new_nid)
 	/* Flush pool alloc cache, as refill will check NUMA node */
 	while (pool->alloc.count) {
 		netmem = pool->alloc.cache[--pool->alloc.count];
-		page_pool_return_page(pool, netmem);
+		__page_pool_return_page(pool, netmem);
 	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);
diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
index 57439787b9c2..5d85f862a30a 100644
--- a/net/core/page_pool_priv.h
+++ b/net/core/page_pool_priv.h
@@ -36,16 +36,18 @@  static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 }
 
 #if defined(CONFIG_PAGE_POOL)
-void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
-void page_pool_clear_pp_info(netmem_ref netmem);
+bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem);
 int page_pool_check_memory_provider(struct net_device *dev,
 				    struct netdev_rx_queue *rxq);
 #else
-static inline void page_pool_set_pp_info(struct page_pool *pool,
+static inline bool page_pool_set_pp_info(struct page_pool *pool,
 					 netmem_ref netmem)
 {
+	return true;
 }
-static inline void page_pool_clear_pp_info(netmem_ref netmem)
+static inline void page_pool_clear_pp_info(struct page_pool *pool,
+					   netmem_ref netmem)
 {
 }
 static inline int page_pool_check_memory_provider(struct net_device *dev,