diff mbox series

[RFC,v4,2/3] page_pool: fix IOMMU crash when driver has already unbound

Message ID 20241120103456.396577-3-linyunsheng@huawei.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,v4,1/3] page_pool: fix timing for checking and disabling napi_local | expand

Checks

Context Check Description
netdev/series_format warning Series does not have a cover letter; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3 this patch: 3
netdev/build_tools success Errors and warnings before: 0 (+0) this patch: 0 (+0)
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 80 this patch: 80
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 179 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 3 this patch: 3
netdev/source_inline success Was 0 now: 0

Commit Message

Yunsheng Lin Nov. 20, 2024, 10:34 a.m. UTC
Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.

As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to.

Currently it seems there are at least two cases that the page
is not released fast enough causing dma unmmapping done after
driver has already unbound:
1. ipv4 packet defragmentation timeout: this seems to cause
   delay up to 30 secs.
2. skb_defer_free_flush(): this may cause infinite delay if
   there is no triggering for net_rx_action().

In order not to call DMA APIs to do DMA unmmapping after driver
has already unbound and stall the unloading of the networking
driver, scan the inflight pages using some MM API to do the DMA
unmmapping for those pages when page_pool_destroy() is called.

The max time of scanning inflight pages is about 1.3 sec for
system with over 300GB memory as mentioned in [3], which seems
acceptable as the scanning is only done when there are indeed
some inflight pages and is done in the slow path.

Note, the devmem patchset seems to make the bug harder to fix,
and may make backporting harder too. As there is no actual user
for the devmem and the fixing for devmem is unclear for now,
this patch does not consider fixing the case for devmem yet.

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
2. https://github.com/netoptimizer/prototype-kernel
3. https://lore.kernel.org/all/17a24d69-7bf0-412c-a32a-b25d82bb4159@kernel.org/
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
---
 include/net/page_pool/types.h |  6 ++-
 net/core/page_pool.c          | 95 ++++++++++++++++++++++++++++++-----
 2 files changed, 87 insertions(+), 14 deletions(-)

Comments

Jesper Dangaard Brouer Nov. 20, 2024, 3:10 p.m. UTC | #1
On 20/11/2024 11.34, Yunsheng Lin wrote:
> Networking driver with page_pool support may hand over page
> still with dma mapping to network stack and try to reuse that
> page after network stack is done with it and passes it back
> to page_pool to avoid the penalty of dma mapping/unmapping.
> With all the caching in the network stack, some pages may be
> held in the network stack without returning to the page_pool
> soon enough, and with VF disable causing the driver unbound,
> the page_pool does not stop the driver from doing it's
> unbounding work, instead page_pool uses workqueue to check
> if there is some pages coming back from the network stack
> periodically, if there is any, it will do the dma unmmapping
> related cleanup work.
> 
> As mentioned in [1], attempting DMA unmaps after the driver
> has already unbound may leak resources or at worst corrupt
> memory. Fundamentally, the page pool code cannot allow DMA
> mappings to outlive the driver they belong to.
> 
> Currently it seems there are at least two cases that the page
> is not released fast enough causing dma unmmapping done after
> driver has already unbound:
> 1. ipv4 packet defragmentation timeout: this seems to cause
>     delay up to 30 secs.
> 2. skb_defer_free_flush(): this may cause infinite delay if
>     there is no triggering for net_rx_action().
> 
> In order not to call DMA APIs to do DMA unmmapping after driver
> has already unbound and stall the unloading of the networking
> driver, scan the inflight pages using some MM API to do the DMA
> unmmapping for those pages when page_pool_destroy() is called.
> 
> The max time of scanning inflight pages is about 1.3 sec for
> system with over 300GB memory as mentioned in [3], which seems
> acceptable as the scanning is only done when there are indeed
> some inflight pages and is done in the slow path.
> 
> Note, the devmem patchset seems to make the bug harder to fix,
> and may make backporting harder too. As there is no actual user
> for the devmem and the fixing for devmem is unclear for now,
> this patch does not consider fixing the case for devmem yet.
> 
> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> 2. https://github.com/netoptimizer/prototype-kernel
> 3. https://lore.kernel.org/all/17a24d69-7bf0-412c-a32a-b25d82bb4159@kernel.org/
> CC: Robin Murphy <robin.murphy@arm.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: IOMMU <iommu@lists.linux.dev>
> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> Tested-by: Yonglong Liu <liuyonglong@huawei.com>
> ---
>   include/net/page_pool/types.h |  6 ++-
>   net/core/page_pool.c          | 95 ++++++++++++++++++++++++++++++-----
>   2 files changed, 87 insertions(+), 14 deletions(-)
> 
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index c022c410abe3..7393fd45bc47 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -228,7 +228,11 @@ struct page_pool {
>   	 */
>   	refcount_t user_cnt;
>   
> -	u64 destroy_cnt;
> +	/* Lock to avoid doing dma unmapping concurrently when
> +	 * destroy_cnt > 0.
> +	 */
> +	spinlock_t destroy_lock;
> +	unsigned int destroy_cnt;
>   
>   	/* Slow/Control-path information follows */
>   	struct page_pool_params_slow slow;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index b3dae671eb26..33a314abbba4 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -272,9 +272,6 @@ static int page_pool_init(struct page_pool *pool,
>   	/* Driver calling page_pool_create() also call page_pool_destroy() */
>   	refcount_set(&pool->user_cnt, 1);
>   
> -	if (pool->dma_map)
> -		get_device(pool->p.dev);
> -
>   	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
>   		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
>   		 * configuration doesn't change while we're initializing
> @@ -312,9 +309,6 @@ static void page_pool_uninit(struct page_pool *pool)
>   {
>   	ptr_ring_cleanup(&pool->ring, NULL);
>   
> -	if (pool->dma_map)
> -		put_device(pool->p.dev);
> -
>   #ifdef CONFIG_PAGE_POOL_STATS
>   	if (!pool->system)
>   		free_percpu(pool->recycle_stats);
> @@ -365,7 +359,7 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
>   }
>   EXPORT_SYMBOL(page_pool_create);
>   
> -static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
> +static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
>   
>   static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
>   {
> @@ -403,7 +397,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
>   			 * (2) break out to fallthrough to alloc_pages_node.
>   			 * This limit stress on page buddy alloactor.
>   			 */
> -			page_pool_return_page(pool, netmem);
> +			__page_pool_return_page(pool, netmem);
>   			alloc_stat_inc(pool, waive);
>   			netmem = 0;
>   			break;
> @@ -670,7 +664,7 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
>    * a regular page (that will eventually be returned to the normal
>    * page-allocator via put_page).
>    */
> -void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
> +void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   {
>   	int count;
>   	bool put;
> @@ -697,6 +691,27 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   	 */
>   }
>   
> +/* Called from page_pool_put_*() path, need to synchronizated with
> + * page_pool_destory() path.
> + */
> +static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
> +{
> +	unsigned int destroy_cnt;
> +
> +	rcu_read_lock();
> +
> +	destroy_cnt = READ_ONCE(pool->destroy_cnt);
> +	if (unlikely(destroy_cnt)) {
> +		spin_lock_bh(&pool->destroy_lock);
> +		__page_pool_return_page(pool, netmem);
> +		spin_unlock_bh(&pool->destroy_lock);
> +	} else {
> +		__page_pool_return_page(pool, netmem);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
>   static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
>   {
>   	int ret;
> @@ -924,7 +939,7 @@ static netmem_ref page_pool_drain_frag(struct page_pool *pool,
>   		return netmem;
>   	}
>   
> -	page_pool_return_page(pool, netmem);
> +	__page_pool_return_page(pool, netmem);
>   	return 0;
>   }
>   
> @@ -938,7 +953,7 @@ static void page_pool_free_frag(struct page_pool *pool)
>   	if (!netmem || page_pool_unref_netmem(netmem, drain_count))
>   		return;
>   
> -	page_pool_return_page(pool, netmem);
> +	__page_pool_return_page(pool, netmem);
>   }
>   
>   netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
> @@ -1045,7 +1060,7 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
>   static void page_pool_scrub(struct page_pool *pool)
>   {
>   	page_pool_empty_alloc_cache_once(pool);
> -	pool->destroy_cnt++;
> +	WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
>   
>   	/* No more consumers should exist, but producers could still
>   	 * be in-flight.
> @@ -1119,6 +1134,58 @@ void page_pool_disable_direct_recycling(struct page_pool *pool)
>   }
>   EXPORT_SYMBOL(page_pool_disable_direct_recycling);
>   
> +static void page_pool_inflight_unmap(struct page_pool *pool)
> +{
> +	unsigned int unmapped = 0;
> +	struct zone *zone;
> +	int inflight;
> +
> +	if (!pool->dma_map || pool->mp_priv)
> +		return;
> +
> +	get_online_mems();
> +	spin_lock_bh(&pool->destroy_lock);
> +
> +	inflight = page_pool_inflight(pool, false);
> +	for_each_populated_zone(zone) {
> +		unsigned long end_pfn = zone_end_pfn(zone);
> +		unsigned long pfn;
> +
> +		for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
> +			struct page *page = pfn_to_online_page(pfn);
> +
> +			if (!page || !page_count(page) ||
> +			    (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
> +			    page->pp != pool)
> +				continue;
> +
> +			dma_unmap_page_attrs(pool->p.dev,
> +					     page_pool_get_dma_addr(page),
> +					     PAGE_SIZE << pool->p.order,
> +					     pool->p.dma_dir,
> +					     DMA_ATTR_SKIP_CPU_SYNC |
> +					     DMA_ATTR_WEAK_ORDERING);
> +			page_pool_set_dma_addr(page, 0);
> +

I feel this belongs in a helper function call.

Previously we had a function called: page_pool_release_page().

This was used to convert the page into a normal page again, releasing
page_pool dependencies.  It was used when packet transitioned into the
netstack, but it was removed when we decided it was safe to let netstack
return pp pages.

Removed in commits:
  - 535b9c61bdef ("net: page_pool: hide page_pool_release_page()")
  - 07e0c7d3179d ("net: page_pool: merge page_pool_release_page() with 
page_pool_return_page()")


> +			unmapped++;
> +
> +			/* Skip scanning all pages when debug is disabled */
> +			if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
> +			    inflight == unmapped)
> +				goto out;
> +		}
> +	}
> +
> +out:
> +	WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
> +		  "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
> +		  pool->user.id, unmapped, inflight);
> +
> +	pool->dma_map = false;
> +	spin_unlock_bh(&pool->destroy_lock);
> +	put_online_mems();
> +}
> +
>   void page_pool_destroy(struct page_pool *pool)
>   {
>   	if (!pool)
> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>   	 */
>   	synchronize_rcu();
>   
> +	page_pool_inflight_unmap(pool);
> +

Reaching here means we have detected in-flight packets/pages.

In "page_pool_inflight_unmap" we scan and find those in-flight pages to
DMA unmap them. Then below we wait for these in-flight pages again.
Why don't we just "release" (page_pool_release_page) those in-flight
pages from belonging to the page_pool, when we found them during scanning?

If doing so, we can hopefully remove the periodic checking code below.

>   	page_pool_detached(pool);
>   	pool->defer_start = jiffies;
>   	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>   	/* Flush pool alloc cache, as refill will check NUMA node */
>   	while (pool->alloc.count) {
>   		netmem = pool->alloc.cache[--pool->alloc.count];
> -		page_pool_return_page(pool, netmem);
> +		__page_pool_return_page(pool, netmem);
>   	}
>   }
>   EXPORT_SYMBOL(page_pool_update_nid);

Thanks for continuing to work on this :-)

--Jesper
Yunsheng Lin Nov. 21, 2024, 8:03 a.m. UTC | #2
On 2024/11/20 23:10, Jesper Dangaard Brouer wrote:

...

>>         /* No more consumers should exist, but producers could still
>>        * be in-flight.
>> @@ -1119,6 +1134,58 @@ void page_pool_disable_direct_recycling(struct page_pool *pool)
>>   }
>>   EXPORT_SYMBOL(page_pool_disable_direct_recycling);
>>   +static void page_pool_inflight_unmap(struct page_pool *pool)
>> +{
>> +    unsigned int unmapped = 0;
>> +    struct zone *zone;
>> +    int inflight;
>> +
>> +    if (!pool->dma_map || pool->mp_priv)
>> +        return;
>> +
>> +    get_online_mems();
>> +    spin_lock_bh(&pool->destroy_lock);
>> +
>> +    inflight = page_pool_inflight(pool, false);
>> +    for_each_populated_zone(zone) {
>> +        unsigned long end_pfn = zone_end_pfn(zone);
>> +        unsigned long pfn;
>> +
>> +        for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
>> +            struct page *page = pfn_to_online_page(pfn);
>> +
>> +            if (!page || !page_count(page) ||
>> +                (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
>> +                page->pp != pool)
>> +                continue;
>> +
>> +            dma_unmap_page_attrs(pool->p.dev,
>> +                         page_pool_get_dma_addr(page),
>> +                         PAGE_SIZE << pool->p.order,
>> +                         pool->p.dma_dir,
>> +                         DMA_ATTR_SKIP_CPU_SYNC |
>> +                         DMA_ATTR_WEAK_ORDERING);
>> +            page_pool_set_dma_addr(page, 0);
>> +
> 
> I feel this belongs in a helper function call.
> 
> Previously we had a function called: page_pool_release_page().

Previously I was going to reuse __page_pool_release_page_dma() for the above,
but it seems __page_pool_release_page_dma() has the duplicated pool->dma_map
checking for each page as page_pool_inflight_unmap() already check that before
calling __page_pool_release_page_dma().

> 
> This was used to convert the page into a normal page again, releasing
> page_pool dependencies.  It was used when packet transitioned into the
> netstack, but it was removed when we decided it was safe to let netstack
> return pp pages.
> 
> Removed in commits:
>  - 535b9c61bdef ("net: page_pool: hide page_pool_release_page()")
>  - 07e0c7d3179d ("net: page_pool: merge page_pool_release_page() with page_pool_return_page()")
> 
> 
>> +            unmapped++;
>> +
>> +            /* Skip scanning all pages when debug is disabled */
>> +            if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
>> +                inflight == unmapped)
>> +                goto out;
>> +        }
>> +    }
>> +
>> +out:
>> +    WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
>> +          "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
>> +          pool->user.id, unmapped, inflight);
>> +
>> +    pool->dma_map = false;
>> +    spin_unlock_bh(&pool->destroy_lock);
>> +    put_online_mems();
>> +}
>> +
>>   void page_pool_destroy(struct page_pool *pool)
>>   {
>>       if (!pool)
>> @@ -1139,6 +1206,8 @@ void page_pool_destroy(struct page_pool *pool)
>>        */
>>       synchronize_rcu();
>>   +    page_pool_inflight_unmap(pool);
>> +
> 
> Reaching here means we have detected in-flight packets/pages.
> 
> In "page_pool_inflight_unmap" we scan and find those in-flight pages to
> DMA unmap them. Then below we wait for these in-flight pages again.
> Why don't we just "release" (page_pool_release_page) those in-flight
> pages from belonging to the page_pool, when we found them during scanning?
> 
> If doing so, we can hopefully remove the periodic checking code below.

I thought about that too, but it means more complicated work than just
calling the page_pool_release_page() as page->pp_ref_count need to be
converted into page->_refcount for the above to work, it seems hard to
do that with least performance degradation as the racing against
page_pool_put_page() being called concurrently.

> 
>>       page_pool_detached(pool);
>>       pool->defer_start = jiffies;
>>       pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>> @@ -1159,7 +1228,7 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>>       /* Flush pool alloc cache, as refill will check NUMA node */
>>       while (pool->alloc.count) {
>>           netmem = pool->alloc.cache[--pool->alloc.count];
>> -        page_pool_return_page(pool, netmem);
>> +        __page_pool_return_page(pool, netmem);
>>       }
>>   }
>>   EXPORT_SYMBOL(page_pool_update_nid);
> 
> Thanks for continuing to work on this :-)

I am not sure how scalable the scanning is going to be if the memory size became
bigger, which is one of the reason I was posting it as RFC for this version.

For some quick searching here, it seems there might be server with max ram capacity
of 12.3TB, which means the scanning might take up to about 10 secs for those systems.
The spin_lock is used to avoid concurrency as the page_pool_put_page() API might be
called from the softirq context, which might mean there might be spinning of 12 secs
in the softirq context.

And it seems hard to call cond_resched() when the scanning and unmapping takes a lot
of time as page_pool_put_page() might be called concurrently when pool->destroy_lock
is released, which might means page_pool_get_dma_addr() need to be checked to decide
if the mapping is already done or not for each page.

Also, I am not sure it is appropriate to stall the driver unbound up to 10 secs here
for those large memory systems.

https://www.broadberry.com/12tb-ram-supermicro-servers?srsltid=AfmBOorCPCZQBSv91mOGH3WTg9Cq0MhksnVYL_eXxOHtHJyuYzjyvwgH

> 
> --Jesper
diff mbox series

Patch

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index c022c410abe3..7393fd45bc47 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -228,7 +228,11 @@  struct page_pool {
 	 */
 	refcount_t user_cnt;
 
-	u64 destroy_cnt;
+	/* Lock to avoid doing dma unmapping concurrently when
+	 * destroy_cnt > 0.
+	 */
+	spinlock_t destroy_lock;
+	unsigned int destroy_cnt;
 
 	/* Slow/Control-path information follows */
 	struct page_pool_params_slow slow;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index b3dae671eb26..33a314abbba4 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -272,9 +272,6 @@  static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
-	if (pool->dma_map)
-		get_device(pool->p.dev);
-
 	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
 		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
 		 * configuration doesn't change while we're initializing
@@ -312,9 +309,6 @@  static void page_pool_uninit(struct page_pool *pool)
 {
 	ptr_ring_cleanup(&pool->ring, NULL);
 
-	if (pool->dma_map)
-		put_device(pool->p.dev);
-
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
 		free_percpu(pool->recycle_stats);
@@ -365,7 +359,7 @@  struct page_pool *page_pool_create(const struct page_pool_params *params)
 }
 EXPORT_SYMBOL(page_pool_create);
 
-static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
+static void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem);
 
 static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 {
@@ -403,7 +397,7 @@  static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool)
 			 * (2) break out to fallthrough to alloc_pages_node.
 			 * This limit stress on page buddy alloactor.
 			 */
-			page_pool_return_page(pool, netmem);
+			__page_pool_return_page(pool, netmem);
 			alloc_stat_inc(pool, waive);
 			netmem = 0;
 			break;
@@ -670,7 +664,7 @@  static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
  * a regular page (that will eventually be returned to the normal
  * page-allocator via put_page).
  */
-void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
+void __page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 {
 	int count;
 	bool put;
@@ -697,6 +691,27 @@  void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	 */
 }
 
+/* Called from page_pool_put_*() path, need to synchronizated with
+ * page_pool_destory() path.
+ */
+static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
+{
+	unsigned int destroy_cnt;
+
+	rcu_read_lock();
+
+	destroy_cnt = READ_ONCE(pool->destroy_cnt);
+	if (unlikely(destroy_cnt)) {
+		spin_lock_bh(&pool->destroy_lock);
+		__page_pool_return_page(pool, netmem);
+		spin_unlock_bh(&pool->destroy_lock);
+	} else {
+		__page_pool_return_page(pool, netmem);
+	}
+
+	rcu_read_unlock();
+}
+
 static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
 {
 	int ret;
@@ -924,7 +939,7 @@  static netmem_ref page_pool_drain_frag(struct page_pool *pool,
 		return netmem;
 	}
 
-	page_pool_return_page(pool, netmem);
+	__page_pool_return_page(pool, netmem);
 	return 0;
 }
 
@@ -938,7 +953,7 @@  static void page_pool_free_frag(struct page_pool *pool)
 	if (!netmem || page_pool_unref_netmem(netmem, drain_count))
 		return;
 
-	page_pool_return_page(pool, netmem);
+	__page_pool_return_page(pool, netmem);
 }
 
 netmem_ref page_pool_alloc_frag_netmem(struct page_pool *pool,
@@ -1045,7 +1060,7 @@  static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 static void page_pool_scrub(struct page_pool *pool)
 {
 	page_pool_empty_alloc_cache_once(pool);
-	pool->destroy_cnt++;
+	WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
 
 	/* No more consumers should exist, but producers could still
 	 * be in-flight.
@@ -1119,6 +1134,58 @@  void page_pool_disable_direct_recycling(struct page_pool *pool)
 }
 EXPORT_SYMBOL(page_pool_disable_direct_recycling);
 
+static void page_pool_inflight_unmap(struct page_pool *pool)
+{
+	unsigned int unmapped = 0;
+	struct zone *zone;
+	int inflight;
+
+	if (!pool->dma_map || pool->mp_priv)
+		return;
+
+	get_online_mems();
+	spin_lock_bh(&pool->destroy_lock);
+
+	inflight = page_pool_inflight(pool, false);
+	for_each_populated_zone(zone) {
+		unsigned long end_pfn = zone_end_pfn(zone);
+		unsigned long pfn;
+
+		for (pfn = zone->zone_start_pfn; pfn < end_pfn; pfn++) {
+			struct page *page = pfn_to_online_page(pfn);
+
+			if (!page || !page_count(page) ||
+			    (page->pp_magic & ~0x3UL) != PP_SIGNATURE ||
+			    page->pp != pool)
+				continue;
+
+			dma_unmap_page_attrs(pool->p.dev,
+					     page_pool_get_dma_addr(page),
+					     PAGE_SIZE << pool->p.order,
+					     pool->p.dma_dir,
+					     DMA_ATTR_SKIP_CPU_SYNC |
+					     DMA_ATTR_WEAK_ORDERING);
+			page_pool_set_dma_addr(page, 0);
+
+			unmapped++;
+
+			/* Skip scanning all pages when debug is disabled */
+			if (!IS_ENABLED(CONFIG_DEBUG_NET) &&
+			    inflight == unmapped)
+				goto out;
+		}
+	}
+
+out:
+	WARN_ONCE(page_pool_inflight(pool, false) != unmapped,
+		  "page_pool(%u): unmapped(%u) != inflight pages(%d)\n",
+		  pool->user.id, unmapped, inflight);
+
+	pool->dma_map = false;
+	spin_unlock_bh(&pool->destroy_lock);
+	put_online_mems();
+}
+
 void page_pool_destroy(struct page_pool *pool)
 {
 	if (!pool)
@@ -1139,6 +1206,8 @@  void page_pool_destroy(struct page_pool *pool)
 	 */
 	synchronize_rcu();
 
+	page_pool_inflight_unmap(pool);
+
 	page_pool_detached(pool);
 	pool->defer_start = jiffies;
 	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
@@ -1159,7 +1228,7 @@  void page_pool_update_nid(struct page_pool *pool, int new_nid)
 	/* Flush pool alloc cache, as refill will check NUMA node */
 	while (pool->alloc.count) {
 		netmem = pool->alloc.cache[--pool->alloc.count];
-		page_pool_return_page(pool, netmem);
+		__page_pool_return_page(pool, netmem);
 	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);