Message ID | 20240925075707.3970187-2-linyunsheng@huawei.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | fix two bugs related to page_pool | expand |
On Wed, Sep 25, 2024 at 03:57:06PM +0800, Yunsheng Lin wrote: > page_pool page may be freed from skb_defer_free_flush() to > softirq context, it may cause concurrent access problem for > pool->alloc cache due to the below time window, as below, > both CPU0 and CPU1 may access the pool->alloc cache > concurrently in page_pool_empty_alloc_cache_once() and > page_pool_recycle_in_cache(): > > CPU 0 CPU1 > page_pool_destroy() skb_defer_free_flush() > . . > . page_pool_put_unrefed_page() > . . > . allow_direct = page_pool_napi_local() > . . > page_pool_disable_direct_recycling() . > . . > page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache() > > Use rcu mechanism to avoid the above concurrent access problem. > > Note, the above was found during code reviewing on how to fix > the problem in [1]. > > 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ > > Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy") > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > CC: Alexander Lobakin <aleksander.lobakin@intel.com> Sorry for the noise, but I hit an assert in page_pool_unref_netmem and I am trying to figure out if it is related to what you all are debugging? I thought it might be, but if not, my apologies. Just in case it is, I've put the backtrace on github [1]. I triggered this while testing an RFC [2] I've been working on. Please note, the RFC posted publicly does not currently apply cleanly to net-next and has some bugs I've fixed in my v4. I had planned to send the v4 early next week and mention the page pool issue I am hitting. After triggering the assert in [1], I tried applying the patches of this series and retesting the RFC v4 I have queued locally. When I did that, I hit a new assertion page_pool_destroy [3]. There are a few possibilities: 1. I am hitting the same issue you are hitting 2. I am hitting a different issue caused by a bug I introduced 3. I am hitting a different page pool issue entirely In case of 2 and 3, my apologies for the noise. In case of 1: If you think I am hitting the same issue as you are trying to solve, I can reliably reproduce this with my RFC v4 and would be happy to test any patches meant to fix the issue. [1]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning1.txt [2]: https://lore.kernel.org/all/20240912100738.16567-1-jdamato@fastly.com/#r [3]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning2.txt
On 2024/9/27 4:06, Joe Damato wrote: > On Wed, Sep 25, 2024 at 03:57:06PM +0800, Yunsheng Lin wrote: >> page_pool page may be freed from skb_defer_free_flush() to >> softirq context, it may cause concurrent access problem for >> pool->alloc cache due to the below time window, as below, >> both CPU0 and CPU1 may access the pool->alloc cache >> concurrently in page_pool_empty_alloc_cache_once() and >> page_pool_recycle_in_cache(): >> >> CPU 0 CPU1 >> page_pool_destroy() skb_defer_free_flush() >> . . >> . page_pool_put_unrefed_page() >> . . >> . allow_direct = page_pool_napi_local() >> . . >> page_pool_disable_direct_recycling() . >> . . >> page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache() >> >> Use rcu mechanism to avoid the above concurrent access problem. >> >> Note, the above was found during code reviewing on how to fix >> the problem in [1]. >> >> 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ >> >> Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy") >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> >> CC: Alexander Lobakin <aleksander.lobakin@intel.com> > > Sorry for the noise, but I hit an assert in page_pool_unref_netmem > and I am trying to figure out if it is related to what you all are > debugging? I thought it might be, but if not, my apologies. > > Just in case it is, I've put the backtrace on github [1]. I > triggered this while testing an RFC [2] I've been working on. Please > note, the RFC posted publicly does not currently apply cleanly to > net-next and has some bugs I've fixed in my v4. I had planned to > send the v4 early next week and mention the page pool issue I am > hitting. > > After triggering the assert in [1], I tried applying the patches of > this series and retesting the RFC v4 I have queued locally. When I > did that, I hit a new assertion page_pool_destroy [3]. > > There are a few possibilities: > 1. I am hitting the same issue you are hitting > 2. I am hitting a different issue caused by a bug I introduced > 3. I am hitting a different page pool issue entirely > > In case of 2 and 3, my apologies for the noise. > > In case of 1: If you think I am hitting the same issue as you are > trying to solve, I can reliably reproduce this with my RFC v4 and > would be happy to test any patches meant to fix the issue. > > [1]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning1.txt It seems there may be some concurrent access of page->pp_ref_count/ page_pool->frag_users or imbalance incrementing and decrementing of page->pp_ref_count. For the concurrent access part, you may refer to the below patch for debugging, but it seems mlx5 driver doesn't use frag API directly as my last recall, so you can't use it directly. https://lore.kernel.org/all/20240903122208.3379182-1-linyunsheng@huawei.com/ > [2]: https://lore.kernel.org/all/20240912100738.16567-1-jdamato@fastly.com/#r > [3]: https://gist.githubusercontent.com/jdamato-fsly/eb628c8bf4e4d1c8158441644cdb7e52/raw/96dcf422303d9e64b5060f2fb0f1d71e04ab048e/warning2.txt There warning is only added to see if there is any infight page that need dma unmapping when page_pool_destroy() is called, you can ignore that warning or remove that WARN_ONCE() in page_pool_item_uninit() when testing. >
On 9/25/24 09:57, Yunsheng Lin wrote: > page_pool page may be freed from skb_defer_free_flush() to > softirq context, it may cause concurrent access problem for > pool->alloc cache due to the below time window, as below, > both CPU0 and CPU1 may access the pool->alloc cache > concurrently in page_pool_empty_alloc_cache_once() and > page_pool_recycle_in_cache(): > > CPU 0 CPU1 > page_pool_destroy() skb_defer_free_flush() > . . > . page_pool_put_unrefed_page() > . . > . allow_direct = page_pool_napi_local() > . . > page_pool_disable_direct_recycling() . > . . > page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache() > > Use rcu mechanism to avoid the above concurrent access problem. > > Note, the above was found during code reviewing on how to fix > the problem in [1]. > > 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ > > Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy") > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > CC: Alexander Lobakin <aleksander.lobakin@intel.com> > --- > net/core/page_pool.c | 31 ++++++++++++++++++++++++++++--- > 1 file changed, 28 insertions(+), 3 deletions(-) > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index a813d30d2135..bec6e717cd22 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -818,8 +818,17 @@ static bool page_pool_napi_local(const struct page_pool *pool) > void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem, > unsigned int dma_sync_size, bool allow_direct) > { > - if (!allow_direct) > + bool allow_direct_orig = allow_direct; > + > + /* page_pool_put_unrefed_netmem() is not supposed to be called with > + * allow_direct being true after page_pool_destroy() is called, so > + * the allow_direct being true case doesn't need synchronization. > + */ > + DEBUG_NET_WARN_ON_ONCE(allow_direct && pool->destroy_cnt); > + if (!allow_direct_orig) { > + rcu_read_lock(); > allow_direct = page_pool_napi_local(pool); > + } > > netmem = > __page_pool_put_page(pool, netmem, dma_sync_size, allow_direct); > @@ -828,6 +837,9 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem, > recycle_stat_inc(pool, ring_full); > page_pool_return_page(pool, netmem); > } > + > + if (!allow_direct_orig) > + rcu_read_unlock(); What about always acquiring the rcu lock? would that impact performances negatively? If not, I think it's preferable, as it would make static checker happy. > } > EXPORT_SYMBOL(page_pool_put_unrefed_netmem); > [...] > @@ -1121,6 +1140,12 @@ void page_pool_destroy(struct page_pool *pool) > return; > > page_pool_disable_direct_recycling(pool); > + > + /* Wait for the freeing side see the disabling direct recycling setting > + * to avoid the concurrent access to the pool->alloc cache. > + */ > + synchronize_rcu(); When turning on/off a device with a lot of queues, the above could introduce a lot of long waits under the RTNL lock, right? What about moving the trailing of this function in a separate helper and use call_rcu() instead? Thanks! Paolo > + > page_pool_free_frag(pool); > > if (!page_pool_release(pool))
On 10/1/2024 7:30 PM, Paolo Abeni wrote: ... >> @@ -828,6 +837,9 @@ void page_pool_put_unrefed_netmem(struct page_pool >> *pool, netmem_ref netmem, >> recycle_stat_inc(pool, ring_full); >> page_pool_return_page(pool, netmem); >> } >> + >> + if (!allow_direct_orig) >> + rcu_read_unlock(); > > What about always acquiring the rcu lock? would that impact performances > negatively? > > If not, I think it's preferable, as it would make static checker happy. As mentioned in cover letter, the overhead is about ~2ns I guess it is the 'if' checking before rcu_read_unlock that static checker is not happy about, there is also a 'if' checking before the 'destroy_lock' introduced in patch 2, maybe '__cond_acquires' can be used to make static checker happy? > >> } >> EXPORT_SYMBOL(page_pool_put_unrefed_netmem); > > [...] > >> @@ -1121,6 +1140,12 @@ void page_pool_destroy(struct page_pool *pool) >> return; >> page_pool_disable_direct_recycling(pool); >> + >> + /* Wait for the freeing side see the disabling direct recycling >> setting >> + * to avoid the concurrent access to the pool->alloc cache. >> + */ >> + synchronize_rcu(); > > When turning on/off a device with a lot of queues, the above could > introduce a lot of long waits under the RTNL lock, right? > > What about moving the trailing of this function in a separate helper and > use call_rcu() instead? For this patch, yes, it can be done. But patch 2 also rely on the rcu lock in this patch to ensure that free side is synchronized with the destroy path, and the dma mapping done for the inflight pages in page_pool_item_uninit() can not be done in call_rcu(), as the driver might have unbound when RCU callback is called, which might defeat the purpose of patch 2. Maybe an optimization here is to only call synchronize_rcu() when there are some inflight pages and pool->dma_map is set.
On Wed, 25 Sep 2024 15:57:06 +0800 Yunsheng Lin wrote: > Use rcu mechanism to avoid the above concurrent access problem. > > Note, the above was found during code reviewing on how to fix > the problem in [1]. The driver must make sure NAPI cannot be running while page_pool_destroy() is called. There's even an WARN() checking this.. if you know what to look for.
On 2024/10/9 8:40, Jakub Kicinski wrote: > On Wed, 25 Sep 2024 15:57:06 +0800 Yunsheng Lin wrote: >> Use rcu mechanism to avoid the above concurrent access problem. >> >> Note, the above was found during code reviewing on how to fix >> the problem in [1]. > > The driver must make sure NAPI cannot be running while > page_pool_destroy() is called. There's even an WARN() > checking this.. if you know what to look for. I am guessing you are referring to the WARN() in page_pool_disable_direct_recycling(), right? If yes, I am aware of that WARN(). The problem is that at least from the skb_defer_free_flush() case, it is not tied to any specific napi instance. When skb_attempt_defer_free() calls kick_defer_list_purge() to trigger running of net_rx_action(), skb_defer_free_flush() can be called without tieing to any specific napi instance as my understanding: https://elixir.bootlin.com/linux/v6.7-rc8/source/net/core/dev.c#L6719 Or I am missing something obvious here? I even used below diff to verify that and it did trigger without any napi in the sd->poll_list: @@ -6313,6 +6313,9 @@ static void skb_defer_free_flush(struct softnet_data *sd) spin_unlock(&sd->defer_lock); while (skb != NULL) { + if (list_empty(&sd->poll_list)) + pr_err("defer freeing: %px with empty napi list\n", skb); + next = skb->next; napi_consume_skb(skb, 1); skb = next; > >
On Wed, 9 Oct 2024 11:33:02 +0800 Yunsheng Lin wrote:
> Or I am missing something obvious here?
Seemingly the entire logic of how the safety of the lockless caching
is ensured.
But I don't think you don't trust my opinion so I'll let others explain
this to you..
On 2024/10/9 23:13, Jakub Kicinski wrote: > On Wed, 9 Oct 2024 11:33:02 +0800 Yunsheng Lin wrote: >> Or I am missing something obvious here? > > Seemingly the entire logic of how the safety of the lockless caching > is ensured. I looked at it more closely, it seems what you meant ensuring is by setting the napi->list_owner to -1 when disabling NAPI, right? But letting skb_defer_free_flush() holding on the napi instance to check the napi->list_owner without synchronizing with page_pool_destroy() seems a little surprised to me, as page_pool_destroy() may return to driver to free the napi even if it is a very very small time window, causing a possible used-after-free problem? CPU 0 CPU1 . . . skb_defer_free_flush() . . . napi = READ_ONCE(pool->p.napi); . . page_pool_disable_direct_recycling() . driver free napi memory . . . . napi && READ_ONCE(napi->list_owner) == cpuid . . > > But I don't think you don't trust my opinion so I'll let others explain > this to you..
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index a813d30d2135..bec6e717cd22 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -818,8 +818,17 @@ static bool page_pool_napi_local(const struct page_pool *pool) void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem, unsigned int dma_sync_size, bool allow_direct) { - if (!allow_direct) + bool allow_direct_orig = allow_direct; + + /* page_pool_put_unrefed_netmem() is not supposed to be called with + * allow_direct being true after page_pool_destroy() is called, so + * the allow_direct being true case doesn't need synchronization. + */ + DEBUG_NET_WARN_ON_ONCE(allow_direct && pool->destroy_cnt); + if (!allow_direct_orig) { + rcu_read_lock(); allow_direct = page_pool_napi_local(pool); + } netmem = __page_pool_put_page(pool, netmem, dma_sync_size, allow_direct); @@ -828,6 +837,9 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem, recycle_stat_inc(pool, ring_full); page_pool_return_page(pool, netmem); } + + if (!allow_direct_orig) + rcu_read_unlock(); } EXPORT_SYMBOL(page_pool_put_unrefed_netmem); @@ -861,6 +873,7 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, bool allow_direct; bool in_softirq; + rcu_read_lock(); allow_direct = page_pool_napi_local(pool); for (i = 0; i < count; i++) { @@ -876,8 +889,10 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, data[bulk_len++] = (__force void *)netmem; } - if (!bulk_len) + if (!bulk_len) { + rcu_read_unlock(); return; + } /* Bulk producer into ptr_ring page_pool cache */ in_softirq = page_pool_producer_lock(pool); @@ -892,14 +907,18 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data, page_pool_producer_unlock(pool, in_softirq); /* Hopefully all pages was return into ptr_ring */ - if (likely(i == bulk_len)) + if (likely(i == bulk_len)) { + rcu_read_unlock(); return; + } /* ptr_ring cache full, free remaining pages outside producer lock * since put_page() with refcnt == 1 can be an expensive operation */ for (; i < bulk_len; i++) page_pool_return_page(pool, (__force netmem_ref)data[i]); + + rcu_read_unlock(); } EXPORT_SYMBOL(page_pool_put_page_bulk); @@ -1121,6 +1140,12 @@ void page_pool_destroy(struct page_pool *pool) return; page_pool_disable_direct_recycling(pool); + + /* Wait for the freeing side see the disabling direct recycling setting + * to avoid the concurrent access to the pool->alloc cache. + */ + synchronize_rcu(); + page_pool_free_frag(pool); if (!page_pool_release(pool))
page_pool page may be freed from skb_defer_free_flush() to softirq context, it may cause concurrent access problem for pool->alloc cache due to the below time window, as below, both CPU0 and CPU1 may access the pool->alloc cache concurrently in page_pool_empty_alloc_cache_once() and page_pool_recycle_in_cache(): CPU 0 CPU1 page_pool_destroy() skb_defer_free_flush() . . . page_pool_put_unrefed_page() . . . allow_direct = page_pool_napi_local() . . page_pool_disable_direct_recycling() . . . page_pool_empty_alloc_cache_once() page_pool_recycle_in_cache() Use rcu mechanism to avoid the above concurrent access problem. Note, the above was found during code reviewing on how to fix the problem in [1]. 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/ Fixes: dd64b232deb8 ("page_pool: unlink from napi during destroy") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Alexander Lobakin <aleksander.lobakin@intel.com> --- net/core/page_pool.c | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-)