diff mbox series

[RFC,net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool

Message ID 20250308145500.14046-1-toke@redhat.com (mailing list archive)
State New
Headers show
Series [RFC,net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool | expand

Commit Message

Toke Høiland-Jørgensen March 8, 2025, 2:54 p.m. UTC
When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes problems when a device is
torn down, because the page pool can't unmap the pages until they are
returned to the pool. This causes resource leaks and/or crashes when
there are pages still outstanding while the device is torn down, because
page_pool will attempt an unmap of a non-existent DMA device on the
subsequent page return.

To fix this, implement a simple tracking of outstanding dma-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.

To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, in the field previously called '_pp_mapping_pad' in
the page_pool struct inside struct page. This field overlaps with the
page->mapping pointer, which may turn out to be problematic, so an
alternative is probably needed. Sticking the ID into some of the upper
bits of page->pp_magic may work as an alternative, but that requires
further investigation. Using the 'mapping' field works well enough as
a demonstration for this RFC, though.

Since all the tracking is performed on DMA map/unmap, no additional code
is needed in the fast path, meaning the performance overhead of this
tracking is negligible. The extra memory needed to track the pages is
neatly encapsulated inside xarray, which uses the 'struct xa_node'
structure to track items. This structure is 576 bytes long, with slots
for 64 items, meaning that a full node occurs only 9 bytes of overhead
per slot it tracks (in practice, it probably won't be this efficient,
but in any case it should be an acceptable overhead).

[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/

Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Suggested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
This is an alternative to Yunsheng's series. Yunsheng requested I send
this as an RFC to better be able to discuss the different approaches; see
some initial discussion in[1], also regarding where to store the ID as
alluded to above.

-Toke

[1] https://lore.kernel.org/r/40b33879-509a-4c4a-873b-b5d3573b6e14@gmail.com

 include/linux/mm_types.h      |  2 +-
 include/net/page_pool/types.h |  2 ++
 net/core/netmem_priv.h        | 17 +++++++++++++
 net/core/page_pool.c          | 46 +++++++++++++++++++++++++++++------
 4 files changed, 58 insertions(+), 9 deletions(-)

Comments

Mina Almasry March 8, 2025, 7:22 p.m. UTC | #1
On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
> they are released from the pool, to avoid the overhead of re-mapping the
> pages every time they are used. This causes problems when a device is
> torn down, because the page pool can't unmap the pages until they are
> returned to the pool. This causes resource leaks and/or crashes when
> there are pages still outstanding while the device is torn down, because
> page_pool will attempt an unmap of a non-existent DMA device on the
> subsequent page return.
>
> To fix this, implement a simple tracking of outstanding dma-mapped pages
> in page pool using an xarray. This was first suggested by Mina[0], and
> turns out to be fairly straight forward: We simply store pointers to
> pages directly in the xarray with xa_alloc() when they are first DMA
> mapped, and remove them from the array on unmap. Then, when a page pool
> is torn down, it can simply walk the xarray and unmap all pages still
> present there before returning, which also allows us to get rid of the
> get/put_device() calls in page_pool.

THANK YOU!! I had been looking at the other proposals to fix this here
and there and I had similar feelings to you. They add lots of code
changes and the code changes themselves were hard for me to
understand. I hope we can make this simpler approach work.

> Using xa_cmpxchg(), no additional
> synchronisation is needed, as a page will only ever be unmapped once.
>

Very clever. I had been wondering how to handle the concurrency. I
also think this works.

> To avoid having to walk the entire xarray on unmap to find the page
> reference, we stash the ID assigned by xa_alloc() into the page
> structure itself, in the field previously called '_pp_mapping_pad' in
> the page_pool struct inside struct page. This field overlaps with the
> page->mapping pointer, which may turn out to be problematic, so an
> alternative is probably needed. Sticking the ID into some of the upper
> bits of page->pp_magic may work as an alternative, but that requires
> further investigation. Using the 'mapping' field works well enough as
> a demonstration for this RFC, though.
>

I'm unsure about this. I think page->mapping may be used when we map
the page to the userspace in TCP zerocopy, but I'm really not sure.
Yes, finding somewhere else to put the id would be ideal. Do we really
need a full unsigned long for the pp_magic?

> Since all the tracking is performed on DMA map/unmap, no additional code
> is needed in the fast path, meaning the performance overhead of this
> tracking is negligible. The extra memory needed to track the pages is
> neatly encapsulated inside xarray, which uses the 'struct xa_node'
> structure to track items. This structure is 576 bytes long, with slots
> for 64 items, meaning that a full node occurs only 9 bytes of overhead
> per slot it tracks (in practice, it probably won't be this efficient,
> but in any case it should be an acceptable overhead).
>

Yes, I think I also saw in another thread that this version actually
produces better perf results than the more complicated version,
because the page_pool benchmark actually does no mapping and this
version doesn't add overhead when there is no mapping.

As an aside I have it in my todo list to put the page_pool benchmark
in the upstream kernel so we don't have to run out-of-tree versions.

> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
>
> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
> Reported-by: Yonglong Liu <liuyonglong@huawei.com>
> Suggested-by: Mina Almasry <almasrymina@google.com>
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

I'm only holding back my Reviewed-by for the page->mapping issue.
Other than that, I think this is great. Thank you very much for
looking.

Pavel, David, as an aside, I think we need to propagate this fix to
memory providers as a follow up. We probably need a new op in the
provider to unmap. Then, in page_pool_scrub, where this patch does an
xa_for_each, we need to call that unmap op.

For the io_uring memory provider, I think it needs to unmap its memory
(and record it did so, so it doesn't re-unmap it later).

For the dmabuf memory provider, I think maybe we need to call
dma_buf_unmap_attachment (and record we did that, so we don't re-do it
later).

I don't think propagating this fix to memory providers should block
merging the fix for pages, IMO.

> ---
> This is an alternative to Yunsheng's series. Yunsheng requested I send
> this as an RFC to better be able to discuss the different approaches; see
> some initial discussion in[1], also regarding where to store the ID as
> alluded to above.
>
> -Toke
>
> [1] https://lore.kernel.org/r/40b33879-509a-4c4a-873b-b5d3573b6e14@gmail.com
>
>  include/linux/mm_types.h      |  2 +-
>  include/net/page_pool/types.h |  2 ++
>  net/core/netmem_priv.h        | 17 +++++++++++++
>  net/core/page_pool.c          | 46 +++++++++++++++++++++++++++++------
>  4 files changed, 58 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0234f14f2aa6..d2c7a7b04bea 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -121,7 +121,7 @@ struct page {
>                          */
>                         unsigned long pp_magic;
>                         struct page_pool *pp;
> -                       unsigned long _pp_mapping_pad;
> +                       unsigned long pp_dma_index;
>                         unsigned long dma_addr;
>                         atomic_long_t pp_ref_count;
>                 };
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index 36eb57d73abc..13597a77aa36 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -221,6 +221,8 @@ struct page_pool {
>         void *mp_priv;
>         const struct memory_provider_ops *mp_ops;
>
> +       struct xarray dma_mapped;
> +
>  #ifdef CONFIG_PAGE_POOL_STATS
>         /* recycle stats are per-cpu to avoid locking */
>         struct page_pool_recycle_stats __percpu *recycle_stats;
> diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
> index 7eadb8393e00..59679406a7b7 100644
> --- a/net/core/netmem_priv.h
> +++ b/net/core/netmem_priv.h
> @@ -28,4 +28,21 @@ static inline void netmem_set_dma_addr(netmem_ref netmem,
>  {
>         __netmem_clear_lsb(netmem)->dma_addr = dma_addr;
>  }
> +
> +static inline unsigned long netmem_get_dma_index(netmem_ref netmem)
> +{
> +       if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
> +               return 0;
> +
> +       return netmem_to_page(netmem)->pp_dma_index;
> +}
> +
> +static inline void netmem_set_dma_index(netmem_ref netmem,
> +                                       unsigned long id)
> +{
> +       if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
> +               return;
> +
> +       netmem_to_page(netmem)->pp_dma_index = id;
> +}
>  #endif
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index acef1fcd8ddc..d5530f29bf62 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -226,6 +226,8 @@ static int page_pool_init(struct page_pool *pool,
>                         return -EINVAL;
>
>                 pool->dma_map = true;
> +
> +               xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1);
>         }
>
>         if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
> @@ -275,9 +277,6 @@ static int page_pool_init(struct page_pool *pool,
>         /* Driver calling page_pool_create() also call page_pool_destroy() */
>         refcount_set(&pool->user_cnt, 1);
>
> -       if (pool->dma_map)
> -               get_device(pool->p.dev);
> -
>         if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
>                 /* We rely on rtnl_lock()ing to make sure netdev_rx_queue
>                  * configuration doesn't change while we're initializing
> @@ -325,7 +324,7 @@ static void page_pool_uninit(struct page_pool *pool)
>         ptr_ring_cleanup(&pool->ring, NULL);
>
>         if (pool->dma_map)
> -               put_device(pool->p.dev);
> +               xa_destroy(&pool->dma_mapped);
>
>  #ifdef CONFIG_PAGE_POOL_STATS
>         if (!pool->system)
> @@ -470,9 +469,11 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
>                 __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
>  }
>
> -static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
> +static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp)
>  {
>         dma_addr_t dma;
> +       int err;
> +       u32 id;
>
>         /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>          * since dma_addr_t can be either 32 or 64 bits and does not always fit
> @@ -486,9 +487,19 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
>         if (dma_mapping_error(pool->p.dev, dma))
>                 return false;
>
> +       if (in_softirq())
> +               err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem),
> +                              XA_LIMIT(1, UINT_MAX), gfp);
> +       else
> +               err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem),
> +                                 XA_LIMIT(1, UINT_MAX), gfp);

nit: I think xa_limit_32b works here, because XA_FLAGS_ALLOC1 avoids 1 anyway.
Toke Høiland-Jørgensen March 9, 2025, 12:42 p.m. UTC | #2
Mina Almasry <almasrymina@google.com> writes:

> On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>> they are released from the pool, to avoid the overhead of re-mapping the
>> pages every time they are used. This causes problems when a device is
>> torn down, because the page pool can't unmap the pages until they are
>> returned to the pool. This causes resource leaks and/or crashes when
>> there are pages still outstanding while the device is torn down, because
>> page_pool will attempt an unmap of a non-existent DMA device on the
>> subsequent page return.
>>
>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>> in page pool using an xarray. This was first suggested by Mina[0], and
>> turns out to be fairly straight forward: We simply store pointers to
>> pages directly in the xarray with xa_alloc() when they are first DMA
>> mapped, and remove them from the array on unmap. Then, when a page pool
>> is torn down, it can simply walk the xarray and unmap all pages still
>> present there before returning, which also allows us to get rid of the
>> get/put_device() calls in page_pool.
>
> THANK YOU!! I had been looking at the other proposals to fix this here
> and there and I had similar feelings to you. They add lots of code
> changes and the code changes themselves were hard for me to
> understand. I hope we can make this simpler approach work.

You're welcome :)
And yeah, me too!

>> Using xa_cmpxchg(), no additional
>> synchronisation is needed, as a page will only ever be unmapped once.
>>
>
> Very clever. I had been wondering how to handle the concurrency. I
> also think this works.

Thanks!

>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, in the field previously called '_pp_mapping_pad' in
>> the page_pool struct inside struct page. This field overlaps with the
>> page->mapping pointer, which may turn out to be problematic, so an
>> alternative is probably needed. Sticking the ID into some of the upper
>> bits of page->pp_magic may work as an alternative, but that requires
>> further investigation. Using the 'mapping' field works well enough as
>> a demonstration for this RFC, though.
>>
>
> I'm unsure about this. I think page->mapping may be used when we map
> the page to the userspace in TCP zerocopy, but I'm really not sure.
> Yes, finding somewhere else to put the id would be ideal. Do we really
> need a full unsigned long for the pp_magic?

No, pp_magic was also my backup plan (see the other thread). Tried
actually doing that now, and while there's a bit of complication due to
the varying definitions of POISON_POINTER_DELTA across architectures,
but it seems that this can be defined at compile time. I'll send a v2
RFC with this change.

-Toke
Yunsheng Lin March 9, 2025, 1:27 p.m. UTC | #3
On 3/8/2025 10:54 PM, Toke Høiland-Jørgensen wrote:
> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
> they are released from the pool, to avoid the overhead of re-mapping the
> pages every time they are used. This causes problems when a device is
> torn down, because the page pool can't unmap the pages until they are
> returned to the pool. This causes resource leaks and/or crashes when
> there are pages still outstanding while the device is torn down, because
> page_pool will attempt an unmap of a non-existent DMA device on the
> subsequent page return.
> 
> To fix this, implement a simple tracking of outstanding dma-mapped pages
> in page pool using an xarray. This was first suggested by Mina[0], and
> turns out to be fairly straight forward: We simply store pointers to
> pages directly in the xarray with xa_alloc() when they are first DMA
> mapped, and remove them from the array on unmap. Then, when a page pool
> is torn down, it can simply walk the xarray and unmap all pages still
> present there before returning, which also allows us to get rid of the
> get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
> synchronisation is needed, as a page will only ever be unmapped once.

The implementation of xa_cmpxchg() seems to take the xa_lock, which
seems to be a per-Xarray spin_lock.
Yes, if if we were to take a per-Xarray lock unconditionaly, additional
synchronisation like rcu doesn't seems to be needed. But it seems an
excessive overhead for normal packet processing when page_pool_destroy()
is not called yet?

Also, we might need a similar locking or synchronisation for the dma
sync API in order to skip the dma sync API when page_pool_destroy() is
called too.

> 
> To avoid having to walk the entire xarray on unmap to find the page
> reference, we stash the ID assigned by xa_alloc() into the page
> structure itself, in the field previously called '_pp_mapping_pad' in
> the page_pool struct inside struct page. This field overlaps with the
> page->mapping pointer, which may turn out to be problematic, so an
> alternative is probably needed. Sticking the ID into some of the upper
> bits of page->pp_magic may work as an alternative, but that requires
> further investigation. Using the 'mapping' field works well enough as
> a demonstration for this RFC, though.
page->pp_magic seems to only have 16 bits space available at most when
trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
and STACK_DEPOT_POISON seems to already use 16 bits, so:
1. For 32 bits system, it seems there is only 16 bits left even if the
    ILLEGAL_POINTER_VALUE is defined as 0x0.
2. For 64 bits system, it seems there is only 12 bits left for powerpc
    as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
    arch/powerpc/Kconfig.

So it seems we might need to limit the dma mapping count to 4096 or
65536?

And to be honest, I am not sure if those 'unused' 12/16 bits can really 
be reused or not, I guess we might need suggestion from mm and arch
experts here.

> 
> Since all the tracking is performed on DMA map/unmap, no additional code
> is needed in the fast path, meaning the performance overhead of this
> tracking is negligible. The extra memory needed to track the pages is
> neatly encapsulated inside xarray, which uses the 'struct xa_node'
> structure to track items. This structure is 576 bytes long, with slots
> for 64 items, meaning that a full node occurs only 9 bytes of overhead
> per slot it tracks (in practice, it probably won't be this efficient,
> but in any case it should be an acceptable overhead).

Even if items is stored sequentially in xa_node at first, is it possible
that there may be fragmentation in those xa_node when pages are released
and allocated many times during packet processing? If yes, is there any
fragmentation info about those xa_node?

> 
> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
> 
> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
> Reported-by: Yonglong Liu <liuyonglong@huawei.com>
> Suggested-by: Mina Almasry <almasrymina@google.com>
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
> This is an alternative to Yunsheng's series. Yunsheng requested I send
> this as an RFC to better be able to discuss the different approaches; see
> some initial discussion in[1], also regarding where to store the ID as
> alluded to above.

As mentioned before, I am not really convinced there is still any
space left in 'struct page' yet, otherwise we might already use that
space to fix the DMA address > 32 bits problem in 32 bits system, see
page_pool_set_dma_addr_netmem().

Also, Using the more space in 'struct page' for the page_pool seems to
make page_pool more coupled to the mm subsystem, which seems to not
align with the folios work that is trying to decouple non-mm subsystem
from the mm subsystem by avoid other subsystem using more of the 'struct
page' as metadata from the long term point of view.

> 
> -Toke
> 
> [1] https://lore.kernel.org/r/40b33879-509a-4c4a-873b-b5d3573b6e14@gmail.com
>
Pavel Begunkov March 10, 2025, 7:17 a.m. UTC | #4
On 3/8/25 19:22, Mina Almasry wrote:
> On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>> they are released from the pool, to avoid the overhead of re-mapping the
>> pages every time they are used. This causes problems when a device is
>> torn down, because the page pool can't unmap the pages until they are
>> returned to the pool. This causes resource leaks and/or crashes when
>> there are pages still outstanding while the device is torn down, because
>> page_pool will attempt an unmap of a non-existent DMA device on the
>> subsequent page return.
>>
>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>> in page pool using an xarray. This was first suggested by Mina[0], and
>> turns out to be fairly straight forward: We simply store pointers to
>> pages directly in the xarray with xa_alloc() when they are first DMA
>> mapped, and remove them from the array on unmap. Then, when a page pool
>> is torn down, it can simply walk the xarray and unmap all pages still
>> present there before returning, which also allows us to get rid of the
>> get/put_device() calls in page_pool.
> 
>> Using xa_cmpxchg(), no additional
>> synchronisation is needed, as a page will only ever be unmapped once.
>>
>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, in the field previously called '_pp_mapping_pad' in
>> the page_pool struct inside struct page. This field overlaps with the
>> page->mapping pointer, which may turn out to be problematic, so an
>> alternative is probably needed. Sticking the ID into some of the upper
>> bits of page->pp_magic may work as an alternative, but that requires
>> further investigation. Using the 'mapping' field works well enough as
>> a demonstration for this RFC, though.
>>
>> Since all the tracking is performed on DMA map/unmap, no additional code
>> is needed in the fast path, meaning the performance overhead of this
>> tracking is negligible. The extra memory needed to track the pages is
>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>> structure to track items. This structure is 576 bytes long, with slots
>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>> per slot it tracks (in practice, it probably won't be this efficient,
>> but in any case it should be an acceptable overhead).
...
> 
> Pavel, David, as an aside, I think we need to propagate this fix to
> memory providers as a follow up. We probably need a new op in the
> provider to unmap. Then, in page_pool_scrub, where this patch does an
> xa_for_each, we need to call that unmap op.

Sounds like it, which is the easy part since mps already hold the
full list of pages available. We just need to be careful unmapping
all netmems in presense of multiple pools, but that should be fine.
Toke Høiland-Jørgensen March 10, 2025, 9:13 a.m. UTC | #5
Yunsheng Lin <yunshenglin0825@gmail.com> writes:

> On 3/8/2025 10:54 PM, Toke Høiland-Jørgensen wrote:
>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>> they are released from the pool, to avoid the overhead of re-mapping the
>> pages every time they are used. This causes problems when a device is
>> torn down, because the page pool can't unmap the pages until they are
>> returned to the pool. This causes resource leaks and/or crashes when
>> there are pages still outstanding while the device is torn down, because
>> page_pool will attempt an unmap of a non-existent DMA device on the
>> subsequent page return.
>> 
>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>> in page pool using an xarray. This was first suggested by Mina[0], and
>> turns out to be fairly straight forward: We simply store pointers to
>> pages directly in the xarray with xa_alloc() when they are first DMA
>> mapped, and remove them from the array on unmap. Then, when a page pool
>> is torn down, it can simply walk the xarray and unmap all pages still
>> present there before returning, which also allows us to get rid of the
>> get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
>> synchronisation is needed, as a page will only ever be unmapped once.
>
> The implementation of xa_cmpxchg() seems to take the xa_lock, which
> seems to be a per-Xarray spin_lock.
> Yes, if if we were to take a per-Xarray lock unconditionaly, additional
> synchronisation like rcu doesn't seems to be needed. But it seems an
> excessive overhead for normal packet processing when page_pool_destroy()
> is not called yet?

I dunno, maybe? It's only done on DMA map/unmap, so it may be
acceptable? We should get some benchmark results to assess :)

> Also, we might need a similar locking or synchronisation for the dma
> sync API in order to skip the dma sync API when page_pool_destroy() is
> called too.

Good point, but that seems a separate issue? And simpler to solve (just
set pool->dma_sync to false when destroying?).

>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, in the field previously called '_pp_mapping_pad' in
>> the page_pool struct inside struct page. This field overlaps with the
>> page->mapping pointer, which may turn out to be problematic, so an
>> alternative is probably needed. Sticking the ID into some of the upper
>> bits of page->pp_magic may work as an alternative, but that requires
>> further investigation. Using the 'mapping' field works well enough as
>> a demonstration for this RFC, though.
> page->pp_magic seems to only have 16 bits space available at most when
> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
> and STACK_DEPOT_POISON seems to already use 16 bits, so:
> 1. For 32 bits system, it seems there is only 16 bits left even if the
>     ILLEGAL_POINTER_VALUE is defined as 0x0.
> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
>     as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
>     arch/powerpc/Kconfig.
>
> So it seems we might need to limit the dma mapping count to 4096 or
> 65536?
>
> And to be honest, I am not sure if those 'unused' 12/16 bits can really 
> be reused or not, I guess we might need suggestion from mm and arch
> experts here.

Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
See v2 of the RFC patch, the bit arithmetic there gives me:

- 24 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures

Which seems to be plenty?

>> Since all the tracking is performed on DMA map/unmap, no additional code
>> is needed in the fast path, meaning the performance overhead of this
>> tracking is negligible. The extra memory needed to track the pages is
>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>> structure to track items. This structure is 576 bytes long, with slots
>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>> per slot it tracks (in practice, it probably won't be this efficient,
>> but in any case it should be an acceptable overhead).
>
> Even if items is stored sequentially in xa_node at first, is it possible
> that there may be fragmentation in those xa_node when pages are released
> and allocated many times during packet processing? If yes, is there any
> fragmentation info about those xa_node?

Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
do some rebalancing of the underlying radix tree, freeing nodes when
they are no longer used. However, I am not too familiar with the xarray
code, so I don't know exactly how efficient this is in practice.

>> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
>> 
>> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
>> Reported-by: Yonglong Liu <liuyonglong@huawei.com>
>> Suggested-by: Mina Almasry <almasrymina@google.com>
>> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> ---
>> This is an alternative to Yunsheng's series. Yunsheng requested I send
>> this as an RFC to better be able to discuss the different approaches; see
>> some initial discussion in[1], also regarding where to store the ID as
>> alluded to above.
>
> As mentioned before, I am not really convinced there is still any
> space left in 'struct page' yet, otherwise we might already use that
> space to fix the DMA address > 32 bits problem in 32 bits system, see
> page_pool_set_dma_addr_netmem().

See above.

> Also, Using the more space in 'struct page' for the page_pool seems to
> make page_pool more coupled to the mm subsystem, which seems to not
> align with the folios work that is trying to decouple non-mm subsystem
> from the mm subsystem by avoid other subsystem using more of the 'struct
> page' as metadata from the long term point of view.

This seems a bit theoretical; any future changes of struct page would
have to shuffle things around so we still have the ID available,
obviously :)

-Toke
Yunsheng Lin March 10, 2025, 12:35 p.m. UTC | #6
On 2025/3/10 17:13, Toke Høiland-Jørgensen wrote:

...

> 
>> Also, we might need a similar locking or synchronisation for the dma
>> sync API in order to skip the dma sync API when page_pool_destroy() is
>> called too.
> 
> Good point, but that seems a separate issue? And simpler to solve (just

If I understand the comment from DMA experts correctly, the dma_sync API
should not be allowed to be called after the dma_unmap API.

> set pool->dma_sync to false when destroying?).

Without locking or synchronisation, there is some small window between
pool->dma_sync checking and dma sync API calling, during which the driver
might have already unbound.

> 
>>> To avoid having to walk the entire xarray on unmap to find the page
>>> reference, we stash the ID assigned by xa_alloc() into the page
>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>> the page_pool struct inside struct page. This field overlaps with the
>>> page->mapping pointer, which may turn out to be problematic, so an
>>> alternative is probably needed. Sticking the ID into some of the upper
>>> bits of page->pp_magic may work as an alternative, but that requires
>>> further investigation. Using the 'mapping' field works well enough as
>>> a demonstration for this RFC, though.
>> page->pp_magic seems to only have 16 bits space available at most when
>> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
>> and STACK_DEPOT_POISON seems to already use 16 bits, so:
>> 1. For 32 bits system, it seems there is only 16 bits left even if the
>>     ILLEGAL_POINTER_VALUE is defined as 0x0.
>> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
>>     as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
>>     arch/powerpc/Kconfig.
>>
>> So it seems we might need to limit the dma mapping count to 4096 or
>> 65536?
>>
>> And to be honest, I am not sure if those 'unused' 12/16 bits can really 
>> be reused or not, I guess we might need suggestion from mm and arch
>> experts here.
> 
> Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
> AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
> See v2 of the RFC patch, the bit arithmetic there gives me:
> 
> - 24 bits on 32-bit architectures
> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
> - 32 bits on other 64-bit architectures
> 
> Which seems to be plenty?

I am really doubtful it is that simple, but we always can hear from the
experts if it isn't:)

> 
>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>> is needed in the fast path, meaning the performance overhead of this
>>> tracking is negligible. The extra memory needed to track the pages is
>>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>>> structure to track items. This structure is 576 bytes long, with slots
>>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>>> per slot it tracks (in practice, it probably won't be this efficient,
>>> but in any case it should be an acceptable overhead).
>>
>> Even if items is stored sequentially in xa_node at first, is it possible
>> that there may be fragmentation in those xa_node when pages are released
>> and allocated many times during packet processing? If yes, is there any
>> fragmentation info about those xa_node?
> 
> Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
> do some rebalancing of the underlying radix tree, freeing nodes when
> they are no longer used. However, I am not too familiar with the xarray
> code, so I don't know exactly how efficient this is in practice.

I guess that is one of the disadvantages that an advanced struct like
Xarray is used:(

>
Toke Høiland-Jørgensen March 10, 2025, 3:24 p.m. UTC | #7
Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2025/3/10 17:13, Toke Høiland-Jørgensen wrote:
>
> ...
>
>> 
>>> Also, we might need a similar locking or synchronisation for the dma
>>> sync API in order to skip the dma sync API when page_pool_destroy() is
>>> called too.
>> 
>> Good point, but that seems a separate issue? And simpler to solve (just
>
> If I understand the comment from DMA experts correctly, the dma_sync API
> should not be allowed to be called after the dma_unmap API.

Ah, right, I see what you mean; will add a check for this.

>> set pool->dma_sync to false when destroying?).
>
> Without locking or synchronisation, there is some small window between
> pool->dma_sync checking and dma sync API calling, during which the driver
> might have already unbound.
>
>> 
>>>> To avoid having to walk the entire xarray on unmap to find the page
>>>> reference, we stash the ID assigned by xa_alloc() into the page
>>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>>> the page_pool struct inside struct page. This field overlaps with the
>>>> page->mapping pointer, which may turn out to be problematic, so an
>>>> alternative is probably needed. Sticking the ID into some of the upper
>>>> bits of page->pp_magic may work as an alternative, but that requires
>>>> further investigation. Using the 'mapping' field works well enough as
>>>> a demonstration for this RFC, though.
>>> page->pp_magic seems to only have 16 bits space available at most when
>>> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
>>> and STACK_DEPOT_POISON seems to already use 16 bits, so:
>>> 1. For 32 bits system, it seems there is only 16 bits left even if the
>>>     ILLEGAL_POINTER_VALUE is defined as 0x0.
>>> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
>>>     as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
>>>     arch/powerpc/Kconfig.
>>>
>>> So it seems we might need to limit the dma mapping count to 4096 or
>>> 65536?
>>>
>>> And to be honest, I am not sure if those 'unused' 12/16 bits can really 
>>> be reused or not, I guess we might need suggestion from mm and arch
>>> experts here.
>> 
>> Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
>> AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
>> See v2 of the RFC patch, the bit arithmetic there gives me:
>> 
>> - 24 bits on 32-bit architectures
>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>> - 32 bits on other 64-bit architectures
>> 
>> Which seems to be plenty?
>
> I am really doubtful it is that simple, but we always can hear from the
> experts if it isn't:)

Do you have any specific reason to think so? :)

>>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>>> is needed in the fast path, meaning the performance overhead of this
>>>> tracking is negligible. The extra memory needed to track the pages is
>>>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>>>> structure to track items. This structure is 576 bytes long, with slots
>>>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>>>> per slot it tracks (in practice, it probably won't be this efficient,
>>>> but in any case it should be an acceptable overhead).
>>>
>>> Even if items is stored sequentially in xa_node at first, is it possible
>>> that there may be fragmentation in those xa_node when pages are released
>>> and allocated many times during packet processing? If yes, is there any
>>> fragmentation info about those xa_node?
>> 
>> Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
>> do some rebalancing of the underlying radix tree, freeing nodes when
>> they are no longer used. However, I am not too familiar with the xarray
>> code, so I don't know exactly how efficient this is in practice.
>
> I guess that is one of the disadvantages that an advanced struct like
> Xarray is used:(

Sure, there will be some overhead from using xarray, but I think the
simplicity makes up for it; especially since we can limit this to the
cases where it's absolutely needed.

-Toke
Matthew Wilcox March 10, 2025, 3:42 p.m. UTC | #8
On Mon, Mar 10, 2025 at 10:13:32AM +0100, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
> > Also, Using the more space in 'struct page' for the page_pool seems to
> > make page_pool more coupled to the mm subsystem, which seems to not
> > align with the folios work that is trying to decouple non-mm subsystem
> > from the mm subsystem by avoid other subsystem using more of the 'struct
> > page' as metadata from the long term point of view.
> 
> This seems a bit theoretical; any future changes of struct page would
> have to shuffle things around so we still have the ID available,
> obviously :)

See https://kernelnewbies.org/MatthewWilcox/Memdescs
and more immediately
https://kernelnewbies.org/MatthewWilcox/Memdescs/Path

pagepool is going to be renamed "bump" because it's a bump allocator and
"pagepool" is a nonsense name.  I haven't looked into it in a lot of
detail yet, but in the not-too-distant future, struct page will look
like this (from your point of view):

struct page {
	unsigned long flags;
	unsigned long memdesc;
	int _refcount;	// 0 for bump
	union {
		unsigned long private;
		atomic_t _mapcount; // maybe used by bump?  not sure
	};
};

'memdesc' will be a pointer to struct bump with the bottom four bits of
that pointer indicating that it's a struct bump pointer (and not, say, a
folio or a slab).

So if you allocate a multi-page bump, you'll get N of these pages,
and they'll all point to the same struct bump where you'll maintain
your actual refcount.  And you'll be able to grow struct bump to your
heart's content.  I don't know exactly what struct bump looks like,
but the core mm will have no requirements on you.
Toke Høiland-Jørgensen March 10, 2025, 5:26 p.m. UTC | #9
Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Mar 10, 2025 at 10:13:32AM +0100, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
>> > Also, Using the more space in 'struct page' for the page_pool seems to
>> > make page_pool more coupled to the mm subsystem, which seems to not
>> > align with the folios work that is trying to decouple non-mm subsystem
>> > from the mm subsystem by avoid other subsystem using more of the 'struct
>> > page' as metadata from the long term point of view.
>> 
>> This seems a bit theoretical; any future changes of struct page would
>> have to shuffle things around so we still have the ID available,
>> obviously :)
>
> See https://kernelnewbies.org/MatthewWilcox/Memdescs
> and more immediately
> https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
>
> pagepool is going to be renamed "bump" because it's a bump allocator and
> "pagepool" is a nonsense name.  I haven't looked into it in a lot of
> detail yet, but in the not-too-distant future, struct page will look
> like this (from your point of view):
>
> struct page {
> 	unsigned long flags;
> 	unsigned long memdesc;
> 	int _refcount;	// 0 for bump
> 	union {
> 		unsigned long private;
> 		atomic_t _mapcount; // maybe used by bump?  not sure
> 	};
> };
>
> 'memdesc' will be a pointer to struct bump with the bottom four bits of
> that pointer indicating that it's a struct bump pointer (and not, say, a
> folio or a slab).
>
> So if you allocate a multi-page bump, you'll get N of these pages,
> and they'll all point to the same struct bump where you'll maintain
> your actual refcount.  And you'll be able to grow struct bump to your
> heart's content.  I don't know exactly what struct bump looks like,
> but the core mm will have no requirements on you.

Ah, excellent, thanks for the pointer!

Out of curiosity, why "bump"? Is that a term of art somewhere?

And in the meantime (until those patches land), do you see any reason
why we can't squat on the middle bits of page->pp_magic (AKA page->lru)
like I'm doing in v2[0] of this patch?

-Toke

[0] https://lore.kernel.org/r/20250309124719.21285-1-toke@redhat.com
Yunsheng Lin March 11, 2025, 12:19 p.m. UTC | #10
On 2025/3/10 23:24, Toke Høiland-Jørgensen wrote:

>>
>> I guess that is one of the disadvantages that an advanced struct like
>> Xarray is used:(
> 
> Sure, there will be some overhead from using xarray, but I think the
> simplicity makes up for it; especially since we can limit this to the

As my understanding, it is more complicated, it is just that complexity
is hidden before xarray now.

Even if there is no space in 'struct page' to store the id, the
'struct page' pointer itself can be used as id if the xarray can
use pointer as id. But it might mean the memory utilization might
not be as efficient as it should be, and performance hurts too if
there is more memory to be allocated and freed.

It seems it is just a matter of choices between using tailor-made
page_pool specific optimization and using some generic advanced
struct like xarray.

I chose the tailor-made one because it ensure least overhead as
much as possibe from performance and memory utilization perspective,
for example, the 'single producer, multiple consumer' guarantee
offered by NAPI context can avoid some lock and atomic operation.

> cases where it's absolutely needed.

The above can also be done for using page_pool_item too as the
lower 2 bits can be used to indicate the pointer in 'struct page'
is 'page_pool_item' or 'page_pool', I just don't think it is
necessary yet as it might add more checking in the fast path.

> 
> -Toke
> 
>
Yunsheng Lin March 11, 2025, 12:25 p.m. UTC | #11
On 2025/3/10 23:42, Matthew Wilcox wrote:
> On Mon, Mar 10, 2025 at 10:13:32AM +0100, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
>>> Also, Using the more space in 'struct page' for the page_pool seems to
>>> make page_pool more coupled to the mm subsystem, which seems to not
>>> align with the folios work that is trying to decouple non-mm subsystem
>>> from the mm subsystem by avoid other subsystem using more of the 'struct
>>> page' as metadata from the long term point of view.
>>
>> This seems a bit theoretical; any future changes of struct page would
>> have to shuffle things around so we still have the ID available,
>> obviously :)
> 
> See https://kernelnewbies.org/MatthewWilcox/Memdescs
> and more immediately
> https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
> 
> pagepool is going to be renamed "bump" because it's a bump allocator and
> "pagepool" is a nonsense name.  I haven't looked into it in a lot of
> detail yet, but in the not-too-distant future, struct page will look
> like this (from your point of view):
> 
> struct page {
> 	unsigned long flags;
> 	unsigned long memdesc;

It seems there may be memory behind the above 'memdesc' with different size
and layout for different subsystem?

I am not sure if I understand the case of the same page might be handle in
two subsystems concurrently or a page is allocated in one subsystem and
then passed to be handled in other subsystem, for examlpe:
page_pool owned page is mmap'ed into user space through tcp zero copy,
see tcp_zerocopy_vm_insert_batch(), it seems the same page is handled in
both networking/page_pool and vm subsystem?

And page->mapping seems to have been moved into 'memdesc' as there is no
'mapping' field in 'struct page' you list here? Does we need a similar
field like 'mapping' in the 'memdesc' for page_pool subsystem to support
tcp zero copy?

> 	int _refcount;	// 0 for bump
> 	union {
> 		unsigned long private;
> 		atomic_t _mapcount; // maybe used by bump?  not sure
> 	};
> };
> 
> 'memdesc' will be a pointer to struct bump with the bottom four bits of
> that pointer indicating that it's a struct bump pointer (and not, say, a
> folio or a slab).

The above seems similar as what I was doing, the difference seems to be
that memory behind the above pointer is managed by page_pool itself
instead of mm subsystem allocating 'memdesc' memory from a slab cache?

> 
> So if you allocate a multi-page bump, you'll get N of these pages,
> and they'll all point to the same struct bump where you'll maintain
> your actual refcount.  And you'll be able to grow struct bump to your
> heart's content.  I don't know exactly what struct bump looks like,
> but the core mm will have no requirements on you.
>
Pavel Begunkov March 11, 2025, 1:25 p.m. UTC | #12
On 3/9/25 12:42, Toke Høiland-Jørgensen wrote:
> Mina Almasry <almasrymina@google.com> writes:
> 
>> On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>
>>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>>> they are released from the pool, to avoid the overhead of re-mapping the
>>> pages every time they are used. This causes problems when a device is
>>> torn down, because the page pool can't unmap the pages until they are
>>> returned to the pool. This causes resource leaks and/or crashes when
>>> there are pages still outstanding while the device is torn down, because
>>> page_pool will attempt an unmap of a non-existent DMA device on the
>>> subsequent page return.
>>>
>>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>>> in page pool using an xarray. This was first suggested by Mina[0], and
>>> turns out to be fairly straight forward: We simply store pointers to
>>> pages directly in the xarray with xa_alloc() when they are first DMA
>>> mapped, and remove them from the array on unmap. Then, when a page pool
>>> is torn down, it can simply walk the xarray and unmap all pages still
>>> present there before returning, which also allows us to get rid of the
>>> get/put_device() calls in page_pool.
>>
>> THANK YOU!! I had been looking at the other proposals to fix this here
>> and there and I had similar feelings to you. They add lots of code
>> changes and the code changes themselves were hard for me to
>> understand. I hope we can make this simpler approach work.
> 
> You're welcome :)
> And yeah, me too!
> 
>>> Using xa_cmpxchg(), no additional
>>> synchronisation is needed, as a page will only ever be unmapped once.
>>>
>>
>> Very clever. I had been wondering how to handle the concurrency. I
>> also think this works.
> 
> Thanks!
> 
>>> To avoid having to walk the entire xarray on unmap to find the page
>>> reference, we stash the ID assigned by xa_alloc() into the page
>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>> the page_pool struct inside struct page. This field overlaps with the
>>> page->mapping pointer, which may turn out to be problematic, so an
>>> alternative is probably needed. Sticking the ID into some of the upper
>>> bits of page->pp_magic may work as an alternative, but that requires
>>> further investigation. Using the 'mapping' field works well enough as
>>> a demonstration for this RFC, though.
>>>
>>
>> I'm unsure about this. I think page->mapping may be used when we map
>> the page to the userspace in TCP zerocopy, but I'm really not sure.
>> Yes, finding somewhere else to put the id would be ideal. Do we really
>> need a full unsigned long for the pp_magic?
> 
> No, pp_magic was also my backup plan (see the other thread). Tried
> actually doing that now, and while there's a bit of complication due to
> the varying definitions of POISON_POINTER_DELTA across architectures,
> but it seems that this can be defined at compile time. I'll send a v2
> RFC with this change.

FWIW, personally I like this one much more than an extra indirection
to pp.

If we're out of space in the page, why can't we use struct page *
as indices into the xarray? Ala

struct page *p = ...;
xa_store(xarray, index=(unsigned long)p, p);

Indices wouldn't be nicely packed, but it's still a map. Is there
a problem with that I didn't consider?
Toke Høiland-Jørgensen March 11, 2025, 1:26 p.m. UTC | #13
Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2025/3/10 23:24, Toke Høiland-Jørgensen wrote:
>
>>>
>>> I guess that is one of the disadvantages that an advanced struct like
>>> Xarray is used:(
>> 
>> Sure, there will be some overhead from using xarray, but I think the
>> simplicity makes up for it; especially since we can limit this to the
>
> As my understanding, it is more complicated, it is just that
> complexity is hidden before xarray now.

Yes, which encapsulates the complexity into a shared abstraction that is
widely used in the kernel, so it does not add new complexity to the
kernel as a whole. Whereas your series adds a whole bunch of new
complexity to the kernel in the form of a new slab allocator.

> Even if there is no space in 'struct page' to store the id, the
> 'struct page' pointer itself can be used as id if the xarray can
> use pointer as id. But it might mean the memory utilization might
> not be as efficient as it should be, and performance hurts too if
> there is more memory to be allocated and freed.

I don't think it can. But sure, there can be other ways around this.

FWIW, I don't think your idea of allocating page_pool_items to use as an
indirection is totally crazy, but all the complexity around it (the
custom slab allocator etc) is way too much. And if we can avoid the item
indirection that is obviously better.

> It seems it is just a matter of choices between using tailor-made
> page_pool specific optimization and using some generic advanced
> struct like xarray.

Yup, basically.

> I chose the tailor-made one because it ensure least overhead as
> much as possibe from performance and memory utilization perspective,
> for example, the 'single producer, multiple consumer' guarantee
> offered by NAPI context can avoid some lock and atomic operation.

Right, and my main point is that the complexity of this is not worth it :)

>> cases where it's absolutely needed.
>
> The above can also be done for using page_pool_item too as the
> lower 2 bits can be used to indicate the pointer in 'struct page'
> is 'page_pool_item' or 'page_pool', I just don't think it is
> necessary yet as it might add more checking in the fast path.

Yup, did think about using the lower bits to distinguish if it does turn
out that we can't avoid an indirection. See above; it's not actually the
page_pool_item concept that is my main issue with your series.

-Toke
Toke Høiland-Jørgensen March 11, 2025, 1:44 p.m. UTC | #14
Pavel Begunkov <asml.silence@gmail.com> writes:

> On 3/9/25 12:42, Toke Høiland-Jørgensen wrote:
>> Mina Almasry <almasrymina@google.com> writes:
>> 
>>> On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>
>>>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>>>> they are released from the pool, to avoid the overhead of re-mapping the
>>>> pages every time they are used. This causes problems when a device is
>>>> torn down, because the page pool can't unmap the pages until they are
>>>> returned to the pool. This causes resource leaks and/or crashes when
>>>> there are pages still outstanding while the device is torn down, because
>>>> page_pool will attempt an unmap of a non-existent DMA device on the
>>>> subsequent page return.
>>>>
>>>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>>>> in page pool using an xarray. This was first suggested by Mina[0], and
>>>> turns out to be fairly straight forward: We simply store pointers to
>>>> pages directly in the xarray with xa_alloc() when they are first DMA
>>>> mapped, and remove them from the array on unmap. Then, when a page pool
>>>> is torn down, it can simply walk the xarray and unmap all pages still
>>>> present there before returning, which also allows us to get rid of the
>>>> get/put_device() calls in page_pool.
>>>
>>> THANK YOU!! I had been looking at the other proposals to fix this here
>>> and there and I had similar feelings to you. They add lots of code
>>> changes and the code changes themselves were hard for me to
>>> understand. I hope we can make this simpler approach work.
>> 
>> You're welcome :)
>> And yeah, me too!
>> 
>>>> Using xa_cmpxchg(), no additional
>>>> synchronisation is needed, as a page will only ever be unmapped once.
>>>>
>>>
>>> Very clever. I had been wondering how to handle the concurrency. I
>>> also think this works.
>> 
>> Thanks!
>> 
>>>> To avoid having to walk the entire xarray on unmap to find the page
>>>> reference, we stash the ID assigned by xa_alloc() into the page
>>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>>> the page_pool struct inside struct page. This field overlaps with the
>>>> page->mapping pointer, which may turn out to be problematic, so an
>>>> alternative is probably needed. Sticking the ID into some of the upper
>>>> bits of page->pp_magic may work as an alternative, but that requires
>>>> further investigation. Using the 'mapping' field works well enough as
>>>> a demonstration for this RFC, though.
>>>>
>>>
>>> I'm unsure about this. I think page->mapping may be used when we map
>>> the page to the userspace in TCP zerocopy, but I'm really not sure.
>>> Yes, finding somewhere else to put the id would be ideal. Do we really
>>> need a full unsigned long for the pp_magic?
>> 
>> No, pp_magic was also my backup plan (see the other thread). Tried
>> actually doing that now, and while there's a bit of complication due to
>> the varying definitions of POISON_POINTER_DELTA across architectures,
>> but it seems that this can be defined at compile time. I'll send a v2
>> RFC with this change.
>
> FWIW, personally I like this one much more than an extra indirection
> to pp.
>
> If we're out of space in the page, why can't we use struct page *
> as indices into the xarray? Ala
>
> struct page *p = ...;
> xa_store(xarray, index=(unsigned long)p, p);
>
> Indices wouldn't be nicely packed, but it's still a map. Is there
> a problem with that I didn't consider?

Huh. As I just replied to Yunsheng, I was under the impression that this
was not supported. But since you're now the second person to suggest
this, I looked again, and it looks like I was wrong. There does indeed
seem to be other places in the kernel that does this.

As you say the indices won't be as densely packed, though. So I'm
wondering if using the bits in pp_magic would be better in any case to
get the better packing? I guess we can try benchmarking both approaches
and see if there's a measurable difference.

-Toke
Pavel Begunkov March 11, 2025, 1:56 p.m. UTC | #15
On 3/11/25 13:44, Toke Høiland-Jørgensen wrote:
> Pavel Begunkov <asml.silence@gmail.com> writes:
> 
>> On 3/9/25 12:42, Toke Høiland-Jørgensen wrote:
>>> Mina Almasry <almasrymina@google.com> writes:
...
>>> No, pp_magic was also my backup plan (see the other thread). Tried
>>> actually doing that now, and while there's a bit of complication due to
>>> the varying definitions of POISON_POINTER_DELTA across architectures,
>>> but it seems that this can be defined at compile time. I'll send a v2
>>> RFC with this change.
>>
>> FWIW, personally I like this one much more than an extra indirection
>> to pp.
>>
>> If we're out of space in the page, why can't we use struct page *
>> as indices into the xarray? Ala
>>
>> struct page *p = ...;
>> xa_store(xarray, index=(unsigned long)p, p);
>>
>> Indices wouldn't be nicely packed, but it's still a map. Is there
>> a problem with that I didn't consider?
> 
> Huh. As I just replied to Yunsheng, I was under the impression that this
> was not supported. But since you're now the second person to suggest
> this, I looked again, and it looks like I was wrong. There does indeed
> seem to be other places in the kernel that does this.

And I just noticed there is an entire discussion my email
client didn't pull :)

At least that's likely the easiest solution. Depends on how
complicated it is to fit the index in, but there is an option
to just go with it and continue the discussion on how to
improve it on top.

> As you say the indices won't be as densely packed, though. So I'm
> wondering if using the bits in pp_magic would be better in any case to
> get the better packing? I guess we can try benchmarking both approaches
> and see if there's a measurable difference.
Matthew Wilcox March 11, 2025, 3:03 p.m. UTC | #16
On Mon, Mar 10, 2025 at 06:26:23PM +0100, Toke Høiland-Jørgensen wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > See https://kernelnewbies.org/MatthewWilcox/Memdescs
> > and more immediately
> > https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
> >
> > pagepool is going to be renamed "bump" because it's a bump allocator and
> > "pagepool" is a nonsense name.  I haven't looked into it in a lot of
> > detail yet, but in the not-too-distant future, struct page will look
> > like this (from your point of view):
> >
> > struct page {
> > 	unsigned long flags;
> > 	unsigned long memdesc;
> > 	int _refcount;	// 0 for bump
> > 	union {
> > 		unsigned long private;
> > 		atomic_t _mapcount; // maybe used by bump?  not sure
> > 	};
> > };
> >
> > 'memdesc' will be a pointer to struct bump with the bottom four bits of
> > that pointer indicating that it's a struct bump pointer (and not, say, a
> > folio or a slab).
> >
> > So if you allocate a multi-page bump, you'll get N of these pages,
> > and they'll all point to the same struct bump where you'll maintain
> > your actual refcount.  And you'll be able to grow struct bump to your
> > heart's content.  I don't know exactly what struct bump looks like,
> > but the core mm will have no requirements on you.
> 
> Ah, excellent, thanks for the pointer!
> 
> Out of curiosity, why "bump"? Is that a term of art somewhere?

https://en.wikipedia.org/wiki/Region-based_memory_management

(and the term "bump allocator" has a number of hits in your favourite
search engine)

> And in the meantime (until those patches land), do you see any reason
> why we can't squat on the middle bits of page->pp_magic (AKA page->lru)
> like I'm doing in v2[0] of this patch?

I haven't had time to dig into this series.  I'm trying to get a bunch
of things finished before LSFMM.
Matthew Wilcox March 11, 2025, 3:11 p.m. UTC | #17
On Tue, Mar 11, 2025 at 08:25:25PM +0800, Yunsheng Lin wrote:
> > struct page {
> > 	unsigned long flags;
> > 	unsigned long memdesc;
> 
> It seems there may be memory behind the above 'memdesc' with different size
> and layout for different subsystem?

Yes.

> I am not sure if I understand the case of the same page might be handle in
> two subsystems concurrently or a page is allocated in one subsystem and
> then passed to be handled in other subsystem, for examlpe:
> page_pool owned page is mmap'ed into user space through tcp zero copy,
> see tcp_zerocopy_vm_insert_batch(), it seems the same page is handled in
> both networking/page_pool and vm subsystem?

It's not that arbitrary.  I mean, you could read all the documentation
I've written about this concept, listen to the talks I've given.  But
sure, you're a special fucking snowflake and deserve your own unique
explanation.
Toke Høiland-Jørgensen March 11, 2025, 3:32 p.m. UTC | #18
Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Mar 10, 2025 at 06:26:23PM +0100, Toke Høiland-Jørgensen wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > See https://kernelnewbies.org/MatthewWilcox/Memdescs
>> > and more immediately
>> > https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
>> >
>> > pagepool is going to be renamed "bump" because it's a bump allocator and
>> > "pagepool" is a nonsense name.  I haven't looked into it in a lot of
>> > detail yet, but in the not-too-distant future, struct page will look
>> > like this (from your point of view):
>> >
>> > struct page {
>> > 	unsigned long flags;
>> > 	unsigned long memdesc;
>> > 	int _refcount;	// 0 for bump
>> > 	union {
>> > 		unsigned long private;
>> > 		atomic_t _mapcount; // maybe used by bump?  not sure
>> > 	};
>> > };
>> >
>> > 'memdesc' will be a pointer to struct bump with the bottom four bits of
>> > that pointer indicating that it's a struct bump pointer (and not, say, a
>> > folio or a slab).
>> >
>> > So if you allocate a multi-page bump, you'll get N of these pages,
>> > and they'll all point to the same struct bump where you'll maintain
>> > your actual refcount.  And you'll be able to grow struct bump to your
>> > heart's content.  I don't know exactly what struct bump looks like,
>> > but the core mm will have no requirements on you.
>> 
>> Ah, excellent, thanks for the pointer!
>> 
>> Out of curiosity, why "bump"? Is that a term of art somewhere?
>
> https://en.wikipedia.org/wiki/Region-based_memory_management
>
> (and the term "bump allocator" has a number of hits in your favourite
> search engine)

Right, fair point, I was being lazy by just asking. Thanks for taking
the time to provide a link :)

>> And in the meantime (until those patches land), do you see any reason
>> why we can't squat on the middle bits of page->pp_magic (AKA page->lru)
>> like I'm doing in v2[0] of this patch?
>
> I haven't had time to dig into this series.  I'm trying to get a bunch
> of things finished before LSFMM.

Alright, well if/when you get a chance (before of after LSFMM), I'd
appreciate if you could take a look!

-Toke
Toke Høiland-Jørgensen March 11, 2025, 3:49 p.m. UTC | #19
Pavel Begunkov <asml.silence@gmail.com> writes:

> On 3/11/25 13:44, Toke Høiland-Jørgensen wrote:
>> Pavel Begunkov <asml.silence@gmail.com> writes:
>> 
>>> On 3/9/25 12:42, Toke Høiland-Jørgensen wrote:
>>>> Mina Almasry <almasrymina@google.com> writes:
> ...
>>>> No, pp_magic was also my backup plan (see the other thread). Tried
>>>> actually doing that now, and while there's a bit of complication due to
>>>> the varying definitions of POISON_POINTER_DELTA across architectures,
>>>> but it seems that this can be defined at compile time. I'll send a v2
>>>> RFC with this change.
>>>
>>> FWIW, personally I like this one much more than an extra indirection
>>> to pp.
>>>
>>> If we're out of space in the page, why can't we use struct page *
>>> as indices into the xarray? Ala
>>>
>>> struct page *p = ...;
>>> xa_store(xarray, index=(unsigned long)p, p);
>>>
>>> Indices wouldn't be nicely packed, but it's still a map. Is there
>>> a problem with that I didn't consider?
>> 
>> Huh. As I just replied to Yunsheng, I was under the impression that this
>> was not supported. But since you're now the second person to suggest
>> this, I looked again, and it looks like I was wrong. There does indeed
>> seem to be other places in the kernel that does this.
>
> And I just noticed there is an entire discussion my email
> client didn't pull :)
>
> At least that's likely the easiest solution. Depends on how
> complicated it is to fit the index in, but there is an option
> to just go with it and continue the discussion on how to
> improve it on top.

Didn't seem to be too complicated, assuming no problems appear with
using the middle bits of the pp_magic field. See v2 of the RFC, or here
for the latest version:

https://git.kernel.org/toke/c/df6248a71f85

-Toke
Matthew Wilcox March 11, 2025, 4:46 p.m. UTC | #20
On Tue, Mar 11, 2025 at 02:44:15PM +0100, Toke Høiland-Jørgensen wrote:
> Pavel Begunkov <asml.silence@gmail.com> writes:
> > If we're out of space in the page, why can't we use struct page *
> > as indices into the xarray? Ala
> >
> > struct page *p = ...;
> > xa_store(xarray, index=(unsigned long)p, p);
> >
> > Indices wouldn't be nicely packed, but it's still a map. Is there
> > a problem with that I didn't consider?
> 
> Huh. As I just replied to Yunsheng, I was under the impression that this
> was not supported. But since you're now the second person to suggest
> this, I looked again, and it looks like I was wrong. There does indeed
> seem to be other places in the kernel that does this.
> 
> As you say the indices won't be as densely packed, though. So I'm
> wondering if using the bits in pp_magic would be better in any case to
> get the better packing? I guess we can try benchmarking both approaches
> and see if there's a measurable difference.

This is an absolutely terrible idea, only proposed by those who have no
understanding of how the XArray works.  It could not be more wasteful.
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..d2c7a7b04bea 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -121,7 +121,7 @@  struct page {
 			 */
 			unsigned long pp_magic;
 			struct page_pool *pp;
-			unsigned long _pp_mapping_pad;
+			unsigned long pp_dma_index;
 			unsigned long dma_addr;
 			atomic_long_t pp_ref_count;
 		};
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 36eb57d73abc..13597a77aa36 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -221,6 +221,8 @@  struct page_pool {
 	void *mp_priv;
 	const struct memory_provider_ops *mp_ops;
 
+	struct xarray dma_mapped;
+
 #ifdef CONFIG_PAGE_POOL_STATS
 	/* recycle stats are per-cpu to avoid locking */
 	struct page_pool_recycle_stats __percpu *recycle_stats;
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index 7eadb8393e00..59679406a7b7 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -28,4 +28,21 @@  static inline void netmem_set_dma_addr(netmem_ref netmem,
 {
 	__netmem_clear_lsb(netmem)->dma_addr = dma_addr;
 }
+
+static inline unsigned long netmem_get_dma_index(netmem_ref netmem)
+{
+	if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
+		return 0;
+
+	return netmem_to_page(netmem)->pp_dma_index;
+}
+
+static inline void netmem_set_dma_index(netmem_ref netmem,
+					unsigned long id)
+{
+	if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
+		return;
+
+	netmem_to_page(netmem)->pp_dma_index = id;
+}
 #endif
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index acef1fcd8ddc..d5530f29bf62 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -226,6 +226,8 @@  static int page_pool_init(struct page_pool *pool,
 			return -EINVAL;
 
 		pool->dma_map = true;
+
+		xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1);
 	}
 
 	if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
@@ -275,9 +277,6 @@  static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
-	if (pool->dma_map)
-		get_device(pool->p.dev);
-
 	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
 		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
 		 * configuration doesn't change while we're initializing
@@ -325,7 +324,7 @@  static void page_pool_uninit(struct page_pool *pool)
 	ptr_ring_cleanup(&pool->ring, NULL);
 
 	if (pool->dma_map)
-		put_device(pool->p.dev);
+		xa_destroy(&pool->dma_mapped);
 
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
@@ -470,9 +469,11 @@  page_pool_dma_sync_for_device(const struct page_pool *pool,
 		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
 }
 
-static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp)
 {
 	dma_addr_t dma;
+	int err;
+	u32 id;
 
 	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
 	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
@@ -486,9 +487,19 @@  static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	if (dma_mapping_error(pool->p.dev, dma))
 		return false;
 
+	if (in_softirq())
+		err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem),
+			       XA_LIMIT(1, UINT_MAX), gfp);
+	else
+		err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem),
+				  XA_LIMIT(1, UINT_MAX), gfp);
+	if (err)
+		goto unmap_failed;
+
 	if (page_pool_set_dma_addr_netmem(netmem, dma))
 		goto unmap_failed;
 
+	netmem_set_dma_index(netmem, id);
 	page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
 
 	return true;
@@ -511,7 +522,7 @@  static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 	if (unlikely(!page))
 		return NULL;
 
-	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
+	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) {
 		put_page(page);
 		return NULL;
 	}
@@ -557,7 +568,7 @@  static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		netmem = pool->alloc.cache[i];
-		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
+		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem, gfp))) {
 			put_page(netmem_to_page(netmem));
 			continue;
 		}
@@ -659,6 +670,8 @@  void page_pool_clear_pp_info(netmem_ref netmem)
 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 							 netmem_ref netmem)
 {
+	struct page *old, *page = netmem_to_page(netmem);
+	unsigned long id;
 	dma_addr_t dma;
 
 	if (!pool->dma_map)
@@ -667,6 +680,17 @@  static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 		 */
 		return;
 
+	id = netmem_get_dma_index(netmem);
+	if (!id)
+		return;
+
+	if (in_softirq())
+		old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0);
+	else
+		old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0);
+	if (old != page)
+		return;
+
 	dma = page_pool_get_dma_addr_netmem(netmem);
 
 	/* When page is unmapped, it cannot be returned to our pool */
@@ -674,6 +698,7 @@  static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
 	page_pool_set_dma_addr_netmem(netmem, 0);
+	netmem_set_dma_index(netmem, 0);
 }
 
 /* Disconnects a page (from a page_pool).  API users can have a need
@@ -1083,8 +1108,13 @@  static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 
 static void page_pool_scrub(struct page_pool *pool)
 {
+	unsigned long id;
+	void *ptr;
+
 	page_pool_empty_alloc_cache_once(pool);
-	pool->destroy_cnt++;
+	if (!pool->destroy_cnt++)
+		xa_for_each(&pool->dma_mapped, id, ptr)
+			__page_pool_release_page_dma(pool, page_to_netmem(ptr));
 
 	/* No more consumers should exist, but producers could still
 	 * be in-flight.