mbox series

[00/24] Split page pools from struct page

Message ID 20221130220803.3657490-1-willy@infradead.org (mailing list archive)
Headers show
Series Split page pools from struct page | expand

Message

Matthew Wilcox (Oracle) Nov. 30, 2022, 10:07 p.m. UTC
The MM subsystem is trying to reduce struct page to a single pointer.
The first step towards that is splitting struct page by its individual
users, as has already been done with folio and slab.  This attempt chooses
'netmem' as a name, but I am not even slightly committed to that name,
and will happily use another.

There are some relatively significant reductions in kernel text
size from these changes.  I'm not qualified to judge how they
might affect performance, but every call to put_page() includes
a call to compound_head(), which is now rather more complex
than it once was (at least in a distro config which enables
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP).

I've only converted one user of the page_pool APIs to use the new netmem
APIs, all the others continue to use the page based ones.

Uh, I see I left netmem_to_virt() as its own commit instead of squashing
it into "netmem: Add utility functions".  I'll fix that in the next
version, because I'm sure you'll want some changes anyway.

Happy to answer questions.

Matthew Wilcox (Oracle) (24):
  netmem: Create new type
  netmem: Add utility functions
  page_pool: Add netmem_set_dma_addr() and netmem_get_dma_addr()
  page_pool: Convert page_pool_release_page() to
    page_pool_release_netmem()
  page_pool: Start using netmem in allocation path.
  page_pool: Convert page_pool_return_page() to
    page_pool_return_netmem()
  page_pool: Convert __page_pool_put_page() to __page_pool_put_netmem()
  page_pool: Convert pp_alloc_cache to contain netmem
  page_pool: Convert page_pool_defrag_page() to
    page_pool_defrag_netmem()
  page_pool: Convert page_pool_put_defragged_page() to netmem
  page_pool: Convert page_pool_empty_ring() to use netmem
  page_pool: Convert page_pool_alloc_pages() to page_pool_alloc_netmem()
  page_pool: Convert page_pool_dma_sync_for_device() to take a netmem
  page_pool: Convert page_pool_recycle_in_cache() to netmem
  page_pool: Remove page_pool_defrag_page()
  page_pool: Use netmem in page_pool_drain_frag()
  page_pool: Convert page_pool_return_skb_page() to use netmem
  page_pool: Convert frag_page to frag_nmem
  xdp: Convert to netmem
  mm: Remove page pool members from struct page
  netmem_to_virt
  page_pool: Pass a netmem to init_callback()
  net: Add support for netmem in skb_frag
  mvneta: Convert to netmem

 drivers/net/ethernet/marvell/mvneta.c |  48 ++---
 include/linux/mm_types.h              |  22 ---
 include/linux/skbuff.h                |  11 ++
 include/net/page_pool.h               | 181 ++++++++++++++---
 include/trace/events/page_pool.h      |  28 +--
 net/bpf/test_run.c                    |   4 +-
 net/core/page_pool.c                  | 274 +++++++++++++-------------
 net/core/xdp.c                        |   7 +-
 8 files changed, 344 insertions(+), 231 deletions(-)


base-commit: 13ee7ef407cfcf63f4f047460ac5bb6ba5a3447d

Comments

Jesper Dangaard Brouer Dec. 5, 2022, 3:34 p.m. UTC | #1
On 30/11/2022 23.07, Matthew Wilcox (Oracle) wrote:
> The MM subsystem is trying to reduce struct page to a single pointer.
> The first step towards that is splitting struct page by its individual
> users, as has already been done with folio and slab.  This attempt chooses
> 'netmem' as a name, but I am not even slightly committed to that name,
> and will happily use another.

I've not been able to come-up with a better name, so I'm okay with
'netmem'.  Others are of-cause free to bikesheet this ;-)

> There are some relatively significant reductions in kernel text
> size from these changes.  I'm not qualified to judge how they
> might affect performance, but every call to put_page() includes
> a call to compound_head(), which is now rather more complex
> than it once was (at least in a distro config which enables
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP).
> 

I have a micro-benchmark [1][2], that I want to run on this patchset.
Reducing the asm code 'text' size is less likely to improve a
microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can
run a packet benchmark that can show the (expected) performance improvement.

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
[2] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c

> I've only converted one user of the page_pool APIs to use the new netmem
> APIs, all the others continue to use the page based ones.
> 

I guess we/netdev-devels need to update the NIC drivers that uses page_pool.

> Uh, I see I left netmem_to_virt() as its own commit instead of squashing
> it into "netmem: Add utility functions".  I'll fix that in the next
> version, because I'm sure you'll want some changes anyway.
> 
> Happy to answer questions.
> 
> Matthew Wilcox (Oracle) (24):
>    netmem: Create new type
>    netmem: Add utility functions
>    page_pool: Add netmem_set_dma_addr() and netmem_get_dma_addr()
>    page_pool: Convert page_pool_release_page() to
>      page_pool_release_netmem()
>    page_pool: Start using netmem in allocation path.
>    page_pool: Convert page_pool_return_page() to
>      page_pool_return_netmem()
>    page_pool: Convert __page_pool_put_page() to __page_pool_put_netmem()
>    page_pool: Convert pp_alloc_cache to contain netmem
>    page_pool: Convert page_pool_defrag_page() to
>      page_pool_defrag_netmem()
>    page_pool: Convert page_pool_put_defragged_page() to netmem
>    page_pool: Convert page_pool_empty_ring() to use netmem
>    page_pool: Convert page_pool_alloc_pages() to page_pool_alloc_netmem()
>    page_pool: Convert page_pool_dma_sync_for_device() to take a netmem
>    page_pool: Convert page_pool_recycle_in_cache() to netmem
>    page_pool: Remove page_pool_defrag_page()
>    page_pool: Use netmem in page_pool_drain_frag()
>    page_pool: Convert page_pool_return_skb_page() to use netmem
>    page_pool: Convert frag_page to frag_nmem
>    xdp: Convert to netmem
>    mm: Remove page pool members from struct page
>    netmem_to_virt
>    page_pool: Pass a netmem to init_callback()
>    net: Add support for netmem in skb_frag
>    mvneta: Convert to netmem
> 
>   drivers/net/ethernet/marvell/mvneta.c |  48 ++---
>   include/linux/mm_types.h              |  22 ---
>   include/linux/skbuff.h                |  11 ++
>   include/net/page_pool.h               | 181 ++++++++++++++---
>   include/trace/events/page_pool.h      |  28 +--
>   net/bpf/test_run.c                    |   4 +-
>   net/core/page_pool.c                  | 274 +++++++++++++-------------
>   net/core/xdp.c                        |   7 +-
>   8 files changed, 344 insertions(+), 231 deletions(-)
> 
> 
> base-commit: 13ee7ef407cfcf63f4f047460ac5bb6ba5a3447d
Ilias Apalodimas Dec. 5, 2022, 3:44 p.m. UTC | #2
Hi Jesper,

On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote:
> 
> On 30/11/2022 23.07, Matthew Wilcox (Oracle) wrote:
> > The MM subsystem is trying to reduce struct page to a single pointer.
> > The first step towards that is splitting struct page by its individual
> > users, as has already been done with folio and slab.  This attempt chooses
> > 'netmem' as a name, but I am not even slightly committed to that name,
> > and will happily use another.
> 
> I've not been able to come-up with a better name, so I'm okay with
> 'netmem'.  Others are of-cause free to bikesheet this ;-)

Same here. But if anyone has a better name please shout.

> 
> > There are some relatively significant reductions in kernel text
> > size from these changes.  I'm not qualified to judge how they
> > might affect performance, but every call to put_page() includes
> > a call to compound_head(), which is now rather more complex
> > than it once was (at least in a distro config which enables
> > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP).
> > 
> 
> I have a micro-benchmark [1][2], that I want to run on this patchset.
> Reducing the asm code 'text' size is less likely to improve a
> microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can
> run a packet benchmark that can show the (expected) performance improvement.
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c
> 

If you could give it a spin it would be great.  I did apply the patchset
and was running fine on my Arm box. I was about to run these tests, but then
I remembered that this only works for x86.  I don't have any cards supported
by page pool around.

> > I've only converted one user of the page_pool APIs to use the new netmem
> > APIs, all the others continue to use the page based ones.
> > 
> 
> I guess we/netdev-devels need to update the NIC drivers that uses page_pool.
> 
 
[...]

Regards
/Ilias
Matthew Wilcox (Oracle) Dec. 5, 2022, 4:31 p.m. UTC | #3
On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote:
> I have a micro-benchmark [1][2], that I want to run on this patchset.
> Reducing the asm code 'text' size is less likely to improve a
> microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can
> run a packet benchmark that can show the (expected) performance improvement.
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c

Appreciate it!  I'm not expecting any performance change outside noise,
but things do surprise me.  I'd appreciate it if you'd test with a
"distro" config, ie enabling CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP so
we show the most expensive case.

> > I've only converted one user of the page_pool APIs to use the new netmem
> > APIs, all the others continue to use the page based ones.
> > 
> 
> I guess we/netdev-devels need to update the NIC drivers that uses page_pool.

Oh, it's not a huge amount of work, and I don't mind doing it.  I only
did one in order to show the kinds of changes that are needed.  I can
do the mlx5 conversion now ...
Jesper Dangaard Brouer Dec. 6, 2022, 9:43 a.m. UTC | #4
On 05/12/2022 17.31, Matthew Wilcox wrote:
> On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote:
>> I have a micro-benchmark [1][2], that I want to run on this patchset.
>> Reducing the asm code 'text' size is less likely to improve a
>> microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can
>> run a packet benchmark that can show the (expected) performance improvement.
>>
>> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c
> 
> Appreciate it!  I'm not expecting any performance change outside noise,
> but things do surprise me.  I'd appreciate it if you'd test with a
> "distro" config, ie enabling CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP so
> we show the most expensive case.
> 

I have CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y BUT it isn't default
runtime enabled.

Should I also choose CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
or enable it via sysctl ?

  $ grep -H . /proc/sys/vm/hugetlb_optimize_vmemmap
  /proc/sys/vm/hugetlb_optimize_vmemmap:0

--Jesper
Matthew Wilcox (Oracle) Dec. 6, 2022, 4:08 p.m. UTC | #5
On Tue, Dec 06, 2022 at 10:43:05AM +0100, Jesper Dangaard Brouer wrote:
> 
> 
> On 05/12/2022 17.31, Matthew Wilcox wrote:
> > On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote:
> > > I have a micro-benchmark [1][2], that I want to run on this patchset.
> > > Reducing the asm code 'text' size is less likely to improve a
> > > microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can
> > > run a packet benchmark that can show the (expected) performance improvement.
> > > 
> > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> > > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c
> > 
> > Appreciate it!  I'm not expecting any performance change outside noise,
> > but things do surprise me.  I'd appreciate it if you'd test with a
> > "distro" config, ie enabling CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP so
> > we show the most expensive case.
> > 
> 
> I have CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y BUT it isn't default
> runtime enabled.

That's fine.  I think the vast majority of machines won't actually have
it enabled.  It's mostly useful for hosting setups where allocating 1GB
pages for VMs is common.

The mlx5 driver was straightforward, but showed some gaps in the API.
You'd already got the majority of the wins by using page_ref_inc()
instead of get_page(), but I did find one put_page() ;-)
Jesper Dangaard Brouer Dec. 8, 2022, 3:33 p.m. UTC | #6
On 06/12/2022 17.08, Matthew Wilcox wrote:
> On Tue, Dec 06, 2022 at 10:43:05AM +0100, Jesper Dangaard Brouer wrote:
>>
>> On 05/12/2022 17.31, Matthew Wilcox wrote:
>>> On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote:
>>>> I have a micro-benchmark [1][2], that I want to run on this patchset.
>>>> Reducing the asm code 'text' size is less likely to improve a
>>>> microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can
>>>> run a packet benchmark that can show the (expected) performance improvement.
>>>>
>>>> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>>>> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c
>>>
>>> Appreciate it!  I'm not expecting any performance change outside noise,
>>> but things do surprise me.  I'd appreciate it if you'd test with a
>>> "distro" config, ie enabling CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP so
>>> we show the most expensive case.

I've tested with [1] and [2] and the performance numbers are the same.

Microbench [1] is easiest to compare, and numbers below were basically
same for both with+without patchset.

  Type:tasklet_page_pool01_fast_path Per elem: 16 cycles(tsc) 4.484 ns
  Type:tasklet_page_pool02_ptr_ring Per elem: 47 cycles(tsc) 13.147 ns
  Type:tasklet_page_pool03_slow Per elem: 173 cycles(tsc) 48.278 ns

The last line (with 173 cycles) is then pages are not recycled, but 
instead returned back into systems page allocator.  To related this to 
something, allocating order-0 pages via normal page allocator API costs 
approx 282 cycles(tsc) 78.385 ns on this system (with .config).  I 
believe page_pool is faster, because we leverage the bulk page allocator.

--Jesper