[net-next,0/6] net: page_pool: Add page_pool stat counters

Message ID	1643237300-44904-1-git-send-email-jdamato@fastly.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Joe Damato <jdamato@fastly.com> To: netdev@vger.kernel.org Cc: kuba@kernel.org, davem@davemloft.net, ilias.apalodimas@linaro.org, hawk@kernel.org, Joe Damato <jdamato@fastly.com> Subject: [PATCH net-next 0/6] net: page_pool: Add page_pool stat counters Date: Wed, 26 Jan 2022 14:48:14 -0800 Message-Id: <1643237300-44904-1-git-send-email-jdamato@fastly.com> Precedence: bulk
Series	net: page_pool: Add page_pool stat counters \| expand [net-next,0/6] net: page_pool: Add page_pool stat counters [1/6] net: page_pool: Add alloc stats and fast path stat [net-next,2/6] net: page_pool: Add a stat for the slow alloc path [net-next,3/6] net: page_pool: Add a high order alloc stat [net-next,4/6] net: page_pool: Add stat tracking empty ring [net-next,5/6] net: page_pool: Add stat tracking cache refills. [6/6] net: page_pool: Add a stat tracking waived pages.

Joe Damato Jan. 26, 2022, 10:48 p.m. UTC

Greetings:

This series adds some stat counters for the page_pool allocation path which
help to track:

	- fast path allocations
	- slow path order-0 allocations
	- slow path high order allocations
	- refills which failed due to an empty ptr ring, forcing a slow
	  path allocation
	- allocations fulfilled via successful refill
	- pages which cannot be added to the cache because of numa mismatch
	  (i.e. waived)

Some static inline wrappers are provided for accessing these stats. The
intention is that drivers which use the page_pool API can, if they choose,
use this stats API.

It assumed that the API consumer will ensure the page_pool is not destroyed
during calls to the stats API.

If this series is accepted, I'll submit a follow up patch which will export
these stats per RX-ring via ethtool in a driver which uses the page_pool
API.

Joe Damato (6):
  net: page_pool: Add alloc stats and fast path stat
  net: page_pool: Add a stat for the slow alloc path
  net: page_pool: Add a high order alloc stat
  net: page_pool: Add stat tracking empty ring
  net: page_pool: Add stat tracking cache refills.
  net: page_pool: Add a stat tracking waived pages.

 include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/core/page_pool.c    | 15 +++++++--
 2 files changed, 94 insertions(+), 3 deletions(-)

Jesper Dangaard Brouer Jan. 27, 2022, 8:51 a.m. UTC | #1

On 26/01/2022 23.48, Joe Damato wrote:
> Greetings:
> 
> This series adds some stat counters for the page_pool allocation path which
> help to track:
> 
> 	- fast path allocations
> 	- slow path order-0 allocations
> 	- slow path high order allocations
> 	- refills which failed due to an empty ptr ring, forcing a slow
> 	  path allocation
> 	- allocations fulfilled via successful refill
> 	- pages which cannot be added to the cache because of numa mismatch
> 	  (i.e. waived)
> 
> Some static inline wrappers are provided for accessing these stats. The
> intention is that drivers which use the page_pool API can, if they choose,
> use this stats API.

You are adding (always on) counters to a critical fast-path, that 
drivers uses for the XDP_DROP use-case.

I want to see performance measurements as documentation, showing this is 
not causing a slow-down.

I have some performance tests here[1]:
  [1] 
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/lib

Look at:
  - bench_page_pool_simple.c and
  - bench_page_pool_cross_cpu.c

How to use + build this[2]:
  [2] 
https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html

> It assumed that the API consumer will ensure the page_pool is not destroyed
> during calls to the stats API.
> 
> If this series is accepted, I'll submit a follow up patch which will export
> these stats per RX-ring via ethtool in a driver which uses the page_pool
> API.
> 
> Joe Damato (6):
>    net: page_pool: Add alloc stats and fast path stat
>    net: page_pool: Add a stat for the slow alloc path
>    net: page_pool: Add a high order alloc stat
>    net: page_pool: Add stat tracking empty ring
>    net: page_pool: Add stat tracking cache refills.
>    net: page_pool: Add a stat tracking waived pages.
> 
>   include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
>   net/core/page_pool.c    | 15 +++++++--
>   2 files changed, 94 insertions(+), 3 deletions(-)
>

Ilias Apalodimas Jan. 27, 2022, 9:08 a.m. UTC | #2

Hi Joe, 

On Wed, Jan 26, 2022 at 02:48:14PM -0800, Joe Damato wrote:
> Greetings:
> 
> This series adds some stat counters for the page_pool allocation path which
> help to track:
> 
> 	- fast path allocations
> 	- slow path order-0 allocations
> 	- slow path high order allocations
> 	- refills which failed due to an empty ptr ring, forcing a slow
> 	  path allocation
> 	- allocations fulfilled via successful refill
> 	- pages which cannot be added to the cache because of numa mismatch
> 	  (i.e. waived)
> 

Thanks for the patch.  Stats are something that's indeed missing from the
API.  The patch  should work for Rx based allocations (which is what you
currently cover),  since the RX side is usually protected by NAPI.  However
we've added a few features recently,  which we would like to have stats on. 

commit 6a5bcd84e886("page_pool: Allow drivers to hint on SKB recycling"),
introduces recycling capabilities on the API.  I think it would be far more
interesting to be able to extend the statistics to recycled/non-recycled
packets as well in the future.  But the recycling is asynchronous and we
can't add locks just for the sake of accurate statistics.  Can we instead
convert that to a per-cpu structure for producers?

Thanks!
/Ilias

> Some static inline wrappers are provided for accessing these stats. The
> intention is that drivers which use the page_pool API can, if they choose,
> use this stats API.
> 
> It assumed that the API consumer will ensure the page_pool is not destroyed
> during calls to the stats API.
> 
> If this series is accepted, I'll submit a follow up patch which will export
> these stats per RX-ring via ethtool in a driver which uses the page_pool
> API.
> 
> Joe Damato (6):
>   net: page_pool: Add alloc stats and fast path stat
>   net: page_pool: Add a stat for the slow alloc path
>   net: page_pool: Add a high order alloc stat
>   net: page_pool: Add stat tracking empty ring
>   net: page_pool: Add stat tracking cache refills.
>   net: page_pool: Add a stat tracking waived pages.
> 
>  include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
>  net/core/page_pool.c    | 15 +++++++--
>  2 files changed, 94 insertions(+), 3 deletions(-)
> 
> -- 
> 2.7.4
>

Joe Damato Jan. 27, 2022, 9:08 p.m. UTC | #3

On Thu, Jan 27, 2022 at 12:51 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 26/01/2022 23.48, Joe Damato wrote:
> > Greetings:
> >
> > This series adds some stat counters for the page_pool allocation path which
> > help to track:
> >
> >       - fast path allocations
> >       - slow path order-0 allocations
> >       - slow path high order allocations
> >       - refills which failed due to an empty ptr ring, forcing a slow
> >         path allocation
> >       - allocations fulfilled via successful refill
> >       - pages which cannot be added to the cache because of numa mismatch
> >         (i.e. waived)
> >
> > Some static inline wrappers are provided for accessing these stats. The
> > intention is that drivers which use the page_pool API can, if they choose,
> > use this stats API.
>
> You are adding (always on) counters to a critical fast-path, that
> drivers uses for the XDP_DROP use-case.

If you prefer requiring users explicitly enable these stats, I am
happy to add a kernel config option (e.g. CONFIG_PAGE_POOL_DEBUG or
similar) in a v2.

> I want to see performance measurements as documentation, showing this is
> not causing a slow-down.
>
> I have some performance tests here[1]:
>   [1]
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/lib
>
> Look at:
>   - bench_page_pool_simple.c and
>   - bench_page_pool_cross_cpu.c
>
> How to use + build this[2]:
>   [2]
> https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html

Thanks for the pointers to the benchmarks.

In general, I noted that the benchmark results varied fairly
substantially between repeated runs on the same system.

Results below suggest that:
   -  bench_page_pool_simple is faster on the test kernel, and
   - bench_page_pool_cross_cpu faster on the control

Subsequent runs of bench_page_pool_cross_cpu on the control, however,
reveal *much* slower results than shown below.

Test system:
  - 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
  - 2 numa zones, with 18 cores per zone and 2 threads per core

Control kernel: built from net-next at commit e2cf07654efb ("ptp:
replace snprintf with sysfs_emit").
Test kernel: This series applied on top of control kernel mentioned above.

Raw output from dmesg for control [1] and test [2] kernel summarized below:

bench_page_pool_simple
  - run with default options (i.e. "sudo mod probe bench_page_pool_simple").

Control:

Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0)
Type:atomic_inc Per elem: 13 cycles(tsc) 6.021 ns (step:0)
Type:lock Per elem: 31 cycles(tsc) 13.514 ns (step:0)

Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.549 ns (step:0)
Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.658 ns (step:0)
Type:no-softirq-page_pool03 Per elem: 118 cycles(tsc) 51.638 ns (step:0)

Type:tasklet_page_pool01_fast_path Per elem: 17 cycles(tsc) 7.472 ns (step:0)
Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.585 ns (step:0)
Type:tasklet_page_pool03_slow Per elem: 109 cycles(tsc) 47.807 ns (step:0)

Test:

Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0)
Type:atomic_inc Per elem: 14 cycles(tsc) 6.195 ns (step:0)
Type:lock Per elem: 31 cycles(tsc) 13.827 ns (step:0)

Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.561 ns (step:0)
Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.700 ns (step:0)
Type:no-softirq-page_pool03 Per elem: 108 cycles(tsc) 47.186 ns (step:0)

Type:tasklet_page_pool01_fast_path Per elem: 12 cycles(tsc) 5.447 ns (step:0)
Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.501 ns (step:0)
Type:tasklet_page_pool03_slow Per elem: 106 cycles(tsc) 46.313 ns (step:0)

bench_page_pool_cross_cpu
  - run with default options (i.e. "sudo mod probe bench_page_pool_cross_cpu").

Control:
Type:page_pool_cross_cpu CPU(0) 1795 cycles(tsc) 782.567 ns (step:2)
Type:page_pool_cross_cpu CPU(1) 1921 cycles(tsc) 837.435 ns (step:2)
Type:page_pool_cross_cpu CPU(2) 960 cycles(tsc) 418.758 ns (step:2)
Sum Type:page_pool_cross_cpu Average: 1558 cycles(tsc) CPUs:3 step:2

Test:
Type:page_pool_cross_cpu CPU(0) 2411 cycles(tsc) 1051.037 ns (step:2)
Type:page_pool_cross_cpu CPU(1) 2467 cycles(tsc) 1075.204 ns (step:2)
Type:page_pool_cross_cpu CPU(2) 1233 cycles(tsc) 537.629 ns (step:2)
Type:page_pool_cross_cpu Average: 2037 cycles(tsc) CPUs:3 step:2


[1]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_control
[2]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_TESTKERNEL


> > It assumed that the API consumer will ensure the page_pool is not destroyed
> > during calls to the stats API.
> >
> > If this series is accepted, I'll submit a follow up patch which will export
> > these stats per RX-ring via ethtool in a driver which uses the page_pool
> > API.
> >
> > Joe Damato (6):
> >    net: page_pool: Add alloc stats and fast path stat
> >    net: page_pool: Add a stat for the slow alloc path
> >    net: page_pool: Add a high order alloc stat
> >    net: page_pool: Add stat tracking empty ring
> >    net: page_pool: Add stat tracking cache refills.
> >    net: page_pool: Add a stat tracking waived pages.
> >
> >   include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
> >   net/core/page_pool.c    | 15 +++++++--
> >   2 files changed, 94 insertions(+), 3 deletions(-)
> >
>

Joe Damato Jan. 27, 2022, 11:55 p.m. UTC | #4

On Thu, Jan 27, 2022 at 1:08 AM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> Hi Joe,
>
> On Wed, Jan 26, 2022 at 02:48:14PM -0800, Joe Damato wrote:
> > Greetings:
> >
> > This series adds some stat counters for the page_pool allocation path which
> > help to track:
> >
> >       - fast path allocations
> >       - slow path order-0 allocations
> >       - slow path high order allocations
> >       - refills which failed due to an empty ptr ring, forcing a slow
> >         path allocation
> >       - allocations fulfilled via successful refill
> >       - pages which cannot be added to the cache because of numa mismatch
> >         (i.e. waived)
> >
>
> Thanks for the patch.  Stats are something that's indeed missing from the
> API.  The patch  should work for Rx based allocations (which is what you
> currently cover),  since the RX side is usually protected by NAPI.  However
> we've added a few features recently,  which we would like to have stats on.

Thanks for taking a look at the patch.

> commit 6a5bcd84e886("page_pool: Allow drivers to hint on SKB recycling"),
> introduces recycling capabilities on the API.  I think it would be far more
> interesting to be able to extend the statistics to recycled/non-recycled
> packets as well in the future.

I agree. Tracking recycling events would be both helpful and
interesting, indeed.

> But the recycling is asynchronous and we
> can't add locks just for the sake of accurate statistics.

Agreed.

> Can we instead
> convert that to a per-cpu structure for producers?

If my understanding of your proposal is accurate, moving the stats
structure to a per-cpu structure (instead of per-pool) would add
ambiguity as to the performance of a specific driver's page pool. In
exchange for the ambiguity, though, we'd get stats for additional
events, which could be interesting.

It seems like under load it might be very useful to know that a
particular driver's page pool is adding pressure to the buddy
allocator in the slow path. I suppose that a user could move softirqs
around on their system to alleviate some of the ambiguity and perhaps
that is good enough.







> Thanks!
> /Ilias
>
> > Some static inline wrappers are provided for accessing these stats. The
> > intention is that drivers which use the page_pool API can, if they choose,
> > use this stats API.
> >
> > It assumed that the API consumer will ensure the page_pool is not destroyed
> > during calls to the stats API.
> >
> > If this series is accepted, I'll submit a follow up patch which will export
> > these stats per RX-ring via ethtool in a driver which uses the page_pool
> > API.
> >
> > Joe Damato (6):
> >   net: page_pool: Add alloc stats and fast path stat
> >   net: page_pool: Add a stat for the slow alloc path
> >   net: page_pool: Add a high order alloc stat
> >   net: page_pool: Add stat tracking empty ring
> >   net: page_pool: Add stat tracking cache refills.
> >   net: page_pool: Add a stat tracking waived pages.
> >
> >  include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  net/core/page_pool.c    | 15 +++++++--
> >  2 files changed, 94 insertions(+), 3 deletions(-)
> >
> > --
> > 2.7.4
> >

Ilias Apalodimas Jan. 29, 2022, 2:07 p.m. UTC | #5

Hi Joe!

On Thu, Jan 27, 2022 at 03:55:03PM -0800, Joe Damato wrote:
> On Thu, Jan 27, 2022 at 1:08 AM Ilias Apalodimas
> <ilias.apalodimas@linaro.org> wrote:
> >
> > Hi Joe,
> >
> > On Wed, Jan 26, 2022 at 02:48:14PM -0800, Joe Damato wrote:
> > > Greetings:
> > >
> > > This series adds some stat counters for the page_pool allocation path which
> > > help to track:
> > >
> > >       - fast path allocations
> > >       - slow path order-0 allocations
> > >       - slow path high order allocations
> > >       - refills which failed due to an empty ptr ring, forcing a slow
> > >         path allocation
> > >       - allocations fulfilled via successful refill
> > >       - pages which cannot be added to the cache because of numa mismatch
> > >         (i.e. waived)
> > >
> >
> > Thanks for the patch.  Stats are something that's indeed missing from the
> > API.  The patch  should work for Rx based allocations (which is what you
> > currently cover),  since the RX side is usually protected by NAPI.  However
> > we've added a few features recently,  which we would like to have stats on.
> 
> Thanks for taking a look at the patch.
> 

yw

> > commit 6a5bcd84e886("page_pool: Allow drivers to hint on SKB recycling"),
> > introduces recycling capabilities on the API.  I think it would be far more
> > interesting to be able to extend the statistics to recycled/non-recycled
> > packets as well in the future.
> 
> I agree. Tracking recycling events would be both helpful and
> interesting, indeed.
> 
> > But the recycling is asynchronous and we
> > can't add locks just for the sake of accurate statistics.
> 
> Agreed.
> 
> > Can we instead
> > convert that to a per-cpu structure for producers?
> 
> If my understanding of your proposal is accurate, moving the stats
> structure to a per-cpu structure (instead of per-pool) would add
> ambiguity as to the performance of a specific driver's page pool. In
> exchange for the ambiguity, though, we'd get stats for additional
> events, which could be interesting.

I was mostly thinking per pool using with 'struct percpu_counter' or 
allocate __percpu variables,  but I haven't really checked if that's doable or 
which of those is better suited for our case.

> 
> It seems like under load it might be very useful to know that a
> particular driver's page pool is adding pressure to the buddy
> allocator in the slow path. I suppose that a user could move softirqs
> around on their system to alleviate some of the ambiguity and perhaps
> that is good enough.
> 

[...]

Cheers
/Ilias

Joe Damato Jan. 29, 2022, 6:07 p.m. UTC | #6

On Sat, Jan 29, 2022 at 6:07 AM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> Hi Joe!
>
> On Thu, Jan 27, 2022 at 03:55:03PM -0800, Joe Damato wrote:
> > On Thu, Jan 27, 2022 at 1:08 AM Ilias Apalodimas
> > <ilias.apalodimas@linaro.org> wrote:
> > >
> > > Hi Joe,
> > >
> > > On Wed, Jan 26, 2022 at 02:48:14PM -0800, Joe Damato wrote:
> > > > Greetings:
> > > >
> > > > This series adds some stat counters for the page_pool allocation path which
> > > > help to track:
> > > >
> > > >       - fast path allocations
> > > >       - slow path order-0 allocations
> > > >       - slow path high order allocations
> > > >       - refills which failed due to an empty ptr ring, forcing a slow
> > > >         path allocation
> > > >       - allocations fulfilled via successful refill
> > > >       - pages which cannot be added to the cache because of numa mismatch
> > > >         (i.e. waived)
> > > >
> > >
> > > Thanks for the patch.  Stats are something that's indeed missing from the
> > > API.  The patch  should work for Rx based allocations (which is what you
> > > currently cover),  since the RX side is usually protected by NAPI.  However
> > > we've added a few features recently,  which we would like to have stats on.
> >
> > Thanks for taking a look at the patch.
> >
>
> yw
>
> > > commit 6a5bcd84e886("page_pool: Allow drivers to hint on SKB recycling"),
> > > introduces recycling capabilities on the API.  I think it would be far more
> > > interesting to be able to extend the statistics to recycled/non-recycled
> > > packets as well in the future.
> >
> > I agree. Tracking recycling events would be both helpful and
> > interesting, indeed.
> >
> > > But the recycling is asynchronous and we
> > > can't add locks just for the sake of accurate statistics.
> >
> > Agreed.
> >
> > > Can we instead
> > > convert that to a per-cpu structure for producers?
> >
> > If my understanding of your proposal is accurate, moving the stats
> > structure to a per-cpu structure (instead of per-pool) would add
> > ambiguity as to the performance of a specific driver's page pool. In
> > exchange for the ambiguity, though, we'd get stats for additional
> > events, which could be interesting.
>
> I was mostly thinking per pool using with 'struct percpu_counter' or
> allocate __percpu variables,  but I haven't really checked if that's doable or
> which of those is better suited for our case.

I wrote up a v2 last night that allocates and exports a
page_pool_stats structure per cpu (but not per pool). The data can be
accessed by users in the file /proc/net/page_pool_stats. The approach
is similar to the way softnet_stat is implemented.

The main advantage with this approach is that no driver modifications
are needed and no additional APIs are exposed that will need to be
maintained. Adding new stats in the future would be much simpler with
this approach. I've also moved all the code behind a kernel config
flag so users can opt-in to get these stats.

I'll send the v2 shortly.

Thanks,
Joe

Jesper Dangaard Brouer Feb. 2, 2022, 2:04 p.m. UTC | #7

On 27/01/2022 22.08, Joe Damato wrote:
> On Thu, Jan 27, 2022 at 12:51 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>> On 26/01/2022 23.48, Joe Damato wrote:
>>> Greetings:
>>>
>>> This series adds some stat counters for the page_pool allocation path which
>>> help to track:
>>>
>>>        - fast path allocations
>>>        - slow path order-0 allocations
>>>        - slow path high order allocations
>>>        - refills which failed due to an empty ptr ring, forcing a slow
>>>          path allocation
>>>        - allocations fulfilled via successful refill
>>>        - pages which cannot be added to the cache because of numa mismatch
>>>          (i.e. waived)
>>>
>>> Some static inline wrappers are provided for accessing these stats. The
>>> intention is that drivers which use the page_pool API can, if they choose,
>>> use this stats API.
>>
>> You are adding (always on) counters to a critical fast-path, that
>> drivers uses for the XDP_DROP use-case.
> 
> If you prefer requiring users explicitly enable these stats, I am
> happy to add a kernel config option (e.g. CONFIG_PAGE_POOL_DEBUG or
> similar) in a v2.
> 
>> I want to see performance measurements as documentation, showing this is
>> not causing a slow-down.
>>
>> I have some performance tests here[1]:
>>    [1]
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/lib
>>
>> Look at:
>>    - bench_page_pool_simple.c and
>>    - bench_page_pool_cross_cpu.c
>>
>> How to use + build this[2]:
>>    [2]
>> https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html
> 
> Thanks for the pointers to the benchmarks.
> 
> In general, I noted that the benchmark results varied fairly
> substantially between repeated runs on the same system.
> 
> Results below suggest that:
>     -  bench_page_pool_simple is faster on the test kernel, and
>     - bench_page_pool_cross_cpu faster on the control
> 
> Subsequent runs of bench_page_pool_cross_cpu on the control, however,
> reveal *much* slower results than shown below.
> 
> Test system:
>    - 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
>    - 2 numa zones, with 18 cores per zone and 2 threads per core
> 
> Control kernel: built from net-next at commit e2cf07654efb ("ptp:
> replace snprintf with sysfs_emit").
> Test kernel: This series applied on top of control kernel mentioned above.
> 
> Raw output from dmesg for control [1] and test [2] kernel summarized below:
> 
> bench_page_pool_simple
>    - run with default options (i.e. "sudo mod probe bench_page_pool_simple").
> 
> Control:
> 
> Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0)
> Type:atomic_inc Per elem: 13 cycles(tsc) 6.021 ns (step:0)
> Type:lock Per elem: 31 cycles(tsc) 13.514 ns (step:0)
> 
> Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.549 ns (step:0)
> Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.658 ns (step:0)
> Type:no-softirq-page_pool03 Per elem: 118 cycles(tsc) 51.638 ns (step:0)
> 
> Type:tasklet_page_pool01_fast_path Per elem: 17 cycles(tsc) 7.472 ns (step:0)
> Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.585 ns (step:0)
> Type:tasklet_page_pool03_slow Per elem: 109 cycles(tsc) 47.807 ns (step:0)
> 
> Test:
> 
> Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0)
> Type:atomic_inc Per elem: 14 cycles(tsc) 6.195 ns (step:0)
> Type:lock Per elem: 31 cycles(tsc) 13.827 ns (step:0)
> 
> Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.561 ns (step:0)
> Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.700 ns (step:0)
> Type:no-softirq-page_pool03 Per elem: 108 cycles(tsc) 47.186 ns (step:0)
> 
> Type:tasklet_page_pool01_fast_path Per elem: 12 cycles(tsc) 5.447 ns (step:0)

Watch out for the: measurement period time:0.054478253 (taken from [2])

If the measurement period becomes too small, you/we cannot use the 
results.  Perhaps I've set the default 'loops' variable too low, for 
these modern systems.  Hint it is adjust as module parameter 'loops'.



> Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.501 ns (step:0)
> Type:tasklet_page_pool03_slow Per elem: 106 cycles(tsc) 46.313 ns (step:0)
> 
> bench_page_pool_cross_cpu
>    - run with default options (i.e. "sudo mod probe bench_page_pool_cross_cpu").
> 
> Control:
> Type:page_pool_cross_cpu CPU(0) 1795 cycles(tsc) 782.567 ns (step:2)
> Type:page_pool_cross_cpu CPU(1) 1921 cycles(tsc) 837.435 ns (step:2)
> Type:page_pool_cross_cpu CPU(2) 960 cycles(tsc) 418.758 ns (step:2)
> Sum Type:page_pool_cross_cpu Average: 1558 cycles(tsc) CPUs:3 step:2
> 
> Test:
> Type:page_pool_cross_cpu CPU(0) 2411 cycles(tsc) 1051.037 ns (step:2)
> Type:page_pool_cross_cpu CPU(1) 2467 cycles(tsc) 1075.204 ns (step:2)
> Type:page_pool_cross_cpu CPU(2) 1233 cycles(tsc) 537.629 ns (step:2)
> Type:page_pool_cross_cpu Average: 2037 cycles(tsc) CPUs:3 step:2

I think the effect you are seeing here is because you placed your stats 
struct on the a cache-line that is also used by remote CPUs 
freeing/recycling page'es back to the page_pool.


> 
> [1]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_control
> [2]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_TESTKERNEL
> 
> 
>>> It assumed that the API consumer will ensure the page_pool is not destroyed
>>> during calls to the stats API.
>>>
>>> If this series is accepted, I'll submit a follow up patch which will export
>>> these stats per RX-ring via ethtool in a driver which uses the page_pool
>>> API.
>>>
>>> Joe Damato (6):
>>>     net: page_pool: Add alloc stats and fast path stat
>>>     net: page_pool: Add a stat for the slow alloc path
>>>     net: page_pool: Add a high order alloc stat
>>>     net: page_pool: Add stat tracking empty ring
>>>     net: page_pool: Add stat tracking cache refills.
>>>     net: page_pool: Add a stat tracking waived pages.
>>>
>>>    include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>    net/core/page_pool.c    | 15 +++++++--
>>>    2 files changed, 94 insertions(+), 3 deletions(-)
>>>
>>
>

[net-next,0/6] net: page_pool: Add page_pool stat counters

Message

Comments