Message ID | 1643237300-44904-1-git-send-email-jdamato@fastly.com (mailing list archive) |
---|---|
Headers | show |
Series | net: page_pool: Add page_pool stat counters | expand |
On 26/01/2022 23.48, Joe Damato wrote: > Greetings: > > This series adds some stat counters for the page_pool allocation path which > help to track: > > - fast path allocations > - slow path order-0 allocations > - slow path high order allocations > - refills which failed due to an empty ptr ring, forcing a slow > path allocation > - allocations fulfilled via successful refill > - pages which cannot be added to the cache because of numa mismatch > (i.e. waived) > > Some static inline wrappers are provided for accessing these stats. The > intention is that drivers which use the page_pool API can, if they choose, > use this stats API. You are adding (always on) counters to a critical fast-path, that drivers uses for the XDP_DROP use-case. I want to see performance measurements as documentation, showing this is not causing a slow-down. I have some performance tests here[1]: [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/lib Look at: - bench_page_pool_simple.c and - bench_page_pool_cross_cpu.c How to use + build this[2]: [2] https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html > It assumed that the API consumer will ensure the page_pool is not destroyed > during calls to the stats API. > > If this series is accepted, I'll submit a follow up patch which will export > these stats per RX-ring via ethtool in a driver which uses the page_pool > API. > > Joe Damato (6): > net: page_pool: Add alloc stats and fast path stat > net: page_pool: Add a stat for the slow alloc path > net: page_pool: Add a high order alloc stat > net: page_pool: Add stat tracking empty ring > net: page_pool: Add stat tracking cache refills. > net: page_pool: Add a stat tracking waived pages. > > include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++ > net/core/page_pool.c | 15 +++++++-- > 2 files changed, 94 insertions(+), 3 deletions(-) >
Hi Joe, On Wed, Jan 26, 2022 at 02:48:14PM -0800, Joe Damato wrote: > Greetings: > > This series adds some stat counters for the page_pool allocation path which > help to track: > > - fast path allocations > - slow path order-0 allocations > - slow path high order allocations > - refills which failed due to an empty ptr ring, forcing a slow > path allocation > - allocations fulfilled via successful refill > - pages which cannot be added to the cache because of numa mismatch > (i.e. waived) > Thanks for the patch. Stats are something that's indeed missing from the API. The patch should work for Rx based allocations (which is what you currently cover), since the RX side is usually protected by NAPI. However we've added a few features recently, which we would like to have stats on. commit 6a5bcd84e886("page_pool: Allow drivers to hint on SKB recycling"), introduces recycling capabilities on the API. I think it would be far more interesting to be able to extend the statistics to recycled/non-recycled packets as well in the future. But the recycling is asynchronous and we can't add locks just for the sake of accurate statistics. Can we instead convert that to a per-cpu structure for producers? Thanks! /Ilias > Some static inline wrappers are provided for accessing these stats. The > intention is that drivers which use the page_pool API can, if they choose, > use this stats API. > > It assumed that the API consumer will ensure the page_pool is not destroyed > during calls to the stats API. > > If this series is accepted, I'll submit a follow up patch which will export > these stats per RX-ring via ethtool in a driver which uses the page_pool > API. > > Joe Damato (6): > net: page_pool: Add alloc stats and fast path stat > net: page_pool: Add a stat for the slow alloc path > net: page_pool: Add a high order alloc stat > net: page_pool: Add stat tracking empty ring > net: page_pool: Add stat tracking cache refills. > net: page_pool: Add a stat tracking waived pages. > > include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++ > net/core/page_pool.c | 15 +++++++-- > 2 files changed, 94 insertions(+), 3 deletions(-) > > -- > 2.7.4 >
On Thu, Jan 27, 2022 at 12:51 AM Jesper Dangaard Brouer <jbrouer@redhat.com> wrote: > > > > On 26/01/2022 23.48, Joe Damato wrote: > > Greetings: > > > > This series adds some stat counters for the page_pool allocation path which > > help to track: > > > > - fast path allocations > > - slow path order-0 allocations > > - slow path high order allocations > > - refills which failed due to an empty ptr ring, forcing a slow > > path allocation > > - allocations fulfilled via successful refill > > - pages which cannot be added to the cache because of numa mismatch > > (i.e. waived) > > > > Some static inline wrappers are provided for accessing these stats. The > > intention is that drivers which use the page_pool API can, if they choose, > > use this stats API. > > You are adding (always on) counters to a critical fast-path, that > drivers uses for the XDP_DROP use-case. If you prefer requiring users explicitly enable these stats, I am happy to add a kernel config option (e.g. CONFIG_PAGE_POOL_DEBUG or similar) in a v2. > I want to see performance measurements as documentation, showing this is > not causing a slow-down. > > I have some performance tests here[1]: > [1] > https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/lib > > Look at: > - bench_page_pool_simple.c and > - bench_page_pool_cross_cpu.c > > How to use + build this[2]: > [2] > https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html Thanks for the pointers to the benchmarks. In general, I noted that the benchmark results varied fairly substantially between repeated runs on the same system. Results below suggest that: - bench_page_pool_simple is faster on the test kernel, and - bench_page_pool_cross_cpu faster on the control Subsequent runs of bench_page_pool_cross_cpu on the control, however, reveal *much* slower results than shown below. Test system: - 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz - 2 numa zones, with 18 cores per zone and 2 threads per core Control kernel: built from net-next at commit e2cf07654efb ("ptp: replace snprintf with sysfs_emit"). Test kernel: This series applied on top of control kernel mentioned above. Raw output from dmesg for control [1] and test [2] kernel summarized below: bench_page_pool_simple - run with default options (i.e. "sudo mod probe bench_page_pool_simple"). Control: Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0) Type:atomic_inc Per elem: 13 cycles(tsc) 6.021 ns (step:0) Type:lock Per elem: 31 cycles(tsc) 13.514 ns (step:0) Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.549 ns (step:0) Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.658 ns (step:0) Type:no-softirq-page_pool03 Per elem: 118 cycles(tsc) 51.638 ns (step:0) Type:tasklet_page_pool01_fast_path Per elem: 17 cycles(tsc) 7.472 ns (step:0) Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.585 ns (step:0) Type:tasklet_page_pool03_slow Per elem: 109 cycles(tsc) 47.807 ns (step:0) Test: Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0) Type:atomic_inc Per elem: 14 cycles(tsc) 6.195 ns (step:0) Type:lock Per elem: 31 cycles(tsc) 13.827 ns (step:0) Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.561 ns (step:0) Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.700 ns (step:0) Type:no-softirq-page_pool03 Per elem: 108 cycles(tsc) 47.186 ns (step:0) Type:tasklet_page_pool01_fast_path Per elem: 12 cycles(tsc) 5.447 ns (step:0) Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.501 ns (step:0) Type:tasklet_page_pool03_slow Per elem: 106 cycles(tsc) 46.313 ns (step:0) bench_page_pool_cross_cpu - run with default options (i.e. "sudo mod probe bench_page_pool_cross_cpu"). Control: Type:page_pool_cross_cpu CPU(0) 1795 cycles(tsc) 782.567 ns (step:2) Type:page_pool_cross_cpu CPU(1) 1921 cycles(tsc) 837.435 ns (step:2) Type:page_pool_cross_cpu CPU(2) 960 cycles(tsc) 418.758 ns (step:2) Sum Type:page_pool_cross_cpu Average: 1558 cycles(tsc) CPUs:3 step:2 Test: Type:page_pool_cross_cpu CPU(0) 2411 cycles(tsc) 1051.037 ns (step:2) Type:page_pool_cross_cpu CPU(1) 2467 cycles(tsc) 1075.204 ns (step:2) Type:page_pool_cross_cpu CPU(2) 1233 cycles(tsc) 537.629 ns (step:2) Type:page_pool_cross_cpu Average: 2037 cycles(tsc) CPUs:3 step:2 [1]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_control [2]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_TESTKERNEL > > It assumed that the API consumer will ensure the page_pool is not destroyed > > during calls to the stats API. > > > > If this series is accepted, I'll submit a follow up patch which will export > > these stats per RX-ring via ethtool in a driver which uses the page_pool > > API. > > > > Joe Damato (6): > > net: page_pool: Add alloc stats and fast path stat > > net: page_pool: Add a stat for the slow alloc path > > net: page_pool: Add a high order alloc stat > > net: page_pool: Add stat tracking empty ring > > net: page_pool: Add stat tracking cache refills. > > net: page_pool: Add a stat tracking waived pages. > > > > include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++ > > net/core/page_pool.c | 15 +++++++-- > > 2 files changed, 94 insertions(+), 3 deletions(-) > > >
On Thu, Jan 27, 2022 at 1:08 AM Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > > Hi Joe, > > On Wed, Jan 26, 2022 at 02:48:14PM -0800, Joe Damato wrote: > > Greetings: > > > > This series adds some stat counters for the page_pool allocation path which > > help to track: > > > > - fast path allocations > > - slow path order-0 allocations > > - slow path high order allocations > > - refills which failed due to an empty ptr ring, forcing a slow > > path allocation > > - allocations fulfilled via successful refill > > - pages which cannot be added to the cache because of numa mismatch > > (i.e. waived) > > > > Thanks for the patch. Stats are something that's indeed missing from the > API. The patch should work for Rx based allocations (which is what you > currently cover), since the RX side is usually protected by NAPI. However > we've added a few features recently, which we would like to have stats on. Thanks for taking a look at the patch. > commit 6a5bcd84e886("page_pool: Allow drivers to hint on SKB recycling"), > introduces recycling capabilities on the API. I think it would be far more > interesting to be able to extend the statistics to recycled/non-recycled > packets as well in the future. I agree. Tracking recycling events would be both helpful and interesting, indeed. > But the recycling is asynchronous and we > can't add locks just for the sake of accurate statistics. Agreed. > Can we instead > convert that to a per-cpu structure for producers? If my understanding of your proposal is accurate, moving the stats structure to a per-cpu structure (instead of per-pool) would add ambiguity as to the performance of a specific driver's page pool. In exchange for the ambiguity, though, we'd get stats for additional events, which could be interesting. It seems like under load it might be very useful to know that a particular driver's page pool is adding pressure to the buddy allocator in the slow path. I suppose that a user could move softirqs around on their system to alleviate some of the ambiguity and perhaps that is good enough. > Thanks! > /Ilias > > > Some static inline wrappers are provided for accessing these stats. The > > intention is that drivers which use the page_pool API can, if they choose, > > use this stats API. > > > > It assumed that the API consumer will ensure the page_pool is not destroyed > > during calls to the stats API. > > > > If this series is accepted, I'll submit a follow up patch which will export > > these stats per RX-ring via ethtool in a driver which uses the page_pool > > API. > > > > Joe Damato (6): > > net: page_pool: Add alloc stats and fast path stat > > net: page_pool: Add a stat for the slow alloc path > > net: page_pool: Add a high order alloc stat > > net: page_pool: Add stat tracking empty ring > > net: page_pool: Add stat tracking cache refills. > > net: page_pool: Add a stat tracking waived pages. > > > > include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++ > > net/core/page_pool.c | 15 +++++++-- > > 2 files changed, 94 insertions(+), 3 deletions(-) > > > > -- > > 2.7.4 > >
Hi Joe! On Thu, Jan 27, 2022 at 03:55:03PM -0800, Joe Damato wrote: > On Thu, Jan 27, 2022 at 1:08 AM Ilias Apalodimas > <ilias.apalodimas@linaro.org> wrote: > > > > Hi Joe, > > > > On Wed, Jan 26, 2022 at 02:48:14PM -0800, Joe Damato wrote: > > > Greetings: > > > > > > This series adds some stat counters for the page_pool allocation path which > > > help to track: > > > > > > - fast path allocations > > > - slow path order-0 allocations > > > - slow path high order allocations > > > - refills which failed due to an empty ptr ring, forcing a slow > > > path allocation > > > - allocations fulfilled via successful refill > > > - pages which cannot be added to the cache because of numa mismatch > > > (i.e. waived) > > > > > > > Thanks for the patch. Stats are something that's indeed missing from the > > API. The patch should work for Rx based allocations (which is what you > > currently cover), since the RX side is usually protected by NAPI. However > > we've added a few features recently, which we would like to have stats on. > > Thanks for taking a look at the patch. > yw > > commit 6a5bcd84e886("page_pool: Allow drivers to hint on SKB recycling"), > > introduces recycling capabilities on the API. I think it would be far more > > interesting to be able to extend the statistics to recycled/non-recycled > > packets as well in the future. > > I agree. Tracking recycling events would be both helpful and > interesting, indeed. > > > But the recycling is asynchronous and we > > can't add locks just for the sake of accurate statistics. > > Agreed. > > > Can we instead > > convert that to a per-cpu structure for producers? > > If my understanding of your proposal is accurate, moving the stats > structure to a per-cpu structure (instead of per-pool) would add > ambiguity as to the performance of a specific driver's page pool. In > exchange for the ambiguity, though, we'd get stats for additional > events, which could be interesting. I was mostly thinking per pool using with 'struct percpu_counter' or allocate __percpu variables, but I haven't really checked if that's doable or which of those is better suited for our case. > > It seems like under load it might be very useful to know that a > particular driver's page pool is adding pressure to the buddy > allocator in the slow path. I suppose that a user could move softirqs > around on their system to alleviate some of the ambiguity and perhaps > that is good enough. > [...] Cheers /Ilias
On Sat, Jan 29, 2022 at 6:07 AM Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > > Hi Joe! > > On Thu, Jan 27, 2022 at 03:55:03PM -0800, Joe Damato wrote: > > On Thu, Jan 27, 2022 at 1:08 AM Ilias Apalodimas > > <ilias.apalodimas@linaro.org> wrote: > > > > > > Hi Joe, > > > > > > On Wed, Jan 26, 2022 at 02:48:14PM -0800, Joe Damato wrote: > > > > Greetings: > > > > > > > > This series adds some stat counters for the page_pool allocation path which > > > > help to track: > > > > > > > > - fast path allocations > > > > - slow path order-0 allocations > > > > - slow path high order allocations > > > > - refills which failed due to an empty ptr ring, forcing a slow > > > > path allocation > > > > - allocations fulfilled via successful refill > > > > - pages which cannot be added to the cache because of numa mismatch > > > > (i.e. waived) > > > > > > > > > > Thanks for the patch. Stats are something that's indeed missing from the > > > API. The patch should work for Rx based allocations (which is what you > > > currently cover), since the RX side is usually protected by NAPI. However > > > we've added a few features recently, which we would like to have stats on. > > > > Thanks for taking a look at the patch. > > > > yw > > > > commit 6a5bcd84e886("page_pool: Allow drivers to hint on SKB recycling"), > > > introduces recycling capabilities on the API. I think it would be far more > > > interesting to be able to extend the statistics to recycled/non-recycled > > > packets as well in the future. > > > > I agree. Tracking recycling events would be both helpful and > > interesting, indeed. > > > > > But the recycling is asynchronous and we > > > can't add locks just for the sake of accurate statistics. > > > > Agreed. > > > > > Can we instead > > > convert that to a per-cpu structure for producers? > > > > If my understanding of your proposal is accurate, moving the stats > > structure to a per-cpu structure (instead of per-pool) would add > > ambiguity as to the performance of a specific driver's page pool. In > > exchange for the ambiguity, though, we'd get stats for additional > > events, which could be interesting. > > I was mostly thinking per pool using with 'struct percpu_counter' or > allocate __percpu variables, but I haven't really checked if that's doable or > which of those is better suited for our case. I wrote up a v2 last night that allocates and exports a page_pool_stats structure per cpu (but not per pool). The data can be accessed by users in the file /proc/net/page_pool_stats. The approach is similar to the way softnet_stat is implemented. The main advantage with this approach is that no driver modifications are needed and no additional APIs are exposed that will need to be maintained. Adding new stats in the future would be much simpler with this approach. I've also moved all the code behind a kernel config flag so users can opt-in to get these stats. I'll send the v2 shortly. Thanks, Joe
On 27/01/2022 22.08, Joe Damato wrote: > On Thu, Jan 27, 2022 at 12:51 AM Jesper Dangaard Brouer > <jbrouer@redhat.com> wrote: >> >> On 26/01/2022 23.48, Joe Damato wrote: >>> Greetings: >>> >>> This series adds some stat counters for the page_pool allocation path which >>> help to track: >>> >>> - fast path allocations >>> - slow path order-0 allocations >>> - slow path high order allocations >>> - refills which failed due to an empty ptr ring, forcing a slow >>> path allocation >>> - allocations fulfilled via successful refill >>> - pages which cannot be added to the cache because of numa mismatch >>> (i.e. waived) >>> >>> Some static inline wrappers are provided for accessing these stats. The >>> intention is that drivers which use the page_pool API can, if they choose, >>> use this stats API. >> >> You are adding (always on) counters to a critical fast-path, that >> drivers uses for the XDP_DROP use-case. > > If you prefer requiring users explicitly enable these stats, I am > happy to add a kernel config option (e.g. CONFIG_PAGE_POOL_DEBUG or > similar) in a v2. > >> I want to see performance measurements as documentation, showing this is >> not causing a slow-down. >> >> I have some performance tests here[1]: >> [1] >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/lib >> >> Look at: >> - bench_page_pool_simple.c and >> - bench_page_pool_cross_cpu.c >> >> How to use + build this[2]: >> [2] >> https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html > > Thanks for the pointers to the benchmarks. > > In general, I noted that the benchmark results varied fairly > substantially between repeated runs on the same system. > > Results below suggest that: > - bench_page_pool_simple is faster on the test kernel, and > - bench_page_pool_cross_cpu faster on the control > > Subsequent runs of bench_page_pool_cross_cpu on the control, however, > reveal *much* slower results than shown below. > > Test system: > - 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz > - 2 numa zones, with 18 cores per zone and 2 threads per core > > Control kernel: built from net-next at commit e2cf07654efb ("ptp: > replace snprintf with sysfs_emit"). > Test kernel: This series applied on top of control kernel mentioned above. > > Raw output from dmesg for control [1] and test [2] kernel summarized below: > > bench_page_pool_simple > - run with default options (i.e. "sudo mod probe bench_page_pool_simple"). > > Control: > > Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0) > Type:atomic_inc Per elem: 13 cycles(tsc) 6.021 ns (step:0) > Type:lock Per elem: 31 cycles(tsc) 13.514 ns (step:0) > > Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.549 ns (step:0) > Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.658 ns (step:0) > Type:no-softirq-page_pool03 Per elem: 118 cycles(tsc) 51.638 ns (step:0) > > Type:tasklet_page_pool01_fast_path Per elem: 17 cycles(tsc) 7.472 ns (step:0) > Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.585 ns (step:0) > Type:tasklet_page_pool03_slow Per elem: 109 cycles(tsc) 47.807 ns (step:0) > > Test: > > Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0) > Type:atomic_inc Per elem: 14 cycles(tsc) 6.195 ns (step:0) > Type:lock Per elem: 31 cycles(tsc) 13.827 ns (step:0) > > Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.561 ns (step:0) > Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.700 ns (step:0) > Type:no-softirq-page_pool03 Per elem: 108 cycles(tsc) 47.186 ns (step:0) > > Type:tasklet_page_pool01_fast_path Per elem: 12 cycles(tsc) 5.447 ns (step:0) Watch out for the: measurement period time:0.054478253 (taken from [2]) If the measurement period becomes too small, you/we cannot use the results. Perhaps I've set the default 'loops' variable too low, for these modern systems. Hint it is adjust as module parameter 'loops'. > Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.501 ns (step:0) > Type:tasklet_page_pool03_slow Per elem: 106 cycles(tsc) 46.313 ns (step:0) > > bench_page_pool_cross_cpu > - run with default options (i.e. "sudo mod probe bench_page_pool_cross_cpu"). > > Control: > Type:page_pool_cross_cpu CPU(0) 1795 cycles(tsc) 782.567 ns (step:2) > Type:page_pool_cross_cpu CPU(1) 1921 cycles(tsc) 837.435 ns (step:2) > Type:page_pool_cross_cpu CPU(2) 960 cycles(tsc) 418.758 ns (step:2) > Sum Type:page_pool_cross_cpu Average: 1558 cycles(tsc) CPUs:3 step:2 > > Test: > Type:page_pool_cross_cpu CPU(0) 2411 cycles(tsc) 1051.037 ns (step:2) > Type:page_pool_cross_cpu CPU(1) 2467 cycles(tsc) 1075.204 ns (step:2) > Type:page_pool_cross_cpu CPU(2) 1233 cycles(tsc) 537.629 ns (step:2) > Type:page_pool_cross_cpu Average: 2037 cycles(tsc) CPUs:3 step:2 I think the effect you are seeing here is because you placed your stats struct on the a cache-line that is also used by remote CPUs freeing/recycling page'es back to the page_pool. > > [1]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_control > [2]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_TESTKERNEL > > >>> It assumed that the API consumer will ensure the page_pool is not destroyed >>> during calls to the stats API. >>> >>> If this series is accepted, I'll submit a follow up patch which will export >>> these stats per RX-ring via ethtool in a driver which uses the page_pool >>> API. >>> >>> Joe Damato (6): >>> net: page_pool: Add alloc stats and fast path stat >>> net: page_pool: Add a stat for the slow alloc path >>> net: page_pool: Add a high order alloc stat >>> net: page_pool: Add stat tracking empty ring >>> net: page_pool: Add stat tracking cache refills. >>> net: page_pool: Add a stat tracking waived pages. >>> >>> include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++ >>> net/core/page_pool.c | 15 +++++++-- >>> 2 files changed, 94 insertions(+), 3 deletions(-) >>> >> >