diff mbox series

[net-next] page_pool: Clamp ring size to 32K

Message ID 20230807034932.4000598-1-rkannoth@marvell.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next] page_pool: Clamp ring size to 32K | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1332 this patch: 1332
netdev/cc_maintainers warning 2 maintainers not CCed: hawk@kernel.org ilias.apalodimas@linaro.org
netdev/build_clang success Errors and warnings before: 1353 this patch: 1353
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1355 this patch: 1355
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 11 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Ratheesh Kannoth Aug. 7, 2023, 3:49 a.m. UTC
https://lore.kernel.org/netdev/20230804133512.4dbbbc16@kernel.org/T/
Capping the recycle ring to 32k instead of returning the error.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
---
 net/core/page_pool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Jesper Dangaard Brouer Aug. 7, 2023, 11:42 a.m. UTC | #1
On 07/08/2023 05.49, Ratheesh Kannoth wrote:
> https://lore.kernel.org/netdev/20230804133512.4dbbbc16@kernel.org/T/
> Capping the recycle ring to 32k instead of returning the error.
> 

Page pool (PP) is just a cache of pages.  The driver octeontx2 (in link)
is creating an excessive large cache of pages.  The drivers RX
descriptor ring size should be independent of the PP ptr_ring size, as
it is just a cache that grows as a functions of the in-flight packet
workload, it functions as a "shock absorber".

32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.

The RX-desc ring (obviously) pins down these pages (immediately), but PP
ring starts empty.  As the workload varies the "shock absorber" effect
will let more pages into the system, that will travel the PP ptr_ring.
As all pages originating from the same PP instance will get recycled,
the in-flight pages in the "system" (PP ptr_ring) will grow over time.

The PP design have the problem that it never releases or reduces pages
in this shock absorber "closed" system. (Cc. PP people/devel) we should
consider implementing a MM shrinker callback (include/linux/shrinker.h).

Are the systems using driver octeontx2 ready to handle 128MiB memory per
RX-queue getting pinned down overtime? (this could lead to some strange
do debug situation if the memory is not sufficient)

--Jesper

> Suggested-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
> ---
>   net/core/page_pool.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 5d615a169718..404f835a94be 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -182,9 +182,9 @@ static int page_pool_init(struct page_pool *pool,
>   	if (pool->p.pool_size)
>   		ring_qsize = pool->p.pool_size;
>   
> -	/* Sanity limit mem that can be pinned down */
> +	/* Clamp to 32K */
>   	if (ring_qsize > 32768)
> -		return -E2BIG;
> +		ring_qsize = 32768;
>   
>   	/* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
>   	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
Alexander Duyck Aug. 7, 2023, 2:18 p.m. UTC | #2
On Mon, 2023-08-07 at 13:42 +0200, Jesper Dangaard Brouer wrote:
> 
> On 07/08/2023 05.49, Ratheesh Kannoth wrote:
> > https://lore.kernel.org/netdev/20230804133512.4dbbbc16@kernel.org/T/
> > Capping the recycle ring to 32k instead of returning the error.
> > 
> 
> Page pool (PP) is just a cache of pages.  The driver octeontx2 (in link)
> is creating an excessive large cache of pages.  The drivers RX
> descriptor ring size should be independent of the PP ptr_ring size, as
> it is just a cache that grows as a functions of the in-flight packet
> workload, it functions as a "shock absorber".
> 
> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
> 
> The RX-desc ring (obviously) pins down these pages (immediately), but PP
> ring starts empty.  As the workload varies the "shock absorber" effect
> will let more pages into the system, that will travel the PP ptr_ring.
> As all pages originating from the same PP instance will get recycled,
> the in-flight pages in the "system" (PP ptr_ring) will grow over time.
> 
> The PP design have the problem that it never releases or reduces pages
> in this shock absorber "closed" system. (Cc. PP people/devel) we should
> consider implementing a MM shrinker callback (include/linux/shrinker.h).
> 
> Are the systems using driver octeontx2 ready to handle 128MiB memory per
> RX-queue getting pinned down overtime? (this could lead to some strange
> do debug situation if the memory is not sufficient)
> 
> --Jesper

I'm with Jesper on this. It doesn't make sense to be tying the
page_pool size strictly to the ring size. The amount of recycling you
get will depend on how long the packets are on the stack, not in the
driver.

For example, in the case of something like a software router or bridge
that is just taking the Rx packets and routing them to Tx you could
theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
would likely never need much more than that as the Tx would likely be
cleaned about as fast as the Rx can consume the pages.

Rather than overriding the size here wouldn't it make more sense to do
it in the octeontx2 driver? With that at least you would know that you
were the one that limited the size instead of having the value modified
out from underneath you.

That said, one change that might help to enable this kind of change
would be look at adding a #define so that this value wouldn't be so
much a magic number and would be visible to the drivers should it ever
be changed in the future.
Jakub Kicinski Aug. 7, 2023, 5:20 p.m. UTC | #3
On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote:
> > Page pool (PP) is just a cache of pages.  The driver octeontx2 (in link)
> > is creating an excessive large cache of pages.  The drivers RX
> > descriptor ring size should be independent of the PP ptr_ring size, as
> > it is just a cache that grows as a functions of the in-flight packet
> > workload, it functions as a "shock absorber".
> > 
> > 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
> > 
> > The RX-desc ring (obviously) pins down these pages (immediately), but PP
> > ring starts empty.  As the workload varies the "shock absorber" effect
> > will let more pages into the system, that will travel the PP ptr_ring.
> > As all pages originating from the same PP instance will get recycled,
> > the in-flight pages in the "system" (PP ptr_ring) will grow over time.
> > 
> > The PP design have the problem that it never releases or reduces pages
> > in this shock absorber "closed" system. (Cc. PP people/devel) we should
> > consider implementing a MM shrinker callback (include/linux/shrinker.h).
> > 
> > Are the systems using driver octeontx2 ready to handle 128MiB memory per
> > RX-queue getting pinned down overtime? (this could lead to some strange
> > do debug situation if the memory is not sufficient)
>
> I'm with Jesper on this. It doesn't make sense to be tying the
> page_pool size strictly to the ring size. The amount of recycling you
> get will depend on how long the packets are on the stack, not in the
> driver.
> 
> For example, in the case of something like a software router or bridge
> that is just taking the Rx packets and routing them to Tx you could
> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
> would likely never need much more than that as the Tx would likely be
> cleaned about as fast as the Rx can consume the pages.
> 
> Rather than overriding the size here wouldn't it make more sense to do
> it in the octeontx2 driver? With that at least you would know that you
> were the one that limited the size instead of having the value modified
> out from underneath you.
> 
> That said, one change that might help to enable this kind of change
> would be look at adding a #define so that this value wouldn't be so
> much a magic number and would be visible to the drivers should it ever
> be changed in the future.

All the points y'all making are valid, sizing the cache is a hard
problem. But the proposed solution goes in the wrong direction, IMO.
The driver doesn't know. I started hacking together page pool control
over netlink. I think that the pool size selection logic should be in
the core, with inputs taken from user space / workload (via netlink).

If it wasn't for the fact that I'm working on that API I'd probably
side with you. And 64k descriptors is impractically large.

Copy / pasting from the discussion on previous version:

  Tuning this in the driver relies on the assumption that the HW /
  driver is the thing that matters. I'd think that the workload,
  platform (CPU) and config (e.g. is IOMMU enabled?) will matter at
  least as much. While driver developers will end up tuning to whatever
  servers they have, random single config and most likely.. iperf.

  IMO it's much better to re-purpose "pool_size" and treat it as the ring
  size, because that's what most drivers end up putting there. 
  Defer tuning of the effective ring size to the core and user input 
  (via the "it will be added any minute now" netlink API for configuring
  page pools)...

  So capping the recycle ring to 32k instead of returning the error seems
  like an okay solution for now.
Jesper Dangaard Brouer Aug. 7, 2023, 8:11 p.m. UTC | #4
On 07/08/2023 19.20, Jakub Kicinski wrote:
> On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote:
>>> Page pool (PP) is just a cache of pages.  The driver octeontx2 (in link)
>>> is creating an excessive large cache of pages.  The drivers RX
>>> descriptor ring size should be independent of the PP ptr_ring size, as
>>> it is just a cache that grows as a functions of the in-flight packet
>>> workload, it functions as a "shock absorber".
>>>
>>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
>>>
>>> The RX-desc ring (obviously) pins down these pages (immediately), but PP
>>> ring starts empty.  As the workload varies the "shock absorber" effect
>>> will let more pages into the system, that will travel the PP ptr_ring.
>>> As all pages originating from the same PP instance will get recycled,
>>> the in-flight pages in the "system" (PP ptr_ring) will grow over time.
>>>
>>> The PP design have the problem that it never releases or reduces pages
>>> in this shock absorber "closed" system. (Cc. PP people/devel) we should
>>> consider implementing a MM shrinker callback (include/linux/shrinker.h).
>>>
>>> Are the systems using driver octeontx2 ready to handle 128MiB memory per
>>> RX-queue getting pinned down overtime? (this could lead to some strange
>>> do debug situation if the memory is not sufficient)
>>
>> I'm with Jesper on this. It doesn't make sense to be tying the
>> page_pool size strictly to the ring size. The amount of recycling you
>> get will depend on how long the packets are on the stack, not in the
>> driver.
>>

Thanks for agreeing with me, and I agree with you :-)

>> For example, in the case of something like a software router or bridge
>> that is just taking the Rx packets and routing them to Tx you could
>> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
>> would likely never need much more than that as the Tx would likely be
>> cleaned about as fast as the Rx can consume the pages.
>>

I agree.

>> Rather than overriding the size here wouldn't it make more sense to do
>> it in the octeontx2 driver? With that at least you would know that you
>> were the one that limited the size instead of having the value modified
>> out from underneath you.
>>

I'm not fully agreeing here.  I don't think we can expect driver
developer to be experts on page_pool cache dynamics.  I'm more on
Jakub's side here, as perhaps we/net-core can come up with some control
system, even if this means we change this underneath drivers.


>> That said, one change that might help to enable this kind of change
>> would be look at adding a #define so that this value wouldn't be so
>> much a magic number and would be visible to the drivers should it ever
>> be changed in the future.
> 
> All the points y'all making are valid, sizing the cache is a hard
> problem. But the proposed solution goes in the wrong direction, IMO.
> The driver doesn't know. I started hacking together page pool control
> over netlink. I think that the pool size selection logic should be in
> the core, with inputs taken from user space / workload (via netlink).
> 
> If it wasn't for the fact that I'm working on that API I'd probably
> side with you. And 64k descriptors is impractically large.
> 
> Copy / pasting from the discussion on previous version:
> 
>    Tuning this in the driver relies on the assumption that the HW /
>    driver is the thing that matters. I'd think that the workload,
>    platform (CPU) and config (e.g. is IOMMU enabled?) will matter at
>    least as much. While driver developers will end up tuning to whatever
>    servers they have, random single config and most likely.. iperf.
> 
>    IMO it's much better to re-purpose "pool_size" and treat it as the ring
>    size, because that's what most drivers end up putting there.

I disagree here, as driver developers should not treat "pool_size" as
the ring size.  It seems to be a copy-paste-programming scheme without
understanding PP dynamics.

>    Defer tuning of the effective ring size to the core and user input
>    (via the "it will be added any minute now" netlink API for configuring
>    page pools)...
> 

I agree here, that tuning ring size is a hard problem, and this is
better handled in the core.  Happy to hear, that/if Jakub is working on
this.

>    So capping the recycle ring to 32k instead of returning the error seems
>    like an okay solution for now.

As a temporary solution, I'm actually fine with capping at 32k.
Driver developer loose some feedback control, but perhaps that is okay,
if we can agree that the net-core should control tuning this anyhow.

--Jesper
Ratheesh Kannoth Aug. 8, 2023, 2:26 a.m. UTC | #5
> From: Jesper Dangaard Brouer <hawk@kernel.org>
> Sent: Tuesday, August 8, 2023 1:42 AM
> As a temporary solution, I'm actually fine with capping at 32k.
> Driver developer loose some feedback control, but perhaps that is okay, if
> we can agree that the net-core should control tuning this anyhow.


Capping will never let user know that memory is unnecessarily wasted as there is not much
Correlation between ring size and page pool size.  I would prefer not setting pool->p.pool_size in
Octeontx2  (pool->p.pool_size =  0 ) driver  and let page pool infra decide on that. Is this change acceptable ?
100G link burst traffic would be able to handle with 1024 page pool size, do we have any data with any other driver ?

-Ratheesh
Alexander Lobakin Aug. 8, 2023, 1:29 p.m. UTC | #6
From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Mon, 7 Aug 2023 22:11:35 +0200

> 
> 
> On 07/08/2023 19.20, Jakub Kicinski wrote:
>> On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote:
>>>> Page pool (PP) is just a cache of pages.  The driver octeontx2 (in
>>>> link)
>>>> is creating an excessive large cache of pages.  The drivers RX
>>>> descriptor ring size should be independent of the PP ptr_ring size, as
>>>> it is just a cache that grows as a functions of the in-flight packet
>>>> workload, it functions as a "shock absorber".
>>>>
>>>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue.
>>>>
>>>> The RX-desc ring (obviously) pins down these pages (immediately),
>>>> but PP
>>>> ring starts empty.  As the workload varies the "shock absorber" effect
>>>> will let more pages into the system, that will travel the PP ptr_ring.
>>>> As all pages originating from the same PP instance will get recycled,
>>>> the in-flight pages in the "system" (PP ptr_ring) will grow over time.
>>>>
>>>> The PP design have the problem that it never releases or reduces pages
>>>> in this shock absorber "closed" system. (Cc. PP people/devel) we should
>>>> consider implementing a MM shrinker callback
>>>> (include/linux/shrinker.h).
>>>>
>>>> Are the systems using driver octeontx2 ready to handle 128MiB memory
>>>> per
>>>> RX-queue getting pinned down overtime? (this could lead to some strange
>>>> do debug situation if the memory is not sufficient)
>>>
>>> I'm with Jesper on this. It doesn't make sense to be tying the
>>> page_pool size strictly to the ring size. The amount of recycling you
>>> get will depend on how long the packets are on the stack, not in the
>>> driver.
>>>
> 
> Thanks for agreeing with me, and I agree with you :-)
> 
>>> For example, in the case of something like a software router or bridge
>>> that is just taking the Rx packets and routing them to Tx you could
>>> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you
>>> would likely never need much more than that as the Tx would likely be
>>> cleaned about as fast as the Rx can consume the pages.
>>>
> 
> I agree.
> 
>>> Rather than overriding the size here wouldn't it make more sense to do
>>> it in the octeontx2 driver? With that at least you would know that you
>>> were the one that limited the size instead of having the value modified
>>> out from underneath you.
>>>
> 
> I'm not fully agreeing here.  I don't think we can expect driver
> developer to be experts on page_pool cache dynamics.  I'm more on
> Jakub's side here, as perhaps we/net-core can come up with some control
> system, even if this means we change this underneath drivers.
> 
> 
>>> That said, one change that might help to enable this kind of change
>>> would be look at adding a #define so that this value wouldn't be so
>>> much a magic number and would be visible to the drivers should it ever
>>> be changed in the future.
>>
>> All the points y'all making are valid, sizing the cache is a hard
>> problem. But the proposed solution goes in the wrong direction, IMO.
>> The driver doesn't know. I started hacking together page pool control
>> over netlink. I think that the pool size selection logic should be in
>> the core, with inputs taken from user space / workload (via netlink).
>>
>> If it wasn't for the fact that I'm working on that API I'd probably
>> side with you. And 64k descriptors is impractically large.
>>
>> Copy / pasting from the discussion on previous version:
>>
>>    Tuning this in the driver relies on the assumption that the HW /
>>    driver is the thing that matters. I'd think that the workload,
>>    platform (CPU) and config (e.g. is IOMMU enabled?) will matter at
>>    least as much. While driver developers will end up tuning to whatever
>>    servers they have, random single config and most likely.. iperf.
>>
>>    IMO it's much better to re-purpose "pool_size" and treat it as the
>> ring
>>    size, because that's what most drivers end up putting there.
> 
> I disagree here, as driver developers should not treat "pool_size" as
> the ring size.  It seems to be a copy-paste-programming scheme without
> understanding PP dynamics.

+1. That's why I wrote in the previous thread that pool_size must be the
minimum value which gives optimal performance. I don't believe Otx2 HW
needs 32k entries in PP's ptr_ring to have optimal performance.
That's why I wrote that developers must check whether there's any
benefit in using bigger pool_size values. Values bigger than 2k don't
seem reasonable to me, especially now that we use direct recycling way
more aggressively -- often times ptr_ring is left unused at all.
Jakub thought that my "pls test whether bigger sizes make sense" meant
"please tune Page Pool to your servers", not exactly what I wanted to
say =\ I said only "please keep pool_size reasonable, it's your right to
have 2^32 descriptors on the ring, but don't do that with Page Pool".

> 
>>    Defer tuning of the effective ring size to the core and user input
>>    (via the "it will be added any minute now" netlink API for configuring
>>    page pools)...
>>
> 
> I agree here, that tuning ring size is a hard problem, and this is
> better handled in the core.  Happy to hear, that/if Jakub is working on
> this.
> 
>>    So capping the recycle ring to 32k instead of returning the error
>> seems
>>    like an okay solution for now.
> 
> As a temporary solution, I'm actually fine with capping at 32k.
> Driver developer loose some feedback control, but perhaps that is okay,
> if we can agree that the net-core should control tuning this anyhow.
> 
> --Jesper

Thanks,
Olek
diff mbox series

Patch

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 5d615a169718..404f835a94be 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -182,9 +182,9 @@  static int page_pool_init(struct page_pool *pool,
 	if (pool->p.pool_size)
 		ring_qsize = pool->p.pool_size;
 
-	/* Sanity limit mem that can be pinned down */
+	/* Clamp to 32K */
 	if (ring_qsize > 32768)
-		return -E2BIG;
+		ring_qsize = 32768;
 
 	/* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
 	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,