Message ID | 20230807034932.4000598-1-rkannoth@marvell.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] page_pool: Clamp ring size to 32K | expand |
On 07/08/2023 05.49, Ratheesh Kannoth wrote: > https://lore.kernel.org/netdev/20230804133512.4dbbbc16@kernel.org/T/ > Capping the recycle ring to 32k instead of returning the error. > Page pool (PP) is just a cache of pages. The driver octeontx2 (in link) is creating an excessive large cache of pages. The drivers RX descriptor ring size should be independent of the PP ptr_ring size, as it is just a cache that grows as a functions of the in-flight packet workload, it functions as a "shock absorber". 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue. The RX-desc ring (obviously) pins down these pages (immediately), but PP ring starts empty. As the workload varies the "shock absorber" effect will let more pages into the system, that will travel the PP ptr_ring. As all pages originating from the same PP instance will get recycled, the in-flight pages in the "system" (PP ptr_ring) will grow over time. The PP design have the problem that it never releases or reduces pages in this shock absorber "closed" system. (Cc. PP people/devel) we should consider implementing a MM shrinker callback (include/linux/shrinker.h). Are the systems using driver octeontx2 ready to handle 128MiB memory per RX-queue getting pinned down overtime? (this could lead to some strange do debug situation if the memory is not sufficient) --Jesper > Suggested-by: Jakub Kicinski <kuba@kernel.org> > Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> > --- > net/core/page_pool.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index 5d615a169718..404f835a94be 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -182,9 +182,9 @@ static int page_pool_init(struct page_pool *pool, > if (pool->p.pool_size) > ring_qsize = pool->p.pool_size; > > - /* Sanity limit mem that can be pinned down */ > + /* Clamp to 32K */ > if (ring_qsize > 32768) > - return -E2BIG; > + ring_qsize = 32768; > > /* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. > * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
On Mon, 2023-08-07 at 13:42 +0200, Jesper Dangaard Brouer wrote: > > On 07/08/2023 05.49, Ratheesh Kannoth wrote: > > https://lore.kernel.org/netdev/20230804133512.4dbbbc16@kernel.org/T/ > > Capping the recycle ring to 32k instead of returning the error. > > > > Page pool (PP) is just a cache of pages. The driver octeontx2 (in link) > is creating an excessive large cache of pages. The drivers RX > descriptor ring size should be independent of the PP ptr_ring size, as > it is just a cache that grows as a functions of the in-flight packet > workload, it functions as a "shock absorber". > > 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue. > > The RX-desc ring (obviously) pins down these pages (immediately), but PP > ring starts empty. As the workload varies the "shock absorber" effect > will let more pages into the system, that will travel the PP ptr_ring. > As all pages originating from the same PP instance will get recycled, > the in-flight pages in the "system" (PP ptr_ring) will grow over time. > > The PP design have the problem that it never releases or reduces pages > in this shock absorber "closed" system. (Cc. PP people/devel) we should > consider implementing a MM shrinker callback (include/linux/shrinker.h). > > Are the systems using driver octeontx2 ready to handle 128MiB memory per > RX-queue getting pinned down overtime? (this could lead to some strange > do debug situation if the memory is not sufficient) > > --Jesper I'm with Jesper on this. It doesn't make sense to be tying the page_pool size strictly to the ring size. The amount of recycling you get will depend on how long the packets are on the stack, not in the driver. For example, in the case of something like a software router or bridge that is just taking the Rx packets and routing them to Tx you could theoretically get away with a multiple of NAPI_POLL_WEIGHT since you would likely never need much more than that as the Tx would likely be cleaned about as fast as the Rx can consume the pages. Rather than overriding the size here wouldn't it make more sense to do it in the octeontx2 driver? With that at least you would know that you were the one that limited the size instead of having the value modified out from underneath you. That said, one change that might help to enable this kind of change would be look at adding a #define so that this value wouldn't be so much a magic number and would be visible to the drivers should it ever be changed in the future.
On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote: > > Page pool (PP) is just a cache of pages. The driver octeontx2 (in link) > > is creating an excessive large cache of pages. The drivers RX > > descriptor ring size should be independent of the PP ptr_ring size, as > > it is just a cache that grows as a functions of the in-flight packet > > workload, it functions as a "shock absorber". > > > > 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue. > > > > The RX-desc ring (obviously) pins down these pages (immediately), but PP > > ring starts empty. As the workload varies the "shock absorber" effect > > will let more pages into the system, that will travel the PP ptr_ring. > > As all pages originating from the same PP instance will get recycled, > > the in-flight pages in the "system" (PP ptr_ring) will grow over time. > > > > The PP design have the problem that it never releases or reduces pages > > in this shock absorber "closed" system. (Cc. PP people/devel) we should > > consider implementing a MM shrinker callback (include/linux/shrinker.h). > > > > Are the systems using driver octeontx2 ready to handle 128MiB memory per > > RX-queue getting pinned down overtime? (this could lead to some strange > > do debug situation if the memory is not sufficient) > > I'm with Jesper on this. It doesn't make sense to be tying the > page_pool size strictly to the ring size. The amount of recycling you > get will depend on how long the packets are on the stack, not in the > driver. > > For example, in the case of something like a software router or bridge > that is just taking the Rx packets and routing them to Tx you could > theoretically get away with a multiple of NAPI_POLL_WEIGHT since you > would likely never need much more than that as the Tx would likely be > cleaned about as fast as the Rx can consume the pages. > > Rather than overriding the size here wouldn't it make more sense to do > it in the octeontx2 driver? With that at least you would know that you > were the one that limited the size instead of having the value modified > out from underneath you. > > That said, one change that might help to enable this kind of change > would be look at adding a #define so that this value wouldn't be so > much a magic number and would be visible to the drivers should it ever > be changed in the future. All the points y'all making are valid, sizing the cache is a hard problem. But the proposed solution goes in the wrong direction, IMO. The driver doesn't know. I started hacking together page pool control over netlink. I think that the pool size selection logic should be in the core, with inputs taken from user space / workload (via netlink). If it wasn't for the fact that I'm working on that API I'd probably side with you. And 64k descriptors is impractically large. Copy / pasting from the discussion on previous version: Tuning this in the driver relies on the assumption that the HW / driver is the thing that matters. I'd think that the workload, platform (CPU) and config (e.g. is IOMMU enabled?) will matter at least as much. While driver developers will end up tuning to whatever servers they have, random single config and most likely.. iperf. IMO it's much better to re-purpose "pool_size" and treat it as the ring size, because that's what most drivers end up putting there. Defer tuning of the effective ring size to the core and user input (via the "it will be added any minute now" netlink API for configuring page pools)... So capping the recycle ring to 32k instead of returning the error seems like an okay solution for now.
On 07/08/2023 19.20, Jakub Kicinski wrote: > On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote: >>> Page pool (PP) is just a cache of pages. The driver octeontx2 (in link) >>> is creating an excessive large cache of pages. The drivers RX >>> descriptor ring size should be independent of the PP ptr_ring size, as >>> it is just a cache that grows as a functions of the in-flight packet >>> workload, it functions as a "shock absorber". >>> >>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue. >>> >>> The RX-desc ring (obviously) pins down these pages (immediately), but PP >>> ring starts empty. As the workload varies the "shock absorber" effect >>> will let more pages into the system, that will travel the PP ptr_ring. >>> As all pages originating from the same PP instance will get recycled, >>> the in-flight pages in the "system" (PP ptr_ring) will grow over time. >>> >>> The PP design have the problem that it never releases or reduces pages >>> in this shock absorber "closed" system. (Cc. PP people/devel) we should >>> consider implementing a MM shrinker callback (include/linux/shrinker.h). >>> >>> Are the systems using driver octeontx2 ready to handle 128MiB memory per >>> RX-queue getting pinned down overtime? (this could lead to some strange >>> do debug situation if the memory is not sufficient) >> >> I'm with Jesper on this. It doesn't make sense to be tying the >> page_pool size strictly to the ring size. The amount of recycling you >> get will depend on how long the packets are on the stack, not in the >> driver. >> Thanks for agreeing with me, and I agree with you :-) >> For example, in the case of something like a software router or bridge >> that is just taking the Rx packets and routing them to Tx you could >> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you >> would likely never need much more than that as the Tx would likely be >> cleaned about as fast as the Rx can consume the pages. >> I agree. >> Rather than overriding the size here wouldn't it make more sense to do >> it in the octeontx2 driver? With that at least you would know that you >> were the one that limited the size instead of having the value modified >> out from underneath you. >> I'm not fully agreeing here. I don't think we can expect driver developer to be experts on page_pool cache dynamics. I'm more on Jakub's side here, as perhaps we/net-core can come up with some control system, even if this means we change this underneath drivers. >> That said, one change that might help to enable this kind of change >> would be look at adding a #define so that this value wouldn't be so >> much a magic number and would be visible to the drivers should it ever >> be changed in the future. > > All the points y'all making are valid, sizing the cache is a hard > problem. But the proposed solution goes in the wrong direction, IMO. > The driver doesn't know. I started hacking together page pool control > over netlink. I think that the pool size selection logic should be in > the core, with inputs taken from user space / workload (via netlink). > > If it wasn't for the fact that I'm working on that API I'd probably > side with you. And 64k descriptors is impractically large. > > Copy / pasting from the discussion on previous version: > > Tuning this in the driver relies on the assumption that the HW / > driver is the thing that matters. I'd think that the workload, > platform (CPU) and config (e.g. is IOMMU enabled?) will matter at > least as much. While driver developers will end up tuning to whatever > servers they have, random single config and most likely.. iperf. > > IMO it's much better to re-purpose "pool_size" and treat it as the ring > size, because that's what most drivers end up putting there. I disagree here, as driver developers should not treat "pool_size" as the ring size. It seems to be a copy-paste-programming scheme without understanding PP dynamics. > Defer tuning of the effective ring size to the core and user input > (via the "it will be added any minute now" netlink API for configuring > page pools)... > I agree here, that tuning ring size is a hard problem, and this is better handled in the core. Happy to hear, that/if Jakub is working on this. > So capping the recycle ring to 32k instead of returning the error seems > like an okay solution for now. As a temporary solution, I'm actually fine with capping at 32k. Driver developer loose some feedback control, but perhaps that is okay, if we can agree that the net-core should control tuning this anyhow. --Jesper
> From: Jesper Dangaard Brouer <hawk@kernel.org> > Sent: Tuesday, August 8, 2023 1:42 AM > As a temporary solution, I'm actually fine with capping at 32k. > Driver developer loose some feedback control, but perhaps that is okay, if > we can agree that the net-core should control tuning this anyhow. Capping will never let user know that memory is unnecessarily wasted as there is not much Correlation between ring size and page pool size. I would prefer not setting pool->p.pool_size in Octeontx2 (pool->p.pool_size = 0 ) driver and let page pool infra decide on that. Is this change acceptable ? 100G link burst traffic would be able to handle with 1024 page pool size, do we have any data with any other driver ? -Ratheesh
From: Jesper Dangaard Brouer <hawk@kernel.org> Date: Mon, 7 Aug 2023 22:11:35 +0200 > > > On 07/08/2023 19.20, Jakub Kicinski wrote: >> On Mon, 07 Aug 2023 07:18:21 -0700 Alexander H Duyck wrote: >>>> Page pool (PP) is just a cache of pages. The driver octeontx2 (in >>>> link) >>>> is creating an excessive large cache of pages. The drivers RX >>>> descriptor ring size should be independent of the PP ptr_ring size, as >>>> it is just a cache that grows as a functions of the in-flight packet >>>> workload, it functions as a "shock absorber". >>>> >>>> 32768 pages (4KiB) is approx 128 MiB, and this will be per RX-queue. >>>> >>>> The RX-desc ring (obviously) pins down these pages (immediately), >>>> but PP >>>> ring starts empty. As the workload varies the "shock absorber" effect >>>> will let more pages into the system, that will travel the PP ptr_ring. >>>> As all pages originating from the same PP instance will get recycled, >>>> the in-flight pages in the "system" (PP ptr_ring) will grow over time. >>>> >>>> The PP design have the problem that it never releases or reduces pages >>>> in this shock absorber "closed" system. (Cc. PP people/devel) we should >>>> consider implementing a MM shrinker callback >>>> (include/linux/shrinker.h). >>>> >>>> Are the systems using driver octeontx2 ready to handle 128MiB memory >>>> per >>>> RX-queue getting pinned down overtime? (this could lead to some strange >>>> do debug situation if the memory is not sufficient) >>> >>> I'm with Jesper on this. It doesn't make sense to be tying the >>> page_pool size strictly to the ring size. The amount of recycling you >>> get will depend on how long the packets are on the stack, not in the >>> driver. >>> > > Thanks for agreeing with me, and I agree with you :-) > >>> For example, in the case of something like a software router or bridge >>> that is just taking the Rx packets and routing them to Tx you could >>> theoretically get away with a multiple of NAPI_POLL_WEIGHT since you >>> would likely never need much more than that as the Tx would likely be >>> cleaned about as fast as the Rx can consume the pages. >>> > > I agree. > >>> Rather than overriding the size here wouldn't it make more sense to do >>> it in the octeontx2 driver? With that at least you would know that you >>> were the one that limited the size instead of having the value modified >>> out from underneath you. >>> > > I'm not fully agreeing here. I don't think we can expect driver > developer to be experts on page_pool cache dynamics. I'm more on > Jakub's side here, as perhaps we/net-core can come up with some control > system, even if this means we change this underneath drivers. > > >>> That said, one change that might help to enable this kind of change >>> would be look at adding a #define so that this value wouldn't be so >>> much a magic number and would be visible to the drivers should it ever >>> be changed in the future. >> >> All the points y'all making are valid, sizing the cache is a hard >> problem. But the proposed solution goes in the wrong direction, IMO. >> The driver doesn't know. I started hacking together page pool control >> over netlink. I think that the pool size selection logic should be in >> the core, with inputs taken from user space / workload (via netlink). >> >> If it wasn't for the fact that I'm working on that API I'd probably >> side with you. And 64k descriptors is impractically large. >> >> Copy / pasting from the discussion on previous version: >> >> Tuning this in the driver relies on the assumption that the HW / >> driver is the thing that matters. I'd think that the workload, >> platform (CPU) and config (e.g. is IOMMU enabled?) will matter at >> least as much. While driver developers will end up tuning to whatever >> servers they have, random single config and most likely.. iperf. >> >> IMO it's much better to re-purpose "pool_size" and treat it as the >> ring >> size, because that's what most drivers end up putting there. > > I disagree here, as driver developers should not treat "pool_size" as > the ring size. It seems to be a copy-paste-programming scheme without > understanding PP dynamics. +1. That's why I wrote in the previous thread that pool_size must be the minimum value which gives optimal performance. I don't believe Otx2 HW needs 32k entries in PP's ptr_ring to have optimal performance. That's why I wrote that developers must check whether there's any benefit in using bigger pool_size values. Values bigger than 2k don't seem reasonable to me, especially now that we use direct recycling way more aggressively -- often times ptr_ring is left unused at all. Jakub thought that my "pls test whether bigger sizes make sense" meant "please tune Page Pool to your servers", not exactly what I wanted to say =\ I said only "please keep pool_size reasonable, it's your right to have 2^32 descriptors on the ring, but don't do that with Page Pool". > >> Defer tuning of the effective ring size to the core and user input >> (via the "it will be added any minute now" netlink API for configuring >> page pools)... >> > > I agree here, that tuning ring size is a hard problem, and this is > better handled in the core. Happy to hear, that/if Jakub is working on > this. > >> So capping the recycle ring to 32k instead of returning the error >> seems >> like an okay solution for now. > > As a temporary solution, I'm actually fine with capping at 32k. > Driver developer loose some feedback control, but perhaps that is okay, > if we can agree that the net-core should control tuning this anyhow. > > --Jesper Thanks, Olek
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 5d615a169718..404f835a94be 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -182,9 +182,9 @@ static int page_pool_init(struct page_pool *pool, if (pool->p.pool_size) ring_qsize = pool->p.pool_size; - /* Sanity limit mem that can be pinned down */ + /* Clamp to 32K */ if (ring_qsize > 32768) - return -E2BIG; + ring_qsize = 32768; /* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
https://lore.kernel.org/netdev/20230804133512.4dbbbc16@kernel.org/T/ Capping the recycle ring to 32k instead of returning the error. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> --- net/core/page_pool.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)