Message ID | 6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Commit | dbae2b062824fc2d35ae2d5df2f500626c758e80 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next,v4] net: skb: introduce and use a single page frag cache | expand |
On Wed, Sep 28, 2022 at 1:43 AM Paolo Abeni <pabeni@redhat.com> wrote: > > After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation > for tiny skbs") we are observing 10-20% regressions in performance > tests with small packets. The perf trace points to high pressure on > the slab allocator. > > This change tries to improve the allocation schema for small packets > using an idea originally suggested by Eric: a new per CPU page frag is > introduced and used in __napi_alloc_skb to cope with small allocation > requests. > > To ensure that the above does not lead to excessive truesize > underestimation, the frag size for small allocation is inflated to 1K > and all the above is restricted to build with 4K page size. > > Note that we need to update accordingly the run-time check introduced > with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). > > Alex suggested a smart page refcount schema to reduce the number > of atomic operations and deal properly with pfmemalloc pages. > > Under small packet UDP flood, I measure a 15% peak tput increases. > > Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> > Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- Reviewed-by: Eric Dumazet <edumazet@google.com> Thanks !
On Wed, 2022-09-28 at 10:43 +0200, Paolo Abeni wrote: > After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation > for tiny skbs") we are observing 10-20% regressions in performance > tests with small packets. The perf trace points to high pressure on > the slab allocator. > > This change tries to improve the allocation schema for small packets > using an idea originally suggested by Eric: a new per CPU page frag is > introduced and used in __napi_alloc_skb to cope with small allocation > requests. > > To ensure that the above does not lead to excessive truesize > underestimation, the frag size for small allocation is inflated to 1K > and all the above is restricted to build with 4K page size. > > Note that we need to update accordingly the run-time check introduced > with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). > > Alex suggested a smart page refcount schema to reduce the number > of atomic operations and deal properly with pfmemalloc pages. > > Under small packet UDP flood, I measure a 15% peak tput increases. > > Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> > Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > --- > v3 -> v4: > - NAPI_HAS_SMALL_PAGE <-> 4K page (Eric) > - fixed typos in comments (Eric) > > v2 -> v3: > - updated Alex email address > - fixed build with !NAPI_HAS_SMALL_PAGE_FRAG > > v1 -> v2: > - better page_frag_alloc_1k() (Alex & Eric) > - avoid duplicate code and gfp_flags misuse in __napi_alloc_skb() (Alex) > --- Looks good to me. Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Hello: This patch was applied to netdev/net-next.git (master) by Jakub Kicinski <kuba@kernel.org>: On Wed, 28 Sep 2022 10:43:09 +0200 you wrote: > After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation > for tiny skbs") we are observing 10-20% regressions in performance > tests with small packets. The perf trace points to high pressure on > the slab allocator. > > This change tries to improve the allocation schema for small packets > using an idea originally suggested by Eric: a new per CPU page frag is > introduced and used in __napi_alloc_skb to cope with small allocation > requests. > > [...] Here is the summary with links: - [net-next,v4] net: skb: introduce and use a single page frag cache https://git.kernel.org/netdev/net-next/c/dbae2b062824 You are awesome, thank you!
On Thu, Sep 29, 2022 at 7:21 PM <patchwork-bot+netdevbpf@kernel.org> wrote: > > Hello: > > This patch was applied to netdev/net-next.git (master) > by Jakub Kicinski <kuba@kernel.org>: > > On Wed, 28 Sep 2022 10:43:09 +0200 you wrote: > > After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation > > for tiny skbs") we are observing 10-20% regressions in performance > > tests with small packets. The perf trace points to high pressure on > > the slab allocator. > > > > This change tries to improve the allocation schema for small packets > > using an idea originally suggested by Eric: a new per CPU page frag is > > introduced and used in __napi_alloc_skb to cope with small allocation > > requests. > > > > [...] > > Here is the summary with links: > - [net-next,v4] net: skb: introduce and use a single page frag cache > https://git.kernel.org/netdev/net-next/c/dbae2b062824 > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR) Before the patch, cpus servicing NIC interrupts were allocating SLAB/SLUB objects for incoming packets, but they were also freeing skbs from TCP rtx queues when ACK packets were processed. SLAB/SLUB caches were efficient (hit ratio close to 100%) After the patch, these CPU only free skbs from TCP rtx queues and constantly have to drain their alien caches, thus competing with the mm spinlocks. RX skbs allocations being done by page frag allocation only left kfree(~1KB) calls. One way to avoid the asymmetric behavior would be to switch TCP to also use page frags for TX skbs, allocated from tcp_stream_alloc_skb()
Hello, On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote: > On Thu, Sep 29, 2022 at 7:21 PM <patchwork-bot+netdevbpf@kernel.org> wrote: > > > > Hello: > > > > This patch was applied to netdev/net-next.git (master) > > by Jakub Kicinski <kuba@kernel.org>: > > > > On Wed, 28 Sep 2022 10:43:09 +0200 you wrote: > > > After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation > > > for tiny skbs") we are observing 10-20% regressions in performance > > > tests with small packets. The perf trace points to high pressure on > > > the slab allocator. > > > > > > This change tries to improve the allocation schema for small packets > > > using an idea originally suggested by Eric: a new per CPU page frag is > > > introduced and used in __napi_alloc_skb to cope with small allocation > > > requests. > > > > > > [...] > > > > Here is the summary with links: > > - [net-next,v4] net: skb: introduce and use a single page frag cache > > https://git.kernel.org/netdev/net-next/c/dbae2b062824 > > > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR) > > Before the patch, cpus servicing NIC interrupts were allocating > SLAB/SLUB objects for incoming packets, > but they were also freeing skbs from TCP rtx queues when ACK packets > were processed. SLAB/SLUB caches > were efficient (hit ratio close to 100%) Thank you for the report. Is that reproducible with netperf TCP_RR and CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes? Do you think a revert will be needed for 6.1? > After the patch, these CPU only free skbs from TCP rtx queues and > constantly have to drain their alien caches, > thus competing with the mm spinlocks. RX skbs allocations being done > by page frag allocation only left kfree(~1KB) calls. > > One way to avoid the asymmetric behavior would be to switch TCP to > also use page frags for TX skbs, > allocated from tcp_stream_alloc_skb() I guess we should have: if (<alloc size is small and NAPI_HAS_SMALL_PAGE>) <use small page frag> else <use current allocator> right in tcp_stream_alloc_skb()? or all the way down to __alloc_skb()? Thanks! Paolo >
On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote: > > Hello, > > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote: > > On Thu, Sep 29, 2022 at 7:21 PM <patchwork-bot+netdevbpf@kernel.org> wrote: > > > > > > Hello: > > > > > > This patch was applied to netdev/net-next.git (master) > > > by Jakub Kicinski <kuba@kernel.org>: > > > > > > On Wed, 28 Sep 2022 10:43:09 +0200 you wrote: > > > > After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation > > > > for tiny skbs") we are observing 10-20% regressions in performance > > > > tests with small packets. The perf trace points to high pressure on > > > > the slab allocator. > > > > > > > > This change tries to improve the allocation schema for small packets > > > > using an idea originally suggested by Eric: a new per CPU page frag is > > > > introduced and used in __napi_alloc_skb to cope with small allocation > > > > requests. > > > > > > > > [...] > > > > > > Here is the summary with links: > > > - [net-next,v4] net: skb: introduce and use a single page frag cache > > > https://git.kernel.org/netdev/net-next/c/dbae2b062824 > > > > > > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR) > > > > Before the patch, cpus servicing NIC interrupts were allocating > > SLAB/SLUB objects for incoming packets, > > but they were also freeing skbs from TCP rtx queues when ACK packets > > were processed. SLAB/SLUB caches > > were efficient (hit ratio close to 100%) > > Thank you for the report. Is that reproducible with netperf TCP_RR and > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes? No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD host with 256 cpus... > > Do you think a revert will be needed for 6.1? No need for a revert, I am sure we can add a followup. > > > After the patch, these CPU only free skbs from TCP rtx queues and > > constantly have to drain their alien caches, > > thus competing with the mm spinlocks. RX skbs allocations being done > > by page frag allocation only left kfree(~1KB) calls. > > > > One way to avoid the asymmetric behavior would be to switch TCP to > > also use page frags for TX skbs, > > allocated from tcp_stream_alloc_skb() > > I guess we should have: > Note that typical skb allocated from tcp sendmsg() have size==0 (all payload is put in skb frag, not in skb->head) > if (<alloc size is small and NAPI_HAS_SMALL_PAGE>) > <use small page frag> > else > <use current allocator> > > right in tcp_stream_alloc_skb()? or all the way down to __alloc_skb()? We could first try in tcp_stream_alloc_skb() > > Thanks! > > Paolo > > > > > >
On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote: > On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote: > > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR) > > > > > > Before the patch, cpus servicing NIC interrupts were allocating > > > SLAB/SLUB objects for incoming packets, > > > but they were also freeing skbs from TCP rtx queues when ACK packets > > > were processed. SLAB/SLUB caches > > > were efficient (hit ratio close to 100%) > > > > Thank you for the report. Is that reproducible with netperf TCP_RR and > > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes? > > No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD > host with 256 cpus... The most similar host I can easily grab is a 2 numa nodes AMD with 16 cores/32 threads each. I tried tcp_rr with different number of flows in (1-2048) range with both slub (I tried first that because is the default allocator in my builds) and slab but so far I can't reproduce it: results differences between pre-patch and post-patch kernel are within noise and numastat never show misses. I'm likely missing some incredient to the recipe. I'll try next to pin the netperf/netserver processes on a numa node different from the NIC's one and to increase the number of concurrent flows. I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb freeing deferral to per-cpu lists") the CPUs servicing the NIC interrupts both allocate and (defer) free the memory for the incoming packets, so they should not have to access remote caches ?!? Or at least isn't the allocator behavior always asymmetric, with rx alloc, rx free, tx free on the same core and tx alloc possibly on a different one? > > We could first try in tcp_stream_alloc_skb() I'll try to test something alike the following - after reproducing the issue. --- diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f15d5b62539b..d5e9be98e8bd 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size, return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE); } +#if PAGE_SIZE == SZ_4K + +#define NAPI_HAS_SMALL_PAGE_FRAG 1 + +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority); + +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority) +{ + if (size <= SKB_WITH_OVERHEAD(1024)) + return __alloc_skb_fclone_frag(size, priority); + + return alloc_skb_fclone(size, priority); +} + +#else +#define NAPI_HAS_SMALL_PAGE_FRAG 0 + +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority) +{ + return alloc_skb_fclone(size, priority); +} + +#endif + struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src); void skb_headers_offset_update(struct sk_buff *skb, int off); int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 81d63f95e865..0c63653c9951 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) #define NAPI_SKB_CACHE_BULK 16 #define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2) -#if PAGE_SIZE == SZ_4K +#if NAPI_HAS_SMALL_PAGE_FRAG -#define NAPI_HAS_SMALL_PAGE_FRAG 1 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) ((nc).pfmemalloc) /* specialized page frag allocator using a single order 0 page @@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp) nc->offset = offset; return nc->va + offset; } + #else /* the small page is actually unused in this build; add dummy helpers * to please the compiler and avoid later preprocessor's conditionals */ -#define NAPI_HAS_SMALL_PAGE_FRAG 0 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) false struct page_frag_1k { @@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, } EXPORT_SYMBOL(__alloc_skb); +#if NAPI_HAS_SMALL_PAGE_FRAG +/* optimized skb fast-clone allocation variant, using the small + * page frag cache + */ +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask) +{ + struct sk_buff_fclones *fclones; + struct napi_alloc_cache *nc; + struct sk_buff *skb; + bool pfmemalloc; + void *data; + + /* Get the HEAD */ + skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE); + if (unlikely(!skb)) + return NULL; + prefetchw(skb); + + /* We can access the napi cache only in BH context */ + nc = this_cpu_ptr(&napi_alloc_cache); + local_bh_disable(); + data = page_frag_alloc_1k(&nc->page_small, gfp_mask); + pfmemalloc = nc->page_small.pfmemalloc; + local_bh_enable(); + if (unlikely(!data)) + goto nodata; + + prefetchw(data + SKB_WITH_OVERHEAD(SZ_1K)); + + memset(skb, 0, offsetof(struct sk_buff, tail)); + __build_skb_around(skb, data, SZ_1K); + skb->pfmemalloc = pfmemalloc; + skb->fclone = SKB_FCLONE_ORIG; + skb->head_frag = 1; + + fclones = container_of(skb, struct sk_buff_fclones, skb1); + refcount_set(&fclones->fclone_ref, 1); + + return skb; + +nodata: + kmem_cache_free(skbuff_fclone_cache, skb); + return NULL; +} +#endif + /** * __netdev_alloc_skb - allocate an skbuff for rx on a specific device * @dev: network device to receive on diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 829beee3fa32..3bcc3e1d9b19 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -858,7 +858,7 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp, { struct sk_buff *skb; - skb = alloc_skb_fclone(size + MAX_TCP_HEADER, gfp); + skb = alloc_skb_fclone_frag(size + MAX_TCP_HEADER, gfp); if (likely(skb)) { bool mem_scheduled;
On Sun, 2022-10-02 at 16:55 +0200, Paolo Abeni wrote: > On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote: > > On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote: > > > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR) > > > > > > > > Before the patch, cpus servicing NIC interrupts were allocating > > > > SLAB/SLUB objects for incoming packets, > > > > but they were also freeing skbs from TCP rtx queues when ACK packets > > > > were processed. SLAB/SLUB caches > > > > were efficient (hit ratio close to 100%) > > > > > > Thank you for the report. Is that reproducible with netperf TCP_RR and > > > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes? > > > > No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD > > host with 256 cpus... > > The most similar host I can easily grab is a 2 numa nodes AMD with 16 > cores/32 threads each. > > I tried tcp_rr with different number of flows in (1-2048) range with > both slub (I tried first that because is the default allocator in my > builds) and slab but so far I can't reproduce it: results differences > between pre-patch and post-patch kernel are within noise and numastat > never show misses. > > I'm likely missing some incredient to the recipe. I'll try next to pin > the netperf/netserver processes on a numa node different from the NIC's > one and to increase the number of concurrent flows. I'm still stuck trying to reproduce the issue. I tried pinning and increasing the flows numbers, but I could not find a scenario with a clear regression. I see some contention, but it's related to the timer wheel spinlock, and independent from my patch. Which kind of delta should I observe? Could you please share any additional setup hints? > I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb > freeing deferral to per-cpu lists") the CPUs servicing the NIC > interrupts both allocate and (defer) free the memory for the incoming > packets, so they should not have to access remote caches ?!? Or at > least isn't the allocator behavior always asymmetric, with rx alloc, rx > free, tx free on the same core and tx alloc possibly on a different > one? > > > > > We could first try in tcp_stream_alloc_skb() > > I'll try to test something alike the following - after reproducing the > issue. > --- > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index f15d5b62539b..d5e9be98e8bd 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size, > return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE); > } > > +#if PAGE_SIZE == SZ_4K > + > +#define NAPI_HAS_SMALL_PAGE_FRAG 1 > + > +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority); > + > +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority) > +{ > + if (size <= SKB_WITH_OVERHEAD(1024)) > + return __alloc_skb_fclone_frag(size, priority); > + > + return alloc_skb_fclone(size, priority); > +} > + > +#else > +#define NAPI_HAS_SMALL_PAGE_FRAG 0 > + > +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority) > +{ > + return alloc_skb_fclone(size, priority); > +} > + > +#endif > + > struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src); > void skb_headers_offset_update(struct sk_buff *skb, int off); > int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask); > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 81d63f95e865..0c63653c9951 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) > #define NAPI_SKB_CACHE_BULK 16 > #define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2) > > -#if PAGE_SIZE == SZ_4K > +#if NAPI_HAS_SMALL_PAGE_FRAG > > -#define NAPI_HAS_SMALL_PAGE_FRAG 1 > #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) ((nc).pfmemalloc) > > /* specialized page frag allocator using a single order 0 page > @@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp) > nc->offset = offset; > return nc->va + offset; > } > + > #else > > /* the small page is actually unused in this build; add dummy helpers > * to please the compiler and avoid later preprocessor's conditionals > */ > -#define NAPI_HAS_SMALL_PAGE_FRAG 0 > #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) false > > struct page_frag_1k { > @@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, > } > EXPORT_SYMBOL(__alloc_skb); > > +#if NAPI_HAS_SMALL_PAGE_FRAG > +/* optimized skb fast-clone allocation variant, using the small > + * page frag cache > + */ > +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask) > +{ > + struct sk_buff_fclones *fclones; > + struct napi_alloc_cache *nc; > + struct sk_buff *skb; > + bool pfmemalloc; > + void *data; > + > + /* Get the HEAD */ > + skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE); > + if (unlikely(!skb)) > + return NULL; > + prefetchw(skb); > + > + /* We can access the napi cache only in BH context */ > + nc = this_cpu_ptr(&napi_alloc_cache); > + local_bh_disable(); For the record, the above is wrong, this_cpu_ptr must be after the local_bh_disable() call. Thanks, Paolo
On Mon, Oct 3, 2022 at 2:40 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On Sun, 2022-10-02 at 16:55 +0200, Paolo Abeni wrote: > > On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote: > > > On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote: > > > > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR) > > > > > > > > > > Before the patch, cpus servicing NIC interrupts were allocating > > > > > SLAB/SLUB objects for incoming packets, > > > > > but they were also freeing skbs from TCP rtx queues when ACK packets > > > > > were processed. SLAB/SLUB caches > > > > > were efficient (hit ratio close to 100%) > > > > > > > > Thank you for the report. Is that reproducible with netperf TCP_RR and > > > > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes? > > > > > > No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD > > > host with 256 cpus... > > > > The most similar host I can easily grab is a 2 numa nodes AMD with 16 > > cores/32 threads each. > > > > I tried tcp_rr with different number of flows in (1-2048) range with > > both slub (I tried first that because is the default allocator in my > > builds) and slab but so far I can't reproduce it: results differences > > between pre-patch and post-patch kernel are within noise and numastat > > never show misses. > > > > I'm likely missing some incredient to the recipe. I'll try next to pin > > the netperf/netserver processes on a numa node different from the NIC's > > one and to increase the number of concurrent flows. > > I'm still stuck trying to reproduce the issue. I tried pinning and > increasing the flows numbers, but I could not find a scenario with a > clear regression. I see some contention, but it's related to the timer > wheel spinlock, and independent from my patch. I was using neper ( https://github.com/google/neper ), with 100 threads and 4000 flows. I suspect contention is hard to see unless you use a host with at least 128 cores. Also you need a lot of NIC receive queues (or use RPS to spread incoming packets to many cores) > > Which kind of delta should I observe? Could you please share any > additional setup hints? > > > I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb > > freeing deferral to per-cpu lists") the CPUs servicing the NIC > > interrupts both allocate and (defer) free the memory for the incoming > > packets, so they should not have to access remote caches ?!? Or at > > least isn't the allocator behavior always asymmetric, with rx alloc, rx > > free, tx free on the same core and tx alloc possibly on a different > > one? > > > > > > > > We could first try in tcp_stream_alloc_skb() > > > > I'll try to test something alike the following - after reproducing the > > issue. > > --- > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > > index f15d5b62539b..d5e9be98e8bd 100644 > > --- a/include/linux/skbuff.h > > +++ b/include/linux/skbuff.h > > @@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size, > > return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE); > > } > > > > +#if PAGE_SIZE == SZ_4K > > + > > +#define NAPI_HAS_SMALL_PAGE_FRAG 1 > > + > > +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority); > > + > > +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority) > > +{ > > + if (size <= SKB_WITH_OVERHEAD(1024)) > > + return __alloc_skb_fclone_frag(size, priority); > > + > > + return alloc_skb_fclone(size, priority); > > +} > > + > > +#else > > +#define NAPI_HAS_SMALL_PAGE_FRAG 0 > > + > > +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority) > > +{ > > + return alloc_skb_fclone(size, priority); > > +} > > + > > +#endif > > + > > struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src); > > void skb_headers_offset_update(struct sk_buff *skb, int off); > > int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask); > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 81d63f95e865..0c63653c9951 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) > > #define NAPI_SKB_CACHE_BULK 16 > > #define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2) > > > > -#if PAGE_SIZE == SZ_4K > > +#if NAPI_HAS_SMALL_PAGE_FRAG > > > > -#define NAPI_HAS_SMALL_PAGE_FRAG 1 > > #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) ((nc).pfmemalloc) > > > > /* specialized page frag allocator using a single order 0 page > > @@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp) > > nc->offset = offset; > > return nc->va + offset; > > } > > + > > #else > > > > /* the small page is actually unused in this build; add dummy helpers > > * to please the compiler and avoid later preprocessor's conditionals > > */ > > -#define NAPI_HAS_SMALL_PAGE_FRAG 0 > > #define NAPI_SMALL_PAGE_PFMEMALLOC(nc) false > > > > struct page_frag_1k { > > @@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, > > } > > EXPORT_SYMBOL(__alloc_skb); > > > > +#if NAPI_HAS_SMALL_PAGE_FRAG > > +/* optimized skb fast-clone allocation variant, using the small > > + * page frag cache > > + */ > > +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask) > > +{ > > + struct sk_buff_fclones *fclones; > > + struct napi_alloc_cache *nc; > > + struct sk_buff *skb; > > + bool pfmemalloc; > > + void *data; > > + > > + /* Get the HEAD */ > > + skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE); > > + if (unlikely(!skb)) > > + return NULL; > > + prefetchw(skb); > > + > > + /* We can access the napi cache only in BH context */ > > + nc = this_cpu_ptr(&napi_alloc_cache); > > + local_bh_disable(); > > For the record, the above is wrong, this_cpu_ptr must be after the > local_bh_disable() call. Can you send me (privately maybe) an updated patch ? Thanks.
On Mon, 2022-10-03 at 15:30 -0700, Eric Dumazet wrote: > I was using neper ( https://github.com/google/neper ), with 100 > threads and 4000 flows. > > I suspect contention is hard to see unless you use a host with at > least 128 cores. > Also you need a lot of NIC receive queues (or use RPS to spread > incoming packets to many cores) Nice tool! It looks like it can load the host very easily. I still can't observe the regression on my H/W but I see some minor improvement with the proposed patch, FWIW. I'll try to look for an host with more cores, but that can take a lot to unlimited time. > Cheers, Paolo
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 9f42fc871c3b..a1938560192a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3822,6 +3822,7 @@ void netif_receive_skb_list(struct list_head *head); gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb); void napi_gro_flush(struct napi_struct *napi, bool flush_old); struct sk_buff *napi_get_frags(struct napi_struct *napi); +void napi_get_frags_check(struct napi_struct *napi); gro_result_t napi_gro_frags(struct napi_struct *napi); struct packet_offload *gro_find_receive_by_type(__be16 type); struct packet_offload *gro_find_complete_by_type(__be16 type); diff --git a/net/core/dev.c b/net/core/dev.c index d66c73c1c734..fa53830d0683 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6358,23 +6358,6 @@ int dev_set_threaded(struct net_device *dev, bool threaded) } EXPORT_SYMBOL(dev_set_threaded); -/* Double check that napi_get_frags() allocates skbs with - * skb->head being backed by slab, not a page fragment. - * This is to make sure bug fixed in 3226b158e67c - * ("net: avoid 32 x truesize under-estimation for tiny skbs") - * does not accidentally come back. - */ -static void napi_get_frags_check(struct napi_struct *napi) -{ - struct sk_buff *skb; - - local_bh_disable(); - skb = napi_get_frags(napi); - WARN_ON_ONCE(skb && skb->head_frag); - napi_free_frags(napi); - local_bh_enable(); -} - void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi, int (*poll)(struct napi_struct *, int), int weight) { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f1b8b20fc20b..81d63f95e865 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -134,8 +134,66 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr) #define NAPI_SKB_CACHE_BULK 16 #define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2) +#if PAGE_SIZE == SZ_4K + +#define NAPI_HAS_SMALL_PAGE_FRAG 1 +#define NAPI_SMALL_PAGE_PFMEMALLOC(nc) ((nc).pfmemalloc) + +/* specialized page frag allocator using a single order 0 page + * and slicing it into 1K sized fragment. Constrained to systems + * with a very limited amount of 1K fragments fitting a single + * page - to avoid excessive truesize underestimation + */ + +struct page_frag_1k { + void *va; + u16 offset; + bool pfmemalloc; +}; + +static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp) +{ + struct page *page; + int offset; + + offset = nc->offset - SZ_1K; + if (likely(offset >= 0)) + goto use_frag; + + page = alloc_pages_node(NUMA_NO_NODE, gfp, 0); + if (!page) + return NULL; + + nc->va = page_address(page); + nc->pfmemalloc = page_is_pfmemalloc(page); + offset = PAGE_SIZE - SZ_1K; + page_ref_add(page, offset / SZ_1K); + +use_frag: + nc->offset = offset; + return nc->va + offset; +} +#else + +/* the small page is actually unused in this build; add dummy helpers + * to please the compiler and avoid later preprocessor's conditionals + */ +#define NAPI_HAS_SMALL_PAGE_FRAG 0 +#define NAPI_SMALL_PAGE_PFMEMALLOC(nc) false + +struct page_frag_1k { +}; + +static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask) +{ + return NULL; +} + +#endif + struct napi_alloc_cache { struct page_frag_cache page; + struct page_frag_1k page_small; unsigned int skb_count; void *skb_cache[NAPI_SKB_CACHE_SIZE]; }; @@ -143,6 +201,23 @@ struct napi_alloc_cache { static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache); static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache); +/* Double check that napi_get_frags() allocates skbs with + * skb->head being backed by slab, not a page fragment. + * This is to make sure bug fixed in 3226b158e67c + * ("net: avoid 32 x truesize under-estimation for tiny skbs") + * does not accidentally come back. + */ +void napi_get_frags_check(struct napi_struct *napi) +{ + struct sk_buff *skb; + + local_bh_disable(); + skb = napi_get_frags(napi); + WARN_ON_ONCE(!NAPI_HAS_SMALL_PAGE_FRAG && skb && skb->head_frag); + napi_free_frags(napi); + local_bh_enable(); +} + void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask) { struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache); @@ -561,6 +636,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, { struct napi_alloc_cache *nc; struct sk_buff *skb; + bool pfmemalloc; void *data; DEBUG_NET_WARN_ON_ONCE(!in_softirq()); @@ -568,8 +644,10 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, /* If requested length is either too small or too big, * we use kmalloc() for skb->head allocation. + * When the small frag allocator is available, prefer it over kmalloc + * for small fragments */ - if (len <= SKB_WITH_OVERHEAD(1024) || + if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) || len > SKB_WITH_OVERHEAD(PAGE_SIZE) || (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI, @@ -580,13 +658,33 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, } nc = this_cpu_ptr(&napi_alloc_cache); - len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - len = SKB_DATA_ALIGN(len); if (sk_memalloc_socks()) gfp_mask |= __GFP_MEMALLOC; - data = page_frag_alloc(&nc->page, len, gfp_mask); + if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) { + /* we are artificially inflating the allocation size, but + * that is not as bad as it may look like, as: + * - 'len' less than GRO_MAX_HEAD makes little sense + * - On most systems, larger 'len' values lead to fragment + * size above 512 bytes + * - kmalloc would use the kmalloc-1k slab for such values + * - Builds with smaller GRO_MAX_HEAD will very likely do + * little networking, as that implies no WiFi and no + * tunnels support, and 32 bits arches. + */ + len = SZ_1K; + + data = page_frag_alloc_1k(&nc->page_small, gfp_mask); + pfmemalloc = NAPI_SMALL_PAGE_PFMEMALLOC(nc->page_small); + } else { + len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + len = SKB_DATA_ALIGN(len); + + data = page_frag_alloc(&nc->page, len, gfp_mask); + pfmemalloc = nc->page.pfmemalloc; + } + if (unlikely(!data)) return NULL; @@ -596,7 +694,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, return NULL; } - if (nc->page.pfmemalloc) + if (pfmemalloc) skb->pfmemalloc = 1; skb->head_frag = 1;
After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") we are observing 10-20% regressions in performance tests with small packets. The perf trace points to high pressure on the slab allocator. This change tries to improve the allocation schema for small packets using an idea originally suggested by Eric: a new per CPU page frag is introduced and used in __napi_alloc_skb to cope with small allocation requests. To ensure that the above does not lead to excessive truesize underestimation, the frag size for small allocation is inflated to 1K and all the above is restricted to build with 4K page size. Note that we need to update accordingly the run-time check introduced with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper"). Alex suggested a smart page refcount schema to reduce the number of atomic operations and deal properly with pfmemalloc pages. Under small packet UDP flood, I measure a 15% peak tput increases. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- v3 -> v4: - NAPI_HAS_SMALL_PAGE <-> 4K page (Eric) - fixed typos in comments (Eric) v2 -> v3: - updated Alex email address - fixed build with !NAPI_HAS_SMALL_PAGE_FRAG v1 -> v2: - better page_frag_alloc_1k() (Alex & Eric) - avoid duplicate code and gfp_flags misuse in __napi_alloc_skb() (Alex) --- include/linux/netdevice.h | 1 + net/core/dev.c | 17 ------ net/core/skbuff.c | 108 ++++++++++++++++++++++++++++++++++++-- 3 files changed, 104 insertions(+), 22 deletions(-)