diff mbox series

[net-next,v4] net: skb: introduce and use a single page frag cache

Message ID 6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com (mailing list archive)
State Accepted
Commit dbae2b062824fc2d35ae2d5df2f500626c758e80
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v4] net: skb: introduce and use a single page frag cache | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 4377 this patch: 4377
netdev/cc_maintainers warning 1 maintainers not CCed: petrm@nvidia.com
netdev/build_clang success Errors and warnings before: 1057 this patch: 1057
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4566 this patch: 4566
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 181 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Paolo Abeni Sept. 28, 2022, 8:43 a.m. UTC
After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation
for tiny skbs") we are observing 10-20% regressions in performance
tests with small packets. The perf trace points to high pressure on
the slab allocator.

This change tries to improve the allocation schema for small packets
using an idea originally suggested by Eric: a new per CPU page frag is
introduced and used in __napi_alloc_skb to cope with small allocation
requests.

To ensure that the above does not lead to excessive truesize
underestimation, the frag size for small allocation is inflated to 1K
and all the above is restricted to build with 4K page size.

Note that we need to update accordingly the run-time check introduced
with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper").

Alex suggested a smart page refcount schema to reduce the number
of atomic operations and deal properly with pfmemalloc pages.

Under small packet UDP flood, I measure a 15% peak tput increases.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Alexander H Duyck <alexanderduyck@fb.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v3 -> v4:
 - NAPI_HAS_SMALL_PAGE <-> 4K page (Eric)
 - fixed typos in comments (Eric)

v2 -> v3:
 - updated Alex email address
 - fixed build with !NAPI_HAS_SMALL_PAGE_FRAG

v1 -> v2:
 - better page_frag_alloc_1k() (Alex & Eric)
 - avoid duplicate code and gfp_flags misuse in __napi_alloc_skb() (Alex)
---
 include/linux/netdevice.h |   1 +
 net/core/dev.c            |  17 ------
 net/core/skbuff.c         | 108 ++++++++++++++++++++++++++++++++++++--
 3 files changed, 104 insertions(+), 22 deletions(-)

Comments

Eric Dumazet Sept. 28, 2022, 2:11 p.m. UTC | #1
On Wed, Sep 28, 2022 at 1:43 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation
> for tiny skbs") we are observing 10-20% regressions in performance
> tests with small packets. The perf trace points to high pressure on
> the slab allocator.
>
> This change tries to improve the allocation schema for small packets
> using an idea originally suggested by Eric: a new per CPU page frag is
> introduced and used in __napi_alloc_skb to cope with small allocation
> requests.
>
> To ensure that the above does not lead to excessive truesize
> underestimation, the frag size for small allocation is inflated to 1K
> and all the above is restricted to build with 4K page size.
>
> Note that we need to update accordingly the run-time check introduced
> with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper").
>
> Alex suggested a smart page refcount schema to reduce the number
> of atomic operations and deal properly with pfmemalloc pages.
>
> Under small packet UDP flood, I measure a 15% peak tput increases.
>
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !
Alexander Duyck Sept. 28, 2022, 2:24 p.m. UTC | #2
On Wed, 2022-09-28 at 10:43 +0200, Paolo Abeni wrote:
> After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation
> for tiny skbs") we are observing 10-20% regressions in performance
> tests with small packets. The perf trace points to high pressure on
> the slab allocator.
> 
> This change tries to improve the allocation schema for small packets
> using an idea originally suggested by Eric: a new per CPU page frag is
> introduced and used in __napi_alloc_skb to cope with small allocation
> requests.
> 
> To ensure that the above does not lead to excessive truesize
> underestimation, the frag size for small allocation is inflated to 1K
> and all the above is restricted to build with 4K page size.
> 
> Note that we need to update accordingly the run-time check introduced
> with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper").
> 
> Alex suggested a smart page refcount schema to reduce the number
> of atomic operations and deal properly with pfmemalloc pages.
> 
> Under small packet UDP flood, I measure a 15% peak tput increases.
> 
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Suggested-by: Alexander H Duyck <alexanderduyck@fb.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v3 -> v4:
>  - NAPI_HAS_SMALL_PAGE <-> 4K page (Eric)
>  - fixed typos in comments (Eric)
> 
> v2 -> v3:
>  - updated Alex email address
>  - fixed build with !NAPI_HAS_SMALL_PAGE_FRAG
> 
> v1 -> v2:
>  - better page_frag_alloc_1k() (Alex & Eric)
>  - avoid duplicate code and gfp_flags misuse in __napi_alloc_skb() (Alex)
> ---


Looks good to me.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
patchwork-bot+netdevbpf@kernel.org Sept. 30, 2022, 2:21 a.m. UTC | #3
Hello:

This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 28 Sep 2022 10:43:09 +0200 you wrote:
> After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation
> for tiny skbs") we are observing 10-20% regressions in performance
> tests with small packets. The perf trace points to high pressure on
> the slab allocator.
> 
> This change tries to improve the allocation schema for small packets
> using an idea originally suggested by Eric: a new per CPU page frag is
> introduced and used in __napi_alloc_skb to cope with small allocation
> requests.
> 
> [...]

Here is the summary with links:
  - [net-next,v4] net: skb: introduce and use a single page frag cache
    https://git.kernel.org/netdev/net-next/c/dbae2b062824

You are awesome, thank you!
Eric Dumazet Sept. 30, 2022, 4:43 p.m. UTC | #4
On Thu, Sep 29, 2022 at 7:21 PM <patchwork-bot+netdevbpf@kernel.org> wrote:
>
> Hello:
>
> This patch was applied to netdev/net-next.git (master)
> by Jakub Kicinski <kuba@kernel.org>:
>
> On Wed, 28 Sep 2022 10:43:09 +0200 you wrote:
> > After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation
> > for tiny skbs") we are observing 10-20% regressions in performance
> > tests with small packets. The perf trace points to high pressure on
> > the slab allocator.
> >
> > This change tries to improve the allocation schema for small packets
> > using an idea originally suggested by Eric: a new per CPU page frag is
> > introduced and used in __napi_alloc_skb to cope with small allocation
> > requests.
> >
> > [...]
>
> Here is the summary with links:
>   - [net-next,v4] net: skb: introduce and use a single page frag cache
>     https://git.kernel.org/netdev/net-next/c/dbae2b062824
>

Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)

Before the patch, cpus servicing NIC interrupts were allocating
SLAB/SLUB objects for incoming packets,
but they were also freeing skbs from TCP rtx queues when ACK packets
were processed. SLAB/SLUB caches
were efficient (hit ratio close to 100%)

After the patch, these CPU only free skbs from TCP rtx queues and
constantly have to drain their alien caches,
thus competing with the mm spinlocks. RX skbs allocations being done
by page frag allocation only left kfree(~1KB) calls.

One way to avoid the asymmetric behavior would be to switch TCP to
also use page frags for TX skbs,
allocated from tcp_stream_alloc_skb()
Paolo Abeni Sept. 30, 2022, 5:30 p.m. UTC | #5
Hello,

On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:
> On Thu, Sep 29, 2022 at 7:21 PM <patchwork-bot+netdevbpf@kernel.org> wrote:
> > 
> > Hello:
> > 
> > This patch was applied to netdev/net-next.git (master)
> > by Jakub Kicinski <kuba@kernel.org>:
> > 
> > On Wed, 28 Sep 2022 10:43:09 +0200 you wrote:
> > > After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation
> > > for tiny skbs") we are observing 10-20% regressions in performance
> > > tests with small packets. The perf trace points to high pressure on
> > > the slab allocator.
> > > 
> > > This change tries to improve the allocation schema for small packets
> > > using an idea originally suggested by Eric: a new per CPU page frag is
> > > introduced and used in __napi_alloc_skb to cope with small allocation
> > > requests.
> > > 
> > > [...]
> > 
> > Here is the summary with links:
> >   - [net-next,v4] net: skb: introduce and use a single page frag cache
> >     https://git.kernel.org/netdev/net-next/c/dbae2b062824
> > 
> 
> Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)
> 
> Before the patch, cpus servicing NIC interrupts were allocating
> SLAB/SLUB objects for incoming packets,
> but they were also freeing skbs from TCP rtx queues when ACK packets
> were processed. SLAB/SLUB caches
> were efficient (hit ratio close to 100%)

Thank you for the report. Is that reproducible with netperf TCP_RR and
CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?

Do you think a revert will be needed for 6.1?

> After the patch, these CPU only free skbs from TCP rtx queues and
> constantly have to drain their alien caches,
> thus competing with the mm spinlocks. RX skbs allocations being done
> by page frag allocation only left kfree(~1KB) calls.
> 
> One way to avoid the asymmetric behavior would be to switch TCP to
> also use page frags for TX skbs,
> allocated from tcp_stream_alloc_skb()

I guess we should have:

	if (<alloc size is small and NAPI_HAS_SMALL_PAGE>)
		<use small page frag>
	else
		<use current allocator>

right in tcp_stream_alloc_skb()? or all the way down to __alloc_skb()?

Thanks!

Paolo



>
Eric Dumazet Sept. 30, 2022, 5:45 p.m. UTC | #6
On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hello,
>
> On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:
> > On Thu, Sep 29, 2022 at 7:21 PM <patchwork-bot+netdevbpf@kernel.org> wrote:
> > >
> > > Hello:
> > >
> > > This patch was applied to netdev/net-next.git (master)
> > > by Jakub Kicinski <kuba@kernel.org>:
> > >
> > > On Wed, 28 Sep 2022 10:43:09 +0200 you wrote:
> > > > After commit 3226b158e67c ("net: avoid 32 x truesize under-estimation
> > > > for tiny skbs") we are observing 10-20% regressions in performance
> > > > tests with small packets. The perf trace points to high pressure on
> > > > the slab allocator.
> > > >
> > > > This change tries to improve the allocation schema for small packets
> > > > using an idea originally suggested by Eric: a new per CPU page frag is
> > > > introduced and used in __napi_alloc_skb to cope with small allocation
> > > > requests.
> > > >
> > > > [...]
> > >
> > > Here is the summary with links:
> > >   - [net-next,v4] net: skb: introduce and use a single page frag cache
> > >     https://git.kernel.org/netdev/net-next/c/dbae2b062824
> > >
> >
> > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)
> >
> > Before the patch, cpus servicing NIC interrupts were allocating
> > SLAB/SLUB objects for incoming packets,
> > but they were also freeing skbs from TCP rtx queues when ACK packets
> > were processed. SLAB/SLUB caches
> > were efficient (hit ratio close to 100%)
>
> Thank you for the report. Is that reproducible with netperf TCP_RR and
> CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?

No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD
host with 256 cpus...


>
> Do you think a revert will be needed for 6.1?

No need for a revert, I am sure we can add a followup.

>
> > After the patch, these CPU only free skbs from TCP rtx queues and
> > constantly have to drain their alien caches,
> > thus competing with the mm spinlocks. RX skbs allocations being done
> > by page frag allocation only left kfree(~1KB) calls.
> >
> > One way to avoid the asymmetric behavior would be to switch TCP to
> > also use page frags for TX skbs,
> > allocated from tcp_stream_alloc_skb()
>
> I guess we should have:
>

Note that typical skb allocated from tcp sendmsg() have size==0 (all
payload is put in skb frag, not in skb->head)

>         if (<alloc size is small and NAPI_HAS_SMALL_PAGE>)
>                 <use small page frag>
>         else
>                 <use current allocator>
>
> right in tcp_stream_alloc_skb()? or all the way down to __alloc_skb()?

We could first try in tcp_stream_alloc_skb()

>
> Thanks!
>
> Paolo
>
>
>
> >
>
Paolo Abeni Oct. 2, 2022, 2:55 p.m. UTC | #7
On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote:
> On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:
> > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)
> > > 
> > > Before the patch, cpus servicing NIC interrupts were allocating
> > > SLAB/SLUB objects for incoming packets,
> > > but they were also freeing skbs from TCP rtx queues when ACK packets
> > > were processed. SLAB/SLUB caches
> > > were efficient (hit ratio close to 100%)
> > 
> > Thank you for the report. Is that reproducible with netperf TCP_RR and
> > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?
> 
> No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD
> host with 256 cpus...

The most similar host I can easily grab is a 2 numa nodes AMD with 16
cores/32 threads each.

I tried tcp_rr with different number of flows in (1-2048) range with
both slub (I tried first that because is the default allocator in my
builds) and slab but so far I can't reproduce it: results differences
between pre-patch and post-patch kernel are within noise and numastat
never show misses.

I'm likely missing some incredient to the recipe. I'll try next to pin
the netperf/netserver processes on a numa node different from the NIC's
one and to increase the number of concurrent flows.

I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb
freeing deferral to per-cpu lists") the CPUs servicing the NIC
interrupts both allocate and (defer) free the memory for the incoming
packets,  so they should not have to access remote caches ?!? Or at
least isn't the allocator behavior always asymmetric, with rx alloc, rx
free, tx free on the same core and tx alloc possibly on a different
one?

> 
> We could first try in tcp_stream_alloc_skb()

I'll try to test something alike the following - after reproducing the
issue.
---
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f15d5b62539b..d5e9be98e8bd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
 }
 
+#if PAGE_SIZE == SZ_4K
+
+#define NAPI_HAS_SMALL_PAGE_FRAG	1
+
+struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority);
+
+static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
+{
+	if (size <= SKB_WITH_OVERHEAD(1024))
+		return __alloc_skb_fclone_frag(size, priority);
+
+	return alloc_skb_fclone(size, priority);
+}
+
+#else
+#define NAPI_HAS_SMALL_PAGE_FRAG	0
+
+static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
+{
+	return alloc_skb_fclone(size, priority);
+}
+
+#endif
+
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
 void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 81d63f95e865..0c63653c9951 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
 #define NAPI_SKB_CACHE_BULK	16
 #define NAPI_SKB_CACHE_HALF	(NAPI_SKB_CACHE_SIZE / 2)
 
-#if PAGE_SIZE == SZ_4K
+#if NAPI_HAS_SMALL_PAGE_FRAG
 
-#define NAPI_HAS_SMALL_PAGE_FRAG	1
 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)	((nc).pfmemalloc)
 
 /* specialized page frag allocator using a single order 0 page
@@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
 	nc->offset = offset;
 	return nc->va + offset;
 }
+
 #else
 
 /* the small page is actually unused in this build; add dummy helpers
  * to please the compiler and avoid later preprocessor's conditionals
  */
-#define NAPI_HAS_SMALL_PAGE_FRAG	0
 #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)	false
 
 struct page_frag_1k {
@@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 }
 EXPORT_SYMBOL(__alloc_skb);
 
+#if NAPI_HAS_SMALL_PAGE_FRAG
+/* optimized skb fast-clone allocation variant, using the small
+ * page frag cache
+ */
+struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask)
+{
+	struct sk_buff_fclones *fclones;
+	struct napi_alloc_cache *nc;
+	struct sk_buff *skb;
+	bool pfmemalloc;
+	void *data;
+
+	/* Get the HEAD */
+	skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE);
+	if (unlikely(!skb))
+		return NULL;
+	prefetchw(skb);
+
+	/* We can access the napi cache only in BH context */
+	nc = this_cpu_ptr(&napi_alloc_cache);
+	local_bh_disable();
+	data = page_frag_alloc_1k(&nc->page_small, gfp_mask);
+	pfmemalloc = nc->page_small.pfmemalloc;
+	local_bh_enable();
+	if (unlikely(!data))
+		goto nodata;
+
+	prefetchw(data + SKB_WITH_OVERHEAD(SZ_1K));
+
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	__build_skb_around(skb, data, SZ_1K);
+	skb->pfmemalloc = pfmemalloc;
+	skb->fclone = SKB_FCLONE_ORIG;
+	skb->head_frag = 1;
+
+	fclones = container_of(skb, struct sk_buff_fclones, skb1);
+	refcount_set(&fclones->fclone_ref, 1);
+
+	return skb;
+
+nodata:
+	kmem_cache_free(skbuff_fclone_cache, skb);
+	return NULL;
+}
+#endif
+
 /**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 829beee3fa32..3bcc3e1d9b19 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -858,7 +858,7 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
 {
 	struct sk_buff *skb;
 
-	skb = alloc_skb_fclone(size + MAX_TCP_HEADER, gfp);
+	skb = alloc_skb_fclone_frag(size + MAX_TCP_HEADER, gfp);
 	if (likely(skb)) {
 		bool mem_scheduled;
Paolo Abeni Oct. 3, 2022, 9:40 p.m. UTC | #8
On Sun, 2022-10-02 at 16:55 +0200, Paolo Abeni wrote:
> On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote:
> > On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:
> > > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)
> > > > 
> > > > Before the patch, cpus servicing NIC interrupts were allocating
> > > > SLAB/SLUB objects for incoming packets,
> > > > but they were also freeing skbs from TCP rtx queues when ACK packets
> > > > were processed. SLAB/SLUB caches
> > > > were efficient (hit ratio close to 100%)
> > > 
> > > Thank you for the report. Is that reproducible with netperf TCP_RR and
> > > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?
> > 
> > No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD
> > host with 256 cpus...
> 
> The most similar host I can easily grab is a 2 numa nodes AMD with 16
> cores/32 threads each.
> 
> I tried tcp_rr with different number of flows in (1-2048) range with
> both slub (I tried first that because is the default allocator in my
> builds) and slab but so far I can't reproduce it: results differences
> between pre-patch and post-patch kernel are within noise and numastat
> never show misses.
> 
> I'm likely missing some incredient to the recipe. I'll try next to pin
> the netperf/netserver processes on a numa node different from the NIC's
> one and to increase the number of concurrent flows.

I'm still stuck trying to reproduce the issue. I tried pinning and
increasing the flows numbers, but I could not find a scenario with a
clear regression. I see some contention, but it's related to the timer
wheel spinlock, and independent from my patch.

Which kind of delta should I observe? Could you please share any
additional setup hints?

> I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb
> freeing deferral to per-cpu lists") the CPUs servicing the NIC
> interrupts both allocate and (defer) free the memory for the incoming
> packets,  so they should not have to access remote caches ?!? Or at
> least isn't the allocator behavior always asymmetric, with rx alloc, rx
> free, tx free on the same core and tx alloc possibly on a different
> one?
> 
> > 
> > We could first try in tcp_stream_alloc_skb()
> 
> I'll try to test something alike the following - after reproducing the
> issue.
> ---
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index f15d5b62539b..d5e9be98e8bd 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>  	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
>  }
>  
> +#if PAGE_SIZE == SZ_4K
> +
> +#define NAPI_HAS_SMALL_PAGE_FRAG	1
> +
> +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority);
> +
> +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
> +{
> +	if (size <= SKB_WITH_OVERHEAD(1024))
> +		return __alloc_skb_fclone_frag(size, priority);
> +
> +	return alloc_skb_fclone(size, priority);
> +}
> +
> +#else
> +#define NAPI_HAS_SMALL_PAGE_FRAG	0
> +
> +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
> +{
> +	return alloc_skb_fclone(size, priority);
> +}
> +
> +#endif
> +
>  struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
>  void skb_headers_offset_update(struct sk_buff *skb, int off);
>  int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 81d63f95e865..0c63653c9951 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
>  #define NAPI_SKB_CACHE_BULK	16
>  #define NAPI_SKB_CACHE_HALF	(NAPI_SKB_CACHE_SIZE / 2)
>  
> -#if PAGE_SIZE == SZ_4K
> +#if NAPI_HAS_SMALL_PAGE_FRAG
>  
> -#define NAPI_HAS_SMALL_PAGE_FRAG	1
>  #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)	((nc).pfmemalloc)
>  
>  /* specialized page frag allocator using a single order 0 page
> @@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
>  	nc->offset = offset;
>  	return nc->va + offset;
>  }
> +
>  #else
>  
>  /* the small page is actually unused in this build; add dummy helpers
>   * to please the compiler and avoid later preprocessor's conditionals
>   */
> -#define NAPI_HAS_SMALL_PAGE_FRAG	0
>  #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)	false
>  
>  struct page_frag_1k {
> @@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  }
>  EXPORT_SYMBOL(__alloc_skb);
>  
> +#if NAPI_HAS_SMALL_PAGE_FRAG
> +/* optimized skb fast-clone allocation variant, using the small
> + * page frag cache
> + */
> +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask)
> +{
> +	struct sk_buff_fclones *fclones;
> +	struct napi_alloc_cache *nc;
> +	struct sk_buff *skb;
> +	bool pfmemalloc;
> +	void *data;
> +
> +	/* Get the HEAD */
> +	skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE);
> +	if (unlikely(!skb))
> +		return NULL;
> +	prefetchw(skb);
> +
> +	/* We can access the napi cache only in BH context */
> +	nc = this_cpu_ptr(&napi_alloc_cache);
> +	local_bh_disable();

For the record, the above is wrong, this_cpu_ptr must be after the
local_bh_disable() call.


Thanks,

Paolo
Eric Dumazet Oct. 3, 2022, 10:30 p.m. UTC | #9
On Mon, Oct 3, 2022 at 2:40 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2022-10-02 at 16:55 +0200, Paolo Abeni wrote:
> > On Fri, 2022-09-30 at 10:45 -0700, Eric Dumazet wrote:
> > > On Fri, Sep 30, 2022 at 10:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > > On Fri, 2022-09-30 at 09:43 -0700, Eric Dumazet wrote:
> > > > > Paolo, this patch adds a regression for TCP RPC workloads (aka TCP_RR)
> > > > >
> > > > > Before the patch, cpus servicing NIC interrupts were allocating
> > > > > SLAB/SLUB objects for incoming packets,
> > > > > but they were also freeing skbs from TCP rtx queues when ACK packets
> > > > > were processed. SLAB/SLUB caches
> > > > > were efficient (hit ratio close to 100%)
> > > >
> > > > Thank you for the report. Is that reproducible with netperf TCP_RR and
> > > > CONFIG_DEBUG_SLAB, I guess? Do I need specific request/response sizes?
> > >
> > > No CONFIG_DEBUG_SLAB, simply standard SLAB, and tcp_rr tests on an AMD
> > > host with 256 cpus...
> >
> > The most similar host I can easily grab is a 2 numa nodes AMD with 16
> > cores/32 threads each.
> >
> > I tried tcp_rr with different number of flows in (1-2048) range with
> > both slub (I tried first that because is the default allocator in my
> > builds) and slab but so far I can't reproduce it: results differences
> > between pre-patch and post-patch kernel are within noise and numastat
> > never show misses.
> >
> > I'm likely missing some incredient to the recipe. I'll try next to pin
> > the netperf/netserver processes on a numa node different from the NIC's
> > one and to increase the number of concurrent flows.
>
> I'm still stuck trying to reproduce the issue. I tried pinning and
> increasing the flows numbers, but I could not find a scenario with a
> clear regression. I see some contention, but it's related to the timer
> wheel spinlock, and independent from my patch.

I was using neper ( https://github.com/google/neper ), with 100
threads and 4000 flows.

I suspect contention is hard to see unless you use a host with at
least 128 cores.
Also you need a lot of NIC receive queues (or use RPS to spread
incoming packets to many cores)


>
> Which kind of delta should I observe? Could you please share any
> additional setup hints?
>
> > I'm also wondering, after commit 68822bdf76f10 ("net: generalize skb
> > freeing deferral to per-cpu lists") the CPUs servicing the NIC
> > interrupts both allocate and (defer) free the memory for the incoming
> > packets,  so they should not have to access remote caches ?!? Or at
> > least isn't the allocator behavior always asymmetric, with rx alloc, rx
> > free, tx free on the same core and tx alloc possibly on a different
> > one?
> >
> > >
> > > We could first try in tcp_stream_alloc_skb()
> >
> > I'll try to test something alike the following - after reproducing the
> > issue.
> > ---
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index f15d5b62539b..d5e9be98e8bd 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1308,6 +1308,30 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
> >       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
> >  }
> >
> > +#if PAGE_SIZE == SZ_4K
> > +
> > +#define NAPI_HAS_SMALL_PAGE_FRAG     1
> > +
> > +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t priority);
> > +
> > +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
> > +{
> > +     if (size <= SKB_WITH_OVERHEAD(1024))
> > +             return __alloc_skb_fclone_frag(size, priority);
> > +
> > +     return alloc_skb_fclone(size, priority);
> > +}
> > +
> > +#else
> > +#define NAPI_HAS_SMALL_PAGE_FRAG     0
> > +
> > +static inline struct sk_buff *alloc_skb_fclone_frag(unsigned int size, gfp_t priority)
> > +{
> > +     return alloc_skb_fclone(size, priority);
> > +}
> > +
> > +#endif
> > +
> >  struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
> >  void skb_headers_offset_update(struct sk_buff *skb, int off);
> >  int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 81d63f95e865..0c63653c9951 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -134,9 +134,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
> >  #define NAPI_SKB_CACHE_BULK  16
> >  #define NAPI_SKB_CACHE_HALF  (NAPI_SKB_CACHE_SIZE / 2)
> >
> > -#if PAGE_SIZE == SZ_4K
> > +#if NAPI_HAS_SMALL_PAGE_FRAG
> >
> > -#define NAPI_HAS_SMALL_PAGE_FRAG     1
> >  #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)       ((nc).pfmemalloc)
> >
> >  /* specialized page frag allocator using a single order 0 page
> > @@ -173,12 +172,12 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
> >       nc->offset = offset;
> >       return nc->va + offset;
> >  }
> > +
> >  #else
> >
> >  /* the small page is actually unused in this build; add dummy helpers
> >   * to please the compiler and avoid later preprocessor's conditionals
> >   */
> > -#define NAPI_HAS_SMALL_PAGE_FRAG     0
> >  #define NAPI_SMALL_PAGE_PFMEMALLOC(nc)       false
> >
> >  struct page_frag_1k {
> > @@ -543,6 +542,52 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  }
> >  EXPORT_SYMBOL(__alloc_skb);
> >
> > +#if NAPI_HAS_SMALL_PAGE_FRAG
> > +/* optimized skb fast-clone allocation variant, using the small
> > + * page frag cache
> > + */
> > +struct sk_buff *__alloc_skb_fclone_frag(unsigned int size, gfp_t gfp_mask)
> > +{
> > +     struct sk_buff_fclones *fclones;
> > +     struct napi_alloc_cache *nc;
> > +     struct sk_buff *skb;
> > +     bool pfmemalloc;
> > +     void *data;
> > +
> > +     /* Get the HEAD */
> > +     skb = kmem_cache_alloc_node(skbuff_fclone_cache, gfp_mask & ~GFP_DMA, NUMA_NO_NODE);
> > +     if (unlikely(!skb))
> > +             return NULL;
> > +     prefetchw(skb);
> > +
> > +     /* We can access the napi cache only in BH context */
> > +     nc = this_cpu_ptr(&napi_alloc_cache);
> > +     local_bh_disable();
>
> For the record, the above is wrong, this_cpu_ptr must be after the
> local_bh_disable() call.

Can you send me (privately maybe) an updated patch ?
Thanks.
Paolo Abeni Oct. 5, 2022, 7:33 a.m. UTC | #10
On Mon, 2022-10-03 at 15:30 -0700, Eric Dumazet wrote:
> I was using neper ( https://github.com/google/neper ), with 100
> threads and 4000 flows.
> 
> I suspect contention is hard to see unless you use a host with at
> least 128 cores.
> Also you need a lot of NIC receive queues (or use RPS to spread
> incoming packets to many cores)

Nice tool! It looks like it can load the host very easily.

I still can't observe the regression on my H/W but I see some minor
improvement with the proposed patch, FWIW.

I'll try to look for an host with more cores, but that can take a lot
to unlimited time.

> 
Cheers,

Paolo
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9f42fc871c3b..a1938560192a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3822,6 +3822,7 @@  void netif_receive_skb_list(struct list_head *head);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
+void napi_get_frags_check(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
 struct packet_offload *gro_find_receive_by_type(__be16 type);
 struct packet_offload *gro_find_complete_by_type(__be16 type);
diff --git a/net/core/dev.c b/net/core/dev.c
index d66c73c1c734..fa53830d0683 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6358,23 +6358,6 @@  int dev_set_threaded(struct net_device *dev, bool threaded)
 }
 EXPORT_SYMBOL(dev_set_threaded);
 
-/* Double check that napi_get_frags() allocates skbs with
- * skb->head being backed by slab, not a page fragment.
- * This is to make sure bug fixed in 3226b158e67c
- * ("net: avoid 32 x truesize under-estimation for tiny skbs")
- * does not accidentally come back.
- */
-static void napi_get_frags_check(struct napi_struct *napi)
-{
-	struct sk_buff *skb;
-
-	local_bh_disable();
-	skb = napi_get_frags(napi);
-	WARN_ON_ONCE(skb && skb->head_frag);
-	napi_free_frags(napi);
-	local_bh_enable();
-}
-
 void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
 			   int (*poll)(struct napi_struct *, int), int weight)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f1b8b20fc20b..81d63f95e865 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -134,8 +134,66 @@  static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
 #define NAPI_SKB_CACHE_BULK	16
 #define NAPI_SKB_CACHE_HALF	(NAPI_SKB_CACHE_SIZE / 2)
 
+#if PAGE_SIZE == SZ_4K
+
+#define NAPI_HAS_SMALL_PAGE_FRAG	1
+#define NAPI_SMALL_PAGE_PFMEMALLOC(nc)	((nc).pfmemalloc)
+
+/* specialized page frag allocator using a single order 0 page
+ * and slicing it into 1K sized fragment. Constrained to systems
+ * with a very limited amount of 1K fragments fitting a single
+ * page - to avoid excessive truesize underestimation
+ */
+
+struct page_frag_1k {
+	void *va;
+	u16 offset;
+	bool pfmemalloc;
+};
+
+static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp)
+{
+	struct page *page;
+	int offset;
+
+	offset = nc->offset - SZ_1K;
+	if (likely(offset >= 0))
+		goto use_frag;
+
+	page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+	if (!page)
+		return NULL;
+
+	nc->va = page_address(page);
+	nc->pfmemalloc = page_is_pfmemalloc(page);
+	offset = PAGE_SIZE - SZ_1K;
+	page_ref_add(page, offset / SZ_1K);
+
+use_frag:
+	nc->offset = offset;
+	return nc->va + offset;
+}
+#else
+
+/* the small page is actually unused in this build; add dummy helpers
+ * to please the compiler and avoid later preprocessor's conditionals
+ */
+#define NAPI_HAS_SMALL_PAGE_FRAG	0
+#define NAPI_SMALL_PAGE_PFMEMALLOC(nc)	false
+
+struct page_frag_1k {
+};
+
+static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask)
+{
+	return NULL;
+}
+
+#endif
+
 struct napi_alloc_cache {
 	struct page_frag_cache page;
+	struct page_frag_1k page_small;
 	unsigned int skb_count;
 	void *skb_cache[NAPI_SKB_CACHE_SIZE];
 };
@@ -143,6 +201,23 @@  struct napi_alloc_cache {
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
 static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
 
+/* Double check that napi_get_frags() allocates skbs with
+ * skb->head being backed by slab, not a page fragment.
+ * This is to make sure bug fixed in 3226b158e67c
+ * ("net: avoid 32 x truesize under-estimation for tiny skbs")
+ * does not accidentally come back.
+ */
+void napi_get_frags_check(struct napi_struct *napi)
+{
+	struct sk_buff *skb;
+
+	local_bh_disable();
+	skb = napi_get_frags(napi);
+	WARN_ON_ONCE(!NAPI_HAS_SMALL_PAGE_FRAG && skb && skb->head_frag);
+	napi_free_frags(napi);
+	local_bh_enable();
+}
+
 void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
@@ -561,6 +636,7 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 {
 	struct napi_alloc_cache *nc;
 	struct sk_buff *skb;
+	bool pfmemalloc;
 	void *data;
 
 	DEBUG_NET_WARN_ON_ONCE(!in_softirq());
@@ -568,8 +644,10 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 
 	/* If requested length is either too small or too big,
 	 * we use kmalloc() for skb->head allocation.
+	 * When the small frag allocator is available, prefer it over kmalloc
+	 * for small fragments
 	 */
-	if (len <= SKB_WITH_OVERHEAD(1024) ||
+	if ((!NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) ||
 	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
 	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX | SKB_ALLOC_NAPI,
@@ -580,13 +658,33 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	}
 
 	nc = this_cpu_ptr(&napi_alloc_cache);
-	len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-	len = SKB_DATA_ALIGN(len);
 
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	data = page_frag_alloc(&nc->page, len, gfp_mask);
+	if (NAPI_HAS_SMALL_PAGE_FRAG && len <= SKB_WITH_OVERHEAD(1024)) {
+		/* we are artificially inflating the allocation size, but
+		 * that is not as bad as it may look like, as:
+		 * - 'len' less than GRO_MAX_HEAD makes little sense
+		 * - On most systems, larger 'len' values lead to fragment
+		 *   size above 512 bytes
+		 * - kmalloc would use the kmalloc-1k slab for such values
+		 * - Builds with smaller GRO_MAX_HEAD will very likely do
+		 *   little networking, as that implies no WiFi and no
+		 *   tunnels support, and 32 bits arches.
+		 */
+		len = SZ_1K;
+
+		data = page_frag_alloc_1k(&nc->page_small, gfp_mask);
+		pfmemalloc = NAPI_SMALL_PAGE_PFMEMALLOC(nc->page_small);
+	} else {
+		len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		len = SKB_DATA_ALIGN(len);
+
+		data = page_frag_alloc(&nc->page, len, gfp_mask);
+		pfmemalloc = nc->page.pfmemalloc;
+	}
+
 	if (unlikely(!data))
 		return NULL;
 
@@ -596,7 +694,7 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 		return NULL;
 	}
 
-	if (nc->page.pfmemalloc)
+	if (pfmemalloc)
 		skb->pfmemalloc = 1;
 	skb->head_frag = 1;