diff mbox series

[net] net: avoid 32 x truesize under-estimation for tiny skbs

Message ID 20210113161819.1155526-1-eric.dumazet@gmail.com (mailing list archive)
State Accepted
Commit 3226b158e67cfaa677fd180152bfb28989cb2fac
Delegated to: Netdev Maintainers
Headers show
Series [net] net: avoid 32 x truesize under-estimation for tiny skbs | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers warning 9 maintainers not CCed: elver@google.com linmiaohe@huawei.com linyunsheng@huawei.com dseok.yi@samsung.com willemb@google.com steffen.klassert@secunet.com gnault@redhat.com viro@zeniv.linux.org.uk kyk.segfault@gmail.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 26 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Eric Dumazet Jan. 13, 2021, 4:18 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

Both virtio net and napi_get_frags() allocate skbs
with a very small skb->head

While using page fragments instead of a kmalloc backed skb->head might give
a small performance improvement in some cases, there is a huge risk of
under estimating memory usage.

For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
per page (order-3 page in x86), or even 64 on PowerPC

We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

Even if we force napi_alloc_skb() to only use order-0 pages, the issue
would still be there on arches with PAGE_SIZE >= 32768

This patch makes sure that small skb head are kmalloc backed, so that
other objects in the slab page can be reused instead of being held as long
as skbs are sitting in socket queues.

Note that we might in the future use the sk_buff napi cache,
instead of going through a more expensive __alloc_skb()

Another idea would be to use separate page sizes depending
on the allocated length (to never have more than 4 frags per page)

I would like to thank Greg Thelen for his precious help on this matter,
analysing crash dumps is always a time consuming task.

Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexanderduyck@fb.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
---
 net/core/skbuff.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Alexander Duyck Jan. 13, 2021, 6 p.m. UTC | #1
On Wed, Jan 13, 2021 at 8:20 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> Both virtio net and napi_get_frags() allocate skbs
> with a very small skb->head
>
> While using page fragments instead of a kmalloc backed skb->head might give
> a small performance improvement in some cases, there is a huge risk of
> under estimating memory usage.
>
> For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> per page (order-3 page in x86), or even 64 on PowerPC
>
> We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
>
> Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> would still be there on arches with PAGE_SIZE >= 32768
>
> This patch makes sure that small skb head are kmalloc backed, so that
> other objects in the slab page can be reused instead of being held as long
> as skbs are sitting in socket queues.
>
> Note that we might in the future use the sk_buff napi cache,
> instead of going through a more expensive __alloc_skb()
>
> Another idea would be to use separate page sizes depending
> on the allocated length (to never have more than 4 frags per page)
>
> I would like to thank Greg Thelen for his precious help on this matter,
> analysing crash dumps is always a time consuming task.
>
> Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexander Duyck <alexanderduyck@fb.com>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Greg Thelen <gthelen@google.com>
> ---
>  net/core/skbuff.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
>  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>                                  gfp_t gfp_mask)
>  {
> -       struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> +       struct napi_alloc_cache *nc;
>         struct sk_buff *skb;
>         void *data;
>
>         len += NET_SKB_PAD + NET_IP_ALIGN;
>
> -       if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> +       /* If requested length is either too small or too big,
> +        * we use kmalloc() for skb->head allocation.
> +        */
> +       if (len <= SKB_WITH_OVERHEAD(1024) ||
> +           len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>             (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>                 if (!skb)
> @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>                 goto skb_success;
>         }
>
> +       nc = this_cpu_ptr(&napi_alloc_cache);
>         len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>         len = SKB_DATA_ALIGN(len);
>

The fix here looks good to me.
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

I think at some point in the future we may need to follow up and do a
rework of a bunch of this code. One thing I am wondering is if we
should look at doing some sort of memory accounting per napi_struct.
Maybe it is something we could work on tying into the page pool work
that Jesper did earlier.
Michael S. Tsirkin Jan. 13, 2021, 7:19 p.m. UTC | #2
On Wed, Jan 13, 2021 at 08:18:19AM -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Both virtio net and napi_get_frags() allocate skbs
> with a very small skb->head
> 
> While using page fragments instead of a kmalloc backed skb->head might give
> a small performance improvement in some cases, there is a huge risk of
> under estimating memory usage.
> 
> For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> per page (order-3 page in x86), or even 64 on PowerPC
> 
> We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
> 
> Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> would still be there on arches with PAGE_SIZE >= 32768
> 
> This patch makes sure that small skb head are kmalloc backed, so that
> other objects in the slab page can be reused instead of being held as long
> as skbs are sitting in socket queues.
> 
> Note that we might in the future use the sk_buff napi cache,
> instead of going through a more expensive __alloc_skb()
> 
> Another idea would be to use separate page sizes depending
> on the allocated length (to never have more than 4 frags per page)
> 
> I would like to thank Greg Thelen for his precious help on this matter,
> analysing crash dumps is always a time consuming task.
> 
> Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexander Duyck <alexanderduyck@fb.com>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Greg Thelen <gthelen@google.com>

Better than tweaking virtio code.

Acked-by: Michael S. Tsirkin <mst@redhat.com>

I do hope the sk_buff napi cache idea materializes in the future.

> ---
>  net/core/skbuff.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
>  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>  				 gfp_t gfp_mask)
>  {
> -	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> +	struct napi_alloc_cache *nc;
>  	struct sk_buff *skb;
>  	void *data;
>  
>  	len += NET_SKB_PAD + NET_IP_ALIGN;
>  
> -	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> +	/* If requested length is either too small or too big,
> +	 * we use kmalloc() for skb->head allocation.
> +	 */
> +	if (len <= SKB_WITH_OVERHEAD(1024) ||
> +	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>  	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>  		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>  		if (!skb)
> @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>  		goto skb_success;
>  	}
>  
> +	nc = this_cpu_ptr(&napi_alloc_cache);
>  	len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  	len = SKB_DATA_ALIGN(len);
>  
> -- 
> 2.30.0.284.gd98b1dd5eaa7-goog
David Laight Jan. 13, 2021, 10:23 p.m. UTC | #3
From: Eric Dumazet
> Sent: 13 January 2021 16:18
> 
> From: Eric Dumazet <edumazet@google.com>
> 
> Both virtio net and napi_get_frags() allocate skbs
> with a very small skb->head
> 
> While using page fragments instead of a kmalloc backed skb->head might give
> a small performance improvement in some cases, there is a huge risk of
> under estimating memory usage.

There is (or was last time I looked) also a problem with
some of the USB ethernet drivers.

IIRC one of the ASXnnnnnn (???) USB3 ones allocates 64k skb to pass
to the USB stack and then just lies about skb->truesize when passing
them into the network stack.
The USB hardware will merge TCP receives and put multiple ethernet
packets into a single USB message.
But single frames can end up in very big kernel memory buffers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Eric Dumazet Jan. 14, 2021, 5:16 a.m. UTC | #4
On Wed, Jan 13, 2021 at 11:23 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 13 January 2021 16:18
> >
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Both virtio net and napi_get_frags() allocate skbs
> > with a very small skb->head
> >
> > While using page fragments instead of a kmalloc backed skb->head might give
> > a small performance improvement in some cases, there is a huge risk of
> > under estimating memory usage.
>
> There is (or was last time I looked) also a problem with
> some of the USB ethernet drivers.
>
> IIRC one of the ASXnnnnnn (???) USB3 ones allocates 64k skb to pass
> to the USB stack and then just lies about skb->truesize when passing
> them into the network stack.

You sure ? I think I have fixed this at some point

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a9e0aca4b37885b5599e52211f098bd7f565e749

> The USB hardware will merge TCP receives and put multiple ethernet
> packets into a single USB message.
> But single frames can end up in very big kernel memory buffers.
>

Yeah, this is a known problem.

Since 2009 I have sent numerous patches addressing truesize issues,
your help will be welcomed especially if you own the hardware and test
the patches.

git log --author dumazet --grep truesize --oneline --reverse
2b85a34e911bf483c27cfdd124aeb1605145dc80 net: No more expensive
sock_hold()/sock_put() on each tx
d361fd599a991ff6c1d522a599c635b35d61ef30 net: sock_free() optimizations
daebbca3ab41031666ee27f991b223d2bc0415e9 qlcnic: dont set skb->truesize
8df8fd27123054b02007361bd5483775db84b4a8 qlcnic: dont set skb->truesize
7e96dc7045bff8758804b047c0dfb6868f182500 netxen: dont set skb->truesize
3d13008e7345fa7a79d8f6438150dc15d6ba6e9d ip: fix truesize mismatch in
ip fragmentation
7a91b434e2bad554b709265db7603b1aa52dd92e net: update SOCK_MIN_RCVBUF
87fb4b7b533073eeeaed0b6bf7c2328995f6c075 net: more accurate skb truesize
bdb28a97f46b5307e6e9351de52a9dd03e711a2f be2net: fix truesize errors
a1f4e8bcbccf50cf1894c263af4d677d4f566533 bnx2: fix skb truesize underestimation
ed64b3cc11502f50e1401f12e33d021592800bca e1000: fix skb truesize underestimation
95b9c1dfb7b929f5f3b203ed95c28bdfd069d122 igb: fix skb truesize underestimation
98130646770db42cd14c44ba0d7f2d0eb8078820 ixgbe: fix skb truesize underestimation
98a045d7e4a59db0865a59eea2140fe36bc7c220 e1000e: fix skb truesize
underestimation
7ae60b3f3b297b7f04025c93f1cb2275c3a1dfcd sky2: fix skb truesize underestimation
5935f81c595897d213afcf756e3e41af7c704f0e ftgmac100: fix skb truesize
underestimation
5e6c355c47e75314fd2282d087616069d4093ecf vmxnet3: fix skb truesize
underestimation
e7e5a4033f765e2a37095cd0a73261c99840f77e niu: fix skb truesize underestimation
96cd8951684adaa5fd72952adef532d0b42f70e1 ftmac100: fix skb truesize
underestimation
9e903e085262ffbf1fc44a17ac06058aca03524a net: add skb frag size accessors
90278c9ffb8a92672d60a618a58a99e2370a98ac mlx4_en: fix skb truesize
underestimation
7b8b59617ead5acc6ff041a9ad2ea1fe7a58094f igbvf: fix truesize underestimation
924a4c7d2e962b4e6a8e9ab3aabfd2bb29e5ada9 myri10ge: fix truesize underestimation
e1ac50f64691de9a095ac5d73cb8ac73d3d17dba bnx2x: fix skb truesize underestimation
4b727361f0bc7ee7378298941066d8aa15023ffb virtio_net: fix truesize
underestimation
e52fcb2462ac484e6dd6e68869536609f0216938 bnx2x: uses build_skb() in receive path
dd2bc8e9c0685d8eaaaf06e65919e31d60478411 bnx2: switch to build_skb()
infrastructure
9205fd9ccab8ef51ad771c1917eed7b2f2225d45 tg3: switch to build_skb()
infrastructure
570e57bcbcc4df5581b1e9c806ab2b16e96ea7d3 atm: use SKB_TRUESIZE() in
atm_guess_pdu2truesize()
f07d960df33c5aef8f513efce0fd201f962f94a1 tcp: avoid frag allocation
for small frames
0fd7bac6b6157eed6cf0cb86a1e88ba29e57c033 net: relax rcvbuf limits
de8261c2fa364397ed872fad1244d75364689168 gro: fix truesize underestimation
19c6c8f58b5840fd4757233b4849f42687d2ef3a ppp: fix truesize underestimation
a9e0aca4b37885b5599e52211f098bd7f565e749 asix: asix_rx_fixup surgery
to reduce skb truesizes
c8628155ece363487b57d33441ea0359018c0fa7 tcp: reduce out_of_order memory use
50269e19ad990e79eeda101fc6df80cffd5d4831 net: add a truesize parameter
to skb_add_rx_frag()
21dcda6083a0573686acabca39b3f92ba032d333 f_phonet: fix skb truesize
underestimation
094b5855bf37eae4b297bc47bb5bc5454f1f6fab cdc-phonet: fix skb truesize
underestimation
da882c1f2ecadb0ed582628ec1585e36b137c0f0 tcp: sk_add_backlog() is too
agressive for TCP
1402d366019fedaa2b024f2bac06b7cc9a8782e1 tcp: introduce tcp_try_coalesce
d3836f21b0af5513ef55701dd3f50b8c42e44c7a net: allow skb->head to be a
page fragment
b49960a05e32121d29316cfdf653894b88ac9190 tcp: change tcp_adv_win_scale
and tcp_rmem[2]
ed90542b0ce5415050c6fbfca324bccaafa69f2f iwlwifi: fix skb truesize
underestimation
715dc1f342713816d1be1c37643a2c9e6ee181a7 net: Fix truesize accounting
in skb_gro_receive()
3cc4949269e01f39443d0fcfffb5bc6b47878d45 ipv4: use skb coalescing in
defragmentation
ec16439e173aaf56f62bd8e175e976fbd452497b ipv6: use skb coalescing in reassembly
313b037cf054ec908de92fb4c085403ffd7420d4 gianfar: fix potential
sk_wmem_alloc imbalance
b28ba72665356438e3a6e3be365c3c3071496840 IPoIB: fix skb truesize underestimatiom
9936a7bbe56df432838fef658aea6bcfdd994021 igb: reduce Rx header size
87c084a980325d877dc7e388b8f2f26d5d3b4d01 l2tp: dont play with skb->truesize
6ff50cd55545d922f5c62776fe1feb38a9846168 tcp: gso: do not generate out
of order packets
9eb5bf838d06aa6ddebe4aca6b5cedcf2eb53b86 net: sock: fix TCP_SKB_MIN_TRUESIZE
45fe142cefa864b685615bcb930159f6749c3667 iwl3945: better skb
management in rx path
4e4f1fc226816905c937f9b29dabe351075dfe0f tcp: properly increase
rcv_ssthresh for ofo packets
400dfd3ae899849b27d398ca7894e1b44430887f net: refactor sk_page_frag_refill()
0d08c42cf9a71530fef5ebcfe368f38f2dd0476f tcp: gso: fix truesize tracking
e33d0ba8047b049c9262fdb1fcafb93cb52ceceb net-gro: reset skb->truesize
in napi_reuse_skb()
b2532eb9abd88384aa586169b54a3e53574f29f8 tcp: fix ooo_okay setting vs
Small Queues
f2d9da1a8375cbe53df5b415d059429013a3a79f bna: fix skb->truesize underestimation
9878196578286c5ed494778ada01da094377a686 tcp: do not pace pure ack packets
0cef6a4c34b56a9a6894f2dad2fad4be789990e1 tcp: give prequeue mode some care
95b58430abe74f5e50970c57d27380bd5b8be324 fq_codel: add memory
limitation per queue
008830bc321c0fc22c0db8d5b0b56f854ed90a5c net_sched: fq_codel: cache
skb->truesize into skb->cb
c9c3321257e1b95be9b375f811fb250162af8d39 tcp: add tcp_add_backlog()
a297569fe00a8fae18547061d355c45ef191b483 net/udp: do not touch
skb->peeked unless really needed
c8c8b127091b758f5768f906bcdeeb88bc9951ca udp: under rx pressure, try
to condense skbs
c84d949057cab262b4d3110ead9a42a58c2958f7 udp: copy skb->truesize in
the first cache line
158f323b9868b59967ad96957c4ca388161be321 net: adjust skb->truesize in
pskb_expand_head()
48cac18ecf1de82f76259a54402c3adb7839ad01 ipv6: orphan skbs in reassembly unit
60c7f5ae5416a8491216bcccf6b3b3d842d69fa4 mlx4: removal of frag_sizes[]
b5a54d9a313645ec9607dc557b67d9325c28884c mlx4: use order-0 pages for RX
7162fb242cb8322beb558828fd26b33c3e9fc805 tcp: do not underestimate
skb->truesize in tcp_trim_head()
c21b48cc1bbf2f5af3ef54ada559f7fadf8b508b net: adjust skb->truesize in
___pskb_trim()
d1f496fd8f34a40458d0eda6be0655926559e546 bpf: restore skb->sk before
pskb_trim() call
f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9 netem: fix skb_orphan_partial()
7ec318feeed10a64c0359ec4d10889cb4defa39a tcp: gso: avoid refcount_t
warning from tcp_gso_segment()
72cd43ba64fc172a443410ce01645895850844c8 tcp: free batches of packets
in tcp_prune_ofo_queue()
4672694bd4f1aebdab0ad763ae4716e89cb15221 ipv4: frags: handle possible
skb truesize change
50ce163a72d817a99e8974222dcf2886d5deb1ae tcp: tcp_grow_window() needs
to respect tcp_space()
d7cc399e1227e74e44f78847d9732a228b46cc91 tcp: properly reset
skb->truesize for tx recycling
24adbc1676af4e134e709ddc7f34cf2adc2131e4 tcp: fix SO_RCVLOWAT hangs
with fat skbs
David Laight Jan. 14, 2021, 9:29 a.m. UTC | #5
From: Eric Dumazet
> Sent: 14 January 2021 05:17
> 
> On Wed, Jan 13, 2021 at 11:23 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Eric Dumazet
> > > Sent: 13 January 2021 16:18
> > >
> > > From: Eric Dumazet <edumazet@google.com>
> > >
> > > Both virtio net and napi_get_frags() allocate skbs
> > > with a very small skb->head
> > >
> > > While using page fragments instead of a kmalloc backed skb->head might give
> > > a small performance improvement in some cases, there is a huge risk of
> > > under estimating memory usage.
> >
> > There is (or was last time I looked) also a problem with
> > some of the USB ethernet drivers.
> >
> > IIRC one of the ASXnnnnnn (???) USB3 ones allocates 64k skb to pass
> > to the USB stack and then just lies about skb->truesize when passing
> > them into the network stack.
> 
> You sure ? I think I have fixed this at some point
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=a9e0aca4b37885b5599e52211f09
> 8bd7f565e749

I might have forgotten that patch :-)
Or possibly only remembered it changing small packets.

> > The USB hardware will merge TCP receives and put multiple ethernet
> > packets into a single USB message.
> > But single frames can end up in very big kernel memory buffers.
> >
> 
> Yeah, this is a known problem.

The whole USB ethernet block is somewhat horrid and inefficient
especially for XHCI/USB3 - which could have high speed ethernet.
It really needs to either directly interface to the XHCI ring
(like a normal ethernet driver) or be given the sequence of
USB rx packets to split/join into ethernet frames.

However I don't have the time to make those changes.
When I was looking at that driver 'dayjob' was actually
trying to make it work.
They dropped that idea later.
I've not got the ethernet dongle any more either.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
patchwork-bot+netdevbpf@kernel.org Jan. 14, 2021, 7 p.m. UTC | #6
Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Wed, 13 Jan 2021 08:18:19 -0800 you wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Both virtio net and napi_get_frags() allocate skbs
> with a very small skb->head
> 
> While using page fragments instead of a kmalloc backed skb->head might give
> a small performance improvement in some cases, there is a huge risk of
> under estimating memory usage.
> 
> [...]

Here is the summary with links:
  - [net] net: avoid 32 x truesize under-estimation for tiny skbs
    https://git.kernel.org/netdev/net/c/3226b158e67c

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
Eric Dumazet March 29, 2021, 9:06 a.m. UTC | #7
On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Both virtio net and napi_get_frags() allocate skbs
> > with a very small skb->head
> >
> > While using page fragments instead of a kmalloc backed skb->head might give
> > a small performance improvement in some cases, there is a huge risk of
> > under estimating memory usage.
> >
> > For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> > per page (order-3 page in x86), or even 64 on PowerPC
> >
> > We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> > but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
> >
> > Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> > would still be there on arches with PAGE_SIZE >= 32768
> >
> > This patch makes sure that small skb head are kmalloc backed, so that
> > other objects in the slab page can be reused instead of being held as long
> > as skbs are sitting in socket queues.
> >
> > Note that we might in the future use the sk_buff napi cache,
> > instead of going through a more expensive __alloc_skb()
> >
> > Another idea would be to use separate page sizes depending
> > on the allocated length (to never have more than 4 frags per page)
> >
> > I would like to thank Greg Thelen for his precious help on this matter,
> > analysing crash dumps is always a time consuming task.
>
>
> This patch causes a performance degradation of about 10% in the scenario of
> virtio-net + GRO.
>
> For GRO, there is no way to merge skbs based on frags with this patch, only
> frag_list can be used to link skbs. The problem that this cause are that compared
> to the GRO package merged into the frags way, the current skb needs to call
> kfree_skb_list to release each skb, resulting in performance degradation.
>
> virtio-net will store some data onto the linear space after receiving it. In
> addition to the header, there are also some payloads, so "headlen <= offset"
> fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
>

Thanks for the report.

There is no way we can make things both fast for existing strategies
used by _insert_your_driver
and malicious usages of data that can sit for seconds/minutes in socket queues.

I think that if you want to gain this 10% back, you have to change
virtio_net to meet optimal behavior.

Normal drivers make sure to not pull payload in skb->head, only headers.

Optimal GRO packets are when payload is in page fragments.

(I am speaking not only for raw performance, but ability for systems
to cope with network outages and sudden increase of memory usage in
out of order queues)

This has been quite clearly stated in my changelog.

Thanks.


> int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
> {
>         struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
>         unsigned int offset = skb_gro_offset(skb);
>         unsigned int headlen = skb_headlen(skb);
>
>     .......
>
>         if (headlen <= offset) {         // virtio-net will fail
>         ........ // merge by frags
>                 goto done;
>         } else if (skb->head_frag) {     // skb->head_frag is fail when use kmalloc() for skb->head allocation
>         ........ // merge by frags
>                 goto done;
>         }
>
> merge:
>     ......
>
>         if (NAPI_GRO_CB(p)->last == p)
>                 skb_shinfo(p)->frag_list = skb;
>         else
>                 NAPI_GRO_CB(p)->last->next = skb;
>
>     ......
>         return 0;
> }
>
>
> test cmd:
>  for i in $(seq 1 4)
>  do
>     redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h  <ip> -p 6379 2>&1 | grep 'per second'  &
>  done
>
> Reported-by: su-lifan@linux.alibaba.com
>
> >
> > Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Alexander Duyck <alexanderduyck@fb.com>
> > Cc: Paolo Abeni <pabeni@redhat.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Greg Thelen <gthelen@google.com>
> > ---
> >  net/core/skbuff.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
> >  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> >                                gfp_t gfp_mask)
> >  {
> > -     struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > +     struct napi_alloc_cache *nc;
> >       struct sk_buff *skb;
> >       void *data;
> >
> >       len += NET_SKB_PAD + NET_IP_ALIGN;
> >
> > -     if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > +     /* If requested length is either too small or too big,
> > +      * we use kmalloc() for skb->head allocation.
> > +      */
> > +     if (len <= SKB_WITH_OVERHEAD(1024) ||
> > +         len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> >           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> >               skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> >               if (!skb)
> > @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> >               goto skb_success;
> >       }
> >
> > +     nc = this_cpu_ptr(&napi_alloc_cache);
> >       len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >       len = SKB_DATA_ALIGN(len);
> >
> > --
> > 2.30.0.284.gd98b1dd5eaa7-goog
> >
Michael S. Tsirkin March 31, 2021, 8:11 a.m. UTC | #8
On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
> On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > >
> > > Both virtio net and napi_get_frags() allocate skbs
> > > with a very small skb->head
> > >
> > > While using page fragments instead of a kmalloc backed skb->head might give
> > > a small performance improvement in some cases, there is a huge risk of
> > > under estimating memory usage.
> > >
> > > For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> > > per page (order-3 page in x86), or even 64 on PowerPC
> > >
> > > We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> > > but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
> > >
> > > Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> > > would still be there on arches with PAGE_SIZE >= 32768
> > >
> > > This patch makes sure that small skb head are kmalloc backed, so that
> > > other objects in the slab page can be reused instead of being held as long
> > > as skbs are sitting in socket queues.
> > >
> > > Note that we might in the future use the sk_buff napi cache,
> > > instead of going through a more expensive __alloc_skb()
> > >
> > > Another idea would be to use separate page sizes depending
> > > on the allocated length (to never have more than 4 frags per page)
> > >
> > > I would like to thank Greg Thelen for his precious help on this matter,
> > > analysing crash dumps is always a time consuming task.
> >
> >
> > This patch causes a performance degradation of about 10% in the scenario of
> > virtio-net + GRO.
> >
> > For GRO, there is no way to merge skbs based on frags with this patch, only
> > frag_list can be used to link skbs. The problem that this cause are that compared
> > to the GRO package merged into the frags way, the current skb needs to call
> > kfree_skb_list to release each skb, resulting in performance degradation.
> >
> > virtio-net will store some data onto the linear space after receiving it. In
> > addition to the header, there are also some payloads, so "headlen <= offset"
> > fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
> >
> 
> Thanks for the report.
> 
> There is no way we can make things both fast for existing strategies
> used by _insert_your_driver
> and malicious usages of data that can sit for seconds/minutes in socket queues.
> 
> I think that if you want to gain this 10% back, you have to change
> virtio_net to meet optimal behavior.
> 
> Normal drivers make sure to not pull payload in skb->head, only headers.

Hmm we do have hdr_len field, but seem to ignore it on RX.
Jason do you see any issues with using it for the head len?


> Optimal GRO packets are when payload is in page fragments.
> 
> (I am speaking not only for raw performance, but ability for systems
> to cope with network outages and sudden increase of memory usage in
> out of order queues)
> 
> This has been quite clearly stated in my changelog.
> 
> Thanks.
> 
> 
> > int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
> > {
> >         struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
> >         unsigned int offset = skb_gro_offset(skb);
> >         unsigned int headlen = skb_headlen(skb);
> >
> >     .......
> >
> >         if (headlen <= offset) {         // virtio-net will fail
> >         ........ // merge by frags
> >                 goto done;
> >         } else if (skb->head_frag) {     // skb->head_frag is fail when use kmalloc() for skb->head allocation
> >         ........ // merge by frags
> >                 goto done;
> >         }
> >
> > merge:
> >     ......
> >
> >         if (NAPI_GRO_CB(p)->last == p)
> >                 skb_shinfo(p)->frag_list = skb;
> >         else
> >                 NAPI_GRO_CB(p)->last->next = skb;
> >
> >     ......
> >         return 0;
> > }
> >
> >
> > test cmd:
> >  for i in $(seq 1 4)
> >  do
> >     redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h  <ip> -p 6379 2>&1 | grep 'per second'  &
> >  done
> >
> > Reported-by: su-lifan@linux.alibaba.com
> >
> > >
> > > Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Cc: Alexander Duyck <alexanderduyck@fb.com>
> > > Cc: Paolo Abeni <pabeni@redhat.com>
> > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > Cc: Greg Thelen <gthelen@google.com>
> > > ---
> > >  net/core/skbuff.c | 9 +++++++--
> > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
> > >  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > >                                gfp_t gfp_mask)
> > >  {
> > > -     struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > > +     struct napi_alloc_cache *nc;
> > >       struct sk_buff *skb;
> > >       void *data;
> > >
> > >       len += NET_SKB_PAD + NET_IP_ALIGN;
> > >
> > > -     if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > > +     /* If requested length is either too small or too big,
> > > +      * we use kmalloc() for skb->head allocation.
> > > +      */
> > > +     if (len <= SKB_WITH_OVERHEAD(1024) ||
> > > +         len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> > >           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > >               skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> > >               if (!skb)
> > > @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > >               goto skb_success;
> > >       }
> > >
> > > +     nc = this_cpu_ptr(&napi_alloc_cache);
> > >       len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > >       len = SKB_DATA_ALIGN(len);
> > >
> > > --
> > > 2.30.0.284.gd98b1dd5eaa7-goog
> > >
Eric Dumazet March 31, 2021, 8:36 a.m. UTC | #9
On Wed, Mar 31, 2021 at 10:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
> > On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >
> > > On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > From: Eric Dumazet <edumazet@google.com>
> > > >
> > > > Both virtio net and napi_get_frags() allocate skbs
> > > > with a very small skb->head
> > > >
> > > > While using page fragments instead of a kmalloc backed skb->head might give
> > > > a small performance improvement in some cases, there is a huge risk of
> > > > under estimating memory usage.
> > > >
> > > > For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> > > > per page (order-3 page in x86), or even 64 on PowerPC
> > > >
> > > > We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> > > > but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
> > > >
> > > > Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> > > > would still be there on arches with PAGE_SIZE >= 32768
> > > >
> > > > This patch makes sure that small skb head are kmalloc backed, so that
> > > > other objects in the slab page can be reused instead of being held as long
> > > > as skbs are sitting in socket queues.
> > > >
> > > > Note that we might in the future use the sk_buff napi cache,
> > > > instead of going through a more expensive __alloc_skb()
> > > >
> > > > Another idea would be to use separate page sizes depending
> > > > on the allocated length (to never have more than 4 frags per page)
> > > >
> > > > I would like to thank Greg Thelen for his precious help on this matter,
> > > > analysing crash dumps is always a time consuming task.
> > >
> > >
> > > This patch causes a performance degradation of about 10% in the scenario of
> > > virtio-net + GRO.
> > >
> > > For GRO, there is no way to merge skbs based on frags with this patch, only
> > > frag_list can be used to link skbs. The problem that this cause are that compared
> > > to the GRO package merged into the frags way, the current skb needs to call
> > > kfree_skb_list to release each skb, resulting in performance degradation.
> > >
> > > virtio-net will store some data onto the linear space after receiving it. In
> > > addition to the header, there are also some payloads, so "headlen <= offset"
> > > fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
> > >
> >
> > Thanks for the report.
> >
> > There is no way we can make things both fast for existing strategies
> > used by _insert_your_driver
> > and malicious usages of data that can sit for seconds/minutes in socket queues.
> >
> > I think that if you want to gain this 10% back, you have to change
> > virtio_net to meet optimal behavior.
> >
> > Normal drivers make sure to not pull payload in skb->head, only headers.
>
> Hmm we do have hdr_len field, but seem to ignore it on RX.
> Jason do you see any issues with using it for the head len?
>

I was looking at this code (page_to_skb())  a few minutes ago ;)

pulling payload would make sense only if can pull of of it (to free the page)
(This is what some drivers implement and call copybreak)

Even if we do not have an accurate knowledge of header sizes,
it would be better to pull only the Ethernet header and let GRO do the
rest during its dissection.

Once fixed, virtio_net will reduce by 2x number of frags per skb,
compared to the situation before "net: avoid 32 x truesize
under-estimation for tiny skbs"


>
> > Optimal GRO packets are when payload is in page fragments.
> >
> > (I am speaking not only for raw performance, but ability for systems
> > to cope with network outages and sudden increase of memory usage in
> > out of order queues)
> >
> > This has been quite clearly stated in my changelog.
> >
> > Thanks.
> >
> >
> > > int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
> > > {
> > >         struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
> > >         unsigned int offset = skb_gro_offset(skb);
> > >         unsigned int headlen = skb_headlen(skb);
> > >
> > >     .......
> > >
> > >         if (headlen <= offset) {         // virtio-net will fail
> > >         ........ // merge by frags
> > >                 goto done;
> > >         } else if (skb->head_frag) {     // skb->head_frag is fail when use kmalloc() for skb->head allocation
> > >         ........ // merge by frags
> > >                 goto done;
> > >         }
> > >
> > > merge:
> > >     ......
> > >
> > >         if (NAPI_GRO_CB(p)->last == p)
> > >                 skb_shinfo(p)->frag_list = skb;
> > >         else
> > >                 NAPI_GRO_CB(p)->last->next = skb;
> > >
> > >     ......
> > >         return 0;
> > > }
> > >
> > >
> > > test cmd:
> > >  for i in $(seq 1 4)
> > >  do
> > >     redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h  <ip> -p 6379 2>&1 | grep 'per second'  &
> > >  done
> > >
> > > Reported-by: su-lifan@linux.alibaba.com
> > >
> > > >
> > > > Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > Cc: Alexander Duyck <alexanderduyck@fb.com>
> > > > Cc: Paolo Abeni <pabeni@redhat.com>
> > > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > > Cc: Greg Thelen <gthelen@google.com>
> > > > ---
> > > >  net/core/skbuff.c | 9 +++++++--
> > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
> > > >  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > >                                gfp_t gfp_mask)
> > > >  {
> > > > -     struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > > > +     struct napi_alloc_cache *nc;
> > > >       struct sk_buff *skb;
> > > >       void *data;
> > > >
> > > >       len += NET_SKB_PAD + NET_IP_ALIGN;
> > > >
> > > > -     if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > > > +     /* If requested length is either too small or too big,
> > > > +      * we use kmalloc() for skb->head allocation.
> > > > +      */
> > > > +     if (len <= SKB_WITH_OVERHEAD(1024) ||
> > > > +         len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> > > >           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > > >               skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> > > >               if (!skb)
> > > > @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > >               goto skb_success;
> > > >       }
> > > >
> > > > +     nc = this_cpu_ptr(&napi_alloc_cache);
> > > >       len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > >       len = SKB_DATA_ALIGN(len);
> > > >
> > > > --
> > > > 2.30.0.284.gd98b1dd5eaa7-goog
> > > >
>
Eric Dumazet March 31, 2021, 8:46 a.m. UTC | #10
On Wed, Mar 31, 2021 at 10:36 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Mar 31, 2021 at 10:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
> > > On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > >
> > > > > Both virtio net and napi_get_frags() allocate skbs
> > > > > with a very small skb->head
> > > > >
> > > > > While using page fragments instead of a kmalloc backed skb->head might give
> > > > > a small performance improvement in some cases, there is a huge risk of
> > > > > under estimating memory usage.
> > > > >
> > > > > For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> > > > > per page (order-3 page in x86), or even 64 on PowerPC
> > > > >
> > > > > We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> > > > > but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
> > > > >
> > > > > Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> > > > > would still be there on arches with PAGE_SIZE >= 32768
> > > > >
> > > > > This patch makes sure that small skb head are kmalloc backed, so that
> > > > > other objects in the slab page can be reused instead of being held as long
> > > > > as skbs are sitting in socket queues.
> > > > >
> > > > > Note that we might in the future use the sk_buff napi cache,
> > > > > instead of going through a more expensive __alloc_skb()
> > > > >
> > > > > Another idea would be to use separate page sizes depending
> > > > > on the allocated length (to never have more than 4 frags per page)
> > > > >
> > > > > I would like to thank Greg Thelen for his precious help on this matter,
> > > > > analysing crash dumps is always a time consuming task.
> > > >
> > > >
> > > > This patch causes a performance degradation of about 10% in the scenario of
> > > > virtio-net + GRO.
> > > >
> > > > For GRO, there is no way to merge skbs based on frags with this patch, only
> > > > frag_list can be used to link skbs. The problem that this cause are that compared
> > > > to the GRO package merged into the frags way, the current skb needs to call
> > > > kfree_skb_list to release each skb, resulting in performance degradation.
> > > >
> > > > virtio-net will store some data onto the linear space after receiving it. In
> > > > addition to the header, there are also some payloads, so "headlen <= offset"
> > > > fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
> > > >
> > >
> > > Thanks for the report.
> > >
> > > There is no way we can make things both fast for existing strategies
> > > used by _insert_your_driver
> > > and malicious usages of data that can sit for seconds/minutes in socket queues.
> > >
> > > I think that if you want to gain this 10% back, you have to change
> > > virtio_net to meet optimal behavior.
> > >
> > > Normal drivers make sure to not pull payload in skb->head, only headers.
> >
> > Hmm we do have hdr_len field, but seem to ignore it on RX.
> > Jason do you see any issues with using it for the head len?
> >
>
> I was looking at this code (page_to_skb())  a few minutes ago ;)
>
> pulling payload would make sense only if can pull of of it (to free the page)
> (This is what some drivers implement and call copybreak)
>
> Even if we do not have an accurate knowledge of header sizes,
> it would be better to pull only the Ethernet header and let GRO do the
> rest during its dissection.
>
> Once fixed, virtio_net will reduce by 2x number of frags per skb,
> compared to the situation before "net: avoid 32 x truesize
> under-estimation for tiny skbs"

Ie I suspect the simple way to fix this would be :

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bb4ea9dbc16bcb19c5969fc8247478aa66c63fce..a5500bf6ac01051be949edf9fead934a90335f4f
100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -409,9 +409,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
        offset += hdr_padded_len;
        p += hdr_padded_len;

-       copy = len;
-       if (copy > skb_tailroom(skb))
-               copy = skb_tailroom(skb);
+       copy = min_t(int, len, ETH_HLEN);
        skb_put_data(skb, p, copy);

        if (metasize) {
Eric Dumazet March 31, 2021, 8:49 a.m. UTC | #11
On Wed, Mar 31, 2021 at 10:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Mar 31, 2021 at 10:36 AM Eric Dumazet <edumazet@google.com> wrote:
> >

> >
> > I was looking at this code (page_to_skb())  a few minutes ago ;)
> >
> > pulling payload would make sense only if can pull of of it (to free the page)
> > (This is what some drivers implement and call copybreak)
> >
> > Even if we do not have an accurate knowledge of header sizes,
> > it would be better to pull only the Ethernet header and let GRO do the
> > rest during its dissection.
> >
> > Once fixed, virtio_net will reduce by 2x number of frags per skb,
> > compared to the situation before "net: avoid 32 x truesize
> > under-estimation for tiny skbs"
>
> Ie I suspect the simple way to fix this would be :
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bb4ea9dbc16bcb19c5969fc8247478aa66c63fce..a5500bf6ac01051be949edf9fead934a90335f4f
> 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -409,9 +409,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>         offset += hdr_padded_len;
>         p += hdr_padded_len;
>
> -       copy = len;
> -       if (copy > skb_tailroom(skb))
> -               copy = skb_tailroom(skb);
> +       copy = min_t(int, len, ETH_HLEN);
>         skb_put_data(skb, p, copy);
>
>         if (metasize) {

A  'copybreak' aware version would be :

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bb4ea9dbc16bcb19c5969fc8247478aa66c63fce..dd58b075ca53643231bc1795c7283fcd8609547b
100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -409,9 +409,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
        offset += hdr_padded_len;
        p += hdr_padded_len;

-       copy = len;
-       if (copy > skb_tailroom(skb))
-               copy = skb_tailroom(skb);
+       /* Copy all frame if it fits skb->head,
+        * otherwise we let GRO pull headers as needed.
+        */
+       if (len <= skb_tailroom(skb))
+               copy = len;
+       else
+               copy = min_t(int, len, ETH_HLEN);
        skb_put_data(skb, p, copy);

        if (metasize) {
Eric Dumazet March 31, 2021, 8:54 a.m. UTC | #12
On Wed, Mar 31, 2021 at 10:49 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Mar 31, 2021 at 10:46 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Mar 31, 2021 at 10:36 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
>
> > >
> > > I was looking at this code (page_to_skb())  a few minutes ago ;)
> > >
> > > pulling payload would make sense only if can pull of of it (to free the page)
> > > (This is what some drivers implement and call copybreak)
> > >
> > > Even if we do not have an accurate knowledge of header sizes,
> > > it would be better to pull only the Ethernet header and let GRO do the
> > > rest during its dissection.
> > >
> > > Once fixed, virtio_net will reduce by 2x number of frags per skb,
> > > compared to the situation before "net: avoid 32 x truesize
> > > under-estimation for tiny skbs"
> >
> > Ie I suspect the simple way to fix this would be :
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index bb4ea9dbc16bcb19c5969fc8247478aa66c63fce..a5500bf6ac01051be949edf9fead934a90335f4f
> > 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -409,9 +409,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >         offset += hdr_padded_len;
> >         p += hdr_padded_len;
> >
> > -       copy = len;
> > -       if (copy > skb_tailroom(skb))
> > -               copy = skb_tailroom(skb);
> > +       copy = min_t(int, len, ETH_HLEN);
> >         skb_put_data(skb, p, copy);
> >
> >         if (metasize) {
>
> A  'copybreak' aware version would be :
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bb4ea9dbc16bcb19c5969fc8247478aa66c63fce..dd58b075ca53643231bc1795c7283fcd8609547b
> 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -409,9 +409,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>         offset += hdr_padded_len;
>         p += hdr_padded_len;
>
> -       copy = len;
> -       if (copy > skb_tailroom(skb))
> -               copy = skb_tailroom(skb);
> +       /* Copy all frame if it fits skb->head,
> +        * otherwise we let GRO pull headers as needed.
> +        */
> +       if (len <= skb_tailroom(skb))
> +               copy = len;
> +       else
> +               copy = min_t(int, len, ETH_HLEN);
>         skb_put_data(skb, p, copy);
>
>         if (metasize) {

Not that we might need to include 'metasize' in the picture.

maybe :

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bb4ea9dbc16bcb19c5969fc8247478aa66c63fce..f5a3cecd18eada32694714ecb85c205af7108aae
100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -409,9 +409,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
        offset += hdr_padded_len;
        p += hdr_padded_len;

-       copy = len;
-       if (copy > skb_tailroom(skb))
-               copy = skb_tailroom(skb);
+       /* Copy all frame if it fits skb->head,
+        * otherwise we let GRO pull headers as needed.
+        */
+       if (len <= skb_tailroom(skb))
+               copy = len;
+       else
+               copy = min_t(int, len, ETH_HLEN + metasize);
        skb_put_data(skb, p, copy);

        if (metasize) {
Eric Dumazet March 31, 2021, 12:08 p.m. UTC | #13
On Wed, Mar 31, 2021 at 1:34 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Wed, 31 Mar 2021 04:11:12 -0400, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
> > > On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > >
> > > > > Both virtio net and napi_get_frags() allocate skbs
> > > > > with a very small skb->head
> > > > >
> > > > > While using page fragments instead of a kmalloc backed skb->head might give
> > > > > a small performance improvement in some cases, there is a huge risk of
> > > > > under estimating memory usage.
> > > > >
> > > > > For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> > > > > per page (order-3 page in x86), or even 64 on PowerPC
> > > > >
> > > > > We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> > > > > but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
> > > > >
> > > > > Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> > > > > would still be there on arches with PAGE_SIZE >= 32768
> > > > >
> > > > > This patch makes sure that small skb head are kmalloc backed, so that
> > > > > other objects in the slab page can be reused instead of being held as long
> > > > > as skbs are sitting in socket queues.
> > > > >
> > > > > Note that we might in the future use the sk_buff napi cache,
> > > > > instead of going through a more expensive __alloc_skb()
> > > > >
> > > > > Another idea would be to use separate page sizes depending
> > > > > on the allocated length (to never have more than 4 frags per page)
> > > > >
> > > > > I would like to thank Greg Thelen for his precious help on this matter,
> > > > > analysing crash dumps is always a time consuming task.
> > > >
> > > >
> > > > This patch causes a performance degradation of about 10% in the scenario of
> > > > virtio-net + GRO.
> > > >
> > > > For GRO, there is no way to merge skbs based on frags with this patch, only
> > > > frag_list can be used to link skbs. The problem that this cause are that compared
> > > > to the GRO package merged into the frags way, the current skb needs to call
> > > > kfree_skb_list to release each skb, resulting in performance degradation.
> > > >
> > > > virtio-net will store some data onto the linear space after receiving it. In
> > > > addition to the header, there are also some payloads, so "headlen <= offset"
> > > > fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
> > > >
> > >
> > > Thanks for the report.
> > >
> > > There is no way we can make things both fast for existing strategies
> > > used by _insert_your_driver
> > > and malicious usages of data that can sit for seconds/minutes in socket queues.
> > >
> > > I think that if you want to gain this 10% back, you have to change
> > > virtio_net to meet optimal behavior.
> > >
> > > Normal drivers make sure to not pull payload in skb->head, only headers.
> >
> > Hmm we do have hdr_len field, but seem to ignore it on RX.
> > Jason do you see any issues with using it for the head len?
> >
>
> Why not add a struct skb_shared_info space when adding merge/big buffer?
> So that we can use build_skb to create skb. For skb with little data, we can
> copy directly to save pages.
>

This does not matter. build_skb() is a distraction here.

Ultimately, you want to receive fat GRO packets with 17 MSS per sk_buff

The patch I gave earlier will combine :

1) Advantage of having skb->head being backed by slab allocations,
   to solve the OOM issue my patch already coped with.

2)  Up to 17 page frags attached to the sk_buff, storing 17 MSS.

In the past, virtio_net had to use 2 slots in shared_info per MSS
  -One slot for the first ~200 bytes of payload
 - One slot for the last part of the MSS.

Please try my patch, so that we confirm the issue is really in page_to_skb()

Thanks
Eric Dumazet April 1, 2021, 5:06 a.m. UTC | #14
On Thu, Apr 1, 2021 at 5:40 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:

>
> [  144.405067] eth1: bad gso: type: 1, size: 1452
> [  180.473565] skbuff: bad partial csum: csum=34/16 headroom=64 headlen=14
> [  180.474698] eth1: bad gso: type: 1, size: 1452
> [  180.519980] skbuff: bad partial csum: csum=34/16 headroom=64 headlen=14
> [  180.521076] eth1: bad gso: type: 1, size: 1452
> [  180.559409] skbuff: bad partial csum: csum=34/16 headroom=64 headlen=14
> [  180.560501] eth1: bad gso: type: 1, size: 1452
> [  180.576476] skbuff: bad partial csum: csum=34/16 headroom=64 headlen=14
> [  180.577562] eth1: bad gso: type: 1, size: 1452
> [  183.284265] skbuff: bad partial csum: csum=34/16 headroom=64 headlen=14
> [  183.285372] eth1: bad gso: type: 1, size: 1452
> [  216.831138] net_ratelimit: 2 callbacks suppressed
> [  216.832009] skbuff: bad partial csum: csum=34/16 headroom=64 headlen=14
> [  216.833103] eth1: bad gso: type: 1, size: 1452
> [  216.892180] skbuff: bad partial csum: csum=34/16 headroom=64 headlen=14
> [  216.893280] eth1: bad gso: type: 1, size: 1452
> [  216.949833] skbuff: bad partial csum: csum=34/16 headroom=64 headlen=14
> [  216.950922] eth1: bad gso: type: 1, size: 1452
> [  291.962999] skbuff: bad partial csum: csum=34/6 headroom=64 headlen=14
> [  291.964092] eth1: bad gso: type: 0, size: 0
> [  311.972597] skbuff: bad partial csum: csum=34/6 headroom=64 headlen=14
> [  311.973691] eth1: bad gso: type: 0, size: 0
>
> I got this display after the machine started, I think just copy eth may not be a
> good way.

This is great, thanks for testing.
This means virtio_net_hdr_to_skb() needs to be a little bit smarter to
pull needed bytes (not assuming 128 bytes were pre-pulled)

Can you try this ?
Thanks !

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bb4ea9dbc16bcb19c5969fc8247478aa66c63fce..f5a3cecd18eada32694714ecb85c205af7108aae
100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -409,9 +409,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
        offset += hdr_padded_len;
        p += hdr_padded_len;

-       copy = len;
-       if (copy > skb_tailroom(skb))
-               copy = skb_tailroom(skb);
+       /* Copy all frame if it fits skb->head,
+        * otherwise we let GRO pull headers as needed.
+        */
+       if (len <= skb_tailroom(skb))
+               copy = len;
+       else
+               copy = min_t(int, len, ETH_HLEN + metasize);
        skb_put_data(skb, p, copy);

        if (metasize) {
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 6b5fcfa1e5553576b0e853ae31a2df655c04204b..2ee8f3ba76a548d54e0b21321a67da958c9984a0
100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -63,8 +63,12 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
        }

        if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-               u16 start = __virtio16_to_cpu(little_endian, hdr->csum_start);
-               u16 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
+               u32 start = __virtio16_to_cpu(little_endian, hdr->csum_start);
+               u32 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
+               u32 needed = start + max_t(u32, thlen, off + sizeof(__sum16));
+
+               if (pskb_may_pull(skb, needed))
+                       return -EINVAL;

                if (!skb_partial_csum_set(skb, start, off))
                        return -EINVAL;
@@ -100,14 +104,14 @@ static inline int virtio_net_hdr_to_skb(struct
sk_buff *skb,
                        }

                        p_off = keys.control.thoff + thlen;
-                       if (p_off > skb_headlen(skb) ||
+                       if (pskb_may_pull(skb, p_off) ||
                            keys.basic.ip_proto != ip_proto)
                                return -EINVAL;

                        skb_set_transport_header(skb, keys.control.thoff);
                } else if (gso_type) {
                        p_off = thlen;
-                       if (p_off > skb_headlen(skb))
+                       if (pskb_may_pull(skb, p_off))
                                return -EINVAL;
                }
        }
Jason Wang April 1, 2021, 7:14 a.m. UTC | #15
在 2021/3/31 下午4:11, Michael S. Tsirkin 写道:
> On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
>> On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>>> On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>> From: Eric Dumazet <edumazet@google.com>
>>>>
>>>> Both virtio net and napi_get_frags() allocate skbs
>>>> with a very small skb->head
>>>>
>>>> While using page fragments instead of a kmalloc backed skb->head might give
>>>> a small performance improvement in some cases, there is a huge risk of
>>>> under estimating memory usage.
>>>>
>>>> For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
>>>> per page (order-3 page in x86), or even 64 on PowerPC
>>>>
>>>> We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
>>>> but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
>>>>
>>>> Even if we force napi_alloc_skb() to only use order-0 pages, the issue
>>>> would still be there on arches with PAGE_SIZE >= 32768
>>>>
>>>> This patch makes sure that small skb head are kmalloc backed, so that
>>>> other objects in the slab page can be reused instead of being held as long
>>>> as skbs are sitting in socket queues.
>>>>
>>>> Note that we might in the future use the sk_buff napi cache,
>>>> instead of going through a more expensive __alloc_skb()
>>>>
>>>> Another idea would be to use separate page sizes depending
>>>> on the allocated length (to never have more than 4 frags per page)
>>>>
>>>> I would like to thank Greg Thelen for his precious help on this matter,
>>>> analysing crash dumps is always a time consuming task.
>>>
>>> This patch causes a performance degradation of about 10% in the scenario of
>>> virtio-net + GRO.
>>>
>>> For GRO, there is no way to merge skbs based on frags with this patch, only
>>> frag_list can be used to link skbs. The problem that this cause are that compared
>>> to the GRO package merged into the frags way, the current skb needs to call
>>> kfree_skb_list to release each skb, resulting in performance degradation.
>>>
>>> virtio-net will store some data onto the linear space after receiving it. In
>>> addition to the header, there are also some payloads, so "headlen <= offset"
>>> fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
>>>
>> Thanks for the report.
>>
>> There is no way we can make things both fast for existing strategies
>> used by _insert_your_driver
>> and malicious usages of data that can sit for seconds/minutes in socket queues.
>>
>> I think that if you want to gain this 10% back, you have to change
>> virtio_net to meet optimal behavior.
>>
>> Normal drivers make sure to not pull payload in skb->head, only headers.
> Hmm we do have hdr_len field, but seem to ignore it on RX.
> Jason do you see any issues with using it for the head len?


This might work only if the device sets a correct hdr_len. I'm not sure 
all of the devices can do this properly. E.g for tap, we use 
skb_headlen() in virtio_net_hdr_from_skb() which depends highly on the 
behaviour of the underlayer layers (device driver or GRO). And we only 
set this hint for GSO packet but virtio-net may tries to do GRO for non 
GSO packets.

Thanks


>
>
>> Optimal GRO packets are when payload is in page fragments.
>>
>> (I am speaking not only for raw performance, but ability for systems
>> to cope with network outages and sudden increase of memory usage in
>> out of order queues)
>>
>> This has been quite clearly stated in my changelog.
>>
>> Thanks.
>>
>>
>>> int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>>> {
>>>          struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
>>>          unsigned int offset = skb_gro_offset(skb);
>>>          unsigned int headlen = skb_headlen(skb);
>>>
>>>      .......
>>>
>>>          if (headlen <= offset) {         // virtio-net will fail
>>>          ........ // merge by frags
>>>                  goto done;
>>>          } else if (skb->head_frag) {     // skb->head_frag is fail when use kmalloc() for skb->head allocation
>>>          ........ // merge by frags
>>>                  goto done;
>>>          }
>>>
>>> merge:
>>>      ......
>>>
>>>          if (NAPI_GRO_CB(p)->last == p)
>>>                  skb_shinfo(p)->frag_list = skb;
>>>          else
>>>                  NAPI_GRO_CB(p)->last->next = skb;
>>>
>>>      ......
>>>          return 0;
>>> }
>>>
>>>
>>> test cmd:
>>>   for i in $(seq 1 4)
>>>   do
>>>      redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h  <ip> -p 6379 2>&1 | grep 'per second'  &
>>>   done
>>>
>>> Reported-by: su-lifan@linux.alibaba.com
>>>
>>>> Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
>>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>>> Cc: Alexander Duyck <alexanderduyck@fb.com>
>>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>> Cc: Greg Thelen <gthelen@google.com>
>>>> ---
>>>>   net/core/skbuff.c | 9 +++++++--
>>>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
>>>>   struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>>>>                                 gfp_t gfp_mask)
>>>>   {
>>>> -     struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
>>>> +     struct napi_alloc_cache *nc;
>>>>        struct sk_buff *skb;
>>>>        void *data;
>>>>
>>>>        len += NET_SKB_PAD + NET_IP_ALIGN;
>>>>
>>>> -     if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
>>>> +     /* If requested length is either too small or too big,
>>>> +      * we use kmalloc() for skb->head allocation.
>>>> +      */
>>>> +     if (len <= SKB_WITH_OVERHEAD(1024) ||
>>>> +         len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>>>>            (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>>>>                skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>>>>                if (!skb)
>>>> @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>>>>                goto skb_success;
>>>>        }
>>>>
>>>> +     nc = this_cpu_ptr(&napi_alloc_cache);
>>>>        len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>>        len = SKB_DATA_ALIGN(len);
>>>>
>>>> --
>>>> 2.30.0.284.gd98b1dd5eaa7-goog
>>>>
Eric Dumazet April 1, 2021, 9:58 a.m. UTC | #16
On Thu, Apr 1, 2021 at 11:04 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Thu, 1 Apr 2021 15:14:18 +0800, Jason Wang <jasowang@redhat.com> wrote:
> >
> > 在 2021/3/31 下午4:11, Michael S. Tsirkin 写道:
> > > On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
> > >> On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > >>> On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >>>> From: Eric Dumazet <edumazet@google.com>
> > >>>>
> > >>>> Both virtio net and napi_get_frags() allocate skbs
> > >>>> with a very small skb->head
> > >>>>
> > >>>> While using page fragments instead of a kmalloc backed skb->head might give
> > >>>> a small performance improvement in some cases, there is a huge risk of
> > >>>> under estimating memory usage.
> > >>>>
> > >>>> For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> > >>>> per page (order-3 page in x86), or even 64 on PowerPC
> > >>>>
> > >>>> We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> > >>>> but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
> > >>>>
> > >>>> Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> > >>>> would still be there on arches with PAGE_SIZE >= 32768
> > >>>>
> > >>>> This patch makes sure that small skb head are kmalloc backed, so that
> > >>>> other objects in the slab page can be reused instead of being held as long
> > >>>> as skbs are sitting in socket queues.
> > >>>>
> > >>>> Note that we might in the future use the sk_buff napi cache,
> > >>>> instead of going through a more expensive __alloc_skb()
> > >>>>
> > >>>> Another idea would be to use separate page sizes depending
> > >>>> on the allocated length (to never have more than 4 frags per page)
> > >>>>
> > >>>> I would like to thank Greg Thelen for his precious help on this matter,
> > >>>> analysing crash dumps is always a time consuming task.
> > >>>
> > >>> This patch causes a performance degradation of about 10% in the scenario of
> > >>> virtio-net + GRO.
> > >>>
> > >>> For GRO, there is no way to merge skbs based on frags with this patch, only
> > >>> frag_list can be used to link skbs. The problem that this cause are that compared
> > >>> to the GRO package merged into the frags way, the current skb needs to call
> > >>> kfree_skb_list to release each skb, resulting in performance degradation.
> > >>>
> > >>> virtio-net will store some data onto the linear space after receiving it. In
> > >>> addition to the header, there are also some payloads, so "headlen <= offset"
> > >>> fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
> > >>>
> > >> Thanks for the report.
> > >>
> > >> There is no way we can make things both fast for existing strategies
> > >> used by _insert_your_driver
> > >> and malicious usages of data that can sit for seconds/minutes in socket queues.
> > >>
> > >> I think that if you want to gain this 10% back, you have to change
> > >> virtio_net to meet optimal behavior.
> > >>
> > >> Normal drivers make sure to not pull payload in skb->head, only headers.
> > > Hmm we do have hdr_len field, but seem to ignore it on RX.
> > > Jason do you see any issues with using it for the head len?
> >
> >
> > This might work only if the device sets a correct hdr_len. I'm not sure
> > all of the devices can do this properly. E.g for tap, we use
> > skb_headlen() in virtio_net_hdr_from_skb() which depends highly on the
> > behaviour of the underlayer layers (device driver or GRO). And we only
> > set this hint for GSO packet but virtio-net may tries to do GRO for non
> > GSO packets.
> >
> > Thanks
>
> hi, Jason
>
> I personally prefer to use build_skb to create skb, so the problem here is
> actually gone.
>
> The premise of this is that the buffer added by add_recvbuf_mergeable must
> retain a skb_shared_info. Of course, then rx frags coalescing won't
> work. But I consider that suppose the size of the mrg_avg_pkt_len 1500, so we
> can still store 17 * 1500 = 24k packets in a skb. If the packet is really big,
> the mrg_avg_pkt_len will also increase, and the buffer allocated later will
> increase. When the mrg_avg_pkt_len is greater than PAGE_SIZE/2, rx frags
> coalesce is no longer needed. Because we can't allocate two bufs with a value of
> mrg_avg_pkt_len on the same page.
>

For the record I implemented build_skb() 10 years ago, so you can
trust me when I
am saying this will not help.

Using build_skb() will waste additional skb_shared_info per MSS.
That's an increase of 20% of memory, for nothing at all.

Also there are cases when this won't be possible, say if you use an MTU of 4000





> Thanks.
>
> >
> >
> > >
> > >
> > >> Optimal GRO packets are when payload is in page fragments.
> > >>
> > >> (I am speaking not only for raw performance, but ability for systems
> > >> to cope with network outages and sudden increase of memory usage in
> > >> out of order queues)
> > >>
> > >> This has been quite clearly stated in my changelog.
> > >>
> > >> Thanks.
> > >>
> > >>
> > >>> int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
> > >>> {
> > >>>          struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
> > >>>          unsigned int offset = skb_gro_offset(skb);
> > >>>          unsigned int headlen = skb_headlen(skb);
> > >>>
> > >>>      .......
> > >>>
> > >>>          if (headlen <= offset) {         // virtio-net will fail
> > >>>          ........ // merge by frags
> > >>>                  goto done;
> > >>>          } else if (skb->head_frag) {     // skb->head_frag is fail when use kmalloc() for skb->head allocation
> > >>>          ........ // merge by frags
> > >>>                  goto done;
> > >>>          }
> > >>>
> > >>> merge:
> > >>>      ......
> > >>>
> > >>>          if (NAPI_GRO_CB(p)->last == p)
> > >>>                  skb_shinfo(p)->frag_list = skb;
> > >>>          else
> > >>>                  NAPI_GRO_CB(p)->last->next = skb;
> > >>>
> > >>>      ......
> > >>>          return 0;
> > >>> }
> > >>>
> > >>>
> > >>> test cmd:
> > >>>   for i in $(seq 1 4)
> > >>>   do
> > >>>      redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h  <ip> -p 6379 2>&1 | grep 'per second'  &
> > >>>   done
> > >>>
> > >>> Reported-by: su-lifan@linux.alibaba.com
> > >>>
> > >>>> Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
> > >>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
> > >>>> Cc: Alexander Duyck <alexanderduyck@fb.com>
> > >>>> Cc: Paolo Abeni <pabeni@redhat.com>
> > >>>> Cc: Michael S. Tsirkin <mst@redhat.com>
> > >>>> Cc: Greg Thelen <gthelen@google.com>
> > >>>> ---
> > >>>>   net/core/skbuff.c | 9 +++++++--
> > >>>>   1 file changed, 7 insertions(+), 2 deletions(-)
> > >>>>
> > >>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > >>>> index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
> > >>>> --- a/net/core/skbuff.c
> > >>>> +++ b/net/core/skbuff.c
> > >>>> @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
> > >>>>   struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > >>>>                                 gfp_t gfp_mask)
> > >>>>   {
> > >>>> -     struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > >>>> +     struct napi_alloc_cache *nc;
> > >>>>        struct sk_buff *skb;
> > >>>>        void *data;
> > >>>>
> > >>>>        len += NET_SKB_PAD + NET_IP_ALIGN;
> > >>>>
> > >>>> -     if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > >>>> +     /* If requested length is either too small or too big,
> > >>>> +      * we use kmalloc() for skb->head allocation.
> > >>>> +      */
> > >>>> +     if (len <= SKB_WITH_OVERHEAD(1024) ||
> > >>>> +         len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> > >>>>            (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > >>>>                skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> > >>>>                if (!skb)
> > >>>> @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > >>>>                goto skb_success;
> > >>>>        }
> > >>>>
> > >>>> +     nc = this_cpu_ptr(&napi_alloc_cache);
> > >>>>        len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > >>>>        len = SKB_DATA_ALIGN(len);
> > >>>>
> > >>>> --
> > >>>> 2.30.0.284.gd98b1dd5eaa7-goog
> > >>>>
> >
Michael S. Tsirkin April 1, 2021, 1:36 p.m. UTC | #17
On Wed, Mar 31, 2021 at 07:30:39PM +0800, Xuan Zhuo wrote:
> On Wed, 31 Mar 2021 04:11:12 -0400, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
> > > On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > >
> > > > > Both virtio net and napi_get_frags() allocate skbs
> > > > > with a very small skb->head
> > > > >
> > > > > While using page fragments instead of a kmalloc backed skb->head might give
> > > > > a small performance improvement in some cases, there is a huge risk of
> > > > > under estimating memory usage.
> > > > >
> > > > > For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> > > > > per page (order-3 page in x86), or even 64 on PowerPC
> > > > >
> > > > > We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> > > > > but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
> > > > >
> > > > > Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> > > > > would still be there on arches with PAGE_SIZE >= 32768
> > > > >
> > > > > This patch makes sure that small skb head are kmalloc backed, so that
> > > > > other objects in the slab page can be reused instead of being held as long
> > > > > as skbs are sitting in socket queues.
> > > > >
> > > > > Note that we might in the future use the sk_buff napi cache,
> > > > > instead of going through a more expensive __alloc_skb()
> > > > >
> > > > > Another idea would be to use separate page sizes depending
> > > > > on the allocated length (to never have more than 4 frags per page)
> > > > >
> > > > > I would like to thank Greg Thelen for his precious help on this matter,
> > > > > analysing crash dumps is always a time consuming task.
> > > >
> > > >
> > > > This patch causes a performance degradation of about 10% in the scenario of
> > > > virtio-net + GRO.
> > > >
> > > > For GRO, there is no way to merge skbs based on frags with this patch, only
> > > > frag_list can be used to link skbs. The problem that this cause are that compared
> > > > to the GRO package merged into the frags way, the current skb needs to call
> > > > kfree_skb_list to release each skb, resulting in performance degradation.
> > > >
> > > > virtio-net will store some data onto the linear space after receiving it. In
> > > > addition to the header, there are also some payloads, so "headlen <= offset"
> > > > fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
> > > >
> > >
> > > Thanks for the report.
> > >
> > > There is no way we can make things both fast for existing strategies
> > > used by _insert_your_driver
> > > and malicious usages of data that can sit for seconds/minutes in socket queues.
> > >
> > > I think that if you want to gain this 10% back, you have to change
> > > virtio_net to meet optimal behavior.
> > >
> > > Normal drivers make sure to not pull payload in skb->head, only headers.
> >
> > Hmm we do have hdr_len field, but seem to ignore it on RX.
> > Jason do you see any issues with using it for the head len?
> >
> 
> Why not add a struct skb_shared_info space when adding merge/big buffer?
> So that we can use build_skb to create skb. For skb with little data, we can
> copy directly to save pages.
> 
> Thanks.

Not 100% sure what do you mean ... want to post the patch?

> >
> > > Optimal GRO packets are when payload is in page fragments.
> > >
> > > (I am speaking not only for raw performance, but ability for systems
> > > to cope with network outages and sudden increase of memory usage in
> > > out of order queues)
> > >
> > > This has been quite clearly stated in my changelog.
> > >
> > > Thanks.
> > >
> > >
> > > > int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
> > > > {
> > > >         struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
> > > >         unsigned int offset = skb_gro_offset(skb);
> > > >         unsigned int headlen = skb_headlen(skb);
> > > >
> > > >     .......
> > > >
> > > >         if (headlen <= offset) {         // virtio-net will fail
> > > >         ........ // merge by frags
> > > >                 goto done;
> > > >         } else if (skb->head_frag) {     // skb->head_frag is fail when use kmalloc() for skb->head allocation
> > > >         ........ // merge by frags
> > > >                 goto done;
> > > >         }
> > > >
> > > > merge:
> > > >     ......
> > > >
> > > >         if (NAPI_GRO_CB(p)->last == p)
> > > >                 skb_shinfo(p)->frag_list = skb;
> > > >         else
> > > >                 NAPI_GRO_CB(p)->last->next = skb;
> > > >
> > > >     ......
> > > >         return 0;
> > > > }
> > > >
> > > >
> > > > test cmd:
> > > >  for i in $(seq 1 4)
> > > >  do
> > > >     redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h  <ip> -p 6379 2>&1 | grep 'per second'  &
> > > >  done
> > > >
> > > > Reported-by: su-lifan@linux.alibaba.com
> > > >
> > > > >
> > > > > Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
> > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > Cc: Alexander Duyck <alexanderduyck@fb.com>
> > > > > Cc: Paolo Abeni <pabeni@redhat.com>
> > > > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > > > Cc: Greg Thelen <gthelen@google.com>
> > > > > ---
> > > > >  net/core/skbuff.c | 9 +++++++--
> > > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
> > > > >  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > > >                                gfp_t gfp_mask)
> > > > >  {
> > > > > -     struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > > > > +     struct napi_alloc_cache *nc;
> > > > >       struct sk_buff *skb;
> > > > >       void *data;
> > > > >
> > > > >       len += NET_SKB_PAD + NET_IP_ALIGN;
> > > > >
> > > > > -     if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > > > > +     /* If requested length is either too small or too big,
> > > > > +      * we use kmalloc() for skb->head allocation.
> > > > > +      */
> > > > > +     if (len <= SKB_WITH_OVERHEAD(1024) ||
> > > > > +         len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> > > > >           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > > > >               skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> > > > >               if (!skb)
> > > > > @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > > >               goto skb_success;
> > > > >       }
> > > > >
> > > > > +     nc = this_cpu_ptr(&napi_alloc_cache);
> > > > >       len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > >       len = SKB_DATA_ALIGN(len);
> > > > >
> > > > > --
> > > > > 2.30.0.284.gd98b1dd5eaa7-goog
> > > > >
> >
Michael S. Tsirkin April 1, 2021, 1:51 p.m. UTC | #18
On Wed, Mar 31, 2021 at 10:36:35AM +0200, Eric Dumazet wrote:
> On Wed, Mar 31, 2021 at 10:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
> > > On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> > > >
> > > > On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > >
> > > > > Both virtio net and napi_get_frags() allocate skbs
> > > > > with a very small skb->head
> > > > >
> > > > > While using page fragments instead of a kmalloc backed skb->head might give
> > > > > a small performance improvement in some cases, there is a huge risk of
> > > > > under estimating memory usage.
> > > > >
> > > > > For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
> > > > > per page (order-3 page in x86), or even 64 on PowerPC
> > > > >
> > > > > We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
> > > > > but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
> > > > >
> > > > > Even if we force napi_alloc_skb() to only use order-0 pages, the issue
> > > > > would still be there on arches with PAGE_SIZE >= 32768
> > > > >
> > > > > This patch makes sure that small skb head are kmalloc backed, so that
> > > > > other objects in the slab page can be reused instead of being held as long
> > > > > as skbs are sitting in socket queues.
> > > > >
> > > > > Note that we might in the future use the sk_buff napi cache,
> > > > > instead of going through a more expensive __alloc_skb()
> > > > >
> > > > > Another idea would be to use separate page sizes depending
> > > > > on the allocated length (to never have more than 4 frags per page)
> > > > >
> > > > > I would like to thank Greg Thelen for his precious help on this matter,
> > > > > analysing crash dumps is always a time consuming task.
> > > >
> > > >
> > > > This patch causes a performance degradation of about 10% in the scenario of
> > > > virtio-net + GRO.
> > > >
> > > > For GRO, there is no way to merge skbs based on frags with this patch, only
> > > > frag_list can be used to link skbs. The problem that this cause are that compared
> > > > to the GRO package merged into the frags way, the current skb needs to call
> > > > kfree_skb_list to release each skb, resulting in performance degradation.
> > > >
> > > > virtio-net will store some data onto the linear space after receiving it. In
> > > > addition to the header, there are also some payloads, so "headlen <= offset"
> > > > fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
> > > >
> > >
> > > Thanks for the report.
> > >
> > > There is no way we can make things both fast for existing strategies
> > > used by _insert_your_driver
> > > and malicious usages of data that can sit for seconds/minutes in socket queues.
> > >
> > > I think that if you want to gain this 10% back, you have to change
> > > virtio_net to meet optimal behavior.
> > >
> > > Normal drivers make sure to not pull payload in skb->head, only headers.
> >
> > Hmm we do have hdr_len field, but seem to ignore it on RX.
> > Jason do you see any issues with using it for the head len?
> >
> 
> I was looking at this code (page_to_skb())  a few minutes ago ;)
> 
> pulling payload would make sense only if can pull of of it (to free the page)
> (This is what some drivers implement and call copybreak)

right.. I wonder whether it's preferable to always copy as much as fits,
or to use hdr_len if it's there?

> Even if we do not have an accurate knowledge of header sizes,
> it would be better to pull only the Ethernet header and let GRO do the
> rest during its dissection.

So IIUC what you are saying is we should do more or less
     if (hdr_len != 0)
                        copy hdr_len
?


> Once fixed, virtio_net will reduce by 2x number of frags per skb,
> compared to the situation before "net: avoid 32 x truesize
> under-estimation for tiny skbs"
> 
> 
> >
> > > Optimal GRO packets are when payload is in page fragments.
> > >
> > > (I am speaking not only for raw performance, but ability for systems
> > > to cope with network outages and sudden increase of memory usage in
> > > out of order queues)
> > >
> > > This has been quite clearly stated in my changelog.
> > >
> > > Thanks.
> > >
> > >
> > > > int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
> > > > {
> > > >         struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
> > > >         unsigned int offset = skb_gro_offset(skb);
> > > >         unsigned int headlen = skb_headlen(skb);
> > > >
> > > >     .......
> > > >
> > > >         if (headlen <= offset) {         // virtio-net will fail
> > > >         ........ // merge by frags
> > > >                 goto done;
> > > >         } else if (skb->head_frag) {     // skb->head_frag is fail when use kmalloc() for skb->head allocation
> > > >         ........ // merge by frags
> > > >                 goto done;
> > > >         }
> > > >
> > > > merge:
> > > >     ......
> > > >
> > > >         if (NAPI_GRO_CB(p)->last == p)
> > > >                 skb_shinfo(p)->frag_list = skb;
> > > >         else
> > > >                 NAPI_GRO_CB(p)->last->next = skb;
> > > >
> > > >     ......
> > > >         return 0;
> > > > }
> > > >
> > > >
> > > > test cmd:
> > > >  for i in $(seq 1 4)
> > > >  do
> > > >     redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h  <ip> -p 6379 2>&1 | grep 'per second'  &
> > > >  done
> > > >
> > > > Reported-by: su-lifan@linux.alibaba.com
> > > >
> > > > >
> > > > > Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
> > > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > > Cc: Alexander Duyck <alexanderduyck@fb.com>
> > > > > Cc: Paolo Abeni <pabeni@redhat.com>
> > > > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > > > Cc: Greg Thelen <gthelen@google.com>
> > > > > ---
> > > > >  net/core/skbuff.c | 9 +++++++--
> > > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
> > > > >  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > > >                                gfp_t gfp_mask)
> > > > >  {
> > > > > -     struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > > > > +     struct napi_alloc_cache *nc;
> > > > >       struct sk_buff *skb;
> > > > >       void *data;
> > > > >
> > > > >       len += NET_SKB_PAD + NET_IP_ALIGN;
> > > > >
> > > > > -     if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > > > > +     /* If requested length is either too small or too big,
> > > > > +      * we use kmalloc() for skb->head allocation.
> > > > > +      */
> > > > > +     if (len <= SKB_WITH_OVERHEAD(1024) ||
> > > > > +         len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> > > > >           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > > > >               skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> > > > >               if (!skb)
> > > > > @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> > > > >               goto skb_success;
> > > > >       }
> > > > >
> > > > > +     nc = this_cpu_ptr(&napi_alloc_cache);
> > > > >       len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > >       len = SKB_DATA_ALIGN(len);
> > > > >
> > > > > --
> > > > > 2.30.0.284.gd98b1dd5eaa7-goog
> > > > >
> >
Eric Dumazet April 1, 2021, 2:08 p.m. UTC | #19
On Thu, Apr 1, 2021 at 3:51 PM Michael S. Tsirkin <mst@redhat.com> wrote:

> So IIUC what you are saying is we should do more or less
>      if (hdr_len != 0)
>                         copy hdr_len
> ?

This part is not pulling bytes into skb->head, but into
skb_vnet_hdr(skb) (which is basically skb->cb)

I suggest the following patch

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bb4ea9dbc16bcb19c5969fc8247478aa66c63fce..f5a3cecd18eada32694714ecb85c205af7108aae
100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -409,9 +409,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
        offset += hdr_padded_len;
        p += hdr_padded_len;

-       copy = len;
-       if (copy > skb_tailroom(skb))
-               copy = skb_tailroom(skb);
+       /* Copy all frame if it fits skb->head,
+        * otherwise we let GRO pull headers as needed.
+        */
+       if (len <= skb_tailroom(skb))
+               copy = len;
+       else
+               copy =  ETH_HLEN + metasize;
        skb_put_data(skb, p, copy);

        if (metasize) {
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 6b5fcfa1e5553576b0e853ae31a2df655c04204b..2ee8f3ba76a548d54e0b21321a67da958c9984a0
100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -63,8 +63,12 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
        }

        if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-               u16 start = __virtio16_to_cpu(little_endian, hdr->csum_start);
-               u16 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
+               u32 start = __virtio16_to_cpu(little_endian, hdr->csum_start);
+               u32 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
+               u32 needed = start + max_t(u32, thlen, off + sizeof(__sum16));
+
+               if (pskb_may_pull(skb, needed))
+                       return -EINVAL;

                if (!skb_partial_csum_set(skb, start, off))
                        return -EINVAL;
@@ -100,14 +104,14 @@ static inline int virtio_net_hdr_to_skb(struct
sk_buff *skb,
                        }

                        p_off = keys.control.thoff + thlen;
-                       if (p_off > skb_headlen(skb) ||
+                       if (pskb_may_pull(skb, p_off) ||
                            keys.basic.ip_proto != ip_proto)
                                return -EINVAL;

                        skb_set_transport_header(skb, keys.control.thoff);
                } else if (gso_type) {
                        p_off = thlen;
-                       if (p_off > skb_headlen(skb))
+                       if (pskb_may_pull(skb, p_off))
                                return -EINVAL;
                }
        }
Jason Wang April 2, 2021, 2:52 a.m. UTC | #20
在 2021/4/1 下午5:58, Eric Dumazet 写道:
> On Thu, Apr 1, 2021 at 11:04 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>> On Thu, 1 Apr 2021 15:14:18 +0800, Jason Wang <jasowang@redhat.com> wrote:
>>> 在 2021/3/31 下午4:11, Michael S. Tsirkin 写道:
>>>> On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:
>>>>> On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>>>>>> On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>>>> From: Eric Dumazet <edumazet@google.com>
>>>>>>>
>>>>>>> Both virtio net and napi_get_frags() allocate skbs
>>>>>>> with a very small skb->head
>>>>>>>
>>>>>>> While using page fragments instead of a kmalloc backed skb->head might give
>>>>>>> a small performance improvement in some cases, there is a huge risk of
>>>>>>> under estimating memory usage.
>>>>>>>
>>>>>>> For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
>>>>>>> per page (order-3 page in x86), or even 64 on PowerPC
>>>>>>>
>>>>>>> We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
>>>>>>> but consuming far more memory for TCP buffers than instructed in tcp_mem[2]
>>>>>>>
>>>>>>> Even if we force napi_alloc_skb() to only use order-0 pages, the issue
>>>>>>> would still be there on arches with PAGE_SIZE >= 32768
>>>>>>>
>>>>>>> This patch makes sure that small skb head are kmalloc backed, so that
>>>>>>> other objects in the slab page can be reused instead of being held as long
>>>>>>> as skbs are sitting in socket queues.
>>>>>>>
>>>>>>> Note that we might in the future use the sk_buff napi cache,
>>>>>>> instead of going through a more expensive __alloc_skb()
>>>>>>>
>>>>>>> Another idea would be to use separate page sizes depending
>>>>>>> on the allocated length (to never have more than 4 frags per page)
>>>>>>>
>>>>>>> I would like to thank Greg Thelen for his precious help on this matter,
>>>>>>> analysing crash dumps is always a time consuming task.
>>>>>> This patch causes a performance degradation of about 10% in the scenario of
>>>>>> virtio-net + GRO.
>>>>>>
>>>>>> For GRO, there is no way to merge skbs based on frags with this patch, only
>>>>>> frag_list can be used to link skbs. The problem that this cause are that compared
>>>>>> to the GRO package merged into the frags way, the current skb needs to call
>>>>>> kfree_skb_list to release each skb, resulting in performance degradation.
>>>>>>
>>>>>> virtio-net will store some data onto the linear space after receiving it. In
>>>>>> addition to the header, there are also some payloads, so "headlen <= offset"
>>>>>> fails. And skb->head_frag is failing when use kmalloc() for skb->head allocation.
>>>>>>
>>>>> Thanks for the report.
>>>>>
>>>>> There is no way we can make things both fast for existing strategies
>>>>> used by _insert_your_driver
>>>>> and malicious usages of data that can sit for seconds/minutes in socket queues.
>>>>>
>>>>> I think that if you want to gain this 10% back, you have to change
>>>>> virtio_net to meet optimal behavior.
>>>>>
>>>>> Normal drivers make sure to not pull payload in skb->head, only headers.
>>>> Hmm we do have hdr_len field, but seem to ignore it on RX.
>>>> Jason do you see any issues with using it for the head len?
>>>
>>> This might work only if the device sets a correct hdr_len. I'm not sure
>>> all of the devices can do this properly. E.g for tap, we use
>>> skb_headlen() in virtio_net_hdr_from_skb() which depends highly on the
>>> behaviour of the underlayer layers (device driver or GRO). And we only
>>> set this hint for GSO packet but virtio-net may tries to do GRO for non
>>> GSO packets.
>>>
>>> Thanks
>> hi, Jason
>>
>> I personally prefer to use build_skb to create skb, so the problem here is
>> actually gone.
>>
>> The premise of this is that the buffer added by add_recvbuf_mergeable must
>> retain a skb_shared_info. Of course, then rx frags coalescing won't
>> work. But I consider that suppose the size of the mrg_avg_pkt_len 1500, so we
>> can still store 17 * 1500 = 24k packets in a skb. If the packet is really big,
>> the mrg_avg_pkt_len will also increase, and the buffer allocated later will
>> increase. When the mrg_avg_pkt_len is greater than PAGE_SIZE/2, rx frags
>> coalesce is no longer needed. Because we can't allocate two bufs with a value of
>> mrg_avg_pkt_len on the same page.
>>
> For the record I implemented build_skb() 10 years ago, so you can
> trust me when I
> am saying this will not help.
>
> Using build_skb() will waste additional skb_shared_info per MSS.
> That's an increase of 20% of memory, for nothing at all.


So I wonder something like the following like this help. We know the 
frag size, that means, if we know there's sufficient tailroom we can use 
build_skb() without reserving dedicated room for skb_shared_info.

Thanks


>
> Also there are cases when this won't be possible, say if you use an MTU of 4000
>
>
>
>
>
>> Thanks.
>>
>>>
>>>>
>>>>> Optimal GRO packets are when payload is in page fragments.
>>>>>
>>>>> (I am speaking not only for raw performance, but ability for systems
>>>>> to cope with network outages and sudden increase of memory usage in
>>>>> out of order queues)
>>>>>
>>>>> This has been quite clearly stated in my changelog.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>> int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
>>>>>> {
>>>>>>           struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
>>>>>>           unsigned int offset = skb_gro_offset(skb);
>>>>>>           unsigned int headlen = skb_headlen(skb);
>>>>>>
>>>>>>       .......
>>>>>>
>>>>>>           if (headlen <= offset) {         // virtio-net will fail
>>>>>>           ........ // merge by frags
>>>>>>                   goto done;
>>>>>>           } else if (skb->head_frag) {     // skb->head_frag is fail when use kmalloc() for skb->head allocation
>>>>>>           ........ // merge by frags
>>>>>>                   goto done;
>>>>>>           }
>>>>>>
>>>>>> merge:
>>>>>>       ......
>>>>>>
>>>>>>           if (NAPI_GRO_CB(p)->last == p)
>>>>>>                   skb_shinfo(p)->frag_list = skb;
>>>>>>           else
>>>>>>                   NAPI_GRO_CB(p)->last->next = skb;
>>>>>>
>>>>>>       ......
>>>>>>           return 0;
>>>>>> }
>>>>>>
>>>>>>
>>>>>> test cmd:
>>>>>>    for i in $(seq 1 4)
>>>>>>    do
>>>>>>       redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h  <ip> -p 6379 2>&1 | grep 'per second'  &
>>>>>>    done
>>>>>>
>>>>>> Reported-by: su-lifan@linux.alibaba.com
>>>>>>
>>>>>>> Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
>>>>>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>>>>>> Cc: Alexander Duyck <alexanderduyck@fb.com>
>>>>>>> Cc: Paolo Abeni <pabeni@redhat.com>
>>>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Cc: Greg Thelen <gthelen@google.com>
>>>>>>> ---
>>>>>>>    net/core/skbuff.c | 9 +++++++--
>>>>>>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>>>> index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
>>>>>>> --- a/net/core/skbuff.c
>>>>>>> +++ b/net/core/skbuff.c
>>>>>>> @@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
>>>>>>>    struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>>>>>>>                                  gfp_t gfp_mask)
>>>>>>>    {
>>>>>>> -     struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
>>>>>>> +     struct napi_alloc_cache *nc;
>>>>>>>         struct sk_buff *skb;
>>>>>>>         void *data;
>>>>>>>
>>>>>>>         len += NET_SKB_PAD + NET_IP_ALIGN;
>>>>>>>
>>>>>>> -     if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
>>>>>>> +     /* If requested length is either too small or too big,
>>>>>>> +      * we use kmalloc() for skb->head allocation.
>>>>>>> +      */
>>>>>>> +     if (len <= SKB_WITH_OVERHEAD(1024) ||
>>>>>>> +         len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
>>>>>>>             (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>>>>>>>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>>>>>>>                 if (!skb)
>>>>>>> @@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>>>>>>>                 goto skb_success;
>>>>>>>         }
>>>>>>>
>>>>>>> +     nc = this_cpu_ptr(&napi_alloc_cache);
>>>>>>>         len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>>>>>         len = SKB_DATA_ALIGN(len);
>>>>>>>
>>>>>>> --
>>>>>>> 2.30.0.284.gd98b1dd5eaa7-goog
>>>>>>>
Eric Dumazet April 2, 2021, 12:52 p.m. UTC | #21
On Fri, Apr 2, 2021 at 12:35 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
>
> Great!!
>
> test cmd:
>         sockperf tp --tcp --ip <ip> -m 8192 -t 300
>
>                 server softirq cpu | server sockperf cpu | rxpck/s | rxkB/s
> page_frag_alloc 100%.                86.1%                 2211785   3270145
> kmalloc         100%.                100%                  2226950   3292571
> copy ETH        100%                 69.3%                 2140552   3163408
>
> Compared with page_frag_alloc, the performance has also increased by 19.4%. The
> reason for such a big improvement I think is that although merge is turned on in
> my test environment, there is only one buffer for each packet received.
>
> So in the page_frag_alloc scenario, when performing gro merge, one skb will
> occupy two frags, one is used to store the data in the linear space, and the
> other is a frag in the skb frags. Now, the linear space only has the header, so
> one skb only occupies one frag space.
>
> In the case of using page_frag_alloc, an skb can only contain the original 8
> packets. Now a skb merges the original 17 packets

Exactly. Never put payload in skb->head, unless you are sure that :
1) All payload can be put there. (or it is filling a full page,
followed by remaining payload in further frags for big MTU)
2) skb->head is backed by a page fraf as well (skb->head_frag == 1)

>
> It's really exciting.
>
> I fixed the retval check of pskb_may_pull and added a pskb_may_pull call.
> Here is the patch I used for testing.
>
> Thanks very much.

Excellent, I am going to submit this patch officially, since it is not
risky for stable kernels.
I will give you credits as Reporter, and co-developer.

Thanks for your help !
Eric Dumazet April 2, 2021, 12:53 p.m. UTC | #22
On Fri, Apr 2, 2021 at 1:08 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
>
> On Fri, 2 Apr 2021 10:52:12 +0800, Jason Wang <jasowang@redhat.com> wrote:
> >
> > So I wonder something like the following like this help. We know the
> > frag size, that means, if we know there's sufficient tailroom we can use
> > build_skb() without reserving dedicated room for skb_shared_info.
> >
> > Thanks
> >
> >
>
> Do you mean so?
>
> I have also considered this scenario, although build_skb is not always used, but
> it is also very good for the right situation.
>
> Thanks.
>

I prefer doing such experiments for net-next kernels.

Lets keep a simple patch for stable kernels.
Paolo Abeni Sept. 7, 2022, 8:19 p.m. UTC | #23
Hello,

reviving an old thread...
On Wed, 2021-01-13 at 08:18 -0800, Eric Dumazet wrote:
> While using page fragments instead of a kmalloc backed skb->head might give
> a small performance improvement in some cases, there is a huge risk of
> under estimating memory usage.

[...]

> Note that we might in the future use the sk_buff napi cache,
> instead of going through a more expensive __alloc_skb()
> 
> Another idea would be to use separate page sizes depending
> on the allocated length (to never have more than 4 frags per page)

I'm investigating a couple of performance regressions pointing to this
change and I'd like to have a try to the 2nd suggestion above. 

If I read correctly, it means:
- extend the page_frag_cache alloc API to allow forcing max order==0
- add a 2nd page_frag_cache into napi_alloc_cache (say page_order0 or
page_small)
- in __napi_alloc_skb(), when len <= SKB_WITH_OVERHEAD(1024), use the
page_small cache with order 0 allocation.
(all the above constrained to host with 4K pages)

I'm not quite sure about the "never have more than 4 frags per page"
part.

What outlined above will allow for 10 min size frags in page_order0, as
(SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
not sure that anything will allocate such small frags.
With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page. 

The maximum truesize underestimation in both cases will be lower than
what we can get with the current code in the worst case (almost 32x
AFAICS). 

Is the above schema safe enough or should the requested size
artificially inflatted to fit at most 4 allocations per page_order0?
Am I miss something else? Apart from omitting a good deal of testing in
the above list ;) 

Thanks!

Paolo
Eric Dumazet Sept. 7, 2022, 8:40 p.m. UTC | #24
On 9/7/22 13:19, Paolo Abeni wrote:
> Hello,
>
> reviving an old thread...
> On Wed, 2021-01-13 at 08:18 -0800, Eric Dumazet wrote:
>> While using page fragments instead of a kmalloc backed skb->head might give
>> a small performance improvement in some cases, there is a huge risk of
>> under estimating memory usage.
> [...]
>
>> Note that we might in the future use the sk_buff napi cache,
>> instead of going through a more expensive __alloc_skb()
>>
>> Another idea would be to use separate page sizes depending
>> on the allocated length (to never have more than 4 frags per page)
> I'm investigating a couple of performance regressions pointing to this
> change and I'd like to have a try to the 2nd suggestion above.
>
> If I read correctly, it means:
> - extend the page_frag_cache alloc API to allow forcing max order==0
> - add a 2nd page_frag_cache into napi_alloc_cache (say page_order0 or
> page_small)
> - in __napi_alloc_skb(), when len <= SKB_WITH_OVERHEAD(1024), use the
> page_small cache with order 0 allocation.
> (all the above constrained to host with 4K pages)
>
> I'm not quite sure about the "never have more than 4 frags per page"
> part.
>
> What outlined above will allow for 10 min size frags in page_order0, as
> (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> not sure that anything will allocate such small frags.
> With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page.


Well, some arches have PAGE_SIZE=65536 :/


>
> The maximum truesize underestimation in both cases will be lower than
> what we can get with the current code in the worst case (almost 32x
> AFAICS).
>
> Is the above schema safe enough or should the requested size
> artificially inflatted to fit at most 4 allocations per page_order0?
> Am I miss something else? Apart from omitting a good deal of testing in
> the above list ;)
>
> Thanks!
>
> Paolo
>
Alexander Duyck Sept. 7, 2022, 9:36 p.m. UTC | #25
On Wed, 2022-09-07 at 22:19 +0200, Paolo Abeni wrote:
> Hello,
> 
> reviving an old thread...
> On Wed, 2021-01-13 at 08:18 -0800, Eric Dumazet wrote:
> > While using page fragments instead of a kmalloc backed skb->head might give
> > a small performance improvement in some cases, there is a huge risk of
> > under estimating memory usage.
> 
> [...]
> 
> > Note that we might in the future use the sk_buff napi cache,
> > instead of going through a more expensive __alloc_skb()
> > 
> > Another idea would be to use separate page sizes depending
> > on the allocated length (to never have more than 4 frags per page)
> 
> I'm investigating a couple of performance regressions pointing to this
> change and I'd like to have a try to the 2nd suggestion above. 
> 
> If I read correctly, it means:
> - extend the page_frag_cache alloc API to allow forcing max order==0
> - add a 2nd page_frag_cache into napi_alloc_cache (say page_order0 or
> page_small)
> - in __napi_alloc_skb(), when len <= SKB_WITH_OVERHEAD(1024), use the
> page_small cache with order 0 allocation.
> (all the above constrained to host with 4K pages)
> 
> I'm not quite sure about the "never have more than 4 frags per page"
> part.
> 
> What outlined above will allow for 10 min size frags in page_order0, as
> (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> not sure that anything will allocate such small frags.
> With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page. 

That doesn't account for any headroom though. Most of the time you have
to reserve some space for headroom so that if this buffer ends up
getting routed off somewhere to be tunneled there is room for adding to
the header. I think the default ends up being NET_SKB_PAD, though many
NICs use larger values. So adding any data onto that will push you up
to a minimum of 512 per skb for the first 64B for header data.

With that said it would probably put you in the range of 8 or fewer
skbs per page assuming at least 1 byte for data:
  512 =	SKB_DATA_ALIGN(NET_SKB_PAD + 1) +
	SKB_DATA_ALIGN(struct skb_shared_info)

> The maximum truesize underestimation in both cases will be lower than
> what we can get with the current code in the worst case (almost 32x
> AFAICS). 
> 
> Is the above schema safe enough or should the requested size
> artificially inflatted to fit at most 4 allocations per page_order0?
> Am I miss something else? Apart from omitting a good deal of testing in
> the above list ;) 

If we are working with an order 0 page we may just want to split it up
into a fixed 1K fragments and not bother with a variable pagecnt bias.
Doing that we would likely simplify this quite a bit and avoid having
to do as much page count manipulation which could get expensive if we
are not getting many uses out of the page. An added advantage is that
we can get rid of the pagecnt_bias and just work based on the page
offset.

As such I am not sure the page frag cache would really be that good of
a fit since we have quite a bit of overhead in terms of maintaining the
pagecnt_bias which assumes the page is a bit longer lived so the ratio
of refcnt updates vs pagecnt_bias updates is better.
Paolo Abeni Sept. 8, 2022, 10:48 a.m. UTC | #26
On Wed, 2022-09-07 at 13:40 -0700, Eric Dumazet wrote:
> On 9/7/22 13:19, Paolo Abeni wrote:
> > Hello,
> > 
> > reviving an old thread...
> > On Wed, 2021-01-13 at 08:18 -0800, Eric Dumazet wrote:
> > > While using page fragments instead of a kmalloc backed skb->head might give
> > > a small performance improvement in some cases, there is a huge risk of
> > > under estimating memory usage.
> > [...]
> > 
> > > Note that we might in the future use the sk_buff napi cache,
> > > instead of going through a more expensive __alloc_skb()
> > > 
> > > Another idea would be to use separate page sizes depending
> > > on the allocated length (to never have more than 4 frags per page)
> > I'm investigating a couple of performance regressions pointing to this
> > change and I'd like to have a try to the 2nd suggestion above.
> > 
> > If I read correctly, it means:
> > - extend the page_frag_cache alloc API to allow forcing max order==0
> > - add a 2nd page_frag_cache into napi_alloc_cache (say page_order0 or
> > page_small)
> > - in __napi_alloc_skb(), when len <= SKB_WITH_OVERHEAD(1024), use the
> > page_small cache with order 0 allocation.
> > (all the above constrained to host with 4K pages)
> > 
> > I'm not quite sure about the "never have more than 4 frags per page"
> > part.
> > 
> > What outlined above will allow for 10 min size frags in page_order0, as
> > (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> > not sure that anything will allocate such small frags.
> > With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page.
> 
> Well, some arches have PAGE_SIZE=65536 :/

Yes, the idea is to implement all the above only for arches with
PAGE_SIZE==4K. Would that be reasonable? 

Thanks!

Paolo
Paolo Abeni Sept. 8, 2022, 11 a.m. UTC | #27
On Wed, 2022-09-07 at 14:36 -0700, Alexander H Duyck wrote:
> On Wed, 2022-09-07 at 22:19 +0200, Paolo Abeni wrote:
> > What outlined above will allow for 10 min size frags in page_order0, as
> > (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> > not sure that anything will allocate such small frags.
> > With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page. 
> 
> That doesn't account for any headroom though. 

Yes, the 0-size data packet was just a theoretical example to make the
really worst case scenario.

> Most of the time you have
> to reserve some space for headroom so that if this buffer ends up
> getting routed off somewhere to be tunneled there is room for adding to
> the header. I think the default ends up being NET_SKB_PAD, though many
> NICs use larger values. So adding any data onto that will push you up
> to a minimum of 512 per skb for the first 64B for header data.
> 
> With that said it would probably put you in the range of 8 or fewer
> skbs per page assuming at least 1 byte for data:
>   512 =	SKB_DATA_ALIGN(NET_SKB_PAD + 1) +
> 	SKB_DATA_ALIGN(struct skb_shared_info)

In most build GRO_MAX_HEAD packets are even larger (should be 640)

> > The maximum truesize underestimation in both cases will be lower than
> > what we can get with the current code in the worst case (almost 32x
> > AFAICS). 
> > 
> > Is the above schema safe enough or should the requested size
> > artificially inflatted to fit at most 4 allocations per page_order0?
> > Am I miss something else? Apart from omitting a good deal of testing in
> > the above list ;) 
> 
> If we are working with an order 0 page we may just want to split it up
> into a fixed 1K fragments and not bother with a variable pagecnt bias.
> Doing that we would likely simplify this quite a bit and avoid having
> to do as much page count manipulation which could get expensive if we
> are not getting many uses out of the page. An added advantage is that
> we can get rid of the pagecnt_bias and just work based on the page
> offset.
> 
> As such I am not sure the page frag cache would really be that good of
> a fit since we have quite a bit of overhead in terms of maintaining the
> pagecnt_bias which assumes the page is a bit longer lived so the ratio
> of refcnt updates vs pagecnt_bias updates is better.

I see. With the above schema there will be 4-6 frags per packet. I'm
wild guessing that the pagecnt_bias optimization still give some gain
in that case, but I really shold collect some data points.

If the pagecnt optimization should be dropped, it would be probably
more straight-forward to use/adapt 'page_frag' for the page_order0
allocator.

BTW it's quite strange/confusing having to very similar APIs (page_frag
and page_frag_cache) with very similar names and no references between
them.

Thanks!

Paolo
Eric Dumazet Sept. 8, 2022, 12:20 p.m. UTC | #28
On Thu, Sep 8, 2022 at 3:48 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Wed, 2022-09-07 at 13:40 -0700, Eric Dumazet wrote:
> > On 9/7/22 13:19, Paolo Abeni wrote:
> > > Hello,
> > >
> > > reviving an old thread...
> > > On Wed, 2021-01-13 at 08:18 -0800, Eric Dumazet wrote:
> > > > While using page fragments instead of a kmalloc backed skb->head might give
> > > > a small performance improvement in some cases, there is a huge risk of
> > > > under estimating memory usage.
> > > [...]
> > >
> > > > Note that we might in the future use the sk_buff napi cache,
> > > > instead of going through a more expensive __alloc_skb()
> > > >
> > > > Another idea would be to use separate page sizes depending
> > > > on the allocated length (to never have more than 4 frags per page)
> > > I'm investigating a couple of performance regressions pointing to this
> > > change and I'd like to have a try to the 2nd suggestion above.
> > >
> > > If I read correctly, it means:
> > > - extend the page_frag_cache alloc API to allow forcing max order==0
> > > - add a 2nd page_frag_cache into napi_alloc_cache (say page_order0 or
> > > page_small)
> > > - in __napi_alloc_skb(), when len <= SKB_WITH_OVERHEAD(1024), use the
> > > page_small cache with order 0 allocation.
> > > (all the above constrained to host with 4K pages)
> > >
> > > I'm not quite sure about the "never have more than 4 frags per page"
> > > part.
> > >
> > > What outlined above will allow for 10 min size frags in page_order0, as
> > > (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> > > not sure that anything will allocate such small frags.
> > > With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page.
> >
> > Well, some arches have PAGE_SIZE=65536 :/
>
> Yes, the idea is to implement all the above only for arches with
> PAGE_SIZE==4K. Would that be reasonable?

Well, we also have changed MAX_SKB_FRAGS from 17 to 45 for BIG TCP.

And locally we have

#define GRO_MAX_HEAD 192

Reference:

commit fd9ea57f4e9514f9d0f0dec505eefd99a8faa148
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 09:04:38 2022 -0700

    net: add napi_get_frags_check() helper


>
> Thanks!
>
> Paolo
>
Paolo Abeni Sept. 8, 2022, 2:26 p.m. UTC | #29
On Thu, 2022-09-08 at 05:20 -0700, Eric Dumazet wrote:
> On Thu, Sep 8, 2022 at 3:48 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > On Wed, 2022-09-07 at 13:40 -0700, Eric Dumazet wrote:
> > > On 9/7/22 13:19, Paolo Abeni wrote:
> > > > reviving an old thread...
> > > > On Wed, 2021-01-13 at 08:18 -0800, Eric Dumazet wrote:
> > > > > While using page fragments instead of a kmalloc backed skb->head might give
> > > > > a small performance improvement in some cases, there is a huge risk of
> > > > > under estimating memory usage.
> > > > [...]
> > > > 
> > > > > Note that we might in the future use the sk_buff napi cache,
> > > > > instead of going through a more expensive __alloc_skb()
> > > > > 
> > > > > Another idea would be to use separate page sizes depending
> > > > > on the allocated length (to never have more than 4 frags per page)
> > > > I'm investigating a couple of performance regressions pointing to this
> > > > change and I'd like to have a try to the 2nd suggestion above.
> > > > 
> > > > If I read correctly, it means:
> > > > - extend the page_frag_cache alloc API to allow forcing max order==0
> > > > - add a 2nd page_frag_cache into napi_alloc_cache (say page_order0 or
> > > > page_small)
> > > > - in __napi_alloc_skb(), when len <= SKB_WITH_OVERHEAD(1024), use the
> > > > page_small cache with order 0 allocation.
> > > > (all the above constrained to host with 4K pages)
> > > > 
> > > > I'm not quite sure about the "never have more than 4 frags per page"
> > > > part.
> > > > 
> > > > What outlined above will allow for 10 min size frags in page_order0, as
> > > > (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> > > > not sure that anything will allocate such small frags.
> > > > With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page.
> > > 
> > > Well, some arches have PAGE_SIZE=65536 :/
> > 
> > Yes, the idea is to implement all the above only for arches with
> > PAGE_SIZE==4K. Would that be reasonable?
> 
> Well, we also have changed MAX_SKB_FRAGS from 17 to 45 for BIG TCP.
> 
> And locally we have
> 
> #define GRO_MAX_HEAD 192

default allocation size for napi_get_frags() is ~960b in google kernel,
right? It looks like it should fit the above quite nicely with 4 frags
per page?!?

Vanilla kernel may hit a larger number of fragments per page, even if
very likely not as high as the theoretical maximum mentioned in my
previous email (as noted by Alex). 

If in that case excessive truesize underestimation would still be
problematic (with a order0 4k page) __napi_alloc_skb() could be patched
to increase smaller sizes to some reasonable minimum. 

Likely there is some point in your reply I did not get. Luckily LPC is
coming :) 

> Reference:
> 
> commit fd9ea57f4e9514f9d0f0dec505eefd99a8faa148
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Wed Jun 8 09:04:38 2022 -0700
> 
>     net: add napi_get_frags_check() helper

I guess such check should be revisited with all the above. 

Thanks,

Paolo
Alexander Duyck Sept. 8, 2022, 2:53 p.m. UTC | #30
On Thu, 2022-09-08 at 13:00 +0200, Paolo Abeni wrote:
> On Wed, 2022-09-07 at 14:36 -0700, Alexander H Duyck wrote:
> > On Wed, 2022-09-07 at 22:19 +0200, Paolo Abeni wrote:
> > > What outlined above will allow for 10 min size frags in page_order0, as
> > > (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> > > not sure that anything will allocate such small frags.
> > > With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page. 
> > 
> > That doesn't account for any headroom though. 
> 
> Yes, the 0-size data packet was just a theoretical example to make the
> really worst case scenario.
> 
> > Most of the time you have
> > to reserve some space for headroom so that if this buffer ends up
> > getting routed off somewhere to be tunneled there is room for adding to
> > the header. I think the default ends up being NET_SKB_PAD, though many
> > NICs use larger values. So adding any data onto that will push you up
> > to a minimum of 512 per skb for the first 64B for header data.
> > 
> > With that said it would probably put you in the range of 8 or fewer
> > skbs per page assuming at least 1 byte for data:
> >   512 =	SKB_DATA_ALIGN(NET_SKB_PAD + 1) +
> > 	SKB_DATA_ALIGN(struct skb_shared_info)
> 
> In most build GRO_MAX_HEAD packets are even larger (should be 640)

Right, which is why I am thinking we may want to default to a 1K slice.

> > > The maximum truesize underestimation in both cases will be lower than
> > > what we can get with the current code in the worst case (almost 32x
> > > AFAICS). 
> > > 
> > > Is the above schema safe enough or should the requested size
> > > artificially inflatted to fit at most 4 allocations per page_order0?
> > > Am I miss something else? Apart from omitting a good deal of testing in
> > > the above list ;) 
> > 
> > If we are working with an order 0 page we may just want to split it up
> > into a fixed 1K fragments and not bother with a variable pagecnt bias.
> > Doing that we would likely simplify this quite a bit and avoid having
> > to do as much page count manipulation which could get expensive if we
> > are not getting many uses out of the page. An added advantage is that
> > we can get rid of the pagecnt_bias and just work based on the page
> > offset.
> > 
> > As such I am not sure the page frag cache would really be that good of
> > a fit since we have quite a bit of overhead in terms of maintaining the
> > pagecnt_bias which assumes the page is a bit longer lived so the ratio
> > of refcnt updates vs pagecnt_bias updates is better.
> 
> I see. With the above schema there will be 4-6 frags per packet. I'm
> wild guessing that the pagecnt_bias optimization still give some gain
> in that case, but I really shold collect some data points.

As I recall one of the big advantages of the 32k page was that we were
reducing the atomic ops by nearly half. Essentially we did a
page_ref_add at the start and a page_ref_sub_and_test when we were out
of space. Whereas a single 4K allocation would be 2 atomic ops per
allocation, we were only averaging 1.13 per 2k slice. With the Intel
NICs I was able to get even closer to 1 since I was able to do the 2k
flip/flop setup and could get up to 64K uses off of a single page.

Then again though I am not sure now much the atomic ops penalty will be
for your use case. Where it came into play is that MMIO writes to the
device will block atomic ops until they can be completed so in a device
driver atomic ops become very expensive and so we want to batch them as
much as possible.

> If the pagecnt optimization should be dropped, it would be probably
> more straight-forward to use/adapt 'page_frag' for the page_order0
> allocator.

That would make sense. Basically we could get rid of the pagecnt bias
and add the fixed number of slices to the count at allocation so we
would just need to track the offset to decide when we need to allocate
a new page. In addtion if we are flushing the page when it is depleted
we don't have to mess with the pfmemalloc logic.

> BTW it's quite strange/confusing having to very similar APIs (page_frag
> and page_frag_cache) with very similar names and no references between
> them.

I'm not sure what you are getting at here. There are plenty of
references between them, they just aren't direct. I viewed it as being
more like the get_free_pages() type logic where you end up allocating a
page but what you get is an address to the page fragment instead of the
page_frag struct.
Eric Dumazet Sept. 8, 2022, 4 p.m. UTC | #31
On Thu, Sep 8, 2022 at 7:26 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2022-09-08 at 05:20 -0700, Eric Dumazet wrote:
> > On Thu, Sep 8, 2022 at 3:48 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > > On Wed, 2022-09-07 at 13:40 -0700, Eric Dumazet wrote:
> > > > On 9/7/22 13:19, Paolo Abeni wrote:
> > > > > reviving an old thread...
> > > > > On Wed, 2021-01-13 at 08:18 -0800, Eric Dumazet wrote:
> > > > > > While using page fragments instead of a kmalloc backed skb->head might give
> > > > > > a small performance improvement in some cases, there is a huge risk of
> > > > > > under estimating memory usage.
> > > > > [...]
> > > > >
> > > > > > Note that we might in the future use the sk_buff napi cache,
> > > > > > instead of going through a more expensive __alloc_skb()
> > > > > >
> > > > > > Another idea would be to use separate page sizes depending
> > > > > > on the allocated length (to never have more than 4 frags per page)
> > > > > I'm investigating a couple of performance regressions pointing to this
> > > > > change and I'd like to have a try to the 2nd suggestion above.
> > > > >
> > > > > If I read correctly, it means:
> > > > > - extend the page_frag_cache alloc API to allow forcing max order==0
> > > > > - add a 2nd page_frag_cache into napi_alloc_cache (say page_order0 or
> > > > > page_small)
> > > > > - in __napi_alloc_skb(), when len <= SKB_WITH_OVERHEAD(1024), use the
> > > > > page_small cache with order 0 allocation.
> > > > > (all the above constrained to host with 4K pages)
> > > > >
> > > > > I'm not quite sure about the "never have more than 4 frags per page"
> > > > > part.
> > > > >
> > > > > What outlined above will allow for 10 min size frags in page_order0, as
> > > > > (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) == 384. I'm
> > > > > not sure that anything will allocate such small frags.
> > > > > With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page.
> > > >
> > > > Well, some arches have PAGE_SIZE=65536 :/
> > >
> > > Yes, the idea is to implement all the above only for arches with
> > > PAGE_SIZE==4K. Would that be reasonable?
> >
> > Well, we also have changed MAX_SKB_FRAGS from 17 to 45 for BIG TCP.
> >
> > And locally we have
> >
> > #define GRO_MAX_HEAD 192
>
> default allocation size for napi_get_frags() is ~960b in google kernel,
> right? It looks like it should fit the above quite nicely with 4 frags
> per page?!?
>

Yes, using order-0 pages on x86 would avoid problems.
But if this adds yet another tests in fast path, increasing icache
pressure, I am unsure.
So I will comment when I see actual code/implementation.

("Extending" page_frag_cache alloc API seems overkill to me. Just use
separate code maybe ?)

> Vanilla kernel may hit a larger number of fragments per page, even if
> very likely not as high as the theoretical maximum mentioned in my
> previous email (as noted by Alex).
>
> If in that case excessive truesize underestimation would still be
> problematic (with a order0 4k page) __napi_alloc_skb() could be patched
> to increase smaller sizes to some reasonable minimum.
>
> Likely there is some point in your reply I did not get. Luckily LPC is
> coming :)
>
> > Reference:
> >
> > commit fd9ea57f4e9514f9d0f0dec505eefd99a8faa148
> > Author: Eric Dumazet <edumazet@google.com>
> > Date:   Wed Jun 8 09:04:38 2022 -0700
> >
> >     net: add napi_get_frags_check() helper
>
> I guess such check should be revisited with all the above.
>
> Thanks,
>
> Paolo
>
Paolo Abeni Sept. 8, 2022, 6:01 p.m. UTC | #32
On Thu, 2022-09-08 at 07:53 -0700, Alexander H Duyck wrote:
> On Thu, 2022-09-08 at 13:00 +0200, Paolo Abeni wrote:
> > In most build GRO_MAX_HEAD packets are even larger (should be 640)
> 
> Right, which is why I am thinking we may want to default to a 1K slice.

Ok it looks like there is agreement to force a minimum frag size of 1K.
Side note: that should not cause a memory usage increase compared to
the slab allocator as kmalloc(640) should use the kmalloc-1k slab.

[...]

> > > 
> > If the pagecnt optimization should be dropped, it would be probably
> > more straight-forward to use/adapt 'page_frag' for the page_order0
> > allocator.
> 
> That would make sense. Basically we could get rid of the pagecnt bias
> and add the fixed number of slices to the count at allocation so we
> would just need to track the offset to decide when we need to allocate
> a new page. In addtion if we are flushing the page when it is depleted
> we don't have to mess with the pfmemalloc logic.

Uhmm... it looks like that the existing page_frag allocator does not
always flush the depleted page:

bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
{
        if (pfrag->page) {
                if (page_ref_count(pfrag->page) == 1) {
                        pfrag->offset = 0;
                        return true;
                }

so I'll try adding some separate/specialized code and see if the
overall complexity would be reasonable.

> > BTW it's quite strange/confusing having to very similar APIs (page_frag
> > and page_frag_cache) with very similar names and no references between
> > them.
> 
> I'm not sure what you are getting at here. There are plenty of
> references between them, they just aren't direct.

Looking/greping the tree I could not trivially understand when 'struct
page_frag' should be preferred over 'struct page_frag_cache' and/or
vice versa, I had to look at the respective implementation details.

Thanks,

Paolo
Alexander Duyck Sept. 8, 2022, 7:26 p.m. UTC | #33
On Thu, Sep 8, 2022 at 11:01 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Thu, 2022-09-08 at 07:53 -0700, Alexander H Duyck wrote:
> > On Thu, 2022-09-08 at 13:00 +0200, Paolo Abeni wrote:
> > > In most build GRO_MAX_HEAD packets are even larger (should be 640)
> >
> > Right, which is why I am thinking we may want to default to a 1K slice.
>
> Ok it looks like there is agreement to force a minimum frag size of 1K.
> Side note: that should not cause a memory usage increase compared to
> the slab allocator as kmalloc(640) should use the kmalloc-1k slab.
>
> [...]
>
> > > >
> > > If the pagecnt optimization should be dropped, it would be probably
> > > more straight-forward to use/adapt 'page_frag' for the page_order0
> > > allocator.
> >
> > That would make sense. Basically we could get rid of the pagecnt bias
> > and add the fixed number of slices to the count at allocation so we
> > would just need to track the offset to decide when we need to allocate
> > a new page. In addtion if we are flushing the page when it is depleted
> > we don't have to mess with the pfmemalloc logic.
>
> Uhmm... it looks like that the existing page_frag allocator does not
> always flush the depleted page:
>
> bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
> {
>         if (pfrag->page) {
>                 if (page_ref_count(pfrag->page) == 1) {
>                         pfrag->offset = 0;
>                         return true;
>                 }

Right, we have an option to reuse the page if the page count is 0.
However in the case of the 4K page with 1K slices scenario it means
you are having to bump back up the count on every 3 pages. So you
would be looking at 1.3 atomic accesses per frag. Just doing the bump
once at the start and using all 4 slices would give you 1.25 atomic
accesses per frag. That is why I assumed it would be better to just
let it go.

> so I'll try adding some separate/specialized code and see if the
> overall complexity would be reasonable.

The other thing to keep in mind is that once you start adding the
recycling you will have best case and worst case scenarios to
consider. The code above is for recycling frag in place it seems like,
or reallocating a new one in its place.

> > > BTW it's quite strange/confusing having to very similar APIs (page_frag
> > > and page_frag_cache) with very similar names and no references between
> > > them.
> >
> > I'm not sure what you are getting at here. There are plenty of
> > references between them, they just aren't direct.
>
> Looking/greping the tree I could not trivially understand when 'struct
> page_frag' should be preferred over 'struct page_frag_cache' and/or
> vice versa, I had to look at the respective implementation details.

The page_frag_cache is mostly there to store a higher order page to
slice up to generate page fragments that can be stored in the
page_frag struct. Honestly I am surprised we still have page_frag
floating around. I thought we replaced that with bio_vec some time
ago. At least that is the structure that skb_frag_t is typdef-ed as.
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -501,13 +501,17 @@  EXPORT_SYMBOL(__netdev_alloc_skb);
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 				 gfp_t gfp_mask)
 {
-	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct napi_alloc_cache *nc;
 	struct sk_buff *skb;
 	void *data;
 
 	len += NET_SKB_PAD + NET_IP_ALIGN;
 
-	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
+	/* If requested length is either too small or too big,
+	 * we use kmalloc() for skb->head allocation.
+	 */
+	if (len <= SKB_WITH_OVERHEAD(1024) ||
+	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
 	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
 		if (!skb)
@@ -515,6 +519,7 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 		goto skb_success;
 	}
 
+	nc = this_cpu_ptr(&napi_alloc_cache);
 	len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	len = SKB_DATA_ALIGN(len);