diff mbox series

[v2,net-next] net: generalize skb freeing deferral to per-cpu lists

Message ID 20220422201237.416238-1-eric.dumazet@gmail.com (mailing list archive)
State Accepted
Commit 68822bdf76f10c3dc80609d4e2cdc1e847429086
Delegated to: Netdev Maintainers
Headers show
Series [v2,net-next] net: generalize skb freeing deferral to per-cpu lists | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5837 this patch: 5837
netdev/cc_maintainers warning 8 maintainers not CCed: petrm@nvidia.com borisp@nvidia.com yoshfuji@linux-ipv6.org dsahern@kernel.org daniel@iogearbox.net imagedong@tencent.com keescook@chromium.org john.fastabend@gmail.com
netdev/build_clang success Errors and warnings before: 1151 this patch: 1151
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5976 this patch: 5976
netdev/checkpatch warning CHECK: Comparison to NULL could be written "skb"
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Eric Dumazet April 22, 2022, 8:12 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |  5 ++++
 include/linux/skbuff.h    |  3 +++
 include/net/sock.h        |  2 --
 include/net/tcp.h         | 12 ---------
 net/core/dev.c            | 31 ++++++++++++++++++++++++
 net/core/skbuff.c         | 51 ++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c           |  3 ---
 net/ipv4/tcp.c            | 25 +------------------
 net/ipv4/tcp_ipv4.c       |  1 -
 net/ipv6/tcp_ipv6.c       |  1 -
 net/tls/tls_sw.c          |  2 --
 11 files changed, 90 insertions(+), 46 deletions(-)

Comments

Paolo Abeni April 26, 2022, 7:38 a.m. UTC | #1
Hello,

I'm sorry for the late feedback. I have only a possibly relevant point
below.

On Fri, 2022-04-22 at 13:12 -0700, Eric Dumazet wrote:
[...]
> @@ -6571,6 +6577,28 @@ static int napi_threaded_poll(void *data)
>  	return 0;
>  }
>  
> +static void skb_defer_free_flush(struct softnet_data *sd)
> +{
> +	struct sk_buff *skb, *next;
> +	unsigned long flags;
> +
> +	/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
> +	if (!READ_ONCE(sd->defer_list))
> +		return;
> +
> +	spin_lock_irqsave(&sd->defer_lock, flags);
> +	skb = sd->defer_list;

I *think* that this read can possibly be fused with the previous one,
and another READ_ONCE() should avoid that.

BTW it looks like this version gives slightly better results than the
previous one, perhpas due to the single-liked list usage?

Thanks!

Paolo
Eric Dumazet April 26, 2022, 1:11 p.m. UTC | #2
On Tue, Apr 26, 2022 at 12:38 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
>
> Hello,
>
> I'm sorry for the late feedback. I have only a possibly relevant point
> below.
>
> On Fri, 2022-04-22 at 13:12 -0700, Eric Dumazet wrote:
> [...]
> > @@ -6571,6 +6577,28 @@ static int napi_threaded_poll(void *data)
> >       return 0;
> >  }
> >
> > +static void skb_defer_free_flush(struct softnet_data *sd)
> > +{
> > +     struct sk_buff *skb, *next;
> > +     unsigned long flags;
> > +
> > +     /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
> > +     if (!READ_ONCE(sd->defer_list))
> > +             return;
> > +
> > +     spin_lock_irqsave(&sd->defer_lock, flags);
> > +     skb = sd->defer_list;
>
> I *think* that this read can possibly be fused with the previous one,
> and another READ_ONCE() should avoid that.

Only the lockless read needs READ_ONCE()

For the one after spin_lock_irqsave(&sd->defer_lock, flags),
there is no need for any additional barrier.

If the compiler really wants to use multiple one-byte-at-a-time loads,
we are not going to fight, there might be good reasons for that.

(We do not want to spread READ_ONCE / WRITE_ONCE for all
loads/stores, as this has performance implications)

>
> BTW it looks like this version gives slightly better results than the
> previous one, perhpas due to the single-liked list usage?

Yes, this could be the case, or maybe it is because 10 runs are not enough
in a host with 32 RX queues, with a 50/50 split between the two NUMA nodes.

When reaching high throughput, every detail matters, like background usage on
the network, from monitoring and machine health daemons.

Thanks.
Paolo Abeni April 26, 2022, 3:28 p.m. UTC | #3
On Tue, 2022-04-26 at 06:11 -0700, Eric Dumazet wrote:
> On Tue, Apr 26, 2022 at 12:38 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > On Fri, 2022-04-22 at 13:12 -0700, Eric Dumazet wrote:
> > [...]
> > > @@ -6571,6 +6577,28 @@ static int napi_threaded_poll(void *data)
> > >       return 0;
> > >  }
> > > 
> > > +static void skb_defer_free_flush(struct softnet_data *sd)
> > > +{
> > > +     struct sk_buff *skb, *next;
> > > +     unsigned long flags;
> > > +
> > > +     /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
> > > +     if (!READ_ONCE(sd->defer_list))
> > > +             return;
> > > +
> > > +     spin_lock_irqsave(&sd->defer_lock, flags);
> > > +     skb = sd->defer_list;
> > 
> > I *think* that this read can possibly be fused with the previous one,
> > and another READ_ONCE() should avoid that.
> 
> Only the lockless read needs READ_ONCE()
> 
> For the one after spin_lock_irqsave(&sd->defer_lock, flags),
> there is no need for any additional barrier.
> 
> If the compiler really wants to use multiple one-byte-at-a-time loads,
> we are not going to fight, there might be good reasons for that.

I'm unsure I explained my doubt in a clear way: what I fear is that the
compiler could emit a single read instruction, corresponding to the
READ_ONCE() outside the lock, so that the spin-locked section will
operate on "old" defer_list. 

If that happens we could end-up with 'defer_count' underestimating the
list lenght. It looks like that is tolerable, as we will still be
protected vs defer_list growing too much.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Eric Dumazet April 26, 2022, 4:13 p.m. UTC | #4
On Tue, Apr 26, 2022 at 8:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
>

> I'm unsure I explained my doubt in a clear way: what I fear is that the
> compiler could emit a single read instruction, corresponding to the
> READ_ONCE() outside the lock, so that the spin-locked section will
> operate on "old" defer_list.
>
> If that happens we could end-up with 'defer_count' underestimating the
> list lenght. It looks like that is tolerable, as we will still be
> protected vs defer_list growing too much.

defer_count is always read/written under the protection of the spinlock.
It must be very precise, unless I am mistaken.

>
> Acked-by: Paolo Abeni <pabeni@redhat.com>
>
>

Thanks !
patchwork-bot+netdevbpf@kernel.org April 27, 2022, 12:50 a.m. UTC | #5
Hello:

This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 22 Apr 2022 13:12:37 -0700 you wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
> lock is released") helped bulk TCP flows to move the cost of skbs
> frees outside of critical section where socket lock was held.
> 
> But for RPC traffic, or hosts with RFS enabled, the solution is far from
> being ideal.
> 
> [...]

Here is the summary with links:
  - [v2,net-next] net: generalize skb freeing deferral to per-cpu lists
    https://git.kernel.org/netdev/net-next/c/68822bdf76f1

You are awesome, thank you!
Ido Schimmel April 27, 2022, 3:34 p.m. UTC | #6
On Fri, Apr 22, 2022 at 01:12:37PM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
> lock is released") helped bulk TCP flows to move the cost of skbs
> frees outside of critical section where socket lock was held.
> 
> But for RPC traffic, or hosts with RFS enabled, the solution is far from
> being ideal.
> 
> For RPC traffic, recvmsg() has to return to user space right after
> skb payload has been consumed, meaning that BH handler has no chance
> to pick the skb before recvmsg() thread. This issue is more visible
> with BIG TCP, as more RPC fit one skb.
> 
> For RFS, even if BH handler picks the skbs, they are still picked
> from the cpu on which user thread is running.
> 
> Ideally, it is better to free the skbs (and associated page frags)
> on the cpu that originally allocated them.
> 
> This patch removes the per socket anchor (sk->defer_list) and
> instead uses a per-cpu list, which will hold more skbs per round.
> 
> This new per-cpu list is drained at the end of net_action_rx(),
> after incoming packets have been processed, to lower latencies.
> 
> In normal conditions, skbs are added to the per-cpu list with
> no further action. In the (unlikely) cases where the cpu does not
> run net_action_rx() handler fast enough, we use an IPI to raise
> NET_RX_SOFTIRQ on the remote cpu.
> 
> Also, we do not bother draining the per-cpu list from dev_cpu_dead()
> This is because skbs in this list have no requirement on how fast
> they should be freed.
> 
> Note that we can add in the future a small per-cpu cache
> if we see any contention on sd->defer_lock.
> 
> Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
> and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
> page recycling strategy used by NIC driver (its page pool capacity
> being too small compared to number of skbs/pages held in sockets
> receive queues)
> 
> Note that this tuning was only done to demonstrate worse
> conditions for skb freeing for this particular test.
> These conditions can happen in more general production workload.

[...]

> Signed-off-by: Eric Dumazet <edumazet@google.com>

Eric, with this patch I'm seeing memory leaks such as these [1][2] after
boot. The system is using the igb driver for its management interface
[3]. The leaks disappear after reverting the patch.

Any ideas?

Let me know if you need more info. I can easily test a patch.

Thanks

[1]
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff888170143740 (size 216):
  comm "softirq", pid 0, jiffies 4294825261 (age 95.244s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 17 0f 81 88 ff ff 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff82571fc0>] napi_skb_cache_get+0xf0/0x180
    [<ffffffff8257206a>] __napi_build_skb+0x1a/0x60
    [<ffffffff8257b0f3>] napi_build_skb+0x23/0x350
    [<ffffffffa0469592>] igb_poll+0x2b72/0x5880 [igb]
    [<ffffffff825f9584>] __napi_poll.constprop.0+0xb4/0x480
    [<ffffffff825f9d5a>] net_rx_action+0x40a/0xc60
    [<ffffffff82e00295>] __do_softirq+0x295/0x9fe
    [<ffffffff81185bcc>] __irq_exit_rcu+0x11c/0x180
    [<ffffffff8118622a>] irq_exit_rcu+0xa/0x20
    [<ffffffff82bbed39>] common_interrupt+0xa9/0xc0
    [<ffffffff82c00b5e>] asm_common_interrupt+0x1e/0x40
    [<ffffffff824a186e>] cpuidle_enter_state+0x27e/0xcb0
    [<ffffffff824a236f>] cpuidle_enter+0x4f/0xa0
    [<ffffffff8126a290>] do_idle+0x3b0/0x4b0
    [<ffffffff8126a869>] cpu_startup_entry+0x19/0x20
    [<ffffffff810f4725>] start_secondary+0x265/0x340

[2]
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff88810ce3aac0 (size 216):
  comm "softirq", pid 0, jiffies 4294861408 (age 64.607s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00  ..{.............
  backtrace:
    [<ffffffff82575539>] __alloc_skb+0x229/0x360
    [<ffffffff8290bd3c>] __tcp_send_ack.part.0+0x6c/0x760
    [<ffffffff8291a062>] tcp_send_ack+0x82/0xa0
    [<ffffffff828cb6db>] __tcp_ack_snd_check+0x15b/0xa00
    [<ffffffff828f17fe>] tcp_rcv_established+0x198e/0x2120
    [<ffffffff829363b5>] tcp_v4_do_rcv+0x665/0x9a0
    [<ffffffff8293d8ae>] tcp_v4_rcv+0x2c1e/0x32f0
    [<ffffffff828610b3>] ip_protocol_deliver_rcu+0x53/0x2c0
    [<ffffffff828616eb>] ip_local_deliver+0x3cb/0x620
    [<ffffffff8285e66f>] ip_sublist_rcv_finish+0x9f/0x2c0
    [<ffffffff82860895>] ip_list_rcv_finish.constprop.0+0x525/0x6f0
    [<ffffffff82861f88>] ip_list_rcv+0x318/0x460
    [<ffffffff825f5e61>] __netif_receive_skb_list_core+0x541/0x8f0
    [<ffffffff825f8043>] netif_receive_skb_list_internal+0x763/0xdc0
    [<ffffffff826c3025>] napi_gro_complete.constprop.0+0x5a5/0x700
    [<ffffffff826c44ed>] dev_gro_receive+0xf2d/0x23f0
unreferenced object 0xffff888175e1afc0 (size 216):
  comm "sshd", pid 1024, jiffies 4294861424 (age 64.591s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00  ..{.............
  backtrace:
    [<ffffffff82575539>] __alloc_skb+0x229/0x360
    [<ffffffff8258201c>] alloc_skb_with_frags+0x9c/0x720
    [<ffffffff8255f333>] sock_alloc_send_pskb+0x7b3/0x940
    [<ffffffff82876af4>] __ip_append_data+0x1874/0x36d0
    [<ffffffff8287f283>] ip_make_skb+0x263/0x2e0
    [<ffffffff82978afa>] udp_sendmsg+0x1c8a/0x29d0
    [<ffffffff829af94e>] inet_sendmsg+0x9e/0xe0
    [<ffffffff8255082d>] __sys_sendto+0x23d/0x360
    [<ffffffff82550a31>] __x64_sys_sendto+0xe1/0x1b0
    [<ffffffff82bbde05>] do_syscall_64+0x35/0x80
    [<ffffffff82c00068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

[3]
# ethtool -i enp8s0
driver: igb
version: 5.18.0-rc3-custom-91743-g481c1b
firmware-version: 3.25, 0x80000708, 1.1824.0
expansion-rom-version: 
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
Eric Dumazet April 27, 2022, 4:53 p.m. UTC | #7
On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote:
>

>
> Eric, with this patch I'm seeing memory leaks such as these [1][2] after
> boot. The system is using the igb driver for its management interface
> [3]. The leaks disappear after reverting the patch.
>
> Any ideas?
>

No idea, skbs allocated to send an ACK can not be stored in receive
queue, I guess this is a kmemleak false positive.

Stress your host for hours, and check if there are real kmemleaks, as
in memory being depleted.

> Let me know if you need more info. I can easily test a patch.
>
> Thanks
>
> [1]
> # cat /sys/kernel/debug/kmemleak
> unreferenced object 0xffff888170143740 (size 216):
>   comm "softirq", pid 0, jiffies 4294825261 (age 95.244s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 17 0f 81 88 ff ff 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<ffffffff82571fc0>] napi_skb_cache_get+0xf0/0x180
>     [<ffffffff8257206a>] __napi_build_skb+0x1a/0x60
>     [<ffffffff8257b0f3>] napi_build_skb+0x23/0x350
>     [<ffffffffa0469592>] igb_poll+0x2b72/0x5880 [igb]
>     [<ffffffff825f9584>] __napi_poll.constprop.0+0xb4/0x480
>     [<ffffffff825f9d5a>] net_rx_action+0x40a/0xc60
>     [<ffffffff82e00295>] __do_softirq+0x295/0x9fe
>     [<ffffffff81185bcc>] __irq_exit_rcu+0x11c/0x180
>     [<ffffffff8118622a>] irq_exit_rcu+0xa/0x20
>     [<ffffffff82bbed39>] common_interrupt+0xa9/0xc0
>     [<ffffffff82c00b5e>] asm_common_interrupt+0x1e/0x40
>     [<ffffffff824a186e>] cpuidle_enter_state+0x27e/0xcb0
>     [<ffffffff824a236f>] cpuidle_enter+0x4f/0xa0
>     [<ffffffff8126a290>] do_idle+0x3b0/0x4b0
>     [<ffffffff8126a869>] cpu_startup_entry+0x19/0x20
>     [<ffffffff810f4725>] start_secondary+0x265/0x340
>
> [2]
> # cat /sys/kernel/debug/kmemleak
> unreferenced object 0xffff88810ce3aac0 (size 216):
>   comm "softirq", pid 0, jiffies 4294861408 (age 64.607s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00  ..{.............
>   backtrace:
>     [<ffffffff82575539>] __alloc_skb+0x229/0x360
>     [<ffffffff8290bd3c>] __tcp_send_ack.part.0+0x6c/0x760
>     [<ffffffff8291a062>] tcp_send_ack+0x82/0xa0
>     [<ffffffff828cb6db>] __tcp_ack_snd_check+0x15b/0xa00
>     [<ffffffff828f17fe>] tcp_rcv_established+0x198e/0x2120
>     [<ffffffff829363b5>] tcp_v4_do_rcv+0x665/0x9a0
>     [<ffffffff8293d8ae>] tcp_v4_rcv+0x2c1e/0x32f0
>     [<ffffffff828610b3>] ip_protocol_deliver_rcu+0x53/0x2c0
>     [<ffffffff828616eb>] ip_local_deliver+0x3cb/0x620
>     [<ffffffff8285e66f>] ip_sublist_rcv_finish+0x9f/0x2c0
>     [<ffffffff82860895>] ip_list_rcv_finish.constprop.0+0x525/0x6f0
>     [<ffffffff82861f88>] ip_list_rcv+0x318/0x460
>     [<ffffffff825f5e61>] __netif_receive_skb_list_core+0x541/0x8f0
>     [<ffffffff825f8043>] netif_receive_skb_list_internal+0x763/0xdc0
>     [<ffffffff826c3025>] napi_gro_complete.constprop.0+0x5a5/0x700
>     [<ffffffff826c44ed>] dev_gro_receive+0xf2d/0x23f0
> unreferenced object 0xffff888175e1afc0 (size 216):
>   comm "sshd", pid 1024, jiffies 4294861424 (age 64.591s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00  ..{.............
>   backtrace:
>     [<ffffffff82575539>] __alloc_skb+0x229/0x360
>     [<ffffffff8258201c>] alloc_skb_with_frags+0x9c/0x720
>     [<ffffffff8255f333>] sock_alloc_send_pskb+0x7b3/0x940
>     [<ffffffff82876af4>] __ip_append_data+0x1874/0x36d0
>     [<ffffffff8287f283>] ip_make_skb+0x263/0x2e0
>     [<ffffffff82978afa>] udp_sendmsg+0x1c8a/0x29d0
>     [<ffffffff829af94e>] inet_sendmsg+0x9e/0xe0
>     [<ffffffff8255082d>] __sys_sendto+0x23d/0x360
>     [<ffffffff82550a31>] __x64_sys_sendto+0xe1/0x1b0
>     [<ffffffff82bbde05>] do_syscall_64+0x35/0x80
>     [<ffffffff82c00068>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> [3]
> # ethtool -i enp8s0
> driver: igb
> version: 5.18.0-rc3-custom-91743-g481c1b
> firmware-version: 3.25, 0x80000708, 1.1824.0
> expansion-rom-version:
> bus-info: 0000:08:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
Eric Dumazet April 27, 2022, 5:11 p.m. UTC | #8
On Wed, Apr 27, 2022 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote:
> >
>
> >
> > Eric, with this patch I'm seeing memory leaks such as these [1][2] after
> > boot. The system is using the igb driver for its management interface
> > [3]. The leaks disappear after reverting the patch.
> >
> > Any ideas?
> >
>
> No idea, skbs allocated to send an ACK can not be stored in receive
> queue, I guess this is a kmemleak false positive.
>
> Stress your host for hours, and check if there are real kmemleaks, as
> in memory being depleted.

AT first when I saw your report I wondered if the following was needed,
but I do not think so. Nothing in __kfree_skb(skb) cares about skb->next.

But you might give it a try, maybe I am wrong.

diff --git a/net/core/dev.c b/net/core/dev.c
index 611bd719706412723561c27753150b27e1dc4e7a..9dc3443649b962586cc059899ac3d71e9c7a3559
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6594,6 +6594,7 @@ static void skb_defer_free_flush(struct softnet_data *sd)

        while (skb != NULL) {
                next = skb->next;
+               skb_mark_not_on_list(skb);
                __kfree_skb(skb);
                skb = next;
        }
Eric Dumazet April 27, 2022, 5:59 p.m. UTC | #9
On Wed, Apr 27, 2022 at 10:11 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 27, 2022 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote:
> > >
> >
> > >
> > > Eric, with this patch I'm seeing memory leaks such as these [1][2] after
> > > boot. The system is using the igb driver for its management interface
> > > [3]. The leaks disappear after reverting the patch.
> > >
> > > Any ideas?
> > >
> >
> > No idea, skbs allocated to send an ACK can not be stored in receive
> > queue, I guess this is a kmemleak false positive.
> >
> > Stress your host for hours, and check if there are real kmemleaks, as
> > in memory being depleted.
>
> AT first when I saw your report I wondered if the following was needed,
> but I do not think so. Nothing in __kfree_skb(skb) cares about skb->next.
>
> But you might give it a try, maybe I am wrong.
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 611bd719706412723561c27753150b27e1dc4e7a..9dc3443649b962586cc059899ac3d71e9c7a3559
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6594,6 +6594,7 @@ static void skb_defer_free_flush(struct softnet_data *sd)
>
>         while (skb != NULL) {
>                 next = skb->next;
> +               skb_mark_not_on_list(skb);
>                 __kfree_skb(skb);
>                 skb = next;
>         }

Oh well, there is definitely a leak, sorry for that.
Eric Dumazet April 27, 2022, 6:54 p.m. UTC | #10
On Wed, Apr 27, 2022 at 10:59 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Apr 27, 2022 at 10:11 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Apr 27, 2022 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote:
> > > >
> > >
> > > >
> > > > Eric, with this patch I'm seeing memory leaks such as these [1][2] after
> > > > boot. The system is using the igb driver for its management interface
> > > > [3]. The leaks disappear after reverting the patch.
> > > >
> > > > Any ideas?
> > > >
> > >
> > > No idea, skbs allocated to send an ACK can not be stored in receive
> > > queue, I guess this is a kmemleak false positive.
> > >
> > > Stress your host for hours, and check if there are real kmemleaks, as
> > > in memory being depleted.
> >
> > AT first when I saw your report I wondered if the following was needed,
> > but I do not think so. Nothing in __kfree_skb(skb) cares about skb->next.
> >
> > But you might give it a try, maybe I am wrong.
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 611bd719706412723561c27753150b27e1dc4e7a..9dc3443649b962586cc059899ac3d71e9c7a3559
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6594,6 +6594,7 @@ static void skb_defer_free_flush(struct softnet_data *sd)
> >
> >         while (skb != NULL) {
> >                 next = skb->next;
> > +               skb_mark_not_on_list(skb);
> >                 __kfree_skb(skb);
> >                 skb = next;
> >         }
>
> Oh well, there is definitely a leak, sorry for that.

Ido, can you test if the following patch solves your issue ?
It definitely looks needed.

Thanks !

diff --git a/net/core/dev.c b/net/core/dev.c
index 611bd719706412723561c27753150b27e1dc4e7a..e09cd202fc579dfe2313243e20def8044aafafa2
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6617,7 +6617,7 @@ static __latent_entropy void
net_rx_action(struct softirq_action *h)

                if (list_empty(&list)) {
                        if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
-                               return;
+                               goto end;
                        break;
                }

@@ -6644,6 +6644,7 @@ static __latent_entropy void
net_rx_action(struct softirq_action *h)
                __raise_softirq_irqoff(NET_RX_SOFTIRQ);

        net_rps_action_and_irq_enable(sd);
+end:
        skb_defer_free_flush(sd);
 }
Qian Cai April 29, 2022, 4:18 p.m. UTC | #11
On Fri, Apr 22, 2022 at 01:12:37PM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
> lock is released") helped bulk TCP flows to move the cost of skbs
> frees outside of critical section where socket lock was held.
> 
> But for RPC traffic, or hosts with RFS enabled, the solution is far from
> being ideal.
> 
> For RPC traffic, recvmsg() has to return to user space right after
> skb payload has been consumed, meaning that BH handler has no chance
> to pick the skb before recvmsg() thread. This issue is more visible
> with BIG TCP, as more RPC fit one skb.
> 
> For RFS, even if BH handler picks the skbs, they are still picked
> from the cpu on which user thread is running.
> 
> Ideally, it is better to free the skbs (and associated page frags)
> on the cpu that originally allocated them.
> 
> This patch removes the per socket anchor (sk->defer_list) and
> instead uses a per-cpu list, which will hold more skbs per round.
> 
> This new per-cpu list is drained at the end of net_action_rx(),
> after incoming packets have been processed, to lower latencies.
> 
> In normal conditions, skbs are added to the per-cpu list with
> no further action. In the (unlikely) cases where the cpu does not
> run net_action_rx() handler fast enough, we use an IPI to raise
> NET_RX_SOFTIRQ on the remote cpu.
> 
> Also, we do not bother draining the per-cpu list from dev_cpu_dead()
> This is because skbs in this list have no requirement on how fast
> they should be freed.
> 
> Note that we can add in the future a small per-cpu cache
> if we see any contention on sd->defer_lock.

Hmm, we started to see some memory leak reports from kmemleak that have
been around for hours without being freed since yesterday's linux-next
tree which included this commit. Any thoughts?

unreferenced object 0xffff400610f55cc0 (size 216):
  comm "git-remote-http", pid 781180, jiffies 4314091475 (age 4323.740s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00  ..~..?..........
  backtrace:
     kmem_cache_alloc_node
     __alloc_skb
     __tcp_send_ack.part.0
     tcp_send_ack
     tcp_cleanup_rbuf
     tcp_recvmsg_locked
     tcp_recvmsg
     inet_recvmsg
     __sys_recvfrom
     __arm64_sys_recvfrom
     invoke_syscall
     el0_svc_common.constprop.0
     do_el0_svc
     el0_svc
     el0t_64_sync_handler
     el0t_64_sync
unreferenced object 0xffff4001e58f0c40 (size 216):
  comm "git-remote-http", pid 781180, jiffies 4314091483 (age 4323.968s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00  ..~..?..........
  backtrace:
     kmem_cache_alloc_node
     __alloc_skb
     __tcp_send_ack.part.0
     tcp_send_ack
     tcp_cleanup_rbuf
     tcp_recvmsg_locked
     tcp_recvmsg
     inet_recvmsg
     __sys_recvfrom
     __arm64_sys_recvfrom
     invoke_syscall
     el0_svc_common.constprop.0
     do_el0_svc
     el0_svc
     el0t_64_sync_handler
     el0t_64_sync
Eric Dumazet April 29, 2022, 4:23 p.m. UTC | #12
On Fri, Apr 29, 2022 at 9:18 AM Qian Cai <quic_qiancai@quicinc.com> wrote:
>
> On Fri, Apr 22, 2022 at 01:12:37PM -0700, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
> > lock is released") helped bulk TCP flows to move the cost of skbs
> > frees outside of critical section where socket lock was held.
> >
> > But for RPC traffic, or hosts with RFS enabled, the solution is far from
> > being ideal.
> >
> > For RPC traffic, recvmsg() has to return to user space right after
> > skb payload has been consumed, meaning that BH handler has no chance
> > to pick the skb before recvmsg() thread. This issue is more visible
> > with BIG TCP, as more RPC fit one skb.
> >
> > For RFS, even if BH handler picks the skbs, they are still picked
> > from the cpu on which user thread is running.
> >
> > Ideally, it is better to free the skbs (and associated page frags)
> > on the cpu that originally allocated them.
> >
> > This patch removes the per socket anchor (sk->defer_list) and
> > instead uses a per-cpu list, which will hold more skbs per round.
> >
> > This new per-cpu list is drained at the end of net_action_rx(),
> > after incoming packets have been processed, to lower latencies.
> >
> > In normal conditions, skbs are added to the per-cpu list with
> > no further action. In the (unlikely) cases where the cpu does not
> > run net_action_rx() handler fast enough, we use an IPI to raise
> > NET_RX_SOFTIRQ on the remote cpu.
> >
> > Also, we do not bother draining the per-cpu list from dev_cpu_dead()
> > This is because skbs in this list have no requirement on how fast
> > they should be freed.
> >
> > Note that we can add in the future a small per-cpu cache
> > if we see any contention on sd->defer_lock.
>
> Hmm, we started to see some memory leak reports from kmemleak that have
> been around for hours without being freed since yesterday's linux-next
> tree which included this commit. Any thoughts?
>
> unreferenced object 0xffff400610f55cc0 (size 216):
>   comm "git-remote-http", pid 781180, jiffies 4314091475 (age 4323.740s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00  ..~..?..........
>   backtrace:
>      kmem_cache_alloc_node
>      __alloc_skb
>      __tcp_send_ack.part.0
>      tcp_send_ack
>      tcp_cleanup_rbuf
>      tcp_recvmsg_locked
>      tcp_recvmsg
>      inet_recvmsg
>      __sys_recvfrom
>      __arm64_sys_recvfrom
>      invoke_syscall
>      el0_svc_common.constprop.0
>      do_el0_svc
>      el0_svc
>      el0t_64_sync_handler
>      el0t_64_sync
> unreferenced object 0xffff4001e58f0c40 (size 216):
>   comm "git-remote-http", pid 781180, jiffies 4314091483 (age 4323.968s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00  ..~..?..........
>   backtrace:
>      kmem_cache_alloc_node
>      __alloc_skb
>      __tcp_send_ack.part.0
>      tcp_send_ack
>      tcp_cleanup_rbuf
>      tcp_recvmsg_locked
>      tcp_recvmsg
>      inet_recvmsg
>      __sys_recvfrom
>      __arm64_sys_recvfrom
>      invoke_syscall
>      el0_svc_common.constprop.0
>      do_el0_svc
>      el0_svc
>      el0t_64_sync_handler
>      el0t_64_sync

Which tree are you using ?

Ido said the leak has been fixed in
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=f3412b3879b4f7c4313b186b03940d4791345534
Qian Cai April 29, 2022, 8:44 p.m. UTC | #13
On Fri, Apr 29, 2022 at 09:23:13AM -0700, Eric Dumazet wrote:
> Which tree are you using ?

linux-next tree, tag next-20220428

> Ido said the leak has been fixed in
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=f3412b3879b4f7c4313b186b03940d4791345534

Cool, next-20220429 is running fine so far which included the patch.
kernel test robot May 6, 2022, 6:44 a.m. UTC | #14
Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 72fd55c0dbb1ad5ba283ca80abc9546702815a33 ("[PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists")
url: https://github.com/intel-lab-lkp/linux/commits/Eric-Dumazet/net-generalize-skb-freeing-deferral-to-per-cpu-lists/20220423-060710
base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git c78c5a660439d4d341a03b651541fda3ebe76160
patch link: https://lore.kernel.org/netdev/20220422201237.416238-1-eric.dumazet@gmail.com

in testcase: xfstests
version: xfstests-x86_64-46e1b83-1_20220414
with following parameters:

	disk: 4HDD
	fs: ext4
	fs2: smbv3
	test: generic-group-10
	ucode: 0xec

test-description: xfstests is a regression test suite for xfs and other files ystems.
test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git


on test machine: 8 threads Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz with 16G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <oliver.sang@intel.com>



[   80.428226][ T1836] Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.
[   80.739189][  T370] generic/207	 0s
[   80.739198][  T370]
[   80.785302][ T1696] run fstests generic/208 at 2022-05-06 02:42:26
[   81.143444][ T1836] Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.
[   89.609627][   T58] kworker/u16:5 invoked oom-killer: gfp_mask=0xcd0(GFP_KERNEL|__GFP_RECLAIMABLE), order=0, oom_score_adj=0
[   89.620805][   T58] CPU: 0 PID: 58 Comm: kworker/u16:5 Not tainted 5.18.0-rc3-00568-g72fd55c0dbb1 #1
[   89.629899][   T58] Hardware name: HP HP Z240 SFF Workstation/802E, BIOS N51 Ver. 01.63 10/05/2017
[   89.638822][   T58] Workqueue: writeback wb_workfn (flush-8:16)
[   89.644727][   T58] Call Trace:
[   89.647863][   T58]  <TASK>
[ 89.650649][ T58] dump_stack_lvl (kbuild/src/consumer/lib/dump_stack.c:107 (discriminator 1)) 
[ 89.654992][ T58] dump_header (kbuild/src/consumer/mm/oom_kill.c:73 kbuild/src/consumer/mm/oom_kill.c:461) 
[ 89.659250][ T58] oom_kill_process.cold (kbuild/src/consumer/mm/oom_kill.c:979) 
[ 89.664110][ T58] out_of_memory (kbuild/src/consumer/mm/oom_kill.c:1119 (discriminator 4)) 


To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        sudo bin/lkp install job.yaml           # job file is attached in this email
        bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
        sudo bin/lkp run generated-yaml-file

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.
Eric Dumazet May 6, 2022, 8:15 a.m. UTC | #15
On Thu, May 5, 2022 at 11:44 PM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-11):
>
> commit: 72fd55c0dbb1ad5ba283ca80abc9546702815a33 ("[PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists")
> url: https://github.com/intel-lab-lkp/linux/commits/Eric-Dumazet/net-generalize-skb-freeing-deferral-to-per-cpu-lists/20220423-060710
> base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git c78c5a660439d4d341a03b651541fda3ebe76160
> patch link: https://lore.kernel.org/netdev/20220422201237.416238-1-eric.dumazet@gmail.com
>

I think this commit had two follow up fixes.
Make sure to test the tree after the fixes are included, otherwise
this is adding unneeded noise.
Thank you.

commit f3412b3879b4f7c4313b186b03940d4791345534
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Apr 27 13:41:47 2022 -0700

    net: make sure net_rx_action() calls skb_defer_free_flush()

And:

commit 783d108dd71d97e4cac5fe8ce70ca43ed7dc7bb7
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Apr 29 18:15:23 2022 -0700

    tcp: drop skb dst in tcp_rcv_established()


> in testcase: xfstests
> version: xfstests-x86_64-46e1b83-1_20220414
> with following parameters:
>
>         disk: 4HDD
>         fs: ext4
>         fs2: smbv3
>         test: generic-group-10
>         ucode: 0xec
>
> test-description: xfstests is a regression test suite for xfs and other files ystems.
> test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
>
>
> on test machine: 8 threads Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz with 16G memory
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <oliver.sang@intel.com>
>
>
>
> [   80.428226][ T1836] Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.
> [   80.739189][  T370] generic/207       0s
> [   80.739198][  T370]
> [   80.785302][ T1696] run fstests generic/208 at 2022-05-06 02:42:26
> [   81.143444][ T1836] Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.
> [   89.609627][   T58] kworker/u16:5 invoked oom-killer: gfp_mask=0xcd0(GFP_KERNEL|__GFP_RECLAIMABLE), order=0, oom_score_adj=0
> [   89.620805][   T58] CPU: 0 PID: 58 Comm: kworker/u16:5 Not tainted 5.18.0-rc3-00568-g72fd55c0dbb1 #1
> [   89.629899][   T58] Hardware name: HP HP Z240 SFF Workstation/802E, BIOS N51 Ver. 01.63 10/05/2017
> [   89.638822][   T58] Workqueue: writeback wb_workfn (flush-8:16)
> [   89.644727][   T58] Call Trace:
> [   89.647863][   T58]  <TASK>
> [ 89.650649][ T58] dump_stack_lvl (kbuild/src/consumer/lib/dump_stack.c:107 (discriminator 1))
> [ 89.654992][ T58] dump_header (kbuild/src/consumer/mm/oom_kill.c:73 kbuild/src/consumer/mm/oom_kill.c:461)
> [ 89.659250][ T58] oom_kill_process.cold (kbuild/src/consumer/mm/oom_kill.c:979)
> [ 89.664110][ T58] out_of_memory (kbuild/src/consumer/mm/oom_kill.c:1119 (discriminator 4))
>
>
> To reproduce:
>
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         sudo bin/lkp install job.yaml           # job file is attached in this email
>         bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
>         sudo bin/lkp run generated-yaml-file
>
>         # if come across any failure that blocks the test,
>         # please remove ~/.lkp and /lkp dir to run from a clean state.
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://01.org/lkp
>
>
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7dccbfd1bf5635c27514c70b4a06d3e6f74395dd..ac8a5f71220a999aebabd73d8df2c8e2b1325ad4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3081,6 +3081,11 @@  struct softnet_data {
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 
+	/* Another possibly contended cache line */
+	spinlock_t		defer_lock ____cacheline_aligned_in_smp;
+	int			defer_count;
+	struct sk_buff		*defer_list;
+	call_single_data_t	defer_csd;
 };
 
 static inline void input_queue_head_incr(struct softnet_data *sd)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 84d78df60453955a8eaf05847f6e2145176a727a..5cbc184ca685d886306ccff70b82cd409082c229 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -888,6 +888,7 @@  typedef unsigned char *sk_buff_data_t;
  *		delivery_time at egress.
  *	@napi_id: id of the NAPI struct this skb came from
  *	@sender_cpu: (aka @napi_id) source CPU in XPS
+ *	@alloc_cpu: CPU which did the skb allocation.
  *	@secmark: security marking
  *	@mark: Generic packet mark
  *	@reserved_tailroom: (aka @mark) number of bytes of free space available
@@ -1080,6 +1081,7 @@  struct sk_buff {
 		unsigned int	sender_cpu;
 	};
 #endif
+	u16			alloc_cpu;
 #ifdef CONFIG_NETWORK_SECMARK
 	__u32		secmark;
 #endif
@@ -1321,6 +1323,7 @@  struct sk_buff *__build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb_around(struct sk_buff *skb,
 				 void *data, unsigned int frag_size);
+void skb_attempt_defer_free(struct sk_buff *skb);
 
 struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index a01d6c421aa2caad4032167284eed9565d4bd545..f9f8ecae0f8decb3e0e74c6efaff5b890e3685ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -292,7 +292,6 @@  struct sk_filter;
   *	@sk_pacing_shift: scaling factor for TCP Small Queues
   *	@sk_lingertime: %SO_LINGER l_linger setting
   *	@sk_backlog: always used with the per-socket spinlock held
-  *	@defer_list: head of llist storing skbs to be freed
   *	@sk_callback_lock: used with the callbacks in the end of this struct
   *	@sk_error_queue: rarely used
   *	@sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
@@ -417,7 +416,6 @@  struct sock {
 		struct sk_buff	*head;
 		struct sk_buff	*tail;
 	} sk_backlog;
-	struct llist_head defer_list;
 
 #define sk_rmem_alloc sk_backlog.rmem_alloc
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 679b1964d49414fcb1c361778fd0ac664e8c8ea5..94a52ad1101c12e13c2957e8b028b697742c451f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1375,18 +1375,6 @@  static inline bool tcp_checksum_complete(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 		     enum skb_drop_reason *reason);
 
-#ifdef CONFIG_INET
-void __sk_defer_free_flush(struct sock *sk);
-
-static inline void sk_defer_free_flush(struct sock *sk)
-{
-	if (llist_empty(&sk->defer_list))
-		return;
-	__sk_defer_free_flush(sk);
-}
-#else
-static inline void sk_defer_free_flush(struct sock *sk) {}
-#endif
 
 int tcp_filter(struct sock *sk, struct sk_buff *skb);
 void tcp_set_state(struct sock *sk, int state);
diff --git a/net/core/dev.c b/net/core/dev.c
index 4a77ebda4fb155581a5f761a864446a046987f51..611bd719706412723561c27753150b27e1dc4e7a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4545,6 +4545,12 @@  static void rps_trigger_softirq(void *data)
 
 #endif /* CONFIG_RPS */
 
+/* Called from hardirq (IPI) context */
+static void trigger_rx_softirq(void *data __always_unused)
+{
+	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+}
+
 /*
  * Check if this softnet_data structure is another cpu one
  * If yes, queue it to our IPI list and return 1
@@ -6571,6 +6577,28 @@  static int napi_threaded_poll(void *data)
 	return 0;
 }
 
+static void skb_defer_free_flush(struct softnet_data *sd)
+{
+	struct sk_buff *skb, *next;
+	unsigned long flags;
+
+	/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
+	if (!READ_ONCE(sd->defer_list))
+		return;
+
+	spin_lock_irqsave(&sd->defer_lock, flags);
+	skb = sd->defer_list;
+	sd->defer_list = NULL;
+	sd->defer_count = 0;
+	spin_unlock_irqrestore(&sd->defer_lock, flags);
+
+	while (skb != NULL) {
+		next = skb->next;
+		__kfree_skb(skb);
+		skb = next;
+	}
+}
+
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
@@ -6616,6 +6644,7 @@  static __latent_entropy void net_rx_action(struct softirq_action *h)
 		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 
 	net_rps_action_and_irq_enable(sd);
+	skb_defer_free_flush(sd);
 }
 
 struct netdev_adjacent {
@@ -11326,6 +11355,8 @@  static int __init net_dev_init(void)
 		INIT_CSD(&sd->csd, rps_trigger_softirq, sd);
 		sd->cpu = i;
 #endif
+		INIT_CSD(&sd->defer_csd, trigger_rx_softirq, NULL);
+		spin_lock_init(&sd->defer_lock);
 
 		init_gro_hash(&sd->backlog);
 		sd->backlog.poll = process_backlog;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 30b523fa4ad2e9be30bdefdc61f70f989c345bbf..028a280fbabd5b69770ddd6bf0e00eae7651bbf1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -204,7 +204,7 @@  static void __build_skb_around(struct sk_buff *skb, void *data,
 	skb_set_end_offset(skb, size);
 	skb->mac_header = (typeof(skb->mac_header))~0U;
 	skb->transport_header = (typeof(skb->transport_header))~0U;
-
+	skb->alloc_cpu = raw_smp_processor_id();
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
 	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
@@ -1037,6 +1037,7 @@  static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	CHECK_SKB_FIELD(napi_id);
 #endif
+	CHECK_SKB_FIELD(alloc_cpu);
 #ifdef CONFIG_XPS
 	CHECK_SKB_FIELD(sender_cpu);
 #endif
@@ -6486,3 +6487,51 @@  void __skb_ext_put(struct skb_ext *ext)
 }
 EXPORT_SYMBOL(__skb_ext_put);
 #endif /* CONFIG_SKB_EXTENSIONS */
+
+/**
+ * skb_attempt_defer_free - queue skb for remote freeing
+ * @skb: buffer
+ *
+ * Put @skb in a per-cpu list, using the cpu which
+ * allocated the skb/pages to reduce false sharing
+ * and memory zone spinlock contention.
+ */
+void skb_attempt_defer_free(struct sk_buff *skb)
+{
+	int cpu = skb->alloc_cpu;
+	struct softnet_data *sd;
+	unsigned long flags;
+	bool kick;
+
+	if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
+	    !cpu_online(cpu) ||
+	    cpu == raw_smp_processor_id()) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	sd = &per_cpu(softnet_data, cpu);
+	/* We do not send an IPI or any signal.
+	 * Remote cpu will eventually call skb_defer_free_flush()
+	 */
+	spin_lock_irqsave(&sd->defer_lock, flags);
+	skb->next = sd->defer_list;
+	/* Paired with READ_ONCE() in skb_defer_free_flush() */
+	WRITE_ONCE(sd->defer_list, skb);
+	sd->defer_count++;
+
+	/* kick every time queue length reaches 128.
+	 * This should avoid blocking in smp_call_function_single_async().
+	 * This condition should hardly be bit under normal conditions,
+	 * unless cpu suddenly stopped to receive NIC interrupts.
+	 */
+	kick = sd->defer_count == 128;
+
+	spin_unlock_irqrestore(&sd->defer_lock, flags);
+
+	/* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU
+	 * if we are unlucky enough (this seems very unlikely).
+	 */
+	if (unlikely(kick))
+		smp_call_function_single_async(cpu, &sd->defer_csd);
+}
diff --git a/net/core/sock.c b/net/core/sock.c
index 29abec3eabd8905f2671e0b5789878a129453ef6..a0f3989de3d62456665e8b6382a4681fba17d60c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2082,9 +2082,6 @@  void sk_destruct(struct sock *sk)
 {
 	bool use_call_rcu = sock_flag(sk, SOCK_RCU_FREE);
 
-	WARN_ON_ONCE(!llist_empty(&sk->defer_list));
-	sk_defer_free_flush(sk);
-
 	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
 		reuseport_detach_sock(sk);
 		use_call_rcu = true;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e20b87b3bf907a9b04b7531936129fb729e96c52..db55af9eb37b56bf0ec3b47212240c0302b86a1f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -843,7 +843,6 @@  ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 	}
 
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 
 	if (spliced)
 		return spliced;
@@ -1589,20 +1588,6 @@  void tcp_cleanup_rbuf(struct sock *sk, int copied)
 		tcp_send_ack(sk);
 }
 
-void __sk_defer_free_flush(struct sock *sk)
-{
-	struct llist_node *head;
-	struct sk_buff *skb, *n;
-
-	head = llist_del_all(&sk->defer_list);
-	llist_for_each_entry_safe(skb, n, head, ll_node) {
-		prefetch(n);
-		skb_mark_not_on_list(skb);
-		__kfree_skb(skb);
-	}
-}
-EXPORT_SYMBOL(__sk_defer_free_flush);
-
 static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	__skb_unlink(skb, &sk->sk_receive_queue);
@@ -1610,11 +1595,7 @@  static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 		sock_rfree(skb);
 		skb->destructor = NULL;
 		skb->sk = NULL;
-		if (!skb_queue_empty(&sk->sk_receive_queue) ||
-		    !llist_empty(&sk->defer_list)) {
-			llist_add(&skb->ll_node, &sk->defer_list);
-			return;
-		}
+		return skb_attempt_defer_free(skb);
 	}
 	__kfree_skb(skb);
 }
@@ -2453,7 +2434,6 @@  static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			__sk_flush_backlog(sk);
 		} else {
 			tcp_cleanup_rbuf(sk, copied);
-			sk_defer_free_flush(sk);
 			sk_wait_data(sk, &timeo, last);
 		}
 
@@ -2571,7 +2551,6 @@  int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
 	lock_sock(sk);
 	ret = tcp_recvmsg_locked(sk, msg, len, flags, &tss, &cmsg_flags);
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 
 	if (cmsg_flags && ret >= 0) {
 		if (cmsg_flags & TCP_CMSG_TS)
@@ -3096,7 +3075,6 @@  int tcp_disconnect(struct sock *sk, int flags)
 		sk->sk_frag.page = NULL;
 		sk->sk_frag.offset = 0;
 	}
-	sk_defer_free_flush(sk);
 	sk_error_report(sk);
 	return 0;
 }
@@ -4225,7 +4203,6 @@  static int do_tcp_getsockopt(struct sock *sk, int level,
 		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
 							  &zc, &len, err);
 		release_sock(sk);
-		sk_defer_free_flush(sk);
 		if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags))
 			goto zerocopy_rcv_cmsg;
 		switch (len) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2c2d421425557c188c4bcf3dc113baea62e915c7..918816ec5dd49abe321f0179a2a64ca9a989a01c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2065,7 +2065,6 @@  int tcp_v4_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
-	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 54277de7474b78f1fea033b7978acebc0647f3ad..60bdec257ba7220d6c05b48208a587c7be2b4087 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1728,7 +1728,6 @@  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
-	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ddbe05ec5489dd352dee832e038884339f338b43..bc54f6c5b1a4cabbfe1e3eff1768128b2730c730 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1911,7 +1911,6 @@  int tls_sw_recvmsg(struct sock *sk,
 
 end:
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 	if (psock)
 		sk_psock_put(sk, psock);
 	return copied ? : err;
@@ -1983,7 +1982,6 @@  ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 
 splice_read_end:
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 	return copied ? : err;
 }