diff mbox series

[net-next] net: generalize skb freeing deferral to per-cpu lists

Message ID 20220421153920.3637792-1-eric.dumazet@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net: generalize skb freeing deferral to per-cpu lists | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5839 this patch: 5839
netdev/cc_maintainers warning 8 maintainers not CCed: petrm@nvidia.com borisp@nvidia.com yoshfuji@linux-ipv6.org dsahern@kernel.org daniel@iogearbox.net imagedong@tencent.com keescook@chromium.org john.fastabend@gmail.com
netdev/build_clang success Errors and warnings before: 1151 this patch: 1151
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5978 this patch: 5978
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 271 lines checked
netdev/kdoc fail Errors and warnings before: 0 this patch: 1
netdev/source_inline success Was 0 now: 0

Commit Message

Eric Dumazet April 21, 2022, 3:39 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 71874 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |  3 +++
 include/linux/skbuff.h    |  2 ++
 include/net/sock.h        |  2 --
 include/net/tcp.h         | 12 ----------
 net/core/dev.c            | 27 +++++++++++++++++++++++
 net/core/skbuff.c         | 46 ++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c           |  3 ---
 net/ipv4/tcp.c            | 25 +--------------------
 net/ipv4/tcp_ipv4.c       |  1 -
 net/ipv6/tcp_ipv6.c       |  1 -
 net/tls/tls_sw.c          |  2 --
 11 files changed, 78 insertions(+), 46 deletions(-)

Comments

Paolo Abeni April 22, 2022, 9:02 a.m. UTC | #1
Hi,

Looks great! I have a few questions below mostly to understand better
how it works...

On Thu, 2022-04-21 at 08:39 -0700, Eric Dumazet wrote:
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 84d78df60453955a8eaf05847f6e2145176a727a..2fe311447fae5e860eee95f6e8772926d4915e9f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1080,6 +1080,7 @@ struct sk_buff {
>  		unsigned int	sender_cpu;
>  	};
>  #endif
> +	u16			alloc_cpu;

I *think* we could in theory fetch the CPU that allocated the skb from
the napi_id - adding a cpu field to napi_struct and implementing an
helper to fetch it. Have you considered that option? or the napi lookup
would be just too expensive?

[...]

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4a77ebda4fb155581a5f761a864446a046987f51..4136d9c0ada6870ea0f7689702bdb5f0bbf29145 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4545,6 +4545,12 @@ static void rps_trigger_softirq(void *data)
>  
>  #endif /* CONFIG_RPS */
>  
> +/* Called from hardirq (IPI) context */
> +static void trigger_rx_softirq(void *data)

Perhaps '__always_unused' ? (But the compiler doesn't complain here)

> @@ -6486,3 +6487,46 @@ void __skb_ext_put(struct skb_ext *ext)
>  }
>  EXPORT_SYMBOL(__skb_ext_put);
>  #endif /* CONFIG_SKB_EXTENSIONS */
> +
> +/**
> + * skb_attempt_defer_free - queue skb for remote freeing
> + * @skb: buffer
> + *
> + * Put @skb in a per-cpu list, using the cpu which
> + * allocated the skb/pages to reduce false sharing
> + * and memory zone spinlock contention.
> + */
> +void skb_attempt_defer_free(struct sk_buff *skb)
> +{
> +	int cpu = skb->alloc_cpu;
> +	struct softnet_data *sd;
> +	unsigned long flags;
> +	bool kick;
> +
> +	if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || !cpu_online(cpu)) {
> +		__kfree_skb(skb);
> +		return;
> +	}

I'm wondering if we should skip even when cpu == smp_processor_id()? 

> +
> +	sd = &per_cpu(softnet_data, cpu);
> +	/* We do not send an IPI or any signal.
> +	 * Remote cpu will eventually call skb_defer_free_flush()
> +	 */
> +	spin_lock_irqsave(&sd->skb_defer_list.lock, flags);
> +	__skb_queue_tail(&sd->skb_defer_list, skb);
> +
> +	/* kick every time queue length reaches 128.
> +	 * This should avoid blocking in smp_call_function_single_async().
> +	 * This condition should hardly be bit under normal conditions,
> +	 * unless cpu suddenly stopped to receive NIC interrupts.
> +	 */
> +	kick = skb_queue_len(&sd->skb_defer_list) == 128;

Out of sheer curiosity why 128? I guess it's should be larger then
NAPI_POLL_WEIGHT, to cope with with maximum theorethical burst len?

Thanks!

Paolo
Eric Dumazet April 22, 2022, 3:59 p.m. UTC | #2
On Fri, Apr 22, 2022 at 2:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> Looks great! I have a few questions below mostly to understand better
> how it works...
>

Hi Paolo, thanks for the review :)

> On Thu, 2022-04-21 at 08:39 -0700, Eric Dumazet wrote:
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 84d78df60453955a8eaf05847f6e2145176a727a..2fe311447fae5e860eee95f6e8772926d4915e9f 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1080,6 +1080,7 @@ struct sk_buff {
> >               unsigned int    sender_cpu;
> >       };
> >  #endif
> > +     u16                     alloc_cpu;
>
> I *think* we could in theory fetch the CPU that allocated the skb from
> the napi_id - adding a cpu field to napi_struct and implementing an
> helper to fetch it. Have you considered that option? or the napi lookup
> would be just too expensive?

I have considered that, but a NAPI is not guaranteed to be
owned/serviced from a single cpu.

(In fact, I realized recently about the fact that commit
01770a166165 "tcp: fix race condition when creating child sockets from
syncookies"
has not been backported to stable kernels.

This tcp bug would not happen in normal cases, where all packets from
a particular 4-tuple
are handled by a single cpu.

>
> [...]
>
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 4a77ebda4fb155581a5f761a864446a046987f51..4136d9c0ada6870ea0f7689702bdb5f0bbf29145 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4545,6 +4545,12 @@ static void rps_trigger_softirq(void *data)
> >
> >  #endif /* CONFIG_RPS */
> >
> > +/* Called from hardirq (IPI) context */
> > +static void trigger_rx_softirq(void *data)
>
> Perhaps '__always_unused' ? (But the compiler doesn't complain here)

Sure I will add this.

>
> > @@ -6486,3 +6487,46 @@ void __skb_ext_put(struct skb_ext *ext)
> >  }
> >  EXPORT_SYMBOL(__skb_ext_put);
> >  #endif /* CONFIG_SKB_EXTENSIONS */
> > +
> > +/**
> > + * skb_attempt_defer_free - queue skb for remote freeing
> > + * @skb: buffer
> > + *
> > + * Put @skb in a per-cpu list, using the cpu which
> > + * allocated the skb/pages to reduce false sharing
> > + * and memory zone spinlock contention.
> > + */
> > +void skb_attempt_defer_free(struct sk_buff *skb)
> > +{
> > +     int cpu = skb->alloc_cpu;
> > +     struct softnet_data *sd;
> > +     unsigned long flags;
> > +     bool kick;
> > +
> > +     if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || !cpu_online(cpu)) {
> > +             __kfree_skb(skb);
> > +             return;
> > +     }
>
> I'm wondering if we should skip even when cpu == smp_processor_id()?

Yes, although we would have to use the raw_smp_processor_id() form I guess.

>
> > +
> > +     sd = &per_cpu(softnet_data, cpu);
> > +     /* We do not send an IPI or any signal.
> > +      * Remote cpu will eventually call skb_defer_free_flush()
> > +      */
> > +     spin_lock_irqsave(&sd->skb_defer_list.lock, flags);
> > +     __skb_queue_tail(&sd->skb_defer_list, skb);
> > +
> > +     /* kick every time queue length reaches 128.
> > +      * This should avoid blocking in smp_call_function_single_async().
> > +      * This condition should hardly be bit under normal conditions,
> > +      * unless cpu suddenly stopped to receive NIC interrupts.
> > +      */
> > +     kick = skb_queue_len(&sd->skb_defer_list) == 128;
>
> Out of sheer curiosity why 128? I guess it's should be larger then
> NAPI_POLL_WEIGHT, to cope with with maximum theorethical burst len?

Yes, I needed a value there, but was not sure which precise one.
In my tests I had no IPI ever sent with 128.

>
> Thanks!
>
> Paolo
>
Jakub Kicinski April 22, 2022, 4:40 p.m. UTC | #3
On Thu, 21 Apr 2022 08:39:20 -0700 Eric Dumazet wrote:
> 10 runs of one TCP_STREAM flow

Was the test within a NUMA node or cross-node?

For my learning - could this change cause more cache line bouncing 
than individual per-socket lists for non-RFS setups. Multiple CPUs 
may try to queue skbs for freeing on one remove node.

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 7dccbfd1bf5635c27514c70b4a06d3e6f74395dd..0162a9bdc9291e7aae967a044788d09bd2ef2423 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3081,6 +3081,9 @@ struct softnet_data {
>  	struct sk_buff_head	input_pkt_queue;
>  	struct napi_struct	backlog;
>  
> +	/* Another possibly contended cache line */
> +	struct sk_buff_head	skb_defer_list ____cacheline_aligned_in_smp;

If so maybe we can avoid some dirtying and use a single-linked list? 
No point modifying the cache line of the skb already on the list.

> +	call_single_data_t  csd_defer;
>  };
Jakub Kicinski April 22, 2022, 4:40 p.m. UTC | #4
On Thu, 21 Apr 2022 08:39:20 -0700 Eric Dumazet wrote:
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 84d78df60453955a8eaf05847f6e2145176a727a..2fe311447fae5e860eee95f6e8772926d4915e9f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1080,6 +1080,7 @@ struct sk_buff {
>  		unsigned int	sender_cpu;
>  	};
>  #endif
> +	u16			alloc_cpu;
>  #ifdef CONFIG_NETWORK_SECMARK
>  	__u32		secmark;
>  #endif

nit: kdoc missing
Eric Dumazet April 22, 2022, 4:50 p.m. UTC | #5
On Fri, Apr 22, 2022 at 9:40 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 21 Apr 2022 08:39:20 -0700 Eric Dumazet wrote:
> > 10 runs of one TCP_STREAM flow
>
> Was the test within a NUMA node or cross-node?

This was a NUMA host, but nothing done to force anything (no pinning,
both for sender and receiver)

>
> For my learning - could this change cause more cache line bouncing
> than individual per-socket lists for non-RFS setups. Multiple CPUs
> may try to queue skbs for freeing on one remove node.

I did tests as well in non-RFS setups, and got nice improvement as well
I could post them in v2 if you want.

The thing is that with a typical number of RX queues (typically 16 or
32 queues on a 100Gbit NIC),
there is enough sharding for this spinlock to be a non-issue.

Also, we could quite easily add some batching in a future patch, for
the cases where the number of RX queues
is too small.

(Each cpu could hold up to 8 or 16 skbs in a per-cpu cache, before
giving them back to alloc_cpu(s))


>
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 7dccbfd1bf5635c27514c70b4a06d3e6f74395dd..0162a9bdc9291e7aae967a044788d09bd2ef2423 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3081,6 +3081,9 @@ struct softnet_data {
> >       struct sk_buff_head     input_pkt_queue;
> >       struct napi_struct      backlog;
> >
> > +     /* Another possibly contended cache line */
> > +     struct sk_buff_head     skb_defer_list ____cacheline_aligned_in_smp;
>
> If so maybe we can avoid some dirtying and use a single-linked list?
> No point modifying the cache line of the skb already on the list.

Good idea, I can think about it.

>
> > +     call_single_data_t  csd_defer;
> >  };
Eric Dumazet April 22, 2022, 4:50 p.m. UTC | #6
On Fri, Apr 22, 2022 at 9:41 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 21 Apr 2022 08:39:20 -0700 Eric Dumazet wrote:
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 84d78df60453955a8eaf05847f6e2145176a727a..2fe311447fae5e860eee95f6e8772926d4915e9f 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1080,6 +1080,7 @@ struct sk_buff {
> >               unsigned int    sender_cpu;
> >       };
> >  #endif
> > +     u16                     alloc_cpu;
> >  #ifdef CONFIG_NETWORK_SECMARK
> >       __u32           secmark;
> >  #endif
>
> nit: kdoc missing

Yep, I had this covered for v2 already, thanks.
Jakub Kicinski April 22, 2022, 5:10 p.m. UTC | #7
On Fri, 22 Apr 2022 09:50:33 -0700 Eric Dumazet wrote:
> The thing is that with a typical number of RX queues (typically 16 or
> 32 queues on a 100Gbit NIC),
> there is enough sharding for this spinlock to be a non-issue.
> 
> Also, we could quite easily add some batching in a future patch, for
> the cases where the number of RX queues
> is too small.
> 
> (Each cpu could hold up to 8 or 16 skbs in a per-cpu cache, before
> giving them back to alloc_cpu(s))

I was wondering if we want to keep the per-socket queue for the
batching but you're right, per CPU batch is better anyway if needed.
Eric Dumazet April 22, 2022, 5:25 p.m. UTC | #8
On Fri, Apr 22, 2022 at 9:50 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Apr 22, 2022 at 9:40 AM Jakub Kicinski <kuba@kernel.org> wrote:

> >
> > If so maybe we can avoid some dirtying and use a single-linked list?
> > No point modifying the cache line of the skb already on the list.
>
> Good idea, I can think about it.
>


My first implementation was using an llist (as current per-socket llist),
but then I needed the count as well, so I converted to standard sk_buff_head

It seems we can hand code to:

     spinlock_t  lock;
     struct sk_buff *skb_head;
     int count;

We also could keep an llist,  and an atomic_t for the count, but that
would require two atomic ops, so no good.
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7dccbfd1bf5635c27514c70b4a06d3e6f74395dd..0162a9bdc9291e7aae967a044788d09bd2ef2423 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3081,6 +3081,9 @@  struct softnet_data {
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 
+	/* Another possibly contended cache line */
+	struct sk_buff_head	skb_defer_list ____cacheline_aligned_in_smp;
+	call_single_data_t  csd_defer;
 };
 
 static inline void input_queue_head_incr(struct softnet_data *sd)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 84d78df60453955a8eaf05847f6e2145176a727a..2fe311447fae5e860eee95f6e8772926d4915e9f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1080,6 +1080,7 @@  struct sk_buff {
 		unsigned int	sender_cpu;
 	};
 #endif
+	u16			alloc_cpu;
 #ifdef CONFIG_NETWORK_SECMARK
 	__u32		secmark;
 #endif
@@ -1321,6 +1322,7 @@  struct sk_buff *__build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb(void *data, unsigned int frag_size);
 struct sk_buff *build_skb_around(struct sk_buff *skb,
 				 void *data, unsigned int frag_size);
+void skb_attempt_defer_free(struct sk_buff *skb);
 
 struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index a01d6c421aa2caad4032167284eed9565d4bd545..f9f8ecae0f8decb3e0e74c6efaff5b890e3685ea 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -292,7 +292,6 @@  struct sk_filter;
   *	@sk_pacing_shift: scaling factor for TCP Small Queues
   *	@sk_lingertime: %SO_LINGER l_linger setting
   *	@sk_backlog: always used with the per-socket spinlock held
-  *	@defer_list: head of llist storing skbs to be freed
   *	@sk_callback_lock: used with the callbacks in the end of this struct
   *	@sk_error_queue: rarely used
   *	@sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
@@ -417,7 +416,6 @@  struct sock {
 		struct sk_buff	*head;
 		struct sk_buff	*tail;
 	} sk_backlog;
-	struct llist_head defer_list;
 
 #define sk_rmem_alloc sk_backlog.rmem_alloc
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 679b1964d49414fcb1c361778fd0ac664e8c8ea5..94a52ad1101c12e13c2957e8b028b697742c451f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1375,18 +1375,6 @@  static inline bool tcp_checksum_complete(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 		     enum skb_drop_reason *reason);
 
-#ifdef CONFIG_INET
-void __sk_defer_free_flush(struct sock *sk);
-
-static inline void sk_defer_free_flush(struct sock *sk)
-{
-	if (llist_empty(&sk->defer_list))
-		return;
-	__sk_defer_free_flush(sk);
-}
-#else
-static inline void sk_defer_free_flush(struct sock *sk) {}
-#endif
 
 int tcp_filter(struct sock *sk, struct sk_buff *skb);
 void tcp_set_state(struct sock *sk, int state);
diff --git a/net/core/dev.c b/net/core/dev.c
index 4a77ebda4fb155581a5f761a864446a046987f51..4136d9c0ada6870ea0f7689702bdb5f0bbf29145 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4545,6 +4545,12 @@  static void rps_trigger_softirq(void *data)
 
 #endif /* CONFIG_RPS */
 
+/* Called from hardirq (IPI) context */
+static void trigger_rx_softirq(void *data)
+{
+	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+}
+
 /*
  * Check if this softnet_data structure is another cpu one
  * If yes, queue it to our IPI list and return 1
@@ -6571,6 +6577,24 @@  static int napi_threaded_poll(void *data)
 	return 0;
 }
 
+static void skb_defer_free_flush(struct softnet_data *sd)
+{
+	struct sk_buff_head list;
+	struct sk_buff *skb;
+	unsigned long flags;
+
+	if (skb_queue_empty_lockless(&sd->skb_defer_list))
+		return;
+	__skb_queue_head_init(&list);
+
+	spin_lock_irqsave(&sd->skb_defer_list.lock, flags);
+	skb_queue_splice_init(&sd->skb_defer_list, &list);
+	spin_unlock_irqrestore(&sd->skb_defer_list.lock, flags);
+
+	while ((skb = __skb_dequeue(&list)) != NULL)
+		__kfree_skb(skb);
+}
+
 static __latent_entropy void net_rx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
@@ -6616,6 +6640,7 @@  static __latent_entropy void net_rx_action(struct softirq_action *h)
 		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 
 	net_rps_action_and_irq_enable(sd);
+	skb_defer_free_flush(sd);
 }
 
 struct netdev_adjacent {
@@ -11326,6 +11351,8 @@  static int __init net_dev_init(void)
 		INIT_CSD(&sd->csd, rps_trigger_softirq, sd);
 		sd->cpu = i;
 #endif
+		INIT_CSD(&sd->csd_defer, trigger_rx_softirq, NULL);
+		skb_queue_head_init(&sd->skb_defer_list);
 
 		init_gro_hash(&sd->backlog);
 		sd->backlog.poll = process_backlog;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 30b523fa4ad2e9be30bdefdc61f70f989c345bbf..c0ab436401b5d1d985e6eccc393ffa4ad007843b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -204,7 +204,7 @@  static void __build_skb_around(struct sk_buff *skb, void *data,
 	skb_set_end_offset(skb, size);
 	skb->mac_header = (typeof(skb->mac_header))~0U;
 	skb->transport_header = (typeof(skb->transport_header))~0U;
-
+	skb->alloc_cpu = raw_smp_processor_id();
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
 	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
@@ -1037,6 +1037,7 @@  static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	CHECK_SKB_FIELD(napi_id);
 #endif
+	CHECK_SKB_FIELD(alloc_cpu);
 #ifdef CONFIG_XPS
 	CHECK_SKB_FIELD(sender_cpu);
 #endif
@@ -6486,3 +6487,46 @@  void __skb_ext_put(struct skb_ext *ext)
 }
 EXPORT_SYMBOL(__skb_ext_put);
 #endif /* CONFIG_SKB_EXTENSIONS */
+
+/**
+ * skb_attempt_defer_free - queue skb for remote freeing
+ * @skb: buffer
+ *
+ * Put @skb in a per-cpu list, using the cpu which
+ * allocated the skb/pages to reduce false sharing
+ * and memory zone spinlock contention.
+ */
+void skb_attempt_defer_free(struct sk_buff *skb)
+{
+	int cpu = skb->alloc_cpu;
+	struct softnet_data *sd;
+	unsigned long flags;
+	bool kick;
+
+	if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || !cpu_online(cpu)) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	sd = &per_cpu(softnet_data, cpu);
+	/* We do not send an IPI or any signal.
+	 * Remote cpu will eventually call skb_defer_free_flush()
+	 */
+	spin_lock_irqsave(&sd->skb_defer_list.lock, flags);
+	__skb_queue_tail(&sd->skb_defer_list, skb);
+
+	/* kick every time queue length reaches 128.
+	 * This should avoid blocking in smp_call_function_single_async().
+	 * This condition should hardly be bit under normal conditions,
+	 * unless cpu suddenly stopped to receive NIC interrupts.
+	 */
+	kick = skb_queue_len(&sd->skb_defer_list) == 128;
+
+	spin_unlock_irqrestore(&sd->skb_defer_list.lock, flags);
+
+	/* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU
+	 * if we are unlucky enough (this seems very unlikely).
+	 */
+	if (unlikely(kick))
+		smp_call_function_single_async(cpu, &sd->csd_defer);
+}
diff --git a/net/core/sock.c b/net/core/sock.c
index 29abec3eabd8905f2671e0b5789878a129453ef6..a0f3989de3d62456665e8b6382a4681fba17d60c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2082,9 +2082,6 @@  void sk_destruct(struct sock *sk)
 {
 	bool use_call_rcu = sock_flag(sk, SOCK_RCU_FREE);
 
-	WARN_ON_ONCE(!llist_empty(&sk->defer_list));
-	sk_defer_free_flush(sk);
-
 	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
 		reuseport_detach_sock(sk);
 		use_call_rcu = true;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e20b87b3bf907a9b04b7531936129fb729e96c52..db55af9eb37b56bf0ec3b47212240c0302b86a1f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -843,7 +843,6 @@  ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 	}
 
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 
 	if (spliced)
 		return spliced;
@@ -1589,20 +1588,6 @@  void tcp_cleanup_rbuf(struct sock *sk, int copied)
 		tcp_send_ack(sk);
 }
 
-void __sk_defer_free_flush(struct sock *sk)
-{
-	struct llist_node *head;
-	struct sk_buff *skb, *n;
-
-	head = llist_del_all(&sk->defer_list);
-	llist_for_each_entry_safe(skb, n, head, ll_node) {
-		prefetch(n);
-		skb_mark_not_on_list(skb);
-		__kfree_skb(skb);
-	}
-}
-EXPORT_SYMBOL(__sk_defer_free_flush);
-
 static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	__skb_unlink(skb, &sk->sk_receive_queue);
@@ -1610,11 +1595,7 @@  static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 		sock_rfree(skb);
 		skb->destructor = NULL;
 		skb->sk = NULL;
-		if (!skb_queue_empty(&sk->sk_receive_queue) ||
-		    !llist_empty(&sk->defer_list)) {
-			llist_add(&skb->ll_node, &sk->defer_list);
-			return;
-		}
+		return skb_attempt_defer_free(skb);
 	}
 	__kfree_skb(skb);
 }
@@ -2453,7 +2434,6 @@  static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			__sk_flush_backlog(sk);
 		} else {
 			tcp_cleanup_rbuf(sk, copied);
-			sk_defer_free_flush(sk);
 			sk_wait_data(sk, &timeo, last);
 		}
 
@@ -2571,7 +2551,6 @@  int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
 	lock_sock(sk);
 	ret = tcp_recvmsg_locked(sk, msg, len, flags, &tss, &cmsg_flags);
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 
 	if (cmsg_flags && ret >= 0) {
 		if (cmsg_flags & TCP_CMSG_TS)
@@ -3096,7 +3075,6 @@  int tcp_disconnect(struct sock *sk, int flags)
 		sk->sk_frag.page = NULL;
 		sk->sk_frag.offset = 0;
 	}
-	sk_defer_free_flush(sk);
 	sk_error_report(sk);
 	return 0;
 }
@@ -4225,7 +4203,6 @@  static int do_tcp_getsockopt(struct sock *sk, int level,
 		err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
 							  &zc, &len, err);
 		release_sock(sk);
-		sk_defer_free_flush(sk);
 		if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags))
 			goto zerocopy_rcv_cmsg;
 		switch (len) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 157265aecbedd1761de2d892b5a54f4a0cfe1912..f5eb40e6ec48824bb684175c0dc920653c72a6d8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2066,7 +2066,6 @@  int tcp_v4_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
-	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 782df529ff69e0b2fc5f9d12fc72538b7041a392..09031561fc0e74f2a92069aa5ebd78df9c39cd5a 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1728,7 +1728,6 @@  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 
 	sk_incoming_cpu_update(sk);
 
-	sk_defer_free_flush(sk);
 	bh_lock_sock_nested(sk);
 	tcp_segs_in(tcp_sk(sk), skb);
 	ret = 0;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index ddbe05ec5489dd352dee832e038884339f338b43..bc54f6c5b1a4cabbfe1e3eff1768128b2730c730 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1911,7 +1911,6 @@  int tls_sw_recvmsg(struct sock *sk,
 
 end:
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 	if (psock)
 		sk_psock_put(sk, psock);
 	return copied ? : err;
@@ -1983,7 +1982,6 @@  ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 
 splice_read_end:
 	release_sock(sk);
-	sk_defer_free_flush(sk);
 	return copied ? : err;
 }