Message ID | 20220422201237.416238-1-eric.dumazet@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 68822bdf76f10c3dc80609d4e2cdc1e847429086 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2,net-next] net: generalize skb freeing deferral to per-cpu lists | expand |
Hello, I'm sorry for the late feedback. I have only a possibly relevant point below. On Fri, 2022-04-22 at 13:12 -0700, Eric Dumazet wrote: [...] > @@ -6571,6 +6577,28 @@ static int napi_threaded_poll(void *data) > return 0; > } > > +static void skb_defer_free_flush(struct softnet_data *sd) > +{ > + struct sk_buff *skb, *next; > + unsigned long flags; > + > + /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */ > + if (!READ_ONCE(sd->defer_list)) > + return; > + > + spin_lock_irqsave(&sd->defer_lock, flags); > + skb = sd->defer_list; I *think* that this read can possibly be fused with the previous one, and another READ_ONCE() should avoid that. BTW it looks like this version gives slightly better results than the previous one, perhpas due to the single-liked list usage? Thanks! Paolo
On Tue, Apr 26, 2022 at 12:38 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > Hello, > > I'm sorry for the late feedback. I have only a possibly relevant point > below. > > On Fri, 2022-04-22 at 13:12 -0700, Eric Dumazet wrote: > [...] > > @@ -6571,6 +6577,28 @@ static int napi_threaded_poll(void *data) > > return 0; > > } > > > > +static void skb_defer_free_flush(struct softnet_data *sd) > > +{ > > + struct sk_buff *skb, *next; > > + unsigned long flags; > > + > > + /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */ > > + if (!READ_ONCE(sd->defer_list)) > > + return; > > + > > + spin_lock_irqsave(&sd->defer_lock, flags); > > + skb = sd->defer_list; > > I *think* that this read can possibly be fused with the previous one, > and another READ_ONCE() should avoid that. Only the lockless read needs READ_ONCE() For the one after spin_lock_irqsave(&sd->defer_lock, flags), there is no need for any additional barrier. If the compiler really wants to use multiple one-byte-at-a-time loads, we are not going to fight, there might be good reasons for that. (We do not want to spread READ_ONCE / WRITE_ONCE for all loads/stores, as this has performance implications) > > BTW it looks like this version gives slightly better results than the > previous one, perhpas due to the single-liked list usage? Yes, this could be the case, or maybe it is because 10 runs are not enough in a host with 32 RX queues, with a 50/50 split between the two NUMA nodes. When reaching high throughput, every detail matters, like background usage on the network, from monitoring and machine health daemons. Thanks.
On Tue, 2022-04-26 at 06:11 -0700, Eric Dumazet wrote: > On Tue, Apr 26, 2022 at 12:38 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On Fri, 2022-04-22 at 13:12 -0700, Eric Dumazet wrote: > > [...] > > > @@ -6571,6 +6577,28 @@ static int napi_threaded_poll(void *data) > > > return 0; > > > } > > > > > > +static void skb_defer_free_flush(struct softnet_data *sd) > > > +{ > > > + struct sk_buff *skb, *next; > > > + unsigned long flags; > > > + > > > + /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */ > > > + if (!READ_ONCE(sd->defer_list)) > > > + return; > > > + > > > + spin_lock_irqsave(&sd->defer_lock, flags); > > > + skb = sd->defer_list; > > > > I *think* that this read can possibly be fused with the previous one, > > and another READ_ONCE() should avoid that. > > Only the lockless read needs READ_ONCE() > > For the one after spin_lock_irqsave(&sd->defer_lock, flags), > there is no need for any additional barrier. > > If the compiler really wants to use multiple one-byte-at-a-time loads, > we are not going to fight, there might be good reasons for that. I'm unsure I explained my doubt in a clear way: what I fear is that the compiler could emit a single read instruction, corresponding to the READ_ONCE() outside the lock, so that the spin-locked section will operate on "old" defer_list. If that happens we could end-up with 'defer_count' underestimating the list lenght. It looks like that is tolerable, as we will still be protected vs defer_list growing too much. Acked-by: Paolo Abeni <pabeni@redhat.com>
On Tue, Apr 26, 2022 at 8:28 AM Paolo Abeni <pabeni@redhat.com> wrote: > > I'm unsure I explained my doubt in a clear way: what I fear is that the > compiler could emit a single read instruction, corresponding to the > READ_ONCE() outside the lock, so that the spin-locked section will > operate on "old" defer_list. > > If that happens we could end-up with 'defer_count' underestimating the > list lenght. It looks like that is tolerable, as we will still be > protected vs defer_list growing too much. defer_count is always read/written under the protection of the spinlock. It must be very precise, unless I am mistaken. > > Acked-by: Paolo Abeni <pabeni@redhat.com> > > Thanks !
Hello: This patch was applied to netdev/net-next.git (master) by Jakub Kicinski <kuba@kernel.org>: On Fri, 22 Apr 2022 13:12:37 -0700 you wrote: > From: Eric Dumazet <edumazet@google.com> > > Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket > lock is released") helped bulk TCP flows to move the cost of skbs > frees outside of critical section where socket lock was held. > > But for RPC traffic, or hosts with RFS enabled, the solution is far from > being ideal. > > [...] Here is the summary with links: - [v2,net-next] net: generalize skb freeing deferral to per-cpu lists https://git.kernel.org/netdev/net-next/c/68822bdf76f1 You are awesome, thank you!
On Fri, Apr 22, 2022 at 01:12:37PM -0700, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket > lock is released") helped bulk TCP flows to move the cost of skbs > frees outside of critical section where socket lock was held. > > But for RPC traffic, or hosts with RFS enabled, the solution is far from > being ideal. > > For RPC traffic, recvmsg() has to return to user space right after > skb payload has been consumed, meaning that BH handler has no chance > to pick the skb before recvmsg() thread. This issue is more visible > with BIG TCP, as more RPC fit one skb. > > For RFS, even if BH handler picks the skbs, they are still picked > from the cpu on which user thread is running. > > Ideally, it is better to free the skbs (and associated page frags) > on the cpu that originally allocated them. > > This patch removes the per socket anchor (sk->defer_list) and > instead uses a per-cpu list, which will hold more skbs per round. > > This new per-cpu list is drained at the end of net_action_rx(), > after incoming packets have been processed, to lower latencies. > > In normal conditions, skbs are added to the per-cpu list with > no further action. In the (unlikely) cases where the cpu does not > run net_action_rx() handler fast enough, we use an IPI to raise > NET_RX_SOFTIRQ on the remote cpu. > > Also, we do not bother draining the per-cpu list from dev_cpu_dead() > This is because skbs in this list have no requirement on how fast > they should be freed. > > Note that we can add in the future a small per-cpu cache > if we see any contention on sd->defer_lock. > > Tested on a pair of hosts with 100Gbit NIC, RFS enabled, > and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around > page recycling strategy used by NIC driver (its page pool capacity > being too small compared to number of skbs/pages held in sockets > receive queues) > > Note that this tuning was only done to demonstrate worse > conditions for skb freeing for this particular test. > These conditions can happen in more general production workload. [...] > Signed-off-by: Eric Dumazet <edumazet@google.com> Eric, with this patch I'm seeing memory leaks such as these [1][2] after boot. The system is using the igb driver for its management interface [3]. The leaks disappear after reverting the patch. Any ideas? Let me know if you need more info. I can easily test a patch. Thanks [1] # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888170143740 (size 216): comm "softirq", pid 0, jiffies 4294825261 (age 95.244s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 17 0f 81 88 ff ff 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff82571fc0>] napi_skb_cache_get+0xf0/0x180 [<ffffffff8257206a>] __napi_build_skb+0x1a/0x60 [<ffffffff8257b0f3>] napi_build_skb+0x23/0x350 [<ffffffffa0469592>] igb_poll+0x2b72/0x5880 [igb] [<ffffffff825f9584>] __napi_poll.constprop.0+0xb4/0x480 [<ffffffff825f9d5a>] net_rx_action+0x40a/0xc60 [<ffffffff82e00295>] __do_softirq+0x295/0x9fe [<ffffffff81185bcc>] __irq_exit_rcu+0x11c/0x180 [<ffffffff8118622a>] irq_exit_rcu+0xa/0x20 [<ffffffff82bbed39>] common_interrupt+0xa9/0xc0 [<ffffffff82c00b5e>] asm_common_interrupt+0x1e/0x40 [<ffffffff824a186e>] cpuidle_enter_state+0x27e/0xcb0 [<ffffffff824a236f>] cpuidle_enter+0x4f/0xa0 [<ffffffff8126a290>] do_idle+0x3b0/0x4b0 [<ffffffff8126a869>] cpu_startup_entry+0x19/0x20 [<ffffffff810f4725>] start_secondary+0x265/0x340 [2] # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff88810ce3aac0 (size 216): comm "softirq", pid 0, jiffies 4294861408 (age 64.607s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00 ..{............. backtrace: [<ffffffff82575539>] __alloc_skb+0x229/0x360 [<ffffffff8290bd3c>] __tcp_send_ack.part.0+0x6c/0x760 [<ffffffff8291a062>] tcp_send_ack+0x82/0xa0 [<ffffffff828cb6db>] __tcp_ack_snd_check+0x15b/0xa00 [<ffffffff828f17fe>] tcp_rcv_established+0x198e/0x2120 [<ffffffff829363b5>] tcp_v4_do_rcv+0x665/0x9a0 [<ffffffff8293d8ae>] tcp_v4_rcv+0x2c1e/0x32f0 [<ffffffff828610b3>] ip_protocol_deliver_rcu+0x53/0x2c0 [<ffffffff828616eb>] ip_local_deliver+0x3cb/0x620 [<ffffffff8285e66f>] ip_sublist_rcv_finish+0x9f/0x2c0 [<ffffffff82860895>] ip_list_rcv_finish.constprop.0+0x525/0x6f0 [<ffffffff82861f88>] ip_list_rcv+0x318/0x460 [<ffffffff825f5e61>] __netif_receive_skb_list_core+0x541/0x8f0 [<ffffffff825f8043>] netif_receive_skb_list_internal+0x763/0xdc0 [<ffffffff826c3025>] napi_gro_complete.constprop.0+0x5a5/0x700 [<ffffffff826c44ed>] dev_gro_receive+0xf2d/0x23f0 unreferenced object 0xffff888175e1afc0 (size 216): comm "sshd", pid 1024, jiffies 4294861424 (age 64.591s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00 ..{............. backtrace: [<ffffffff82575539>] __alloc_skb+0x229/0x360 [<ffffffff8258201c>] alloc_skb_with_frags+0x9c/0x720 [<ffffffff8255f333>] sock_alloc_send_pskb+0x7b3/0x940 [<ffffffff82876af4>] __ip_append_data+0x1874/0x36d0 [<ffffffff8287f283>] ip_make_skb+0x263/0x2e0 [<ffffffff82978afa>] udp_sendmsg+0x1c8a/0x29d0 [<ffffffff829af94e>] inet_sendmsg+0x9e/0xe0 [<ffffffff8255082d>] __sys_sendto+0x23d/0x360 [<ffffffff82550a31>] __x64_sys_sendto+0xe1/0x1b0 [<ffffffff82bbde05>] do_syscall_64+0x35/0x80 [<ffffffff82c00068>] entry_SYSCALL_64_after_hwframe+0x44/0xae [3] # ethtool -i enp8s0 driver: igb version: 5.18.0-rc3-custom-91743-g481c1b firmware-version: 3.25, 0x80000708, 1.1824.0 expansion-rom-version: bus-info: 0000:08:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes
On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote: > > > Eric, with this patch I'm seeing memory leaks such as these [1][2] after > boot. The system is using the igb driver for its management interface > [3]. The leaks disappear after reverting the patch. > > Any ideas? > No idea, skbs allocated to send an ACK can not be stored in receive queue, I guess this is a kmemleak false positive. Stress your host for hours, and check if there are real kmemleaks, as in memory being depleted. > Let me know if you need more info. I can easily test a patch. > > Thanks > > [1] > # cat /sys/kernel/debug/kmemleak > unreferenced object 0xffff888170143740 (size 216): > comm "softirq", pid 0, jiffies 4294825261 (age 95.244s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 17 0f 81 88 ff ff 00 00 00 00 00 00 00 00 ................ > backtrace: > [<ffffffff82571fc0>] napi_skb_cache_get+0xf0/0x180 > [<ffffffff8257206a>] __napi_build_skb+0x1a/0x60 > [<ffffffff8257b0f3>] napi_build_skb+0x23/0x350 > [<ffffffffa0469592>] igb_poll+0x2b72/0x5880 [igb] > [<ffffffff825f9584>] __napi_poll.constprop.0+0xb4/0x480 > [<ffffffff825f9d5a>] net_rx_action+0x40a/0xc60 > [<ffffffff82e00295>] __do_softirq+0x295/0x9fe > [<ffffffff81185bcc>] __irq_exit_rcu+0x11c/0x180 > [<ffffffff8118622a>] irq_exit_rcu+0xa/0x20 > [<ffffffff82bbed39>] common_interrupt+0xa9/0xc0 > [<ffffffff82c00b5e>] asm_common_interrupt+0x1e/0x40 > [<ffffffff824a186e>] cpuidle_enter_state+0x27e/0xcb0 > [<ffffffff824a236f>] cpuidle_enter+0x4f/0xa0 > [<ffffffff8126a290>] do_idle+0x3b0/0x4b0 > [<ffffffff8126a869>] cpu_startup_entry+0x19/0x20 > [<ffffffff810f4725>] start_secondary+0x265/0x340 > > [2] > # cat /sys/kernel/debug/kmemleak > unreferenced object 0xffff88810ce3aac0 (size 216): > comm "softirq", pid 0, jiffies 4294861408 (age 64.607s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00 ..{............. > backtrace: > [<ffffffff82575539>] __alloc_skb+0x229/0x360 > [<ffffffff8290bd3c>] __tcp_send_ack.part.0+0x6c/0x760 > [<ffffffff8291a062>] tcp_send_ack+0x82/0xa0 > [<ffffffff828cb6db>] __tcp_ack_snd_check+0x15b/0xa00 > [<ffffffff828f17fe>] tcp_rcv_established+0x198e/0x2120 > [<ffffffff829363b5>] tcp_v4_do_rcv+0x665/0x9a0 > [<ffffffff8293d8ae>] tcp_v4_rcv+0x2c1e/0x32f0 > [<ffffffff828610b3>] ip_protocol_deliver_rcu+0x53/0x2c0 > [<ffffffff828616eb>] ip_local_deliver+0x3cb/0x620 > [<ffffffff8285e66f>] ip_sublist_rcv_finish+0x9f/0x2c0 > [<ffffffff82860895>] ip_list_rcv_finish.constprop.0+0x525/0x6f0 > [<ffffffff82861f88>] ip_list_rcv+0x318/0x460 > [<ffffffff825f5e61>] __netif_receive_skb_list_core+0x541/0x8f0 > [<ffffffff825f8043>] netif_receive_skb_list_internal+0x763/0xdc0 > [<ffffffff826c3025>] napi_gro_complete.constprop.0+0x5a5/0x700 > [<ffffffff826c44ed>] dev_gro_receive+0xf2d/0x23f0 > unreferenced object 0xffff888175e1afc0 (size 216): > comm "sshd", pid 1024, jiffies 4294861424 (age 64.591s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00 ..{............. > backtrace: > [<ffffffff82575539>] __alloc_skb+0x229/0x360 > [<ffffffff8258201c>] alloc_skb_with_frags+0x9c/0x720 > [<ffffffff8255f333>] sock_alloc_send_pskb+0x7b3/0x940 > [<ffffffff82876af4>] __ip_append_data+0x1874/0x36d0 > [<ffffffff8287f283>] ip_make_skb+0x263/0x2e0 > [<ffffffff82978afa>] udp_sendmsg+0x1c8a/0x29d0 > [<ffffffff829af94e>] inet_sendmsg+0x9e/0xe0 > [<ffffffff8255082d>] __sys_sendto+0x23d/0x360 > [<ffffffff82550a31>] __x64_sys_sendto+0xe1/0x1b0 > [<ffffffff82bbde05>] do_syscall_64+0x35/0x80 > [<ffffffff82c00068>] entry_SYSCALL_64_after_hwframe+0x44/0xae > > [3] > # ethtool -i enp8s0 > driver: igb > version: 5.18.0-rc3-custom-91743-g481c1b > firmware-version: 3.25, 0x80000708, 1.1824.0 > expansion-rom-version: > bus-info: 0000:08:00.0 > supports-statistics: yes > supports-test: yes > supports-eeprom-access: yes > supports-register-dump: yes > supports-priv-flags: yes
On Wed, Apr 27, 2022 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote: > > > > > > > Eric, with this patch I'm seeing memory leaks such as these [1][2] after > > boot. The system is using the igb driver for its management interface > > [3]. The leaks disappear after reverting the patch. > > > > Any ideas? > > > > No idea, skbs allocated to send an ACK can not be stored in receive > queue, I guess this is a kmemleak false positive. > > Stress your host for hours, and check if there are real kmemleaks, as > in memory being depleted. AT first when I saw your report I wondered if the following was needed, but I do not think so. Nothing in __kfree_skb(skb) cares about skb->next. But you might give it a try, maybe I am wrong. diff --git a/net/core/dev.c b/net/core/dev.c index 611bd719706412723561c27753150b27e1dc4e7a..9dc3443649b962586cc059899ac3d71e9c7a3559 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6594,6 +6594,7 @@ static void skb_defer_free_flush(struct softnet_data *sd) while (skb != NULL) { next = skb->next; + skb_mark_not_on_list(skb); __kfree_skb(skb); skb = next; }
On Wed, Apr 27, 2022 at 10:11 AM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Apr 27, 2022 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote: > > > > > > > > > > > Eric, with this patch I'm seeing memory leaks such as these [1][2] after > > > boot. The system is using the igb driver for its management interface > > > [3]. The leaks disappear after reverting the patch. > > > > > > Any ideas? > > > > > > > No idea, skbs allocated to send an ACK can not be stored in receive > > queue, I guess this is a kmemleak false positive. > > > > Stress your host for hours, and check if there are real kmemleaks, as > > in memory being depleted. > > AT first when I saw your report I wondered if the following was needed, > but I do not think so. Nothing in __kfree_skb(skb) cares about skb->next. > > But you might give it a try, maybe I am wrong. > > diff --git a/net/core/dev.c b/net/core/dev.c > index 611bd719706412723561c27753150b27e1dc4e7a..9dc3443649b962586cc059899ac3d71e9c7a3559 > 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6594,6 +6594,7 @@ static void skb_defer_free_flush(struct softnet_data *sd) > > while (skb != NULL) { > next = skb->next; > + skb_mark_not_on_list(skb); > __kfree_skb(skb); > skb = next; > } Oh well, there is definitely a leak, sorry for that.
On Wed, Apr 27, 2022 at 10:59 AM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Apr 27, 2022 at 10:11 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Apr 27, 2022 at 9:53 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Wed, Apr 27, 2022 at 8:34 AM Ido Schimmel <idosch@idosch.org> wrote: > > > > > > > > > > > > > > > Eric, with this patch I'm seeing memory leaks such as these [1][2] after > > > > boot. The system is using the igb driver for its management interface > > > > [3]. The leaks disappear after reverting the patch. > > > > > > > > Any ideas? > > > > > > > > > > No idea, skbs allocated to send an ACK can not be stored in receive > > > queue, I guess this is a kmemleak false positive. > > > > > > Stress your host for hours, and check if there are real kmemleaks, as > > > in memory being depleted. > > > > AT first when I saw your report I wondered if the following was needed, > > but I do not think so. Nothing in __kfree_skb(skb) cares about skb->next. > > > > But you might give it a try, maybe I am wrong. > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 611bd719706412723561c27753150b27e1dc4e7a..9dc3443649b962586cc059899ac3d71e9c7a3559 > > 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -6594,6 +6594,7 @@ static void skb_defer_free_flush(struct softnet_data *sd) > > > > while (skb != NULL) { > > next = skb->next; > > + skb_mark_not_on_list(skb); > > __kfree_skb(skb); > > skb = next; > > } > > Oh well, there is definitely a leak, sorry for that. Ido, can you test if the following patch solves your issue ? It definitely looks needed. Thanks ! diff --git a/net/core/dev.c b/net/core/dev.c index 611bd719706412723561c27753150b27e1dc4e7a..e09cd202fc579dfe2313243e20def8044aafafa2 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6617,7 +6617,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) if (list_empty(&list)) { if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll)) - return; + goto end; break; } @@ -6644,6 +6644,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) __raise_softirq_irqoff(NET_RX_SOFTIRQ); net_rps_action_and_irq_enable(sd); +end: skb_defer_free_flush(sd); }
On Fri, Apr 22, 2022 at 01:12:37PM -0700, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket > lock is released") helped bulk TCP flows to move the cost of skbs > frees outside of critical section where socket lock was held. > > But for RPC traffic, or hosts with RFS enabled, the solution is far from > being ideal. > > For RPC traffic, recvmsg() has to return to user space right after > skb payload has been consumed, meaning that BH handler has no chance > to pick the skb before recvmsg() thread. This issue is more visible > with BIG TCP, as more RPC fit one skb. > > For RFS, even if BH handler picks the skbs, they are still picked > from the cpu on which user thread is running. > > Ideally, it is better to free the skbs (and associated page frags) > on the cpu that originally allocated them. > > This patch removes the per socket anchor (sk->defer_list) and > instead uses a per-cpu list, which will hold more skbs per round. > > This new per-cpu list is drained at the end of net_action_rx(), > after incoming packets have been processed, to lower latencies. > > In normal conditions, skbs are added to the per-cpu list with > no further action. In the (unlikely) cases where the cpu does not > run net_action_rx() handler fast enough, we use an IPI to raise > NET_RX_SOFTIRQ on the remote cpu. > > Also, we do not bother draining the per-cpu list from dev_cpu_dead() > This is because skbs in this list have no requirement on how fast > they should be freed. > > Note that we can add in the future a small per-cpu cache > if we see any contention on sd->defer_lock. Hmm, we started to see some memory leak reports from kmemleak that have been around for hours without being freed since yesterday's linux-next tree which included this commit. Any thoughts? unreferenced object 0xffff400610f55cc0 (size 216): comm "git-remote-http", pid 781180, jiffies 4314091475 (age 4323.740s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00 ..~..?.......... backtrace: kmem_cache_alloc_node __alloc_skb __tcp_send_ack.part.0 tcp_send_ack tcp_cleanup_rbuf tcp_recvmsg_locked tcp_recvmsg inet_recvmsg __sys_recvfrom __arm64_sys_recvfrom invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync unreferenced object 0xffff4001e58f0c40 (size 216): comm "git-remote-http", pid 781180, jiffies 4314091483 (age 4323.968s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00 ..~..?.......... backtrace: kmem_cache_alloc_node __alloc_skb __tcp_send_ack.part.0 tcp_send_ack tcp_cleanup_rbuf tcp_recvmsg_locked tcp_recvmsg inet_recvmsg __sys_recvfrom __arm64_sys_recvfrom invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync
On Fri, Apr 29, 2022 at 9:18 AM Qian Cai <quic_qiancai@quicinc.com> wrote: > > On Fri, Apr 22, 2022 at 01:12:37PM -0700, Eric Dumazet wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket > > lock is released") helped bulk TCP flows to move the cost of skbs > > frees outside of critical section where socket lock was held. > > > > But for RPC traffic, or hosts with RFS enabled, the solution is far from > > being ideal. > > > > For RPC traffic, recvmsg() has to return to user space right after > > skb payload has been consumed, meaning that BH handler has no chance > > to pick the skb before recvmsg() thread. This issue is more visible > > with BIG TCP, as more RPC fit one skb. > > > > For RFS, even if BH handler picks the skbs, they are still picked > > from the cpu on which user thread is running. > > > > Ideally, it is better to free the skbs (and associated page frags) > > on the cpu that originally allocated them. > > > > This patch removes the per socket anchor (sk->defer_list) and > > instead uses a per-cpu list, which will hold more skbs per round. > > > > This new per-cpu list is drained at the end of net_action_rx(), > > after incoming packets have been processed, to lower latencies. > > > > In normal conditions, skbs are added to the per-cpu list with > > no further action. In the (unlikely) cases where the cpu does not > > run net_action_rx() handler fast enough, we use an IPI to raise > > NET_RX_SOFTIRQ on the remote cpu. > > > > Also, we do not bother draining the per-cpu list from dev_cpu_dead() > > This is because skbs in this list have no requirement on how fast > > they should be freed. > > > > Note that we can add in the future a small per-cpu cache > > if we see any contention on sd->defer_lock. > > Hmm, we started to see some memory leak reports from kmemleak that have > been around for hours without being freed since yesterday's linux-next > tree which included this commit. Any thoughts? > > unreferenced object 0xffff400610f55cc0 (size 216): > comm "git-remote-http", pid 781180, jiffies 4314091475 (age 4323.740s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00 ..~..?.......... > backtrace: > kmem_cache_alloc_node > __alloc_skb > __tcp_send_ack.part.0 > tcp_send_ack > tcp_cleanup_rbuf > tcp_recvmsg_locked > tcp_recvmsg > inet_recvmsg > __sys_recvfrom > __arm64_sys_recvfrom > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > unreferenced object 0xffff4001e58f0c40 (size 216): > comm "git-remote-http", pid 781180, jiffies 4314091483 (age 4323.968s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00 ..~..?.......... > backtrace: > kmem_cache_alloc_node > __alloc_skb > __tcp_send_ack.part.0 > tcp_send_ack > tcp_cleanup_rbuf > tcp_recvmsg_locked > tcp_recvmsg > inet_recvmsg > __sys_recvfrom > __arm64_sys_recvfrom > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync Which tree are you using ? Ido said the leak has been fixed in https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=f3412b3879b4f7c4313b186b03940d4791345534
On Fri, Apr 29, 2022 at 09:23:13AM -0700, Eric Dumazet wrote: > Which tree are you using ? linux-next tree, tag next-20220428 > Ido said the leak has been fixed in > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=f3412b3879b4f7c4313b186b03940d4791345534 Cool, next-20220429 is running fine so far which included the patch.
Greeting, FYI, we noticed the following commit (built with gcc-11): commit: 72fd55c0dbb1ad5ba283ca80abc9546702815a33 ("[PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists") url: https://github.com/intel-lab-lkp/linux/commits/Eric-Dumazet/net-generalize-skb-freeing-deferral-to-per-cpu-lists/20220423-060710 base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git c78c5a660439d4d341a03b651541fda3ebe76160 patch link: https://lore.kernel.org/netdev/20220422201237.416238-1-eric.dumazet@gmail.com in testcase: xfstests version: xfstests-x86_64-46e1b83-1_20220414 with following parameters: disk: 4HDD fs: ext4 fs2: smbv3 test: generic-group-10 ucode: 0xec test-description: xfstests is a regression test suite for xfs and other files ystems. test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git on test machine: 8 threads Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz with 16G memory caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag Reported-by: kernel test robot <oliver.sang@intel.com> [ 80.428226][ T1836] Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored. [ 80.739189][ T370] generic/207 0s [ 80.739198][ T370] [ 80.785302][ T1696] run fstests generic/208 at 2022-05-06 02:42:26 [ 81.143444][ T1836] Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored. [ 89.609627][ T58] kworker/u16:5 invoked oom-killer: gfp_mask=0xcd0(GFP_KERNEL|__GFP_RECLAIMABLE), order=0, oom_score_adj=0 [ 89.620805][ T58] CPU: 0 PID: 58 Comm: kworker/u16:5 Not tainted 5.18.0-rc3-00568-g72fd55c0dbb1 #1 [ 89.629899][ T58] Hardware name: HP HP Z240 SFF Workstation/802E, BIOS N51 Ver. 01.63 10/05/2017 [ 89.638822][ T58] Workqueue: writeback wb_workfn (flush-8:16) [ 89.644727][ T58] Call Trace: [ 89.647863][ T58] <TASK> [ 89.650649][ T58] dump_stack_lvl (kbuild/src/consumer/lib/dump_stack.c:107 (discriminator 1)) [ 89.654992][ T58] dump_header (kbuild/src/consumer/mm/oom_kill.c:73 kbuild/src/consumer/mm/oom_kill.c:461) [ 89.659250][ T58] oom_kill_process.cold (kbuild/src/consumer/mm/oom_kill.c:979) [ 89.664110][ T58] out_of_memory (kbuild/src/consumer/mm/oom_kill.c:1119 (discriminator 4)) To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
On Thu, May 5, 2022 at 11:44 PM kernel test robot <oliver.sang@intel.com> wrote: > > > > Greeting, > > FYI, we noticed the following commit (built with gcc-11): > > commit: 72fd55c0dbb1ad5ba283ca80abc9546702815a33 ("[PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists") > url: https://github.com/intel-lab-lkp/linux/commits/Eric-Dumazet/net-generalize-skb-freeing-deferral-to-per-cpu-lists/20220423-060710 > base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git c78c5a660439d4d341a03b651541fda3ebe76160 > patch link: https://lore.kernel.org/netdev/20220422201237.416238-1-eric.dumazet@gmail.com > I think this commit had two follow up fixes. Make sure to test the tree after the fixes are included, otherwise this is adding unneeded noise. Thank you. commit f3412b3879b4f7c4313b186b03940d4791345534 Author: Eric Dumazet <edumazet@google.com> Date: Wed Apr 27 13:41:47 2022 -0700 net: make sure net_rx_action() calls skb_defer_free_flush() And: commit 783d108dd71d97e4cac5fe8ce70ca43ed7dc7bb7 Author: Eric Dumazet <edumazet@google.com> Date: Fri Apr 29 18:15:23 2022 -0700 tcp: drop skb dst in tcp_rcv_established() > in testcase: xfstests > version: xfstests-x86_64-46e1b83-1_20220414 > with following parameters: > > disk: 4HDD > fs: ext4 > fs2: smbv3 > test: generic-group-10 > ucode: 0xec > > test-description: xfstests is a regression test suite for xfs and other files ystems. > test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git > > > on test machine: 8 threads Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz with 16G memory > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): > > > > If you fix the issue, kindly add following tag > Reported-by: kernel test robot <oliver.sang@intel.com> > > > > [ 80.428226][ T1836] Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored. > [ 80.739189][ T370] generic/207 0s > [ 80.739198][ T370] > [ 80.785302][ T1696] run fstests generic/208 at 2022-05-06 02:42:26 > [ 81.143444][ T1836] Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored. > [ 89.609627][ T58] kworker/u16:5 invoked oom-killer: gfp_mask=0xcd0(GFP_KERNEL|__GFP_RECLAIMABLE), order=0, oom_score_adj=0 > [ 89.620805][ T58] CPU: 0 PID: 58 Comm: kworker/u16:5 Not tainted 5.18.0-rc3-00568-g72fd55c0dbb1 #1 > [ 89.629899][ T58] Hardware name: HP HP Z240 SFF Workstation/802E, BIOS N51 Ver. 01.63 10/05/2017 > [ 89.638822][ T58] Workqueue: writeback wb_workfn (flush-8:16) > [ 89.644727][ T58] Call Trace: > [ 89.647863][ T58] <TASK> > [ 89.650649][ T58] dump_stack_lvl (kbuild/src/consumer/lib/dump_stack.c:107 (discriminator 1)) > [ 89.654992][ T58] dump_header (kbuild/src/consumer/mm/oom_kill.c:73 kbuild/src/consumer/mm/oom_kill.c:461) > [ 89.659250][ T58] oom_kill_process.cold (kbuild/src/consumer/mm/oom_kill.c:979) > [ 89.664110][ T58] out_of_memory (kbuild/src/consumer/mm/oom_kill.c:1119 (discriminator 4)) > > > To reproduce: > > git clone https://github.com/intel/lkp-tests.git > cd lkp-tests > sudo bin/lkp install job.yaml # job file is attached in this email > bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run > sudo bin/lkp run generated-yaml-file > > # if come across any failure that blocks the test, > # please remove ~/.lkp and /lkp dir to run from a clean state. > > > > -- > 0-DAY CI Kernel Test Service > https://01.org/lkp > >
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 7dccbfd1bf5635c27514c70b4a06d3e6f74395dd..ac8a5f71220a999aebabd73d8df2c8e2b1325ad4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3081,6 +3081,11 @@ struct softnet_data { struct sk_buff_head input_pkt_queue; struct napi_struct backlog; + /* Another possibly contended cache line */ + spinlock_t defer_lock ____cacheline_aligned_in_smp; + int defer_count; + struct sk_buff *defer_list; + call_single_data_t defer_csd; }; static inline void input_queue_head_incr(struct softnet_data *sd) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 84d78df60453955a8eaf05847f6e2145176a727a..5cbc184ca685d886306ccff70b82cd409082c229 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -888,6 +888,7 @@ typedef unsigned char *sk_buff_data_t; * delivery_time at egress. * @napi_id: id of the NAPI struct this skb came from * @sender_cpu: (aka @napi_id) source CPU in XPS + * @alloc_cpu: CPU which did the skb allocation. * @secmark: security marking * @mark: Generic packet mark * @reserved_tailroom: (aka @mark) number of bytes of free space available @@ -1080,6 +1081,7 @@ struct sk_buff { unsigned int sender_cpu; }; #endif + u16 alloc_cpu; #ifdef CONFIG_NETWORK_SECMARK __u32 secmark; #endif @@ -1321,6 +1323,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size); struct sk_buff *build_skb(void *data, unsigned int frag_size); struct sk_buff *build_skb_around(struct sk_buff *skb, void *data, unsigned int frag_size); +void skb_attempt_defer_free(struct sk_buff *skb); struct sk_buff *napi_build_skb(void *data, unsigned int frag_size); diff --git a/include/net/sock.h b/include/net/sock.h index a01d6c421aa2caad4032167284eed9565d4bd545..f9f8ecae0f8decb3e0e74c6efaff5b890e3685ea 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -292,7 +292,6 @@ struct sk_filter; * @sk_pacing_shift: scaling factor for TCP Small Queues * @sk_lingertime: %SO_LINGER l_linger setting * @sk_backlog: always used with the per-socket spinlock held - * @defer_list: head of llist storing skbs to be freed * @sk_callback_lock: used with the callbacks in the end of this struct * @sk_error_queue: rarely used * @sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt, @@ -417,7 +416,6 @@ struct sock { struct sk_buff *head; struct sk_buff *tail; } sk_backlog; - struct llist_head defer_list; #define sk_rmem_alloc sk_backlog.rmem_alloc diff --git a/include/net/tcp.h b/include/net/tcp.h index 679b1964d49414fcb1c361778fd0ac664e8c8ea5..94a52ad1101c12e13c2957e8b028b697742c451f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1375,18 +1375,6 @@ static inline bool tcp_checksum_complete(struct sk_buff *skb) bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb, enum skb_drop_reason *reason); -#ifdef CONFIG_INET -void __sk_defer_free_flush(struct sock *sk); - -static inline void sk_defer_free_flush(struct sock *sk) -{ - if (llist_empty(&sk->defer_list)) - return; - __sk_defer_free_flush(sk); -} -#else -static inline void sk_defer_free_flush(struct sock *sk) {} -#endif int tcp_filter(struct sock *sk, struct sk_buff *skb); void tcp_set_state(struct sock *sk, int state); diff --git a/net/core/dev.c b/net/core/dev.c index 4a77ebda4fb155581a5f761a864446a046987f51..611bd719706412723561c27753150b27e1dc4e7a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4545,6 +4545,12 @@ static void rps_trigger_softirq(void *data) #endif /* CONFIG_RPS */ +/* Called from hardirq (IPI) context */ +static void trigger_rx_softirq(void *data __always_unused) +{ + __raise_softirq_irqoff(NET_RX_SOFTIRQ); +} + /* * Check if this softnet_data structure is another cpu one * If yes, queue it to our IPI list and return 1 @@ -6571,6 +6577,28 @@ static int napi_threaded_poll(void *data) return 0; } +static void skb_defer_free_flush(struct softnet_data *sd) +{ + struct sk_buff *skb, *next; + unsigned long flags; + + /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */ + if (!READ_ONCE(sd->defer_list)) + return; + + spin_lock_irqsave(&sd->defer_lock, flags); + skb = sd->defer_list; + sd->defer_list = NULL; + sd->defer_count = 0; + spin_unlock_irqrestore(&sd->defer_lock, flags); + + while (skb != NULL) { + next = skb->next; + __kfree_skb(skb); + skb = next; + } +} + static __latent_entropy void net_rx_action(struct softirq_action *h) { struct softnet_data *sd = this_cpu_ptr(&softnet_data); @@ -6616,6 +6644,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) __raise_softirq_irqoff(NET_RX_SOFTIRQ); net_rps_action_and_irq_enable(sd); + skb_defer_free_flush(sd); } struct netdev_adjacent { @@ -11326,6 +11355,8 @@ static int __init net_dev_init(void) INIT_CSD(&sd->csd, rps_trigger_softirq, sd); sd->cpu = i; #endif + INIT_CSD(&sd->defer_csd, trigger_rx_softirq, NULL); + spin_lock_init(&sd->defer_lock); init_gro_hash(&sd->backlog); sd->backlog.poll = process_backlog; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 30b523fa4ad2e9be30bdefdc61f70f989c345bbf..028a280fbabd5b69770ddd6bf0e00eae7651bbf1 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -204,7 +204,7 @@ static void __build_skb_around(struct sk_buff *skb, void *data, skb_set_end_offset(skb, size); skb->mac_header = (typeof(skb->mac_header))~0U; skb->transport_header = (typeof(skb->transport_header))~0U; - + skb->alloc_cpu = raw_smp_processor_id(); /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); @@ -1037,6 +1037,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old) #ifdef CONFIG_NET_RX_BUSY_POLL CHECK_SKB_FIELD(napi_id); #endif + CHECK_SKB_FIELD(alloc_cpu); #ifdef CONFIG_XPS CHECK_SKB_FIELD(sender_cpu); #endif @@ -6486,3 +6487,51 @@ void __skb_ext_put(struct skb_ext *ext) } EXPORT_SYMBOL(__skb_ext_put); #endif /* CONFIG_SKB_EXTENSIONS */ + +/** + * skb_attempt_defer_free - queue skb for remote freeing + * @skb: buffer + * + * Put @skb in a per-cpu list, using the cpu which + * allocated the skb/pages to reduce false sharing + * and memory zone spinlock contention. + */ +void skb_attempt_defer_free(struct sk_buff *skb) +{ + int cpu = skb->alloc_cpu; + struct softnet_data *sd; + unsigned long flags; + bool kick; + + if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || + !cpu_online(cpu) || + cpu == raw_smp_processor_id()) { + __kfree_skb(skb); + return; + } + + sd = &per_cpu(softnet_data, cpu); + /* We do not send an IPI or any signal. + * Remote cpu will eventually call skb_defer_free_flush() + */ + spin_lock_irqsave(&sd->defer_lock, flags); + skb->next = sd->defer_list; + /* Paired with READ_ONCE() in skb_defer_free_flush() */ + WRITE_ONCE(sd->defer_list, skb); + sd->defer_count++; + + /* kick every time queue length reaches 128. + * This should avoid blocking in smp_call_function_single_async(). + * This condition should hardly be bit under normal conditions, + * unless cpu suddenly stopped to receive NIC interrupts. + */ + kick = sd->defer_count == 128; + + spin_unlock_irqrestore(&sd->defer_lock, flags); + + /* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU + * if we are unlucky enough (this seems very unlikely). + */ + if (unlikely(kick)) + smp_call_function_single_async(cpu, &sd->defer_csd); +} diff --git a/net/core/sock.c b/net/core/sock.c index 29abec3eabd8905f2671e0b5789878a129453ef6..a0f3989de3d62456665e8b6382a4681fba17d60c 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2082,9 +2082,6 @@ void sk_destruct(struct sock *sk) { bool use_call_rcu = sock_flag(sk, SOCK_RCU_FREE); - WARN_ON_ONCE(!llist_empty(&sk->defer_list)); - sk_defer_free_flush(sk); - if (rcu_access_pointer(sk->sk_reuseport_cb)) { reuseport_detach_sock(sk); use_call_rcu = true; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e20b87b3bf907a9b04b7531936129fb729e96c52..db55af9eb37b56bf0ec3b47212240c0302b86a1f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -843,7 +843,6 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, } release_sock(sk); - sk_defer_free_flush(sk); if (spliced) return spliced; @@ -1589,20 +1588,6 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied) tcp_send_ack(sk); } -void __sk_defer_free_flush(struct sock *sk) -{ - struct llist_node *head; - struct sk_buff *skb, *n; - - head = llist_del_all(&sk->defer_list); - llist_for_each_entry_safe(skb, n, head, ll_node) { - prefetch(n); - skb_mark_not_on_list(skb); - __kfree_skb(skb); - } -} -EXPORT_SYMBOL(__sk_defer_free_flush); - static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb) { __skb_unlink(skb, &sk->sk_receive_queue); @@ -1610,11 +1595,7 @@ static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb) sock_rfree(skb); skb->destructor = NULL; skb->sk = NULL; - if (!skb_queue_empty(&sk->sk_receive_queue) || - !llist_empty(&sk->defer_list)) { - llist_add(&skb->ll_node, &sk->defer_list); - return; - } + return skb_attempt_defer_free(skb); } __kfree_skb(skb); } @@ -2453,7 +2434,6 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, __sk_flush_backlog(sk); } else { tcp_cleanup_rbuf(sk, copied); - sk_defer_free_flush(sk); sk_wait_data(sk, &timeo, last); } @@ -2571,7 +2551,6 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags, lock_sock(sk); ret = tcp_recvmsg_locked(sk, msg, len, flags, &tss, &cmsg_flags); release_sock(sk); - sk_defer_free_flush(sk); if (cmsg_flags && ret >= 0) { if (cmsg_flags & TCP_CMSG_TS) @@ -3096,7 +3075,6 @@ int tcp_disconnect(struct sock *sk, int flags) sk->sk_frag.page = NULL; sk->sk_frag.offset = 0; } - sk_defer_free_flush(sk); sk_error_report(sk); return 0; } @@ -4225,7 +4203,6 @@ static int do_tcp_getsockopt(struct sock *sk, int level, err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname, &zc, &len, err); release_sock(sk); - sk_defer_free_flush(sk); if (len >= offsetofend(struct tcp_zerocopy_receive, msg_flags)) goto zerocopy_rcv_cmsg; switch (len) { diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 2c2d421425557c188c4bcf3dc113baea62e915c7..918816ec5dd49abe321f0179a2a64ca9a989a01c 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2065,7 +2065,6 @@ int tcp_v4_rcv(struct sk_buff *skb) sk_incoming_cpu_update(sk); - sk_defer_free_flush(sk); bh_lock_sock_nested(sk); tcp_segs_in(tcp_sk(sk), skb); ret = 0; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 54277de7474b78f1fea033b7978acebc0647f3ad..60bdec257ba7220d6c05b48208a587c7be2b4087 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1728,7 +1728,6 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) sk_incoming_cpu_update(sk); - sk_defer_free_flush(sk); bh_lock_sock_nested(sk); tcp_segs_in(tcp_sk(sk), skb); ret = 0; diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index ddbe05ec5489dd352dee832e038884339f338b43..bc54f6c5b1a4cabbfe1e3eff1768128b2730c730 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1911,7 +1911,6 @@ int tls_sw_recvmsg(struct sock *sk, end: release_sock(sk); - sk_defer_free_flush(sk); if (psock) sk_psock_put(sk, psock); return copied ? : err; @@ -1983,7 +1982,6 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, splice_read_end: release_sock(sk); - sk_defer_free_flush(sk); return copied ? : err; }