diff mbox series

[net] tcp: drop skb extensions before skb_attempt_defer_free

Message ID 879a4592e4e4bd0c30dbe29ca189e224ec1739a5.1739201151.git.sd@queasysnail.net (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] tcp: drop skb extensions before skb_attempt_defer_free | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers fail 1 blamed authors not CCed: kuba@kernel.org; 2 maintainers not CCed: horms@kernel.org kuba@kernel.org
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 14 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 55 this patch: 55
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-02-11--00-00 (tests: 889)

Commit Message

Sabrina Dubroca Feb. 10, 2025, 4:01 p.m. UTC
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
running tests that boil down to:
 - create a pair of netns
 - run a basic TCP test over ipcomp6
 - delete the pair of netns

The xfrm_state found on spi_byaddr was not deleted at the time we
delete the netns, because we still have a reference on it. This
lingering reference comes from a secpath (which holds a ref on the
xfrm_state), which is still attached to an skb. This skb is not
leaked, it ends up on sk_receive_queue and then gets defer-free'd by
skb_attempt_defer_free.

The problem happens when we defer freeing an skb (push it on one CPU's
defer_list), and don't flush that list before the netns is deleted. In
that case, we still have a reference on the xfrm_state that we don't
expect at this point.

tcp_eat_recv_skb is currently the only caller of skb_attempt_defer_free,
so I'm fixing it here. This patch also adds a DEBUG_NET_WARN_ON_ONCE
in skb_attempt_defer_free, to make sure we don't re-introduce this
problem.

Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
A few comments:
 - AFAICT this could not happen before 68822bdf76f1, since we would
   have emptied the (per-socket) defer_list before getting to ->exit()
   for the netns
 - I thought about dropping the extensions at the same time as we
   already drop the dst, but Paolo said this is probably not correct due
   to IP_CMSG_PASSSEC
 - I'm planning to rework the "synchronous" removal of xfrm_states
   (commit f75a2804da39 ("xfrm: destroy xfrm_state synchronously on
   net exit path")), which may also be able to fix this problem, but
   it is going to be a lot more complex than this patch

 net/core/skbuff.c | 1 +
 net/ipv4/tcp.c    | 1 +
 2 files changed, 2 insertions(+)

Comments

Eric Dumazet Feb. 10, 2025, 4:24 p.m. UTC | #1
On Mon, Feb 10, 2025 at 5:02 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
>
> Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> running tests that boil down to:
>  - create a pair of netns
>  - run a basic TCP test over ipcomp6
>  - delete the pair of netns
>
> The xfrm_state found on spi_byaddr was not deleted at the time we
> delete the netns, because we still have a reference on it. This
> lingering reference comes from a secpath (which holds a ref on the
> xfrm_state), which is still attached to an skb. This skb is not
> leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> skb_attempt_defer_free.
>
> The problem happens when we defer freeing an skb (push it on one CPU's
> defer_list), and don't flush that list before the netns is deleted. In
> that case, we still have a reference on the xfrm_state that we don't
> expect at this point.
>
> tcp_eat_recv_skb is currently the only caller of skb_attempt_defer_free,
> so I'm fixing it here. This patch also adds a DEBUG_NET_WARN_ON_ONCE
> in skb_attempt_defer_free, to make sure we don't re-introduce this
> problem.
>
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Reported-by: Xiumei Mu <xmu@redhat.com>
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> ---
> A few comments:
>  - AFAICT this could not happen before 68822bdf76f1, since we would
>    have emptied the (per-socket) defer_list before getting to ->exit()
>    for the netns
>  - I thought about dropping the extensions at the same time as we
>    already drop the dst, but Paolo said this is probably not correct due
>    to IP_CMSG_PASSSEC

I think we discussed this issue in the past.

Are you sure IP_CMSG_PASSSEC  is ever used by TCP ?

Many layers in TCP can aggregate packets, are they aware of XFRM yet ?

>  - I'm planning to rework the "synchronous" removal of xfrm_states
>    (commit f75a2804da39 ("xfrm: destroy xfrm_state synchronously on
>    net exit path")), which may also be able to fix this problem, but
>    it is going to be a lot more complex than this patch
>
>  net/core/skbuff.c | 1 +
>  net/ipv4/tcp.c    | 1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 6a99c453397f..abd0371bc51a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -7047,6 +7047,7 @@ nodefer:  kfree_skb_napi_cache(skb);
>
>         DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
>         DEBUG_NET_WARN_ON_ONCE(skb->destructor);
> +       DEBUG_NET_WARN_ON_ONCE(skb_has_extensions(skb));
>
>         sd = &per_cpu(softnet_data, cpu);
>         defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max);
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 0d704bda6c41..e60f642485ee 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1524,6 +1524,7 @@ static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
>                 sock_rfree(skb);
>                 skb->destructor = NULL;
>                 skb->sk = NULL;
> +               skb_ext_reset(skb);

If we think about it, storing thousands of packets in TCP sockets receive queues
with XFRM state is consuming memory for absolutely no purpose.

It is worth noting MPTCP  calls skb_ext_reset(skb) after
commit 4e637c70b503b686aae45716a25a94dc3a434f3a ("mptcp: attempt
coalescing when moving skbs to mptcp rx queue")

I would suggest calling secpath_reset() earlier in TCP, from BH
handler, while cpu caches are hot,
instead of waiting for recvmsg() to drain the receive queue much later ?
Sabrina Dubroca Feb. 11, 2025, 6:51 p.m. UTC | #2
2025-02-10, 17:24:43 +0100, Eric Dumazet wrote:
> On Mon, Feb 10, 2025 at 5:02 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> >
> > Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> > running tests that boil down to:
> >  - create a pair of netns
> >  - run a basic TCP test over ipcomp6
> >  - delete the pair of netns
> >
> > The xfrm_state found on spi_byaddr was not deleted at the time we
> > delete the netns, because we still have a reference on it. This
> > lingering reference comes from a secpath (which holds a ref on the
> > xfrm_state), which is still attached to an skb. This skb is not
> > leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> > skb_attempt_defer_free.
> >
> > The problem happens when we defer freeing an skb (push it on one CPU's
> > defer_list), and don't flush that list before the netns is deleted. In
> > that case, we still have a reference on the xfrm_state that we don't
> > expect at this point.
> >
> > tcp_eat_recv_skb is currently the only caller of skb_attempt_defer_free,
> > so I'm fixing it here. This patch also adds a DEBUG_NET_WARN_ON_ONCE
> > in skb_attempt_defer_free, to make sure we don't re-introduce this
> > problem.
> >
> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > Reported-by: Xiumei Mu <xmu@redhat.com>
> > Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> > ---
> > A few comments:
> >  - AFAICT this could not happen before 68822bdf76f1, since we would
> >    have emptied the (per-socket) defer_list before getting to ->exit()
> >    for the netns
> >  - I thought about dropping the extensions at the same time as we
> >    already drop the dst, but Paolo said this is probably not correct due
> >    to IP_CMSG_PASSSEC
> 
> I think we discussed this issue in the past.
> 
> Are you sure IP_CMSG_PASSSEC  is ever used by TCP ?

After checking, I don't think so. The only way TCP can get to
IP_CMSG_PASSSEC is through the error queue, so it shouldn't matter.

The original commit (2c7946a7bf45 ("[SECURITY]: TCP/UDP getpeersec"))
also says that TCP should be using SO_PEERSEC for that purpose
(although likely based on the secpath as well, but not packet per
packet).

Based on the chat you had with Paul Moore back in November, it seems
any point after tcp_filter should be fine:

https://lore.kernel.org/netdev/CAHC9VhS3yuwrOPcH5_iRy50O_TtBCh_OVWHZgzfFTYqyfrw_zQ@mail.gmail.com


> Many layers in TCP can aggregate packets, are they aware of XFRM yet ?

I'm not so familiar with the depths of TCP, but with what you're
suggesting below, AFAIU the cleanup should happen before any
aggregation attempt (well, there's GRO...).


[...]
> If we think about it, storing thousands of packets in TCP sockets receive queues
> with XFRM state is consuming memory for absolutely no purpose.

True. I went with the simpler (less likely to break things
unexpectedly) fix for v1.

> It is worth noting MPTCP  calls skb_ext_reset(skb) after
> commit 4e637c70b503b686aae45716a25a94dc3a434f3a ("mptcp: attempt
> coalescing when moving skbs to mptcp rx queue")
> 
> I would suggest calling secpath_reset() earlier in TCP, from BH
> handler, while cpu caches are hot,
> instead of waiting for recvmsg() to drain the receive queue much later ?

Ok. So in the end it would look a lot like what you proposed in a
discussion with Ilya:
https://lore.kernel.org/netdev/CANn89i+JdDukwEhZ%3D41FxY-w63eER6JVixkwL+s2eSOjo6aWEQ@mail.gmail.com/

(as Paolo noticed, we can't just do skb_ext_reset because at least in
tcp_data_queue the MPTCP extension has just been attached)

An additional patch could maybe add DEBUG_NET_WARN_ON_ONCE at the time
we add skbs to sk_receive_queue, to check we didn't miss (or remove in
the future) places where the dst or secpath should have been dropped?

-------- 8< --------
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5b2b04835688..87c1e98d76cf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -683,6 +683,12 @@ void tcp_fin(struct sock *sk);
 void tcp_check_space(struct sock *sk);
 void tcp_sack_compress_send_ack(struct sock *sk);
 
+static inline void tcp_skb_cleanup(struct sk_buff *skb)
+{
+	skb_dst_drop(skb);
+	secpath_reset(skb);
+}
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 0f523cbfe329..b815b9fc604c 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -178,7 +178,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
 	if (!skb)
 		return;
 
-	skb_dst_drop(skb);
+	tcp_cleanup_skb(skb);
 	/* segs_in has been initialized to 1 in tcp_create_openreq_child().
 	 * Hence, reset segs_in to 0 before calling tcp_segs_in()
 	 * to avoid double counting.  Also, tcp_segs_in() expects
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eb82e01da911..bb0811c38908 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5245,7 +5245,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		__kfree_skb(skb);
 		return;
 	}
-	skb_dst_drop(skb);
+	tcp_cleanup_skb(skb);
 	__skb_pull(skb, tcp_hdr(skb)->doff * 4);
 
 	reason = SKB_DROP_REASON_NOT_SPECIFIED;
@@ -6226,7 +6226,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
 
 			/* Bulk data transfer: receiver */
-			skb_dst_drop(skb);
+			tcp_cleanup_skb(skb);
 			__skb_pull(skb, tcp_header_len);
 			eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cc2b5194a18d..2632844d2c35 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2027,7 +2027,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 	 */
 	skb_condense(skb);
 
-	skb_dst_drop(skb);
+	tcp_cleanup_skb(skb);
 
 	if (unlikely(tcp_checksum_complete(skb))) {
 		bh_unlock_sock(sk);
Eric Dumazet Feb. 11, 2025, 7:17 p.m. UTC | #3
On Tue, Feb 11, 2025 at 7:51 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
>
> 2025-02-10, 17:24:43 +0100, Eric Dumazet wrote:
> > On Mon, Feb 10, 2025 at 5:02 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> > >
> > > Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> > > running tests that boil down to:
> > >  - create a pair of netns
> > >  - run a basic TCP test over ipcomp6
> > >  - delete the pair of netns
> > >
> > > The xfrm_state found on spi_byaddr was not deleted at the time we
> > > delete the netns, because we still have a reference on it. This
> > > lingering reference comes from a secpath (which holds a ref on the
> > > xfrm_state), which is still attached to an skb. This skb is not
> > > leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> > > skb_attempt_defer_free.
> > >
> > > The problem happens when we defer freeing an skb (push it on one CPU's
> > > defer_list), and don't flush that list before the netns is deleted. In
> > > that case, we still have a reference on the xfrm_state that we don't
> > > expect at this point.
> > >
> > > tcp_eat_recv_skb is currently the only caller of skb_attempt_defer_free,
> > > so I'm fixing it here. This patch also adds a DEBUG_NET_WARN_ON_ONCE
> > > in skb_attempt_defer_free, to make sure we don't re-introduce this
> > > problem.
> > >
> > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > > Reported-by: Xiumei Mu <xmu@redhat.com>
> > > Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> > > ---
> > > A few comments:
> > >  - AFAICT this could not happen before 68822bdf76f1, since we would
> > >    have emptied the (per-socket) defer_list before getting to ->exit()
> > >    for the netns
> > >  - I thought about dropping the extensions at the same time as we
> > >    already drop the dst, but Paolo said this is probably not correct due
> > >    to IP_CMSG_PASSSEC
> >
> > I think we discussed this issue in the past.
> >
> > Are you sure IP_CMSG_PASSSEC  is ever used by TCP ?
>
> After checking, I don't think so. The only way TCP can get to
> IP_CMSG_PASSSEC is through the error queue, so it shouldn't matter.
>
> The original commit (2c7946a7bf45 ("[SECURITY]: TCP/UDP getpeersec"))
> also says that TCP should be using SO_PEERSEC for that purpose
> (although likely based on the secpath as well, but not packet per
> packet).
>
> Based on the chat you had with Paul Moore back in November, it seems
> any point after tcp_filter should be fine:
>
> https://lore.kernel.org/netdev/CAHC9VhS3yuwrOPcH5_iRy50O_TtBCh_OVWHZgzfFTYqyfrw_zQ@mail.gmail.com
>
>
> > Many layers in TCP can aggregate packets, are they aware of XFRM yet ?
>
> I'm not so familiar with the depths of TCP, but with what you're
> suggesting below, AFAIU the cleanup should happen before any
> aggregation attempt (well, there's GRO...).
>
>
> [...]
> > If we think about it, storing thousands of packets in TCP sockets receive queues
> > with XFRM state is consuming memory for absolutely no purpose.
>
> True. I went with the simpler (less likely to break things
> unexpectedly) fix for v1.
>
> > It is worth noting MPTCP  calls skb_ext_reset(skb) after
> > commit 4e637c70b503b686aae45716a25a94dc3a434f3a ("mptcp: attempt
> > coalescing when moving skbs to mptcp rx queue")
> >
> > I would suggest calling secpath_reset() earlier in TCP, from BH
> > handler, while cpu caches are hot,
> > instead of waiting for recvmsg() to drain the receive queue much later ?
>
> Ok. So in the end it would look a lot like what you proposed in a
> discussion with Ilya:
> https://lore.kernel.org/netdev/CANn89i+JdDukwEhZ%3D41FxY-w63eER6JVixkwL+s2eSOjo6aWEQ@mail.gmail.com/
>
> (as Paolo noticed, we can't just do skb_ext_reset because at least in
> tcp_data_queue the MPTCP extension has just been attached)
>
> An additional patch could maybe add DEBUG_NET_WARN_ON_ONCE at the time
> we add skbs to sk_receive_queue, to check we didn't miss (or remove in
> the future) places where the dst or secpath should have been dropped?

Sure, adding the  DEBUG_NET_WARN_ON_ONCE() is absolutely fine.

>
> -------- 8< --------
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 5b2b04835688..87c1e98d76cf 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -683,6 +683,12 @@ void tcp_fin(struct sock *sk);
>  void tcp_check_space(struct sock *sk);
>  void tcp_sack_compress_send_ack(struct sock *sk);
>
> +static inline void tcp_skb_cleanup(struct sk_buff *skb)
> +{
> +       skb_dst_drop(skb);
> +       secpath_reset(skb);
> +}
> +
>  /* tcp_timer.c */
>  void tcp_init_xmit_timers(struct sock *);
>  static inline void tcp_clear_xmit_timers(struct sock *sk)
> diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
> index 0f523cbfe329..b815b9fc604c 100644
> --- a/net/ipv4/tcp_fastopen.c
> +++ b/net/ipv4/tcp_fastopen.c
> @@ -178,7 +178,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
>         if (!skb)
>                 return;
>
> -       skb_dst_drop(skb);
> +       tcp_cleanup_skb(skb);
>         /* segs_in has been initialized to 1 in tcp_create_openreq_child().
>          * Hence, reset segs_in to 0 before calling tcp_segs_in()
>          * to avoid double counting.  Also, tcp_segs_in() expects
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index eb82e01da911..bb0811c38908 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5245,7 +5245,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
>                 __kfree_skb(skb);
>                 return;
>         }
> -       skb_dst_drop(skb);
> +       tcp_cleanup_skb(skb);
>         __skb_pull(skb, tcp_hdr(skb)->doff * 4);
>
>         reason = SKB_DROP_REASON_NOT_SPECIFIED;
> @@ -6226,7 +6226,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
>                         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
>
>                         /* Bulk data transfer: receiver */
> -                       skb_dst_drop(skb);
> +                       tcp_cleanup_skb(skb);
>                         __skb_pull(skb, tcp_header_len);
>                         eaten = tcp_queue_rcv(sk, skb, &fragstolen);
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index cc2b5194a18d..2632844d2c35 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2027,7 +2027,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
>          */
>         skb_condense(skb);
>
> -       skb_dst_drop(skb);
> +       tcp_cleanup_skb(skb);
>
>         if (unlikely(tcp_checksum_complete(skb))) {
>                 bh_unlock_sock(sk);
> --

This looks much better to me ;)
Sabrina Dubroca Feb. 12, 2025, 12:38 p.m. UTC | #4
2025-02-11, 20:17:17 +0100, Eric Dumazet wrote:
> On Tue, Feb 11, 2025 at 7:51 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> > An additional patch could maybe add DEBUG_NET_WARN_ON_ONCE at the time
> > we add skbs to sk_receive_queue, to check we didn't miss (or remove in
> > the future) places where the dst or secpath should have been dropped?
> 
> Sure, adding the  DEBUG_NET_WARN_ON_ONCE() is absolutely fine.

Something like this would be ok?
(on top of the previous diff)

The main drawback is that we can't just look for "sk_receive_queue" in
net/ipv4/tcp*.

-------- 8< --------
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4d106d13db22..930cda5b5eb9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -690,6 +690,13 @@ static inline void tcp_cleanup_skb(struct sk_buff *skb)
 	secpath_reset(skb);
 }
 
+static inline void tcp_add_receive_queue(struct sock *sk, struct sk_buff *skb)
+{
+	DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
+	DEBUG_NET_WARN_ON_ONCE(secpath_exists(skb));
+	__skb_queue_tail(&sk->sk_receive_queue, skb);
+}
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index b815b9fc604c..32b28fc21b63 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -195,7 +195,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_SYN;
 
 	tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-	__skb_queue_tail(&sk->sk_receive_queue, skb);
+	tcp_add_receive_queue(sk, skb);
 	tp->syn_data_acked = 1;
 
 	/* u64_stats_update_begin(&tp->syncp) not needed here,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bb0811c38908..6821e5540a53 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4970,7 +4970,7 @@ static void tcp_ofo_queue(struct sock *sk)
 		tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
 		fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN;
 		if (!eaten)
-			__skb_queue_tail(&sk->sk_receive_queue, skb);
+			tcp_add_receive_queue(sk, skb);
 		else
 			kfree_skb_partial(skb, fragstolen);
 
@@ -5162,7 +5162,7 @@ static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
 				  skb, fragstolen)) ? 1 : 0;
 	tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq);
 	if (!eaten) {
-		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		tcp_add_receive_queue(sk, skb);
 		skb_set_owner_r(skb, sk);
 	}
 	return eaten;
Paul Moore Feb. 12, 2025, 4:15 p.m. UTC | #5
On Tue, Feb 11, 2025 at 1:51 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> 2025-02-10, 17:24:43 +0100, Eric Dumazet wrote:
> > On Mon, Feb 10, 2025 at 5:02 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> > >
> > > Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> > > running tests that boil down to:
> > >  - create a pair of netns
> > >  - run a basic TCP test over ipcomp6
> > >  - delete the pair of netns
> > >
> > > The xfrm_state found on spi_byaddr was not deleted at the time we
> > > delete the netns, because we still have a reference on it. This
> > > lingering reference comes from a secpath (which holds a ref on the
> > > xfrm_state), which is still attached to an skb. This skb is not
> > > leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> > > skb_attempt_defer_free.
> > >
> > > The problem happens when we defer freeing an skb (push it on one CPU's
> > > defer_list), and don't flush that list before the netns is deleted. In
> > > that case, we still have a reference on the xfrm_state that we don't
> > > expect at this point.
> > >
> > > tcp_eat_recv_skb is currently the only caller of skb_attempt_defer_free,
> > > so I'm fixing it here. This patch also adds a DEBUG_NET_WARN_ON_ONCE
> > > in skb_attempt_defer_free, to make sure we don't re-introduce this
> > > problem.
> > >
> > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > > Reported-by: Xiumei Mu <xmu@redhat.com>
> > > Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> > > ---
> > > A few comments:
> > >  - AFAICT this could not happen before 68822bdf76f1, since we would
> > >    have emptied the (per-socket) defer_list before getting to ->exit()
> > >    for the netns
> > >  - I thought about dropping the extensions at the same time as we
> > >    already drop the dst, but Paolo said this is probably not correct due
> > >    to IP_CMSG_PASSSEC
> >
> > I think we discussed this issue in the past.
> >
> > Are you sure IP_CMSG_PASSSEC  is ever used by TCP ?
>
> After checking, I don't think so. The only way TCP can get to
> IP_CMSG_PASSSEC is through the error queue, so it shouldn't matter.
>
> The original commit (2c7946a7bf45 ("[SECURITY]: TCP/UDP getpeersec"))
> also says that TCP should be using SO_PEERSEC for that purpose
> (although likely based on the secpath as well, but not packet per
> packet).
>
> Based on the chat you had with Paul Moore back in November, it seems
> any point after tcp_filter should be fine:
> https://lore.kernel.org/netdev/CAHC9VhS3yuwrOPcH5_iRy50O_TtBCh_OVWHZgzfFTYqyfrw_zQ@mail.gmail.com

[NOTE: CC'ing the LSM so others are aware of this discussion]

I can't say I've got the full context for this discussion, but my
comments from last November should still hold true: if a packet/skb is
still "alive" and able to be accessed by a task then we need to
preserve it from a LSM perspective.

As far as IP_CMSG_PASSSEC is concerned, while I would normally expect
TCP/stream connections to use the SO_PEERSEC method to determine the
peer security information, unless there is something in the stack or
socket API that explicitly prevents the use of IP_CMSG_PASSSEC on
UDP/datagram packets I would imagine it is possible that there is an
application somewhere which makes use of it.  The LSM framework, and
all of the LSMs that implement the IP_CMSG_PASSSEC hook callback,
should work just fine regardless of if the packet is TCP or UDP.
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6a99c453397f..abd0371bc51a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7047,6 +7047,7 @@  nodefer:	kfree_skb_napi_cache(skb);
 
 	DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
 	DEBUG_NET_WARN_ON_ONCE(skb->destructor);
+	DEBUG_NET_WARN_ON_ONCE(skb_has_extensions(skb));
 
 	sd = &per_cpu(softnet_data, cpu);
 	defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..e60f642485ee 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1524,6 +1524,7 @@  static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb)
 		sock_rfree(skb);
 		skb->destructor = NULL;
 		skb->sk = NULL;
+		skb_ext_reset(skb);
 		return skb_attempt_defer_free(skb);
 	}
 	__kfree_skb(skb);