diff mbox series

[net,v2] tcp: drop secpath at the same time as we currently drop dst

Message ID 5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net (mailing list archive)
State New
Headers show
Series [net,v2] tcp: drop secpath at the same time as we currently drop dst | expand

Commit Message

Sabrina Dubroca Feb. 17, 2025, 10:23 a.m. UTC
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
running tests that boil down to:
 - create a pair of netns
 - run a basic TCP test over ipcomp6
 - delete the pair of netns

The xfrm_state found on spi_byaddr was not deleted at the time we
delete the netns, because we still have a reference on it. This
lingering reference comes from a secpath (which holds a ref on the
xfrm_state), which is still attached to an skb. This skb is not
leaked, it ends up on sk_receive_queue and then gets defer-free'd by
skb_attempt_defer_free.

The problem happens when we defer freeing an skb (push it on one CPU's
defer_list), and don't flush that list before the netns is deleted. In
that case, we still have a reference on the xfrm_state that we don't
expect at this point.

We already drop the skb's dst in the TCP receive path when it's no
longer needed, so let's also drop the secpath. At this point,
tcp_filter has already called into the LSM hooks that may require the
secpath, so it should not be needed anymore. However, in some of those
places, the MPTCP extension has just been attached to the skb, so we
cannot simply drop all extensions.

Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
v1: drop all extensions just before calling skb_attempt_defer_free
    https://lore.kernel.org/netdev/879a4592e4e4bd0c30dbe29ca189e224ec1739a5.1739201151.git.sd@queasysnail.net/
v2: - drop only secpath, as soon as possible - per Eric's feedback
    - add debug warns if trying to add to sk_receive_queue an skb with
      a dst or a secpath

@Eric feel free to add some tags (Suggested-by? sign-off?) for the
code I adapted from
https://lore.kernel.org/netdev/CANn89i+JdDukwEhZ%3D41FxY-w63eER6JVixkwL+s2eSOjo6aWEQ@mail.gmail.com/

 include/net/tcp.h       | 14 ++++++++++++++
 net/ipv4/tcp_fastopen.c |  4 ++--
 net/ipv4/tcp_input.c    |  8 ++++----
 net/ipv4/tcp_ipv4.c     |  2 +-
 4 files changed, 21 insertions(+), 7 deletions(-)

Comments

Eric Dumazet Feb. 17, 2025, 11:58 a.m. UTC | #1
On Mon, Feb 17, 2025 at 11:23 AM Sabrina Dubroca <sd@queasysnail.net> wrote:
>
> Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> running tests that boil down to:
>  - create a pair of netns
>  - run a basic TCP test over ipcomp6
>  - delete the pair of netns
>
> The xfrm_state found on spi_byaddr was not deleted at the time we
> delete the netns, because we still have a reference on it. This
> lingering reference comes from a secpath (which holds a ref on the
> xfrm_state), which is still attached to an skb. This skb is not
> leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> skb_attempt_defer_free.
>
> The problem happens when we defer freeing an skb (push it on one CPU's
> defer_list), and don't flush that list before the netns is deleted. In
> that case, we still have a reference on the xfrm_state that we don't
> expect at this point.
>
> We already drop the skb's dst in the TCP receive path when it's no
> longer needed, so let's also drop the secpath. At this point,
> tcp_filter has already called into the LSM hooks that may require the
> secpath, so it should not be needed anymore. However, in some of those
> places, the MPTCP extension has just been attached to the skb, so we
> cannot simply drop all extensions.
>
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Reported-by: Xiumei Mu <xmu@redhat.com>
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> ---
> v1: drop all extensions just before calling skb_attempt_defer_free
>     https://lore.kernel.org/netdev/879a4592e4e4bd0c30dbe29ca189e224ec1739a5.1739201151.git.sd@queasysnail.net/
> v2: - drop only secpath, as soon as possible - per Eric's feedback
>     - add debug warns if trying to add to sk_receive_queue an skb with
>       a dst or a secpath
>
> @Eric feel free to add some tags (Suggested-by? sign-off?) for the
> code I adapted from
> https://lore.kernel.org/netdev/CANn89i+JdDukwEhZ%3D41FxY-w63eER6JVixkwL+s2eSOjo6aWEQ@mail.gmail.com/

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !
Paul Moore Feb. 17, 2025, 10:35 p.m. UTC | #2
On Mon, Feb 17, 2025 at 5:23 AM Sabrina Dubroca <sd@queasysnail.net> wrote:
>
> Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> running tests that boil down to:
>  - create a pair of netns
>  - run a basic TCP test over ipcomp6
>  - delete the pair of netns
>
> The xfrm_state found on spi_byaddr was not deleted at the time we
> delete the netns, because we still have a reference on it. This
> lingering reference comes from a secpath (which holds a ref on the
> xfrm_state), which is still attached to an skb. This skb is not
> leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> skb_attempt_defer_free.
>
> The problem happens when we defer freeing an skb (push it on one CPU's
> defer_list), and don't flush that list before the netns is deleted. In
> that case, we still have a reference on the xfrm_state that we don't
> expect at this point.
>
> We already drop the skb's dst in the TCP receive path when it's no
> longer needed, so let's also drop the secpath. At this point,
> tcp_filter has already called into the LSM hooks that may require the
> secpath, so it should not be needed anymore.

I don't recall seeing any follow up in the v1 patchset regarding
IP_CMSG_PASSSEC/security_socket_getpeersec_dgram(), can you confirm
that the secpath is preserved for that code path?

https://lore.kernel.org/linux-security-module/CAHC9VhQZ+k1J0UidJ-bgdBGBuVX9M18tQ+a+fuqXQM_L-PFvzA@mail.gmail.com

> However, in some of those
> places, the MPTCP extension has just been attached to the skb, so we
> cannot simply drop all extensions.
>
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Reported-by: Xiumei Mu <xmu@redhat.com>
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> ---
> v1: drop all extensions just before calling skb_attempt_defer_free
>     https://lore.kernel.org/netdev/879a4592e4e4bd0c30dbe29ca189e224ec1739a5.1739201151.git.sd@queasysnail.net/
> v2: - drop only secpath, as soon as possible - per Eric's feedback
>     - add debug warns if trying to add to sk_receive_queue an skb with
>       a dst or a secpath
>
> @Eric feel free to add some tags (Suggested-by? sign-off?) for the
> code I adapted from
> https://lore.kernel.org/netdev/CANn89i+JdDukwEhZ%3D41FxY-w63eER6JVixkwL+s2eSOjo6aWEQ@mail.gmail.com/
>
>  include/net/tcp.h       | 14 ++++++++++++++
>  net/ipv4/tcp_fastopen.c |  4 ++--
>  net/ipv4/tcp_input.c    |  8 ++++----
>  net/ipv4/tcp_ipv4.c     |  2 +-
>  4 files changed, 21 insertions(+), 7 deletions(-)
Sabrina Dubroca Feb. 17, 2025, 11:14 p.m. UTC | #3
2025-02-17, 17:35:32 -0500, Paul Moore wrote:
> On Mon, Feb 17, 2025 at 5:23 AM Sabrina Dubroca <sd@queasysnail.net> wrote:
> >
> > Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> > running tests that boil down to:
> >  - create a pair of netns
> >  - run a basic TCP test over ipcomp6
> >  - delete the pair of netns
> >
> > The xfrm_state found on spi_byaddr was not deleted at the time we
> > delete the netns, because we still have a reference on it. This
> > lingering reference comes from a secpath (which holds a ref on the
> > xfrm_state), which is still attached to an skb. This skb is not
> > leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> > skb_attempt_defer_free.
> >
> > The problem happens when we defer freeing an skb (push it on one CPU's
> > defer_list), and don't flush that list before the netns is deleted. In
> > that case, we still have a reference on the xfrm_state that we don't
> > expect at this point.
> >
> > We already drop the skb's dst in the TCP receive path when it's no
> > longer needed, so let's also drop the secpath. At this point,
> > tcp_filter has already called into the LSM hooks that may require the
> > secpath, so it should not be needed anymore.
> 
> I don't recall seeing any follow up in the v1 patchset regarding
> IP_CMSG_PASSSEC/security_socket_getpeersec_dgram(), can you confirm
> that the secpath is preserved for that code path?
> 
> https://lore.kernel.org/linux-security-module/CAHC9VhQZ+k1J0UidJ-bgdBGBuVX9M18tQ+a+fuqXQM_L-PFvzA@mail.gmail.com

Sorry, I thought we'd addressed this in the v1 discussion with Eric.

IP_CMSG_PASSSEC is not blocked for TCP sockets, but it will only
process skbs that came from the error queue (ip_recv_error ->
ip_cmsg_recv -> ip_cmsg_recv_offset -> ip_cmsg_recv_security ->
security_socket_getpeersec_dgram), which don't go through those code
paths at all. So AFAICT IP_CMSG_PASSSEC for TCP isn't affected by
dropping the secpath early.
Paul Moore Feb. 17, 2025, 11:56 p.m. UTC | #4
On Mon, Feb 17, 2025 at 6:14 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> 2025-02-17, 17:35:32 -0500, Paul Moore wrote:
> > On Mon, Feb 17, 2025 at 5:23 AM Sabrina Dubroca <sd@queasysnail.net> wrote:
> > >
> > > Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> > > running tests that boil down to:
> > >  - create a pair of netns
> > >  - run a basic TCP test over ipcomp6
> > >  - delete the pair of netns
> > >
> > > The xfrm_state found on spi_byaddr was not deleted at the time we
> > > delete the netns, because we still have a reference on it. This
> > > lingering reference comes from a secpath (which holds a ref on the
> > > xfrm_state), which is still attached to an skb. This skb is not
> > > leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> > > skb_attempt_defer_free.
> > >
> > > The problem happens when we defer freeing an skb (push it on one CPU's
> > > defer_list), and don't flush that list before the netns is deleted. In
> > > that case, we still have a reference on the xfrm_state that we don't
> > > expect at this point.
> > >
> > > We already drop the skb's dst in the TCP receive path when it's no
> > > longer needed, so let's also drop the secpath. At this point,
> > > tcp_filter has already called into the LSM hooks that may require the
> > > secpath, so it should not be needed anymore.
> >
> > I don't recall seeing any follow up in the v1 patchset regarding
> > IP_CMSG_PASSSEC/security_socket_getpeersec_dgram(), can you confirm
> > that the secpath is preserved for that code path?
> >
> > https://lore.kernel.org/linux-security-module/CAHC9VhQZ+k1J0UidJ-bgdBGBuVX9M18tQ+a+fuqXQM_L-PFvzA@mail.gmail.com
>
> Sorry, I thought we'd addressed this in the v1 discussion with Eric.
>
> IP_CMSG_PASSSEC is not blocked for TCP sockets, but it will only
> process skbs that came from the error queue (ip_recv_error ->
> ip_cmsg_recv -> ip_cmsg_recv_offset -> ip_cmsg_recv_security ->
> security_socket_getpeersec_dgram), which don't go through those code
> paths at all. So AFAICT IP_CMSG_PASSSEC for TCP isn't affected by
> dropping the secpath early.

Great, thanks for clearing that up.
patchwork-bot+netdevbpf@kernel.org Feb. 20, 2025, 8:40 a.m. UTC | #5
Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Mon, 17 Feb 2025 11:23:35 +0100 you wrote:
> Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
> running tests that boil down to:
>  - create a pair of netns
>  - run a basic TCP test over ipcomp6
>  - delete the pair of netns
> 
> The xfrm_state found on spi_byaddr was not deleted at the time we
> delete the netns, because we still have a reference on it. This
> lingering reference comes from a secpath (which holds a ref on the
> xfrm_state), which is still attached to an skb. This skb is not
> leaked, it ends up on sk_receive_queue and then gets defer-free'd by
> skb_attempt_defer_free.
> 
> [...]

Here is the summary with links:
  - [net,v2] tcp: drop secpath at the same time as we currently drop dst
    https://git.kernel.org/netdev/net/c/9b6412e6979f

You are awesome, thank you!
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5b2b04835688..930cda5b5eb9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -41,6 +41,7 @@ 
 #include <net/inet_ecn.h>
 #include <net/dst.h>
 #include <net/mptcp.h>
+#include <net/xfrm.h>
 
 #include <linux/seq_file.h>
 #include <linux/memcontrol.h>
@@ -683,6 +684,19 @@  void tcp_fin(struct sock *sk);
 void tcp_check_space(struct sock *sk);
 void tcp_sack_compress_send_ack(struct sock *sk);
 
+static inline void tcp_cleanup_skb(struct sk_buff *skb)
+{
+	skb_dst_drop(skb);
+	secpath_reset(skb);
+}
+
+static inline void tcp_add_receive_queue(struct sock *sk, struct sk_buff *skb)
+{
+	DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
+	DEBUG_NET_WARN_ON_ONCE(secpath_exists(skb));
+	__skb_queue_tail(&sk->sk_receive_queue, skb);
+}
+
 /* tcp_timer.c */
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 0f523cbfe329..32b28fc21b63 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -178,7 +178,7 @@  void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
 	if (!skb)
 		return;
 
-	skb_dst_drop(skb);
+	tcp_cleanup_skb(skb);
 	/* segs_in has been initialized to 1 in tcp_create_openreq_child().
 	 * Hence, reset segs_in to 0 before calling tcp_segs_in()
 	 * to avoid double counting.  Also, tcp_segs_in() expects
@@ -195,7 +195,7 @@  void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_SYN;
 
 	tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-	__skb_queue_tail(&sk->sk_receive_queue, skb);
+	tcp_add_receive_queue(sk, skb);
 	tp->syn_data_acked = 1;
 
 	/* u64_stats_update_begin(&tp->syncp) not needed here,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eb82e01da911..6821e5540a53 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4970,7 +4970,7 @@  static void tcp_ofo_queue(struct sock *sk)
 		tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
 		fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN;
 		if (!eaten)
-			__skb_queue_tail(&sk->sk_receive_queue, skb);
+			tcp_add_receive_queue(sk, skb);
 		else
 			kfree_skb_partial(skb, fragstolen);
 
@@ -5162,7 +5162,7 @@  static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
 				  skb, fragstolen)) ? 1 : 0;
 	tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq);
 	if (!eaten) {
-		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		tcp_add_receive_queue(sk, skb);
 		skb_set_owner_r(skb, sk);
 	}
 	return eaten;
@@ -5245,7 +5245,7 @@  static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		__kfree_skb(skb);
 		return;
 	}
-	skb_dst_drop(skb);
+	tcp_cleanup_skb(skb);
 	__skb_pull(skb, tcp_hdr(skb)->doff * 4);
 
 	reason = SKB_DROP_REASON_NOT_SPECIFIED;
@@ -6226,7 +6226,7 @@  void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
 
 			/* Bulk data transfer: receiver */
-			skb_dst_drop(skb);
+			tcp_cleanup_skb(skb);
 			__skb_pull(skb, tcp_header_len);
 			eaten = tcp_queue_rcv(sk, skb, &fragstolen);
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cc2b5194a18d..2632844d2c35 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2027,7 +2027,7 @@  bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
 	 */
 	skb_condense(skb);
 
-	skb_dst_drop(skb);
+	tcp_cleanup_skb(skb);
 
 	if (unlikely(tcp_checksum_complete(skb))) {
 		bh_unlock_sock(sk);