diff mbox series

[net-next,2/2] tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field

Message ID 20240407093322.3172088-3-edumazet@google.com (mailing list archive)
State Accepted
Commit 41eecbd712b73f0d5dcf1152b9a1c27b1f238028
Delegated to: Netdev Maintainers
Headers show
Series tcp: fix ISN selection in TIMEWAIT -> SYN_RECV | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1846 this patch: 1846
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 1 maintainers not CCed: dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 973 this patch: 973
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1882 this patch: 1882
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 158 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 4 this patch: 4
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-04-08--03-00 (tests: 956)

Commit Message

Eric Dumazet April 7, 2024, 9:33 a.m. UTC
TCP can transform a TIMEWAIT socket into a SYN_RECV one from
a SYN packet, and the ISN of the SYNACK packet is normally
generated using TIMEWAIT tw_snd_nxt :

tcp_timewait_state_process()
...
    u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
    if (isn == 0)
        isn++;
    TCP_SKB_CB(skb)->tcp_tw_isn = isn;
    return TCP_TW_SYN;

This SYN packet also bypasses normal checks against listen queue
being full or not.

tcp_conn_request()
...
       __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
...
        /* TW buckets are converted to open requests without
         * limitations, they conserve resources and peer is
         * evidently real one.
         */
        if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
                want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
                if (!want_cookie)
                        goto drop;
        }

This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.

Unfortunately this field has been accidentally cleared
after the call to tcp_timewait_state_process() returning
TCP_TW_SYN.

Using a field in TCP_SKB_CB(skb) for a temporary state
is overkill.

Switch instead to a per-cpu variable.

As a bonus, we do not have to clear tcp_tw_isn in TCP receive
fast path.
It is temporarily set then cleared only in the TCP_TW_SYN dance.

Fixes: 4ad19de8774e ("net: tcp6: fix double call of tcp_v6_fill_cb()")
Fixes: eeea10b83a13 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/tcp.h        | 10 +++++-----
 net/ipv4/tcp.c           |  3 +++
 net/ipv4/tcp_input.c     | 26 ++++++++++++++++----------
 net/ipv4/tcp_ipv4.c      |  5 +++--
 net/ipv4/tcp_minisocks.c |  4 ++--
 net/ipv6/tcp_ipv6.c      |  5 +++--
 6 files changed, 32 insertions(+), 21 deletions(-)

Comments

Paolo Abeni April 9, 2024, 9:58 a.m. UTC | #1
On Sun, 2024-04-07 at 09:33 +0000, Eric Dumazet wrote:
> TCP can transform a TIMEWAIT socket into a SYN_RECV one from
> a SYN packet, and the ISN of the SYNACK packet is normally
> generated using TIMEWAIT tw_snd_nxt :
> 
> tcp_timewait_state_process()
> ...
>     u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
>     if (isn == 0)
>         isn++;
>     TCP_SKB_CB(skb)->tcp_tw_isn = isn;
>     return TCP_TW_SYN;
> 
> This SYN packet also bypasses normal checks against listen queue
> being full or not.
> 
> tcp_conn_request()
> ...
>        __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
> ...
>         /* TW buckets are converted to open requests without
>          * limitations, they conserve resources and peer is
>          * evidently real one.
>          */
>         if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
>                 want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
>                 if (!want_cookie)
>                         goto drop;
>         }
> 
> This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.
> 
> Unfortunately this field has been accidentally cleared
> after the call to tcp_timewait_state_process() returning
> TCP_TW_SYN.
> 
> Using a field in TCP_SKB_CB(skb) for a temporary state
> is overkill.
> 
> Switch instead to a per-cpu variable.

I guess that pushing the info via a local variable all the way down to
tcp_conn_request would be cumbersome, and will prevent the fast path
optimization, right?

> As a bonus, we do not have to clear tcp_tw_isn in TCP receive
> fast path.
> It is temporarily set then cleared only in the TCP_TW_SYN dance.
> 
> Fixes: 4ad19de8774e ("net: tcp6: fix double call of tcp_v6_fill_cb()")
> Fixes: eeea10b83a13 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

[...]

> @@ -2397,6 +2397,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
>  			sk = sk2;
>  			tcp_v4_restore_cb(skb);
>  			refcounted = false;
> +			__this_cpu_write(tcp_tw_isn, isn);
>  			goto process;

Unrelated from this patch, but I think the 'process' label could be
moved down skipping a couple of conditionals. 'sk' is a listener
socket, checking for TW or SYN_RECV should not be needed, right?

Thanks!

Paolo
Eric Dumazet April 9, 2024, 10:11 a.m. UTC | #2
On Tue, Apr 9, 2024 at 11:58 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Sun, 2024-04-07 at 09:33 +0000, Eric Dumazet wrote:
> > TCP can transform a TIMEWAIT socket into a SYN_RECV one from
> > a SYN packet, and the ISN of the SYNACK packet is normally
> > generated using TIMEWAIT tw_snd_nxt :
> >
> > tcp_timewait_state_process()
> > ...
> >     u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
> >     if (isn == 0)
> >         isn++;
> >     TCP_SKB_CB(skb)->tcp_tw_isn = isn;
> >     return TCP_TW_SYN;
> >
> > This SYN packet also bypasses normal checks against listen queue
> > being full or not.
> >
> > tcp_conn_request()
> > ...
> >        __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
> > ...
> >         /* TW buckets are converted to open requests without
> >          * limitations, they conserve resources and peer is
> >          * evidently real one.
> >          */
> >         if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
> >                 want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
> >                 if (!want_cookie)
> >                         goto drop;
> >         }
> >
> > This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.
> >
> > Unfortunately this field has been accidentally cleared
> > after the call to tcp_timewait_state_process() returning
> > TCP_TW_SYN.
> >
> > Using a field in TCP_SKB_CB(skb) for a temporary state
> > is overkill.
> >
> > Switch instead to a per-cpu variable.
>
> I guess that pushing the info via a local variable all the way down to
> tcp_conn_request would be cumbersome, and will prevent the fast path
> optimization, right?

Right, I tried two other variants of the fix before coming to this.

One of the issues I had is that packets eventually going through the
socket backlog
can trigger this path, so I would have to carry a local variable in a
lot of paths.

References :

commit 449809a66c1d0b1563dee84493e14bf3104d2d7e    tcp/dccp: block BH
for SYN processing
commit 1ad98e9d1bdf4724c0a8532fabd84bf3c457c2bc    tcp/dccp: fix
lockdep issue when SYN is backlogged

I chose to not care about this tricky case (vs tcp_tw_syn), by making sure
to always leave per-cpu variable cleared after each tcp_conn_request()

This was also a nice way of not having to clear a field/variable for
each incoming
TCP packet :)

>
> > As a bonus, we do not have to clear tcp_tw_isn in TCP receive
> > fast path.
> > It is temporarily set then cleared only in the TCP_TW_SYN dance.
> >
> > Fixes: 4ad19de8774e ("net: tcp6: fix double call of tcp_v6_fill_cb()")
> > Fixes: eeea10b83a13 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>
> [...]
>
> > @@ -2397,6 +2397,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >                       sk = sk2;
> >                       tcp_v4_restore_cb(skb);
> >                       refcounted = false;
> > +                     __this_cpu_write(tcp_tw_isn, isn);
> >                       goto process;
>
> Unrelated from this patch, but I think the 'process' label could be
> moved down skipping a couple of conditionals. 'sk' is a listener
> socket, checking for TW or SYN_RECV should not be needed, right?

Absolutely !
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fa0ab77acee23654b22e97615de983fc04eee319..ba6c5ae86e228a1633feaf39b0f1e053c3832b08 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -52,6 +52,8 @@  extern struct inet_hashinfo tcp_hashinfo;
 DECLARE_PER_CPU(unsigned int, tcp_orphan_count);
 int tcp_orphan_count_sum(void);
 
+DECLARE_PER_CPU(u32, tcp_tw_isn);
+
 void tcp_time_wait(struct sock *sk, int state, int timeo);
 
 #define MAX_TCP_HEADER	L1_CACHE_ALIGN(128 + MAX_HEADER)
@@ -392,7 +394,8 @@  enum tcp_tw_status {
 
 enum tcp_tw_status tcp_timewait_state_process(struct inet_timewait_sock *tw,
 					      struct sk_buff *skb,
-					      const struct tcphdr *th);
+					      const struct tcphdr *th,
+					      u32 *tw_isn);
 struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 			   struct request_sock *req, bool fastopen,
 			   bool *lost_race);
@@ -935,13 +938,10 @@  struct tcp_skb_cb {
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
 	union {
-		/* Note : tcp_tw_isn is used in input path only
-		 *	  (isn chosen by tcp_timewait_state_process())
-		 *
+		/* Note :
 		 * 	  tcp_gso_segs/size are used in write queue only,
 		 *	  cf tcp_skb_pcount()/tcp_skb_mss()
 		 */
-		__u32		tcp_tw_isn;
 		struct {
 			u16	tcp_gso_segs;
 			u16	tcp_gso_size;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 92ee60492314a1483cfbfa2f73d32fcad5632773..6cb5b9f74c94b72fec1d6a3cc8e6dc75bd61ba2f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -290,6 +290,9 @@  enum {
 DEFINE_PER_CPU(unsigned int, tcp_orphan_count);
 EXPORT_PER_CPU_SYMBOL_GPL(tcp_orphan_count);
 
+DEFINE_PER_CPU(u32, tcp_tw_isn);
+EXPORT_PER_CPU_SYMBOL_GPL(tcp_tw_isn);
+
 long sysctl_tcp_mem[3] __read_mostly;
 EXPORT_SYMBOL(sysctl_tcp_mem);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 48c275e6ef02bfc5dd98f0878c752841f949c714..5a45a0923a1f058cdc80255be0f76a71fd102d4d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -7097,7 +7097,6 @@  int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		     struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_fastopen_cookie foc = { .len = -1 };
-	__u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
 	struct tcp_options_received tmp_opt;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct net *net = sock_net(sk);
@@ -7107,21 +7106,28 @@  int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	struct dst_entry *dst;
 	struct flowi fl;
 	u8 syncookies;
+	u32 isn;
 
 #ifdef CONFIG_TCP_AO
 	const struct tcp_ao_hdr *aoh;
 #endif
 
-	syncookies = READ_ONCE(net->ipv4.sysctl_tcp_syncookies);
+	isn = __this_cpu_read(tcp_tw_isn);
+	if (isn) {
+		/* TW buckets are converted to open requests without
+		 * limitations, they conserve resources and peer is
+		 * evidently real one.
+		 */
+		__this_cpu_write(tcp_tw_isn, 0);
+	} else {
+		syncookies = READ_ONCE(net->ipv4.sysctl_tcp_syncookies);
 
-	/* TW buckets are converted to open requests without
-	 * limitations, they conserve resources and peer is
-	 * evidently real one.
-	 */
-	if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-		want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
-		if (!want_cookie)
-			goto drop;
+		if (syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) {
+			want_cookie = tcp_syn_flood_action(sk,
+							   rsk_ops->slab_name);
+			if (!want_cookie)
+				goto drop;
+		}
 	}
 
 	if (sk_acceptq_is_full(sk)) {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 81e2f05c244d1671980a34bb756f528f3e6debcc..1e650ec71d2fe5198b9dad9e6ea9c5eaf868277f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2146,7 +2146,6 @@  static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 				    skb->len - th->doff * 4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
 	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
-	TCP_SKB_CB(skb)->tcp_tw_isn = 0;
 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
 	TCP_SKB_CB(skb)->sacked	 = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
@@ -2168,6 +2167,7 @@  int tcp_v4_rcv(struct sk_buff *skb)
 	bool refcounted;
 	struct sock *sk;
 	int ret;
+	u32 isn;
 
 	drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
 	if (skb->pkt_type != PACKET_HOST)
@@ -2383,7 +2383,7 @@  int tcp_v4_rcv(struct sk_buff *skb)
 		inet_twsk_put(inet_twsk(sk));
 		goto csum_error;
 	}
-	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
+	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
 	case TCP_TW_SYN: {
 		struct sock *sk2 = inet_lookup_listener(net,
 							net->ipv4.tcp_death_row.hashinfo,
@@ -2397,6 +2397,7 @@  int tcp_v4_rcv(struct sk_buff *skb)
 			sk = sk2;
 			tcp_v4_restore_cb(skb);
 			refcounted = false;
+			__this_cpu_write(tcp_tw_isn, isn);
 			goto process;
 		}
 	}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 5b21a07ddf9aa5593d21cb856f0e0ea2f45b1eef..f53c7ada2ace4219917e75f806f39a00d5ab0123 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -95,7 +95,7 @@  static void twsk_rcv_nxt_update(struct tcp_timewait_sock *tcptw, u32 seq)
  */
 enum tcp_tw_status
 tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
-			   const struct tcphdr *th)
+			   const struct tcphdr *th, u32 *tw_isn)
 {
 	struct tcp_options_received tmp_opt;
 	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
@@ -228,7 +228,7 @@  tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 		u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
 		if (isn == 0)
 			isn++;
-		TCP_SKB_CB(skb)->tcp_tw_isn = isn;
+		*tw_isn = isn;
 		return TCP_TW_SYN;
 	}
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 5141f7033abd8bb03bc4e162066ca4befe343bdc..3aa9da5c9a669d2754b421cfb704ad28def5a748 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1741,7 +1741,6 @@  static void tcp_v6_fill_cb(struct sk_buff *skb, const struct ipv6hdr *hdr,
 				    skb->len - th->doff*4);
 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
 	TCP_SKB_CB(skb)->tcp_flags = tcp_flag_byte(th);
-	TCP_SKB_CB(skb)->tcp_tw_isn = 0;
 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
 	TCP_SKB_CB(skb)->sacked = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
@@ -1758,6 +1757,7 @@  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 	bool refcounted;
 	struct sock *sk;
 	int ret;
+	u32 isn;
 	struct net *net = dev_net(skb->dev);
 
 	drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
@@ -1967,7 +1967,7 @@  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 		goto csum_error;
 	}
 
-	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) {
+	switch (tcp_timewait_state_process(inet_twsk(sk), skb, th, &isn)) {
 	case TCP_TW_SYN:
 	{
 		struct sock *sk2;
@@ -1985,6 +1985,7 @@  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 			sk = sk2;
 			tcp_v6_restore_cb(skb);
 			refcounted = false;
+			__this_cpu_write(tcp_tw_isn, isn);
 			goto process;
 		}
 	}