diff mbox series

[net-next] tcp: avoid RST in 3-way shakehands due to failure in tcp_timewait_state_process

Message ID 20241105025511.42652-1-kerneljasonxing@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: avoid RST in 3-way shakehands due to failure in tcp_timewait_state_process | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3 this patch: 3
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4 this patch: 4
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-11-05--21-00 (tests: 779)

Commit Message

Jason Xing Nov. 5, 2024, 2:55 a.m. UTC
From: Jason Xing <kernelxing@tencent.com>

We found there are rare chances that some RST packets appear during
the shakehands because the timewait socket cannot accept the SYN and
doesn't return TCP_TW_SYN in tcp_timewait_state_process().

Here is how things happen in production:
Time        Client(A)        Server(B)
0s          SYN-->
...
132s                         <-- FIN
...
169s        FIN-->
169s                         <-- ACK
169s        SYN-->
169s                         <-- ACK
169s        RST-->
As above picture shows, the two flows have a start time difference
of 169 seconds. B starts to send FIN so it will finally enter into
TIMEWAIT state. Nearly at the same time A launches a new connection
that soon is reset by itself due to receiving a ACK.

There are two key checks in tcp_timewait_state_process() when timewait
socket in B receives the SYN packet:
1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)

Regarding the first rule, it fails as expected because in the first
connection the seq of SYN sent from A is 1892994276, then 169s have
passed, the second SYN has 239034613 (caused by overflow of s32).

Then how about the second rule?
It fails again!
Let's take a look at how the tsval comes out:
__tcp_transmit_skb()
    -> tcp_syn_options()
        -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
The timestamp depends on two things, one is skb->skb_mstamp_ns, the
other is tp->tsoffset. The latter value is fixed, so we don't need
to care about it. If both operations (sending FIN and then starting
sending SYN) from A happen in 1ms, then the tsval would be the same.
It can be clearly seen in the tcpdump log. Notice that the tsval is
with millisecond precision.

Based on the above analysis, I decided to make a small change to
the check in tcp_timewait_state_process() so that the second flow
would not fail.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/ipv4/tcp_minisocks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Kuniyuki Iwashima Nov. 5, 2024, 7:49 a.m. UTC | #1
From: Jason Xing <kerneljasonxing@gmail.com>
Date: Tue,  5 Nov 2024 10:55:11 +0800
> From: Jason Xing <kernelxing@tencent.com>
> 
> We found there are rare chances that some RST packets appear during
> the shakehands because the timewait socket cannot accept the SYN and

s/shakehands/handshake/

same in the subject.

> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> 
> Here is how things happen in production:
> Time        Client(A)        Server(B)
> 0s          SYN-->
> ...
> 132s                         <-- FIN
> ...
> 169s        FIN-->
> 169s                         <-- ACK
> 169s        SYN-->
> 169s                         <-- ACK
> 169s        RST-->
> As above picture shows, the two flows have a start time difference
> of 169 seconds. B starts to send FIN so it will finally enter into
> TIMEWAIT state. Nearly at the same time A launches a new connection
> that soon is reset by itself due to receiving a ACK.
> 
> There are two key checks in tcp_timewait_state_process() when timewait
> socket in B receives the SYN packet:
> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> 
> Regarding the first rule, it fails as expected because in the first
> connection the seq of SYN sent from A is 1892994276, then 169s have
> passed, the second SYN has 239034613 (caused by overflow of s32).
> 
> Then how about the second rule?
> It fails again!
> Let's take a look at how the tsval comes out:
> __tcp_transmit_skb()
>     -> tcp_syn_options()
>         -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> other is tp->tsoffset. The latter value is fixed, so we don't need
> to care about it. If both operations (sending FIN and then starting
> sending SYN) from A happen in 1ms, then the tsval would be the same.
> It can be clearly seen in the tcpdump log. Notice that the tsval is
> with millisecond precision.
> 
> Based on the above analysis, I decided to make a small change to
> the check in tcp_timewait_state_process() so that the second flow
> would not fail.
> 
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/ipv4/tcp_minisocks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index bb1fe1ba867a..2b29d1bf5ca0 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -234,7 +234,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
>  	if (th->syn && !th->rst && !th->ack && !paws_reject &&
>  	    (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
>  	     (tmp_opt.saw_tstamp &&
> -	      (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {

I think this follows RFC 6191 and such a change requires a formal
discussion at IETF.

https://datatracker.ietf.org/doc/html/rfc6191#section-2

---8<---
      *  If TCP Timestamps would be enabled for the new incarnation of
         the connection, and the timestamp contained in the incoming SYN
         segment is greater than the last timestamp seen on the previous
         incarnation of the connection (for that direction of the data
         transfer), honor the connection request (creating a connection
         in the SYN-RECEIVED state).

      *  If TCP Timestamps would be enabled for the new incarnation of
         the connection, the timestamp contained in the incoming SYN
         segment is equal to the last timestamp seen on the previous
         incarnation of the connection (for that direction of the data
         transfer), and the Sequence Number of the incoming SYN segment
         is greater than the last sequence number seen on the previous
         incarnation of the connection (for that direction of the data
         transfer), honor the connection request (creating a connection
         in the SYN-RECEIVED state).
---8<---

> +	      (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) <= 0))) {
>  		u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
>  		if (isn == 0)
>  			isn++;
> -- 
> 2.37.3
Jason Xing Nov. 5, 2024, 9:08 a.m. UTC | #2
On Tue, Nov 5, 2024 at 3:49 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Tue,  5 Nov 2024 10:55:11 +0800
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > We found there are rare chances that some RST packets appear during
> > the shakehands because the timewait socket cannot accept the SYN and
>
> s/shakehands/handshake/
>
> same in the subject.
>
> > doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> >
> > Here is how things happen in production:
> > Time        Client(A)        Server(B)
> > 0s          SYN-->
> > ...
> > 132s                         <-- FIN
> > ...
> > 169s        FIN-->
> > 169s                         <-- ACK
> > 169s        SYN-->
> > 169s                         <-- ACK
> > 169s        RST-->
> > As above picture shows, the two flows have a start time difference
> > of 169 seconds. B starts to send FIN so it will finally enter into
> > TIMEWAIT state. Nearly at the same time A launches a new connection
> > that soon is reset by itself due to receiving a ACK.
> >
> > There are two key checks in tcp_timewait_state_process() when timewait
> > socket in B receives the SYN packet:
> > 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> >
> > Regarding the first rule, it fails as expected because in the first
> > connection the seq of SYN sent from A is 1892994276, then 169s have
> > passed, the second SYN has 239034613 (caused by overflow of s32).
> >
> > Then how about the second rule?
> > It fails again!
> > Let's take a look at how the tsval comes out:
> > __tcp_transmit_skb()
> >     -> tcp_syn_options()
> >         -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > other is tp->tsoffset. The latter value is fixed, so we don't need
> > to care about it. If both operations (sending FIN and then starting
> > sending SYN) from A happen in 1ms, then the tsval would be the same.
> > It can be clearly seen in the tcpdump log. Notice that the tsval is
> > with millisecond precision.
> >
> > Based on the above analysis, I decided to make a small change to
> > the check in tcp_timewait_state_process() so that the second flow
> > would not fail.
> >
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> >  net/ipv4/tcp_minisocks.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index bb1fe1ba867a..2b29d1bf5ca0 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -234,7 +234,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
> >       if (th->syn && !th->rst && !th->ack && !paws_reject &&
> >           (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
> >            (tmp_opt.saw_tstamp &&
> > -           (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
>
> I think this follows RFC 6191 and such a change requires a formal
> discussion at IETF.
>
> https://datatracker.ietf.org/doc/html/rfc6191#section-2
>
> ---8<---
>       *  If TCP Timestamps would be enabled for the new incarnation of
>          the connection, and the timestamp contained in the incoming SYN
>          segment is greater than the last timestamp seen on the previous

The true thing is that the timestamp of the SYN packet is greater than
that of the last packet, but the kernel implementation uses ms
precision (please see tcp_skb_timestamp_ts()). That function makes
those two timestamps the same.

This case happens as expected, so the second connection should be
established. My confusion just popped out of my mind: what rules
should we follow to stop the second flow?

Thanks,
Jason
Eric Dumazet Nov. 5, 2024, 9:35 a.m. UTC | #3
On Tue, Nov 5, 2024 at 10:08 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Tue, Nov 5, 2024 at 3:49 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Jason Xing <kerneljasonxing@gmail.com>
> > Date: Tue,  5 Nov 2024 10:55:11 +0800
> > > From: Jason Xing <kernelxing@tencent.com>
> > >
> > > We found there are rare chances that some RST packets appear during
> > > the shakehands because the timewait socket cannot accept the SYN and
> >
> > s/shakehands/handshake/
> >
> > same in the subject.
> >
> > > doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > >
> > > Here is how things happen in production:
> > > Time        Client(A)        Server(B)
> > > 0s          SYN-->
> > > ...
> > > 132s                         <-- FIN
> > > ...
> > > 169s        FIN-->
> > > 169s                         <-- ACK
> > > 169s        SYN-->
> > > 169s                         <-- ACK
> > > 169s        RST-->
> > > As above picture shows, the two flows have a start time difference
> > > of 169 seconds. B starts to send FIN so it will finally enter into
> > > TIMEWAIT state. Nearly at the same time A launches a new connection
> > > that soon is reset by itself due to receiving a ACK.
> > >
> > > There are two key checks in tcp_timewait_state_process() when timewait
> > > socket in B receives the SYN packet:
> > > 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > >
> > > Regarding the first rule, it fails as expected because in the first
> > > connection the seq of SYN sent from A is 1892994276, then 169s have
> > > passed, the second SYN has 239034613 (caused by overflow of s32).
> > >
> > > Then how about the second rule?
> > > It fails again!
> > > Let's take a look at how the tsval comes out:
> > > __tcp_transmit_skb()
> > >     -> tcp_syn_options()
> > >         -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > other is tp->tsoffset. The latter value is fixed, so we don't need
> > > to care about it. If both operations (sending FIN and then starting
> > > sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > with millisecond precision.
> > >
> > > Based on the above analysis, I decided to make a small change to
> > > the check in tcp_timewait_state_process() so that the second flow
> > > would not fail.
> > >
> > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > ---
> > >  net/ipv4/tcp_minisocks.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > index bb1fe1ba867a..2b29d1bf5ca0 100644
> > > --- a/net/ipv4/tcp_minisocks.c
> > > +++ b/net/ipv4/tcp_minisocks.c
> > > @@ -234,7 +234,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
> > >       if (th->syn && !th->rst && !th->ack && !paws_reject &&
> > >           (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
> > >            (tmp_opt.saw_tstamp &&
> > > -           (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
> >
> > I think this follows RFC 6191 and such a change requires a formal
> > discussion at IETF.
> >
> > https://datatracker.ietf.org/doc/html/rfc6191#section-2
> >
> > ---8<---
> >       *  If TCP Timestamps would be enabled for the new incarnation of
> >          the connection, and the timestamp contained in the incoming SYN
> >          segment is greater than the last timestamp seen on the previous
>
> The true thing is that the timestamp of the SYN packet is greater than
> that of the last packet, but the kernel implementation uses ms
> precision (please see tcp_skb_timestamp_ts()). That function makes
> those two timestamps the same.
>
> This case happens as expected, so the second connection should be
> established. My confusion just popped out of my mind: what rules
> should we follow to stop the second flow?

Note that linux TCP stack can use usec resolution for TCP TS values.

You might adopt it, and no longer care about this ms granularity.
Jason Xing Nov. 5, 2024, 9:41 a.m. UTC | #4
On Tue, Nov 5, 2024 at 5:35 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Nov 5, 2024 at 10:08 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Tue, Nov 5, 2024 at 3:49 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > Date: Tue,  5 Nov 2024 10:55:11 +0800
> > > > From: Jason Xing <kernelxing@tencent.com>
> > > >
> > > > We found there are rare chances that some RST packets appear during
> > > > the shakehands because the timewait socket cannot accept the SYN and
> > >
> > > s/shakehands/handshake/
> > >
> > > same in the subject.
> > >
> > > > doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > >
> > > > Here is how things happen in production:
> > > > Time        Client(A)        Server(B)
> > > > 0s          SYN-->
> > > > ...
> > > > 132s                         <-- FIN
> > > > ...
> > > > 169s        FIN-->
> > > > 169s                         <-- ACK
> > > > 169s        SYN-->
> > > > 169s                         <-- ACK
> > > > 169s        RST-->
> > > > As above picture shows, the two flows have a start time difference
> > > > of 169 seconds. B starts to send FIN so it will finally enter into
> > > > TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > that soon is reset by itself due to receiving a ACK.
> > > >
> > > > There are two key checks in tcp_timewait_state_process() when timewait
> > > > socket in B receives the SYN packet:
> > > > 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > >
> > > > Regarding the first rule, it fails as expected because in the first
> > > > connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > passed, the second SYN has 239034613 (caused by overflow of s32).
> > > >
> > > > Then how about the second rule?
> > > > It fails again!
> > > > Let's take a look at how the tsval comes out:
> > > > __tcp_transmit_skb()
> > > >     -> tcp_syn_options()
> > > >         -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > to care about it. If both operations (sending FIN and then starting
> > > > sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > with millisecond precision.
> > > >
> > > > Based on the above analysis, I decided to make a small change to
> > > > the check in tcp_timewait_state_process() so that the second flow
> > > > would not fail.
> > > >
> > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > ---
> > > >  net/ipv4/tcp_minisocks.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > index bb1fe1ba867a..2b29d1bf5ca0 100644
> > > > --- a/net/ipv4/tcp_minisocks.c
> > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > @@ -234,7 +234,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
> > > >       if (th->syn && !th->rst && !th->ack && !paws_reject &&
> > > >           (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
> > > >            (tmp_opt.saw_tstamp &&
> > > > -           (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
> > >
> > > I think this follows RFC 6191 and such a change requires a formal
> > > discussion at IETF.
> > >
> > > https://datatracker.ietf.org/doc/html/rfc6191#section-2
> > >
> > > ---8<---
> > >       *  If TCP Timestamps would be enabled for the new incarnation of
> > >          the connection, and the timestamp contained in the incoming SYN
> > >          segment is greater than the last timestamp seen on the previous
> >
> > The true thing is that the timestamp of the SYN packet is greater than
> > that of the last packet, but the kernel implementation uses ms
> > precision (please see tcp_skb_timestamp_ts()). That function makes
> > those two timestamps the same.
> >
> > This case happens as expected, so the second connection should be
> > established. My confusion just popped out of my mind: what rules
> > should we follow to stop the second flow?
>
> Note that linux TCP stack can use usec resolution for TCP TS values.
>
> You might adopt it, and no longer care about this ms granularity.

Right, I noticed this feature. I wonder if we can change the check in
tcp_timewait_state_process() like this patch if it has no side
effects? I'm worried that some programs don't use this feature. It's
the reason why I try to propose this patch to you.

Thanks,
Jason
Eric Dumazet Nov. 5, 2024, 9:50 a.m. UTC | #5
On Tue, Nov 5, 2024 at 10:42 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Tue, Nov 5, 2024 at 5:35 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Nov 5, 2024 at 10:08 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Tue, Nov 5, 2024 at 3:49 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > Date: Tue,  5 Nov 2024 10:55:11 +0800
> > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > >
> > > > > We found there are rare chances that some RST packets appear during
> > > > > the shakehands because the timewait socket cannot accept the SYN and
> > > >
> > > > s/shakehands/handshake/
> > > >
> > > > same in the subject.
> > > >
> > > > > doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > >
> > > > > Here is how things happen in production:
> > > > > Time        Client(A)        Server(B)
> > > > > 0s          SYN-->
> > > > > ...
> > > > > 132s                         <-- FIN
> > > > > ...
> > > > > 169s        FIN-->
> > > > > 169s                         <-- ACK
> > > > > 169s        SYN-->
> > > > > 169s                         <-- ACK
> > > > > 169s        RST-->
> > > > > As above picture shows, the two flows have a start time difference
> > > > > of 169 seconds. B starts to send FIN so it will finally enter into
> > > > > TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > > that soon is reset by itself due to receiving a ACK.
> > > > >
> > > > > There are two key checks in tcp_timewait_state_process() when timewait
> > > > > socket in B receives the SYN packet:
> > > > > 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > > 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > >
> > > > > Regarding the first rule, it fails as expected because in the first
> > > > > connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > > passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > >
> > > > > Then how about the second rule?
> > > > > It fails again!
> > > > > Let's take a look at how the tsval comes out:
> > > > > __tcp_transmit_skb()
> > > > >     -> tcp_syn_options()
> > > > >         -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > > The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > > other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > > to care about it. If both operations (sending FIN and then starting
> > > > > sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > > It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > > with millisecond precision.
> > > > >
> > > > > Based on the above analysis, I decided to make a small change to
> > > > > the check in tcp_timewait_state_process() so that the second flow
> > > > > would not fail.
> > > > >
> > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > ---
> > > > >  net/ipv4/tcp_minisocks.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > > index bb1fe1ba867a..2b29d1bf5ca0 100644
> > > > > --- a/net/ipv4/tcp_minisocks.c
> > > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > > @@ -234,7 +234,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
> > > > >       if (th->syn && !th->rst && !th->ack && !paws_reject &&
> > > > >           (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
> > > > >            (tmp_opt.saw_tstamp &&
> > > > > -           (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
> > > >
> > > > I think this follows RFC 6191 and such a change requires a formal
> > > > discussion at IETF.
> > > >
> > > > https://datatracker.ietf.org/doc/html/rfc6191#section-2
> > > >
> > > > ---8<---
> > > >       *  If TCP Timestamps would be enabled for the new incarnation of
> > > >          the connection, and the timestamp contained in the incoming SYN
> > > >          segment is greater than the last timestamp seen on the previous
> > >
> > > The true thing is that the timestamp of the SYN packet is greater than
> > > that of the last packet, but the kernel implementation uses ms
> > > precision (please see tcp_skb_timestamp_ts()). That function makes
> > > those two timestamps the same.
> > >
> > > This case happens as expected, so the second connection should be
> > > established. My confusion just popped out of my mind: what rules
> > > should we follow to stop the second flow?
> >
> > Note that linux TCP stack can use usec resolution for TCP TS values.
> >
> > You might adopt it, and no longer care about this ms granularity.
>
> Right, I noticed this feature. I wonder if we can change the check in
> tcp_timewait_state_process() like this patch if it has no side
> effects? I'm worried that some programs don't use this feature. It's
> the reason why I try to propose this patch to you.

Breaking RFC ? I do not think so.

Instead, use a usec clock and voila, the problem is solved.
Jason Xing Nov. 5, 2024, 9:56 a.m. UTC | #6
On Tue, Nov 5, 2024 at 5:50 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Nov 5, 2024 at 10:42 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Tue, Nov 5, 2024 at 5:35 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Nov 5, 2024 at 10:08 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Tue, Nov 5, 2024 at 3:49 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > > Date: Tue,  5 Nov 2024 10:55:11 +0800
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > We found there are rare chances that some RST packets appear during
> > > > > > the shakehands because the timewait socket cannot accept the SYN and
> > > > >
> > > > > s/shakehands/handshake/
> > > > >
> > > > > same in the subject.
> > > > >
> > > > > > doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > > >
> > > > > > Here is how things happen in production:
> > > > > > Time        Client(A)        Server(B)
> > > > > > 0s          SYN-->
> > > > > > ...
> > > > > > 132s                         <-- FIN
> > > > > > ...
> > > > > > 169s        FIN-->
> > > > > > 169s                         <-- ACK
> > > > > > 169s        SYN-->
> > > > > > 169s                         <-- ACK
> > > > > > 169s        RST-->
> > > > > > As above picture shows, the two flows have a start time difference
> > > > > > of 169 seconds. B starts to send FIN so it will finally enter into
> > > > > > TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > > > that soon is reset by itself due to receiving a ACK.
> > > > > >
> > > > > > There are two key checks in tcp_timewait_state_process() when timewait
> > > > > > socket in B receives the SYN packet:
> > > > > > 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > > > 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > > >
> > > > > > Regarding the first rule, it fails as expected because in the first
> > > > > > connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > > > passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > > >
> > > > > > Then how about the second rule?
> > > > > > It fails again!
> > > > > > Let's take a look at how the tsval comes out:
> > > > > > __tcp_transmit_skb()
> > > > > >     -> tcp_syn_options()
> > > > > >         -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > > > The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > > > other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > > > to care about it. If both operations (sending FIN and then starting
> > > > > > sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > > > It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > > > with millisecond precision.
> > > > > >
> > > > > > Based on the above analysis, I decided to make a small change to
> > > > > > the check in tcp_timewait_state_process() so that the second flow
> > > > > > would not fail.
> > > > > >
> > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > > ---
> > > > > >  net/ipv4/tcp_minisocks.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > > > index bb1fe1ba867a..2b29d1bf5ca0 100644
> > > > > > --- a/net/ipv4/tcp_minisocks.c
> > > > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > > > @@ -234,7 +234,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
> > > > > >       if (th->syn && !th->rst && !th->ack && !paws_reject &&
> > > > > >           (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
> > > > > >            (tmp_opt.saw_tstamp &&
> > > > > > -           (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
> > > > >
> > > > > I think this follows RFC 6191 and such a change requires a formal
> > > > > discussion at IETF.
> > > > >
> > > > > https://datatracker.ietf.org/doc/html/rfc6191#section-2
> > > > >
> > > > > ---8<---
> > > > >       *  If TCP Timestamps would be enabled for the new incarnation of
> > > > >          the connection, and the timestamp contained in the incoming SYN
> > > > >          segment is greater than the last timestamp seen on the previous
> > > >
> > > > The true thing is that the timestamp of the SYN packet is greater than
> > > > that of the last packet, but the kernel implementation uses ms
> > > > precision (please see tcp_skb_timestamp_ts()). That function makes
> > > > those two timestamps the same.
> > > >
> > > > This case happens as expected, so the second connection should be
> > > > established. My confusion just popped out of my mind: what rules
> > > > should we follow to stop the second flow?
> > >
> > > Note that linux TCP stack can use usec resolution for TCP TS values.
> > >
> > > You might adopt it, and no longer care about this ms granularity.
> >
> > Right, I noticed this feature. I wonder if we can change the check in
> > tcp_timewait_state_process() like this patch if it has no side
> > effects? I'm worried that some programs don't use this feature. It's
> > the reason why I try to propose this patch to you.
>
> Breaking RFC ? I do not think so.

Oh right, I just can't figure it out why since we've already lost the
fine-grained timestamp in each skb. I spent a few days investigating
the bad cases this patch may bring, but I failed.

> Instead, use a usec clock and voila, the problem is solved.

Sure, it can work :)

Thanks,
Jason
Jason Xing Nov. 5, 2024, 11:48 a.m. UTC | #7
Hello Eric, Martin

On Tue, Nov 5, 2024 at 5:50 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Nov 5, 2024 at 10:42 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Tue, Nov 5, 2024 at 5:35 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Nov 5, 2024 at 10:08 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Tue, Nov 5, 2024 at 3:49 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > > Date: Tue,  5 Nov 2024 10:55:11 +0800
> > > > > > From: Jason Xing <kernelxing@tencent.com>
> > > > > >
> > > > > > We found there are rare chances that some RST packets appear during
> > > > > > the shakehands because the timewait socket cannot accept the SYN and
> > > > >
> > > > > s/shakehands/handshake/
> > > > >
> > > > > same in the subject.
> > > > >
> > > > > > doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > > >
> > > > > > Here is how things happen in production:
> > > > > > Time        Client(A)        Server(B)
> > > > > > 0s          SYN-->
> > > > > > ...
> > > > > > 132s                         <-- FIN
> > > > > > ...
> > > > > > 169s        FIN-->
> > > > > > 169s                         <-- ACK
> > > > > > 169s        SYN-->
> > > > > > 169s                         <-- ACK
> > > > > > 169s        RST-->
> > > > > > As above picture shows, the two flows have a start time difference
> > > > > > of 169 seconds. B starts to send FIN so it will finally enter into
> > > > > > TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > > > that soon is reset by itself due to receiving a ACK.
> > > > > >
> > > > > > There are two key checks in tcp_timewait_state_process() when timewait
> > > > > > socket in B receives the SYN packet:
> > > > > > 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > > > 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > > >
> > > > > > Regarding the first rule, it fails as expected because in the first
> > > > > > connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > > > passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > > >
> > > > > > Then how about the second rule?
> > > > > > It fails again!
> > > > > > Let's take a look at how the tsval comes out:
> > > > > > __tcp_transmit_skb()
> > > > > >     -> tcp_syn_options()
> > > > > >         -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > > > The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > > > other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > > > to care about it. If both operations (sending FIN and then starting
> > > > > > sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > > > It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > > > with millisecond precision.
> > > > > >
> > > > > > Based on the above analysis, I decided to make a small change to
> > > > > > the check in tcp_timewait_state_process() so that the second flow
> > > > > > would not fail.
> > > > > >
> > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > > > > > ---
> > > > > >  net/ipv4/tcp_minisocks.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > > > > > index bb1fe1ba867a..2b29d1bf5ca0 100644
> > > > > > --- a/net/ipv4/tcp_minisocks.c
> > > > > > +++ b/net/ipv4/tcp_minisocks.c
> > > > > > @@ -234,7 +234,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
> > > > > >       if (th->syn && !th->rst && !th->ack && !paws_reject &&
> > > > > >           (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
> > > > > >            (tmp_opt.saw_tstamp &&
> > > > > > -           (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
> > > > >
> > > > > I think this follows RFC 6191 and such a change requires a formal
> > > > > discussion at IETF.
> > > > >
> > > > > https://datatracker.ietf.org/doc/html/rfc6191#section-2
> > > > >
> > > > > ---8<---
> > > > >       *  If TCP Timestamps would be enabled for the new incarnation of
> > > > >          the connection, and the timestamp contained in the incoming SYN
> > > > >          segment is greater than the last timestamp seen on the previous
> > > >
> > > > The true thing is that the timestamp of the SYN packet is greater than
> > > > that of the last packet, but the kernel implementation uses ms
> > > > precision (please see tcp_skb_timestamp_ts()). That function makes
> > > > those two timestamps the same.
> > > >
> > > > This case happens as expected, so the second connection should be
> > > > established. My confusion just popped out of my mind: what rules
> > > > should we follow to stop the second flow?
> > >
> > > Note that linux TCP stack can use usec resolution for TCP TS values.
> > >
> > > You might adopt it, and no longer care about this ms granularity.
> >
> > Right, I noticed this feature. I wonder if we can change the check in
> > tcp_timewait_state_process() like this patch if it has no side
> > effects? I'm worried that some programs don't use this feature. It's
> > the reason why I try to propose this patch to you.
>
> Breaking RFC ? I do not think so.
>
> Instead, use a usec clock and voila, the problem is solved.

I wonder if it is necessary to use BPF to extend the usec clock. Since
I started investigating the BPF part, I found it can help us extend
many useful features in TCP to the most extent.

Using it with the BPF program can make the feature take effect and
widely used in future.

I'm asking because I somehow have a feeling that someday the majority
of traditional sockopts could be replaced by BPF. After all, requiring
application modification for those kernel features is a bit heavy. I'm
not sure if I'm right about it :S

Thanks,
Jason
Jason Xing Nov. 7, 2024, 3:16 a.m. UTC | #8
Hello Eric,

On Tue, Nov 5, 2024 at 10:55 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> From: Jason Xing <kernelxing@tencent.com>
>
> We found there are rare chances that some RST packets appear during
> the shakehands because the timewait socket cannot accept the SYN and
> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
>
> Here is how things happen in production:
> Time        Client(A)        Server(B)
> 0s          SYN-->
> ...
> 132s                         <-- FIN
> ...
> 169s        FIN-->
> 169s                         <-- ACK
> 169s        SYN-->
> 169s                         <-- ACK

I noticed the above ACK doesn't adhere to RFC 6191. It says:
"If the previous incarnation of the connection used Timestamps, then:
     if ...
     ...
     * Otherwise, silently drop the incoming SYN segment, thus leaving
         the previous incarnation of the connection in the TIME-WAIT
         state.
"
But the timewait socket sends an ACK because of this code snippet:
tcp_timewait_state_process()
    -> // the checks of SYN packet failed.
    -> if (!th->rst) {
        -> return TCP_TW_ACK; // this line can be traced back to 2005

I think the following patch follows the RFC:
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bb1fe1ba867a..cc22f0412f98 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -231,15 +231,17 @@ tcp_timewait_state_process(struct
inet_timewait_sock *tw, struct sk_buff *skb,
           but not fatal yet.
         */

-       if (th->syn && !th->rst && !th->ack && !paws_reject &&
-           (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
-            (tmp_opt.saw_tstamp &&
-             (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
-               u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
-               if (isn == 0)
-                       isn++;
-               *tw_isn = isn;
-               return TCP_TW_SYN;
+       if (th->syn && !th->rst && !th->ack && !paws_reject) {
+               if (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
+                   (tmp_opt.saw_tstamp &&
+                    (s32)(READ_ONCE(tcptw->tw_ts_recent) -
tmp_opt.rcv_tsval) < 0)) {
+                       u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
+                       if (isn == 0)
+                               isn++;
+                       *tw_isn = isn;
+                       return TCP_TW_SYN;
+               }
+               return TCP_TW_SUCCESS;
        }

        if (paws_reject)

Could you help me review this, Eric? Thanks in advance!

Thanks,
Jason

> 169s        RST-->
> As above picture shows, the two flows have a start time difference
> of 169 seconds. B starts to send FIN so it will finally enter into
> TIMEWAIT state. Nearly at the same time A launches a new connection
> that soon is reset by itself due to receiving a ACK.
>
> There are two key checks in tcp_timewait_state_process() when timewait
> socket in B receives the SYN packet:
> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
>
> Regarding the first rule, it fails as expected because in the first
> connection the seq of SYN sent from A is 1892994276, then 169s have
> passed, the second SYN has 239034613 (caused by overflow of s32).
>
> Then how about the second rule?
> It fails again!
> Let's take a look at how the tsval comes out:
> __tcp_transmit_skb()
>     -> tcp_syn_options()
>         -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> other is tp->tsoffset. The latter value is fixed, so we don't need
> to care about it. If both operations (sending FIN and then starting
> sending SYN) from A happen in 1ms, then the tsval would be the same.
> It can be clearly seen in the tcpdump log. Notice that the tsval is
> with millisecond precision.
>
> Based on the above analysis, I decided to make a small change to
> the check in tcp_timewait_state_process() so that the second flow
> would not fail.
>
> Signed-off-by: Jason Xing <kernelxing@tencent.com>
> ---
>  net/ipv4/tcp_minisocks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index bb1fe1ba867a..2b29d1bf5ca0 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -234,7 +234,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
>         if (th->syn && !th->rst && !th->ack && !paws_reject &&
>             (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
>              (tmp_opt.saw_tstamp &&
> -             (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
> +             (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) <= 0))) {
>                 u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
>                 if (isn == 0)
>                         isn++;
> --
> 2.37.3
>
Kuniyuki Iwashima Nov. 7, 2024, 4:15 a.m. UTC | #9
From: Jason Xing <kerneljasonxing@gmail.com>
Date: Thu, 7 Nov 2024 11:16:04 +0800
> > Here is how things happen in production:
> > Time        Client(A)        Server(B)
> > 0s          SYN-->
> > ...
> > 132s                         <-- FIN
> > ...
> > 169s        FIN-->
> > 169s                         <-- ACK
> > 169s        SYN-->
> > 169s                         <-- ACK
> 
> I noticed the above ACK doesn't adhere to RFC 6191. It says:
> "If the previous incarnation of the connection used Timestamps, then:
>      if ...
>      ...
>      * Otherwise, silently drop the incoming SYN segment, thus leaving
>          the previous incarnation of the connection in the TIME-WAIT
>          state.
> "
> But the timewait socket sends an ACK because of this code snippet:
> tcp_timewait_state_process()
>     -> // the checks of SYN packet failed.
>     -> if (!th->rst) {
>         -> return TCP_TW_ACK; // this line can be traced back to 2005

This is a challenge ACK following RFC 5961.

If SYN is returned here, the client may lose the chance to RST the
previous connection in TIME_WAIT.

https://www.rfc-editor.org/rfc/rfc9293.html#section-3.10.7.4-2.4.1
---8<---
      -  TIME-WAIT STATE

         o  If the SYN bit is set in these synchronized states, it may
            be either a legitimate new connection attempt (e.g., in the
            case of TIME-WAIT), an error where the connection should be
            reset, or the result of an attack attempt, as described in
            RFC 5961 [9].  For the TIME-WAIT state, new connections can
            be accepted if the Timestamp Option is used and meets
            expectations (per [40]).  For all other cases, RFC 5961
            provides a mitigation with applicability to some situations,
            though there are also alternatives that offer cryptographic
            protection (see Section 7).  RFC 5961 recommends that in
            these synchronized states, if the SYN bit is set,
            irrespective of the sequence number, TCP endpoints MUST send
            a "challenge ACK" to the remote peer:

            <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
---8<---

https://datatracker.ietf.org/doc/html/rfc5961#section-4
---8<---
   1) If the SYN bit is set, irrespective of the sequence number, TCP
      MUST send an ACK (also referred to as challenge ACK) to the remote
      peer:

      <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

      After sending the acknowledgment, TCP MUST drop the unacceptable
      segment and stop processing further.
---8<---
Kuniyuki Iwashima Nov. 7, 2024, 4:21 a.m. UTC | #10
From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Wed, 6 Nov 2024 20:15:06 -0800
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Thu, 7 Nov 2024 11:16:04 +0800
> > > Here is how things happen in production:
> > > Time        Client(A)        Server(B)
> > > 0s          SYN-->
> > > ...
> > > 132s                         <-- FIN
> > > ...
> > > 169s        FIN-->
> > > 169s                         <-- ACK
> > > 169s        SYN-->
> > > 169s                         <-- ACK
> > 
> > I noticed the above ACK doesn't adhere to RFC 6191. It says:
> > "If the previous incarnation of the connection used Timestamps, then:
> >      if ...
> >      ...
> >      * Otherwise, silently drop the incoming SYN segment, thus leaving
> >          the previous incarnation of the connection in the TIME-WAIT
> >          state.
> > "
> > But the timewait socket sends an ACK because of this code snippet:
> > tcp_timewait_state_process()
> >     -> // the checks of SYN packet failed.
> >     -> if (!th->rst) {
> >         -> return TCP_TW_ACK; // this line can be traced back to 2005
> 
> This is a challenge ACK following RFC 5961.
> 
> If SYN is returned here, the client may lose the chance to RST the
> previous connection in TIME_WAIT.

s/returned/silently dropped/ :/
Jason Xing Nov. 7, 2024, 5:23 a.m. UTC | #11
On Thu, Nov 7, 2024 at 12:15 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Thu, 7 Nov 2024 11:16:04 +0800
> > > Here is how things happen in production:
> > > Time        Client(A)        Server(B)
> > > 0s          SYN-->
> > > ...
> > > 132s                         <-- FIN
> > > ...
> > > 169s        FIN-->
> > > 169s                         <-- ACK
> > > 169s        SYN-->
> > > 169s                         <-- ACK
> >
> > I noticed the above ACK doesn't adhere to RFC 6191. It says:
> > "If the previous incarnation of the connection used Timestamps, then:
> >      if ...
> >      ...
> >      * Otherwise, silently drop the incoming SYN segment, thus leaving
> >          the previous incarnation of the connection in the TIME-WAIT
> >          state.
> > "
> > But the timewait socket sends an ACK because of this code snippet:
> > tcp_timewait_state_process()
> >     -> // the checks of SYN packet failed.
> >     -> if (!th->rst) {
> >         -> return TCP_TW_ACK; // this line can be traced back to 2005
>
> This is a challenge ACK following RFC 5961.

Please note the idea of challenge ack was proposed in 2010. But this
code snippet has already existed before 2005. If it is a challenge
ack, then at least we need to count it (by using NET_INC_STATS(net,
LINUX_MIB_TCPCHALLENGEACK);).

>
> If SYN is returned here, the client may lose the chance to RST the
> previous connection in TIME_WAIT.
>
> https://www.rfc-editor.org/rfc/rfc9293.html#section-3.10.7.4-2.4.1
> ---8<---
>       -  TIME-WAIT STATE
>
>          o  If the SYN bit is set in these synchronized states, it may
>             be either a legitimate new connection attempt (e.g., in the
>             case of TIME-WAIT), an error where the connection should be
>             reset, or the result of an attack attempt, as described in
>             RFC 5961 [9].  For the TIME-WAIT state, new connections can
>             be accepted if the Timestamp Option is used and meets
>             expectations (per [40]).  For all other cases, RFC 5961
>             provides a mitigation with applicability to some situations,
>             though there are also alternatives that offer cryptographic
>             protection (see Section 7).  RFC 5961 recommends that in
>             these synchronized states, if the SYN bit is set,
>             irrespective of the sequence number, TCP endpoints MUST send
>             a "challenge ACK" to the remote peer:
>
>             <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> ---8<---
>
> https://datatracker.ietf.org/doc/html/rfc5961#section-4
> ---8<---
>    1) If the SYN bit is set, irrespective of the sequence number, TCP
>       MUST send an ACK (also referred to as challenge ACK) to the remote
>       peer:
>
>       <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
>
>       After sending the acknowledgment, TCP MUST drop the unacceptable
>       segment and stop processing further.
> ---8<---

The RFC 5961 4.2 was implemented in tcp_validate_incoming():
        /* step 4: Check for a SYN
         * RFC 5961 4.2 : Send a challenge ack
         */
        if (th->syn) {
                if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
                    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
                    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
                    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
                        goto pass;
syn_challenge:
                if (syn_inerr)
                        TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
                NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPSYNCHALLENGE);
                tcp_send_challenge_ack(sk);
                SKB_DR_SET(reason, TCP_INVALID_SYN);
                goto discard;
        }

Also, this quotation you mentioned obviously doesn't match the kernel
implementation:
"If the SYN bit is set, irrespective of the sequence number, TCP MUST
send an ACK"
The tcp_timewait_state_process() does care about the seq number, or
else timewait socket would refuse every SYN packet.

Thanks,
Jason
Kuniyuki Iwashima Nov. 7, 2024, 5:43 a.m. UTC | #12
From: Jason Xing <kerneljasonxing@gmail.com>
Date: Thu, 7 Nov 2024 13:23:50 +0800
> On Thu, Nov 7, 2024 at 12:15 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Jason Xing <kerneljasonxing@gmail.com>
> > Date: Thu, 7 Nov 2024 11:16:04 +0800
> > > > Here is how things happen in production:
> > > > Time        Client(A)        Server(B)
> > > > 0s          SYN-->
> > > > ...
> > > > 132s                         <-- FIN
> > > > ...
> > > > 169s        FIN-->
> > > > 169s                         <-- ACK
> > > > 169s        SYN-->
> > > > 169s                         <-- ACK
> > >
> > > I noticed the above ACK doesn't adhere to RFC 6191. It says:
> > > "If the previous incarnation of the connection used Timestamps, then:
> > >      if ...
> > >      ...
> > >      * Otherwise, silently drop the incoming SYN segment, thus leaving
> > >          the previous incarnation of the connection in the TIME-WAIT
> > >          state.
> > > "
> > > But the timewait socket sends an ACK because of this code snippet:
> > > tcp_timewait_state_process()
> > >     -> // the checks of SYN packet failed.
> > >     -> if (!th->rst) {
> > >         -> return TCP_TW_ACK; // this line can be traced back to 2005
> >
> > This is a challenge ACK following RFC 5961.
> 
> Please note the idea of challenge ack was proposed in 2010. But this
> code snippet has already existed before 2005. If it is a challenge
> ack, then at least we need to count it (by using NET_INC_STATS(net,
> LINUX_MIB_TCPCHALLENGEACK);).

The word was not accurate, the behaviour is compliant with RFC 5961.
RFC is often formalised based on real implementations.

Incrementing the count makes sense to me.

> 
> >
> > If SYN is returned here, the client may lose the chance to RST the
> > previous connection in TIME_WAIT.
> >
> > https://www.rfc-editor.org/rfc/rfc9293.html#section-3.10.7.4-2.4.1
> > ---8<---
> >       -  TIME-WAIT STATE
> >
> >          o  If the SYN bit is set in these synchronized states, it may
> >             be either a legitimate new connection attempt (e.g., in the
> >             case of TIME-WAIT), an error where the connection should be
> >             reset, or the result of an attack attempt, as described in
> >             RFC 5961 [9].  For the TIME-WAIT state, new connections can
> >             be accepted if the Timestamp Option is used and meets
> >             expectations (per [40]).  For all other cases, RFC 5961
> >             provides a mitigation with applicability to some situations,
> >             though there are also alternatives that offer cryptographic
> >             protection (see Section 7).  RFC 5961 recommends that in
> >             these synchronized states, if the SYN bit is set,
> >             irrespective of the sequence number, TCP endpoints MUST send
> >             a "challenge ACK" to the remote peer:
> >
> >             <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> > ---8<---
> >
> > https://datatracker.ietf.org/doc/html/rfc5961#section-4
> > ---8<---
> >    1) If the SYN bit is set, irrespective of the sequence number, TCP
> >       MUST send an ACK (also referred to as challenge ACK) to the remote
> >       peer:
> >
> >       <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> >
> >       After sending the acknowledgment, TCP MUST drop the unacceptable
> >       segment and stop processing further.
> > ---8<---
> 
> The RFC 5961 4.2 was implemented in tcp_validate_incoming():
>         /* step 4: Check for a SYN
>          * RFC 5961 4.2 : Send a challenge ack
>          */
>         if (th->syn) {
>                 if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
>                     TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
>                     TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
>                     TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
>                         goto pass;
> syn_challenge:
>                 if (syn_inerr)
>                         TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
>                 NET_INC_STATS(sock_net(sk),
> LINUX_MIB_TCPSYNCHALLENGE);
>                 tcp_send_challenge_ack(sk);
>                 SKB_DR_SET(reason, TCP_INVALID_SYN);
>                 goto discard;
>         }
> 
> Also, this quotation you mentioned obviously doesn't match the kernel
> implementation:
> "If the SYN bit is set, irrespective of the sequence number, TCP MUST
> send an ACK"
> The tcp_timewait_state_process() does care about the seq number, or
> else timewait socket would refuse every SYN packet.

That's why I pasted RFC 9293 first that clearly states that we
should check seq number and then return ACK for all other cases.
Jason Xing Nov. 7, 2024, 6:51 a.m. UTC | #13
On Thu, Nov 7, 2024 at 1:43 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Thu, 7 Nov 2024 13:23:50 +0800
> > On Thu, Nov 7, 2024 at 12:15 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > Date: Thu, 7 Nov 2024 11:16:04 +0800
> > > > > Here is how things happen in production:
> > > > > Time        Client(A)        Server(B)
> > > > > 0s          SYN-->
> > > > > ...
> > > > > 132s                         <-- FIN
> > > > > ...
> > > > > 169s        FIN-->
> > > > > 169s                         <-- ACK
> > > > > 169s        SYN-->
> > > > > 169s                         <-- ACK
> > > >
> > > > I noticed the above ACK doesn't adhere to RFC 6191. It says:
> > > > "If the previous incarnation of the connection used Timestamps, then:
> > > >      if ...
> > > >      ...
> > > >      * Otherwise, silently drop the incoming SYN segment, thus leaving
> > > >          the previous incarnation of the connection in the TIME-WAIT
> > > >          state.
> > > > "
> > > > But the timewait socket sends an ACK because of this code snippet:
> > > > tcp_timewait_state_process()
> > > >     -> // the checks of SYN packet failed.
> > > >     -> if (!th->rst) {
> > > >         -> return TCP_TW_ACK; // this line can be traced back to 2005
> > >
> > > This is a challenge ACK following RFC 5961.
> >
> > Please note the idea of challenge ack was proposed in 2010. But this
> > code snippet has already existed before 2005. If it is a challenge
> > ack, then at least we need to count it (by using NET_INC_STATS(net,
> > LINUX_MIB_TCPCHALLENGEACK);).
>
> The word was not accurate, the behaviour is compliant with RFC 5961.
> RFC is often formalised based on real implementations.
>
> Incrementing the count makes sense to me.
>
> >
> > >
> > > If SYN is returned here, the client may lose the chance to RST the
> > > previous connection in TIME_WAIT.
> > >
> > > https://www.rfc-editor.org/rfc/rfc9293.html#section-3.10.7.4-2.4.1
> > > ---8<---
> > >       -  TIME-WAIT STATE
> > >
> > >          o  If the SYN bit is set in these synchronized states, it may
> > >             be either a legitimate new connection attempt (e.g., in the
> > >             case of TIME-WAIT), an error where the connection should be
> > >             reset, or the result of an attack attempt, as described in
> > >             RFC 5961 [9].  For the TIME-WAIT state, new connections can
> > >             be accepted if the Timestamp Option is used and meets
> > >             expectations (per [40]).  For all other cases, RFC 5961
> > >             provides a mitigation with applicability to some situations,
> > >             though there are also alternatives that offer cryptographic
> > >             protection (see Section 7).  RFC 5961 recommends that in
> > >             these synchronized states, if the SYN bit is set,
> > >             irrespective of the sequence number, TCP endpoints MUST send
> > >             a "challenge ACK" to the remote peer:
> > >
> > >             <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> > > ---8<---
> > >
> > > https://datatracker.ietf.org/doc/html/rfc5961#section-4
> > > ---8<---
> > >    1) If the SYN bit is set, irrespective of the sequence number, TCP
> > >       MUST send an ACK (also referred to as challenge ACK) to the remote
> > >       peer:
> > >
> > >       <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> > >
> > >       After sending the acknowledgment, TCP MUST drop the unacceptable
> > >       segment and stop processing further.
> > > ---8<---
> >
> > The RFC 5961 4.2 was implemented in tcp_validate_incoming():
> >         /* step 4: Check for a SYN
> >          * RFC 5961 4.2 : Send a challenge ack
> >          */
> >         if (th->syn) {
> >                 if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
> >                     TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
> >                     TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
> >                     TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
> >                         goto pass;
> > syn_challenge:
> >                 if (syn_inerr)
> >                         TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> >                 NET_INC_STATS(sock_net(sk),
> > LINUX_MIB_TCPSYNCHALLENGE);
> >                 tcp_send_challenge_ack(sk);
> >                 SKB_DR_SET(reason, TCP_INVALID_SYN);
> >                 goto discard;
> >         }
> >
> > Also, this quotation you mentioned obviously doesn't match the kernel
> > implementation:
> > "If the SYN bit is set, irrespective of the sequence number, TCP MUST
> > send an ACK"
> > The tcp_timewait_state_process() does care about the seq number, or
> > else timewait socket would refuse every SYN packet.
>
> That's why I pasted RFC 9293 first that clearly states that we
> should check seq number and then return ACK for all other cases.

I don't think so.

RFC 9293 only states that RFC 5691 provides an approach that mitigates
the risk by rejecting all the SYN packets if the socket stays in
synchronized state. It's "For all other cases" in RFC 9293.
Please loot at "irrespective of the sequence number" in RFC 5691 4.2
[1]. It means no matter what the seq is we MUST send back an ACK
instead of establishing a new connection.
Actually the tcp_timewait_state_process() checks the seq or timestamp
in the SYN packet.

[1]: https://datatracker.ietf.org/doc/html/rfc5961#section-4.2

Thanks,
Jason
Kuniyuki Iwashima Nov. 7, 2024, 7:11 a.m. UTC | #14
From: Jason Xing <kerneljasonxing@gmail.com>
Date: Thu, 7 Nov 2024 14:51:35 +0800
> On Thu, Nov 7, 2024 at 1:43 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Jason Xing <kerneljasonxing@gmail.com>
> > Date: Thu, 7 Nov 2024 13:23:50 +0800
> > > On Thu, Nov 7, 2024 at 12:15 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > Date: Thu, 7 Nov 2024 11:16:04 +0800
> > > > > > Here is how things happen in production:
> > > > > > Time        Client(A)        Server(B)
> > > > > > 0s          SYN-->
> > > > > > ...
> > > > > > 132s                         <-- FIN
> > > > > > ...
> > > > > > 169s        FIN-->
> > > > > > 169s                         <-- ACK
> > > > > > 169s        SYN-->
> > > > > > 169s                         <-- ACK
> > > > >
> > > > > I noticed the above ACK doesn't adhere to RFC 6191. It says:
> > > > > "If the previous incarnation of the connection used Timestamps, then:
> > > > >      if ...
> > > > >      ...
> > > > >      * Otherwise, silently drop the incoming SYN segment, thus leaving
> > > > >          the previous incarnation of the connection in the TIME-WAIT
> > > > >          state.
> > > > > "
> > > > > But the timewait socket sends an ACK because of this code snippet:
> > > > > tcp_timewait_state_process()
> > > > >     -> // the checks of SYN packet failed.
> > > > >     -> if (!th->rst) {
> > > > >         -> return TCP_TW_ACK; // this line can be traced back to 2005
> > > >
> > > > This is a challenge ACK following RFC 5961.
> > >
> > > Please note the idea of challenge ack was proposed in 2010. But this
> > > code snippet has already existed before 2005. If it is a challenge
> > > ack, then at least we need to count it (by using NET_INC_STATS(net,
> > > LINUX_MIB_TCPCHALLENGEACK);).
> >
> > The word was not accurate, the behaviour is compliant with RFC 5961.
> > RFC is often formalised based on real implementations.
> >
> > Incrementing the count makes sense to me.
> >
> > >
> > > >
> > > > If SYN is returned here, the client may lose the chance to RST the
> > > > previous connection in TIME_WAIT.
> > > >
> > > > https://www.rfc-editor.org/rfc/rfc9293.html#section-3.10.7.4-2.4.1
> > > > ---8<---
> > > >       -  TIME-WAIT STATE
> > > >
> > > >          o  If the SYN bit is set in these synchronized states, it may
> > > >             be either a legitimate new connection attempt (e.g., in the
> > > >             case of TIME-WAIT), an error where the connection should be
> > > >             reset, or the result of an attack attempt, as described in
> > > >             RFC 5961 [9].  For the TIME-WAIT state, new connections can
> > > >             be accepted if the Timestamp Option is used and meets
> > > >             expectations (per [40]).  For all other cases, RFC 5961
> > > >             provides a mitigation with applicability to some situations,
> > > >             though there are also alternatives that offer cryptographic
> > > >             protection (see Section 7).  RFC 5961 recommends that in
> > > >             these synchronized states, if the SYN bit is set,
> > > >             irrespective of the sequence number, TCP endpoints MUST send
> > > >             a "challenge ACK" to the remote peer:
> > > >
> > > >             <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> > > > ---8<---
> > > >
> > > > https://datatracker.ietf.org/doc/html/rfc5961#section-4
> > > > ---8<---
> > > >    1) If the SYN bit is set, irrespective of the sequence number, TCP
> > > >       MUST send an ACK (also referred to as challenge ACK) to the remote
> > > >       peer:
> > > >
> > > >       <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> > > >
> > > >       After sending the acknowledgment, TCP MUST drop the unacceptable
> > > >       segment and stop processing further.
> > > > ---8<---
> > >
> > > The RFC 5961 4.2 was implemented in tcp_validate_incoming():
> > >         /* step 4: Check for a SYN
> > >          * RFC 5961 4.2 : Send a challenge ack
> > >          */
> > >         if (th->syn) {
> > >                 if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
> > >                     TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
> > >                     TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
> > >                     TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
> > >                         goto pass;
> > > syn_challenge:
> > >                 if (syn_inerr)
> > >                         TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> > >                 NET_INC_STATS(sock_net(sk),
> > > LINUX_MIB_TCPSYNCHALLENGE);
> > >                 tcp_send_challenge_ack(sk);
> > >                 SKB_DR_SET(reason, TCP_INVALID_SYN);
> > >                 goto discard;
> > >         }
> > >
> > > Also, this quotation you mentioned obviously doesn't match the kernel
> > > implementation:
> > > "If the SYN bit is set, irrespective of the sequence number, TCP MUST
> > > send an ACK"
> > > The tcp_timewait_state_process() does care about the seq number, or
> > > else timewait socket would refuse every SYN packet.
> >
> > That's why I pasted RFC 9293 first that clearly states that we
> > should check seq number and then return ACK for all other cases.
> 
> I don't think so.
> 
> RFC 9293 only states that RFC 5691 provides an approach that mitigates
> the risk by rejecting all the SYN packets if the socket stays in
> synchronized state. It's "For all other cases" in RFC 9293.

RFC 9293 states which RFC to prioritise.  You will find the
link [40] is RFC 6191.

---8<---
For the TIME-WAIT state, new connections can
be accepted if the Timestamp Option is used and meets
expectations (per [40]).  For all other cases, RFC 5961
...
---8<---

> Please loot at "irrespective of the sequence number" in RFC 5691 4.2
> [1]. It means no matter what the seq is we MUST send back an ACK
> instead of establishing a new connection.

RFC 9293 mentions accepatble cases first, so this is only applied
to "all other cases"


> Actually the tcp_timewait_state_process() checks the seq or timestamp
> in the SYN packet.

and this part takes precedence than "all other cases".

Also, you missed that the pasted part is the 4th step of incoming
segment processing.

https://www.rfc-editor.org/rfc/rfc9293.html#section-3.10.7.4
---8<---
First, check sequence number: ...
Second, check the RST bit: ...
Third, check security: ...
Fourth, check the SYN bit:
...
  TIME-WAIT STATE
    If the SYN bit is set in these synchronized states...
---8<---

So, RFC 9293 says "check seq number, RST, security, then
if the connection is still accepatable for TIME_WAIT based on
RFC 6191, accept it, otherwise, return ACK based on RFC 5691".
Jason Xing Nov. 7, 2024, 7:44 a.m. UTC | #15
On Thu, Nov 7, 2024 at 3:11 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jason Xing <kerneljasonxing@gmail.com>
> Date: Thu, 7 Nov 2024 14:51:35 +0800
> > On Thu, Nov 7, 2024 at 1:43 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > Date: Thu, 7 Nov 2024 13:23:50 +0800
> > > > On Thu, Nov 7, 2024 at 12:15 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From: Jason Xing <kerneljasonxing@gmail.com>
> > > > > Date: Thu, 7 Nov 2024 11:16:04 +0800
> > > > > > > Here is how things happen in production:
> > > > > > > Time        Client(A)        Server(B)
> > > > > > > 0s          SYN-->
> > > > > > > ...
> > > > > > > 132s                         <-- FIN
> > > > > > > ...
> > > > > > > 169s        FIN-->
> > > > > > > 169s                         <-- ACK
> > > > > > > 169s        SYN-->
> > > > > > > 169s                         <-- ACK
> > > > > >
> > > > > > I noticed the above ACK doesn't adhere to RFC 6191. It says:
> > > > > > "If the previous incarnation of the connection used Timestamps, then:
> > > > > >      if ...
> > > > > >      ...
> > > > > >      * Otherwise, silently drop the incoming SYN segment, thus leaving
> > > > > >          the previous incarnation of the connection in the TIME-WAIT
> > > > > >          state.
> > > > > > "
> > > > > > But the timewait socket sends an ACK because of this code snippet:
> > > > > > tcp_timewait_state_process()
> > > > > >     -> // the checks of SYN packet failed.
> > > > > >     -> if (!th->rst) {
> > > > > >         -> return TCP_TW_ACK; // this line can be traced back to 2005
> > > > >
> > > > > This is a challenge ACK following RFC 5961.
> > > >
> > > > Please note the idea of challenge ack was proposed in 2010. But this
> > > > code snippet has already existed before 2005. If it is a challenge
> > > > ack, then at least we need to count it (by using NET_INC_STATS(net,
> > > > LINUX_MIB_TCPCHALLENGEACK);).
> > >
> > > The word was not accurate, the behaviour is compliant with RFC 5961.
> > > RFC is often formalised based on real implementations.
> > >
> > > Incrementing the count makes sense to me.
> > >
> > > >
> > > > >
> > > > > If SYN is returned here, the client may lose the chance to RST the
> > > > > previous connection in TIME_WAIT.
> > > > >
> > > > > https://www.rfc-editor.org/rfc/rfc9293.html#section-3.10.7.4-2.4.1
> > > > > ---8<---
> > > > >       -  TIME-WAIT STATE
> > > > >
> > > > >          o  If the SYN bit is set in these synchronized states, it may
> > > > >             be either a legitimate new connection attempt (e.g., in the
> > > > >             case of TIME-WAIT), an error where the connection should be
> > > > >             reset, or the result of an attack attempt, as described in
> > > > >             RFC 5961 [9].  For the TIME-WAIT state, new connections can
> > > > >             be accepted if the Timestamp Option is used and meets
> > > > >             expectations (per [40]).  For all other cases, RFC 5961
> > > > >             provides a mitigation with applicability to some situations,
> > > > >             though there are also alternatives that offer cryptographic
> > > > >             protection (see Section 7).  RFC 5961 recommends that in
> > > > >             these synchronized states, if the SYN bit is set,
> > > > >             irrespective of the sequence number, TCP endpoints MUST send
> > > > >             a "challenge ACK" to the remote peer:
> > > > >
> > > > >             <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> > > > > ---8<---
> > > > >
> > > > > https://datatracker.ietf.org/doc/html/rfc5961#section-4
> > > > > ---8<---
> > > > >    1) If the SYN bit is set, irrespective of the sequence number, TCP
> > > > >       MUST send an ACK (also referred to as challenge ACK) to the remote
> > > > >       peer:
> > > > >
> > > > >       <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
> > > > >
> > > > >       After sending the acknowledgment, TCP MUST drop the unacceptable
> > > > >       segment and stop processing further.
> > > > > ---8<---
> > > >
> > > > The RFC 5961 4.2 was implemented in tcp_validate_incoming():
> > > >         /* step 4: Check for a SYN
> > > >          * RFC 5961 4.2 : Send a challenge ack
> > > >          */
> > > >         if (th->syn) {
> > > >                 if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
> > > >                     TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
> > > >                     TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
> > > >                     TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
> > > >                         goto pass;
> > > > syn_challenge:
> > > >                 if (syn_inerr)
> > > >                         TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> > > >                 NET_INC_STATS(sock_net(sk),
> > > > LINUX_MIB_TCPSYNCHALLENGE);
> > > >                 tcp_send_challenge_ack(sk);
> > > >                 SKB_DR_SET(reason, TCP_INVALID_SYN);
> > > >                 goto discard;
> > > >         }
> > > >
> > > > Also, this quotation you mentioned obviously doesn't match the kernel
> > > > implementation:
> > > > "If the SYN bit is set, irrespective of the sequence number, TCP MUST
> > > > send an ACK"
> > > > The tcp_timewait_state_process() does care about the seq number, or
> > > > else timewait socket would refuse every SYN packet.
> > >
> > > That's why I pasted RFC 9293 first that clearly states that we
> > > should check seq number and then return ACK for all other cases.
> >
> > I don't think so.
> >
> > RFC 9293 only states that RFC 5691 provides an approach that mitigates
> > the risk by rejecting all the SYN packets if the socket stays in
> > synchronized state. It's "For all other cases" in RFC 9293.
>
> RFC 9293 states which RFC to prioritise.  You will find the
> link [40] is RFC 6191.
>
> ---8<---
> For the TIME-WAIT state, new connections can
> be accepted if the Timestamp Option is used and meets
> expectations (per [40]).  For all other cases, RFC 5961
> ...
> ---8<---
>
> > Please loot at "irrespective of the sequence number" in RFC 5691 4.2
> > [1]. It means no matter what the seq is we MUST send back an ACK
> > instead of establishing a new connection.
>
> RFC 9293 mentions accepatble cases first, so this is only applied
> to "all other cases"
>
>
> > Actually the tcp_timewait_state_process() checks the seq or timestamp
> > in the SYN packet.
>
> and this part takes precedence than "all other cases".
>
> Also, you missed that the pasted part is the 4th step of incoming
> segment processing.
>
> https://www.rfc-editor.org/rfc/rfc9293.html#section-3.10.7.4
> ---8<---
> First, check sequence number: ...
> Second, check the RST bit: ...
> Third, check security: ...
> Fourth, check the SYN bit:
> ...
>   TIME-WAIT STATE
>     If the SYN bit is set in these synchronized states...
> ---8<---
>
> So, RFC 9293 says "check seq number, RST, security, then
> if the connection is still accepatable for TIME_WAIT based on
> RFC 6191, accept it, otherwise, return ACK based on RFC 5691".

I see what you mean here. If that is so, tcp_timewait_state_process()
has already implemented the challenge ack even before the challenge
ack showed up in 2010.

Interesting. If this part refers to RFC 5691, then we need to copy
part of tcp_send_challenge_ack() in this case to send the challenge
ack, like testing sysctl_tcp_challenge_ack_limit, etc.

Thanks,
Jason
Philo Lu Nov. 7, 2024, 7:51 a.m. UTC | #16
Hi Jason,

On 2024/11/5 10:55, Jason Xing wrote:
> From: Jason Xing <kernelxing@tencent.com>
> 
> We found there are rare chances that some RST packets appear during
> the shakehands because the timewait socket cannot accept the SYN and
> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> 
> Here is how things happen in production:
> Time        Client(A)        Server(B)
> 0s          SYN-->
> ...
> 132s                         <-- FIN
> ...
> 169s        FIN-->
> 169s                         <-- ACK
> 169s        SYN-->
> 169s                         <-- ACK
> 169s        RST-->
> As above picture shows, the two flows have a start time difference
> of 169 seconds. B starts to send FIN so it will finally enter into
> TIMEWAIT state. Nearly at the same time A launches a new connection
> that soon is reset by itself due to receiving a ACK.
> 
> There are two key checks in tcp_timewait_state_process() when timewait
> socket in B receives the SYN packet:
> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> 
> Regarding the first rule, it fails as expected because in the first
> connection the seq of SYN sent from A is 1892994276, then 169s have
> passed, the second SYN has 239034613 (caused by overflow of s32).
> 
> Then how about the second rule?
> It fails again!
> Let's take a look at how the tsval comes out:
> __tcp_transmit_skb()
>      -> tcp_syn_options()
>          -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> other is tp->tsoffset. The latter value is fixed, so we don't need
> to care about it. If both operations (sending FIN and then starting
> sending SYN) from A happen in 1ms, then the tsval would be the same.
> It can be clearly seen in the tcpdump log. Notice that the tsval is
> with millisecond precision.
> 
> Based on the above analysis, I decided to make a small change to
> the check in tcp_timewait_state_process() so that the second flow
> would not fail.
> 

I wonder what a bad result the RST causes. As far as I know, the client 
will not close the connect and return. Instead, it re-sends an SYN in 
TCP_TIMEOUT_MIN(2) jiffies (implemented in 
tcp_rcv_synsent_state_process). So the second connection could still be 
established successfully, at the cost of a bit more delay. Like:

  Time        Client(A)        Server(B)
  0s          SYN-->
  ...
  132s                         <-- FIN
  ...
  169s        FIN-->
  169s                         <-- ACK
  169s        SYN-->
  169s                         <-- ACK
  169s        RST-->
~2jiffies    SYN-->
                               <-- SYN,ACK

Thanks.
Jason Xing Nov. 7, 2024, 8:01 a.m. UTC | #17
On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>
> Hi Jason,
>
> On 2024/11/5 10:55, Jason Xing wrote:
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > We found there are rare chances that some RST packets appear during
> > the shakehands because the timewait socket cannot accept the SYN and
> > doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> >
> > Here is how things happen in production:
> > Time        Client(A)        Server(B)
> > 0s          SYN-->
> > ...
> > 132s                         <-- FIN
> > ...
> > 169s        FIN-->
> > 169s                         <-- ACK
> > 169s        SYN-->
> > 169s                         <-- ACK
> > 169s        RST-->
> > As above picture shows, the two flows have a start time difference
> > of 169 seconds. B starts to send FIN so it will finally enter into
> > TIMEWAIT state. Nearly at the same time A launches a new connection
> > that soon is reset by itself due to receiving a ACK.
> >
> > There are two key checks in tcp_timewait_state_process() when timewait
> > socket in B receives the SYN packet:
> > 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> >
> > Regarding the first rule, it fails as expected because in the first
> > connection the seq of SYN sent from A is 1892994276, then 169s have
> > passed, the second SYN has 239034613 (caused by overflow of s32).
> >
> > Then how about the second rule?
> > It fails again!
> > Let's take a look at how the tsval comes out:
> > __tcp_transmit_skb()
> >      -> tcp_syn_options()
> >          -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > other is tp->tsoffset. The latter value is fixed, so we don't need
> > to care about it. If both operations (sending FIN and then starting
> > sending SYN) from A happen in 1ms, then the tsval would be the same.
> > It can be clearly seen in the tcpdump log. Notice that the tsval is
> > with millisecond precision.
> >
> > Based on the above analysis, I decided to make a small change to
> > the check in tcp_timewait_state_process() so that the second flow
> > would not fail.
> >
>
> I wonder what a bad result the RST causes. As far as I know, the client
> will not close the connect and return. Instead, it re-sends an SYN in
> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> tcp_rcv_synsent_state_process). So the second connection could still be
> established successfully, at the cost of a bit more delay. Like:
>
>   Time        Client(A)        Server(B)
>   0s          SYN-->
>   ...
>   132s                         <-- FIN
>   ...
>   169s        FIN-->
>   169s                         <-- ACK
>   169s        SYN-->
>   169s                         <-- ACK
>   169s        RST-->
> ~2jiffies    SYN-->
>                                <-- SYN,ACK

That's exactly what I meant here :) Originally I didn't expect the
application to relaunch a connection in this case.
Philo Lu Nov. 7, 2024, 8:22 a.m. UTC | #18
On 2024/11/7 16:01, Jason Xing wrote:
> On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>>
>> Hi Jason,
>>
>> On 2024/11/5 10:55, Jason Xing wrote:
>>> From: Jason Xing <kernelxing@tencent.com>
>>>
>>> We found there are rare chances that some RST packets appear during
>>> the shakehands because the timewait socket cannot accept the SYN and
>>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
>>>
>>> Here is how things happen in production:
>>> Time        Client(A)        Server(B)
>>> 0s          SYN-->
>>> ...
>>> 132s                         <-- FIN
>>> ...
>>> 169s        FIN-->
>>> 169s                         <-- ACK
>>> 169s        SYN-->
>>> 169s                         <-- ACK
>>> 169s        RST-->
>>> As above picture shows, the two flows have a start time difference
>>> of 169 seconds. B starts to send FIN so it will finally enter into
>>> TIMEWAIT state. Nearly at the same time A launches a new connection
>>> that soon is reset by itself due to receiving a ACK.
>>>
>>> There are two key checks in tcp_timewait_state_process() when timewait
>>> socket in B receives the SYN packet:
>>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
>>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
>>>
>>> Regarding the first rule, it fails as expected because in the first
>>> connection the seq of SYN sent from A is 1892994276, then 169s have
>>> passed, the second SYN has 239034613 (caused by overflow of s32).
>>>
>>> Then how about the second rule?
>>> It fails again!
>>> Let's take a look at how the tsval comes out:
>>> __tcp_transmit_skb()
>>>       -> tcp_syn_options()
>>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
>>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
>>> other is tp->tsoffset. The latter value is fixed, so we don't need
>>> to care about it. If both operations (sending FIN and then starting
>>> sending SYN) from A happen in 1ms, then the tsval would be the same.
>>> It can be clearly seen in the tcpdump log. Notice that the tsval is
>>> with millisecond precision.
>>>
>>> Based on the above analysis, I decided to make a small change to
>>> the check in tcp_timewait_state_process() so that the second flow
>>> would not fail.
>>>
>>
>> I wonder what a bad result the RST causes. As far as I know, the client
>> will not close the connect and return. Instead, it re-sends an SYN in
>> TCP_TIMEOUT_MIN(2) jiffies (implemented in
>> tcp_rcv_synsent_state_process). So the second connection could still be
>> established successfully, at the cost of a bit more delay. Like:
>>
>>    Time        Client(A)        Server(B)
>>    0s          SYN-->
>>    ...
>>    132s                         <-- FIN
>>    ...
>>    169s        FIN-->
>>    169s                         <-- ACK
>>    169s        SYN-->
>>    169s                         <-- ACK
>>    169s        RST-->
>> ~2jiffies    SYN-->
>>                                 <-- SYN,ACK
> 
> That's exactly what I meant here :) Originally I didn't expect the
> application to relaunch a connection in this case.

s/application/kernel/, right? Because the retry is transparent to user 
applications except the additional latency. I think all of these are 
finished in a single connect() :)
Jason Xing Nov. 7, 2024, 8:26 a.m. UTC | #19
On Thu, Nov 7, 2024 at 4:22 PM Philo Lu <lulie@linux.alibaba.com> wrote:
>
>
>
> On 2024/11/7 16:01, Jason Xing wrote:
> > On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> >>
> >> Hi Jason,
> >>
> >> On 2024/11/5 10:55, Jason Xing wrote:
> >>> From: Jason Xing <kernelxing@tencent.com>
> >>>
> >>> We found there are rare chances that some RST packets appear during
> >>> the shakehands because the timewait socket cannot accept the SYN and
> >>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> >>>
> >>> Here is how things happen in production:
> >>> Time        Client(A)        Server(B)
> >>> 0s          SYN-->
> >>> ...
> >>> 132s                         <-- FIN
> >>> ...
> >>> 169s        FIN-->
> >>> 169s                         <-- ACK
> >>> 169s        SYN-->
> >>> 169s                         <-- ACK
> >>> 169s        RST-->
> >>> As above picture shows, the two flows have a start time difference
> >>> of 169 seconds. B starts to send FIN so it will finally enter into
> >>> TIMEWAIT state. Nearly at the same time A launches a new connection
> >>> that soon is reset by itself due to receiving a ACK.
> >>>
> >>> There are two key checks in tcp_timewait_state_process() when timewait
> >>> socket in B receives the SYN packet:
> >>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> >>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> >>>
> >>> Regarding the first rule, it fails as expected because in the first
> >>> connection the seq of SYN sent from A is 1892994276, then 169s have
> >>> passed, the second SYN has 239034613 (caused by overflow of s32).
> >>>
> >>> Then how about the second rule?
> >>> It fails again!
> >>> Let's take a look at how the tsval comes out:
> >>> __tcp_transmit_skb()
> >>>       -> tcp_syn_options()
> >>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> >>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> >>> other is tp->tsoffset. The latter value is fixed, so we don't need
> >>> to care about it. If both operations (sending FIN and then starting
> >>> sending SYN) from A happen in 1ms, then the tsval would be the same.
> >>> It can be clearly seen in the tcpdump log. Notice that the tsval is
> >>> with millisecond precision.
> >>>
> >>> Based on the above analysis, I decided to make a small change to
> >>> the check in tcp_timewait_state_process() so that the second flow
> >>> would not fail.
> >>>
> >>
> >> I wonder what a bad result the RST causes. As far as I know, the client
> >> will not close the connect and return. Instead, it re-sends an SYN in
> >> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> >> tcp_rcv_synsent_state_process). So the second connection could still be
> >> established successfully, at the cost of a bit more delay. Like:
> >>
> >>    Time        Client(A)        Server(B)
> >>    0s          SYN-->
> >>    ...
> >>    132s                         <-- FIN
> >>    ...
> >>    169s        FIN-->
> >>    169s                         <-- ACK
> >>    169s        SYN-->
> >>    169s                         <-- ACK
> >>    169s        RST-->
> >> ~2jiffies    SYN-->
> >>                                 <-- SYN,ACK
> >
> > That's exactly what I meant here :) Originally I didn't expect the
> > application to relaunch a connection in this case.
>
> s/application/kernel/, right?

No. Perhaps I didn't make myself clear. If the kernel doesn't silently
drop the SYN and then send back an ACK, the application has to call
the connect() syscall again.

> Because the retry is transparent to user
> applications except the additional latency. I think all of these are
> finished in a single connect() :)

Right.
Eric Dumazet Nov. 7, 2024, 8:37 a.m. UTC | #20
On Thu, Nov 7, 2024 at 9:27 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Nov 7, 2024 at 4:22 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> >
> >
> >
> > On 2024/11/7 16:01, Jason Xing wrote:
> > > On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > >>
> > >> Hi Jason,
> > >>
> > >> On 2024/11/5 10:55, Jason Xing wrote:
> > >>> From: Jason Xing <kernelxing@tencent.com>
> > >>>
> > >>> We found there are rare chances that some RST packets appear during
> > >>> the shakehands because the timewait socket cannot accept the SYN and
> > >>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > >>>
> > >>> Here is how things happen in production:
> > >>> Time        Client(A)        Server(B)
> > >>> 0s          SYN-->
> > >>> ...
> > >>> 132s                         <-- FIN
> > >>> ...
> > >>> 169s        FIN-->
> > >>> 169s                         <-- ACK
> > >>> 169s        SYN-->
> > >>> 169s                         <-- ACK
> > >>> 169s        RST-->
> > >>> As above picture shows, the two flows have a start time difference
> > >>> of 169 seconds. B starts to send FIN so it will finally enter into
> > >>> TIMEWAIT state. Nearly at the same time A launches a new connection
> > >>> that soon is reset by itself due to receiving a ACK.
> > >>>
> > >>> There are two key checks in tcp_timewait_state_process() when timewait
> > >>> socket in B receives the SYN packet:
> > >>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > >>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > >>>
> > >>> Regarding the first rule, it fails as expected because in the first
> > >>> connection the seq of SYN sent from A is 1892994276, then 169s have
> > >>> passed, the second SYN has 239034613 (caused by overflow of s32).
> > >>>
> > >>> Then how about the second rule?
> > >>> It fails again!
> > >>> Let's take a look at how the tsval comes out:
> > >>> __tcp_transmit_skb()
> > >>>       -> tcp_syn_options()
> > >>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > >>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > >>> other is tp->tsoffset. The latter value is fixed, so we don't need
> > >>> to care about it. If both operations (sending FIN and then starting
> > >>> sending SYN) from A happen in 1ms, then the tsval would be the same.
> > >>> It can be clearly seen in the tcpdump log. Notice that the tsval is
> > >>> with millisecond precision.
> > >>>
> > >>> Based on the above analysis, I decided to make a small change to
> > >>> the check in tcp_timewait_state_process() so that the second flow
> > >>> would not fail.
> > >>>
> > >>
> > >> I wonder what a bad result the RST causes. As far as I know, the client
> > >> will not close the connect and return. Instead, it re-sends an SYN in
> > >> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> > >> tcp_rcv_synsent_state_process). So the second connection could still be
> > >> established successfully, at the cost of a bit more delay. Like:
> > >>
> > >>    Time        Client(A)        Server(B)
> > >>    0s          SYN-->
> > >>    ...
> > >>    132s                         <-- FIN
> > >>    ...
> > >>    169s        FIN-->
> > >>    169s                         <-- ACK
> > >>    169s        SYN-->
> > >>    169s                         <-- ACK
> > >>    169s        RST-->
> > >> ~2jiffies    SYN-->
> > >>                                 <-- SYN,ACK
> > >
> > > That's exactly what I meant here :) Originally I didn't expect the
> > > application to relaunch a connection in this case.
> >
> > s/application/kernel/, right?
>
> No. Perhaps I didn't make myself clear. If the kernel doesn't silently
> drop the SYN and then send back an ACK, the application has to call
> the connect() syscall again.

My suggestion to stop the confusion:

Provide a packetdrill test.
Jason Xing Nov. 7, 2024, 9 a.m. UTC | #21
On Thu, Nov 7, 2024 at 4:37 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Nov 7, 2024 at 9:27 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Nov 7, 2024 at 4:22 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > >
> > >
> > >
> > > On 2024/11/7 16:01, Jason Xing wrote:
> > > > On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > >>
> > > >> Hi Jason,
> > > >>
> > > >> On 2024/11/5 10:55, Jason Xing wrote:
> > > >>> From: Jason Xing <kernelxing@tencent.com>
> > > >>>
> > > >>> We found there are rare chances that some RST packets appear during
> > > >>> the shakehands because the timewait socket cannot accept the SYN and
> > > >>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > >>>
> > > >>> Here is how things happen in production:
> > > >>> Time        Client(A)        Server(B)
> > > >>> 0s          SYN-->
> > > >>> ...
> > > >>> 132s                         <-- FIN
> > > >>> ...
> > > >>> 169s        FIN-->
> > > >>> 169s                         <-- ACK
> > > >>> 169s        SYN-->
> > > >>> 169s                         <-- ACK
> > > >>> 169s        RST-->
> > > >>> As above picture shows, the two flows have a start time difference
> > > >>> of 169 seconds. B starts to send FIN so it will finally enter into
> > > >>> TIMEWAIT state. Nearly at the same time A launches a new connection
> > > >>> that soon is reset by itself due to receiving a ACK.
> > > >>>
> > > >>> There are two key checks in tcp_timewait_state_process() when timewait
> > > >>> socket in B receives the SYN packet:
> > > >>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > >>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > >>>
> > > >>> Regarding the first rule, it fails as expected because in the first
> > > >>> connection the seq of SYN sent from A is 1892994276, then 169s have
> > > >>> passed, the second SYN has 239034613 (caused by overflow of s32).
> > > >>>
> > > >>> Then how about the second rule?
> > > >>> It fails again!
> > > >>> Let's take a look at how the tsval comes out:
> > > >>> __tcp_transmit_skb()
> > > >>>       -> tcp_syn_options()
> > > >>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > >>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > >>> other is tp->tsoffset. The latter value is fixed, so we don't need
> > > >>> to care about it. If both operations (sending FIN and then starting
> > > >>> sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > >>> It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > >>> with millisecond precision.
> > > >>>
> > > >>> Based on the above analysis, I decided to make a small change to
> > > >>> the check in tcp_timewait_state_process() so that the second flow
> > > >>> would not fail.
> > > >>>
> > > >>
> > > >> I wonder what a bad result the RST causes. As far as I know, the client
> > > >> will not close the connect and return. Instead, it re-sends an SYN in
> > > >> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> > > >> tcp_rcv_synsent_state_process). So the second connection could still be
> > > >> established successfully, at the cost of a bit more delay. Like:
> > > >>
> > > >>    Time        Client(A)        Server(B)
> > > >>    0s          SYN-->
> > > >>    ...
> > > >>    132s                         <-- FIN
> > > >>    ...
> > > >>    169s        FIN-->
> > > >>    169s                         <-- ACK
> > > >>    169s        SYN-->
> > > >>    169s                         <-- ACK
> > > >>    169s        RST-->
> > > >> ~2jiffies    SYN-->

It doesn't happen. I dare to say the SYN will not be retransmitted
soon which I can tell from many tcpdump logs I captured.

> > > >>                                 <-- SYN,ACK
> > > >
> > > > That's exactly what I meant here :) Originally I didn't expect the
> > > > application to relaunch a connection in this case.
> > >
> > > s/application/kernel/, right?
> >
> > No. Perhaps I didn't make myself clear. If the kernel doesn't silently
> > drop the SYN and then send back an ACK, the application has to call
> > the connect() syscall again.
>
> My suggestion to stop the confusion:

Oh, right now, I realized that Philo and I are not on the same page :(
Please forget that conversation.

My points are:
1) If B silently drops the SYN packet, A will retransmit an SYN packet
and then the connection will be established. It's what I tried to
propose and would like to see. It also adheres to the RFC 6191.
2) As kuniyuki pointed out on the contrary, sending an ACK (like the
current implementation) instead of silently dropping the SYN packet is
actually a challenge ack. If so, I think we need to consider this ACK
as a challenge ack like what tcp_send_challenge_ack() does.

I'd like to hear more about your opinions on the above conclusion.

Thanks!
Eric Dumazet Nov. 7, 2024, 9:16 a.m. UTC | #22
On Thu, Nov 7, 2024 at 10:01 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Nov 7, 2024 at 4:37 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Nov 7, 2024 at 9:27 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Thu, Nov 7, 2024 at 4:22 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > >
> > > >
> > > >
> > > > On 2024/11/7 16:01, Jason Xing wrote:
> > > > > On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > >>
> > > > >> Hi Jason,
> > > > >>
> > > > >> On 2024/11/5 10:55, Jason Xing wrote:
> > > > >>> From: Jason Xing <kernelxing@tencent.com>
> > > > >>>
> > > > >>> We found there are rare chances that some RST packets appear during
> > > > >>> the shakehands because the timewait socket cannot accept the SYN and
> > > > >>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > >>>
> > > > >>> Here is how things happen in production:
> > > > >>> Time        Client(A)        Server(B)
> > > > >>> 0s          SYN-->
> > > > >>> ...
> > > > >>> 132s                         <-- FIN
> > > > >>> ...
> > > > >>> 169s        FIN-->
> > > > >>> 169s                         <-- ACK
> > > > >>> 169s        SYN-->
> > > > >>> 169s                         <-- ACK
> > > > >>> 169s        RST-->
> > > > >>> As above picture shows, the two flows have a start time difference
> > > > >>> of 169 seconds. B starts to send FIN so it will finally enter into
> > > > >>> TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > >>> that soon is reset by itself due to receiving a ACK.
> > > > >>>
> > > > >>> There are two key checks in tcp_timewait_state_process() when timewait
> > > > >>> socket in B receives the SYN packet:
> > > > >>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > >>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > >>>
> > > > >>> Regarding the first rule, it fails as expected because in the first
> > > > >>> connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > >>> passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > >>>
> > > > >>> Then how about the second rule?
> > > > >>> It fails again!
> > > > >>> Let's take a look at how the tsval comes out:
> > > > >>> __tcp_transmit_skb()
> > > > >>>       -> tcp_syn_options()
> > > > >>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > >>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > >>> other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > >>> to care about it. If both operations (sending FIN and then starting
> > > > >>> sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > >>> It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > >>> with millisecond precision.
> > > > >>>
> > > > >>> Based on the above analysis, I decided to make a small change to
> > > > >>> the check in tcp_timewait_state_process() so that the second flow
> > > > >>> would not fail.
> > > > >>>
> > > > >>
> > > > >> I wonder what a bad result the RST causes. As far as I know, the client
> > > > >> will not close the connect and return. Instead, it re-sends an SYN in
> > > > >> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> > > > >> tcp_rcv_synsent_state_process). So the second connection could still be
> > > > >> established successfully, at the cost of a bit more delay. Like:
> > > > >>
> > > > >>    Time        Client(A)        Server(B)
> > > > >>    0s          SYN-->
> > > > >>    ...
> > > > >>    132s                         <-- FIN
> > > > >>    ...
> > > > >>    169s        FIN-->
> > > > >>    169s                         <-- ACK
> > > > >>    169s        SYN-->
> > > > >>    169s                         <-- ACK
> > > > >>    169s        RST-->
> > > > >> ~2jiffies    SYN-->
>
> It doesn't happen. I dare to say the SYN will not be retransmitted
> soon which I can tell from many tcpdump logs I captured.
>
> > > > >>                                 <-- SYN,ACK
> > > > >
> > > > > That's exactly what I meant here :) Originally I didn't expect the
> > > > > application to relaunch a connection in this case.
> > > >
> > > > s/application/kernel/, right?
> > >
> > > No. Perhaps I didn't make myself clear. If the kernel doesn't silently
> > > drop the SYN and then send back an ACK, the application has to call
> > > the connect() syscall again.
> >
> > My suggestion to stop the confusion:
>
> Oh, right now, I realized that Philo and I are not on the same page :(
> Please forget that conversation.
>
> My points are:
> 1) If B silently drops the SYN packet, A will retransmit an SYN packet
> and then the connection will be established. It's what I tried to
> propose and would like to see. It also adheres to the RFC 6191.
> 2) As kuniyuki pointed out on the contrary, sending an ACK (like the
> current implementation) instead of silently dropping the SYN packet is
> actually a challenge ack. If so, I think we need to consider this ACK
> as a challenge ack like what tcp_send_challenge_ack() does.

Like where ? Again, a packetdrill test will clarify all of this.

>
> I'd like to hear more about your opinions on the above conclusion.
Jason Xing Nov. 7, 2024, 9:18 a.m. UTC | #23
On Thu, Nov 7, 2024 at 5:17 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Nov 7, 2024 at 10:01 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Nov 7, 2024 at 4:37 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Nov 7, 2024 at 9:27 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Thu, Nov 7, 2024 at 4:22 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 2024/11/7 16:01, Jason Xing wrote:
> > > > > > On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > > >>
> > > > > >> Hi Jason,
> > > > > >>
> > > > > >> On 2024/11/5 10:55, Jason Xing wrote:
> > > > > >>> From: Jason Xing <kernelxing@tencent.com>
> > > > > >>>
> > > > > >>> We found there are rare chances that some RST packets appear during
> > > > > >>> the shakehands because the timewait socket cannot accept the SYN and
> > > > > >>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > > >>>
> > > > > >>> Here is how things happen in production:
> > > > > >>> Time        Client(A)        Server(B)
> > > > > >>> 0s          SYN-->
> > > > > >>> ...
> > > > > >>> 132s                         <-- FIN
> > > > > >>> ...
> > > > > >>> 169s        FIN-->
> > > > > >>> 169s                         <-- ACK
> > > > > >>> 169s        SYN-->
> > > > > >>> 169s                         <-- ACK
> > > > > >>> 169s        RST-->
> > > > > >>> As above picture shows, the two flows have a start time difference
> > > > > >>> of 169 seconds. B starts to send FIN so it will finally enter into
> > > > > >>> TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > > >>> that soon is reset by itself due to receiving a ACK.
> > > > > >>>
> > > > > >>> There are two key checks in tcp_timewait_state_process() when timewait
> > > > > >>> socket in B receives the SYN packet:
> > > > > >>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > > >>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > > >>>
> > > > > >>> Regarding the first rule, it fails as expected because in the first
> > > > > >>> connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > > >>> passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > > >>>
> > > > > >>> Then how about the second rule?
> > > > > >>> It fails again!
> > > > > >>> Let's take a look at how the tsval comes out:
> > > > > >>> __tcp_transmit_skb()
> > > > > >>>       -> tcp_syn_options()
> > > > > >>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > > >>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > > >>> other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > > >>> to care about it. If both operations (sending FIN and then starting
> > > > > >>> sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > > >>> It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > > >>> with millisecond precision.
> > > > > >>>
> > > > > >>> Based on the above analysis, I decided to make a small change to
> > > > > >>> the check in tcp_timewait_state_process() so that the second flow
> > > > > >>> would not fail.
> > > > > >>>
> > > > > >>
> > > > > >> I wonder what a bad result the RST causes. As far as I know, the client
> > > > > >> will not close the connect and return. Instead, it re-sends an SYN in
> > > > > >> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> > > > > >> tcp_rcv_synsent_state_process). So the second connection could still be
> > > > > >> established successfully, at the cost of a bit more delay. Like:
> > > > > >>
> > > > > >>    Time        Client(A)        Server(B)
> > > > > >>    0s          SYN-->
> > > > > >>    ...
> > > > > >>    132s                         <-- FIN
> > > > > >>    ...
> > > > > >>    169s        FIN-->
> > > > > >>    169s                         <-- ACK
> > > > > >>    169s        SYN-->
> > > > > >>    169s                         <-- ACK
> > > > > >>    169s        RST-->
> > > > > >> ~2jiffies    SYN-->
> >
> > It doesn't happen. I dare to say the SYN will not be retransmitted
> > soon which I can tell from many tcpdump logs I captured.
> >
> > > > > >>                                 <-- SYN,ACK
> > > > > >
> > > > > > That's exactly what I meant here :) Originally I didn't expect the
> > > > > > application to relaunch a connection in this case.
> > > > >
> > > > > s/application/kernel/, right?
> > > >
> > > > No. Perhaps I didn't make myself clear. If the kernel doesn't silently
> > > > drop the SYN and then send back an ACK, the application has to call
> > > > the connect() syscall again.
> > >
> > > My suggestion to stop the confusion:
> >
> > Oh, right now, I realized that Philo and I are not on the same page :(
> > Please forget that conversation.
> >
> > My points are:
> > 1) If B silently drops the SYN packet, A will retransmit an SYN packet
> > and then the connection will be established. It's what I tried to
> > propose and would like to see. It also adheres to the RFC 6191.
> > 2) As kuniyuki pointed out on the contrary, sending an ACK (like the
> > current implementation) instead of silently dropping the SYN packet is
> > actually a challenge ack. If so, I think we need to consider this ACK
> > as a challenge ack like what tcp_send_challenge_ack() does.
>
> Like where ? Again, a packetdrill test will clarify all of this.

Well, okay, Let me try.
Jason Xing Nov. 7, 2024, 9:30 a.m. UTC | #24
On Thu, Nov 7, 2024 at 5:00 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Nov 7, 2024 at 4:37 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Nov 7, 2024 at 9:27 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Thu, Nov 7, 2024 at 4:22 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > >
> > > >
> > > >
> > > > On 2024/11/7 16:01, Jason Xing wrote:
> > > > > On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > >>
> > > > >> Hi Jason,
> > > > >>
> > > > >> On 2024/11/5 10:55, Jason Xing wrote:
> > > > >>> From: Jason Xing <kernelxing@tencent.com>
> > > > >>>
> > > > >>> We found there are rare chances that some RST packets appear during
> > > > >>> the shakehands because the timewait socket cannot accept the SYN and
> > > > >>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > >>>
> > > > >>> Here is how things happen in production:
> > > > >>> Time        Client(A)        Server(B)
> > > > >>> 0s          SYN-->
> > > > >>> ...
> > > > >>> 132s                         <-- FIN
> > > > >>> ...
> > > > >>> 169s        FIN-->
> > > > >>> 169s                         <-- ACK
> > > > >>> 169s        SYN-->
> > > > >>> 169s                         <-- ACK
> > > > >>> 169s        RST-->
> > > > >>> As above picture shows, the two flows have a start time difference
> > > > >>> of 169 seconds. B starts to send FIN so it will finally enter into
> > > > >>> TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > >>> that soon is reset by itself due to receiving a ACK.
> > > > >>>
> > > > >>> There are two key checks in tcp_timewait_state_process() when timewait
> > > > >>> socket in B receives the SYN packet:
> > > > >>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > >>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > >>>
> > > > >>> Regarding the first rule, it fails as expected because in the first
> > > > >>> connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > >>> passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > >>>
> > > > >>> Then how about the second rule?
> > > > >>> It fails again!
> > > > >>> Let's take a look at how the tsval comes out:
> > > > >>> __tcp_transmit_skb()
> > > > >>>       -> tcp_syn_options()
> > > > >>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > >>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > >>> other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > >>> to care about it. If both operations (sending FIN and then starting
> > > > >>> sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > >>> It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > >>> with millisecond precision.
> > > > >>>
> > > > >>> Based on the above analysis, I decided to make a small change to
> > > > >>> the check in tcp_timewait_state_process() so that the second flow
> > > > >>> would not fail.
> > > > >>>
> > > > >>
> > > > >> I wonder what a bad result the RST causes. As far as I know, the client
> > > > >> will not close the connect and return. Instead, it re-sends an SYN in
> > > > >> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> > > > >> tcp_rcv_synsent_state_process). So the second connection could still be
> > > > >> established successfully, at the cost of a bit more delay. Like:
> > > > >>
> > > > >>    Time        Client(A)        Server(B)
> > > > >>    0s          SYN-->
> > > > >>    ...
> > > > >>    132s                         <-- FIN
> > > > >>    ...
> > > > >>    169s        FIN-->
> > > > >>    169s                         <-- ACK
> > > > >>    169s        SYN-->
> > > > >>    169s                         <-- ACK
> > > > >>    169s        RST-->
> > > > >> ~2jiffies    SYN-->
>
> It doesn't happen. I dare to say the SYN will not be retransmitted
> soon which I can tell from many tcpdump logs I captured.

My mind went blank probably because of getting sick. Sorry. Why the
tcpdump doesn't show the retransmitted SYN packet is our private
kernel missed this following commit:

commit 9603d47bad4642118fa19fd1562569663d9235f6
Author: SeongJae Park <sjpark@amazon.de>
Date:   Sun Feb 2 03:38:26 2020 +0000

    tcp: Reduce SYN resend delay if a suspicous ACK is received

    When closing a connection, the two acks that required to change closing
    socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in
    reverse order.  This is possible in RSS disabled environments such as a
    connection inside a host.
Eric Dumazet Nov. 7, 2024, 9:45 a.m. UTC | #25
On Thu, Nov 7, 2024 at 10:30 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Nov 7, 2024 at 5:00 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Nov 7, 2024 at 4:37 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Nov 7, 2024 at 9:27 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > >
> > > > On Thu, Nov 7, 2024 at 4:22 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 2024/11/7 16:01, Jason Xing wrote:
> > > > > > On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > > >>
> > > > > >> Hi Jason,
> > > > > >>
> > > > > >> On 2024/11/5 10:55, Jason Xing wrote:
> > > > > >>> From: Jason Xing <kernelxing@tencent.com>
> > > > > >>>
> > > > > >>> We found there are rare chances that some RST packets appear during
> > > > > >>> the shakehands because the timewait socket cannot accept the SYN and
> > > > > >>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > > >>>
> > > > > >>> Here is how things happen in production:
> > > > > >>> Time        Client(A)        Server(B)
> > > > > >>> 0s          SYN-->
> > > > > >>> ...
> > > > > >>> 132s                         <-- FIN
> > > > > >>> ...
> > > > > >>> 169s        FIN-->
> > > > > >>> 169s                         <-- ACK
> > > > > >>> 169s        SYN-->
> > > > > >>> 169s                         <-- ACK
> > > > > >>> 169s        RST-->
> > > > > >>> As above picture shows, the two flows have a start time difference
> > > > > >>> of 169 seconds. B starts to send FIN so it will finally enter into
> > > > > >>> TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > > >>> that soon is reset by itself due to receiving a ACK.
> > > > > >>>
> > > > > >>> There are two key checks in tcp_timewait_state_process() when timewait
> > > > > >>> socket in B receives the SYN packet:
> > > > > >>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > > >>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > > >>>
> > > > > >>> Regarding the first rule, it fails as expected because in the first
> > > > > >>> connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > > >>> passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > > >>>
> > > > > >>> Then how about the second rule?
> > > > > >>> It fails again!
> > > > > >>> Let's take a look at how the tsval comes out:
> > > > > >>> __tcp_transmit_skb()
> > > > > >>>       -> tcp_syn_options()
> > > > > >>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > > >>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > > >>> other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > > >>> to care about it. If both operations (sending FIN and then starting
> > > > > >>> sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > > >>> It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > > >>> with millisecond precision.
> > > > > >>>
> > > > > >>> Based on the above analysis, I decided to make a small change to
> > > > > >>> the check in tcp_timewait_state_process() so that the second flow
> > > > > >>> would not fail.
> > > > > >>>
> > > > > >>
> > > > > >> I wonder what a bad result the RST causes. As far as I know, the client
> > > > > >> will not close the connect and return. Instead, it re-sends an SYN in
> > > > > >> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> > > > > >> tcp_rcv_synsent_state_process). So the second connection could still be
> > > > > >> established successfully, at the cost of a bit more delay. Like:
> > > > > >>
> > > > > >>    Time        Client(A)        Server(B)
> > > > > >>    0s          SYN-->
> > > > > >>    ...
> > > > > >>    132s                         <-- FIN
> > > > > >>    ...
> > > > > >>    169s        FIN-->
> > > > > >>    169s                         <-- ACK
> > > > > >>    169s        SYN-->
> > > > > >>    169s                         <-- ACK
> > > > > >>    169s        RST-->
> > > > > >> ~2jiffies    SYN-->
> >
> > It doesn't happen. I dare to say the SYN will not be retransmitted
> > soon which I can tell from many tcpdump logs I captured.
>
> My mind went blank probably because of getting sick. Sorry. Why the
> tcpdump doesn't show the retransmitted SYN packet is our private
> kernel missed this following commit:
>
> commit 9603d47bad4642118fa19fd1562569663d9235f6
> Author: SeongJae Park <sjpark@amazon.de>
> Date:   Sun Feb 2 03:38:26 2020 +0000
>
>     tcp: Reduce SYN resend delay if a suspicous ACK is received
>
>     When closing a connection, the two acks that required to change closing
>     socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in
>     reverse order.  This is possible in RSS disabled environments such as a
>     connection inside a host.

Not having a patch from linux-5.6 does not really give a good impression,
please next time make sure your patches are needed for upstream trees,
net or net-next.

This also means you can write a packetdrill test to show the benefit
of this patch,
and send it as a new selftests ;)

Back to packetdrill, and everyone will be happy.
Eric Dumazet Nov. 7, 2024, 9:48 a.m. UTC | #26
On Thu, Nov 7, 2024 at 10:45 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Nov 7, 2024 at 10:30 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Nov 7, 2024 at 5:00 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Thu, Nov 7, 2024 at 4:37 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Nov 7, 2024 at 9:27 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > On Thu, Nov 7, 2024 at 4:22 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 2024/11/7 16:01, Jason Xing wrote:
> > > > > > > On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > > > >>
> > > > > > >> Hi Jason,
> > > > > > >>
> > > > > > >> On 2024/11/5 10:55, Jason Xing wrote:
> > > > > > >>> From: Jason Xing <kernelxing@tencent.com>
> > > > > > >>>
> > > > > > >>> We found there are rare chances that some RST packets appear during
> > > > > > >>> the shakehands because the timewait socket cannot accept the SYN and
> > > > > > >>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > > > >>>
> > > > > > >>> Here is how things happen in production:
> > > > > > >>> Time        Client(A)        Server(B)
> > > > > > >>> 0s          SYN-->
> > > > > > >>> ...
> > > > > > >>> 132s                         <-- FIN
> > > > > > >>> ...
> > > > > > >>> 169s        FIN-->
> > > > > > >>> 169s                         <-- ACK
> > > > > > >>> 169s        SYN-->
> > > > > > >>> 169s                         <-- ACK
> > > > > > >>> 169s        RST-->
> > > > > > >>> As above picture shows, the two flows have a start time difference
> > > > > > >>> of 169 seconds. B starts to send FIN so it will finally enter into
> > > > > > >>> TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > > > >>> that soon is reset by itself due to receiving a ACK.
> > > > > > >>>
> > > > > > >>> There are two key checks in tcp_timewait_state_process() when timewait
> > > > > > >>> socket in B receives the SYN packet:
> > > > > > >>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > > > >>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > > > >>>
> > > > > > >>> Regarding the first rule, it fails as expected because in the first
> > > > > > >>> connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > > > >>> passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > > > >>>
> > > > > > >>> Then how about the second rule?
> > > > > > >>> It fails again!
> > > > > > >>> Let's take a look at how the tsval comes out:
> > > > > > >>> __tcp_transmit_skb()
> > > > > > >>>       -> tcp_syn_options()
> > > > > > >>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > > > >>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > > > >>> other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > > > >>> to care about it. If both operations (sending FIN and then starting
> > > > > > >>> sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > > > >>> It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > > > >>> with millisecond precision.
> > > > > > >>>
> > > > > > >>> Based on the above analysis, I decided to make a small change to
> > > > > > >>> the check in tcp_timewait_state_process() so that the second flow
> > > > > > >>> would not fail.
> > > > > > >>>
> > > > > > >>
> > > > > > >> I wonder what a bad result the RST causes. As far as I know, the client
> > > > > > >> will not close the connect and return. Instead, it re-sends an SYN in
> > > > > > >> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> > > > > > >> tcp_rcv_synsent_state_process). So the second connection could still be
> > > > > > >> established successfully, at the cost of a bit more delay. Like:
> > > > > > >>
> > > > > > >>    Time        Client(A)        Server(B)
> > > > > > >>    0s          SYN-->
> > > > > > >>    ...
> > > > > > >>    132s                         <-- FIN
> > > > > > >>    ...
> > > > > > >>    169s        FIN-->
> > > > > > >>    169s                         <-- ACK
> > > > > > >>    169s        SYN-->
> > > > > > >>    169s                         <-- ACK
> > > > > > >>    169s        RST-->
> > > > > > >> ~2jiffies    SYN-->
> > >
> > > It doesn't happen. I dare to say the SYN will not be retransmitted
> > > soon which I can tell from many tcpdump logs I captured.
> >
> > My mind went blank probably because of getting sick. Sorry. Why the
> > tcpdump doesn't show the retransmitted SYN packet is our private
> > kernel missed this following commit:
> >
> > commit 9603d47bad4642118fa19fd1562569663d9235f6
> > Author: SeongJae Park <sjpark@amazon.de>
> > Date:   Sun Feb 2 03:38:26 2020 +0000
> >
> >     tcp: Reduce SYN resend delay if a suspicous ACK is received
> >
> >     When closing a connection, the two acks that required to change closing
> >     socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in
> >     reverse order.  This is possible in RSS disabled environments such as a
> >     connection inside a host.
>
> Not having a patch from linux-5.6 does not really give a good impression,
> please next time make sure your patches are needed for upstream trees,
> net or net-next.
>
> This also means you can write a packetdrill test to show the benefit
> of this patch,
> and send it as a new selftests ;)
>
> Back to packetdrill, and everyone will be happy.

Well, there is selftest already, not based on packetdrill because
we had no packetdrill support yet in selftests

commit af8c8a450bf4698a8a6a7c68956ea5ccafe4cc88
Author: SeongJae Park <sjpark@amazon.de>
Date:   Sun Feb 2 03:38:27 2020 +0000

    selftests: net: Add FIN_ACK processing order related latency spike test
Jason Xing Nov. 7, 2024, 9:57 a.m. UTC | #27
On Thu, Nov 7, 2024 at 5:45 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Nov 7, 2024 at 10:30 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Thu, Nov 7, 2024 at 5:00 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Thu, Nov 7, 2024 at 4:37 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Nov 7, 2024 at 9:27 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > > >
> > > > > On Thu, Nov 7, 2024 at 4:22 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 2024/11/7 16:01, Jason Xing wrote:
> > > > > > > On Thu, Nov 7, 2024 at 3:51 PM Philo Lu <lulie@linux.alibaba.com> wrote:
> > > > > > >>
> > > > > > >> Hi Jason,
> > > > > > >>
> > > > > > >> On 2024/11/5 10:55, Jason Xing wrote:
> > > > > > >>> From: Jason Xing <kernelxing@tencent.com>
> > > > > > >>>
> > > > > > >>> We found there are rare chances that some RST packets appear during
> > > > > > >>> the shakehands because the timewait socket cannot accept the SYN and
> > > > > > >>> doesn't return TCP_TW_SYN in tcp_timewait_state_process().
> > > > > > >>>
> > > > > > >>> Here is how things happen in production:
> > > > > > >>> Time        Client(A)        Server(B)
> > > > > > >>> 0s          SYN-->
> > > > > > >>> ...
> > > > > > >>> 132s                         <-- FIN
> > > > > > >>> ...
> > > > > > >>> 169s        FIN-->
> > > > > > >>> 169s                         <-- ACK
> > > > > > >>> 169s        SYN-->
> > > > > > >>> 169s                         <-- ACK
> > > > > > >>> 169s        RST-->
> > > > > > >>> As above picture shows, the two flows have a start time difference
> > > > > > >>> of 169 seconds. B starts to send FIN so it will finally enter into
> > > > > > >>> TIMEWAIT state. Nearly at the same time A launches a new connection
> > > > > > >>> that soon is reset by itself due to receiving a ACK.
> > > > > > >>>
> > > > > > >>> There are two key checks in tcp_timewait_state_process() when timewait
> > > > > > >>> socket in B receives the SYN packet:
> > > > > > >>> 1) after(TCP_SKB_CB(skb)->seq, rcv_nxt)
> > > > > > >>> 2) (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0)
> > > > > > >>>
> > > > > > >>> Regarding the first rule, it fails as expected because in the first
> > > > > > >>> connection the seq of SYN sent from A is 1892994276, then 169s have
> > > > > > >>> passed, the second SYN has 239034613 (caused by overflow of s32).
> > > > > > >>>
> > > > > > >>> Then how about the second rule?
> > > > > > >>> It fails again!
> > > > > > >>> Let's take a look at how the tsval comes out:
> > > > > > >>> __tcp_transmit_skb()
> > > > > > >>>       -> tcp_syn_options()
> > > > > > >>>           -> opts->tsval = tcp_skb_timestamp_ts(tp->tcp_usec_ts, skb) + tp->tsoffset;
> > > > > > >>> The timestamp depends on two things, one is skb->skb_mstamp_ns, the
> > > > > > >>> other is tp->tsoffset. The latter value is fixed, so we don't need
> > > > > > >>> to care about it. If both operations (sending FIN and then starting
> > > > > > >>> sending SYN) from A happen in 1ms, then the tsval would be the same.
> > > > > > >>> It can be clearly seen in the tcpdump log. Notice that the tsval is
> > > > > > >>> with millisecond precision.
> > > > > > >>>
> > > > > > >>> Based on the above analysis, I decided to make a small change to
> > > > > > >>> the check in tcp_timewait_state_process() so that the second flow
> > > > > > >>> would not fail.
> > > > > > >>>
> > > > > > >>
> > > > > > >> I wonder what a bad result the RST causes. As far as I know, the client
> > > > > > >> will not close the connect and return. Instead, it re-sends an SYN in
> > > > > > >> TCP_TIMEOUT_MIN(2) jiffies (implemented in
> > > > > > >> tcp_rcv_synsent_state_process). So the second connection could still be
> > > > > > >> established successfully, at the cost of a bit more delay. Like:
> > > > > > >>
> > > > > > >>    Time        Client(A)        Server(B)
> > > > > > >>    0s          SYN-->
> > > > > > >>    ...
> > > > > > >>    132s                         <-- FIN
> > > > > > >>    ...
> > > > > > >>    169s        FIN-->
> > > > > > >>    169s                         <-- ACK
> > > > > > >>    169s        SYN-->
> > > > > > >>    169s                         <-- ACK
> > > > > > >>    169s        RST-->
> > > > > > >> ~2jiffies    SYN-->
> > >
> > > It doesn't happen. I dare to say the SYN will not be retransmitted
> > > soon which I can tell from many tcpdump logs I captured.
> >
> > My mind went blank probably because of getting sick. Sorry. Why the
> > tcpdump doesn't show the retransmitted SYN packet is our private
> > kernel missed this following commit:
> >
> > commit 9603d47bad4642118fa19fd1562569663d9235f6
> > Author: SeongJae Park <sjpark@amazon.de>
> > Date:   Sun Feb 2 03:38:26 2020 +0000
> >
> >     tcp: Reduce SYN resend delay if a suspicous ACK is received
> >
> >     When closing a connection, the two acks that required to change closing
> >     socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in
> >     reverse order.  This is possible in RSS disabled environments such as a
> >     connection inside a host.
>
> Not having a patch from linux-5.6 does not really give a good impression,
> please next time make sure your patches are needed for upstream trees,
> net or net-next.

Sorry about this. Normally, I found an issue on those older kernels
and then I will test it on the net-next tree. The issue this patch
tried to fix can be reproduced, as we all see. But today regarding the
topic about re-sending an SYN quickly, I didn't conduct the test, only
analyzing the code :( Sorry for the noise.

>
> This also means you can write a packetdrill test to show the benefit
> of this patch,
> and send it as a new selftests ;)
>
> Back to packetdrill, and everyone will be happy.

Thanks for reminding me. It's really necessary for me to get started
to learn how to write various packetdrill tests.
diff mbox series

Patch

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index bb1fe1ba867a..2b29d1bf5ca0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -234,7 +234,7 @@  tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	if (th->syn && !th->rst && !th->ack && !paws_reject &&
 	    (after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
 	     (tmp_opt.saw_tstamp &&
-	      (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
+	      (s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) <= 0))) {
 		u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
 		if (isn == 0)
 			isn++;