Message ID | 20200131122421.23286-3-sjpark@amazon.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Fix reconnection latency caused by FIN/ACK handling race | expand |
On Fri, Jan 31, 2020 at 4:25 AM <sjpark@amazon.com> wrote: > Signed-off-by: SeongJae Park <sjpark@amazon.de> > --- > net/ipv4/tcp_input.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 2a976f57f7e7..b168e29e1ad1 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -5893,8 +5893,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, > * the segment and return)" > */ > if (!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_una) || > - after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) > + after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) { > + /* Previous FIN/ACK or RST/ACK might be ignore. */ > + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, > + TCP_ATO_MIN, TCP_RTO_MAX); This is not what I suggested. I suggested implementing a strategy where only the _first_ retransmit would be done earlier. So you need to look at the current counter of retransmit attempts, then reset the timer if this SYN_SENT socket never resent a SYN. We do not want to trigger packet storms, if for some reason the remote peer constantly sends us the same packet. Thanks. > goto reset_and_undo; > + } > > if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && > !between(tp->rx_opt.rcv_tsecr, tp->retrans_stamp, > -- > 2.17.1 >
On Fri, Jan 31, 2020 at 7:25 AM <sjpark@amazon.com> wrote: > > From: SeongJae Park <sjpark@amazon.de> > > When closing a connection, the two acks that required to change closing > socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in > reverse order. This is possible in RSS disabled environments such as a > connection inside a host. > > For example, expected state transitions and required packets for the > disconnection will be similar to below flow. > > 00 (Process A) (Process B) > 01 ESTABLISHED ESTABLISHED > 02 close() > 03 FIN_WAIT_1 > 04 ---FIN--> > 05 CLOSE_WAIT > 06 <--ACK--- > 07 FIN_WAIT_2 > 08 <--FIN/ACK--- > 09 TIME_WAIT > 10 ---ACK--> > 11 LAST_ACK > 12 CLOSED CLOSED AFAICT this sequence is not quite what would happen, and that it would be different starting in line 8, and would unfold as follows: 08 close() 09 LAST_ACK 10 <--FIN/ACK--- 11 TIME_WAIT 12 ---ACK--> 13 CLOSED CLOSED > The acks in lines 6 and 8 are the acks. If the line 8 packet is > processed before the line 6 packet, it will be just ignored as it is not > a expected packet, AFAICT that is where the bug starts. AFAICT, from first principles, when process A receives the FIN/ACK it should move to TIME_WAIT even if it has not received a preceding ACK. That's because ACKs are cumulative. So receiving a later cumulative ACK conveys all the information in the previous ACKs. Also, consider the de facto standard state transition diagram from "TCP/IP Illustrated, Volume 2: The Implementation", by Wright and Stevens, e.g.: https://courses.cs.washington.edu/courses/cse461/19sp/lectures/TCPIP_State_Transition_Diagram.pdf This first-principles analysis agrees with the Wright/Stevens diagram, which says that a connection in FIN_WAIT_1 that receives a FIN/ACK should move to TIME_WAIT. This seems like a faster and more robust solution than installing special timers. Thoughts? neal
On Fri, Jan 31, 2020 at 8:12 AM <sjpark@amazon.com> wrote: > > On Fri, 31 Jan 2020 07:01:21 -0800 Eric Dumazet <edumazet@google.com> wrote: > > > On Fri, Jan 31, 2020 at 4:25 AM <sjpark@amazon.com> wrote: > > > > > Signed-off-by: SeongJae Park <sjpark@amazon.de> > > > --- > > > net/ipv4/tcp_input.c | 6 +++++- > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > > index 2a976f57f7e7..b168e29e1ad1 100644 > > > --- a/net/ipv4/tcp_input.c > > > +++ b/net/ipv4/tcp_input.c > > > @@ -5893,8 +5893,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, > > > * the segment and return)" > > > */ > > > if (!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_una) || > > > - after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) > > > + after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) { > > > + /* Previous FIN/ACK or RST/ACK might be ignore. */ > > > + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, > > > + TCP_ATO_MIN, TCP_RTO_MAX); > > > > This is not what I suggested. > > > > I suggested implementing a strategy where only the _first_ retransmit > > would be done earlier. > > > > So you need to look at the current counter of retransmit attempts, > > then reset the timer if this SYN_SENT > > socket never resent a SYN. > > > > We do not want to trigger packet storms, if for some reason the remote > > peer constantly sends > > us the same packet. > > You're right, I missed the important point, thank you for pointing it. Among > retransmission related fields of 'tcp_sock', I think '->total_retrans' would > fit for this check. How about below change? > > ``` > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 2a976f57f7e7..29fc0e4da931 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -5893,8 +5893,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, > * the segment and return)" > */ > if (!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_una) || > - after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) > + after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) { > + /* Previous FIN/ACK or RST/ACK might be ignored. */ > + if (tp->total_retrans == 0) canonical fied would be icsk->icsk_retransmits (look in net/ipv4/tcp_timer.c ) AFAIK, it seems we forget to clear tp->total_retrans in tcp_disconnect() I will send a patch for this tp->total_retrans thing. > + inet_csk_reset_xmit_timer(sk, > + ICSK_TIME_RETRANS, TCP_ATO_MIN, > + TCP_RTO_MAX); > goto reset_and_undo; > + } > > if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && > !between(tp->rx_opt.rcv_tsecr, tp->retrans_stamp, > ``` > > Thanks, > SeongJae Park > > > > > Thanks. > > > > > goto reset_and_undo; > > > + } > > > > > > if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && > > > !between(tp->rx_opt.rcv_tsecr, tp->retrans_stamp, > > > -- > > > 2.17.1 > > > > >
On Fri, 31 Jan 2020 08:55:08 -0800 Eric Dumazet <edumazet@google.com> wrote: > On Fri, Jan 31, 2020 at 8:12 AM <sjpark@amazon.com> wrote: > > > > On Fri, 31 Jan 2020 07:01:21 -0800 Eric Dumazet <edumazet@google.com> wrote: > > > > > On Fri, Jan 31, 2020 at 4:25 AM <sjpark@amazon.com> wrote: > > > > > > > Signed-off-by: SeongJae Park <sjpark@amazon.de> > > > > --- > > > > net/ipv4/tcp_input.c | 6 +++++- > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > > > index 2a976f57f7e7..b168e29e1ad1 100644 > > > > --- a/net/ipv4/tcp_input.c > > > > +++ b/net/ipv4/tcp_input.c > > > > @@ -5893,8 +5893,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, > > > > * the segment and return)" > > > > */ > > > > if (!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_una) || > > > > - after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) > > > > + after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) { > > > > + /* Previous FIN/ACK or RST/ACK might be ignore. */ > > > > + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, > > > > + TCP_ATO_MIN, TCP_RTO_MAX); > > > > > > This is not what I suggested. > > > > > > I suggested implementing a strategy where only the _first_ retransmit > > > would be done earlier. > > > > > > So you need to look at the current counter of retransmit attempts, > > > then reset the timer if this SYN_SENT > > > socket never resent a SYN. > > > > > > We do not want to trigger packet storms, if for some reason the remote > > > peer constantly sends > > > us the same packet. > > > > You're right, I missed the important point, thank you for pointing it. Among > > retransmission related fields of 'tcp_sock', I think '->total_retrans' would > > fit for this check. How about below change? > > > > ``` > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 2a976f57f7e7..29fc0e4da931 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -5893,8 +5893,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, > > * the segment and return)" > > */ > > if (!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_una) || > > - after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) > > + after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) { > > + /* Previous FIN/ACK or RST/ACK might be ignored. */ > > + if (tp->total_retrans == 0) > > canonical fied would be icsk->icsk_retransmits (look in net/ipv4/tcp_timer.c ) > > AFAIK, it seems we forget to clear tp->total_retrans in tcp_disconnect() > I will send a patch for this tp->total_retrans thing. Oh, then I will use 'tcsk->icsk_retransmits' instead of 'tp->total_retrans', in next spin. May I also ask you to Cc me for your 'tp->total_retrans' fix patch? Thanks, SeongJae Park > > > + inet_csk_reset_xmit_timer(sk, > > + ICSK_TIME_RETRANS, TCP_ATO_MIN, > > + TCP_RTO_MAX); > > goto reset_and_undo; > > + } > > > > if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && > > !between(tp->rx_opt.rcv_tsecr, tp->retrans_stamp, > > ``` > > > > Thanks, > > SeongJae Park > > > > > > > > Thanks. > > > > > > > goto reset_and_undo; > > > > + } > > > > > > > > if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && > > > > !between(tp->rx_opt.rcv_tsecr, tp->retrans_stamp, > > > > -- > > > > 2.17.1 > > > > > > > >
On Fri, Jan 31, 2020 at 9:05 AM <sjpark@amazon.com> wrote: > Oh, then I will use 'tcsk->icsk_retransmits' instead of 'tp->total_retrans', in > next spin. May I also ask you to Cc me for your 'tp->total_retrans' fix patch? > Sure, but I usually send my patches to netdev@ Please subscribe to the list if you want to get a copy of all TCP patches in the future.
On 1/31/20 7:10 AM, Neal Cardwell wrote: > On Fri, Jan 31, 2020 at 7:25 AM <sjpark@amazon.com> wrote: >> >> From: SeongJae Park <sjpark@amazon.de> >> >> When closing a connection, the two acks that required to change closing >> socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in >> reverse order. This is possible in RSS disabled environments such as a >> connection inside a host. >> >> For example, expected state transitions and required packets for the >> disconnection will be similar to below flow. >> >> 00 (Process A) (Process B) >> 01 ESTABLISHED ESTABLISHED >> 02 close() >> 03 FIN_WAIT_1 >> 04 ---FIN--> >> 05 CLOSE_WAIT >> 06 <--ACK--- >> 07 FIN_WAIT_2 >> 08 <--FIN/ACK--- >> 09 TIME_WAIT >> 10 ---ACK--> >> 11 LAST_ACK >> 12 CLOSED CLOSED > > AFAICT this sequence is not quite what would happen, and that it would > be different starting in line 8, and would unfold as follows: > > 08 close() > 09 LAST_ACK > 10 <--FIN/ACK--- > 11 TIME_WAIT > 12 ---ACK--> > 13 CLOSED CLOSED > > >> The acks in lines 6 and 8 are the acks. If the line 8 packet is >> processed before the line 6 packet, it will be just ignored as it is not >> a expected packet, > > AFAICT that is where the bug starts. > > AFAICT, from first principles, when process A receives the FIN/ACK it > should move to TIME_WAIT even if it has not received a preceding ACK. > That's because ACKs are cumulative. So receiving a later cumulative > ACK conveys all the information in the previous ACKs. > > Also, consider the de facto standard state transition diagram from > "TCP/IP Illustrated, Volume 2: The Implementation", by Wright and > Stevens, e.g.: > > https://courses.cs.washington.edu/courses/cse461/19sp/lectures/TCPIP_State_Transition_Diagram.pdf > > This first-principles analysis agrees with the Wright/Stevens diagram, > which says that a connection in FIN_WAIT_1 that receives a FIN/ACK > should move to TIME_WAIT. > > This seems like a faster and more robust solution than installing > special timers. > > Thoughts? This is orthogonal I think. No matter how hard we fix the other side, we should improve the active side. Since we send a RST, sending the SYN a few ms after the RST seems way better than waiting 1 second as if we received no packet at all. Receiving this ACK tells us something about networking health, no need to be very cautious about the next attempt. Of course, if you have a fix for the passive side, that would be nice to review !
On Fri, Jan 31, 2020 at 1:12 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > On 1/31/20 7:10 AM, Neal Cardwell wrote: > > On Fri, Jan 31, 2020 at 7:25 AM <sjpark@amazon.com> wrote: > >> > >> From: SeongJae Park <sjpark@amazon.de> > >> > >> When closing a connection, the two acks that required to change closing > >> socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in > >> reverse order. This is possible in RSS disabled environments such as a > >> connection inside a host. > >> > >> For example, expected state transitions and required packets for the > >> disconnection will be similar to below flow. > >> > >> 00 (Process A) (Process B) > >> 01 ESTABLISHED ESTABLISHED > >> 02 close() > >> 03 FIN_WAIT_1 > >> 04 ---FIN--> > >> 05 CLOSE_WAIT > >> 06 <--ACK--- > >> 07 FIN_WAIT_2 > >> 08 <--FIN/ACK--- > >> 09 TIME_WAIT > >> 10 ---ACK--> > >> 11 LAST_ACK > >> 12 CLOSED CLOSED > > > > AFAICT this sequence is not quite what would happen, and that it would > > be different starting in line 8, and would unfold as follows: > > > > 08 close() > > 09 LAST_ACK > > 10 <--FIN/ACK--- > > 11 TIME_WAIT > > 12 ---ACK--> > > 13 CLOSED CLOSED > > > > > >> The acks in lines 6 and 8 are the acks. If the line 8 packet is > >> processed before the line 6 packet, it will be just ignored as it is not > >> a expected packet, > > > > AFAICT that is where the bug starts. > > > > AFAICT, from first principles, when process A receives the FIN/ACK it > > should move to TIME_WAIT even if it has not received a preceding ACK. > > That's because ACKs are cumulative. So receiving a later cumulative > > ACK conveys all the information in the previous ACKs. > > > > Also, consider the de facto standard state transition diagram from > > "TCP/IP Illustrated, Volume 2: The Implementation", by Wright and > > Stevens, e.g.: > > > > https://courses.cs.washington.edu/courses/cse461/19sp/lectures/TCPIP_State_Transition_Diagram.pdf > > > > This first-principles analysis agrees with the Wright/Stevens diagram, > > which says that a connection in FIN_WAIT_1 that receives a FIN/ACK > > should move to TIME_WAIT. > > > > This seems like a faster and more robust solution than installing > > special timers. > > > > Thoughts? > > > This is orthogonal I think. > > No matter how hard we fix the other side, we should improve the active side. > > Since we send a RST, sending the SYN a few ms after the RST seems way better > than waiting 1 second as if we received no packet at all. > > Receiving this ACK tells us something about networking health, no need > to be very cautious about the next attempt. Yes, all good points. Thanks! > Of course, if you have a fix for the passive side, that would be nice to review ! I looked into fixing this, but my quick reading of the Linux tcp_rcv_state_process() code is that it should behave correctly and that a connection in FIN_WAIT_1 that receives a FIN/ACK should move to TIME_WAIT. SeongJae, do you happen to have a tcpdump trace of the problematic sequence where the "process A" ends up in FIN_WAIT_2 when it should be in TIME_WAIT? If I have time I will try to construct a packetdrill case to verify the behavior in this case. thanks, neal > > >
On 1/31/20 2:11 PM, Neal Cardwell wrote: > I looked into fixing this, but my quick reading of the Linux > tcp_rcv_state_process() code is that it should behave correctly and > that a connection in FIN_WAIT_1 that receives a FIN/ACK should move to > TIME_WAIT. > > SeongJae, do you happen to have a tcpdump trace of the problematic > sequence where the "process A" ends up in FIN_WAIT_2 when it should be > in TIME_WAIT? > > If I have time I will try to construct a packetdrill case to verify > the behavior in this case. Unfortunately you wont be able to reproduce the issue with packetdrill, since it involved packets being processed at the same time (race window)
On Fri, Jan 31, 2020 at 5:18 PM SeongJae Park <sj38.park@gmail.com> wrote: > > On Fri, 31 Jan 2020 17:11:35 -0500 Neal Cardwell <ncardwell@google.com> wrote: > > > On Fri, Jan 31, 2020 at 1:12 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > > > > > > > > > On 1/31/20 7:10 AM, Neal Cardwell wrote: > > > > On Fri, Jan 31, 2020 at 7:25 AM <sjpark@amazon.com> wrote: > > > >> > > > >> From: SeongJae Park <sjpark@amazon.de> > > > >> > > > >> When closing a connection, the two acks that required to change closing > > > >> socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in > > > >> reverse order. This is possible in RSS disabled environments such as a > > > >> connection inside a host. > [...] > > > > I looked into fixing this, but my quick reading of the Linux > > tcp_rcv_state_process() code is that it should behave correctly and > > that a connection in FIN_WAIT_1 that receives a FIN/ACK should move to > > TIME_WAIT. > > > > SeongJae, do you happen to have a tcpdump trace of the problematic > > sequence where the "process A" ends up in FIN_WAIT_2 when it should be > > in TIME_WAIT? > > Hi Neal, > > > Yes, I have. You can get it from the previous discussion for this patchset > (https://lore.kernel.org/bpf/20200129171403.3926-1-sjpark@amazon.com/). As it > also has a reproducer program and how I got the tcpdump trace, I believe you > could get your own trace, too. If you have any question or need help, feel > free to let me know. :) Great. Thank you for the pointer. I had one quick question: in the message: https://lore.kernel.org/bpf/20200129171403.3926-1-sjpark@amazon.com/ ... it showed a trace with the client sending a RST/ACK, but this email thread shows a FIN/ACK. I am curious about the motivation for the difference? Anyway, thanks for the report, and thanks to Eric for further clarifying! neal
On Sat, Feb 1, 2020 at 1:08 AM SeongJae Park <sj38.park@gmail.com> wrote: > RST/ACK is traced if LINGER socket option is applied in the reproduce program, > and FIN/ACK is traced if it is not applied. LINGER applied version shows the > spikes more frequently, but the main problem logic has no difference. I > confirmed this by testing both of the two versions. > > In the previous discussion, I showed the LINGER applied trace. However, as > many other documents are using FIN/ACK, I changed the trace to FIN/ACK version > in this patchset for better understanding. I will comment that it doesn't > matter whether it is FIN/ACK or RST/ACK in the next spin. Great. Thanks for the details! neal
From: Eric Dumazet > Sent: 31 January 2020 22:54 > On 1/31/20 2:11 PM, Neal Cardwell wrote: > > > I looked into fixing this, but my quick reading of the Linux > > tcp_rcv_state_process() code is that it should behave correctly and > > that a connection in FIN_WAIT_1 that receives a FIN/ACK should move to > > TIME_WAIT. > > > > SeongJae, do you happen to have a tcpdump trace of the problematic > > sequence where the "process A" ends up in FIN_WAIT_2 when it should be > > in TIME_WAIT? > > > > If I have time I will try to construct a packetdrill case to verify > > the behavior in this case. > > Unfortunately you wont be able to reproduce the issue with packetdrill, > since it involved packets being processed at the same time (race window) You might be able to force the timing race by adding a sleep in one of the code paths. No good for a regression test, but ok for code testing. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Feb 3, 2020 at 7:40 AM David Laight <David.Laight@aculab.com> wrote: > > From: Eric Dumazet > > Sent: 31 January 2020 22:54 > > On 1/31/20 2:11 PM, Neal Cardwell wrote: > > > > > I looked into fixing this, but my quick reading of the Linux > > > tcp_rcv_state_process() code is that it should behave correctly and > > > that a connection in FIN_WAIT_1 that receives a FIN/ACK should move to > > > TIME_WAIT. > > > > > > SeongJae, do you happen to have a tcpdump trace of the problematic > > > sequence where the "process A" ends up in FIN_WAIT_2 when it should be > > > in TIME_WAIT? > > > > > > If I have time I will try to construct a packetdrill case to verify > > > the behavior in this case. > > > > Unfortunately you wont be able to reproduce the issue with packetdrill, > > since it involved packets being processed at the same time (race window) > > You might be able to force the timing race by adding a sleep > in one of the code paths. > > No good for a regression test, but ok for code testing. Please take a look at packetdrill, there is no possibility for it to send more than one packet at a time. Even if we modify packetdrill adding the possibility of feeding packets to its tun device from multiple threads, the race is tiny and you would have to run the packetdrill thousands of times to eventually trigger the race once. While the test SeongJae provided is using two threads and regular TCP stack over loopback interface, it triggers the race more reliably.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2a976f57f7e7..b168e29e1ad1 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5893,8 +5893,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, * the segment and return)" */ if (!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_una) || - after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) + after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) { + /* Previous FIN/ACK or RST/ACK might be ignore. */ + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, + TCP_ATO_MIN, TCP_RTO_MAX); goto reset_and_undo; + } if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr && !between(tp->rx_opt.rcv_tsecr, tp->retrans_stamp,