diff mbox series

[net] tcp: Fix error ts_recent time during three-way handshake

Message ID 20250218105824.34511-1-wanghai38@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] tcp: Fix error ts_recent time during three-way handshake | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 8 of 8 maintainers
netdev/build_clang success Errors and warnings before: 1 this patch: 1
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 10 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-02-18--12-00 (tests: 891)

Commit Message

Wang Hai Feb. 18, 2025, 10:58 a.m. UTC
If two ack packets from a connection enter tcp_check_req at the same time
through different cpu, it may happen that req->ts_recent is updated with
with a more recent time and the skb with an older time creates a new sock,
which will cause the tcp_validate_incoming check to fail.

cpu1                                cpu2
tcp_check_req
                                    tcp_check_req
req->ts_recent = tmp_opt.rcv_tsval = t1
                                    req->ts_recent = tmp_opt.rcv_tsval = t2

newsk->ts_recent = req->ts_recent = t2 // t1 < t2
tcp_child_process
tcp_rcv_state_process
tcp_validate_incoming
tcp_paws_check
if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed

In tcp_check_req, restore ts_recent to this skb's to fix this bug.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Wang Hai <wanghai38@huawei.com>
---
 net/ipv4/tcp_minisocks.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Jason Xing Feb. 18, 2025, 12:02 p.m. UTC | #1
Hi Wang,

On Tue, Feb 18, 2025 at 7:02 PM Wang Hai <wanghai38@huawei.com> wrote:
>
> If two ack packets from a connection enter tcp_check_req at the same time
> through different cpu, it may happen that req->ts_recent is updated with
> with a more recent time and the skb with an older time creates a new sock,
> which will cause the tcp_validate_incoming check to fail.
>
> cpu1                                cpu2
> tcp_check_req
>                                     tcp_check_req
> req->ts_recent = tmp_opt.rcv_tsval = t1
>                                     req->ts_recent = tmp_opt.rcv_tsval = t2
>
> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
> tcp_child_process
> tcp_rcv_state_process
> tcp_validate_incoming
> tcp_paws_check
> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
>
> In tcp_check_req, restore ts_recent to this skb's to fix this bug.

It's known that it's possible to receive two packets in two different
cpus like this case nearly at the same time. I'm curious if the socket
was running on the loopback?

Even if the above check that you commented in tcp_paws_check() fails,
how about the rest of the checks in tcp_validate_incoming()? In your
test, I doubt if really that check failed which finally caused the skb
from CPU2 to be discarded. Let me put it this way, is the paws_win
check the root cause? What is the drop reason for
tcp_validate_incoming()?

Thanks,
Jason

>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Wang Hai <wanghai38@huawei.com>
> ---
>  net/ipv4/tcp_minisocks.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index b089b08e9617..0208455f9eb8 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
>         sock_rps_save_rxhash(child, skb);
>         tcp_synack_rtt_meas(child, req);
>         *req_stolen = !own_req;
> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
> +
>         return inet_csk_complete_hashdance(sk, child, req, own_req);
>
>  listen_overflow:
> --
> 2.17.1
>
>
Eric Dumazet Feb. 18, 2025, 1:35 p.m. UTC | #2
On Tue, Feb 18, 2025 at 12:00 PM Wang Hai <wanghai38@huawei.com> wrote:
>
> If two ack packets from a connection enter tcp_check_req at the same time
> through different cpu, it may happen that req->ts_recent is updated with
> with a more recent time and the skb with an older time creates a new sock,
> which will cause the tcp_validate_incoming check to fail.
>
> cpu1                                cpu2
> tcp_check_req
>                                     tcp_check_req
> req->ts_recent = tmp_opt.rcv_tsval = t1
>                                     req->ts_recent = tmp_opt.rcv_tsval = t2
>
> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
> tcp_child_process
> tcp_rcv_state_process
> tcp_validate_incoming
> tcp_paws_check
> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
>
> In tcp_check_req, restore ts_recent to this skb's to fix this bug.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Wang Hai <wanghai38@huawei.com>
> ---
>  net/ipv4/tcp_minisocks.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index b089b08e9617..0208455f9eb8 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
>         sock_rps_save_rxhash(child, skb);
>         tcp_synack_rtt_meas(child, req);
>         *req_stolen = !own_req;
> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
> +
>         return inet_csk_complete_hashdance(sk, child, req, own_req);

Have you seen the comment at line 818 ?

/* TODO: We probably should defer ts_recent change once
 * we take ownership of @req.
 */

Plan was clear and explained. Why implement something else (and buggy) ?
Wang Hai Feb. 19, 2025, 2:09 a.m. UTC | #3
On 2025/2/18 20:02, Jason Xing wrote:
> Hi Wang,
> 
> On Tue, Feb 18, 2025 at 7:02 PM Wang Hai <wanghai38@huawei.com> wrote:
>>
>> If two ack packets from a connection enter tcp_check_req at the same time
>> through different cpu, it may happen that req->ts_recent is updated with
>> with a more recent time and the skb with an older time creates a new sock,
>> which will cause the tcp_validate_incoming check to fail.
>>
>> cpu1                                cpu2
>> tcp_check_req
>>                                      tcp_check_req
>> req->ts_recent = tmp_opt.rcv_tsval = t1
>>                                      req->ts_recent = tmp_opt.rcv_tsval = t2
>>
>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
>> tcp_child_process
>> tcp_rcv_state_process
>> tcp_validate_incoming
>> tcp_paws_check
>> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
>>
>> In tcp_check_req, restore ts_recent to this skb's to fix this bug.
> 
> It's known that it's possible to receive two packets in two different
> cpus like this case nearly at the same time. I'm curious if the socket
> was running on the loopback?
> 
Hi Jason,

Yeah, it's running in loopback.
> Even if the above check that you commented in tcp_paws_check() fails,
> how about the rest of the checks in tcp_validate_incoming()? 

The skb from cpu1 is a valid skb, so it should have passed the
tcp_validate_incoming check, but the current concurrency issue
prevented it from passing the check.

5951 static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
5952                                   const struct tcphdr *th, int 
syn_inerr)
5953 {
5954         struct tcp_sock *tp = tcp_sk(sk);
5955         SKB_DR(reason);
5956
5957         /* RFC1323: H1. Apply PAWS check first. */
5958         if (!tcp_fast_parse_options(sock_net(sk), skb, th, tp) ||
5959             !tp->rx_opt.saw_tstamp ||
5960             tcp_paws_check(&tp->rx_opt, TCP_PAWS_WINDOW)) // failed
5961                 goto step1;
5962
5963         reason = tcp_disordered_ack_check(sk, skb);
5964         if (!reason) // failed
5965                 goto step1;
5966         /* Reset is accepted even if it did not pass PAWS. */
5967         if (th->rst) // failed
5968                 goto step1;
5969         if (unlikely(th->syn)) // failed
5970                 goto syn_challenge;
5971
5972         /* Old ACK are common, increment PAWS_OLD_ACK
5973          * and do not send a dupack.
5974          */
5975         if (reason == SKB_DROP_REASON_TCP_RFC7323_PAWS_ACK) {
5976                 NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWS_OLD_ACK);
5977                 goto discard;
5978         }
5979         NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
5980         if (!tcp_oow_rate_limited(sock_net(sk), skb,
5981                                   LINUX_MIB_TCPACKSKIPPEDPAWS,
5982                                   &tp->last_oow_ack_time))
5983                 tcp_send_dupack(sk, skb);
5984         goto discard; // Drop the skb from here.
> test, I doubt if really that check failed which finally caused the skb
> from CPU2 to be discarded. Let me put it this way, is the paws_win
> check the root cause? What is the drop reason for
> tcp_validate_incoming()?
> 
The values of ts_recent and rcv_tsval are compared in tcp_paws_check,
where ts_recent comes from req->ts_recent = t2 and rcv_tsval is the
current cpu1 skb time. ts_recent - rcv_tsval may be greater than 1
and here will fail here. Tcp_paws_check fails causing 
tcp_validate_incoming to fail too.

> Thanks,
> Jason
> 
>>
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>> Signed-off-by: Wang Hai <wanghai38@huawei.com>
>> ---
>>   net/ipv4/tcp_minisocks.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
>> index b089b08e9617..0208455f9eb8 100644
>> --- a/net/ipv4/tcp_minisocks.c
>> +++ b/net/ipv4/tcp_minisocks.c
>> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
>>          sock_rps_save_rxhash(child, skb);
>>          tcp_synack_rtt_meas(child, req);
>>          *req_stolen = !own_req;
>> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
>> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
>> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
>> +
>>          return inet_csk_complete_hashdance(sk, child, req, own_req);
>>
>>   listen_overflow:
>> --
>> 2.17.1
>>
>>
Wang Hai Feb. 19, 2025, 2:16 a.m. UTC | #4
On 2025/2/18 21:35, Eric Dumazet wrote:
> On Tue, Feb 18, 2025 at 12:00 PM Wang Hai <wanghai38@huawei.com> wrote:
>>
>> If two ack packets from a connection enter tcp_check_req at the same time
>> through different cpu, it may happen that req->ts_recent is updated with
>> with a more recent time and the skb with an older time creates a new sock,
>> which will cause the tcp_validate_incoming check to fail.
>>
>> cpu1                                cpu2
>> tcp_check_req
>>                                      tcp_check_req
>> req->ts_recent = tmp_opt.rcv_tsval = t1
>>                                      req->ts_recent = tmp_opt.rcv_tsval = t2
>>
>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
>> tcp_child_process
>> tcp_rcv_state_process
>> tcp_validate_incoming
>> tcp_paws_check
>> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
>>
>> In tcp_check_req, restore ts_recent to this skb's to fix this bug.
>>
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>> Signed-off-by: Wang Hai <wanghai38@huawei.com>
>> ---
>>   net/ipv4/tcp_minisocks.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
>> index b089b08e9617..0208455f9eb8 100644
>> --- a/net/ipv4/tcp_minisocks.c
>> +++ b/net/ipv4/tcp_minisocks.c
>> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
>>          sock_rps_save_rxhash(child, skb);
>>          tcp_synack_rtt_meas(child, req);
>>          *req_stolen = !own_req;
>> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
>> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
>> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
>> +
>>          return inet_csk_complete_hashdance(sk, child, req, own_req);
> 
> Have you seen the comment at line 818 ?
> 
> /* TODO: We probably should defer ts_recent change once
>   * we take ownership of @req.
>   */
> 
> Plan was clear and explained. Why implement something else (and buggy) ?
> 
Hi Eric,

Currently we have a real problem, so we want to solve it. This bug 
causes the upper layers to be unable to be notified to call accept after 
the successful three-way handshake.

Skb from cpu1 that fails at tcp_paws_check (which it could have 
succeeded) will not be able to enter the TCP_ESTABLISHED state, and 
therefore parent->sk_data_ready(parent) will not be triggered, and skb 
from cpu2 can complete the three-way handshake, but there is also no way 
to call parent->sk_data_ready(parent) to notify the upper layer, which 
will result
in the upper layer not being able to sense and call accept to obtain the 
nsk.

cpu1                                cpu2
tcp_check_req
                                     tcp_check_req
req->ts_recent = tmp_opt.rcv_tsval = t1
                                     req->ts_recent=tmp_opt.rcv_tsval= t2

newsk->ts_recent = req->ts_recent = t2 // t1 < t2
tcp_child_process
  tcp_rcv_state_process
   tcp_validate_incoming
    tcp_paws_check // failed
  parent->sk_data_ready(parent); // will not be called
                                     tcp_v4_do_rcv
				     tcp_rcv_state_process // Complete the three-way handshake
													// missing parent->sk_data_ready(parent);
Wang Hai Feb. 19, 2025, 3:08 a.m. UTC | #5
On 2025/2/18 21:35, Eric Dumazet wrote:
> On Tue, Feb 18, 2025 at 12:00 PM Wang Hai <wanghai38@huawei.com> wrote:
>>
>> If two ack packets from a connection enter tcp_check_req at the same time
>> through different cpu, it may happen that req->ts_recent is updated with
>> with a more recent time and the skb with an older time creates a new sock,
>> which will cause the tcp_validate_incoming check to fail.
>>
>> cpu1                                cpu2
>> tcp_check_req
>>                                      tcp_check_req
>> req->ts_recent = tmp_opt.rcv_tsval = t1
>>                                      req->ts_recent = tmp_opt.rcv_tsval = t2
>>
>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
>> tcp_child_process
>> tcp_rcv_state_process
>> tcp_validate_incoming
>> tcp_paws_check
>> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
>>
>> In tcp_check_req, restore ts_recent to this skb's to fix this bug.
>>
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>> Signed-off-by: Wang Hai <wanghai38@huawei.com>
>> ---
>>   net/ipv4/tcp_minisocks.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
>> index b089b08e9617..0208455f9eb8 100644
>> --- a/net/ipv4/tcp_minisocks.c
>> +++ b/net/ipv4/tcp_minisocks.c
>> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
>>          sock_rps_save_rxhash(child, skb);
>>          tcp_synack_rtt_meas(child, req);
>>          *req_stolen = !own_req;
>> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
>> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
>> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
>> +
>>          return inet_csk_complete_hashdance(sk, child, req, own_req);
> 
> Have you seen the comment at line 818 ?
> 
> /* TODO: We probably should defer ts_recent change once
>   * we take ownership of @req.
>   */
> 
> Plan was clear and explained. Why implement something else (and buggy) ?
Hi Eric,

According to the plan, can we fix it like this?

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index b089b08e9617..1210d4967b94 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -814,13 +814,6 @@ struct sock *tcp_check_req(struct sock *sk, struct 
sk_buff *skb,
         }

         /* In sequence, PAWS is OK. */
-
-       /* TODO: We probably should defer ts_recent change once
-        * we take ownership of @req.
-        */
-       if (tmp_opt.saw_tstamp && !after(TCP_SKB_CB(skb)->seq, 
tcp_rsk(req)->rcv_nxt))
-               WRITE_ONCE(req->ts_recent, tmp_opt.rcv_tsval);
-
         if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn) {
                 /* Truncate SYN, it is out of window starting
                    at tcp_rsk(req)->rcv_isn + 1. */
@@ -878,6 +871,9 @@ struct sock *tcp_check_req(struct sock *sk, struct 
sk_buff *skb,
         sock_rps_save_rxhash(child, skb);
         tcp_synack_rtt_meas(child, req);
         *req_stolen = !own_req;
+       if (own_req && tmp_opt.saw_tstamp && !after(TCP_SKB_CB(skb)- 
seq, tcp_rsk(req)->rcv_nxt))
+               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
+
         return inet_csk_complete_hashdance(sk, child, req, own_req);

  listen_overflow:
>
Jason Xing Feb. 19, 2025, 3:31 a.m. UTC | #6
On Wed, Feb 19, 2025 at 10:16 AM Wang Hai <wanghai38@huawei.com> wrote:
>
>
>
> On 2025/2/18 21:35, Eric Dumazet wrote:
> > On Tue, Feb 18, 2025 at 12:00 PM Wang Hai <wanghai38@huawei.com> wrote:
> >>
> >> If two ack packets from a connection enter tcp_check_req at the same time
> >> through different cpu, it may happen that req->ts_recent is updated with
> >> with a more recent time and the skb with an older time creates a new sock,
> >> which will cause the tcp_validate_incoming check to fail.
> >>
> >> cpu1                                cpu2
> >> tcp_check_req
> >>                                      tcp_check_req
> >> req->ts_recent = tmp_opt.rcv_tsval = t1
> >>                                      req->ts_recent = tmp_opt.rcv_tsval = t2
> >>
> >> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
> >> tcp_child_process
> >> tcp_rcv_state_process
> >> tcp_validate_incoming
> >> tcp_paws_check
> >> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
> >>
> >> In tcp_check_req, restore ts_recent to this skb's to fix this bug.
> >>
> >> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> >> Signed-off-by: Wang Hai <wanghai38@huawei.com>
> >> ---
> >>   net/ipv4/tcp_minisocks.c | 4 ++++
> >>   1 file changed, 4 insertions(+)
> >>
> >> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> >> index b089b08e9617..0208455f9eb8 100644
> >> --- a/net/ipv4/tcp_minisocks.c
> >> +++ b/net/ipv4/tcp_minisocks.c
> >> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
> >>          sock_rps_save_rxhash(child, skb);
> >>          tcp_synack_rtt_meas(child, req);
> >>          *req_stolen = !own_req;
> >> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
> >> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
> >> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
> >> +
> >>          return inet_csk_complete_hashdance(sk, child, req, own_req);
> >
> > Have you seen the comment at line 818 ?
> >
> > /* TODO: We probably should defer ts_recent change once
> >   * we take ownership of @req.
> >   */
> >
> > Plan was clear and explained. Why implement something else (and buggy) ?
> >
> Hi Eric,
>
> Currently we have a real problem, so we want to solve it. This bug
> causes the upper layers to be unable to be notified to call accept after
> the successful three-way handshake.
>
> Skb from cpu1 that fails at tcp_paws_check (which it could have
> succeeded) will not be able to enter the TCP_ESTABLISHED state, and
> therefore parent->sk_data_ready(parent) will not be triggered, and skb
> from cpu2 can complete the three-way handshake, but there is also no way
> to call parent->sk_data_ready(parent) to notify the upper layer, which
> will result
> in the upper layer not being able to sense and call accept to obtain the
> nsk.
>
> cpu1                                cpu2
> tcp_check_req
>                                      tcp_check_req
> req->ts_recent = tmp_opt.rcv_tsval = t1
>                                      req->ts_recent=tmp_opt.rcv_tsval= t2
>
> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
> tcp_child_process
>   tcp_rcv_state_process
>    tcp_validate_incoming
>     tcp_paws_check // failed
>   parent->sk_data_ready(parent); // will not be called
>                                      tcp_v4_do_rcv
>                                      tcp_rcv_state_process // Complete the three-way handshake
>                                                                                                         // missing parent->sk_data_ready(parent);

IIUC, the ack received from cpu1 triggered calling
inet_csk_complete_hashdance() so its state transited from
TCP_NEW_SYN_RECV to TCP_SYN_RECV, right? If so, the reason why not
call sk_data_ready() if the skb entered into tcp_child_process() is
that its state failed to transit to TCP_ESTABLISHED?

Here is another question. How did the skb on the right side enter into
tcp_v4_do_rcv() after entering tcp_check_req() if the state of sk
which the skb belongs to is TCP_NEW_SYN_RECV? Could you elaborate more
on this point?

Thanks,
Jason
Wang Hai Feb. 19, 2025, 1:11 p.m. UTC | #7
On 2025/2/19 11:31, Jason Xing wrote:
> On Wed, Feb 19, 2025 at 10:16 AM Wang Hai <wanghai38@huawei.com> wrote:
>>
>>
>>
>> On 2025/2/18 21:35, Eric Dumazet wrote:
>>> On Tue, Feb 18, 2025 at 12:00 PM Wang Hai <wanghai38@huawei.com> wrote:
>>>>
>>>> If two ack packets from a connection enter tcp_check_req at the same time
>>>> through different cpu, it may happen that req->ts_recent is updated with
>>>> with a more recent time and the skb with an older time creates a new sock,
>>>> which will cause the tcp_validate_incoming check to fail.
>>>>
>>>> cpu1                                cpu2
>>>> tcp_check_req
>>>>                                       tcp_check_req
>>>> req->ts_recent = tmp_opt.rcv_tsval = t1
>>>>                                       req->ts_recent = tmp_opt.rcv_tsval = t2
>>>>
>>>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
>>>> tcp_child_process
>>>> tcp_rcv_state_process
>>>> tcp_validate_incoming
>>>> tcp_paws_check
>>>> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
>>>>
>>>> In tcp_check_req, restore ts_recent to this skb's to fix this bug.
>>>>
>>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>>>> Signed-off-by: Wang Hai <wanghai38@huawei.com>
>>>> ---
>>>>    net/ipv4/tcp_minisocks.c | 4 ++++
>>>>    1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
>>>> index b089b08e9617..0208455f9eb8 100644
>>>> --- a/net/ipv4/tcp_minisocks.c
>>>> +++ b/net/ipv4/tcp_minisocks.c
>>>> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
>>>>           sock_rps_save_rxhash(child, skb);
>>>>           tcp_synack_rtt_meas(child, req);
>>>>           *req_stolen = !own_req;
>>>> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
>>>> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
>>>> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
>>>> +
>>>>           return inet_csk_complete_hashdance(sk, child, req, own_req);
>>>
>>> Have you seen the comment at line 818 ?
>>>
>>> /* TODO: We probably should defer ts_recent change once
>>>    * we take ownership of @req.
>>>    */
>>>
>>> Plan was clear and explained. Why implement something else (and buggy) ?
>>>
>> Hi Eric,
>>
>> Currently we have a real problem, so we want to solve it. This bug
>> causes the upper layers to be unable to be notified to call accept after
>> the successful three-way handshake.
>>
>> Skb from cpu1 that fails at tcp_paws_check (which it could have
>> succeeded) will not be able to enter the TCP_ESTABLISHED state, and
>> therefore parent->sk_data_ready(parent) will not be triggered, and skb
>> from cpu2 can complete the three-way handshake, but there is also no way
>> to call parent->sk_data_ready(parent) to notify the upper layer, which
>> will result
>> in the upper layer not being able to sense and call accept to obtain the
>> nsk.
>>
>> cpu1                                cpu2
>> tcp_check_req
>>                                       tcp_check_req
>> req->ts_recent = tmp_opt.rcv_tsval = t1
>>                                       req->ts_recent=tmp_opt.rcv_tsval= t2
>>
>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
>> tcp_child_process
>>    tcp_rcv_state_process
>>     tcp_validate_incoming
>>      tcp_paws_check // failed
>>    parent->sk_data_ready(parent); // will not be called
>>                                       tcp_v4_do_rcv
>>                                       tcp_rcv_state_process // Complete the three-way handshake
>>                                                                                                          // missing parent->sk_data_ready(parent);
> 
> IIUC, the ack received from cpu1 triggered calling
> inet_csk_complete_hashdance() so its state transited from
> TCP_NEW_SYN_RECV to TCP_SYN_RECV, right? If so, the reason why not
> call sk_data_ready() if the skb entered into tcp_child_process() is
> that its state failed to transit to TCP_ESTABLISHED?
> 
Yes, because it didn't switch to TCP_ESTABLISHED
> Here is another question. How did the skb on the right side enter into
> tcp_v4_do_rcv() after entering tcp_check_req() if the state of sk
> which the skb belongs to is TCP_NEW_SYN_RECV? Could you elaborate more
> on this point?
Since cpu1 successfully created the child sock, cpu2 will return
null in tcp_check_req and req_stolen is set to true, so that it will
subsequently go to 'goto lookup' to re-process the packet, and at
this point, sk->sk_state is already in TCP_SYN_RECV state, and then
then tcp_v4_do_rcv is called.
> 
> Thanks,
> Jason
Jason Xing Feb. 20, 2025, 3:04 a.m. UTC | #8
On Wed, Feb 19, 2025 at 9:11 PM Wang Hai <wanghai38@huawei.com> wrote:
>
>
>
> On 2025/2/19 11:31, Jason Xing wrote:
> > On Wed, Feb 19, 2025 at 10:16 AM Wang Hai <wanghai38@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2025/2/18 21:35, Eric Dumazet wrote:
> >>> On Tue, Feb 18, 2025 at 12:00 PM Wang Hai <wanghai38@huawei.com> wrote:
> >>>>
> >>>> If two ack packets from a connection enter tcp_check_req at the same time
> >>>> through different cpu, it may happen that req->ts_recent is updated with
> >>>> with a more recent time and the skb with an older time creates a new sock,
> >>>> which will cause the tcp_validate_incoming check to fail.
> >>>>
> >>>> cpu1                                cpu2
> >>>> tcp_check_req
> >>>>                                       tcp_check_req
> >>>> req->ts_recent = tmp_opt.rcv_tsval = t1
> >>>>                                       req->ts_recent = tmp_opt.rcv_tsval = t2
> >>>>
> >>>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
> >>>> tcp_child_process
> >>>> tcp_rcv_state_process
> >>>> tcp_validate_incoming
> >>>> tcp_paws_check
> >>>> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
> >>>>
> >>>> In tcp_check_req, restore ts_recent to this skb's to fix this bug.
> >>>>
> >>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> >>>> Signed-off-by: Wang Hai <wanghai38@huawei.com>
> >>>> ---
> >>>>    net/ipv4/tcp_minisocks.c | 4 ++++
> >>>>    1 file changed, 4 insertions(+)
> >>>>
> >>>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> >>>> index b089b08e9617..0208455f9eb8 100644
> >>>> --- a/net/ipv4/tcp_minisocks.c
> >>>> +++ b/net/ipv4/tcp_minisocks.c
> >>>> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
> >>>>           sock_rps_save_rxhash(child, skb);
> >>>>           tcp_synack_rtt_meas(child, req);
> >>>>           *req_stolen = !own_req;
> >>>> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
> >>>> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
> >>>> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
> >>>> +
> >>>>           return inet_csk_complete_hashdance(sk, child, req, own_req);
> >>>
> >>> Have you seen the comment at line 818 ?
> >>>
> >>> /* TODO: We probably should defer ts_recent change once
> >>>    * we take ownership of @req.
> >>>    */
> >>>
> >>> Plan was clear and explained. Why implement something else (and buggy) ?
> >>>
> >> Hi Eric,
> >>
> >> Currently we have a real problem, so we want to solve it. This bug
> >> causes the upper layers to be unable to be notified to call accept after
> >> the successful three-way handshake.
> >>
> >> Skb from cpu1 that fails at tcp_paws_check (which it could have
> >> succeeded) will not be able to enter the TCP_ESTABLISHED state, and
> >> therefore parent->sk_data_ready(parent) will not be triggered, and skb
> >> from cpu2 can complete the three-way handshake, but there is also no way
> >> to call parent->sk_data_ready(parent) to notify the upper layer, which
> >> will result
> >> in the upper layer not being able to sense and call accept to obtain the
> >> nsk.
> >>
> >> cpu1                                cpu2
> >> tcp_check_req
> >>                                       tcp_check_req
> >> req->ts_recent = tmp_opt.rcv_tsval = t1
> >>                                       req->ts_recent=tmp_opt.rcv_tsval= t2
> >>
> >> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
> >> tcp_child_process
> >>    tcp_rcv_state_process
> >>     tcp_validate_incoming
> >>      tcp_paws_check // failed
> >>    parent->sk_data_ready(parent); // will not be called
> >>                                       tcp_v4_do_rcv
> >>                                       tcp_rcv_state_process // Complete the three-way handshake
> >>                                                                                                          // missing parent->sk_data_ready(parent);
> >
> > IIUC, the ack received from cpu1 triggered calling
> > inet_csk_complete_hashdance() so its state transited from
> > TCP_NEW_SYN_RECV to TCP_SYN_RECV, right? If so, the reason why not
> > call sk_data_ready() if the skb entered into tcp_child_process() is
> > that its state failed to transit to TCP_ESTABLISHED?
> >
> Yes, because it didn't switch to TCP_ESTABLISHED
> > Here is another question. How did the skb on the right side enter into
> > tcp_v4_do_rcv() after entering tcp_check_req() if the state of sk
> > which the skb belongs to is TCP_NEW_SYN_RECV? Could you elaborate more
> > on this point?
> Since cpu1 successfully created the child sock, cpu2 will return
> null in tcp_check_req and req_stolen is set to true, so that it will
> subsequently go to 'goto lookup' to re-process the packet, and at
> this point, sk->sk_state is already in TCP_SYN_RECV state, and then
> then tcp_v4_do_rcv is called.

Now I can see what happened there. Perhaps it would be good to update
the commit message
in the next iteration.

Another key information I notice is that the second lookup process
loses the chance to call sk_data_ready() for its parent socket. It's
the one of the main reasons that cause your application to be unable
to get notified. Taking a rough look at tcp_rcv_state_process(), I
think it's not easy to acquire the parent socket there and then call
sk_data_ready() without modifying more codes compared to the current
solution. It's a different solution in theory.

If your new approach (like your previous reply) works, the following
commit[1] will be reverted/overwritten.
commit e0f9759f530bf789e984961dce79f525b151ecf3
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Feb 13 06:14:12 2018 -0800

    tcp: try to keep packet if SYN_RCV race is lost

    배석진 reported that in some situations, packets for a given 5-tuple
    end up being processed by different CPUS.

    This involves RPS, and fragmentation.

    배석진 is seeing packet drops when a SYN_RECV request socket is
    moved into ESTABLISH state. Other states are protected by socket lock.

    This is caused by a CPU losing the race, and simply not caring enough.

    Since this seems to occur frequently, we can do better and perform
    a second lookup.

    Note that all needed memory barriers are already in the existing code,
    thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
    and reqsk_put(). The second lookup must find the new socket,
    unless it has already been accepted and closed by another cpu.

    Note that the fragmentation could be avoided in the first place by
    use of a correct TCP MSS option in the SYN{ACK} packet, but this
    does not mean we can not be more robust.

Please repost a V2 when you think it's ready, then TCP maintainers
will handle it:)

Thanks,
Jason
Eric Dumazet Feb. 20, 2025, 8:41 a.m. UTC | #9
On Wed, Feb 19, 2025 at 4:08 AM Wang Hai <wanghai38@huawei.com> wrote:
>

> Hi Eric,
>
> According to the plan, can we fix it like this?
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index b089b08e9617..1210d4967b94 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -814,13 +814,6 @@ struct sock *tcp_check_req(struct sock *sk, struct
> sk_buff *skb,
>          }
>
>          /* In sequence, PAWS is OK. */
> -
> -       /* TODO: We probably should defer ts_recent change once
> -        * we take ownership of @req.
> -        */
> -       if (tmp_opt.saw_tstamp && !after(TCP_SKB_CB(skb)->seq,
> tcp_rsk(req)->rcv_nxt))
> -               WRITE_ONCE(req->ts_recent, tmp_opt.rcv_tsval);
> -
>          if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn) {
>                  /* Truncate SYN, it is out of window starting
>                     at tcp_rsk(req)->rcv_isn + 1. */
> @@ -878,6 +871,9 @@ struct sock *tcp_check_req(struct sock *sk, struct
> sk_buff *skb,
>          sock_rps_save_rxhash(child, skb);
>          tcp_synack_rtt_meas(child, req);
>          *req_stolen = !own_req;
> +       if (own_req && tmp_opt.saw_tstamp && !after(TCP_SKB_CB(skb)-
> seq, tcp_rsk(req)->rcv_nxt))
> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
> +
>          return inet_csk_complete_hashdance(sk, child, req, own_req);

Yes, this was the plan ;)
Wang Hai Feb. 20, 2025, 2:03 p.m. UTC | #10
On 2025/2/20 11:04, Jason Xing wrote:
> On Wed, Feb 19, 2025 at 9:11 PM Wang Hai <wanghai38@huawei.com> wrote:
>>
>>
>>
>> On 2025/2/19 11:31, Jason Xing wrote:
>>> On Wed, Feb 19, 2025 at 10:16 AM Wang Hai <wanghai38@huawei.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2025/2/18 21:35, Eric Dumazet wrote:
>>>>> On Tue, Feb 18, 2025 at 12:00 PM Wang Hai <wanghai38@huawei.com> wrote:
>>>>>>
>>>>>> If two ack packets from a connection enter tcp_check_req at the same time
>>>>>> through different cpu, it may happen that req->ts_recent is updated with
>>>>>> with a more recent time and the skb with an older time creates a new sock,
>>>>>> which will cause the tcp_validate_incoming check to fail.
>>>>>>
>>>>>> cpu1                                cpu2
>>>>>> tcp_check_req
>>>>>>                                        tcp_check_req
>>>>>> req->ts_recent = tmp_opt.rcv_tsval = t1
>>>>>>                                        req->ts_recent = tmp_opt.rcv_tsval = t2
>>>>>>
>>>>>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
>>>>>> tcp_child_process
>>>>>> tcp_rcv_state_process
>>>>>> tcp_validate_incoming
>>>>>> tcp_paws_check
>>>>>> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
>>>>>>
>>>>>> In tcp_check_req, restore ts_recent to this skb's to fix this bug.
>>>>>>
>>>>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>>>>>> Signed-off-by: Wang Hai <wanghai38@huawei.com>
>>>>>> ---
>>>>>>     net/ipv4/tcp_minisocks.c | 4 ++++
>>>>>>     1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
>>>>>> index b089b08e9617..0208455f9eb8 100644
>>>>>> --- a/net/ipv4/tcp_minisocks.c
>>>>>> +++ b/net/ipv4/tcp_minisocks.c
>>>>>> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
>>>>>>            sock_rps_save_rxhash(child, skb);
>>>>>>            tcp_synack_rtt_meas(child, req);
>>>>>>            *req_stolen = !own_req;
>>>>>> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
>>>>>> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
>>>>>> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
>>>>>> +
>>>>>>            return inet_csk_complete_hashdance(sk, child, req, own_req);
>>>>>
>>>>> Have you seen the comment at line 818 ?
>>>>>
>>>>> /* TODO: We probably should defer ts_recent change once
>>>>>     * we take ownership of @req.
>>>>>     */
>>>>>
>>>>> Plan was clear and explained. Why implement something else (and buggy) ?
>>>>>
>>>> Hi Eric,
>>>>
>>>> Currently we have a real problem, so we want to solve it. This bug
>>>> causes the upper layers to be unable to be notified to call accept after
>>>> the successful three-way handshake.
>>>>
>>>> Skb from cpu1 that fails at tcp_paws_check (which it could have
>>>> succeeded) will not be able to enter the TCP_ESTABLISHED state, and
>>>> therefore parent->sk_data_ready(parent) will not be triggered, and skb
>>>> from cpu2 can complete the three-way handshake, but there is also no way
>>>> to call parent->sk_data_ready(parent) to notify the upper layer, which
>>>> will result
>>>> in the upper layer not being able to sense and call accept to obtain the
>>>> nsk.
>>>>
>>>> cpu1                                cpu2
>>>> tcp_check_req
>>>>                                        tcp_check_req
>>>> req->ts_recent = tmp_opt.rcv_tsval = t1
>>>>                                        req->ts_recent=tmp_opt.rcv_tsval= t2
>>>>
>>>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
>>>> tcp_child_process
>>>>     tcp_rcv_state_process
>>>>      tcp_validate_incoming
>>>>       tcp_paws_check // failed
>>>>     parent->sk_data_ready(parent); // will not be called
>>>>                                        tcp_v4_do_rcv
>>>>                                        tcp_rcv_state_process // Complete the three-way handshake
>>>>                                                                                                           // missing parent->sk_data_ready(parent);
>>>
>>> IIUC, the ack received from cpu1 triggered calling
>>> inet_csk_complete_hashdance() so its state transited from
>>> TCP_NEW_SYN_RECV to TCP_SYN_RECV, right? If so, the reason why not
>>> call sk_data_ready() if the skb entered into tcp_child_process() is
>>> that its state failed to transit to TCP_ESTABLISHED?
>>>
>> Yes, because it didn't switch to TCP_ESTABLISHED
>>> Here is another question. How did the skb on the right side enter into
>>> tcp_v4_do_rcv() after entering tcp_check_req() if the state of sk
>>> which the skb belongs to is TCP_NEW_SYN_RECV? Could you elaborate more
>>> on this point?
>> Since cpu1 successfully created the child sock, cpu2 will return
>> null in tcp_check_req and req_stolen is set to true, so that it will
>> subsequently go to 'goto lookup' to re-process the packet, and at
>> this point, sk->sk_state is already in TCP_SYN_RECV state, and then
>> then tcp_v4_do_rcv is called.
> 
> Now I can see what happened there. Perhaps it would be good to update
> the commit message
> in the next iteration.
Hi Jason,

Thanks for the suggestion, I'll test it out and improve the commit 
message to send v2.
> 
> Another key information I notice is that the second lookup process
> loses the chance to call sk_data_ready() for its parent socket. It's
> the one of the main reasons that cause your application to be unable
> to get notified. Taking a rough look at tcp_rcv_state_process(), I
> think it's not easy to acquire the parent socket there and then call
> sk_data_ready() without modifying more codes compared to the current
> solution. It's a different solution in theory.
Yes, I have considered this fix before, but the complexity of the fix 
would be higher.
> 
> If your new approach (like your previous reply) works, the following
> commit[1] will be reverted/overwritten.
I'm sorry, I may not have understood what you meant. Applying my fix, 
commit[1] is still necessary because it doesn't solve the bug that 
commit[1] fixes. can you explain in detail, why commit[1] will be 
reverted/overwritten.

Thanks,
Wang
Jason Xing Feb. 20, 2025, 2:45 p.m. UTC | #11
On Thu, Feb 20, 2025 at 10:03 PM Wang Hai <wanghai38@huawei.com> wrote:
>
>
>
> On 2025/2/20 11:04, Jason Xing wrote:
> > On Wed, Feb 19, 2025 at 9:11 PM Wang Hai <wanghai38@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2025/2/19 11:31, Jason Xing wrote:
> >>> On Wed, Feb 19, 2025 at 10:16 AM Wang Hai <wanghai38@huawei.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2025/2/18 21:35, Eric Dumazet wrote:
> >>>>> On Tue, Feb 18, 2025 at 12:00 PM Wang Hai <wanghai38@huawei.com> wrote:
> >>>>>>
> >>>>>> If two ack packets from a connection enter tcp_check_req at the same time
> >>>>>> through different cpu, it may happen that req->ts_recent is updated with
> >>>>>> with a more recent time and the skb with an older time creates a new sock,
> >>>>>> which will cause the tcp_validate_incoming check to fail.
> >>>>>>
> >>>>>> cpu1                                cpu2
> >>>>>> tcp_check_req
> >>>>>>                                        tcp_check_req
> >>>>>> req->ts_recent = tmp_opt.rcv_tsval = t1
> >>>>>>                                        req->ts_recent = tmp_opt.rcv_tsval = t2
> >>>>>>
> >>>>>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
> >>>>>> tcp_child_process
> >>>>>> tcp_rcv_state_process
> >>>>>> tcp_validate_incoming
> >>>>>> tcp_paws_check
> >>>>>> if ((s32)(rx_opt->ts_recent - rx_opt->rcv_tsval) <= paws_win) // failed
> >>>>>>
> >>>>>> In tcp_check_req, restore ts_recent to this skb's to fix this bug.
> >>>>>>
> >>>>>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> >>>>>> Signed-off-by: Wang Hai <wanghai38@huawei.com>
> >>>>>> ---
> >>>>>>     net/ipv4/tcp_minisocks.c | 4 ++++
> >>>>>>     1 file changed, 4 insertions(+)
> >>>>>>
> >>>>>> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> >>>>>> index b089b08e9617..0208455f9eb8 100644
> >>>>>> --- a/net/ipv4/tcp_minisocks.c
> >>>>>> +++ b/net/ipv4/tcp_minisocks.c
> >>>>>> @@ -878,6 +878,10 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
> >>>>>>            sock_rps_save_rxhash(child, skb);
> >>>>>>            tcp_synack_rtt_meas(child, req);
> >>>>>>            *req_stolen = !own_req;
> >>>>>> +       if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
> >>>>>> +           unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
> >>>>>> +               tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
> >>>>>> +
> >>>>>>            return inet_csk_complete_hashdance(sk, child, req, own_req);
> >>>>>
> >>>>> Have you seen the comment at line 818 ?
> >>>>>
> >>>>> /* TODO: We probably should defer ts_recent change once
> >>>>>     * we take ownership of @req.
> >>>>>     */
> >>>>>
> >>>>> Plan was clear and explained. Why implement something else (and buggy) ?
> >>>>>
> >>>> Hi Eric,
> >>>>
> >>>> Currently we have a real problem, so we want to solve it. This bug
> >>>> causes the upper layers to be unable to be notified to call accept after
> >>>> the successful three-way handshake.
> >>>>
> >>>> Skb from cpu1 that fails at tcp_paws_check (which it could have
> >>>> succeeded) will not be able to enter the TCP_ESTABLISHED state, and
> >>>> therefore parent->sk_data_ready(parent) will not be triggered, and skb
> >>>> from cpu2 can complete the three-way handshake, but there is also no way
> >>>> to call parent->sk_data_ready(parent) to notify the upper layer, which
> >>>> will result
> >>>> in the upper layer not being able to sense and call accept to obtain the
> >>>> nsk.
> >>>>
> >>>> cpu1                                cpu2
> >>>> tcp_check_req
> >>>>                                        tcp_check_req
> >>>> req->ts_recent = tmp_opt.rcv_tsval = t1
> >>>>                                        req->ts_recent=tmp_opt.rcv_tsval= t2
> >>>>
> >>>> newsk->ts_recent = req->ts_recent = t2 // t1 < t2
> >>>> tcp_child_process
> >>>>     tcp_rcv_state_process
> >>>>      tcp_validate_incoming
> >>>>       tcp_paws_check // failed
> >>>>     parent->sk_data_ready(parent); // will not be called
> >>>>                                        tcp_v4_do_rcv
> >>>>                                        tcp_rcv_state_process // Complete the three-way handshake
> >>>>                                                                                                           // missing parent->sk_data_ready(parent);
> >>>
> >>> IIUC, the ack received from cpu1 triggered calling
> >>> inet_csk_complete_hashdance() so its state transited from
> >>> TCP_NEW_SYN_RECV to TCP_SYN_RECV, right? If so, the reason why not
> >>> call sk_data_ready() if the skb entered into tcp_child_process() is
> >>> that its state failed to transit to TCP_ESTABLISHED?
> >>>
> >> Yes, because it didn't switch to TCP_ESTABLISHED
> >>> Here is another question. How did the skb on the right side enter into
> >>> tcp_v4_do_rcv() after entering tcp_check_req() if the state of sk
> >>> which the skb belongs to is TCP_NEW_SYN_RECV? Could you elaborate more
> >>> on this point?
> >> Since cpu1 successfully created the child sock, cpu2 will return
> >> null in tcp_check_req and req_stolen is set to true, so that it will
> >> subsequently go to 'goto lookup' to re-process the packet, and at
> >> this point, sk->sk_state is already in TCP_SYN_RECV state, and then
> >> then tcp_v4_do_rcv is called.
> >
> > Now I can see what happened there. Perhaps it would be good to update
> > the commit message
> > in the next iteration.
> Hi Jason,
>
> Thanks for the suggestion, I'll test it out and improve the commit
> message to send v2.
> >
> > Another key information I notice is that the second lookup process
> > loses the chance to call sk_data_ready() for its parent socket. It's
> > the one of the main reasons that cause your application to be unable
> > to get notified. Taking a rough look at tcp_rcv_state_process(), I
> > think it's not easy to acquire the parent socket there and then call
> > sk_data_ready() without modifying more codes compared to the current
> > solution. It's a different solution in theory.
> Yes, I have considered this fix before, but the complexity of the fix
> would be higher.

Agreed.

> >
> > If your new approach (like your previous reply) works, the following
> > commit[1] will be reverted/overwritten.
> I'm sorry, I may not have understood what you meant. Applying my fix,
> commit[1] is still necessary because it doesn't solve the bug that
> commit[1] fixes. can you explain in detail, why commit[1] will be
> reverted/overwritten.

Right, what I was trying to say is that the commit[1] explains how it
happened which is quite similar to your case from the commit message
while your approach can avoid the failure of calling sk_data_ready()
on the basis of commit[1]. IIUC, you postpone updating the recent time
of the skb received from cpu1 which could let this pass
tcp_paws_check() and have the chance to call sk_data_ready(). Then the
issue you reported will be solved. After your patch, we will not
resort to a second lookup. I'm just murmuring alone.

Sorry for my previous inaccurate/wrong words. The above is what I
meant. So let it go. Please go ahead and post your patch :)

Thanks,
Jason

>
> Thanks,
> Wang
>
diff mbox series

Patch

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index b089b08e9617..0208455f9eb8 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -878,6 +878,10 @@  struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 	sock_rps_save_rxhash(child, skb);
 	tcp_synack_rtt_meas(child, req);
 	*req_stolen = !own_req;
+	if (own_req && tcp_sk(child)->rx_opt.tstamp_ok &&
+	    unlikely(tcp_sk(child)->rx_opt.ts_recent != tmp_opt.rcv_tsval))
+		tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
+
 	return inet_csk_complete_hashdance(sk, child, req, own_req);
 
 listen_overflow: