diff mbox series

[net-next] tcp: clamp window like before the cleanup

Message ID 20250305-net-next-fix-tcp-win-clamp-v1-1-12afb705d34e@kernel.org (mailing list archive)
State Accepted
Commit 8e0e8bef484160ac01ea7bcc3122cc1f0405c982
Delegated to: Netdev Maintainers
Headers show
Series [net-next] tcp: clamp window like before the cleanup | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 9 of 9 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 25 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2025-03-06--03-00 (tests: 894)

Commit Message

Matthieu Baerts March 5, 2025, 2:49 p.m. UTC
A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
looks unintentional, and affects MPTCP selftests, e.g. some tests
re-establishing a connection after a disconnect are now unstable.

Before the cleanup, this operation was done:

  new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
  tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);

The cleanup used the 'clamp' macro which takes 3 arguments -- value,
lowest, and highest -- and returns a value between the lowest and the
highest allowable values. This then assumes ...

  lowest (rcv_ssthresh) <= highest (rcv_wnd)

... which doesn't seem to be always the case here according to the MPTCP
selftests, even when running them without MPTCP, but only TCP.

For example, when we have ...

  rcv_wnd < rcv_ssthresh < new_rcv_ssthresh

... before the cleanup, the rcv_ssthresh was not changed, while after
the cleanup, it is lowered down to rcv_wnd (highest).

During a simple test with TCP, here are the values I observed:

  new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
      117760   (out)         65495         <  65536
      128512   (out)         109595        >  80256  => lo > hi
      1184975  (out)         328987        <  329088

      113664   (out)         65483         <  65536
      117760   (out)         110968        <  110976
      129024   (out)         116527        >  109696 => lo > hi

Here, we can see that it is not that rare to have rcv_ssthresh (lo)
higher than rcv_wnd (hi), so having a different behaviour when the
clamp() macro is used, even without MPTCP.

Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
here, which seems to be generally the case in my tests with small
connections.

I then suggests reverting this part, not to change the behaviour.

Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Notes: the 'Fixes' commit is only in net-next
---
 net/ipv4/tcp.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)


---
base-commit: c62e6f056ea308d6382450c1cb32e41727375885
change-id: 20250305-net-next-fix-tcp-win-clamp-9f4c417ff44d

Best regards,

Comments

Jason Xing March 6, 2025, 5:21 a.m. UTC | #1
On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
<matttbe@kernel.org> wrote:
>
> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
> looks unintentional, and affects MPTCP selftests, e.g. some tests
> re-establishing a connection after a disconnect are now unstable.
>
> Before the cleanup, this operation was done:
>
>   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
>   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
>
> The cleanup used the 'clamp' macro which takes 3 arguments -- value,
> lowest, and highest -- and returns a value between the lowest and the
> highest allowable values. This then assumes ...
>
>   lowest (rcv_ssthresh) <= highest (rcv_wnd)
>
> ... which doesn't seem to be always the case here according to the MPTCP
> selftests, even when running them without MPTCP, but only TCP.
>
> For example, when we have ...
>
>   rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
>
> ... before the cleanup, the rcv_ssthresh was not changed, while after
> the cleanup, it is lowered down to rcv_wnd (highest).
>
> During a simple test with TCP, here are the values I observed:
>
>   new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
>       117760   (out)         65495         <  65536
>       128512   (out)         109595        >  80256  => lo > hi
>       1184975  (out)         328987        <  329088
>
>       113664   (out)         65483         <  65536
>       117760   (out)         110968        <  110976
>       129024   (out)         116527        >  109696 => lo > hi
>
> Here, we can see that it is not that rare to have rcv_ssthresh (lo)
> higher than rcv_wnd (hi), so having a different behaviour when the
> clamp() macro is used, even without MPTCP.
>
> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
> here, which seems to be generally the case in my tests with small
> connections.
>
> I then suggests reverting this part, not to change the behaviour.
>
> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>

Tested-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks for catching this. I should have done more tests :(

Now I use netperf with TCP_CRR to test loopback and easily see the
case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
tp->rcv_wnd is not the upper bound as you said.

Thanks,
Jason

> ---
> Notes: the 'Fixes' commit is only in net-next
> ---
>  net/ipv4/tcp.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index eb5a60c7a9ccdd23fb78a74d614c18c4f7e281c9..46951e74930844af952dfbc57a107b504d4e296b 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -3693,7 +3693,7 @@ EXPORT_SYMBOL(tcp_sock_set_keepcnt);
>
>  int tcp_set_window_clamp(struct sock *sk, int val)
>  {
> -       u32 old_window_clamp, new_window_clamp;
> +       u32 old_window_clamp, new_window_clamp, new_rcv_ssthresh;
>         struct tcp_sock *tp = tcp_sk(sk);
>
>         if (!val) {
> @@ -3714,12 +3714,12 @@ int tcp_set_window_clamp(struct sock *sk, int val)
>         /* Need to apply the reserved mem provisioning only
>          * when shrinking the window clamp.
>          */
> -       if (new_window_clamp < old_window_clamp)
> +       if (new_window_clamp < old_window_clamp) {
>                 __tcp_adjust_rcv_ssthresh(sk, new_window_clamp);
> -       else
> -               tp->rcv_ssthresh = clamp(new_window_clamp,
> -                                        tp->rcv_ssthresh,
> -                                        tp->rcv_wnd);
> +       } else {
> +               new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
> +               tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
> +       }
>         return 0;
>  }
>
>
> ---
> base-commit: c62e6f056ea308d6382450c1cb32e41727375885
> change-id: 20250305-net-next-fix-tcp-win-clamp-9f4c417ff44d
>
> Best regards,
> --
> Matthieu Baerts (NGI0) <matttbe@kernel.org>
>
Eric Dumazet March 6, 2025, 9:45 a.m. UTC | #2
On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
> <matttbe@kernel.org> wrote:
> >
> > A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
> > looks unintentional, and affects MPTCP selftests, e.g. some tests
> > re-establishing a connection after a disconnect are now unstable.
> >
> > Before the cleanup, this operation was done:
> >
> >   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
> >   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
> >
> > The cleanup used the 'clamp' macro which takes 3 arguments -- value,
> > lowest, and highest -- and returns a value between the lowest and the
> > highest allowable values. This then assumes ...
> >
> >   lowest (rcv_ssthresh) <= highest (rcv_wnd)
> >
> > ... which doesn't seem to be always the case here according to the MPTCP
> > selftests, even when running them without MPTCP, but only TCP.
> >
> > For example, when we have ...
> >
> >   rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
> >
> > ... before the cleanup, the rcv_ssthresh was not changed, while after
> > the cleanup, it is lowered down to rcv_wnd (highest).
> >
> > During a simple test with TCP, here are the values I observed:
> >
> >   new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
> >       117760   (out)         65495         <  65536
> >       128512   (out)         109595        >  80256  => lo > hi
> >       1184975  (out)         328987        <  329088
> >
> >       113664   (out)         65483         <  65536
> >       117760   (out)         110968        <  110976
> >       129024   (out)         116527        >  109696 => lo > hi
> >
> > Here, we can see that it is not that rare to have rcv_ssthresh (lo)
> > higher than rcv_wnd (hi), so having a different behaviour when the
> > clamp() macro is used, even without MPTCP.
> >
> > Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
> > here, which seems to be generally the case in my tests with small
> > connections.
> >
> > I then suggests reverting this part, not to change the behaviour.
> >
> > Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
> > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
> > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>
> Tested-by: Jason Xing <kerneljasonxing@gmail.com>
>
> Thanks for catching this. I should have done more tests :(
>
> Now I use netperf with TCP_CRR to test loopback and easily see the
> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
> tp->rcv_wnd is not the upper bound as you said.
>
> Thanks,
> Jason
>

Patch looks fine to me but all our tests are passing with the current kernel,
and I was not able to trigger the condition.

Can you share what precise test you did ?

Thanks !
Matthieu Baerts March 6, 2025, 9:55 a.m. UTC | #3
Hi Eric,

On 06/03/2025 10:45, Eric Dumazet wrote:
> On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>>
>> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
>> <matttbe@kernel.org> wrote:
>>>
>>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
>>> looks unintentional, and affects MPTCP selftests, e.g. some tests
>>> re-establishing a connection after a disconnect are now unstable.
>>>
>>> Before the cleanup, this operation was done:
>>>
>>>   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
>>>   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
>>>
>>> The cleanup used the 'clamp' macro which takes 3 arguments -- value,
>>> lowest, and highest -- and returns a value between the lowest and the
>>> highest allowable values. This then assumes ...
>>>
>>>   lowest (rcv_ssthresh) <= highest (rcv_wnd)
>>>
>>> ... which doesn't seem to be always the case here according to the MPTCP
>>> selftests, even when running them without MPTCP, but only TCP.
>>>
>>> For example, when we have ...
>>>
>>>   rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
>>>
>>> ... before the cleanup, the rcv_ssthresh was not changed, while after
>>> the cleanup, it is lowered down to rcv_wnd (highest).
>>>
>>> During a simple test with TCP, here are the values I observed:
>>>
>>>   new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
>>>       117760   (out)         65495         <  65536
>>>       128512   (out)         109595        >  80256  => lo > hi
>>>       1184975  (out)         328987        <  329088
>>>
>>>       113664   (out)         65483         <  65536
>>>       117760   (out)         110968        <  110976
>>>       129024   (out)         116527        >  109696 => lo > hi
>>>
>>> Here, we can see that it is not that rare to have rcv_ssthresh (lo)
>>> higher than rcv_wnd (hi), so having a different behaviour when the
>>> clamp() macro is used, even without MPTCP.
>>>
>>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
>>> here, which seems to be generally the case in my tests with small
>>> connections.
>>>
>>> I then suggests reverting this part, not to change the behaviour.
>>>
>>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
>>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>>
>> Tested-by: Jason Xing <kerneljasonxing@gmail.com>
>>
>> Thanks for catching this. I should have done more tests :(
>>
>> Now I use netperf with TCP_CRR to test loopback and easily see the
>> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
>> tp->rcv_wnd is not the upper bound as you said.
>>
>> Thanks,
>> Jason
>>
> 
> Patch looks fine to me but all our tests are passing with the current kernel,
> and I was not able to trigger the condition.

Thank you for having looked at this patch!


> Can you share what precise test you did ?

To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply
executed MPTCP Connect selftest. You can also force creating TCP only
connections with '-tt', e.g.

  ./mptcp_connect.sh -tt


To be able to reproduce the issue with the selftests mentioned in [1], I
simply executed ./mptcp_connect.sh in a loop after having applied this
small patch to execute only a part of the subtests ("disconnect"):

> diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
> index 5e3c56253274..d8ebea5abc6c 100755
> --- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh
> +++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
> @@ -855,6 +855,7 @@ make_file "$sin" "server"
>  
>  mptcp_lib_subtests_last_ts_reset
>  
> +if false; then
>  check_mptcp_disabled
>  
>  stop_if_error "The kernel configuration is not valid for MPTCP"
> @@ -882,6 +883,7 @@ mptcp_lib_result_code "${ret}" "ping tests"
>  
>  stop_if_error "Could not even run ping tests"
>  mptcp_lib_pr_ok
> +fi
>  
>  [ -n "$tc_loss" ] && tc -net "$ns2" qdisc add dev ns2eth3 root netem loss random $tc_loss delay ${tc_delay}ms
>  tc_info="loss of $tc_loss "
> @@ -910,6 +912,7 @@ mptcp_lib_pr_info "Using ${tc_info}on ns3eth4"
>  
>  tc -net "$ns3" qdisc add dev ns3eth4 root netem delay ${reorder_delay}ms $tc_reorder
>  
> +if false; then
>  TEST_GROUP="loopback v4"
>  run_tests_lo "$ns1" "$ns1" 10.0.1.1 1
>  stop_if_error "Could not even run loopback test"
> @@ -959,6 +962,7 @@ log_if_error "Tests with MPTFO have failed"
>  run_test_transparent 10.0.3.1 "tproxy ipv4"
>  run_test_transparent dead:beef:3::1 "tproxy ipv6"
>  log_if_error "Tests with tproxy have failed"
> +fi
>  
>  run_tests_disconnect
>  log_if_error "Tests of the full disconnection have failed"

Note that our CI was able to easily reproduce it. Locally, it was taking
around 30 to 50 iterations to reproduce the issue.

[1] https://github.com/multipath-tcp/mptcp_net-next/issues/551

Cheers,
Matt
Eric Dumazet March 6, 2025, 10:02 a.m. UTC | #4
On Thu, Mar 6, 2025 at 10:55 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Eric,
>
> On 06/03/2025 10:45, Eric Dumazet wrote:
> > On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >>
> >> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
> >> <matttbe@kernel.org> wrote:
> >>>
> >>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
> >>> looks unintentional, and affects MPTCP selftests, e.g. some tests
> >>> re-establishing a connection after a disconnect are now unstable.
> >>>
> >>> Before the cleanup, this operation was done:
> >>>
> >>>   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
> >>>   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
> >>>
> >>> The cleanup used the 'clamp' macro which takes 3 arguments -- value,
> >>> lowest, and highest -- and returns a value between the lowest and the
> >>> highest allowable values. This then assumes ...
> >>>
> >>>   lowest (rcv_ssthresh) <= highest (rcv_wnd)
> >>>
> >>> ... which doesn't seem to be always the case here according to the MPTCP
> >>> selftests, even when running them without MPTCP, but only TCP.
> >>>
> >>> For example, when we have ...
> >>>
> >>>   rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
> >>>
> >>> ... before the cleanup, the rcv_ssthresh was not changed, while after
> >>> the cleanup, it is lowered down to rcv_wnd (highest).
> >>>
> >>> During a simple test with TCP, here are the values I observed:
> >>>
> >>>   new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
> >>>       117760   (out)         65495         <  65536
> >>>       128512   (out)         109595        >  80256  => lo > hi
> >>>       1184975  (out)         328987        <  329088
> >>>
> >>>       113664   (out)         65483         <  65536
> >>>       117760   (out)         110968        <  110976
> >>>       129024   (out)         116527        >  109696 => lo > hi
> >>>
> >>> Here, we can see that it is not that rare to have rcv_ssthresh (lo)
> >>> higher than rcv_wnd (hi), so having a different behaviour when the
> >>> clamp() macro is used, even without MPTCP.
> >>>
> >>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
> >>> here, which seems to be generally the case in my tests with small
> >>> connections.
> >>>
> >>> I then suggests reverting this part, not to change the behaviour.
> >>>
> >>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
> >>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
> >>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> >>
> >> Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> >>
> >> Thanks for catching this. I should have done more tests :(
> >>
> >> Now I use netperf with TCP_CRR to test loopback and easily see the
> >> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
> >> tp->rcv_wnd is not the upper bound as you said.
> >>
> >> Thanks,
> >> Jason
> >>
> >
> > Patch looks fine to me but all our tests are passing with the current kernel,
> > and I was not able to trigger the condition.
>
> Thank you for having looked at this patch!
>
>
> > Can you share what precise test you did ?
>
> To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply
> executed MPTCP Connect selftest. You can also force creating TCP only
> connections with '-tt', e.g.
>
>   ./mptcp_connect.sh -tt

I was asking Jason about TCP tests. He mentioned TCP_CRR

I made several of them, with temporary debug in the kernel that did
not show the issue.


I am wondering if this could hide an issue in MPTCP ?

>
>
> To be able to reproduce the issue with the selftests mentioned in [1], I
> simply executed ./mptcp_connect.sh in a loop after having applied this
> small patch to execute only a part of the subtests ("disconnect"):
>
> > diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
> > index 5e3c56253274..d8ebea5abc6c 100755
> > --- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh
> > +++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh
> > @@ -855,6 +855,7 @@ make_file "$sin" "server"
> >
> >  mptcp_lib_subtests_last_ts_reset
> >
> > +if false; then
> >  check_mptcp_disabled
> >
> >  stop_if_error "The kernel configuration is not valid for MPTCP"
> > @@ -882,6 +883,7 @@ mptcp_lib_result_code "${ret}" "ping tests"
> >
> >  stop_if_error "Could not even run ping tests"
> >  mptcp_lib_pr_ok
> > +fi
> >
> >  [ -n "$tc_loss" ] && tc -net "$ns2" qdisc add dev ns2eth3 root netem loss random $tc_loss delay ${tc_delay}ms
> >  tc_info="loss of $tc_loss "
> > @@ -910,6 +912,7 @@ mptcp_lib_pr_info "Using ${tc_info}on ns3eth4"
> >
> >  tc -net "$ns3" qdisc add dev ns3eth4 root netem delay ${reorder_delay}ms $tc_reorder
> >
> > +if false; then
> >  TEST_GROUP="loopback v4"
> >  run_tests_lo "$ns1" "$ns1" 10.0.1.1 1
> >  stop_if_error "Could not even run loopback test"
> > @@ -959,6 +962,7 @@ log_if_error "Tests with MPTFO have failed"
> >  run_test_transparent 10.0.3.1 "tproxy ipv4"
> >  run_test_transparent dead:beef:3::1 "tproxy ipv6"
> >  log_if_error "Tests with tproxy have failed"
> > +fi
> >
> >  run_tests_disconnect
> >  log_if_error "Tests of the full disconnection have failed"
>
> Note that our CI was able to easily reproduce it. Locally, it was taking
> around 30 to 50 iterations to reproduce the issue.
>
> [1] https://github.com/multipath-tcp/mptcp_net-next/issues/551
>
> Cheers,
> Matt
> --
> Sponsored by the NGI0 Core fund.
>
Matthieu Baerts March 6, 2025, 10:12 a.m. UTC | #5
On 06/03/2025 11:02, Eric Dumazet wrote:
> On Thu, Mar 6, 2025 at 10:55 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>>
>> Hi Eric,
>>
>> On 06/03/2025 10:45, Eric Dumazet wrote:
>>> On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
>>>>
>>>> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
>>>> <matttbe@kernel.org> wrote:
>>>>>
>>>>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
>>>>> looks unintentional, and affects MPTCP selftests, e.g. some tests
>>>>> re-establishing a connection after a disconnect are now unstable.
>>>>>
>>>>> Before the cleanup, this operation was done:
>>>>>
>>>>>   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
>>>>>   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
>>>>>
>>>>> The cleanup used the 'clamp' macro which takes 3 arguments -- value,
>>>>> lowest, and highest -- and returns a value between the lowest and the
>>>>> highest allowable values. This then assumes ...
>>>>>
>>>>>   lowest (rcv_ssthresh) <= highest (rcv_wnd)
>>>>>
>>>>> ... which doesn't seem to be always the case here according to the MPTCP
>>>>> selftests, even when running them without MPTCP, but only TCP.
>>>>>
>>>>> For example, when we have ...
>>>>>
>>>>>   rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
>>>>>
>>>>> ... before the cleanup, the rcv_ssthresh was not changed, while after
>>>>> the cleanup, it is lowered down to rcv_wnd (highest).
>>>>>
>>>>> During a simple test with TCP, here are the values I observed:
>>>>>
>>>>>   new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
>>>>>       117760   (out)         65495         <  65536
>>>>>       128512   (out)         109595        >  80256  => lo > hi
>>>>>       1184975  (out)         328987        <  329088
>>>>>
>>>>>       113664   (out)         65483         <  65536
>>>>>       117760   (out)         110968        <  110976
>>>>>       129024   (out)         116527        >  109696 => lo > hi
>>>>>
>>>>> Here, we can see that it is not that rare to have rcv_ssthresh (lo)
>>>>> higher than rcv_wnd (hi), so having a different behaviour when the
>>>>> clamp() macro is used, even without MPTCP.
>>>>>
>>>>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
>>>>> here, which seems to be generally the case in my tests with small
>>>>> connections.
>>>>>
>>>>> I then suggests reverting this part, not to change the behaviour.
>>>>>
>>>>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
>>>>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
>>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>>>>
>>>> Tested-by: Jason Xing <kerneljasonxing@gmail.com>
>>>>
>>>> Thanks for catching this. I should have done more tests :(
>>>>
>>>> Now I use netperf with TCP_CRR to test loopback and easily see the
>>>> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
>>>> tp->rcv_wnd is not the upper bound as you said.
>>>>
>>>> Thanks,
>>>> Jason
>>>>
>>>
>>> Patch looks fine to me but all our tests are passing with the current kernel,
>>> and I was not able to trigger the condition.
>>
>> Thank you for having looked at this patch!
>>
>>
>>> Can you share what precise test you did ?
>>
>> To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply
>> executed MPTCP Connect selftest. You can also force creating TCP only
>> connections with '-tt', e.g.
>>
>>   ./mptcp_connect.sh -tt
> 
> I was asking Jason about TCP tests. He mentioned TCP_CRR

Oops, I'm sorry, I didn't look at the "To:" field.

> I made several of them, with temporary debug in the kernel that did
> not show the issue.
> 
> 
> I am wondering if this could hide an issue in MPTCP ?
Indeed, I was wondering the same thing. I didn't see anything obvious
when looking at this issue. The behaviours around the window clamping,
with MPTCP single flow, and "plain" TCP were quite similar I think.

Cheers,
Matt
Eric Dumazet March 6, 2025, 10:16 a.m. UTC | #6
On Thu, Mar 6, 2025 at 11:12 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> On 06/03/2025 11:02, Eric Dumazet wrote:
> > On Thu, Mar 6, 2025 at 10:55 AM Matthieu Baerts <matttbe@kernel.org> wrote:
> >>
> >> Hi Eric,
> >>
> >> On 06/03/2025 10:45, Eric Dumazet wrote:
> >>> On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >>>>
> >>>> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
> >>>> <matttbe@kernel.org> wrote:
> >>>>>
> >>>>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
> >>>>> looks unintentional, and affects MPTCP selftests, e.g. some tests
> >>>>> re-establishing a connection after a disconnect are now unstable.
> >>>>>
> >>>>> Before the cleanup, this operation was done:
> >>>>>
> >>>>>   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
> >>>>>   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
> >>>>>
> >>>>> The cleanup used the 'clamp' macro which takes 3 arguments -- value,
> >>>>> lowest, and highest -- and returns a value between the lowest and the
> >>>>> highest allowable values. This then assumes ...
> >>>>>
> >>>>>   lowest (rcv_ssthresh) <= highest (rcv_wnd)
> >>>>>
> >>>>> ... which doesn't seem to be always the case here according to the MPTCP
> >>>>> selftests, even when running them without MPTCP, but only TCP.
> >>>>>
> >>>>> For example, when we have ...
> >>>>>
> >>>>>   rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
> >>>>>
> >>>>> ... before the cleanup, the rcv_ssthresh was not changed, while after
> >>>>> the cleanup, it is lowered down to rcv_wnd (highest).
> >>>>>
> >>>>> During a simple test with TCP, here are the values I observed:
> >>>>>
> >>>>>   new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
> >>>>>       117760   (out)         65495         <  65536
> >>>>>       128512   (out)         109595        >  80256  => lo > hi
> >>>>>       1184975  (out)         328987        <  329088
> >>>>>
> >>>>>       113664   (out)         65483         <  65536
> >>>>>       117760   (out)         110968        <  110976
> >>>>>       129024   (out)         116527        >  109696 => lo > hi
> >>>>>
> >>>>> Here, we can see that it is not that rare to have rcv_ssthresh (lo)
> >>>>> higher than rcv_wnd (hi), so having a different behaviour when the
> >>>>> clamp() macro is used, even without MPTCP.
> >>>>>
> >>>>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
> >>>>> here, which seems to be generally the case in my tests with small
> >>>>> connections.
> >>>>>
> >>>>> I then suggests reverting this part, not to change the behaviour.
> >>>>>
> >>>>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
> >>>>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
> >>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> >>>>
> >>>> Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> >>>>
> >>>> Thanks for catching this. I should have done more tests :(
> >>>>
> >>>> Now I use netperf with TCP_CRR to test loopback and easily see the
> >>>> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
> >>>> tp->rcv_wnd is not the upper bound as you said.
> >>>>
> >>>> Thanks,
> >>>> Jason
> >>>>
> >>>
> >>> Patch looks fine to me but all our tests are passing with the current kernel,
> >>> and I was not able to trigger the condition.
> >>
> >> Thank you for having looked at this patch!
> >>
> >>
> >>> Can you share what precise test you did ?
> >>
> >> To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply
> >> executed MPTCP Connect selftest. You can also force creating TCP only
> >> connections with '-tt', e.g.
> >>
> >>   ./mptcp_connect.sh -tt
> >
> > I was asking Jason about TCP tests. He mentioned TCP_CRR
>
> Oops, I'm sorry, I didn't look at the "To:" field.
>
> > I made several of them, with temporary debug in the kernel that did
> > not show the issue.
> >
> >
> > I am wondering if this could hide an issue in MPTCP ?
> Indeed, I was wondering the same thing. I didn't see anything obvious
> when looking at this issue. The behaviours around the window clamping,
> with MPTCP single flow, and "plain" TCP were quite similar I think.

OK, let me run mptcp tests just in case I see something dubious.
Eric Dumazet March 6, 2025, 11:08 a.m. UTC | #7
On Thu, Mar 6, 2025 at 11:16 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Mar 6, 2025 at 11:12 AM Matthieu Baerts <matttbe@kernel.org> wrote:
> >
> > On 06/03/2025 11:02, Eric Dumazet wrote:
> > > On Thu, Mar 6, 2025 at 10:55 AM Matthieu Baerts <matttbe@kernel.org> wrote:
> > >>
> > >> Hi Eric,
> > >>
> > >> On 06/03/2025 10:45, Eric Dumazet wrote:
> > >>> On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >>>>
> > >>>> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
> > >>>> <matttbe@kernel.org> wrote:
> > >>>>>
> > >>>>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
> > >>>>> looks unintentional, and affects MPTCP selftests, e.g. some tests
> > >>>>> re-establishing a connection after a disconnect are now unstable.
> > >>>>>
> > >>>>> Before the cleanup, this operation was done:
> > >>>>>
> > >>>>>   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
> > >>>>>   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
> > >>>>>
> > >>>>> The cleanup used the 'clamp' macro which takes 3 arguments -- value,
> > >>>>> lowest, and highest -- and returns a value between the lowest and the
> > >>>>> highest allowable values. This then assumes ...
> > >>>>>
> > >>>>>   lowest (rcv_ssthresh) <= highest (rcv_wnd)
> > >>>>>
> > >>>>> ... which doesn't seem to be always the case here according to the MPTCP
> > >>>>> selftests, even when running them without MPTCP, but only TCP.
> > >>>>>
> > >>>>> For example, when we have ...
> > >>>>>
> > >>>>>   rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
> > >>>>>
> > >>>>> ... before the cleanup, the rcv_ssthresh was not changed, while after
> > >>>>> the cleanup, it is lowered down to rcv_wnd (highest).
> > >>>>>
> > >>>>> During a simple test with TCP, here are the values I observed:
> > >>>>>
> > >>>>>   new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
> > >>>>>       117760   (out)         65495         <  65536
> > >>>>>       128512   (out)         109595        >  80256  => lo > hi
> > >>>>>       1184975  (out)         328987        <  329088
> > >>>>>
> > >>>>>       113664   (out)         65483         <  65536
> > >>>>>       117760   (out)         110968        <  110976
> > >>>>>       129024   (out)         116527        >  109696 => lo > hi
> > >>>>>
> > >>>>> Here, we can see that it is not that rare to have rcv_ssthresh (lo)
> > >>>>> higher than rcv_wnd (hi), so having a different behaviour when the
> > >>>>> clamp() macro is used, even without MPTCP.
> > >>>>>
> > >>>>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
> > >>>>> here, which seems to be generally the case in my tests with small
> > >>>>> connections.
> > >>>>>
> > >>>>> I then suggests reverting this part, not to change the behaviour.
> > >>>>>
> > >>>>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
> > >>>>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
> > >>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> > >>>>
> > >>>> Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> > >>>>
> > >>>> Thanks for catching this. I should have done more tests :(
> > >>>>
> > >>>> Now I use netperf with TCP_CRR to test loopback and easily see the
> > >>>> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
> > >>>> tp->rcv_wnd is not the upper bound as you said.
> > >>>>
> > >>>> Thanks,
> > >>>> Jason
> > >>>>
> > >>>
> > >>> Patch looks fine to me but all our tests are passing with the current kernel,
> > >>> and I was not able to trigger the condition.
> > >>
> > >> Thank you for having looked at this patch!
> > >>
> > >>
> > >>> Can you share what precise test you did ?
> > >>
> > >> To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply
> > >> executed MPTCP Connect selftest. You can also force creating TCP only
> > >> connections with '-tt', e.g.
> > >>
> > >>   ./mptcp_connect.sh -tt
> > >
> > > I was asking Jason about TCP tests. He mentioned TCP_CRR
> >
> > Oops, I'm sorry, I didn't look at the "To:" field.
> >
> > > I made several of them, with temporary debug in the kernel that did
> > > not show the issue.
> > >
> > >
> > > I am wondering if this could hide an issue in MPTCP ?
> > Indeed, I was wondering the same thing. I didn't see anything obvious
> > when looking at this issue. The behaviours around the window clamping,
> > with MPTCP single flow, and "plain" TCP were quite similar I think.
>
> OK, let me run mptcp tests just in case I see something dubious.

I have no idea why only MPTCP flows can trigger the condition, I do
not think it matters anyway.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Jason Xing March 6, 2025, 11:18 a.m. UTC | #8
On Thu, Mar 6, 2025 at 5:45 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> >
> > On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
> > <matttbe@kernel.org> wrote:
> > >
> > > A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
> > > looks unintentional, and affects MPTCP selftests, e.g. some tests
> > > re-establishing a connection after a disconnect are now unstable.
> > >
> > > Before the cleanup, this operation was done:
> > >
> > >   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
> > >   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
> > >
> > > The cleanup used the 'clamp' macro which takes 3 arguments -- value,
> > > lowest, and highest -- and returns a value between the lowest and the
> > > highest allowable values. This then assumes ...
> > >
> > >   lowest (rcv_ssthresh) <= highest (rcv_wnd)
> > >
> > > ... which doesn't seem to be always the case here according to the MPTCP
> > > selftests, even when running them without MPTCP, but only TCP.
> > >
> > > For example, when we have ...
> > >
> > >   rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
> > >
> > > ... before the cleanup, the rcv_ssthresh was not changed, while after
> > > the cleanup, it is lowered down to rcv_wnd (highest).
> > >
> > > During a simple test with TCP, here are the values I observed:
> > >
> > >   new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
> > >       117760   (out)         65495         <  65536
> > >       128512   (out)         109595        >  80256  => lo > hi
> > >       1184975  (out)         328987        <  329088
> > >
> > >       113664   (out)         65483         <  65536
> > >       117760   (out)         110968        <  110976
> > >       129024   (out)         116527        >  109696 => lo > hi
> > >
> > > Here, we can see that it is not that rare to have rcv_ssthresh (lo)
> > > higher than rcv_wnd (hi), so having a different behaviour when the
> > > clamp() macro is used, even without MPTCP.
> > >
> > > Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
> > > here, which seems to be generally the case in my tests with small
> > > connections.
> > >
> > > I then suggests reverting this part, not to change the behaviour.
> > >
> > > Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
> > > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
> > > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> >
> > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> >
> > Thanks for catching this. I should have done more tests :(
> >
> > Now I use netperf with TCP_CRR to test loopback and easily see the
> > case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
> > tp->rcv_wnd is not the upper bound as you said.
> >
> > Thanks,
> > Jason
> >
>
> Patch looks fine to me but all our tests are passing with the current kernel,
> and I was not able to trigger the condition.
>
> Can you share what precise test you did ?
>
> Thanks !

I did the test[1] on the virtual machine running the kernel [2]. And
after seeing your reply, I checked out a clean branch and compiled the
kernel with the patch reverted again and rebooted. The case can still
be reliably reproduced in my machine. Here are some outputs from BPF
program[3]:
 sudo bpftrace tcp_cap.bt.2
Attaching 1 probe...
4327813, 4326775, 4310912
netperf
        tcp_set_window_clamp+1
        tcp_data_queue+1744
        tcp_rcv_established+501
        tcp_v4_do_rcv+369
        tcp_v4_rcv+4800
        ip_protocol_deliver_rcu+65

4327813, 4326827, 4310912
netperf
        tcp_set_window_clamp+1
        tcp_data_queue+1744
        tcp_rcv_established+501
        tcp_v4_do_rcv+369
        tcp_v4_rcv+4800
        ip_protocol_deliver_rcu+65

418081, 417052, 417024
swapper/11
        tcp_set_window_clamp+1
        tcp_data_queue+1744
        tcp_rcv_established+501
        tcp_v4_do_rcv+369
        tcp_v4_rcv+4800
        ip_protocol_deliver_rcu+65

I can help if you want to see something else.

Thanks,
Jason

[1]: netperf -H 127.0.0.1 and netperf -H 127.0.0.1 -t TCP_CRR
[2]: based on commit f130a0cc1b4ff1 with reverted patch
commit 196150bf8e912eb27d4f083fc223ad8787709c6f (HEAD -> main)
Author: Jason Xing <kerneljasonxing@gmail.com>
Date:   Thu Mar 6 19:00:47 2025 +0800

    Revert "tcp: tcp_set_window_clamp() cleanup"

    This reverts commit 863a952eb79a6acf2b1f654f4e75ed104ff4cc81.

commit f130a0cc1b4ff1ef28a307428d40436032e2b66e (origin/main, origin/HEAD)
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Mar 4 12:59:18 2025 +0000

    inet: fix lwtunnel_valid_encap_type() lock imbalance

    After blamed commit rtm_to_fib_config() now calls
    lwtunnel_valid_encap_type{_attr}() without RTNL held,
    triggering an unlock balance in __rtnl_unlock,
    as reported by syzbot [1]
[3]:
kprobe: tcp_set_window_clamp
{
        $sk = (struct sock *) arg0;
        $val = arg1;
        $tp = (struct tcp_sock *) $sk;
        $dport = $sk->__sk_common.skc_dport;
        $dport = bswap($dport);
        $lport = $sk->__sk_common.skc_num;

        if ($tp->rcv_ssthresh > $tp->rcv_wnd) {
                printf("%u, %u, %u\n", $tp->window_clamp,
$tp->rcv_ssthresh, $tp->rcv_wnd);
                printf("%s %s\n", comm, kstack(6));
        }
}
Matthieu Baerts March 6, 2025, 11:20 a.m. UTC | #9
Hi Eric,

On 06/03/2025 12:08, Eric Dumazet wrote:
> On Thu, Mar 6, 2025 at 11:16 AM Eric Dumazet <edumazet@google.com> wrote:
>> On Thu, Mar 6, 2025 at 11:12 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>>> On 06/03/2025 11:02, Eric Dumazet wrote:

(...)

>>>> I am wondering if this could hide an issue in MPTCP ?
>>> Indeed, I was wondering the same thing. I didn't see anything obvious
>>> when looking at this issue. The behaviours around the window clamping,
>>> with MPTCP single flow, and "plain" TCP were quite similar I think.
>>
>> OK, let me run mptcp tests just in case I see something dubious.
> 
> I have no idea why only MPTCP flows can trigger the condition, I do
> not think it matters anyway.

Thank you for having looked! Did you manage to trigger this condition
with "plain" TCP (without MPTCP) when using "./mptcp_connect.sh -tt"?

The TCP only connections will be printed like this:

  nsX TCP   -> nsY (<IP>:<port>) TCP

But yes, it probably doesn't matter.

Cheers,
Matt
patchwork-bot+netdevbpf@kernel.org March 6, 2025, 11:40 p.m. UTC | #10
Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 05 Mar 2025 15:49:48 +0100 you wrote:
> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
> looks unintentional, and affects MPTCP selftests, e.g. some tests
> re-establishing a connection after a disconnect are now unstable.
> 
> Before the cleanup, this operation was done:
> 
>   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
>   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
> 
> [...]

Here is the summary with links:
  - [net-next] tcp: clamp window like before the cleanup
    https://git.kernel.org/netdev/net-next/c/8e0e8bef4841

You are awesome, thank you!
Jason Xing March 7, 2025, 12:31 a.m. UTC | #11
On Thu, Mar 6, 2025 at 7:18 PM Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> On Thu, Mar 6, 2025 at 5:45 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote:
> > >
> > > On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
> > > <matttbe@kernel.org> wrote:
> > > >
> > > > A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
> > > > looks unintentional, and affects MPTCP selftests, e.g. some tests
> > > > re-establishing a connection after a disconnect are now unstable.
> > > >
> > > > Before the cleanup, this operation was done:
> > > >
> > > >   new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
> > > >   tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
> > > >
> > > > The cleanup used the 'clamp' macro which takes 3 arguments -- value,
> > > > lowest, and highest -- and returns a value between the lowest and the
> > > > highest allowable values. This then assumes ...
> > > >
> > > >   lowest (rcv_ssthresh) <= highest (rcv_wnd)
> > > >
> > > > ... which doesn't seem to be always the case here according to the MPTCP
> > > > selftests, even when running them without MPTCP, but only TCP.
> > > >
> > > > For example, when we have ...
> > > >
> > > >   rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
> > > >
> > > > ... before the cleanup, the rcv_ssthresh was not changed, while after
> > > > the cleanup, it is lowered down to rcv_wnd (highest).
> > > >
> > > > During a simple test with TCP, here are the values I observed:
> > > >
> > > >   new_window_clamp (val)  rcv_ssthresh (lo)  rcv_wnd (hi)
> > > >       117760   (out)         65495         <  65536
> > > >       128512   (out)         109595        >  80256  => lo > hi
> > > >       1184975  (out)         328987        <  329088
> > > >
> > > >       113664   (out)         65483         <  65536
> > > >       117760   (out)         110968        <  110976
> > > >       129024   (out)         116527        >  109696 => lo > hi
> > > >
> > > > Here, we can see that it is not that rare to have rcv_ssthresh (lo)
> > > > higher than rcv_wnd (hi), so having a different behaviour when the
> > > > clamp() macro is used, even without MPTCP.
> > > >
> > > > Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
> > > > here, which seems to be generally the case in my tests with small
> > > > connections.
> > > >
> > > > I then suggests reverting this part, not to change the behaviour.
> > > >
> > > > Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
> > > > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
> > > > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> > >
> > > Tested-by: Jason Xing <kerneljasonxing@gmail.com>
> > >
> > > Thanks for catching this. I should have done more tests :(
> > >
> > > Now I use netperf with TCP_CRR to test loopback and easily see the
> > > case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
> > > tp->rcv_wnd is not the upper bound as you said.
> > >
> > > Thanks,
> > > Jason
> > >
> >
> > Patch looks fine to me but all our tests are passing with the current kernel,
> > and I was not able to trigger the condition.
> >
> > Can you share what precise test you did ?
> >
> > Thanks !
>
> I did the test[1] on the virtual machine running the kernel [2]. And
> after seeing your reply, I checked out a clean branch and compiled the
> kernel with the patch reverted again and rebooted. The case can still
> be reliably reproduced in my machine. Here are some outputs from BPF
> program[3]:
>  sudo bpftrace tcp_cap.bt.2
> Attaching 1 probe...
> 4327813, 4326775, 4310912
> netperf
>         tcp_set_window_clamp+1
>         tcp_data_queue+1744
>         tcp_rcv_established+501
>         tcp_v4_do_rcv+369
>         tcp_v4_rcv+4800
>         ip_protocol_deliver_rcu+65
>
> 4327813, 4326827, 4310912
> netperf
>         tcp_set_window_clamp+1
>         tcp_data_queue+1744
>         tcp_rcv_established+501
>         tcp_v4_do_rcv+369
>         tcp_v4_rcv+4800
>         ip_protocol_deliver_rcu+65
>
> 418081, 417052, 417024
> swapper/11
>         tcp_set_window_clamp+1
>         tcp_data_queue+1744
>         tcp_rcv_established+501
>         tcp_v4_do_rcv+369
>         tcp_v4_rcv+4800
>         ip_protocol_deliver_rcu+65
>

Hi Eric, Matthieu

I did a quick analysis on this case. It turned out that
__release_sock() was dealing with skbs in sk_backlog one by one, then:
1) one skb went into tcp_grow_window(). At the beginning, rcv_ssthresh
is equal to rcv_wnd, but later rcv_ssthresh will increase by 'incr',
which means updated rcv_ssthresh is larger than rcv_wnd.
2) another skb went into tcp_set_window_clamp(), as I saw yesterday,
the issue happened: rcv_ssthresh > rcv_wnd.

As to the rcv_wnd, in synchronised states, it can be only
changed/adjusted to new_win in tcp_select_window(). Thus, between
above 1) and 2) the sk didn't have the chance to update its rcv_wnd
value, so...

I attached two consecutive calltraces here:
[Fri Mar  7 08:00:52 2025] netserver, 1, 91234, 65483, (25751,130966)
[Fri Mar  7 08:00:52 2025] CPU: 0 UID: 266980 PID: 17465 Comm:
netserver Kdump: loaded Not tainted 6.14.0-rc4+ #415
[Fri Mar  7 08:00:52 2025] Hardware name: Tencent Cloud CVM, BIOS
seabios-1.9.1-qemu-project.org 04/01/2014
[Fri Mar  7 08:00:52 2025] Call Trace:
[Fri Mar  7 08:00:52 2025]  <TASK>
[Fri Mar  7 08:00:52 2025]  dump_stack_lvl+0x5b/0x70
[Fri Mar  7 08:00:52 2025]  dump_stack+0x10/0x20
[Fri Mar  7 08:00:52 2025]  tcp_grow_window+0x297/0x320
[Fri Mar  7 08:00:52 2025]  tcp_event_data_recv+0x265/0x400
[Fri Mar  7 08:00:52 2025]  tcp_data_queue+0x6d0/0xc70
[Fri Mar  7 08:00:52 2025]  tcp_rcv_established+0x1f5/0x760
[Fri Mar  7 08:00:52 2025]  ? schedule_timeout+0xe5/0x100
[Fri Mar  7 08:00:52 2025]  tcp_v4_do_rcv+0x171/0x2d0
[Fri Mar  7 08:00:52 2025]  __release_sock+0xd1/0xe0
[Fri Mar  7 08:00:52 2025]  release_sock+0x30/0xa0
[Fri Mar  7 08:00:52 2025]  inet_accept+0x5c/0x80
[Fri Mar  7 08:00:52 2025]  do_accept+0xf1/0x180
.....
[Fri Mar  7 08:00:52 2025] netserver, 0, 91234, 65483
[Fri Mar  7 08:00:52 2025] CPU: 0 UID: 266980 PID: 17465 Comm:
netserver Kdump: loaded Not tainted 6.14.0-rc4+ #415
[Fri Mar  7 08:00:52 2025] Hardware name: Tencent Cloud CVM, BIOS
seabios-1.9.1-qemu-project.org 04/01/2014
[Fri Mar  7 08:00:52 2025] Call Trace:
[Fri Mar  7 08:00:52 2025]  <TASK>
[Fri Mar  7 08:00:52 2025]  dump_stack_lvl+0x5b/0x70
[Fri Mar  7 08:00:52 2025]  dump_stack+0x10/0x20
[Fri Mar  7 08:00:52 2025]  tcp_set_window_clamp+0xc3/0x1f0
[Fri Mar  7 08:00:52 2025]  tcp_event_data_recv+0x35b/0x400
[Fri Mar  7 08:00:52 2025]  tcp_data_queue+0x6d0/0xc70
[Fri Mar  7 08:00:52 2025]  tcp_rcv_established+0x1f5/0x760
[Fri Mar  7 08:00:52 2025]  ? schedule_timeout+0xe5/0x100
[Fri Mar  7 08:00:52 2025]  tcp_v4_do_rcv+0x171/0x2d0
[Fri Mar  7 08:00:52 2025]  __release_sock+0xd1/0xe0
[Fri Mar  7 08:00:52 2025]  release_sock+0x30/0xa0
[Fri Mar  7 08:00:52 2025]  inet_accept+0x5c/0x80
[Fri Mar  7 08:00:52 2025]  do_accept+0xf1/0x180

Test is simple on my virtual machine just by running "netperf -H
127.0.0.1" which uses TCP_STREAM by default.

Thanks,
Jason
diff mbox series

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index eb5a60c7a9ccdd23fb78a74d614c18c4f7e281c9..46951e74930844af952dfbc57a107b504d4e296b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3693,7 +3693,7 @@  EXPORT_SYMBOL(tcp_sock_set_keepcnt);
 
 int tcp_set_window_clamp(struct sock *sk, int val)
 {
-	u32 old_window_clamp, new_window_clamp;
+	u32 old_window_clamp, new_window_clamp, new_rcv_ssthresh;
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	if (!val) {
@@ -3714,12 +3714,12 @@  int tcp_set_window_clamp(struct sock *sk, int val)
 	/* Need to apply the reserved mem provisioning only
 	 * when shrinking the window clamp.
 	 */
-	if (new_window_clamp < old_window_clamp)
+	if (new_window_clamp < old_window_clamp) {
 		__tcp_adjust_rcv_ssthresh(sk, new_window_clamp);
-	else
-		tp->rcv_ssthresh = clamp(new_window_clamp,
-					 tp->rcv_ssthresh,
-					 tp->rcv_wnd);
+	} else {
+		new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
+		tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
+	}
 	return 0;
 }