Message ID | 20250305-net-next-fix-tcp-win-clamp-v1-1-12afb705d34e@kernel.org (mailing list archive) |
---|---|
State | Accepted |
Commit | 8e0e8bef484160ac01ea7bcc3122cc1f0405c982 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] tcp: clamp window like before the cleanup | expand |
On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0) <matttbe@kernel.org> wrote: > > A recent cleanup changed the behaviour of tcp_set_window_clamp(). This > looks unintentional, and affects MPTCP selftests, e.g. some tests > re-establishing a connection after a disconnect are now unstable. > > Before the cleanup, this operation was done: > > new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); > tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); > > The cleanup used the 'clamp' macro which takes 3 arguments -- value, > lowest, and highest -- and returns a value between the lowest and the > highest allowable values. This then assumes ... > > lowest (rcv_ssthresh) <= highest (rcv_wnd) > > ... which doesn't seem to be always the case here according to the MPTCP > selftests, even when running them without MPTCP, but only TCP. > > For example, when we have ... > > rcv_wnd < rcv_ssthresh < new_rcv_ssthresh > > ... before the cleanup, the rcv_ssthresh was not changed, while after > the cleanup, it is lowered down to rcv_wnd (highest). > > During a simple test with TCP, here are the values I observed: > > new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) > 117760 (out) 65495 < 65536 > 128512 (out) 109595 > 80256 => lo > hi > 1184975 (out) 328987 < 329088 > > 113664 (out) 65483 < 65536 > 117760 (out) 110968 < 110976 > 129024 (out) 116527 > 109696 => lo > hi > > Here, we can see that it is not that rare to have rcv_ssthresh (lo) > higher than rcv_wnd (hi), so having a different behaviour when the > clamp() macro is used, even without MPTCP. > > Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) > here, which seems to be generally the case in my tests with small > connections. > > I then suggests reverting this part, not to change the behaviour. > > Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Thanks for catching this. I should have done more tests :( Now I use netperf with TCP_CRR to test loopback and easily see the case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means tp->rcv_wnd is not the upper bound as you said. Thanks, Jason > --- > Notes: the 'Fixes' commit is only in net-next > --- > net/ipv4/tcp.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index eb5a60c7a9ccdd23fb78a74d614c18c4f7e281c9..46951e74930844af952dfbc57a107b504d4e296b 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -3693,7 +3693,7 @@ EXPORT_SYMBOL(tcp_sock_set_keepcnt); > > int tcp_set_window_clamp(struct sock *sk, int val) > { > - u32 old_window_clamp, new_window_clamp; > + u32 old_window_clamp, new_window_clamp, new_rcv_ssthresh; > struct tcp_sock *tp = tcp_sk(sk); > > if (!val) { > @@ -3714,12 +3714,12 @@ int tcp_set_window_clamp(struct sock *sk, int val) > /* Need to apply the reserved mem provisioning only > * when shrinking the window clamp. > */ > - if (new_window_clamp < old_window_clamp) > + if (new_window_clamp < old_window_clamp) { > __tcp_adjust_rcv_ssthresh(sk, new_window_clamp); > - else > - tp->rcv_ssthresh = clamp(new_window_clamp, > - tp->rcv_ssthresh, > - tp->rcv_wnd); > + } else { > + new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); > + tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); > + } > return 0; > } > > > --- > base-commit: c62e6f056ea308d6382450c1cb32e41727375885 > change-id: 20250305-net-next-fix-tcp-win-clamp-9f4c417ff44d > > Best regards, > -- > Matthieu Baerts (NGI0) <matttbe@kernel.org> >
On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0) > <matttbe@kernel.org> wrote: > > > > A recent cleanup changed the behaviour of tcp_set_window_clamp(). This > > looks unintentional, and affects MPTCP selftests, e.g. some tests > > re-establishing a connection after a disconnect are now unstable. > > > > Before the cleanup, this operation was done: > > > > new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); > > tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); > > > > The cleanup used the 'clamp' macro which takes 3 arguments -- value, > > lowest, and highest -- and returns a value between the lowest and the > > highest allowable values. This then assumes ... > > > > lowest (rcv_ssthresh) <= highest (rcv_wnd) > > > > ... which doesn't seem to be always the case here according to the MPTCP > > selftests, even when running them without MPTCP, but only TCP. > > > > For example, when we have ... > > > > rcv_wnd < rcv_ssthresh < new_rcv_ssthresh > > > > ... before the cleanup, the rcv_ssthresh was not changed, while after > > the cleanup, it is lowered down to rcv_wnd (highest). > > > > During a simple test with TCP, here are the values I observed: > > > > new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) > > 117760 (out) 65495 < 65536 > > 128512 (out) 109595 > 80256 => lo > hi > > 1184975 (out) 328987 < 329088 > > > > 113664 (out) 65483 < 65536 > > 117760 (out) 110968 < 110976 > > 129024 (out) 116527 > 109696 => lo > hi > > > > Here, we can see that it is not that rare to have rcv_ssthresh (lo) > > higher than rcv_wnd (hi), so having a different behaviour when the > > clamp() macro is used, even without MPTCP. > > > > Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) > > here, which seems to be generally the case in my tests with small > > connections. > > > > I then suggests reverting this part, not to change the behaviour. > > > > Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") > > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 > > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> > > Tested-by: Jason Xing <kerneljasonxing@gmail.com> > > Thanks for catching this. I should have done more tests :( > > Now I use netperf with TCP_CRR to test loopback and easily see the > case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means > tp->rcv_wnd is not the upper bound as you said. > > Thanks, > Jason > Patch looks fine to me but all our tests are passing with the current kernel, and I was not able to trigger the condition. Can you share what precise test you did ? Thanks !
Hi Eric, On 06/03/2025 10:45, Eric Dumazet wrote: > On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote: >> >> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0) >> <matttbe@kernel.org> wrote: >>> >>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This >>> looks unintentional, and affects MPTCP selftests, e.g. some tests >>> re-establishing a connection after a disconnect are now unstable. >>> >>> Before the cleanup, this operation was done: >>> >>> new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); >>> tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); >>> >>> The cleanup used the 'clamp' macro which takes 3 arguments -- value, >>> lowest, and highest -- and returns a value between the lowest and the >>> highest allowable values. This then assumes ... >>> >>> lowest (rcv_ssthresh) <= highest (rcv_wnd) >>> >>> ... which doesn't seem to be always the case here according to the MPTCP >>> selftests, even when running them without MPTCP, but only TCP. >>> >>> For example, when we have ... >>> >>> rcv_wnd < rcv_ssthresh < new_rcv_ssthresh >>> >>> ... before the cleanup, the rcv_ssthresh was not changed, while after >>> the cleanup, it is lowered down to rcv_wnd (highest). >>> >>> During a simple test with TCP, here are the values I observed: >>> >>> new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) >>> 117760 (out) 65495 < 65536 >>> 128512 (out) 109595 > 80256 => lo > hi >>> 1184975 (out) 328987 < 329088 >>> >>> 113664 (out) 65483 < 65536 >>> 117760 (out) 110968 < 110976 >>> 129024 (out) 116527 > 109696 => lo > hi >>> >>> Here, we can see that it is not that rare to have rcv_ssthresh (lo) >>> higher than rcv_wnd (hi), so having a different behaviour when the >>> clamp() macro is used, even without MPTCP. >>> >>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) >>> here, which seems to be generally the case in my tests with small >>> connections. >>> >>> I then suggests reverting this part, not to change the behaviour. >>> >>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") >>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 >>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> >> >> Tested-by: Jason Xing <kerneljasonxing@gmail.com> >> >> Thanks for catching this. I should have done more tests :( >> >> Now I use netperf with TCP_CRR to test loopback and easily see the >> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means >> tp->rcv_wnd is not the upper bound as you said. >> >> Thanks, >> Jason >> > > Patch looks fine to me but all our tests are passing with the current kernel, > and I was not able to trigger the condition. Thank you for having looked at this patch! > Can you share what precise test you did ? To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply executed MPTCP Connect selftest. You can also force creating TCP only connections with '-tt', e.g. ./mptcp_connect.sh -tt To be able to reproduce the issue with the selftests mentioned in [1], I simply executed ./mptcp_connect.sh in a loop after having applied this small patch to execute only a part of the subtests ("disconnect"): > diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh > index 5e3c56253274..d8ebea5abc6c 100755 > --- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh > +++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh > @@ -855,6 +855,7 @@ make_file "$sin" "server" > > mptcp_lib_subtests_last_ts_reset > > +if false; then > check_mptcp_disabled > > stop_if_error "The kernel configuration is not valid for MPTCP" > @@ -882,6 +883,7 @@ mptcp_lib_result_code "${ret}" "ping tests" > > stop_if_error "Could not even run ping tests" > mptcp_lib_pr_ok > +fi > > [ -n "$tc_loss" ] && tc -net "$ns2" qdisc add dev ns2eth3 root netem loss random $tc_loss delay ${tc_delay}ms > tc_info="loss of $tc_loss " > @@ -910,6 +912,7 @@ mptcp_lib_pr_info "Using ${tc_info}on ns3eth4" > > tc -net "$ns3" qdisc add dev ns3eth4 root netem delay ${reorder_delay}ms $tc_reorder > > +if false; then > TEST_GROUP="loopback v4" > run_tests_lo "$ns1" "$ns1" 10.0.1.1 1 > stop_if_error "Could not even run loopback test" > @@ -959,6 +962,7 @@ log_if_error "Tests with MPTFO have failed" > run_test_transparent 10.0.3.1 "tproxy ipv4" > run_test_transparent dead:beef:3::1 "tproxy ipv6" > log_if_error "Tests with tproxy have failed" > +fi > > run_tests_disconnect > log_if_error "Tests of the full disconnection have failed" Note that our CI was able to easily reproduce it. Locally, it was taking around 30 to 50 iterations to reproduce the issue. [1] https://github.com/multipath-tcp/mptcp_net-next/issues/551 Cheers, Matt
On Thu, Mar 6, 2025 at 10:55 AM Matthieu Baerts <matttbe@kernel.org> wrote: > > Hi Eric, > > On 06/03/2025 10:45, Eric Dumazet wrote: > > On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > >> > >> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0) > >> <matttbe@kernel.org> wrote: > >>> > >>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This > >>> looks unintentional, and affects MPTCP selftests, e.g. some tests > >>> re-establishing a connection after a disconnect are now unstable. > >>> > >>> Before the cleanup, this operation was done: > >>> > >>> new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); > >>> tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); > >>> > >>> The cleanup used the 'clamp' macro which takes 3 arguments -- value, > >>> lowest, and highest -- and returns a value between the lowest and the > >>> highest allowable values. This then assumes ... > >>> > >>> lowest (rcv_ssthresh) <= highest (rcv_wnd) > >>> > >>> ... which doesn't seem to be always the case here according to the MPTCP > >>> selftests, even when running them without MPTCP, but only TCP. > >>> > >>> For example, when we have ... > >>> > >>> rcv_wnd < rcv_ssthresh < new_rcv_ssthresh > >>> > >>> ... before the cleanup, the rcv_ssthresh was not changed, while after > >>> the cleanup, it is lowered down to rcv_wnd (highest). > >>> > >>> During a simple test with TCP, here are the values I observed: > >>> > >>> new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) > >>> 117760 (out) 65495 < 65536 > >>> 128512 (out) 109595 > 80256 => lo > hi > >>> 1184975 (out) 328987 < 329088 > >>> > >>> 113664 (out) 65483 < 65536 > >>> 117760 (out) 110968 < 110976 > >>> 129024 (out) 116527 > 109696 => lo > hi > >>> > >>> Here, we can see that it is not that rare to have rcv_ssthresh (lo) > >>> higher than rcv_wnd (hi), so having a different behaviour when the > >>> clamp() macro is used, even without MPTCP. > >>> > >>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) > >>> here, which seems to be generally the case in my tests with small > >>> connections. > >>> > >>> I then suggests reverting this part, not to change the behaviour. > >>> > >>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") > >>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 > >>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> > >> > >> Tested-by: Jason Xing <kerneljasonxing@gmail.com> > >> > >> Thanks for catching this. I should have done more tests :( > >> > >> Now I use netperf with TCP_CRR to test loopback and easily see the > >> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means > >> tp->rcv_wnd is not the upper bound as you said. > >> > >> Thanks, > >> Jason > >> > > > > Patch looks fine to me but all our tests are passing with the current kernel, > > and I was not able to trigger the condition. > > Thank you for having looked at this patch! > > > > Can you share what precise test you did ? > > To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply > executed MPTCP Connect selftest. You can also force creating TCP only > connections with '-tt', e.g. > > ./mptcp_connect.sh -tt I was asking Jason about TCP tests. He mentioned TCP_CRR I made several of them, with temporary debug in the kernel that did not show the issue. I am wondering if this could hide an issue in MPTCP ? > > > To be able to reproduce the issue with the selftests mentioned in [1], I > simply executed ./mptcp_connect.sh in a loop after having applied this > small patch to execute only a part of the subtests ("disconnect"): > > > diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.sh b/tools/testing/selftests/net/mptcp/mptcp_connect.sh > > index 5e3c56253274..d8ebea5abc6c 100755 > > --- a/tools/testing/selftests/net/mptcp/mptcp_connect.sh > > +++ b/tools/testing/selftests/net/mptcp/mptcp_connect.sh > > @@ -855,6 +855,7 @@ make_file "$sin" "server" > > > > mptcp_lib_subtests_last_ts_reset > > > > +if false; then > > check_mptcp_disabled > > > > stop_if_error "The kernel configuration is not valid for MPTCP" > > @@ -882,6 +883,7 @@ mptcp_lib_result_code "${ret}" "ping tests" > > > > stop_if_error "Could not even run ping tests" > > mptcp_lib_pr_ok > > +fi > > > > [ -n "$tc_loss" ] && tc -net "$ns2" qdisc add dev ns2eth3 root netem loss random $tc_loss delay ${tc_delay}ms > > tc_info="loss of $tc_loss " > > @@ -910,6 +912,7 @@ mptcp_lib_pr_info "Using ${tc_info}on ns3eth4" > > > > tc -net "$ns3" qdisc add dev ns3eth4 root netem delay ${reorder_delay}ms $tc_reorder > > > > +if false; then > > TEST_GROUP="loopback v4" > > run_tests_lo "$ns1" "$ns1" 10.0.1.1 1 > > stop_if_error "Could not even run loopback test" > > @@ -959,6 +962,7 @@ log_if_error "Tests with MPTFO have failed" > > run_test_transparent 10.0.3.1 "tproxy ipv4" > > run_test_transparent dead:beef:3::1 "tproxy ipv6" > > log_if_error "Tests with tproxy have failed" > > +fi > > > > run_tests_disconnect > > log_if_error "Tests of the full disconnection have failed" > > Note that our CI was able to easily reproduce it. Locally, it was taking > around 30 to 50 iterations to reproduce the issue. > > [1] https://github.com/multipath-tcp/mptcp_net-next/issues/551 > > Cheers, > Matt > -- > Sponsored by the NGI0 Core fund. >
On 06/03/2025 11:02, Eric Dumazet wrote: > On Thu, Mar 6, 2025 at 10:55 AM Matthieu Baerts <matttbe@kernel.org> wrote: >> >> Hi Eric, >> >> On 06/03/2025 10:45, Eric Dumazet wrote: >>> On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote: >>>> >>>> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0) >>>> <matttbe@kernel.org> wrote: >>>>> >>>>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This >>>>> looks unintentional, and affects MPTCP selftests, e.g. some tests >>>>> re-establishing a connection after a disconnect are now unstable. >>>>> >>>>> Before the cleanup, this operation was done: >>>>> >>>>> new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); >>>>> tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); >>>>> >>>>> The cleanup used the 'clamp' macro which takes 3 arguments -- value, >>>>> lowest, and highest -- and returns a value between the lowest and the >>>>> highest allowable values. This then assumes ... >>>>> >>>>> lowest (rcv_ssthresh) <= highest (rcv_wnd) >>>>> >>>>> ... which doesn't seem to be always the case here according to the MPTCP >>>>> selftests, even when running them without MPTCP, but only TCP. >>>>> >>>>> For example, when we have ... >>>>> >>>>> rcv_wnd < rcv_ssthresh < new_rcv_ssthresh >>>>> >>>>> ... before the cleanup, the rcv_ssthresh was not changed, while after >>>>> the cleanup, it is lowered down to rcv_wnd (highest). >>>>> >>>>> During a simple test with TCP, here are the values I observed: >>>>> >>>>> new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) >>>>> 117760 (out) 65495 < 65536 >>>>> 128512 (out) 109595 > 80256 => lo > hi >>>>> 1184975 (out) 328987 < 329088 >>>>> >>>>> 113664 (out) 65483 < 65536 >>>>> 117760 (out) 110968 < 110976 >>>>> 129024 (out) 116527 > 109696 => lo > hi >>>>> >>>>> Here, we can see that it is not that rare to have rcv_ssthresh (lo) >>>>> higher than rcv_wnd (hi), so having a different behaviour when the >>>>> clamp() macro is used, even without MPTCP. >>>>> >>>>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) >>>>> here, which seems to be generally the case in my tests with small >>>>> connections. >>>>> >>>>> I then suggests reverting this part, not to change the behaviour. >>>>> >>>>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") >>>>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 >>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> >>>> >>>> Tested-by: Jason Xing <kerneljasonxing@gmail.com> >>>> >>>> Thanks for catching this. I should have done more tests :( >>>> >>>> Now I use netperf with TCP_CRR to test loopback and easily see the >>>> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means >>>> tp->rcv_wnd is not the upper bound as you said. >>>> >>>> Thanks, >>>> Jason >>>> >>> >>> Patch looks fine to me but all our tests are passing with the current kernel, >>> and I was not able to trigger the condition. >> >> Thank you for having looked at this patch! >> >> >>> Can you share what precise test you did ? >> >> To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply >> executed MPTCP Connect selftest. You can also force creating TCP only >> connections with '-tt', e.g. >> >> ./mptcp_connect.sh -tt > > I was asking Jason about TCP tests. He mentioned TCP_CRR Oops, I'm sorry, I didn't look at the "To:" field. > I made several of them, with temporary debug in the kernel that did > not show the issue. > > > I am wondering if this could hide an issue in MPTCP ? Indeed, I was wondering the same thing. I didn't see anything obvious when looking at this issue. The behaviours around the window clamping, with MPTCP single flow, and "plain" TCP were quite similar I think. Cheers, Matt
On Thu, Mar 6, 2025 at 11:12 AM Matthieu Baerts <matttbe@kernel.org> wrote: > > On 06/03/2025 11:02, Eric Dumazet wrote: > > On Thu, Mar 6, 2025 at 10:55 AM Matthieu Baerts <matttbe@kernel.org> wrote: > >> > >> Hi Eric, > >> > >> On 06/03/2025 10:45, Eric Dumazet wrote: > >>> On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > >>>> > >>>> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0) > >>>> <matttbe@kernel.org> wrote: > >>>>> > >>>>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This > >>>>> looks unintentional, and affects MPTCP selftests, e.g. some tests > >>>>> re-establishing a connection after a disconnect are now unstable. > >>>>> > >>>>> Before the cleanup, this operation was done: > >>>>> > >>>>> new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); > >>>>> tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); > >>>>> > >>>>> The cleanup used the 'clamp' macro which takes 3 arguments -- value, > >>>>> lowest, and highest -- and returns a value between the lowest and the > >>>>> highest allowable values. This then assumes ... > >>>>> > >>>>> lowest (rcv_ssthresh) <= highest (rcv_wnd) > >>>>> > >>>>> ... which doesn't seem to be always the case here according to the MPTCP > >>>>> selftests, even when running them without MPTCP, but only TCP. > >>>>> > >>>>> For example, when we have ... > >>>>> > >>>>> rcv_wnd < rcv_ssthresh < new_rcv_ssthresh > >>>>> > >>>>> ... before the cleanup, the rcv_ssthresh was not changed, while after > >>>>> the cleanup, it is lowered down to rcv_wnd (highest). > >>>>> > >>>>> During a simple test with TCP, here are the values I observed: > >>>>> > >>>>> new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) > >>>>> 117760 (out) 65495 < 65536 > >>>>> 128512 (out) 109595 > 80256 => lo > hi > >>>>> 1184975 (out) 328987 < 329088 > >>>>> > >>>>> 113664 (out) 65483 < 65536 > >>>>> 117760 (out) 110968 < 110976 > >>>>> 129024 (out) 116527 > 109696 => lo > hi > >>>>> > >>>>> Here, we can see that it is not that rare to have rcv_ssthresh (lo) > >>>>> higher than rcv_wnd (hi), so having a different behaviour when the > >>>>> clamp() macro is used, even without MPTCP. > >>>>> > >>>>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) > >>>>> here, which seems to be generally the case in my tests with small > >>>>> connections. > >>>>> > >>>>> I then suggests reverting this part, not to change the behaviour. > >>>>> > >>>>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") > >>>>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 > >>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> > >>>> > >>>> Tested-by: Jason Xing <kerneljasonxing@gmail.com> > >>>> > >>>> Thanks for catching this. I should have done more tests :( > >>>> > >>>> Now I use netperf with TCP_CRR to test loopback and easily see the > >>>> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means > >>>> tp->rcv_wnd is not the upper bound as you said. > >>>> > >>>> Thanks, > >>>> Jason > >>>> > >>> > >>> Patch looks fine to me but all our tests are passing with the current kernel, > >>> and I was not able to trigger the condition. > >> > >> Thank you for having looked at this patch! > >> > >> > >>> Can you share what precise test you did ? > >> > >> To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply > >> executed MPTCP Connect selftest. You can also force creating TCP only > >> connections with '-tt', e.g. > >> > >> ./mptcp_connect.sh -tt > > > > I was asking Jason about TCP tests. He mentioned TCP_CRR > > Oops, I'm sorry, I didn't look at the "To:" field. > > > I made several of them, with temporary debug in the kernel that did > > not show the issue. > > > > > > I am wondering if this could hide an issue in MPTCP ? > Indeed, I was wondering the same thing. I didn't see anything obvious > when looking at this issue. The behaviours around the window clamping, > with MPTCP single flow, and "plain" TCP were quite similar I think. OK, let me run mptcp tests just in case I see something dubious.
On Thu, Mar 6, 2025 at 11:16 AM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Mar 6, 2025 at 11:12 AM Matthieu Baerts <matttbe@kernel.org> wrote: > > > > On 06/03/2025 11:02, Eric Dumazet wrote: > > > On Thu, Mar 6, 2025 at 10:55 AM Matthieu Baerts <matttbe@kernel.org> wrote: > > >> > > >> Hi Eric, > > >> > > >> On 06/03/2025 10:45, Eric Dumazet wrote: > > >>> On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > >>>> > > >>>> On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0) > > >>>> <matttbe@kernel.org> wrote: > > >>>>> > > >>>>> A recent cleanup changed the behaviour of tcp_set_window_clamp(). This > > >>>>> looks unintentional, and affects MPTCP selftests, e.g. some tests > > >>>>> re-establishing a connection after a disconnect are now unstable. > > >>>>> > > >>>>> Before the cleanup, this operation was done: > > >>>>> > > >>>>> new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); > > >>>>> tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); > > >>>>> > > >>>>> The cleanup used the 'clamp' macro which takes 3 arguments -- value, > > >>>>> lowest, and highest -- and returns a value between the lowest and the > > >>>>> highest allowable values. This then assumes ... > > >>>>> > > >>>>> lowest (rcv_ssthresh) <= highest (rcv_wnd) > > >>>>> > > >>>>> ... which doesn't seem to be always the case here according to the MPTCP > > >>>>> selftests, even when running them without MPTCP, but only TCP. > > >>>>> > > >>>>> For example, when we have ... > > >>>>> > > >>>>> rcv_wnd < rcv_ssthresh < new_rcv_ssthresh > > >>>>> > > >>>>> ... before the cleanup, the rcv_ssthresh was not changed, while after > > >>>>> the cleanup, it is lowered down to rcv_wnd (highest). > > >>>>> > > >>>>> During a simple test with TCP, here are the values I observed: > > >>>>> > > >>>>> new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) > > >>>>> 117760 (out) 65495 < 65536 > > >>>>> 128512 (out) 109595 > 80256 => lo > hi > > >>>>> 1184975 (out) 328987 < 329088 > > >>>>> > > >>>>> 113664 (out) 65483 < 65536 > > >>>>> 117760 (out) 110968 < 110976 > > >>>>> 129024 (out) 116527 > 109696 => lo > hi > > >>>>> > > >>>>> Here, we can see that it is not that rare to have rcv_ssthresh (lo) > > >>>>> higher than rcv_wnd (hi), so having a different behaviour when the > > >>>>> clamp() macro is used, even without MPTCP. > > >>>>> > > >>>>> Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) > > >>>>> here, which seems to be generally the case in my tests with small > > >>>>> connections. > > >>>>> > > >>>>> I then suggests reverting this part, not to change the behaviour. > > >>>>> > > >>>>> Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") > > >>>>> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 > > >>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> > > >>>> > > >>>> Tested-by: Jason Xing <kerneljasonxing@gmail.com> > > >>>> > > >>>> Thanks for catching this. I should have done more tests :( > > >>>> > > >>>> Now I use netperf with TCP_CRR to test loopback and easily see the > > >>>> case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means > > >>>> tp->rcv_wnd is not the upper bound as you said. > > >>>> > > >>>> Thanks, > > >>>> Jason > > >>>> > > >>> > > >>> Patch looks fine to me but all our tests are passing with the current kernel, > > >>> and I was not able to trigger the condition. > > >> > > >> Thank you for having looked at this patch! > > >> > > >> > > >>> Can you share what precise test you did ? > > >> > > >> To be able to get a situation where "rcv_ssthresh > rcv_wnd", I simply > > >> executed MPTCP Connect selftest. You can also force creating TCP only > > >> connections with '-tt', e.g. > > >> > > >> ./mptcp_connect.sh -tt > > > > > > I was asking Jason about TCP tests. He mentioned TCP_CRR > > > > Oops, I'm sorry, I didn't look at the "To:" field. > > > > > I made several of them, with temporary debug in the kernel that did > > > not show the issue. > > > > > > > > > I am wondering if this could hide an issue in MPTCP ? > > Indeed, I was wondering the same thing. I didn't see anything obvious > > when looking at this issue. The behaviours around the window clamping, > > with MPTCP single flow, and "plain" TCP were quite similar I think. > > OK, let me run mptcp tests just in case I see something dubious. I have no idea why only MPTCP flows can trigger the condition, I do not think it matters anyway. Reviewed-by: Eric Dumazet <edumazet@google.com>
On Thu, Mar 6, 2025 at 5:45 PM Eric Dumazet <edumazet@google.com> wrote: > > On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0) > > <matttbe@kernel.org> wrote: > > > > > > A recent cleanup changed the behaviour of tcp_set_window_clamp(). This > > > looks unintentional, and affects MPTCP selftests, e.g. some tests > > > re-establishing a connection after a disconnect are now unstable. > > > > > > Before the cleanup, this operation was done: > > > > > > new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); > > > tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); > > > > > > The cleanup used the 'clamp' macro which takes 3 arguments -- value, > > > lowest, and highest -- and returns a value between the lowest and the > > > highest allowable values. This then assumes ... > > > > > > lowest (rcv_ssthresh) <= highest (rcv_wnd) > > > > > > ... which doesn't seem to be always the case here according to the MPTCP > > > selftests, even when running them without MPTCP, but only TCP. > > > > > > For example, when we have ... > > > > > > rcv_wnd < rcv_ssthresh < new_rcv_ssthresh > > > > > > ... before the cleanup, the rcv_ssthresh was not changed, while after > > > the cleanup, it is lowered down to rcv_wnd (highest). > > > > > > During a simple test with TCP, here are the values I observed: > > > > > > new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) > > > 117760 (out) 65495 < 65536 > > > 128512 (out) 109595 > 80256 => lo > hi > > > 1184975 (out) 328987 < 329088 > > > > > > 113664 (out) 65483 < 65536 > > > 117760 (out) 110968 < 110976 > > > 129024 (out) 116527 > 109696 => lo > hi > > > > > > Here, we can see that it is not that rare to have rcv_ssthresh (lo) > > > higher than rcv_wnd (hi), so having a different behaviour when the > > > clamp() macro is used, even without MPTCP. > > > > > > Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) > > > here, which seems to be generally the case in my tests with small > > > connections. > > > > > > I then suggests reverting this part, not to change the behaviour. > > > > > > Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") > > > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 > > > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> > > > > Tested-by: Jason Xing <kerneljasonxing@gmail.com> > > > > Thanks for catching this. I should have done more tests :( > > > > Now I use netperf with TCP_CRR to test loopback and easily see the > > case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means > > tp->rcv_wnd is not the upper bound as you said. > > > > Thanks, > > Jason > > > > Patch looks fine to me but all our tests are passing with the current kernel, > and I was not able to trigger the condition. > > Can you share what precise test you did ? > > Thanks ! I did the test[1] on the virtual machine running the kernel [2]. And after seeing your reply, I checked out a clean branch and compiled the kernel with the patch reverted again and rebooted. The case can still be reliably reproduced in my machine. Here are some outputs from BPF program[3]: sudo bpftrace tcp_cap.bt.2 Attaching 1 probe... 4327813, 4326775, 4310912 netperf tcp_set_window_clamp+1 tcp_data_queue+1744 tcp_rcv_established+501 tcp_v4_do_rcv+369 tcp_v4_rcv+4800 ip_protocol_deliver_rcu+65 4327813, 4326827, 4310912 netperf tcp_set_window_clamp+1 tcp_data_queue+1744 tcp_rcv_established+501 tcp_v4_do_rcv+369 tcp_v4_rcv+4800 ip_protocol_deliver_rcu+65 418081, 417052, 417024 swapper/11 tcp_set_window_clamp+1 tcp_data_queue+1744 tcp_rcv_established+501 tcp_v4_do_rcv+369 tcp_v4_rcv+4800 ip_protocol_deliver_rcu+65 I can help if you want to see something else. Thanks, Jason [1]: netperf -H 127.0.0.1 and netperf -H 127.0.0.1 -t TCP_CRR [2]: based on commit f130a0cc1b4ff1 with reverted patch commit 196150bf8e912eb27d4f083fc223ad8787709c6f (HEAD -> main) Author: Jason Xing <kerneljasonxing@gmail.com> Date: Thu Mar 6 19:00:47 2025 +0800 Revert "tcp: tcp_set_window_clamp() cleanup" This reverts commit 863a952eb79a6acf2b1f654f4e75ed104ff4cc81. commit f130a0cc1b4ff1ef28a307428d40436032e2b66e (origin/main, origin/HEAD) Author: Eric Dumazet <edumazet@google.com> Date: Tue Mar 4 12:59:18 2025 +0000 inet: fix lwtunnel_valid_encap_type() lock imbalance After blamed commit rtm_to_fib_config() now calls lwtunnel_valid_encap_type{_attr}() without RTNL held, triggering an unlock balance in __rtnl_unlock, as reported by syzbot [1] [3]: kprobe: tcp_set_window_clamp { $sk = (struct sock *) arg0; $val = arg1; $tp = (struct tcp_sock *) $sk; $dport = $sk->__sk_common.skc_dport; $dport = bswap($dport); $lport = $sk->__sk_common.skc_num; if ($tp->rcv_ssthresh > $tp->rcv_wnd) { printf("%u, %u, %u\n", $tp->window_clamp, $tp->rcv_ssthresh, $tp->rcv_wnd); printf("%s %s\n", comm, kstack(6)); } }
Hi Eric, On 06/03/2025 12:08, Eric Dumazet wrote: > On Thu, Mar 6, 2025 at 11:16 AM Eric Dumazet <edumazet@google.com> wrote: >> On Thu, Mar 6, 2025 at 11:12 AM Matthieu Baerts <matttbe@kernel.org> wrote: >>> On 06/03/2025 11:02, Eric Dumazet wrote: (...) >>>> I am wondering if this could hide an issue in MPTCP ? >>> Indeed, I was wondering the same thing. I didn't see anything obvious >>> when looking at this issue. The behaviours around the window clamping, >>> with MPTCP single flow, and "plain" TCP were quite similar I think. >> >> OK, let me run mptcp tests just in case I see something dubious. > > I have no idea why only MPTCP flows can trigger the condition, I do > not think it matters anyway. Thank you for having looked! Did you manage to trigger this condition with "plain" TCP (without MPTCP) when using "./mptcp_connect.sh -tt"? The TCP only connections will be printed like this: nsX TCP -> nsY (<IP>:<port>) TCP But yes, it probably doesn't matter. Cheers, Matt
Hello: This patch was applied to netdev/net-next.git (main) by Jakub Kicinski <kuba@kernel.org>: On Wed, 05 Mar 2025 15:49:48 +0100 you wrote: > A recent cleanup changed the behaviour of tcp_set_window_clamp(). This > looks unintentional, and affects MPTCP selftests, e.g. some tests > re-establishing a connection after a disconnect are now unstable. > > Before the cleanup, this operation was done: > > new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); > tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); > > [...] Here is the summary with links: - [net-next] tcp: clamp window like before the cleanup https://git.kernel.org/netdev/net-next/c/8e0e8bef4841 You are awesome, thank you!
On Thu, Mar 6, 2025 at 7:18 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Thu, Mar 6, 2025 at 5:45 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0) > > > <matttbe@kernel.org> wrote: > > > > > > > > A recent cleanup changed the behaviour of tcp_set_window_clamp(). This > > > > looks unintentional, and affects MPTCP selftests, e.g. some tests > > > > re-establishing a connection after a disconnect are now unstable. > > > > > > > > Before the cleanup, this operation was done: > > > > > > > > new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); > > > > tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); > > > > > > > > The cleanup used the 'clamp' macro which takes 3 arguments -- value, > > > > lowest, and highest -- and returns a value between the lowest and the > > > > highest allowable values. This then assumes ... > > > > > > > > lowest (rcv_ssthresh) <= highest (rcv_wnd) > > > > > > > > ... which doesn't seem to be always the case here according to the MPTCP > > > > selftests, even when running them without MPTCP, but only TCP. > > > > > > > > For example, when we have ... > > > > > > > > rcv_wnd < rcv_ssthresh < new_rcv_ssthresh > > > > > > > > ... before the cleanup, the rcv_ssthresh was not changed, while after > > > > the cleanup, it is lowered down to rcv_wnd (highest). > > > > > > > > During a simple test with TCP, here are the values I observed: > > > > > > > > new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) > > > > 117760 (out) 65495 < 65536 > > > > 128512 (out) 109595 > 80256 => lo > hi > > > > 1184975 (out) 328987 < 329088 > > > > > > > > 113664 (out) 65483 < 65536 > > > > 117760 (out) 110968 < 110976 > > > > 129024 (out) 116527 > 109696 => lo > hi > > > > > > > > Here, we can see that it is not that rare to have rcv_ssthresh (lo) > > > > higher than rcv_wnd (hi), so having a different behaviour when the > > > > clamp() macro is used, even without MPTCP. > > > > > > > > Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) > > > > here, which seems to be generally the case in my tests with small > > > > connections. > > > > > > > > I then suggests reverting this part, not to change the behaviour. > > > > > > > > Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") > > > > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 > > > > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> > > > > > > Tested-by: Jason Xing <kerneljasonxing@gmail.com> > > > > > > Thanks for catching this. I should have done more tests :( > > > > > > Now I use netperf with TCP_CRR to test loopback and easily see the > > > case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means > > > tp->rcv_wnd is not the upper bound as you said. > > > > > > Thanks, > > > Jason > > > > > > > Patch looks fine to me but all our tests are passing with the current kernel, > > and I was not able to trigger the condition. > > > > Can you share what precise test you did ? > > > > Thanks ! > > I did the test[1] on the virtual machine running the kernel [2]. And > after seeing your reply, I checked out a clean branch and compiled the > kernel with the patch reverted again and rebooted. The case can still > be reliably reproduced in my machine. Here are some outputs from BPF > program[3]: > sudo bpftrace tcp_cap.bt.2 > Attaching 1 probe... > 4327813, 4326775, 4310912 > netperf > tcp_set_window_clamp+1 > tcp_data_queue+1744 > tcp_rcv_established+501 > tcp_v4_do_rcv+369 > tcp_v4_rcv+4800 > ip_protocol_deliver_rcu+65 > > 4327813, 4326827, 4310912 > netperf > tcp_set_window_clamp+1 > tcp_data_queue+1744 > tcp_rcv_established+501 > tcp_v4_do_rcv+369 > tcp_v4_rcv+4800 > ip_protocol_deliver_rcu+65 > > 418081, 417052, 417024 > swapper/11 > tcp_set_window_clamp+1 > tcp_data_queue+1744 > tcp_rcv_established+501 > tcp_v4_do_rcv+369 > tcp_v4_rcv+4800 > ip_protocol_deliver_rcu+65 > Hi Eric, Matthieu I did a quick analysis on this case. It turned out that __release_sock() was dealing with skbs in sk_backlog one by one, then: 1) one skb went into tcp_grow_window(). At the beginning, rcv_ssthresh is equal to rcv_wnd, but later rcv_ssthresh will increase by 'incr', which means updated rcv_ssthresh is larger than rcv_wnd. 2) another skb went into tcp_set_window_clamp(), as I saw yesterday, the issue happened: rcv_ssthresh > rcv_wnd. As to the rcv_wnd, in synchronised states, it can be only changed/adjusted to new_win in tcp_select_window(). Thus, between above 1) and 2) the sk didn't have the chance to update its rcv_wnd value, so... I attached two consecutive calltraces here: [Fri Mar 7 08:00:52 2025] netserver, 1, 91234, 65483, (25751,130966) [Fri Mar 7 08:00:52 2025] CPU: 0 UID: 266980 PID: 17465 Comm: netserver Kdump: loaded Not tainted 6.14.0-rc4+ #415 [Fri Mar 7 08:00:52 2025] Hardware name: Tencent Cloud CVM, BIOS seabios-1.9.1-qemu-project.org 04/01/2014 [Fri Mar 7 08:00:52 2025] Call Trace: [Fri Mar 7 08:00:52 2025] <TASK> [Fri Mar 7 08:00:52 2025] dump_stack_lvl+0x5b/0x70 [Fri Mar 7 08:00:52 2025] dump_stack+0x10/0x20 [Fri Mar 7 08:00:52 2025] tcp_grow_window+0x297/0x320 [Fri Mar 7 08:00:52 2025] tcp_event_data_recv+0x265/0x400 [Fri Mar 7 08:00:52 2025] tcp_data_queue+0x6d0/0xc70 [Fri Mar 7 08:00:52 2025] tcp_rcv_established+0x1f5/0x760 [Fri Mar 7 08:00:52 2025] ? schedule_timeout+0xe5/0x100 [Fri Mar 7 08:00:52 2025] tcp_v4_do_rcv+0x171/0x2d0 [Fri Mar 7 08:00:52 2025] __release_sock+0xd1/0xe0 [Fri Mar 7 08:00:52 2025] release_sock+0x30/0xa0 [Fri Mar 7 08:00:52 2025] inet_accept+0x5c/0x80 [Fri Mar 7 08:00:52 2025] do_accept+0xf1/0x180 ..... [Fri Mar 7 08:00:52 2025] netserver, 0, 91234, 65483 [Fri Mar 7 08:00:52 2025] CPU: 0 UID: 266980 PID: 17465 Comm: netserver Kdump: loaded Not tainted 6.14.0-rc4+ #415 [Fri Mar 7 08:00:52 2025] Hardware name: Tencent Cloud CVM, BIOS seabios-1.9.1-qemu-project.org 04/01/2014 [Fri Mar 7 08:00:52 2025] Call Trace: [Fri Mar 7 08:00:52 2025] <TASK> [Fri Mar 7 08:00:52 2025] dump_stack_lvl+0x5b/0x70 [Fri Mar 7 08:00:52 2025] dump_stack+0x10/0x20 [Fri Mar 7 08:00:52 2025] tcp_set_window_clamp+0xc3/0x1f0 [Fri Mar 7 08:00:52 2025] tcp_event_data_recv+0x35b/0x400 [Fri Mar 7 08:00:52 2025] tcp_data_queue+0x6d0/0xc70 [Fri Mar 7 08:00:52 2025] tcp_rcv_established+0x1f5/0x760 [Fri Mar 7 08:00:52 2025] ? schedule_timeout+0xe5/0x100 [Fri Mar 7 08:00:52 2025] tcp_v4_do_rcv+0x171/0x2d0 [Fri Mar 7 08:00:52 2025] __release_sock+0xd1/0xe0 [Fri Mar 7 08:00:52 2025] release_sock+0x30/0xa0 [Fri Mar 7 08:00:52 2025] inet_accept+0x5c/0x80 [Fri Mar 7 08:00:52 2025] do_accept+0xf1/0x180 Test is simple on my virtual machine just by running "netperf -H 127.0.0.1" which uses TCP_STREAM by default. Thanks, Jason
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index eb5a60c7a9ccdd23fb78a74d614c18c4f7e281c9..46951e74930844af952dfbc57a107b504d4e296b 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3693,7 +3693,7 @@ EXPORT_SYMBOL(tcp_sock_set_keepcnt); int tcp_set_window_clamp(struct sock *sk, int val) { - u32 old_window_clamp, new_window_clamp; + u32 old_window_clamp, new_window_clamp, new_rcv_ssthresh; struct tcp_sock *tp = tcp_sk(sk); if (!val) { @@ -3714,12 +3714,12 @@ int tcp_set_window_clamp(struct sock *sk, int val) /* Need to apply the reserved mem provisioning only * when shrinking the window clamp. */ - if (new_window_clamp < old_window_clamp) + if (new_window_clamp < old_window_clamp) { __tcp_adjust_rcv_ssthresh(sk, new_window_clamp); - else - tp->rcv_ssthresh = clamp(new_window_clamp, - tp->rcv_ssthresh, - tp->rcv_wnd); + } else { + new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); + tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); + } return 0; }
A recent cleanup changed the behaviour of tcp_set_window_clamp(). This looks unintentional, and affects MPTCP selftests, e.g. some tests re-establishing a connection after a disconnect are now unstable. Before the cleanup, this operation was done: new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp); tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh); The cleanup used the 'clamp' macro which takes 3 arguments -- value, lowest, and highest -- and returns a value between the lowest and the highest allowable values. This then assumes ... lowest (rcv_ssthresh) <= highest (rcv_wnd) ... which doesn't seem to be always the case here according to the MPTCP selftests, even when running them without MPTCP, but only TCP. For example, when we have ... rcv_wnd < rcv_ssthresh < new_rcv_ssthresh ... before the cleanup, the rcv_ssthresh was not changed, while after the cleanup, it is lowered down to rcv_wnd (highest). During a simple test with TCP, here are the values I observed: new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi) 117760 (out) 65495 < 65536 128512 (out) 109595 > 80256 => lo > hi 1184975 (out) 328987 < 329088 113664 (out) 65483 < 65536 117760 (out) 110968 < 110976 129024 (out) 116527 > 109696 => lo > hi Here, we can see that it is not that rare to have rcv_ssthresh (lo) higher than rcv_wnd (hi), so having a different behaviour when the clamp() macro is used, even without MPTCP. Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd) here, which seems to be generally the case in my tests with small connections. I then suggests reverting this part, not to change the behaviour. Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> --- Notes: the 'Fixes' commit is only in net-next --- net/ipv4/tcp.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) --- base-commit: c62e6f056ea308d6382450c1cb32e41727375885 change-id: 20250305-net-next-fix-tcp-win-clamp-9f4c417ff44d Best regards,