Message ID | 20240726204105.1466841-1-quic_subashab@quicinc.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 05f76b2d634e65ab34472802d9b142ea9e03f74e |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,v2] tcp: Adjust clamping window for applications specifying SO_RCVBUF | expand |
On Fri, Jul 26, 2024 at 10:41 PM Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> wrote: > > tp->scaling_ratio is not updated based on skb->len/skb->truesize once > SO_RCVBUF is set leading to the maximum window scaling to be 25% of > rcvbuf after > commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") > and 50% of rcvbuf after > commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"). > 50% tries to emulate the behavior of older kernels using > sysctl_tcp_adv_win_scale with default value. > > Systems which were using a different values of sysctl_tcp_adv_win_scale > in older kernels ended up seeing reduced download speeds in certain > cases as covered in https://lists.openwall.net/netdev/2024/05/15/13 > While the sysctl scheme is no longer acceptable, the value of 50% is > a bit conservative when the skb->len/skb->truesize ratio is later > determined to be ~0.66. > > Applications not specifying SO_RCVBUF update the window scaling and > the receiver buffer every time data is copied to userspace. This > computation is now used for applications setting SO_RCVBUF to update > the maximum window scaling while ensuring that the receive buffer > is within the application specified limit. > > Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") > Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> > Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Thanks.
Hello: This patch was applied to netdev/net.git (main) by David S. Miller <davem@davemloft.net>: On Fri, 26 Jul 2024 13:41:05 -0700 you wrote: > tp->scaling_ratio is not updated based on skb->len/skb->truesize once > SO_RCVBUF is set leading to the maximum window scaling to be 25% of > rcvbuf after > commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") > and 50% of rcvbuf after > commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"). > 50% tries to emulate the behavior of older kernels using > sysctl_tcp_adv_win_scale with default value. > > [...] Here is the summary with links: - [net,v2] tcp: Adjust clamping window for applications specifying SO_RCVBUF https://git.kernel.org/netdev/net/c/05f76b2d634e You are awesome, thank you!
On Fri, Jul 26, 2024 at 4:41 PM Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> wrote: > > tp->scaling_ratio is not updated based on skb->len/skb->truesize once > SO_RCVBUF is set leading to the maximum window scaling to be 25% of > rcvbuf after > commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") > and 50% of rcvbuf after > commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"). > 50% tries to emulate the behavior of older kernels using > sysctl_tcp_adv_win_scale with default value. > > Systems which were using a different values of sysctl_tcp_adv_win_scale > in older kernels ended up seeing reduced download speeds in certain > cases as covered in https://lists.openwall.net/netdev/2024/05/15/13 > While the sysctl scheme is no longer acceptable, the value of 50% is > a bit conservative when the skb->len/skb->truesize ratio is later > determined to be ~0.66. > > Applications not specifying SO_RCVBUF update the window scaling and > the receiver buffer every time data is copied to userspace. This > computation is now used for applications setting SO_RCVBUF to update > the maximum window scaling while ensuring that the receive buffer > is within the application specified limit. > > Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") > Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> > Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> > --- > v1 -> v2 > Update the condition for SO_RCVBUF window_clamp updates to always > monitor the current rcvbuf value as suggested by Eric. > > net/ipv4/tcp_input.c | 23 ++++++++++++++++------- > 1 file changed, 16 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 454362e359da..e2b9583ed96a 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -754,8 +754,7 @@ void tcp_rcv_space_adjust(struct sock *sk) > * <prev RTT . ><current RTT .. ><next RTT .... > > */ > > - if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) && > - !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { > + if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf)) { > u64 rcvwin, grow; > int rcvbuf; > > @@ -771,12 +770,22 @@ void tcp_rcv_space_adjust(struct sock *sk) > > rcvbuf = min_t(u64, tcp_space_from_win(sk, rcvwin), > READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2])); > - if (rcvbuf > sk->sk_rcvbuf) { > - WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); > + if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { > + if (rcvbuf > sk->sk_rcvbuf) { > + WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); > > - /* Make the window clamp follow along. */ > - WRITE_ONCE(tp->window_clamp, > - tcp_win_from_space(sk, rcvbuf)); > + /* Make the window clamp follow along. */ > + WRITE_ONCE(tp->window_clamp, > + tcp_win_from_space(sk, rcvbuf)); > + } > + } else { > + /* Make the window clamp follow along while being bounded > + * by SO_RCVBUF. > + */ > + int clamp = tcp_win_from_space(sk, min(rcvbuf, sk->sk_rcvbuf)); > + > + if (clamp > tp->window_clamp) > + WRITE_ONCE(tp->window_clamp, clamp); > } > } > tp->rcvq_space.space = copied; > -- Is this the correct place to put this new code to update tp->window_clamp? AFAICT it's not the correct place. If a system administrator has disabled receive buffer autotuning by setting `sysctl net.ipv4.tcp_moderate_rcvbuf=0`, or if (copied <= tp->rcvq_space.space), then TCP connections will not reach this new code, and the window_clamp will not be adjusted, and the receive window will still be too low. Even if a system administrator has disabled receive buffer autotuning by setting `sysctl net.ipv4.tcp_moderate_rcvbuf=0`, or even if (copied <= tp->rcvq_space.space), AFAICT we still want the correct receive window value for whatever sk->sk_rcvbuf we have, based on the correct tp->scaling_ratio. So AFAICT the correct place to put this kind of logic is in tcp_measure_rcv_mss(). If we compute a new scaling_ratio and it's different than tp->scaling_ratio, then it seems we should compute a new window_clamp value using sk->sk_rcvbuf, and if the new window_clamp value is different then we should WRITE_ONCE that value into tp->window_clamp. That way we can have the correct tp->window_clamp, no matter the value of net.ipv4.tcp_moderate_rcvbuf, and even if (copied <= tp->rcvq_space.space). How does that sound? neal
On Mon, Jul 29, 2024 at 4:51 PM Neal Cardwell <ncardwell@google.com> wrote: > > On Fri, Jul 26, 2024 at 4:41 PM Subash Abhinov Kasiviswanathan > <quic_subashab@quicinc.com> wrote: > > > > tp->scaling_ratio is not updated based on skb->len/skb->truesize once > > SO_RCVBUF is set leading to the maximum window scaling to be 25% of > > rcvbuf after > > commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") > > and 50% of rcvbuf after > > commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"). > > 50% tries to emulate the behavior of older kernels using > > sysctl_tcp_adv_win_scale with default value. > > > > Systems which were using a different values of sysctl_tcp_adv_win_scale > > in older kernels ended up seeing reduced download speeds in certain > > cases as covered in https://lists.openwall.net/netdev/2024/05/15/13 > > While the sysctl scheme is no longer acceptable, the value of 50% is > > a bit conservative when the skb->len/skb->truesize ratio is later > > determined to be ~0.66. > > > > Applications not specifying SO_RCVBUF update the window scaling and > > the receiver buffer every time data is copied to userspace. This > > computation is now used for applications setting SO_RCVBUF to update > > the maximum window scaling while ensuring that the receive buffer > > is within the application specified limit. > > > > Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") > > Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> > > Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> > > --- > > v1 -> v2 > > Update the condition for SO_RCVBUF window_clamp updates to always > > monitor the current rcvbuf value as suggested by Eric. > > > > net/ipv4/tcp_input.c | 23 ++++++++++++++++------- > > 1 file changed, 16 insertions(+), 7 deletions(-) > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 454362e359da..e2b9583ed96a 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -754,8 +754,7 @@ void tcp_rcv_space_adjust(struct sock *sk) > > * <prev RTT . ><current RTT .. ><next RTT .... > > > */ > > > > - if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) && > > - !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { > > + if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf)) { > > u64 rcvwin, grow; > > int rcvbuf; > > > > @@ -771,12 +770,22 @@ void tcp_rcv_space_adjust(struct sock *sk) > > > > rcvbuf = min_t(u64, tcp_space_from_win(sk, rcvwin), > > READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2])); > > - if (rcvbuf > sk->sk_rcvbuf) { > > - WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); > > + if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { > > + if (rcvbuf > sk->sk_rcvbuf) { > > + WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); > > > > - /* Make the window clamp follow along. */ > > - WRITE_ONCE(tp->window_clamp, > > - tcp_win_from_space(sk, rcvbuf)); > > + /* Make the window clamp follow along. */ > > + WRITE_ONCE(tp->window_clamp, > > + tcp_win_from_space(sk, rcvbuf)); > > + } > > + } else { > > + /* Make the window clamp follow along while being bounded > > + * by SO_RCVBUF. > > + */ > > + int clamp = tcp_win_from_space(sk, min(rcvbuf, sk->sk_rcvbuf)); > > + > > + if (clamp > tp->window_clamp) > > + WRITE_ONCE(tp->window_clamp, clamp); > > } > > } > > tp->rcvq_space.space = copied; > > -- > > Is this the correct place to put this new code to update > tp->window_clamp? AFAICT it's not the correct place. > > If a system administrator has disabled receive buffer autotuning by > setting `sysctl net.ipv4.tcp_moderate_rcvbuf=0`, or if (copied <= > tp->rcvq_space.space), then TCP connections will not reach this new > code, and the window_clamp will not be adjusted, and the receive > window will still be too low. > > Even if a system administrator has disabled receive buffer autotuning > by setting `sysctl net.ipv4.tcp_moderate_rcvbuf=0`, or even if (copied > <= tp->rcvq_space.space), AFAICT we still want the correct receive > window value for whatever sk->sk_rcvbuf we have, based on the correct > tp->scaling_ratio. > > So AFAICT the correct place to put this kind of logic is in > tcp_measure_rcv_mss(). If we compute a new scaling_ratio and it's > different than tp->scaling_ratio, then it seems we should compute a > new window_clamp value using sk->sk_rcvbuf, and if the new > window_clamp value is different then we should WRITE_ONCE that value > into tp->window_clamp. > > That way we can have the correct tp->window_clamp, no matter the value > of net.ipv4.tcp_moderate_rcvbuf, and even if (copied <= > tp->rcvq_space.space). > > How does that sound? Can this be done without adding new code in the fast path ? Otherwise, I feel that we send a wrong signal to 'administrators' : "We will maintain code to make sure that wrong sysctls settings were not so wrong." Are you aware of anyone changing net.ipv4.tcp_moderate_rcvbuf for any valid reason ?
On Mon, Jul 29, 2024 at 11:19 AM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Jul 29, 2024 at 4:51 PM Neal Cardwell <ncardwell@google.com> wrote: > > > > On Fri, Jul 26, 2024 at 4:41 PM Subash Abhinov Kasiviswanathan > > <quic_subashab@quicinc.com> wrote: > > > > > > tp->scaling_ratio is not updated based on skb->len/skb->truesize once > > > SO_RCVBUF is set leading to the maximum window scaling to be 25% of > > > rcvbuf after > > > commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") > > > and 50% of rcvbuf after > > > commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"). > > > 50% tries to emulate the behavior of older kernels using > > > sysctl_tcp_adv_win_scale with default value. > > > > > > Systems which were using a different values of sysctl_tcp_adv_win_scale > > > in older kernels ended up seeing reduced download speeds in certain > > > cases as covered in https://lists.openwall.net/netdev/2024/05/15/13 > > > While the sysctl scheme is no longer acceptable, the value of 50% is > > > a bit conservative when the skb->len/skb->truesize ratio is later > > > determined to be ~0.66. > > > > > > Applications not specifying SO_RCVBUF update the window scaling and > > > the receiver buffer every time data is copied to userspace. This > > > computation is now used for applications setting SO_RCVBUF to update > > > the maximum window scaling while ensuring that the receive buffer > > > is within the application specified limit. > > > > > > Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") > > > Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> > > > Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> > > > --- > > > v1 -> v2 > > > Update the condition for SO_RCVBUF window_clamp updates to always > > > monitor the current rcvbuf value as suggested by Eric. > > > > > > net/ipv4/tcp_input.c | 23 ++++++++++++++++------- > > > 1 file changed, 16 insertions(+), 7 deletions(-) > > > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > > index 454362e359da..e2b9583ed96a 100644 > > > --- a/net/ipv4/tcp_input.c > > > +++ b/net/ipv4/tcp_input.c > > > @@ -754,8 +754,7 @@ void tcp_rcv_space_adjust(struct sock *sk) > > > * <prev RTT . ><current RTT .. ><next RTT .... > > > > */ > > > > > > - if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) && > > > - !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { > > > + if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf)) { > > > u64 rcvwin, grow; > > > int rcvbuf; > > > > > > @@ -771,12 +770,22 @@ void tcp_rcv_space_adjust(struct sock *sk) > > > > > > rcvbuf = min_t(u64, tcp_space_from_win(sk, rcvwin), > > > READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2])); > > > - if (rcvbuf > sk->sk_rcvbuf) { > > > - WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); > > > + if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { > > > + if (rcvbuf > sk->sk_rcvbuf) { > > > + WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); > > > > > > - /* Make the window clamp follow along. */ > > > - WRITE_ONCE(tp->window_clamp, > > > - tcp_win_from_space(sk, rcvbuf)); > > > + /* Make the window clamp follow along. */ > > > + WRITE_ONCE(tp->window_clamp, > > > + tcp_win_from_space(sk, rcvbuf)); > > > + } > > > + } else { > > > + /* Make the window clamp follow along while being bounded > > > + * by SO_RCVBUF. > > > + */ > > > + int clamp = tcp_win_from_space(sk, min(rcvbuf, sk->sk_rcvbuf)); > > > + > > > + if (clamp > tp->window_clamp) > > > + WRITE_ONCE(tp->window_clamp, clamp); > > > } > > > } > > > tp->rcvq_space.space = copied; > > > -- > > > > Is this the correct place to put this new code to update > > tp->window_clamp? AFAICT it's not the correct place. > > > > If a system administrator has disabled receive buffer autotuning by > > setting `sysctl net.ipv4.tcp_moderate_rcvbuf=0`, or if (copied <= > > tp->rcvq_space.space), then TCP connections will not reach this new > > code, and the window_clamp will not be adjusted, and the receive > > window will still be too low. > > > > Even if a system administrator has disabled receive buffer autotuning > > by setting `sysctl net.ipv4.tcp_moderate_rcvbuf=0`, or even if (copied > > <= tp->rcvq_space.space), AFAICT we still want the correct receive > > window value for whatever sk->sk_rcvbuf we have, based on the correct > > tp->scaling_ratio. > > > > So AFAICT the correct place to put this kind of logic is in > > tcp_measure_rcv_mss(). If we compute a new scaling_ratio and it's > > different than tp->scaling_ratio, then it seems we should compute a > > new window_clamp value using sk->sk_rcvbuf, and if the new > > window_clamp value is different then we should WRITE_ONCE that value > > into tp->window_clamp. > > > > That way we can have the correct tp->window_clamp, no matter the value > > of net.ipv4.tcp_moderate_rcvbuf, and even if (copied <= > > tp->rcvq_space.space). > > > > How does that sound? > > Can this be done without adding new code in the fast path ? I was imagining that the code would not really be in the fast path, because it would move to the spot in tcp_measure_rcv_mss() where the segment length "len" is greater than icsk->icsk_ack.rcv_mss. I imagine that should be rare, and we are already doing a somewhat expensive do_div() call there, so I was imagining that the additional cost would be relatively rare and small? > Otherwise, I feel that we send a wrong signal to 'administrators' : > "We will maintain code to make sure that wrong sysctls settings were > not so wrong." > > Are you aware of anyone changing net.ipv4.tcp_moderate_rcvbuf for any > valid reason ? No, I'm not aware of any valid reason to disable net.ipv4.tcp_moderate_rcvbuf. :-) Even if tcp_moderate_rcvbuf is enabled, AFAICT there is still the issue that if (copied <= tp->rcvq_space.space) we will not get to this code... neal
On Mon, Jul 29, 2024 at 5:33 PM Neal Cardwell <ncardwell@google.com> wrote: > I was imagining that the code would not really be in the fast path, > because it would move to the spot in tcp_measure_rcv_mss() where the > segment length "len" is greater than icsk->icsk_ack.rcv_mss. I imagine > that should be rare, and we are already doing a somewhat expensive > do_div() call there, so I was imagining that the additional cost would > be relatively rare and small? Let's see your patch :) Thanks !
Hello Neal, > > Otherwise, I feel that we send a wrong signal to 'administrators' : > > "We will maintain code to make sure that wrong sysctls settings were > > not so wrong." > > > > Are you aware of anyone changing net.ipv4.tcp_moderate_rcvbuf for any > > valid reason ? > > No, I'm not aware of any valid reason to disable > net.ipv4.tcp_moderate_rcvbuf. :-) I was also curious about why this sysctl knob was useful a long time ago? I don't see any good in it (for many years, we haven't touched it, setting it to 1 as default). Since you maintainers don't think it's a good one, could we mark it as deprecated and remove this sysctl? Thanks, Jason
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 454362e359da..e2b9583ed96a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -754,8 +754,7 @@ void tcp_rcv_space_adjust(struct sock *sk) * <prev RTT . ><current RTT .. ><next RTT .... > */ - if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) && - !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { + if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf)) { u64 rcvwin, grow; int rcvbuf; @@ -771,12 +770,22 @@ void tcp_rcv_space_adjust(struct sock *sk) rcvbuf = min_t(u64, tcp_space_from_win(sk, rcvwin), READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2])); - if (rcvbuf > sk->sk_rcvbuf) { - WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); + if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { + if (rcvbuf > sk->sk_rcvbuf) { + WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); - /* Make the window clamp follow along. */ - WRITE_ONCE(tp->window_clamp, - tcp_win_from_space(sk, rcvbuf)); + /* Make the window clamp follow along. */ + WRITE_ONCE(tp->window_clamp, + tcp_win_from_space(sk, rcvbuf)); + } + } else { + /* Make the window clamp follow along while being bounded + * by SO_RCVBUF. + */ + int clamp = tcp_win_from_space(sk, min(rcvbuf, sk->sk_rcvbuf)); + + if (clamp > tp->window_clamp) + WRITE_ONCE(tp->window_clamp, clamp); } } tp->rcvq_space.space = copied;