Message ID | 20240517085031.18896-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net-next] tcp: break the limitation of initial receive window | expand |
On Fri, May 17, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <kernelxing@tencent.com> > > Since in 2018 one commit a337531b942b ("tcp: up initial rmem to 128KB and > SYN rwin to around 64KB") limited received window within 65535, most CDN > team would not benefit from this change because they cannot have a large > window to receive a big packet one time especially in long RTT. > > According to RFC 7323, it says: > "The maximum receive window, and therefore the scale factor, is > determined by the maximum receive buffer space." > > So we can get rid of this 64k limitation and let the window be tunable if > the user wants to do it within the control of buffer space. Then many > companies, I believe, can have the same behaviour as old days. Besides, > there are many papers conducting various interesting experiments which > have something to do with this window and show good outputs in some cases. > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > --- > net/ipv4/tcp_output.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 95caf8aaa8be..95618d0e78e4 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > else > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > + (*rcv_wnd) = space; Hmm, has this patch been tested? This doesn't look like it would work. Please note that RFC 7323 says in https://datatracker.ietf.org/doc/html/rfc7323#section-2.2 : The window field in a segment where the SYN bit is set (i.e., a <SYN> or <SYN,ACK>) MUST NOT be scaled. Since the receive window field in a SYN is unscaled, that means the TCP wire protocol has no way to convey a receive window in the SYN that is bigger than 64KBytes. That is why this code places a limit of U16_MAX on the value here. If you want to advertise a bigger receive window in the SYN, you'll need to define a new TCP option type, and write an IETF Internet Draft and/or RFC standardizing the new option. If you would like to, instead, submit a patch with a comment explaining that this U16_MAX limit is inherent in the RFC 7323 wire protocol specification, that could make sense. best regards, neal
On Fri, May 17, 2024 at 10:42 PM Neal Cardwell <ncardwell@google.com> wrote: > > On Fri, May 17, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > Since in 2018 one commit a337531b942b ("tcp: up initial rmem to 128KB and > > SYN rwin to around 64KB") limited received window within 65535, most CDN > > team would not benefit from this change because they cannot have a large > > window to receive a big packet one time especially in long RTT. > > > > According to RFC 7323, it says: > > "The maximum receive window, and therefore the scale factor, is > > determined by the maximum receive buffer space." > > > > So we can get rid of this 64k limitation and let the window be tunable if > > the user wants to do it within the control of buffer space. Then many > > companies, I believe, can have the same behaviour as old days. Besides, > > there are many papers conducting various interesting experiments which > > have something to do with this window and show good outputs in some cases. > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > --- > > net/ipv4/tcp_output.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index 95caf8aaa8be..95618d0e78e4 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > else > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > + (*rcv_wnd) = space; > > Hmm, has this patch been tested? This doesn't look like it would work. Hello Neal, Thanks for the comment. Sure, I provided such a patch a few months ago which has been tested in production for the customers. One example of using a much bigger initial receive window: client ---window=65535---> server client <---window=14600---- server client ---window=175616---> server Then the client could send more data than before in fewer rtt. Above is the output of tcpdump. Oh, I just found a similar case: https://lore.kernel.org/all/20220213040545.365600-1-tilan7663@gmail.com/ Before this, I always believed I'm not the only one who had such an issue. > > Please note that RFC 7323 says in > https://datatracker.ietf.org/doc/html/rfc7323#section-2.2 : > > The window field in a segment where the SYN bit is set (i.e., a <SYN> > or <SYN,ACK>) MUST NOT be scaled. > > Since the receive window field in a SYN is unscaled, that means the > TCP wire protocol has no way to convey a receive window in the SYN > that is bigger than 64KBytes. > > That is why this code places a limit of U16_MAX on the value here. > > If you want to advertise a bigger receive window in the SYN, you'll No. It's not my original intention. For SYN packet itself is limited in the __tcp_transmit_skb() as below: th->window = htons(min(tp->rcv_wnd, 65535U)); > need to define a new TCP option type, and write an IETF Internet Draft > and/or RFC standardizing the new option. > > If you would like to, instead, submit a patch with a comment > explaining that this U16_MAX limit is inherent in the RFC 7323 wire > protocol specification, that could make sense. I quoted from that link: -------- I'm not trying to make the sender to send more than 64Kib in the first RTT. The change will only make the sender to send more starting on the second RTT(after first ack received on the data). Instead of having the rcv_wnd to grow from 64Kib, the rcv_wnd can start from a much larger base value. Without the patch: RTT: 1, 2, 3, ... rcv_wnd: 64KiB, 192KiB, 576KiB, ... With the patch (assume rcv_wnd is set to 512KiB): RTT: 1, 2, 3, ... rcv_wnd: 64KiB, 1.536MiB, 4.608MiB, ... -------- Another quotation from the paper [1] which has been deployed in Yahoo: "By increasing this ICW, small objects stand to be transferred in fewer RTTs, which when compounded across all objects on a page can cut down the total page load time significantly. ... Luckily, on popular operating systems (except Linux which has a much smaller receive window), the initial receive window is quite large (64KB-256KB), which would allow for utilizing a larger ICW" [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf Thanks, Jason > > best regards, > neal
On Sat, May 18, 2024 at 1:41 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Fri, May 17, 2024 at 10:42 PM Neal Cardwell <ncardwell@google.com> wrote: > > > > On Fri, May 17, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > Since in 2018 one commit a337531b942b ("tcp: up initial rmem to 128KB and > > > SYN rwin to around 64KB") limited received window within 65535, most CDN > > > team would not benefit from this change because they cannot have a large > > > window to receive a big packet one time especially in long RTT. > > > > > > According to RFC 7323, it says: > > > "The maximum receive window, and therefore the scale factor, is > > > determined by the maximum receive buffer space." > > > > > > So we can get rid of this 64k limitation and let the window be tunable if > > > the user wants to do it within the control of buffer space. Then many > > > companies, I believe, can have the same behaviour as old days. Besides, > > > there are many papers conducting various interesting experiments which > > > have something to do with this window and show good outputs in some cases. > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > --- > > > net/ipv4/tcp_output.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > > index 95caf8aaa8be..95618d0e78e4 100644 > > > --- a/net/ipv4/tcp_output.c > > > +++ b/net/ipv4/tcp_output.c > > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > > else > > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > > + (*rcv_wnd) = space; > > > > Hmm, has this patch been tested? This doesn't look like it would work. > > Hello Neal, > > Thanks for the comment. > > Sure, I provided such a patch a few months ago which has been tested > in production for the customers. > > One example of using a much bigger initial receive window: > client ---window=65535---> server > client <---window=14600---- server > client ---window=175616---> server > > Then the client could send more data than before in fewer rtt. > > Above is the output of tcpdump. > > Oh, I just found a similar case: > https://lore.kernel.org/all/20220213040545.365600-1-tilan7663@gmail.com/ > > Before this, I always believed I'm not the only one who had such an issue. > > > > > Please note that RFC 7323 says in > > https://datatracker.ietf.org/doc/html/rfc7323#section-2.2 : > > > > The window field in a segment where the SYN bit is set (i.e., a <SYN> > > or <SYN,ACK>) MUST NOT be scaled. > > > > Since the receive window field in a SYN is unscaled, that means the > > TCP wire protocol has no way to convey a receive window in the SYN > > that is bigger than 64KBytes. > > > > That is why this code places a limit of U16_MAX on the value here. > > > > If you want to advertise a bigger receive window in the SYN, you'll > > No. It's not my original intention. > > For SYN packet itself is limited in the __tcp_transmit_skb() as below: > > th->window = htons(min(tp->rcv_wnd, 65535U)); With this limitation/protection of the window in SYN packet, It would not break RFC with this patch applied. I try to advertise a bigger initRwnd of ACK in a 3-way shakehand process. Thanks, Jason
On Fri, May 17, 2024 at 1:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > On Sat, May 18, 2024 at 1:41 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Fri, May 17, 2024 at 10:42 PM Neal Cardwell <ncardwell@google.com> wrote: > > > > > > On Fri, May 17, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > Since in 2018 one commit a337531b942b ("tcp: up initial rmem to 128KB and > > > > SYN rwin to around 64KB") limited received window within 65535, most CDN > > > > team would not benefit from this change because they cannot have a large > > > > window to receive a big packet one time especially in long RTT. > > > > > > > > According to RFC 7323, it says: > > > > "The maximum receive window, and therefore the scale factor, is > > > > determined by the maximum receive buffer space." > > > > > > > > So we can get rid of this 64k limitation and let the window be tunable if > > > > the user wants to do it within the control of buffer space. Then many > > > > companies, I believe, can have the same behaviour as old days. Besides, > > > > there are many papers conducting various interesting experiments which > > > > have something to do with this window and show good outputs in some cases. > > > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > > --- > > > > net/ipv4/tcp_output.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > > > index 95caf8aaa8be..95618d0e78e4 100644 > > > > --- a/net/ipv4/tcp_output.c > > > > +++ b/net/ipv4/tcp_output.c > > > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > > > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > > > else > > > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > > > + (*rcv_wnd) = space; > > > > > > Hmm, has this patch been tested? This doesn't look like it would work. > > > > Hello Neal, > > > > Thanks for the comment. > > > > Sure, I provided such a patch a few months ago which has been tested > > in production for the customers. > > > > One example of using a much bigger initial receive window: > > client ---window=65535---> server > > client <---window=14600---- server > > client ---window=175616---> server > > > > Then the client could send more data than before in fewer rtt. > > > > Above is the output of tcpdump. > > > > Oh, I just found a similar case: > > https://lore.kernel.org/all/20220213040545.365600-1-tilan7663@gmail.com/ > > > > Before this, I always believed I'm not the only one who had such an issue. > > > > > > > > Please note that RFC 7323 says in > > > https://datatracker.ietf.org/doc/html/rfc7323#section-2.2 : > > > > > > The window field in a segment where the SYN bit is set (i.e., a <SYN> > > > or <SYN,ACK>) MUST NOT be scaled. > > > > > > Since the receive window field in a SYN is unscaled, that means the > > > TCP wire protocol has no way to convey a receive window in the SYN > > > that is bigger than 64KBytes. > > > > > > That is why this code places a limit of U16_MAX on the value here. > > > > > > If you want to advertise a bigger receive window in the SYN, you'll > > > > No. It's not my original intention. > > > > For SYN packet itself is limited in the __tcp_transmit_skb() as below: > > > > th->window = htons(min(tp->rcv_wnd, 65535U)); > > With this limitation/protection of the window in SYN packet, It would > not break RFC with this patch applied. I try to advertise a bigger > initRwnd of ACK in a 3-way shakehand process. Thanks for the explanation. I think the confusion arose because in your title ("tcp: break the limitation of initial receive window"), I interpreted "initial receive window" as the initial receive window advertised on the wire (which is limited by protocol spec to 64 kbytes), when you meant the initial value of tp->rcv_wnd. There are similar ambiguities in the commit message body. I would suggest resubmitting a version of the patch with a revised commit title and commit description, to clarify at least the following issues: + For the patch title, perhaps something like: tcp: remove 64 KByte limit for initial tp->rcv_wnd value + For the commit description, in the sentence 'Since in 2018 one commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") limited received window within 65535', please revise this to clarify that you are talking about tp->rcv_wnd and not the receive window on the wire. For example: 'In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") limited the initial value of tp->rcv_wnd to 65535,.' + For the commit description, please add a note that RFC 7323 limits the initial receive window on the wire in a SYN or SYN+ACK to 65535 bytes, and __tcp_transmit_skb() already ensures that constraint is respected, no matter how large tp->rcv_wnd is. + For the commit description, please include some version of your example of the receive window values in a handshake, like: One example of using a much bigger initial receive window: client --- SYN: rwindow=65535 ---> server client <--- SYN+ACK: rwindow=14600 ---- server client --- ACK: rwindow=175616 ---> server + For the commit description, please include some version of your example of the evolution of the receive window over the first 3 round trips, before and after your patch. thanks, neal
Hello Neal, On Sat, May 18, 2024 at 2:35 AM Neal Cardwell <ncardwell@google.com> wrote: > > On Fri, May 17, 2024 at 1:49 PM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > On Sat, May 18, 2024 at 1:41 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > On Fri, May 17, 2024 at 10:42 PM Neal Cardwell <ncardwell@google.com> wrote: > > > > > > > > On Fri, May 17, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > Since in 2018 one commit a337531b942b ("tcp: up initial rmem to 128KB and > > > > > SYN rwin to around 64KB") limited received window within 65535, most CDN > > > > > team would not benefit from this change because they cannot have a large > > > > > window to receive a big packet one time especially in long RTT. > > > > > > > > > > According to RFC 7323, it says: > > > > > "The maximum receive window, and therefore the scale factor, is > > > > > determined by the maximum receive buffer space." > > > > > > > > > > So we can get rid of this 64k limitation and let the window be tunable if > > > > > the user wants to do it within the control of buffer space. Then many > > > > > companies, I believe, can have the same behaviour as old days. Besides, > > > > > there are many papers conducting various interesting experiments which > > > > > have something to do with this window and show good outputs in some cases. > > > > > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > > > --- > > > > > net/ipv4/tcp_output.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > > > > index 95caf8aaa8be..95618d0e78e4 100644 > > > > > --- a/net/ipv4/tcp_output.c > > > > > +++ b/net/ipv4/tcp_output.c > > > > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > > > > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > > > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > > > > else > > > > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > > > > + (*rcv_wnd) = space; > > > > > > > > Hmm, has this patch been tested? This doesn't look like it would work. > > > > > > Hello Neal, > > > > > > Thanks for the comment. > > > > > > Sure, I provided such a patch a few months ago which has been tested > > > in production for the customers. > > > > > > One example of using a much bigger initial receive window: > > > client ---window=65535---> server > > > client <---window=14600---- server > > > client ---window=175616---> server > > > > > > Then the client could send more data than before in fewer rtt. > > > > > > Above is the output of tcpdump. > > > > > > Oh, I just found a similar case: > > > https://lore.kernel.org/all/20220213040545.365600-1-tilan7663@gmail.com/ > > > > > > Before this, I always believed I'm not the only one who had such an issue. > > > > > > > > > > > Please note that RFC 7323 says in > > > > https://datatracker.ietf.org/doc/html/rfc7323#section-2.2 : > > > > > > > > The window field in a segment where the SYN bit is set (i.e., a <SYN> > > > > or <SYN,ACK>) MUST NOT be scaled. > > > > > > > > Since the receive window field in a SYN is unscaled, that means the > > > > TCP wire protocol has no way to convey a receive window in the SYN > > > > that is bigger than 64KBytes. > > > > > > > > That is why this code places a limit of U16_MAX on the value here. > > > > > > > > If you want to advertise a bigger receive window in the SYN, you'll > > > > > > No. It's not my original intention. > > > > > > For SYN packet itself is limited in the __tcp_transmit_skb() as below: > > > > > > th->window = htons(min(tp->rcv_wnd, 65535U)); > > > > With this limitation/protection of the window in SYN packet, It would > > not break RFC with this patch applied. I try to advertise a bigger > > initRwnd of ACK in a 3-way shakehand process. > > Thanks for the explanation. > > I think the confusion arose because in your title ("tcp: break the > limitation of initial receive window"), I interpreted "initial receive > window" as the initial receive window advertised on the wire (which is > limited by protocol spec to 64 kbytes), when you meant the initial > value of tp->rcv_wnd. There are similar ambiguities in the commit > message body. > > I would suggest resubmitting a version of the patch with a revised > commit title and commit description, to clarify at least the following > issues: > > + For the patch title, perhaps something like: > tcp: remove 64 KByte limit for initial tp->rcv_wnd value Thanks for the help. I'll update it. > > + For the commit description, in the sentence 'Since in 2018 one > commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to > around 64KB") limited received window within 65535', please revise > this to clarify that you are talking about tp->rcv_wnd and not the > receive window on the wire. For example: 'In 2018 commit a337531b942b > ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") limited > the initial value of tp->rcv_wnd to 65535,.' Got it :) > > + For the commit description, please add a note that RFC 7323 limits > the initial receive window on the wire in a SYN or SYN+ACK to 65535 > bytes, and __tcp_transmit_skb() already ensures that constraint is > respected, no matter how large tp->rcv_wnd is. Okay, I will add it to avoid future confusion. > > + For the commit description, please include some version of your > example of the receive window values in a handshake, like: > > One example of using a much bigger initial receive window: > client --- SYN: rwindow=65535 ---> server > client <--- SYN+ACK: rwindow=14600 ---- server > client --- ACK: rwindow=175616 ---> server Agreed. > > + For the commit description, please include some version of your > example of the evolution of the receive window over the first 3 round > trips, before and after your patch. I'll revise the body message according to your comments. Thanks, Jason
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 95caf8aaa8be..95618d0e78e4 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) (*rcv_wnd) = min(space, MAX_TCP_WINDOW); else - (*rcv_wnd) = min_t(u32, space, U16_MAX); + (*rcv_wnd) = space; if (init_rcv_wnd) *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);