Message ID | 20240518025008.70689-1-kerneljasonxing@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,v2,net-next] tcp: remove 64 KByte limit for initial tp->rcv_wnd value | expand |
On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > From: Jason Xing <kernelxing@tencent.com> > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most > CDN team would not benefit from this change because they cannot have a > large window to receive a big packet, which will be slowed down especially > in long RTT. > > According to RFC 7323, it says: > "The maximum receive window, and therefore the scale factor, is > determined by the maximum receive buffer space." This seems not relevant ? wscale factor is not changed in this patch ? tp->rcv_wnd is also not the maximum receive window. > > So we can get rid of this 64k limitation and let the window be tunable if > the user wants to do it within the control of buffer space. Then many > companies, I believe, can have the same behaviour as old days. Not sure this has ever worked, see below. Also, the "many companies ..." mention has nothing to do in a changelog. > Besides, > there are many papers conducting various interesting experiments which > have something to do with this window and show good outputs in some cases, > say, paper [1] in Yahoo! CDN. I think this changelog is trying hard to sell something, but in reality TCP 3WHS nature makes your claims wrong. Instead, you should clearly document that this problem can _not_ be solved for both active _and_ passive connections. In the first RTT, a client (active connection) can not send more than 64KB, if TCP specs are properly applied. > > To avoid future confusion, current change doesn't affect the initial > receive window on the wire in a SYN or SYN+ACK packet which are set within > 65535 bytes according to RFC 7323 also due to the limit in > __tcp_transmit_skb(): > > th->window = htons(min(tp->rcv_wnd, 65535U)); > > In one word, __tcp_transmit_skb() already ensures that constraint is > respected, no matter how large tp->rcv_wnd is. > > Let me provide one example if with or without the patch: > Before: > client --- SYN: rwindow=65535 ---> server > client <--- SYN+ACK: rwindow=65535 ---- server > client --- ACK: rwindow=65536 ---> server > Note: for the last ACK, the calculation is 512 << 7. > > After: > client --- SYN: rwindow=65535 ---> server > client <--- SYN+ACK: rwindow=65535 ---- server > client --- ACK: rwindow=175232 ---> server > Note: I use the following command to make it work: > ip route change default via [ip] dev eth0 metric 100 initrwnd 120 > For the last ACK, the calculation is 1369 << 7. > > We can pay attention to the last ACK in 3-way shakehand and notice that > with the patch applied the window can reach more than 64 KByte. You carefully avoided mentioning the asymmetry. I do not think this is needed in the changelog, because this is adding confusion. > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > --- > v2 > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/ > 1. revise the title and body messages (Neal) > --- > net/ipv4/tcp_output.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 95caf8aaa8be..95618d0e78e4 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > else > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > + (*rcv_wnd) = space; This is probably breaking some packetdrill tests, but your change might [1] be good, especially because it allows DRS behavior to be consistent for large MTU (eg MTU 9000) and bigger tcp_rmem[1], even without playing with initrwnd attribute. "ss -temoi " would display after connection setup rcv_space:89600 instead of a capped value. [1] This is hard to say, DRS is full of surprises.
Hello Eric, On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote: > > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > From: Jason Xing <kernelxing@tencent.com> > > > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most > > CDN team would not benefit from this change because they cannot have a > > large window to receive a big packet, which will be slowed down especially > > in long RTT. > > > > According to RFC 7323, it says: > > "The maximum receive window, and therefore the scale factor, is > > determined by the maximum receive buffer space." > > This seems not relevant ? wscale factor is not changed in this patch ? > tp->rcv_wnd is also not the maximum receive window. Thanks for your review. I can remove this part. I was trying to claim I do not break RFC. > > > > > So we can get rid of this 64k limitation and let the window be tunable if > > the user wants to do it within the control of buffer space. Then many > > companies, I believe, can have the same behaviour as old days. > > Not sure this has ever worked, see below. > > Also, the "many companies ..." mention has nothing to do in a changelog. Oh, I just copied/translated from my initial studies of this rcv_wnd by reading many papers something like this. I can also remove this sentence. > > > > Besides, > > there are many papers conducting various interesting experiments which > > have something to do with this window and show good outputs in some cases, > > say, paper [1] in Yahoo! CDN. > > I think this changelog is trying hard to sell something, but in > reality TCP 3WHS nature > makes your claims wrong. > > Instead, you should clearly document that this problem can _not_ be > solved for both > active _and_ passive connections. > > In the first RTT, a client (active connection) can not send more than > 64KB, if TCP specs > are properly applied. Having a large rcv_wnd if the user can tweak this knob can help transfer data more rapidly. I'm not referring to the first RTT. > > > > > To avoid future confusion, current change doesn't affect the initial > > receive window on the wire in a SYN or SYN+ACK packet which are set within > > 65535 bytes according to RFC 7323 also due to the limit in > > __tcp_transmit_skb(): > > > > th->window = htons(min(tp->rcv_wnd, 65535U)); > > > > In one word, __tcp_transmit_skb() already ensures that constraint is > > respected, no matter how large tp->rcv_wnd is. > > > > Let me provide one example if with or without the patch: > > Before: > > client --- SYN: rwindow=65535 ---> server > > client <--- SYN+ACK: rwindow=65535 ---- server > > client --- ACK: rwindow=65536 ---> server > > Note: for the last ACK, the calculation is 512 << 7. > > > > After: > > client --- SYN: rwindow=65535 ---> server > > client <--- SYN+ACK: rwindow=65535 ---- server > > client --- ACK: rwindow=175232 ---> server > > Note: I use the following command to make it work: > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120 > > For the last ACK, the calculation is 1369 << 7. > > > > We can pay attention to the last ACK in 3-way shakehand and notice that > > with the patch applied the window can reach more than 64 KByte. > > You carefully avoided mentioning the asymmetry. > I do not think this is needed in the changelog, because this is adding > confusion. What kind of case I've met in production is only about whether we're capable of sending more data at the same time at the very beginning, so I care much more about the sending process right now. > > > > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > --- > > v2 > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/ > > 1. revise the title and body messages (Neal) > > --- > > net/ipv4/tcp_output.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index 95caf8aaa8be..95618d0e78e4 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > else > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > + (*rcv_wnd) = space; > > This is probably breaking some packetdrill tests, but your change > might [1] be good, I'll do some packetdrill tests and get back some information soon. > especially because it allows DRS behavior to be consistent for large > MTU (eg MTU 9000) and bigger tcp_rmem[1], > even without playing with initrwnd attribute. > > "ss -temoi " would display after connection setup rcv_space:89600 > instead of a capped value. > > [1] This is hard to say, DRS is full of surprises. To avoid confusion, I will remove this link and relevant statements. Here are my opinions in conclusion: 1) this change doesn't break the law, I mean, various RFCs. 2) this change allows us to have the same behaviour as 2018 in this case. 3) this change does some good things to certain cases, especially for the CDN team. I'll refine the changelog as far as I can, hoping it will not confuse the readers. Thanks, Jason
On Tue, May 21, 2024 at 8:36 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > Hello Eric, > > On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin > > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most > > > CDN team would not benefit from this change because they cannot have a > > > large window to receive a big packet, which will be slowed down especially > > > in long RTT. > > > > > > According to RFC 7323, it says: > > > "The maximum receive window, and therefore the scale factor, is > > > determined by the maximum receive buffer space." > > > > This seems not relevant ? wscale factor is not changed in this patch ? > > tp->rcv_wnd is also not the maximum receive window. > > Thanks for your review. > > I can remove this part. I was trying to claim I do not break RFC. > > > > > > > > > So we can get rid of this 64k limitation and let the window be tunable if > > > the user wants to do it within the control of buffer space. Then many > > > companies, I believe, can have the same behaviour as old days. > > > > Not sure this has ever worked, see below. > > > > Also, the "many companies ..." mention has nothing to do in a changelog. > > Oh, I just copied/translated from my initial studies of this rcv_wnd > by reading many papers something like this. > > I can also remove this sentence. > > > > > > > > Besides, > > > there are many papers conducting various interesting experiments which > > > have something to do with this window and show good outputs in some cases, > > > say, paper [1] in Yahoo! CDN. > > > > I think this changelog is trying hard to sell something, but in > > reality TCP 3WHS nature > > makes your claims wrong. > > > > Instead, you should clearly document that this problem can _not_ be > > solved for both > > active _and_ passive connections. > > > > In the first RTT, a client (active connection) can not send more than > > 64KB, if TCP specs > > are properly applied. > > Having a large rcv_wnd if the user can tweak this knob can help > transfer data more rapidly. I'm not referring to the first RTT. For the first RTT, it is surely limited to 64 KB at most as you said. For the whole process, the change can accelerate the sending process and save some RTTs. How can I find this change? We had some servers upgraded to the latest kernel and noticed the indicator from the user side showed worse results than before. Because of this, I spent some time digging into this part. After applying this patch, the indicator shows normal/good results like before. It is proven it works. For the CDN team, they are very sensitive to the latency/time about sending big data in the long RTT. Thanks, Jason > > > > > > > > > To avoid future confusion, current change doesn't affect the initial > > > receive window on the wire in a SYN or SYN+ACK packet which are set within > > > 65535 bytes according to RFC 7323 also due to the limit in > > > __tcp_transmit_skb(): > > > > > > th->window = htons(min(tp->rcv_wnd, 65535U)); > > > > > > In one word, __tcp_transmit_skb() already ensures that constraint is > > > respected, no matter how large tp->rcv_wnd is. > > > > > > Let me provide one example if with or without the patch: > > > Before: > > > client --- SYN: rwindow=65535 ---> server > > > client <--- SYN+ACK: rwindow=65535 ---- server > > > client --- ACK: rwindow=65536 ---> server > > > Note: for the last ACK, the calculation is 512 << 7. > > > > > > After: > > > client --- SYN: rwindow=65535 ---> server > > > client <--- SYN+ACK: rwindow=65535 ---- server > > > client --- ACK: rwindow=175232 ---> server > > > Note: I use the following command to make it work: > > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120 > > > For the last ACK, the calculation is 1369 << 7. > > > > > > We can pay attention to the last ACK in 3-way shakehand and notice that > > > with the patch applied the window can reach more than 64 KByte. > > > > You carefully avoided mentioning the asymmetry. > > I do not think this is needed in the changelog, because this is adding > > confusion. > > What kind of case I've met in production is only about whether we're > capable of sending more data at the same time at the very beginning, > so I care much more about the sending process right now. > > > > > > > > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > --- > > > v2 > > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/ > > > 1. revise the title and body messages (Neal) > > > --- > > > net/ipv4/tcp_output.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > > index 95caf8aaa8be..95618d0e78e4 100644 > > > --- a/net/ipv4/tcp_output.c > > > +++ b/net/ipv4/tcp_output.c > > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > > else > > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > > + (*rcv_wnd) = space; > > > > This is probably breaking some packetdrill tests, but your change > > might [1] be good, > > I'll do some packetdrill tests and get back some information soon. > > > especially because it allows DRS behavior to be consistent for large > > MTU (eg MTU 9000) and bigger tcp_rmem[1], > > even without playing with initrwnd attribute. > > > > "ss -temoi " would display after connection setup rcv_space:89600 > > instead of a capped value. > > > > [1] This is hard to say, DRS is full of surprises. > > To avoid confusion, I will remove this link and relevant statements. > > Here are my opinions in conclusion: > 1) this change doesn't break the law, I mean, various RFCs. > 2) this change allows us to have the same behaviour as 2018 in this case. > 3) this change does some good things to certain cases, especially for > the CDN team. > > I'll refine the changelog as far as I can, hoping it will not confuse > the readers. > > Thanks, > Jason
Hello Eric, On Tue, May 21, 2024 at 8:36 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > Hello Eric, > > On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin > > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most > > > CDN team would not benefit from this change because they cannot have a > > > large window to receive a big packet, which will be slowed down especially > > > in long RTT. > > > > > > According to RFC 7323, it says: > > > "The maximum receive window, and therefore the scale factor, is > > > determined by the maximum receive buffer space." > > > > This seems not relevant ? wscale factor is not changed in this patch ? > > tp->rcv_wnd is also not the maximum receive window. > > Thanks for your review. > > I can remove this part. I was trying to claim I do not break RFC. > > > > > > > > > So we can get rid of this 64k limitation and let the window be tunable if > > > the user wants to do it within the control of buffer space. Then many > > > companies, I believe, can have the same behaviour as old days. > > > > Not sure this has ever worked, see below. > > > > Also, the "many companies ..." mention has nothing to do in a changelog. > > Oh, I just copied/translated from my initial studies of this rcv_wnd > by reading many papers something like this. > > I can also remove this sentence. > > > > > > > > Besides, > > > there are many papers conducting various interesting experiments which > > > have something to do with this window and show good outputs in some cases, > > > say, paper [1] in Yahoo! CDN. > > > > I think this changelog is trying hard to sell something, but in > > reality TCP 3WHS nature > > makes your claims wrong. > > > > Instead, you should clearly document that this problem can _not_ be > > solved for both > > active _and_ passive connections. > > > > In the first RTT, a client (active connection) can not send more than > > 64KB, if TCP specs > > are properly applied. > > Having a large rcv_wnd if the user can tweak this knob can help > transfer data more rapidly. I'm not referring to the first RTT. > > > > > > > > > To avoid future confusion, current change doesn't affect the initial > > > receive window on the wire in a SYN or SYN+ACK packet which are set within > > > 65535 bytes according to RFC 7323 also due to the limit in > > > __tcp_transmit_skb(): > > > > > > th->window = htons(min(tp->rcv_wnd, 65535U)); > > > > > > In one word, __tcp_transmit_skb() already ensures that constraint is > > > respected, no matter how large tp->rcv_wnd is. > > > > > > Let me provide one example if with or without the patch: > > > Before: > > > client --- SYN: rwindow=65535 ---> server > > > client <--- SYN+ACK: rwindow=65535 ---- server > > > client --- ACK: rwindow=65536 ---> server > > > Note: for the last ACK, the calculation is 512 << 7. > > > > > > After: > > > client --- SYN: rwindow=65535 ---> server > > > client <--- SYN+ACK: rwindow=65535 ---- server > > > client --- ACK: rwindow=175232 ---> server > > > Note: I use the following command to make it work: > > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120 > > > For the last ACK, the calculation is 1369 << 7. > > > > > > We can pay attention to the last ACK in 3-way shakehand and notice that > > > with the patch applied the window can reach more than 64 KByte. > > > > You carefully avoided mentioning the asymmetry. > > I do not think this is needed in the changelog, because this is adding > > confusion. > > What kind of case I've met in production is only about whether we're > capable of sending more data at the same time at the very beginning, > so I care much more about the sending process right now. > > > > > > > > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > --- > > > v2 > > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/ > > > 1. revise the title and body messages (Neal) > > > --- > > > net/ipv4/tcp_output.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > > index 95caf8aaa8be..95618d0e78e4 100644 > > > --- a/net/ipv4/tcp_output.c > > > +++ b/net/ipv4/tcp_output.c > > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > > else > > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > > + (*rcv_wnd) = space; > > > > This is probably breaking some packetdrill tests, but your change > > might [1] be good, > > I'll do some packetdrill tests and get back some information soon. I'm done with the packetdrill tests[1]. Here are two tests failed after comparing with/without this patch: 1) ./packetdrill/run_all.py -S -v -L -l tcp/ioctl/ioctl-siocinq-fin.pkt 2) ./packetdrill/run_all.py -S -v -L -l tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt For the first one, it shows: "FAIL [/data/home/kernelxing/source_code/packetdrill/gtests/net/tcp/ioctl/ioctl-siocinq-fin.pkt (ipv6)] stdout: stderr: ioctl-siocinq-fin.pkt:28: error handling packet: timing error: expected outbound packet at 0.302321 sec but happened at 0.342759 sec; tolerance 0.004000 sec script packet: 0.302321 . 1:1(0) ack 2002 actual packet: 0.342759 . 1:1(0) ack 2002 win 65535" For the second one, it shows: "client-ack-dropped-then-recovery-ms-timestamps.pkt:33: error handling packet: live packet field tcp_window: expected: 256 (0x100) vs actual: 532 (0x214) script packet: 0.012251 P. 1:5001(5000) ack 1001 win 256 <nop,nop,TS val 2010 ecr 1000> actual packet: 0.012242 P. 1:5001(5000) ack 1001 win 532 <nop,nop,TS val 2010 ecr 1000> Ran 3 tests: 0 passing, 3 failing, 0 timed out (0.91 sec): tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt" The reason is unexpected window size. Since I removed the limit of 64KB, It is expected from my view. [1]: https://github.com/google/packetdrill Running: ./packetdrill/run_all.py -S -v -L -l tcp/ I wonder if you mind this change which might be unpredictable, how about this one: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 95caf8aaa8be..3bf7d9fd2d6b 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -231,11 +231,13 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, */ if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) (*rcv_wnd) = min(space, MAX_TCP_WINDOW); - else - (*rcv_wnd) = min_t(u32, space, U16_MAX); - if (init_rcv_wnd) + if (init_rcv_wnd) { + *rcv_wnd = space; *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss); + } else { + *rcv_wnd = min_t(u32, space, U16_MAX); + } *rcv_wscale = 0; if (wscale_ok) { ? It affects/changes the TCP stack only when the user tries to use 'ip route' to set initrwnd. Thanks, Jason > > > especially because it allows DRS behavior to be consistent for large > > MTU (eg MTU 9000) and bigger tcp_rmem[1], > > even without playing with initrwnd attribute. > > > > "ss -temoi " would display after connection setup rcv_space:89600 > > instead of a capped value. > > > > [1] This is hard to say, DRS is full of surprises. > > To avoid confusion, I will remove this link and relevant statements. > > Here are my opinions in conclusion: > 1) this change doesn't break the law, I mean, various RFCs. > 2) this change allows us to have the same behaviour as 2018 in this case. > 3) this change does some good things to certain cases, especially for > the CDN team. > > I'll refine the changelog as far as I can, hoping it will not confuse > the readers. > > Thanks, > Jason
On Tue, May 21, 2024 at 8:56 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > Hello Eric, > > On Tue, May 21, 2024 at 8:36 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > Hello Eric, > > > > On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin > > > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most > > > > CDN team would not benefit from this change because they cannot have a > > > > large window to receive a big packet, which will be slowed down especially > > > > in long RTT. > > > > > > > > According to RFC 7323, it says: > > > > "The maximum receive window, and therefore the scale factor, is > > > > determined by the maximum receive buffer space." > > > > > > This seems not relevant ? wscale factor is not changed in this patch ? > > > tp->rcv_wnd is also not the maximum receive window. > > > > Thanks for your review. > > > > I can remove this part. I was trying to claim I do not break RFC. > > > > > > > > > > > > > So we can get rid of this 64k limitation and let the window be tunable if > > > > the user wants to do it within the control of buffer space. Then many > > > > companies, I believe, can have the same behaviour as old days. > > > > > > Not sure this has ever worked, see below. > > > > > > Also, the "many companies ..." mention has nothing to do in a changelog. > > > > Oh, I just copied/translated from my initial studies of this rcv_wnd > > by reading many papers something like this. > > > > I can also remove this sentence. > > > > > > > > > > > > Besides, > > > > there are many papers conducting various interesting experiments which > > > > have something to do with this window and show good outputs in some cases, > > > > say, paper [1] in Yahoo! CDN. > > > > > > I think this changelog is trying hard to sell something, but in > > > reality TCP 3WHS nature > > > makes your claims wrong. > > > > > > Instead, you should clearly document that this problem can _not_ be > > > solved for both > > > active _and_ passive connections. > > > > > > In the first RTT, a client (active connection) can not send more than > > > 64KB, if TCP specs > > > are properly applied. > > > > Having a large rcv_wnd if the user can tweak this knob can help > > transfer data more rapidly. I'm not referring to the first RTT. > > > > > > > > > > > > > To avoid future confusion, current change doesn't affect the initial > > > > receive window on the wire in a SYN or SYN+ACK packet which are set within > > > > 65535 bytes according to RFC 7323 also due to the limit in > > > > __tcp_transmit_skb(): > > > > > > > > th->window = htons(min(tp->rcv_wnd, 65535U)); > > > > > > > > In one word, __tcp_transmit_skb() already ensures that constraint is > > > > respected, no matter how large tp->rcv_wnd is. > > > > > > > > Let me provide one example if with or without the patch: > > > > Before: > > > > client --- SYN: rwindow=65535 ---> server > > > > client <--- SYN+ACK: rwindow=65535 ---- server > > > > client --- ACK: rwindow=65536 ---> server > > > > Note: for the last ACK, the calculation is 512 << 7. > > > > > > > > After: > > > > client --- SYN: rwindow=65535 ---> server > > > > client <--- SYN+ACK: rwindow=65535 ---- server > > > > client --- ACK: rwindow=175232 ---> server > > > > Note: I use the following command to make it work: > > > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120 > > > > For the last ACK, the calculation is 1369 << 7. > > > > > > > > We can pay attention to the last ACK in 3-way shakehand and notice that > > > > with the patch applied the window can reach more than 64 KByte. > > > > > > You carefully avoided mentioning the asymmetry. > > > I do not think this is needed in the changelog, because this is adding > > > confusion. > > > > What kind of case I've met in production is only about whether we're > > capable of sending more data at the same time at the very beginning, > > so I care much more about the sending process right now. > > > > > > > > > > > > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf > > > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > > --- > > > > v2 > > > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/ > > > > 1. revise the title and body messages (Neal) > > > > --- > > > > net/ipv4/tcp_output.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > > > index 95caf8aaa8be..95618d0e78e4 100644 > > > > --- a/net/ipv4/tcp_output.c > > > > +++ b/net/ipv4/tcp_output.c > > > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > > > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > > > else > > > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > > > + (*rcv_wnd) = space; > > > > > > This is probably breaking some packetdrill tests, but your change > > > might [1] be good, > > > > I'll do some packetdrill tests and get back some information soon. > > I'm done with the packetdrill tests[1]. Here are two tests failed > after comparing with/without this patch: > 1) ./packetdrill/run_all.py -S -v -L -l tcp/ioctl/ioctl-siocinq-fin.pkt > 2) ./packetdrill/run_all.py -S -v -L -l > tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt > > For the first one, it shows: > "FAIL [/data/home/kernelxing/source_code/packetdrill/gtests/net/tcp/ioctl/ioctl-siocinq-fin.pkt > (ipv6)] > stdout: > stderr: > ioctl-siocinq-fin.pkt:28: error handling packet: timing error: > expected outbound packet at 0.302321 sec but happened at 0.342759 sec; > tolerance 0.004000 sec > script packet: 0.302321 . 1:1(0) ack 2002 > actual packet: 0.342759 . 1:1(0) ack 2002 win 65535" > > For the second one, it shows: > "client-ack-dropped-then-recovery-ms-timestamps.pkt:33: error handling > packet: live packet field tcp_window: expected: 256 (0x100) vs actual: > 532 (0x214) > script packet: 0.012251 P. 1:5001(5000) ack 1001 win 256 <nop,nop,TS > val 2010 ecr 1000> > actual packet: 0.012242 P. 1:5001(5000) ack 1001 win 532 <nop,nop,TS > val 2010 ecr 1000> > Ran 3 tests: 0 passing, 3 failing, 0 timed out (0.91 sec): > tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt" > > The reason is unexpected window size. Since I removed the limit of > 64KB, It is expected from my view. I think you misunderstood what I was saying. Basically, this change will break some packetdrill tests, and this is fine, because those packetdrill tests were relying on a prior kernel behavior that was not set in stone (certainly not documented) > > [1]: https://github.com/google/packetdrill > Running: ./packetdrill/run_all.py -S -v -L -l tcp/ > > I wonder if you mind this change which might be unpredictable, how > about this one: > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 95caf8aaa8be..3bf7d9fd2d6b 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -231,11 +231,13 @@ void tcp_select_initial_window(const struct sock > *sk, int __space, __u32 mss, > */ > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > - else > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > - if (init_rcv_wnd) > + if (init_rcv_wnd) { > + *rcv_wnd = space; > *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss); > + } else { > + *rcv_wnd = min_t(u32, space, U16_MAX); > + } > > *rcv_wscale = 0; > if (wscale_ok) { > ? > > It affects/changes the TCP stack only when the user tries to use 'ip > route' to set initrwnd. I much prefer the prior and simpler version. Only the changelog was not very good IMO. Also, I think this is fixing a bug and should target the net tree. If it took 6 years to discover the unwanted side effects, we should make sure the fix is backported by stable teams, thanks to an appropriate Fixes: tag. > > Thanks, > Jason > > > > > > especially because it allows DRS behavior to be consistent for large > > > MTU (eg MTU 9000) and bigger tcp_rmem[1], > > > even without playing with initrwnd attribute. > > > > > > "ss -temoi " would display after connection setup rcv_space:89600 > > > instead of a capped value. > > > > > > [1] This is hard to say, DRS is full of surprises. > > > > To avoid confusion, I will remove this link and relevant statements. > > > > Here are my opinions in conclusion: > > 1) this change doesn't break the law, I mean, various RFCs. > > 2) this change allows us to have the same behaviour as 2018 in this case. > > 3) this change does some good things to certain cases, especially for > > the CDN team. > > > > I'll refine the changelog as far as I can, hoping it will not confuse > > the readers. > > > > Thanks, > > Jason
On Tue, May 21, 2024 at 5:43 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, May 21, 2024 at 8:56 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > Hello Eric, > > > > On Tue, May 21, 2024 at 8:36 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > Hello Eric, > > > > > > On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@gmail.com> wrote: > > > > > > > > > > From: Jason Xing <kernelxing@tencent.com> > > > > > > > > > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin > > > > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most > > > > > CDN team would not benefit from this change because they cannot have a > > > > > large window to receive a big packet, which will be slowed down especially > > > > > in long RTT. > > > > > > > > > > According to RFC 7323, it says: > > > > > "The maximum receive window, and therefore the scale factor, is > > > > > determined by the maximum receive buffer space." > > > > > > > > This seems not relevant ? wscale factor is not changed in this patch ? > > > > tp->rcv_wnd is also not the maximum receive window. > > > > > > Thanks for your review. > > > > > > I can remove this part. I was trying to claim I do not break RFC. > > > > > > > > > > > > > > > > > So we can get rid of this 64k limitation and let the window be tunable if > > > > > the user wants to do it within the control of buffer space. Then many > > > > > companies, I believe, can have the same behaviour as old days. > > > > > > > > Not sure this has ever worked, see below. > > > > > > > > Also, the "many companies ..." mention has nothing to do in a changelog. > > > > > > Oh, I just copied/translated from my initial studies of this rcv_wnd > > > by reading many papers something like this. > > > > > > I can also remove this sentence. > > > > > > > > > > > > > > > > Besides, > > > > > there are many papers conducting various interesting experiments which > > > > > have something to do with this window and show good outputs in some cases, > > > > > say, paper [1] in Yahoo! CDN. > > > > > > > > I think this changelog is trying hard to sell something, but in > > > > reality TCP 3WHS nature > > > > makes your claims wrong. > > > > > > > > Instead, you should clearly document that this problem can _not_ be > > > > solved for both > > > > active _and_ passive connections. > > > > > > > > In the first RTT, a client (active connection) can not send more than > > > > 64KB, if TCP specs > > > > are properly applied. > > > > > > Having a large rcv_wnd if the user can tweak this knob can help > > > transfer data more rapidly. I'm not referring to the first RTT. > > > > > > > > > > > > > > > > > To avoid future confusion, current change doesn't affect the initial > > > > > receive window on the wire in a SYN or SYN+ACK packet which are set within > > > > > 65535 bytes according to RFC 7323 also due to the limit in > > > > > __tcp_transmit_skb(): > > > > > > > > > > th->window = htons(min(tp->rcv_wnd, 65535U)); > > > > > > > > > > In one word, __tcp_transmit_skb() already ensures that constraint is > > > > > respected, no matter how large tp->rcv_wnd is. > > > > > > > > > > Let me provide one example if with or without the patch: > > > > > Before: > > > > > client --- SYN: rwindow=65535 ---> server > > > > > client <--- SYN+ACK: rwindow=65535 ---- server > > > > > client --- ACK: rwindow=65536 ---> server > > > > > Note: for the last ACK, the calculation is 512 << 7. > > > > > > > > > > After: > > > > > client --- SYN: rwindow=65535 ---> server > > > > > client <--- SYN+ACK: rwindow=65535 ---- server > > > > > client --- ACK: rwindow=175232 ---> server > > > > > Note: I use the following command to make it work: > > > > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120 > > > > > For the last ACK, the calculation is 1369 << 7. > > > > > > > > > > We can pay attention to the last ACK in 3-way shakehand and notice that > > > > > with the patch applied the window can reach more than 64 KByte. > > > > > > > > You carefully avoided mentioning the asymmetry. > > > > I do not think this is needed in the changelog, because this is adding > > > > confusion. > > > > > > What kind of case I've met in production is only about whether we're > > > capable of sending more data at the same time at the very beginning, > > > so I care much more about the sending process right now. > > > > > > > > > > > > > > > > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf > > > > > > > > > > Signed-off-by: Jason Xing <kernelxing@tencent.com> > > > > > --- > > > > > v2 > > > > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/ > > > > > 1. revise the title and body messages (Neal) > > > > > --- > > > > > net/ipv4/tcp_output.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > > > > index 95caf8aaa8be..95618d0e78e4 100644 > > > > > --- a/net/ipv4/tcp_output.c > > > > > +++ b/net/ipv4/tcp_output.c > > > > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, > > > > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > > > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > > > > else > > > > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > > > > + (*rcv_wnd) = space; > > > > > > > > This is probably breaking some packetdrill tests, but your change > > > > might [1] be good, > > > > > > I'll do some packetdrill tests and get back some information soon. > > > > I'm done with the packetdrill tests[1]. Here are two tests failed > > after comparing with/without this patch: > > 1) ./packetdrill/run_all.py -S -v -L -l tcp/ioctl/ioctl-siocinq-fin.pkt > > 2) ./packetdrill/run_all.py -S -v -L -l > > tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt > > > > For the first one, it shows: > > "FAIL [/data/home/kernelxing/source_code/packetdrill/gtests/net/tcp/ioctl/ioctl-siocinq-fin.pkt > > (ipv6)] > > stdout: > > stderr: > > ioctl-siocinq-fin.pkt:28: error handling packet: timing error: > > expected outbound packet at 0.302321 sec but happened at 0.342759 sec; > > tolerance 0.004000 sec > > script packet: 0.302321 . 1:1(0) ack 2002 > > actual packet: 0.342759 . 1:1(0) ack 2002 win 65535" > > > > For the second one, it shows: > > "client-ack-dropped-then-recovery-ms-timestamps.pkt:33: error handling > > packet: live packet field tcp_window: expected: 256 (0x100) vs actual: > > 532 (0x214) > > script packet: 0.012251 P. 1:5001(5000) ack 1001 win 256 <nop,nop,TS > > val 2010 ecr 1000> > > actual packet: 0.012242 P. 1:5001(5000) ack 1001 win 532 <nop,nop,TS > > val 2010 ecr 1000> > > Ran 3 tests: 0 passing, 3 failing, 0 timed out (0.91 sec): > > tcp/fastopen/server/client-ack-dropped-then-recovery-ms-timestamps.pkt" > > > > The reason is unexpected window size. Since I removed the limit of > > 64KB, It is expected from my view. > > I think you misunderstood what I was saying. > > Basically, this change will break some packetdrill tests, and this is fine, > because those packetdrill tests were relying on a prior kernel behavior that > was not set in stone (certainly not documented) > > > > > [1]: https://github.com/google/packetdrill > > Running: ./packetdrill/run_all.py -S -v -L -l tcp/ > > > > I wonder if you mind this change which might be unpredictable, how > > about this one: > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index 95caf8aaa8be..3bf7d9fd2d6b 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -231,11 +231,13 @@ void tcp_select_initial_window(const struct sock > > *sk, int __space, __u32 mss, > > */ > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW); > > - else > > - (*rcv_wnd) = min_t(u32, space, U16_MAX); > > > > - if (init_rcv_wnd) > > + if (init_rcv_wnd) { > > + *rcv_wnd = space; > > *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss); > > + } else { > > + *rcv_wnd = min_t(u32, space, U16_MAX); > > + } > > > > *rcv_wscale = 0; > > if (wscale_ok) { > > ? > > > > It affects/changes the TCP stack only when the user tries to use 'ip > > route' to set initrwnd. > > I much prefer the prior and simpler version. > > Only the changelog was not very good IMO. > > Also, I think this is fixing a bug and should target the net tree. > > If it took 6 years to discover the unwanted side effects, we should > make sure the fix > is backported by stable teams, thanks to an appropriate Fixes: tag. Oh, I see. I'll submit a new one soon. Thanks, Jason
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 95caf8aaa8be..95618d0e78e4 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) (*rcv_wnd) = min(space, MAX_TCP_WINDOW); else - (*rcv_wnd) = min_t(u32, space, U16_MAX); + (*rcv_wnd) = space; if (init_rcv_wnd) *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);