Message ID | 20220111192952.49040-1-ivan@cloudflare.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [bpf-next] tcp: bpf: Add TCP_BPF_RCV_SSTHRESH for bpf_setsockopt | expand |
On Tue, Jan 11, 2022 at 11:29 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > This patch adds bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) to allow setting > rcv_ssthresh value for TCP connections. Combined with increased > window_clamp via tcp_rmem[1], it allows to advertise initial scaled > TCP window larger than 64k. This is useful for high BDP connections, > where it allows to push data with fewer roundtrips, reducing latency. > > For active connections the larger window is advertised in the first > non-SYN ACK packet as the part of the 3 way handshake. > > For passive connections the larger window is advertised whenever > there's any packet to send after the 3 way handshake. > > See: https://lkml.org/lkml/2021/12/22/652 I guess this is [1] mentioned above. Please use lore link instead, e.g. [1] https://lore.kernel.org/all/CABWYdi0qBQ57OHt4ZbRxMtdSzhubzkPaPKkYzdNfu4+cgPyXCA@mail.gmail.com/ Can we add a selftests for this? Something similar to tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c Thanks, Song [...]
On Wed, Jan 12, 2022 at 11:59 AM Ivan Babrou <ivan@cloudflare.com> wrote: > > This patch adds bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) to allow setting > rcv_ssthresh value for TCP connections. Combined with increased > window_clamp via tcp_rmem[1], it allows to advertise initial scaled > TCP window larger than 64k. This is useful for high BDP connections, > where it allows to push data with fewer roundtrips, reducing latency. I would not use the word "latency" in this way, I would just say potentially reducing roundtrips... and potentially massively increasing packet loss, oversaturating links, and otherwise hurting latency for other applications sharing the link, including the application that advertised an extreme window like this. This overall focus tends to freak me out somewhat, especially when faced with further statements that cloudflare is using an initcwnd of 250!??? The kind of damage just IW10 can do to much slower bandwidth connections has to be experienced to be believed. https://tools.ietf.org/id/draft-gettys-iw10-considered-harmful-00.html > > For active connections the larger window is advertised in the first > non-SYN ACK packet as the part of the 3 way handshake. > > For passive connections the larger window is advertised whenever > there's any packet to send after the 3 way handshake. > > See: https://lkml.org/lkml/2021/12/22/652 > > Signed-off-by: Ivan Babrou <ivan@cloudflare.com> > --- > include/uapi/linux/bpf.h | 1 + > net/core/filter.c | 6 ++++++ > tools/include/uapi/linux/bpf.h | 1 + > 3 files changed, 8 insertions(+) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 791f31dd0abe..36ebf87278bd 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -5978,6 +5978,7 @@ enum { > TCP_BPF_SYN = 1005, /* Copy the TCP header */ > TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ > TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ > + TCP_BPF_RCV_SSTHRESH = 1008, /* Set rcv_ssthresh */ > }; > > enum { > diff --git a/net/core/filter.c b/net/core/filter.c > index 2e32cee2c469..aafb6066b1a6 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -4904,6 +4904,12 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, > return -EINVAL; > inet_csk(sk)->icsk_rto_min = timeout; > break; > + case TCP_BPF_RCV_SSTHRESH: > + if (val <= 0) > + ret = -EINVAL; > + else > + tp->rcv_ssthresh = min_t(u32, val, tp->window_clamp); > + break; > case TCP_SAVE_SYN: > if (val < 0 || val > 1) > ret = -EINVAL; > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 791f31dd0abe..36ebf87278bd 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -5978,6 +5978,7 @@ enum { > TCP_BPF_SYN = 1005, /* Copy the TCP header */ > TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ > TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ > + TCP_BPF_RCV_SSTHRESH = 1008, /* Set rcv_ssthresh */ > }; > > enum { > -- > 2.34.1 >
On Wed, Jan 12, 2022 at 1:02 PM Dave Taht <dave.taht@gmail.com> wrote: > I would not use the word "latency" in this way, I would just say > potentially reducing > roundtrips... Roundtrips translate directly into latency on high latency links. > and potentially massively increasing packet loss, oversaturating > links, and otherwise > hurting latency for other applications sharing the link, including the > application > that advertised an extreme window like this. The receive window is going to scale up to tcp_rmem[2] with traffic, and packet loss won't stop it. That's around 3MiB on anything that's not embedded these days. My understanding is that congestion control on the sender side deals with packet loss, bottleneck saturation, and packet pacing. This patch only touches the receiving side, letting the client scale up faster if they choose to do so. I don't think any out of the box sender will make use of this, even if we enable it on the receiver, just because the sender's congestion control constraints are lower (like initcwnd=10). Let me know if any of this doesn't look right to you. > This overall focus tends to freak me out somewhat, especially when > faced with further statements that cloudflare is using an initcwnd of 250!??? Congestion window is a learned property, not a static number. You won't get a large initcwnd towards a poor connection. We have a dedicated backbone with different properties.
On Tue, Jan 11, 2022 at 1:48 PM Song Liu <song@kernel.org> wrote: > > I guess this is [1] mentioned above. Please use lore link instead, e.g. > > [1] https://lore.kernel.org/all/CABWYdi0qBQ57OHt4ZbRxMtdSzhubzkPaPKkYzdNfu4+cgPyXCA@mail.gmail.com/ Will do in the next iteration, thanks. > Can we add a selftests for this? Something similar to > > tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c I have the test based on the tcp_rtt selftest. Do you want me to amend my commit with the test and resend it as v2 or make it a series of two commits?
On Thu, Jan 13, 2022 at 2:56 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Wed, Jan 12, 2022 at 1:02 PM Dave Taht <dave.taht@gmail.com> wrote: > > I would not use the word "latency" in this way, I would just say > > potentially reducing > > roundtrips... > > Roundtrips translate directly into latency on high latency links. Yes, but with the caveats below. I'm fine with you just saying round trips, and making this api possible. It would comfort me further if you could provide an actual scenario. See also: https://datatracker.ietf.org/doc/html/rfc6928 which predates packet pacing (are you using sch_fq?) > > > and potentially massively increasing packet loss, oversaturating > > links, and otherwise > > hurting latency for other applications sharing the link, including the > > application > > that advertised an extreme window like this. > > The receive window is going to scale up to tcp_rmem[2] with traffic, > and packet loss won't stop it. That's around 3MiB on anything that's > not embedded these days. > > My understanding is that congestion control on the sender side deals > with packet loss, bottleneck saturation, and packet pacing. This patch > only touches the receiving side, letting the client scale up faster if > they choose to do so. I don't think any out of the box sender will > make use of this, even if we enable it on the receiver, just because > the sender's congestion control constraints are lower (like > initcwnd=10). I've always kind of not liked the sender/reciever "language" in tcp. they are peers. > Let me know if any of this doesn't look right to you. > > > This overall focus tends to freak me out somewhat, especially when > > faced with further statements that cloudflare is using an initcwnd of 250!??? > > Congestion window is a learned property, not a static number. You > won't get a large initcwnd towards a poor connection. initcwnd is set globally or on a per route basis. > We have a dedicated backbone with different properties. It's not so much that I don't think your backbone can handle this... ... it's the prospect of handing whiskey, car keys and excessive initcwnd to teenage boys on a saturday night.
On Thu, Jan 13, 2022 at 9:44 PM Dave Taht <dave.taht@gmail.com> wrote: > Yes, but with the caveats below. I'm fine with you just saying round trips, > and making this api possible. > > It would comfort me further if you could provide an actual scenario. The actual scenario is getting a response as quickly as possible on a fresh connection across long distances (200ms+ RTT). If an RPC response doesn't fit into the initial 64k of rcv_ssthresh, we end up requiring more roundrips to receive the response. Some customers are very picky about the latency they measure and cutting the extra roundtrips made a very visible difference in the tests. > See also: > > https://datatracker.ietf.org/doc/html/rfc6928 > > which predates packet pacing (are you using sch_fq?) We are using fq and bbr. > > Congestion window is a learned property, not a static number. You > > won't get a large initcwnd towards a poor connection. > > initcwnd is set globally or on a per route basis. With TCP_BPF_IW the world is your oyster.
On Fri, Jan 14, 2022 at 2:21 PM Ivan Babrou <ivan@cloudflare.com> wrote: > > On Thu, Jan 13, 2022 at 9:44 PM Dave Taht <dave.taht@gmail.com> wrote: > > Yes, but with the caveats below. I'm fine with you just saying round trips, > > and making this api possible. > > > > It would comfort me further if you could provide an actual scenario. > > The actual scenario is getting a response as quickly as possible on a > fresh connection across long distances (200ms+ RTT). If an RPC > response doesn't fit into the initial 64k of rcv_ssthresh, we end up > requiring more roundrips to receive the response. Some customers are > very picky about the latency they measure and cutting the extra > roundtrips made a very visible difference in the tests. > > > See also: > > > > https://datatracker.ietf.org/doc/html/rfc6928 > > > > which predates packet pacing (are you using sch_fq?) > > We are using fq and bbr. > > > > Congestion window is a learned property, not a static number. You > > > won't get a large initcwnd towards a poor connection. > > > > initcwnd is set globally or on a per route basis. Like I said, retaining state from an existing connection as to the window is ok. i think arbitrarily declaring a window like this for a new connection is not. > With TCP_BPF_IW the world is your oyster. The oyster has to co-habit in this ocean with all the other life there, and I would be comforted if your customer also tracked various other TCP_INFO statistics, like RTT growth, loss, marks, and retransmits, and was aware of not just the self harm inflicted but of collateral damage. In fact I really wish more were instrumenting everything with that, of late we've seen a lot of need for TCP_NOTSENT_LOWAT in things like apache traffic server in containers. A simple one line patch for an widely used app I can't talk about, did wonders for actual perceived throughput and responsiveness by the end user. Measuring from the reciever is far, far more important than measuring from the sender. Collecting long term statistics over many connections, also, from the real world. I hope y'all have been instrumenting your work as well as google has, on these fronts. I know that I'm getting old and crunchy and scarred by seeing so many (corporate wifi mostly) networks over the last decade essentially in congestion collapse! https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-but-its-not-over-yet/ I'm very happy with how well sch_fq + packet pacing works to mitigate impuses like this, as well as with so many other things like BBR and BQL, but pacing out != pacing in, and despite my fervent wish for more FQ+AQM techniques on more bottleneck links also, we're not there yet. I like very much that BPF is allowing rapid innovation, but with great power comes great responsibility.
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 791f31dd0abe..36ebf87278bd 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5978,6 +5978,7 @@ enum { TCP_BPF_SYN = 1005, /* Copy the TCP header */ TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ + TCP_BPF_RCV_SSTHRESH = 1008, /* Set rcv_ssthresh */ }; enum { diff --git a/net/core/filter.c b/net/core/filter.c index 2e32cee2c469..aafb6066b1a6 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4904,6 +4904,12 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, return -EINVAL; inet_csk(sk)->icsk_rto_min = timeout; break; + case TCP_BPF_RCV_SSTHRESH: + if (val <= 0) + ret = -EINVAL; + else + tp->rcv_ssthresh = min_t(u32, val, tp->window_clamp); + break; case TCP_SAVE_SYN: if (val < 0 || val > 1) ret = -EINVAL; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 791f31dd0abe..36ebf87278bd 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5978,6 +5978,7 @@ enum { TCP_BPF_SYN = 1005, /* Copy the TCP header */ TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ + TCP_BPF_RCV_SSTHRESH = 1008, /* Set rcv_ssthresh */ }; enum {
This patch adds bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) to allow setting rcv_ssthresh value for TCP connections. Combined with increased window_clamp via tcp_rmem[1], it allows to advertise initial scaled TCP window larger than 64k. This is useful for high BDP connections, where it allows to push data with fewer roundtrips, reducing latency. For active connections the larger window is advertised in the first non-SYN ACK packet as the part of the 3 way handshake. For passive connections the larger window is advertised whenever there's any packet to send after the 3 way handshake. See: https://lkml.org/lkml/2021/12/22/652 Signed-off-by: Ivan Babrou <ivan@cloudflare.com> --- include/uapi/linux/bpf.h | 1 + net/core/filter.c | 6 ++++++ tools/include/uapi/linux/bpf.h | 1 + 3 files changed, 8 insertions(+)