Message ID | 20240819-jakub-krn-909-poc-msec-tw-tstamp-v1-1-6567b5006fbe@cloudflare.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net-next] tcp: Allow TIME-WAIT reuse after 1 millisecond | expand |
On Mon, Aug 19, 2024 at 1:31 PM Jakub Sitnicki <jakub@cloudflare.com> wrote: > > [This patch needs a description. Please see the RFC cover letter below.] > > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> > --- > Can we shorten the TCP connection reincarnation period? > > Situation > ========= > > Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port) > in the TIME-WAIT state to establish a new outgoing TCP connection after a period > of 1 second. This period, during which the 4-tuple remains blocked from reuse, > is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp > timestamp, which presently uses a 1 Hz clock (ktime_get_seconds). > > The TIME-WAIT block is enforced by __{inet,inet6}_check_established -> > tcp_twsk_unique, where we check if the timestamp clock has ticked since the last > ts_recent_stamp update before allowing the 4-tuple to be reused. > > This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an > already in use local port during connect") [1], protects the TCP receiver > against segments from an earlier incarnation of the same connection (FIN > retransmits), which could potentially corrupt the TCP stream, as described by > RFC 7323 [2, 3]. > > Problem > ======= > > The one-second reincarnation period has not posed a problem when we had a > sufficiently large pool of ephemeral ports (tens of thousands per host). We now have network namespaces, and still ~30,000 ephemeral ports per netns :) > However, as we began sharing egress IPv4 addresses between hosts by partitioning > the available port range [4], the ephemeral port pool size has shrunk > significantly—down to hundreds of ports per host. > > This reduction in port pool size has made it clear that a one-second connection > reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP > connections, such as DNS queries, can complete in milliseconds, yet the TCP > 4-tuple remains blocked for a period of time that is orders of magnitude longer. > > Solution > ======== > > We would like to propose to shorten the period during which the 4-tuple is tied > up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes > nowadays to perform of a short-lived TCP connection, from setup to teardown. > > The ts_recent_stamp protection is based on the same principle as PAWS but > extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point > (b): > > An additional mechanism could be added to the TCP, a per-host > cache of the last timestamp received from any connection. This > value could then be used in the PAWS mechanism to reject old > duplicate segments from earlier incarnations of the connection, > if the timestamp clock can be guaranteed to have ticked at least > once since the old connection was open. This would require that > the TIME-WAIT delay plus the RTT together must be at least one > tick of the sender's timestamp clock. Such an extension is not > part of the proposal of this RFC. Note the RTT part here. I do not see this implemented in your patch. > > Due to that, we would want to follow the same guidelines as the for TSval > timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms > to 1 sec per tick [5], when reconsidering the default setting. > > (Note that the Linux TCP stack has recently introduced even finer granularity > with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for > usec resolution in TCP TS values" [6] for use in private networks.) > > A simple implementation could be to switch from a second to a millisecond clock, > as demonstrated by the following patch. However, this could also be a tunable > option to allow administrators to adjust it based on their specific needs and > risk tolerance. > > A tunable also opens the door to letting users set the TIME-WAIT reuse period > beyond the RFC 7323 recommended range at their own risk. > > Workaround > ========== > > Today, when an application has only a small ephemeral port pool available, we > work around the 1-second reincarnation period by manually selecting the local > port with an explicit bind(). > > This has been possible since the introduction of the ts_recent_stamp protection > mechanism [1]. However, it is unclear why this is allowed for egress > connections. > > To guide readers to the relevant code: if the local port is selected by the > user, we do not pass a TIME-WAIT socket to the check_established helper from > __inet_hash_connect. This way we circumvent the timestamp check in > tcp_twsk_unique [7] (as twp is NULL). > > However, relying on this workaround conflicts with our goal of delegating TCP > local port selection to the network stack, using the IP_BIND_ADDRESS_NO_PORT [8] > and IP_LOCAL_PORT_RANGE [9] socket options. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af > [2] https://datatracker.ietf.org/doc/html/rfc7323#section-5.8 > [3] https://datatracker.ietf.org/doc/html/rfc7323#appendix-B > [4] https://lpc.events/event/16/contributions/1349/ > [5] https://datatracker.ietf.org/doc/html/rfc7323#section-5.4 > [6] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=614e8316aa4cafba3e204cb8ee48bd12b92f3d93 > [7] https://elixir.bootlin.com/linux/v6.10/source/net/ipv4/tcp_ipv4.c#L156 > [8] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_BIND_ADDRESS_NO_PORT > [9] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_LOCAL_PORT_RANGE > --- > > --- > drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c | 2 +- > include/linux/tcp.h | 4 ++-- > net/ipv4/tcp_input.c | 2 +- > net/ipv4/tcp_ipv4.c | 5 ++--- > net/ipv4/tcp_minisocks.c | 9 ++++++--- > 5 files changed, 12 insertions(+), 10 deletions(-) > > diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c > index 6f6525983130..b15b26db8902 100644 > --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c > +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c > @@ -1866,7 +1866,7 @@ static void chtls_timewait(struct sock *sk) > struct tcp_sock *tp = tcp_sk(sk); > > tp->rcv_nxt++; > - tp->rx_opt.ts_recent_stamp = ktime_get_seconds(); > + tp->rx_opt.ts_recent_stamp = tcp_clock_ms(); > tp->srtt_us = 0; > tcp_time_wait(sk, TCP_TIME_WAIT, 0); > } > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index 6a5e08b937b3..174257114ee4 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -110,7 +110,7 @@ struct tcp_sack_block { > > struct tcp_options_received { > /* PAWS/RTTM data */ > - int ts_recent_stamp;/* Time we stored ts_recent (for aging) */ > + u32 ts_recent_stamp;/* Time we stored ts_recent (for aging) */ > u32 ts_recent; /* Time stamp to echo next */ > u32 rcv_tsval; /* Time stamp value */ > u32 rcv_tsecr; /* Time stamp echo reply */ > @@ -543,7 +543,7 @@ struct tcp_timewait_sock { > /* The time we sent the last out-of-window ACK: */ > u32 tw_last_oow_ack_time; > > - int tw_ts_recent_stamp; > + u32 tw_ts_recent_stamp; > u32 tw_tx_delay; > #ifdef CONFIG_TCP_MD5SIG > struct tcp_md5sig_key *tw_md5_key; > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index e37488d3453f..873a1cbd6d14 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk) > static void tcp_store_ts_recent(struct tcp_sock *tp) > { > tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval; > - tp->rx_opt.ts_recent_stamp = ktime_get_seconds(); > + tp->rx_opt.ts_recent_stamp = tcp_clock_ms(); Please do not abuse tcp_clock_ms(). Instead use tcp_time_stamp_ms(tp) Same remark for other parts of the patch, try to reuse tp->tcp_mstamp if available. Also, (tcp_clock_ms() != ts_recent_stamp) can be true even after one usec has elapsed, due to rounding. The 'one second delay' was really: 'An average of 0.5 second delay' Solution : no longer use jiffies, but usec based timestamps, since we already have this infrastructure in TCP stack.
Hello Jakub, On Mon, Aug 19, 2024 at 7:31 PM Jakub Sitnicki <jakub@cloudflare.com> wrote: > > [This patch needs a description. Please see the RFC cover letter below.] > > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> > --- > Can we shorten the TCP connection reincarnation period? > > Situation > ========= > > Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port) > in the TIME-WAIT state to establish a new outgoing TCP connection after a period > of 1 second. This period, during which the 4-tuple remains blocked from reuse, > is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp > timestamp, which presently uses a 1 Hz clock (ktime_get_seconds). > > The TIME-WAIT block is enforced by __{inet,inet6}_check_established -> > tcp_twsk_unique, where we check if the timestamp clock has ticked since the last > ts_recent_stamp update before allowing the 4-tuple to be reused. > > This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an > already in use local port during connect") [1], protects the TCP receiver > against segments from an earlier incarnation of the same connection (FIN > retransmits), which could potentially corrupt the TCP stream, as described by > RFC 7323 [2, 3]. > > Problem > ======= > > The one-second reincarnation period has not posed a problem when we had a > sufficiently large pool of ephemeral ports (tens of thousands per host). > However, as we began sharing egress IPv4 addresses between hosts by partitioning > the available port range [4], the ephemeral port pool size has shrunk > significantly—down to hundreds of ports per host. > > This reduction in port pool size has made it clear that a one-second connection > reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP > connections, such as DNS queries, can complete in milliseconds, yet the TCP > 4-tuple remains blocked for a period of time that is orders of magnitude longer. > > Solution > ======== > > We would like to propose to shorten the period during which the 4-tuple is tied > up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes > nowadays to perform of a short-lived TCP connection, from setup to teardown. > > The ts_recent_stamp protection is based on the same principle as PAWS but > extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point > (b): > > An additional mechanism could be added to the TCP, a per-host > cache of the last timestamp received from any connection. This > value could then be used in the PAWS mechanism to reject old > duplicate segments from earlier incarnations of the connection, > if the timestamp clock can be guaranteed to have ticked at least > once since the old connection was open. This would require that > the TIME-WAIT delay plus the RTT together must be at least one > tick of the sender's timestamp clock. Such an extension is not > part of the proposal of this RFC. > > Due to that, we would want to follow the same guidelines as the for TSval > timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms > to 1 sec per tick [5], when reconsidering the default setting. > > (Note that the Linux TCP stack has recently introduced even finer granularity > with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for > usec resolution in TCP TS values" [6] for use in private networks.) > > A simple implementation could be to switch from a second to a millisecond clock, > as demonstrated by the following patch. However, this could also be a tunable > option to allow administrators to adjust it based on their specific needs and > risk tolerance. > > A tunable also opens the door to letting users set the TIME-WAIT reuse period > beyond the RFC 7323 recommended range at their own risk. > > Workaround > ========== > > Today, when an application has only a small ephemeral port pool available, we > work around the 1-second reincarnation period by manually selecting the local > port with an explicit bind(). > > This has been possible since the introduction of the ts_recent_stamp protection > mechanism [1]. However, it is unclear why this is allowed for egress > connections. > > To guide readers to the relevant code: if the local port is selected by the > user, we do not pass a TIME-WAIT socket to the check_established helper from > __inet_hash_connect. This way we circumvent the timestamp check in > tcp_twsk_unique [7] (as twp is NULL). > > However, relying on this workaround conflicts with our goal of delegating TCP > local port selection to the network stack, using the IP_BIND_ADDRESS_NO_PORT [8] > and IP_LOCAL_PORT_RANGE [9] socket options. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af > [2] https://datatracker.ietf.org/doc/html/rfc7323#section-5.8 > [3] https://datatracker.ietf.org/doc/html/rfc7323#appendix-B > [4] https://lpc.events/event/16/contributions/1349/ > [5] https://datatracker.ietf.org/doc/html/rfc7323#section-5.4 > [6] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=614e8316aa4cafba3e204cb8ee48bd12b92f3d93 > [7] https://elixir.bootlin.com/linux/v6.10/source/net/ipv4/tcp_ipv4.c#L156 > [8] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_BIND_ADDRESS_NO_PORT > [9] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_LOCAL_PORT_RANGE > --- > > --- > drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c | 2 +- > include/linux/tcp.h | 4 ++-- > net/ipv4/tcp_input.c | 2 +- > net/ipv4/tcp_ipv4.c | 5 ++--- > net/ipv4/tcp_minisocks.c | 9 ++++++--- > 5 files changed, 12 insertions(+), 10 deletions(-) > > diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c > index 6f6525983130..b15b26db8902 100644 > --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c > +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c > @@ -1866,7 +1866,7 @@ static void chtls_timewait(struct sock *sk) > struct tcp_sock *tp = tcp_sk(sk); > > tp->rcv_nxt++; > - tp->rx_opt.ts_recent_stamp = ktime_get_seconds(); > + tp->rx_opt.ts_recent_stamp = tcp_clock_ms(); > tp->srtt_us = 0; > tcp_time_wait(sk, TCP_TIME_WAIT, 0); > } > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index 6a5e08b937b3..174257114ee4 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -110,7 +110,7 @@ struct tcp_sack_block { > > struct tcp_options_received { > /* PAWS/RTTM data */ > - int ts_recent_stamp;/* Time we stored ts_recent (for aging) */ > + u32 ts_recent_stamp;/* Time we stored ts_recent (for aging) */ > u32 ts_recent; /* Time stamp to echo next */ > u32 rcv_tsval; /* Time stamp value */ > u32 rcv_tsecr; /* Time stamp echo reply */ > @@ -543,7 +543,7 @@ struct tcp_timewait_sock { > /* The time we sent the last out-of-window ACK: */ > u32 tw_last_oow_ack_time; > > - int tw_ts_recent_stamp; > + u32 tw_ts_recent_stamp; > u32 tw_tx_delay; > #ifdef CONFIG_TCP_MD5SIG > struct tcp_md5sig_key *tw_md5_key; > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index e37488d3453f..873a1cbd6d14 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk) > static void tcp_store_ts_recent(struct tcp_sock *tp) > { > tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval; > - tp->rx_opt.ts_recent_stamp = ktime_get_seconds(); > + tp->rx_opt.ts_recent_stamp = tcp_clock_ms(); > } > > static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq) > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index fd17f25ff288..47e2dcda4eae 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -116,7 +116,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) > const struct inet_timewait_sock *tw = inet_twsk(sktw); > const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw); > struct tcp_sock *tp = tcp_sk(sk); > - int ts_recent_stamp; > + u32 ts_recent_stamp; > > if (reuse == 2) { > /* Still does not detect *everything* that goes through > @@ -157,8 +157,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) > */ > ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp); > if (ts_recent_stamp && > - (!twp || (reuse && time_after32(ktime_get_seconds(), > - ts_recent_stamp)))) { > + (!twp || (reuse && (u32)tcp_clock_ms() != ts_recent_stamp))) { At first glance, I wonder whether 1 ms is really too short, especially for most cases? If the rtt is 2-3 ms which is quite often seen in production, we may lose our opportunity to change the sub-state of timewait socket and finish the work that should be done as expected. One second is safe for most cases, of course, since I obscurely remember I've read one paper (tuning the initial window to 10) saying in Google the cases exceeding 100ms rtt is rare but exists. So I still feel a fixed short value is not that appropriate... Like you said, how about converting the fixed value into a tunable one and keeping 1 second as the default value? After you submit the next version, I think I can try it and test it locally :) It's interesting! Thanks, Jason
On Mon, Aug 19, 2024 at 01:59 PM +02, Eric Dumazet wrote: > On Mon, Aug 19, 2024 at 1:31 PM Jakub Sitnicki <jakub@cloudflare.com> wrote: >> >> [This patch needs a description. Please see the RFC cover letter below.] >> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> >> --- >> Can we shorten the TCP connection reincarnation period? >> >> Situation >> ========= >> >> Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port) >> in the TIME-WAIT state to establish a new outgoing TCP connection after a period >> of 1 second. This period, during which the 4-tuple remains blocked from reuse, >> is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp >> timestamp, which presently uses a 1 Hz clock (ktime_get_seconds). >> >> The TIME-WAIT block is enforced by __{inet,inet6}_check_established -> >> tcp_twsk_unique, where we check if the timestamp clock has ticked since the last >> ts_recent_stamp update before allowing the 4-tuple to be reused. >> >> This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an >> already in use local port during connect") [1], protects the TCP receiver >> against segments from an earlier incarnation of the same connection (FIN >> retransmits), which could potentially corrupt the TCP stream, as described by >> RFC 7323 [2, 3]. >> >> Problem >> ======= >> >> The one-second reincarnation period has not posed a problem when we had a >> sufficiently large pool of ephemeral ports (tens of thousands per host). > > > We now have network namespaces, and still ~30,000 ephemeral ports per netns :) It's just that we are short on public IPv4 addresses with certain traits we need to proxy on egress (like ownership, reputation, geolocation). Hence we had to share the addresses and divide the port space between hosts :-/ > >> However, as we began sharing egress IPv4 addresses between hosts by partitioning >> the available port range [4], the ephemeral port pool size has shrunk >> significantly—down to hundreds of ports per host. >> >> This reduction in port pool size has made it clear that a one-second connection >> reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP >> connections, such as DNS queries, can complete in milliseconds, yet the TCP >> 4-tuple remains blocked for a period of time that is orders of magnitude longer. >> >> Solution >> ======== >> >> We would like to propose to shorten the period during which the 4-tuple is tied >> up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes >> nowadays to perform of a short-lived TCP connection, from setup to teardown. >> >> The ts_recent_stamp protection is based on the same principle as PAWS but >> extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point >> (b): >> >> An additional mechanism could be added to the TCP, a per-host >> cache of the last timestamp received from any connection. This >> value could then be used in the PAWS mechanism to reject old >> duplicate segments from earlier incarnations of the connection, >> if the timestamp clock can be guaranteed to have ticked at least >> once since the old connection was open. This would require that >> the TIME-WAIT delay plus the RTT together must be at least one >> tick of the sender's timestamp clock. Such an extension is not >> part of the proposal of this RFC. > > Note the RTT part here. I do not see this implemented in your patch. > Not sure I follow. I need to look into that more. My initial thinking here was that as long as TW delay (1 msec) is not shorter than one tick of the sender's TS clock (1 msec), then I can ignore the RTT and the requirement is still met. >> >> Due to that, we would want to follow the same guidelines as the for TSval >> timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms >> to 1 sec per tick [5], when reconsidering the default setting. >> >> (Note that the Linux TCP stack has recently introduced even finer granularity >> with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for >> usec resolution in TCP TS values" [6] for use in private networks.) >> >> A simple implementation could be to switch from a second to a millisecond clock, >> as demonstrated by the following patch. However, this could also be a tunable >> option to allow administrators to adjust it based on their specific needs and >> risk tolerance. >> >> A tunable also opens the door to letting users set the TIME-WAIT reuse period >> beyond the RFC 7323 recommended range at their own risk. >> [...] >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >> index e37488d3453f..873a1cbd6d14 100644 >> --- a/net/ipv4/tcp_input.c >> +++ b/net/ipv4/tcp_input.c >> @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk) >> static void tcp_store_ts_recent(struct tcp_sock *tp) >> { >> tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval; >> - tp->rx_opt.ts_recent_stamp = ktime_get_seconds(); >> + tp->rx_opt.ts_recent_stamp = tcp_clock_ms(); > > Please do not abuse tcp_clock_ms(). > > Instead use tcp_time_stamp_ms(tp) > > Same remark for other parts of the patch, try to reuse tp->tcp_mstamp > if available. > > Also, (tcp_clock_ms() != ts_recent_stamp) can be true even after one > usec has elapsed, due to rounding. > > The 'one second delay' was really: 'An average of 0.5 second delay' > > Solution : no longer use jiffies, but usec based timestamps, since we > already have this infrastructure in TCP stack. Thank you for feedback. Especially wrt the rounding bug - eye opening. Will rework it to move away from jiffies.
Hi Jason, On Mon, Aug 19, 2024 at 08:27 PM +08, Jason Xing wrote: > On Mon, Aug 19, 2024 at 7:31 PM Jakub Sitnicki <jakub@cloudflare.com> wrote: [...] >> --- a/net/ipv4/tcp_ipv4.c >> +++ b/net/ipv4/tcp_ipv4.c >> @@ -116,7 +116,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) >> const struct inet_timewait_sock *tw = inet_twsk(sktw); >> const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw); >> struct tcp_sock *tp = tcp_sk(sk); >> - int ts_recent_stamp; >> + u32 ts_recent_stamp; >> >> if (reuse == 2) { >> /* Still does not detect *everything* that goes through >> @@ -157,8 +157,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) >> */ >> ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp); >> if (ts_recent_stamp && >> - (!twp || (reuse && time_after32(ktime_get_seconds(), >> - ts_recent_stamp)))) { >> + (!twp || (reuse && (u32)tcp_clock_ms() != ts_recent_stamp))) { > > At first glance, I wonder whether 1 ms is really too short, especially > for most cases? If the rtt is 2-3 ms which is quite often seen in > production, we may lose our opportunity to change the sub-state of > timewait socket and finish the work that should be done as expected. Good point about TW state management. Haven't thought of that. > One second is safe for most cases, of course, since I obscurely > remember I've read one paper (tuning the initial window to 10) saying > in Google the cases exceeding 100ms rtt is rare but exists. So I still > feel a fixed short value is not that appropriate... > > Like you said, how about converting the fixed value into a tunable one > and keeping 1 second as the default value? I'm also leaning toward a tunable. The adoption could then be based on an opt-in model. We don't want to break any existing setups, naturally. > After you submit the next version, I think I can try it and test it > locally :) It's interesting! Thanks for feedback, (the other) Jakub
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c index 6f6525983130..b15b26db8902 100644 --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c @@ -1866,7 +1866,7 @@ static void chtls_timewait(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); tp->rcv_nxt++; - tp->rx_opt.ts_recent_stamp = ktime_get_seconds(); + tp->rx_opt.ts_recent_stamp = tcp_clock_ms(); tp->srtt_us = 0; tcp_time_wait(sk, TCP_TIME_WAIT, 0); } diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 6a5e08b937b3..174257114ee4 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -110,7 +110,7 @@ struct tcp_sack_block { struct tcp_options_received { /* PAWS/RTTM data */ - int ts_recent_stamp;/* Time we stored ts_recent (for aging) */ + u32 ts_recent_stamp;/* Time we stored ts_recent (for aging) */ u32 ts_recent; /* Time stamp to echo next */ u32 rcv_tsval; /* Time stamp value */ u32 rcv_tsecr; /* Time stamp echo reply */ @@ -543,7 +543,7 @@ struct tcp_timewait_sock { /* The time we sent the last out-of-window ACK: */ u32 tw_last_oow_ack_time; - int tw_ts_recent_stamp; + u32 tw_ts_recent_stamp; u32 tw_tx_delay; #ifdef CONFIG_TCP_MD5SIG struct tcp_md5sig_key *tw_md5_key; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e37488d3453f..873a1cbd6d14 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk) static void tcp_store_ts_recent(struct tcp_sock *tp) { tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval; - tp->rx_opt.ts_recent_stamp = ktime_get_seconds(); + tp->rx_opt.ts_recent_stamp = tcp_clock_ms(); } static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fd17f25ff288..47e2dcda4eae 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -116,7 +116,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) const struct inet_timewait_sock *tw = inet_twsk(sktw); const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw); struct tcp_sock *tp = tcp_sk(sk); - int ts_recent_stamp; + u32 ts_recent_stamp; if (reuse == 2) { /* Still does not detect *everything* that goes through @@ -157,8 +157,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp) */ ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp); if (ts_recent_stamp && - (!twp || (reuse && time_after32(ktime_get_seconds(), - ts_recent_stamp)))) { + (!twp || (reuse && (u32)tcp_clock_ms() != ts_recent_stamp))) { /* inet_twsk_hashdance_schedule() sets sk_refcnt after putting twsk * and releasing the bucket lock. */ diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index a19a9dbd3409..d2a62c88806f 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -101,7 +101,7 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, struct tcp_options_received tmp_opt; struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw); bool paws_reject = false; - int ts_recent_stamp; + u32 ts_recent_stamp; tmp_opt.saw_tstamp = 0; ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp); @@ -576,7 +576,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, if (newtp->rx_opt.tstamp_ok) { newtp->tcp_usec_ts = treq->req_usec_ts; newtp->rx_opt.ts_recent = READ_ONCE(req->ts_recent); - newtp->rx_opt.ts_recent_stamp = ktime_get_seconds(); + newtp->rx_opt.ts_recent_stamp = tcp_clock_ms(); newtp->tcp_header_len = sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED; } else { newtp->tcp_usec_ts = 0; @@ -659,6 +659,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, tcp_parse_options(sock_net(sk), skb, &tmp_opt, 0, NULL); if (tmp_opt.saw_tstamp) { + unsigned int rsk_timeout; + tmp_opt.ts_recent = READ_ONCE(req->ts_recent); if (tmp_opt.rcv_tsecr) tmp_opt.rcv_tsecr -= tcp_rsk(req)->ts_off; @@ -666,7 +668,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, * it can be estimated (approximately) * from another data. */ - tmp_opt.ts_recent_stamp = ktime_get_seconds() - reqsk_timeout(req, TCP_RTO_MAX) / HZ; + rsk_timeout = jiffies_to_msecs(reqsk_timeout(req, TCP_RTO_MAX)); + tmp_opt.ts_recent_stamp = tcp_clock_ms() - rsk_timeout; paws_reject = tcp_paws_reject(&tmp_opt, th->rst); } }
[This patch needs a description. Please see the RFC cover letter below.] Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> --- Can we shorten the TCP connection reincarnation period? Situation ========= Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port) in the TIME-WAIT state to establish a new outgoing TCP connection after a period of 1 second. This period, during which the 4-tuple remains blocked from reuse, is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp timestamp, which presently uses a 1 Hz clock (ktime_get_seconds). The TIME-WAIT block is enforced by __{inet,inet6}_check_established -> tcp_twsk_unique, where we check if the timestamp clock has ticked since the last ts_recent_stamp update before allowing the 4-tuple to be reused. This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an already in use local port during connect") [1], protects the TCP receiver against segments from an earlier incarnation of the same connection (FIN retransmits), which could potentially corrupt the TCP stream, as described by RFC 7323 [2, 3]. Problem ======= The one-second reincarnation period has not posed a problem when we had a sufficiently large pool of ephemeral ports (tens of thousands per host). However, as we began sharing egress IPv4 addresses between hosts by partitioning the available port range [4], the ephemeral port pool size has shrunk significantly—down to hundreds of ports per host. This reduction in port pool size has made it clear that a one-second connection reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP connections, such as DNS queries, can complete in milliseconds, yet the TCP 4-tuple remains blocked for a period of time that is orders of magnitude longer. Solution ======== We would like to propose to shorten the period during which the 4-tuple is tied up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes nowadays to perform of a short-lived TCP connection, from setup to teardown. The ts_recent_stamp protection is based on the same principle as PAWS but extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point (b): An additional mechanism could be added to the TCP, a per-host cache of the last timestamp received from any connection. This value could then be used in the PAWS mechanism to reject old duplicate segments from earlier incarnations of the connection, if the timestamp clock can be guaranteed to have ticked at least once since the old connection was open. This would require that the TIME-WAIT delay plus the RTT together must be at least one tick of the sender's timestamp clock. Such an extension is not part of the proposal of this RFC. Due to that, we would want to follow the same guidelines as the for TSval timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms to 1 sec per tick [5], when reconsidering the default setting. (Note that the Linux TCP stack has recently introduced even finer granularity with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for usec resolution in TCP TS values" [6] for use in private networks.) A simple implementation could be to switch from a second to a millisecond clock, as demonstrated by the following patch. However, this could also be a tunable option to allow administrators to adjust it based on their specific needs and risk tolerance. A tunable also opens the door to letting users set the TIME-WAIT reuse period beyond the RFC 7323 recommended range at their own risk. Workaround ========== Today, when an application has only a small ephemeral port pool available, we work around the 1-second reincarnation period by manually selecting the local port with an explicit bind(). This has been possible since the introduction of the ts_recent_stamp protection mechanism [1]. However, it is unclear why this is allowed for egress connections. To guide readers to the relevant code: if the local port is selected by the user, we do not pass a TIME-WAIT socket to the check_established helper from __inet_hash_connect. This way we circumvent the timestamp check in tcp_twsk_unique [7] (as twp is NULL). However, relying on this workaround conflicts with our goal of delegating TCP local port selection to the network stack, using the IP_BIND_ADDRESS_NO_PORT [8] and IP_LOCAL_PORT_RANGE [9] socket options. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af [2] https://datatracker.ietf.org/doc/html/rfc7323#section-5.8 [3] https://datatracker.ietf.org/doc/html/rfc7323#appendix-B [4] https://lpc.events/event/16/contributions/1349/ [5] https://datatracker.ietf.org/doc/html/rfc7323#section-5.4 [6] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=614e8316aa4cafba3e204cb8ee48bd12b92f3d93 [7] https://elixir.bootlin.com/linux/v6.10/source/net/ipv4/tcp_ipv4.c#L156 [8] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_BIND_ADDRESS_NO_PORT [9] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_LOCAL_PORT_RANGE --- --- drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c | 2 +- include/linux/tcp.h | 4 ++-- net/ipv4/tcp_input.c | 2 +- net/ipv4/tcp_ipv4.c | 5 ++--- net/ipv4/tcp_minisocks.c | 9 ++++++--- 5 files changed, 12 insertions(+), 10 deletions(-)