diff mbox series

[RFC,net-next] tcp: Allow TIME-WAIT reuse after 1 millisecond

Message ID 20240819-jakub-krn-909-poc-msec-tw-tstamp-v1-1-6567b5006fbe@cloudflare.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net-next] tcp: Allow TIME-WAIT reuse after 1 millisecond | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 27 this patch: 27
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 4 maintainers not CCed: pabeni@redhat.com kuba@kernel.org ayush.sawal@chelsio.com dsahern@kernel.org
netdev/build_clang success Errors and warnings before: 49 this patch: 49
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2092 this patch: 2092
netdev/checkpatch warning WARNING: line length of 88 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 3 this patch: 3
netdev/source_inline success Was 0 now: 0

Commit Message

Jakub Sitnicki Aug. 19, 2024, 11:31 a.m. UTC
[This patch needs a description. Please see the RFC cover letter below.]

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Can we shorten the TCP connection reincarnation period?

Situation
=========

Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port)
in the TIME-WAIT state to establish a new outgoing TCP connection after a period
of 1 second. This period, during which the 4-tuple remains blocked from reuse,
is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp
timestamp, which presently uses a 1 Hz clock (ktime_get_seconds).

The TIME-WAIT block is enforced by __{inet,inet6}_check_established ->
tcp_twsk_unique, where we check if the timestamp clock has ticked since the last
ts_recent_stamp update before allowing the 4-tuple to be reused.

This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an
already in use local port during connect") [1], protects the TCP receiver
against segments from an earlier incarnation of the same connection (FIN
retransmits), which could potentially corrupt the TCP stream, as described by
RFC 7323 [2, 3].

Problem
=======

The one-second reincarnation period has not posed a problem when we had a
sufficiently large pool of ephemeral ports (tens of thousands per host).
However, as we began sharing egress IPv4 addresses between hosts by partitioning
the available port range [4], the ephemeral port pool size has shrunk
significantly—down to hundreds of ports per host.

This reduction in port pool size has made it clear that a one-second connection
reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP
connections, such as DNS queries, can complete in milliseconds, yet the TCP
4-tuple remains blocked for a period of time that is orders of magnitude longer.

Solution
========

We would like to propose to shorten the period during which the 4-tuple is tied
up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes
nowadays to perform of a short-lived TCP connection, from setup to teardown.

The ts_recent_stamp protection is based on the same principle as PAWS but
extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point
(b):

    An additional mechanism could be added to the TCP, a per-host
    cache of the last timestamp received from any connection.  This
    value could then be used in the PAWS mechanism to reject old
    duplicate segments from earlier incarnations of the connection,
    if the timestamp clock can be guaranteed to have ticked at least
    once since the old connection was open.  This would require that
    the TIME-WAIT delay plus the RTT together must be at least one
    tick of the sender's timestamp clock.  Such an extension is not
    part of the proposal of this RFC.

Due to that, we would want to follow the same guidelines as the for TSval
timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms
to 1 sec per tick [5], when reconsidering the default setting.

(Note that the Linux TCP stack has recently introduced even finer granularity
with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for
usec resolution in TCP TS values" [6] for use in private networks.)

A simple implementation could be to switch from a second to a millisecond clock,
as demonstrated by the following patch. However, this could also be a tunable
option to allow administrators to adjust it based on their specific needs and
risk tolerance.

A tunable also opens the door to letting users set the TIME-WAIT reuse period
beyond the RFC 7323 recommended range at their own risk.

Workaround
==========

Today, when an application has only a small ephemeral port pool available, we
work around the 1-second reincarnation period by manually selecting the local
port with an explicit bind().

This has been possible since the introduction of the ts_recent_stamp protection
mechanism [1]. However, it is unclear why this is allowed for egress
connections.

To guide readers to the relevant code: if the local port is selected by the
user, we do not pass a TIME-WAIT socket to the check_established helper from
__inet_hash_connect. This way we circumvent the timestamp check in
tcp_twsk_unique [7] (as twp is NULL).

However, relying on this workaround conflicts with our goal of delegating TCP
local port selection to the network stack, using the IP_BIND_ADDRESS_NO_PORT [8]
and IP_LOCAL_PORT_RANGE [9] socket options.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
[2] https://datatracker.ietf.org/doc/html/rfc7323#section-5.8
[3] https://datatracker.ietf.org/doc/html/rfc7323#appendix-B
[4] https://lpc.events/event/16/contributions/1349/
[5] https://datatracker.ietf.org/doc/html/rfc7323#section-5.4
[6] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=614e8316aa4cafba3e204cb8ee48bd12b92f3d93
[7] https://elixir.bootlin.com/linux/v6.10/source/net/ipv4/tcp_ipv4.c#L156
[8] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_BIND_ADDRESS_NO_PORT
[9] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_LOCAL_PORT_RANGE
---

---
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c | 2 +-
 include/linux/tcp.h                                         | 4 ++--
 net/ipv4/tcp_input.c                                        | 2 +-
 net/ipv4/tcp_ipv4.c                                         | 5 ++---
 net/ipv4/tcp_minisocks.c                                    | 9 ++++++---
 5 files changed, 12 insertions(+), 10 deletions(-)

Comments

Eric Dumazet Aug. 19, 2024, 11:59 a.m. UTC | #1
On Mon, Aug 19, 2024 at 1:31 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> [This patch needs a description. Please see the RFC cover letter below.]
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
> Can we shorten the TCP connection reincarnation period?
>
> Situation
> =========
>
> Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port)
> in the TIME-WAIT state to establish a new outgoing TCP connection after a period
> of 1 second. This period, during which the 4-tuple remains blocked from reuse,
> is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp
> timestamp, which presently uses a 1 Hz clock (ktime_get_seconds).
>
> The TIME-WAIT block is enforced by __{inet,inet6}_check_established ->
> tcp_twsk_unique, where we check if the timestamp clock has ticked since the last
> ts_recent_stamp update before allowing the 4-tuple to be reused.
>
> This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an
> already in use local port during connect") [1], protects the TCP receiver
> against segments from an earlier incarnation of the same connection (FIN
> retransmits), which could potentially corrupt the TCP stream, as described by
> RFC 7323 [2, 3].
>
> Problem
> =======
>
> The one-second reincarnation period has not posed a problem when we had a
> sufficiently large pool of ephemeral ports (tens of thousands per host).


We now have network namespaces, and still ~30,000 ephemeral ports per netns :)

> However, as we began sharing egress IPv4 addresses between hosts by partitioning
> the available port range [4], the ephemeral port pool size has shrunk
> significantly—down to hundreds of ports per host.
>
> This reduction in port pool size has made it clear that a one-second connection
> reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP
> connections, such as DNS queries, can complete in milliseconds, yet the TCP
> 4-tuple remains blocked for a period of time that is orders of magnitude longer.
>
> Solution
> ========
>
> We would like to propose to shorten the period during which the 4-tuple is tied
> up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes
> nowadays to perform of a short-lived TCP connection, from setup to teardown.
>
> The ts_recent_stamp protection is based on the same principle as PAWS but
> extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point
> (b):
>
>     An additional mechanism could be added to the TCP, a per-host
>     cache of the last timestamp received from any connection.  This
>     value could then be used in the PAWS mechanism to reject old
>     duplicate segments from earlier incarnations of the connection,
>     if the timestamp clock can be guaranteed to have ticked at least
>     once since the old connection was open.  This would require that
>     the TIME-WAIT delay plus the RTT together must be at least one
>     tick of the sender's timestamp clock.  Such an extension is not
>     part of the proposal of this RFC.

Note the RTT part here. I do not see this implemented in your patch.

>
> Due to that, we would want to follow the same guidelines as the for TSval
> timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms
> to 1 sec per tick [5], when reconsidering the default setting.
>
> (Note that the Linux TCP stack has recently introduced even finer granularity
> with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for
> usec resolution in TCP TS values" [6] for use in private networks.)
>
> A simple implementation could be to switch from a second to a millisecond clock,
> as demonstrated by the following patch. However, this could also be a tunable
> option to allow administrators to adjust it based on their specific needs and
> risk tolerance.
>
> A tunable also opens the door to letting users set the TIME-WAIT reuse period
> beyond the RFC 7323 recommended range at their own risk.
>
> Workaround
> ==========
>
> Today, when an application has only a small ephemeral port pool available, we
> work around the 1-second reincarnation period by manually selecting the local
> port with an explicit bind().
>
> This has been possible since the introduction of the ts_recent_stamp protection
> mechanism [1]. However, it is unclear why this is allowed for egress
> connections.
>
> To guide readers to the relevant code: if the local port is selected by the
> user, we do not pass a TIME-WAIT socket to the check_established helper from
> __inet_hash_connect. This way we circumvent the timestamp check in
> tcp_twsk_unique [7] (as twp is NULL).
>
> However, relying on this workaround conflicts with our goal of delegating TCP
> local port selection to the network stack, using the IP_BIND_ADDRESS_NO_PORT [8]
> and IP_LOCAL_PORT_RANGE [9] socket options.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
> [2] https://datatracker.ietf.org/doc/html/rfc7323#section-5.8
> [3] https://datatracker.ietf.org/doc/html/rfc7323#appendix-B
> [4] https://lpc.events/event/16/contributions/1349/
> [5] https://datatracker.ietf.org/doc/html/rfc7323#section-5.4
> [6] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=614e8316aa4cafba3e204cb8ee48bd12b92f3d93
> [7] https://elixir.bootlin.com/linux/v6.10/source/net/ipv4/tcp_ipv4.c#L156
> [8] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_BIND_ADDRESS_NO_PORT
> [9] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_LOCAL_PORT_RANGE
> ---
>
> ---
>  drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c | 2 +-
>  include/linux/tcp.h                                         | 4 ++--
>  net/ipv4/tcp_input.c                                        | 2 +-
>  net/ipv4/tcp_ipv4.c                                         | 5 ++---
>  net/ipv4/tcp_minisocks.c                                    | 9 ++++++---
>  5 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> index 6f6525983130..b15b26db8902 100644
> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> @@ -1866,7 +1866,7 @@ static void chtls_timewait(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>
>         tp->rcv_nxt++;
> -       tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
> +       tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
>         tp->srtt_us = 0;
>         tcp_time_wait(sk, TCP_TIME_WAIT, 0);
>  }
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 6a5e08b937b3..174257114ee4 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -110,7 +110,7 @@ struct tcp_sack_block {
>
>  struct tcp_options_received {
>  /*     PAWS/RTTM data  */
> -       int     ts_recent_stamp;/* Time we stored ts_recent (for aging) */
> +       u32     ts_recent_stamp;/* Time we stored ts_recent (for aging) */
>         u32     ts_recent;      /* Time stamp to echo next              */
>         u32     rcv_tsval;      /* Time stamp value                     */
>         u32     rcv_tsecr;      /* Time stamp echo reply                */
> @@ -543,7 +543,7 @@ struct tcp_timewait_sock {
>         /* The time we sent the last out-of-window ACK: */
>         u32                       tw_last_oow_ack_time;
>
> -       int                       tw_ts_recent_stamp;
> +       u32                       tw_ts_recent_stamp;
>         u32                       tw_tx_delay;
>  #ifdef CONFIG_TCP_MD5SIG
>         struct tcp_md5sig_key     *tw_md5_key;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e37488d3453f..873a1cbd6d14 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk)
>  static void tcp_store_ts_recent(struct tcp_sock *tp)
>  {
>         tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval;
> -       tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
> +       tp->rx_opt.ts_recent_stamp = tcp_clock_ms();

Please do not abuse tcp_clock_ms().

Instead use tcp_time_stamp_ms(tp)

Same remark for other parts of the patch, try to reuse tp->tcp_mstamp
if available.

Also, (tcp_clock_ms() != ts_recent_stamp) can be true even after one
usec has elapsed, due to rounding.

The 'one second delay' was really: 'An average of 0.5 second delay'

Solution : no longer use jiffies, but usec based timestamps, since we
already have this infrastructure in TCP stack.
Jason Xing Aug. 19, 2024, 12:27 p.m. UTC | #2
Hello Jakub,

On Mon, Aug 19, 2024 at 7:31 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> [This patch needs a description. Please see the RFC cover letter below.]
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
> Can we shorten the TCP connection reincarnation period?
>
> Situation
> =========
>
> Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port)
> in the TIME-WAIT state to establish a new outgoing TCP connection after a period
> of 1 second. This period, during which the 4-tuple remains blocked from reuse,
> is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp
> timestamp, which presently uses a 1 Hz clock (ktime_get_seconds).
>
> The TIME-WAIT block is enforced by __{inet,inet6}_check_established ->
> tcp_twsk_unique, where we check if the timestamp clock has ticked since the last
> ts_recent_stamp update before allowing the 4-tuple to be reused.
>
> This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an
> already in use local port during connect") [1], protects the TCP receiver
> against segments from an earlier incarnation of the same connection (FIN
> retransmits), which could potentially corrupt the TCP stream, as described by
> RFC 7323 [2, 3].
>
> Problem
> =======
>
> The one-second reincarnation period has not posed a problem when we had a
> sufficiently large pool of ephemeral ports (tens of thousands per host).
> However, as we began sharing egress IPv4 addresses between hosts by partitioning
> the available port range [4], the ephemeral port pool size has shrunk
> significantly—down to hundreds of ports per host.
>
> This reduction in port pool size has made it clear that a one-second connection
> reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP
> connections, such as DNS queries, can complete in milliseconds, yet the TCP
> 4-tuple remains blocked for a period of time that is orders of magnitude longer.
>
> Solution
> ========
>
> We would like to propose to shorten the period during which the 4-tuple is tied
> up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes
> nowadays to perform of a short-lived TCP connection, from setup to teardown.
>
> The ts_recent_stamp protection is based on the same principle as PAWS but
> extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point
> (b):
>
>     An additional mechanism could be added to the TCP, a per-host
>     cache of the last timestamp received from any connection.  This
>     value could then be used in the PAWS mechanism to reject old
>     duplicate segments from earlier incarnations of the connection,
>     if the timestamp clock can be guaranteed to have ticked at least
>     once since the old connection was open.  This would require that
>     the TIME-WAIT delay plus the RTT together must be at least one
>     tick of the sender's timestamp clock.  Such an extension is not
>     part of the proposal of this RFC.
>
> Due to that, we would want to follow the same guidelines as the for TSval
> timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms
> to 1 sec per tick [5], when reconsidering the default setting.
>
> (Note that the Linux TCP stack has recently introduced even finer granularity
> with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for
> usec resolution in TCP TS values" [6] for use in private networks.)
>
> A simple implementation could be to switch from a second to a millisecond clock,
> as demonstrated by the following patch. However, this could also be a tunable
> option to allow administrators to adjust it based on their specific needs and
> risk tolerance.
>
> A tunable also opens the door to letting users set the TIME-WAIT reuse period
> beyond the RFC 7323 recommended range at their own risk.
>
> Workaround
> ==========
>
> Today, when an application has only a small ephemeral port pool available, we
> work around the 1-second reincarnation period by manually selecting the local
> port with an explicit bind().
>
> This has been possible since the introduction of the ts_recent_stamp protection
> mechanism [1]. However, it is unclear why this is allowed for egress
> connections.
>
> To guide readers to the relevant code: if the local port is selected by the
> user, we do not pass a TIME-WAIT socket to the check_established helper from
> __inet_hash_connect. This way we circumvent the timestamp check in
> tcp_twsk_unique [7] (as twp is NULL).
>
> However, relying on this workaround conflicts with our goal of delegating TCP
> local port selection to the network stack, using the IP_BIND_ADDRESS_NO_PORT [8]
> and IP_LOCAL_PORT_RANGE [9] socket options.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
> [2] https://datatracker.ietf.org/doc/html/rfc7323#section-5.8
> [3] https://datatracker.ietf.org/doc/html/rfc7323#appendix-B
> [4] https://lpc.events/event/16/contributions/1349/
> [5] https://datatracker.ietf.org/doc/html/rfc7323#section-5.4
> [6] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=614e8316aa4cafba3e204cb8ee48bd12b92f3d93
> [7] https://elixir.bootlin.com/linux/v6.10/source/net/ipv4/tcp_ipv4.c#L156
> [8] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_BIND_ADDRESS_NO_PORT
> [9] https://manpages.debian.org/unstable/manpages/ip.7.en.html#IP_LOCAL_PORT_RANGE
> ---
>
> ---
>  drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c | 2 +-
>  include/linux/tcp.h                                         | 4 ++--
>  net/ipv4/tcp_input.c                                        | 2 +-
>  net/ipv4/tcp_ipv4.c                                         | 5 ++---
>  net/ipv4/tcp_minisocks.c                                    | 9 ++++++---
>  5 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> index 6f6525983130..b15b26db8902 100644
> --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
> @@ -1866,7 +1866,7 @@ static void chtls_timewait(struct sock *sk)
>         struct tcp_sock *tp = tcp_sk(sk);
>
>         tp->rcv_nxt++;
> -       tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
> +       tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
>         tp->srtt_us = 0;
>         tcp_time_wait(sk, TCP_TIME_WAIT, 0);
>  }
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 6a5e08b937b3..174257114ee4 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -110,7 +110,7 @@ struct tcp_sack_block {
>
>  struct tcp_options_received {
>  /*     PAWS/RTTM data  */
> -       int     ts_recent_stamp;/* Time we stored ts_recent (for aging) */
> +       u32     ts_recent_stamp;/* Time we stored ts_recent (for aging) */
>         u32     ts_recent;      /* Time stamp to echo next              */
>         u32     rcv_tsval;      /* Time stamp value                     */
>         u32     rcv_tsecr;      /* Time stamp echo reply                */
> @@ -543,7 +543,7 @@ struct tcp_timewait_sock {
>         /* The time we sent the last out-of-window ACK: */
>         u32                       tw_last_oow_ack_time;
>
> -       int                       tw_ts_recent_stamp;
> +       u32                       tw_ts_recent_stamp;
>         u32                       tw_tx_delay;
>  #ifdef CONFIG_TCP_MD5SIG
>         struct tcp_md5sig_key     *tw_md5_key;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e37488d3453f..873a1cbd6d14 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk)
>  static void tcp_store_ts_recent(struct tcp_sock *tp)
>  {
>         tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval;
> -       tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
> +       tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
>  }
>
>  static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fd17f25ff288..47e2dcda4eae 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -116,7 +116,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
>         const struct inet_timewait_sock *tw = inet_twsk(sktw);
>         const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
>         struct tcp_sock *tp = tcp_sk(sk);
> -       int ts_recent_stamp;
> +       u32 ts_recent_stamp;
>
>         if (reuse == 2) {
>                 /* Still does not detect *everything* that goes through
> @@ -157,8 +157,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
>          */
>         ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
>         if (ts_recent_stamp &&
> -           (!twp || (reuse && time_after32(ktime_get_seconds(),
> -                                           ts_recent_stamp)))) {
> +           (!twp || (reuse && (u32)tcp_clock_ms() != ts_recent_stamp))) {

At first glance, I wonder whether 1 ms is really too short, especially
for most cases? If the rtt is 2-3 ms which is quite often seen in
production, we may lose our opportunity to change the sub-state of
timewait socket and finish the work that should be done as expected.
One second is safe for most cases, of course, since I obscurely
remember I've read one paper (tuning the initial window to 10) saying
in Google the cases exceeding 100ms rtt is rare but exists. So I still
feel a fixed short value is not that appropriate...

Like you said, how about converting the fixed value into a tunable one
and keeping 1 second as the default value?

After you submit the next version, I think I can try it and test it
locally :) It's interesting!

Thanks,
Jason
Jakub Sitnicki Aug. 19, 2024, 1:44 p.m. UTC | #3
On Mon, Aug 19, 2024 at 01:59 PM +02, Eric Dumazet wrote:
> On Mon, Aug 19, 2024 at 1:31 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> [This patch needs a description. Please see the RFC cover letter below.]
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>> Can we shorten the TCP connection reincarnation period?
>>
>> Situation
>> =========
>>
>> Currently, we can reuse a TCP 4-tuple (source IP + port, destination IP + port)
>> in the TIME-WAIT state to establish a new outgoing TCP connection after a period
>> of 1 second. This period, during which the 4-tuple remains blocked from reuse,
>> is determined by the granularity of the ts_recent_stamp / tw_ts_recent_stamp
>> timestamp, which presently uses a 1 Hz clock (ktime_get_seconds).
>>
>> The TIME-WAIT block is enforced by __{inet,inet6}_check_established ->
>> tcp_twsk_unique, where we check if the timestamp clock has ticked since the last
>> ts_recent_stamp update before allowing the 4-tuple to be reused.
>>
>> This mechanism, introduced in 2002 by commit b8439924316d ("Allow to bind to an
>> already in use local port during connect") [1], protects the TCP receiver
>> against segments from an earlier incarnation of the same connection (FIN
>> retransmits), which could potentially corrupt the TCP stream, as described by
>> RFC 7323 [2, 3].
>>
>> Problem
>> =======
>>
>> The one-second reincarnation period has not posed a problem when we had a
>> sufficiently large pool of ephemeral ports (tens of thousands per host).
>
>
> We now have network namespaces, and still ~30,000 ephemeral ports per netns :)

It's just that we are short on public IPv4 addresses with certain traits
we need to proxy on egress (like ownership, reputation, geolocation).
Hence we had to share the addresses and divide the port space between
hosts :-/

>
>> However, as we began sharing egress IPv4 addresses between hosts by partitioning
>> the available port range [4], the ephemeral port pool size has shrunk
>> significantly—down to hundreds of ports per host.
>>
>> This reduction in port pool size has made it clear that a one-second connection
>> reincarnation period can lead to ephemeral port exhaustion. Short-lived TCP
>> connections, such as DNS queries, can complete in milliseconds, yet the TCP
>> 4-tuple remains blocked for a period of time that is orders of magnitude longer.
>>
>> Solution
>> ========
>>
>> We would like to propose to shorten the period during which the 4-tuple is tied
>> up. The intention is to enable TIME-WAIT reuse at least as quickly as it takes
>> nowadays to perform of a short-lived TCP connection, from setup to teardown.
>>
>> The ts_recent_stamp protection is based on the same principle as PAWS but
>> extends it across TCP connections. As RFC 7323 outlines in Appendix B.2, point
>> (b):
>>
>>     An additional mechanism could be added to the TCP, a per-host
>>     cache of the last timestamp received from any connection.  This
>>     value could then be used in the PAWS mechanism to reject old
>>     duplicate segments from earlier incarnations of the connection,
>>     if the timestamp clock can be guaranteed to have ticked at least
>>     once since the old connection was open.  This would require that
>>     the TIME-WAIT delay plus the RTT together must be at least one
>>     tick of the sender's timestamp clock.  Such an extension is not
>>     part of the proposal of this RFC.
>
> Note the RTT part here. I do not see this implemented in your patch.
>

Not sure I follow. I need to look into that more.

My initial thinking here was that as long as TW delay (1 msec) is not
shorter than one tick of the sender's TS clock (1 msec), then I can
ignore the RTT and the requirement is still met.

>>
>> Due to that, we would want to follow the same guidelines as the for TSval
>> timestamp clock, for which RFC 7323 recommends a frequency in the range of 1 ms
>> to 1 sec per tick [5], when reconsidering the default setting.
>>
>> (Note that the Linux TCP stack has recently introduced even finer granularity
>> with microsecond TSval resolution in commit 614e8316aa4c "tcp: add support for
>> usec resolution in TCP TS values" [6] for use in private networks.)
>>
>> A simple implementation could be to switch from a second to a millisecond clock,
>> as demonstrated by the following patch. However, this could also be a tunable
>> option to allow administrators to adjust it based on their specific needs and
>> risk tolerance.
>>
>> A tunable also opens the door to letting users set the TIME-WAIT reuse period
>> beyond the RFC 7323 recommended range at their own risk.
>>

[...]

>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index e37488d3453f..873a1cbd6d14 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -3778,7 +3778,7 @@ static void tcp_send_challenge_ack(struct sock *sk)
>>  static void tcp_store_ts_recent(struct tcp_sock *tp)
>>  {
>>         tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval;
>> -       tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
>> +       tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
>
> Please do not abuse tcp_clock_ms().
>
> Instead use tcp_time_stamp_ms(tp)
>
> Same remark for other parts of the patch, try to reuse tp->tcp_mstamp
> if available.
>
> Also, (tcp_clock_ms() != ts_recent_stamp) can be true even after one
> usec has elapsed, due to rounding.
>
> The 'one second delay' was really: 'An average of 0.5 second delay'
>
> Solution : no longer use jiffies, but usec based timestamps, since we
> already have this infrastructure in TCP stack.

Thank you for feedback. Especially wrt the rounding bug - eye opening.

Will rework it to move away from jiffies.
Jakub Sitnicki Aug. 19, 2024, 1:54 p.m. UTC | #4
Hi Jason,

On Mon, Aug 19, 2024 at 08:27 PM +08, Jason Xing wrote:
> On Mon, Aug 19, 2024 at 7:31 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:

[...]

>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -116,7 +116,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
>>         const struct inet_timewait_sock *tw = inet_twsk(sktw);
>>         const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
>>         struct tcp_sock *tp = tcp_sk(sk);
>> -       int ts_recent_stamp;
>> +       u32 ts_recent_stamp;
>>
>>         if (reuse == 2) {
>>                 /* Still does not detect *everything* that goes through
>> @@ -157,8 +157,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
>>          */
>>         ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
>>         if (ts_recent_stamp &&
>> -           (!twp || (reuse && time_after32(ktime_get_seconds(),
>> -                                           ts_recent_stamp)))) {
>> +           (!twp || (reuse && (u32)tcp_clock_ms() != ts_recent_stamp))) {
>
> At first glance, I wonder whether 1 ms is really too short, especially
> for most cases? If the rtt is 2-3 ms which is quite often seen in
> production, we may lose our opportunity to change the sub-state of
> timewait socket and finish the work that should be done as expected.

Good point about TW state management. Haven't thought of that.

> One second is safe for most cases, of course, since I obscurely
> remember I've read one paper (tuning the initial window to 10) saying
> in Google the cases exceeding 100ms rtt is rare but exists. So I still
> feel a fixed short value is not that appropriate...
>
> Like you said, how about converting the fixed value into a tunable one
> and keeping 1 second as the default value?

I'm also leaning toward a tunable. The adoption could then be based on
an opt-in model. We don't want to break any existing setups, naturally.

> After you submit the next version, I think I can try it and test it
> locally :) It's interesting!

Thanks for feedback,
(the other) Jakub
diff mbox series

Patch

diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
index 6f6525983130..b15b26db8902 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c
@@ -1866,7 +1866,7 @@  static void chtls_timewait(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	tp->rcv_nxt++;
-	tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
+	tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
 	tp->srtt_us = 0;
 	tcp_time_wait(sk, TCP_TIME_WAIT, 0);
 }
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6a5e08b937b3..174257114ee4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -110,7 +110,7 @@  struct tcp_sack_block {
 
 struct tcp_options_received {
 /*	PAWS/RTTM data	*/
-	int	ts_recent_stamp;/* Time we stored ts_recent (for aging) */
+	u32	ts_recent_stamp;/* Time we stored ts_recent (for aging) */
 	u32	ts_recent;	/* Time stamp to echo next		*/
 	u32	rcv_tsval;	/* Time stamp value             	*/
 	u32	rcv_tsecr;	/* Time stamp echo reply        	*/
@@ -543,7 +543,7 @@  struct tcp_timewait_sock {
 	/* The time we sent the last out-of-window ACK: */
 	u32			  tw_last_oow_ack_time;
 
-	int			  tw_ts_recent_stamp;
+	u32			  tw_ts_recent_stamp;
 	u32			  tw_tx_delay;
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_md5sig_key	  *tw_md5_key;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e37488d3453f..873a1cbd6d14 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3778,7 +3778,7 @@  static void tcp_send_challenge_ack(struct sock *sk)
 static void tcp_store_ts_recent(struct tcp_sock *tp)
 {
 	tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval;
-	tp->rx_opt.ts_recent_stamp = ktime_get_seconds();
+	tp->rx_opt.ts_recent_stamp = tcp_clock_ms();
 }
 
 static void tcp_replace_ts_recent(struct tcp_sock *tp, u32 seq)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fd17f25ff288..47e2dcda4eae 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -116,7 +116,7 @@  int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 	const struct inet_timewait_sock *tw = inet_twsk(sktw);
 	const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
 	struct tcp_sock *tp = tcp_sk(sk);
-	int ts_recent_stamp;
+	u32 ts_recent_stamp;
 
 	if (reuse == 2) {
 		/* Still does not detect *everything* that goes through
@@ -157,8 +157,7 @@  int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 	 */
 	ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
 	if (ts_recent_stamp &&
-	    (!twp || (reuse && time_after32(ktime_get_seconds(),
-					    ts_recent_stamp)))) {
+	    (!twp || (reuse && (u32)tcp_clock_ms() != ts_recent_stamp))) {
 		/* inet_twsk_hashdance_schedule() sets sk_refcnt after putting twsk
 		 * and releasing the bucket lock.
 		 */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index a19a9dbd3409..d2a62c88806f 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -101,7 +101,7 @@  tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 	struct tcp_options_received tmp_opt;
 	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
 	bool paws_reject = false;
-	int ts_recent_stamp;
+	u32 ts_recent_stamp;
 
 	tmp_opt.saw_tstamp = 0;
 	ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
@@ -576,7 +576,7 @@  struct sock *tcp_create_openreq_child(const struct sock *sk,
 	if (newtp->rx_opt.tstamp_ok) {
 		newtp->tcp_usec_ts = treq->req_usec_ts;
 		newtp->rx_opt.ts_recent = READ_ONCE(req->ts_recent);
-		newtp->rx_opt.ts_recent_stamp = ktime_get_seconds();
+		newtp->rx_opt.ts_recent_stamp = tcp_clock_ms();
 		newtp->tcp_header_len = sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
 	} else {
 		newtp->tcp_usec_ts = 0;
@@ -659,6 +659,8 @@  struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 		tcp_parse_options(sock_net(sk), skb, &tmp_opt, 0, NULL);
 
 		if (tmp_opt.saw_tstamp) {
+			unsigned int rsk_timeout;
+
 			tmp_opt.ts_recent = READ_ONCE(req->ts_recent);
 			if (tmp_opt.rcv_tsecr)
 				tmp_opt.rcv_tsecr -= tcp_rsk(req)->ts_off;
@@ -666,7 +668,8 @@  struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
 			 * it can be estimated (approximately)
 			 * from another data.
 			 */
-			tmp_opt.ts_recent_stamp = ktime_get_seconds() - reqsk_timeout(req, TCP_RTO_MAX) / HZ;
+			rsk_timeout = jiffies_to_msecs(reqsk_timeout(req, TCP_RTO_MAX));
+			tmp_opt.ts_recent_stamp = tcp_clock_ms() - rsk_timeout;
 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
 		}
 	}