tcp: use rto_min value from socket in retransmits timeout

Message ID	20210723093938.49354-1-zeil@yandex-team.ru (mailing list archive)
State	Rejected
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <bpf-owner@kernel.org> From: Dmitry Yakunin <zeil@yandex-team.ru> To: kafai@fb.com, edumazet@google.com, netdev@vger.kernel.org, bpf@vger.kernel.org Cc: dmtrmonakhov@yandex-team.ru Subject: [PATCH] tcp: use rto_min value from socket in retransmits timeout Date: Fri, 23 Jul 2021 12:39:38 +0300 Message-Id: <20210723093938.49354-1-zeil@yandex-team.ru> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	tcp: use rto_min value from socket in retransmits timeout \| expand tcp: use rto_min value from socket in retransmits timeout

Message ID

20210723093938.49354-1-zeil@yandex-team.ru (mailing list archive)

State

Rejected

Delegated to:

Netdev Maintainers

Headers

From: Dmitry Yakunin <zeil@yandex-team.ru>
To: kafai@fb.com, edumazet@google.com, netdev@vger.kernel.org,
        bpf@vger.kernel.org
Cc: dmtrmonakhov@yandex-team.ru
Subject: [PATCH] tcp: use rto_min value from socket in retransmits timeout
Date: Fri, 23 Jul 2021 12:39:38 +0300
Message-Id: <20210723093938.49354-1-zeil@yandex-team.ru>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

tcp: use rto_min value from socket in retransmits timeout | expand

Context	Check	Description
netdev/cover_letter	success	Link
netdev/fixes_present	success	Link
netdev/patch_count	success	Link
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/subject_prefix	warning	Target tree name not specified in the subject
netdev/cc_maintainers	fail	2 blamed authors not CCed: ast@kernel.org john.fastabend@gmail.com; 11 maintainers not CCed: dsahern@kernel.org yhs@fb.com kpsingh@kernel.org yoshfuji@linux-ipv6.org daniel@iogearbox.net andrii@kernel.org ast@kernel.org john.fastabend@gmail.com songliubraving@fb.com davem@davemloft.net kuba@kernel.org
netdev/source_inline	success	Was 0 now: 0
netdev/verify_signedoff	success	Link
netdev/module_param	success	Was 0 now: 0
netdev/build_32bit	success	Errors and warnings before: 1 this patch: 1
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/verify_fixes	success	Link
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 23 lines checked
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1 this patch: 1
netdev/header_inline	success	Link

Context

Check

Description

netdev/cover_letter

success

Link

netdev/fixes_present

success

Link

netdev/patch_count

success

Link

netdev/tree_selection

success

Guessed tree name to be net-next

netdev/subject_prefix

warning

Target tree name not specified in the subject

netdev/cc_maintainers

fail

2 blamed authors not CCed: ast@kernel.org john.fastabend@gmail.com; 11 maintainers not CCed: dsahern@kernel.org yhs@fb.com kpsingh@kernel.org yoshfuji@linux-ipv6.org daniel@iogearbox.net andrii@kernel.org ast@kernel.org john.fastabend@gmail.com songliubraving@fb.com davem@davemloft.net kuba@kernel.org

netdev/source_inline

success

Was 0 now: 0

netdev/verify_signedoff

success

Link

netdev/module_param

success

Was 0 now: 0

netdev/build_32bit

success

Errors and warnings before: 1 this patch: 1

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/verify_fixes

success

Link

netdev/checkpatch

success

total: 0 errors, 0 warnings, 0 checks, 23 lines checked

netdev/build_allmodconfig_warn

success

Errors and warnings before: 1 this patch: 1

netdev/header_inline

success

Link

Commit Message

Dmitry Yakunin July 23, 2021, 9:39 a.m. UTC

Commit ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
adds ability to set rto_min value on socket less then default TCP_RTO_MIN.
But retransmits_timed_out() function still uses TCP_RTO_MIN and
tcp_retries{1,2} sysctls don't work properly for tuned socket values.

Fixes: ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Acked-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
---
 net/ipv4/tcp_timer.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Neal Cardwell July 23, 2021, 2:41 p.m. UTC | #1

.(On Fri, Jul 23, 2021 at 5:41 AM Dmitry Yakunin <zeil@yandex-team.ru> wrote:
>
> Commit ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
> adds ability to set rto_min value on socket less then default TCP_RTO_MIN.
> But retransmits_timed_out() function still uses TCP_RTO_MIN and
> tcp_retries{1,2} sysctls don't work properly for tuned socket values.
>
> Fixes: ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
> Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
> Acked-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
> ---
>  net/ipv4/tcp_timer.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 20cf4a9..66c4b97 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -199,12 +199,13 @@ static unsigned int tcp_model_timeout(struct sock *sk,
>   *  @boundary: max number of retransmissions
>   *  @timeout:  A custom timeout value.
>   *             If set to 0 the default timeout is calculated and used.
> - *             Using TCP_RTO_MIN and the number of unsuccessful retransmits.
> + *             Using icsk_rto_min value from socket or RTAX_RTO_MIN from route
> + *             and the number of unsuccessful retransmits.
>   *
>   * The default "timeout" value this function can calculate and use
>   * is equivalent to the timeout of a TCP Connection
>   * after "boundary" unsuccessful, exponentially backed-off
> - * retransmissions with an initial RTO of TCP_RTO_MIN.
> + * retransmissions with an initial RTO of icsk_rto_min or RTAX_RTO_MIN.
>   */
>  static bool retransmits_timed_out(struct sock *sk,
>                                   unsigned int boundary,
> @@ -217,7 +218,7 @@ static bool retransmits_timed_out(struct sock *sk,
>
>         start_ts = tcp_sk(sk)->retrans_stamp;
>         if (likely(timeout == 0)) {
> -               unsigned int rto_base = TCP_RTO_MIN;
> +               unsigned int rto_base = tcp_rto_min(sk);
>
>                 if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
>                         rto_base = tcp_timeout_init(sk);
> --

I would argue strenuously against this. We tried the approach in this
patch at Google years ago, but we had to revert that approach and go
back to using TCP_RTO_MIN as the baseline for the computation, because
using the custom tcp_rto_min(sk) caused serious reliability problems.

The behavior in this patch causes various serious reliability problems
because the retransmits_timed_out() computation is used for various
timeout decisions that determine how long a connection tries to
retransmit something before deciding the path is bad and/or giving up
and closing the connection. Here are a few of the problems this
causes:

(1) The biggest one is probably orphan retries. By default
tcp_orphan_retries() uses a retry count of 8. But if your min_rto is
5ms (used at Google for many years), then the 8 retries means an
orphaned connection (whose fd is no longer held by a process, but is
still established) only lasts for 1.275 seconds before giving up and
closing. This means that connectivity problems longer than 1.275
seconds (extremely common with cellular links) are not tolerated for
such connections; the connections often do not receive the data they
were supposed to receive.

(2) TCP_RETR1 /sysctl_tcp_retries1, used for __dst_negative_advice(),
also has big problems. Even with a min_rto as big as 20ms, on a route
with 150ms RTT, the approach in this patch will cause
retransmits_timed_out() to return true upon the 1st RTO timer firing,
even though TCP_RETR1 is 3.

(3) TCP_RETR2 /sysctl_tcp_retries2, with a default of 15, used for
regular connection retry lifetimes, also has a massive decrease in
robustness, due to falling from  109 minutes with a 200ms RTO, to
about 2.7 minutes with a min_rto of 5ms.

neal

Dmitry Yakunin July 30, 2021, 12:37 p.m. UTC | #2

Hello, Neal!

Thanks for your reply and explanations.

I agree with all your points, about safe defaults for both timeouts and the number of retries. But what the patch does is not changing the defaults, it only provides a way to work with these values through bpf, which is important in an environment that is way different from cellular networks. For example in the modern DC the rto_min value should correspond with real RTT, that definitely not 200ms.

Also I add Alexander Azimov for further discussions.

--
Dmitry

> On 23 Jul 2021, at 17:41, Neal Cardwell <ncardwell@google.com> wrote:
> 
> .(On Fri, Jul 23, 2021 at 5:41 AM Dmitry Yakunin <zeil@yandex-team.ru> wrote:
>> 
>> Commit ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
>> adds ability to set rto_min value on socket less then default TCP_RTO_MIN.
>> But retransmits_timed_out() function still uses TCP_RTO_MIN and
>> tcp_retries{1,2} sysctls don't work properly for tuned socket values.
>> 
>> Fixes: ca584ba07086 ("tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt")
>> Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
>> Acked-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
>> ---
>> net/ipv4/tcp_timer.c | 7 ++++---
>> 1 file changed, 4 insertions(+), 3 deletions(-)
>> 
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 20cf4a9..66c4b97 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -199,12 +199,13 @@ static unsigned int tcp_model_timeout(struct sock *sk,
>>  *  @boundary: max number of retransmissions
>>  *  @timeout:  A custom timeout value.
>>  *             If set to 0 the default timeout is calculated and used.
>> - *             Using TCP_RTO_MIN and the number of unsuccessful retransmits.
>> + *             Using icsk_rto_min value from socket or RTAX_RTO_MIN from route
>> + *             and the number of unsuccessful retransmits.
>>  *
>>  * The default "timeout" value this function can calculate and use
>>  * is equivalent to the timeout of a TCP Connection
>>  * after "boundary" unsuccessful, exponentially backed-off
>> - * retransmissions with an initial RTO of TCP_RTO_MIN.
>> + * retransmissions with an initial RTO of icsk_rto_min or RTAX_RTO_MIN.
>>  */
>> static bool retransmits_timed_out(struct sock *sk,
>>                                  unsigned int boundary,
>> @@ -217,7 +218,7 @@ static bool retransmits_timed_out(struct sock *sk,
>> 
>>        start_ts = tcp_sk(sk)->retrans_stamp;
>>        if (likely(timeout == 0)) {
>> -               unsigned int rto_base = TCP_RTO_MIN;
>> +               unsigned int rto_base = tcp_rto_min(sk);
>> 
>>                if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
>>                        rto_base = tcp_timeout_init(sk);
>> --
> 
> I would argue strenuously against this. We tried the approach in this
> patch at Google years ago, but we had to revert that approach and go
> back to using TCP_RTO_MIN as the baseline for the computation, because
> using the custom tcp_rto_min(sk) caused serious reliability problems.
> 
> The behavior in this patch causes various serious reliability problems
> because the retransmits_timed_out() computation is used for various
> timeout decisions that determine how long a connection tries to
> retransmit something before deciding the path is bad and/or giving up
> and closing the connection. Here are a few of the problems this
> causes:
> 
> (1) The biggest one is probably orphan retries. By default
> tcp_orphan_retries() uses a retry count of 8. But if your min_rto is
> 5ms (used at Google for many years), then the 8 retries means an
> orphaned connection (whose fd is no longer held by a process, but is
> still established) only lasts for 1.275 seconds before giving up and
> closing. This means that connectivity problems longer than 1.275
> seconds (extremely common with cellular links) are not tolerated for
> such connections; the connections often do not receive the data they
> were supposed to receive.
> 
> (2) TCP_RETR1 /sysctl_tcp_retries1, used for __dst_negative_advice(),
> also has big problems. Even with a min_rto as big as 20ms, on a route
> with 150ms RTT, the approach in this patch will cause
> retransmits_timed_out() to return true upon the 1st RTO timer firing,
> even though TCP_RETR1 is 3.
> 
> (3) TCP_RETR2 /sysctl_tcp_retries2, with a default of 15, used for
> regular connection retry lifetimes, also has a massive decrease in
> robustness, due to falling from  109 minutes with a 200ms RTO, to
> about 2.7 minutes with a min_rto of 5ms.
> 
> neal

Neal Cardwell July 30, 2021, 3:08 p.m. UTC | #3

On Fri, Jul 30, 2021 at 8:37 AM Dmitry Yakunin <zeil@yandex-team.ru> wrote:
>
> Hello, Neal!
>
> Thanks for your reply and explanations.
>
> I agree with all your points, about safe defaults for both timeouts
> and the number of retries. But what the patch does is not changing the
> defaults, it only provides a way to work with these values through
> bpf, which is important in an environment that is way different from
> cellular networks. For example in the modern DC the rto_min value
> should correspond with real RTT, that definitely not 200ms.

It seems your patch and your analysis are conflating several different issues:

(1) how long should rto_min be in datacenter environments?
(2) for reliability/robustness, how long should TCP retry to transmit
data before giving up?
(2) should rto_min just correspond to the real RTT, or other factors
(like delayed ACK timers)?

I am talking about the reliability/robustness cost of your proposal to
tie custom reductions in (1) to automatic custom reductions in (2).
(I'm not talking about safe defaults.)

If BPF or routing table entries customize rto_min, then it's great for
the rto_min knob to customize the RTO timer value to use a lower value
in datacenters to speed up loss recovery (1) (as already happens).

But just because you customize (1) does not imply that it is safe to
massively reduce the answer to (2): it is not safe to cripple
reliability/robustness by (as in your proposed patch) having the
rto_min setting massively reduce the length of time that a TCP
connection retries sending data before giving up and closing the
connection.

The problem caused by your proposal to have rto_min shorten the retry
duration (e.g. a 5ms rto_min leading to only 1.275 seconds of retries)
is a general problem of reliability/robustness, not specific to
cellular paths. My point about cellular networks was just the most
crisp example I could think of, to try to provide a clear and concrete
example.

If you really think it's important for TCP connections to only retry
sending data for 1.275 seconds, then can you please give an example of
when this is important, and then please implement a separate
customization mechanism for that, rather than forcing all Linux users
of the rto_min mechanism to suffer the fallout from tying (1) to (2)?

best regards,
neal

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 20cf4a9..66c4b97 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -199,12 +199,13 @@  static unsigned int tcp_model_timeout(struct sock *sk,
  *  @boundary: max number of retransmissions
  *  @timeout:  A custom timeout value.
  *             If set to 0 the default timeout is calculated and used.
- *             Using TCP_RTO_MIN and the number of unsuccessful retransmits.
+ *             Using icsk_rto_min value from socket or RTAX_RTO_MIN from route
+ *             and the number of unsuccessful retransmits.
  *
  * The default "timeout" value this function can calculate and use
  * is equivalent to the timeout of a TCP Connection
  * after "boundary" unsuccessful, exponentially backed-off
- * retransmissions with an initial RTO of TCP_RTO_MIN.
+ * retransmissions with an initial RTO of icsk_rto_min or RTAX_RTO_MIN.
  */
 static bool retransmits_timed_out(struct sock *sk,
 				  unsigned int boundary,
@@ -217,7 +218,7 @@  static bool retransmits_timed_out(struct sock *sk,
 
 	start_ts = tcp_sk(sk)->retrans_stamp;
 	if (likely(timeout == 0)) {
-		unsigned int rto_base = TCP_RTO_MIN;
+		unsigned int rto_base = tcp_rto_min(sk);
 
 		if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
 			rto_base = tcp_timeout_init(sk);

tcp: use rto_min value from socket in retransmits timeout

Checks

Commit Message

Comments

Patch