From patchwork Mon Apr 25 00:34:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 12825177 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC7FEC433F5 for ; Mon, 25 Apr 2022 00:34:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240009AbiDYAhR (ORCPT ); Sun, 24 Apr 2022 20:37:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58324 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240008AbiDYAhQ (ORCPT ); Sun, 24 Apr 2022 20:37:16 -0400 Received: from mail-pl1-x633.google.com (mail-pl1-x633.google.com [IPv6:2607:f8b0:4864:20::633]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3D20851E6D for ; Sun, 24 Apr 2022 17:34:14 -0700 (PDT) Received: by mail-pl1-x633.google.com with SMTP id q8so279943plx.3 for ; Sun, 24 Apr 2022 17:34:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=jgEGaRgvs2xfX02zkN4mkGg+qSlQt1+dViNF9TmvYQs=; b=Dl4yJKzBslPAj6fB6xOpZgNQGHzexFpGSU1c1m/LEQYbl1nL1hAvYN1vYsBJ7uWegp 9RmFrZS1p4gh5C4PCJ6e2dzVPIP/t6jlzrCF7u5cUI46QpXaWTrlOK61i4QeNXBxPb9w Wc6Y+CBoZ6yolUAO0jYMtXIDv9vkaDrdXNeeDNwW3SXhjyiwusZPCqEfQEZepkRDoQPd osv5pOCYK1FE5sltyC/qyUZEVgp6fRhi9wBS8a7w4JnWHWlTVAeI+0tSF3IEaLWRtGCo 3Bl5Gkkp4mcAxGNcYZJ0Gtau2yYpyT/TI6+PXeeny5CElz1CKtuK/NxoPb6FgaSNOoRO lqPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=jgEGaRgvs2xfX02zkN4mkGg+qSlQt1+dViNF9TmvYQs=; b=NgXEkEjyUYr6n7hOPLNZAWyrCPEtTYJO4D2J+DKDzwyER8hrJz6Oe5zZ8QWPzo8fkl 94FUADTnKqYdtLz0VkeuIXyZUTyt2DAHbSXd2FL0xl8lqjesVBX14LHyKuhcuuEfMsOp yFR9yd3Dd1jEj1j0sz2C1fNz2AvfgH2BP2Miiv1uLZQA9XK58KdSJw0891yPdKRVyPeu VBrAvMVhf2jsXKHljvUYQRbEmdt8o6A7hsYMn0LRUW1bo1v6xtb9jfhmdm2ZvtJSmGD2 BrFMe5ALM3EwbcX2Ag9FmyW4E7sB1ZKFq6yov0SwPJ6Lzd6gN2rZ/Yy1vOpiO5PuYs4M QGjA== X-Gm-Message-State: AOAM530Y1PUnerQeOwJ0Ku0U+YsSpmeN+vvHPQviiRYvSvpcRMnB+vuw 8/2H+74o4rIc+XckjodfDYE= X-Google-Smtp-Source: ABdhPJwDkj89t3+nhHy5XTL5mmtNLnMwBjLL5MwfKrt/fbCZf+bEqRLUmfvREjDnXo24NX66/SceXA== X-Received: by 2002:a17:902:b406:b0:14f:bb35:95ab with SMTP id x6-20020a170902b40600b0014fbb3595abmr15577208plr.140.1650846853687; Sun, 24 Apr 2022 17:34:13 -0700 (PDT) Received: from edumazet1.svl.corp.google.com ([2620:15c:2c4:201:deb1:6b52:39aa:3e96]) by smtp.gmail.com with ESMTPSA id o4-20020a625a04000000b004fdf5419e41sm9222043pfb.36.2022.04.24.17.34.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 24 Apr 2022 17:34:12 -0700 (PDT) From: Eric Dumazet To: "David S . Miller" , Jakub Kicinski , Paolo Abeni Cc: netdev , Eric Dumazet , Eric Dumazet , Doug Porter , Soheil Hassas Yeganeh , Neal Cardwell Subject: [PATCH net] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT Date: Sun, 24 Apr 2022 17:34:07 -0700 Message-Id: <20220425003407.3002429-1-eric.dumazet@gmail.com> X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org From: Eric Dumazet I had this bug sitting for too long in my pile, it is time to fix it. Thanks to Doug Porter for reminding me of it! We had various attempts in the past, including commit 0cbe6a8f089e ("tcp: remove SOCK_QUEUE_SHRUNK"), but the issue is that TCP stack currently only generates EPOLLOUT from input path, when tp->snd_una has advanced and skb(s) cleaned from rtx queue. If a flow has a big RTT, and/or receives SACKs, it is possible that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0 and no more data can be sent until tp->snd_una finally advances. What is needed is to also check if POLLOUT needs to be generated whenever tp->snd_nxt is advanced, from output path. This bug triggers more often after an idle period, as we do not receive ACK for at least one RTT. tcp_notsent_lowat could be a fraction of what CWND and pacing rate would allow to send during this RTT. In a followup patch, I will remove the bogus call to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED) from tcp_check_space(). Fact that we have decided to generate an EPOLLOUT does not mean the application has immediately refilled the transmit queue. This optimistic call might have been the reason the bug seemed not too serious. Tested: 200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2] $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat $ cat bench_rr.sh SUM=0 for i in {1..10} do V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"` echo $V SUM=$(($SUM + $V)) done echo SUM=$SUM Before patch: $ bench_rr.sh 130000000 80000000 140000000 140000000 140000000 140000000 130000000 40000000 90000000 110000000 SUM=1140000000 After patch: $ bench_rr.sh 430000000 590000000 530000000 450000000 450000000 350000000 450000000 490000000 480000000 460000000 SUM=4680000000 # This is 410 % of the value before patch. Fixes: c9bee3b7fdec ("tcp: TCP_NOTSENT_LOWAT socket option") Signed-off-by: Eric Dumazet Reported-by: Doug Porter Cc: Soheil Hassas Yeganeh Cc: Neal Cardwell Acked-by: Soheil Hassas Yeganeh --- include/net/tcp.h | 1 + net/ipv4/tcp_input.c | 12 +++++++++++- net/ipv4/tcp_output.c | 1 + 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 9987b3fba9f202632916cc439af9d17f1e68bcd3..cc1295037533a7741e454f7c040f77a21deae02b 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -621,6 +621,7 @@ void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req); void tcp_reset(struct sock *sk, struct sk_buff *skb); void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb); void tcp_fin(struct sock *sk); +void tcp_check_space(struct sock *sk); /* tcp_timer.c */ void tcp_init_xmit_timers(struct sock *); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2088f93fa37b5fb9110e7933242a27bd4009990e..48f6075228600896daa6507c4cd06acfc851a0fa 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5454,7 +5454,17 @@ static void tcp_new_space(struct sock *sk) INDIRECT_CALL_1(sk->sk_write_space, sk_stream_write_space, sk); } -static void tcp_check_space(struct sock *sk) +/* Caller made space either from: + * 1) Freeing skbs in rtx queues (after tp->snd_una has advanced) + * 2) Sent skbs from output queue (and thus advancing tp->snd_nxt) + * + * We might be able to generate EPOLLOUT to the application if: + * 1) Space consumed in output/rtx queues is below sk->sk_sndbuf/2 + * 2) notsent amount (tp->write_seq - tp->snd_nxt) became + * small enough that tcp_stream_memory_free() decides it + * is time to generate EPOLLOUT. + */ +void tcp_check_space(struct sock *sk) { /* pairs with tcp_poll() */ smp_mb(); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 9ede847f4199844c5884e3f62ea450562072a0a7..1ca2f28c9981018e6cfaee3435d711467af6048d 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -82,6 +82,7 @@ static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb) NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPORIGDATASENT, tcp_skb_pcount(skb)); + tcp_check_space(sk); } /* SND.NXT, if window was not shrunk or the amount of shrunk was less than one