[net-next] net: generalize skb freeing deferral to per-cpu lists

From: Eric Dumazet <edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>

Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 71874 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |  3 +++
 include/linux/skbuff.h    |  2 ++
 include/net/sock.h        |  2 --
 include/net/tcp.h         | 12 ----------
 net/core/dev.c            | 27 +++++++++++++++++++++++
 net/core/skbuff.c         | 46 ++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c           |  3 ---
 net/ipv4/tcp.c            | 25 +--------------------
 net/ipv4/tcp_ipv4.c       |  1 -
 net/ipv6/tcp_ipv6.c       |  1 -
 net/tls/tls_sw.c          |  2 --
 11 files changed, 78 insertions(+), 46 deletions(-)

Message ID	20220421153920.3637792-1-eric.dumazet@gmail.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65F25C433F5 for <netdev@archiver.kernel.org>; Thu, 21 Apr 2022 15:39:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1384252AbiDUPmT (ORCPT <rfc822;netdev@archiver.kernel.org>); Thu, 21 Apr 2022 11:42:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36540 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1348309AbiDUPmQ (ORCPT <rfc822;netdev@vger.kernel.org>); Thu, 21 Apr 2022 11:42:16 -0400 Received: from mail-pf1-x436.google.com (mail-pf1-x436.google.com [IPv6:2607:f8b0:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 798B7473B8 for <netdev@vger.kernel.org>; Thu, 21 Apr 2022 08:39:26 -0700 (PDT) Received: by mail-pf1-x436.google.com with SMTP id bo5so5350017pfb.4 for <netdev@vger.kernel.org>; Thu, 21 Apr 2022 08:39:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=uUbo0nCNysDShACcPba0Q5dRT3JxFyXYCPIo+2c1kD0=; b=LGjoB8cu4TfQbMQD646sioB0MpgNptIzIYmkUsxg5Dx48xcEVo+YMJPaBeyHdgpCoT xyI/aTVKxpwUPfkjFPbrKZyWFoablSBH2nNdgEjnjRABpqAnKkPyOglE56tFnPG/Qcyv KDp/NZ2RCeP0u2XlecHiPjGPWgajj2oSOrRVYE0/L154Btq1vA8Uk1o/p3BJEbtz3TAN FxmbJdQQ8DvY82S9yFjUbrEgnKenlzLhPbGy960lUi5pTW0pNwuFLSm9boRc0p0O1R64 ZokhrEKybCvrEJghK6rxuFydklkejOrzmUtVWOV2m9goX8ibawekSOM9hRsK6c07rxFy Nqww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=uUbo0nCNysDShACcPba0Q5dRT3JxFyXYCPIo+2c1kD0=; b=fK3w8RQau/yyOCVBO/DuP8/EkHUQe0YLd7cGwWPZtwk4xaGC5LUbk1cK8xk6lT+uNX /Snfe0d6j+bAoMNRu/hazPOUoBRkj9gfG8jUvU3n6+L+qVxnJocEa3Z044ZxyBxCmr7S dJ5E1g34TlxexCEEF1UminPhA1vrbkZ7kE81iAnQPh4DGlGOuYGXjbEYkEL+08FV75wM HKl4uQ/pfyQXw7ZQAbbTbFfq4sPxveovSvuUE5vmCgRjfH8JnGBrTpgzJYeBwaT4dxj8 OctK5rnTiwMdCf0rrHRTh7aI/bRc2V8eJyjrERBijqK/tU559f7P/rVEfA15bel+GBUS eztA== X-Gm-Message-State: AOAM530Wz53BNz+HcmK6sG0GeTzA6HUkV24MgH5u6p5zHGk6fAaY4IXg hCE3167naGnyoQ/rSDa25Zwv5L2vrwc= X-Google-Smtp-Source: ABdhPJwbsoornWmCo77wO2sXsRhtbh/ixrEI1+vir1XCDP+piFK4cgADGgL1QCsbPxdPbJr0QLnPhA== X-Received: by 2002:a63:6ac9:0:b0:3aa:7e36:d58a with SMTP id f192-20020a636ac9000000b003aa7e36d58amr24457pgc.337.1650555565767; Thu, 21 Apr 2022 08:39:25 -0700 (PDT) Received: from edumazet1.svl.corp.google.com ([2620:15c:2c4:201:70a1:8fa:4210:eb0b]) by smtp.gmail.com with ESMTPSA id c7-20020a17090a8d0700b001cd4989ff51sm3221340pjo.24.2022.04.21.08.39.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Apr 2022 08:39:23 -0700 (PDT) From: Eric Dumazet <eric.dumazet@gmail.com> To: "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> Cc: netdev <netdev@vger.kernel.org>, Eric Dumazet <edumazet@google.com>, Eric Dumazet <eric.dumazet@gmail.com> Subject: [PATCH net-next] net: generalize skb freeing deferral to per-cpu lists Date: Thu, 21 Apr 2022 08:39:20 -0700 Message-Id: <20220421153920.3637792-1-eric.dumazet@gmail.com> X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org
Series	[net-next] net: generalize skb freeing deferral to per-cpu lists \| expand [net-next] net: generalize skb freeing deferral to per-cpu lists

Context	Check	Description
netdev/tree_selection	success	Clearly marked for net-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 5839 this patch: 5839
netdev/cc_maintainers	warning	8 maintainers not CCed: petrm@nvidia.com borisp@nvidia.com yoshfuji@linux-ipv6.org dsahern@kernel.org daniel@iogearbox.net imagedong@tencent.com keescook@chromium.org john.fastabend@gmail.com
netdev/build_clang	success	Errors and warnings before: 1151 this patch: 1151
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 5978 this patch: 5978
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 271 lines checked
netdev/kdoc	fail	Errors and warnings before: 0 this patch: 1
netdev/source_inline	success	Was 0 now: 0

[net-next] net: generalize skb freeing deferral to per-cpu lists

Checks

Commit Message

Comments

Patch