[v2,net-next] net: generalize skb freeing deferral to per-cpu lists

From: Eric Dumazet <edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>

Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice.h |  5 ++++
 include/linux/skbuff.h    |  3 +++
 include/net/sock.h        |  2 --
 include/net/tcp.h         | 12 ---------
 net/core/dev.c            | 31 ++++++++++++++++++++++++
 net/core/skbuff.c         | 51 ++++++++++++++++++++++++++++++++++++++-
 net/core/sock.c           |  3 ---
 net/ipv4/tcp.c            | 25 +------------------
 net/ipv4/tcp_ipv4.c       |  1 -
 net/ipv6/tcp_ipv6.c       |  1 -
 net/tls/tls_sw.c          |  2 --
 11 files changed, 90 insertions(+), 46 deletions(-)

Message ID	20220422201237.416238-1-eric.dumazet@gmail.com (mailing list archive)
State	Accepted
Commit	68822bdf76f10c3dc80609d4e2cdc1e847429086
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76278C433EF for <netdev@archiver.kernel.org>; Fri, 22 Apr 2022 22:03:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231529AbiDVWF6 (ORCPT <rfc822;netdev@archiver.kernel.org>); Fri, 22 Apr 2022 18:05:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34986 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231464AbiDVWFf (ORCPT <rfc822;netdev@vger.kernel.org>); Fri, 22 Apr 2022 18:05:35 -0400 Received: from mail-oa1-x2d.google.com (mail-oa1-x2d.google.com [IPv6:2001:4860:4864:20::2d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 201321F47A6 for <netdev@vger.kernel.org>; Fri, 22 Apr 2022 13:49:02 -0700 (PDT) Received: by mail-oa1-x2d.google.com with SMTP id 586e51a60fabf-e67799d278so6032722fac.11 for <netdev@vger.kernel.org>; Fri, 22 Apr 2022 13:49:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=MottAINveZ7gBjrASZW9+blyJYQxWbQ586QR8g/aZnU=; b=oh19L9xId/pgJW4uFwrBvAOllnzEXfQnhXjBhDH3azSk7R+TJvyJwi4en2R0gAVybE RJbM5pVJy/lD6G2xu4hqIJjAfPsrzRmjtxH9v3BtislhQcbytZhsct8DKTo15NIvyWOo gTx/lLmAjCSq0/K9Gk1vBS2PlN3gnUyJgIDHS8zRo8BzSWev29+9SQCbTNdRIlNhXAJh MDJq7KBCyZQbEHZpEtsQ5oUMigTbzPMr1HkUb8dhtsjIQDoqLZ5hmJlYXIx8U5rKY4QN zZOkQZrRYebqzTJ4l5jDzKIGR+rQWzSD/e4jKLrCkpvSB+CpqYl5FYH4M1bwwYxutfL3 EhUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=MottAINveZ7gBjrASZW9+blyJYQxWbQ586QR8g/aZnU=; b=PYRR/PhzBGrrw2nP4Hz787owTZ6fHKb3z5Z6jKpGvt20iQUwNU100CObUMc1gAg6fE IvPmiOa8Tefqb5G24MAnL1dlSL06Y6ND02sLfIXlR925YPSYoYV9Y+gCX7Sncn4ETskO dlmGZgUPI3KNq3PFnuStzocv9leB93vMfpHQ8sznDXEOHGdazHxkLlVrocqFutnL30+s ocXKgI5AJTdAsJkXeWrAaTB10gDYKaurlS83yOY2iIjw6tHUau3y9x9EcxNpRQ5hVIsu WDkEDZFsQhJBDdBU5jltkMzqGeBg7/DJzhepmN94R7tcB+DGbq9Vt9/8m2bda7gDpfOq SyRQ== X-Gm-Message-State: AOAM532JckGnB79VsByr62/RokXQ0LtnE8NxowMymQ8PAmMjX0QCKl+U +MGHYhG5hW+OSNb8/6MGEjDOrSyvc9U= X-Google-Smtp-Source: ABdhPJyyuR2RuSC5GsRfwrUV5FdfPJlYV5cmlml3qE8YYUaBxkj6KzBdDjKL1z26RwWbsXrCah8A1g== X-Received: by 2002:a17:90b:1e4d:b0:1d2:a91e:24cc with SMTP id pi13-20020a17090b1e4d00b001d2a91e24ccmr7255051pjb.99.1650658361638; Fri, 22 Apr 2022 13:12:41 -0700 (PDT) Received: from edumazet1.svl.corp.google.com ([2620:15c:2c4:201:5dfd:d2c2:9be4:8d3]) by smtp.gmail.com with ESMTPSA id l4-20020a056a0016c400b004f79504ef9csm3630686pfc.3.2022.04.22.13.12.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Apr 2022 13:12:40 -0700 (PDT) From: Eric Dumazet <eric.dumazet@gmail.com> To: "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> Cc: netdev <netdev@vger.kernel.org>, Eric Dumazet <edumazet@google.com>, Eric Dumazet <eric.dumazet@gmail.com> Subject: [PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists Date: Fri, 22 Apr 2022 13:12:37 -0700 Message-Id: <20220422201237.416238-1-eric.dumazet@gmail.com> X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org
Series	[v2,net-next] net: generalize skb freeing deferral to per-cpu lists \| expand [v2,net-next] net: generalize skb freeing deferral to per-cpu lists

Context	Check	Description
netdev/tree_selection	success	Clearly marked for net-next, async
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 5837 this patch: 5837
netdev/cc_maintainers	warning	8 maintainers not CCed: petrm@nvidia.com borisp@nvidia.com yoshfuji@linux-ipv6.org dsahern@kernel.org daniel@iogearbox.net imagedong@tencent.com keescook@chromium.org john.fastabend@gmail.com
netdev/build_clang	success	Errors and warnings before: 1151 this patch: 1151
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 5976 this patch: 5976
netdev/checkpatch	warning	CHECK: Comparison to NULL could be written "skb"
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

[v2,net-next] net: generalize skb freeing deferral to per-cpu lists

Checks

Commit Message

Comments

Patch