From patchwork Mon May 2 18:23:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12834619 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4F71C433F5 for ; Mon, 2 May 2022 18:24:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1386850AbiEBS1b (ORCPT ); Mon, 2 May 2022 14:27:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43386 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238642AbiEBS1a (ORCPT ); Mon, 2 May 2022 14:27:30 -0400 Received: from mail-oi1-x234.google.com (mail-oi1-x234.google.com [IPv6:2607:f8b0:4864:20::234]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D283265A1; Mon, 2 May 2022 11:24:00 -0700 (PDT) Received: by mail-oi1-x234.google.com with SMTP id l16so8759992oil.6; Mon, 02 May 2022 11:24:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZlOdm9LHJdhe6OTRhmGuzd9PvHg3DuoSVReEww5/lns=; b=alzeGk9t1f7cbHxfcRY/eiJMiNeVN9h3SS6Snzk9LTcXO/1Jm6eua2YoAM0s4Xbn2V eKul+c9q+2PxVOwVhklq6nvus43lu5LF8S6yMA3dODXeedWAfvPD21v4AkYXQJDsA6KH kle02NTRbK8h1WJbI5+ssNa2yn6D3TH8JBytc4AG+3FzO4clc17vvoDK84Z4KFDaLS+E 4BzTCqdFwVFEGuRshzXBkRxCA1hC2mIHNdF/gxMYLJDjbgZeNYbD4xV83c5yEyhNyXEQ XfcKi7Oq9xsd/olFxT+ZBYu9L8i+foitIrkWQugevJFdv6QlPQX6hnRjfwMfGj57Py7u QCrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZlOdm9LHJdhe6OTRhmGuzd9PvHg3DuoSVReEww5/lns=; b=ule8e5uFlURQ0fvoDJVk/F5uQCc8H8Ezd5d0ppbpUl2/wl+NmnPgXRi4IM9MTziuQz 4Qkc9PKQuHMtrH4lH5PNJVcxrt/Z6A3BhY3aTs5dJmjbi0RkED/gzKx4syttdNakMUvV reW+d/SgcssQB4MRqBo3mQEuK7XK0+IwYLbukF361cZXy9uzEPxpRlQemPrRX2ILwFzv J9mziPXJgrUxng/semySedh4nOIVSLgNfVXf1wbmfXjy85VS4xRI3aXMwSiCGx/8R3pK D6l4ZbtoGxZALS1xIDDLlr7Js+umiOLLgyE7t1YV+20wssE28mB7uogZAvgs1mM3mj6v ag8A== X-Gm-Message-State: AOAM533KBtO5SfuaaCpOvbHk8YawsNh2eLQkrNXiuAMm05F+UKONMozs xK0TLJVk0X1EeWyvg7LlBfyVgB8Zo1Q= X-Google-Smtp-Source: ABdhPJztpTRBUBqzvrVjPvL4D3egtwLF+C9i5eL3SvEE7XnsHcCNmjk4NX9ch3GtoHsgAJn7k+RewA== X-Received: by 2002:a54:4f12:0:b0:325:e:49ff with SMTP id e18-20020a544f12000000b00325000e49ffmr208848oiy.261.1651515840068; Mon, 02 May 2022 11:24:00 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:7340:5d9f:8575:d25d]) by smtp.gmail.com with ESMTPSA id t13-20020a05683014cd00b0060603221245sm3129915otq.21.2022.05.02.11.23.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 May 2022 11:23:59 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, Cong Wang , Eric Dumazet , John Fastabend , Daniel Borkmann , Jakub Sitnicki Subject: [Patch bpf-next v2 1/4] tcp: introduce tcp_read_skb() Date: Mon, 2 May 2022 11:23:42 -0700 Message-Id: <20220502182345.306970-2-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220502182345.306970-1-xiyou.wangcong@gmail.com> References: <20220502182345.306970-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Cong Wang This patch inroduces tcp_read_skb() based on tcp_read_sock(), a preparation for the next patch which actually introduces a new sock ops. TCP is special here, because it has tcp_read_sock() which is mainly used by splice(). tcp_read_sock() supports partial read and arbitrary offset, neither of them is needed for sockmap. Cc: Eric Dumazet Cc: John Fastabend Cc: Daniel Borkmann Cc: Jakub Sitnicki Signed-off-by: Cong Wang --- include/net/tcp.h | 2 ++ net/ipv4/tcp.c | 63 +++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 57 insertions(+), 8 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 94a52ad1101c..ab7516e5cc56 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -667,6 +667,8 @@ void tcp_get_info(struct sock *, struct tcp_info *); /* Read 'sendfile()'-style from a TCP socket */ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor); +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, + sk_read_actor_t recv_actor); void tcp_initialize_rcv_mss(struct sock *sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index db55af9eb37b..8d48126e3694 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1600,7 +1600,7 @@ static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb) __kfree_skb(skb); } -static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) +static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off, bool unlink) { struct sk_buff *skb; u32 offset; @@ -1613,6 +1613,8 @@ static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) } if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) { *off = offset; + if (unlink) + __skb_unlink(skb, &sk->sk_receive_queue); return skb; } /* This looks weird, but this can happen if TCP collapsing @@ -1646,7 +1648,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, if (sk->sk_state == TCP_LISTEN) return -ENOTCONN; - while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { + while ((skb = tcp_recv_skb(sk, seq, &offset, false)) != NULL) { if (offset < skb->len) { int used; size_t len; @@ -1677,7 +1679,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, * getting here: tcp_collapse might have deleted it * while aggregating skbs from the socket queue. */ - skb = tcp_recv_skb(sk, seq - 1, &offset); + skb = tcp_recv_skb(sk, seq - 1, &offset, false); if (!skb) break; /* TCP coalescing might have appended data to the skb. @@ -1702,13 +1704,58 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, /* Clean up data we have read: This will do ACK frames. */ if (copied > 0) { - tcp_recv_skb(sk, seq, &offset); + tcp_recv_skb(sk, seq, &offset, false); tcp_cleanup_rbuf(sk, copied); } return copied; } EXPORT_SYMBOL(tcp_read_sock); +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, + sk_read_actor_t recv_actor) +{ + struct tcp_sock *tp = tcp_sk(sk); + u32 seq = tp->copied_seq; + struct sk_buff *skb; + int copied = 0; + u32 offset; + + if (sk->sk_state == TCP_LISTEN) + return -ENOTCONN; + + while ((skb = tcp_recv_skb(sk, seq, &offset, true)) != NULL) { + int used = recv_actor(desc, skb, 0, skb->len); + + if (used <= 0) { + if (!copied) + copied = used; + break; + } + seq += used; + copied += used; + + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) { + kfree_skb(skb); + ++seq; + break; + } + kfree_skb(skb); + if (!desc->count) + break; + WRITE_ONCE(tp->copied_seq, seq); + } + WRITE_ONCE(tp->copied_seq, seq); + + tcp_rcv_space_adjust(sk); + + /* Clean up data we have read: This will do ACK frames. */ + if (copied > 0) + tcp_cleanup_rbuf(sk, copied); + + return copied; +} +EXPORT_SYMBOL(tcp_read_skb); + int tcp_peek_len(struct socket *sock) { return tcp_inq(sock->sk); @@ -1890,7 +1937,7 @@ static int receive_fallback_to_copy(struct sock *sk, struct sk_buff *skb; u32 offset; - skb = tcp_recv_skb(sk, tcp_sk(sk)->copied_seq, &offset); + skb = tcp_recv_skb(sk, tcp_sk(sk)->copied_seq, &offset, false); if (skb) tcp_zerocopy_set_hint_for_skb(sk, zc, skb, offset); } @@ -1937,7 +1984,7 @@ static int tcp_zc_handle_leftover(struct tcp_zerocopy_receive *zc, if (skb) { offset = *seq - TCP_SKB_CB(skb)->seq; } else { - skb = tcp_recv_skb(sk, *seq, &offset); + skb = tcp_recv_skb(sk, *seq, &offset, false); if (TCP_SKB_CB(skb)->has_rxtstamp) { tcp_update_recv_tstamps(skb, tss); zc->msg_flags |= TCP_CMSG_TS; @@ -2130,7 +2177,7 @@ static int tcp_zerocopy_receive(struct sock *sk, skb = skb->next; offset = seq - TCP_SKB_CB(skb)->seq; } else { - skb = tcp_recv_skb(sk, seq, &offset); + skb = tcp_recv_skb(sk, seq, &offset, false); } if (TCP_SKB_CB(skb)->has_rxtstamp) { @@ -2186,7 +2233,7 @@ static int tcp_zerocopy_receive(struct sock *sk, tcp_rcv_space_adjust(sk); /* Clean up data we have read: This will do ACK frames. */ - tcp_recv_skb(sk, seq, &offset); + tcp_recv_skb(sk, seq, &offset, false); tcp_cleanup_rbuf(sk, length + copylen); ret = 0; if (length == zc->length) From patchwork Mon May 2 18:23:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12834620 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50CCCC433F5 for ; Mon, 2 May 2022 18:24:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1386856AbiEBS1h (ORCPT ); Mon, 2 May 2022 14:27:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43440 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238642AbiEBS1c (ORCPT ); Mon, 2 May 2022 14:27:32 -0400 Received: from mail-oi1-x22a.google.com (mail-oi1-x22a.google.com [IPv6:2607:f8b0:4864:20::22a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1ED6D7671; Mon, 2 May 2022 11:24:02 -0700 (PDT) Received: by mail-oi1-x22a.google.com with SMTP id l16so8760038oil.6; Mon, 02 May 2022 11:24:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=rYmnUga3LNpruXv7GAgs6qY9u0qs4qlG/Du9CDwJ6FY=; b=lNOgAMlJ2q4tfBiPumM6ZSCZKo4pTAE3tgRiyPwdw9bwwWWFaFqsqJEJrQowQgvRO6 UEQ0w22HTKcnSwOwgGphAGe+G1Tff9S3bURXNh/hh1Ta4syGfkr8uKK9Srl5rowtI58y wzmfVIvf1Nv8OeQrYa2TZaHVeHAMcvuh531qEIyC4uVEE+Agx0qG9JzvpAteYzKfaG8Z xk5xxaIKA2Bcmuw69dMbulk2lWZiqV5kXNFncnoFr4twWDx46qLlquRZRrN3HqmUXFAT FjTOOSobFbZhR7if2iUm4gO7SStOOUN4o3vxK4GljFbvGRMYGU/moSR6jqj5OCis8M8B zv4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=rYmnUga3LNpruXv7GAgs6qY9u0qs4qlG/Du9CDwJ6FY=; b=kPiiORyv7rezo6faMk+YW0A+wuFmtX3SR9Fq/HOXbsYiEsVM68nXJFLx0DsYFn6SDi P/q/mRPJwVdcsNhU/9miVr6X/1ATq89yZ4kREwixgNrWDWnlyvSrqUy0qvw6tN7dS8Bb DKWj6zulhJKdP2ifXvemAO+HnW8/KjKbkrRVSJua6pwyOlJ5oF2gWy6Dkh5YlRandETm WXTtG0k28p/oTDBS3CcK4rKBsC47JqSGFJNf1PeZtqDD04XvoAppdikKpkgphZ/QR+sv uyCOgFzM69Cjiy+PmCv1Jpcni/hhC/AfDxDBvMvyUxKGsvR5ELh9qgZERpPFpM03tRrc jo1Q== X-Gm-Message-State: AOAM531dTkF9QLaSAA6jOGlsZ2gjLJL+WsTmKXDVyfOFPyF5Wbyj5AqZ XbqYn5pzNXE8lmEP0EYKqM7TsJWLnI4= X-Google-Smtp-Source: ABdhPJzehGvBvG/1onHEizWSIRJsBIuoOWwbFfnn2uwDUVwZJTeE5EC2NMas7Xqzuf1BuGiEU3/+bQ== X-Received: by 2002:a05:6808:198b:b0:325:fe41:d455 with SMTP id bj11-20020a056808198b00b00325fe41d455mr215956oib.9.1651515841282; Mon, 02 May 2022 11:24:01 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:7340:5d9f:8575:d25d]) by smtp.gmail.com with ESMTPSA id t13-20020a05683014cd00b0060603221245sm3129915otq.21.2022.05.02.11.24.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 May 2022 11:24:00 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, Cong Wang , Eric Dumazet , John Fastabend , Daniel Borkmann , Jakub Sitnicki Subject: [Patch bpf-next v2 2/4] net: introduce a new proto_ops ->read_skb() Date: Mon, 2 May 2022 11:23:43 -0700 Message-Id: <20220502182345.306970-3-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220502182345.306970-1-xiyou.wangcong@gmail.com> References: <20220502182345.306970-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Cong Wang Currently both splice() and sockmap use ->read_sock() to read skb from receive queue, but for sockmap we only read one entire skb at a time, so ->read_sock() is too conservative to use. Introduce a new proto_ops ->read_skb() which supports this sematic, with this we can finally pass the ownership of skb to recv actors. For non-TCP protocols, all ->read_sock() can be simply converted to ->read_skb(). Cc: Eric Dumazet Cc: John Fastabend Cc: Daniel Borkmann Cc: Jakub Sitnicki Signed-off-by: Cong Wang --- include/linux/net.h | 4 ++++ include/net/tcp.h | 3 +-- include/net/udp.h | 3 +-- net/core/skmsg.c | 20 +++++--------------- net/ipv4/af_inet.c | 3 ++- net/ipv4/tcp.c | 9 +++------ net/ipv4/udp.c | 10 ++++------ net/ipv6/af_inet6.c | 3 ++- net/unix/af_unix.c | 23 +++++++++-------------- 9 files changed, 31 insertions(+), 47 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index 12093f4db50c..a03485e8cbb2 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -152,6 +152,8 @@ struct module; struct sk_buff; typedef int (*sk_read_actor_t)(read_descriptor_t *, struct sk_buff *, unsigned int, size_t); +typedef int (*skb_read_actor_t)(struct sock *, struct sk_buff *); + struct proto_ops { int family; @@ -214,6 +216,8 @@ struct proto_ops { */ int (*read_sock)(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor); + /* This is different from read_sock(), it reads an entire skb at a time. */ + int (*read_skb)(struct sock *sk, skb_read_actor_t recv_actor); int (*sendpage_locked)(struct sock *sk, struct page *page, int offset, size_t size, int flags); int (*sendmsg_locked)(struct sock *sk, struct msghdr *msg, diff --git a/include/net/tcp.h b/include/net/tcp.h index ab7516e5cc56..9f4fe3b80e30 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -667,8 +667,7 @@ void tcp_get_info(struct sock *, struct tcp_info *); /* Read 'sendfile()'-style from a TCP socket */ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor); -int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor); +int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); void tcp_initialize_rcv_mss(struct sock *sk); diff --git a/include/net/udp.h b/include/net/udp.h index b83a00330566..47a0e3359771 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -305,8 +305,7 @@ struct sock *__udp6_lib_lookup(struct net *net, struct sk_buff *skb); struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, __be16 sport, __be16 dport); -int udp_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor); +int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); /* UDP uses skb->dev_scratch to cache as much information as possible and avoid * possibly multiple cache miss on dequeue() diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 22b983ade0e7..50405e3eda88 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1159,21 +1159,17 @@ static void sk_psock_done_strp(struct sk_psock *psock) } #endif /* CONFIG_BPF_STREAM_PARSER */ -static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb, - unsigned int offset, size_t orig_len) +static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) { - struct sock *sk = (struct sock *)desc->arg.data; struct sk_psock *psock; struct bpf_prog *prog; int ret = __SK_DROP; - int len = orig_len; + int len = skb->len; /* clone here so sk_eat_skb() in tcp_read_sock does not drop our data */ skb = skb_clone(skb, GFP_ATOMIC); - if (!skb) { - desc->error = -ENOMEM; + if (!skb) return 0; - } rcu_read_lock(); psock = sk_psock(sk); @@ -1203,16 +1199,10 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb, static void sk_psock_verdict_data_ready(struct sock *sk) { struct socket *sock = sk->sk_socket; - read_descriptor_t desc; - if (unlikely(!sock || !sock->ops || !sock->ops->read_sock)) + if (unlikely(!sock || !sock->ops || !sock->ops->read_skb)) return; - - desc.arg.data = sk; - desc.error = 0; - desc.count = 1; - - sock->ops->read_sock(sk, &desc, sk_psock_verdict_recv); + sock->ops->read_skb(sk, sk_psock_verdict_recv); } void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 93da9f783bec..f615263855d0 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1040,6 +1040,7 @@ const struct proto_ops inet_stream_ops = { .sendpage = inet_sendpage, .splice_read = tcp_splice_read, .read_sock = tcp_read_sock, + .read_skb = tcp_read_skb, .sendmsg_locked = tcp_sendmsg_locked, .sendpage_locked = tcp_sendpage_locked, .peek_len = tcp_peek_len, @@ -1067,7 +1068,7 @@ const struct proto_ops inet_dgram_ops = { .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, .sendmsg = inet_sendmsg, - .read_sock = udp_read_sock, + .read_skb = udp_read_skb, .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .sendpage = inet_sendpage, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8d48126e3694..d62490d10fd8 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1711,8 +1711,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, } EXPORT_SYMBOL(tcp_read_sock); -int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor) +int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { struct tcp_sock *tp = tcp_sk(sk); u32 seq = tp->copied_seq; @@ -1724,7 +1723,7 @@ int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, return -ENOTCONN; while ((skb = tcp_recv_skb(sk, seq, &offset, true)) != NULL) { - int used = recv_actor(desc, skb, 0, skb->len); + int used = recv_actor(sk, skb); if (used <= 0) { if (!copied) @@ -1740,9 +1739,7 @@ int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, break; } kfree_skb(skb); - if (!desc->count) - break; - WRITE_ONCE(tp->copied_seq, seq); + break; } WRITE_ONCE(tp->copied_seq, seq); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index aa8545ca6964..b8cfa0c3de59 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1795,8 +1795,7 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, } EXPORT_SYMBOL(__skb_recv_udp); -int udp_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor) +int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { int copied = 0; @@ -1818,7 +1817,7 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc, continue; } - used = recv_actor(desc, skb, 0, skb->len); + used = recv_actor(sk, skb); if (used <= 0) { if (!copied) copied = used; @@ -1829,13 +1828,12 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc, } kfree_skb(skb); - if (!desc->count) - break; + break; } return copied; } -EXPORT_SYMBOL(udp_read_sock); +EXPORT_SYMBOL(udp_read_skb); /* * This should be easy, if there is something there we diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 70564ddccc46..1aea5ef9bdea 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -701,6 +701,7 @@ const struct proto_ops inet6_stream_ops = { .sendpage_locked = tcp_sendpage_locked, .splice_read = tcp_splice_read, .read_sock = tcp_read_sock, + .read_skb = tcp_read_skb, .peek_len = tcp_peek_len, #ifdef CONFIG_COMPAT .compat_ioctl = inet6_compat_ioctl, @@ -726,7 +727,7 @@ const struct proto_ops inet6_dgram_ops = { .getsockopt = sock_common_getsockopt, /* ok */ .sendmsg = inet6_sendmsg, /* retpoline's sake */ .recvmsg = inet6_recvmsg, /* retpoline's sake */ - .read_sock = udp_read_sock, + .read_skb = udp_read_skb, .mmap = sock_no_mmap, .sendpage = sock_no_sendpage, .set_peek_off = sk_set_peek_off, diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index e1dd9e9c8452..71deefaaf373 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -741,10 +741,8 @@ static ssize_t unix_stream_splice_read(struct socket *, loff_t *ppos, unsigned int flags); static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t); static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int); -static int unix_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor); -static int unix_stream_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor); +static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor); +static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor); static int unix_dgram_connect(struct socket *, struct sockaddr *, int, int); static int unix_seqpacket_sendmsg(struct socket *, struct msghdr *, size_t); @@ -798,7 +796,7 @@ static const struct proto_ops unix_stream_ops = { .shutdown = unix_shutdown, .sendmsg = unix_stream_sendmsg, .recvmsg = unix_stream_recvmsg, - .read_sock = unix_stream_read_sock, + .read_skb = unix_stream_read_skb, .mmap = sock_no_mmap, .sendpage = unix_stream_sendpage, .splice_read = unix_stream_splice_read, @@ -823,7 +821,7 @@ static const struct proto_ops unix_dgram_ops = { .listen = sock_no_listen, .shutdown = unix_shutdown, .sendmsg = unix_dgram_sendmsg, - .read_sock = unix_read_sock, + .read_skb = unix_read_skb, .recvmsg = unix_dgram_recvmsg, .mmap = sock_no_mmap, .sendpage = sock_no_sendpage, @@ -2489,8 +2487,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si return __unix_dgram_recvmsg(sk, msg, size, flags); } -static int unix_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor) +static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { int copied = 0; @@ -2505,7 +2502,7 @@ static int unix_read_sock(struct sock *sk, read_descriptor_t *desc, if (!skb) return err; - used = recv_actor(desc, skb, 0, skb->len); + used = recv_actor(sk, skb); if (used <= 0) { if (!copied) copied = used; @@ -2516,8 +2513,7 @@ static int unix_read_sock(struct sock *sk, read_descriptor_t *desc, } kfree_skb(skb); - if (!desc->count) - break; + break; } return copied; @@ -2652,13 +2648,12 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, } #endif -static int unix_stream_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor) +static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { if (unlikely(sk->sk_state != TCP_ESTABLISHED)) return -ENOTCONN; - return unix_read_sock(sk, desc, recv_actor); + return unix_read_skb(sk, recv_actor); } static int unix_stream_read_generic(struct unix_stream_read_state *state, From patchwork Mon May 2 18:23:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12834621 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB5E4C433F5 for ; Mon, 2 May 2022 18:24:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1386851AbiEBS1i (ORCPT ); Mon, 2 May 2022 14:27:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43466 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1386854AbiEBS1c (ORCPT ); Mon, 2 May 2022 14:27:32 -0400 Received: from mail-oi1-x230.google.com (mail-oi1-x230.google.com [IPv6:2607:f8b0:4864:20::230]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 078F0767D; Mon, 2 May 2022 11:24:02 -0700 (PDT) Received: by mail-oi1-x230.google.com with SMTP id m11so15984240oib.11; Mon, 02 May 2022 11:24:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=3krU9qzD3sAu/Akj2KCT7vAhjMYIV/YGWxHztC1aoj0=; b=UTdc+l77Ix0qbAdFJK9F9V8JfkMqlVRYOt1TkF4gZlc7xMQl2vsBVGej9IyqGQSgpK 4hpG5+PsS7GUD9wwMGfYisVAXwfK/y6hlARo19Fb5YPTEnbFaS5jM4OX5j/AJnuwUzTy OCwayddK+Z28khr+3oK4N9CI2Fw/RQtAtj1aC8qSrXsja6rHcQZzfqbcOmg6DZSL5HBf cN/jmqqjPP79Oc1U4FjIbjf/3aIq64UFlFmtobJRuH1vWCWYHH/x9sLsAzYiuqZOw66+ Q+p1HEHu/ivIPEQP1hArZOBwSsRkw8NTP3EsYYvrSG0YQWL0Y4FF+UbQBSuaqDD8v9P+ rMCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=3krU9qzD3sAu/Akj2KCT7vAhjMYIV/YGWxHztC1aoj0=; b=nica1ifAMcaNkNRd1Im1glUI/4foO5lygAgIzwoWBu30Xf3Hm+RekN0CqeFu9p/q8p oCbkdrNTJ+DS4yU3O0UOgFKAG9kyJh3eWFfDTrsYbkSpLzTK8rDEfKCLHkOpdHxNvau4 7qd7UGsB7iQGEqwhxFCn3wkXdaglFVzhVOYLUd9cLHxf6oNlEa6yG0m5C1QV6O4Q54IJ coQUaDAPcSZYBPYUsU9oE3wbp4pt/s3XPwhBukLNwPqtLkNSAHSZDVY869NvWEQKCjLW nR0D6ohe03L4QjB/RtJWeOqyuVk/IWBOKen42GbazXY8rIkKesRtO1Y9/+f77HAGXAq8 2Ltw== X-Gm-Message-State: AOAM531cA7PwQ9BvX9N39mJDLaZyUidYrqjMaS/zdebHa4jNEdPjEJQf AjwHkGfQj5VRKffuLtRY47m8U47nUm4= X-Google-Smtp-Source: ABdhPJxv0QiWgwj59XXLxWxSoaeOHFbOi1Qop3CqhbkjW5/9UXUCG2+N8YKtFFaM1lr7iNiT2Xsdzg== X-Received: by 2002:a05:6808:17a6:b0:325:b364:4e2f with SMTP id bg38-20020a05680817a600b00325b3644e2fmr227723oib.74.1651515842271; Mon, 02 May 2022 11:24:02 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:7340:5d9f:8575:d25d]) by smtp.gmail.com with ESMTPSA id t13-20020a05683014cd00b0060603221245sm3129915otq.21.2022.05.02.11.24.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 May 2022 11:24:01 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, Cong Wang , Eric Dumazet , John Fastabend , Daniel Borkmann , Jakub Sitnicki Subject: [Patch bpf-next v2 3/4] skmsg: get rid of skb_clone() Date: Mon, 2 May 2022 11:23:44 -0700 Message-Id: <20220502182345.306970-4-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220502182345.306970-1-xiyou.wangcong@gmail.com> References: <20220502182345.306970-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Cong Wang With ->read_skb() now we have an entire skb dequeued from receive queue, now we just need to grab an addtional refcnt before passing its ownership to recv actors. And we should not touch them any more, particularly for skb->sk. Fortunately, skb->sk is already set for most of the protocols except UDP where skb->sk has been stolen, so we have to fix it up for UDP case. Cc: Eric Dumazet Cc: John Fastabend Cc: Daniel Borkmann Cc: Jakub Sitnicki Signed-off-by: Cong Wang --- net/core/skmsg.c | 7 +------ net/ipv4/udp.c | 1 + 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 50405e3eda88..3ff86d73672c 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1166,10 +1166,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) int ret = __SK_DROP; int len = skb->len; - /* clone here so sk_eat_skb() in tcp_read_sock does not drop our data */ - skb = skb_clone(skb, GFP_ATOMIC); - if (!skb) - return 0; + skb_get(skb); rcu_read_lock(); psock = sk_psock(sk); @@ -1182,12 +1179,10 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) if (!prog) prog = READ_ONCE(psock->progs.skb_verdict); if (likely(prog)) { - skb->sk = sk; skb_dst_drop(skb); skb_bpf_redirect_clear(skb); ret = bpf_prog_run_pin_on_cpu(prog, skb); ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb)); - skb->sk = NULL; } if (sk_psock_verdict_apply(psock, skb, ret) < 0) len = 0; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index b8cfa0c3de59..71c2c147f2d0 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1817,6 +1817,7 @@ int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) continue; } + WARN_ON(!skb_set_owner_sk_safe(skb, sk)); used = recv_actor(sk, skb); if (used <= 0) { if (!copied) From patchwork Mon May 2 18:23:45 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12834622 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C1DBC4332F for ; Mon, 2 May 2022 18:24:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1386864AbiEBS1l (ORCPT ); Mon, 2 May 2022 14:27:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43524 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1386860AbiEBS1g (ORCPT ); Mon, 2 May 2022 14:27:36 -0400 Received: from mail-oa1-x2b.google.com (mail-oa1-x2b.google.com [IPv6:2001:4860:4864:20::2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D0C0A182; Mon, 2 May 2022 11:24:04 -0700 (PDT) Received: by mail-oa1-x2b.google.com with SMTP id 586e51a60fabf-ed9a75c453so4519349fac.11; Mon, 02 May 2022 11:24:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=PkoIJtcmp4WksNV5PxPejhCDc2tsgOUS55s91uI+wrs=; b=qvKz5DCZZjNBfiE7zHNuuFyXsA2A5a7u6kYw/l+tQiFYjv+LkU3DTOwB+c3dtdXsjC Dso3PjHFa+JNtMuUtKNm59DbEu0MBa/2pmgzOliadPsQOiAP6+XYBeXisJ894IKJIz+9 Ak8AZeX+kxMXWDHKPyE2Cip+uG6t42eEaq04vcu4s47d7xJ10YQbnVECwOQP3UsKV4fw N/eY6pvrFR5wy9Zn3Nmdj3kyTsae0i6HbmNYMKw+sJRZsd/vJbV3GMqGPlENK6r8S3ul fVT0ElDnvzjhpQ541JZtUgoCCidgC1dGyiJVLo21V4hHd67Rl9X0ak3u3HbpcNvpVJgU P1zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PkoIJtcmp4WksNV5PxPejhCDc2tsgOUS55s91uI+wrs=; b=xM51Yd4nzU0xKgxXVA2dp5WlKEdkhu30JcD/LnXEUEcVjGLFQGcMMwL9K0l5h7H6Sx ab6zjYvZzykWfXBml9I+ejpCq07zL0YxXDbSQDbTbGAQG/KRmh2g0DSEjp1uga6+Z/QR 5u3N1CBT5rheOISVOlgtO4Oj2kBE8C6mjDS4xiSfZnHtEHfie55VJ12dlNSFShTfZqo6 TpS7f88rH1ZURfMpzemqUHDLFoZ6/v8m8l5qTmqCeUNXaCXztroZYlkEqdfrfAmcDW0C J7AfTh0lpz19j4febdsPpJc5KWzU5HHztId3a5AMQ86D4zjW6NXYQ3dl0P5XO84hK7sr K9jg== X-Gm-Message-State: AOAM533LtyXk6x6qMz2+keiCYG4Jec6Az6KuIQpFDfOVeL0MT8omhuge 9+SlCUMk19ap8OSs1azckX7FYfYaP6I= X-Google-Smtp-Source: ABdhPJzAbmuiua45HkZfqaHsfVlc8bhY08cu7p8BwEzp2U7+OsUJ4ry725CJXtT2bbc0MiWB6SITJA== X-Received: by 2002:a05:6870:2103:b0:e9:6d65:4abb with SMTP id f3-20020a056870210300b000e96d654abbmr195411oae.261.1651515843366; Mon, 02 May 2022 11:24:03 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:7340:5d9f:8575:d25d]) by smtp.gmail.com with ESMTPSA id t13-20020a05683014cd00b0060603221245sm3129915otq.21.2022.05.02.11.24.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 May 2022 11:24:03 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, Cong Wang , John Fastabend , Daniel Borkmann , Jakub Sitnicki Subject: [Patch bpf-next v2 4/4] skmsg: get rid of unncessary memset() Date: Mon, 2 May 2022 11:23:45 -0700 Message-Id: <20220502182345.306970-5-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220502182345.306970-1-xiyou.wangcong@gmail.com> References: <20220502182345.306970-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Cong Wang We always allocate skmsg with kzalloc(), so there is no need to call memset(0) on it, the only thing we need from sk_msg_init() is sg_init_marker(). So introduce a new helper which is just kzalloc()+sg_init_marker(), this saves an unncessary memset(0) for skmsg on fast path. Cc: John Fastabend Cc: Daniel Borkmann Cc: Jakub Sitnicki Signed-off-by: Cong Wang --- net/core/skmsg.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 3ff86d73672c..6dbb735ec94d 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -497,23 +497,27 @@ bool sk_msg_is_readable(struct sock *sk) } EXPORT_SYMBOL_GPL(sk_msg_is_readable); -static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, - struct sk_buff *skb) +static struct sk_msg *alloc_sk_msg(void) { struct sk_msg *msg; - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_KERNEL); + if (unlikely(!msg)) return NULL; + sg_init_marker(msg->sg.data, NR_MSG_FRAG_IDS); + return msg; +} - if (!sk_rmem_schedule(sk, skb, skb->truesize)) +static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, + struct sk_buff *skb) +{ + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) return NULL; - msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_KERNEL); - if (unlikely(!msg)) + if (!sk_rmem_schedule(sk, skb, skb->truesize)) return NULL; - sk_msg_init(msg); - return msg; + return alloc_sk_msg(); } static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb, @@ -590,13 +594,12 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb, u32 off, u32 len) { - struct sk_msg *msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC); + struct sk_msg *msg = alloc_sk_msg(); struct sock *sk = psock->sk; int err; if (unlikely(!msg)) return -EAGAIN; - sk_msg_init(msg); skb_set_owner_r(skb, sk); err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg); if (err < 0)