From patchwork Sun Apr 10 16:10:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12808170 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51F78C433F5 for ; Sun, 10 Apr 2022 16:10:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239877AbiDJQNE (ORCPT ); Sun, 10 Apr 2022 12:13:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56688 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243490AbiDJQNC (ORCPT ); Sun, 10 Apr 2022 12:13:02 -0400 Received: from mail-oa1-x30.google.com (mail-oa1-x30.google.com [IPv6:2001:4860:4864:20::30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 485884704C for ; Sun, 10 Apr 2022 09:10:51 -0700 (PDT) Received: by mail-oa1-x30.google.com with SMTP id 586e51a60fabf-d6ca46da48so14759613fac.12 for ; Sun, 10 Apr 2022 09:10:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=qIwja98XMjRMXmd1i+cMEo2Fpfxq0B8s/z047qqE0MM=; b=RYuYZ+TbecWK4nPULE9PlGlkPhzvzepUCVKUE1W64m0IWXpHk85MeToyxFHwYWZF0R Otaw5hy432Woz08r9YZ8CFDpEsTOQBr1kM0ZMIOxGkm4PjhxLHjiOy0/qm0RXavhJUOP mTqmprUkffkBuHS5oq6EPAJSBXopLJmCTKTDUjileg3X6u9o8J2sHbQqVS2pgpQnk9h6 G6pNQddjw8Etu5X/c+bdw3rneLenlAb9pC9VIYJVOGbFKWec7ycL2LyJPsyNtSalDt/a GmT1bEo5H27Bj7Pj2KsN83ot7pusDTEJXfy/ucAgZHSnk9VYZrMHiAi2+ZGV8Vhx3exl 2MfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=qIwja98XMjRMXmd1i+cMEo2Fpfxq0B8s/z047qqE0MM=; b=qZ0EXuuhSTDUDph+X7wTmkAxWUd1hmkMBivkcxUKDQnrry8fhnIlikCAXcPIyt51FG Haiw67oIcSggWFhV6TMT+B+iI65nnC11RD3DZL7BOsd03t979gZBZMLOP7x615l7tr13 BeuOLJgocYZodKjy6fmjZbejU9UJHWfYMgZ5Y+UacCMcWSvLTaDmyPJXOv9vy6FRA5Cz xtfR4cpXPvnOzWMEHh7QlfCPvwhtFpQ+3DhZftmqtbCcLvIgVGWv+e8f2QucSrf5bBNj xCwzv3TeXOeeS7/Tvy5F282ZFjRqaFuXvFtEuJwIPHYmMsXfeWshLYKqx2KW+08mTVcL 1EgA== X-Gm-Message-State: AOAM532DPIbxmgBUwpHw+oQ9aFEIc1F49p32k+Bdu4bUW7Z/CmGZk1Oz wSq79zUHHNzsqbcglocuhTmZArFy9v0= X-Google-Smtp-Source: ABdhPJw6wswldNmTJpo2VqE0LgSgMDsFLhgqBV9nFMhPYzqoPHFuRiW+0/FS3XRlvscp2STMCupNyg== X-Received: by 2002:a05:6870:9a16:b0:e2:8bdb:81d5 with SMTP id fo22-20020a0568709a1600b000e28bdb81d5mr5128293oab.102.1649607050537; Sun, 10 Apr 2022 09:10:50 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:9a32:f478:4bc0:f027]) by smtp.gmail.com with ESMTPSA id v21-20020a4ade95000000b00320f814c73bsm10550200oou.47.2022.04.10.09.10.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Apr 2022 09:10:50 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: Cong Wang , Eric Dumazet , John Fastabend , Daniel Borkmann , Jakub Sitnicki Subject: [Patch bpf-next v1 1/4] tcp: introduce tcp_read_skb() Date: Sun, 10 Apr 2022 09:10:39 -0700 Message-Id: <20220410161042.183540-2-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220410161042.183540-1-xiyou.wangcong@gmail.com> References: <20220410161042.183540-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Cong Wang This patch inroduces tcp_read_skb() based on tcp_read_sock(), a preparation for the next patch which actually introduces a new sock ops. TCP is special here, because it has tcp_read_sock() which is mainly used by splice(). tcp_read_sock() supports partial read and arbitrary offset, neither of them is needed for sockmap. Cc: Eric Dumazet Cc: John Fastabend Cc: Daniel Borkmann Cc: Jakub Sitnicki Signed-off-by: Cong Wang --- include/net/tcp.h | 2 ++ net/ipv4/tcp.c | 72 +++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 66 insertions(+), 8 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 6d50a662bf89..f0d4ce6855e1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -667,6 +667,8 @@ void tcp_get_info(struct sock *, struct tcp_info *); /* Read 'sendfile()'-style from a TCP socket */ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor); +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, + sk_read_actor_t recv_actor); void tcp_initialize_rcv_mss(struct sock *sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e31cf137c614..8b054bcc6849 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1619,7 +1619,7 @@ static void tcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb) __kfree_skb(skb); } -static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) +static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off, bool unlink) { struct sk_buff *skb; u32 offset; @@ -1632,6 +1632,8 @@ static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) } if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) { *off = offset; + if (unlink) + __skb_unlink(skb, &sk->sk_receive_queue); return skb; } /* This looks weird, but this can happen if TCP collapsing @@ -1665,7 +1667,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, if (sk->sk_state == TCP_LISTEN) return -ENOTCONN; - while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { + while ((skb = tcp_recv_skb(sk, seq, &offset, false)) != NULL) { if (offset < skb->len) { int used; size_t len; @@ -1696,7 +1698,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, * getting here: tcp_collapse might have deleted it * while aggregating skbs from the socket queue. */ - skb = tcp_recv_skb(sk, seq - 1, &offset); + skb = tcp_recv_skb(sk, seq - 1, &offset, false); if (!skb) break; /* TCP coalescing might have appended data to the skb. @@ -1721,13 +1723,67 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, /* Clean up data we have read: This will do ACK frames. */ if (copied > 0) { - tcp_recv_skb(sk, seq, &offset); + tcp_recv_skb(sk, seq, &offset, false); tcp_cleanup_rbuf(sk, copied); } return copied; } EXPORT_SYMBOL(tcp_read_sock); +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, + sk_read_actor_t recv_actor) +{ + struct sk_buff *skb; + struct tcp_sock *tp = tcp_sk(sk); + u32 seq = tp->copied_seq; + u32 offset; + int copied = 0; + + if (sk->sk_state == TCP_LISTEN) + return -ENOTCONN; + while ((skb = tcp_recv_skb(sk, seq, &offset, true)) != NULL) { + if (offset < skb->len) { + int used; + size_t len; + + len = skb->len - offset; + used = recv_actor(desc, skb, offset, len); + if (used <= 0) { + if (!copied) + copied = used; + break; + } + if (WARN_ON_ONCE(used > len)) + used = len; + seq += used; + copied += used; + offset += used; + + if (offset != skb->len) + continue; + } + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) { + kfree_skb(skb); + ++seq; + break; + } + kfree_skb(skb); + if (!desc->count) + break; + WRITE_ONCE(tp->copied_seq, seq); + } + WRITE_ONCE(tp->copied_seq, seq); + + tcp_rcv_space_adjust(sk); + + /* Clean up data we have read: This will do ACK frames. */ + if (copied > 0) + tcp_cleanup_rbuf(sk, copied); + + return copied; +} +EXPORT_SYMBOL(tcp_read_skb); + int tcp_peek_len(struct socket *sock) { return tcp_inq(sock->sk); @@ -1910,7 +1966,7 @@ static int receive_fallback_to_copy(struct sock *sk, struct sk_buff *skb; u32 offset; - skb = tcp_recv_skb(sk, tcp_sk(sk)->copied_seq, &offset); + skb = tcp_recv_skb(sk, tcp_sk(sk)->copied_seq, &offset, false); if (skb) tcp_zerocopy_set_hint_for_skb(sk, zc, skb, offset); } @@ -1957,7 +2013,7 @@ static int tcp_zc_handle_leftover(struct tcp_zerocopy_receive *zc, if (skb) { offset = *seq - TCP_SKB_CB(skb)->seq; } else { - skb = tcp_recv_skb(sk, *seq, &offset); + skb = tcp_recv_skb(sk, *seq, &offset, false); if (TCP_SKB_CB(skb)->has_rxtstamp) { tcp_update_recv_tstamps(skb, tss); zc->msg_flags |= TCP_CMSG_TS; @@ -2150,7 +2206,7 @@ static int tcp_zerocopy_receive(struct sock *sk, skb = skb->next; offset = seq - TCP_SKB_CB(skb)->seq; } else { - skb = tcp_recv_skb(sk, seq, &offset); + skb = tcp_recv_skb(sk, seq, &offset, false); } if (TCP_SKB_CB(skb)->has_rxtstamp) { @@ -2206,7 +2262,7 @@ static int tcp_zerocopy_receive(struct sock *sk, tcp_rcv_space_adjust(sk); /* Clean up data we have read: This will do ACK frames. */ - tcp_recv_skb(sk, seq, &offset); + tcp_recv_skb(sk, seq, &offset, false); tcp_cleanup_rbuf(sk, length + copylen); ret = 0; if (length == zc->length) From patchwork Sun Apr 10 16:10:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12808172 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 41753C433F5 for ; Sun, 10 Apr 2022 16:10:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243490AbiDJQNG (ORCPT ); Sun, 10 Apr 2022 12:13:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56828 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239776AbiDJQND (ORCPT ); Sun, 10 Apr 2022 12:13:03 -0400 Received: from mail-ot1-x32c.google.com (mail-ot1-x32c.google.com [IPv6:2607:f8b0:4864:20::32c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 94E114831C for ; Sun, 10 Apr 2022 09:10:52 -0700 (PDT) Received: by mail-ot1-x32c.google.com with SMTP id b17-20020a0568301df100b005ce0456a9efso9712235otj.9 for ; Sun, 10 Apr 2022 09:10:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=QvVQEeiHae78BuQG/hbFlRDtvKj18+ZwVmX/KR4TZr8=; b=HrshSbvzTaasnoZ1rnsDLN087GwzPczPRHqFFsG4E9SKOzm3DfXgUZcAMz8nUMV8hl 0seHfZt+qevIYxKIAaYuebfMbY0FOPL5fjxsC7CYb+ivBVtxhdLiRRdDA22t4fO43QC8 VXLUjpiqgnr3alhPpmigPXLDCsBbYIe1iuTiynnQIJxeyN/9MgRf3EOnw8AkenceS/yT TiTB42S3Jsvfw+VwUMn9XMnAhbWdeyv6KUa3Ps5c7oen48BJrpXdu/XZcCs41obcmC// VWYp2ek4JuKgAXmFhGb7IpzejSl2exhtBgKHzSIqGnUtirq2FMESr5y1czsRYvCedJnc umbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=QvVQEeiHae78BuQG/hbFlRDtvKj18+ZwVmX/KR4TZr8=; b=7fxUMaYHla+KeSzAbJK2HxP1hTLRSYW//sMAx5BnFPkUGC+i9REqJzC8GxPhNxen2/ Lb4asuUADTHWc7wHrT/8PYiNp+8zsTScseb0cPCAR4rkIiqWFqwYVeFzRmAaQyAEnSnG VkAneutEQd/LwEYQapqP0m/+k8fPDFnzIhmd4ksdjZdrv8g9jk/C90vjpeEmiXHLlYo+ tuJQkZMkwGyWXRHDWfn3KOMnWkH5fneSH5cyx5dvSnpE9GLPVL9IMvGYYH8xvC67G30I l2DlgaHM92G2D+dwu5fcRKsd/RVcnoS2XteL8ROaDhF/9lWLIsaF3ZZF+OmAOQvp28nE B8cg== X-Gm-Message-State: AOAM533C2NXKOxRDtnT3k3/KsbZLl/5TNhGwQbp6LMpZxa6q8vr+9D8k AUAOicyHu+7H7CYT0bitn23IWo3Ky3M= X-Google-Smtp-Source: ABdhPJwwlrs7FPhefeI230MGllBncwjOPemiGhRq/gaNSUzLuQqy+CPNXsGz0AaPQg6mC8XsTnbUfg== X-Received: by 2002:a05:6830:4d3:b0:5cb:73f0:2f1a with SMTP id s19-20020a05683004d300b005cb73f02f1amr9618988otd.30.1649607051635; Sun, 10 Apr 2022 09:10:51 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:9a32:f478:4bc0:f027]) by smtp.gmail.com with ESMTPSA id v21-20020a4ade95000000b00320f814c73bsm10550200oou.47.2022.04.10.09.10.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Apr 2022 09:10:51 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: Cong Wang , Eric Dumazet , John Fastabend , Daniel Borkmann , Jakub Sitnicki Subject: [Patch bpf-next v1 2/4] net: introduce a new proto_ops ->read_skb() Date: Sun, 10 Apr 2022 09:10:40 -0700 Message-Id: <20220410161042.183540-3-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220410161042.183540-1-xiyou.wangcong@gmail.com> References: <20220410161042.183540-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Cong Wang Currently both splice() and sockmap use ->read_sock() to read skb from receive queue, but for sockmap we only read one entire skb at a time, so ->read_sock() is too conservative to use. Introduce a new proto_ops ->read_skb() which supports this sematic, with this we can finally pass the ownership of skb to recv actors. For non-TCP protocols, all ->read_sock() can be simply converted to ->read_skb(). Cc: Eric Dumazet Cc: John Fastabend Cc: Daniel Borkmann Cc: Jakub Sitnicki Signed-off-by: Cong Wang --- include/linux/net.h | 3 +++ include/net/tcp.h | 3 +-- include/net/udp.h | 3 +-- net/core/skmsg.c | 20 +++++--------------- net/ipv4/af_inet.c | 3 ++- net/ipv4/tcp.c | 9 +++------ net/ipv4/udp.c | 10 ++++------ net/ipv6/af_inet6.c | 3 ++- net/unix/af_unix.c | 23 +++++++++-------------- 9 files changed, 30 insertions(+), 47 deletions(-) diff --git a/include/linux/net.h b/include/linux/net.h index 12093f4db50c..adcc4e54ec4a 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -152,6 +152,8 @@ struct module; struct sk_buff; typedef int (*sk_read_actor_t)(read_descriptor_t *, struct sk_buff *, unsigned int, size_t); +typedef int (*skb_read_actor_t)(struct sock *, struct sk_buff *); + struct proto_ops { int family; @@ -214,6 +216,7 @@ struct proto_ops { */ int (*read_sock)(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor); + int (*read_skb)(struct sock *sk, skb_read_actor_t recv_actor); int (*sendpage_locked)(struct sock *sk, struct page *page, int offset, size_t size, int flags); int (*sendmsg_locked)(struct sock *sk, struct msghdr *msg, diff --git a/include/net/tcp.h b/include/net/tcp.h index f0d4ce6855e1..56946d5e8160 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -667,8 +667,7 @@ void tcp_get_info(struct sock *, struct tcp_info *); /* Read 'sendfile()'-style from a TCP socket */ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, sk_read_actor_t recv_actor); -int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor); +int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); void tcp_initialize_rcv_mss(struct sock *sk); diff --git a/include/net/udp.h b/include/net/udp.h index f1c2a88c9005..90cc590d42e3 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -305,8 +305,7 @@ struct sock *__udp6_lib_lookup(struct net *net, struct sk_buff *skb); struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, __be16 sport, __be16 dport); -int udp_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor); +int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor); /* UDP uses skb->dev_scratch to cache as much information as possible and avoid * possibly multiple cache miss on dequeue() diff --git a/net/core/skmsg.c b/net/core/skmsg.c index cc381165ea08..19bca36940a2 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1155,21 +1155,17 @@ static void sk_psock_done_strp(struct sk_psock *psock) } #endif /* CONFIG_BPF_STREAM_PARSER */ -static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb, - unsigned int offset, size_t orig_len) +static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) { - struct sock *sk = (struct sock *)desc->arg.data; struct sk_psock *psock; struct bpf_prog *prog; int ret = __SK_DROP; - int len = orig_len; + int len = skb->len; /* clone here so sk_eat_skb() in tcp_read_sock does not drop our data */ skb = skb_clone(skb, GFP_ATOMIC); - if (!skb) { - desc->error = -ENOMEM; + if (!skb) return 0; - } rcu_read_lock(); psock = sk_psock(sk); @@ -1199,16 +1195,10 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb, static void sk_psock_verdict_data_ready(struct sock *sk) { struct socket *sock = sk->sk_socket; - read_descriptor_t desc; - if (unlikely(!sock || !sock->ops || !sock->ops->read_sock)) + if (unlikely(!sock || !sock->ops || !sock->ops->read_skb)) return; - - desc.arg.data = sk; - desc.error = 0; - desc.count = 1; - - sock->ops->read_sock(sk, &desc, sk_psock_verdict_recv); + sock->ops->read_skb(sk, sk_psock_verdict_recv); } void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 72fde2888ad2..c60262bcac88 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1041,6 +1041,7 @@ const struct proto_ops inet_stream_ops = { .sendpage = inet_sendpage, .splice_read = tcp_splice_read, .read_sock = tcp_read_sock, + .read_skb = tcp_read_skb, .sendmsg_locked = tcp_sendmsg_locked, .sendpage_locked = tcp_sendpage_locked, .peek_len = tcp_peek_len, @@ -1068,7 +1069,7 @@ const struct proto_ops inet_dgram_ops = { .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, .sendmsg = inet_sendmsg, - .read_sock = udp_read_sock, + .read_skb = udp_read_skb, .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .sendpage = inet_sendpage, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8b054bcc6849..74e472e8178f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1730,8 +1730,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, } EXPORT_SYMBOL(tcp_read_sock); -int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor) +int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { struct sk_buff *skb; struct tcp_sock *tp = tcp_sk(sk); @@ -1747,7 +1746,7 @@ int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, size_t len; len = skb->len - offset; - used = recv_actor(desc, skb, offset, len); + used = recv_actor(sk, skb); if (used <= 0) { if (!copied) copied = used; @@ -1768,9 +1767,7 @@ int tcp_read_skb(struct sock *sk, read_descriptor_t *desc, break; } kfree_skb(skb); - if (!desc->count) - break; - WRITE_ONCE(tp->copied_seq, seq); + break; } WRITE_ONCE(tp->copied_seq, seq); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 6b4d8361560f..9faca5758ed6 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1796,8 +1796,7 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags, } EXPORT_SYMBOL(__skb_recv_udp); -int udp_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor) +int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { int copied = 0; @@ -1819,7 +1818,7 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc, continue; } - used = recv_actor(desc, skb, 0, skb->len); + used = recv_actor(sk, skb); if (used <= 0) { if (!copied) copied = used; @@ -1830,13 +1829,12 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc, } kfree_skb(skb); - if (!desc->count) - break; + break; } return copied; } -EXPORT_SYMBOL(udp_read_sock); +EXPORT_SYMBOL(udp_read_skb); /* * This should be easy, if there is something there we diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index 7d7b7523d126..06c1b16aa739 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -702,6 +702,7 @@ const struct proto_ops inet6_stream_ops = { .sendpage_locked = tcp_sendpage_locked, .splice_read = tcp_splice_read, .read_sock = tcp_read_sock, + .read_skb = tcp_read_skb, .peek_len = tcp_peek_len, #ifdef CONFIG_COMPAT .compat_ioctl = inet6_compat_ioctl, @@ -727,7 +728,7 @@ const struct proto_ops inet6_dgram_ops = { .getsockopt = sock_common_getsockopt, /* ok */ .sendmsg = inet6_sendmsg, /* retpoline's sake */ .recvmsg = inet6_recvmsg, /* retpoline's sake */ - .read_sock = udp_read_sock, + .read_skb = udp_read_skb, .mmap = sock_no_mmap, .sendpage = sock_no_sendpage, .set_peek_off = sk_set_peek_off, diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index fecbd95da918..06cf0570635d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -741,10 +741,8 @@ static ssize_t unix_stream_splice_read(struct socket *, loff_t *ppos, unsigned int flags); static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t); static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int); -static int unix_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor); -static int unix_stream_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor); +static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor); +static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor); static int unix_dgram_connect(struct socket *, struct sockaddr *, int, int); static int unix_seqpacket_sendmsg(struct socket *, struct msghdr *, size_t); @@ -798,7 +796,7 @@ static const struct proto_ops unix_stream_ops = { .shutdown = unix_shutdown, .sendmsg = unix_stream_sendmsg, .recvmsg = unix_stream_recvmsg, - .read_sock = unix_stream_read_sock, + .read_skb = unix_stream_read_skb, .mmap = sock_no_mmap, .sendpage = unix_stream_sendpage, .splice_read = unix_stream_splice_read, @@ -823,7 +821,7 @@ static const struct proto_ops unix_dgram_ops = { .listen = sock_no_listen, .shutdown = unix_shutdown, .sendmsg = unix_dgram_sendmsg, - .read_sock = unix_read_sock, + .read_skb = unix_read_skb, .recvmsg = unix_dgram_recvmsg, .mmap = sock_no_mmap, .sendpage = sock_no_sendpage, @@ -2490,8 +2488,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si return __unix_dgram_recvmsg(sk, msg, size, flags); } -static int unix_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor) +static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { int copied = 0; @@ -2506,7 +2503,7 @@ static int unix_read_sock(struct sock *sk, read_descriptor_t *desc, if (!skb) return err; - used = recv_actor(desc, skb, 0, skb->len); + used = recv_actor(sk, skb); if (used <= 0) { if (!copied) copied = used; @@ -2517,8 +2514,7 @@ static int unix_read_sock(struct sock *sk, read_descriptor_t *desc, } kfree_skb(skb); - if (!desc->count) - break; + break; } return copied; @@ -2653,13 +2649,12 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk, } #endif -static int unix_stream_read_sock(struct sock *sk, read_descriptor_t *desc, - sk_read_actor_t recv_actor) +static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor) { if (unlikely(sk->sk_state != TCP_ESTABLISHED)) return -ENOTCONN; - return unix_read_sock(sk, desc, recv_actor); + return unix_read_skb(sk, recv_actor); } static int unix_stream_read_generic(struct unix_stream_read_state *state, From patchwork Sun Apr 10 16:10:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12808173 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD627C433EF for ; Sun, 10 Apr 2022 16:10:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243530AbiDJQNH (ORCPT ); Sun, 10 Apr 2022 12:13:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56894 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240087AbiDJQNE (ORCPT ); Sun, 10 Apr 2022 12:13:04 -0400 Received: from mail-ot1-x335.google.com (mail-ot1-x335.google.com [IPv6:2607:f8b0:4864:20::335]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B30FA483BB for ; Sun, 10 Apr 2022 09:10:53 -0700 (PDT) Received: by mail-ot1-x335.google.com with SMTP id c24-20020a9d6c98000000b005e6b7c0a8a8so6243915otr.2 for ; Sun, 10 Apr 2022 09:10:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=EvgYX1I2TdvOBiZ7Pfh0m7DvERKxA/qS4bF/+kJQmJY=; b=OIr7sHYqFIe/bUaPQD6Zs8UM64PdCgbYdCcEP0aqU5V8xjJJf/zEZruXSwimZiJpI2 iv8BwtKeWOXek75ThuUc1mimiBg9NEgRCfTABcw4e2cUZ5oy8o/kDU1lfQlWlJuT8SjZ 53Brbrg2EQd9YnSxS81+FQoKysoVN9y4Tg/eHNaGoid4/B6A0CCmU4cjYk5gMkQWZzTY 7+nxxO+Ysdn3d46apFbIDnSOrGLjJhBfb/44gSx7anvvhuZm5G9sw4I0hO6QkWVtbngn oIwOmXycJigYzgE+f1stzNIrVIQ7AIOi/0Bm3f3ij9ONKIUz1w3Ij3dlug0Io6Z6GNlg ppoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=EvgYX1I2TdvOBiZ7Pfh0m7DvERKxA/qS4bF/+kJQmJY=; b=vpuUWVUpcQuRXwCqmzBHRlVWTlYtSeI4G9jKRyl9olqIniq1tC1Ue9g3qu3QvC6oqz MsIGxesCbBIyqpEo/BUcGK/0/d+7/XWw/jsmf5Waod13WB+1TwD+1VwgOSOQ8XOTAiLu b/S2g2CBQAd6fFfxQTGpd74z8G+mVVO5pQV1p9anxNzmHr2ghWGzDSjWZpM44Pz1GxPz hsfDfAulDEoy0RAn5s2y3G2+KXXexjo548b23y2cSxsPszEmTCIZtO3pjQw80l+SVEIV 1iSu51CfOMScIpfE0MEYz6xRzTchRjcImNu2mmVW3Iiv3dZ+7CVAVYn0sWHbmw5w7i7o esWg== X-Gm-Message-State: AOAM531rs4+WXXoXM1ionWAoCzPd1ZD/qb7uZTyUOWch5df+URgxc4DW 8hfBZmhENmQQus3gyV1euhHn1y2cgvI= X-Google-Smtp-Source: ABdhPJy/qxWLv1T45pcC0edqIzX1BnX23utGJ4I6mhleP7Q1Qj4xcjWXmwrzlz+7Ny8EXQSlOT+Jjg== X-Received: by 2002:a05:6830:138d:b0:5b2:4b0a:a4fa with SMTP id d13-20020a056830138d00b005b24b0aa4famr10067980otq.380.1649607052947; Sun, 10 Apr 2022 09:10:52 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:9a32:f478:4bc0:f027]) by smtp.gmail.com with ESMTPSA id v21-20020a4ade95000000b00320f814c73bsm10550200oou.47.2022.04.10.09.10.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Apr 2022 09:10:52 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: Cong Wang , Eric Dumazet , John Fastabend , Daniel Borkmann , Jakub Sitnicki Subject: [Patch bpf-next v1 3/4] skmsg: get rid of skb_clone() Date: Sun, 10 Apr 2022 09:10:41 -0700 Message-Id: <20220410161042.183540-4-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220410161042.183540-1-xiyou.wangcong@gmail.com> References: <20220410161042.183540-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Cong Wang With ->read_skb() now we have an entire skb dequeued from receive queue, now we just need to grab an addtional refcnt before passing its ownership to recv actors. And we should not touch them any more, particularly for skb->sk. Fortunately, skb->sk is already set for most of the protocols except UDP where skb->sk has been stolen, so we have to fix it up for UDP case. Cc: Eric Dumazet Cc: John Fastabend Cc: Daniel Borkmann Cc: Jakub Sitnicki Signed-off-by: Cong Wang --- net/core/skmsg.c | 7 +------ net/ipv4/udp.c | 1 + 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 19bca36940a2..7aa37b6287e1 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1162,10 +1162,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) int ret = __SK_DROP; int len = skb->len; - /* clone here so sk_eat_skb() in tcp_read_sock does not drop our data */ - skb = skb_clone(skb, GFP_ATOMIC); - if (!skb) - return 0; + skb_get(skb); rcu_read_lock(); psock = sk_psock(sk); @@ -1178,12 +1175,10 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb) if (!prog) prog = READ_ONCE(psock->progs.skb_verdict); if (likely(prog)) { - skb->sk = sk; skb_dst_drop(skb); skb_bpf_redirect_clear(skb); ret = bpf_prog_run_pin_on_cpu(prog, skb); ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb)); - skb->sk = NULL; } if (sk_psock_verdict_apply(psock, skb, ret) < 0) len = 0; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 9faca5758ed6..dbf33f68555d 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1818,6 +1818,7 @@ int udp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) continue; } + WARN_ON(!skb_set_owner_sk_safe(skb, sk)); used = recv_actor(sk, skb); if (used <= 0) { if (!copied) From patchwork Sun Apr 10 16:10:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12808174 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66B98C433EF for ; Sun, 10 Apr 2022 16:11:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243540AbiDJQNI (ORCPT ); Sun, 10 Apr 2022 12:13:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56974 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243476AbiDJQNF (ORCPT ); Sun, 10 Apr 2022 12:13:05 -0400 Received: from mail-oa1-x34.google.com (mail-oa1-x34.google.com [IPv6:2001:4860:4864:20::34]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B3214831C for ; Sun, 10 Apr 2022 09:10:54 -0700 (PDT) Received: by mail-oa1-x34.google.com with SMTP id 586e51a60fabf-e2afb80550so2887452fac.1 for ; Sun, 10 Apr 2022 09:10:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=xK3D8ISXCcFBiXoJXWBOc3gprGb/nCQ0EaK1U8YPw3Y=; b=p3nfMx250sVmKpQD+UFbrUg4qHnQukWo/OzpkqXFRuCwa69pG/1aMboPWquPRYRoMk vUY8Zg58gPz5BK7i/krMA6eisxJlT2mjx4Rx9PgZRACkUC2ex3q+I0s4BnlcDrO9ruoU fWr9AlGFPESvOlzyfDXg5LVn0jas1WTpSyESY+5m8uuVR0g6JeyAQNkYqZAhR3kxNsQD TQLTrSzaGgpJUG5fSWcB95QfeX5cBg5opFMrY9fLEyyHzZBWSmt7jXfCXy9Rbd2Yqotm lNGrbV5/0Lzeh1NikQol3YYDhki7qGVaNwkHkf6bPkzJ1668nR5bti11dd4khxIf91ZB X4Ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=xK3D8ISXCcFBiXoJXWBOc3gprGb/nCQ0EaK1U8YPw3Y=; b=CIdv50gYgJ+xbSQl9ZIK5MMPehMRTUPkBmOQz8kPSvYnVtQSS7CNNLMcO3GmwHhT47 Uc+zCOmKjB6ktuIHyMy3pH1OwaeFC/ZLq6AdUnvTUyVAABYRZqFy9iFcZi4/3V/Wymrl eMOXEgz+lsMGCj7eY3iEr84Kzl8EAUHq1GI7isthjFb8xiXXlLu/juVzwyqsboffOAIP Y8QZFoCKenPz29hcKIZKD0xLd+FmpAJG9AWxj1Ld+9xlHMEVrqvU0IUCu5XPKsREcXI5 zGMqBqcX80RsoyqQwjD+AZ3+nQd1QzYdUdWuO6NGwR4xSToUQZafC2zOQL1fX8DREsu7 ltsA== X-Gm-Message-State: AOAM533ECPCWTpHfhlpmDwpE/XbsgKM8QSvmlc15o/i+yiygOlC2eNAx kDstDFW73cHDTB9aJZC85AkGaFs/Ht4= X-Google-Smtp-Source: ABdhPJySFAvPI1aIQ+iTfXvG8vm1NecCuoMPSpA9y6yKJUBPBcYnqcW6kVnpSIMIAS8QTzQhgH4a0g== X-Received: by 2002:a05:6871:793:b0:db:360c:7f5e with SMTP id o19-20020a056871079300b000db360c7f5emr12239275oap.218.1649607053874; Sun, 10 Apr 2022 09:10:53 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:9a32:f478:4bc0:f027]) by smtp.gmail.com with ESMTPSA id v21-20020a4ade95000000b00320f814c73bsm10550200oou.47.2022.04.10.09.10.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Apr 2022 09:10:53 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: Cong Wang , John Fastabend , Daniel Borkmann , Jakub Sitnicki Subject: [Patch bpf-next v1 4/4] skmsg: get rid of unncessary memset() Date: Sun, 10 Apr 2022 09:10:42 -0700 Message-Id: <20220410161042.183540-5-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20220410161042.183540-1-xiyou.wangcong@gmail.com> References: <20220410161042.183540-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Cong Wang We always allocate skmsg with kzalloc(), so there is no need to call memset(0) on it, the only thing we need from sk_msg_init() is sg_init_marker(). So introduce a new helper which is just kzalloc()+sg_init_marker(), this saves an unncessary memset(0) for skmsg on fast path. Cc: John Fastabend Cc: Daniel Borkmann Cc: Jakub Sitnicki Signed-off-by: Cong Wang --- net/core/skmsg.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 7aa37b6287e1..d165d81c1e4a 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -497,23 +497,27 @@ bool sk_msg_is_readable(struct sock *sk) } EXPORT_SYMBOL_GPL(sk_msg_is_readable); -static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, - struct sk_buff *skb) +static struct sk_msg *alloc_sk_msg(void) { struct sk_msg *msg; - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_KERNEL); + if (unlikely(!msg)) return NULL; + sg_init_marker(msg->sg.data, NR_MSG_FRAG_IDS); + return msg; +} - if (!sk_rmem_schedule(sk, skb, skb->truesize)) +static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, + struct sk_buff *skb) +{ + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) return NULL; - msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_KERNEL); - if (unlikely(!msg)) + if (!sk_rmem_schedule(sk, skb, skb->truesize)) return NULL; - sk_msg_init(msg); - return msg; + return alloc_sk_msg(); } static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb, @@ -586,13 +590,12 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb, u32 off, u32 len) { - struct sk_msg *msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC); + struct sk_msg *msg = alloc_sk_msg(); struct sock *sk = psock->sk; int err; if (unlikely(!msg)) return -EAGAIN; - sk_msg_init(msg); skb_set_owner_r(skb, sk); err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg); if (err < 0)