From patchwork Mon Dec 18 02:40:17 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 13497350 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C0683C35274 for ; Mon, 18 Dec 2023 18:32:48 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 32BFE10E359; Mon, 18 Dec 2023 18:32:44 +0000 (UTC) Received: from mail-ot1-f73.google.com (mail-ot1-f73.google.com [209.85.210.73]) by gabe.freedesktop.org (Postfix) with ESMTPS id B277310E1EC for ; Mon, 18 Dec 2023 18:26:54 +0000 (UTC) Received: by mail-ot1-f73.google.com with SMTP id 46e09a7af769-6da33744c92so3526217a34.0 for ; Mon, 18 Dec 2023 10:26:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1702924014; x=1703528814; darn=lists.freedesktop.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=blmMBx3GtdLKtkIx2IJcH5muEEMSNX+FM1+51RhSOI8=; b=KBqUJkOORdkFB5bIhqR3n/G3GZXZcxWlOBNI0eEFauAzOSPjyddCJHX11N4GYozjgM YmJ0euz7vZlSQRFQGDJsNdvWWywMlajtZhK2+Og7OFmOgrR62UnG3Sx0kLcytdNHw2mv zb69WGL8Smn2wRNX+G5Y45KcqPExn6iTet58B1+NE8EkTKmIO8r7yfMaWqwELC3BTofD Hw9vWty/b2IUWlrFLjoIpjyofGabYS4Mr/1BBM2rDL/1oV1+fhcDQOH2BLPazvdNabI5 z2d+tnpPbiXizj1KOpaH8vAjq26JyxsVsnfSB2bQUwTC2gInHgpfJ/beJfSkD0GCEiW3 jcKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702924014; x=1703528814; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=blmMBx3GtdLKtkIx2IJcH5muEEMSNX+FM1+51RhSOI8=; b=aXTfkpEeffScOOm5rMlGF+DuO2ygZE/0tWaX38cvaJf5yw7dIdq7N3FBlDdT3IPisE jzk3UZ/0JG/IED8wIbeTBZ9ypiZ2CrfLC9gMxx1Ocg+/tSUeHil5b2WzH8N/UN112XRu sp0mv8eNngogOPhbbSJs+7bl+qZvLg4VAUcl12scsp6n1YDjM0GhZM63c5PFv9Gxc4LV Tfk/IJ3Ts8Bf2dTq9hQUso8ba/XcgHqv4OrcyNWeMoiEHLHahE72YedWuLTBB05U/bv5 f57D//kxocetT5s9nNtvTHccJ3/5Vs2+zf+R2HbK8WBNPEIfyGMS6VRAN3VXyr/UqBa6 CsMA== X-Gm-Message-State: AOJu0YwosQc3pR/ZT7jLlymQtRAlYA3EAwHt7aU4ytqZ3COCWT4wltaU 0hTTdSbyvX70MwE5rvz9H3VXGampLv+NT340ww== X-Google-Smtp-Source: AGHT+IF8QLckdtCjiynuO8NX78JmyvvsW0bF29icp/BdBbx83X2OgcVpCkru8hzFb4V746cxK4hculxfePuB0NUaQA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:5cbf:3534:fb34:758e]) (user=almasrymina job=sendgmr) by 2002:a81:ad28:0:b0:5e6:1142:fc1f with SMTP id l40-20020a81ad28000000b005e61142fc1fmr419409ywh.0.1702867250408; Sun, 17 Dec 2023 18:40:50 -0800 (PST) Date: Sun, 17 Dec 2023 18:40:17 -0800 In-Reply-To: <20231218024024.3516870-1-almasrymina@google.com> Mime-Version: 1.0 References: <20231218024024.3516870-1-almasrymina@google.com> X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog Message-ID: <20231218024024.3516870-11-almasrymina@google.com> Subject: [RFC PATCH net-next v5 10/14] net: add support for skbs with unreadable frags From: Mina Almasry To: Mathieu Desnoyers , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kaiyuan Zhang , Pavel Begunkov , Alexei Starovoitov , "James E.J. Bottomley" , Eric Dumazet , Shuah Khan , Sumit Semwal , Mina Almasry , Willem de Bruijn , Jeroen de Borst , Daniel Borkmann , Jonathan Corbet , Helge Deller , Andrii Nakryiko , Jason Gunthorpe , Jakub Kicinski , Matt Turner , Paolo Abeni , Jesper Dangaard Brouer , Arnd Bergmann , Richard Henderson , Steven Rostedt , Shailend Chand , Ivan Kokshaysky , Harshitha Ramamurthy , Praveen Kaligineedi , Willem de Bruijn , Thomas Bogendoerfer , David Ahern , Ilias Apalodimas , David Wei , =?utf-8?q?Christian_K=C3=B6nig?= , Yunsheng Lin , Masami Hiramatsu , Shakeel Butt , "David S. Miller" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb. Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not. __skb_fill_netmem_desc() now checks frags added to skbs for net_iov, and marks the skb as skb->devmem accordingly. Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coalescing devmem skbs with non devmem skbs. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- Changes in v1: - Rename devmem -> dmabuf (David). - Flip skb_frags_not_readable (Jakub). --- include/linux/skbuff.h | 19 ++++++++++-- include/net/tcp.h | 5 +-- net/core/datagram.c | 6 ++++ net/core/gro.c | 5 ++- net/core/skbuff.c | 69 ++++++++++++++++++++++++++++++++++++------ net/ipv4/tcp.c | 3 ++ net/ipv4/tcp_input.c | 13 ++++++-- net/ipv4/tcp_output.c | 5 ++- net/packet/af_packet.c | 4 +-- 9 files changed, 108 insertions(+), 21 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 890fc2b94fc7..07691a649b05 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -820,6 +820,8 @@ typedef unsigned char *sk_buff_data_t; * @csum_level: indicates the number of consecutive checksums found in * the packet minus one that have been verified as * CHECKSUM_UNNECESSARY (max 3) + * @dmabuf: indicates that all the fragments in this skb are backed by + * dmabuf. * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required @@ -1006,7 +1008,7 @@ struct sk_buff { #if IS_ENABLED(CONFIG_IP_SCTP) __u8 csum_not_inet:1; #endif - + __u8 dmabuf:1; #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) __u16 tc_index; /* traffic control index */ #endif @@ -1781,6 +1783,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) __skb_zcopy_downgrade_managed(skb); } +/* Return true if frags in this skb are readable by the host. */ +static inline bool skb_frags_readable(const struct sk_buff *skb) +{ + return !skb->dmabuf; +} + static inline void skb_mark_not_on_list(struct sk_buff *skb) { skb->next = NULL; @@ -2498,10 +2506,17 @@ static inline void __skb_fill_netmem_desc(struct sk_buff *skb, int i, struct netmem *netmem, int off, int size) { - struct page *page = netmem_to_page(netmem); + struct page *page; __skb_fill_netmem_desc_noacc(skb_shinfo(skb), i, netmem, off, size); + if (netmem_is_net_iov(netmem)) { + skb->dmabuf = true; + return; + } + + page = netmem_to_page(netmem); + /* Propagate page pfmemalloc to the skb if we can. The problem is * that not all callers have unique ownership of the page but rely * on page_is_pfmemalloc doing the right thing(tm). diff --git a/include/net/tcp.h b/include/net/tcp.h index f5ca4abaee8b..5ecd5307f485 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1017,7 +1017,7 @@ static inline int tcp_skb_mss(const struct sk_buff *skb) static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb) { - return likely(!TCP_SKB_CB(skb)->eor); + return likely(!TCP_SKB_CB(skb)->eor && skb_frags_readable(skb)); } static inline bool tcp_skb_can_collapse(const struct sk_buff *to, @@ -1025,7 +1025,8 @@ static inline bool tcp_skb_can_collapse(const struct sk_buff *to, { return likely(tcp_skb_can_collapse_to(to) && mptcp_skb_can_collapse(to, from) && - skb_pure_zcopy_same(to, from)); + skb_pure_zcopy_same(to, from) && + skb_frags_readable(to) == skb_frags_readable(from)); } /* Events passed to congestion control interface */ diff --git a/net/core/datagram.c b/net/core/datagram.c index 103d46fa0eeb..f28472ddbaa4 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -426,6 +426,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset, return 0; } + if (!skb_frags_readable(skb)) + goto short_copy; + /* Copy paged appendix. Hmm... why does this look so complicated? */ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; @@ -638,6 +641,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, if (msg && msg->msg_ubuf && msg->sg_from_iter) return msg->sg_from_iter(sk, skb, from, length); + if (!skb_frags_readable(skb)) + return -EFAULT; + frag = skb_shinfo(skb)->nr_frags; while (length && iov_iter_count(from)) { diff --git a/net/core/gro.c b/net/core/gro.c index 42d7f6755f32..26df48f1b355 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow) { struct skb_shared_info *pinfo = skb_shinfo(skb); + if (WARN_ON_ONCE(!skb_frags_readable(skb))) + return; + BUG_ON(skb->end - skb->tail < grow); memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow); @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb) { int grow = skb_gro_offset(skb) - skb_headlen(skb); - if (grow > 0) + if (grow > 0 && skb_frags_readable(skb)) gro_pull_from_frag0(skb, grow); } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 671775bad5f9..298593024e42 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1824,6 +1824,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) return -EINVAL; + if (!skb_frags_readable(skb)) + return -EFAULT; + if (!num_frags) goto release; @@ -1995,8 +1998,12 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask) { int headerlen = skb_headroom(skb); unsigned int size = skb_end_offset(skb) + skb->data_len; - struct sk_buff *n = __alloc_skb(size, gfp_mask, - skb_alloc_rx_flag(skb), NUMA_NO_NODE); + struct sk_buff *n; + + if (!skb_frags_readable(skb)) + return NULL; + + n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), NUMA_NO_NODE); if (!n) return NULL; @@ -2322,14 +2329,16 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom, int newtailroom, gfp_t gfp_mask) { - /* - * Allocate the copy buffer - */ - struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom, - gfp_mask, skb_alloc_rx_flag(skb), - NUMA_NO_NODE); int oldheadroom = skb_headroom(skb); int head_copy_len, head_copy_off; + struct sk_buff *n; + + if (!skb_frags_readable(skb)) + return NULL; + + /* Allocate the copy buffer */ + n = __alloc_skb(newheadroom + skb->len + newtailroom, gfp_mask, + skb_alloc_rx_flag(skb), NUMA_NO_NODE); if (!n) return NULL; @@ -2668,6 +2677,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) */ int i, k, eat = (skb->tail + delta) - skb->end; + if (!skb_frags_readable(skb)) + return NULL; + if (eat > 0 || skb_cloned(skb)) { if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0, GFP_ATOMIC)) @@ -2821,6 +2833,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len) to += copy; } + if (!skb_frags_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *f = &skb_shinfo(skb)->frags[i]; @@ -3009,6 +3024,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, /* * then map the fragments */ + if (!skb_frags_readable(skb)) + return false; + for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) { const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; @@ -3232,6 +3250,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len) from += copy; } + if (!skb_frags_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; int end; @@ -3311,6 +3332,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len, pos = copy; } + if (!skb_frags_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; @@ -3411,6 +3435,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, pos = copy; } + if (!skb_frags_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; @@ -3902,7 +3929,9 @@ static inline void skb_split_inside_header(struct sk_buff *skb, skb_shinfo(skb1)->frags[i] = skb_shinfo(skb)->frags[i]; skb_shinfo(skb1)->nr_frags = skb_shinfo(skb)->nr_frags; + skb1->dmabuf = skb->dmabuf; skb_shinfo(skb)->nr_frags = 0; + skb->dmabuf = 0; skb1->data_len = skb->data_len; skb1->len += skb1->data_len; skb->data_len = 0; @@ -3916,6 +3945,7 @@ static inline void skb_split_no_header(struct sk_buff *skb, { int i, k = 0; const int nfrags = skb_shinfo(skb)->nr_frags; + const int dmabuf = skb->dmabuf; skb_shinfo(skb)->nr_frags = 0; skb1->len = skb1->data_len = skb->len - len; @@ -3949,6 +3979,16 @@ static inline void skb_split_no_header(struct sk_buff *skb, pos += size; } skb_shinfo(skb1)->nr_frags = k; + + if (skb_shinfo(skb)->nr_frags) + skb->dmabuf = dmabuf; + else + skb->dmabuf = 0; + + if (skb_shinfo(skb1)->nr_frags) + skb1->dmabuf = dmabuf; + else + skb1->dmabuf = 0; } /** @@ -4184,6 +4224,9 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data, return block_limit - abs_offset; } + if (!skb_frags_readable(st->cur_skb)) + return 0; + if (st->frag_idx == 0 && !st->frag_data) st->stepped_offset += skb_headlen(st->cur_skb); @@ -5799,7 +5842,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, (from->pp_recycle && skb_cloned(from))) return false; - if (len <= skb_tailroom(to)) { + if (skb_frags_readable(from) != skb_frags_readable(to)) + return false; + + if (len <= skb_tailroom(to) && skb_frags_readable(from)) { if (len) BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); *delta_truesize = 0; @@ -5974,6 +6020,9 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len) if (!pskb_may_pull(skb, write_len)) return -ENOMEM; + if (!skb_frags_readable(skb)) + return -EFAULT; + if (!skb_cloned(skb) || skb_clone_writable(skb, write_len)) return 0; @@ -6628,7 +6677,7 @@ void skb_condense(struct sk_buff *skb) { if (skb->data_len) { if (skb->data_len > skb->end - skb->tail || - skb_cloned(skb)) + skb_cloned(skb) || !skb_frags_readable(skb)) return; /* Nice, we can free page frag(s) right now */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 82485af12cdc..6f4c59143156 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2140,6 +2140,9 @@ static int tcp_zerocopy_receive(struct sock *sk, skb = tcp_recv_skb(sk, seq, &offset); } + if (!skb_frags_readable(skb)) + break; + if (TCP_SKB_CB(skb)->has_rxtstamp) { tcp_update_recv_tstamps(skb, tss); zc->msg_flags |= TCP_CMSG_TS; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index df7b13f0e5e0..1964459d722a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5331,6 +5331,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) { n = tcp_skb_next(skb, list); + if (!skb_frags_readable(skb)) + goto skip_this; + /* No new bits? It is possible on ofo queue. */ if (!before(start, TCP_SKB_CB(skb)->end_seq)) { skb = tcp_collapse_one(sk, skb, list, root); @@ -5351,17 +5354,20 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, break; } - if (n && n != tail && mptcp_skb_can_collapse(skb, n) && + if (n && n != tail && skb_frags_readable(n) && + mptcp_skb_can_collapse(skb, n) && TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) { end_of_skbs = false; break; } +skip_this: /* Decided to skip this, advance start seq. */ start = TCP_SKB_CB(skb)->end_seq; } if (end_of_skbs || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + !skb_frags_readable(skb)) return; __skb_queue_head_init(&tmp); @@ -5405,7 +5411,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, if (!skb || skb == tail || !mptcp_skb_can_collapse(nskb, skb) || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + !skb_frags_readable(skb)) goto end; #ifdef CONFIG_TLS_DEVICE if (skb->decrypted != nskb->decrypted) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e3167ad96567..30f53de14a24 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2343,7 +2343,8 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len) if (unlikely(TCP_SKB_CB(skb)->eor) || tcp_has_tx_tstamp(skb) || - !skb_pure_zcopy_same(skb, next)) + !skb_pure_zcopy_same(skb, next) || + skb_frags_readable(skb) != skb_frags_readable(next)) return false; len -= skb->len; @@ -3227,6 +3228,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb) return false; if (skb_cloned(skb)) return false; + if (!skb_frags_readable(skb)) + return false; /* Some heuristics for collapsing over SACK'd could be invented */ if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) return false; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 5f1757a32842..bfed83ef92ef 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2156,7 +2156,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_readable(skb) ? skb->len : skb_headlen(skb); res = run_filter(skb, sk, snaplen); if (!res) @@ -2276,7 +2276,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_readable(skb) ? skb->len : skb_headlen(skb); res = run_filter(skb, sk, snaplen); if (!res)