From patchwork Tue Nov 7 21:40:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449344 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D5F3450C2 for ; Tue, 7 Nov 2023 21:40:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="zURq9TFY" Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E70C410DF for ; Tue, 7 Nov 2023 13:40:57 -0800 (PST) Received: by mail-pl1-x636.google.com with SMTP id d9443c01a7336-1cc30bf9e22so1028245ad.1 for ; Tue, 07 Nov 2023 13:40:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393257; x=1699998057; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=LbHxJpOeuHDOYo11oQCQ4dzaD+AelcYGpMtLC1wBtjE=; b=zURq9TFYQb2cBk0FJUhcVT4dVMUD1VfaATRGl+jjvUpZ5iD6Zaps3qD3OpMP2nw3t3 +b7Dw/DFNmlTTuZ55gf7puapaSMXLGLgBDbX4zZnnNZNOcgbP50uRLUehuMyNQRbpFgI cikaeBJThSUutYXABQFIcvmhsYH2zbfUN5IsS2Mu0JWS7WhJbr25eFvo7CNsBJ50RtHb 1vJe6v/wWnW6lTn4Tx9xrYs8KIQ4hpyn+YO5igtA2VpB52Sd8GGUsXoF1JFeisoTWPzS cpVnimw/tLqqdPywysJ6WX3WFmB2urMLUe5hZFSXQqLY9SApvA4yLwRVJUkVJWIfESHy U/6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393257; x=1699998057; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LbHxJpOeuHDOYo11oQCQ4dzaD+AelcYGpMtLC1wBtjE=; b=rv+AKgWVQu/MHMipWLGL2+HzB2EoNy1RtWZlR+LcIw6rQhhKIbwNRnaXHNzEQeIOsm WEQWOtrpMSZZKhc7z3mY2kE/TDwC61icFQpVf4EKMD9TMmHVrjd/TCwRIzuYmq1v46vM OTuHeQCyqyCBbUEe2ollFx7q5Cgrf2Wgb/7VfVCXRe1DVuCLUfCNQQiQeargq92U3cO4 GSf2u3egzxcuicD1lsRPp7dW9u3LKAUjrGh2cwWjjGYNpCgceae97zyzYrjlbdLTk3q8 bBdXFc4TQulj4xMjBNHfh24GX7ID2Jb/vdNsO0fO5KtU3/OZlnAhU7c75FO6+7dxgvkY gdMw== X-Gm-Message-State: AOJu0Yz32z3OqN5Gul99yyIC93qgncsMBTQI9nSPafiz8okjS96Qk9Xl M3KRo52htG2iSBi9K67ydOj7ig== X-Google-Smtp-Source: AGHT+IFcZmC6BQsd2AGq4KJb4oTyCPww1Yk63L0EhdNUL2eJjnTiZEo9JXoLvZFlPH3arwmsV2pE9w== X-Received: by 2002:a17:903:2282:b0:1cc:5671:8d9 with SMTP id b2-20020a170903228200b001cc567108d9mr6359515plh.27.1699393257307; Tue, 07 Nov 2023 13:40:57 -0800 (PST) Received: from localhost (fwdproxy-prn-005.fbsv.net. [2a03:2880:ff:5::face:b00c]) by smtp.gmail.com with ESMTPSA id j1-20020a170902690100b001c74df14e6esm284401plk.51.2023.11.07.13.40.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:40:57 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 01/20] io_uring: add interface queue Date: Tue, 7 Nov 2023 13:40:26 -0800 Message-Id: <20231107214045.2172393-2-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch introduces a new object in io_uring called an interface queue (ifq) which contains: * A pool region allocated by userspace and registered w/ io_uring where Rx data is written to. * A net device and one specific Rx queue in it that will be configured for ZC Rx. * A pair of shared ringbuffers w/ userspace, dubbed registered buf (rbuf) rings. Each entry contains a pool region id and an offset + len within that region. The kernel writes entries into the completion ring to tell userspace where RX data is relative to the start of a region. Userspace writes entries into the refill ring to tell the kernel when it is done with the data. For now, each io_uring instance has a single ifq, and each ifq has a single pool region associated with one Rx queue. Add a new opcode to io_uring_register that sets up an ifq. Size and offsets of shared ringbuffers are returned to userspace for it to mmap. The implementation will be added in a later patch. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/io_uring_types.h | 6 +++ include/uapi/linux/io_uring.h | 50 +++++++++++++++++++++ io_uring/Makefile | 3 +- io_uring/io_uring.c | 8 ++++ io_uring/kbuf.c | 27 ++++++++++++ io_uring/kbuf.h | 5 +++ io_uring/zc_rx.c | 79 ++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 34 +++++++++++++++ 8 files changed, 211 insertions(+), 1 deletion(-) create mode 100644 io_uring/zc_rx.c create mode 100644 io_uring/zc_rx.h diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 13d19b9be9f4..4f902e17b9c7 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -151,6 +151,10 @@ struct io_rings { struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp; }; +struct io_rbuf_ring { + struct io_uring rq, cq; +}; + struct io_restriction { DECLARE_BITMAP(register_op, IORING_REGISTER_LAST); DECLARE_BITMAP(sqe_op, IORING_OP_LAST); @@ -336,6 +340,8 @@ struct io_ring_ctx { struct io_rsrc_data *file_data; struct io_rsrc_data *buf_data; + struct io_zc_rx_ifq *ifq; + /* protected by ->uring_lock */ struct list_head rsrc_ref_list; struct io_alloc_cache rsrc_node_cache; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 8e61f8b7c2ce..84c82a789543 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -546,6 +546,9 @@ enum { /* register a range of fixed file slots for automatic slot allocation */ IORING_REGISTER_FILE_ALLOC_RANGE = 25, + /* register a network interface queue for zerocopy */ + IORING_REGISTER_ZC_RX_IFQ = 26, + /* this goes last */ IORING_REGISTER_LAST, @@ -736,6 +739,53 @@ enum { SOCKET_URING_OP_SIOCOUTQ, }; +struct io_uring_rbuf_rqe { + __u32 off; + __u32 len; + __u16 region; + __u8 __pad[6]; +}; + +struct io_uring_rbuf_cqe { + __u32 off; + __u32 len; + __u16 region; + __u8 flags; + __u8 __pad[3]; +}; + +struct io_rbuf_rqring_offsets { + __u32 head; + __u32 tail; + __u32 rqes; + __u8 __pad[4]; +}; + +struct io_rbuf_cqring_offsets { + __u32 head; + __u32 tail; + __u32 cqes; + __u8 __pad[4]; +}; + +/* + * Argument for IORING_REGISTER_ZC_RX_IFQ + */ +struct io_uring_zc_rx_ifq_reg { + __u32 if_idx; + /* hw rx descriptor ring id */ + __u32 if_rxq_id; + __u32 region_id; + __u32 rq_entries; + __u32 cq_entries; + __u32 flags; + __u16 cpu; + + __u32 mmap_sz; + struct io_rbuf_rqring_offsets rq_off; + struct io_rbuf_cqring_offsets cq_off; +}; + #ifdef __cplusplus } #endif diff --git a/io_uring/Makefile b/io_uring/Makefile index 8cc8e5387a75..7818b015a1f2 100644 --- a/io_uring/Makefile +++ b/io_uring/Makefile @@ -7,5 +7,6 @@ obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \ openclose.o uring_cmd.o epoll.o \ statx.o net.o msg_ring.o timeout.o \ sqpoll.o fdinfo.o tctx.o poll.o \ - cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o + cancel.o kbuf.o rsrc.o rw.o opdef.o \ + notif.o zc_rx.o obj-$(CONFIG_IO_WQ) += io-wq.o diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 783ed0fff71b..ae7f37aabe78 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -92,6 +92,7 @@ #include "cancel.h" #include "net.h" #include "notif.h" +#include "zc_rx.h" #include "timeout.h" #include "poll.h" @@ -3160,6 +3161,7 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); xa_for_each(&ctx->personalities, index, creds) io_unregister_personality(ctx, index); + io_unregister_zc_rx_ifq(ctx); if (ctx->rings) io_poll_remove_all(ctx, NULL, true); mutex_unlock(&ctx->uring_lock); @@ -4536,6 +4538,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_file_alloc_range(ctx, arg); break; + case IORING_REGISTER_ZC_RX_IFQ: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_zc_rx_ifq(ctx, arg); + break; default: ret = -EINVAL; break; diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 556f4df25b0f..30c3e5b20ab3 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -630,3 +630,30 @@ void *io_pbuf_get_address(struct io_ring_ctx *ctx, unsigned long bgid) return bl->buf_ring; } + +int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq, + struct io_uring_zc_rx_ifq_reg *reg) +{ + gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP; + size_t off, size, rq_size, cq_size; + void *ptr; + + off = sizeof(struct io_rbuf_ring); + rq_size = reg->rq_entries * sizeof(struct io_uring_rbuf_rqe); + cq_size = reg->cq_entries * sizeof(struct io_uring_rbuf_cqe); + size = off + rq_size + cq_size; + ptr = (void *) __get_free_pages(gfp, get_order(size)); + if (!ptr) + return -ENOMEM; + ifq->ring = (struct io_rbuf_ring *)ptr; + ifq->rqes = (struct io_uring_rbuf_rqe *)((char *)ptr + off); + ifq->cqes = (struct io_uring_rbuf_cqe *)((char *)ifq->rqes + rq_size); + + return 0; +} + +void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq) +{ + if (ifq->ring) + folio_put(virt_to_folio(ifq->ring)); +} diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index d14345ef61fc..6c8afda93646 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -4,6 +4,8 @@ #include +#include "zc_rx.h" + struct io_buffer_list { /* * If ->buf_nr_pages is set, then buf_pages/buf_ring are used. If not, @@ -57,6 +59,9 @@ void io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags); void *io_pbuf_get_address(struct io_ring_ctx *ctx, unsigned long bgid); +int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq, struct io_uring_zc_rx_ifq_reg *reg); +void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq); + static inline void io_kbuf_recycle_ring(struct io_kiocb *req) { /* diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c new file mode 100644 index 000000000000..45dab29fe0ae --- /dev/null +++ b/io_uring/zc_rx.c @@ -0,0 +1,79 @@ +// SPDX-License-Identifier: GPL-2.0 +#if defined(CONFIG_NET) +#include +#include +#include +#include + +#include + +#include "io_uring.h" +#include "kbuf.h" +#include "zc_rx.h" + +static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) +{ + struct io_zc_rx_ifq *ifq; + + ifq = kzalloc(sizeof(*ifq), GFP_KERNEL); + if (!ifq) + return NULL; + + ifq->ctx = ctx; + + return ifq; +} + +static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) +{ + io_free_rbuf_ring(ifq); + kfree(ifq); +} + +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg) +{ + struct io_uring_zc_rx_ifq_reg reg; + struct io_zc_rx_ifq *ifq; + int ret; + + if (copy_from_user(®, arg, sizeof(reg))) + return -EFAULT; + if (ctx->ifq) + return -EBUSY; + + ifq = io_zc_rx_ifq_alloc(ctx); + if (!ifq) + return -ENOMEM; + + /* TODO: initialise network interface */ + + ret = io_allocate_rbuf_ring(ifq, ®); + if (ret) + goto err; + + /* TODO: map zc region and initialise zc pool */ + + ifq->rq_entries = reg.rq_entries; + ifq->cq_entries = reg.cq_entries; + ifq->if_rxq_id = reg.if_rxq_id; + ctx->ifq = ifq; + + return 0; +err: + io_zc_rx_ifq_free(ifq); + return ret; +} + +int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx) +{ + struct io_zc_rx_ifq *ifq = ctx->ifq; + + if (!ifq) + return -EINVAL; + + ctx->ifq = NULL; + io_zc_rx_ifq_free(ifq); + return 0; +} +#endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h new file mode 100644 index 000000000000..5f6d80c1c2b8 --- /dev/null +++ b/io_uring/zc_rx.h @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0 +#ifndef IOU_ZC_RX_H +#define IOU_ZC_RX_H + +struct io_zc_rx_ifq { + struct io_ring_ctx *ctx; + struct net_device *dev; + struct io_rbuf_ring *ring; + struct io_uring_rbuf_rqe *rqes; + struct io_uring_rbuf_cqe *cqes; + u32 rq_entries, cq_entries; + void *pool; + + /* hw rx descriptor ring id */ + u32 if_rxq_id; +}; + +#if defined(CONFIG_NET) +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg); +int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx); +#else +static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg) +{ + return -EOPNOTSUPP; +} +static inline int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx) +{ + return -EOPNOTSUPP; +} +#endif + +#endif From patchwork Tue Nov 7 21:40:27 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449345 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF74F450D1 for ; Tue, 7 Nov 2023 21:40:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="k9vEzKgG" Received: from mail-pg1-x52b.google.com (mail-pg1-x52b.google.com [IPv6:2607:f8b0:4864:20::52b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B8F1310E4 for ; Tue, 7 Nov 2023 13:40:58 -0800 (PST) Received: by mail-pg1-x52b.google.com with SMTP id 41be03b00d2f7-5bd0631f630so107747a12.0 for ; Tue, 07 Nov 2023 13:40:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393258; x=1699998058; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=520r0fT9CXFqnN6K8LnXvZLFpQESKDxVi7WN1Xfbe2s=; b=k9vEzKgGlRKmhi2MuyolRjg5HtyNkSuNUVCRr0Jvexarp+wKHrVnPNqne56FWpyAEI igW6nHudepHRNs7Wp9vDHYF/DhEzpmTPgCn3qhlIA01xd/99xMqgdmC6RvS8jJOQWsKY 8qp178i/m62iTjNOvwFoIxwrLJRKYorN1bIHt3TIWpE7R8Y9OOdodowTlgjsZ6P3bHlO mePztIytxhFUt3upa+u/4Fj/BlGe2qkrk2gjU5eDlQeQZXBUj/UKAff9xNx2dgltRq3k WoGe2oJsh+VjP6vClHszNRjWlo6kmld/vCD62JScEjV69ofpno51OZM8ikgAoIrS/H62 hOdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393258; x=1699998058; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=520r0fT9CXFqnN6K8LnXvZLFpQESKDxVi7WN1Xfbe2s=; b=bsv325OD0B7KNpO1qJURa0ii0bnuTKGoDDLNZA15fDF3FNv+8cVBLe1afTwkefNnl+ zRJqty67jsdaXECppO0ksMDeYx5UsVSSFjjejUXE8PnzjpWfYU/dhSp1+vx5BisTD0WX 2ylyRMfnF2J/ayH1K/ox7tjGD84aaPQm/oKpFdTI28q1YzimhBpVxqktJl6kfLUy3pgl FPBKCdAW8qgx63O3lk+G3AY2fmY32ezYhG8V+im7Jw+lMEhhKBI+Km14G7NGgAXIfjwJ hBvdRwpBuE1OtF2rdHNtc2sz/gwC3aD0WApo2IO8RDaSX7nl3mQf2J2PGyoEaNnIJrSn ix4A== X-Gm-Message-State: AOJu0YwsBnj40EXVqt3E7anO686qfGxPooYFuQ7Vxx/bupsy9U3a6kqL JPDeqc9Qxg2bOOQqVz5NTH1c3Q== X-Google-Smtp-Source: AGHT+IFF+B6gbHUBADrXjroISxhXSQTrTB1jUxSanL6FHfskkynBovVZjD0Q3/Xkekd9MJkNt6gUMA== X-Received: by 2002:a17:90a:fb42:b0:280:6cde:ecc2 with SMTP id iq2-20020a17090afb4200b002806cdeecc2mr5299320pjb.11.1699393258176; Tue, 07 Nov 2023 13:40:58 -0800 (PST) Received: from localhost (fwdproxy-prn-010.fbsv.net. [2a03:2880:ff:a::face:b00c]) by smtp.gmail.com with ESMTPSA id ft20-20020a17090b0f9400b002800d17a21csm268331pjb.15.2023.11.07.13.40.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:40:57 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 02/20] io_uring: add mmap support for shared ifq ringbuffers Date: Tue, 7 Nov 2023 13:40:27 -0800 Message-Id: <20231107214045.2172393-3-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch adds mmap support for ifq rbuf rings. There are two rings and a struct io_rbuf_ring that contains the head and tail ptrs into each ring. Just like the io_uring SQ/CQ rings, userspace issues a single mmap call using the io_uring fd w/ magic offset IORING_OFF_RBUF_RING. An opaque ptr is returned to userspace, which is then expected to use the offsets returned in the registration struct to get access to the head/tail and rings. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/uapi/linux/io_uring.h | 2 ++ io_uring/io_uring.c | 5 +++++ io_uring/zc_rx.c | 17 +++++++++++++++++ 3 files changed, 24 insertions(+) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 84c82a789543..ae5608bcd785 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -416,6 +416,8 @@ enum { #define IORING_OFF_PBUF_RING 0x80000000ULL #define IORING_OFF_PBUF_SHIFT 16 #define IORING_OFF_MMAP_MASK 0xf8000000ULL +#define IORING_OFF_RBUF_RING 0x20000000ULL +#define IORING_OFF_RBUF_SHIFT 16 /* * Filled with the offset for mmap(2) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index ae7f37aabe78..f06e9ed397da 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3438,6 +3438,11 @@ static void *io_uring_validate_mmap_request(struct file *file, return ERR_PTR(-EINVAL); break; } + case IORING_OFF_RBUF_RING: + if (!ctx->ifq) + return ERR_PTR(-EINVAL); + ptr = ctx->ifq->ring; + break; default: return ERR_PTR(-EINVAL); } diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 45dab29fe0ae..a3a54845c712 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -35,6 +35,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, { struct io_uring_zc_rx_ifq_reg reg; struct io_zc_rx_ifq *ifq; + size_t ring_sz, rqes_sz, cqes_sz; int ret; if (copy_from_user(®, arg, sizeof(reg))) @@ -59,6 +60,22 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, ifq->if_rxq_id = reg.if_rxq_id; ctx->ifq = ifq; + ring_sz = sizeof(struct io_rbuf_ring); + rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries; + cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries; + reg.mmap_sz = ring_sz + rqes_sz + cqes_sz; + reg.rq_off.rqes = ring_sz; + reg.cq_off.cqes = ring_sz + rqes_sz; + reg.rq_off.head = offsetof(struct io_rbuf_ring, rq.head); + reg.rq_off.tail = offsetof(struct io_rbuf_ring, rq.tail); + reg.cq_off.head = offsetof(struct io_rbuf_ring, cq.head); + reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail); + + if (copy_to_user(arg, ®, sizeof(reg))) { + ret = -EFAULT; + goto err; + } + return 0; err: io_zc_rx_ifq_free(ifq); From patchwork Tue Nov 7 21:40:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449346 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8CEA636B14 for ; Tue, 7 Nov 2023 21:41:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="wcA9Fl/d" Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A543710E7 for ; Tue, 7 Nov 2023 13:40:59 -0800 (PST) Received: by mail-pl1-x62e.google.com with SMTP id d9443c01a7336-1cc921a4632so51701585ad.1 for ; Tue, 07 Nov 2023 13:40:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393259; x=1699998059; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=gkhKqBwYYqQgvgkz0J+eF0ekZRGveOzIaQIF+mtTnrA=; b=wcA9Fl/dDyAHZQB/CthzMWjAkAj3eG4BmlXKhXDJKRsaavjUo70ITrEmEzjqyRAzmP akumYflMxdbaxt1gtxxb6U0MnRduHSYgEoEIP5wn5K4ARFtw+i4VTeRpRhgO0XRarMUt lRKBwVI0MN7WI8a2nDPWqRRdSIyXvIQZUoswI0VUh9Y6k62hshQLgH+SA2i1CXrojU8S evIRom0Pm39P2x0QrVWkKUWFfxjMRZHY3cFN8azsWHNoLJntyurhTYF2QTd40jDZH87h xhIXwmm/PpVgJlYmmjrHphk24OKArppidvc823Y/jvjQf7J9h3f3OprmnQUemG2eUzO0 3q6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393259; x=1699998059; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gkhKqBwYYqQgvgkz0J+eF0ekZRGveOzIaQIF+mtTnrA=; b=jL97euaWaYThyU9li3gkT8TccGmp7rOso7iZ2UE7DB4jKVCBz5/nj1AoB/6IPpLoBh +VVYUyir89EhvrKz9U8P+zZwM6xw+VyD7OEEixr3hKmfZ1B5u/sFigJ9hkRiyhTu+mw3 6rExeyS5IApAMW1RnSOcCyjll099cgtMrW8ftgnbw76zOloQJV1bKIjlpTH5fon5tDWh r0LjEeNX6CfS5bSS6+OmBcdynJSwCNnwsi61cyinQpl8k0YxPLfyuY+z9PcuIFLHNK1l TsaeWRXk2eNO58IdkxQSHMo2ep3ZUk2+ko/AnFnYrsriLn0S2PCZWhqNabw4GI6yXnhu BeBw== X-Gm-Message-State: AOJu0YxrnO+ql3q/WQJHPq15OVFDSmmxpCH0R7jAUVr3F3myk2uKoG4j yYbUfxA5X01e6yYHrSnNMyPeZw== X-Google-Smtp-Source: AGHT+IFCXiGsm4+uoYT/4g2dHZxdtmbkIAaHx6GCEsyE1kAbJJmkijduufnQHm62qVhcmiS8LbaINA== X-Received: by 2002:a17:903:2689:b0:1cc:70ed:1d68 with SMTP id jf9-20020a170903268900b001cc70ed1d68mr203111plb.67.1699393259157; Tue, 07 Nov 2023 13:40:59 -0800 (PST) Received: from localhost (fwdproxy-prn-021.fbsv.net. [2a03:2880:ff:15::face:b00c]) by smtp.gmail.com with ESMTPSA id n12-20020a1709026a8c00b001a80ad9c599sm257701plk.294.2023.11.07.13.40.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:40:58 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 03/20] netdev: add XDP_SETUP_ZC_RX command Date: Tue, 7 Nov 2023 13:40:28 -0800 Message-Id: <20231107214045.2172393-4-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org This patch adds a new XDP_SETUP_ZC_RX command that will be used in a later patch to enable or disable ZC RX for a specific RX queue. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- We are open to suggestions on a better way of doing this, rather than using a bpf_netdev_command. include/linux/netdevice.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 11d704bfec9b..f9c82c89a96b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -984,6 +984,7 @@ enum bpf_netdev_command { BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE, XDP_SETUP_XSK_POOL, + XDP_SETUP_ZC_RX, }; struct bpf_prog_offload_ops; @@ -1022,6 +1023,11 @@ struct netdev_bpf { struct xsk_buff_pool *pool; u16 queue_id; } xsk; + /* XDP_SETUP_ZC_RX */ + struct { + struct io_zc_rx_ifq *ifq; + u16 queue_id; + } zc_rx; }; }; From patchwork Tue Nov 7 21:40:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449347 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 50EDD450F4 for ; Tue, 7 Nov 2023 21:41:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="w1Ymep9N" Received: from mail-oo1-xc2f.google.com (mail-oo1-xc2f.google.com [IPv6:2607:f8b0:4864:20::c2f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C566A10E5 for ; Tue, 7 Nov 2023 13:41:00 -0800 (PST) Received: by mail-oo1-xc2f.google.com with SMTP id 006d021491bc7-581e5a9413bso3409659eaf.1 for ; Tue, 07 Nov 2023 13:41:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393260; x=1699998060; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=1JQFVyD/BclnABaNMr0ihb/yu9Na0XwqUKf71Hw/JqE=; b=w1Ymep9NNgfB4ODycMNp//P+6E1bB3ih2At9wJKsln4ltVh6coE6wN//o76LjdMn/q 8hQ8MLGtUtA4R1ITttsoInVJFvRk/3PKjSA1Vi4opRQjRdeeO5RdRgAMB/Z62YMsmu/7 QFMIMfg+kB84wpnubMp3xCBCRZaBStabibeIIP3VwdJHKJIM+nrV5dE2sp/yY2FCU9cT +hby7KwCIPsYNaXrGTdai12MLYmmXfNSfeNJ3i0EnJH9WQilAiohoQCc73RsmQOmmxAe f73FbEl0iLbCHF6UiVDTpcBIH6K+tf3O8FPp/AV8kHiGviMayZoidfVFJWNHpRTXMROV SwxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393260; x=1699998060; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1JQFVyD/BclnABaNMr0ihb/yu9Na0XwqUKf71Hw/JqE=; b=f/irKn3bR9QUqXke1rCaLWcQ7tnn5tqOJGKWh1db19Vn+aN84WMBlH/UaSNusWGNK0 Sic/JVAJmApxN4lpyvluS1xFREuJIvcmScJ0hOVObkYR5jrX/53u4/487UOklmOb5ly1 6zSz3GNb2HcalyBr/1kGQ61K3Q/bj7PshucYPEcUSMWJQzUqMkoIrlt5Y+lC3urYNvBC UvQMcZPoewtvAPjWN2eDlPcdngDkS2J8jeJ7oRDT3daKStOvGGPxfIIJoeRuqmNd/T2z DHEECbmrjo9iRHyscnZQ5gvtpmQgNXOjfKOXXB2QjoGMaqjuYw3m8hFqERsFwE1QSiJQ arjQ== X-Gm-Message-State: AOJu0YzME0CR23IWukMGPxzLt+gyen7LzKyGieR4azmvZ3CxPWouOkPx aViUBCfkFEFjvXZcq1Di0zfmig== X-Google-Smtp-Source: AGHT+IFWdyvIluXwiwCF4cLhWlh9UcUzd1F1iP7kvHIF5igbTDAVoRK1o58ujXZe6zX4hW3DRWIEGw== X-Received: by 2002:a05:6358:1904:b0:168:e0db:ce43 with SMTP id w4-20020a056358190400b00168e0dbce43mr33357692rwm.31.1699393260074; Tue, 07 Nov 2023 13:41:00 -0800 (PST) Received: from localhost (fwdproxy-prn-012.fbsv.net. [2a03:2880:ff:c::face:b00c]) by smtp.gmail.com with ESMTPSA id j26-20020a63595a000000b0058ac101ad83sm1822698pgm.33.2023.11.07.13.40.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:40:59 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 04/20] io_uring: setup ZC for an Rx queue when registering an ifq Date: Tue, 7 Nov 2023 13:40:29 -0800 Message-Id: <20231107214045.2172393-5-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch sets up ZC for an Rx queue in a net device when an ifq is registered with io_uring. The Rx queue is specified in the registration struct. The XDP command added in the previous patch is used to enable or disable ZC Rx. For now since there is only one ifq, its destruction is implicit during io_uring cleanup. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/zc_rx.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 46 insertions(+), 2 deletions(-) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index a3a54845c712..85180c3044d8 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -4,6 +4,7 @@ #include #include #include +#include #include @@ -11,6 +12,35 @@ #include "kbuf.h" #include "zc_rx.h" +typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); + +static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, + u16 queue_id) +{ + struct netdev_bpf cmd; + bpf_op_t ndo_bpf; + + ndo_bpf = dev->netdev_ops->ndo_bpf; + if (!ndo_bpf) + return -EINVAL; + + cmd.command = XDP_SETUP_ZC_RX; + cmd.zc_rx.ifq = ifq; + cmd.zc_rx.queue_id = queue_id; + + return ndo_bpf(dev, &cmd); +} + +static int io_open_zc_rxq(struct io_zc_rx_ifq *ifq) +{ + return __io_queue_mgmt(ifq->dev, ifq, ifq->if_rxq_id); +} + +static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq) +{ + return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id); +} + static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) { struct io_zc_rx_ifq *ifq; @@ -20,12 +50,17 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) return NULL; ifq->ctx = ctx; + ifq->if_rxq_id = -1; return ifq; } static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) { + if (ifq->if_rxq_id != -1) + io_close_zc_rxq(ifq); + if (ifq->dev) + dev_put(ifq->dev); io_free_rbuf_ring(ifq); kfree(ifq); } @@ -42,17 +77,22 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, return -EFAULT; if (ctx->ifq) return -EBUSY; + if (reg.if_rxq_id == -1) + return -EINVAL; ifq = io_zc_rx_ifq_alloc(ctx); if (!ifq) return -ENOMEM; - /* TODO: initialise network interface */ - ret = io_allocate_rbuf_ring(ifq, ®); if (ret) goto err; + ret = -ENODEV; + ifq->dev = dev_get_by_index(current->nsproxy->net_ns, reg.if_idx); + if (!ifq->dev) + goto err; + /* TODO: map zc region and initialise zc pool */ ifq->rq_entries = reg.rq_entries; @@ -60,6 +100,10 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, ifq->if_rxq_id = reg.if_rxq_id; ctx->ifq = ifq; + ret = io_open_zc_rxq(ifq); + if (ret) + goto err; + ring_sz = sizeof(struct io_rbuf_ring); rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries; cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries; From patchwork Tue Nov 7 21:40:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449348 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 03A432A8C6 for ; Tue, 7 Nov 2023 21:41:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="ZysJlVCK" Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9763210D0 for ; Tue, 7 Nov 2023 13:41:01 -0800 (PST) Received: by mail-pf1-x433.google.com with SMTP id d2e1a72fcca58-6b20a48522fso5307440b3a.1 for ; Tue, 07 Nov 2023 13:41:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393261; x=1699998061; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=mOi3kgM6YMdJ3o8a6c8h2Q5Yb4bopwAacE/KWvRpLyw=; b=ZysJlVCKeevLgVhhfRKI1U8utXwAB3c86wL5dSN7EeAnLXomQwOGfBZGvSxm7fv2W/ 0enD7Q7q2pKoMRt+57vPH6BeeHyPgroEi8bwvvFtJfBA2mB+8OcNe5d7QHuToHTpaGrL KvrlcrhXpW0eQbIjh3sj5fpVn5rYqZeJ99cWKmMmAf1eFe2NCKspsXUU61qp/h48V6vs Znb7WdCoLDKvkjXYdZXaQTCXq2SDLf2nLFxCyBQ6o6RgjySh80LAFutyOqOus/yXMwpH fkZkVdC4NUXmSRMG0UYPyRGH+ExfOZxZ95j6OCXPybcXH51vIoF8c2wNCaUb2I5ePtub R+mQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393261; x=1699998061; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mOi3kgM6YMdJ3o8a6c8h2Q5Yb4bopwAacE/KWvRpLyw=; b=h4MQRub8K97tjQJx6WybrVHmGa6GG8LEM19g8oTZlBBjrAq3EOLHDyEkVZJNaEZOLg 8yGmr+UxX7gE+4iYKmZOn19IoTYviJSQwECvBjd36+Nz/XzRiR3zioKEcMhXHWf5wpat wInLJGIkOmsKfDtGDG1r8Dx6O+2Gw3fssk6b/hz0Uucgp5Xdr4EpKoB/y5fyBuNLTwPx VUfPxVvrpTERCvYw9gEi1F2Pts2H3PbYylIcbXr+09qa6+qbS2lZavgfDm4lSiIlAxQ8 Mjc71XctQcJpVHp2JG3RCuCjEkpkQfbKOMryjWY5Ljg7Q0+XCqfeO5o9t9ya9mLEJ+HR FIDQ== X-Gm-Message-State: AOJu0YwLhqYgfPJ7PhuWDuYBLObD4PTyz3VLq1jteV63ysCBKKTYQEmj MtyduSYnEljJUamtKlkXbvlNMw== X-Google-Smtp-Source: AGHT+IHvA1IbXaADsbrtc9iZC/hG8Lytson8e0HTxc42XE/7jKTGiVwh1DGmhtsIq0IrXRJeZ+c+TQ== X-Received: by 2002:a05:6a00:23d2:b0:6bd:66ce:21d4 with SMTP id g18-20020a056a0023d200b006bd66ce21d4mr273980pfc.23.1699393261026; Tue, 07 Nov 2023 13:41:01 -0800 (PST) Received: from localhost (fwdproxy-prn-016.fbsv.net. [2a03:2880:ff:10::face:b00c]) by smtp.gmail.com with ESMTPSA id fa16-20020a056a002d1000b0068fece2c190sm5734815pfb.70.2023.11.07.13.41.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:00 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 05/20] io_uring/zcrx: implement socket registration Date: Tue, 7 Nov 2023 13:40:30 -0800 Message-Id: <20231107214045.2172393-6-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org From: Pavel Begunkov We want userspace to explicitly list all sockets it'll be using with a particular zc ifq, so we can properly configure them, e.g. binding the sockets to the corresponding interface and setting steering rules. We'll also need it to better control ifq lifetime and for termination / unregistration purposes. TODO: remove zc_rx_idx from struct socket, and uapi is likely to change Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/net.h | 2 ++ include/uapi/linux/io_uring.h | 7 ++++ io_uring/io_uring.c | 6 ++++ io_uring/net.c | 19 +++++++++++ io_uring/zc_rx.c | 63 +++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 17 ++++++++++ net/socket.c | 1 + 7 files changed, 115 insertions(+) diff --git a/include/linux/net.h b/include/linux/net.h index c9b4a63791a4..867061a91d30 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -126,6 +126,8 @@ struct socket { const struct proto_ops *ops; /* Might change with IPV6_ADDRFORM or MPTCP. */ struct socket_wq wq; + + unsigned zc_rx_idx; }; /* diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ae5608bcd785..917d0025cc94 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -550,6 +550,7 @@ enum { /* register a network interface queue for zerocopy */ IORING_REGISTER_ZC_RX_IFQ = 26, + IORING_REGISTER_ZC_RX_SOCK = 27, /* this goes last */ IORING_REGISTER_LAST, @@ -788,6 +789,12 @@ struct io_uring_zc_rx_ifq_reg { struct io_rbuf_cqring_offsets cq_off; }; +struct io_uring_zc_rx_sock_reg { + __u32 sockfd; + __u32 zc_rx_ifq_idx; + __u32 __resv[2]; +}; + #ifdef __cplusplus } #endif diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index f06e9ed397da..e24e2c308a8a 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -4549,6 +4549,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_zc_rx_ifq(ctx, arg); break; + case IORING_REGISTER_ZC_RX_SOCK: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_zc_rx_sock(ctx, arg); + break; default: ret = -EINVAL; break; diff --git a/io_uring/net.c b/io_uring/net.c index 7a8e298af81b..fc0b7936971d 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -955,6 +955,25 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) return ret; } +static __maybe_unused +struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, + struct socket *sock) +{ + unsigned token = READ_ONCE(sock->zc_rx_idx); + unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET; + unsigned sock_idx = token & IO_ZC_IFQ_IDX_MASK; + struct io_zc_rx_ifq *ifq; + + if (ifq_idx) + return NULL; + ifq = req->ctx->ifq; + if (!ifq || sock_idx >= ifq->nr_sockets) + return NULL; + if (ifq->sockets[sock_idx] != req->file) + return NULL; + return ifq; +} + void io_send_zc_cleanup(struct io_kiocb *req) { struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg); diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 85180c3044d8..b5266a67395e 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -11,6 +11,7 @@ #include "io_uring.h" #include "kbuf.h" #include "zc_rx.h" +#include "rsrc.h" typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); @@ -129,12 +130,74 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx) { struct io_zc_rx_ifq *ifq = ctx->ifq; + int i; if (!ifq) return -EINVAL; + for (i = 0; i < ifq->nr_sockets; i++) + fput(ifq->sockets[i]); + ctx->ifq = NULL; io_zc_rx_ifq_free(ifq); return 0; } + +int io_register_zc_rx_sock(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_sock_reg __user *arg) +{ + struct io_uring_zc_rx_sock_reg sr; + struct io_zc_rx_ifq *ifq; + struct socket *sock; + struct file *file; + int ret = -EEXIST; + int idx; + + if (copy_from_user(&sr, arg, sizeof(sr))) + return -EFAULT; + if (sr.__resv[0] || sr.__resv[1]) + return -EINVAL; + if (sr.zc_rx_ifq_idx != 0 || !ctx->ifq) + return -EINVAL; + + ifq = ctx->ifq; + if (ifq->nr_sockets >= ARRAY_SIZE(ifq->sockets)) + return -EINVAL; + + BUILD_BUG_ON(ARRAY_SIZE(ifq->sockets) > IO_ZC_IFQ_IDX_MASK); + + file = fget(sr.sockfd); + if (!file) + return -EBADF; + + if (io_file_need_scm(file)) { + fput(file); + return -EBADF; + } + + sock = sock_from_file(file); + if (unlikely(!sock || !sock->sk)) { + fput(file); + return -ENOTSOCK; + } + + idx = ifq->nr_sockets; + lock_sock(sock->sk); + if (!sock->zc_rx_idx) { + unsigned token; + + token = idx + (sr.zc_rx_ifq_idx << IO_ZC_IFQ_IDX_OFFSET); + WRITE_ONCE(sock->zc_rx_idx, token); + ret = 0; + } + release_sock(sock->sk); + + if (ret) { + fput(file); + return -EINVAL; + } + ifq->sockets[idx] = file; + ifq->nr_sockets++; + return 0; +} #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index 5f6d80c1c2b8..ab25f8dbb433 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -2,6 +2,13 @@ #ifndef IOU_ZC_RX_H #define IOU_ZC_RX_H +#include +#include + +#define IO_ZC_MAX_IFQ_SOCKETS 16 +#define IO_ZC_IFQ_IDX_OFFSET 16 +#define IO_ZC_IFQ_IDX_MASK ((1U << IO_ZC_IFQ_IDX_OFFSET) - 1) + struct io_zc_rx_ifq { struct io_ring_ctx *ctx; struct net_device *dev; @@ -11,6 +18,9 @@ struct io_zc_rx_ifq { u32 rq_entries, cq_entries; void *pool; + unsigned nr_sockets; + struct file *sockets[IO_ZC_MAX_IFQ_SOCKETS]; + /* hw rx descriptor ring id */ u32 if_rxq_id; }; @@ -19,6 +29,8 @@ struct io_zc_rx_ifq { int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg); int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx); +int io_register_zc_rx_sock(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_sock_reg __user *arg); #else static inline int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg) @@ -29,6 +41,11 @@ static inline int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx) { return -EOPNOTSUPP; } +static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_sock_reg __user *arg) +{ + return -EOPNOTSUPP; +} #endif #endif diff --git a/net/socket.c b/net/socket.c index c4a6f5532955..419b7bda3f9c 100644 --- a/net/socket.c +++ b/net/socket.c @@ -637,6 +637,7 @@ struct socket *sock_alloc(void) sock = SOCKET_I(inode); + sock->zc_rx_idx = 0; inode->i_ino = get_next_ino(); inode->i_mode = S_IFSOCK | S_IRWXUGO; inode->i_uid = current_fsuid(); From patchwork Tue Nov 7 21:40:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449349 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2886C2A8D9 for ; Tue, 7 Nov 2023 21:41:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="pwmWGQP1" Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7CB1510E2 for ; Tue, 7 Nov 2023 13:41:02 -0800 (PST) Received: by mail-pf1-x434.google.com with SMTP id d2e1a72fcca58-6bd0e1b1890so4706845b3a.3 for ; Tue, 07 Nov 2023 13:41:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393262; x=1699998062; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=XkwQzujPT5BBrG7XX0IftjDx25OJwjZS66fFII/4OGU=; b=pwmWGQP1rHaPNKS+SBY2kKuln/d7o5GQWVSvdTzs2PjNh6NmS7CA5WLXwl88n6Ib+e asdvGrkT7HW1TN34gBtR5wCyOIzF6Z52Xq2GAgTNAoIlIksXG90hVlU1JYNS3By3UT0W U08bSWiVv3JklZthJ373sD/Mv2km58LJ5WG14xBbJIGM1qMGPKu3n8y2ZgDNsVWGFBMs zdKWHtuki74Epmx+mAoa1FYeoTOHTF1rAbzoCraoqTimoxyG+4Eb8pr2iX9oR54+nkkZ 5sTp8PNnxBC+xSqsjLtEyQMC+Ih1mEPVIxeULMqJOH2M/xF6ZqjobCprsElRt8Ub/N/Q zIgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393262; x=1699998062; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XkwQzujPT5BBrG7XX0IftjDx25OJwjZS66fFII/4OGU=; b=V5Csgh1OtEMpd/HSfNBkBSXDSMnJ4YoepsVlatKIVag3eYs5X1QdO4eW7wi2NxyyLQ hltfYjD2KE2YA2N6rM5g5l4FCLXvidnCGlOA9IEkdY+XYpRFFHXQkWNf2YXsp9shxqn3 5JYGPa3baSKNlTRCTf5zjISJzI6M/HWzKxuGXQwH0QNhfcQU3BVyMvc6jbhngowXEsC7 1HpEMS9Tfd0xEwxNumNlAKhOA1BzYmamLH5KHvlZyMPvIyQpQ2TtrE1U13rYfRzMoHov Z1EpKMzDnGR+xxS3LPis7tz6HH7Eahkyg/GNsTQbh9G2NIaU5Zg4U3fvvkcuTvGQvCfK tTPQ== X-Gm-Message-State: AOJu0Yyzp6lM8QK88FyvlALZcemEiXXGHl7SdyVVzKonMzRPtVeteqor d6mZISozFzIxHi0iE81Te/CHBg== X-Google-Smtp-Source: AGHT+IHVZxWBoccQ5Jk/sZcBfoYdsYWzKFv9LsnMbT9hBrQCPhgOuD1Vl7DAtvQDuLIwn97Ua+5CfQ== X-Received: by 2002:a05:6a20:8419:b0:16b:d3d5:a5c5 with SMTP id c25-20020a056a20841900b0016bd3d5a5c5mr245524pzd.52.1699393261935; Tue, 07 Nov 2023 13:41:01 -0800 (PST) Received: from localhost (fwdproxy-prn-014.fbsv.net. [2a03:2880:ff:e::face:b00c]) by smtp.gmail.com with ESMTPSA id ey18-20020a056a0038d200b00690d255b5a1sm7560681pfb.217.2023.11.07.13.41.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:01 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 06/20] io_uring: add ZC buf and pool Date: Tue, 7 Nov 2023 13:40:31 -0800 Message-Id: <20231107214045.2172393-7-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch adds two objects: * Zero copy buffer representation, holding a page, its mapped dma_addr, and a refcount for lifetime management. * Zero copy pool, spiritually similar to page pool, that holds ZC bufs and hands them out to net devices. The ZC pool is tiered with currently two tiers: a fast lockless cache that should only be accessed from the NAPI context of a single Rx queue, and a freelist. When a ZC pool region is first mapped, it is added to the freelist. During normal operation, bufs are moved from the freelist into the cache in POOL_CACHE_SIZE blocks before being given out. Pool regions are registered w/ io_uring using the registered buffer API, with a 1:1 mapping between region and nr_iovec in io_uring_register_buffers. This does the heavy lifting of pinning and chunking into bvecs into a struct io_mapped_ubuf for us. For now as there is only one pool region per ifq, there is no separate API for adding/removing regions yet and it is mapped implicitly during ifq registration. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/io_uring.h | 6 ++ io_uring/zc_rx.c | 173 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 178 insertions(+), 1 deletion(-) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 106cdc55ff3b..abfb73e257a4 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -41,6 +41,12 @@ static inline const void *io_uring_sqe_cmd(const struct io_uring_sqe *sqe) return sqe->cmd; } +struct io_zc_rx_buf { + dma_addr_t dma; + struct page *page; + atomic_t refcount; +}; + #if defined(CONFIG_IO_URING) int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw, struct iov_iter *iter, void *ioucmd); diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index b5266a67395e..0f5fa9ab5cec 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -5,14 +5,44 @@ #include #include #include +#include #include #include "io_uring.h" #include "kbuf.h" +#include "rsrc.h" #include "zc_rx.h" #include "rsrc.h" +#define POOL_CACHE_SIZE 128 + +struct io_zc_rx_pool { + struct io_zc_rx_ifq *ifq; + struct io_zc_rx_buf *bufs; + u16 pool_id; + u32 nr_pages; + + /* fast cache */ + u32 cache_count; + u32 cache[POOL_CACHE_SIZE]; + + /* freelist */ + spinlock_t freelist_lock; + u32 free_count; + u32 freelist[]; +}; + +static inline struct device *netdev2dev(struct net_device *dev) +{ + return dev->dev.parent; +} + +static inline u64 mk_page_info(u16 pool_id, u32 pgid) +{ + return (u64)0xface << 48 | (u64)pool_id << 32 | (u64)pgid; +} + typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, @@ -42,6 +72,143 @@ static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq) return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id); } +static int io_zc_rx_map_buf(struct device *dev, struct page *page, u16 pool_id, + u32 pgid, struct io_zc_rx_buf *buf) +{ + dma_addr_t addr; + + SetPagePrivate(page); + set_page_private(page, mk_page_info(pool_id, pgid)); + + addr = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, + DMA_BIDIRECTIONAL, + DMA_ATTR_SKIP_CPU_SYNC); + if (dma_mapping_error(dev, addr)) { + set_page_private(page, 0); + ClearPagePrivate(page); + return -ENOMEM; + } + + buf->dma = addr; + buf->page = page; + atomic_set(&buf->refcount, 0); + get_page(page); + + return 0; +} + +static void io_zc_rx_unmap_buf(struct device *dev, struct io_zc_rx_buf *buf) +{ + struct page *page; + + page = buf->page; + set_page_private(page, 0); + ClearPagePrivate(page); + dma_unmap_page_attrs(dev, buf->dma, PAGE_SIZE, + DMA_BIDIRECTIONAL, + DMA_ATTR_SKIP_CPU_SYNC); + put_page(page); +} + +static int io_zc_rx_map_pool(struct io_zc_rx_pool *pool, + struct io_mapped_ubuf *imu, + struct device *dev) +{ + struct io_zc_rx_buf *buf; + struct page *page; + int i, ret; + + for (i = 0; i < imu->nr_bvecs; i++) { + page = imu->bvec[i].bv_page; + if (PagePrivate(page)) { + ret = -EEXIST; + goto err; + } + + buf = &pool->bufs[i]; + ret = io_zc_rx_map_buf(dev, page, pool->pool_id, i, buf); + if (ret) + goto err; + + pool->freelist[i] = i; + } + + return 0; +err: + while (i--) { + buf = &pool->bufs[i]; + io_zc_rx_unmap_buf(dev, buf); + } + + return ret; +} + +static int io_zc_rx_create_pool(struct io_ring_ctx *ctx, + struct io_zc_rx_ifq *ifq, + u16 id) +{ + struct device *dev = netdev2dev(ifq->dev); + struct io_mapped_ubuf *imu; + struct io_zc_rx_pool *pool; + int nr_pages; + int ret; + + if (ifq->pool) + return -EFAULT; + + if (unlikely(id >= ctx->nr_user_bufs)) + return -EFAULT; + id = array_index_nospec(id, ctx->nr_user_bufs); + imu = ctx->user_bufs[id]; + if (imu->ubuf & ~PAGE_MASK || imu->ubuf_end & ~PAGE_MASK) + return -EFAULT; + + ret = -ENOMEM; + nr_pages = imu->nr_bvecs; + pool = kvmalloc(struct_size(pool, freelist, nr_pages), GFP_KERNEL); + if (!pool) + goto err; + + pool->bufs = kvmalloc_array(nr_pages, sizeof(*pool->bufs), GFP_KERNEL); + if (!pool->bufs) + goto err_buf; + + ret = io_zc_rx_map_pool(pool, imu, dev); + if (ret) + goto err_map; + + pool->ifq = ifq; + pool->pool_id = id; + pool->nr_pages = nr_pages; + pool->cache_count = 0; + spin_lock_init(&pool->freelist_lock); + pool->free_count = nr_pages; + ifq->pool = pool; + + return 0; + +err_map: + kvfree(pool->bufs); +err_buf: + kvfree(pool); +err: + return ret; +} + +static void io_zc_rx_destroy_pool(struct io_zc_rx_pool *pool) +{ + struct device *dev = netdev2dev(pool->ifq->dev); + struct io_zc_rx_buf *buf; + + for (int i = 0; i < pool->nr_pages; i++) { + buf = &pool->bufs[i]; + + io_zc_rx_unmap_buf(dev, buf); + } + kvfree(pool->bufs); + kvfree(pool); +} + static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) { struct io_zc_rx_ifq *ifq; @@ -60,6 +227,8 @@ static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) { if (ifq->if_rxq_id != -1) io_close_zc_rxq(ifq); + if (ifq->pool) + io_zc_rx_destroy_pool(ifq->pool); if (ifq->dev) dev_put(ifq->dev); io_free_rbuf_ring(ifq); @@ -94,7 +263,9 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, if (!ifq->dev) goto err; - /* TODO: map zc region and initialise zc pool */ + ret = io_zc_rx_create_pool(ctx, ifq, reg.region_id); + if (ret) + goto err; ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; From patchwork Tue Nov 7 21:40:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449350 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD9842A8E8 for ; Tue, 7 Nov 2023 21:41:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="rAlFztHq" Received: from mail-pg1-x536.google.com (mail-pg1-x536.google.com [IPv6:2607:f8b0:4864:20::536]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4063810E6 for ; Tue, 7 Nov 2023 13:41:03 -0800 (PST) Received: by mail-pg1-x536.google.com with SMTP id 41be03b00d2f7-5aa481d53e5so4229128a12.1 for ; Tue, 07 Nov 2023 13:41:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393263; x=1699998063; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=QAg0Jz6lJH3QD8K6wvrjVW15K9s2A8jtw8v6eF5xseE=; b=rAlFztHqWLTLhgIFyZjUTBG1xOjq/9Nr9ueCzJTXlMn5BPWKRxOhsfh45DzhhR4Fjy dypaAiJbLyz6xrULQm4aQE2FmBclja+XuHGncjde9F99tZ+JjpqaCJ6sp/lCY5smEd9j Sl+G4WC5vEWA38ZbtV8RebuSoWNGHlTZflF1jREF/ecPes+msYDGzR//pg1oDCa0afNe tBdo/mvW41QOzSEZ84QKJ655rNkNvPF7sv04U7ZWBrBuJwrNa7UKv9Vt5NdpcDmNIwuR a4I7I/Xt+T2TPWVYIuOpLEQtwFilortTUCV7RofHGypJJ1pp+bg0s6T+p/TH5hNMLQqE cu/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393263; x=1699998063; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QAg0Jz6lJH3QD8K6wvrjVW15K9s2A8jtw8v6eF5xseE=; b=B5MRd4OwYohyh3Kee7gEepZ2R4rYdvjFP/iGLfTEZFHeBdZXOJlIcq4Tla2Va3C9OX nOtZp4cNG64OhOJu4MkyOTrfLB6J1TKLPf/tmno3VIVDPm7w9ata2ol+3nE0HGYlLfG7 M4qhYbZx2JGlYpusHfKuv3EvcS0G7dQ/Vh7KhHUOOPkfTZwmoOv+FMvx2GhYaZCSBr5o qBDDj8kmLfKHTeYQrgEVt2nv9IzT/c+RNUk3qhPAwl7UrsQa99fMWFdrLk19uQmUx8dO Pj5uPU9wp2NEsKAq4vPEXGcpJJ2BoHTlVAfF8qNAocjd8tYeDOaurkNjZTssNy8aVg4e CDow== X-Gm-Message-State: AOJu0YwgmznpzlU1+htBkIx4UHI80gz4Up3FUSbedZ9Pkuh1UOAeiaB/ N7nGj5CCBQTBwILoGZi1cg5img== X-Google-Smtp-Source: AGHT+IHWYyJ4PUWdPclTsrmoG8Zst9HyiUfl18diZ3k13PQj+J9uDi1i72YwmExYAzIS6Djd7WPR0w== X-Received: by 2002:a05:6a20:da8a:b0:181:b86b:41f with SMTP id iy10-20020a056a20da8a00b00181b86b041fmr291675pzb.33.1699393262766; Tue, 07 Nov 2023 13:41:02 -0800 (PST) Received: from localhost (fwdproxy-prn-021.fbsv.net. [2a03:2880:ff:15::face:b00c]) by smtp.gmail.com with ESMTPSA id m5-20020a62f205000000b006c0678eab2csm7792252pfh.90.2023.11.07.13.41.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:02 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 07/20] io_uring: add ZC pool API Date: Tue, 7 Nov 2023 13:40:32 -0800 Message-Id: <20231107214045.2172393-8-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch adds an API to get/put bufs from a ZC pool added in the previous patch. Recall that there is an rbuf refill ring in an ifq that is shared w/ userspace, which puts bufs it is done with back into it. A new tier is added to the ZC pool that drains entries from the refill ring to put into the cache. So when the cache is empty, it is refilled from the refill ring first, then the freelist. ZC bufs are refcounted. Userspace is given an off + len into the entire ZC pool region, not individual pages from ZC bufs. A net device may pack multiple packets into the same page it gets from a ZC buf, so it is possible for the same ZC buf to be handed out to userspace multiple times. This means it is possible to drain the entire refill ring, and have no usable free bufs. Suggestions for dealing w/ this are very welcome! Only up to POOL_REFILL_COUNT entries are refilled from the refill ring. Given the above, we may want to limit the amount of work being done since refilling happens inside the NAPI softirq context. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/io_uring.h | 19 ++++++++ io_uring/zc_rx.c | 95 ++++++++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 11 +++++ 3 files changed, 125 insertions(+) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index abfb73e257a4..624515a8bdd5 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -70,6 +70,18 @@ static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, __io_uring_cmd_do_in_task(ioucmd, task_work_cb, 0); } +struct io_zc_rx_ifq; +struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq); +void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf); + +static inline dma_addr_t io_zc_rx_buf_dma(struct io_zc_rx_buf *buf) +{ + return buf->dma; +} +static inline struct page *io_zc_rx_buf_page(struct io_zc_rx_buf *buf) +{ + return buf->page; +} static inline void io_uring_files_cancel(void) { if (current->io_uring) { @@ -106,6 +118,13 @@ static inline void io_uring_cmd_do_in_task_lazy(struct io_uring_cmd *ioucmd, void (*task_work_cb)(struct io_uring_cmd *, unsigned)) { } +static inline struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq) +{ + return NULL; +} +static inline void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf) +{ +} static inline struct sock *io_uring_get_socket(struct file *file) { return NULL; diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 0f5fa9ab5cec..840a21549d89 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -16,6 +16,9 @@ #include "rsrc.h" #define POOL_CACHE_SIZE 128 +#define POOL_REFILL_COUNT 64 +#define IO_ZC_RX_UREF 0x10000 +#define IO_ZC_RX_KREF_MASK (IO_ZC_RX_UREF - 1) struct io_zc_rx_pool { struct io_zc_rx_ifq *ifq; @@ -269,6 +272,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; + ifq->cached_rq_head = 0; + ifq->cached_cq_tail = 0; ifq->if_rxq_id = reg.if_rxq_id; ctx->ifq = ifq; @@ -371,4 +376,94 @@ int io_register_zc_rx_sock(struct io_ring_ctx *ctx, ifq->nr_sockets++; return 0; } + +static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf) +{ + if (atomic_read(&buf->refcount) < IO_ZC_RX_UREF) + return false; + + return atomic_sub_and_test(IO_ZC_RX_UREF, &buf->refcount); +} + +static void io_zc_rx_refill_cache(struct io_zc_rx_ifq *ifq, int count) +{ + unsigned int entries = io_zc_rx_rqring_entries(ifq); + unsigned int mask = ifq->rq_entries - 1; + struct io_zc_rx_pool *pool = ifq->pool; + struct io_uring_rbuf_rqe *rqe; + struct io_zc_rx_buf *buf; + int i, filled; + + if (!entries) + return; + + for (i = 0, filled = 0; i < entries && filled < count; i++) { + unsigned int rq_idx = ifq->cached_rq_head++ & mask; + u32 pgid; + + rqe = &ifq->rqes[rq_idx]; + pgid = rqe->off / PAGE_SIZE; + buf = &pool->bufs[pgid]; + if (!io_zc_rx_put_buf_uref(buf)) + continue; + pool->cache[filled++] = pgid; + } + + smp_store_release(&ifq->ring->rq.head, ifq->cached_rq_head); + pool->cache_count += filled; +} + +struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq) +{ + struct io_zc_rx_pool *pool = ifq->pool; + struct io_zc_rx_buf *buf; + int count; + u32 pgid; + + lockdep_assert_no_hardirq(); + + if (likely(pool->cache_count)) + goto out; + + io_zc_rx_refill_cache(ifq, POOL_REFILL_COUNT); + if (pool->cache_count) + goto out; + + spin_lock_bh(&pool->freelist_lock); + count = min_t(u32, pool->free_count, POOL_CACHE_SIZE); + pool->free_count -= count; + pool->cache_count += count; + memcpy(pool->cache, &pool->freelist[pool->free_count], + count * sizeof(u32)); + spin_unlock_bh(&pool->freelist_lock); + + if (!pool->cache_count) + return NULL; +out: + pgid = pool->cache[--pool->cache_count]; + buf = &pool->bufs[pgid]; + atomic_set(&buf->refcount, 1); + return buf; +} +EXPORT_SYMBOL(io_zc_rx_get_buf); + +static void io_zc_rx_recycle_buf(struct io_zc_rx_pool *pool, + struct io_zc_rx_buf *buf) +{ + spin_lock_bh(&pool->freelist_lock); + pool->freelist[pool->free_count++] = buf - pool->bufs; + spin_unlock_bh(&pool->freelist_lock); +} + +void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf) +{ + struct io_zc_rx_pool *pool = ifq->pool; + + if (!atomic_dec_and_test(&buf->refcount)) + return; + + io_zc_rx_recycle_buf(pool, buf); +} +EXPORT_SYMBOL(io_zc_rx_put_buf); + #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index ab25f8dbb433..a3df820e52e7 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -16,6 +16,8 @@ struct io_zc_rx_ifq { struct io_uring_rbuf_rqe *rqes; struct io_uring_rbuf_cqe *cqes; u32 rq_entries, cq_entries; + u32 cached_rq_head; + u32 cached_cq_tail; void *pool; unsigned nr_sockets; @@ -26,6 +28,15 @@ struct io_zc_rx_ifq { }; #if defined(CONFIG_NET) +static inline u32 io_zc_rx_rqring_entries(struct io_zc_rx_ifq *ifq) +{ + struct io_rbuf_ring *ring = ifq->ring; + u32 entries; + + entries = smp_load_acquire(&ring->rq.tail) - ifq->cached_rq_head; + return min(entries, ifq->rq_entries); +} + int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg); int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx); From patchwork Tue Nov 7 21:40:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449351 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9DAB32A8F7 for ; Tue, 7 Nov 2023 21:41:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="ClCVsMEf" Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2C47810DB for ; Tue, 7 Nov 2023 13:41:04 -0800 (PST) Received: by mail-pl1-x630.google.com with SMTP id d9443c01a7336-1cc3bb32b5dso56141045ad.3 for ; Tue, 07 Nov 2023 13:41:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393263; x=1699998063; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=cl3ninwz7WqOnfXoQFVr+8n0tlWT0AeCCLPz+qAllN0=; b=ClCVsMEfweqFIx0KnAEE7bsyQagoemyd2uQ5pfmU5lPqeReiXqmHJWIjdev7DJ4EuW TzjRbU7eHNWI02QGkAhhTMVAq5yLBRPkBW+7Z6gauQHkSOcZV5fQFKaVInLijsSbXDlj rM2uvNmZPArv1hnM8i9TzbONrLVw14TEOTlhRLpQ3gDx75+pU9tNS3BgVm5Blnf65UfK YjkJJnora83Jt2RYmZyodxK287gilnqYkO42/zWtFdu8AtsObilK7tAp/Z8r/T9L05PV Z6Qpdc47l37+ysfadAxZqEkBGo+S0neDOGMK4koROS86/ZndiUr9IRsw4l22oszUuEWE /BXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393263; x=1699998063; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cl3ninwz7WqOnfXoQFVr+8n0tlWT0AeCCLPz+qAllN0=; b=MCjqq7BnbEYUQo3CZHYL9s1qZMuLwN/+oTK3i3WHXfBlAuPSsKLrrGF3PmQs6+h75m tDkjVS6v6ZF/JY9NOG0LCMNNJvUi2bOgYmr5XzZ2zVtxlwK8msuTxAvy2OuOhvc42DhV 8EkI25g3lxL1RBuMyh4RGHBnCFMwfEzsLFOWzqahxPWUaqz1J+8NVRXzzTwfSulgs/j7 ZdtIPovg3Ntrjg3MLDpr3zVIf0EuGdzpaoWXVhg6/UdJsvsZf8z89qh0LbGjRuOy2h5S EQEQ02nSQiQeL8LWsL4AMxryUl1Pka41Ikt8BLBgdNjlo0Xs3i034aC9TpH1zZ17kS5r bYVw== X-Gm-Message-State: AOJu0YwcXh1UjH2wVe2bfj14fVQqF45+r8sEy36lIa5YS5OaIHA9eskV oqA4Le4Huotdr4cItLZy11yrdA== X-Google-Smtp-Source: AGHT+IGrQgMliiiwDX5fn0J+VbpG/ztV2ihkRjg1ZR85mlnjmijxMcX4fMQWqalQs/IVVhQP77L+Iw== X-Received: by 2002:a17:902:8c83:b0:1ca:3c63:d5d3 with SMTP id t3-20020a1709028c8300b001ca3c63d5d3mr295371plo.2.1699393263695; Tue, 07 Nov 2023 13:41:03 -0800 (PST) Received: from localhost (fwdproxy-prn-014.fbsv.net. [2a03:2880:ff:e::face:b00c]) by smtp.gmail.com with ESMTPSA id d2-20020a170902cec200b001c3be750900sm270257plg.163.2023.11.07.13.41.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:03 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 08/20] skbuff: add SKBFL_FIXED_FRAG and skb_fixed() Date: Tue, 7 Nov 2023 13:40:33 -0800 Message-Id: <20231107214045.2172393-9-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 When an skb that is marked as zero copy goes up the network stack during RX, skb_orphan_frags_rx is called which then calls skb_copy_ubufs and defeat the purpose of ZC. This is because currently zero copy is only for TX and this behaviour is designed to prevent TX zero copy data being redirected up the network stack rather than new zero copy RX data coming from the driver. This patch adds a new flag SKBFL_FIXED_FRAG and checks for this in skb_orphan_frags, not calling skb_copy_ubufs if it is set. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/skbuff.h | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4174c4b82d13..12de269d6827 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -516,6 +516,9 @@ enum { * use frags only up until ubuf_info is released */ SKBFL_MANAGED_FRAG_REFS = BIT(4), + + /* don't move or copy the fragment */ + SKBFL_FIXED_FRAG = BIT(5), }; #define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG) @@ -1682,6 +1685,11 @@ static inline bool skb_zcopy_managed(const struct sk_buff *skb) return skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAG_REFS; } +static inline bool skb_fixed(const struct sk_buff *skb) +{ + return skb_shinfo(skb)->flags & SKBFL_FIXED_FRAG; +} + static inline bool skb_pure_zcopy_same(const struct sk_buff *skb1, const struct sk_buff *skb2) { @@ -3143,7 +3151,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask) /* Frags must be orphaned, even if refcounted, if skb might loop to rx path */ static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask) { - if (likely(!skb_zcopy(skb))) + if (likely(!skb_zcopy(skb) || skb_fixed(skb))) return 0; return skb_copy_ubufs(skb, gfp_mask); } From patchwork Tue Nov 7 21:40:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449352 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9A2702B2C3 for ; Tue, 7 Nov 2023 21:41:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="H4aaLUmI" Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 12EB810D0 for ; Tue, 7 Nov 2023 13:41:05 -0800 (PST) Received: by mail-pl1-x636.google.com with SMTP id d9443c01a7336-1cc1e1e74beso56088525ad.1 for ; Tue, 07 Nov 2023 13:41:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393264; x=1699998064; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=cSOXUo8OS6zelv94KUNMHN/MdPKo/PdzolI0dF5/1lw=; b=H4aaLUmIS+4Y4hXoazgXFgezUU9EPqRHsugsKsyqWgUZvUpX5Fro0yu3s5aQrHyFlI zHEQpUtKPb0Z0X0jYEiJmHIfHlxbZ4wuP5hypwbr/pL9T+TU/W7CX0i7bn6Cbemah+kt dol+6KjztBKNdmTgUFGFE77nAqscpB7K5S1nvN8g2vLc3mnKrisuP6NjjUU1Zy1n18Vk BAMIwSch2kJxDTMAMixrZ+k+n42sUfLUig55ySpejn0Y8yj/Mch1PU5RPkFqlenUSlFO cLIt1vhRInpEQHnaMpe+Nz9bgjqY93Ej1oonjF3p/iq/VC7KEb8a/d1xaegj7nqaIIh2 6JxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393264; x=1699998064; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cSOXUo8OS6zelv94KUNMHN/MdPKo/PdzolI0dF5/1lw=; b=NvaXcyB7oAClwTZaEe2tY6qePzHYUd/aR8B2XF7A9YjfXwIHRV49tHqQR6BqBSjRNZ iPsKQDsULS89i0ONp1GbGTM/wvcf6WWEHy6dOj2Us8hynCNgFI+mTcc5JdBx8mr2Ehsz 2bcdaX3bAgHR4MnNuaCJnV++TqSHCOjMKp1ScN3Qqss1X6yR9I+a5yHEYVInt5t81bkL TzdU9mWbwfhP21t5EjiZv2TF9I9oV6SJWKU6ge0SarHXb0cx5+5BZLiiZ2zapB+VGf3e 2SLghpFW5/QlG2iRULfsqdzDn0a6fwPWMHO7NgLyjb9VPGiKz2JRD2XOdXLQ0ZSmwDDv HMzw== X-Gm-Message-State: AOJu0YzDIDow2T9yKlP2xELDzM67m4N3gmAAEqeYSSa71Nnkm2pXe553 7vm4ON5pu1MSRTF2lPVXC5FQzw== X-Google-Smtp-Source: AGHT+IFv6ru8WDfKAjNQ4Ab7cvWHWmt1HeHM2E4sGqiBQT2254a7GxveuXA00BgODy2WeqT1QkFRTg== X-Received: by 2002:a17:902:9888:b0:1cc:54b5:b4fa with SMTP id s8-20020a170902988800b001cc54b5b4famr230582plp.18.1699393264581; Tue, 07 Nov 2023 13:41:04 -0800 (PST) Received: from localhost (fwdproxy-prn-006.fbsv.net. [2a03:2880:ff:6::face:b00c]) by smtp.gmail.com with ESMTPSA id u6-20020a170902e5c600b001b89466a5f4sm279147plf.105.2023.11.07.13.41.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:04 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 09/20] io_uring: allocate a uarg for freeing zero copy skbs Date: Tue, 7 Nov 2023 13:40:34 -0800 Message-Id: <20231107214045.2172393-10-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org As ZC skbs are marked as zero copy, they will bypass the default skb frag destructor. This patch adds a static uarg that is attached to ZC bufs and a callback that returns them to the freelist of a ZC pool. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/io_uring.h | 7 +++++++ include/linux/netdevice.h | 1 + io_uring/zc_rx.c | 44 +++++++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 1 + 4 files changed, 53 insertions(+) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 624515a8bdd5..fb88e000c156 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -72,6 +72,8 @@ static inline void io_uring_cmd_complete_in_task(struct io_uring_cmd *ioucmd, struct io_zc_rx_ifq; struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq); +struct io_zc_rx_buf *io_zc_rx_buf_from_page(struct io_zc_rx_ifq *ifq, + struct page *page); void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf); static inline dma_addr_t io_zc_rx_buf_dma(struct io_zc_rx_buf *buf) @@ -122,6 +124,11 @@ static inline struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq) { return NULL; } +static inline struct io_zc_rx_buf *io_zc_rx_buf_from_page(struct io_zc_rx_ifq *ifq, + struct page *page) +{ + return NULL; +} static inline void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf) { } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f9c82c89a96b..ec82fc984941 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1027,6 +1027,7 @@ struct netdev_bpf { struct { struct io_zc_rx_ifq *ifq; u16 queue_id; + struct ubuf_info *uarg; } zc_rx; }; }; diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 840a21549d89..59f279486e9a 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -46,6 +46,11 @@ static inline u64 mk_page_info(u16 pool_id, u32 pgid) return (u64)0xface << 48 | (u64)pool_id << 32 | (u64)pgid; } +static inline bool is_zc_rx_page(struct page *page) +{ + return PagePrivate(page) && ((page_private(page) >> 48) == 0xface); +} + typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, @@ -61,6 +66,7 @@ static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, cmd.command = XDP_SETUP_ZC_RX; cmd.zc_rx.ifq = ifq; cmd.zc_rx.queue_id = queue_id; + cmd.zc_rx.uarg = ifq ? &ifq->uarg : 0; return ndo_bpf(dev, &cmd); } @@ -75,6 +81,26 @@ static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq) return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id); } +static void io_zc_rx_skb_free(struct sk_buff *skb, struct ubuf_info *uarg, + bool success) +{ + struct skb_shared_info *shinfo = skb_shinfo(skb); + struct io_zc_rx_ifq *ifq; + struct io_zc_rx_buf *buf; + struct page *page; + int i; + + ifq = container_of(uarg, struct io_zc_rx_ifq, uarg); + for (i = 0; i < shinfo->nr_frags; i++) { + page = skb_frag_page(&shinfo->frags[i]); + buf = io_zc_rx_buf_from_page(ifq, page); + if (likely(buf)) + io_zc_rx_put_buf(ifq, buf); + else + __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle); + } +} + static int io_zc_rx_map_buf(struct device *dev, struct page *page, u16 pool_id, u32 pgid, struct io_zc_rx_buf *buf) { @@ -270,6 +296,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, if (ret) goto err; + ifq->uarg.callback = io_zc_rx_skb_free; + ifq->uarg.flags = SKBFL_ALL_ZEROCOPY | SKBFL_FIXED_FRAG; ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; ifq->cached_rq_head = 0; @@ -466,4 +494,20 @@ void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf) } EXPORT_SYMBOL(io_zc_rx_put_buf); +struct io_zc_rx_buf *io_zc_rx_buf_from_page(struct io_zc_rx_ifq *ifq, + struct page *page) +{ + struct io_zc_rx_pool *pool; + int pgid; + + if (!is_zc_rx_page(page)) + return NULL; + + pool = ifq->pool; + pgid = page_private(page) & 0xffffffff; + + return &pool->bufs[pgid]; +} +EXPORT_SYMBOL(io_zc_rx_buf_from_page); + #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index a3df820e52e7..b99be0227e9e 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -15,6 +15,7 @@ struct io_zc_rx_ifq { struct io_rbuf_ring *ring; struct io_uring_rbuf_rqe *rqes; struct io_uring_rbuf_cqe *cqes; + struct ubuf_info uarg; u32 rq_entries, cq_entries; u32 cached_rq_head; u32 cached_cq_tail; From patchwork Tue Nov 7 21:40:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449353 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8E5A82B2D4 for ; Tue, 7 Nov 2023 21:41:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="WCF6h/UL" Received: from mail-pl1-x629.google.com (mail-pl1-x629.google.com [IPv6:2607:f8b0:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EA47C10DF for ; Tue, 7 Nov 2023 13:41:05 -0800 (PST) Received: by mail-pl1-x629.google.com with SMTP id d9443c01a7336-1cc1e1e74beso56088665ad.1 for ; Tue, 07 Nov 2023 13:41:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393265; x=1699998065; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=CEGEZSeHFsNXXDjUe7hxxYXsucH5o3pVw1oTNiKf1kY=; b=WCF6h/UL4Sl04sCFRepXbhFMm9+IC+OEifYNyXrap4HiP9asilQIBgb1tTzuEYOjd1 eaTh1XcCOlCWTn35sPj+wA4jRNZrwj6RQuXEcJgLLnZckuLINceRJdhXya7kHo4v8oUK X1VLiAtJZegyM8cAaaR4SHn/EqQ9Rpq7a3pz8rs6mOI+4C6BBq3l3Sua+8uOoNYGaTFe coaCt+lfiUuhtHLlhz8b36EYEXVsTbIBRndZY3VQ1weTowlmAo7FU5tkCYAEPywP0K1O sa1NWcyd36e121818PaGCcqWnOAvwTVKBD7XfN3TRyFwIqpy0T4AN2Ai4zd0wYpD93id r5pA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393265; x=1699998065; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CEGEZSeHFsNXXDjUe7hxxYXsucH5o3pVw1oTNiKf1kY=; b=dzssaeaisp7rilhVpwQyX97MIwpLykDRWtTmRRtMtPTLVfKMhvqTOOSnrY5IZ00Nwi cyDqwBOXYnlAAmHpav+ExonXrWvIGF/p0vZg7zCRIvJQP0H67BArm5DFwS/VYQMBFcl1 /vdXdFl9qC5kcwO0gb7Ewd999v+5oR0GrlDa6O77lU8WmV1rjdI/SfKroTQ46XQewDiZ grnBhDlKA5ew6AdEuZ44kzmhWCECsjtmrUrGDpIkUTLT7kvlyNxhshu1iAQBBak9UxGp uGzdh0nxO/PKpq2TFx3XV/rYY7b0bF0pvITpwm/fsru3CiK1oEYn0luHpNwNuytMI6tF 7X0g== X-Gm-Message-State: AOJu0YzQ5ZR+/L6jLOh5FMy03b8XZ8DOXH5Lzi/rCGhlUVINNIA5efNq DU1u2uV4IYpMfwC6CWFIE0Pq+A== X-Google-Smtp-Source: AGHT+IHSl06XpdqPfoK/SHJali+kSRHZLDIfMQko+YfURseQXsqMuGdK6iNbBkPuUdvA0ZQOM6VlNA== X-Received: by 2002:a17:902:9b90:b0:1c3:3b5c:1fbf with SMTP id y16-20020a1709029b9000b001c33b5c1fbfmr283461plp.10.1699393265443; Tue, 07 Nov 2023 13:41:05 -0800 (PST) Received: from localhost (fwdproxy-prn-018.fbsv.net. [2a03:2880:ff:12::face:b00c]) by smtp.gmail.com with ESMTPSA id o7-20020a1709026b0700b001c739768214sm280417plk.92.2023.11.07.13.41.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:05 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 10/20] io_uring: delay ZC pool destruction Date: Tue, 7 Nov 2023 13:40:35 -0800 Message-Id: <20231107214045.2172393-11-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 At a point in time, a ZC buf may be in: * Rx queue * Socket * One of the ifq ringbufs * Userspace The ZC pool region and the pool itself cannot be destroyed until all bufs have been returned. This patch changes the ZC pool destruction to be delayed work, waiting for up to 10 seconds for bufs to be returned before unconditionally destroying the pool. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/zc_rx.c | 51 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 45 insertions(+), 6 deletions(-) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 59f279486e9a..bebcd637c893 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -30,6 +30,10 @@ struct io_zc_rx_pool { u32 cache_count; u32 cache[POOL_CACHE_SIZE]; + /* delayed destruction */ + unsigned long delay_end; + struct delayed_work destroy_work; + /* freelist */ spinlock_t freelist_lock; u32 free_count; @@ -224,20 +228,57 @@ static int io_zc_rx_create_pool(struct io_ring_ctx *ctx, return ret; } -static void io_zc_rx_destroy_pool(struct io_zc_rx_pool *pool) +static void io_zc_rx_destroy_ifq(struct io_zc_rx_ifq *ifq) +{ + if (ifq->dev) + dev_put(ifq->dev); + io_free_rbuf_ring(ifq); + kfree(ifq); +} + +static void io_zc_rx_destroy_pool_work(struct work_struct *work) { + struct io_zc_rx_pool *pool = container_of( + to_delayed_work(work), struct io_zc_rx_pool, destroy_work); struct device *dev = netdev2dev(pool->ifq->dev); struct io_zc_rx_buf *buf; + int i, refc, count; - for (int i = 0; i < pool->nr_pages; i++) { + for (i = 0; i < pool->nr_pages; i++) { buf = &pool->bufs[i]; + refc = atomic_read(&buf->refcount) & IO_ZC_RX_KREF_MASK; + if (refc) { + if (time_before(jiffies, pool->delay_end)) { + schedule_delayed_work(&pool->destroy_work, HZ); + return; + } + count++; + } + } + + if (count) { + pr_debug("freeing pool with %d/%d outstanding pages\n", + count, pool->nr_pages); + return; + } + for (i = 0; i < pool->nr_pages; i++) { + buf = &pool->bufs[i]; io_zc_rx_unmap_buf(dev, buf); } + + io_zc_rx_destroy_ifq(pool->ifq); kvfree(pool->bufs); kvfree(pool); } +static void io_zc_rx_destroy_pool(struct io_zc_rx_pool *pool) +{ + pool->delay_end = jiffies + HZ * 10; + INIT_DELAYED_WORK(&pool->destroy_work, io_zc_rx_destroy_pool_work); + schedule_delayed_work(&pool->destroy_work, 0); +} + static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) { struct io_zc_rx_ifq *ifq; @@ -258,10 +299,8 @@ static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) io_close_zc_rxq(ifq); if (ifq->pool) io_zc_rx_destroy_pool(ifq->pool); - if (ifq->dev) - dev_put(ifq->dev); - io_free_rbuf_ring(ifq); - kfree(ifq); + else + io_zc_rx_destroy_ifq(ifq); } int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, From patchwork Tue Nov 7 21:40:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449354 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 521972B2E3 for ; Tue, 7 Nov 2023 21:41:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="T9ioqKuM" Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A732210E2 for ; Tue, 7 Nov 2023 13:41:06 -0800 (PST) Received: by mail-pl1-x62c.google.com with SMTP id d9443c01a7336-1cc58219376so56380275ad.1 for ; Tue, 07 Nov 2023 13:41:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393266; x=1699998066; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=d9J1FEFzhrhikaZNta65SrTLG6ucsKYhta+WP1qgq5Y=; b=T9ioqKuM6X4kUSmM0Jbhyh9n+f7Qn7Gu7a2GeBpjhQT0JAIFiM+lSBoyTzJitWekfF 4GnG4lPGVA2+9Jg7XFrkJB7Nc5qk2NZ5aQGeyZyVPUTwYrmmMIvVRV/96Uprpuc+kI9o 9yqVOJt29jVI//CNlTQxysdadMA5UxqGt5Gsnd7AAHVPuik1HpV5nsCsl0Iqbc3caFYf B5yumalZuahX7oWinRiE7hq41Cw83sNTTy57wotCtd1Gn7CExVh14Su3BZCQVJDY7ogK Y/aKSvEauBVBhS8vIrQXipBTs1LxgSoNX3DL7TTryFXR+V3IqxvZqqoKz1aBnN0/QTI1 gVoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393266; x=1699998066; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=d9J1FEFzhrhikaZNta65SrTLG6ucsKYhta+WP1qgq5Y=; b=BSFVN/EKJkXHIrUBMoSGZLegEMgLnd6Pfgpbe/MSALaJpPBgMQvAGOYPR7zILQgPuF QSJwNvIll0nQ0RCIkhc41MBSEHg/2TpbqD5ep62jhYi9Cp0Ie8ZCexVol5pYLJWRnq8Z g4thch1ReMm81qTkDRnrFvF2oHnjPxR2X1Hq4s8jz7Sbh2Ob8UMssy+eeDp5/7ZC0/oa RcwowPhlOgzCkVwzAdj0h5NvQSwZWe4vgvgU5lNfRAM7sB1AyVhpUatYhahjfcXQPD/6 ieFHxmsCsbRGFCv+1AunvbGQpIvXeKy94Gw5N+gpQMWhlFHj8uj8jtqcTZB+cyuZcHmd EhcQ== X-Gm-Message-State: AOJu0YzPjMMKT7QuwbVJ/9cpekaL128MyWr8L4FVAxvig9ChN3V9oKvH Tvv8yMydsMx74i9xaAw6WIvxEg== X-Google-Smtp-Source: AGHT+IEOXch3zKD+qAyCHDnaLeYtThJTFT4ZKIiazXb5IXi2lINxPtMnVCw9yz/JkYptgvDn6gPBaA== X-Received: by 2002:a17:903:1252:b0:1cc:6597:f40b with SMTP id u18-20020a170903125200b001cc6597f40bmr232616plh.36.1699393266238; Tue, 07 Nov 2023 13:41:06 -0800 (PST) Received: from localhost (fwdproxy-prn-120.fbsv.net. [2a03:2880:ff:78::face:b00c]) by smtp.gmail.com with ESMTPSA id q13-20020a17090311cd00b001b016313b1dsm276284plh.86.2023.11.07.13.41.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:06 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 11/20] net: add data pool Date: Tue, 7 Nov 2023 13:40:36 -0800 Message-Id: <20231107214045.2172393-12-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org Add a struct data_pool that holds both a page_pool and an ifq (by extension, a ZC pool). Each hardware Rx queue configured for ZC will have one data_pool, set in its struct netdev_rx_queue. Payload hardware Rx queues are filled from the ZC pool, while header Rx queues are filled from the page_pool as normal. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/data_pool.h | 74 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) create mode 100644 include/net/data_pool.h diff --git a/include/net/data_pool.h b/include/net/data_pool.h new file mode 100644 index 000000000000..bf2dff23724a --- /dev/null +++ b/include/net/data_pool.h @@ -0,0 +1,74 @@ +#ifndef _DATA_POOL_H +#define _DATA_POOL_H + +#include +#include +#include +#include +#include + +struct data_pool { + struct page_pool *page_pool; + struct io_zc_rx_ifq *zc_ifq; + struct ubuf_info *zc_uarg; +}; + +static inline struct page *data_pool_alloc_page(struct data_pool *dp) +{ + if (dp->zc_ifq) { + struct io_zc_rx_buf *buf; + + buf = io_zc_rx_get_buf(dp->zc_ifq); + if (!buf) + return NULL; + return buf->page; + } else { + return page_pool_dev_alloc_pages(dp->page_pool); + } +} + +static inline void data_pool_fragment_page(struct data_pool *dp, + struct page *page, + unsigned long bias) +{ + if (dp->zc_ifq) { + struct io_zc_rx_buf *buf; + + buf = io_zc_rx_buf_from_page(dp->zc_ifq, page); + atomic_set(&buf->refcount, bias); + } else { + page_pool_fragment_page(page, bias); + } +} + +static inline void data_pool_put_page(struct data_pool *dp, struct page *page) +{ + if (dp->zc_ifq) { + struct io_zc_rx_buf *buf; + + buf = io_zc_rx_buf_from_page(dp->zc_ifq, page); + if (!buf) + page_pool_recycle_direct(dp->page_pool, page); + else + io_zc_rx_put_buf(dp->zc_ifq, buf); + } else { + WARN_ON_ONCE(page->pp_magic != PP_SIGNATURE); + + page_pool_recycle_direct(dp->page_pool, page); + } +} + +static inline dma_addr_t data_pool_get_dma_addr(struct data_pool *dp, + struct page *page) +{ + if (dp->zc_ifq) { + struct io_zc_rx_buf *buf; + + buf = io_zc_rx_buf_from_page(dp->zc_ifq, page); + return io_zc_rx_buf_dma(buf); + } else { + return page_pool_get_dma_addr(page); + } +} + +#endif From patchwork Tue Nov 7 21:40:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449356 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A5A02B2F7 for ; Tue, 7 Nov 2023 21:41:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="gDtc3u1g" Received: from mail-oo1-xc32.google.com (mail-oo1-xc32.google.com [IPv6:2607:f8b0:4864:20::c32]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E93C810DB for ; Tue, 7 Nov 2023 13:41:07 -0800 (PST) Received: by mail-oo1-xc32.google.com with SMTP id 006d021491bc7-586940ee5a5so3143043eaf.0 for ; Tue, 07 Nov 2023 13:41:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393267; x=1699998067; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=s+zqFsNP24K5w+diErx4CzW6YBVojNR7U9nZFqFAmk4=; b=gDtc3u1gMwnzHQL3L1vzps0uGTtOkdnxpt8nJUVFBNtNQj78wjpNz5u9LWTDIlYE3u t0NmiEwIpRsclDiqPAOp8Tcl3/0rYEnULju6hybc1Mhdwf34rjZAPTYd3BB/I0dz74/I zNTIdh7/JjSu3uBz0lOSInTax6IA42enVOw/IH30DW7NaeOicGrGz4oXQs7Zry5F9a4K AISif66Agc4kWID1+PnHkYxUMISeWHUbQCt2S9NkX1rh6jKHyItqYkGbrZAn0GxqGptS Osm9ZcxKM3qQKHgWZeqMooedYFJXqdRidR+N5Hw9laPbeJsydagu+ijTS/sdGuORigGH Zt3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393267; x=1699998067; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=s+zqFsNP24K5w+diErx4CzW6YBVojNR7U9nZFqFAmk4=; b=RZQsYY0APXsDevcT6WOhKsN92cm2cDZyWT5DPeEkUoJWYxOWX5eJthi/36eNJXMWFp GVQxTqSjd1DyhMaGDsBYZfuWYmN9yJfDRLQURVU50UMoR/0DVEPbQ4fEk57sIFi5MWKt exB6xegtJ9PKeHziRmFL5gwgK4LmvP8NuCWC9ttuhxTA4ejTWhty1seKJflNK/cYjY1n nJxm+wPo1DTwMiHWB5CQsni9I53pcVQLfLJ0Jr3y89YTxEsMk3bLNEg4PUD2ncBFCw/n aDTQo1qObH8i/twsYYZaZZYIj8QHaYu++7h02je3t9aJP+fv0l8K5ykQoAC0e/tDc+QR +j+g== X-Gm-Message-State: AOJu0YwlWNwkr7QU3JX4pVrwd8lkNQpD3XqUIBGCjbs3IvzG0n7tRJk/ aBSWC6vwev0SEM+E8mVUmERGBg== X-Google-Smtp-Source: AGHT+IEqd2GTfWpxVSSb6gx/b5KY6FAhjT7V3Wr2G0n7DZUoTC7yAI6Pbm2twIB7w7C+XmvVpNJCTQ== X-Received: by 2002:a05:6358:904c:b0:16b:858c:a140 with SMTP id f12-20020a056358904c00b0016b858ca140mr1333273rwf.9.1699393267134; Tue, 07 Nov 2023 13:41:07 -0800 (PST) Received: from localhost (fwdproxy-prn-119.fbsv.net. [2a03:2880:ff:77::face:b00c]) by smtp.gmail.com with ESMTPSA id m33-20020a635821000000b0059d219cb359sm1816160pgb.9.2023.11.07.13.41.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:06 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 12/20] io_uring: add io_recvzc request Date: Tue, 7 Nov 2023 13:40:37 -0800 Message-Id: <20231107214045.2172393-13-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch adds an io_uring opcode OP_RECV_ZC for doing ZC reads from a socket that is set up for ZC Rx. The request reads skbs from a socket where its page frags are tagged w/ a magic cookie in their page private field. For each frag, entries are written into the ifq rbuf completion ring, and the total number of bytes read is returned to user as an io_uring completion event. Multishot requests work. There is no need to specify provided buffers as data is returned in the ifq rbuf completion rings. Userspace is expected to look into the ifq rbuf completion ring when it receives an io_uring completion event. The addr3 field is used to encode params in the following format: addr3 = (readlen << 32) | ifq_id; readlen is the max amount of data to read from the socket. ifq_id is the interface queue id, and currently only 0 is supported. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/uapi/linux/io_uring.h | 1 + io_uring/net.c | 122 ++++++++++++++++- io_uring/opdef.c | 16 +++ io_uring/zc_rx.c | 239 ++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 4 + 5 files changed, 376 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 917d0025cc94..603d07d0a791 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -240,6 +240,7 @@ enum io_uring_op { IORING_OP_URING_CMD, IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC, + IORING_OP_RECV_ZC, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/net.c b/io_uring/net.c index fc0b7936971d..79f2ed3a6fc0 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -70,6 +70,16 @@ struct io_sr_msg { struct io_kiocb *notif; }; +struct io_recvzc { + struct file *file; + unsigned len; + unsigned done_io; + unsigned msg_flags; + u16 flags; + + u32 datalen; +}; + static inline bool io_check_multishot(struct io_kiocb *req, unsigned int issue_flags) { @@ -588,7 +598,8 @@ int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sr->msg_flags & MSG_ERRQUEUE) req->flags |= REQ_F_CLEAR_POLLIN; if (sr->flags & IORING_RECV_MULTISHOT) { - if (!(req->flags & REQ_F_BUFFER_SELECT)) + if (!(req->flags & REQ_F_BUFFER_SELECT) + && req->opcode != IORING_OP_RECV_ZC) return -EINVAL; if (sr->msg_flags & MSG_WAITALL) return -EINVAL; @@ -636,7 +647,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret, unsigned int cflags; cflags = io_put_kbuf(req, issue_flags); - if (msg->msg_inq && msg->msg_inq != -1) + if (msg && msg->msg_inq && msg->msg_inq != -1) cflags |= IORING_CQE_F_SOCK_NONEMPTY; if (!(req->flags & REQ_F_APOLL_MULTISHOT)) { @@ -651,7 +662,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret, io_recv_prep_retry(req); /* Known not-empty or unknown state, retry */ if (cflags & IORING_CQE_F_SOCK_NONEMPTY || - msg->msg_inq == -1) + (msg && msg->msg_inq == -1)) return false; if (issue_flags & IO_URING_F_MULTISHOT) *ret = IOU_ISSUE_SKIP_COMPLETE; @@ -955,9 +966,8 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) return ret; } -static __maybe_unused -struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, - struct socket *sock) +static struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, + struct socket *sock) { unsigned token = READ_ONCE(sock->zc_rx_idx); unsigned ifq_idx = token >> IO_ZC_IFQ_IDX_OFFSET; @@ -974,6 +984,106 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, return ifq; } +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + u64 recvzc_cmd; + + recvzc_cmd = READ_ONCE(sqe->addr3); + zc->datalen = recvzc_cmd >> 32; + if (recvzc_cmd & 0xffff) + return -EINVAL; + if (!(req->ctx->flags & IORING_SETUP_DEFER_TASKRUN)) + return -EINVAL; + if (unlikely(sqe->file_index || sqe->addr2)) + return -EINVAL; + + zc->len = READ_ONCE(sqe->len); + zc->flags = READ_ONCE(sqe->ioprio); + if (zc->flags & ~(RECVMSG_FLAGS)) + return -EINVAL; + zc->msg_flags = READ_ONCE(sqe->msg_flags); + if (zc->msg_flags & MSG_DONTWAIT) + req->flags |= REQ_F_NOWAIT; + if (zc->msg_flags & MSG_ERRQUEUE) + req->flags |= REQ_F_CLEAR_POLLIN; + if (zc->flags & IORING_RECV_MULTISHOT) { + if (zc->msg_flags & MSG_WAITALL) + return -EINVAL; + if (req->opcode == IORING_OP_RECV && zc->len) + return -EINVAL; + req->flags |= REQ_F_APOLL_MULTISHOT; + } + +#ifdef CONFIG_COMPAT + if (req->ctx->compat) + zc->msg_flags |= MSG_CMSG_COMPAT; +#endif + zc->done_io = 0; + return 0; +} + +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + struct socket *sock; + unsigned flags; + int ret, min_ret = 0; + bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + struct io_zc_rx_ifq *ifq; + + if (issue_flags & IO_URING_F_UNLOCKED) + return -EAGAIN; + + if (!(req->flags & REQ_F_POLLED) && + (zc->flags & IORING_RECVSEND_POLL_FIRST)) + return -EAGAIN; + + sock = sock_from_file(req->file); + if (unlikely(!sock)) + return -ENOTSOCK; + ifq = io_zc_verify_sock(req, sock); + if (!ifq) + return -EINVAL; + +retry_multishot: + flags = zc->msg_flags; + if (force_nonblock) + flags |= MSG_DONTWAIT; + if (flags & MSG_WAITALL) + min_ret = zc->len; + + ret = io_zc_rx_recv(sock, zc->datalen, flags); + if (ret < min_ret) { + if (ret == -EAGAIN && force_nonblock) { + if (issue_flags & IO_URING_F_MULTISHOT) + return IOU_ISSUE_SKIP_COMPLETE; + return -EAGAIN; + } + if (ret > 0 && io_net_retry(sock, flags)) { + zc->len -= ret; + zc->done_io += ret; + req->flags |= REQ_F_PARTIAL_IO; + return -EAGAIN; + } + if (ret == -ERESTARTSYS) + ret = -EINTR; + req_set_fail(req); + } else if ((flags & MSG_WAITALL) && (flags & (MSG_TRUNC | MSG_CTRUNC))) { + req_set_fail(req); + } + + if (ret > 0) + ret += zc->done_io; + else if (zc->done_io) + ret = zc->done_io; + + if (!io_recv_finish(req, &ret, 0, ret <= 0, issue_flags)) + goto retry_multishot; + + return ret; +} + void io_send_zc_cleanup(struct io_kiocb *req) { struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg); diff --git a/io_uring/opdef.c b/io_uring/opdef.c index 3b9c6489b8b6..4dee7f83222f 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -33,6 +33,7 @@ #include "poll.h" #include "cancel.h" #include "rw.h" +#include "zc_rx.h" static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags) { @@ -426,6 +427,18 @@ const struct io_issue_def io_issue_defs[] = { .issue = io_sendmsg_zc, #else .prep = io_eopnotsupp_prep, +#endif + }, + [IORING_OP_RECV_ZC] = { + .needs_file = 1, + .unbound_nonreg_file = 1, + .pollin = 1, + .ioprio = 1, +#if defined(CONFIG_NET) + .prep = io_recvzc_prep, + .issue = io_recvzc, +#else + .prep = io_eopnotsupp_prep, #endif }, }; @@ -648,6 +661,9 @@ const struct io_cold_def io_cold_defs[] = { .fail = io_sendrecv_fail, #endif }, + [IORING_OP_RECV_ZC] = { + .name = "RECV_ZC", + }, }; const char *io_uring_get_opcode(u8 opcode) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index bebcd637c893..842aae760deb 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -6,6 +6,7 @@ #include #include #include +#include #include @@ -40,6 +41,13 @@ struct io_zc_rx_pool { u32 freelist[]; }; +static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq) +{ + struct io_rbuf_ring *ring = ifq->ring; + + return ifq->cached_cq_tail - READ_ONCE(ring->cq.head); +} + static inline struct device *netdev2dev(struct net_device *dev) { return dev->dev.parent; @@ -311,6 +319,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, size_t ring_sz, rqes_sz, cqes_sz; int ret; + if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) + return -EINVAL; if (copy_from_user(®, arg, sizeof(reg))) return -EFAULT; if (ctx->ifq) @@ -444,6 +454,14 @@ int io_register_zc_rx_sock(struct io_ring_ctx *ctx, return 0; } +static void io_zc_rx_get_buf_uref(struct io_zc_rx_pool *pool, u32 pgid) +{ + if (WARN_ON(pgid >= pool->nr_pages)) + return; + + atomic_add(IO_ZC_RX_UREF, &pool->bufs[pgid].refcount); +} + static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf) { if (atomic_read(&buf->refcount) < IO_ZC_RX_UREF) @@ -549,4 +567,225 @@ struct io_zc_rx_buf *io_zc_rx_buf_from_page(struct io_zc_rx_ifq *ifq, } EXPORT_SYMBOL(io_zc_rx_buf_from_page); +static struct io_zc_rx_ifq *io_zc_rx_ifq_skb(struct sk_buff *skb) +{ + struct ubuf_info *uarg = skb_zcopy(skb); + + if (uarg && uarg->callback == io_zc_rx_skb_free) + return container_of(uarg, struct io_zc_rx_ifq, uarg); + return NULL; +} + +static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, + int off, int len) +{ + struct io_uring_rbuf_cqe *cqe; + unsigned int cq_idx, queued, free, entries; + struct page *page; + unsigned int mask; + u32 pgid; + + page = skb_frag_page(frag); + off += skb_frag_off(frag); + + if (likely(ifq && is_zc_rx_page(page))) { + mask = ifq->cq_entries - 1; + pgid = page_private(page) & 0xffffffff; + io_zc_rx_get_buf_uref(ifq->pool, pgid); + cq_idx = ifq->cached_cq_tail & mask; + smp_rmb(); + queued = min(io_zc_rx_cqring_entries(ifq), ifq->cq_entries); + free = ifq->cq_entries - queued; + entries = min(free, ifq->cq_entries - cq_idx); + if (!entries) + return -ENOBUFS; + cqe = &ifq->cqes[cq_idx]; + ifq->cached_cq_tail++; + cqe->region = 0; + cqe->off = pgid * PAGE_SIZE + off; + cqe->len = len; + cqe->flags = 0; + } else { + /* TODO: copy frags that aren't backed by zc pages */ + WARN_ON_ONCE(1); + return -ENOMEM; + } + + return len; +} + +static int +zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, + unsigned int offset, size_t len) +{ + struct io_zc_rx_ifq *ifq; + struct sk_buff *frag_iter; + unsigned start, start_off; + int i, copy, end, off; + int ret = 0; + + ifq = io_zc_rx_ifq_skb(skb); + if (!ifq) { + pr_debug("non zerocopy pages are not supported\n"); + return -EFAULT; + } + start = skb_headlen(skb); + start_off = offset; + + // TODO: copy payload in skb linear data */ + WARN_ON_ONCE(offset < start); + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + const skb_frag_t *frag; + + WARN_ON(start > offset + len); + + frag = &skb_shinfo(skb)->frags[i]; + end = start + skb_frag_size(frag); + + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = zc_rx_recv_frag(ifq, frag, off, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + + skb_walk_frags(skb, frag_iter) { + WARN_ON(start > offset + len); + + end = start + frag_iter->len; + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = zc_rx_recv_skb(desc, frag_iter, off, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + +out: + smp_store_release(&ifq->ring->cq.tail, ifq->cached_cq_tail); + if (offset == start_off) + return ret; + return offset - start_off; +} + +static int io_zc_rx_tcp_read(struct sock *sk) +{ + read_descriptor_t rd_desc = { + .count = 1, + }; + + return tcp_read_sock(sk, &rd_desc, zc_rx_recv_skb); +} + +static int io_zc_rx_tcp_recvmsg(struct sock *sk, unsigned int recv_limit, + int flags, int *addr_len) +{ + size_t used; + long timeo; + int ret; + + ret = used = 0; + + lock_sock(sk); + + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + while (recv_limit) { + ret = io_zc_rx_tcp_read(sk); + if (ret < 0) + break; + if (!ret) { + if (used) + break; + if (sock_flag(sk, SOCK_DONE)) + break; + if (sk->sk_err) { + ret = sock_error(sk); + break; + } + if (sk->sk_shutdown & RCV_SHUTDOWN) + break; + if (sk->sk_state == TCP_CLOSE) { + ret = -ENOTCONN; + break; + } + if (!timeo) { + ret = -EAGAIN; + break; + } + if (!skb_queue_empty(&sk->sk_receive_queue)) + break; + sk_wait_data(sk, &timeo, NULL); + if (signal_pending(current)) { + ret = sock_intr_errno(timeo); + break; + } + continue; + } + recv_limit -= ret; + used += ret; + + if (!timeo) + break; + release_sock(sk); + lock_sock(sk); + + if (sk->sk_err || sk->sk_state == TCP_CLOSE || + (sk->sk_shutdown & RCV_SHUTDOWN) || + signal_pending(current)) + break; + } + + release_sock(sk); + + /* TODO: handle timestamping */ + + if (used) + return used; + + return ret; +} + +int io_zc_rx_recv(struct socket *sock, unsigned int limit, unsigned int flags) +{ + struct sock *sk = sock->sk; + const struct proto *prot; + int addr_len = 0; + int ret; + + if (flags & MSG_ERRQUEUE) + return -EOPNOTSUPP; + + prot = READ_ONCE(sk->sk_prot); + if (prot->recvmsg != tcp_recvmsg) + return -EPROTONOSUPPORT; + + sock_rps_record_flow(sk); + + ret = io_zc_rx_tcp_recvmsg(sk, limit, flags, &addr_len); + + return ret; +} + #endif diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index b99be0227e9e..bfba21c370b0 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -60,4 +60,8 @@ static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx, } #endif +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags); +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_zc_rx_recv(struct socket *sock, unsigned int limit, unsigned int flags); + #endif From patchwork Tue Nov 7 21:40:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449355 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB76436B14 for ; Tue, 7 Nov 2023 21:41:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="TPoADq/V" Received: from mail-pl1-x633.google.com (mail-pl1-x633.google.com [IPv6:2607:f8b0:4864:20::633]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8839910D0 for ; Tue, 7 Nov 2023 13:41:08 -0800 (PST) Received: by mail-pl1-x633.google.com with SMTP id d9443c01a7336-1cc316ccc38so50511475ad.1 for ; Tue, 07 Nov 2023 13:41:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393268; x=1699998068; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=H88/FIrlrOF9cEijoa57DWCiKAkLO7WMP992r2a8IF0=; b=TPoADq/V/11CTgjT/bXQohO+wcWOl7AGkhcHzFxCxCxvNvwcYR+hkNSuqt2vfeXyBY JLclsyMA92/c5mY5i6a6nfpJKwXY4WTNIUOlZQgp7J2WIMsWnGsi5jwRCKYdlpENXihb xMQyKWiMkUF5SDagEbBU3gesND8DPltcJifCDhXwNGqjTDvsunP8z6UDuxEGWRqg9E/+ 25ZSzAE6qQTKeh5ebWD0jP7Hul9Nf+6fVHpszTHDb3kZ22d4XfqcqLHyifkvSw/xHD9B 1xwFjcnWkse6dQ9Dhb/Y8jzG0mdgXYLYwJgx7i2CVleGXVl2El2mAXzCUAdEE3tTr+xO Quig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393268; x=1699998068; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=H88/FIrlrOF9cEijoa57DWCiKAkLO7WMP992r2a8IF0=; b=rH4IP1ypa3iEtjcu/arSaPytWkUDbTXP9f1m+3jXlcOaourSeMdZnBT7J2AFDIgBNz TJwu5QT4mWt2tkEtRr4zsI0WjfGAHRLGeLY3Ztf9D3AO5HKFg+yCHx2FWFmY5Ufldop9 8ddXz7+eju8Eo9LDkwd5u6eNjfTXRBZpcvgKs6/BbXftVKaH3yR66o5E5vuEvNIgzp2u APdtu/UAmz3fy6a1WAtURCr6fwaSivwFMwwKtzWBr1omzLJ2zry9tLygP7bo5h/GxDYo XesnGS05QdIOKQglgUWcxHMWimbfkznF7tBe1P0Ej83U2rocxv51Gwujoz+tx3HYR/B2 gpdA== X-Gm-Message-State: AOJu0YzxRwPLJgoSo2XokLOCBZt03o5L9WutrAl0etWnz8gb+Y741FvR HumsSeQFSE4+ZaqcSwTqDBwsyg== X-Google-Smtp-Source: AGHT+IHl9L3su6ib7CEqSX0f32m1v8z2zQFvHUGj1R2HWuJaFei56dCXR3qaXd6c5WDITbtZTIWrZA== X-Received: by 2002:a17:902:8f8c:b0:1cc:665d:f818 with SMTP id z12-20020a1709028f8c00b001cc665df818mr181332plo.68.1699393267988; Tue, 07 Nov 2023 13:41:07 -0800 (PST) Received: from localhost (fwdproxy-prn-001.fbsv.net. [2a03:2880:ff:1::face:b00c]) by smtp.gmail.com with ESMTPSA id w5-20020a170902e88500b001c5fe217fb9sm260355plg.267.2023.11.07.13.41.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:07 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 13/20] io_uring/zcrx: propagate ifq down the stack Date: Tue, 7 Nov 2023 13:40:38 -0800 Message-Id: <20231107214045.2172393-14-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov We need to know the current ifq for copy fallback purposes, so pass it down from the issue callback down to zc_rx_recv_frag(). It'll also be needed in the future for notifications, accounting and so on. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/net.c | 2 +- io_uring/zc_rx.c | 30 +++++++++++++++++++----------- io_uring/zc_rx.h | 3 ++- 3 files changed, 22 insertions(+), 13 deletions(-) diff --git a/io_uring/net.c b/io_uring/net.c index 79f2ed3a6fc0..e7b41c5826d5 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -1053,7 +1053,7 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags) if (flags & MSG_WAITALL) min_ret = zc->len; - ret = io_zc_rx_recv(sock, zc->datalen, flags); + ret = io_zc_rx_recv(ifq, sock, zc->datalen, flags); if (ret < min_ret) { if (ret == -EAGAIN && force_nonblock) { if (issue_flags & IO_URING_F_MULTISHOT) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 842aae760deb..038692d3265e 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -577,7 +577,7 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_skb(struct sk_buff *skb) } static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, - int off, int len) + int off, int len, bool zc_skb) { struct io_uring_rbuf_cqe *cqe; unsigned int cq_idx, queued, free, entries; @@ -588,7 +588,7 @@ static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, page = skb_frag_page(frag); off += skb_frag_off(frag); - if (likely(ifq && is_zc_rx_page(page))) { + if (likely(zc_skb && is_zc_rx_page(page))) { mask = ifq->cq_entries - 1; pgid = page_private(page) & 0xffffffff; io_zc_rx_get_buf_uref(ifq->pool, pgid); @@ -618,14 +618,19 @@ static int zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, unsigned int offset, size_t len) { - struct io_zc_rx_ifq *ifq; + struct io_zc_rx_ifq *ifq = desc->arg.data; + struct io_zc_rx_ifq *skb_ifq; struct sk_buff *frag_iter; unsigned start, start_off; int i, copy, end, off; + bool zc_skb = true; int ret = 0; - ifq = io_zc_rx_ifq_skb(skb); - if (!ifq) { + skb_ifq = io_zc_rx_ifq_skb(skb); + if (unlikely(ifq != skb_ifq)) { + zc_skb = false; + if (WARN_ON_ONCE(skb_ifq)) + return -EFAULT; pr_debug("non zerocopy pages are not supported\n"); return -EFAULT; } @@ -649,7 +654,7 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, copy = len; off = offset - start; - ret = zc_rx_recv_frag(ifq, frag, off, copy); + ret = zc_rx_recv_frag(ifq, frag, off, copy, zc_skb); if (ret < 0) goto out; @@ -690,16 +695,18 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, return offset - start_off; } -static int io_zc_rx_tcp_read(struct sock *sk) +static int io_zc_rx_tcp_read(struct io_zc_rx_ifq *ifq, struct sock *sk) { read_descriptor_t rd_desc = { .count = 1, + .arg.data = ifq, }; return tcp_read_sock(sk, &rd_desc, zc_rx_recv_skb); } -static int io_zc_rx_tcp_recvmsg(struct sock *sk, unsigned int recv_limit, +static int io_zc_rx_tcp_recvmsg(struct io_zc_rx_ifq *ifq, struct sock *sk, + unsigned int recv_limit, int flags, int *addr_len) { size_t used; @@ -712,7 +719,7 @@ static int io_zc_rx_tcp_recvmsg(struct sock *sk, unsigned int recv_limit, timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); while (recv_limit) { - ret = io_zc_rx_tcp_read(sk); + ret = io_zc_rx_tcp_read(ifq, sk); if (ret < 0) break; if (!ret) { @@ -767,7 +774,8 @@ static int io_zc_rx_tcp_recvmsg(struct sock *sk, unsigned int recv_limit, return ret; } -int io_zc_rx_recv(struct socket *sock, unsigned int limit, unsigned int flags) +int io_zc_rx_recv(struct io_zc_rx_ifq *ifq, struct socket *sock, + unsigned int limit, unsigned int flags) { struct sock *sk = sock->sk; const struct proto *prot; @@ -783,7 +791,7 @@ int io_zc_rx_recv(struct socket *sock, unsigned int limit, unsigned int flags) sock_rps_record_flow(sk); - ret = io_zc_rx_tcp_recvmsg(sk, limit, flags, &addr_len); + ret = io_zc_rx_tcp_recvmsg(ifq, sk, limit, flags, &addr_len); return ret; } diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index bfba21c370b0..fac32089e699 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -62,6 +62,7 @@ static inline int io_register_zc_rx_sock(struct io_ring_ctx *ctx, int io_recvzc(struct io_kiocb *req, unsigned int issue_flags); int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); -int io_zc_rx_recv(struct socket *sock, unsigned int limit, unsigned int flags); +int io_zc_rx_recv(struct io_zc_rx_ifq *ifq, struct socket *sock, + unsigned int limit, unsigned int flags); #endif From patchwork Tue Nov 7 21:40:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449357 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E04BC2BCEE for ; Tue, 7 Nov 2023 21:41:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="T3es60nL" Received: from mail-pf1-x430.google.com (mail-pf1-x430.google.com [IPv6:2607:f8b0:4864:20::430]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7375510DF for ; Tue, 7 Nov 2023 13:41:09 -0800 (PST) Received: by mail-pf1-x430.google.com with SMTP id d2e1a72fcca58-6bee11456baso5662585b3a.1 for ; Tue, 07 Nov 2023 13:41:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393269; x=1699998069; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=0l/dAFHKr6I2GsT4AxouY6xWEkGOzCExNR1nSE25Ok4=; b=T3es60nLsItlyKtIKstM9Dy7R+HMqJjJqNJjyefe8gZ8YcDIlm0KSlczPvWiuep6MX NBEGvEvh9ZVML/vw70AKzzjW1F3guPq4L/expG7egO7mdJjP/RT1IHiCYNduBDeqhGkA syzSgF1qjFI6/nI+4/pDJQHfGVLIrxQs7nv4nJa/Ih0Avy2XNrVraKNvWZILlhhOZkzT HIVhR+eMlaQN5s+mw2M0hiPRzFzonu1InX+3Lj7kd2R2vxDgQdOQM230N1bgyNcryhKk a7O8kkRURK00FoBxOuL3WzOXNLkGwuvv6L+fR02gdyUt8tuxD9YEuOcoNLKERLHhiJrD SOog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393269; x=1699998069; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0l/dAFHKr6I2GsT4AxouY6xWEkGOzCExNR1nSE25Ok4=; b=go4IJpINDkwxAOYdyIvi69PlEB/xjEFwEGlIf+sFzYPFlh2esbCeMAJKYXk2Yu+h+6 EmJk4M/juPqGljhSpZxJIsjRkwMVFtn2JRUW25iY7dO7f4EboQfotuneyuCkadQKvH2A p8olR8TV4AZ6T20G6f8LY72YCRQH/WtZ/sLanDG2sV8xw1I3wMy+fPXVA54Bjyb2cn8Q pUBKu5N3VwnuJYpWhP0vdlsm9fGmElY5ExuBFfam7UMgd6yrM0ZJRt+TZ9NSJxhI1eLs WzX0NA8SG5uVf+R+NeR62SeCVBVUZLm9GULMbGE2rr0bx3JIFU9xQmDEbS0UvZjLIyPg +9EQ== X-Gm-Message-State: AOJu0YxzpUPY4H6PDKms24pk7363lLAOOTWbKWOmO+2CZvDgcCHTN9lO UsXfkhuv3dtzk0g8rU9C20yHCA== X-Google-Smtp-Source: AGHT+IGz9RIYPpPmCVytFlwZp2W7Q/hli9aDmu6Xh8XeTH4433Do+Hg+6z+vlEUQAjbpNk149S9u1A== X-Received: by 2002:a05:6a00:cc6:b0:68e:42c9:74e0 with SMTP id b6-20020a056a000cc600b0068e42c974e0mr339493pfv.3.1699393268935; Tue, 07 Nov 2023 13:41:08 -0800 (PST) Received: from localhost (fwdproxy-prn-017.fbsv.net. [2a03:2880:ff:11::face:b00c]) by smtp.gmail.com with ESMTPSA id n22-20020a635c56000000b005898e4acf2dsm1804225pgm.49.2023.11.07.13.41.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:08 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 14/20] io_uring/zcrx: introduce io_zc_get_rbuf_cqe Date: Tue, 7 Nov 2023 13:40:39 -0800 Message-Id: <20231107214045.2172393-15-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov Add a simple helper for grabbing a new rbuf entry. It greatly helps zc_rx_recv_frag()'s readability and will be reused later Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/zc_rx.c | 36 ++++++++++++++++++++++++------------ 1 file changed, 24 insertions(+), 12 deletions(-) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 038692d3265e..c1502ec3e629 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -576,31 +576,43 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_skb(struct sk_buff *skb) return NULL; } +static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq) +{ + struct io_uring_rbuf_cqe *cqe; + unsigned int cq_idx, queued, free, entries; + unsigned int mask = ifq->cq_entries - 1; + + cq_idx = ifq->cached_cq_tail & mask; + smp_rmb(); + queued = min(io_zc_rx_cqring_entries(ifq), ifq->cq_entries); + free = ifq->cq_entries - queued; + entries = min(free, ifq->cq_entries - cq_idx); + if (!entries) + return NULL; + + cqe = &ifq->cqes[cq_idx]; + ifq->cached_cq_tail++; + return cqe; +} + static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, int off, int len, bool zc_skb) { struct io_uring_rbuf_cqe *cqe; - unsigned int cq_idx, queued, free, entries; struct page *page; - unsigned int mask; u32 pgid; page = skb_frag_page(frag); off += skb_frag_off(frag); if (likely(zc_skb && is_zc_rx_page(page))) { - mask = ifq->cq_entries - 1; + cqe = io_zc_get_rbuf_cqe(ifq); + if (!cqe) + return -ENOBUFS; + pgid = page_private(page) & 0xffffffff; io_zc_rx_get_buf_uref(ifq->pool, pgid); - cq_idx = ifq->cached_cq_tail & mask; - smp_rmb(); - queued = min(io_zc_rx_cqring_entries(ifq), ifq->cq_entries); - free = ifq->cq_entries - queued; - entries = min(free, ifq->cq_entries - cq_idx); - if (!entries) - return -ENOBUFS; - cqe = &ifq->cqes[cq_idx]; - ifq->cached_cq_tail++; + cqe->region = 0; cqe->off = pgid * PAGE_SIZE + off; cqe->len = len; From patchwork Tue Nov 7 21:40:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449358 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A9092B2C3 for ; Tue, 7 Nov 2023 21:41:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="hA4BzTZA" Received: from mail-oi1-x22a.google.com (mail-oi1-x22a.google.com [IPv6:2607:f8b0:4864:20::22a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 904C510E6 for ; Tue, 7 Nov 2023 13:41:10 -0800 (PST) Received: by mail-oi1-x22a.google.com with SMTP id 5614622812f47-3b2ea7cca04so3891116b6e.2 for ; Tue, 07 Nov 2023 13:41:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393270; x=1699998070; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=K2OFTXg8XBavOP2tPlsdD5TmFis6HNSAe4rWWmEFvho=; b=hA4BzTZASh0LLFDHFG6EKQ9H2FUHjdRIwSm2OixwzcTzNh75FNPh4HJ6ts8obetJus /s/38G/+HimGJO2TB+5M5ug4uz9VyA4mSBGFepg9ApHbFF1M7pph2oIgVRluDTv8T//j D/d5Egl55KD2C8Q2TRsa13A3mpEMU/RL+yPJyOFvZQDFXPDJ84DOBrfceWhCkXPHH5p2 YVmMOP3QjJZQRmjQRjxPtUbShmxfzCb4Q4VFh2nB48R0E29siuI6Su0gD2tvGvFnQ9Qs ykful/SsIvMPJsoAjb8dYkIkI3D27TUYtavyVdcdL7KdDQw/TnUqlYVYEHV6VCGM+GzI df0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393270; x=1699998070; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K2OFTXg8XBavOP2tPlsdD5TmFis6HNSAe4rWWmEFvho=; b=MEmielRE5I+z3NazogRCpiJxuPC1u2DkfVvl8lCPSJ9nAVrKIK8kdjW8luyOrMgS5z hGs/QYyXLqYLED1M4MvVOP7hZcKjT/0JHrpd566gU7J4KVLQrui5cU8RE9Vi3smRJ64S OPwFgeVbzG+No/tX+FchdZm0mO6Owg4TEFtwSyl9pApXUUVMjnWp6GvH7UOaAtTGSleO c6RM3Mrq5BRhP4VSFK72JzZ1tfkFm78ZB1M3Nd4uKc0++Fx7HMX2/fYiqkbklRaf3L8Q m9lXvsqXr1EdCUi7sgGA72Iaz/gaiJRUs/fqBGUc1zVWNleHTZbZQXOri+uQszWh8hR8 1YWA== X-Gm-Message-State: AOJu0Yy2tR5vh59+A9KVUK/lOUz4tl3ndJ3dilwk4/80BzQEwCffXsZz ON7msBz+8uDJFBQo6ug0ItgxlQ== X-Google-Smtp-Source: AGHT+IEaaldIA3iOXxqK2zrInlxpweTaP+yU7mUkStGw/uDeYT3BPBfL6SPP3AY5ubOnpx1g9xzwtg== X-Received: by 2002:aca:1208:0:b0:3b2:f2a8:1a4c with SMTP id 8-20020aca1208000000b003b2f2a81a4cmr223003ois.44.1699393269875; Tue, 07 Nov 2023 13:41:09 -0800 (PST) Received: from localhost (fwdproxy-prn-018.fbsv.net. [2a03:2880:ff:12::face:b00c]) by smtp.gmail.com with ESMTPSA id y188-20020a6364c5000000b005b92ba3938dsm1836687pgb.77.2023.11.07.13.41.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:09 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 15/20] io_uring/zcrx: add copy fallback Date: Tue, 7 Nov 2023 13:40:40 -0800 Message-Id: <20231107214045.2172393-16-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov Currently, if user fails to keep up with the network and doesn't refill the buffer ring fast enough the NIC/driver will start dropping packets. That might be too punishing, so let's fall back to non-zerocopy version by allowing the driver to do normal kernel allocations. Later, when we're in the task context doing zc_rx_recv_skb() we'll detect such pages and copy them into user specified buffers. This patch implement the second (copy) part. It'll facilitate adoption and help the user to strike the balance b/w allocation the right amount of zerocopy buffers and being resilient to surges in traffic. Note, due to technical reasons for now we're only using buffers from ->freelist, which is unreliably and is likely to fail with time. It'll be revised in later patches. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/zc_rx.c | 115 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 105 insertions(+), 10 deletions(-) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index c1502ec3e629..c2ed600f0951 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -498,6 +498,26 @@ static void io_zc_rx_refill_cache(struct io_zc_rx_ifq *ifq, int count) pool->cache_count += filled; } +static struct io_zc_rx_buf *io_zc_get_buf_task_safe(struct io_zc_rx_ifq *ifq) +{ + struct io_zc_rx_pool *pool = ifq->pool; + struct io_zc_rx_buf *buf = NULL; + u32 pgid; + + if (!READ_ONCE(pool->free_count)) + return NULL; + + spin_lock_bh(&pool->freelist_lock); + if (pool->free_count) { + pool->free_count--; + pgid = pool->freelist[pool->free_count]; + buf = &pool->bufs[pgid]; + atomic_set(&buf->refcount, 1); + } + spin_unlock_bh(&pool->freelist_lock); + return buf; +} + struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq) { struct io_zc_rx_pool *pool = ifq->pool; @@ -576,6 +596,11 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_skb(struct sk_buff *skb) return NULL; } +static inline void io_zc_return_rbuf_cqe(struct io_zc_rx_ifq *ifq) +{ + ifq->cached_cq_tail--; +} + static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq *ifq) { struct io_uring_rbuf_cqe *cqe; @@ -595,6 +620,51 @@ static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq * return cqe; } +static ssize_t zc_rx_copy_chunk(struct io_zc_rx_ifq *ifq, void *data, + unsigned int offset, size_t len) +{ + size_t copy_size, copied = 0; + struct io_uring_rbuf_cqe *cqe; + struct io_zc_rx_buf *buf; + unsigned int pgid; + int ret = 0, off = 0; + u8 *vaddr; + + do { + cqe = io_zc_get_rbuf_cqe(ifq); + if (!cqe) { + ret = ENOBUFS; + break; + } + buf = io_zc_get_buf_task_safe(ifq); + if (!buf) { + io_zc_return_rbuf_cqe(ifq); + ret = -ENOMEM; + break; + } + + vaddr = kmap_local_page(buf->page); + copy_size = min_t(size_t, PAGE_SIZE, len); + memcpy(vaddr, data + offset, copy_size); + kunmap_local(vaddr); + + pgid = page_private(buf->page) & 0xffffffff; + io_zc_rx_get_buf_uref(ifq->pool, pgid); + io_zc_rx_put_buf(ifq, buf); + + cqe->region = 0; + cqe->off = pgid * PAGE_SIZE + off; + cqe->len = copy_size; + cqe->flags = 0; + + offset += copy_size; + len -= copy_size; + copied += copy_size; + } while (offset < len); + + return copied ? copied : ret; +} + static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, int off, int len, bool zc_skb) { @@ -618,9 +688,21 @@ static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, cqe->len = len; cqe->flags = 0; } else { - /* TODO: copy frags that aren't backed by zc pages */ - WARN_ON_ONCE(1); - return -ENOMEM; + u32 p_off, p_len, t, copied = 0; + u8 *vaddr; + int ret = 0; + + skb_frag_foreach_page(frag, off, len, + page, p_off, p_len, t) { + vaddr = kmap_local_page(page); + ret = zc_rx_copy_chunk(ifq, vaddr, p_off, p_len); + kunmap_local(vaddr); + + if (ret < 0) + return copied ? copied : ret; + copied += ret; + } + len = copied; } return len; @@ -633,7 +715,7 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, struct io_zc_rx_ifq *ifq = desc->arg.data; struct io_zc_rx_ifq *skb_ifq; struct sk_buff *frag_iter; - unsigned start, start_off; + unsigned start, start_off = offset; int i, copy, end, off; bool zc_skb = true; int ret = 0; @@ -643,14 +725,27 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, zc_skb = false; if (WARN_ON_ONCE(skb_ifq)) return -EFAULT; - pr_debug("non zerocopy pages are not supported\n"); - return -EFAULT; } - start = skb_headlen(skb); - start_off = offset; - // TODO: copy payload in skb linear data */ - WARN_ON_ONCE(offset < start); + if (unlikely(offset < skb_headlen(skb))) { + ssize_t copied; + size_t to_copy; + + to_copy = min_t(size_t, skb_headlen(skb) - offset, len); + copied = zc_rx_copy_chunk(ifq, skb->data, offset, to_copy); + if (copied < 0) { + ret = copied; + goto out; + } + offset += copied; + len -= copied; + if (!len) + goto out; + if (offset != skb_headlen(skb)) + goto out; + } + + start = skb_headlen(skb); for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { const skb_frag_t *frag; From patchwork Tue Nov 7 21:40:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449360 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A7B92BD01 for ; Tue, 7 Nov 2023 21:41:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="0qfxYbvB" Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2712510EA for ; Tue, 7 Nov 2023 13:41:11 -0800 (PST) Received: by mail-pl1-x630.google.com with SMTP id d9443c01a7336-1cc0e78ec92so41917055ad.3 for ; Tue, 07 Nov 2023 13:41:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393270; x=1699998070; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=DEYoqijzZJW9k3dNn5XO9gR9ltrXX/l1ozIXtxfwbU8=; b=0qfxYbvBi+FA5VLTVj5BUubHqOc0REVtaihsVsYyZEw8gMK3yYSvFhiRDwQTldCckd mUEF6RbihQb2N3f18LEW+YZesQBu8NkmWAImGfMOdB+K1+AR1DimFMenOlMnda1DtEPC wgPko3l31ONOiit4m2cn/oAG/0g8rx/xlsNHXj1QJtqybpFrBvAPHyqpe2dFAD6sy1LM VylFuHQXLqaOapxPH4Dd07SF8BtMZUjz+I8uW8PgObEAZtkHl0mSkHU8D2wqHO7fO1FK wj4eeK02VFgmxqYgSdOvTOQN4tSTfFXl+5pBOlmb/nR1Yg6bwIb/w/yddAiSqZuDN1x7 wyoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393270; x=1699998070; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DEYoqijzZJW9k3dNn5XO9gR9ltrXX/l1ozIXtxfwbU8=; b=cRFcbGeCRBTrsZ7vpGhW7O6+xGDwuUj2oD2Ch8gHALQPDBUu44yv8kxhhWxuCjN8G+ pIS1HoflS1hNjhuZJi6yEYKB9P42WEoUzemvMOLjkRN3xrJ6FIzdq7eBzQAbm7p8k7I/ 4tCTeUUBw4fY8cCCzP9RQ6hB07KHorMGJXax1w6rAajZ6gw99dxLx4QSXpRX4gXWmREw T2mfEdSgC6wU8TtiHhlGSDNfMR9Jr+ygYaDcY8mfzBwPc0as4j3E+jOPf2bauP32sox0 Cja/FCOqMrDSYgq1wFKzmobqlLriFG7XOFe3he+1OB4ZFuDyJHzIHFgbbBMbg0AKnZ+t aojg== X-Gm-Message-State: AOJu0YyV74AmF/n8NztwJhBqbDRa3f8KDX/yPO6EM4uhaHrXbmj8Szl/ 9WqiucTZVXoWab16UElgULvIiQ== X-Google-Smtp-Source: AGHT+IEKSuqqWQvMVmIb/Hqh7B+Gxlpgb+OwjhNZAxtVTPcTPHKznyOvGf1WAmMO7Igy9ntVidWomg== X-Received: by 2002:a17:903:230c:b0:1cc:6e42:1413 with SMTP id d12-20020a170903230c00b001cc6e421413mr256995plh.57.1699393270702; Tue, 07 Nov 2023 13:41:10 -0800 (PST) Received: from localhost (fwdproxy-prn-001.fbsv.net. [2a03:2880:ff:1::face:b00c]) by smtp.gmail.com with ESMTPSA id u1-20020a170902e80100b001cc3a6813f8sm268781plg.154.2023.11.07.13.41.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:10 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 16/20] net: execute custom callback from napi Date: Tue, 7 Nov 2023 13:40:41 -0800 Message-Id: <20231107214045.2172393-17-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org From: Pavel Begunkov Sometimes we want to access a napi protected resource from task context like in the case of io_uring zc falling back to copy and accessing the buffer ring. Add a helper function that allows to execute a custom function from napi context by first stopping it similarly to napi_busy_loop(). Experimental and might go away after convertion to custom page pools. It has to share more code with napi_busy_loop(). It also might be spinning too long a better breaking mechanism. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/net/busy_poll.h | 2 ++ net/core/dev.c | 51 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+) diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index 4dabeb6c76d3..292c3b4eaa7a 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -47,6 +47,8 @@ bool sk_busy_loop_end(void *p, unsigned long start_time); void napi_busy_loop(unsigned int napi_id, bool (*loop_end)(void *, unsigned long), void *loop_end_arg, bool prefer_busy_poll, u16 budget); +void napi_execute(unsigned int napi_id, + bool (*cb)(void *), void *cb_arg); #else /* CONFIG_NET_RX_BUSY_POLL */ static inline unsigned long net_busy_loop_on(void) diff --git a/net/core/dev.c b/net/core/dev.c index 02949a929e7f..66397ac1d8fc 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6291,6 +6291,57 @@ void napi_busy_loop(unsigned int napi_id, } EXPORT_SYMBOL(napi_busy_loop); +void napi_execute(unsigned int napi_id, + bool (*cb)(void *), void *cb_arg) +{ + bool done = false; + unsigned long val; + void *have_poll_lock = NULL; + struct napi_struct *napi; + + rcu_read_lock(); + napi = napi_by_id(napi_id); + if (!napi) + goto out; + + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + for (;;) { + local_bh_disable(); + val = READ_ONCE(napi->state); + + /* If multiple threads are competing for this napi, + * we avoid dirtying napi->state as much as we can. + */ + if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED | + NAPIF_STATE_IN_BUSY_POLL)) + goto restart; + + if (cmpxchg(&napi->state, val, + val | NAPIF_STATE_IN_BUSY_POLL | + NAPIF_STATE_SCHED) != val) + goto restart; + + have_poll_lock = netpoll_poll_lock(napi); + cb(cb_arg); + done = true; + gro_normal_list(napi); + local_bh_enable(); + break; +restart: + local_bh_enable(); + if (unlikely(need_resched())) + break; + cpu_relax(); + } + if (done) + busy_poll_stop(napi, have_poll_lock, false, 1); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); +out: + rcu_read_unlock(); +} + #endif /* CONFIG_NET_RX_BUSY_POLL */ static void napi_hash_add(struct napi_struct *napi) From patchwork Tue Nov 7 21:40:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449359 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 769AE2BD0E for ; Tue, 7 Nov 2023 21:41:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="Damd4j7n" Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com [IPv6:2607:f8b0:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1422710D0 for ; Tue, 7 Nov 2023 13:41:12 -0800 (PST) Received: by mail-pl1-x62f.google.com with SMTP id d9443c01a7336-1cc0d0a0355so47298765ad.3 for ; Tue, 07 Nov 2023 13:41:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393271; x=1699998071; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=xpoMSiM7fsnsF8gAcS74BCI3r9yY1Vj9o22YU2WHyDA=; b=Damd4j7n/+Ewv5u83FOG9R8VcapUTXZkhSjcAjhQD+3wbpzx5NFiubJIJevUDo0c9H NsBFDrVMa6vNUk39DfP8WFCLClTZBMT9CD+JGcsDeRgFWPIMWvH6o1TRbKJDD6wFWe4/ zrxRPRDPjl+JQwNhj6mWUjCZgPgLqbaOYiKfotXM3ifrz8Gtp+Wz+gm4XNQMWS7jNHV4 m3oLpcw2/rYjamdR9XX6ORRGxoMYQIpBPnfAGoAB9jpNuPdYvx63WBO15cX7ikwUc0WH cf/qRqgDmdsrqjTVluyqKhrsoWyjSv9J/HAEWJ5ZLkRiGFnkkaIb3jx9UPiObie5EO1J gNCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393271; x=1699998071; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xpoMSiM7fsnsF8gAcS74BCI3r9yY1Vj9o22YU2WHyDA=; b=PeelziwIMJHdzkHdWndVeXQawP47zwZ49ElxFGCKrETdhrDtLIQ4cGH9ADzBKYVbQI xOuvWP+14r//w8jPweacMjzb4Zu8nnNvjPFgDpKS+ljgZ2O4k6qVvTrvems51pKlX5/U dftuMge0pa79Lok+v+NNoo0NABURcj8c3FfC6jBnqfPuASAai58Y52GxYMiaXREdlpGa HgNRLlv4j+AzE6frVoICtuhC1bQ6vxaqax8JFwPkw8MRTPMM08UD8wMwqpqqJi/tCLCI Iqv2IDH8/o6sfouNEciy/g6mFdk96CEufHzFlWuBSnhEN83ipATDHEG8x6tJioLCOJEl rcew== X-Gm-Message-State: AOJu0YzhdnTreZ6L/7bhhTvW0zYDKdft+Zf0CZZ9YA8xI9C6rU7TMw38 cIm9gVL6MFmBarAXy2et7EBvgQ== X-Google-Smtp-Source: AGHT+IHtozb1N6C8iJ3U75mzIC4LwKXFGI/Y3hhHNrrAdS+TnaNZlMdQryi+qh+yEi1QBFVqjwPNYA== X-Received: by 2002:a17:903:2349:b0:1cc:436d:39dd with SMTP id c9-20020a170903234900b001cc436d39ddmr217210plh.65.1699393271584; Tue, 07 Nov 2023 13:41:11 -0800 (PST) Received: from localhost (fwdproxy-prn-004.fbsv.net. [2a03:2880:ff:4::face:b00c]) by smtp.gmail.com with ESMTPSA id g10-20020a170902934a00b001b0358848b0sm273327plp.161.2023.11.07.13.41.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:11 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 17/20] io_uring/zcrx: copy fallback to ring buffers Date: Tue, 7 Nov 2023 13:40:42 -0800 Message-Id: <20231107214045.2172393-18-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov The copy fallback is currently limited to spinlock protected ->freelist, but we also want to be able to grab buffers from the refill queue, which is napi protected. Use the new napi_execute() helper to inject a function call into the napi context. todo: the way we set napi_id with io_zc_rx_set_napi in drivers later is not reliable, we should catch all netif_napi_del() and update the id. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/linux/io_uring.h | 1 + io_uring/zc_rx.c | 45 ++++++++++++++++++++++++++++++++++++++-- io_uring/zc_rx.h | 1 + 3 files changed, 45 insertions(+), 2 deletions(-) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index fb88e000c156..bf886d6de4e0 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -75,6 +75,7 @@ struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq); struct io_zc_rx_buf *io_zc_rx_buf_from_page(struct io_zc_rx_ifq *ifq, struct page *page); void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf); +void io_zc_rx_set_napi(struct io_zc_rx_ifq *ifq, unsigned napi_id); static inline dma_addr_t io_zc_rx_buf_dma(struct io_zc_rx_buf *buf) { diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index c2ed600f0951..14328024a550 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -7,6 +7,7 @@ #include #include #include +#include #include @@ -41,6 +42,11 @@ struct io_zc_rx_pool { u32 freelist[]; }; +struct io_zc_refill_data { + struct io_zc_rx_ifq *ifq; + unsigned count; +}; + static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq) { struct io_rbuf_ring *ring = ifq->ring; @@ -244,6 +250,12 @@ static void io_zc_rx_destroy_ifq(struct io_zc_rx_ifq *ifq) kfree(ifq); } +void io_zc_rx_set_napi(struct io_zc_rx_ifq *ifq, unsigned napi_id) +{ + ifq->napi_id = napi_id; +} +EXPORT_SYMBOL(io_zc_rx_set_napi); + static void io_zc_rx_destroy_pool_work(struct work_struct *work) { struct io_zc_rx_pool *pool = container_of( @@ -498,14 +510,43 @@ static void io_zc_rx_refill_cache(struct io_zc_rx_ifq *ifq, int count) pool->cache_count += filled; } +static bool io_napi_refill(void *data) +{ + struct io_zc_refill_data *rd = data; + struct io_zc_rx_ifq *ifq = rd->ifq; + struct io_zc_rx_pool *pool = ifq->pool; + int i, count = rd->count; + + lockdep_assert_no_hardirq(); + + if (!pool->cache_count) + io_zc_rx_refill_cache(ifq, POOL_REFILL_COUNT); + + spin_lock_bh(&pool->freelist_lock); + for (i = 0; i < count && pool->cache_count; i++) { + u32 pgid; + + pgid = pool->cache[--pool->cache_count]; + pool->freelist[pool->free_count++] = pgid; + } + spin_unlock_bh(&pool->freelist_lock); + return true; +} + static struct io_zc_rx_buf *io_zc_get_buf_task_safe(struct io_zc_rx_ifq *ifq) { struct io_zc_rx_pool *pool = ifq->pool; struct io_zc_rx_buf *buf = NULL; u32 pgid; - if (!READ_ONCE(pool->free_count)) - return NULL; + if (!READ_ONCE(pool->free_count)) { + struct io_zc_refill_data rd = { + .ifq = ifq, + .count = 1, + }; + + napi_execute(ifq->napi_id, io_napi_refill, &rd); + } spin_lock_bh(&pool->freelist_lock); if (pool->free_count) { diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index fac32089e699..fd8828e4bd7a 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -20,6 +20,7 @@ struct io_zc_rx_ifq { u32 cached_rq_head; u32 cached_cq_tail; void *pool; + unsigned int napi_id; unsigned nr_sockets; struct file *sockets[IO_ZC_MAX_IFQ_SOCKETS]; From patchwork Tue Nov 7 21:40:43 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449363 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D9D0436B14 for ; Tue, 7 Nov 2023 21:41:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="ZPbJ/rM5" Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C85E10E7 for ; Tue, 7 Nov 2023 13:41:13 -0800 (PST) Received: by mail-pj1-x102e.google.com with SMTP id 98e67ed59e1d1-280109daaaaso4629414a91.3 for ; Tue, 07 Nov 2023 13:41:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393272; x=1699998072; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Tamxuyrmc9V5WDiAondngZpJ00DNIyR1wLlFf3mWuNQ=; b=ZPbJ/rM5Tnj+cefBpWPBlXoyIu+xU80BiuU0t9YskyRFCPmeIDeHU/fjBvoN/Oe2n4 RzCg630Ghrf6joNBMzrRhbkve/rF8m4nMcwPp8NthSvLy2Ug5yPytyE/Y+qb5barYg4m IX/DNKuzPkiFN/kBnJMelBCpb3LnqhL8kZnf9WAalaZZ0wg0/GG+yMmpJI/AHh+2h5Kt MW+TBCu7noOWFVfdaXyijdH9qyvXW4Jwr55GvO1R4MIRJ0wCJynigbDjx0oXZcvY4xoB LLTg/GBteXY5o7Z4NAnH8JkAGMpksgDDodXNax6wP7oJ2p5fAdnIKurD/5yI7t3WuicC oxeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393272; x=1699998072; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Tamxuyrmc9V5WDiAondngZpJ00DNIyR1wLlFf3mWuNQ=; b=nnMbnpA8HsxRwPtj3tr4/L1JlS+qG60nA7VkeNjpc4KPwKTVA7bWZsM4i1GcHyRZhk GPoVLP42BP0CsP6nfeTFAmmHPAgOgXS4clWzJ6/eZ3I97Ytwq86WJ5Zuss/PO5ORE6sI MZ4jv1fGfe4U6mvQ2WKwhALCWw5GQrkA4f8InpyUQLjODlgk0tvbBpCvss2/KVGoyj3v 9SGGprhj08AKvy+cKgQTeLoh2gvGAePGsyk1WqBmcEcXUUIZPmtjxnuy2EMJoDKYqhKE 6LvqmEM8Kp0kbpkdc0TOysD5+nPIWu6UVEz1nV4cMtwa3nvT+kgwZlrr7fBr++idMPSp RDsw== X-Gm-Message-State: AOJu0YzojP/4XThD59LToC1cYEd0KxsjemeWwSiG7FldNDTy+3D65lTL FzLJ8DJlVRJEkB1T4P0bhrh9fw== X-Google-Smtp-Source: AGHT+IG96zwTBgW1kaJGEH38MmWnZ8QQVo84jeNgQowFRn8jQbKuxQkyDCagBqHMDZInNEFklcM3pA== X-Received: by 2002:a17:90b:1bc7:b0:281:691:e58c with SMTP id oa7-20020a17090b1bc700b002810691e58cmr6076553pjb.37.1699393272478; Tue, 07 Nov 2023 13:41:12 -0800 (PST) Received: from localhost (fwdproxy-prn-009.fbsv.net. [2a03:2880:ff:9::face:b00c]) by smtp.gmail.com with ESMTPSA id a3-20020a17090acb8300b002774d7e2fefsm250892pju.36.2023.11.07.13.41.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:12 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 18/20] veth: add support for io_uring zc rx Date: Tue, 7 Nov 2023 13:40:43 -0800 Message-Id: <20231107214045.2172393-19-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org From: Pavel Begunkov Dirty and hacky, testing only Add io_uring zerocopy support for veth. It's not actually zerocopy, we copy data in napi, which is early enough in the stack to be useful for testing. Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- drivers/net/veth.c | 179 ++++++++++++++++++++++++++++++++++++++++++++- io_uring/zc_rx.c | 15 ++-- 2 files changed, 186 insertions(+), 8 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 0deefd1573cf..08420d43ac00 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -27,6 +27,8 @@ #include #include #include +#include +#include #define DRV_NAME "veth" #define DRV_VERSION "1.0" @@ -67,6 +69,8 @@ struct veth_rq { struct ptr_ring xdp_ring; struct xdp_rxq_info xdp_rxq; struct page_pool *page_pool; + + struct data_pool zc_dp; }; struct veth_priv { @@ -75,6 +79,7 @@ struct veth_priv { struct bpf_prog *_xdp_prog; struct veth_rq *rq; unsigned int requested_headroom; + bool zc_installed; }; struct veth_xdp_tx_bq { @@ -335,9 +340,12 @@ static bool veth_skb_is_eligible_for_gro(const struct net_device *dev, const struct net_device *rcv, const struct sk_buff *skb) { + struct veth_priv *rcv_priv = netdev_priv(rcv); + return !(dev->features & NETIF_F_ALL_TSO) || (skb->destructor == sock_wfree && - rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD)); + rcv->features & (NETIF_F_GRO_FRAGLIST | NETIF_F_GRO_UDP_FWD)) || + rcv_priv->zc_installed; } static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) @@ -828,6 +836,73 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, return -ENOMEM; } +static struct sk_buff *veth_iou_rcv_skb(struct veth_rq *rq, + struct sk_buff *skb) +{ + struct sk_buff *nskb; + u32 size, len, off, max_head_size; + struct page *page; + int ret, i, head_off; + void *vaddr; + + skb_prepare_for_gro(skb); + max_head_size = skb_headlen(skb); + + rcu_read_lock(); + nskb = napi_alloc_skb(&rq->xdp_napi, max_head_size); + if (!nskb) + goto drop; + + skb_zcopy_init(nskb, rq->zc_dp.zc_uarg); + skb_copy_header(nskb, skb); + skb_mark_for_recycle(nskb); + + size = max_head_size; + if (skb_copy_bits(skb, 0, nskb->data, size)) { + consume_skb(nskb); + goto drop; + } + skb_put(nskb, size); + head_off = skb_headroom(nskb) - skb_headroom(skb); + skb_headers_offset_update(nskb, head_off); + + /* Allocate paged area of new skb */ + off = size; + len = skb->len - off; + + for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) { + page = data_pool_alloc_page(&rq->zc_dp); + if (!page) { + consume_skb(nskb); + goto drop; + } + + size = min_t(u32, len, PAGE_SIZE); + skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE); + + vaddr = kmap_atomic(page); + ret = skb_copy_bits(skb, off, vaddr, size); + kunmap_atomic(vaddr); + + if (ret) { + consume_skb(nskb); + goto drop; + } + len -= size; + off += size; + } + rcu_read_unlock(); + + consume_skb(skb); + skb = nskb; + return skb; +drop: + rcu_read_unlock(); + kfree_skb(skb); + return NULL; +} + + static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb, struct veth_xdp_tx_bq *bq, @@ -971,8 +1046,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, /* ndo_start_xmit */ struct sk_buff *skb = ptr; - stats->xdp_bytes += skb->len; - skb = veth_xdp_rcv_skb(rq, skb, bq, stats); + if (!rq->zc_dp.zc_ifq) { + stats->xdp_bytes += skb->len; + skb = veth_xdp_rcv_skb(rq, skb, bq, stats); + } else { + skb = veth_iou_rcv_skb(rq, skb); + } + if (skb) { if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC)) netif_receive_skb(skb); @@ -1351,6 +1431,9 @@ static int veth_set_channels(struct net_device *dev, struct net_device *peer; int err; + if (priv->zc_installed) + return -EINVAL; + /* sanity check. Upper bounds are already enforced by the caller */ if (!ch->rx_count || !ch->tx_count) return -EINVAL; @@ -1428,6 +1511,8 @@ static int veth_open(struct net_device *dev) struct net_device *peer = rtnl_dereference(priv->peer); int err; + priv->zc_installed = false; + if (!peer) return -ENOTCONN; @@ -1618,6 +1703,89 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr) rcu_read_unlock(); } +static int __veth_iou_set(struct net_device *dev, + struct netdev_bpf *xdp) +{ + bool napi_already_on = veth_gro_requested(dev) && (dev->flags & IFF_UP); + unsigned qid = xdp->zc_rx.queue_id; + struct veth_priv *priv = netdev_priv(dev); + struct net_device *peer; + struct veth_rq *rq; + int ret; + + if (priv->_xdp_prog) + return -EINVAL; + if (qid >= dev->real_num_rx_queues) + return -EINVAL; + if (!(dev->flags & IFF_UP)) + return -EOPNOTSUPP; + if (dev->real_num_rx_queues != 1) + return -EINVAL; + + rq = &priv->rq[qid]; + if (!!rq->zc_dp.zc_ifq == !!xdp->zc_rx.ifq) + return -EINVAL; + + if (rq->zc_dp.zc_ifq) { + veth_napi_del(dev); + rq->zc_dp.zc_ifq = NULL; + rq->zc_dp.page_pool = NULL; + rq->zc_dp.zc_uarg = NULL; + priv->zc_installed = false; + + if (!veth_gro_requested(dev) && netif_running(dev)) { + dev->features &= ~NETIF_F_GRO; + netdev_features_change(dev); + } + return 0; + } + + peer = rtnl_dereference(priv->peer); + peer->hw_features &= ~NETIF_F_GSO_SOFTWARE; + + ret = veth_create_page_pool(rq); + if (ret) + return ret; + + ret = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL); + if (ret) { + page_pool_destroy(rq->page_pool); + rq->page_pool = NULL; + return ret; + } + + rq->zc_dp.zc_ifq = xdp->zc_rx.ifq; + rq->zc_dp.zc_uarg = xdp->zc_rx.uarg; + rq->zc_dp.page_pool = rq->page_pool; + priv->zc_installed = true; + + if (!veth_gro_requested(dev)) { + /* user-space did not require GRO, but adding XDP + * is supposed to get GRO working + */ + dev->features |= NETIF_F_GRO; + netdev_features_change(dev); + } + if (!napi_already_on) { + netif_napi_add(dev, &rq->xdp_napi, veth_poll); + napi_enable(&rq->xdp_napi); + rcu_assign_pointer(rq->napi, &rq->xdp_napi); + } + io_zc_rx_set_napi(rq->zc_dp.zc_ifq, rq->xdp_napi.napi_id); + return 0; +} + +static int veth_iou_set(struct net_device *dev, + struct netdev_bpf *xdp) +{ + int ret; + + rtnl_lock(); + ret = __veth_iou_set(dev, xdp); + rtnl_unlock(); + return ret; +} + static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog, struct netlink_ext_ack *extack) { @@ -1627,6 +1795,9 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog, unsigned int max_mtu; int err; + if (priv->zc_installed) + return -EINVAL; + old_prog = priv->_xdp_prog; priv->_xdp_prog = prog; peer = rtnl_dereference(priv->peer); @@ -1705,6 +1876,8 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: return veth_xdp_set(dev, xdp->prog, xdp->extack); + case XDP_SETUP_ZC_RX: + return veth_iou_set(dev, xdp); default: return -EINVAL; } diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 14328024a550..611a068c3402 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -122,11 +122,14 @@ static void io_zc_rx_skb_free(struct sk_buff *skb, struct ubuf_info *uarg, static int io_zc_rx_map_buf(struct device *dev, struct page *page, u16 pool_id, u32 pgid, struct io_zc_rx_buf *buf) { - dma_addr_t addr; + dma_addr_t addr = 0; SetPagePrivate(page); set_page_private(page, mk_page_info(pool_id, pgid)); + if (!dev) + goto out; + addr = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, DMA_BIDIRECTIONAL, DMA_ATTR_SKIP_CPU_SYNC); @@ -135,7 +138,7 @@ static int io_zc_rx_map_buf(struct device *dev, struct page *page, u16 pool_id, ClearPagePrivate(page); return -ENOMEM; } - +out: buf->dma = addr; buf->page = page; atomic_set(&buf->refcount, 0); @@ -151,9 +154,11 @@ static void io_zc_rx_unmap_buf(struct device *dev, struct io_zc_rx_buf *buf) page = buf->page; set_page_private(page, 0); ClearPagePrivate(page); - dma_unmap_page_attrs(dev, buf->dma, PAGE_SIZE, - DMA_BIDIRECTIONAL, - DMA_ATTR_SKIP_CPU_SYNC); + + if (dev) + dma_unmap_page_attrs(dev, buf->dma, PAGE_SIZE, + DMA_BIDIRECTIONAL, + DMA_ATTR_SKIP_CPU_SYNC); put_page(page); } From patchwork Tue Nov 7 21:40:44 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449361 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6FC952CCA3 for ; Tue, 7 Nov 2023 21:41:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="NPBxlArz" Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0B6D310EF for ; Tue, 7 Nov 2023 13:41:14 -0800 (PST) Received: by mail-pj1-x1030.google.com with SMTP id 98e67ed59e1d1-28041176e77so4662478a91.0 for ; Tue, 07 Nov 2023 13:41:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393273; x=1699998073; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=24NMvuwsm8opdmLI6RaThqDP9FkcBYT5SzyW1mwOoUk=; b=NPBxlArzKZXH2jXMbgP8j7xomkQmrggyA5JD8zYQbubi6gRLdQ06VDarLDMotMR980 Jv6WGZ7RV1HipXeC0sizxTj7mSms9lU7aj+0SemMjgTqrnky5F8k7EYNv07hr2rfFKLN R2LAkGCTIn1WMg44TFo3f0rfD8oUzmnYFT1XR+X3FnTP9liWRn0HHIkhUerFV2Y+8TPv D1EPd2HxzooNlo8uSSKmn9aUa9DUPP6DqqL6/5mP65bYFpSKsP9oGBOWcYdwGv/pCMqj mBu4vtEfxbtzqnZlBa1fZMYvOsFOz3Oq9Jdbo20yTn8tIfy1jz5099IQydBEI4YSMO+L 07pA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393273; x=1699998073; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=24NMvuwsm8opdmLI6RaThqDP9FkcBYT5SzyW1mwOoUk=; b=VhuXZ8VoJkLs0vlfZ84Cr1sM4NHfZbf4M9dcA35D0oIfHCWg8U4pPubpjrJMJAK4xE gyOadA3HNFAQcqX2ua1Zf6FjinerSwFYW7MxJzsIEKXg9dJvtXnwuiUrxwWLpMrQ9XD2 n4rszajpXag6GR8aeD+FFm6etVVgFUViRbid/is0ayIC0qgZ+Yu3uNWnbBzbdStrAY5t bqqB/URLGf+V5pJK0sFx3HN20ByxPwbkUe+WyZhlEkyhzYZ4diZxNAV4d0v2K6nDgv3N wX/pAoD/gvmeNEUukovgokxU0UTqOKraoxcvqR7JKnjOabL18p5H0NPD58bN7C3TuIOO Gwnw== X-Gm-Message-State: AOJu0YxLTkMzeO1wJw0J7VRl9MctYiMGBrochf9yidTuAZK96Khm3TPh XHISL1Gft8J1vLlFWZhsc/7iLg== X-Google-Smtp-Source: AGHT+IEsCCxIQ4zudTGP6Cjux/XiAMHHW5H4f2S8Qmp0qUYHU0tHtdYyhYBfoPuRWZjiqf1BB8wdAw== X-Received: by 2002:a17:90a:7e8d:b0:280:ff37:8981 with SMTP id j13-20020a17090a7e8d00b00280ff378981mr5558242pjl.44.1699393273530; Tue, 07 Nov 2023 13:41:13 -0800 (PST) Received: from localhost (fwdproxy-prn-013.fbsv.net. [2a03:2880:ff:d::face:b00c]) by smtp.gmail.com with ESMTPSA id ml10-20020a17090b360a00b0027d0af2e9c3sm255218pjb.40.2023.11.07.13.41.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:13 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 19/20] bnxt: use data pool Date: Tue, 7 Nov 2023 13:40:44 -0800 Message-Id: <20231107214045.2172393-20-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org The BNXT driver is modified to use data pool in order to support ZC Rx. A setup function bnxt_zc_rx is added that is called on XDP_SETUP_ZC_RX XDP command which initialises a data_pool in a netdev_rx_queue. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 61 ++++++++++++++++--- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 5 ++ drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 3 + include/net/netdev_rx_queue.h | 2 + 4 files changed, 61 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index ca1088f7107e..2787c1b474db 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -55,6 +55,8 @@ #include #include #include +#include +#include #include "bnxt_hsi.h" #include "bnxt.h" @@ -798,13 +800,7 @@ static struct page *__bnxt_alloc_rx_64k_page(struct bnxt *bp, dma_addr_t *mappin if (!page) return NULL; - *mapping = dma_map_page_attrs(&bp->pdev->dev, page, offset, - BNXT_RX_PAGE_SIZE, DMA_FROM_DEVICE, - DMA_ATTR_WEAK_ORDERING); - if (dma_mapping_error(&bp->pdev->dev, *mapping)) { - page_pool_recycle_direct(rxr->page_pool, page); - return NULL; - } + *mapping = page_pool_get_dma_addr(page); if (page_offset) *page_offset = offset; @@ -824,13 +820,13 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping, page = page_pool_dev_alloc_frag(rxr->page_pool, offset, BNXT_RX_PAGE_SIZE); } else { - page = page_pool_dev_alloc_pages(rxr->page_pool); + page = data_pool_alloc_page(&rxr->rx_dp); *offset = 0; } if (!page) return NULL; - *mapping = page_pool_get_dma_addr(page) + *offset; + *mapping = data_pool_get_dma_addr(&rxr->rx_dp, page) + *offset; return page; } @@ -1816,6 +1812,8 @@ static void bnxt_deliver_skb(struct bnxt *bp, struct bnxt_napi *bnapi, return; } skb_record_rx_queue(skb, bnapi->index); + if (bnapi->rx_ring->rx_dp.zc_uarg) + skb_zcopy_init(skb, bnapi->rx_ring->rx_dp.zc_uarg); skb_mark_for_recycle(skb); napi_gro_receive(&bnapi->napi, skb); } @@ -3100,7 +3098,7 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr) rx_agg_buf->page = NULL; __clear_bit(i, rxr->rx_agg_bmap); - page_pool_recycle_direct(rxr->page_pool, page); + data_pool_put_page(&rxr->rx_dp, page); } skip_rx_agg_free: @@ -3305,6 +3303,8 @@ static void bnxt_free_rx_rings(struct bnxt *bp) page_pool_destroy(rxr->page_pool); rxr->page_pool = NULL; + rxr->rx_dp.page_pool = NULL; + rxr->rx_dp.zc_ifq = NULL; kfree(rxr->rx_agg_bmap); rxr->rx_agg_bmap = NULL; @@ -3333,6 +3333,8 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp, pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) pp.flags |= PP_FLAG_PAGE_FRAG; + pp.flags |= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; + pp.max_len = PAGE_SIZE; rxr->page_pool = page_pool_create(&pp); if (IS_ERR(rxr->page_pool)) { @@ -3341,6 +3343,7 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp, rxr->page_pool = NULL; return err; } + rxr->rx_dp.page_pool = rxr->page_pool; return 0; } @@ -3803,6 +3806,7 @@ static int bnxt_init_one_rx_ring(struct bnxt *bp, int ring_nr) { struct bnxt_rx_ring_info *rxr; struct bnxt_ring_struct *ring; + struct netdev_rx_queue *rxq; u32 type; type = (bp->rx_buf_use_size << RX_BD_LEN_SHIFT) | @@ -3831,6 +3835,12 @@ static int bnxt_init_one_rx_ring(struct bnxt *bp, int ring_nr) bnxt_init_rxbd_pages(ring, type); } + rxq = __netif_get_rx_queue(bp->dev, ring_nr); + if (rxq->data_pool.zc_ifq) { + rxr->rx_dp.zc_ifq = rxq->data_pool.zc_ifq; + rxr->rx_dp.zc_uarg = rxq->data_pool.zc_uarg; + } + return bnxt_alloc_one_rx_ring(bp, ring_nr); } @@ -13974,6 +13984,37 @@ void bnxt_print_device_info(struct bnxt *bp) pcie_print_link_status(bp->pdev); } +int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp) +{ + unsigned ifq_idx = xdp->zc_rx.queue_id; + + if (ifq_idx >= bp->rx_nr_rings) + return -EINVAL; + + bnxt_rtnl_lock_sp(bp); + if (netif_running(bp->dev)) { + struct netdev_rx_queue *rxq; + int rc, napi_id; + + bnxt_ulp_stop(bp); + bnxt_close_nic(bp, true, false); + + rxq = __netif_get_rx_queue(bp->dev, ifq_idx); + rxq->data_pool.zc_ifq = xdp->zc_rx.ifq; + rxq->data_pool.zc_uarg = xdp->zc_rx.uarg; + + rc = bnxt_open_nic(bp, true, false); + bnxt_ulp_start(bp, rc); + + if (xdp->zc_rx.ifq) { + napi_id = bp->bnapi[ifq_idx]->napi.napi_id; + io_zc_rx_set_napi(xdp->zc_rx.ifq, napi_id); + } + } + bnxt_rtnl_unlock_sp(bp); + return 0; +} + static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { struct net_device *dev; diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index d95d0ca91f3f..7f3b03fa5960 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -33,6 +33,7 @@ #ifdef CONFIG_TEE_BNXT_FW #include #endif +#include extern struct list_head bnxt_block_cb_list; @@ -946,6 +947,7 @@ struct bnxt_rx_ring_info { struct bnxt_ring_struct rx_agg_ring_struct; struct xdp_rxq_info xdp_rxq; struct page_pool *page_pool; + struct data_pool rx_dp; }; struct bnxt_rx_sw_stats { @@ -2485,4 +2487,7 @@ int bnxt_get_port_parent_id(struct net_device *dev, void bnxt_dim_work(struct work_struct *work); int bnxt_hwrm_set_ring_coal(struct bnxt *bp, struct bnxt_napi *bnapi); void bnxt_print_device_info(struct bnxt *bp); + +int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp); + #endif diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index 96f5ca778c67..b7ef2e551334 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -465,6 +465,9 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: rc = bnxt_xdp_set(bp, xdp->prog); break; + case XDP_SETUP_ZC_RX: + return bnxt_zc_rx(bp, xdp); + break; default: rc = -EINVAL; break; diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h index cdcafb30d437..1b2944e61e19 100644 --- a/include/net/netdev_rx_queue.h +++ b/include/net/netdev_rx_queue.h @@ -6,6 +6,7 @@ #include #include #include +#include /* This structure contains an instance of an RX queue. */ struct netdev_rx_queue { @@ -18,6 +19,7 @@ struct netdev_rx_queue { struct net_device *dev; netdevice_tracker dev_tracker; + struct data_pool data_pool; #ifdef CONFIG_XDP_SOCKETS struct xsk_buff_pool *pool; #endif From patchwork Tue Nov 7 21:40:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13449362 X-Patchwork-Delegate: kuba@kernel.org Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B450A2CCB6 for ; Tue, 7 Nov 2023 21:41:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="DZYmBary" Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DF8AE10E6 for ; Tue, 7 Nov 2023 13:41:14 -0800 (PST) Received: by mail-pl1-x636.google.com with SMTP id d9443c01a7336-1cc53d0030fso1101055ad.0 for ; Tue, 07 Nov 2023 13:41:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1699393274; x=1699998074; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=aN4oMaI4R9lQb3D1p9Iwygb5XnCuCVbIIXzTcWlvdpg=; b=DZYmBaryiWUy/oPARqPJEF50u3v87tKbQ+qlTDjEt+UZkUiQhUxbAEu/Qd/p+Ogq27 d7N9vlSeIqGRlql1kNg/G05KyA57/ORqXhTfg4D/CezVvVmGH2U4xNw6yxvAJ6INI0y3 CO/AWhUrrjSKleYEJXusYjx4C5UsiPzraOsgabEtG3cPh5dtDGWiIHdRdSBzA9Fv5o5o 5dAtXybbO3nrDOEAPSY6FEy5fDxIA4UCk7Pth5XS6pHI4M2KFKZ6qviCoB9FWPHsKWcE OkVOSpIj97lVJt0OQ9SCt+ZeHjAMF6ogNVwI3r1o7WMppaVR4VnG2h0ya1aH8dP4mqec Ijng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699393274; x=1699998074; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aN4oMaI4R9lQb3D1p9Iwygb5XnCuCVbIIXzTcWlvdpg=; b=dZKbr15diq6OTNQPMuWUChtvju5d2yqPRWp9Lnot0EMDoH63fRGD3P2EobH9WyTXUR haOqHSP6rA8AnT9uUBTBkcjbwwBnSTS7eX4rQPi7YnwiY50moSzClwleGtqNzkfHDzXc /2oX/2qEhZpCAsOAE21t1UMEDASvMH7Cni9TJpTigEnUNg5QzxdXCAITQ908jO+2gAWO CdTpVsN916/yW7vNHvq+E2sYUzBktBWhf0yHi8PU0n399pbZ80UQ95PCpdbvueRrRy65 StfcMc1/agrxyqcdOpmlvSgaU8oYBsWY/5fCh2bBNmLjQgwj0hkMNd3uc+bdI/Ybio6h yTww== X-Gm-Message-State: AOJu0YzhPnomuh7+SHUq6ggWndQfYFBxg0wbZcKVeakqsC1DeyKISbf4 O1qPr7+8MKaTBk2hi8csogMKNw== X-Google-Smtp-Source: AGHT+IEC28F2Hi8Z2PLSGIkXJCyYnwk65ti/b8HkC4XdItbMFIjTiR7+o2qyYlTD6mIX1MINY5plzA== X-Received: by 2002:a17:903:2281:b0:1cc:332f:9e4b with SMTP id b1-20020a170903228100b001cc332f9e4bmr12727plh.1.1699393274402; Tue, 07 Nov 2023 13:41:14 -0800 (PST) Received: from localhost (fwdproxy-prn-011.fbsv.net. [2a03:2880:ff:b::face:b00c]) by smtp.gmail.com with ESMTPSA id h10-20020a170902748a00b001cc0d1af177sm264672pll.229.2023.11.07.13.41.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Nov 2023 13:41:14 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Willem de Bruijn , Dragos Tatulea Subject: [PATCH 20/20] io_uring/zcrx: add multi socket support per Rx queue Date: Tue, 7 Nov 2023 13:40:45 -0800 Message-Id: <20231107214045.2172393-21-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20231107214045.2172393-1-dw@davidwei.uk> References: <20231107214045.2172393-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Extract the io_uring internal sock_idx from a sock and set it in each rbuf cqe. This allows userspace to distinguish which cqe belongs to which socket (and by association, which flow). This complicates the uapi as userspace now needs to keep a table of sock_idx to bufs per loop iteration. Each io_recvzc request on a socket will return its own completion event, but all rbuf cqes from all sockets already exist in the rbuf cq ring. Co-developed-by: Pavel Begunkov Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/uapi/linux/io_uring.h | 3 ++- io_uring/net.c | 1 + io_uring/zc_rx.c | 29 ++++++++++++++++++++++------- 3 files changed, 25 insertions(+), 8 deletions(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 603d07d0a791..588fd7eda797 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -754,8 +754,9 @@ struct io_uring_rbuf_cqe { __u32 off; __u32 len; __u16 region; + __u8 sock; __u8 flags; - __u8 __pad[3]; + __u8 __pad[2]; }; struct io_rbuf_rqring_offsets { diff --git a/io_uring/net.c b/io_uring/net.c index e7b41c5826d5..4f8d19e88dcb 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -1031,6 +1031,7 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags) int ret, min_ret = 0; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; struct io_zc_rx_ifq *ifq; + unsigned sock_idx; if (issue_flags & IO_URING_F_UNLOCKED) return -EAGAIN; diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 611a068c3402..fdeaed4b4883 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -47,6 +47,11 @@ struct io_zc_refill_data { unsigned count; }; +struct io_zc_rx_recv_args { + struct io_zc_rx_ifq *ifq; + struct socket *sock; +}; + static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq) { struct io_rbuf_ring *ring = ifq->ring; @@ -667,7 +672,7 @@ static inline struct io_uring_rbuf_cqe *io_zc_get_rbuf_cqe(struct io_zc_rx_ifq * } static ssize_t zc_rx_copy_chunk(struct io_zc_rx_ifq *ifq, void *data, - unsigned int offset, size_t len) + unsigned int offset, size_t len, unsigned sock_idx) { size_t copy_size, copied = 0; struct io_uring_rbuf_cqe *cqe; @@ -702,6 +707,7 @@ static ssize_t zc_rx_copy_chunk(struct io_zc_rx_ifq *ifq, void *data, cqe->off = pgid * PAGE_SIZE + off; cqe->len = copy_size; cqe->flags = 0; + cqe->sock = sock_idx; offset += copy_size; len -= copy_size; @@ -712,7 +718,7 @@ static ssize_t zc_rx_copy_chunk(struct io_zc_rx_ifq *ifq, void *data, } static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, - int off, int len, bool zc_skb) + int off, int len, unsigned sock_idx, bool zc_skb) { struct io_uring_rbuf_cqe *cqe; struct page *page; @@ -732,6 +738,7 @@ static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, cqe->region = 0; cqe->off = pgid * PAGE_SIZE + off; cqe->len = len; + cqe->sock = sock_idx; cqe->flags = 0; } else { u32 p_off, p_len, t, copied = 0; @@ -741,7 +748,7 @@ static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, skb_frag_foreach_page(frag, off, len, page, p_off, p_len, t) { vaddr = kmap_local_page(page); - ret = zc_rx_copy_chunk(ifq, vaddr, p_off, p_len); + ret = zc_rx_copy_chunk(ifq, vaddr, p_off, p_len, sock_idx); kunmap_local(vaddr); if (ret < 0) @@ -758,9 +765,12 @@ static int zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, unsigned int offset, size_t len) { - struct io_zc_rx_ifq *ifq = desc->arg.data; + struct io_zc_rx_recv_args *args = desc->arg.data; + struct io_zc_rx_ifq *ifq = args->ifq; + struct socket *sock = args->sock; struct io_zc_rx_ifq *skb_ifq; struct sk_buff *frag_iter; + unsigned sock_idx = sock->zc_rx_idx & IO_ZC_IFQ_IDX_MASK; unsigned start, start_off = offset; int i, copy, end, off; bool zc_skb = true; @@ -778,7 +788,7 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, size_t to_copy; to_copy = min_t(size_t, skb_headlen(skb) - offset, len); - copied = zc_rx_copy_chunk(ifq, skb->data, offset, to_copy); + copied = zc_rx_copy_chunk(ifq, skb->data, offset, to_copy, sock_idx); if (copied < 0) { ret = copied; goto out; @@ -807,7 +817,7 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, copy = len; off = offset - start; - ret = zc_rx_recv_frag(ifq, frag, off, copy, zc_skb); + ret = zc_rx_recv_frag(ifq, frag, off, copy, sock_idx, zc_skb); if (ret < 0) goto out; @@ -850,9 +860,14 @@ zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, static int io_zc_rx_tcp_read(struct io_zc_rx_ifq *ifq, struct sock *sk) { + struct io_zc_rx_recv_args args = { + .ifq = ifq, + .sock = sk->sk_socket, + }; + read_descriptor_t rd_desc = { .count = 1, - .arg.data = ifq, + .arg.data = &args, }; return tcp_read_sock(sk, &rd_desc, zc_rx_recv_skb);