From patchwork Wed Feb 26 18:09:52 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 13992881 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B731233723 for ; Wed, 26 Feb 2025 18:10:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593417; cv=none; b=MpKBecDnBx07L1DdqYYbD8dkiZ1jA4m21S5kEnw3tcW6/MgDMcy+ODWH2gVGdkAxY4jLR+jz56HvvoTytOV0iu4vi42jSFpPwWeW1YH2/k2b6Ue4AZ5b4VA0A1emzbZOQuuOaMxSqdqR7SOKcBLdrX2VeyIv+Y3GFx8mOIu5ZjU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593417; c=relaxed/simple; bh=42doPWHs01cCRK+7iQ8nxJijoXH+EnGvaHbI6a7w6GI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Z2b4Tv2PbyTamH3tauGXWbR4V+Eke/3aqxRltZAkGie/jKQKdibw458NmuZKPyCNyeJkA6XJzg8LeCt42DF7k+Z/ZVRGeiqo498P2uQi7auQ5loLO5XB4dVe6vRU2DRH5+PlGQWs8eb8rqNP2N/h5Zf2WuvzB1L/zk5f+iWVt2A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=UXlmkX87; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="UXlmkX87" Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 51QF0jR8019840 for ; Wed, 26 Feb 2025 10:10:14 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2021-q4; bh=y0Y/ECEqgQaIDn98rUw+oY23UvTFuskrGecxUwUlobU=; b=UXlmkX872RvS fVZcwbzR8w10azLWTVmP21uM57kD9h7g02SPvgGZA/QfscBFe/gCsjH8k4Ly+U4Q EbMRDLYZYKTPWGuJKJy7Sl1rUpBTEyioe7inTk8yPFj6wIcUnrZAWZJ2P2ooAg8J cpKAc1xTwWs9VsG+dZOxJfNVWvBGMjfcSBIf+VyqVYoUjXNinddj+NDoJwMKOeL2 /P8IuxIBbEY3eOMboBoh5AKMrHXcfzZTGLJ3wGrfPvNmTzz3reBWYNeeMrdtIPFO Y3VpUa2Mno3GXt8VdEy38Qdpyl9jnpNsmFVYBZqYEe4uNRXfYd+PdzjV5Ie8Jk/J hKAJcpesEA== Received: from maileast.thefacebook.com ([163.114.135.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 45257j1ewn-7 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 26 Feb 2025 10:10:13 -0800 (PST) Received: from twshared55211.03.ash8.facebook.com (2620:10d:c0a8:1b::2d) by mail.thefacebook.com (2620:10d:c0a9:6f::237c) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1544.14; Wed, 26 Feb 2025 18:10:02 +0000 Received: by devbig638.nha1.facebook.com (Postfix, from userid 544533) id B189E187C4ACF; Wed, 26 Feb 2025 10:10:05 -0800 (PST) From: Keith Busch To: , , , , CC: , , , Keith Busch Subject: [PATCHv6 1/5] io_uring: add support for kernel registered bvecs Date: Wed, 26 Feb 2025 10:09:52 -0800 Message-ID: <20250226181002.2574148-2-kbusch@meta.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250226181002.2574148-1-kbusch@meta.com> References: <20250226181002.2574148-1-kbusch@meta.com> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: Gf9QXlAnbvssPuQmj3VfmM7IYfeSnZIX X-Proofpoint-GUID: Gf9QXlAnbvssPuQmj3VfmM7IYfeSnZIX X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1057,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-02-26_04,2025-02-26_01,2024-11-22_01 From: Keith Busch Provide an interface for the kernel to leverage the existing pre-registered buffers that io_uring provides. User space can reference these later to achieve zero-copy IO. User space must register an empty fixed buffer table with io_uring in order for the kernel to make use of it. Signed-off-by: Keith Busch --- include/linux/io_uring/cmd.h | 7 ++ io_uring/io_uring.c | 3 + io_uring/rsrc.c | 122 +++++++++++++++++++++++++++++++++-- io_uring/rsrc.h | 8 +++ 4 files changed, 133 insertions(+), 7 deletions(-) diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h index 87150dc0a07cf..cf8d80d847344 100644 --- a/include/linux/io_uring/cmd.h +++ b/include/linux/io_uring/cmd.h @@ -4,6 +4,7 @@ #include #include +#include /* only top 8 bits of sqe->uring_cmd_flags for kernel internal use */ #define IORING_URING_CMD_CANCELABLE (1U << 30) @@ -125,4 +126,10 @@ static inline struct io_uring_cmd_data *io_uring_cmd_get_async_data(struct io_ur return cmd_to_io_kiocb(cmd)->async_data; } +int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq, + void (*release)(void *), unsigned int index, + unsigned int issue_flags); +void io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index, + unsigned int issue_flags); + #endif /* _LINUX_IO_URING_CMD_H */ diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index db1c0792def63..31e936d468051 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3947,6 +3947,9 @@ static int __init io_uring_init(void) io_uring_optable_init(); + /* imu->perm is u8 */ + BUILD_BUG_ON((IO_IMU_DEST | IO_IMU_SOURCE) > U8_MAX); + /* * Allow user copy in the per-command field, which starts after the * file in io_kiocb and until the opcode field. The openat2 handling diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index f814526982c36..5b234e84dcba6 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -9,6 +9,7 @@ #include #include #include +#include #include @@ -104,14 +105,21 @@ int io_buffer_validate(struct iovec *iov) static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node) { struct io_mapped_ubuf *imu = node->buf; - unsigned int i; if (!refcount_dec_and_test(&imu->refs)) return; - for (i = 0; i < imu->nr_bvecs; i++) - unpin_user_page(imu->bvec[i].bv_page); - if (imu->acct_pages) - io_unaccount_mem(ctx, imu->acct_pages); + + if (imu->release) { + imu->release(imu->priv); + } else { + unsigned int i; + + for (i = 0; i < imu->nr_bvecs; i++) + unpin_user_page(imu->bvec[i].bv_page); + if (imu->acct_pages) + io_unaccount_mem(ctx, imu->acct_pages); + } + kvfree(imu); } @@ -761,6 +769,9 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx, imu->len = iov->iov_len; imu->nr_bvecs = nr_pages; imu->folio_shift = PAGE_SHIFT; + imu->release = NULL; + imu->priv = NULL; + imu->perm = IO_IMU_DEST | IO_IMU_SOURCE; if (coalesced) imu->folio_shift = data.folio_shift; refcount_set(&imu->refs, 1); @@ -857,6 +868,94 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, return ret; } +int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq, + void (*release)(void *), unsigned int index, + unsigned int issue_flags) +{ + struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx; + struct io_rsrc_data *data = &ctx->buf_table; + struct req_iterator rq_iter; + struct io_mapped_ubuf *imu; + struct io_rsrc_node *node; + struct bio_vec bv, *bvec; + u16 nr_bvecs; + int ret = 0; + + io_ring_submit_lock(ctx, issue_flags); + if (index >= data->nr) { + ret = -EINVAL; + goto unlock; + } + index = array_index_nospec(index, data->nr); + + if (data->nodes[index]) { + ret = -EBUSY; + goto unlock; + } + + node = io_rsrc_node_alloc(IORING_RSRC_BUFFER); + if (!node) { + ret = -ENOMEM; + goto unlock; + } + + nr_bvecs = blk_rq_nr_phys_segments(rq); + imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL); + if (!imu) { + kfree(node); + ret = -ENOMEM; + goto unlock; + } + + imu->ubuf = 0; + imu->len = blk_rq_bytes(rq); + imu->acct_pages = 0; + imu->folio_shift = PAGE_SHIFT; + imu->nr_bvecs = nr_bvecs; + refcount_set(&imu->refs, 1); + imu->release = release; + imu->priv = rq; + + if (op_is_write(req_op(rq))) + imu->perm = IO_IMU_SOURCE; + else + imu->perm = IO_IMU_DEST; + + bvec = imu->bvec; + rq_for_each_bvec(bv, rq, rq_iter) + *bvec++ = bv; + + node->buf = imu; + data->nodes[index] = node; +unlock: + io_ring_submit_unlock(ctx, issue_flags); + return ret; +} +EXPORT_SYMBOL_GPL(io_buffer_register_bvec); + +void io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index, + unsigned int issue_flags) +{ + struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx; + struct io_rsrc_data *data = &ctx->buf_table; + struct io_rsrc_node *node; + + io_ring_submit_lock(ctx, issue_flags); + if (index >= data->nr) + goto unlock; + index = array_index_nospec(index, data->nr); + + node = data->nodes[index]; + if (!node || !node->buf->release) + goto unlock; + + io_put_rsrc_node(ctx, node); + data->nodes[index] = NULL; +unlock: + io_ring_submit_unlock(ctx, issue_flags); +} +EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec); + static int io_import_fixed(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu, u64 buf_addr, size_t len) @@ -871,6 +970,8 @@ static int io_import_fixed(int ddir, struct iov_iter *iter, /* not inside the mapped region */ if (unlikely(buf_addr < imu->ubuf || buf_end > (imu->ubuf + imu->len))) return -EFAULT; + if (!(imu->perm & (1 << ddir))) + return -EFAULT; /* * Might not be a start of buffer, set size appropriately @@ -883,8 +984,8 @@ static int io_import_fixed(int ddir, struct iov_iter *iter, /* * Don't use iov_iter_advance() here, as it's really slow for * using the latter parts of a big fixed buffer - it iterates - * over each segment manually. We can cheat a bit here, because - * we know that: + * over each segment manually. We can cheat a bit here for user + * registered nodes, because we know that: * * 1) it's a BVEC iter, we set it up * 2) all bvecs are the same in size, except potentially the @@ -898,8 +999,15 @@ static int io_import_fixed(int ddir, struct iov_iter *iter, */ const struct bio_vec *bvec = imu->bvec; + /* + * Kernel buffer bvecs, on the other hand, don't necessarily + * have the size property of user registered ones, so we have + * to use the slow iter advance. + */ if (offset < bvec->bv_len) { iter->iov_offset = offset; + } else if (imu->release) { + iov_iter_advance(iter, offset); } else { unsigned long seg_skip; diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h index f0e9080599646..9668804afddc4 100644 --- a/io_uring/rsrc.h +++ b/io_uring/rsrc.h @@ -20,6 +20,11 @@ struct io_rsrc_node { }; }; +enum { + IO_IMU_DEST = 1 << ITER_DEST, + IO_IMU_SOURCE = 1 << ITER_SOURCE, +}; + struct io_mapped_ubuf { u64 ubuf; unsigned int len; @@ -27,6 +32,9 @@ struct io_mapped_ubuf { unsigned int folio_shift; refcount_t refs; unsigned long acct_pages; + void (*release)(void *); + void *priv; + u8 perm; struct bio_vec bvec[] __counted_by(nr_bvecs); }; From patchwork Wed Feb 26 18:09:54 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 13992880 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 61761235368 for ; Wed, 26 Feb 2025 18:10:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593417; cv=none; b=g4GI+l/qgNTANolKe9WfZX7MLsSRhzPV/6QoZ/ngDCEVpT7RUYPTbXihojFP1PHErBly1Nejss5RLGzS6rhxqBqgq+jhwQiG+FDwUitBtIVyCYIGEOmIN+kk0u39WcqUimvhYuO74oAfUU8RA+Bj0LtSJrkvngyYceNpu7c4qH4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593417; c=relaxed/simple; bh=42doPWHs01cCRK+7iQ8nxJijoXH+EnGvaHbI6a7w6GI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=oredTJh0fNrXgRRVlAM5YVZboNzAZEdsJC5sNfWDRk5OB9VkKQvk9Ocp50FRij0amoIXlmmy7Dfslfhqopnuw+vJOMtURfjnzMObxImhJehy9ctJMIeDlBVhcakPqY9DKQSDbMYdawztYl+m+LN3Dmfg0kVHAJRUSxwu1qUkrSk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=LsIjd3CV; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="LsIjd3CV" Received: from pps.filterd (m0001303.ppops.net [127.0.0.1]) by m0001303.ppops.net (8.18.1.2/8.18.1.2) with ESMTP id 51QFa932015227 for ; Wed, 26 Feb 2025 10:10:14 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2021-q4; bh=y0Y/ECEqgQaIDn98rUw+oY23UvTFuskrGecxUwUlobU=; b=LsIjd3CVm6gf yK8zsCuuexFlrR3Qbbn9V9LS13S1grjlNFxGf4cUZ8lc+YczMCBS8xSnHhR32huR 2u1qUrlbwa2vqW9WZRZkwnIb6JKBxZv9C9xXHFKvH6QquFlwAPhgHqTkBak/o+pb dFrKpOXMh8/xsuowWotXr7xY285iPANM9FZgf3HI3c71mX1WheWekJ1Z5jgSc+xb yTjtCz1f7Vl9VJTwTu4a++MNi08Lx1pCAda4/JK+VvuQi3bL3Sk+Rat6Na//J3ug c5sZMXCm5Qzk5Jb1dSFwMBPqgWyDLWOCADwoD/cUwPL2Mc0mIG9ggQObhqoZzYuh nXTRN8a7Mw== Received: from maileast.thefacebook.com ([163.114.135.16]) by m0001303.ppops.net (PPS) with ESMTPS id 4525r796ku-7 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 26 Feb 2025 10:10:14 -0800 (PST) Received: from twshared8234.09.ash9.facebook.com (2620:10d:c0a8:1c::11) by mail.thefacebook.com (2620:10d:c0a9:6f::8fd4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1544.14; Wed, 26 Feb 2025 18:10:04 +0000 Received: by devbig638.nha1.facebook.com (Postfix, from userid 544533) id BCEAD187C4AD8; Wed, 26 Feb 2025 10:10:05 -0800 (PST) From: Keith Busch To: , , , , CC: , , , Keith Busch Subject: [PATCHv6 2/6] io_uring: add support for kernel registered bvecs Date: Wed, 26 Feb 2025 10:09:54 -0800 Message-ID: <20250226181002.2574148-4-kbusch@meta.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250226181002.2574148-1-kbusch@meta.com> References: <20250226181002.2574148-1-kbusch@meta.com> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: pCmP8HC2WrYlwRvwD9SpiuMW1MHE4mm0 X-Proofpoint-ORIG-GUID: pCmP8HC2WrYlwRvwD9SpiuMW1MHE4mm0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1057,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-02-26_04,2025-02-26_01,2024-11-22_01 From: Keith Busch Provide an interface for the kernel to leverage the existing pre-registered buffers that io_uring provides. User space can reference these later to achieve zero-copy IO. User space must register an empty fixed buffer table with io_uring in order for the kernel to make use of it. Signed-off-by: Keith Busch --- include/linux/io_uring/cmd.h | 7 ++ io_uring/io_uring.c | 3 + io_uring/rsrc.c | 122 +++++++++++++++++++++++++++++++++-- io_uring/rsrc.h | 8 +++ 4 files changed, 133 insertions(+), 7 deletions(-) diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h index 87150dc0a07cf..cf8d80d847344 100644 --- a/include/linux/io_uring/cmd.h +++ b/include/linux/io_uring/cmd.h @@ -4,6 +4,7 @@ #include #include +#include /* only top 8 bits of sqe->uring_cmd_flags for kernel internal use */ #define IORING_URING_CMD_CANCELABLE (1U << 30) @@ -125,4 +126,10 @@ static inline struct io_uring_cmd_data *io_uring_cmd_get_async_data(struct io_ur return cmd_to_io_kiocb(cmd)->async_data; } +int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq, + void (*release)(void *), unsigned int index, + unsigned int issue_flags); +void io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index, + unsigned int issue_flags); + #endif /* _LINUX_IO_URING_CMD_H */ diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index db1c0792def63..31e936d468051 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3947,6 +3947,9 @@ static int __init io_uring_init(void) io_uring_optable_init(); + /* imu->perm is u8 */ + BUILD_BUG_ON((IO_IMU_DEST | IO_IMU_SOURCE) > U8_MAX); + /* * Allow user copy in the per-command field, which starts after the * file in io_kiocb and until the opcode field. The openat2 handling diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index f814526982c36..5b234e84dcba6 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -9,6 +9,7 @@ #include #include #include +#include #include @@ -104,14 +105,21 @@ int io_buffer_validate(struct iovec *iov) static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node) { struct io_mapped_ubuf *imu = node->buf; - unsigned int i; if (!refcount_dec_and_test(&imu->refs)) return; - for (i = 0; i < imu->nr_bvecs; i++) - unpin_user_page(imu->bvec[i].bv_page); - if (imu->acct_pages) - io_unaccount_mem(ctx, imu->acct_pages); + + if (imu->release) { + imu->release(imu->priv); + } else { + unsigned int i; + + for (i = 0; i < imu->nr_bvecs; i++) + unpin_user_page(imu->bvec[i].bv_page); + if (imu->acct_pages) + io_unaccount_mem(ctx, imu->acct_pages); + } + kvfree(imu); } @@ -761,6 +769,9 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx, imu->len = iov->iov_len; imu->nr_bvecs = nr_pages; imu->folio_shift = PAGE_SHIFT; + imu->release = NULL; + imu->priv = NULL; + imu->perm = IO_IMU_DEST | IO_IMU_SOURCE; if (coalesced) imu->folio_shift = data.folio_shift; refcount_set(&imu->refs, 1); @@ -857,6 +868,94 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, return ret; } +int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq, + void (*release)(void *), unsigned int index, + unsigned int issue_flags) +{ + struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx; + struct io_rsrc_data *data = &ctx->buf_table; + struct req_iterator rq_iter; + struct io_mapped_ubuf *imu; + struct io_rsrc_node *node; + struct bio_vec bv, *bvec; + u16 nr_bvecs; + int ret = 0; + + io_ring_submit_lock(ctx, issue_flags); + if (index >= data->nr) { + ret = -EINVAL; + goto unlock; + } + index = array_index_nospec(index, data->nr); + + if (data->nodes[index]) { + ret = -EBUSY; + goto unlock; + } + + node = io_rsrc_node_alloc(IORING_RSRC_BUFFER); + if (!node) { + ret = -ENOMEM; + goto unlock; + } + + nr_bvecs = blk_rq_nr_phys_segments(rq); + imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL); + if (!imu) { + kfree(node); + ret = -ENOMEM; + goto unlock; + } + + imu->ubuf = 0; + imu->len = blk_rq_bytes(rq); + imu->acct_pages = 0; + imu->folio_shift = PAGE_SHIFT; + imu->nr_bvecs = nr_bvecs; + refcount_set(&imu->refs, 1); + imu->release = release; + imu->priv = rq; + + if (op_is_write(req_op(rq))) + imu->perm = IO_IMU_SOURCE; + else + imu->perm = IO_IMU_DEST; + + bvec = imu->bvec; + rq_for_each_bvec(bv, rq, rq_iter) + *bvec++ = bv; + + node->buf = imu; + data->nodes[index] = node; +unlock: + io_ring_submit_unlock(ctx, issue_flags); + return ret; +} +EXPORT_SYMBOL_GPL(io_buffer_register_bvec); + +void io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index, + unsigned int issue_flags) +{ + struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx; + struct io_rsrc_data *data = &ctx->buf_table; + struct io_rsrc_node *node; + + io_ring_submit_lock(ctx, issue_flags); + if (index >= data->nr) + goto unlock; + index = array_index_nospec(index, data->nr); + + node = data->nodes[index]; + if (!node || !node->buf->release) + goto unlock; + + io_put_rsrc_node(ctx, node); + data->nodes[index] = NULL; +unlock: + io_ring_submit_unlock(ctx, issue_flags); +} +EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec); + static int io_import_fixed(int ddir, struct iov_iter *iter, struct io_mapped_ubuf *imu, u64 buf_addr, size_t len) @@ -871,6 +970,8 @@ static int io_import_fixed(int ddir, struct iov_iter *iter, /* not inside the mapped region */ if (unlikely(buf_addr < imu->ubuf || buf_end > (imu->ubuf + imu->len))) return -EFAULT; + if (!(imu->perm & (1 << ddir))) + return -EFAULT; /* * Might not be a start of buffer, set size appropriately @@ -883,8 +984,8 @@ static int io_import_fixed(int ddir, struct iov_iter *iter, /* * Don't use iov_iter_advance() here, as it's really slow for * using the latter parts of a big fixed buffer - it iterates - * over each segment manually. We can cheat a bit here, because - * we know that: + * over each segment manually. We can cheat a bit here for user + * registered nodes, because we know that: * * 1) it's a BVEC iter, we set it up * 2) all bvecs are the same in size, except potentially the @@ -898,8 +999,15 @@ static int io_import_fixed(int ddir, struct iov_iter *iter, */ const struct bio_vec *bvec = imu->bvec; + /* + * Kernel buffer bvecs, on the other hand, don't necessarily + * have the size property of user registered ones, so we have + * to use the slow iter advance. + */ if (offset < bvec->bv_len) { iter->iov_offset = offset; + } else if (imu->release) { + iov_iter_advance(iter, offset); } else { unsigned long seg_skip; diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h index f0e9080599646..9668804afddc4 100644 --- a/io_uring/rsrc.h +++ b/io_uring/rsrc.h @@ -20,6 +20,11 @@ struct io_rsrc_node { }; }; +enum { + IO_IMU_DEST = 1 << ITER_DEST, + IO_IMU_SOURCE = 1 << ITER_SOURCE, +}; + struct io_mapped_ubuf { u64 ubuf; unsigned int len; @@ -27,6 +32,9 @@ struct io_mapped_ubuf { unsigned int folio_shift; refcount_t refs; unsigned long acct_pages; + void (*release)(void *); + void *priv; + u8 perm; struct bio_vec bvec[] __counted_by(nr_bvecs); }; From patchwork Wed Feb 26 18:09:57 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 13992886 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52AA6236420 for ; Wed, 26 Feb 2025 18:10:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.145.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593424; cv=none; b=Y15IML/CQM2kZgV05NxSOyX4TrMtafz6nSfkdoYimRaaXW2v5bjN3OSBSLbDmgN98FaAOmURNDYj7UbdLZ86HQ2UiZgQEL5lDRWqRgzVj2UmGSARmyMXhuxF5DWdBucbegaowWxScqfLM2q/HFvNyeQuSFR0fxeG5RLOt112xY0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593424; c=relaxed/simple; bh=dTwn1OBTOCnfmKkMGRk3k1UBYGieZ5Xe61oqEFDuBks=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Zg//C2Sq/CKRE1pURs9ggECn1yIMtEZFy3SkKd9G31moK1aW9RQ3emz7U5WVYXEV6sfS7ufjiQ0ptFJBfQ7LCZIkeDsFGV/p5b7+FjaaRgHVu4EtetOHaybnUQ+DfG2sZB6dPHCSkecRES+gQoGL6nVc62tbIRNZZH0KPbLCXSA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=fzubWNqH; arc=none smtp.client-ip=67.231.145.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="fzubWNqH" Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 51QBHDHr021853 for ; Wed, 26 Feb 2025 10:10:21 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2021-q4; bh=nwmAP8sUV6CYBzcCZL2YWO0LfzfSw2Y16cunh72RRkE=; b=fzubWNqH/j45 tn6PWkXaVTYAzhdt1hpST0yWexLCxC7tAGKnY9xs7LrSRsXGB7VfKGT5wjuT7EoF bq4CzrtB+aLt+P7PCZBxtC7SyKPtSi846stkk4ooD1XSddPhkBsY6wFC28VtPvsu 2sAwxOteNPxhmac90igySbE6oc7fooPaluFzpo/Zb4RRlBdDY88DUNecAl6V+vZh SyxIGv3pre6crQUBuo8u3R5kHx56mjcb4TL/NxPio/agHFup/MxsUolw5PDQRntQ HJLzZz14RmQ+YItwx/2yZMmJZobE50iZuo1Un4nRPQwaruzciPp6ub67q+AkqEn5 znENuAX8nA== Received: from maileast.thefacebook.com ([163.114.135.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 4521xrjngk-6 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 26 Feb 2025 10:10:21 -0800 (PST) Received: from twshared8234.09.ash9.facebook.com (2620:10d:c0a8:1b::2d) by mail.thefacebook.com (2620:10d:c0a9:6f::237c) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1544.14; Wed, 26 Feb 2025 18:10:04 +0000 Received: by devbig638.nha1.facebook.com (Postfix, from userid 544533) id D9661187C4ADF; Wed, 26 Feb 2025 10:10:06 -0800 (PST) From: Keith Busch To: , , , , CC: , , , Keith Busch Subject: [PATCHv6 3/5] ublk: zc register/unregister bvec Date: Wed, 26 Feb 2025 10:09:57 -0800 Message-ID: <20250226181002.2574148-7-kbusch@meta.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250226181002.2574148-1-kbusch@meta.com> References: <20250226181002.2574148-1-kbusch@meta.com> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: DRTR9SQesduYUjqjbjaktnQ7ZQQg74qF X-Proofpoint-ORIG-GUID: DRTR9SQesduYUjqjbjaktnQ7ZQQg74qF X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1057,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-02-26_04,2025-02-26_01,2024-11-22_01 From: Keith Busch Provide new operations for the user to request mapping an active request to an io uring instance's buf_table. The user has to provide the index it wants to install the buffer. A reference count is taken on the request to ensure it can't be completed while it is active in a ring's buf_table. Signed-off-by: Keith Busch --- drivers/block/ublk_drv.c | 119 +++++++++++++++++++++++----------- include/uapi/linux/ublk_cmd.h | 4 ++ 2 files changed, 86 insertions(+), 37 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 529085181f355..dc9ff869aa560 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -51,6 +51,9 @@ /* private ioctl command mirror */ #define UBLK_CMD_DEL_DEV_ASYNC _IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC) +#define UBLK_IO_REGISTER_IO_BUF _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF) +#define UBLK_IO_UNREGISTER_IO_BUF _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF) + /* All UBLK_F_* have to be included into UBLK_F_ALL */ #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \ | UBLK_F_URING_CMD_COMP_IN_TASK \ @@ -201,7 +204,7 @@ static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq, int tag); static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub) { - return ub->dev_info.flags & UBLK_F_USER_COPY; + return ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY); } static inline bool ublk_dev_is_zoned(const struct ublk_device *ub) @@ -581,7 +584,7 @@ static void ublk_apply_params(struct ublk_device *ub) static inline bool ublk_support_user_copy(const struct ublk_queue *ubq) { - return ubq->flags & UBLK_F_USER_COPY; + return ubq->flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY); } static inline bool ublk_need_req_ref(const struct ublk_queue *ubq) @@ -1747,6 +1750,77 @@ static inline void ublk_prep_cancel(struct io_uring_cmd *cmd, io_uring_cmd_mark_cancelable(cmd, issue_flags); } +static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, + struct ublk_queue *ubq, int tag, size_t offset) +{ + struct request *req; + + if (!ublk_need_req_ref(ubq)) + return NULL; + + req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag); + if (!req) + return NULL; + + if (!ublk_get_req_ref(ubq, req)) + return NULL; + + if (unlikely(!blk_mq_request_started(req) || req->tag != tag)) + goto fail_put; + + if (!ublk_rq_has_data(req)) + goto fail_put; + + if (offset > blk_rq_bytes(req)) + goto fail_put; + + return req; +fail_put: + ublk_put_req_ref(ubq, req); + return NULL; +} + +static void ublk_io_release(void *priv) +{ + struct request *rq = priv; + struct ublk_queue *ubq = rq->mq_hctx->driver_data; + + ublk_put_req_ref(ubq, rq); +} + +static int ublk_register_io_buf(struct io_uring_cmd *cmd, + struct ublk_queue *ubq, unsigned int tag, + const struct ublksrv_io_cmd *ub_cmd, + unsigned int issue_flags) +{ + struct ublk_device *ub = cmd->file->private_data; + int index = (int)ub_cmd->addr, ret; + struct request *req; + + req = __ublk_check_and_get_req(ub, ubq, tag, 0); + if (!req) + return -EINVAL; + + ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index, + issue_flags); + if (ret) { + ublk_put_req_ref(ubq, req); + return ret; + } + + return 0; +} + +static int ublk_unregister_io_buf(struct io_uring_cmd *cmd, + const struct ublksrv_io_cmd *ub_cmd, + unsigned int issue_flags) +{ + int index = (int)ub_cmd->addr; + + io_buffer_unregister_bvec(cmd, index, issue_flags); + return 0; +} + static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags, const struct ublksrv_io_cmd *ub_cmd) @@ -1798,6 +1872,10 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, ret = -EINVAL; switch (_IOC_NR(cmd_op)) { + case UBLK_IO_REGISTER_IO_BUF: + return ublk_register_io_buf(cmd, ubq, tag, ub_cmd, issue_flags); + case UBLK_IO_UNREGISTER_IO_BUF: + return ublk_unregister_io_buf(cmd, ub_cmd, issue_flags); case UBLK_IO_FETCH_REQ: /* UBLK_IO_FETCH_REQ is only allowed before queue is setup */ if (ublk_queue_ready(ubq)) { @@ -1872,36 +1950,6 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, return -EIOCBQUEUED; } -static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, - struct ublk_queue *ubq, int tag, size_t offset) -{ - struct request *req; - - if (!ublk_need_req_ref(ubq)) - return NULL; - - req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag); - if (!req) - return NULL; - - if (!ublk_get_req_ref(ubq, req)) - return NULL; - - if (unlikely(!blk_mq_request_started(req) || req->tag != tag)) - goto fail_put; - - if (!ublk_rq_has_data(req)) - goto fail_put; - - if (offset > blk_rq_bytes(req)) - goto fail_put; - - return req; -fail_put: - ublk_put_req_ref(ubq, req); - return NULL; -} - static inline int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd, unsigned int issue_flags) { @@ -2459,7 +2507,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) * buffer by pwrite() to ublk char device, which can't be * used for unprivileged device */ - if (info.flags & UBLK_F_USER_COPY) + if (info.flags & UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY) return -EINVAL; } @@ -2527,9 +2575,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) goto out_free_dev_number; } - /* We are not ready to support zero copy */ - ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY; - ub->dev_info.nr_hw_queues = min_t(unsigned int, ub->dev_info.nr_hw_queues, nr_cpu_ids); ublk_align_max_io_size(ub); @@ -2860,7 +2905,7 @@ static int ublk_ctrl_get_features(struct io_uring_cmd *cmd) { const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe); void __user *argp = (void __user *)(unsigned long)header->addr; - u64 features = UBLK_F_ALL & ~UBLK_F_SUPPORT_ZERO_COPY; + u64 features = UBLK_F_ALL; if (header->len != UBLK_FEATURES_LEN || !header->addr) return -EINVAL; diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index a8bc98bb69fce..74246c926b55f 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -94,6 +94,10 @@ _IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd) #define UBLK_U_IO_NEED_GET_DATA \ _IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd) +#define UBLK_U_IO_REGISTER_IO_BUF \ + _IOWR('u', 0x23, struct ublksrv_io_cmd) +#define UBLK_U_IO_UNREGISTER_IO_BUF \ + _IOWR('u', 0x24, struct ublksrv_io_cmd) /* only ABORT means that no re-fetch */ #define UBLK_IO_RES_OK 0 From patchwork Wed Feb 26 18:09:59 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 13992882 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 61701235362 for ; Wed, 26 Feb 2025 18:10:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593418; cv=none; b=QEVoZoMA6o0P+GlnmTbAcU4ERkI+B8zdFXxZLomXVQykseQ69iiw9O1YXJcjQnaDFApPKCGEDlyBqMilY7BgPVGPnGVLI4gNYLPRVshkszLtv0D/56kQJwtDRe4z4RE9C9+lk3ti1AnGq5H9rFe2n1Qwo0s3KdggwEFQb8zGOE4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593418; c=relaxed/simple; bh=dTwn1OBTOCnfmKkMGRk3k1UBYGieZ5Xe61oqEFDuBks=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=NCkJdmoDWHw47BE5sqd2ujApg5R54BNRrlZQnA83z5Mu7fJHcsS/1kUQxWIWMYt/E4J9HljmCLQgeipC7an/KMNMY6amWLZPDSHYu0gsSuHC6KZF/AJo4sXLuEx8kWnWI4JqtJzEGrzENXWeer0OK2P/OeJraBbIdfzdzInBsfM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=JulT8Y9r; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="JulT8Y9r" Received: from pps.filterd (m0001303.ppops.net [127.0.0.1]) by m0001303.ppops.net (8.18.1.2/8.18.1.2) with ESMTP id 51QFa2jW015036 for ; Wed, 26 Feb 2025 10:10:14 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2021-q4; bh=nwmAP8sUV6CYBzcCZL2YWO0LfzfSw2Y16cunh72RRkE=; b=JulT8Y9rIa09 /VFG2xtmJvOiauBOgLdBvMML7+9TdlPqm7tSMJRjpIU33HP9XtWwHis7Xb66wALh NoGWG3eTMLH0Ykq8ux+3GM5DRfVWL/z8KRtUL411Izqn8R2DF84VMbKtchTrXxGp 8NnjwNt5w+6H0HgY3zEkq7thNoCYNVE8rOm6QYHJewecIorcSg0gnbxf8CwfdZjU Fiik2+mP385K33VZmPmqMcjiF/JQzkjvi4gBxJqlPEuUR7xfyiF+ElzYCYJMUMMF lY35/RNu4wnEB3g1sRro9Ge/+mDsHWGLn1TYmiTpQDS5WkT8MZRtUDGuFkmJC0Dl vs2m7Da4Fg== Received: from mail.thefacebook.com ([163.114.134.16]) by m0001303.ppops.net (PPS) with ESMTPS id 4525r796m0-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 26 Feb 2025 10:10:14 -0800 (PST) Received: from twshared29376.33.frc3.facebook.com (2620:10d:c085:208::f) by mail.thefacebook.com (2620:10d:c08b:78::c78f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1544.14; Wed, 26 Feb 2025 18:10:02 +0000 Received: by devbig638.nha1.facebook.com (Postfix, from userid 544533) id EFCCC187C4AE3; Wed, 26 Feb 2025 10:10:06 -0800 (PST) From: Keith Busch To: , , , , CC: , , , Keith Busch Subject: [PATCHv6 4/6] ublk: zc register/unregister bvec Date: Wed, 26 Feb 2025 10:09:59 -0800 Message-ID: <20250226181002.2574148-9-kbusch@meta.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250226181002.2574148-1-kbusch@meta.com> References: <20250226181002.2574148-1-kbusch@meta.com> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: tmaHI0UfZ3pO6rLjIZ136ODSCrGLh8dn X-Proofpoint-ORIG-GUID: tmaHI0UfZ3pO6rLjIZ136ODSCrGLh8dn X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1057,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-02-26_04,2025-02-26_01,2024-11-22_01 From: Keith Busch Provide new operations for the user to request mapping an active request to an io uring instance's buf_table. The user has to provide the index it wants to install the buffer. A reference count is taken on the request to ensure it can't be completed while it is active in a ring's buf_table. Signed-off-by: Keith Busch --- drivers/block/ublk_drv.c | 119 +++++++++++++++++++++++----------- include/uapi/linux/ublk_cmd.h | 4 ++ 2 files changed, 86 insertions(+), 37 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 529085181f355..dc9ff869aa560 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -51,6 +51,9 @@ /* private ioctl command mirror */ #define UBLK_CMD_DEL_DEV_ASYNC _IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC) +#define UBLK_IO_REGISTER_IO_BUF _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF) +#define UBLK_IO_UNREGISTER_IO_BUF _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF) + /* All UBLK_F_* have to be included into UBLK_F_ALL */ #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \ | UBLK_F_URING_CMD_COMP_IN_TASK \ @@ -201,7 +204,7 @@ static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq, int tag); static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub) { - return ub->dev_info.flags & UBLK_F_USER_COPY; + return ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY); } static inline bool ublk_dev_is_zoned(const struct ublk_device *ub) @@ -581,7 +584,7 @@ static void ublk_apply_params(struct ublk_device *ub) static inline bool ublk_support_user_copy(const struct ublk_queue *ubq) { - return ubq->flags & UBLK_F_USER_COPY; + return ubq->flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY); } static inline bool ublk_need_req_ref(const struct ublk_queue *ubq) @@ -1747,6 +1750,77 @@ static inline void ublk_prep_cancel(struct io_uring_cmd *cmd, io_uring_cmd_mark_cancelable(cmd, issue_flags); } +static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, + struct ublk_queue *ubq, int tag, size_t offset) +{ + struct request *req; + + if (!ublk_need_req_ref(ubq)) + return NULL; + + req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag); + if (!req) + return NULL; + + if (!ublk_get_req_ref(ubq, req)) + return NULL; + + if (unlikely(!blk_mq_request_started(req) || req->tag != tag)) + goto fail_put; + + if (!ublk_rq_has_data(req)) + goto fail_put; + + if (offset > blk_rq_bytes(req)) + goto fail_put; + + return req; +fail_put: + ublk_put_req_ref(ubq, req); + return NULL; +} + +static void ublk_io_release(void *priv) +{ + struct request *rq = priv; + struct ublk_queue *ubq = rq->mq_hctx->driver_data; + + ublk_put_req_ref(ubq, rq); +} + +static int ublk_register_io_buf(struct io_uring_cmd *cmd, + struct ublk_queue *ubq, unsigned int tag, + const struct ublksrv_io_cmd *ub_cmd, + unsigned int issue_flags) +{ + struct ublk_device *ub = cmd->file->private_data; + int index = (int)ub_cmd->addr, ret; + struct request *req; + + req = __ublk_check_and_get_req(ub, ubq, tag, 0); + if (!req) + return -EINVAL; + + ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index, + issue_flags); + if (ret) { + ublk_put_req_ref(ubq, req); + return ret; + } + + return 0; +} + +static int ublk_unregister_io_buf(struct io_uring_cmd *cmd, + const struct ublksrv_io_cmd *ub_cmd, + unsigned int issue_flags) +{ + int index = (int)ub_cmd->addr; + + io_buffer_unregister_bvec(cmd, index, issue_flags); + return 0; +} + static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags, const struct ublksrv_io_cmd *ub_cmd) @@ -1798,6 +1872,10 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, ret = -EINVAL; switch (_IOC_NR(cmd_op)) { + case UBLK_IO_REGISTER_IO_BUF: + return ublk_register_io_buf(cmd, ubq, tag, ub_cmd, issue_flags); + case UBLK_IO_UNREGISTER_IO_BUF: + return ublk_unregister_io_buf(cmd, ub_cmd, issue_flags); case UBLK_IO_FETCH_REQ: /* UBLK_IO_FETCH_REQ is only allowed before queue is setup */ if (ublk_queue_ready(ubq)) { @@ -1872,36 +1950,6 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, return -EIOCBQUEUED; } -static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, - struct ublk_queue *ubq, int tag, size_t offset) -{ - struct request *req; - - if (!ublk_need_req_ref(ubq)) - return NULL; - - req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag); - if (!req) - return NULL; - - if (!ublk_get_req_ref(ubq, req)) - return NULL; - - if (unlikely(!blk_mq_request_started(req) || req->tag != tag)) - goto fail_put; - - if (!ublk_rq_has_data(req)) - goto fail_put; - - if (offset > blk_rq_bytes(req)) - goto fail_put; - - return req; -fail_put: - ublk_put_req_ref(ubq, req); - return NULL; -} - static inline int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd, unsigned int issue_flags) { @@ -2459,7 +2507,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) * buffer by pwrite() to ublk char device, which can't be * used for unprivileged device */ - if (info.flags & UBLK_F_USER_COPY) + if (info.flags & UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY) return -EINVAL; } @@ -2527,9 +2575,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) goto out_free_dev_number; } - /* We are not ready to support zero copy */ - ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY; - ub->dev_info.nr_hw_queues = min_t(unsigned int, ub->dev_info.nr_hw_queues, nr_cpu_ids); ublk_align_max_io_size(ub); @@ -2860,7 +2905,7 @@ static int ublk_ctrl_get_features(struct io_uring_cmd *cmd) { const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe); void __user *argp = (void __user *)(unsigned long)header->addr; - u64 features = UBLK_F_ALL & ~UBLK_F_SUPPORT_ZERO_COPY; + u64 features = UBLK_F_ALL; if (header->len != UBLK_FEATURES_LEN || !header->addr) return -EINVAL; diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index a8bc98bb69fce..74246c926b55f 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -94,6 +94,10 @@ _IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd) #define UBLK_U_IO_NEED_GET_DATA \ _IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd) +#define UBLK_U_IO_REGISTER_IO_BUF \ + _IOWR('u', 0x23, struct ublksrv_io_cmd) +#define UBLK_U_IO_UNREGISTER_IO_BUF \ + _IOWR('u', 0x24, struct ublksrv_io_cmd) /* only ABORT means that no re-fetch */ #define UBLK_IO_RES_OK 0 From patchwork Wed Feb 26 18:10:00 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 13992879 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3F7821C19E for ; Wed, 26 Feb 2025 18:10:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593416; cv=none; b=JFGy6F/MN31cDa8u1PLPbFqDH+y05ecvqdaX2pSpTGlqpQMokDSVOZdh9yfHDqA61NYPEcwPW7yjmOMz9yuLZnWYew6LP05r6msAe14u3LmvjLe5kc7aNziARwdbkr6GlnGLNWPNsdyTrEZxMmo6LvZXV1wIhnRLu3VYJHQWujI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593416; c=relaxed/simple; bh=l1AFxrxGgV95zI+B1/sL16G/usO1VwzeRMdezrsrExg=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=bahQsEtCnAwbK0/vxYirQt0s8WO4ymQ1UGrgsWRTSXN+zwzUm/d3npY/CEsPXlnQ0X+93xs1CbExkyrQvzYGXdiAFOKEdol7QkbzOX2m80INmrlSLVHmQq5NRUyLTKHVZ/7LMKHDKsAT+GpJ1rpO5E7F5bWsACJsIyCDwPelnyc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=c9PvMaCg; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="c9PvMaCg" Received: from pps.filterd (m0001303.ppops.net [127.0.0.1]) by m0001303.ppops.net (8.18.1.2/8.18.1.2) with ESMTP id 51QFa2jU015036 for ; Wed, 26 Feb 2025 10:10:12 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2021-q4; bh=pZ7hsLEDh+ICLvM8xUOac2m7aznqbKdVGcLsSqsMS4I=; b=c9PvMaCg4UMZ OjMnZqTNNqT7vmUSlWjLv8ZH3rUHoouEMBYrWPkmmu/U9Y7Bey30oi8cvZ2VBoYK iYpLzaN13YB1/nuDsYpMLd8WH+wJLzlV9PWbp7O3H8Py6/A3b9w8Xnb8ft4lGhWv ORY8XawMgSUYUfmJJP9gGCRnij7qw3mf/Qn32SCxmt96Oka5pzWNIAgeDel/IoIb WWHi6BZXoIsihXL+1pJE0mmSNtgX1c2zdWwb9RmgI2u8Pf6r6v/XLhV6BGvkSc7I NXwQgyglGBMsizQ4WuQDxDZKuogOjwaawPeXWCxkbLLXzTv+KxCJdAb7r2/MO94P O7NIpURKaA== Received: from mail.thefacebook.com ([163.114.134.16]) by m0001303.ppops.net (PPS) with ESMTPS id 4525r796m0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 26 Feb 2025 10:10:12 -0800 (PST) Received: from twshared29376.33.frc3.facebook.com (2620:10d:c085:208::f) by mail.thefacebook.com (2620:10d:c08b:78::c78f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1544.14; Wed, 26 Feb 2025 18:10:03 +0000 Received: by devbig638.nha1.facebook.com (Postfix, from userid 544533) id 09EBD187C4AE6; Wed, 26 Feb 2025 10:10:07 -0800 (PST) From: Keith Busch To: , , , , CC: , , , Keith Busch Subject: [PATCHv6 5/6] io_uring: add abstraction for buf_table rsrc data Date: Wed, 26 Feb 2025 10:10:00 -0800 Message-ID: <20250226181002.2574148-10-kbusch@meta.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250226181002.2574148-1-kbusch@meta.com> References: <20250226181002.2574148-1-kbusch@meta.com> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: Ri5oLPO4CNAL9QDipARZrb6mOAyHhwqS X-Proofpoint-ORIG-GUID: Ri5oLPO4CNAL9QDipARZrb6mOAyHhwqS X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1057,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-02-26_04,2025-02-26_01,2024-11-22_01 From: Keith Busch We'll need to add more fields specific to the registered buffers, so make a layer for it now. No functional change in this patch. Reviewed-by: Caleb Sander Mateos Reviewed-by: Pavel Begunkov Signed-off-by: Keith Busch --- include/linux/io_uring_types.h | 6 +++- io_uring/fdinfo.c | 8 +++--- io_uring/nop.c | 2 +- io_uring/register.c | 2 +- io_uring/rsrc.c | 51 +++++++++++++++++----------------- 5 files changed, 36 insertions(+), 33 deletions(-) diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index c0fe8a00fe53a..a05ae4cb98a4c 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -69,6 +69,10 @@ struct io_file_table { unsigned int alloc_hint; }; +struct io_buf_table { + struct io_rsrc_data data; +}; + struct io_hash_bucket { struct hlist_head list; } ____cacheline_aligned_in_smp; @@ -293,7 +297,7 @@ struct io_ring_ctx { struct io_wq_work_list iopoll_list; struct io_file_table file_table; - struct io_rsrc_data buf_table; + struct io_buf_table buf_table; struct io_submit_state submit_state; diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index f60d0a9d505e2..d389c06cbce10 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -217,12 +217,12 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file) seq_puts(m, "\n"); } } - seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.nr); - for (i = 0; has_lock && i < ctx->buf_table.nr; i++) { + seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.data.nr); + for (i = 0; has_lock && i < ctx->buf_table.data.nr; i++) { struct io_mapped_ubuf *buf = NULL; - if (ctx->buf_table.nodes[i]) - buf = ctx->buf_table.nodes[i]->buf; + if (ctx->buf_table.data.nodes[i]) + buf = ctx->buf_table.data.nodes[i]->buf; if (buf) seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf, buf->len); else diff --git a/io_uring/nop.c b/io_uring/nop.c index ea539531cb5f6..da8870e00eee7 100644 --- a/io_uring/nop.c +++ b/io_uring/nop.c @@ -66,7 +66,7 @@ int io_nop(struct io_kiocb *req, unsigned int issue_flags) ret = -EFAULT; io_ring_submit_lock(ctx, issue_flags); - node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index); + node = io_rsrc_node_lookup(&ctx->buf_table.data, req->buf_index); if (node) { io_req_assign_buf_node(req, node); ret = 0; diff --git a/io_uring/register.c b/io_uring/register.c index cc23a4c205cd4..f15a8d52ad30f 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -926,7 +926,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, ret = __io_uring_register(ctx, opcode, arg, nr_args); trace_io_uring_register(ctx, opcode, ctx->file_table.data.nr, - ctx->buf_table.nr, ret); + ctx->buf_table.data.nr, ret); mutex_unlock(&ctx->uring_lock); fput(file); diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index 5b234e84dcba6..c30a5cda08f3e 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -236,9 +236,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx, __u32 done; int i, err; - if (!ctx->buf_table.nr) + if (!ctx->buf_table.data.nr) return -ENXIO; - if (up->offset + nr_args > ctx->buf_table.nr) + if (up->offset + nr_args > ctx->buf_table.data.nr) return -EINVAL; for (done = 0; done < nr_args; done++) { @@ -270,9 +270,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx, } node->tag = tag; } - i = array_index_nospec(up->offset + done, ctx->buf_table.nr); - io_reset_rsrc_node(ctx, &ctx->buf_table, i); - ctx->buf_table.nodes[i] = node; + i = array_index_nospec(up->offset + done, ctx->buf_table.data.nr); + io_reset_rsrc_node(ctx, &ctx->buf_table.data, i); + ctx->buf_table.data.nodes[i] = node; if (ctx->compat) user_data += sizeof(struct compat_iovec); else @@ -550,9 +550,9 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, int io_sqe_buffers_unregister(struct io_ring_ctx *ctx) { - if (!ctx->buf_table.nr) + if (!ctx->buf_table.data.nr) return -ENXIO; - io_rsrc_data_free(ctx, &ctx->buf_table); + io_rsrc_data_free(ctx, &ctx->buf_table.data); return 0; } @@ -579,8 +579,8 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages, } /* check previously registered pages */ - for (i = 0; i < ctx->buf_table.nr; i++) { - struct io_rsrc_node *node = ctx->buf_table.nodes[i]; + for (i = 0; i < ctx->buf_table.data.nr; i++) { + struct io_rsrc_node *node = ctx->buf_table.data.nodes[i]; struct io_mapped_ubuf *imu; if (!node) @@ -809,7 +809,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16)); - if (ctx->buf_table.nr) + if (ctx->buf_table.data.nr) return -EBUSY; if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS) return -EINVAL; @@ -862,7 +862,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, data.nodes[i] = node; } - ctx->buf_table = data; + ctx->buf_table.data = data; if (ret) io_sqe_buffers_unregister(ctx); return ret; @@ -873,7 +873,7 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq, unsigned int issue_flags) { struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx; - struct io_rsrc_data *data = &ctx->buf_table; + struct io_rsrc_data *data = &ctx->buf_table.data; struct req_iterator rq_iter; struct io_mapped_ubuf *imu; struct io_rsrc_node *node; @@ -937,7 +937,7 @@ void io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index, unsigned int issue_flags) { struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx; - struct io_rsrc_data *data = &ctx->buf_table; + struct io_rsrc_data *data = &ctx->buf_table.data; struct io_rsrc_node *node; io_ring_submit_lock(ctx, issue_flags); @@ -1034,7 +1034,7 @@ static inline struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req, return req->buf_node; io_ring_submit_lock(ctx, issue_flags); - node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index); + node = io_rsrc_node_lookup(&ctx->buf_table.data, req->buf_index); if (node) io_req_assign_buf_node(req, node); io_ring_submit_unlock(ctx, issue_flags); @@ -1084,10 +1084,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx if (!arg->nr && (arg->dst_off || arg->src_off)) return -EINVAL; /* not allowed unless REPLACE is set */ - if (ctx->buf_table.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE)) + if (ctx->buf_table.data.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE)) return -EBUSY; - nbufs = src_ctx->buf_table.nr; + nbufs = src_ctx->buf_table.data.nr; if (!arg->nr) arg->nr = nbufs; else if (arg->nr > nbufs) @@ -1097,13 +1097,13 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx if (check_add_overflow(arg->nr, arg->dst_off, &nbufs)) return -EOVERFLOW; - ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.nr)); + ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr)); if (ret) return ret; /* Fill entries in data from dst that won't overlap with src */ - for (i = 0; i < min(arg->dst_off, ctx->buf_table.nr); i++) { - struct io_rsrc_node *src_node = ctx->buf_table.nodes[i]; + for (i = 0; i < min(arg->dst_off, ctx->buf_table.data.nr); i++) { + struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i]; if (src_node) { data.nodes[i] = src_node; @@ -1112,7 +1112,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx } ret = -ENXIO; - nbufs = src_ctx->buf_table.nr; + nbufs = src_ctx->buf_table.data.nr; if (!nbufs) goto out_free; ret = -EINVAL; @@ -1132,7 +1132,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx while (nr--) { struct io_rsrc_node *dst_node, *src_node; - src_node = io_rsrc_node_lookup(&src_ctx->buf_table, i); + src_node = io_rsrc_node_lookup(&src_ctx->buf_table.data, i); if (!src_node) { dst_node = NULL; } else { @@ -1154,7 +1154,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx * old and new nodes at this point. */ if (arg->flags & IORING_REGISTER_DST_REPLACE) - io_rsrc_data_free(ctx, &ctx->buf_table); + io_sqe_buffers_unregister(ctx); /* * ctx->buf_table must be empty now - either the contents are being @@ -1162,10 +1162,9 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx * copied to a ring that does not have buffers yet (checked at function * entry). */ - WARN_ON_ONCE(ctx->buf_table.nr); - ctx->buf_table = data; + WARN_ON_ONCE(ctx->buf_table.data.nr); + ctx->buf_table.data = data; return 0; - out_free: io_rsrc_data_free(ctx, &data); return ret; @@ -1190,7 +1189,7 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg) return -EFAULT; if (buf.flags & ~(IORING_REGISTER_SRC_REGISTERED|IORING_REGISTER_DST_REPLACE)) return -EINVAL; - if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.nr) + if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.data.nr) return -EBUSY; if (memchr_inv(buf.pad, 0, sizeof(buf.pad))) return -EINVAL; From patchwork Wed Feb 26 18:10:02 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 13992884 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 96ED422F163 for ; Wed, 26 Feb 2025 18:10:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593421; cv=none; b=ARYKKR6/3C65mgJkVZeSRPsCcsKCGGUcNX4eJB9z0AIHzSTUCoGUUdsXeJriWjIvh1NWWktfZw+W4iRNk3Fg2ZBd72IHbswSru/jVx8OaVyX4Jb4lFDMk6imSocHNsMXcFWOnp5bZEyFKUuXyzrTbjJ6+6Q5PVuugkVcKppDvuY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1740593421; c=relaxed/simple; bh=Dt4ivYC3SR2/vPzCzMkzFgxjvTtJGc+6CyD7PH7YyEk=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=a9jK4GulX6QjG0ZVjYkl59eC2MH7RyXH0IHAiH7P9q1CRWQwM3qHUzstFMqYLuPaJNG4E8Tjvq7Ac00uLdah9eXdGy3zTGB6X38yBb3ojw5tJurCxdZXWHZ7fkFsOcukVZ6Ax/90om755uFv3QuHQWNNQGZXF/Rx7GWVtxU1wxk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=Dd3NM7ZA; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="Dd3NM7ZA" Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 51QF0jRE019840 for ; Wed, 26 Feb 2025 10:10:18 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2021-q4; bh=hbunRQgRX+ReeqwUe8G8nMtr7PKoD5fxxeT5ERKFNGQ=; b=Dd3NM7ZA7oRh eFW5/kUo6oevtHZOEOHzkFIqG1Dhc+a+cIfqy60De1qw5+qT6ywMifET66lRBq1m 53OMRduI2EQtCX84nM+hHRmZAW2uKyXO0i4B/srZS58vqFBWlpkkJoOgm/FSOr1E AsjpK6m9cngh1DKuX6jhCx5dwcE/xIj0WboI47tNS87DRRjY25DfZCYqs1oDtIwu vng5i0doRB8JF7PdTSW6bZ3Un7dMSHcZoHJfaCx99KAnCVuZehAFTMeR2WSH5U/q +0kJ2oUUpqUFbw5qUfEQSUuLaXY70FGCKRUsUk+zbMqgg8p7ym0jgW6zSREtvhCv rQ49pRIBpQ== Received: from maileast.thefacebook.com ([163.114.135.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 45257j1ewn-13 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 26 Feb 2025 10:10:17 -0800 (PST) Received: from twshared55211.03.ash8.facebook.com (2620:10d:c0a8:1c::1b) by mail.thefacebook.com (2620:10d:c0a9:6f::237c) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.1544.14; Wed, 26 Feb 2025 18:10:02 +0000 Received: by devbig638.nha1.facebook.com (Postfix, from userid 544533) id 2468D187C4AEA; Wed, 26 Feb 2025 10:10:07 -0800 (PST) From: Keith Busch To: , , , , CC: , , , Keith Busch Subject: [PATCHv6 6/6] io_uring: cache nodes and mapped buffers Date: Wed, 26 Feb 2025 10:10:02 -0800 Message-ID: <20250226181002.2574148-12-kbusch@meta.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250226181002.2574148-1-kbusch@meta.com> References: <20250226181002.2574148-1-kbusch@meta.com> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: EwNax3Kvc2m2CXYaZAOy89_uZKgnBkKK X-Proofpoint-GUID: EwNax3Kvc2m2CXYaZAOy89_uZKgnBkKK X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1057,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-02-26_04,2025-02-26_01,2024-11-22_01 From: Keith Busch Frequent alloc/free cycles on these is pretty costly. Use an io cache to more efficiently reuse these buffers. Signed-off-by: Keith Busch --- include/linux/io_uring_types.h | 18 ++--- io_uring/filetable.c | 2 +- io_uring/rsrc.c | 123 +++++++++++++++++++++++++-------- io_uring/rsrc.h | 2 +- 4 files changed, 107 insertions(+), 38 deletions(-) diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index a05ae4cb98a4c..fda3221de2174 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -69,8 +69,18 @@ struct io_file_table { unsigned int alloc_hint; }; +struct io_alloc_cache { + void **entries; + unsigned int nr_cached; + unsigned int max_cached; + unsigned int elem_size; + unsigned int init_clear; +}; + struct io_buf_table { struct io_rsrc_data data; + struct io_alloc_cache node_cache; + struct io_alloc_cache imu_cache; }; struct io_hash_bucket { @@ -224,14 +234,6 @@ struct io_submit_state { struct blk_plug plug; }; -struct io_alloc_cache { - void **entries; - unsigned int nr_cached; - unsigned int max_cached; - unsigned int elem_size; - unsigned int init_clear; -}; - struct io_ring_ctx { /* const or read-mostly hot data */ struct { diff --git a/io_uring/filetable.c b/io_uring/filetable.c index dd8eeec97acf6..a21660e3145ab 100644 --- a/io_uring/filetable.c +++ b/io_uring/filetable.c @@ -68,7 +68,7 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file, if (slot_index >= ctx->file_table.data.nr) return -EINVAL; - node = io_rsrc_node_alloc(IORING_RSRC_FILE); + node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE); if (!node) return -ENOMEM; diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index c30a5cda08f3e..8823f15d8fe2e 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -33,6 +33,9 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx, #define IORING_MAX_FIXED_FILES (1U << 20) #define IORING_MAX_REG_BUFFERS (1U << 14) +#define IO_CACHED_BVECS_SEGS 32 +#define IO_CACHED_ELEMS 64 + int __io_account_mem(struct user_struct *user, unsigned long nr_pages) { unsigned long page_limit, cur_pages, new_pages; @@ -102,6 +105,22 @@ int io_buffer_validate(struct iovec *iov) return 0; } +static struct io_mapped_ubuf *io_alloc_imu(struct io_ring_ctx *ctx, + int nr_bvecs) +{ + if (nr_bvecs <= IO_CACHED_BVECS_SEGS) + return io_cache_alloc(&ctx->buf_table.imu_cache, GFP_KERNEL); + return kvmalloc(struct_size_t(struct io_mapped_ubuf, bvec, nr_bvecs), + GFP_KERNEL); +} + +static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu) +{ + if (imu->nr_bvecs > IO_CACHED_BVECS_SEGS || + !io_alloc_cache_put(&ctx->buf_table.imu_cache, imu)) + kvfree(imu); +} + static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node) { struct io_mapped_ubuf *imu = node->buf; @@ -120,22 +139,35 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node) io_unaccount_mem(ctx, imu->acct_pages); } - kvfree(imu); + io_free_imu(ctx, imu); } -struct io_rsrc_node *io_rsrc_node_alloc(int type) +struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type) { struct io_rsrc_node *node; - node = kzalloc(sizeof(*node), GFP_KERNEL); + if (type == IORING_RSRC_FILE) + node = kmalloc(sizeof(*node), GFP_KERNEL); + else + node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL); if (node) { node->type = type; node->refs = 1; + node->tag = 0; + node->file_ptr = 0; } return node; } -__cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data) +static __cold void __io_rsrc_data_free(struct io_rsrc_data *data) +{ + kvfree(data->nodes); + data->nodes = NULL; + data->nr = 0; +} + +__cold void io_rsrc_data_free(struct io_ring_ctx *ctx, + struct io_rsrc_data *data) { if (!data->nr) return; @@ -143,9 +175,7 @@ __cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data if (data->nodes[data->nr]) io_put_rsrc_node(ctx, data->nodes[data->nr]); } - kvfree(data->nodes); - data->nodes = NULL; - data->nr = 0; + __io_rsrc_data_free(data); } __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr) @@ -159,6 +189,33 @@ __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr) return -ENOMEM; } +static __cold int io_rsrc_buffer_alloc(struct io_buf_table *table, unsigned nr) +{ + const int imu_cache_size = struct_size_t(struct io_mapped_ubuf, bvec, + IO_CACHED_BVECS_SEGS); + const int node_size = sizeof(struct io_rsrc_node); + int ret; + + ret = io_rsrc_data_alloc(&table->data, nr); + if (ret) + return ret; + + if (io_alloc_cache_init(&table->node_cache, IO_CACHED_ELEMS, + node_size, 0)) + goto free_data; + + if (io_alloc_cache_init(&table->imu_cache, IO_CACHED_ELEMS, + imu_cache_size, 0)) + goto free_cache; + + return 0; +free_cache: + io_alloc_cache_free(&table->node_cache, kfree); +free_data: + __io_rsrc_data_free(&table->data); + return -ENOMEM; +} + static int __io_sqe_files_update(struct io_ring_ctx *ctx, struct io_uring_rsrc_update2 *up, unsigned nr_args) @@ -208,7 +265,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx, err = -EBADF; break; } - node = io_rsrc_node_alloc(IORING_RSRC_FILE); + node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE); if (!node) { err = -ENOMEM; fput(file); @@ -460,6 +517,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node) case IORING_RSRC_BUFFER: if (node->buf) io_buffer_unmap(ctx, node); + if (io_alloc_cache_put(&ctx->buf_table.node_cache, node)) + return; break; default: WARN_ON_ONCE(1); @@ -528,7 +587,7 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, goto fail; } ret = -ENOMEM; - node = io_rsrc_node_alloc(IORING_RSRC_FILE); + node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE); if (!node) { fput(file); goto fail; @@ -548,11 +607,19 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, return ret; } +static void io_rsrc_buffer_free(struct io_ring_ctx *ctx, + struct io_buf_table *table) +{ + io_rsrc_data_free(ctx, &table->data); + io_alloc_cache_free(&table->node_cache, kfree); + io_alloc_cache_free(&table->imu_cache, kfree); +} + int io_sqe_buffers_unregister(struct io_ring_ctx *ctx) { if (!ctx->buf_table.data.nr) return -ENXIO; - io_rsrc_data_free(ctx, &ctx->buf_table.data); + io_rsrc_buffer_free(ctx, &ctx->buf_table); return 0; } @@ -733,7 +800,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx, if (!iov->iov_base) return NULL; - node = io_rsrc_node_alloc(IORING_RSRC_BUFFER); + node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER); if (!node) return ERR_PTR(-ENOMEM); node->buf = NULL; @@ -753,7 +820,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx, coalesced = io_coalesce_buffer(&pages, &nr_pages, &data); } - imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL); + imu = io_alloc_imu(ctx, nr_pages); if (!imu) goto done; @@ -789,7 +856,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx, } done: if (ret) { - kvfree(imu); + io_free_imu(ctx, imu); if (node) io_put_rsrc_node(ctx, node); node = ERR_PTR(ret); @@ -802,9 +869,9 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, unsigned int nr_args, u64 __user *tags) { struct page *last_hpage = NULL; - struct io_rsrc_data data; struct iovec fast_iov, *iov = &fast_iov; const struct iovec __user *uvec; + struct io_buf_table table; int i, ret; BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16)); @@ -813,13 +880,14 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, return -EBUSY; if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS) return -EINVAL; - ret = io_rsrc_data_alloc(&data, nr_args); + ret = io_rsrc_buffer_alloc(&table, nr_args); if (ret) return ret; if (!arg) memset(iov, 0, sizeof(*iov)); + ctx->buf_table = table; for (i = 0; i < nr_args; i++) { struct io_rsrc_node *node; u64 tag = 0; @@ -859,10 +927,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg, } node->tag = tag; } - data.nodes[i] = node; + table.data.nodes[i] = node; } - - ctx->buf_table.data = data; if (ret) io_sqe_buffers_unregister(ctx); return ret; @@ -893,14 +959,15 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq, goto unlock; } - node = io_rsrc_node_alloc(IORING_RSRC_BUFFER); + node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER); if (!node) { ret = -ENOMEM; goto unlock; } nr_bvecs = blk_rq_nr_phys_segments(rq); - imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL); + + imu = io_alloc_imu(ctx, nr_bvecs); if (!imu) { kfree(node); ret = -ENOMEM; @@ -1066,7 +1133,7 @@ static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2) static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx, struct io_uring_clone_buffers *arg) { - struct io_rsrc_data data; + struct io_buf_table table; int i, ret, off, nr; unsigned int nbufs; @@ -1097,7 +1164,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx if (check_add_overflow(arg->nr, arg->dst_off, &nbufs)) return -EOVERFLOW; - ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr)); + ret = io_rsrc_buffer_alloc(&table, max(nbufs, ctx->buf_table.data.nr)); if (ret) return ret; @@ -1106,7 +1173,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i]; if (src_node) { - data.nodes[i] = src_node; + table.data.nodes[i] = src_node; src_node->refs++; } } @@ -1136,7 +1203,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx if (!src_node) { dst_node = NULL; } else { - dst_node = io_rsrc_node_alloc(IORING_RSRC_BUFFER); + dst_node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER); if (!dst_node) { ret = -ENOMEM; goto out_free; @@ -1145,12 +1212,12 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx refcount_inc(&src_node->buf->refs); dst_node->buf = src_node->buf; } - data.nodes[off++] = dst_node; + table.data.nodes[off++] = dst_node; i++; } /* - * If asked for replace, put the old table. data->nodes[] holds both + * If asked for replace, put the old table. table.data->nodes[] holds both * old and new nodes at this point. */ if (arg->flags & IORING_REGISTER_DST_REPLACE) @@ -1163,10 +1230,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx * entry). */ WARN_ON_ONCE(ctx->buf_table.data.nr); - ctx->buf_table.data = data; + ctx->buf_table = table; return 0; out_free: - io_rsrc_data_free(ctx, &data); + io_rsrc_buffer_free(ctx, &table); return ret; } diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h index 9668804afddc4..4b39d8104df19 100644 --- a/io_uring/rsrc.h +++ b/io_uring/rsrc.h @@ -47,7 +47,7 @@ struct io_imu_folio_data { unsigned int nr_folios; }; -struct io_rsrc_node *io_rsrc_node_alloc(int type); +struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type); void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node); void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data); int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);