[v8,RESEND] io_uring: releasing CPU resources when polling

Message ID	20240918021010.12894-1-xue01.he@samsung.com (mailing list archive)
State	New
Headers	show Received: from mailout4.samsung.com (mailout4.samsung.com [203.254.224.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2453658210 for <io-uring@vger.kernel.org>; Wed, 18 Sep 2024 02:10:48 +0000 (UTC) From: hexue <xue01.he@samsung.com> To: axboe@kernel.dk, asml.silence@gmail.com Cc: io-uring@vger.kernel.org, linux-kernel@vger.kernel.org, hexue <xue01.he@samsung.com> Subject: [PATCH v8 RESEND] io_uring: releasing CPU resources when polling Date: Wed, 18 Sep 2024 10:10:10 +0800 Message-Id: <20240918021010.12894-1-xue01.he@samsung.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 105P DLP-Filter: Pass References: <CGME20240918021016epcas5p14d6e771ee39bee5dddf253c39b84110c@epcas5p1.samsung.com>
Series	[v8,RESEND] io_uring: releasing CPU resources when polling \| expand [v8,RESEND] io_uring: releasing CPU resources when polling

hexue Sept. 18, 2024, 2:10 a.m. UTC

This patch add a new hybrid poll at io_uring level, it also set a signal
"IORING_SETUP_HY_POLL" to application, aim to provide a interface for users
to enable use new hybrid polling flexibly.

io_uring use polling mode could improve the IO performence, but it will
spend 100% of CPU resources to do polling.

A new hybrid poll is implemented on the io_uring layer. Once IO issued,
it will not polling immediately, but block first and re-run before IO
complete, then poll to reap IO. This poll function could be a suboptimal
solution when running on a single thread, it offers the performance lower
than regular polling but higher than IRQ, and CPU utilization is also lower
than polling.

Test Result
fio-3.35, 16 poll queues, 1 thread
-------------------------------------------------------------------------
Performance
-------------------------------------------------------------------------
                write         read        randwrite  randread
regular poll BW=3939MiB/s  BW=6613MiB/s  IOPS=190K  IOPS=470K
IRQ          BW=3927MiB/s  BW=6612MiB/s  IOPS=181K  IOPS=203K
hybrid poll  BW=3937MiB/s  BW=6623MiB/s  IOPS=190K  IOPS=358K(suboptimal)
-------------------------------------------------------------------------
CPU Utilization
------------------------------------------------------
                write   read    randwrite   randread
regular poll    100%    100%    100%        100%
IRQ             50%     53%     100%        100%
hybrid poll     70%     37%     70%         85%
------------------------------------------------------

--
changes since v7:
- rebase code on for-6.12/io_uring
- remove unused varibales

changes since v6:
- Modified IO path, distinct iopoll and uring_cmd_iopoll
- update test results

changes since v5:
- Remove cstime recorder
- Use minimize sleep time in different drivers
- Use the half of whole runtime to do schedule
- Consider as a suboptimal solution between
  regular poll and IRQ

changes since v4:
- Rewrote the commit
- Update the test results
- Reorganized the code basd on 6.11

changes since v3:
- Simplified the commit
- Add some comments on code

changes since v2:
- Modified some formatting errors
- Move judgement to poll path

changes since v1:
- Extend hybrid poll to async polled io

Signed-off-by: hexue <xue01.he@samsung.com>
---
 include/linux/io_uring_types.h |  6 +++
 include/uapi/linux/io_uring.h  |  1 +
 io_uring/io_uring.c            |  3 +-
 io_uring/rw.c                  | 99 ++++++++++++++++++++++++++++++----
 4 files changed, 97 insertions(+), 12 deletions(-)

hexue Sept. 25, 2024, 8:29 a.m. UTC | #1

On 24/08/12 1:59AM, hexue wrote:
>This patch add a new hybrid poll at io_uring level, it also set a signal
>"IORING_SETUP_HY_POLL" to application, aim to provide a interface for users
>to enable use new hybrid polling flexibly.

Hi, just a gentle ping, is there still in merge window? or any comment for
this patch?
--
hexue

Pavel Begunkov Sept. 25, 2024, 12:12 p.m. UTC | #2

On 9/25/24 09:29, hexue wrote:
> On 24/08/12 1:59AM, hexue wrote:
>> This patch add a new hybrid poll at io_uring level, it also set a signal
>> "IORING_SETUP_HY_POLL" to application, aim to provide a interface for users
>> to enable use new hybrid polling flexibly.
> 
> Hi, just a gentle ping, is there still in merge window? or any comment for
> this patch?

I don't have a strong opinion on the feature, but the open question
we should get some decision on is whether it's really well applicable to
a good enough set of apps / workloads, if it'll even be useful in the
future and/or for other vendors, and if the merit outweighs extra
8 bytes + 1 flag per io_kiocb and the overhead of 1-2 static key'able
checks in hot paths.

hexue Oct. 24, 2024, 2:38 a.m. UTC | #3

On 9/25/2024 12:12, Pavel Begunkov wrote:
>I don't have a strong opinion on the feature, but the open question
>we should get some decision on is whether it's really well applicable to
>a good enough set of apps / workloads, if it'll even be useful in the
>future and/or for other vendors, and if the merit outweighs extra
>8 bytes + 1 flag per io_kiocb and the overhead of 1-2 static key'able
>checks in hot paths.

IMHO, releasing some of the CPU resources during the polling
process may be appropriate for some performance bottlenecks
due to CPU resource constraints, such as some database
applications, in addition to completing IO operations, CPU
also needs to peocess data, like compression and decompression.
In a high-concurrency state, not only polling takes up a lot of
CPU time, but also operations like calculation and processing
also need to compete for CPU time. In this case, the performance
of the application may be difficult to improve.

The MultiRead interface of Rocksdb has been adapted to io_uring,
I used db_bench to construct a situation with high CPU pressure
and compared the performance. The test configuration is as follows,

-------------------------------------------------------------------
CPU Model 	Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
CPU Cores	8
Memory		16G
SSD			Samsung PM9A3
-------------------------------------------------------------------

Test case：
./db_bench --benchmarks=multireadrandom,stats
--duration=60
--threads=4/8/16
--use_direct_reads=true
--db=/mnt/rocks/test_db
--wal_dir=/mnt/rocks/test_db
--key_size=4
--value_size=4096
-cache_size=0
-use_existing_db=1
-batch_size=256
-multiread_batched=true
-multiread_stride=0
------------------------------------------------------
Test result：
			National	Optimization
threads		ops/sec		ops/sec		CPU Utilization
16			139300		189075		100%*8
8			138639		133191		90%*8
4			71475		68361		90%*8
------------------------------------------------------

When the number of threads exceeds the number of CPU cores,the
database throughput does not increase significantly. However,
hybrid polling can releasing some CPU resources during the polling
process, so that part of the CPU time can be used for frequent
data processing and other operations, which speeds up the reading
process, thereby improving throughput and optimizaing database
performance.I tried different compression strategies and got
results similar to the above table.(~30% throughput improvement)

As more database applications adapt to the io_uring engine, I think
the application of hybrid poll may have potential in some scenarios.
--
Xue

Jens Axboe Oct. 24, 2024, 2:18 p.m. UTC | #4

On 10/23/24 8:38 PM, hexue wrote:
> On 9/25/2024 12:12, Pavel Begunkov wrote:
>> I don't have a strong opinion on the feature, but the open question
>> we should get some decision on is whether it's really well applicable to
>> a good enough set of apps / workloads, if it'll even be useful in the
>> future and/or for other vendors, and if the merit outweighs extra
>> 8 bytes + 1 flag per io_kiocb and the overhead of 1-2 static key'able
>> checks in hot paths.
> 
> IMHO, releasing some of the CPU resources during the polling
> process may be appropriate for some performance bottlenecks
> due to CPU resource constraints, such as some database
> applications, in addition to completing IO operations, CPU
> also needs to peocess data, like compression and decompression.
> In a high-concurrency state, not only polling takes up a lot of
> CPU time, but also operations like calculation and processing
> also need to compete for CPU time. In this case, the performance
> of the application may be difficult to improve.
> 
> The MultiRead interface of Rocksdb has been adapted to io_uring,
> I used db_bench to construct a situation with high CPU pressure
> and compared the performance. The test configuration is as follows,
> 
> -------------------------------------------------------------------
> CPU Model 	Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
> CPU Cores	8
> Memory		16G
> SSD			Samsung PM9A3
> -------------------------------------------------------------------
> 
> Test case?
> ./db_bench --benchmarks=multireadrandom,stats
> --duration=60
> --threads=4/8/16
> --use_direct_reads=true
> --db=/mnt/rocks/test_db
> --wal_dir=/mnt/rocks/test_db
> --key_size=4
> --value_size=4096
> -cache_size=0
> -use_existing_db=1
> -batch_size=256
> -multiread_batched=true
> -multiread_stride=0
> ------------------------------------------------------
> Test result?
> 			National	Optimization
> threads		ops/sec		ops/sec		CPU Utilization
> 16			139300		189075		100%*8
> 8			138639		133191		90%*8
> 4			71475		68361		90%*8
> ------------------------------------------------------
> 
> When the number of threads exceeds the number of CPU cores,the
> database throughput does not increase significantly. However,
> hybrid polling can releasing some CPU resources during the polling
> process, so that part of the CPU time can be used for frequent
> data processing and other operations, which speeds up the reading
> process, thereby improving throughput and optimizaing database
> performance.I tried different compression strategies and got
> results similar to the above table.(~30% throughput improvement)
> 
> As more database applications adapt to the io_uring engine, I think
> the application of hybrid poll may have potential in some scenarios.

Thanks for posting some numbers on that part, that's useful. I do
think the feature is useful as well, but I still have some issues
with the implementation. Below is an incremental patch on top of
yours to resolve some of those, potentially. Issues:

1) The patch still reads a bit like a hack, in that it doesn't seem to
   care about following the current style. This reads a bit lazy/sloppy
   or unfinished. I've fixed that up.

2) Appropriate member and function naming.

3) Same as above, it doesn't care about proper placement of additions to
   structs. Again this is a bit lazy and wasteful, attention should be
   paid to where additions are placed to not needlessly bloat
   structures, or place members in cache unfortunate locations. For
   example, available_time is just placed at the end of io_ring_ctx,
   why? It's a submission side member, and there's room with other
   related members. Not only is the placement now where you'd want it to
   be, memory wise, it also doesn't add 8 bytes to io_uring_ctx.

4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me
   that we can share space with the polling hash_node. This obviously
   needs careful vetting, haven't done that yet. IOPOLL setups should
   not be using poll at all. This needs extra checking. The poll_state
   can go with cancel_seq_set, as there's a hole there any. And just
   like that, rather than add 24b to io_kiocb, it doesn't take any extra
   space at all.

5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's
   please use a name related to that. And require IOPOLL being set with
   HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that
   HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it
   can't exist without that.

Please take a look at this incremental and test it, and then post a v9
that looks a lot more finished. Caveat - I haven't tested this one at
all. Thanks!

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c79ee9fe86d4..6cf6a45835e5 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -238,6 +238,8 @@ struct io_ring_ctx {
 		struct io_rings		*rings;
 		struct percpu_ref	refs;
 
+		u64			poll_available_time;
+
 		clockid_t		clockid;
 		enum tk_offsets		clock_offset;
 
@@ -433,9 +435,6 @@ struct io_ring_ctx {
 	struct page			**sqe_pages;
 
 	struct page			**cq_wait_page;
-
-	/* for io_uring hybrid poll*/
-	u64			available_time;
 };
 
 struct io_tw_state {
@@ -647,9 +646,22 @@ struct io_kiocb {
 
 	atomic_t			refs;
 	bool				cancel_seq_set;
+	bool				poll_state;
 	struct io_task_work		io_task_work;
-	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
-	struct hlist_node		hash_node;
+	union {
+		/*
+		 * for polled requests, i.e. IORING_OP_POLL_ADD and async armed
+		 * poll
+		 */
+		struct hlist_node	hash_node;
+		/*
+		 * For IOPOLL setup queues, with hybrid polling
+		 */
+		struct {
+			u64		iopoll_start;
+			u64		iopoll_end;
+		};
+	};
 	/* internal polling, see IORING_FEAT_FAST_POLL */
 	struct async_poll		*apoll;
 	/* opcode allocated if it needs to store data for async defer */
@@ -665,10 +677,6 @@ struct io_kiocb {
 		u64			extra1;
 		u64			extra2;
 	} big_cqe;
-    /* for io_uring hybrid iopoll */
-	bool		poll_state;
-	u64			iopoll_start;
-	u64			iopoll_end;
 };
 
 struct io_overflow_cqe {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 034997a1e507..5a290a56af6c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -199,7 +199,7 @@ enum io_uring_sqe_flags_bit {
  * Removes indirection through the SQ index array.
  */
 #define IORING_SETUP_NO_SQARRAY		(1U << 16)
-#define IORING_SETUP_HY_POLL	(1U << 17)
+#define IORING_SETUP_HYBRID_IOPOLL	(1U << 17)
 
 enum io_uring_op {
 	IORING_OP_NOP,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 9631e10d681b..35071442fb70 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -307,7 +307,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 		goto err;
 
 	ctx->flags = p->flags;
-	ctx->available_time = LLONG_MAX;
+	ctx->poll_available_time = LLONG_MAX;
 	atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT);
 	init_waitqueue_head(&ctx->sqo_sq_wait);
 	INIT_LIST_HEAD(&ctx->sqd_list);
@@ -3646,6 +3646,11 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	ctx->clockid = CLOCK_MONOTONIC;
 	ctx->clock_offset = 0;
 
+	/* HYBRID_IOPOLL only valid with IOPOLL */
+	if ((ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_HYBRID_IOPOLL)) ==
+	    IORING_SETUP_HYBRID_IOPOLL)
+		return -EINVAL;
+
 	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
 		static_branch_inc(&io_key_has_sqarray);
 
@@ -3808,7 +3813,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
 			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
 			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY |
-			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HY_POLL))
+			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index b86cef10ed72..6f7bf40df85a 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -782,13 +782,6 @@ static bool need_complete_io(struct io_kiocb *req)
 		S_ISBLK(file_inode(req->file)->i_mode);
 }
 
-static void init_hybrid_poll(struct io_kiocb *req)
-{
-	/* make sure every req only block once*/
-	req->poll_state = false;
-	req->iopoll_start = ktime_get_ns();
-}
-
 static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 {
 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
@@ -826,8 +819,11 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 		kiocb->ki_flags |= IOCB_HIPRI;
 		kiocb->ki_complete = io_complete_rw_iopoll;
 		req->iopoll_completed = 0;
-		if (ctx->flags & IORING_SETUP_HY_POLL)
-			init_hybrid_poll(req);
+		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) {
+			/* make sure every req only blocks once*/
+			req->poll_state = false;
+			req->iopoll_start = ktime_get_ns();
+		}
 	} else {
 		if (kiocb->ki_flags & IOCB_HIPRI)
 			return -EINVAL;
@@ -1126,27 +1122,24 @@ void io_rw_fail(struct io_kiocb *req)
 	io_req_set_res(req, res, req->cqe.flags);
 }
 
-static int io_uring_classic_poll(struct io_kiocb *req,
-		struct io_comp_batch *iob, unsigned int poll_flags)
+static int io_uring_classic_poll(struct io_kiocb *req, struct io_comp_batch *iob,
+				 unsigned int poll_flags)
 {
-	int ret;
 	struct file *file = req->file;
 
 	if (req->opcode == IORING_OP_URING_CMD) {
 		struct io_uring_cmd *ioucmd;
 
 		ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-		ret = file->f_op->uring_cmd_iopoll(ioucmd, iob,
-						poll_flags);
+		return file->f_op->uring_cmd_iopoll(ioucmd, iob, poll_flags);
 	} else {
 		struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 
-		ret = file->f_op->iopoll(&rw->kiocb, iob, poll_flags);
+		return file->f_op->iopoll(&rw->kiocb, iob, poll_flags);
 	}
-	return ret;
 }
 
-static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req)
+static u64 io_hybrid_iopoll_delay(struct io_ring_ctx *ctx, struct io_kiocb *req)
 {
 	struct hrtimer_sleeper timer;
 	enum hrtimer_mode mode;
@@ -1156,11 +1149,11 @@ static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req)
 	if (req->poll_state)
 		return 0;
 
-	if (ctx->available_time == LLONG_MAX)
+	if (ctx->poll_available_time == LLONG_MAX)
 		return 0;
 
-	/* Using half running time to do schedul */
-	sleep_time = ctx->available_time / 2;
+	/* Use half the running time to do schedule */
+	sleep_time = ctx->poll_available_time / 2;
 
 	kt = ktime_set(0, sleep_time);
 	req->poll_state = true;
@@ -1177,7 +1170,6 @@ static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req)
 	hrtimer_cancel(&timer.timer);
 	__set_current_state(TASK_RUNNING);
 	destroy_hrtimer_on_stack(&timer.timer);
-
 	return sleep_time;
 }
 
@@ -1185,19 +1177,21 @@ static int io_uring_hybrid_poll(struct io_kiocb *req,
 				struct io_comp_batch *iob, unsigned int poll_flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	int ret;
 	u64 runtime, sleep_time;
+	int ret;
 
-	sleep_time = io_delay(ctx, req);
+	sleep_time = io_hybrid_iopoll_delay(ctx, req);
 	ret = io_uring_classic_poll(req, iob, poll_flags);
 	req->iopoll_end = ktime_get_ns();
 	runtime = req->iopoll_end - req->iopoll_start - sleep_time;
 
-	/* use minimize sleep time if there are different speed
-	 * drivers, it could get more completions from fast one
+	/*
+	 * Use minimum sleep time if we're polling devices with different
+	 * latencies. We could get more completions from the faster ones.
 	 */
-	if (ctx->available_time > runtime)
-		ctx->available_time = runtime;
+	if (ctx->poll_available_time > runtime)
+		ctx->poll_available_time = runtime;
+
 	return ret;
 }
 
@@ -1227,7 +1221,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 		if (READ_ONCE(req->iopoll_completed))
 			break;
 
-		if (ctx->flags & IORING_SETUP_HY_POLL)
+		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL)
 			ret = io_uring_hybrid_poll(req, &iob, poll_flags);
 		else
 			ret = io_uring_classic_poll(req, &iob, poll_flags);

Pavel Begunkov Oct. 24, 2024, 2:26 p.m. UTC | #5

On 10/24/24 15:18, Jens Axboe wrote:
> On 10/23/24 8:38 PM, hexue wrote:
>> On 9/25/2024 12:12, Pavel Begunkov wrote:
...
>> When the number of threads exceeds the number of CPU cores,the
>> database throughput does not increase significantly. However,
>> hybrid polling can releasing some CPU resources during the polling
>> process, so that part of the CPU time can be used for frequent
>> data processing and other operations, which speeds up the reading
>> process, thereby improving throughput and optimizaing database
>> performance.I tried different compression strategies and got
>> results similar to the above table.(~30% throughput improvement)
>>
>> As more database applications adapt to the io_uring engine, I think
>> the application of hybrid poll may have potential in some scenarios.
> 
> Thanks for posting some numbers on that part, that's useful. I do
> think the feature is useful as well, but I still have some issues
> with the implementation. Below is an incremental patch on top of
> yours to resolve some of those, potentially. Issues:
> 
> 1) The patch still reads a bit like a hack, in that it doesn't seem to
>     care about following the current style. This reads a bit lazy/sloppy
>     or unfinished. I've fixed that up.
> 
> 2) Appropriate member and function naming.
> 
> 3) Same as above, it doesn't care about proper placement of additions to
>     structs. Again this is a bit lazy and wasteful, attention should be
>     paid to where additions are placed to not needlessly bloat
>     structures, or place members in cache unfortunate locations. For
>     example, available_time is just placed at the end of io_ring_ctx,
>     why? It's a submission side member, and there's room with other
>     related members. Not only is the placement now where you'd want it to
>     be, memory wise, it also doesn't add 8 bytes to io_uring_ctx.
> 
> 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me
>     that we can share space with the polling hash_node. This obviously
>     needs careful vetting, haven't done that yet. IOPOLL setups should
>     not be using poll at all. This needs extra checking. The poll_state
>     can go with cancel_seq_set, as there's a hole there any. And just
>     like that, rather than add 24b to io_kiocb, it doesn't take any extra
>     space at all.
> 
> 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's
>     please use a name related to that. And require IOPOLL being set with
>     HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that
>     HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it
>     can't exist without that.
> 
> Please take a look at this incremental and test it, and then post a v9
> that looks a lot more finished. Caveat - I haven't tested this one at
> all. Thanks!
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index c79ee9fe86d4..6cf6a45835e5 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -238,6 +238,8 @@ struct io_ring_ctx {
>   		struct io_rings		*rings;
>   		struct percpu_ref	refs;
>   
> +		u64			poll_available_time;
> +
>   		clockid_t		clockid;
>   		enum tk_offsets		clock_offset;
>   
> @@ -433,9 +435,6 @@ struct io_ring_ctx {
>   	struct page			**sqe_pages;
>   
>   	struct page			**cq_wait_page;
> -
> -	/* for io_uring hybrid poll*/
> -	u64			available_time;
>   };
>   
>   struct io_tw_state {
> @@ -647,9 +646,22 @@ struct io_kiocb {
>   
>   	atomic_t			refs;
>   	bool				cancel_seq_set;
> +	bool				poll_state;

As mentioned briefly before, that can be just a req->flags flag

>   	struct io_task_work		io_task_work;
> -	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
> -	struct hlist_node		hash_node;
> +	union {
> +		/*
> +		 * for polled requests, i.e. IORING_OP_POLL_ADD and async armed
> +		 * poll
> +		 */
> +		struct hlist_node	hash_node;
> +		/*
> +		 * For IOPOLL setup queues, with hybrid polling
> +		 */
> +		struct {
> +			u64		iopoll_start;
> +			u64		iopoll_end;

And IIRC it doesn't need to store the end as it's used immediately
after it's set in the same function.

> +		};
> +	};
>   	/* internal polling, see IORING_FEAT_FAST_POLL */
>   	struct async_poll		*apoll;
>   	/* opcode allocated if it needs to store data for async defer */
> @@ -665,10 +677,6 @@ struct io_kiocb {
>   		u64			extra1;
>   		u64			extra2;
>   	} big_cqe;
> -    /* for io_uring hybrid iopoll */
> -	bool		poll_state;
> -	u64			iopoll_start;
> -	u64			iopoll_end;
>   };
>   
...

Jens Axboe Oct. 24, 2024, 2:40 p.m. UTC | #6

On 10/24/24 8:26 AM, Pavel Begunkov wrote:
> On 10/24/24 15:18, Jens Axboe wrote:
>> On 10/23/24 8:38 PM, hexue wrote:
>>> On 9/25/2024 12:12, Pavel Begunkov wrote:
> ...
>>> When the number of threads exceeds the number of CPU cores,the
>>> database throughput does not increase significantly. However,
>>> hybrid polling can releasing some CPU resources during the polling
>>> process, so that part of the CPU time can be used for frequent
>>> data processing and other operations, which speeds up the reading
>>> process, thereby improving throughput and optimizaing database
>>> performance.I tried different compression strategies and got
>>> results similar to the above table.(~30% throughput improvement)
>>>
>>> As more database applications adapt to the io_uring engine, I think
>>> the application of hybrid poll may have potential in some scenarios.
>>
>> Thanks for posting some numbers on that part, that's useful. I do
>> think the feature is useful as well, but I still have some issues
>> with the implementation. Below is an incremental patch on top of
>> yours to resolve some of those, potentially. Issues:
>>
>> 1) The patch still reads a bit like a hack, in that it doesn't seem to
>>     care about following the current style. This reads a bit lazy/sloppy
>>     or unfinished. I've fixed that up.
>>
>> 2) Appropriate member and function naming.
>>
>> 3) Same as above, it doesn't care about proper placement of additions to
>>     structs. Again this is a bit lazy and wasteful, attention should be
>>     paid to where additions are placed to not needlessly bloat
>>     structures, or place members in cache unfortunate locations. For
>>     example, available_time is just placed at the end of io_ring_ctx,
>>     why? It's a submission side member, and there's room with other
>>     related members. Not only is the placement now where you'd want it to
>>     be, memory wise, it also doesn't add 8 bytes to io_uring_ctx.
>>
>> 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me
>>     that we can share space with the polling hash_node. This obviously
>>     needs careful vetting, haven't done that yet. IOPOLL setups should
>>     not be using poll at all. This needs extra checking. The poll_state
>>     can go with cancel_seq_set, as there's a hole there any. And just
>>     like that, rather than add 24b to io_kiocb, it doesn't take any extra
>>     space at all.
>>
>> 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's
>>     please use a name related to that. And require IOPOLL being set with
>>     HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that
>>     HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it
>>     can't exist without that.
>>
>> Please take a look at this incremental and test it, and then post a v9
>> that looks a lot more finished. Caveat - I haven't tested this one at
>> all. Thanks!
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index c79ee9fe86d4..6cf6a45835e5 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -238,6 +238,8 @@ struct io_ring_ctx {
>>           struct io_rings        *rings;
>>           struct percpu_ref    refs;
>>   +        u64            poll_available_time;
>> +
>>           clockid_t        clockid;
>>           enum tk_offsets        clock_offset;
>>   @@ -433,9 +435,6 @@ struct io_ring_ctx {
>>       struct page            **sqe_pages;
>>         struct page            **cq_wait_page;
>> -
>> -    /* for io_uring hybrid poll*/
>> -    u64            available_time;
>>   };
>>     struct io_tw_state {
>> @@ -647,9 +646,22 @@ struct io_kiocb {
>>         atomic_t            refs;
>>       bool                cancel_seq_set;
>> +    bool                poll_state;
> 
> As mentioned briefly before, that can be just a req->flags flag

That'd be even better, I generally despise random bool addition.

>>       struct io_task_work        io_task_work;
>> -    /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
>> -    struct hlist_node        hash_node;
>> +    union {
>> +        /*
>> +         * for polled requests, i.e. IORING_OP_POLL_ADD and async armed
>> +         * poll
>> +         */
>> +        struct hlist_node    hash_node;
>> +        /*
>> +         * For IOPOLL setup queues, with hybrid polling
>> +         */
>> +        struct {
>> +            u64        iopoll_start;
>> +            u64        iopoll_end;
> 
> And IIRC it doesn't need to store the end as it's used immediately
> after it's set in the same function.

Nice, that opens up the door for less esoteric sharing as well. And
yeah, I'd just use:

runtime = ktime_get_ns() - req->iopoll_start - sleep_time;

in io_uring_hybrid_poll() and kill it entirely, doesn't even need a
local variable there. And then shove iopoll_start into the union with
comp_list/apoll_events.

My main points are really: don't randomly sprinkle additions to structs.
Think about if they are needed, and if they are, be a bit smarter about
where to place them. The original patch did neither of those, and that's
a non-starter.

Pavel Begunkov Oct. 24, 2024, 2:49 p.m. UTC | #7

On 10/24/24 15:40, Jens Axboe wrote:
> On 10/24/24 8:26 AM, Pavel Begunkov wrote:
>> On 10/24/24 15:18, Jens Axboe wrote:
>>> On 10/23/24 8:38 PM, hexue wrote:
>>>> On 9/25/2024 12:12, Pavel Begunkov wrote:
>> ...
>>>> When the number of threads exceeds the number of CPU cores,the
>>>> database throughput does not increase significantly. However,
>>>> hybrid polling can releasing some CPU resources during the polling
>>>> process, so that part of the CPU time can be used for frequent
>>>> data processing and other operations, which speeds up the reading
>>>> process, thereby improving throughput and optimizaing database
>>>> performance.I tried different compression strategies and got
>>>> results similar to the above table.(~30% throughput improvement)
>>>>
>>>> As more database applications adapt to the io_uring engine, I think
>>>> the application of hybrid poll may have potential in some scenarios.
>>>
>>> Thanks for posting some numbers on that part, that's useful. I do
>>> think the feature is useful as well, but I still have some issues
>>> with the implementation. Below is an incremental patch on top of
>>> yours to resolve some of those, potentially. Issues:
>>>
>>> 1) The patch still reads a bit like a hack, in that it doesn't seem to
>>>      care about following the current style. This reads a bit lazy/sloppy
>>>      or unfinished. I've fixed that up.
>>>
>>> 2) Appropriate member and function naming.
>>>
>>> 3) Same as above, it doesn't care about proper placement of additions to
>>>      structs. Again this is a bit lazy and wasteful, attention should be
>>>      paid to where additions are placed to not needlessly bloat
>>>      structures, or place members in cache unfortunate locations. For
>>>      example, available_time is just placed at the end of io_ring_ctx,
>>>      why? It's a submission side member, and there's room with other
>>>      related members. Not only is the placement now where you'd want it to
>>>      be, memory wise, it also doesn't add 8 bytes to io_uring_ctx.
>>>
>>> 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me
>>>      that we can share space with the polling hash_node. This obviously
>>>      needs careful vetting, haven't done that yet. IOPOLL setups should
>>>      not be using poll at all. This needs extra checking. The poll_state
>>>      can go with cancel_seq_set, as there's a hole there any. And just
>>>      like that, rather than add 24b to io_kiocb, it doesn't take any extra
>>>      space at all.
>>>
>>> 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's
>>>      please use a name related to that. And require IOPOLL being set with
>>>      HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that
>>>      HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it
>>>      can't exist without that.
>>>
>>> Please take a look at this incremental and test it, and then post a v9
>>> that looks a lot more finished. Caveat - I haven't tested this one at
>>> all. Thanks!
>>>
>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>> index c79ee9fe86d4..6cf6a45835e5 100644
>>> --- a/include/linux/io_uring_types.h
>>> +++ b/include/linux/io_uring_types.h
>>> @@ -238,6 +238,8 @@ struct io_ring_ctx {
>>>            struct io_rings        *rings;
>>>            struct percpu_ref    refs;
>>>    +        u64            poll_available_time;
>>> +
>>>            clockid_t        clockid;
>>>            enum tk_offsets        clock_offset;
>>>    @@ -433,9 +435,6 @@ struct io_ring_ctx {
>>>        struct page            **sqe_pages;
>>>          struct page            **cq_wait_page;
>>> -
>>> -    /* for io_uring hybrid poll*/
>>> -    u64            available_time;
>>>    };
>>>      struct io_tw_state {
>>> @@ -647,9 +646,22 @@ struct io_kiocb {
>>>          atomic_t            refs;
>>>        bool                cancel_seq_set;
>>> +    bool                poll_state;
>>
>> As mentioned briefly before, that can be just a req->flags flag
> 
> That'd be even better, I generally despise random bool addition.
> 
>>>        struct io_task_work        io_task_work;
>>> -    /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
>>> -    struct hlist_node        hash_node;
>>> +    union {
>>> +        /*
>>> +         * for polled requests, i.e. IORING_OP_POLL_ADD and async armed
>>> +         * poll
>>> +         */
>>> +        struct hlist_node    hash_node;
>>> +        /*
>>> +         * For IOPOLL setup queues, with hybrid polling
>>> +         */
>>> +        struct {
>>> +            u64        iopoll_start;
>>> +            u64        iopoll_end;
>>
>> And IIRC it doesn't need to store the end as it's used immediately
>> after it's set in the same function.
> 
> Nice, that opens up the door for less esoteric sharing as well. And
> yeah, I'd just use:
> 
> runtime = ktime_get_ns() - req->iopoll_start - sleep_time;
> 
> in io_uring_hybrid_poll() and kill it entirely, doesn't even need a
> local variable there. And then shove iopoll_start into the union with
> comp_list/apoll_events.

That's with what the current request is hooked into the list,
IOW such aliasing will corrupt the request

Jens Axboe Oct. 24, 2024, 2:49 p.m. UTC | #8

On 10/24/24 8:49 AM, Pavel Begunkov wrote:
> On 10/24/24 15:40, Jens Axboe wrote:
>> On 10/24/24 8:26 AM, Pavel Begunkov wrote:
>>> On 10/24/24 15:18, Jens Axboe wrote:
>>>> On 10/23/24 8:38 PM, hexue wrote:
>>>>> On 9/25/2024 12:12, Pavel Begunkov wrote:
>>> ...
>>>>> When the number of threads exceeds the number of CPU cores,the
>>>>> database throughput does not increase significantly. However,
>>>>> hybrid polling can releasing some CPU resources during the polling
>>>>> process, so that part of the CPU time can be used for frequent
>>>>> data processing and other operations, which speeds up the reading
>>>>> process, thereby improving throughput and optimizaing database
>>>>> performance.I tried different compression strategies and got
>>>>> results similar to the above table.(~30% throughput improvement)
>>>>>
>>>>> As more database applications adapt to the io_uring engine, I think
>>>>> the application of hybrid poll may have potential in some scenarios.
>>>>
>>>> Thanks for posting some numbers on that part, that's useful. I do
>>>> think the feature is useful as well, but I still have some issues
>>>> with the implementation. Below is an incremental patch on top of
>>>> yours to resolve some of those, potentially. Issues:
>>>>
>>>> 1) The patch still reads a bit like a hack, in that it doesn't seem to
>>>>      care about following the current style. This reads a bit lazy/sloppy
>>>>      or unfinished. I've fixed that up.
>>>>
>>>> 2) Appropriate member and function naming.
>>>>
>>>> 3) Same as above, it doesn't care about proper placement of additions to
>>>>      structs. Again this is a bit lazy and wasteful, attention should be
>>>>      paid to where additions are placed to not needlessly bloat
>>>>      structures, or place members in cache unfortunate locations. For
>>>>      example, available_time is just placed at the end of io_ring_ctx,
>>>>      why? It's a submission side member, and there's room with other
>>>>      related members. Not only is the placement now where you'd want it to
>>>>      be, memory wise, it also doesn't add 8 bytes to io_uring_ctx.
>>>>
>>>> 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me
>>>>      that we can share space with the polling hash_node. This obviously
>>>>      needs careful vetting, haven't done that yet. IOPOLL setups should
>>>>      not be using poll at all. This needs extra checking. The poll_state
>>>>      can go with cancel_seq_set, as there's a hole there any. And just
>>>>      like that, rather than add 24b to io_kiocb, it doesn't take any extra
>>>>      space at all.
>>>>
>>>> 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's
>>>>      please use a name related to that. And require IOPOLL being set with
>>>>      HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that
>>>>      HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it
>>>>      can't exist without that.
>>>>
>>>> Please take a look at this incremental and test it, and then post a v9
>>>> that looks a lot more finished. Caveat - I haven't tested this one at
>>>> all. Thanks!
>>>>
>>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>>> index c79ee9fe86d4..6cf6a45835e5 100644
>>>> --- a/include/linux/io_uring_types.h
>>>> +++ b/include/linux/io_uring_types.h
>>>> @@ -238,6 +238,8 @@ struct io_ring_ctx {
>>>>            struct io_rings        *rings;
>>>>            struct percpu_ref    refs;
>>>>    +        u64            poll_available_time;
>>>> +
>>>>            clockid_t        clockid;
>>>>            enum tk_offsets        clock_offset;
>>>>    @@ -433,9 +435,6 @@ struct io_ring_ctx {
>>>>        struct page            **sqe_pages;
>>>>          struct page            **cq_wait_page;
>>>> -
>>>> -    /* for io_uring hybrid poll*/
>>>> -    u64            available_time;
>>>>    };
>>>>      struct io_tw_state {
>>>> @@ -647,9 +646,22 @@ struct io_kiocb {
>>>>          atomic_t            refs;
>>>>        bool                cancel_seq_set;
>>>> +    bool                poll_state;
>>>
>>> As mentioned briefly before, that can be just a req->flags flag
>>
>> That'd be even better, I generally despise random bool addition.
>>
>>>>        struct io_task_work        io_task_work;
>>>> -    /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
>>>> -    struct hlist_node        hash_node;
>>>> +    union {
>>>> +        /*
>>>> +         * for polled requests, i.e. IORING_OP_POLL_ADD and async armed
>>>> +         * poll
>>>> +         */
>>>> +        struct hlist_node    hash_node;
>>>> +        /*
>>>> +         * For IOPOLL setup queues, with hybrid polling
>>>> +         */
>>>> +        struct {
>>>> +            u64        iopoll_start;
>>>> +            u64        iopoll_end;
>>>
>>> And IIRC it doesn't need to store the end as it's used immediately
>>> after it's set in the same function.
>>
>> Nice, that opens up the door for less esoteric sharing as well. And
>> yeah, I'd just use:
>>
>> runtime = ktime_get_ns() - req->iopoll_start - sleep_time;
>>
>> in io_uring_hybrid_poll() and kill it entirely, doesn't even need a
>> local variable there. And then shove iopoll_start into the union with
>> comp_list/apoll_events.
> 
> That's with what the current request is hooked into the list,
> IOW such aliasing will corrupt the request

Ah true, well some other spot then, should be pretty easy to find 8
bytes for iopoll_start. As mentioned, the point is really just to THINK
about where it should go, rather than lazily just shove it at the end
like no thought has been given to it.

hexue Oct. 25, 2024, 6:10 a.m. UTC | #9

On 10/24/24 14:49 Jens Axboe wrote:
>On 10/24/24 8:49 AM, Pavel Begunkov wrote:
>> On 10/24/24 15:40, Jens Axboe wrote:
>>> On 10/24/24 8:26 AM, Pavel Begunkov wrote:
>>>> On 10/24/24 15:18, Jens Axboe wrote:
>>>>> On 10/23/24 8:38 PM, hexue wrote:
>>>>>> On 9/25/2024 12:12, Pavel Begunkov wrote:
>>>> ...

>Ah true, well some other spot then, should be pretty easy to find 8
>bytes for iopoll_start. As mentioned, the point is really just to THINK
>about where it should go, rather than lazily just shove it at the end
>like no thought has been given to it.

Thanks a lot for these suggestions and help. I will revise and submit a
complete v9 patch, thank you very much.

--
Xue

[v8,RESEND] io_uring: releasing CPU resources when polling

Commit Message

Comments

Patch