Message ID | 20240918021010.12894-1-xue01.he@samsung.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v8,RESEND] io_uring: releasing CPU resources when polling | expand |
On 24/08/12 1:59AM, hexue wrote: >This patch add a new hybrid poll at io_uring level, it also set a signal >"IORING_SETUP_HY_POLL" to application, aim to provide a interface for users >to enable use new hybrid polling flexibly. Hi, just a gentle ping, is there still in merge window? or any comment for this patch? -- hexue
On 9/25/24 09:29, hexue wrote: > On 24/08/12 1:59AM, hexue wrote: >> This patch add a new hybrid poll at io_uring level, it also set a signal >> "IORING_SETUP_HY_POLL" to application, aim to provide a interface for users >> to enable use new hybrid polling flexibly. > > Hi, just a gentle ping, is there still in merge window? or any comment for > this patch? I don't have a strong opinion on the feature, but the open question we should get some decision on is whether it's really well applicable to a good enough set of apps / workloads, if it'll even be useful in the future and/or for other vendors, and if the merit outweighs extra 8 bytes + 1 flag per io_kiocb and the overhead of 1-2 static key'able checks in hot paths.
On 9/25/2024 12:12, Pavel Begunkov wrote: >I don't have a strong opinion on the feature, but the open question >we should get some decision on is whether it's really well applicable to >a good enough set of apps / workloads, if it'll even be useful in the >future and/or for other vendors, and if the merit outweighs extra >8 bytes + 1 flag per io_kiocb and the overhead of 1-2 static key'able >checks in hot paths. IMHO, releasing some of the CPU resources during the polling process may be appropriate for some performance bottlenecks due to CPU resource constraints, such as some database applications, in addition to completing IO operations, CPU also needs to peocess data, like compression and decompression. In a high-concurrency state, not only polling takes up a lot of CPU time, but also operations like calculation and processing also need to compete for CPU time. In this case, the performance of the application may be difficult to improve. The MultiRead interface of Rocksdb has been adapted to io_uring, I used db_bench to construct a situation with high CPU pressure and compared the performance. The test configuration is as follows, ------------------------------------------------------------------- CPU Model Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz CPU Cores 8 Memory 16G SSD Samsung PM9A3 ------------------------------------------------------------------- Test case: ./db_bench --benchmarks=multireadrandom,stats --duration=60 --threads=4/8/16 --use_direct_reads=true --db=/mnt/rocks/test_db --wal_dir=/mnt/rocks/test_db --key_size=4 --value_size=4096 -cache_size=0 -use_existing_db=1 -batch_size=256 -multiread_batched=true -multiread_stride=0 ------------------------------------------------------ Test result: National Optimization threads ops/sec ops/sec CPU Utilization 16 139300 189075 100%*8 8 138639 133191 90%*8 4 71475 68361 90%*8 ------------------------------------------------------ When the number of threads exceeds the number of CPU cores,the database throughput does not increase significantly. However, hybrid polling can releasing some CPU resources during the polling process, so that part of the CPU time can be used for frequent data processing and other operations, which speeds up the reading process, thereby improving throughput and optimizaing database performance.I tried different compression strategies and got results similar to the above table.(~30% throughput improvement) As more database applications adapt to the io_uring engine, I think the application of hybrid poll may have potential in some scenarios. -- Xue
On 10/23/24 8:38 PM, hexue wrote: > On 9/25/2024 12:12, Pavel Begunkov wrote: >> I don't have a strong opinion on the feature, but the open question >> we should get some decision on is whether it's really well applicable to >> a good enough set of apps / workloads, if it'll even be useful in the >> future and/or for other vendors, and if the merit outweighs extra >> 8 bytes + 1 flag per io_kiocb and the overhead of 1-2 static key'able >> checks in hot paths. > > IMHO, releasing some of the CPU resources during the polling > process may be appropriate for some performance bottlenecks > due to CPU resource constraints, such as some database > applications, in addition to completing IO operations, CPU > also needs to peocess data, like compression and decompression. > In a high-concurrency state, not only polling takes up a lot of > CPU time, but also operations like calculation and processing > also need to compete for CPU time. In this case, the performance > of the application may be difficult to improve. > > The MultiRead interface of Rocksdb has been adapted to io_uring, > I used db_bench to construct a situation with high CPU pressure > and compared the performance. The test configuration is as follows, > > ------------------------------------------------------------------- > CPU Model Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz > CPU Cores 8 > Memory 16G > SSD Samsung PM9A3 > ------------------------------------------------------------------- > > Test case? > ./db_bench --benchmarks=multireadrandom,stats > --duration=60 > --threads=4/8/16 > --use_direct_reads=true > --db=/mnt/rocks/test_db > --wal_dir=/mnt/rocks/test_db > --key_size=4 > --value_size=4096 > -cache_size=0 > -use_existing_db=1 > -batch_size=256 > -multiread_batched=true > -multiread_stride=0 > ------------------------------------------------------ > Test result? > National Optimization > threads ops/sec ops/sec CPU Utilization > 16 139300 189075 100%*8 > 8 138639 133191 90%*8 > 4 71475 68361 90%*8 > ------------------------------------------------------ > > When the number of threads exceeds the number of CPU cores,the > database throughput does not increase significantly. However, > hybrid polling can releasing some CPU resources during the polling > process, so that part of the CPU time can be used for frequent > data processing and other operations, which speeds up the reading > process, thereby improving throughput and optimizaing database > performance.I tried different compression strategies and got > results similar to the above table.(~30% throughput improvement) > > As more database applications adapt to the io_uring engine, I think > the application of hybrid poll may have potential in some scenarios. Thanks for posting some numbers on that part, that's useful. I do think the feature is useful as well, but I still have some issues with the implementation. Below is an incremental patch on top of yours to resolve some of those, potentially. Issues: 1) The patch still reads a bit like a hack, in that it doesn't seem to care about following the current style. This reads a bit lazy/sloppy or unfinished. I've fixed that up. 2) Appropriate member and function naming. 3) Same as above, it doesn't care about proper placement of additions to structs. Again this is a bit lazy and wasteful, attention should be paid to where additions are placed to not needlessly bloat structures, or place members in cache unfortunate locations. For example, available_time is just placed at the end of io_ring_ctx, why? It's a submission side member, and there's room with other related members. Not only is the placement now where you'd want it to be, memory wise, it also doesn't add 8 bytes to io_uring_ctx. 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me that we can share space with the polling hash_node. This obviously needs careful vetting, haven't done that yet. IOPOLL setups should not be using poll at all. This needs extra checking. The poll_state can go with cancel_seq_set, as there's a hole there any. And just like that, rather than add 24b to io_kiocb, it doesn't take any extra space at all. 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's please use a name related to that. And require IOPOLL being set with HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it can't exist without that. Please take a look at this incremental and test it, and then post a v9 that looks a lot more finished. Caveat - I haven't tested this one at all. Thanks! diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index c79ee9fe86d4..6cf6a45835e5 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -238,6 +238,8 @@ struct io_ring_ctx { struct io_rings *rings; struct percpu_ref refs; + u64 poll_available_time; + clockid_t clockid; enum tk_offsets clock_offset; @@ -433,9 +435,6 @@ struct io_ring_ctx { struct page **sqe_pages; struct page **cq_wait_page; - - /* for io_uring hybrid poll*/ - u64 available_time; }; struct io_tw_state { @@ -647,9 +646,22 @@ struct io_kiocb { atomic_t refs; bool cancel_seq_set; + bool poll_state; struct io_task_work io_task_work; - /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */ - struct hlist_node hash_node; + union { + /* + * for polled requests, i.e. IORING_OP_POLL_ADD and async armed + * poll + */ + struct hlist_node hash_node; + /* + * For IOPOLL setup queues, with hybrid polling + */ + struct { + u64 iopoll_start; + u64 iopoll_end; + }; + }; /* internal polling, see IORING_FEAT_FAST_POLL */ struct async_poll *apoll; /* opcode allocated if it needs to store data for async defer */ @@ -665,10 +677,6 @@ struct io_kiocb { u64 extra1; u64 extra2; } big_cqe; - /* for io_uring hybrid iopoll */ - bool poll_state; - u64 iopoll_start; - u64 iopoll_end; }; struct io_overflow_cqe { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 034997a1e507..5a290a56af6c 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -199,7 +199,7 @@ enum io_uring_sqe_flags_bit { * Removes indirection through the SQ index array. */ #define IORING_SETUP_NO_SQARRAY (1U << 16) -#define IORING_SETUP_HY_POLL (1U << 17) +#define IORING_SETUP_HYBRID_IOPOLL (1U << 17) enum io_uring_op { IORING_OP_NOP, diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 9631e10d681b..35071442fb70 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -307,7 +307,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) goto err; ctx->flags = p->flags; - ctx->available_time = LLONG_MAX; + ctx->poll_available_time = LLONG_MAX; atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT); init_waitqueue_head(&ctx->sqo_sq_wait); INIT_LIST_HEAD(&ctx->sqd_list); @@ -3646,6 +3646,11 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, ctx->clockid = CLOCK_MONOTONIC; ctx->clock_offset = 0; + /* HYBRID_IOPOLL only valid with IOPOLL */ + if ((ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_HYBRID_IOPOLL)) == + IORING_SETUP_HYBRID_IOPOLL) + return -EINVAL; + if (!(ctx->flags & IORING_SETUP_NO_SQARRAY)) static_branch_inc(&io_key_has_sqarray); @@ -3808,7 +3813,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) IORING_SETUP_SQE128 | IORING_SETUP_CQE32 | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY | - IORING_SETUP_NO_SQARRAY | IORING_SETUP_HY_POLL)) + IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL)) return -EINVAL; return io_uring_create(entries, &p, params); diff --git a/io_uring/rw.c b/io_uring/rw.c index b86cef10ed72..6f7bf40df85a 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -782,13 +782,6 @@ static bool need_complete_io(struct io_kiocb *req) S_ISBLK(file_inode(req->file)->i_mode); } -static void init_hybrid_poll(struct io_kiocb *req) -{ - /* make sure every req only block once*/ - req->poll_state = false; - req->iopoll_start = ktime_get_ns(); -} - static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); @@ -826,8 +819,11 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) kiocb->ki_flags |= IOCB_HIPRI; kiocb->ki_complete = io_complete_rw_iopoll; req->iopoll_completed = 0; - if (ctx->flags & IORING_SETUP_HY_POLL) - init_hybrid_poll(req); + if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) { + /* make sure every req only blocks once*/ + req->poll_state = false; + req->iopoll_start = ktime_get_ns(); + } } else { if (kiocb->ki_flags & IOCB_HIPRI) return -EINVAL; @@ -1126,27 +1122,24 @@ void io_rw_fail(struct io_kiocb *req) io_req_set_res(req, res, req->cqe.flags); } -static int io_uring_classic_poll(struct io_kiocb *req, - struct io_comp_batch *iob, unsigned int poll_flags) +static int io_uring_classic_poll(struct io_kiocb *req, struct io_comp_batch *iob, + unsigned int poll_flags) { - int ret; struct file *file = req->file; if (req->opcode == IORING_OP_URING_CMD) { struct io_uring_cmd *ioucmd; ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); - ret = file->f_op->uring_cmd_iopoll(ioucmd, iob, - poll_flags); + return file->f_op->uring_cmd_iopoll(ioucmd, iob, poll_flags); } else { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); - ret = file->f_op->iopoll(&rw->kiocb, iob, poll_flags); + return file->f_op->iopoll(&rw->kiocb, iob, poll_flags); } - return ret; } -static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) +static u64 io_hybrid_iopoll_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) { struct hrtimer_sleeper timer; enum hrtimer_mode mode; @@ -1156,11 +1149,11 @@ static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) if (req->poll_state) return 0; - if (ctx->available_time == LLONG_MAX) + if (ctx->poll_available_time == LLONG_MAX) return 0; - /* Using half running time to do schedul */ - sleep_time = ctx->available_time / 2; + /* Use half the running time to do schedule */ + sleep_time = ctx->poll_available_time / 2; kt = ktime_set(0, sleep_time); req->poll_state = true; @@ -1177,7 +1170,6 @@ static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) hrtimer_cancel(&timer.timer); __set_current_state(TASK_RUNNING); destroy_hrtimer_on_stack(&timer.timer); - return sleep_time; } @@ -1185,19 +1177,21 @@ static int io_uring_hybrid_poll(struct io_kiocb *req, struct io_comp_batch *iob, unsigned int poll_flags) { struct io_ring_ctx *ctx = req->ctx; - int ret; u64 runtime, sleep_time; + int ret; - sleep_time = io_delay(ctx, req); + sleep_time = io_hybrid_iopoll_delay(ctx, req); ret = io_uring_classic_poll(req, iob, poll_flags); req->iopoll_end = ktime_get_ns(); runtime = req->iopoll_end - req->iopoll_start - sleep_time; - /* use minimize sleep time if there are different speed - * drivers, it could get more completions from fast one + /* + * Use minimum sleep time if we're polling devices with different + * latencies. We could get more completions from the faster ones. */ - if (ctx->available_time > runtime) - ctx->available_time = runtime; + if (ctx->poll_available_time > runtime) + ctx->poll_available_time = runtime; + return ret; } @@ -1227,7 +1221,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) if (READ_ONCE(req->iopoll_completed)) break; - if (ctx->flags & IORING_SETUP_HY_POLL) + if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) ret = io_uring_hybrid_poll(req, &iob, poll_flags); else ret = io_uring_classic_poll(req, &iob, poll_flags);
On 10/24/24 15:18, Jens Axboe wrote: > On 10/23/24 8:38 PM, hexue wrote: >> On 9/25/2024 12:12, Pavel Begunkov wrote: ... >> When the number of threads exceeds the number of CPU cores,the >> database throughput does not increase significantly. However, >> hybrid polling can releasing some CPU resources during the polling >> process, so that part of the CPU time can be used for frequent >> data processing and other operations, which speeds up the reading >> process, thereby improving throughput and optimizaing database >> performance.I tried different compression strategies and got >> results similar to the above table.(~30% throughput improvement) >> >> As more database applications adapt to the io_uring engine, I think >> the application of hybrid poll may have potential in some scenarios. > > Thanks for posting some numbers on that part, that's useful. I do > think the feature is useful as well, but I still have some issues > with the implementation. Below is an incremental patch on top of > yours to resolve some of those, potentially. Issues: > > 1) The patch still reads a bit like a hack, in that it doesn't seem to > care about following the current style. This reads a bit lazy/sloppy > or unfinished. I've fixed that up. > > 2) Appropriate member and function naming. > > 3) Same as above, it doesn't care about proper placement of additions to > structs. Again this is a bit lazy and wasteful, attention should be > paid to where additions are placed to not needlessly bloat > structures, or place members in cache unfortunate locations. For > example, available_time is just placed at the end of io_ring_ctx, > why? It's a submission side member, and there's room with other > related members. Not only is the placement now where you'd want it to > be, memory wise, it also doesn't add 8 bytes to io_uring_ctx. > > 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me > that we can share space with the polling hash_node. This obviously > needs careful vetting, haven't done that yet. IOPOLL setups should > not be using poll at all. This needs extra checking. The poll_state > can go with cancel_seq_set, as there's a hole there any. And just > like that, rather than add 24b to io_kiocb, it doesn't take any extra > space at all. > > 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's > please use a name related to that. And require IOPOLL being set with > HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that > HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it > can't exist without that. > > Please take a look at this incremental and test it, and then post a v9 > that looks a lot more finished. Caveat - I haven't tested this one at > all. Thanks! > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h > index c79ee9fe86d4..6cf6a45835e5 100644 > --- a/include/linux/io_uring_types.h > +++ b/include/linux/io_uring_types.h > @@ -238,6 +238,8 @@ struct io_ring_ctx { > struct io_rings *rings; > struct percpu_ref refs; > > + u64 poll_available_time; > + > clockid_t clockid; > enum tk_offsets clock_offset; > > @@ -433,9 +435,6 @@ struct io_ring_ctx { > struct page **sqe_pages; > > struct page **cq_wait_page; > - > - /* for io_uring hybrid poll*/ > - u64 available_time; > }; > > struct io_tw_state { > @@ -647,9 +646,22 @@ struct io_kiocb { > > atomic_t refs; > bool cancel_seq_set; > + bool poll_state; As mentioned briefly before, that can be just a req->flags flag > struct io_task_work io_task_work; > - /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */ > - struct hlist_node hash_node; > + union { > + /* > + * for polled requests, i.e. IORING_OP_POLL_ADD and async armed > + * poll > + */ > + struct hlist_node hash_node; > + /* > + * For IOPOLL setup queues, with hybrid polling > + */ > + struct { > + u64 iopoll_start; > + u64 iopoll_end; And IIRC it doesn't need to store the end as it's used immediately after it's set in the same function. > + }; > + }; > /* internal polling, see IORING_FEAT_FAST_POLL */ > struct async_poll *apoll; > /* opcode allocated if it needs to store data for async defer */ > @@ -665,10 +677,6 @@ struct io_kiocb { > u64 extra1; > u64 extra2; > } big_cqe; > - /* for io_uring hybrid iopoll */ > - bool poll_state; > - u64 iopoll_start; > - u64 iopoll_end; > }; > ...
On 10/24/24 8:26 AM, Pavel Begunkov wrote: > On 10/24/24 15:18, Jens Axboe wrote: >> On 10/23/24 8:38 PM, hexue wrote: >>> On 9/25/2024 12:12, Pavel Begunkov wrote: > ... >>> When the number of threads exceeds the number of CPU cores,the >>> database throughput does not increase significantly. However, >>> hybrid polling can releasing some CPU resources during the polling >>> process, so that part of the CPU time can be used for frequent >>> data processing and other operations, which speeds up the reading >>> process, thereby improving throughput and optimizaing database >>> performance.I tried different compression strategies and got >>> results similar to the above table.(~30% throughput improvement) >>> >>> As more database applications adapt to the io_uring engine, I think >>> the application of hybrid poll may have potential in some scenarios. >> >> Thanks for posting some numbers on that part, that's useful. I do >> think the feature is useful as well, but I still have some issues >> with the implementation. Below is an incremental patch on top of >> yours to resolve some of those, potentially. Issues: >> >> 1) The patch still reads a bit like a hack, in that it doesn't seem to >> care about following the current style. This reads a bit lazy/sloppy >> or unfinished. I've fixed that up. >> >> 2) Appropriate member and function naming. >> >> 3) Same as above, it doesn't care about proper placement of additions to >> structs. Again this is a bit lazy and wasteful, attention should be >> paid to where additions are placed to not needlessly bloat >> structures, or place members in cache unfortunate locations. For >> example, available_time is just placed at the end of io_ring_ctx, >> why? It's a submission side member, and there's room with other >> related members. Not only is the placement now where you'd want it to >> be, memory wise, it also doesn't add 8 bytes to io_uring_ctx. >> >> 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me >> that we can share space with the polling hash_node. This obviously >> needs careful vetting, haven't done that yet. IOPOLL setups should >> not be using poll at all. This needs extra checking. The poll_state >> can go with cancel_seq_set, as there's a hole there any. And just >> like that, rather than add 24b to io_kiocb, it doesn't take any extra >> space at all. >> >> 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's >> please use a name related to that. And require IOPOLL being set with >> HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that >> HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it >> can't exist without that. >> >> Please take a look at this incremental and test it, and then post a v9 >> that looks a lot more finished. Caveat - I haven't tested this one at >> all. Thanks! >> >> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h >> index c79ee9fe86d4..6cf6a45835e5 100644 >> --- a/include/linux/io_uring_types.h >> +++ b/include/linux/io_uring_types.h >> @@ -238,6 +238,8 @@ struct io_ring_ctx { >> struct io_rings *rings; >> struct percpu_ref refs; >> + u64 poll_available_time; >> + >> clockid_t clockid; >> enum tk_offsets clock_offset; >> @@ -433,9 +435,6 @@ struct io_ring_ctx { >> struct page **sqe_pages; >> struct page **cq_wait_page; >> - >> - /* for io_uring hybrid poll*/ >> - u64 available_time; >> }; >> struct io_tw_state { >> @@ -647,9 +646,22 @@ struct io_kiocb { >> atomic_t refs; >> bool cancel_seq_set; >> + bool poll_state; > > As mentioned briefly before, that can be just a req->flags flag That'd be even better, I generally despise random bool addition. >> struct io_task_work io_task_work; >> - /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */ >> - struct hlist_node hash_node; >> + union { >> + /* >> + * for polled requests, i.e. IORING_OP_POLL_ADD and async armed >> + * poll >> + */ >> + struct hlist_node hash_node; >> + /* >> + * For IOPOLL setup queues, with hybrid polling >> + */ >> + struct { >> + u64 iopoll_start; >> + u64 iopoll_end; > > And IIRC it doesn't need to store the end as it's used immediately > after it's set in the same function. Nice, that opens up the door for less esoteric sharing as well. And yeah, I'd just use: runtime = ktime_get_ns() - req->iopoll_start - sleep_time; in io_uring_hybrid_poll() and kill it entirely, doesn't even need a local variable there. And then shove iopoll_start into the union with comp_list/apoll_events. My main points are really: don't randomly sprinkle additions to structs. Think about if they are needed, and if they are, be a bit smarter about where to place them. The original patch did neither of those, and that's a non-starter.
On 10/24/24 15:40, Jens Axboe wrote: > On 10/24/24 8:26 AM, Pavel Begunkov wrote: >> On 10/24/24 15:18, Jens Axboe wrote: >>> On 10/23/24 8:38 PM, hexue wrote: >>>> On 9/25/2024 12:12, Pavel Begunkov wrote: >> ... >>>> When the number of threads exceeds the number of CPU cores,the >>>> database throughput does not increase significantly. However, >>>> hybrid polling can releasing some CPU resources during the polling >>>> process, so that part of the CPU time can be used for frequent >>>> data processing and other operations, which speeds up the reading >>>> process, thereby improving throughput and optimizaing database >>>> performance.I tried different compression strategies and got >>>> results similar to the above table.(~30% throughput improvement) >>>> >>>> As more database applications adapt to the io_uring engine, I think >>>> the application of hybrid poll may have potential in some scenarios. >>> >>> Thanks for posting some numbers on that part, that's useful. I do >>> think the feature is useful as well, but I still have some issues >>> with the implementation. Below is an incremental patch on top of >>> yours to resolve some of those, potentially. Issues: >>> >>> 1) The patch still reads a bit like a hack, in that it doesn't seem to >>> care about following the current style. This reads a bit lazy/sloppy >>> or unfinished. I've fixed that up. >>> >>> 2) Appropriate member and function naming. >>> >>> 3) Same as above, it doesn't care about proper placement of additions to >>> structs. Again this is a bit lazy and wasteful, attention should be >>> paid to where additions are placed to not needlessly bloat >>> structures, or place members in cache unfortunate locations. For >>> example, available_time is just placed at the end of io_ring_ctx, >>> why? It's a submission side member, and there's room with other >>> related members. Not only is the placement now where you'd want it to >>> be, memory wise, it also doesn't add 8 bytes to io_uring_ctx. >>> >>> 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me >>> that we can share space with the polling hash_node. This obviously >>> needs careful vetting, haven't done that yet. IOPOLL setups should >>> not be using poll at all. This needs extra checking. The poll_state >>> can go with cancel_seq_set, as there's a hole there any. And just >>> like that, rather than add 24b to io_kiocb, it doesn't take any extra >>> space at all. >>> >>> 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's >>> please use a name related to that. And require IOPOLL being set with >>> HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that >>> HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it >>> can't exist without that. >>> >>> Please take a look at this incremental and test it, and then post a v9 >>> that looks a lot more finished. Caveat - I haven't tested this one at >>> all. Thanks! >>> >>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h >>> index c79ee9fe86d4..6cf6a45835e5 100644 >>> --- a/include/linux/io_uring_types.h >>> +++ b/include/linux/io_uring_types.h >>> @@ -238,6 +238,8 @@ struct io_ring_ctx { >>> struct io_rings *rings; >>> struct percpu_ref refs; >>> + u64 poll_available_time; >>> + >>> clockid_t clockid; >>> enum tk_offsets clock_offset; >>> @@ -433,9 +435,6 @@ struct io_ring_ctx { >>> struct page **sqe_pages; >>> struct page **cq_wait_page; >>> - >>> - /* for io_uring hybrid poll*/ >>> - u64 available_time; >>> }; >>> struct io_tw_state { >>> @@ -647,9 +646,22 @@ struct io_kiocb { >>> atomic_t refs; >>> bool cancel_seq_set; >>> + bool poll_state; >> >> As mentioned briefly before, that can be just a req->flags flag > > That'd be even better, I generally despise random bool addition. > >>> struct io_task_work io_task_work; >>> - /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */ >>> - struct hlist_node hash_node; >>> + union { >>> + /* >>> + * for polled requests, i.e. IORING_OP_POLL_ADD and async armed >>> + * poll >>> + */ >>> + struct hlist_node hash_node; >>> + /* >>> + * For IOPOLL setup queues, with hybrid polling >>> + */ >>> + struct { >>> + u64 iopoll_start; >>> + u64 iopoll_end; >> >> And IIRC it doesn't need to store the end as it's used immediately >> after it's set in the same function. > > Nice, that opens up the door for less esoteric sharing as well. And > yeah, I'd just use: > > runtime = ktime_get_ns() - req->iopoll_start - sleep_time; > > in io_uring_hybrid_poll() and kill it entirely, doesn't even need a > local variable there. And then shove iopoll_start into the union with > comp_list/apoll_events. That's with what the current request is hooked into the list, IOW such aliasing will corrupt the request
On 10/24/24 8:49 AM, Pavel Begunkov wrote: > On 10/24/24 15:40, Jens Axboe wrote: >> On 10/24/24 8:26 AM, Pavel Begunkov wrote: >>> On 10/24/24 15:18, Jens Axboe wrote: >>>> On 10/23/24 8:38 PM, hexue wrote: >>>>> On 9/25/2024 12:12, Pavel Begunkov wrote: >>> ... >>>>> When the number of threads exceeds the number of CPU cores,the >>>>> database throughput does not increase significantly. However, >>>>> hybrid polling can releasing some CPU resources during the polling >>>>> process, so that part of the CPU time can be used for frequent >>>>> data processing and other operations, which speeds up the reading >>>>> process, thereby improving throughput and optimizaing database >>>>> performance.I tried different compression strategies and got >>>>> results similar to the above table.(~30% throughput improvement) >>>>> >>>>> As more database applications adapt to the io_uring engine, I think >>>>> the application of hybrid poll may have potential in some scenarios. >>>> >>>> Thanks for posting some numbers on that part, that's useful. I do >>>> think the feature is useful as well, but I still have some issues >>>> with the implementation. Below is an incremental patch on top of >>>> yours to resolve some of those, potentially. Issues: >>>> >>>> 1) The patch still reads a bit like a hack, in that it doesn't seem to >>>> care about following the current style. This reads a bit lazy/sloppy >>>> or unfinished. I've fixed that up. >>>> >>>> 2) Appropriate member and function naming. >>>> >>>> 3) Same as above, it doesn't care about proper placement of additions to >>>> structs. Again this is a bit lazy and wasteful, attention should be >>>> paid to where additions are placed to not needlessly bloat >>>> structures, or place members in cache unfortunate locations. For >>>> example, available_time is just placed at the end of io_ring_ctx, >>>> why? It's a submission side member, and there's room with other >>>> related members. Not only is the placement now where you'd want it to >>>> be, memory wise, it also doesn't add 8 bytes to io_uring_ctx. >>>> >>>> 4) Like the above, the io_kiocb bloat is, by far, the worst. Seems to me >>>> that we can share space with the polling hash_node. This obviously >>>> needs careful vetting, haven't done that yet. IOPOLL setups should >>>> not be using poll at all. This needs extra checking. The poll_state >>>> can go with cancel_seq_set, as there's a hole there any. And just >>>> like that, rather than add 24b to io_kiocb, it doesn't take any extra >>>> space at all. >>>> >>>> 5) HY_POLL is a terrible name. It's associated with IOPOLL, and so let's >>>> please use a name related to that. And require IOPOLL being set with >>>> HYBRID_IOPOLL, as it's really a variant of that. Makes it clear that >>>> HYBRID_IOPOLL is really just a mode of operation for IOPOLL, and it >>>> can't exist without that. >>>> >>>> Please take a look at this incremental and test it, and then post a v9 >>>> that looks a lot more finished. Caveat - I haven't tested this one at >>>> all. Thanks! >>>> >>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h >>>> index c79ee9fe86d4..6cf6a45835e5 100644 >>>> --- a/include/linux/io_uring_types.h >>>> +++ b/include/linux/io_uring_types.h >>>> @@ -238,6 +238,8 @@ struct io_ring_ctx { >>>> struct io_rings *rings; >>>> struct percpu_ref refs; >>>> + u64 poll_available_time; >>>> + >>>> clockid_t clockid; >>>> enum tk_offsets clock_offset; >>>> @@ -433,9 +435,6 @@ struct io_ring_ctx { >>>> struct page **sqe_pages; >>>> struct page **cq_wait_page; >>>> - >>>> - /* for io_uring hybrid poll*/ >>>> - u64 available_time; >>>> }; >>>> struct io_tw_state { >>>> @@ -647,9 +646,22 @@ struct io_kiocb { >>>> atomic_t refs; >>>> bool cancel_seq_set; >>>> + bool poll_state; >>> >>> As mentioned briefly before, that can be just a req->flags flag >> >> That'd be even better, I generally despise random bool addition. >> >>>> struct io_task_work io_task_work; >>>> - /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */ >>>> - struct hlist_node hash_node; >>>> + union { >>>> + /* >>>> + * for polled requests, i.e. IORING_OP_POLL_ADD and async armed >>>> + * poll >>>> + */ >>>> + struct hlist_node hash_node; >>>> + /* >>>> + * For IOPOLL setup queues, with hybrid polling >>>> + */ >>>> + struct { >>>> + u64 iopoll_start; >>>> + u64 iopoll_end; >>> >>> And IIRC it doesn't need to store the end as it's used immediately >>> after it's set in the same function. >> >> Nice, that opens up the door for less esoteric sharing as well. And >> yeah, I'd just use: >> >> runtime = ktime_get_ns() - req->iopoll_start - sleep_time; >> >> in io_uring_hybrid_poll() and kill it entirely, doesn't even need a >> local variable there. And then shove iopoll_start into the union with >> comp_list/apoll_events. > > That's with what the current request is hooked into the list, > IOW such aliasing will corrupt the request Ah true, well some other spot then, should be pretty easy to find 8 bytes for iopoll_start. As mentioned, the point is really just to THINK about where it should go, rather than lazily just shove it at the end like no thought has been given to it.
On 10/24/24 14:49 Jens Axboe wrote: >On 10/24/24 8:49 AM, Pavel Begunkov wrote: >> On 10/24/24 15:40, Jens Axboe wrote: >>> On 10/24/24 8:26 AM, Pavel Begunkov wrote: >>>> On 10/24/24 15:18, Jens Axboe wrote: >>>>> On 10/23/24 8:38 PM, hexue wrote: >>>>>> On 9/25/2024 12:12, Pavel Begunkov wrote: >>>> ... >Ah true, well some other spot then, should be pretty easy to find 8 >bytes for iopoll_start. As mentioned, the point is really just to THINK >about where it should go, rather than lazily just shove it at the end >like no thought has been given to it. Thanks a lot for these suggestions and help. I will revise and submit a complete v9 patch, thank you very much. -- Xue
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 3315005df117..35ac4a8bf6ab 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -422,6 +422,8 @@ struct io_ring_ctx { unsigned short n_sqe_pages; struct page **ring_pages; struct page **sqe_pages; + /* for io_uring hybrid poll*/ + u64 available_time; }; struct io_tw_state { @@ -657,6 +659,10 @@ struct io_kiocb { u64 extra1; u64 extra2; } big_cqe; + /* for io_uring hybrid iopoll */ + bool poll_state; + u64 iopoll_start; + u64 iopoll_end; }; struct io_overflow_cqe { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 2aaf7ee256ac..42ae868651b0 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -199,6 +199,7 @@ enum io_uring_sqe_flags_bit { * Removes indirection through the SQ index array. */ #define IORING_SETUP_NO_SQARRAY (1U << 16) +#define IORING_SETUP_HY_POLL (1U << 17) enum io_uring_op { IORING_OP_NOP, diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 3942db160f18..bb3dfd749572 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -301,6 +301,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) goto err; ctx->flags = p->flags; + ctx->available_time = LLONG_MAX; atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT); init_waitqueue_head(&ctx->sqo_sq_wait); INIT_LIST_HEAD(&ctx->sqd_list); @@ -3603,7 +3604,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) IORING_SETUP_SQE128 | IORING_SETUP_CQE32 | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY | - IORING_SETUP_NO_SQARRAY)) + IORING_SETUP_NO_SQARRAY | IORING_SETUP_HY_POLL)) return -EINVAL; return io_uring_create(entries, &p, params); diff --git a/io_uring/rw.c b/io_uring/rw.c index c004d21e2f12..4d32b9b9900b 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -772,6 +772,13 @@ static bool need_complete_io(struct io_kiocb *req) S_ISBLK(file_inode(req->file)->i_mode); } +static void init_hybrid_poll(struct io_kiocb *req) +{ + /* make sure every req only block once*/ + req->poll_state = false; + req->iopoll_start = ktime_get_ns(); +} + static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); @@ -809,6 +816,8 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) kiocb->ki_flags |= IOCB_HIPRI; kiocb->ki_complete = io_complete_rw_iopoll; req->iopoll_completed = 0; + if (ctx->flags & IORING_SETUP_HY_POLL) + init_hybrid_poll(req); } else { if (kiocb->ki_flags & IOCB_HIPRI) return -EINVAL; @@ -1105,6 +1114,81 @@ void io_rw_fail(struct io_kiocb *req) io_req_set_res(req, res, req->cqe.flags); } +static int io_uring_classic_poll(struct io_kiocb *req, + struct io_comp_batch *iob, unsigned int poll_flags) +{ + int ret; + struct file *file = req->file; + + if (req->opcode == IORING_OP_URING_CMD) { + struct io_uring_cmd *ioucmd; + + ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); + ret = file->f_op->uring_cmd_iopoll(ioucmd, iob, + poll_flags); + } else { + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + + ret = file->f_op->iopoll(&rw->kiocb, iob, poll_flags); + } + return ret; +} + +static u64 io_delay(struct io_ring_ctx *ctx, struct io_kiocb *req) +{ + struct hrtimer_sleeper timer; + enum hrtimer_mode mode; + ktime_t kt; + u64 sleep_time; + + if (req->poll_state) + return 0; + + if (ctx->available_time == LLONG_MAX) + return 0; + + /* Using half running time to do schedul */ + sleep_time = ctx->available_time / 2; + + kt = ktime_set(0, sleep_time); + req->poll_state = true; + + mode = HRTIMER_MODE_REL; + hrtimer_init_sleeper_on_stack(&timer, CLOCK_MONOTONIC, mode); + hrtimer_set_expires(&timer.timer, kt); + set_current_state(TASK_INTERRUPTIBLE); + hrtimer_sleeper_start_expires(&timer, mode); + + if (timer.task) + io_schedule(); + + hrtimer_cancel(&timer.timer); + __set_current_state(TASK_RUNNING); + destroy_hrtimer_on_stack(&timer.timer); + + return sleep_time; +} + +static int io_uring_hybrid_poll(struct io_kiocb *req, + struct io_comp_batch *iob, unsigned int poll_flags) +{ + struct io_ring_ctx *ctx = req->ctx; + int ret; + u64 runtime, sleep_time; + + sleep_time = io_delay(ctx, req); + ret = io_uring_classic_poll(req, iob, poll_flags); + req->iopoll_end = ktime_get_ns(); + runtime = req->iopoll_end - req->iopoll_start - sleep_time; + + /* use minimize sleep time if there are different speed + * drivers, it could get more completions from fast one + */ + if (ctx->available_time > runtime) + ctx->available_time = runtime; + return ret; +} + int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) { struct io_wq_work_node *pos, *start, *prev; @@ -1121,7 +1205,6 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) wq_list_for_each(pos, start, &ctx->iopoll_list) { struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); - struct file *file = req->file; int ret; /* @@ -1132,17 +1215,11 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) if (READ_ONCE(req->iopoll_completed)) break; - if (req->opcode == IORING_OP_URING_CMD) { - struct io_uring_cmd *ioucmd; - - ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); - ret = file->f_op->uring_cmd_iopoll(ioucmd, &iob, - poll_flags); - } else { - struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + if (ctx->flags & IORING_SETUP_HY_POLL) + ret = io_uring_hybrid_poll(req, &iob, poll_flags); + else + ret = io_uring_classic_poll(req, &iob, poll_flags); - ret = file->f_op->iopoll(&rw->kiocb, &iob, poll_flags); - } if (unlikely(ret < 0)) return ret; else if (ret)
This patch add a new hybrid poll at io_uring level, it also set a signal "IORING_SETUP_HY_POLL" to application, aim to provide a interface for users to enable use new hybrid polling flexibly. io_uring use polling mode could improve the IO performence, but it will spend 100% of CPU resources to do polling. A new hybrid poll is implemented on the io_uring layer. Once IO issued, it will not polling immediately, but block first and re-run before IO complete, then poll to reap IO. This poll function could be a suboptimal solution when running on a single thread, it offers the performance lower than regular polling but higher than IRQ, and CPU utilization is also lower than polling. Test Result fio-3.35, 16 poll queues, 1 thread ------------------------------------------------------------------------- Performance ------------------------------------------------------------------------- write read randwrite randread regular poll BW=3939MiB/s BW=6613MiB/s IOPS=190K IOPS=470K IRQ BW=3927MiB/s BW=6612MiB/s IOPS=181K IOPS=203K hybrid poll BW=3937MiB/s BW=6623MiB/s IOPS=190K IOPS=358K(suboptimal) ------------------------------------------------------------------------- CPU Utilization ------------------------------------------------------ write read randwrite randread regular poll 100% 100% 100% 100% IRQ 50% 53% 100% 100% hybrid poll 70% 37% 70% 85% ------------------------------------------------------ -- changes since v7: - rebase code on for-6.12/io_uring - remove unused varibales changes since v6: - Modified IO path, distinct iopoll and uring_cmd_iopoll - update test results changes since v5: - Remove cstime recorder - Use minimize sleep time in different drivers - Use the half of whole runtime to do schedule - Consider as a suboptimal solution between regular poll and IRQ changes since v4: - Rewrote the commit - Update the test results - Reorganized the code basd on 6.11 changes since v3: - Simplified the commit - Add some comments on code changes since v2: - Modified some formatting errors - Move judgement to poll path changes since v1: - Extend hybrid poll to async polled io Signed-off-by: hexue <xue01.he@samsung.com> --- include/linux/io_uring_types.h | 6 +++ include/uapi/linux/io_uring.h | 1 + io_uring/io_uring.c | 3 +- io_uring/rw.c | 99 ++++++++++++++++++++++++++++++---- 4 files changed, 97 insertions(+), 12 deletions(-)