Message ID | 20231108080732.15587-1-xiaobing.li@samsung.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] io_uring: Statistics of the true utilization of sq threads. | expand |
On 11/8/23 1:07 AM, Xiaobing Li wrote: > Since the sq thread has a while(1) structure, during this process, there > may be a lot of time that is not processing IO but does not exceed the > timeout period, therefore, the sqpoll thread will keep running and will > keep occupying the CPU. Obviously, the CPU is wasted at this time;Our > goal is to count the part of the time that the sqpoll thread actually > processes IO, so as to reflect the part of the CPU it uses to process > IO, which can be used to help improve the actual utilization of the CPU > in the future. There should be an explanation in here on what 'work' and 'total' mean. > The test results are as follows: > cat /proc/11440/fdinfo/6 > pos: 0 > flags: 02000002 > mnt_id: 16 > ino: 94449 > SqMask: 0xf > SqHead: 1845170 > SqTail: 1845170 > CachedSqHead: 1845170 > CqMask: 0xf > CqHead: 1845154 > CqTail: 1845154 > CachedCqTail: 1845154 > SQEs: 0 > CQEs: 0 > SqThread: -1 > SqThreadCpu: -1 > UserFiles: 1 > UserBufs: 0 > PollList: > CqOverflowList: > PID: 11440 > work: 18794 > total: 19123 These should go with the other Sq thread related ones, eg be SqWork and SqTotal. It's counted in jiffies right now which is a bit odd in terms of being exposed, as you'd need to know what the base of that unit is. But probably not much of a concern, as work/total is really the metric you care about. Maybe it'd be better to expose it as a percentage, and get rid of total? Eg just have SqBusy: xx% be the output. > diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c > index f04a43044d91..f0b79c533062 100644 > --- a/io_uring/fdinfo.c > +++ b/io_uring/fdinfo.c > @@ -213,6 +213,12 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f) > > } > > + if (ctx->sq_data) { > + seq_printf(m, "PID:\t%d\n", task_pid_nr(ctx->sq_data->thread)); > + seq_printf(m, "work:\t%lu\n", ctx->sq_data->work); > + seq_printf(m, "total:\t%lu\n", ctx->sq_data->total); > + } > + This doesn't work, it needs proper locking. See how we get the other sq values.
Xiaobing Li <xiaobing.li@samsung.com> writes: > diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c > index f04a43044d91..f0b79c533062 100644 > --- a/io_uring/fdinfo.c > +++ b/io_uring/fdinfo.c > @@ -213,6 +213,12 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f) > > } > > + if (ctx->sq_data) { > + seq_printf(m, "PID:\t%d\n", task_pid_nr(ctx->sq_data->thread)); ctx->sq_data and sq_data->thread can become NULL if the queue is being setup/going away. You need to hold the uring_lock and the ctx->sq_data->lock to access this. But task_pid_nr is already published in this file a bit before this hunk. Perhaps drop this line and move the two lines below together with them. > + seq_printf(m, "work:\t%lu\n", ctx->sq_data->work); > + seq_printf(m, "total:\t%lu\n", ctx->sq_data->total); > + }
On 11/8/23 08:07, Xiaobing Li wrote: > Since the sq thread has a while(1) structure, during this process, there > may be a lot of time that is not processing IO but does not exceed the > timeout period, therefore, the sqpoll thread will keep running and will > keep occupying the CPU. Obviously, the CPU is wasted at this time;Our > goal is to count the part of the time that the sqpoll thread actually > processes IO, so as to reflect the part of the CPU it uses to process > IO, which can be used to help improve the actual utilization of the CPU > in the future. Let's pull the elephant out of the room, what's the use case? "Improve in the future" doesn't sound too convincing. If it's a future kernel change you have in mind, it has to go together with this patch. If it's a userspace application, it'd be interesting to hear what that is, especially if you have numbers ready. And another classic question, why can't it be done with bpf? > Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com> > > v1 -> v2: Added method to query data. > ... > diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c > index bd6c2c7959a5..c821273406bd 100644 > --- a/io_uring/sqpoll.c > +++ b/io_uring/sqpoll.c > @@ -224,6 +224,7 @@ static int io_sq_thread(void *data) > struct io_ring_ctx *ctx; > unsigned long timeout = 0; > char buf[TASK_COMM_LEN]; > + unsigned long start, begin, end; start and begin used for just slightly different accounting, it'll get confused anyone. > DEFINE_WAIT(wait); > > snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid); > @@ -235,6 +236,7 @@ static int io_sq_thread(void *data) > set_cpus_allowed_ptr(current, cpu_online_mask); > > mutex_lock(&sqd->lock); > + start = jiffies; > while (1) { > bool cap_entries, sqt_spin = false; > > @@ -245,12 +247,18 @@ static int io_sq_thread(void *data) > } > > cap_entries = !list_is_singular(&sqd->ctx_list); > + begin = jiffies; There can be {hard,soft}irq in between jiffies reads, and it can even be scheduled out in favour of another process, so it'd collect a lot of garbage. There should be a per-task stat for system time you can use: start = get_system_time(current); do_io_part(); sq->total_time += get_system_time(current) - start; wait(); ... > list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) { > int ret = __io_sq_thread(ctx, cap_entries); > > if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list))) > sqt_spin = true; > } > + end = jiffies; > + sqd->total = end - start; ...and then you don't need to track total at all, it'd be your total = get_system_time(sq_thread /* current */); at any given point it time. > + if (sqt_spin == true) > + sqd->work += end - begin; It should go after the io_run_task_work() below, task_work is a major part of request execution. > + > if (io_run_task_work()) > sqt_spin = true; > > diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h > index 8df37e8c9149..0aa4e2efa4db 100644 > --- a/io_uring/sqpoll.h > +++ b/io_uring/sqpoll.h > @@ -16,6 +16,8 @@ struct io_sq_data { > pid_t task_pid; > pid_t task_tgid; > > + unsigned long work; > + unsigned long total; > unsigned long state; > struct completion exited; > };
After careful consideration and testing, I don't think getting the uring_lock is possible here, for the following reasons: Due to lock competition, ctx->uring_lock and sq->lock are usually not available here. The best proof is that the values of SqThread and SqThreadCpu always output -1. In this case, it is impossible to obtain the required work_time and total_time values. In fact, it should be feasible to obtain work_time and total_time by judging that ctx->sq_data is not NULL, because if the sq thread exits, the action of reading data will also stop, and the possibility of a null pointer reference is very low.
On 11/12/23 8:10 PM, Xiaobing Li wrote: > After careful consideration and testing, I don't think getting the > uring_lock is possible here, for the following reasons: > Due to lock competition, ctx->uring_lock and sq->lock are usually not > available here. The best proof is that the values of SqThread and > SqThreadCpu always output -1. In this case, it is impossible to obtain > the required work_time and total_time values. > In fact, it should be feasible to obtain work_time and total_time by > judging that ctx->sq_data is not NULL, because if the sq thread exits, > the action of reading data will also stop, and the possibility of a null > pointer reference is very low. We have that problem right now, in the current tree. And agree it's not the best. sq_data should be fine under the (ctx) lock, it's just the thread that may go away. Maybe we just cache the cpu/pid of it when we create it, seems better than needing to query it. And for the other stats, should be fine in ctx->sq_data.
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index f04a43044d91..f0b79c533062 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -213,6 +213,12 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f) } + if (ctx->sq_data) { + seq_printf(m, "PID:\t%d\n", task_pid_nr(ctx->sq_data->thread)); + seq_printf(m, "work:\t%lu\n", ctx->sq_data->work); + seq_printf(m, "total:\t%lu\n", ctx->sq_data->total); + } + spin_unlock(&ctx->completion_lock); } #endif diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index bd6c2c7959a5..c821273406bd 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -224,6 +224,7 @@ static int io_sq_thread(void *data) struct io_ring_ctx *ctx; unsigned long timeout = 0; char buf[TASK_COMM_LEN]; + unsigned long start, begin, end; DEFINE_WAIT(wait); snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid); @@ -235,6 +236,7 @@ static int io_sq_thread(void *data) set_cpus_allowed_ptr(current, cpu_online_mask); mutex_lock(&sqd->lock); + start = jiffies; while (1) { bool cap_entries, sqt_spin = false; @@ -245,12 +247,18 @@ static int io_sq_thread(void *data) } cap_entries = !list_is_singular(&sqd->ctx_list); + begin = jiffies; list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) { int ret = __io_sq_thread(ctx, cap_entries); if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list))) sqt_spin = true; } + end = jiffies; + sqd->total = end - start; + if (sqt_spin == true) + sqd->work += end - begin; + if (io_run_task_work()) sqt_spin = true; diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h index 8df37e8c9149..0aa4e2efa4db 100644 --- a/io_uring/sqpoll.h +++ b/io_uring/sqpoll.h @@ -16,6 +16,8 @@ struct io_sq_data { pid_t task_pid; pid_t task_tgid; + unsigned long work; + unsigned long total; unsigned long state; struct completion exited; };
Since the sq thread has a while(1) structure, during this process, there may be a lot of time that is not processing IO but does not exceed the timeout period, therefore, the sqpoll thread will keep running and will keep occupying the CPU. Obviously, the CPU is wasted at this time;Our goal is to count the part of the time that the sqpoll thread actually processes IO, so as to reflect the part of the CPU it uses to process IO, which can be used to help improve the actual utilization of the CPU in the future. Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com> v1 -> v2: Added method to query data. The test results are as follows: cat /proc/11440/fdinfo/6 pos: 0 flags: 02000002 mnt_id: 16 ino: 94449 SqMask: 0xf SqHead: 1845170 SqTail: 1845170 CachedSqHead: 1845170 CqMask: 0xf CqHead: 1845154 CqTail: 1845154 CachedCqTail: 1845154 SQEs: 0 CQEs: 0 SqThread: -1 SqThreadCpu: -1 UserFiles: 1 UserBufs: 0 PollList: CqOverflowList: PID: 11440 work: 18794 total: 19123 --- io_uring/fdinfo.c | 6 ++++++ io_uring/sqpoll.c | 8 ++++++++ io_uring/sqpoll.h | 2 ++ 3 files changed, 16 insertions(+)