diff mbox series

[3/3] IO_URING: Statistics of the true utilization of sq threads.

Message ID 20230928022228.15770-4-xiaobing.li@samsung.com (mailing list archive)
State New
Headers show
Series [1/3] SCHEDULER: Add an interface for counting real utilization. | expand

Commit Message

Xiaobing Li Sept. 28, 2023, 2:22 a.m. UTC
Since the sq thread has a while(1) structure, during this process, there
may be a lot of time that is not processing IO but does not exceed the
timeout period, therefore, the sqpoll thread will keep running and will
keep occupying the CPU. Obviously, the CPU is wasted at this time;Our
goal is to count the part of the time that the sqpoll thread actually
processes IO, so as to reflect the part of the CPU it uses to process
IO, which can be used to help improve the actual utilization of the CPU
in the future.

Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com>
---
 io_uring/sqpoll.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Sept. 28, 2023, 8:01 a.m. UTC | #1
On Thu, Sep 28, 2023 at 10:22:28AM +0800, Xiaobing Li wrote:
> Since the sq thread has a while(1) structure, during this process, there
> may be a lot of time that is not processing IO but does not exceed the
> timeout period, therefore, the sqpoll thread will keep running and will
> keep occupying the CPU. Obviously, the CPU is wasted at this time;Our
> goal is to count the part of the time that the sqpoll thread actually
> processes IO, so as to reflect the part of the CPU it uses to process
> IO, which can be used to help improve the actual utilization of the CPU
> in the future.
> 
> Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com>
> ---
>  io_uring/sqpoll.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
> index bd6c2c7959a5..2c5fc4d95fa8 100644
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -11,6 +11,7 @@
>  #include <linux/audit.h>
>  #include <linux/security.h>
>  #include <linux/io_uring.h>
> +#include <linux/time.h>
>  
>  #include <uapi/linux/io_uring.h>
>  
> @@ -235,6 +236,10 @@ static int io_sq_thread(void *data)
>  		set_cpus_allowed_ptr(current, cpu_online_mask);
>  
>  	mutex_lock(&sqd->lock);
> +	bool first = true;
> +	struct timespec64 ts_start, ts_end;
> +	struct timespec64 ts_delta;
> +	struct sched_entity *se = &sqd->thread->se;
>  	while (1) {
>  		bool cap_entries, sqt_spin = false;
>  
> @@ -243,7 +248,16 @@ static int io_sq_thread(void *data)
>  				break;
>  			timeout = jiffies + sqd->sq_thread_idle;
>  		}
> -
> +		ktime_get_boottime_ts64(&ts_start);
> +		ts_delta = timespec64_sub(ts_start, ts_end);
> +		unsigned long long now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
> +		se->sq_avg.last_update_time;
> +
> +		if (first) {
> +			now = 0;
> +			first = false;
> +		}
> +		__update_sq_avg_block(now, se);
>  		cap_entries = !list_is_singular(&sqd->ctx_list);
>  		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
>  			int ret = __io_sq_thread(ctx, cap_entries);
> @@ -251,6 +265,16 @@ static int io_sq_thread(void *data)
>  			if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
>  				sqt_spin = true;
>  		}
> +
> +		ktime_get_boottime_ts64(&ts_end);
> +		ts_delta = timespec64_sub(ts_end, ts_start);
> +		now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
> +		se->sq_avg.last_update_time;
> +
> +		if (sqt_spin)
> +			__update_sq_avg(now, se);
> +		else
> +			__update_sq_avg_block(now, se);
>  		if (io_run_task_work())
>  			sqt_spin = true;
>  

All of this is quite insane, but the above is actually broken. You're
using wall-time to measure runtime of a preemptible thread.

On top of that, for extra insanity, you're using the frigging insane
timespec interface, because clearly the _ns() interfaces are too
complicated or something?

And that whole first thing is daft too, pull now out of the loop and
set it before, then all that goes away.

Now, I see what you're trying to do, but who actually uses this data?

Finally, please don't scream in the subject :/
Jens Axboe Sept. 28, 2023, 8:23 a.m. UTC | #2
On 9/28/23 2:01 AM, Peter Zijlstra wrote:
> All of this is quite insane, but the above is actually broken. You're
> using wall-time to measure runtime of a preemptible thread.

Would have to agree with that... wall-time specifics aside, this whole
thing seems to attempt to solve an issue quite the wrong way around.

> Now, I see what you're trying to do, but who actually uses this data?

This is the important question - and if you need this data, then why not
just account it in sqpoll itself and have some way to export it? Let's
please not implicate the core kernel bits for that.
Matthew Wilcox (Oracle) Sept. 28, 2023, 8:37 a.m. UTC | #3
On Thu, Sep 28, 2023 at 10:01:14AM +0200, Peter Zijlstra wrote:
> Now, I see what you're trying to do, but who actually uses this data?

I ... don't.  There seems to be the notion that since we're polling, that
shouldn't count against the runtime of the thread.  But the thread has
chosen to poll!  It is doing something!  For one thing, it's preventing
the CPU from entering an idle state.  It seems absolutely fair to
accuont this poll time to the runtime of the thread.  Clearly i'm
missing something.
Jens Axboe Sept. 28, 2023, 8:41 a.m. UTC | #4
On 9/28/23 2:37 AM, Matthew Wilcox wrote:
> On Thu, Sep 28, 2023 at 10:01:14AM +0200, Peter Zijlstra wrote:
>> Now, I see what you're trying to do, but who actually uses this data?
> 
> I ... don't.  There seems to be the notion that since we're polling, that
> shouldn't count against the runtime of the thread.  But the thread has
> chosen to poll!  It is doing something!  For one thing, it's preventing
> the CPU from entering an idle state.  It seems absolutely fair to
> accuont this poll time to the runtime of the thread.  Clearly i'm
> missing something.

For sure, it should be accounted as CPU time, as it is exactly that. You
could argue that if we needed to preempt this task for something else we
would do that (and the code does check that on every loop), but it's
still using CPU.

I can see maybe wanting to know how much of the total time the thread
spent doing ACTUAL work rather than just polling for new work, but
that's not really something the scheduler should be involved in and
should be purely an io_uring sqpoll stat of some sort if that is truly
interesting for an application.
kernel test robot Sept. 28, 2023, 6:33 p.m. UTC | #5
Hi Xiaobing,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on linus/master v6.6-rc3 next-20230928]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Xiaobing-Li/SCHEDULER-Add-an-interface-for-counting-real-utilization/20230928-103219
base:   tip/sched/core
patch link:    https://lore.kernel.org/r/20230928022228.15770-4-xiaobing.li%40samsung.com
patch subject: [PATCH 3/3] IO_URING: Statistics of the true utilization of sq threads.
config: arm-mmp2_defconfig (https://download.01.org/0day-ci/archive/20230929/202309290242.FeOyvZYI-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230929/202309290242.FeOyvZYI-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309290242.FeOyvZYI-lkp@intel.com/

All errors (new ones prefixed by >>):

   io_uring/sqpoll.c:254:7: error: no member named 'sq_avg' in 'struct sched_entity'
                   se->sq_avg.last_update_time;
                   ~~  ^
>> io_uring/sqpoll.c:260:3: error: call to undeclared function '__update_sq_avg_block'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                   __update_sq_avg_block(now, se);
                   ^
   io_uring/sqpoll.c:272:7: error: no member named 'sq_avg' in 'struct sched_entity'
                   se->sq_avg.last_update_time;
                   ~~  ^
>> io_uring/sqpoll.c:275:4: error: call to undeclared function '__update_sq_avg'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                           __update_sq_avg(now, se);
                           ^
   4 errors generated.


vim +/__update_sq_avg_block +260 io_uring/sqpoll.c

   221	
   222	static int io_sq_thread(void *data)
   223	{
   224		struct io_sq_data *sqd = data;
   225		struct io_ring_ctx *ctx;
   226		unsigned long timeout = 0;
   227		char buf[TASK_COMM_LEN];
   228		DEFINE_WAIT(wait);
   229	
   230		snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
   231		set_task_comm(current, buf);
   232	
   233		if (sqd->sq_cpu != -1)
   234			set_cpus_allowed_ptr(current, cpumask_of(sqd->sq_cpu));
   235		else
   236			set_cpus_allowed_ptr(current, cpu_online_mask);
   237	
   238		mutex_lock(&sqd->lock);
   239		bool first = true;
   240		struct timespec64 ts_start, ts_end;
   241		struct timespec64 ts_delta;
   242		struct sched_entity *se = &sqd->thread->se;
   243		while (1) {
   244			bool cap_entries, sqt_spin = false;
   245	
   246			if (io_sqd_events_pending(sqd) || signal_pending(current)) {
   247				if (io_sqd_handle_event(sqd))
   248					break;
   249				timeout = jiffies + sqd->sq_thread_idle;
   250			}
   251			ktime_get_boottime_ts64(&ts_start);
   252			ts_delta = timespec64_sub(ts_start, ts_end);
   253			unsigned long long now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
   254			se->sq_avg.last_update_time;
   255	
   256			if (first) {
   257				now = 0;
   258				first = false;
   259			}
 > 260			__update_sq_avg_block(now, se);
   261			cap_entries = !list_is_singular(&sqd->ctx_list);
   262			list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
   263				int ret = __io_sq_thread(ctx, cap_entries);
   264	
   265				if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
   266					sqt_spin = true;
   267			}
   268	
   269			ktime_get_boottime_ts64(&ts_end);
   270			ts_delta = timespec64_sub(ts_end, ts_start);
   271			now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
   272			se->sq_avg.last_update_time;
   273	
   274			if (sqt_spin)
 > 275				__update_sq_avg(now, se);
   276			else
   277				__update_sq_avg_block(now, se);
   278			if (io_run_task_work())
   279				sqt_spin = true;
   280	
   281			if (sqt_spin || !time_after(jiffies, timeout)) {
   282				if (sqt_spin)
   283					timeout = jiffies + sqd->sq_thread_idle;
   284				if (unlikely(need_resched())) {
   285					mutex_unlock(&sqd->lock);
   286					cond_resched();
   287					mutex_lock(&sqd->lock);
   288				}
   289				continue;
   290			}
   291	
   292			prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);
   293			if (!io_sqd_events_pending(sqd) && !task_work_pending(current)) {
   294				bool needs_sched = true;
   295	
   296				list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
   297					atomic_or(IORING_SQ_NEED_WAKEUP,
   298							&ctx->rings->sq_flags);
   299					if ((ctx->flags & IORING_SETUP_IOPOLL) &&
   300					    !wq_list_empty(&ctx->iopoll_list)) {
   301						needs_sched = false;
   302						break;
   303					}
   304	
   305					/*
   306					 * Ensure the store of the wakeup flag is not
   307					 * reordered with the load of the SQ tail
   308					 */
   309					smp_mb__after_atomic();
   310	
   311					if (io_sqring_entries(ctx)) {
   312						needs_sched = false;
   313						break;
   314					}
   315				}
   316	
   317				if (needs_sched) {
   318					mutex_unlock(&sqd->lock);
   319					schedule();
   320					mutex_lock(&sqd->lock);
   321				}
   322				list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
   323					atomic_andnot(IORING_SQ_NEED_WAKEUP,
   324							&ctx->rings->sq_flags);
   325			}
   326	
   327			finish_wait(&sqd->wait, &wait);
   328			timeout = jiffies + sqd->sq_thread_idle;
   329		}
   330	
   331		io_uring_cancel_generic(true, sqd);
   332		sqd->thread = NULL;
   333		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
   334			atomic_or(IORING_SQ_NEED_WAKEUP, &ctx->rings->sq_flags);
   335		io_run_task_work();
   336		mutex_unlock(&sqd->lock);
   337	
   338		complete(&sqd->exited);
   339		do_exit(0);
   340	}
   341
diff mbox series

Patch

diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index bd6c2c7959a5..2c5fc4d95fa8 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -11,6 +11,7 @@ 
 #include <linux/audit.h>
 #include <linux/security.h>
 #include <linux/io_uring.h>
+#include <linux/time.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -235,6 +236,10 @@  static int io_sq_thread(void *data)
 		set_cpus_allowed_ptr(current, cpu_online_mask);
 
 	mutex_lock(&sqd->lock);
+	bool first = true;
+	struct timespec64 ts_start, ts_end;
+	struct timespec64 ts_delta;
+	struct sched_entity *se = &sqd->thread->se;
 	while (1) {
 		bool cap_entries, sqt_spin = false;
 
@@ -243,7 +248,16 @@  static int io_sq_thread(void *data)
 				break;
 			timeout = jiffies + sqd->sq_thread_idle;
 		}
-
+		ktime_get_boottime_ts64(&ts_start);
+		ts_delta = timespec64_sub(ts_start, ts_end);
+		unsigned long long now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
+		se->sq_avg.last_update_time;
+
+		if (first) {
+			now = 0;
+			first = false;
+		}
+		__update_sq_avg_block(now, se);
 		cap_entries = !list_is_singular(&sqd->ctx_list);
 		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
 			int ret = __io_sq_thread(ctx, cap_entries);
@@ -251,6 +265,16 @@  static int io_sq_thread(void *data)
 			if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
 				sqt_spin = true;
 		}
+
+		ktime_get_boottime_ts64(&ts_end);
+		ts_delta = timespec64_sub(ts_end, ts_start);
+		now = ts_delta.tv_sec * NSEC_PER_SEC + ts_delta.tv_nsec +
+		se->sq_avg.last_update_time;
+
+		if (sqt_spin)
+			__update_sq_avg(now, se);
+		else
+			__update_sq_avg_block(now, se);
 		if (io_run_task_work())
 			sqt_spin = true;