diff mbox series

[v6] io_uring: Statistics of the true utilization of sq threads.

Message ID 20231225054438.44581-1-xiaobing.li@samsung.com (mailing list archive)
State New
Headers show
Series [v6] io_uring: Statistics of the true utilization of sq threads. | expand

Commit Message

Xiaobing Li Dec. 25, 2023, 5:44 a.m. UTC
Count the running time and actual IO processing time of the sqpoll
thread, and output the statistical data to fdinfo.

Variable description:
"work_time" in the code represents the sum of the jiffies of the sq
thread actually processing IO, that is, how many milliseconds it
actually takes to process IO. "total_time" represents the total time
that the sq thread has elapsed from the beginning of the loop to the
current time point, that is, how many milliseconds it has spent in
total.

The test results are as follows:
Every 0.5s: cat /proc/23112/fdinfo/6 | grep Sq
SqMask: 0x3
SqHead: 1168417
SqTail: 1168418
CachedSqHead:   1168418
SqThread:       23112
SqThreadCpu:    55
SqWorkTime:     12423
SqTotalTime:    12843

changes:
v6:
 - Replace the percentages in the fdinfo output with the actual running
time and the time actually processing IO

v5:
 - list the changes in each iteration.

v4:
 - Resubmit the patch based on removing sq->lock

v3:
 - output actual working time as a percentage of total time
 - detailed description of the meaning of each variable
 - added test results

v2:
 - output the total statistical time and work time to fdinfo

v1:
 - initial version
 - Statistics of total time and work time

Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com>
---
 io_uring/fdinfo.c | 6 ++++++
 io_uring/sqpoll.c | 8 ++++++++
 io_uring/sqpoll.h | 2 ++
 3 files changed, 16 insertions(+)

Comments

Jens Axboe Dec. 26, 2023, 4:32 p.m. UTC | #1
On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote:
> Count the running time and actual IO processing time of the sqpoll
> thread, and output the statistical data to fdinfo.
> 
> Variable description:
> "work_time" in the code represents the sum of the jiffies of the sq
> thread actually processing IO, that is, how many milliseconds it
> actually takes to process IO. "total_time" represents the total time
> that the sq thread has elapsed from the beginning of the loop to the
> current time point, that is, how many milliseconds it has spent in
> total.
> 
> [...]

Applied, thanks!

[1/1] io_uring: Statistics of the true utilization of sq threads.
      commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655

Best regards,
Pavel Begunkov Dec. 30, 2023, 4:27 p.m. UTC | #2
On 12/26/23 16:32, Jens Axboe wrote:
> 
> On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote:
>> Count the running time and actual IO processing time of the sqpoll
>> thread, and output the statistical data to fdinfo.
>>
>> Variable description:
>> "work_time" in the code represents the sum of the jiffies of the sq
>> thread actually processing IO, that is, how many milliseconds it
>> actually takes to process IO. "total_time" represents the total time
>> that the sq thread has elapsed from the beginning of the loop to the
>> current time point, that is, how many milliseconds it has spent in
>> total.
>>
>> [...]
> 
> Applied, thanks!
> 
> [1/1] io_uring: Statistics of the true utilization of sq threads.
>        commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655

I don't believe the patch is near complete, there are still
pending question that the author ignored (see replies to
prev revisions).

Why it uses jiffies instead of some task run time?
Consequently, why it's fine to account irq time and other
preemption? (hint, it's not)

Why it can't be done with userspace and/or bpf? Why
can't it be estimated by checking and tracking
IORING_SQ_NEED_WAKEUP in userspace?

What's the use case in particular? Considering that
one of the previous revisions was uapi-less, something
is really fishy here. Again, it's a procfs file nobody
but a few would want to parse to use the feature.

Why it just keeps aggregating stats for the whole
life time of the ring? If the workload changes,
that would either totally screw the stats or would make
it too inert to be useful. That's especially relevant
for long running (days) processes. There should be a
way to reset it so it starts counting anew.


I say the patch has to be removed until all that is
figured, but otherwise I'll just leave a NACK for
history.
Jens Axboe Dec. 30, 2023, 5:41 p.m. UTC | #3
On 12/30/23 9:27 AM, Pavel Begunkov wrote:
> On 12/26/23 16:32, Jens Axboe wrote:
>>
>> On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote:
>>> Count the running time and actual IO processing time of the sqpoll
>>> thread, and output the statistical data to fdinfo.
>>>
>>> Variable description:
>>> "work_time" in the code represents the sum of the jiffies of the sq
>>> thread actually processing IO, that is, how many milliseconds it
>>> actually takes to process IO. "total_time" represents the total time
>>> that the sq thread has elapsed from the beginning of the loop to the
>>> current time point, that is, how many milliseconds it has spent in
>>> total.
>>>
>>> [...]
>>
>> Applied, thanks!
>>
>> [1/1] io_uring: Statistics of the true utilization of sq threads.
>>        commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655
> 
> I don't believe the patch is near complete, there are still
> pending question that the author ignored (see replies to
> prev revisions).

We can drop and defer, that's not an issue. It's still sitting top of
branch.

Can you elaborate on the pending questions?

> Why it uses jiffies instead of some task run time?
> Consequently, why it's fine to account irq time and other
> preemption? (hint, it's not)

Yeah that's a good point, might be better to use task run time. Jiffies
is also an annoying metric to expose, as you'd need to then get the tick
rate as well. Though I suspect the ratio is the interesting bit here.

> Why it can't be done with userspace and/or bpf? Why
> can't it be estimated by checking and tracking
> IORING_SQ_NEED_WAKEUP in userspace?

Asking people to integrate bpf for this is a bit silly imho. Tracking
NEED_WAKEUP is also quite cumbersome and would most likely be higher
overhead as well.

> What's the use case in particular? Considering that
> one of the previous revisions was uapi-less, something
> is really fishy here. Again, it's a procfs file nobody
> but a few would want to parse to use the feature.

I brought this up earlier too, fdinfo is not a great API. For anything,
really.

> Why it just keeps aggregating stats for the whole
> life time of the ring? If the workload changes,
> that would either totally screw the stats or would make
> it too inert to be useful. That's especially relevant
> for long running (days) processes. There should be a
> way to reset it so it starts counting anew.

I don't see a problem there with the current revision, as the app can
just remember the previous two numbers and do the appropriate math
"since last time".

> I say the patch has to be removed until all that is
> figured, but otherwise I'll just leave a NACK for
> history.

That's fine, I can drop it for now and we can get the rest addressed.
Pavel Begunkov Dec. 30, 2023, 9:06 p.m. UTC | #4
On 12/30/23 17:41, Jens Axboe wrote:
> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>> On 12/26/23 16:32, Jens Axboe wrote:
>>>
>>> On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote:
>>>> Count the running time and actual IO processing time of the sqpoll
>>>> thread, and output the statistical data to fdinfo.
>>>>
>>>> Variable description:
>>>> "work_time" in the code represents the sum of the jiffies of the sq
>>>> thread actually processing IO, that is, how many milliseconds it
>>>> actually takes to process IO. "total_time" represents the total time
>>>> that the sq thread has elapsed from the beginning of the loop to the
>>>> current time point, that is, how many milliseconds it has spent in
>>>> total.
>>>>
>>>> [...]
>>>
>>> Applied, thanks!
>>>
>>> [1/1] io_uring: Statistics of the true utilization of sq threads.
>>>         commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655
>>
>> I don't believe the patch is near complete, there are still
>> pending question that the author ignored (see replies to
>> prev revisions).
> 
> We can drop and defer, that's not an issue. It's still sitting top of
> branch.
> 
> Can you elaborate on the pending questions?

I guess that wasn't clear, but I duplicated all of them in the
email you're replying to for convenience

>> Why it uses jiffies instead of some task run time?
>> Consequently, why it's fine to account irq time and other
>> preemption? (hint, it's not)
> 
> Yeah that's a good point, might be better to use task run time. Jiffies
> is also an annoying metric to expose, as you'd need to then get the tick
> rate as well. Though I suspect the ratio is the interesting bit here.

I agree that seconds are nicer, but that's not my point. That's
not about jiffies, but that the patch keeps counting regardless
whether the SQ task was actually running, or the CPU was serving
irq, or even if it was force descheduled.

I even outlined what a solution may look like, i.e. replace jiffies
with task runtime, which should already be counted in the task.

>> Why it can't be done with userspace and/or bpf? Why
>> can't it be estimated by checking and tracking
>> IORING_SQ_NEED_WAKEUP in userspace?
> 
> Asking people to integrate bpf for this is a bit silly imho. Tracking

I haven't seen any mention of the real use case, did I miss it?
Because otherwise I fail to see how it can possibly be called
silly when it's not clear how exactly it's used.

Maybe it's a bash program printing stats to a curious user? Or
maybe it's to track once at start, and then nobody cares about
it, in which case NEED_WAKEUP would be justified.

I can guess it's for adjusting the sq timeouts, but who knows.

> NEED_WAKEUP is also quite cumbersome and would most likely be higher
> overhead as well.

Comparing to reading a procfs file or doing an io_uring
register syscall? I doubt that. It's also not everyone
would be using that.

>> What's the use case in particular? Considering that
>> one of the previous revisions was uapi-less, something
>> is really fishy here. Again, it's a procfs file nobody
>> but a few would want to parse to use the feature.
> 
> I brought this up earlier too, fdinfo is not a great API. For anything,
> really.

I saw that comment, that's why I mentioned, but the
point is that I have doubts the author is even using
the uapi.

>> Why it just keeps aggregating stats for the whole
>> life time of the ring? If the workload changes,
>> that would either totally screw the stats or would make
>> it too inert to be useful. That's especially relevant
>> for long running (days) processes. There should be a
>> way to reset it so it starts counting anew.
> 
> I don't see a problem there with the current revision, as the app can
> just remember the previous two numbers and do the appropriate math
> "since last time".

I assumed it only prints the ratio. Looked it up, there
are 2 values, so you're right, can be easily recalculated.

>> I say the patch has to be removed until all that is
>> figured, but otherwise I'll just leave a NACK for
>> history.
> 
> That's fine, I can drop it for now and we can get the rest addressed.
Jens Axboe Dec. 30, 2023, 10:17 p.m. UTC | #5
On 12/30/23 2:06 PM, Pavel Begunkov wrote:
> On 12/30/23 17:41, Jens Axboe wrote:
>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>> On 12/26/23 16:32, Jens Axboe wrote:
>>>>
>>>> On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote:
>>>>> Count the running time and actual IO processing time of the sqpoll
>>>>> thread, and output the statistical data to fdinfo.
>>>>>
>>>>> Variable description:
>>>>> "work_time" in the code represents the sum of the jiffies of the sq
>>>>> thread actually processing IO, that is, how many milliseconds it
>>>>> actually takes to process IO. "total_time" represents the total time
>>>>> that the sq thread has elapsed from the beginning of the loop to the
>>>>> current time point, that is, how many milliseconds it has spent in
>>>>> total.
>>>>>
>>>>> [...]
>>>>
>>>> Applied, thanks!
>>>>
>>>> [1/1] io_uring: Statistics of the true utilization of sq threads.
>>>>         commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655
>>>
>>> I don't believe the patch is near complete, there are still
>>> pending question that the author ignored (see replies to
>>> prev revisions).
>>
>> We can drop and defer, that's not an issue. It's still sitting top of
>> branch.
>>
>> Can you elaborate on the pending questions?
> 
> I guess that wasn't clear, but I duplicated all of them in the
> email you're replying to for convenience
> 
>>> Why it uses jiffies instead of some task run time?
>>> Consequently, why it's fine to account irq time and other
>>> preemption? (hint, it's not)
>>
>> Yeah that's a good point, might be better to use task run time. Jiffies
>> is also an annoying metric to expose, as you'd need to then get the tick
>> rate as well. Though I suspect the ratio is the interesting bit here.
> 
> I agree that seconds are nicer, but that's not my point. That's
> not about jiffies, but that the patch keeps counting regardless
> whether the SQ task was actually running, or the CPU was serving
> irq, or even if it was force descheduled.

Right, guess I wasn't clear, I did very much agree with using task run
time to avoid cases like that where it's perceived running, but really
isn't. For example.

> I even outlined what a solution may look like, i.e. replace jiffies
> with task runtime, which should already be counted in the task.

Would be a good change to make. And to be fair, I guess they originally
wanted something like that, as the very first patch had some scheduler
interactions. Just wasn't done quite right.

>>> Why it can't be done with userspace and/or bpf? Why
>>> can't it be estimated by checking and tracking
>>> IORING_SQ_NEED_WAKEUP in userspace?
>>
>> Asking people to integrate bpf for this is a bit silly imho. Tracking
> 
> I haven't seen any mention of the real use case, did I miss it?
> Because otherwise I fail to see how it can possibly be called
> silly when it's not clear how exactly it's used.
> 
> Maybe it's a bash program printing stats to a curious user? Or
> maybe it's to track once at start, and then nobody cares about
> it, in which case NEED_WAKEUP would be justified.
> 
> I can guess it's for adjusting the sq timeouts, but who knows.

I only know what is in those threads, but the most obvious use case
would indeed be to vet the efficiency of the chosen timeout value and
balance cpu usage with latency like that.

>> NEED_WAKEUP is also quite cumbersome and would most likely be higher
>> overhead as well.
> 
> Comparing to reading a procfs file or doing an io_uring
> register syscall? I doubt that. It's also not everyone
> would be using that.

What's the proposed integration to make NEED_WAKEUP sampling work? As
far as I can tell, you'd need to either do that kind of accounting every
time you do io_uring_submit(), or make it conditional which would then
at least still have a branch.

The kernel side would obviously not be free either, but at least it
would be restricted to the SQPOLL side of things and not need to get
entangled with the general IO path that doesn't use SQPOLL.

If we put it in there and have some way to enable/query/disable, then it
least it would just be a branch or two in there rather than in the
generic path.

>>> What's the use case in particular? Considering that
>>> one of the previous revisions was uapi-less, something
>>> is really fishy here. Again, it's a procfs file nobody
>>> but a few would want to parse to use the feature.
>>
>> I brought this up earlier too, fdinfo is not a great API. For anything,
>> really.
> 
> I saw that comment, that's why I mentioned, but the
> point is that I have doubts the author is even using
> the uapi.

Not sure I follow... If they aren't using the API, what's the point of
the patch? Or are you questioning whether this is being done for an
actual use case, or just as a "why not, might be handy" kind of thing?
Pavel Begunkov Dec. 30, 2023, 11:17 p.m. UTC | #6
On 12/30/23 22:17, Jens Axboe wrote:
> On 12/30/23 2:06 PM, Pavel Begunkov wrote:
>> On 12/30/23 17:41, Jens Axboe wrote:
>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>> On 12/26/23 16:32, Jens Axboe wrote:
>>>>>
>>>>> On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote:
>>>>>> Count the running time and actual IO processing time of the sqpoll
>>>>>> thread, and output the statistical data to fdinfo.
>>>>>>
>>>>>> Variable description:
>>>>>> "work_time" in the code represents the sum of the jiffies of the sq
>>>>>> thread actually processing IO, that is, how many milliseconds it
>>>>>> actually takes to process IO. "total_time" represents the total time
>>>>>> that the sq thread has elapsed from the beginning of the loop to the
>>>>>> current time point, that is, how many milliseconds it has spent in
>>>>>> total.
>>>>>>
>>>>>> [...]
>>>>>
>>>>> Applied, thanks!
>>>>>
>>>>> [1/1] io_uring: Statistics of the true utilization of sq threads.
>>>>>          commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655
>>>>
>>>> I don't believe the patch is near complete, there are still
>>>> pending question that the author ignored (see replies to
>>>> prev revisions).
>>>
>>> We can drop and defer, that's not an issue. It's still sitting top of
>>> branch.
>>>
>>> Can you elaborate on the pending questions?
>>
>> I guess that wasn't clear, but I duplicated all of them in the
>> email you're replying to for convenience
>>
>>>> Why it uses jiffies instead of some task run time?
>>>> Consequently, why it's fine to account irq time and other
>>>> preemption? (hint, it's not)
>>>
>>> Yeah that's a good point, might be better to use task run time. Jiffies
>>> is also an annoying metric to expose, as you'd need to then get the tick
>>> rate as well. Though I suspect the ratio is the interesting bit here.
>>
>> I agree that seconds are nicer, but that's not my point. That's
>> not about jiffies, but that the patch keeps counting regardless
>> whether the SQ task was actually running, or the CPU was serving
>> irq, or even if it was force descheduled.
> 
> Right, guess I wasn't clear, I did very much agree with using task run
> time to avoid cases like that where it's perceived running, but really
> isn't. For example.
> 
>> I even outlined what a solution may look like, i.e. replace jiffies
>> with task runtime, which should already be counted in the task.
> 
> Would be a good change to make. And to be fair, I guess they originally
> wanted something like that, as the very first patch had some scheduler
> interactions. Just wasn't done quite right.

Right, just like what the v1 was doing but without touching
core/sched.

>>>> Why it can't be done with userspace and/or bpf? Why
>>>> can't it be estimated by checking and tracking
>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>
>>> Asking people to integrate bpf for this is a bit silly imho. Tracking
>>
>> I haven't seen any mention of the real use case, did I miss it?
>> Because otherwise I fail to see how it can possibly be called
>> silly when it's not clear how exactly it's used.
>>
>> Maybe it's a bash program printing stats to a curious user? Or
>> maybe it's to track once at start, and then nobody cares about
>> it, in which case NEED_WAKEUP would be justified.
>>
>> I can guess it's for adjusting the sq timeouts, but who knows.
> 
> I only know what is in those threads, but the most obvious use case
> would indeed be to vet the efficiency of the chosen timeout value and
> balance cpu usage with latency like that.
> 
>>> NEED_WAKEUP is also quite cumbersome and would most likely be higher
>>> overhead as well.
>>
>> Comparing to reading a procfs file or doing an io_uring
>> register syscall? I doubt that. It's also not everyone
>> would be using that.
> 
> What's the proposed integration to make NEED_WAKEUP sampling work? As
> far as I can tell, you'd need to either do that kind of accounting every
> time you do io_uring_submit(), or make it conditional which would then
> at least still have a branch.
> 
> The kernel side would obviously not be free either, but at least it
> would be restricted to the SQPOLL side of things and not need to get
> entangled with the general IO path that doesn't use SQPOLL.
> 
> If we put it in there and have some way to enable/query/disable, then it
> least it would just be a branch or two in there rather than in the
> generic path.

It can be in the app without ever touching liburing/kernel. E.g. you
can binary search the idle time, minimising it, while looking that
sqpoll doesn't go to sleep too often topping with other conditions.
Another stat you can get right away is to compare the current idle
time with the total time since last wake up (which should be idle +
extra).

It might be enough for _some_ apps, but due to the fog of war
I assume there are currently 0 apps supposed that are supposed
to use it anyway.

>>>> What's the use case in particular? Considering that
>>>> one of the previous revisions was uapi-less, something
>>>> is really fishy here. Again, it's a procfs file nobody
>>>> but a few would want to parse to use the feature.
>>>
>>> I brought this up earlier too, fdinfo is not a great API. For anything,
>>> really.
>>
>> I saw that comment, that's why I mentioned, but the
>> point is that I have doubts the author is even using
>> the uapi.
> 
> Not sure I follow... If they aren't using the API, what's the point of
> the patch? Or are you questioning whether this is being done for an
> actual use case, or just as a "why not, might be handy" kind of thing?
I assume there is a use case, which hasn't been spelled out
though AFAIK, but as a matter of fact earlier patches had no uapi.

https://lore.kernel.org/all/20231106074055.1248629-1-xiaobing.li@samsung.com/

and later patches use a suspiciously inconvenient interface,
i.e. /proc. It's a good question how it was supposed to be used
(and tested), but I have only guesses. Private patches? A custom
module? Or maybe a genuine mistake, which is why I'm still very
curious hearing about the application.
Jens Axboe Dec. 30, 2023, 11:24 p.m. UTC | #7
On 12/30/23 4:17 PM, Pavel Begunkov wrote:
> On 12/30/23 22:17, Jens Axboe wrote:
>> On 12/30/23 2:06 PM, Pavel Begunkov wrote:
>>> On 12/30/23 17:41, Jens Axboe wrote:
>>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>>> On 12/26/23 16:32, Jens Axboe wrote:
>>>>>>
>>>>>> On Mon, 25 Dec 2023 13:44:38 +0800, Xiaobing Li wrote:
>>>>>>> Count the running time and actual IO processing time of the sqpoll
>>>>>>> thread, and output the statistical data to fdinfo.
>>>>>>>
>>>>>>> Variable description:
>>>>>>> "work_time" in the code represents the sum of the jiffies of the sq
>>>>>>> thread actually processing IO, that is, how many milliseconds it
>>>>>>> actually takes to process IO. "total_time" represents the total time
>>>>>>> that the sq thread has elapsed from the beginning of the loop to the
>>>>>>> current time point, that is, how many milliseconds it has spent in
>>>>>>> total.
>>>>>>>
>>>>>>> [...]
>>>>>>
>>>>>> Applied, thanks!
>>>>>>
>>>>>> [1/1] io_uring: Statistics of the true utilization of sq threads.
>>>>>>          commit: 9f7e5872eca81d7341e3ec222ebdc202ff536655
>>>>>
>>>>> I don't believe the patch is near complete, there are still
>>>>> pending question that the author ignored (see replies to
>>>>> prev revisions).
>>>>
>>>> We can drop and defer, that's not an issue. It's still sitting top of
>>>> branch.
>>>>
>>>> Can you elaborate on the pending questions?
>>>
>>> I guess that wasn't clear, but I duplicated all of them in the
>>> email you're replying to for convenience
>>>
>>>>> Why it uses jiffies instead of some task run time?
>>>>> Consequently, why it's fine to account irq time and other
>>>>> preemption? (hint, it's not)
>>>>
>>>> Yeah that's a good point, might be better to use task run time. Jiffies
>>>> is also an annoying metric to expose, as you'd need to then get the tick
>>>> rate as well. Though I suspect the ratio is the interesting bit here.
>>>
>>> I agree that seconds are nicer, but that's not my point. That's
>>> not about jiffies, but that the patch keeps counting regardless
>>> whether the SQ task was actually running, or the CPU was serving
>>> irq, or even if it was force descheduled.
>>
>> Right, guess I wasn't clear, I did very much agree with using task run
>> time to avoid cases like that where it's perceived running, but really
>> isn't. For example.
>>
>>> I even outlined what a solution may look like, i.e. replace jiffies
>>> with task runtime, which should already be counted in the task.
>>
>> Would be a good change to make. And to be fair, I guess they originally
>> wanted something like that, as the very first patch had some scheduler
>> interactions. Just wasn't done quite right.
> 
> Right, just like what the v1 was doing but without touching
> core/sched.

Yep

>>>>> Why it can't be done with userspace and/or bpf? Why
>>>>> can't it be estimated by checking and tracking
>>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>>
>>>> Asking people to integrate bpf for this is a bit silly imho. Tracking
>>>
>>> I haven't seen any mention of the real use case, did I miss it?
>>> Because otherwise I fail to see how it can possibly be called
>>> silly when it's not clear how exactly it's used.
>>>
>>> Maybe it's a bash program printing stats to a curious user? Or
>>> maybe it's to track once at start, and then nobody cares about
>>> it, in which case NEED_WAKEUP would be justified.
>>>
>>> I can guess it's for adjusting the sq timeouts, but who knows.
>>
>> I only know what is in those threads, but the most obvious use case
>> would indeed be to vet the efficiency of the chosen timeout value and
>> balance cpu usage with latency like that.
>>
>>>> NEED_WAKEUP is also quite cumbersome and would most likely be higher
>>>> overhead as well.
>>>
>>> Comparing to reading a procfs file or doing an io_uring
>>> register syscall? I doubt that. It's also not everyone
>>> would be using that.
>>
>> What's the proposed integration to make NEED_WAKEUP sampling work? As
>> far as I can tell, you'd need to either do that kind of accounting every
>> time you do io_uring_submit(), or make it conditional which would then
>> at least still have a branch.
>>
>> The kernel side would obviously not be free either, but at least it
>> would be restricted to the SQPOLL side of things and not need to get
>> entangled with the general IO path that doesn't use SQPOLL.
>>
>> If we put it in there and have some way to enable/query/disable, then it
>> least it would just be a branch or two in there rather than in the
>> generic path.
> 
> It can be in the app without ever touching liburing/kernel. E.g. you
> can binary search the idle time, minimising it, while looking that
> sqpoll doesn't go to sleep too often topping with other conditions.
> Another stat you can get right away is to compare the current idle
> time with the total time since last wake up (which should be idle +
> extra).

Sure, it's more a question of if you want a simple to query metric, or
build something a lot more involved to do it. Guess it depends on the
use case. I think exposing a metric makes sense, as it'll provide an
easy way for an app to gauge efficiency of it and whether it'd make
sense to tweak settings. But I guess then what comes next is a way to
modify the timeout setting at runtime...

>>>>> What's the use case in particular? Considering that
>>>>> one of the previous revisions was uapi-less, something
>>>>> is really fishy here. Again, it's a procfs file nobody
>>>>> but a few would want to parse to use the feature.
>>>>
>>>> I brought this up earlier too, fdinfo is not a great API. For anything,
>>>> really.
>>>
>>> I saw that comment, that's why I mentioned, but the
>>> point is that I have doubts the author is even using
>>> the uapi.
>>
>> Not sure I follow... If they aren't using the API, what's the point of
>> the patch? Or are you questioning whether this is being done for an
>> actual use case, or just as a "why not, might be handy" kind of thing?
> I assume there is a use case, which hasn't been spelled out
> though AFAIK, but as a matter of fact earlier patches had no uapi.
> 
> https://lore.kernel.org/all/20231106074055.1248629-1-xiaobing.li@samsung.com/

Might be because they already have some patches exposing these metrics.
But ->

> and later patches use a suspiciously inconvenient interface,
> i.e. /proc. It's a good question how it was supposed to be used
> (and tested), but I have only guesses. Private patches? A custom
> module? Or maybe a genuine mistake, which is why I'm still very
> curious hearing about the application.

Agree, Xiaobing, can you expand on the actual use case here? How are you
intending to use this metric?
Xiaobing Li Jan. 3, 2024, 5:49 a.m. UTC | #8
On 12/30/23 9:27 AM, Pavel Begunkov wrote:
> Why it uses jiffies instead of some task run time?
> Consequently, why it's fine to account irq time and other
> preemption? (hint, it's not)
> 
> Why it can't be done with userspace and/or bpf? Why
> can't it be estimated by checking and tracking
> IORING_SQ_NEED_WAKEUP in userspace?
> 
> What's the use case in particular? Considering that
> one of the previous revisions was uapi-less, something
> is really fishy here. Again, it's a procfs file nobody
> but a few would want to parse to use the feature.
> 
> Why it just keeps aggregating stats for the whole
> life time of the ring? If the workload changes,
> that would either totally screw the stats or would make
> it too inert to be useful. That's especially relevant
> for long running (days) processes. There should be a
> way to reset it so it starts counting anew.

Hi, Jens and Pavel,
I carefully read the questions you raised.
First of all, as to why I use jiffies to statistics time, it
is because I have done some performance tests and found that
using jiffies has a relatively smaller loss of performance 
than using task run time. Of course, using task run time is 
indeed more accurate.  But in fact, our requirements for 
accuracy are not particularly high, so after comprehensive 
consideration, we finally chose to use jiffies.
Of course, if you think that a little more performance loss 
here has no impact, I can use task run time instead, but in 
this case, does the way of calculating sqpoll thread timeout
also need to be changed, because it is also calculated through
jiffies.
Then there’s how to use this metric.
We are studying some optimization methods for io-uring, including
performance and CPU utilization, but we found that there is
currently no tool that can observe the CPU ratio of sqthread's 
actual processing IO part, so we want to merge this method  that
can observe this value so that we can more easily observe the 
optimization effects.

Best regards,
--
Xiaobing Li
Pavel Begunkov Jan. 5, 2024, 4:02 a.m. UTC | #9
On 1/3/24 05:49, Xiaobing Li wrote:
> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>> Why it uses jiffies instead of some task run time?
>> Consequently, why it's fine to account irq time and other
>> preemption? (hint, it's not)
>>
>> Why it can't be done with userspace and/or bpf? Why
>> can't it be estimated by checking and tracking
>> IORING_SQ_NEED_WAKEUP in userspace?
>>
>> What's the use case in particular? Considering that
>> one of the previous revisions was uapi-less, something
>> is really fishy here. Again, it's a procfs file nobody
>> but a few would want to parse to use the feature.
>>
>> Why it just keeps aggregating stats for the whole
>> life time of the ring? If the workload changes,
>> that would either totally screw the stats or would make
>> it too inert to be useful. That's especially relevant
>> for long running (days) processes. There should be a
>> way to reset it so it starts counting anew.
> 
> Hi, Jens and Pavel,
> I carefully read the questions you raised.
> First of all, as to why I use jiffies to statistics time, it
> is because I have done some performance tests and found that
> using jiffies has a relatively smaller loss of performance
> than using task run time. Of course, using task run time is

How does taking a measure for task runtime looks like? I expect it to
be a simple read of a variable inside task_struct, maybe with READ_ONCE,
in which case the overhead shouldn't be realistically measurable. Does
it need locking?

> indeed more accurate.  But in fact, our requirements for
> accuracy are not particularly high, so after comprehensive

I'm looking at it as a generic feature for everyone, and the
accuracy behaviour is dependent on circumstances. High load
networking spends quite a good share of CPU in softirq, and
preemption would be dependent on config, scheduling, pinning,
etc.

> consideration, we finally chose to use jiffies.
> Of course, if you think that a little more performance loss
> here has no impact, I can use task run time instead, but in
> this case, does the way of calculating sqpoll thread timeout
> also need to be changed, because it is also calculated through
> jiffies.

That's a good point. It doesn't have to change unless you're
directly inferring the idle time parameter from those two
time values rather than using the ratio. E.g. a simple
bisection of the idle time based on the utilisation metric
shouldn't change. But that definitely raises the question
what idle_time parameter should exactly mean, and what is
more convenient for algorithms.


> Then there’s how to use this metric.
> We are studying some optimization methods for io-uring, including
> performance and CPU utilization, but we found that there is
> currently no tool that can observe the CPU ratio of sqthread's
> actual processing IO part, so we want to merge this method  that
> can observe this value so that we can more easily observe the
> optimization effects.
Xiaobing Li Jan. 10, 2024, 9:05 a.m. UTC | #10
On 1/5/24 04:02 AM, Pavel Begunkov wrote:
>On 1/3/24 05:49, Xiaobing Li wrote:
>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>> Why it uses jiffies instead of some task run time?
>>> Consequently, why it's fine to account irq time and other
>>> preemption? (hint, it's not)
>>>
>>> Why it can't be done with userspace and/or bpf? Why
>>> can't it be estimated by checking and tracking
>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>
>>> What's the use case in particular? Considering that
>>> one of the previous revisions was uapi-less, something
>>> is really fishy here. Again, it's a procfs file nobody
>>> but a few would want to parse to use the feature.
>>>
>>> Why it just keeps aggregating stats for the whole
>>> life time of the ring? If the workload changes,
>>> that would either totally screw the stats or would make
>>> it too inert to be useful. That's especially relevant
>>> for long running (days) processes. There should be a
>>> way to reset it so it starts counting anew.
>> 
>> Hi, Jens and Pavel,
>> I carefully read the questions you raised.
>> First of all, as to why I use jiffies to statistics time, it
>> is because I have done some performance tests and found that
>> using jiffies has a relatively smaller loss of performance
>> than using task run time. Of course, using task run time is
>
>How does taking a measure for task runtime looks like? I expect it to
>be a simple read of a variable inside task_struct, maybe with READ_ONCE,
>in which case the overhead shouldn't be realistically measurable. Does
>it need locking?

The task runtime I am talking about is similar to this:
start = get_system_time(current);
do_io_part();
sq->total_time += get_system_time(current) - start;

Currently, it is not possible to obtain the execution time of a piece of 
code by a simple read of a variable inside task_struct. 
Or do you have any good ideas?

>> indeed more accurate.  But in fact, our requirements for
>> accuracy are not particularly high, so after comprehensive
>
>I'm looking at it as a generic feature for everyone, and the
>accuracy behaviour is dependent on circumstances. High load
>networking spends quite a good share of CPU in softirq, and
>preemption would be dependent on config, scheduling, pinning,
>etc.

Yes, I quite agree that the accuracy behaviour is dependent on circumstances.
In fact, judging from some test results we have done, the current solution 
can basically meet everyone's requirements, and the error in the calculation 
result of utilization is estimated to be within 0.5%.


>> consideration, we finally chose to use jiffies.
>> Of course, if you think that a little more performance loss
>> here has no impact, I can use task run time instead, but in
>> this case, does the way of calculating sqpoll thread timeout
>> also need to be changed, because it is also calculated through
>> jiffies.
>
>That's a good point. It doesn't have to change unless you're
>directly inferring the idle time parameter from those two
>time values rather than using the ratio. E.g. a simple
>bisection of the idle time based on the utilisation metric
>shouldn't change. But that definitely raises the question
>what idle_time parameter should exactly mean, and what is
>more convenient for algorithms.

We think that idle_time represents the time spent by the sqpoll thread 
except for submitting IO.

In a ring, it may take time M to submit IO, or it may not submit IO in the 
entire cycle. Then we can optimize the efficiency of the sqpoll thread in 
two directions. The first is to reduce the number of rings that no IO submit,
The second is to increase the time M to increase the proportion of time 
submitted IO in the ring.
In order to observe the CPU ratio of sqthread's actual processing IO part, 
we need this patch.
Jens Axboe Jan. 10, 2024, 4:15 p.m. UTC | #11
On 1/10/24 2:05 AM, Xiaobing Li wrote:
> On 1/5/24 04:02 AM, Pavel Begunkov wrote:
>> On 1/3/24 05:49, Xiaobing Li wrote:
>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>> Why it uses jiffies instead of some task run time?
>>>> Consequently, why it's fine to account irq time and other
>>>> preemption? (hint, it's not)
>>>>
>>>> Why it can't be done with userspace and/or bpf? Why
>>>> can't it be estimated by checking and tracking
>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>>
>>>> What's the use case in particular? Considering that
>>>> one of the previous revisions was uapi-less, something
>>>> is really fishy here. Again, it's a procfs file nobody
>>>> but a few would want to parse to use the feature.
>>>>
>>>> Why it just keeps aggregating stats for the whole
>>>> life time of the ring? If the workload changes,
>>>> that would either totally screw the stats or would make
>>>> it too inert to be useful. That's especially relevant
>>>> for long running (days) processes. There should be a
>>>> way to reset it so it starts counting anew.
>>>
>>> Hi, Jens and Pavel,
>>> I carefully read the questions you raised.
>>> First of all, as to why I use jiffies to statistics time, it
>>> is because I have done some performance tests and found that
>>> using jiffies has a relatively smaller loss of performance
>>> than using task run time. Of course, using task run time is
>>
>> How does taking a measure for task runtime looks like? I expect it to
>> be a simple read of a variable inside task_struct, maybe with READ_ONCE,
>> in which case the overhead shouldn't be realistically measurable. Does
>> it need locking?
> 
> The task runtime I am talking about is similar to this:
> start = get_system_time(current);
> do_io_part();
> sq->total_time += get_system_time(current) - start;

Not sure what get_system_time() is, don't see that anywhere.

> Currently, it is not possible to obtain the execution time of a piece of 
> code by a simple read of a variable inside task_struct. 
> Or do you have any good ideas?

I must be missing something, because it seems like all you need is to
read task->stime? You could possible even make do with just logging busy
loop time, as getrusage(RUSAGE_THREAD, &stat) from userspace would then
give you the total time.

stat.ru_stime would then be the total time, the thread ran, and
1 - (above_busy_stime / stat.ru_stime) would give you the time the
percentage of time the thread ran and did useful work (eg not busy
looping.

Hmm?
Pavel Begunkov Jan. 11, 2024, 1:12 p.m. UTC | #12
On 1/10/24 09:05, Xiaobing Li wrote:
> On 1/5/24 04:02 AM, Pavel Begunkov wrote:
>> On 1/3/24 05:49, Xiaobing Li wrote:
>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>> Why it uses jiffies instead of some task run time?
>>>> Consequently, why it's fine to account irq time and other
>>>> preemption? (hint, it's not)
>>>>
>>>> Why it can't be done with userspace and/or bpf? Why
>>>> can't it be estimated by checking and tracking
>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>>
>>>> What's the use case in particular? Considering that
>>>> one of the previous revisions was uapi-less, something
>>>> is really fishy here. Again, it's a procfs file nobody
>>>> but a few would want to parse to use the feature.
>>>>
>>>> Why it just keeps aggregating stats for the whole
>>>> life time of the ring? If the workload changes,
>>>> that would either totally screw the stats or would make
>>>> it too inert to be useful. That's especially relevant
>>>> for long running (days) processes. There should be a
>>>> way to reset it so it starts counting anew.
>>>
>>> Hi, Jens and Pavel,
>>> I carefully read the questions you raised.
>>> First of all, as to why I use jiffies to statistics time, it
>>> is because I have done some performance tests and found that
>>> using jiffies has a relatively smaller loss of performance
>>> than using task run time. Of course, using task run time is
>>
>> How does taking a measure for task runtime looks like? I expect it to
>> be a simple read of a variable inside task_struct, maybe with READ_ONCE,
>> in which case the overhead shouldn't be realistically measurable. Does
>> it need locking?
> 
> The task runtime I am talking about is similar to this:
> start = get_system_time(current);
> do_io_part();
> sq->total_time += get_system_time(current) - start;
> 
> Currently, it is not possible to obtain the execution time of a piece of
> code by a simple read of a variable inside task_struct.
> Or do you have any good ideas?

Jens answered it well

>>> indeed more accurate.  But in fact, our requirements for
>>> accuracy are not particularly high, so after comprehensive
>>
>> I'm looking at it as a generic feature for everyone, and the
>> accuracy behaviour is dependent on circumstances. High load
>> networking spends quite a good share of CPU in softirq, and
>> preemption would be dependent on config, scheduling, pinning,
>> etc.
> 
> Yes, I quite agree that the accuracy behaviour is dependent on circumstances.
> In fact, judging from some test results we have done, the current solution
> can basically meet everyone's requirements, and the error in the calculation
> result of utilization is estimated to be within 0.5%.

Which sounds more than fine, but there are cases where irqs are
eating up 10s of percents of CPU, which is likely to be more
troublesome.

>>> consideration, we finally chose to use jiffies.
>>> Of course, if you think that a little more performance loss
>>> here has no impact, I can use task run time instead, but in
>>> this case, does the way of calculating sqpoll thread timeout
>>> also need to be changed, because it is also calculated through
>>> jiffies.
>>
>> That's a good point. It doesn't have to change unless you're
>> directly inferring the idle time parameter from those two
>> time values rather than using the ratio. E.g. a simple
>> bisection of the idle time based on the utilisation metric
>> shouldn't change. But that definitely raises the question
>> what idle_time parameter should exactly mean, and what is
>> more convenient for algorithms.
> 
> We think that idle_time represents the time spent by the sqpoll thread
> except for submitting IO.

I mean the idle_time parameter, i.e.
struct io_uring_params :: sq_thread_idle, which is how long an SQPOLL
thread should be continuously starved of any work to go to sleep.

For example:
sq_thread_idle = 10ms

   -> 9ms starving -> (do work) -> ...
   -> 9ms starving -> (do work) -> ...
   -> 11ms starving -> (more than idle, go sleep)

And the question was whether to count those delays in wall clock
time, as it currently is, and which is likely to be more natural
for userspace, or otherwise theoretically it could be task local time.
  

> In a ring, it may take time M to submit IO, or it may not submit IO in the
> entire cycle. Then we can optimize the efficiency of the sqpoll thread in
> two directions. The first is to reduce the number of rings that no IO submit,
> The second is to increase the time M to increase the proportion of time
> submitted IO in the ring.
> In order to observe the CPU ratio of sqthread's actual processing IO part,
> we need this patch.
Xiaobing Li Jan. 12, 2024, 1:12 a.m. UTC | #13
On 1/10/24 16:15 AM, Jens Axboe wrote:
>On 1/10/24 2:05 AM, Xiaobing Li wrote:
>> On 1/5/24 04:02 AM, Pavel Begunkov wrote:
>>> On 1/3/24 05:49, Xiaobing Li wrote:
>>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>>> Why it uses jiffies instead of some task run time?
>>>>> Consequently, why it's fine to account irq time and other
>>>>> preemption? (hint, it's not)
>>>>>
>>>>> Why it can't be done with userspace and/or bpf? Why
>>>>> can't it be estimated by checking and tracking
>>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>>>
>>>>> What's the use case in particular? Considering that
>>>>> one of the previous revisions was uapi-less, something
>>>>> is really fishy here. Again, it's a procfs file nobody
>>>>> but a few would want to parse to use the feature.
>>>>>
>>>>> Why it just keeps aggregating stats for the whole
>>>>> life time of the ring? If the workload changes,
>>>>> that would either totally screw the stats or would make
>>>>> it too inert to be useful. That's especially relevant
>>>>> for long running (days) processes. There should be a
>>>>> way to reset it so it starts counting anew.
>>>>
>>>> Hi, Jens and Pavel,
>>>> I carefully read the questions you raised.
>>>> First of all, as to why I use jiffies to statistics time, it
>>>> is because I have done some performance tests and found that
>>>> using jiffies has a relatively smaller loss of performance
>>>> than using task run time. Of course, using task run time is
>>>
>>> How does taking a measure for task runtime looks like? I expect it to
>>> be a simple read of a variable inside task_struct, maybe with READ_ONCE,
>>> in which case the overhead shouldn't be realistically measurable. Does
>>> it need locking?
>> 
>> The task runtime I am talking about is similar to this:
>> start = get_system_time(current);
>> do_io_part();
>> sq->total_time += get_system_time(current) - start;
>
>Not sure what get_system_time() is, don't see that anywhere.
>
>> Currently, it is not possible to obtain the execution time of a piece of 
>> code by a simple read of a variable inside task_struct. 
>> Or do you have any good ideas?
>
>I must be missing something, because it seems like all you need is to
>read task->stime? You could possible even make do with just logging busy
>loop time, as getrusage(RUSAGE_THREAD, &stat) from userspace would then
>give you the total time.
>
>stat.ru_stime would then be the total time, the thread ran, and
>1 - (above_busy_stime / stat.ru_stime) would give you the time the
>percentage of time the thread ran and did useful work (eg not busy
>looping.

getrusage can indeed get the total time of the thread, but this introduces an 
extra function call, which is relatively more complicated than defining a variable.
In fact, recording the total time of the loop and the time of processing the 
IO part can achieve our observation purpose. Recording only two variables will 
have less impact on the existing performance, so why not  choose a simpler 
and effective method.

--
Xiaobing Li
Jens Axboe Jan. 12, 2024, 2:58 a.m. UTC | #14
On 1/11/24 6:12 PM, Xiaobing Li wrote:
> On 1/10/24 16:15 AM, Jens Axboe wrote:
>> On 1/10/24 2:05 AM, Xiaobing Li wrote:
>>> On 1/5/24 04:02 AM, Pavel Begunkov wrote:
>>>> On 1/3/24 05:49, Xiaobing Li wrote:
>>>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>>>> Why it uses jiffies instead of some task run time?
>>>>>> Consequently, why it's fine to account irq time and other
>>>>>> preemption? (hint, it's not)
>>>>>>
>>>>>> Why it can't be done with userspace and/or bpf? Why
>>>>>> can't it be estimated by checking and tracking
>>>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>>>>
>>>>>> What's the use case in particular? Considering that
>>>>>> one of the previous revisions was uapi-less, something
>>>>>> is really fishy here. Again, it's a procfs file nobody
>>>>>> but a few would want to parse to use the feature.
>>>>>>
>>>>>> Why it just keeps aggregating stats for the whole
>>>>>> life time of the ring? If the workload changes,
>>>>>> that would either totally screw the stats or would make
>>>>>> it too inert to be useful. That's especially relevant
>>>>>> for long running (days) processes. There should be a
>>>>>> way to reset it so it starts counting anew.
>>>>>
>>>>> Hi, Jens and Pavel,
>>>>> I carefully read the questions you raised.
>>>>> First of all, as to why I use jiffies to statistics time, it
>>>>> is because I have done some performance tests and found that
>>>>> using jiffies has a relatively smaller loss of performance
>>>>> than using task run time. Of course, using task run time is
>>>>
>>>> How does taking a measure for task runtime looks like? I expect it to
>>>> be a simple read of a variable inside task_struct, maybe with READ_ONCE,
>>>> in which case the overhead shouldn't be realistically measurable. Does
>>>> it need locking?
>>>
>>> The task runtime I am talking about is similar to this:
>>> start = get_system_time(current);
>>> do_io_part();
>>> sq->total_time += get_system_time(current) - start;
>>
>> Not sure what get_system_time() is, don't see that anywhere.
>>
>>> Currently, it is not possible to obtain the execution time of a piece of 
>>> code by a simple read of a variable inside task_struct. 
>>> Or do you have any good ideas?
>>
>> I must be missing something, because it seems like all you need is to
>> read task->stime? You could possible even make do with just logging busy
>> loop time, as getrusage(RUSAGE_THREAD, &stat) from userspace would then
>> give you the total time.
>>
>> stat.ru_stime would then be the total time, the thread ran, and
>> 1 - (above_busy_stime / stat.ru_stime) would give you the time the
>> percentage of time the thread ran and did useful work (eg not busy
>> looping.
> 
> getrusage can indeed get the total time of the thread, but this
> introduces an extra function call, which is relatively more
> complicated than defining a variable. In fact, recording the total
> time of the loop and the time of processing the IO part can achieve
> our observation purpose. Recording only two variables will have less
> impact on the existing performance, so why not  choose a simpler and
> effective method.

I'm not opposed to exposing both of them, it does make the API simpler.
If we can call it an API... I think the main point was using task->stime
for it rather than jiffies etc.
Xiaobing Li Jan. 17, 2024, 8:37 a.m. UTC | #15
On 1/12/24 2:58 AM, Jens Axboe wrote:
>On 1/11/24 6:12 PM, Xiaobing Li wrote:
>> On 1/10/24 16:15 AM, Jens Axboe wrote:
>>> On 1/10/24 2:05 AM, Xiaobing Li wrote:
>>>> On 1/5/24 04:02 AM, Pavel Begunkov wrote:
>>>>> On 1/3/24 05:49, Xiaobing Li wrote:
>>>>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>>>>> Why it uses jiffies instead of some task run time?
>>>>>>> Consequently, why it's fine to account irq time and other
>>>>>>> preemption? (hint, it's not)
>>>>>>>
>>>>>>> Why it can't be done with userspace and/or bpf? Why
>>>>>>> can't it be estimated by checking and tracking
>>>>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>>>>>
>>>>>>> What's the use case in particular? Considering that
>>>>>>> one of the previous revisions was uapi-less, something
>>>>>>> is really fishy here. Again, it's a procfs file nobody
>>>>>>> but a few would want to parse to use the feature.
>>>>>>>
>>>>>>> Why it just keeps aggregating stats for the whole
>>>>>>> life time of the ring? If the workload changes,
>>>>>>> that would either totally screw the stats or would make
>>>>>>> it too inert to be useful. That's especially relevant
>>>>>>> for long running (days) processes. There should be a
>>>>>>> way to reset it so it starts counting anew.
>>>>>>
>>>>>> Hi, Jens and Pavel,
>>>>>> I carefully read the questions you raised.
>>>>>> First of all, as to why I use jiffies to statistics time, it
>>>>>> is because I have done some performance tests and found that
>>>>>> using jiffies has a relatively smaller loss of performance
>>>>>> than using task run time. Of course, using task run time is
>>>>>
>>>>> How does taking a measure for task runtime looks like? I expect it to
>>>>> be a simple read of a variable inside task_struct, maybe with READ_ONCE,
>>>>> in which case the overhead shouldn't be realistically measurable. Does
>>>>> it need locking?
>>>>
>>>> The task runtime I am talking about is similar to this:
>>>> start = get_system_time(current);
>>>> do_io_part();
>>>> sq->total_time += get_system_time(current) - start;
>>>
>>> Not sure what get_system_time() is, don't see that anywhere.
>>>
>>>> Currently, it is not possible to obtain the execution time of a piece of 
>>>> code by a simple read of a variable inside task_struct. 
>>>> Or do you have any good ideas?
>>>
>>> I must be missing something, because it seems like all you need is to
>>> read task->stime? You could possible even make do with just logging busy
>>> loop time, as getrusage(RUSAGE_THREAD, &stat) from userspace would then
>>> give you the total time.
>>>
>>> stat.ru_stime would then be the total time, the thread ran, and
>>> 1 - (above_busy_stime / stat.ru_stime) would give you the time the
>>> percentage of time the thread ran and did useful work (eg not busy
>>> looping.
>> 
>> getrusage can indeed get the total time of the thread, but this
>> introduces an extra function call, which is relatively more
>> complicated than defining a variable. In fact, recording the total
>> time of the loop and the time of processing the IO part can achieve
>> our observation purpose. Recording only two variables will have less
>> impact on the existing performance, so why not  choose a simpler and
>> effective method.
>
>I'm not opposed to exposing both of them, it does make the API simpler.
>If we can call it an API... I think the main point was using task->stime
>for it rather than jiffies etc.

Hi, Jens and Pavel
I modified the code according to your opinions.

I got the total time of the sqpoll thread through getrusage. 
eg:

fdinfo.c:
+long sq_total_time = 0;
+long sq_work_time = 0;
if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) {
	struct io_sq_data *sq = ctx->sq_data;

	sq_pid = sq->task_pid;
	sq_cpu = sq->sq_cpu;
+	struct rusage r;
+	getrusage(sq->thread, RUSAGE_SELF, &r);
+	sq_total_time = r.ru_stime.tv_sec * 1000000 + r.ru_stime.tv_usec;
+	sq_work_time = sq->work_time;
}

seq_printf(m, "SqThread:\t%d\n", sq_pid);
seq_printf(m, "SqThreadCpu:\t%d\n", sq_cpu);
+seq_printf(m, "SqTotalTime:\t%ldus\n", sq_total_time);
+seq_printf(m, "SqWorkTime:\t%ldus\n", sq_work_time);
seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files);

The working time of the sqpoll thread is obtained through ktime_get().
eg:

sqpoll.c:
+ktime_t start, diff;
+start = ktime_get();
list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
	int ret = __io_sq_thread(ctx, cap_entries);

	if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
		sqt_spin = true;
}
if (io_run_task_work())
	sqt_spin = true;

+diff = ktime_sub(ktime_get(), start);
+if (sqt_spin == true)
+	sqd->work_time += ktime_to_us(diff);

The test results are as follows:
Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq
SqMask: 0x3
SqHead: 3197153
SqTail: 3197153
CachedSqHead:   3197153
SqThread:       9231
SqThreadCpu:    11
SqTotalTime:    92215321us
SqWorkTime:     15106578us

Do you think this solution work?
Jens Axboe Jan. 17, 2024, 11:04 p.m. UTC | #16
On 1/17/24 1:37 AM, Xiaobing Li wrote:
> On 1/12/24 2:58 AM, Jens Axboe wrote:
>> On 1/11/24 6:12 PM, Xiaobing Li wrote:
>>> On 1/10/24 16:15 AM, Jens Axboe wrote:
>>>> On 1/10/24 2:05 AM, Xiaobing Li wrote:
>>>>> On 1/5/24 04:02 AM, Pavel Begunkov wrote:
>>>>>> On 1/3/24 05:49, Xiaobing Li wrote:
>>>>>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>>>>>> Why it uses jiffies instead of some task run time?
>>>>>>>> Consequently, why it's fine to account irq time and other
>>>>>>>> preemption? (hint, it's not)
>>>>>>>>
>>>>>>>> Why it can't be done with userspace and/or bpf? Why
>>>>>>>> can't it be estimated by checking and tracking
>>>>>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>>>>>>
>>>>>>>> What's the use case in particular? Considering that
>>>>>>>> one of the previous revisions was uapi-less, something
>>>>>>>> is really fishy here. Again, it's a procfs file nobody
>>>>>>>> but a few would want to parse to use the feature.
>>>>>>>>
>>>>>>>> Why it just keeps aggregating stats for the whole
>>>>>>>> life time of the ring? If the workload changes,
>>>>>>>> that would either totally screw the stats or would make
>>>>>>>> it too inert to be useful. That's especially relevant
>>>>>>>> for long running (days) processes. There should be a
>>>>>>>> way to reset it so it starts counting anew.
>>>>>>>
>>>>>>> Hi, Jens and Pavel,
>>>>>>> I carefully read the questions you raised.
>>>>>>> First of all, as to why I use jiffies to statistics time, it
>>>>>>> is because I have done some performance tests and found that
>>>>>>> using jiffies has a relatively smaller loss of performance
>>>>>>> than using task run time. Of course, using task run time is
>>>>>>
>>>>>> How does taking a measure for task runtime looks like? I expect it to
>>>>>> be a simple read of a variable inside task_struct, maybe with READ_ONCE,
>>>>>> in which case the overhead shouldn't be realistically measurable. Does
>>>>>> it need locking?
>>>>>
>>>>> The task runtime I am talking about is similar to this:
>>>>> start = get_system_time(current);
>>>>> do_io_part();
>>>>> sq->total_time += get_system_time(current) - start;
>>>>
>>>> Not sure what get_system_time() is, don't see that anywhere.
>>>>
>>>>> Currently, it is not possible to obtain the execution time of a piece of 
>>>>> code by a simple read of a variable inside task_struct. 
>>>>> Or do you have any good ideas?
>>>>
>>>> I must be missing something, because it seems like all you need is to
>>>> read task->stime? You could possible even make do with just logging busy
>>>> loop time, as getrusage(RUSAGE_THREAD, &stat) from userspace would then
>>>> give you the total time.
>>>>
>>>> stat.ru_stime would then be the total time, the thread ran, and
>>>> 1 - (above_busy_stime / stat.ru_stime) would give you the time the
>>>> percentage of time the thread ran and did useful work (eg not busy
>>>> looping.
>>>
>>> getrusage can indeed get the total time of the thread, but this
>>> introduces an extra function call, which is relatively more
>>> complicated than defining a variable. In fact, recording the total
>>> time of the loop and the time of processing the IO part can achieve
>>> our observation purpose. Recording only two variables will have less
>>> impact on the existing performance, so why not  choose a simpler and
>>> effective method.
>>
>> I'm not opposed to exposing both of them, it does make the API simpler.
>> If we can call it an API... I think the main point was using task->stime
>> for it rather than jiffies etc.
> 
> Hi, Jens and Pavel
> I modified the code according to your opinions.
> 
> I got the total time of the sqpoll thread through getrusage. 
> eg?
> 
> fdinfo.c:
> +long sq_total_time = 0;
> +long sq_work_time = 0;
> if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) {
> 	struct io_sq_data *sq = ctx->sq_data;
> 
> 	sq_pid = sq->task_pid;
> 	sq_cpu = sq->sq_cpu;
> +	struct rusage r;
> +	getrusage(sq->thread, RUSAGE_SELF, &r);
> +	sq_total_time = r.ru_stime.tv_sec * 1000000 + r.ru_stime.tv_usec;
> +	sq_work_time = sq->work_time;
> }
> 
> seq_printf(m, "SqThread:\t%d\n", sq_pid);
> seq_printf(m, "SqThreadCpu:\t%d\n", sq_cpu);
> +seq_printf(m, "SqTotalTime:\t%ldus\n", sq_total_time);
> +seq_printf(m, "SqWorkTime:\t%ldus\n", sq_work_time);
> seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files);
> 
> The working time of the sqpoll thread is obtained through ktime_get().
> eg?
> 
> sqpoll.c:
> +ktime_t start, diff;
> +start = ktime_get();
> list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
> 	int ret = __io_sq_thread(ctx, cap_entries);
> 
> 	if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
> 		sqt_spin = true;
> }
> if (io_run_task_work())
> 	sqt_spin = true;
> 
> +diff = ktime_sub(ktime_get(), start);
> +if (sqt_spin == true)
> +	sqd->work_time += ktime_to_us(diff);
> 
> The test results are as follows:
> Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq
> SqMask: 0x3
> SqHead: 3197153
> SqTail: 3197153
> CachedSqHead:   3197153
> SqThread:       9231
> SqThreadCpu:    11
> SqTotalTime:    92215321us
> SqWorkTime:     15106578us
> 
> Do you think this solution work?

Possibly, can you send an actual patch? Would be easier to review that
way. Bonus points for crafting test cases that can help vet that it
calculates the right thing (eg test case that does 50% idle, 25% idle,
75% idle, that kind of thing).
Xiaobing Li Jan. 18, 2024, 2:25 a.m. UTC | #17
On 1/17/24 23:04, Jens Axboe wrote:
>Possibly, can you send an actual patch? Would be easier to review that
>way. Bonus points for crafting test cases that can help vet that it
>calculates the right thing (eg test case that does 50% idle, 25% idle,
>75% idle, that kind of thing).

ok, I'll send out a v7.
Pavel Begunkov Jan. 18, 2024, 2:56 a.m. UTC | #18
On 1/17/24 08:37, Xiaobing Li wrote:
> On 1/12/24 2:58 AM, Jens Axboe wrote:
>> On 1/11/24 6:12 PM, Xiaobing Li wrote:
>>> On 1/10/24 16:15 AM, Jens Axboe wrote:
>>>> On 1/10/24 2:05 AM, Xiaobing Li wrote:
>>>>> On 1/5/24 04:02 AM, Pavel Begunkov wrote:
>>>>>> On 1/3/24 05:49, Xiaobing Li wrote:
>>>>>>> On 12/30/23 9:27 AM, Pavel Begunkov wrote:
>>>>>>>> Why it uses jiffies instead of some task run time?
>>>>>>>> Consequently, why it's fine to account irq time and other
>>>>>>>> preemption? (hint, it's not)
>>>>>>>>
>>>>>>>> Why it can't be done with userspace and/or bpf? Why
>>>>>>>> can't it be estimated by checking and tracking
>>>>>>>> IORING_SQ_NEED_WAKEUP in userspace?
>>>>>>>>
>>>>>>>> What's the use case in particular? Considering that
>>>>>>>> one of the previous revisions was uapi-less, something
>>>>>>>> is really fishy here. Again, it's a procfs file nobody
>>>>>>>> but a few would want to parse to use the feature.
>>>>>>>>
>>>>>>>> Why it just keeps aggregating stats for the whole
>>>>>>>> life time of the ring? If the workload changes,
>>>>>>>> that would either totally screw the stats or would make
>>>>>>>> it too inert to be useful. That's especially relevant
>>>>>>>> for long running (days) processes. There should be a
>>>>>>>> way to reset it so it starts counting anew.
>>>>>>>
>>>>>>> Hi, Jens and Pavel,
>>>>>>> I carefully read the questions you raised.
>>>>>>> First of all, as to why I use jiffies to statistics time, it
>>>>>>> is because I have done some performance tests and found that
>>>>>>> using jiffies has a relatively smaller loss of performance
>>>>>>> than using task run time. Of course, using task run time is
>>>>>>
>>>>>> How does taking a measure for task runtime looks like? I expect it to
>>>>>> be a simple read of a variable inside task_struct, maybe with READ_ONCE,
>>>>>> in which case the overhead shouldn't be realistically measurable. Does
>>>>>> it need locking?
>>>>>
>>>>> The task runtime I am talking about is similar to this:
>>>>> start = get_system_time(current);
>>>>> do_io_part();
>>>>> sq->total_time += get_system_time(current) - start;
>>>>
>>>> Not sure what get_system_time() is, don't see that anywhere.
>>>>
>>>>> Currently, it is not possible to obtain the execution time of a piece of
>>>>> code by a simple read of a variable inside task_struct.
>>>>> Or do you have any good ideas?
>>>>
>>>> I must be missing something, because it seems like all you need is to
>>>> read task->stime? You could possible even make do with just logging busy
>>>> loop time, as getrusage(RUSAGE_THREAD, &stat) from userspace would then
>>>> give you the total time.
>>>>
>>>> stat.ru_stime would then be the total time, the thread ran, and
>>>> 1 - (above_busy_stime / stat.ru_stime) would give you the time the
>>>> percentage of time the thread ran and did useful work (eg not busy
>>>> looping.
>>>
>>> getrusage can indeed get the total time of the thread, but this
>>> introduces an extra function call, which is relatively more
>>> complicated than defining a variable. In fact, recording the total
>>> time of the loop and the time of processing the IO part can achieve
>>> our observation purpose. Recording only two variables will have less
>>> impact on the existing performance, so why not  choose a simpler and
>>> effective method.
>>
>> I'm not opposed to exposing both of them, it does make the API simpler.
>> If we can call it an API... I think the main point was using task->stime
>> for it rather than jiffies etc.
> 
> Hi, Jens and Pavel
> I modified the code according to your opinions.
> 
> I got the total time of the sqpoll thread through getrusage.
> eg:
> 
> fdinfo.c:
> +long sq_total_time = 0;
> +long sq_work_time = 0;
> if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) {
> 	struct io_sq_data *sq = ctx->sq_data;
> 
> 	sq_pid = sq->task_pid;
> 	sq_cpu = sq->sq_cpu;
> +	struct rusage r;
> +	getrusage(sq->thread, RUSAGE_SELF, &r);
> +	sq_total_time = r.ru_stime.tv_sec * 1000000 + r.ru_stime.tv_usec;
> +	sq_work_time = sq->work_time;
> }

That's neat, but fwiw my concerns are mostly about what's
exposed to the user space.

> seq_printf(m, "SqThread:\t%d\n", sq_pid);
> seq_printf(m, "SqThreadCpu:\t%d\n", sq_cpu);
> +seq_printf(m, "SqTotalTime:\t%ldus\n", sq_total_time);
> +seq_printf(m, "SqWorkTime:\t%ldus\n", sq_work_time);
> seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files);
> 
> The working time of the sqpoll thread is obtained through ktime_get().
> eg:

Just like with jiffies, ktime_get() is wall clock time, but
uncomfortably much more expensive. Why not stime Jens dug up
last time?

> sqpoll.c:
> +ktime_t start, diff;
> +start = ktime_get();
> list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
> 	int ret = __io_sq_thread(ctx, cap_entries);
> 
> 	if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
> 		sqt_spin = true;
> }
> if (io_run_task_work())
> 	sqt_spin = true;
> 
> +diff = ktime_sub(ktime_get(), start);
> +if (sqt_spin == true)
> +	sqd->work_time += ktime_to_us(diff);
> 
> The test results are as follows:
> Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq
> SqMask: 0x3
> SqHead: 3197153
> SqTail: 3197153
> CachedSqHead:   3197153
> SqThread:       9231
> SqThreadCpu:    11
> SqTotalTime:    92215321us
> SqWorkTime:     15106578us
> 
> Do you think this solution work?

I'm curious, is the plan to leave it only accessible via
procfs? It's not machine readable well, so should be quite
an annoyance parsing it every time.
diff mbox series

Patch

diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index 976e9500f651..82d3d9471bac 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -64,6 +64,8 @@  __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
 	unsigned int sq_shift = 0;
 	unsigned int sq_entries, cq_entries;
 	int sq_pid = -1, sq_cpu = -1;
+	unsigned long work_time = 0;
+	unsigned long total_time = 0;
 	bool has_lock;
 	unsigned int i;
 
@@ -147,10 +149,14 @@  __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
 
 		sq_pid = sq->task_pid;
 		sq_cpu = sq->sq_cpu;
+		work_time = sq->work_time;
+		total_time = sq->total_time;
 	}
 
 	seq_printf(m, "SqThread:\t%d\n", sq_pid);
 	seq_printf(m, "SqThreadCpu:\t%d\n", sq_cpu);
+	seq_printf(m, "SqWorkTime:\t%lu\n", work_time);
+	seq_printf(m, "SqTotalTime:\t%lu\n", total_time);
 	seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files);
 	for (i = 0; has_lock && i < ctx->nr_user_files; i++) {
 		struct file *f = io_file_from_index(&ctx->file_table, i);
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 65b5dbe3c850..9b74e344c52a 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -225,6 +225,7 @@  static int io_sq_thread(void *data)
 	struct io_ring_ctx *ctx;
 	unsigned long timeout = 0;
 	char buf[TASK_COMM_LEN];
+	unsigned long sq_start, sq_work_begin, sq_work_end;
 	DEFINE_WAIT(wait);
 
 	snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
@@ -241,6 +242,7 @@  static int io_sq_thread(void *data)
 	}
 
 	mutex_lock(&sqd->lock);
+	sq_start = jiffies;
 	while (1) {
 		bool cap_entries, sqt_spin = false;
 
@@ -251,6 +253,7 @@  static int io_sq_thread(void *data)
 		}
 
 		cap_entries = !list_is_singular(&sqd->ctx_list);
+		sq_work_begin = jiffies;
 		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
 			int ret = __io_sq_thread(ctx, cap_entries);
 
@@ -260,6 +263,11 @@  static int io_sq_thread(void *data)
 		if (io_run_task_work())
 			sqt_spin = true;
 
+		sq_work_end = jiffies;
+		sqd->total_time = sq_work_end - sq_start;
+		if (sqt_spin == true)
+			sqd->work_time += sq_work_end - sq_work_begin;
+
 		if (sqt_spin || !time_after(jiffies, timeout)) {
 			if (sqt_spin)
 				timeout = jiffies + sqd->sq_thread_idle;
diff --git a/io_uring/sqpoll.h b/io_uring/sqpoll.h
index 8df37e8c9149..92b4b07220fa 100644
--- a/io_uring/sqpoll.h
+++ b/io_uring/sqpoll.h
@@ -16,6 +16,8 @@  struct io_sq_data {
 	pid_t			task_pid;
 	pid_t			task_tgid;
 
+	unsigned long		work_time;
+	unsigned long		total_time;
 	unsigned long		state;
 	struct completion	exited;
 };