[RFC,2/2] cpufreq/schedutil: Remove iowait boost

Message ID	20240304201625.100619-3-christian.loehle@arm.com (mailing list archive)
State	RFC, archived
Headers	show Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 007EA2C840; Mon, 4 Mar 2024 20:17:18 +0000 (UTC) From: Christian Loehle <christian.loehle@arm.com> To: linux-kernel@vger.kernel.org Cc: peterz@infradead.org, juri.lelli@redhat.com, mingo@redhat.com, rafael@kernel.org, dietmar.eggemann@arm.com, vschneid@redhat.com, vincent.guittot@linaro.org, Johannes.Thumshirn@wdc.com, adrian.hunter@intel.com, ulf.hansson@linaro.org, andres@anarazel.de, asml.silence@gmail.com, linux-pm@vger.kernel.org, linux-block@vger.kernel.org, io-uring@vger.kernel.org, Christian Loehle <christian.loehle@arm.com> Subject: [RFC PATCH 2/2] cpufreq/schedutil: Remove iowait boost Date: Mon, 4 Mar 2024 20:16:25 +0000 Message-Id: <20240304201625.100619-3-christian.loehle@arm.com> In-Reply-To: <20240304201625.100619-1-christian.loehle@arm.com> References: <20240304201625.100619-1-christian.loehle@arm.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Introduce per-task io utilization boost \| expand [RFC,0/2] Introduce per-task io utilization boost [RFC,1/2] sched/fair: Introduce per-task io util boost [RFC,2/2] cpufreq/schedutil: Remove iowait boost

Christian Loehle March 4, 2024, 8:16 p.m. UTC

The previous commit provides a new cpu_util_cfs_boost_io interface for
schedutil which uses the io boosted utilization of the per-task
tracking strategy. Schedutil iowait boosting is therefore no longer
necessary so remove it.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 152 +------------------------------
 1 file changed, 5 insertions(+), 147 deletions(-)

Rafael J. Wysocki March 18, 2024, 2:07 p.m. UTC | #1

On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> The previous commit provides a new cpu_util_cfs_boost_io interface for
> schedutil which uses the io boosted utilization of the per-task
> tracking strategy. Schedutil iowait boosting is therefore no longer
> necessary so remove it.

I'm wondering about the cases when schedutil is used without EAS.

Are they still going to be handled as before after this change?

Christian Loehle March 18, 2024, 4:40 p.m. UTC | #2

On 18/03/2024 14:07, Rafael J. Wysocki wrote:
> On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> The previous commit provides a new cpu_util_cfs_boost_io interface for
>> schedutil which uses the io boosted utilization of the per-task
>> tracking strategy. Schedutil iowait boosting is therefore no longer
>> necessary so remove it.
> 
> I'm wondering about the cases when schedutil is used without EAS.
> 
> Are they still going to be handled as before after this change?

Well they should still get boosted (under the new conditions) and according
to my tests that does work. Anything in particular you're worried about?

So in terms of throughput I see similar results with EAS and CAS+sugov.
I'm happy including numbers in the cover letter for future versions, too.
So far my intuition was that nobody would care enough to include them
(as long as it generally still works).

Kind Regards,
Christian

Rafael J. Wysocki March 18, 2024, 5:08 p.m. UTC | #3

On Mon, Mar 18, 2024 at 5:40 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 18/03/2024 14:07, Rafael J. Wysocki wrote:
> > On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
> > <christian.loehle@arm.com> wrote:
> >>
> >> The previous commit provides a new cpu_util_cfs_boost_io interface for
> >> schedutil which uses the io boosted utilization of the per-task
> >> tracking strategy. Schedutil iowait boosting is therefore no longer
> >> necessary so remove it.
> >
> > I'm wondering about the cases when schedutil is used without EAS.
> >
> > Are they still going to be handled as before after this change?
>
> Well they should still get boosted (under the new conditions) and according
> to my tests that does work.

OK

> Anything in particular you're worried about?

It is not particularly clear to me how exactly the boost is taken into
account without EAS.

> So in terms of throughput I see similar results with EAS and CAS+sugov.
> I'm happy including numbers in the cover letter for future versions, too.
> So far my intuition was that nobody would care enough to include them
> (as long as it generally still works).

Well, IMV clear understanding of the changes is more important.

Christian Loehle March 19, 2024, 1:58 p.m. UTC | #4

On 18/03/2024 17:08, Rafael J. Wysocki wrote:
> On Mon, Mar 18, 2024 at 5:40 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 18/03/2024 14:07, Rafael J. Wysocki wrote:
>>> On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
>>> <christian.loehle@arm.com> wrote:
>>>>
>>>> The previous commit provides a new cpu_util_cfs_boost_io interface for
>>>> schedutil which uses the io boosted utilization of the per-task
>>>> tracking strategy. Schedutil iowait boosting is therefore no longer
>>>> necessary so remove it.
>>>
>>> I'm wondering about the cases when schedutil is used without EAS.
>>>
>>> Are they still going to be handled as before after this change?
>>
>> Well they should still get boosted (under the new conditions) and according
>> to my tests that does work.
> 
> OK
> 
>> Anything in particular you're worried about?
> 
> It is not particularly clear to me how exactly the boost is taken into
> account without EAS.

So a quick rundown for now, I'll try to include something along the lines in
future versions then, too.
Every task_struct carries an io_boost_level in the range of [0..8] with it.
The boost is in units of utilization (w.r.t SCHED_CAPACITY_SCALE, independent
of CPU the task might be currently enqueued on).
The boost is taken into account for:
1. sugov frequency selection with
io_boost = cpu_util_io_boost(sg_cpu->cpu);
util = max(util, io_boost);

The io boost of all tasks enqueued on the rq will be max-aggregated with the
util here. (See cfs_rq->io_boost_tasks).

2. Task placement, for EAS in feec();
Otherwise select_idle_sibling() / select_idle_capacity() to ensure the CPU
satisfies the requested io_boost of the task to be enqueued.

Determining the io_boost_level is a bit more involved than with sugov's
implementation and happens in dequeue_io_boost(), hopefully that part
is reasonably understandable from the code.

Hope that helps.

Kind Regards,
Christian

> 
>> So in terms of throughput I see similar results with EAS and CAS+sugov.
>> I'm happy including numbers in the cover letter for future versions, too.
>> So far my intuition was that nobody would care enough to include them
>> (as long as it generally still works).
> 
> Well, IMV clear understanding of the changes is more important.

Qais Yousef March 25, 2024, 2:37 a.m. UTC | #5

On 03/18/24 18:08, Rafael J. Wysocki wrote:
> On Mon, Mar 18, 2024 at 5:40 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
> >
> > On 18/03/2024 14:07, Rafael J. Wysocki wrote:
> > > On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
> > > <christian.loehle@arm.com> wrote:
> > >>
> > >> The previous commit provides a new cpu_util_cfs_boost_io interface for
> > >> schedutil which uses the io boosted utilization of the per-task
> > >> tracking strategy. Schedutil iowait boosting is therefore no longer
> > >> necessary so remove it.
> > >
> > > I'm wondering about the cases when schedutil is used without EAS.
> > >
> > > Are they still going to be handled as before after this change?
> >
> > Well they should still get boosted (under the new conditions) and according
> > to my tests that does work.
> 
> OK
> 
> > Anything in particular you're worried about?
> 
> It is not particularly clear to me how exactly the boost is taken into
> account without EAS.
> 
> > So in terms of throughput I see similar results with EAS and CAS+sugov.
> > I'm happy including numbers in the cover letter for future versions, too.
> > So far my intuition was that nobody would care enough to include them
> > (as long as it generally still works).
> 
> Well, IMV clear understanding of the changes is more important.

I think the major thing we need to be careful about is the behavior when the
task is sleeping. I think the boosting will be removed when the task is
dequeued and I can bet there will be systems out there where the BLOCK softirq
being boosted when the task is sleeping will matter.

FWIW I do have an implementation for per-task iowait boost where I went a step
further and converted intel_pstate too and like Christian didn't notice
a regression. But I am not sure (rather don't think) I triggered this use case.
I can't tell when the systems truly have per-cpu cpufreq control or just appear
so and they are actually shared but not visible at linux level.

Christian Loehle April 19, 2024, 1:42 p.m. UTC | #6

On 25/03/2024 02:37, Qais Yousef wrote:
> On 03/18/24 18:08, Rafael J. Wysocki wrote:
>> On Mon, Mar 18, 2024 at 5:40 PM Christian Loehle
>> <christian.loehle@arm.com> wrote:
>>>
>>> On 18/03/2024 14:07, Rafael J. Wysocki wrote:
>>>> On Mon, Mar 4, 2024 at 9:17 PM Christian Loehle
>>>> <christian.loehle@arm.com> wrote:
>>>>>
>>>>> The previous commit provides a new cpu_util_cfs_boost_io interface for
>>>>> schedutil which uses the io boosted utilization of the per-task
>>>>> tracking strategy. Schedutil iowait boosting is therefore no longer
>>>>> necessary so remove it.
>>>>
>>>> I'm wondering about the cases when schedutil is used without EAS.
>>>>
>>>> Are they still going to be handled as before after this change?
>>>
>>> Well they should still get boosted (under the new conditions) and according
>>> to my tests that does work.
>>
>> OK
>>
>>> Anything in particular you're worried about?
>>
>> It is not particularly clear to me how exactly the boost is taken into
>> account without EAS.
>>
>>> So in terms of throughput I see similar results with EAS and CAS+sugov.
>>> I'm happy including numbers in the cover letter for future versions, too.
>>> So far my intuition was that nobody would care enough to include them
>>> (as long as it generally still works).
>>
>> Well, IMV clear understanding of the changes is more important.
> 
> I think the major thing we need to be careful about is the behavior when the
> task is sleeping. I think the boosting will be removed when the task is
> dequeued and I can bet there will be systems out there where the BLOCK softirq
> being boosted when the task is sleeping will matter.

Currently I see this mainly protected by the sugov rate_limit_us.
With the enqueue's being the dominating cpufreq updates it's not really an
issue, the boost is expected to survive the sleep duration, during which it
wouldn't be active.
I did experiment with some sort of 'stickiness' of the boost to the rq, but
it is somewhat of a pain to deal with if we want to remove it once enqueued
on a different rq. A sugov 1ms timer is much simpler of course.
Currently it's not necessary IMO, but for the sake of being future-proof in
terms of more frequent freq updates I might include it in v2.

> 
> FWIW I do have an implementation for per-task iowait boost where I went a step
> further and converted intel_pstate too and like Christian didn't notice
> a regression. But I am not sure (rather don't think) I triggered this use case.
> I can't tell when the systems truly have per-cpu cpufreq control or just appear
> so and they are actually shared but not visible at linux level.

Please do share your intel_pstate proposal!

Kind Regards,
Christian

Qais Yousef April 29, 2024, 11:18 a.m. UTC | #7

On 04/19/24 14:42, Christian Loehle wrote:

> > I think the major thing we need to be careful about is the behavior when the
> > task is sleeping. I think the boosting will be removed when the task is
> > dequeued and I can bet there will be systems out there where the BLOCK softirq
> > being boosted when the task is sleeping will matter.
> 
> Currently I see this mainly protected by the sugov rate_limit_us.
> With the enqueue's being the dominating cpufreq updates it's not really an
> issue, the boost is expected to survive the sleep duration, during which it
> wouldn't be active.
> I did experiment with some sort of 'stickiness' of the boost to the rq, but
> it is somewhat of a pain to deal with if we want to remove it once enqueued
> on a different rq. A sugov 1ms timer is much simpler of course.
> Currently it's not necessary IMO, but for the sake of being future-proof in
> terms of more frequent freq updates I might include it in v2.

Making sure things work with purpose would be really great. This implicit
dependency is not great IMHO and make both testing and reasoning about why
things are good or bad harder when analysing real workloads. Especially by non
kernel developers.

> 
> > 
> > FWIW I do have an implementation for per-task iowait boost where I went a step
> > further and converted intel_pstate too and like Christian didn't notice
> > a regression. But I am not sure (rather don't think) I triggered this use case.
> > I can't tell when the systems truly have per-cpu cpufreq control or just appear
> > so and they are actually shared but not visible at linux level.
> 
> Please do share your intel_pstate proposal!

This is what I had. I haven't been working on this for the past few months, but
I remember tried several tests on different machines then without a problem.
I tried to re-order patches at some point though and I hope I didn't break
something accidentally and forgot the state.

https://github.com/torvalds/linux/compare/master...qais-yousef:linux:uclamp-max-aggregation

Christian Loehle May 7, 2024, 3:19 p.m. UTC | #8

On 29/04/2024 12:18, Qais Yousef wrote:
> On 04/19/24 14:42, Christian Loehle wrote:
> 
>>> I think the major thing we need to be careful about is the behavior when the
>>> task is sleeping. I think the boosting will be removed when the task is
>>> dequeued and I can bet there will be systems out there where the BLOCK softirq
>>> being boosted when the task is sleeping will matter.
>>
>> Currently I see this mainly protected by the sugov rate_limit_us.
>> With the enqueue's being the dominating cpufreq updates it's not really an
>> issue, the boost is expected to survive the sleep duration, during which it
>> wouldn't be active.
>> I did experiment with some sort of 'stickiness' of the boost to the rq, but
>> it is somewhat of a pain to deal with if we want to remove it once enqueued
>> on a different rq. A sugov 1ms timer is much simpler of course.
>> Currently it's not necessary IMO, but for the sake of being future-proof in
>> terms of more frequent freq updates I might include it in v2.
> 
> Making sure things work with purpose would be really great. This implicit
> dependency is not great IMHO and make both testing and reasoning about why
> things are good or bad harder when analysing real workloads. Especially by non
> kernel developers.

Agreed.
Even without your proposed changes [1] relying on sugov rate_limit_us is
unfortunate.
There is a problem with an arbitrarily low rate_limit_us more generally, not
just because we kind of rely on the CPU being boosted right before the task is
actually enqueued (for the interrupt/softirq part of it), but also because of
the latency from requested frequency improvement to actually running on that
frequency. If the task is 90% done by the time it sees the improvement and
the frequency will be updated (back to a lower one) before the next enqueue,
then that's hardly worth the effort.
Currently this is covered by rate_limit_us probabillistically and that seems
to be good enough in practice, but it's not very pleasing (and also EAS can't
take it into consideration).
That's not just exclusive for iowait wakeup tasks of course, but in theory any
that is off the rq frequently (and still requests a higher frequency than it can
realistically build up through util_avg like through uclamp_min).

>>>
>>> FWIW I do have an implementation for per-task iowait boost where I went a step
>>> further and converted intel_pstate too and like Christian didn't notice
>>> a regression. But I am not sure (rather don't think) I triggered this use case.
>>> I can't tell when the systems truly have per-cpu cpufreq control or just appear
>>> so and they are actually shared but not visible at linux level.
>>
>> Please do share your intel_pstate proposal!
> 
> This is what I had. I haven't been working on this for the past few months, but
> I remember tried several tests on different machines then without a problem.
> I tried to re-order patches at some point though and I hope I didn't break
> something accidentally and forgot the state.
> 
> https://github.com/torvalds/linux/compare/master...qais-yousef:linux:uclamp-max-aggregation
> 

Thanks for sharing, that looks reasonable with consolidating it into uclamp_min.
Couple of thoughts on yours, I'm sure you're aware, but consider it me thinking out
loud:
- iowait boost is taken into consideration for task placement, but with just the
4 steps that made it more aggressive on HMP. (Potentially 2-3 consecutive iowait
wakeups to land on the big instead of running at max OPP of a LITTLE).
- If the current iowait boost decay is sensible is questionable, but there should
probably be some decay. Taken to the extreme this would mean something
like blk_wait_io() demands 1024 utilization, if it waits for a very long time.
Repeating myself here, but iowait wakeups itself is tricky to work with (and I
try to work around that).
- The intel_pstate solution will increase boost even if
previous_wakeup->iowait_boost > current->iowait_boost
right? But using current->iowait_boost is a clever idea.

[1]
https://lore.kernel.org/lkml/ZgKFT5b423hfQdl9@gmail.com/T/

Kind Regards,
Christian

Qais Yousef May 12, 2024, 3:29 p.m. UTC | #9

On 05/07/24 16:19, Christian Loehle wrote:
> On 29/04/2024 12:18, Qais Yousef wrote:
> > On 04/19/24 14:42, Christian Loehle wrote:
> > 
> >>> I think the major thing we need to be careful about is the behavior when the
> >>> task is sleeping. I think the boosting will be removed when the task is
> >>> dequeued and I can bet there will be systems out there where the BLOCK softirq
> >>> being boosted when the task is sleeping will matter.
> >>
> >> Currently I see this mainly protected by the sugov rate_limit_us.
> >> With the enqueue's being the dominating cpufreq updates it's not really an
> >> issue, the boost is expected to survive the sleep duration, during which it
> >> wouldn't be active.
> >> I did experiment with some sort of 'stickiness' of the boost to the rq, but
> >> it is somewhat of a pain to deal with if we want to remove it once enqueued
> >> on a different rq. A sugov 1ms timer is much simpler of course.
> >> Currently it's not necessary IMO, but for the sake of being future-proof in
> >> terms of more frequent freq updates I might include it in v2.
> > 
> > Making sure things work with purpose would be really great. This implicit
> > dependency is not great IMHO and make both testing and reasoning about why
> > things are good or bad harder when analysing real workloads. Especially by non
> > kernel developers.
> 
> Agreed.
> Even without your proposed changes [1] relying on sugov rate_limit_us is
> unfortunate.
> There is a problem with an arbitrarily low rate_limit_us more generally, not
> just because we kind of rely on the CPU being boosted right before the task is
> actually enqueued (for the interrupt/softirq part of it), but also because of
> the latency from requested frequency improvement to actually running on that
> frequency. If the task is 90% done by the time it sees the improvement and
> the frequency will be updated (back to a lower one) before the next enqueue,
> then that's hardly worth the effort.

I think that's why iowait boost is done the way it is today. You need to
sustain the boost as the thing that need it run for a very short amount of
time..

Have you looked at how long the iowait boosted tasks run for in your tests?

> Currently this is covered by rate_limit_us probabillistically and that seems

Side note. While looking more at the history of rate_limit_us and why it is set
soo much higher than reported transition_latency by the driver, I am slowly
reaching the conclusion that what is happening here is similar to introduction
of down_rate_limit_us in Android. Our ability to predict system requirement is
not great and we end up with prematurely lowering frequencies, and this rate
limit just happens to more likely keep it higher so on average you see better
perf in various workloads. I hope we can improve on this so we don't rely on
this magic and enable better usage of the hardware ability to transition
between frequencies fast and be more exact. I am trying to work on this as part
of my magic margins series. But I think the story is bigger than that..

> to be good enough in practice, but it's not very pleasing (and also EAS can't
> take it into consideration).
> That's not just exclusive for iowait wakeup tasks of course, but in theory any
> that is off the rq frequently (and still requests a higher frequency than it can
> realistically build up through util_avg like through uclamp_min).

For uclamp_min, if the task RUNNING duration is shorter than the hardware
ability to apply the boost, I think there's little we can do. The user can opt
to boost the system in general. Note that this is likely a problem on systems
with multi-ms transition_delay_us. If the task is running for few 100s us and
it really wants to be boosted for this short time then a static system wide
boost is all we can do to guarantee what it wants. The hardware is poor fit for
the use case in this scenario. And personally I'd push back against introducing
complexity to deal with such poor fit scenarios. We can already set min freq
for the policy via sysfs and uclamp_min can still help with task placement for
HMP systems.

Now the problem we could have which is similar to iowait boost scenario is when
there's a chaining effect that requires the boost for the duration of the
chain.

I think we can do something about it if we:

	1. Guarantee the chain will run on the same CPU.
	2. Introduce a 'sticky' flag for the boost to stay while the chain is
	   running.
	3. Introduce a start/finish indication for the chain.

I think we can do something like that with sched-qos [1] to tag the chain via
a cookie and request the boost to apply to them collectively.

Generally, userspace would be better to collapse this chain into a single task
that runs in one go.

I don't know how often this scenario exists in practice and what limitations
exist that make the simple collapse solution infeasible.. So I'd leave this
out until more info is available.

> 
> >>>
> >>> FWIW I do have an implementation for per-task iowait boost where I went a step
> >>> further and converted intel_pstate too and like Christian didn't notice
> >>> a regression. But I am not sure (rather don't think) I triggered this use case.
> >>> I can't tell when the systems truly have per-cpu cpufreq control or just appear
> >>> so and they are actually shared but not visible at linux level.
> >>
> >> Please do share your intel_pstate proposal!
> > 
> > This is what I had. I haven't been working on this for the past few months, but
> > I remember tried several tests on different machines then without a problem.
> > I tried to re-order patches at some point though and I hope I didn't break
> > something accidentally and forgot the state.
> > 
> > https://github.com/torvalds/linux/compare/master...qais-yousef:linux:uclamp-max-aggregation
> > 
> 
> Thanks for sharing, that looks reasonable with consolidating it into uclamp_min.
> Couple of thoughts on yours, I'm sure you're aware, but consider it me thinking out
> loud:
> - iowait boost is taken into consideration for task placement, but with just the
> 4 steps that made it more aggressive on HMP. (Potentially 2-3 consecutive iowait
> wakeups to land on the big instead of running at max OPP of a LITTLE).

Yeah I opted to keep the logic the same. I think there are gains to be had even
without being smarter about the algorithm. But we do need to improve it, yes.
The current logic is too aggressive and the perf/power trade-off will be tricky
in practice.

> - If the current iowait boost decay is sensible is questionable, but there should
> probably be some decay. Taken to the extreme this would mean something
> like blk_wait_io() demands 1024 utilization, if it waits for a very long time.
> Repeating myself here, but iowait wakeups itself is tricky to work with (and I
> try to work around that).

I didn't get you here. But generally the story can go few levels deep yes. My
approach was to make incremental progress without breaking existing stuff, but
help move things in the right direction over time. Fixing everything in one go
will be hard and not productive.

> - The intel_pstate solution will increase boost even if
> previous_wakeup->iowait_boost > current->iowait_boost
> right? But using current->iowait_boost is a clever idea.

I forgot the details now. But I seem to remember intel_pstate had its own
accounting when it sees iowait boost in intel_pstate_hwp_boost_up/down().

Note that the major worry I had is about the softirq not being boosted.
Although in my testing this didn't seem to show up as things seemed fine. But
I haven't dug down to see how accidental this was. I could easily see my
patches making some use case out there not happy as the softirq might not get
a chance to see the boost. I got distracted with other stuff and didn't get
back to the topic since then. I am more than happy to support your efforts
though :)

[1] https://lore.kernel.org/lkml/20230916213316.p36nhgnibsidoggt@airbuntu/

[RFC,2/2] cpufreq/schedutil: Remove iowait boost

Commit Message

Comments

Patch