diff mbox series

[Resend] psi : calc cfs task memstall time more precisely

Message ID 1634278612-17055-1-git-send-email-huangzhaoyang@gmail.com (mailing list archive)
State New
Headers show
Series [Resend] psi : calc cfs task memstall time more precisely | expand

Commit Message

Zhaoyang Huang Oct. 15, 2021, 6:16 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

In an EAS enabled system, there are two scenarios discordant to current design,

1. workload used to be heavy uneven among cores for sake of scheduler policy.
RT task usually preempts CFS task in little core.
2. CFS task's memstall time is counted as simple as exit - entry so far, which
ignore the preempted time by RT, DL and Irqs.

With these two constraints, the percpu nonidle time would be mainly consumed by
none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
via the proportion of cfs_rq's utilization on the whole rq.

eg.
Here is the scenario which this commit want to fix, that is the rt and irq consume
some utilization of the whole rq. This scenario could be typical in a core
which is assigned to deal with all irqs. Furthermore, the rt task used to run on
little core under EAS.

Binder:305_3-314    [002] d..1   257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18
droid.phone-1525    [001] d..1   265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 kernel/sched/psi.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

Comments

Johannes Weiner Nov. 2, 2021, 7:47 p.m. UTC | #1
CC peterz as well for rt and timekeeping magic

On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> In an EAS enabled system, there are two scenarios discordant to current design,
> 
> 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> RT task usually preempts CFS task in little core.
> 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> ignore the preempted time by RT, DL and Irqs.
> 
> With these two constraints, the percpu nonidle time would be mainly consumed by
> none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> via the proportion of cfs_rq's utilization on the whole rq.
> 
> eg.
> Here is the scenario which this commit want to fix, that is the rt and irq consume
> some utilization of the whole rq. This scenario could be typical in a core
> which is assigned to deal with all irqs. Furthermore, the rt task used to run on
> little core under EAS.
> 
> Binder:305_3-314    [002] d..1   257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18
> droid.phone-1525    [001] d..1   265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15
>
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
>  kernel/sched/psi.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index cc25a3c..754a836 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -182,6 +182,8 @@ struct psi_group psi_system = {
>  
>  static void psi_avgs_work(struct work_struct *work);
>  
> +static unsigned long psi_memtime_fixup(u32 growth);
> +
>  static void group_init(struct psi_group *group)
>  {
>  	int cpu;
> @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
>  	return growth;
>  }
>  
> +static unsigned long psi_memtime_fixup(u32 growth)
> +{
> +	struct rq *rq = task_rq(current);
> +	unsigned long growth_fixed = (unsigned long)growth;
> +
> +	if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> +		return growth_fixed;
> +
> +	if (current->in_memstall)
> +		growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> +					- rq->avg_irq.util_avg + 1) * growth, 1024);
> +
> +	return growth_fixed;
> +}
> +
>  static void init_triggers(struct psi_group *group, u64 now)
>  {
>  	struct psi_trigger *t;
> @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
>  	}
>  
>  	if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> +		delta = psi_memtime_fixup(delta);

Ok, so we want to deduct IRQ and RT preemption time from the memstall
period of an active reclaimer, since it's technically not stalled on
memory during this time but on CPU.

However, we do NOT want to deduct IRQ and RT time from memstalls that
are sleeping on refaults swapins, since they are not affected by what
is going on on the CPU.

Does util_avg capture that difference? I'm not confident it does - but
correct me if I'm wrong. We need length of time during which and IRQ
or an RT task preempted the old rq->curr, not absolute irq/rt length.

(Btw, such preemption periods, in addition to being deducted from
memory stalls, should probably also be added to CPU contention stalls,
to make CPU pressure reporting more accurate as well.)
Zhaoyang Huang Nov. 3, 2021, 7:07 a.m. UTC | #2
On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> CC peterz as well for rt and timekeeping magic
>
> On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > In an EAS enabled system, there are two scenarios discordant to current design,
> >
> > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > RT task usually preempts CFS task in little core.
> > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > ignore the preempted time by RT, DL and Irqs.
> >
> > With these two constraints, the percpu nonidle time would be mainly consumed by
> > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> > via the proportion of cfs_rq's utilization on the whole rq.
> >
> > eg.
> > Here is the scenario which this commit want to fix, that is the rt and irq consume
> > some utilization of the whole rq. This scenario could be typical in a core
> > which is assigned to deal with all irqs. Furthermore, the rt task used to run on
> > little core under EAS.
> >
> > Binder:305_3-314    [002] d..1   257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18
> > droid.phone-1525    [001] d..1   265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15
> >
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > ---
> >  kernel/sched/psi.c | 20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index cc25a3c..754a836 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -182,6 +182,8 @@ struct psi_group psi_system = {
> >
> >  static void psi_avgs_work(struct work_struct *work);
> >
> > +static unsigned long psi_memtime_fixup(u32 growth);
> > +
> >  static void group_init(struct psi_group *group)
> >  {
> >       int cpu;
> > @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
> >       return growth;
> >  }
> >
> > +static unsigned long psi_memtime_fixup(u32 growth)
> > +{
> > +     struct rq *rq = task_rq(current);
> > +     unsigned long growth_fixed = (unsigned long)growth;
> > +
> > +     if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> > +             return growth_fixed;
> > +
> > +     if (current->in_memstall)
> > +             growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > +                                     - rq->avg_irq.util_avg + 1) * growth, 1024);
> > +
> > +     return growth_fixed;
> > +}
> > +
> >  static void init_triggers(struct psi_group *group, u64 now)
> >  {
> >       struct psi_trigger *t;
> > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> >       }
> >
> >       if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> > +             delta = psi_memtime_fixup(delta);
>
add vincent for advise on cpu load mechanism

> Ok, so we want to deduct IRQ and RT preemption time from the memstall
> period of an active reclaimer, since it's technically not stalled on
> memory during this time but on CPU.
>
> However, we do NOT want to deduct IRQ and RT time from memstalls that
> are sleeping on refaults swapins, since they are not affected by what
> is going on on the CPU.
>
> Does util_avg capture that difference? I'm not confident it does - but
> correct me if I'm wrong. We need length of time during which and IRQ
> or an RT task preempted the old rq->curr, not absolute irq/rt length.
As far as my understanding, core's capacity = IRQ + DEADLINE + RT +
CFS. For a certain time period, all cfs tasks preempt each other
inside CFS's utilization. So the sleeping on refaults is counted in.
>
> (Btw, such preemption periods, in addition to being deducted from
> memory stalls, should probably also be added to CPU contention stalls,
> to make CPU pressure reporting more accurate as well.)
Zhaoyang Huang Nov. 3, 2021, 7:08 a.m. UTC | #3
+Vincent Guittot

On Wed, Nov 3, 2021 at 3:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > CC peterz as well for rt and timekeeping magic
> >
> > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > In an EAS enabled system, there are two scenarios discordant to current design,
> > >
> > > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > > RT task usually preempts CFS task in little core.
> > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > > ignore the preempted time by RT, DL and Irqs.
> > >
> > > With these two constraints, the percpu nonidle time would be mainly consumed by
> > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> > > via the proportion of cfs_rq's utilization on the whole rq.
> > >
> > > eg.
> > > Here is the scenario which this commit want to fix, that is the rt and irq consume
> > > some utilization of the whole rq. This scenario could be typical in a core
> > > which is assigned to deal with all irqs. Furthermore, the rt task used to run on
> > > little core under EAS.
> > >
> > > Binder:305_3-314    [002] d..1   257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18
> > > droid.phone-1525    [001] d..1   265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15
> > >
> > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > ---
> > >  kernel/sched/psi.c | 20 +++++++++++++++++++-
> > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > index cc25a3c..754a836 100644
> > > --- a/kernel/sched/psi.c
> > > +++ b/kernel/sched/psi.c
> > > @@ -182,6 +182,8 @@ struct psi_group psi_system = {
> > >
> > >  static void psi_avgs_work(struct work_struct *work);
> > >
> > > +static unsigned long psi_memtime_fixup(u32 growth);
> > > +
> > >  static void group_init(struct psi_group *group)
> > >  {
> > >       int cpu;
> > > @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
> > >       return growth;
> > >  }
> > >
> > > +static unsigned long psi_memtime_fixup(u32 growth)
> > > +{
> > > +     struct rq *rq = task_rq(current);
> > > +     unsigned long growth_fixed = (unsigned long)growth;
> > > +
> > > +     if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> > > +             return growth_fixed;
> > > +
> > > +     if (current->in_memstall)
> > > +             growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > > +                                     - rq->avg_irq.util_avg + 1) * growth, 1024);
> > > +
> > > +     return growth_fixed;
> > > +}
> > > +
> > >  static void init_triggers(struct psi_group *group, u64 now)
> > >  {
> > >       struct psi_trigger *t;
> > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> > >       }
> > >
> > >       if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> > > +             delta = psi_memtime_fixup(delta);
> >
> add vincent for advise on cpu load mechanism
>
> > Ok, so we want to deduct IRQ and RT preemption time from the memstall
> > period of an active reclaimer, since it's technically not stalled on
> > memory during this time but on CPU.
> >
> > However, we do NOT want to deduct IRQ and RT time from memstalls that
> > are sleeping on refaults swapins, since they are not affected by what
> > is going on on the CPU.
> >
> > Does util_avg capture that difference? I'm not confident it does - but
> > correct me if I'm wrong. We need length of time during which and IRQ
> > or an RT task preempted the old rq->curr, not absolute irq/rt length.
> As far as my understanding, core's capacity = IRQ + DEADLINE + RT +
> CFS. For a certain time period, all cfs tasks preempt each other
> inside CFS's utilization. So the sleeping on refaults is counted in.
> >
> > (Btw, such preemption periods, in addition to being deducted from
> > memory stalls, should probably also be added to CPU contention stalls,
> > to make CPU pressure reporting more accurate as well.)
Dietmar Eggemann Nov. 4, 2021, 8:58 a.m. UTC | #4
On 03/11/2021 08:08, Zhaoyang Huang wrote:
> +Vincent Guittot
> 
> On Wed, Nov 3, 2021 at 3:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>>
>> On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>
>>> CC peterz as well for rt and timekeeping magic
>>>
>>> On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
>>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>>>
>>>> In an EAS enabled system, there are two scenarios discordant to current design,

I don't understand the EAS (probably asymmetric CPU capacity is meant
here) angle of the story. Pressure on CPU capacity which is usable for
CFS happens on SMP as well?

>>>>
>>>> 1. workload used to be heavy uneven among cores for sake of scheduler policy.
>>>> RT task usually preempts CFS task in little core.
>>>> 2. CFS task's memstall time is counted as simple as exit - entry so far, which
>>>> ignore the preempted time by RT, DL and Irqs.
>>>>
>>>> With these two constraints, the percpu nonidle time would be mainly consumed by
>>>> none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
>>>> via the proportion of cfs_rq's utilization on the whole rq.
>>>>
>>>> eg.
>>>> Here is the scenario which this commit want to fix, that is the rt and irq consume
>>>> some utilization of the whole rq. This scenario could be typical in a core
>>>> which is assigned to deal with all irqs. Furthermore, the rt task used to run on
>>>> little core under EAS.
>>>>
>>>> Binder:305_3-314    [002] d..1   257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18
>>>> droid.phone-1525    [001] d..1   265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15
>>>>
>>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>>> ---
>>>>  kernel/sched/psi.c | 20 +++++++++++++++++++-
>>>>  1 file changed, 19 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>>>> index cc25a3c..754a836 100644
>>>> --- a/kernel/sched/psi.c
>>>> +++ b/kernel/sched/psi.c
>>>> @@ -182,6 +182,8 @@ struct psi_group psi_system = {
>>>>
>>>>  static void psi_avgs_work(struct work_struct *work);
>>>>
>>>> +static unsigned long psi_memtime_fixup(u32 growth);
>>>> +
>>>>  static void group_init(struct psi_group *group)
>>>>  {
>>>>       int cpu;
>>>> @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
>>>>       return growth;
>>>>  }
>>>>
>>>> +static unsigned long psi_memtime_fixup(u32 growth)
>>>> +{
>>>> +     struct rq *rq = task_rq(current);
>>>> +     unsigned long growth_fixed = (unsigned long)growth;
>>>> +
>>>> +     if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
>>>> +             return growth_fixed;

This will let the idle task (swapper) pass. Is this indented? Or do you
want to only let CFS tasks (including SCHED_IDLE) pass?

if (current->sched_class != &fair_sched_class)
    return growth_fixed;

>>>> +
>>>> +     if (current->in_memstall)
>>>> +             growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
>>>> +                                     - rq->avg_irq.util_avg + 1) * growth, 1024);
>>>> +

We do this slightly different in scale_rt_capacity() [fair.c]:

max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support
                                            asymmetric CPU capacity */

used = cpu_util_rt(rq);
used += cpu_util_dl(rq);
used += thermal_load_avg(rq);

free = max - used
irq = cpu_util_irq(rq)

used = scale_irq_capacity(free, irq, max);

scaling then with with: max - used / max

>>>> +     return growth_fixed;
>>>> +}
>>>> +
>>>>  static void init_triggers(struct psi_group *group, u64 now)
>>>>  {
>>>>       struct psi_trigger *t;
>>>> @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
>>>>       }
>>>>
>>>>       if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
>>>> +             delta = psi_memtime_fixup(delta);
>>>
>> add vincent for advise on cpu load mechanism
>>
>>> Ok, so we want to deduct IRQ and RT preemption time from the memstall
>>> period of an active reclaimer, since it's technically not stalled on
>>> memory during this time but on CPU.
>>>
>>> However, we do NOT want to deduct IRQ and RT time from memstalls that
>>> are sleeping on refaults swapins, since they are not affected by what
>>> is going on on the CPU.
>>>
>>> Does util_avg capture that difference? I'm not confident it does - but
>>> correct me if I'm wrong. We need length of time during which and IRQ
>>> or an RT task preempted the old rq->curr, not absolute irq/rt length.
>> As far as my understanding, core's capacity = IRQ + DEADLINE + RT +
>> CFS. For a certain time period, all cfs tasks preempt each other
>> inside CFS's utilization. So the sleeping on refaults is counted in.
>>>
>>> (Btw, such preemption periods, in addition to being deducted from
>>> memory stalls, should probably also be added to CPU contention stalls,
>>> to make CPU pressure reporting more accurate as well.)
Zhaoyang Huang Nov. 5, 2021, 5:58 a.m. UTC | #5
On Thu, Nov 4, 2021 at 4:58 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 03/11/2021 08:08, Zhaoyang Huang wrote:
> > +Vincent Guittot
> >
> > On Wed, Nov 3, 2021 at 3:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
> >>
> >> On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>>
> >>> CC peterz as well for rt and timekeeping magic
> >>>
> >>> On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> >>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >>>>
> >>>> In an EAS enabled system, there are two scenarios discordant to current design,
>
> I don't understand the EAS (probably asymmetric CPU capacity is meant
> here) angle of the story. Pressure on CPU capacity which is usable for
> CFS happens on SMP as well?
 Mentioning EAS here mainly about RT tasks preempting small CFS tasks
(big CFS tasks could be scheduled to big core), which would introduce
more proportion of preempted time within PSI_MEM_STALL than SMP does.
>
> >>>>
> >>>> 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> >>>> RT task usually preempts CFS task in little core.
> >>>> 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> >>>> ignore the preempted time by RT, DL and Irqs.
> >>>>
> >>>> With these two constraints, the percpu nonidle time would be mainly consumed by
> >>>> none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> >>>> via the proportion of cfs_rq's utilization on the whole rq.
> >>>>
> >>>> eg.
> >>>> Here is the scenario which this commit want to fix, that is the rt and irq consume
> >>>> some utilization of the whole rq. This scenario could be typical in a core
> >>>> which is assigned to deal with all irqs. Furthermore, the rt task used to run on
> >>>> little core under EAS.
> >>>>
> >>>> Binder:305_3-314    [002] d..1   257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18
> >>>> droid.phone-1525    [001] d..1   265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15
> >>>>
> >>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >>>> ---
> >>>>  kernel/sched/psi.c | 20 +++++++++++++++++++-
> >>>>  1 file changed, 19 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> >>>> index cc25a3c..754a836 100644
> >>>> --- a/kernel/sched/psi.c
> >>>> +++ b/kernel/sched/psi.c
> >>>> @@ -182,6 +182,8 @@ struct psi_group psi_system = {
> >>>>
> >>>>  static void psi_avgs_work(struct work_struct *work);
> >>>>
> >>>> +static unsigned long psi_memtime_fixup(u32 growth);
> >>>> +
> >>>>  static void group_init(struct psi_group *group)
> >>>>  {
> >>>>       int cpu;
> >>>> @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
> >>>>       return growth;
> >>>>  }
> >>>>
> >>>> +static unsigned long psi_memtime_fixup(u32 growth)
> >>>> +{
> >>>> +     struct rq *rq = task_rq(current);
> >>>> +     unsigned long growth_fixed = (unsigned long)growth;
> >>>> +
> >>>> +     if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> >>>> +             return growth_fixed;
>
> This will let the idle task (swapper) pass. Is this indented? Or do you
> want to only let CFS tasks (including SCHED_IDLE) pass?
idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to
scale the STALL time.
>
> if (current->sched_class != &fair_sched_class)
>     return growth_fixed;
>
> >>>> +
> >>>> +     if (current->in_memstall)
> >>>> +             growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> >>>> +                                     - rq->avg_irq.util_avg + 1) * growth, 1024);
> >>>> +
>
> We do this slightly different in scale_rt_capacity() [fair.c]:
>
> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support
>                                             asymmetric CPU capacity */
Is it possible that the SUM of rqs' util_avg large than
arch_scale_cpu_capacity because of task migration things?
>
> used = cpu_util_rt(rq);
> used += cpu_util_dl(rq);
> used += thermal_load_avg(rq);
>
> free = max - used
> irq = cpu_util_irq(rq)
>
> used = scale_irq_capacity(free, irq, max);
>
> scaling then with with: max - used / max
ok. so introduce thermal util.
>
> >>>> +     return growth_fixed;
> >>>> +}
> >>>> +
> >>>>  static void init_triggers(struct psi_group *group, u64 now)
> >>>>  {
> >>>>       struct psi_trigger *t;
> >>>> @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> >>>>       }
> >>>>
> >>>>       if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> >>>> +             delta = psi_memtime_fixup(delta);
> >>>
> >> add vincent for advise on cpu load mechanism
> >>
> >>> Ok, so we want to deduct IRQ and RT preemption time from the memstall
> >>> period of an active reclaimer, since it's technically not stalled on
> >>> memory during this time but on CPU.
> >>>
> >>> However, we do NOT want to deduct IRQ and RT time from memstalls that
> >>> are sleeping on refaults swapins, since they are not affected by what
> >>> is going on on the CPU.
> >>>
> >>> Does util_avg capture that difference? I'm not confident it does - but
> >>> correct me if I'm wrong. We need length of time during which and IRQ
> >>> or an RT task preempted the old rq->curr, not absolute irq/rt length.
> >> As far as my understanding, core's capacity = IRQ + DEADLINE + RT +
> >> CFS. For a certain time period, all cfs tasks preempt each other
> >> inside CFS's utilization. So the sleeping on refaults is counted in.
> >>>
> >>> (Btw, such preemption periods, in addition to being deducted from
> >>> memory stalls, should probably also be added to CPU contention stalls,
> >>> to make CPU pressure reporting more accurate as well.)
>
Dietmar Eggemann Nov. 5, 2021, 4:42 p.m. UTC | #6
On 05/11/2021 06:58, Zhaoyang Huang wrote:
> On Thu, Nov 4, 2021 at 4:58 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 03/11/2021 08:08, Zhaoyang Huang wrote:
>>> +Vincent Guittot
>>>
>>> On Wed, Nov 3, 2021 at 3:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>>>>
>>>> On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>>>>
>>>>> CC peterz as well for rt and timekeeping magic
>>>>>
>>>>> On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
>>>>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>>>>>
>>>>>> In an EAS enabled system, there are two scenarios discordant to current design,
>>
>> I don't understand the EAS (probably asymmetric CPU capacity is meant
>> here) angle of the story. Pressure on CPU capacity which is usable for
>> CFS happens on SMP as well?
>  Mentioning EAS here mainly about RT tasks preempting small CFS tasks
> (big CFS tasks could be scheduled to big core), which would introduce
> more proportion of preempted time within PSI_MEM_STALL than SMP does.

What's your CPU layout? Do you have the little before the big CPUs? Like
Hikey 960?

root@linaro-developer:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity
462
462
462
462
1024
1024
1024
1024

And I guess rt class prefers lower CPU numbers hence you see this?

>>>>>> 1. workload used to be heavy uneven among cores for sake of scheduler policy.
>>>>>> RT task usually preempts CFS task in little core.
>>>>>> 2. CFS task's memstall time is counted as simple as exit - entry so far, which
>>>>>> ignore the preempted time by RT, DL and Irqs.
>>>>>>
>>>>>> With these two constraints, the percpu nonidle time would be mainly consumed by
>>>>>> none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
>>>>>> via the proportion of cfs_rq's utilization on the whole rq.
>>>>>>
>>>>>> eg.
>>>>>> Here is the scenario which this commit want to fix, that is the rt and irq consume
>>>>>> some utilization of the whole rq. This scenario could be typical in a core
>>>>>> which is assigned to deal with all irqs. Furthermore, the rt task used to run on
>>>>>> little core under EAS.
>>>>>>
>>>>>> Binder:305_3-314    [002] d..1   257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18
>>>>>> droid.phone-1525    [001] d..1   265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15

[...]

>>>>>> @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
>>>>>>       return growth;
>>>>>>  }
>>>>>>
>>>>>> +static unsigned long psi_memtime_fixup(u32 growth)
>>>>>> +{
>>>>>> +     struct rq *rq = task_rq(current);
>>>>>> +     unsigned long growth_fixed = (unsigned long)growth;
>>>>>> +
>>>>>> +     if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
>>>>>> +             return growth_fixed;
>>
>> This will let the idle task (swapper) pass. Is this indented? Or do you
>> want to only let CFS tasks (including SCHED_IDLE) pass?
> idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to
> scale the STALL time.

Not sure I  get this.

__schedule() -> psi_sched_switch() -> psi_task_change() ->
psi_group_change() -> record_times() -> psi_memtime_fixup()

is something else than calling psi_memstall_enter() or _leave()?

IMHO, at least record_times() can be called with current equal
swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in
this callstack? I don't know the PSI internals.

>>
>> if (current->sched_class != &fair_sched_class)
>>     return growth_fixed;
>>
>>>>>> +
>>>>>> +     if (current->in_memstall)
>>>>>> +             growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
>>>>>> +                                     - rq->avg_irq.util_avg + 1) * growth, 1024);
>>>>>> +
>>
>> We do this slightly different in scale_rt_capacity() [fair.c]:
>>
>> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support
>>                                             asymmetric CPU capacity */
> Is it possible that the SUM of rqs' util_avg large than
> arch_scale_cpu_capacity because of task migration things?

I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL,
IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)?

Yes it can.

Have a lock at

effective_cpu_util(..., max, ...) {

  if (foo >= max)
    return max;

}

Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than
the original cpu capacity (rq->cpu_capacity_orig).

Have a look at cpu_util(). capacity_orig_of(CPUx) and
arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig.

[...]
Xuewen Yan Nov. 8, 2021, 8:49 a.m. UTC | #7
Hi Dietmar

On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 05/11/2021 06:58, Zhaoyang Huang wrote:
> >> I don't understand the EAS (probably asymmetric CPU capacity is meant
> >> here) angle of the story. Pressure on CPU capacity which is usable for
> >> CFS happens on SMP as well?
> >  Mentioning EAS here mainly about RT tasks preempting small CFS tasks
> > (big CFS tasks could be scheduled to big core), which would introduce
> > more proportion of preempted time within PSI_MEM_STALL than SMP does.
>
> What's your CPU layout? Do you have the little before the big CPUs? Like
> Hikey 960?
>
> root@linaro-developer:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity
> 462
> 462
> 462
> 462
> 1024
> 1024
> 1024
> 1024
>
> And I guess rt class prefers lower CPU numbers hence you see this?
>
our CPU layout is:
xuewen.yan:/ # cat /sys/devices/system/cpu/cpu*/cpu_capacity
544
544
544
544
544
544
1024
1024

And in our platform, we use the kernel in mobile phones with Android.
And we prefer power, so we prefer the RT class to run on little cores.


> >>
> >> This will let the idle task (swapper) pass. Is this indented? Or do you
> >> want to only let CFS tasks (including SCHED_IDLE) pass?
> > idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to
> > scale the STALL time.
>
> Not sure I  get this.
>
> __schedule() -> psi_sched_switch() -> psi_task_change() ->
> psi_group_change() -> record_times() -> psi_memtime_fixup()
>
> is something else than calling psi_memstall_enter() or _leave()?
>
> IMHO, at least record_times() can be called with current equal
> swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in
> this callstack? I don't know the PSI internals.
>
> >>
> >> if (current->sched_class != &fair_sched_class)
> >>     return growth_fixed;
> >>
> >>>>>> +
> >>>>>> +     if (current->in_memstall)
> >>>>>> +             growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> >>>>>> +                                     - rq->avg_irq.util_avg + 1) * growth, 1024);
> >>>>>> +
> >>
> >> We do this slightly different in scale_rt_capacity() [fair.c]:
> >>
> >> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support
> >>                                             asymmetric CPU capacity */
> > Is it possible that the SUM of rqs' util_avg large than
> > arch_scale_cpu_capacity because of task migration things?
>
> I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL,
> IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)?
>
> Yes it can.
>
> Have a lock at
>
> effective_cpu_util(..., max, ...) {
>
>   if (foo >= max)
>     return max;
>
> }
>
> Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than
> the original cpu capacity (rq->cpu_capacity_orig).
>
> Have a look at cpu_util(). capacity_orig_of(CPUx) and
> arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig.
>

Well, your means is we should not use the 1024 and should use the
original cpu capacity?
And maybe use the "sched_cpu_util()" is a good choice just like this:

+       if (current->in_memstall)
+               growth_fixed = div64_ul(cpu_util_cfs(rq) * growth,
sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu)));

Thanks!

BR
xuewen
Zhaoyang Huang Nov. 8, 2021, 9:20 a.m. UTC | #8
On Mon, Nov 8, 2021 at 4:49 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>
> Hi Dietmar
>
> On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
> >
> > On 05/11/2021 06:58, Zhaoyang Huang wrote:
> > >> I don't understand the EAS (probably asymmetric CPU capacity is meant
> > >> here) angle of the story. Pressure on CPU capacity which is usable for
> > >> CFS happens on SMP as well?
> > >  Mentioning EAS here mainly about RT tasks preempting small CFS tasks
> > > (big CFS tasks could be scheduled to big core), which would introduce
> > > more proportion of preempted time within PSI_MEM_STALL than SMP does.
> >
> > What's your CPU layout? Do you have the little before the big CPUs? Like
> > Hikey 960?
> >
> > root@linaro-developer:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity
> > 462
> > 462
> > 462
> > 462
> > 1024
> > 1024
> > 1024
> > 1024
> >
> > And I guess rt class prefers lower CPU numbers hence you see this?
> >
> our CPU layout is:
> xuewen.yan:/ # cat /sys/devices/system/cpu/cpu*/cpu_capacity
> 544
> 544
> 544
> 544
> 544
> 544
> 1024
> 1024
>
> And in our platform, we use the kernel in mobile phones with Android.
> And we prefer power, so we prefer the RT class to run on little cores.
>
>
> > >>
> > >> This will let the idle task (swapper) pass. Is this indented? Or do you
> > >> want to only let CFS tasks (including SCHED_IDLE) pass?
> > > idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to
> > > scale the STALL time.
> >
> > Not sure I  get this.
> >
> > __schedule() -> psi_sched_switch() -> psi_task_change() ->
> > psi_group_change() -> record_times() -> psi_memtime_fixup()
> >
> > is something else than calling psi_memstall_enter() or _leave()?
> >
> > IMHO, at least record_times() can be called with current equal
> > swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in
> > this callstack? I don't know the PSI internals.
According to my understanding, PSI_MEM_SOME represents the CORE's
state within which there is at least one task trapped in memstall
path(only counted in by calling PSI_MEMSTALL_ENTER). record_times is
responsible for collecting the delta time of the CORE since it start.
What we are doing is to make the delta time more precise. So idle task
is irrelevant for these.

> >
> > >>
> > >> if (current->sched_class != &fair_sched_class)
> > >>     return growth_fixed;
> > >>
> > >>>>>> +
> > >>>>>> +     if (current->in_memstall)
> > >>>>>> +             growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > >>>>>> +                                     - rq->avg_irq.util_avg + 1) * growth, 1024);
> > >>>>>> +
> > >>
> > >> We do this slightly different in scale_rt_capacity() [fair.c]:
> > >>
> > >> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support
> > >>                                             asymmetric CPU capacity */
> > > Is it possible that the SUM of rqs' util_avg large than
> > > arch_scale_cpu_capacity because of task migration things?
> >
> > I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL,
> > IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)?
> >
> > Yes it can.
> >
> > Have a lock at
> >
> > effective_cpu_util(..., max, ...) {
> >
> >   if (foo >= max)
> >     return max;
> >
> > }
> >
> > Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than
> > the original cpu capacity (rq->cpu_capacity_orig).
> >
> > Have a look at cpu_util(). capacity_orig_of(CPUx) and
> > arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig.
> >
>
> Well, your means is we should not use the 1024 and should use the
> original cpu capacity?
> And maybe use the "sched_cpu_util()" is a good choice just like this:
>
> +       if (current->in_memstall)
> +               growth_fixed = div64_ul(cpu_util_cfs(rq) * growth,
> sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu)));
>
> Thanks!
>
> BR
> xuewen
Dietmar Eggemann Nov. 9, 2021, 9:43 a.m. UTC | #9
On 08/11/2021 09:49, Xuewen Yan wrote:
> Hi Dietmar
> 
> On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 05/11/2021 06:58, Zhaoyang Huang wrote:
>>>> I don't understand the EAS (probably asymmetric CPU capacity is meant
>>>> here) angle of the story. Pressure on CPU capacity which is usable for
>>>> CFS happens on SMP as well?
>>>  Mentioning EAS here mainly about RT tasks preempting small CFS tasks
>>> (big CFS tasks could be scheduled to big core), which would introduce
>>> more proportion of preempted time within PSI_MEM_STALL than SMP does.
>>
>> What's your CPU layout? Do you have the little before the big CPUs? Like
>> Hikey 960?

[...]

>> And I guess rt class prefers lower CPU numbers hence you see this?
>>
> our CPU layout is:
> xuewen.yan:/ # cat /sys/devices/system/cpu/cpu*/cpu_capacity
> 544
> 544
> 544
> 544
> 544
> 544
> 1024
> 1024
> 
> And in our platform, we use the kernel in mobile phones with Android.
> And we prefer power, so we prefer the RT class to run on little cores.

Ah, OK, out-of-tree extensions.

[...]

>>>>>>>> +     if (current->in_memstall)
>>>>>>>> +             growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
>>>>>>>> +                                     - rq->avg_irq.util_avg + 1) * growth, 1024);
>>>>>>>> +
>>>>
>>>> We do this slightly different in scale_rt_capacity() [fair.c]:
>>>>
>>>> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support
>>>>                                             asymmetric CPU capacity */
>>> Is it possible that the SUM of rqs' util_avg large than
>>> arch_scale_cpu_capacity because of task migration things?
>>
>> I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL,
>> IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)?
>>
>> Yes it can.
>>
>> Have a lock at
>>
>> effective_cpu_util(..., max, ...) {
>>
>>   if (foo >= max)
>>     return max;
>>
>> }
>>
>> Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than
>> the original cpu capacity (rq->cpu_capacity_orig).
>>
>> Have a look at cpu_util(). capacity_orig_of(CPUx) and
>> arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig.
>>
> 
> Well, your means is we should not use the 1024 and should use the
> original cpu capacity?
> And maybe use the "sched_cpu_util()" is a good choice just like this:
> 
> +       if (current->in_memstall)
> +               growth_fixed = div64_ul(cpu_util_cfs(rq) * growth,
> sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu)));

Not sure about this. In case util_cfs=0 you would get scale=0.

IMHO, you need

cap      = rq->cpu_capacity
cap_orig = rq->cpu_capacity_orig

scale = (cap * X) / cap_orig

or if the update of these rq values happens to infrequently for you then
you have to calc the pressure evey time. Something like:

pressure = cpu_util_rt(rq) + cpu_util_dl(rq)

irq = cpu_util_irq(rq)

if (irq >= cap_orig)
    pressure = cap_orig
else
    pressure = scale_irq_capacity(pressure, irq, cap_orig)
    pressure += irq

scale = ((cap_orig - pressure) * X) / cap_orig
Dietmar Eggemann Nov. 9, 2021, 12:29 p.m. UTC | #10
On 08/11/2021 10:20, Zhaoyang Huang wrote:
> On Mon, Nov 8, 2021 at 4:49 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
>>
>> Hi Dietmar
>>
>> On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann
>> <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 05/11/2021 06:58, Zhaoyang Huang wrote:

[...]

>>>>> This will let the idle task (swapper) pass. Is this indented? Or do you
>>>>> want to only let CFS tasks (including SCHED_IDLE) pass?
>>>> idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to
>>>> scale the STALL time.
>>>
>>> Not sure I  get this.
>>>
>>> __schedule() -> psi_sched_switch() -> psi_task_change() ->
>>> psi_group_change() -> record_times() -> psi_memtime_fixup()
>>>
>>> is something else than calling psi_memstall_enter() or _leave()?
>>>
>>> IMHO, at least record_times() can be called with current equal
>>> swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in
>>> this callstack? I don't know the PSI internals.
> According to my understanding, PSI_MEM_SOME represents the CORE's
> state within which there is at least one task trapped in memstall
> path(only counted in by calling PSI_MEMSTALL_ENTER). record_times is
> responsible for collecting the delta time of the CORE since it start.
> What we are doing is to make the delta time more precise. So idle task
> is irrelevant for these.

Coming back to the original snippet of the patch.

static unsigned long psi_memtime_fixup(u32 growth)
{

    if (!(current->policy == SCHED_NORMAL ||
          current->policy == SCHED_BATCH))
        return growth_fixed;

With this condition:

(1) you're not bailing when current is the idle task. It has policy
    equal 0 (SCHED_NORMAL)

(2) But you're bailing for a SCHED_IDLE (CFS) task.

I'm not sure that this is indented here?

Since you want to do the scaling later based on whats left for CFS tasks
from the CPU capacity my hunch is that you want to rather do:

    if (current->sched_class != &fair_sched_class)
        return growth_fixed;

What's the possible sched classes of current in psi_memtime_fixup?
Peter Zijlstra Nov. 9, 2021, 2:56 p.m. UTC | #11
On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote:
> CC peterz as well for rt and timekeeping magic
> 
> On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > 
> > In an EAS enabled system, there are two scenarios discordant to current design,
> > 
> > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > RT task usually preempts CFS task in little core.
> > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > ignore the preempted time by RT, DL and Irqs.

It ignores preemption full-stop. I don't see why RT/IRQ should be
special cased here.

> > With these two constraints, the percpu nonidle time would be mainly consumed by
> > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> > via the proportion of cfs_rq's utilization on the whole rq.


> > +static unsigned long psi_memtime_fixup(u32 growth)
> > +{
> > +	struct rq *rq = task_rq(current);
> > +	unsigned long growth_fixed = (unsigned long)growth;
> > +
> > +	if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> > +		return growth_fixed;
> > +
> > +	if (current->in_memstall)
> > +		growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > +					- rq->avg_irq.util_avg + 1) * growth, 1024);
> > +
> > +	return growth_fixed;
> > +}
> > +
> >  static void init_triggers(struct psi_group *group, u64 now)
> >  {
> >  	struct psi_trigger *t;
> > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> >  	}
> >  
> >  	if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> > +		delta = psi_memtime_fixup(delta);
> 
> Ok, so we want to deduct IRQ and RT preemption time from the memstall
> period of an active reclaimer, since it's technically not stalled on
> memory during this time but on CPU.
> 
> However, we do NOT want to deduct IRQ and RT time from memstalls that
> are sleeping on refaults swapins, since they are not affected by what
> is going on on the CPU.

I think that focus on RT/IRQ is mis-guided here, and the implementation
is horrendous.

So the fundamental question seems to be; and I think Johannes is the one
to answer that: What time-base do these metrics want to use?

Do some of these states want to account in task-time instead of
wall-time perhaps? I can't quite remember, but vague memories are
telling me most of the PSI accounting was about blocked tasks, not
running tasks, which makes all this rather more complicated.

Randomly scaling time as proposed seems almost certainly wrong. What
would that make the stats mean?
Zhaoyang Huang Nov. 10, 2021, 1:37 a.m. UTC | #12
On Tue, Nov 9, 2021 at 10:56 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote:
> > CC peterz as well for rt and timekeeping magic
> >
> > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > In an EAS enabled system, there are two scenarios discordant to current design,
> > >
> > > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > > RT task usually preempts CFS task in little core.
> > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > > ignore the preempted time by RT, DL and Irqs.
>
> It ignores preemption full-stop. I don't see why RT/IRQ should be
> special cased here.
As Johannes comments, what we are trying to solve is mainly the
preempted time of the CFS task by RT/IRQ, NOT the RT/IRQ themselves.
Could you please catch up the recent reply of Dietmar, which maybe
provide more information.
>
> > > With these two constraints, the percpu nonidle time would be mainly consumed by
> > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> > > via the proportion of cfs_rq's utilization on the whole rq.
>
>
> > > +static unsigned long psi_memtime_fixup(u32 growth)
> > > +{
> > > +   struct rq *rq = task_rq(current);
> > > +   unsigned long growth_fixed = (unsigned long)growth;
> > > +
> > > +   if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> > > +           return growth_fixed;
> > > +
> > > +   if (current->in_memstall)
> > > +           growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > > +                                   - rq->avg_irq.util_avg + 1) * growth, 1024);
> > > +
> > > +   return growth_fixed;
> > > +}
> > > +
> > >  static void init_triggers(struct psi_group *group, u64 now)
> > >  {
> > >     struct psi_trigger *t;
> > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> > >     }
> > >
> > >     if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> > > +           delta = psi_memtime_fixup(delta);
> >
> > Ok, so we want to deduct IRQ and RT preemption time from the memstall
> > period of an active reclaimer, since it's technically not stalled on
> > memory during this time but on CPU.
> >
> > However, we do NOT want to deduct IRQ and RT time from memstalls that
> > are sleeping on refaults swapins, since they are not affected by what
> > is going on on the CPU.
>
> I think that focus on RT/IRQ is mis-guided here, and the implementation
> is horrendous.
>
> So the fundamental question seems to be; and I think Johannes is the one
> to answer that: What time-base do these metrics want to use?
>
> Do some of these states want to account in task-time instead of
> wall-time perhaps? I can't quite remember, but vague memories are
> telling me most of the PSI accounting was about blocked tasks, not
> running tasks, which makes all this rather more complicated.
memstall time is counted as exit - enter, which include both blocked
and running stat. However, we think the blocked time introduced by
preemption of RT/IRQ/DL are memstall irrelevant(should be eliminated),
while the ones between CFS tasks could be. Thanks for the mechanism of
load tracking, the implementation could be simple by calculating the
proportion of CFS_UTIL among the whole core's capacity.
>
> Randomly scaling time as proposed seems almost certainly wrong. What
> would that make the stats mean?
It is NOT randomly scaling, but scales in each record_times for CFS tasks.
Xuewen Yan Nov. 10, 2021, 5:36 a.m. UTC | #13
Hi Dietmar
On Tue, Nov 9, 2021 at 5:43 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 08/11/2021 09:49, Xuewen Yan wrote:
> > Hi Dietmar
> >
> > On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 05/11/2021 06:58, Zhaoyang Huang wrote:
> >>>> I don't understand the EAS (probably asymmetric CPU capacity is meant
> >>>> here) angle of the story. Pressure on CPU capacity which is usable for
> >>>> CFS happens on SMP as well?
> >>>  Mentioning EAS here mainly about RT tasks preempting small CFS tasks
> >>> (big CFS tasks could be scheduled to big core), which would introduce
> >>> more proportion of preempted time within PSI_MEM_STALL than SMP does.
> >>
> >> What's your CPU layout? Do you have the little before the big CPUs? Like
> >> Hikey 960?
>
> [...]
>
> >> And I guess rt class prefers lower CPU numbers hence you see this?
> >>
> > our CPU layout is:
> > xuewen.yan:/ # cat /sys/devices/system/cpu/cpu*/cpu_capacity
> > 544
> > 544
> > 544
> > 544
> > 544
> > 544
> > 1024
> > 1024
> >
> > And in our platform, we use the kernel in mobile phones with Android.
> > And we prefer power, so we prefer the RT class to run on little cores.
>
> Ah, OK, out-of-tree extensions.
>
> [...]
>
> >>>>>>>> +     if (current->in_memstall)
> >>>>>>>> +             growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> >>>>>>>> +                                     - rq->avg_irq.util_avg + 1) * growth, 1024);
> >>>>>>>> +
> >>>>
> >>>> We do this slightly different in scale_rt_capacity() [fair.c]:
> >>>>
> >>>> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support
> >>>>                                             asymmetric CPU capacity */
> >>> Is it possible that the SUM of rqs' util_avg large than
> >>> arch_scale_cpu_capacity because of task migration things?
> >>
> >> I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL,
> >> IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)?
> >>
> >> Yes it can.
> >>
> >> Have a lock at
> >>
> >> effective_cpu_util(..., max, ...) {
> >>
> >>   if (foo >= max)
> >>     return max;
> >>
> >> }
> >>
> >> Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than
> >> the original cpu capacity (rq->cpu_capacity_orig).
> >>
> >> Have a look at cpu_util(). capacity_orig_of(CPUx) and
> >> arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig.
> >>
> >
> > Well, your means is we should not use the 1024 and should use the
> > original cpu capacity?
> > And maybe use the "sched_cpu_util()" is a good choice just like this:
> >
> > +       if (current->in_memstall)
> > +               growth_fixed = div64_ul(cpu_util_cfs(rq) * growth,
> > sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu)));
>
> Not sure about this. In case util_cfs=0 you would get scale=0.

Yes , we should consider it. In addition, it also should be considered
when util_cfs > capacity_orig because of the UTIL_EST......

>
> IMHO, you need
>
> cap      = rq->cpu_capacity
> cap_orig = rq->cpu_capacity_orig
>
> scale = (cap * X) / cap_orig
>
> or if the update of these rq values happens to infrequently for you then
> you have to calc the pressure evey time. Something like:
>
> pressure = cpu_util_rt(rq) + cpu_util_dl(rq)
>
> irq = cpu_util_irq(rq)
>
> if (irq >= cap_orig)
>     pressure = cap_orig
> else
>     pressure = scale_irq_capacity(pressure, irq, cap_orig)
>     pressure += irq
>
> scale = ((cap_orig - pressure) * X) / cap_orig

Why rescale the util there, the sched_cpu_util() would invoke the
effective_cpu_util(), and I don't think it's necessary to rescale it.

Thanks!
Xuewen Yan Nov. 10, 2021, 5:38 a.m. UTC | #14
On Tue, Nov 9, 2021 at 8:29 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 08/11/2021 10:20, Zhaoyang Huang wrote:
> > On Mon, Nov 8, 2021 at 4:49 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote:
> >>
> >> Hi Dietmar
> >>
> >> On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann
> >> <dietmar.eggemann@arm.com> wrote:
> >>>
> >>> On 05/11/2021 06:58, Zhaoyang Huang wrote:
>
> [...]
>
> >>>>> This will let the idle task (swapper) pass. Is this indented? Or do you
> >>>>> want to only let CFS tasks (including SCHED_IDLE) pass?
> >>>> idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to
> >>>> scale the STALL time.
> >>>
> >>> Not sure I  get this.
> >>>
> >>> __schedule() -> psi_sched_switch() -> psi_task_change() ->
> >>> psi_group_change() -> record_times() -> psi_memtime_fixup()
> >>>
> >>> is something else than calling psi_memstall_enter() or _leave()?
> >>>
> >>> IMHO, at least record_times() can be called with current equal
> >>> swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in
> >>> this callstack? I don't know the PSI internals.
> > According to my understanding, PSI_MEM_SOME represents the CORE's
> > state within which there is at least one task trapped in memstall
> > path(only counted in by calling PSI_MEMSTALL_ENTER). record_times is
> > responsible for collecting the delta time of the CORE since it start.
> > What we are doing is to make the delta time more precise. So idle task
> > is irrelevant for these.
>
> Coming back to the original snippet of the patch.
>
> static unsigned long psi_memtime_fixup(u32 growth)
> {
>
>     if (!(current->policy == SCHED_NORMAL ||
>           current->policy == SCHED_BATCH))
>         return growth_fixed;
>
> With this condition:
>
> (1) you're not bailing when current is the idle task. It has policy
>     equal 0 (SCHED_NORMAL)
>
> (2) But you're bailing for a SCHED_IDLE (CFS) task.
>
> I'm not sure that this is indented here?
>
> Since you want to do the scaling later based on whats left for CFS tasks
> from the CPU capacity my hunch is that you want to rather do:
>
>     if (current->sched_class != &fair_sched_class)
>         return growth_fixed;

Yes, It is the correct method!

Thanks!
>
> What's the possible sched classes of current in psi_memtime_fixup?
Peter Zijlstra Nov. 10, 2021, 8:36 a.m. UTC | #15
On Wed, Nov 10, 2021 at 09:37:00AM +0800, Zhaoyang Huang wrote:
> On Tue, Nov 9, 2021 at 10:56 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote:
> > > CC peterz as well for rt and timekeeping magic
> > >
> > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > >
> > > > In an EAS enabled system, there are two scenarios discordant to current design,
> > > >
> > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > > > RT task usually preempts CFS task in little core.
> > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > > > ignore the preempted time by RT, DL and Irqs.
> >
> > It ignores preemption full-stop. I don't see why RT/IRQ should be
> > special cased here.
> As Johannes comments, what we are trying to solve is mainly the
> preempted time of the CFS task by RT/IRQ, NOT the RT/IRQ themselves.
> Could you please catch up the recent reply of Dietmar, which maybe
> provide more information.

In that case NAK.
Zhaoyang Huang Nov. 10, 2021, 8:47 a.m. UTC | #16
On Wed, Nov 10, 2021 at 4:36 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 10, 2021 at 09:37:00AM +0800, Zhaoyang Huang wrote:
> > On Tue, Nov 9, 2021 at 10:56 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote:
> > > > CC peterz as well for rt and timekeeping magic
> > > >
> > > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > >
> > > > > In an EAS enabled system, there are two scenarios discordant to current design,
> > > > >
> > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > > > > RT task usually preempts CFS task in little core.
> > > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > > > > ignore the preempted time by RT, DL and Irqs.
> > >
> > > It ignores preemption full-stop. I don't see why RT/IRQ should be
> > > special cased here.
> > As Johannes comments, what we are trying to solve is mainly the
> > preempted time of the CFS task by RT/IRQ, NOT the RT/IRQ themselves.
> > Could you please catch up the recent reply of Dietmar, which maybe
> > provide more information.
>
> In that case NAK.
Would you please explaining if there is any constraint to prevent from
doing so? We do think eliminating the preempted time is reasonable and
doable as it is memory irrelevant but probably related to lack of CPU
etc.
Vincent Guittot Nov. 10, 2021, 8:49 a.m. UTC | #17
On Tue, 9 Nov 2021 at 15:56, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote:
> > CC peterz as well for rt and timekeeping magic
> >
> > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > In an EAS enabled system, there are two scenarios discordant to current design,
> > >
> > > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > > RT task usually preempts CFS task in little core.
> > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > > ignore the preempted time by RT, DL and Irqs.
>
> It ignores preemption full-stop. I don't see why RT/IRQ should be
> special cased here.
>
> > > With these two constraints, the percpu nonidle time would be mainly consumed by
> > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> > > via the proportion of cfs_rq's utilization on the whole rq.
>
>
> > > +static unsigned long psi_memtime_fixup(u32 growth)
> > > +{
> > > +   struct rq *rq = task_rq(current);
> > > +   unsigned long growth_fixed = (unsigned long)growth;
> > > +
> > > +   if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> > > +           return growth_fixed;
> > > +
> > > +   if (current->in_memstall)
> > > +           growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > > +                                   - rq->avg_irq.util_avg + 1) * growth, 1024);
> > > +
> > > +   return growth_fixed;
> > > +}
> > > +
> > >  static void init_triggers(struct psi_group *group, u64 now)
> > >  {
> > >     struct psi_trigger *t;
> > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> > >     }
> > >
> > >     if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> > > +           delta = psi_memtime_fixup(delta);
> >
> > Ok, so we want to deduct IRQ and RT preemption time from the memstall
> > period of an active reclaimer, since it's technically not stalled on
> > memory during this time but on CPU.
> >
> > However, we do NOT want to deduct IRQ and RT time from memstalls that
> > are sleeping on refaults swapins, since they are not affected by what
> > is going on on the CPU.
>
> I think that focus on RT/IRQ is mis-guided here, and the implementation
> is horrendous.
>
> So the fundamental question seems to be; and I think Johannes is the one
> to answer that: What time-base do these metrics want to use?
>
> Do some of these states want to account in task-time instead of
> wall-time perhaps? I can't quite remember, but vague memories are
> telling me most of the PSI accounting was about blocked tasks, not
> running tasks, which makes all this rather more complicated.

I tend to agree with this.
Using rq_clock_task(rq) instead of cpu_clock(cpu) will remove the time
spent under interrupt as an example
and AFAICT, rq->clock_task is updated before calling psi function

>
> Randomly scaling time as proposed seems almost certainly wrong. What
> would that make the stats mean?
Zhaoyang Huang Nov. 10, 2021, 9:04 a.m. UTC | #18
On Wed, Nov 10, 2021 at 4:49 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 9 Nov 2021 at 15:56, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote:
> > > CC peterz as well for rt and timekeeping magic
> > >
> > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > >
> > > > In an EAS enabled system, there are two scenarios discordant to current design,
> > > >
> > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > > > RT task usually preempts CFS task in little core.
> > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > > > ignore the preempted time by RT, DL and Irqs.
> >
> > It ignores preemption full-stop. I don't see why RT/IRQ should be
> > special cased here.
> >
> > > > With these two constraints, the percpu nonidle time would be mainly consumed by
> > > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> > > > via the proportion of cfs_rq's utilization on the whole rq.
> >
> >
> > > > +static unsigned long psi_memtime_fixup(u32 growth)
> > > > +{
> > > > +   struct rq *rq = task_rq(current);
> > > > +   unsigned long growth_fixed = (unsigned long)growth;
> > > > +
> > > > +   if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> > > > +           return growth_fixed;
> > > > +
> > > > +   if (current->in_memstall)
> > > > +           growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > > > +                                   - rq->avg_irq.util_avg + 1) * growth, 1024);
> > > > +
> > > > +   return growth_fixed;
> > > > +}
> > > > +
> > > >  static void init_triggers(struct psi_group *group, u64 now)
> > > >  {
> > > >     struct psi_trigger *t;
> > > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> > > >     }
> > > >
> > > >     if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> > > > +           delta = psi_memtime_fixup(delta);
> > >
> > > Ok, so we want to deduct IRQ and RT preemption time from the memstall
> > > period of an active reclaimer, since it's technically not stalled on
> > > memory during this time but on CPU.
> > >
> > > However, we do NOT want to deduct IRQ and RT time from memstalls that
> > > are sleeping on refaults swapins, since they are not affected by what
> > > is going on on the CPU.
> >
> > I think that focus on RT/IRQ is mis-guided here, and the implementation
> > is horrendous.
> >
> > So the fundamental question seems to be; and I think Johannes is the one
> > to answer that: What time-base do these metrics want to use?
> >
> > Do some of these states want to account in task-time instead of
> > wall-time perhaps? I can't quite remember, but vague memories are
> > telling me most of the PSI accounting was about blocked tasks, not
> > running tasks, which makes all this rather more complicated.
>
> I tend to agree with this.
> Using rq_clock_task(rq) instead of cpu_clock(cpu) will remove the time
> spent under interrupt as an example
> and AFAICT, rq->clock_task is updated before calling psi function
thanks vincent. Could rq_clock_task help on removing the preempted
time of CFS task by RT/DL, which is the mainly part we want to solve
on memstall time.
>
> >
> > Randomly scaling time as proposed seems almost certainly wrong. What
> > would that make the stats mean?
Dietmar Eggemann Nov. 12, 2021, 2:16 p.m. UTC | #19
On 10/11/2021 06:36, Xuewen Yan wrote:
> Hi Dietmar
> On Tue, Nov 9, 2021 at 5:43 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
>>
>> On 08/11/2021 09:49, Xuewen Yan wrote:
>>> Hi Dietmar
>>>
>>> On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann
>>> <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 05/11/2021 06:58, Zhaoyang Huang wrote:

[...]

>>>> Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than
>>>> the original cpu capacity (rq->cpu_capacity_orig).
>>>>
>>>> Have a look at cpu_util(). capacity_orig_of(CPUx) and
>>>> arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig.
>>>>
>>>
>>> Well, your means is we should not use the 1024 and should use the
>>> original cpu capacity?
>>> And maybe use the "sched_cpu_util()" is a good choice just like this:
>>>
>>> +       if (current->in_memstall)
>>> +               growth_fixed = div64_ul(cpu_util_cfs(rq) * growth,
>>> sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu)));
>>
>> Not sure about this. In case util_cfs=0 you would get scale=0.
> 
> Yes , we should consider it. In addition, it also should be considered
> when util_cfs > capacity_orig because of the UTIL_EST......

We should ]-clamp cpu_util_cfs() with capacity_orig_of(), like we
currently do for the CFS internal cpu_util(). In fact, we should only
use one function for this. Sent a patch out:

https://lkml.kernel.org/r/20211112141349.155713-1-dietmar.eggemann@arm.com%3E

[...]
Johannes Weiner Nov. 12, 2021, 4:36 p.m. UTC | #20
On Tue, Nov 09, 2021 at 03:56:36PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote:
> > CC peterz as well for rt and timekeeping magic
> > 
> > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > 
> > > In an EAS enabled system, there are two scenarios discordant to current design,
> > > 
> > > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > > RT task usually preempts CFS task in little core.
> > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > > ignore the preempted time by RT, DL and Irqs.
> 
> It ignores preemption full-stop. I don't see why RT/IRQ should be
> special cased here.
> 
> > > With these two constraints, the percpu nonidle time would be mainly consumed by
> > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> > > via the proportion of cfs_rq's utilization on the whole rq.
> 
> 
> > > +static unsigned long psi_memtime_fixup(u32 growth)
> > > +{
> > > +	struct rq *rq = task_rq(current);
> > > +	unsigned long growth_fixed = (unsigned long)growth;
> > > +
> > > +	if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> > > +		return growth_fixed;
> > > +
> > > +	if (current->in_memstall)
> > > +		growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > > +					- rq->avg_irq.util_avg + 1) * growth, 1024);
> > > +
> > > +	return growth_fixed;
> > > +}
> > > +
> > >  static void init_triggers(struct psi_group *group, u64 now)
> > >  {
> > >  	struct psi_trigger *t;
> > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> > >  	}
> > >  
> > >  	if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> > > +		delta = psi_memtime_fixup(delta);
> > 
> > Ok, so we want to deduct IRQ and RT preemption time from the memstall
> > period of an active reclaimer, since it's technically not stalled on
> > memory during this time but on CPU.
> > 
> > However, we do NOT want to deduct IRQ and RT time from memstalls that
> > are sleeping on refaults swapins, since they are not affected by what
> > is going on on the CPU.
> 
> I think that focus on RT/IRQ is mis-guided here, and the implementation
> is horrendous.
> 
> So the fundamental question seems to be; and I think Johannes is the one
> to answer that: What time-base do these metrics want to use?
> 
> Do some of these states want to account in task-time instead of
> wall-time perhaps? I can't quite remember, but vague memories are
> telling me most of the PSI accounting was about blocked tasks, not
> running tasks, which makes all this rather more complicated.
> 
> Randomly scaling time as proposed seems almost certainly wrong. What
> would that make the stats mean?

It *could* be argued that IRQs and RT preemptions are CPU stalls.

I'm less convinced we should subtract preemptions from memory stalls.

Yes, when you're reclaiming and you get preempted for whatever reason,
you're technically stalled on CPU in this moment. However, reclaim
itself consumes CPU and walltime, and it could be what is causing
those preemptions to begin with! For example, reclaim could eat up 90%
of your scheduling timeslice and then cause a preemption when the
thread is back in userspace and trying to be productive. By consuming
time, it also drags out the overall timeline for userspace to finish
its work, and a longer timeline will have more disruptions from
independent events like IRQs and RT thread wakeups.

So if you *were* to discount CPU contention from memory stalls, it
would also mean that you'd have to count *memory stalls* when
userspace experiences CPU contention caused by preceding reclaims. I
don't think it makes sense to try to go down that road...

They're dependent resources. Just like faster CPUs and faster IO
devices mean less memory pressure for the same amount of reclaim and
paging activity, it seems logical that contention of those underlying
resources will result in longer memory stalls and higher pressure.
Peter Zijlstra Nov. 12, 2021, 7:23 p.m. UTC | #21
On Fri, Nov 12, 2021 at 11:36:08AM -0500, Johannes Weiner wrote:
> On Tue, Nov 09, 2021 at 03:56:36PM +0100, Peter Zijlstra wrote:

> > I think that focus on RT/IRQ is mis-guided here, and the implementation
> > is horrendous.
> > 
> > So the fundamental question seems to be; and I think Johannes is the one
> > to answer that: What time-base do these metrics want to use?
> > 
> > Do some of these states want to account in task-time instead of
> > wall-time perhaps? I can't quite remember, but vague memories are
> > telling me most of the PSI accounting was about blocked tasks, not
> > running tasks, which makes all this rather more complicated.
> > 
> > Randomly scaling time as proposed seems almost certainly wrong. What
> > would that make the stats mean?
> 
> It *could* be argued that IRQs and RT preemptions are CPU stalls.
> 
> I'm less convinced we should subtract preemptions from memory stalls.
> 
> Yes, when you're reclaiming and you get preempted for whatever reason,
> you're technically stalled on CPU in this moment. However, reclaim
> itself consumes CPU and walltime, and it could be what is causing
> those preemptions to begin with! For example, reclaim could eat up 90%
> of your scheduling timeslice and then cause a preemption when the
> thread is back in userspace and trying to be productive. By consuming
> time, it also drags out the overall timeline for userspace to finish
> its work, and a longer timeline will have more disruptions from
> independent events like IRQs and RT thread wakeups.

Reclaim could be running at RT priority. There is fundamentally no
distinction between CFS and RT tasks. Consider priority inheritance on
kernel mutexes, or an admin thinking it makes sense to chrt kswapd. Or
an RT task ends up direct reclaim or...

Either they all count or they don't. There is no sane middle ground
here.
Zhaoyang Huang Nov. 15, 2021, 2:24 a.m. UTC | #22
On Sat, Nov 13, 2021 at 12:36 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Nov 09, 2021 at 03:56:36PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote:
> > > CC peterz as well for rt and timekeeping magic
> > >
> > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > >
> > > > In an EAS enabled system, there are two scenarios discordant to current design,
> > > >
> > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy.
> > > > RT task usually preempts CFS task in little core.
> > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which
> > > > ignore the preempted time by RT, DL and Irqs.
> >
> > It ignores preemption full-stop. I don't see why RT/IRQ should be
> > special cased here.
> >
> > > > With these two constraints, the percpu nonidle time would be mainly consumed by
> > > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth
> > > > via the proportion of cfs_rq's utilization on the whole rq.
> >
> >
> > > > +static unsigned long psi_memtime_fixup(u32 growth)
> > > > +{
> > > > + struct rq *rq = task_rq(current);
> > > > + unsigned long growth_fixed = (unsigned long)growth;
> > > > +
> > > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
> > > > +         return growth_fixed;
> > > > +
> > > > + if (current->in_memstall)
> > > > +         growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
> > > > +                                 - rq->avg_irq.util_avg + 1) * growth, 1024);
> > > > +
> > > > + return growth_fixed;
> > > > +}
> > > > +
> > > >  static void init_triggers(struct psi_group *group, u64 now)
> > > >  {
> > > >   struct psi_trigger *t;
> > > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
> > > >   }
> > > >
> > > >   if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
> > > > +         delta = psi_memtime_fixup(delta);
> > >
> > > Ok, so we want to deduct IRQ and RT preemption time from the memstall
> > > period of an active reclaimer, since it's technically not stalled on
> > > memory during this time but on CPU.
> > >
> > > However, we do NOT want to deduct IRQ and RT time from memstalls that
> > > are sleeping on refaults swapins, since they are not affected by what
> > > is going on on the CPU.
> >
> > I think that focus on RT/IRQ is mis-guided here, and the implementation
> > is horrendous.
> >
> > So the fundamental question seems to be; and I think Johannes is the one
> > to answer that: What time-base do these metrics want to use?
> >
> > Do some of these states want to account in task-time instead of
> > wall-time perhaps? I can't quite remember, but vague memories are
> > telling me most of the PSI accounting was about blocked tasks, not
> > running tasks, which makes all this rather more complicated.
> >
> > Randomly scaling time as proposed seems almost certainly wrong. What
> > would that make the stats mean?
>
> It *could* be argued that IRQs and RT preemptions are CPU stalls.
>
> I'm less convinced we should subtract preemptions from memory stalls.
>
> Yes, when you're reclaiming and you get preempted for whatever reason,
> you're technically stalled on CPU in this moment. However, reclaim
> itself consumes CPU and walltime, and it could be what is causing
> those preemptions to begin with! For example, reclaim could eat up 90%
> of your scheduling timeslice and then cause a preemption when the
> thread is back in userspace and trying to be productive. By consuming
> time, it also drags out the overall timeline for userspace to finish
> its work, and a longer timeline will have more disruptions from
> independent events like IRQs and RT thread wakeups.
>
> So if you *were* to discount CPU contention from memory stalls, it
> would also mean that you'd have to count *memory stalls* when
> userspace experiences CPU contention caused by preceding reclaims. I
> don't think it makes sense to try to go down that road...
>
> They're dependent resources. Just like faster CPUs and faster IO
> devices mean less memory pressure for the same amount of reclaim and
> paging activity, it seems logical that contention of those underlying
> resources will result in longer memory stalls and higher pressure.
Imagine that two triggers created on CPU and MEMORY with one RT
non-memstall process consume 90% of the rq's util while a memstall CFS
process get the rest of 10%. The problem is we will be misguided as
both of the resources are busy under current mechanisms.
diff mbox series

Patch

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index cc25a3c..754a836 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -182,6 +182,8 @@  struct psi_group psi_system = {
 
 static void psi_avgs_work(struct work_struct *work);
 
+static unsigned long psi_memtime_fixup(u32 growth);
+
 static void group_init(struct psi_group *group)
 {
 	int cpu;
@@ -492,6 +494,21 @@  static u64 window_update(struct psi_window *win, u64 now, u64 value)
 	return growth;
 }
 
+static unsigned long psi_memtime_fixup(u32 growth)
+{
+	struct rq *rq = task_rq(current);
+	unsigned long growth_fixed = (unsigned long)growth;
+
+	if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH))
+		return growth_fixed;
+
+	if (current->in_memstall)
+		growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg
+					- rq->avg_irq.util_avg + 1) * growth, 1024);
+
+	return growth_fixed;
+}
+
 static void init_triggers(struct psi_group *group, u64 now)
 {
 	struct psi_trigger *t;
@@ -658,6 +675,7 @@  static void record_times(struct psi_group_cpu *groupc, u64 now)
 	}
 
 	if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
+		delta = psi_memtime_fixup(delta);
 		groupc->times[PSI_MEM_SOME] += delta;
 		if (groupc->state_mask & (1 << PSI_MEM_FULL))
 			groupc->times[PSI_MEM_FULL] += delta;
@@ -928,8 +946,8 @@  void psi_memstall_leave(unsigned long *flags)
 	 */
 	rq = this_rq_lock_irq(&rf);
 
-	current->in_memstall = 0;
 	psi_task_change(current, TSK_MEMSTALL, 0);
+	current->in_memstall = 0;
 
 	rq_unlock_irq(rq, &rf);
 }