Message ID | 1634278612-17055-1-git-send-email-huangzhaoyang@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [Resend] psi : calc cfs task memstall time more precisely | expand |
CC peterz as well for rt and timekeeping magic On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > In an EAS enabled system, there are two scenarios discordant to current design, > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > RT task usually preempts CFS task in little core. > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > ignore the preempted time by RT, DL and Irqs. > > With these two constraints, the percpu nonidle time would be mainly consumed by > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > via the proportion of cfs_rq's utilization on the whole rq. > > eg. > Here is the scenario which this commit want to fix, that is the rt and irq consume > some utilization of the whole rq. This scenario could be typical in a core > which is assigned to deal with all irqs. Furthermore, the rt task used to run on > little core under EAS. > > Binder:305_3-314 [002] d..1 257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18 > droid.phone-1525 [001] d..1 265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15 > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > --- > kernel/sched/psi.c | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > index cc25a3c..754a836 100644 > --- a/kernel/sched/psi.c > +++ b/kernel/sched/psi.c > @@ -182,6 +182,8 @@ struct psi_group psi_system = { > > static void psi_avgs_work(struct work_struct *work); > > +static unsigned long psi_memtime_fixup(u32 growth); > + > static void group_init(struct psi_group *group) > { > int cpu; > @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value) > return growth; > } > > +static unsigned long psi_memtime_fixup(u32 growth) > +{ > + struct rq *rq = task_rq(current); > + unsigned long growth_fixed = (unsigned long)growth; > + > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > + return growth_fixed; > + > + if (current->in_memstall) > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > + - rq->avg_irq.util_avg + 1) * growth, 1024); > + > + return growth_fixed; > +} > + > static void init_triggers(struct psi_group *group, u64 now) > { > struct psi_trigger *t; > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > } > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > + delta = psi_memtime_fixup(delta); Ok, so we want to deduct IRQ and RT preemption time from the memstall period of an active reclaimer, since it's technically not stalled on memory during this time but on CPU. However, we do NOT want to deduct IRQ and RT time from memstalls that are sleeping on refaults swapins, since they are not affected by what is going on on the CPU. Does util_avg capture that difference? I'm not confident it does - but correct me if I'm wrong. We need length of time during which and IRQ or an RT task preempted the old rq->curr, not absolute irq/rt length. (Btw, such preemption periods, in addition to being deducted from memory stalls, should probably also be added to CPU contention stalls, to make CPU pressure reporting more accurate as well.)
On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > CC peterz as well for rt and timekeeping magic > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > RT task usually preempts CFS task in little core. > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > ignore the preempted time by RT, DL and Irqs. > > > > With these two constraints, the percpu nonidle time would be mainly consumed by > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > > via the proportion of cfs_rq's utilization on the whole rq. > > > > eg. > > Here is the scenario which this commit want to fix, that is the rt and irq consume > > some utilization of the whole rq. This scenario could be typical in a core > > which is assigned to deal with all irqs. Furthermore, the rt task used to run on > > little core under EAS. > > > > Binder:305_3-314 [002] d..1 257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18 > > droid.phone-1525 [001] d..1 265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15 > > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > --- > > kernel/sched/psi.c | 20 +++++++++++++++++++- > > 1 file changed, 19 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > > index cc25a3c..754a836 100644 > > --- a/kernel/sched/psi.c > > +++ b/kernel/sched/psi.c > > @@ -182,6 +182,8 @@ struct psi_group psi_system = { > > > > static void psi_avgs_work(struct work_struct *work); > > > > +static unsigned long psi_memtime_fixup(u32 growth); > > + > > static void group_init(struct psi_group *group) > > { > > int cpu; > > @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value) > > return growth; > > } > > > > +static unsigned long psi_memtime_fixup(u32 growth) > > +{ > > + struct rq *rq = task_rq(current); > > + unsigned long growth_fixed = (unsigned long)growth; > > + > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > > + return growth_fixed; > > + > > + if (current->in_memstall) > > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > > + - rq->avg_irq.util_avg + 1) * growth, 1024); > > + > > + return growth_fixed; > > +} > > + > > static void init_triggers(struct psi_group *group, u64 now) > > { > > struct psi_trigger *t; > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > > } > > > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > > + delta = psi_memtime_fixup(delta); > add vincent for advise on cpu load mechanism > Ok, so we want to deduct IRQ and RT preemption time from the memstall > period of an active reclaimer, since it's technically not stalled on > memory during this time but on CPU. > > However, we do NOT want to deduct IRQ and RT time from memstalls that > are sleeping on refaults swapins, since they are not affected by what > is going on on the CPU. > > Does util_avg capture that difference? I'm not confident it does - but > correct me if I'm wrong. We need length of time during which and IRQ > or an RT task preempted the old rq->curr, not absolute irq/rt length. As far as my understanding, core's capacity = IRQ + DEADLINE + RT + CFS. For a certain time period, all cfs tasks preempt each other inside CFS's utilization. So the sleeping on refaults is counted in. > > (Btw, such preemption periods, in addition to being deducted from > memory stalls, should probably also be added to CPU contention stalls, > to make CPU pressure reporting more accurate as well.)
+Vincent Guittot On Wed, Nov 3, 2021 at 3:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > CC peterz as well for rt and timekeeping magic > > > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > > RT task usually preempts CFS task in little core. > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > > ignore the preempted time by RT, DL and Irqs. > > > > > > With these two constraints, the percpu nonidle time would be mainly consumed by > > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > > > via the proportion of cfs_rq's utilization on the whole rq. > > > > > > eg. > > > Here is the scenario which this commit want to fix, that is the rt and irq consume > > > some utilization of the whole rq. This scenario could be typical in a core > > > which is assigned to deal with all irqs. Furthermore, the rt task used to run on > > > little core under EAS. > > > > > > Binder:305_3-314 [002] d..1 257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18 > > > droid.phone-1525 [001] d..1 265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15 > > > > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > --- > > > kernel/sched/psi.c | 20 +++++++++++++++++++- > > > 1 file changed, 19 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > > > index cc25a3c..754a836 100644 > > > --- a/kernel/sched/psi.c > > > +++ b/kernel/sched/psi.c > > > @@ -182,6 +182,8 @@ struct psi_group psi_system = { > > > > > > static void psi_avgs_work(struct work_struct *work); > > > > > > +static unsigned long psi_memtime_fixup(u32 growth); > > > + > > > static void group_init(struct psi_group *group) > > > { > > > int cpu; > > > @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value) > > > return growth; > > > } > > > > > > +static unsigned long psi_memtime_fixup(u32 growth) > > > +{ > > > + struct rq *rq = task_rq(current); > > > + unsigned long growth_fixed = (unsigned long)growth; > > > + > > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > > > + return growth_fixed; > > > + > > > + if (current->in_memstall) > > > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > > > + - rq->avg_irq.util_avg + 1) * growth, 1024); > > > + > > > + return growth_fixed; > > > +} > > > + > > > static void init_triggers(struct psi_group *group, u64 now) > > > { > > > struct psi_trigger *t; > > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > > > } > > > > > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > > > + delta = psi_memtime_fixup(delta); > > > add vincent for advise on cpu load mechanism > > > Ok, so we want to deduct IRQ and RT preemption time from the memstall > > period of an active reclaimer, since it's technically not stalled on > > memory during this time but on CPU. > > > > However, we do NOT want to deduct IRQ and RT time from memstalls that > > are sleeping on refaults swapins, since they are not affected by what > > is going on on the CPU. > > > > Does util_avg capture that difference? I'm not confident it does - but > > correct me if I'm wrong. We need length of time during which and IRQ > > or an RT task preempted the old rq->curr, not absolute irq/rt length. > As far as my understanding, core's capacity = IRQ + DEADLINE + RT + > CFS. For a certain time period, all cfs tasks preempt each other > inside CFS's utilization. So the sleeping on refaults is counted in. > > > > (Btw, such preemption periods, in addition to being deducted from > > memory stalls, should probably also be added to CPU contention stalls, > > to make CPU pressure reporting more accurate as well.)
On 03/11/2021 08:08, Zhaoyang Huang wrote: > +Vincent Guittot > > On Wed, Nov 3, 2021 at 3:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: >> >> On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote: >>> >>> CC peterz as well for rt and timekeeping magic >>> >>> On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: >>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> >>>> >>>> In an EAS enabled system, there are two scenarios discordant to current design, I don't understand the EAS (probably asymmetric CPU capacity is meant here) angle of the story. Pressure on CPU capacity which is usable for CFS happens on SMP as well? >>>> >>>> 1. workload used to be heavy uneven among cores for sake of scheduler policy. >>>> RT task usually preempts CFS task in little core. >>>> 2. CFS task's memstall time is counted as simple as exit - entry so far, which >>>> ignore the preempted time by RT, DL and Irqs. >>>> >>>> With these two constraints, the percpu nonidle time would be mainly consumed by >>>> none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth >>>> via the proportion of cfs_rq's utilization on the whole rq. >>>> >>>> eg. >>>> Here is the scenario which this commit want to fix, that is the rt and irq consume >>>> some utilization of the whole rq. This scenario could be typical in a core >>>> which is assigned to deal with all irqs. Furthermore, the rt task used to run on >>>> little core under EAS. >>>> >>>> Binder:305_3-314 [002] d..1 257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18 >>>> droid.phone-1525 [001] d..1 265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15 >>>> >>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> >>>> --- >>>> kernel/sched/psi.c | 20 +++++++++++++++++++- >>>> 1 file changed, 19 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c >>>> index cc25a3c..754a836 100644 >>>> --- a/kernel/sched/psi.c >>>> +++ b/kernel/sched/psi.c >>>> @@ -182,6 +182,8 @@ struct psi_group psi_system = { >>>> >>>> static void psi_avgs_work(struct work_struct *work); >>>> >>>> +static unsigned long psi_memtime_fixup(u32 growth); >>>> + >>>> static void group_init(struct psi_group *group) >>>> { >>>> int cpu; >>>> @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value) >>>> return growth; >>>> } >>>> >>>> +static unsigned long psi_memtime_fixup(u32 growth) >>>> +{ >>>> + struct rq *rq = task_rq(current); >>>> + unsigned long growth_fixed = (unsigned long)growth; >>>> + >>>> + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) >>>> + return growth_fixed; This will let the idle task (swapper) pass. Is this indented? Or do you want to only let CFS tasks (including SCHED_IDLE) pass? if (current->sched_class != &fair_sched_class) return growth_fixed; >>>> + >>>> + if (current->in_memstall) >>>> + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg >>>> + - rq->avg_irq.util_avg + 1) * growth, 1024); >>>> + We do this slightly different in scale_rt_capacity() [fair.c]: max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support asymmetric CPU capacity */ used = cpu_util_rt(rq); used += cpu_util_dl(rq); used += thermal_load_avg(rq); free = max - used irq = cpu_util_irq(rq) used = scale_irq_capacity(free, irq, max); scaling then with with: max - used / max >>>> + return growth_fixed; >>>> +} >>>> + >>>> static void init_triggers(struct psi_group *group, u64 now) >>>> { >>>> struct psi_trigger *t; >>>> @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) >>>> } >>>> >>>> if (groupc->state_mask & (1 << PSI_MEM_SOME)) { >>>> + delta = psi_memtime_fixup(delta); >>> >> add vincent for advise on cpu load mechanism >> >>> Ok, so we want to deduct IRQ and RT preemption time from the memstall >>> period of an active reclaimer, since it's technically not stalled on >>> memory during this time but on CPU. >>> >>> However, we do NOT want to deduct IRQ and RT time from memstalls that >>> are sleeping on refaults swapins, since they are not affected by what >>> is going on on the CPU. >>> >>> Does util_avg capture that difference? I'm not confident it does - but >>> correct me if I'm wrong. We need length of time during which and IRQ >>> or an RT task preempted the old rq->curr, not absolute irq/rt length. >> As far as my understanding, core's capacity = IRQ + DEADLINE + RT + >> CFS. For a certain time period, all cfs tasks preempt each other >> inside CFS's utilization. So the sleeping on refaults is counted in. >>> >>> (Btw, such preemption periods, in addition to being deducted from >>> memory stalls, should probably also be added to CPU contention stalls, >>> to make CPU pressure reporting more accurate as well.)
On Thu, Nov 4, 2021 at 4:58 PM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 03/11/2021 08:08, Zhaoyang Huang wrote: > > +Vincent Guittot > > > > On Wed, Nov 3, 2021 at 3:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > >> > >> On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > >>> > >>> CC peterz as well for rt and timekeeping magic > >>> > >>> On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > >>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > >>>> > >>>> In an EAS enabled system, there are two scenarios discordant to current design, > > I don't understand the EAS (probably asymmetric CPU capacity is meant > here) angle of the story. Pressure on CPU capacity which is usable for > CFS happens on SMP as well? Mentioning EAS here mainly about RT tasks preempting small CFS tasks (big CFS tasks could be scheduled to big core), which would introduce more proportion of preempted time within PSI_MEM_STALL than SMP does. > > >>>> > >>>> 1. workload used to be heavy uneven among cores for sake of scheduler policy. > >>>> RT task usually preempts CFS task in little core. > >>>> 2. CFS task's memstall time is counted as simple as exit - entry so far, which > >>>> ignore the preempted time by RT, DL and Irqs. > >>>> > >>>> With these two constraints, the percpu nonidle time would be mainly consumed by > >>>> none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > >>>> via the proportion of cfs_rq's utilization on the whole rq. > >>>> > >>>> eg. > >>>> Here is the scenario which this commit want to fix, that is the rt and irq consume > >>>> some utilization of the whole rq. This scenario could be typical in a core > >>>> which is assigned to deal with all irqs. Furthermore, the rt task used to run on > >>>> little core under EAS. > >>>> > >>>> Binder:305_3-314 [002] d..1 257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18 > >>>> droid.phone-1525 [001] d..1 265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15 > >>>> > >>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > >>>> --- > >>>> kernel/sched/psi.c | 20 +++++++++++++++++++- > >>>> 1 file changed, 19 insertions(+), 1 deletion(-) > >>>> > >>>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c > >>>> index cc25a3c..754a836 100644 > >>>> --- a/kernel/sched/psi.c > >>>> +++ b/kernel/sched/psi.c > >>>> @@ -182,6 +182,8 @@ struct psi_group psi_system = { > >>>> > >>>> static void psi_avgs_work(struct work_struct *work); > >>>> > >>>> +static unsigned long psi_memtime_fixup(u32 growth); > >>>> + > >>>> static void group_init(struct psi_group *group) > >>>> { > >>>> int cpu; > >>>> @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value) > >>>> return growth; > >>>> } > >>>> > >>>> +static unsigned long psi_memtime_fixup(u32 growth) > >>>> +{ > >>>> + struct rq *rq = task_rq(current); > >>>> + unsigned long growth_fixed = (unsigned long)growth; > >>>> + > >>>> + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > >>>> + return growth_fixed; > > This will let the idle task (swapper) pass. Is this indented? Or do you > want to only let CFS tasks (including SCHED_IDLE) pass? idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to scale the STALL time. > > if (current->sched_class != &fair_sched_class) > return growth_fixed; > > >>>> + > >>>> + if (current->in_memstall) > >>>> + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > >>>> + - rq->avg_irq.util_avg + 1) * growth, 1024); > >>>> + > > We do this slightly different in scale_rt_capacity() [fair.c]: > > max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support > asymmetric CPU capacity */ Is it possible that the SUM of rqs' util_avg large than arch_scale_cpu_capacity because of task migration things? > > used = cpu_util_rt(rq); > used += cpu_util_dl(rq); > used += thermal_load_avg(rq); > > free = max - used > irq = cpu_util_irq(rq) > > used = scale_irq_capacity(free, irq, max); > > scaling then with with: max - used / max ok. so introduce thermal util. > > >>>> + return growth_fixed; > >>>> +} > >>>> + > >>>> static void init_triggers(struct psi_group *group, u64 now) > >>>> { > >>>> struct psi_trigger *t; > >>>> @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > >>>> } > >>>> > >>>> if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > >>>> + delta = psi_memtime_fixup(delta); > >>> > >> add vincent for advise on cpu load mechanism > >> > >>> Ok, so we want to deduct IRQ and RT preemption time from the memstall > >>> period of an active reclaimer, since it's technically not stalled on > >>> memory during this time but on CPU. > >>> > >>> However, we do NOT want to deduct IRQ and RT time from memstalls that > >>> are sleeping on refaults swapins, since they are not affected by what > >>> is going on on the CPU. > >>> > >>> Does util_avg capture that difference? I'm not confident it does - but > >>> correct me if I'm wrong. We need length of time during which and IRQ > >>> or an RT task preempted the old rq->curr, not absolute irq/rt length. > >> As far as my understanding, core's capacity = IRQ + DEADLINE + RT + > >> CFS. For a certain time period, all cfs tasks preempt each other > >> inside CFS's utilization. So the sleeping on refaults is counted in. > >>> > >>> (Btw, such preemption periods, in addition to being deducted from > >>> memory stalls, should probably also be added to CPU contention stalls, > >>> to make CPU pressure reporting more accurate as well.) >
On 05/11/2021 06:58, Zhaoyang Huang wrote: > On Thu, Nov 4, 2021 at 4:58 PM Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: >> >> On 03/11/2021 08:08, Zhaoyang Huang wrote: >>> +Vincent Guittot >>> >>> On Wed, Nov 3, 2021 at 3:07 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: >>>> >>>> On Wed, Nov 3, 2021 at 3:47 AM Johannes Weiner <hannes@cmpxchg.org> wrote: >>>>> >>>>> CC peterz as well for rt and timekeeping magic >>>>> >>>>> On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: >>>>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> >>>>>> >>>>>> In an EAS enabled system, there are two scenarios discordant to current design, >> >> I don't understand the EAS (probably asymmetric CPU capacity is meant >> here) angle of the story. Pressure on CPU capacity which is usable for >> CFS happens on SMP as well? > Mentioning EAS here mainly about RT tasks preempting small CFS tasks > (big CFS tasks could be scheduled to big core), which would introduce > more proportion of preempted time within PSI_MEM_STALL than SMP does. What's your CPU layout? Do you have the little before the big CPUs? Like Hikey 960? root@linaro-developer:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity 462 462 462 462 1024 1024 1024 1024 And I guess rt class prefers lower CPU numbers hence you see this? >>>>>> 1. workload used to be heavy uneven among cores for sake of scheduler policy. >>>>>> RT task usually preempts CFS task in little core. >>>>>> 2. CFS task's memstall time is counted as simple as exit - entry so far, which >>>>>> ignore the preempted time by RT, DL and Irqs. >>>>>> >>>>>> With these two constraints, the percpu nonidle time would be mainly consumed by >>>>>> none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth >>>>>> via the proportion of cfs_rq's utilization on the whole rq. >>>>>> >>>>>> eg. >>>>>> Here is the scenario which this commit want to fix, that is the rt and irq consume >>>>>> some utilization of the whole rq. This scenario could be typical in a core >>>>>> which is assigned to deal with all irqs. Furthermore, the rt task used to run on >>>>>> little core under EAS. >>>>>> >>>>>> Binder:305_3-314 [002] d..1 257.880195: psi_memtime_fixup: original:30616,adjusted:25951,se:89,cfs:353,rt:139,dl:0,irq:18 >>>>>> droid.phone-1525 [001] d..1 265.145492: psi_memtime_fixup: original:61616,adjusted:53492,se:55,cfs:225,rt:121,dl:0,irq:15 [...] >>>>>> @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value) >>>>>> return growth; >>>>>> } >>>>>> >>>>>> +static unsigned long psi_memtime_fixup(u32 growth) >>>>>> +{ >>>>>> + struct rq *rq = task_rq(current); >>>>>> + unsigned long growth_fixed = (unsigned long)growth; >>>>>> + >>>>>> + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) >>>>>> + return growth_fixed; >> >> This will let the idle task (swapper) pass. Is this indented? Or do you >> want to only let CFS tasks (including SCHED_IDLE) pass? > idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to > scale the STALL time. Not sure I get this. __schedule() -> psi_sched_switch() -> psi_task_change() -> psi_group_change() -> record_times() -> psi_memtime_fixup() is something else than calling psi_memstall_enter() or _leave()? IMHO, at least record_times() can be called with current equal swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in this callstack? I don't know the PSI internals. >> >> if (current->sched_class != &fair_sched_class) >> return growth_fixed; >> >>>>>> + >>>>>> + if (current->in_memstall) >>>>>> + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg >>>>>> + - rq->avg_irq.util_avg + 1) * growth, 1024); >>>>>> + >> >> We do this slightly different in scale_rt_capacity() [fair.c]: >> >> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support >> asymmetric CPU capacity */ > Is it possible that the SUM of rqs' util_avg large than > arch_scale_cpu_capacity because of task migration things? I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL, IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)? Yes it can. Have a lock at effective_cpu_util(..., max, ...) { if (foo >= max) return max; } Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than the original cpu capacity (rq->cpu_capacity_orig). Have a look at cpu_util(). capacity_orig_of(CPUx) and arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig. [...]
Hi Dietmar On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 05/11/2021 06:58, Zhaoyang Huang wrote: > >> I don't understand the EAS (probably asymmetric CPU capacity is meant > >> here) angle of the story. Pressure on CPU capacity which is usable for > >> CFS happens on SMP as well? > > Mentioning EAS here mainly about RT tasks preempting small CFS tasks > > (big CFS tasks could be scheduled to big core), which would introduce > > more proportion of preempted time within PSI_MEM_STALL than SMP does. > > What's your CPU layout? Do you have the little before the big CPUs? Like > Hikey 960? > > root@linaro-developer:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity > 462 > 462 > 462 > 462 > 1024 > 1024 > 1024 > 1024 > > And I guess rt class prefers lower CPU numbers hence you see this? > our CPU layout is: xuewen.yan:/ # cat /sys/devices/system/cpu/cpu*/cpu_capacity 544 544 544 544 544 544 1024 1024 And in our platform, we use the kernel in mobile phones with Android. And we prefer power, so we prefer the RT class to run on little cores. > >> > >> This will let the idle task (swapper) pass. Is this indented? Or do you > >> want to only let CFS tasks (including SCHED_IDLE) pass? > > idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to > > scale the STALL time. > > Not sure I get this. > > __schedule() -> psi_sched_switch() -> psi_task_change() -> > psi_group_change() -> record_times() -> psi_memtime_fixup() > > is something else than calling psi_memstall_enter() or _leave()? > > IMHO, at least record_times() can be called with current equal > swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in > this callstack? I don't know the PSI internals. > > >> > >> if (current->sched_class != &fair_sched_class) > >> return growth_fixed; > >> > >>>>>> + > >>>>>> + if (current->in_memstall) > >>>>>> + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > >>>>>> + - rq->avg_irq.util_avg + 1) * growth, 1024); > >>>>>> + > >> > >> We do this slightly different in scale_rt_capacity() [fair.c]: > >> > >> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support > >> asymmetric CPU capacity */ > > Is it possible that the SUM of rqs' util_avg large than > > arch_scale_cpu_capacity because of task migration things? > > I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL, > IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)? > > Yes it can. > > Have a lock at > > effective_cpu_util(..., max, ...) { > > if (foo >= max) > return max; > > } > > Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than > the original cpu capacity (rq->cpu_capacity_orig). > > Have a look at cpu_util(). capacity_orig_of(CPUx) and > arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig. > Well, your means is we should not use the 1024 and should use the original cpu capacity? And maybe use the "sched_cpu_util()" is a good choice just like this: + if (current->in_memstall) + growth_fixed = div64_ul(cpu_util_cfs(rq) * growth, sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu))); Thanks! BR xuewen
On Mon, Nov 8, 2021 at 4:49 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote: > > Hi Dietmar > > On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: > > > > On 05/11/2021 06:58, Zhaoyang Huang wrote: > > >> I don't understand the EAS (probably asymmetric CPU capacity is meant > > >> here) angle of the story. Pressure on CPU capacity which is usable for > > >> CFS happens on SMP as well? > > > Mentioning EAS here mainly about RT tasks preempting small CFS tasks > > > (big CFS tasks could be scheduled to big core), which would introduce > > > more proportion of preempted time within PSI_MEM_STALL than SMP does. > > > > What's your CPU layout? Do you have the little before the big CPUs? Like > > Hikey 960? > > > > root@linaro-developer:~# cat /sys/devices/system/cpu/cpu*/cpu_capacity > > 462 > > 462 > > 462 > > 462 > > 1024 > > 1024 > > 1024 > > 1024 > > > > And I guess rt class prefers lower CPU numbers hence you see this? > > > our CPU layout is: > xuewen.yan:/ # cat /sys/devices/system/cpu/cpu*/cpu_capacity > 544 > 544 > 544 > 544 > 544 > 544 > 1024 > 1024 > > And in our platform, we use the kernel in mobile phones with Android. > And we prefer power, so we prefer the RT class to run on little cores. > > > > >> > > >> This will let the idle task (swapper) pass. Is this indented? Or do you > > >> want to only let CFS tasks (including SCHED_IDLE) pass? > > > idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to > > > scale the STALL time. > > > > Not sure I get this. > > > > __schedule() -> psi_sched_switch() -> psi_task_change() -> > > psi_group_change() -> record_times() -> psi_memtime_fixup() > > > > is something else than calling psi_memstall_enter() or _leave()? > > > > IMHO, at least record_times() can be called with current equal > > swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in > > this callstack? I don't know the PSI internals. According to my understanding, PSI_MEM_SOME represents the CORE's state within which there is at least one task trapped in memstall path(only counted in by calling PSI_MEMSTALL_ENTER). record_times is responsible for collecting the delta time of the CORE since it start. What we are doing is to make the delta time more precise. So idle task is irrelevant for these. > > > > >> > > >> if (current->sched_class != &fair_sched_class) > > >> return growth_fixed; > > >> > > >>>>>> + > > >>>>>> + if (current->in_memstall) > > >>>>>> + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > > >>>>>> + - rq->avg_irq.util_avg + 1) * growth, 1024); > > >>>>>> + > > >> > > >> We do this slightly different in scale_rt_capacity() [fair.c]: > > >> > > >> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support > > >> asymmetric CPU capacity */ > > > Is it possible that the SUM of rqs' util_avg large than > > > arch_scale_cpu_capacity because of task migration things? > > > > I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL, > > IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)? > > > > Yes it can. > > > > Have a lock at > > > > effective_cpu_util(..., max, ...) { > > > > if (foo >= max) > > return max; > > > > } > > > > Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than > > the original cpu capacity (rq->cpu_capacity_orig). > > > > Have a look at cpu_util(). capacity_orig_of(CPUx) and > > arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig. > > > > Well, your means is we should not use the 1024 and should use the > original cpu capacity? > And maybe use the "sched_cpu_util()" is a good choice just like this: > > + if (current->in_memstall) > + growth_fixed = div64_ul(cpu_util_cfs(rq) * growth, > sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu))); > > Thanks! > > BR > xuewen
On 08/11/2021 09:49, Xuewen Yan wrote: > Hi Dietmar > > On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: >> >> On 05/11/2021 06:58, Zhaoyang Huang wrote: >>>> I don't understand the EAS (probably asymmetric CPU capacity is meant >>>> here) angle of the story. Pressure on CPU capacity which is usable for >>>> CFS happens on SMP as well? >>> Mentioning EAS here mainly about RT tasks preempting small CFS tasks >>> (big CFS tasks could be scheduled to big core), which would introduce >>> more proportion of preempted time within PSI_MEM_STALL than SMP does. >> >> What's your CPU layout? Do you have the little before the big CPUs? Like >> Hikey 960? [...] >> And I guess rt class prefers lower CPU numbers hence you see this? >> > our CPU layout is: > xuewen.yan:/ # cat /sys/devices/system/cpu/cpu*/cpu_capacity > 544 > 544 > 544 > 544 > 544 > 544 > 1024 > 1024 > > And in our platform, we use the kernel in mobile phones with Android. > And we prefer power, so we prefer the RT class to run on little cores. Ah, OK, out-of-tree extensions. [...] >>>>>>>> + if (current->in_memstall) >>>>>>>> + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg >>>>>>>> + - rq->avg_irq.util_avg + 1) * growth, 1024); >>>>>>>> + >>>> >>>> We do this slightly different in scale_rt_capacity() [fair.c]: >>>> >>>> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support >>>> asymmetric CPU capacity */ >>> Is it possible that the SUM of rqs' util_avg large than >>> arch_scale_cpu_capacity because of task migration things? >> >> I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL, >> IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)? >> >> Yes it can. >> >> Have a lock at >> >> effective_cpu_util(..., max, ...) { >> >> if (foo >= max) >> return max; >> >> } >> >> Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than >> the original cpu capacity (rq->cpu_capacity_orig). >> >> Have a look at cpu_util(). capacity_orig_of(CPUx) and >> arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig. >> > > Well, your means is we should not use the 1024 and should use the > original cpu capacity? > And maybe use the "sched_cpu_util()" is a good choice just like this: > > + if (current->in_memstall) > + growth_fixed = div64_ul(cpu_util_cfs(rq) * growth, > sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu))); Not sure about this. In case util_cfs=0 you would get scale=0. IMHO, you need cap = rq->cpu_capacity cap_orig = rq->cpu_capacity_orig scale = (cap * X) / cap_orig or if the update of these rq values happens to infrequently for you then you have to calc the pressure evey time. Something like: pressure = cpu_util_rt(rq) + cpu_util_dl(rq) irq = cpu_util_irq(rq) if (irq >= cap_orig) pressure = cap_orig else pressure = scale_irq_capacity(pressure, irq, cap_orig) pressure += irq scale = ((cap_orig - pressure) * X) / cap_orig
On 08/11/2021 10:20, Zhaoyang Huang wrote: > On Mon, Nov 8, 2021 at 4:49 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote: >> >> Hi Dietmar >> >> On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann >> <dietmar.eggemann@arm.com> wrote: >>> >>> On 05/11/2021 06:58, Zhaoyang Huang wrote: [...] >>>>> This will let the idle task (swapper) pass. Is this indented? Or do you >>>>> want to only let CFS tasks (including SCHED_IDLE) pass? >>>> idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to >>>> scale the STALL time. >>> >>> Not sure I get this. >>> >>> __schedule() -> psi_sched_switch() -> psi_task_change() -> >>> psi_group_change() -> record_times() -> psi_memtime_fixup() >>> >>> is something else than calling psi_memstall_enter() or _leave()? >>> >>> IMHO, at least record_times() can be called with current equal >>> swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in >>> this callstack? I don't know the PSI internals. > According to my understanding, PSI_MEM_SOME represents the CORE's > state within which there is at least one task trapped in memstall > path(only counted in by calling PSI_MEMSTALL_ENTER). record_times is > responsible for collecting the delta time of the CORE since it start. > What we are doing is to make the delta time more precise. So idle task > is irrelevant for these. Coming back to the original snippet of the patch. static unsigned long psi_memtime_fixup(u32 growth) { if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) return growth_fixed; With this condition: (1) you're not bailing when current is the idle task. It has policy equal 0 (SCHED_NORMAL) (2) But you're bailing for a SCHED_IDLE (CFS) task. I'm not sure that this is indented here? Since you want to do the scaling later based on whats left for CFS tasks from the CPU capacity my hunch is that you want to rather do: if (current->sched_class != &fair_sched_class) return growth_fixed; What's the possible sched classes of current in psi_memtime_fixup?
On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote: > CC peterz as well for rt and timekeeping magic > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > RT task usually preempts CFS task in little core. > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > ignore the preempted time by RT, DL and Irqs. It ignores preemption full-stop. I don't see why RT/IRQ should be special cased here. > > With these two constraints, the percpu nonidle time would be mainly consumed by > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > > via the proportion of cfs_rq's utilization on the whole rq. > > +static unsigned long psi_memtime_fixup(u32 growth) > > +{ > > + struct rq *rq = task_rq(current); > > + unsigned long growth_fixed = (unsigned long)growth; > > + > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > > + return growth_fixed; > > + > > + if (current->in_memstall) > > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > > + - rq->avg_irq.util_avg + 1) * growth, 1024); > > + > > + return growth_fixed; > > +} > > + > > static void init_triggers(struct psi_group *group, u64 now) > > { > > struct psi_trigger *t; > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > > } > > > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > > + delta = psi_memtime_fixup(delta); > > Ok, so we want to deduct IRQ and RT preemption time from the memstall > period of an active reclaimer, since it's technically not stalled on > memory during this time but on CPU. > > However, we do NOT want to deduct IRQ and RT time from memstalls that > are sleeping on refaults swapins, since they are not affected by what > is going on on the CPU. I think that focus on RT/IRQ is mis-guided here, and the implementation is horrendous. So the fundamental question seems to be; and I think Johannes is the one to answer that: What time-base do these metrics want to use? Do some of these states want to account in task-time instead of wall-time perhaps? I can't quite remember, but vague memories are telling me most of the PSI accounting was about blocked tasks, not running tasks, which makes all this rather more complicated. Randomly scaling time as proposed seems almost certainly wrong. What would that make the stats mean?
On Tue, Nov 9, 2021 at 10:56 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote: > > CC peterz as well for rt and timekeeping magic > > > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > > RT task usually preempts CFS task in little core. > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > > ignore the preempted time by RT, DL and Irqs. > > It ignores preemption full-stop. I don't see why RT/IRQ should be > special cased here. As Johannes comments, what we are trying to solve is mainly the preempted time of the CFS task by RT/IRQ, NOT the RT/IRQ themselves. Could you please catch up the recent reply of Dietmar, which maybe provide more information. > > > > With these two constraints, the percpu nonidle time would be mainly consumed by > > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > > > via the proportion of cfs_rq's utilization on the whole rq. > > > > > +static unsigned long psi_memtime_fixup(u32 growth) > > > +{ > > > + struct rq *rq = task_rq(current); > > > + unsigned long growth_fixed = (unsigned long)growth; > > > + > > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > > > + return growth_fixed; > > > + > > > + if (current->in_memstall) > > > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > > > + - rq->avg_irq.util_avg + 1) * growth, 1024); > > > + > > > + return growth_fixed; > > > +} > > > + > > > static void init_triggers(struct psi_group *group, u64 now) > > > { > > > struct psi_trigger *t; > > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > > > } > > > > > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > > > + delta = psi_memtime_fixup(delta); > > > > Ok, so we want to deduct IRQ and RT preemption time from the memstall > > period of an active reclaimer, since it's technically not stalled on > > memory during this time but on CPU. > > > > However, we do NOT want to deduct IRQ and RT time from memstalls that > > are sleeping on refaults swapins, since they are not affected by what > > is going on on the CPU. > > I think that focus on RT/IRQ is mis-guided here, and the implementation > is horrendous. > > So the fundamental question seems to be; and I think Johannes is the one > to answer that: What time-base do these metrics want to use? > > Do some of these states want to account in task-time instead of > wall-time perhaps? I can't quite remember, but vague memories are > telling me most of the PSI accounting was about blocked tasks, not > running tasks, which makes all this rather more complicated. memstall time is counted as exit - enter, which include both blocked and running stat. However, we think the blocked time introduced by preemption of RT/IRQ/DL are memstall irrelevant(should be eliminated), while the ones between CFS tasks could be. Thanks for the mechanism of load tracking, the implementation could be simple by calculating the proportion of CFS_UTIL among the whole core's capacity. > > Randomly scaling time as proposed seems almost certainly wrong. What > would that make the stats mean? It is NOT randomly scaling, but scales in each record_times for CFS tasks.
Hi Dietmar On Tue, Nov 9, 2021 at 5:43 PM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 08/11/2021 09:49, Xuewen Yan wrote: > > Hi Dietmar > > > > On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann > > <dietmar.eggemann@arm.com> wrote: > >> > >> On 05/11/2021 06:58, Zhaoyang Huang wrote: > >>>> I don't understand the EAS (probably asymmetric CPU capacity is meant > >>>> here) angle of the story. Pressure on CPU capacity which is usable for > >>>> CFS happens on SMP as well? > >>> Mentioning EAS here mainly about RT tasks preempting small CFS tasks > >>> (big CFS tasks could be scheduled to big core), which would introduce > >>> more proportion of preempted time within PSI_MEM_STALL than SMP does. > >> > >> What's your CPU layout? Do you have the little before the big CPUs? Like > >> Hikey 960? > > [...] > > >> And I guess rt class prefers lower CPU numbers hence you see this? > >> > > our CPU layout is: > > xuewen.yan:/ # cat /sys/devices/system/cpu/cpu*/cpu_capacity > > 544 > > 544 > > 544 > > 544 > > 544 > > 544 > > 1024 > > 1024 > > > > And in our platform, we use the kernel in mobile phones with Android. > > And we prefer power, so we prefer the RT class to run on little cores. > > Ah, OK, out-of-tree extensions. > > [...] > > >>>>>>>> + if (current->in_memstall) > >>>>>>>> + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > >>>>>>>> + - rq->avg_irq.util_avg + 1) * growth, 1024); > >>>>>>>> + > >>>> > >>>> We do this slightly different in scale_rt_capacity() [fair.c]: > >>>> > >>>> max = arch_scale_cpu_capacity(cpu_of(rq) /* instead of 1024 to support > >>>> asymmetric CPU capacity */ > >>> Is it possible that the SUM of rqs' util_avg large than > >>> arch_scale_cpu_capacity because of task migration things? > >> > >> I assume you meant if the rq (cpu_rq(CPUx)) util_avg sum (CFS, RT, DL, > >> IRQ and thermal part) can be larger than arch_scale_cpu_capacity(CPUx)? > >> > >> Yes it can. > >> > >> Have a lock at > >> > >> effective_cpu_util(..., max, ...) { > >> > >> if (foo >= max) > >> return max; > >> > >> } > >> > >> Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than > >> the original cpu capacity (rq->cpu_capacity_orig). > >> > >> Have a look at cpu_util(). capacity_orig_of(CPUx) and > >> arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig. > >> > > > > Well, your means is we should not use the 1024 and should use the > > original cpu capacity? > > And maybe use the "sched_cpu_util()" is a good choice just like this: > > > > + if (current->in_memstall) > > + growth_fixed = div64_ul(cpu_util_cfs(rq) * growth, > > sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu))); > > Not sure about this. In case util_cfs=0 you would get scale=0. Yes , we should consider it. In addition, it also should be considered when util_cfs > capacity_orig because of the UTIL_EST...... > > IMHO, you need > > cap = rq->cpu_capacity > cap_orig = rq->cpu_capacity_orig > > scale = (cap * X) / cap_orig > > or if the update of these rq values happens to infrequently for you then > you have to calc the pressure evey time. Something like: > > pressure = cpu_util_rt(rq) + cpu_util_dl(rq) > > irq = cpu_util_irq(rq) > > if (irq >= cap_orig) > pressure = cap_orig > else > pressure = scale_irq_capacity(pressure, irq, cap_orig) > pressure += irq > > scale = ((cap_orig - pressure) * X) / cap_orig Why rescale the util there, the sched_cpu_util() would invoke the effective_cpu_util(), and I don't think it's necessary to rescale it. Thanks!
On Tue, Nov 9, 2021 at 8:29 PM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 08/11/2021 10:20, Zhaoyang Huang wrote: > > On Mon, Nov 8, 2021 at 4:49 PM Xuewen Yan <xuewen.yan94@gmail.com> wrote: > >> > >> Hi Dietmar > >> > >> On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann > >> <dietmar.eggemann@arm.com> wrote: > >>> > >>> On 05/11/2021 06:58, Zhaoyang Huang wrote: > > [...] > > >>>>> This will let the idle task (swapper) pass. Is this indented? Or do you > >>>>> want to only let CFS tasks (including SCHED_IDLE) pass? > >>>> idle tasks will NOT call psi_memstall_xxx. We just want CFS tasks to > >>>> scale the STALL time. > >>> > >>> Not sure I get this. > >>> > >>> __schedule() -> psi_sched_switch() -> psi_task_change() -> > >>> psi_group_change() -> record_times() -> psi_memtime_fixup() > >>> > >>> is something else than calling psi_memstall_enter() or _leave()? > >>> > >>> IMHO, at least record_times() can be called with current equal > >>> swapper/X. Or is it that PSI_MEM_SOME is never set for the idle task in > >>> this callstack? I don't know the PSI internals. > > According to my understanding, PSI_MEM_SOME represents the CORE's > > state within which there is at least one task trapped in memstall > > path(only counted in by calling PSI_MEMSTALL_ENTER). record_times is > > responsible for collecting the delta time of the CORE since it start. > > What we are doing is to make the delta time more precise. So idle task > > is irrelevant for these. > > Coming back to the original snippet of the patch. > > static unsigned long psi_memtime_fixup(u32 growth) > { > > if (!(current->policy == SCHED_NORMAL || > current->policy == SCHED_BATCH)) > return growth_fixed; > > With this condition: > > (1) you're not bailing when current is the idle task. It has policy > equal 0 (SCHED_NORMAL) > > (2) But you're bailing for a SCHED_IDLE (CFS) task. > > I'm not sure that this is indented here? > > Since you want to do the scaling later based on whats left for CFS tasks > from the CPU capacity my hunch is that you want to rather do: > > if (current->sched_class != &fair_sched_class) > return growth_fixed; Yes, It is the correct method! Thanks! > > What's the possible sched classes of current in psi_memtime_fixup?
On Wed, Nov 10, 2021 at 09:37:00AM +0800, Zhaoyang Huang wrote: > On Tue, Nov 9, 2021 at 10:56 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote: > > > CC peterz as well for rt and timekeeping magic > > > > > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > > > RT task usually preempts CFS task in little core. > > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > > > ignore the preempted time by RT, DL and Irqs. > > > > It ignores preemption full-stop. I don't see why RT/IRQ should be > > special cased here. > As Johannes comments, what we are trying to solve is mainly the > preempted time of the CFS task by RT/IRQ, NOT the RT/IRQ themselves. > Could you please catch up the recent reply of Dietmar, which maybe > provide more information. In that case NAK.
On Wed, Nov 10, 2021 at 4:36 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Nov 10, 2021 at 09:37:00AM +0800, Zhaoyang Huang wrote: > > On Tue, Nov 9, 2021 at 10:56 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote: > > > > CC peterz as well for rt and timekeeping magic > > > > > > > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > > > > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > > > > RT task usually preempts CFS task in little core. > > > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > > > > ignore the preempted time by RT, DL and Irqs. > > > > > > It ignores preemption full-stop. I don't see why RT/IRQ should be > > > special cased here. > > As Johannes comments, what we are trying to solve is mainly the > > preempted time of the CFS task by RT/IRQ, NOT the RT/IRQ themselves. > > Could you please catch up the recent reply of Dietmar, which maybe > > provide more information. > > In that case NAK. Would you please explaining if there is any constraint to prevent from doing so? We do think eliminating the preempted time is reasonable and doable as it is memory irrelevant but probably related to lack of CPU etc.
On Tue, 9 Nov 2021 at 15:56, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote: > > CC peterz as well for rt and timekeeping magic > > > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > > RT task usually preempts CFS task in little core. > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > > ignore the preempted time by RT, DL and Irqs. > > It ignores preemption full-stop. I don't see why RT/IRQ should be > special cased here. > > > > With these two constraints, the percpu nonidle time would be mainly consumed by > > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > > > via the proportion of cfs_rq's utilization on the whole rq. > > > > > +static unsigned long psi_memtime_fixup(u32 growth) > > > +{ > > > + struct rq *rq = task_rq(current); > > > + unsigned long growth_fixed = (unsigned long)growth; > > > + > > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > > > + return growth_fixed; > > > + > > > + if (current->in_memstall) > > > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > > > + - rq->avg_irq.util_avg + 1) * growth, 1024); > > > + > > > + return growth_fixed; > > > +} > > > + > > > static void init_triggers(struct psi_group *group, u64 now) > > > { > > > struct psi_trigger *t; > > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > > > } > > > > > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > > > + delta = psi_memtime_fixup(delta); > > > > Ok, so we want to deduct IRQ and RT preemption time from the memstall > > period of an active reclaimer, since it's technically not stalled on > > memory during this time but on CPU. > > > > However, we do NOT want to deduct IRQ and RT time from memstalls that > > are sleeping on refaults swapins, since they are not affected by what > > is going on on the CPU. > > I think that focus on RT/IRQ is mis-guided here, and the implementation > is horrendous. > > So the fundamental question seems to be; and I think Johannes is the one > to answer that: What time-base do these metrics want to use? > > Do some of these states want to account in task-time instead of > wall-time perhaps? I can't quite remember, but vague memories are > telling me most of the PSI accounting was about blocked tasks, not > running tasks, which makes all this rather more complicated. I tend to agree with this. Using rq_clock_task(rq) instead of cpu_clock(cpu) will remove the time spent under interrupt as an example and AFAICT, rq->clock_task is updated before calling psi function > > Randomly scaling time as proposed seems almost certainly wrong. What > would that make the stats mean?
On Wed, Nov 10, 2021 at 4:49 PM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Tue, 9 Nov 2021 at 15:56, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote: > > > CC peterz as well for rt and timekeeping magic > > > > > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > > > RT task usually preempts CFS task in little core. > > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > > > ignore the preempted time by RT, DL and Irqs. > > > > It ignores preemption full-stop. I don't see why RT/IRQ should be > > special cased here. > > > > > > With these two constraints, the percpu nonidle time would be mainly consumed by > > > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > > > > via the proportion of cfs_rq's utilization on the whole rq. > > > > > > > > +static unsigned long psi_memtime_fixup(u32 growth) > > > > +{ > > > > + struct rq *rq = task_rq(current); > > > > + unsigned long growth_fixed = (unsigned long)growth; > > > > + > > > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > > > > + return growth_fixed; > > > > + > > > > + if (current->in_memstall) > > > > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > > > > + - rq->avg_irq.util_avg + 1) * growth, 1024); > > > > + > > > > + return growth_fixed; > > > > +} > > > > + > > > > static void init_triggers(struct psi_group *group, u64 now) > > > > { > > > > struct psi_trigger *t; > > > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > > > > } > > > > > > > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > > > > + delta = psi_memtime_fixup(delta); > > > > > > Ok, so we want to deduct IRQ and RT preemption time from the memstall > > > period of an active reclaimer, since it's technically not stalled on > > > memory during this time but on CPU. > > > > > > However, we do NOT want to deduct IRQ and RT time from memstalls that > > > are sleeping on refaults swapins, since they are not affected by what > > > is going on on the CPU. > > > > I think that focus on RT/IRQ is mis-guided here, and the implementation > > is horrendous. > > > > So the fundamental question seems to be; and I think Johannes is the one > > to answer that: What time-base do these metrics want to use? > > > > Do some of these states want to account in task-time instead of > > wall-time perhaps? I can't quite remember, but vague memories are > > telling me most of the PSI accounting was about blocked tasks, not > > running tasks, which makes all this rather more complicated. > > I tend to agree with this. > Using rq_clock_task(rq) instead of cpu_clock(cpu) will remove the time > spent under interrupt as an example > and AFAICT, rq->clock_task is updated before calling psi function thanks vincent. Could rq_clock_task help on removing the preempted time of CFS task by RT/DL, which is the mainly part we want to solve on memstall time. > > > > > Randomly scaling time as proposed seems almost certainly wrong. What > > would that make the stats mean?
On 10/11/2021 06:36, Xuewen Yan wrote: > Hi Dietmar > On Tue, Nov 9, 2021 at 5:43 PM Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: >> >> On 08/11/2021 09:49, Xuewen Yan wrote: >>> Hi Dietmar >>> >>> On Sat, Nov 6, 2021 at 1:20 AM Dietmar Eggemann >>> <dietmar.eggemann@arm.com> wrote: >>>> >>>> On 05/11/2021 06:58, Zhaoyang Huang wrote: [...] >>>> Even the CFS part (cpu_rq(CPUx)->cfs.avg.util_avg) can be larger than >>>> the original cpu capacity (rq->cpu_capacity_orig). >>>> >>>> Have a look at cpu_util(). capacity_orig_of(CPUx) and >>>> arch_scale_cpu_capacity(CPUx) both returning rq->cpu_capacity_orig. >>>> >>> >>> Well, your means is we should not use the 1024 and should use the >>> original cpu capacity? >>> And maybe use the "sched_cpu_util()" is a good choice just like this: >>> >>> + if (current->in_memstall) >>> + growth_fixed = div64_ul(cpu_util_cfs(rq) * growth, >>> sched_cpu_util(rq->cpu, capacity_orig_of(rq->cpu))); >> >> Not sure about this. In case util_cfs=0 you would get scale=0. > > Yes , we should consider it. In addition, it also should be considered > when util_cfs > capacity_orig because of the UTIL_EST...... We should ]-clamp cpu_util_cfs() with capacity_orig_of(), like we currently do for the CFS internal cpu_util(). In fact, we should only use one function for this. Sent a patch out: https://lkml.kernel.org/r/20211112141349.155713-1-dietmar.eggemann@arm.com%3E [...]
On Tue, Nov 09, 2021 at 03:56:36PM +0100, Peter Zijlstra wrote: > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote: > > CC peterz as well for rt and timekeeping magic > > > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > > RT task usually preempts CFS task in little core. > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > > ignore the preempted time by RT, DL and Irqs. > > It ignores preemption full-stop. I don't see why RT/IRQ should be > special cased here. > > > > With these two constraints, the percpu nonidle time would be mainly consumed by > > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > > > via the proportion of cfs_rq's utilization on the whole rq. > > > > > +static unsigned long psi_memtime_fixup(u32 growth) > > > +{ > > > + struct rq *rq = task_rq(current); > > > + unsigned long growth_fixed = (unsigned long)growth; > > > + > > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > > > + return growth_fixed; > > > + > > > + if (current->in_memstall) > > > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > > > + - rq->avg_irq.util_avg + 1) * growth, 1024); > > > + > > > + return growth_fixed; > > > +} > > > + > > > static void init_triggers(struct psi_group *group, u64 now) > > > { > > > struct psi_trigger *t; > > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > > > } > > > > > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > > > + delta = psi_memtime_fixup(delta); > > > > Ok, so we want to deduct IRQ and RT preemption time from the memstall > > period of an active reclaimer, since it's technically not stalled on > > memory during this time but on CPU. > > > > However, we do NOT want to deduct IRQ and RT time from memstalls that > > are sleeping on refaults swapins, since they are not affected by what > > is going on on the CPU. > > I think that focus on RT/IRQ is mis-guided here, and the implementation > is horrendous. > > So the fundamental question seems to be; and I think Johannes is the one > to answer that: What time-base do these metrics want to use? > > Do some of these states want to account in task-time instead of > wall-time perhaps? I can't quite remember, but vague memories are > telling me most of the PSI accounting was about blocked tasks, not > running tasks, which makes all this rather more complicated. > > Randomly scaling time as proposed seems almost certainly wrong. What > would that make the stats mean? It *could* be argued that IRQs and RT preemptions are CPU stalls. I'm less convinced we should subtract preemptions from memory stalls. Yes, when you're reclaiming and you get preempted for whatever reason, you're technically stalled on CPU in this moment. However, reclaim itself consumes CPU and walltime, and it could be what is causing those preemptions to begin with! For example, reclaim could eat up 90% of your scheduling timeslice and then cause a preemption when the thread is back in userspace and trying to be productive. By consuming time, it also drags out the overall timeline for userspace to finish its work, and a longer timeline will have more disruptions from independent events like IRQs and RT thread wakeups. So if you *were* to discount CPU contention from memory stalls, it would also mean that you'd have to count *memory stalls* when userspace experiences CPU contention caused by preceding reclaims. I don't think it makes sense to try to go down that road... They're dependent resources. Just like faster CPUs and faster IO devices mean less memory pressure for the same amount of reclaim and paging activity, it seems logical that contention of those underlying resources will result in longer memory stalls and higher pressure.
On Fri, Nov 12, 2021 at 11:36:08AM -0500, Johannes Weiner wrote: > On Tue, Nov 09, 2021 at 03:56:36PM +0100, Peter Zijlstra wrote: > > I think that focus on RT/IRQ is mis-guided here, and the implementation > > is horrendous. > > > > So the fundamental question seems to be; and I think Johannes is the one > > to answer that: What time-base do these metrics want to use? > > > > Do some of these states want to account in task-time instead of > > wall-time perhaps? I can't quite remember, but vague memories are > > telling me most of the PSI accounting was about blocked tasks, not > > running tasks, which makes all this rather more complicated. > > > > Randomly scaling time as proposed seems almost certainly wrong. What > > would that make the stats mean? > > It *could* be argued that IRQs and RT preemptions are CPU stalls. > > I'm less convinced we should subtract preemptions from memory stalls. > > Yes, when you're reclaiming and you get preempted for whatever reason, > you're technically stalled on CPU in this moment. However, reclaim > itself consumes CPU and walltime, and it could be what is causing > those preemptions to begin with! For example, reclaim could eat up 90% > of your scheduling timeslice and then cause a preemption when the > thread is back in userspace and trying to be productive. By consuming > time, it also drags out the overall timeline for userspace to finish > its work, and a longer timeline will have more disruptions from > independent events like IRQs and RT thread wakeups. Reclaim could be running at RT priority. There is fundamentally no distinction between CFS and RT tasks. Consider priority inheritance on kernel mutexes, or an admin thinking it makes sense to chrt kswapd. Or an RT task ends up direct reclaim or... Either they all count or they don't. There is no sane middle ground here.
On Sat, Nov 13, 2021 at 12:36 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Nov 09, 2021 at 03:56:36PM +0100, Peter Zijlstra wrote: > > On Tue, Nov 02, 2021 at 03:47:33PM -0400, Johannes Weiner wrote: > > > CC peterz as well for rt and timekeeping magic > > > > > > On Fri, Oct 15, 2021 at 02:16:52PM +0800, Huangzhaoyang wrote: > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > In an EAS enabled system, there are two scenarios discordant to current design, > > > > > > > > 1. workload used to be heavy uneven among cores for sake of scheduler policy. > > > > RT task usually preempts CFS task in little core. > > > > 2. CFS task's memstall time is counted as simple as exit - entry so far, which > > > > ignore the preempted time by RT, DL and Irqs. > > > > It ignores preemption full-stop. I don't see why RT/IRQ should be > > special cased here. > > > > > > With these two constraints, the percpu nonidle time would be mainly consumed by > > > > none CFS tasks and couldn't be averaged. Eliminating them by calc the time growth > > > > via the proportion of cfs_rq's utilization on the whole rq. > > > > > > > > +static unsigned long psi_memtime_fixup(u32 growth) > > > > +{ > > > > + struct rq *rq = task_rq(current); > > > > + unsigned long growth_fixed = (unsigned long)growth; > > > > + > > > > + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) > > > > + return growth_fixed; > > > > + > > > > + if (current->in_memstall) > > > > + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg > > > > + - rq->avg_irq.util_avg + 1) * growth, 1024); > > > > + > > > > + return growth_fixed; > > > > +} > > > > + > > > > static void init_triggers(struct psi_group *group, u64 now) > > > > { > > > > struct psi_trigger *t; > > > > @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) > > > > } > > > > > > > > if (groupc->state_mask & (1 << PSI_MEM_SOME)) { > > > > + delta = psi_memtime_fixup(delta); > > > > > > Ok, so we want to deduct IRQ and RT preemption time from the memstall > > > period of an active reclaimer, since it's technically not stalled on > > > memory during this time but on CPU. > > > > > > However, we do NOT want to deduct IRQ and RT time from memstalls that > > > are sleeping on refaults swapins, since they are not affected by what > > > is going on on the CPU. > > > > I think that focus on RT/IRQ is mis-guided here, and the implementation > > is horrendous. > > > > So the fundamental question seems to be; and I think Johannes is the one > > to answer that: What time-base do these metrics want to use? > > > > Do some of these states want to account in task-time instead of > > wall-time perhaps? I can't quite remember, but vague memories are > > telling me most of the PSI accounting was about blocked tasks, not > > running tasks, which makes all this rather more complicated. > > > > Randomly scaling time as proposed seems almost certainly wrong. What > > would that make the stats mean? > > It *could* be argued that IRQs and RT preemptions are CPU stalls. > > I'm less convinced we should subtract preemptions from memory stalls. > > Yes, when you're reclaiming and you get preempted for whatever reason, > you're technically stalled on CPU in this moment. However, reclaim > itself consumes CPU and walltime, and it could be what is causing > those preemptions to begin with! For example, reclaim could eat up 90% > of your scheduling timeslice and then cause a preemption when the > thread is back in userspace and trying to be productive. By consuming > time, it also drags out the overall timeline for userspace to finish > its work, and a longer timeline will have more disruptions from > independent events like IRQs and RT thread wakeups. > > So if you *were* to discount CPU contention from memory stalls, it > would also mean that you'd have to count *memory stalls* when > userspace experiences CPU contention caused by preceding reclaims. I > don't think it makes sense to try to go down that road... > > They're dependent resources. Just like faster CPUs and faster IO > devices mean less memory pressure for the same amount of reclaim and > paging activity, it seems logical that contention of those underlying > resources will result in longer memory stalls and higher pressure. Imagine that two triggers created on CPU and MEMORY with one RT non-memstall process consume 90% of the rq's util while a memstall CFS process get the rest of 10%. The problem is we will be misguided as both of the resources are busy under current mechanisms.
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index cc25a3c..754a836 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -182,6 +182,8 @@ struct psi_group psi_system = { static void psi_avgs_work(struct work_struct *work); +static unsigned long psi_memtime_fixup(u32 growth); + static void group_init(struct psi_group *group) { int cpu; @@ -492,6 +494,21 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value) return growth; } +static unsigned long psi_memtime_fixup(u32 growth) +{ + struct rq *rq = task_rq(current); + unsigned long growth_fixed = (unsigned long)growth; + + if (!(current->policy == SCHED_NORMAL || current->policy == SCHED_BATCH)) + return growth_fixed; + + if (current->in_memstall) + growth_fixed = div64_ul((1024 - rq->avg_rt.util_avg - rq->avg_dl.util_avg + - rq->avg_irq.util_avg + 1) * growth, 1024); + + return growth_fixed; +} + static void init_triggers(struct psi_group *group, u64 now) { struct psi_trigger *t; @@ -658,6 +675,7 @@ static void record_times(struct psi_group_cpu *groupc, u64 now) } if (groupc->state_mask & (1 << PSI_MEM_SOME)) { + delta = psi_memtime_fixup(delta); groupc->times[PSI_MEM_SOME] += delta; if (groupc->state_mask & (1 << PSI_MEM_FULL)) groupc->times[PSI_MEM_FULL] += delta; @@ -928,8 +946,8 @@ void psi_memstall_leave(unsigned long *flags) */ rq = this_rq_lock_irq(&rf); - current->in_memstall = 0; psi_task_change(current, TSK_MEMSTALL, 0); + current->in_memstall = 0; rq_unlock_irq(rq, &rf); }