Message ID | 1471571741-19504-1-git-send-email-smuckle@linaro.org (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
2016-08-19 9:55 GMT+08:00 Steve Muckle <steve.muckle@linaro.org>: > PELT scales its util_sum and util_avg values via > arch_scale_cpu_capacity(). If that function is passed the CPU's sched > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY > is set. PELT does not pass in the sd however. The other caller of > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means > util_sum and util_avg scale beyond the CPU capacity on SMT. > > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but > util_avg scales up to 1024. > > Fix this by passing in the sd in __update_load_avg() as well. I believe we notice this at least several months ago. https://lkml.org/lkml/2016/5/25/228 > > Signed-off-by: Steve Muckle <smuckle@linaro.org> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 61d485421bed..95d34b337152 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > sa->last_update_time = now; > > scale_freq = arch_scale_freq_capacity(NULL, cpu); > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > + scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu); > > /* delta_w is the amount already accumulated against our next period */ > delta_w = sa->period_contrib; > -- > 2.7.3 >
On Fri, Aug 19, 2016 at 10:30:36AM +0800, Wanpeng Li wrote: > 2016-08-19 9:55 GMT+08:00 Steve Muckle <steve.muckle@linaro.org>: > > PELT scales its util_sum and util_avg values via > > arch_scale_cpu_capacity(). If that function is passed the CPU's sched > > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY > > is set. PELT does not pass in the sd however. The other caller of > > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means > > util_sum and util_avg scale beyond the CPU capacity on SMT. > > > > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but > > util_avg scales up to 1024. > > > > Fix this by passing in the sd in __update_load_avg() as well. > > I believe we notice this at least several months ago. > https://lkml.org/lkml/2016/5/25/228 Glad to see I'm not alone in thinking this is an issue. It causes an issue with schedutil, effectively doubling the apparent demand on SMT. I don't know the load balance code well enough offhand to say whether it's an issue there. cheers, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Steve, On 19/08/16 02:55, Steve Muckle wrote: > PELT scales its util_sum and util_avg values via > arch_scale_cpu_capacity(). If that function is passed the CPU's sched > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY > is set. PELT does not pass in the sd however. The other caller of > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means > util_sum and util_avg scale beyond the CPU capacity on SMT. > > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but > util_avg scales up to 1024. > > Fix this by passing in the sd in __update_load_avg() as well. > > Signed-off-by: Steve Muckle <smuckle@linaro.org> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 61d485421bed..95d34b337152 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > sa->last_update_time = now; > > scale_freq = arch_scale_freq_capacity(NULL, cpu); > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > + scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu); Wouldn't you have to subscribe to this rcu pointer rq->sd w/ something like 'rcu_dereference(cpu_rq(cpu)->sd)'? IMHO, __update_load_avg() is called outside existing RCU read-side critical sections as well so there would be a pair of rcu_read_lock()/rcu_read_unlock() required in this case. > > /* delta_w is the amount already accumulated against our next period */ > delta_w = sa->period_contrib; > -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Steve, On Thu, Aug 18, 2016 at 06:55:41PM -0700, Steve Muckle wrote: > PELT scales its util_sum and util_avg values via > arch_scale_cpu_capacity(). If that function is passed the CPU's sched > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY > is set. PELT does not pass in the sd however. The other caller of > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means > util_sum and util_avg scale beyond the CPU capacity on SMT. > > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but > util_avg scales up to 1024. I can't convince myself whether this is the right thing to do. SMT is a bit 'special' and it depends on how you model SMT capacity. I'm no SMT expert, but the way I understand the current SMT capacity model is that capacity_orig represents the capacity of the SMT-thread when all its thread-siblings are busy. The true capacity of an SMT-thread where all thread-siblings are idle is actually 1024, but we don't model this (it would be nightmare to track when the capacity should change). The capacity of a core with two or more SMT-threads is chosen to be 1024 + smt_gain, where smt_gain is supposed represent the additional throughput we gain for the additional SMT-threads. The reason why we don't have 1024 per thread is that we would prefer to have only one task per core if possible. With util_avg scaling to 1024 a core (capacity = 2*589) would be nearly 'full' with just one always-running task. If we change util_avg to max out at 589, it would take two always-running tasks for the combined utilization to match the core capacity. So we may loose some bias towards spreading for SMT systems. AFAICT, group_is_overloaded() and group_has_capacity() would both be affected by this patch. Interestingly, Vincent recently proposed to set the SMT-thread capacity to 1024 which would affectively make all the current SMT code redundant. It would make things a lot simpler, but I'm not sure if we can get away with it. It would need discussion at least. Opinions? Morten -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 19, 2016 at 04:30:39PM +0100, Morten Rasmussen wrote: > Hi Steve, > > On Thu, Aug 18, 2016 at 06:55:41PM -0700, Steve Muckle wrote: > > PELT scales its util_sum and util_avg values via > > arch_scale_cpu_capacity(). If that function is passed the CPU's sched > > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY > > is set. PELT does not pass in the sd however. The other caller of > > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means > > util_sum and util_avg scale beyond the CPU capacity on SMT. > > > > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but > > util_avg scales up to 1024. > > I can't convince myself whether this is the right thing to do. SMT is a > bit 'special' and it depends on how you model SMT capacity. > > I'm no SMT expert, but the way I understand the current SMT capacity > model is that capacity_orig represents the capacity of the SMT-thread > when all its thread-siblings are busy. The true capacity of an > SMT-thread where all thread-siblings are idle is actually 1024, but we > don't model this (it would be nightmare to track when the capacity > should change). The capacity of a core with two or more SMT-threads is > chosen to be 1024 + smt_gain, where smt_gain is supposed represent the > additional throughput we gain for the additional SMT-threads. The reason > why we don't have 1024 per thread is that we would prefer to have only > one task per core if possible. > > With util_avg scaling to 1024 a core (capacity = 2*589) would be nearly > 'full' with just one always-running task. If we change util_avg to max > out at 589, it would take two always-running tasks for the combined > utilization to match the core capacity. So we may loose some bias > towards spreading for SMT systems. > > AFAICT, group_is_overloaded() and group_has_capacity() would both be > affected by this patch. > > Interestingly, Vincent recently proposed to set the SMT-thread capacity > to 1024 which would affectively make all the current SMT code redundant. > It would make things a lot simpler, but I'm not sure if we can get away > with it. It would need discussion at least. > > Opinions? Thanks for having a look. The reason I pushed this patch was to address an issue with the schedutil governor - demand is effectively doubled on SMT systems due to the above scheme. But this can just be fixed for schedutil by using a max value there consistent with what __update_load_avg() is using. I'll send another patch. It looks like there's a good reason for the current PELT scaling w.r.t. SMT in the scheduler/load balancer. thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 19, 2016 at 04:00:57PM +0100, Dietmar Eggemann wrote: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 61d485421bed..95d34b337152 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > sa->last_update_time = now; > > > > scale_freq = arch_scale_freq_capacity(NULL, cpu); > > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > > + scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu); > > Wouldn't you have to subscribe to this rcu pointer rq->sd w/ something > like 'rcu_dereference(cpu_rq(cpu)->sd)'? > > IMHO, __update_load_avg() is called outside existing RCU read-side > critical sections as well so there would be a pair of > rcu_read_lock()/rcu_read_unlock() required in this case. Thanks Dietmar for the review. Yeah I didn't consider that this was protected with rcu. It looks like I'm abandoning this approach anyway though and doing something limited just to schedutil. thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2016-08-19 23:30 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>: > Hi Steve, > > On Thu, Aug 18, 2016 at 06:55:41PM -0700, Steve Muckle wrote: >> PELT scales its util_sum and util_avg values via >> arch_scale_cpu_capacity(). If that function is passed the CPU's sched >> domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY >> is set. PELT does not pass in the sd however. The other caller of >> arch_scale_cpu_capacity, update_cpu_capacity(), does. This means >> util_sum and util_avg scale beyond the CPU capacity on SMT. >> >> On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but >> util_avg scales up to 1024. > > I can't convince myself whether this is the right thing to do. SMT is a > bit 'special' and it depends on how you model SMT capacity. > > I'm no SMT expert, but the way I understand the current SMT capacity > model is that capacity_orig represents the capacity of the SMT-thread > when all its thread-siblings are busy. The true capacity of an > SMT-thread where all thread-siblings are idle is actually 1024, but we > don't model this (it would be nightmare to track when the capacity > should change). The capacity of a core with two or more SMT-threads is > chosen to be 1024 + smt_gain, where smt_gain is supposed represent the > additional throughput we gain for the additional SMT-threads. The reason > why we don't have 1024 per thread is that we would prefer to have only > one task per core if possible. Agreed, maybe the capacity of an SMP-thread where all thread-siblings are idle can be 1024 + smt_gain after latest IA technology. http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-max-technology.html Regards, Wanpeng Li -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 19, 2016 at 04:30:39PM +0100, Morten Rasmussen wrote: > I can't convince myself whether this is the right thing to do. SMT is a > bit 'special' and it depends on how you model SMT capacity. > > I'm no SMT expert, but the way I understand the current SMT capacity > model is that capacity_orig represents the capacity of the SMT-thread > when all its thread-siblings are busy. Correct. Has a weird side effect if you have >2 siblings and unplug some but not symmetric. Rather uncommon case though. > The true capacity of an > SMT-thread where all thread-siblings are idle is actually 1024, but we > don't model this (it would be nightmare to track when the capacity > should change). Right, so we have some dynamics in the capacity, but doing things like that (and the power7 asymmetric SMT) requires changing the capacity of other CPUs, which gets to be real interesting real quick. The current dynamics are limited to CPU local things, like having RT tasks eat time. > The capacity of a core with two or more SMT-threads is > chosen to be 1024 + smt_gain, where smt_gain is supposed represent the (1024 * smt_gain) >> 10 > additional throughput we gain for the additional SMT-threads. The reason > why we don't have 1024 per thread is that we would prefer to have only > one task per core if possible. Not really, it stems from the fact that 1024 used (and still might in some places) represent 1 (nice-0) task (at 100% utilization). And if you have SMT you really don't want to stick 2 tasks on if you can do differently. Simply because 2 threads on a core do not get the same throughput (in general) as 2 cores do. Now, these days SD_PREFER_SIBLING might actually be the main force that gets us 1 task per core if possible. We no longer use the capacity stuff to compute how many tasks we can run (with exception of update_numa_stats it seems). > With util_avg scaling to 1024 a core (capacity = 2*589) would be nearly > 'full' with just one always-running task. If we change util_avg to max > out at 589, it would take two always-running tasks for the combined > utilization to match the core capacity. So we may loose some bias > towards spreading for SMT systems. Right, so this is always going to be a bit weird, as util numbers shrink under load. Therefore they too shrink when you saturate a core with SMT threads. > AFAICT, group_is_overloaded() and group_has_capacity() would both be > affected by this patch. > > Interestingly, Vincent recently proposed to set the SMT-thread capacity > to 1024 which would affectively make all the current SMT code redundant. > It would make things a lot simpler, but I'm not sure if we can get away > with it. It would need discussion at least. > > Opinions? Time I go stare at SMT again I suppose.. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 31, 2016 at 03:07:20PM +0200, Peter Zijlstra wrote: > On Fri, Aug 19, 2016 at 04:30:39PM +0100, Morten Rasmussen wrote: > > I can't convince myself whether this is the right thing to do. SMT is a > > bit 'special' and it depends on how you model SMT capacity. > > > > I'm no SMT expert, but the way I understand the current SMT capacity > > model is that capacity_orig represents the capacity of the SMT-thread > > when all its thread-siblings are busy. > > Correct. Has a weird side effect if you have >2 siblings and unplug some > but not symmetric. Rather uncommon case though. > > > The true capacity of an > > SMT-thread where all thread-siblings are idle is actually 1024, but we > > don't model this (it would be nightmare to track when the capacity > > should change). > > Right, so we have some dynamics in the capacity, but doing things like > that (and the power7 asymmetric SMT) requires changing the capacity of > other CPUs, which gets to be real interesting real quick. > > The current dynamics are limited to CPU local things, like having RT > tasks eat time. > > > The capacity of a core with two or more SMT-threads is > > chosen to be 1024 + smt_gain, where smt_gain is supposed represent the > > (1024 * smt_gain) >> 10 Looking at the code it seems that we just use smt_gain as the core capacity, so the SMT capacity is simply sd->smt_gain/sd->span_weight, where sd->smt_gain is initialized to 1178 by default. But it really doesn't matter ;-) > > additional throughput we gain for the additional SMT-threads. The reason > > why we don't have 1024 per thread is that we would prefer to have only > > one task per core if possible. > > Not really, it stems from the fact that 1024 used (and still might in > some places) represent 1 (nice-0) task (at 100% utilization). > > And if you have SMT you really don't want to stick 2 tasks on if you can > do differently. Simply because 2 threads on a core do not get the same > throughput (in general) as 2 cores do. Agreed, that is what I failed to communicate above. > Now, these days SD_PREFER_SIBLING might actually be the main force that > gets us 1 task per core if possible. We no longer use the capacity stuff > to compute how many tasks we can run (with exception of > update_numa_stats it seems). Okay. I think the load_above_capacity stuff still does that and we tried to get rid of that a while back. If we can rely on SD_PREFER_SIBLING alone, it would certainly make things simpler. > > With util_avg scaling to 1024 a core (capacity = 2*589) would be nearly > > 'full' with just one always-running task. If we change util_avg to max > > out at 589, it would take two always-running tasks for the combined > > utilization to match the core capacity. So we may loose some bias > > towards spreading for SMT systems. > > Right, so this is always going to be a bit weird, as util numbers shrink > under load. Therefore they too shrink when you saturate a core with SMT > threads. Shouldn't utilization increase, not shrink, if you saturate more SMT threads? The effective throughput of each SMT thread should reduce when more threads are saturated so the utilization should go up since utilization is time-based? > > AFAICT, group_is_overloaded() and group_has_capacity() would both be > > affected by this patch. > > > > Interestingly, Vincent recently proposed to set the SMT-thread capacity > > to 1024 which would affectively make all the current SMT code redundant. > > It would make things a lot simpler, but I'm not sure if we can get away > > with it. It would need discussion at least. > > > > Opinions? > > Time I go stare at SMT again I suppose.. :-) I'm afraid so. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 61d485421bed..95d34b337152 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, sa->last_update_time = now; scale_freq = arch_scale_freq_capacity(NULL, cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu); /* delta_w is the amount already accumulated against our next period */ delta_w = sa->period_contrib;
PELT scales its util_sum and util_avg values via arch_scale_cpu_capacity(). If that function is passed the CPU's sched domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY is set. PELT does not pass in the sd however. The other caller of arch_scale_cpu_capacity, update_cpu_capacity(), does. This means util_sum and util_avg scale beyond the CPU capacity on SMT. On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but util_avg scales up to 1024. Fix this by passing in the sd in __update_load_avg() as well. Signed-off-by: Steve Muckle <smuckle@linaro.org> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)