Message ID | 1458606068-7476-1-git-send-email-smuckle@linaro.org (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Hi Steve, these patches fall into the bucket of 'optimization of updating the value only if the root cfs_rq util has changed' as discussed in '[PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util' of Mike T's current series '[PATCH 0/8] schedutil enhancements', right? I wonder if it makes sense to apply them before a proper 'capacity vote aggregation from CFS/RT/DL' has been agreed upon? Otherwise I agree with the changes in your 3 patches (inc. "[RFC PATCH] sched/fair: call cpufreq hook in additional paths") to only invoke cpufreq_update_util() if &rq->cfs.avg.util_avg has really changed. On 03/22/2016 01:21 AM, Steve Muckle wrote: > The cpufreq hook should be called whenever the root cfs_rq > utilization changes so update_cfs_rq_load_avg() is a better > place for it. The current location is not invoked in the > enqueue_entity() or update_blocked_averages() paths. > > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > Signed-off-by: Steve Muckle <smuckle@linaro.org> > --- > kernel/sched/fair.c | 50 ++++++++++++++++++++++++++------------------------ > 1 file changed, 26 insertions(+), 24 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 46d64e4ccfde..d418deb04049 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2825,7 +2825,9 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); > static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > { > struct sched_avg *sa = &cfs_rq->avg; > + struct rq *rq = rq_of(cfs_rq); > int decayed, removed = 0; > + int cpu = cpu_of(rq); > > if (atomic_long_read(&cfs_rq->removed_load_avg)) { > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); > @@ -2840,7 +2842,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) > sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0); > } > > - decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, > + decayed = __update_load_avg(now, cpu, sa, > scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq); Why did you change these 3 lines above? You reverted this back in "[RFC PATCH] sched/fair: call cpufreq hook in additional paths". [...] -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Dietmar, On 03/28/2016 05:02 AM, Dietmar Eggemann wrote: > Hi Steve, > > these patches fall into the bucket of 'optimization of updating the > value only if the root cfs_rq util has changed' as discussed in '[PATCH > 5/8] sched/cpufreq: pass sched class into cpufreq_update_util' of Mike > T's current series '[PATCH 0/8] schedutil enhancements', right? I would say just the second patch is an optimization. The first and third patches cover additional paths in CFS where the hook should be called but currently is not, which I think is a correctness issue. > I wonder if it makes sense to apply them before a proper 'capacity vote > aggregation from CFS/RT/DL' has been agreed upon? Getting the right call sites for the hook in CFS should be orthogonal to the sched class vote aggregation IMO. > Otherwise I agree with the changes in your 3 patches (inc. "[RFC PATCH] > sched/fair: call cpufreq hook in additional paths") to only invoke > cpufreq_update_util() if &rq->cfs.avg.util_avg has really changed. > > > On 03/22/2016 01:21 AM, Steve Muckle wrote: >> The cpufreq hook should be called whenever the root cfs_rq >> utilization changes so update_cfs_rq_load_avg() is a better >> place for it. The current location is not invoked in the >> enqueue_entity() or update_blocked_averages() paths. >> >> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> >> Signed-off-by: Steve Muckle <smuckle@linaro.org> >> --- >> kernel/sched/fair.c | 50 >> ++++++++++++++++++++++++++------------------------ >> 1 file changed, 26 insertions(+), 24 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 46d64e4ccfde..d418deb04049 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2825,7 +2825,9 @@ static inline u64 cfs_rq_clock_task(struct >> cfs_rq *cfs_rq); >> static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq >> *cfs_rq) >> { >> struct sched_avg *sa = &cfs_rq->avg; >> + struct rq *rq = rq_of(cfs_rq); >> int decayed, removed = 0; >> + int cpu = cpu_of(rq); >> >> if (atomic_long_read(&cfs_rq->removed_load_avg)) { >> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); >> @@ -2840,7 +2842,7 @@ static inline int update_cfs_rq_load_avg(u64 >> now, struct cfs_rq *cfs_rq) >> sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0); >> } >> >> - decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, >> + decayed = __update_load_avg(now, cpu, sa, >> scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, >> cfs_rq); > > Why did you change these 3 lines above? You reverted this back in "[RFC > PATCH] sched/fair: call cpufreq hook in additional paths". If all three patches are accepted in principle I can restructure them if so desired. I did not want to introduce a dependency between them and patch 2 represents the cleanest implementation at that point. Thanks for the review! thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/28/2016 06:34 PM, Steve Muckle wrote: > Hi Dietmar, > > On 03/28/2016 05:02 AM, Dietmar Eggemann wrote: >> Hi Steve, >> >> these patches fall into the bucket of 'optimization of updating the >> value only if the root cfs_rq util has changed' as discussed in '[PATCH >> 5/8] sched/cpufreq: pass sched class into cpufreq_update_util' of Mike >> T's current series '[PATCH 0/8] schedutil enhancements', right? > > I would say just the second patch is an optimization. The first and > third patches cover additional paths in CFS where the hook should be > called but currently is not, which I think is a correctness issue. Not disagreeing here but I don't know if this level of accuracy is really needed. I mean we currently miss updates in enqueue_task_fair()->enqueue_entity()->enqueue_entity_load_avg() and idle_balance()/rebalance_domains()->update_blocked_averages() but there are plenty of call sides of update_load_avg(se, ...) with '&rq_of(cfs_rq_of(se))->cfs == cfs_rq_of(se)'. The question for me is does schedutil work better with this new, more accurate signal? IMO, not receiving a bunch of consecutive cpufreq_update_util's w/ the same 'util' value is probably a good thing, unless we see the interaction with RT/DL class as mentioned by Sai. Here an agreement on the design for the 'capacity vote aggregation from CFS/RT/DL' would help to clarify. >> I wonder if it makes sense to apply them before a proper 'capacity vote >> aggregation from CFS/RT/DL' has been agreed upon? > > Getting the right call sites for the hook in CFS should be orthogonal to > the sched class vote aggregation IMO. Hopefully :-) [...] Cheers, -- Dietmar -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/28/2016 11:30 AM, Dietmar Eggemann wrote: > On 03/28/2016 06:34 PM, Steve Muckle wrote: >> Hi Dietmar, >> >> On 03/28/2016 05:02 AM, Dietmar Eggemann wrote: >>> Hi Steve, >>> >>> these patches fall into the bucket of 'optimization of updating the >>> value only if the root cfs_rq util has changed' as discussed in '[PATCH >>> 5/8] sched/cpufreq: pass sched class into cpufreq_update_util' of Mike >>> T's current series '[PATCH 0/8] schedutil enhancements', right? >> >> I would say just the second patch is an optimization. The first and >> third patches cover additional paths in CFS where the hook should be >> called but currently is not, which I think is a correctness issue. > > Not disagreeing here but I don't know if this level of accuracy is > really needed. I mean we currently miss updates in > enqueue_task_fair()->enqueue_entity()->enqueue_entity_load_avg() and > idle_balance()/rebalance_domains()->update_blocked_averages() but there > are plenty of call sides of update_load_avg(se, ...) with > '&rq_of(cfs_rq_of(se))->cfs == cfs_rq_of(se)'. > > The question for me is does schedutil work better with this new, more > accurate signal? IMO, not receiving a bunch of consecutive > cpufreq_update_util's w/ the same 'util' value is probably a good thing, > unless we see the interaction with RT/DL class as mentioned by Sai. Here > an agreement on the design for the 'capacity vote aggregation from > CFS/RT/DL' would help to clarify. Without covering all the paths where CFS utilization changes it's possible to have to wait up to a tick to act on some changes, since the tick is the only guaranteed regularly-occurring instance of the hook. That's an unacceptable amount of latency IMO... thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 28, 2016 at 12:38:26PM -0700, Steve Muckle wrote: > Without covering all the paths where CFS utilization changes it's > possible to have to wait up to a tick to act on some changes, since the > tick is the only guaranteed regularly-occurring instance of the hook. > That's an unacceptable amount of latency IMO... Note that even with your patches that might still be the case. Remote wakeups might not happen on the destination CPU at all, so it might not be until the next tick (which always happens locally) that we'll 'observe' the utilization change brought with the wakeups. We could force all the remote wakeups to IPI the destination CPU, but that comes at a significant performance cost. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/30/2016 12:35 PM, Peter Zijlstra wrote: > On Mon, Mar 28, 2016 at 12:38:26PM -0700, Steve Muckle wrote: >> Without covering all the paths where CFS utilization changes it's >> possible to have to wait up to a tick to act on some changes, since the >> tick is the only guaranteed regularly-occurring instance of the hook. >> That's an unacceptable amount of latency IMO... > > Note that even with your patches that might still be the case. Remote > wakeups might not happen on the destination CPU at all, so it might not > be until the next tick (which always happens locally) that we'll > 'observe' the utilization change brought with the wakeups. > > We could force all the remote wakeups to IPI the destination CPU, but > that comes at a significant performance cost. What about only IPI'ing the destination when the utilization change is known to require a higher CPU frequency? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 30, 2016 at 06:42:20PM -0700, Steve Muckle wrote: > On 03/30/2016 12:35 PM, Peter Zijlstra wrote: > > On Mon, Mar 28, 2016 at 12:38:26PM -0700, Steve Muckle wrote: > >> Without covering all the paths where CFS utilization changes it's > >> possible to have to wait up to a tick to act on some changes, since the > >> tick is the only guaranteed regularly-occurring instance of the hook. > >> That's an unacceptable amount of latency IMO... > > > > Note that even with your patches that might still be the case. Remote > > wakeups might not happen on the destination CPU at all, so it might not > > be until the next tick (which always happens locally) that we'll > > 'observe' the utilization change brought with the wakeups. > > > > We could force all the remote wakeups to IPI the destination CPU, but > > that comes at a significant performance cost. > > What about only IPI'ing the destination when the utilization change is > known to require a higher CPU frequency? Can't, the way the wakeup path is constructed we would be sending the IPI way before we know about utilization. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 30 March 2016 at 21:35, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Mar 28, 2016 at 12:38:26PM -0700, Steve Muckle wrote: >> Without covering all the paths where CFS utilization changes it's >> possible to have to wait up to a tick to act on some changes, since the >> tick is the only guaranteed regularly-occurring instance of the hook. >> That's an unacceptable amount of latency IMO... > > Note that even with your patches that might still be the case. Remote > wakeups might not happen on the destination CPU at all, so it might not > be until the next tick (which always happens locally) that we'll > 'observe' the utilization change brought with the wakeups. > > We could force all the remote wakeups to IPI the destination CPU, but > that comes at a significant performance cost. Isn't a reschedule ipi already sent in this case ? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 31, 2016 at 11:27:22AM +0200, Vincent Guittot wrote: > On 30 March 2016 at 21:35, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Mar 28, 2016 at 12:38:26PM -0700, Steve Muckle wrote: > >> Without covering all the paths where CFS utilization changes it's > >> possible to have to wait up to a tick to act on some changes, since the > >> tick is the only guaranteed regularly-occurring instance of the hook. > >> That's an unacceptable amount of latency IMO... > > > > Note that even with your patches that might still be the case. Remote > > wakeups might not happen on the destination CPU at all, so it might not > > be until the next tick (which always happens locally) that we'll > > 'observe' the utilization change brought with the wakeups. > > > > We could force all the remote wakeups to IPI the destination CPU, but > > that comes at a significant performance cost. > > Isn't a reschedule ipi already sent in this case ? In what case? Assuming you talk about a remove wakeup, no. Only if that wakeup results in a preemption, which isn't a given. And we really don't want to carry the 'has util increased' information all the way down to where we make that decision. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 31 March 2016 at 11:34, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Mar 31, 2016 at 11:27:22AM +0200, Vincent Guittot wrote: >> On 30 March 2016 at 21:35, Peter Zijlstra <peterz@infradead.org> wrote: >> > On Mon, Mar 28, 2016 at 12:38:26PM -0700, Steve Muckle wrote: >> >> Without covering all the paths where CFS utilization changes it's >> >> possible to have to wait up to a tick to act on some changes, since the >> >> tick is the only guaranteed regularly-occurring instance of the hook. >> >> That's an unacceptable amount of latency IMO... >> > >> > Note that even with your patches that might still be the case. Remote >> > wakeups might not happen on the destination CPU at all, so it might not >> > be until the next tick (which always happens locally) that we'll >> > 'observe' the utilization change brought with the wakeups. >> > >> > We could force all the remote wakeups to IPI the destination CPU, but >> > that comes at a significant performance cost. >> >> Isn't a reschedule ipi already sent in this case ? > > In what case? Assuming you talk about a remove wakeup, no. Only if that > wakeup results in a preemption, which isn't a given. yes, i was speaking about a remote wakeup. In the ttwu_queue_remote, there is a call to smp_send_reschedule. Is there another way to add a remote task in the wake list ? > > And we really don't want to carry the 'has util increased' information > all the way down to where we make that decision. yes i agree -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 31, 2016 at 11:50:40AM +0200, Vincent Guittot wrote: > > In what case? Assuming you talk about a remove wakeup, no. Only if that > > wakeup results in a preemption, which isn't a given. > > yes, i was speaking about a remote wakeup. > In the ttwu_queue_remote, there is a call to smp_send_reschedule. Is > there another way to add a remote task in the wake list ? Right, but at that point we don't yet know how much util will change. And doing that IPI unconditionally is expensive; see: 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries") 13% regression on TCP_RR. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 31 March 2016 at 12:47, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Mar 31, 2016 at 11:50:40AM +0200, Vincent Guittot wrote: >> > In what case? Assuming you talk about a remove wakeup, no. Only if that >> > wakeup results in a preemption, which isn't a given. >> >> yes, i was speaking about a remote wakeup. >> In the ttwu_queue_remote, there is a call to smp_send_reschedule. Is >> there another way to add a remote task in the wake list ? > > Right, but at that point we don't yet know how much util will change. > > And doing that IPI unconditionally is expensive; see: > > 518cd6234178 ("sched: Only queue remote wakeups when crossing cache boundaries") > > 13% regression on TCP_RR. Ok. In fact, I looks for the sequence where the utilization of a rq is not updated until the next tick but i can't find it. If cpu doesn't share cache, task is added to wake list and an ipi is sent and the utilization. Otherwise, we directly enqueue the task on the rq and the utilization is updated Is there another path ? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 31, 2016 at 02:14:50PM +0200, Vincent Guittot wrote: > In fact, I looks for the sequence where the utilization of a rq is not > updated until the next tick but i can't find it. No, util it always updated, however.. > If cpu doesn't share cache, task is added to wake list and an ipi is > sent and the utilization. Here we run: ttwu_do_activate() ttwu_activate() activate_task() enqueue_task() p->sched_class->enqueue_task() := enqueue_task_fair() update_load_avg() update_cfs_rq_load_avg() cfs_rq_util_change() On the local cpu, and we can indeed call out to have the frequency changed. > Otherwise, we directly enqueue the task on > the rq and the utilization is updated But here we run it on a remote cpu, so we cannot call out and the frequency remains the same. So if a remote wakeup on the same LLC domain happens, utilization will increase but we will not observe until the next tick. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 31 March 2016 at 14:34, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Mar 31, 2016 at 02:14:50PM +0200, Vincent Guittot wrote: >> In fact, I looks for the sequence where the utilization of a rq is not >> updated until the next tick but i can't find it. > > No, util it always updated, however.. > >> If cpu doesn't share cache, task is added to wake list and an ipi is >> sent and the utilization. > > Here we run: > > ttwu_do_activate() > ttwu_activate() > activate_task() > enqueue_task() > p->sched_class->enqueue_task() := enqueue_task_fair() > update_load_avg() > update_cfs_rq_load_avg() > cfs_rq_util_change() > > On the local cpu, and we can indeed call out to have the frequency > changed. > >> Otherwise, we directly enqueue the task on >> the rq and the utilization is updated > > But here we run it on a remote cpu, so we cannot call out and the > frequency remains the same. > > So if a remote wakeup on the same LLC domain happens, utilization will > increase but we will not observe until the next tick. ok. I forgot that we have the condition cpu == smp_processor_id() in cfs_rq_util_change. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/31/2016 12:37 AM, Peter Zijlstra wrote: > On Wed, Mar 30, 2016 at 06:42:20PM -0700, Steve Muckle wrote: >> On 03/30/2016 12:35 PM, Peter Zijlstra wrote: >>> On Mon, Mar 28, 2016 at 12:38:26PM -0700, Steve Muckle wrote: >>>> Without covering all the paths where CFS utilization changes it's >>>> possible to have to wait up to a tick to act on some changes, since the >>>> tick is the only guaranteed regularly-occurring instance of the hook. >>>> That's an unacceptable amount of latency IMO... >>> >>> Note that even with your patches that might still be the case. Remote >>> wakeups might not happen on the destination CPU at all, so it might not >>> be until the next tick (which always happens locally) that we'll >>> 'observe' the utilization change brought with the wakeups. >>> >>> We could force all the remote wakeups to IPI the destination CPU, but >>> that comes at a significant performance cost. >> >> What about only IPI'ing the destination when the utilization change is >> known to require a higher CPU frequency? > > Can't, the way the wakeup path is constructed we would be sending the > IPI way before we know about utilization. Sorry I thought we were referring to the possibility of sending an IPI to just run the cpufreq driver rather than to conduct the whole wakeup operation. My thinking was in CFS we get rid of the (cpu == smp_processor_id()) condition for calling the cpufreq hook. The sched governor can then calculate utilization and frequency required for cpu. If (cpu == smp_processor_id()), the update is processed normally. If (cpu != smp_processor_id()) and the new frequency is higher than cpu's Fcur, the sched gov IPIs cpu to continue running the update operation. Otherwise, the update is dropped. Does that sound plausible? -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 31, 2016 at 02:26:06PM -0700, Steve Muckle wrote: > > Can't, the way the wakeup path is constructed we would be sending the > > IPI way before we know about utilization. > > Sorry I thought we were referring to the possibility of sending an IPI > to just run the cpufreq driver rather than to conduct the whole wakeup > operation. > > My thinking was in CFS we get rid of the (cpu == smp_processor_id()) > condition for calling the cpufreq hook. > > The sched governor can then calculate utilization and frequency required > for cpu. If (cpu == smp_processor_id()), the update is processed > normally. If (cpu != smp_processor_id()) and the new frequency is higher > than cpu's Fcur, the sched gov IPIs cpu to continue running the update > operation. Otherwise, the update is dropped. > > Does that sound plausible? Can be done I suppose.. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Rafael, On 04/01/2016 02:20 AM, Peter Zijlstra wrote: >> > My thinking was in CFS we get rid of the (cpu == smp_processor_id()) >> > condition for calling the cpufreq hook. >> > >> > The sched governor can then calculate utilization and frequency required >> > for cpu. If (cpu == smp_processor_id()), the update is processed >> > normally. If (cpu != smp_processor_id()) and the new frequency is higher >> > than cpu's Fcur, the sched gov IPIs cpu to continue running the update >> > operation. Otherwise, the update is dropped. >> > >> > Does that sound plausible? > > Can be done I suppose.. Currently we drop schedutil updates for a target CPU which do not occur on that CPU. Is this solely due to platforms which must run the cpufreq driver on the target CPU? Are there also shared cpufreq policies where the driver needs to run on any CPU in the affected policy/freq domain? thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 11, 2016 at 9:28 PM, Steve Muckle <steve.muckle@linaro.org> wrote: > Hi Rafael, > > On 04/01/2016 02:20 AM, Peter Zijlstra wrote: >>> > My thinking was in CFS we get rid of the (cpu == smp_processor_id()) >>> > condition for calling the cpufreq hook. >>> > >>> > The sched governor can then calculate utilization and frequency required >>> > for cpu. If (cpu == smp_processor_id()), the update is processed >>> > normally. If (cpu != smp_processor_id()) and the new frequency is higher >>> > than cpu's Fcur, the sched gov IPIs cpu to continue running the update >>> > operation. Otherwise, the update is dropped. >>> > >>> > Does that sound plausible? >> >> Can be done I suppose.. > > Currently we drop schedutil updates for a target CPU which do not occur > on that CPU. > > Is this solely due to platforms which must run the cpufreq driver on the > target CPU? The current code assumes that the CPU running the update will always be the one that gets updated. Anything else would require extra synchronization. > Are there also shared cpufreq policies where the driver needs to run on > any CPU in the affected policy/freq domain? Yes, there are, AFAICS, but drivers are expected to cope with that (if I understand the question correctly). Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 11, 2016 at 11:20 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Mon, Apr 11, 2016 at 9:28 PM, Steve Muckle <steve.muckle@linaro.org> wrote: >> Hi Rafael, >> >> On 04/01/2016 02:20 AM, Peter Zijlstra wrote: >>>> > My thinking was in CFS we get rid of the (cpu == smp_processor_id()) >>>> > condition for calling the cpufreq hook. >>>> > >>>> > The sched governor can then calculate utilization and frequency required >>>> > for cpu. If (cpu == smp_processor_id()), the update is processed >>>> > normally. If (cpu != smp_processor_id()) and the new frequency is higher >>>> > than cpu's Fcur, the sched gov IPIs cpu to continue running the update >>>> > operation. Otherwise, the update is dropped. >>>> > >>>> > Does that sound plausible? >>> >>> Can be done I suppose.. >> >> Currently we drop schedutil updates for a target CPU which do not occur >> on that CPU. >> >> Is this solely due to platforms which must run the cpufreq driver on the >> target CPU? > > The current code assumes that the CPU running the update will always > be the one that gets updated. Anything else would require extra > synchronization. This is rather fundamental. For example, if you look at cpufreq_update_util(), it does this: data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); meaning that it will run the current CPU's utilization update callback. Of course, that won't work cross-CPU, because in principle different CPUs may use different governors and therefore different util update callbacks. If you want to do remote updates, I guess that will require an irq_work to run the update on the target CPU, but then you'll probably want to neglect the rate limit on it as well, so it looks like a "need_update" flag in struct update_util_data will be useful for that. I think I can prototype something along these lines, but can you please tell me more about the case you have in mind? Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 12, 2016 at 04:29:06PM +0200, Rafael J. Wysocki wrote: > On Mon, Apr 11, 2016 at 11:20 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Mon, Apr 11, 2016 at 9:28 PM, Steve Muckle <steve.muckle@linaro.org> wrote: > >> Hi Rafael, > >> > >> On 04/01/2016 02:20 AM, Peter Zijlstra wrote: > >>>> > My thinking was in CFS we get rid of the (cpu == smp_processor_id()) > >>>> > condition for calling the cpufreq hook. > >>>> > > >>>> > The sched governor can then calculate utilization and frequency required > >>>> > for cpu. If (cpu == smp_processor_id()), the update is processed > >>>> > normally. If (cpu != smp_processor_id()) and the new frequency is higher > >>>> > than cpu's Fcur, the sched gov IPIs cpu to continue running the update > >>>> > operation. Otherwise, the update is dropped. > >>>> > > >>>> > Does that sound plausible? > >>> > >>> Can be done I suppose.. > >> > >> Currently we drop schedutil updates for a target CPU which do not occur > >> on that CPU. > >> > >> Is this solely due to platforms which must run the cpufreq driver on the > >> target CPU? > > > > The current code assumes that the CPU running the update will always > > be the one that gets updated. Anything else would require extra > > synchronization. > > This is rather fundamental. > > For example, if you look at cpufreq_update_util(), it does this: > > data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); > > meaning that it will run the current CPU's utilization update > callback. Of course, that won't work cross-CPU, because in principle > different CPUs may use different governors and therefore different > util update callbacks. > > If you want to do remote updates, I guess that will require an > irq_work to run the update on the target CPU, but then you'll probably > want to neglect the rate limit on it as well, so it looks like a > "need_update" flag in struct update_util_data will be useful for that. > > I think I can prototype something along these lines, but can you > please tell me more about the case you have in mind? I'm concerned generally with the latency to react to changes in required capacity due to remote wakeups, which are quite common on SMP platforms with shared cache. Unless the hook is called it could take up to a tick to react AFAICS if the target CPU is running some other task that does not get preempted by the wakeup. That's a potentially long time for say UI-critical applications and seems like a lost opportunity for us to leverage closer scheduler-cpufreq communication to get better performance. thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 12, 2016 at 9:38 PM, Steve Muckle <steve.muckle@linaro.org> wrote: > On Tue, Apr 12, 2016 at 04:29:06PM +0200, Rafael J. Wysocki wrote: >> On Mon, Apr 11, 2016 at 11:20 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: >> > On Mon, Apr 11, 2016 at 9:28 PM, Steve Muckle <steve.muckle@linaro.org> wrote: >> >> Hi Rafael, >> >> >> >> On 04/01/2016 02:20 AM, Peter Zijlstra wrote: >> >>>> > My thinking was in CFS we get rid of the (cpu == smp_processor_id()) >> >>>> > condition for calling the cpufreq hook. >> >>>> > >> >>>> > The sched governor can then calculate utilization and frequency required >> >>>> > for cpu. If (cpu == smp_processor_id()), the update is processed >> >>>> > normally. If (cpu != smp_processor_id()) and the new frequency is higher >> >>>> > than cpu's Fcur, the sched gov IPIs cpu to continue running the update >> >>>> > operation. Otherwise, the update is dropped. >> >>>> > >> >>>> > Does that sound plausible? >> >>> >> >>> Can be done I suppose.. >> >> >> >> Currently we drop schedutil updates for a target CPU which do not occur >> >> on that CPU. >> >> >> >> Is this solely due to platforms which must run the cpufreq driver on the >> >> target CPU? >> > >> > The current code assumes that the CPU running the update will always >> > be the one that gets updated. Anything else would require extra >> > synchronization. >> >> This is rather fundamental. >> >> For example, if you look at cpufreq_update_util(), it does this: >> >> data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data)); >> >> meaning that it will run the current CPU's utilization update >> callback. Of course, that won't work cross-CPU, because in principle >> different CPUs may use different governors and therefore different >> util update callbacks. >> >> If you want to do remote updates, I guess that will require an >> irq_work to run the update on the target CPU, but then you'll probably >> want to neglect the rate limit on it as well, so it looks like a >> "need_update" flag in struct update_util_data will be useful for that. >> >> I think I can prototype something along these lines, but can you >> please tell me more about the case you have in mind? > > I'm concerned generally with the latency to react to changes in > required capacity due to remote wakeups, which are quite common on SMP > platforms with shared cache. Unless the hook is called it could take > up to a tick to react AFAICS if the target CPU is running some other > task that does not get preempted by the wakeup. So the scenario seems to be that CPU A is running task X and CPU B wakes up task Y on it remotely, but that task has to wait for CPU A to get to it, so you want to increase the frequency of CPU A at the wakeup time so as to reduce the time the woken up task has to wait. In that case task X would not be giving the CPU away (ie. no invocations of schedule()) for the whole tick, so it would be CPU/memory bound. In that case I would expect CPU A to be running at full capacity already unless this is the first tick period in which task X behaves this way which looks like a corner case to me. Moreover, sending an IPI to CPU A in that case looks like the right thing to do to me anyway. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/13/2016 07:45 AM, Rafael J. Wysocki wrote: >> I'm concerned generally with the latency to react to changes in >> > required capacity due to remote wakeups, which are quite common on SMP >> > platforms with shared cache. Unless the hook is called it could take >> > up to a tick to react AFAICS if the target CPU is running some other >> > task that does not get preempted by the wakeup. > > So the scenario seems to be that CPU A is running task X and CPU B > wakes up task Y on it remotely, but that task has to wait for CPU A to > get to it, so you want to increase the frequency of CPU A at the > wakeup time so as to reduce the time the woken up task has to wait. > > In that case task X would not be giving the CPU away (ie. no > invocations of schedule()) for the whole tick, so it would be > CPU/memory bound. In that case I would expect CPU A to be running at > full capacity already unless this is the first tick period in which > task X behaves this way which looks like a corner case to me. This situation is fairly common in bursty workloads (such as UI driven ones). > Moreover, sending an IPI to CPU A in that case looks like the right > thing to do to me anyway. Sorry I didn't follow - sending an IPI to do what exactly? Perform the wakeup operation on the target CPU? thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 13, 2016 at 7:53 PM, Steve Muckle <steve.muckle@linaro.org> wrote: > On 04/13/2016 07:45 AM, Rafael J. Wysocki wrote: >>> I'm concerned generally with the latency to react to changes in >>> > required capacity due to remote wakeups, which are quite common on SMP >>> > platforms with shared cache. Unless the hook is called it could take >>> > up to a tick to react AFAICS if the target CPU is running some other >>> > task that does not get preempted by the wakeup. >> >> So the scenario seems to be that CPU A is running task X and CPU B >> wakes up task Y on it remotely, but that task has to wait for CPU A to >> get to it, so you want to increase the frequency of CPU A at the >> wakeup time so as to reduce the time the woken up task has to wait. >> >> In that case task X would not be giving the CPU away (ie. no >> invocations of schedule()) for the whole tick, so it would be >> CPU/memory bound. In that case I would expect CPU A to be running at >> full capacity already unless this is the first tick period in which >> task X behaves this way which looks like a corner case to me. > > This situation is fairly common in bursty workloads (such as UI driven > ones). > >> Moreover, sending an IPI to CPU A in that case looks like the right >> thing to do to me anyway. > > Sorry I didn't follow - sending an IPI to do what exactly? Perform the > wakeup operation on the target CPU? Basically, to run a frequency update. You can combine that with the wakeup itself, though, I suppose. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 46d64e4ccfde..d418deb04049 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2825,7 +2825,9 @@ static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq); static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) { struct sched_avg *sa = &cfs_rq->avg; + struct rq *rq = rq_of(cfs_rq); int decayed, removed = 0; + int cpu = cpu_of(rq); if (atomic_long_read(&cfs_rq->removed_load_avg)) { s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0); @@ -2840,7 +2842,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0); } - decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, + decayed = __update_load_avg(now, cpu, sa, scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq); #ifndef CONFIG_64BIT @@ -2848,28 +2850,6 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) cfs_rq->load_last_update_time_copy = sa->last_update_time; #endif - return decayed || removed; -} - -/* Update task and its cfs_rq load average */ -static inline void update_load_avg(struct sched_entity *se, int update_tg) -{ - struct cfs_rq *cfs_rq = cfs_rq_of(se); - u64 now = cfs_rq_clock_task(cfs_rq); - struct rq *rq = rq_of(cfs_rq); - int cpu = cpu_of(rq); - - /* - * Track task load average for carrying it to new CPU after migrated, and - * track group sched_entity load average for task_h_load calc in migration - */ - __update_load_avg(now, cpu, &se->avg, - se->on_rq * scale_load_down(se->load.weight), - cfs_rq->curr == se, NULL); - - if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg) - update_tg_load_avg(cfs_rq, 0); - if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) { unsigned long max = rq->cpu_capacity_orig; @@ -2890,8 +2870,30 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg) * See cpu_util(). */ cpufreq_update_util(rq_clock(rq), - min(cfs_rq->avg.util_avg, max), max); + min(sa->util_avg, max), max); } + + return decayed || removed; +} + +/* Update task and its cfs_rq load average */ +static inline void update_load_avg(struct sched_entity *se, int update_tg) +{ + struct cfs_rq *cfs_rq = cfs_rq_of(se); + u64 now = cfs_rq_clock_task(cfs_rq); + struct rq *rq = rq_of(cfs_rq); + int cpu = cpu_of(rq); + + /* + * Track task load average for carrying it to new CPU after migrated, and + * track group sched_entity load average for task_h_load calc in migration + */ + __update_load_avg(now, cpu, &se->avg, + se->on_rq * scale_load_down(se->load.weight), + cfs_rq->curr == se, NULL); + + if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg) + update_tg_load_avg(cfs_rq, 0); } static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
The cpufreq hook should be called whenever the root cfs_rq utilization changes so update_cfs_rq_load_avg() is a better place for it. The current location is not invoked in the enqueue_entity() or update_blocked_averages() paths. Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Steve Muckle <smuckle@linaro.org> --- kernel/sched/fair.c | 50 ++++++++++++++++++++++++++------------------------ 1 file changed, 26 insertions(+), 24 deletions(-)