Message ID | 20191011134500.235736-1-douglas.raillard@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | sched/cpufreq: Make schedutil energy aware | expand |
On Fri, Oct 11, 2019 at 02:44:54PM +0100, Douglas RAILLARD wrote: > This has been ligthly tested with a rtapp task ramping from 10% to 75% > utilisation on a big core. Results are improved by fast ramp-up > EWMA [1], since it greatly reduces the oscillation in frequency at first > idle when ramping up. > > [1] [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases > Message-ID: <20190620150555.15717-1-patrick.bellasi@arm.com> > https://lore.kernel.org/lkml/20190620150555.15717-1-patrick.bellasi@arm.com/ I don't really see anything fundamentally weird here. Any actual numbers or other means of quantifying the improvement these patches bring?
Hi Peter, On 10/14/19 3:53 PM, Peter Zijlstra wrote: > On Fri, Oct 11, 2019 at 02:44:54PM +0100, Douglas RAILLARD wrote: >> This has been ligthly tested with a rtapp task ramping from 10% to 75% >> utilisation on a big core. Results are improved by fast ramp-up >> EWMA [1], since it greatly reduces the oscillation in frequency at first >> idle when ramping up. >> >> [1] [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases >> Message-ID: <20190620150555.15717-1-patrick.bellasi@arm.com> >> https://lore.kernel.org/lkml/20190620150555.15717-1-patrick.bellasi@arm.com/ > > > I don't really see anything fundamentally weird here. Any actual numbers > or other means of quantifying the improvement these patches bring? > I posted some numbers based on a similar experiment on the v2 of that series that are still applicable: TL;DR the rt-app negative slack is divided by 1.75 by this series, with an increase of 3% in total energy consumption. There is a burst every 0.6s, and the average power consumption increase is proportional to the average number of bursts. The workload is an rt-app task ramping up from 5% to 75% util in one big step, pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%). This cycle is repeated 20 times and the average of boosting is taken. The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone. It has a lot more OPPs than a hikey 960, so gradations in boosting are better reflected on frequency selection. avg slack (higher=better): Average time between task sleep and its next periodic activation. See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631 avg negative slack (lower in absolute value=better): Same as avg slack, but only taking into account negative values. Negative slack means a task activation did not have enough time to complete before the next periodic activation fired, which is what we want to avoid. boost energy overhead (lower=better): Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP) and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it, since boost(policy) = max(boost(cpu) foreach cpu of policy)). Without ramp boost: +--------------------+--------------------+ |avg slack (us) |avg negative slack | | |(us) | +--------------------+--------------------+ |6598.72 |-10217.13 | |6595.49 |-10200.13 | |6613.72 |-10401.06 | |6600.29 |-9860.872 | |6605.53 |-10057.64 | |6612.05 |-10267.50 | |6599.01 |-9939.60 | |6593.79 |-9445.633 | |6613.56 |-10276.75 | |6595.44 |-9751.770 | +--------------------+--------------------+ |average | +--------------------+--------------------+ |6602.76 |-10041.81 | +--------------------+--------------------+ With ramp boost enabled: +--------------------+--------------------+--------------------+ |boost energy |avg slack (us) |avg negative slack | |overhead (%) | |(us) | +--------------------+--------------------+--------------------+ |3.05 |7148.93 |-5664.26 | |3.04 |7144.69 |-5667.77 | |3.05 |7149.05 |-5698.31 | |2.97 |7126.71 |-6040.23 | |3.02 |7140.28 |-5826.78 | |3.03 |7135.11 |-5749.62 | |3.05 |7140.24 |-5750.0 | |3.05 |7144.84 |-5667.04 | |3.07 |7157.30 |-5656.65 | |3.06 |7154.65 |-5653.76 | +--------------------+--------------------+--------------------+ |average | +--------------------+--------------------+--------------------+ |3.039000 |7144.18 |-5737.44 | +--------------------+--------------------+--------------------+ The negative slack is due to missed activations while the utilization signals increase during the big utilization step. Ramp boost is designed to boost frequency during that phase, which materializes in 1.75 less negative slack, for an extra power consumption under 3%. Regards, Douglas
On Mon, Oct 14, 2019 at 04:50:24PM +0100, Douglas Raillard wrote: > I posted some numbers based on a similar experiment on the v2 of that series that > are still applicable: > > TL;DR the rt-app negative slack is divided by 1.75 by this series, with an > increase of 3% in total energy consumption. There is a burst every 0.6s, and > the average power consumption increase is proportional to the average number > of bursts. > > > The workload is an rt-app task ramping up from 5% to 75% util in one big step, > pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%). > This cycle is repeated 20 times and the average of boosting is taken. > > The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone. > It has a lot more OPPs than a hikey 960, so gradations in boosting > are better reflected on frequency selection. > > avg slack (higher=better): > Average time between task sleep and its next periodic activation. > See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631 > > avg negative slack (lower in absolute value=better): > Same as avg slack, but only taking into account negative values. > Negative slack means a task activation did not have enough time to complete before the next > periodic activation fired, which is what we want to avoid. > > boost energy overhead (lower=better): > Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP) > and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it, > since boost(policy) = max(boost(cpu) foreach cpu of policy)). > > Without ramp boost: > +--------------------+--------------------+ > |avg slack (us) |avg negative slack | > | |(us) | > +--------------------+--------------------+ > |6598.72 |-10217.13 | > |6595.49 |-10200.13 | > |6613.72 |-10401.06 | > |6600.29 |-9860.872 | > |6605.53 |-10057.64 | > |6612.05 |-10267.50 | > |6599.01 |-9939.60 | > |6593.79 |-9445.633 | > |6613.56 |-10276.75 | > |6595.44 |-9751.770 | > +--------------------+--------------------+ > |average | > +--------------------+--------------------+ > |6602.76 |-10041.81 | > +--------------------+--------------------+ > > > With ramp boost enabled: > +--------------------+--------------------+--------------------+ > |boost energy |avg slack (us) |avg negative slack | > |overhead (%) | |(us) | > +--------------------+--------------------+--------------------+ > |3.05 |7148.93 |-5664.26 | > |3.04 |7144.69 |-5667.77 | > |3.05 |7149.05 |-5698.31 | > |2.97 |7126.71 |-6040.23 | > |3.02 |7140.28 |-5826.78 | > |3.03 |7135.11 |-5749.62 | > |3.05 |7140.24 |-5750.0 | > |3.05 |7144.84 |-5667.04 | > |3.07 |7157.30 |-5656.65 | > |3.06 |7154.65 |-5653.76 | > +--------------------+--------------------+--------------------+ > |average | > +--------------------+--------------------+--------------------+ > |3.039000 |7144.18 |-5737.44 | > +--------------------+--------------------+--------------------+ > > > The negative slack is due to missed activations while the utilization signals > increase during the big utilization step. Ramp boost is designed to boost frequency during > that phase, which materializes in 1.75 less negative slack, for an extra power > consumption under 3%. OK, so I think I see what it is doing, and why. Normally we use (map_util_freq): freq = C * max_freq * util / max ; C=1.25 But here, when util is increasing, we effectively increase our C to allow picking a higher OPP. Because of that higher OPP we finish our work sooner (avg slack increases) and miss our activation less often (avg neg slack decreases). Now, the thing is, we use map_util_freq() in more places, and should we not reflect this increase in C for all of them? That is, why is this patch changing get_next_freq() and not map_util_freq(). I don't think that question is answered in the Changelogs. Exactly because it does change the energy consumption (it must) should that not also be reflected in the EAS logic? I'm still thinking about the exact means you're using to raise C; that is, the 'util - util_est' as cost_margin. It hurts my brain still.
On Thursday 17 Oct 2019 at 11:50:15 (+0200), Peter Zijlstra wrote: > Now, the thing is, we use map_util_freq() in more places, and should we > not reflect this increase in C for all of them? That is, why is this > patch changing get_next_freq() and not map_util_freq(). > > I don't think that question is answered in the Changelogs. > > Exactly because it does change the energy consumption (it must) should > that not also be reflected in the EAS logic? Right that shouldn't hurt and keep things consistent. That probably won't have a huge impact in practice (the boost should be != 0 only when the util signals haven't converged IIUC, which is a case where the EAS calculation is already 'wrong' anyway), but that still feels like the right thing to do. > I'm still thinking about the exact means you're using to raise C; that > is, the 'util - util_est' as cost_margin. It hurts my brain still. +1 ... Thanks, Quentin
On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote: > On Thursday 17 Oct 2019 at 11:50:15 (+0200), Peter Zijlstra wrote: > > Now, the thing is, we use map_util_freq() in more places, and should we > > not reflect this increase in C for all of them? That is, why is this > > patch changing get_next_freq() and not map_util_freq(). > > > > I don't think that question is answered in the Changelogs. > > > > Exactly because it does change the energy consumption (it must) should > > that not also be reflected in the EAS logic? > > Right that shouldn't hurt and keep things consistent. That probably > won't have a huge impact in practice (the boost should be != 0 only when > the util signals haven't converged IIUC, which is a case where the EAS > calculation is already 'wrong' anyway), but that still feels like the > right thing to do. It only boosts when 'rq->cfs.avg.util' increases while 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est obv). This condition can be true for select_task_rq_fair(), because that is ran before we do enqueue_task_fair() (for obvious raisins). > > I'm still thinking about the exact means you're using to raise C; that > > is, the 'util - util_est' as cost_margin. It hurts my brain still. > > +1 ... cost_i = capacity_i / power_i ; for the i-th OPP We then do: 'x = util - util_avg' and use that on the first OPP >= min_freq: cost(x) = cost_j + cost_j * x ; freq_j >= min_freq = cost_j * (1 + x) And then we find the highest OPP that has: cost_k <= cost(x) Now, 'P = C*V^2*f', which under 'V ~ f' turns into 'P ~ f^3'. (this assumption is important because we don't have V_i, but know that when f increases, V also increases and the linear relation is the simplest) This then gives us: capacity_i = f_i / f_max power_i ~ f_i ^ 3 cost_i = capacity_i / power_i ~ (f_i / f_max) / f_i^3 ~ 1 / (f_max * f_i^2) (your changelog already called if efficiency, but then went on calling it cost; as per the above, you see that higher frequencies have lower efficiency, as expected) cost(x) then turns into something like: cost(x) = cost_j * (1 + x) ~ (1 + x) / (f_max * f_j^2) We then get the following equation (assuming inf. OPPs): cost_k = cost(x) 1 / (f_max * f_k^2) = (1 + x) / (f_max * f_j^2) From which we can solve f_k: f_k = f_j / sqrt(1 + x) ; x = util - util_est Which, given positive 'x' results in f_k < f_j, IOW. we're selecting a lower frequency. Given that 'cost' really is efficiency, and we've made the equations about selecting a higher efficiency, that makes sense in so far that it would always end up at the knee in the graph (our most efficient OPP). Is this really what we're wanting to do?
Hi Peter, On 10/17/19 10:50 AM, Peter Zijlstra wrote: > On Mon, Oct 14, 2019 at 04:50:24PM +0100, Douglas Raillard wrote: > >> I posted some numbers based on a similar experiment on the v2 of that series that >> are still applicable: >> >> TL;DR the rt-app negative slack is divided by 1.75 by this series, with an >> increase of 3% in total energy consumption. There is a burst every 0.6s, and >> the average power consumption increase is proportional to the average number >> of bursts. >> >> >> The workload is an rt-app task ramping up from 5% to 75% util in one big step, >> pinned on a big core. The whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%). >> This cycle is repeated 20 times and the average of boosting is taken. >> >> The test device is a Google Pixel 3 (Qcom Snapdragon 845) phone. >> It has a lot more OPPs than a hikey 960, so gradations in boosting >> are better reflected on frequency selection. >> >> avg slack (higher=better): >> Average time between task sleep and its next periodic activation. >> See rt-app doc: https://github.com/scheduler-tools/rt-app/blob/9a50d76f726d7c325c82ac8c7ed9ed70e1c97937/doc/tutorial.txt#L631 >> >> avg negative slack (lower in absolute value=better): >> Same as avg slack, but only taking into account negative values. >> Negative slack means a task activation did not have enough time to complete before the next >> periodic activation fired, which is what we want to avoid. >> >> boost energy overhead (lower=better): >> Extra power consumption induced by ramp boost, assuming continuous OPP space (infinite number of OPP) >> and single-CPU policies. In practice, fixed number of OPP decrease this value, and more CPU per policy increases it, >> since boost(policy) = max(boost(cpu) foreach cpu of policy)). >> >> Without ramp boost: >> +--------------------+--------------------+ >> |avg slack (us) |avg negative slack | >> | |(us) | >> +--------------------+--------------------+ >> |6598.72 |-10217.13 | >> |6595.49 |-10200.13 | >> |6613.72 |-10401.06 | >> |6600.29 |-9860.872 | >> |6605.53 |-10057.64 | >> |6612.05 |-10267.50 | >> |6599.01 |-9939.60 | >> |6593.79 |-9445.633 | >> |6613.56 |-10276.75 | >> |6595.44 |-9751.770 | >> +--------------------+--------------------+ >> |average | >> +--------------------+--------------------+ >> |6602.76 |-10041.81 | >> +--------------------+--------------------+ >> >> >> With ramp boost enabled: >> +--------------------+--------------------+--------------------+ >> |boost energy |avg slack (us) |avg negative slack | >> |overhead (%) | |(us) | >> +--------------------+--------------------+--------------------+ >> |3.05 |7148.93 |-5664.26 | >> |3.04 |7144.69 |-5667.77 | >> |3.05 |7149.05 |-5698.31 | >> |2.97 |7126.71 |-6040.23 | >> |3.02 |7140.28 |-5826.78 | >> |3.03 |7135.11 |-5749.62 | >> |3.05 |7140.24 |-5750.0 | >> |3.05 |7144.84 |-5667.04 | >> |3.07 |7157.30 |-5656.65 | >> |3.06 |7154.65 |-5653.76 | >> +--------------------+--------------------+--------------------+ >> |average | >> +--------------------+--------------------+--------------------+ >> |3.039000 |7144.18 |-5737.44 | >> +--------------------+--------------------+--------------------+ >> >> >> The negative slack is due to missed activations while the utilization signals >> increase during the big utilization step. Ramp boost is designed to boost frequency during >> that phase, which materializes in 1.75 less negative slack, for an extra power >> consumption under 3%. > > OK, so I think I see what it is doing, and why. > > Normally we use (map_util_freq): > > freq = C * max_freq * util / max ; C=1.25 > > But here, when util is increasing, we effectively increase our C to > allow picking a higher OPP. Because of that higher OPP we finish our > work sooner (avg slack increases) Indeed. The increase in avg slack is much smaller than the absolute decrease in negative slack, which means we don't boost when that's not necessary (i.e. on activations where the slack is already positive). > and miss our activation less often > (avg neg slack decreases). We actually miss the activation as often, but by a lesser amount of time, which is desirable. Preventing from missing activations altogether could probably be achieved by very aggressive boosting, at which point it might be better/easier to just assume util=1024 and let it decrease to its correct value. None of that sounds very appealing in the general case though. > Now, the thing is, we use map_util_freq() in more places, and should we > not reflect this increase in C for all of them? That is, why is this > patch changing get_next_freq() and not map_util_freq(). > > I don't think that question is answered in the Changelogs. > > Exactly because it does change the energy consumption (it must) should > that not also be reflected in the EAS logic? map_util_freq() is only used in schedutil and in EAS by compute_energy(), so I retarget the question at compute_energy(). It is supposed to compute the energy consumed by a given CPU if a given task was migrated on it. This implies some assumptions on util signals: 1) util(_est.enqueued) of the task is representative of the task's duty cycle (the semantic of util is somehow blurry for aperiodic tasks AFAIK). 2) util of the task is CPU-invariant 3) task util + target CPU util = new target CPU util for the foreseeable future, i.e. the amount of future we can predict with reasonable accuracy. Otherwise we would end up moving things around all the time. Since ramp boost is designed to be transient, integrating it (indirectly) in "target CPU util" will add some noise to the placement decision, potentially rendering it obsolete as soon as the boosting stops. Basing a costly decision on that does not sound like a good idea IMHO. > I'm still thinking about the exact means you're using to raise C; that > is, the 'util - util_est' as cost_margin. It hurts my brain still. util_est is currently the best approximation of the actual portion of the CPU the task needs: 1) for periodic tasks, it's not too far from the duty cycle, and is always higher 2) for aperiodic tasks, it (indirectly) takes into account the total time it took to complete the previous activation, so the signal is not 100% composed of logical signals only relevant for periodic tasks (although it's a big part of it). 3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes their duty cycle over time, without needing a very long history (the last task period is sufficient). For periodic tasks, the distance between instantaneous util_avg and the actual task duty cycle indicates somehow what is our best guess of the (potential) change in the task duty cycle. util_est is the threshold (assuming util_avg increasing) for util_avg after which we know for sure that even if the task stopped right now, its duty cycle would be higher than during the previous period. This means for a given task and with (util >= util_est): 1) util - util_est == 0 means the task duty cycle will be equal to the one during during the previous activation, if the tasks stopped executing right now. 2) util - util_est > 0 means the task duty cycle will be higher to the one during during the previous activation, if the tasks stopped executing right now. Using the difference (util - util_est) will therefore give these properties to the boost signal: * no boost will be applied as long as the task has a constant or decreasing duty cycle. * when we can detect that the duty cycle increases, we temporarily increase the frequency. We start with a slight increase, and the longer we wait for the current period to finish, the more we boost, since the more likely it is that the task has a much larger duty cycle than anticipated. More specifically, the evaluation of "how much more" is done the exact same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT. Now if the task is aperiodic, the boost will allow reaching the highest frequency faster, which may or may not be desired. Ultimately, it's not more or less wrong than just picking the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic tasks. It just allows reaching the max freq at some point without waiting for too long, which is all what we can do without more info on the task. When applying these boosting rules on the runqueue util signals, we are able to detect if at least one task needs boosting according to these rules. That only holds as long as the history we look at is the result of a stable set of tasks, i.e. no tasks added or removed from the rq. Regards, Douglas
On Thu, Oct 17, 2019 at 03:23:04PM +0100, Douglas Raillard wrote: > On 10/17/19 10:50 AM, Peter Zijlstra wrote: > > Now, the thing is, we use map_util_freq() in more places, and should we > > not reflect this increase in C for all of them? That is, why is this > > patch changing get_next_freq() and not map_util_freq(). > > > > I don't think that question is answered in the Changelogs. > > > > Exactly because it does change the energy consumption (it must) should > > that not also be reflected in the EAS logic? > > map_util_freq() is only used in schedutil and in EAS by compute_energy(), so > I retarget the question at compute_energy(). It is supposed to compute > the energy consumed by a given CPU if a given task was migrated on it. > This implies some assumptions on util signals: > 1) util(_est.enqueued) of the task is representative of the task's > duty cycle (the semantic of util is somehow blurry for aperiodic tasks > AFAIK). > 2) util of the task is CPU-invariant ( we know this to be false, but do indeed make this assumption because simplicity, taking IPC differences into account would just complicate things more ) > 3) task util + target CPU util = new target CPU util for the > foreseeable future, i.e. the amount of future we can predict with > reasonable accuracy. Otherwise we would end up moving things around > all the time. > > Since ramp boost is designed to be transient, integrating it > (indirectly) in "target CPU util" will add some noise to the placement > decision, potentially rendering it obsolete as soon as the boosting > stops. Basing a costly decision on that does not sound like a good > idea IMHO. Well, we _hope_ recent past is a reasonable predictor for the near future. We of course know it'll be complete crap every so often, but hope that on average it is true more than false :-) Anyway, the above seems like a sensible enough argument, and seems worthy of being part of the Changelog. Also maybe a comment in schedutil as to why there is a map_util_freq() modifier there.
On Thu, Oct 17, 2019 at 03:23:04PM +0100, Douglas Raillard wrote: > On 10/17/19 10:50 AM, Peter Zijlstra wrote: > > I'm still thinking about the exact means you're using to raise C; that > > is, the 'util - util_est' as cost_margin. It hurts my brain still. > > util_est is currently the best approximation of the actual portion of the CPU the task needs: > 1) for periodic tasks, it's not too far from the duty cycle, and is always higher > > 2) for aperiodic tasks, it (indirectly) takes into account the total time it took > to complete the previous activation, so the signal is not 100% composed of logical signals > only relevant for periodic tasks (although it's a big part of it). > > 3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes > their duty cycle over time, without needing a very long history (the last task period > is sufficient). > > For periodic tasks, the distance between instantaneous util_avg and the actual task > duty cycle indicates somehow what is our best guess of the (potential) change in the task > duty cycle. > > util_est is the threshold (assuming util_avg increasing) for util_avg after which we know > for sure that even if the task stopped right now, its duty cycle would be higher than > during the previous period. > This means for a given task and with (util >= util_est): > > 1) util - util_est == 0 means the task duty cycle will be equal to the one during > during the previous activation, if the tasks stopped executing right now. > > 2) util - util_est > 0 means the task duty cycle will be higher to the one during > during the previous activation, if the tasks stopped executing right now. So far I can follow, 2) is indeed a fairly sane indication that utilization is growing. > Using the difference (util - util_est) will therefore give these properties to the boost signal: > * no boost will be applied as long as the task has a constant or decreasing duty cycle. > > * when we can detect that the duty cycle increases, we temporarily increase the frequency. > We start with a slight increase, and the longer we wait for the current period to finish, > the more we boost, since the more likely it is that the task has a much larger duty cycle > than anticipated. More specifically, the evaluation of "how much more" is done the exact > same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT. Right, because as long it keeps running, util_est will not be changed, so the difference will continue to increase. What I don't see is how that that difference makes sense as input to: cost(x) : (1 + x) * cost_j I suppose that limits the additional OPP to twice the previously selected cost / efficiency (see the confusion from that other email). But given that efficency drops (or costs rise) for higher OPPs that still doesn't really make sense.. > Now if the task is aperiodic, the boost will allow reaching the highest frequency faster, > which may or may not be desired. Ultimately, it's not more or less wrong than just picking > the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic > tasks. It just allows reaching the max freq at some point without waiting for too long, which is > all what we can do without more info on the task. > > When applying these boosting rules on the runqueue util signals, we are able to detect if at least one > task needs boosting according to these rules. That only holds as long as the history we look at is > the result of a stable set of tasks, i.e. no tasks added or removed from the rq. So while I agree that 2) is a reasonable signal to work from, everything that comes after is still much confusing me.
On 17/10/2019 16:11, Peter Zijlstra wrote: > On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote: [...] > It only boosts when 'rq->cfs.avg.util' increases while > 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est > obv). > > This condition can be true for select_task_rq_fair(), because that is > ran before we do enqueue_task_fair() (for obvious raisins). > >>> I'm still thinking about the exact means you're using to raise C; that >>> is, the 'util - util_est' as cost_margin. It hurts my brain still. >> >> +1 ... > > cost_i = capacity_i / power_i ; for the i-th OPP I get confused by this definition. efficiency=capacity/power but the cs->cost value used in em_pd_get_higher_freq() is defined as cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h] > We then do: 'x = util - util_avg' and use that on the first > OPP >= min_freq: > > cost(x) = cost_j + cost_j * x ; freq_j >= min_freq > = cost_j * (1 + x) > > And then we find the highest OPP that has: > > cost_k <= cost(x) [...]
On Fri, Oct 18, 2019 at 09:44:44AM +0200, Dietmar Eggemann wrote: > On 17/10/2019 16:11, Peter Zijlstra wrote: > > On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote: > > [...] > > > It only boosts when 'rq->cfs.avg.util' increases while > > 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est > > obv). > > > > This condition can be true for select_task_rq_fair(), because that is > > ran before we do enqueue_task_fair() (for obvious raisins). > > > >>> I'm still thinking about the exact means you're using to raise C; that > >>> is, the 'util - util_est' as cost_margin. It hurts my brain still. > >> > >> +1 ... > > > > cost_i = capacity_i / power_i ; for the i-th OPP > > I get confused by this definition. efficiency=capacity/power but the > cs->cost value used in em_pd_get_higher_freq() is defined as > > cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h] Well, chalk that up to confusion inspired by the Changelog of patch 1. Let me redo it with that formula then.
On Fri, Oct 18, 2019 at 09:44:44AM +0200, Dietmar Eggemann wrote: > On 17/10/2019 16:11, Peter Zijlstra wrote: > > On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote: > > [...] > > > It only boosts when 'rq->cfs.avg.util' increases while > > 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est > > obv). > > > > This condition can be true for select_task_rq_fair(), because that is > > ran before we do enqueue_task_fair() (for obvious raisins). > > > >>> I'm still thinking about the exact means you're using to raise C; that > >>> is, the 'util - util_est' as cost_margin. It hurts my brain still. > >> > >> +1 ... > > > > cost_i = capacity_i / power_i ; for the i-th OPP > > I get confused by this definition. efficiency=capacity/power but the > cs->cost value used in em_pd_get_higher_freq() is defined as > > cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h] cost_i = power_i * f_max / f_i cost(x) = cost_j * (1 + x) ; f_j >= min_freq cost_k <= cost(x) P = C*V^2*f, V ~ f -> P ~ f^3 cost_i ~ f_i^3 * f_max / f_i = f_i^2 * f_max cost(x) = (1 + x) * f_j^2 * f_max cost_k = cost(x) f_k^2 * f_max = (1 + x) * f_j^2 * f_max f_k = sqrt(1 + x) * f_j Which does indeed make more sense... However, I still struggle with using our 'x = util - util_est' as input for an OPP specific increase.
On 10/17/19 8:07 PM, Peter Zijlstra wrote: > On Thu, Oct 17, 2019 at 03:23:04PM +0100, Douglas Raillard wrote: >> On 10/17/19 10:50 AM, Peter Zijlstra wrote: > >>> I'm still thinking about the exact means you're using to raise C; that >>> is, the 'util - util_est' as cost_margin. It hurts my brain still. >> >> util_est is currently the best approximation of the actual portion of the CPU the task needs: >> 1) for periodic tasks, it's not too far from the duty cycle, and is always higher >> >> 2) for aperiodic tasks, it (indirectly) takes into account the total time it took >> to complete the previous activation, so the signal is not 100% composed of logical signals >> only relevant for periodic tasks (although it's a big part of it). >> >> 3) Point 1) and 2) together allows util_est to adapt to periodic tasks that changes >> their duty cycle over time, without needing a very long history (the last task period >> is sufficient). >> >> For periodic tasks, the distance between instantaneous util_avg and the actual task >> duty cycle indicates somehow what is our best guess of the (potential) change in the task >> duty cycle. >> >> util_est is the threshold (assuming util_avg increasing) for util_avg after which we know >> for sure that even if the task stopped right now, its duty cycle would be higher than >> during the previous period. >> This means for a given task and with (util >= util_est): >> >> 1) util - util_est == 0 means the task duty cycle will be equal to the one during >> during the previous activation, if the tasks stopped executing right now. >> >> 2) util - util_est > 0 means the task duty cycle will be higher to the one during >> during the previous activation, if the tasks stopped executing right now. > > So far I can follow, 2) is indeed a fairly sane indication that > utilization is growing. > >> Using the difference (util - util_est) will therefore give these properties to the boost signal: >> * no boost will be applied as long as the task has a constant or decreasing duty cycle. >> >> * when we can detect that the duty cycle increases, we temporarily increase the frequency. >> We start with a slight increase, and the longer we wait for the current period to finish, >> the more we boost, since the more likely it is that the task has a much larger duty cycle >> than anticipated. More specifically, the evaluation of "how much more" is done the exact >> same way as it is done for PELT, since the dynamic of the boost is "inherited" from PELT. > > Right, because as long it keeps running, util_est will not be changed, > so the difference will continue to increase. > > What I don't see is how that that difference makes sense as input to: > > cost(x) : (1 + x) * cost_j The actual input is: x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est) Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1 is not directly reflected in the code but is important for units consistency. Non-zero x means that we are going to allow spending more energy than what schedutil currently thinks of being the minimal energy. (it's actually not that minimal, since we have this 1.25 factor, plus the fact that we use util_est.enqueued, which will always over-estimate the actual util of the task). > > I suppose that limits the additional OPP to twice the previously > selected cost / efficiency (see the confusion from that other email). > But given that efficency drops (or costs rise) for higher OPPs that > still doesn't really make sense.. Yes, this current limit to +100% freq boosting is somehow arbitrary and could probably benefit from being tunable in some way (Kconfig option maybe). When (margin > 0), we end up selecting an OPP that has a higher cost than the one strictly required, which is expected. The goal is to speed things up at the expense of more power consumed to achieve the same work, hence at a lower efficiency (== higher cost). That's the main reason why this boosting apply a margin on the cost of the selected OPP rather than just inflating the util. This allows controlling directly how much more power (battery life) we are going to spend to achieve some work that we know could be achieved with less power. This "how much more" is the margin. A policy for such boosting must obviously be quite picky in when it decides to boost (not boosting all the time). Decreasing the amount of negative slack is one situation where spending a bit more power to ensure the work is done in time can be more important than just efficiency, within reasonable limits. (the eternal efficiency vs throughput vs latency trade-off). > >> Now if the task is aperiodic, the boost will allow reaching the highest frequency faster, >> which may or may not be desired. Ultimately, it's not more or less wrong than just picking >> the freq based on util_est alone, since util_est is already somewhat meaningless for aperiodic >> tasks. It just allows reaching the max freq at some point without waiting for too long, which is >> all what we can do without more info on the task. >> >> When applying these boosting rules on the runqueue util signals, we are able to detect if at least one >> task needs boosting according to these rules. That only holds as long as the history we look at is >> the result of a stable set of tasks, i.e. no tasks added or removed from the rq. > > So while I agree that 2) is a reasonable signal to work from, everything > that comes after is still much confusing me. "Now if the task is aperiodic ...": What I mean by that is that AFAIK, there isn't really any fundamentally right or wrong way of choosing frequencies if the tasks we are dealing with are not periodic, unless you add more constraints to the problem. We currently base the decision on an overestimation of some kind of average running time per second of the task. The average being the EWMA implemented by PELT, not to be confused with util_est.ewma that adds an extra EWMA on top. What "window" of time used for that average (or EWMA half life in our case) will change the value of the average for aperiodic tasks. The choice of that half life is driven by how much of the task history we want to take into account. That is driven by how often we expect tasks to change their "execution profile" on average so to speak (a thread pool picking disparate work items at random would change its profile very often for example). Once this window or half life is chosen, we can ensure that the CPU will use a frequency high enough to avoid work piling up more than what can be computed using a simple proportional controller. The goal of the schedutil controller is to make sure that: current CPU capa == current util * 1.25 All that to say that in the aperiodic case, some pieces of the setup are directly provided by the policy (PELT half life), which are empirically determined to perform well, without any way of computing an provably optimal value (unless we know for sure exactly when a task is going to change its workload and CPU share it will require). For periodic tasks, we can compute the exact frequency that will lead to using 100% of the CPU just by looking at the duty cycle of the tasks, and that's more or less what schedutil does. "When applying these boosting rules on the runqueue util signals ...": Assuming the set of enqueued tasks stays the same between 2 observations from schedutil, if we see the rq util_avg increase above its util_est.enqueued, that means that at least one task had its util_avg go above util_est.enqueued. We might miss some boosting opportunities if some (util - util_est) compensates: TASK_1(util - util_est) = - TASK_2(util - util_est) but working on the aggregated value is much easier in schedutil, to avoid crawling the list of entities.
On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote: > > What I don't see is how that that difference makes sense as input to: > > > > cost(x) : (1 + x) * cost_j > > The actual input is: > x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est) > > Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1 > is not directly reflected in the code but is important for units > consistency. But completely irrelevant for the actual math and conceptual understanding. Just because computers suck at real numbers, and floats are expensive, doesn't mean we have to burden ourselves with fixed point when writing equations. Also, as a physicist I'm prone to normalizing everything to 1, because that's lazy. > > I suppose that limits the additional OPP to twice the previously > > selected cost / efficiency (see the confusion from that other email). > > But given that efficency drops (or costs rise) for higher OPPs that > > still doesn't really make sense.. > Yes, this current limit to +100% freq boosting is somehow arbitrary and > could probably benefit from being tunable in some way (Kconfig option > maybe). When (margin > 0), we end up selecting an OPP that has a higher cost > than the one strictly required, which is expected. The goal is to speed > things up at the expense of more power consumed to achieve the same work, > hence at a lower efficiency (== higher cost). No, no Kconfig knobs. > That's the main reason why this boosting apply a margin on the cost of the > selected OPP rather than just inflating the util. This allows controlling > directly how much more power (battery life) we are going to spend to achieve > some work that we know could be achieved with less power. But you're not; the margin is relative to the OPP, it is not absolute. Or rather, the only actual limit is in relation to the max OPP. So you have very little actual control over how much more energy you're spending. > > So while I agree that 2) is a reasonable signal to work from, everything > > that comes after is still much confusing me. > "When applying these boosting rules on the runqueue util signals ...": > Assuming the set of enqueued tasks stays the same between 2 observations > from schedutil, if we see the rq util_avg increase above its > util_est.enqueued, that means that at least one task had its util_avg go > above util_est.enqueued. We might miss some boosting opportunities if some > (util - util_est) compensates: > TASK_1(util - util_est) = - TASK_2(util - util_est) > but working on the aggregated value is much easier in schedutil, to avoid > crawling the list of entities. That still does not explain why 'util - util_est', when >0, makes for a sensible input into an OPP relative function. I agree that 'util - util_est', when >0, indicates utilization is increasing (for the aperiodic blah blah blah). But after that I'm still confused.
On 10/18/19 1:07 PM, Peter Zijlstra wrote: > On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote: > >>> What I don't see is how that that difference makes sense as input to: >>> >>> cost(x) : (1 + x) * cost_j >> >> The actual input is: >> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est) >> >> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1 >> is not directly reflected in the code but is important for units >> consistency. > > But completely irrelevant for the actual math and conceptual > understanding. > how that that difference makes sense as input to I was unsure if you referred to the units being inconsistent or the actual way of computing values being strange, so I provided some justification for both. > Just because computers suck at real numbers, and floats > are expensive, doesn't mean we have to burden ourselves with fixed point > when writing equations. > > Also, as a physicist I'm prone to normalizing everything to 1, because > that's lazy. > >>> I suppose that limits the additional OPP to twice the previously >>> selected cost / efficiency (see the confusion from that other email). >>> But given that efficency drops (or costs rise) for higher OPPs that >>> still doesn't really make sense.. > >> Yes, this current limit to +100% freq boosting is somehow arbitrary and >> could probably benefit from being tunable in some way (Kconfig option >> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost >> than the one strictly required, which is expected. The goal is to speed >> things up at the expense of more power consumed to achieve the same work, >> hence at a lower efficiency (== higher cost). > > No, no Kconfig knobs. > >> That's the main reason why this boosting apply a margin on the cost of the >> selected OPP rather than just inflating the util. This allows controlling >> directly how much more power (battery life) we are going to spend to achieve >> some work that we know could be achieved with less power. > > But you're not; the margin is relative to the OPP, it is not absolute. Considering a CPU with 1024 max capacity (since we are not talking about migrations here, we can ignore CPU invariance): work = normalized number of iterations of a given busy loop # Thanks to freq invariance work = util (between 0 and 1) util = f/f_max # f(work) is the min freq that is admissible for "work", which we will # abbreviate as "f" f(work) = work * f_max # from struct em_cap_state doc in energy_model.h cost(f) = power(f) * f_max / f cost(f) = power(f) / util cost(f) = power(f) / work power(f) = cost(f) * work boosted_cost(f) = cost(f) + x boosted_power(f) = boosted_cost(f) * work boosted_power(f) = (cost(f) + x) * work # Let's normalize cost() so we can forget about f and deal only with work. cost'(work) = cost(f)/cost(f_max) x' = x/cost(f_max) boosted_power'(work) = (cost'(work) + x') * work boosted_power'(work) = cost'(work) * work + x' * work boosted_power'(work) = power'(work) + x' * work boosted_power'(work) = power'(work) + A(work) # Over a duration T, spend an extra B unit of energy B(work) = A(work) * T lost_battery_percent(work) = 100 * B(work)/total_battery_energy lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy lost_battery_percent(work) = (100 * T / cost(f_max) / total_battery_energy) * x * work This means that the effect of boosting on battery life is proportional to "x" unless I made a mistake somewhere. > > Or rather, the only actual limit is in relation to the max OPP. So you > have very little actual control over how much more energy you're > spending. > >>> So while I agree that 2) is a reasonable signal to work from, everything >>> that comes after is still much confusing me. > >> "When applying these boosting rules on the runqueue util signals ...": >> Assuming the set of enqueued tasks stays the same between 2 observations >> from schedutil, if we see the rq util_avg increase above its >> util_est.enqueued, that means that at least one task had its util_avg go >> above util_est.enqueued. We might miss some boosting opportunities if some >> (util - util_est) compensates: >> TASK_1(util - util_est) = - TASK_2(util - util_est) >> but working on the aggregated value is much easier in schedutil, to avoid >> crawling the list of entities. > > That still does not explain why 'util - util_est', when >0, makes for a > sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is > increasing (for the aperiodic blah blah blah). But after that I'm still > confused. For the same reason PELT makes a sensible input for OPP selection. Currently, OPP selection is based on max(util_avg, util_est.enqueued) (from cpu_util_cfs in sched.h), so as soon as we have (util - util_est > 0), the OPP will be selected according to util_avg. In a way, using util_avg there is already some kind of boosting. Since the boosting is essentially (util - constant), it grows the same way as util. If we think of (util - util_est) as being some estimation of how wrong we were in the estimation of the task "true" utilization of the CPU, then it makes sense to feed that to the boost. The wronger we were, the more we want to boost, because the more time passes, the more the scheduler realizes it actually does not know what the task needs. In doubt, provide a higher freq than usual until we get to know this task better. When that happens (at the next period), boosting is disabled and we revert to the usual behavior (aka margin=0). Hope we are converging to some wording that makes sense.
On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <douglas.raillard@arm.com> wrote: > > > > On 10/18/19 1:07 PM, Peter Zijlstra wrote: > > On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote: > > > >>> What I don't see is how that that difference makes sense as input to: > >>> > >>> cost(x) : (1 + x) * cost_j > >> > >> The actual input is: > >> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est) > >> > >> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1 > >> is not directly reflected in the code but is important for units > >> consistency. > > > > But completely irrelevant for the actual math and conceptual > > understanding. > > > how that that difference makes sense as input to > I was unsure if you referred to the units being inconsistent or the > actual way of computing values being strange, so I provided some > justification for both. > > > Just because computers suck at real numbers, and floats > > are expensive, doesn't mean we have to burden ourselves with fixed point > > when writing equations. > > > > Also, as a physicist I'm prone to normalizing everything to 1, because > > that's lazy. > > > >>> I suppose that limits the additional OPP to twice the previously > >>> selected cost / efficiency (see the confusion from that other email). > >>> But given that efficency drops (or costs rise) for higher OPPs that > >>> still doesn't really make sense.. > > > >> Yes, this current limit to +100% freq boosting is somehow arbitrary and > >> could probably benefit from being tunable in some way (Kconfig option > >> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost > >> than the one strictly required, which is expected. The goal is to speed > >> things up at the expense of more power consumed to achieve the same work, > >> hence at a lower efficiency (== higher cost). > > > > No, no Kconfig knobs. > > > >> That's the main reason why this boosting apply a margin on the cost of the > >> selected OPP rather than just inflating the util. This allows controlling > >> directly how much more power (battery life) we are going to spend to achieve > >> some work that we know could be achieved with less power. > > > > But you're not; the margin is relative to the OPP, it is not absolute. > > Considering a CPU with 1024 max capacity (since we are not talking about > migrations here, we can ignore CPU invariance): > > work = normalized number of iterations of a given busy loop > # Thanks to freq invariance > work = util (between 0 and 1) > util = f/f_max > > # f(work) is the min freq that is admissible for "work", which we will > # abbreviate as "f" > f(work) = work * f_max > > # from struct em_cap_state doc in energy_model.h > cost(f) = power(f) * f_max / f > cost(f) = power(f) / util > cost(f) = power(f) / work > power(f) = cost(f) * work > > boosted_cost(f) = cost(f) + x In em_pd_get_higher_freq, the boost is a % of cost(f) so it should be boosted_cost(f)=cost(f)1+ cost(f)*x > boosted_power(f) = boosted_cost(f) * work > boosted_power(f) = (cost(f) + x) * work > > # Let's normalize cost() so we can forget about f and deal only with work. > cost'(work) = cost(f)/cost(f_max) > x' = x/cost(f_max) > boosted_power'(work) = (cost'(work) + x') * work > boosted_power'(work) = cost'(work) * work + x' * work > boosted_power'(work) = power'(work) + x' * work > boosted_power'(work) = power'(work) + A(work) > > # Over a duration T, spend an extra B unit of energy > B(work) = A(work) * T > lost_battery_percent(work) = 100 * B(work)/total_battery_energy > lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy > lost_battery_percent(work) = > (100 * T / cost(f_max) / total_battery_energy) * x * work > > This means that the effect of boosting on battery life is proportional > to "x" unless I made a mistake somewhere. > > > > > Or rather, the only actual limit is in relation to the max OPP. So you > > have very little actual control over how much more energy you're > > spending. > > > >>> So while I agree that 2) is a reasonable signal to work from, everything > >>> that comes after is still much confusing me. > > > >> "When applying these boosting rules on the runqueue util signals ...": > >> Assuming the set of enqueued tasks stays the same between 2 observations > >> from schedutil, if we see the rq util_avg increase above its > >> util_est.enqueued, that means that at least one task had its util_avg go > >> above util_est.enqueued. We might miss some boosting opportunities if some > >> (util - util_est) compensates: > >> TASK_1(util - util_est) = - TASK_2(util - util_est) > >> but working on the aggregated value is much easier in schedutil, to avoid > >> crawling the list of entities. > > > > That still does not explain why 'util - util_est', when >0, makes for a > > sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is > > increasing (for the aperiodic blah blah blah). But after that I'm still > > confused. > > For the same reason PELT makes a sensible input for OPP selection. > Currently, OPP selection is based on max(util_avg, util_est.enqueued) > (from cpu_util_cfs in sched.h), so as soon as we have > (util - util_est > 0), the OPP will be selected according to util_avg. > In a way, using util_avg there is already some kind of boosting. > > Since the boosting is essentially (util - constant), it grows the same > way as util. If we think of (util - util_est) as being some estimation > of how wrong we were in the estimation of the task "true" utilization of > the CPU, then it makes sense to feed that to the boost. The wronger we > were, the more we want to boost, because the more time passes, the more > the scheduler realizes it actually does not know what the task needs. In > doubt, provide a higher freq than usual until we get to know this task > better. When that happens (at the next period), boosting is disabled and > we revert to the usual behavior (aka margin=0). > > Hope we are converging to some wording that makes sense.
On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <douglas.raillard@arm.com> wrote: > > > > On 10/18/19 1:07 PM, Peter Zijlstra wrote: > > On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote: > > > >>> What I don't see is how that that difference makes sense as input to: > >>> > >>> cost(x) : (1 + x) * cost_j > >> > >> The actual input is: > >> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est) > >> > >> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1 > >> is not directly reflected in the code but is important for units > >> consistency. > > > > But completely irrelevant for the actual math and conceptual > > understanding. > > > how that that difference makes sense as input to > I was unsure if you referred to the units being inconsistent or the > actual way of computing values being strange, so I provided some > justification for both. > > > Just because computers suck at real numbers, and floats > > are expensive, doesn't mean we have to burden ourselves with fixed point > > when writing equations. > > > > Also, as a physicist I'm prone to normalizing everything to 1, because > > that's lazy. > > > >>> I suppose that limits the additional OPP to twice the previously > >>> selected cost / efficiency (see the confusion from that other email). > >>> But given that efficency drops (or costs rise) for higher OPPs that > >>> still doesn't really make sense.. > > > >> Yes, this current limit to +100% freq boosting is somehow arbitrary and > >> could probably benefit from being tunable in some way (Kconfig option > >> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost > >> than the one strictly required, which is expected. The goal is to speed > >> things up at the expense of more power consumed to achieve the same work, > >> hence at a lower efficiency (== higher cost). > > > > No, no Kconfig knobs. > > > >> That's the main reason why this boosting apply a margin on the cost of the > >> selected OPP rather than just inflating the util. This allows controlling > >> directly how much more power (battery life) we are going to spend to achieve > >> some work that we know could be achieved with less power. > > > > But you're not; the margin is relative to the OPP, it is not absolute. > > Considering a CPU with 1024 max capacity (since we are not talking about > migrations here, we can ignore CPU invariance): > > work = normalized number of iterations of a given busy loop > # Thanks to freq invariance > work = util (between 0 and 1) > util = f/f_max > > # f(work) is the min freq that is admissible for "work", which we will > # abbreviate as "f" > f(work) = work * f_max > > # from struct em_cap_state doc in energy_model.h > cost(f) = power(f) * f_max / f > cost(f) = power(f) / util > cost(f) = power(f) / work > power(f) = cost(f) * work > > boosted_cost(f) = cost(f) + x > boosted_power(f) = boosted_cost(f) * work > boosted_power(f) = (cost(f) + x) * work > > # Let's normalize cost() so we can forget about f and deal only with work. > cost'(work) = cost(f)/cost(f_max) > x' = x/cost(f_max) > boosted_power'(work) = (cost'(work) + x') * work > boosted_power'(work) = cost'(work) * work + x' * work > boosted_power'(work) = power'(work) + x' * work > boosted_power'(work) = power'(work) + A(work) > > # Over a duration T, spend an extra B unit of energy > B(work) = A(work) * T > lost_battery_percent(work) = 100 * B(work)/total_battery_energy > lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy > lost_battery_percent(work) = > (100 * T / cost(f_max) / total_battery_energy) * x * work > > This means that the effect of boosting on battery life is proportional > to "x" unless I made a mistake somewhere. Because the boost is relative to cost(f) and cost is not linear to the frequency, I don't think that it's is a linear relation. > > > > > Or rather, the only actual limit is in relation to the max OPP. So you > > have very little actual control over how much more energy you're > > spending. > > > >>> So while I agree that 2) is a reasonable signal to work from, everything > >>> that comes after is still much confusing me. > > > >> "When applying these boosting rules on the runqueue util signals ...": > >> Assuming the set of enqueued tasks stays the same between 2 observations > >> from schedutil, if we see the rq util_avg increase above its > >> util_est.enqueued, that means that at least one task had its util_avg go > >> above util_est.enqueued. We might miss some boosting opportunities if some > >> (util - util_est) compensates: > >> TASK_1(util - util_est) = - TASK_2(util - util_est) > >> but working on the aggregated value is much easier in schedutil, to avoid > >> crawling the list of entities. > > > > That still does not explain why 'util - util_est', when >0, makes for a > > sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is > > increasing (for the aperiodic blah blah blah). But after that I'm still > > confused. > > For the same reason PELT makes a sensible input for OPP selection. > Currently, OPP selection is based on max(util_avg, util_est.enqueued) > (from cpu_util_cfs in sched.h), so as soon as we have > (util - util_est > 0), the OPP will be selected according to util_avg. > In a way, using util_avg there is already some kind of boosting. > > Since the boosting is essentially (util - constant), it grows the same > way as util. If we think of (util - util_est) as being some estimation > of how wrong we were in the estimation of the task "true" utilization of > the CPU, then it makes sense to feed that to the boost. The wronger we > were, the more we want to boost, because the more time passes, the more > the scheduler realizes it actually does not know what the task needs. In > doubt, provide a higher freq than usual until we get to know this task > better. When that happens (at the next period), boosting is disabled and > we revert to the usual behavior (aka margin=0). > > Hope we are converging to some wording that makes sense.
On 10/18/19 4:15 PM, Vincent Guittot wrote: > On Fri, 18 Oct 2019 at 16:44, Douglas Raillard <douglas.raillard@arm.com> wrote: >> >> >> >> On 10/18/19 1:07 PM, Peter Zijlstra wrote: >>> On Fri, Oct 18, 2019 at 12:46:25PM +0100, Douglas Raillard wrote: >>> >>>>> What I don't see is how that that difference makes sense as input to: >>>>> >>>>> cost(x) : (1 + x) * cost_j >>>> >>>> The actual input is: >>>> x = (EM_COST_MARGIN_SCALE/SCHED_CAPACITY_SCALE) * (util - util_est) >>>> >>>> Since EM_COST_MARGIN_SCALE == SCHED_CAPACITY_SCALE == 1024, this factor of 1 >>>> is not directly reflected in the code but is important for units >>>> consistency. >>> >>> But completely irrelevant for the actual math and conceptual >>> understanding. >> >> > how that that difference makes sense as input to >> I was unsure if you referred to the units being inconsistent or the >> actual way of computing values being strange, so I provided some >> justification for both. >> >>> Just because computers suck at real numbers, and floats >>> are expensive, doesn't mean we have to burden ourselves with fixed point >>> when writing equations. >>> >>> Also, as a physicist I'm prone to normalizing everything to 1, because >>> that's lazy. >>> >>>>> I suppose that limits the additional OPP to twice the previously >>>>> selected cost / efficiency (see the confusion from that other email). >>>>> But given that efficency drops (or costs rise) for higher OPPs that >>>>> still doesn't really make sense.. >>> >>>> Yes, this current limit to +100% freq boosting is somehow arbitrary and >>>> could probably benefit from being tunable in some way (Kconfig option >>>> maybe). When (margin > 0), we end up selecting an OPP that has a higher cost >>>> than the one strictly required, which is expected. The goal is to speed >>>> things up at the expense of more power consumed to achieve the same work, >>>> hence at a lower efficiency (== higher cost). >>> >>> No, no Kconfig knobs. >>> >>>> That's the main reason why this boosting apply a margin on the cost of the >>>> selected OPP rather than just inflating the util. This allows controlling >>>> directly how much more power (battery life) we are going to spend to achieve >>>> some work that we know could be achieved with less power. >>> >>> But you're not; the margin is relative to the OPP, it is not absolute. >> >> Considering a CPU with 1024 max capacity (since we are not talking about >> migrations here, we can ignore CPU invariance): >> >> work = normalized number of iterations of a given busy loop >> # Thanks to freq invariance >> work = util (between 0 and 1) >> util = f/f_max >> >> # f(work) is the min freq that is admissible for "work", which we will >> # abbreviate as "f" >> f(work) = work * f_max >> >> # from struct em_cap_state doc in energy_model.h >> cost(f) = power(f) * f_max / f >> cost(f) = power(f) / util >> cost(f) = power(f) / work >> power(f) = cost(f) * work >> >> boosted_cost(f) = cost(f) + x > > In em_pd_get_higher_freq, the boost is a % of cost(f) so it should be > boosted_cost(f)=cost(f)1+ cost(f)*x Good point, this means that we need to change "x" in these equations: x = cost(f) * margin Which leads to: lost_battery_percent(work) = (100 * T / cost(f_max) / total_battery_energy) * cost'(work) * margin * work lost_battery_percent(work) is still proportional to something that can easily be traced and averaged (cost'(work,t) * margin(work,t)). At the end of the day, since the impact depends on whether the workload will make the condition to trigger, tracing is necessary to see how it performs. Other than that, I agree that the thing becomes simpler if em_pd_get_higher_freq() takes an absolute margin (as a per-1024 of max cost) rather than something proportional to cost(f). I'll make the change for v4. >> boosted_power(f) = boosted_cost(f) * work >> boosted_power(f) = (cost(f) + x) * work >> >> # Let's normalize cost() so we can forget about f and deal only with work. >> cost'(work) = cost(f)/cost(f_max) >> x' = x/cost(f_max) >> boosted_power'(work) = (cost'(work) + x') * work >> boosted_power'(work) = cost'(work) * work + x' * work >> boosted_power'(work) = power'(work) + x' * work >> boosted_power'(work) = power'(work) + A(work) >> >> # Over a duration T, spend an extra B unit of energy >> B(work) = A(work) * T >> lost_battery_percent(work) = 100 * B(work)/total_battery_energy >> lost_battery_percent(work) = 100 * T * x' * work /total_battery_energy >> lost_battery_percent(work) = >> (100 * T / cost(f_max) / total_battery_energy) * x * work >> >> This means that the effect of boosting on battery life is proportional >> to "x" unless I made a mistake somewhere. >> >>> >>> Or rather, the only actual limit is in relation to the max OPP. So you >>> have very little actual control over how much more energy you're >>> spending. >>> >>>>> So while I agree that 2) is a reasonable signal to work from, everything >>>>> that comes after is still much confusing me. >>> >>>> "When applying these boosting rules on the runqueue util signals ...": >>>> Assuming the set of enqueued tasks stays the same between 2 observations >>>> from schedutil, if we see the rq util_avg increase above its >>>> util_est.enqueued, that means that at least one task had its util_avg go >>>> above util_est.enqueued. We might miss some boosting opportunities if some >>>> (util - util_est) compensates: >>>> TASK_1(util - util_est) = - TASK_2(util - util_est) >>>> but working on the aggregated value is much easier in schedutil, to avoid >>>> crawling the list of entities. >>> >>> That still does not explain why 'util - util_est', when >0, makes for a >>> sensible input into an OPP relative function > I agree that 'util - util_est', when >0, indicates utilization is >>> increasing (for the aperiodic blah blah blah). But after that I'm still >>> confused. >> >> For the same reason PELT makes a sensible input for OPP selection. >> Currently, OPP selection is based on max(util_avg, util_est.enqueued) >> (from cpu_util_cfs in sched.h), so as soon as we have >> (util - util_est > 0), the OPP will be selected according to util_avg. >> In a way, using util_avg there is already some kind of boosting. >> >> Since the boosting is essentially (util - constant), it grows the same >> way as util. If we think of (util - util_est) as being some estimation >> of how wrong we were in the estimation of the task "true" utilization of >> the CPU, then it makes sense to feed that to the boost. The wronger we >> were, the more we want to boost, because the more time passes, the more >> the scheduler realizes it actually does not know what the task needs. In >> doubt, provide a higher freq than usual until we get to know this task >> better. When that happens (at the next period), boosting is disabled and >> we revert to the usual behavior (aka margin=0). >> >> Hope we are converging to some wording that makes sense.
On 10/18/19 8:59 AM, Peter Zijlstra wrote: > On Fri, Oct 18, 2019 at 09:44:44AM +0200, Dietmar Eggemann wrote: >> On 17/10/2019 16:11, Peter Zijlstra wrote: >>> On Thu, Oct 17, 2019 at 12:11:16PM +0100, Quentin Perret wrote: >> >> [...] >> >>> It only boosts when 'rq->cfs.avg.util' increases while >>> 'rq->cfs.avg.util_est.enqueued' remains unchanged (and util > util_est >>> obv). >>> >>> This condition can be true for select_task_rq_fair(), because that is >>> ran before we do enqueue_task_fair() (for obvious raisins). >>> >>>>> I'm still thinking about the exact means you're using to raise C; that >>>>> is, the 'util - util_est' as cost_margin. It hurts my brain still. >>>> >>>> +1 ... >>> >>> cost_i = capacity_i / power_i ; for the i-th OPP >> >> I get confused by this definition. efficiency=capacity/power but the >> cs->cost value used in em_pd_get_higher_freq() is defined as >> >> cs_cost = cs->power * cpu_max_freq / cs->freq [energy_model.h] > > Well, chalk that up to confusion inspired by the Changelog of patch 1. I've updated the commit message to include that ordering OPPs by increasing efficiency=capa/power on one CPU leads to the same ordering as ordering by decreasing cost=power*f_max/f. efficiency=(cpu_max_capa/1024) * (f/f_max) / power efficiency=(cpu_max_capa/1024) / cost > Let me redo it with that formula then. >