Message ID | 20220207073036.14901-2-lukasz.luba@arm.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Daniel Lezcano |
Headers | show |
Series | Ignore Energy Model with abstract scale in IPA and DTPM | expand |
On Mon, Feb 07, 2022 at 07:30:35AM +0000, Lukasz Luba wrote: > The Energy Model supports power values either in Watts or in some abstract > scale. When the 2nd option is in use, the thermal governor IPA should not > be allowed to operate, since the relation between cooling devices is not > properly defined. Thus, it might be possible that big GPU has lower power > values in abstract scale than a Little CPU. To mitigate a misbehaviour > of the thermal control algorithm, simply not register a cooling device > capable of working with IPA. Ugh, this would break thermal throttling for existing devices that are currently supported in the upstream kernel. Wasn't the conclusion that it is the responsability of the device tree owners to ensure that cooling devices with different scales aren't used in the same thermal zone? That's also what's currently specified in the power allocator documentation: Another important thing is the consistent scale of the power values provided by the cooling devices. All of the cooling devices in a single thermal zone should have power values reported either in milli-Watts or scaled to the same 'abstract scale'. Which was actually added by yourself: commit 5a64f775691647c242aa40d34f3512e7b179a921 Author: Lukasz Luba <lukasz.luba@arm.com> Date: Tue Nov 3 09:05:58 2020 +0000 PM: EM: Clarify abstract scale usage for power values in Energy Model The Energy Model (EM) can store power values in milli-Watts or in abstract scale. This might cause issues in the subsystems which use the EM for estimating the device power, such as: - mixing of different scales in a subsystem which uses multiple (cooling) devices (e.g. thermal Intelligent Power Allocation (IPA)) - assuming that energy [milli-Joules] can be derived from the EM power values which might not be possible since the power scale doesn't have to be in milli-Watts To avoid misconfiguration add the requisite documentation to the EM and related subsystems: EAS and IPA. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> It's ugly to have the abstract scales in the first place, but that's unfortunately what we currently have for at least some cooling devices. IMO it would be preferable to stick to catching incompliant configurations in reviews, rather than breaking thermal throttling for existing devices with configurations that comply with the current documentation.
On 2/8/22 12:50 AM, Matthias Kaehlcke wrote: > On Mon, Feb 07, 2022 at 07:30:35AM +0000, Lukasz Luba wrote: >> The Energy Model supports power values either in Watts or in some abstract >> scale. When the 2nd option is in use, the thermal governor IPA should not >> be allowed to operate, since the relation between cooling devices is not >> properly defined. Thus, it might be possible that big GPU has lower power >> values in abstract scale than a Little CPU. To mitigate a misbehaviour >> of the thermal control algorithm, simply not register a cooling device >> capable of working with IPA. > > Ugh, this would break thermal throttling for existing devices that are > currently supported in the upstream kernel. Could you point me to those devices? I cannot find them in the mainline DT. There are no GPU devices which register Energy Model (EM) in upstream, neither using DT (which would be power in mW) nor explicitly providing EM get_power() callback. The EM is needed to have IPA. Please clarify which existing devices are going to be broken with this change. > > Wasn't the conclusion that it is the responsability of the device tree > owners to ensure that cooling devices with different scales aren't used > in the same thermal zone? It's based on assumption that someone has DT and control. There was also implicit assumption that IPA would work properly on such platform - but it won't. 1. You cannot have 2 thermal zones: one with CPUs and other with GPU only and both working with two instances of IPA. 2. The abstract power scale doesn't guaranty anything about power values and IPA was simply designed with milli-Watts in mind. So even working on CPUs only using bogoWatts, is not what we could guaranty in IPA. > > That's also what's currently specified in the power allocator > documentation: > > Another important thing is the consistent scale of the power values > provided by the cooling devices. All of the cooling devices in a single > thermal zone should have power values reported either in milli-Watts > or scaled to the same 'abstract scale'. This can change. We have removed the userspace governor from kernel recently. The trend is to implement thermal policy in FW. Dealing with some intermediate configurations are causing complicated design, support of the algorithm logic is also more complex. > > Which was actually added by yourself: > > commit 5a64f775691647c242aa40d34f3512e7b179a921 > Author: Lukasz Luba <lukasz.luba@arm.com> > Date: Tue Nov 3 09:05:58 2020 +0000 > > PM: EM: Clarify abstract scale usage for power values in Energy Model > > The Energy Model (EM) can store power values in milli-Watts or in abstract > scale. This might cause issues in the subsystems which use the EM for > estimating the device power, such as: > > - mixing of different scales in a subsystem which uses multiple > (cooling) devices (e.g. thermal Intelligent Power Allocation (IPA)) > > - assuming that energy [milli-Joules] can be derived from the EM power > values which might not be possible since the power scale doesn't have > to be in milli-Watts > > To avoid misconfiguration add the requisite documentation to the EM and > related subsystems: EAS and IPA. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > It's ugly to have the abstract scales in the first place, but that's > unfortunately what we currently have for at least some cooling devices. A few questions: Do you use 'we' as Chrome engineers? Could you point me to those devices please? Are they new or some old platforms which need just maintenance? How IPA works for you in such real platform configuration? If it would be possible could you share some plots of temperature, frequency and CPUs, GPU utilization? Do you maybe know how the real power was scaled for them? It would help me understand and judge. > > IMO it would be preferable to stick to catching incompliant configurations > in reviews, rather than breaking thermal throttling for existing devices > with configurations that comply with the current documentation. > Without access to the source code of those devices, it's hard for me to see if they are broken. Regards, Lukasz
On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: > > > On 2/8/22 12:50 AM, Matthias Kaehlcke wrote: > > On Mon, Feb 07, 2022 at 07:30:35AM +0000, Lukasz Luba wrote: > > > The Energy Model supports power values either in Watts or in some abstract > > > scale. When the 2nd option is in use, the thermal governor IPA should not > > > be allowed to operate, since the relation between cooling devices is not > > > properly defined. Thus, it might be possible that big GPU has lower power > > > values in abstract scale than a Little CPU. To mitigate a misbehaviour > > > of the thermal control algorithm, simply not register a cooling device > > > capable of working with IPA. > > > > Ugh, this would break thermal throttling for existing devices that are > > currently supported in the upstream kernel. > > Could you point me to those devices? I cannot find them in the mainline > DT. There are no GPU devices which register Energy Model (EM) in > upstream, neither using DT (which would be power in mW) nor explicitly > providing EM get_power() callback. The EM is needed to have IPA. > > Please clarify which existing devices are going to be broken with this > change. I was thinking about arch/arm64/boot/dts/qcom/sc7180-trogdor-*, and potentially other SC7180 boards that use IPA for the CPU thermal zones. Initially SC7180 used an abstract scale for the CPU energy model, however I realized your change doesn't actually impact SC7180 CPUs for two reasons: 1. The energy model of the CPUs is registered through cpufreq_register_em_with_opp dev_pm_opp_of_register_em em_dev_register_perf_domain em_dev_register_perf_domain() is called with 'milliwatts = true', regardless of the potentially abstract scale, so IPA would not be disabled with your change. 2. There is a patch from Doug that adjusted the dynamic power coefficients of the CPUs to be closer to reality: commit 82ea7d411d43f60dce878252558e926f957109f0 Author: Douglas Anderson <dianders@chromium.org> Date: Thu Sep 2 14:51:37 2021 -0700 arm64: dts: qcom: sc7180: Base dynamic CPU power coefficients in reality > > Wasn't the conclusion that it is the responsability of the device tree > > owners to ensure that cooling devices with different scales aren't used > > in the same thermal zone? > > It's based on assumption that someone has DT and control. There was also > implicit assumption that IPA would work properly on such platform - but > it won't. > > 1. You cannot have 2 thermal zones: one with CPUs and other with GPU > only and both working with two instances of IPA. It's not clear to me why such a configuration wouldn't work. Is it also a problem to have multiple CPU thermal zones (one for each core) that use multiple instances of IPA? SC7180 has separate thermal zones for each core (or thermal sensor), Chrome OS uses IPA for CPU thermal throttling. > 2. The abstract power scale doesn't guaranty anything about power values > and IPA was simply designed with milli-Watts in mind. So even working > on CPUs only using bogoWatts, is not what we could guaranty in IPA. That's bad news for SoCs that are configured with bogoWatt values, from past discussions I had the impression that this is unfortunately not uncommon. > It's ugly to have the abstract scales in the first place, but that's > unfortunately what we currently have for at least some cooling devices. > A few questions: > > Do you use 'we' as Chrome engineers? I was referring to the kernel, in particular QCOM SC7180. > Could you point me to those devices please? arch/arm64/boot/dts/qcom/sc7180-trogdor-* Though as per above they shouldn't be impacted by your change, since the CPUs always pretend to use milli-Watts. [skipped some questions/answers since sc7180 isn't actually impacted by the change] Thanks Matthias
On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: > On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >> >> >> On 2/8/22 12:50 AM, Matthias Kaehlcke wrote: >>> On Mon, Feb 07, 2022 at 07:30:35AM +0000, Lukasz Luba wrote: >>>> The Energy Model supports power values either in Watts or in some abstract >>>> scale. When the 2nd option is in use, the thermal governor IPA should not >>>> be allowed to operate, since the relation between cooling devices is not >>>> properly defined. Thus, it might be possible that big GPU has lower power >>>> values in abstract scale than a Little CPU. To mitigate a misbehaviour >>>> of the thermal control algorithm, simply not register a cooling device >>>> capable of working with IPA. >>> >>> Ugh, this would break thermal throttling for existing devices that are >>> currently supported in the upstream kernel. >> >> Could you point me to those devices? I cannot find them in the mainline >> DT. There are no GPU devices which register Energy Model (EM) in >> upstream, neither using DT (which would be power in mW) nor explicitly >> providing EM get_power() callback. The EM is needed to have IPA. >> >> Please clarify which existing devices are going to be broken with this >> change. > > I was thinking about arch/arm64/boot/dts/qcom/sc7180-trogdor-*, and > potentially other SC7180 boards that use IPA for the CPU thermal > zones. > > Initially SC7180 used an abstract scale for the CPU energy model, > however I realized your change doesn't actually impact SC7180 CPUs > for two reasons: > > 1. The energy model of the CPUs is registered through > > cpufreq_register_em_with_opp > dev_pm_opp_of_register_em > em_dev_register_perf_domain > > em_dev_register_perf_domain() is called with 'milliwatts = true', > regardless of the potentially abstract scale, so IPA would not be > disabled with your change. That good registration. > > 2. There is a patch from Doug that adjusted the dynamic power > coefficients of the CPUs to be closer to reality: > > commit 82ea7d411d43f60dce878252558e926f957109f0 > Author: Douglas Anderson <dianders@chromium.org> > Date: Thu Sep 2 14:51:37 2021 -0700 > > arm64: dts: qcom: sc7180: Base dynamic CPU power coefficients in reality Thanks for the link to the commit. I'll have a look. It's good that it doesn't break this board. > >>> Wasn't the conclusion that it is the responsability of the device tree >>> owners to ensure that cooling devices with different scales aren't used >>> in the same thermal zone? >> >> It's based on assumption that someone has DT and control. There was also >> implicit assumption that IPA would work properly on such platform - but >> it won't. >> >> 1. You cannot have 2 thermal zones: one with CPUs and other with GPU >> only and both working with two instances of IPA. > > It's not clear to me why such a configuration wouldn't work. Is it also a > problem to have multiple CPU thermal zones (one for each core) that use > multiple instances of IPA? SC7180 has separate thermal zones for each core > (or thermal sensor), Chrome OS uses IPA for CPU thermal throttling. Unfortunately, the control mechanism is not working best in such setup. Main idea of IPA is to find 'best' OPP for a cooling device, e.g. one in the middle of a freq table. Worst scenario is when we change our decision between lowest and highest OPP, in short periods. It's burning too much power at highest freq, then we throttle too much, then we unthrottle because temp is dropping by some 'good enough' value. This situation can happen due to a few reasons: 1. Power values from EM are not reflecting reality, e.g. scaled too much 2. We don't have proper information about CPU load and frequencies used 3. PID coefficients are not properly set 4. board has small physical thermal sink potential but silicon if 'hot' 5. the workload is too dynamic 6. there is another power hungry device (GPU, DSP, ISP) which is outside of our control loop, e.g. is controlled in other thermal zone and has impact on our temp sensor value Proper setup and tuning of IPA could bring quite good benefit, because it could pick the 'best at that moment' OPPs for the devices, instead of throttling too hard and then unthrottling. Unfortunately, it's tricky and needs time (experiments, understanding the system). We have been trying to easy this entry barrier for developers. Very good results brings the optimization of the PID coefficients. The automated mechanism would figure out best values for the given board based on the tests run. It's under review now, there are also shared results [1][2]. It would solve a some of the issues with complex tuning. I was going to give it a try for my old experiments with the bogoWatts devices, but I don't expect that this is a silver bullet. Regarding the 'two thermal zones with IPA issues' I'm prity sure it won't help. > >> 2. The abstract power scale doesn't guaranty anything about power values >> and IPA was simply designed with milli-Watts in mind. So even working >> on CPUs only using bogoWatts, is not what we could guaranty in IPA. > > That's bad news for SoCs that are configured with bogoWatt values, from > past discussions I had the impression that this is unfortunately not > uncommon. > >> It's ugly to have the abstract scales in the first place, but that's >> unfortunately what we currently have for at least some cooling devices. > >> A few questions: >> >> Do you use 'we' as Chrome engineers? > > I was referring to the kernel, in particular QCOM SC7180. Thanks for sharing the name. > >> Could you point me to those devices please? > > arch/arm64/boot/dts/qcom/sc7180-trogdor-* > > Though as per above they shouldn't be impacted by your change, since the > CPUs always pretend to use milli-Watts. > > [skipped some questions/answers since sc7180 isn't actually impacted by > the change] Thank you Matthias. I will investigate your setup to get better understanding. Regards, Lukasz > > Thanks > > Matthias > [1] https://nbviewer.org/gist/chemis01/0e86ad81508860659a57338dae8274f9 [2] https://lore.kernel.org/all/20211217184907.2103677-1-chetan.mistry@arm.com/
On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: > > > On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: > > On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: > > > > > > > > > On 2/8/22 12:50 AM, Matthias Kaehlcke wrote: > > > > On Mon, Feb 07, 2022 at 07:30:35AM +0000, Lukasz Luba wrote: > > > > > The Energy Model supports power values either in Watts or in some abstract > > > > > scale. When the 2nd option is in use, the thermal governor IPA should not > > > > > be allowed to operate, since the relation between cooling devices is not > > > > > properly defined. Thus, it might be possible that big GPU has lower power > > > > > values in abstract scale than a Little CPU. To mitigate a misbehaviour > > > > > of the thermal control algorithm, simply not register a cooling device > > > > > capable of working with IPA. > > > > > > > > Ugh, this would break thermal throttling for existing devices that are > > > > currently supported in the upstream kernel. > > > > > > Could you point me to those devices? I cannot find them in the mainline > > > DT. There are no GPU devices which register Energy Model (EM) in > > > upstream, neither using DT (which would be power in mW) nor explicitly > > > providing EM get_power() callback. The EM is needed to have IPA. > > > > > > Please clarify which existing devices are going to be broken with this > > > change. > > > > I was thinking about arch/arm64/boot/dts/qcom/sc7180-trogdor-*, and > > potentially other SC7180 boards that use IPA for the CPU thermal > > zones. > > > > Initially SC7180 used an abstract scale for the CPU energy model, > > however I realized your change doesn't actually impact SC7180 CPUs > > for two reasons: > > > > 1. The energy model of the CPUs is registered through > > > > cpufreq_register_em_with_opp > > dev_pm_opp_of_register_em > > em_dev_register_perf_domain > > > > em_dev_register_perf_domain() is called with 'milliwatts = true', > > regardless of the potentially abstract scale, so IPA would not be > > disabled with your change. > > That good registration. > > > > > 2. There is a patch from Doug that adjusted the dynamic power > > coefficients of the CPUs to be closer to reality: > > > > commit 82ea7d411d43f60dce878252558e926f957109f0 > > Author: Douglas Anderson <dianders@chromium.org> > > Date: Thu Sep 2 14:51:37 2021 -0700 > > > > arm64: dts: qcom: sc7180: Base dynamic CPU power coefficients in reality > > Thanks for the link to the commit. I'll have a look. It's good that it > doesn't break this board. > > > > > > > Wasn't the conclusion that it is the responsability of the device tree > > > > owners to ensure that cooling devices with different scales aren't used > > > > in the same thermal zone? > > > > > > It's based on assumption that someone has DT and control. There was also > > > implicit assumption that IPA would work properly on such platform - but > > > it won't. > > > > > > 1. You cannot have 2 thermal zones: one with CPUs and other with GPU > > > only and both working with two instances of IPA. > > > > It's not clear to me why such a configuration wouldn't work. Is it also a > > problem to have multiple CPU thermal zones (one for each core) that use > > multiple instances of IPA? SC7180 has separate thermal zones for each core > > (or thermal sensor), Chrome OS uses IPA for CPU thermal throttling. > > Unfortunately, the control mechanism is not working best in such setup. > Main idea of IPA is to find 'best' OPP for a cooling device, e.g. > one in the middle of a freq table. Worst scenario is when we change > our decision between lowest and highest OPP, in short periods. > It's burning too much power at highest freq, then we throttle too much, > then we unthrottle because temp is dropping by some 'good enough' value. > This situation can happen due to a few reasons: > 1. Power values from EM are not reflecting reality, e.g. scaled too much > 2. We don't have proper information about CPU load and frequencies used > 3. PID coefficients are not properly set > 4. board has small physical thermal sink potential but silicon if 'hot' > 5. the workload is too dynamic > 6. there is another power hungry device (GPU, DSP, ISP) which is outside > of our control loop, e.g. is controlled in other thermal zone and has > impact on our temp sensor value Thanks for the details! > Proper setup and tuning of IPA could bring quite good benefit, because > it could pick the 'best at that moment' OPPs for the devices, instead > of throttling too hard and then unthrottling. Unfortunately, it's > tricky and needs time (experiments, understanding the system). On some sc7180 ChromeOS boards we observed a behavior as you describe, we mitigated it by tuning the device 'sustainable-power' device tree property and the trip point temperatures. > We have been trying to easy this entry barrier for developers. Very good > results brings the optimization of the PID coefficients. The automated > mechanism would figure out best values for the given board based on the > tests run. It's under review now, there are also shared results [1][2]. > It would solve a some of the issues with complex tuning. Good to know, thanks for the pointer! > I was going to give it a try for my old experiments with the bogoWatts > devices, but I don't expect that this is a silver bullet. Regarding > the 'two thermal zones with IPA issues' I'm prity sure it won't help. > > > > > > 2. The abstract power scale doesn't guaranty anything about power values > > > and IPA was simply designed with milli-Watts in mind. So even working > > > on CPUs only using bogoWatts, is not what we could guaranty in IPA. > > > > That's bad news for SoCs that are configured with bogoWatt values, from > > past discussions I had the impression that this is unfortunately not > > uncommon. > > > > > It's ugly to have the abstract scales in the first place, but that's > > > unfortunately what we currently have for at least some cooling devices. > > > > > A few questions: > > > > > > Do you use 'we' as Chrome engineers? > > > > I was referring to the kernel, in particular QCOM SC7180. > > Thanks for sharing the name. > > > > > > Could you point me to those devices please? > > > > arch/arm64/boot/dts/qcom/sc7180-trogdor-* > > > > Though as per above they shouldn't be impacted by your change, since the > > CPUs always pretend to use milli-Watts. > > > > [skipped some questions/answers since sc7180 isn't actually impacted by > > the change] > > Thank you Matthias. I will investigate your setup to get better > understanding. Thanks!
Hi Matthias, On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: > On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >> >> >> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>> >>>> [snip] >>>> Could you point me to those devices please? >>> >>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>> >>> Though as per above they shouldn't be impacted by your change, since the >>> CPUs always pretend to use milli-Watts. >>> >>> [skipped some questions/answers since sc7180 isn't actually impacted by >>> the change] >> >> Thank you Matthias. I will investigate your setup to get better >> understanding. > > Thanks! > I've checked those DT files and related code. As you already said, this patch is safe for them. So we can apply it IMO. -------------Off-topic------------------ Not in $subject comments: AFAICS based on two files which define thermal zones: sc7180-trogdor-homestar.dtsi sc7180-trogdor-coachz.dtsi only the 'big' cores are used as cooling devices in the 'skin_temp_thermal' - the CPU6 and CPU7. I assume you don't want to model at all the power usage from the Little cluster (which is quite big: 6 CPUs), do you? I can see that the Little CPUs have small dyn-power-coeff ~30% of the big and lower max freq, but still might be worth to add them to IPA. You might give them more 'weight', to make sure they receive more power during power split. You also don't have GPU cooling device in that thermal zone. Based on my experience if your GPU is a power hungry one, e.g. 2-4Watts, you might get better results when you model this 'hot' device (which impacts your temp sensor reported value). Regards, Lukasz
Hi, On Tue, Feb 8, 2022 at 1:32 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > Another important thing is the consistent scale of the power values > > provided by the cooling devices. All of the cooling devices in a single > > thermal zone should have power values reported either in milli-Watts > > or scaled to the same 'abstract scale'. > > This can change. We have removed the userspace governor from kernel > recently. The trend is to implement thermal policy in FW. Dealing with > some intermediate configurations are causing complicated design, support > of the algorithm logic is also more complex. One thing that didn't get addressed is the whole "The trend is to implement thermal policy in FW". I'm not sure I can get on board with that trend. IMO "moving to FW" isn't a super great trend. FW is harder to update than kernel and trying to keep it in sync with the kernel isn't wonderful. Unless something _has_ to be in FW I personally prefer it to be in the kernel. ...although now that I re-read this, I'm not sure which firmware you might be talking about. Is this the AP firmware, or some companion chip / coprocessor? Even so, I'd still rather see things done in the kernel when possible... -Doug
Hi, On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Hi Matthias, > > On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: > > On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: > >> > >> > >> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: > >>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: > >>>> > >>>> > > [snip] > > >>>> Could you point me to those devices please? > >>> > >>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* > >>> > >>> Though as per above they shouldn't be impacted by your change, since the > >>> CPUs always pretend to use milli-Watts. > >>> > >>> [skipped some questions/answers since sc7180 isn't actually impacted by > >>> the change] > >> > >> Thank you Matthias. I will investigate your setup to get better > >> understanding. > > > > Thanks! > > > > I've checked those DT files and related code. > As you already said, this patch is safe for them. > So we can apply it IMO. > > > -------------Off-topic------------------ > Not in $subject comments: > > AFAICS based on two files which define thermal zones: > sc7180-trogdor-homestar.dtsi > sc7180-trogdor-coachz.dtsi > > only the 'big' cores are used as cooling devices in the > 'skin_temp_thermal' - the CPU6 and CPU7. > > I assume you don't want to model at all the power usage > from the Little cluster (which is quite big: 6 CPUs), do you? > I can see that the Little CPUs have small dyn-power-coeff > ~30% of the big and lower max freq, but still might be worth > to add them to IPA. You might give them more 'weight', to > make sure they receive more power during power split. > > You also don't have GPU cooling device in that thermal zone. > Based on my experience if your GPU is a power hungry one, > e.g. 2-4Watts, you might get better results when you model > this 'hot' device (which impacts your temp sensor reported value). I think the two boards you point at (homestar and coachz) are just the two that override the default defined in the SoC dtsi file. If you look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling map. You can also see the cooling maps for the littles. I guess we don't have a `dynamic-power-coefficient` for the GPU, though? Seems like we should, but I haven't dug through all the code here... -Doug
On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote: > Hi, > > On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > Hi Matthias, > > > > On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: > > > On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: > > >> > > >> > > >> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: > > >>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: > > >>>> > > >>>> > > > > [snip] > > > > >>>> Could you point me to those devices please? > > >>> > > >>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* > > >>> > > >>> Though as per above they shouldn't be impacted by your change, since the > > >>> CPUs always pretend to use milli-Watts. > > >>> > > >>> [skipped some questions/answers since sc7180 isn't actually impacted by > > >>> the change] > > >> > > >> Thank you Matthias. I will investigate your setup to get better > > >> understanding. > > > > > > Thanks! > > > > > > > I've checked those DT files and related code. > > As you already said, this patch is safe for them. > > So we can apply it IMO. > > > > > > -------------Off-topic------------------ > > Not in $subject comments: > > > > AFAICS based on two files which define thermal zones: > > sc7180-trogdor-homestar.dtsi > > sc7180-trogdor-coachz.dtsi > > > > only the 'big' cores are used as cooling devices in the > > 'skin_temp_thermal' - the CPU6 and CPU7. > > > > I assume you don't want to model at all the power usage > > from the Little cluster (which is quite big: 6 CPUs), do you? > > I can see that the Little CPUs have small dyn-power-coeff > > ~30% of the big and lower max freq, but still might be worth > > to add them to IPA. You might give them more 'weight', to > > make sure they receive more power during power split. In experiments we saw that including the little cores as cooling devices for 'skin_temp_thermal' didn't have a significant impact on thermals, so we left them out. > > You also don't have GPU cooling device in that thermal zone. > > Based on my experience if your GPU is a power hungry one, > > e.g. 2-4Watts, you might get better results when you model > > this 'hot' device (which impacts your temp sensor reported value). > > I think the two boards you point at (homestar and coachz) are just the > two that override the default defined in the SoC dtsi file. If you > look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling > map. You can also see the cooling maps for the littles. Yep, plus thermal zones with cooling maps for the big cores. > I guess we don't have a `dynamic-power-coefficient` for the GPU, > though? Seems like we should, but I haven't dug through all the code > here... To my knowledge the SC7x80 GPU doesn't register an energy model, which is one of the reasons the GPU wasn't included as cooling device for 'skin_temp_thermal'.
On 2/16/22 10:13 PM, Matthias Kaehlcke wrote: > On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote: >> Hi, >> >> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >>> >>> Hi Matthias, >>> >>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: >>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >>>>> >>>>> >>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>>>>> >>>>>>> >>> >>> [snip] >>> >>>>>>> Could you point me to those devices please? >>>>>> >>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>>>>> >>>>>> Though as per above they shouldn't be impacted by your change, since the >>>>>> CPUs always pretend to use milli-Watts. >>>>>> >>>>>> [skipped some questions/answers since sc7180 isn't actually impacted by >>>>>> the change] >>>>> >>>>> Thank you Matthias. I will investigate your setup to get better >>>>> understanding. >>>> >>>> Thanks! >>>> >>> >>> I've checked those DT files and related code. >>> As you already said, this patch is safe for them. >>> So we can apply it IMO. >>> >>> >>> -------------Off-topic------------------ >>> Not in $subject comments: >>> >>> AFAICS based on two files which define thermal zones: >>> sc7180-trogdor-homestar.dtsi >>> sc7180-trogdor-coachz.dtsi >>> >>> only the 'big' cores are used as cooling devices in the >>> 'skin_temp_thermal' - the CPU6 and CPU7. >>> >>> I assume you don't want to model at all the power usage >>> from the Little cluster (which is quite big: 6 CPUs), do you? >>> I can see that the Little CPUs have small dyn-power-coeff >>> ~30% of the big and lower max freq, but still might be worth >>> to add them to IPA. You might give them more 'weight', to >>> make sure they receive more power during power split. > > In experiments we saw that including the little cores as cooling > devices for 'skin_temp_thermal' didn't have a significant impact on > thermals, so we left them out. > >>> You also don't have GPU cooling device in that thermal zone. >>> Based on my experience if your GPU is a power hungry one, >>> e.g. 2-4Watts, you might get better results when you model >>> this 'hot' device (which impacts your temp sensor reported value). >> >> I think the two boards you point at (homestar and coachz) are just the >> two that override the default defined in the SoC dtsi file. If you >> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling >> map. You can also see the cooling maps for the littles. > > Yep, plus thermal zones with cooling maps for the big cores. > >> I guess we don't have a `dynamic-power-coefficient` for the GPU, >> though? Seems like we should, but I haven't dug through all the code >> here... > > To my knowledge the SC7x80 GPU doesn't register an energy model, which is > one of the reasons the GPU wasn't included as cooling device for > 'skin_temp_thermal'. > You can give it a try by editing the DT and adding in the GPU node the 'dynamic-power-coefficient' + probably small modification in the driver code. If the GPU driver registers the cooling device in the new way, you would also get EM registered thanks to the devfreq cooling new code (commit: 84e0d87c9944eb36ae6037a). You can check an example from Panfrost GPU driver [1]. I can see some upstream MSM GPU driver, but I don't know if that is your GPU driver. It registers the 'old' way the devfreq cooling [2] but it would be easy to change to use the new function. The GPU driver would use the same dev_pm_opp_of_register_em() as your CPUs do, so EM would be in 'milli-Watts' (so should be fine). [1] https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/panfrost/panfrost_devfreq.c#L153 [2] https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/msm/msm_gpu_devfreq.c#L129
On 2/16/22 5:21 PM, Doug Anderson wrote: > Hi, > > On Tue, Feb 8, 2022 at 1:32 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >>> Another important thing is the consistent scale of the power values >>> provided by the cooling devices. All of the cooling devices in a single >>> thermal zone should have power values reported either in milli-Watts >>> or scaled to the same 'abstract scale'. >> >> This can change. We have removed the userspace governor from kernel >> recently. The trend is to implement thermal policy in FW. Dealing with >> some intermediate configurations are causing complicated design, support >> of the algorithm logic is also more complex. > > One thing that didn't get addressed is the whole "The trend is to > implement thermal policy in FW". I'm not sure I can get on board with > that trend. IMO "moving to FW" isn't a super great trend. FW is harder > to update than kernel and trying to keep it in sync with the kernel > isn't wonderful. Unless something _has_ to be in FW I personally > prefer it to be in the kernel. There are pros and cons for both approaches (as always). Although, there are some use cases, where the kernel is not able to react that fast, e.g. sudden power usage changes, which can cause that the power rail is not able to sustain within required conditions. When we are talking about tough requirements for those power & thermal policies, the mechanism must be fast, precised and reliable. Here you can find Arm reference FW implementation and an IPA clone in there (I have been reviewing this) [1][2]. As you can see there is a new FW feature set: "MPMM, Traffic-cop and Thermal management". Apart from Arm implementation, there are already known thermal monitoring mechanisms in HW/FW. Like in the new Qcom SoCs which are using this driver code [3]. The driver receives an interrupt about throttling conditions and just populates the thermal pressure. > > ...although now that I re-read this, I'm not sure which firmware you > might be talking about. Is this the AP firmware, or some companion > chip / coprocessor? Even so, I'd still rather see things done in the > kernel when possible... It's a FW run on a dedicated microprocessor. In Arm SoCs it's usually some Cortex-M. We communicated with it from the kernel via SCMI drivers (using shared memory and mailboxes). We recommend to use the SCMI protocol to send e.g. 'performance request' to the FW via 'fast channel' instead of having an implementation of PMIC and clock, and do the voltage & freq change in the kernel (using drivers & locking). That implementation allows to avoid costly locking and allows to go via that SCMI cpufreq driver [4] and SCMI perf layer [5] the task scheduler. We don't need a dedicated 'sugov' kthread in a Deadline policy to do that work and preempt the currently running task. IMHO the FW approach opens new opportunities. Regards, Lukasz [1] https://github.com/ARM-software/SCP-firmware/pull/588 [2] https://github.com/ARM-software/SCP-firmware/pull/588/commits/59c62ead5eb66353ae805c367bfa86192e28c410 [3] https://elixir.bootlin.com/linux/v5.17-rc4/source/drivers/cpufreq/qcom-cpufreq-hw.c#L287 [4] https://elixir.bootlin.com/linux/latest/source/drivers/cpufreq/scmi-cpufreq.c#L65 [5] https://elixir.bootlin.com/linux/v5.17-rc4/source/drivers/firmware/arm_scmi/perf.c#L465
On Wed, Feb 16, 2022 at 10:43:34PM +0000, Lukasz Luba wrote: > > > On 2/16/22 10:13 PM, Matthias Kaehlcke wrote: > > On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote: > > > Hi, > > > > > > On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > > > > > Hi Matthias, > > > > > > > > On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: > > > > > On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: > > > > > > > > > > > > > > > > > > On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: > > > > > > > On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: > > > > > > > > > > > > > > > > > > > > > > > > [snip] > > > > > > > > > > > > Could you point me to those devices please? > > > > > > > > > > > > > > arch/arm64/boot/dts/qcom/sc7180-trogdor-* > > > > > > > > > > > > > > Though as per above they shouldn't be impacted by your change, since the > > > > > > > CPUs always pretend to use milli-Watts. > > > > > > > > > > > > > > [skipped some questions/answers since sc7180 isn't actually impacted by > > > > > > > the change] > > > > > > > > > > > > Thank you Matthias. I will investigate your setup to get better > > > > > > understanding. > > > > > > > > > > Thanks! > > > > > > > > > > > > > I've checked those DT files and related code. > > > > As you already said, this patch is safe for them. > > > > So we can apply it IMO. > > > > > > > > > > > > -------------Off-topic------------------ > > > > Not in $subject comments: > > > > > > > > AFAICS based on two files which define thermal zones: > > > > sc7180-trogdor-homestar.dtsi > > > > sc7180-trogdor-coachz.dtsi > > > > > > > > only the 'big' cores are used as cooling devices in the > > > > 'skin_temp_thermal' - the CPU6 and CPU7. > > > > > > > > I assume you don't want to model at all the power usage > > > > from the Little cluster (which is quite big: 6 CPUs), do you? > > > > I can see that the Little CPUs have small dyn-power-coeff > > > > ~30% of the big and lower max freq, but still might be worth > > > > to add them to IPA. You might give them more 'weight', to > > > > make sure they receive more power during power split. > > > > In experiments we saw that including the little cores as cooling > > devices for 'skin_temp_thermal' didn't have a significant impact on > > thermals, so we left them out. > > > > > > You also don't have GPU cooling device in that thermal zone. > > > > Based on my experience if your GPU is a power hungry one, > > > > e.g. 2-4Watts, you might get better results when you model > > > > this 'hot' device (which impacts your temp sensor reported value). > > > > > > I think the two boards you point at (homestar and coachz) are just the > > > two that override the default defined in the SoC dtsi file. If you > > > look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling > > > map. You can also see the cooling maps for the littles. > > > > Yep, plus thermal zones with cooling maps for the big cores. > > > > > I guess we don't have a `dynamic-power-coefficient` for the GPU, > > > though? Seems like we should, but I haven't dug through all the code > > > here... > > > > To my knowledge the SC7x80 GPU doesn't register an energy model, which is > > one of the reasons the GPU wasn't included as cooling device for > > 'skin_temp_thermal'. > > > > You can give it a try by editing the DT and adding in the > GPU node the 'dynamic-power-coefficient' + probably > small modification in the driver code. > > If the GPU driver registers the cooling device in the new way, you > would also get EM registered thanks to the devfreq cooling new code > (commit: 84e0d87c9944eb36ae6037a). > > You can check an example from Panfrost GPU driver [1]. Ah, I missed that, thanks for the pointer! > I can see some upstream MSM GPU driver, but I don't know if that is > your GPU driver. It registers the 'old' way the devfreq cooling [2] > but it would be easy to change to use the new function. > The GPU driver would use the same dev_pm_opp_of_register_em() as > your CPUs do, so EM would be in 'milli-Watts' (so should be fine). Yep, that's whay we are using.
On 16/02/2022 23:13, Matthias Kaehlcke wrote: > On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote: >> Hi, >> >> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >>> >>> Hi Matthias, >>> >>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: >>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >>>>> >>>>> >>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>>>>> >>>>>>> >>> >>> [snip] >>> >>>>>>> Could you point me to those devices please? >>>>>> >>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>>>>> >>>>>> Though as per above they shouldn't be impacted by your change, since the >>>>>> CPUs always pretend to use milli-Watts. >>>>>> >>>>>> [skipped some questions/answers since sc7180 isn't actually impacted by >>>>>> the change] >>>>> >>>>> Thank you Matthias. I will investigate your setup to get better >>>>> understanding. >>>> >>>> Thanks! >>>> >>> >>> I've checked those DT files and related code. >>> As you already said, this patch is safe for them. >>> So we can apply it IMO. >>> >>> >>> -------------Off-topic------------------ >>> Not in $subject comments: >>> >>> AFAICS based on two files which define thermal zones: >>> sc7180-trogdor-homestar.dtsi >>> sc7180-trogdor-coachz.dtsi >>> >>> only the 'big' cores are used as cooling devices in the >>> 'skin_temp_thermal' - the CPU6 and CPU7. >>> >>> I assume you don't want to model at all the power usage >>> from the Little cluster (which is quite big: 6 CPUs), do you? >>> I can see that the Little CPUs have small dyn-power-coeff >>> ~30% of the big and lower max freq, but still might be worth >>> to add them to IPA. You might give them more 'weight', to >>> make sure they receive more power during power split. > > In experiments we saw that including the little cores as cooling > devices for 'skin_temp_thermal' didn't have a significant impact on > thermals, so we left them out. I agree, that was also my conclusion after doing some measurements. Basically, the little cores are always cold and are victims of the big cores heat dissipation. They of course contribute a bit to the heat but capping their performance does not change the temperature trend of the whole. That is less true with silver-gold were it is the same micro-arch but different frequencies. >>> You also don't have GPU cooling device in that thermal zone. >>> Based on my experience if your GPU is a power hungry one, >>> e.g. 2-4Watts, you might get better results when you model >>> this 'hot' device (which impacts your temp sensor reported value). >> >> I think the two boards you point at (homestar and coachz) are just the >> two that override the default defined in the SoC dtsi file. If you >> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling >> map. You can also see the cooling maps for the littles. > > Yep, plus thermal zones with cooling maps for the big cores. > >> I guess we don't have a `dynamic-power-coefficient` for the GPU, >> though? Seems like we should, but I haven't dug through all the code >> here... > > To my knowledge the SC7x80 GPU doesn't register an energy model, which is > one of the reasons the GPU wasn't included as cooling device for > 'skin_temp_thermal'.
On 16/02/2022 18:33, Doug Anderson wrote: > Hi, > > On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> Hi Matthias, >> >> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: >>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >>>> >>>> >>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>>>> >>>>>> >> >> [snip] >> >>>>>> Could you point me to those devices please? >>>>> >>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>>>> >>>>> Though as per above they shouldn't be impacted by your change, since the >>>>> CPUs always pretend to use milli-Watts. >>>>> >>>>> [skipped some questions/answers since sc7180 isn't actually impacted by >>>>> the change] >>>> >>>> Thank you Matthias. I will investigate your setup to get better >>>> understanding. >>> >>> Thanks! >>> >> >> I've checked those DT files and related code. >> As you already said, this patch is safe for them. >> So we can apply it IMO. >> >> >> -------------Off-topic------------------ >> Not in $subject comments: >> >> AFAICS based on two files which define thermal zones: >> sc7180-trogdor-homestar.dtsi >> sc7180-trogdor-coachz.dtsi >> >> only the 'big' cores are used as cooling devices in the >> 'skin_temp_thermal' - the CPU6 and CPU7. >> >> I assume you don't want to model at all the power usage >> from the Little cluster (which is quite big: 6 CPUs), do you? >> I can see that the Little CPUs have small dyn-power-coeff >> ~30% of the big and lower max freq, but still might be worth >> to add them to IPA. You might give them more 'weight', to >> make sure they receive more power during power split. >> >> You also don't have GPU cooling device in that thermal zone. >> Based on my experience if your GPU is a power hungry one, >> e.g. 2-4Watts, you might get better results when you model >> this 'hot' device (which impacts your temp sensor reported value). > > I think the two boards you point at (homestar and coachz) are just the > two that override the default defined in the SoC dtsi file. If you > look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling > map. You can also see the cooling maps for the littles. > > I guess we don't have a `dynamic-power-coefficient` for the GPU, > though? Seems like we should, but I haven't dug through all the code > here... The dynamic-power-coefficient is available for OPPs which includes CPUfreq and devfreq. As the GPU is managed by devfreq, setting the dynamic-power-coefficient makes the energy model available for it. However, the OPPs must define the frequency and the voltage. That is the case for most platforms except on QCom platform. That may not be specified as it uses a frequency index and the hardware does the voltage change in our back. The QCom cpufreq backend get the voltage table from a register (or whatever) and completes the voltage values for the OPPs, thus adding the information which is missing in the device tree. The energy model can then initializes itself and allows the usage of the Energy Aware Scheduler. However this piece of code is missing for the GPU part.
On 16/02/2022 23:43, Lukasz Luba wrote: > > > On 2/16/22 10:13 PM, Matthias Kaehlcke wrote: >> On Wed, Feb 16, 2022 at 09:33:50AM -0800, Doug Anderson wrote: >>> Hi, >>> >>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >>>> >>>> Hi Matthias, >>>> >>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: >>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >>>>>> >>>>>> >>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>>>>>> >>>>>>>> >>>> >>>> [snip] >>>> >>>>>>>> Could you point me to those devices please? >>>>>>> >>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>>>>>> >>>>>>> Though as per above they shouldn't be impacted by your change, >>>>>>> since the >>>>>>> CPUs always pretend to use milli-Watts. >>>>>>> >>>>>>> [skipped some questions/answers since sc7180 isn't actually >>>>>>> impacted by >>>>>>> the change] >>>>>> >>>>>> Thank you Matthias. I will investigate your setup to get better >>>>>> understanding. >>>>> >>>>> Thanks! >>>>> >>>> >>>> I've checked those DT files and related code. >>>> As you already said, this patch is safe for them. >>>> So we can apply it IMO. >>>> >>>> >>>> -------------Off-topic------------------ >>>> Not in $subject comments: >>>> >>>> AFAICS based on two files which define thermal zones: >>>> sc7180-trogdor-homestar.dtsi >>>> sc7180-trogdor-coachz.dtsi >>>> >>>> only the 'big' cores are used as cooling devices in the >>>> 'skin_temp_thermal' - the CPU6 and CPU7. >>>> >>>> I assume you don't want to model at all the power usage >>>> from the Little cluster (which is quite big: 6 CPUs), do you? >>>> I can see that the Little CPUs have small dyn-power-coeff >>>> ~30% of the big and lower max freq, but still might be worth >>>> to add them to IPA. You might give them more 'weight', to >>>> make sure they receive more power during power split. >> >> In experiments we saw that including the little cores as cooling >> devices for 'skin_temp_thermal' didn't have a significant impact on >> thermals, so we left them out. >> >>>> You also don't have GPU cooling device in that thermal zone. >>>> Based on my experience if your GPU is a power hungry one, >>>> e.g. 2-4Watts, you might get better results when you model >>>> this 'hot' device (which impacts your temp sensor reported value). >>> >>> I think the two boards you point at (homestar and coachz) are just the >>> two that override the default defined in the SoC dtsi file. If you >>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling >>> map. You can also see the cooling maps for the littles. >> >> Yep, plus thermal zones with cooling maps for the big cores. >> >>> I guess we don't have a `dynamic-power-coefficient` for the GPU, >>> though? Seems like we should, but I haven't dug through all the code >>> here... >> >> To my knowledge the SC7x80 GPU doesn't register an energy model, which is >> one of the reasons the GPU wasn't included as cooling device for >> 'skin_temp_thermal'. >> > > You can give it a try by editing the DT and adding in the > GPU node the 'dynamic-power-coefficient' + probably > small modification in the driver code. That should not work for sc7180 as the DT does not specify the voltage values. The QCom backend should retrieve the voltage values for each OPPs from a register or a hardcoded table coming from the SoC documentation. However, I can confirm dynamic-power-coefficient will add an energy model on GPU (or any devfreq device) if the voltage is specified in the OPP description. Checked with panfrost driver and memory controller on rk3399. > If the GPU driver registers the cooling device in the new way, you > would also get EM registered thanks to the devfreq cooling new code > (commit: 84e0d87c9944eb36ae6037a). > > You can check an example from Panfrost GPU driver [1]. > > I can see some upstream MSM GPU driver, but I don't know if that is > your GPU driver. It registers the 'old' way the devfreq cooling [2] > but it would be easy to change to use the new function. > The GPU driver would use the same dev_pm_opp_of_register_em() as > your CPUs do, so EM would be in 'milli-Watts' (so should be fine). > > > [1] > https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/panfrost/panfrost_devfreq.c#L153 > > [2] > https://elixir.bootlin.com/linux/v5.16/source/drivers/gpu/drm/msm/msm_gpu_devfreq.c#L129 >
Hi Daniel, On 2/17/22 10:10 AM, Daniel Lezcano wrote: > On 16/02/2022 18:33, Doug Anderson wrote: >> Hi, >> >> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >>> >>> Hi Matthias, >>> >>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: >>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >>>>> >>>>> >>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>>>>> >>>>>>> >>> >>> [snip] >>> >>>>>>> Could you point me to those devices please? >>>>>> >>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>>>>> >>>>>> Though as per above they shouldn't be impacted by your change, >>>>>> since the >>>>>> CPUs always pretend to use milli-Watts. >>>>>> >>>>>> [skipped some questions/answers since sc7180 isn't actually >>>>>> impacted by >>>>>> the change] >>>>> >>>>> Thank you Matthias. I will investigate your setup to get better >>>>> understanding. >>>> >>>> Thanks! >>>> >>> >>> I've checked those DT files and related code. >>> As you already said, this patch is safe for them. >>> So we can apply it IMO. >>> >>> >>> -------------Off-topic------------------ >>> Not in $subject comments: >>> >>> AFAICS based on two files which define thermal zones: >>> sc7180-trogdor-homestar.dtsi >>> sc7180-trogdor-coachz.dtsi >>> >>> only the 'big' cores are used as cooling devices in the >>> 'skin_temp_thermal' - the CPU6 and CPU7. >>> >>> I assume you don't want to model at all the power usage >>> from the Little cluster (which is quite big: 6 CPUs), do you? >>> I can see that the Little CPUs have small dyn-power-coeff >>> ~30% of the big and lower max freq, but still might be worth >>> to add them to IPA. You might give them more 'weight', to >>> make sure they receive more power during power split. >>> >>> You also don't have GPU cooling device in that thermal zone. >>> Based on my experience if your GPU is a power hungry one, >>> e.g. 2-4Watts, you might get better results when you model >>> this 'hot' device (which impacts your temp sensor reported value). >> >> I think the two boards you point at (homestar and coachz) are just the >> two that override the default defined in the SoC dtsi file. If you >> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling >> map. You can also see the cooling maps for the littles. >> >> I guess we don't have a `dynamic-power-coefficient` for the GPU, >> though? Seems like we should, but I haven't dug through all the code >> here... > > The dynamic-power-coefficient is available for OPPs which includes > CPUfreq and devfreq. As the GPU is managed by devfreq, setting the > dynamic-power-coefficient makes the energy model available for it. > > However, the OPPs must define the frequency and the voltage. That is the > case for most platforms except on QCom platform. > > That may not be specified as it uses a frequency index and the hardware > does the voltage change in our back. The QCom cpufreq backend get the > voltage table from a register (or whatever) and completes the voltage > values for the OPPs, thus adding the information which is missing in the > device tree. The energy model can then initializes itself and allows the > usage of the Energy Aware Scheduler. > > However this piece of code is missing for the GPU part. > Thank you for joining the discussion. I don't know about that Qcom GPU voltage information is missing. If the voltage is not available (only the frequencies), there is another way. There is an 'advanced' EM which uses registration function: em_dev_register_perf_domain(). It uses a local driver callback to get power for each found frequency. It has benefit because there is no restriction to 'fit' into the math formula, instead just avg power values can be feed into EM. It's called 'advanced' EM [1]. Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten was proposing from ~2014 this upstream, but EAS wasn't merged back then): where to store these power-freq values, which are then used by the callback. We have the 'dynamic-power-coefficient' in DT, but it has limitations. It would be good to have this simple array attached to the GPU/CPU node. IMHO it meet the requirement of DT, it describes the HW (it would have HZ and Watts values). Doug, Matthias could you have a look at that function and its usage, please [1]? If you guys would support me in this, I would start, with an RFC proposal, a discussion on LKML. [1] https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87
On 17/02/2022 11:47, Lukasz Luba wrote: > Hi Daniel, > > On 2/17/22 10:10 AM, Daniel Lezcano wrote: >> On 16/02/2022 18:33, Doug Anderson wrote: >>> Hi, >>> >>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >>>> >>>> Hi Matthias, >>>> >>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: >>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >>>>>> >>>>>> >>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>>>>>> >>>>>>>> >>>> >>>> [snip] >>>> >>>>>>>> Could you point me to those devices please? >>>>>>> >>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>>>>>> >>>>>>> Though as per above they shouldn't be impacted by your change, >>>>>>> since the >>>>>>> CPUs always pretend to use milli-Watts. >>>>>>> >>>>>>> [skipped some questions/answers since sc7180 isn't actually >>>>>>> impacted by >>>>>>> the change] >>>>>> >>>>>> Thank you Matthias. I will investigate your setup to get better >>>>>> understanding. >>>>> >>>>> Thanks! >>>>> >>>> >>>> I've checked those DT files and related code. >>>> As you already said, this patch is safe for them. >>>> So we can apply it IMO. >>>> >>>> >>>> -------------Off-topic------------------ >>>> Not in $subject comments: >>>> >>>> AFAICS based on two files which define thermal zones: >>>> sc7180-trogdor-homestar.dtsi >>>> sc7180-trogdor-coachz.dtsi >>>> >>>> only the 'big' cores are used as cooling devices in the >>>> 'skin_temp_thermal' - the CPU6 and CPU7. >>>> >>>> I assume you don't want to model at all the power usage >>>> from the Little cluster (which is quite big: 6 CPUs), do you? >>>> I can see that the Little CPUs have small dyn-power-coeff >>>> ~30% of the big and lower max freq, but still might be worth >>>> to add them to IPA. You might give them more 'weight', to >>>> make sure they receive more power during power split. >>>> >>>> You also don't have GPU cooling device in that thermal zone. >>>> Based on my experience if your GPU is a power hungry one, >>>> e.g. 2-4Watts, you might get better results when you model >>>> this 'hot' device (which impacts your temp sensor reported value). >>> >>> I think the two boards you point at (homestar and coachz) are just the >>> two that override the default defined in the SoC dtsi file. If you >>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling >>> map. You can also see the cooling maps for the littles. >>> >>> I guess we don't have a `dynamic-power-coefficient` for the GPU, >>> though? Seems like we should, but I haven't dug through all the code >>> here... >> >> The dynamic-power-coefficient is available for OPPs which includes >> CPUfreq and devfreq. As the GPU is managed by devfreq, setting the >> dynamic-power-coefficient makes the energy model available for it. >> >> However, the OPPs must define the frequency and the voltage. That is >> the case for most platforms except on QCom platform. >> >> That may not be specified as it uses a frequency index and the >> hardware does the voltage change in our back. The QCom cpufreq backend >> get the voltage table from a register (or whatever) and completes the >> voltage values for the OPPs, thus adding the information which is >> missing in the device tree. The energy model can then initializes >> itself and allows the usage of the Energy Aware Scheduler. >> >> However this piece of code is missing for the GPU part. >> > > Thank you for joining the discussion. I don't know about that Qcom > GPU voltage information is missing. > > If the voltage is not available (only the frequencies), there is > another way. There is an 'advanced' EM which uses registration function: > em_dev_register_perf_domain(). It uses a local driver callback to get > power for each found frequency. It has benefit because there is no > restriction to 'fit' into the math formula, instead just avg power > values can be feed into EM. It's called 'advanced' EM [1]. > > Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten > was proposing from ~2014 this upstream, but EAS wasn't merged back > then): > where to store these power-freq values, which are then used by the > callback. Why not make it more generic and replace the frequency by a performance index, so it can be used by any kind of perf limiter? > We have the 'dynamic-power-coefficient' in DT, but > it has limitations. It would be good to have this simple array > attached to the GPU/CPU node. IMHO it meet the requirement of DT, > it describes the HW (it would have HZ and Watts values). > > Doug, Matthias could you have a look at that function and its > usage, please [1]? > If you guys would support me in this, I would start, with an RFC > proposal, a discussion on LKML. > > > [1] > https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87 >
On 2/17/22 11:28 AM, Daniel Lezcano wrote: > On 17/02/2022 11:47, Lukasz Luba wrote: >> Hi Daniel, >> >> On 2/17/22 10:10 AM, Daniel Lezcano wrote: >>> On 16/02/2022 18:33, Doug Anderson wrote: >>>> Hi, >>>> >>>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> >>>> wrote: >>>>> >>>>> Hi Matthias, >>>>> >>>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: >>>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >>>>>>> >>>>>>> >>>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>>>>>>> >>>>>>>>> >>>>> >>>>> [snip] >>>>> >>>>>>>>> Could you point me to those devices please? >>>>>>>> >>>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>>>>>>> >>>>>>>> Though as per above they shouldn't be impacted by your change, >>>>>>>> since the >>>>>>>> CPUs always pretend to use milli-Watts. >>>>>>>> >>>>>>>> [skipped some questions/answers since sc7180 isn't actually >>>>>>>> impacted by >>>>>>>> the change] >>>>>>> >>>>>>> Thank you Matthias. I will investigate your setup to get better >>>>>>> understanding. >>>>>> >>>>>> Thanks! >>>>>> >>>>> >>>>> I've checked those DT files and related code. >>>>> As you already said, this patch is safe for them. >>>>> So we can apply it IMO. >>>>> >>>>> >>>>> -------------Off-topic------------------ >>>>> Not in $subject comments: >>>>> >>>>> AFAICS based on two files which define thermal zones: >>>>> sc7180-trogdor-homestar.dtsi >>>>> sc7180-trogdor-coachz.dtsi >>>>> >>>>> only the 'big' cores are used as cooling devices in the >>>>> 'skin_temp_thermal' - the CPU6 and CPU7. >>>>> >>>>> I assume you don't want to model at all the power usage >>>>> from the Little cluster (which is quite big: 6 CPUs), do you? >>>>> I can see that the Little CPUs have small dyn-power-coeff >>>>> ~30% of the big and lower max freq, but still might be worth >>>>> to add them to IPA. You might give them more 'weight', to >>>>> make sure they receive more power during power split. >>>>> >>>>> You also don't have GPU cooling device in that thermal zone. >>>>> Based on my experience if your GPU is a power hungry one, >>>>> e.g. 2-4Watts, you might get better results when you model >>>>> this 'hot' device (which impacts your temp sensor reported value). >>>> >>>> I think the two boards you point at (homestar and coachz) are just the >>>> two that override the default defined in the SoC dtsi file. If you >>>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling >>>> map. You can also see the cooling maps for the littles. >>>> >>>> I guess we don't have a `dynamic-power-coefficient` for the GPU, >>>> though? Seems like we should, but I haven't dug through all the code >>>> here... >>> >>> The dynamic-power-coefficient is available for OPPs which includes >>> CPUfreq and devfreq. As the GPU is managed by devfreq, setting the >>> dynamic-power-coefficient makes the energy model available for it. >>> >>> However, the OPPs must define the frequency and the voltage. That is >>> the case for most platforms except on QCom platform. >>> >>> That may not be specified as it uses a frequency index and the >>> hardware does the voltage change in our back. The QCom cpufreq >>> backend get the voltage table from a register (or whatever) and >>> completes the voltage values for the OPPs, thus adding the >>> information which is missing in the device tree. The energy model can >>> then initializes itself and allows the usage of the Energy Aware >>> Scheduler. >>> >>> However this piece of code is missing for the GPU part. >>> >> >> Thank you for joining the discussion. I don't know about that Qcom >> GPU voltage information is missing. >> >> If the voltage is not available (only the frequencies), there is >> another way. There is an 'advanced' EM which uses registration function: >> em_dev_register_perf_domain(). It uses a local driver callback to get >> power for each found frequency. It has benefit because there is no >> restriction to 'fit' into the math formula, instead just avg power >> values can be feed into EM. It's called 'advanced' EM [1]. >> >> Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten >> was proposing from ~2014 this upstream, but EAS wasn't merged back >> then): >> where to store these power-freq values, which are then used by the >> callback. > > Why not make it more generic and replace the frequency by a performance > index, so it can be used by any kind of perf limiter? For that DT array, yes, it can be an index, so effectively it could be a simple 1d array. something like: msm_gpu_energy_model: msm-gpu-energy-model { compatible = "energy-model" /* Values are sorted micro-Watts which correspond to each OPP or performance state. The total amount of them must match number of OPPs. */ power-microwatt = <100000>, <230000>, <380000>, <600000>; }; then in gpu node instead of having 'dynamic-power-coefficient', which is useless because voltage is missing, we would have 'energy-model', like: energy-model = <&msm_gpu_energy_model>; If you agree to continue this topic. I will send an RFC so we could further discuss this idea. This $subject doesn't fit well. Thank you again for your feedback Daniel!
On 17/02/2022 13:11, Lukasz Luba wrote: [ ... ] >> Why not make it more generic and replace the frequency by a >> performance index, so it can be used by any kind of perf limiter? > > For that DT array, yes, it can be an index, so effectively it could be > a simple 1d array. > > something like: > > msm_gpu_energy_model: msm-gpu-energy-model { > compatible = "energy-model" > /* Values are sorted micro-Watts which correspond to each OPP > or performance state. The total amount of them must match > number of OPPs. */ > power-microwatt = <100000>, > <230000>, > <380000>, > <600000>; > }; > > then in gpu node instead of having 'dynamic-power-coefficient', > which is useless because voltage is missing, we would have > 'energy-model', like: > > energy-model = <&msm_gpu_energy_model>; > > > If you agree to continue this topic. I will send an RFC so we could > further discuss this idea. This $subject doesn't fit well. Yes, definitively I agree to continue on this topic.
On 2/17/22 12:33 PM, Daniel Lezcano wrote: > On 17/02/2022 13:11, Lukasz Luba wrote: > > [ ... ] > >>> Why not make it more generic and replace the frequency by a >>> performance index, so it can be used by any kind of perf limiter? >> >> For that DT array, yes, it can be an index, so effectively it could be >> a simple 1d array. >> >> something like: >> >> msm_gpu_energy_model: msm-gpu-energy-model { >> compatible = "energy-model" >> /* Values are sorted micro-Watts which correspond to each OPP >> or performance state. The total amount of them must match >> number of OPPs. */ >> power-microwatt = <100000>, >> <230000>, >> <380000>, >> <600000>; >> }; >> >> then in gpu node instead of having 'dynamic-power-coefficient', >> which is useless because voltage is missing, we would have >> 'energy-model', like: >> >> energy-model = <&msm_gpu_energy_model>; >> >> >> If you agree to continue this topic. I will send an RFC so we could >> further discuss this idea. This $subject doesn't fit well. > > Yes, definitively I agree to continue on this topic. > > Great! I'm going to craft something...
Hi, On Thu, Feb 17, 2022 at 2:47 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > Hi Daniel, > > On 2/17/22 10:10 AM, Daniel Lezcano wrote: > > On 16/02/2022 18:33, Doug Anderson wrote: > >> Hi, > >> > >> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > >>> > >>> Hi Matthias, > >>> > >>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: > >>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: > >>>>> > >>>>> > >>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: > >>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: > >>>>>>> > >>>>>>> > >>> > >>> [snip] > >>> > >>>>>>> Could you point me to those devices please? > >>>>>> > >>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* > >>>>>> > >>>>>> Though as per above they shouldn't be impacted by your change, > >>>>>> since the > >>>>>> CPUs always pretend to use milli-Watts. > >>>>>> > >>>>>> [skipped some questions/answers since sc7180 isn't actually > >>>>>> impacted by > >>>>>> the change] > >>>>> > >>>>> Thank you Matthias. I will investigate your setup to get better > >>>>> understanding. > >>>> > >>>> Thanks! > >>>> > >>> > >>> I've checked those DT files and related code. > >>> As you already said, this patch is safe for them. > >>> So we can apply it IMO. > >>> > >>> > >>> -------------Off-topic------------------ > >>> Not in $subject comments: > >>> > >>> AFAICS based on two files which define thermal zones: > >>> sc7180-trogdor-homestar.dtsi > >>> sc7180-trogdor-coachz.dtsi > >>> > >>> only the 'big' cores are used as cooling devices in the > >>> 'skin_temp_thermal' - the CPU6 and CPU7. > >>> > >>> I assume you don't want to model at all the power usage > >>> from the Little cluster (which is quite big: 6 CPUs), do you? > >>> I can see that the Little CPUs have small dyn-power-coeff > >>> ~30% of the big and lower max freq, but still might be worth > >>> to add them to IPA. You might give them more 'weight', to > >>> make sure they receive more power during power split. > >>> > >>> You also don't have GPU cooling device in that thermal zone. > >>> Based on my experience if your GPU is a power hungry one, > >>> e.g. 2-4Watts, you might get better results when you model > >>> this 'hot' device (which impacts your temp sensor reported value). > >> > >> I think the two boards you point at (homestar and coachz) are just the > >> two that override the default defined in the SoC dtsi file. If you > >> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling > >> map. You can also see the cooling maps for the littles. > >> > >> I guess we don't have a `dynamic-power-coefficient` for the GPU, > >> though? Seems like we should, but I haven't dug through all the code > >> here... > > > > The dynamic-power-coefficient is available for OPPs which includes > > CPUfreq and devfreq. As the GPU is managed by devfreq, setting the > > dynamic-power-coefficient makes the energy model available for it. > > > > However, the OPPs must define the frequency and the voltage. That is the > > case for most platforms except on QCom platform. > > > > That may not be specified as it uses a frequency index and the hardware > > does the voltage change in our back. The QCom cpufreq backend get the > > voltage table from a register (or whatever) and completes the voltage > > values for the OPPs, thus adding the information which is missing in the > > device tree. The energy model can then initializes itself and allows the > > usage of the Energy Aware Scheduler. > > > > However this piece of code is missing for the GPU part. > > > > Thank you for joining the discussion. I don't know about that Qcom > GPU voltage information is missing. > > If the voltage is not available (only the frequencies), there is > another way. There is an 'advanced' EM which uses registration function: > em_dev_register_perf_domain(). It uses a local driver callback to get > power for each found frequency. It has benefit because there is no > restriction to 'fit' into the math formula, instead just avg power > values can be feed into EM. It's called 'advanced' EM [1]. It seems like there _should_ be a way to get the voltage out for GPU operating points, like is done with cpufreq in qcom_cpufreq_hw_read_lut(), but it might need someone with Qualcomm documentation to help with it. Maybe Rajendra would be able to help? Adding Jordon and Rob to this conversation in case they're aware of anything. As you said, we could just list a power for each frequency, though. I'm actually not sure which one would be more accurate across a range of devices with different "corners": specifying a dynamic power coefficient used for all "corners" and then using the actual voltage and doing the math, or specifying a power number for each frequency and ignoring the actual voltage used. In any case we're trying to get ballpark numbers and not every device will be exactly the same, so probably it doesn't matter that much. > Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten > was proposing from ~2014 this upstream, but EAS wasn't merged back > then): > where to store these power-freq values, which are then used by the > callback. We have the 'dynamic-power-coefficient' in DT, but > it has limitations. It would be good to have this simple array > attached to the GPU/CPU node. IMHO it meet the requirement of DT, > it describes the HW (it would have HZ and Watts values). > > Doug, Matthias could you have a look at that function and its > usage, please [1]? > If you guys would support me in this, I would start, with an RFC > proposal, a discussion on LKML. > > [1] > https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87 Matthias: I think you've spent more time on the thermal stuff than me so I'll assume you'll follow-up here. If not then please yell! Ideally, though, someone from Qualcomm would jump in an own this. Basically it allows more intelligently throttling the GPU and CPU together in tandem instead of treating them separately IIUC, right? -Doug
On Thu, Feb 17, 2022 at 12:11:27PM +0000, Lukasz Luba wrote: > > > On 2/17/22 11:28 AM, Daniel Lezcano wrote: > > On 17/02/2022 11:47, Lukasz Luba wrote: > > > Hi Daniel, > > > > > > On 2/17/22 10:10 AM, Daniel Lezcano wrote: > > > > On 16/02/2022 18:33, Doug Anderson wrote: > > > > > Hi, > > > > > > > > > > On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba > > > > > <lukasz.luba@arm.com> wrote: > > > > > > > > > > > > Hi Matthias, > > > > > > > > > > > > On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: > > > > > > > On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: > > > > > > > > > On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [snip] > > > > > > > > > > > > > > > > Could you point me to those devices please? > > > > > > > > > > > > > > > > > > arch/arm64/boot/dts/qcom/sc7180-trogdor-* > > > > > > > > > > > > > > > > > > Though as per above they shouldn't be > > > > > > > > > impacted by your change, since the > > > > > > > > > CPUs always pretend to use milli-Watts. > > > > > > > > > > > > > > > > > > [skipped some questions/answers since sc7180 > > > > > > > > > isn't actually impacted by > > > > > > > > > the change] > > > > > > > > > > > > > > > > Thank you Matthias. I will investigate your setup to get better > > > > > > > > understanding. > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > I've checked those DT files and related code. > > > > > > As you already said, this patch is safe for them. > > > > > > So we can apply it IMO. > > > > > > > > > > > > > > > > > > -------------Off-topic------------------ > > > > > > Not in $subject comments: > > > > > > > > > > > > AFAICS based on two files which define thermal zones: > > > > > > sc7180-trogdor-homestar.dtsi > > > > > > sc7180-trogdor-coachz.dtsi > > > > > > > > > > > > only the 'big' cores are used as cooling devices in the > > > > > > 'skin_temp_thermal' - the CPU6 and CPU7. > > > > > > > > > > > > I assume you don't want to model at all the power usage > > > > > > from the Little cluster (which is quite big: 6 CPUs), do you? > > > > > > I can see that the Little CPUs have small dyn-power-coeff > > > > > > ~30% of the big and lower max freq, but still might be worth > > > > > > to add them to IPA. You might give them more 'weight', to > > > > > > make sure they receive more power during power split. > > > > > > > > > > > > You also don't have GPU cooling device in that thermal zone. > > > > > > Based on my experience if your GPU is a power hungry one, > > > > > > e.g. 2-4Watts, you might get better results when you model > > > > > > this 'hot' device (which impacts your temp sensor reported value). > > > > > > > > > > I think the two boards you point at (homestar and coachz) are just the > > > > > two that override the default defined in the SoC dtsi file. If you > > > > > look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling > > > > > map. You can also see the cooling maps for the littles. > > > > > > > > > > I guess we don't have a `dynamic-power-coefficient` for the GPU, > > > > > though? Seems like we should, but I haven't dug through all the code > > > > > here... > > > > > > > > The dynamic-power-coefficient is available for OPPs which > > > > includes CPUfreq and devfreq. As the GPU is managed by devfreq, > > > > setting the dynamic-power-coefficient makes the energy model > > > > available for it. > > > > > > > > However, the OPPs must define the frequency and the voltage. > > > > That is the case for most platforms except on QCom platform. > > > > > > > > That may not be specified as it uses a frequency index and the > > > > hardware does the voltage change in our back. The QCom cpufreq > > > > backend get the voltage table from a register (or whatever) and > > > > completes the voltage values for the OPPs, thus adding the > > > > information which is missing in the device tree. The energy > > > > model can then initializes itself and allows the usage of the > > > > Energy Aware Scheduler. > > > > > > > > However this piece of code is missing for the GPU part. > > > > > > > > > > Thank you for joining the discussion. I don't know about that Qcom > > > GPU voltage information is missing. > > > > > > If the voltage is not available (only the frequencies), there is > > > another way. There is an 'advanced' EM which uses registration function: > > > em_dev_register_perf_domain(). It uses a local driver callback to get > > > power for each found frequency. It has benefit because there is no > > > restriction to 'fit' into the math formula, instead just avg power > > > values can be feed into EM. It's called 'advanced' EM [1]. > > > > > > Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten > > > was proposing from ~2014 this upstream, but EAS wasn't merged back > > > then): > > > where to store these power-freq values, which are then used by the > > > callback. > > > > Why not make it more generic and replace the frequency by a performance > > index, so it can be used by any kind of perf limiter? > > For that DT array, yes, it can be an index, so effectively it could be > a simple 1d array. > > something like: > > msm_gpu_energy_model: msm-gpu-energy-model { > compatible = "energy-model" > /* Values are sorted micro-Watts which correspond to each OPP > or performance state. The total amount of them must match > number of OPPs. */ > power-microwatt = <100000>, > <230000>, > <380000>, > <600000>; > }; IIUC for the QCOM GPU the voltages/power consumption per OPP aren't fixed but can vary between different SoCs of the same model. If the ranges aren't too wide it might still be suitable to have a table with the average power consumption. Another question is whether QCOM would be willing to provide information about the GPU power consumption. For the SC7180 CPUs they only provided bogoWatt numbers. Once boards with a given SoC/GPU are available to the public someone could come up with such a table based on measurements, similar to what Doug did for the SC7180 CPUs (commit 82ea7d411d43f). > then in gpu node instead of having 'dynamic-power-coefficient', > which is useless because voltage is missing, we would have > 'energy-model', like: > > energy-model = <&msm_gpu_energy_model>; > > > If you agree to continue this topic. I will send an RFC so we could > further discuss this idea. This $subject doesn't fit well. > > Thank you again for your feedback Daniel! Thanks Lukasz and Daniel for looking into this! m.
Hi, On Wed, Feb 16, 2022 at 3:28 PM Lukasz Luba <lukasz.luba@arm.com> wrote: > > On 2/16/22 5:21 PM, Doug Anderson wrote: > > Hi, > > > > On Tue, Feb 8, 2022 at 1:32 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > >> > >>> Another important thing is the consistent scale of the power values > >>> provided by the cooling devices. All of the cooling devices in a single > >>> thermal zone should have power values reported either in milli-Watts > >>> or scaled to the same 'abstract scale'. > >> > >> This can change. We have removed the userspace governor from kernel > >> recently. The trend is to implement thermal policy in FW. Dealing with > >> some intermediate configurations are causing complicated design, support > >> of the algorithm logic is also more complex. > > > > One thing that didn't get addressed is the whole "The trend is to > > implement thermal policy in FW". I'm not sure I can get on board with > > that trend. IMO "moving to FW" isn't a super great trend. FW is harder > > to update than kernel and trying to keep it in sync with the kernel > > isn't wonderful. Unless something _has_ to be in FW I personally > > prefer it to be in the kernel. > > There are pros and cons for both approaches (as always). > > Although, there are some use cases, where the kernel is not able to > react that fast, e.g. sudden power usage changes, which can cause > that the power rail is not able to sustain within required conditions. > When we are talking about tough requirements for those power & thermal > policies, the mechanism must be fast, precised and reliable. > > Here you can find Arm reference FW implementation and an IPA clone > in there (I have been reviewing this) [1][2]. > > As you can see there is a new FW feature set: > "MPMM, Traffic-cop and Thermal management". > > Apart from Arm implementation, there are already known thermal > monitoring mechanisms in HW/FW. Like in the new Qcom SoCs which > are using this driver code [3]. The driver receives an interrupt > about throttling conditions and just populates the thermal pressure. Yeah, this has come up in another context recently too. Right on on the Qcom SoCs I'm working with (sc7180 on Chromebooks) we've essentially disabled all the HW/FW throttling (LMH), preferring to let Linux manage things. We chose to do it this way with the assumption that Linux would be able to make better decisions than the firmware and it was easier to understand / update than an opaque vendor-provided blob. LMH is still there with super high limits in case Linux goofs up (we don't want to damage the CPU) but it's not the primary means of throttling. As you said, Linux reacts a bit slower, though I've heard that might be fixed soon-ish? So far on sc7180 Chromebooks it hasn't been a problem because we have more total thermal mass and the CPUs in sc7180 don't actually generate that much heat compared to other CPUs. We also have thermal interrupts enabled, which helps. That being said, improvements are certainly welcome! > > ...although now that I re-read this, I'm not sure which firmware you > > might be talking about. Is this the AP firmware, or some companion > > chip / coprocessor? Even so, I'd still rather see things done in the > > kernel when possible... > > It's a FW run on a dedicated microprocessor. In Arm SoCs it's usually > some Cortex-M. We communicated with it from the kernel via SCMI drivers > (using shared memory and mailboxes). We recommend to use the SCMI > protocol to send e.g. 'performance request' to the FW via 'fast > channel' instead of having an implementation of PMIC and clock, and do > the voltage & freq change in the kernel (using drivers & locking). That > implementation allows to avoid costly locking and allows to go via > that SCMI cpufreq driver [4] and SCMI perf layer [5] the task scheduler. > We don't need a dedicated 'sugov' kthread in a Deadline policy to > do that work and preempt the currently running task. > > IMHO the FW approach opens new opportunities. > > Regards, > Lukasz > > [1] https://github.com/ARM-software/SCP-firmware/pull/588 > [2] > https://github.com/ARM-software/SCP-firmware/pull/588/commits/59c62ead5eb66353ae805c367bfa86192e28c410 > [3] > https://elixir.bootlin.com/linux/v5.17-rc4/source/drivers/cpufreq/qcom-cpufreq-hw.c#L287 > [4] > https://elixir.bootlin.com/linux/latest/source/drivers/cpufreq/scmi-cpufreq.c#L65 > [5] > https://elixir.bootlin.com/linux/v5.17-rc4/source/drivers/firmware/arm_scmi/perf.c#L465
On Thu, Feb 17, 2022 at 08:37:39AM -0800, Doug Anderson wrote: > Hi, > > On Thu, Feb 17, 2022 at 2:47 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > > > Hi Daniel, > > > > On 2/17/22 10:10 AM, Daniel Lezcano wrote: > > > On 16/02/2022 18:33, Doug Anderson wrote: > > >> Hi, > > >> > > >> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: > > >>> > > >>> Hi Matthias, > > >>> > > >>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: > > >>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: > > >>>>> > > >>>>> > > >>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: > > >>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: > > >>>>>>> > > >>>>>>> > > >>> > > >>> [snip] > > >>> > > >>>>>>> Could you point me to those devices please? > > >>>>>> > > >>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* > > >>>>>> > > >>>>>> Though as per above they shouldn't be impacted by your change, > > >>>>>> since the > > >>>>>> CPUs always pretend to use milli-Watts. > > >>>>>> > > >>>>>> [skipped some questions/answers since sc7180 isn't actually > > >>>>>> impacted by > > >>>>>> the change] > > >>>>> > > >>>>> Thank you Matthias. I will investigate your setup to get better > > >>>>> understanding. > > >>>> > > >>>> Thanks! > > >>>> > > >>> > > >>> I've checked those DT files and related code. > > >>> As you already said, this patch is safe for them. > > >>> So we can apply it IMO. > > >>> > > >>> > > >>> -------------Off-topic------------------ > > >>> Not in $subject comments: > > >>> > > >>> AFAICS based on two files which define thermal zones: > > >>> sc7180-trogdor-homestar.dtsi > > >>> sc7180-trogdor-coachz.dtsi > > >>> > > >>> only the 'big' cores are used as cooling devices in the > > >>> 'skin_temp_thermal' - the CPU6 and CPU7. > > >>> > > >>> I assume you don't want to model at all the power usage > > >>> from the Little cluster (which is quite big: 6 CPUs), do you? > > >>> I can see that the Little CPUs have small dyn-power-coeff > > >>> ~30% of the big and lower max freq, but still might be worth > > >>> to add them to IPA. You might give them more 'weight', to > > >>> make sure they receive more power during power split. > > >>> > > >>> You also don't have GPU cooling device in that thermal zone. > > >>> Based on my experience if your GPU is a power hungry one, > > >>> e.g. 2-4Watts, you might get better results when you model > > >>> this 'hot' device (which impacts your temp sensor reported value). > > >> > > >> I think the two boards you point at (homestar and coachz) are just the > > >> two that override the default defined in the SoC dtsi file. If you > > >> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling > > >> map. You can also see the cooling maps for the littles. > > >> > > >> I guess we don't have a `dynamic-power-coefficient` for the GPU, > > >> though? Seems like we should, but I haven't dug through all the code > > >> here... > > > > > > The dynamic-power-coefficient is available for OPPs which includes > > > CPUfreq and devfreq. As the GPU is managed by devfreq, setting the > > > dynamic-power-coefficient makes the energy model available for it. > > > > > > However, the OPPs must define the frequency and the voltage. That is the > > > case for most platforms except on QCom platform. > > > > > > That may not be specified as it uses a frequency index and the hardware > > > does the voltage change in our back. The QCom cpufreq backend get the > > > voltage table from a register (or whatever) and completes the voltage > > > values for the OPPs, thus adding the information which is missing in the > > > device tree. The energy model can then initializes itself and allows the > > > usage of the Energy Aware Scheduler. > > > > > > However this piece of code is missing for the GPU part. > > > > > > > Thank you for joining the discussion. I don't know about that Qcom > > GPU voltage information is missing. > > > > If the voltage is not available (only the frequencies), there is > > another way. There is an 'advanced' EM which uses registration function: > > em_dev_register_perf_domain(). It uses a local driver callback to get > > power for each found frequency. It has benefit because there is no > > restriction to 'fit' into the math formula, instead just avg power > > values can be feed into EM. It's called 'advanced' EM [1]. > > It seems like there _should_ be a way to get the voltage out for GPU > operating points, like is done with cpufreq in > qcom_cpufreq_hw_read_lut(), but it might need someone with Qualcomm > documentation to help with it. Maybe Rajendra would be able to help? > Adding Jordon and Rob to this conversation in case they're aware of > anything. > > As you said, we could just list a power for each frequency, though. > > I'm actually not sure which one would be more accurate across a range > of devices with different "corners": specifying a dynamic power > coefficient used for all "corners" and then using the actual voltage > and doing the math, or specifying a power number for each frequency > and ignoring the actual voltage used. In any case we're trying to get > ballpark numbers and not every device will be exactly the same, so > probably it doesn't matter that much. > > > > Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten > > was proposing from ~2014 this upstream, but EAS wasn't merged back > > then): > > where to store these power-freq values, which are then used by the > > callback. We have the 'dynamic-power-coefficient' in DT, but > > it has limitations. It would be good to have this simple array > > attached to the GPU/CPU node. IMHO it meet the requirement of DT, > > it describes the HW (it would have HZ and Watts values). > > > > Doug, Matthias could you have a look at that function and its > > usage, please [1]? > > If you guys would support me in this, I would start, with an RFC > > proposal, a discussion on LKML. > > > > [1] > > https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87 > > Matthias: I think you've spent more time on the thermal stuff than me > so I'll assume you'll follow-up here. If not then please yell! > > Ideally, though, someone from Qualcomm would jump in an own this. > Basically it allows more intelligently throttling the GPU and CPU > together in tandem instead of treating them separately IIUC, right? Yes, I think for the em_dev_register_perf_domain() route support from Qualcomm would be needed since "Drivers must provide a callback function returning <frequency, power> tuples for each performance state. ".
On 2/17/22 4:50 PM, Doug Anderson wrote: > Hi, > > On Wed, Feb 16, 2022 at 3:28 PM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> On 2/16/22 5:21 PM, Doug Anderson wrote: >>> Hi, >>> >>> On Tue, Feb 8, 2022 at 1:32 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >>>> >>>>> Another important thing is the consistent scale of the power values >>>>> provided by the cooling devices. All of the cooling devices in a single >>>>> thermal zone should have power values reported either in milli-Watts >>>>> or scaled to the same 'abstract scale'. >>>> >>>> This can change. We have removed the userspace governor from kernel >>>> recently. The trend is to implement thermal policy in FW. Dealing with >>>> some intermediate configurations are causing complicated design, support >>>> of the algorithm logic is also more complex. >>> >>> One thing that didn't get addressed is the whole "The trend is to >>> implement thermal policy in FW". I'm not sure I can get on board with >>> that trend. IMO "moving to FW" isn't a super great trend. FW is harder >>> to update than kernel and trying to keep it in sync with the kernel >>> isn't wonderful. Unless something _has_ to be in FW I personally >>> prefer it to be in the kernel. >> >> There are pros and cons for both approaches (as always). >> >> Although, there are some use cases, where the kernel is not able to >> react that fast, e.g. sudden power usage changes, which can cause >> that the power rail is not able to sustain within required conditions. >> When we are talking about tough requirements for those power & thermal >> policies, the mechanism must be fast, precised and reliable. >> >> Here you can find Arm reference FW implementation and an IPA clone >> in there (I have been reviewing this) [1][2]. >> >> As you can see there is a new FW feature set: >> "MPMM, Traffic-cop and Thermal management". >> >> Apart from Arm implementation, there are already known thermal >> monitoring mechanisms in HW/FW. Like in the new Qcom SoCs which >> are using this driver code [3]. The driver receives an interrupt >> about throttling conditions and just populates the thermal pressure. > > Yeah, this has come up in another context recently too. Right on on > the Qcom SoCs I'm working with (sc7180 on Chromebooks) we've > essentially disabled all the HW/FW throttling (LMH), preferring to let > Linux manage things. We chose to do it this way with the assumption > that Linux would be able to make better decisions than the firmware > and it was easier to understand / update than an opaque > vendor-provided blob. LMH is still there with super high limits in > case Linux goofs up (we don't want to damage the CPU) but it's not the > primary means of throttling. > > As you said, Linux reacts a bit slower, though I've heard that might > be fixed soon-ish? So far on sc7180 Chromebooks it hasn't been a > problem because we have more total thermal mass and the CPUs in sc7180 > don't actually generate that much heat compared to other CPUs. We also > have thermal interrupts enabled, which helps. That being said, > improvements are certainly welcome! > Thanks Doug for sharing this with me. I'll keep this in mind, your requirements, configuration and usage. I've learned recently that some SoCs start throttling very early during boot, even before the cpumask is available. That was causing kernel to blow up (deference of a cpumask NULL pointer in that LMH [1]). [1] https://lore.kernel.org/linux-pm/20220128032554.155132-2-bjorn.andersson@linaro.org/
On 2/17/22 5:14 PM, Matthias Kaehlcke wrote: > On Thu, Feb 17, 2022 at 08:37:39AM -0800, Doug Anderson wrote: >> Hi, >> >> On Thu, Feb 17, 2022 at 2:47 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >>> >>> Hi Daniel, >>> >>> On 2/17/22 10:10 AM, Daniel Lezcano wrote: >>>> On 16/02/2022 18:33, Doug Anderson wrote: >>>>> Hi, >>>>> >>>>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >>>>>> >>>>>> Hi Matthias, >>>>>> >>>>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: >>>>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>>>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>>>>>>>> >>>>>>>>>> >>>>>> >>>>>> [snip] >>>>>> >>>>>>>>>> Could you point me to those devices please? >>>>>>>>> >>>>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>>>>>>>> >>>>>>>>> Though as per above they shouldn't be impacted by your change, >>>>>>>>> since the >>>>>>>>> CPUs always pretend to use milli-Watts. >>>>>>>>> >>>>>>>>> [skipped some questions/answers since sc7180 isn't actually >>>>>>>>> impacted by >>>>>>>>> the change] >>>>>>>> >>>>>>>> Thank you Matthias. I will investigate your setup to get better >>>>>>>> understanding. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>> >>>>>> I've checked those DT files and related code. >>>>>> As you already said, this patch is safe for them. >>>>>> So we can apply it IMO. >>>>>> >>>>>> >>>>>> -------------Off-topic------------------ >>>>>> Not in $subject comments: >>>>>> >>>>>> AFAICS based on two files which define thermal zones: >>>>>> sc7180-trogdor-homestar.dtsi >>>>>> sc7180-trogdor-coachz.dtsi >>>>>> >>>>>> only the 'big' cores are used as cooling devices in the >>>>>> 'skin_temp_thermal' - the CPU6 and CPU7. >>>>>> >>>>>> I assume you don't want to model at all the power usage >>>>>> from the Little cluster (which is quite big: 6 CPUs), do you? >>>>>> I can see that the Little CPUs have small dyn-power-coeff >>>>>> ~30% of the big and lower max freq, but still might be worth >>>>>> to add them to IPA. You might give them more 'weight', to >>>>>> make sure they receive more power during power split. >>>>>> >>>>>> You also don't have GPU cooling device in that thermal zone. >>>>>> Based on my experience if your GPU is a power hungry one, >>>>>> e.g. 2-4Watts, you might get better results when you model >>>>>> this 'hot' device (which impacts your temp sensor reported value). >>>>> >>>>> I think the two boards you point at (homestar and coachz) are just the >>>>> two that override the default defined in the SoC dtsi file. If you >>>>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling >>>>> map. You can also see the cooling maps for the littles. >>>>> >>>>> I guess we don't have a `dynamic-power-coefficient` for the GPU, >>>>> though? Seems like we should, but I haven't dug through all the code >>>>> here... >>>> >>>> The dynamic-power-coefficient is available for OPPs which includes >>>> CPUfreq and devfreq. As the GPU is managed by devfreq, setting the >>>> dynamic-power-coefficient makes the energy model available for it. >>>> >>>> However, the OPPs must define the frequency and the voltage. That is the >>>> case for most platforms except on QCom platform. >>>> >>>> That may not be specified as it uses a frequency index and the hardware >>>> does the voltage change in our back. The QCom cpufreq backend get the >>>> voltage table from a register (or whatever) and completes the voltage >>>> values for the OPPs, thus adding the information which is missing in the >>>> device tree. The energy model can then initializes itself and allows the >>>> usage of the Energy Aware Scheduler. >>>> >>>> However this piece of code is missing for the GPU part. >>>> >>> >>> Thank you for joining the discussion. I don't know about that Qcom >>> GPU voltage information is missing. >>> >>> If the voltage is not available (only the frequencies), there is >>> another way. There is an 'advanced' EM which uses registration function: >>> em_dev_register_perf_domain(). It uses a local driver callback to get >>> power for each found frequency. It has benefit because there is no >>> restriction to 'fit' into the math formula, instead just avg power >>> values can be feed into EM. It's called 'advanced' EM [1]. >> >> It seems like there _should_ be a way to get the voltage out for GPU >> operating points, like is done with cpufreq in >> qcom_cpufreq_hw_read_lut(), but it might need someone with Qualcomm >> documentation to help with it. Maybe Rajendra would be able to help? >> Adding Jordon and Rob to this conversation in case they're aware of >> anything. >> >> As you said, we could just list a power for each frequency, though. >> >> I'm actually not sure which one would be more accurate across a range >> of devices with different "corners": specifying a dynamic power >> coefficient used for all "corners" and then using the actual voltage >> and doing the math, or specifying a power number for each frequency >> and ignoring the actual voltage used. In any case we're trying to get >> ballpark numbers and not every device will be exactly the same, so >> probably it doesn't matter that much. >> >> >>> Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten >>> was proposing from ~2014 this upstream, but EAS wasn't merged back >>> then): >>> where to store these power-freq values, which are then used by the >>> callback. We have the 'dynamic-power-coefficient' in DT, but >>> it has limitations. It would be good to have this simple array >>> attached to the GPU/CPU node. IMHO it meet the requirement of DT, >>> it describes the HW (it would have HZ and Watts values). >>> >>> Doug, Matthias could you have a look at that function and its >>> usage, please [1]? >>> If you guys would support me in this, I would start, with an RFC >>> proposal, a discussion on LKML. >>> >>> [1] >>> https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87 >> >> Matthias: I think you've spent more time on the thermal stuff than me >> so I'll assume you'll follow-up here. If not then please yell! >> >> Ideally, though, someone from Qualcomm would jump in an own this. >> Basically it allows more intelligently throttling the GPU and CPU >> together in tandem instead of treating them separately IIUC, right? > > Yes, I think for the em_dev_register_perf_domain() route support from > Qualcomm would be needed since "Drivers must provide a callback > function returning <frequency, power> tuples for each performance > state. ". > Not necessarily. It might be done 'generically' by fwk. There are other benefits of this 'energy-model' entry in the DT. I'll list them in the cover letter. Let me send an RFC, so we could discuss there. Thanks guys! Regards, Lukasz
Hi Daniel, On 2/7/22 7:30 AM, Lukasz Luba wrote: > The Energy Model supports power values either in Watts or in some abstract > scale. When the 2nd option is in use, the thermal governor IPA should not > be allowed to operate, since the relation between cooling devices is not > properly defined. Thus, it might be possible that big GPU has lower power > values in abstract scale than a Little CPU. To mitigate a misbehaviour > of the thermal control algorithm, simply not register a cooling device > capable of working with IPA. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > drivers/thermal/cpufreq_cooling.c | 2 +- > drivers/thermal/devfreq_cooling.c | 16 +++++++++++++--- > 2 files changed, 14 insertions(+), 4 deletions(-) The discussion in below this patch went slightly off-topic but it was valuable. It clarified also there are no broken platforms with this change. Could you take the patch into the thermal tree, please? Regards, Lukasz
Adding Manaf (QCom) in the loop On 17/02/2022 17:37, Doug Anderson wrote: > Hi, > > On Thu, Feb 17, 2022 at 2:47 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> Hi Daniel, >> >> On 2/17/22 10:10 AM, Daniel Lezcano wrote: >>> On 16/02/2022 18:33, Doug Anderson wrote: >>>> Hi, >>>> >>>> On Wed, Feb 16, 2022 at 7:35 AM Lukasz Luba <lukasz.luba@arm.com> wrote: >>>>> >>>>> Hi Matthias, >>>>> >>>>> On 2/9/22 10:17 PM, Matthias Kaehlcke wrote: >>>>>> On Wed, Feb 09, 2022 at 11:16:36AM +0000, Lukasz Luba wrote: >>>>>>> >>>>>>> >>>>>>> On 2/8/22 5:25 PM, Matthias Kaehlcke wrote: >>>>>>>> On Tue, Feb 08, 2022 at 09:32:28AM +0000, Lukasz Luba wrote: >>>>>>>>> >>>>>>>>> >>>>> >>>>> [snip] >>>>> >>>>>>>>> Could you point me to those devices please? >>>>>>>> >>>>>>>> arch/arm64/boot/dts/qcom/sc7180-trogdor-* >>>>>>>> >>>>>>>> Though as per above they shouldn't be impacted by your change, >>>>>>>> since the >>>>>>>> CPUs always pretend to use milli-Watts. >>>>>>>> >>>>>>>> [skipped some questions/answers since sc7180 isn't actually >>>>>>>> impacted by >>>>>>>> the change] >>>>>>> >>>>>>> Thank you Matthias. I will investigate your setup to get better >>>>>>> understanding. >>>>>> >>>>>> Thanks! >>>>>> >>>>> >>>>> I've checked those DT files and related code. >>>>> As you already said, this patch is safe for them. >>>>> So we can apply it IMO. >>>>> >>>>> >>>>> -------------Off-topic------------------ >>>>> Not in $subject comments: >>>>> >>>>> AFAICS based on two files which define thermal zones: >>>>> sc7180-trogdor-homestar.dtsi >>>>> sc7180-trogdor-coachz.dtsi >>>>> >>>>> only the 'big' cores are used as cooling devices in the >>>>> 'skin_temp_thermal' - the CPU6 and CPU7. >>>>> >>>>> I assume you don't want to model at all the power usage >>>>> from the Little cluster (which is quite big: 6 CPUs), do you? >>>>> I can see that the Little CPUs have small dyn-power-coeff >>>>> ~30% of the big and lower max freq, but still might be worth >>>>> to add them to IPA. You might give them more 'weight', to >>>>> make sure they receive more power during power split. >>>>> >>>>> You also don't have GPU cooling device in that thermal zone. >>>>> Based on my experience if your GPU is a power hungry one, >>>>> e.g. 2-4Watts, you might get better results when you model >>>>> this 'hot' device (which impacts your temp sensor reported value). >>>> >>>> I think the two boards you point at (homestar and coachz) are just the >>>> two that override the default defined in the SoC dtsi file. If you >>>> look in sc7180.dtsi you'll see 'gpuss1-thermal' which has a cooling >>>> map. You can also see the cooling maps for the littles. >>>> >>>> I guess we don't have a `dynamic-power-coefficient` for the GPU, >>>> though? Seems like we should, but I haven't dug through all the code >>>> here... >>> >>> The dynamic-power-coefficient is available for OPPs which includes >>> CPUfreq and devfreq. As the GPU is managed by devfreq, setting the >>> dynamic-power-coefficient makes the energy model available for it. >>> >>> However, the OPPs must define the frequency and the voltage. That is the >>> case for most platforms except on QCom platform. >>> >>> That may not be specified as it uses a frequency index and the hardware >>> does the voltage change in our back. The QCom cpufreq backend get the >>> voltage table from a register (or whatever) and completes the voltage >>> values for the OPPs, thus adding the information which is missing in the >>> device tree. The energy model can then initializes itself and allows the >>> usage of the Energy Aware Scheduler. >>> >>> However this piece of code is missing for the GPU part. >>> >> >> Thank you for joining the discussion. I don't know about that Qcom >> GPU voltage information is missing. >> >> If the voltage is not available (only the frequencies), there is >> another way. There is an 'advanced' EM which uses registration function: >> em_dev_register_perf_domain(). It uses a local driver callback to get >> power for each found frequency. It has benefit because there is no >> restriction to 'fit' into the math formula, instead just avg power >> values can be feed into EM. It's called 'advanced' EM [1]. > > It seems like there _should_ be a way to get the voltage out for GPU > operating points, like is done with cpufreq in > qcom_cpufreq_hw_read_lut(), but it might need someone with Qualcomm > documentation to help with it. Maybe Rajendra would be able to help? > Adding Jordon and Rob to this conversation in case they're aware of > anything. > > As you said, we could just list a power for each frequency, though. > > I'm actually not sure which one would be more accurate across a range > of devices with different "corners": specifying a dynamic power > coefficient used for all "corners" and then using the actual voltage > and doing the math, or specifying a power number for each frequency > and ignoring the actual voltage used. In any case we're trying to get > ballpark numbers and not every device will be exactly the same, so > probably it doesn't matter that much. > > >> Now we hit (again) the DT & EM issue (it's an old one, IIRC Morten >> was proposing from ~2014 this upstream, but EAS wasn't merged back >> then): >> where to store these power-freq values, which are then used by the >> callback. We have the 'dynamic-power-coefficient' in DT, but >> it has limitations. It would be good to have this simple array >> attached to the GPU/CPU node. IMHO it meet the requirement of DT, >> it describes the HW (it would have HZ and Watts values). >> >> Doug, Matthias could you have a look at that function and its >> usage, please [1]? >> If you guys would support me in this, I would start, with an RFC >> proposal, a discussion on LKML. >> >> [1] >> https://elixir.bootlin.com/linux/v5.17-rc4/source/Documentation/power/energy-model.rst#L87 > > Matthias: I think you've spent more time on the thermal stuff than me > so I'll assume you'll follow-up here. If not then please yell! > > Ideally, though, someone from Qualcomm would jump in an own this. > Basically it allows more intelligently throttling the GPU and CPU > together in tandem instead of treating them separately IIUC, right? > > -Doug
Hi Daniel, gentle ping On 2/17/22 18:18, Lukasz Luba wrote: > Hi Daniel, > > > On 2/7/22 7:30 AM, Lukasz Luba wrote: >> The Energy Model supports power values either in Watts or in some >> abstract >> scale. When the 2nd option is in use, the thermal governor IPA should not >> be allowed to operate, since the relation between cooling devices is not >> properly defined. Thus, it might be possible that big GPU has lower power >> values in abstract scale than a Little CPU. To mitigate a misbehaviour >> of the thermal control algorithm, simply not register a cooling device >> capable of working with IPA. >> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> >> --- >> drivers/thermal/cpufreq_cooling.c | 2 +- >> drivers/thermal/devfreq_cooling.c | 16 +++++++++++++--- >> 2 files changed, 14 insertions(+), 4 deletions(-) > > The discussion in below this patch went slightly off-topic but it was > valuable. It clarified also there are no broken platforms with this > change. > > Could you take the patch into the thermal tree, please? > > Regards, > Lukasz
Hi Lukasz, I don't think it makes sense to remove the support of the energy model if the units are abstracts. IIUC, regarding your previous answer, we don't really know what will do the SoC vendor with these numbers and likely they will provide consistent abstract values which won't prevent a correct behavior. What would be the benefit of giving inconsistent abstract values which will be unusable except of giving a broken energy model? Your proposed changes would be acceptable if the energy model has a broken flag IMO On 22/02/2022 18:05, Lukasz Luba wrote: > Hi Daniel, > > gentle ping > > On 2/17/22 18:18, Lukasz Luba wrote: >> Hi Daniel, >> >> >> On 2/7/22 7:30 AM, Lukasz Luba wrote: >>> The Energy Model supports power values either in Watts or in some >>> abstract >>> scale. When the 2nd option is in use, the thermal governor IPA should >>> not >>> be allowed to operate, since the relation between cooling devices is not >>> properly defined. Thus, it might be possible that big GPU has lower >>> power >>> values in abstract scale than a Little CPU. To mitigate a misbehaviour >>> of the thermal control algorithm, simply not register a cooling device >>> capable of working with IPA. >>> >>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> >>> --- >>> drivers/thermal/cpufreq_cooling.c | 2 +- >>> drivers/thermal/devfreq_cooling.c | 16 +++++++++++++--- >>> 2 files changed, 14 insertions(+), 4 deletions(-) >> >> The discussion in below this patch went slightly off-topic but it was >> valuable. It clarified also there are no broken platforms with this >> change. >> >> Could you take the patch into the thermal tree, please? >> >> Regards, >> Lukasz
On 2/22/22 18:12, Daniel Lezcano wrote: > > Hi Lukasz, > > I don't think it makes sense to remove the support of the energy model > if the units are abstracts. > > IIUC, regarding your previous answer, we don't really know what will do > the SoC vendor with these numbers and likely they will provide > consistent abstract values which won't prevent a correct behavior. > > What would be the benefit of giving inconsistent abstract values which > will be unusable except of giving a broken energy model? The power values in the EM which has abstract scale, would make sense to EAS, but not for IPA or DTPM. Those platforms which want to enable EAS, but don't need IPA, would register such '<a_good_name_here>' EM. > > Your proposed changes would be acceptable if the energy model has a > broken flag IMO That is doable. I can add that flag, so we can call it 'artificial' EM (when this new flag is set). Let me craft the RFC patch with this new flag proposal then. Do you agree? Can I also add you as 'Suggested-by'? Thank you for coming back to me with the comments.
Hi Lukasz, On 22/02/2022 19:31, Lukasz Luba wrote: > > > On 2/22/22 18:12, Daniel Lezcano wrote: >> >> Hi Lukasz, >> >> I don't think it makes sense to remove the support of the energy model >> if the units are abstracts. >> >> IIUC, regarding your previous answer, we don't really know what will >> do the SoC vendor with these numbers and likely they will provide >> consistent abstract values which won't prevent a correct behavior. >> >> What would be the benefit of giving inconsistent abstract values which >> will be unusable except of giving a broken energy model? > > The power values in the EM which has abstract scale, would make sense to > EAS, but not for IPA or DTPM. Those platforms which want to enable EAS, > but don't need IPA, would register such '<a_good_name_here>' EM. Sorry, but I don't understand why DTPM can not deal with abstract values? >> Your proposed changes would be acceptable if the energy model has a >> broken flag IMO > > That is doable. I can add that flag, so we can call it 'artificial' EM > (when this new flag is set). It is too soon IMO, I would like to see the numbers first so we can take an enlighten decision. Right now, it is unclear what the numbers will be. > Let me craft the RFC patch with this new flag proposal then. > Do you agree? Can I also add you as 'Suggested-by'? > > Thank you for coming back to me with the comments.
On 2/22/22 22:10, Daniel Lezcano wrote: > > Hi Lukasz, > > On 22/02/2022 19:31, Lukasz Luba wrote: >> >> >> On 2/22/22 18:12, Daniel Lezcano wrote: >>> >>> Hi Lukasz, >>> >>> I don't think it makes sense to remove the support of the energy >>> model if the units are abstracts. >>> >>> IIUC, regarding your previous answer, we don't really know what will >>> do the SoC vendor with these numbers and likely they will provide >>> consistent abstract values which won't prevent a correct behavior. >>> >>> What would be the benefit of giving inconsistent abstract values >>> which will be unusable except of giving a broken energy model? >> >> The power values in the EM which has abstract scale, would make sense >> to EAS, but not for IPA or DTPM. Those platforms which want to enable >> EAS, >> but don't need IPA, would register such '<a_good_name_here>' EM. > > Sorry, but I don't understand why DTPM can not deal with abstract values? They will be totally meaningless/bogus. > > >>> Your proposed changes would be acceptable if the energy model has a >>> broken flag IMO >> >> That is doable. I can add that flag, so we can call it 'artificial' EM >> (when this new flag is set). > > It is too soon IMO, I would like to see the numbers first so we can take > an enlighten decision. Right now, it is unclear what the numbers will be. We are going to add new support from our roadmap for platforms which don't have this power information but are going to use EAS. I'm going to send some patches soon which create that support. Pierre is going to send the platform code. I want to make sure that this platform won't register power actors for IPA. Other thermal governors will work, since they don't use EM for making a decision.
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c index 43b1ae8a7789..f831ed40b333 100644 --- a/drivers/thermal/cpufreq_cooling.c +++ b/drivers/thermal/cpufreq_cooling.c @@ -328,7 +328,7 @@ static inline bool em_is_sane(struct cpufreq_cooling_device *cpufreq_cdev, struct cpufreq_policy *policy; unsigned int nr_levels; - if (!em) + if (!em || !(em->flags & EM_PERF_DOMAIN_MILLIWATTS)) return false; policy = cpufreq_cdev->policy; diff --git a/drivers/thermal/devfreq_cooling.c b/drivers/thermal/devfreq_cooling.c index 4310cb342a9f..7e8bd1368cab 100644 --- a/drivers/thermal/devfreq_cooling.c +++ b/drivers/thermal/devfreq_cooling.c @@ -336,6 +336,14 @@ static int devfreq_cooling_gen_tables(struct devfreq_cooling_device *dfc, return 0; } +static inline bool em_is_sane(struct em_perf_domain *em) +{ + if (!em || !(em->flags & EM_PERF_DOMAIN_MILLIWATTS)) + return false; + else + return true; +} + /** * of_devfreq_cooling_register_power() - Register devfreq cooling device, * with OF and power information. @@ -358,6 +366,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df, struct thermal_cooling_device *cdev; struct device *dev = df->dev.parent; struct devfreq_cooling_device *dfc; + struct em_perf_domain *em; char *name; int err, num_opps; @@ -367,8 +376,9 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df, dfc->devfreq = df; - dfc->em_pd = em_pd_get(dev); - if (dfc->em_pd) { + em = em_pd_get(dev); + if (em_is_sane(em)) { + dfc->em_pd = em; devfreq_cooling_ops.get_requested_power = devfreq_cooling_get_requested_power; devfreq_cooling_ops.state2power = devfreq_cooling_state2power; @@ -379,7 +389,7 @@ of_devfreq_cooling_register_power(struct device_node *np, struct devfreq *df, num_opps = em_pd_nr_perf_states(dfc->em_pd); } else { /* Backward compatibility for drivers which do not use IPA */ - dev_dbg(dev, "missing EM for cooling device\n"); + dev_dbg(dev, "missing proper EM for cooling device\n"); num_opps = dev_pm_opp_get_opp_count(dev);
The Energy Model supports power values either in Watts or in some abstract scale. When the 2nd option is in use, the thermal governor IPA should not be allowed to operate, since the relation between cooling devices is not properly defined. Thus, it might be possible that big GPU has lower power values in abstract scale than a Little CPU. To mitigate a misbehaviour of the thermal control algorithm, simply not register a cooling device capable of working with IPA. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> --- drivers/thermal/cpufreq_cooling.c | 2 +- drivers/thermal/devfreq_cooling.c | 16 +++++++++++++--- 2 files changed, 14 insertions(+), 4 deletions(-)