Message ID | 20180320094312.24081-3-dietmar.eggemann@arm.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote: > From: Quentin Perret <quentin.perret@arm.com> > > The energy consumption of each CPU in the system is modeled with a list > of values representing its dissipated power and compute capacity at each > available Operating Performance Point (OPP). These values are derived > from existing information in the kernel (currently used by the thermal > subsystem) and don't require the introduction of new platform-specific > tunables. The energy model is also provided with a simple representation > of all frequency domains as cpumasks, hence enabling the scheduler to be > aware of dependencies between CPUs. The data required to build the energy > model is provided by the OPP library which enables an abstract view of > the platform from the scheduler. The new data structures holding these > models and the routines to populate them are stored in > kernel/sched/energy.c. > > For the sake of simplicity, it is assumed in the energy model that all > CPUs in a frequency domain share the same micro-architecture. As long as > this assumption is correct, the energy models of different CPUs belonging > to the same frequency domain are equal. Hence, this commit builds only one > energy model per frequency domain, and links all relevant CPUs to it in > order to save time and memory. If needed for future hardware platforms, > relaxing this assumption should imply relatively simple modifications in > the code but a significantly higher algorithmic complexity. > > As it appears that energy-aware scheduling really makes a difference on > heterogeneous systems (e.g. big.LITTLE platforms), it is restricted to > systems having: > > 1. SD_ASYM_CPUCAPACITY flag set > 2. Dynamic Voltage and Frequency Scaling (DVFS) is enabled > 3. Available power estimates for the OPPs of all possible CPUs > > Moreover, the scheduler is notified of the energy model availability > using a static key in order to minimize the overhead on non-energy-aware > systems. > > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > --- > This patch depends on additional infrastructure being merged in the OPP > core. As this infrastructure can also be useful for other clients, the > related patches have been posted separately [1]. > > [1] https://marc.info/?l=linux-pm&m=151635516419249&w=2 > --- > include/linux/sched/energy.h | 31 +++++++ > kernel/sched/Makefile | 2 +- > kernel/sched/energy.c | 190 +++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 222 insertions(+), 1 deletion(-) > create mode 100644 include/linux/sched/energy.h > create mode 100644 kernel/sched/energy.c > > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h > new file mode 100644 > index 000000000000..b4f43564ffe4 > --- /dev/null > +++ b/include/linux/sched/energy.h > @@ -0,0 +1,31 @@ > +#ifndef _LINUX_SCHED_ENERGY_H > +#define _LINUX_SCHED_ENERGY_H No copyright or license info? Not good :( > --- /dev/null > +++ b/kernel/sched/energy.c > @@ -0,0 +1,190 @@ > +/* > + * Released under the GPLv2 only. > + * SPDX-License-Identifier: GPL-2.0 Please read the documentation for the SPDX lines on how to do them correctly. Newer versions of checkpatch.pl will catch this, but that is in linux-next for the moment. And once you have the SPDX line, the "Released under..." line is not needed. > + * > + * Energy-aware scheduling models > + * > + * Copyright (C) 2018, Arm Ltd. > + * Written by: Quentin Perret, Arm Ltd. > + * > + * This file is subject to the terms and conditions of the GNU General Public > + * License. See the file "COPYING" in the main directory of this archive > + * for more details. This paragraph is not needed at all. > + */ > + > +#define pr_fmt(fmt) "sched-energy: " fmt > + > +#include <linux/sched/topology.h> > +#include <linux/sched/energy.h> > +#include <linux/pm_opp.h> > + > +#include "sched.h" > + > +DEFINE_STATIC_KEY_FALSE(sched_energy_present); > +struct sched_energy_model ** __percpu energy_model; > + > +/* > + * A copy of the cpumasks representing the frequency domains is kept private > + * to the scheduler. They are stacked in a dynamically allocated linked list > + * as we don't know how many frequency domains the system has. > + */ > +LIST_HEAD(freq_domains); global variable? If so, please prefix it with something more unique than "freq_". > +#ifdef CONFIG_PM_OPP #ifdefs go in .h files, not .c files, right? thanks, greg k-h
On Tuesday 20 Mar 2018 at 10:52:15 (+0100), Greg Kroah-Hartman wrote: > On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote: > > From: Quentin Perret <quentin.perret@arm.com> > > > > The energy consumption of each CPU in the system is modeled with a list > > of values representing its dissipated power and compute capacity at each > > available Operating Performance Point (OPP). These values are derived > > from existing information in the kernel (currently used by the thermal > > subsystem) and don't require the introduction of new platform-specific > > tunables. The energy model is also provided with a simple representation > > of all frequency domains as cpumasks, hence enabling the scheduler to be > > aware of dependencies between CPUs. The data required to build the energy > > model is provided by the OPP library which enables an abstract view of > > the platform from the scheduler. The new data structures holding these > > models and the routines to populate them are stored in > > kernel/sched/energy.c. > > > > For the sake of simplicity, it is assumed in the energy model that all > > CPUs in a frequency domain share the same micro-architecture. As long as > > this assumption is correct, the energy models of different CPUs belonging > > to the same frequency domain are equal. Hence, this commit builds only one > > energy model per frequency domain, and links all relevant CPUs to it in > > order to save time and memory. If needed for future hardware platforms, > > relaxing this assumption should imply relatively simple modifications in > > the code but a significantly higher algorithmic complexity. > > > > As it appears that energy-aware scheduling really makes a difference on > > heterogeneous systems (e.g. big.LITTLE platforms), it is restricted to > > systems having: > > > > 1. SD_ASYM_CPUCAPACITY flag set > > 2. Dynamic Voltage and Frequency Scaling (DVFS) is enabled > > 3. Available power estimates for the OPPs of all possible CPUs > > > > Moreover, the scheduler is notified of the energy model availability > > using a static key in order to minimize the overhead on non-energy-aware > > systems. > > > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Signed-off-by: Quentin Perret <quentin.perret@arm.com> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> > > > > --- > > This patch depends on additional infrastructure being merged in the OPP > > core. As this infrastructure can also be useful for other clients, the > > related patches have been posted separately [1]. > > > > [1] https://marc.info/?l=linux-pm&m=151635516419249&w=2 > > --- > > include/linux/sched/energy.h | 31 +++++++ > > kernel/sched/Makefile | 2 +- > > kernel/sched/energy.c | 190 +++++++++++++++++++++++++++++++++++++++++++ > > 3 files changed, 222 insertions(+), 1 deletion(-) > > create mode 100644 include/linux/sched/energy.h > > create mode 100644 kernel/sched/energy.c > > > > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h > > new file mode 100644 > > index 000000000000..b4f43564ffe4 > > --- /dev/null > > +++ b/include/linux/sched/energy.h > > @@ -0,0 +1,31 @@ > > +#ifndef _LINUX_SCHED_ENERGY_H > > +#define _LINUX_SCHED_ENERGY_H > > No copyright or license info? Not good :( > > > --- /dev/null > > +++ b/kernel/sched/energy.c > > @@ -0,0 +1,190 @@ > > +/* > > + * Released under the GPLv2 only. > > + * SPDX-License-Identifier: GPL-2.0 > > Please read the documentation for the SPDX lines on how to do them > correctly. Newer versions of checkpatch.pl will catch this, but that is > in linux-next for the moment. > > And once you have the SPDX line, the "Released under..." line is not > needed. > > > > + * > > + * Energy-aware scheduling models > > + * > > + * Copyright (C) 2018, Arm Ltd. > > + * Written by: Quentin Perret, Arm Ltd. > > + * > > + * This file is subject to the terms and conditions of the GNU General Public > > + * License. See the file "COPYING" in the main directory of this archive > > + * for more details. > > This paragraph is not needed at all. Right, I will fix all the licence issues and add one to the new header file. I took example on existing files a while ago when I first wrote the patches and forgot to update them later on. Sorry about that. > > > + */ > > + > > +#define pr_fmt(fmt) "sched-energy: " fmt > > + > > +#include <linux/sched/topology.h> > > +#include <linux/sched/energy.h> > > +#include <linux/pm_opp.h> > > + > > +#include "sched.h" > > + > > +DEFINE_STATIC_KEY_FALSE(sched_energy_present); > > +struct sched_energy_model ** __percpu energy_model; > > + > > +/* > > + * A copy of the cpumasks representing the frequency domains is kept private > > + * to the scheduler. They are stacked in a dynamically allocated linked list > > + * as we don't know how many frequency domains the system has. > > + */ > > +LIST_HEAD(freq_domains); > > global variable? If so, please prefix it with something more unique > than "freq_". Will do. > > > +#ifdef CONFIG_PM_OPP > > #ifdefs go in .h files, not .c files, right? Yes good point. Actually, I might be able to tweak only kernel/sched/Makefile to ensure we have CONFIG_PM_OPP. I will look into it. > > thanks, > > greg k-h Thanks, Quentin
On Tuesday 20 Mar 2018 at 10:52:15 (+0100), Greg Kroah-Hartman wrote: > On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote: > > From: Quentin Perret <quentin.perret@arm.com> [...] > > +#ifdef CONFIG_PM_OPP > > #ifdefs go in .h files, not .c files, right? > So, after looking into this, my suggestion would be to: 1) remove the #ifdef CONFIG_PM_OPP from energy.c entirely; 2) make sure init_sched_energy() is stubbed properly for !CONFIG_SMP and !CONFIG_PM_OPP in include/linux/sched/energy.h; 3) relocate the global variables (energy_model, freq_domains, ...) to fair.c; and 4) modify kernel/sched/Makefile with something like: ifeq ($(CONFIG_PM_OPP),y) obj-$(CONFIG_SMP) += energy.o endif That way, energy.c is not compiled if not needed by the arch, and the ifdef are kept within header files and Makefiles. Would that work ? Thanks, Quentin
On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote: > From: Quentin Perret <quentin.perret@arm.com> > > The energy consumption of each CPU in the system is modeled with a list > of values representing its dissipated power and compute capacity at each > available Operating Performance Point (OPP). These values are derived > from existing information in the kernel (currently used by the thermal > subsystem) and don't require the introduction of new platform-specific > tunables. The energy model is also provided with a simple representation > of all frequency domains as cpumasks, hence enabling the scheduler to be > aware of dependencies between CPUs. The data required to build the energy > model is provided by the OPP library which enables an abstract view of > the platform from the scheduler. The new data structures holding these > models and the routines to populate them are stored in > kernel/sched/energy.c. > > For the sake of simplicity, it is assumed in the energy model that all > CPUs in a frequency domain share the same micro-architecture. As long as > this assumption is correct, the energy models of different CPUs belonging > to the same frequency domain are equal. Hence, this commit builds only one > energy model per frequency domain, and links all relevant CPUs to it in > order to save time and memory. If needed for future hardware platforms, > relaxing this assumption should imply relatively simple modifications in > the code but a significantly higher algorithmic complexity. What this doesn't mention is why this isn't part of the regular topology bits. IIRC this is because the frequency domains don't necessarily need to align with the existing topology, but this completely fails to state any of that. Also, since I'm not at all familiar with DT and the OPP library stuff, this code is completely unreadable to me and there isn't a nice comment to help me along.
On Monday 09 Apr 2018 at 14:01:11 (+0200), Peter Zijlstra wrote: > On Tue, Mar 20, 2018 at 09:43:08AM +0000, Dietmar Eggemann wrote: > > From: Quentin Perret <quentin.perret@arm.com> > > > > The energy consumption of each CPU in the system is modeled with a list > > of values representing its dissipated power and compute capacity at each > > available Operating Performance Point (OPP). These values are derived > > from existing information in the kernel (currently used by the thermal > > subsystem) and don't require the introduction of new platform-specific > > tunables. The energy model is also provided with a simple representation > > of all frequency domains as cpumasks, hence enabling the scheduler to be > > aware of dependencies between CPUs. The data required to build the energy > > model is provided by the OPP library which enables an abstract view of > > the platform from the scheduler. The new data structures holding these > > models and the routines to populate them are stored in > > kernel/sched/energy.c. > > > > For the sake of simplicity, it is assumed in the energy model that all > > CPUs in a frequency domain share the same micro-architecture. As long as > > this assumption is correct, the energy models of different CPUs belonging > > to the same frequency domain are equal. Hence, this commit builds only one > > energy model per frequency domain, and links all relevant CPUs to it in > > order to save time and memory. If needed for future hardware platforms, > > relaxing this assumption should imply relatively simple modifications in > > the code but a significantly higher algorithmic complexity. > > What this doesn't mention is why this isn't part of the regular topology > bits. IIRC this is because the frequency domains don't necessarily need > to align with the existing topology, but this completely fails to state > any of that. Yes that's the main reason. Frequency domains and scheduling domains don't necessarily align. That used to be the case for big.LITTLE platforms, but not anymore with DynamIQ ... > > Also, since I'm not at all familiar with DT and the OPP library stuff, > this code is completely unreadable to me and there isn't a nice comment > to help me along. Right, so I can definitely fix that. Comments in the code and a better commit message should help hopefully. And also, it has already been suggested that a documentation file should be added alongside the code for this patchset, so I'll make sure we add that for the next version. In the meantime, here is a (hopefully) better explanation below. In this specific patch, we are basically trying to figure out the boundaries of frequency domains, and the power consumed by each CPU at each OPP, to make them available to the scheduler. The important thing here is that, in both cases, we rely on the OPP library to keep the code as platform-agnostic as possible. In the case of the frequency domains for example, the cpufreq driver is in charge of specifying the CPUs that are sharing frequencies. That information can come from DT, or SCPI, or SCMI, or whatever -- we probably shouldn't have to care about that from the scheduler's standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy, the OPP library gives us the digested information we need. The power values (dev_pm_opp_get_power) we use right now are those already used by the thermal subsystem (IPA), which means we don't have to introduce any new DT binding whatsoever. In a close future, the power values could also come from other sources (SCMI for ex), and again it's probably not the scheduler's job to care about those things, so the OPP library is helping us again. As mentioned in the notes, as of today, this approach has dependencies on other patches relating to these things which are already on the list [1]. The rest of the code in this patch is just about iterating over the CPUs/freq. domains/OPPs. The algorithm is more or less the following: 1. find a frequency domain which hasn't been visited yet; 2. estimate the power and capacity of a CPU in this freq domain at each possible OPP; 3. map all CPUs in the freq domain to this list of <capacity, power> tuples; 4. go to 1. I hope that makes sense. Thanks, Quentin [1] https://marc.info/?l=linux-pm&m=151635516419249&w=2
On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote: > In this specific patch, we are basically trying to figure out the > boundaries of frequency domains, and the power consumed by each CPU > at each OPP, to make them available to the scheduler. The important > thing here is that, in both cases, we rely on the OPP library to > keep the code as platform-agnostic as possible. AFAICT the only users of this PM_OPP stuff is a bunch of ARM platforms. Granted, body else has build a big.little style system, so that might all be fine I suppose. It won't be until some !ARM chip comes along that we'll know how generically usable any of this really is. > In the case of the frequency domains for example, the cpufreq driver is > in charge of specifying the CPUs that are sharing frequencies. That > information can come from DT, or SCPI, or SCMI, or whatever -- we > probably shouldn't have to care about that from the scheduler's > standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy, > the OPP library gives us the digested information we need. So I kinda would've expected to just ask cpufreq, that after all already knows these things. Why did we need to invent this pm_opp thing? Cpufreq has a tons of supported architectures, pm_opp not so much. > The power values (dev_pm_opp_get_power) we use right now are those > already used by the thermal subsystem (IPA), which means we don't have I love an IPA style beer, but I'm thinking that's not the same IPA, right :-) > to introduce any new DT binding whatsoever. In a close future, the power > values could also come from other sources (SCMI for ex), and again it's > probably not the scheduler's job to care about those things, so the OPP > library is helping us again. As mentioned in the notes, as of today, this > approach has dependencies on other patches relating to these things which > are already on the list [1]. Is there any !ARM thermal driver? (clearly I'm not up-to-date on things thermal).
On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote: > On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote: > > > In this specific patch, we are basically trying to figure out the > > boundaries of frequency domains, and the power consumed by each CPU > > at each OPP, to make them available to the scheduler. The important > > thing here is that, in both cases, we rely on the OPP library to > > keep the code as platform-agnostic as possible. > > AFAICT the only users of this PM_OPP stuff is a bunch of ARM platforms. That's correct. > Granted, body else has build a big.little style system, so that might > all be fine I suppose. > > It won't be until some !ARM chip comes along that we'll know how > generically usable any of this really is. > Right. There is already a lot of diversity in the Arm ecosystem that has to be managed. That's what I meant by platform-agnostic. Now, I agree that it should be discussed whether or not this is enough for other archs ... It might be reasonable to expect from the archs who want to use EAS that they expose their OPPs in the OPP lib. That should be harmless, and EAS needs to know about the OPPs, so they should be made visible, ideally somewhere generic. Otherwise, that means the interface with the EAS has to be defined only by the energy model data structures, and the actual energy model loading procedure becomes free-form arch code. I quiet like the first idea from a pure design standpoint, but I could also understand if maintainers of other archs were reluctant to have new dependencies on PM_OPP ... > > In the case of the frequency domains for example, the cpufreq driver is > > in charge of specifying the CPUs that are sharing frequencies. That > > information can come from DT, or SCPI, or SCMI, or whatever -- we > > probably shouldn't have to care about that from the scheduler's > > standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy, > > the OPP library gives us the digested information we need. > > So I kinda would've expected to just ask cpufreq, that after all already > knows these things. Why did we need to invent this pm_opp thing? Yes, we can definitely rely on cpufreq for this one. There is a "strong" dependency on PM_OPP to get power values, so I decided to use PM_OPP for the frequency domains as well, for consistency. But I can change that if needed. > > Cpufreq has a tons of supported architectures, pm_opp not so much. > > > The power values (dev_pm_opp_get_power) we use right now are those > > already used by the thermal subsystem (IPA), which means we don't have > > I love an IPA style beer, but I'm thinking that's not the same IPA, > right :-) Well, both can help to chill down in a way ... :-) The IPA I'm talking about means Intelligent Power Allocator. It's a thermal governor that uses a power model of the platform to allocate power budgets to CPUs & GPUs using a control loop. The code is in drivers/thermal/power_allocator.c if this is of interest. > > > to introduce any new DT binding whatsoever. In a close future, the power > > values could also come from other sources (SCMI for ex), and again it's > > probably not the scheduler's job to care about those things, so the OPP > > library is helping us again. As mentioned in the notes, as of today, this > > approach has dependencies on other patches relating to these things which > > are already on the list [1]. > > Is there any !ARM thermal driver? (clearly I'm not up-to-date on things > thermal). I don't think so. Thanks, Quentin
On Mon, Apr 9, 2018 at 6:42 PM, Quentin Perret <quentin.perret@arm.com> wrote: > On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote: >> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote: >> >> > In this specific patch, we are basically trying to figure out the >> > boundaries of frequency domains, and the power consumed by each CPU >> > at each OPP, to make them available to the scheduler. The important >> > thing here is that, in both cases, we rely on the OPP library to >> > keep the code as platform-agnostic as possible. >> >> AFAICT the only users of this PM_OPP stuff is a bunch of ARM platforms. > > That's correct. > >> Granted, body else has build a big.little style system, so that might >> all be fine I suppose. >> >> It won't be until some !ARM chip comes along that we'll know how >> generically usable any of this really is. >> > > Right. There is already a lot of diversity in the Arm ecosystem that has > to be managed. That's what I meant by platform-agnostic. Now, I agree > that it should be discussed whether or not this is enough for other > archs ... Even for ARM64 w/ ACPI, mind you. > It might be reasonable to expect from the archs who want to use EAS that > they expose their OPPs in the OPP lib. That should be harmless, and EAS > needs to know about the OPPs, so they should be made visible, ideally > somewhere generic. Otherwise, that means the interface with the > EAS has to be defined only by the energy model data structures, and the > actual energy model loading procedure becomes free-form arch code. > > I quiet like the first idea from a pure design standpoint, but I could > also understand if maintainers of other archs were reluctant to > have new dependencies on PM_OPP ... Not just reluctant I would think. Depending on PM_OPP directly here is like depending on ACPI directly. Would you agree with the latter? >> > In the case of the frequency domains for example, the cpufreq driver is >> > in charge of specifying the CPUs that are sharing frequencies. That >> > information can come from DT, or SCPI, or SCMI, or whatever -- we >> > probably shouldn't have to care about that from the scheduler's >> > standpoint. That's why using dev_pm_opp_get_sharing_cpus() is handy, >> > the OPP library gives us the digested information we need. >> >> So I kinda would've expected to just ask cpufreq, that after all already >> knows these things. Why did we need to invent this pm_opp thing? > > Yes, we can definitely rely on cpufreq for this one. There is a "strong" > dependency on PM_OPP to get power values, so I decided to use PM_OPP for > the frequency domains as well, for consistency. But I can change that if > needed. Yes, please. >> >> Cpufreq has a tons of supported architectures, pm_opp not so much. >> >> > The power values (dev_pm_opp_get_power) we use right now are those >> > already used by the thermal subsystem (IPA), which means we don't have >> >> I love an IPA style beer, but I'm thinking that's not the same IPA, >> right :-) > > Well, both can help to chill down in a way ... :-) > > The IPA I'm talking about means Intelligent Power Allocator. It's a > thermal governor that uses a power model of the platform to allocate > power budgets to CPUs & GPUs using a control loop. The code is in > drivers/thermal/power_allocator.c if this is of interest. > >> >> > to introduce any new DT binding whatsoever. In a close future, the power >> > values could also come from other sources (SCMI for ex), and again it's >> > probably not the scheduler's job to care about those things, so the OPP >> > library is helping us again. As mentioned in the notes, as of today, this >> > approach has dependencies on other patches relating to these things which >> > are already on the list [1]. >> >> Is there any !ARM thermal driver? (clearly I'm not up-to-date on things >> thermal). > > I don't think so. No, there isn't, AFAICS. Thanks!
On Tuesday 10 Apr 2018 at 08:55:14 (+0200), Rafael J. Wysocki wrote: > On Mon, Apr 9, 2018 at 6:42 PM, Quentin Perret <quentin.perret@arm.com> wrote: > > On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote: > >> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote: [...] > > I quiet like the first idea from a pure design standpoint, but I could > > also understand if maintainers of other archs were reluctant to > > have new dependencies on PM_OPP ... > > Not just reluctant I would think. > > Depending on PM_OPP directly here is like depending on ACPI directly. > Would you agree with the latter? Right, I see your point. I was suggesting to use PM_OPP only to make the OPPs *visible*, nothing else. That doesn't mean all archs would have to use dev_pm_opp_set_rate() or anything, they could just keep on doing DVFS their own way. PM_OPP would just be a common way to make OPPs visible outside of their subsystem, which should be harmless. The point is to keep the energy model loading code common to all archs. Another solution would be to let the archs populate the energy model data-structures themselves, and turn the current energy.c file into arm/arm64-specific code for ex. Overall, I guess the question is whether or not PM_OPP is the right interface for EAS of multiple archs ... That sounds like an interesting discussion topic for OSPM next week, so thanks a lot for raising this point ! Regards, Quentin
On Tue, Apr 10, 2018 at 11:31 AM, Quentin Perret <quentin.perret@arm.com> wrote: > On Tuesday 10 Apr 2018 at 08:55:14 (+0200), Rafael J. Wysocki wrote: >> On Mon, Apr 9, 2018 at 6:42 PM, Quentin Perret <quentin.perret@arm.com> wrote: >> > On Monday 09 Apr 2018 at 17:32:33 (+0200), Peter Zijlstra wrote: >> >> On Mon, Apr 09, 2018 at 02:45:11PM +0100, Quentin Perret wrote: > > [...] > >> > I quiet like the first idea from a pure design standpoint, but I could >> > also understand if maintainers of other archs were reluctant to >> > have new dependencies on PM_OPP ... >> >> Not just reluctant I would think. >> >> Depending on PM_OPP directly here is like depending on ACPI directly. >> Would you agree with the latter? > > Right, I see your point. I was suggesting to use PM_OPP only to make the > OPPs *visible*, nothing else. That doesn't mean all archs would have > to use dev_pm_opp_set_rate() or anything, they could just keep on doing > DVFS their own way. PM_OPP would just be a common way to make OPPs > visible outside of their subsystem, which should be harmless. The point > is to keep the energy model loading code common to all archs. > > Another solution would be to let the archs populate the energy model > data-structures themselves, and turn the current energy.c file into > arm/arm64-specific code for ex. > > Overall, I guess the question is whether or not PM_OPP is the right > interface for EAS of multiple archs ... That sounds like an interesting > discussion topic for OSPM next week, I agree. > so thanks a lot for raising this point ! And moreover, we already have cpufreq and cpuidle that use their own representations of the same information, generally coming from lower layers. They do that, because they need to work with different platforms that generally represent the low-level information differently. I don't see why that principle doesn't apply to EAS. Maybe there should be a common data structure to be used by them all, but I'm quite confident that PM_OPP is not suitable for this purpose in general.
diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h new file mode 100644 index 000000000000..b4f43564ffe4 --- /dev/null +++ b/include/linux/sched/energy.h @@ -0,0 +1,31 @@ +#ifndef _LINUX_SCHED_ENERGY_H +#define _LINUX_SCHED_ENERGY_H + +#ifdef CONFIG_SMP +struct capacity_state { + unsigned long cap; /* compute capacity */ + unsigned long power; /* power consumption at this compute capacity */ +}; + +struct sched_energy_model { + int nr_cap_states; + struct capacity_state *cap_states; +}; + +struct freq_domain { + struct list_head next; + cpumask_t span; +}; + +extern struct sched_energy_model ** __percpu energy_model; +extern struct static_key_false sched_energy_present; +extern struct list_head freq_domains; +#define for_each_freq_domain(fdom) \ + list_for_each_entry(fdom, &freq_domains, next) + +void init_sched_energy(void); +#else +static inline void init_sched_energy(void) { } +#endif + +#endif diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index d9a02b318108..912972ad4dbc 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -20,7 +20,7 @@ obj-y += core.o loadavg.o clock.o cputime.o obj-y += idle.o fair.o rt.o deadline.o obj-y += wait.o wait_bit.o swait.o completion.o -obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o +obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o topology.o stop_task.o energy.o obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o diff --git a/kernel/sched/energy.c b/kernel/sched/energy.c new file mode 100644 index 000000000000..4662c993e096 --- /dev/null +++ b/kernel/sched/energy.c @@ -0,0 +1,190 @@ +/* + * Released under the GPLv2 only. + * SPDX-License-Identifier: GPL-2.0 + * + * Energy-aware scheduling models + * + * Copyright (C) 2018, Arm Ltd. + * Written by: Quentin Perret, Arm Ltd. + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file "COPYING" in the main directory of this archive + * for more details. + */ + +#define pr_fmt(fmt) "sched-energy: " fmt + +#include <linux/sched/topology.h> +#include <linux/sched/energy.h> +#include <linux/pm_opp.h> + +#include "sched.h" + +DEFINE_STATIC_KEY_FALSE(sched_energy_present); +struct sched_energy_model ** __percpu energy_model; + +/* + * A copy of the cpumasks representing the frequency domains is kept private + * to the scheduler. They are stacked in a dynamically allocated linked list + * as we don't know how many frequency domains the system has. + */ +LIST_HEAD(freq_domains); + +#ifdef CONFIG_PM_OPP +static struct sched_energy_model *build_energy_model(int cpu) +{ + unsigned long cap_scale = arch_scale_cpu_capacity(NULL, cpu); + unsigned long cap, freq, power, max_freq = ULONG_MAX; + unsigned long opp_eff, prev_opp_eff = ULONG_MAX; + struct sched_energy_model *em = NULL; + struct device *cpu_dev; + struct dev_pm_opp *opp; + int opp_cnt, i; + + cpu_dev = get_cpu_device(cpu); + if (!cpu_dev) { + pr_err("CPU%d: Failed to get device\n", cpu); + return NULL; + } + + opp_cnt = dev_pm_opp_get_opp_count(cpu_dev); + if (opp_cnt <= 0) { + pr_err("CPU%d: Failed to get # of available OPPs.\n", cpu); + return NULL; + } + + opp = dev_pm_opp_find_freq_floor(cpu_dev, &max_freq); + if (IS_ERR(opp)) { + pr_err("CPU%d: Failed to get max frequency.\n", cpu); + return NULL; + } + + dev_pm_opp_put(opp); + if (!max_freq) { + pr_err("CPU%d: Found null max frequency.\n", cpu); + return NULL; + } + + em = kzalloc(sizeof(*em), GFP_KERNEL); + if (!em) + return NULL; + + em->cap_states = kcalloc(opp_cnt, sizeof(*em->cap_states), GFP_KERNEL); + if (!em->cap_states) + goto free_em; + + for (i = 0, freq = 0; i < opp_cnt; i++, freq++) { + opp = dev_pm_opp_find_freq_ceil(cpu_dev, &freq); + if (IS_ERR(opp)) { + pr_err("CPU%d: Failed to get OPP %d.\n", cpu, i+1); + goto free_cs; + } + + power = dev_pm_opp_get_power(opp); + dev_pm_opp_put(opp); + if (!power || !freq) + goto free_cs; + + cap = freq * cap_scale / max_freq; + em->cap_states[i].power = power; + em->cap_states[i].cap = cap; + + /* + * The capacity/watts efficiency ratio should decrease as the + * frequency grows on sane platforms. If not, warn the user + * that some high OPPs are more power efficient than some + * of the lower ones. + */ + opp_eff = (cap << 20) / power; + if (opp_eff >= prev_opp_eff) + pr_warn("CPU%d: cap/pwr: OPP%d > OPP%d\n", cpu, i, i-1); + prev_opp_eff = opp_eff; + } + + em->nr_cap_states = opp_cnt; + return em; + +free_cs: + kfree(em->cap_states); +free_em: + kfree(em); + return NULL; +} + +static void free_energy_model(void) +{ + struct sched_energy_model *em; + struct freq_domain *tmp, *pos; + int cpu; + + list_for_each_entry_safe(pos, tmp, &freq_domains, next) { + cpu = cpumask_first(&(pos->span)); + em = *per_cpu_ptr(energy_model, cpu); + if (em) { + kfree(em->cap_states); + kfree(em); + } + + list_del(&(pos->next)); + kfree(pos); + } + + free_percpu(energy_model); +} + +void init_sched_energy(void) +{ + struct freq_domain *fdom; + struct sched_energy_model *em; + struct device *cpu_dev; + int cpu, ret, fdom_cpu; + + /* Energy Aware Scheduling is used for asymmetric systems only. */ + if (!lowest_flag_domain(smp_processor_id(), SD_ASYM_CPUCAPACITY)) + return; + + energy_model = alloc_percpu(struct sched_energy_model *); + if (!energy_model) + goto exit_fail; + + for_each_possible_cpu(cpu) { + if (*per_cpu_ptr(energy_model, cpu)) + continue; + + /* Keep a copy of the sharing_cpus mask */ + fdom = kzalloc(sizeof(struct freq_domain), GFP_KERNEL); + if (!fdom) + goto free_em; + + cpu_dev = get_cpu_device(cpu); + ret = dev_pm_opp_get_sharing_cpus(cpu_dev, &(fdom->span)); + if (ret) + goto free_em; + list_add(&(fdom->next), &freq_domains); + + /* + * Build the energy model of one CPU, and link it to all CPUs + * in its frequency domain. This should be correct as long as + * they share the same micro-architecture. + */ + fdom_cpu = cpumask_first(&(fdom->span)); + em = build_energy_model(fdom_cpu); + if (!em) + goto free_em; + + for_each_cpu(fdom_cpu, &(fdom->span)) + *per_cpu_ptr(energy_model, fdom_cpu) = em; + } + + static_branch_enable(&sched_energy_present); + + pr_info("Energy Aware Scheduling started.\n"); + return; +free_em: + free_energy_model(); +exit_fail: + pr_err("Energy Aware Scheduling initialization failed.\n"); +} +#else +void init_sched_energy(void) {} +#endif