diff mbox

[RFC,18/27] drivers: cpu-pd: Add PM Domain governor for CPUs

Message ID 1447799871-56374-19-git-send-email-lina.iyer@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Lina Iyer Nov. 17, 2015, 10:37 p.m. UTC
A PM domain comprising of CPUs may be powered off when all the CPUs in
the domain are powered down. Powering down a CPU domain is generally a
expensive operation and therefore the power performance trade offs
should be considered. The time between the last CPU powering down and
the first CPU powering up in a domain, is the time available for the
domain to sleep. Ideally, the sleep time of the domain should fulfill
the residency requirement of the domains' idle state.

To do this effectively, read the time before the wakeup of the cluster's
CPUs and ensure that the domain's idle state sleep time guarantees the
QoS requirements of each of the CPU, the PM QoS CPU_DMA_LATENCY and the
state's residency.

Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
---
 drivers/base/power/cpu-pd.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 1 deletion(-)

Comments

Lorenzo Pieralisi Nov. 18, 2015, 6:42 p.m. UTC | #1
On Tue, Nov 17, 2015 at 03:37:42PM -0700, Lina Iyer wrote:
> A PM domain comprising of CPUs may be powered off when all the CPUs in
> the domain are powered down. Powering down a CPU domain is generally a
> expensive operation and therefore the power performance trade offs
> should be considered. The time between the last CPU powering down and
> the first CPU powering up in a domain, is the time available for the
> domain to sleep. Ideally, the sleep time of the domain should fulfill
> the residency requirement of the domains' idle state.
> 
> To do this effectively, read the time before the wakeup of the cluster's
> CPUs and ensure that the domain's idle state sleep time guarantees the
> QoS requirements of each of the CPU, the PM QoS CPU_DMA_LATENCY and the
> state's residency.

To me this information should be part of the CPUidle governor (it is
already there), we should not split the decision into multiple layers.

The problem you are facing is that the CPUidle governor(s) do not take
cross cpus relationship into account, I do not think that adding another
decision layer in the power domain subsystem helps, you are doing that
just because adding it to the existing CPUidle governor(s) is invasive.

Why can't we use the power domain work you put together to eg disable
idle states that share multiple cpus and make them "visible" only
when the power domain that encompass them is actually going down ?

You could use the power domains information to detect states that
are shared between cpus.

It is just an idea, what I am saying is that having another governor in
the power domain subsytem does not make much sense, you split the
decision in two layers while there is actually one, the existing
CPUidle governor and that's where the decision should be taken.

Thoughts appreciated.

Lorenzo

> Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
> ---
>  drivers/base/power/cpu-pd.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 82 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/power/cpu-pd.c b/drivers/base/power/cpu-pd.c
> index 617ce54..a00abc1 100644
> --- a/drivers/base/power/cpu-pd.c
> +++ b/drivers/base/power/cpu-pd.c
> @@ -21,6 +21,7 @@
>  #include <linux/pm_qos.h>
>  #include <linux/rculist.h>
>  #include <linux/slab.h>
> +#include <linux/tick.h>
>  
>  #define CPU_PD_NAME_MAX 36
>  
> @@ -66,6 +67,86 @@ static void get_cpus_in_domain(struct generic_pm_domain *genpd,
>  	}
>  }
>  
> +static bool cpu_pd_down_ok(struct dev_pm_domain *pd)
> +{
> +	struct generic_pm_domain *genpd = pd_to_genpd(pd);
> +	struct cpu_pm_domain *cpu_pd = to_cpu_pd(genpd);
> +	int qos = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
> +	u64 sleep_ns = ~0;
> +	ktime_t earliest;
> +	int cpu;
> +	int i;
> +
> +	/* Reset the last set genpd state, default to index 0 */
> +	genpd->state_idx = 0;
> +
> +	/* We dont want to power down, if QoS is 0 */
> +	if (!qos)
> +		return false;
> +
> +	/*
> +	 * Find the sleep time for the cluster.
> +	 * The time between now and the first wake up of any CPU that
> +	 * are in this domain hierarchy is the time available for the
> +	 * domain to be idle.
> +	 */
> +	earliest.tv64 = KTIME_MAX;
> +	for_each_cpu_and(cpu, cpu_pd->cpus, cpu_online_mask) {
> +		struct device *cpu_dev = get_cpu_device(cpu);
> +		struct gpd_timing_data *td;
> +
> +		td = &dev_gpd_data(cpu_dev)->td;
> +
> +		if (earliest.tv64 < td->next_wakeup.tv64)
> +			earliest = td->next_wakeup;
> +	}
> +
> +	sleep_ns = ktime_to_ns(ktime_sub(earliest, ktime_get()));
> +	if (sleep_ns <= 0)
> +		return false;
> +
> +	/*
> +	 * Find the deepest sleep state that satisfies the residency
> +	 * requirement and the QoS constraint
> +	 */
> +	for (i = genpd->state_count - 1; i > 0; i--) {
> +		u64 state_sleep_ns;
> +
> +		state_sleep_ns = genpd->states[i].power_off_latency_ns +
> +			genpd->states[i].power_on_latency_ns +
> +			genpd->states[i].residency_ns;
> +
> +		/*
> +		 * If we cant sleep to save power in the state, move on
> +		 * to the next lower idle state.
> +		 */
> +		if (state_sleep_ns > sleep_ns)
> +		       continue;
> +
> +		/*
> +		 * We also dont want to sleep more than we should to
> +		 * gaurantee QoS.
> +		 */
> +		if (state_sleep_ns < (qos * NSEC_PER_USEC))
> +			break;
> +	}
> +
> +	if (i >= 0)
> +		genpd->state_idx = i;
> +
> +	return  (i >= 0) ? true : false;
> +}
> +
> +static bool cpu_stop_ok(struct device *dev)
> +{
> +	return true;
> +}
> +
> +struct dev_power_governor cpu_pd_gov = {
> +	.power_down_ok = cpu_pd_down_ok,
> +	.stop_ok = cpu_stop_ok,
> +};
> +
>  static int cpu_pd_power_off(struct generic_pm_domain *genpd)
>  {
>  	struct cpu_pm_domain *pd = to_cpu_pd(genpd);
> @@ -183,7 +264,7 @@ int of_register_cpu_pm_domain(struct device_node *dn,
>  
>  	/* Register the CPU genpd */
>  	pr_debug("adding %s as CPU PM domain.\n", pd->genpd->name);
> -	ret = of_pm_genpd_init(dn, pd->genpd, &simple_qos_governor, false);
> +	ret = of_pm_genpd_init(dn, pd->genpd, &cpu_pd_gov, false);
>  	if (ret) {
>  		pr_err("Unable to initialize domain %s\n", dn->full_name);
>  		return ret;
> -- 
> 2.1.4
>
Marc Titinger Nov. 19, 2015, 8:50 a.m. UTC | #2
On 18/11/2015 19:42, Lorenzo Pieralisi wrote:
> On Tue, Nov 17, 2015 at 03:37:42PM -0700, Lina Iyer wrote:
>> A PM domain comprising of CPUs may be powered off when all the CPUs in
>> the domain are powered down. Powering down a CPU domain is generally a
>> expensive operation and therefore the power performance trade offs
>> should be considered. The time between the last CPU powering down and
>> the first CPU powering up in a domain, is the time available for the
>> domain to sleep. Ideally, the sleep time of the domain should fulfill
>> the residency requirement of the domains' idle state.
>>
>> To do this effectively, read the time before the wakeup of the cluster's
>> CPUs and ensure that the domain's idle state sleep time guarantees the
>> QoS requirements of each of the CPU, the PM QoS CPU_DMA_LATENCY and the
>> state's residency.
>
> To me this information should be part of the CPUidle governor (it is
> already there), we should not split the decision into multiple layers.
>
> The problem you are facing is that the CPUidle governor(s) do not take
> cross cpus relationship into account, I do not think that adding another
> decision layer in the power domain subsystem helps, you are doing that
> just because adding it to the existing CPUidle governor(s) is invasive.
>
> Why can't we use the power domain work you put together to eg disable
> idle states that share multiple cpus and make them "visible" only
> when the power domain that encompass them is actually going down ?
>
> You could use the power domains information to detect states that
> are shared between cpus.
>
> It is just an idea, what I am saying is that having another governor in
> the power domain subsytem does not make much sense, you split the
> decision in two layers while there is actually one, the existing
> CPUidle governor and that's where the decision should be taken.
>
> Thoughts appreciated.

Maybe this is silly and not thought-through, but I wonder if the 
responsibilities could be split or instance with an outer control loop 
that has the heuristic to compute the next tick time, and the required 
cpu-power needed during that time slot, and an inner control loop 
(genpd) that has a per-domain QoS and can optimize power consumption.

Marc.

>
> Lorenzo
>
>> Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
>> ---
>>   drivers/base/power/cpu-pd.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 82 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/power/cpu-pd.c b/drivers/base/power/cpu-pd.c
>> index 617ce54..a00abc1 100644
>> --- a/drivers/base/power/cpu-pd.c
>> +++ b/drivers/base/power/cpu-pd.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/pm_qos.h>
>>   #include <linux/rculist.h>
>>   #include <linux/slab.h>
>> +#include <linux/tick.h>
>>
>>   #define CPU_PD_NAME_MAX 36
>>
>> @@ -66,6 +67,86 @@ static void get_cpus_in_domain(struct generic_pm_domain *genpd,
>>   	}
>>   }
>>
>> +static bool cpu_pd_down_ok(struct dev_pm_domain *pd)
>> +{
>> +	struct generic_pm_domain *genpd = pd_to_genpd(pd);
>> +	struct cpu_pm_domain *cpu_pd = to_cpu_pd(genpd);
>> +	int qos = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
>> +	u64 sleep_ns = ~0;
>> +	ktime_t earliest;
>> +	int cpu;
>> +	int i;
>> +
>> +	/* Reset the last set genpd state, default to index 0 */
>> +	genpd->state_idx = 0;
>> +
>> +	/* We dont want to power down, if QoS is 0 */
>> +	if (!qos)
>> +		return false;
>> +
>> +	/*
>> +	 * Find the sleep time for the cluster.
>> +	 * The time between now and the first wake up of any CPU that
>> +	 * are in this domain hierarchy is the time available for the
>> +	 * domain to be idle.
>> +	 */
>> +	earliest.tv64 = KTIME_MAX;
>> +	for_each_cpu_and(cpu, cpu_pd->cpus, cpu_online_mask) {
>> +		struct device *cpu_dev = get_cpu_device(cpu);
>> +		struct gpd_timing_data *td;
>> +
>> +		td = &dev_gpd_data(cpu_dev)->td;
>> +
>> +		if (earliest.tv64 < td->next_wakeup.tv64)
>> +			earliest = td->next_wakeup;
>> +	}
>> +
>> +	sleep_ns = ktime_to_ns(ktime_sub(earliest, ktime_get()));
>> +	if (sleep_ns <= 0)
>> +		return false;
>> +
>> +	/*
>> +	 * Find the deepest sleep state that satisfies the residency
>> +	 * requirement and the QoS constraint
>> +	 */
>> +	for (i = genpd->state_count - 1; i > 0; i--) {
>> +		u64 state_sleep_ns;
>> +
>> +		state_sleep_ns = genpd->states[i].power_off_latency_ns +
>> +			genpd->states[i].power_on_latency_ns +
>> +			genpd->states[i].residency_ns;
>> +
>> +		/*
>> +		 * If we cant sleep to save power in the state, move on
>> +		 * to the next lower idle state.
>> +		 */
>> +		if (state_sleep_ns > sleep_ns)
>> +		       continue;
>> +
>> +		/*
>> +		 * We also dont want to sleep more than we should to
>> +		 * gaurantee QoS.
>> +		 */
>> +		if (state_sleep_ns < (qos * NSEC_PER_USEC))
>> +			break;
>> +	}
>> +
>> +	if (i >= 0)
>> +		genpd->state_idx = i;
>> +
>> +	return  (i >= 0) ? true : false;
>> +}
>> +
>> +static bool cpu_stop_ok(struct device *dev)
>> +{
>> +	return true;
>> +}
>> +
>> +struct dev_power_governor cpu_pd_gov = {
>> +	.power_down_ok = cpu_pd_down_ok,
>> +	.stop_ok = cpu_stop_ok,
>> +};
>> +
>>   static int cpu_pd_power_off(struct generic_pm_domain *genpd)
>>   {
>>   	struct cpu_pm_domain *pd = to_cpu_pd(genpd);
>> @@ -183,7 +264,7 @@ int of_register_cpu_pm_domain(struct device_node *dn,
>>
>>   	/* Register the CPU genpd */
>>   	pr_debug("adding %s as CPU PM domain.\n", pd->genpd->name);
>> -	ret = of_pm_genpd_init(dn, pd->genpd, &simple_qos_governor, false);
>> +	ret = of_pm_genpd_init(dn, pd->genpd, &cpu_pd_gov, false);
>>   	if (ret) {
>>   		pr_err("Unable to initialize domain %s\n", dn->full_name);
>>   		return ret;
>> --
>> 2.1.4
>>
Kevin Hilman Nov. 19, 2015, 11:52 p.m. UTC | #3
Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> writes:

> On Tue, Nov 17, 2015 at 03:37:42PM -0700, Lina Iyer wrote:
>> A PM domain comprising of CPUs may be powered off when all the CPUs in
>> the domain are powered down. Powering down a CPU domain is generally a
>> expensive operation and therefore the power performance trade offs
>> should be considered. The time between the last CPU powering down and
>> the first CPU powering up in a domain, is the time available for the
>> domain to sleep. Ideally, the sleep time of the domain should fulfill
>> the residency requirement of the domains' idle state.
>> 
>> To do this effectively, read the time before the wakeup of the cluster's
>> CPUs and ensure that the domain's idle state sleep time guarantees the
>> QoS requirements of each of the CPU, the PM QoS CPU_DMA_LATENCY and the
>> state's residency.
>
> To me this information should be part of the CPUidle governor (it is
> already there), we should not split the decision into multiple layers.
>
> The problem you are facing is that the CPUidle governor(s) do not take
> cross cpus relationship into account, I do not think that adding another
> decision layer in the power domain subsystem helps, you are doing that
> just because adding it to the existing CPUidle governor(s) is invasive.
>
> Why can't we use the power domain work you put together to eg disable
> idle states that share multiple cpus and make them "visible" only
> when the power domain that encompass them is actually going down ?
>
> You could use the power domains information to detect states that
> are shared between cpus.
>
> It is just an idea, what I am saying is that having another governor in
> the power domain subsytem does not make much sense, you split the
> decision in two layers while there is actually one, the existing
> CPUidle governor and that's where the decision should be taken.

Hmm, considering "normal" devices in "normal" power domains, and
following the same logic, the equivalent would be to say that the
decision to gate the power domain belongs to the individual drivers
in the domain instead of in the power domain layer.  I disagree.

IMO, there are different decision layers because there are different
hardware layers.  Devices (including CPUs) are reponsible for handling
device-local idle states, based on device-local conditions (e.g. local
wakeups, timers, etc.)  and domains are responsible for handling
decisions based on conditions of the whole domain.

Kevin
Kevin Hilman Nov. 20, 2015, 12:03 a.m. UTC | #4
Lina Iyer <lina.iyer@linaro.org> writes:

> A PM domain comprising of CPUs may be powered off when all the CPUs in
> the domain are powered down. Powering down a CPU domain is generally a
> expensive operation and therefore the power performance trade offs
> should be considered. The time between the last CPU powering down and
> the first CPU powering up in a domain, is the time available for the
> domain to sleep. Ideally, the sleep time of the domain should fulfill
> the residency requirement of the domains' idle state.
>
> To do this effectively, read the time before the wakeup of the cluster's
> CPUs and ensure that the domain's idle state sleep time guarantees the
> QoS requirements of each of the CPU, the PM QoS CPU_DMA_LATENCY and the
> state's residency.
>
> Signed-off-by: Lina Iyer <lina.iyer@linaro.org>

[...]

> +static bool cpu_stop_ok(struct device *dev)
> +{
> +	return true;
> +}
> +
> +struct dev_power_governor cpu_pd_gov = {
> +	.power_down_ok = cpu_pd_down_ok,
> +	.stop_ok = cpu_stop_ok,
> +};

If stop_ok is unconditionally true, it should probably just be removed
(IOW cpu_pd_gov->stop_ok == NULL), and that will avoid an unnecessary
function call.

Kevin
Lorenzo Pieralisi Nov. 20, 2015, 4:21 p.m. UTC | #5
On Thu, Nov 19, 2015 at 03:52:13PM -0800, Kevin Hilman wrote:
> Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> writes:
> 
> > On Tue, Nov 17, 2015 at 03:37:42PM -0700, Lina Iyer wrote:
> >> A PM domain comprising of CPUs may be powered off when all the CPUs in
> >> the domain are powered down. Powering down a CPU domain is generally a
> >> expensive operation and therefore the power performance trade offs
> >> should be considered. The time between the last CPU powering down and
> >> the first CPU powering up in a domain, is the time available for the
> >> domain to sleep. Ideally, the sleep time of the domain should fulfill
> >> the residency requirement of the domains' idle state.
> >> 
> >> To do this effectively, read the time before the wakeup of the cluster's
> >> CPUs and ensure that the domain's idle state sleep time guarantees the
> >> QoS requirements of each of the CPU, the PM QoS CPU_DMA_LATENCY and the
> >> state's residency.
> >
> > To me this information should be part of the CPUidle governor (it is
> > already there), we should not split the decision into multiple layers.
> >
> > The problem you are facing is that the CPUidle governor(s) do not take
> > cross cpus relationship into account, I do not think that adding another
> > decision layer in the power domain subsystem helps, you are doing that
> > just because adding it to the existing CPUidle governor(s) is invasive.
> >
> > Why can't we use the power domain work you put together to eg disable
> > idle states that share multiple cpus and make them "visible" only
> > when the power domain that encompass them is actually going down ?
> >
> > You could use the power domains information to detect states that
> > are shared between cpus.
> >
> > It is just an idea, what I am saying is that having another governor in
> > the power domain subsytem does not make much sense, you split the
> > decision in two layers while there is actually one, the existing
> > CPUidle governor and that's where the decision should be taken.
> 
> Hmm, considering "normal" devices in "normal" power domains, and
> following the same logic, the equivalent would be to say that the
> decision to gate the power domain belongs to the individual drivers
> in the domain instead of in the power domain layer.  I disagree.
> 
> IMO, there are different decision layers because there are different
> hardware layers.  Devices (including CPUs) are reponsible for handling
> device-local idle states, based on device-local conditions (e.g. local
> wakeups, timers, etc.)  and domains are responsible for handling
> decisions based on conditions of the whole domain.

After going through the series for the second time (it is quite complex and
should probably be split) I understood your point of view and I agree with
it, I will review it more in-depth to understand the details.

One thing that is not clear to me is how we would end up handling
cluster states in platform coordinated mode with this series (and
I am actually referring to the data we would add in the idle-states,
such as min-residency). I admit that data for cluster states at present
is not extremely well defined, because we have to add latencies for
the cluster state even if the state itself may be just a cpu one (by
definition a cluster state is entered only if all cpus in the cluster
enter it, otherwise FW or power controller demote them automatically).

I would like to take this series as an opportunity to improve the
current situation in a clean way (and without changing the bindings,
only augmenting them).

On a side note, I think we should give up the concept of cluster
entirely, to me they are just a group of cpus, I do not see any reason
why we should group cpus this way and I do not like the dependencies
of this series on the cpu-map either, I do not see the reason but I
will go through code again to make sure I am not missing anything.

To be clear, to me the cpumask should be created with all cpus belonging
in a given power domain, no cluster dependency (and yes the CPU PM
notifiers are not appropriate at present - eg on
cpu_cluster_pm_{enter/exit} we save and restore the GIC distributor state
even on multi-cluster systems, that's useless and has no connection with
the real power domain topology at all, so the concept of cluster as it
stands is shaky to say the least).

Thanks,
Lorenzo
Lina Iyer Nov. 20, 2015, 4:42 p.m. UTC | #6
On Fri, Nov 20 2015 at 09:20 -0700, Lorenzo Pieralisi wrote:
>On Thu, Nov 19, 2015 at 03:52:13PM -0800, Kevin Hilman wrote:
>> Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> writes:
>>
>> > On Tue, Nov 17, 2015 at 03:37:42PM -0700, Lina Iyer wrote:
>> >> A PM domain comprising of CPUs may be powered off when all the CPUs in
>> >> the domain are powered down. Powering down a CPU domain is generally a
>> >> expensive operation and therefore the power performance trade offs
>> >> should be considered. The time between the last CPU powering down and
>> >> the first CPU powering up in a domain, is the time available for the
>> >> domain to sleep. Ideally, the sleep time of the domain should fulfill
>> >> the residency requirement of the domains' idle state.
>> >>
>> >> To do this effectively, read the time before the wakeup of the cluster's
>> >> CPUs and ensure that the domain's idle state sleep time guarantees the
>> >> QoS requirements of each of the CPU, the PM QoS CPU_DMA_LATENCY and the
>> >> state's residency.
>> >
>> > To me this information should be part of the CPUidle governor (it is
>> > already there), we should not split the decision into multiple layers.
>> >
>> > The problem you are facing is that the CPUidle governor(s) do not take
>> > cross cpus relationship into account, I do not think that adding another
>> > decision layer in the power domain subsystem helps, you are doing that
>> > just because adding it to the existing CPUidle governor(s) is invasive.
>> >
>> > Why can't we use the power domain work you put together to eg disable
>> > idle states that share multiple cpus and make them "visible" only
>> > when the power domain that encompass them is actually going down ?
>> >
>> > You could use the power domains information to detect states that
>> > are shared between cpus.
>> >
>> > It is just an idea, what I am saying is that having another governor in
>> > the power domain subsytem does not make much sense, you split the
>> > decision in two layers while there is actually one, the existing
>> > CPUidle governor and that's where the decision should be taken.
>>
>> Hmm, considering "normal" devices in "normal" power domains, and
>> following the same logic, the equivalent would be to say that the
>> decision to gate the power domain belongs to the individual drivers
>> in the domain instead of in the power domain layer.  I disagree.
>>
>> IMO, there are different decision layers because there are different
>> hardware layers.  Devices (including CPUs) are reponsible for handling
>> device-local idle states, based on device-local conditions (e.g. local
>> wakeups, timers, etc.)  and domains are responsible for handling
>> decisions based on conditions of the whole domain.
>
>After going through the series for the second time (it is quite complex and
>should probably be split) I understood your point of view and I agree with
>it, I will review it more in-depth to understand the details.
>
I have included patches from Axel and Marc, so as to get a complete
picture. My core changes are in genpd, cpu-pd and psci.c

>One thing that is not clear to me is how we would end up handling
>cluster states in platform coordinated mode with this series (and
>I am actually referring to the data we would add in the idle-states,
>such as min-residency).
>
From what I see, the platform coordinated mode, doesnt need any of this.
We are fine as it is today. CPUs vote for the cluster state they can
enter and the f/w determines based on these votes. It makes sense and
probably easier to flatten out the cluster states and attach them to
cpuidle for that.

I couldnt find a symmetry with OS initated. May be it deserves more
discussion and brain storming.

>I admit that data for cluster states at present
>is not extremely well defined, because we have to add latencies for
>the cluster state even if the state itself may be just a cpu one (by
>definition a cluster state is entered only if all cpus in the cluster
>enter it, otherwise FW or power controller demote them automatically).
>

>I would like to take this series as an opportunity to improve the
>current situation in a clean way (and without changing the bindings,
>only augmenting them).
>
>On a side note, I think we should give up the concept of cluster
>entirely, to me they are just a group of cpus, I do not see any reason
>why we should group cpus this way and I do not like the dependencies
>of this series on the cpu-map either, I do not see the reason but I
>will go through code again to make sure I am not missing anything.
>
SoC's could have different organization of CPUs (clubbed as clusters)
and power domains the power thesee clusters. This information has to
come from the DT. Since there are no actual devices in linux for domain
management (with PSCI), I have added them to cpu-map, which already
builds up the cluster hierarchy. The only addition I had to make wa
allow these cluster nodes to be tell the kernel that they are domain
providers.

>To be clear, to me the cpumask should be created with all cpus belonging
>in a given power domain, no cluster dependency (and yes the CPU PM
>notifiers are not appropriate at present - eg on
>cpu_cluster_pm_{enter/exit} we save and restore the GIC distributor state
>even on multi-cluster systems, that's useless and has no connection with
>the real power domain topology at all, so the concept of cluster as it
>stands is shaky to say the least).
>

Lets discuss this more. I am interested in what you are thinking, will
let you go through the code.

Thanks for you time Lorenzo.

-- Lina
Lina Iyer Nov. 20, 2015, 5:39 p.m. UTC | #7
On Thu, Nov 19 2015 at 01:50 -0700, Marc Titinger wrote:
>On 18/11/2015 19:42, Lorenzo Pieralisi wrote:
>>On Tue, Nov 17, 2015 at 03:37:42PM -0700, Lina Iyer wrote:
>>>A PM domain comprising of CPUs may be powered off when all the CPUs in
>>>the domain are powered down. Powering down a CPU domain is generally a
>>>expensive operation and therefore the power performance trade offs
>>>should be considered. The time between the last CPU powering down and
>>>the first CPU powering up in a domain, is the time available for the
>>>domain to sleep. Ideally, the sleep time of the domain should fulfill
>>>the residency requirement of the domains' idle state.
>>>
>>>To do this effectively, read the time before the wakeup of the cluster's
>>>CPUs and ensure that the domain's idle state sleep time guarantees the
>>>QoS requirements of each of the CPU, the PM QoS CPU_DMA_LATENCY and the
>>>state's residency.
>>
>>To me this information should be part of the CPUidle governor (it is
>>already there), we should not split the decision into multiple layers.
>>
>>The problem you are facing is that the CPUidle governor(s) do not take
>>cross cpus relationship into account, I do not think that adding another
>>decision layer in the power domain subsystem helps, you are doing that
>>just because adding it to the existing CPUidle governor(s) is invasive.
>>
>>Why can't we use the power domain work you put together to eg disable
>>idle states that share multiple cpus and make them "visible" only
>>when the power domain that encompass them is actually going down ?
>>
>>You could use the power domains information to detect states that
>>are shared between cpus.
>>
>>It is just an idea, what I am saying is that having another governor in
>>the power domain subsytem does not make much sense, you split the
>>decision in two layers while there is actually one, the existing
>>CPUidle governor and that's where the decision should be taken.
>>
>>Thoughts appreciated.
>
>Maybe this is silly and not thought-through, but I wonder if the 
>responsibilities could be split or instance with an outer control loop 
>that has the heuristic to compute the next tick time, and the required 
>cpu-power needed during that time slot, and an inner control loop 
>(genpd) that has a per-domain QoS and can optimize power consumption.
>
Not sure I understand everything you said, but the heuristics across a
bunch of CPUs can be very erratic. Its hard enough for menu governor to
determine heuristics on a per-cpu basis.

I governor in this patch already takes care of PM QoS, but does not do a
per-cpu QoS.

We should discuss this more.

-- Lina

>Marc.
>
>>
>>Lorenzo
>>
>>>Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
>>>---
>>>  drivers/base/power/cpu-pd.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
>>>  1 file changed, 82 insertions(+), 1 deletion(-)
>>>
>>>diff --git a/drivers/base/power/cpu-pd.c b/drivers/base/power/cpu-pd.c
>>>index 617ce54..a00abc1 100644
>>>--- a/drivers/base/power/cpu-pd.c
>>>+++ b/drivers/base/power/cpu-pd.c
>>>@@ -21,6 +21,7 @@
>>>  #include <linux/pm_qos.h>
>>>  #include <linux/rculist.h>
>>>  #include <linux/slab.h>
>>>+#include <linux/tick.h>
>>>
>>>  #define CPU_PD_NAME_MAX 36
>>>
>>>@@ -66,6 +67,86 @@ static void get_cpus_in_domain(struct generic_pm_domain *genpd,
>>>  	}
>>>  }
>>>
>>>+static bool cpu_pd_down_ok(struct dev_pm_domain *pd)
>>>+{
>>>+	struct generic_pm_domain *genpd = pd_to_genpd(pd);
>>>+	struct cpu_pm_domain *cpu_pd = to_cpu_pd(genpd);
>>>+	int qos = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
>>>+	u64 sleep_ns = ~0;
>>>+	ktime_t earliest;
>>>+	int cpu;
>>>+	int i;
>>>+
>>>+	/* Reset the last set genpd state, default to index 0 */
>>>+	genpd->state_idx = 0;
>>>+
>>>+	/* We dont want to power down, if QoS is 0 */
>>>+	if (!qos)
>>>+		return false;
>>>+
>>>+	/*
>>>+	 * Find the sleep time for the cluster.
>>>+	 * The time between now and the first wake up of any CPU that
>>>+	 * are in this domain hierarchy is the time available for the
>>>+	 * domain to be idle.
>>>+	 */
>>>+	earliest.tv64 = KTIME_MAX;
>>>+	for_each_cpu_and(cpu, cpu_pd->cpus, cpu_online_mask) {
>>>+		struct device *cpu_dev = get_cpu_device(cpu);
>>>+		struct gpd_timing_data *td;
>>>+
>>>+		td = &dev_gpd_data(cpu_dev)->td;
>>>+
>>>+		if (earliest.tv64 < td->next_wakeup.tv64)
>>>+			earliest = td->next_wakeup;
>>>+	}
>>>+
>>>+	sleep_ns = ktime_to_ns(ktime_sub(earliest, ktime_get()));
>>>+	if (sleep_ns <= 0)
>>>+		return false;
>>>+
>>>+	/*
>>>+	 * Find the deepest sleep state that satisfies the residency
>>>+	 * requirement and the QoS constraint
>>>+	 */
>>>+	for (i = genpd->state_count - 1; i > 0; i--) {
>>>+		u64 state_sleep_ns;
>>>+
>>>+		state_sleep_ns = genpd->states[i].power_off_latency_ns +
>>>+			genpd->states[i].power_on_latency_ns +
>>>+			genpd->states[i].residency_ns;
>>>+
>>>+		/*
>>>+		 * If we cant sleep to save power in the state, move on
>>>+		 * to the next lower idle state.
>>>+		 */
>>>+		if (state_sleep_ns > sleep_ns)
>>>+		       continue;
>>>+
>>>+		/*
>>>+		 * We also dont want to sleep more than we should to
>>>+		 * gaurantee QoS.
>>>+		 */
>>>+		if (state_sleep_ns < (qos * NSEC_PER_USEC))
>>>+			break;
>>>+	}
>>>+
>>>+	if (i >= 0)
>>>+		genpd->state_idx = i;
>>>+
>>>+	return  (i >= 0) ? true : false;
>>>+}
>>>+
>>>+static bool cpu_stop_ok(struct device *dev)
>>>+{
>>>+	return true;
>>>+}
>>>+
>>>+struct dev_power_governor cpu_pd_gov = {
>>>+	.power_down_ok = cpu_pd_down_ok,
>>>+	.stop_ok = cpu_stop_ok,
>>>+};
>>>+
>>>  static int cpu_pd_power_off(struct generic_pm_domain *genpd)
>>>  {
>>>  	struct cpu_pm_domain *pd = to_cpu_pd(genpd);
>>>@@ -183,7 +264,7 @@ int of_register_cpu_pm_domain(struct device_node *dn,
>>>
>>>  	/* Register the CPU genpd */
>>>  	pr_debug("adding %s as CPU PM domain.\n", pd->genpd->name);
>>>-	ret = of_pm_genpd_init(dn, pd->genpd, &simple_qos_governor, false);
>>>+	ret = of_pm_genpd_init(dn, pd->genpd, &cpu_pd_gov, false);
>>>  	if (ret) {
>>>  		pr_err("Unable to initialize domain %s\n", dn->full_name);
>>>  		return ret;
>>>--
>>>2.1.4
>>>
>
diff mbox

Patch

diff --git a/drivers/base/power/cpu-pd.c b/drivers/base/power/cpu-pd.c
index 617ce54..a00abc1 100644
--- a/drivers/base/power/cpu-pd.c
+++ b/drivers/base/power/cpu-pd.c
@@ -21,6 +21,7 @@ 
 #include <linux/pm_qos.h>
 #include <linux/rculist.h>
 #include <linux/slab.h>
+#include <linux/tick.h>
 
 #define CPU_PD_NAME_MAX 36
 
@@ -66,6 +67,86 @@  static void get_cpus_in_domain(struct generic_pm_domain *genpd,
 	}
 }
 
+static bool cpu_pd_down_ok(struct dev_pm_domain *pd)
+{
+	struct generic_pm_domain *genpd = pd_to_genpd(pd);
+	struct cpu_pm_domain *cpu_pd = to_cpu_pd(genpd);
+	int qos = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
+	u64 sleep_ns = ~0;
+	ktime_t earliest;
+	int cpu;
+	int i;
+
+	/* Reset the last set genpd state, default to index 0 */
+	genpd->state_idx = 0;
+
+	/* We dont want to power down, if QoS is 0 */
+	if (!qos)
+		return false;
+
+	/*
+	 * Find the sleep time for the cluster.
+	 * The time between now and the first wake up of any CPU that
+	 * are in this domain hierarchy is the time available for the
+	 * domain to be idle.
+	 */
+	earliest.tv64 = KTIME_MAX;
+	for_each_cpu_and(cpu, cpu_pd->cpus, cpu_online_mask) {
+		struct device *cpu_dev = get_cpu_device(cpu);
+		struct gpd_timing_data *td;
+
+		td = &dev_gpd_data(cpu_dev)->td;
+
+		if (earliest.tv64 < td->next_wakeup.tv64)
+			earliest = td->next_wakeup;
+	}
+
+	sleep_ns = ktime_to_ns(ktime_sub(earliest, ktime_get()));
+	if (sleep_ns <= 0)
+		return false;
+
+	/*
+	 * Find the deepest sleep state that satisfies the residency
+	 * requirement and the QoS constraint
+	 */
+	for (i = genpd->state_count - 1; i > 0; i--) {
+		u64 state_sleep_ns;
+
+		state_sleep_ns = genpd->states[i].power_off_latency_ns +
+			genpd->states[i].power_on_latency_ns +
+			genpd->states[i].residency_ns;
+
+		/*
+		 * If we cant sleep to save power in the state, move on
+		 * to the next lower idle state.
+		 */
+		if (state_sleep_ns > sleep_ns)
+		       continue;
+
+		/*
+		 * We also dont want to sleep more than we should to
+		 * gaurantee QoS.
+		 */
+		if (state_sleep_ns < (qos * NSEC_PER_USEC))
+			break;
+	}
+
+	if (i >= 0)
+		genpd->state_idx = i;
+
+	return  (i >= 0) ? true : false;
+}
+
+static bool cpu_stop_ok(struct device *dev)
+{
+	return true;
+}
+
+struct dev_power_governor cpu_pd_gov = {
+	.power_down_ok = cpu_pd_down_ok,
+	.stop_ok = cpu_stop_ok,
+};
+
 static int cpu_pd_power_off(struct generic_pm_domain *genpd)
 {
 	struct cpu_pm_domain *pd = to_cpu_pd(genpd);
@@ -183,7 +264,7 @@  int of_register_cpu_pm_domain(struct device_node *dn,
 
 	/* Register the CPU genpd */
 	pr_debug("adding %s as CPU PM domain.\n", pd->genpd->name);
-	ret = of_pm_genpd_init(dn, pd->genpd, &simple_qos_governor, false);
+	ret = of_pm_genpd_init(dn, pd->genpd, &cpu_pd_gov, false);
 	if (ret) {
 		pr_err("Unable to initialize domain %s\n", dn->full_name);
 		return ret;