diff mbox

[RFCv7,03/10] sched: scheduler-driven cpu frequency selection

Message ID 1456190570-4475-4-git-send-email-smuckle@linaro.org (mailing list archive)
State RFC, archived
Headers show

Commit Message

Steve Muckle Feb. 23, 2016, 1:22 a.m. UTC
From: Michael Turquette <mturquette@baylibre.com>

Scheduler-driven CPU frequency selection hopes to exploit both
per-task and global information in the scheduler to improve frequency
selection policy, achieving lower power consumption, improved
responsiveness/performance, and less reliance on heuristics and
tunables. For further discussion on the motivation of this integration
see [0].

This patch implements a shim layer between the Linux scheduler and the
cpufreq subsystem. The interface accepts capacity requests from the
CFS, RT and deadline sched classes. The requests from each sched class
are summed on each CPU with a margin applied to the CFS and RT
capacity requests to provide some headroom. Deadline requests are
expected to be precise enough given their nature to not require
headroom. The maximum total capacity request for a CPU in a frequency
domain drives the requested frequency for that domain.

Policy is determined by both the sched classes and this shim layer.

Note that this algorithm is event-driven. There is no polling loop to
check cpu idle time nor any other method which is unsynchronized with
the scheduler, aside from an optional throttling mechanism.

Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
code and test results, and to Ricky Liang <jcliang@chromium.org>
for initialization and static key inc/dec fixes.

[0] http://article.gmane.org/gmane.linux.kernel/1499836

[smuckle@linaro.org: various additions and fixes, revised commit text]

CC: Ricky Liang <jcliang@chromium.org>
Signed-off-by: Michael Turquette <mturquette@baylibre.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 drivers/cpufreq/Kconfig      |  21 ++
 include/linux/cpufreq.h      |   3 +
 include/linux/sched.h        |   8 +
 kernel/sched/Makefile        |   1 +
 kernel/sched/cpufreq_sched.c | 459 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         |  51 +++++
 6 files changed, 543 insertions(+)
 create mode 100644 kernel/sched/cpufreq_sched.c

Comments

Rafael J. Wysocki Feb. 25, 2016, 3:55 a.m. UTC | #1
Hi,

I promised a review and here it goes.

Let me focus on this one as the rest seems to depend on it.

On Monday, February 22, 2016 05:22:43 PM Steve Muckle wrote:
> From: Michael Turquette <mturquette@baylibre.com>
> 
> Scheduler-driven CPU frequency selection hopes to exploit both
> per-task and global information in the scheduler to improve frequency
> selection policy, achieving lower power consumption, improved
> responsiveness/performance, and less reliance on heuristics and
> tunables. For further discussion on the motivation of this integration
> see [0].
> 
> This patch implements a shim layer between the Linux scheduler and the
> cpufreq subsystem. The interface accepts capacity requests from the
> CFS, RT and deadline sched classes. The requests from each sched class
> are summed on each CPU with a margin applied to the CFS and RT
> capacity requests to provide some headroom. Deadline requests are
> expected to be precise enough given their nature to not require
> headroom. The maximum total capacity request for a CPU in a frequency
> domain drives the requested frequency for that domain.
> 
> Policy is determined by both the sched classes and this shim layer.
> 
> Note that this algorithm is event-driven. There is no polling loop to
> check cpu idle time nor any other method which is unsynchronized with
> the scheduler, aside from an optional throttling mechanism.
> 
> Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
> code and test results, and to Ricky Liang <jcliang@chromium.org>
> for initialization and static key inc/dec fixes.
> 
> [0] http://article.gmane.org/gmane.linux.kernel/1499836
> 
> [smuckle@linaro.org: various additions and fixes, revised commit text]

Well, the changelog is still a bit terse in my view.  It should at least
describe the design somewhat (mention the static keys and how they are
used etc) end explain why the things are done this way. 

> CC: Ricky Liang <jcliang@chromium.org>
> Signed-off-by: Michael Turquette <mturquette@baylibre.com>
> Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Steve Muckle <smuckle@linaro.org>
> ---
>  drivers/cpufreq/Kconfig      |  21 ++
>  include/linux/cpufreq.h      |   3 +
>  include/linux/sched.h        |   8 +
>  kernel/sched/Makefile        |   1 +
>  kernel/sched/cpufreq_sched.c | 459 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h         |  51 +++++
>  6 files changed, 543 insertions(+)
>  create mode 100644 kernel/sched/cpufreq_sched.c
> 

[cut]

> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 93e1c1c..ce8b895 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -495,6 +495,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
>  #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
>  extern struct cpufreq_governor cpufreq_gov_conservative;
>  #define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_conservative)
> +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED)
> +extern struct cpufreq_governor cpufreq_gov_sched;
> +#define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_sched)
>  #endif
>  
>  /*********************************************************************
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a292c4b..27a6cd8 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -937,6 +937,14 @@ enum cpu_idle_type {
>  #define SCHED_CAPACITY_SHIFT	10
>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>  
> +struct sched_capacity_reqs {
> +	unsigned long cfs;
> +	unsigned long rt;
> +	unsigned long dl;
> +
> +	unsigned long total;
> +};

Without a comment explaining what this represents it is quite hard to
decode it.

> +
>  /*
>   * Wake-queues are lists of tasks with a pending wakeup, whose
>   * callers have already marked the task as woken internally,
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 6768797..90ed832 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> new file mode 100644
> index 0000000..a113e4e
> --- /dev/null
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -0,0 +1,459 @@
> +/*

Any chance to add one sentence about what's in the file?

Besides, governors traditionally go to drivers/cpufreq.  Why is this different?

> + *  Copyright (C)  2015 Michael Turquette <mturquette@linaro.org>
> + *  Copyright (C)  2015-2016 Steve Muckle <smuckle@linaro.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/cpufreq.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/percpu.h>
> +#include <linux/irq_work.h>
> +#include <linux/delay.h>
> +#include <linux/string.h>
> +
> +#include "sched.h"
> +
> +struct static_key __read_mostly __sched_freq = STATIC_KEY_INIT_FALSE;

Well, I'm not familiar with static keys and how they work, so you'll need to
explain this part to me.  I'm assuming that there is some magic related to
the value of the key that changes set_*_cpu_capacity() into no-ops when the
key is 0, presumably by modifying kernel code.

So this is clever but tricky and my question here is why it is necessary.

For example, compared to the RCU-based approach I'm using, how much better
this is?  Yes, there is some tiny overhead related to the checking if callbacks
are present and invoking them, but is it really so bad?  Can you actually
measure the different in any realistic workload?

One thing I personally like in the RCU-based approach is its universality.  The
callbacks may be installed by different entities in a uniform way: intel_pstate
can do that, the old governors can do that, my experimental schedutil code can
do that and your code could have done that too in principle.  And this is very
nice, because it is a common underlying mechanism that can be used by everybody
regardless of their particular implementations on the other side.

Why would I want to use something different, then?

> +static bool __read_mostly cpufreq_driver_slow;
> +
> +/*
> + * The number of enabled schedfreq policies is modified during GOV_START/STOP.
> + * It, along with whether the schedfreq static key is enabled, is protected by
> + * the gov_enable_lock.
> + */

Well, it would be good to explain what the role of the number of enabled_policies is
at least briefly.

> +static int enabled_policies;
> +static DEFINE_MUTEX(gov_enable_lock);
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
> +static struct cpufreq_governor cpufreq_gov_sched;
> +#endif

I don't think you need the #ifndef any more after recent changes in linux-next.

> +
> +/*
> + * Capacity margin added to CFS and RT capacity requests to provide
> + * some head room if task utilization further increases.
> + */

OK, where does this number come from?

> +unsigned int capacity_margin = 1280;
> +
> +static DEFINE_PER_CPU(struct gov_data *, cpu_gov_data);
> +DEFINE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
> +
> +/**
> + * gov_data - per-policy data internal to the governor
> + * @throttle: next throttling period expiry. Derived from throttle_nsec
> + * @throttle_nsec: throttle period length in nanoseconds
> + * @task: worker thread for dvfs transition that may block/sleep
> + * @irq_work: callback used to wake up worker thread
> + * @policy: pointer to cpufreq policy associated with this governor data
> + * @fastpath_lock: prevents multiple CPUs in a frequency domain from racing
> + * with each other in fast path during calculation of domain frequency
> + * @slowpath_lock: mutex used to synchronize with slow path - ensure policy
> + * remains enabled, and eliminate racing between slow and fast path
> + * @enabled: boolean value indicating that the policy is started, protected
> + * by the slowpath_lock
> + * @requested_freq: last frequency requested by the sched governor
> + *
> + * struct gov_data is the per-policy cpufreq_sched-specific data
> + * structure. A per-policy instance of it is created when the
> + * cpufreq_sched governor receives the CPUFREQ_GOV_POLICY_INIT
> + * condition and a pointer to it exists in the gov_data member of
> + * struct cpufreq_policy.
> + */
> +struct gov_data {
> +	ktime_t throttle;
> +	unsigned int throttle_nsec;
> +	struct task_struct *task;
> +	struct irq_work irq_work;
> +	struct cpufreq_policy *policy;
> +	raw_spinlock_t fastpath_lock;
> +	struct mutex slowpath_lock;
> +	unsigned int enabled;
> +	unsigned int requested_freq;
> +};
> +
> +static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy,
> +					    unsigned int freq)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +
> +	__cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
> +	gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
> +}
> +
> +static bool finish_last_request(struct gov_data *gd)
> +{
> +	ktime_t now = ktime_get();
> +
> +	if (ktime_after(now, gd->throttle))
> +		return false;
> +
> +	while (1) {

I would write this as

	do {

> +		int usec_left = ktime_to_ns(ktime_sub(gd->throttle, now));
> +
> +		usec_left /= NSEC_PER_USEC;
> +		usleep_range(usec_left, usec_left + 100);
> +		now = ktime_get();

	} while (ktime_before(now, gd->throttle));

	return true;

But maybe that's just me. :-)

> +		if (ktime_after(now, gd->throttle))
> +			return true;
> +	}
> +}

I'm not a big fan of this throttling mechanism overall, but more about that later.

> +
> +static int cpufreq_sched_thread(void *data)
> +{

Now, what really is the advantage of having those extra threads vs using
workqueues?

I guess the underlying concern is that RT tasks may stall workqueues indefinitely
in theory and then the frequency won't be updated, but there's much more kernel
stuff run from workqueues and if that is starved, you won't get very far anyway.

If you take special measures to prevent frequency change requests from being
stalled by RT tasks, question is why are they so special?  Aren't there any
other kernel activities that also should be protected from that and may be
more important than CPU frequency changes?

Plus if this really is the problem here, then it also affects the other cpufreq
governors, so maybe it should be solved for everybody in some common way?

> +	struct sched_param param;
> +	struct gov_data *gd = (struct gov_data*) data;
> +	struct cpufreq_policy *policy = gd->policy;
> +	unsigned int new_request = 0;
> +	unsigned int last_request = 0;
> +	int ret;
> +
> +	param.sched_priority = 50;
> +	ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
> +	if (ret) {
> +		pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> +		do_exit(-EINVAL);
> +	} else {
> +		pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
> +				__func__, gd->task->pid);
> +	}
> +
> +	mutex_lock(&gd->slowpath_lock);
> +
> +	while (true) {
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		if (kthread_should_stop()) {
> +			set_current_state(TASK_RUNNING);
> +			break;
> +		}
> +		new_request = gd->requested_freq;
> +		if (!gd->enabled || new_request == last_request) {

This generally is a mistake.

You can't assume that if you had requested a CPU to enter a specific P-state,
that state was actually entered.  In the platform-coordinated case the hardware
(or firmware) may choose to ignore your request and you won't be told about
that.

For this reason, you generally need to make the request every time even if it
is identical to the previous one.  Even in the one-CPU-per-policy case there
may be HW coordination you don't know about.

> +			mutex_unlock(&gd->slowpath_lock);
> +			schedule();
> +			mutex_lock(&gd->slowpath_lock);
> +		} else {
> +			set_current_state(TASK_RUNNING);
> +			/*
> +			 * if the frequency thread sleeps while waiting to be
> +			 * unthrottled, start over to check for a newer request
> +			 */
> +			if (finish_last_request(gd))
> +				continue;
> +			last_request = new_request;
> +			cpufreq_sched_try_driver_target(policy, new_request);
> +		}
> +	}
> +
> +	mutex_unlock(&gd->slowpath_lock);
> +
> +	return 0;
> +}
> +
> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
> +{
> +	struct gov_data *gd;
> +
> +	gd = container_of(irq_work, struct gov_data, irq_work);
> +	if (!gd)
> +		return;
> +
> +	wake_up_process(gd->task);

I'm wondering what would be wrong with writing it as

	if (gd)
		wake_up_process(gd->task);

And can gd turn out to be NULL here in any case?

> +}
> +
> +static void update_fdomain_capacity_request(int cpu)
> +{
> +	unsigned int freq_new, index_new, cpu_tmp;
> +	struct cpufreq_policy *policy;
> +	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
> +	unsigned long capacity = 0;
> +
> +	if (!gd)
> +		return;
> +

Why is this check necessary?

> +	/* interrupts already disabled here via rq locked */
> +	raw_spin_lock(&gd->fastpath_lock);

Well, if you compare this with the one-CPU-per-policy path in my experimental
schedutil governor code with the "fast switch" patch on top, you'll notice that
it doesn't use any locks and/or atomic ops.  That's very much on purpose and
here's where your whole gain from using static keys practically goes away.

> +
> +	policy = gd->policy;
> +
> +	for_each_cpu(cpu_tmp, policy->cpus) {
> +		struct sched_capacity_reqs *scr;
> +
> +		scr = &per_cpu(cpu_sched_capacity_reqs, cpu_tmp);
> +		capacity = max(capacity, scr->total);
> +	}

You could save a few cycles from this in the case when the policy is not
shared.

> +
> +	freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;

Where does this formula come from?

> +
> +	/*
> +	 * Calling this without locking policy->rwsem means we race
> +	 * against changes with policy->min and policy->max. This should
> +	 * be okay though.
> +	 */
> +	if (cpufreq_frequency_table_target(policy, policy->freq_table,
> +					   freq_new, CPUFREQ_RELATION_L,
> +					   &index_new))
> +		goto out;

__cpufreq_driver_target() will call this again, so isn't calling it here
a bit wasteful?

> +	freq_new = policy->freq_table[index_new].frequency;
> +
> +	if (freq_new == gd->requested_freq)
> +		goto out;
> +

Again, the above generally is a mistake for reasons explained earlier.

> +	gd->requested_freq = freq_new;
> +
> +	if (cpufreq_driver_slow || !mutex_trylock(&gd->slowpath_lock)) {

This really doesn't look good to me.

Why is the mutex needed here in the first place?  cpufreq_sched_stop() should
be able to make sure that this function won't be run again for the policy
without using this lock.

> +		irq_work_queue_on(&gd->irq_work, cpu);

I hope that you are aware of the fact that irq_work_queue_on() explodes
on uniprocessor ARM32 if you run an SMP kernel on it?

And what happens in the !cpufreq_driver_is_slow() case when we don't
initialize the irq_work?

> +	} else if (policy->transition_ongoing ||
> +		   ktime_before(ktime_get(), gd->throttle)) {

If this really runs in the scheduler paths, you don't want to have ktime_get()
here.

> +		mutex_unlock(&gd->slowpath_lock);
> +		irq_work_queue_on(&gd->irq_work, cpu);

Allright.

I think I have figured out how this is arranged, but I may be wrong. :-)

Here's my understanding of it.  If we are throttled, we don't just skip the
request.  Instead, we wake up the gd thread kind of in the hope that the
throttling may end when it actually wakes up.  So in fact we poke at the
gd thread on a regular basis asking it "Are you still throttled?" and that
happens on every call from the scheduler until the throttling is over if
I'm not mistaken.  This means that during throttling every call from the
scheduler generates an irq_work that wakes up the gd thread just to make it
check if it still is throttled and go to sleep again.  Please tell me
that I haven't understood this correctly.

The above aside, I personally think that rate limitting should happen at the source
and not at the worker thread level.  So if you're throttled, you should just
return immediately from this function without generating any more work.  That,
BTW, is what the sampling rate in my code is for.

> +	} else {
> +		cpufreq_sched_try_driver_target(policy, freq_new);

Well, this is supposed to be the fast path AFAICS.

Did you actually look at what __cpufreq_driver_target() does in general?
Including the wait_event() in cpufreq_freq_transition_begin() to mention just
one suspicious thing?  And how much overhead it generates in the most general
case?

No, running *that* from the fast path is not a good idea.  Quite honestly,
you'd need a new driver callback and a new way to run it from the cpufreq core
to implement this in a reasonably efficient way.

> +		mutex_unlock(&gd->slowpath_lock);
> +	}
> +
> +out:
> +	raw_spin_unlock(&gd->fastpath_lock);
> +}
> +
> +void update_cpu_capacity_request(int cpu, bool request)
> +{
> +	unsigned long new_capacity;
> +	struct sched_capacity_reqs *scr;
> +
> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
> +
> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
> +
> +	new_capacity = scr->cfs + scr->rt;
> +	new_capacity = new_capacity * capacity_margin
> +		/ SCHED_CAPACITY_SCALE;
> +	new_capacity += scr->dl;

Can you please explain the formula here?

> +
> +	if (new_capacity == scr->total)
> +		return;
> +

The same mistake as before.

> +	scr->total = new_capacity;
> +	if (request)
> +		update_fdomain_capacity_request(cpu);
> +}
> +
> +static ssize_t show_throttle_nsec(struct cpufreq_policy *policy, char *buf)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +	return sprintf(buf, "%u\n", gd->throttle_nsec);
> +}
> +
> +static ssize_t store_throttle_nsec(struct cpufreq_policy *policy,
> +				   const char *buf, size_t count)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +	unsigned int input;
> +	int ret;
> +
> +	ret = sscanf(buf, "%u", &input);
> +
> +	if (ret != 1)
> +		return -EINVAL;
> +
> +	gd->throttle_nsec = input;
> +	return count;
> +}
> +
> +static struct freq_attr sched_freq_throttle_nsec_attr =
> +	__ATTR(throttle_nsec, 0644, show_throttle_nsec, store_throttle_nsec);
> +
> +static struct attribute *sched_freq_sysfs_attribs[] = {
> +	&sched_freq_throttle_nsec_attr.attr,
> +	NULL
> +};
> +
> +static struct attribute_group sched_freq_sysfs_group = {
> +	.attrs = sched_freq_sysfs_attribs,
> +	.name = "sched_freq",
> +};
> +
> +static int cpufreq_sched_policy_init(struct cpufreq_policy *policy)
> +{
> +	struct gov_data *gd;
> +	int ret;
> +
> +	gd = kzalloc(sizeof(*gd), GFP_KERNEL);
> +	if (!gd)
> +		return -ENOMEM;
> +	policy->governor_data = gd;
> +	gd->policy = policy;
> +	raw_spin_lock_init(&gd->fastpath_lock);
> +	mutex_init(&gd->slowpath_lock);
> +
> +	ret = sysfs_create_group(&policy->kobj, &sched_freq_sysfs_group);
> +	if (ret)
> +		goto err_mem;
> +
> +	/*
> +	 * Set up schedfreq thread for slow path freq transitions if
> +	 * required by the driver.
> +	 */
> +	if (cpufreq_driver_is_slow()) {
> +		cpufreq_driver_slow = true;
> +		gd->task = kthread_create(cpufreq_sched_thread, gd,
> +					  "kschedfreq:%d",
> +					  cpumask_first(policy->related_cpus));
> +		if (IS_ERR_OR_NULL(gd->task)) {
> +			pr_err("%s: failed to create kschedfreq thread\n",
> +			       __func__);
> +			goto err_sysfs;
> +		}
> +		get_task_struct(gd->task);
> +		kthread_bind_mask(gd->task, policy->related_cpus);
> +		wake_up_process(gd->task);
> +		init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
> +	}
> +	return 0;
> +
> +err_sysfs:
> +	sysfs_remove_group(&policy->kobj, &sched_freq_sysfs_group);
> +err_mem:
> +	policy->governor_data = NULL;
> +	kfree(gd);
> +	return -ENOMEM;
> +}
> +
> +static int cpufreq_sched_policy_exit(struct cpufreq_policy *policy)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +
> +	/* Stop the schedfreq thread associated with this policy. */
> +	if (cpufreq_driver_slow) {
> +		kthread_stop(gd->task);
> +		put_task_struct(gd->task);
> +	}
> +	sysfs_remove_group(&policy->kobj, &sched_freq_sysfs_group);
> +	policy->governor_data = NULL;
> +	kfree(gd);
> +	return 0;
> +}
> +
> +static int cpufreq_sched_start(struct cpufreq_policy *policy)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +	int cpu;
> +
> +	/*
> +	 * The schedfreq static key is managed here so the global schedfreq
> +	 * lock must be taken - a per-policy lock such as policy->rwsem is
> +	 * not sufficient.
> +	 */
> +	mutex_lock(&gov_enable_lock);
> +
> +	gd->enabled = 1;
> +
> +	/*
> +	 * Set up percpu information. Writing the percpu gd pointer will
> +	 * enable the fast path if the static key is already enabled.
> +	 */
> +	for_each_cpu(cpu, policy->cpus) {
> +		memset(&per_cpu(cpu_sched_capacity_reqs, cpu), 0,
> +		       sizeof(struct sched_capacity_reqs));
> +		per_cpu(cpu_gov_data, cpu) = gd;
> +	}
> +
> +	if (enabled_policies == 0)
> +		static_key_slow_inc(&__sched_freq);
> +	enabled_policies++;
> +	mutex_unlock(&gov_enable_lock);
> +
> +	return 0;
> +}
> +
> +static void dummy(void *info) {}
> +
> +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> +{
> +	struct gov_data *gd = policy->governor_data;
> +	int cpu;
> +
> +	/*
> +	 * The schedfreq static key is managed here so the global schedfreq
> +	 * lock must be taken - a per-policy lock such as policy->rwsem is
> +	 * not sufficient.
> +	 */
> +	mutex_lock(&gov_enable_lock);
> +
> +	/*
> +	 * The governor stop path may or may not hold policy->rwsem. There
> +	 * must be synchronization with the slow path however.
> +	 */
> +	mutex_lock(&gd->slowpath_lock);
> +
> +	/*
> +	 * Stop new entries into the hot path for all CPUs. This will
> +	 * potentially affect other policies which are still running but
> +	 * this is an infrequent operation.
> +	 */
> +	static_key_slow_dec(&__sched_freq);
> +	enabled_policies--;
> +
> +	/*
> +	 * Ensure that all CPUs currently part of this policy are out
> +	 * of the hot path so that if this policy exits we can free gd.
> +	 */
> +	preempt_disable();
> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> +	preempt_enable();

I'm not sure how this works, can you please tell me?

> +
> +	/*
> +	 * Other CPUs in other policies may still have the schedfreq
> +	 * static key enabled. The percpu gd is used to signal which
> +	 * CPUs are enabled in the sched gov during the hot path.
> +	 */
> +	for_each_cpu(cpu, policy->cpus)
> +		per_cpu(cpu_gov_data, cpu) = NULL;
> +
> +	/* Pause the slow path for this policy. */
> +	gd->enabled = 0;
> +
> +	if (enabled_policies)
> +		static_key_slow_inc(&__sched_freq);
> +	mutex_unlock(&gd->slowpath_lock);
> +	mutex_unlock(&gov_enable_lock);
> +
> +	return 0;
> +}
> +
> +static int cpufreq_sched_setup(struct cpufreq_policy *policy,
> +			       unsigned int event)
> +{
> +	switch (event) {
> +	case CPUFREQ_GOV_POLICY_INIT:
> +		return cpufreq_sched_policy_init(policy);
> +	case CPUFREQ_GOV_POLICY_EXIT:
> +		return cpufreq_sched_policy_exit(policy);
> +	case CPUFREQ_GOV_START:
> +		return cpufreq_sched_start(policy);
> +	case CPUFREQ_GOV_STOP:
> +		return cpufreq_sched_stop(policy);
> +	case CPUFREQ_GOV_LIMITS:
> +		break;
> +	}
> +	return 0;
> +}
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
> +static
> +#endif
> +struct cpufreq_governor cpufreq_gov_sched = {
> +	.name			= "sched",
> +	.governor		= cpufreq_sched_setup,
> +	.owner			= THIS_MODULE,
> +};
> +
> +static int __init cpufreq_sched_init(void)
> +{
> +	return cpufreq_register_governor(&cpufreq_gov_sched);
> +}
> +
> +/* Try to make this the default governor */
> +fs_initcall(cpufreq_sched_init);

I have no comments to the rest of the patch.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 25, 2016, 9:21 a.m. UTC | #2
On Thu, Feb 25, 2016 at 04:55:57AM +0100, Rafael J. Wysocki wrote:
> Well, I'm not familiar with static keys and how they work, so you'll need to
> explain this part to me. 

See include/linux/jump_label.h, it has lots of text on them. There is
also Documentation/static-keys.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 25, 2016, 9:28 a.m. UTC | #3
On Thu, Feb 25, 2016 at 04:55:57AM +0100, Rafael J. Wysocki wrote:
> > +static void dummy(void *info) {}
> > +
> > +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> > +{
> > +	struct gov_data *gd = policy->governor_data;
> > +	int cpu;
> > +
> > +	/*
> > +	 * The schedfreq static key is managed here so the global schedfreq
> > +	 * lock must be taken - a per-policy lock such as policy->rwsem is
> > +	 * not sufficient.
> > +	 */
> > +	mutex_lock(&gov_enable_lock);
> > +
> > +	/*
> > +	 * The governor stop path may or may not hold policy->rwsem. There
> > +	 * must be synchronization with the slow path however.
> > +	 */
> > +	mutex_lock(&gd->slowpath_lock);
> > +
> > +	/*
> > +	 * Stop new entries into the hot path for all CPUs. This will
> > +	 * potentially affect other policies which are still running but
> > +	 * this is an infrequent operation.
> > +	 */
> > +	static_key_slow_dec(&__sched_freq);
> > +	enabled_policies--;
> > +
> > +	/*
> > +	 * Ensure that all CPUs currently part of this policy are out
> > +	 * of the hot path so that if this policy exits we can free gd.
> > +	 */
> > +	preempt_disable();
> > +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> > +	preempt_enable();
> 
> I'm not sure how this works, can you please tell me?

I think it relies on the fact that rq->lock disables IRQs, so if we've
managed to IPI all relevant CPUs, it means they cannot be inside a
rq->lock section.

Its vile though; one should not spray IPIs if one can avoid it. Such
things are much better done with RCU. Sure sync_sched() takes a little
longer, but this isn't a fast path by any measure.

> > +
> > +	/*
> > +	 * Other CPUs in other policies may still have the schedfreq
> > +	 * static key enabled. The percpu gd is used to signal which
> > +	 * CPUs are enabled in the sched gov during the hot path.
> > +	 */
> > +	for_each_cpu(cpu, policy->cpus)
> > +		per_cpu(cpu_gov_data, cpu) = NULL;
> > +
> > +	/* Pause the slow path for this policy. */
> > +	gd->enabled = 0;
> > +
> > +	if (enabled_policies)
> > +		static_key_slow_inc(&__sched_freq);
> > +	mutex_unlock(&gd->slowpath_lock);
> > +	mutex_unlock(&gov_enable_lock);
> > +
> > +	return 0;
> > +}
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 25, 2016, 11:04 a.m. UTC | #4
On Thursday, February 25, 2016 04:55:57 AM Rafael J. Wysocki wrote:
> Hi,
> 

[cut]

> > +	while (true) {
> > +		set_current_state(TASK_INTERRUPTIBLE);
> > +		if (kthread_should_stop()) {
> > +			set_current_state(TASK_RUNNING);
> > +			break;
> > +		}
> > +		new_request = gd->requested_freq;
> > +		if (!gd->enabled || new_request == last_request) {
> 
> This generally is a mistake.
> 
> You can't assume that if you had requested a CPU to enter a specific P-state,
> that state was actually entered.  In the platform-coordinated case the hardware
> (or firmware) may choose to ignore your request and you won't be told about
> that.
> 
> For this reason, you generally need to make the request every time even if it
> is identical to the previous one.  Even in the one-CPU-per-policy case there
> may be HW coordination you don't know about.

Please scratch this bit, actually (and analogous comments below).

It is reasonable to expect that the platform will remember your last request,
so you don't have to repeat it.

Which means that I can optimize the schedutil thing somewhat. :-)

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 25, 2016, 9:04 p.m. UTC | #5
On Thursday, February 25, 2016 10:21:50 AM Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 04:55:57AM +0100, Rafael J. Wysocki wrote:
> > Well, I'm not familiar with static keys and how they work, so you'll need to
> > explain this part to me. 
> 
> See include/linux/jump_label.h, it has lots of text on them. There is
> also Documentation/static-keys.txt

Thanks for the pointers!

It looks like the author of the $subject patch hasn't looked at the latter
document lately.

In any case, IMO this might be used to hide the cpufreq_update_util() call
sites from the scheduler code in case no one has set anything via
cpufreq_set_update_util_data() for any CPUs.

That essentially is when things like the performance governor are in use,
so may be worth doing, but that's your judgement call mostly. :-)

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 25, 2016, 9:08 p.m. UTC | #6
On Thursday, February 25, 2016 10:28:37 AM Peter Zijlstra wrote:
> On Thu, Feb 25, 2016 at 04:55:57AM +0100, Rafael J. Wysocki wrote:
> > > +static void dummy(void *info) {}
> > > +
> > > +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> > > +{
> > > +	struct gov_data *gd = policy->governor_data;
> > > +	int cpu;
> > > +
> > > +	/*
> > > +	 * The schedfreq static key is managed here so the global schedfreq
> > > +	 * lock must be taken - a per-policy lock such as policy->rwsem is
> > > +	 * not sufficient.
> > > +	 */
> > > +	mutex_lock(&gov_enable_lock);
> > > +
> > > +	/*
> > > +	 * The governor stop path may or may not hold policy->rwsem. There
> > > +	 * must be synchronization with the slow path however.
> > > +	 */
> > > +	mutex_lock(&gd->slowpath_lock);
> > > +
> > > +	/*
> > > +	 * Stop new entries into the hot path for all CPUs. This will
> > > +	 * potentially affect other policies which are still running but
> > > +	 * this is an infrequent operation.
> > > +	 */
> > > +	static_key_slow_dec(&__sched_freq);
> > > +	enabled_policies--;
> > > +
> > > +	/*
> > > +	 * Ensure that all CPUs currently part of this policy are out
> > > +	 * of the hot path so that if this policy exits we can free gd.
> > > +	 */
> > > +	preempt_disable();
> > > +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> > > +	preempt_enable();
> > 
> > I'm not sure how this works, can you please tell me?
> 
> I think it relies on the fact that rq->lock disables IRQs, so if we've
> managed to IPI all relevant CPUs, it means they cannot be inside a
> rq->lock section.
> 
> Its vile though; one should not spray IPIs if one can avoid it. Such
> things are much better done with RCU. Sure sync_sched() takes a little
> longer, but this isn't a fast path by any measure.

I see, thanks!

BTW, when cpufreq_update_util() callbacks are removed, I use synchronize_rcu()
to wait for the running ones, but would it be better to use synchronize_sched()
in there instead?

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Feb. 26, 2016, 12:34 a.m. UTC | #7
On 02/24/2016 07:55 PM, Rafael J. Wysocki wrote:
> Hi,
> 
> I promised a review and here it goes.

Thanks Rafael for your detailed review.

> 
> Let me focus on this one as the rest seems to depend on it.
> 
> On Monday, February 22, 2016 05:22:43 PM Steve Muckle wrote:
>> From: Michael Turquette <mturquette@baylibre.com>
>>
>> Scheduler-driven CPU frequency selection hopes to exploit both
>> per-task and global information in the scheduler to improve frequency
>> selection policy, achieving lower power consumption, improved
>> responsiveness/performance, and less reliance on heuristics and
>> tunables. For further discussion on the motivation of this integration
>> see [0].
>>
>> This patch implements a shim layer between the Linux scheduler and the
>> cpufreq subsystem. The interface accepts capacity requests from the
>> CFS, RT and deadline sched classes. The requests from each sched class
>> are summed on each CPU with a margin applied to the CFS and RT
>> capacity requests to provide some headroom. Deadline requests are
>> expected to be precise enough given their nature to not require
>> headroom. The maximum total capacity request for a CPU in a frequency
>> domain drives the requested frequency for that domain.
>>
>> Policy is determined by both the sched classes and this shim layer.
>>
>> Note that this algorithm is event-driven. There is no polling loop to
>> check cpu idle time nor any other method which is unsynchronized with
>> the scheduler, aside from an optional throttling mechanism.
>>
>> Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
>> code and test results, and to Ricky Liang <jcliang@chromium.org>
>> for initialization and static key inc/dec fixes.
>>
>> [0] http://article.gmane.org/gmane.linux.kernel/1499836
>>
>> [smuckle@linaro.org: various additions and fixes, revised commit text]
> 
> Well, the changelog is still a bit terse in my view.  It should at least
> describe the design somewhat (mention the static keys and how they are
> used etc) end explain why the things are done this way. 

Sure. Will add more, including the details you mention.

> 
...
>>  /*********************************************************************
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index a292c4b..27a6cd8 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -937,6 +937,14 @@ enum cpu_idle_type {
>>  #define SCHED_CAPACITY_SHIFT	10
>>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>>  
>> +struct sched_capacity_reqs {
>> +	unsigned long cfs;
>> +	unsigned long rt;
>> +	unsigned long dl;
>> +
>> +	unsigned long total;
>> +};
> 
> Without a comment explaining what this represents it is quite hard to
> decode it.

This is the per-CPU representation of capacity requests from each sched
class. The scale of the cfs, rt, and dl capacity numbers are 0 to
SCHED_CAPACITY_SCALE. I'll add a comment here.

> 
>> +
>>  /*
>>   * Wake-queues are lists of tasks with a pending wakeup, whose
>>   * callers have already marked the task as woken internally,
>> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
>> index 6768797..90ed832 100644
>> --- a/kernel/sched/Makefile
>> +++ b/kernel/sched/Makefile
>> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
>>  obj-$(CONFIG_SCHEDSTATS) += stats.o
>>  obj-$(CONFIG_SCHED_DEBUG) += debug.o
>>  obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
>> +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
>> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
>> new file mode 100644
>> index 0000000..a113e4e
>> --- /dev/null
>> +++ b/kernel/sched/cpufreq_sched.c
>> @@ -0,0 +1,459 @@
>> +/*
> 
> Any chance to add one sentence about what's in the file?

Sure, will do.

> 
> Besides, governors traditionally go to drivers/cpufreq.  Why is this different?

This predates my involvement but I'm guessing this was done because the
long term vision was for cpufreq to eventually be removed from the
picture. But it may be quite some time before that happens - I think
communicating directly with drivers may be difficult due to the need to
synchronize with thermal or userspace min/max requests etc, plus there's
the fact that we've got a longstanding API exposed to userspace.

I'm happy to move this to drivers/cpufreq.

> 
>> + *  Copyright (C)  2015 Michael Turquette <mturquette@linaro.org>
>> + *  Copyright (C)  2015-2016 Steve Muckle <smuckle@linaro.org>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/cpufreq.h>
>> +#include <linux/module.h>
>> +#include <linux/kthread.h>
>> +#include <linux/percpu.h>
>> +#include <linux/irq_work.h>
>> +#include <linux/delay.h>
>> +#include <linux/string.h>
>> +
>> +#include "sched.h"
>> +
>> +struct static_key __read_mostly __sched_freq = STATIC_KEY_INIT_FALSE;
> 
> Well, I'm not familiar with static keys and how they work, so you'll need to
> explain this part to me.  I'm assuming that there is some magic related to
> the value of the key that changes set_*_cpu_capacity() into no-ops when the
> key is 0, presumably by modifying kernel code.

That's correct.

> 
> So this is clever but tricky and my question here is why it is necessary.
> 
> For example, compared to the RCU-based approach I'm using, how much better
> this is?  Yes, there is some tiny overhead related to the checking if callbacks
> are present and invoking them, but is it really so bad?  Can you actually
> measure the different in any realistic workload?

I have not tried to measure any difference. We erred on the side of
caution to avoid impacting the scheduler hot paths entirely if possible.

> One thing I personally like in the RCU-based approach is its universality.  The
> callbacks may be installed by different entities in a uniform way: intel_pstate
> can do that, the old governors can do that, my experimental schedutil code can
> do that and your code could have done that too in principle.  And this is very
> nice, because it is a common underlying mechanism that can be used by everybody
> regardless of their particular implementations on the other side.
> 
> Why would I want to use something different, then?

I've got nothing against a callback registration mechanism. As you
mentioned in another mail it could itself use static keys, enabling the
static key when a callback is registered and disabling it otherwise to
avoid calling into cpufreq_update_util().

> 
>> +static bool __read_mostly cpufreq_driver_slow;
>> +
>> +/*
>> + * The number of enabled schedfreq policies is modified during GOV_START/STOP.
>> + * It, along with whether the schedfreq static key is enabled, is protected by
>> + * the gov_enable_lock.
>> + */
> 
> Well, it would be good to explain what the role of the number of enabled_policies is
> at least briefly.

It's just that if the number of enabled policies is > 0, the static key
must be enabled so that the callbacks work. I'll clarify that in the
comment here.

> 
>> +static int enabled_policies;
>> +static DEFINE_MUTEX(gov_enable_lock);
>> +
>> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
>> +static struct cpufreq_governor cpufreq_gov_sched;
>> +#endif
> 
> I don't think you need the #ifndef any more after recent changes in linux-next.

Ah I've been based on tip/sched/core. I should probably move over to
linux-next. Will do so and check this.

> 
>> +
>> +/*
>> + * Capacity margin added to CFS and RT capacity requests to provide
>> + * some head room if task utilization further increases.
>> + */
> 
> OK, where does this number come from?

Someone's posterior :) .

This really should be a tunable IMO, but there's a fairly strong
anti-tunable sentiment, so it's been left hard-coded in an attempt to
provide something that "just works."

At the least I can add a comment saying that the 20% idle headroom
requirement was an off the cuff estimate and that at this time, we don't
have significant data to suggest it's the best number.

> 
...
>> +static bool finish_last_request(struct gov_data *gd)
>> +{
>> +	ktime_t now = ktime_get();
>> +
>> +	if (ktime_after(now, gd->throttle))
>> +		return false;
>> +
>> +	while (1) {
> 
> I would write this as
> 
> 	do {
> 
>> +		int usec_left = ktime_to_ns(ktime_sub(gd->throttle, now));
>> +
>> +		usec_left /= NSEC_PER_USEC;
>> +		usleep_range(usec_left, usec_left + 100);
>> +		now = ktime_get();
> 
> 	} while (ktime_before(now, gd->throttle));
> 
> 	return true;
> 
> But maybe that's just me. :-)

I agree that's cleaner, will change it.

> 
>> +		if (ktime_after(now, gd->throttle))
>> +			return true;
>> +	}
>> +}
> 
> I'm not a big fan of this throttling mechanism overall, but more about that later.
> 
>> +
>> +static int cpufreq_sched_thread(void *data)
>> +{
> 
> Now, what really is the advantage of having those extra threads vs using
> workqueues?
> 
> I guess the underlying concern is that RT tasks may stall workqueues indefinitely
> in theory and then the frequency won't be updated, but there's much more kernel
> stuff run from workqueues and if that is starved, you won't get very far anyway.
> 
> If you take special measures to prevent frequency change requests from being
> stalled by RT tasks, question is why are they so special?  Aren't there any
> other kernel activities that also should be protected from that and may be
> more important than CPU frequency changes?

I think updating the CPU frequency during periods of heavy RT/DL load is
one of the most (if not the most) important things. I can't speak for
other system activities that may get blocked, but there's an opportunity
to protect CPU frequency changes here, and it seems worth taking to me.

> 
> Plus if this really is the problem here, then it also affects the other cpufreq
> governors, so maybe it should be solved for everybody in some common way?

Agreed, I'd think a freq change thread that serves frequency change
requests would be a useful common component. The locking and throttling
(slowpath_lock, finish_last_request()) are somewhat specific to this
implementation, but could probably be done generically and maybe even
used in other governors. If you're okay with it though I'd like to view
that as a slightly longer term effort, as I think it would get unwieldy
trying to do that as part of this initial change.

> 
...
>> +
>> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
>> +{
>> +	struct gov_data *gd;
>> +
>> +	gd = container_of(irq_work, struct gov_data, irq_work);
>> +	if (!gd)
>> +		return;
>> +
>> +	wake_up_process(gd->task);
> 
> I'm wondering what would be wrong with writing it as
> 
> 	if (gd)
> 		wake_up_process(gd->task);
> 
> And can gd turn out to be NULL here in any case?

In practice I don't think this would ever happen, but there's not
anything that would guarantee the policy can't be stopped and exit while
one of these irq_works is in flight. This would free not only gd but the
the irq_work structure itself.

Rather than check if gd is NULL here I think synchronization is required
to flush an in flight irq_work when the policy is being stopped. I will
add an irq_work_sync in the policy stop path and remove the NULL check.

> 
>> +}
>> +
>> +static void update_fdomain_capacity_request(int cpu)
>> +{
>> +	unsigned int freq_new, index_new, cpu_tmp;
>> +	struct cpufreq_policy *policy;
>> +	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
>> +	unsigned long capacity = 0;
>> +
>> +	if (!gd)
>> +		return;
>> +
> 
> Why is this check necessary?

As soon as one policy in the system uses scheduler-guided frequency the
static key will be enabled and all CPUs will be calling into the
scheduler hooks and can potentially come through this path. But some
CPUs may not be part of a scheduler-guided frequency policy. Those CPUs
will have a NULL gov_data pointer.

That being said, I think this test should be moved up into
update_cpu_capacity_request() to avoid that computation when it is not
required. Better still would be bailing when required in the
set_*_cpu_capacity() macros in kernel/sched/sched.h. I'll see if I can
do that instead.

> 
>> +	/* interrupts already disabled here via rq locked */
>> +	raw_spin_lock(&gd->fastpath_lock);
> 
> Well, if you compare this with the one-CPU-per-policy path in my experimental
> schedutil governor code with the "fast switch" patch on top, you'll notice that
> it doesn't use any locks and/or atomic ops.  That's very much on purpose and
> here's where your whole gain from using static keys practically goes away.

Sure, I like the idea of special fast callbacks for the simple case of
UP frequency domains and will add it to the list of things to do here.

The static key is really a separate issue IMO as it's about minimizing
the impact of this feature on scheduler hot paths when schedfreq is not
enabled.

> 
>> +
>> +	policy = gd->policy;
>> +
>> +	for_each_cpu(cpu_tmp, policy->cpus) {
>> +		struct sched_capacity_reqs *scr;
>> +
>> +		scr = &per_cpu(cpu_sched_capacity_reqs, cpu_tmp);
>> +		capacity = max(capacity, scr->total);
>> +	}
> 
> You could save a few cycles from this in the case when the policy is not
> shared.

Agreed. Will look at making special UP paths.

> 
>> +
>> +	freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> 
> Where does this formula come from?

Capacity here is 0 to SCHED_CAPACITY_SCALE, so this is translating the
capacity request to a frequency request via
(capacity/SCHED_CAPACITY_SCALE) * policy->max. I'll add a comment to
this effect.

The race with policy->max potentially changing also deserves a comment.
If policy->max changes say just before we read it here to do this
translation, the scheduler PELT numbers will still largely be based on
the old policy->max value, because they are an exponential moving
average and will take time to re-adjust. So at least for now I'm
ignoring this race as I don't think it's really meaningful to attempt
any synchronization.

> 
>> +
>> +	/*
>> +	 * Calling this without locking policy->rwsem means we race
>> +	 * against changes with policy->min and policy->max. This should
>> +	 * be okay though.
>> +	 */
>> +	if (cpufreq_frequency_table_target(policy, policy->freq_table,
>> +					   freq_new, CPUFREQ_RELATION_L,
>> +					   &index_new))
>> +		goto out;
> 
> __cpufreq_driver_target() will call this again, so isn't calling it here
> a bit wasteful?

I wanted to avoid waking up the frequency change thread (an expensive
operation) whenever possible, or even making an unnecessary fastpath
frequency request, so translating the raw frequency request to a
supported target frequency allows us to bail if the actual requested
target frequency will end up being the same as the current one. I
thought this was more valuable than the extra table lookup here.

Actually, I could make this better by storing the next lower frequency
than the current frequency as well as the current frequency - this would
allow me to do a much simpler test to see if we'd end up requesting the
same frequency or something different.

> 
>> +	freq_new = policy->freq_table[index_new].frequency;
>> +
>> +	if (freq_new == gd->requested_freq)
>> +		goto out;
>> +
> 
> Again, the above generally is a mistake for reasons explained earlier.

(skipping per the other email)

> 
>> +	gd->requested_freq = freq_new;
>> +
>> +	if (cpufreq_driver_slow || !mutex_trylock(&gd->slowpath_lock)) {
> 
> This really doesn't look good to me.
> 
> Why is the mutex needed here in the first place?  cpufreq_sched_stop() should
> be able to make sure that this function won't be run again for the policy
> without using this lock.

The acquisition of the slowpath_lock here isn't for synchronizing with
the policy being stopped - that's handled differently via the smp call
in the stop routine.

Rather it may be the case that schedfreq runs both in the fast and slow
path on a target (maybe because of throttling, or because a previously
started asynchronous transition isn't done yet). If that is so, then
when the slow path is active, I do not want to attempt a transition in
the fast path.

> 
>> +		irq_work_queue_on(&gd->irq_work, cpu);
> 
> I hope that you are aware of the fact that irq_work_queue_on() explodes
> on uniprocessor ARM32 if you run an SMP kernel on it?

No, I wasn't. Fortunately I think switching it to irq_work_queue() to
run the irq_work on the same CPU as the fast path should be fine
(assuming there's no similar landmine there).

> 
> And what happens in the !cpufreq_driver_is_slow() case when we don't
> initialize the irq_work?

That's a bug, the irq_work should always be initialized. Will fix.

> 
>> +	} else if (policy->transition_ongoing ||
>> +		   ktime_before(ktime_get(), gd->throttle)) {
> 
> If this really runs in the scheduler paths, you don't want to have ktime_get()
> here.

I'll switch this to be sched_clock() based.

> 
>> +		mutex_unlock(&gd->slowpath_lock);
>> +		irq_work_queue_on(&gd->irq_work, cpu);
> 
> Allright.
> 
> I think I have figured out how this is arranged, but I may be wrong. :-)
> 
> Here's my understanding of it.  If we are throttled, we don't just skip the
> request.  Instead, we wake up the gd thread kind of in the hope that the
> throttling may end when it actually wakes up.  So in fact we poke at the
> gd thread on a regular basis asking it "Are you still throttled?" and that
> happens on every call from the scheduler until the throttling is over if
> I'm not mistaken.  This means that during throttling every call from the
> scheduler generates an irq_work that wakes up the gd thread just to make it
> check if it still is throttled and go to sleep again.  Please tell me
> that I haven't understood this correctly.

Sorry, yes this should be doing something much more intelligent such as
checking to see if the freq thread is already due to wake up at some
point, and setting a timer for it if not.

> 
> The above aside, I personally think that rate limitting should happen at the source
> and not at the worker thread level.  So if you're throttled, you should just
> return immediately from this function without generating any more work.  That,
> BTW, is what the sampling rate in my code is for.

I will work on modifying the throttling here to be more sane.

> 
>> +	} else {
>> +		cpufreq_sched_try_driver_target(policy, freq_new);
> 
> Well, this is supposed to be the fast path AFAICS.
> 
> Did you actually look at what __cpufreq_driver_target() does in general?
> Including the wait_event() in cpufreq_freq_transition_begin() to mention just
> one suspicious thing?  And how much overhead it generates in the most general
> case?
> 
> No, running *that* from the fast path is not a good idea.  Quite honestly,
> you'd need a new driver callback and a new way to run it from the cpufreq core
> to implement this in a reasonably efficient way.

Yes I'm aware of the wait_event(). I've attempted to work around it by
ensuring schedfreq does not issue a __cpufreq_driver_target attempt
while a transition is in flight (transition_ongoing), and ensuring the
slow and fast paths can't race. I'm not sure yet whether this is enough
if something like thermal or userspace changes min/max, whether that can
independently start a transition that may cause a fast path request here
to block. This governor does not use the dbs stuff.

While it's not exactly lean, I didn't see anything else in
__cpufreq_driver_target() that looked really terrible.

I've nothing against a new fast switch driver interface. It may be nice
to support unmodified drivers in the fast path as well, if it can be
made to work, even though it may not be optimal.

> 
>> +		mutex_unlock(&gd->slowpath_lock);
>> +	}
>> +
>> +out:
>> +	raw_spin_unlock(&gd->fastpath_lock);
>> +}
>> +
>> +void update_cpu_capacity_request(int cpu, bool request)
>> +{
>> +	unsigned long new_capacity;
>> +	struct sched_capacity_reqs *scr;
>> +
>> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
>> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
>> +
>> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
>> +
>> +	new_capacity = scr->cfs + scr->rt;
>> +	new_capacity = new_capacity * capacity_margin
>> +		/ SCHED_CAPACITY_SCALE;
>> +	new_capacity += scr->dl;
> 
> Can you please explain the formula here?

The deadline class is expected to provide precise enough capacity
requirements (since tasks admitted to it have CPU bandwidth parameters)
such that it does not require additional CPU headroom.

So we're applying the CPU headroom to the CFS and RT capacity requests
only, then adding the DL request.

I'll add a comment here.

> 
>> +
>> +	if (new_capacity == scr->total)
>> +		return;
>> +
> 
> The same mistake as before.

(assuming this is part of the same comment in the other email)

> 
...
>> +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
>> +{
>> +	struct gov_data *gd = policy->governor_data;
>> +	int cpu;
>> +
>> +	/*
>> +	 * The schedfreq static key is managed here so the global schedfreq
>> +	 * lock must be taken - a per-policy lock such as policy->rwsem is
>> +	 * not sufficient.
>> +	 */
>> +	mutex_lock(&gov_enable_lock);
>> +
>> +	/*
>> +	 * The governor stop path may or may not hold policy->rwsem. There
>> +	 * must be synchronization with the slow path however.
>> +	 */
>> +	mutex_lock(&gd->slowpath_lock);
>> +
>> +	/*
>> +	 * Stop new entries into the hot path for all CPUs. This will
>> +	 * potentially affect other policies which are still running but
>> +	 * this is an infrequent operation.
>> +	 */
>> +	static_key_slow_dec(&__sched_freq);
>> +	enabled_policies--;
>> +
>> +	/*
>> +	 * Ensure that all CPUs currently part of this policy are out
>> +	 * of the hot path so that if this policy exits we can free gd.
>> +	 */
>> +	preempt_disable();
>> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
>> +	preempt_enable();
> 
> I'm not sure how this works, can you please tell me?

Peter correctly interpreted my intentions.

The RCU possibility also crossed my mind. They both seemed like a bit of
a hack to me - this wouldn't really be doing any RCU per se, rather
relying on its implementation. I'll switch to RCU since that seems to be
preferred though. It's certainly cleaner to write.

> 
...
> 
> I have no comments to the rest of the patch.

Thanks again for the review.

thanks,
Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 26, 2016, 9:18 a.m. UTC | #8
On Thu, Feb 25, 2016 at 10:08:48PM +0100, Rafael J. Wysocki wrote:
> On Thursday, February 25, 2016 10:28:37 AM Peter Zijlstra wrote:
> > Its vile though; one should not spray IPIs if one can avoid it. Such
> > things are much better done with RCU. Sure sync_sched() takes a little
> > longer, but this isn't a fast path by any measure.
> 
> I see, thanks!
> 
> BTW, when cpufreq_update_util() callbacks are removed, I use synchronize_rcu()
> to wait for the running ones, but would it be better to use synchronize_sched()
> in there instead?

So I think we only call the callback with rq->lock held, in which case
sync_sched() is good enough.

It would allow you to get rid of the rcu_read_{,un}lock() calls as well.

The down-side is that it all makes the code a little harder to get,
because you're relying on caller context to DTRT.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 27, 2016, 2:39 a.m. UTC | #9
On Thursday, February 25, 2016 04:34:23 PM Steve Muckle wrote:
> On 02/24/2016 07:55 PM, Rafael J. Wysocki wrote:
> > Hi,

[cut]

> > One thing I personally like in the RCU-based approach is its universality.  The
> > callbacks may be installed by different entities in a uniform way: intel_pstate
> > can do that, the old governors can do that, my experimental schedutil code can
> > do that and your code could have done that too in principle.  And this is very
> > nice, because it is a common underlying mechanism that can be used by everybody
> > regardless of their particular implementations on the other side.
> > 
> > Why would I want to use something different, then?
> 
> I've got nothing against a callback registration mechanism. As you
> mentioned in another mail it could itself use static keys, enabling the
> static key when a callback is registered and disabling it otherwise to
> avoid calling into cpufreq_update_util().

But then it would only make a difference if cpufreq_update_util() was not
used at all (ie. no callbacks installed for any policies by anyone).  The
only reason why it may matter is that the total number of systems using
the performance governor is quite large AFAICS and they would benefit from
that.

[cut]

> 
> > 
> >> +
> >> +/*
> >> + * Capacity margin added to CFS and RT capacity requests to provide
> >> + * some head room if task utilization further increases.
> >> + */
> > 
> > OK, where does this number come from?
> 
> Someone's posterior :) .
> 
> This really should be a tunable IMO, but there's a fairly strong
> anti-tunable sentiment, so it's been left hard-coded in an attempt to
> provide something that "just works."

Ouch.

> At the least I can add a comment saying that the 20% idle headroom
> requirement was an off the cuff estimate and that at this time, we don't
> have significant data to suggest it's the best number.

Well, in this area, every number has to be justified.  Otherwise we end
up with things that sort of work, but nobody actually understands why.

[cut]

> > 
> >> +
> >> +static int cpufreq_sched_thread(void *data)
> >> +{
> > 
> > Now, what really is the advantage of having those extra threads vs using
> > workqueues?
> > 
> > I guess the underlying concern is that RT tasks may stall workqueues indefinitely
> > in theory and then the frequency won't be updated, but there's much more kernel
> > stuff run from workqueues and if that is starved, you won't get very far anyway.
> > 
> > If you take special measures to prevent frequency change requests from being
> > stalled by RT tasks, question is why are they so special?  Aren't there any
> > other kernel activities that also should be protected from that and may be
> > more important than CPU frequency changes?
> 
> I think updating the CPU frequency during periods of heavy RT/DL load is
> one of the most (if not the most) important things. I can't speak for
> other system activities that may get blocked, but there's an opportunity
> to protect CPU frequency changes here, and it seems worth taking to me.

So do it in a general way for everybody and not just for one governor
that you happen to be working on.

That said I'm unconvinced about the approach still.

Having more RT threads in a system that already is under RT pressure seems like
a recipe for trouble.  Moreover, it's likely that those new RT threads will
disturb the system's normal operation somehow even without the RT pressure and
have you investigated that?  Also having them per policy may be overkill and
binding them to policy CPUs only is not necessary.

Overall, it looks like a dynamic pool of threads that may run on every CPU
might be a better approach, but that would almost duplicate the workqueues
subsystem, so is it really worth it?

And is the problem actually visible in practice?  I have no record of any reports
mentioning it, although theoretically it's been there forever, so had it been
real, someone would have noticed it and complained about it IMO.

> > 
> > Plus if this really is the problem here, then it also affects the other cpufreq
> > governors, so maybe it should be solved for everybody in some common way?
> 
> Agreed, I'd think a freq change thread that serves frequency change
> requests would be a useful common component. The locking and throttling
> (slowpath_lock, finish_last_request()) are somewhat specific to this
> implementation, but could probably be done generically and maybe even
> used in other governors. If you're okay with it though I'd like to view
> that as a slightly longer term effort, as I think it would get unwieldy
> trying to do that as part of this initial change.

I really am not sure if this is useful at all, so why bother with it initially?

> > 
> ...
> >> +
> >> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
> >> +{
> >> +	struct gov_data *gd;
> >> +
> >> +	gd = container_of(irq_work, struct gov_data, irq_work);
> >> +	if (!gd)
> >> +		return;
> >> +
> >> +	wake_up_process(gd->task);
> > 
> > I'm wondering what would be wrong with writing it as
> > 
> > 	if (gd)
> > 		wake_up_process(gd->task);
> > 
> > And can gd turn out to be NULL here in any case?
> 
> In practice I don't think this would ever happen, but there's not
> anything that would guarantee the policy can't be stopped and exit while
> one of these irq_works is in flight. This would free not only gd but the
> the irq_work structure itself.
> 
> Rather than check if gd is NULL here I think synchronization is required
> to flush an in flight irq_work when the policy is being stopped.

Right.

> I will add an irq_work_sync in the policy stop path and remove the NULL check.
> 
> > 
> >> +}
> >> +
> >> +static void update_fdomain_capacity_request(int cpu)
> >> +{
> >> +	unsigned int freq_new, index_new, cpu_tmp;
> >> +	struct cpufreq_policy *policy;
> >> +	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
> >> +	unsigned long capacity = 0;
> >> +
> >> +	if (!gd)
> >> +		return;
> >> +
> > 
> > Why is this check necessary?
> 
> As soon as one policy in the system uses scheduler-guided frequency the
> static key will be enabled and all CPUs will be calling into the
> scheduler hooks and can potentially come through this path. But some
> CPUs may not be part of a scheduler-guided frequency policy. Those CPUs
> will have a NULL gov_data pointer.
> 
> That being said, I think this test should be moved up into
> update_cpu_capacity_request() to avoid that computation when it is not
> required. Better still would be bailing when required in the
> set_*_cpu_capacity() macros in kernel/sched/sched.h. I'll see if I can
> do that instead.

If you switch over to using cpufreq_update_util() callbacks like the other
governors, you won't need it any more, because then you'll be guaranteed that
you won't run if the callback hasn't been installed for this CPU.

[cut]

> > 
> >> +
> >> +	freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> > 
> > Where does this formula come from?
> 
> Capacity here is 0 to SCHED_CAPACITY_SCALE, so this is translating the
> capacity request to a frequency request via
> (capacity/SCHED_CAPACITY_SCALE) * policy->max. I'll add a comment to
> this effect.
>
> The race with policy->max potentially changing also deserves a comment.
> If policy->max changes say just before we read it here to do this
> translation, the scheduler PELT numbers will still largely be based on
> the old policy->max value, because they are an exponential moving
> average and will take time to re-adjust. So at least for now I'm
> ignoring this race as I don't think it's really meaningful to attempt
> any synchronization.

Agreed.

> > 
> >> +
> >> +	/*
> >> +	 * Calling this without locking policy->rwsem means we race
> >> +	 * against changes with policy->min and policy->max. This should
> >> +	 * be okay though.
> >> +	 */
> >> +	if (cpufreq_frequency_table_target(policy, policy->freq_table,
> >> +					   freq_new, CPUFREQ_RELATION_L,
> >> +					   &index_new))
> >> +		goto out;
> > 
> > __cpufreq_driver_target() will call this again, so isn't calling it here
> > a bit wasteful?
> 
> I wanted to avoid waking up the frequency change thread (an expensive
> operation) whenever possible, or even making an unnecessary fastpath
> frequency request,

In the fastpath case the driver will check if the new freq is equal to the
old one as it generally has no guarantee that the freq it is asked for
matches one it has in the table.  It needs to look it up and check.

And the lookup may actually be the most expensive operation in that case
(the request itself may be a matter of a read from and a write to a fast
register), so by adding it here you pretty much double the overhead.  Not
to mention the fact that the driver's lookup may be optimized quite a bit
compared to what cpufreq_frequency_table_target() does.

> so translating the raw frequency request to a
> supported target frequency allows us to bail if the actual requested
> target frequency will end up being the same as the current one. I
> thought this was more valuable than the extra table lookup here.

You don't need to do a table lookup for that.  You only need to store
the frequency that you have previously requested.

As long as the driver is sane, it should deterministically choose the same
frequency from the table for the same target value every time, so comparing
with what you passed to it before should be sufficient.

> Actually, I could make this better by storing the next lower frequency
> than the current frequency as well as the current frequency - this would
> allow me to do a much simpler test to see if we'd end up requesting the
> same frequency or something different.
> 
> > 
> >> +	freq_new = policy->freq_table[index_new].frequency;
> >> +
> >> +	if (freq_new == gd->requested_freq)
> >> +		goto out;
> >> +
> > 
> > Again, the above generally is a mistake for reasons explained earlier.
> 
> (skipping per the other email)
> 
> > 
> >> +	gd->requested_freq = freq_new;
> >> +
> >> +	if (cpufreq_driver_slow || !mutex_trylock(&gd->slowpath_lock)) {
> > 
> > This really doesn't look good to me.
> > 
> > Why is the mutex needed here in the first place?  cpufreq_sched_stop() should
> > be able to make sure that this function won't be run again for the policy
> > without using this lock.
> 
> The acquisition of the slowpath_lock here isn't for synchronizing with
> the policy being stopped - that's handled differently via the smp call
> in the stop routine.
> 
> Rather it may be the case that schedfreq runs both in the fast and slow
> path on a target (maybe because of throttling, or because a previously
> started asynchronous transition isn't done yet). If that is so, then
> when the slow path is active, I do not want to attempt a transition in
> the fast path.

But you never throttle in the "fast switch" case, do you?  You don't
start the gd thread in that case even.

That aside, playing tricks with mutexes like that is ugly like hell and
doesn't make the code easier to understand in any way.

> 
> > 
> >> +		irq_work_queue_on(&gd->irq_work, cpu);
> > 
> > I hope that you are aware of the fact that irq_work_queue_on() explodes
> > on uniprocessor ARM32 if you run an SMP kernel on it?
> 
> No, I wasn't. Fortunately I think switching it to irq_work_queue() to
> run the irq_work on the same CPU as the fast path should be fine
> (assuming there's no similar landmine there).

You don't have to run it on the same CPU, though.  It doesn't matter what
CPU kicks your worker thread, does it?

[cut]

> >> +	} else {
> >> +		cpufreq_sched_try_driver_target(policy, freq_new);
> > 
> > Well, this is supposed to be the fast path AFAICS.
> > 
> > Did you actually look at what __cpufreq_driver_target() does in general?
> > Including the wait_event() in cpufreq_freq_transition_begin() to mention just
> > one suspicious thing?  And how much overhead it generates in the most general
> > case?
> > 
> > No, running *that* from the fast path is not a good idea.  Quite honestly,
> > you'd need a new driver callback and a new way to run it from the cpufreq core
> > to implement this in a reasonably efficient way.
> 
> Yes I'm aware of the wait_event(). I've attempted to work around it by
> ensuring schedfreq does not issue a __cpufreq_driver_target attempt
> while a transition is in flight (transition_ongoing), and ensuring the
> slow and fast paths can't race. I'm not sure yet whether this is enough
> if something like thermal or userspace changes min/max, whether that can
> independently start a transition that may cause a fast path request here
> to block. This governor does not use the dbs stuff.
> 
> While it's not exactly lean, I didn't see anything else in
> __cpufreq_driver_target() that looked really terrible.
> 
> I've nothing against a new fast switch driver interface. It may be nice
> to support unmodified drivers in the fast path as well, if it can be
> made to work, even though it may not be optimal.

My bet is that you'd need to modify drivers for that anyway, because none of
them meet the fast path requirements for the very simple reason that they
weren't designed with that in mind.

So if you need to modify them anyway, why not to add a new callback to them
in the process?

> > 
> >> +		mutex_unlock(&gd->slowpath_lock);
> >> +	}
> >> +
> >> +out:
> >> +	raw_spin_unlock(&gd->fastpath_lock);
> >> +}
> >> +
> >> +void update_cpu_capacity_request(int cpu, bool request)
> >> +{
> >> +	unsigned long new_capacity;
> >> +	struct sched_capacity_reqs *scr;
> >> +
> >> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
> >> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
> >> +
> >> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
> >> +
> >> +	new_capacity = scr->cfs + scr->rt;
> >> +	new_capacity = new_capacity * capacity_margin
> >> +		/ SCHED_CAPACITY_SCALE;
> >> +	new_capacity += scr->dl;
> > 
> > Can you please explain the formula here?
> 
> The deadline class is expected to provide precise enough capacity
> requirements (since tasks admitted to it have CPU bandwidth parameters)
> such that it does not require additional CPU headroom.
> 
> So we're applying the CPU headroom to the CFS and RT capacity requests
> only, then adding the DL request.
> 
> I'll add a comment here.

One thing that has escaped me: How do you take policy->min into account?
In particular, what if you end up with a frequency below policy->min?

[cut]

> >> +	/*
> >> +	 * Ensure that all CPUs currently part of this policy are out
> >> +	 * of the hot path so that if this policy exits we can free gd.
> >> +	 */
> >> +	preempt_disable();
> >> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> >> +	preempt_enable();
> > 
> > I'm not sure how this works, can you please tell me?
> 
> Peter correctly interpreted my intentions.
> 
> The RCU possibility also crossed my mind. They both seemed like a bit of
> a hack to me - this wouldn't really be doing any RCU per se, rather
> relying on its implementation. I'll switch to RCU since that seems to be
> preferred though. It's certainly cleaner to write.

Well, in that case you simply can switch over to using cpufreq_update_util()
callbacks and then you'll get that part for free.

Overall, sorry to say that, I don't quite see much to salvage in this patch.

The throttling implementation is a disaster.  The idea of having RT worker
threads per policy is questionable for a few reasons.  The idea of invoking
the existing driver callbacks from the fast path is not a good one and
trying to use __cpufreq_driver_target() for that only makes it worse.
The mutex manipulations under a raw spinlock are firmly in the "don't do
that" category and the static keys won't be useful any more or need to be
used elsewhere.

The way that you compute the frequency to ask for kind of might be defended,
but it has problems too, like arbitrary factors coming out of thin air.

So what's left then, realistically?

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Feb. 27, 2016, 4:17 a.m. UTC | #10
On 02/26/2016 06:39 PM, Rafael J. Wysocki wrote:
>>> One thing I personally like in the RCU-based approach is its universality.  The
>>> callbacks may be installed by different entities in a uniform way: intel_pstate
>>> can do that, the old governors can do that, my experimental schedutil code can
>>> do that and your code could have done that too in principle.  And this is very
>>> nice, because it is a common underlying mechanism that can be used by everybody
>>> regardless of their particular implementations on the other side.
>>>
>>> Why would I want to use something different, then?
>>
>> I've got nothing against a callback registration mechanism. As you
>> mentioned in another mail it could itself use static keys, enabling the
>> static key when a callback is registered and disabling it otherwise to
>> avoid calling into cpufreq_update_util().
> 
> But then it would only make a difference if cpufreq_update_util() was not
> used at all (ie. no callbacks installed for any policies by anyone).  The
> only reason why it may matter is that the total number of systems using
> the performance governor is quite large AFAICS and they would benefit from
> that.

I'd think that's a benefit worth preserving, but I guess that's Peter
and Ingo's call.

> 
...
>>>> +/*
>>>> + * Capacity margin added to CFS and RT capacity requests to provide
>>>> + * some head room if task utilization further increases.
>>>> + */
>>>
>>> OK, where does this number come from?
>>
>> Someone's posterior :) .
>>
>> This really should be a tunable IMO, but there's a fairly strong
>> anti-tunable sentiment, so it's been left hard-coded in an attempt to
>> provide something that "just works."
> 
> Ouch.
> 
>> At the least I can add a comment saying that the 20% idle headroom
>> requirement was an off the cuff estimate and that at this time, we don't
>> have significant data to suggest it's the best number.
> 
> Well, in this area, every number has to be justified.  Otherwise we end
> up with things that sort of work, but nobody actually understands why.

It's just a starting point. There's a lot of profiling and tuning that
has yet to happen. We figured there were larger design issues to discuss
prior to spending a lot of time tweaking the headroom value.

> 
> [cut]
> 
>>>
>>>> +
>>>> +static int cpufreq_sched_thread(void *data)
>>>> +{
>>>
>>> Now, what really is the advantage of having those extra threads vs using
>>> workqueues?
>>>
>>> I guess the underlying concern is that RT tasks may stall workqueues indefinitely
>>> in theory and then the frequency won't be updated, but there's much more kernel
>>> stuff run from workqueues and if that is starved, you won't get very far anyway.
>>>
>>> If you take special measures to prevent frequency change requests from being
>>> stalled by RT tasks, question is why are they so special?  Aren't there any
>>> other kernel activities that also should be protected from that and may be
>>> more important than CPU frequency changes?
>>
>> I think updating the CPU frequency during periods of heavy RT/DL load is
>> one of the most (if not the most) important things. I can't speak for
>> other system activities that may get blocked, but there's an opportunity
>> to protect CPU frequency changes here, and it seems worth taking to me.
> 
> So do it in a general way for everybody and not just for one governor
> that you happen to be working on.
> 
> That said I'm unconvinced about the approach still.
> 
> Having more RT threads in a system that already is under RT pressure seems like
> a recipe for trouble.  Moreover, it's likely that those new RT threads will
> disturb the system's normal operation somehow even without the RT pressure and
> have you investigated that?

Sorry I'm not sure what you mean by disturb normal operation.

Generally speaking, increasing the capacity of a heavily loaded system
seems to me to be something that should run urgently, so that the system
can potentially get itself out of trouble and meet the workload's needs.

> Also having them per policy may be overkill and
> binding them to policy CPUs only is not necessary.
> 
> Overall, it looks like a dynamic pool of threads that may run on every CPU
> might be a better approach, but that would almost duplicate the workqueues
> subsystem, so is it really worth it?
> 
> And is the problem actually visible in practice?  I have no record of any reports
> mentioning it, although theoretically it's been there forever, so had it been
> real, someone would have noticed it and complained about it IMO.

While I don't have a test case drawn up to provide it seems like it'd be
easy to create one. More importantly the interactive governor in Android
uses this same kind of model, starting a frequency change thread and
making it RT. Android is particularly sensitive to latency in frequency
response. So that's likely one big reason why you're not hearing about
this issue - some folks have already worked around it.

> 
>>>
>>> Plus if this really is the problem here, then it also affects the other cpufreq
>>> governors, so maybe it should be solved for everybody in some common way?
>>
>> Agreed, I'd think a freq change thread that serves frequency change
>> requests would be a useful common component. The locking and throttling
>> (slowpath_lock, finish_last_request()) are somewhat specific to this
>> implementation, but could probably be done generically and maybe even
>> used in other governors. If you're okay with it though I'd like to view
>> that as a slightly longer term effort, as I think it would get unwieldy
>> trying to do that as part of this initial change.
> 
> I really am not sure if this is useful at all, so why bother with it initially?

I still think ensuring CPU frequency changes aren't delayed by other
task activity is useful...

> 
...
>>>
>>>> +}
>>>> +
>>>> +static void update_fdomain_capacity_request(int cpu)
>>>> +{
>>>> +	unsigned int freq_new, index_new, cpu_tmp;
>>>> +	struct cpufreq_policy *policy;
>>>> +	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
>>>> +	unsigned long capacity = 0;
>>>> +
>>>> +	if (!gd)
>>>> +		return;
>>>> +
>>>
>>> Why is this check necessary?
>>
>> As soon as one policy in the system uses scheduler-guided frequency the
>> static key will be enabled and all CPUs will be calling into the
>> scheduler hooks and can potentially come through this path. But some
>> CPUs may not be part of a scheduler-guided frequency policy. Those CPUs
>> will have a NULL gov_data pointer.
>>
>> That being said, I think this test should be moved up into
>> update_cpu_capacity_request() to avoid that computation when it is not
>> required. Better still would be bailing when required in the
>> set_*_cpu_capacity() macros in kernel/sched/sched.h. I'll see if I can
>> do that instead.
> 
> If you switch over to using cpufreq_update_util() callbacks like the other
> governors, you won't need it any more, because then you'll be guaranteed that
> you won't run if the callback hasn't been installed for this CPU.

Sure.

> 
...
>>>> +		irq_work_queue_on(&gd->irq_work, cpu);
>>>
>>> I hope that you are aware of the fact that irq_work_queue_on() explodes
>>> on uniprocessor ARM32 if you run an SMP kernel on it?
>>
>> No, I wasn't. Fortunately I think switching it to irq_work_queue() to
>> run the irq_work on the same CPU as the fast path should be fine
>> (assuming there's no similar landmine there).
> 
> You don't have to run it on the same CPU, though.  It doesn't matter what
> CPU kicks your worker thread, does it?

I don't want to wake a random CPU to potentially just do the kick.

> 
> [cut]
> 
>>>> +	} else {
>>>> +		cpufreq_sched_try_driver_target(policy, freq_new);
>>>
>>> Well, this is supposed to be the fast path AFAICS.
>>>
>>> Did you actually look at what __cpufreq_driver_target() does in general?
>>> Including the wait_event() in cpufreq_freq_transition_begin() to mention just
>>> one suspicious thing?  And how much overhead it generates in the most general
>>> case?
>>>
>>> No, running *that* from the fast path is not a good idea.  Quite honestly,
>>> you'd need a new driver callback and a new way to run it from the cpufreq core
>>> to implement this in a reasonably efficient way.
>>
>> Yes I'm aware of the wait_event(). I've attempted to work around it by
>> ensuring schedfreq does not issue a __cpufreq_driver_target attempt
>> while a transition is in flight (transition_ongoing), and ensuring the
>> slow and fast paths can't race. I'm not sure yet whether this is enough
>> if something like thermal or userspace changes min/max, whether that can
>> independently start a transition that may cause a fast path request here
>> to block. This governor does not use the dbs stuff.
>>
>> While it's not exactly lean, I didn't see anything else in
>> __cpufreq_driver_target() that looked really terrible.
>>
>> I've nothing against a new fast switch driver interface. It may be nice
>> to support unmodified drivers in the fast path as well, if it can be
>> made to work, even though it may not be optimal.
> 
> My bet is that you'd need to modify drivers for that anyway, because none of
> them meet the fast path requirements for the very simple reason that they
> weren't designed with that in mind.
> 
> So if you need to modify them anyway, why not to add a new callback to them
> in the process?

This implies to me that there are drivers which can be made to work in
the fast path (i.e. do not sleep and are reasonably quick) but currently
do not. Are there really such drivers? Why would they have unnecessary
blocking or work in them?

I would have thought that if a driver sleeps or is slow, it is because
of platform limitations (like a blocking call to adjust a regulator)
which are unavoidable.

> 
>>>
>>>> +		mutex_unlock(&gd->slowpath_lock);
>>>> +	}
>>>> +
>>>> +out:
>>>> +	raw_spin_unlock(&gd->fastpath_lock);
>>>> +}
>>>> +
>>>> +void update_cpu_capacity_request(int cpu, bool request)
>>>> +{
>>>> +	unsigned long new_capacity;
>>>> +	struct sched_capacity_reqs *scr;
>>>> +
>>>> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
>>>> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
>>>> +
>>>> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
>>>> +
>>>> +	new_capacity = scr->cfs + scr->rt;
>>>> +	new_capacity = new_capacity * capacity_margin
>>>> +		/ SCHED_CAPACITY_SCALE;
>>>> +	new_capacity += scr->dl;
>>>
>>> Can you please explain the formula here?
>>
>> The deadline class is expected to provide precise enough capacity
>> requirements (since tasks admitted to it have CPU bandwidth parameters)
>> such that it does not require additional CPU headroom.
>>
>> So we're applying the CPU headroom to the CFS and RT capacity requests
>> only, then adding the DL request.
>>
>> I'll add a comment here.
> 
> One thing that has escaped me: How do you take policy->min into account?
> In particular, what if you end up with a frequency below policy->min?

__cpufreq_driver_target will floor the request with policy->min.

> 
> [cut]
> 
>>>> +	/*
>>>> +	 * Ensure that all CPUs currently part of this policy are out
>>>> +	 * of the hot path so that if this policy exits we can free gd.
>>>> +	 */
>>>> +	preempt_disable();
>>>> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
>>>> +	preempt_enable();
>>>
>>> I'm not sure how this works, can you please tell me?
>>
>> Peter correctly interpreted my intentions.
>>
>> The RCU possibility also crossed my mind. They both seemed like a bit of
>> a hack to me - this wouldn't really be doing any RCU per se, rather
>> relying on its implementation. I'll switch to RCU since that seems to be
>> preferred though. It's certainly cleaner to write.
> 
> Well, in that case you simply can switch over to using cpufreq_update_util()
> callbacks and then you'll get that part for free.

I'm not sure. Removing the callbacks by itself doesn't guarantee all the
CPUs are out of them - don't you still need something to synchronize
with that?

> Overall, sorry to say that, I don't quite see much to salvage in this patch.
> 
> The throttling implementation is a disaster.  The idea of having RT worker
> threads per policy is questionable for a few reasons.  The idea of invoking
> the existing driver callbacks from the fast path is not a good one and
> trying to use __cpufreq_driver_target() for that only makes it worse.
> The mutex manipulations under a raw spinlock are firmly in the "don't do
> that" category and the static keys won't be useful any more or need to be
> used elsewhere.
> 
> The way that you compute the frequency to ask for kind of might be defended,
> but it has problems too, like arbitrary factors coming out of thin air.
> 
> So what's left then, realistically?

Aside from the throttling, which I agree was a last minute and
not-well-thought-out addition and needs to be fixed, I'd still challenge
the statements above. But I don't think there's much point. I'm happy to
work towards enabling ARM in schedutil. I have some review comments
which I am just finishing up and will send shortly.

thanks,
Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 28, 2016, 2:26 a.m. UTC | #11
On Friday, February 26, 2016 08:17:46 PM Steve Muckle wrote:
> On 02/26/2016 06:39 PM, Rafael J. Wysocki wrote:
> >>> One thing I personally like in the RCU-based approach is its universality.  The
> >>> callbacks may be installed by different entities in a uniform way: intel_pstate
> >>> can do that, the old governors can do that, my experimental schedutil code can
> >>> do that and your code could have done that too in principle.  And this is very
> >>> nice, because it is a common underlying mechanism that can be used by everybody
> >>> regardless of their particular implementations on the other side.
> >>>
> >>> Why would I want to use something different, then?
> >>
> >> I've got nothing against a callback registration mechanism. As you
> >> mentioned in another mail it could itself use static keys, enabling the
> >> static key when a callback is registered and disabling it otherwise to
> >> avoid calling into cpufreq_update_util().
> > 
> > But then it would only make a difference if cpufreq_update_util() was not
> > used at all (ie. no callbacks installed for any policies by anyone).  The
> > only reason why it may matter is that the total number of systems using
> > the performance governor is quite large AFAICS and they would benefit from
> > that.
> 
> I'd think that's a benefit worth preserving, but I guess that's Peter
> and Ingo's call.

That's exactly what I said to Peter. :-)

[cut]

> > 
> > That said I'm unconvinced about the approach still.
> > 
> > Having more RT threads in a system that already is under RT pressure seems like
> > a recipe for trouble.  Moreover, it's likely that those new RT threads will
> > disturb the system's normal operation somehow even without the RT pressure and
> > have you investigated that?
> 
> Sorry I'm not sure what you mean by disturb normal operation.

That would introduce a number of extra RT threads that would be woken up quite
often and on a regular basis, so there would be some extra RT noise in the
system, especially on systems with one CPU per cpufreq policy and many CPUs.

That's not present ATM and surely need not be completely transparent.

> Generally speaking, increasing the capacity of a heavily loaded system
> seems to me to be something that should run urgently, so that the system
> can potentially get itself out of trouble and meet the workload's needs.
> 
> > Also having them per policy may be overkill and
> > binding them to policy CPUs only is not necessary.
> > 
> > Overall, it looks like a dynamic pool of threads that may run on every CPU
> > might be a better approach, but that would almost duplicate the workqueues
> > subsystem, so is it really worth it?
> > 
> > And is the problem actually visible in practice?  I have no record of any reports
> > mentioning it, although theoretically it's been there forever, so had it been
> > real, someone would have noticed it and complained about it IMO.
> 
> While I don't have a test case drawn up to provide it seems like it'd be
> easy to create one. More importantly the interactive governor in Android
> uses this same kind of model, starting a frequency change thread and
> making it RT. Android is particularly sensitive to latency in frequency
> response. So that's likely one big reason why you're not hearing about
> this issue - some folks have already worked around it.

OK, so Android is the reason. :-)

Fair enough.  I still think that care is needed here, though.

[cut]

> >>
> >> I've nothing against a new fast switch driver interface. It may be nice
> >> to support unmodified drivers in the fast path as well, if it can be
> >> made to work, even though it may not be optimal.
> > 
> > My bet is that you'd need to modify drivers for that anyway, because none of
> > them meet the fast path requirements for the very simple reason that they
> > weren't designed with that in mind.
> > 
> > So if you need to modify them anyway, why not to add a new callback to them
> > in the process?
> 
> This implies to me that there are drivers which can be made to work in
> the fast path (i.e. do not sleep and are reasonably quick) but currently
> do not. Are there really such drivers? Why would they have unnecessary
> blocking or work in them?

The ACPI driver is an example here.  It uses smp_call_function_many() to
run stuff on policy CPUs and that cannot be called from the fast path,
so the driver has to be modified.

> I would have thought that if a driver sleeps or is slow, it is because
> of platform limitations (like a blocking call to adjust a regulator)
> which are unavoidable.

Well, if you don't need to worry about sleeping, it's common to use
things like mutexes for synchronization etc.  Quite simply, if you don't
expect that the given piece of code will ever be run from interrupt
context, you'll not code for that and the scheduler paths have even more
restrictions than typical interrupt handlers.

> >>>
> >>>> +		mutex_unlock(&gd->slowpath_lock);
> >>>> +	}
> >>>> +
> >>>> +out:
> >>>> +	raw_spin_unlock(&gd->fastpath_lock);
> >>>> +}
> >>>> +
> >>>> +void update_cpu_capacity_request(int cpu, bool request)
> >>>> +{
> >>>> +	unsigned long new_capacity;
> >>>> +	struct sched_capacity_reqs *scr;
> >>>> +
> >>>> +	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
> >>>> +	lockdep_assert_held(&cpu_rq(cpu)->lock);
> >>>> +
> >>>> +	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
> >>>> +
> >>>> +	new_capacity = scr->cfs + scr->rt;
> >>>> +	new_capacity = new_capacity * capacity_margin
> >>>> +		/ SCHED_CAPACITY_SCALE;
> >>>> +	new_capacity += scr->dl;
> >>>
> >>> Can you please explain the formula here?
> >>
> >> The deadline class is expected to provide precise enough capacity
> >> requirements (since tasks admitted to it have CPU bandwidth parameters)
> >> such that it does not require additional CPU headroom.
> >>
> >> So we're applying the CPU headroom to the CFS and RT capacity requests
> >> only, then adding the DL request.
> >>
> >> I'll add a comment here.
> > 
> > One thing that has escaped me: How do you take policy->min into account?
> > In particular, what if you end up with a frequency below policy->min?
> 
> __cpufreq_driver_target will floor the request with policy->min.

I see.

> > 
> > [cut]
> > 
> >>>> +	/*
> >>>> +	 * Ensure that all CPUs currently part of this policy are out
> >>>> +	 * of the hot path so that if this policy exits we can free gd.
> >>>> +	 */
> >>>> +	preempt_disable();
> >>>> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> >>>> +	preempt_enable();
> >>>
> >>> I'm not sure how this works, can you please tell me?
> >>
> >> Peter correctly interpreted my intentions.
> >>
> >> The RCU possibility also crossed my mind. They both seemed like a bit of
> >> a hack to me - this wouldn't really be doing any RCU per se, rather
> >> relying on its implementation. I'll switch to RCU since that seems to be
> >> preferred though. It's certainly cleaner to write.
> > 
> > Well, in that case you simply can switch over to using cpufreq_update_util()
> > callbacks and then you'll get that part for free.
> 
> I'm not sure. Removing the callbacks by itself doesn't guarantee all the
> CPUs are out of them - don't you still need something to synchronize
> with that?

The update_util pointers are dereferenced and checked against NULL and the
callbacks are run (if present) in the same RCU read-side critical section.
synchronize_rcu() ensures that (a) all of the previously running read-side
critical sections complete before it returns and (b) all of the read-side
critical sections that begin after it has returned will see the new value
of the pointer.  This guarantees that the callbacks will not be running
after clearing the update_util pointers and executing synchronize_rcu().

> > Overall, sorry to say that, I don't quite see much to salvage in this patch.
> > 
> > The throttling implementation is a disaster.  The idea of having RT worker
> > threads per policy is questionable for a few reasons.  The idea of invoking
> > the existing driver callbacks from the fast path is not a good one and
> > trying to use __cpufreq_driver_target() for that only makes it worse.
> > The mutex manipulations under a raw spinlock are firmly in the "don't do
> > that" category and the static keys won't be useful any more or need to be
> > used elsewhere.
> > 
> > The way that you compute the frequency to ask for kind of might be defended,
> > but it has problems too, like arbitrary factors coming out of thin air.
> > 
> > So what's left then, realistically?
> 
> Aside from the throttling, which I agree was a last minute and
> not-well-thought-out addition and needs to be fixed, I'd still challenge
> the statements above. But I don't think there's much point. I'm happy to
> work towards enabling ARM in schedutil.

Cool! :-)

> I have some review comments which I am just finishing up and will send shortly.

I've responded to those already I think.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 1, 2016, 1:17 p.m. UTC | #12
On Thu, Feb 25, 2016 at 04:34:23PM -0800, Steve Muckle wrote:
> >> +	/*
> >> +	 * Ensure that all CPUs currently part of this policy are out
> >> +	 * of the hot path so that if this policy exits we can free gd.
> >> +	 */
> >> +	preempt_disable();
> >> +	smp_call_function_many(policy->cpus, dummy, NULL, true);
> >> +	preempt_enable();
> > 
> > I'm not sure how this works, can you please tell me?
> 
> Peter correctly interpreted my intentions.
> 
> The RCU possibility also crossed my mind. They both seemed like a bit of
> a hack to me - this wouldn't really be doing any RCU per se, rather
> relying on its implementation. I'll switch to RCU since that seems to be
> preferred though. It's certainly cleaner to write.

RCU is widely used in this fashion. synchronize_sched() is explicitly
constructed to sync against preempt disable sections. This is not an
implementation detail.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 1, 2016, 1:26 p.m. UTC | #13
On Fri, Feb 26, 2016 at 08:17:46PM -0800, Steve Muckle wrote:
> > But then it would only make a difference if cpufreq_update_util() was not
> > used at all (ie. no callbacks installed for any policies by anyone).  The
> > only reason why it may matter is that the total number of systems using
> > the performance governor is quite large AFAICS and they would benefit from
> > that.
> 
> I'd think that's a benefit worth preserving, but I guess that's Peter
> and Ingo's call.

Probably worth it, most of this is on rather fast paths.

See commit 1cde2930e154 ("sched/preempt: Add static_key() to
preempt_notifiers") for example.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 1, 2016, 2:31 p.m. UTC | #14
On Sun, Feb 28, 2016 at 03:26:21AM +0100, Rafael J. Wysocki wrote:

> > > That said I'm unconvinced about the approach still.
> > > 
> > > Having more RT threads in a system that already is under RT pressure seems like
> > > a recipe for trouble.  Moreover, it's likely that those new RT threads will
> > > disturb the system's normal operation somehow even without the RT pressure and
> > > have you investigated that?
> > 
> > Sorry I'm not sure what you mean by disturb normal operation.
> 
> That would introduce a number of extra RT threads that would be woken up quite
> often and on a regular basis, so there would be some extra RT noise in the
> system, especially on systems with one CPU per cpufreq policy and many CPUs.
> 
> That's not present ATM and surely need not be completely transparent.

Having RT tasks should not be a problem. You can always set their
priority such that they do not interfere with an actual RT workload.

> > Generally speaking, increasing the capacity of a heavily loaded system
> > seems to me to be something that should run urgently, so that the system
> > can potentially get itself out of trouble and meet the workload's needs.
> > 
> > > Also having them per policy may be overkill and
> > > binding them to policy CPUs only is not necessary.
> > > 
> > > Overall, it looks like a dynamic pool of threads that may run on every CPU
> > > might be a better approach, but that would almost duplicate the workqueues
> > > subsystem, so is it really worth it?
> > > 
> > > And is the problem actually visible in practice?  I have no record of any reports
> > > mentioning it, although theoretically it's been there forever, so had it been
> > > real, someone would have noticed it and complained about it IMO.
> > 
> > While I don't have a test case drawn up to provide it seems like it'd be
> > easy to create one. More importantly the interactive governor in Android
> > uses this same kind of model, starting a frequency change thread and
> > making it RT. Android is particularly sensitive to latency in frequency
> > response. So that's likely one big reason why you're not hearing about
> > this issue - some folks have already worked around it.
> 
> OK, so Android is the reason. :-)
> 
> Fair enough.  I still think that care is needed here, though.

So I sort of see the point of having per-cpu RT kthread tasks to effect
the OPP change for CFS. And here I would indeed suggest just having a
task per cpu, moving tasks about is too complex and generates even more
noise.

But the problem of having RT tasks is that you should run RT tasks at
max OPP.

Now you can probably fudge things and not account these RT tasks to the
RT 'workload' since you know their characteristics etc.. But its going
to be ugly I suspect.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 1, 2016, 8:32 p.m. UTC | #15
On Tue, Mar 1, 2016 at 3:31 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Feb 28, 2016 at 03:26:21AM +0100, Rafael J. Wysocki wrote:
>
>> > > That said I'm unconvinced about the approach still.
>> > >
>> > > Having more RT threads in a system that already is under RT pressure seems like
>> > > a recipe for trouble.  Moreover, it's likely that those new RT threads will
>> > > disturb the system's normal operation somehow even without the RT pressure and
>> > > have you investigated that?
>> >
>> > Sorry I'm not sure what you mean by disturb normal operation.
>>
>> That would introduce a number of extra RT threads that would be woken up quite
>> often and on a regular basis, so there would be some extra RT noise in the
>> system, especially on systems with one CPU per cpufreq policy and many CPUs.
>>
>> That's not present ATM and surely need not be completely transparent.
>
> Having RT tasks should not be a problem. You can always set their
> priority such that they do not interfere with an actual RT workload.
>
>> > Generally speaking, increasing the capacity of a heavily loaded system
>> > seems to me to be something that should run urgently, so that the system
>> > can potentially get itself out of trouble and meet the workload's needs.
>> >
>> > > Also having them per policy may be overkill and
>> > > binding them to policy CPUs only is not necessary.
>> > >
>> > > Overall, it looks like a dynamic pool of threads that may run on every CPU
>> > > might be a better approach, but that would almost duplicate the workqueues
>> > > subsystem, so is it really worth it?
>> > >
>> > > And is the problem actually visible in practice?  I have no record of any reports
>> > > mentioning it, although theoretically it's been there forever, so had it been
>> > > real, someone would have noticed it and complained about it IMO.
>> >
>> > While I don't have a test case drawn up to provide it seems like it'd be
>> > easy to create one. More importantly the interactive governor in Android
>> > uses this same kind of model, starting a frequency change thread and
>> > making it RT. Android is particularly sensitive to latency in frequency
>> > response. So that's likely one big reason why you're not hearing about
>> > this issue - some folks have already worked around it.
>>
>> OK, so Android is the reason. :-)
>>
>> Fair enough.  I still think that care is needed here, though.
>
> So I sort of see the point of having per-cpu RT kthread tasks to effect
> the OPP change for CFS. And here I would indeed suggest just having a
> task per cpu, moving tasks about is too complex and generates even more
> noise.
>
> But the problem of having RT tasks is that you should run RT tasks at
> max OPP.
>
> Now you can probably fudge things and not account these RT tasks to the
> RT 'workload' since you know their characteristics etc.. But its going
> to be ugly I suspect.

Good point.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Turquette March 2, 2016, 7:49 a.m. UTC | #16
Hi,

I'm still catching up on the plurality of scheduler/cpufreq threads but
I thought I would chime in with some historical reasons for why
cpufreq_sched.c looks the way it does today.

Quoting Steve Muckle (2016-02-25 16:34:23)
> On 02/24/2016 07:55 PM, Rafael J. Wysocki wrote:
> > On Monday, February 22, 2016 05:22:43 PM Steve Muckle wrote:
> > Besides, governors traditionally go to drivers/cpufreq.  Why is this different?
> 
> This predates my involvement but I'm guessing this was done because the
> long term vision was for cpufreq to eventually be removed from the
> picture. But it may be quite some time before that happens - I think
> communicating directly with drivers may be difficult due to the need to
> synchronize with thermal or userspace min/max requests etc, plus there's
> the fact that we've got a longstanding API exposed to userspace.
> 
> I'm happy to move this to drivers/cpufreq.

Originally it was put in kernel/sched/ because cpufreq_sched.c needed
access to kernel/sched/sched.h. Rafael's implementation flips the
push-pull relationship between the governor and scheduler so that this
is not necessary. If that requirement is removed from Steve's series
then there is no need to put the file outside of drivers/cpufreq.

...
> > One thing I personally like in the RCU-based approach is its universality.  The
> > callbacks may be installed by different entities in a uniform way: intel_pstate
> > can do that, the old governors can do that, my experimental schedutil code can
> > do that and your code could have done that too in principle.  And this is very
> > nice, because it is a common underlying mechanism that can be used by everybody
> > regardless of their particular implementations on the other side.
> > 
> > Why would I want to use something different, then?
> 
> I've got nothing against a callback registration mechanism. As you
> mentioned in another mail it could itself use static keys, enabling the
> static key when a callback is registered and disabling it otherwise to
> avoid calling into cpufreq_update_util().

I'm very happy to see the return of capacity/util ops. I had these in my
initial prototype back in 2014 but Peter shot it down, partially due to
the performance hit of indirect function call overhead in the fast path:

http://permalink.gmane.org/gmane.linux.kernel/1803410

...
> >> +
> >> +/*
> >> + * Capacity margin added to CFS and RT capacity requests to provide
> >> + * some head room if task utilization further increases.
> >> + */
> > 
> > OK, where does this number come from?
> 
> Someone's posterior :) .

Indeed, this margin is based on a posterior-derived value, in particular
the 25% imbalance_pct in struct sched_domain:

include/linux/sched.h:
struct sched_domain {
        /* These fields must be setup */
        ...
        unsigned int imbalance_pct;     /* No balance until over watermark */
...

and,

kernel/sched/core.c:
static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl, int cpu)
{
        struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
        ...
        sd = (struct sched_domain){
                .min_interval           = sd_weight,
                .max_interval           = 2*sd_weight,
                .busy_factor            = 32,
                .imbalance_pct          = 125,
...

> 
> This really should be a tunable IMO, but there's a fairly strong
> anti-tunable sentiment, so it's been left hard-coded in an attempt to
> provide something that "just works."

It should definitely be tunable.

> 
> At the least I can add a comment saying that the 20% idle headroom
> requirement was an off the cuff estimate and that at this time, we don't
> have significant data to suggest it's the best number.

Yeah, it was just for prototyping.

...
> > Now, what really is the advantage of having those extra threads vs using
> > workqueues?

More historical anecdotes: The RT/DL stuff below is absolutely worth
talking about, but if your question is, "why did the design choose
kthreads over wq's", the answer is legacy.

Originally back in 2014 Morten played with queuing work up via wq within
schedule() context, but hit a problem where that caused re-entering
schedule() from schedule() and entering reentrancy hell and entering
reentrancy hell and entering reentrancy hell and entering reentrancy
hell ... well you get the idea.

I came up with an ugly workaround to wake up a dormant kthread with
wake_up_process() which was called from an arch-specific callback to
change cpu frequency and later Juri suggested to use irq_work to kick
the kthread, which is still the method used in Steve's patch series.

Seems that it is fine to toss work onto a workqueue from irq_work
context however, which is nice, except that the RT thread approach
yields better latency. I'll let you guys sort out the issues around
RT/DL starvation, which seems like a real issue.

> > 
> > I guess the underlying concern is that RT tasks may stall workqueues indefinitely
> > in theory and then the frequency won't be updated, but there's much more kernel
> > stuff run from workqueues and if that is starved, you won't get very far anyway.
> > 
> > If you take special measures to prevent frequency change requests from being
> > stalled by RT tasks, question is why are they so special?  Aren't there any
> > other kernel activities that also should be protected from that and may be
> > more important than CPU frequency changes?
> 
> I think updating the CPU frequency during periods of heavy RT/DL load is
> one of the most (if not the most) important things. I can't speak for
> other system activities that may get blocked, but there's an opportunity
> to protect CPU frequency changes here, and it seems worth taking to me.
> 
> > 
> > Plus if this really is the problem here, then it also affects the other cpufreq
> > governors, so maybe it should be solved for everybody in some common way?
> 
> Agreed, I'd think a freq change thread that serves frequency change
> requests would be a useful common component. The locking and throttling
> (slowpath_lock, finish_last_request()) are somewhat specific to this
> implementation, but could probably be done generically and maybe even
> used in other governors. If you're okay with it though I'd like to view
> that as a slightly longer term effort, as I think it would get unwieldy
> trying to do that as part of this initial change.

I do not have any data to back up a case for stalls caused by RT/DL
starvation, but conceptually I would say that latency is fundamentally
more important in a scheduler-driven cpu frequency selection scenario,
versus the legacy timer-based governors.

In the latter case we get woken up by a timer (prior to Rafael's recent
"cpufreq: governor: Replace timers with utilization update callbacks"
patch), we sample idleness/busyness, and change frequency, all in one go
and all from process context.

In the case of the scheduler selecting frequency in the hot path, with
hardware that takes a long time to transition frequency (and thus must
be done in a slow path), we want to minimize the time delta between the
scheduler picking a frequency and the thread that executes that change
actually being run.

In my over-simplified view of the scheduler, it would be great if we
could have a backdoor mechanism to place the frequency transition
kthread onto a runqueue from within the schedule() context and dispense
with the irq_work stuff in Steve's series altogether.

Regards,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 3, 2016, 2:49 a.m. UTC | #17
On Wed, Mar 2, 2016 at 8:49 AM, Michael Turquette
<mturquette@baylibre.com> wrote:
>

[cut]

> I do not have any data to back up a case for stalls caused by RT/DL
> starvation, but conceptually I would say that latency is fundamentally
> more important in a scheduler-driven cpu frequency selection scenario,
> versus the legacy timer-based governors.
>
> In the latter case we get woken up by a timer (prior to Rafael's recent
> "cpufreq: governor: Replace timers with utilization update callbacks"
> patch), we sample idleness/busyness, and change frequency, all in one go
> and all from process context.
>
> In the case of the scheduler selecting frequency in the hot path, with
> hardware that takes a long time to transition frequency (and thus must
> be done in a slow path), we want to minimize the time delta between the
> scheduler picking a frequency and the thread that executes that change
> actually being run.

That is a good point.  However, the Peter's one about the RT tasks
having to run at the max util and affecting the frequency control this
way is good too.

I'm not actually sure if RT is the right answer here.  DL may be a
better choice.  After all, we want the thing to happen shortly, but
not necessarily at full speed.

So something like a DL workqueue would be quite useful here it seems.

> In my over-simplified view of the scheduler, it would be great if we
> could have a backdoor mechanism to place the frequency transition
> kthread onto a runqueue from within the schedule() context and dispense
> with the irq_work stuff in Steve's series altogether.

Well, I use irq_work() now in schedutil and ondemand/conservative too
for queuing up work items and it gets the job done.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 3, 2016, 3:50 a.m. UTC | #18
On 03/02/2016 06:49 PM, Rafael J. Wysocki wrote:
> I'm not actually sure if RT is the right answer here.  DL may be a
> better choice.  After all, we want the thing to happen shortly, but
> not necessarily at full speed.
> 
> So something like a DL workqueue would be quite useful here it seems.

The DL idea seems like a good one to me.

It would also prevent cpufreq changes from being delayed by other RT or
DL tasks.

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 3, 2016, 9:34 a.m. UTC | #19
On 02/03/16 19:50, Steve Muckle wrote:
> On 03/02/2016 06:49 PM, Rafael J. Wysocki wrote:
> > I'm not actually sure if RT is the right answer here.  DL may be a
> > better choice.  After all, we want the thing to happen shortly, but
> > not necessarily at full speed.
> > 
> > So something like a DL workqueue would be quite useful here it seems.
> 
> The DL idea seems like a good one to me.
> 
> It would also prevent cpufreq changes from being delayed by other RT or
> DL tasks.
> 

Dimensioning period and runtime could require some experimenting, but
it's worth a try, I agree.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 3, 2016, 1:03 p.m. UTC | #20
On Tue, Mar 01, 2016 at 11:49:10PM -0800, Michael Turquette wrote:
> 
> In my over-simplified view of the scheduler, it would be great if we
> could have a backdoor mechanism to place the frequency transition
> kthread onto a runqueue from within the schedule() context and dispense
> with the irq_work stuff in Steve's series altogether.

This is actually very very hard :/

So while there is something similar for workqueues,
try_to_wake_up_local(), that will not work for the cpufreq stuff.

The main problem is that schedule() is done with rq->lock held, but
wakeups need p->pi_lock, but it so happens that rq->lock nests inside of
p->pi_lock.

Now, the workqueue stuff with try_to_wake_up_local() can get away with
dropping rq->lock, because of where it is called, way early in
schedule() before we really muck things up.

The cpufreq hook otoh is called all over the place.

The second problem is that doing a wakeup will in fact also end up
calling the cpufreq hook, so you're back in recursion hell.

The third problem is that cpufreq is called from wakeups, which would
want to do another wakeup (see point 2), but this also means we have to
nest p->pi_lock, and we can't really do that either.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar March 3, 2016, 2:21 p.m. UTC | #21
* Steve Muckle <steve.muckle@linaro.org> wrote:

> From: Michael Turquette <mturquette@baylibre.com>
> 
> Scheduler-driven CPU frequency selection hopes to exploit both
> per-task and global information in the scheduler to improve frequency
> selection policy, achieving lower power consumption, improved
> responsiveness/performance, and less reliance on heuristics and
> tunables. For further discussion on the motivation of this integration
> see [0].
> 
> This patch implements a shim layer between the Linux scheduler and the
> cpufreq subsystem. The interface accepts capacity requests from the
> CFS, RT and deadline sched classes. The requests from each sched class
> are summed on each CPU with a margin applied to the CFS and RT
> capacity requests to provide some headroom. Deadline requests are
> expected to be precise enough given their nature to not require
> headroom. The maximum total capacity request for a CPU in a frequency
> domain drives the requested frequency for that domain.
> 
> Policy is determined by both the sched classes and this shim layer.
> 
> Note that this algorithm is event-driven. There is no polling loop to
> check cpu idle time nor any other method which is unsynchronized with
> the scheduler, aside from an optional throttling mechanism.
> 
> Thanks to Juri Lelli <juri.lelli@arm.com> for contributing design ideas,
> code and test results, and to Ricky Liang <jcliang@chromium.org>
> for initialization and static key inc/dec fixes.
> 
> [0] http://article.gmane.org/gmane.linux.kernel/1499836
> 
> [smuckle@linaro.org: various additions and fixes, revised commit text]
> 
> CC: Ricky Liang <jcliang@chromium.org>
> Signed-off-by: Michael Turquette <mturquette@baylibre.com>
> Signed-off-by: Juri Lelli <juri.lelli@arm.com>
> Signed-off-by: Steve Muckle <smuckle@linaro.org>
> ---
>  drivers/cpufreq/Kconfig      |  21 ++
>  include/linux/cpufreq.h      |   3 +
>  include/linux/sched.h        |   8 +
>  kernel/sched/Makefile        |   1 +
>  kernel/sched/cpufreq_sched.c | 459 +++++++++++++++++++++++++++++++++++++++++++

Please rename this to kernel/sched/cpufreq.c - no need to say 'sched' twice! :-)

>  kernel/sched/sched.h         |  51 +++++
>  6 files changed, 543 insertions(+)
>  create mode 100644 kernel/sched/cpufreq_sched.c

So I really like how you push all high level code into kernel/sched/cpufreq.c and 
use the cpufreq drivers only for actual low level frequency switching.

It would be nice to converge this code with the code from Rafael:

   [PATCH 0/6] cpufreq: schedutil governor

i.e. use scheduler internal metrics within the scheduler, and create a clear 
interface between low level cpufreq drivers and the cpufreq code living in the 
scheduler.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 659879a..82d1548 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,14 @@  config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	  Be aware that not all cpufreq drivers support the conservative
 	  governor. If unsure have a look at the help section of the
 	  driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHED
+	bool "sched"
+	select CPU_FREQ_GOV_SCHED
+	help
+	  Use the CPUfreq governor 'sched' as default. This scales
+	  cpu frequency using CPU utilization estimates from the
+	  scheduler.
 endchoice
 
 config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +191,19 @@  config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_SCHED
+	bool "'sched' cpufreq governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_COMMON
+	select IRQ_WORK
+	help
+	  'sched' - this governor scales cpu frequency from the
+	  scheduler as a function of cpu capacity utilization. It does
+	  not evaluate utilization on a periodic basis (as ondemand
+	  does) but instead is event-driven by the scheduler.
+
+	  If in doubt, say N.
+
 comment "CPU frequency scaling drivers"
 
 config CPUFREQ_DT
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 93e1c1c..ce8b895 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -495,6 +495,9 @@  extern struct cpufreq_governor cpufreq_gov_ondemand;
 #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
 extern struct cpufreq_governor cpufreq_gov_conservative;
 #define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED)
+extern struct cpufreq_governor cpufreq_gov_sched;
+#define CPUFREQ_DEFAULT_GOVERNOR	(&cpufreq_gov_sched)
 #endif
 
 /*********************************************************************
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a292c4b..27a6cd8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -937,6 +937,14 @@  enum cpu_idle_type {
 #define SCHED_CAPACITY_SHIFT	10
 #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
 
+struct sched_capacity_reqs {
+	unsigned long cfs;
+	unsigned long rt;
+	unsigned long dl;
+
+	unsigned long total;
+};
+
 /*
  * Wake-queues are lists of tasks with a pending wakeup, whose
  * callers have already marked the task as woken internally,
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 6768797..90ed832 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@  obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
new file mode 100644
index 0000000..a113e4e
--- /dev/null
+++ b/kernel/sched/cpufreq_sched.c
@@ -0,0 +1,459 @@ 
+/*
+ *  Copyright (C)  2015 Michael Turquette <mturquette@linaro.org>
+ *  Copyright (C)  2015-2016 Steve Muckle <smuckle@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>
+#include <linux/irq_work.h>
+#include <linux/delay.h>
+#include <linux/string.h>
+
+#include "sched.h"
+
+struct static_key __read_mostly __sched_freq = STATIC_KEY_INIT_FALSE;
+static bool __read_mostly cpufreq_driver_slow;
+
+/*
+ * The number of enabled schedfreq policies is modified during GOV_START/STOP.
+ * It, along with whether the schedfreq static key is enabled, is protected by
+ * the gov_enable_lock.
+ */
+static int enabled_policies;
+static DEFINE_MUTEX(gov_enable_lock);
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
+static struct cpufreq_governor cpufreq_gov_sched;
+#endif
+
+/*
+ * Capacity margin added to CFS and RT capacity requests to provide
+ * some head room if task utilization further increases.
+ */
+unsigned int capacity_margin = 1280;
+
+static DEFINE_PER_CPU(struct gov_data *, cpu_gov_data);
+DEFINE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
+
+/**
+ * gov_data - per-policy data internal to the governor
+ * @throttle: next throttling period expiry. Derived from throttle_nsec
+ * @throttle_nsec: throttle period length in nanoseconds
+ * @task: worker thread for dvfs transition that may block/sleep
+ * @irq_work: callback used to wake up worker thread
+ * @policy: pointer to cpufreq policy associated with this governor data
+ * @fastpath_lock: prevents multiple CPUs in a frequency domain from racing
+ * with each other in fast path during calculation of domain frequency
+ * @slowpath_lock: mutex used to synchronize with slow path - ensure policy
+ * remains enabled, and eliminate racing between slow and fast path
+ * @enabled: boolean value indicating that the policy is started, protected
+ * by the slowpath_lock
+ * @requested_freq: last frequency requested by the sched governor
+ *
+ * struct gov_data is the per-policy cpufreq_sched-specific data
+ * structure. A per-policy instance of it is created when the
+ * cpufreq_sched governor receives the CPUFREQ_GOV_POLICY_INIT
+ * condition and a pointer to it exists in the gov_data member of
+ * struct cpufreq_policy.
+ */
+struct gov_data {
+	ktime_t throttle;
+	unsigned int throttle_nsec;
+	struct task_struct *task;
+	struct irq_work irq_work;
+	struct cpufreq_policy *policy;
+	raw_spinlock_t fastpath_lock;
+	struct mutex slowpath_lock;
+	unsigned int enabled;
+	unsigned int requested_freq;
+};
+
+static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy,
+					    unsigned int freq)
+{
+	struct gov_data *gd = policy->governor_data;
+
+	__cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
+	gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
+}
+
+static bool finish_last_request(struct gov_data *gd)
+{
+	ktime_t now = ktime_get();
+
+	if (ktime_after(now, gd->throttle))
+		return false;
+
+	while (1) {
+		int usec_left = ktime_to_ns(ktime_sub(gd->throttle, now));
+
+		usec_left /= NSEC_PER_USEC;
+		usleep_range(usec_left, usec_left + 100);
+		now = ktime_get();
+		if (ktime_after(now, gd->throttle))
+			return true;
+	}
+}
+
+static int cpufreq_sched_thread(void *data)
+{
+	struct sched_param param;
+	struct gov_data *gd = (struct gov_data*) data;
+	struct cpufreq_policy *policy = gd->policy;
+	unsigned int new_request = 0;
+	unsigned int last_request = 0;
+	int ret;
+
+	param.sched_priority = 50;
+	ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);
+	if (ret) {
+		pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+		do_exit(-EINVAL);
+	} else {
+		pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
+				__func__, gd->task->pid);
+	}
+
+	mutex_lock(&gd->slowpath_lock);
+
+	while (true) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (kthread_should_stop()) {
+			set_current_state(TASK_RUNNING);
+			break;
+		}
+		new_request = gd->requested_freq;
+		if (!gd->enabled || new_request == last_request) {
+			mutex_unlock(&gd->slowpath_lock);
+			schedule();
+			mutex_lock(&gd->slowpath_lock);
+		} else {
+			set_current_state(TASK_RUNNING);
+			/*
+			 * if the frequency thread sleeps while waiting to be
+			 * unthrottled, start over to check for a newer request
+			 */
+			if (finish_last_request(gd))
+				continue;
+			last_request = new_request;
+			cpufreq_sched_try_driver_target(policy, new_request);
+		}
+	}
+
+	mutex_unlock(&gd->slowpath_lock);
+
+	return 0;
+}
+
+static void cpufreq_sched_irq_work(struct irq_work *irq_work)
+{
+	struct gov_data *gd;
+
+	gd = container_of(irq_work, struct gov_data, irq_work);
+	if (!gd)
+		return;
+
+	wake_up_process(gd->task);
+}
+
+static void update_fdomain_capacity_request(int cpu)
+{
+	unsigned int freq_new, index_new, cpu_tmp;
+	struct cpufreq_policy *policy;
+	struct gov_data *gd = per_cpu(cpu_gov_data, cpu);
+	unsigned long capacity = 0;
+
+	if (!gd)
+		return;
+
+	/* interrupts already disabled here via rq locked */
+	raw_spin_lock(&gd->fastpath_lock);
+
+	policy = gd->policy;
+
+	for_each_cpu(cpu_tmp, policy->cpus) {
+		struct sched_capacity_reqs *scr;
+
+		scr = &per_cpu(cpu_sched_capacity_reqs, cpu_tmp);
+		capacity = max(capacity, scr->total);
+	}
+
+	freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
+
+	/*
+	 * Calling this without locking policy->rwsem means we race
+	 * against changes with policy->min and policy->max. This should
+	 * be okay though.
+	 */
+	if (cpufreq_frequency_table_target(policy, policy->freq_table,
+					   freq_new, CPUFREQ_RELATION_L,
+					   &index_new))
+		goto out;
+	freq_new = policy->freq_table[index_new].frequency;
+
+	if (freq_new == gd->requested_freq)
+		goto out;
+
+	gd->requested_freq = freq_new;
+
+	if (cpufreq_driver_slow || !mutex_trylock(&gd->slowpath_lock)) {
+		irq_work_queue_on(&gd->irq_work, cpu);
+	} else if (policy->transition_ongoing ||
+		   ktime_before(ktime_get(), gd->throttle)) {
+		mutex_unlock(&gd->slowpath_lock);
+		irq_work_queue_on(&gd->irq_work, cpu);
+	} else {
+		cpufreq_sched_try_driver_target(policy, freq_new);
+		mutex_unlock(&gd->slowpath_lock);
+	}
+
+out:
+	raw_spin_unlock(&gd->fastpath_lock);
+}
+
+void update_cpu_capacity_request(int cpu, bool request)
+{
+	unsigned long new_capacity;
+	struct sched_capacity_reqs *scr;
+
+	/* The rq lock serializes access to the CPU's sched_capacity_reqs. */
+	lockdep_assert_held(&cpu_rq(cpu)->lock);
+
+	scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
+
+	new_capacity = scr->cfs + scr->rt;
+	new_capacity = new_capacity * capacity_margin
+		/ SCHED_CAPACITY_SCALE;
+	new_capacity += scr->dl;
+
+	if (new_capacity == scr->total)
+		return;
+
+	scr->total = new_capacity;
+	if (request)
+		update_fdomain_capacity_request(cpu);
+}
+
+static ssize_t show_throttle_nsec(struct cpufreq_policy *policy, char *buf)
+{
+	struct gov_data *gd = policy->governor_data;
+	return sprintf(buf, "%u\n", gd->throttle_nsec);
+}
+
+static ssize_t store_throttle_nsec(struct cpufreq_policy *policy,
+				   const char *buf, size_t count)
+{
+	struct gov_data *gd = policy->governor_data;
+	unsigned int input;
+	int ret;
+
+	ret = sscanf(buf, "%u", &input);
+
+	if (ret != 1)
+		return -EINVAL;
+
+	gd->throttle_nsec = input;
+	return count;
+}
+
+static struct freq_attr sched_freq_throttle_nsec_attr =
+	__ATTR(throttle_nsec, 0644, show_throttle_nsec, store_throttle_nsec);
+
+static struct attribute *sched_freq_sysfs_attribs[] = {
+	&sched_freq_throttle_nsec_attr.attr,
+	NULL
+};
+
+static struct attribute_group sched_freq_sysfs_group = {
+	.attrs = sched_freq_sysfs_attribs,
+	.name = "sched_freq",
+};
+
+static int cpufreq_sched_policy_init(struct cpufreq_policy *policy)
+{
+	struct gov_data *gd;
+	int ret;
+
+	gd = kzalloc(sizeof(*gd), GFP_KERNEL);
+	if (!gd)
+		return -ENOMEM;
+	policy->governor_data = gd;
+	gd->policy = policy;
+	raw_spin_lock_init(&gd->fastpath_lock);
+	mutex_init(&gd->slowpath_lock);
+
+	ret = sysfs_create_group(&policy->kobj, &sched_freq_sysfs_group);
+	if (ret)
+		goto err_mem;
+
+	/*
+	 * Set up schedfreq thread for slow path freq transitions if
+	 * required by the driver.
+	 */
+	if (cpufreq_driver_is_slow()) {
+		cpufreq_driver_slow = true;
+		gd->task = kthread_create(cpufreq_sched_thread, gd,
+					  "kschedfreq:%d",
+					  cpumask_first(policy->related_cpus));
+		if (IS_ERR_OR_NULL(gd->task)) {
+			pr_err("%s: failed to create kschedfreq thread\n",
+			       __func__);
+			goto err_sysfs;
+		}
+		get_task_struct(gd->task);
+		kthread_bind_mask(gd->task, policy->related_cpus);
+		wake_up_process(gd->task);
+		init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
+	}
+	return 0;
+
+err_sysfs:
+	sysfs_remove_group(&policy->kobj, &sched_freq_sysfs_group);
+err_mem:
+	policy->governor_data = NULL;
+	kfree(gd);
+	return -ENOMEM;
+}
+
+static int cpufreq_sched_policy_exit(struct cpufreq_policy *policy)
+{
+	struct gov_data *gd = policy->governor_data;
+
+	/* Stop the schedfreq thread associated with this policy. */
+	if (cpufreq_driver_slow) {
+		kthread_stop(gd->task);
+		put_task_struct(gd->task);
+	}
+	sysfs_remove_group(&policy->kobj, &sched_freq_sysfs_group);
+	policy->governor_data = NULL;
+	kfree(gd);
+	return 0;
+}
+
+static int cpufreq_sched_start(struct cpufreq_policy *policy)
+{
+	struct gov_data *gd = policy->governor_data;
+	int cpu;
+
+	/*
+	 * The schedfreq static key is managed here so the global schedfreq
+	 * lock must be taken - a per-policy lock such as policy->rwsem is
+	 * not sufficient.
+	 */
+	mutex_lock(&gov_enable_lock);
+
+	gd->enabled = 1;
+
+	/*
+	 * Set up percpu information. Writing the percpu gd pointer will
+	 * enable the fast path if the static key is already enabled.
+	 */
+	for_each_cpu(cpu, policy->cpus) {
+		memset(&per_cpu(cpu_sched_capacity_reqs, cpu), 0,
+		       sizeof(struct sched_capacity_reqs));
+		per_cpu(cpu_gov_data, cpu) = gd;
+	}
+
+	if (enabled_policies == 0)
+		static_key_slow_inc(&__sched_freq);
+	enabled_policies++;
+	mutex_unlock(&gov_enable_lock);
+
+	return 0;
+}
+
+static void dummy(void *info) {}
+
+static int cpufreq_sched_stop(struct cpufreq_policy *policy)
+{
+	struct gov_data *gd = policy->governor_data;
+	int cpu;
+
+	/*
+	 * The schedfreq static key is managed here so the global schedfreq
+	 * lock must be taken - a per-policy lock such as policy->rwsem is
+	 * not sufficient.
+	 */
+	mutex_lock(&gov_enable_lock);
+
+	/*
+	 * The governor stop path may or may not hold policy->rwsem. There
+	 * must be synchronization with the slow path however.
+	 */
+	mutex_lock(&gd->slowpath_lock);
+
+	/*
+	 * Stop new entries into the hot path for all CPUs. This will
+	 * potentially affect other policies which are still running but
+	 * this is an infrequent operation.
+	 */
+	static_key_slow_dec(&__sched_freq);
+	enabled_policies--;
+
+	/*
+	 * Ensure that all CPUs currently part of this policy are out
+	 * of the hot path so that if this policy exits we can free gd.
+	 */
+	preempt_disable();
+	smp_call_function_many(policy->cpus, dummy, NULL, true);
+	preempt_enable();
+
+	/*
+	 * Other CPUs in other policies may still have the schedfreq
+	 * static key enabled. The percpu gd is used to signal which
+	 * CPUs are enabled in the sched gov during the hot path.
+	 */
+	for_each_cpu(cpu, policy->cpus)
+		per_cpu(cpu_gov_data, cpu) = NULL;
+
+	/* Pause the slow path for this policy. */
+	gd->enabled = 0;
+
+	if (enabled_policies)
+		static_key_slow_inc(&__sched_freq);
+	mutex_unlock(&gd->slowpath_lock);
+	mutex_unlock(&gov_enable_lock);
+
+	return 0;
+}
+
+static int cpufreq_sched_setup(struct cpufreq_policy *policy,
+			       unsigned int event)
+{
+	switch (event) {
+	case CPUFREQ_GOV_POLICY_INIT:
+		return cpufreq_sched_policy_init(policy);
+	case CPUFREQ_GOV_POLICY_EXIT:
+		return cpufreq_sched_policy_exit(policy);
+	case CPUFREQ_GOV_START:
+		return cpufreq_sched_start(policy);
+	case CPUFREQ_GOV_STOP:
+		return cpufreq_sched_stop(policy);
+	case CPUFREQ_GOV_LIMITS:
+		break;
+	}
+	return 0;
+}
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
+static
+#endif
+struct cpufreq_governor cpufreq_gov_sched = {
+	.name			= "sched",
+	.governor		= cpufreq_sched_setup,
+	.owner			= THIS_MODULE,
+};
+
+static int __init cpufreq_sched_init(void)
+{
+	return cpufreq_register_governor(&cpufreq_gov_sched);
+}
+
+/* Try to make this the default governor */
+fs_initcall(cpufreq_sched_init);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1d58387..17908dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1384,6 +1384,57 @@  unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 }
 #endif
 
+#ifdef CONFIG_CPU_FREQ_GOV_SCHED
+extern unsigned int capacity_margin;
+extern struct static_key __sched_freq;
+
+static inline bool sched_freq(void)
+{
+	return static_key_false(&__sched_freq);
+}
+
+DECLARE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
+void update_cpu_capacity_request(int cpu, bool request);
+
+static inline void set_cfs_cpu_capacity(int cpu, bool request,
+					unsigned long capacity)
+{
+	if (per_cpu(cpu_sched_capacity_reqs, cpu).cfs != capacity) {
+		per_cpu(cpu_sched_capacity_reqs, cpu).cfs = capacity;
+		update_cpu_capacity_request(cpu, request);
+	}
+}
+
+static inline void set_rt_cpu_capacity(int cpu, bool request,
+				       unsigned long capacity)
+{
+	if (per_cpu(cpu_sched_capacity_reqs, cpu).rt != capacity) {
+		per_cpu(cpu_sched_capacity_reqs, cpu).rt = capacity;
+		update_cpu_capacity_request(cpu, request);
+	}
+}
+
+static inline void set_dl_cpu_capacity(int cpu, bool request,
+				       unsigned long capacity)
+{
+	if (per_cpu(cpu_sched_capacity_reqs, cpu).dl != capacity) {
+		per_cpu(cpu_sched_capacity_reqs, cpu).dl = capacity;
+		update_cpu_capacity_request(cpu, request);
+	}
+}
+#else
+static inline bool sched_freq(void) { return false; }
+static inline void set_cfs_cpu_capacity(int cpu, bool request,
+					unsigned long capacity)
+{ }
+static inline void set_rt_cpu_capacity(int cpu, bool request,
+				       unsigned long capacity)
+{ }
+static inline void set_dl_cpu_capacity(int cpu, bool request,
+				       unsigned long capacity)
+{ }
+#endif
+
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));