Message ID | 1303807401-16350-1-git-send-email-myungjoo.ham@samsung.com (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
Hi, To start with let me say I don't have any fundamental objections at this point, although I'm not 100% sure I understand correctly how this feature is supposed to be used. My understanding is that if a device is known to have multiple OPPs, its subsystem and driver may call dvfs_add_device() for it, which will cause it to be periodically monitored and put into the "optimum" OPP using the infrastructure introduced by your patch (on the basis of the given usage profile and with the help of the given governor). Is that correct? That said I have some specific comments below. On Tuesday, April 26, 2011, MyungJoo Ham wrote: > With OPPs, a device may have multiple operable frequency and voltage > sets. However, there can be multiple possible operable sets and a system > will need to choose one from them. In order to reduce the power > consumption (by reducing frequency and voltage) without affecting the > performance too much, a DVFS scheme may be used. > > This patch introduces the DVFS capability to non-CPU devices with OPPs. Well, I'd like the DVFS acronym to be expanded at least once somewhere, possibly in the comments describing the new files and necessarily in the Kconfig option description. > In this RTC version, we do not have specific gorvernors or device RFC I guess? > support, yet. However, we plan to test and use DVFS scheme on > DRAM/BUS/GPU for our test boards (Exynos4-NURI), which have DVFS > capability and have seperated DVFS-like drivers without any common code. > > The generic DVFS may appear quite similar with /drivers/cpufreq. > However, CPUFREQ does not allow to have multiple devices registered and > is not suitable to have multiple heterogenous devices with different > (but simple) governors. > > Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com> > Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> > --- > drivers/base/power/Makefile | 1 + > drivers/base/power/dvfs.c | 401 +++++++++++++++++++++++++++++++++++++++++++ > drivers/base/power/opp.c | 7 + > include/linux/dvfs.h | 69 ++++++++ > kernel/power/Kconfig | 9 + > 5 files changed, 487 insertions(+), 0 deletions(-) > create mode 100644 drivers/base/power/dvfs.c > create mode 100644 include/linux/dvfs.h > > diff --git a/drivers/base/power/Makefile b/drivers/base/power/Makefile > index 118c1b9..4d92d7f 100644 > --- a/drivers/base/power/Makefile > +++ b/drivers/base/power/Makefile > @@ -3,6 +3,7 @@ obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o > obj-$(CONFIG_PM_RUNTIME) += runtime.o > obj-$(CONFIG_PM_TRACE_RTC) += trace.o > obj-$(CONFIG_PM_OPP) += opp.o > +obj-$(CONFIG_PM_GENERIC_DVFS) += dvfs.o > > ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG > ccflags-$(CONFIG_PM_VERBOSE) += -DDEBUG > diff --git a/drivers/base/power/dvfs.c b/drivers/base/power/dvfs.c > new file mode 100644 > index 0000000..867ad70 > --- /dev/null > +++ b/drivers/base/power/dvfs.c > @@ -0,0 +1,401 @@ > +/* > + * Generic DVFS Interface for Non-CPU Devices > + * Based on OPP. > + * > + * Copyright (C) 2011 Samsung Electronics > + * MyungJoo Ham <myungjoo.ham@samsung.com> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > + > +#include <linux/kernel.h> > +#include <linux/errno.h> > +#include <linux/err.h> > +#include <linux/init.h> > +#include <linux/slab.h> > +#include <linux/opp.h> > +#include <linux/dvfs.h> > +#include <linux/workqueue.h> > +#include <linux/platform_device.h> > +#include <linux/list.h> > + > +#define DVFS_INTERVAL 20 You should state in a comment what's the unit of this and explain the choice of the value. > +/** > + * struct dvfs - Device DVFS structure > + * @node list node - contains the devices with DVFS that have been > + * registered. > + * @dev device pointer > + * @profile device-specific dvfs profile > + * @governor method how to choose frequency based on the usage. > + * @previous_freq previously configured frequency value. > + * @next_polling the number of remaining "dvfs_monitor" executions to > + * reevaluate frequency/voltage of the device. Set by > + * profile's polling_ms interval. > + * @tickle positive if DVFS-tickling is activated for the device. > + * at each executino of dvfs_monitor, tickle is decremented. > + * User may tickle a device-dvfs in order to set maximum > + * frequency instaneously with some guaranteed duration. > + * > + * This structure stores the DVFS information for a give device. > + */ > +struct dvfs { > + struct list_head node; > + > + struct device *dev; > + struct dvfs_device_profile *profile; > + struct dvfs_governor *governor; > + > + unsigned long previous_freq; > + unsigned int next_polling; > + unsigned int tickle; > +}; > + I'd move the structure definition to the header file you introduce below. Even though it's supposed to be internal, it's better to have it defined next to the other structures for the benefit of the people who will try to understand your code in the future. > +/* > + * dvfs_work periodically (given by DVFS_INTERVAL) monitors every > + * registered device. > + */ > +static struct delayed_work dvfs_work; > +/* The list of all device-dvfs */ > +static LIST_HEAD(dvfs_list); > +/* Exclusive access to dvfs_list and its elements */ > +static DEFINE_MUTEX(dvfs_list_lock); > + > +/** > + * find_device_dvfs() - find dvfs struct using device pointer > + * @dev: device pointer used to lookup device DVFS. > + * > + * Search the list of device DVFSs and return the matched device's DVFS info. > + */ > +static struct dvfs *find_device_dvfs(struct device *dev) > +{ > + struct dvfs *tmp_dvfs, *dvfs = ERR_PTR(-ENODEV); > + > + if (unlikely(IS_ERR_OR_NULL(dev))) { > + pr_err("%s: Invalid parameters\n", __func__); > + return ERR_PTR(-EINVAL); > + } > + > + list_for_each_entry(tmp_dvfs, &dvfs_list, node) { > + if (tmp_dvfs->dev == dev) { > + dvfs = tmp_dvfs; > + break; > + } > + } > + > + return dvfs; > +} > + > +/** > + * dvfs_next_polling() - the number of dvfs-monitor iterations to satisfy > + * the given polling interval. The interval is rounded by > + * dvfs-monitor poling interval (DVFS_INTERVAL) with ceiling > + * function. > + * @ms: the length of polling interval in milli-second > + */ > +static unsigned int dvfs_next_polling(int ms) > +{ > + unsigned int ret; > + > + if (ms <= 0) > + return 0; > + > + ret = ms / DVFS_INTERVAL; > + if (ms % DVFS_INTERVAL) > + ret++; > + > + return ret; > +} It seems you could use DIV_ROUND_UP instead of this whole function. > + > +/** > + * dvfs_do() - Check the usage profile of a given device and configure > + * frequency and voltage accordingly > + * @dvfs: DVFS info of the given device > + */ > +static int dvfs_do(struct dvfs *dvfs) > +{ > + struct dvfs_usage_profile profile; > + struct opp *opp; > + unsigned long freq; > + int err; > + > + err = dvfs->profile->get_usage_profile(dvfs->dev, &profile); > + if (err) { > + dev_err(dvfs->dev, "%s: Cannot get usage profile (%d)\n", > + __func__, err); > + return err; > + } > + > + err = dvfs->governor->get_target_freq(&profile, dvfs->governor, &freq); It may be better to define ->get_target_freq() so that it takes devfs as an arg. In that case you'd need to pass fewer pointers and the governor would be able to access all of its data anyway. > + if (err) { > + dev_err(dvfs->dev, "%s: Cannot get target frequency (%d)\n", > + __func__, err); > + return err; > + } > + > + opp = opp_find_freq_ceil(dvfs->dev, &freq); > + if (opp == ERR_PTR(-ENODEV)) > + opp = opp_find_freq_floor(dvfs->dev, &freq); > + > + if (IS_ERR(opp)) { > + dev_err(dvfs->dev, "%s: Cannot find opp with %luHz.\n", > + __func__, freq); > + return PTR_ERR(opp); > + } > + > + freq = opp_get_freq(opp); > + if (dvfs->previous_freq != freq) { > + err = dvfs->profile->target(dvfs->dev, freq, > + opp_get_voltage(opp)); > + dvfs->previous_freq = freq; > + } > + > + if (err) > + dev_err(dvfs->dev, "%s: Cannot set %luHz/%luuV\n", __func__, > + opp_get_freq(opp), opp_get_voltage(opp)); > + return err; > +} I'd generally consider using fewer error messages. While they may be good for debugging, they have a potential of spamming the kernel's message buffer in case of a problem on a production system. Also they don't really make the code cleaner, so to speak. :-) > + > +/** > + * dvfs_monitor() - Regularly run dvfs_do() and support device DVFS tickle. > + * @work: the work struct used to run dvfs_monitor periodically. > + */ > +static void dvfs_monitor(struct work_struct *work) > +{ > + struct dvfs *dvfs; > + > + mutex_lock(&dvfs_list_lock); > + > + list_for_each_entry(dvfs, &dvfs_list, node) { > + if (dvfs->next_polling == 0) > + continue; > + if (dvfs->tickle) { > + dvfs->tickle--; > + continue; > + } > + if (dvfs->next_polling == 1) { > + dvfs_do(dvfs); > + dvfs->next_polling = dvfs_next_polling( > + dvfs->profile->polling_ms); > + } else { > + dvfs->next_polling--; > + } > + } > + > + schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL)); > + mutex_unlock(&dvfs_list_lock); > +} > + > +/** > + * dvfs_add_device() - Add dvfs feature to the device > + * @dev: the device to add dvfs feature. > + * @profile: device-specific profile to run dvfs. > + * @governor: the policy to choose frequency. > + */ > +int dvfs_add_device(struct device *dev, struct dvfs_device_profile *profile, > + struct dvfs_governor *governor) > +{ > + struct dvfs *new_dvfs, *dvfs; > + struct cpufreq_frequency_table *table; > + int err; > + > + if (!dev || !profile || !governor) { > + dev_err(dev, "%s: Invalid parameters.\n", __func__); > + return -EINVAL; > + } > + > + err = opp_init_cpufreq_table(dev, &table); I'd change the name opp_init_cpufreq_table() to something more suitable, since it's no longer used for the CPU frequencies alone. Something like opp_init_frequency_table() might work. > + if (err) { > + dev_err(dev, "%s: Device OPP cannot provide DVFS table (%d)\n", > + __func__, err); > + return err; > + } > + > + new_dvfs = kzalloc(sizeof(struct dvfs), GFP_KERNEL); > + if (!new_dvfs) { > + dev_err(dev, "%s: Unable to create DVFS for the device\n", > + __func__); > + return -ENOMEM; > + } > + > + mutex_lock(&dvfs_list_lock); > + > + dvfs = find_device_dvfs(dev); > + if (!IS_ERR(dvfs)) { > + dev_err(dev, "%s: Unable to create DVFS for the device. " > + "It already has one.\n", __func__); > + mutex_unlock(&dvfs_list_lock); > + kfree(new_dvfs); > + return -EINVAL; > + } It seems that this should be checked before allocating the new object (you can use kzalloc() under the mutex). I'd even run it before initializing the device's frequency table. > + > + new_dvfs->dev = dev; > + new_dvfs->profile = profile; > + new_dvfs->governor = governor; > + new_dvfs->next_polling = dvfs_next_polling(profile->polling_ms); > + new_dvfs->previous_freq = 0; > + > + list_add(&new_dvfs->node, &dvfs_list); > + > + mutex_unlock(&dvfs_list_lock); > + > + return 0; > +} > + > +/** > + * dvfs_remove_device() - Remove DVFS feature from a device. > + * @device: the device to remove dvfs feature. > + */ > +int dvfs_remove_device(struct device *dev) > +{ > + struct dvfs *dvfs; > + > + if (!dev) > + return -EINVAL; > + > + mutex_lock(&dvfs_list_lock); > + dvfs = find_device_dvfs(dev); > + if (IS_ERR(dvfs)) { > + dev_err(dev, "%s: Unable to find DVFS entry for the device.\n", > + __func__); > + mutex_unlock(&dvfs_list_lock); > + return -EINVAL; > + } > + > + list_del(&dvfs->node); > + > + kfree(dvfs); > + > + mutex_unlock(&dvfs_list_lock); > + > + return 0; > +} > + > +/** > + * dvfs_update() - Notify that the device OPP has been changed. > + * @dev: the device whose OPP has been changed. > + * @may_not_exist: do not print error message even if the device > + * does not have dvfs entry. > + */ > +int dvfs_update(struct device *dev, bool may_not_exist) > +{ > + struct dvfs *dvfs; > + int err = 0; > + > + mutex_lock(&dvfs_list_lock); > + dvfs = find_device_dvfs(dev); > + if (!IS_ERR(dvfs)) { > + if (dvfs->tickle) { > + unsigned long freq = dvfs->profile->max_freq; > + struct opp *opp = opp_find_freq_floor(dvfs->dev, &freq); > + > + freq = opp_get_freq(opp); > + if (dvfs->previous_freq != freq) { > + err = dvfs->profile->target(dvfs->dev, freq, > + opp_get_voltage(opp)); > + dvfs->previous_freq = freq; > + } > + } else { > + err = dvfs_do(dvfs); > + } > + } > + mutex_unlock(&dvfs_list_lock); > + > + if (IS_ERR(dvfs)) { > + if (may_not_exist && PTR_ERR(dvfs) == -EINVAL) > + return 0; > + > + dev_err(dev, "%s: Device DVFS not found (%ld)\n", __func__, > + PTR_ERR(dvfs)); > + return PTR_ERR(dvfs); > + } > + > + return err; > +} > + > +/** > + * dvfs_tickle_device() - Guarantee maximum operation speed for a while > + * instaneously. > + * @dev: the device to be tickled. > + * @duration_ms: the duration of tickle effect. > + * > + * Tickle sets the device at the maximum frequency instaneously and > + * the maximul frequency is guaranteed to be used for the given duration. ^^^^^^^ I guess that should be "maximum". > + * For faster user reponse time, an input event may tickle a related device > + * so that the input event does not need to wait for the DVFS to react with > + * normal interval. > + */ Can you explain how this function is supposed to be used, please? > +int dvfs_tickle_device(struct device *dev, unsigned long duration_ms) > +{ > + struct dvfs *dvfs; > + struct opp *opp; > + unsigned long freq; > + int err = 0; > + > + mutex_lock(&dvfs_list_lock); > + dvfs = find_device_dvfs(dev); > + if (!IS_ERR(dvfs)) { > + freq = dvfs->profile->max_freq; > + opp = opp_find_freq_floor(dvfs->dev, &freq); > + freq = opp_get_freq(opp); > + if (dvfs->previous_freq != freq) { > + err = dvfs->profile->target(dvfs->dev, freq, > + opp_get_voltage(opp)); > + dvfs->previous_freq = freq; > + } > + if (err) > + dev_err(dev, "%s: Cannot set frequency.\n", __func__); > + else > + dvfs->tickle = dvfs_next_polling(duration_ms); > + } > + mutex_unlock(&dvfs_list_lock); > + > + if (IS_ERR(dvfs)) { > + dev_err(dev, "%s: Cannot find dvfs.\n", __func__); > + err = PTR_ERR(dvfs); > + } > + > + return err; > +} > + > +#ifdef CONFIG_PM_SLEEP > +static int dvfs_pause(struct device *dev) > +{ > + cancel_delayed_work_sync(&dvfs_work); > + return 0; > +} > + > +static void dvfs_continue(struct device *dev) > +{ > + schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL)); > +} > +#else > +#define dvfs_pause NULL > +#define dvfs_continue NULL > +#endif > +static const struct dev_pm_ops dvfs_pm = { > + .prepare = dvfs_pause, > + .complete = dvfs_continue, > +}; > +static struct platform_driver dvfs_pm_drv = { > + .driver = { > + .name = "generic_dvfs_pm", > + .pm = &dvfs_pm, > + }, > +}; > +static struct platform_device dvfs_pm_dev = { > + .name = "generic_dvfs_pm", > +}; Please don't do that. You can simply use the freezable workqueue introduced by commit 4149efb (workqueue: add system_freezeable_wq) and then you won't have to worry about system suspend. > + > +static int __init dvfs_init(void) > +{ > + platform_driver_register(&dvfs_pm_drv); > + platform_device_register(&dvfs_pm_dev); > + > + INIT_DELAYED_WORK_DEFERRABLE(&dvfs_work, dvfs_monitor); > + schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL)); > + > + return 0; > +} > +late_initcall(dvfs_init); > diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c > index 56a6899..df74655 100644 > --- a/drivers/base/power/opp.c > +++ b/drivers/base/power/opp.c > @@ -21,6 +21,7 @@ > #include <linux/rculist.h> > #include <linux/rcupdate.h> > #include <linux/opp.h> > +#include <linux/dvfs.h> > > /* > * Internal data structure organization with the OPP layer library is as > @@ -428,6 +429,9 @@ int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt) > list_add_rcu(&new_opp->node, head); > mutex_unlock(&dev_opp_list_lock); > > + /* Notify generic dvfs for the change */ > + dvfs_update(dev, true); > + > return 0; > } > > @@ -512,6 +516,9 @@ unlock: > mutex_unlock(&dev_opp_list_lock); > out: > kfree(new_opp); > + > + /* Notify generic dvfs for the change */ > + dvfs_update(dev, true); > return r; > } > > diff --git a/include/linux/dvfs.h b/include/linux/dvfs.h > new file mode 100644 > index 0000000..de3e53d > --- /dev/null > +++ b/include/linux/dvfs.h > @@ -0,0 +1,69 @@ > +/* > + * Generic DVFS Interface for Non-CPU Devices > + * > + * Copyright (C) 2011 Samsung Electronics > + * MyungJoo Ham <myungjoo.ham@samsung.com> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > + > +#ifndef __LINUX_DVFS_H__ > +#define __LINUX_DVFS_H__ > + > +struct dvfs; > + > +struct dvfs_usage_profile { > + /* both since the last measure */ > + unsigned long total_time; > + unsigned long busy_time; > +}; > + > +struct dvfs_device_profile { > + unsigned long max_freq; /* may be larger than the actual value */ > + int (*target)(struct device *dev, unsigned long freq, > + unsigned long u_volt); > + int (*get_usage_profile)(struct device *dev, > + struct dvfs_usage_profile *profile); > + int polling_ms; /* 0 for at opp change only */ > +}; > + > +struct dvfs_governor { > + void *data; /* private data for get_target_freq */ > + int (*get_target_freq)(struct dvfs_usage_profile *profile, > + struct dvfs_governor *this, unsigned long *freq); > +}; > + > +#if defined(CONFIG_PM_GENERIC_DVFS) > +extern int dvfs_add_device(struct device *ev, > + struct dvfs_device_profile *profile, > + struct dvfs_governor *governor); > +extern int dvfs_remove_device(struct device *dev); > +extern int dvfs_update(struct device *dev, bool may_not_exist); > +extern int dvfs_tickle_device(struct device *dev, unsigned long duration_ms); > +#else /* !CONFIG_PM_GENERIC_DVFS */ > +static int dvfs_add_device(struct device *ev, > + struct dvfs_device_profile *profile, > + struct dvfs_governor *governor) > +{ > + return 0; > +} > + > +static int dvfs_remove_device(struct device *dev) > +{ > + return 0; > +} > + > +static int dvfs_update(struct device *dev, bool may_not_exist) > +{ > + return 0; > +} > + > +static int dvfs_tickle_device(struct device *dev, unsigned long duration_ms) > +{ > + return 0; > +} > +#endif /* CONFIG_PM_GENERIC_DVFS */ > + > +#endif /* __LINUX_DVFS_H__ */ > diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig > index 4603f08..7d191ec 100644 > --- a/kernel/power/Kconfig > +++ b/kernel/power/Kconfig > @@ -225,3 +225,12 @@ config PM_OPP > representing individual voltage domains and provides SOC > implementations a ready to use framework to manage OPPs. > For more information, read <file:Documentation/power/opp.txt> > + > +config PM_GENERIC_DVFS > + bool "Generic DVFS framework" > + depends on PM_OPP > + help > + With OPP support, a device may have a list of frequencies and > + voltages available. Generic DVFS framework provides a method to > + control frequency/voltage of a device with OPP with given profile > + and governor per device. > Thanks, Rafael
On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote: > With OPPs, a device may have multiple operable frequency and voltage > sets. However, there can be multiple possible operable sets and a system > will need to choose one from them. In order to reduce the power > consumption (by reducing frequency and voltage) without affecting the > performance too much, a DVFS scheme may be used. What is "DVFS"? A new filesystem? confused, greg k-h
Hi Greg, I think DVFS here means “Dynamic voltage and frequency scaling”, refering the first google's result: a dynamic voltage and frequency scaling (DVFS) *technique that minimizes the total system energy consumption for performing a task while satisfying a given execution time constraint. We first show that in order to guarantee minimum energy for task execution by using DVFS it is essential to divide the system power into active and standby power components. Next, we present a new DVFS technique, which considers not only the active power, but also the standby component of the system power. It's widely used in embeded Soc. Best regards, Zhang Jiejing 2011/4/26 Greg KH <gregkh@suse.de> > On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote: > > With OPPs, a device may have multiple operable frequency and voltage > > sets. However, there can be multiple possible operable sets and a system > > will need to choose one from them. In order to reduce the power > > consumption (by reducing frequency and voltage) without affecting the > > performance too much, a DVFS scheme may be used. > > What is "DVFS"? A new filesystem? > > confused, > > greg k-h > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm >
On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote: > +static int __init dvfs_init(void) > +{ > + platform_driver_register(&dvfs_pm_drv); > + platform_device_register(&dvfs_pm_dev); > + > + INIT_DELAYED_WORK_DEFERRABLE(&dvfs_work, dvfs_monitor); > + schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL)); This will unconditionally start the timer which polls the devices at whatever rate. I'd expect to start and stop the poll depending on if we have devices and/or governor type code (which I don't see anything for here, though it's hard to know what criteria a generic governor would use) which require it.
On Tue 2011-04-26 06:22:24, Greg KH wrote: > On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote: > > With OPPs, a device may have multiple operable frequency and voltage > > sets. However, there can be multiple possible operable sets and a system > > will need to choose one from them. In order to reduce the power > > consumption (by reducing frequency and voltage) without affecting the > > performance too much, a DVFS scheme may be used. > > What is "DVFS"? A new filesystem? Dynamic voltage & frequency scaling ... very probably. But yes, the naming sucks. Pavel
On Tue, Apr 26, 2011 at 10:45:18PM +0800, Jiejing.Zhang wrote: > Hi Greg, > > I think DVFS here means “Dynamic voltage and frequency scaling”, refering the > first google's result: Ok, if so, please spell it out. Also, don't use 'dvfs' as a prefix in the kernel, too many other people will think it is a filesystem. How about "dynvolt"? Or anything else, just no 'fs' in the name otherwise confusion will reign. > a dynamic voltage and frequency scaling (DVFS) *technique that minimizes the > total system energy consumption for performing a task while satisfying > a given execution time constraint. We first show that in order to > guarantee minimum energy for task execution by using DVFS it is > essential to divide the system power into active and standby power > components. Next, we present a new DVFS technique, which considers not > only the active power, but also the standby component of the system > power. Please put this in the changelog entry, you need to explain it to those of us who have never seen it. thanks, greg k-h
Hello, 2011/4/26 Rafael J. Wysocki <rjw@sisk.pl>: > Hi, > > To start with let me say I don't have any fundamental objections at this point, > although I'm not 100% sure I understand correctly how this feature is supposed > to be used. My understanding is that if a device is known to have multiple > OPPs, its subsystem and driver may call dvfs_add_device() for it, which will > cause it to be periodically monitored and put into the "optimum" OPP using > the infrastructure introduced by your patch (on the basis of the given > usage profile and with the help of the given governor). Is that correct? Yes, that is correct. > > That said I have some specific comments below. > > On Tuesday, April 26, 2011, MyungJoo Ham wrote: >> With OPPs, a device may have multiple operable frequency and voltage >> sets. However, there can be multiple possible operable sets and a system >> will need to choose one from them. In order to reduce the power >> consumption (by reducing frequency and voltage) without affecting the >> performance too much, a DVFS scheme may be used. >> >> This patch introduces the DVFS capability to non-CPU devices with OPPs. > > Well, I'd like the DVFS acronym to be expanded at least once somewhere, > possibly in the comments describing the new files and necessarily in the > Kconfig option description. Sure, I will do so. > >> In this RTC version, we do not have specific gorvernors or device > > RFC I guess? Uh oh.. yes, it should've been typed "RFC". > [] >> + >> +#define DVFS_INTERVAL 20 > > You should state in a comment what's the unit of this and explain the choice > of the value. Ok. I'll add a comment noting that it's in msec. > >> +/** >> + * struct dvfs - Device DVFS structure >> + * @node list node - contains the devices with DVFS that have been >> + * registered. >> + * @dev device pointer >> + * @profile device-specific dvfs profile >> + * @governor method how to choose frequency based on the usage. >> + * @previous_freq previously configured frequency value. >> + * @next_polling the number of remaining "dvfs_monitor" executions to >> + * reevaluate frequency/voltage of the device. Set by >> + * profile's polling_ms interval. >> + * @tickle positive if DVFS-tickling is activated for the device. >> + * at each executino of dvfs_monitor, tickle is decremented. >> + * User may tickle a device-dvfs in order to set maximum >> + * frequency instaneously with some guaranteed duration. >> + * >> + * This structure stores the DVFS information for a give device. >> + */ >> +struct dvfs { >> + struct list_head node; >> + >> + struct device *dev; >> + struct dvfs_device_profile *profile; >> + struct dvfs_governor *governor; >> + >> + unsigned long previous_freq; >> + unsigned int next_polling; >> + unsigned int tickle; >> +}; >> + > > I'd move the structure definition to the header file you introduce below. > Even though it's supposed to be internal, it's better to have it defined > next to the other structures for the benefit of the people who will > try to understand your code in the future. Ok. I'll do that. > [] >> +static unsigned int dvfs_next_polling(int ms) >> +{ >> + unsigned int ret; >> + >> + if (ms <= 0) >> + return 0; >> + >> + ret = ms / DVFS_INTERVAL; >> + if (ms % DVFS_INTERVAL) >> + ret++; >> + >> + return ret; >> +} > > It seems you could use DIV_ROUND_UP instead of this whole function. Thanks. > >> + >> +/** >> + * dvfs_do() - Check the usage profile of a given device and configure >> + * frequency and voltage accordingly >> + * @dvfs: DVFS info of the given device >> + */ >> +static int dvfs_do(struct dvfs *dvfs) >> +{ >> + struct dvfs_usage_profile profile; >> + struct opp *opp; >> + unsigned long freq; >> + int err; >> + >> + err = dvfs->profile->get_usage_profile(dvfs->dev, &profile); >> + if (err) { >> + dev_err(dvfs->dev, "%s: Cannot get usage profile (%d)\n", >> + __func__, err); >> + return err; >> + } >> + >> + err = dvfs->governor->get_target_freq(&profile, dvfs->governor, &freq); > > It may be better to define ->get_target_freq() so that it takes devfs as an arg. > In that case you'd need to pass fewer pointers and the governor would be > able to access all of its data anyway. Oh Yes, I need to revise governor design so that the governor can take more information about the device and dvfs options. I've been testing with some test-case governors and it made some issues. :) > [] > > I'd generally consider using fewer error messages. While they may be good > for debugging, they have a potential of spamming the kernel's message buffer > in case of a problem on a production system. Also they don't really make > the code cleaner, so to speak. :-) Ok, I'll keep that in mind in the next revision. > [] >> + err = opp_init_cpufreq_table(dev, &table); > > I'd change the name opp_init_cpufreq_table() to something more suitable, > since it's no longer used for the CPU frequencies alone. Something like > opp_init_frequency_table() might work. Got it. Anyway, I'm going to remove the dependency on cpufreq_frequency_table or anything similar as well. > [] >> + >> + dvfs = find_device_dvfs(dev); >> + if (!IS_ERR(dvfs)) { >> + dev_err(dev, "%s: Unable to create DVFS for the device. " >> + "It already has one.\n", __func__); >> + mutex_unlock(&dvfs_list_lock); >> + kfree(new_dvfs); >> + return -EINVAL; >> + } > > It seems that this should be checked before allocating the new object > (you can use kzalloc() under the mutex). I'd even run it before initializing > the device's frequency table. Fine. That seems more economic. > [] >> + >> +/** >> + * dvfs_tickle_device() - Guarantee maximum operation speed for a while >> + * instaneously. >> + * @dev: the device to be tickled. >> + * @duration_ms: the duration of tickle effect. >> + * >> + * Tickle sets the device at the maximum frequency instaneously and >> + * the maximul frequency is guaranteed to be used for the given duration. > ^^^^^^^ > I guess that should be "maximum". Yes. :) > >> + * For faster user reponse time, an input event may tickle a related device >> + * so that the input event does not need to wait for the DVFS to react with >> + * normal interval. >> + */ > > Can you explain how this function is supposed to be used, please? Ok, I'll show you a usage example The system: a touchscreen based computer with DVFS'able GPU. Use case: a user scrolls the screen with fingers suddenly. (think of an iPhone or Android device's home screen filled wit icons) While there is no touchscreen input (thus, the screen does not move fast), we do not use GPU too much; thus it is scaled to run at slower rate (let's say 100MHz). If a user suddenly touches screen and starts moving, the screen should follow it; thus, requires much output from GPU. With normal DVFS-GPU, it would take some time to speed it up because it should monitor the GPU usage and react afterwards. With the API given, "tickle" may be included in the user input event. So that GPU can be scaled to maximum frequency immediately. If the monitoring frequency is 50ms, DVFS will likely to react with about 75ms of delay. For example, if the touch event has occurred at 25ms after the last DVFS monitoring event, at the next DVFS monitoring event, the measured load will be about 50%, which usually won't require DVFS to speed up. Then, DVFS will react at the next monitoring event; thus, it took 50ms + 25ms. 75ms is enough to give some impression to users. These days, many mobile devices require 60Hz-based UI, which is about 16ms. Anyway, this happens with drivers/cpufreq also. We have been testing "tickling" associated with drivers/cpufreq/cpufreq.c. This has been reduced user response time significantly removing the need for tuning the threshold values. > >> +int dvfs_tickle_device(struct device *dev, unsigned long duration_ms) >> +{ >> + struct dvfs *dvfs; >> + struct opp *opp; >> + unsigned long freq; >> + int err = 0; >> + >> + mutex_lock(&dvfs_list_lock); >> + dvfs = find_device_dvfs(dev); >> + if (!IS_ERR(dvfs)) { >> + freq = dvfs->profile->max_freq; >> + opp = opp_find_freq_floor(dvfs->dev, &freq); >> + freq = opp_get_freq(opp); >> + if (dvfs->previous_freq != freq) { >> + err = dvfs->profile->target(dvfs->dev, freq, >> + opp_get_voltage(opp)); >> + dvfs->previous_freq = freq; >> + } >> + if (err) >> + dev_err(dev, "%s: Cannot set frequency.\n", __func__); >> + else >> + dvfs->tickle = dvfs_next_polling(duration_ms); >> + } >> + mutex_unlock(&dvfs_list_lock); >> + >> + if (IS_ERR(dvfs)) { >> + dev_err(dev, "%s: Cannot find dvfs.\n", __func__); >> + err = PTR_ERR(dvfs); >> + } >> + >> + return err; >> +} >> + >> +#ifdef CONFIG_PM_SLEEP >> +static int dvfs_pause(struct device *dev) >> +{ >> + cancel_delayed_work_sync(&dvfs_work); >> + return 0; >> +} >> + >> +static void dvfs_continue(struct device *dev) >> +{ >> + schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL)); >> +} >> +#else >> +#define dvfs_pause NULL >> +#define dvfs_continue NULL >> +#endif >> +static const struct dev_pm_ops dvfs_pm = { >> + .prepare = dvfs_pause, >> + .complete = dvfs_continue, >> +}; >> +static struct platform_driver dvfs_pm_drv = { >> + .driver = { >> + .name = "generic_dvfs_pm", >> + .pm = &dvfs_pm, >> + }, >> +}; >> +static struct platform_device dvfs_pm_dev = { >> + .name = "generic_dvfs_pm", >> +}; > > Please don't do that. You can simply use the freezable workqueue introduced > by commit 4149efb (workqueue: add system_freezeable_wq) and then you won't > have to worry about system suspend. Cool! Thanks. > > > Thanks, > Rafael > Thank you for your comments. Cheers! - MyungJoo
On Wed, Apr 27, 2011 at 12:07 AM, Mark Brown <broonie@sirena.org.uk> wrote: > On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote: > >> +static int __init dvfs_init(void) >> +{ >> + platform_driver_register(&dvfs_pm_drv); >> + platform_device_register(&dvfs_pm_dev); >> + >> + INIT_DELAYED_WORK_DEFERRABLE(&dvfs_work, dvfs_monitor); >> + schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL)); > > This will unconditionally start the timer which polls the devices at > whatever rate. I'd expect to start and stop the poll depending on if we > have devices and/or governor type code (which I don't see anything for > here, though it's hard to know what criteria a generic governor would > use) which require it. > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm > Thanks. I will let it poll only when there are dvfs-devices with polling interval specified. Cheers! - MyungJoo
On Wed, Apr 27, 2011 at 5:54 AM, Greg KH <gregkh@suse.de> wrote: > On Tue, Apr 26, 2011 at 10:45:18PM +0800, Jiejing.Zhang wrote: >> Hi Greg, >> >> I think DVFS here means “Dynamic voltage and frequency scaling”, refering the >> first google's result: > > Ok, if so, please spell it out. Yes, will do so. > > Also, don't use 'dvfs' as a prefix in the kernel, too many other people > will think it is a filesystem. How about "dynvolt"? > > Or anything else, just no 'fs' in the name otherwise confusion will > reign. Um... ok... I'll avoid using fs as postfix and think about the name again. "oppfreq", "devfreq", "dvfscale", "oppscale", "devscale", "ddp (dynamic device power)", or "dfvs" (switched frequency and voltage to avoid confusion with file systems). > >> a dynamic voltage and frequency scaling (DVFS) *technique that minimizes the >> total system energy consumption for performing a task while satisfying >> a given execution time constraint. We first show that in order to >> guarantee minimum energy for task execution by using DVFS it is >> essential to divide the system power into active and standby power >> components. Next, we present a new DVFS technique, which considers not >> only the active power, but also the standby component of the system >> power. > > Please put this in the changelog entry, you need to explain it to those > of us who have never seen it. Sure. I'll do that. > > thanks, > > greg k-h > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm Cheers! - MyungJoo
On Tue, Apr 26, 2011 at 01:54:30PM -0700, Greg KH wrote: > On Tue, Apr 26, 2011 at 10:45:18PM +0800, Jiejing.Zhang wrote: > > I think DVFS here means ???Dynamic voltage and frequency scaling???,?? refering the > > first google's result: > Ok, if so, please spell it out. > Also, don't use 'dvfs' as a prefix in the kernel, too many other people > will think it is a filesystem. How about "dynvolt"? > Or anything else, just no 'fs' in the name otherwise confusion will > reign. This really is the standard technical term for this, it's vanishingly rare for it to be a source of confusion in context. We should at least mention the actual term in the documentation even if we invent our own name for it.
On Wednesday, April 27, 2011, MyungJoo Ham wrote: > Hello, > > 2011/4/26 Rafael J. Wysocki <rjw@sisk.pl>: > > Hi, > > ... > >> + > >> +/** > >> + * dvfs_tickle_device() - Guarantee maximum operation speed for a while > >> + * instaneously. > >> + * @dev: the device to be tickled. > >> + * @duration_ms: the duration of tickle effect. > >> + * > >> + * Tickle sets the device at the maximum frequency instaneously and > >> + * the maximul frequency is guaranteed to be used for the given duration. > > ^^^^^^^ > > I guess that should be "maximum". > > Yes. :) > > > > >> + * For faster user reponse time, an input event may tickle a related device > >> + * so that the input event does not need to wait for the DVFS to react with > >> + * normal interval. > >> + */ > > > > Can you explain how this function is supposed to be used, please? > > Ok, I'll show you a usage example > > The system: a touchscreen based computer with DVFS'able GPU. > Use case: a user scrolls the screen with fingers suddenly. (think of > an iPhone or Android device's home screen filled wit icons) > While there is no touchscreen input (thus, the screen does not move > fast), we do not use GPU too much; thus it is scaled to run at slower > rate (let's say 100MHz). > If a user suddenly touches screen and starts moving, the screen should > follow it; thus, requires much output from GPU. > With normal DVFS-GPU, it would take some time to speed it up because > it should monitor the GPU usage and react afterwards. > > With the API given, "tickle" may be included in the user input event. > So that GPU can be scaled to maximum frequency immediately. So in this case the input subsystem would be calling dvfs_tickle_device() (or whatever it is finally called) for the GPU, right? I'm afraid you'd need a user space interface for that and let the application (or library) handling the touchpad input control the "tickle". Otherwise it would be difficult to arrange things correctly (the association between the input subsystem and the GPU is not obvious at the kernel level). Ideally, though, the working OPP for the GPU should depend on the number of processing requests per a unit of time, pretty much like for a CPU. > If the monitoring frequency is 50ms, DVFS will likely to react with > about 75ms of delay. > For example, if the touch event has occurred at 25ms after the last > DVFS monitoring event, at the next DVFS monitoring event, the measured > load will be about 50%, which usually won't require DVFS to speed up. > Then, DVFS will react at the next monitoring event; thus, it took 50ms > + 25ms. > > 75ms is enough to give some impression to users. These days, many > mobile devices require 60Hz-based UI, which is about 16ms. > > Anyway, this happens with drivers/cpufreq also. We have been testing > "tickling" associated with drivers/cpufreq/cpufreq.c. This has been > reduced user response time significantly removing the need for tuning > the threshold values. I think I understand the problem, but I'm not sure if there's a clean way to trigger the "tickle" from the kernel level. Thanks, Rafael
On Wednesday, April 27, 2011, MyungJoo Ham wrote: > On Wed, Apr 27, 2011 at 5:54 AM, Greg KH <gregkh@suse.de> wrote: > > On Tue, Apr 26, 2011 at 10:45:18PM +0800, Jiejing.Zhang wrote: > >> Hi Greg, > >> > >> I think DVFS here means “Dynamic voltage and frequency scaling”, refering the > >> first google's result: > > > > Ok, if so, please spell it out. > > Yes, will do so. > > > > > Also, don't use 'dvfs' as a prefix in the kernel, too many other people > > will think it is a filesystem. How about "dynvolt"? > > > > Or anything else, just no 'fs' in the name otherwise confusion will > > reign. > > Um... ok... I'll avoid using fs as postfix and think about the name again. > "oppfreq", "devfreq", "dvfscale", "oppscale", "devscale", "ddp > (dynamic device power)", or "dfvs" (switched frequency and voltage to > avoid confusion with file systems). I think devfreq is the best one of the above. At least it will make everyone think that the feature is analogous to cpufreq, which is correct. Thanks, Rafael
On Tue, Apr 26, 2011 at 10:22 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote: > Hello, > > 2011/4/26 Rafael J. Wysocki <rjw@sisk.pl>: >> Hi, >> >> To start with let me say I don't have any fundamental objections at this point, >> although I'm not 100% sure I understand correctly how this feature is supposed >> to be used. My understanding is that if a device is known to have multiple >> OPPs, its subsystem and driver may call dvfs_add_device() for it, which will >> cause it to be periodically monitored and put into the "optimum" OPP using >> the infrastructure introduced by your patch (on the basis of the given >> usage profile and with the help of the given governor). Is that correct? > > Yes, that is correct. I'm a little confused about the design for this, and OPP as well. OPP matches a struct device * and a frequency to a voltage, which is not a generically useful pairing, as far as I can tell. On Tegra, it is quite possible for a single device to have multiple clocks that each have different voltage requirements, for example the display block can have an interface clock as well as a pixel clock. Simplifying this to dev + freq = voltage seems very OMAP specific, and will be difficult or impossible to adapt to Tegra. Moreover, from a silicon perspective, there is always a simple link from a single frequency to a minimum voltage for a given circuit. There is no need to group them into OPPs, which seem to have a group of clocks and their frequencies that map to a single voltage. That is an artifact of the way TI specifies voltages. I don't think DVFS is even the right place for any sort of governor. DVFS is very simple - to increase to a specific clock speed, the voltage must be immediately be raised, with minimum or no delay, to a specified value that is specific to that clock. When the frequency is lowered, the voltage should be decreased. There is a tiny bit of policy to determine when to delay dropping the voltage in case the frequency will immediately be raised again, but nowhere near the complexity of what is shown here. I proposed in a different thread on LKML that DVFS be handled within the generic clock implementation. Platforms would register a regulator and a table of voltages for each struct clock that required DVFS, and the voltages would be changed on normal clk_* requests. This maintains compatibility with existing clk_* calls. There is a place for a GPU, etc., frequency governor, but it is a completely separate issue from DVFS, and should not be mixed in. I could have a GPU that is not voltage scalable, but could still benefit from lowering the frequency when it is not in use. A devfreq interface sounds perfect for this, as long as it only ends up calling clk_* functions, and those functions handle getting the voltage correct.
On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote: +l-o > I'm a little confused about the design for this, and OPP as well. OPP > matches a struct device * and a frequency to a voltage, which is not a > generically useful pairing, as far as I can tell. On Tegra, it is > quite possible for a single device to have multiple clocks that each > have different voltage requirements, for example the display block can > have an interface clock as well as a pixel clock. Simplifying this to > dev + freq = voltage seems very OMAP specific, and will be difficult > or impossible to adapt to Tegra. We have the same requirements as well(iclk,fclk,pixclk etc..)! We group them under voltage domains in OMAP ;). if your issue was a ability to have a single freq to a OPP, it is upto SoC to do the proper mapping. Concept of an OPP still remains consistent - which is for a voltage, there is only so much freq you can drive that specific module to. > Moreover, from a silicon perspective, there is always a simple link > from a single frequency to a minimum voltage for a given circuit. > There is no need to group them into OPPs, which seem to have a group > of clocks and their frequencies that map to a single voltage. That is > an artifact of the way TI specifies voltages. > > I don't think DVFS is even the right place for any sort of governor. > DVFS is very simple - to increase to a specific clock speed, the > voltage must be immediately be raised, with minimum or no delay, to a > specified value that is specific to that clock. When the frequency is > lowered, the voltage should be decreased. There is a tiny bit of > policy to determine when to delay dropping the voltage in case the > frequency will immediately be raised again, but nowhere near the > complexity of what is shown here. > > I proposed in a different thread on LKML that DVFS be handled within > the generic clock implementation. Platforms would register a > regulator and a table of voltages for each struct clock that required > DVFS, and the voltages would be changed on normal clk_* requests. > This maintains compatibility with existing clk_* calls. It is upto SoC frameworks to implement the transitions. E.g. lets look at scalability: How'd the mechanism proposed work with temperature variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it be an SoC specific hack I'd need to introduce? All OPP framework does is store that maps, and leaves it to users to choose regulators, clock framework variances, SoC temperature sensors or what ever mechanisms they choose to allow through a transition. > There is a place for a GPU, etc., frequency governor, but it is a > completely separate issue from DVFS, and should not be mixed in. I > could have a GPU that is not voltage scalable, but could still benefit > from lowering the frequency when it is not in use. A devfreq > interface sounds perfect for this, as long as it only ends up calling > clk_* functions, and those functions handle getting the voltage > correct. Regards, Nishanth Menon PS: https://lists.linux-foundation.org/pipermail/linux-pm/2011-April/031113.html for start of thread
On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote: > On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote: > +l-o > >> I'm a little confused about the design for this, and OPP as well. OPP >> matches a struct device * and a frequency to a voltage, which is not a >> generically useful pairing, as far as I can tell. On Tegra, it is >> quite possible for a single device to have multiple clocks that each >> have different voltage requirements, for example the display block can >> have an interface clock as well as a pixel clock. Simplifying this to >> dev + freq = voltage seems very OMAP specific, and will be difficult >> or impossible to adapt to Tegra. > We have the same requirements as well(iclk,fclk,pixclk etc..)! We > group them under voltage domains in OMAP ;). if your issue was a > ability to have a single freq to a OPP, it is upto SoC to do the > proper mapping. Concept of an OPP still remains consistent - which is > for a voltage, there is only so much freq you can drive that specific > module to. No, that is still wrong. You don't drive a module at a frequency, you drive a clock. You can't map struct device * 1-1 to a clock. Look at omap2_set_init_voltage: static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name, struct device *dev) { clk = clk_get(NULL, clk_name); freq = clk->rate; opp = opp_find_freq_ceil(dev, &freq); ... } Now what happens if I have a dev with two frequencies, >> Moreover, from a silicon perspective, there is always a simple link >> from a single frequency to a minimum voltage for a given circuit. >> There is no need to group them into OPPs, which seem to have a group >> of clocks and their frequencies that map to a single voltage. That is >> an artifact of the way TI specifies voltages. >> >> I don't think DVFS is even the right place for any sort of governor. >> DVFS is very simple - to increase to a specific clock speed, the >> voltage must be immediately be raised, with minimum or no delay, to a >> specified value that is specific to that clock. When the frequency is >> lowered, the voltage should be decreased. There is a tiny bit of >> policy to determine when to delay dropping the voltage in case the >> frequency will immediately be raised again, but nowhere near the >> complexity of what is shown here. >> >> I proposed in a different thread on LKML that DVFS be handled within >> the generic clock implementation. Platforms would register a >> regulator and a table of voltages for each struct clock that required >> DVFS, and the voltages would be changed on normal clk_* requests. >> This maintains compatibility with existing clk_* calls. > > It is upto SoC frameworks to implement the transitions. E.g. lets look > at scalability: How'd the mechanism proposed work with temperature > variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it > be an SoC specific hack I'd need to introduce? > > All OPP framework does is store that maps, and leaves it to users to > choose regulators, clock framework variances, SoC temperature sensors > or what ever mechanisms they choose to allow through a transition. > >> There is a place for a GPU, etc., frequency governor, but it is a >> completely separate issue from DVFS, and should not be mixed in. I >> could have a GPU that is not voltage scalable, but could still benefit >> from lowering the frequency when it is not in use. A devfreq >> interface sounds perfect for this, as long as it only ends up calling >> clk_* functions, and those functions handle getting the voltage >> correct. > > Regards, > Nishanth Menon > PS: > https://lists.linux-foundation.org/pipermail/linux-pm/2011-April/031113.html > for start of thread >
On Wed, Apr 27, 2011 at 10:49:47AM -0700, Colin Cross wrote: > Moreover, from a silicon perspective, there is always a simple link > from a single frequency to a minimum voltage for a given circuit. > There is no need to group them into OPPs, which seem to have a group > of clocks and their frequencies that map to a single voltage. That is > an artifact of the way TI specifies voltages. This isn't just a TI thing, other CPU (and general big digital) vendors spec things similarly. It's just that TI have more code in mainline than most. My understanding is that when people do this the set of operating points aren't purely about DVFS, they're also about specifying the relationships between the various system clock rates and potentially other things which are supported, especially in the blocks that mediate between multiple domains. The voltages are just one of the parameters here, and with multiple supplies the voltages aren't always orthogonal to each other. > I proposed in a different thread on LKML that DVFS be handled within > the generic clock implementation. Platforms would register a > regulator and a table of voltages for each struct clock that required > DVFS, and the voltages would be changed on normal clk_* requests. > This maintains compatibility with existing clk_* calls. This comes back to the thing we were discussing in the thread there about there being n:m constraints between settings. I've not looked at this in any detail but it may be that for the systems which spec things as operating points we want to root the core clocking in an operating point block which deals with stuff like this.
(sorry, missent the earlier one) On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote: > On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote: > +l-o > >> I'm a little confused about the design for this, and OPP as well. OPP >> matches a struct device * and a frequency to a voltage, which is not a >> generically useful pairing, as far as I can tell. On Tegra, it is >> quite possible for a single device to have multiple clocks that each >> have different voltage requirements, for example the display block can >> have an interface clock as well as a pixel clock. Simplifying this to >> dev + freq = voltage seems very OMAP specific, and will be difficult >> or impossible to adapt to Tegra. > We have the same requirements as well(iclk,fclk,pixclk etc..)! We > group them under voltage domains in OMAP ;). if your issue was a > ability to have a single freq to a OPP, it is upto SoC to do the > proper mapping. Concept of an OPP still remains consistent - which is > for a voltage, there is only so much freq you can drive that specific > module to. No, that is still wrong. You don't drive a module at a frequency, you drive a clock. You can't map struct device * 1-1 to a clock. Look at omap2_set_init_voltage: static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name, struct device *dev) { ... clk = clk_get(NULL, clk_name); freq = clk->rate; opp = opp_find_freq_ceil(dev, &freq); ... } What happens if I have a dev with two frequencies? I can only pass a dev into opp. It makes infinitely more sense to pass in a clock: opp_find_freq_ceil(clk, &freq). > It is upto SoC frameworks to implement the transitions. E.g. lets look > at scalability: How'd the mechanism proposed work with temperature > variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it > be an SoC specific hack I'd need to introduce? No, because you're putting it in the wrong place, that is a policy decision. Handle it in the clock framework, or handle it in the device driver. That's a bad example either way - what happens if you are already at 1.5GHz when the temperature crosses 70C? You need an interrupt that tells you the temperature is too high, and than needs to affect a policy decision at a much higher level than dvfs. > > All OPP framework does is store that maps, and leaves it to users to > choose regulators, clock framework variances, SoC temperature sensors > or what ever mechanisms they choose to allow through a transition. I understand its just a map, but its a map between two things that don't have a direct mapping in many SoCs. I think if you changed every usage of struct dev * in opp to struct clk *, it would make much more sense. There is already a mapping from struct dev * to struct clk *, its called clk_get, and it takes a second parameter to allow devices to have multiple clocks.
On Wed, Apr 27, 2011 at 11:34 AM, Mark Brown <broonie@sirena.org.uk> wrote: > On Wed, Apr 27, 2011 at 10:49:47AM -0700, Colin Cross wrote: > >> Moreover, from a silicon perspective, there is always a simple link >> from a single frequency to a minimum voltage for a given circuit. >> There is no need to group them into OPPs, which seem to have a group >> of clocks and their frequencies that map to a single voltage. That is >> an artifact of the way TI specifies voltages. > > This isn't just a TI thing, other CPU (and general big digital) vendors > spec things similarly. It's just that TI have more code in mainline > than most. My understanding is that when people do this the set of > operating points aren't purely about DVFS, they're also about specifying > the relationships between the various system clock rates and potentially > other things which are supported, especially in the blocks that mediate > between multiple domains. The voltages are just one of the parameters > here, and with multiple supplies the voltages aren't always orthogonal > to each other. Then TI (and other vendors) are providing one table that specifies two things - the relationship between frequency and voltage for each clock, and the relationship between multiple clocks for optimal performance. Mixing those in the kernel breaks platforms (like Tegra) that do not have any relationship between multiple clocks. Can you point me to a platform that implements keeping multiple clocks changing between OPPs together? As far as I can tell, OMAP4 always does dev -> clk, clk -> freq, dev + freq -> voltage, and never looks at multiple clocks. >> I proposed in a different thread on LKML that DVFS be handled within >> the generic clock implementation. Platforms would register a >> regulator and a table of voltages for each struct clock that required >> DVFS, and the voltages would be changed on normal clk_* requests. >> This maintains compatibility with existing clk_* calls. > > This comes back to the thing we were discussing in the thread there > about there being n:m constraints between settings. I've not looked at > this in any detail but it may be that for the systems which spec things > as operating points we want to root the core clocking in an operating > point block which deals with stuff like this. > On some platforms there may be a need for some global clock governor, which can keep multiple clocks in sync, but I haven't seen one for which that is truly required.
On Wed, Apr 27, 2011 at 13:29, Colin Cross <ccross@google.com> wrote: > On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote: >> On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote: >> +l-o >> >>> I'm a little confused about the design for this, and OPP as well. OPP >>> matches a struct device * and a frequency to a voltage, which is not a >>> generically useful pairing, as far as I can tell. On Tegra, it is >>> quite possible for a single device to have multiple clocks that each >>> have different voltage requirements, for example the display block can >>> have an interface clock as well as a pixel clock. Simplifying this to >>> dev + freq = voltage seems very OMAP specific, and will be difficult >>> or impossible to adapt to Tegra. >> We have the same requirements as well(iclk,fclk,pixclk etc..)! We >> group them under voltage domains in OMAP ;). if your issue was a >> ability to have a single freq to a OPP, it is upto SoC to do the >> proper mapping. Concept of an OPP still remains consistent - which is >> for a voltage, there is only so much freq you can drive that specific >> module to. > No, that is still wrong. You don't drive a module at a frequency, you > drive a clock. You can't map struct device * 1-1 to a clock. Look at Agreed, module runs on clocks - Lets say n clocks provide a module it's functionality. > omap2_set_init_voltage: > static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name, > struct device *dev) { > > clk = clk_get(NULL, clk_name); > freq = clk->rate; > opp = opp_find_freq_ceil(dev, &freq); > ... > } > > Now what happens if I have a dev with two frequencies, we do have it - it depends on what the OPP table represents. we do have modules which have both interface and functional clocks on OMAP as well. for a module(represented by struct device *) which has n clocks, choose the scheme of representation of clock that depends on voltage for the module. in the example you provided "the display block can have an interface clock as well as a pixel clock" - I suppose you mean: {.pclk = x, .iclk = y, .v = z} The question I'd ask is this : for a voltage z, is the dependency on pclk or iclk? I can expect a dependency of pclk to iclk requirement (considering pixel clock drives an external display for example). the table reduces to just {.iclk = y, .v = z} and a different table that has divisor for .iclk to pclk which is SoC based. OPP table is just a storage and retrieval mechanism, it is upto SoC frameworks to choose the most adequate of solutions - e.g. OMAP has omap_device, hwmod and a clock framework for more intricate control to work in conjunction with cpuidle frameworks as well. There is cross domain dependency which OMAP (yet to be pushed to mainline) has - example: when OMAP4's MPUs are at a certain OPP, L3 (OMAP's SoC bus) needs to be at least a certain OPP - these are framework which may be very custom to OMAP itself. --- Regards, Nishanth Menon
On Wed, Apr 27, 2011 at 11:48 AM, Menon, Nishanth <nm@ti.com> wrote: > On Wed, Apr 27, 2011 at 13:29, Colin Cross <ccross@google.com> wrote: >> On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote: >>> On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote: >>> +l-o >>> >>>> I'm a little confused about the design for this, and OPP as well. OPP >>>> matches a struct device * and a frequency to a voltage, which is not a >>>> generically useful pairing, as far as I can tell. On Tegra, it is >>>> quite possible for a single device to have multiple clocks that each >>>> have different voltage requirements, for example the display block can >>>> have an interface clock as well as a pixel clock. Simplifying this to >>>> dev + freq = voltage seems very OMAP specific, and will be difficult >>>> or impossible to adapt to Tegra. >>> We have the same requirements as well(iclk,fclk,pixclk etc..)! We >>> group them under voltage domains in OMAP ;). if your issue was a >>> ability to have a single freq to a OPP, it is upto SoC to do the >>> proper mapping. Concept of an OPP still remains consistent - which is >>> for a voltage, there is only so much freq you can drive that specific >>> module to. >> No, that is still wrong. You don't drive a module at a frequency, you >> drive a clock. You can't map struct device * 1-1 to a clock. Look at > Agreed, module runs on clocks - Lets say n clocks provide a module > it's functionality. > >> omap2_set_init_voltage: >> static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name, >> struct device *dev) { >> >> clk = clk_get(NULL, clk_name); >> freq = clk->rate; >> opp = opp_find_freq_ceil(dev, &freq); >> ... >> } >> >> Now what happens if I have a dev with two frequencies, > we do have it - it depends on what the OPP table represents. we do > have modules which have both interface and functional clocks on OMAP > as well. for a module(represented by struct device *) which has n > clocks, choose the scheme of representation of clock that depends on > voltage for the module. > in the example you provided "the display block can have an interface > clock as well as a pixel clock" - I suppose you mean: > {.pclk = x, .iclk = y, .v = z} > The question I'd ask is this : for a voltage z, is the dependency on > pclk or iclk? I can expect a dependency of pclk to iclk requirement > (considering pixel clock drives an external display for example). the > table reduces to just > {.iclk = y, .v = z} and a different table that has divisor for .iclk > to pclk which is SoC based. No, there can be voltage requirements on both, and the higher voltage requirement of the two must be used. > OPP table is just a storage and retrieval mechanism, it is upto SoC > frameworks to choose the most adequate of solutions - e.g. OMAP has > omap_device, hwmod and a clock framework for more intricate control to > work in conjunction with cpuidle frameworks as well. > > There is cross domain dependency which OMAP (yet to be pushed to > mainline) has - example: when OMAP4's MPUs are at a certain OPP, L3 > (OMAP's SoC bus) needs to be at least a certain OPP - these are > framework which may be very custom to OMAP itself. > > --- > Regards, > Nishanth Menon >
On Wed, Apr 27, 2011 at 11:45:18AM -0700, Colin Cross wrote: > Then TI (and other vendors) are providing one table that specifies two > things - the relationship between frequency and voltage for each > clock, and the relationship between multiple clocks for optimal I imagine you'll also see some of the constraints coming from other sources (thermal was mentioned, for example). > performance. Mixing those in the kernel breaks platforms (like Tegra) > that do not have any relationship between multiple clocks. I'm not sure why this is a problem, presumably you can either define a lot of simple OPP domains or not use the relevant infrastructure? If we can't cope with that sort of scenario then we'd also be unable to cope with two distinct chips simultaneously using the infrastructure which seems like an obvious use case. > Can you point me to a platform that implements keeping multiple clocks > changing between OPPs together? As far as I can tell, OMAP4 always > does dev -> clk, clk -> freq, dev + freq -> voltage, and never looks > at multiple clocks. I don't know that I've got anything I can publish, sorry. > > This comes back to the thing we were discussing in the thread there > > about there being n:m constraints between settings. ?I've not looked at > > this in any detail but it may be that for the systems which spec things > > as operating points we want to root the core clocking in an operating > > point block which deals with stuff like this. > On some platforms there may be a need for some global clock governor, > which can keep multiple clocks in sync, but I haven't seen one for > which that is truly required. I gather that with a lot of this stuff there's a big difference between what works and what people are willing to sign off on over all process variations, possible timing conditions and so on.
On Wed, 27 Apr 2011, Menon, Nishanth wrote: > On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote: > > I proposed in a different thread on LKML that DVFS be handled within > > the generic clock implementation. Platforms would register a > > regulator and a table of voltages for each struct clock that required > > DVFS, and the voltages would be changed on normal clk_* requests. > > This maintains compatibility with existing clk_* calls. > > It is upto SoC frameworks to implement the transitions. E.g. lets look > at scalability: How'd the mechanism proposed work with temperature > variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it > be an SoC specific hack I'd need to introduce? Why is limiting the max core frequency depending on temperature a SoC specific problem ? Everyone wants to do that. x86 does it in hardware / SMM, other architectures want the kernel to take care of it. So the decision is simple. Something wants to set core freq to 1.5 GHz, so it calls clk_set_rate() and there we consult the DVFS code first to validate that setting. If it can be set, fine, then DVFS will set the voltages _before_ we change the frequency or it will simply veto the change because one of the preliminaries for such a change is not given. Please stop thinking that your SoC is sooo special. It's NOT. The HW concepts are quite similar all over the place, they are just named differently and use different IP blocks with slightly different functionality, but the problems are not unique to a particular SoC at all. > All OPP framework does is store that maps, and leaves it to users to > choose regulators, clock framework variances, SoC temperature sensors > or what ever mechanisms they choose to allow through a transition. That's how it's implemented, but that does not say that the design is correct and usable for more than the usecase it was modeled after. We are looking into a common clock framework, which abstracts out the duplicated functionality of the various implementations and reduces them to the real thing: hardware drivers. So we really need to look into that DVFS problem as well, simply because it is tightly coupled and not a complete separate entity. And looking at the struct clk disaster we really don't want another incarnation in terms of DVFS where we end up with the same decision functions in various SoCs over and over. Thanks, tglx
On Wed, 27 Apr 2011, Menon, Nishanth wrote: > OPP table is just a storage and retrieval mechanism, it is upto SoC > frameworks to choose the most adequate of solutions - e.g. OMAP has > omap_device, hwmod and a clock framework for more intricate control to > work in conjunction with cpuidle frameworks as well. Can you please stop thinking about OMAP for a minute? A clock framework is nothing SoC specific. A framework is an abstraction of common HW functionality, which implements general functionality and relies on the HW specific part to configure it and to provide access to the hardware itself. clocks are ordered as trees in HW, simply because you cannot have a clock consumer be driven by more than one active clock at the same time. A clock consumer may select a different clock producer, but that merily changes the tree structure nothing else. So why should every SoC implement it's own (different buggy) version of tree handling and call it framework? Yes, I know you might argue that some devices need two clocks enabled to be functional. That's correct, but coupling those clocks at the framework level is the wrong thing to do. If a device needs both an interface clock and a separate interconnect clock to work, then it needs to enable both clocks and become a consumer of them. > There is cross domain dependency which OMAP (yet to be pushed to > mainline) has - example: when OMAP4's MPUs are at a certain OPP, L3 > (OMAP's SoC bus) needs to be at least a certain OPP - these are > framework which may be very custom to OMAP itself. Wrong again. That's not a framework when you hack SoC specific decision functions into it. It's the OMAP internal hackery to make stuff work, but that's far from a framework. What you are describing is a restriction which can be expressed in tables or rules which are fed into a general framework. Look at generic irqs, generic timekeeping, generic clockevents and tons of other real frameworks in the kernel. They abstract out concepts and provide generic interfaces rather than claiming that the problem is unique to a particular piece of silicon. Forget OMAP implementation details for a while, sit back and look at the big picture. Thanks, tglx
On Wed, Apr 27, 2011 at 12:26 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > Forget OMAP implementation details for a while, sit back and look at > the big picture. Here's my proposal for DVFS: - DVFS is implemented in drivers/clk/dvfs.c, and is called by the common clock implementation to adjust the voltages, if necessary, on regular clk_* calls. - Platform code provides mappings in the form (clk, regulator, max frequency, min voltage) to the dvfs code. - Everything that is in OPP today gets converted to helper functions inside the dvfs implementation, and is never called from SoC code (except to pass tables at init), or from drivers. - OPP can be recreated in the future as a upper level policy manager for clocks that need to move together, if that is ever necessary. It would not know anything about voltages. - A few common policy implementations need to be added to the common clock implementation, like temperature limits. For Tegra: - DVFS continues to be accessed by calling clk_* functions For OMAP: - DVFS is triggered by hwmod through clk_* functions. Any cross-arch driver can continue to call clk_* functions. OPP currently has opp_enable and opp_disable functions. I don't understand why these are needed, they are only used at init time to determine available voltages, which could be handled by never passing unavailable voltages to the dvfs implementation.
On Thu, Apr 28, 2011 at 3:37 AM, Colin Cross <ccross@google.com> wrote: > (sorry, missent the earlier one) > > On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote: >> On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote: >> +l-o >> >>> I'm a little confused about the design for this, and OPP as well. OPP >>> matches a struct device * and a frequency to a voltage, which is not a >>> generically useful pairing, as far as I can tell. On Tegra, it is >>> quite possible for a single device to have multiple clocks that each >>> have different voltage requirements, for example the display block can >>> have an interface clock as well as a pixel clock. Simplifying this to >>> dev + freq = voltage seems very OMAP specific, and will be difficult >>> or impossible to adapt to Tegra. >> We have the same requirements as well(iclk,fclk,pixclk etc..)! We >> group them under voltage domains in OMAP ;). if your issue was a >> ability to have a single freq to a OPP, it is upto SoC to do the >> proper mapping. Concept of an OPP still remains consistent - which is >> for a voltage, there is only so much freq you can drive that specific >> module to. > No, that is still wrong. You don't drive a module at a frequency, you > drive a clock. You can't map struct device * 1-1 to a clock. Look at > omap2_set_init_voltage: > static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name, > struct device *dev) { > ... > clk = clk_get(NULL, clk_name); > freq = clk->rate; > opp = opp_find_freq_ceil(dev, &freq); > ... > } > > What happens if I have a dev with two frequencies? I can only pass a > dev into opp. It makes infinitely more sense to pass in a clock: > opp_find_freq_ceil(clk, &freq). What one instance of DVFS (devfreq) controls are clocks and regulators. (a device may have multiple regulators as well as multiple clocks) What one instance of DVFS (devfreq) monitors (device load and/or temperature) is a device that uses the clocks and regulators. If we focus on the things that are controlled by DVFS, connecting DVFS with clock seems fine; however, DVFS's decision is based on the status of the device and the decision (monitoring result) configures a set of clocks and regulators. The clocks are not configured independently from others if the clocks are used by a DVFS-capable device. The frequency/voltage pair (OPP in this patch) associated with a device becomes a representative value of a specific configuration that configures the set of clocks and regulators. This is quite similar with CPUFREQ. CPUFREQ provides a single frequency value as a result of monitoring; however the machine's cpufreq driver may set multiple clocks and multiple voltage regulators based on the representative value (which is usually the core clock) although the cpufreq driver may need to control many more clocks with different frequencies. With multiple clocks of a device, if there is a clock that is required to be set independently from the "representative" clock with DVFS, it means that the DVFS monitoring result (load/temperature) is not a scalar value but a vector (multi-dimensional value). That implies that we need to monitor different and independent values, which in turn, implies that we need separated devices. Note that the DVFS monitor result from load and temperature combined is not a multi-dimensional value because the temperature limits "maximum possible frequency or voltage" and the load gives "preferred lower bound of frequency" that can be overridden by the limit set by temperature. Therefore, having one DVFS per clock where multiple clocks are attached to a device will create multiple monitors that monitor the same object(device behavior) with same metrics (load and temperature). Besides, the reason I've started with "target" callback, not clk and regulator names or pointers is that a device may have multiple clks and regulators and the OPP may only show the representative clock/regulators as CPUFREQ does. Especially when the order of transitions of those multiple clocks and regulators matter (if they are in a single device, it sometimes does), running a DVFS per clock, not per device, will be bothersome if not disasterous. > >> It is upto SoC frameworks to implement the transitions. E.g. lets look >> at scalability: How'd the mechanism proposed work with temperature >> variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it >> be an SoC specific hack I'd need to introduce? > No, because you're putting it in the wrong place, that is a policy > decision. Handle it in the clock framework, or handle it in the > device driver. That's a bad example either way - what happens if you > are already at 1.5GHz when the temperature crosses 70C? You need an > interrupt that tells you the temperature is too high, and than needs > to affect a policy decision at a much higher level than dvfs. > >> >> All OPP framework does is store that maps, and leaves it to users to >> choose regulators, clock framework variances, SoC temperature sensors >> or what ever mechanisms they choose to allow through a transition. > I understand its just a map, but its a map between two things that > don't have a direct mapping in many SoCs. I think if you changed > every usage of struct dev * in opp to struct clk *, it would make much > more sense. There is already a mapping from struct dev * to struct > clk *, its called clk_get, and it takes a second parameter to allow > devices to have multiple clocks. Anyway, I agree that we need either one of the followings (at least in the future) 1) OPP mapped with clock or 2) multiple OPPs mapped with a dev, each OPP with a clock of the dev. Still, DVFS(devfreq) would be better linked directly with a device although we will need to tell which OPP of the device or which clock's OPP should be linked as representative clock OPP. > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm > Cheers! - MyungJoo
On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote: > On Wed, Apr 27, 2011 at 12:26 PM, Thomas Gleixner <tglx@linutronix.de> wrote: >> Forget OMAP implementation details for a while, sit back and look at >> the big picture. > > Here's my proposal for DVFS: > - DVFS is implemented in drivers/clk/dvfs.c, and is called by the > common clock implementation to adjust the voltages, if necessary, on > regular clk_* calls. > - Platform code provides mappings in the form (clk, regulator, max > frequency, min voltage) to the dvfs code. > - Everything that is in OPP today gets converted to helper functions > inside the dvfs implementation, and is never called from SoC code > (except to pass tables at init), or from drivers. > - OPP can be recreated in the future as a upper level policy manager > for clocks that need to move together, if that is ever necessary. It > would not know anything about voltages. > - A few common policy implementations need to be added to the common > clock implementation, like temperature limits. I hope that my previous reply answered this. > > For Tegra: > - DVFS continues to be accessed by calling clk_* functions > > For OMAP: > - DVFS is triggered by hwmod through clk_* functions. Any cross-arch > driver can continue to call clk_* functions. > > OPP currently has opp_enable and opp_disable functions. I don't > understand why these are needed, they are only used at init time to > determine available voltages, which could be handled by never passing > unavailable voltages to the dvfs implementation. We need them in runtime. A device "a" may want to guarantee that a device "b" to be at least "200MHz" or faster while it does some operations. Then, "a" will opp_disable("b", 100MHz and others); and opp_enable("b", them) later on. We have similar issues with multimedia blocks (MFC, Camera, FB, GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay on catching up a workload (1.5x the sampling rate in average, <2.0x the sampling rate in worst cases), which may incur flickering/tearing issues with multimedia streams. On the other hand, a general thermal monitor or battery manager might want to limit energy usage by disabling top performance clocks if it is too hot or the battery level is low. > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm >
On Wed, Apr 27, 2011 at 10:59 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote: > What one instance of DVFS (devfreq) controls are clocks and > regulators. (a device may have multiple regulators as well as multiple > clocks) > What one instance of DVFS (devfreq) monitors (device load and/or > temperature) is a device that uses the clocks and regulators. > > If we focus on the things that are controlled by DVFS, connecting DVFS > with clock seems fine; however, DVFS's decision is based on the status > of the device and the decision (monitoring result) configures a set of > clocks and regulators. The clocks are not configured independently > from others if the clocks are used by a DVFS-capable device. The > frequency/voltage pair (OPP in this patch) associated with a device > becomes a representative value of a specific configuration that > configures the set of clocks and regulators. > > This is quite similar with CPUFREQ. CPUFREQ provides a single > frequency value as a result of monitoring; however the machine's > cpufreq driver may set multiple clocks and multiple voltage regulators > based on the representative value (which is usually the core clock) > although the cpufreq driver may need to control many more clocks with > different frequencies. > > With multiple clocks of a device, if there is a clock that is required > to be set independently from the "representative" clock with DVFS, it > means that the DVFS monitoring result (load/temperature) is not a > scalar value but a vector (multi-dimensional value). That implies that > we need to monitor different and independent values, which in turn, > implies that we need separated devices. Note that the DVFS monitor > result from load and temperature combined is not a multi-dimensional > value because the temperature limits "maximum possible frequency or > voltage" and the load gives "preferred lower bound of frequency" that > can be overridden by the limit set by temperature. > > Therefore, having one DVFS per clock where multiple clocks are > attached to a device will create multiple monitors that monitor the > same object(device behavior) with same metrics (load and temperature). > > Besides, the reason I've started with "target" callback, not clk and > regulator names or pointers is that a device may have multiple clks > and regulators and the OPP may only show the representative > clock/regulators as CPUFREQ does. Especially when the order of > transitions of those multiple clocks and regulators matter (if they > are in a single device, it sometimes does), running a DVFS per clock, > not per device, will be bothersome if not disasterous. I understand the need for some sort of governor that can use device state to determine the necessary clock frequencies. Where I disagree is the connection to voltages. The governor should ONLY determine the frequencies desired, and the voltage required to meet those frequencies should be determined by the clock framework, based only on the clock and the frequency.
On Wed, Apr 27, 2011 at 11:12 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote: > On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote: >> OPP currently has opp_enable and opp_disable functions. I don't >> understand why these are needed, they are only used at init time to >> determine available voltages, which could be handled by never passing >> unavailable voltages to the dvfs implementation. > > We need them in runtime. > > A device "a" may want to guarantee that a device "b" to be at least > "200MHz" or faster while it does some operations. Then, "a" will > opp_disable("b", 100MHz and others); and opp_enable("b", them) later > on. We have similar issues with multimedia blocks (MFC, Camera, FB, > GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay > on catching up a workload (1.5x the sampling rate in average, <2.0x > the sampling rate in worst cases), which may incur flickering/tearing > issues with multimedia streams. On the other hand, a general thermal > monitor or battery manager might want to limit energy usage by > disabling top performance clocks if it is too hot or the battery level > is low. That sounds like a very strange api, when what you really mean is clk_set_min_rate or clk_set_max_rate.
On Thu, Apr 28, 2011 at 3:44 PM, Colin Cross <ccross@google.com> wrote: > On Wed, Apr 27, 2011 at 11:12 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote: >> On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote: >>> OPP currently has opp_enable and opp_disable functions. I don't >>> understand why these are needed, they are only used at init time to >>> determine available voltages, which could be handled by never passing >>> unavailable voltages to the dvfs implementation. >> >> We need them in runtime. >> >> A device "a" may want to guarantee that a device "b" to be at least >> "200MHz" or faster while it does some operations. Then, "a" will >> opp_disable("b", 100MHz and others); and opp_enable("b", them) later >> on. We have similar issues with multimedia blocks (MFC, Camera, FB, >> GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay >> on catching up a workload (1.5x the sampling rate in average, <2.0x >> the sampling rate in worst cases), which may incur flickering/tearing >> issues with multimedia streams. On the other hand, a general thermal >> monitor or battery manager might want to limit energy usage by >> disabling top performance clocks if it is too hot or the battery level >> is low. > > That sounds like a very strange api, when what you really mean is > clk_set_min_rate or clk_set_max_rate. Essentially, that's what needed. However, with clk_set_min/max_rate, don't we need to let another device to be consumer of other devices' clocks? Not just introducing a device to other devices? > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm >
On Wed, Apr 27, 2011 at 10:46 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Wednesday, April 27, 2011, MyungJoo Ham wrote: >> Hello, >> >> 2011/4/26 Rafael J. Wysocki <rjw@sisk.pl>: >> > Hi, >> > > ... >> >> + >> >> +/** >> >> + * dvfs_tickle_device() - Guarantee maximum operation speed for a while >> >> + * instaneously. >> >> + * @dev: the device to be tickled. >> >> + * @duration_ms: the duration of tickle effect. >> >> + * >> >> + * Tickle sets the device at the maximum frequency instaneously and >> >> + * the maximul frequency is guaranteed to be used for the given duration. >> > ^^^^^^^ >> > I guess that should be "maximum". >> >> Yes. :) >> >> > >> >> + * For faster user reponse time, an input event may tickle a related device >> >> + * so that the input event does not need to wait for the DVFS to react with >> >> + * normal interval. >> >> + */ >> > >> > Can you explain how this function is supposed to be used, please? >> >> Ok, I'll show you a usage example >> >> The system: a touchscreen based computer with DVFS'able GPU. >> Use case: a user scrolls the screen with fingers suddenly. (think of >> an iPhone or Android device's home screen filled wit icons) >> While there is no touchscreen input (thus, the screen does not move >> fast), we do not use GPU too much; thus it is scaled to run at slower >> rate (let's say 100MHz). >> If a user suddenly touches screen and starts moving, the screen should >> follow it; thus, requires much output from GPU. >> With normal DVFS-GPU, it would take some time to speed it up because >> it should monitor the GPU usage and react afterwards. >> >> With the API given, "tickle" may be included in the user input event. >> So that GPU can be scaled to maximum frequency immediately. > > So in this case the input subsystem would be calling dvfs_tickle_device() > (or whatever it is finally called) for the GPU, right? Yes, that's correct. The input subsystem should somehow "tickle" a DVFS device (GPU in this case). With CPUFREQ-Tickle (never upstreamed anyway), we have been doing this by adding tickle as an "input event" at board file (i.e., arch/arm/mach-*/mach-*.c). > > I'm afraid you'd need a user space interface for that and let the > application (or library) handling the touchpad input control the "tickle". > Otherwise it would be difficult to arrange things correctly (the association > between the input subsystem and the GPU is not obvious at the kernel level). > Creating a user space interface to connect an input event (or any other user space event) with a DVFS (like /sys/devices/system/cpu/cpux/cpufreq/...) will provide the flexibility for user space to tickle although it might not be as efficient as the approach in the previous paragraph in the metric of how much response time is reduced. We may do that by allowing something like this: "echo 0 > /sys/class/devfreq/devname.0/tickle" or even "echo0 > /sys/class/devfreq/tickle_all" > Ideally, though, the working OPP for the GPU should depend on the number of > processing requests per a unit of time, pretty much like for a CPU. As long as we measure workload (number of requests or operations per unit time) periodically, we have the issue mentioned that can be mitigated by tickling. Ideally, if we can count each operation at device driver, we do not need tickling because we can monitor the workload with per-operation basis not per-sampling-time basis. If it is a device with .. like .. 1000 operations/sec, it might be feasible. However, GPUs may have like over 100,000,000 operations/sec and we'd better just monitor them periodically. Besides, devices including many GPUs allow DMA and user-addressable memory without device driver intervention for normal operations once the device is configured. Thus, device driver wouldn't know anything about per-operation basis; we need to stick with periodic monitoring anyway (and with delayed DVFS issue) unless a GPU can invoke an interrupt of "I'm overloaded, let me work faster" > >> If the monitoring frequency is 50ms, DVFS will likely to react with >> about 75ms of delay. >> For example, if the touch event has occurred at 25ms after the last >> DVFS monitoring event, at the next DVFS monitoring event, the measured >> load will be about 50%, which usually won't require DVFS to speed up. >> Then, DVFS will react at the next monitoring event; thus, it took 50ms >> + 25ms. >> >> 75ms is enough to give some impression to users. These days, many >> mobile devices require 60Hz-based UI, which is about 16ms. >> >> Anyway, this happens with drivers/cpufreq also. We have been testing >> "tickling" associated with drivers/cpufreq/cpufreq.c. This has been >> reduced user response time significantly removing the need for tuning >> the threshold values. > > I think I understand the problem, but I'm not sure if there's a clean way > to trigger the "tickle" from the kernel level. I'm considering the followings (and they are not mutually exclusive). How about them? 1. (the way that we've been using) Add a tickle call to input events at board file. 2. provide sysfs interface that triggers tickling along with devfreq sysfs. > > Thanks, > Rafael > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm > Cheers! - MyungJoo
On Wed, Apr 27, 2011 at 11:50 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote: > On Thu, Apr 28, 2011 at 3:44 PM, Colin Cross <ccross@google.com> wrote: >> On Wed, Apr 27, 2011 at 11:12 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote: >>> On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote: >>>> OPP currently has opp_enable and opp_disable functions. I don't >>>> understand why these are needed, they are only used at init time to >>>> determine available voltages, which could be handled by never passing >>>> unavailable voltages to the dvfs implementation. >>> >>> We need them in runtime. >>> >>> A device "a" may want to guarantee that a device "b" to be at least >>> "200MHz" or faster while it does some operations. Then, "a" will >>> opp_disable("b", 100MHz and others); and opp_enable("b", them) later >>> on. We have similar issues with multimedia blocks (MFC, Camera, FB, >>> GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay >>> on catching up a workload (1.5x the sampling rate in average, <2.0x >>> the sampling rate in worst cases), which may incur flickering/tearing >>> issues with multimedia streams. On the other hand, a general thermal >>> monitor or battery manager might want to limit energy usage by >>> disabling top performance clocks if it is too hot or the battery level >>> is low. >> >> That sounds like a very strange api, when what you really mean is >> clk_set_min_rate or clk_set_max_rate. > > Essentially, that's what needed. > However, with clk_set_min/max_rate, don't we need to let another > device to be consumer of other devices' clocks? Not just introducing a > device to other devices? Yes, but that's effectively what you're doing through a backwards api anyways. The question is, for these complicated clock scenarios where the final frequency of a clock depends on so many factors, should that control go through the clock framework, or through some sort of global clock governor (which is where OPP would reappear).
On Thu, Apr 28, 2011 at 3:43 PM, Colin Cross <ccross@google.com> wrote: > I understand the need for some sort of governor that can use device > state to determine the necessary clock frequencies. Where I disagree > is the connection to voltages. The governor should ONLY determine the > frequencies desired, and the voltage required to meet those > frequencies should be determined by the clock framework, based only on > the clock and the frequency. Yes, as long as AVS(Adaptive Voltage Scaling) is not involved, devfreq does not need to care about voltages and let device driver (such as the target callback or its callee) take care of voltages. Besides, my impression on AVS is that AVS wouldn't be depending on software DVFS scheme, at least with some AVS test on S5PC110. So, I'd say that it's safe to let devfreq framework handle frequency only and let target callback handle anything else except for choosing representative clock frequency. However, if we are going to detach devfreq from OPP, we only need to provide frequency list at init and { an interface to control max/min freq or an interface to lookup max/min freq of corresponding representative clock. } > _______________________________________________ > linux-pm mailing list > linux-pm@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/linux-pm > ps. In our AVS test, the device drivers had nothing to do with voltage scaling except for initializing devices. The H/W did everything about voltage scaling dynamically. Thanks, MyungJoo.
On Thu, Apr 28, 2011 at 4:06 PM, Colin Cross <ccross@google.com> wrote: > On Wed, Apr 27, 2011 at 11:50 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote: >> On Thu, Apr 28, 2011 at 3:44 PM, Colin Cross <ccross@google.com> wrote: >>> On Wed, Apr 27, 2011 at 11:12 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote: >>>> On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote: >>>>> OPP currently has opp_enable and opp_disable functions. I don't >>>>> understand why these are needed, they are only used at init time to >>>>> determine available voltages, which could be handled by never passing >>>>> unavailable voltages to the dvfs implementation. >>>> >>>> We need them in runtime. >>>> >>>> A device "a" may want to guarantee that a device "b" to be at least >>>> "200MHz" or faster while it does some operations. Then, "a" will >>>> opp_disable("b", 100MHz and others); and opp_enable("b", them) later >>>> on. We have similar issues with multimedia blocks (MFC, Camera, FB, >>>> GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay >>>> on catching up a workload (1.5x the sampling rate in average, <2.0x >>>> the sampling rate in worst cases), which may incur flickering/tearing >>>> issues with multimedia streams. On the other hand, a general thermal >>>> monitor or battery manager might want to limit energy usage by >>>> disabling top performance clocks if it is too hot or the battery level >>>> is low. >>> >>> That sounds like a very strange api, when what you really mean is >>> clk_set_min_rate or clk_set_max_rate. >> >> Essentially, that's what needed. >> However, with clk_set_min/max_rate, don't we need to let another >> device to be consumer of other devices' clocks? Not just introducing a >> device to other devices? > > Yes, but that's effectively what you're doing through a backwards api > anyways. The question is, for these complicated clock scenarios where > the final frequency of a clock depends on so many factors, should that > control go through the clock framework, or through some sort of global > clock governor (which is where OPP would reappear). > In the use cases of runtime clock setting by devfreq or other devices mentioned above, we are controlling the device's performance with the representative clock of the device, not a specific clock among the clocks that the device has. For a device "A" with clock "a1" and "a2", another device "B" would not control both "a1" and "a2" directly to get the guaranteed performance from "A". Besides, "B" should not do so if there are specific orders, delays, and other controls for "A" to properly change performance. Therefore, my answer is that it would be preferred to control through some wrapper/interface/or anything that is connected to the device of the controlled clocks (and let the device's callback or something control its clocks), not to control through clock framework directly. In this version of devfreq+OPP, these are handled by the "target" callback. Cheers! - MyungJoo
On Wed, Apr 27, 2011 at 01:48:52PM -0700, Colin Cross wrote: > OPP currently has opp_enable and opp_disable functions. I don't > understand why these are needed, they are only used at init time to > determine available voltages, which could be handled by never passing > unavailable voltages to the dvfs implementation. I queried this when OPP was originally added. The motivation which was given (which seemed fairly reasonable) was to reduce the number of data tables for similar parts and board designs. That did seem like something which it was reasonable to factor out in some way, though possibly with a different mechanism.
On Thursday, April 28, 2011, MyungJoo Ham wrote: > On Wed, Apr 27, 2011 at 10:46 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > On Wednesday, April 27, 2011, MyungJoo Ham wrote: > >> Hello, ... > >> Anyway, this happens with drivers/cpufreq also. We have been testing > >> "tickling" associated with drivers/cpufreq/cpufreq.c. This has been > >> reduced user response time significantly removing the need for tuning > >> the threshold values. > > > > I think I understand the problem, but I'm not sure if there's a clean way > > to trigger the "tickle" from the kernel level. > > I'm considering the followings (and they are not mutually exclusive). > How about them? > 1. (the way that we've been using) Add a tickle call to input events > at board file. Well, first, the input events may not be the only user of that feature and the GPU may not be the only target. So, there should be a way to specify multiple targets for each user and vice versa while keeping the drivers portable between platforms. That seems to be challenging. :-) Second, even if the input events were the only user and the GPU were the only possible target, it probably won't be necessary to "tickle" the GPU every time the touchscreen is used, only when the user is "scrolling", which input events really can't tell. The only entity having an idea about what the _user_ is doing is the application being used at the moment. > 2. provide sysfs interface that triggers tickling along with devfreq sysfs. That may be a better approach IMO, but making applications use it also will be challenging. Thanks, Rafael
diff --git a/drivers/base/power/Makefile b/drivers/base/power/Makefile index 118c1b9..4d92d7f 100644 --- a/drivers/base/power/Makefile +++ b/drivers/base/power/Makefile @@ -3,6 +3,7 @@ obj-$(CONFIG_PM_SLEEP) += main.o wakeup.o obj-$(CONFIG_PM_RUNTIME) += runtime.o obj-$(CONFIG_PM_TRACE_RTC) += trace.o obj-$(CONFIG_PM_OPP) += opp.o +obj-$(CONFIG_PM_GENERIC_DVFS) += dvfs.o ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG ccflags-$(CONFIG_PM_VERBOSE) += -DDEBUG diff --git a/drivers/base/power/dvfs.c b/drivers/base/power/dvfs.c new file mode 100644 index 0000000..867ad70 --- /dev/null +++ b/drivers/base/power/dvfs.c @@ -0,0 +1,401 @@ +/* + * Generic DVFS Interface for Non-CPU Devices + * Based on OPP. + * + * Copyright (C) 2011 Samsung Electronics + * MyungJoo Ham <myungjoo.ham@samsung.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/kernel.h> +#include <linux/errno.h> +#include <linux/err.h> +#include <linux/init.h> +#include <linux/slab.h> +#include <linux/opp.h> +#include <linux/dvfs.h> +#include <linux/workqueue.h> +#include <linux/platform_device.h> +#include <linux/list.h> + +#define DVFS_INTERVAL 20 +/** + * struct dvfs - Device DVFS structure + * @node list node - contains the devices with DVFS that have been + * registered. + * @dev device pointer + * @profile device-specific dvfs profile + * @governor method how to choose frequency based on the usage. + * @previous_freq previously configured frequency value. + * @next_polling the number of remaining "dvfs_monitor" executions to + * reevaluate frequency/voltage of the device. Set by + * profile's polling_ms interval. + * @tickle positive if DVFS-tickling is activated for the device. + * at each executino of dvfs_monitor, tickle is decremented. + * User may tickle a device-dvfs in order to set maximum + * frequency instaneously with some guaranteed duration. + * + * This structure stores the DVFS information for a give device. + */ +struct dvfs { + struct list_head node; + + struct device *dev; + struct dvfs_device_profile *profile; + struct dvfs_governor *governor; + + unsigned long previous_freq; + unsigned int next_polling; + unsigned int tickle; +}; + +/* + * dvfs_work periodically (given by DVFS_INTERVAL) monitors every + * registered device. + */ +static struct delayed_work dvfs_work; +/* The list of all device-dvfs */ +static LIST_HEAD(dvfs_list); +/* Exclusive access to dvfs_list and its elements */ +static DEFINE_MUTEX(dvfs_list_lock); + +/** + * find_device_dvfs() - find dvfs struct using device pointer + * @dev: device pointer used to lookup device DVFS. + * + * Search the list of device DVFSs and return the matched device's DVFS info. + */ +static struct dvfs *find_device_dvfs(struct device *dev) +{ + struct dvfs *tmp_dvfs, *dvfs = ERR_PTR(-ENODEV); + + if (unlikely(IS_ERR_OR_NULL(dev))) { + pr_err("%s: Invalid parameters\n", __func__); + return ERR_PTR(-EINVAL); + } + + list_for_each_entry(tmp_dvfs, &dvfs_list, node) { + if (tmp_dvfs->dev == dev) { + dvfs = tmp_dvfs; + break; + } + } + + return dvfs; +} + +/** + * dvfs_next_polling() - the number of dvfs-monitor iterations to satisfy + * the given polling interval. The interval is rounded by + * dvfs-monitor poling interval (DVFS_INTERVAL) with ceiling + * function. + * @ms: the length of polling interval in milli-second + */ +static unsigned int dvfs_next_polling(int ms) +{ + unsigned int ret; + + if (ms <= 0) + return 0; + + ret = ms / DVFS_INTERVAL; + if (ms % DVFS_INTERVAL) + ret++; + + return ret; +} + +/** + * dvfs_do() - Check the usage profile of a given device and configure + * frequency and voltage accordingly + * @dvfs: DVFS info of the given device + */ +static int dvfs_do(struct dvfs *dvfs) +{ + struct dvfs_usage_profile profile; + struct opp *opp; + unsigned long freq; + int err; + + err = dvfs->profile->get_usage_profile(dvfs->dev, &profile); + if (err) { + dev_err(dvfs->dev, "%s: Cannot get usage profile (%d)\n", + __func__, err); + return err; + } + + err = dvfs->governor->get_target_freq(&profile, dvfs->governor, &freq); + if (err) { + dev_err(dvfs->dev, "%s: Cannot get target frequency (%d)\n", + __func__, err); + return err; + } + + opp = opp_find_freq_ceil(dvfs->dev, &freq); + if (opp == ERR_PTR(-ENODEV)) + opp = opp_find_freq_floor(dvfs->dev, &freq); + + if (IS_ERR(opp)) { + dev_err(dvfs->dev, "%s: Cannot find opp with %luHz.\n", + __func__, freq); + return PTR_ERR(opp); + } + + freq = opp_get_freq(opp); + if (dvfs->previous_freq != freq) { + err = dvfs->profile->target(dvfs->dev, freq, + opp_get_voltage(opp)); + dvfs->previous_freq = freq; + } + + if (err) + dev_err(dvfs->dev, "%s: Cannot set %luHz/%luuV\n", __func__, + opp_get_freq(opp), opp_get_voltage(opp)); + return err; +} + +/** + * dvfs_monitor() - Regularly run dvfs_do() and support device DVFS tickle. + * @work: the work struct used to run dvfs_monitor periodically. + */ +static void dvfs_monitor(struct work_struct *work) +{ + struct dvfs *dvfs; + + mutex_lock(&dvfs_list_lock); + + list_for_each_entry(dvfs, &dvfs_list, node) { + if (dvfs->next_polling == 0) + continue; + if (dvfs->tickle) { + dvfs->tickle--; + continue; + } + if (dvfs->next_polling == 1) { + dvfs_do(dvfs); + dvfs->next_polling = dvfs_next_polling( + dvfs->profile->polling_ms); + } else { + dvfs->next_polling--; + } + } + + schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL)); + mutex_unlock(&dvfs_list_lock); +} + +/** + * dvfs_add_device() - Add dvfs feature to the device + * @dev: the device to add dvfs feature. + * @profile: device-specific profile to run dvfs. + * @governor: the policy to choose frequency. + */ +int dvfs_add_device(struct device *dev, struct dvfs_device_profile *profile, + struct dvfs_governor *governor) +{ + struct dvfs *new_dvfs, *dvfs; + struct cpufreq_frequency_table *table; + int err; + + if (!dev || !profile || !governor) { + dev_err(dev, "%s: Invalid parameters.\n", __func__); + return -EINVAL; + } + + err = opp_init_cpufreq_table(dev, &table); + if (err) { + dev_err(dev, "%s: Device OPP cannot provide DVFS table (%d)\n", + __func__, err); + return err; + } + + new_dvfs = kzalloc(sizeof(struct dvfs), GFP_KERNEL); + if (!new_dvfs) { + dev_err(dev, "%s: Unable to create DVFS for the device\n", + __func__); + return -ENOMEM; + } + + mutex_lock(&dvfs_list_lock); + + dvfs = find_device_dvfs(dev); + if (!IS_ERR(dvfs)) { + dev_err(dev, "%s: Unable to create DVFS for the device. " + "It already has one.\n", __func__); + mutex_unlock(&dvfs_list_lock); + kfree(new_dvfs); + return -EINVAL; + } + + new_dvfs->dev = dev; + new_dvfs->profile = profile; + new_dvfs->governor = governor; + new_dvfs->next_polling = dvfs_next_polling(profile->polling_ms); + new_dvfs->previous_freq = 0; + + list_add(&new_dvfs->node, &dvfs_list); + + mutex_unlock(&dvfs_list_lock); + + return 0; +} + +/** + * dvfs_remove_device() - Remove DVFS feature from a device. + * @device: the device to remove dvfs feature. + */ +int dvfs_remove_device(struct device *dev) +{ + struct dvfs *dvfs; + + if (!dev) + return -EINVAL; + + mutex_lock(&dvfs_list_lock); + dvfs = find_device_dvfs(dev); + if (IS_ERR(dvfs)) { + dev_err(dev, "%s: Unable to find DVFS entry for the device.\n", + __func__); + mutex_unlock(&dvfs_list_lock); + return -EINVAL; + } + + list_del(&dvfs->node); + + kfree(dvfs); + + mutex_unlock(&dvfs_list_lock); + + return 0; +} + +/** + * dvfs_update() - Notify that the device OPP has been changed. + * @dev: the device whose OPP has been changed. + * @may_not_exist: do not print error message even if the device + * does not have dvfs entry. + */ +int dvfs_update(struct device *dev, bool may_not_exist) +{ + struct dvfs *dvfs; + int err = 0; + + mutex_lock(&dvfs_list_lock); + dvfs = find_device_dvfs(dev); + if (!IS_ERR(dvfs)) { + if (dvfs->tickle) { + unsigned long freq = dvfs->profile->max_freq; + struct opp *opp = opp_find_freq_floor(dvfs->dev, &freq); + + freq = opp_get_freq(opp); + if (dvfs->previous_freq != freq) { + err = dvfs->profile->target(dvfs->dev, freq, + opp_get_voltage(opp)); + dvfs->previous_freq = freq; + } + } else { + err = dvfs_do(dvfs); + } + } + mutex_unlock(&dvfs_list_lock); + + if (IS_ERR(dvfs)) { + if (may_not_exist && PTR_ERR(dvfs) == -EINVAL) + return 0; + + dev_err(dev, "%s: Device DVFS not found (%ld)\n", __func__, + PTR_ERR(dvfs)); + return PTR_ERR(dvfs); + } + + return err; +} + +/** + * dvfs_tickle_device() - Guarantee maximum operation speed for a while + * instaneously. + * @dev: the device to be tickled. + * @duration_ms: the duration of tickle effect. + * + * Tickle sets the device at the maximum frequency instaneously and + * the maximul frequency is guaranteed to be used for the given duration. + * For faster user reponse time, an input event may tickle a related device + * so that the input event does not need to wait for the DVFS to react with + * normal interval. + */ +int dvfs_tickle_device(struct device *dev, unsigned long duration_ms) +{ + struct dvfs *dvfs; + struct opp *opp; + unsigned long freq; + int err = 0; + + mutex_lock(&dvfs_list_lock); + dvfs = find_device_dvfs(dev); + if (!IS_ERR(dvfs)) { + freq = dvfs->profile->max_freq; + opp = opp_find_freq_floor(dvfs->dev, &freq); + freq = opp_get_freq(opp); + if (dvfs->previous_freq != freq) { + err = dvfs->profile->target(dvfs->dev, freq, + opp_get_voltage(opp)); + dvfs->previous_freq = freq; + } + if (err) + dev_err(dev, "%s: Cannot set frequency.\n", __func__); + else + dvfs->tickle = dvfs_next_polling(duration_ms); + } + mutex_unlock(&dvfs_list_lock); + + if (IS_ERR(dvfs)) { + dev_err(dev, "%s: Cannot find dvfs.\n", __func__); + err = PTR_ERR(dvfs); + } + + return err; +} + +#ifdef CONFIG_PM_SLEEP +static int dvfs_pause(struct device *dev) +{ + cancel_delayed_work_sync(&dvfs_work); + return 0; +} + +static void dvfs_continue(struct device *dev) +{ + schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL)); +} +#else +#define dvfs_pause NULL +#define dvfs_continue NULL +#endif +static const struct dev_pm_ops dvfs_pm = { + .prepare = dvfs_pause, + .complete = dvfs_continue, +}; +static struct platform_driver dvfs_pm_drv = { + .driver = { + .name = "generic_dvfs_pm", + .pm = &dvfs_pm, + }, +}; +static struct platform_device dvfs_pm_dev = { + .name = "generic_dvfs_pm", +}; + +static int __init dvfs_init(void) +{ + platform_driver_register(&dvfs_pm_drv); + platform_device_register(&dvfs_pm_dev); + + INIT_DELAYED_WORK_DEFERRABLE(&dvfs_work, dvfs_monitor); + schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL)); + + return 0; +} +late_initcall(dvfs_init); diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c index 56a6899..df74655 100644 --- a/drivers/base/power/opp.c +++ b/drivers/base/power/opp.c @@ -21,6 +21,7 @@ #include <linux/rculist.h> #include <linux/rcupdate.h> #include <linux/opp.h> +#include <linux/dvfs.h> /* * Internal data structure organization with the OPP layer library is as @@ -428,6 +429,9 @@ int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt) list_add_rcu(&new_opp->node, head); mutex_unlock(&dev_opp_list_lock); + /* Notify generic dvfs for the change */ + dvfs_update(dev, true); + return 0; } @@ -512,6 +516,9 @@ unlock: mutex_unlock(&dev_opp_list_lock); out: kfree(new_opp); + + /* Notify generic dvfs for the change */ + dvfs_update(dev, true); return r; } diff --git a/include/linux/dvfs.h b/include/linux/dvfs.h new file mode 100644 index 0000000..de3e53d --- /dev/null +++ b/include/linux/dvfs.h @@ -0,0 +1,69 @@ +/* + * Generic DVFS Interface for Non-CPU Devices + * + * Copyright (C) 2011 Samsung Electronics + * MyungJoo Ham <myungjoo.ham@samsung.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#ifndef __LINUX_DVFS_H__ +#define __LINUX_DVFS_H__ + +struct dvfs; + +struct dvfs_usage_profile { + /* both since the last measure */ + unsigned long total_time; + unsigned long busy_time; +}; + +struct dvfs_device_profile { + unsigned long max_freq; /* may be larger than the actual value */ + int (*target)(struct device *dev, unsigned long freq, + unsigned long u_volt); + int (*get_usage_profile)(struct device *dev, + struct dvfs_usage_profile *profile); + int polling_ms; /* 0 for at opp change only */ +}; + +struct dvfs_governor { + void *data; /* private data for get_target_freq */ + int (*get_target_freq)(struct dvfs_usage_profile *profile, + struct dvfs_governor *this, unsigned long *freq); +}; + +#if defined(CONFIG_PM_GENERIC_DVFS) +extern int dvfs_add_device(struct device *ev, + struct dvfs_device_profile *profile, + struct dvfs_governor *governor); +extern int dvfs_remove_device(struct device *dev); +extern int dvfs_update(struct device *dev, bool may_not_exist); +extern int dvfs_tickle_device(struct device *dev, unsigned long duration_ms); +#else /* !CONFIG_PM_GENERIC_DVFS */ +static int dvfs_add_device(struct device *ev, + struct dvfs_device_profile *profile, + struct dvfs_governor *governor) +{ + return 0; +} + +static int dvfs_remove_device(struct device *dev) +{ + return 0; +} + +static int dvfs_update(struct device *dev, bool may_not_exist) +{ + return 0; +} + +static int dvfs_tickle_device(struct device *dev, unsigned long duration_ms) +{ + return 0; +} +#endif /* CONFIG_PM_GENERIC_DVFS */ + +#endif /* __LINUX_DVFS_H__ */ diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 4603f08..7d191ec 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -225,3 +225,12 @@ config PM_OPP representing individual voltage domains and provides SOC implementations a ready to use framework to manage OPPs. For more information, read <file:Documentation/power/opp.txt> + +config PM_GENERIC_DVFS + bool "Generic DVFS framework" + depends on PM_OPP + help + With OPP support, a device may have a list of frequencies and + voltages available. Generic DVFS framework provides a method to + control frequency/voltage of a device with OPP with given profile + and governor per device.