[3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

Message ID	1418771379-24369-4-git-send-email-dtor@chromium.org (mailing list archive)
State	Accepted, archived
Delegated to:	Rafael Wysocki
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: Dmitry Torokhov <dtor@chromium.org> To: "Rafael J. Wysocki" <rjw@rjwysocki.net>, Viresh Kumar <viresh.kumar@linaro.org> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>, Geert Uytterhoeven <geert+renesas@glider.be>, Stefan Wahren <stefan.wahren@i2se.com>, Paul Gortmaker <paul.gortmaker@windriver.com>, Nishanth Menon <nm@ti.com>, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count Date: Tue, 16 Dec 2014 15:09:38 -0800 Message-Id: <1418771379-24369-4-git-send-email-dtor@chromium.org> In-Reply-To: <1418771379-24369-1-git-send-email-dtor@chromium.org> References: <1418771379-24369-1-git-send-email-dtor@chromium.org> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

Dmitry Torokhov Dec. 16, 2014, 11:09 p.m. UTC

A lot of callers are missing the fact that dev_pm_opp_get_opp_count
needs to be called under RCU lock. Given that RCU locks can safely be
nested, instead of providing *_locked() API, let's take RCU lock inside
dev_pm_opp_get_opp_count() and leave callers as is.

Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
---
 drivers/base/power/opp.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

Viresh Kumar Dec. 17, 2014, 4:36 a.m. UTC | #1

On 17 December 2014 at 04:39, Dmitry Torokhov <dtor@chromium.org> wrote:
> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> needs to be called under RCU lock. Given that RCU locks can safely be
> nested, instead of providing *_locked() API, let's take RCU lock inside

Hmm, I asked for a *_locked() API because many users of
dev_pm_opp_get_opp_count() are already calling it from rcu read side
critical sections.

Now, there are two questions:
- Can rcu-read side critical sections be nested ?

Yes, this is what the comment over rcu_read_lock() says

 * RCU read-side critical sections may be nested.  Any deferred actions
 * will be deferred until the outermost RCU read-side critical section
 * completes.

- Would it be better to drop these double rcu_read_locks() ? i.e. either
get a *_locked() API or fix the callers of dev_pm_opp_get_opp_count().

@Paul: What do you say ?

> dev_pm_opp_get_opp_count() and leave callers as is.
>
> Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
> ---
>  drivers/base/power/opp.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> index 413c7fe..ee5eca2 100644
> --- a/drivers/base/power/opp.c
> +++ b/drivers/base/power/opp.c
> @@ -216,9 +216,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_get_freq);
>   * This function returns the number of available opps if there are any,
>   * else returns 0 if none or the corresponding error value.
>   *
> - * Locking: This function must be called under rcu_read_lock(). This function
> - * internally references two RCU protected structures: device_opp and opp which
> - * are safe as long as we are under a common RCU locked section.
> + * Locking: This function takes rcu_read_lock().
>   */
>  int dev_pm_opp_get_opp_count(struct device *dev)
>  {
> @@ -226,13 +224,14 @@ int dev_pm_opp_get_opp_count(struct device *dev)
>         struct dev_pm_opp *temp_opp;
>         int count = 0;
>
> -       opp_rcu_lockdep_assert();
> +       rcu_read_lock();
>
>         dev_opp = find_device_opp(dev);
>         if (IS_ERR(dev_opp)) {
> -               int r = PTR_ERR(dev_opp);
> -               dev_err(dev, "%s: device OPP not found (%d)\n", __func__, r);
> -               return r;
> +               count = PTR_ERR(dev_opp);
> +               dev_err(dev, "%s: device OPP not found (%d)\n",
> +                       __func__, count);
> +               goto out_unlock;
>         }
>
>         list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> @@ -240,6 +239,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
>                         count++;
>         }
>
> +out_unlock:
> +       rcu_read_unlock();
>         return count;
>  }

Looked fine otherwise:

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dmitry Torokhov Dec. 17, 2014, 5:28 p.m. UTC | #2

On Wednesday, December 17, 2014 10:06:17 AM Viresh Kumar wrote:
> On 17 December 2014 at 04:39, Dmitry Torokhov <dtor@chromium.org> wrote:
> > A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> > needs to be called under RCU lock. Given that RCU locks can safely be
> > nested, instead of providing *_locked() API, let's take RCU lock inside
> 
> Hmm, I asked for a *_locked() API because many users of
> dev_pm_opp_get_opp_count() are already calling it from rcu read side
> critical sections.
> 
> Now, there are two questions:
> - Can rcu-read side critical sections be nested ?
> 
> Yes, this is what the comment over rcu_read_lock() says
> 
>  * RCU read-side critical sections may be nested.  Any deferred actions
>  * will be deferred until the outermost RCU read-side critical section
>  * completes.
> 
> - Would it be better to drop these double rcu_read_locks() ? i.e. either
> get a *_locked() API or fix the callers of dev_pm_opp_get_opp_count().
> 
> @Paul: What do you say ?
>

FWIW the change is a stop-gap; I hope we'll get away from using 
dev_pm_opp_get_opp_count() in cpufreq drivers and then we can revert the 
change. I just did not want to touch cpufreq drivers unless necessary.

Thanks,
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney Dec. 17, 2014, 11:47 p.m. UTC | #3

On Wed, Dec 17, 2014 at 10:06:17AM +0530, Viresh Kumar wrote:
> On 17 December 2014 at 04:39, Dmitry Torokhov <dtor@chromium.org> wrote:
> > A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> > needs to be called under RCU lock. Given that RCU locks can safely be
> > nested, instead of providing *_locked() API, let's take RCU lock inside
> 
> Hmm, I asked for a *_locked() API because many users of
> dev_pm_opp_get_opp_count() are already calling it from rcu read side
> critical sections.
> 
> Now, there are two questions:
> - Can rcu-read side critical sections be nested ?
> 
> Yes, this is what the comment over rcu_read_lock() says
> 
>  * RCU read-side critical sections may be nested.  Any deferred actions
>  * will be deferred until the outermost RCU read-side critical section
>  * completes.
> 
> - Would it be better to drop these double rcu_read_locks() ? i.e. either
> get a *_locked() API or fix the callers of dev_pm_opp_get_opp_count().
> 
> @Paul: What do you say ?

Yep, they can be nested.  Both rcu_read_lock() and rcu_read_unlock()
are quite fast, as are their friends, so there is almost no performance
penalty from nesting.  So the decision normally turns on maintainability
and style.

							Thanx, Paul

> > dev_pm_opp_get_opp_count() and leave callers as is.
> >
> > Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
> > ---
> >  drivers/base/power/opp.c | 15 ++++++++-------
> >  1 file changed, 8 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> > index 413c7fe..ee5eca2 100644
> > --- a/drivers/base/power/opp.c
> > +++ b/drivers/base/power/opp.c
> > @@ -216,9 +216,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_get_freq);
> >   * This function returns the number of available opps if there are any,
> >   * else returns 0 if none or the corresponding error value.
> >   *
> > - * Locking: This function must be called under rcu_read_lock(). This function
> > - * internally references two RCU protected structures: device_opp and opp which
> > - * are safe as long as we are under a common RCU locked section.
> > + * Locking: This function takes rcu_read_lock().
> >   */
> >  int dev_pm_opp_get_opp_count(struct device *dev)
> >  {
> > @@ -226,13 +224,14 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> >         struct dev_pm_opp *temp_opp;
> >         int count = 0;
> >
> > -       opp_rcu_lockdep_assert();
> > +       rcu_read_lock();
> >
> >         dev_opp = find_device_opp(dev);
> >         if (IS_ERR(dev_opp)) {
> > -               int r = PTR_ERR(dev_opp);
> > -               dev_err(dev, "%s: device OPP not found (%d)\n", __func__, r);
> > -               return r;
> > +               count = PTR_ERR(dev_opp);
> > +               dev_err(dev, "%s: device OPP not found (%d)\n",
> > +                       __func__, count);
> > +               goto out_unlock;
> >         }
> >
> >         list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> > @@ -240,6 +239,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
> >                         count++;
> >         }
> >
> > +out_unlock:
> > +       rcu_read_unlock();
> >         return count;
> >  }
> 
> Looked fine otherwise:
> 
> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Viresh Kumar Dec. 18, 2014, 2:11 a.m. UTC | #4

On 18 December 2014 at 05:17, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> Yep, they can be nested.  Both rcu_read_lock() and rcu_read_unlock()
> are quite fast, as are their friends, so there is almost no performance
> penalty from nesting.  So the decision normally turns on maintainability
> and style.

Thanks again for your kind advice :)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Viresh Kumar Dec. 18, 2014, 2:11 a.m. UTC | #5

On 17 December 2014 at 22:58, Dmitry Torokhov <dtor@chromium.org> wrote:
> FWIW the change is a stop-gap; I hope we'll get away from using
> dev_pm_opp_get_opp_count() in cpufreq drivers and then we can revert the
> change. I just did not want to touch cpufreq drivers unless necessary.

Yeah, all good now. No more questions :)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nishanth Menon Dec. 24, 2014, 4:48 p.m. UTC | #6

On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> needs to be called under RCU lock. Given that RCU locks can safely be
> nested, instead of providing *_locked() API, let's take RCU lock inside
> dev_pm_opp_get_opp_count() and leave callers as is.

While it is true that we can safely do nested RCU locks, This also
encourages wrong usage.

count = dev_pm_opp_get_opp_count(dev)
^^ point A
array   = kzalloc(count * sizeof (*array));
rcu_read_lock();
^^ point B
.. work down the list and add OPPs..
...

Between A and B, we might have had list modification (dynamic OPP
addition or deletion) - which implies that the count is no longer
accurate between point A and B. instead, enforcing callers to have the
responsibility of rcu_lock is exactly what we have to do since the OPP
library has no clue how to enforce pointer or data accuracy.

Sorry, NAK. this setsup stage for further similar additions to
get_voltage, freq etc.. basically encourages errorneous usage of the
functions and misinterpretation of data procured.

> 
> Signed-off-by: Dmitry Torokhov <dtor@chromium.org>
> ---
>  drivers/base/power/opp.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> index 413c7fe..ee5eca2 100644
> --- a/drivers/base/power/opp.c
> +++ b/drivers/base/power/opp.c
> @@ -216,9 +216,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_get_freq);
>   * This function returns the number of available opps if there are any,
>   * else returns 0 if none or the corresponding error value.
>   *
> - * Locking: This function must be called under rcu_read_lock(). This function
> - * internally references two RCU protected structures: device_opp and opp which
> - * are safe as long as we are under a common RCU locked section.
> + * Locking: This function takes rcu_read_lock().
>   */
>  int dev_pm_opp_get_opp_count(struct device *dev)
>  {
> @@ -226,13 +224,14 @@ int dev_pm_opp_get_opp_count(struct device *dev)
>  	struct dev_pm_opp *temp_opp;
>  	int count = 0;
>  
> -	opp_rcu_lockdep_assert();
> +	rcu_read_lock();
>  
>  	dev_opp = find_device_opp(dev);
>  	if (IS_ERR(dev_opp)) {
> -		int r = PTR_ERR(dev_opp);
> -		dev_err(dev, "%s: device OPP not found (%d)\n", __func__, r);
> -		return r;
> +		count = PTR_ERR(dev_opp);
> +		dev_err(dev, "%s: device OPP not found (%d)\n",
> +			__func__, count);
> +		goto out_unlock;
>  	}
>  
>  	list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> @@ -240,6 +239,8 @@ int dev_pm_opp_get_opp_count(struct device *dev)
>  			count++;
>  	}
>  
> +out_unlock:
> +	rcu_read_unlock();
>  	return count;
>  }
>  EXPORT_SYMBOL_GPL(dev_pm_opp_get_opp_count);
>

Dmitry Torokhov Dec. 24, 2014, 5:09 p.m. UTC | #7

On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <nm@ti.com> wrote:
> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>> needs to be called under RCU lock. Given that RCU locks can safely be
>> nested, instead of providing *_locked() API, let's take RCU lock inside
>> dev_pm_opp_get_opp_count() and leave callers as is.
>
> While it is true that we can safely do nested RCU locks, This also
> encourages wrong usage.
>
> count = dev_pm_opp_get_opp_count(dev)
> ^^ point A
> array   = kzalloc(count * sizeof (*array));
> rcu_read_lock();
> ^^ point B
> .. work down the list and add OPPs..
> ...
>
> Between A and B, we might have had list modification (dynamic OPP
> addition or deletion) - which implies that the count is no longer
> accurate between point A and B. instead, enforcing callers to have the
> responsibility of rcu_lock is exactly what we have to do since the OPP
> library has no clue how to enforce pointer or data accuracy.

No, you seem to have a misconception that rcu_lock protects you past
the point B, but that is also wrong. The only thing rcu "lock"
provides is safe traversing the list and guarantee that elements will
not disappear while you are referencing them, but list can both
contract and expand under you. In that regard code in
drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
the list and use number of elements you should be taking a mutex.
Luckily all cpufreq drivers at the moment only want to see if OPP
table is empty or not, so as a stop-gap we can take rcu_lock
automatically as we are getting count. We won't get necessarily
accurate result, but at least we will be safe traversing the list.

Thanks.

Nishanth Menon Dec. 24, 2014, 5:16 p.m. UTC | #8

On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <nm@ti.com> wrote:
>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>
>> While it is true that we can safely do nested RCU locks, This also
>> encourages wrong usage.
>>
>> count = dev_pm_opp_get_opp_count(dev)
>> ^^ point A
>> array   = kzalloc(count * sizeof (*array));
>> rcu_read_lock();
>> ^^ point B
>> .. work down the list and add OPPs..
>> ...
>>
>> Between A and B, we might have had list modification (dynamic OPP
>> addition or deletion) - which implies that the count is no longer
>> accurate between point A and B. instead, enforcing callers to have the
>> responsibility of rcu_lock is exactly what we have to do since the OPP
>> library has no clue how to enforce pointer or data accuracy.
> 
> No, you seem to have a misconception that rcu_lock protects you past
> the point B, but that is also wrong. The only thing rcu "lock"
> provides is safe traversing the list and guarantee that elements will
> not disappear while you are referencing them, but list can both
> contract and expand under you. In that regard code in
> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
> the list and use number of elements you should be taking a mutex.
> Luckily all cpufreq drivers at the moment only want to see if OPP
> table is empty or not, so as a stop-gap we can take rcu_lock
> automatically as we are getting count. We won't get necessarily
> accurate result, but at least we will be safe traversing the list.

So, instead of a half solution, lets consider this in the realm of
dynamic OPPs as well. agreed to the point that we only have safe
traversal and pointer validity. the real problem however is with
"dynamic OPPs" (one of the original reasons why i did not add dynamic
OPPs in the original version was to escape from it's complexity for
users - anyways.. we are beyond that now). if OPPs can be removed on
the fly, we need the following:
a) use OPP notifiers to adequately handle list modification
b) lock down list modification (and associated APIs) to ensure that
the original cpufreq /devfreq list is correct.

I still dont see the need to do this half solution.

Dmitry Torokhov Dec. 24, 2014, 5:31 p.m. UTC | #9

On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <nm@ti.com> wrote:
> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <nm@ti.com> wrote:
>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>>
>>> While it is true that we can safely do nested RCU locks, This also
>>> encourages wrong usage.
>>>
>>> count = dev_pm_opp_get_opp_count(dev)
>>> ^^ point A
>>> array   = kzalloc(count * sizeof (*array));
>>> rcu_read_lock();
>>> ^^ point B
>>> .. work down the list and add OPPs..
>>> ...
>>>
>>> Between A and B, we might have had list modification (dynamic OPP
>>> addition or deletion) - which implies that the count is no longer
>>> accurate between point A and B. instead, enforcing callers to have the
>>> responsibility of rcu_lock is exactly what we have to do since the OPP
>>> library has no clue how to enforce pointer or data accuracy.
>>
>> No, you seem to have a misconception that rcu_lock protects you past
>> the point B, but that is also wrong. The only thing rcu "lock"
>> provides is safe traversing the list and guarantee that elements will
>> not disappear while you are referencing them, but list can both
>> contract and expand under you. In that regard code in
>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
>> the list and use number of elements you should be taking a mutex.
>> Luckily all cpufreq drivers at the moment only want to see if OPP
>> table is empty or not, so as a stop-gap we can take rcu_lock
>> automatically as we are getting count. We won't get necessarily
>> accurate result, but at least we will be safe traversing the list.
>
> So, instead of a half solution, lets consider this in the realm of
> dynamic OPPs as well. agreed to the point that we only have safe
> traversal and pointer validity. the real problem however is with
> "dynamic OPPs" (one of the original reasons why i did not add dynamic
> OPPs in the original version was to escape from it's complexity for
> users - anyways.. we are beyond that now). if OPPs can be removed on
> the fly, we need the following:
> a) use OPP notifiers to adequately handle list modification
> b) lock down list modification (and associated APIs) to ensure that
> the original cpufreq /devfreq list is correct.
>
> I still dont see the need to do this half solution.

The need for half solution at the moment is that you can't safely
travel the lists and may crash on an invalid pointer.

Going forward I think (I mentioned that in my other email) that we
should rework the OPP API so that callers fetch OPP table object for a
device at init/probe time and then use it to get OPPs. This way won't
have to travel two lists any time we want to reference an OPP.

And instead of relying notifiers, maybe look into using OPP tables
directly in cpufreq drivers instead of converting OPP into static-ish
cpufreq tables.

Thanks.

Nishanth Menon Dec. 24, 2014, 5:37 p.m. UTC | #10

On 12/24/2014 11:31 AM, Dmitry Torokhov wrote:
> On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <nm@ti.com> wrote:
>> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
>>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <nm@ti.com> wrote:
>>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>>>
>>>> While it is true that we can safely do nested RCU locks, This also
>>>> encourages wrong usage.
>>>>
>>>> count = dev_pm_opp_get_opp_count(dev)
>>>> ^^ point A
>>>> array   = kzalloc(count * sizeof (*array));
>>>> rcu_read_lock();
>>>> ^^ point B
>>>> .. work down the list and add OPPs..
>>>> ...
>>>>
>>>> Between A and B, we might have had list modification (dynamic OPP
>>>> addition or deletion) - which implies that the count is no longer
>>>> accurate between point A and B. instead, enforcing callers to have the
>>>> responsibility of rcu_lock is exactly what we have to do since the OPP
>>>> library has no clue how to enforce pointer or data accuracy.
>>>
>>> No, you seem to have a misconception that rcu_lock protects you past
>>> the point B, but that is also wrong. The only thing rcu "lock"
>>> provides is safe traversing the list and guarantee that elements will
>>> not disappear while you are referencing them, but list can both
>>> contract and expand under you. In that regard code in
>>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
>>> the list and use number of elements you should be taking a mutex.
>>> Luckily all cpufreq drivers at the moment only want to see if OPP
>>> table is empty or not, so as a stop-gap we can take rcu_lock
>>> automatically as we are getting count. We won't get necessarily
>>> accurate result, but at least we will be safe traversing the list.
>>
>> So, instead of a half solution, lets consider this in the realm of
>> dynamic OPPs as well. agreed to the point that we only have safe
>> traversal and pointer validity. the real problem however is with
>> "dynamic OPPs" (one of the original reasons why i did not add dynamic
>> OPPs in the original version was to escape from it's complexity for
>> users - anyways.. we are beyond that now). if OPPs can be removed on
>> the fly, we need the following:
>> a) use OPP notifiers to adequately handle list modification
>> b) lock down list modification (and associated APIs) to ensure that
>> the original cpufreq /devfreq list is correct.
>>
>> I still dont see the need to do this half solution.
> 
> The need for half solution at the moment is that you can't safely
> travel the lists and may crash on an invalid pointer.

So, fix the cpufreq-dt instead of moving the hack inside OPP driver.

> 
> Going forward I think (I mentioned that in my other email) that we
> should rework the OPP API so that callers fetch OPP table object for a
> device at init/probe time and then use it to get OPPs. This way won't
> have to travel two lists any time we want to reference an OPP.
> 
> And instead of relying notifiers, maybe look into using OPP tables
> directly in cpufreq drivers instead of converting OPP into static-ish
> cpufreq tables.
> 

If you'd like a proper fix for OPP usage, I am all open to see such a
proposal that works not just for cpufreq, but also for devfreq as well.

Dmitry Torokhov Dec. 24, 2014, 5:44 p.m. UTC | #11

On Wed, Dec 24, 2014 at 9:37 AM, Nishanth Menon <nm@ti.com> wrote:
> On 12/24/2014 11:31 AM, Dmitry Torokhov wrote:
>> On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <nm@ti.com> wrote:
>>> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
>>>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <nm@ti.com> wrote:
>>>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>>>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>>>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>>>>
>>>>> While it is true that we can safely do nested RCU locks, This also
>>>>> encourages wrong usage.
>>>>>
>>>>> count = dev_pm_opp_get_opp_count(dev)
>>>>> ^^ point A
>>>>> array   = kzalloc(count * sizeof (*array));
>>>>> rcu_read_lock();
>>>>> ^^ point B
>>>>> .. work down the list and add OPPs..
>>>>> ...
>>>>>
>>>>> Between A and B, we might have had list modification (dynamic OPP
>>>>> addition or deletion) - which implies that the count is no longer
>>>>> accurate between point A and B. instead, enforcing callers to have the
>>>>> responsibility of rcu_lock is exactly what we have to do since the OPP
>>>>> library has no clue how to enforce pointer or data accuracy.
>>>>
>>>> No, you seem to have a misconception that rcu_lock protects you past
>>>> the point B, but that is also wrong. The only thing rcu "lock"
>>>> provides is safe traversing the list and guarantee that elements will
>>>> not disappear while you are referencing them, but list can both
>>>> contract and expand under you. In that regard code in
>>>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
>>>> the list and use number of elements you should be taking a mutex.
>>>> Luckily all cpufreq drivers at the moment only want to see if OPP
>>>> table is empty or not, so as a stop-gap we can take rcu_lock
>>>> automatically as we are getting count. We won't get necessarily
>>>> accurate result, but at least we will be safe traversing the list.
>>>
>>> So, instead of a half solution, lets consider this in the realm of
>>> dynamic OPPs as well. agreed to the point that we only have safe
>>> traversal and pointer validity. the real problem however is with
>>> "dynamic OPPs" (one of the original reasons why i did not add dynamic
>>> OPPs in the original version was to escape from it's complexity for
>>> users - anyways.. we are beyond that now). if OPPs can be removed on
>>> the fly, we need the following:
>>> a) use OPP notifiers to adequately handle list modification
>>> b) lock down list modification (and associated APIs) to ensure that
>>> the original cpufreq /devfreq list is correct.
>>>
>>> I still dont see the need to do this half solution.
>>
>> The need for half solution at the moment is that you can't safely
>> travel the lists and may crash on an invalid pointer.
>
> So, fix the cpufreq-dt instead of moving the hack inside OPP driver.

I started there, but it is not only cpufreq-dt that got it wrong. I
considered changing individual drivers (Viresh also suggested adding
_locked() variant API), but decided patching opp was less invasive for
now.

>
>>
>> Going forward I think (I mentioned that in my other email) that we
>> should rework the OPP API so that callers fetch OPP table object for a
>> device at init/probe time and then use it to get OPPs. This way won't
>> have to travel two lists any time we want to reference an OPP.
>>
>> And instead of relying notifiers, maybe look into using OPP tables
>> directly in cpufreq drivers instead of converting OPP into static-ish
>> cpufreq tables.
>>
>
> If you'd like a proper fix for OPP usage, I am all open to see such a
> proposal that works not just for cpufreq, but also for devfreq as well.

Yeah, let's see what kind of time I have ;)

Thanks.

Nishanth Menon Dec. 24, 2014, 8:37 p.m. UTC | #12

On 12/24/2014 11:44 AM, Dmitry Torokhov wrote:
> On Wed, Dec 24, 2014 at 9:37 AM, Nishanth Menon <nm@ti.com> wrote:
>> On 12/24/2014 11:31 AM, Dmitry Torokhov wrote:
>>> On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <nm@ti.com> wrote:
>>>> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
>>>>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <nm@ti.com> wrote:
>>>>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
>>>>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
>>>>>>> needs to be called under RCU lock. Given that RCU locks can safely be
>>>>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
>>>>>>> dev_pm_opp_get_opp_count() and leave callers as is.
>>>>>>
>>>>>> While it is true that we can safely do nested RCU locks, This also
>>>>>> encourages wrong usage.
>>>>>>
>>>>>> count = dev_pm_opp_get_opp_count(dev)
>>>>>> ^^ point A
>>>>>> array   = kzalloc(count * sizeof (*array));
>>>>>> rcu_read_lock();
>>>>>> ^^ point B
>>>>>> .. work down the list and add OPPs..
>>>>>> ...
>>>>>>
>>>>>> Between A and B, we might have had list modification (dynamic OPP
>>>>>> addition or deletion) - which implies that the count is no longer
>>>>>> accurate between point A and B. instead, enforcing callers to have the
>>>>>> responsibility of rcu_lock is exactly what we have to do since the OPP
>>>>>> library has no clue how to enforce pointer or data accuracy.
>>>>>
>>>>> No, you seem to have a misconception that rcu_lock protects you past
>>>>> the point B, but that is also wrong. The only thing rcu "lock"
>>>>> provides is safe traversing the list and guarantee that elements will
>>>>> not disappear while you are referencing them, but list can both
>>>>> contract and expand under you. In that regard code in
>>>>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
>>>>> the list and use number of elements you should be taking a mutex.
>>>>> Luckily all cpufreq drivers at the moment only want to see if OPP
>>>>> table is empty or not, so as a stop-gap we can take rcu_lock
>>>>> automatically as we are getting count. We won't get necessarily
>>>>> accurate result, but at least we will be safe traversing the list.
>>>>
>>>> So, instead of a half solution, lets consider this in the realm of
>>>> dynamic OPPs as well. agreed to the point that we only have safe
>>>> traversal and pointer validity. the real problem however is with
>>>> "dynamic OPPs" (one of the original reasons why i did not add dynamic
>>>> OPPs in the original version was to escape from it's complexity for
>>>> users - anyways.. we are beyond that now). if OPPs can be removed on
>>>> the fly, we need the following:
>>>> a) use OPP notifiers to adequately handle list modification
>>>> b) lock down list modification (and associated APIs) to ensure that
>>>> the original cpufreq /devfreq list is correct.
>>>>
>>>> I still dont see the need to do this half solution.
>>>
>>> The need for half solution at the moment is that you can't safely
>>> travel the lists and may crash on an invalid pointer.
>>
>> So, fix the cpufreq-dt instead of moving the hack inside OPP driver.
> 
> I started there, but it is not only cpufreq-dt that got it wrong. I
> considered changing individual drivers (Viresh also suggested adding
> _locked() variant API), but decided patching opp was less invasive for
> now.

True. I had done an audit and cleanup, I think a couple or so years
back and things ofcourse tend to go down the bitrot path without
constant checkups :(

>>> Going forward I think (I mentioned that in my other email) that we
>>> should rework the OPP API so that callers fetch OPP table object for a
>>> device at init/probe time and then use it to get OPPs. This way won't
>>> have to travel two lists any time we want to reference an OPP.
>>>
>>> And instead of relying notifiers, maybe look into using OPP tables
>>> directly in cpufreq drivers instead of converting OPP into static-ish
>>> cpufreq tables.
>>>
>>
>> If you'd like a proper fix for OPP usage, I am all open to see such a
>> proposal that works not just for cpufreq, but also for devfreq as well.
> 
> Yeah, let's see what kind of time I have ;)

That would be nice. Thank you.

if you could post the split off the remaining patches from the series
esp patches #1 and #2 w.r.t 3.19-rc1 and repost, it will be nice to
get them merged in as they do look like obvious improvements good to
get in without depending on the remainder of the series which we can
work on meanwhile.

Rafael J. Wysocki Dec. 27, 2014, 8:34 p.m. UTC | #13

On Wednesday, December 24, 2014 02:37:14 PM Nishanth Menon wrote:
> On 12/24/2014 11:44 AM, Dmitry Torokhov wrote:
> > On Wed, Dec 24, 2014 at 9:37 AM, Nishanth Menon <nm@ti.com> wrote:
> >> On 12/24/2014 11:31 AM, Dmitry Torokhov wrote:
> >>> On Wed, Dec 24, 2014 at 9:16 AM, Nishanth Menon <nm@ti.com> wrote:
> >>>> On 12/24/2014 11:09 AM, Dmitry Torokhov wrote:
> >>>>> On Wed, Dec 24, 2014 at 8:48 AM, Nishanth Menon <nm@ti.com> wrote:
> >>>>>> On 12/16/2014 05:09 PM, Dmitry Torokhov wrote:
> >>>>>>> A lot of callers are missing the fact that dev_pm_opp_get_opp_count
> >>>>>>> needs to be called under RCU lock. Given that RCU locks can safely be
> >>>>>>> nested, instead of providing *_locked() API, let's take RCU lock inside
> >>>>>>> dev_pm_opp_get_opp_count() and leave callers as is.
> >>>>>>
> >>>>>> While it is true that we can safely do nested RCU locks, This also
> >>>>>> encourages wrong usage.
> >>>>>>
> >>>>>> count = dev_pm_opp_get_opp_count(dev)
> >>>>>> ^^ point A
> >>>>>> array   = kzalloc(count * sizeof (*array));
> >>>>>> rcu_read_lock();
> >>>>>> ^^ point B
> >>>>>> .. work down the list and add OPPs..
> >>>>>> ...
> >>>>>>
> >>>>>> Between A and B, we might have had list modification (dynamic OPP
> >>>>>> addition or deletion) - which implies that the count is no longer
> >>>>>> accurate between point A and B. instead, enforcing callers to have the
> >>>>>> responsibility of rcu_lock is exactly what we have to do since the OPP
> >>>>>> library has no clue how to enforce pointer or data accuracy.
> >>>>>
> >>>>> No, you seem to have a misconception that rcu_lock protects you past
> >>>>> the point B, but that is also wrong. The only thing rcu "lock"
> >>>>> provides is safe traversing the list and guarantee that elements will
> >>>>> not disappear while you are referencing them, but list can both
> >>>>> contract and expand under you. In that regard code in
> >>>>> drivers/cpufreq/cpufreq_opp.c is utterly wrong. If you want to count
> >>>>> the list and use number of elements you should be taking a mutex.
> >>>>> Luckily all cpufreq drivers at the moment only want to see if OPP
> >>>>> table is empty or not, so as a stop-gap we can take rcu_lock
> >>>>> automatically as we are getting count. We won't get necessarily
> >>>>> accurate result, but at least we will be safe traversing the list.
> >>>>
> >>>> So, instead of a half solution, lets consider this in the realm of
> >>>> dynamic OPPs as well. agreed to the point that we only have safe
> >>>> traversal and pointer validity. the real problem however is with
> >>>> "dynamic OPPs" (one of the original reasons why i did not add dynamic
> >>>> OPPs in the original version was to escape from it's complexity for
> >>>> users - anyways.. we are beyond that now). if OPPs can be removed on
> >>>> the fly, we need the following:
> >>>> a) use OPP notifiers to adequately handle list modification
> >>>> b) lock down list modification (and associated APIs) to ensure that
> >>>> the original cpufreq /devfreq list is correct.
> >>>>
> >>>> I still dont see the need to do this half solution.
> >>>
> >>> The need for half solution at the moment is that you can't safely
> >>> travel the lists and may crash on an invalid pointer.
> >>
> >> So, fix the cpufreq-dt instead of moving the hack inside OPP driver.
> > 
> > I started there, but it is not only cpufreq-dt that got it wrong. I
> > considered changing individual drivers (Viresh also suggested adding
> > _locked() variant API), but decided patching opp was less invasive for
> > now.
> 
> True. I had done an audit and cleanup, I think a couple or so years
> back and things ofcourse tend to go down the bitrot path without
> constant checkups :(
> 
> >>> Going forward I think (I mentioned that in my other email) that we
> >>> should rework the OPP API so that callers fetch OPP table object for a
> >>> device at init/probe time and then use it to get OPPs. This way won't
> >>> have to travel two lists any time we want to reference an OPP.
> >>>
> >>> And instead of relying notifiers, maybe look into using OPP tables
> >>> directly in cpufreq drivers instead of converting OPP into static-ish
> >>> cpufreq tables.
> >>>
> >>
> >> If you'd like a proper fix for OPP usage, I am all open to see such a
> >> proposal that works not just for cpufreq, but also for devfreq as well.
> > 
> > Yeah, let's see what kind of time I have ;)
> 
> That would be nice. Thank you.
> 
> if you could post the split off the remaining patches from the series
> esp patches #1 and #2 w.r.t 3.19-rc1 and repost, it will be nice to
> get them merged in as they do look like obvious improvements good to
> get in without depending on the remainder of the series which we can
> work on meanwhile.

I actually have the entire Dmitry's series queued up for the next push
(most likely on Monday) and at this point I'm not seeing a compelling
reason for not pushing it.

[3/4] PM / OPP: take RCU lock in dev_pm_opp_get_opp_count

Commit Message

Comments

Patch