diff mbox

PM: Document rules on using pm_runtime_resume() in system suspend callbacks

Message ID 1849600.vI30ccvhCD@aspire.rjw.lan (mailing list archive)
State Mainlined
Delegated to: Rafael Wysocki
Headers show

Commit Message

Rafael J. Wysocki Sept. 20, 2017, 12:26 a.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

It quite often is necessary to resume devices from runtime suspend
during system suspend for various reasons (for example, if their
wakeup settings need to be changed), but that requires middle-layer
or subsystem code to follow additional rules which currently are not
clearly documented.

Namely, if a driver calls pm_runtime_resume() for the device from
its ->suspend (or equivalent) system sleep callback, that may not
work if the middle layer above it has updated the state of the
device from its ->prepare or ->suspend callbacks already in an
incompatible way.  For this reason, all middle layers must follow
the rule that, until the ->suspend callback provided by the device's
driver is invoked, the only way in which the device's state can be
updated is by calling pm_runtime_resume() for it, if necessary.
Fortunately enough, all middle layers in the code base today follow
this rule, but it is not explicitly stated anywhere, so do that.

Note that calling pm_runtime_resume() from the ->suspend callback
of a driver will cause the ->runtime_resume callback provided by the
middle layer to be invoked, but the rule above guarantees that this
callback will nest properly with the middle layer's ->suspend
callback and it will play well with the ->prepare one invoked before.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This is a follow-up for the system suspend callbacks discussion during
the Power Management and Energy-Awareness session at the LPC last week.

In particular, I have been thinking about using pm_runtime_resume() from
driver ->suspend callbacks and it actually appears to be quite defendable to
me as long as it is guaranteed that middle layers will not mess up with the
device state before the driver's ->suspend callback is invoked.  If that
is the case, and I *think* that it currently is the case for all of the
middle layers in the tree unless I overlook something (USB anyone?), the
middle layer callbacks involved (->prepare, ->suspend and ->runtime_resume)
should actually nest properly and there should not be problems with that.
So, my proposal here would be to simply go ahead and document the rules
regarding this case without modifying the code.

At the same time, I see at least two general problems with calling
pm_runtime_force_suspend() from the ->suspend callbacks of device drivers
(unless the middle layers involved are trivial).

First, note that the middle layer callbacks involved in that case are
->prepare, ->suspend, ->runtime_suspend (called indirectly from within the
former) and then *also* ->suspend_late and ->suspend_noirq, because the PM
core will call the last two from the middle layer as it has no information
that pm_runtime_force_suspend() was called for the device in the "suspend"
phase.  Of course, in general, what the middle layer ->suspend_late and
->suspend_noirq do is not guaranteed to play well with what its
->runtime_suspend does even if ->suspend itself is OK (but for ->runtime_resume
all of that actually works, because the state it leaves the device in should
be compatible with the system suspend callbacks invoked in the later phases).

Second, leaving devices in runtime suspend in the "suspend" phase of system
suspend is fishy even when their runtime PM is disabled, because that doesn't
guarantee anything regarding their children or possible consumers.  Runtime
PM may still be enabled for those devices at that time and runtime resume may
be triggered for them later, in which case it all quickly falls apart.

IOW, there are reasons why the PM core bumps up the runtime PM usage counters
of all devices during system suspend and they also apply to runtime suspend
callbacks being invoked directly with runtime PM disabled for the given device.
Frankly, it generally is only safe to leave a device in runtime suspend during
system suspend if it can be guarateed that the system suspend callbacks in the
subsequent suspend phases will not be invoked for it at all.

Thanks,
Rafael

---
 Documentation/driver-api/pm/devices.rst |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

Comments

Ulf Hansson Sept. 20, 2017, 12:28 p.m. UTC | #1
On 20 September 2017 at 02:26, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> It quite often is necessary to resume devices from runtime suspend
> during system suspend for various reasons (for example, if their
> wakeup settings need to be changed), but that requires middle-layer
> or subsystem code to follow additional rules which currently are not
> clearly documented.
>
> Namely, if a driver calls pm_runtime_resume() for the device from
> its ->suspend (or equivalent) system sleep callback, that may not
> work if the middle layer above it has updated the state of the
> device from its ->prepare or ->suspend callbacks already in an
> incompatible way.  For this reason, all middle layers must follow
> the rule that, until the ->suspend callback provided by the device's
> driver is invoked, the only way in which the device's state can be
> updated is by calling pm_runtime_resume() for it, if necessary.
> Fortunately enough, all middle layers in the code base today follow
> this rule, but it is not explicitly stated anywhere, so do that.
>
> Note that calling pm_runtime_resume() from the ->suspend callback
> of a driver will cause the ->runtime_resume callback provided by the
> middle layer to be invoked, but the rule above guarantees that this
> callback will nest properly with the middle layer's ->suspend
> callback and it will play well with the ->prepare one invoked before.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>

Still, some comments below..

> ---
>
> This is a follow-up for the system suspend callbacks discussion during
> the Power Management and Energy-Awareness session at the LPC last week.
>
> In particular, I have been thinking about using pm_runtime_resume() from
> driver ->suspend callbacks and it actually appears to be quite defendable to
> me as long as it is guaranteed that middle layers will not mess up with the
> device state before the driver's ->suspend callback is invoked.  If that
> is the case, and I *think* that it currently is the case for all of the
> middle layers in the tree unless I overlook something (USB anyone?), the
> middle layer callbacks involved (->prepare, ->suspend and ->runtime_resume)
> should actually nest properly and there should not be problems with that.
> So, my proposal here would be to simply go ahead and document the rules
> regarding this case without modifying the code.
>
> At the same time, I see at least two general problems with calling
> pm_runtime_force_suspend() from the ->suspend callbacks of device drivers
> (unless the middle layers involved are trivial).
>
> First, note that the middle layer callbacks involved in that case are
> ->prepare, ->suspend, ->runtime_suspend (called indirectly from within the
> former) and then *also* ->suspend_late and ->suspend_noirq, because the PM
> core will call the last two from the middle layer as it has no information
> that pm_runtime_force_suspend() was called for the device in the "suspend"
> phase.  Of course, in general, what the middle layer ->suspend_late and
> ->suspend_noirq do is not guaranteed to play well with what its
> ->runtime_suspend does even if ->suspend itself is OK (but for ->runtime_resume
> all of that actually works, because the state it leaves the device in should
> be compatible with the system suspend callbacks invoked in the later phases).
>
> Second, leaving devices in runtime suspend in the "suspend" phase of system
> suspend is fishy even when their runtime PM is disabled, because that doesn't
> guarantee anything regarding their children or possible consumers.  Runtime
> PM may still be enabled for those devices at that time and runtime resume may
> be triggered for them later, in which case it all quickly falls apart.

This is true, although to me this is a about a different problem and
has very little to do with pm_runtime_force_suspend().

More precisely, whether runtime PM becomes disabled in the suspend
phase or suspend_late phase, really doesn't matter. Because in the end
this is about suspending/resuming devices in the correct order.

>
> IOW, there are reasons why the PM core bumps up the runtime PM usage counters
> of all devices during system suspend and they also apply to runtime suspend
> callbacks being invoked directly with runtime PM disabled for the given device.
> Frankly, it generally is only safe to leave a device in runtime suspend during
> system suspend if it can be guarateed that the system suspend callbacks in the
> subsequent suspend phases will not be invoked for it at all.

I understand this is perfectly true for some of the non-trivial middle
layers, however just to be clear, this statement don't have to serve
as a general rule for all cases, right?

Moreover, bumping the runtime PM usage count
(pm_runtime_get_noresume()) in the device_prepare/suspend() phase, was
originally added to prevent the runtime PM core from runtime
suspending a device, in cases when runtime PM has been enabled for it.
Preventing the ->runtime_suspend() callback from being invoked when
runtime PM is disabled, just doesn't make any sense to me.

>
> Thanks,
> Rafael
>
> ---
>  Documentation/driver-api/pm/devices.rst |   25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
>
> Index: linux-pm/Documentation/driver-api/pm/devices.rst
> ===================================================================
> --- linux-pm.orig/Documentation/driver-api/pm/devices.rst
> +++ linux-pm/Documentation/driver-api/pm/devices.rst
> @@ -328,7 +328,10 @@ the phases are: ``prepare``, ``suspend``
>         After the ``->prepare`` callback method returns, no new children may be
>         registered below the device.  The method may also prepare the device or
>         driver in some way for the upcoming system power transition, but it
> -       should not put the device into a low-power state.
> +       should not put the device into a low-power state.  Moreover, if the
> +       device supports runtime power management, the ``->prepare`` callback
> +       method must not update its state in case it is necessary to resume it
> +       from runtime suspend later on.
>
>         For devices supporting runtime power management, the return value of the
>         prepare callback can be used to indicate to the PM core that it may
> @@ -356,6 +359,16 @@ the phases are: ``prepare``, ``suspend``
>         the appropriate low-power state, depending on the bus type the device is
>         on, and they may enable wakeup events.
>
> +       However, for devices supporting runtime power management, the
> +       ``->suspend`` methods provided by subsystems (bus types and PM domains
> +       in particular) must follow an additional rule regarding what can be done
> +       to the devices before their drivers' ``->suspend`` methods are called.
> +       Namely, they can only resume the devices from runtime suspend by
> +       calling :c:func:`pm_runtime_resume` for them, if that is necessary, and
> +       they must not update the state of the devices in any other way at that
> +       time (in case the drivers need to resume the devices from runtime
> +       suspend in their ``->suspend`` methods).
> +
>      3. For a number of devices it is convenient to split suspend into the
>         "quiesce device" and "save device state" phases, in which cases
>         ``suspend_late`` is meant to do the latter.  It is always executed after
> @@ -729,6 +742,16 @@ state temporarily, for example so that i
>  disabled.  This all depends on the hardware and the design of the subsystem and
>  device driver in question.
>
> +If it is necessary to resume a device from runtime suspend during a system-wide
> +transition into a sleep state, that can be done by calling
> +:c:func:`pm_runtime_resume` for it from the ``->suspend`` callback (or its
> +couterpart for transitions related to hibernation) of either the device's driver
> +or a subsystem responsible for it (for example, a bus type or a PM domain).
> +That is guaranteed to work by the requirement that subsystems must not change
> +the state of devices (possibly except for resuming them from runtime suspend)
> +from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before*
> +invoking device drivers' ``->suspend`` callbacks (or equivalent).
> +
>  During system-wide resume from a sleep state it's easiest to put devices into
>  the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`.
>  Refer to that document for more information regarding this particular issue as
>

Kind regards
Uffe
Rafael J. Wysocki Sept. 20, 2017, 2:01 p.m. UTC | #2
On Wed, Sep 20, 2017 at 2:28 PM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> On 20 September 2017 at 02:26, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> It quite often is necessary to resume devices from runtime suspend
>> during system suspend for various reasons (for example, if their
>> wakeup settings need to be changed), but that requires middle-layer
>> or subsystem code to follow additional rules which currently are not
>> clearly documented.
>>
>> Namely, if a driver calls pm_runtime_resume() for the device from
>> its ->suspend (or equivalent) system sleep callback, that may not
>> work if the middle layer above it has updated the state of the
>> device from its ->prepare or ->suspend callbacks already in an
>> incompatible way.  For this reason, all middle layers must follow
>> the rule that, until the ->suspend callback provided by the device's
>> driver is invoked, the only way in which the device's state can be
>> updated is by calling pm_runtime_resume() for it, if necessary.
>> Fortunately enough, all middle layers in the code base today follow
>> this rule, but it is not explicitly stated anywhere, so do that.
>>
>> Note that calling pm_runtime_resume() from the ->suspend callback
>> of a driver will cause the ->runtime_resume callback provided by the
>> middle layer to be invoked, but the rule above guarantees that this
>> callback will nest properly with the middle layer's ->suspend
>> callback and it will play well with the ->prepare one invoked before.
>>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>

Thanks!

> Still, some comments below..
>
>> ---
>>
>> This is a follow-up for the system suspend callbacks discussion during
>> the Power Management and Energy-Awareness session at the LPC last week.
>>
>> In particular, I have been thinking about using pm_runtime_resume() from
>> driver ->suspend callbacks and it actually appears to be quite defendable to
>> me as long as it is guaranteed that middle layers will not mess up with the
>> device state before the driver's ->suspend callback is invoked.  If that
>> is the case, and I *think* that it currently is the case for all of the
>> middle layers in the tree unless I overlook something (USB anyone?), the
>> middle layer callbacks involved (->prepare, ->suspend and ->runtime_resume)
>> should actually nest properly and there should not be problems with that.
>> So, my proposal here would be to simply go ahead and document the rules
>> regarding this case without modifying the code.
>>
>> At the same time, I see at least two general problems with calling
>> pm_runtime_force_suspend() from the ->suspend callbacks of device drivers
>> (unless the middle layers involved are trivial).
>>
>> First, note that the middle layer callbacks involved in that case are
>> ->prepare, ->suspend, ->runtime_suspend (called indirectly from within the
>> former) and then *also* ->suspend_late and ->suspend_noirq, because the PM
>> core will call the last two from the middle layer as it has no information
>> that pm_runtime_force_suspend() was called for the device in the "suspend"
>> phase.  Of course, in general, what the middle layer ->suspend_late and
>> ->suspend_noirq do is not guaranteed to play well with what its
>> ->runtime_suspend does even if ->suspend itself is OK (but for ->runtime_resume
>> all of that actually works, because the state it leaves the device in should
>> be compatible with the system suspend callbacks invoked in the later phases).
>>
>> Second, leaving devices in runtime suspend in the "suspend" phase of system
>> suspend is fishy even when their runtime PM is disabled, because that doesn't
>> guarantee anything regarding their children or possible consumers.  Runtime
>> PM may still be enabled for those devices at that time and runtime resume may
>> be triggered for them later, in which case it all quickly falls apart.
>
> This is true, although to me this is a about a different problem and
> has very little to do with pm_runtime_force_suspend().
>
> More precisely, whether runtime PM becomes disabled in the suspend
> phase or suspend_late phase, really doesn't matter. Because in the end
> this is about suspending/resuming devices in the correct order.

Yes, it is, but this is not my point (I didn't make it clear enough I guess).

At the time you make the decision to disable runtime PM for a parent
(say) and leave it in runtime suspend, all of its children are
suspended just fine (otherwise the parent wouldn't have been suspended
too).  However, you *also* need to make sure that there will be no
attempts to resume any of them *after* that point, which practically
means that either runtime PM has to have been disabled already for all
of them at the time it is disabled for the parent, or there has to be
another guarantee in place.

That's why the core tries to enforce the "runtime PM disabled for the
entire hierarchy below" guarantee for the devices with direct_complete
set, but that may just be overkill in many cases.  I guess it may be
better to use WARN_ON() to catch the cases in which things may really
go wrong.

>> IOW, there are reasons why the PM core bumps up the runtime PM usage counters
>> of all devices during system suspend and they also apply to runtime suspend
>> callbacks being invoked directly with runtime PM disabled for the given device.
>> Frankly, it generally is only safe to leave a device in runtime suspend during
>> system suspend if it can be guarateed that the system suspend callbacks in the
>> subsequent suspend phases will not be invoked for it at all.
>
> I understand this is perfectly true for some of the non-trivial middle
> layers, however just to be clear, this statement don't have to serve
> as a general rule for all cases, right?

Well, a really general version of it is something like "it is only
safe to leave a device in runtime suspend during system suspend if it
can be guaranteed that the system suspend callbacks in the subsequent
suspend phases will not change its state" and the most effective way
to make that guarantee is to prevent them from being invoked at all.
:-)

> Moreover, bumping the runtime PM usage count
> (pm_runtime_get_noresume()) in the device_prepare/suspend() phase, was
> originally added to prevent the runtime PM core from runtime
> suspending a device, in cases when runtime PM has been enabled for it.
> Preventing the ->runtime_suspend() callback from being invoked when
> runtime PM is disabled, just doesn't make any sense to me.

The problem is that the functionality of ->runtime_suspend() in
principle overlaps with the functionality of ->suspend(),
->suspend_late() and ->suspend_noirq() combined, but it need not be
entirely the same.  Therefore if you invoke ->runtime_suspend()
anywhere between the beginning of ->suspend() and the end of
->suspend_noirq(), the remaining code in the system sleep callbacks
needs to know about that in order to avoid, for example, attempting to
power down the device for the second time in a row, which very well
may kill the system in some extreme cases.

Of course, if those callbacks are trivial, this problem goes away, but
they need not be trivial and if you are a platform driver (or an
i2c/spi driver too for that matter), you aren't guaranteed that they
will always be trivial.  That is quite a bit of an issue to me.

Thanks,
Rafael
Alan Stern Sept. 20, 2017, 4:15 p.m. UTC | #3
On Wed, 20 Sep 2017, Rafael J. Wysocki wrote:

> >> Second, leaving devices in runtime suspend in the "suspend" phase of system
> >> suspend is fishy even when their runtime PM is disabled, because that doesn't
> >> guarantee anything regarding their children or possible consumers.  Runtime
> >> PM may still be enabled for those devices at that time and runtime resume may
> >> be triggered for them later, in which case it all quickly falls apart.
> >
> > This is true, although to me this is a about a different problem and
> > has very little to do with pm_runtime_force_suspend().
> >
> > More precisely, whether runtime PM becomes disabled in the suspend
> > phase or suspend_late phase, really doesn't matter. Because in the end
> > this is about suspending/resuming devices in the correct order.
> 
> Yes, it is, but this is not my point (I didn't make it clear enough I guess).
> 
> At the time you make the decision to disable runtime PM for a parent
> (say) and leave it in runtime suspend, all of its children are
> suspended just fine (otherwise the parent wouldn't have been suspended
> too).  However, you *also* need to make sure that there will be no
> attempts to resume any of them *after* that point, which practically
> means that either runtime PM has to have been disabled already for all
> of them at the time it is disabled for the parent, or there has to be
> another guarantee in place.
> 
> That's why the core tries to enforce the "runtime PM disabled for the
> entire hierarchy below" guarantee for the devices with direct_complete
> set, but that may just be overkill in many cases.  I guess it may be
> better to use WARN_ON() to catch the cases in which things may really
> go wrong.

That's a good idea.

> >> IOW, there are reasons why the PM core bumps up the runtime PM usage counters
> >> of all devices during system suspend and they also apply to runtime suspend
> >> callbacks being invoked directly with runtime PM disabled for the given device.
> >> Frankly, it generally is only safe to leave a device in runtime suspend during
> >> system suspend if it can be guarateed that the system suspend callbacks in the
> >> subsequent suspend phases will not be invoked for it at all.
> >
> > I understand this is perfectly true for some of the non-trivial middle
> > layers, however just to be clear, this statement don't have to serve
> > as a general rule for all cases, right?
> 
> Well, a really general version of it is something like "it is only
> safe to leave a device in runtime suspend during system suspend if it
> can be guaranteed that the system suspend callbacks in the subsequent
> suspend phases will not change its state" and the most effective way
> to make that guarantee is to prevent them from being invoked at all.
> :-)

Of course, this can be overkill.  It's probably common for there to be
little physical difference between a device's "suspended" state and
its "runtime-suspended" state.  In such cases, the middle layer and
the subsequent callbacks merely have to recognize that the device is
already at low power (by checking an internal flag, for instance) and
then do nothing when they see that it is.

> > Moreover, bumping the runtime PM usage count
> > (pm_runtime_get_noresume()) in the device_prepare/suspend() phase, was
> > originally added to prevent the runtime PM core from runtime
> > suspending a device, in cases when runtime PM has been enabled for it.
> > Preventing the ->runtime_suspend() callback from being invoked when
> > runtime PM is disabled, just doesn't make any sense to me.
> 
> The problem is that the functionality of ->runtime_suspend() in
> principle overlaps with the functionality of ->suspend(),
> ->suspend_late() and ->suspend_noirq() combined, but it need not be
> entirely the same.  Therefore if you invoke ->runtime_suspend()
> anywhere between the beginning of ->suspend() and the end of
> ->suspend_noirq(), the remaining code in the system sleep callbacks
> needs to know about that in order to avoid, for example, attempting to
> power down the device for the second time in a row, which very well
> may kill the system in some extreme cases.

Another possibility is the arrival of an ill-timed runtime resume after
some of the suspend callbacks have run (because of a wakeup request,
for example).  In short, system PM and runtime PM do overlap
considerably, and this means that it would be easy for either one to
interfere with the other unless the developers are very careful.

> Of course, if those callbacks are trivial, this problem goes away, but
> they need not be trivial and if you are a platform driver (or an
> i2c/spi driver too for that matter), you aren't guaranteed that they
> will always be trivial.  That is quite a bit of an issue to me.

Indeed.  It may not be possible to come up with a firm set of detailed
rules that apply everywhere.  But we should make sure that people are
aware of the issues.

Alan Stern
Johannes Stezenbach Sept. 20, 2017, 4:27 p.m. UTC | #4
On Wed, Sep 20, 2017 at 04:01:32PM +0200, Rafael J. Wysocki wrote:
> On Wed, Sep 20, 2017 at 2:28 PM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> > On 20 September 2017 at 02:26, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >>
> >> Second, leaving devices in runtime suspend in the "suspend" phase of system
> >> suspend is fishy even when their runtime PM is disabled, because that doesn't
> >> guarantee anything regarding their children or possible consumers.  Runtime
> >> PM may still be enabled for those devices at that time and runtime resume may
> >> be triggered for them later, in which case it all quickly falls apart.
> >
> > This is true, although to me this is a about a different problem and
> > has very little to do with pm_runtime_force_suspend().
> >
> > More precisely, whether runtime PM becomes disabled in the suspend
> > phase or suspend_late phase, really doesn't matter. Because in the end
> > this is about suspending/resuming devices in the correct order.
> 
> Yes, it is, but this is not my point (I didn't make it clear enough I guess).
> 
> At the time you make the decision to disable runtime PM for a parent
> (say) and leave it in runtime suspend, all of its children are
> suspended just fine (otherwise the parent wouldn't have been suspended
> too).  However, you *also* need to make sure that there will be no
> attempts to resume any of them *after* that point, which practically
> means that either runtime PM has to have been disabled already for all
> of them at the time it is disabled for the parent, or there has to be
> another guarantee in place.
> 
> That's why the core tries to enforce the "runtime PM disabled for the
> entire hierarchy below" guarantee for the devices with direct_complete
> set, but that may just be overkill in many cases.  I guess it may be
> better to use WARN_ON() to catch the cases in which things may really
> go wrong.

I read this half a dozen times and I'm still confused.
Moreover, Documentation/driver-api/pm/devices.rst says:

    Runtime Power Management model:

        Devices may also be put into low-power states while the system is
        running, independently of other power management activity in principle.
        However, devices are not generally independent of each other (for
        example, a parent device cannot be suspended unless all of its child
        devices have been suspended).  ...

However, isn't this a fundamental difference of runtime suspend
vs. system suspend that parent devices *can* be runtime suspended
before their children?  E.g. an audio codec could keep running
while the i2c bus used to program its registers can be runtime suspended.
If this is correct I think it would be useful to spell it out explicitly
in the documentation.

During system suspend, pm core will suspend children first,
and if the child's ->suspend hook uses the i2c bus to access registers,
it will implicitly runtime resume the i2c bus (e.g. due to pm_runtime_get_sync()
in i2c_dw_xfer()).  Later pm core will ->suspend the i2c bus.

I have a hunch the root of the problem is that ->prepare walks the tree
in top-down order, and its return value is used to decide about
direct-complete.  Why does it do that?  Shouldn't pm core check
the direct_complete flag during ->suspend if the device
is in runtime suspend, to decide whether to skip runtime resume + ->suspend
for *this* device?


Johannes
Rafael J. Wysocki Sept. 21, 2017, 12:39 a.m. UTC | #5
On Wed, Sep 20, 2017 at 6:27 PM, Johannes Stezenbach <js@sig21.net> wrote:
> On Wed, Sep 20, 2017 at 04:01:32PM +0200, Rafael J. Wysocki wrote:
>> On Wed, Sep 20, 2017 at 2:28 PM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>> > On 20 September 2017 at 02:26, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >>
>> >> Second, leaving devices in runtime suspend in the "suspend" phase of system
>> >> suspend is fishy even when their runtime PM is disabled, because that doesn't
>> >> guarantee anything regarding their children or possible consumers.  Runtime
>> >> PM may still be enabled for those devices at that time and runtime resume may
>> >> be triggered for them later, in which case it all quickly falls apart.
>> >
>> > This is true, although to me this is a about a different problem and
>> > has very little to do with pm_runtime_force_suspend().
>> >
>> > More precisely, whether runtime PM becomes disabled in the suspend
>> > phase or suspend_late phase, really doesn't matter. Because in the end
>> > this is about suspending/resuming devices in the correct order.
>>
>> Yes, it is, but this is not my point (I didn't make it clear enough I guess).
>>
>> At the time you make the decision to disable runtime PM for a parent
>> (say) and leave it in runtime suspend, all of its children are
>> suspended just fine (otherwise the parent wouldn't have been suspended
>> too).  However, you *also* need to make sure that there will be no
>> attempts to resume any of them *after* that point, which practically
>> means that either runtime PM has to have been disabled already for all
>> of them at the time it is disabled for the parent, or there has to be
>> another guarantee in place.
>>
>> That's why the core tries to enforce the "runtime PM disabled for the
>> entire hierarchy below" guarantee for the devices with direct_complete
>> set, but that may just be overkill in many cases.  I guess it may be
>> better to use WARN_ON() to catch the cases in which things may really
>> go wrong.
>
> I read this half a dozen times and I'm still confused.
> Moreover, Documentation/driver-api/pm/devices.rst says:
>
>     Runtime Power Management model:
>
>         Devices may also be put into low-power states while the system is
>         running, independently of other power management activity in principle.
>         However, devices are not generally independent of each other (for
>         example, a parent device cannot be suspended unless all of its child
>         devices have been suspended).  ...
>
> However, isn't this a fundamental difference of runtime suspend
> vs. system suspend that parent devices *can* be runtime suspended
> before their children?

No, it isn't.

>  E.g. an audio codec could keep running
> while the i2c bus used to program its registers can be runtime suspended.
> If this is correct I think it would be useful to spell it out explicitly
> in the documentation.

That's because the i2c bus uses the ignore_children flag that allows
it to override the general rules. :-)

> During system suspend, pm core will suspend children first,
> and if the child's ->suspend hook uses the i2c bus to access registers,
> it will implicitly runtime resume the i2c bus (e.g. due to pm_runtime_get_sync()
> in i2c_dw_xfer()).  Later pm core will ->suspend the i2c bus.

Yup.

> I have a hunch the root of the problem is that ->prepare walks the tree
> in top-down order, and its return value is used to decide about
> direct-complete.  Why does it do that?  Shouldn't pm core check
> the direct_complete flag during ->suspend if the device
> is in runtime suspend, to decide whether to skip runtime resume + ->suspend
> for *this* device?

direct_complete has nothing to do with this.

First off, the PM core does check the direct_complete flag in
__device_suspend() and does more-or-less what you are saying.

However, that flag is initialized in device_prepare() with the help of
the ->suspend() return value, because whether or not it makes sense to
set that flag depends on some conditions that may change between
consecutive system suspend-resume cycles in general and need to be
checked in advance before setting it.

HTH

Rafael
Rafael J. Wysocki Sept. 21, 2017, 12:46 a.m. UTC | #6
On Wed, Sep 20, 2017 at 6:15 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Wed, 20 Sep 2017, Rafael J. Wysocki wrote:
>
>> >> Second, leaving devices in runtime suspend in the "suspend" phase of system
>> >> suspend is fishy even when their runtime PM is disabled, because that doesn't
>> >> guarantee anything regarding their children or possible consumers.  Runtime
>> >> PM may still be enabled for those devices at that time and runtime resume may
>> >> be triggered for them later, in which case it all quickly falls apart.
>> >
>> > This is true, although to me this is a about a different problem and
>> > has very little to do with pm_runtime_force_suspend().
>> >
>> > More precisely, whether runtime PM becomes disabled in the suspend
>> > phase or suspend_late phase, really doesn't matter. Because in the end
>> > this is about suspending/resuming devices in the correct order.
>>
>> Yes, it is, but this is not my point (I didn't make it clear enough I guess).
>>
>> At the time you make the decision to disable runtime PM for a parent
>> (say) and leave it in runtime suspend, all of its children are
>> suspended just fine (otherwise the parent wouldn't have been suspended
>> too).  However, you *also* need to make sure that there will be no
>> attempts to resume any of them *after* that point, which practically
>> means that either runtime PM has to have been disabled already for all
>> of them at the time it is disabled for the parent, or there has to be
>> another guarantee in place.
>>
>> That's why the core tries to enforce the "runtime PM disabled for the
>> entire hierarchy below" guarantee for the devices with direct_complete
>> set, but that may just be overkill in many cases.  I guess it may be
>> better to use WARN_ON() to catch the cases in which things may really
>> go wrong.
>
> That's a good idea.
>
>> >> IOW, there are reasons why the PM core bumps up the runtime PM usage counters
>> >> of all devices during system suspend and they also apply to runtime suspend
>> >> callbacks being invoked directly with runtime PM disabled for the given device.
>> >> Frankly, it generally is only safe to leave a device in runtime suspend during
>> >> system suspend if it can be guarateed that the system suspend callbacks in the
>> >> subsequent suspend phases will not be invoked for it at all.
>> >
>> > I understand this is perfectly true for some of the non-trivial middle
>> > layers, however just to be clear, this statement don't have to serve
>> > as a general rule for all cases, right?
>>
>> Well, a really general version of it is something like "it is only
>> safe to leave a device in runtime suspend during system suspend if it
>> can be guaranteed that the system suspend callbacks in the subsequent
>> suspend phases will not change its state" and the most effective way
>> to make that guarantee is to prevent them from being invoked at all.
>> :-)
>
> Of course, this can be overkill.  It's probably common for there to be
> little physical difference between a device's "suspended" state and
> its "runtime-suspended" state.  In such cases, the middle layer and
> the subsequent callbacks merely have to recognize that the device is
> already at low power (by checking an internal flag, for instance) and
> then do nothing when they see that it is.

Right.

The idea, though, was to make it work even if they didn't do that. :-)

>> > Moreover, bumping the runtime PM usage count
>> > (pm_runtime_get_noresume()) in the device_prepare/suspend() phase, was
>> > originally added to prevent the runtime PM core from runtime
>> > suspending a device, in cases when runtime PM has been enabled for it.
>> > Preventing the ->runtime_suspend() callback from being invoked when
>> > runtime PM is disabled, just doesn't make any sense to me.
>>
>> The problem is that the functionality of ->runtime_suspend() in
>> principle overlaps with the functionality of ->suspend(),
>> ->suspend_late() and ->suspend_noirq() combined, but it need not be
>> entirely the same.  Therefore if you invoke ->runtime_suspend()
>> anywhere between the beginning of ->suspend() and the end of
>> ->suspend_noirq(), the remaining code in the system sleep callbacks
>> needs to know about that in order to avoid, for example, attempting to
>> power down the device for the second time in a row, which very well
>> may kill the system in some extreme cases.
>
> Another possibility is the arrival of an ill-timed runtime resume after
> some of the suspend callbacks have run (because of a wakeup request,
> for example).  In short, system PM and runtime PM do overlap
> considerably, and this means that it would be easy for either one to
> interfere with the other unless the developers are very careful.

Right.

>> Of course, if those callbacks are trivial, this problem goes away, but
>> they need not be trivial and if you are a platform driver (or an
>> i2c/spi driver too for that matter), you aren't guaranteed that they
>> will always be trivial.  That is quite a bit of an issue to me.
>
> Indeed.  It may not be possible to come up with a firm set of detailed
> rules that apply everywhere.  But we should make sure that people are
> aware of the issues.

Agreed.

Thanks,
Rafael
Johannes Stezenbach Sept. 21, 2017, 9:27 a.m. UTC | #7
On Thu, Sep 21, 2017 at 02:39:30AM +0200, Rafael J. Wysocki wrote:
> On Wed, Sep 20, 2017 at 6:27 PM, Johannes Stezenbach <js@sig21.net> wrote:
> >
> >  E.g. an audio codec could keep running
> > while the i2c bus used to program its registers can be runtime suspended.
> > If this is correct I think it would be useful to spell it out explicitly
> > in the documentation.
> 
> That's because the i2c bus uses the ignore_children flag that allows
> it to override the general rules. :-)

Ah!  I was looking at Documentation/driver-api/pm only (which is
changed by your patch), but this is documented in Documentation/power
(and obviously I hadn't checked the code, shame on me).

> direct_complete has nothing to do with this.

Oh?  Reading again, do I get this right:

1. simple method: always call pm_runtime_resume() in ->suspend(),
   then suspend the driver again
2. optimization: if pm_runtime_suspended(), the driver's ->suspend()
   can possibly do nothing if conditions permit, otherwise it calls
   pm_runtime_resume() and then suspends
3. optimization: tell pm core to skip ->suspend() via return value
   from ->prepare() which sets direct_complete

...and your patch only deals with 1 and 2.

Sorry to hijack your thread for side discussion, it was
inadvertant due to my lack of understanding.


> First off, the PM core does check the direct_complete flag in
> __device_suspend() and does more-or-less what you are saying.
> 
> However, that flag is initialized in device_prepare() with the help of
> the ->suspend() return value, because whether or not it makes sense to

you mean ->prepare(), right?

> set that flag depends on some conditions that may change between
> consecutive system suspend-resume cycles in general and need to be
> checked in advance before setting it.
> 
> HTH

It does, however the question remains *why* it needs to check
it in ->prepare() and not right before calling ->suspend().
Using ->prepare() for the purpose seems wrong since it traverses
the hierarchy in the "wrong" order.  Only right before
calling ->suspend() the driver knows if its current state
allows it to skip any further actions for suspend, because
suspending children or other users may cause pm_runtime_resume()
for it.  (In the back of my head I have the scenario of
bug #196861, some completely different driver uses
i2c via ACPI OpRegion during its suspend.)


Thanks,
Johannes
Rafael J. Wysocki Sept. 21, 2017, 2:30 p.m. UTC | #8
On Thursday, September 21, 2017 11:27:13 AM CEST Johannes Stezenbach wrote:
> On Thu, Sep 21, 2017 at 02:39:30AM +0200, Rafael J. Wysocki wrote:
> > On Wed, Sep 20, 2017 at 6:27 PM, Johannes Stezenbach <js@sig21.net> wrote:
> > >
> > >  E.g. an audio codec could keep running
> > > while the i2c bus used to program its registers can be runtime suspended.
> > > If this is correct I think it would be useful to spell it out explicitly
> > > in the documentation.
> > 
> > That's because the i2c bus uses the ignore_children flag that allows
> > it to override the general rules. :-)
> 
> Ah!  I was looking at Documentation/driver-api/pm only (which is
> changed by your patch), but this is documented in Documentation/power
> (and obviously I hadn't checked the code, shame on me).
> 
> > direct_complete has nothing to do with this.
> 
> Oh?  Reading again, do I get this right:
> 
> 1. simple method: always call pm_runtime_resume() in ->suspend(),
>    then suspend the driver again

Right.

> 2. optimization: if pm_runtime_suspended(), the driver's ->suspend()
>    can possibly do nothing if conditions permit, otherwise it calls
>    pm_runtime_resume() and then suspends

Well, that may work in some cases. :-)

> 3. optimization: tell pm core to skip ->suspend() via return value
>    from ->prepare() which sets direct_complete

Yup.

> ...and your patch only deals with 1 and 2.

Yes, basically.

> Sorry to hijack your thread for side discussion, it was
> inadvertant due to my lack of understanding.
> 
> 
> > First off, the PM core does check the direct_complete flag in
> > __device_suspend() and does more-or-less what you are saying.
> > 
> > However, that flag is initialized in device_prepare() with the help of
> > the ->suspend() return value, because whether or not it makes sense to
> 
> you mean ->prepare(), right?

Right (sorry).

> > set that flag depends on some conditions that may change between
> > consecutive system suspend-resume cycles in general and need to be
> > checked in advance before setting it.
> > 
> > HTH
> 
> It does, however the question remains *why* it needs to check
> it in ->prepare() and not right before calling ->suspend().

Becuase the core needs input from middle layers in some cases before
it decides to call ->suspend().

> Using ->prepare() for the purpose seems wrong since it traverses
> the hierarchy in the "wrong" order.

That doesn't matter.  What matters is whether or not the device's
state is "compatible" with system sleep at the ->prepare() time (which is
checked by some middle layers in ->prepare()).

BTW, did you notice the pm_runtime_status_suspended() checks in
__device_suspend()?  They are in there in case the device has been
resumed from runtime suspend after ->prepare().

> Only right before calling ->suspend() the driver knows if its current
> state allows it to skip any further actions for suspend, because
> suspending children or other users may cause pm_runtime_resume()
> for it.  (In the back of my head I have the scenario of
> bug #196861, some completely different driver uses
> i2c via ACPI OpRegion during its suspend.)

Yes, that's a special case (again, because of the way i2c handles
runtime PM overall).

Thanks,
Rafael
Alan Stern Sept. 21, 2017, 2:36 p.m. UTC | #9
On Thu, 21 Sep 2017, Johannes Stezenbach wrote:

> On Thu, Sep 21, 2017 at 02:39:30AM +0200, Rafael J. Wysocki wrote:
> > On Wed, Sep 20, 2017 at 6:27 PM, Johannes Stezenbach <js@sig21.net> wrote:
> > >
> > >  E.g. an audio codec could keep running
> > > while the i2c bus used to program its registers can be runtime suspended.
> > > If this is correct I think it would be useful to spell it out explicitly
> > > in the documentation.
> > 
> > That's because the i2c bus uses the ignore_children flag that allows
> > it to override the general rules. :-)
> 
> Ah!  I was looking at Documentation/driver-api/pm only (which is
> changed by your patch), but this is documented in Documentation/power
> (and obviously I hadn't checked the code, shame on me).
> 
> > direct_complete has nothing to do with this.
> 
> Oh?  Reading again, do I get this right:
> 
> 1. simple method: always call pm_runtime_resume() in ->suspend(),
>    then suspend the driver again
> 2. optimization: if pm_runtime_suspended(), the driver's ->suspend()
>    can possibly do nothing if conditions permit, otherwise it calls
>    pm_runtime_resume() and then suspends
> 3. optimization: tell pm core to skip ->suspend() via return value
>    from ->prepare() which sets direct_complete
> 
> ...and your patch only deals with 1 and 2.
> 
> Sorry to hijack your thread for side discussion, it was
> inadvertant due to my lack of understanding.
> 
> 
> > First off, the PM core does check the direct_complete flag in
> > __device_suspend() and does more-or-less what you are saying.
> > 
> > However, that flag is initialized in device_prepare() with the help of
> > the ->suspend() return value, because whether or not it makes sense to
> 
> you mean ->prepare(), right?
> 
> > set that flag depends on some conditions that may change between
> > consecutive system suspend-resume cycles in general and need to be
> > checked in advance before setting it.
> > 
> > HTH
> 
> It does, however the question remains *why* it needs to check
> it in ->prepare() and not right before calling ->suspend().
> Using ->prepare() for the purpose seems wrong since it traverses
> the hierarchy in the "wrong" order.

No, it is the _right_ order.  If a device's ->prepare() says that
direct_complete is okay, but one of its descendants disallows
direct_complete, we then want to clear the direct_complete flag in the
original device structure.  We couldn't do this if we checked the 
descendant's driver first.

>  Only right before
> calling ->suspend() the driver knows if its current state
> allows it to skip any further actions for suspend, because
> suspending children or other users may cause pm_runtime_resume()
> for it.

If the device gets runtime-resumed before ->suspend() would be called 
then the direct_complete setting doesn't matter.  The PM core follows 
the direct_complete path only if the device is already in runtime 
suspend when the ->suspend() callback would normally be invoked.

And if the device does get runtime-resumed like this, it can't be 
runtime-suspended again.  The PM core makes sure of that.

Alan Stern

>  (In the back of my head I have the scenario of
> bug #196861, some completely different driver uses
> i2c via ACPI OpRegion during its suspend.)
> 
> 
> Thanks,
> Johannes
Rafael J. Wysocki Sept. 21, 2017, 2:44 p.m. UTC | #10
On Thursday, September 21, 2017 4:36:30 PM CEST Alan Stern wrote:
> On Thu, 21 Sep 2017, Johannes Stezenbach wrote:
> 
> > On Thu, Sep 21, 2017 at 02:39:30AM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Sep 20, 2017 at 6:27 PM, Johannes Stezenbach <js@sig21.net> wrote:
> > > >
> > > >  E.g. an audio codec could keep running
> > > > while the i2c bus used to program its registers can be runtime suspended.
> > > > If this is correct I think it would be useful to spell it out explicitly
> > > > in the documentation.
> > > 
> > > That's because the i2c bus uses the ignore_children flag that allows
> > > it to override the general rules. :-)
> > 
> > Ah!  I was looking at Documentation/driver-api/pm only (which is
> > changed by your patch), but this is documented in Documentation/power
> > (and obviously I hadn't checked the code, shame on me).
> > 
> > > direct_complete has nothing to do with this.
> > 
> > Oh?  Reading again, do I get this right:
> > 
> > 1. simple method: always call pm_runtime_resume() in ->suspend(),
> >    then suspend the driver again
> > 2. optimization: if pm_runtime_suspended(), the driver's ->suspend()
> >    can possibly do nothing if conditions permit, otherwise it calls
> >    pm_runtime_resume() and then suspends
> > 3. optimization: tell pm core to skip ->suspend() via return value
> >    from ->prepare() which sets direct_complete
> > 
> > ...and your patch only deals with 1 and 2.
> > 
> > Sorry to hijack your thread for side discussion, it was
> > inadvertant due to my lack of understanding.
> > 
> > 
> > > First off, the PM core does check the direct_complete flag in
> > > __device_suspend() and does more-or-less what you are saying.
> > > 
> > > However, that flag is initialized in device_prepare() with the help of
> > > the ->suspend() return value, because whether or not it makes sense to
> > 
> > you mean ->prepare(), right?
> > 
> > > set that flag depends on some conditions that may change between
> > > consecutive system suspend-resume cycles in general and need to be
> > > checked in advance before setting it.
> > > 
> > > HTH
> > 
> > It does, however the question remains *why* it needs to check
> > it in ->prepare() and not right before calling ->suspend().
> > Using ->prepare() for the purpose seems wrong since it traverses
> > the hierarchy in the "wrong" order.
> 
> No, it is the _right_ order.  If a device's ->prepare() says that
> direct_complete is okay, but one of its descendants disallows
> direct_complete, we then want to clear the direct_complete flag in the
> original device structure.  We couldn't do this if we checked the 
> descendant's driver first.

But we really clear it for parents (and suppliers) in __device_suspend(),
which is still OK, because that is first called for the children (and
consumers).  So the ordering of ->prepare() doesn't really matter here IMO.

Thanks,
Rafael
Ulf Hansson Sept. 22, 2017, 7:22 a.m. UTC | #11
[...]

>>> Second, leaving devices in runtime suspend in the "suspend" phase of system
>>> suspend is fishy even when their runtime PM is disabled, because that doesn't
>>> guarantee anything regarding their children or possible consumers.  Runtime
>>> PM may still be enabled for those devices at that time and runtime resume may
>>> be triggered for them later, in which case it all quickly falls apart.
>>
>> This is true, although to me this is a about a different problem and
>> has very little to do with pm_runtime_force_suspend().
>>
>> More precisely, whether runtime PM becomes disabled in the suspend
>> phase or suspend_late phase, really doesn't matter. Because in the end
>> this is about suspending/resuming devices in the correct order.
>
> Yes, it is, but this is not my point (I didn't make it clear enough I guess).
>
> At the time you make the decision to disable runtime PM for a parent
> (say) and leave it in runtime suspend, all of its children are
> suspended just fine (otherwise the parent wouldn't have been suspended
> too).  However, you *also* need to make sure that there will be no
> attempts to resume any of them *after* that point, which practically
> means that either runtime PM has to have been disabled already for all
> of them at the time it is disabled for the parent, or there has to be
> another guarantee in place.
>
> That's why the core tries to enforce the "runtime PM disabled for the
> entire hierarchy below" guarantee for the devices with direct_complete
> set, but that may just be overkill in many cases.  I guess it may be
> better to use WARN_ON() to catch the cases in which things may really
> go wrong.

Yes, using WARN_ON() is some clever way, seems like a great idea.

My point is, disabling runtime PM in a hierarchical manner, isn't a
solution to the suspend/resume ordering problem.

>
>>> IOW, there are reasons why the PM core bumps up the runtime PM usage counters
>>> of all devices during system suspend and they also apply to runtime suspend
>>> callbacks being invoked directly with runtime PM disabled for the given device.
>>> Frankly, it generally is only safe to leave a device in runtime suspend during
>>> system suspend if it can be guarateed that the system suspend callbacks in the
>>> subsequent suspend phases will not be invoked for it at all.
>>
>> I understand this is perfectly true for some of the non-trivial middle
>> layers, however just to be clear, this statement don't have to serve
>> as a general rule for all cases, right?
>
> Well, a really general version of it is something like "it is only
> safe to leave a device in runtime suspend during system suspend if it
> can be guaranteed that the system suspend callbacks in the subsequent
> suspend phases will not change its state" and the most effective way
> to make that guarantee is to prevent them from being invoked at all.
> :-)
>
>> Moreover, bumping the runtime PM usage count
>> (pm_runtime_get_noresume()) in the device_prepare/suspend() phase, was
>> originally added to prevent the runtime PM core from runtime
>> suspending a device, in cases when runtime PM has been enabled for it.
>> Preventing the ->runtime_suspend() callback from being invoked when
>> runtime PM is disabled, just doesn't make any sense to me.
>
> The problem is that the functionality of ->runtime_suspend() in
> principle overlaps with the functionality of ->suspend(),
> ->suspend_late() and ->suspend_noirq() combined, but it need not be
> entirely the same.  Therefore if you invoke ->runtime_suspend()
> anywhere between the beginning of ->suspend() and the end of
> ->suspend_noirq(), the remaining code in the system sleep callbacks
> needs to know about that in order to avoid, for example, attempting to
> power down the device for the second time in a row, which very well
> may kill the system in some extreme cases.

Agree, this may be a bit tricky to deal with for the more complex middle layers.

However, this is also very good reason to why it's useful keep the
runtime PM status up to date, always reflecting the HW status of the
device - even during system suspend phases.

>
> Of course, if those callbacks are trivial, this problem goes away, but
> they need not be trivial and if you are a platform driver (or an
> i2c/spi driver too for that matter), you aren't guaranteed that they
> will always be trivial.  That is quite a bit of an issue to me.

I understand your concern, but I don't share it. :-)

To me, I would rather keep things as simple as possible, then deal
with the complexity once we have a valid case for it.

Kind regards
Uffe
Rafael J. Wysocki Sept. 22, 2017, 10:29 p.m. UTC | #12
On Friday, September 22, 2017 9:22:47 AM CEST Ulf Hansson wrote:
> [...]
> 
> >>> Second, leaving devices in runtime suspend in the "suspend" phase of system
> >>> suspend is fishy even when their runtime PM is disabled, because that doesn't
> >>> guarantee anything regarding their children or possible consumers.  Runtime
> >>> PM may still be enabled for those devices at that time and runtime resume may
> >>> be triggered for them later, in which case it all quickly falls apart.
> >>
> >> This is true, although to me this is a about a different problem and
> >> has very little to do with pm_runtime_force_suspend().
> >>
> >> More precisely, whether runtime PM becomes disabled in the suspend
> >> phase or suspend_late phase, really doesn't matter. Because in the end
> >> this is about suspending/resuming devices in the correct order.
> >
> > Yes, it is, but this is not my point (I didn't make it clear enough I guess).
> >
> > At the time you make the decision to disable runtime PM for a parent
> > (say) and leave it in runtime suspend, all of its children are
> > suspended just fine (otherwise the parent wouldn't have been suspended
> > too).  However, you *also* need to make sure that there will be no
> > attempts to resume any of them *after* that point, which practically
> > means that either runtime PM has to have been disabled already for all
> > of them at the time it is disabled for the parent, or there has to be
> > another guarantee in place.
> >
> > That's why the core tries to enforce the "runtime PM disabled for the
> > entire hierarchy below" guarantee for the devices with direct_complete
> > set, but that may just be overkill in many cases.  I guess it may be
> > better to use WARN_ON() to catch the cases in which things may really
> > go wrong.
> 
> Yes, using WARN_ON() is some clever way, seems like a great idea.
> 
> My point is, disabling runtime PM in a hierarchical manner, isn't a
> solution to the suspend/resume ordering problem.
> 
> >
> >>> IOW, there are reasons why the PM core bumps up the runtime PM usage counters
> >>> of all devices during system suspend and they also apply to runtime suspend
> >>> callbacks being invoked directly with runtime PM disabled for the given device.
> >>> Frankly, it generally is only safe to leave a device in runtime suspend during
> >>> system suspend if it can be guarateed that the system suspend callbacks in the
> >>> subsequent suspend phases will not be invoked for it at all.
> >>
> >> I understand this is perfectly true for some of the non-trivial middle
> >> layers, however just to be clear, this statement don't have to serve
> >> as a general rule for all cases, right?
> >
> > Well, a really general version of it is something like "it is only
> > safe to leave a device in runtime suspend during system suspend if it
> > can be guaranteed that the system suspend callbacks in the subsequent
> > suspend phases will not change its state" and the most effective way
> > to make that guarantee is to prevent them from being invoked at all.
> > :-)
> >
> >> Moreover, bumping the runtime PM usage count
> >> (pm_runtime_get_noresume()) in the device_prepare/suspend() phase, was
> >> originally added to prevent the runtime PM core from runtime
> >> suspending a device, in cases when runtime PM has been enabled for it.
> >> Preventing the ->runtime_suspend() callback from being invoked when
> >> runtime PM is disabled, just doesn't make any sense to me.
> >
> > The problem is that the functionality of ->runtime_suspend() in
> > principle overlaps with the functionality of ->suspend(),
> > ->suspend_late() and ->suspend_noirq() combined, but it need not be
> > entirely the same.  Therefore if you invoke ->runtime_suspend()
> > anywhere between the beginning of ->suspend() and the end of
> > ->suspend_noirq(), the remaining code in the system sleep callbacks
> > needs to know about that in order to avoid, for example, attempting to
> > power down the device for the second time in a row, which very well
> > may kill the system in some extreme cases.
> 
> Agree, this may be a bit tricky to deal with for the more complex middle layers.
> 
> However, this is also very good reason to why it's useful keep the
> runtime PM status up to date, always reflecting the HW status of the
> device - even during system suspend phases.
> 
> >
> > Of course, if those callbacks are trivial, this problem goes away, but
> > they need not be trivial and if you are a platform driver (or an
> > i2c/spi driver too for that matter), you aren't guaranteed that they
> > will always be trivial.  That is quite a bit of an issue to me.
> 
> I understand your concern, but I don't share it. :-)
> 
> To me, I would rather keep things as simple as possible, then deal
> with the complexity once we have a valid case for it.

But we have valid cases, actually, which is why we are talking about that
in the first place.

The point is how to address them and *none* of the choices at hand are
particuarly simple.  Some of them are simply plain unacceptable to me, though,
and I have tried very hard to explain why this is the case, but at this point
I'm just totally out of patience, so please accept the fact.

Thanks,
Rafael
diff mbox

Patch

Index: linux-pm/Documentation/driver-api/pm/devices.rst
===================================================================
--- linux-pm.orig/Documentation/driver-api/pm/devices.rst
+++ linux-pm/Documentation/driver-api/pm/devices.rst
@@ -328,7 +328,10 @@  the phases are: ``prepare``, ``suspend``
 	After the ``->prepare`` callback method returns, no new children may be
 	registered below the device.  The method may also prepare the device or
 	driver in some way for the upcoming system power transition, but it
-	should not put the device into a low-power state.
+	should not put the device into a low-power state.  Moreover, if the
+	device supports runtime power management, the ``->prepare`` callback
+	method must not update its state in case it is necessary to resume it
+	from runtime suspend later on.
 
 	For devices supporting runtime power management, the return value of the
 	prepare callback can be used to indicate to the PM core that it may
@@ -356,6 +359,16 @@  the phases are: ``prepare``, ``suspend``
 	the appropriate low-power state, depending on the bus type the device is
 	on, and they may enable wakeup events.
 
+	However, for devices supporting runtime power management, the
+	``->suspend`` methods provided by subsystems (bus types and PM domains
+	in particular) must follow an additional rule regarding what can be done
+	to the devices before their drivers' ``->suspend`` methods are called.
+	Namely, they can only resume the devices from runtime suspend by
+	calling :c:func:`pm_runtime_resume` for them, if that is necessary, and
+	they must not update the state of the devices in any other way at that
+	time (in case the drivers need to resume the devices from runtime
+	suspend in their ``->suspend`` methods).
+
     3.	For a number of devices it is convenient to split suspend into the
 	"quiesce device" and "save device state" phases, in which cases
 	``suspend_late`` is meant to do the latter.  It is always executed after
@@ -729,6 +742,16 @@  state temporarily, for example so that i
 disabled.  This all depends on the hardware and the design of the subsystem and
 device driver in question.
 
+If it is necessary to resume a device from runtime suspend during a system-wide
+transition into a sleep state, that can be done by calling
+:c:func:`pm_runtime_resume` for it from the ``->suspend`` callback (or its
+couterpart for transitions related to hibernation) of either the device's driver
+or a subsystem responsible for it (for example, a bus type or a PM domain).
+That is guaranteed to work by the requirement that subsystems must not change
+the state of devices (possibly except for resuming them from runtime suspend)
+from their ``->prepare`` and ``->suspend`` callbacks (or equivalent) *before*
+invoking device drivers' ``->suspend`` callbacks (or equivalent).
+
 During system-wide resume from a sleep state it's easiest to put devices into
 the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`.
 Refer to that document for more information regarding this particular issue as