diff mbox

driver core: Ensure proper suspend/resume ordering

Message ID 1441880343-30852-1-git-send-email-thierry.reding@gmail.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Thierry Reding Sept. 10, 2015, 10:19 a.m. UTC
From: Thierry Reding <treding@nvidia.com>

Deferred probe can lead to strange situations where a device that is a
dependency for others will be moved to the end of the dpm_list. At the
same time the dependers may not be moved because at the time they will
be probed the dependee may already have been successfully reprobed and
they will not have to defer the probe themselves.

One example where this happens is the Jetson TK1 board (Tegra124). The
gpio-keys driver exposes the power key of the board as an input device
that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
tegra: Add gpio-ranges property") results in the gpio-tegra driver
deferring probe because one of its dependencies, the pinctrl-tegra
driver, has not successfully completed probing. Currently the deferred
probe code will move the corresponding gpio-tegra device to the end of
the dpm_list, but by the time the gpio-keys device, depending on the
gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
the gpio-keys device is not moved to the end of dpm_list itself. As a
result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
gpio-tegra. That's problematic because the gpio-keys driver requests
the power key to be a wakeup source. However, the programming of the
wakeup interrupt registers happens in the gpio-tegra driver's suspend
callback, which is now called before that of the gpio-keys driver. The
result is that the wrong values are programmed and leaves the system
unable to be resumed using the power key.

To fix this situation, always move devices to the end of the dpm_list
before probing them. Technically this should only be done for devices
that have been successfully probed, but that won't work for recursive
probing of devices (think an I2C master that instantiates children in
its ->probe()). Effectively the dpm_list will end up ordered the same
way that devices were probed, hence taking care of dependencies.

Signed-off-by: Thierry Reding <treding@nvidia.com>
---
Note that this commit is kind of the PM equivalent of 52cdbdd49853
("driver core: correct device's shutdown order) and that we have two
lists that are essentially the same (dpm_list and devices_kset). I'm
wondering if it would be worth looking into getting rid of one of
them? I don't see any reason why the ordering for shutdown and
suspend/resume should be different, and having a single list would
help keep this in sync.

 drivers/base/dd.c | 33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

Comments

Grygorii Strashko Sept. 10, 2015, 10:58 a.m. UTC | #1
On 09/10/2015 01:19 PM, Thierry Reding wrote:
> From: Thierry Reding <treding@nvidia.com>
> 
> Deferred probe can lead to strange situations where a device that is a
> dependency for others will be moved to the end of the dpm_list. At the
> same time the dependers may not be moved because at the time they will
> be probed the dependee may already have been successfully reprobed and
> they will not have to defer the probe themselves.
> 
> One example where this happens is the Jetson TK1 board (Tegra124). The
> gpio-keys driver exposes the power key of the board as an input device
> that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> tegra: Add gpio-ranges property") results in the gpio-tegra driver
> deferring probe because one of its dependencies, the pinctrl-tegra
> driver, has not successfully completed probing. Currently the deferred
> probe code will move the corresponding gpio-tegra device to the end of
> the dpm_list, but by the time the gpio-keys device, depending on the
> gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
> the gpio-keys device is not moved to the end of dpm_list itself. As a
> result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
> gpio-tegra. That's problematic because the gpio-keys driver requests
> the power key to be a wakeup source. However, the programming of the
> wakeup interrupt registers happens in the gpio-tegra driver's suspend
> callback, which is now called before that of the gpio-keys driver. The
> result is that the wrong values are programmed and leaves the system
> unable to be resumed using the power key.
> 
> To fix this situation, always move devices to the end of the dpm_list
> before probing them. Technically this should only be done for devices
> that have been successfully probed, but that won't work for recursive
> probing of devices (think an I2C master that instantiates children in
> its ->probe()). Effectively the dpm_list will end up ordered the same
> way that devices were probed, hence taking care of dependencies.

Second try :), first one was here:
"[RFC 1/1] driver core: re-order dpm_list after a succussful probe"
https://lkml.org/lkml/2014/12/12/324
and it was unsuccessful exactly because dpm_list was reordered after probe()
and not before, i think.

I'll try to test it.

> 
> Signed-off-by: Thierry Reding <treding@nvidia.com>
> ---
> Note that this commit is kind of the PM equivalent of 52cdbdd49853
> ("driver core: correct device's shutdown order) and that we have two
> lists that are essentially the same (dpm_list and devices_kset). I'm
> wondering if it would be worth looking into getting rid of one of
> them? I don't see any reason why the ordering for shutdown and
> suspend/resume should be different, and having a single list would
> help keep this in sync.

Yep. I've tried to remove one of those lists while working on shutdown issue.
I've tried to drop dmp_list, but my try was unsuccessful, because I was not
able to find simple way to handle device's  dynamic creation/removal during
suspend/resume :(.
Also note, dmp_list's presence depends on Kconfig options.

> 
>   drivers/base/dd.c | 33 +++++++++++++++++++++++----------
>   1 file changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index be0eb4639128..56291b11049b 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -88,16 +88,6 @@ static void deferred_probe_work_func(struct work_struct *work)
>   		 */
>   		mutex_unlock(&deferred_probe_mutex);
>   
> -		/*
> -		 * Force the device to the end of the dpm_list since
> -		 * the PM code assumes that the order we add things to
> -		 * the list is a good order for suspend but deferred
> -		 * probe makes that very unsafe.
> -		 */
> -		device_pm_lock();
> -		device_pm_move_last(dev);
> -		device_pm_unlock();
> -
>   		dev_dbg(dev, "Retrying from deferred list\n");
>   		bus_probe_device(dev);
>   
> @@ -312,6 +302,29 @@ static int really_probe(struct device *dev, struct device_driver *drv)
>   	 */
>   	devices_kset_move_last(dev);
>   
> +	/*
> +	 * Force the device to the end of the dpm_list since the PM code
> +	 * assumes that the order we add things to the list is a good order
> +	 * for suspend but deferred probe makes that very unsafe.
> +	 *
> +	 * Deferred probe can also cause situations in which a device that is
> +	 * a dependency for others gets moved further down the dpm_list as a
> +	 * result of probe deferral. In that case the dependee will end up
> +	 * getting suspended before any of its dependers.
> +	 *
> +	 * To ensure proper ordering of suspend/resume, move every device that
> +	 * is being probed to the end of the dpm_list. Note that technically
> +	 * only successfully probed devices need to be moved, but that breaks
> +	 * for recursively added devices because they would end up in the list
> +	 * in reverse of the desired order, so we simply do it unconditionally
> +	 * for all devices before they are being probed. In the worst case the
> +	 * list will be reordered a couple more times than necessary, which
> +	 * should be an insignificant amount of work.
> +	 */
> +	device_pm_lock();
> +	device_pm_move_last(dev);
> +	device_pm_unlock();
> +
>   	if (dev->bus->probe) {
>   		ret = dev->bus->probe(dev);
>   		if (ret)
>
Rafael J. Wysocki Sept. 10, 2015, 10:08 p.m. UTC | #2
On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
> From: Thierry Reding <treding@nvidia.com>
> 
> Deferred probe can lead to strange situations where a device that is a
> dependency for others will be moved to the end of the dpm_list. At the
> same time the dependers may not be moved because at the time they will
> be probed the dependee may already have been successfully reprobed and
> they will not have to defer the probe themselves.

So there's a bug in the implementation of deferred probing IMO.

> One example where this happens is the Jetson TK1 board (Tegra124). The
> gpio-keys driver exposes the power key of the board as an input device
> that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> tegra: Add gpio-ranges property") results in the gpio-tegra driver
> deferring probe because one of its dependencies, the pinctrl-tegra
> driver, has not successfully completed probing. Currently the deferred
> probe code will move the corresponding gpio-tegra device to the end of
> the dpm_list, but by the time the gpio-keys device, depending on the
> gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
> the gpio-keys device is not moved to the end of dpm_list itself. As a
> result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
> gpio-tegra. That's problematic because the gpio-keys driver requests
> the power key to be a wakeup source. However, the programming of the
> wakeup interrupt registers happens in the gpio-tegra driver's suspend
> callback, which is now called before that of the gpio-keys driver. The
> result is that the wrong values are programmed and leaves the system
> unable to be resumed using the power key.
> 
> To fix this situation, always move devices to the end of the dpm_list
> before probing them. Technically this should only be done for devices
> that have been successfully probed, but that won't work for recursive
> probing of devices (think an I2C master that instantiates children in
> its ->probe()). Effectively the dpm_list will end up ordered the same
> way that devices were probed, hence taking care of dependencies.
> 
> Signed-off-by: Thierry Reding <treding@nvidia.com>
> ---
> Note that this commit is kind of the PM equivalent of 52cdbdd49853
> ("driver core: correct device's shutdown order) and that we have two
> lists that are essentially the same (dpm_list and devices_kset). I'm
> wondering if it would be worth looking into getting rid of one of
> them? I don't see any reason why the ordering for shutdown and
> suspend/resume should be different, and having a single list would
> help keep this in sync.

We move away things from dpm_list during suspend and add them back to it
during resume to handle the situations in which some devices go away or
appear during suspend/resume.  That makes this idea potentially problematic.

> 
>  drivers/base/dd.c | 33 +++++++++++++++++++++++----------
>  1 file changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index be0eb4639128..56291b11049b 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -88,16 +88,6 @@ static void deferred_probe_work_func(struct work_struct *work)
>  		 */
>  		mutex_unlock(&deferred_probe_mutex);
>  
> -		/*
> -		 * Force the device to the end of the dpm_list since
> -		 * the PM code assumes that the order we add things to
> -		 * the list is a good order for suspend but deferred
> -		 * probe makes that very unsafe.
> -		 */
> -		device_pm_lock();
> -		device_pm_move_last(dev);
> -		device_pm_unlock();
> -
>  		dev_dbg(dev, "Retrying from deferred list\n");
>  		bus_probe_device(dev);
>  
> @@ -312,6 +302,29 @@ static int really_probe(struct device *dev, struct device_driver *drv)
>  	 */
>  	devices_kset_move_last(dev);
>  
> +	/*
> +	 * Force the device to the end of the dpm_list since the PM code
> +	 * assumes that the order we add things to the list is a good order
> +	 * for suspend but deferred probe makes that very unsafe.
> +	 *
> +	 * Deferred probe can also cause situations in which a device that is
> +	 * a dependency for others gets moved further down the dpm_list as a
> +	 * result of probe deferral. In that case the dependee will end up
> +	 * getting suspended before any of its dependers.
> +	 *
> +	 * To ensure proper ordering of suspend/resume, move every device that
> +	 * is being probed to the end of the dpm_list. Note that technically
> +	 * only successfully probed devices need to be moved, but that breaks
> +	 * for recursively added devices because they would end up in the list
> +	 * in reverse of the desired order, so we simply do it unconditionally
> +	 * for all devices before they are being probed. In the worst case the
> +	 * list will be reordered a couple more times than necessary, which
> +	 * should be an insignificant amount of work.
> +	 */
> +	device_pm_lock();
> +	device_pm_move_last(dev);
> +	device_pm_unlock();

So I don't agree with doing that for every driver being probed against the
same device.  That's just wasteful IMO.

> +
>  	if (dev->bus->probe) {
>  		ret = dev->bus->probe(dev);
>  		if (ret)
> 

Alan, what do you think about this?

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Sept. 11, 2015, 12:03 p.m. UTC | #3
On Fri, Sep 11, 2015 at 12:08:02AM +0200, Rafael J. Wysocki wrote:
> On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
> > From: Thierry Reding <treding@nvidia.com>
> > 
> > Deferred probe can lead to strange situations where a device that is a
> > dependency for others will be moved to the end of the dpm_list. At the
> > same time the dependers may not be moved because at the time they will
> > be probed the dependee may already have been successfully reprobed and
> > they will not have to defer the probe themselves.
> 
> So there's a bug in the implementation of deferred probing IMO.

Well, yeah. The root problem here is that we don't have dependency
information and deferred probing is supposed to fix that. It does so
fairly well, but it breaks in this particular case.

> > One example where this happens is the Jetson TK1 board (Tegra124). The
> > gpio-keys driver exposes the power key of the board as an input device
> > that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> > tegra: Add gpio-ranges property") results in the gpio-tegra driver
> > deferring probe because one of its dependencies, the pinctrl-tegra
> > driver, has not successfully completed probing. Currently the deferred
> > probe code will move the corresponding gpio-tegra device to the end of
> > the dpm_list, but by the time the gpio-keys device, depending on the
> > gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
> > the gpio-keys device is not moved to the end of dpm_list itself. As a
> > result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
> > gpio-tegra. That's problematic because the gpio-keys driver requests
> > the power key to be a wakeup source. However, the programming of the
> > wakeup interrupt registers happens in the gpio-tegra driver's suspend
> > callback, which is now called before that of the gpio-keys driver. The
> > result is that the wrong values are programmed and leaves the system
> > unable to be resumed using the power key.
> > 
> > To fix this situation, always move devices to the end of the dpm_list
> > before probing them. Technically this should only be done for devices
> > that have been successfully probed, but that won't work for recursive
> > probing of devices (think an I2C master that instantiates children in
> > its ->probe()). Effectively the dpm_list will end up ordered the same
> > way that devices were probed, hence taking care of dependencies.
> > 
> > Signed-off-by: Thierry Reding <treding@nvidia.com>
> > ---
> > Note that this commit is kind of the PM equivalent of 52cdbdd49853
> > ("driver core: correct device's shutdown order) and that we have two
> > lists that are essentially the same (dpm_list and devices_kset). I'm
> > wondering if it would be worth looking into getting rid of one of
> > them? I don't see any reason why the ordering for shutdown and
> > suspend/resume should be different, and having a single list would
> > help keep this in sync.
> 
> We move away things from dpm_list during suspend and add them back to it
> during resume to handle the situations in which some devices go away or
> appear during suspend/resume.  That makes this idea potentially problematic.

Okay, I see. If they are used for different purposes it's fine to keep
them both.

> >  drivers/base/dd.c | 33 +++++++++++++++++++++++----------
> >  1 file changed, 23 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > index be0eb4639128..56291b11049b 100644
> > --- a/drivers/base/dd.c
> > +++ b/drivers/base/dd.c
> > @@ -88,16 +88,6 @@ static void deferred_probe_work_func(struct work_struct *work)
> >  		 */
> >  		mutex_unlock(&deferred_probe_mutex);
> >  
> > -		/*
> > -		 * Force the device to the end of the dpm_list since
> > -		 * the PM code assumes that the order we add things to
> > -		 * the list is a good order for suspend but deferred
> > -		 * probe makes that very unsafe.
> > -		 */
> > -		device_pm_lock();
> > -		device_pm_move_last(dev);
> > -		device_pm_unlock();
> > -
> >  		dev_dbg(dev, "Retrying from deferred list\n");
> >  		bus_probe_device(dev);
> >  
> > @@ -312,6 +302,29 @@ static int really_probe(struct device *dev, struct device_driver *drv)
> >  	 */
> >  	devices_kset_move_last(dev);
> >  
> > +	/*
> > +	 * Force the device to the end of the dpm_list since the PM code
> > +	 * assumes that the order we add things to the list is a good order
> > +	 * for suspend but deferred probe makes that very unsafe.
> > +	 *
> > +	 * Deferred probe can also cause situations in which a device that is
> > +	 * a dependency for others gets moved further down the dpm_list as a
> > +	 * result of probe deferral. In that case the dependee will end up
> > +	 * getting suspended before any of its dependers.
> > +	 *
> > +	 * To ensure proper ordering of suspend/resume, move every device that
> > +	 * is being probed to the end of the dpm_list. Note that technically
> > +	 * only successfully probed devices need to be moved, but that breaks
> > +	 * for recursively added devices because they would end up in the list
> > +	 * in reverse of the desired order, so we simply do it unconditionally
> > +	 * for all devices before they are being probed. In the worst case the
> > +	 * list will be reordered a couple more times than necessary, which
> > +	 * should be an insignificant amount of work.
> > +	 */
> > +	device_pm_lock();
> > +	device_pm_move_last(dev);
> > +	device_pm_unlock();
> 
> So I don't agree with doing that for every driver being probed against the
> same device.  That's just wasteful IMO.

I don't understand. At this point driver matching has already taken
place, so this will only every happen for one particular driver and the
corresponding device. In the most common case this will happen exactly
once, when the device is probed. Worst case it will happen a second or
more times if a driver defers probing for a specific device. But in that
case moving the device to the end of the list is absolutely required to
keep it properly ordered for suspend/resume. Note that this is already
done by the original code in deferred_probe_work_func() that is removed
in the first hunk. The fix here is to do it for every device to ensure
that inter-device dependencies are properly dealt with.

I agree that it's a little wasteful, but that's completely in line with
deferred probe. It's a simple solution to a very difficult problem, so
naturally it comes at a cost. But it's also fairly elegant in how it
correctly solves the ordering problem with very little code. The only
other way that I can think of to avoid reordering the list for every
device probe would be to sort it in advance using dependency information
which we don't have. So we'd need to first add all that dependency
information, and using that information is likely to be more work than
simply reordering the list for each probe.

Thierry
Alan Stern Sept. 11, 2015, 2:05 p.m. UTC | #4
On Fri, 11 Sep 2015, Rafael J. Wysocki wrote:

> On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
> > From: Thierry Reding <treding@nvidia.com>
> > 
> > Deferred probe can lead to strange situations where a device that is a
> > dependency for others will be moved to the end of the dpm_list. At the
> > same time the dependers may not be moved because at the time they will
> > be probed the dependee may already have been successfully reprobed and
> > they will not have to defer the probe themselves.
> 
> So there's a bug in the implementation of deferred probing IMO.
> 
> > One example where this happens is the Jetson TK1 board (Tegra124). The
> > gpio-keys driver exposes the power key of the board as an input device
> > that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> > tegra: Add gpio-ranges property") results in the gpio-tegra driver
> > deferring probe because one of its dependencies, the pinctrl-tegra
> > driver, has not successfully completed probing. Currently the deferred
> > probe code will move the corresponding gpio-tegra device to the end of
> > the dpm_list, but by the time the gpio-keys device, depending on the
> > gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
> > the gpio-keys device is not moved to the end of dpm_list itself. As a
> > result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
> > gpio-tegra. That's problematic because the gpio-keys driver requests
> > the power key to be a wakeup source. However, the programming of the
> > wakeup interrupt registers happens in the gpio-tegra driver's suspend
> > callback, which is now called before that of the gpio-keys driver. The
> > result is that the wrong values are programmed and leaves the system
> > unable to be resumed using the power key.
> > 
> > To fix this situation, always move devices to the end of the dpm_list
> > before probing them. Technically this should only be done for devices
> > that have been successfully probed, but that won't work for recursive
> > probing of devices (think an I2C master that instantiates children in
> > its ->probe()). Effectively the dpm_list will end up ordered the same
> > way that devices were probed, hence taking care of dependencies.
> > 
> > Signed-off-by: Thierry Reding <treding@nvidia.com>
> > ---
> > Note that this commit is kind of the PM equivalent of 52cdbdd49853
> > ("driver core: correct device's shutdown order) and that we have two
> > lists that are essentially the same (dpm_list and devices_kset). I'm
> > wondering if it would be worth looking into getting rid of one of
> > them? I don't see any reason why the ordering for shutdown and
> > suspend/resume should be different, and having a single list would
> > help keep this in sync.
> 
> We move away things from dpm_list during suspend and add them back to it
> during resume to handle the situations in which some devices go away or
> appear during suspend/resume.  That makes this idea potentially problematic.

> Alan, what do you think about this?

It's a tricky problem.  Let me give it some thought and I'll get back 
to you.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 11, 2015, 7:01 p.m. UTC | #5
On Fri, 11 Sep 2015, Thierry Reding wrote:

> On Fri, Sep 11, 2015 at 12:08:02AM +0200, Rafael J. Wysocki wrote:
> > On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
> > > From: Thierry Reding <treding@nvidia.com>
> > > 
> > > Deferred probe can lead to strange situations where a device that is a
> > > dependency for others will be moved to the end of the dpm_list. At the
> > > same time the dependers may not be moved because at the time they will
> > > be probed the dependee may already have been successfully reprobed and
> > > they will not have to defer the probe themselves.
> > 
> > So there's a bug in the implementation of deferred probing IMO.
> 
> Well, yeah. The root problem here is that we don't have dependency
> information and deferred probing is supposed to fix that. It does so
> fairly well, but it breaks in this particular case.
> 
> > > One example where this happens is the Jetson TK1 board (Tegra124). The
> > > gpio-keys driver exposes the power key of the board as an input device
> > > that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> > > tegra: Add gpio-ranges property") results in the gpio-tegra driver
> > > deferring probe because one of its dependencies, the pinctrl-tegra
> > > driver, has not successfully completed probing. Currently the deferred
> > > probe code will move the corresponding gpio-tegra device to the end of
> > > the dpm_list, but by the time the gpio-keys device, depending on the
> > > gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
> > > the gpio-keys device is not moved to the end of dpm_list itself. As a
> > > result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
> > > gpio-tegra. That's problematic because the gpio-keys driver requests
> > > the power key to be a wakeup source. However, the programming of the
> > > wakeup interrupt registers happens in the gpio-tegra driver's suspend
> > > callback, which is now called before that of the gpio-keys driver. The
> > > result is that the wrong values are programmed and leaves the system
> > > unable to be resumed using the power key.
> > > 
> > > To fix this situation, always move devices to the end of the dpm_list
> > > before probing them. Technically this should only be done for devices
> > > that have been successfully probed, but that won't work for recursive
> > > probing of devices (think an I2C master that instantiates children in
> > > its ->probe()). Effectively the dpm_list will end up ordered the same
> > > way that devices were probed, hence taking care of dependencies.

I'm not worried about the overhead involved in moving a device to the 
end of the list every time it is probed.  That ought to be relatively 
small.

There are a few things to watch out for.  Since the dpm_list gets
modified during system sleep transitions, we would have to make sure
that nothing gets probed during those times.  In principle, that's what
the "prepare" stage is meant for, but there's still a race.  As long as
no other kernel thread (such as the deferred probing mechanism) tries
to probe a device once everything has been frozen, we should be okay.
But if not, there will be trouble -- after the ->prepare callback runs, 
the device is no longer on the dpm_list and so we don't want this patch 
to put it back on that list.

There's also an issue about other types of dependencies.  For instance,
it's conceivable that device B might be discovered and depend on device
A, even before A has been bound to a driver.  (B might be discovered by 
A's subsystem rather than A's driver.)  In that case, moving A to the 
end of the list would cause B to come before A even though B depends on 
A.  Of course, deferred probing already has this problem.

An easy way to check for this sort of thing would be to verify that a 
device about to be probed doesn't have any children.  This wouldn't 
catch all the potential dependencies, but it would be a reasonable 
start.  Do we currently check that after a device has been unbound, it 
doesn't have any remaining children?  We should do that too.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 11, 2015, 10:28 p.m. UTC | #6
On Friday, September 11, 2015 03:01:14 PM Alan Stern wrote:
> On Fri, 11 Sep 2015, Thierry Reding wrote:
> 
> > On Fri, Sep 11, 2015 at 12:08:02AM +0200, Rafael J. Wysocki wrote:
> > > On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
> > > > From: Thierry Reding <treding@nvidia.com>
> > > > 
> > > > Deferred probe can lead to strange situations where a device that is a
> > > > dependency for others will be moved to the end of the dpm_list. At the
> > > > same time the dependers may not be moved because at the time they will
> > > > be probed the dependee may already have been successfully reprobed and
> > > > they will not have to defer the probe themselves.
> > > 
> > > So there's a bug in the implementation of deferred probing IMO.
> > 
> > Well, yeah. The root problem here is that we don't have dependency
> > information and deferred probing is supposed to fix that. It does so
> > fairly well, but it breaks in this particular case.
> > 
> > > > One example where this happens is the Jetson TK1 board (Tegra124). The
> > > > gpio-keys driver exposes the power key of the board as an input device
> > > > that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> > > > tegra: Add gpio-ranges property") results in the gpio-tegra driver
> > > > deferring probe because one of its dependencies, the pinctrl-tegra
> > > > driver, has not successfully completed probing. Currently the deferred
> > > > probe code will move the corresponding gpio-tegra device to the end of
> > > > the dpm_list, but by the time the gpio-keys device, depending on the
> > > > gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
> > > > the gpio-keys device is not moved to the end of dpm_list itself. As a
> > > > result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
> > > > gpio-tegra. That's problematic because the gpio-keys driver requests
> > > > the power key to be a wakeup source. However, the programming of the
> > > > wakeup interrupt registers happens in the gpio-tegra driver's suspend
> > > > callback, which is now called before that of the gpio-keys driver. The
> > > > result is that the wrong values are programmed and leaves the system
> > > > unable to be resumed using the power key.
> > > > 
> > > > To fix this situation, always move devices to the end of the dpm_list
> > > > before probing them. Technically this should only be done for devices
> > > > that have been successfully probed, but that won't work for recursive
> > > > probing of devices (think an I2C master that instantiates children in
> > > > its ->probe()). Effectively the dpm_list will end up ordered the same
> > > > way that devices were probed, hence taking care of dependencies.
> 
> I'm not worried about the overhead involved in moving a device to the 
> end of the list every time it is probed.  That ought to be relatively 
> small.
> 
> There are a few things to watch out for.  Since the dpm_list gets
> modified during system sleep transitions, we would have to make sure
> that nothing gets probed during those times.  In principle, that's what
> the "prepare" stage is meant for, but there's still a race.  As long as
> no other kernel thread (such as the deferred probing mechanism) tries
> to probe a device once everything has been frozen, we should be okay.
> But if not, there will be trouble -- after the ->prepare callback runs, 
> the device is no longer on the dpm_list and so we don't want this patch 
> to put it back on that list.

Right.

> There's also an issue about other types of dependencies.  For instance,
> it's conceivable that device B might be discovered and depend on device
> A, even before A has been bound to a driver.  (B might be discovered by 
> A's subsystem rather than A's driver.)  In that case, moving A to the 
> end of the list would cause B to come before A even though B depends on 
> A.  Of course, deferred probing already has this problem.

That may actually happen for PCIe ports and devices below them AFAICS.

Devices below PCIe ports are discovered by the PCI subsystem and the PCIe
ports need not be probed before those devices are probed.

> An easy way to check for this sort of thing would be to verify that a 
> device about to be probed doesn't have any children.  This wouldn't 
> catch all the potential dependencies, but it would be a reasonable 
> start.

That would address the PCIe ports issue.

> Do we currently check that after a device has been unbound, it 
> doesn't have any remaining children?  We should do that too.

We don't and if the device is something like a PCIe port, it'd be OK for it
to still have active children after unbind.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 11, 2015, 10:38 p.m. UTC | #7
On Friday, September 11, 2015 02:03:46 PM Thierry Reding wrote:
> On Fri, Sep 11, 2015 at 12:08:02AM +0200, Rafael J. Wysocki wrote:
> > On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
> > > From: Thierry Reding <treding@nvidia.com>
> > >=20
> > > Deferred probe can lead to strange situations where a device that is a
> > > dependency for others will be moved to the end of the dpm_list. At the
> > > same time the dependers may not be moved because at the time they will
> > > be probed the dependee may already have been successfully reprobed and
> > > they will not have to defer the probe themselves.
> >=20
> > So there's a bug in the implementation of deferred probing IMO.
> 
> Well, yeah. The root problem here is that we don't have dependency
> information and deferred probing is supposed to fix that. It does so
> fairly well, but it breaks in this particular case.
> 
> > > One example where this happens is the Jetson TK1 board (Tegra124). The
> > > gpio-keys driver exposes the power key of the board as an input device
> > > that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> > > tegra: Add gpio-ranges property") results in the gpio-tegra driver
> > > deferring probe because one of its dependencies, the pinctrl-tegra
> > > driver, has not successfully completed probing. Currently the deferred
> > > probe code will move the corresponding gpio-tegra device to the end of
> > > the dpm_list, but by the time the gpio-keys device, depending on the
> > > gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
> > > the gpio-keys device is not moved to the end of dpm_list itself.

At this point, when checking its dependencies, gpio-keys should also check
if their ordering with respect to it in dpm_list is correct and move itself
to the end of it otherwise.

There might be a helper for that I suppose.

> > > As a
> > > result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
> > > gpio-tegra. That's problematic because the gpio-keys driver requests
> > > the power key to be a wakeup source. However, the programming of the
> > > wakeup interrupt registers happens in the gpio-tegra driver's suspend
> > > callback, which is now called before that of the gpio-keys driver. The
> > > result is that the wrong values are programmed and leaves the system
> > > unable to be resumed using the power key.
> > >=20
> > > To fix this situation, always move devices to the end of the dpm_list
> > > before probing them. Technically this should only be done for devices
> > > that have been successfully probed, but that won't work for recursive
> > > probing of devices (think an I2C master that instantiates children in
> > > its ->probe()). Effectively the dpm_list will end up ordered the same
> > > way that devices were probed, hence taking care of dependencies.
> > >=20
> > > Signed-off-by: Thierry Reding <treding@nvidia.com>
> > > ---
> > > Note that this commit is kind of the PM equivalent of 52cdbdd49853
> > > ("driver core: correct device's shutdown order) and that we have two
> > > lists that are essentially the same (dpm_list and devices_kset). I'm
> > > wondering if it would be worth looking into getting rid of one of
> > > them? I don't see any reason why the ordering for shutdown and
> > > suspend/resume should be different, and having a single list would
> > > help keep this in sync.
> >=20
> > We move away things from dpm_list during suspend and add them back to it
> > during resume to handle the situations in which some devices go away or
> > appear during suspend/resume.  That makes this idea potentially problemat=
> ic.
> 
> Okay, I see. If they are used for different purposes it's fine to keep
> them both.
> 
> > >  drivers/base/dd.c | 33 +++++++++++++++++++++++----------
> > >  1 file changed, 23 insertions(+), 10 deletions(-)
> > >=20
> > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > > index be0eb4639128..56291b11049b 100644
> > > --- a/drivers/base/dd.c
> > > +++ b/drivers/base/dd.c
> > > @@ -88,16 +88,6 @@ static void deferred_probe_work_func(struct work_str=
> uct *work)
> > >  		 */
> > >  		mutex_unlock(&deferred_probe_mutex);
> > > =20
> > > -		/*
> > > -		 * Force the device to the end of the dpm_list since
> > > -		 * the PM code assumes that the order we add things to
> > > -		 * the list is a good order for suspend but deferred
> > > -		 * probe makes that very unsafe.
> > > -		 */
> > > -		device_pm_lock();
> > > -		device_pm_move_last(dev);
> > > -		device_pm_unlock();
> > > -
> > >  		dev_dbg(dev, "Retrying from deferred list\n");
> > >  		bus_probe_device(dev);
> > > =20
> > > @@ -312,6 +302,29 @@ static int really_probe(struct device *dev, struct=
>  device_driver *drv)
> > >  	 */
> > >  	devices_kset_move_last(dev);
> > > =20
> > > +	/*
> > > +	 * Force the device to the end of the dpm_list since the PM code
> > > +	 * assumes that the order we add things to the list is a good order
> > > +	 * for suspend but deferred probe makes that very unsafe.
> > > +	 *
> > > +	 * Deferred probe can also cause situations in which a device that is
> > > +	 * a dependency for others gets moved further down the dpm_list as a
> > > +	 * result of probe deferral. In that case the dependee will end up
> > > +	 * getting suspended before any of its dependers.
> > > +	 *
> > > +	 * To ensure proper ordering of suspend/resume, move every device that
> > > +	 * is being probed to the end of the dpm_list. Note that technically
> > > +	 * only successfully probed devices need to be moved, but that breaks
> > > +	 * for recursively added devices because they would end up in the list
> > > +	 * in reverse of the desired order, so we simply do it unconditionally
> > > +	 * for all devices before they are being probed. In the worst case the
> > > +	 * list will be reordered a couple more times than necessary, which
> > > +	 * should be an insignificant amount of work.
> > > +	 */
> > > +	device_pm_lock();
> > > +	device_pm_move_last(dev);
> > > +	device_pm_unlock();
> >=20
> > So I don't agree with doing that for every driver being probed against the
> > same device.  That's just wasteful IMO.
> 
> I don't understand. At this point driver matching has already taken
> place, so this will only every happen for one particular driver and the
> corresponding device. In the most common case this will happen exactly
> once, when the device is probed. Worst case it will happen a second or
> more times if a driver defers probing for a specific device. But in that
> case moving the device to the end of the list is absolutely required to
> keep it properly ordered for suspend/resume. Note that this is already
> done by the original code in deferred_probe_work_func() that is removed
> in the first hunk. The fix here is to do it for every device to ensure
> that inter-device dependencies are properly dealt with.
> 
> I agree that it's a little wasteful, but that's completely in line with
> deferred probe. It's a simple solution to a very difficult problem, so
> naturally it comes at a cost. But it's also fairly elegant in how it
> correctly solves the ordering problem with very little code. The only
> other way that I can think of to avoid reordering the list for every
> device probe would be to sort it in advance using dependency information
> which we don't have. So we'd need to first add all that dependency
> information, and using that information is likely to be more work than
> simply reordering the list for each probe.

But that is problematic too as Alan pointed it out.

I'm always cautious about changes that make the core do something for every
device/driver, because they are very likely to step on corner cases that we
don't need to worry about otherwise.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 12, 2015, 5:40 p.m. UTC | #8
On Sat, 12 Sep 2015, Rafael J. Wysocki wrote:

> On Friday, September 11, 2015 03:01:14 PM Alan Stern wrote:

> > There's also an issue about other types of dependencies.  For instance,
> > it's conceivable that device B might be discovered and depend on device
> > A, even before A has been bound to a driver.  (B might be discovered by 
> > A's subsystem rather than A's driver.)  In that case, moving A to the 
> > end of the list would cause B to come before A even though B depends on 
> > A.  Of course, deferred probing already has this problem.
> 
> That may actually happen for PCIe ports and devices below them AFAICS.
> 
> Devices below PCIe ports are discovered by the PCI subsystem and the PCIe
> ports need not be probed before those devices are probed.

Is it possible to change this?  Make it so that devices below PCIe 
ports are discovered by the port driver rather than by the PCI 
subsystem?  Or would that be far too difficult?

> > An easy way to check for this sort of thing would be to verify that a 
> > device about to be probed doesn't have any children.  This wouldn't 
> > catch all the potential dependencies, but it would be a reasonable 
> > start.
> 
> That would address the PCIe ports issue.

Well, it would _detect_ the PCIe ports issue.  It might also detect 
other things.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 15, 2015, 12:40 a.m. UTC | #9
Hi Alan,

On Sat, Sep 12, 2015 at 7:40 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sat, 12 Sep 2015, Rafael J. Wysocki wrote:
>
>> On Friday, September 11, 2015 03:01:14 PM Alan Stern wrote:
>
>> > There's also an issue about other types of dependencies.  For instance,
>> > it's conceivable that device B might be discovered and depend on device
>> > A, even before A has been bound to a driver.  (B might be discovered by
>> > A's subsystem rather than A's driver.)  In that case, moving A to the
>> > end of the list would cause B to come before A even though B depends on
>> > A.  Of course, deferred probing already has this problem.
>>
>> That may actually happen for PCIe ports and devices below them AFAICS.
>>
>> Devices below PCIe ports are discovered by the PCI subsystem and the PCIe
>> ports need not be probed before those devices are probed.
>
> Is it possible to change this?  Make it so that devices below PCIe
> ports are discovered by the port driver rather than by the PCI
> subsystem?  Or would that be far too difficult?

I don't think it would be really useful.

PCIe ports are PCI bridges from the resource allocation and basic
functionality perspective, so they are handled accordingly.

The PCIe port driver provides additional services (such as PME/hotplug
interrupt handling and advanced error reporting) that aren't necessary
for probing devices below the ports.

I guess the ordering of PCIe ports probing might be changed to happen
at the "right" time, but care needs to be taken in that case too.

>> > An easy way to check for this sort of thing would be to verify that a
>> > device about to be probed doesn't have any children.  This wouldn't
>> > catch all the potential dependencies, but it would be a reasonable
>> > start.
>>
>> That would address the PCIe ports issue.
>
> Well, it would _detect_ the PCIe ports issue.

That's what I meant. :-)

>  It might also detect other things.

Right.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 15, 2015, 2:23 p.m. UTC | #10
On Tue, 15 Sep 2015, Rafael J. Wysocki wrote:

> Hi Alan,

Hi.

> On Sat, Sep 12, 2015 at 7:40 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > On Sat, 12 Sep 2015, Rafael J. Wysocki wrote:
> >
> >> On Friday, September 11, 2015 03:01:14 PM Alan Stern wrote:
> >
> >> > There's also an issue about other types of dependencies.  For instance,
> >> > it's conceivable that device B might be discovered and depend on device
> >> > A, even before A has been bound to a driver.  (B might be discovered by
> >> > A's subsystem rather than A's driver.)  In that case, moving A to the
> >> > end of the list would cause B to come before A even though B depends on
> >> > A.  Of course, deferred probing already has this problem.
> >>
> >> That may actually happen for PCIe ports and devices below them AFAICS.
> >>
> >> Devices below PCIe ports are discovered by the PCI subsystem and the PCIe
> >> ports need not be probed before those devices are probed.
> >
> > Is it possible to change this?  Make it so that devices below PCIe
> > ports are discovered by the port driver rather than by the PCI
> > subsystem?  Or would that be far too difficult?
> 
> I don't think it would be really useful.
> 
> PCIe ports are PCI bridges from the resource allocation and basic
> functionality perspective, so they are handled accordingly.
> 
> The PCIe port driver provides additional services (such as PME/hotplug
> interrupt handling and advanced error reporting) that aren't necessary
> for probing devices below the ports.
> 
> I guess the ordering of PCIe ports probing might be changed to happen
> at the "right" time, but care needs to be taken in that case too.

Does suspending a PCIe port do anything significant?  In particular, 
does it cut off normal communication with anything attached through 
the port?

If it does then we better not move the port device to the end of the 
dpm_list after any descendant devices have been probed.

> >> > An easy way to check for this sort of thing would be to verify that a
> >> > device about to be probed doesn't have any children.  This wouldn't
> >> > catch all the potential dependencies, but it would be a reasonable
> >> > start.
> >>
> >> That would address the PCIe ports issue.
> >
> > Well, it would _detect_ the PCIe ports issue.
> 
> That's what I meant. :-)
> 
> >  It might also detect other things.
> 
> Right.

All right, I'll try writing a test patch to report these things.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Sept. 15, 2015, 3:53 p.m. UTC | #11
On Sat, Sep 12, 2015 at 12:38:19AM +0200, Rafael J. Wysocki wrote:
> On Friday, September 11, 2015 02:03:46 PM Thierry Reding wrote:
> > On Fri, Sep 11, 2015 at 12:08:02AM +0200, Rafael J. Wysocki wrote:
> > > On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
> > > > From: Thierry Reding <treding@nvidia.com>
> > > >=20
> > > > Deferred probe can lead to strange situations where a device that is a
> > > > dependency for others will be moved to the end of the dpm_list. At the
> > > > same time the dependers may not be moved because at the time they will
> > > > be probed the dependee may already have been successfully reprobed and
> > > > they will not have to defer the probe themselves.
> > >=20
> > > So there's a bug in the implementation of deferred probing IMO.
> > 
> > Well, yeah. The root problem here is that we don't have dependency
> > information and deferred probing is supposed to fix that. It does so
> > fairly well, but it breaks in this particular case.
> > 
> > > > One example where this happens is the Jetson TK1 board (Tegra124). The
> > > > gpio-keys driver exposes the power key of the board as an input device
> > > > that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> > > > tegra: Add gpio-ranges property") results in the gpio-tegra driver
> > > > deferring probe because one of its dependencies, the pinctrl-tegra
> > > > driver, has not successfully completed probing. Currently the deferred
> > > > probe code will move the corresponding gpio-tegra device to the end of
> > > > the dpm_list, but by the time the gpio-keys device, depending on the
> > > > gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
> > > > the gpio-keys device is not moved to the end of dpm_list itself.
> 
> At this point, when checking its dependencies, gpio-keys should also check
> if their ordering with respect to it in dpm_list is correct and move itself
> to the end of it otherwise.
> 
> There might be a helper for that I suppose.

But that's essentially the same thing as what this patch proposes,
except that every driver now needs to call this helper, rather than
having the core take care of it.

> > > > As a
> > > > result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
> > > > gpio-tegra. That's problematic because the gpio-keys driver requests
> > > > the power key to be a wakeup source. However, the programming of the
> > > > wakeup interrupt registers happens in the gpio-tegra driver's suspend
> > > > callback, which is now called before that of the gpio-keys driver. The
> > > > result is that the wrong values are programmed and leaves the system
> > > > unable to be resumed using the power key.
> > > >=20
> > > > To fix this situation, always move devices to the end of the dpm_list
> > > > before probing them. Technically this should only be done for devices
> > > > that have been successfully probed, but that won't work for recursive
> > > > probing of devices (think an I2C master that instantiates children in
> > > > its ->probe()). Effectively the dpm_list will end up ordered the same
> > > > way that devices were probed, hence taking care of dependencies.
> > > >=20
> > > > Signed-off-by: Thierry Reding <treding@nvidia.com>
> > > > ---
> > > > Note that this commit is kind of the PM equivalent of 52cdbdd49853
> > > > ("driver core: correct device's shutdown order) and that we have two
> > > > lists that are essentially the same (dpm_list and devices_kset). I'm
> > > > wondering if it would be worth looking into getting rid of one of
> > > > them? I don't see any reason why the ordering for shutdown and
> > > > suspend/resume should be different, and having a single list would
> > > > help keep this in sync.
> > >=20
> > > We move away things from dpm_list during suspend and add them back to it
> > > during resume to handle the situations in which some devices go away or
> > > appear during suspend/resume.  That makes this idea potentially problemat=
> > ic.
> > 
> > Okay, I see. If they are used for different purposes it's fine to keep
> > them both.
> > 
> > > >  drivers/base/dd.c | 33 +++++++++++++++++++++++----------
> > > >  1 file changed, 23 insertions(+), 10 deletions(-)
> > > >=20
> > > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > > > index be0eb4639128..56291b11049b 100644
> > > > --- a/drivers/base/dd.c
> > > > +++ b/drivers/base/dd.c
> > > > @@ -88,16 +88,6 @@ static void deferred_probe_work_func(struct work_str=
> > uct *work)
> > > >  		 */
> > > >  		mutex_unlock(&deferred_probe_mutex);
> > > > =20
> > > > -		/*
> > > > -		 * Force the device to the end of the dpm_list since
> > > > -		 * the PM code assumes that the order we add things to
> > > > -		 * the list is a good order for suspend but deferred
> > > > -		 * probe makes that very unsafe.
> > > > -		 */
> > > > -		device_pm_lock();
> > > > -		device_pm_move_last(dev);
> > > > -		device_pm_unlock();
> > > > -
> > > >  		dev_dbg(dev, "Retrying from deferred list\n");
> > > >  		bus_probe_device(dev);
> > > > =20
> > > > @@ -312,6 +302,29 @@ static int really_probe(struct device *dev, struct=
> >  device_driver *drv)
> > > >  	 */
> > > >  	devices_kset_move_last(dev);
> > > > =20
> > > > +	/*
> > > > +	 * Force the device to the end of the dpm_list since the PM code
> > > > +	 * assumes that the order we add things to the list is a good order
> > > > +	 * for suspend but deferred probe makes that very unsafe.
> > > > +	 *
> > > > +	 * Deferred probe can also cause situations in which a device that is
> > > > +	 * a dependency for others gets moved further down the dpm_list as a
> > > > +	 * result of probe deferral. In that case the dependee will end up
> > > > +	 * getting suspended before any of its dependers.
> > > > +	 *
> > > > +	 * To ensure proper ordering of suspend/resume, move every device that
> > > > +	 * is being probed to the end of the dpm_list. Note that technically
> > > > +	 * only successfully probed devices need to be moved, but that breaks
> > > > +	 * for recursively added devices because they would end up in the list
> > > > +	 * in reverse of the desired order, so we simply do it unconditionally
> > > > +	 * for all devices before they are being probed. In the worst case the
> > > > +	 * list will be reordered a couple more times than necessary, which
> > > > +	 * should be an insignificant amount of work.
> > > > +	 */
> > > > +	device_pm_lock();
> > > > +	device_pm_move_last(dev);
> > > > +	device_pm_unlock();
> > >=20
> > > So I don't agree with doing that for every driver being probed against the
> > > same device.  That's just wasteful IMO.
> > 
> > I don't understand. At this point driver matching has already taken
> > place, so this will only every happen for one particular driver and the
> > corresponding device. In the most common case this will happen exactly
> > once, when the device is probed. Worst case it will happen a second or
> > more times if a driver defers probing for a specific device. But in that
> > case moving the device to the end of the list is absolutely required to
> > keep it properly ordered for suspend/resume. Note that this is already
> > done by the original code in deferred_probe_work_func() that is removed
> > in the first hunk. The fix here is to do it for every device to ensure
> > that inter-device dependencies are properly dealt with.
> > 
> > I agree that it's a little wasteful, but that's completely in line with
> > deferred probe. It's a simple solution to a very difficult problem, so
> > naturally it comes at a cost. But it's also fairly elegant in how it
> > correctly solves the ordering problem with very little code. The only
> > other way that I can think of to avoid reordering the list for every
> > device probe would be to sort it in advance using dependency information
> > which we don't have. So we'd need to first add all that dependency
> > information, and using that information is likely to be more work than
> > simply reordering the list for each probe.
> 
> But that is problematic too as Alan pointed it out.
> 
> I'm always cautious about changes that make the core do something for every
> device/driver, because they are very likely to step on corner cases that we
> don't need to worry about otherwise.

To me this seems more like a fundamental issue that should be fixed at
the root rather than be fixed up case by case as we find them. Keeping
the suspend/resume order the same as probe order sounds like the most
logical thing to do. I grant you that there could be cases where this
might break, but then I'd consider those cases to be broken rather than
the other way around.

With the current situation potentially every user of GPIOs (and the same
is true for other types of resources) is broken, though we may not
notice (immediately). In fact it was mostly by coincidence that I
noticed in this case. The GPIO key works perfectly fine for regular use,
it just doesn't work as a wakeup source anymore. That's not something
that people test very frequently, hence could've gone unnoticed for much
longer.

Of course reordering the dpm_list to follow the probe order has the
potential to break other setups in similarily subtle ways, so I do
understand your reluctance.

Perhaps it would help if we put this patch into some boot farm to get
more coverage, maybe that would give us a better picture for how
invasive the change is, or how bad the fallout could be?

I can of course go and come up with a more ad-hoc solution that fixes
the issue for this particular use-case, but I'd like to avoid a
situation where we end up having to patch up drivers one by one, when
we could have a solution that works in the general case.

Also note that a recent commit (52cdbdd49853 "driver core: correct
device's shutdown order") patch already made an equivalent change to the
shutdown order. I'd expect that most devices would fail with that patch
already if ordering is a real problem. Of course shutdown is a one-way
ticket, so failure might not be as relevant as for suspend/resume.

Grygorii, Greg: have you heard of any fallout caused by the above patch
yet?

Thierry
Thierry Reding Sept. 15, 2015, 4:10 p.m. UTC | #12
On Fri, Sep 11, 2015 at 03:01:14PM -0400, Alan Stern wrote:
> On Fri, 11 Sep 2015, Thierry Reding wrote:
> 
> > On Fri, Sep 11, 2015 at 12:08:02AM +0200, Rafael J. Wysocki wrote:
> > > On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
> > > > From: Thierry Reding <treding@nvidia.com>
> > > > 
> > > > Deferred probe can lead to strange situations where a device that is a
> > > > dependency for others will be moved to the end of the dpm_list. At the
> > > > same time the dependers may not be moved because at the time they will
> > > > be probed the dependee may already have been successfully reprobed and
> > > > they will not have to defer the probe themselves.
> > > 
> > > So there's a bug in the implementation of deferred probing IMO.
> > 
> > Well, yeah. The root problem here is that we don't have dependency
> > information and deferred probing is supposed to fix that. It does so
> > fairly well, but it breaks in this particular case.
> > 
> > > > One example where this happens is the Jetson TK1 board (Tegra124). The
> > > > gpio-keys driver exposes the power key of the board as an input device
> > > > that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> > > > tegra: Add gpio-ranges property") results in the gpio-tegra driver
> > > > deferring probe because one of its dependencies, the pinctrl-tegra
> > > > driver, has not successfully completed probing. Currently the deferred
> > > > probe code will move the corresponding gpio-tegra device to the end of
> > > > the dpm_list, but by the time the gpio-keys device, depending on the
> > > > gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
> > > > the gpio-keys device is not moved to the end of dpm_list itself. As a
> > > > result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
> > > > gpio-tegra. That's problematic because the gpio-keys driver requests
> > > > the power key to be a wakeup source. However, the programming of the
> > > > wakeup interrupt registers happens in the gpio-tegra driver's suspend
> > > > callback, which is now called before that of the gpio-keys driver. The
> > > > result is that the wrong values are programmed and leaves the system
> > > > unable to be resumed using the power key.
> > > > 
> > > > To fix this situation, always move devices to the end of the dpm_list
> > > > before probing them. Technically this should only be done for devices
> > > > that have been successfully probed, but that won't work for recursive
> > > > probing of devices (think an I2C master that instantiates children in
> > > > its ->probe()). Effectively the dpm_list will end up ordered the same
> > > > way that devices were probed, hence taking care of dependencies.
> 
> I'm not worried about the overhead involved in moving a device to the 
> end of the list every time it is probed.  That ought to be relatively 
> small.
> 
> There are a few things to watch out for.  Since the dpm_list gets
> modified during system sleep transitions, we would have to make sure
> that nothing gets probed during those times.  In principle, that's what
> the "prepare" stage is meant for, but there's still a race.  As long as
> no other kernel thread (such as the deferred probing mechanism) tries
> to probe a device once everything has been frozen, we should be okay.
> But if not, there will be trouble -- after the ->prepare callback runs, 
> the device is no longer on the dpm_list and so we don't want this patch 
> to put it back on that list.

Perhaps moving to the end of the list needs to be a little smarter. That
is it could check whether the device has been prepared for suspension or
not and only move when it hasn't?

Then again, shouldn't the core even prohibit new probes once the suspend
has been triggered? Sounds like asking for a lot of trouble if it didn't
...

> There's also an issue about other types of dependencies.  For instance,
> it's conceivable that device B might be discovered and depend on device
> A, even before A has been bound to a driver.  (B might be discovered by 
> A's subsystem rather than A's driver.)  In that case, moving A to the 
> end of the list would cause B to come before A even though B depends on 
> A.  Of course, deferred probing already has this problem.

But that's exactly the problem that I'm seeing. B isn't discovered by
A's subsystem, but the type of dependency is the same. A in this case
would be the GPIO controller and B the gpio-keys device. B clearly
depends on A, but deferred probe currently moves A to the end of the
list but not A, hence why the problem occurs.

That's also a problem that I think this patch solves. By moving every
device to the end of the list before it is probed we ensure that the
dpm_list is ordered in the same way as the probe order. For this to work
the precondition of course is that drivers know about the dependencies
and will defer probe if necessary.

> An easy way to check for this sort of thing would be to verify that a 
> device about to be probed doesn't have any children.  This wouldn't 
> catch all the potential dependencies, but it would be a reasonable 
> start.  Do we currently check that after a device has been unbound, it 
> doesn't have any remaining children?  We should do that too.

I think that would cover the other half of the cases where deferred
probe isn't implemented because drivers are written with the assumption
that children will be instantiated after their parent has been probed.

But it won't catch all the cases where there is no parent-child
relationship between the depender and the dependee.

Thierry
Alan Stern Sept. 15, 2015, 7:18 p.m. UTC | #13
On Tue, 15 Sep 2015, Thierry Reding wrote:

> > There are a few things to watch out for.  Since the dpm_list gets
> > modified during system sleep transitions, we would have to make sure
> > that nothing gets probed during those times.  In principle, that's what
> > the "prepare" stage is meant for, but there's still a race.  As long as
> > no other kernel thread (such as the deferred probing mechanism) tries
> > to probe a device once everything has been frozen, we should be okay.
> > But if not, there will be trouble -- after the ->prepare callback runs, 
> > the device is no longer on the dpm_list and so we don't want this patch 
> > to put it back on that list.
> 
> Perhaps moving to the end of the list needs to be a little smarter. That
> is it could check whether the device has been prepared for suspension or
> not and only move when it hasn't?

Maybe.  But doesn't that mean it won't solve your problem completely?

> Then again, shouldn't the core even prohibit new probes once the suspend
> has been triggered? Sounds like asking for a lot of trouble if it didn't
> ...

The core prohibits new devices from being registered.  It does not 
prohibit probes of existing devices, because they currently do not 
affect the dpm_list.

In general, we rely on subsystems not to do any probing once a device 
is suspended.  It's probably reasonable to ask them not to do any 
probing once a device has gone through the "prepare" stage.

> > There's also an issue about other types of dependencies.  For instance,
> > it's conceivable that device B might be discovered and depend on device
> > A, even before A has been bound to a driver.  (B might be discovered by 
> > A's subsystem rather than A's driver.)  In that case, moving A to the 
> > end of the list would cause B to come before A even though B depends on 
> > A.  Of course, deferred probing already has this problem.
> 
> But that's exactly the problem that I'm seeing.

Not quite.

>  B isn't discovered by
> A's subsystem, but the type of dependency is the same. A in this case
> would be the GPIO controller and B the gpio-keys device. B clearly
> depends on A, but deferred probe currently moves A to the end of the
> list but not A, hence why the problem occurs.

The difference is that in my example, B can be probed before A.  In
your case it can't.  Therefore the patch works for your case but not
for mine.

> That's also a problem that I think this patch solves. By moving every
> device to the end of the list before it is probed we ensure that the
> dpm_list is ordered in the same way as the probe order. For this to work
> the precondition of course is that drivers know about the dependencies
> and will defer probe if necessary.

Do I understand correctly?  You're saying a driver must defer a probe
if the device it's probing depends on another device which hasn't
been bound yet.  That does not sound like a reasonable sort of
requirement -- we might know about the dependency but we shouldn't have
to check whether the prerequisite device has been bound.

> > An easy way to check for this sort of thing would be to verify that a 
> > device about to be probed doesn't have any children.  This wouldn't 
> > catch all the potential dependencies, but it would be a reasonable 
> > start.  Do we currently check that after a device has been unbound, it 
> > doesn't have any remaining children?  We should do that too.
> 
> I think that would cover the other half of the cases where deferred
> probe isn't implemented because drivers are written with the assumption
> that children will be instantiated after their parent has been probed.
> 
> But it won't catch all the cases where there is no parent-child
> relationship between the depender and the dependee.

No, it won't catch all cases.  But the cases it does catch are ones 
where your patch is likely to cause trouble, so it would be good to 
know about them.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 16, 2015, 1:04 a.m. UTC | #14
On Tuesday, September 15, 2015 10:23:07 AM Alan Stern wrote:
> On Tue, 15 Sep 2015, Rafael J. Wysocki wrote:
> 
> > Hi Alan,
> 
> Hi.
> 
> > On Sat, Sep 12, 2015 at 7:40 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > On Sat, 12 Sep 2015, Rafael J. Wysocki wrote:
> > >
> > >> On Friday, September 11, 2015 03:01:14 PM Alan Stern wrote:
> > >
> > >> > There's also an issue about other types of dependencies.  For instance,
> > >> > it's conceivable that device B might be discovered and depend on device
> > >> > A, even before A has been bound to a driver.  (B might be discovered by
> > >> > A's subsystem rather than A's driver.)  In that case, moving A to the
> > >> > end of the list would cause B to come before A even though B depends on
> > >> > A.  Of course, deferred probing already has this problem.
> > >>
> > >> That may actually happen for PCIe ports and devices below them AFAICS.
> > >>
> > >> Devices below PCIe ports are discovered by the PCI subsystem and the PCIe
> > >> ports need not be probed before those devices are probed.
> > >
> > > Is it possible to change this?  Make it so that devices below PCIe
> > > ports are discovered by the port driver rather than by the PCI
> > > subsystem?  Or would that be far too difficult?
> > 
> > I don't think it would be really useful.
> > 
> > PCIe ports are PCI bridges from the resource allocation and basic
> > functionality perspective, so they are handled accordingly.
> > 
> > The PCIe port driver provides additional services (such as PME/hotplug
> > interrupt handling and advanced error reporting) that aren't necessary
> > for probing devices below the ports.
> > 
> > I guess the ordering of PCIe ports probing might be changed to happen
> > at the "right" time, but care needs to be taken in that case too.
> 
> Does suspending a PCIe port do anything significant?  In particular, 
> does it cut off normal communication with anything attached through 
> the port?

No, it doesn't.

> If it does then we better not move the port device to the end of the 
> dpm_list after any descendant devices have been probed.
> 
> > >> > An easy way to check for this sort of thing would be to verify that a
> > >> > device about to be probed doesn't have any children.  This wouldn't
> > >> > catch all the potential dependencies, but it would be a reasonable
> > >> > start.
> > >>
> > >> That would address the PCIe ports issue.
> > >
> > > Well, it would _detect_ the PCIe ports issue.
> > 
> > That's what I meant. :-)
> > 
> > >  It might also detect other things.
> > 
> > Right.
> 
> All right, I'll try writing a test patch to report these things.

OK, thanks!

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 16, 2015, 1:18 a.m. UTC | #15
On Tuesday, September 15, 2015 05:53:02 PM Thierry Reding wrote:
> On Sat, Sep 12, 2015 at 12:38:19AM +0200, Rafael J. Wysocki wrote:
> > On Friday, September 11, 2015 02:03:46 PM Thierry Reding wrote:
> > > On Fri, Sep 11, 2015 at 12:08:02AM +0200, Rafael J. Wysocki wrote:
> > > > On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
> > > > > From: Thierry Reding <treding@nvidia.com>
> > > > >=3D20
> > > > > Deferred probe can lead to strange situations where a device that i=
> s a
> > > > > dependency for others will be moved to the end of the dpm_list. At =
> the
> > > > > same time the dependers may not be moved because at the time they w=
> ill
> > > > > be probed the dependee may already have been successfully reprobed =
> and
> > > > > they will not have to defer the probe themselves.
> > > >=3D20
> > > > So there's a bug in the implementation of deferred probing IMO.
> > >=20
> > > Well, yeah. The root problem here is that we don't have dependency
> > > information and deferred probing is supposed to fix that. It does so
> > > fairly well, but it breaks in this particular case.
> > >=20
> > > > > One example where this happens is the Jetson TK1 board (Tegra124). =
> The
> > > > > gpio-keys driver exposes the power key of the board as an input dev=
> ice
> > > > > that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
> > > > > tegra: Add gpio-ranges property") results in the gpio-tegra driver
> > > > > deferring probe because one of its dependencies, the pinctrl-tegra
> > > > > driver, has not successfully completed probing. Currently the defer=
> red
> > > > > probe code will move the corresponding gpio-tegra device to the end=
>  of
> > > > > the dpm_list, but by the time the gpio-keys device, depending on the
> > > > > gpio-tegra device, is probed, gpio-tegra has already been reprobed,=
>  so
> > > > > the gpio-keys device is not moved to the end of dpm_list itself.
> >=20
> > At this point, when checking its dependencies, gpio-keys should also check
> > if their ordering with respect to it in dpm_list is correct and move itse=
> lf
> > to the end of it otherwise.
> >=20
> > There might be a helper for that I suppose.
> 
> But that's essentially the same thing as what this patch proposes,
> except that every driver now needs to call this helper, rather than
> having the core take care of it.

Well, not quite.  Not "every driver", but "every driver with dependencies
the driver core doesn't know about".  That's quite a difference in my view.
 
[...]

> > I'm always cautious about changes that make the core do something for eve=
> ry
> > device/driver, because they are very likely to step on corner cases that =
> we
> > don't need to worry about otherwise.
> 
> To me this seems more like a fundamental issue that should be fixed at
> the root rather than be fixed up case by case as we find them. Keeping
> the suspend/resume order the same as probe order sounds like the most
> logical thing to do. I grant you that there could be cases where this
> might break, but then I'd consider those cases to be broken rather than
> the other way around.

We have to disagree, then.

> With the current situation potentially every user of GPIOs (and the same
> is true for other types of resources) is broken, though we may not
> notice (immediately). In fact it was mostly by coincidence that I
> noticed in this case. The GPIO key works perfectly fine for regular use,
> it just doesn't work as a wakeup source anymore. That's not something
> that people test very frequently, hence could've gone unnoticed for much
> longer.
> 
> Of course reordering the dpm_list to follow the probe order has the
> potential to break other setups in similarily subtle ways, so I do
> understand your reluctance.
> 
> Perhaps it would help if we put this patch into some boot farm to get
> more coverage, maybe that would give us a better picture for how
> invasive the change is, or how bad the fallout could be?

I'd actually prefer to have a patch that I don't have to worry about
even in theory.

What about an opt-in mechanism for things that are likely to need this
stuff instead of imposing it on everybody unconditionally?

> I can of course go and come up with a more ad-hoc solution that fixes
> the issue for this particular use-case, but I'd like to avoid a
> situation where we end up having to patch up drivers one by one, when
> we could have a solution that works in the general case.
> 
> Also note that a recent commit (52cdbdd49853 "driver core: correct
> device's shutdown order") patch already made an equivalent change to the
> shutdown order. I'd expect that most devices would fail with that patch
> already if ordering is a real problem. Of course shutdown is a one-way
> ticket, so failure might not be as relevant as for suspend/resume.

Did you forget about the other Alan's concerns regarding probing devices
during suspend/resume?

> Grygorii, Greg: have you heard of any fallout caused by the above patch
> yet?

Did that patch manipulate dpm_list?

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 16, 2015, 1:28 a.m. UTC | #16
On Tuesday, September 15, 2015 03:18:19 PM Alan Stern wrote:
> On Tue, 15 Sep 2015, Thierry Reding wrote:
> 
> > > There are a few things to watch out for.  Since the dpm_list gets
> > > modified during system sleep transitions, we would have to make sure
> > > that nothing gets probed during those times.  In principle, that's what
> > > the "prepare" stage is meant for, but there's still a race.  As long as
> > > no other kernel thread (such as the deferred probing mechanism) tries
> > > to probe a device once everything has been frozen, we should be okay.
> > > But if not, there will be trouble -- after the ->prepare callback runs, 
> > > the device is no longer on the dpm_list and so we don't want this patch 
> > > to put it back on that list.
> > 
> > Perhaps moving to the end of the list needs to be a little smarter. That
> > is it could check whether the device has been prepared for suspension or
> > not and only move when it hasn't?
> 
> Maybe.  But doesn't that mean it won't solve your problem completely?
> 
> > Then again, shouldn't the core even prohibit new probes once the suspend
> > has been triggered? Sounds like asking for a lot of trouble if it didn't
> > ...
> 
> The core prohibits new devices from being registered.  It does not 
> prohibit probes of existing devices, because they currently do not 
> affect the dpm_list.

Which may be a mistake, because it does affect callbacks executed during
suspend/resume (after successful probe the device potentially has a different
set of PM callbacks than before).

> In general, we rely on subsystems not to do any probing once a device 
> is suspended.  It's probably reasonable to ask them not to do any 
> probing once a device has gone through the "prepare" stage.

Right.

Question is when it should be allowed to probe again.  I guess at the same
time we allow registrations to to take place again?

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Sept. 16, 2015, 1:16 p.m. UTC | #17
On Tue, Sep 15, 2015 at 03:18:19PM -0400, Alan Stern wrote:
> On Tue, 15 Sep 2015, Thierry Reding wrote:
> 
> > > There are a few things to watch out for.  Since the dpm_list gets
> > > modified during system sleep transitions, we would have to make sure
> > > that nothing gets probed during those times.  In principle, that's what
> > > the "prepare" stage is meant for, but there's still a race.  As long as
> > > no other kernel thread (such as the deferred probing mechanism) tries
> > > to probe a device once everything has been frozen, we should be okay.
> > > But if not, there will be trouble -- after the ->prepare callback runs, 
> > > the device is no longer on the dpm_list and so we don't want this patch 
> > > to put it back on that list.
> > 
> > Perhaps moving to the end of the list needs to be a little smarter. That
> > is it could check whether the device has been prepared for suspension or
> > not and only move when it hasn't?
> 
> Maybe.  But doesn't that mean it won't solve your problem completely?

It would solve the problem completely if probing was prohibited during
the suspend/resume cycle. But if that were true there'd be no need to
special-case in the first place.

> > Then again, shouldn't the core even prohibit new probes once the suspend
> > has been triggered? Sounds like asking for a lot of trouble if it didn't
> > ...
> 
> The core prohibits new devices from being registered.  It does not 
> prohibit probes of existing devices, because they currently do not 
> affect the dpm_list.

My understanding was that the core was guaranteed not to call suspend or
resume callbacks for devices that haven't completed probe. At least I've
never seen any driver code specifically check in their suspend or resume
callbacks that they've been probed successfully. Allowing probes while a
suspend is happening sounds racy.

> In general, we rely on subsystems not to do any probing once a device 
> is suspended.  It's probably reasonable to ask them not to do any 
> probing once a device has gone through the "prepare" stage.

Perhaps the reason why we seem to be talking across purposes is that I
haven't thought much about devices where the bus does all the heavy
lifting. So suspending the device from a bus' perspective makes sense
even if the device hasn't been bound.

And yes, I agree that preventing a probe for a device that has been
prepared for suspension sounds like a very reasonable thing to do.

> > > There's also an issue about other types of dependencies.  For instance,
> > > it's conceivable that device B might be discovered and depend on device
> > > A, even before A has been bound to a driver.  (B might be discovered by 
> > > A's subsystem rather than A's driver.)  In that case, moving A to the 
> > > end of the list would cause B to come before A even though B depends on 
> > > A.  Of course, deferred probing already has this problem.
> > 
> > But that's exactly the problem that I'm seeing.
> 
> Not quite.
> 
> >  B isn't discovered by
> > A's subsystem, but the type of dependency is the same. A in this case
> > would be the GPIO controller and B the gpio-keys device. B clearly
> > depends on A, but deferred probe currently moves A to the end of the
> > list but not A, hence why the problem occurs.
> 
> The difference is that in my example, B can be probed before A.  In
> your case it can't.  Therefore the patch works for your case but not
> for mine.

How would that even work in practice? Essentially you have a dependency
between two devices and no way of guaranteeing any ordering. Either the
dependency is completely optional, in which case the ordering of the
dpm_list must be irrelevant to the interaction, or the drivers make too
many assumptions and it is only working by accident.

> > That's also a problem that I think this patch solves. By moving every
> > device to the end of the list before it is probed we ensure that the
> > dpm_list is ordered in the same way as the probe order. For this to work
> > the precondition of course is that drivers know about the dependencies
> > and will defer probe if necessary.
> 
> Do I understand correctly?  You're saying a driver must defer a probe
> if the device it's probing depends on another device which hasn't
> been bound yet.  That does not sound like a reasonable sort of
> requirement -- we might know about the dependency but we shouldn't have
> to check whether the prerequisite device has been bound.

I guess that depends on the kind of dependency you have. For most cases
that I'm aware of the dependencies are required dependencies. That is, a
consumer uses a resource registered by a provider. The provider will
register the resource at probe time, so if the consumer is probed before
the provider, then the resource it needs isn't there. For a required
dependency that implies that the consumer must defer probe until the
provider has been bound because it simply can't continue without getting
access to the resource first.

I'm slightly confused by your statement. If a consumer depends on a
provider for a resource, how can we finish the consumer's probe without
checking that the provider has been bound? It's true that we don't
technically check for the device to have been bound, but the end result
is the same.

Unless I misunderstand what you're saying we'd need to have some
mechanism to notify the consumer (after it's been probed) that the
provider of it's resource has become available.

Thierry
Grygorii Strashko Sept. 16, 2015, 1:38 p.m. UTC | #18
On 09/16/2015 04:18 AM, Rafael J. Wysocki wrote:
> On Tuesday, September 15, 2015 05:53:02 PM Thierry Reding wrote:
>> On Sat, Sep 12, 2015 at 12:38:19AM +0200, Rafael J. Wysocki wrote:
>>> On Friday, September 11, 2015 02:03:46 PM Thierry Reding wrote:
>>>> On Fri, Sep 11, 2015 at 12:08:02AM +0200, Rafael J. Wysocki wrote:
>>>>> On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
>>>>>> From: Thierry Reding <treding@nvidia.com>
>>>>>> =3D20
>>>>>> Deferred probe can lead to strange situations where a device that i=
>> s a
>>>>>> dependency for others will be moved to the end of the dpm_list. At =
>> the
>>>>>> same time the dependers may not be moved because at the time they w=
>> ill
>>>>>> be probed the dependee may already have been successfully reprobed =
>> and
>>>>>> they will not have to defer the probe themselves.
>>>>> =3D20
>>>>> So there's a bug in the implementation of deferred probing IMO.
>>>> =20
>>>> Well, yeah. The root problem here is that we don't have dependency
>>>> information and deferred probing is supposed to fix that. It does so
>>>> fairly well, but it breaks in this particular case.
>>>> =20
>>>>>> One example where this happens is the Jetson TK1 board (Tegra124). =
>> The
>>>>>> gpio-keys driver exposes the power key of the board as an input dev=
>> ice
>>>>>> that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
>>>>>> tegra: Add gpio-ranges property") results in the gpio-tegra driver
>>>>>> deferring probe because one of its dependencies, the pinctrl-tegra
>>>>>> driver, has not successfully completed probing. Currently the defer=
>> red
>>>>>> probe code will move the corresponding gpio-tegra device to the end=
>>   of
>>>>>> the dpm_list, but by the time the gpio-keys device, depending on the
>>>>>> gpio-tegra device, is probed, gpio-tegra has already been reprobed,=
>>   so
>>>>>> the gpio-keys device is not moved to the end of dpm_list itself.
>>> =20
>>> At this point, when checking its dependencies, gpio-keys should also check
>>> if their ordering with respect to it in dpm_list is correct and move itse=
>> lf
>>> to the end of it otherwise.
>>> =20
>>> There might be a helper for that I suppose.
>>
>> But that's essentially the same thing as what this patch proposes,
>> except that every driver now needs to call this helper, rather than
>> having the core take care of it.
>
> Well, not quite.  Not "every driver", but "every driver with dependencies
> the driver core doesn't know about".  That's quite a difference in my view.
>
> [...]
>
>>> I'm always cautious about changes that make the core do something for eve=
>> ry
>>> device/driver, because they are very likely to step on corner cases that =
>> we
>>> don't need to worry about otherwise.
>>
>> To me this seems more like a fundamental issue that should be fixed at
>> the root rather than be fixed up case by case as we find them. Keeping
>> the suspend/resume order the same as probe order sounds like the most
>> logical thing to do. I grant you that there could be cases where this
>> might break, but then I'd consider those cases to be broken rather than
>> the other way around.
>
> We have to disagree, then.
>
>> With the current situation potentially every user of GPIOs (and the same
>> is true for other types of resources) is broken, though we may not
>> notice (immediately). In fact it was mostly by coincidence that I
>> noticed in this case. The GPIO key works perfectly fine for regular use,
>> it just doesn't work as a wakeup source anymore. That's not something
>> that people test very frequently, hence could've gone unnoticed for much
>> longer.
>>
>> Of course reordering the dpm_list to follow the probe order has the
>> potential to break other setups in similarily subtle ways, so I do
>> understand your reluctance.
>>
>> Perhaps it would help if we put this patch into some boot farm to get
>> more coverage, maybe that would give us a better picture for how
>> invasive the change is, or how bad the fallout could be?
>
> I'd actually prefer to have a patch that I don't have to worry about
> even in theory.
>
> What about an opt-in mechanism for things that are likely to need this
> stuff instead of imposing it on everybody unconditionally?
>
>> I can of course go and come up with a more ad-hoc solution that fixes
>> the issue for this particular use-case, but I'd like to avoid a
>> situation where we end up having to patch up drivers one by one, when
>> we could have a solution that works in the general case.
>>
>> Also note that a recent commit (52cdbdd49853 "driver core: correct
>> device's shutdown order") patch already made an equivalent change to the
>> shutdown order. I'd expect that most devices would fail with that patch
>> already if ordering is a real problem. Of course shutdown is a one-way
>> ticket, so failure might not be as relevant as for suspend/resume.

It was in linux-next for at about one month or more.

>
> Did you forget about the other Alan's concerns regarding probing devices
> during suspend/resume?
>
>> Grygorii, Greg: have you heard of any fallout caused by the above patch
>> yet?

no

>
> Did that patch manipulate dpm_list?

no.
Grygorii Strashko Sept. 16, 2015, 1:38 p.m. UTC | #19
On 09/16/2015 04:28 AM, Rafael J. Wysocki wrote:
> On Tuesday, September 15, 2015 03:18:19 PM Alan Stern wrote:
>> On Tue, 15 Sep 2015, Thierry Reding wrote:
>>
>>>> There are a few things to watch out for.  Since the dpm_list gets
>>>> modified during system sleep transitions, we would have to make sure
>>>> that nothing gets probed during those times.  In principle, that's what
>>>> the "prepare" stage is meant for, but there's still a race.  As long as
>>>> no other kernel thread (such as the deferred probing mechanism) tries
>>>> to probe a device once everything has been frozen, we should be okay.
>>>> But if not, there will be trouble -- after the ->prepare callback runs,
>>>> the device is no longer on the dpm_list and so we don't want this patch
>>>> to put it back on that list.
>>>
>>> Perhaps moving to the end of the list needs to be a little smarter. That
>>> is it could check whether the device has been prepared for suspension or
>>> not and only move when it hasn't?
>>
>> Maybe.  But doesn't that mean it won't solve your problem completely?
>>
>>> Then again, shouldn't the core even prohibit new probes once the suspend
>>> has been triggered? Sounds like asking for a lot of trouble if it didn't
>>> ...
>>
>> The core prohibits new devices from being registered.  It does not
>> prohibit probes of existing devices, because they currently do not
>> affect the dpm_list.

Seems I missed smth, but I can't find the place in Kernel that prohibits
creation of new devices during suspend.

Could someone point me on, please?
Grygorii Strashko Sept. 16, 2015, 1:39 p.m. UTC | #20
Hi Thierry, Alan, Rafael,

On 09/12/2015 01:28 AM, Rafael J. Wysocki wrote:
> On Friday, September 11, 2015 03:01:14 PM Alan Stern wrote:
>> On Fri, 11 Sep 2015, Thierry Reding wrote:
>>
>>> On Fri, Sep 11, 2015 at 12:08:02AM +0200, Rafael J. Wysocki wrote:
>>>> On Thursday, September 10, 2015 12:19:03 PM Thierry Reding wrote:
>>>>> From: Thierry Reding <treding@nvidia.com>
>>>>>
>>>>> Deferred probe can lead to strange situations where a device that is a
>>>>> dependency for others will be moved to the end of the dpm_list. At the
>>>>> same time the dependers may not be moved because at the time they will
>>>>> be probed the dependee may already have been successfully reprobed and
>>>>> they will not have to defer the probe themselves.
>>>>
>>>> So there's a bug in the implementation of deferred probing IMO.
>>>
>>> Well, yeah. The root problem here is that we don't have dependency
>>> information and deferred probing is supposed to fix that. It does so
>>> fairly well, but it breaks in this particular case.

Actually this is the old problem and deferred probing just makes it more
reproducible.

Lets say, we have device P-producer and device C-consumer.
- devices are created/registered in the following order (It doesn't matter
how they are created - dt, platform,..)
-- device C created
-- device P created

- devices probing order (legacy probing order depends on Makefiles and
initcalls - no deferred probing)
-- device P probed
-- device C probed

- suspend order
-- device P suspended
-- device C suspended (ops, device C may still need device P)

To W/A such issues:
- driver's suspend code can be shifted between suspend layers (from
.suspend() to .suspend_noirq(), for example)
- device's registration order can be changed manually in DT or platform code
- can play with initcalls

May be it was enough before, but now it's not :(, simply because, now
there are much more generic/common frameworks in Kernel:
gpio, regulators, dma ++ clk, syscon, phy, extcon, component, display framework..
As result, It introduces more devices and dependencies between devices and,
therefore, suspend ordering issue can be hit more often.

>>>
>>>>> One example where this happens is the Jetson TK1 board (Tegra124). The
>>>>> gpio-keys driver exposes the power key of the board as an input device
>>>>> that can also be used as a wakeup source. Commit 17cdddf0fb68 ("ARM:
>>>>> tegra: Add gpio-ranges property") results in the gpio-tegra driver
>>>>> deferring probe because one of its dependencies, the pinctrl-tegra
>>>>> driver, has not successfully completed probing. Currently the deferred
>>>>> probe code will move the corresponding gpio-tegra device to the end of
>>>>> the dpm_list, but by the time the gpio-keys device, depending on the
>>>>> gpio-tegra device, is probed, gpio-tegra has already been reprobed, so
>>>>> the gpio-keys device is not moved to the end of dpm_list itself. As a
>>>>> result, the suspend ordering becomes pinctrl-tegra -> gpio-keys ->
>>>>> gpio-tegra. That's problematic because the gpio-keys driver requests
>>>>> the power key to be a wakeup source. However, the programming of the
>>>>> wakeup interrupt registers happens in the gpio-tegra driver's suspend
>>>>> callback, which is now called before that of the gpio-keys driver. The
>>>>> result is that the wrong values are programmed and leaves the system
>>>>> unable to be resumed using the power key.
>>>>>
>>>>> To fix this situation, always move devices to the end of the dpm_list
>>>>> before probing them. Technically this should only be done for devices
>>>>> that have been successfully probed, but that won't work for recursive
>>>>> probing of devices (think an I2C master that instantiates children in
>>>>> its ->probe()). Effectively the dpm_list will end up ordered the same
>>>>> way that devices were probed, hence taking care of dependencies.
>>
>> I'm not worried about the overhead involved in moving a device to the
>> end of the list every time it is probed.  That ought to be relatively
>> small.
>>
>> There are a few things to watch out for.  Since the dpm_list gets
>> modified during system sleep transitions, we would have to make sure
>> that nothing gets probed during those times.  In principle, that's what
>> the "prepare" stage is meant for, but there's still a race.  As long as
>> no other kernel thread (such as the deferred probing mechanism) tries
>> to probe a device once everything has been frozen, we should be okay.
>> But if not, there will be trouble -- after the ->prepare callback runs,
>> the device is no longer on the dpm_list and so we don't want this patch
>> to put it back on that list.
> 

I think, It should prohibited to probe devices during suspend/hibernation.
And solution introduced in this patch might help to fix it -
in general, we could do :
- add sync point on suspend enter: wait_for_device_probe() and
- prohibit probing: move all devices which will request probing into
deferred_probe list
- one suspend exit: allow probing and do driver_deferred_probe_trigger

> 
>> There's also an issue about other types of dependencies.  For instance,
>> it's conceivable that device B might be discovered and depend on device
>> A, even before A has been bound to a driver.  (B might be discovered by
>> A's subsystem rather than A's driver.)  In that case, moving A to the
>> end of the list would cause B to come before A even though B depends on
>> A.  Of course, deferred probing already has this problem.
> 
> That may actually happen for PCIe ports and devices below them AFAICS.
> 
> Devices below PCIe ports are discovered by the PCI subsystem and the PCIe
> ports need not be probed before those devices are probed.

I'd like to mention here that this patch will work only
if dmp_list will be filled according device creation order ("parent<-child" dependencies)
*AND* according device's probing order ("supplier<-consumer").
So, if there is the case when Parent device can be probed AFTER its children
- it will not work, because "parent<-child" dependencies will not be tracked
any more :( Sry, I could not even imagine that such crazy case exist :'(

Are there any other subsystems with the same behavior like PCI?
If not - probably, it could be fixed in PCI subsystem using device_pm_move_after() or 
device_move() in PCIe ports probe.
if yes - ... maybe we can scan/re-check and reorder dpm_list on suspend enter and
restore ("parent<-child" dependencies).

> 
>> An easy way to check for this sort of thing would be to verify that a
>> device about to be probed doesn't have any children.  This wouldn't
>> catch all the potential dependencies, but it would be a reasonable
>> start.
> 
> That would address the PCIe ports issue.
> 

that might work also.

Truth is that smth. need to be done 100%. Personally, I was hit by this issue also,
 and it cost me 3 hours of debugging and I came up with the same patch as
Bill Huang, then spent some time trying to understand what is wrong with PCI
- finally, I've just changed the order of my devices in DT :)

Also, I think, it will be good to have this patch in -next to collect more feedbacks.
Alan Stern Sept. 16, 2015, 4:58 p.m. UTC | #21
On Wed, 16 Sep 2015, Grygorii Strashko wrote:

> >> The core prohibits new devices from being registered.  It does not
> >> prohibit probes of existing devices, because they currently do not
> >> affect the dpm_list.
> 
> Seems I missed smth, but I can't find the place in Kernel that prohibits
> creation of new devices during suspend.
> 
> Could someone point me on, please?

In Documentation/power/devices.txt, there is a section describing the
various phases of system suspend.  The part about the "prepare" phase
says:

    1.	The prepare phase is meant to prevent races by preventing new devices
	from being registered; the PM core would never know that all the
	children of a device had been suspended if new children could be
	registered at will.  (By contrast, devices may be unregistered at any
	time.)  Unlike the other suspend-related phases, during the prepare
	phase the device tree is traversed top-down.

	After the prepare callback method returns, no new children may be
	registered below the device.  The method may also prepare the device or
	driver in some way for the upcoming system power transition, but it
	should not put the device into a low-power state.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 16, 2015, 5:08 p.m. UTC | #22
On Wed, 16 Sep 2015, Rafael J. Wysocki wrote:

> > The core prohibits new devices from being registered.  It does not 
> > prohibit probes of existing devices, because they currently do not 
> > affect the dpm_list.
> 
> Which may be a mistake, because it does affect callbacks executed during
> suspend/resume (after successful probe the device potentially has a different
> set of PM callbacks than before).
> 
> > In general, we rely on subsystems not to do any probing once a device 
> > is suspended.  It's probably reasonable to ask them not to do any 
> > probing once a device has gone through the "prepare" stage.
> 
> Right.
> 
> Question is when it should be allowed to probe again.  I guess at the same
> time we allow registrations to to take place again?

That would make the most sense.  Particularly since registration 
automatically causes probing to occur.

Now, we do allow registration below a device as soon as the ->resume
callback returns, whereas devices don't get added back to the dpm_list
until after the "complete" phase is totally finished.  If a
previously-existing device gets probed in between, it would be moved 
off the dpm_prepared_list before its ->complete callback was invoked, 
which means its ->complete wouldn't get called at all.

Perhaps this would be okay; it depends on the subsystem.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 16, 2015, 7:22 p.m. UTC | #23
On Wed, 16 Sep 2015, Thierry Reding wrote:

> > > Perhaps moving to the end of the list needs to be a little smarter. That
> > > is it could check whether the device has been prepared for suspension or
> > > not and only move when it hasn't?
> > 
> > Maybe.  But doesn't that mean it won't solve your problem completely?
> 
> It would solve the problem completely if probing was prohibited during
> the suspend/resume cycle. But if that were true there'd be no need to
> special-case in the first place.

It's possible that this approach will help.  We'll see.

It would also help if your patch checked to see if the device has any 
children, and avoided moving it to the end of the list if it does.  In 
fact, that might be sufficient to avoid almost all problems.

> > > Then again, shouldn't the core even prohibit new probes once the suspend
> > > has been triggered? Sounds like asking for a lot of trouble if it didn't
> > > ...
> > 
> > The core prohibits new devices from being registered.  It does not 
> > prohibit probes of existing devices, because they currently do not 
> > affect the dpm_list.
> 
> My understanding was that the core was guaranteed not to call suspend or
> resume callbacks for devices that haven't completed probe.

No, that's not so.  The core invokes suspend and resume callbacks for 
all registered devices.

>  At least I've
> never seen any driver code specifically check in their suspend or resume
> callbacks that they've been probed successfully.

Now that wouldn't make any sense, would it?  A driver's suspend or
resume callback wouldn't be invoked in the first place unless the
driver had already been probed for that device.  So there's no point in
checking whether the probe has occurred.

>  Allowing probes while a
> suspend is happening sounds racy.

Yes, it does.  It definitely isn't a good idea.

During a sleep transition all user threads are frozen.  So are some
kernel threads, but not all of them.  It's possible that one of the
running kernel threads might want to do a probe -- something in a
workqueue, for example.  That's the only way it could happen.

> > In general, we rely on subsystems not to do any probing once a device 
> > is suspended.  It's probably reasonable to ask them not to do any 
> > probing once a device has gone through the "prepare" stage.
> 
> Perhaps the reason why we seem to be talking across purposes is that I
> haven't thought much about devices where the bus does all the heavy
> lifting. So suspending the device from a bus' perspective makes sense
> even if the device hasn't been bound.

Yes.

> And yes, I agree that preventing a probe for a device that has been
> prepared for suspension sounds like a very reasonable thing to do.
> 
> > > > There's also an issue about other types of dependencies.  For instance,
> > > > it's conceivable that device B might be discovered and depend on device
> > > > A, even before A has been bound to a driver.  (B might be discovered by 
> > > > A's subsystem rather than A's driver.)  In that case, moving A to the 
> > > > end of the list would cause B to come before A even though B depends on 
> > > > A.  Of course, deferred probing already has this problem.
> > > 
> > > But that's exactly the problem that I'm seeing.
> > 
> > Not quite.
> > 
> > >  B isn't discovered by
> > > A's subsystem, but the type of dependency is the same. A in this case
> > > would be the GPIO controller and B the gpio-keys device. B clearly
> > > depends on A, but deferred probe currently moves A to the end of the
> > > list but not A, hence why the problem occurs.
> > 
> > The difference is that in my example, B can be probed before A.  In
> > your case it can't.  Therefore the patch works for your case but not
> > for mine.
> 
> How would that even work in practice? Essentially you have a dependency
> between two devices and no way of guaranteeing any ordering. Either the
> dependency is completely optional, in which case the ordering of the
> dpm_list must be irrelevant to the interaction, or the drivers make too
> many assumptions and it is only working by accident.

Rafael gave a good example.  A device behind a PCIe port can't be
detected until the port has been registered, so the port will always
get on the dpm_list before the other device does.  But both devices can
be added before the port has been probed.

This dependency is not optional and it is guaranteed.  But it's not 
related to anything the drivers do.

> > > That's also a problem that I think this patch solves. By moving every
> > > device to the end of the list before it is probed we ensure that the
> > > dpm_list is ordered in the same way as the probe order. For this to work
> > > the precondition of course is that drivers know about the dependencies
> > > and will defer probe if necessary.
> > 
> > Do I understand correctly?  You're saying a driver must defer a probe
> > if the device it's probing depends on another device which hasn't
> > been bound yet.  That does not sound like a reasonable sort of
> > requirement -- we might know about the dependency but we shouldn't have
> > to check whether the prerequisite device has been bound.
> 
> I guess that depends on the kind of dependency you have. For most cases
> that I'm aware of the dependencies are required dependencies. That is, a
> consumer uses a resource registered by a provider. The provider will
> register the resource at probe time, so if the consumer is probed before
> the provider, then the resource it needs isn't there. For a required
> dependency that implies that the consumer must defer probe until the
> provider has been bound because it simply can't continue without getting
> access to the resource first.

Okay.  It would be a good idea to restrict your patch to handle only 
that particular kind of dependency, if you can think of any way to do 
this.

> I'm slightly confused by your statement. If a consumer depends on a
> provider for a resource, how can we finish the consumer's probe without
> checking that the provider has been bound? It's true that we don't
> technically check for the device to have been bound, but the end result
> is the same.
> 
> Unless I misunderstand what you're saying we'd need to have some
> mechanism to notify the consumer (after it's been probed) that the
> provider of it's resource has become available.

You misunderstand me.  Yes, I agree that your patch handles the case 
where a consumer depends on a provider for a resource.  But it doesn't 
handle the case where one device depends on another for something other 
than a resource (or perhaps for a resource that can be provided even in 
the absence of a driver).

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 16, 2015, 7:27 p.m. UTC | #24
On Wed, 16 Sep 2015, Grygorii Strashko wrote:

> I think, It should prohibited to probe devices during suspend/hibernation.
> And solution introduced in this patch might help to fix it -
> in general, we could do :
> - add sync point on suspend enter: wait_for_device_probe() and
> - prohibit probing: move all devices which will request probing into
> deferred_probe list
> - one suspend exit: allow probing and do driver_deferred_probe_trigger

That could work; it's a good idea.

> I'd like to mention here that this patch will work only
> if dmp_list will be filled according device creation order ("parent<-child" dependencies)
> *AND* according device's probing order ("supplier<-consumer").
> So, if there is the case when Parent device can be probed AFTER its children
> - it will not work, because "parent<-child" dependencies will not be tracked
> any more :( Sry, I could not even imagine that such crazy case exist :'(

If we avoid moving devices to the end of the dpm_list when they already 
have children, then we should be okay, right?

> Are there any other subsystems with the same behavior like PCI?

I don't know.

> If not - probably, it could be fixed in PCI subsystem using device_pm_move_after() or 
> device_move() in PCIe ports probe.
> if yes - ... maybe we can scan/re-check and reorder dpm_list on suspend enter and
> restore ("parent<-child" dependencies).

> Truth is that smth. need to be done 100%. Personally, I was hit by this issue also,
>  and it cost me 3 hours of debugging and I came up with the same patch as
> Bill Huang, then spent some time trying to understand what is wrong with PCI
> - finally, I've just changed the order of my devices in DT :)
> 
> Also, I think, it will be good to have this patch in -next to collect more feedbacks.

I like the idea of forcing all probes during a sleep transition to be 
deferred.  We could carry them out just before unfreezing the user 
threads.  That combined with the change mentioned above ought to be 
worth testing.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 17, 2015, 12:07 a.m. UTC | #25
On Wednesday, September 16, 2015 03:27:55 PM Alan Stern wrote:
> On Wed, 16 Sep 2015, Grygorii Strashko wrote:
> 
> > I think, It should prohibited to probe devices during suspend/hibernation.
> > And solution introduced in this patch might help to fix it -
> > in general, we could do :
> > - add sync point on suspend enter: wait_for_device_probe() and
> > - prohibit probing: move all devices which will request probing into
> > deferred_probe list
> > - one suspend exit: allow probing and do driver_deferred_probe_trigger
> 
> That could work; it's a good idea.
> 
> > I'd like to mention here that this patch will work only
> > if dmp_list will be filled according device creation order ("parent<-child" dependencies)
> > *AND* according device's probing order ("supplier<-consumer").
> > So, if there is the case when Parent device can be probed AFTER its children
> > - it will not work, because "parent<-child" dependencies will not be tracked
> > any more :( Sry, I could not even imagine that such crazy case exist :'(
> 
> If we avoid moving devices to the end of the dpm_list when they already 
> have children, then we should be okay, right?
> 
> > Are there any other subsystems with the same behavior like PCI?
> 
> I don't know.
> 
> > If not - probably, it could be fixed in PCI subsystem using device_pm_move_after() or 
> > device_move() in PCIe ports probe.
> > if yes - ... maybe we can scan/re-check and reorder dpm_list on suspend enter and
> > restore ("parent<-child" dependencies).
> 
> > Truth is that smth. need to be done 100%. Personally, I was hit by this issue also,
> >  and it cost me 3 hours of debugging and I came up with the same patch as
> > Bill Huang, then spent some time trying to understand what is wrong with PCI
> > - finally, I've just changed the order of my devices in DT :)
> > 
> > Also, I think, it will be good to have this patch in -next to collect more feedbacks.
> 
> I like the idea of forcing all probes during a sleep transition to be 
> deferred.  We could carry them out just before unfreezing the user 
> threads.  That combined with the change mentioned above ought to be 
> worth testing.

Agreed.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 17, 2015, 12:27 a.m. UTC | #26
On Wednesday, September 16, 2015 03:22:37 PM Alan Stern wrote:
> On Wed, 16 Sep 2015, Thierry Reding wrote:
> 
> > > > Perhaps moving to the end of the list needs to be a little smarter. That
> > > > is it could check whether the device has been prepared for suspension or
> > > > not and only move when it hasn't?
> > > 
> > > Maybe.  But doesn't that mean it won't solve your problem completely?
> > 
> > It would solve the problem completely if probing was prohibited during
> > the suspend/resume cycle. But if that were true there'd be no need to
> > special-case in the first place.
> 
> It's possible that this approach will help.  We'll see.
> 
> It would also help if your patch checked to see if the device has any 
> children, and avoided moving it to the end of the list if it does.  In 
> fact, that might be sufficient to avoid almost all problems.

I agree.

In any case if a device that already has children is about to be probed,
this is sort of a corner case anyway and should be handled as such.

> > > > Then again, shouldn't the core even prohibit new probes once the suspend
> > > > has been triggered? Sounds like asking for a lot of trouble if it didn't
> > > > ...
> > > 
> > > The core prohibits new devices from being registered.  It does not 
> > > prohibit probes of existing devices, because they currently do not 
> > > affect the dpm_list.
> > 
> > My understanding was that the core was guaranteed not to call suspend or
> > resume callbacks for devices that haven't completed probe.
> 
> No, that's not so.  The core invokes suspend and resume callbacks for 
> all registered devices.

And those may be bus type callbacks that decide whether or not to call
driver callbacks.  Having a driver bound to a device is not necessary for
a bus type callback to succeed in general.

> >  At least I've
> > never seen any driver code specifically check in their suspend or resume
> > callbacks that they've been probed successfully.
> 
> Now that wouldn't make any sense, would it?  A driver's suspend or
> resume callback wouldn't be invoked in the first place unless the
> driver had already been probed for that device.  So there's no point in
> checking whether the probe has occurred.
> 
> >  Allowing probes while a
> > suspend is happening sounds racy.
> 
> Yes, it does.  It definitely isn't a good idea.
> 
> During a sleep transition all user threads are frozen.  So are some
> kernel threads, but not all of them.  It's possible that one of the
> running kernel threads might want to do a probe -- something in a
> workqueue, for example.  That's the only way it could happen.

A hotplug event or something like that I suppose, but those should be
deferred to post resume time too.

> > > In general, we rely on subsystems not to do any probing once a device 
> > > is suspended.  It's probably reasonable to ask them not to do any 
> > > probing once a device has gone through the "prepare" stage.
> > 
> > Perhaps the reason why we seem to be talking across purposes is that I
> > haven't thought much about devices where the bus does all the heavy
> > lifting. So suspending the device from a bus' perspective makes sense
> > even if the device hasn't been bound.
> 
> Yes.

Precisely.

> > And yes, I agree that preventing a probe for a device that has been
> > prepared for suspension sounds like a very reasonable thing to do.
> > 
> > > > > There's also an issue about other types of dependencies.  For instance,
> > > > > it's conceivable that device B might be discovered and depend on device
> > > > > A, even before A has been bound to a driver.  (B might be discovered by 
> > > > > A's subsystem rather than A's driver.)  In that case, moving A to the 
> > > > > end of the list would cause B to come before A even though B depends on 
> > > > > A.  Of course, deferred probing already has this problem.
> > > > 
> > > > But that's exactly the problem that I'm seeing.
> > > 
> > > Not quite.
> > > 
> > > >  B isn't discovered by
> > > > A's subsystem, but the type of dependency is the same. A in this case
> > > > would be the GPIO controller and B the gpio-keys device. B clearly
> > > > depends on A, but deferred probe currently moves A to the end of the
> > > > list but not A, hence why the problem occurs.
> > > 
> > > The difference is that in my example, B can be probed before A.  In
> > > your case it can't.  Therefore the patch works for your case but not
> > > for mine.
> > 
> > How would that even work in practice? Essentially you have a dependency
> > between two devices and no way of guaranteeing any ordering. Either the
> > dependency is completely optional, in which case the ordering of the
> > dpm_list must be irrelevant to the interaction, or the drivers make too
> > many assumptions and it is only working by accident.
> 
> Rafael gave a good example.  A device behind a PCIe port can't be
> detected until the port has been registered, so the port will always
> get on the dpm_list before the other device does.  But both devices can
> be added before the port has been probed.
> 
> This dependency is not optional and it is guaranteed.  But it's not 
> related to anything the drivers do.
> 
> > > > That's also a problem that I think this patch solves. By moving every
> > > > device to the end of the list before it is probed we ensure that the
> > > > dpm_list is ordered in the same way as the probe order. For this to work
> > > > the precondition of course is that drivers know about the dependencies
> > > > and will defer probe if necessary.
> > > 
> > > Do I understand correctly?  You're saying a driver must defer a probe
> > > if the device it's probing depends on another device which hasn't
> > > been bound yet.  That does not sound like a reasonable sort of
> > > requirement -- we might know about the dependency but we shouldn't have
> > > to check whether the prerequisite device has been bound.
> > 
> > I guess that depends on the kind of dependency you have. For most cases
> > that I'm aware of the dependencies are required dependencies. That is, a
> > consumer uses a resource registered by a provider. The provider will
> > register the resource at probe time, so if the consumer is probed before
> > the provider, then the resource it needs isn't there. For a required
> > dependency that implies that the consumer must defer probe until the
> > provider has been bound because it simply can't continue without getting
> > access to the resource first.
> 
> Okay.  It would be a good idea to restrict your patch to handle only 
> that particular kind of dependency, if you can think of any way to do 
> this.

Agreed.

I've been postulating exactly that from the start, but I might not be able to
state that clearly enough.

> > I'm slightly confused by your statement. If a consumer depends on a
> > provider for a resource, how can we finish the consumer's probe without
> > checking that the provider has been bound? It's true that we don't
> > technically check for the device to have been bound, but the end result
> > is the same.
> > 
> > Unless I misunderstand what you're saying we'd need to have some
> > mechanism to notify the consumer (after it's been probed) that the
> > provider of it's resource has become available.
> 
> You misunderstand me.  Yes, I agree that your patch handles the case 
> where a consumer depends on a provider for a resource.  But it doesn't 
> handle the case where one device depends on another for something other 
> than a resource (or perhaps for a resource that can be provided even in 
> the absence of a driver).

Right.

So to recap my original point, I do have concerns about doing the
"move the device to the end of dpm_list when it is about to be probed"
thing unconditionally for all devices.

However, the overall idea is not bad in my view, it just needs to be
refined.  We first need to avoid probing for drivers during system
sleep transitions, which would be good to do regardless.  It also would help
if we could identify the cases when the moving is really necessary and
restrict it to those cases only.  I'm not sure if that's practical,
though.

And I'm wondering if and how that is related to runtime PM?  It only
covers the system sleep transitions case, but who's responsible for the
runtime PM part?  Device drivers?

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 17, 2015, 12:55 a.m. UTC | #27
On Wednesday, September 16, 2015 04:38:07 PM Grygorii Strashko wrote:
> On 09/16/2015 04:18 AM, Rafael J. Wysocki wrote:
> > On Tuesday, September 15, 2015 05:53:02 PM Thierry Reding wrote:

[cut]

> 
> >
> > Did that patch manipulate dpm_list?
> 
> no.

So whether or not there were any problems with it is meaningless here.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 17, 2015, 2:04 a.m. UTC | #28
On Thu, Sep 17, 2015 at 2:27 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Wednesday, September 16, 2015 03:22:37 PM Alan Stern wrote:
>> On Wed, 16 Sep 2015, Thierry Reding wrote:

[cut]

>
> And I'm wondering if and how that is related to runtime PM?  It only
> covers the system sleep transitions case, but who's responsible for the
> runtime PM part?  Device drivers?
>

Which reminds me of something we all seem to be forgetting about:
there is asynchronous suspend and resume which may cause suspend and
resume callbacks of devices to be executed in an order that is
different from the dpm_list order.  In those cases the device that
depends on another one has to explicitly wait for the other one to
complete its callback in the current phase of the transition.

While correct ordering of dpm_list is essential for this to work too,
it by no means is sufficient, so in the end the driver having a
dependency needs to know about it and act on it as needed (or we need
an alternative mechanism that will do that automatically, but I'm not
sure what that may be).

Actually, I was thinking about adding something like pm_get() for this
purpose that will do pm_runtime_get_sync() on the target and will
ensure that the right things will happen during system suspend/resume
in addition to that, including reordering dpm_list if necessary.
Plus, of course, the complementary pm_put().

With something like that available, there should be no need to reorder
dpm_list anywhere else.  The problem with this approach is that the
reordering becomes quite complicated then, as it would need to move
the device itself after the target and anything that depends on it
along with it and tracking those dependencies becomes quite
problematic.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tomeu Vizoso Sept. 17, 2015, 5:40 a.m. UTC | #29
On 17 September 2015 at 02:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Wednesday, September 16, 2015 03:22:37 PM Alan Stern wrote:
>>
>> It would also help if your patch checked to see if the device has any
>> children, and avoided moving it to the end of the list if it does.  In
>> fact, that might be sufficient to avoid almost all problems.
>
> I agree.
>
> In any case if a device that already has children is about to be probed,
> this is sort of a corner case anyway and should be handled as such.

Just wanted to mention that it's very common in platforms that make
use of DT to have some devices registered before their parents have probed.
If I grep for DTS with simple-bus and simple-mfd I get 802 matches.

Regards,

Tomeu
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 17, 2015, 5:02 p.m. UTC | #30
On Thu, 17 Sep 2015, Rafael J. Wysocki wrote:

> On Thu, Sep 17, 2015 at 2:27 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > On Wednesday, September 16, 2015 03:22:37 PM Alan Stern wrote:
> >> On Wed, 16 Sep 2015, Thierry Reding wrote:
> 
> [cut]
> 
> >
> > And I'm wondering if and how that is related to runtime PM?  It only
> > covers the system sleep transitions case, but who's responsible for the
> > runtime PM part?  Device drivers?

In theory the drivers are responsible.  In practice, I don't know how 
this is handled.

> Which reminds me of something we all seem to be forgetting about:
> there is asynchronous suspend and resume which may cause suspend and
> resume callbacks of devices to be executed in an order that is
> different from the dpm_list order.  In those cases the device that
> depends on another one has to explicitly wait for the other one to
> complete its callback in the current phase of the transition.
> 
> While correct ordering of dpm_list is essential for this to work too,
> it by no means is sufficient, so in the end the driver having a
> dependency needs to know about it and act on it as needed (or we need
> an alternative mechanism that will do that automatically, but I'm not
> sure what that may be).
> 
> Actually, I was thinking about adding something like pm_get() for this
> purpose that will do pm_runtime_get_sync() on the target and will
> ensure that the right things will happen during system suspend/resume
> in addition to that, including reordering dpm_list if necessary.
> Plus, of course, the complementary pm_put().
> 
> With something like that available, there should be no need to reorder
> dpm_list anywhere else.  The problem with this approach is that the
> reordering becomes quite complicated then, as it would need to move
> the device itself after the target and anything that depends on it
> along with it and tracking those dependencies becomes quite
> problematic.

Keeping explicit track of these non-child-parent dependencies seems 
reasonable.  But I don't know how you could combine it with reordering 
dpm_list.

One possibility might be to do a topological sort of all devices, with 
the initial set of constraints given by the explicit dependencies and 
the parent-child relations.  So basically this would mean ignoring the 
actual dpm_list and making up a new list of your own whenever a sleep 
transition starts.  It might work, but I'm not sure how reliable it 
would be.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 17, 2015, 6:15 p.m. UTC | #31
Hi,

On Thu, Sep 17, 2015 at 7:40 AM, Tomeu Vizoso
<tomeu.vizoso@collabora.com> wrote:
> On 17 September 2015 at 02:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> On Wednesday, September 16, 2015 03:22:37 PM Alan Stern wrote:
>>>
>>> It would also help if your patch checked to see if the device has any
>>> children, and avoided moving it to the end of the list if it does.  In
>>> fact, that might be sufficient to avoid almost all problems.
>>
>> I agree.
>>
>> In any case if a device that already has children is about to be probed,
>> this is sort of a corner case anyway and should be handled as such.
>
> Just wanted to mention that it's very common in platforms that make
> use of DT to have some devices registered before their parents have probed.
> If I grep for DTS with simple-bus and simple-mfd I get 802 matches.


This is good to know, thanks!

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 17, 2015, 6:43 p.m. UTC | #32
Hi Alan,

On Thu, Sep 17, 2015 at 7:02 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 17 Sep 2015, Rafael J. Wysocki wrote:
>
>> On Thu, Sep 17, 2015 at 2:27 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> > On Wednesday, September 16, 2015 03:22:37 PM Alan Stern wrote:
>> >> On Wed, 16 Sep 2015, Thierry Reding wrote:
>>
>> [cut]
>>
>> >
>> > And I'm wondering if and how that is related to runtime PM?  It only
>> > covers the system sleep transitions case, but who's responsible for the
>> > runtime PM part?  Device drivers?
>
> In theory the drivers are responsible.  In practice, I don't know how
> this is handled.

Same here, unfortunately.

>> Which reminds me of something we all seem to be forgetting about:
>> there is asynchronous suspend and resume which may cause suspend and
>> resume callbacks of devices to be executed in an order that is
>> different from the dpm_list order.  In those cases the device that
>> depends on another one has to explicitly wait for the other one to
>> complete its callback in the current phase of the transition.
>>
>> While correct ordering of dpm_list is essential for this to work too,
>> it by no means is sufficient, so in the end the driver having a
>> dependency needs to know about it and act on it as needed (or we need
>> an alternative mechanism that will do that automatically, but I'm not
>> sure what that may be).

Note: Problems also may happen if device A depends on device B and its
driver to be present and functional and then the B's driver module is
unloaded.  The core doesn't prevent that from happening AFAICS.

>> Actually, I was thinking about adding something like pm_get() for this
>> purpose that will do pm_runtime_get_sync() on the target and will
>> ensure that the right things will happen during system suspend/resume
>> in addition to that, including reordering dpm_list if necessary.
>> Plus, of course, the complementary pm_put().
>>
>> With something like that available, there should be no need to reorder
>> dpm_list anywhere else.  The problem with this approach is that the
>> reordering becomes quite complicated then, as it would need to move
>> the device itself after the target and anything that depends on it
>> along with it and tracking those dependencies becomes quite
>> problematic.
>
> Keeping explicit track of these non-child-parent dependencies seems
> reasonable.  But I don't know how you could combine it with reordering
> dpm_list.
>
> One possibility might be to do a topological sort of all devices, with
> the initial set of constraints given by the explicit dependencies and
> the parent-child relations.  So basically this would mean ignoring the
> actual dpm_list and making up a new list of your own whenever a sleep
> transition starts.  It might work, but I'm not sure how reliable it
> would be.

I'd like to go back to my initial hunch that the driver knowing about
a dependency on another one should tell the core about that, so the
core can make the right things happen at various times (like system
suspend/resume etc).

What if we introduce a mechanism allowing drivers to say "I depend on
device X and its driver to be present and functional from now on" and
store that information somewhere for the core to use?

Some time ago (a few years ago actually IIRC) I proposed something
called "PM links".  The idea was to have objects representing such
dependencies, although I was not taking the "the driver of the device
I depend on should be present and functional going forward" condition.

Say, if a driver wants to check the presence of the device+driver it
needs to be functional, it will do something like

ret = create_pm_link(dev, producer);

and that will return -EPROBE_DEFER if the producer device is not
functional.  If success is returned, the link has been created and now
the core will take it into account.

On driver removal the core may just delete the links where the device
is the "consumer".  Also there may be a delete_pm_link(dev, producer)
operation if needed.

The creation of a link may then include the reordering of dpm_list as
appropriate so all "producers" are now followed by all of their
"consumers".  Going forward, though, the core may use the links to
make all "producers" wait for the PM callbacks of their "consumers" to
complete during system suspend etc.  It also may use them to prevent
drivers being depended on from being unloaded and/or to force the
removal of drivers that depend on something being removed.  In
principle it may also use those links to coordinate runtime PM
transitions, but I guess that's not going to be useful in all cases,
so there needs to be an opt-in mechanism for that.

Please tell me what you think.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 17, 2015, 7:21 p.m. UTC | #33
On Thu, 17 Sep 2015, Rafael J. Wysocki wrote:

> Hi,
> 
> On Thu, Sep 17, 2015 at 7:40 AM, Tomeu Vizoso
> <tomeu.vizoso@collabora.com> wrote:
> > On 17 September 2015 at 02:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> On Wednesday, September 16, 2015 03:22:37 PM Alan Stern wrote:
> >>>
> >>> It would also help if your patch checked to see if the device has any
> >>> children, and avoided moving it to the end of the list if it does.  In
> >>> fact, that might be sufficient to avoid almost all problems.
> >>
> >> I agree.
> >>
> >> In any case if a device that already has children is about to be probed,
> >> this is sort of a corner case anyway and should be handled as such.
> >
> > Just wanted to mention that it's very common in platforms that make
> > use of DT to have some devices registered before their parents have probed.
> > If I grep for DTS with simple-bus and simple-mfd I get 802 matches.
> 
> 
> This is good to know, thanks!

Hmmm.  In view of this, maybe it's not worthwhile to try looking for 
such things.  Given that they exist all over the place, we will 
definitely need to handle them correctly.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 17, 2015, 9:06 p.m. UTC | #34
On Thu, 17 Sep 2015, Rafael J. Wysocki wrote:

> Note: Problems also may happen if device A depends on device B and its
> driver to be present and functional and then the B's driver module is
> unloaded.  The core doesn't prevent that from happening AFAICS.

It also doesn't prevent B's driver from being unbound from the B 
device.

To some extent the kernel _does_ prevent driver modules from being
unloaded.  If A's driver uses code resources provided by B's driver
then the module's refcount would be larger than 0.

> I'd like to go back to my initial hunch that the driver knowing about
> a dependency on another one should tell the core about that, so the
> core can make the right things happen at various times (like system
> suspend/resume etc).
> 
> What if we introduce a mechanism allowing drivers to say "I depend on
> device X and its driver to be present and functional from now on" and
> store that information somewhere for the core to use?
> 
> Some time ago (a few years ago actually IIRC) I proposed something
> called "PM links".  The idea was to have objects representing such
> dependencies, although I was not taking the "the driver of the device
> I depend on should be present and functional going forward" condition.
> 
> Say, if a driver wants to check the presence of the device+driver it
> needs to be functional, it will do something like
> 
> ret = create_pm_link(dev, producer);
> 
> and that will return -EPROBE_DEFER if the producer device is not
> functional.  If success is returned, the link has been created and now
> the core will take it into account.
> 
> On driver removal the core may just delete the links where the device
> is the "consumer".  Also there may be a delete_pm_link(dev, producer)
> operation if needed.
> 
> The creation of a link may then include the reordering of dpm_list as
> appropriate so all "producers" are now followed by all of their
> "consumers".  Going forward, though, the core may use the links to
> make all "producers" wait for the PM callbacks of their "consumers" to
> complete during system suspend etc.  It also may use them to prevent
> drivers being depended on from being unloaded and/or to force the
> removal of drivers that depend on something being removed.  In
> principle it may also use those links to coordinate runtime PM
> transitions, but I guess that's not going to be useful in all cases,
> so there needs to be an opt-in mechanism for that.
> 
> Please tell me what you think.

Sounds familiar.  I recall this basic approach from a Plumbers
conference some years ago -- maybe that was when you first proposed it!

You might want to categorize the dependencies into different types.  I 
can think of three types offhand:

	The target device must be present before the current device
	can be probed (hard to imagine how that could be stored as a PM
	link if the target device isn't present, though);

	The target device must be bound to a driver before the current
	device can be probed;

	The target device must be at full power whenever the current
	device is.

Maybe you can think of others.

[Oddly enough, the USB subsystem has some dependencies that don't fall 
into any of these categories.  They have to do with the peculiar way in 
which a low- or full-speed device is handed off from a high-speed 
controller to its companion low/full-speed controller, and they apply 
only to system resume, not to normal operation.  (That is, device A 
requires device B to be at full power when A is being resumed from a 
system sleep, but not when A is operating normally or when A is being 
runtime-resumed.)  For such things, we should keep the existing 
device_pm_wait_for_dev() API.]

This sounds like a big change, but it might be worthwhile.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 18, 2015, 12:13 a.m. UTC | #35
On Thu, Sep 17, 2015 at 11:06 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 17 Sep 2015, Rafael J. Wysocki wrote:
>
>> Note: Problems also may happen if device A depends on device B and its
>> driver to be present and functional and then the B's driver module is
>> unloaded.  The core doesn't prevent that from happening AFAICS.
>
> It also doesn't prevent B's driver from being unbound from the B
> device.
>
> To some extent the kernel _does_ prevent driver modules from being
> unloaded.  If A's driver uses code resources provided by B's driver
> then the module's refcount would be larger than 0.

Right.

>> I'd like to go back to my initial hunch that the driver knowing about
>> a dependency on another one should tell the core about that, so the
>> core can make the right things happen at various times (like system
>> suspend/resume etc).
>>
>> What if we introduce a mechanism allowing drivers to say "I depend on
>> device X and its driver to be present and functional from now on" and
>> store that information somewhere for the core to use?
>>
>> Some time ago (a few years ago actually IIRC) I proposed something
>> called "PM links".  The idea was to have objects representing such
>> dependencies, although I was not taking the "the driver of the device
>> I depend on should be present and functional going forward" condition.
>>
>> Say, if a driver wants to check the presence of the device+driver it
>> needs to be functional, it will do something like
>>
>> ret = create_pm_link(dev, producer);
>>
>> and that will return -EPROBE_DEFER if the producer device is not
>> functional.  If success is returned, the link has been created and now
>> the core will take it into account.
>>
>> On driver removal the core may just delete the links where the device
>> is the "consumer".  Also there may be a delete_pm_link(dev, producer)
>> operation if needed.
>>
>> The creation of a link may then include the reordering of dpm_list as
>> appropriate so all "producers" are now followed by all of their
>> "consumers".  Going forward, though, the core may use the links to
>> make all "producers" wait for the PM callbacks of their "consumers" to
>> complete during system suspend etc.  It also may use them to prevent
>> drivers being depended on from being unloaded and/or to force the
>> removal of drivers that depend on something being removed.  In
>> principle it may also use those links to coordinate runtime PM
>> transitions, but I guess that's not going to be useful in all cases,
>> so there needs to be an opt-in mechanism for that.
>>
>> Please tell me what you think.
>
> Sounds familiar.  I recall this basic approach from a Plumbers
> conference some years ago -- maybe that was when you first proposed it!
>
> You might want to categorize the dependencies into different types.  I
> can think of three types offhand:
>
>         The target device must be present before the current device
>         can be probed (hard to imagine how that could be stored as a PM
>         link if the target device isn't present, though);

Right, but there is a tricky part here.  The presence of the device
object need not imply that the device is physically present. :-)

>         The target device must be bound to a driver before the current
>         device can be probed;
>
>         The target device must be at full power whenever the current
>         device is.

Or even before attempting to put the current device at full power.

> Maybe you can think of others.
>
> [Oddly enough, the USB subsystem has some dependencies that don't fall
> into any of these categories.  They have to do with the peculiar way in
> which a low- or full-speed device is handed off from a high-speed
> controller to its companion low/full-speed controller, and they apply
> only to system resume, not to normal operation.  (That is, device A
> requires device B to be at full power when A is being resumed from a
> system sleep, but not when A is operating normally or when A is being
> runtime-resumed.)  For such things, we should keep the existing
> device_pm_wait_for_dev() API.]

Absolutely.  The idea is to use the existing APIs for that where it makes sense.

> This sounds like a big change, but it might be worthwhile.

Well, the more I think about that the more it seems to me that some
redesign is needed.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Sept. 18, 2015, 3:55 p.m. UTC | #36
On Thu, Sep 17, 2015 at 08:43:54PM +0200, Rafael J. Wysocki wrote:
> Hi Alan,
> 
> On Thu, Sep 17, 2015 at 7:02 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > On Thu, 17 Sep 2015, Rafael J. Wysocki wrote:
> >
> >> On Thu, Sep 17, 2015 at 2:27 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> > On Wednesday, September 16, 2015 03:22:37 PM Alan Stern wrote:
> >> >> On Wed, 16 Sep 2015, Thierry Reding wrote:
> >>
> >> [cut]
> >>
> >> >
> >> > And I'm wondering if and how that is related to runtime PM?  It only
> >> > covers the system sleep transitions case, but who's responsible for the
> >> > runtime PM part?  Device drivers?
> >
> > In theory the drivers are responsible.  In practice, I don't know how
> > this is handled.
> 
> Same here, unfortunately.
> 
> >> Which reminds me of something we all seem to be forgetting about:
> >> there is asynchronous suspend and resume which may cause suspend and
> >> resume callbacks of devices to be executed in an order that is
> >> different from the dpm_list order.  In those cases the device that
> >> depends on another one has to explicitly wait for the other one to
> >> complete its callback in the current phase of the transition.
> >>
> >> While correct ordering of dpm_list is essential for this to work too,
> >> it by no means is sufficient, so in the end the driver having a
> >> dependency needs to know about it and act on it as needed (or we need
> >> an alternative mechanism that will do that automatically, but I'm not
> >> sure what that may be).
> 
> Note: Problems also may happen if device A depends on device B and its
> driver to be present and functional and then the B's driver module is
> unloaded.  The core doesn't prevent that from happening AFAICS.

As Alan already mentioned this is typically solved by consumers taking a
module reference. However that only makes sure that the module stays
around, so references to code or global data will remain valid. However
it does not prevent the device from being unbound and freeing all the
resources associated with it.

A lot of subsystems are buggy this way. Typically the solution here is
to properly reference count objects that subsystems hand out. But that
does not solve the problem entirely. You still need to deal with the
situation where the device backing an object goes away. Consumers may
keep a reference to the object, which ensures that the data stays around
but operations on the objects would still fail (consider cases where the
operation accesses registers that have been unmapped when the provider
was unbound).

If the core provided a means to prevent a device from being unbound if
it still had consumers, that would fix a whole lot of problems at once.

Of course there's still the matter of some types of devices physically
disappearing (USB, PCI, ...).

> >> Actually, I was thinking about adding something like pm_get() for this
> >> purpose that will do pm_runtime_get_sync() on the target and will
> >> ensure that the right things will happen during system suspend/resume
> >> in addition to that, including reordering dpm_list if necessary.
> >> Plus, of course, the complementary pm_put().
> >>
> >> With something like that available, there should be no need to reorder
> >> dpm_list anywhere else.  The problem with this approach is that the
> >> reordering becomes quite complicated then, as it would need to move
> >> the device itself after the target and anything that depends on it
> >> along with it and tracking those dependencies becomes quite
> >> problematic.
> >
> > Keeping explicit track of these non-child-parent dependencies seems
> > reasonable.  But I don't know how you could combine it with reordering
> > dpm_list.
> >
> > One possibility might be to do a topological sort of all devices, with
> > the initial set of constraints given by the explicit dependencies and
> > the parent-child relations.  So basically this would mean ignoring the
> > actual dpm_list and making up a new list of your own whenever a sleep
> > transition starts.  It might work, but I'm not sure how reliable it
> > would be.
> 
> I'd like to go back to my initial hunch that the driver knowing about
> a dependency on another one should tell the core about that, so the
> core can make the right things happen at various times (like system
> suspend/resume etc).
> 
> What if we introduce a mechanism allowing drivers to say "I depend on
> device X and its driver to be present and functional from now on" and
> store that information somewhere for the core to use?
> 
> Some time ago (a few years ago actually IIRC) I proposed something
> called "PM links".  The idea was to have objects representing such
> dependencies, although I was not taking the "the driver of the device
> I depend on should be present and functional going forward" condition.
> 
> Say, if a driver wants to check the presence of the device+driver it
> needs to be functional, it will do something like
> 
> ret = create_pm_link(dev, producer);
> 
> and that will return -EPROBE_DEFER if the producer device is not
> functional.  If success is returned, the link has been created and now
> the core will take it into account.
> 
> On driver removal the core may just delete the links where the device
> is the "consumer".  Also there may be a delete_pm_link(dev, producer)
> operation if needed.
> 
> The creation of a link may then include the reordering of dpm_list as
> appropriate so all "producers" are now followed by all of their
> "consumers".  Going forward, though, the core may use the links to
> make all "producers" wait for the PM callbacks of their "consumers" to
> complete during system suspend etc.  It also may use them to prevent
> drivers being depended on from being unloaded and/or to force the
> removal of drivers that depend on something being removed.  In
> principle it may also use those links to coordinate runtime PM
> transitions, but I guess that's not going to be useful in all cases,
> so there needs to be an opt-in mechanism for that.

Force-removing drivers that depend on a device that's being unbound
would be a possibility to solve the problem where consumers depend on a
device that could physically go away. It might also be the right thing
to do in any case. Presumably somebody unloading a module want to do
just that, and refusing to do so isn't playing very nice. Of course
allowing random modules to be removed even if a lot of consumers might
depend on it may not be friendly either. Consider if you unload a GPIO
driver that provides a pin that's used to enable power to an eMMC that
might have the root filesystem.

Then again, if you unload a module you better know what you're doing
anyway, so maybe that's not something we need to be concerned about.

> Please tell me what you think.

I think that's a great idea. There's probably some bikeshedding to be
had, but on the whole I think this would add very useful information to
the driver model.

Many subsystems nowadays already provide a very similar API, often of
the form:

	resource = [dev_]*_get(dev, "name");

Subsystems usually use dev and "name" to look up the provider and the
resource. When they do lookup the provider they will typically be able
to get at the underlying struct device, so this would provide a very
nice entrypoint to call the core function. That way we can move this
into subsystems and individual drivers don't have to be updated.

I think this would also tie in nicely with Tomeu's patch set to do
on-demand probing. Essentially a [dev_]*_get() call could in turn call
this new "declare dependency" API, and the new API could underneath do
on-demand probing.

Given that this isn't a strictly PM mechanism anymore, perhaps something
like:

	int device_depend(struct device *dev, struct device *target);

would be a more generic option.

Thierry
Rafael J. Wysocki Sept. 18, 2015, 11:07 p.m. UTC | #37
Hi Thierry,

On Fri, Sep 18, 2015 at 5:55 PM, Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Thu, Sep 17, 2015 at 08:43:54PM +0200, Rafael J. Wysocki wrote:
>> Hi Alan,
>>
>> On Thu, Sep 17, 2015 at 7:02 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
>> > On Thu, 17 Sep 2015, Rafael J. Wysocki wrote:
>> >
>> >> On Thu, Sep 17, 2015 at 2:27 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >> > On Wednesday, September 16, 2015 03:22:37 PM Alan Stern wrote:
>> >> >> On Wed, 16 Sep 2015, Thierry Reding wrote:
>> >>
>> >> [cut]
>> >>
>> >> >
>> >> > And I'm wondering if and how that is related to runtime PM?  It only
>> >> > covers the system sleep transitions case, but who's responsible for the
>> >> > runtime PM part?  Device drivers?
>> >
>> > In theory the drivers are responsible.  In practice, I don't know how
>> > this is handled.
>>
>> Same here, unfortunately.
>>
>> >> Which reminds me of something we all seem to be forgetting about:
>> >> there is asynchronous suspend and resume which may cause suspend and
>> >> resume callbacks of devices to be executed in an order that is
>> >> different from the dpm_list order.  In those cases the device that
>> >> depends on another one has to explicitly wait for the other one to
>> >> complete its callback in the current phase of the transition.
>> >>
>> >> While correct ordering of dpm_list is essential for this to work too,
>> >> it by no means is sufficient, so in the end the driver having a
>> >> dependency needs to know about it and act on it as needed (or we need
>> >> an alternative mechanism that will do that automatically, but I'm not
>> >> sure what that may be).
>>
>> Note: Problems also may happen if device A depends on device B and its
>> driver to be present and functional and then the B's driver module is
>> unloaded.  The core doesn't prevent that from happening AFAICS.
>
> As Alan already mentioned this is typically solved by consumers taking a
> module reference. However that only makes sure that the module stays
> around, so references to code or global data will remain valid. However
> it does not prevent the device from being unbound and freeing all the
> resources associated with it.
>
> A lot of subsystems are buggy this way. Typically the solution here is
> to properly reference count objects that subsystems hand out. But that
> does not solve the problem entirely. You still need to deal with the
> situation where the device backing an object goes away. Consumers may
> keep a reference to the object, which ensures that the data stays around
> but operations on the objects would still fail (consider cases where the
> operation accesses registers that have been unmapped when the provider
> was unbound).
>
> If the core provided a means to prevent a device from being unbound if
> it still had consumers, that would fix a whole lot of problems at once.

Indeed.

> Of course there's still the matter of some types of devices physically
> disappearing (USB, PCI, ...).

Right.  In some cases removal is simply necessary as part of the
cleanup, like after a surprise hot-unplug of a device, for example.
In those cases everything that depended on the device that went away
should be unbound from drivers at least IMO.

>> >> Actually, I was thinking about adding something like pm_get() for this
>> >> purpose that will do pm_runtime_get_sync() on the target and will
>> >> ensure that the right things will happen during system suspend/resume
>> >> in addition to that, including reordering dpm_list if necessary.
>> >> Plus, of course, the complementary pm_put().
>> >>
>> >> With something like that available, there should be no need to reorder
>> >> dpm_list anywhere else.  The problem with this approach is that the
>> >> reordering becomes quite complicated then, as it would need to move
>> >> the device itself after the target and anything that depends on it
>> >> along with it and tracking those dependencies becomes quite
>> >> problematic.
>> >
>> > Keeping explicit track of these non-child-parent dependencies seems
>> > reasonable.  But I don't know how you could combine it with reordering
>> > dpm_list.
>> >
>> > One possibility might be to do a topological sort of all devices, with
>> > the initial set of constraints given by the explicit dependencies and
>> > the parent-child relations.  So basically this would mean ignoring the
>> > actual dpm_list and making up a new list of your own whenever a sleep
>> > transition starts.  It might work, but I'm not sure how reliable it
>> > would be.
>>
>> I'd like to go back to my initial hunch that the driver knowing about
>> a dependency on another one should tell the core about that, so the
>> core can make the right things happen at various times (like system
>> suspend/resume etc).
>>
>> What if we introduce a mechanism allowing drivers to say "I depend on
>> device X and its driver to be present and functional from now on" and
>> store that information somewhere for the core to use?
>>
>> Some time ago (a few years ago actually IIRC) I proposed something
>> called "PM links".  The idea was to have objects representing such
>> dependencies, although I was not taking the "the driver of the device
>> I depend on should be present and functional going forward" condition.
>>
>> Say, if a driver wants to check the presence of the device+driver it
>> needs to be functional, it will do something like
>>
>> ret = create_pm_link(dev, producer);
>>
>> and that will return -EPROBE_DEFER if the producer device is not
>> functional.  If success is returned, the link has been created and now
>> the core will take it into account.
>>
>> On driver removal the core may just delete the links where the device
>> is the "consumer".  Also there may be a delete_pm_link(dev, producer)
>> operation if needed.
>>
>> The creation of a link may then include the reordering of dpm_list as
>> appropriate so all "producers" are now followed by all of their
>> "consumers".  Going forward, though, the core may use the links to
>> make all "producers" wait for the PM callbacks of their "consumers" to
>> complete during system suspend etc.  It also may use them to prevent
>> drivers being depended on from being unloaded and/or to force the
>> removal of drivers that depend on something being removed.  In
>> principle it may also use those links to coordinate runtime PM
>> transitions, but I guess that's not going to be useful in all cases,
>> so there needs to be an opt-in mechanism for that.
>
> Force-removing drivers that depend on a device that's being unbound
> would be a possibility to solve the problem where consumers depend on a
> device that could physically go away. It might also be the right thing
> to do in any case. Presumably somebody unloading a module want to do
> just that, and refusing to do so isn't playing very nice. Of course
> allowing random modules to be removed even if a lot of consumers might
> depend on it may not be friendly either. Consider if you unload a GPIO
> driver that provides a pin that's used to enable power to an eMMC that
> might have the root filesystem.
>
> Then again, if you unload a module you better know what you're doing
> anyway, so maybe that's not something we need to be concerned about.

I think that it's better to fail module unloads in such cases by
default (to prevent simple silly mistakes from having possibly severe
consequences), but if a "force" option is used, we should regard that
as "the user really means it" and do as requested.  That would be very
much analogous to the hot-unplug situation from the software
perspective.

>> Please tell me what you think.
>
> I think that's a great idea. There's probably some bikeshedding to be
> had, but on the whole I think this would add very useful information to
> the driver model.
>
> Many subsystems nowadays already provide a very similar API, often of
> the form:
>
>         resource = [dev_]*_get(dev, "name");
>
> Subsystems usually use dev and "name" to look up the provider and the
> resource. When they do lookup the provider they will typically be able
> to get at the underlying struct device, so this would provide a very
> nice entrypoint to call the core function. That way we can move this
> into subsystems and individual drivers don't have to be updated.

Right.

> I think this would also tie in nicely with Tomeu's patch set to do
> on-demand probing. Essentially a [dev_]*_get() call could in turn call
> this new "declare dependency" API, and the new API could underneath do
> on-demand probing.
>
> Given that this isn't a strictly PM mechanism anymore, perhaps something
> like:
>
>         int device_depend(struct device *dev, struct device *target);
>
> would be a more generic option.

I thought about something like link_device(dev, target, flags), where
the last argument would indicate what the core is supposed to use the
link for (removal handling, system suspend/resume, runtime PM etc).

And I agree that this isn't really PM-specific.

OK, thanks a lot for the feedback!

Let me think about that a bit more and I'll try to come up with a more
detailed design description.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Sept. 21, 2015, 8:51 a.m. UTC | #38
On Sat, Sep 19, 2015 at 01:07:56AM +0200, Rafael J. Wysocki wrote:
> On Fri, Sep 18, 2015 at 5:55 PM, Thierry Reding <thierry.reding@gmail.com> wrote:
[...]
> > Of course there's still the matter of some types of devices physically
> > disappearing (USB, PCI, ...).
> 
> Right.  In some cases removal is simply necessary as part of the
> cleanup, like after a surprise hot-unplug of a device, for example.
> In those cases everything that depended on the device that went away
> should be unbound from drivers at least IMO.

Agreed.

> > Force-removing drivers that depend on a device that's being unbound
> > would be a possibility to solve the problem where consumers depend on a
> > device that could physically go away. It might also be the right thing
> > to do in any case. Presumably somebody unloading a module want to do
> > just that, and refusing to do so isn't playing very nice. Of course
> > allowing random modules to be removed even if a lot of consumers might
> > depend on it may not be friendly either. Consider if you unload a GPIO
> > driver that provides a pin that's used to enable power to an eMMC that
> > might have the root filesystem.
> >
> > Then again, if you unload a module you better know what you're doing
> > anyway, so maybe that's not something we need to be concerned about.
> 
> I think that it's better to fail module unloads in such cases by
> default (to prevent simple silly mistakes from having possibly severe
> consequences), but if a "force" option is used, we should regard that
> as "the user really means it" and do as requested.  That would be very
> much analogous to the hot-unplug situation from the software
> perspective.

Sounds very reasonable to me.

> > I think this would also tie in nicely with Tomeu's patch set to do
> > on-demand probing. Essentially a [dev_]*_get() call could in turn call
> > this new "declare dependency" API, and the new API could underneath do
> > on-demand probing.
> >
> > Given that this isn't a strictly PM mechanism anymore, perhaps something
> > like:
> >
> >         int device_depend(struct device *dev, struct device *target);
> >
> > would be a more generic option.
> 
> I thought about something like link_device(dev, target, flags), where
> the last argument would indicate what the core is supposed to use the
> link for (removal handling, system suspend/resume, runtime PM etc).

Sounds good to me. I think the core isn't quite consistent on the naming
of functions, so we have things like device_register/unregister() versus
get/put_device(). I'd lean towards device_link(dev, target, flags), but
I'll go with any color you'd like the shed to have.

> And I agree that this isn't really PM-specific.
> 
> OK, thanks a lot for the feedback!
> 
> Let me think about that a bit more and I'll try to come up with a more
> detailed design description.

This sounds like it's not going to make it into v4.3 anymore, so I'll
need to think about the easiest way to (temporarily) fix up the current
regression.

Is this something that you will have time to implement yourself? If so,
please keep me in the loop and Cc me on any patches that you need
tested. If you're short on time, let me know as well and I'll see if I
can take a stab at it myself, though I'm pretty sure I'll need further
guidance along the way.

Thierry
Alan Stern Sept. 21, 2015, 2:34 p.m. UTC | #39
On Mon, 21 Sep 2015, Thierry Reding wrote:

> > > Force-removing drivers that depend on a device that's being unbound
> > > would be a possibility to solve the problem where consumers depend on a
> > > device that could physically go away. It might also be the right thing
> > > to do in any case. Presumably somebody unloading a module want to do
> > > just that, and refusing to do so isn't playing very nice. Of course
> > > allowing random modules to be removed even if a lot of consumers might
> > > depend on it may not be friendly either. Consider if you unload a GPIO
> > > driver that provides a pin that's used to enable power to an eMMC that
> > > might have the root filesystem.
> > >
> > > Then again, if you unload a module you better know what you're doing
> > > anyway, so maybe that's not something we need to be concerned about.
> > 
> > I think that it's better to fail module unloads in such cases by
> > default (to prevent simple silly mistakes from having possibly severe
> > consequences), but if a "force" option is used, we should regard that
> > as "the user really means it" and do as requested.  That would be very
> > much analogous to the hot-unplug situation from the software
> > perspective.
> 
> Sounds very reasonable to me.

I'm not so sure about this.  For one thing, how are you going to 
distinguish which module unloads are safe?

For another, even if you do make this distinction, don't you think 
people will get into the habit of always using the "force" option?  My 
impression is that most module unloads end up causing some device to be 
unbound from a driver, or even completely deregistered.

Right now the kernel uses the "You better know what you're doing when
you unload a module" point of view, and I don't see any good reasons
for changing.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 22, 2015, 12:26 a.m. UTC | #40
On Monday, September 21, 2015 10:34:54 AM Alan Stern wrote:
> On Mon, 21 Sep 2015, Thierry Reding wrote:
> 
> > > > Force-removing drivers that depend on a device that's being unbound
> > > > would be a possibility to solve the problem where consumers depend on a
> > > > device that could physically go away. It might also be the right thing
> > > > to do in any case. Presumably somebody unloading a module want to do
> > > > just that, and refusing to do so isn't playing very nice. Of course
> > > > allowing random modules to be removed even if a lot of consumers might
> > > > depend on it may not be friendly either. Consider if you unload a GPIO
> > > > driver that provides a pin that's used to enable power to an eMMC that
> > > > might have the root filesystem.
> > > >
> > > > Then again, if you unload a module you better know what you're doing
> > > > anyway, so maybe that's not something we need to be concerned about.
> > > 
> > > I think that it's better to fail module unloads in such cases by
> > > default (to prevent simple silly mistakes from having possibly severe
> > > consequences), but if a "force" option is used, we should regard that
> > > as "the user really means it" and do as requested.  That would be very
> > > much analogous to the hot-unplug situation from the software
> > > perspective.
> > 
> > Sounds very reasonable to me.
> 
> I'm not so sure about this.  For one thing, how are you going to 
> distinguish which module unloads are safe?

Ones that have no dependencies?

Very simple: If the driver being unloaded is explicitly depended on by
something (ie. there is a "device link" to it), we fail the unload with
-EBUSY unless the "force" option is used.

> For another, even if you do make this distinction, don't you think 
> people will get into the habit of always using the "force" option?  My 
> impression is that most module unloads end up causing some device to be 
> unbound from a driver, or even completely deregistered.
> 
> Right now the kernel uses the "You better know what you're doing when
> you unload a module" point of view, and I don't see any good reasons
> for changing.

Well, is that point of view appropriate from the users' perspective?

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Sept. 22, 2015, 12:39 a.m. UTC | #41
Hi,

On Mon, Sep 21, 2015 at 10:51 AM, Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Sat, Sep 19, 2015 at 01:07:56AM +0200, Rafael J. Wysocki wrote:
>> On Fri, Sep 18, 2015 at 5:55 PM, Thierry Reding <thierry.reding@gmail.com> wrote:
> [...]
>> > Of course there's still the matter of some types of devices physically
>> > disappearing (USB, PCI, ...).
>>
>> Right.  In some cases removal is simply necessary as part of the
>> cleanup, like after a surprise hot-unplug of a device, for example.
>> In those cases everything that depended on the device that went away
>> should be unbound from drivers at least IMO.
>
> Agreed.
>
>> > Force-removing drivers that depend on a device that's being unbound
>> > would be a possibility to solve the problem where consumers depend on a
>> > device that could physically go away. It might also be the right thing
>> > to do in any case. Presumably somebody unloading a module want to do
>> > just that, and refusing to do so isn't playing very nice. Of course
>> > allowing random modules to be removed even if a lot of consumers might
>> > depend on it may not be friendly either. Consider if you unload a GPIO
>> > driver that provides a pin that's used to enable power to an eMMC that
>> > might have the root filesystem.
>> >
>> > Then again, if you unload a module you better know what you're doing
>> > anyway, so maybe that's not something we need to be concerned about.
>>
>> I think that it's better to fail module unloads in such cases by
>> default (to prevent simple silly mistakes from having possibly severe
>> consequences), but if a "force" option is used, we should regard that
>> as "the user really means it" and do as requested.  That would be very
>> much analogous to the hot-unplug situation from the software
>> perspective.
>
> Sounds very reasonable to me.
>
>> > I think this would also tie in nicely with Tomeu's patch set to do
>> > on-demand probing. Essentially a [dev_]*_get() call could in turn call
>> > this new "declare dependency" API, and the new API could underneath do
>> > on-demand probing.
>> >
>> > Given that this isn't a strictly PM mechanism anymore, perhaps something
>> > like:
>> >
>> >         int device_depend(struct device *dev, struct device *target);
>> >
>> > would be a more generic option.
>>
>> I thought about something like link_device(dev, target, flags), where
>> the last argument would indicate what the core is supposed to use the
>> link for (removal handling, system suspend/resume, runtime PM etc).
>
> Sounds good to me. I think the core isn't quite consistent on the naming
> of functions, so we have things like device_register/unregister() versus
> get/put_device(). I'd lean towards device_link(dev, target, flags), but
> I'll go with any color you'd like the shed to have.

Well, whatever.

n any case it would be good to have "link" and "device" in the name,
regardless of the ordering. :-)

>> And I agree that this isn't really PM-specific.
>>
>> OK, thanks a lot for the feedback!
>>
>> Let me think about that a bit more and I'll try to come up with a more
>> detailed design description.
>
> This sounds like it's not going to make it into v4.3 anymore, so I'll
> need to think about the easiest way to (temporarily) fix up the current
> regression.
>
> Is this something that you will have time to implement yourself? If so,
> please keep me in the loop and Cc me on any patches that you need
> tested. If you're short on time, let me know as well and I'll see if I
> can take a stab at it myself, though I'm pretty sure I'll need further
> guidance along the way.

I'd like to try to do that myself, but that'll take some time.  I hope
this isn't a problem.

Given the time frame it should be doable for v4.4 in theory.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Sept. 22, 2015, 1:22 a.m. UTC | #42
On Tue, 22 Sep 2015, Rafael J. Wysocki wrote:

> On Monday, September 21, 2015 10:34:54 AM Alan Stern wrote:
> > On Mon, 21 Sep 2015, Thierry Reding wrote:
> > 
> > > > > Force-removing drivers that depend on a device that's being unbound
> > > > > would be a possibility to solve the problem where consumers depend on a
> > > > > device that could physically go away. It might also be the right thing
> > > > > to do in any case. Presumably somebody unloading a module want to do
> > > > > just that, and refusing to do so isn't playing very nice. Of course
> > > > > allowing random modules to be removed even if a lot of consumers might
> > > > > depend on it may not be friendly either. Consider if you unload a GPIO
> > > > > driver that provides a pin that's used to enable power to an eMMC that
> > > > > might have the root filesystem.
> > > > >
> > > > > Then again, if you unload a module you better know what you're doing
> > > > > anyway, so maybe that's not something we need to be concerned about.
> > > > 
> > > > I think that it's better to fail module unloads in such cases by
> > > > default (to prevent simple silly mistakes from having possibly severe
> > > > consequences), but if a "force" option is used, we should regard that
> > > > as "the user really means it" and do as requested.  That would be very
> > > > much analogous to the hot-unplug situation from the software
> > > > perspective.
> > > 
> > > Sounds very reasonable to me.
> > 
> > I'm not so sure about this.  For one thing, how are you going to 
> > distinguish which module unloads are safe?
> 
> Ones that have no dependencies?

Does a parent-child relationship count as a dependency?  AFAIK, that's 
the only sort of dependency we have for mounted filesystems, in 
general.

> Very simple: If the driver being unloaded is explicitly depended on by
> something (ie. there is a "device link" to it), we fail the unload with
> -EBUSY unless the "force" option is used.

Hmmm.  In other words, you will iterate through all the drivers,
checking the ->owner field against the module being removed.  Then for
each matching driver, you will iterate through all the bound devices,
and all their descendants, looking for links of the appropriate type.  
And perhaps do the same thing for subsystems and classes.

Doable, I guess, even if it is inelegant.  Fortunately module unload
has no timing requirements.

> > For another, even if you do make this distinction, don't you think 
> > people will get into the habit of always using the "force" option?  My 
> > impression is that most module unloads end up causing some device to be 
> > unbound from a driver, or even completely deregistered.
> > 
> > Right now the kernel uses the "You better know what you're doing when
> > you unload a module" point of view, and I don't see any good reasons
> > for changing.
> 
> Well, is that point of view appropriate from the users' perspective?

I suspect people will find it more appropriate than the new point of
view you are proposing.  I don't know how often people unload modules,
but I suspect almost every time they do, it involves unregistering one
or more devices.  Goodness knows how many of those devices will be the
target of a dependency link, though.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index be0eb4639128..56291b11049b 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -88,16 +88,6 @@  static void deferred_probe_work_func(struct work_struct *work)
 		 */
 		mutex_unlock(&deferred_probe_mutex);
 
-		/*
-		 * Force the device to the end of the dpm_list since
-		 * the PM code assumes that the order we add things to
-		 * the list is a good order for suspend but deferred
-		 * probe makes that very unsafe.
-		 */
-		device_pm_lock();
-		device_pm_move_last(dev);
-		device_pm_unlock();
-
 		dev_dbg(dev, "Retrying from deferred list\n");
 		bus_probe_device(dev);
 
@@ -312,6 +302,29 @@  static int really_probe(struct device *dev, struct device_driver *drv)
 	 */
 	devices_kset_move_last(dev);
 
+	/*
+	 * Force the device to the end of the dpm_list since the PM code
+	 * assumes that the order we add things to the list is a good order
+	 * for suspend but deferred probe makes that very unsafe.
+	 *
+	 * Deferred probe can also cause situations in which a device that is
+	 * a dependency for others gets moved further down the dpm_list as a
+	 * result of probe deferral. In that case the dependee will end up
+	 * getting suspended before any of its dependers.
+	 *
+	 * To ensure proper ordering of suspend/resume, move every device that
+	 * is being probed to the end of the dpm_list. Note that technically
+	 * only successfully probed devices need to be moved, but that breaks
+	 * for recursively added devices because they would end up in the list
+	 * in reverse of the desired order, so we simply do it unconditionally
+	 * for all devices before they are being probed. In the worst case the
+	 * list will be reordered a couple more times than necessary, which
+	 * should be an insignificant amount of work.
+	 */
+	device_pm_lock();
+	device_pm_move_last(dev);
+	device_pm_unlock();
+
 	if (dev->bus->probe) {
 		ret = dev->bus->probe(dev);
 		if (ret)