diff mbox

[RFC] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.

Message ID 1367360914-23389-1-git-send-email-zoran.markovic@linaro.org (mailing list archive)
State RFC, archived
Headers show

Commit Message

Zoran Markovic April 30, 2013, 10:28 p.m. UTC
From: Benoit Goby <benoit@android.com>

Below is a patch from android kernel that detects a driver suspend
lockup and captures dump in the kernel log. Please review and provide
comments.

Rather than hard-lock the kernel, dump the suspend thread stack and
BUG() when a driver takes too long to suspend.  The timeout is set to
12 seconds to be longer than the usbhid 10 second timeout.

Exclude from the watchdog the time spent waiting for children that
are resumed asynchronously and time every device, whether or not they
resumed synchronously.

Cc: Android Kernel Team <kernel-team@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Todd Poynor <toddpoynor@google.com>
Cc: San Mehat <san@google.com>
Cc: Benoit Goby <benoit@android.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <len.brown@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Original-author: San Mehat <san@google.com>
Signed-off-by: Benoit Goby <benoit@android.com>
[zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...),
tweaked commit message.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
---
 drivers/base/power/main.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

Comments

Greg KH April 30, 2013, 11:30 p.m. UTC | #1
On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
> From: Benoit Goby <benoit@android.com>
> 
> Below is a patch from android kernel that detects a driver suspend
> lockup and captures dump in the kernel log. Please review and provide
> comments.

There's this really cool thing called a watchdog driver that does stuff
like this :)

> Rather than hard-lock the kernel, dump the suspend thread stack and
> BUG() when a driver takes too long to suspend.  The timeout is set to
> 12 seconds to be longer than the usbhid 10 second timeout.
> 
> Exclude from the watchdog the time spent waiting for children that
> are resumed asynchronously and time every device, whether or not they
> resumed synchronously.

No, don't add a driver-core-only timer, use the existing watchdog timers
if you are worried about the kernel locking up.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek May 1, 2013, 12:30 a.m. UTC | #2
Hi!

> Below is a patch from android kernel that detects a driver suspend
> lockup and captures dump in the kernel log. Please review and provide
> comments.
> 
> Rather than hard-lock the kernel, dump the suspend thread stack and
> BUG() when a driver takes too long to suspend.  The timeout is set to
> 12 seconds to be longer than the usbhid 10 second timeout.
> 
> Exclude from the watchdog the time spent waiting for children that
> are resumed asynchronously and time every device, whether or not they
> resumed synchronously.
> 
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Colin Cross <ccross@android.com>
> Cc: Todd Poynor <toddpoynor@google.com>
> Cc: San Mehat <san@google.com>
> Cc: Benoit Goby <benoit@android.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Rafael J. Wysocki <rjw@sisk.pl>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Original-author: San Mehat <san@google.com>
> Signed-off-by: Benoit Goby <benoit@android.com>
> [zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...),
> tweaked commit message.]
> Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
> ---
>  drivers/base/power/main.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index 15beb50..eb70c0e 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c
> @@ -29,6 +29,8 @@
>  #include <linux/async.h>
>  #include <linux/suspend.h>
>  #include <linux/cpuidle.h>
> +#include <linux/timer.h>
> +
>  #include "../base.h"
>  #include "power.h"
>  
> @@ -54,6 +56,12 @@ struct suspend_stats suspend_stats;
>  static DEFINE_MUTEX(dpm_list_mtx);
>  static pm_message_t pm_transition;
>  
> +static void dpm_drv_timeout(unsigned long data);
> +struct dpm_drv_wd_data {
> +	struct device *dev;
> +	struct task_struct *tsk;
> +};
> +
>  static int async_error;
>  
>  /**
> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev)
>  }
>  
>  /**
> + *     dpm_drv_timeout - Driver suspend / resume watchdog handler
> + *     @data: struct device which timed out
> + *
> + *     Called when a driver has timed out suspending or resuming.
> + *     There's not much we can do here to recover so
> + *     BUG() out for a crash-dump
> + *
> + */
> +static void dpm_drv_timeout(unsigned long data)
> +{
> +	struct dpm_drv_wd_data *wd_data = (void *)data;
> +	struct device *dev = wd_data->dev;
> +	struct task_struct *tsk = wd_data->tsk;
> +
> +	pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
> +		(dev->driver ? dev->driver->name : "no driver"));
> +
> +	pr_emerg("dpm suspend stack:\n");
> +	show_stack(tsk, NULL);
> +
> +	BUG();
> +}

So you:

dump stack of the suspend task

do BUG which
   dumps stack of current task
   kills current task

Current task may very well be idle task; in such case you kill the
machine. Sounds like you should be doing something else, like kill -9
instead of BUG()?
									Pavel
Colin Cross May 1, 2013, 3:36 a.m. UTC | #3
On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
>> From: Benoit Goby <benoit@android.com>
>>
>> Below is a patch from android kernel that detects a driver suspend
>> lockup and captures dump in the kernel log. Please review and provide
>> comments.
>
> There's this really cool thing called a watchdog driver that does stuff
> like this :)

If the watchdog driver worked in this case this patch wouldn't exist.

>> Rather than hard-lock the kernel, dump the suspend thread stack and
>> BUG() when a driver takes too long to suspend.  The timeout is set to
>> 12 seconds to be longer than the usbhid 10 second timeout.
>>
>> Exclude from the watchdog the time spent waiting for children that
>> are resumed asynchronously and time every device, whether or not they
>> resumed synchronously.
>
> No, don't add a driver-core-only timer, use the existing watchdog timers
> if you are worried about the kernel locking up.

The watchdog timers are useless here.  For one, they generally stop
when their driver suspend op is called, so you may not even have one
running when you lock up.  More importantly, the purpose of this patch
is to tell you which driver locked up and hopefully why, and the
watchdog driver will usually result in a silent reset.  This patch
will cause a stack trace of the driver suspend op that is blocking
suspend progress, even if that call does not happen in the suspend
thread.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Colin Cross May 1, 2013, 3:39 a.m. UTC | #4
On Tue, Apr 30, 2013 at 5:30 PM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> Below is a patch from android kernel that detects a driver suspend
>> lockup and captures dump in the kernel log. Please review and provide
>> comments.
>>
>> Rather than hard-lock the kernel, dump the suspend thread stack and
>> BUG() when a driver takes too long to suspend.  The timeout is set to
>> 12 seconds to be longer than the usbhid 10 second timeout.
>>
>> Exclude from the watchdog the time spent waiting for children that
>> are resumed asynchronously and time every device, whether or not they
>> resumed synchronously.
>>
>> Cc: Android Kernel Team <kernel-team@android.com>
>> Cc: Colin Cross <ccross@android.com>
>> Cc: Todd Poynor <toddpoynor@google.com>
>> Cc: San Mehat <san@google.com>
>> Cc: Benoit Goby <benoit@android.com>
>> Cc: John Stultz <john.stultz@linaro.org>
>> Cc: Pavel Machek <pavel@ucw.cz>
>> Cc: Rafael J. Wysocki <rjw@sisk.pl>
>> Cc: Len Brown <len.brown@intel.com>
>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> Original-author: San Mehat <san@google.com>
>> Signed-off-by: Benoit Goby <benoit@android.com>
>> [zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...),
>> tweaked commit message.]
>> Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
>> ---
>>  drivers/base/power/main.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 45 insertions(+)
>>
>> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
>> index 15beb50..eb70c0e 100644
>> --- a/drivers/base/power/main.c
>> +++ b/drivers/base/power/main.c
>> @@ -29,6 +29,8 @@
>>  #include <linux/async.h>
>>  #include <linux/suspend.h>
>>  #include <linux/cpuidle.h>
>> +#include <linux/timer.h>
>> +
>>  #include "../base.h"
>>  #include "power.h"
>>
>> @@ -54,6 +56,12 @@ struct suspend_stats suspend_stats;
>>  static DEFINE_MUTEX(dpm_list_mtx);
>>  static pm_message_t pm_transition;
>>
>> +static void dpm_drv_timeout(unsigned long data);
>> +struct dpm_drv_wd_data {
>> +     struct device *dev;
>> +     struct task_struct *tsk;
>> +};
>> +
>>  static int async_error;
>>
>>  /**
>> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev)
>>  }
>>
>>  /**
>> + *     dpm_drv_timeout - Driver suspend / resume watchdog handler
>> + *     @data: struct device which timed out
>> + *
>> + *     Called when a driver has timed out suspending or resuming.
>> + *     There's not much we can do here to recover so
>> + *     BUG() out for a crash-dump
>> + *
>> + */
>> +static void dpm_drv_timeout(unsigned long data)
>> +{
>> +     struct dpm_drv_wd_data *wd_data = (void *)data;
>> +     struct device *dev = wd_data->dev;
>> +     struct task_struct *tsk = wd_data->tsk;
>> +
>> +     pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
>> +             (dev->driver ? dev->driver->name : "no driver"));
>> +
>> +     pr_emerg("dpm suspend stack:\n");
>> +     show_stack(tsk, NULL);
>> +
>> +     BUG();
>> +}
>
> So you:
>
> dump stack of the suspend task
It dumps the stack of the suspend task if the suspend callback is run
synchronously, or the async task if the suspend op is run
asynchronously.

> do BUG which
>    dumps stack of current task
>    kills current task
>
> Current task may very well be idle task; in such case you kill the
> machine. Sounds like you should be doing something else, like kill -9
> instead of BUG()?

Not much else you can do, you are stuck part way into suspend with a
driver's suspend callback half executed.  All userspace tasks are
frozen, and the suspend task is blocked indefinitely.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH May 1, 2013, 4:17 a.m. UTC | #5
On Tue, Apr 30, 2013 at 08:36:21PM -0700, Colin Cross wrote:
> On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
> >> From: Benoit Goby <benoit@android.com>
> >>
> >> Below is a patch from android kernel that detects a driver suspend
> >> lockup and captures dump in the kernel log. Please review and provide
> >> comments.
> >
> > There's this really cool thing called a watchdog driver that does stuff
> > like this :)
> 
> If the watchdog driver worked in this case this patch wouldn't exist.

Great, let's fix the watchdog timer then :)

What's wrong with it?

> >> Rather than hard-lock the kernel, dump the suspend thread stack and
> >> BUG() when a driver takes too long to suspend.  The timeout is set to
> >> 12 seconds to be longer than the usbhid 10 second timeout.
> >>
> >> Exclude from the watchdog the time spent waiting for children that
> >> are resumed asynchronously and time every device, whether or not they
> >> resumed synchronously.
> >
> > No, don't add a driver-core-only timer, use the existing watchdog timers
> > if you are worried about the kernel locking up.
> 
> The watchdog timers are useless here.  For one, they generally stop
> when their driver suspend op is called, so you may not even have one
> running when you lock up.

But you can fix that, right?

> More importantly, the purpose of this patch is to tell you which
> driver locked up and hopefully why, and the watchdog driver will
> usually result in a silent reset.

I thought it was an option as to what the watchdog does when it
triggers.

> This patch will cause a stack trace of the driver suspend op that is
> blocking suspend progress, even if that call does not happen in the
> suspend thread.

But who can see this, the machine is now dead.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Colin Cross May 1, 2013, 4:39 a.m. UTC | #6
On Tue, Apr 30, 2013 at 9:17 PM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Tue, Apr 30, 2013 at 08:36:21PM -0700, Colin Cross wrote:
>> On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>> > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
>> >> From: Benoit Goby <benoit@android.com>
>> >>
>> >> Below is a patch from android kernel that detects a driver suspend
>> >> lockup and captures dump in the kernel log. Please review and provide
>> >> comments.
>> >
>> > There's this really cool thing called a watchdog driver that does stuff
>> > like this :)
>>
>> If the watchdog driver worked in this case this patch wouldn't exist.
>
> Great, let's fix the watchdog timer then :)
>
> What's wrong with it?
>
>> >> Rather than hard-lock the kernel, dump the suspend thread stack and
>> >> BUG() when a driver takes too long to suspend.  The timeout is set to
>> >> 12 seconds to be longer than the usbhid 10 second timeout.
>> >>
>> >> Exclude from the watchdog the time spent waiting for children that
>> >> are resumed asynchronously and time every device, whether or not they
>> >> resumed synchronously.
>> >
>> > No, don't add a driver-core-only timer, use the existing watchdog timers
>> > if you are worried about the kernel locking up.
>>
>> The watchdog timers are useless here.  For one, they generally stop
>> when their driver suspend op is called, so you may not even have one
>> running when you lock up.
>
> But you can fix that, right?

Ah, you're talking about the lockup detectors, and not drivers/watchdog.

The hardlockup detector can tell you if timer interrupts are not
firing, which is unaffected by this patch since the timer wouldn't
fire any way.  The softlockup detector could eventually tell you that
tasks were not being scheduled, but not why.  Even panic on softlockup
will only get you the stack trace of the current task, which will be
the locked up task if it is spinning, but is likely to be the idle
task if the suspend task is blocked on a wait_event.  This patch will
give the stack trace of the suspend operation that is blocked, even if
it is an asynchronous suspend callback.

>> More importantly, the purpose of this patch is to tell you which
>> driver locked up and hopefully why, and the watchdog driver will
>> usually result in a silent reset.
>
> I thought it was an option as to what the watchdog does when it
> triggers.
>
>> This patch will cause a stack trace of the driver suspend op that is
>> blocking suspend progress, even if that call does not happen in the
>> suspend thread.
>
> But who can see this, the machine is now dead.

I'm not sure what might still be working in this situation on x86, but
on ARM the machine is dead anyways.  Some random subset of drivers are
suspended, so you probably have no hardware watchdog, no console, no
video.  kexec on panic, kgdb on panic, console messages saved in
pstore, or jtag are the only options I know of.  This patch is very
useful in conjunction with pstore console.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Colin Cross May 1, 2013, 5:14 a.m. UTC | #7
On Tue, Apr 30, 2013 at 9:57 PM, anish singh
<anish198519851985@gmail.com> wrote:
>
>
>
> On Wed, May 1, 2013 at 10:09 AM, Colin Cross <ccross@android.com> wrote:
>>
>> On Tue, Apr 30, 2013 at 9:17 PM, Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>> > On Tue, Apr 30, 2013 at 08:36:21PM -0700, Colin Cross wrote:
>> >> On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman
>> >> <gregkh@linuxfoundation.org> wrote:
>> >> > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
>> >> >> From: Benoit Goby <benoit@android.com>
>> >> >>
>> >> >> Below is a patch from android kernel that detects a driver suspend
>> >> >> lockup and captures dump in the kernel log. Please review and
>> >> >> provide
>> >> >> comments.
>> >> >
>> >> > There's this really cool thing called a watchdog driver that does
>> >> > stuff
>> >> > like this :)
>> >>
>> >> If the watchdog driver worked in this case this patch wouldn't exist.
>> >
>> > Great, let's fix the watchdog timer then :)
>> >
>> > What's wrong with it?
>> >
>> >> >> Rather than hard-lock the kernel, dump the suspend thread stack and
>> >> >> BUG() when a driver takes too long to suspend.  The timeout is set
>> >> >> to
>> >> >> 12 seconds to be longer than the usbhid 10 second timeout.
>> >> >>
>> >> >> Exclude from the watchdog the time spent waiting for children that
>> >> >> are resumed asynchronously and time every device, whether or not
>> >> >> they
>> >> >> resumed synchronously.
>> >> >
>> >> > No, don't add a driver-core-only timer, use the existing watchdog
>> >> > timers
>> >> > if you are worried about the kernel locking up.
>> >>
>> >> The watchdog timers are useless here.  For one, they generally stop
>> >> when their driver suspend op is called, so you may not even have one
>> >> running when you lock up.
>> >
>> > But you can fix that, right?
>>
>> Ah, you're talking about the lockup detectors, and not drivers/watchdog.
>>
>> The hardlockup detector can tell you if timer interrupts are not
>> firing, which is unaffected by this patch since the timer wouldn't
>> fire any way.  The softlockup detector could eventually tell you that
>
> I was wondering if timers don't fire then how is your dpm_drv_timeout
> function gets called?

That's what I meant - this patch doesn't do anything if timers are not firing.

>>
>> tasks were not being scheduled, but not why.  Even panic on softlockup
>> will only get you the stack trace of the current task, which will be
>> the locked up task if it is spinning, but is likely to be the idle
>> task if the suspend task is blocked on a wait_event.  This patch will
>> give the stack trace of the suspend operation that is blocked, even if
>> it is an asynchronous suspend callback.
>
> ......but not when timers itself is not firing right?
>>
>> ...but not when ti
>> >> More importantly, the purpose of this patch is to tell you which
>> >> driver locked up and hopefully why, and the watchdog driver will
>> >> usually result in a silent reset.
>> >
>> > I thought it was an option as to what the watchdog does when it
>> > triggers.
>> >
>> >> This patch will cause a stack trace of the driver suspend op that is
>> >> blocking suspend progress, even if that call does not happen in the
>> >> suspend thread.
>> >
>> > But who can see this, the machine is now dead.
>>
>> I'm not sure what might still be working in this situation on x86, but
>> on ARM the machine is dead anyways.  Some random subset of drivers are
>> suspended, so you probably have no hardware watchdog, no console, no
>> video.  kexec on panic, kgdb on panic, console messages saved in
>> pstore, or jtag are the only options I know of.  This patch is very
>> useful in conjunction with pstore console.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek May 1, 2013, 10:56 a.m. UTC | #8
Hi!

> >> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev)
> >>  }
> >>
> >>  /**
> >> + *     dpm_drv_timeout - Driver suspend / resume watchdog handler
> >> + *     @data: struct device which timed out
> >> + *
> >> + *     Called when a driver has timed out suspending or resuming.
> >> + *     There's not much we can do here to recover so
> >> + *     BUG() out for a crash-dump
> >> + *
> >> + */
> >> +static void dpm_drv_timeout(unsigned long data)
> >> +{
> >> +     struct dpm_drv_wd_data *wd_data = (void *)data;
> >> +     struct device *dev = wd_data->dev;
> >> +     struct task_struct *tsk = wd_data->tsk;
> >> +
> >> +     pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
> >> +             (dev->driver ? dev->driver->name : "no driver"));
> >> +
> >> +     pr_emerg("dpm suspend stack:\n");
> >> +     show_stack(tsk, NULL);
> >> +
> >> +     BUG();
> >> +}
> >
> > So you:
> >
> > dump stack of the suspend task
> It dumps the stack of the suspend task if the suspend callback is run
> synchronously, or the async task if the suspend op is run
> asynchronously.

Lets call that [a]suspend task.

> > do BUG which
> >    dumps stack of current task
> >    kills current task
> >
> > Current task may very well be idle task; in such case you kill the
> > machine. Sounds like you should be doing something else, like kill -9
> > instead of BUG()?
> 
> Not much else you can do, you are stuck part way into suspend with a
> driver's suspend callback half executed.  All userspace tasks are
> frozen, and the suspend task is blocked indefinitely.

Yes, there's better option. Attempt killing the [a]suspend task,
instead of killing the current task.

Try putting mdelay(100000) into suspend path. Your patch will do the
wrong thing in that case (actually turning debuggable problem into
undebuggable one).
									Pavel
Colin Cross May 1, 2013, 4:10 p.m. UTC | #9
On Wed, May 1, 2013 at 3:56 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> >> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev)
>> >>  }
>> >>
>> >>  /**
>> >> + *     dpm_drv_timeout - Driver suspend / resume watchdog handler
>> >> + *     @data: struct device which timed out
>> >> + *
>> >> + *     Called when a driver has timed out suspending or resuming.
>> >> + *     There's not much we can do here to recover so
>> >> + *     BUG() out for a crash-dump
>> >> + *
>> >> + */
>> >> +static void dpm_drv_timeout(unsigned long data)
>> >> +{
>> >> +     struct dpm_drv_wd_data *wd_data = (void *)data;
>> >> +     struct device *dev = wd_data->dev;
>> >> +     struct task_struct *tsk = wd_data->tsk;
>> >> +
>> >> +     pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
>> >> +             (dev->driver ? dev->driver->name : "no driver"));
>> >> +
>> >> +     pr_emerg("dpm suspend stack:\n");
>> >> +     show_stack(tsk, NULL);
>> >> +
>> >> +     BUG();
>> >> +}
>> >
>> > So you:
>> >
>> > dump stack of the suspend task
>> It dumps the stack of the suspend task if the suspend callback is run
>> synchronously, or the async task if the suspend op is run
>> asynchronously.
>
> Lets call that [a]suspend task.
>
>> > do BUG which
>> >    dumps stack of current task
>> >    kills current task
>> >
>> > Current task may very well be idle task; in such case you kill the
>> > machine. Sounds like you should be doing something else, like kill -9
>> > instead of BUG()?
>>
>> Not much else you can do, you are stuck part way into suspend with a
>> driver's suspend callback half executed.  All userspace tasks are
>> frozen, and the suspend task is blocked indefinitely.
>
> Yes, there's better option. Attempt killing the [a]suspend task,
> instead of killing the current task.

That will leave you in a completely undefined state.  If you just kill
the task, you are likely to kill the synchronous suspend task, which
is the task that would resume your drivers and unfreeze tasks.  That
will leave you with no userspace tasks running, and much of your
hardware suspended.  How is that a useful result?  If you somehow
respawn a resume thread to resume whatever hardware you can and
unfreeze tasks, you still have the hardware that was suspending when
it was killed in a bad state, and probably has locks held, so you're
just going to deadlock or crash soon after.

> Try putting mdelay(100000) into suspend path. Your patch will do the
> wrong thing in that case (actually turning debuggable problem into
> undebuggable one).

I'm not saying this patch as is is right for everyone (it probably at
least needs to be configurable to be turned off, change the delay, and
change the panic to just a stack trace), but from a mobile perspective
this patch is far more debuggable than without this patch.  We work
very hard to make sure that panic's are highly debuggable, in fact we
often prefer panics over any other behavior when the device is in a
bad state, because it immediately gets the user's device working again
while still giving us useful information in our automatic log
collection.

With an mdelay(100000) in the suspend path, users in our debug device
pool are likely to just pull the battery because their screen won't
turn on, in which case I get no debugging information.  With this
patch, the device will automatically reboot due to the panic, and I
will get captured logs after reboot that show a stack trace ending
with
mdelay, which tells me exactly where to look for this mdelay(100000).
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH May 1, 2013, 4:24 p.m. UTC | #10
On Wed, May 01, 2013 at 09:10:49AM -0700, Colin Cross wrote:
> I'm not saying this patch as is is right for everyone (it probably at
> least needs to be configurable to be turned off, change the delay, and
> change the panic to just a stack trace),

Those changes would be nice.

> but from a mobile perspective
> this patch is far more debuggable than without this patch.  We work
> very hard to make sure that panic's are highly debuggable, in fact we
> often prefer panics over any other behavior when the device is in a
> bad state, because it immediately gets the user's device working again
> while still giving us useful information in our automatic log
> collection.
> 
> With an mdelay(100000) in the suspend path, users in our debug device
> pool are likely to just pull the battery because their screen won't
> turn on, in which case I get no debugging information.  With this
> patch, the device will automatically reboot due to the panic, and I
> will get captured logs after reboot that show a stack trace ending
> with mdelay, which tells me exactly where to look for this
> mdelay(100000).

All of that information would be _great_ to have in the changelog for
the patch, as it explains exactly why you need this.  {hint}

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Machek May 2, 2013, 12:30 p.m. UTC | #11
Hi!

> >> > do BUG which
> >> >    dumps stack of current task
> >> >    kills current task
> >> >
> >> > Current task may very well be idle task; in such case you kill the
> >> > machine. Sounds like you should be doing something else, like kill -9
> >> > instead of BUG()?
> >>
> >> Not much else you can do, you are stuck part way into suspend with a
> >> driver's suspend callback half executed.  All userspace tasks are
> >> frozen, and the suspend task is blocked indefinitely.
> >
> > Yes, there's better option. Attempt killing the [a]suspend task,
> > instead of killing the current task.
> 
> That will leave you in a completely undefined state.  If you just kill
> the task, you are likely to kill the synchronous suspend task, which
> is the task that would resume your drivers and unfreeze tasks.  That
> will leave you with no userspace tasks running, and much of your
> hardware suspended.  How is that a useful result?  If you somehow

So instead you kill random task? (BUG() from timer kills pretty much
random task, right?)

If you want to do panic(), do panic().
									Pavel
Colin Cross May 2, 2013, 6:25 p.m. UTC | #12
On Thu, May 2, 2013 at 5:30 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> >> > do BUG which
>> >> >    dumps stack of current task
>> >> >    kills current task
>> >> >
>> >> > Current task may very well be idle task; in such case you kill the
>> >> > machine. Sounds like you should be doing something else, like kill -9
>> >> > instead of BUG()?
>> >>
>> >> Not much else you can do, you are stuck part way into suspend with a
>> >> driver's suspend callback half executed.  All userspace tasks are
>> >> frozen, and the suspend task is blocked indefinitely.
>> >
>> > Yes, there's better option. Attempt killing the [a]suspend task,
>> > instead of killing the current task.
>>
>> That will leave you in a completely undefined state.  If you just kill
>> the task, you are likely to kill the synchronous suspend task, which
>> is the task that would resume your drivers and unfreeze tasks.  That
>> will leave you with no userspace tasks running, and much of your
>> hardware suspended.  How is that a useful result?  If you somehow
>
> So instead you kill random task? (BUG() from timer kills pretty much
> random task, right?)
>
> If you want to do panic(), do panic().

At least on ARM a BUG() in an interrupt or softirq always results in a
panic, but this can be switched to directly call panic.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 15beb50..eb70c0e 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -29,6 +29,8 @@ 
 #include <linux/async.h>
 #include <linux/suspend.h>
 #include <linux/cpuidle.h>
+#include <linux/timer.h>
+
 #include "../base.h"
 #include "power.h"
 
@@ -54,6 +56,12 @@  struct suspend_stats suspend_stats;
 static DEFINE_MUTEX(dpm_list_mtx);
 static pm_message_t pm_transition;
 
+static void dpm_drv_timeout(unsigned long data);
+struct dpm_drv_wd_data {
+	struct device *dev;
+	struct task_struct *tsk;
+};
+
 static int async_error;
 
 /**
@@ -663,6 +671,30 @@  static bool is_async(struct device *dev)
 }
 
 /**
+ *     dpm_drv_timeout - Driver suspend / resume watchdog handler
+ *     @data: struct device which timed out
+ *
+ *     Called when a driver has timed out suspending or resuming.
+ *     There's not much we can do here to recover so
+ *     BUG() out for a crash-dump
+ *
+ */
+static void dpm_drv_timeout(unsigned long data)
+{
+	struct dpm_drv_wd_data *wd_data = (void *)data;
+	struct device *dev = wd_data->dev;
+	struct task_struct *tsk = wd_data->tsk;
+
+	pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
+		(dev->driver ? dev->driver->name : "no driver"));
+
+	pr_emerg("dpm suspend stack:\n");
+	show_stack(tsk, NULL);
+
+	BUG();
+}
+
+/**
  * dpm_resume - Execute "resume" callbacks for non-sysdev devices.
  * @state: PM transition of the system being carried out.
  *
@@ -1053,6 +1085,8 @@  static int __device_suspend(struct device *dev, pm_message_t state, bool async)
 	pm_callback_t callback = NULL;
 	char *info = NULL;
 	int error = 0;
+	struct timer_list timer;
+	struct dpm_drv_wd_data data;
 
 	dpm_wait_for_children(dev, async);
 
@@ -1076,6 +1110,14 @@  static int __device_suspend(struct device *dev, pm_message_t state, bool async)
 	if (dev->power.syscore)
 		goto Complete;
 
+	data.dev = dev;
+	data.tsk = get_current();
+	init_timer_on_stack(&timer);
+	timer.expires = jiffies + HZ * 12;
+	timer.function = dpm_drv_timeout;
+	timer.data = (unsigned long)&data;
+	add_timer(&timer);
+
 	device_lock(dev);
 
 	if (dev->pm_domain) {
@@ -1131,6 +1173,9 @@  static int __device_suspend(struct device *dev, pm_message_t state, bool async)
 
 	device_unlock(dev);
 
+	del_timer_sync(&timer);
+	destroy_timer_on_stack(&timer);
+
  Complete:
 	complete_all(&dev->power.completion);
 	if (error)