Message ID | 1367360914-23389-1-git-send-email-zoran.markovic@linaro.org (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote: > From: Benoit Goby <benoit@android.com> > > Below is a patch from android kernel that detects a driver suspend > lockup and captures dump in the kernel log. Please review and provide > comments. There's this really cool thing called a watchdog driver that does stuff like this :) > Rather than hard-lock the kernel, dump the suspend thread stack and > BUG() when a driver takes too long to suspend. The timeout is set to > 12 seconds to be longer than the usbhid 10 second timeout. > > Exclude from the watchdog the time spent waiting for children that > are resumed asynchronously and time every device, whether or not they > resumed synchronously. No, don't add a driver-core-only timer, use the existing watchdog timers if you are worried about the kernel locking up. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi! > Below is a patch from android kernel that detects a driver suspend > lockup and captures dump in the kernel log. Please review and provide > comments. > > Rather than hard-lock the kernel, dump the suspend thread stack and > BUG() when a driver takes too long to suspend. The timeout is set to > 12 seconds to be longer than the usbhid 10 second timeout. > > Exclude from the watchdog the time spent waiting for children that > are resumed asynchronously and time every device, whether or not they > resumed synchronously. > > Cc: Android Kernel Team <kernel-team@android.com> > Cc: Colin Cross <ccross@android.com> > Cc: Todd Poynor <toddpoynor@google.com> > Cc: San Mehat <san@google.com> > Cc: Benoit Goby <benoit@android.com> > Cc: John Stultz <john.stultz@linaro.org> > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Rafael J. Wysocki <rjw@sisk.pl> > Cc: Len Brown <len.brown@intel.com> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Original-author: San Mehat <san@google.com> > Signed-off-by: Benoit Goby <benoit@android.com> > [zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...), > tweaked commit message.] > Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org> > --- > drivers/base/power/main.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 45 insertions(+) > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > index 15beb50..eb70c0e 100644 > --- a/drivers/base/power/main.c > +++ b/drivers/base/power/main.c > @@ -29,6 +29,8 @@ > #include <linux/async.h> > #include <linux/suspend.h> > #include <linux/cpuidle.h> > +#include <linux/timer.h> > + > #include "../base.h" > #include "power.h" > > @@ -54,6 +56,12 @@ struct suspend_stats suspend_stats; > static DEFINE_MUTEX(dpm_list_mtx); > static pm_message_t pm_transition; > > +static void dpm_drv_timeout(unsigned long data); > +struct dpm_drv_wd_data { > + struct device *dev; > + struct task_struct *tsk; > +}; > + > static int async_error; > > /** > @@ -663,6 +671,30 @@ static bool is_async(struct device *dev) > } > > /** > + * dpm_drv_timeout - Driver suspend / resume watchdog handler > + * @data: struct device which timed out > + * > + * Called when a driver has timed out suspending or resuming. > + * There's not much we can do here to recover so > + * BUG() out for a crash-dump > + * > + */ > +static void dpm_drv_timeout(unsigned long data) > +{ > + struct dpm_drv_wd_data *wd_data = (void *)data; > + struct device *dev = wd_data->dev; > + struct task_struct *tsk = wd_data->tsk; > + > + pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev), > + (dev->driver ? dev->driver->name : "no driver")); > + > + pr_emerg("dpm suspend stack:\n"); > + show_stack(tsk, NULL); > + > + BUG(); > +} So you: dump stack of the suspend task do BUG which dumps stack of current task kills current task Current task may very well be idle task; in such case you kill the machine. Sounds like you should be doing something else, like kill -9 instead of BUG()? Pavel
On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote: >> From: Benoit Goby <benoit@android.com> >> >> Below is a patch from android kernel that detects a driver suspend >> lockup and captures dump in the kernel log. Please review and provide >> comments. > > There's this really cool thing called a watchdog driver that does stuff > like this :) If the watchdog driver worked in this case this patch wouldn't exist. >> Rather than hard-lock the kernel, dump the suspend thread stack and >> BUG() when a driver takes too long to suspend. The timeout is set to >> 12 seconds to be longer than the usbhid 10 second timeout. >> >> Exclude from the watchdog the time spent waiting for children that >> are resumed asynchronously and time every device, whether or not they >> resumed synchronously. > > No, don't add a driver-core-only timer, use the existing watchdog timers > if you are worried about the kernel locking up. The watchdog timers are useless here. For one, they generally stop when their driver suspend op is called, so you may not even have one running when you lock up. More importantly, the purpose of this patch is to tell you which driver locked up and hopefully why, and the watchdog driver will usually result in a silent reset. This patch will cause a stack trace of the driver suspend op that is blocking suspend progress, even if that call does not happen in the suspend thread. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 30, 2013 at 5:30 PM, Pavel Machek <pavel@ucw.cz> wrote: > Hi! > >> Below is a patch from android kernel that detects a driver suspend >> lockup and captures dump in the kernel log. Please review and provide >> comments. >> >> Rather than hard-lock the kernel, dump the suspend thread stack and >> BUG() when a driver takes too long to suspend. The timeout is set to >> 12 seconds to be longer than the usbhid 10 second timeout. >> >> Exclude from the watchdog the time spent waiting for children that >> are resumed asynchronously and time every device, whether or not they >> resumed synchronously. >> >> Cc: Android Kernel Team <kernel-team@android.com> >> Cc: Colin Cross <ccross@android.com> >> Cc: Todd Poynor <toddpoynor@google.com> >> Cc: San Mehat <san@google.com> >> Cc: Benoit Goby <benoit@android.com> >> Cc: John Stultz <john.stultz@linaro.org> >> Cc: Pavel Machek <pavel@ucw.cz> >> Cc: Rafael J. Wysocki <rjw@sisk.pl> >> Cc: Len Brown <len.brown@intel.com> >> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> >> Original-author: San Mehat <san@google.com> >> Signed-off-by: Benoit Goby <benoit@android.com> >> [zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...), >> tweaked commit message.] >> Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org> >> --- >> drivers/base/power/main.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 45 insertions(+) >> >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c >> index 15beb50..eb70c0e 100644 >> --- a/drivers/base/power/main.c >> +++ b/drivers/base/power/main.c >> @@ -29,6 +29,8 @@ >> #include <linux/async.h> >> #include <linux/suspend.h> >> #include <linux/cpuidle.h> >> +#include <linux/timer.h> >> + >> #include "../base.h" >> #include "power.h" >> >> @@ -54,6 +56,12 @@ struct suspend_stats suspend_stats; >> static DEFINE_MUTEX(dpm_list_mtx); >> static pm_message_t pm_transition; >> >> +static void dpm_drv_timeout(unsigned long data); >> +struct dpm_drv_wd_data { >> + struct device *dev; >> + struct task_struct *tsk; >> +}; >> + >> static int async_error; >> >> /** >> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev) >> } >> >> /** >> + * dpm_drv_timeout - Driver suspend / resume watchdog handler >> + * @data: struct device which timed out >> + * >> + * Called when a driver has timed out suspending or resuming. >> + * There's not much we can do here to recover so >> + * BUG() out for a crash-dump >> + * >> + */ >> +static void dpm_drv_timeout(unsigned long data) >> +{ >> + struct dpm_drv_wd_data *wd_data = (void *)data; >> + struct device *dev = wd_data->dev; >> + struct task_struct *tsk = wd_data->tsk; >> + >> + pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev), >> + (dev->driver ? dev->driver->name : "no driver")); >> + >> + pr_emerg("dpm suspend stack:\n"); >> + show_stack(tsk, NULL); >> + >> + BUG(); >> +} > > So you: > > dump stack of the suspend task It dumps the stack of the suspend task if the suspend callback is run synchronously, or the async task if the suspend op is run asynchronously. > do BUG which > dumps stack of current task > kills current task > > Current task may very well be idle task; in such case you kill the > machine. Sounds like you should be doing something else, like kill -9 > instead of BUG()? Not much else you can do, you are stuck part way into suspend with a driver's suspend callback half executed. All userspace tasks are frozen, and the suspend task is blocked indefinitely. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 30, 2013 at 08:36:21PM -0700, Colin Cross wrote: > On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote: > >> From: Benoit Goby <benoit@android.com> > >> > >> Below is a patch from android kernel that detects a driver suspend > >> lockup and captures dump in the kernel log. Please review and provide > >> comments. > > > > There's this really cool thing called a watchdog driver that does stuff > > like this :) > > If the watchdog driver worked in this case this patch wouldn't exist. Great, let's fix the watchdog timer then :) What's wrong with it? > >> Rather than hard-lock the kernel, dump the suspend thread stack and > >> BUG() when a driver takes too long to suspend. The timeout is set to > >> 12 seconds to be longer than the usbhid 10 second timeout. > >> > >> Exclude from the watchdog the time spent waiting for children that > >> are resumed asynchronously and time every device, whether or not they > >> resumed synchronously. > > > > No, don't add a driver-core-only timer, use the existing watchdog timers > > if you are worried about the kernel locking up. > > The watchdog timers are useless here. For one, they generally stop > when their driver suspend op is called, so you may not even have one > running when you lock up. But you can fix that, right? > More importantly, the purpose of this patch is to tell you which > driver locked up and hopefully why, and the watchdog driver will > usually result in a silent reset. I thought it was an option as to what the watchdog does when it triggers. > This patch will cause a stack trace of the driver suspend op that is > blocking suspend progress, even if that call does not happen in the > suspend thread. But who can see this, the machine is now dead. greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 30, 2013 at 9:17 PM, Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Tue, Apr 30, 2013 at 08:36:21PM -0700, Colin Cross wrote: >> On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman >> <gregkh@linuxfoundation.org> wrote: >> > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote: >> >> From: Benoit Goby <benoit@android.com> >> >> >> >> Below is a patch from android kernel that detects a driver suspend >> >> lockup and captures dump in the kernel log. Please review and provide >> >> comments. >> > >> > There's this really cool thing called a watchdog driver that does stuff >> > like this :) >> >> If the watchdog driver worked in this case this patch wouldn't exist. > > Great, let's fix the watchdog timer then :) > > What's wrong with it? > >> >> Rather than hard-lock the kernel, dump the suspend thread stack and >> >> BUG() when a driver takes too long to suspend. The timeout is set to >> >> 12 seconds to be longer than the usbhid 10 second timeout. >> >> >> >> Exclude from the watchdog the time spent waiting for children that >> >> are resumed asynchronously and time every device, whether or not they >> >> resumed synchronously. >> > >> > No, don't add a driver-core-only timer, use the existing watchdog timers >> > if you are worried about the kernel locking up. >> >> The watchdog timers are useless here. For one, they generally stop >> when their driver suspend op is called, so you may not even have one >> running when you lock up. > > But you can fix that, right? Ah, you're talking about the lockup detectors, and not drivers/watchdog. The hardlockup detector can tell you if timer interrupts are not firing, which is unaffected by this patch since the timer wouldn't fire any way. The softlockup detector could eventually tell you that tasks were not being scheduled, but not why. Even panic on softlockup will only get you the stack trace of the current task, which will be the locked up task if it is spinning, but is likely to be the idle task if the suspend task is blocked on a wait_event. This patch will give the stack trace of the suspend operation that is blocked, even if it is an asynchronous suspend callback. >> More importantly, the purpose of this patch is to tell you which >> driver locked up and hopefully why, and the watchdog driver will >> usually result in a silent reset. > > I thought it was an option as to what the watchdog does when it > triggers. > >> This patch will cause a stack trace of the driver suspend op that is >> blocking suspend progress, even if that call does not happen in the >> suspend thread. > > But who can see this, the machine is now dead. I'm not sure what might still be working in this situation on x86, but on ARM the machine is dead anyways. Some random subset of drivers are suspended, so you probably have no hardware watchdog, no console, no video. kexec on panic, kgdb on panic, console messages saved in pstore, or jtag are the only options I know of. This patch is very useful in conjunction with pstore console. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 30, 2013 at 9:57 PM, anish singh <anish198519851985@gmail.com> wrote: > > > > On Wed, May 1, 2013 at 10:09 AM, Colin Cross <ccross@android.com> wrote: >> >> On Tue, Apr 30, 2013 at 9:17 PM, Greg Kroah-Hartman >> <gregkh@linuxfoundation.org> wrote: >> > On Tue, Apr 30, 2013 at 08:36:21PM -0700, Colin Cross wrote: >> >> On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman >> >> <gregkh@linuxfoundation.org> wrote: >> >> > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote: >> >> >> From: Benoit Goby <benoit@android.com> >> >> >> >> >> >> Below is a patch from android kernel that detects a driver suspend >> >> >> lockup and captures dump in the kernel log. Please review and >> >> >> provide >> >> >> comments. >> >> > >> >> > There's this really cool thing called a watchdog driver that does >> >> > stuff >> >> > like this :) >> >> >> >> If the watchdog driver worked in this case this patch wouldn't exist. >> > >> > Great, let's fix the watchdog timer then :) >> > >> > What's wrong with it? >> > >> >> >> Rather than hard-lock the kernel, dump the suspend thread stack and >> >> >> BUG() when a driver takes too long to suspend. The timeout is set >> >> >> to >> >> >> 12 seconds to be longer than the usbhid 10 second timeout. >> >> >> >> >> >> Exclude from the watchdog the time spent waiting for children that >> >> >> are resumed asynchronously and time every device, whether or not >> >> >> they >> >> >> resumed synchronously. >> >> > >> >> > No, don't add a driver-core-only timer, use the existing watchdog >> >> > timers >> >> > if you are worried about the kernel locking up. >> >> >> >> The watchdog timers are useless here. For one, they generally stop >> >> when their driver suspend op is called, so you may not even have one >> >> running when you lock up. >> > >> > But you can fix that, right? >> >> Ah, you're talking about the lockup detectors, and not drivers/watchdog. >> >> The hardlockup detector can tell you if timer interrupts are not >> firing, which is unaffected by this patch since the timer wouldn't >> fire any way. The softlockup detector could eventually tell you that > > I was wondering if timers don't fire then how is your dpm_drv_timeout > function gets called? That's what I meant - this patch doesn't do anything if timers are not firing. >> >> tasks were not being scheduled, but not why. Even panic on softlockup >> will only get you the stack trace of the current task, which will be >> the locked up task if it is spinning, but is likely to be the idle >> task if the suspend task is blocked on a wait_event. This patch will >> give the stack trace of the suspend operation that is blocked, even if >> it is an asynchronous suspend callback. > > ......but not when timers itself is not firing right? >> >> ...but not when ti >> >> More importantly, the purpose of this patch is to tell you which >> >> driver locked up and hopefully why, and the watchdog driver will >> >> usually result in a silent reset. >> > >> > I thought it was an option as to what the watchdog does when it >> > triggers. >> > >> >> This patch will cause a stack trace of the driver suspend op that is >> >> blocking suspend progress, even if that call does not happen in the >> >> suspend thread. >> > >> > But who can see this, the machine is now dead. >> >> I'm not sure what might still be working in this situation on x86, but >> on ARM the machine is dead anyways. Some random subset of drivers are >> suspended, so you probably have no hardware watchdog, no console, no >> video. kexec on panic, kgdb on panic, console messages saved in >> pstore, or jtag are the only options I know of. This patch is very >> useful in conjunction with pstore console. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi! > >> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev) > >> } > >> > >> /** > >> + * dpm_drv_timeout - Driver suspend / resume watchdog handler > >> + * @data: struct device which timed out > >> + * > >> + * Called when a driver has timed out suspending or resuming. > >> + * There's not much we can do here to recover so > >> + * BUG() out for a crash-dump > >> + * > >> + */ > >> +static void dpm_drv_timeout(unsigned long data) > >> +{ > >> + struct dpm_drv_wd_data *wd_data = (void *)data; > >> + struct device *dev = wd_data->dev; > >> + struct task_struct *tsk = wd_data->tsk; > >> + > >> + pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev), > >> + (dev->driver ? dev->driver->name : "no driver")); > >> + > >> + pr_emerg("dpm suspend stack:\n"); > >> + show_stack(tsk, NULL); > >> + > >> + BUG(); > >> +} > > > > So you: > > > > dump stack of the suspend task > It dumps the stack of the suspend task if the suspend callback is run > synchronously, or the async task if the suspend op is run > asynchronously. Lets call that [a]suspend task. > > do BUG which > > dumps stack of current task > > kills current task > > > > Current task may very well be idle task; in such case you kill the > > machine. Sounds like you should be doing something else, like kill -9 > > instead of BUG()? > > Not much else you can do, you are stuck part way into suspend with a > driver's suspend callback half executed. All userspace tasks are > frozen, and the suspend task is blocked indefinitely. Yes, there's better option. Attempt killing the [a]suspend task, instead of killing the current task. Try putting mdelay(100000) into suspend path. Your patch will do the wrong thing in that case (actually turning debuggable problem into undebuggable one). Pavel
On Wed, May 1, 2013 at 3:56 AM, Pavel Machek <pavel@ucw.cz> wrote: > Hi! > >> >> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev) >> >> } >> >> >> >> /** >> >> + * dpm_drv_timeout - Driver suspend / resume watchdog handler >> >> + * @data: struct device which timed out >> >> + * >> >> + * Called when a driver has timed out suspending or resuming. >> >> + * There's not much we can do here to recover so >> >> + * BUG() out for a crash-dump >> >> + * >> >> + */ >> >> +static void dpm_drv_timeout(unsigned long data) >> >> +{ >> >> + struct dpm_drv_wd_data *wd_data = (void *)data; >> >> + struct device *dev = wd_data->dev; >> >> + struct task_struct *tsk = wd_data->tsk; >> >> + >> >> + pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev), >> >> + (dev->driver ? dev->driver->name : "no driver")); >> >> + >> >> + pr_emerg("dpm suspend stack:\n"); >> >> + show_stack(tsk, NULL); >> >> + >> >> + BUG(); >> >> +} >> > >> > So you: >> > >> > dump stack of the suspend task >> It dumps the stack of the suspend task if the suspend callback is run >> synchronously, or the async task if the suspend op is run >> asynchronously. > > Lets call that [a]suspend task. > >> > do BUG which >> > dumps stack of current task >> > kills current task >> > >> > Current task may very well be idle task; in such case you kill the >> > machine. Sounds like you should be doing something else, like kill -9 >> > instead of BUG()? >> >> Not much else you can do, you are stuck part way into suspend with a >> driver's suspend callback half executed. All userspace tasks are >> frozen, and the suspend task is blocked indefinitely. > > Yes, there's better option. Attempt killing the [a]suspend task, > instead of killing the current task. That will leave you in a completely undefined state. If you just kill the task, you are likely to kill the synchronous suspend task, which is the task that would resume your drivers and unfreeze tasks. That will leave you with no userspace tasks running, and much of your hardware suspended. How is that a useful result? If you somehow respawn a resume thread to resume whatever hardware you can and unfreeze tasks, you still have the hardware that was suspending when it was killed in a bad state, and probably has locks held, so you're just going to deadlock or crash soon after. > Try putting mdelay(100000) into suspend path. Your patch will do the > wrong thing in that case (actually turning debuggable problem into > undebuggable one). I'm not saying this patch as is is right for everyone (it probably at least needs to be configurable to be turned off, change the delay, and change the panic to just a stack trace), but from a mobile perspective this patch is far more debuggable than without this patch. We work very hard to make sure that panic's are highly debuggable, in fact we often prefer panics over any other behavior when the device is in a bad state, because it immediately gets the user's device working again while still giving us useful information in our automatic log collection. With an mdelay(100000) in the suspend path, users in our debug device pool are likely to just pull the battery because their screen won't turn on, in which case I get no debugging information. With this patch, the device will automatically reboot due to the panic, and I will get captured logs after reboot that show a stack trace ending with mdelay, which tells me exactly where to look for this mdelay(100000). -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 01, 2013 at 09:10:49AM -0700, Colin Cross wrote: > I'm not saying this patch as is is right for everyone (it probably at > least needs to be configurable to be turned off, change the delay, and > change the panic to just a stack trace), Those changes would be nice. > but from a mobile perspective > this patch is far more debuggable than without this patch. We work > very hard to make sure that panic's are highly debuggable, in fact we > often prefer panics over any other behavior when the device is in a > bad state, because it immediately gets the user's device working again > while still giving us useful information in our automatic log > collection. > > With an mdelay(100000) in the suspend path, users in our debug device > pool are likely to just pull the battery because their screen won't > turn on, in which case I get no debugging information. With this > patch, the device will automatically reboot due to the panic, and I > will get captured logs after reboot that show a stack trace ending > with mdelay, which tells me exactly where to look for this > mdelay(100000). All of that information would be _great_ to have in the changelog for the patch, as it explains exactly why you need this. {hint} thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi! > >> > do BUG which > >> > dumps stack of current task > >> > kills current task > >> > > >> > Current task may very well be idle task; in such case you kill the > >> > machine. Sounds like you should be doing something else, like kill -9 > >> > instead of BUG()? > >> > >> Not much else you can do, you are stuck part way into suspend with a > >> driver's suspend callback half executed. All userspace tasks are > >> frozen, and the suspend task is blocked indefinitely. > > > > Yes, there's better option. Attempt killing the [a]suspend task, > > instead of killing the current task. > > That will leave you in a completely undefined state. If you just kill > the task, you are likely to kill the synchronous suspend task, which > is the task that would resume your drivers and unfreeze tasks. That > will leave you with no userspace tasks running, and much of your > hardware suspended. How is that a useful result? If you somehow So instead you kill random task? (BUG() from timer kills pretty much random task, right?) If you want to do panic(), do panic(). Pavel
On Thu, May 2, 2013 at 5:30 AM, Pavel Machek <pavel@ucw.cz> wrote: > Hi! > >> >> > do BUG which >> >> > dumps stack of current task >> >> > kills current task >> >> > >> >> > Current task may very well be idle task; in such case you kill the >> >> > machine. Sounds like you should be doing something else, like kill -9 >> >> > instead of BUG()? >> >> >> >> Not much else you can do, you are stuck part way into suspend with a >> >> driver's suspend callback half executed. All userspace tasks are >> >> frozen, and the suspend task is blocked indefinitely. >> > >> > Yes, there's better option. Attempt killing the [a]suspend task, >> > instead of killing the current task. >> >> That will leave you in a completely undefined state. If you just kill >> the task, you are likely to kill the synchronous suspend task, which >> is the task that would resume your drivers and unfreeze tasks. That >> will leave you with no userspace tasks running, and much of your >> hardware suspended. How is that a useful result? If you somehow > > So instead you kill random task? (BUG() from timer kills pretty much > random task, right?) > > If you want to do panic(), do panic(). At least on ARM a BUG() in an interrupt or softirq always results in a panic, but this can be switched to directly call panic. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 15beb50..eb70c0e 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -29,6 +29,8 @@ #include <linux/async.h> #include <linux/suspend.h> #include <linux/cpuidle.h> +#include <linux/timer.h> + #include "../base.h" #include "power.h" @@ -54,6 +56,12 @@ struct suspend_stats suspend_stats; static DEFINE_MUTEX(dpm_list_mtx); static pm_message_t pm_transition; +static void dpm_drv_timeout(unsigned long data); +struct dpm_drv_wd_data { + struct device *dev; + struct task_struct *tsk; +}; + static int async_error; /** @@ -663,6 +671,30 @@ static bool is_async(struct device *dev) } /** + * dpm_drv_timeout - Driver suspend / resume watchdog handler + * @data: struct device which timed out + * + * Called when a driver has timed out suspending or resuming. + * There's not much we can do here to recover so + * BUG() out for a crash-dump + * + */ +static void dpm_drv_timeout(unsigned long data) +{ + struct dpm_drv_wd_data *wd_data = (void *)data; + struct device *dev = wd_data->dev; + struct task_struct *tsk = wd_data->tsk; + + pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev), + (dev->driver ? dev->driver->name : "no driver")); + + pr_emerg("dpm suspend stack:\n"); + show_stack(tsk, NULL); + + BUG(); +} + +/** * dpm_resume - Execute "resume" callbacks for non-sysdev devices. * @state: PM transition of the system being carried out. * @@ -1053,6 +1085,8 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) pm_callback_t callback = NULL; char *info = NULL; int error = 0; + struct timer_list timer; + struct dpm_drv_wd_data data; dpm_wait_for_children(dev, async); @@ -1076,6 +1110,14 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) if (dev->power.syscore) goto Complete; + data.dev = dev; + data.tsk = get_current(); + init_timer_on_stack(&timer); + timer.expires = jiffies + HZ * 12; + timer.function = dpm_drv_timeout; + timer.data = (unsigned long)&data; + add_timer(&timer); + device_lock(dev); if (dev->pm_domain) { @@ -1131,6 +1173,9 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) device_unlock(dev); + del_timer_sync(&timer); + destroy_timer_on_stack(&timer); + Complete: complete_all(&dev->power.completion); if (error)