Message ID | 201107251043.19932.oneukum@suse.de (mailing list archive) |
---|---|
State | Rejected, archived |
Headers | show |
On Mon, Jul 25, 2011 at 10:43:19AM +0200, Oliver Neukum wrote: > Hi Rafael, > > I had a problem with the kernel stopping the machine forever because I got an > oops while tasks were frozen. It seems to me that we should thaw when this > happens. How about this approach? > > Regards > Oliver > > From 6f3b5e7a5c7ccf3564bdd2e703eba7eee753ecdc Mon Sep 17 00:00:00 2001 > From: Oliver Neukum <oliver@neukum.org> > Date: Fri, 22 Jul 2011 11:20:19 +0200 > Subject: [PATCH] unfreeze tasks if an oops happens while tasks are frozen > > If an oops kills the task suspending or snapshotting > is system, the system is dead because the action is > never completed and the tasks never thawed. > > Signed-off-by: Oliver Neukum <oneukum@suse.de> > --- > include/linux/freezer.h | 1 + > kernel/panic.c | 2 ++ > kernel/power/process.c | 11 +++++++++++ > 3 files changed, 14 insertions(+), 0 deletions(-) > > diff --git a/include/linux/freezer.h b/include/linux/freezer.h > index 1effc8b..9907cf6 100644 > --- a/include/linux/freezer.h > +++ b/include/linux/freezer.h > @@ -50,6 +50,7 @@ extern int thaw_process(struct task_struct *p); > extern void refrigerator(void); > extern int freeze_processes(void); > extern void thaw_processes(void); > +extern void thaw_in_oops(void); > > static inline int try_to_freeze(void) > { > diff --git a/kernel/panic.c b/kernel/panic.c > index 6923167..255e662 100644 > --- a/kernel/panic.c > +++ b/kernel/panic.c > @@ -23,6 +23,7 @@ > #include <linux/init.h> > #include <linux/nmi.h> > #include <linux/dmi.h> > +#include <linux/freezer.h> > > #define PANIC_TIMER_STEP 100 > #define PANIC_BLINK_SPD 18 > @@ -355,6 +356,7 @@ void oops_exit(void) > do_oops_enter_exit(); > print_oops_end_marker(); > kmsg_dump(KMSG_DUMP_OOPS); > + thaw_in_oops(); > } > > #ifdef WANT_WARN_ON_SLOWPATH > diff --git a/kernel/power/process.c b/kernel/power/process.c > index 0cf3a27..20994cd 100644 > --- a/kernel/power/process.c > +++ b/kernel/power/process.c > @@ -22,6 +22,9 @@ > */ > #define TIMEOUT (20 * HZ) > > +/* in case we oops while processes are frozen */ > +static bool tasks_fozen = false; I think you mean 'tasks_frozen' here :).
Hi, On Monday, July 25, 2011, Oliver Neukum wrote: > Hi Rafael, > > I had a problem with the kernel stopping the machine forever because I got an > oops while tasks were frozen. It seems to me that we should thaw when this > happens. How about this approach? Well, we do something like this already for the OOM killer (see oom_killer_disable() and friends), so I think it would be better to simply extend/modify that mechanism instead of adding a new one doing almost exactly the same thing. I have no complaints about adding thaw_in_oops(), though, so long as Andrew thinks it makes sense. Thanks, Rafael > From 6f3b5e7a5c7ccf3564bdd2e703eba7eee753ecdc Mon Sep 17 00:00:00 2001 > From: Oliver Neukum <oliver@neukum.org> > Date: Fri, 22 Jul 2011 11:20:19 +0200 > Subject: [PATCH] unfreeze tasks if an oops happens while tasks are frozen > > If an oops kills the task suspending or snapshotting > is system, the system is dead because the action is > never completed and the tasks never thawed. > > Signed-off-by: Oliver Neukum <oneukum@suse.de> > --- > include/linux/freezer.h | 1 + > kernel/panic.c | 2 ++ > kernel/power/process.c | 11 +++++++++++ > 3 files changed, 14 insertions(+), 0 deletions(-) > > diff --git a/include/linux/freezer.h b/include/linux/freezer.h > index 1effc8b..9907cf6 100644 > --- a/include/linux/freezer.h > +++ b/include/linux/freezer.h > @@ -50,6 +50,7 @@ extern int thaw_process(struct task_struct *p); > extern void refrigerator(void); > extern int freeze_processes(void); > extern void thaw_processes(void); > +extern void thaw_in_oops(void); > > static inline int try_to_freeze(void) > { > diff --git a/kernel/panic.c b/kernel/panic.c > index 6923167..255e662 100644 > --- a/kernel/panic.c > +++ b/kernel/panic.c > @@ -23,6 +23,7 @@ > #include <linux/init.h> > #include <linux/nmi.h> > #include <linux/dmi.h> > +#include <linux/freezer.h> > > #define PANIC_TIMER_STEP 100 > #define PANIC_BLINK_SPD 18 > @@ -355,6 +356,7 @@ void oops_exit(void) > do_oops_enter_exit(); > print_oops_end_marker(); > kmsg_dump(KMSG_DUMP_OOPS); > + thaw_in_oops(); > } > > #ifdef WANT_WARN_ON_SLOWPATH > diff --git a/kernel/power/process.c b/kernel/power/process.c > index 0cf3a27..20994cd 100644 > --- a/kernel/power/process.c > +++ b/kernel/power/process.c > @@ -22,6 +22,9 @@ > */ > #define TIMEOUT (20 * HZ) > > +/* in case we oops while processes are frozen */ > +static bool tasks_fozen = false; > + > static inline int freezable(struct task_struct * p) > { > if ((p == current) || > @@ -131,6 +134,7 @@ static int try_to_freeze_tasks(bool sig_only) > elapsed_csecs % 100); > } > > + tasks_fozen = (todo == 0); > return todo ? -EBUSY : 0; > } > > @@ -189,7 +193,14 @@ void thaw_processes(void) > thaw_workqueues(); > thaw_tasks(true); > thaw_tasks(false); > + tasks_fozen = false; > schedule(); > printk("done.\n"); > } > > +void thaw_in_oops(void) > +{ > + if (tasks_fozen) > + thaw_processes(); > +} > + >
On Wed, 27 Jul 2011 00:24:11 +0200 "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > Hi Rafael, > > > > I had a problem with the kernel stopping the machine forever because I got an > > oops while tasks were frozen. It seems to me that we should thaw when this > > happens. How about this approach? > > Well, we do something like this already for the OOM killer (see > oom_killer_disable() and friends), so I think it would be better to > simply extend/modify that mechanism instead of adding a new one > doing almost exactly the same thing. > > I have no complaints about adding thaw_in_oops(), though, so long as > Andrew thinks it makes sense. mm... The patch as proposed is very simple, direct, explicit. I suspect that trying to embed this operation within some other one would end up producing a less clear result. Sometimes we do exceptional and weird things, and leaving the code exceptional and weird-looking is better than hiding it in some framework, if you follow what I mean. It does need some code comments to explain to people what it's doing and more importantly why it's doing it. Also, something which doesn't break the build when CONFIG_FREEZER=n would be nice.
On Wednesday, July 27, 2011, Andrew Morton wrote: > On Wed, 27 Jul 2011 00:24:11 +0200 > "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > > > Hi Rafael, > > > > > > I had a problem with the kernel stopping the machine forever because I got an > > > oops while tasks were frozen. It seems to me that we should thaw when this > > > happens. How about this approach? > > > > Well, we do something like this already for the OOM killer (see > > oom_killer_disable() and friends), so I think it would be better to > > simply extend/modify that mechanism instead of adding a new one > > doing almost exactly the same thing. > > > > I have no complaints about adding thaw_in_oops(), though, so long as > > Andrew thinks it makes sense. > > mm... The patch as proposed is very simple, direct, explicit. I > suspect that trying to embed this operation within some other one would > end up producing a less clear result. Sometimes we do exceptional and > weird things, and leaving the code exceptional and weird-looking is > better than hiding it in some framework, if you follow what I mean. My point is that instead of using the oom_killer_disabled variable in page_alloc.c, we could use a oom_killer_disabled() function returning the value of the new tasks_are_frozen variable. Then, oom_killer_disable/enable() won't be necessary any more. > It does need some code comments to explain to people what it's doing > and more importantly why it's doing it. Also, something which doesn't > break the build when CONFIG_FREEZER=n would be nice. Right. Thanks, Rafael
Am Mittwoch, 27. Juli 2011, 11:21:21 schrieb Rafael J. Wysocki: > > mm... The patch as proposed is very simple, direct, explicit. I > > suspect that trying to embed this operation within some other one would > > end up producing a less clear result. Sometimes we do exceptional and > > weird things, and leaving the code exceptional and weird-looking is > > better than hiding it in some framework, if you follow what I mean. > > My point is that instead of using the oom_killer_disabled variable in > page_alloc.c, we could use a oom_killer_disabled() function returning > the value of the new tasks_are_frozen variable. Then, > oom_killer_disable/enable() won't be necessary any more. Well, if we do this, it'll be a separate change. Not that I disagree. Regards Oliver
On Mon 2011-07-25 10:43:19, Oliver Neukum wrote: > Hi Rafael, > > I had a problem with the kernel stopping the machine forever because I got an > oops while tasks were frozen. It seems to me that we should thaw when this > happens. How about this approach? Danger, Will Robinson. You must not write to the filesystems after hibernation started. By thawing, you may do just that. It should be safe to thaw as long as final suspend signature is not on disk and will not be written there.
Am Freitag, 29. Juli 2011, 21:54:30 schrieb Pavel Machek: > On Mon 2011-07-25 10:43:19, Oliver Neukum wrote: > > Hi Rafael, > > > > I had a problem with the kernel stopping the machine forever because I got an > > oops while tasks were frozen. It seems to me that we should thaw when this > > happens. How about this approach? > > Danger, Will Robinson. You must not write to the filesystems after > hibernation started. By thawing, you may do just that. > > It should be safe to thaw as long as final suspend signature is not > on disk and will not be written there. Yes. But this applies only if I use kernel-space hibernation, doesn't it? So I need a second flag for the signature being written. Any other problem? Regards Oliver
On Friday, July 29, 2011, Pavel Machek wrote: > On Mon 2011-07-25 10:43:19, Oliver Neukum wrote: > > Hi Rafael, > > > > I had a problem with the kernel stopping the machine forever because I got an > > oops while tasks were frozen. It seems to me that we should thaw when this > > happens. How about this approach? > > Danger, Will Robinson. You must not write to the filesystems after > hibernation started. By thawing, you may do just that. > > It should be safe to thaw as long as final suspend signature is not > on disk and will not be written there. This only applies to hibernation and only when we're going to restore from the image the presumably hasn't been saved yet, right? However, there's another problem I didn't think of before. Namely, we cannot thaw tasks before resuming devices in case we've already suspended them, because that will defeat the very purpose of the freezing in the first place. Thanks, Rafael
On Friday, July 29, 2011, Oliver Neukum wrote: > Am Freitag, 29. Juli 2011, 21:54:30 schrieb Pavel Machek: > > On Mon 2011-07-25 10:43:19, Oliver Neukum wrote: > > > Hi Rafael, > > > > > > I had a problem with the kernel stopping the machine forever because I got an > > > oops while tasks were frozen. It seems to me that we should thaw when this > > > happens. How about this approach? > > > > Danger, Will Robinson. You must not write to the filesystems after > > hibernation started. By thawing, you may do just that. > > > > It should be safe to thaw as long as final suspend signature is not > > on disk and will not be written there. > > Yes. But this applies only if I use kernel-space hibernation, doesn't it? > So I need a second flag for the signature being written. Any other problem? Yes, please see my reply to Pavel. Thanks, Rafael
Am Freitag, 29. Juli 2011, 22:27:54 schrieb Rafael J. Wysocki: > However, there's another problem I didn't think of before. Namely, > we cannot thaw tasks before resuming devices in case we've already > suspended them, because that will defeat the very purpose of the > freezing in the first place. That purpose is already defeated. The machine cannot be deader than dead. We could try to go through the device tree. But the machine just oopsed quite likely doing just that, so this doesn't seem wise to me. IMHO if we oops while frozen, chances are the system is going to die. It certainly dies now. However, we must certainly not thaw if a valid image is already on disk, lest we corrupt a filesystem. Regards Oliver
On Friday, July 29, 2011, Oliver Neukum wrote: > Am Freitag, 29. Juli 2011, 22:27:54 schrieb Rafael J. Wysocki: > > However, there's another problem I didn't think of before. Namely, > > we cannot thaw tasks before resuming devices in case we've already > > suspended them, because that will defeat the very purpose of the > > freezing in the first place. > > That purpose is already defeated. The machine cannot be deader than dead. OK, so what exactly is the purpose of the thawing, then? Rafael
Am Freitag, 29. Juli 2011, 23:09:44 schrieb Rafael J. Wysocki: > On Friday, July 29, 2011, Oliver Neukum wrote: > > Am Freitag, 29. Juli 2011, 22:27:54 schrieb Rafael J. Wysocki: > > > However, there's another problem I didn't think of before. Namely, > > > we cannot thaw tasks before resuming devices in case we've already > > > suspended them, because that will defeat the very purpose of the > > > freezing in the first place. > > > > That purpose is already defeated. The machine cannot be deader than dead. > > OK, so what exactly is the purpose of the thawing, then? We may be lucky enough to get the oops written out to disk. The oops may happen in a driver rarely used and be late in the sequence. Regards Oliver
On Friday, July 29, 2011, Oliver Neukum wrote: > Am Freitag, 29. Juli 2011, 23:09:44 schrieb Rafael J. Wysocki: > > On Friday, July 29, 2011, Oliver Neukum wrote: > > > Am Freitag, 29. Juli 2011, 22:27:54 schrieb Rafael J. Wysocki: > > > > However, there's another problem I didn't think of before. Namely, > > > > we cannot thaw tasks before resuming devices in case we've already > > > > suspended them, because that will defeat the very purpose of the > > > > freezing in the first place. > > > > > > That purpose is already defeated. The machine cannot be deader than dead. > > > > OK, so what exactly is the purpose of the thawing, then? > > We may be lucky enough to get the oops written out to disk. The oops may happen > in a driver rarely used and be late in the sequence. OK, so is there any guarantee that we won't corrupt things this way? Rafael
diff --git a/include/linux/freezer.h b/include/linux/freezer.h index 1effc8b..9907cf6 100644 --- a/include/linux/freezer.h +++ b/include/linux/freezer.h @@ -50,6 +50,7 @@ extern int thaw_process(struct task_struct *p); extern void refrigerator(void); extern int freeze_processes(void); extern void thaw_processes(void); +extern void thaw_in_oops(void); static inline int try_to_freeze(void) { diff --git a/kernel/panic.c b/kernel/panic.c index 6923167..255e662 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -23,6 +23,7 @@ #include <linux/init.h> #include <linux/nmi.h> #include <linux/dmi.h> +#include <linux/freezer.h> #define PANIC_TIMER_STEP 100 #define PANIC_BLINK_SPD 18 @@ -355,6 +356,7 @@ void oops_exit(void) do_oops_enter_exit(); print_oops_end_marker(); kmsg_dump(KMSG_DUMP_OOPS); + thaw_in_oops(); } #ifdef WANT_WARN_ON_SLOWPATH diff --git a/kernel/power/process.c b/kernel/power/process.c index 0cf3a27..20994cd 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -22,6 +22,9 @@ */ #define TIMEOUT (20 * HZ) +/* in case we oops while processes are frozen */ +static bool tasks_fozen = false; + static inline int freezable(struct task_struct * p) { if ((p == current) || @@ -131,6 +134,7 @@ static int try_to_freeze_tasks(bool sig_only) elapsed_csecs % 100); } + tasks_fozen = (todo == 0); return todo ? -EBUSY : 0; } @@ -189,7 +193,14 @@ void thaw_processes(void) thaw_workqueues(); thaw_tasks(true); thaw_tasks(false); + tasks_fozen = false; schedule(); printk("done.\n"); } +void thaw_in_oops(void) +{ + if (tasks_fozen) + thaw_processes(); +} +