better oopsing when frozen

Message ID	201107251043.19932.oneukum@suse.de (mailing list archive)
State	Rejected, archived
Headers	show Received: from smtp1.linux-foundation.org (smtp1.linux-foundation.org [140.211.169.13]) by demeter2.kernel.org (8.14.4/8.14.4) with ESMTP id p6P8gSdP012290 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for <patchwork-linux-pm@patchwork.kernel.org>; Mon, 25 Jul 2011 08:42:54 GMT From: Oliver Neukum <oneukum@suse.de> Organization: SUSE To: linux-pm@lists.linux-foundation.org, linux-kernel@vger.kernel.org, rjw@sisk.pl Date: Mon, 25 Jul 2011 10:43:19 +0200 User-Agent: KMail/1.13.5 (Linux/3.0.0-rc7-12-desktop+; KDE/4.4.4; x86_64; ; ) MIME-Version: 1.0 Message-Id: <201107251043.19932.oneukum@suse.de> Received-SPF: pass (localhost is always allowed.) Subject: [linux-pm] better oopsing when frozen Precedence: list Content-Type: text/plain; charset="iso-8859-1" Sender: linux-pm-bounces@lists.linux-foundation.org Errors-To: linux-pm-bounces@lists.linux-foundation.org Content-Transfer-Encoding: 8bit

Oliver Neukum July 25, 2011, 8:43 a.m. UTC

Hi Rafael,

I had a problem with the kernel stopping the machine forever because I got an
oops while tasks were frozen. It seems to me that we should thaw when this
happens. How about this approach?

	Regards
		Oliver

From 6f3b5e7a5c7ccf3564bdd2e703eba7eee753ecdc Mon Sep 17 00:00:00 2001
From: Oliver Neukum <oliver@neukum.org>
Date: Fri, 22 Jul 2011 11:20:19 +0200
Subject: [PATCH] unfreeze tasks if an oops happens while tasks are frozen

If an oops kills the task suspending or snapshotting
is system, the system is dead because the action is
never completed and the tasks never thawed.

Signed-off-by: Oliver Neukum <oneukum@suse.de>
---
 include/linux/freezer.h |    1 +
 kernel/panic.c          |    2 ++
 kernel/power/process.c  |   11 +++++++++++
 3 files changed, 14 insertions(+), 0 deletions(-)

Borislav Petkov July 25, 2011, 10:10 a.m. UTC | #1

On Mon, Jul 25, 2011 at 10:43:19AM +0200, Oliver Neukum wrote:
> Hi Rafael,
> 
> I had a problem with the kernel stopping the machine forever because I got an
> oops while tasks were frozen. It seems to me that we should thaw when this
> happens. How about this approach?
> 
> 	Regards
> 		Oliver
> 
> From 6f3b5e7a5c7ccf3564bdd2e703eba7eee753ecdc Mon Sep 17 00:00:00 2001
> From: Oliver Neukum <oliver@neukum.org>
> Date: Fri, 22 Jul 2011 11:20:19 +0200
> Subject: [PATCH] unfreeze tasks if an oops happens while tasks are frozen
> 
> If an oops kills the task suspending or snapshotting
> is system, the system is dead because the action is
> never completed and the tasks never thawed.
> 
> Signed-off-by: Oliver Neukum <oneukum@suse.de>
> ---
>  include/linux/freezer.h |    1 +
>  kernel/panic.c          |    2 ++
>  kernel/power/process.c  |   11 +++++++++++
>  3 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/freezer.h b/include/linux/freezer.h
> index 1effc8b..9907cf6 100644
> --- a/include/linux/freezer.h
> +++ b/include/linux/freezer.h
> @@ -50,6 +50,7 @@ extern int thaw_process(struct task_struct *p);
>  extern void refrigerator(void);
>  extern int freeze_processes(void);
>  extern void thaw_processes(void);
> +extern void thaw_in_oops(void);
>  
>  static inline int try_to_freeze(void)
>  {
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 6923167..255e662 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -23,6 +23,7 @@
>  #include <linux/init.h>
>  #include <linux/nmi.h>
>  #include <linux/dmi.h>
> +#include <linux/freezer.h>
>  
>  #define PANIC_TIMER_STEP 100
>  #define PANIC_BLINK_SPD 18
> @@ -355,6 +356,7 @@ void oops_exit(void)
>  	do_oops_enter_exit();
>  	print_oops_end_marker();
>  	kmsg_dump(KMSG_DUMP_OOPS);
> +	thaw_in_oops();
>  }
>  
>  #ifdef WANT_WARN_ON_SLOWPATH
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 0cf3a27..20994cd 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -22,6 +22,9 @@
>   */
>  #define TIMEOUT	(20 * HZ)
>  
> +/* in case we oops while processes are frozen */
> +static bool tasks_fozen = false;

I think you mean 'tasks_frozen' here :).

Rafael Wysocki July 26, 2011, 10:24 p.m. UTC | #2

Hi,

On Monday, July 25, 2011, Oliver Neukum wrote:
> Hi Rafael,
> 
> I had a problem with the kernel stopping the machine forever because I got an
> oops while tasks were frozen. It seems to me that we should thaw when this
> happens. How about this approach?

Well, we do something like this already for the OOM killer (see
oom_killer_disable() and friends), so I think it would be better to
simply extend/modify that mechanism instead of adding a new one
doing almost exactly the same thing.

I have no complaints about adding thaw_in_oops(), though, so long as
Andrew thinks it makes sense.

Thanks,
Rafael 


> From 6f3b5e7a5c7ccf3564bdd2e703eba7eee753ecdc Mon Sep 17 00:00:00 2001
> From: Oliver Neukum <oliver@neukum.org>
> Date: Fri, 22 Jul 2011 11:20:19 +0200
> Subject: [PATCH] unfreeze tasks if an oops happens while tasks are frozen
> 
> If an oops kills the task suspending or snapshotting
> is system, the system is dead because the action is
> never completed and the tasks never thawed.
> 
> Signed-off-by: Oliver Neukum <oneukum@suse.de>
> ---
>  include/linux/freezer.h |    1 +
>  kernel/panic.c          |    2 ++
>  kernel/power/process.c  |   11 +++++++++++
>  3 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/freezer.h b/include/linux/freezer.h
> index 1effc8b..9907cf6 100644
> --- a/include/linux/freezer.h
> +++ b/include/linux/freezer.h
> @@ -50,6 +50,7 @@ extern int thaw_process(struct task_struct *p);
>  extern void refrigerator(void);
>  extern int freeze_processes(void);
>  extern void thaw_processes(void);
> +extern void thaw_in_oops(void);
>  
>  static inline int try_to_freeze(void)
>  {
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 6923167..255e662 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -23,6 +23,7 @@
>  #include <linux/init.h>
>  #include <linux/nmi.h>
>  #include <linux/dmi.h>
> +#include <linux/freezer.h>
>  
>  #define PANIC_TIMER_STEP 100
>  #define PANIC_BLINK_SPD 18
> @@ -355,6 +356,7 @@ void oops_exit(void)
>  	do_oops_enter_exit();
>  	print_oops_end_marker();
>  	kmsg_dump(KMSG_DUMP_OOPS);
> +	thaw_in_oops();
>  }
>  
>  #ifdef WANT_WARN_ON_SLOWPATH
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 0cf3a27..20994cd 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -22,6 +22,9 @@
>   */
>  #define TIMEOUT	(20 * HZ)
>  
> +/* in case we oops while processes are frozen */
> +static bool tasks_fozen = false;
> +
>  static inline int freezable(struct task_struct * p)
>  {
>  	if ((p == current) ||
> @@ -131,6 +134,7 @@ static int try_to_freeze_tasks(bool sig_only)
>  			elapsed_csecs % 100);
>  	}
>  
> +	tasks_fozen = (todo == 0);
>  	return todo ? -EBUSY : 0;
>  }
>  
> @@ -189,7 +193,14 @@ void thaw_processes(void)
>  	thaw_workqueues();
>  	thaw_tasks(true);
>  	thaw_tasks(false);
> +	tasks_fozen = false;
>  	schedule();
>  	printk("done.\n");
>  }
>  
> +void thaw_in_oops(void)
> +{
> +	if (tasks_fozen)
> +		thaw_processes();
> +}
> +
>

Andrew Morton July 26, 2011, 10:45 p.m. UTC | #3

On Wed, 27 Jul 2011 00:24:11 +0200
"Rafael J. Wysocki" <rjw@sisk.pl> wrote:

> > Hi Rafael,
> > 
> > I had a problem with the kernel stopping the machine forever because I got an
> > oops while tasks were frozen. It seems to me that we should thaw when this
> > happens. How about this approach?
> 
> Well, we do something like this already for the OOM killer (see
> oom_killer_disable() and friends), so I think it would be better to
> simply extend/modify that mechanism instead of adding a new one
> doing almost exactly the same thing.
> 
> I have no complaints about adding thaw_in_oops(), though, so long as
> Andrew thinks it makes sense.

mm...  The patch as proposed is very simple, direct, explicit.  I
suspect that trying to embed this operation within some other one would
end up producing a less clear result.  Sometimes we do exceptional and
weird things, and leaving the code exceptional and weird-looking is
better than hiding it in some framework, if you follow what I mean.

It does need some code comments to explain to people what it's doing
and more importantly why it's doing it.  Also, something which doesn't
break the build when CONFIG_FREEZER=n would be nice.

Rafael Wysocki July 27, 2011, 9:21 a.m. UTC | #4

On Wednesday, July 27, 2011, Andrew Morton wrote:
> On Wed, 27 Jul 2011 00:24:11 +0200
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> 
> > > Hi Rafael,
> > > 
> > > I had a problem with the kernel stopping the machine forever because I got an
> > > oops while tasks were frozen. It seems to me that we should thaw when this
> > > happens. How about this approach?
> > 
> > Well, we do something like this already for the OOM killer (see
> > oom_killer_disable() and friends), so I think it would be better to
> > simply extend/modify that mechanism instead of adding a new one
> > doing almost exactly the same thing.
> > 
> > I have no complaints about adding thaw_in_oops(), though, so long as
> > Andrew thinks it makes sense.
> 
> mm...  The patch as proposed is very simple, direct, explicit.  I
> suspect that trying to embed this operation within some other one would
> end up producing a less clear result.  Sometimes we do exceptional and
> weird things, and leaving the code exceptional and weird-looking is
> better than hiding it in some framework, if you follow what I mean.

My point is that instead of using the oom_killer_disabled variable in
page_alloc.c, we could use a oom_killer_disabled() function returning
the value of the new tasks_are_frozen variable.  Then,
oom_killer_disable/enable() won't be necessary any more.

> It does need some code comments to explain to people what it's doing
> and more importantly why it's doing it.  Also, something which doesn't
> break the build when CONFIG_FREEZER=n would be nice.

Right.

Thanks,
Rafael

Oliver Neukum July 27, 2011, 10:23 a.m. UTC | #5

Am Mittwoch, 27. Juli 2011, 11:21:21 schrieb Rafael J. Wysocki:
> > mm...  The patch as proposed is very simple, direct, explicit.  I
> > suspect that trying to embed this operation within some other one would
> > end up producing a less clear result.  Sometimes we do exceptional and
> > weird things, and leaving the code exceptional and weird-looking is
> > better than hiding it in some framework, if you follow what I mean.
> 
> My point is that instead of using the oom_killer_disabled variable in
> page_alloc.c, we could use a oom_killer_disabled() function returning
> the value of the new tasks_are_frozen variable.  Then,
> oom_killer_disable/enable() won't be necessary any more.

Well, if we do this, it'll be a separate change. Not that I disagree. 

	Regards
		Oliver

Pavel Machek July 29, 2011, 7:54 p.m. UTC | #6

On Mon 2011-07-25 10:43:19, Oliver Neukum wrote:
> Hi Rafael,
> 
> I had a problem with the kernel stopping the machine forever because I got an
> oops while tasks were frozen. It seems to me that we should thaw when this
> happens. How about this approach?

Danger, Will Robinson. You must not write to the filesystems after
hibernation started. By thawing, you may do just that.

It should be safe to thaw as long as final suspend signature is not
on disk and will not be written there.

Oliver Neukum July 29, 2011, 8:27 p.m. UTC | #7

Am Freitag, 29. Juli 2011, 21:54:30 schrieb Pavel Machek:
> On Mon 2011-07-25 10:43:19, Oliver Neukum wrote:
> > Hi Rafael,
> > 
> > I had a problem with the kernel stopping the machine forever because I got an
> > oops while tasks were frozen. It seems to me that we should thaw when this
> > happens. How about this approach?
> 
> Danger, Will Robinson. You must not write to the filesystems after
> hibernation started. By thawing, you may do just that.
> 
> It should be safe to thaw as long as final suspend signature is not
> on disk and will not be written there.

Yes. But this applies only if I use kernel-space hibernation, doesn't it?
So I need a second flag for the signature being written. Any other problem?

	Regards
		Oliver

Rafael Wysocki July 29, 2011, 8:27 p.m. UTC | #8

On Friday, July 29, 2011, Pavel Machek wrote:
> On Mon 2011-07-25 10:43:19, Oliver Neukum wrote:
> > Hi Rafael,
> > 
> > I had a problem with the kernel stopping the machine forever because I got an
> > oops while tasks were frozen. It seems to me that we should thaw when this
> > happens. How about this approach?
> 
> Danger, Will Robinson. You must not write to the filesystems after
> hibernation started. By thawing, you may do just that.
> 
> It should be safe to thaw as long as final suspend signature is not
> on disk and will not be written there.

This only applies to hibernation and only when we're going to restore
from the image the presumably hasn't been saved yet, right?

However, there's another problem I didn't think of before.  Namely,
we cannot thaw tasks before resuming devices in case we've already
suspended them, because that will defeat the very purpose of the
freezing in the first place.

Thanks,
Rafael

Rafael Wysocki July 29, 2011, 8:29 p.m. UTC | #9

On Friday, July 29, 2011, Oliver Neukum wrote:
> Am Freitag, 29. Juli 2011, 21:54:30 schrieb Pavel Machek:
> > On Mon 2011-07-25 10:43:19, Oliver Neukum wrote:
> > > Hi Rafael,
> > > 
> > > I had a problem with the kernel stopping the machine forever because I got an
> > > oops while tasks were frozen. It seems to me that we should thaw when this
> > > happens. How about this approach?
> > 
> > Danger, Will Robinson. You must not write to the filesystems after
> > hibernation started. By thawing, you may do just that.
> > 
> > It should be safe to thaw as long as final suspend signature is not
> > on disk and will not be written there.
> 
> Yes. But this applies only if I use kernel-space hibernation, doesn't it?
> So I need a second flag for the signature being written. Any other problem?

Yes, please see my reply to Pavel.

Thanks,
Rafael

Oliver Neukum July 29, 2011, 9:04 p.m. UTC | #10

Am Freitag, 29. Juli 2011, 22:27:54 schrieb Rafael J. Wysocki:
> However, there's another problem I didn't think of before.  Namely,
> we cannot thaw tasks before resuming devices in case we've already
> suspended them, because that will defeat the very purpose of the
> freezing in the first place.

That purpose is already defeated. The machine cannot be deader than dead.
We could try to go through the device tree. But the machine just oopsed
quite likely doing just that, so this doesn't seem wise to me.

IMHO if we oops while frozen, chances are the system is going to die.
It certainly dies now. However, we must certainly not thaw if a valid
image is already on disk, lest we corrupt a filesystem.

	Regards
		Oliver

Rafael Wysocki July 29, 2011, 9:09 p.m. UTC | #11

On Friday, July 29, 2011, Oliver Neukum wrote:
> Am Freitag, 29. Juli 2011, 22:27:54 schrieb Rafael J. Wysocki:
> > However, there's another problem I didn't think of before.  Namely,
> > we cannot thaw tasks before resuming devices in case we've already
> > suspended them, because that will defeat the very purpose of the
> > freezing in the first place.
> 
> That purpose is already defeated. The machine cannot be deader than dead.

OK, so what exactly is the purpose of the thawing, then?

Rafael

Oliver Neukum July 29, 2011, 9:17 p.m. UTC | #12

Am Freitag, 29. Juli 2011, 23:09:44 schrieb Rafael J. Wysocki:
> On Friday, July 29, 2011, Oliver Neukum wrote:
> > Am Freitag, 29. Juli 2011, 22:27:54 schrieb Rafael J. Wysocki:
> > > However, there's another problem I didn't think of before.  Namely,
> > > we cannot thaw tasks before resuming devices in case we've already
> > > suspended them, because that will defeat the very purpose of the
> > > freezing in the first place.
> > 
> > That purpose is already defeated. The machine cannot be deader than dead.
> 
> OK, so what exactly is the purpose of the thawing, then?

We may be lucky enough to get the oops written out to disk. The oops may happen
in a driver rarely used and be late in the sequence.

	Regards
		Oliver

Rafael Wysocki July 29, 2011, 9:30 p.m. UTC | #13

On Friday, July 29, 2011, Oliver Neukum wrote:
> Am Freitag, 29. Juli 2011, 23:09:44 schrieb Rafael J. Wysocki:
> > On Friday, July 29, 2011, Oliver Neukum wrote:
> > > Am Freitag, 29. Juli 2011, 22:27:54 schrieb Rafael J. Wysocki:
> > > > However, there's another problem I didn't think of before.  Namely,
> > > > we cannot thaw tasks before resuming devices in case we've already
> > > > suspended them, because that will defeat the very purpose of the
> > > > freezing in the first place.
> > > 
> > > That purpose is already defeated. The machine cannot be deader than dead.
> > 
> > OK, so what exactly is the purpose of the thawing, then?
> 
> We may be lucky enough to get the oops written out to disk. The oops may happen
> in a driver rarely used and be late in the sequence.

OK, so is there any guarantee that we won't corrupt things this way?

Rafael

better oopsing when frozen

Commit Message

Comments

Patch