diff mbox series

[v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

Message ID nycvar.YFH.7.76.1905291230130.1962@cbobk.fhfr.pm (mailing list archive)
State Superseded, archived
Headers show
Series [v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume | expand

Commit Message

Jiri Kosina May 29, 2019, 10:32 a.m. UTC
From: Jiri Kosina <jkosina@suse.cz>

As explained in

	0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")

we always, no matter what, have to bring up x86 HT siblings during boot at 
least once in order to avoid first MCE bringing the system to its knees.

That means that whenever 'nosmt' is supplied on the kernel command-line, 
all the HT siblings are as a result sitting in mwait or cpudile after 
going through the online-offline cycle at least once.

This causes a serious issue though when a kernel, which saw 'nosmt' on its 
commandline, is going to perform resume from hibernation: if the resume 
from the hibernated image is successful, cr3 is flipped in order to point 
to the address space of the kernel that is being resumed, which in turn 
means that all the HT siblings are all of a sudden mwaiting on address 
which is no longer valid.

That results in triple fault shortly after cr3 is switched, and machine 
reboots.

Fix this by always waking up all the SMT siblings before initiating the 
'restore from hibernation' process; this guarantees that all the HT 
siblings will be properly carried over to the resumed kernel waiting in 
resume_play_dead(), and acted upon accordingly afterwards, based on the 
target kernel configuration.

Cc: stable@vger.kernel.org # v4.19+
Debugged-by: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
---

v1 -> v2:
	restructure error handling as suggested by peterz
	add Rafael's ack

 arch/x86/power/cpu.c | 10 ++++++++++
 include/linux/cpu.h  |  2 ++
 kernel/cpu.c         |  2 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra May 29, 2019, 12:02 p.m. UTC | #1
On Wed, May 29, 2019 at 12:32:02PM +0200, Jiri Kosina wrote:
>  arch/x86/power/cpu.c | 10 ++++++++++
>  include/linux/cpu.h  |  2 ++
>  kernel/cpu.c         |  2 +-
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index a7d966964c6f..513ce09e9950 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -299,7 +299,17 @@ int hibernate_resume_nonboot_cpu_disable(void)
>  	 * address in its instruction pointer may not be possible to resolve
>  	 * any more at that point (the page tables used by it previously may
>  	 * have been overwritten by hibernate image data).
> +	 *
> +	 * First, make sure that we wake up all the potentially disabled SMT
> +	 * threads which have been initially brought up and then put into
> +	 * mwait/cpuidle sleep.
> +	 * Those will be put to proper (not interfering with hibernation
> +	 * resume) sleep afterwards, and the resumed kernel will decide itself
> +	 * what to do with them.
>  	 */
> +	ret = cpuhp_smt_enable();
> +	if (ret)
> +		return ret;
>  	smp_ops.play_dead = resume_play_dead;
>  	ret = disable_nonboot_cpus();
>  	smp_ops.play_dead = play_dead;

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Josh Poimboeuf May 29, 2019, 4:10 p.m. UTC | #2
On Wed, May 29, 2019 at 12:32:02PM +0200, Jiri Kosina wrote:
> From: Jiri Kosina <jkosina@suse.cz>
> 
> As explained in
> 
> 	0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> 
> we always, no matter what, have to bring up x86 HT siblings during boot at 
> least once in order to avoid first MCE bringing the system to its knees.
> 
> That means that whenever 'nosmt' is supplied on the kernel command-line, 
> all the HT siblings are as a result sitting in mwait or cpudile after 
> going through the online-offline cycle at least once.
> 
> This causes a serious issue though when a kernel, which saw 'nosmt' on its 
> commandline, is going to perform resume from hibernation: if the resume 
> from the hibernated image is successful, cr3 is flipped in order to point 
> to the address space of the kernel that is being resumed, which in turn 
> means that all the HT siblings are all of a sudden mwaiting on address 
> which is no longer valid.
> 
> That results in triple fault shortly after cr3 is switched, and machine 
> reboots.
> 
> Fix this by always waking up all the SMT siblings before initiating the 
> 'restore from hibernation' process; this guarantees that all the HT 
> siblings will be properly carried over to the resumed kernel waiting in 
> resume_play_dead(), and acted upon accordingly afterwards, based on the 
> target kernel configuration.

hibernation_restore() is called by user space at runtime, via ioctl or
sysfs.  So I think this still doesn't fix the case where you've disabled
CPUs at runtime via sysfs, and then resumed from hibernation.  Or are we
declaring that this is not a supported scenario?

Would it be possible for mwait_play_dead() to instead just monitor a
fixmap address which doesn't change for kaslr?

Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
dance?
Jiri Kosina May 29, 2019, 4:26 p.m. UTC | #3
On Wed, 29 May 2019, Josh Poimboeuf wrote:

> hibernation_restore() is called by user space at runtime, via ioctl or 
> sysfs.  So I think this still doesn't fix the case where you've disabled 
> CPUs at runtime via sysfs, and then resumed from hibernation.  Or are we 
> declaring that this is not a supported scenario?

Yeah I personally find that scenario awkward :) Anyway, cpuhp_smt_enable() 
is going to online even those potentially "manually" offlined CPUs, isn't 
it?

Are you perhaps suggesting to call enable_nonboot_cpus() instead of 
cpuhp_smt_enable() here to make it more explicit?

> Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
> dance?

I am not sure whether it's really needed. My understanding is that the MCE 
issue happens only after primary sibling has been brought up; if that 
never happened, MCE wouldn't be broadcasted to that core at all in the 
first place.

But this needs to be confirmed by Intel.
Peter Zijlstra May 29, 2019, 5 p.m. UTC | #4
On Wed, May 29, 2019 at 06:26:59PM +0200, Jiri Kosina wrote:
> On Wed, 29 May 2019, Josh Poimboeuf wrote:

> > Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
> > dance?
> 
> I am not sure whether it's really needed. My understanding is that the MCE 
> issue happens only after primary sibling has been brought up; if that 
> never happened, MCE wouldn't be broadcasted to that core at all in the 
> first place.
> 
> But this needs to be confirmed by Intel.

(I'm not confirming anything, as I've no clue), but that code stems from
long before we found out about that brilliant MCE stuff (which was
fairly recent).
Thomas Gleixner May 29, 2019, 5:15 p.m. UTC | #5
On Wed, 29 May 2019, Peter Zijlstra wrote:
> On Wed, May 29, 2019 at 06:26:59PM +0200, Jiri Kosina wrote:
> > On Wed, 29 May 2019, Josh Poimboeuf wrote:
> 
> > > Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
> > > dance?
> > 
> > I am not sure whether it's really needed. My understanding is that the MCE 
> > issue happens only after primary sibling has been brought up; if that 
> > never happened, MCE wouldn't be broadcasted to that core at all in the 
> > first place.
> > 
> > But this needs to be confirmed by Intel.
> 
> (I'm not confirming anything, as I've no clue), but that code stems from
> long before we found out about that brilliant MCE stuff (which was
> fairly recent).

Actually we knew about the brilliant MCE wreckage for a long time and
maxcpus was always considered to be a debug/testing bandaid and not to be
used for anything serious used in production.

Of course 'nosmt' changed that because that is aimed at production
scenarios so we were forced to deal with that 'feature'.

We could do the same thing with 'maxcpus' now that we have all the
mechanisms there at our fingertips already, but I'd rather not do it.

Thanks,

	tglx
Josh Poimboeuf May 29, 2019, 5:17 p.m. UTC | #6
On Wed, May 29, 2019 at 06:26:59PM +0200, Jiri Kosina wrote:
> On Wed, 29 May 2019, Josh Poimboeuf wrote:
> 
> > hibernation_restore() is called by user space at runtime, via ioctl or 
> > sysfs.  So I think this still doesn't fix the case where you've disabled 
> > CPUs at runtime via sysfs, and then resumed from hibernation.  Or are we 
> > declaring that this is not a supported scenario?
> 
> Yeah I personally find that scenario awkward :) Anyway, cpuhp_smt_enable() 
> is going to online even those potentially "manually" offlined CPUs, isn't 
> it?
> 
> Are you perhaps suggesting to call enable_nonboot_cpus() instead of 
> cpuhp_smt_enable() here to make it more explicit?

Maybe, but I guess that wouldn't work as-is because it relies on
the frozen_cpus mask.  

But maybe this is just a scenario we don't care about anyway?

I still have the question about whether we could make mwait_play_dead()
monitor a fixed address.  If we could get that to work, that seems more
robust to me.

Another question.  With your patch, if booted with nosmt, is SMT still
disabled after you resume from hibernation?  I don't see how SMT would
get disabled again.

> > Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
> > dance?
> 
> I am not sure whether it's really needed. My understanding is that the MCE 
> issue happens only after primary sibling has been brought up; if that 
> never happened, MCE wouldn't be broadcasted to that core at all in the 
> first place.
> 
> But this needs to be confirmed by Intel.

Right, but can't maxcpus= create scenarios where only the primary
sibling has been brought up?

Anyway, Thomas indicated on IRC that maxcpus= may be deprecated and
should probably be documented as such.  So maybe it's another scenario
we don't care about.
Jiri Kosina May 29, 2019, 5:29 p.m. UTC | #7
On Wed, 29 May 2019, Josh Poimboeuf wrote:

> I still have the question about whether we could make mwait_play_dead() 
> monitor a fixed address.  If we could get that to work, that seems more 
> robust to me.

Hmm, does it really?

That'd mean the resumer and resumee must have the same fixmap. How are you 
going to guarantee that? Currently the resuming kernel doesn't really have 
to be the same as the one that is being resumed.

> Another question.  With your patch, if booted with nosmt, is SMT still 
> disabled after you resume from hibernation?  

Yup, it is.

> I don't see how SMT would get disabled again.

The target kernel only onlines the CPUs which were online at the time of 
hibernation (and are therefore in frozen_cpus mask).
Jiri Kosina May 29, 2019, 6:02 p.m. UTC | #8
On Wed, 29 May 2019, Jiri Kosina wrote:

> The target kernel only onlines the CPUs which were online at the time of 
> hibernation (and are therefore in frozen_cpus mask).

Hm, there is a catch though. After resume, the SMT siblings are now in hlt 
instead of mwait.

Which means that the resumed kernel has to do one more online/offline 
cycle for them, to push them to mwait again.

Bah.

I'll send v3 shortly, so please don't apply v2 just yet.
diff mbox series

Patch

diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index a7d966964c6f..513ce09e9950 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -299,7 +299,17 @@  int hibernate_resume_nonboot_cpu_disable(void)
 	 * address in its instruction pointer may not be possible to resolve
 	 * any more at that point (the page tables used by it previously may
 	 * have been overwritten by hibernate image data).
+	 *
+	 * First, make sure that we wake up all the potentially disabled SMT
+	 * threads which have been initially brought up and then put into
+	 * mwait/cpuidle sleep.
+	 * Those will be put to proper (not interfering with hibernation
+	 * resume) sleep afterwards, and the resumed kernel will decide itself
+	 * what to do with them.
 	 */
+	ret = cpuhp_smt_enable();
+	if (ret)
+		return ret;
 	smp_ops.play_dead = resume_play_dead;
 	ret = disable_nonboot_cpus();
 	smp_ops.play_dead = play_dead;
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 3813fe45effd..b5523552a607 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -201,10 +201,12 @@  enum cpuhp_smt_control {
 extern enum cpuhp_smt_control cpu_smt_control;
 extern void cpu_smt_disable(bool force);
 extern void cpu_smt_check_topology(void);
+extern int cpuhp_smt_enable(void);
 #else
 # define cpu_smt_control		(CPU_SMT_NOT_IMPLEMENTED)
 static inline void cpu_smt_disable(bool force) { }
 static inline void cpu_smt_check_topology(void) { }
+static inline int cpuhp_smt_enable(void) { return 0; }
 #endif
 
 /*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f2ef10460698..3ff5ce0e4132 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2093,7 +2093,7 @@  static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
 	return ret;
 }
 
-static int cpuhp_smt_enable(void)
+int cpuhp_smt_enable(void)
 {
 	int cpu, ret = 0;