ARM: call disable_nonboot_cpus() from machine_shutdown()

Message ID	1357160861-26282-1-git-send-email-swarren@wwwdotorg.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Stephen Warren <swarren@wwwdotorg.org> To: Russell King <linux@arm.linux.org.uk> Subject: [PATCH] ARM: call disable_nonboot_cpus() from machine_shutdown() Date: Wed, 2 Jan 2013 14:07:41 -0700 Message-Id: <1357160861-26282-1-git-send-email-swarren@wwwdotorg.org> summary: Content analysis details: (-2.6 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record -0.7 RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Cc: kexec@lists.infradead.org, Will Deacon <will.deacon@arm.com>, Eric Biederman <ebiederm@xmission.com>, linux-arm-kernel@lists.infradead.org, Stephen Warren <swarren@nvidia.com> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: linux-arm-kernel-bounces@lists.infradead.org Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Stephen Warren Jan. 2, 2013, 9:07 p.m. UTC

From: Stephen Warren <swarren@nvidia.com>

On Tegra at least, this change allows kexec to work with SMP enabled.
Without this,  machine_shutdown() simply puts all CPUs into a loop. If
the code of that loop is over-written, the CPUs may hang or crash (which
I do observe in practice), or cause the kexec'd kernel not to be able to
initialize them.

This fix has the added benefit that the kexec always happens on the boot
CPU, and thus kexec mirrors the initial kernel boot as much as possible.

Signed-off-by: Stephen Warren <swarren@nvidia.com>
---
Russell, I assume this should go into the ARM patch tracker if OK?

 arch/arm/kernel/process.c |    1 +
 1 file changed, 1 insertion(+)

Russell King - ARM Linux Jan. 2, 2013, 9:52 p.m. UTC | #1

On Wed, Jan 02, 2013 at 02:07:41PM -0700, Stephen Warren wrote:
> From: Stephen Warren <swarren@nvidia.com>
> 
> On Tegra at least, this change allows kexec to work with SMP enabled.
> Without this,  machine_shutdown() simply puts all CPUs into a loop. If
> the code of that loop is over-written, the CPUs may hang or crash (which
> I do observe in practice), or cause the kexec'd kernel not to be able to
> initialize them.
> 
> This fix has the added benefit that the kexec always happens on the boot
> CPU, and thus kexec mirrors the initial kernel boot as much as possible.
> 
> Signed-off-by: Stephen Warren <swarren@nvidia.com>
> ---
> Russell, I assume this should go into the ARM patch tracker if OK?

I guess the discussions about getting this into the generic parts of
the kernel came to nothing?

So, yes.

> 
>  arch/arm/kernel/process.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> index f79dd1e..1893bda 100644
> --- a/arch/arm/kernel/process.c
> +++ b/arch/arm/kernel/process.c
> @@ -239,6 +239,7 @@ __setup("reboot=", reboot_setup);
>  
>  void machine_shutdown(void)
>  {
> +	disable_nonboot_cpus();
>  #ifdef CONFIG_SMP
>  	smp_send_stop();
>  #endif
> -- 
> 1.7.10.4
>

Stephen Boyd Jan. 2, 2013, 11:59 p.m. UTC | #2

On 01/02/13 13:07, Stephen Warren wrote:
> From: Stephen Warren <swarren@nvidia.com>
>
> On Tegra at least, this change allows kexec to work with SMP enabled.
> Without this,  machine_shutdown() simply puts all CPUs into a loop. If
> the code of that loop is over-written, the CPUs may hang or crash (which
> I do observe in practice), or cause the kexec'd kernel not to be able to
> initialize them.
>
> This fix has the added benefit that the kexec always happens on the boot
> CPU, and thus kexec mirrors the initial kernel boot as much as possible.
>
> Signed-off-by: Stephen Warren <swarren@nvidia.com>
> ---
> Russell, I assume this should go into the ARM patch tracker if OK?
>
>  arch/arm/kernel/process.c |    1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> index f79dd1e..1893bda 100644
> --- a/arch/arm/kernel/process.c
> +++ b/arch/arm/kernel/process.c
> @@ -239,6 +239,7 @@ __setup("reboot=", reboot_setup);
>  
>  void machine_shutdown(void)
>  {
> +	disable_nonboot_cpus();
>  #ifdef CONFIG_SMP
>  	smp_send_stop();
>  #endif

How does this work in a CONFIG_SUSPEND=n build? It looks like
disable_nonboot_cpus() would be a no-op and so we wouldn't actually hot
unplug the other CPUs before sending the smp_send_stop(). And then the
smp_send_stop() seems a little unnecessary if we do actually shutdown
the other CPUs.

Will Deacon Jan. 3, 2013, 12:02 p.m. UTC | #3

On Wed, Jan 02, 2013 at 11:59:02PM +0000, Stephen Boyd wrote:
> On 01/02/13 13:07, Stephen Warren wrote:
> > From: Stephen Warren <swarren@nvidia.com>
> >
> > On Tegra at least, this change allows kexec to work with SMP enabled.
> > Without this,  machine_shutdown() simply puts all CPUs into a loop. If
> > the code of that loop is over-written, the CPUs may hang or crash (which
> > I do observe in practice), or cause the kexec'd kernel not to be able to
> > initialize them.
> >
> > This fix has the added benefit that the kexec always happens on the boot
> > CPU, and thus kexec mirrors the initial kernel boot as much as possible.
> >
> > Signed-off-by: Stephen Warren <swarren@nvidia.com>
> > ---
> > Russell, I assume this should go into the ARM patch tracker if OK?
> >
> >  arch/arm/kernel/process.c |    1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
> > index f79dd1e..1893bda 100644
> > --- a/arch/arm/kernel/process.c
> > +++ b/arch/arm/kernel/process.c
> > @@ -239,6 +239,7 @@ __setup("reboot=", reboot_setup);
> >  
> >  void machine_shutdown(void)
> >  {
> > +	disable_nonboot_cpus();
> >  #ifdef CONFIG_SMP
> >  	smp_send_stop();
> >  #endif
> 
> How does this work in a CONFIG_SUSPEND=n build? It looks like
> disable_nonboot_cpus() would be a no-op and so we wouldn't actually hot
> unplug the other CPUs before sending the smp_send_stop(). And then the
> smp_send_stop() seems a little unnecessary if we do actually shutdown
> the other CPUs.

You need the smp_send_stop call in order to send the cpu_kill (looks like
tegra needs die and then kill). So you really need hotplug support as well
as suspend for this to do much (if not, the secondaries end up spinning
with interrupts disabled which is probably the best we can do anyway).

We could add SUSPEND as a KEXEC dependency if SMP (we already have HOTPLUG
there) if you like?

Will

Will Deacon Jan. 3, 2013, 12:03 p.m. UTC | #4

On Wed, Jan 02, 2013 at 09:07:41PM +0000, Stephen Warren wrote:
> From: Stephen Warren <swarren@nvidia.com>
> 
> On Tegra at least, this change allows kexec to work with SMP enabled.
> Without this,  machine_shutdown() simply puts all CPUs into a loop. If
> the code of that loop is over-written, the CPUs may hang or crash (which
> I do observe in practice), or cause the kexec'd kernel not to be able to
> initialize them.
> 
> This fix has the added benefit that the kexec always happens on the boot
> CPU, and thus kexec mirrors the initial kernel boot as much as possible.
> 
> Signed-off-by: Stephen Warren <swarren@nvidia.com>

Acked-by: Will Deacon <will.deacon@arm.com>

Cheers Stephen,

Will

Russell King - ARM Linux Jan. 3, 2013, 12:21 p.m. UTC | #5

On Thu, Jan 03, 2013 at 12:02:59PM +0000, Will Deacon wrote:
> You need the smp_send_stop call in order to send the cpu_kill (looks like
> tegra needs die and then kill). So you really need hotplug support as well
> as suspend for this to do much (if not, the secondaries end up spinning
> with interrupts disabled which is probably the best we can do anyway).
> 
> We could add SUSPEND as a KEXEC dependency if SMP (we already have HOTPLUG
> there) if you like?

Or we could look into bringing in the code to do this when KEXEC is
enabled - which would mean an amount of restructuring of the Kconfig
files.

In other words, go from this:

config SUSPEND
        bool "Suspend to RAM and standby"
        depends on ARCH_SUSPEND_POSSIBLE
        default y

config HIBERNATE_CALLBACKS
        bool

config PM_SLEEP
        def_bool y
        depends on SUSPEND || HIBERNATE_CALLBACKS

config PM_SLEEP_SMP
        def_bool y
        depends on SMP
        depends on ARCH_SUSPEND_POSSIBLE || ARCH_HIBERNATION_POSSIBLE
        depends on PM_SLEEP
        select HOTPLUG
        select HOTPLUG_CPU

to:

config SUSPEND
	bool...
	depends on ARCH_SUSPEND_POSSIBLE
	default y
	select PM_SLEEP

config HIBERNATE_CALLBACKS
	bool
	select PM_SLEEP

config PM_SLEEP
	bool
	select PM_SLEEP_SMP if SMP && (ARCH_SUSPEND_POSSIBLE || ARCH_HIBERNATION_POSSIBLE)

config PM_SLEEP_SMP
	bool
        select HOTPLUG
        select HOTPLUG_CPU

This means we could have KEXEC select PM_SLEEP_SMP (and maybe PM_SLEEP) as
required, rather than bringing in the entire suspend support just to have
working kexec.

However, how many platforms don't have suspend support enabled?

Jason Gunthorpe Jan. 3, 2013, 6:08 p.m. UTC | #6

On Thu, Jan 03, 2013 at 12:21:00PM +0000, Russell King - ARM Linux wrote:

> This means we could have KEXEC select PM_SLEEP_SMP (and maybe PM_SLEEP) as
> required, rather than bringing in the entire suspend support just to have
> working kexec.
> 
> However, how many platforms don't have suspend support enabled?

I was looking into something similar to this for Kirkwood - my
interest was in getting kexec to work, which means it needs to call
the hibernation callbacks in the drivers prior to execing, otherwise
there is a fairly high chance DMA will corrupt the kexec target.

So this means the code needs the suspend support, however Kirkwood
doesn't have any support for ARM_CPU_SUSPEND, which blocks enabling
suspend at all for that platform. I was looking at moving the 'depends
on CPU_..' from ARCH_SUSPEND_POSSIBLE to ARM_CPU_SUSPEND, which seemed
to do the trick for kirkwood but I didn't study every possibility..

Regards,
Jason

Stephen Warren Jan. 3, 2013, 8:26 p.m. UTC | #7

On 01/03/2013 05:21 AM, Russell King - ARM Linux wrote:
> On Thu, Jan 03, 2013 at 12:02:59PM +0000, Will Deacon wrote:
>> You need the smp_send_stop call in order to send the cpu_kill (looks like
>> tegra needs die and then kill). So you really need hotplug support as well
>> as suspend for this to do much (if not, the secondaries end up spinning
>> with interrupts disabled which is probably the best we can do anyway).
>>
>> We could add SUSPEND as a KEXEC dependency if SMP (we already have HOTPLUG
>> there) if you like?
> 
> Or we could look into bringing in the code to do this when KEXEC is
> enabled - which would mean an amount of restructuring of the Kconfig
> files.

I'm not sure if any of this thread means we should hold off on this
patch, or just that the Kconfig could/should be enhanced later?

One thing I did just notice with my patch: disable_nonboot_cpus() ends
up being called twice for the poweroff path:

[   30.461847] Disabling non-boot CPUs ...
[   30.478797] CPU1: shutdown
[   30.492104] Power down.
[   30.494578] Disabling non-boot CPUs ...

Is this worth worrying about?

It's because kernel_power_off() calls disable_nonboot_cpus() then
machine_shutdown() which also calls disable_nonboot_cpus().

Will Deacon Jan. 6, 2013, 4:22 p.m. UTC | #8

On Thu, Jan 03, 2013 at 08:26:15PM +0000, Stephen Warren wrote:
> On 01/03/2013 05:21 AM, Russell King - ARM Linux wrote:
> > On Thu, Jan 03, 2013 at 12:02:59PM +0000, Will Deacon wrote:
> >> You need the smp_send_stop call in order to send the cpu_kill (looks like
> >> tegra needs die and then kill). So you really need hotplug support as well
> >> as suspend for this to do much (if not, the secondaries end up spinning
> >> with interrupts disabled which is probably the best we can do anyway).
> >>
> >> We could add SUSPEND as a KEXEC dependency if SMP (we already have HOTPLUG
> >> there) if you like?
> > 
> > Or we could look into bringing in the code to do this when KEXEC is
> > enabled - which would mean an amount of restructuring of the Kconfig
> > files.
> 
> I'm not sure if any of this thread means we should hold off on this
> patch, or just that the Kconfig could/should be enhanced later?

Well, Russell's suggestion looked easy enough to have a crack out so you
could always post a series implementing it along with this patch.

> One thing I did just notice with my patch: disable_nonboot_cpus() ends
> up being called twice for the poweroff path:
> 
> [   30.461847] Disabling non-boot CPUs ...
> [   30.478797] CPU1: shutdown
> [   30.492104] Power down.
> [   30.494578] Disabling non-boot CPUs ...
> 
> Is this worth worrying about?

It's harmless but it's also pretty horrible. Unfortunately, I don't see
what we can do about it: it's a direct side-effect of generic code calling
disable_nonboot_cpus for poweroff and not for kexec.

Will

Russell King - ARM Linux Jan. 6, 2013, 4:40 p.m. UTC | #9

On Sun, Jan 06, 2013 at 04:22:00PM +0000, Will Deacon wrote:
> On Thu, Jan 03, 2013 at 08:26:15PM +0000, Stephen Warren wrote:
> > On 01/03/2013 05:21 AM, Russell King - ARM Linux wrote:
> > > On Thu, Jan 03, 2013 at 12:02:59PM +0000, Will Deacon wrote:
> > >> You need the smp_send_stop call in order to send the cpu_kill (looks like
> > >> tegra needs die and then kill). So you really need hotplug support as well
> > >> as suspend for this to do much (if not, the secondaries end up spinning
> > >> with interrupts disabled which is probably the best we can do anyway).
> > >>
> > >> We could add SUSPEND as a KEXEC dependency if SMP (we already have HOTPLUG
> > >> there) if you like?
> > > 
> > > Or we could look into bringing in the code to do this when KEXEC is
> > > enabled - which would mean an amount of restructuring of the Kconfig
> > > files.
> > 
> > I'm not sure if any of this thread means we should hold off on this
> > patch, or just that the Kconfig could/should be enhanced later?
> 
> Well, Russell's suggestion looked easy enough to have a crack out so you
> could always post a series implementing it along with this patch.
> 
> > One thing I did just notice with my patch: disable_nonboot_cpus() ends
> > up being called twice for the poweroff path:
> > 
> > [   30.461847] Disabling non-boot CPUs ...
> > [   30.478797] CPU1: shutdown
> > [   30.492104] Power down.
> > [   30.494578] Disabling non-boot CPUs ...
> > 
> > Is this worth worrying about?
> 
> It's harmless but it's also pretty horrible. Unfortunately, I don't see
> what we can do about it: it's a direct side-effect of generic code calling
> disable_nonboot_cpus for poweroff and not for kexec.

I think we're starting to hit the age old problem with software: over
time it becomes more difficult to change established software as it
becomes more risky to make changes, and no one wants to make those
changes through fear of breakign something else.

Eventually, this results in someone declaring that the current project
is beyond hope, and the next Linus Torvalds comes along and the next
Linux kernel project will be born.  That probably isn't want people want,
but it's what will eventually happen when things just get too painful.

Unless, of course, we start pushing back now and don't leave issues to
fester.

However, that said, I can see why kexec does this.  I've not looked too
deeply into kexec, but I'd image that the path where the secondary CPUs
are not taken offline prior to kexec happening is to allow kexec to
be used for crash recovery, where you don't want to do too much in case
you re-tickle the bug that caused the original crash.

I suspect what platforms like x86 do in this scenario (via
machine_shutdown) is to forcefully put the secondary CPUs into reset
and hold them there until the new kexec'd kernel gets going.

The problem is... on ARM we're yet again shot in the foot through the
unwillingness to architect certain aspects of the system: there is no
architecturally known way to place CPUs into a reset state once they've
been started up.  In other words, once a CPU starts executing kernel
code, it needs to always execute kernel code or there must be some kind
of platform specific hook to disable it.

Maybe that's the answer here: have machine_shutdown() call out to some
platform specific function to do the forced-takedown of secondary CPUs,
and if there's no such specific function, then we use our present CPU
stopping method of making them spin in a WFI loop with IRQs disabled.

Eric W. Biederman Jan. 7, 2013, 1:53 a.m. UTC | #10

Russell King - ARM Linux <linux@arm.linux.org.uk> writes:

> On Sun, Jan 06, 2013 at 04:22:00PM +0000, Will Deacon wrote:
>> On Thu, Jan 03, 2013 at 08:26:15PM +0000, Stephen Warren wrote:
>> > On 01/03/2013 05:21 AM, Russell King - ARM Linux wrote:
>> > > On Thu, Jan 03, 2013 at 12:02:59PM +0000, Will Deacon wrote:
>> > >> You need the smp_send_stop call in order to send the cpu_kill (looks like
>> > >> tegra needs die and then kill). So you really need hotplug support as well
>> > >> as suspend for this to do much (if not, the secondaries end up spinning
>> > >> with interrupts disabled which is probably the best we can do anyway).
>> > >>
>> > >> We could add SUSPEND as a KEXEC dependency if SMP (we already have HOTPLUG
>> > >> there) if you like?
>> > > 
>> > > Or we could look into bringing in the code to do this when KEXEC is
>> > > enabled - which would mean an amount of restructuring of the Kconfig
>> > > files.
>> > 
>> > I'm not sure if any of this thread means we should hold off on this
>> > patch, or just that the Kconfig could/should be enhanced later?
>> 
>> Well, Russell's suggestion looked easy enough to have a crack out so you
>> could always post a series implementing it along with this patch.
>> 
>> > One thing I did just notice with my patch: disable_nonboot_cpus() ends
>> > up being called twice for the poweroff path:
>> > 
>> > [   30.461847] Disabling non-boot CPUs ...
>> > [   30.478797] CPU1: shutdown
>> > [   30.492104] Power down.
>> > [   30.494578] Disabling non-boot CPUs ...
>> > 
>> > Is this worth worrying about?
>> 
>> It's harmless but it's also pretty horrible. Unfortunately, I don't see
>> what we can do about it: it's a direct side-effect of generic code calling
>> disable_nonboot_cpus for poweroff and not for kexec.
>
> I think we're starting to hit the age old problem with software: over
> time it becomes more difficult to change established software as it
> becomes more risky to make changes, and no one wants to make those
> changes through fear of breakign something else.
>
> Eventually, this results in someone declaring that the current project
> is beyond hope, and the next Linus Torvalds comes along and the next
> Linux kernel project will be born.  That probably isn't want people want,
> but it's what will eventually happen when things just get too painful.
>
> Unless, of course, we start pushing back now and don't leave issues to
> fester.
>
> However, that said, I can see why kexec does this.  I've not looked too
> deeply into kexec, but I'd image that the path where the secondary CPUs
> are not taken offline prior to kexec happening is to allow kexec to
> be used for crash recovery, where you don't want to do too much in case
> you re-tickle the bug that caused the original crash.
>
> I suspect what platforms like x86 do in this scenario (via
> machine_shutdown) is to forcefully put the secondary CPUs into reset
> and hold them there until the new kexec'd kernel gets going.
>
> The problem is... on ARM we're yet again shot in the foot through the
> unwillingness to architect certain aspects of the system: there is no
> architecturally known way to place CPUs into a reset state once they've
> been started up.  In other words, once a CPU starts executing kernel
> code, it needs to always execute kernel code or there must be some kind
> of platform specific hook to disable it.
>
> Maybe that's the answer here: have machine_shutdown() call out to some
> platform specific function to do the forced-takedown of secondary CPUs,
> and if there's no such specific function, then we use our present CPU
> stopping method of making them spin in a WFI loop with IRQs disabled.

Yes.  On x86 we have had the generic equivalent of disable_nonboot_cpus
in the machine_shutdown for a long time.

Now I don't know the full history of what led someone to make a
conscious choice to not call disable_nonboot_cpus() in kernel_kexec()
before machine_shutdown() and machine_kexec().  But it was a deliberate
choice and someone needs to dig up the reasons and review if the reasons
are still valid before we can change the generic code path without
introducing regressions.

I remember there being some problems when disable_nonboot_cpus() was
added originially.  And if disable_nonboot_cpus disappears in
configurations without power managment that is another problem, because
platforms that have SMP always need that logic to happen.

But what it looks like is whoever put disable_nonboot_cpus() on the kernel
reboot path didn't do their homework, did a sloppy job and left us with
a situation where we need to write both generic code and arch specific
code to do the same thing.

I have cleaned up the mess that is the reboot path once a bunch of years
ago, and apparently it is deteriorating again.  Unfortunately right now
I don't have the time or inclination to sort out whatever the nonsense
disable_nonboot_cpus() has become.  What is required of
machine_shutdown() is clear even if there is duplication with
disable_nonboot_cpus().

Please note the only code path that generically calls machine_shutdown()
is kernel_kexec() so if you wish you can avoid duplication by
refactoring you architecture specific code.

Eric

Will Deacon Jan. 7, 2013, 2:25 p.m. UTC | #11

Hi Eric,

On Mon, Jan 07, 2013 at 01:53:30AM +0000, Eric W. Biederman wrote:
> Russell King - ARM Linux <linux@arm.linux.org.uk> writes:
> > The problem is... on ARM we're yet again shot in the foot through the
> > unwillingness to architect certain aspects of the system: there is no
> > architecturally known way to place CPUs into a reset state once they've
> > been started up.  In other words, once a CPU starts executing kernel
> > code, it needs to always execute kernel code or there must be some kind
> > of platform specific hook to disable it.
> >
> > Maybe that's the answer here: have machine_shutdown() call out to some
> > platform specific function to do the forced-takedown of secondary CPUs,
> > and if there's no such specific function, then we use our present CPU
> > stopping method of making them spin in a WFI loop with IRQs disabled.
> 
> Yes.  On x86 we have had the generic equivalent of disable_nonboot_cpus
> in the machine_shutdown for a long time.

Ok, we can do that that on ARM too, but we currently call machine_shutdown
from machine_{power_off, halt, restart} so we just need to fix that and we
will be doing functionally the same thing as x86.

> Please note the only code path that generically calls machine_shutdown()
> is kernel_kexec() so if you wish you can avoid duplication by
> refactoring you architecture specific code.

Exactly, that sounds like the right way forward to me. Thanks.

Will

Russell King - ARM Linux Jan. 7, 2013, 2:48 p.m. UTC | #12

On Sun, Jan 06, 2013 at 05:53:30PM -0800, Eric W. Biederman wrote:
> I have cleaned up the mess that is the reboot path once a bunch of years
> ago, and apparently it is deteriorating again.

Unfortunately, that's what happens when lots of cooks get their fingers
in a pie with no coordination.  This is why having maintainers is soo
important for code - a good maintainer ensures that the code remains
high quality by whatever means.

Code without maintainers is subject to modification in all kinds of
random ways, including duplicating code amongst architectures which
should be generic code - that happens because no one wants to understand
what other architectures require.

Your original cleanups, afaik, are still all intact.  The problem is
that since then, kexec has come along, and invented a new callback
(machine_shutdown) which gets called while the system is still live
in full SMP mode (and presumably all the devices are still running
and potentially scribbling over memory with their DMA.)  Meanwhile
kexec wants to pass control to the new kernel... that doesn't sound
particularly clever to me.

Stephen Warren Jan. 9, 2013, 12:06 a.m. UTC | #13

On 01/06/2013 06:53 PM, Eric W. Biederman wrote:
...
> Yes.  On x86 we have had the generic equivalent of disable_nonboot_cpus
> in the machine_shutdown for a long time.

It looks like the x86 implementation does a bit more than
disable_nonboot_cpus():

disable_nonboot_cpus():
	find first cpu in online cpu mask
	disable everything else

x86's machine_shutdown():
	default to rebooting on cpu 0
	if user specified a different cpu, override default
	bring that cpu online
	disable everything else

So, x86 always kexec's on a specific CPU, whereas if we use
disable_nonboot_cpus() on ARM, we'll end up kexec'ing on whichever is
the first online CPU, which might not be the actual boot CPU, and can vary.

On Tegra this doesn't (currently?) matter since CPU 0 can't be taken
offline due to our CPU hotplug driver disallowing it. But, perhaps other
SoCs or future Tegra versions don't/won't have this restriction, so the
difference will be material.

Should the x86 code be lifted into the implementation of
disable_nonboot_cpus()?

For the record, here's what I can tell about what the various
arch-specific machine_shutdown() do:

ARM, ARM64: calls smp_send_stop()
  -> disable_nonboot_cpus() would be correct

IA64: shuts down all CPUs except the current one
  -> disable_nonboot_cpus() would be correct

Microblaze: nothing (but no SMP support?)
  -> disable_nonboot_cpus() would be irrelevant, but fine

MIPS: machine-specific:
  Cavium Octeon: Shuts down CPUs, waits until num_online_cpus()==1
    Not sure which CPU isn't shut down though; the current one?
  Others: Nothing (but at least some have SMP support)
  -> disable_nonboot_cpus() would be a behaviour change

PPC: machine-specific
  Only implementations either aren't for SMP, or do nothing (but
presumably many have SMP support)
  -> disable_nonboot_cpus() would be a behaviour change

SH: smp_send_stop()
  -> disable_nonboot_cpus() would be correct

S390: nothing (but appears to have SMP support)
  -> disable_nonboot_cpus() would be a behaviour change

Tile: nothing (but bans kexec unless no SMP or only 1 CPU online)
  -> disable_nonboot_cpus() would be irrelevant, but fine, and perhaps
even allow removal of the kexec ban for SMP?

So at least I don't see anything that would particularly indicate that
having the kexec generic/core code call disable_nonboot_cpus() would
cause problems; many architectures have done something like that
themselves. That said, it certainly would cause some behavioural
differences on some big platforms like PPC...

Eric W. Biederman Jan. 11, 2013, 5:59 a.m. UTC | #14

Russell King - ARM Linux <linux@arm.linux.org.uk> writes:

> On Sun, Jan 06, 2013 at 05:53:30PM -0800, Eric W. Biederman wrote:
>> I have cleaned up the mess that is the reboot path once a bunch of years
>> ago, and apparently it is deteriorating again.
>
> Unfortunately, that's what happens when lots of cooks get their fingers
> in a pie with no coordination.  This is why having maintainers is soo
> important for code - a good maintainer ensures that the code remains
> high quality by whatever means.
>
> Code without maintainers is subject to modification in all kinds of
> random ways, including duplicating code amongst architectures which
> should be generic code - that happens because no one wants to understand
> what other architectures require.

I don't think you meant to but I have been rather strongly insulted.
With what time I can find I have been maintaining kexec.  Which I admit
has mostly been telling people don't do that silly stupid thing.

> Your original cleanups, afaik, are still all intact.  The problem is
> that since then, kexec has come along, and invented a new callback
> (machine_shutdown) which gets called while the system is still live
> in full SMP mode (and presumably all the devices are still running
> and potentially scribbling over memory with their DMA.)

The devices should not be running. The sequence of code is.
	kernel_restart_prepare();
	machine_shutdown();
	machine_kexec();

The reboot notifier and all of the device shutdown methods have been 
called in kernel_restart_prepare().  If drivers don't stop dma
in their shutdown method that is a driver bug.  I wish we could
use the remove method but I lost that fight a decade ago.

It is exactly the same sequence we use for a normal reboot
except for the the lack of disable_nonboot_cpus().

> Meanwhile
> kexec wants to pass control to the new kernel... that doesn't sound
> particularly clever to me.

Because I was annoyed I just stopped and between faded memories and
reading the kernel logs I think I can reconstruct how
disable_nonboot_cpus() got into the places it has gotten.

Ages and ages ago (the logs say Oct 2007) disable_nonboot_cpus() stopped
being a pure cpu-hotplug and power management thing and started being
called from kernel_power_off.  I objected to it's inclusion on the
reboot path but there was some deep magic ACPI ordering mess that
necessitated it be called in kernel_power_off.

In retrospect that disable_nonboot_cpus() call should have been x86
specific.

As I recall disable_nonboot_cpus() at the time was a cpu hotplug hack
and reading through the code again tonight it appears that
disable_noboot_cpus() continues to be a messed up cpu hotplug hack.

disable_nonboot_cpus() should really be called
sometimes_dangerously_hotunplug_all_but_one_cpu().

If that code is going to be something other than power management
specific it is not cool that disable_nonboot_cpus() is not always
enabled when SMP is enabled.  It means that architectures need to
implement two different ways of shutting down cpus.

One of the truly nasty things about cpu_hotplug is that it requires that
irqs be migrated away from a cpu with interrupts disabled, which at
least on x86 in some interrupt delivery modes is impossible to do
safely.  The only way to losslessly (and without wedging irq
controllers) in those interrupt delivery modes (needed for more than 8
cpus) is to migrate an irq in it's irq handler.  Which is fine for
setting /proc/irq/$N/smp_affinity but is useless for cpu hot-unplug,
where we need to guarantee that all irqs are going to stop hitting a
cpu.

Now sometimes_dangerously_hotunplug_all_but_one_cpu() on the reboot path 
was added just a few months ago in Oct 2012, and it appears to due to
weird ARM maintainership.  At least the x86 reboot_cpu_id option is
broken due to that addition.   The unlikely but very real ways to wedge
your cpu during a reboot still remain in those code paths and I expect
we simply have not seen regressions yet as there has not been enough
time for the regressions to be noticed and tracked down.

Now given that the inconsistences were caused by crazy ARM
maintainership and that disable_nonboot_cpus() is still crazy enough
to make me scared thinking about the code.

We should remove disable_nonboot_cpus() from the reboot path.   It is
still a crazy unmaintained cpu hotplug mess.

Move disable_nonboot_cpus() into machine_shutdown in the ARM reboot path
if you want to rely on that crazy piece of code.  Or possibly you can
write something ARM specific that requires less work and is safer.

Eric

Eric W. Biederman Jan. 11, 2013, 6:28 a.m. UTC | #15

Thanks for the cross architecture survey.

Stephen Warren <swarren@wwwdotorg.org> writes:

> On 01/06/2013 06:53 PM, Eric W. Biederman wrote:
> ...
>> Yes.  On x86 we have had the generic equivalent of disable_nonboot_cpus
>> in the machine_shutdown for a long time.
>
> It looks like the x86 implementation does a bit more than
> disable_nonboot_cpus():
>
> disable_nonboot_cpus():
> 	find first cpu in online cpu mask
> 	disable everything else
>
> x86's machine_shutdown():
> 	default to rebooting on cpu 0
> 	if user specified a different cpu, override default
> 	bring that cpu online
> 	disable everything else

That you can no longer specify a reboot_cpu_id on x86 is a regression
since Oct caused by people fixing another piece of the ARM SMP reboot.

Not that I know it is knew I wonder what other x86 regressions it
causes.  The migrating of irqs to different cpus with interrupts
disabled on interrupt the architectural x86 interrupt controllers that
don't support that and that I have experimentally verified can be locked
on with no recovery short of power cycling gives me the willies.

The actual reboot_cpu_id option was someones first hack at getting
the reboot to happen on the boot cpu so it may not matter much as
long as we reboot on the boot cpu.

reboot_cpu_id really wasn't a kexec thing in this case.

> So, x86 always kexec's on a specific CPU, whereas if we use
> disable_nonboot_cpus() on ARM, we'll end up kexec'ing on whichever is
> the first online CPU, which might not be the actual boot CPU, and can
> vary.

> On Tegra this doesn't (currently?) matter since CPU 0 can't be taken
> offline due to our CPU hotplug driver disallowing it. But, perhaps other
> SoCs or future Tegra versions don't/won't have this restriction, so the
> difference will be material.

> Should the x86 code be lifted into the implementation of
> disable_nonboot_cpus()?

disable_nonboot_cpus() roughly as it exists today is needed for
hibernation and some silliness like that.

I don't have any problem with generic code in the reboot path
doing:
if (cpu_online(0))
	set_cpus_allowed_ptr(current, cpumask_of(0));

All of the code in native_stop_other_cpus() on x86 is both simple and
pretty architecture specific so I don't see much if any point in making
a generic version.  At most there is an argument for an arch specific
stop_other_cpus() call.

I have a real hard time believe that power management and cpu hot-unplug
need to be as complex as disable_nonboot_cpus() suggest.

For justifying disable_nonboot_cpus() there is the question of how many
architectures is it safe to migrate irqs from one cpu to another with
irqs disabled.

> For the record, here's what I can tell about what the various
> arch-specific machine_shutdown() do:
>
> ARM, ARM64: calls smp_send_stop()
>   -> disable_nonboot_cpus() would be correct
>
> IA64: shuts down all CPUs except the current one
>   -> disable_nonboot_cpus() would be correct
>
> Microblaze: nothing (but no SMP support?)
>   -> disable_nonboot_cpus() would be irrelevant, but fine
>
> MIPS: machine-specific:
>   Cavium Octeon: Shuts down CPUs, waits until num_online_cpus()==1
>     Not sure which CPU isn't shut down though; the current one?
>   Others: Nothing (but at least some have SMP support)
>   -> disable_nonboot_cpus() would be a behaviour change
>
> PPC: machine-specific
>   Only implementations either aren't for SMP, or do nothing (but
> presumably many have SMP support)
>   -> disable_nonboot_cpus() would be a behaviour change
>
> SH: smp_send_stop()
>   -> disable_nonboot_cpus() would be correct
>
> S390: nothing (but appears to have SMP support)
>   -> disable_nonboot_cpus() would be a behaviour change
>
> Tile: nothing (but bans kexec unless no SMP or only 1 CPU online)
>   -> disable_nonboot_cpus() would be irrelevant, but fine, and perhaps
> even allow removal of the kexec ban for SMP?
>
> So at least I don't see anything that would particularly indicate that
> having the kexec generic/core code call disable_nonboot_cpus() would
> cause problems; many architectures have done something like that
> themselves. That said, it certainly would cause some behavioural
> differences on some big platforms like PPC...

Behavior differences someone introduced just a kernel ago in the PPC
reboot code to deal with ARMs lack of code in machine_shutdown to handle
this at all.

Eric

Russell King - ARM Linux Jan. 11, 2013, 10:04 a.m. UTC | #16

On Thu, Jan 10, 2013 at 09:59:35PM -0800, Eric W. Biederman wrote:
> Russell King - ARM Linux <linux@arm.linux.org.uk> writes:
> 
> > On Sun, Jan 06, 2013 at 05:53:30PM -0800, Eric W. Biederman wrote:
> >> I have cleaned up the mess that is the reboot path once a bunch of years
> >> ago, and apparently it is deteriorating again.
> >
> > Unfortunately, that's what happens when lots of cooks get their fingers
> > in a pie with no coordination.  This is why having maintainers is soo
> > important for code - a good maintainer ensures that the code remains
> > high quality by whatever means.
> >
> > Code without maintainers is subject to modification in all kinds of
> > random ways, including duplicating code amongst architectures which
> > should be generic code - that happens because no one wants to understand
> > what other architectures require.
> 
> I don't think you meant to but I have been rather strongly insulted.

Why would you be insulted by the above?  It's a statement of fact that
is true for anything.  Ever heard of the expression "Too many cooks
spoil the broth" ?  Does that expression insult you too?

You've admitted that you don't have time to look at the shutdown/reboot
paths very much, so they are by way of that fact lacking in maintainer
input - but if you read the rest of what I wrote, I also commented that
they were still in pretty good shape inspite of that.

So actually, there's a complement there.  Shame you missed it.

> Now sometimes_dangerously_hotunplug_all_but_one_cpu() on the reboot path 
> was added just a few months ago in Oct 2012, and it appears to due to
> weird ARM maintainership.

Err.  No.  Stop trying to accuse people of being crap when there's no
evidence:

commit f96972f2dc6365421cf2366ebd61ee4cf060c8d5
Author: Shawn Guo <shawn.guo@linaro.org>
Date:   Thu Oct 4 17:12:23 2012 -0700

    kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()
...
    Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

No ARM maintainers were involved in that change.  I just checked my
mailboxes for September 2012 which is when it was mailed to the list.
It got missed, and Andrew replied and accepted the patch (as is
illustrated by the sign-offs).  And even if _I_ had been involved with
it, _I_ would not have spotted the x86 behaviour, just the same that
Andrew didn't.

Is this not itself evidence of my comment above: had the above change
gone through someone who had a deep understanding of that code - iow,
a maintainer, it probably would not have been committed to the kernel?

To accuse others who have nothing to do with the area, and weren't
involved with the change in any way of being responsible for that
change is... well... I don't think there's anything that can be said
to that which isn't inflamatory.

It seems that, given your comments, the above commit needs to be reverted,
and then _maybe_ the disable_nonboot_cpus() call should be added to ARMs
machine_shutdown() instead.

Stephen Warren Jan. 29, 2013, 10:01 p.m. UTC | #17

On 01/10/2013 10:59 PM, Eric W. Biederman wrote:
...
> disable_nonboot_cpus() should really be called
> sometimes_dangerously_hotunplug_all_but_one_cpu().
> 
> If that code is going to be something other than power management
> specific it is not cool that disable_nonboot_cpus() is not always
> enabled when SMP is enabled.  It means that architectures need to
> implement two different ways of shutting down cpus.
> 
> One of the truly nasty things about cpu_hotplug is that it requires that
> irqs be migrated away from a cpu with interrupts disabled, which at
> least on x86 in some interrupt delivery modes is impossible to do
> safely.  The only way to losslessly (and without wedging irq
> controllers) in those interrupt delivery modes (needed for more than 8
> cpus) is to migrate an irq in it's irq handler.  Which is fine for
> setting /proc/irq/$N/smp_affinity but is useless for cpu hot-unplug,
> where we need to guarantee that all irqs are going to stop hitting a
> cpu.

I'm a little confused about this; disable_nonboot_cpus() calls
_cpu_down() to hot-unplug the CPU, and the regular CPU hotplug path is
drivers/base/cpu.c:store_online() -> cpu_down() -> _cpu_down(), i.e. the
same basic code.

Given that, why is CPU hot unplug safe on x86 via sysfs (well, I'm just
assuming it must be, since it's enabled in my regular distro kernel and
appears to work fine), but not safe when it's done from
disable_nonboot_cpus()?

> Now sometimes_dangerously_hotunplug_all_but_one_cpu() on the reboot path 
> was added just a few months ago in Oct 2012, and it appears to due to
> weird ARM maintainership.  At least the x86 reboot_cpu_id option is
> broken due to that addition.

I guess this is true because reboot_cpu_id() isn't honored unless that
CPU is already offline, and disable_nonboot_cpus() probably took it
offline on average.

Perhaps x86's native_machine_shutdown() should explicitly bring that CPU
online if it's in cpu_possible_mask? Or perhaps disable_nonboot_cpus()
should be enhanced with x86's reboot_cpu_id logic, so it simply works
the same way across all architectures; that way some chunk of x86's
native_machine_shutdown() could be made common.

...
> We should remove disable_nonboot_cpus() from the reboot path.   It is
> still a crazy unmaintained cpu hotplug mess.

Even if we updated it to standardize the reboot_cpu_id logic across all
architectures, and hence made it still suitable for x86?

Stephen Warren Jan. 29, 2013, 10:10 p.m. UTC | #18

On 01/10/2013 11:28 PM, Eric W. Biederman wrote:
...
> I don't have any problem with generic code in the reboot path
> doing:
> if (cpu_online(0))
> 	set_cpus_allowed_ptr(current, cpumask_of(0));

It looks like that API just affects the scheduler, and not whether the
other CPUs are actually active/hot-plugged-in. At least for my use-case,
I need something that really disables the other CPUs so they aren't
executing code, hence my tendency to hot-un-plug them using
disable_nonboot_cpus(), rather than just shift task execution off them
using the code above.

I wonder if all architectures shouldn't always do the following in all
reboot/shutdown/kexec cases:

* set_cpus_allowed_ptr() to limit code execution to a single CPU.
* disable_nonboot_cpus() (or equivalent) if it's available to turn off
all the other CPUs.

The issue here would be that disable_nonboot_cpus() isn't always
available; I assume that's part of the reason that there are
arch-specific machine_xxx() hooks, so that architectures can
power-off/reset their CPUs even when hotplug isn't enabled? I wonder if
that can be refactored so that reboot/poweroff/kexec can share some
CPU-disable code with CPU hotplug?

ARM: call disable_nonboot_cpus() from machine_shutdown()

Commit Message

Comments

Patch