mbox series

[v3,0/2] arm64: Allow erratum 1418040 for late CPUs

Message ID 20200731173824.107480-1-maz@kernel.org (mailing list archive)
Headers show
Series arm64: Allow erratum 1418040 for late CPUs | expand

Message

Marc Zyngier July 31, 2020, 5:38 p.m. UTC
Erratum 1418040 currently prevents a late CPU from booting if none
of the early CPUs are affected by it. This is because the handling
is implemented as alternatives, and we have already got rid of them
by the time userspace onlines a new CPU.

A solution to this is to move everything into C code, and rely on
static keys instead. Once this is done, the feature can be allowed
for late CPUs.

Note that CPUs affected by 1418040 also tend to miss AArch32-EL1,
meaning they cannot be used as late CPUs when KVM is enabled and
that their sibblings have AArch32-EL1.

* From v1:
  - Dropped check for kernel threads
  - Added comment describing the switching logic
  - Made the errata handling function __always_inline

* From v2:
  - Dropped __always_inline again
  - Simplified logic

Marc Zyngier (2):
  arm64: Move handling of erratum 1418040 into C code
  arm64: Allow booting of late CPUs affected by erratum 1418040

 arch/arm64/kernel/cpu_errata.c |  2 ++
 arch/arm64/kernel/entry.S      | 21 --------------------
 arch/arm64/kernel/process.c    | 35 ++++++++++++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 21 deletions(-)

Comments

Doug Anderson Sept. 9, 2020, 2:53 p.m. UTC | #1
Hi,

On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote:
> > Erratum 1418040 currently prevents a late CPU from booting if none
> > of the early CPUs are affected by it. This is because the handling
> > is implemented as alternatives, and we have already got rid of them
> > by the time userspace onlines a new CPU.
> >
> > A solution to this is to move everything into C code, and rely on
> > static keys instead. Once this is done, the feature can be allowed
> > for late CPUs.
> >
> > [...]
>
> Applied to arm64 (for-next/fixes), thanks!
>
> [1/2] arm64: Move handling of erratum 1418040 into C code
>       https://git.kernel.org/arm64/c/d49f7d7376d0
> [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040
>       https://git.kernel.org/arm64/c/bf87bb0881d0

NOTE: patch 2 seems to have come in through a stable merge onto Chrome
OS 5.4 and is causing a regression when resuming from suspend.  In the
short term we've got a revert going into our tree:

https://crrev.com/c/2399101

...but that's obviously not a long term fix.  I haven't done any
debugging of this myself, though I can if there's nobody more
qualified to do it and/or nobody else has time.  I'm just trying to
make sure that the problem is reported somewhere where others might
notice it rather than in an obscure Chrome OS tree.  ;-)

-Doug
Sai Prakash Ranjan Sept. 10, 2020, 1:43 p.m. UTC | #2
On 2020-09-09 20:23, Doug Anderson wrote:
> Hi,
> 
> On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
>> 
>> On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote:
>> > Erratum 1418040 currently prevents a late CPU from booting if none
>> > of the early CPUs are affected by it. This is because the handling
>> > is implemented as alternatives, and we have already got rid of them
>> > by the time userspace onlines a new CPU.
>> >
>> > A solution to this is to move everything into C code, and rely on
>> > static keys instead. Once this is done, the feature can be allowed
>> > for late CPUs.
>> >
>> > [...]
>> 
>> Applied to arm64 (for-next/fixes), thanks!
>> 
>> [1/2] arm64: Move handling of erratum 1418040 into C code
>>       https://git.kernel.org/arm64/c/d49f7d7376d0
>> [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040
>>       https://git.kernel.org/arm64/c/bf87bb0881d0
> 
> NOTE: patch 2 seems to have come in through a stable merge onto Chrome
> OS 5.4 and is causing a regression when resuming from suspend.  In the
> short term we've got a revert going into our tree:
> 
> https://crrev.com/c/2399101
> 
> ...but that's obviously not a long term fix.  I haven't done any
> debugging of this myself, though I can if there's nobody more
> qualified to do it and/or nobody else has time.  I'm just trying to
> make sure that the problem is reported somewhere where others might
> notice it rather than in an obscure Chrome OS tree.  ;-)
> 

The rootcause is pretty straightforward however I'm afraid the
solution isn't so but I may be mistaken, so this happens on
big.LITTLE systems with CPUs differing in erratum 1418040
which was applicable only for big cores and not little cores.
So when trying to bringup little cores during resume, there
is a conflict as below (messages snipped from the internal bug
for more visibility).

Enabling non-boot CPUs ...
CPU features: CPU1: Detected conflict for capability 35 (ARM erratum 
1418040), System: 1, CPU: 0
CPU1: will not boot
CPU1: will not boot
CPU1: failed to come online
psci: CPU1 killed (polled 0 ms)
CPU1: died during early boot
Error taking CPU1 up: -5

Thanks,
Sai
Marc Zyngier Sept. 11, 2020, 1:30 p.m. UTC | #3
On 2020-09-10 14:43, Sai Prakash Ranjan wrote:
> On 2020-09-09 20:23, Doug Anderson wrote:
>> Hi,
>> 
>> On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas
>> <catalin.marinas@arm.com> wrote:
>>> 
>>> On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote:
>>> > Erratum 1418040 currently prevents a late CPU from booting if none
>>> > of the early CPUs are affected by it. This is because the handling
>>> > is implemented as alternatives, and we have already got rid of them
>>> > by the time userspace onlines a new CPU.
>>> >
>>> > A solution to this is to move everything into C code, and rely on
>>> > static keys instead. Once this is done, the feature can be allowed
>>> > for late CPUs.
>>> >
>>> > [...]
>>> 
>>> Applied to arm64 (for-next/fixes), thanks!
>>> 
>>> [1/2] arm64: Move handling of erratum 1418040 into C code
>>>       https://git.kernel.org/arm64/c/d49f7d7376d0
>>> [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040
>>>       https://git.kernel.org/arm64/c/bf87bb0881d0
>> 
>> NOTE: patch 2 seems to have come in through a stable merge onto Chrome
>> OS 5.4 and is causing a regression when resuming from suspend.  In the
>> short term we've got a revert going into our tree:
>> 
>> https://crrev.com/c/2399101
>> 
>> ...but that's obviously not a long term fix.  I haven't done any
>> debugging of this myself, though I can if there's nobody more
>> qualified to do it and/or nobody else has time.  I'm just trying to
>> make sure that the problem is reported somewhere where others might
>> notice it rather than in an obscure Chrome OS tree.  ;-)
>> 
> 
> The rootcause is pretty straightforward however I'm afraid the
> solution isn't so but I may be mistaken, so this happens on
> big.LITTLE systems with CPUs differing in erratum 1418040
> which was applicable only for big cores and not little cores.
> So when trying to bringup little cores during resume, there
> is a conflict as below (messages snipped from the internal bug
> for more visibility).
> 
> Enabling non-boot CPUs ...
> CPU features: CPU1: Detected conflict for capability 35 (ARM erratum
> 1418040), System: 1, CPU: 0
> CPU1: will not boot
> CPU1: will not boot
> CPU1: failed to come online
> psci: CPU1 killed (polled 0 ms)
> CPU1: died during early boot
> Error taking CPU1 up: -5

This is becoming very annoying... By allowing the buggy CPUs to come
in late, we have made it impossible for the good ones to work correctly.

Can you try this (untested yet, I'm dealing with another bucket of
errata at the moment):

diff --git a/arch/arm64/kernel/cpu_errata.c 
b/arch/arm64/kernel/cpu_errata.c
index 6c8303559beb..fcf7f763400c 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -477,6 +477,7 @@ const struct arm64_cpu_capabilities arm64_errata[] = 
{
  		.capability = ARM64_WORKAROUND_1418040,
  		ERRATA_MIDR_RANGE_LIST(erratum_1418040_list),
  		.type = (ARM64_CPUCAP_SCOPE_LOCAL_CPU |
+			 ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU |
  			 ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU),
  	},
  #endif


Thanks,

         M.
Sai Prakash Ranjan Sept. 11, 2020, 4:34 p.m. UTC | #4
On 2020-09-11 19:00, Marc Zyngier wrote:
> On 2020-09-10 14:43, Sai Prakash Ranjan wrote:
>> On 2020-09-09 20:23, Doug Anderson wrote:
>>> Hi,
>>> 
>>> On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas
>>> <catalin.marinas@arm.com> wrote:
>>>> 
>>>> On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote:
>>>> > Erratum 1418040 currently prevents a late CPU from booting if none
>>>> > of the early CPUs are affected by it. This is because the handling
>>>> > is implemented as alternatives, and we have already got rid of them
>>>> > by the time userspace onlines a new CPU.
>>>> >
>>>> > A solution to this is to move everything into C code, and rely on
>>>> > static keys instead. Once this is done, the feature can be allowed
>>>> > for late CPUs.
>>>> >
>>>> > [...]
>>>> 
>>>> Applied to arm64 (for-next/fixes), thanks!
>>>> 
>>>> [1/2] arm64: Move handling of erratum 1418040 into C code
>>>>       https://git.kernel.org/arm64/c/d49f7d7376d0
>>>> [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040
>>>>       https://git.kernel.org/arm64/c/bf87bb0881d0
>>> 
>>> NOTE: patch 2 seems to have come in through a stable merge onto 
>>> Chrome
>>> OS 5.4 and is causing a regression when resuming from suspend.  In 
>>> the
>>> short term we've got a revert going into our tree:
>>> 
>>> https://crrev.com/c/2399101
>>> 
>>> ...but that's obviously not a long term fix.  I haven't done any
>>> debugging of this myself, though I can if there's nobody more
>>> qualified to do it and/or nobody else has time.  I'm just trying to
>>> make sure that the problem is reported somewhere where others might
>>> notice it rather than in an obscure Chrome OS tree.  ;-)
>>> 
>> 
>> The rootcause is pretty straightforward however I'm afraid the
>> solution isn't so but I may be mistaken, so this happens on
>> big.LITTLE systems with CPUs differing in erratum 1418040
>> which was applicable only for big cores and not little cores.
>> So when trying to bringup little cores during resume, there
>> is a conflict as below (messages snipped from the internal bug
>> for more visibility).
>> 
>> Enabling non-boot CPUs ...
>> CPU features: CPU1: Detected conflict for capability 35 (ARM erratum
>> 1418040), System: 1, CPU: 0
>> CPU1: will not boot
>> CPU1: will not boot
>> CPU1: failed to come online
>> psci: CPU1 killed (polled 0 ms)
>> CPU1: died during early boot
>> Error taking CPU1 up: -5
> 
> This is becoming very annoying... By allowing the buggy CPUs to come
> in late, we have made it impossible for the good ones to work 
> correctly.
> 
> Can you try this (untested yet, I'm dealing with another bucket of
> errata at the moment):
> 
> diff --git a/arch/arm64/kernel/cpu_errata.c 
> b/arch/arm64/kernel/cpu_errata.c
> index 6c8303559beb..fcf7f763400c 100644
> --- a/arch/arm64/kernel/cpu_errata.c
> +++ b/arch/arm64/kernel/cpu_errata.c
> @@ -477,6 +477,7 @@ const struct arm64_cpu_capabilities arm64_errata[] 
> = {
>  		.capability = ARM64_WORKAROUND_1418040,
>  		ERRATA_MIDR_RANGE_LIST(erratum_1418040_list),
>  		.type = (ARM64_CPUCAP_SCOPE_LOCAL_CPU |
> +			 ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU |
>  			 ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU),
>  	},
>  #endif
> 

Yes, this works.

Thanks,
Sai
Will Deacon Sept. 11, 2020, 4:42 p.m. UTC | #5
On Fri, Sep 11, 2020 at 10:04:24PM +0530, Sai Prakash Ranjan wrote:
> On 2020-09-11 19:00, Marc Zyngier wrote:
> > On 2020-09-10 14:43, Sai Prakash Ranjan wrote:
> > > On 2020-09-09 20:23, Doug Anderson wrote:
> > > > Hi,
> > > > 
> > > > On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas
> > > > <catalin.marinas@arm.com> wrote:
> > > > > 
> > > > > On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote:
> > > > > > Erratum 1418040 currently prevents a late CPU from booting if none
> > > > > > of the early CPUs are affected by it. This is because the handling
> > > > > > is implemented as alternatives, and we have already got rid of them
> > > > > > by the time userspace onlines a new CPU.
> > > > > >
> > > > > > A solution to this is to move everything into C code, and rely on
> > > > > > static keys instead. Once this is done, the feature can be allowed
> > > > > > for late CPUs.
> > > > > >
> > > > > > [...]
> > > > > 
> > > > > Applied to arm64 (for-next/fixes), thanks!
> > > > > 
> > > > > [1/2] arm64: Move handling of erratum 1418040 into C code
> > > > >       https://git.kernel.org/arm64/c/d49f7d7376d0
> > > > > [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040
> > > > >       https://git.kernel.org/arm64/c/bf87bb0881d0
> > > > 
> > > > NOTE: patch 2 seems to have come in through a stable merge onto
> > > > Chrome
> > > > OS 5.4 and is causing a regression when resuming from suspend.
> > > > In the
> > > > short term we've got a revert going into our tree:
> > > > 
> > > > https://crrev.com/c/2399101
> > > > 
> > > > ...but that's obviously not a long term fix.  I haven't done any
> > > > debugging of this myself, though I can if there's nobody more
> > > > qualified to do it and/or nobody else has time.  I'm just trying to
> > > > make sure that the problem is reported somewhere where others might
> > > > notice it rather than in an obscure Chrome OS tree.  ;-)
> > > > 
> > > 
> > > The rootcause is pretty straightforward however I'm afraid the
> > > solution isn't so but I may be mistaken, so this happens on
> > > big.LITTLE systems with CPUs differing in erratum 1418040
> > > which was applicable only for big cores and not little cores.
> > > So when trying to bringup little cores during resume, there
> > > is a conflict as below (messages snipped from the internal bug
> > > for more visibility).
> > > 
> > > Enabling non-boot CPUs ...
> > > CPU features: CPU1: Detected conflict for capability 35 (ARM erratum
> > > 1418040), System: 1, CPU: 0
> > > CPU1: will not boot
> > > CPU1: will not boot
> > > CPU1: failed to come online
> > > psci: CPU1 killed (polled 0 ms)
> > > CPU1: died during early boot
> > > Error taking CPU1 up: -5
> > 
> > This is becoming very annoying... By allowing the buggy CPUs to come
> > in late, we have made it impossible for the good ones to work correctly.
> > 
> > Can you try this (untested yet, I'm dealing with another bucket of
> > errata at the moment):
> > 
> > diff --git a/arch/arm64/kernel/cpu_errata.c
> > b/arch/arm64/kernel/cpu_errata.c
> > index 6c8303559beb..fcf7f763400c 100644
> > --- a/arch/arm64/kernel/cpu_errata.c
> > +++ b/arch/arm64/kernel/cpu_errata.c
> > @@ -477,6 +477,7 @@ const struct arm64_cpu_capabilities arm64_errata[] =
> > {
> >  		.capability = ARM64_WORKAROUND_1418040,
> >  		ERRATA_MIDR_RANGE_LIST(erratum_1418040_list),
> >  		.type = (ARM64_CPUCAP_SCOPE_LOCAL_CPU |
> > +			 ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU |
> >  			 ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU),
> >  	},
> >  #endif
> > 
> 
> Yes, this works.

Maybe we should spell it "ARM64_CPUCAP_WEAK_LOCAL_CPU_FEATURE" and add
a comment about the "feature"?

Will
Marc Zyngier Sept. 11, 2020, 5:47 p.m. UTC | #6
On 2020-09-11 17:42, Will Deacon wrote:
> On Fri, Sep 11, 2020 at 10:04:24PM +0530, Sai Prakash Ranjan wrote:
>> On 2020-09-11 19:00, Marc Zyngier wrote:
>> > On 2020-09-10 14:43, Sai Prakash Ranjan wrote:
>> > > On 2020-09-09 20:23, Doug Anderson wrote:
>> > > > Hi,
>> > > >
>> > > > On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas
>> > > > <catalin.marinas@arm.com> wrote:
>> > > > >
>> > > > > On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote:
>> > > > > > Erratum 1418040 currently prevents a late CPU from booting if none
>> > > > > > of the early CPUs are affected by it. This is because the handling
>> > > > > > is implemented as alternatives, and we have already got rid of them
>> > > > > > by the time userspace onlines a new CPU.
>> > > > > >
>> > > > > > A solution to this is to move everything into C code, and rely on
>> > > > > > static keys instead. Once this is done, the feature can be allowed
>> > > > > > for late CPUs.
>> > > > > >
>> > > > > > [...]
>> > > > >
>> > > > > Applied to arm64 (for-next/fixes), thanks!
>> > > > >
>> > > > > [1/2] arm64: Move handling of erratum 1418040 into C code
>> > > > >       https://git.kernel.org/arm64/c/d49f7d7376d0
>> > > > > [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040
>> > > > >       https://git.kernel.org/arm64/c/bf87bb0881d0
>> > > >
>> > > > NOTE: patch 2 seems to have come in through a stable merge onto
>> > > > Chrome
>> > > > OS 5.4 and is causing a regression when resuming from suspend.
>> > > > In the
>> > > > short term we've got a revert going into our tree:
>> > > >
>> > > > https://crrev.com/c/2399101
>> > > >
>> > > > ...but that's obviously not a long term fix.  I haven't done any
>> > > > debugging of this myself, though I can if there's nobody more
>> > > > qualified to do it and/or nobody else has time.  I'm just trying to
>> > > > make sure that the problem is reported somewhere where others might
>> > > > notice it rather than in an obscure Chrome OS tree.  ;-)
>> > > >
>> > >
>> > > The rootcause is pretty straightforward however I'm afraid the
>> > > solution isn't so but I may be mistaken, so this happens on
>> > > big.LITTLE systems with CPUs differing in erratum 1418040
>> > > which was applicable only for big cores and not little cores.
>> > > So when trying to bringup little cores during resume, there
>> > > is a conflict as below (messages snipped from the internal bug
>> > > for more visibility).
>> > >
>> > > Enabling non-boot CPUs ...
>> > > CPU features: CPU1: Detected conflict for capability 35 (ARM erratum
>> > > 1418040), System: 1, CPU: 0
>> > > CPU1: will not boot
>> > > CPU1: will not boot
>> > > CPU1: failed to come online
>> > > psci: CPU1 killed (polled 0 ms)
>> > > CPU1: died during early boot
>> > > Error taking CPU1 up: -5
>> >
>> > This is becoming very annoying... By allowing the buggy CPUs to come
>> > in late, we have made it impossible for the good ones to work correctly.
>> >
>> > Can you try this (untested yet, I'm dealing with another bucket of
>> > errata at the moment):
>> >
>> > diff --git a/arch/arm64/kernel/cpu_errata.c
>> > b/arch/arm64/kernel/cpu_errata.c
>> > index 6c8303559beb..fcf7f763400c 100644
>> > --- a/arch/arm64/kernel/cpu_errata.c
>> > +++ b/arch/arm64/kernel/cpu_errata.c
>> > @@ -477,6 +477,7 @@ const struct arm64_cpu_capabilities arm64_errata[] =
>> > {
>> >  		.capability = ARM64_WORKAROUND_1418040,
>> >  		ERRATA_MIDR_RANGE_LIST(erratum_1418040_list),
>> >  		.type = (ARM64_CPUCAP_SCOPE_LOCAL_CPU |
>> > +			 ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU |
>> >  			 ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU),
>> >  	},
>> >  #endif
>> >
>> 
>> Yes, this works.
> 
> Maybe we should spell it "ARM64_CPUCAP_WEAK_LOCAL_CPU_FEATURE" and add
> a comment about the "feature"?

Yeah, that's exactly what it amounts to. Patch incoming.

         M.