Message ID | 20200731173824.107480-1-maz@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | arm64: Allow erratum 1418040 for late CPUs | expand |
Hi, On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote: > > Erratum 1418040 currently prevents a late CPU from booting if none > > of the early CPUs are affected by it. This is because the handling > > is implemented as alternatives, and we have already got rid of them > > by the time userspace onlines a new CPU. > > > > A solution to this is to move everything into C code, and rely on > > static keys instead. Once this is done, the feature can be allowed > > for late CPUs. > > > > [...] > > Applied to arm64 (for-next/fixes), thanks! > > [1/2] arm64: Move handling of erratum 1418040 into C code > https://git.kernel.org/arm64/c/d49f7d7376d0 > [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040 > https://git.kernel.org/arm64/c/bf87bb0881d0 NOTE: patch 2 seems to have come in through a stable merge onto Chrome OS 5.4 and is causing a regression when resuming from suspend. In the short term we've got a revert going into our tree: https://crrev.com/c/2399101 ...but that's obviously not a long term fix. I haven't done any debugging of this myself, though I can if there's nobody more qualified to do it and/or nobody else has time. I'm just trying to make sure that the problem is reported somewhere where others might notice it rather than in an obscure Chrome OS tree. ;-) -Doug
On 2020-09-09 20:23, Doug Anderson wrote: > Hi, > > On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: >> >> On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote: >> > Erratum 1418040 currently prevents a late CPU from booting if none >> > of the early CPUs are affected by it. This is because the handling >> > is implemented as alternatives, and we have already got rid of them >> > by the time userspace onlines a new CPU. >> > >> > A solution to this is to move everything into C code, and rely on >> > static keys instead. Once this is done, the feature can be allowed >> > for late CPUs. >> > >> > [...] >> >> Applied to arm64 (for-next/fixes), thanks! >> >> [1/2] arm64: Move handling of erratum 1418040 into C code >> https://git.kernel.org/arm64/c/d49f7d7376d0 >> [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040 >> https://git.kernel.org/arm64/c/bf87bb0881d0 > > NOTE: patch 2 seems to have come in through a stable merge onto Chrome > OS 5.4 and is causing a regression when resuming from suspend. In the > short term we've got a revert going into our tree: > > https://crrev.com/c/2399101 > > ...but that's obviously not a long term fix. I haven't done any > debugging of this myself, though I can if there's nobody more > qualified to do it and/or nobody else has time. I'm just trying to > make sure that the problem is reported somewhere where others might > notice it rather than in an obscure Chrome OS tree. ;-) > The rootcause is pretty straightforward however I'm afraid the solution isn't so but I may be mistaken, so this happens on big.LITTLE systems with CPUs differing in erratum 1418040 which was applicable only for big cores and not little cores. So when trying to bringup little cores during resume, there is a conflict as below (messages snipped from the internal bug for more visibility). Enabling non-boot CPUs ... CPU features: CPU1: Detected conflict for capability 35 (ARM erratum 1418040), System: 1, CPU: 0 CPU1: will not boot CPU1: will not boot CPU1: failed to come online psci: CPU1 killed (polled 0 ms) CPU1: died during early boot Error taking CPU1 up: -5 Thanks, Sai
On 2020-09-10 14:43, Sai Prakash Ranjan wrote: > On 2020-09-09 20:23, Doug Anderson wrote: >> Hi, >> >> On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas >> <catalin.marinas@arm.com> wrote: >>> >>> On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote: >>> > Erratum 1418040 currently prevents a late CPU from booting if none >>> > of the early CPUs are affected by it. This is because the handling >>> > is implemented as alternatives, and we have already got rid of them >>> > by the time userspace onlines a new CPU. >>> > >>> > A solution to this is to move everything into C code, and rely on >>> > static keys instead. Once this is done, the feature can be allowed >>> > for late CPUs. >>> > >>> > [...] >>> >>> Applied to arm64 (for-next/fixes), thanks! >>> >>> [1/2] arm64: Move handling of erratum 1418040 into C code >>> https://git.kernel.org/arm64/c/d49f7d7376d0 >>> [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040 >>> https://git.kernel.org/arm64/c/bf87bb0881d0 >> >> NOTE: patch 2 seems to have come in through a stable merge onto Chrome >> OS 5.4 and is causing a regression when resuming from suspend. In the >> short term we've got a revert going into our tree: >> >> https://crrev.com/c/2399101 >> >> ...but that's obviously not a long term fix. I haven't done any >> debugging of this myself, though I can if there's nobody more >> qualified to do it and/or nobody else has time. I'm just trying to >> make sure that the problem is reported somewhere where others might >> notice it rather than in an obscure Chrome OS tree. ;-) >> > > The rootcause is pretty straightforward however I'm afraid the > solution isn't so but I may be mistaken, so this happens on > big.LITTLE systems with CPUs differing in erratum 1418040 > which was applicable only for big cores and not little cores. > So when trying to bringup little cores during resume, there > is a conflict as below (messages snipped from the internal bug > for more visibility). > > Enabling non-boot CPUs ... > CPU features: CPU1: Detected conflict for capability 35 (ARM erratum > 1418040), System: 1, CPU: 0 > CPU1: will not boot > CPU1: will not boot > CPU1: failed to come online > psci: CPU1 killed (polled 0 ms) > CPU1: died during early boot > Error taking CPU1 up: -5 This is becoming very annoying... By allowing the buggy CPUs to come in late, we have made it impossible for the good ones to work correctly. Can you try this (untested yet, I'm dealing with another bucket of errata at the moment): diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index 6c8303559beb..fcf7f763400c 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -477,6 +477,7 @@ const struct arm64_cpu_capabilities arm64_errata[] = { .capability = ARM64_WORKAROUND_1418040, ERRATA_MIDR_RANGE_LIST(erratum_1418040_list), .type = (ARM64_CPUCAP_SCOPE_LOCAL_CPU | + ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU | ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU), }, #endif Thanks, M.
On 2020-09-11 19:00, Marc Zyngier wrote: > On 2020-09-10 14:43, Sai Prakash Ranjan wrote: >> On 2020-09-09 20:23, Doug Anderson wrote: >>> Hi, >>> >>> On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas >>> <catalin.marinas@arm.com> wrote: >>>> >>>> On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote: >>>> > Erratum 1418040 currently prevents a late CPU from booting if none >>>> > of the early CPUs are affected by it. This is because the handling >>>> > is implemented as alternatives, and we have already got rid of them >>>> > by the time userspace onlines a new CPU. >>>> > >>>> > A solution to this is to move everything into C code, and rely on >>>> > static keys instead. Once this is done, the feature can be allowed >>>> > for late CPUs. >>>> > >>>> > [...] >>>> >>>> Applied to arm64 (for-next/fixes), thanks! >>>> >>>> [1/2] arm64: Move handling of erratum 1418040 into C code >>>> https://git.kernel.org/arm64/c/d49f7d7376d0 >>>> [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040 >>>> https://git.kernel.org/arm64/c/bf87bb0881d0 >>> >>> NOTE: patch 2 seems to have come in through a stable merge onto >>> Chrome >>> OS 5.4 and is causing a regression when resuming from suspend. In >>> the >>> short term we've got a revert going into our tree: >>> >>> https://crrev.com/c/2399101 >>> >>> ...but that's obviously not a long term fix. I haven't done any >>> debugging of this myself, though I can if there's nobody more >>> qualified to do it and/or nobody else has time. I'm just trying to >>> make sure that the problem is reported somewhere where others might >>> notice it rather than in an obscure Chrome OS tree. ;-) >>> >> >> The rootcause is pretty straightforward however I'm afraid the >> solution isn't so but I may be mistaken, so this happens on >> big.LITTLE systems with CPUs differing in erratum 1418040 >> which was applicable only for big cores and not little cores. >> So when trying to bringup little cores during resume, there >> is a conflict as below (messages snipped from the internal bug >> for more visibility). >> >> Enabling non-boot CPUs ... >> CPU features: CPU1: Detected conflict for capability 35 (ARM erratum >> 1418040), System: 1, CPU: 0 >> CPU1: will not boot >> CPU1: will not boot >> CPU1: failed to come online >> psci: CPU1 killed (polled 0 ms) >> CPU1: died during early boot >> Error taking CPU1 up: -5 > > This is becoming very annoying... By allowing the buggy CPUs to come > in late, we have made it impossible for the good ones to work > correctly. > > Can you try this (untested yet, I'm dealing with another bucket of > errata at the moment): > > diff --git a/arch/arm64/kernel/cpu_errata.c > b/arch/arm64/kernel/cpu_errata.c > index 6c8303559beb..fcf7f763400c 100644 > --- a/arch/arm64/kernel/cpu_errata.c > +++ b/arch/arm64/kernel/cpu_errata.c > @@ -477,6 +477,7 @@ const struct arm64_cpu_capabilities arm64_errata[] > = { > .capability = ARM64_WORKAROUND_1418040, > ERRATA_MIDR_RANGE_LIST(erratum_1418040_list), > .type = (ARM64_CPUCAP_SCOPE_LOCAL_CPU | > + ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU | > ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU), > }, > #endif > Yes, this works. Thanks, Sai
On Fri, Sep 11, 2020 at 10:04:24PM +0530, Sai Prakash Ranjan wrote: > On 2020-09-11 19:00, Marc Zyngier wrote: > > On 2020-09-10 14:43, Sai Prakash Ranjan wrote: > > > On 2020-09-09 20:23, Doug Anderson wrote: > > > > Hi, > > > > > > > > On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas > > > > <catalin.marinas@arm.com> wrote: > > > > > > > > > > On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote: > > > > > > Erratum 1418040 currently prevents a late CPU from booting if none > > > > > > of the early CPUs are affected by it. This is because the handling > > > > > > is implemented as alternatives, and we have already got rid of them > > > > > > by the time userspace onlines a new CPU. > > > > > > > > > > > > A solution to this is to move everything into C code, and rely on > > > > > > static keys instead. Once this is done, the feature can be allowed > > > > > > for late CPUs. > > > > > > > > > > > > [...] > > > > > > > > > > Applied to arm64 (for-next/fixes), thanks! > > > > > > > > > > [1/2] arm64: Move handling of erratum 1418040 into C code > > > > > https://git.kernel.org/arm64/c/d49f7d7376d0 > > > > > [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040 > > > > > https://git.kernel.org/arm64/c/bf87bb0881d0 > > > > > > > > NOTE: patch 2 seems to have come in through a stable merge onto > > > > Chrome > > > > OS 5.4 and is causing a regression when resuming from suspend. > > > > In the > > > > short term we've got a revert going into our tree: > > > > > > > > https://crrev.com/c/2399101 > > > > > > > > ...but that's obviously not a long term fix. I haven't done any > > > > debugging of this myself, though I can if there's nobody more > > > > qualified to do it and/or nobody else has time. I'm just trying to > > > > make sure that the problem is reported somewhere where others might > > > > notice it rather than in an obscure Chrome OS tree. ;-) > > > > > > > > > > The rootcause is pretty straightforward however I'm afraid the > > > solution isn't so but I may be mistaken, so this happens on > > > big.LITTLE systems with CPUs differing in erratum 1418040 > > > which was applicable only for big cores and not little cores. > > > So when trying to bringup little cores during resume, there > > > is a conflict as below (messages snipped from the internal bug > > > for more visibility). > > > > > > Enabling non-boot CPUs ... > > > CPU features: CPU1: Detected conflict for capability 35 (ARM erratum > > > 1418040), System: 1, CPU: 0 > > > CPU1: will not boot > > > CPU1: will not boot > > > CPU1: failed to come online > > > psci: CPU1 killed (polled 0 ms) > > > CPU1: died during early boot > > > Error taking CPU1 up: -5 > > > > This is becoming very annoying... By allowing the buggy CPUs to come > > in late, we have made it impossible for the good ones to work correctly. > > > > Can you try this (untested yet, I'm dealing with another bucket of > > errata at the moment): > > > > diff --git a/arch/arm64/kernel/cpu_errata.c > > b/arch/arm64/kernel/cpu_errata.c > > index 6c8303559beb..fcf7f763400c 100644 > > --- a/arch/arm64/kernel/cpu_errata.c > > +++ b/arch/arm64/kernel/cpu_errata.c > > @@ -477,6 +477,7 @@ const struct arm64_cpu_capabilities arm64_errata[] = > > { > > .capability = ARM64_WORKAROUND_1418040, > > ERRATA_MIDR_RANGE_LIST(erratum_1418040_list), > > .type = (ARM64_CPUCAP_SCOPE_LOCAL_CPU | > > + ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU | > > ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU), > > }, > > #endif > > > > Yes, this works. Maybe we should spell it "ARM64_CPUCAP_WEAK_LOCAL_CPU_FEATURE" and add a comment about the "feature"? Will
On 2020-09-11 17:42, Will Deacon wrote: > On Fri, Sep 11, 2020 at 10:04:24PM +0530, Sai Prakash Ranjan wrote: >> On 2020-09-11 19:00, Marc Zyngier wrote: >> > On 2020-09-10 14:43, Sai Prakash Ranjan wrote: >> > > On 2020-09-09 20:23, Doug Anderson wrote: >> > > > Hi, >> > > > >> > > > On Fri, Aug 21, 2020 at 11:15 AM Catalin Marinas >> > > > <catalin.marinas@arm.com> wrote: >> > > > > >> > > > > On Fri, 31 Jul 2020 18:38:22 +0100, Marc Zyngier wrote: >> > > > > > Erratum 1418040 currently prevents a late CPU from booting if none >> > > > > > of the early CPUs are affected by it. This is because the handling >> > > > > > is implemented as alternatives, and we have already got rid of them >> > > > > > by the time userspace onlines a new CPU. >> > > > > > >> > > > > > A solution to this is to move everything into C code, and rely on >> > > > > > static keys instead. Once this is done, the feature can be allowed >> > > > > > for late CPUs. >> > > > > > >> > > > > > [...] >> > > > > >> > > > > Applied to arm64 (for-next/fixes), thanks! >> > > > > >> > > > > [1/2] arm64: Move handling of erratum 1418040 into C code >> > > > > https://git.kernel.org/arm64/c/d49f7d7376d0 >> > > > > [2/2] arm64: Allow booting of late CPUs affected by erratum 1418040 >> > > > > https://git.kernel.org/arm64/c/bf87bb0881d0 >> > > > >> > > > NOTE: patch 2 seems to have come in through a stable merge onto >> > > > Chrome >> > > > OS 5.4 and is causing a regression when resuming from suspend. >> > > > In the >> > > > short term we've got a revert going into our tree: >> > > > >> > > > https://crrev.com/c/2399101 >> > > > >> > > > ...but that's obviously not a long term fix. I haven't done any >> > > > debugging of this myself, though I can if there's nobody more >> > > > qualified to do it and/or nobody else has time. I'm just trying to >> > > > make sure that the problem is reported somewhere where others might >> > > > notice it rather than in an obscure Chrome OS tree. ;-) >> > > > >> > > >> > > The rootcause is pretty straightforward however I'm afraid the >> > > solution isn't so but I may be mistaken, so this happens on >> > > big.LITTLE systems with CPUs differing in erratum 1418040 >> > > which was applicable only for big cores and not little cores. >> > > So when trying to bringup little cores during resume, there >> > > is a conflict as below (messages snipped from the internal bug >> > > for more visibility). >> > > >> > > Enabling non-boot CPUs ... >> > > CPU features: CPU1: Detected conflict for capability 35 (ARM erratum >> > > 1418040), System: 1, CPU: 0 >> > > CPU1: will not boot >> > > CPU1: will not boot >> > > CPU1: failed to come online >> > > psci: CPU1 killed (polled 0 ms) >> > > CPU1: died during early boot >> > > Error taking CPU1 up: -5 >> > >> > This is becoming very annoying... By allowing the buggy CPUs to come >> > in late, we have made it impossible for the good ones to work correctly. >> > >> > Can you try this (untested yet, I'm dealing with another bucket of >> > errata at the moment): >> > >> > diff --git a/arch/arm64/kernel/cpu_errata.c >> > b/arch/arm64/kernel/cpu_errata.c >> > index 6c8303559beb..fcf7f763400c 100644 >> > --- a/arch/arm64/kernel/cpu_errata.c >> > +++ b/arch/arm64/kernel/cpu_errata.c >> > @@ -477,6 +477,7 @@ const struct arm64_cpu_capabilities arm64_errata[] = >> > { >> > .capability = ARM64_WORKAROUND_1418040, >> > ERRATA_MIDR_RANGE_LIST(erratum_1418040_list), >> > .type = (ARM64_CPUCAP_SCOPE_LOCAL_CPU | >> > + ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU | >> > ARM64_CPUCAP_PERMITTED_FOR_LATE_CPU), >> > }, >> > #endif >> > >> >> Yes, this works. > > Maybe we should spell it "ARM64_CPUCAP_WEAK_LOCAL_CPU_FEATURE" and add > a comment about the "feature"? Yeah, that's exactly what it amounts to. Patch incoming. M.