Message ID | 20220817121551.21790-1-ionela.voinescu@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RESEND,v2] arm64: errata: add detection for AMEVCNTR01 incrementing incorrectly | expand |
Hi Ionela, On Wed, Aug 17, 2022 at 01:15:51PM +0100, Ionela Voinescu wrote: > diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c > index 7e6289e709fc..810dd3c39882 100644 > --- a/arch/arm64/kernel/cpu_errata.c > +++ b/arch/arm64/kernel/cpu_errata.c > @@ -654,6 +654,16 @@ const struct arm64_cpu_capabilities arm64_errata[] = { > ERRATA_MIDR_REV_RANGE(MIDR_CORTEX_A510, 0, 0, 2) > }, > #endif > +#ifdef CONFIG_ARM64_ERRATUM_2457168 > + { > + .desc = "ARM erratum 2457168", > + .capability = ARM64_WORKAROUND_2457168, > + .type = ARM64_CPUCAP_WEAK_LOCAL_CPU_FEATURE, > + > + /* Cortex-A510 r0p0-r1p1 */ > + CAP_MIDR_RANGE(MIDR_CORTEX_A510, 0, 0, 1, 1) > + }, > +#endif > #ifdef CONFIG_ARM64_ERRATUM_2038923 > { > .desc = "ARM erratum 2038923", > diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c > index 907401e4fffb..af4de817d712 100644 > --- a/arch/arm64/kernel/cpufeature.c > +++ b/arch/arm64/kernel/cpufeature.c > @@ -1870,7 +1870,10 @@ static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap) > pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n", > smp_processor_id()); > cpumask_set_cpu(smp_processor_id(), &amu_cpus); > - update_freq_counters_refs(); > + > + /* 0 reference values signal broken/disabled counters */ > + if (!this_cpu_has_cap(ARM64_WORKAROUND_2457168)) > + update_freq_counters_refs(); > } > } From a CPU errata workaround, this part looks fine to me. > > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c > index 869ffc4d4484..5d7efb15f7cf 100644 > --- a/arch/arm64/kernel/topology.c > +++ b/arch/arm64/kernel/topology.c > @@ -301,7 +301,8 @@ static void cpu_read_corecnt(void *val) > > static void cpu_read_constcnt(void *val) > { > - *(u64 *)val = read_constcnt(); > + *(u64 *)val = this_cpu_has_cap(ARM64_WORKAROUND_2457168) ? > + 0UL : read_constcnt(); > } > > static inline > @@ -328,7 +329,12 @@ int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) > */ > bool cpc_ffh_supported(void) > { > - return freq_counters_valid(get_cpu_with_amu_feat()); > + int cpu = get_cpu_with_amu_feat(); > + > + if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask)) > + return false; > + > + return true; > } So here we tell the core code that FFH is supported but always return 0 via cpc_read_ffh() if the const counter is requested. I assume the core code figures this out and doesn't use the value on the affected CPUs. I was hoping cpc_ffh_supported() would be per-CPU and the core code simply skips calling cpc_read() on the broken cores. Is the other register read by cpc_read_ffh() still useful without the const one? While the Kconfig entry describes the behaviour, I'd rather have a comment in cpc_ffh_supported() and maybe cpu_read_constcnt() on why we do these tricks. Thanks.
Hi Catalin, On Wednesday 17 Aug 2022 at 17:59:01 (+0100), Catalin Marinas wrote: > Hi Ionela, > > On Wed, Aug 17, 2022 at 01:15:51PM +0100, Ionela Voinescu wrote: > > diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c > > index 7e6289e709fc..810dd3c39882 100644 > > --- a/arch/arm64/kernel/cpu_errata.c > > +++ b/arch/arm64/kernel/cpu_errata.c > > @@ -654,6 +654,16 @@ const struct arm64_cpu_capabilities arm64_errata[] = { > > ERRATA_MIDR_REV_RANGE(MIDR_CORTEX_A510, 0, 0, 2) > > }, > > #endif > > +#ifdef CONFIG_ARM64_ERRATUM_2457168 > > + { > > + .desc = "ARM erratum 2457168", > > + .capability = ARM64_WORKAROUND_2457168, > > + .type = ARM64_CPUCAP_WEAK_LOCAL_CPU_FEATURE, > > + > > + /* Cortex-A510 r0p0-r1p1 */ > > + CAP_MIDR_RANGE(MIDR_CORTEX_A510, 0, 0, 1, 1) > > + }, > > +#endif > > #ifdef CONFIG_ARM64_ERRATUM_2038923 > > { > > .desc = "ARM erratum 2038923", > > diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c > > index 907401e4fffb..af4de817d712 100644 > > --- a/arch/arm64/kernel/cpufeature.c > > +++ b/arch/arm64/kernel/cpufeature.c > > @@ -1870,7 +1870,10 @@ static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap) > > pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n", > > smp_processor_id()); > > cpumask_set_cpu(smp_processor_id(), &amu_cpus); > > - update_freq_counters_refs(); > > + > > + /* 0 reference values signal broken/disabled counters */ > > + if (!this_cpu_has_cap(ARM64_WORKAROUND_2457168)) > > + update_freq_counters_refs(); > > } > > } > > From a CPU errata workaround, this part looks fine to me. > > > > > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c > > index 869ffc4d4484..5d7efb15f7cf 100644 > > --- a/arch/arm64/kernel/topology.c > > +++ b/arch/arm64/kernel/topology.c > > @@ -301,7 +301,8 @@ static void cpu_read_corecnt(void *val) > > > > static void cpu_read_constcnt(void *val) > > { > > - *(u64 *)val = read_constcnt(); > > + *(u64 *)val = this_cpu_has_cap(ARM64_WORKAROUND_2457168) ? > > + 0UL : read_constcnt(); > > } > > > > static inline > > @@ -328,7 +329,12 @@ int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) > > */ > > bool cpc_ffh_supported(void) > > { > > - return freq_counters_valid(get_cpu_with_amu_feat()); > > + int cpu = get_cpu_with_amu_feat(); > > + > > + if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask)) > > + return false; > > + > > + return true; > > } > > So here we tell the core code that FFH is supported but always return 0 > via cpc_read_ffh() if the const counter is requested. I assume the core > code figures this out and doesn't use the value on the affected CPUs. I > was hoping cpc_ffh_supported() would be per-CPU and the core code simply > skips calling cpc_read() on the broken cores. I used to think the same, but I've realised that the current approach is best, in my opinion. There are two users of these counters exposed though FFH in the kernel: CPPC-based frequency invariance(FIE) and reading current frequency through sysfs. If AMU counters are disabled or the CPU is affected by this erratum, a single read of 0 for any of the counters will result in cppc_get_perf_ctrs() returning -EFAULT which: - (cppc_cpufreq_cpu_fie_init()) Will disable the use of FIE for that policy, and those counters will never be read again for that CPU, for the purpose of FIE. This is the operation that would result in reading those counters most often, which in this case won't happen. - Will return -EFAULT from cppc_cpufreq_get_rate() signaling to the user that it cannot return a proper frequency using those counters. That's cast to unsigned int so the user would have to be knowledgeable on the matter :), but that's an existing problem. Therefore, error checking based on a counter read of 0 would be equivalent here to checking a potential ffh_supported(cpu). Also, in the future we might use FFH to not only read these counters. So it's better to keep ffh_supported() to just reflect whether generically FFH is supported, even if in some cases the "backend" (AMUs here) is disabled or broken. Also, given that it's most likely for a platform to use the same method for all CPU for reading counters, forgetting or not considering errata, together with the current use of ffh_supported() as gate-keeper of a CPU probe based on validity of all CPC methods, even if cpc_ffh_supported() was per-CPU, it's still better to probe the CPU and let the users of counters deal with breakage, especially given that these usecases are not critical. > Is the other register read by cpc_read_ffh() still useful without the > const one? Not for the current uses, and unlikely to be in the future - I don't see how the core counter value can be useful without a constant reference. > While the Kconfig entry describes the behaviour, I'd rather have a > comment in cpc_ffh_supported() and maybe cpu_read_constcnt() on why we > do these tricks. > Will do! Thanks, Ionela. > Thanks. > > -- > Catalin
On Thu, Aug 18, 2022 at 01:03:51PM +0100, Ionela Voinescu wrote: > On Wednesday 17 Aug 2022 at 17:59:01 (+0100), Catalin Marinas wrote: > > On Wed, Aug 17, 2022 at 01:15:51PM +0100, Ionela Voinescu wrote: > > > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c > > > index 869ffc4d4484..5d7efb15f7cf 100644 > > > --- a/arch/arm64/kernel/topology.c > > > +++ b/arch/arm64/kernel/topology.c > > > @@ -301,7 +301,8 @@ static void cpu_read_corecnt(void *val) > > > > > > static void cpu_read_constcnt(void *val) > > > { > > > - *(u64 *)val = read_constcnt(); > > > + *(u64 *)val = this_cpu_has_cap(ARM64_WORKAROUND_2457168) ? > > > + 0UL : read_constcnt(); > > > } > > > > > > static inline > > > @@ -328,7 +329,12 @@ int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) > > > */ > > > bool cpc_ffh_supported(void) > > > { > > > - return freq_counters_valid(get_cpu_with_amu_feat()); > > > + int cpu = get_cpu_with_amu_feat(); > > > + > > > + if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask)) > > > + return false; > > > + > > > + return true; > > > } > > > > So here we tell the core code that FFH is supported but always return 0 > > via cpc_read_ffh() if the const counter is requested. I assume the core > > code figures this out and doesn't use the value on the affected CPUs. I > > was hoping cpc_ffh_supported() would be per-CPU and the core code simply > > skips calling cpc_read() on the broken cores. > > I used to think the same, but I've realised that the current approach is > best, in my opinion. > > There are two users of these counters exposed though FFH in the kernel: > CPPC-based frequency invariance(FIE) and reading current frequency through > sysfs. If AMU counters are disabled or the CPU is affected by this > erratum, a single read of 0 for any of the counters will result in > cppc_get_perf_ctrs() returning -EFAULT which: > > - (cppc_cpufreq_cpu_fie_init()) Will disable the use of FIE for that > policy, and those counters will never be read again for that CPU, for > the purpose of FIE. This is the operation that would result in reading > those counters most often, which in this case won't happen. > > - Will return -EFAULT from cppc_cpufreq_get_rate() signaling to the user > that it cannot return a proper frequency using those counters. That's > cast to unsigned int so the user would have to be knowledgeable on the > matter :), but that's an existing problem. > > Therefore, error checking based on a counter read of 0 would be > equivalent here to checking a potential ffh_supported(cpu). Also, in the > future we might use FFH to not only read these counters. So it's better > to keep ffh_supported() to just reflect whether generically FFH is > supported, even if in some cases the "backend" (AMUs here) is disabled > or broken. This works for me as long as the callers are aware of what a return of 0 when reading the counter means. > > Is the other register read by cpc_read_ffh() still useful without the > > const one? > > Not for the current uses, and unlikely to be in the future - I don't see > how the core counter value can be useful without a constant reference. I was thinking of return 0 directly from cpc_read_ffh() since the other counter is not used independently but I guess your approach matches the erratum better since it's only the const counter that's broken. > > While the Kconfig entry describes the behaviour, I'd rather have a > > comment in cpc_ffh_supported() and maybe cpu_read_constcnt() on why we > > do these tricks. > > Will do! Thanks. With comments added, feel free to add: Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst index 33b04db8408f..fda97b3fcf01 100644 --- a/Documentation/arm64/silicon-errata.rst +++ b/Documentation/arm64/silicon-errata.rst @@ -52,6 +52,8 @@ stable kernels. | Allwinner | A64/R18 | UNKNOWN1 | SUN50I_ERRATUM_UNKNOWN1 | +----------------+-----------------+-----------------+-----------------------------+ +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A510 | #2457168 | ARM64_ERRATUM_2457168 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A510 | #2064142 | ARM64_ERRATUM_2064142 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A510 | #2038923 | ARM64_ERRATUM_2038923 | diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 571cc234d0b3..9fb9fff08c94 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -917,6 +917,23 @@ config ARM64_ERRATUM_1902691 If unsure, say Y. +config ARM64_ERRATUM_2457168 + bool "Cortex-A510: 2457168: workaround for AMEVCNTR01 incrementing incorrectly" + depends on ARM64_AMU_EXTN + default y + help + This option adds the workaround for ARM Cortex-A510 erratum 2457168. + + The AMU counter AMEVCNTR01 (constant counter) should increment at the same rate + as the system counter. On affected Cortex-A510 cores AMEVCNTR01 increments + incorrectly giving a significantly higher output value. + + Work around this problem by returning 0 when reading the affected counter in + key locations that results in disabling all users of this counter. This effect + is the same to firmware disabling affected counters. + + If unsure, say Y. + config CAVIUM_ERRATUM_22375 bool "Cavium erratum 22375, 24313" default y diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index 7e6289e709fc..810dd3c39882 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -654,6 +654,16 @@ const struct arm64_cpu_capabilities arm64_errata[] = { ERRATA_MIDR_REV_RANGE(MIDR_CORTEX_A510, 0, 0, 2) }, #endif +#ifdef CONFIG_ARM64_ERRATUM_2457168 + { + .desc = "ARM erratum 2457168", + .capability = ARM64_WORKAROUND_2457168, + .type = ARM64_CPUCAP_WEAK_LOCAL_CPU_FEATURE, + + /* Cortex-A510 r0p0-r1p1 */ + CAP_MIDR_RANGE(MIDR_CORTEX_A510, 0, 0, 1, 1) + }, +#endif #ifdef CONFIG_ARM64_ERRATUM_2038923 { .desc = "ARM erratum 2038923", diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 907401e4fffb..af4de817d712 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -1870,7 +1870,10 @@ static void cpu_amu_enable(struct arm64_cpu_capabilities const *cap) pr_info("detected CPU%d: Activity Monitors Unit (AMU)\n", smp_processor_id()); cpumask_set_cpu(smp_processor_id(), &amu_cpus); - update_freq_counters_refs(); + + /* 0 reference values signal broken/disabled counters */ + if (!this_cpu_has_cap(ARM64_WORKAROUND_2457168)) + update_freq_counters_refs(); } } diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 869ffc4d4484..5d7efb15f7cf 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -301,7 +301,8 @@ static void cpu_read_corecnt(void *val) static void cpu_read_constcnt(void *val) { - *(u64 *)val = read_constcnt(); + *(u64 *)val = this_cpu_has_cap(ARM64_WORKAROUND_2457168) ? + 0UL : read_constcnt(); } static inline @@ -328,7 +329,12 @@ int counters_read_on_cpu(int cpu, smp_call_func_t func, u64 *val) */ bool cpc_ffh_supported(void) { - return freq_counters_valid(get_cpu_with_amu_feat()); + int cpu = get_cpu_with_amu_feat(); + + if ((cpu >= nr_cpu_ids) || !cpumask_test_cpu(cpu, cpu_present_mask)) + return false; + + return true; } int cpc_read_ffh(int cpu, struct cpc_reg *reg, u64 *val) diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index 779653771507..63b2484ce6c3 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -67,6 +67,7 @@ WORKAROUND_1902691 WORKAROUND_2038923 WORKAROUND_2064142 WORKAROUND_2077057 +WORKAROUND_2457168 WORKAROUND_TRBE_OVERWRITE_FILL_MODE WORKAROUND_TSB_FLUSH_FAILURE WORKAROUND_TRBE_WRITE_OUT_OF_RANGE
The AMU counter AMEVCNTR01 (constant counter) should increment at the same rate as the system counter. On affected Cortex-A510 cores, AMEVCNTR01 increments incorrectly giving a significantly higher output value. This results in inaccurate task scheduler utilization tracking and incorrect feedback on CPU frequency. Work around this problem by returning 0 when reading the affected counter in key locations that results in disabling all users of this counter from using it either for frequency invariance or as FFH reference counter. This effect is the same to firmware disabling affected counters. Details on how the two features are affected by this erratum: - AMU counters will not be used for frequency invariance for affected CPUs and CPUs in the same cpufreq policy. AMUs can still be used for frequency invariance for unaffected CPUs in the system. Although unlikely, if no alternative method can be found to support frequency invariance for affected CPUs (cpufreq based or solution based on platform counters) frequency invariance will be disabled. Please check the chapter on frequency invariance at Documentation/scheduler/sched-capacity.rst for details of its effect. - Given that FFH can be used to fetch either the core or constant counter values, restrictions are lifted regarding any of these counters returning a valid (!0) value. Therefore FFH is considered supported if there is a least one CPU that support AMUs, independent of any counters being enabled or affected by this erratum. The above is achieved through adding a new erratum: ARM64_ERRATUM_2457168. Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: James Morse <james.morse@arm.com> --- Hi, This patch is based on the information in the A510 Errata Notice, version 13.0 at [1] and applies on v5.19-rc5. v2 RESEND: v2 rebased on 6.0-rc1 v1 -> v2: - v1 at [2] - Move detection of erratum in cpu_errata.c - Limit checking for affected CPUs to the init phase for FIE (Frequency Invariance Engine). For FFH we'll still check for affected CPUs at each read of the constant counter, but reads happen less often (driven by sysfs reads) compared to FIE (on the tick). [1] https://developer.arm.com/documentation/SDEN2397589/1300/?lang=en [2] https://lore.kernel.org/lkml/20220607125340.13635-1-ionela.voinescu@arm.com/ Thanks, Ionela. Documentation/arm64/silicon-errata.rst | 2 ++ arch/arm64/Kconfig | 17 +++++++++++++++++ arch/arm64/kernel/cpu_errata.c | 10 ++++++++++ arch/arm64/kernel/cpufeature.c | 5 ++++- arch/arm64/kernel/topology.c | 10 ++++++++-- arch/arm64/tools/cpucaps | 1 + 6 files changed, 42 insertions(+), 3 deletions(-)