mbox series

[v7,0/2] Add L1 and L2 error detection for A53, A57 and A72

Message ID 1744409319-24912-1-git-send-email-vijayb@linux.microsoft.com (mailing list archive)
Headers show
Series Add L1 and L2 error detection for A53, A57 and A72 | expand

Message

Vijay Balakrishna April 11, 2025, 10:08 p.m. UTC
Hello,

This is an attempt to revive [v5] series. I have attempted to address comments
and suggestions from Marc Zyngier since [v5]. Additionally, I have extended
support for A72 processors. Testing on a problematic A72 SoC has led to the
detection of Correctable Errors (CEs). I am eager to hear your suggestions and
feedback on this series.

Thanks,
Vijay

[v5] https://lore.kernel.org/all/20210401110615.15326-1-s.hauer@pengutronix.de/#t
[v6] https://lore.kernel.org/all/1744241785-20256-1-git-send-email-vijayb@linux.microsoft.com/

Changes since v6:
- restore the change made in [v5] to clear CPU/L2 syndrome registers
  back to read_errors() (Tyler)
- upon detecting a valid error, clear syndrome registers immediately
  to avoid clobbering between the read and write (Marc)
- NULL return check for of_get_cpu_node() (Tyler)
- of_node_put() to avoid refcount issue (Tyler)
- quotes are dropped in yaml file (Krzysztof)

Changes since v5:
- rebase on v6.15-rc1
- the syndrome registers for CPU/L2 memory errors are cleared only upon
  detecting an error and an isb() after for synchronization (Marc)
- "edac-enabled" hunk moved to initial patch to avoid breaking virtual
  environments (Marc)
- to ensure compatibility across all three families, we are not reporting
  "L1 Dirty RAM," documented only in the A53 TRM
- above prompted changing default CPU L1 error meesage from "unknown"
  to "Unspecified" 
- capturing CPUID/WAY information in L2 memory error log (Marc)
- module license from "GPL v2" to "GPL" (checkpatch.pl warning)
- extend support for A72

Changes since v4:
- Rebase on v5.12-rc5

Changes since v3:
- Add edac-enabled property to make EDAC 3support optional

Changes since v2:
- drop usage of virtual dt node (Robh)
- use read_sysreg_s instead of open coded variant (James Morse)
- separate error retrieving from error reporting
- use smp_call_function_single rather than smp_call_function_single_async
- make driver single instance and register all 'cpu' hierarchy up front once

Changes since v1:
- Split dt-binding into separate patch
- Sort local function variables in reverse-xmas tree order
- drop unnecessary comparison and make variable bool

Sascha Hauer (2):
  drivers/edac: Add L1 and L2 error detection for A53, A57 and A72
  dt-bindings: arm: cpus: Add edac-enabled property

 .../devicetree/bindings/arm/cpus.yaml         |   6 +
 drivers/edac/Kconfig                          |   9 +
 drivers/edac/Makefile                         |   1 +
 drivers/edac/cortex_arm64_l1_l2.c             | 232 ++++++++++++++++++
 4 files changed, 248 insertions(+)
 create mode 100644 drivers/edac/cortex_arm64_l1_l2.c


base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8

Comments

Borislav Petkov April 13, 2025, 8:39 p.m. UTC | #1
On Fri, Apr 11, 2025 at 03:08:37PM -0700, Vijay Balakrishna wrote:
> Hello,
> 
> This is an attempt to revive [v5] series. I have attempted to address comments
> and suggestions from Marc Zyngier since [v5]. Additionally, I have extended
> support for A72 processors. Testing on a problematic A72 SoC has led to the
> detection of Correctable Errors (CEs). I am eager to hear your suggestions and
> feedback on this series.

Did you not read Marc's note:

https://lore.kernel.org/all/86a58kl51r.wl-maz@kernel.org/

or

https://lore.kernel.org/all/86frigkmtd.wl-maz@kernel.org/

?
Vijay Balakrishna April 16, 2025, 12:05 a.m. UTC | #2
On 4/13/25 13:39, Borislav Petkov wrote:
> On Fri, Apr 11, 2025 at 03:08:37PM -0700, Vijay Balakrishna wrote:
>> Hello,
>>
>> This is an attempt to revive [v5] series. I have attempted to address comments
>> and suggestions from Marc Zyngier since [v5]. Additionally, I have extended
>> support for A72 processors. Testing on a problematic A72 SoC has led to the
>> detection of Correctable Errors (CEs). I am eager to hear your suggestions and
>> feedback on this series.
> 
> Did you not read Marc's note:
> 
> https://lore.kernel.org/all/86a58kl51r.wl-maz@kernel.org/
> 
> or
> 
> https://lore.kernel.org/all/86frigkmtd.wl-maz@kernel.org/
> 
> ?
> 

Hi Borislav,

I did see the second reply above, but not the first before posting v7. I 
opted to submit v7 after addressing the comments and issues identified 
in v6 for the benefit of those interested. Sascha's v5 series has helped 
us in confirming a problematic A72 indeed suffering from CEs.

Our primary focus is on A72. I can re-submit with modifications solely 
related to A72 and exclude A53 and A57. As Tyler mentioned, we have a 
significant number of A72-based systems in our fleet, and timely 
replacements via monitoring CEs will be instrumental in managing them 
effectively. Please share your thoughts.

Thanks,
Vijay