mbox series

[RFC,0/3,qemu] arm/acpi: ACPI based FW First error injection

Message ID 20240628090605.529-1-shiju.jose@huawei.com (mailing list archive)
Headers show
Series arm/acpi: ACPI based FW First error injection | expand

Message

Shiju Jose June 28, 2024, 9:06 a.m. UTC
From: Shiju Jose <shiju.jose@huawei.com>

Series adds,
1. ACPI based FW First error injection and
2. Support for injecting ARM processor errors.

This qemu based error injection mechanism found very useful for testing and
upstream the RAS FW-first related changes in the kernel
as well as in the user space tools when hardware is not available. 

What is this?
- ACPI + UEFI specs define a means of notifying the OS of errors that
  firmware has handled (gathered up data etc, reset the relevant error tracking
  units etc) in a set of standard formats (UEFI spec appendix N).
- ARM virt already supports standard HEST ACPI table description of Synchronous
  External Abort (SEA) for memory errors. This series builds on this to
  add a GHESv2 / Generic Error Device / GPIO interrupt path for asynchronous
  error reporting.

- The OS normally negotiates for control of error registers via _OSC.
  Previously QEMU unconditionally granted control of these registers.
  This series includes a machine parameter to allow the 'FW' to not let the
  OS take control and tracks whether the OS has asked for control or not.
  Note this code relies on the standard handshake - it's not remotely
  correct if the OS does follow that flow - this can be hardened with some
  more AML magic.

Alternatives:
- In theory we could emulate a management controller running appropriate firmware
  and have that actually handle the errors. It's much easier to instead intercept
  them before the error reporting messages are sent and result logged in the root
  ports error registers. As far as the guest is concerned it doesn't matter if
  these registers are handled via the firmware or never got written in the first
  place (the guest isn't allowed to touch these registers anyway!)
  This is sort of same argument for why we build ACPI tables in general in QEMU
  rather than making that an EDK2 problem.

Why?
- The kernel supports both firmware first and native RAS.
  As only some vendors have adopted a FW first model and hardware
  availability is limited this code has proven challenging to test.

Why an RFC?
- Assuming adding this support to QEMU will be controversial.
- Probably figure out how to do this for x86 as apparently people
  also want to use that architecture.

Reference to the previous series.
https://patchew.org/QEMU/20240205141940.31111-1-Jonathan.Cameron@huawei.com/

Mauro Carvalho had added instructions in wiki about how to inject ARM
procssor errors:
https://github.com/mchehab/rasdaemon/wiki/error-injection

Series is avaiable here:
https://gitlab.com/shiju.jose/qemu/-/commits/arm-error-inject

Jonathan Cameron (3):
  arm/virt: Wire up GPIO error source for ACPI / GHES
  acpi/ghes: Support GPIO error source.
  acpi/ghes: Add a logic to handle block addresses and FW first ARM
    processor error injection

 configs/targets/aarch64-softmmu.mak |   1 +
 hw/acpi/ghes.c                      | 266 ++++++++++++++++++++++++++--
 hw/arm/Kconfig                      |   4 +
 hw/arm/arm_error_inject.c           |  35 ++++
 hw/arm/arm_error_inject_stubs.c     |  18 ++
 hw/arm/meson.build                  |   3 +
 hw/arm/virt-acpi-build.c            |  29 ++-
 hw/arm/virt.c                       |  12 +-
 include/hw/acpi/ghes.h              |   3 +
 include/hw/boards.h                 |   1 +
 qapi/arm-error-inject.json          |  49 +++++
 qapi/meson.build                    |   1 +
 qapi/qapi-schema.json               |   1 +
 13 files changed, 405 insertions(+), 18 deletions(-)
 create mode 100644 hw/arm/arm_error_inject.c
 create mode 100644 hw/arm/arm_error_inject_stubs.c
 create mode 100644 qapi/arm-error-inject.json