diff mbox series

[RFC,v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA

Message ID 20241031212104.1429609-1-jiaqiyan@google.com (mailing list archive)
State New
Headers show
Series [RFC,v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA | expand

Commit Message

Jiaqi Yan Oct. 31, 2024, 9:21 p.m. UTC
Currently KVM handles SEA in guest by injecting async SError into
guest directly, bypassing VMM, usually results in guest kernel panic.

One major situation of guest SEA is when vCPU consumes uncorrectable
memory error on the physical memory. Although SError and guest kernel
panic effectively stops the propagation of corrupted memory, it is not
easy for VMM and guest to recover from memory error in a more graceful
manner.

Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
thread.
In addition to the benifit that KVM's handling for SEA becomes aligned
with core kernel behavior
- The blast radius in VM can be limited to only the consuming thread
  in guest, instead of entire guest kernel, unless the consumption is
  from guest kernel.
- VMM now has the chance to do its duties to stop the VM from repeatedly
  consuming corrupted data. For example, VMM can unmap the guest page
  from stage-2 table to intercept forseen memory poison consumption,
  and for every consumption injects SEA to EL1 with synthetic memory
  error CPER.

Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
can opt in this new capability if it prefers SIGBUS than SError
injection during VM init. Now SEA handling in KVM works as follows:
1. Delegate to APEI/GHES to see if SEA can be claimed by them.
2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
   enabled for the VM, and the SEA is NOT about translation table,
   send SIGBUS BUS_OBJERR signal with host virtual address.
3. Otherwise directly inject async SError to guest.

Tested on a machine running Siryn AmpereOne processor. A in-house VMM
that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
in VM allocated some memory buffer. The test used EINJ to inject an
uncorrectable recoverable memory error at a page in the allocated memory
buffer. The dummy application then consumed the memory error. Some hack
was done to make core kernel's memory_failure triggered by poison
generation to fail, so KVM had to deal with the SEA guest abort due to
poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
valid host virtual address of the poisoned page. VMM then injected a SEA
into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
guest survived and continued to run.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 +
 arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
 arch/arm64/kvm/Makefile           |  2 +-
 arch/arm64/kvm/arm.c              |  5 ++
 arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
 arch/arm64/kvm/mmu.c              |  8 +---
 include/uapi/linux/kvm.h          |  1 +
 7 files changed, 98 insertions(+), 17 deletions(-)
 create mode 100644 arch/arm64/kvm/kvm_ras.c

Comments

Oliver Upton Oct. 31, 2024, 9:46 p.m. UTC | #1
Hi Jiaqi,

Thank you for sending this out.

On Thu, Oct 31, 2024 at 09:21:04PM +0000, Jiaqi Yan wrote:
> Currently KVM handles SEA in guest by injecting async SError into
> guest directly, bypassing VMM, usually results in guest kernel panic.
> 
> One major situation of guest SEA is when vCPU consumes uncorrectable
> memory error on the physical memory. Although SError and guest kernel
> panic effectively stops the propagation of corrupted memory, it is not
> easy for VMM and guest to recover from memory error in a more graceful
> manner.
> 
> Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> thread.
> In addition to the benifit that KVM's handling for SEA becomes aligned
> with core kernel behavior
> - The blast radius in VM can be limited to only the consuming thread
>   in guest, instead of entire guest kernel, unless the consumption is
>   from guest kernel.
> - VMM now has the chance to do its duties to stop the VM from repeatedly
>   consuming corrupted data. For example, VMM can unmap the guest page
>   from stage-2 table to intercept forseen memory poison consumption,
>   and for every consumption injects SEA to EL1 with synthetic memory
>   error CPER.
> 
> Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> can opt in this new capability if it prefers SIGBUS than SError
> injection during VM init. Now SEA handling in KVM works as follows:

I'm somewhat tempted to force the new behavior on userspace
unconditionally. Working back from an unexpected SError in the VM to the
KVM SEA handler is a bit of a mess, and can be annoying if the operator
can't access console logs of the VM.

As it stands today, UAPI expectations around SEAs are platform
dependent. If APEI claims the SEA and decides to offline a page, the
user will get a SIGBUS.

So sending a SIGBUS for the case that firmware _doesn't_ claim the SEA
seems like a good move from a consistency PoV. But it is a decently-sized
change to do without explicit buy-in from userspace so let's see what
others think.

> 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
>    enabled for the VM, and the SEA is NOT about translation table,
>    send SIGBUS BUS_OBJERR signal with host virtual address.
> 3. Otherwise directly inject async SError to guest.

The other reason I'm a bit lukewarm on user buy in is the UAPI suffers
from the same issue we do today: it depends on the platform. If the SEA
is claimed by APEI/GHES then the cap does nothing.

> +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> +{
> +	/* apei_claim_sea(NULL) expects to mask interrupts itself */
> +	lockdep_assert_irqs_enabled();
> +	return apei_claim_sea(NULL);
> +}

Consider dropping parameters from this since they're unused.

> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> +{
> +	bool sigbus_on_sea;
> +	int idx;
> +	u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> +	u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> +	phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> +	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +	/* When FnV is set, send 0 as si_addr like what do_sea() does. */
> +	unsigned long hva = 0UL;
> +
> +	/*
> +	 * For RAS the host kernel may handle this abort.
> +	 * There is no need to SIGBUS VMM, or pass the error into the guest.
> +	 */
> +	if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> +		return;
> +
> +	sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> +				 &(vcpu->kvm->arch.flags));
> +
> +	/*
> +	 * In addition to userspace opt-in, SIGBUS only makes sense if the
> +	 * abort is NOT about translation table walk and NOT about hardware
> +	 * update of translation table.
> +	 */
> +	sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);

Is this because we potentially can't determine a valid HVA for the
fault? Maybe these should go out to userspace still with si_addr = 0.
Marc Zyngier Nov. 1, 2024, 9:03 a.m. UTC | #2
On Thu, 31 Oct 2024 21:21:04 +0000,
Jiaqi Yan <jiaqiyan@google.com> wrote:
> 
> Currently KVM handles SEA in guest by injecting async SError into
> guest directly, bypassing VMM, usually results in guest kernel panic.
> 
> One major situation of guest SEA is when vCPU consumes uncorrectable
> memory error on the physical memory. Although SError and guest kernel
> panic effectively stops the propagation of corrupted memory, it is not
> easy for VMM and guest to recover from memory error in a more graceful
> manner.
> 
> Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> thread.

Can you elaborate on why the delivery of a signal is preferable to
simply exiting back to userspace with a description of the error?
Signals are usually not generated by KVM, and are a pretty twisted way
to generate an exit.

> In addition to the benifit that KVM's handling for SEA becomes aligned
> with core kernel behavior
> - The blast radius in VM can be limited to only the consuming thread
>   in guest, instead of entire guest kernel, unless the consumption is
>   from guest kernel.
> - VMM now has the chance to do its duties to stop the VM from repeatedly
>   consuming corrupted data. For example, VMM can unmap the guest page
>   from stage-2 table to intercept forseen memory poison consumption,

Not quite. The VMM doesn't manage stage-2. It can remove the page from
the VMA if it has it mapped, but that's it. The kernel deals with S2.

Which brings me to the next subject: when the kernel unmaps the page
at S2, it is likely to use CMOs. Can these CMOs create RAS error
themselves?

>   and for every consumption injects SEA to EL1 with synthetic memory
>   error CPER.

Why do we need to involve ACPI here? I would expect the production of
an architected error record instead. Or at least be given the option.

> Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> can opt in this new capability if it prefers SIGBUS than SError
> injection during VM init. Now SEA handling in KVM works as follows:
> 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
>    enabled for the VM, and the SEA is NOT about translation table,
>    send SIGBUS BUS_OBJERR signal with host virtual address.

And what if it is? S1 PTs are backed by userspace memory, like
anything else. I don't think we should have a different treatment of
those, because the HW wouldn't treat them differently either.

> 3. Otherwise directly inject async SError to guest.
>
> Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> in VM allocated some memory buffer. The test used EINJ to inject an
> uncorrectable recoverable memory error at a page in the allocated memory
> buffer. The dummy application then consumed the memory error. Some hack
> was done to make core kernel's memory_failure triggered by poison
> generation to fail, so KVM had to deal with the SEA guest abort due to
> poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> valid host virtual address of the poisoned page. VMM then injected a SEA
> into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> guest survived and continued to run.
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h |  2 +
>  arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
>  arch/arm64/kvm/Makefile           |  2 +-
>  arch/arm64/kvm/arm.c              |  5 ++
>  arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
>  arch/arm64/kvm/mmu.c              |  8 +---
>  include/uapi/linux/kvm.h          |  1 +
>  7 files changed, 98 insertions(+), 17 deletions(-)
>  create mode 100644 arch/arm64/kvm/kvm_ras.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index bf64fed9820ea..eb37a2489411a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -334,6 +334,8 @@ struct kvm_arch {
>  	/* Fine-Grained UNDEF initialised */
>  #define KVM_ARCH_FLAG_FGU_INITIALIZED			8
>  	unsigned long flags;
> +	/* Instead of injecting SError into guest, SIGBUS VMM */
> +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA			9

nit: why do you put this definition out of sequence (below 'flags')?

>  
>  	/* VM-wide vCPU feature set */
>  	DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> index 87e10d9a635b5..4bb7a424e3f6c 100644
> --- a/arch/arm64/include/asm/kvm_ras.h
> +++ b/arch/arm64/include/asm/kvm_ras.h
> @@ -11,15 +11,17 @@
>  #include <asm/acpi.h>
>  
>  /*
> - * Was this synchronous external abort a RAS notification?
> - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> + * Handle synchronous external abort (SEA) in the following order:
> + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> + *    are all done.
> + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> + *    about translation table, send SIGBUS
> + *    - si_code is BUS_OBJERR.
> + *    - si_addr will be 0 when accurate HVA is unavailable.
> + * 3. Otherwise, directly inject an async SError to guest.
> + *
> + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
>   */
> -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> -{
> -	/* apei_claim_sea(NULL) expects to mask interrupts itself */
> -	lockdep_assert_irqs_enabled();
> -
> -	return apei_claim_sea(NULL);
> -}
> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>  
>  #endif /* __ARM64_KVM_RAS_H__ */
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 3cf7adb2b5038..c4a3a6d4870e6 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
>  	 vgic/vgic-v3.o vgic/vgic-v4.o \
>  	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
>  	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> -	 vgic/vgic-its.o vgic/vgic-debug.o
> +	 vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
>  
>  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
>  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 48cafb65d6acf..bb97ad678dbec 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		}
>  		mutex_unlock(&kvm->slots_lock);
>  		break;
> +	case KVM_CAP_ARM_SIGBUS_ON_SEA:
> +		r = 0;
> +		set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);

Shouldn't this be somehow gated on the VM being RAS aware?

> +		break;
>  	default:
>  		break;
>  	}
> @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_ARM_SYSTEM_SUSPEND:
>  	case KVM_CAP_IRQFD_RESAMPLE:
>  	case KVM_CAP_COUNTER_OFFSET:
> +	case KVM_CAP_ARM_SIGBUS_ON_SEA:
>  		r = 1;
>  		break;
>  	case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> new file mode 100644
> index 0000000000000..3225462bcbcda
> --- /dev/null
> +++ b/arch/arm64/kvm/kvm_ras.c
> @@ -0,0 +1,77 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/bitops.h>
> +#include <linux/kvm_host.h>
> +
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_ras.h>
> +#include <asm/system_misc.h>
> +
> +/*
> + * For synchrnous external instruction or data abort, not on translation
> + * table walk or hardware update of translation table, is FAR_EL2 valid?
> + */
> +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> +{
> +	return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> +}
> +
> +/*
> + * Was this synchronous external abort a RAS notification?
> + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> + */
> +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> +{
> +	/* apei_claim_sea(NULL) expects to mask interrupts itself */
> +	lockdep_assert_irqs_enabled();
> +	return apei_claim_sea(NULL);
> +}
> +
> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> +{
> +	bool sigbus_on_sea;
> +	int idx;
> +	u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> +	u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> +	phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> +	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +	/* When FnV is set, send 0 as si_addr like what do_sea() does. */
> +	unsigned long hva = 0UL;
> +
> +	/*
> +	 * For RAS the host kernel may handle this abort.
> +	 * There is no need to SIGBUS VMM, or pass the error into the guest.
> +	 */
> +	if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> +		return;
> +
> +	sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> +				 &(vcpu->kvm->arch.flags));
> +
> +	/*
> +	 * In addition to userspace opt-in, SIGBUS only makes sense if the
> +	 * abort is NOT about translation table walk and NOT about hardware
> +	 * update of translation table.
> +	 */
> +	sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> +
> +	/* Pass the error directly into the guest. */
> +	if (!sigbus_on_sea) {
> +		kvm_inject_vabt(vcpu);
> +		return;
> +	}
> +
> +	if (kvm_vcpu_sea_far_valid(vcpu)) {
> +		idx = srcu_read_lock(&vcpu->kvm->srcu);
> +		hva = gfn_to_hva(vcpu->kvm, gfn);
> +		srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +	}
> +
> +	/*
> +	 * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> +	 * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> +	 * does if apei_claim_sea() fails.
> +	 */
> +	arm64_notify_die("synchronous external abort",
> +			 current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);

This is the point where I really think we should simply trigger an
exit with all that syndrome information stashed in kvm_run, like any
other event requiring userspace help.

Also: where is the documentation?

Thanks,

	M.
Jiaqi Yan Nov. 4, 2024, 5:02 a.m. UTC | #3
Hi Marc, thanks for your quick response!

On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Thu, 31 Oct 2024 21:21:04 +0000,
> Jiaqi Yan <jiaqiyan@google.com> wrote:
> >
> > Currently KVM handles SEA in guest by injecting async SError into
> > guest directly, bypassing VMM, usually results in guest kernel panic.
> >
> > One major situation of guest SEA is when vCPU consumes uncorrectable
> > memory error on the physical memory. Although SError and guest kernel
> > panic effectively stops the propagation of corrupted memory, it is not
> > easy for VMM and guest to recover from memory error in a more graceful
> > manner.
> >
> > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> > thread.
>
> Can you elaborate on why the delivery of a signal is preferable to
> simply exiting back to userspace with a description of the error?
> Signals are usually not generated by KVM, and are a pretty twisted way
> to generate an exit.

A couple of reasons. First, my intuition is that KVM and core kernel
(do_sea) should have aligned behavior when APEI failed to claim the
SEA. Second, if we only talk about SEA caused by memory poison
consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU
thread (kvm_send_hwpoison_signal) to signal hardware memory failure,
although the situation is slightly different here, where we have a
hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON
returned. But from VMM/vCPU's perspective, hardware event or software
level VM_FAULT_HWPOISON, it would be nice if it can react to just the
same event, the SIGBUS signal.

And there is another reason around your comment on arm64_notify_die.

By "exiting back to userspace with a description of the error", are
you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new
flag to tell VMM the error is hardware memory poison, which could be
KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but
I think a specific flag for hwpoison is probably clearer).

>
> > In addition to the benifit that KVM's handling for SEA becomes aligned
> > with core kernel behavior
> > - The blast radius in VM can be limited to only the consuming thread
> >   in guest, instead of entire guest kernel, unless the consumption is
> >   from guest kernel.
> > - VMM now has the chance to do its duties to stop the VM from repeatedly
> >   consuming corrupted data. For example, VMM can unmap the guest page
> >   from stage-2 table to intercept forseen memory poison consumption,
>
> Not quite. The VMM doesn't manage stage-2. It can remove the page from
> the VMA if it has it mapped, but that's it. The kernel deals with S2.

I should probably not mention the implementation, "unmap from S2".
What is needed for preventing repeated consuming memory poison is
simply preventing guest access to certain memory pages. There is a
work in progress KVM API [1] by my colleague James.

[1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf

>
> Which brings me to the next subject: when the kernel unmaps the page
> at S2, it is likely to use CMOs. Can these CMOs create RAS error
> themselves?

I assume CMO here means writing dirty cache lines to memory. Writing
something new to a poisoned cacheline usually won't cause RAS error.
Notifying memory poison usually is delayed to a memory load
transaction.

>
> >   and for every consumption injects SEA to EL1 with synthetic memory
> >   error CPER.
>
> Why do we need to involve ACPI here? I would expect the production of
> an architected error record instead. Or at least be given the option.

Sorry, I was just mentioning a specific VMM's implementation. There
are definitely multiple options (Machine Check MSRs vs CPER for error
description data, SEA vs SDEI vs SPI for notification mechanisms) for
how VMM involves the guest to handle memory error. My preference is:
VMM populates CPER in guest HEST when VMM instructs KVM to inject
i/dabt to the guest.

And a word about "be given the option": I think when VMM receives
SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options,
like using the si_addr to construct CPER with poisoned guest physical
address, or mci_address MSR.

>
> > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> > can opt in this new capability if it prefers SIGBUS than SError
> > injection during VM init. Now SEA handling in KVM works as follows:
> > 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> >    enabled for the VM, and the SEA is NOT about translation table,
> >    send SIGBUS BUS_OBJERR signal with host virtual address.
>
> And what if it is? S1 PTs are backed by userspace memory, like
> anything else. I don't think we should have a different treatment of
> those, because the HW wouldn't treat them differently either.

You are talking about ESR_ELx_FSC_SEA_TTW(1), or
ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no
difference.

But I think we want to make an exception for SEA about S2 PTs.

>
> > 3. Otherwise directly inject async SError to guest.
> >
> > Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> > in VM allocated some memory buffer. The test used EINJ to inject an
> > uncorrectable recoverable memory error at a page in the allocated memory
> > buffer. The dummy application then consumed the memory error. Some hack
> > was done to make core kernel's memory_failure triggered by poison
> > generation to fail, so KVM had to deal with the SEA guest abort due to
> > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> > valid host virtual address of the poisoned page. VMM then injected a SEA
> > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> > guest survived and continued to run.
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  2 +
> >  arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
> >  arch/arm64/kvm/Makefile           |  2 +-
> >  arch/arm64/kvm/arm.c              |  5 ++
> >  arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
> >  arch/arm64/kvm/mmu.c              |  8 +---
> >  include/uapi/linux/kvm.h          |  1 +
> >  7 files changed, 98 insertions(+), 17 deletions(-)
> >  create mode 100644 arch/arm64/kvm/kvm_ras.c
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index bf64fed9820ea..eb37a2489411a 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -334,6 +334,8 @@ struct kvm_arch {
> >       /* Fine-Grained UNDEF initialised */
> >  #define KVM_ARCH_FLAG_FGU_INITIALIZED                        8
> >       unsigned long flags;
> > +     /* Instead of injecting SError into guest, SIGBUS VMM */
> > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA                  9
>
> nit: why do you put this definition out of sequence (below 'flags')?

Ah, I will move it on top of flags.

>
> >
> >       /* VM-wide vCPU feature set */
> >       DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> > index 87e10d9a635b5..4bb7a424e3f6c 100644
> > --- a/arch/arm64/include/asm/kvm_ras.h
> > +++ b/arch/arm64/include/asm/kvm_ras.h
> > @@ -11,15 +11,17 @@
> >  #include <asm/acpi.h>
> >
> >  /*
> > - * Was this synchronous external abort a RAS notification?
> > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > + * Handle synchronous external abort (SEA) in the following order:
> > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> > + *    are all done.
> > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> > + *    about translation table, send SIGBUS
> > + *    - si_code is BUS_OBJERR.
> > + *    - si_addr will be 0 when accurate HVA is unavailable.
> > + * 3. Otherwise, directly inject an async SError to guest.
> > + *
> > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
> >   */
> > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> > -{
> > -     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > -     lockdep_assert_irqs_enabled();
> > -
> > -     return apei_claim_sea(NULL);
> > -}
> > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> >
> >  #endif /* __ARM64_KVM_RAS_H__ */
> > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > index 3cf7adb2b5038..c4a3a6d4870e6 100644
> > --- a/arch/arm64/kvm/Makefile
> > +++ b/arch/arm64/kvm/Makefile
> > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> >        vgic/vgic-v3.o vgic/vgic-v4.o \
> >        vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> >        vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> > -      vgic/vgic-its.o vgic/vgic-debug.o
> > +      vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
> >
> >  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
> >  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 48cafb65d6acf..bb97ad678dbec 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >               }
> >               mutex_unlock(&kvm->slots_lock);
> >               break;
> > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > +             r = 0;
> > +             set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
>
> Shouldn't this be somehow gated on the VM being RAS aware?

Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability?
I don't know if there is one for arm64. On x86 there is
KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one
but I don't think it is exactly the one for "RAS ability".

>
> > +             break;
> >       default:
> >               break;
> >       }
> > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >       case KVM_CAP_ARM_SYSTEM_SUSPEND:
> >       case KVM_CAP_IRQFD_RESAMPLE:
> >       case KVM_CAP_COUNTER_OFFSET:
> > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> >               r = 1;
> >               break;
> >       case KVM_CAP_SET_GUEST_DEBUG2:
> > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> > new file mode 100644
> > index 0000000000000..3225462bcbcda
> > --- /dev/null
> > +++ b/arch/arm64/kvm/kvm_ras.c
> > @@ -0,0 +1,77 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +#include <linux/bitops.h>
> > +#include <linux/kvm_host.h>
> > +
> > +#include <asm/kvm_emulate.h>
> > +#include <asm/kvm_ras.h>
> > +#include <asm/system_misc.h>
> > +
> > +/*
> > + * For synchrnous external instruction or data abort, not on translation
> > + * table walk or hardware update of translation table, is FAR_EL2 valid?
> > + */
> > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> > +{
> > +     return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> > +}
> > +
> > +/*
> > + * Was this synchronous external abort a RAS notification?
> > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > + */
> > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> > +{
> > +     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > +     lockdep_assert_irqs_enabled();
> > +     return apei_claim_sea(NULL);
> > +}
> > +
> > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > +{
> > +     bool sigbus_on_sea;
> > +     int idx;
> > +     u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> > +     u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> > +     phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +     /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> > +     unsigned long hva = 0UL;
> > +
> > +     /*
> > +      * For RAS the host kernel may handle this abort.
> > +      * There is no need to SIGBUS VMM, or pass the error into the guest.
> > +      */
> > +     if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> > +             return;
> > +
> > +     sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> > +                              &(vcpu->kvm->arch.flags));
> > +
> > +     /*
> > +      * In addition to userspace opt-in, SIGBUS only makes sense if the
> > +      * abort is NOT about translation table walk and NOT about hardware
> > +      * update of translation table.
> > +      */
> > +     sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> > +
> > +     /* Pass the error directly into the guest. */
> > +     if (!sigbus_on_sea) {
> > +             kvm_inject_vabt(vcpu);
> > +             return;
> > +     }
> > +
> > +     if (kvm_vcpu_sea_far_valid(vcpu)) {
> > +             idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +             hva = gfn_to_hva(vcpu->kvm, gfn);
> > +             srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +     }
> > +
> > +     /*
> > +      * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> > +      * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> > +      * does if apei_claim_sea() fails.
> > +      */
> > +     arm64_notify_die("synchronous external abort",
> > +                      current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
>
> This is the point where I really think we should simply trigger an
> exit with all that syndrome information stashed in kvm_run, like any
> other event requiring userspace help.

Ah, there is another reason SIGBUS is better than kvm exit: "It is a
programming error to set ext_dabt_pending after an exit which was not
either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from
Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data
abort to guest, at least current documentation doesn't suggest kvm
exit is feasible.

>
> Also: where is the documentation?

Once I get more positive feedback and send out PATCH instead of RFC, I
can add to Documentation/virt/kvm/api.rst.

>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Jiaqi Yan Nov. 8, 2024, 4:54 a.m. UTC | #4
Hi Oliver,

Sorry for getting back to you late...

On Thu, Oct 31, 2024 at 2:46 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Hi Jiaqi,
>
> Thank you for sending this out.
>
> On Thu, Oct 31, 2024 at 09:21:04PM +0000, Jiaqi Yan wrote:
> > Currently KVM handles SEA in guest by injecting async SError into
> > guest directly, bypassing VMM, usually results in guest kernel panic.
> >
> > One major situation of guest SEA is when vCPU consumes uncorrectable
> > memory error on the physical memory. Although SError and guest kernel
> > panic effectively stops the propagation of corrupted memory, it is not
> > easy for VMM and guest to recover from memory error in a more graceful
> > manner.
> >
> > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> > thread.
> > In addition to the benifit that KVM's handling for SEA becomes aligned
> > with core kernel behavior
> > - The blast radius in VM can be limited to only the consuming thread
> >   in guest, instead of entire guest kernel, unless the consumption is
> >   from guest kernel.
> > - VMM now has the chance to do its duties to stop the VM from repeatedly
> >   consuming corrupted data. For example, VMM can unmap the guest page
> >   from stage-2 table to intercept forseen memory poison consumption,
> >   and for every consumption injects SEA to EL1 with synthetic memory
> >   error CPER.
> >
> > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> > can opt in this new capability if it prefers SIGBUS than SError
> > injection during VM init. Now SEA handling in KVM works as follows:
>
> I'm somewhat tempted to force the new behavior on userspace
> unconditionally. Working back from an unexpected SError in the VM to the
> KVM SEA handler is a bit of a mess, and can be annoying if the operator
> can't access console logs of the VM.

Ack, I also think involving VMM is preferable than injecting SError
directly to guest.

>
> As it stands today, UAPI expectations around SEAs are platform
> dependent. If APEI claims the SEA and decides to offline a page, the
> user will get a SIGBUS.
>
> So sending a SIGBUS for the case that firmware _doesn't_ claim the SEA
> seems like a good move from a consistency PoV. But it is a decently-sized
> change to do without explicit buy-in from userspace so let's see what
> others think.

Sounds good, I will wait for a couple of more days, and if the opt-in
part isn't necessary to anyone else, I will remove it from the next
revision.

>
> > 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> >    enabled for the VM, and the SEA is NOT about translation table,
> >    send SIGBUS BUS_OBJERR signal with host virtual address.
> > 3. Otherwise directly inject async SError to guest.
>
> The other reason I'm a bit lukewarm on user buy in is the UAPI suffers
> from the same issue we do today: it depends on the platform. If the SEA
> is claimed by APEI/GHES then the cap does nothing.

Good point, yeah, the path of KVM handling SEA and sending SIGBUS
should be treated as fallback code to APEI/GHES.

>
> > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> > +{
> > +     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > +     lockdep_assert_irqs_enabled();
> > +     return apei_claim_sea(NULL);
> > +}
>
> Consider dropping parameters from this since they're unused.

Ack, will do in the next revision.

>
> > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > +{
> > +     bool sigbus_on_sea;
> > +     int idx;
> > +     u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> > +     u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> > +     phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +     /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> > +     unsigned long hva = 0UL;
> > +
> > +     /*
> > +      * For RAS the host kernel may handle this abort.
> > +      * There is no need to SIGBUS VMM, or pass the error into the guest.
> > +      */
> > +     if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> > +             return;
> > +
> > +     sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> > +                              &(vcpu->kvm->arch.flags));
> > +
> > +     /*
> > +      * In addition to userspace opt-in, SIGBUS only makes sense if the
> > +      * abort is NOT about translation table walk and NOT about hardware
> > +      * update of translation table.
> > +      */
> > +     sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
>
> Is this because we potentially can't determine a valid HVA for the
> fault? Maybe these should go out to userspace still with si_addr = 0.

No, it is not related to the availability of a valid HVA. In this
patch, as long as it decides to sigbus_on_sea, si_addr can be 0 if a
valid HVA isn't available when !kvm_vcpu_sea_far_valid (OR when
HPFAR_EL2 cannot be translated from a valid FAR_EL2, which I think
requires some improvement).

The code here wants to limit SIGBUS _BUS_OBJERR_ to only SEA and
parity+ECC error, similar to the code here (for SEA)
https://elixir.bootlin.com/linux/v6.11.6/source/arch/arm64/mm/fault.c#L771
and here (for synchronous parity or ECC error)
https://elixir.bootlin.com/linux/v6.11.6/source/arch/arm64/mm/fault.c#L779.

>
> --
> Thanks,
> Oliver
Jiaqi Yan Nov. 8, 2024, 9:18 p.m. UTC | #5
On Sun, Nov 3, 2024 at 9:02 PM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> Hi Marc, thanks for your quick response!
>
> On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Thu, 31 Oct 2024 21:21:04 +0000,
> > Jiaqi Yan <jiaqiyan@google.com> wrote:
> > >
> > > Currently KVM handles SEA in guest by injecting async SError into
> > > guest directly, bypassing VMM, usually results in guest kernel panic.
> > >
> > > One major situation of guest SEA is when vCPU consumes uncorrectable
> > > memory error on the physical memory. Although SError and guest kernel
> > > panic effectively stops the propagation of corrupted memory, it is not
> > > easy for VMM and guest to recover from memory error in a more graceful
> > > manner.
> > >
> > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> > > thread.
> >
> > Can you elaborate on why the delivery of a signal is preferable to
> > simply exiting back to userspace with a description of the error?
> > Signals are usually not generated by KVM, and are a pretty twisted way
> > to generate an exit.
>
> A couple of reasons. First, my intuition is that KVM and core kernel
> (do_sea) should have aligned behavior when APEI failed to claim the
> SEA. Second, if we only talk about SEA caused by memory poison
> consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU
> thread (kvm_send_hwpoison_signal) to signal hardware memory failure,
> although the situation is slightly different here, where we have a
> hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON
> returned. But from VMM/vCPU's perspective, hardware event or software
> level VM_FAULT_HWPOISON, it would be nice if it can react to just the
> same event, the SIGBUS signal.
>
> And there is another reason around your comment on arm64_notify_die.
>
> By "exiting back to userspace with a description of the error", are
> you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new
> flag to tell VMM the error is hardware memory poison, which could be
> KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but
> I think a specific flag for hwpoison is probably clearer).
>
> >
> > > In addition to the benifit that KVM's handling for SEA becomes aligned
> > > with core kernel behavior
> > > - The blast radius in VM can be limited to only the consuming thread
> > >   in guest, instead of entire guest kernel, unless the consumption is
> > >   from guest kernel.
> > > - VMM now has the chance to do its duties to stop the VM from repeatedly
> > >   consuming corrupted data. For example, VMM can unmap the guest page
> > >   from stage-2 table to intercept forseen memory poison consumption,
> >
> > Not quite. The VMM doesn't manage stage-2. It can remove the page from
> > the VMA if it has it mapped, but that's it. The kernel deals with S2.
>
> I should probably not mention the implementation, "unmap from S2".
> What is needed for preventing repeated consuming memory poison is
> simply preventing guest access to certain memory pages. There is a
> work in progress KVM API [1] by my colleague James.
>
> [1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf
>
> >
> > Which brings me to the next subject: when the kernel unmaps the page
> > at S2, it is likely to use CMOs. Can these CMOs create RAS error
> > themselves?
>
> I assume CMO here means writing dirty cache lines to memory. Writing
> something new to a poisoned cacheline usually won't cause RAS error.
> Notifying memory poison usually is delayed to a memory load
> transaction.
>
> >
> > >   and for every consumption injects SEA to EL1 with synthetic memory
> > >   error CPER.
> >
> > Why do we need to involve ACPI here? I would expect the production of
> > an architected error record instead. Or at least be given the option.
>
> Sorry, I was just mentioning a specific VMM's implementation. There
> are definitely multiple options (Machine Check MSRs vs CPER for error
> description data, SEA vs SDEI vs SPI for notification mechanisms) for
> how VMM involves the guest to handle memory error. My preference is:
> VMM populates CPER in guest HEST when VMM instructs KVM to inject
> i/dabt to the guest.
>
> And a word about "be given the option": I think when VMM receives
> SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options,
> like using the si_addr to construct CPER with poisoned guest physical
> address, or mci_address MSR.
>
> >
> > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> > > can opt in this new capability if it prefers SIGBUS than SError
> > > injection during VM init. Now SEA handling in KVM works as follows:
> > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> > >    enabled for the VM, and the SEA is NOT about translation table,
> > >    send SIGBUS BUS_OBJERR signal with host virtual address.
> >
> > And what if it is? S1 PTs are backed by userspace memory, like
> > anything else. I don't think we should have a different treatment of
> > those, because the HW wouldn't treat them differently either.
>
> You are talking about ESR_ELx_FSC_SEA_TTW(1), or
> ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no
> difference.
>
> But I think we want to make an exception for SEA about S2 PTs.
>
> >
> > > 3. Otherwise directly inject async SError to guest.
> > >
> > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> > > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> > > in VM allocated some memory buffer. The test used EINJ to inject an
> > > uncorrectable recoverable memory error at a page in the allocated memory
> > > buffer. The dummy application then consumed the memory error. Some hack
> > > was done to make core kernel's memory_failure triggered by poison
> > > generation to fail, so KVM had to deal with the SEA guest abort due to
> > > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> > > valid host virtual address of the poisoned page. VMM then injected a SEA
> > > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> > > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> > > guest survived and continued to run.
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h |  2 +
> > >  arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
> > >  arch/arm64/kvm/Makefile           |  2 +-
> > >  arch/arm64/kvm/arm.c              |  5 ++
> > >  arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
> > >  arch/arm64/kvm/mmu.c              |  8 +---
> > >  include/uapi/linux/kvm.h          |  1 +
> > >  7 files changed, 98 insertions(+), 17 deletions(-)
> > >  create mode 100644 arch/arm64/kvm/kvm_ras.c
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index bf64fed9820ea..eb37a2489411a 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -334,6 +334,8 @@ struct kvm_arch {
> > >       /* Fine-Grained UNDEF initialised */
> > >  #define KVM_ARCH_FLAG_FGU_INITIALIZED                        8
> > >       unsigned long flags;
> > > +     /* Instead of injecting SError into guest, SIGBUS VMM */
> > > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA                  9
> >
> > nit: why do you put this definition out of sequence (below 'flags')?
>
> Ah, I will move it on top of flags.
>
> >
> > >
> > >       /* VM-wide vCPU feature set */
> > >       DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> > > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> > > index 87e10d9a635b5..4bb7a424e3f6c 100644
> > > --- a/arch/arm64/include/asm/kvm_ras.h
> > > +++ b/arch/arm64/include/asm/kvm_ras.h
> > > @@ -11,15 +11,17 @@
> > >  #include <asm/acpi.h>
> > >
> > >  /*
> > > - * Was this synchronous external abort a RAS notification?
> > > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > + * Handle synchronous external abort (SEA) in the following order:
> > > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> > > + *    are all done.
> > > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> > > + *    about translation table, send SIGBUS
> > > + *    - si_code is BUS_OBJERR.
> > > + *    - si_addr will be 0 when accurate HVA is unavailable.
> > > + * 3. Otherwise, directly inject an async SError to guest.
> > > + *
> > > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
> > >   */
> > > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> > > -{
> > > -     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > -     lockdep_assert_irqs_enabled();
> > > -
> > > -     return apei_claim_sea(NULL);
> > > -}
> > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > >
> > >  #endif /* __ARM64_KVM_RAS_H__ */
> > > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > > index 3cf7adb2b5038..c4a3a6d4870e6 100644
> > > --- a/arch/arm64/kvm/Makefile
> > > +++ b/arch/arm64/kvm/Makefile
> > > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> > >        vgic/vgic-v3.o vgic/vgic-v4.o \
> > >        vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> > >        vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> > > -      vgic/vgic-its.o vgic/vgic-debug.o
> > > +      vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
> > >
> > >  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
> > >  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 48cafb65d6acf..bb97ad678dbec 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > >               }
> > >               mutex_unlock(&kvm->slots_lock);
> > >               break;
> > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > > +             r = 0;
> > > +             set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
> >
> > Shouldn't this be somehow gated on the VM being RAS aware?
>
> Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability?
> I don't know if there is one for arm64. On x86 there is
> KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one
> but I don't think it is exactly the one for "RAS ability".

Hi Marc,

There is also a kernel config called ARM64_RAS_EXTN but I believe it
is for the host CPU, not about VM's RAS.

But taking Oliver's opinion into account, I think we want to remove
KVM_CAP_ARM_SIGBUS_ON_SEA. Wonder what your thoughts are.

Thanks,
Jiaqi

>
> >
> > > +             break;
> > >       default:
> > >               break;
> > >       }
> > > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > >       case KVM_CAP_ARM_SYSTEM_SUSPEND:
> > >       case KVM_CAP_IRQFD_RESAMPLE:
> > >       case KVM_CAP_COUNTER_OFFSET:
> > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > >               r = 1;
> > >               break;
> > >       case KVM_CAP_SET_GUEST_DEBUG2:
> > > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> > > new file mode 100644
> > > index 0000000000000..3225462bcbcda
> > > --- /dev/null
> > > +++ b/arch/arm64/kvm/kvm_ras.c
> > > @@ -0,0 +1,77 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +
> > > +#include <linux/bitops.h>
> > > +#include <linux/kvm_host.h>
> > > +
> > > +#include <asm/kvm_emulate.h>
> > > +#include <asm/kvm_ras.h>
> > > +#include <asm/system_misc.h>
> > > +
> > > +/*
> > > + * For synchrnous external instruction or data abort, not on translation
> > > + * table walk or hardware update of translation table, is FAR_EL2 valid?
> > > + */
> > > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> > > +{
> > > +     return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> > > +}
> > > +
> > > +/*
> > > + * Was this synchronous external abort a RAS notification?
> > > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > + */
> > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> > > +{
> > > +     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > +     lockdep_assert_irqs_enabled();
> > > +     return apei_claim_sea(NULL);
> > > +}
> > > +
> > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > +{
> > > +     bool sigbus_on_sea;
> > > +     int idx;
> > > +     u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> > > +     u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> > > +     phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > > +     /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> > > +     unsigned long hva = 0UL;
> > > +
> > > +     /*
> > > +      * For RAS the host kernel may handle this abort.
> > > +      * There is no need to SIGBUS VMM, or pass the error into the guest.
> > > +      */
> > > +     if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> > > +             return;
> > > +
> > > +     sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> > > +                              &(vcpu->kvm->arch.flags));
> > > +
> > > +     /*
> > > +      * In addition to userspace opt-in, SIGBUS only makes sense if the
> > > +      * abort is NOT about translation table walk and NOT about hardware
> > > +      * update of translation table.
> > > +      */
> > > +     sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> > > +
> > > +     /* Pass the error directly into the guest. */
> > > +     if (!sigbus_on_sea) {
> > > +             kvm_inject_vabt(vcpu);
> > > +             return;
> > > +     }
> > > +
> > > +     if (kvm_vcpu_sea_far_valid(vcpu)) {
> > > +             idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > +             hva = gfn_to_hva(vcpu->kvm, gfn);
> > > +             srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > > +     }
> > > +
> > > +     /*
> > > +      * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> > > +      * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> > > +      * does if apei_claim_sea() fails.
> > > +      */
> > > +     arm64_notify_die("synchronous external abort",
> > > +                      current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
> >
> > This is the point where I really think we should simply trigger an
> > exit with all that syndrome information stashed in kvm_run, like any
> > other event requiring userspace help.
>
> Ah, there is another reason SIGBUS is better than kvm exit: "It is a
> programming error to set ext_dabt_pending after an exit which was not
> either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from
> Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data
> abort to guest, at least current documentation doesn't suggest kvm
> exit is feasible.
>
> >
> > Also: where is the documentation?
>
> Once I get more positive feedback and send out PATCH instead of RFC, I
> can add to Documentation/virt/kvm/api.rst.
>
> >
> > Thanks,
> >
> >         M.
> >
> > --
> > Without deviation from the norm, progress is not possible.
Marc Zyngier Nov. 12, 2024, 8:51 a.m. UTC | #6
On Mon, 04 Nov 2024 05:02:03 +0000,
Jiaqi Yan <jiaqiyan@google.com> wrote:
> 
> Hi Marc, thanks for your quick response!
> 
> On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Thu, 31 Oct 2024 21:21:04 +0000,
> > Jiaqi Yan <jiaqiyan@google.com> wrote:
> > >
> > > Currently KVM handles SEA in guest by injecting async SError into
> > > guest directly, bypassing VMM, usually results in guest kernel panic.
> > >
> > > One major situation of guest SEA is when vCPU consumes uncorrectable
> > > memory error on the physical memory. Although SError and guest kernel
> > > panic effectively stops the propagation of corrupted memory, it is not
> > > easy for VMM and guest to recover from memory error in a more graceful
> > > manner.
> > >
> > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> > > thread.
> >
> > Can you elaborate on why the delivery of a signal is preferable to
> > simply exiting back to userspace with a description of the error?
> > Signals are usually not generated by KVM, and are a pretty twisted way
> > to generate an exit.
> 
> A couple of reasons. First, my intuition is that KVM and core kernel
> (do_sea) should have aligned behavior when APEI failed to claim the
> SEA. Second, if we only talk about SEA caused by memory poison
> consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU
> thread (kvm_send_hwpoison_signal) to signal hardware memory failure,
> although the situation is slightly different here, where we have a
> hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON
> returned. But from VMM/vCPU's perspective, hardware event or software
> level VM_FAULT_HWPOISON, it would be nice if it can react to just the
> same event, the SIGBUS signal.
> 
> And there is another reason around your comment on arm64_notify_die.
> 
> By "exiting back to userspace with a description of the error", are
> you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new
> flag to tell VMM the error is hardware memory poison, which could be
> KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but
> I think a specific flag for hwpoison is probably clearer).

KVM_MEMORY_EXIT_FLAG_USERFAULT is not upstream, and it isn't clear to
me if or when it will make it there. But the idea is indeed to treat a
RAS error as an exit reason.

> 
> >
> > > In addition to the benifit that KVM's handling for SEA becomes aligned
> > > with core kernel behavior
> > > - The blast radius in VM can be limited to only the consuming thread
> > >   in guest, instead of entire guest kernel, unless the consumption is
> > >   from guest kernel.
> > > - VMM now has the chance to do its duties to stop the VM from repeatedly
> > >   consuming corrupted data. For example, VMM can unmap the guest page
> > >   from stage-2 table to intercept forseen memory poison consumption,
> >
> > Not quite. The VMM doesn't manage stage-2. It can remove the page from
> > the VMA if it has it mapped, but that's it. The kernel deals with S2.
> 
> I should probably not mention the implementation, "unmap from S2".
> What is needed for preventing repeated consuming memory poison is
> simply preventing guest access to certain memory pages. There is a
> work in progress KVM API [1] by my colleague James.
> 
> [1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf
> 
> >
> > Which brings me to the next subject: when the kernel unmaps the page
> > at S2, it is likely to use CMOs. Can these CMOs create RAS error
> > themselves?
> 
> I assume CMO here means writing dirty cache lines to memory. Writing
> something new to a poisoned cacheline usually won't cause RAS error.
> Notifying memory poison usually is delayed to a memory load
> transaction.

I don't see anything of the sort in the spec. At least R_VGXBJ calls
out that:

<quote>
An error is propagated by the PE by one or more of the following
occurring that would not have been permitted to occur had the fault
not been activated:

* Consumption of the corrupt value by any instruction, propagating the
error to the target(s) of the instruction.  This includes:

  - A store of a corrupt value.
[...]
</quote>

So while implementations /may/ only act and deliver an error on a
load, that's only an implementation detail.

> 
> >
> > >   and for every consumption injects SEA to EL1 with synthetic memory
> > >   error CPER.
> >
> > Why do we need to involve ACPI here? I would expect the production of
> > an architected error record instead. Or at least be given the option.
> 
> Sorry, I was just mentioning a specific VMM's implementation. There
> are definitely multiple options (Machine Check MSRs vs CPER for error
> description data, SEA vs SDEI vs SPI for notification mechanisms) for
> how VMM involves the guest to handle memory error. My preference is:
> VMM populates CPER in guest HEST when VMM instructs KVM to inject
> i/dabt to the guest.
> 
> And a word about "be given the option": I think when VMM receives
> SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options,
> like using the si_addr to construct CPER with poisoned guest physical
> address, or mci_address MSR.
> 
> >
> > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> > > can opt in this new capability if it prefers SIGBUS than SError
> > > injection during VM init. Now SEA handling in KVM works as follows:
> > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> > >    enabled for the VM, and the SEA is NOT about translation table,
> > >    send SIGBUS BUS_OBJERR signal with host virtual address.
> >
> > And what if it is? S1 PTs are backed by userspace memory, like
> > anything else. I don't think we should have a different treatment of
> > those, because the HW wouldn't treat them differently either.
> 
> You are talking about ESR_ELx_FSC_SEA_TTW(1), or
> ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no
> difference.
> 
> But I think we want to make an exception for SEA about S2 PTs.

S2 PTs are owned by the hypervisor, not by userspace, so I'm fine not
reporting those through this mechanism.

> 
> >
> > > 3. Otherwise directly inject async SError to guest.
> > >
> > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> > > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> > > in VM allocated some memory buffer. The test used EINJ to inject an
> > > uncorrectable recoverable memory error at a page in the allocated memory
> > > buffer. The dummy application then consumed the memory error. Some hack
> > > was done to make core kernel's memory_failure triggered by poison
> > > generation to fail, so KVM had to deal with the SEA guest abort due to
> > > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> > > valid host virtual address of the poisoned page. VMM then injected a SEA
> > > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> > > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> > > guest survived and continued to run.
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_host.h |  2 +
> > >  arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
> > >  arch/arm64/kvm/Makefile           |  2 +-
> > >  arch/arm64/kvm/arm.c              |  5 ++
> > >  arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
> > >  arch/arm64/kvm/mmu.c              |  8 +---
> > >  include/uapi/linux/kvm.h          |  1 +
> > >  7 files changed, 98 insertions(+), 17 deletions(-)
> > >  create mode 100644 arch/arm64/kvm/kvm_ras.c
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > index bf64fed9820ea..eb37a2489411a 100644
> > > --- a/arch/arm64/include/asm/kvm_host.h
> > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > @@ -334,6 +334,8 @@ struct kvm_arch {
> > >       /* Fine-Grained UNDEF initialised */
> > >  #define KVM_ARCH_FLAG_FGU_INITIALIZED                        8
> > >       unsigned long flags;
> > > +     /* Instead of injecting SError into guest, SIGBUS VMM */
> > > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA                  9
> >
> > nit: why do you put this definition out of sequence (below 'flags')?
> 
> Ah, I will move it on top of flags.
> 
> >
> > >
> > >       /* VM-wide vCPU feature set */
> > >       DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> > > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> > > index 87e10d9a635b5..4bb7a424e3f6c 100644
> > > --- a/arch/arm64/include/asm/kvm_ras.h
> > > +++ b/arch/arm64/include/asm/kvm_ras.h
> > > @@ -11,15 +11,17 @@
> > >  #include <asm/acpi.h>
> > >
> > >  /*
> > > - * Was this synchronous external abort a RAS notification?
> > > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > + * Handle synchronous external abort (SEA) in the following order:
> > > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> > > + *    are all done.
> > > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> > > + *    about translation table, send SIGBUS
> > > + *    - si_code is BUS_OBJERR.
> > > + *    - si_addr will be 0 when accurate HVA is unavailable.
> > > + * 3. Otherwise, directly inject an async SError to guest.
> > > + *
> > > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
> > >   */
> > > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> > > -{
> > > -     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > -     lockdep_assert_irqs_enabled();
> > > -
> > > -     return apei_claim_sea(NULL);
> > > -}
> > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > >
> > >  #endif /* __ARM64_KVM_RAS_H__ */
> > > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > > index 3cf7adb2b5038..c4a3a6d4870e6 100644
> > > --- a/arch/arm64/kvm/Makefile
> > > +++ b/arch/arm64/kvm/Makefile
> > > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> > >        vgic/vgic-v3.o vgic/vgic-v4.o \
> > >        vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> > >        vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> > > -      vgic/vgic-its.o vgic/vgic-debug.o
> > > +      vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
> > >
> > >  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
> > >  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 48cafb65d6acf..bb97ad678dbec 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > >               }
> > >               mutex_unlock(&kvm->slots_lock);
> > >               break;
> > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > > +             r = 0;
> > > +             set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
> >
> > Shouldn't this be somehow gated on the VM being RAS aware?
> 
> Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability?
> I don't know if there is one for arm64. On x86 there is
> KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one
> but I don't think it is exactly the one for "RAS ability".

Having though about this a bit more, I now think this is independent
of the guest supporting RAS. This really is about the VMM asking to be
made aware of RAS errors affecting the guest, and it is the signalling
back to the guest that needs to be gated by ID_AA64PFR0_EL1.RAS.

>
> >
> > > +             break;
> > >       default:
> > >               break;
> > >       }
> > > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > >       case KVM_CAP_ARM_SYSTEM_SUSPEND:
> > >       case KVM_CAP_IRQFD_RESAMPLE:
> > >       case KVM_CAP_COUNTER_OFFSET:
> > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > >               r = 1;
> > >               break;
> > >       case KVM_CAP_SET_GUEST_DEBUG2:
> > > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> > > new file mode 100644
> > > index 0000000000000..3225462bcbcda
> > > --- /dev/null
> > > +++ b/arch/arm64/kvm/kvm_ras.c
> > > @@ -0,0 +1,77 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +
> > > +#include <linux/bitops.h>
> > > +#include <linux/kvm_host.h>
> > > +
> > > +#include <asm/kvm_emulate.h>
> > > +#include <asm/kvm_ras.h>
> > > +#include <asm/system_misc.h>
> > > +
> > > +/*
> > > + * For synchrnous external instruction or data abort, not on translation
> > > + * table walk or hardware update of translation table, is FAR_EL2 valid?
> > > + */
> > > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> > > +{
> > > +     return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> > > +}
> > > +
> > > +/*
> > > + * Was this synchronous external abort a RAS notification?
> > > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > + */
> > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> > > +{
> > > +     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > +     lockdep_assert_irqs_enabled();
> > > +     return apei_claim_sea(NULL);
> > > +}
> > > +
> > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > +{
> > > +     bool sigbus_on_sea;
> > > +     int idx;
> > > +     u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> > > +     u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> > > +     phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > > +     /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> > > +     unsigned long hva = 0UL;
> > > +
> > > +     /*
> > > +      * For RAS the host kernel may handle this abort.
> > > +      * There is no need to SIGBUS VMM, or pass the error into the guest.
> > > +      */
> > > +     if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> > > +             return;
> > > +
> > > +     sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> > > +                              &(vcpu->kvm->arch.flags));
> > > +
> > > +     /*
> > > +      * In addition to userspace opt-in, SIGBUS only makes sense if the
> > > +      * abort is NOT about translation table walk and NOT about hardware
> > > +      * update of translation table.
> > > +      */
> > > +     sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> > > +
> > > +     /* Pass the error directly into the guest. */
> > > +     if (!sigbus_on_sea) {
> > > +             kvm_inject_vabt(vcpu);
> > > +             return;
> > > +     }
> > > +
> > > +     if (kvm_vcpu_sea_far_valid(vcpu)) {
> > > +             idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > +             hva = gfn_to_hva(vcpu->kvm, gfn);
> > > +             srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > > +     }
> > > +
> > > +     /*
> > > +      * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> > > +      * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> > > +      * does if apei_claim_sea() fails.
> > > +      */
> > > +     arm64_notify_die("synchronous external abort",
> > > +                      current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
> >
> > This is the point where I really think we should simply trigger an
> > exit with all that syndrome information stashed in kvm_run, like any
> > other event requiring userspace help.
> 
> Ah, there is another reason SIGBUS is better than kvm exit: "It is a
> programming error to set ext_dabt_pending after an exit which was not
> either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from
> Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data
> abort to guest, at least current documentation doesn't suggest kvm
> exit is feasible.

I really can't make the link between the two. If you take this to the
letter, then you can't inject an external abort on the back of a
signal handler either.

Also, you're obviously changing the UAPI, so everything is fair game
provided that you don't break backward compatibility.

> 
> >
> > Also: where is the documentation?
> 
> Once I get more positive feedback and send out PATCH instead of RFC, I
> can add to Documentation/virt/kvm/api.rst.

Sorry, but that's a key point for reviewing a UAPI change.

	M.
Marc Zyngier Nov. 12, 2024, 8:55 a.m. UTC | #7
On Fri, 08 Nov 2024 21:18:50 +0000,
Jiaqi Yan <jiaqiyan@google.com> wrote:
> 
> There is also a kernel config called ARM64_RAS_EXTN but I believe it
> is for the host CPU, not about VM's RAS.
> 
> But taking Oliver's opinion into account, I think we want to remove
> KVM_CAP_ARM_SIGBUS_ON_SEA. Wonder what your thoughts are.

If we stick to a signal-based signalling (which I still dislike with a
burning passion), then I agree it can be removed, as this is no
different from a signal being delivered from the rest of the kernel.

	M.
Jiaqi Yan Nov. 19, 2024, 11:57 p.m. UTC | #8
Thanks for your comments, Marc.

While continuing the discussion here, I think it may make sense I sent
out a V2 with 2 major updates:
- add documentation for new SIGBUS feature
- remove KVM_CAP_ARM_SIGBUS_ON_SEA

On Tue, Nov 12, 2024 at 12:51 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 04 Nov 2024 05:02:03 +0000,
> Jiaqi Yan <jiaqiyan@google.com> wrote:
> >
> > Hi Marc, thanks for your quick response!
> >
> > On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Thu, 31 Oct 2024 21:21:04 +0000,
> > > Jiaqi Yan <jiaqiyan@google.com> wrote:
> > > >
> > > > Currently KVM handles SEA in guest by injecting async SError into
> > > > guest directly, bypassing VMM, usually results in guest kernel panic.
> > > >
> > > > One major situation of guest SEA is when vCPU consumes uncorrectable
> > > > memory error on the physical memory. Although SError and guest kernel
> > > > panic effectively stops the propagation of corrupted memory, it is not
> > > > easy for VMM and guest to recover from memory error in a more graceful
> > > > manner.
> > > >
> > > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> > > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> > > > thread.
> > >
> > > Can you elaborate on why the delivery of a signal is preferable to
> > > simply exiting back to userspace with a description of the error?
> > > Signals are usually not generated by KVM, and are a pretty twisted way
> > > to generate an exit.
> >
> > A couple of reasons. First, my intuition is that KVM and core kernel
> > (do_sea) should have aligned behavior when APEI failed to claim the
> > SEA. Second, if we only talk about SEA caused by memory poison
> > consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU
> > thread (kvm_send_hwpoison_signal) to signal hardware memory failure,
> > although the situation is slightly different here, where we have a
> > hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON
> > returned. But from VMM/vCPU's perspective, hardware event or software
> > level VM_FAULT_HWPOISON, it would be nice if it can react to just the
> > same event, the SIGBUS signal.
> >
> > And there is another reason around your comment on arm64_notify_die.
> >
> > By "exiting back to userspace with a description of the error", are
> > you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new
> > flag to tell VMM the error is hardware memory poison, which could be
> > KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but
> > I think a specific flag for hwpoison is probably clearer).
>
> KVM_MEMORY_EXIT_FLAG_USERFAULT is not upstream, and it isn't clear to
> me if or when it will make it there. But the idea is indeed to treat a
> RAS error as an exit reason.

Let me try again on preferring SIGBUS to KVM_EXIT. At least for memory error.

Let's say we should really use some KVM_EXIT to describe RAS error
(e.g. KVM_EXIT_MEMORY_FAULT for memory error). Does that mean we need
to change these existing SIGBUSes [1,2] in KVM into KVM_EXIT? Doesn't
it break VMM [3,4] that expects SIGBUS for memory errors?

The difference between [1,2] and the scenario dealt by this patch is
just whether APEI/GHES/memory_failure succeeded in recovering a memory
error or not. Is the idea that the uAPI will be different depending on
successful recovery or not?

[1] https://elixir.bootlin.com/linux/v6.12-rc7/source/arch/arm64/kvm/mmu.c#L1576
[2] https://elixir.bootlin.com/linux/v6.12-rc7/source/arch/x86/kvm/mmu/mmu.c#L3301
[3] https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L738
[4] https://github.com/qemu/qemu/blob/master/target/arm/kvm.c#L2365

>
> >
> > >
> > > > In addition to the benifit that KVM's handling for SEA becomes aligned
> > > > with core kernel behavior
> > > > - The blast radius in VM can be limited to only the consuming thread
> > > >   in guest, instead of entire guest kernel, unless the consumption is
> > > >   from guest kernel.
> > > > - VMM now has the chance to do its duties to stop the VM from repeatedly
> > > >   consuming corrupted data. For example, VMM can unmap the guest page
> > > >   from stage-2 table to intercept forseen memory poison consumption,
> > >
> > > Not quite. The VMM doesn't manage stage-2. It can remove the page from
> > > the VMA if it has it mapped, but that's it. The kernel deals with S2.
> >
> > I should probably not mention the implementation, "unmap from S2".
> > What is needed for preventing repeated consuming memory poison is
> > simply preventing guest access to certain memory pages. There is a
> > work in progress KVM API [1] by my colleague James.
> >
> > [1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf
> >
> > >
> > > Which brings me to the next subject: when the kernel unmaps the page
> > > at S2, it is likely to use CMOs. Can these CMOs create RAS error
> > > themselves?
> >
> > I assume CMO here means writing dirty cache lines to memory. Writing
> > something new to a poisoned cacheline usually won't cause RAS error.
> > Notifying memory poison usually is delayed to a memory load
> > transaction.
>
> I don't see anything of the sort in the spec. At least R_VGXBJ calls
> out that:
>
> <quote>
> An error is propagated by the PE by one or more of the following
> occurring that would not have been permitted to occur had the fault
> not been activated:
>
> * Consumption of the corrupt value by any instruction, propagating the
> error to the target(s) of the instruction.  This includes:
>
>   - A store of a corrupt value.
> [...]
> </quote>
>
> So while implementations /may/ only act and deliver an error on a
> load, that's only an implementation detail.

Thanks for finding out from the spec, lesson learned. Back to your
original question and assume error will be propagated by the PE, is
your concern that "unmap from S2" will cause a SEA on TTW or hardware
update of translation table level 2 (i.e. DFSC=0b010110), which
renders the system into a worse fault state?

>
> >
> > >
> > > >   and for every consumption injects SEA to EL1 with synthetic memory
> > > >   error CPER.
> > >
> > > Why do we need to involve ACPI here? I would expect the production of
> > > an architected error record instead. Or at least be given the option.
> >
> > Sorry, I was just mentioning a specific VMM's implementation. There
> > are definitely multiple options (Machine Check MSRs vs CPER for error
> > description data, SEA vs SDEI vs SPI for notification mechanisms) for
> > how VMM involves the guest to handle memory error. My preference is:
> > VMM populates CPER in guest HEST when VMM instructs KVM to inject
> > i/dabt to the guest.
> >
> > And a word about "be given the option": I think when VMM receives
> > SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options,
> > like using the si_addr to construct CPER with poisoned guest physical
> > address, or mci_address MSR.
> >
> > >
> > > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> > > > can opt in this new capability if it prefers SIGBUS than SError
> > > > injection during VM init. Now SEA handling in KVM works as follows:
> > > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> > > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> > > >    enabled for the VM, and the SEA is NOT about translation table,
> > > >    send SIGBUS BUS_OBJERR signal with host virtual address.
> > >
> > > And what if it is? S1 PTs are backed by userspace memory, like
> > > anything else. I don't think we should have a different treatment of
> > > those, because the HW wouldn't treat them differently either.
> >
> > You are talking about ESR_ELx_FSC_SEA_TTW(1), or
> > ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no
> > difference.
> >
> > But I think we want to make an exception for SEA about S2 PTs.
>
> S2 PTs are owned by the hypervisor, not by userspace, so I'm fine not
> reporting those through this mechanism.

Ack, will update in the v2.

>
> >
> > >
> > > > 3. Otherwise directly inject async SError to guest.
> > > >
> > > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> > > > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> > > > in VM allocated some memory buffer. The test used EINJ to inject an
> > > > uncorrectable recoverable memory error at a page in the allocated memory
> > > > buffer. The dummy application then consumed the memory error. Some hack
> > > > was done to make core kernel's memory_failure triggered by poison
> > > > generation to fail, so KVM had to deal with the SEA guest abort due to
> > > > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> > > > valid host virtual address of the poisoned page. VMM then injected a SEA
> > > > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> > > > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> > > > guest survived and continued to run.
> > > >
> > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > ---
> > > >  arch/arm64/include/asm/kvm_host.h |  2 +
> > > >  arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
> > > >  arch/arm64/kvm/Makefile           |  2 +-
> > > >  arch/arm64/kvm/arm.c              |  5 ++
> > > >  arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
> > > >  arch/arm64/kvm/mmu.c              |  8 +---
> > > >  include/uapi/linux/kvm.h          |  1 +
> > > >  7 files changed, 98 insertions(+), 17 deletions(-)
> > > >  create mode 100644 arch/arm64/kvm/kvm_ras.c
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > > > index bf64fed9820ea..eb37a2489411a 100644
> > > > --- a/arch/arm64/include/asm/kvm_host.h
> > > > +++ b/arch/arm64/include/asm/kvm_host.h
> > > > @@ -334,6 +334,8 @@ struct kvm_arch {
> > > >       /* Fine-Grained UNDEF initialised */
> > > >  #define KVM_ARCH_FLAG_FGU_INITIALIZED                        8
> > > >       unsigned long flags;
> > > > +     /* Instead of injecting SError into guest, SIGBUS VMM */
> > > > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA                  9
> > >
> > > nit: why do you put this definition out of sequence (below 'flags')?
> >
> > Ah, I will move it on top of flags.
> >
> > >
> > > >
> > > >       /* VM-wide vCPU feature set */
> > > >       DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> > > > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> > > > index 87e10d9a635b5..4bb7a424e3f6c 100644
> > > > --- a/arch/arm64/include/asm/kvm_ras.h
> > > > +++ b/arch/arm64/include/asm/kvm_ras.h
> > > > @@ -11,15 +11,17 @@
> > > >  #include <asm/acpi.h>
> > > >
> > > >  /*
> > > > - * Was this synchronous external abort a RAS notification?
> > > > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > > + * Handle synchronous external abort (SEA) in the following order:
> > > > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> > > > + *    are all done.
> > > > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> > > > + *    about translation table, send SIGBUS
> > > > + *    - si_code is BUS_OBJERR.
> > > > + *    - si_addr will be 0 when accurate HVA is unavailable.
> > > > + * 3. Otherwise, directly inject an async SError to guest.
> > > > + *
> > > > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
> > > >   */
> > > > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> > > > -{
> > > > -     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > > -     lockdep_assert_irqs_enabled();
> > > > -
> > > > -     return apei_claim_sea(NULL);
> > > > -}
> > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> > > >
> > > >  #endif /* __ARM64_KVM_RAS_H__ */
> > > > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > > > index 3cf7adb2b5038..c4a3a6d4870e6 100644
> > > > --- a/arch/arm64/kvm/Makefile
> > > > +++ b/arch/arm64/kvm/Makefile
> > > > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> > > >        vgic/vgic-v3.o vgic/vgic-v4.o \
> > > >        vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> > > >        vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> > > > -      vgic/vgic-its.o vgic/vgic-debug.o
> > > > +      vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
> > > >
> > > >  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
> > > >  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > > index 48cafb65d6acf..bb97ad678dbec 100644
> > > > --- a/arch/arm64/kvm/arm.c
> > > > +++ b/arch/arm64/kvm/arm.c
> > > > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> > > >               }
> > > >               mutex_unlock(&kvm->slots_lock);
> > > >               break;
> > > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > > > +             r = 0;
> > > > +             set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
> > >
> > > Shouldn't this be somehow gated on the VM being RAS aware?
> >
> > Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability?
> > I don't know if there is one for arm64. On x86 there is
> > KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one
> > but I don't think it is exactly the one for "RAS ability".
>
> Having though about this a bit more, I now think this is independent
> of the guest supporting RAS. This really is about the VMM asking to be
> made aware of RAS errors affecting the guest, and it is the signalling
> back to the guest that needs to be gated by ID_AA64PFR0_EL1.RAS.

Just to make sure I fully catch you. I think ID_AA64PFR0_EL1.RAS
translates to ARM64_HAS_RAS_EXTN in the kernel. If VMM signals RAS
error back to the guest with SEA, are you suggesting
__kvm_arm_vcpu_set_events should check
cpus_have_final_cap(ARM64_HAS_RAS_EXTN) before it
kvm_inject_dabt(vcpu)?

If so, how could __kvm_arm_vcpu_set_events know if the error is about
RAS (e.g. memory error) vs about accessing memory not in a memslot
(i.e. KVM_EXIT_ARM_NISV)? I guess KVM needs to look at ESR_EL2 again
(e.g. kvm_vcpu_abt_issea vs kvm_vcpu_dabt_isvalid)?

>
> >
> > >
> > > > +             break;
> > > >       default:
> > > >               break;
> > > >       }
> > > > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> > > >       case KVM_CAP_ARM_SYSTEM_SUSPEND:
> > > >       case KVM_CAP_IRQFD_RESAMPLE:
> > > >       case KVM_CAP_COUNTER_OFFSET:
> > > > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > > >               r = 1;
> > > >               break;
> > > >       case KVM_CAP_SET_GUEST_DEBUG2:
> > > > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> > > > new file mode 100644
> > > > index 0000000000000..3225462bcbcda
> > > > --- /dev/null
> > > > +++ b/arch/arm64/kvm/kvm_ras.c
> > > > @@ -0,0 +1,77 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +
> > > > +#include <linux/bitops.h>
> > > > +#include <linux/kvm_host.h>
> > > > +
> > > > +#include <asm/kvm_emulate.h>
> > > > +#include <asm/kvm_ras.h>
> > > > +#include <asm/system_misc.h>
> > > > +
> > > > +/*
> > > > + * For synchrnous external instruction or data abort, not on translation
> > > > + * table walk or hardware update of translation table, is FAR_EL2 valid?
> > > > + */
> > > > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> > > > +{
> > > > +     return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Was this synchronous external abort a RAS notification?
> > > > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > > > + */
> > > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> > > > +{
> > > > +     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > > > +     lockdep_assert_irqs_enabled();
> > > > +     return apei_claim_sea(NULL);
> > > > +}
> > > > +
> > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > > > +{
> > > > +     bool sigbus_on_sea;
> > > > +     int idx;
> > > > +     u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> > > > +     u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> > > > +     phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > > > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > > > +     /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> > > > +     unsigned long hva = 0UL;
> > > > +
> > > > +     /*
> > > > +      * For RAS the host kernel may handle this abort.
> > > > +      * There is no need to SIGBUS VMM, or pass the error into the guest.
> > > > +      */
> > > > +     if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> > > > +             return;
> > > > +
> > > > +     sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> > > > +                              &(vcpu->kvm->arch.flags));
> > > > +
> > > > +     /*
> > > > +      * In addition to userspace opt-in, SIGBUS only makes sense if the
> > > > +      * abort is NOT about translation table walk and NOT about hardware
> > > > +      * update of translation table.
> > > > +      */
> > > > +     sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> > > > +
> > > > +     /* Pass the error directly into the guest. */
> > > > +     if (!sigbus_on_sea) {
> > > > +             kvm_inject_vabt(vcpu);
> > > > +             return;
> > > > +     }
> > > > +
> > > > +     if (kvm_vcpu_sea_far_valid(vcpu)) {
> > > > +             idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > +             hva = gfn_to_hva(vcpu->kvm, gfn);
> > > > +             srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > > > +     }
> > > > +
> > > > +     /*
> > > > +      * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> > > > +      * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> > > > +      * does if apei_claim_sea() fails.
> > > > +      */
> > > > +     arm64_notify_die("synchronous external abort",
> > > > +                      current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
> > >
> > > This is the point where I really think we should simply trigger an
> > > exit with all that syndrome information stashed in kvm_run, like any
> > > other event requiring userspace help.
> >
> > Ah, there is another reason SIGBUS is better than kvm exit: "It is a
> > programming error to set ext_dabt_pending after an exit which was not
> > either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from
> > Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data
> > abort to guest, at least current documentation doesn't suggest kvm
> > exit is feasible.
>
> I really can't make the link between the two. If you take this to the
> letter, then you can't inject an external abort on the back of a
> signal handler either.

Sorry for the confusion, I think my understanding of the doc and what
I said were both wrong. It seems to me that what the doc is suggesting
is, the use case of ext_dabt_pending (injecting SEA into guest) for
now is only to enable VMM to handle KVM_EXIT_MMIO and
KVM_EXIT_ARM_NISV.

With the new SIGBUS for memory error, is it fair to update the doc to
say another use case of ext_dabt_pending is to enable VMM to signal
guest of RAS error?

>
> Also, you're obviously changing the UAPI, so everything is fair game
> provided that you don't break backward compatibility.
>
> >
> > >
> > > Also: where is the documentation?
> >
> > Once I get more positive feedback and send out PATCH instead of RFC, I
> > can add to Documentation/virt/kvm/api.rst.
>
> Sorry, but that's a key point for reviewing a UAPI change.

Let me include the doc for the SIGBUS feature in v2.

>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Oliver Upton Nov. 20, 2024, 7:46 a.m. UTC | #9
On Tue, Nov 19, 2024 at 03:57:46PM -0800, Jiaqi Yan wrote:
>
> While continuing the discussion here, I think it may make sense I sent
> out a V2 with 2 major updates:
>   - add documentation for new SIGBUS feature
>   - remove KVM_CAP_ARM_SIGBUS_ON_SEA

Just wanted to add that QEMU already has a functioning "MCE" injection
implemenation based on signals that deals with the sloppy mess of
coordinating w/ vCPU threads [*].

I completely agree with Marc that the UAPI around using signals for this
sort of thing is a giant pile of crap, but it seems to be a semi-understood
pile of crap. And from that perspective, wiring unclaimed SEAs into the
existing infrastructure at least makes the UAPI consistent.

[*]: https://elixir.bootlin.com/qemu/v9.1.1/C/ident/kvm_on_sigbus_vcpu

> On Tue, Nov 12, 2024 at 12:51 AM Marc Zyngier <maz@kernel.org> wrote:
> > > Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability?
> > > I don't know if there is one for arm64. On x86 there is
> > > KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one
> > > but I don't think it is exactly the one for "RAS ability".
> >
> > Having though about this a bit more, I now think this is independent
> > of the guest supporting RAS. This really is about the VMM asking to be
> > made aware of RAS errors affecting the guest, and it is the signalling
> > back to the guest that needs to be gated by ID_AA64PFR0_EL1.RAS.
> 
> Just to make sure I fully catch you. I think ID_AA64PFR0_EL1.RAS
> translates to ARM64_HAS_RAS_EXTN in the kernel. If VMM signals RAS
> error back to the guest with SEA, are you suggesting
> __kvm_arm_vcpu_set_events should check
> cpus_have_final_cap(ARM64_HAS_RAS_EXTN) before it
> kvm_inject_dabt(vcpu)?
> 
> If so, how could __kvm_arm_vcpu_set_events know if the error is about
> RAS (e.g. memory error) vs about accessing memory not in a memslot
> (i.e. KVM_EXIT_ARM_NISV)? I guess KVM needs to look at ESR_EL2 again
> (e.g. kvm_vcpu_abt_issea vs kvm_vcpu_dabt_isvalid)?

Good point. I don't think we can lock down this UAPI after the fact
given the existing use cases. It is ultimately up to the VMM what to do.

I don't see anything that would stop an implementation without FEAT_RAS
from generating an SEA in this situation. The lack of FEAT_RAS (to me at
least) implies:

 - No ESR injection for vSErrors (already enforced in UAPI)
 - No deferral of SErrors / ESBs
 - No error record registers in the PE

Of course, it is very likely you've thought about this more than I have,
Marc.
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index bf64fed9820ea..eb37a2489411a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -334,6 +334,8 @@  struct kvm_arch {
 	/* Fine-Grained UNDEF initialised */
 #define KVM_ARCH_FLAG_FGU_INITIALIZED			8
 	unsigned long flags;
+	/* Instead of injecting SError into guest, SIGBUS VMM */
+#define KVM_ARCH_FLAG_SIGBUS_ON_SEA			9
 
 	/* VM-wide vCPU feature set */
 	DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
index 87e10d9a635b5..4bb7a424e3f6c 100644
--- a/arch/arm64/include/asm/kvm_ras.h
+++ b/arch/arm64/include/asm/kvm_ras.h
@@ -11,15 +11,17 @@ 
 #include <asm/acpi.h>
 
 /*
- * Was this synchronous external abort a RAS notification?
- * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
+ * Handle synchronous external abort (SEA) in the following order:
+ * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
+ *    are all done.
+ * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
+ *    about translation table, send SIGBUS
+ *    - si_code is BUS_OBJERR.
+ *    - si_addr will be 0 when accurate HVA is unavailable.
+ * 3. Otherwise, directly inject an async SError to guest.
+ *
+ * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
  */
-static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
-{
-	/* apei_claim_sea(NULL) expects to mask interrupts itself */
-	lockdep_assert_irqs_enabled();
-
-	return apei_claim_sea(NULL);
-}
+void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
 
 #endif /* __ARM64_KVM_RAS_H__ */
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 3cf7adb2b5038..c4a3a6d4870e6 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -23,7 +23,7 @@  kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 	 vgic/vgic-v3.o vgic/vgic-v4.o \
 	 vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
 	 vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
-	 vgic/vgic-its.o vgic/vgic-debug.o
+	 vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
 
 kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
 kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 48cafb65d6acf..bb97ad678dbec 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -151,6 +151,10 @@  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->slots_lock);
 		break;
+	case KVM_CAP_ARM_SIGBUS_ON_SEA:
+		r = 0;
+		set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
+		break;
 	default:
 		break;
 	}
@@ -339,6 +343,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_COUNTER_OFFSET:
+	case KVM_CAP_ARM_SIGBUS_ON_SEA:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
new file mode 100644
index 0000000000000..3225462bcbcda
--- /dev/null
+++ b/arch/arm64/kvm/kvm_ras.c
@@ -0,0 +1,77 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/bitops.h>
+#include <linux/kvm_host.h>
+
+#include <asm/kvm_emulate.h>
+#include <asm/kvm_ras.h>
+#include <asm/system_misc.h>
+
+/*
+ * For synchrnous external instruction or data abort, not on translation
+ * table walk or hardware update of translation table, is FAR_EL2 valid?
+ */
+static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
+{
+	return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
+}
+
+/*
+ * Was this synchronous external abort a RAS notification?
+ * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
+ */
+static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
+{
+	/* apei_claim_sea(NULL) expects to mask interrupts itself */
+	lockdep_assert_irqs_enabled();
+	return apei_claim_sea(NULL);
+}
+
+void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
+{
+	bool sigbus_on_sea;
+	int idx;
+	u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
+	u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
+	phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
+	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
+	/* When FnV is set, send 0 as si_addr like what do_sea() does. */
+	unsigned long hva = 0UL;
+
+	/*
+	 * For RAS the host kernel may handle this abort.
+	 * There is no need to SIGBUS VMM, or pass the error into the guest.
+	 */
+	if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
+		return;
+
+	sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
+				 &(vcpu->kvm->arch.flags));
+
+	/*
+	 * In addition to userspace opt-in, SIGBUS only makes sense if the
+	 * abort is NOT about translation table walk and NOT about hardware
+	 * update of translation table.
+	 */
+	sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
+
+	/* Pass the error directly into the guest. */
+	if (!sigbus_on_sea) {
+		kvm_inject_vabt(vcpu);
+		return;
+	}
+
+	if (kvm_vcpu_sea_far_valid(vcpu)) {
+		idx = srcu_read_lock(&vcpu->kvm->srcu);
+		hva = gfn_to_hva(vcpu->kvm, gfn);
+		srcu_read_unlock(&vcpu->kvm->srcu, idx);
+	}
+
+	/*
+	 * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
+	 * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
+	 * does if apei_claim_sea() fails.
+	 */
+	arm64_notify_die("synchronous external abort",
+			 current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
+}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a71fe6f6bd90f..f5335953827ec 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1766,13 +1766,7 @@  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 
 	/* Synchronous External Abort? */
 	if (kvm_vcpu_abt_issea(vcpu)) {
-		/*
-		 * For RAS the host kernel may handle this abort.
-		 * There is no need to pass the error into the guest.
-		 */
-		if (kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_esr(vcpu)))
-			kvm_inject_vabt(vcpu);
-
+		kvm_handle_guest_sea(vcpu);
 		return 1;
 	}
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc0551453..fe3ca787e72fa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -933,6 +933,7 @@  struct kvm_enable_cap {
 #define KVM_CAP_PRE_FAULT_MEMORY 236
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_ARM_SIGBUS_ON_SEA 239
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;