Message ID | 20241031212104.1429609-1-jiaqiyan@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA | expand |
Hi Jiaqi, Thank you for sending this out. On Thu, Oct 31, 2024 at 09:21:04PM +0000, Jiaqi Yan wrote: > Currently KVM handles SEA in guest by injecting async SError into > guest directly, bypassing VMM, usually results in guest kernel panic. > > One major situation of guest SEA is when vCPU consumes uncorrectable > memory error on the physical memory. Although SError and guest kernel > panic effectively stops the propagation of corrupted memory, it is not > easy for VMM and guest to recover from memory error in a more graceful > manner. > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > thread. > In addition to the benifit that KVM's handling for SEA becomes aligned > with core kernel behavior > - The blast radius in VM can be limited to only the consuming thread > in guest, instead of entire guest kernel, unless the consumption is > from guest kernel. > - VMM now has the chance to do its duties to stop the VM from repeatedly > consuming corrupted data. For example, VMM can unmap the guest page > from stage-2 table to intercept forseen memory poison consumption, > and for every consumption injects SEA to EL1 with synthetic memory > error CPER. > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > can opt in this new capability if it prefers SIGBUS than SError > injection during VM init. Now SEA handling in KVM works as follows: I'm somewhat tempted to force the new behavior on userspace unconditionally. Working back from an unexpected SError in the VM to the KVM SEA handler is a bit of a mess, and can be annoying if the operator can't access console logs of the VM. As it stands today, UAPI expectations around SEAs are platform dependent. If APEI claims the SEA and decides to offline a page, the user will get a SIGBUS. So sending a SIGBUS for the case that firmware _doesn't_ claim the SEA seems like a good move from a consistency PoV. But it is a decently-sized change to do without explicit buy-in from userspace so let's see what others think. > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > enabled for the VM, and the SEA is NOT about translation table, > send SIGBUS BUS_OBJERR signal with host virtual address. > 3. Otherwise directly inject async SError to guest. The other reason I'm a bit lukewarm on user buy in is the UAPI suffers from the same issue we do today: it depends on the platform. If the SEA is claimed by APEI/GHES then the cap does nothing. > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > +{ > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > + lockdep_assert_irqs_enabled(); > + return apei_claim_sea(NULL); > +} Consider dropping parameters from this since they're unused. > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > +{ > + bool sigbus_on_sea; > + int idx; > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > + unsigned long hva = 0UL; > + > + /* > + * For RAS the host kernel may handle this abort. > + * There is no need to SIGBUS VMM, or pass the error into the guest. > + */ > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > + return; > + > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > + &(vcpu->kvm->arch.flags)); > + > + /* > + * In addition to userspace opt-in, SIGBUS only makes sense if the > + * abort is NOT about translation table walk and NOT about hardware > + * update of translation table. > + */ > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); Is this because we potentially can't determine a valid HVA for the fault? Maybe these should go out to userspace still with si_addr = 0.
On Thu, 31 Oct 2024 21:21:04 +0000, Jiaqi Yan <jiaqiyan@google.com> wrote: > > Currently KVM handles SEA in guest by injecting async SError into > guest directly, bypassing VMM, usually results in guest kernel panic. > > One major situation of guest SEA is when vCPU consumes uncorrectable > memory error on the physical memory. Although SError and guest kernel > panic effectively stops the propagation of corrupted memory, it is not > easy for VMM and guest to recover from memory error in a more graceful > manner. > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > thread. Can you elaborate on why the delivery of a signal is preferable to simply exiting back to userspace with a description of the error? Signals are usually not generated by KVM, and are a pretty twisted way to generate an exit. > In addition to the benifit that KVM's handling for SEA becomes aligned > with core kernel behavior > - The blast radius in VM can be limited to only the consuming thread > in guest, instead of entire guest kernel, unless the consumption is > from guest kernel. > - VMM now has the chance to do its duties to stop the VM from repeatedly > consuming corrupted data. For example, VMM can unmap the guest page > from stage-2 table to intercept forseen memory poison consumption, Not quite. The VMM doesn't manage stage-2. It can remove the page from the VMA if it has it mapped, but that's it. The kernel deals with S2. Which brings me to the next subject: when the kernel unmaps the page at S2, it is likely to use CMOs. Can these CMOs create RAS error themselves? > and for every consumption injects SEA to EL1 with synthetic memory > error CPER. Why do we need to involve ACPI here? I would expect the production of an architected error record instead. Or at least be given the option. > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > can opt in this new capability if it prefers SIGBUS than SError > injection during VM init. Now SEA handling in KVM works as follows: > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > enabled for the VM, and the SEA is NOT about translation table, > send SIGBUS BUS_OBJERR signal with host virtual address. And what if it is? S1 PTs are backed by userspace memory, like anything else. I don't think we should have a different treatment of those, because the HW wouldn't treat them differently either. > 3. Otherwise directly inject async SError to guest. > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application > in VM allocated some memory buffer. The test used EINJ to inject an > uncorrectable recoverable memory error at a page in the allocated memory > buffer. The dummy application then consumed the memory error. Some hack > was done to make core kernel's memory_failure triggered by poison > generation to fail, so KVM had to deal with the SEA guest abort due to > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with > valid host virtual address of the poisoned page. VMM then injected a SEA > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the > guest survived and continued to run. > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > --- > arch/arm64/include/asm/kvm_host.h | 2 + > arch/arm64/include/asm/kvm_ras.h | 20 ++++---- > arch/arm64/kvm/Makefile | 2 +- > arch/arm64/kvm/arm.c | 5 ++ > arch/arm64/kvm/kvm_ras.c | 77 +++++++++++++++++++++++++++++++ > arch/arm64/kvm/mmu.c | 8 +--- > include/uapi/linux/kvm.h | 1 + > 7 files changed, 98 insertions(+), 17 deletions(-) > create mode 100644 arch/arm64/kvm/kvm_ras.c > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > index bf64fed9820ea..eb37a2489411a 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -334,6 +334,8 @@ struct kvm_arch { > /* Fine-Grained UNDEF initialised */ > #define KVM_ARCH_FLAG_FGU_INITIALIZED 8 > unsigned long flags; > + /* Instead of injecting SError into guest, SIGBUS VMM */ > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA 9 nit: why do you put this definition out of sequence (below 'flags')? > > /* VM-wide vCPU feature set */ > DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES); > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h > index 87e10d9a635b5..4bb7a424e3f6c 100644 > --- a/arch/arm64/include/asm/kvm_ras.h > +++ b/arch/arm64/include/asm/kvm_ras.h > @@ -11,15 +11,17 @@ > #include <asm/acpi.h> > > /* > - * Was this synchronous external abort a RAS notification? > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > + * Handle synchronous external abort (SEA) in the following order: > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we > + * are all done. > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT > + * about translation table, send SIGBUS > + * - si_code is BUS_OBJERR. > + * - si_addr will be 0 when accurate HVA is unavailable. > + * 3. Otherwise, directly inject an async SError to guest. > + * > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*. > */ > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr) > -{ > - /* apei_claim_sea(NULL) expects to mask interrupts itself */ > - lockdep_assert_irqs_enabled(); > - > - return apei_claim_sea(NULL); > -} > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu); > > #endif /* __ARM64_KVM_RAS_H__ */ > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile > index 3cf7adb2b5038..c4a3a6d4870e6 100644 > --- a/arch/arm64/kvm/Makefile > +++ b/arch/arm64/kvm/Makefile > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ > vgic/vgic-v3.o vgic/vgic-v4.o \ > vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \ > vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ > - vgic/vgic-its.o vgic/vgic-debug.o > + vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o > > kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o > kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > index 48cafb65d6acf..bb97ad678dbec 100644 > --- a/arch/arm64/kvm/arm.c > +++ b/arch/arm64/kvm/arm.c > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > } > mutex_unlock(&kvm->slots_lock); > break; > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > + r = 0; > + set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags); Shouldn't this be somehow gated on the VM being RAS aware? > + break; > default: > break; > } > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_ARM_SYSTEM_SUSPEND: > case KVM_CAP_IRQFD_RESAMPLE: > case KVM_CAP_COUNTER_OFFSET: > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > r = 1; > break; > case KVM_CAP_SET_GUEST_DEBUG2: > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c > new file mode 100644 > index 0000000000000..3225462bcbcda > --- /dev/null > +++ b/arch/arm64/kvm/kvm_ras.c > @@ -0,0 +1,77 @@ > +// SPDX-License-Identifier: GPL-2.0-only > + > +#include <linux/bitops.h> > +#include <linux/kvm_host.h> > + > +#include <asm/kvm_emulate.h> > +#include <asm/kvm_ras.h> > +#include <asm/system_misc.h> > + > +/* > + * For synchrnous external instruction or data abort, not on translation > + * table walk or hardware update of translation table, is FAR_EL2 valid? > + */ > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) > +{ > + return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV); > +} > + > +/* > + * Was this synchronous external abort a RAS notification? > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > + */ > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > +{ > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > + lockdep_assert_irqs_enabled(); > + return apei_claim_sea(NULL); > +} > + > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > +{ > + bool sigbus_on_sea; > + int idx; > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > + unsigned long hva = 0UL; > + > + /* > + * For RAS the host kernel may handle this abort. > + * There is no need to SIGBUS VMM, or pass the error into the guest. > + */ > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > + return; > + > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > + &(vcpu->kvm->arch.flags)); > + > + /* > + * In addition to userspace opt-in, SIGBUS only makes sense if the > + * abort is NOT about translation table walk and NOT about hardware > + * update of translation table. > + */ > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); > + > + /* Pass the error directly into the guest. */ > + if (!sigbus_on_sea) { > + kvm_inject_vabt(vcpu); > + return; > + } > + > + if (kvm_vcpu_sea_far_valid(vcpu)) { > + idx = srcu_read_lock(&vcpu->kvm->srcu); > + hva = gfn_to_hva(vcpu->kvm, gfn); > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > + } > + > + /* > + * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that > + * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea() > + * does if apei_claim_sea() fails. > + */ > + arm64_notify_die("synchronous external abort", > + current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr); This is the point where I really think we should simply trigger an exit with all that syndrome information stashed in kvm_run, like any other event requiring userspace help. Also: where is the documentation? Thanks, M.
Hi Marc, thanks for your quick response! On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@kernel.org> wrote: > > On Thu, 31 Oct 2024 21:21:04 +0000, > Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > Currently KVM handles SEA in guest by injecting async SError into > > guest directly, bypassing VMM, usually results in guest kernel panic. > > > > One major situation of guest SEA is when vCPU consumes uncorrectable > > memory error on the physical memory. Although SError and guest kernel > > panic effectively stops the propagation of corrupted memory, it is not > > easy for VMM and guest to recover from memory error in a more graceful > > manner. > > > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > > thread. > > Can you elaborate on why the delivery of a signal is preferable to > simply exiting back to userspace with a description of the error? > Signals are usually not generated by KVM, and are a pretty twisted way > to generate an exit. A couple of reasons. First, my intuition is that KVM and core kernel (do_sea) should have aligned behavior when APEI failed to claim the SEA. Second, if we only talk about SEA caused by memory poison consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU thread (kvm_send_hwpoison_signal) to signal hardware memory failure, although the situation is slightly different here, where we have a hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON returned. But from VMM/vCPU's perspective, hardware event or software level VM_FAULT_HWPOISON, it would be nice if it can react to just the same event, the SIGBUS signal. And there is another reason around your comment on arm64_notify_die. By "exiting back to userspace with a description of the error", are you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new flag to tell VMM the error is hardware memory poison, which could be KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but I think a specific flag for hwpoison is probably clearer). > > > In addition to the benifit that KVM's handling for SEA becomes aligned > > with core kernel behavior > > - The blast radius in VM can be limited to only the consuming thread > > in guest, instead of entire guest kernel, unless the consumption is > > from guest kernel. > > - VMM now has the chance to do its duties to stop the VM from repeatedly > > consuming corrupted data. For example, VMM can unmap the guest page > > from stage-2 table to intercept forseen memory poison consumption, > > Not quite. The VMM doesn't manage stage-2. It can remove the page from > the VMA if it has it mapped, but that's it. The kernel deals with S2. I should probably not mention the implementation, "unmap from S2". What is needed for preventing repeated consuming memory poison is simply preventing guest access to certain memory pages. There is a work in progress KVM API [1] by my colleague James. [1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf > > Which brings me to the next subject: when the kernel unmaps the page > at S2, it is likely to use CMOs. Can these CMOs create RAS error > themselves? I assume CMO here means writing dirty cache lines to memory. Writing something new to a poisoned cacheline usually won't cause RAS error. Notifying memory poison usually is delayed to a memory load transaction. > > > and for every consumption injects SEA to EL1 with synthetic memory > > error CPER. > > Why do we need to involve ACPI here? I would expect the production of > an architected error record instead. Or at least be given the option. Sorry, I was just mentioning a specific VMM's implementation. There are definitely multiple options (Machine Check MSRs vs CPER for error description data, SEA vs SDEI vs SPI for notification mechanisms) for how VMM involves the guest to handle memory error. My preference is: VMM populates CPER in guest HEST when VMM instructs KVM to inject i/dabt to the guest. And a word about "be given the option": I think when VMM receives SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options, like using the si_addr to construct CPER with poisoned guest physical address, or mci_address MSR. > > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > > can opt in this new capability if it prefers SIGBUS than SError > > injection during VM init. Now SEA handling in KVM works as follows: > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > > enabled for the VM, and the SEA is NOT about translation table, > > send SIGBUS BUS_OBJERR signal with host virtual address. > > And what if it is? S1 PTs are backed by userspace memory, like > anything else. I don't think we should have a different treatment of > those, because the HW wouldn't treat them differently either. You are talking about ESR_ELx_FSC_SEA_TTW(1), or ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no difference. But I think we want to make an exception for SEA about S2 PTs. > > > 3. Otherwise directly inject async SError to guest. > > > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM > > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application > > in VM allocated some memory buffer. The test used EINJ to inject an > > uncorrectable recoverable memory error at a page in the allocated memory > > buffer. The dummy application then consumed the memory error. Some hack > > was done to make core kernel's memory_failure triggered by poison > > generation to fail, so KVM had to deal with the SEA guest abort due to > > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with > > valid host virtual address of the poisoned page. VMM then injected a SEA > > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last > > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the > > guest survived and continued to run. > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > --- > > arch/arm64/include/asm/kvm_host.h | 2 + > > arch/arm64/include/asm/kvm_ras.h | 20 ++++---- > > arch/arm64/kvm/Makefile | 2 +- > > arch/arm64/kvm/arm.c | 5 ++ > > arch/arm64/kvm/kvm_ras.c | 77 +++++++++++++++++++++++++++++++ > > arch/arm64/kvm/mmu.c | 8 +--- > > include/uapi/linux/kvm.h | 1 + > > 7 files changed, 98 insertions(+), 17 deletions(-) > > create mode 100644 arch/arm64/kvm/kvm_ras.c > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > index bf64fed9820ea..eb37a2489411a 100644 > > --- a/arch/arm64/include/asm/kvm_host.h > > +++ b/arch/arm64/include/asm/kvm_host.h > > @@ -334,6 +334,8 @@ struct kvm_arch { > > /* Fine-Grained UNDEF initialised */ > > #define KVM_ARCH_FLAG_FGU_INITIALIZED 8 > > unsigned long flags; > > + /* Instead of injecting SError into guest, SIGBUS VMM */ > > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA 9 > > nit: why do you put this definition out of sequence (below 'flags')? Ah, I will move it on top of flags. > > > > > /* VM-wide vCPU feature set */ > > DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES); > > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h > > index 87e10d9a635b5..4bb7a424e3f6c 100644 > > --- a/arch/arm64/include/asm/kvm_ras.h > > +++ b/arch/arm64/include/asm/kvm_ras.h > > @@ -11,15 +11,17 @@ > > #include <asm/acpi.h> > > > > /* > > - * Was this synchronous external abort a RAS notification? > > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > > + * Handle synchronous external abort (SEA) in the following order: > > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we > > + * are all done. > > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT > > + * about translation table, send SIGBUS > > + * - si_code is BUS_OBJERR. > > + * - si_addr will be 0 when accurate HVA is unavailable. > > + * 3. Otherwise, directly inject an async SError to guest. > > + * > > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*. > > */ > > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr) > > -{ > > - /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > - lockdep_assert_irqs_enabled(); > > - > > - return apei_claim_sea(NULL); > > -} > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu); > > > > #endif /* __ARM64_KVM_RAS_H__ */ > > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile > > index 3cf7adb2b5038..c4a3a6d4870e6 100644 > > --- a/arch/arm64/kvm/Makefile > > +++ b/arch/arm64/kvm/Makefile > > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ > > vgic/vgic-v3.o vgic/vgic-v4.o \ > > vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \ > > vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ > > - vgic/vgic-its.o vgic/vgic-debug.o > > + vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o > > > > kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o > > kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > index 48cafb65d6acf..bb97ad678dbec 100644 > > --- a/arch/arm64/kvm/arm.c > > +++ b/arch/arm64/kvm/arm.c > > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > } > > mutex_unlock(&kvm->slots_lock); > > break; > > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > > + r = 0; > > + set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags); > > Shouldn't this be somehow gated on the VM being RAS aware? Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability? I don't know if there is one for arm64. On x86 there is KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one but I don't think it is exactly the one for "RAS ability". > > > + break; > > default: > > break; > > } > > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > > case KVM_CAP_ARM_SYSTEM_SUSPEND: > > case KVM_CAP_IRQFD_RESAMPLE: > > case KVM_CAP_COUNTER_OFFSET: > > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > > r = 1; > > break; > > case KVM_CAP_SET_GUEST_DEBUG2: > > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c > > new file mode 100644 > > index 0000000000000..3225462bcbcda > > --- /dev/null > > +++ b/arch/arm64/kvm/kvm_ras.c > > @@ -0,0 +1,77 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > + > > +#include <linux/bitops.h> > > +#include <linux/kvm_host.h> > > + > > +#include <asm/kvm_emulate.h> > > +#include <asm/kvm_ras.h> > > +#include <asm/system_misc.h> > > + > > +/* > > + * For synchrnous external instruction or data abort, not on translation > > + * table walk or hardware update of translation table, is FAR_EL2 valid? > > + */ > > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) > > +{ > > + return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV); > > +} > > + > > +/* > > + * Was this synchronous external abort a RAS notification? > > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > > + */ > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > > +{ > > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > + lockdep_assert_irqs_enabled(); > > + return apei_claim_sea(NULL); > > +} > > + > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > > +{ > > + bool sigbus_on_sea; > > + int idx; > > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > > + unsigned long hva = 0UL; > > + > > + /* > > + * For RAS the host kernel may handle this abort. > > + * There is no need to SIGBUS VMM, or pass the error into the guest. > > + */ > > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > > + return; > > + > > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > > + &(vcpu->kvm->arch.flags)); > > + > > + /* > > + * In addition to userspace opt-in, SIGBUS only makes sense if the > > + * abort is NOT about translation table walk and NOT about hardware > > + * update of translation table. > > + */ > > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); > > + > > + /* Pass the error directly into the guest. */ > > + if (!sigbus_on_sea) { > > + kvm_inject_vabt(vcpu); > > + return; > > + } > > + > > + if (kvm_vcpu_sea_far_valid(vcpu)) { > > + idx = srcu_read_lock(&vcpu->kvm->srcu); > > + hva = gfn_to_hva(vcpu->kvm, gfn); > > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > > + } > > + > > + /* > > + * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that > > + * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea() > > + * does if apei_claim_sea() fails. > > + */ > > + arm64_notify_die("synchronous external abort", > > + current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr); > > This is the point where I really think we should simply trigger an > exit with all that syndrome information stashed in kvm_run, like any > other event requiring userspace help. Ah, there is another reason SIGBUS is better than kvm exit: "It is a programming error to set ext_dabt_pending after an exit which was not either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data abort to guest, at least current documentation doesn't suggest kvm exit is feasible. > > Also: where is the documentation? Once I get more positive feedback and send out PATCH instead of RFC, I can add to Documentation/virt/kvm/api.rst. > > Thanks, > > M. > > -- > Without deviation from the norm, progress is not possible.
Hi Oliver, Sorry for getting back to you late... On Thu, Oct 31, 2024 at 2:46 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > Hi Jiaqi, > > Thank you for sending this out. > > On Thu, Oct 31, 2024 at 09:21:04PM +0000, Jiaqi Yan wrote: > > Currently KVM handles SEA in guest by injecting async SError into > > guest directly, bypassing VMM, usually results in guest kernel panic. > > > > One major situation of guest SEA is when vCPU consumes uncorrectable > > memory error on the physical memory. Although SError and guest kernel > > panic effectively stops the propagation of corrupted memory, it is not > > easy for VMM and guest to recover from memory error in a more graceful > > manner. > > > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > > thread. > > In addition to the benifit that KVM's handling for SEA becomes aligned > > with core kernel behavior > > - The blast radius in VM can be limited to only the consuming thread > > in guest, instead of entire guest kernel, unless the consumption is > > from guest kernel. > > - VMM now has the chance to do its duties to stop the VM from repeatedly > > consuming corrupted data. For example, VMM can unmap the guest page > > from stage-2 table to intercept forseen memory poison consumption, > > and for every consumption injects SEA to EL1 with synthetic memory > > error CPER. > > > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > > can opt in this new capability if it prefers SIGBUS than SError > > injection during VM init. Now SEA handling in KVM works as follows: > > I'm somewhat tempted to force the new behavior on userspace > unconditionally. Working back from an unexpected SError in the VM to the > KVM SEA handler is a bit of a mess, and can be annoying if the operator > can't access console logs of the VM. Ack, I also think involving VMM is preferable than injecting SError directly to guest. > > As it stands today, UAPI expectations around SEAs are platform > dependent. If APEI claims the SEA and decides to offline a page, the > user will get a SIGBUS. > > So sending a SIGBUS for the case that firmware _doesn't_ claim the SEA > seems like a good move from a consistency PoV. But it is a decently-sized > change to do without explicit buy-in from userspace so let's see what > others think. Sounds good, I will wait for a couple of more days, and if the opt-in part isn't necessary to anyone else, I will remove it from the next revision. > > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > > enabled for the VM, and the SEA is NOT about translation table, > > send SIGBUS BUS_OBJERR signal with host virtual address. > > 3. Otherwise directly inject async SError to guest. > > The other reason I'm a bit lukewarm on user buy in is the UAPI suffers > from the same issue we do today: it depends on the platform. If the SEA > is claimed by APEI/GHES then the cap does nothing. Good point, yeah, the path of KVM handling SEA and sending SIGBUS should be treated as fallback code to APEI/GHES. > > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > > +{ > > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > + lockdep_assert_irqs_enabled(); > > + return apei_claim_sea(NULL); > > +} > > Consider dropping parameters from this since they're unused. Ack, will do in the next revision. > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > > +{ > > + bool sigbus_on_sea; > > + int idx; > > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > > + unsigned long hva = 0UL; > > + > > + /* > > + * For RAS the host kernel may handle this abort. > > + * There is no need to SIGBUS VMM, or pass the error into the guest. > > + */ > > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > > + return; > > + > > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > > + &(vcpu->kvm->arch.flags)); > > + > > + /* > > + * In addition to userspace opt-in, SIGBUS only makes sense if the > > + * abort is NOT about translation table walk and NOT about hardware > > + * update of translation table. > > + */ > > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); > > Is this because we potentially can't determine a valid HVA for the > fault? Maybe these should go out to userspace still with si_addr = 0. No, it is not related to the availability of a valid HVA. In this patch, as long as it decides to sigbus_on_sea, si_addr can be 0 if a valid HVA isn't available when !kvm_vcpu_sea_far_valid (OR when HPFAR_EL2 cannot be translated from a valid FAR_EL2, which I think requires some improvement). The code here wants to limit SIGBUS _BUS_OBJERR_ to only SEA and parity+ECC error, similar to the code here (for SEA) https://elixir.bootlin.com/linux/v6.11.6/source/arch/arm64/mm/fault.c#L771 and here (for synchronous parity or ECC error) https://elixir.bootlin.com/linux/v6.11.6/source/arch/arm64/mm/fault.c#L779. > > -- > Thanks, > Oliver
On Sun, Nov 3, 2024 at 9:02 PM Jiaqi Yan <jiaqiyan@google.com> wrote: > > Hi Marc, thanks for your quick response! > > On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Thu, 31 Oct 2024 21:21:04 +0000, > > Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > > > Currently KVM handles SEA in guest by injecting async SError into > > > guest directly, bypassing VMM, usually results in guest kernel panic. > > > > > > One major situation of guest SEA is when vCPU consumes uncorrectable > > > memory error on the physical memory. Although SError and guest kernel > > > panic effectively stops the propagation of corrupted memory, it is not > > > easy for VMM and guest to recover from memory error in a more graceful > > > manner. > > > > > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > > > thread. > > > > Can you elaborate on why the delivery of a signal is preferable to > > simply exiting back to userspace with a description of the error? > > Signals are usually not generated by KVM, and are a pretty twisted way > > to generate an exit. > > A couple of reasons. First, my intuition is that KVM and core kernel > (do_sea) should have aligned behavior when APEI failed to claim the > SEA. Second, if we only talk about SEA caused by memory poison > consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU > thread (kvm_send_hwpoison_signal) to signal hardware memory failure, > although the situation is slightly different here, where we have a > hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON > returned. But from VMM/vCPU's perspective, hardware event or software > level VM_FAULT_HWPOISON, it would be nice if it can react to just the > same event, the SIGBUS signal. > > And there is another reason around your comment on arm64_notify_die. > > By "exiting back to userspace with a description of the error", are > you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new > flag to tell VMM the error is hardware memory poison, which could be > KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but > I think a specific flag for hwpoison is probably clearer). > > > > > > In addition to the benifit that KVM's handling for SEA becomes aligned > > > with core kernel behavior > > > - The blast radius in VM can be limited to only the consuming thread > > > in guest, instead of entire guest kernel, unless the consumption is > > > from guest kernel. > > > - VMM now has the chance to do its duties to stop the VM from repeatedly > > > consuming corrupted data. For example, VMM can unmap the guest page > > > from stage-2 table to intercept forseen memory poison consumption, > > > > Not quite. The VMM doesn't manage stage-2. It can remove the page from > > the VMA if it has it mapped, but that's it. The kernel deals with S2. > > I should probably not mention the implementation, "unmap from S2". > What is needed for preventing repeated consuming memory poison is > simply preventing guest access to certain memory pages. There is a > work in progress KVM API [1] by my colleague James. > > [1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf > > > > > Which brings me to the next subject: when the kernel unmaps the page > > at S2, it is likely to use CMOs. Can these CMOs create RAS error > > themselves? > > I assume CMO here means writing dirty cache lines to memory. Writing > something new to a poisoned cacheline usually won't cause RAS error. > Notifying memory poison usually is delayed to a memory load > transaction. > > > > > > and for every consumption injects SEA to EL1 with synthetic memory > > > error CPER. > > > > Why do we need to involve ACPI here? I would expect the production of > > an architected error record instead. Or at least be given the option. > > Sorry, I was just mentioning a specific VMM's implementation. There > are definitely multiple options (Machine Check MSRs vs CPER for error > description data, SEA vs SDEI vs SPI for notification mechanisms) for > how VMM involves the guest to handle memory error. My preference is: > VMM populates CPER in guest HEST when VMM instructs KVM to inject > i/dabt to the guest. > > And a word about "be given the option": I think when VMM receives > SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options, > like using the si_addr to construct CPER with poisoned guest physical > address, or mci_address MSR. > > > > > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > > > can opt in this new capability if it prefers SIGBUS than SError > > > injection during VM init. Now SEA handling in KVM works as follows: > > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > > > enabled for the VM, and the SEA is NOT about translation table, > > > send SIGBUS BUS_OBJERR signal with host virtual address. > > > > And what if it is? S1 PTs are backed by userspace memory, like > > anything else. I don't think we should have a different treatment of > > those, because the HW wouldn't treat them differently either. > > You are talking about ESR_ELx_FSC_SEA_TTW(1), or > ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no > difference. > > But I think we want to make an exception for SEA about S2 PTs. > > > > > > 3. Otherwise directly inject async SError to guest. > > > > > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM > > > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application > > > in VM allocated some memory buffer. The test used EINJ to inject an > > > uncorrectable recoverable memory error at a page in the allocated memory > > > buffer. The dummy application then consumed the memory error. Some hack > > > was done to make core kernel's memory_failure triggered by poison > > > generation to fail, so KVM had to deal with the SEA guest abort due to > > > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with > > > valid host virtual address of the poisoned page. VMM then injected a SEA > > > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last > > > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the > > > guest survived and continued to run. > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > --- > > > arch/arm64/include/asm/kvm_host.h | 2 + > > > arch/arm64/include/asm/kvm_ras.h | 20 ++++---- > > > arch/arm64/kvm/Makefile | 2 +- > > > arch/arm64/kvm/arm.c | 5 ++ > > > arch/arm64/kvm/kvm_ras.c | 77 +++++++++++++++++++++++++++++++ > > > arch/arm64/kvm/mmu.c | 8 +--- > > > include/uapi/linux/kvm.h | 1 + > > > 7 files changed, 98 insertions(+), 17 deletions(-) > > > create mode 100644 arch/arm64/kvm/kvm_ras.c > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > > index bf64fed9820ea..eb37a2489411a 100644 > > > --- a/arch/arm64/include/asm/kvm_host.h > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > @@ -334,6 +334,8 @@ struct kvm_arch { > > > /* Fine-Grained UNDEF initialised */ > > > #define KVM_ARCH_FLAG_FGU_INITIALIZED 8 > > > unsigned long flags; > > > + /* Instead of injecting SError into guest, SIGBUS VMM */ > > > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA 9 > > > > nit: why do you put this definition out of sequence (below 'flags')? > > Ah, I will move it on top of flags. > > > > > > > > > /* VM-wide vCPU feature set */ > > > DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES); > > > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h > > > index 87e10d9a635b5..4bb7a424e3f6c 100644 > > > --- a/arch/arm64/include/asm/kvm_ras.h > > > +++ b/arch/arm64/include/asm/kvm_ras.h > > > @@ -11,15 +11,17 @@ > > > #include <asm/acpi.h> > > > > > > /* > > > - * Was this synchronous external abort a RAS notification? > > > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > > > + * Handle synchronous external abort (SEA) in the following order: > > > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we > > > + * are all done. > > > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT > > > + * about translation table, send SIGBUS > > > + * - si_code is BUS_OBJERR. > > > + * - si_addr will be 0 when accurate HVA is unavailable. > > > + * 3. Otherwise, directly inject an async SError to guest. > > > + * > > > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*. > > > */ > > > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr) > > > -{ > > > - /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > > - lockdep_assert_irqs_enabled(); > > > - > > > - return apei_claim_sea(NULL); > > > -} > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu); > > > > > > #endif /* __ARM64_KVM_RAS_H__ */ > > > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile > > > index 3cf7adb2b5038..c4a3a6d4870e6 100644 > > > --- a/arch/arm64/kvm/Makefile > > > +++ b/arch/arm64/kvm/Makefile > > > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ > > > vgic/vgic-v3.o vgic/vgic-v4.o \ > > > vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \ > > > vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ > > > - vgic/vgic-its.o vgic/vgic-debug.o > > > + vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o > > > > > > kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o > > > kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > > index 48cafb65d6acf..bb97ad678dbec 100644 > > > --- a/arch/arm64/kvm/arm.c > > > +++ b/arch/arm64/kvm/arm.c > > > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > > } > > > mutex_unlock(&kvm->slots_lock); > > > break; > > > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > > > + r = 0; > > > + set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags); > > > > Shouldn't this be somehow gated on the VM being RAS aware? > > Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability? > I don't know if there is one for arm64. On x86 there is > KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one > but I don't think it is exactly the one for "RAS ability". Hi Marc, There is also a kernel config called ARM64_RAS_EXTN but I believe it is for the host CPU, not about VM's RAS. But taking Oliver's opinion into account, I think we want to remove KVM_CAP_ARM_SIGBUS_ON_SEA. Wonder what your thoughts are. Thanks, Jiaqi > > > > > > + break; > > > default: > > > break; > > > } > > > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > > > case KVM_CAP_ARM_SYSTEM_SUSPEND: > > > case KVM_CAP_IRQFD_RESAMPLE: > > > case KVM_CAP_COUNTER_OFFSET: > > > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > > > r = 1; > > > break; > > > case KVM_CAP_SET_GUEST_DEBUG2: > > > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c > > > new file mode 100644 > > > index 0000000000000..3225462bcbcda > > > --- /dev/null > > > +++ b/arch/arm64/kvm/kvm_ras.c > > > @@ -0,0 +1,77 @@ > > > +// SPDX-License-Identifier: GPL-2.0-only > > > + > > > +#include <linux/bitops.h> > > > +#include <linux/kvm_host.h> > > > + > > > +#include <asm/kvm_emulate.h> > > > +#include <asm/kvm_ras.h> > > > +#include <asm/system_misc.h> > > > + > > > +/* > > > + * For synchrnous external instruction or data abort, not on translation > > > + * table walk or hardware update of translation table, is FAR_EL2 valid? > > > + */ > > > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) > > > +{ > > > + return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV); > > > +} > > > + > > > +/* > > > + * Was this synchronous external abort a RAS notification? > > > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > > > + */ > > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > > > +{ > > > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > > + lockdep_assert_irqs_enabled(); > > > + return apei_claim_sea(NULL); > > > +} > > > + > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > > > +{ > > > + bool sigbus_on_sea; > > > + int idx; > > > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > > > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > > > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > > > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > > > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > > > + unsigned long hva = 0UL; > > > + > > > + /* > > > + * For RAS the host kernel may handle this abort. > > > + * There is no need to SIGBUS VMM, or pass the error into the guest. > > > + */ > > > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > > > + return; > > > + > > > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > > > + &(vcpu->kvm->arch.flags)); > > > + > > > + /* > > > + * In addition to userspace opt-in, SIGBUS only makes sense if the > > > + * abort is NOT about translation table walk and NOT about hardware > > > + * update of translation table. > > > + */ > > > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); > > > + > > > + /* Pass the error directly into the guest. */ > > > + if (!sigbus_on_sea) { > > > + kvm_inject_vabt(vcpu); > > > + return; > > > + } > > > + > > > + if (kvm_vcpu_sea_far_valid(vcpu)) { > > > + idx = srcu_read_lock(&vcpu->kvm->srcu); > > > + hva = gfn_to_hva(vcpu->kvm, gfn); > > > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > > > + } > > > + > > > + /* > > > + * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that > > > + * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea() > > > + * does if apei_claim_sea() fails. > > > + */ > > > + arm64_notify_die("synchronous external abort", > > > + current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr); > > > > This is the point where I really think we should simply trigger an > > exit with all that syndrome information stashed in kvm_run, like any > > other event requiring userspace help. > > Ah, there is another reason SIGBUS is better than kvm exit: "It is a > programming error to set ext_dabt_pending after an exit which was not > either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from > Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data > abort to guest, at least current documentation doesn't suggest kvm > exit is feasible. > > > > > Also: where is the documentation? > > Once I get more positive feedback and send out PATCH instead of RFC, I > can add to Documentation/virt/kvm/api.rst. > > > > > Thanks, > > > > M. > > > > -- > > Without deviation from the norm, progress is not possible.
On Mon, 04 Nov 2024 05:02:03 +0000, Jiaqi Yan <jiaqiyan@google.com> wrote: > > Hi Marc, thanks for your quick response! > > On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@kernel.org> wrote: > > > > On Thu, 31 Oct 2024 21:21:04 +0000, > > Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > > > Currently KVM handles SEA in guest by injecting async SError into > > > guest directly, bypassing VMM, usually results in guest kernel panic. > > > > > > One major situation of guest SEA is when vCPU consumes uncorrectable > > > memory error on the physical memory. Although SError and guest kernel > > > panic effectively stops the propagation of corrupted memory, it is not > > > easy for VMM and guest to recover from memory error in a more graceful > > > manner. > > > > > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > > > thread. > > > > Can you elaborate on why the delivery of a signal is preferable to > > simply exiting back to userspace with a description of the error? > > Signals are usually not generated by KVM, and are a pretty twisted way > > to generate an exit. > > A couple of reasons. First, my intuition is that KVM and core kernel > (do_sea) should have aligned behavior when APEI failed to claim the > SEA. Second, if we only talk about SEA caused by memory poison > consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU > thread (kvm_send_hwpoison_signal) to signal hardware memory failure, > although the situation is slightly different here, where we have a > hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON > returned. But from VMM/vCPU's perspective, hardware event or software > level VM_FAULT_HWPOISON, it would be nice if it can react to just the > same event, the SIGBUS signal. > > And there is another reason around your comment on arm64_notify_die. > > By "exiting back to userspace with a description of the error", are > you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new > flag to tell VMM the error is hardware memory poison, which could be > KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but > I think a specific flag for hwpoison is probably clearer). KVM_MEMORY_EXIT_FLAG_USERFAULT is not upstream, and it isn't clear to me if or when it will make it there. But the idea is indeed to treat a RAS error as an exit reason. > > > > > > In addition to the benifit that KVM's handling for SEA becomes aligned > > > with core kernel behavior > > > - The blast radius in VM can be limited to only the consuming thread > > > in guest, instead of entire guest kernel, unless the consumption is > > > from guest kernel. > > > - VMM now has the chance to do its duties to stop the VM from repeatedly > > > consuming corrupted data. For example, VMM can unmap the guest page > > > from stage-2 table to intercept forseen memory poison consumption, > > > > Not quite. The VMM doesn't manage stage-2. It can remove the page from > > the VMA if it has it mapped, but that's it. The kernel deals with S2. > > I should probably not mention the implementation, "unmap from S2". > What is needed for preventing repeated consuming memory poison is > simply preventing guest access to certain memory pages. There is a > work in progress KVM API [1] by my colleague James. > > [1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf > > > > > Which brings me to the next subject: when the kernel unmaps the page > > at S2, it is likely to use CMOs. Can these CMOs create RAS error > > themselves? > > I assume CMO here means writing dirty cache lines to memory. Writing > something new to a poisoned cacheline usually won't cause RAS error. > Notifying memory poison usually is delayed to a memory load > transaction. I don't see anything of the sort in the spec. At least R_VGXBJ calls out that: <quote> An error is propagated by the PE by one or more of the following occurring that would not have been permitted to occur had the fault not been activated: * Consumption of the corrupt value by any instruction, propagating the error to the target(s) of the instruction. This includes: - A store of a corrupt value. [...] </quote> So while implementations /may/ only act and deliver an error on a load, that's only an implementation detail. > > > > > > and for every consumption injects SEA to EL1 with synthetic memory > > > error CPER. > > > > Why do we need to involve ACPI here? I would expect the production of > > an architected error record instead. Or at least be given the option. > > Sorry, I was just mentioning a specific VMM's implementation. There > are definitely multiple options (Machine Check MSRs vs CPER for error > description data, SEA vs SDEI vs SPI for notification mechanisms) for > how VMM involves the guest to handle memory error. My preference is: > VMM populates CPER in guest HEST when VMM instructs KVM to inject > i/dabt to the guest. > > And a word about "be given the option": I think when VMM receives > SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options, > like using the si_addr to construct CPER with poisoned guest physical > address, or mci_address MSR. > > > > > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > > > can opt in this new capability if it prefers SIGBUS than SError > > > injection during VM init. Now SEA handling in KVM works as follows: > > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > > > enabled for the VM, and the SEA is NOT about translation table, > > > send SIGBUS BUS_OBJERR signal with host virtual address. > > > > And what if it is? S1 PTs are backed by userspace memory, like > > anything else. I don't think we should have a different treatment of > > those, because the HW wouldn't treat them differently either. > > You are talking about ESR_ELx_FSC_SEA_TTW(1), or > ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no > difference. > > But I think we want to make an exception for SEA about S2 PTs. S2 PTs are owned by the hypervisor, not by userspace, so I'm fine not reporting those through this mechanism. > > > > > > 3. Otherwise directly inject async SError to guest. > > > > > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM > > > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application > > > in VM allocated some memory buffer. The test used EINJ to inject an > > > uncorrectable recoverable memory error at a page in the allocated memory > > > buffer. The dummy application then consumed the memory error. Some hack > > > was done to make core kernel's memory_failure triggered by poison > > > generation to fail, so KVM had to deal with the SEA guest abort due to > > > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with > > > valid host virtual address of the poisoned page. VMM then injected a SEA > > > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last > > > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the > > > guest survived and continued to run. > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > --- > > > arch/arm64/include/asm/kvm_host.h | 2 + > > > arch/arm64/include/asm/kvm_ras.h | 20 ++++---- > > > arch/arm64/kvm/Makefile | 2 +- > > > arch/arm64/kvm/arm.c | 5 ++ > > > arch/arm64/kvm/kvm_ras.c | 77 +++++++++++++++++++++++++++++++ > > > arch/arm64/kvm/mmu.c | 8 +--- > > > include/uapi/linux/kvm.h | 1 + > > > 7 files changed, 98 insertions(+), 17 deletions(-) > > > create mode 100644 arch/arm64/kvm/kvm_ras.c > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > > index bf64fed9820ea..eb37a2489411a 100644 > > > --- a/arch/arm64/include/asm/kvm_host.h > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > @@ -334,6 +334,8 @@ struct kvm_arch { > > > /* Fine-Grained UNDEF initialised */ > > > #define KVM_ARCH_FLAG_FGU_INITIALIZED 8 > > > unsigned long flags; > > > + /* Instead of injecting SError into guest, SIGBUS VMM */ > > > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA 9 > > > > nit: why do you put this definition out of sequence (below 'flags')? > > Ah, I will move it on top of flags. > > > > > > > > > /* VM-wide vCPU feature set */ > > > DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES); > > > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h > > > index 87e10d9a635b5..4bb7a424e3f6c 100644 > > > --- a/arch/arm64/include/asm/kvm_ras.h > > > +++ b/arch/arm64/include/asm/kvm_ras.h > > > @@ -11,15 +11,17 @@ > > > #include <asm/acpi.h> > > > > > > /* > > > - * Was this synchronous external abort a RAS notification? > > > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > > > + * Handle synchronous external abort (SEA) in the following order: > > > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we > > > + * are all done. > > > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT > > > + * about translation table, send SIGBUS > > > + * - si_code is BUS_OBJERR. > > > + * - si_addr will be 0 when accurate HVA is unavailable. > > > + * 3. Otherwise, directly inject an async SError to guest. > > > + * > > > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*. > > > */ > > > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr) > > > -{ > > > - /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > > - lockdep_assert_irqs_enabled(); > > > - > > > - return apei_claim_sea(NULL); > > > -} > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu); > > > > > > #endif /* __ARM64_KVM_RAS_H__ */ > > > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile > > > index 3cf7adb2b5038..c4a3a6d4870e6 100644 > > > --- a/arch/arm64/kvm/Makefile > > > +++ b/arch/arm64/kvm/Makefile > > > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ > > > vgic/vgic-v3.o vgic/vgic-v4.o \ > > > vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \ > > > vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ > > > - vgic/vgic-its.o vgic/vgic-debug.o > > > + vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o > > > > > > kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o > > > kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > > index 48cafb65d6acf..bb97ad678dbec 100644 > > > --- a/arch/arm64/kvm/arm.c > > > +++ b/arch/arm64/kvm/arm.c > > > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > > } > > > mutex_unlock(&kvm->slots_lock); > > > break; > > > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > > > + r = 0; > > > + set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags); > > > > Shouldn't this be somehow gated on the VM being RAS aware? > > Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability? > I don't know if there is one for arm64. On x86 there is > KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one > but I don't think it is exactly the one for "RAS ability". Having though about this a bit more, I now think this is independent of the guest supporting RAS. This really is about the VMM asking to be made aware of RAS errors affecting the guest, and it is the signalling back to the guest that needs to be gated by ID_AA64PFR0_EL1.RAS. > > > > > > + break; > > > default: > > > break; > > > } > > > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > > > case KVM_CAP_ARM_SYSTEM_SUSPEND: > > > case KVM_CAP_IRQFD_RESAMPLE: > > > case KVM_CAP_COUNTER_OFFSET: > > > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > > > r = 1; > > > break; > > > case KVM_CAP_SET_GUEST_DEBUG2: > > > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c > > > new file mode 100644 > > > index 0000000000000..3225462bcbcda > > > --- /dev/null > > > +++ b/arch/arm64/kvm/kvm_ras.c > > > @@ -0,0 +1,77 @@ > > > +// SPDX-License-Identifier: GPL-2.0-only > > > + > > > +#include <linux/bitops.h> > > > +#include <linux/kvm_host.h> > > > + > > > +#include <asm/kvm_emulate.h> > > > +#include <asm/kvm_ras.h> > > > +#include <asm/system_misc.h> > > > + > > > +/* > > > + * For synchrnous external instruction or data abort, not on translation > > > + * table walk or hardware update of translation table, is FAR_EL2 valid? > > > + */ > > > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) > > > +{ > > > + return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV); > > > +} > > > + > > > +/* > > > + * Was this synchronous external abort a RAS notification? > > > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > > > + */ > > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > > > +{ > > > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > > + lockdep_assert_irqs_enabled(); > > > + return apei_claim_sea(NULL); > > > +} > > > + > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > > > +{ > > > + bool sigbus_on_sea; > > > + int idx; > > > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > > > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > > > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > > > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > > > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > > > + unsigned long hva = 0UL; > > > + > > > + /* > > > + * For RAS the host kernel may handle this abort. > > > + * There is no need to SIGBUS VMM, or pass the error into the guest. > > > + */ > > > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > > > + return; > > > + > > > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > > > + &(vcpu->kvm->arch.flags)); > > > + > > > + /* > > > + * In addition to userspace opt-in, SIGBUS only makes sense if the > > > + * abort is NOT about translation table walk and NOT about hardware > > > + * update of translation table. > > > + */ > > > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); > > > + > > > + /* Pass the error directly into the guest. */ > > > + if (!sigbus_on_sea) { > > > + kvm_inject_vabt(vcpu); > > > + return; > > > + } > > > + > > > + if (kvm_vcpu_sea_far_valid(vcpu)) { > > > + idx = srcu_read_lock(&vcpu->kvm->srcu); > > > + hva = gfn_to_hva(vcpu->kvm, gfn); > > > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > > > + } > > > + > > > + /* > > > + * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that > > > + * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea() > > > + * does if apei_claim_sea() fails. > > > + */ > > > + arm64_notify_die("synchronous external abort", > > > + current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr); > > > > This is the point where I really think we should simply trigger an > > exit with all that syndrome information stashed in kvm_run, like any > > other event requiring userspace help. > > Ah, there is another reason SIGBUS is better than kvm exit: "It is a > programming error to set ext_dabt_pending after an exit which was not > either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from > Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data > abort to guest, at least current documentation doesn't suggest kvm > exit is feasible. I really can't make the link between the two. If you take this to the letter, then you can't inject an external abort on the back of a signal handler either. Also, you're obviously changing the UAPI, so everything is fair game provided that you don't break backward compatibility. > > > > > Also: where is the documentation? > > Once I get more positive feedback and send out PATCH instead of RFC, I > can add to Documentation/virt/kvm/api.rst. Sorry, but that's a key point for reviewing a UAPI change. M.
On Fri, 08 Nov 2024 21:18:50 +0000, Jiaqi Yan <jiaqiyan@google.com> wrote: > > There is also a kernel config called ARM64_RAS_EXTN but I believe it > is for the host CPU, not about VM's RAS. > > But taking Oliver's opinion into account, I think we want to remove > KVM_CAP_ARM_SIGBUS_ON_SEA. Wonder what your thoughts are. If we stick to a signal-based signalling (which I still dislike with a burning passion), then I agree it can be removed, as this is no different from a signal being delivered from the rest of the kernel. M.
Thanks for your comments, Marc. While continuing the discussion here, I think it may make sense I sent out a V2 with 2 major updates: - add documentation for new SIGBUS feature - remove KVM_CAP_ARM_SIGBUS_ON_SEA On Tue, Nov 12, 2024 at 12:51 AM Marc Zyngier <maz@kernel.org> wrote: > > On Mon, 04 Nov 2024 05:02:03 +0000, > Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > Hi Marc, thanks for your quick response! > > > > On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@kernel.org> wrote: > > > > > > On Thu, 31 Oct 2024 21:21:04 +0000, > > > Jiaqi Yan <jiaqiyan@google.com> wrote: > > > > > > > > Currently KVM handles SEA in guest by injecting async SError into > > > > guest directly, bypassing VMM, usually results in guest kernel panic. > > > > > > > > One major situation of guest SEA is when vCPU consumes uncorrectable > > > > memory error on the physical memory. Although SError and guest kernel > > > > panic effectively stops the propagation of corrupted memory, it is not > > > > easy for VMM and guest to recover from memory error in a more graceful > > > > manner. > > > > > > > > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like > > > > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming > > > > thread. > > > > > > Can you elaborate on why the delivery of a signal is preferable to > > > simply exiting back to userspace with a description of the error? > > > Signals are usually not generated by KVM, and are a pretty twisted way > > > to generate an exit. > > > > A couple of reasons. First, my intuition is that KVM and core kernel > > (do_sea) should have aligned behavior when APEI failed to claim the > > SEA. Second, if we only talk about SEA caused by memory poison > > consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU > > thread (kvm_send_hwpoison_signal) to signal hardware memory failure, > > although the situation is slightly different here, where we have a > > hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON > > returned. But from VMM/vCPU's perspective, hardware event or software > > level VM_FAULT_HWPOISON, it would be nice if it can react to just the > > same event, the SIGBUS signal. > > > > And there is another reason around your comment on arm64_notify_die. > > > > By "exiting back to userspace with a description of the error", are > > you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new > > flag to tell VMM the error is hardware memory poison, which could be > > KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but > > I think a specific flag for hwpoison is probably clearer). > > KVM_MEMORY_EXIT_FLAG_USERFAULT is not upstream, and it isn't clear to > me if or when it will make it there. But the idea is indeed to treat a > RAS error as an exit reason. Let me try again on preferring SIGBUS to KVM_EXIT. At least for memory error. Let's say we should really use some KVM_EXIT to describe RAS error (e.g. KVM_EXIT_MEMORY_FAULT for memory error). Does that mean we need to change these existing SIGBUSes [1,2] in KVM into KVM_EXIT? Doesn't it break VMM [3,4] that expects SIGBUS for memory errors? The difference between [1,2] and the scenario dealt by this patch is just whether APEI/GHES/memory_failure succeeded in recovering a memory error or not. Is the idea that the uAPI will be different depending on successful recovery or not? [1] https://elixir.bootlin.com/linux/v6.12-rc7/source/arch/arm64/kvm/mmu.c#L1576 [2] https://elixir.bootlin.com/linux/v6.12-rc7/source/arch/x86/kvm/mmu/mmu.c#L3301 [3] https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L738 [4] https://github.com/qemu/qemu/blob/master/target/arm/kvm.c#L2365 > > > > > > > > > > In addition to the benifit that KVM's handling for SEA becomes aligned > > > > with core kernel behavior > > > > - The blast radius in VM can be limited to only the consuming thread > > > > in guest, instead of entire guest kernel, unless the consumption is > > > > from guest kernel. > > > > - VMM now has the chance to do its duties to stop the VM from repeatedly > > > > consuming corrupted data. For example, VMM can unmap the guest page > > > > from stage-2 table to intercept forseen memory poison consumption, > > > > > > Not quite. The VMM doesn't manage stage-2. It can remove the page from > > > the VMA if it has it mapped, but that's it. The kernel deals with S2. > > > > I should probably not mention the implementation, "unmap from S2". > > What is needed for preventing repeated consuming memory poison is > > simply preventing guest access to certain memory pages. There is a > > work in progress KVM API [1] by my colleague James. > > > > [1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf > > > > > > > > Which brings me to the next subject: when the kernel unmaps the page > > > at S2, it is likely to use CMOs. Can these CMOs create RAS error > > > themselves? > > > > I assume CMO here means writing dirty cache lines to memory. Writing > > something new to a poisoned cacheline usually won't cause RAS error. > > Notifying memory poison usually is delayed to a memory load > > transaction. > > I don't see anything of the sort in the spec. At least R_VGXBJ calls > out that: > > <quote> > An error is propagated by the PE by one or more of the following > occurring that would not have been permitted to occur had the fault > not been activated: > > * Consumption of the corrupt value by any instruction, propagating the > error to the target(s) of the instruction. This includes: > > - A store of a corrupt value. > [...] > </quote> > > So while implementations /may/ only act and deliver an error on a > load, that's only an implementation detail. Thanks for finding out from the spec, lesson learned. Back to your original question and assume error will be propagated by the PE, is your concern that "unmap from S2" will cause a SEA on TTW or hardware update of translation table level 2 (i.e. DFSC=0b010110), which renders the system into a worse fault state? > > > > > > > > > > and for every consumption injects SEA to EL1 with synthetic memory > > > > error CPER. > > > > > > Why do we need to involve ACPI here? I would expect the production of > > > an architected error record instead. Or at least be given the option. > > > > Sorry, I was just mentioning a specific VMM's implementation. There > > are definitely multiple options (Machine Check MSRs vs CPER for error > > description data, SEA vs SDEI vs SPI for notification mechanisms) for > > how VMM involves the guest to handle memory error. My preference is: > > VMM populates CPER in guest HEST when VMM instructs KVM to inject > > i/dabt to the guest. > > > > And a word about "be given the option": I think when VMM receives > > SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options, > > like using the si_addr to construct CPER with poisoned guest physical > > address, or mci_address MSR. > > > > > > > > > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM > > > > can opt in this new capability if it prefers SIGBUS than SError > > > > injection during VM init. Now SEA handling in KVM works as follows: > > > > 1. Delegate to APEI/GHES to see if SEA can be claimed by them. > > > > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is > > > > enabled for the VM, and the SEA is NOT about translation table, > > > > send SIGBUS BUS_OBJERR signal with host virtual address. > > > > > > And what if it is? S1 PTs are backed by userspace memory, like > > > anything else. I don't think we should have a different treatment of > > > those, because the HW wouldn't treat them differently either. > > > > You are talking about ESR_ELx_FSC_SEA_TTW(1), or > > ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no > > difference. > > > > But I think we want to make an exception for SEA about S2 PTs. > > S2 PTs are owned by the hypervisor, not by userspace, so I'm fine not > reporting those through this mechanism. Ack, will update in the v2. > > > > > > > > > > 3. Otherwise directly inject async SError to guest. > > > > > > > > Tested on a machine running Siryn AmpereOne processor. A in-house VMM > > > > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application > > > > in VM allocated some memory buffer. The test used EINJ to inject an > > > > uncorrectable recoverable memory error at a page in the allocated memory > > > > buffer. The dummy application then consumed the memory error. Some hack > > > > was done to make core kernel's memory_failure triggered by poison > > > > generation to fail, so KVM had to deal with the SEA guest abort due to > > > > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with > > > > valid host virtual address of the poisoned page. VMM then injected a SEA > > > > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last > > > > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the > > > > guest survived and continued to run. > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> > > > > --- > > > > arch/arm64/include/asm/kvm_host.h | 2 + > > > > arch/arm64/include/asm/kvm_ras.h | 20 ++++---- > > > > arch/arm64/kvm/Makefile | 2 +- > > > > arch/arm64/kvm/arm.c | 5 ++ > > > > arch/arm64/kvm/kvm_ras.c | 77 +++++++++++++++++++++++++++++++ > > > > arch/arm64/kvm/mmu.c | 8 +--- > > > > include/uapi/linux/kvm.h | 1 + > > > > 7 files changed, 98 insertions(+), 17 deletions(-) > > > > create mode 100644 arch/arm64/kvm/kvm_ras.c > > > > > > > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > > > > index bf64fed9820ea..eb37a2489411a 100644 > > > > --- a/arch/arm64/include/asm/kvm_host.h > > > > +++ b/arch/arm64/include/asm/kvm_host.h > > > > @@ -334,6 +334,8 @@ struct kvm_arch { > > > > /* Fine-Grained UNDEF initialised */ > > > > #define KVM_ARCH_FLAG_FGU_INITIALIZED 8 > > > > unsigned long flags; > > > > + /* Instead of injecting SError into guest, SIGBUS VMM */ > > > > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA 9 > > > > > > nit: why do you put this definition out of sequence (below 'flags')? > > > > Ah, I will move it on top of flags. > > > > > > > > > > > > > /* VM-wide vCPU feature set */ > > > > DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES); > > > > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h > > > > index 87e10d9a635b5..4bb7a424e3f6c 100644 > > > > --- a/arch/arm64/include/asm/kvm_ras.h > > > > +++ b/arch/arm64/include/asm/kvm_ras.h > > > > @@ -11,15 +11,17 @@ > > > > #include <asm/acpi.h> > > > > > > > > /* > > > > - * Was this synchronous external abort a RAS notification? > > > > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > > > > + * Handle synchronous external abort (SEA) in the following order: > > > > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we > > > > + * are all done. > > > > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT > > > > + * about translation table, send SIGBUS > > > > + * - si_code is BUS_OBJERR. > > > > + * - si_addr will be 0 when accurate HVA is unavailable. > > > > + * 3. Otherwise, directly inject an async SError to guest. > > > > + * > > > > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*. > > > > */ > > > > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr) > > > > -{ > > > > - /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > > > - lockdep_assert_irqs_enabled(); > > > > - > > > > - return apei_claim_sea(NULL); > > > > -} > > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu); > > > > > > > > #endif /* __ARM64_KVM_RAS_H__ */ > > > > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile > > > > index 3cf7adb2b5038..c4a3a6d4870e6 100644 > > > > --- a/arch/arm64/kvm/Makefile > > > > +++ b/arch/arm64/kvm/Makefile > > > > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ > > > > vgic/vgic-v3.o vgic/vgic-v4.o \ > > > > vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \ > > > > vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ > > > > - vgic/vgic-its.o vgic/vgic-debug.o > > > > + vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o > > > > > > > > kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o > > > > kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o > > > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > > > > index 48cafb65d6acf..bb97ad678dbec 100644 > > > > --- a/arch/arm64/kvm/arm.c > > > > +++ b/arch/arm64/kvm/arm.c > > > > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, > > > > } > > > > mutex_unlock(&kvm->slots_lock); > > > > break; > > > > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > > > > + r = 0; > > > > + set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags); > > > > > > Shouldn't this be somehow gated on the VM being RAS aware? > > > > Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability? > > I don't know if there is one for arm64. On x86 there is > > KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one > > but I don't think it is exactly the one for "RAS ability". > > Having though about this a bit more, I now think this is independent > of the guest supporting RAS. This really is about the VMM asking to be > made aware of RAS errors affecting the guest, and it is the signalling > back to the guest that needs to be gated by ID_AA64PFR0_EL1.RAS. Just to make sure I fully catch you. I think ID_AA64PFR0_EL1.RAS translates to ARM64_HAS_RAS_EXTN in the kernel. If VMM signals RAS error back to the guest with SEA, are you suggesting __kvm_arm_vcpu_set_events should check cpus_have_final_cap(ARM64_HAS_RAS_EXTN) before it kvm_inject_dabt(vcpu)? If so, how could __kvm_arm_vcpu_set_events know if the error is about RAS (e.g. memory error) vs about accessing memory not in a memslot (i.e. KVM_EXIT_ARM_NISV)? I guess KVM needs to look at ESR_EL2 again (e.g. kvm_vcpu_abt_issea vs kvm_vcpu_dabt_isvalid)? > > > > > > > > > > + break; > > > > default: > > > > break; > > > > } > > > > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > > > > case KVM_CAP_ARM_SYSTEM_SUSPEND: > > > > case KVM_CAP_IRQFD_RESAMPLE: > > > > case KVM_CAP_COUNTER_OFFSET: > > > > + case KVM_CAP_ARM_SIGBUS_ON_SEA: > > > > r = 1; > > > > break; > > > > case KVM_CAP_SET_GUEST_DEBUG2: > > > > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c > > > > new file mode 100644 > > > > index 0000000000000..3225462bcbcda > > > > --- /dev/null > > > > +++ b/arch/arm64/kvm/kvm_ras.c > > > > @@ -0,0 +1,77 @@ > > > > +// SPDX-License-Identifier: GPL-2.0-only > > > > + > > > > +#include <linux/bitops.h> > > > > +#include <linux/kvm_host.h> > > > > + > > > > +#include <asm/kvm_emulate.h> > > > > +#include <asm/kvm_ras.h> > > > > +#include <asm/system_misc.h> > > > > + > > > > +/* > > > > + * For synchrnous external instruction or data abort, not on translation > > > > + * table walk or hardware update of translation table, is FAR_EL2 valid? > > > > + */ > > > > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) > > > > +{ > > > > + return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV); > > > > +} > > > > + > > > > +/* > > > > + * Was this synchronous external abort a RAS notification? > > > > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. > > > > + */ > > > > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) > > > > +{ > > > > + /* apei_claim_sea(NULL) expects to mask interrupts itself */ > > > > + lockdep_assert_irqs_enabled(); > > > > + return apei_claim_sea(NULL); > > > > +} > > > > + > > > > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) > > > > +{ > > > > + bool sigbus_on_sea; > > > > + int idx; > > > > + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); > > > > + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); > > > > + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); > > > > + gfn_t gfn = fault_ipa >> PAGE_SHIFT; > > > > + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ > > > > + unsigned long hva = 0UL; > > > > + > > > > + /* > > > > + * For RAS the host kernel may handle this abort. > > > > + * There is no need to SIGBUS VMM, or pass the error into the guest. > > > > + */ > > > > + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) > > > > + return; > > > > + > > > > + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, > > > > + &(vcpu->kvm->arch.flags)); > > > > + > > > > + /* > > > > + * In addition to userspace opt-in, SIGBUS only makes sense if the > > > > + * abort is NOT about translation table walk and NOT about hardware > > > > + * update of translation table. > > > > + */ > > > > + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); > > > > + > > > > + /* Pass the error directly into the guest. */ > > > > + if (!sigbus_on_sea) { > > > > + kvm_inject_vabt(vcpu); > > > > + return; > > > > + } > > > > + > > > > + if (kvm_vcpu_sea_far_valid(vcpu)) { > > > > + idx = srcu_read_lock(&vcpu->kvm->srcu); > > > > + hva = gfn_to_hva(vcpu->kvm, gfn); > > > > + srcu_read_unlock(&vcpu->kvm->srcu, idx); > > > > + } > > > > + > > > > + /* > > > > + * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that > > > > + * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea() > > > > + * does if apei_claim_sea() fails. > > > > + */ > > > > + arm64_notify_die("synchronous external abort", > > > > + current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr); > > > > > > This is the point where I really think we should simply trigger an > > > exit with all that syndrome information stashed in kvm_run, like any > > > other event requiring userspace help. > > > > Ah, there is another reason SIGBUS is better than kvm exit: "It is a > > programming error to set ext_dabt_pending after an exit which was not > > either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from > > Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data > > abort to guest, at least current documentation doesn't suggest kvm > > exit is feasible. > > I really can't make the link between the two. If you take this to the > letter, then you can't inject an external abort on the back of a > signal handler either. Sorry for the confusion, I think my understanding of the doc and what I said were both wrong. It seems to me that what the doc is suggesting is, the use case of ext_dabt_pending (injecting SEA into guest) for now is only to enable VMM to handle KVM_EXIT_MMIO and KVM_EXIT_ARM_NISV. With the new SIGBUS for memory error, is it fair to update the doc to say another use case of ext_dabt_pending is to enable VMM to signal guest of RAS error? > > Also, you're obviously changing the UAPI, so everything is fair game > provided that you don't break backward compatibility. > > > > > > > > > Also: where is the documentation? > > > > Once I get more positive feedback and send out PATCH instead of RFC, I > > can add to Documentation/virt/kvm/api.rst. > > Sorry, but that's a key point for reviewing a UAPI change. Let me include the doc for the SIGBUS feature in v2. > > M. > > -- > Without deviation from the norm, progress is not possible.
On Tue, Nov 19, 2024 at 03:57:46PM -0800, Jiaqi Yan wrote: > > While continuing the discussion here, I think it may make sense I sent > out a V2 with 2 major updates: > - add documentation for new SIGBUS feature > - remove KVM_CAP_ARM_SIGBUS_ON_SEA Just wanted to add that QEMU already has a functioning "MCE" injection implemenation based on signals that deals with the sloppy mess of coordinating w/ vCPU threads [*]. I completely agree with Marc that the UAPI around using signals for this sort of thing is a giant pile of crap, but it seems to be a semi-understood pile of crap. And from that perspective, wiring unclaimed SEAs into the existing infrastructure at least makes the UAPI consistent. [*]: https://elixir.bootlin.com/qemu/v9.1.1/C/ident/kvm_on_sigbus_vcpu > On Tue, Nov 12, 2024 at 12:51 AM Marc Zyngier <maz@kernel.org> wrote: > > > Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability? > > > I don't know if there is one for arm64. On x86 there is > > > KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one > > > but I don't think it is exactly the one for "RAS ability". > > > > Having though about this a bit more, I now think this is independent > > of the guest supporting RAS. This really is about the VMM asking to be > > made aware of RAS errors affecting the guest, and it is the signalling > > back to the guest that needs to be gated by ID_AA64PFR0_EL1.RAS. > > Just to make sure I fully catch you. I think ID_AA64PFR0_EL1.RAS > translates to ARM64_HAS_RAS_EXTN in the kernel. If VMM signals RAS > error back to the guest with SEA, are you suggesting > __kvm_arm_vcpu_set_events should check > cpus_have_final_cap(ARM64_HAS_RAS_EXTN) before it > kvm_inject_dabt(vcpu)? > > If so, how could __kvm_arm_vcpu_set_events know if the error is about > RAS (e.g. memory error) vs about accessing memory not in a memslot > (i.e. KVM_EXIT_ARM_NISV)? I guess KVM needs to look at ESR_EL2 again > (e.g. kvm_vcpu_abt_issea vs kvm_vcpu_dabt_isvalid)? Good point. I don't think we can lock down this UAPI after the fact given the existing use cases. It is ultimately up to the VMM what to do. I don't see anything that would stop an implementation without FEAT_RAS from generating an SEA in this situation. The lack of FEAT_RAS (to me at least) implies: - No ESR injection for vSErrors (already enforced in UAPI) - No deferral of SErrors / ESBs - No error record registers in the PE Of course, it is very likely you've thought about this more than I have, Marc.
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index bf64fed9820ea..eb37a2489411a 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -334,6 +334,8 @@ struct kvm_arch { /* Fine-Grained UNDEF initialised */ #define KVM_ARCH_FLAG_FGU_INITIALIZED 8 unsigned long flags; + /* Instead of injecting SError into guest, SIGBUS VMM */ +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA 9 /* VM-wide vCPU feature set */ DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES); diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h index 87e10d9a635b5..4bb7a424e3f6c 100644 --- a/arch/arm64/include/asm/kvm_ras.h +++ b/arch/arm64/include/asm/kvm_ras.h @@ -11,15 +11,17 @@ #include <asm/acpi.h> /* - * Was this synchronous external abort a RAS notification? - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. + * Handle synchronous external abort (SEA) in the following order: + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we + * are all done. + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT + * about translation table, send SIGBUS + * - si_code is BUS_OBJERR. + * - si_addr will be 0 when accurate HVA is unavailable. + * 3. Otherwise, directly inject an async SError to guest. + * + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*. */ -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr) -{ - /* apei_claim_sea(NULL) expects to mask interrupts itself */ - lockdep_assert_irqs_enabled(); - - return apei_claim_sea(NULL); -} +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu); #endif /* __ARM64_KVM_RAS_H__ */ diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile index 3cf7adb2b5038..c4a3a6d4870e6 100644 --- a/arch/arm64/kvm/Makefile +++ b/arch/arm64/kvm/Makefile @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \ vgic/vgic-v3.o vgic/vgic-v4.o \ vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \ vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \ - vgic/vgic-its.o vgic/vgic-debug.o + vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 48cafb65d6acf..bb97ad678dbec 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(&kvm->slots_lock); break; + case KVM_CAP_ARM_SIGBUS_ON_SEA: + r = 0; + set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags); + break; default: break; } @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_ARM_SYSTEM_SUSPEND: case KVM_CAP_IRQFD_RESAMPLE: case KVM_CAP_COUNTER_OFFSET: + case KVM_CAP_ARM_SIGBUS_ON_SEA: r = 1; break; case KVM_CAP_SET_GUEST_DEBUG2: diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c new file mode 100644 index 0000000000000..3225462bcbcda --- /dev/null +++ b/arch/arm64/kvm/kvm_ras.c @@ -0,0 +1,77 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include <linux/bitops.h> +#include <linux/kvm_host.h> + +#include <asm/kvm_emulate.h> +#include <asm/kvm_ras.h> +#include <asm/system_misc.h> + +/* + * For synchrnous external instruction or data abort, not on translation + * table walk or hardware update of translation table, is FAR_EL2 valid? + */ +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu) +{ + return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV); +} + +/* + * Was this synchronous external abort a RAS notification? + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT. + */ +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr) +{ + /* apei_claim_sea(NULL) expects to mask interrupts itself */ + lockdep_assert_irqs_enabled(); + return apei_claim_sea(NULL); +} + +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu) +{ + bool sigbus_on_sea; + int idx; + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu); + u8 fsc = kvm_vcpu_trap_get_fault(vcpu); + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu); + gfn_t gfn = fault_ipa >> PAGE_SHIFT; + /* When FnV is set, send 0 as si_addr like what do_sea() does. */ + unsigned long hva = 0UL; + + /* + * For RAS the host kernel may handle this abort. + * There is no need to SIGBUS VMM, or pass the error into the guest. + */ + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0) + return; + + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, + &(vcpu->kvm->arch.flags)); + + /* + * In addition to userspace opt-in, SIGBUS only makes sense if the + * abort is NOT about translation table walk and NOT about hardware + * update of translation table. + */ + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC); + + /* Pass the error directly into the guest. */ + if (!sigbus_on_sea) { + kvm_inject_vabt(vcpu); + return; + } + + if (kvm_vcpu_sea_far_valid(vcpu)) { + idx = srcu_read_lock(&vcpu->kvm->srcu); + hva = gfn_to_hva(vcpu->kvm, gfn); + srcu_read_unlock(&vcpu->kvm->srcu, idx); + } + + /* + * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that + * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea() + * does if apei_claim_sea() fails. + */ + arm64_notify_die("synchronous external abort", + current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr); +} diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index a71fe6f6bd90f..f5335953827ec 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1766,13 +1766,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu) /* Synchronous External Abort? */ if (kvm_vcpu_abt_issea(vcpu)) { - /* - * For RAS the host kernel may handle this abort. - * There is no need to pass the error into the guest. - */ - if (kvm_handle_guest_sea(fault_ipa, kvm_vcpu_get_esr(vcpu))) - kvm_inject_vabt(vcpu); - + kvm_handle_guest_sea(vcpu); return 1; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc0551453..fe3ca787e72fa 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -933,6 +933,7 @@ struct kvm_enable_cap { #define KVM_CAP_PRE_FAULT_MEMORY 236 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237 #define KVM_CAP_X86_GUEST_MODE 238 +#define KVM_CAP_ARM_SIGBUS_ON_SEA 239 struct kvm_irq_routing_irqchip { __u32 irqchip;
Currently KVM handles SEA in guest by injecting async SError into guest directly, bypassing VMM, usually results in guest kernel panic. One major situation of guest SEA is when vCPU consumes uncorrectable memory error on the physical memory. Although SError and guest kernel panic effectively stops the propagation of corrupted memory, it is not easy for VMM and guest to recover from memory error in a more graceful manner. Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like how core kernel signals SIGBUS BUS_OBJERR to the poison consuming thread. In addition to the benifit that KVM's handling for SEA becomes aligned with core kernel behavior - The blast radius in VM can be limited to only the consuming thread in guest, instead of entire guest kernel, unless the consumption is from guest kernel. - VMM now has the chance to do its duties to stop the VM from repeatedly consuming corrupted data. For example, VMM can unmap the guest page from stage-2 table to intercept forseen memory poison consumption, and for every consumption injects SEA to EL1 with synthetic memory error CPER. Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM can opt in this new capability if it prefers SIGBUS than SError injection during VM init. Now SEA handling in KVM works as follows: 1. Delegate to APEI/GHES to see if SEA can be claimed by them. 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is enabled for the VM, and the SEA is NOT about translation table, send SIGBUS BUS_OBJERR signal with host virtual address. 3. Otherwise directly inject async SError to guest. Tested on a machine running Siryn AmpereOne processor. A in-house VMM that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application in VM allocated some memory buffer. The test used EINJ to inject an uncorrectable recoverable memory error at a page in the allocated memory buffer. The dummy application then consumed the memory error. Some hack was done to make core kernel's memory_failure triggered by poison generation to fail, so KVM had to deal with the SEA guest abort due to poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with valid host virtual address of the poisoned page. VMM then injected a SEA into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the guest survived and continued to run. Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> --- arch/arm64/include/asm/kvm_host.h | 2 + arch/arm64/include/asm/kvm_ras.h | 20 ++++---- arch/arm64/kvm/Makefile | 2 +- arch/arm64/kvm/arm.c | 5 ++ arch/arm64/kvm/kvm_ras.c | 77 +++++++++++++++++++++++++++++++ arch/arm64/kvm/mmu.c | 8 +--- include/uapi/linux/kvm.h | 1 + 7 files changed, 98 insertions(+), 17 deletions(-) create mode 100644 arch/arm64/kvm/kvm_ras.c