Message ID | 1510343650-23659-8-git-send-email-gengdongjiu@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Dongjiu Geng, On 10/11/17 19:54, Dongjiu Geng wrote: > If it is not RAS SError, directly inject virtual SError, > which will keep the old way. If it is RAS SError, firstly > let host ACPI module to handle it. > For the ACPI handling, > if the error address is invalid, APEI driver will not > identify the address to hwpoison memory and can not notify > guest to do the recovery. The guest can't do any recover either. There is no recovery you can do without some information about what the error is. This is your memory corruption at an unknown address? We should reboot. (I agree memory_failure.c's::me_kernel() is ignoring kernel errors, we should try and fix this. It makes some sense for polled or irq notifications, but not SEA/SEI). > In order to safe, KVM continues > categorizing errors and handle it separately. > If the RAS error is not propagated, let host user space to > handle it. No. Host user space should not know anything about the kernel or platform RAS support. Doing so creates an ABI link between EL3 firmware and Qemu. This is totally unmaintainable. This thing needs to be portable. The kernel should handle the error, and report any symptoms to user-space. e.g. 'this memory is gone'. We shouldn't special case KVM. > The reason is that sometimes we can only kill the > guest effected application instead of panic whose guest OS. > Host user space specifies a valid ESR and inject virtual > SError, guest can just kill the current application if the > non-consumed error coming from guest application. > > Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com> > Signed-off-by: Quanming Wu <wuquanming@huawei.com> The last Signed-off-by should match the person posting the patch. It's a chain of custody for GPL-signoff purposes, not a 'partially-written-by'. If you want to credit Quanming Wu you can add CC and they can Ack/Review your patch. > diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c > index 7debb74..1afdc87 100644 > --- a/arch/arm64/kvm/handle_exit.c > +++ b/arch/arm64/kvm/handle_exit.c > @@ -178,6 +179,66 @@ static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu) > return arm_exit_handlers[hsr_ec]; > } > > +/** > + * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts > + * @vcpu: the VCPU pointer > + * > + * For RAS SError interrupt, firstly let host kernel handle it. > + * If the AET is [ESR_ELx_AET_UER], then let user space handle it, > + */ > +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run) > +{ > + unsigned int esr = kvm_vcpu_get_hsr(vcpu); > + bool impdef_syndrome = esr & ESR_ELx_ISV; /* aka IDS */ > + unsigned int aet = esr & ESR_ELx_AET; > + > + /* > + * This is not RAS SError > + */ > + if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) { > + kvm_inject_vabt(vcpu); > + return 1; > + } > + /* The host kernel may handle this abort. */ > + handle_guest_sei(); This has to claim the SError as a notification. If APEI claims the error, KVM doesn't need to do anything more. You ignore its return code. > + > + /* > + * In below two conditions, it will directly inject the > + * virtual SError: > + * 1. The Syndrome is IMPLEMENTATION DEFINED > + * 2. It is Uncategorized SEI > + */ > + if (impdef_syndrome || > + ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) { > + kvm_inject_vabt(vcpu); > + return 1; > + } > + > + switch (aet) { > + case ESR_ELx_AET_CE: /* corrected error */ > + case ESR_ELx_AET_UEO: /* restartable error, not yet consumed */ > + return 1; /* continue processing the guest exit */ > + case ESR_ELx_AET_UER: /* The error has not been propagated */ > + /* > + * Userspace only handle the guest SError Interrupt(SEI) if the > + * error has not been propagated > + */ > + run->exit_reason = KVM_EXIT_EXCEPTION; > + run->ex.exception = ESR_ELx_EC_SERROR; > + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; > + return 0; We should not pass RAS notifications to user space. The kernel either handles them, or it panics(). User space shouldn't even know if the kernel supports RAS until it gets an MCEERR signal. You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS. If we get a RAS SError and there are no CPER records or values in the ERR nodes, we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors) > + default: > + /* > + * Until now, the CPU supports RAS and SEI is fatal, or host > + * does not support to handle the SError. > + */ > + panic("This Asynchronous SError interrupt is dangerous, panic"); > + } > + > + return 0; > +} > + > /* > * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on > * proper exit to userspace. James
Hi James, Thanks a lot for the review. On 2017/11/15 0:00, James Morse wrote: > Hi Dongjiu Geng, > > On 10/11/17 19:54, Dongjiu Geng wrote: >> If it is not RAS SError, directly inject virtual SError, >> which will keep the old way. If it is RAS SError, firstly >> let host ACPI module to handle it. > >> For the ACPI handling, >> if the error address is invalid, APEI driver will not >> identify the address to hwpoison memory and can not notify >> guest to do the recovery. > > The guest can't do any recover either. There is no recovery you can do without > some information about what the error is. > > This is your memory corruption at an unknown address? We should reboot. > > (I agree memory_failure.c's::me_kernel() is ignoring kernel errors, we should > try and fix this. It makes some sense for polled or irq notifications, but not > SEA/SEI). > > >> In order to safe, KVM continues >> categorizing errors and handle it separately. > >> If the RAS error is not propagated, let host user space to >> handle it. > > No. Host user space should not know anything about the kernel or platform RAS > support. Doing so creates an ABI link between EL3 firmware and Qemu. This is > totally unmaintainable. Here I have two question: (1) If the AET(Asynchronous Error Type) is Recoverable error (UER), do you mean we also reboot or panic? (2) what is the chance to set guest ESR for Qemu? here I return a error code to Qemu. when Qemu get this error return, it will specify guest ESR and inject the abort. here if KVM does not return error to Qemu, Qemu will do not know when to set the guest ESR value and inject abort. > > This thing needs to be portable. The kernel should handle the error, and report > any symptoms to user-space. e.g. 'this memory is gone'. > > We shouldn't special case KVM. > > >> The reason is that sometimes we can only kill the >> guest effected application instead of panic whose guest OS. >> Host user space specifies a valid ESR and inject virtual >> SError, guest can just kill the current application if the >> non-consumed error coming from guest application. >> >> Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com> >> Signed-off-by: Quanming Wu <wuquanming@huawei.com> > > The last Signed-off-by should match the person posting the patch. It's a chain > of custody for GPL-signoff purposes, not a 'partially-written-by'. If you want > to credit Quanming Wu you can add CC and they can Ack/Review your patch. Ok, got it. thanks a lot for your suggestion. > > >> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c >> index 7debb74..1afdc87 100644 >> --- a/arch/arm64/kvm/handle_exit.c >> +++ b/arch/arm64/kvm/handle_exit.c >> @@ -178,6 +179,66 @@ static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu) >> return arm_exit_handlers[hsr_ec]; >> } >> >> +/** >> + * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts >> + * @vcpu: the VCPU pointer >> + * >> + * For RAS SError interrupt, firstly let host kernel handle it. >> + * If the AET is [ESR_ELx_AET_UER], then let user space handle it, >> + */ >> +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run) >> +{ >> + unsigned int esr = kvm_vcpu_get_hsr(vcpu); >> + bool impdef_syndrome = esr & ESR_ELx_ISV; /* aka IDS */ >> + unsigned int aet = esr & ESR_ELx_AET; >> + >> + /* >> + * This is not RAS SError >> + */ >> + if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) { >> + kvm_inject_vabt(vcpu); >> + return 1; >> + } > >> + /* The host kernel may handle this abort. */ >> + handle_guest_sei(); > > This has to claim the SError as a notification. If APEI claims the error, KVM > doesn't need to do anything more. You ignore its return code. Thanks for the pointing out. I will check the return code, if it return success, KVM doesn't need to do anything more, otherwise, continue run. > > >> + >> + /* >> + * In below two conditions, it will directly inject the >> + * virtual SError: >> + * 1. The Syndrome is IMPLEMENTATION DEFINED >> + * 2. It is Uncategorized SEI >> + */ >> + if (impdef_syndrome || >> + ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) { >> + kvm_inject_vabt(vcpu); >> + return 1; >> + } >> + >> + switch (aet) { >> + case ESR_ELx_AET_CE: /* corrected error */ >> + case ESR_ELx_AET_UEO: /* restartable error, not yet consumed */ >> + return 1; /* continue processing the guest exit */ > >> + case ESR_ELx_AET_UER: /* The error has not been propagated */ >> + /* >> + * Userspace only handle the guest SError Interrupt(SEI) if the >> + * error has not been propagated >> + */ >> + run->exit_reason = KVM_EXIT_EXCEPTION; >> + run->ex.exception = ESR_ELx_EC_SERROR; >> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >> + return 0; > > We should not pass RAS notifications to user space. The kernel either handles > them, or it panics(). User space shouldn't even know if the kernel supports RAS > until it gets an MCEERR signal. Now I rely on this error return to let Qemu set guest ESR, otherwise user space will do not know when to set the guest ESR. If so, how and when we told user space(Qemu) to set the guest ESR and inject abort? > > You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS. > > If we get a RAS SError and there are no CPER records or values in the ERR nodes, > we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors) > > >> + default: >> + /* >> + * Until now, the CPU supports RAS and SEI is fatal, or host >> + * does not support to handle the SError. >> + */ >> + panic("This Asynchronous SError interrupt is dangerous, panic"); >> + } >> + >> + return 0; >> +} >> + >> /* >> * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on >> * proper exit to userspace. > > > > James > > . >
On 2017/11/15 0:00, James Morse wrote: >> + * error has not been propagated >> + */ >> + run->exit_reason = KVM_EXIT_EXCEPTION; >> + run->ex.exception = ESR_ELx_EC_SERROR; >> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >> + return 0; > We should not pass RAS notifications to user space. The kernel either handles > them, or it panics(). User space shouldn't even know if the kernel supports RAS > until it gets an MCEERR signal. > > You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS. > > If we get a RAS SError and there are no CPER records or values in the ERR nodes, > we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors) Hi james, sorry to disturb you! do you think whether we need to set the guest ESR by user space? if need, I need to notify user space that there is a SError happen and need to set ESR for guest in some place of KVM. so here I return a error code to user space. you mean we should not pass RAS notifications to user space, so could you give some suggestion how to notify user space to set guest ESR. Thanks a lot in advance. > >
Hi gengdongjiu, On 06/12/17 10:26, gengdongjiu wrote: > On 2017/11/15 0:00, James Morse wrote: >>> + * error has not been propagated >>> + */ >>> + run->exit_reason = KVM_EXIT_EXCEPTION; >>> + run->ex.exception = ESR_ELx_EC_SERROR; >>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >>> + return 0; >> We should not pass RAS notifications to user space. The kernel either handles >> them, or it panics(). User space shouldn't even know if the kernel supports RAS >> until it gets an MCEERR signal. >> >> You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS. >> >> If we get a RAS SError and there are no CPER records or values in the ERR nodes, >> we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors) > do you think whether we need to set the guest ESR by user space? if need, I need to > notify user space that there is a SError happen and need to set ESR for guest in some place of > KVM. I think you are still coming from a world where user-space gets raw RAS notifications via KVM. This should not happen because the notification method is private to firmware and the kernel. KVM is just in the way when a guest is running. Notifications reaching KVM should be plumbed into the APEI-firmware-first-code or eventually, a kernel-first mechanism if APEI doesn't 'claim' them. The kernel RAS code may signal user-space with the symptoms of the error, and user-space may decided to generate a new RAS notification for the guest. This should function in exactly the same way, regardless of which notification method is in use between the kernel and firmware. (its the only way to make this future-proof). Which notification user-space chooses to use entirely depends on what (if anything) it advertised to the guest in the HEST. User-space has to be in control of triggering any SError, not just overriding the ESR when KVM has decided it wants to kill the guest. > so here I return a error code to user space. you mean we should not pass RAS notifications > to user space, so could you give some suggestion how to notify user space to set guest ESR. KVM shouldn't give the guest an SError when it takes a RAS notification, it should pass the notification to the kernel RAS code. It only needs to 'fall through' to some default cause if both APEI and kernel-first deny-all-knowledge of this notification. The end-to-end flow is then (assuming no-VHE): (1)An error occurs, taking the CPU to EL3. EL3: triage the error, generate CPER, notify the OS EL2: KVM takes the notification, exits the guest, returns to host EL1. EL1: KVM handle_exit() calls APEI to handle the error. This is the end of KVMs involvement in RAS - its just plumbing. (2)APEI processes the CPER records and signals affected processes. If KVM's user-space is affected, KVM will spot the pending signal when it goes to re-enter the guest, and exit to user-space instead. Qemu takes the SIGBUS_MCEERR_A{O,R}. (3) Qemu decides it wants to hand the guest a RAS error, it populates the CPER records (in memory only Qemu knows about), then drives the KVM API to make the appropriate notification appear. (1) only happens if the guest was running when the error arrived. GHES has ~4 flavours of IRQ which may be used to describe corruption in guest memory. Steps (2) and (3) are exactly the same in this case. Qemu may decide to trigger RAS errors all by itself, (probably for testing and debugging), in which case (1) and (2) don't happen, but (3), is exactly the same. This way platform-firmware/host-kernel can use kernel-first or firmware-first with any of the notifications, independently from Qemu/guest-kernel making a different kernel-first or firmware-first with different notifications. Passing information out of KVM breaks this, forcing Qemu to know about the mechanism platform-firmware is using. We need to tackle (1) and (3) separately. For (3) we need some API that lets Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have a way of migrating pending SError yet... which is where I got stuck last time I was looking at this. James
Hi James, On 2017/12/7 3:04, James Morse wrote: > Hi gengdongjiu, > > On 06/12/17 10:26, gengdongjiu wrote: >> On 2017/11/15 0:00, James Morse wrote: >>>> + * error has not been propagated >>>> + */ >>>> + run->exit_reason = KVM_EXIT_EXCEPTION; >>>> + run->ex.exception = ESR_ELx_EC_SERROR; >>>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >>>> + return 0; >>> We should not pass RAS notifications to user space. The kernel either handles >>> them, or it panics(). User space shouldn't even know if the kernel supports RAS >>> until it gets an MCEERR signal. >>> >>> You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS. >>> >>> If we get a RAS SError and there are no CPER records or values in the ERR nodes, >>> we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors) > >> do you think whether we need to set the guest ESR by user space? if need, I need to >> notify user space that there is a SError happen and need to set ESR for guest in some place of >> KVM. > > I think you are still coming from a world where user-space gets raw RAS > notifications via KVM. This should not happen because the notification method is > private to firmware and the kernel. KVM is just in the way when a guest is running. > > Notifications reaching KVM should be plumbed into the APEI-firmware-first-code > or eventually, a kernel-first mechanism if APEI doesn't 'claim' them. > > The kernel RAS code may signal user-space with the symptoms of the error, and > user-space may decided to generate a new RAS notification for the guest. > > This should function in exactly the same way, regardless of which notification > method is in use between the kernel and firmware. (its the only way to make this > future-proof). > > Which notification user-space chooses to use entirely depends on what (if > anything) it advertised to the guest in the HEST. User-space has to be in > control of triggering any SError, not just overriding the ESR when KVM has > decided it wants to kill the guest. thanks, I will explain more. > > >> so here I return a error code to user space. you mean we should not pass RAS notifications >> to user space, so could you give some suggestion how to notify user space to set guest ESR. > > KVM shouldn't give the guest an SError when it takes a RAS notification, it > should pass the notification to the kernel RAS code. It only needs to 'fall > through' to some default cause if both APEI and kernel-first deny-all-knowledge > of this notification. > > > The end-to-end flow is then (assuming no-VHE): > (1)An error occurs, taking the CPU to EL3. > EL3: triage the error, generate CPER, notify the OS > EL2: KVM takes the notification, exits the guest, returns to host EL1. > EL1: KVM handle_exit() calls APEI to handle the error. > This is the end of KVMs involvement in RAS - its just plumbing. > > (2)APEI processes the CPER records and signals affected processes. > If KVM's user-space is affected, KVM will spot the pending signal when it goes > to re-enter the guest, and exit to user-space instead. > Qemu takes the SIGBUS_MCEERR_A{O,R}. > > (3) Qemu decides it wants to hand the guest a RAS error, it populates the CPER > records (in memory only Qemu knows about), then drives the KVM API to make the > appropriate notification appear. > > > (1) only happens if the guest was running when the error arrived. GHES has ~4 > flavours of IRQ which may be used to describe corruption in guest memory. Steps > (2) and (3) are exactly the same in this case. > > Qemu may decide to trigger RAS errors all by itself, (probably for testing and > debugging), in which case (1) and (2) don't happen, but (3), is exactly the same. > > > This way platform-firmware/host-kernel can use kernel-first or firmware-first > with any of the notifications, independently from Qemu/guest-kernel making a > different kernel-first or firmware-first with different notifications. > > Passing information out of KVM breaks this, forcing Qemu to know about the > mechanism platform-firmware is using. > > > We need to tackle (1) and (3) separately. For (3) we need some API that lets > Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have > a way of migrating pending SError yet... which is where I got stuck last time I > was looking at this. I understand you most idea. But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors, otherwise Qemu will do not know how to do. For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO signal, it will record the CPER and trigger a IRQ to notify guest, as shown below: SIGBUS_MCEERR_AR trigger Synchronous External Abort. SIGBUS_MCEERR_AO trigger GPIO IRQ. For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all not involve _trigger_ an SError. so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}. > > > > James > > . >
Hi James, On 2017/12/7 14:37, gengdongjiu wrote: >> We need to tackle (1) and (3) separately. For (3) we need some API that lets >> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have >> a way of migrating pending SError yet... which is where I got stuck last time I >> was looking at this. > I understand you most idea. > > But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors, > otherwise Qemu will do not know how to do. > > For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER > records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO > signal, it will record the CPER and trigger a IRQ to notify guest, as shown below: > > SIGBUS_MCEERR_AR trigger Synchronous External Abort. > SIGBUS_MCEERR_AO trigger GPIO IRQ. > > For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all > > not involve _trigger_ an SError. > > so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}. As I explained above: If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger Synchronous External Abort; If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ; So Qemu does not know when to _trigger_ an SError. so here I "return a error" to Qemu if ghes_notify_sei() return failure in [1], if you opposed KVM "return error", do you have a better idea about it? thanks About the way of migrating pending SError, I think it is a separate case, because Qemu still does not know how and when to trigger the SError. [1]: static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run) { ....................... + case ESR_ELx_AET_UER: /* The error has not been propagated */ + /* + * Userspace only handle the guest SError Interrupt(SEI) if the + * error has not been propagated + */ + run->exit_reason = KVM_EXIT_EXCEPTION; + run->ex.exception = ESR_ELx_EC_SERROR; + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; + return 0; ....................... } >
Hi gengdongjiu, On 07/12/17 06:37, gengdongjiu wrote: > I understand you most idea. > > But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors, > otherwise Qemu will do not know how to do. > > For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER > records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO > signal, it will record the CPER and trigger a IRQ to notify guest, as shown below: > > SIGBUS_MCEERR_AR trigger Synchronous External Abort. > SIGBUS_MCEERR_AO trigger GPIO IRQ. > > For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all > > not involve _trigger_ an SError. It's a policy choice. How does your virtual CPU notify RAS errors to its virtual software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of CPU you are trying to emulate. I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where the guest doesn't take the SError immediately, instead tries to re-execute the code KVM has unmapped from stage2 because its corrupt. (You could detect this happening in Qemu and try something else) Synchronous/asynchronous external abort matters to the CPU, but once the error has been notified to software the reasons for this distinction disappear. Once the error has been handled, all trace of this distinction is gone. CPER records only describe component failures. You are trying to re-create some state that disappeared with one of the firmware-first abstractions. Trying to re-create this information isn't worth the effort as the distinction doesn't matter to linux, only to the CPU. > so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}. You mean there is no reason for Qemu to trigger an SError when it gets a signal from the kernel. The reasons the CPU might have to generate an SError don't apply to linux and KVM user space. User-space will never get a signal for an uncontained error, we will always panic(). We can't give user-space a signal for imprecise exceptions, as it can't return from the signal. The classes of error that are left are covered by polled/irq and NOTIFY_SEA. Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants to, (but I don't think you should, the kernel may have unmapped the page at PC from stage2 due to corruption). I think the problem here is you're applying the CPU->software behaviour and choices to software->software. By the time user-space gets the error, the behaviour is different. Thanks, James
Hi James, On 2017/12/16 2:52, James Morse wrote: >> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below: >> >> SIGBUS_MCEERR_AR trigger Synchronous External Abort. >> SIGBUS_MCEERR_AO trigger GPIO IRQ. >> >> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all >> >> not involve _trigger_ an SError. > It's a policy choice. How does your virtual CPU notify RAS errors to its virtual > software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of > CPU you are trying to emulate. > > I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where > the guest doesn't take the SError immediately, instead tries to re-execute the I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case. > code KVM has unmapped from stage2 because its corrupt. (You could detect this > happening in Qemu and try something else)For something else, using NOTIFY_SEI for SIGBUS_MCEERR_AR? At current implementation, It seems only have this case that "KVM has unmapped from stage2", do you thing we still have something else? > > > Synchronous/asynchronous external abort matters to the CPU, but once the error > has been notified to software the reasons for this distinction disappear. Once > the error has been handled, all trace of this distinction is gone. > > CPER records only describe component failures. You are trying to re-create some > state that disappeared with one of the firmware-first abstractions. Trying to > re-create this information isn't worth the effort as the distinction doesn't > matter to linux, only to the CPU. > > >> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}. > You mean there is no reason for Qemu to trigger an SError when it gets a signal > from the kernel. > > The reasons the CPU might have to generate an SError don't apply to linux and > KVM user space. User-space will never get a signal for an uncontained error, we > will always panic(). We can't give user-space a signal for imprecise exceptions, > as it can't return from the signal. The classes of error that are left are > covered by polled/irq and NOTIFY_SEA. > > Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants > to, (but I don't think you should, the kernel may have unmapped the page at PC > from stage2 due to corruption). yes, you also said you do not want to generate RAS SErrors for SIGBUS_MCEERR_AR, so Qemu does not know in which condition to generate RAS SErrors. > > I think the problem here is you're applying the CPU->software behaviour and > choices to software->software. By the time user-space gets the error, the > behaviour is different. In the KVM, as a policy choice to reserve this API to specify guest ESR and drive to trigger SError is OK, At least for Qemu it does not know in which condition to trigger it.
[...] > >> + case ESR_ELx_AET_UER: /* The error has not been propagated */ >> + /* >> + * Userspace only handle the guest SError Interrupt(SEI) if the >> + * error has not been propagated >> + */ >> + run->exit_reason = KVM_EXIT_EXCEPTION; >> + run->ex.exception = ESR_ELx_EC_SERROR; >> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >> + return 0; > > We should not pass RAS notifications to user space. The kernel either handles > them, or it panics(). User space shouldn't even know if the kernel supports RAS For the ESR_ELx_AET_UER(Recoverable error), let us see its definition below, which get from [0] The state of the PE is Recoverable if all of the following are true: — The error has not been silently propagated. — The error has not been architecturally consumed by the PE. (The PE architectural state is not infected.) — The exception is precise and PE can recover execution from the preferred return address of the exception, if software locates and repairs the error. The PE cannot make correct progress without either consuming the error or otherwise making the error unrecoverable. The error remains latent in the system. If software cannot locate and repair the error, either the application or the VM, or both, must be isolated by software. so we can see the exception is precise and PE can recover execution from the preferred return address of the exception, so let guest handling it is better, for example, if it is guest application RAS error, we can kill the guest application instead of panic whole OS; if it is guest kernel RAS error, guest will panic. Host does not know which application of guest has error, so host can not handle it, panic OS is not a good choice for the Recoverable error. [0] https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf > until it gets an MCEERR signal. user space will detect whether kernel support RAS before handing it. > > You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS. > > If we get a RAS SError and there are no CPER records or values in the ERR nodes, > we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors) > > >> + default: >> + /* >> + * Until now, the CPU supports RAS and SEI is fatal, or host >> + * does not support to handle the SError. >> + */ >> + panic("This Asynchronous SError interrupt is dangerous, panic"); >> + } >> + >> + return 0; >> +} >> + >> /* >> * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on >> * proper exit to userspace. > > > > James > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi gengdongjiu, On 15/12/17 03:30, gengdongjiu wrote: > On 2017/12/7 14:37, gengdongjiu wrote: >>> We need to tackle (1) and (3) separately. For (3) we need some API that lets >>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have >>> a way of migrating pending SError yet... which is where I got stuck last time I >>> was looking at this. >> I understand you most idea. >> >> But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors, >> otherwise Qemu will do not know how to do. >> >> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER >> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO >> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below: >> >> SIGBUS_MCEERR_AR trigger Synchronous External Abort. >> SIGBUS_MCEERR_AO trigger GPIO IRQ. >> >> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all >> >> not involve _trigger_ an SError. >> >> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}. > > As I explained above: > > If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger Synchronous External Abort; > If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ; > So Qemu does not know when to _trigger_ an SError. There is no answer to this. How the CPU decides is specific to the CPU design. How Qemu decides is going to be specific to the machine it emulates. My understanding is there is some overlap for which RAS errors are reported as synchronous external abort, and which use SError. (Obviously the imprecise ones are all SError). Which one the CPU uses depends on how the CPU is designed. When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a stage2 fault because the page is marked with PG_poisoned. These started out as a synchronous exception, but you could still report these with SError. We don't have a way to signal user-space about imprecise exceptions, this isn't a KVM specific problem. > so here I "return a error" to Qemu if ghes_notify_sei() return failure in [1], if you opposed KVM "return error", > do you have a better idea about it? thanks If ghes_notify_sei() fails to claim the error, we should drop through to kernel-first-handling. We don't have that yet, just the stub that ignores errors where we can make progress. If neither firmware-first nor kernel-first claim a RAS error, we're in trouble. I'd like to panic() as we got a RAS notification but no description of the error. We can't do this until we have kernel-first support, hence that stub. > About the way of migrating pending SError, I think it is a separate case, because Qemu still does not know > how and when to trigger the SError. I agree, but I think we should fix this first before we add another user of this unmigratable hypervisor state. (I recall someone saying migration is needed for any new KVM/cpu features, but I can't find the thread) > [1]: > static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run) > { > ....................... > + case ESR_ELx_AET_UER: /* The error has not been propagated */ > + /* > + * Userspace only handle the guest SError Interrupt(SEI) if the > + * error has not been propagated > + */ > + run->exit_reason = KVM_EXIT_EXCEPTION; > + run->ex.exception = ESR_ELx_EC_SERROR; I'm against telling user space RAS errors ever happened, only the final user-visible error when the kernel can't fix it. This is inventing something new for RAS errors not claimed by firmware-first. If we have kernel-first too, this will never happen. (unless your system is losing the error description). Your system has firmware-first, why isn't it claiming the notification? If its not finding CPER records written by firmware, check firmware and the UEFI memory map agree on the attributes to be used when read/writing that area. > + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; > + return 0; Thanks, James
Hi gengdongjiu, On 16/12/17 04:47, gengdongjiu wrote: > [...] >> >>> + case ESR_ELx_AET_UER: /* The error has not been propagated */ >>> + /* >>> + * Userspace only handle the guest SError Interrupt(SEI) if the >>> + * error has not been propagated >>> + */ >>> + run->exit_reason = KVM_EXIT_EXCEPTION; >>> + run->ex.exception = ESR_ELx_EC_SERROR; >>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >>> + return 0; >> >> We should not pass RAS notifications to user space. The kernel either handles >> them, or it panics(). User space shouldn't even know if the kernel supports RAS > > For the ESR_ELx_AET_UER(Recoverable error), let us see its definition > below, which get from [0] [..] > so we can see the exception is precise and PE can recover execution > from the preferred return address of the exception, > so let guest handling it is > better, for example, if it is guest application RAS error, we can kill > the guest application instead of panic whole OS; if it is guest kernel > RAS error, guest will panic. If the kernel takes an unhandled RAS error it should panic - we don't know where the error is. I understand you want to kill-off guest tasks as a result of RAS errors, but this needs to go through the whole APEI->memory_failure()->sigbus machinery so that the kernel knows the kernel can keep running. This saves us signalling user-space when we don't need to. An example: code-corruption. Linux can happily re-read affected user-space executables from disk, there is absolutely nothing user-space can do about it. Handling errors first in the kernel allows us to do recovery for all the affected processes, not just the one that happens to be running right now. > Host does not know which application of guest has error, so host can > not handle it, It has to work this out, otherwise the errors we can handle never get a chance. This kernel is expected to look at the error description, (which for some reason we aren't talking about here), e.g. the CPER records, and determine what recovery action is necessary for this error. For memory errors this may be re-reading from disk, or at the worst case, unmapping from all user-space users (including KVM's stage2) and raining signals on all affected processes. For a memory error the important piece of information is the physical address. Only the kernel can do anything with this, it determines who owns the affected memory and what needs doing to recover from the error. If you pass the notification to user-space, all it can do is signal the guest to "stop doing whatever it is you're doing". The guest may have been able to re-read pages from disk, or otherwise handle the error. Has the error been handled? No: The error remains latent in the system. > panic OS is not a good choice for the Recoverable error. If we don't know where the error is, and we can't make progress, its the only sane choice. This code is never expected to run! (why are we arguing about it?) We should get RAS errors as GHES notifications from firmware via some mechanism. If those are NOTIFY_SEI then APEI should claim the notification and kick off the appropriate handling based on the CPER records. If/when we get kernel-first, that can claim the SError. What we're left with is RAS notifications that no-one claimed because there was no error-description found. James
On Fri, Jan 12, 2018 at 06:05:23PM +0000, James Morse wrote: > On 15/12/17 03:30, gengdongjiu wrote: > > On 2017/12/7 14:37, gengdongjiu wrote: [...] > > (I recall someone saying migration is needed for any new KVM/cpu features, but I > can't find the thread) > I don't know of any hard set-in-stone rule for this, but I have certainly argued that since migration is a popular technique in data centers and often a key motivation behind using virtual machines as it provides both load-balancing and high availability, we should think about migration support for all features and state. Further, experience has shown that retroactively trying to support migration can result in really complex interfaces for saving/restoring state (see the ITS ordering requirements in Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so thinking about this problem when introducing functionality is a good idea. Of course, if there are really good arguments for having some state that simply cannot be migrated, then that's fine, and we should just make sure that userspace (e.g. QEMU) and higher level components in the stack (libvirt, openstack, etc.) can detect this state being used, and ideally enable/disable it, so that it can predict that a particular VM cannot be migrated off a particular host, or between a particular set of two hosts. As an example, migration is typically prohibited when using VFIO direct device assignment, but userspace etc. are already aware of this. As a final note, if we add support for some architectural feature, which may be present on some particular hardware and/or implementation, if the KVM support for said feature is automatically enabled (and not selectively from userspace), I would push back quite strongly on something that doesn't support migration, because it would effectively prevent migration of VMs on ARM. Thanks, -Christoffer
Hi Christoffer On 2018/1/15 16:33, Christoffer Dall wrote: > On Fri, Jan 12, 2018 at 06:05:23PM +0000, James Morse wrote: >> On 15/12/17 03:30, gengdongjiu wrote: >>> On 2017/12/7 14:37, gengdongjiu wrote: > > [...] > >> >> (I recall someone saying migration is needed for any new KVM/cpu features, but I >> can't find the thread) >> > > I don't know of any hard set-in-stone rule for this, but I have > certainly argued that since migration is a popular technique in data > centers and often a key motivation behind using virtual machines as it > provides both load-balancing and high availability, we should think > about migration support for all features and state. Further, experience > has shown that retroactively trying to support migration can result in > really complex interfaces for saving/restoring state (see the ITS > ordering requirements in > Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so > thinking about this problem when introducing functionality is a good > idea. > > Of course, if there are really good arguments for having some state that > simply cannot be migrated, then that's fine, and we should just make > sure that userspace (e.g. QEMU) and higher level components in the > stack (libvirt, openstack, etc.) can detect this state being used, and > ideally enable/disable it, so that it can predict that a particular VM > cannot be migrated off a particular host, or between a particular set of > two hosts. As an example, migration is typically prohibited when using > VFIO direct device assignment, but userspace etc. are already aware of > this. > > As a final note, if we add support for some architectural feature, which > may be present on some particular hardware and/or implementation, if the > KVM support for said feature is automatically enabled (and not > selectively from userspace), I would push back quite strongly on > something that doesn't support migration, because it would effectively > prevent migration of VMs on ARM. Thanks very much for this mail and reply, I will check it, please give me some time due to recently busy with other things. > > Thanks, > -Christoffer > > . >
Hi James, thanks very much for your mail and reply, I will check it ASAP. Due to recently busy with other thing, so reply may be late. On 2018/1/13 2:05, James Morse wrote: > Hi gengdongjiu, > > On 16/12/17 04:47, gengdongjiu wrote: >> [...] >>> >>>> + case ESR_ELx_AET_UER: /* The error has not been propagated */ >>>> + /* >>>> + * Userspace only handle the guest SError Interrupt(SEI) if the >>>> + * error has not been propagated >>>> + */ >>>> + run->exit_reason = KVM_EXIT_EXCEPTION; >>>> + run->ex.exception = ESR_ELx_EC_SERROR; >>>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >>>> + return 0; >>> >>> We should not pass RAS notifications to user space. The kernel either handles >>> them, or it panics(). User space shouldn't even know if the kernel supports RAS >> >> For the ESR_ELx_AET_UER(Recoverable error), let us see its definition >> below, which get from [0] > > [..] > >> so we can see the exception is precise and PE can recover execution >> from the preferred return address of the exception, > >> so let guest handling it is >> better, for example, if it is guest application RAS error, we can kill >> the guest application instead of panic whole OS; if it is guest kernel >> RAS error, guest will panic. > > If the kernel takes an unhandled RAS error it should panic - we don't know where > the error is. > > I understand you want to kill-off guest tasks as a result of RAS errors, but > this needs to go through the whole APEI->memory_failure()->sigbus machinery so > that the kernel knows the kernel can keep running. > > This saves us signalling user-space when we don't need to. An example: > code-corruption. Linux can happily re-read affected user-space executables from > disk, there is absolutely nothing user-space can do about it. > Handling errors first in the kernel allows us to do recovery for all the > affected processes, not just the one that happens to be running right now. > > >> Host does not know which application of guest has error, so host can >> not handle it, > > It has to work this out, otherwise the errors we can handle never get a chance. > > This kernel is expected to look at the error description, (which for some reason > we aren't talking about here), e.g. the CPER records, and determine what > recovery action is necessary for this error. > For memory errors this may be re-reading from disk, or at the worst case, > unmapping from all user-space users (including KVM's stage2) and raining signals > on all affected processes. > > For a memory error the important piece of information is the physical address. > Only the kernel can do anything with this, it determines who owns the affected > memory and what needs doing to recover from the error. > > If you pass the notification to user-space, all it can do is signal the guest to > "stop doing whatever it is you're doing". The guest may have been able to > re-read pages from disk, or otherwise handle the error. > Has the error been handled? No: The error remains latent in the system. > > >> panic OS is not a good choice for the Recoverable error. > > If we don't know where the error is, and we can't make progress, its the only > sane choice. > > This code is never expected to run! (why are we arguing about it?) We should get > RAS errors as GHES notifications from firmware via some mechanism. If those are > NOTIFY_SEI then APEI should claim the notification and kick off the appropriate > handling based on the CPER records. If/when we get kernel-first, that can claim > the SError. What we're left with is RAS notifications that no-one claimed > because there was no error-description found. > > > > James > > . >
Hi James, Sorry for my late response due to out of office recently. 2018-01-13 2:05 GMT+08:00 James Morse <james.morse@arm.com>: > Hi gengdongjiu, > > On 15/12/17 03:30, gengdongjiu wrote: >> On 2017/12/7 14:37, gengdongjiu wrote: >>>> We need to tackle (1) and (3) separately. For (3) we need some API that lets >>>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have >>>> a way of migrating pending SError yet... which is where I got stuck last time I >>>> was looking at this. >>> I understand you most idea. >>> >>> But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors, >>> otherwise Qemu will do not know how to do. >>> >>> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER >>> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO >>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below: >>> >>> SIGBUS_MCEERR_AR trigger Synchronous External Abort. >>> SIGBUS_MCEERR_AO trigger GPIO IRQ. >>> >>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all >>> >>> not involve _trigger_ an SError. >>> >>> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}. >> >> As I explained above: >> >> If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger Synchronous External Abort; >> If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ; > >> So Qemu does not know when to _trigger_ an SError. > > There is no answer to this. How the CPU decides is specific to the CPU design. > How Qemu decides is going to be specific to the machine it emulates. > > My understanding is there is some overlap for which RAS errors are reported as > synchronous external abort, and which use SError. (Obviously the imprecise ones > are all SError). Which one the CPU uses depends on how the CPU is designed. > > When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a > stage2 fault because the page is marked with PG_poisoned. These started out as a > synchronous exception, but you could still report these with SError. yes, I agree, it is policy choice. > > We don't have a way to signal user-space about imprecise exceptions, this isn't > a KVM specific problem. > > >> so here I "return a error" to Qemu if ghes_notify_sei() return failure in [1], if you opposed KVM "return error", >> do you have a better idea about it? thanks > > If ghes_notify_sei() fails to claim the error, we should drop through to > kernel-first-handling. We don't have that yet, just the stub that ignores errors > where we can make progress. > > If neither firmware-first nor kernel-first claim a RAS error, we're in trouble. > I'd like to panic() as we got a RAS notification but no description of the > error. We can't do this until we have kernel-first support, hence that stub. > > >> About the way of migrating pending SError, I think it is a separate case, because Qemu still does not know >> how and when to trigger the SError. > > I agree, but I think we should fix this first before we add another user of this > unmigratable hypervisor state. > > (I recall someone saying migration is needed for any new KVM/cpu features, but I > can't find the thread) > > >> [1]: >> static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run) >> { >> ....................... >> + case ESR_ELx_AET_UER: /* The error has not been propagated */ >> + /* >> + * Userspace only handle the guest SError Interrupt(SEI) if the >> + * error has not been propagated >> + */ >> + run->exit_reason = KVM_EXIT_EXCEPTION; >> + run->ex.exception = ESR_ELx_EC_SERROR; > > I'm against telling user space RAS errors ever happened, only the final > user-visible error when the kernel can't fix it. thanks for the explanation. For the ESR_ELx_AET_UER, this exception is precise, closing the VM may be better[1]. But if you think panic is better until we support kernel-first, it is also OK to me. +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run) +{ + unsigned int esr = kvm_vcpu_get_hsr(vcpu); + bool impdef_syndrome = esr & ESR_ELx_ISV; /* aka IDS */ + unsigned int aet = esr & ESR_ELx_AET; + + /* + * This is not RAS SError + */ + if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) { + kvm_inject_vabt(vcpu); + return 1; + } + + /* For RAS the host kernel may handle this abort. */ + if (!handle_guest_sei()) + return 1; + + /* + * In below two conditions, it will directly inject the + * virtual SError: + * 1. The Syndrome is IMPLEMENTATION DEFINED + * 2. It is Uncategorized SEI + */ + if (impdef_syndrome || + ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) { + kvm_inject_vabt(vcpu); + return 1; + } + + switch (aet) { + case ESR_ELx_AET_CE: /* corrected error */ + case ESR_ELx_AET_UEO: /* restartable error, not yet consumed */ + return 1; /* continue processing the guest exit */ + case ESR_ELx_AET_UER: /* recoverable error */ + /* + * the exception is precise, not been silently propagated + * and not been consumed by the CPU, temporarily shut down + * the VM to isolated the error, hope not touch it again. + */ + run->exit_reason = KVM_EXIT_EXCEPTION; + return 0; + default: + /* + * Until now, the CPU supports RAS, SError interrupt is fatal + * and host does not successfully handle it. + */ + panic("This Asynchronous SError interrupt is dangerous, panic"); + } + + return 0; +} + > > This is inventing something new for RAS errors not claimed by firmware-first. > If we have kernel-first too, this will never happen. (unless your system is > losing the error description). In fact, if we have kernel-first, I think we still need to judge the error type by ESR, right? If the handle_guest_sei() , may be the system does not support firmware-first, so we judge the ESR value, > > > Your system has firmware-first, why isn't it claiming the notification? > If its not finding CPER records written by firmware, check firmware and the UEFI > memory map agree on the attributes to be used when read/writing that area. > > >> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >> + return 0; > > > Thanks, > > James > > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
2018-01-13 2:05 GMT+08:00 James Morse <james.morse@arm.com>: > Hi gengdongjiu, > > On 16/12/17 04:47, gengdongjiu wrote: >> [...] >>> >>>> + case ESR_ELx_AET_UER: /* The error has not been propagated */ >>>> + /* >>>> + * Userspace only handle the guest SError Interrupt(SEI) if the >>>> + * error has not been propagated >>>> + */ >>>> + run->exit_reason = KVM_EXIT_EXCEPTION; >>>> + run->ex.exception = ESR_ELx_EC_SERROR; >>>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >>>> + return 0; >>> >>> We should not pass RAS notifications to user space. The kernel either handles >>> them, or it panics(). User space shouldn't even know if the kernel supports RAS >> >> For the ESR_ELx_AET_UER(Recoverable error), let us see its definition >> below, which get from [0] > > [..] > >> so we can see the exception is precise and PE can recover execution >> from the preferred return address of the exception, > >> so let guest handling it is >> better, for example, if it is guest application RAS error, we can kill >> the guest application instead of panic whole OS; if it is guest kernel >> RAS error, guest will panic. > > If the kernel takes an unhandled RAS error it should panic - we don't know where > the error is. OK, here I will panic. > > I understand you want to kill-off guest tasks as a result of RAS errors, but > this needs to go through the whole APEI->memory_failure()->sigbus machinery so > that the kernel knows the kernel can keep running. > > This saves us signalling user-space when we don't need to. An example: > code-corruption. Linux can happily re-read affected user-space executables from > disk, there is absolutely nothing user-space can do about it. > Handling errors first in the kernel allows us to do recovery for all the > affected processes, not just the one that happens to be running right now. > > >> Host does not know which application of guest has error, so host can >> not handle it, > > It has to work this out, otherwise the errors we can handle never get a chance. > > This kernel is expected to look at the error description, (which for some reason > we aren't talking about here), e.g. the CPER records, and determine what > recovery action is necessary for this error. > For memory errors this may be re-reading from disk, or at the worst case, > unmapping from all user-space users (including KVM's stage2) and raining signals > on all affected processes. > > For a memory error the important piece of information is the physical address. > Only the kernel can do anything with this, it determines who owns the affected > memory and what needs doing to recover from the error. > > If you pass the notification to user-space, all it can do is signal the guest to > "stop doing whatever it is you're doing". The guest may have been able to > re-read pages from disk, or otherwise handle the error. > Has the error been handled? No: The error remains latent in the system. > > >> panic OS is not a good choice for the Recoverable error. > > If we don't know where the error is, and we can't make progress, its the only > sane choice. Ok, I will panic here. > > This code is never expected to run! (why are we arguing about it?) We should get > RAS errors as GHES notifications from firmware via some mechanism. If those are > NOTIFY_SEI then APEI should claim the notification and kick off the appropriate > handling based on the CPER records. If/when we get kernel-first, that can claim > the SError. What we're left with is RAS notifications that no-one claimed > because there was no error-description found. > > > > James
2018-01-15 16:33 GMT+08:00 Christoffer Dall <christoffer.dall@linaro.org>: > On Fri, Jan 12, 2018 at 06:05:23PM +0000, James Morse wrote: >> On 15/12/17 03:30, gengdongjiu wrote: >> > On 2017/12/7 14:37, gengdongjiu wrote: > > [...] > >> >> (I recall someone saying migration is needed for any new KVM/cpu features, but I >> can't find the thread) >> > > I don't know of any hard set-in-stone rule for this, but I have > certainly argued that since migration is a popular technique in data > centers and often a key motivation behind using virtual machines as it > provides both load-balancing and high availability, we should think > about migration support for all features and state. Further, experience > has shown that retroactively trying to support migration can result in > really complex interfaces for saving/restoring state (see the ITS > ordering requirements in > Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so > thinking about this problem when introducing functionality is a good > idea. yes, agree it. > > Of course, if there are really good arguments for having some state that > simply cannot be migrated, then that's fine, and we should just make > sure that userspace (e.g. QEMU) and higher level components in the > stack (libvirt, openstack, etc.) can detect this state being used, and > ideally enable/disable it, so that it can predict that a particular VM > cannot be migrated off a particular host, or between a particular set of > two hosts. As an example, migration is typically prohibited when using > VFIO direct device assignment, but userspace etc. are already aware of > this. Ok, I think this problem is similar to migrating a VM that uses an irqchip in userspace and has set the IRQ or FIQ lines using KVM_IRQ_LINE. > > As a final note, if we add support for some architectural feature, which > may be present on some particular hardware and/or implementation, if the > KVM support for said feature is automatically enabled (and not > selectively from userspace), I would push back quite strongly on > something that doesn't support migration, because it would effectively > prevent migration of VMs on ARM. Ok, got it. > > Thanks, > -Christoffer > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi gengdongjiu, On 21/01/18 02:45, gengdongjiu wrote: > For the ESR_ELx_AET_UER, this exception is precise, closing the VM may > be better[1]. > But if you think panic is better until we support kernel-first, it is > also OK to me. I'm not convinced SError while a guest was running means only guest memory could be affected. Mechanisms like KSM means the error could affect multiple guests. Both firmware-fist and kernel-first will give us the address, with which we can know which processes are affected, isolated the memory and signal affected processes. Until we have one of these panic() is the only way we have to contain an error, but its an interim fix. Not panic()ing the host for an error that should be contained to the guest is a fudge, we don't actually know its safe (KSM, page-table etc). I want to improve on this with {firmware, kernel}-first support (or both!), I don't want to expose that this is happening to user-space, as once we have one of {firmware, kernel}-first, it shouldn't happen. >> This is inventing something new for RAS errors not claimed by firmware-first. >> If we have kernel-first too, this will never happen. (unless your system is >> losing the error description). > In fact, if we have kernel-first, I think we still need to judge the > error type by ESR, right? The kernel-first mechanism should consider the ESR/FAR, yes, but once the error has been claimed and handled, KVM shouldn't care about any of these values. (maybe we'll sanity check for uncontained errors, just in case the error escaped to the RAS code...) My point here was exposing 'unhandled' (ignored) RAS errors to user-space creates an ABI: someone will complain once we start handling the error, and they no longer get a notification via this 'unhandled' interface. Code written to use this interface becomes useless/untested. > If the handle_guest_sei() , may be the system does not support firmware-first, > so we judge the ESR value, ...and panic()/ignore as appropriate. I agree not all systems will support firmware-first, (big-endian is the obvious example), but if we get kernel-first support this ESR guessing can disappear, I'm against exposing it to user-space in the meantime. Thanks, James
Hi gengdongjiu, On 16/12/17 03:44, gengdongjiu wrote: > On 2017/12/16 2:52, James Morse wrote: >>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below: >>> >>> SIGBUS_MCEERR_AR trigger Synchronous External Abort. >>> SIGBUS_MCEERR_AO trigger GPIO IRQ. >>> >>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all >>> >>> not involve _trigger_ an SError. >> It's a policy choice. How does your virtual CPU notify RAS errors to its virtual >> software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of >> CPU you are trying to emulate. >> >> I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where >> the guest doesn't take the SError immediately, instead tries to re-execute the > I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case. >> code KVM has unmapped from stage2 because its corrupt. (You could detect this >> happening in Qemu and try something else) > For something else, using NOTIFY_SEI for SIGBUS_MCEERR_AR? Sorry that was unclear. If you use NOTIFY_SEI, the guest may have PSTATE.A set, in which case the the CPU will patiently wait for it to unmask, (or consume it with an ESB-instruction), before delivering the notification. The guest may not have unmasked SError because its hammering the same page taking the same fault again and again. Pending the asynchronous notification and re-running the vcpu doesn't guarantee progress will be made. In this case user-space can spot its pended an asynchronous notification (for the same address!) more than once in the last few seconds, and try something else, like firing a guest:reboot watchdog on another CPU. > At current implementation, > It seems only have this case that "KVM has unmapped from stage2", do you thing we > still have something else? I'm wary that this only works for errors where we know the guest PC accessed the faulting location. The arch code will send this signal too if user-space touches the PG_poisoned page. (I recall you checked Qemu spots this case and acts differently). Migration is the obvious example for Qemu read/writing guest memory. On x86 the MachineCheck code sends these signals too, so our kernel-first implementation may do the same. As a response to a RAS error notified by synchronous-external-abort, this is fine. But we need to remember '_AR' implies the error is related to the code the signal interrupted, which wouldn't be true for an error notified by SError. >> Synchronous/asynchronous external abort matters to the CPU, but once the error >> has been notified to software the reasons for this distinction disappear. Once >> the error has been handled, all trace of this distinction is gone. >> >> CPER records only describe component failures. You are trying to re-create some >> state that disappeared with one of the firmware-first abstractions. Trying to >> re-create this information isn't worth the effort as the distinction doesn't >> matter to linux, only to the CPU. >> >> >>> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}. >> You mean there is no reason for Qemu to trigger an SError when it gets a signal >> from the kernel. >> >> The reasons the CPU might have to generate an SError don't apply to linux and >> KVM user space. User-space will never get a signal for an uncontained error, we >> will always panic(). We can't give user-space a signal for imprecise exceptions, >> as it can't return from the signal. The classes of error that are left are >> covered by polled/irq and NOTIFY_SEA. >> >> Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants >> to, (but I don't think you should, the kernel may have unmapped the page at PC >> from stage2 due to corruption). > yes, you also said you do not want to generate RAS SErrors for SIGBUS_MCEERR_AR, > so Qemu does not know in which condition to generate RAS SErrors. There are two things going on here, firstly the guest may have masked PSTATE.A, and be hammering an unmapped page. (this this 'sorry that was unclear' case above). This would happen if the exception-entry code or stack became corrupt when an exception was taken. The second is what does existing non-RAS-aware software do? For SError it panic()s, whereas for synchronous external abort there are some cases that can be handled. (e.g. on linux: synchronous external abort from user-space). >> I think the problem here is you're applying the CPU->software behaviour and >> choices to software->software. By the time user-space gets the error, the >> behaviour is different. > In the KVM, as a policy choice to reserve this API to specify guest ESR and > drive to trigger SError is OK, > At least for Qemu it does not know in which condition to trigger it. I think you're saying "lets keep it KVM for now, Qemu doesn't have a better idea of what to do." Thanks, James
diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h index 66ed8b6..aca7eee 100644 --- a/arch/arm64/include/asm/esr.h +++ b/arch/arm64/include/asm/esr.h @@ -102,6 +102,7 @@ #define ESR_ELx_FSC_ACCESS (0x08) #define ESR_ELx_FSC_FAULT (0x04) #define ESR_ELx_FSC_PERM (0x0C) +#define ESR_ELx_FSC_SERROR (0x11) /* ISS field definitions for Data Aborts */ #define ESR_ELx_ISV_SHIFT (24) @@ -119,6 +120,20 @@ #define ESR_ELx_CM_SHIFT (8) #define ESR_ELx_CM (UL(1) << ESR_ELx_CM_SHIFT) +/* ISS field definitions for SError interrupt */ +#define ESR_ELx_AET_SHIFT (10) +#define ESR_ELx_AET (UL(0x7) << ESR_ELx_AET_SHIFT) +/* Uncontainable error */ +#define ESR_ELx_AET_UC (UL(0) << ESR_ELx_AET_SHIFT) +/* Unrecoverable error */ +#define ESR_ELx_AET_UEU (UL(1) << ESR_ELx_AET_SHIFT) +/* Restartable error */ +#define ESR_ELx_AET_UEO (UL(2) << ESR_ELx_AET_SHIFT) +/* Recoverable error */ +#define ESR_ELx_AET_UER (UL(3) << ESR_ELx_AET_SHIFT) +/* Corrected */ +#define ESR_ELx_AET_CE (UL(6) << ESR_ELx_AET_SHIFT) + /* ISS field definitions for exceptions taken in to Hyp */ #define ESR_ELx_CV (UL(1) << 24) #define ESR_ELx_COND_SHIFT (20) diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h index 26a64d0..884f723 100644 --- a/arch/arm64/include/asm/kvm_asm.h +++ b/arch/arm64/include/asm/kvm_asm.h @@ -27,6 +27,9 @@ #define ARM_EXCEPTION_IRQ 0 #define ARM_EXCEPTION_EL1_SERROR 1 #define ARM_EXCEPTION_TRAP 2 +/* Error code for SError Interrupt (SEI) exception */ +#define KVM_SEI_SEV_RECOVERABLE 1 + /* The hyp-stub will return this for any kvm_call_hyp() call */ #define ARM_EXCEPTION_HYP_GONE HVC_STUB_ERR diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h index 07aa8e3..9ee13ad 100644 --- a/arch/arm64/include/asm/system_misc.h +++ b/arch/arm64/include/asm/system_misc.h @@ -57,6 +57,7 @@ void hook_debug_fault_code(int nr, int (*fn)(unsigned long, unsigned int, }) int handle_guest_sea(phys_addr_t addr, unsigned int esr); +int handle_guest_sei(void); #endif /* __ASSEMBLY__ */ diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index 7debb74..1afdc87 100644 --- a/arch/arm64/kvm/handle_exit.c +++ b/arch/arm64/kvm/handle_exit.c @@ -28,6 +28,7 @@ #include <asm/kvm_emulate.h> #include <asm/kvm_mmu.h> #include <asm/kvm_psci.h> +#include <asm/system_misc.h> #define CREATE_TRACE_POINTS #include "trace.h" @@ -178,6 +179,66 @@ static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu) return arm_exit_handlers[hsr_ec]; } +/** + * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts + * @vcpu: the VCPU pointer + * + * For RAS SError interrupt, firstly let host kernel handle it. + * If the AET is [ESR_ELx_AET_UER], then let user space handle it, + */ +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run) +{ + unsigned int esr = kvm_vcpu_get_hsr(vcpu); + bool impdef_syndrome = esr & ESR_ELx_ISV; /* aka IDS */ + unsigned int aet = esr & ESR_ELx_AET; + + /* + * This is not RAS SError + */ + if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) { + kvm_inject_vabt(vcpu); + return 1; + } + + /* The host kernel may handle this abort. */ + handle_guest_sei(); + + /* + * In below two conditions, it will directly inject the + * virtual SError: + * 1. The Syndrome is IMPLEMENTATION DEFINED + * 2. It is Uncategorized SEI + */ + if (impdef_syndrome || + ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) { + kvm_inject_vabt(vcpu); + return 1; + } + + switch (aet) { + case ESR_ELx_AET_CE: /* corrected error */ + case ESR_ELx_AET_UEO: /* restartable error, not yet consumed */ + return 1; /* continue processing the guest exit */ + case ESR_ELx_AET_UER: /* The error has not been propagated */ + /* + * Userspace only handle the guest SError Interrupt(SEI) if the + * error has not been propagated + */ + run->exit_reason = KVM_EXIT_EXCEPTION; + run->ex.exception = ESR_ELx_EC_SERROR; + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; + return 0; + default: + /* + * Until now, the CPU supports RAS and SEI is fatal, or host + * does not support to handle the SError. + */ + panic("This Asynchronous SError interrupt is dangerous, panic"); + } + + return 0; +} + /* * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on * proper exit to userspace. @@ -201,8 +262,7 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run, *vcpu_pc(vcpu) -= adj; } - kvm_inject_vabt(vcpu); - return 1; + return kvm_handle_guest_sei(vcpu, run); } exception_index = ARM_EXCEPTION_CODE(exception_index); @@ -211,8 +271,7 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run, case ARM_EXCEPTION_IRQ: return 1; case ARM_EXCEPTION_EL1_SERROR: - kvm_inject_vabt(vcpu); - return 1; + return kvm_handle_guest_sei(vcpu, run); case ARM_EXCEPTION_TRAP: /* * See ARM ARM B1.14.1: "Hyp traps on instructions diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index b64958b..8560672 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -728,6 +728,22 @@ int handle_guest_sea(phys_addr_t addr, unsigned int esr) } /* + * Handle SError interrupt that occurred in guest OS. + * + * The return value will be zero if the SEI was successfully handled + * and non-zero if handling is failed. + */ +int handle_guest_sei(void) +{ + int ret = -ENOENT; + + if (IS_ENABLED(CONFIG_ACPI_APEI_SEI)) + ret = ghes_notify_sei(); + + return ret; +} + +/* * Dispatch a data abort to the relevant handler. */ asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,