Message ID | 20200720211359.GF502563@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] kvm,x86: Exit to user space in case page fault error | expand |
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > Page fault error handling behavior in kvm seems little inconsistent when > page fault reports error. If we are doing fault synchronously > then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and > exit to user space and qemu reports error, "error: kvm run failed Bad address". Hi Vitaly, A gentle reminder. How does this patch look now? Thanks Vivek > > But if we are doing async page fault, then async_pf_execute() will simply > ignore the error reported by get_user_pages_remote() or > by kvm_mmu_do_page_fault(). It is assumed that page fault was successful > and either a page ready event is injected in guest or guest is brought > out of artificial halt state and run again. In both the cases when guest > retries the instruction, it takes exit again as page fault was not > successful in previous attempt. And then this infinite loop continues > forever. > > Trying fault in a loop will make sense if error is temporary and will > be resolved on retry. But I don't see any intention in the code to > determine if error is temporary or not. Whether to do fault synchronously > or asynchronously, depends on so many variables but none of the varibales > is whether error is temporary or not. (kvm_can_do_async_pf()). > > And that makes it very inconsistent or unpredictable to figure out whether > kvm will exit to qemu with error or it will just retry and go into an > infinite loop. > > This patch tries to make this behavior consistent. That is instead of > getting into infinite loop of retrying page fault, exit to user space > and stop VM if page fault error happens. > > In future this can be improved by injecting errors into guest. As of > now we don't have any race free method to inject errors in guest. > > When page fault error happens in async path save that pfn and when next > time guest retries, do a sync fault instead of async fault. So that if error > is encountered, we exit to qemu and avoid infinite loop. > > We maintain a cache of error gfns and force sync fault if a gfn is > found in cache of error gfn. There is a small possibility that we > miss an error gfn (as it got overwritten by a new error gfn). But > its just a hint and sooner or later some error pfn will match > and we will force sync fault and exit to user space. > > Changes from v3: > - Added function kvm_find_and_remove_error_gfn() and removed > kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly) > > - Added a macro GFN_INVALID (Vitaly). > > - Used gpa_to_gfn() to convert gpa to gfn (Vitaly) > > Change from v2: > - Fixed a warning by converting kvm_find_error_gfn() static. > > Change from v1: > - Maintain a cache of error gfns, instead of single gfn. (Vitaly) > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/x86.c | 54 +++++++++++++++++++++++++++++++-- > include/linux/kvm_types.h | 1 + > 5 files changed, 56 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be5363b21540..e6f8d3f1a377 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) > #define KVM_NR_VAR_MTRR 8 > > #define ASYNC_PF_PER_VCPU 64 > +#define ERROR_GFN_PER_VCPU 64 > > enum kvm_reg { > VCPU_REGS_RAX = __VCPU_REGS_RAX, > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > unsigned long nested_apf_token; > bool delivery_as_pf_vmexit; > bool pageready_pending; > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > } apf; > > /* OSVW MSRs (AMD only) */ > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 444bb9c54548..d0a2a12c7bb6 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, > bool accessed_dirty, gpa_t new_eptp); > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > u64 fault_address, char *insn, int insn_len); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6d6a0ae7800c..b51d4aa405e0 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, > if (!async) > return false; /* *pfn has correct page already */ > > - if (!prefault && kvm_can_do_async_pf(vcpu)) { > + if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) { > trace_kvm_try_async_get_page(cr2_or_gpa, gfn); > if (kvm_find_async_pf_gfn(vcpu, gfn)) { > trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 88c593f83b28..c1f5094d6e53 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu) > vcpu->arch.apf.gfns[i] = ~0; > } > > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu) > +{ > + int i; > + for (i = 0; i < ERROR_GFN_PER_VCPU; i++) > + vcpu->arch.apf.error_gfns[i] = GFN_INVALID; > +} > + > static void kvm_on_user_return(struct user_return_notifier *urn) > { > unsigned slot; > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT; > > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > kvm_pmu_init(vcpu); > > vcpu->arch.pending_external_vector = -1; > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) > > kvm_clear_async_pf_completion_queue(vcpu); > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > vcpu->arch.apf.halted = false; > > if (kvm_mpx_supported()) { > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) > } > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > +{ > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > + > + return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU)); > +} > + > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); > + > + /* > + * Overwrite the previous gfn. This is just a hint to do > + * sync page fault. > + */ > + vcpu->arch.apf.error_gfns[key] = gfn; > +} > + > +/* Returns true if gfn was found in hash table, false otherwise */ > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); > + > + if (vcpu->arch.apf.error_gfns[key] != gfn) > + return 0; > + > + vcpu->arch.apf.error_gfns[key] = GFN_INVALID; > + return true; > +} > + > void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > { > int r; > @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) > return; > > - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); > + r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); > + if (r < 0) > + kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa)); > } > > static inline u32 kvm_async_pf_hash_fn(gfn_t gfn) > @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu) > return true; > } > > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn) > { > if (unlikely(!lapic_in_kernel(vcpu) || > kvm_event_needs_reinjection(vcpu) || > @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) > * If interrupts are off we cannot even use an artificial > * halt state. > */ > - return kvm_arch_interrupt_allowed(vcpu); > + if (!kvm_arch_interrupt_allowed(vcpu)) > + return false; > + > + /* Found gfn in error gfn cache. Force sync fault */ > + if (kvm_find_and_remove_error_gfn(vcpu, gfn)) > + return false; > + > + return true; > } > > bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h > index 68e84cf42a3f..677bb8269cd3 100644 > --- a/include/linux/kvm_types.h > +++ b/include/linux/kvm_types.h > @@ -36,6 +36,7 @@ typedef u64 gpa_t; > typedef u64 gfn_t; > > #define GPA_INVALID (~(gpa_t)0) > +#define GFN_INVALID (~(gfn_t)0) > > typedef unsigned long hva_t; > typedef u64 hpa_t; > -- > 2.25.4 >
Vivek Goyal <vgoyal@redhat.com> writes: > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: >> Page fault error handling behavior in kvm seems little inconsistent when >> page fault reports error. If we are doing fault synchronously >> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and >> exit to user space and qemu reports error, "error: kvm run failed Bad address". > > Hi Vitaly, > > A gentle reminder. How does this patch look now? > Sorry, I even reviewd it but never replied. It looks good to me! Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
On Mon, Jul 27, 2020 at 06:09:32PM +0200, Vitaly Kuznetsov wrote: > Vivek Goyal <vgoyal@redhat.com> writes: > > > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > >> Page fault error handling behavior in kvm seems little inconsistent when > >> page fault reports error. If we are doing fault synchronously > >> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and > >> exit to user space and qemu reports error, "error: kvm run failed Bad address". > > > > Hi Vitaly, > > > > A gentle reminder. How does this patch look now? > > > > Sorry, I even reviewd it but never replied. It looks good to me! > > Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Thanks Vitaly. Paolo, what do you think about this patch. Do you have concerns with this. Can this be merged. Thanks Vivek
> Page fault error handling behavior in kvm seems little inconsistent when > page fault reports error. If we are doing fault synchronously > then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and > exit to user space and qemu reports error, "error: kvm run failed Bad address". > > But if we are doing async page fault, then async_pf_execute() will simply > ignore the error reported by get_user_pages_remote() or > by kvm_mmu_do_page_fault(). It is assumed that page fault was successful > and either a page ready event is injected in guest or guest is brought > out of artificial halt state and run again. In both the cases when guest > retries the instruction, it takes exit again as page fault was not > successful in previous attempt. And then this infinite loop continues > forever. > > Trying fault in a loop will make sense if error is temporary and will > be resolved on retry. But I don't see any intention in the code to > determine if error is temporary or not. Whether to do fault synchronously > or asynchronously, depends on so many variables but none of the varibales s/varibales/variables > is whether error is temporary or not. (kvm_can_do_async_pf()). > > And that makes it very inconsistent or unpredictable to figure out whether > kvm will exit to qemu with error or it will just retry and go into an > infinite loop. > > This patch tries to make this behavior consistent. That is instead of > getting into infinite loop of retrying page fault, exit to user space > and stop VM if page fault error happens. > > In future this can be improved by injecting errors into guest. As of > now we don't have any race free method to inject errors in guest. > > When page fault error happens in async path save that pfn and when next > time guest retries, do a sync fault instead of async fault. So that if error > is encountered, we exit to qemu and avoid infinite loop. > > We maintain a cache of error gfns and force sync fault if a gfn is > found in cache of error gfn. There is a small possibility that we > miss an error gfn (as it got overwritten by a new error gfn). But > its just a hint and sooner or later some error pfn will match > and we will force sync fault and exit to user space. > > Changes from v3: > - Added function kvm_find_and_remove_error_gfn() and removed > kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly) > > - Added a macro GFN_INVALID (Vitaly). > > - Used gpa_to_gfn() to convert gpa to gfn (Vitaly) > > Change from v2: > - Fixed a warning by converting kvm_find_error_gfn() static. > > Change from v1: > - Maintain a cache of error gfns, instead of single gfn. (Vitaly) > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/x86.c | 54 +++++++++++++++++++++++++++++++-- > include/linux/kvm_types.h | 1 + > 5 files changed, 56 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be5363b21540..e6f8d3f1a377 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) > #define KVM_NR_VAR_MTRR 8 > > #define ASYNC_PF_PER_VCPU 64 > +#define ERROR_GFN_PER_VCPU 64 > > enum kvm_reg { > VCPU_REGS_RAX = __VCPU_REGS_RAX, > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > unsigned long nested_apf_token; > bool delivery_as_pf_vmexit; > bool pageready_pending; > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > } apf; > > /* OSVW MSRs (AMD only) */ > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 444bb9c54548..d0a2a12c7bb6 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, > bool accessed_dirty, gpa_t new_eptp); > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > u64 fault_address, char *insn, int insn_len); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6d6a0ae7800c..b51d4aa405e0 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, > if (!async) > return false; /* *pfn has correct page already */ > > - if (!prefault && kvm_can_do_async_pf(vcpu)) { > + if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) { > trace_kvm_try_async_get_page(cr2_or_gpa, gfn); > if (kvm_find_async_pf_gfn(vcpu, gfn)) { > trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 88c593f83b28..c1f5094d6e53 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu) > vcpu->arch.apf.gfns[i] = ~0; > } > > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu) > +{ > + int i; > + for (i = 0; i < ERROR_GFN_PER_VCPU; i++) > + vcpu->arch.apf.error_gfns[i] = GFN_INVALID; > +} > + > static void kvm_on_user_return(struct user_return_notifier *urn) > { > unsigned slot; > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT; > > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > kvm_pmu_init(vcpu); > > vcpu->arch.pending_external_vector = -1; > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) > > kvm_clear_async_pf_completion_queue(vcpu); > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > vcpu->arch.apf.halted = false; > > if (kvm_mpx_supported()) { > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) > } > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > +{ > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > + > + return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU)); > +} > + > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); > + > + /* > + * Overwrite the previous gfn. This is just a hint to do > + * sync page fault. > + */ > + vcpu->arch.apf.error_gfns[key] = gfn; > +} > + > +/* Returns true if gfn was found in hash table, false otherwise */ > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); > + > + if (vcpu->arch.apf.error_gfns[key] != gfn) > + return 0; > + > + vcpu->arch.apf.error_gfns[key] = GFN_INVALID; > + return true; > +} > + > void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > { > int r; > @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) > return; > > - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); > + r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); > + if (r < 0) > + kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa)); > } > > static inline u32 kvm_async_pf_hash_fn(gfn_t gfn) > @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu) > return true; > } > > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn) > { > if (unlikely(!lapic_in_kernel(vcpu) || > kvm_event_needs_reinjection(vcpu) || > @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) > * If interrupts are off we cannot even use an artificial > * halt state. > */ > - return kvm_arch_interrupt_allowed(vcpu); > + if (!kvm_arch_interrupt_allowed(vcpu)) > + return false; > + > + /* Found gfn in error gfn cache. Force sync fault */ > + if (kvm_find_and_remove_error_gfn(vcpu, gfn)) > + return false; > + > + return true; > } > > bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h > index 68e84cf42a3f..677bb8269cd3 100644 > --- a/include/linux/kvm_types.h > +++ b/include/linux/kvm_types.h > @@ -36,6 +36,7 @@ typedef u64 gpa_t; > typedef u64 gfn_t; > > #define GPA_INVALID (~(gpa_t)0) > +#define GFN_INVALID (~(gfn_t)0) > > typedef unsigned long hva_t; > typedef u64 hpa_t; > -- > 2.25.4 This patch looks good to me. Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> >
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > Page fault error handling behavior in kvm seems little inconsistent when > page fault reports error. If we are doing fault synchronously > then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and > exit to user space and qemu reports error, "error: kvm run failed Bad address". Hi Paolo, Ping. I am wondering what do you think about this patch. Can it be merged? Thanks Vivek > > But if we are doing async page fault, then async_pf_execute() will simply > ignore the error reported by get_user_pages_remote() or > by kvm_mmu_do_page_fault(). It is assumed that page fault was successful > and either a page ready event is injected in guest or guest is brought > out of artificial halt state and run again. In both the cases when guest > retries the instruction, it takes exit again as page fault was not > successful in previous attempt. And then this infinite loop continues > forever. > > Trying fault in a loop will make sense if error is temporary and will > be resolved on retry. But I don't see any intention in the code to > determine if error is temporary or not. Whether to do fault synchronously > or asynchronously, depends on so many variables but none of the varibales > is whether error is temporary or not. (kvm_can_do_async_pf()). > > And that makes it very inconsistent or unpredictable to figure out whether > kvm will exit to qemu with error or it will just retry and go into an > infinite loop. > > This patch tries to make this behavior consistent. That is instead of > getting into infinite loop of retrying page fault, exit to user space > and stop VM if page fault error happens. > > In future this can be improved by injecting errors into guest. As of > now we don't have any race free method to inject errors in guest. > > When page fault error happens in async path save that pfn and when next > time guest retries, do a sync fault instead of async fault. So that if error > is encountered, we exit to qemu and avoid infinite loop. > > We maintain a cache of error gfns and force sync fault if a gfn is > found in cache of error gfn. There is a small possibility that we > miss an error gfn (as it got overwritten by a new error gfn). But > its just a hint and sooner or later some error pfn will match > and we will force sync fault and exit to user space. > > Changes from v3: > - Added function kvm_find_and_remove_error_gfn() and removed > kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly) > > - Added a macro GFN_INVALID (Vitaly). > > - Used gpa_to_gfn() to convert gpa to gfn (Vitaly) > > Change from v2: > - Fixed a warning by converting kvm_find_error_gfn() static. > > Change from v1: > - Maintain a cache of error gfns, instead of single gfn. (Vitaly) > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/x86.c | 54 +++++++++++++++++++++++++++++++-- > include/linux/kvm_types.h | 1 + > 5 files changed, 56 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be5363b21540..e6f8d3f1a377 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) > #define KVM_NR_VAR_MTRR 8 > > #define ASYNC_PF_PER_VCPU 64 > +#define ERROR_GFN_PER_VCPU 64 > > enum kvm_reg { > VCPU_REGS_RAX = __VCPU_REGS_RAX, > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > unsigned long nested_apf_token; > bool delivery_as_pf_vmexit; > bool pageready_pending; > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > } apf; > > /* OSVW MSRs (AMD only) */ > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 444bb9c54548..d0a2a12c7bb6 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, > bool accessed_dirty, gpa_t new_eptp); > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > u64 fault_address, char *insn, int insn_len); > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 6d6a0ae7800c..b51d4aa405e0 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, > if (!async) > return false; /* *pfn has correct page already */ > > - if (!prefault && kvm_can_do_async_pf(vcpu)) { > + if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) { > trace_kvm_try_async_get_page(cr2_or_gpa, gfn); > if (kvm_find_async_pf_gfn(vcpu, gfn)) { > trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 88c593f83b28..c1f5094d6e53 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu) > vcpu->arch.apf.gfns[i] = ~0; > } > > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu) > +{ > + int i; > + for (i = 0; i < ERROR_GFN_PER_VCPU; i++) > + vcpu->arch.apf.error_gfns[i] = GFN_INVALID; > +} > + > static void kvm_on_user_return(struct user_return_notifier *urn) > { > unsigned slot; > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT; > > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > kvm_pmu_init(vcpu); > > vcpu->arch.pending_external_vector = -1; > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) > > kvm_clear_async_pf_completion_queue(vcpu); > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > vcpu->arch.apf.halted = false; > > if (kvm_mpx_supported()) { > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) > } > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > +{ > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > + > + return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU)); > +} > + > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); > + > + /* > + * Overwrite the previous gfn. This is just a hint to do > + * sync page fault. > + */ > + vcpu->arch.apf.error_gfns[key] = gfn; > +} > + > +/* Returns true if gfn was found in hash table, false otherwise */ > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); > + > + if (vcpu->arch.apf.error_gfns[key] != gfn) > + return 0; > + > + vcpu->arch.apf.error_gfns[key] = GFN_INVALID; > + return true; > +} > + > void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > { > int r; > @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) > return; > > - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); > + r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); > + if (r < 0) > + kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa)); > } > > static inline u32 kvm_async_pf_hash_fn(gfn_t gfn) > @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu) > return true; > } > > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn) > { > if (unlikely(!lapic_in_kernel(vcpu) || > kvm_event_needs_reinjection(vcpu) || > @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) > * If interrupts are off we cannot even use an artificial > * halt state. > */ > - return kvm_arch_interrupt_allowed(vcpu); > + if (!kvm_arch_interrupt_allowed(vcpu)) > + return false; > + > + /* Found gfn in error gfn cache. Force sync fault */ > + if (kvm_find_and_remove_error_gfn(vcpu, gfn)) > + return false; > + > + return true; > } > > bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h > index 68e84cf42a3f..677bb8269cd3 100644 > --- a/include/linux/kvm_types.h > +++ b/include/linux/kvm_types.h > @@ -36,6 +36,7 @@ typedef u64 gpa_t; > typedef u64 gfn_t; > > #define GPA_INVALID (~(gpa_t)0) > +#define GFN_INVALID (~(gfn_t)0) > > typedef unsigned long hva_t; > typedef u64 hpa_t; > -- > 2.25.4 >
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > --- > arch/x86/include/asm/kvm_host.h | 2 ++ > arch/x86/kvm/mmu.h | 2 +- > arch/x86/kvm/mmu/mmu.c | 2 +- > arch/x86/kvm/x86.c | 54 +++++++++++++++++++++++++++++++-- > include/linux/kvm_types.h | 1 + > 5 files changed, 56 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index be5363b21540..e6f8d3f1a377 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) > #define KVM_NR_VAR_MTRR 8 > > #define ASYNC_PF_PER_VCPU 64 > +#define ERROR_GFN_PER_VCPU 64 Aren't these two related? I.e. wouldn't it make sense to do: #define ERROR_GFN_PER_VCPU ASYNC_PF_PER_VCPU Or maybe even size error_gfns[] to ASYNC_PF_PER_VCPU? > > enum kvm_reg { > VCPU_REGS_RAX = __VCPU_REGS_RAX, > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > unsigned long nested_apf_token; > bool delivery_as_pf_vmexit; > bool pageready_pending; > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > } apf; > > /* OSVW MSRs (AMD only) */ > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 444bb9c54548..d0a2a12c7bb6 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, > bool accessed_dirty, gpa_t new_eptp); > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > u64 fault_address, char *insn, int insn_len); > ... > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 88c593f83b28..c1f5094d6e53 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu) > vcpu->arch.apf.gfns[i] = ~0; > } > > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu) > +{ > + int i; Need a newline. > + for (i = 0; i < ERROR_GFN_PER_VCPU; i++) > + vcpu->arch.apf.error_gfns[i] = GFN_INVALID; > +} > + > static void kvm_on_user_return(struct user_return_notifier *urn) > { > unsigned slot; > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT; > > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > kvm_pmu_init(vcpu); > > vcpu->arch.pending_external_vector = -1; > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) > > kvm_clear_async_pf_completion_queue(vcpu); > kvm_async_pf_hash_reset(vcpu); > + kvm_error_gfn_hash_reset(vcpu); > vcpu->arch.apf.halted = false; > > if (kvm_mpx_supported()) { > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) > } > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > +{ > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > + > + return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU)); > +} > + > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); > + > + /* > + * Overwrite the previous gfn. This is just a hint to do > + * sync page fault. > + */ > + vcpu->arch.apf.error_gfns[key] = gfn; > +} > + > +/* Returns true if gfn was found in hash table, false otherwise */ > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > +{ > + u32 key = kvm_error_gfn_hash_fn(gfn); Mostly out of curiosity, do we really need a hash? E.g. could we get away with an array of 4 values? 2 values? Just wondering if we can avoid 64*8 bytes per CPU. One thought would be to use the index to handle the case of no error gfns so that the size of the array doesn't affect lookup for the common case, e.g. ... u8 error_gfn_head; gfn_t error_gfns[ERROR_GFN_PER_VCPU]; } apf; if (vcpu->arch.apf.error_gfn_head == 0xff) return false; for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) { if (vcpu->arch.apf.error_gfns[i] == gfn) { vcpu->arch.apf.error_gfns[i] = INVALID_GFN; return true; } } return true; Or you could even avoid INVALID_GFN altogether by compacting the array on removal. The VM is probably dead anyways, burning a few cycles to clean things up is a non-issue. Ditto for slow insertion. > + > + if (vcpu->arch.apf.error_gfns[key] != gfn) > + return 0; Should be "return false". > + > + vcpu->arch.apf.error_gfns[key] = GFN_INVALID; > + return true; > +} > + > void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > { > int r; > @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) > work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) > return; > > - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); > + r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); > + if (r < 0) > + kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa)); > } > > static inline u32 kvm_async_pf_hash_fn(gfn_t gfn) > @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu) > return true; > } > > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn) > { > if (unlikely(!lapic_in_kernel(vcpu) || > kvm_event_needs_reinjection(vcpu) || > @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) > * If interrupts are off we cannot even use an artificial > * halt state. > */ > - return kvm_arch_interrupt_allowed(vcpu); > + if (!kvm_arch_interrupt_allowed(vcpu)) > + return false; > + > + /* Found gfn in error gfn cache. Force sync fault */ > + if (kvm_find_and_remove_error_gfn(vcpu, gfn)) > + return false; > + > + return true; > } > > bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h > index 68e84cf42a3f..677bb8269cd3 100644 > --- a/include/linux/kvm_types.h > +++ b/include/linux/kvm_types.h > @@ -36,6 +36,7 @@ typedef u64 gpa_t; > typedef u64 gfn_t; > > #define GPA_INVALID (~(gpa_t)0) > +#define GFN_INVALID (~(gfn_t)0) > > typedef unsigned long hva_t; > typedef u64 hpa_t; > -- > 2.25.4 >
On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote: > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > > --- > > arch/x86/include/asm/kvm_host.h | 2 ++ > > arch/x86/kvm/mmu.h | 2 +- > > arch/x86/kvm/mmu/mmu.c | 2 +- > > arch/x86/kvm/x86.c | 54 +++++++++++++++++++++++++++++++-- > > include/linux/kvm_types.h | 1 + > > 5 files changed, 56 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index be5363b21540..e6f8d3f1a377 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) > > #define KVM_NR_VAR_MTRR 8 > > > > #define ASYNC_PF_PER_VCPU 64 > > +#define ERROR_GFN_PER_VCPU 64 > > Aren't these two related? I.e. wouldn't it make sense to do: Hi, They are related somewhat but they don't have to be same. I think we can accumulate many more error GFNs if this vcpu does not schedule the same task again to retry. > #define ERROR_GFN_PER_VCPU ASYNC_PF_PER_VCPU > > Or maybe even size error_gfns[] to ASYNC_PF_PER_VCPU? Given these two don't have to be same, I kept it separate. And kept the hash size same for now. If one wants, hash can be bigger or smaller down the line. > > > > > enum kvm_reg { > > VCPU_REGS_RAX = __VCPU_REGS_RAX, > > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { > > unsigned long nested_apf_token; > > bool delivery_as_pf_vmexit; > > bool pageready_pending; > > + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > > } apf; > > > > /* OSVW MSRs (AMD only) */ > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > > index 444bb9c54548..d0a2a12c7bb6 100644 > > --- a/arch/x86/kvm/mmu.h > > +++ b/arch/x86/kvm/mmu.h > > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); > > void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); > > void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, > > bool accessed_dirty, gpa_t new_eptp); > > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); > > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); > > int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > > u64 fault_address, char *insn, int insn_len); > > > > ... > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 88c593f83b28..c1f5094d6e53 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu) > > vcpu->arch.apf.gfns[i] = ~0; > > } > > > > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu) > > +{ > > + int i; > > Need a newline. Will do. > > > + for (i = 0; i < ERROR_GFN_PER_VCPU; i++) > > + vcpu->arch.apf.error_gfns[i] = GFN_INVALID; > > +} > > + > > static void kvm_on_user_return(struct user_return_notifier *urn) > > { > > unsigned slot; > > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > > vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT; > > > > kvm_async_pf_hash_reset(vcpu); > > + kvm_error_gfn_hash_reset(vcpu); > > kvm_pmu_init(vcpu); > > > > vcpu->arch.pending_external_vector = -1; > > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) > > > > kvm_clear_async_pf_completion_queue(vcpu); > > kvm_async_pf_hash_reset(vcpu); > > + kvm_error_gfn_hash_reset(vcpu); > > vcpu->arch.apf.halted = false; > > > > if (kvm_mpx_supported()) { > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) > > } > > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > > +{ > > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > > + > > + return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU)); > > +} > > + > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > +{ > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > + > > + /* > > + * Overwrite the previous gfn. This is just a hint to do > > + * sync page fault. > > + */ > > + vcpu->arch.apf.error_gfns[key] = gfn; > > +} > > + > > +/* Returns true if gfn was found in hash table, false otherwise */ > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > +{ > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > Mostly out of curiosity, do we really need a hash? E.g. could we get away > with an array of 4 values? 2 values? Just wondering if we can avoid 64*8 > bytes per CPU. We are relying on returning error when guest task retries fault. Fault will be retried on same address if same task is run by vcpu after "page ready" event. There is no guarantee that same task will be run. In theory, this cpu could have a large number of tasks queued and run these tasks before the faulting task is run again. Now say there are 128 tasks being run and 32 of them have page fault errors. Then if we keep 4 values, newer failures will simply overwrite older failures and we will keep spinning instead of exiting to user space. That's why this array of 64 gfns and add gfns based on hash. This does not completely elimiante the above possibility but chances of one hitting this are very very slim. > > One thought would be to use the index to handle the case of no error gfns so > that the size of the array doesn't affect lookup for the common case, e.g. We are calculating hash of gfn (used as index in array). So lookup cost is not driven by size of array. Its O(1) and not O(N). We just lookup at one element in array and if it does not match, we return false. u32 key = kvm_error_gfn_hash_fn(gfn); if (vcpu->arch.apf.error_gfns[key] != gfn) return 0; > > ... > > u8 error_gfn_head; > gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > } apf; > > > if (vcpu->arch.apf.error_gfn_head == 0xff) > return false; > > for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) { > if (vcpu->arch.apf.error_gfns[i] == gfn) { > vcpu->arch.apf.error_gfns[i] = INVALID_GFN; > return true; > } > } > return true; > > Or you could even avoid INVALID_GFN altogether by compacting the array on > removal. The VM is probably dead anyways, burning a few cycles to clean > things up is a non-issue. Ditto for slow insertion. Same for insertion. Its O(1) and not dependent on size of array. Also insertion anyway is very infrequent event because it will not be often that error will happen. > > > + > > + if (vcpu->arch.apf.error_gfns[key] != gfn) > > + return 0; > > Should be "return false". Will fix. Thanks Vivek
On Thu, Oct 01, 2020 at 05:55:08PM -0400, Vivek Goyal wrote: > On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote: > > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) > > > } > > > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > > > > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > > > +{ > > > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > > > + > > > + return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU)); > > > +} > > > + > > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > > +{ > > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > > + > > > + /* > > > + * Overwrite the previous gfn. This is just a hint to do > > > + * sync page fault. > > > + */ > > > + vcpu->arch.apf.error_gfns[key] = gfn; > > > +} > > > + > > > +/* Returns true if gfn was found in hash table, false otherwise */ > > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > > +{ > > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > > > Mostly out of curiosity, do we really need a hash? E.g. could we get away > > with an array of 4 values? 2 values? Just wondering if we can avoid 64*8 > > bytes per CPU. > > We are relying on returning error when guest task retries fault. Fault > will be retried on same address if same task is run by vcpu after > "page ready" event. There is no guarantee that same task will be > run. In theory, this cpu could have a large number of tasks queued > and run these tasks before the faulting task is run again. Now say > there are 128 tasks being run and 32 of them have page fault > errors. Then if we keep 4 values, newer failures will simply > overwrite older failures and we will keep spinning instead of > exiting to user space. > > That's why this array of 64 gfns and add gfns based on hash. This > does not completely elimiante the above possibility but chances > of one hitting this are very very slim. But have you actually tried such a scenario? In other words, is there good justification for burning the extra memory? Alternatively, what about adding a new KVM request type to handle this? E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a request. The vCPU then gets kicked and exits to userspace. Before exiting to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad GFNs simply get if error_gfn is "valid", i.e. there's a pending request. That would guarantee the error is propagated to userspace, and doesn't lose any guest information as dropping error GFNs just means the guest will take more page fault exits. > > One thought would be to use the index to handle the case of no error gfns so > > that the size of the array doesn't affect lookup for the common case, e.g. > > We are calculating hash of gfn (used as index in array). So lookup cost > is not driven by size of array. Its O(1) and not O(N). We just lookup > at one element in array and if it does not match, we return false. > > u32 key = kvm_error_gfn_hash_fn(gfn); > > if (vcpu->arch.apf.error_gfns[key] != gfn) > return 0; > > > > > > ... > > > > u8 error_gfn_head; > > gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > > } apf; > > > > > > if (vcpu->arch.apf.error_gfn_head == 0xff) > > return false; > > > > for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) { > > if (vcpu->arch.apf.error_gfns[i] == gfn) { > > vcpu->arch.apf.error_gfns[i] = INVALID_GFN; > > return true; > > } > > } > > return true; > > > > Or you could even avoid INVALID_GFN altogether by compacting the array on > > removal. The VM is probably dead anyways, burning a few cycles to clean > > things up is a non-issue. Ditto for slow insertion. > > Same for insertion. Its O(1) and not dependent on size of array. Also > insertion anyway is very infrequent event because it will not be > often that error will happen. I know, I was saying that if we move to an array then we'd technically be subject to O(whatever) search and delete, but that it's irrelevant from a performance perspective because the guest is already toast if it hits a bad GFN. > > > + > > > + if (vcpu->arch.apf.error_gfns[key] != gfn) > > > + return 0; > > > > Should be "return false". > > Will fix. > > Thanks > Vivek >
On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > On Thu, Oct 01, 2020 at 05:55:08PM -0400, Vivek Goyal wrote: > > On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote: > > > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote: > > > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) > > > > } > > > > EXPORT_SYMBOL_GPL(kvm_set_rflags); > > > > > > > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) > > > > +{ > > > > + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); > > > > + > > > > + return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU)); > > > > +} > > > > + > > > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > > > +{ > > > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > > > + > > > > + /* > > > > + * Overwrite the previous gfn. This is just a hint to do > > > > + * sync page fault. > > > > + */ > > > > + vcpu->arch.apf.error_gfns[key] = gfn; > > > > +} > > > > + > > > > +/* Returns true if gfn was found in hash table, false otherwise */ > > > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) > > > > +{ > > > > + u32 key = kvm_error_gfn_hash_fn(gfn); > > > > > > Mostly out of curiosity, do we really need a hash? E.g. could we get away > > > with an array of 4 values? 2 values? Just wondering if we can avoid 64*8 > > > bytes per CPU. > > > > We are relying on returning error when guest task retries fault. Fault > > will be retried on same address if same task is run by vcpu after > > "page ready" event. There is no guarantee that same task will be > > run. In theory, this cpu could have a large number of tasks queued > > and run these tasks before the faulting task is run again. Now say > > there are 128 tasks being run and 32 of them have page fault > > errors. Then if we keep 4 values, newer failures will simply > > overwrite older failures and we will keep spinning instead of > > exiting to user space. > > > > That's why this array of 64 gfns and add gfns based on hash. This > > does not completely elimiante the above possibility but chances > > of one hitting this are very very slim. > > But have you actually tried such a scenario? In other words, is there good > justification for burning the extra memory? Its not easy to try and reproduce. So it is all theory at this point of time. If you are worried about memory usage, we can probably reduce the size of hash table. Say from 64, reduce it to 8. I am fine with that. I think initially I had a single error_gfn. But Vitaly had concerns about above scenario, so I implemeted a hash table. I think reducing hash table size to 8 or 16 probaly is a good middle ground. > > Alternatively, what about adding a new KVM request type to handle this? > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > request. The vCPU then gets kicked and exits to userspace. Before exiting > to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad GFNs > simply get if error_gfn is "valid", i.e. there's a pending request. Sorry, I did not understand the above proposal. Can you please elaborate a bit more. Part of it is that I don't know much about KVM requests. Looking at the code it looks like that main loop is parsing if some kvm request is pending and executing that action. Don't we want to make sure that we exit to user space when guest retries error gfn access again. In this case once we get -EFAULT, we will still inject page_ready into guest. And then either same process or a different process might run. So when exactly code raises a kvm request. If I raise it right when I get -EFAULT, then kvm will exit to user space upon next entry time. But there is no guarantee guest vcpu is running the process which actually accessed the error gfn. And that probably means that register state of cpu does not mean much and one can not easily figure out which task tried to access the bad memory and when. That's why we prepare a list of error gfn and only exit to user space when error_gfn access is retried so that guest vcpu context is correct. What am I missing? Thanks Vivek > > That would guarantee the error is propagated to userspace, and doesn't lose > any guest information as dropping error GFNs just means the guest will take > more page fault exits. > > > > One thought would be to use the index to handle the case of no error gfns so > > > that the size of the array doesn't affect lookup for the common case, e.g. > > > > We are calculating hash of gfn (used as index in array). So lookup cost > > is not driven by size of array. Its O(1) and not O(N). We just lookup > > at one element in array and if it does not match, we return false. > > > > u32 key = kvm_error_gfn_hash_fn(gfn); > > > > if (vcpu->arch.apf.error_gfns[key] != gfn) > > return 0; > > > > > > > > > > ... > > > > > > u8 error_gfn_head; > > > gfn_t error_gfns[ERROR_GFN_PER_VCPU]; > > > } apf; > > > > > > > > > if (vcpu->arch.apf.error_gfn_head == 0xff) > > > return false; > > > > > > for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) { > > > if (vcpu->arch.apf.error_gfns[i] == gfn) { > > > vcpu->arch.apf.error_gfns[i] = INVALID_GFN; > > > return true; > > > } > > > } > > > return true; > > > > > > Or you could even avoid INVALID_GFN altogether by compacting the array on > > > removal. The VM is probably dead anyways, burning a few cycles to clean > > > things up is a non-issue. Ditto for slow insertion. > > > > Same for insertion. Its O(1) and not dependent on size of array. Also > > insertion anyway is very infrequent event because it will not be > > often that error will happen. > > I know, I was saying that if we move to an array then we'd technically be > subject to O(whatever) search and delete, but that it's irrelevant from a > performance perspective because the guest is already toast if it hits a bad > GFN. > > > > > + > > > > + if (vcpu->arch.apf.error_gfns[key] != gfn) > > > > + return 0; > > > > > > Should be "return false". > > > > Will fix. > > > > Thanks > > Vivek > > >
On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > > Alternatively, what about adding a new KVM request type to handle this? > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > > request. The vCPU then gets kicked and exits to userspace. Before exiting > > to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad GFNs > > simply get if error_gfn is "valid", i.e. there's a pending request. > > Sorry, I did not understand the above proposal. Can you please elaborate > a bit more. Part of it is that I don't know much about KVM requests. > Looking at the code it looks like that main loop is parsing if some > kvm request is pending and executing that action. > > Don't we want to make sure that we exit to user space when guest retries > error gfn access again. > In this case once we get -EFAULT, we will still inject page_ready into > guest. And then either same process or a different process might run. > > So when exactly code raises a kvm request. If I raise it right when > I get -EFAULT, then kvm will exit to user space upon next entry > time. But there is no guarantee guest vcpu is running the process which > actually accessed the error gfn. And that probably means that register > state of cpu does not mean much and one can not easily figure out > which task tried to access the bad memory and when. > > That's why we prepare a list of error gfn and only exit to user space > when error_gfn access is retried so that guest vcpu context is correct. > > What am I missing? I don't think it's necessary to provide userspace with the register state of the guest task that hit the bad page. Other than debugging, I don't see how userspace can do anything useful which such information. Even if you want to inject an event of some form into the guest, having the correct context for the event itself is not required. IMO it's perfectly reasonable for such an event to be asynchronous. IIUC, your end goal is to be able to gracefully handle DAX file truncation. Simply killing the guest task that hit the bad page isn't sufficient, as nothing prevents a future task from accessing the same bad page. To fully handle the situation, the guest needs to remove the bad page from its memory pool. Once the page is offlined, the guest kernel's error handling will kick in when a task accesses the bad page (or nothing ever touches the bad page again and everyone is happy). Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection to handle this, but I suspect the resulting behavior will look quite similar, e.g. notify the virtiofs driver in the guest, which does some magic to take the offending region offline, and then guest tasks get SIGBUS or whatever. I also don't think it's KVM's responsibility to _directly_ handle such a scenario. As I said in an earlier version, KVM can't possibly know _why_ a page fault came back with -EFAULT, only userspace can connect the dots of GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely should exit to userspace on the -EFAULT instead of hanging the guest, but that can be done via a new request, as suggested.
On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > > > Alternatively, what about adding a new KVM request type to handle this? > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > > > request. The vCPU then gets kicked and exits to userspace. Before exiting > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad GFNs > > > simply get if error_gfn is "valid", i.e. there's a pending request. > > > > Sorry, I did not understand the above proposal. Can you please elaborate > > a bit more. Part of it is that I don't know much about KVM requests. > > Looking at the code it looks like that main loop is parsing if some > > kvm request is pending and executing that action. > > > > Don't we want to make sure that we exit to user space when guest retries > > error gfn access again. > > > In this case once we get -EFAULT, we will still inject page_ready into > > guest. And then either same process or a different process might run. > > > > So when exactly code raises a kvm request. If I raise it right when > > I get -EFAULT, then kvm will exit to user space upon next entry > > time. But there is no guarantee guest vcpu is running the process which > > actually accessed the error gfn. And that probably means that register > > state of cpu does not mean much and one can not easily figure out > > which task tried to access the bad memory and when. > > > > That's why we prepare a list of error gfn and only exit to user space > > when error_gfn access is retried so that guest vcpu context is correct. > > > > What am I missing? > > I don't think it's necessary to provide userspace with the register state of > the guest task that hit the bad page. Other than debugging, I don't see how > userspace can do anything useful which such information. I think debugging is the whole point so that user can figure out which access by guest task resulted in bad memory access. I would think this will be important piece of information. > > Even if you want to inject an event of some form into the guest, having the > correct context for the event itself is not required. IMO it's perfectly > reasonable for such an event to be asynchronous. > > IIUC, your end goal is to be able to gracefully handle DAX file truncation. > Simply killing the guest task that hit the bad page isn't sufficient, as > nothing prevents a future task from accessing the same bad page. Next task can't even mmap that page mmap will fail. File got truncated, that page does not exist. So sending SIGBUS to task should definitely solve the problem. We also need to solve the issue for guest kernel accessing the page which got truncated on host. In that case we need to use correct memcpy helpers and use exception table magic and return error code to user space. > To fully > handle the situation, the guest needs to remove the bad page from its memory > pool. Once the page is offlined, the guest kernel's error handling will > kick in when a task accesses the bad page (or nothing ever touches the bad > page again and everyone is happy). This is not really a case of bad page as such. It is more of a page gone missing/trucated. And no new user can map it. We just need to worry about existing users who already have it mapped. > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection > to handle this, but I suspect the resulting behavior will look quite similar, > e.g. notify the virtiofs driver in the guest, which does some magic to take > the offending region offline, and then guest tasks get SIGBUS or whatever. > > I also don't think it's KVM's responsibility to _directly_ handle such a > scenario. As I said in an earlier version, KVM can't possibly know _why_ a > page fault came back with -EFAULT, only userspace can connect the dots of > GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely should > exit to userspace on the -EFAULT instead of hanging the guest, but that can > be done via a new request, as suggested. KVM atleast should have the mechanism to report this back to guest. And we don't have any. There only seems to be #MC stuff for poisoned pages. I am not sure how much we can build on top of #MC stuff to take care of this case. One problem with #MC I found was that it generates synchronous #MC only on load and not store. So all the code is written in such a way that synchronous #MC can happen only on load and hence the error handling. Stores generate different kind of #MC that too asynchronously and caller will not know about it immiditely. But in this case we need to know about error in the context of caller both for loads and stores. Anyway, I agree that this patch does not solve the problem of race free synchronous event inject into the guest. That's something yet to be solved either by #VE or by #MC or by #foo. This patch is only doing two things. - Because we don't have a mechanism to report errors to guest, use the second best method and exit to user space. - Make behavior consistent between synchronous fault and async faults. Currently sync faults exit to user space on error while async faults spin infinitely. Once we have a method to report errors back to guest, then we first should report error back to guest. And only in the absense of such mechanism we should exit to user space. Thanks Vivek
On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote: > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > > > > Alternatively, what about adding a new KVM request type to handle this? > > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > > > > request. The vCPU then gets kicked and exits to userspace. Before exiting > > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad GFNs > > > > simply get if error_gfn is "valid", i.e. there's a pending request. > > > > > > Sorry, I did not understand the above proposal. Can you please elaborate > > > a bit more. Part of it is that I don't know much about KVM requests. > > > Looking at the code it looks like that main loop is parsing if some > > > kvm request is pending and executing that action. > > > > > > Don't we want to make sure that we exit to user space when guest retries > > > error gfn access again. > > > > > In this case once we get -EFAULT, we will still inject page_ready into > > > guest. And then either same process or a different process might run. > > > > > > So when exactly code raises a kvm request. If I raise it right when > > > I get -EFAULT, then kvm will exit to user space upon next entry > > > time. But there is no guarantee guest vcpu is running the process which > > > actually accessed the error gfn. And that probably means that register > > > state of cpu does not mean much and one can not easily figure out > > > which task tried to access the bad memory and when. > > > > > > That's why we prepare a list of error gfn and only exit to user space > > > when error_gfn access is retried so that guest vcpu context is correct. > > > > > > What am I missing? > > > > I don't think it's necessary to provide userspace with the register state of > > the guest task that hit the bad page. Other than debugging, I don't see how > > userspace can do anything useful which such information. > > I think debugging is the whole point so that user can figure out which > access by guest task resulted in bad memory access. I would think this > will be important piece of information. But isn't this failure due to a truncation in the host? Why would we care about debugging the guest? It hasn't done anything wrong, has it? Or am I misunderstanding the original problem statement. > > To fully handle the situation, the guest needs to remove the bad page from > > its memory pool. Once the page is offlined, the guest kernel's error > > handling will kick in when a task accesses the bad page (or nothing ever > > touches the bad page again and everyone is happy). > > This is not really a case of bad page as such. It is more of a page > gone missing/trucated. And no new user can map it. We just need to > worry about existing users who already have it mapped. What do you mean by "no new user can map it"? Are you talking about guest tasks or host tasks? If guest tasks, how would the guest know the page is missing and thus prevent mapping the non-existent page? > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection > > to handle this, but I suspect the resulting behavior will look quite similar, > > e.g. notify the virtiofs driver in the guest, which does some magic to take > > the offending region offline, and then guest tasks get SIGBUS or whatever. > > > > I also don't think it's KVM's responsibility to _directly_ handle such a > > scenario. As I said in an earlier version, KVM can't possibly know _why_ a > > page fault came back with -EFAULT, only userspace can connect the dots of > > GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely should > > exit to userspace on the -EFAULT instead of hanging the guest, but that can > > be done via a new request, as suggested. > > KVM atleast should have the mechanism to report this back to guest. And > we don't have any. There only seems to be #MC stuff for poisoned pages. > I am not sure how much we can build on top of #MC stuff to take care > of this case. One problem with #MC I found was that it generates > synchronous #MC only on load and not store. So all the code is > written in such a way that synchronous #MC can happen only on load > and hence the error handling. > > Stores generate different kind of #MC that too asynchronously and > caller will not know about it immiditely. But in this case we need > to know about error in the context of caller both for loads and stores. > > Anyway, I agree that this patch does not solve the problem of race > free synchronous event inject into the guest. That's something yet > to be solved either by #VE or by #MC or by #foo. > > This patch is only doing two things. > > - Because we don't have a mechanism to report errors to guest, use > the second best method and exit to user space. Not that it matters at this point, but IMO exiting to userspace is the correct behavior, not simply "second best method". Again, KVM by design does not know what lies behind any given HVA, and therefore should not be making decisions about how to handle bad HVAs. > - Make behavior consistent between synchronous fault and async faults. > Currently sync faults exit to user space on error while async > faults spin infinitely. Yes, and this part can be done with a new request type. Adding a new request doesn't commit KVM to any ABI changes, e.g. userspace will still see an -EFAULT return, and nothing more. I.e. handling this via request doesn't prevent switching to synchronous exits in the future, if such a feature is required. > Once we have a method to report errors back to guest, then we first > should report error back to guest. And only in the absense of such > mechanism we should exit to user space. I don't see the value in extending KVM's ABI to avoid the exit to userspace. E.g. we could add a memslot flag to autoreflect -EFAULT as an event of some form, but this is a slow path, the extra time should be a non-issue.
On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote: > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote: > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote: > > > > > Alternatively, what about adding a new KVM request type to handle this? > > > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a > > > > > request. The vCPU then gets kicked and exits to userspace. Before exiting > > > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn. Bad GFNs > > > > > simply get if error_gfn is "valid", i.e. there's a pending request. > > > > > > > > Sorry, I did not understand the above proposal. Can you please elaborate > > > > a bit more. Part of it is that I don't know much about KVM requests. > > > > Looking at the code it looks like that main loop is parsing if some > > > > kvm request is pending and executing that action. > > > > > > > > Don't we want to make sure that we exit to user space when guest retries > > > > error gfn access again. > > > > > > > In this case once we get -EFAULT, we will still inject page_ready into > > > > guest. And then either same process or a different process might run. > > > > > > > > So when exactly code raises a kvm request. If I raise it right when > > > > I get -EFAULT, then kvm will exit to user space upon next entry > > > > time. But there is no guarantee guest vcpu is running the process which > > > > actually accessed the error gfn. And that probably means that register > > > > state of cpu does not mean much and one can not easily figure out > > > > which task tried to access the bad memory and when. > > > > > > > > That's why we prepare a list of error gfn and only exit to user space > > > > when error_gfn access is retried so that guest vcpu context is correct. > > > > > > > > What am I missing? > > > > > > I don't think it's necessary to provide userspace with the register state of > > > the guest task that hit the bad page. Other than debugging, I don't see how > > > userspace can do anything useful which such information. > > > > I think debugging is the whole point so that user can figure out which > > access by guest task resulted in bad memory access. I would think this > > will be important piece of information. > > But isn't this failure due to a truncation in the host? Why would we care > about debugging the guest? It hasn't done anything wrong, has it? Or am I > misunderstanding the original problem statement. I think you understood problem statement right. If guest has right context, it just gives additional information who tried to access the missing memory page. > > > > To fully handle the situation, the guest needs to remove the bad page from > > > its memory pool. Once the page is offlined, the guest kernel's error > > > handling will kick in when a task accesses the bad page (or nothing ever > > > touches the bad page again and everyone is happy). > > > > This is not really a case of bad page as such. It is more of a page > > gone missing/trucated. And no new user can map it. We just need to > > worry about existing users who already have it mapped. > > What do you mean by "no new user can map it"? Are you talking about guest > tasks or host tasks? If guest tasks, how would the guest know the page is > missing and thus prevent mapping the non-existent page? If a new task wants mmap(), it will send a request to virtiofsd/qemu on host. If file has been truncated, then mapping beyond file size will fail and process will get error. So they will not be able to map a page which has been truncated. > > > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection > > > to handle this, but I suspect the resulting behavior will look quite similar, > > > e.g. notify the virtiofs driver in the guest, which does some magic to take > > > the offending region offline, and then guest tasks get SIGBUS or whatever. > > > > > > I also don't think it's KVM's responsibility to _directly_ handle such a > > > scenario. As I said in an earlier version, KVM can't possibly know _why_ a > > > page fault came back with -EFAULT, only userspace can connect the dots of > > > GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely should > > > exit to userspace on the -EFAULT instead of hanging the guest, but that can > > > be done via a new request, as suggested. > > > > KVM atleast should have the mechanism to report this back to guest. And > > we don't have any. There only seems to be #MC stuff for poisoned pages. > > I am not sure how much we can build on top of #MC stuff to take care > > of this case. One problem with #MC I found was that it generates > > synchronous #MC only on load and not store. So all the code is > > written in such a way that synchronous #MC can happen only on load > > and hence the error handling. > > > > Stores generate different kind of #MC that too asynchronously and > > caller will not know about it immiditely. But in this case we need > > to know about error in the context of caller both for loads and stores. > > > > Anyway, I agree that this patch does not solve the problem of race > > free synchronous event inject into the guest. That's something yet > > to be solved either by #VE or by #MC or by #foo. > > > > This patch is only doing two things. > > > > - Because we don't have a mechanism to report errors to guest, use > > the second best method and exit to user space. > > Not that it matters at this point, but IMO exiting to userspace is the > correct behavior, not simply "second best method". Again, KVM by design > does not know what lies behind any given HVA, and therefore should not be > making decisions about how to handle bad HVAs. > > > - Make behavior consistent between synchronous fault and async faults. > > Currently sync faults exit to user space on error while async > > faults spin infinitely. > > Yes, and this part can be done with a new request type. Adding a new > request doesn't commit KVM to any ABI changes, e.g. userspace will still > see an -EFAULT return, and nothing more. I.e. handling this via request > doesn't prevent switching to synchronous exits in the future, if such a > feature is required. I am really not sure what benfit this kvm request is bringing to the table. To me maintaining a hash table and exiting when guest retries is much nicer desgin. In fact, once we have the mechanism to inject error into the guest using an exception, We can easily plug that into the same path. If memory usage is a concern, we can reduce the hash table size to say 4 or 8. How is this change commiting KVM to an ABI? > > > Once we have a method to report errors back to guest, then we first > > should report error back to guest. And only in the absense of such > > mechanism we should exit to user space. > > I don't see the value in extending KVM's ABI to avoid the exit to userspace. > E.g. we could add a memslot flag to autoreflect -EFAULT as an event of some > form, but this is a slow path, the extra time should be a non-issue. IIUC, you are saying that this becomes the property of memory region. Some memory regions can choose their errors being reported back to guest (using some kind of flag in memslot). And in the absense of such flag, default behavior will continue to be exit to user space? I guess that's fine if we don't want to treat all the memory areas in same way. I can't think why reporting errors for all areas to guest is a bad idea. Let guest either handle the error or die if it can't handle it. But that discussion is for some other day. Our first problem is that we don't have a race free mechanism to report errors. Thanks Vivek
On Fri, Oct 02, 2020 at 04:02:14PM -0400, Vivek Goyal wrote: > On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote: > > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote: > > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > > > I don't think it's necessary to provide userspace with the register state of > > > > the guest task that hit the bad page. Other than debugging, I don't see how > > > > userspace can do anything useful which such information. > > > > > > I think debugging is the whole point so that user can figure out which > > > access by guest task resulted in bad memory access. I would think this > > > will be important piece of information. > > > > But isn't this failure due to a truncation in the host? Why would we care > > about debugging the guest? It hasn't done anything wrong, has it? Or am I > > misunderstanding the original problem statement. > > I think you understood problem statement right. If guest has right > context, it just gives additional information who tried to access > the missing memory page. Yes, but it's not actionable, e.g. QEMU can't do anything differently given a guest RIP. It's useful information for hands-on debug, but the information can be easily collected through other means when doing hands-on debug. > > > > To fully handle the situation, the guest needs to remove the bad page from > > > > its memory pool. Once the page is offlined, the guest kernel's error > > > > handling will kick in when a task accesses the bad page (or nothing ever > > > > touches the bad page again and everyone is happy). > > > > > > This is not really a case of bad page as such. It is more of a page > > > gone missing/trucated. And no new user can map it. We just need to > > > worry about existing users who already have it mapped. > > > > What do you mean by "no new user can map it"? Are you talking about guest > > tasks or host tasks? If guest tasks, how would the guest know the page is > > missing and thus prevent mapping the non-existent page? > > If a new task wants mmap(), it will send a request to virtiofsd/qemu > on host. If file has been truncated, then mapping beyond file size > will fail and process will get error. So they will not be able to > map a page which has been truncated. Ah. Is there anything that prevents the notification side of things from being handled purely within the virtiofs layer? E.g. host notifies the guest that a file got truncated, virtiofs driver in the guest invokes a kernel API to remove the page(s). > > > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection > > > > to handle this, but I suspect the resulting behavior will look quite similar, > > > > e.g. notify the virtiofs driver in the guest, which does some magic to take > > > > the offending region offline, and then guest tasks get SIGBUS or whatever. > > > > > > > > I also don't think it's KVM's responsibility to _directly_ handle such a > > > > scenario. As I said in an earlier version, KVM can't possibly know _why_ a > > > > page fault came back with -EFAULT, only userspace can connect the dots of > > > > GPA -> HVA -> vm_area_struct -> file -> inject event. KVM definitely should > > > > exit to userspace on the -EFAULT instead of hanging the guest, but that can > > > > be done via a new request, as suggested. > > > > > > KVM atleast should have the mechanism to report this back to guest. And > > > we don't have any. There only seems to be #MC stuff for poisoned pages. > > > I am not sure how much we can build on top of #MC stuff to take care > > > of this case. One problem with #MC I found was that it generates > > > synchronous #MC only on load and not store. So all the code is > > > written in such a way that synchronous #MC can happen only on load > > > and hence the error handling. > > > > > > Stores generate different kind of #MC that too asynchronously and > > > caller will not know about it immiditely. But in this case we need > > > to know about error in the context of caller both for loads and stores. > > > > > > Anyway, I agree that this patch does not solve the problem of race > > > free synchronous event inject into the guest. That's something yet > > > to be solved either by #VE or by #MC or by #foo. > > > > > > This patch is only doing two things. > > > > > > - Because we don't have a mechanism to report errors to guest, use > > > the second best method and exit to user space. > > > > Not that it matters at this point, but IMO exiting to userspace is the > > correct behavior, not simply "second best method". Again, KVM by design > > does not know what lies behind any given HVA, and therefore should not be > > making decisions about how to handle bad HVAs. > > > > > - Make behavior consistent between synchronous fault and async faults. > > > Currently sync faults exit to user space on error while async > > > faults spin infinitely. > > > > Yes, and this part can be done with a new request type. Adding a new > > request doesn't commit KVM to any ABI changes, e.g. userspace will still > > see an -EFAULT return, and nothing more. I.e. handling this via request > > doesn't prevent switching to synchronous exits in the future, if such a > > feature is required. > > I am really not sure what benfit this kvm request is bringing to the > table. To me maintaining a hash table and exiting when guest retries > is much nicer desgin. 8 bytes pre vCPU versus 512 bytes per vCPU, with no guesswork. I.e. the request can guarantee the first access to a truncated file will be reported to userspace. > In fact, once we have the mechanism to inject error into the guest using an > exception, We can easily plug that into the same path. You're blurring the two things together. The first step is to fix the infinite loop by exiting to userspace with -EFAULT. Notifying the guest of the truncated file is a completely different problem to solve. Reporting -EFAULT to userspace is a pure bug fix, and is not unique to DAX. > If memory usage is a concern, we can reduce the hash table size to say > 4 or 8. > > How is this change commiting KVM to an ABI? I didn't mean to imply this patch changes the ABI. What I was trying to say is that going with the request-based implementation doesn't commit KVM to new behavior, e.g. we can yank out the request implementation in favor of something else if the need arises. > > > Once we have a method to report errors back to guest, then we first > > > should report error back to guest. And only in the absense of such > > > mechanism we should exit to user space. > > > > I don't see the value in extending KVM's ABI to avoid the exit to userspace. > > E.g. we could add a memslot flag to autoreflect -EFAULT as an event of some > > form, but this is a slow path, the extra time should be a non-issue. > > IIUC, you are saying that this becomes the property of memory region. Some > memory regions can choose their errors being reported back to guest > (using some kind of flag in memslot). And in the absense of such flag, > default behavior will continue to be exit to user space? > > I guess that's fine if we don't want to treat all the memory areas in > same way. I can't think why reporting errors for all areas to guest > is a bad idea. Because it'd be bleeding host information to the guest. E.g. if there's a a host kernel bug that causes gup() to fail, then KVM would inject a random fault into the guest, which is all kinds of bad. > Let guest either handle the error or die if it can't > handle it. But that discussion is for some other day. Our first problem > is that we don't have a race free mechanism to report errors. > > Thanks > Vivek >
On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote: > On Fri, Oct 02, 2020 at 04:02:14PM -0400, Vivek Goyal wrote: > > On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote: > > > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote: > > > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote: > > > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote: > > > > > I don't think it's necessary to provide userspace with the register state of > > > > > the guest task that hit the bad page. Other than debugging, I don't see how > > > > > userspace can do anything useful which such information. > > > > > > > > I think debugging is the whole point so that user can figure out which > > > > access by guest task resulted in bad memory access. I would think this > > > > will be important piece of information. > > > > > > But isn't this failure due to a truncation in the host? Why would we care > > > about debugging the guest? It hasn't done anything wrong, has it? Or am I > > > misunderstanding the original problem statement. > > > > I think you understood problem statement right. If guest has right > > context, it just gives additional information who tried to access > > the missing memory page. > > Yes, but it's not actionable, e.g. QEMU can't do anything differently given > a guest RIP. It's useful information for hands-on debug, but the information > can be easily collected through other means when doing hands-on debug. Hi Sean, I tried my patch and truncated file on host before guest did memcpy(). After truncation guest process tried memcpy() on truncated region and kvm exited to user space with -EFAULT. I see following on serial console. I am assuming qemu is printing the state of vcpu. ************************************************************ error: kvm run failed Bad address RAX=00007fff6e7a9750 RBX=0000000000000000 RCX=00007f513927e000 RDX=000000000000a RSI=00007f513927e000 RDI=00007fff6e7a9750 RBP=00007fff6e7a97b0 RSP=00007fff6e7a8 R8 =0000000000000000 R9 =0000000000000031 R10=00007fff6e7a957c R11=0000000000006 R12=0000000000401140 R13=0000000000000000 R14=0000000000000000 R15=0000000000000 RIP=00007f51391e0547 RFL=00010202 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0 ES =0000 0000000000000000 ffffffff 00c00000 CS =0033 0000000000000000 ffffffff 00a0fb00 DPL=3 CS64 [-RA] SS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS [-WA] DS =0000 0000000000000000 ffffffff 00c00000 FS =0000 00007f5139246540 ffffffff 00c00000 GS =0000 0000000000000000 ffffffff 00c00000 LDT=0000 0000000000000000 00000000 00000000 TR =0040 fffffe00003a6000 00004087 00008b00 DPL=0 TSS64-busy GDT= fffffe00003a4000 0000007f IDT= fffffe0000000000 00000fff CR0=80050033 CR2=00007f513927e004 CR3=000000102b5eb805 CR4=00770ee0 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=000000000000 DR6=00000000fffe0ff0 DR7=0000000000000400 EFER=0000000000000d01 Code=fa 6f 06 c5 fa 6f 4c 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17 f0 c3 <48> 8b 4c 16 3 ***************************************************************** I also changed my test program to print source and destination address for memcpy. dst=0x0x7fff6e7a9750 src=0x0x7f513927e000 Here dst matches RDI and src matches RSI. This trace also tells me CPL=3 so a user space access triggered this. Now I have few questions. - If we exit to user space asynchronously (using kvm request), what debug information is in there which tells user which address is bad. I admit that even above trace does not seem to be telling me directly which address (HVA?) is bad. But if I take a crash dump of guest, using above information I should be able to get to GPA which is problematic. And looking at /proc/iomem it should also tell which device this memory region is in. Also using this crash dump one should be able to walk through virtiofs data structures and figure out which file and what offset with-in file does it belong to. Now one can look at filesystem on host and see file got truncated and it will become obvious it can't be faulted in. And then one can continue to debug that how did we arrive here. But if we don't exit to user space synchronously, Only relevant information we seem to have is -EFAULT. Apart from that, how does one figure out what address is bad, or who tried to access it. Or which file/offset does it belong to etc. I agree that problem is not necessarily in guest code. But by exiting synchronously, it gives enough information that one can use crash dump to get to bottom of the issue. If we exit to user space asynchronously, all this information will be lost and it might make it very hard to figure out (if not impossible), what's going on. > > > > > > To fully handle the situation, the guest needs to remove the bad page from > > > > > its memory pool. Once the page is offlined, the guest kernel's error > > > > > handling will kick in when a task accesses the bad page (or nothing ever > > > > > touches the bad page again and everyone is happy). > > > > > > > > This is not really a case of bad page as such. It is more of a page > > > > gone missing/trucated. And no new user can map it. We just need to > > > > worry about existing users who already have it mapped. > > > > > > What do you mean by "no new user can map it"? Are you talking about guest > > > tasks or host tasks? If guest tasks, how would the guest know the page is > > > missing and thus prevent mapping the non-existent page? > > > > If a new task wants mmap(), it will send a request to virtiofsd/qemu > > on host. If file has been truncated, then mapping beyond file size > > will fail and process will get error. So they will not be able to > > map a page which has been truncated. > > Ah. Is there anything that prevents the notification side of things from > being handled purely within the virtiofs layer? E.g. host notifies the guest > that a file got truncated, virtiofs driver in the guest invokes a kernel API > to remove the page(s). virtiofsd notifications can help a bit but not in all cases. For example, If file got truncated and guest kernel accesses it immidiately after that, (before notification arrives), it will hang and notification will not be able to do much. So while notification might be nice to have, but we still will need some sort of error reporting from kvm. Thanks Vivek
On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote: > On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote: > Now I have few questions. > > - If we exit to user space asynchronously (using kvm request), what debug > information is in there which tells user which address is bad. I admit > that even above trace does not seem to be telling me directly which > address (HVA?) is bad. > > But if I take a crash dump of guest, using above information I should > be able to get to GPA which is problematic. And looking at /proc/iomem > it should also tell which device this memory region is in. > > Also using this crash dump one should be able to walk through virtiofs data > structures and figure out which file and what offset with-in file does > it belong to. Now one can look at filesystem on host and see file got > truncated and it will become obvious it can't be faulted in. And then > one can continue to debug that how did we arrive here. > > But if we don't exit to user space synchronously, Only relevant > information we seem to have is -EFAULT. Apart from that, how does one > figure out what address is bad, or who tried to access it. Or which > file/offset does it belong to etc. > > I agree that problem is not necessarily in guest code. But by exiting > synchronously, it gives enough information that one can use crash > dump to get to bottom of the issue. If we exit to user space > asynchronously, all this information will be lost and it might make > it very hard to figure out (if not impossible), what's going on. If we want userspace to be able to do something useful, KVM should explicitly inform userspace about the error, userspace shouldn't simply assume that -EFAULT means a HVA->PFN lookup failed. Userspace also shouldn't have to query guest state to handle the error, as that won't work for protected guests guests like SEV-ES and TDX. I can think of two options: 1. Send a signal, a la kvm_send_hwpoison_signal(). 2. Add a userspace exit reason along with a new entry in the run struct, e.g. that provides the bad GPA, HVA, possibly permissions, etc... > > > > > > To fully handle the situation, the guest needs to remove the bad page from > > > > > > its memory pool. Once the page is offlined, the guest kernel's error > > > > > > handling will kick in when a task accesses the bad page (or nothing ever > > > > > > touches the bad page again and everyone is happy). > > > > > > > > > > This is not really a case of bad page as such. It is more of a page > > > > > gone missing/trucated. And no new user can map it. We just need to > > > > > worry about existing users who already have it mapped. > > > > > > > > What do you mean by "no new user can map it"? Are you talking about guest > > > > tasks or host tasks? If guest tasks, how would the guest know the page is > > > > missing and thus prevent mapping the non-existent page? > > > > > > If a new task wants mmap(), it will send a request to virtiofsd/qemu > > > on host. If file has been truncated, then mapping beyond file size > > > will fail and process will get error. So they will not be able to > > > map a page which has been truncated. > > > > Ah. Is there anything that prevents the notification side of things from > > being handled purely within the virtiofs layer? E.g. host notifies the guest > > that a file got truncated, virtiofs driver in the guest invokes a kernel API > > to remove the page(s). > > virtiofsd notifications can help a bit but not in all cases. For example, > If file got truncated and guest kernel accesses it immidiately after that, > (before notification arrives), it will hang and notification will not > be able to do much. > > So while notification might be nice to have, but we still will need some > sort of error reporting from kvm. And I'm in full agreement with that. What I'm saying is that resolving the infinite loop is a _bug fix_ and is completely orthogonal to adding a new mechanism to handle the file truncation scenario.
On Mon, Oct 05, 2020 at 09:16:20AM -0700, Sean Christopherson wrote: > On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote: > > On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote: > > Now I have few questions. > > > > - If we exit to user space asynchronously (using kvm request), what debug > > information is in there which tells user which address is bad. I admit > > that even above trace does not seem to be telling me directly which > > address (HVA?) is bad. > > > > But if I take a crash dump of guest, using above information I should > > be able to get to GPA which is problematic. And looking at /proc/iomem > > it should also tell which device this memory region is in. > > > > Also using this crash dump one should be able to walk through virtiofs data > > structures and figure out which file and what offset with-in file does > > it belong to. Now one can look at filesystem on host and see file got > > truncated and it will become obvious it can't be faulted in. And then > > one can continue to debug that how did we arrive here. > > > > But if we don't exit to user space synchronously, Only relevant > > information we seem to have is -EFAULT. Apart from that, how does one > > figure out what address is bad, or who tried to access it. Or which > > file/offset does it belong to etc. > > > > I agree that problem is not necessarily in guest code. But by exiting > > synchronously, it gives enough information that one can use crash > > dump to get to bottom of the issue. If we exit to user space > > asynchronously, all this information will be lost and it might make > > it very hard to figure out (if not impossible), what's going on. > > If we want userspace to be able to do something useful, KVM should explicitly > inform userspace about the error, userspace shouldn't simply assume that > -EFAULT means a HVA->PFN lookup failed. I guess that's fine. But for this patch, user space is not doing anything. Its just printing error -EFAULT and dumping guest state (Same as we do in case of synchronous fault). > Userspace also shouldn't have to > query guest state to handle the error, as that won't work for protected guests > guests like SEV-ES and TDX. So qemu would not be able to dump vcpu register state when kvm returns with -EFAULT for the case of SEV-ES and TDX? > > I can think of two options: > > 1. Send a signal, a la kvm_send_hwpoison_signal(). This works because -EHWPOISON is a special kind of error which is different from -EFAULT. For truncation, even kvm gets -EFAULT. if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT; if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV)) return -EFAULT; Anyway, if -EFAULT is too generic, and we need something finer grained, that can be looked into when we actually have a method where kvm/qemu injects error into guest. > > 2. Add a userspace exit reason along with a new entry in the run struct, > e.g. that provides the bad GPA, HVA, possibly permissions, etc... This sounds more reasonable to me. That is kvm gives additional information to qemu about failing HVA and GPA with -EFAULT and that can be helpful in debugging a problem. This seems like an extension of KVM API. Even with this, if we want to figure out which file got truncated, we will need to take a dump of guest and try to figure out which file this GPA is currently mapping(By looking at virtiofs data structures). And that becomes little easier if vcpu is running the task which accessed that GPA. Anyway, if we have failing GPA, I think it should be possible to figure out inode even without accessing task being current on vcpu. So we seem to have 3 options. A. Just exit to user space with -EFAULT (using kvm request) and don't wait for the accessing task to run on vcpu again. B. Store error gfn in an hash and exit to user space when task accessing gfn runs again. C. Extend KVM API and report failing HVA/GFN access by guest. And that should allow not having to exit to user space synchronously. Thanks Vivek
Vivek Goyal <vgoyal@redhat.com> writes: > A. Just exit to user space with -EFAULT (using kvm request) and don't > wait for the accessing task to run on vcpu again. What if we also save the required information (RIP, GFN, ...) in the guest along with the APF token so in case of -EFAULT we can just 'crash' the guest and the required information can easily be obtained from kdump? This will solve the debugging problem even for TDX/SEV-ES (if kdump is possible there).
On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote: > Vivek Goyal <vgoyal@redhat.com> writes: > > > A. Just exit to user space with -EFAULT (using kvm request) and don't > > wait for the accessing task to run on vcpu again. > > What if we also save the required information (RIP, GFN, ...) in the > guest along with the APF token Can you elaborate a bit more on this. You mean save GFN on stack before it starts waiting for PAGE_READY event? > so in case of -EFAULT we can just 'crash' > the guest and the required information can easily be obtained from > kdump? This will solve the debugging problem even for TDX/SEV-ES (if > kdump is possible there). Just saving additional info in guest will not help because there might be many tasks waiting and you don't know which GFN is problematic one. Thanks Vivek
Vivek Goyal <vgoyal@redhat.com> writes: > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote: >> Vivek Goyal <vgoyal@redhat.com> writes: >> >> > A. Just exit to user space with -EFAULT (using kvm request) and don't >> > wait for the accessing task to run on vcpu again. >> >> What if we also save the required information (RIP, GFN, ...) in the >> guest along with the APF token > > Can you elaborate a bit more on this. You mean save GFN on stack before > it starts waiting for PAGE_READY event? When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest kernel gets all the registers of the userspace process (except for CR2 which is replaced with a token). In case it is not trivial to extract accessed GFN from this data we can extend the shared APF structure and add it there, KVM has it when it queues APF. > >> so in case of -EFAULT we can just 'crash' >> the guest and the required information can easily be obtained from >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if >> kdump is possible there). > > Just saving additional info in guest will not help because there might > be many tasks waiting and you don't know which GFN is problematic one. But KVM knows which token caused the -EFAULT when we exit to userspace (and we can pass this information to it) so to debug the situation you take this token and then explore the kdump searching for what's associated with this exact token.
On Tue, Oct 06, 2020 at 04:50:44PM +0200, Vitaly Kuznetsov wrote: > Vivek Goyal <vgoyal@redhat.com> writes: > > > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote: > >> Vivek Goyal <vgoyal@redhat.com> writes: > >> > >> > A. Just exit to user space with -EFAULT (using kvm request) and don't > >> > wait for the accessing task to run on vcpu again. > >> > >> What if we also save the required information (RIP, GFN, ...) in the > >> guest along with the APF token > > > > Can you elaborate a bit more on this. You mean save GFN on stack before > > it starts waiting for PAGE_READY event? > > When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest > kernel gets all the registers of the userspace process (except for CR2 > which is replaced with a token). In case it is not trivial to extract > accessed GFN from this data we can extend the shared APF structure and > add it there, KVM has it when it queues APF. > > > > >> so in case of -EFAULT we can just 'crash' > >> the guest and the required information can easily be obtained from > >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if > >> kdump is possible there). > > > > Just saving additional info in guest will not help because there might > > be many tasks waiting and you don't know which GFN is problematic one. > > But KVM knows which token caused the -EFAULT when we exit to userspace > (and we can pass this information to it) so to debug the situation you > take this token and then explore the kdump searching for what's > associated with this exact token. So you will have to report token (along with -EFAULT) to user space. So this is basically the 3rd proposal which is extension of kvm API and will report say HVA/GFN also to user space along with -EFAULT. Thanks Vivek
Vivek Goyal <vgoyal@redhat.com> writes: > On Tue, Oct 06, 2020 at 04:50:44PM +0200, Vitaly Kuznetsov wrote: >> Vivek Goyal <vgoyal@redhat.com> writes: >> >> > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote: >> >> Vivek Goyal <vgoyal@redhat.com> writes: >> >> >> >> > A. Just exit to user space with -EFAULT (using kvm request) and don't >> >> > wait for the accessing task to run on vcpu again. >> >> >> >> What if we also save the required information (RIP, GFN, ...) in the >> >> guest along with the APF token >> > >> > Can you elaborate a bit more on this. You mean save GFN on stack before >> > it starts waiting for PAGE_READY event? >> >> When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest >> kernel gets all the registers of the userspace process (except for CR2 >> which is replaced with a token). In case it is not trivial to extract >> accessed GFN from this data we can extend the shared APF structure and >> add it there, KVM has it when it queues APF. >> >> > >> >> so in case of -EFAULT we can just 'crash' >> >> the guest and the required information can easily be obtained from >> >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if >> >> kdump is possible there). >> > >> > Just saving additional info in guest will not help because there might >> > be many tasks waiting and you don't know which GFN is problematic one. >> >> But KVM knows which token caused the -EFAULT when we exit to userspace >> (and we can pass this information to it) so to debug the situation you >> take this token and then explore the kdump searching for what's >> associated with this exact token. > > So you will have to report token (along with -EFAULT) to user space. So this > is basically the 3rd proposal which is extension of kvm API and will > report say HVA/GFN also to user space along with -EFAULT. > Right, I meant to say that guest kernel has full register state of the userspace process which caused APF to get queued and instead of trying to extract it in KVM and pass to userspace in case of a (later) failure we limit KVM api change to contain token or GFN only and somehow keep the rest in the guest. This should help with TDX/SEV-ES.
On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: > Vivek Goyal <vgoyal@redhat.com> writes: > > So you will have to report token (along with -EFAULT) to user space. So this > > is basically the 3rd proposal which is extension of kvm API and will > > report say HVA/GFN also to user space along with -EFAULT. > > Right, I meant to say that guest kernel has full register state of the > userspace process which caused APF to get queued and instead of trying > to extract it in KVM and pass to userspace in case of a (later) failure > we limit KVM api change to contain token or GFN only and somehow keep > the rest in the guest. This should help with TDX/SEV-ES. Whatever gets reported to userspace should be identical with and without async page faults, i.e. it definitely shouldn't have token information. Note, TDX doesn't allow injection exceptions, so reflecting a #PF back into the guest is not an option. Nor do I think that's "correct" behavior (see everyone's objections to using #PF for APF fixed). I.e. the event should probably be an IRQ.
On Tue, Oct 06, 2020 at 09:12:00AM -0700, Sean Christopherson wrote: > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: > > Vivek Goyal <vgoyal@redhat.com> writes: > > > So you will have to report token (along with -EFAULT) to user space. So this > > > is basically the 3rd proposal which is extension of kvm API and will > > > report say HVA/GFN also to user space along with -EFAULT. > > > > Right, I meant to say that guest kernel has full register state of the > > userspace process which caused APF to get queued and instead of trying > > to extract it in KVM and pass to userspace in case of a (later) failure > > we limit KVM api change to contain token or GFN only and somehow keep > > the rest in the guest. This should help with TDX/SEV-ES. > > Whatever gets reported to userspace should be identical with and without > async page faults, i.e. it definitely shouldn't have token information. > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > into the guest is not an option. Nor do I think that's "correct" > behavior (see everyone's objections to using #PF for APF fixed). I.e. the > event should probably be an IRQ. I am not sure if IRQ for "Page not Present" works. Will it have some conflicts/issues with other high priority interrupts which can get injected before "Page not present". Vivek
Sean Christopherson <sean.j.christopherson@intel.com> writes: > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: >> Vivek Goyal <vgoyal@redhat.com> writes: >> > So you will have to report token (along with -EFAULT) to user space. So this >> > is basically the 3rd proposal which is extension of kvm API and will >> > report say HVA/GFN also to user space along with -EFAULT. >> >> Right, I meant to say that guest kernel has full register state of the >> userspace process which caused APF to get queued and instead of trying >> to extract it in KVM and pass to userspace in case of a (later) failure >> we limit KVM api change to contain token or GFN only and somehow keep >> the rest in the guest. This should help with TDX/SEV-ES. > > Whatever gets reported to userspace should be identical with and without > async page faults, i.e. it definitely shouldn't have token information. > Oh, right, when the error gets reported synchronously guest's kernel is not yet aware of the issue so it won't be possible to find anything in its kdump if userspace decides to crash it immediately. The register state (if available) will be actual though. > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > into the guest is not an option. Not even #MC? So sad :-) > Nor do I think that's "correct" behavior (see everyone's objections to > using #PF for APF fixed). I.e. the event should probably be an IRQ. I recall Paolo objected against making APF 'page not present' into in interrupt as it will require some very special handling to make sure it gets injected (and handled) immediately but I'm not really sure how big the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a try.
On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote: > Sean Christopherson <sean.j.christopherson@intel.com> writes: > > > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: > >> Vivek Goyal <vgoyal@redhat.com> writes: > >> > So you will have to report token (along with -EFAULT) to user space. So this > >> > is basically the 3rd proposal which is extension of kvm API and will > >> > report say HVA/GFN also to user space along with -EFAULT. > >> > >> Right, I meant to say that guest kernel has full register state of the > >> userspace process which caused APF to get queued and instead of trying > >> to extract it in KVM and pass to userspace in case of a (later) failure > >> we limit KVM api change to contain token or GFN only and somehow keep > >> the rest in the guest. This should help with TDX/SEV-ES. > > > > Whatever gets reported to userspace should be identical with and without > > async page faults, i.e. it definitely shouldn't have token information. > > > > Oh, right, when the error gets reported synchronously guest's kernel is > not yet aware of the issue so it won't be possible to find anything in > its kdump if userspace decides to crash it immediately. The register > state (if available) will be actual though. > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > > into the guest is not an option. > > Not even #MC? So sad :-) Heh, #MC isn't allowed either, yet... > > Nor do I think that's "correct" behavior (see everyone's objections to > > using #PF for APF fixed). I.e. the event should probably be an IRQ. > > I recall Paolo objected against making APF 'page not present' into in > interrupt as it will require some very special handling to make sure it > gets injected (and handled) immediately but I'm not really sure how big > the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a > try. This shouldn't have anything to do with APF. Again, the event injection is needed even in the synchronous case as the file truncation in the host can affect existing mappings in the guest. I don't know that the mechanism needs to be virtiofs specific or if there can be a more generic "these PFNs have disappeared", but it's most definitely orthogonal to APF.
* Sean Christopherson (sean.j.christopherson@intel.com) wrote: > On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote: > > Sean Christopherson <sean.j.christopherson@intel.com> writes: > > > > > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > >> > So you will have to report token (along with -EFAULT) to user space. So this > > >> > is basically the 3rd proposal which is extension of kvm API and will > > >> > report say HVA/GFN also to user space along with -EFAULT. > > >> > > >> Right, I meant to say that guest kernel has full register state of the > > >> userspace process which caused APF to get queued and instead of trying > > >> to extract it in KVM and pass to userspace in case of a (later) failure > > >> we limit KVM api change to contain token or GFN only and somehow keep > > >> the rest in the guest. This should help with TDX/SEV-ES. > > > > > > Whatever gets reported to userspace should be identical with and without > > > async page faults, i.e. it definitely shouldn't have token information. > > > > > > > Oh, right, when the error gets reported synchronously guest's kernel is > > not yet aware of the issue so it won't be possible to find anything in > > its kdump if userspace decides to crash it immediately. The register > > state (if available) will be actual though. > > > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > > > into the guest is not an option. > > > > Not even #MC? So sad :-) > > Heh, #MC isn't allowed either, yet... > > > > Nor do I think that's "correct" behavior (see everyone's objections to > > > using #PF for APF fixed). I.e. the event should probably be an IRQ. > > > > I recall Paolo objected against making APF 'page not present' into in > > interrupt as it will require some very special handling to make sure it > > gets injected (and handled) immediately but I'm not really sure how big > > the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a > > try. > > This shouldn't have anything to do with APF. Again, the event injection is > needed even in the synchronous case as the file truncation in the host can > affect existing mappings in the guest. > > I don't know that the mechanism needs to be virtiofs specific or if there can > be a more generic "these PFNs have disappeared", but it's most definitely > orthogonal to APF. There are other cases we get 'these PFNs have disappeared' other than virtiofs; the classic is when people back the guest using a tmpfs that then runs out of room. Dave > _______________________________________________ > Virtio-fs mailing list > Virtio-fs@redhat.com > https://www.redhat.com/mailman/listinfo/virtio-fs
On Tue, Oct 06, 2020 at 06:21:48PM +0100, Dr. David Alan Gilbert wrote: > * Sean Christopherson (sean.j.christopherson@intel.com) wrote: > > On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote: > > > Sean Christopherson <sean.j.christopherson@intel.com> writes: > > > > > > > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote: > > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > > >> > So you will have to report token (along with -EFAULT) to user space. So this > > > >> > is basically the 3rd proposal which is extension of kvm API and will > > > >> > report say HVA/GFN also to user space along with -EFAULT. > > > >> > > > >> Right, I meant to say that guest kernel has full register state of the > > > >> userspace process which caused APF to get queued and instead of trying > > > >> to extract it in KVM and pass to userspace in case of a (later) failure > > > >> we limit KVM api change to contain token or GFN only and somehow keep > > > >> the rest in the guest. This should help with TDX/SEV-ES. > > > > > > > > Whatever gets reported to userspace should be identical with and without > > > > async page faults, i.e. it definitely shouldn't have token information. > > > > > > > > > > Oh, right, when the error gets reported synchronously guest's kernel is > > > not yet aware of the issue so it won't be possible to find anything in > > > its kdump if userspace decides to crash it immediately. The register > > > state (if available) will be actual though. > > > > > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > > > > into the guest is not an option. > > > > > > Not even #MC? So sad :-) > > > > Heh, #MC isn't allowed either, yet... > > > > > > Nor do I think that's "correct" behavior (see everyone's objections to > > > > using #PF for APF fixed). I.e. the event should probably be an IRQ. > > > > > > I recall Paolo objected against making APF 'page not present' into in > > > interrupt as it will require some very special handling to make sure it > > > gets injected (and handled) immediately but I'm not really sure how big > > > the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a > > > try. > > > > This shouldn't have anything to do with APF. Again, the event injection is > > needed even in the synchronous case as the file truncation in the host can > > affect existing mappings in the guest. > > > > I don't know that the mechanism needs to be virtiofs specific or if there can > > be a more generic "these PFNs have disappeared", but it's most definitely > > orthogonal to APF. > > There are other cases we get 'these PFNs have disappeared' other than > virtiofs; the classic is when people back the guest using a tmpfs that > then runs out of room. I also played with nvdimm driver where device was backed a file on host. If I truncate that file, we face similar issues. https://lore.kernel.org/kvm/20200616214847.24482-1-vgoyal@redhat.com/ I think any resource which can be backed by a file on host, can potentially run into this issue if file is truncated. (if guest can do load/store on these pages directly). Thanks Vivek
On Tue, Oct 06, 2020 at 10:17:04AM -0700, Sean Christopherson wrote: [..] > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > > > into the guest is not an option. > > > > Not even #MC? So sad :-) > > Heh, #MC isn't allowed either, yet... If #MC is not allowd, logic related to hwpoison memory will not work as that seems to inject #MC. Vivek
On Tue, Oct 06, 2020 at 01:35:27PM -0400, Vivek Goyal wrote: > On Tue, Oct 06, 2020 at 10:17:04AM -0700, Sean Christopherson wrote: > > [..] > > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back > > > > into the guest is not an option. > > > > > > Not even #MC? So sad :-) > > > > Heh, #MC isn't allowed either, yet... > > If #MC is not allowd, logic related to hwpoison memory will not work > as that seems to inject #MC. Yep. Piggybacking #MC is undesirable for other reasons, e.g. adds a "hardware" dependency for what is really a paravirt feature.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index be5363b21540..e6f8d3f1a377 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) #define KVM_NR_VAR_MTRR 8 #define ASYNC_PF_PER_VCPU 64 +#define ERROR_GFN_PER_VCPU 64 enum kvm_reg { VCPU_REGS_RAX = __VCPU_REGS_RAX, @@ -778,6 +779,7 @@ struct kvm_vcpu_arch { unsigned long nested_apf_token; bool delivery_as_pf_vmexit; bool pageready_pending; + gfn_t error_gfns[ERROR_GFN_PER_VCPU]; } apf; /* OSVW MSRs (AMD only) */ diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 444bb9c54548..d0a2a12c7bb6 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots); void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer); void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly, bool accessed_dirty, gpa_t new_eptp); -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu); +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn); int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, u64 fault_address, char *insn, int insn_len); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6d6a0ae7800c..b51d4aa405e0 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, if (!async) return false; /* *pfn has correct page already */ - if (!prefault && kvm_can_do_async_pf(vcpu)) { + if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) { trace_kvm_try_async_get_page(cr2_or_gpa, gfn); if (kvm_find_async_pf_gfn(vcpu, gfn)) { trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 88c593f83b28..c1f5094d6e53 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu) vcpu->arch.apf.gfns[i] = ~0; } +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu) +{ + int i; + for (i = 0; i < ERROR_GFN_PER_VCPU; i++) + vcpu->arch.apf.error_gfns[i] = GFN_INVALID; +} + static void kvm_on_user_return(struct user_return_notifier *urn) { unsigned slot; @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT; kvm_async_pf_hash_reset(vcpu); + kvm_error_gfn_hash_reset(vcpu); kvm_pmu_init(vcpu); vcpu->arch.pending_external_vector = -1; @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) kvm_clear_async_pf_completion_queue(vcpu); kvm_async_pf_hash_reset(vcpu); + kvm_error_gfn_hash_reset(vcpu); vcpu->arch.apf.halted = false; if (kvm_mpx_supported()) { @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags) } EXPORT_SYMBOL_GPL(kvm_set_rflags); +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn) +{ + BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU)); + + return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU)); +} + +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + u32 key = kvm_error_gfn_hash_fn(gfn); + + /* + * Overwrite the previous gfn. This is just a hint to do + * sync page fault. + */ + vcpu->arch.apf.error_gfns[key] = gfn; +} + +/* Returns true if gfn was found in hash table, false otherwise */ +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + u32 key = kvm_error_gfn_hash_fn(gfn); + + if (vcpu->arch.apf.error_gfns[key] != gfn) + return 0; + + vcpu->arch.apf.error_gfns[key] = GFN_INVALID; + return true; +} + void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) { int r; @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu)) return; - kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); + r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); + if (r < 0) + kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa)); } static inline u32 kvm_async_pf_hash_fn(gfn_t gfn) @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu) return true; } -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn) { if (unlikely(!lapic_in_kernel(vcpu) || kvm_event_needs_reinjection(vcpu) || @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) * If interrupts are off we cannot even use an artificial * halt state. */ - return kvm_arch_interrupt_allowed(vcpu); + if (!kvm_arch_interrupt_allowed(vcpu)) + return false; + + /* Found gfn in error gfn cache. Force sync fault */ + if (kvm_find_and_remove_error_gfn(vcpu, gfn)) + return false; + + return true; } bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu, diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 68e84cf42a3f..677bb8269cd3 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -36,6 +36,7 @@ typedef u64 gpa_t; typedef u64 gfn_t; #define GPA_INVALID (~(gpa_t)0) +#define GFN_INVALID (~(gfn_t)0) typedef unsigned long hva_t; typedef u64 hpa_t;
Page fault error handling behavior in kvm seems little inconsistent when page fault reports error. If we are doing fault synchronously then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and exit to user space and qemu reports error, "error: kvm run failed Bad address". But if we are doing async page fault, then async_pf_execute() will simply ignore the error reported by get_user_pages_remote() or by kvm_mmu_do_page_fault(). It is assumed that page fault was successful and either a page ready event is injected in guest or guest is brought out of artificial halt state and run again. In both the cases when guest retries the instruction, it takes exit again as page fault was not successful in previous attempt. And then this infinite loop continues forever. Trying fault in a loop will make sense if error is temporary and will be resolved on retry. But I don't see any intention in the code to determine if error is temporary or not. Whether to do fault synchronously or asynchronously, depends on so many variables but none of the varibales is whether error is temporary or not. (kvm_can_do_async_pf()). And that makes it very inconsistent or unpredictable to figure out whether kvm will exit to qemu with error or it will just retry and go into an infinite loop. This patch tries to make this behavior consistent. That is instead of getting into infinite loop of retrying page fault, exit to user space and stop VM if page fault error happens. In future this can be improved by injecting errors into guest. As of now we don't have any race free method to inject errors in guest. When page fault error happens in async path save that pfn and when next time guest retries, do a sync fault instead of async fault. So that if error is encountered, we exit to qemu and avoid infinite loop. We maintain a cache of error gfns and force sync fault if a gfn is found in cache of error gfn. There is a small possibility that we miss an error gfn (as it got overwritten by a new error gfn). But its just a hint and sooner or later some error pfn will match and we will force sync fault and exit to user space. Changes from v3: - Added function kvm_find_and_remove_error_gfn() and removed kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly) - Added a macro GFN_INVALID (Vitaly). - Used gpa_to_gfn() to convert gpa to gfn (Vitaly) Change from v2: - Fixed a warning by converting kvm_find_error_gfn() static. Change from v1: - Maintain a cache of error gfns, instead of single gfn. (Vitaly) Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/mmu.h | 2 +- arch/x86/kvm/mmu/mmu.c | 2 +- arch/x86/kvm/x86.c | 54 +++++++++++++++++++++++++++++++-- include/linux/kvm_types.h | 1 + 5 files changed, 56 insertions(+), 5 deletions(-)