diff mbox series

[v4] kvm,x86: Exit to user space in case page fault error

Message ID 20200720211359.GF502563@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v4] kvm,x86: Exit to user space in case page fault error | expand

Commit Message

Vivek Goyal July 20, 2020, 9:13 p.m. UTC
Page fault error handling behavior in kvm seems little inconsistent when
page fault reports error. If we are doing fault synchronously
then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
exit to user space and qemu reports error, "error: kvm run failed Bad address".

But if we are doing async page fault, then async_pf_execute() will simply
ignore the error reported by get_user_pages_remote() or
by kvm_mmu_do_page_fault(). It is assumed that page fault was successful
and either a page ready event is injected in guest or guest is brought
out of artificial halt state and run again. In both the cases when guest
retries the instruction, it takes exit again as page fault was not
successful in previous attempt. And then this infinite loop continues
forever.

Trying fault in a loop will make sense if error is temporary and will
be resolved on retry. But I don't see any intention in the code to
determine if error is temporary or not.  Whether to do fault synchronously
or asynchronously, depends on so many variables but none of the varibales
is whether error is temporary or not. (kvm_can_do_async_pf()).

And that makes it very inconsistent or unpredictable to figure out whether
kvm will exit to qemu with error or it will just retry and go into an
infinite loop.

This patch tries to make this behavior consistent. That is instead of
getting into infinite loop of retrying page fault, exit to user space
and stop VM if page fault error happens.

In future this can be improved by injecting errors into guest. As of
now we don't have any race free method to inject errors in guest.

When page fault error happens in async path save that pfn and when next
time guest retries, do a sync fault instead of async fault. So that if error
is encountered, we exit to qemu and avoid infinite loop.

We maintain a cache of error gfns and force sync fault if a gfn is
found in cache of error gfn. There is a small possibility that we
miss an error gfn (as it got overwritten by a new error gfn). But
its just a hint and sooner or later some error pfn will match
and we will force sync fault and exit to user space.

Changes from v3:
- Added function kvm_find_and_remove_error_gfn() and removed
  kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly)

- Added a macro GFN_INVALID (Vitaly).

- Used gpa_to_gfn() to convert gpa to gfn (Vitaly)

Change from v2:
- Fixed a warning by converting kvm_find_error_gfn() static.

Change from v1:
- Maintain a cache of error gfns, instead of single gfn. (Vitaly)

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.h              |  2 +-
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/x86.c              | 54 +++++++++++++++++++++++++++++++--
 include/linux/kvm_types.h       |  1 +
 5 files changed, 56 insertions(+), 5 deletions(-)

Comments

Vivek Goyal July 27, 2020, 1:56 p.m. UTC | #1
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> Page fault error handling behavior in kvm seems little inconsistent when
> page fault reports error. If we are doing fault synchronously
> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> exit to user space and qemu reports error, "error: kvm run failed Bad address".

Hi Vitaly,

A gentle reminder. How does this patch look now?

Thanks
Vivek

> 
> But if we are doing async page fault, then async_pf_execute() will simply
> ignore the error reported by get_user_pages_remote() or
> by kvm_mmu_do_page_fault(). It is assumed that page fault was successful
> and either a page ready event is injected in guest or guest is brought
> out of artificial halt state and run again. In both the cases when guest
> retries the instruction, it takes exit again as page fault was not
> successful in previous attempt. And then this infinite loop continues
> forever.
> 
> Trying fault in a loop will make sense if error is temporary and will
> be resolved on retry. But I don't see any intention in the code to
> determine if error is temporary or not.  Whether to do fault synchronously
> or asynchronously, depends on so many variables but none of the varibales
> is whether error is temporary or not. (kvm_can_do_async_pf()).
> 
> And that makes it very inconsistent or unpredictable to figure out whether
> kvm will exit to qemu with error or it will just retry and go into an
> infinite loop.
> 
> This patch tries to make this behavior consistent. That is instead of
> getting into infinite loop of retrying page fault, exit to user space
> and stop VM if page fault error happens.
> 
> In future this can be improved by injecting errors into guest. As of
> now we don't have any race free method to inject errors in guest.
> 
> When page fault error happens in async path save that pfn and when next
> time guest retries, do a sync fault instead of async fault. So that if error
> is encountered, we exit to qemu and avoid infinite loop.
> 
> We maintain a cache of error gfns and force sync fault if a gfn is
> found in cache of error gfn. There is a small possibility that we
> miss an error gfn (as it got overwritten by a new error gfn). But
> its just a hint and sooner or later some error pfn will match
> and we will force sync fault and exit to user space.
> 
> Changes from v3:
> - Added function kvm_find_and_remove_error_gfn() and removed
>   kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly)
> 
> - Added a macro GFN_INVALID (Vitaly).
> 
> - Used gpa_to_gfn() to convert gpa to gfn (Vitaly)
> 
> Change from v2:
> - Fixed a warning by converting kvm_find_error_gfn() static.
> 
> Change from v1:
> - Maintain a cache of error gfns, instead of single gfn. (Vitaly)
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h              |  2 +-
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  arch/x86/kvm/x86.c              | 54 +++++++++++++++++++++++++++++++--
>  include/linux/kvm_types.h       |  1 +
>  5 files changed, 56 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..e6f8d3f1a377 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>  #define KVM_NR_VAR_MTRR 8
>  
>  #define ASYNC_PF_PER_VCPU 64
> +#define ERROR_GFN_PER_VCPU 64
>  
>  enum kvm_reg {
>  	VCPU_REGS_RAX = __VCPU_REGS_RAX,
> @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
>  		unsigned long nested_apf_token;
>  		bool delivery_as_pf_vmexit;
>  		bool pageready_pending;
> +		gfn_t error_gfns[ERROR_GFN_PER_VCPU];
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..d0a2a12c7bb6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>  			     bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6d6a0ae7800c..b51d4aa405e0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>  	if (!async)
>  		return false; /* *pfn has correct page already */
>  
> -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> +	if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) {
>  		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
>  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>  			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 88c593f83b28..c1f5094d6e53 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
>  		vcpu->arch.apf.gfns[i] = ~0;
>  }
>  
> +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +	int i;
> +	for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
> +		vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
> +}
> +
>  static void kvm_on_user_return(struct user_return_notifier *urn)
>  {
>  	unsigned slot;
> @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
>  
>  	kvm_async_pf_hash_reset(vcpu);
> +	kvm_error_gfn_hash_reset(vcpu);
>  	kvm_pmu_init(vcpu);
>  
>  	vcpu->arch.pending_external_vector = -1;
> @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  
>  	kvm_clear_async_pf_completion_queue(vcpu);
>  	kvm_async_pf_hash_reset(vcpu);
> +	kvm_error_gfn_hash_reset(vcpu);
>  	vcpu->arch.apf.halted = false;
>  
>  	if (kvm_mpx_supported()) {
> @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_rflags);
>  
> +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> +{
> +	BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> +
> +	return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU));
> +}
> +
> +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> +	/*
> +	 * Overwrite the previous gfn. This is just a hint to do
> +	 * sync page fault.
> +	 */
> +	vcpu->arch.apf.error_gfns[key] = gfn;
> +}
> +
> +/* Returns true if gfn was found in hash table, false otherwise */
> +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> +	if (vcpu->arch.apf.error_gfns[key] != gfn)
> +		return 0;
> +
> +	vcpu->arch.apf.error_gfns[key] = GFN_INVALID;
> +	return true;
> +}
> +
>  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  {
>  	int r;
> @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
>  		return;
>  
> -	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	if (r < 0)
> +		kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa));
>  }
>  
>  static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  	return true;
>  }
>  
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
>  	if (unlikely(!lapic_in_kernel(vcpu) ||
>  		     kvm_event_needs_reinjection(vcpu) ||
> @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  	 * If interrupts are off we cannot even use an artificial
>  	 * halt state.
>  	 */
> -	return kvm_arch_interrupt_allowed(vcpu);
> +	if (!kvm_arch_interrupt_allowed(vcpu))
> +		return false;
> +
> +	/* Found gfn in error gfn cache. Force sync fault */
> +	if (kvm_find_and_remove_error_gfn(vcpu, gfn))
> +		return false;
> +
> +	return true;
>  }
>  
>  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 68e84cf42a3f..677bb8269cd3 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -36,6 +36,7 @@ typedef u64            gpa_t;
>  typedef u64            gfn_t;
>  
>  #define GPA_INVALID	(~(gpa_t)0)
> +#define GFN_INVALID	(~(gfn_t)0)
>  
>  typedef unsigned long  hva_t;
>  typedef u64            hpa_t;
> -- 
> 2.25.4
>
Vitaly Kuznetsov July 27, 2020, 4:09 p.m. UTC | #2
Vivek Goyal <vgoyal@redhat.com> writes:

> On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
>> Page fault error handling behavior in kvm seems little inconsistent when
>> page fault reports error. If we are doing fault synchronously
>> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
>> exit to user space and qemu reports error, "error: kvm run failed Bad address".
>
> Hi Vitaly,
>
> A gentle reminder. How does this patch look now?
>

Sorry, I even reviewd it but never replied. It looks good to me!

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Vivek Goyal July 27, 2020, 6:40 p.m. UTC | #3
On Mon, Jul 27, 2020 at 06:09:32PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> >> Page fault error handling behavior in kvm seems little inconsistent when
> >> page fault reports error. If we are doing fault synchronously
> >> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> >> exit to user space and qemu reports error, "error: kvm run failed Bad address".
> >
> > Hi Vitaly,
> >
> > A gentle reminder. How does this patch look now?
> >
> 
> Sorry, I even reviewd it but never replied. It looks good to me!
> 
> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>

Thanks Vitaly.

Paolo, what do you think about this patch. Do you have concerns with
this. Can this be merged.

Thanks
Vivek
Pankaj Gupta July 30, 2020, 5:01 a.m. UTC | #4
> Page fault error handling behavior in kvm seems little inconsistent when
> page fault reports error. If we are doing fault synchronously
> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> exit to user space and qemu reports error, "error: kvm run failed Bad address".
>
> But if we are doing async page fault, then async_pf_execute() will simply
> ignore the error reported by get_user_pages_remote() or
> by kvm_mmu_do_page_fault(). It is assumed that page fault was successful
> and either a page ready event is injected in guest or guest is brought
> out of artificial halt state and run again. In both the cases when guest
> retries the instruction, it takes exit again as page fault was not
> successful in previous attempt. And then this infinite loop continues
> forever.
>
> Trying fault in a loop will make sense if error is temporary and will
> be resolved on retry. But I don't see any intention in the code to
> determine if error is temporary or not.  Whether to do fault synchronously
> or asynchronously, depends on so many variables but none of the varibales
s/varibales/variables
> is whether error is temporary or not. (kvm_can_do_async_pf()).
>
> And that makes it very inconsistent or unpredictable to figure out whether
> kvm will exit to qemu with error or it will just retry and go into an
> infinite loop.
>
> This patch tries to make this behavior consistent. That is instead of
> getting into infinite loop of retrying page fault, exit to user space
> and stop VM if page fault error happens.
>
> In future this can be improved by injecting errors into guest. As of
> now we don't have any race free method to inject errors in guest.
>
> When page fault error happens in async path save that pfn and when next
> time guest retries, do a sync fault instead of async fault. So that if error
> is encountered, we exit to qemu and avoid infinite loop.
>
> We maintain a cache of error gfns and force sync fault if a gfn is
> found in cache of error gfn. There is a small possibility that we
> miss an error gfn (as it got overwritten by a new error gfn). But
> its just a hint and sooner or later some error pfn will match
> and we will force sync fault and exit to user space.
>
> Changes from v3:
> - Added function kvm_find_and_remove_error_gfn() and removed
>   kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly)
>
> - Added a macro GFN_INVALID (Vitaly).
>
> - Used gpa_to_gfn() to convert gpa to gfn (Vitaly)
>
> Change from v2:
> - Fixed a warning by converting kvm_find_error_gfn() static.
>
> Change from v1:
> - Maintain a cache of error gfns, instead of single gfn. (Vitaly)
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h              |  2 +-
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  arch/x86/kvm/x86.c              | 54 +++++++++++++++++++++++++++++++--
>  include/linux/kvm_types.h       |  1 +
>  5 files changed, 56 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..e6f8d3f1a377 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>  #define KVM_NR_VAR_MTRR 8
>
>  #define ASYNC_PF_PER_VCPU 64
> +#define ERROR_GFN_PER_VCPU 64
>
>  enum kvm_reg {
>         VCPU_REGS_RAX = __VCPU_REGS_RAX,
> @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
>                 unsigned long nested_apf_token;
>                 bool delivery_as_pf_vmexit;
>                 bool pageready_pending;
> +               gfn_t error_gfns[ERROR_GFN_PER_VCPU];
>         } apf;
>
>         /* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..d0a2a12c7bb6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>                              bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>                                 u64 fault_address, char *insn, int insn_len);
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6d6a0ae7800c..b51d4aa405e0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>         if (!async)
>                 return false; /* *pfn has correct page already */
>
> -       if (!prefault && kvm_can_do_async_pf(vcpu)) {
> +       if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) {
>                 trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
>                 if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>                         trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 88c593f83b28..c1f5094d6e53 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
>                 vcpu->arch.apf.gfns[i] = ~0;
>  }
>
> +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +       int i;
> +       for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
> +               vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
> +}
> +
>  static void kvm_on_user_return(struct user_return_notifier *urn)
>  {
>         unsigned slot;
> @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>         vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
>
>         kvm_async_pf_hash_reset(vcpu);
> +       kvm_error_gfn_hash_reset(vcpu);
>         kvm_pmu_init(vcpu);
>
>         vcpu->arch.pending_external_vector = -1;
> @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>
>         kvm_clear_async_pf_completion_queue(vcpu);
>         kvm_async_pf_hash_reset(vcpu);
> +       kvm_error_gfn_hash_reset(vcpu);
>         vcpu->arch.apf.halted = false;
>
>         if (kvm_mpx_supported()) {
> @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_rflags);
>
> +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> +{
> +       BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> +
> +       return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU));
> +}
> +
> +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +       u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> +       /*
> +        * Overwrite the previous gfn. This is just a hint to do
> +        * sync page fault.
> +        */
> +       vcpu->arch.apf.error_gfns[key] = gfn;
> +}
> +
> +/* Returns true if gfn was found in hash table, false otherwise */
> +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +       u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> +       if (vcpu->arch.apf.error_gfns[key] != gfn)
> +               return 0;
> +
> +       vcpu->arch.apf.error_gfns[key] = GFN_INVALID;
> +       return true;
> +}
> +
>  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  {
>         int r;
> @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>               work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
>                 return;
>
> -       kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +       r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +       if (r < 0)
> +               kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa));
>  }
>
>  static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>         return true;
>  }
>
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
>         if (unlikely(!lapic_in_kernel(vcpu) ||
>                      kvm_event_needs_reinjection(vcpu) ||
> @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>          * If interrupts are off we cannot even use an artificial
>          * halt state.
>          */
> -       return kvm_arch_interrupt_allowed(vcpu);
> +       if (!kvm_arch_interrupt_allowed(vcpu))
> +               return false;
> +
> +       /* Found gfn in error gfn cache. Force sync fault */
> +       if (kvm_find_and_remove_error_gfn(vcpu, gfn))
> +               return false;
> +
> +       return true;
>  }
>
>  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 68e84cf42a3f..677bb8269cd3 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -36,6 +36,7 @@ typedef u64            gpa_t;
>  typedef u64            gfn_t;
>
>  #define GPA_INVALID    (~(gpa_t)0)
> +#define GFN_INVALID    (~(gfn_t)0)
>
>  typedef unsigned long  hva_t;
>  typedef u64            hpa_t;
> --
> 2.25.4

This patch looks good to me.

Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>

>
Vivek Goyal Aug. 7, 2020, 5:51 p.m. UTC | #5
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> Page fault error handling behavior in kvm seems little inconsistent when
> page fault reports error. If we are doing fault synchronously
> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> exit to user space and qemu reports error, "error: kvm run failed Bad address".

Hi Paolo,

Ping. I am wondering what do you think about this patch. Can it be
merged?

Thanks
Vivek

> 
> But if we are doing async page fault, then async_pf_execute() will simply
> ignore the error reported by get_user_pages_remote() or
> by kvm_mmu_do_page_fault(). It is assumed that page fault was successful
> and either a page ready event is injected in guest or guest is brought
> out of artificial halt state and run again. In both the cases when guest
> retries the instruction, it takes exit again as page fault was not
> successful in previous attempt. And then this infinite loop continues
> forever.
> 
> Trying fault in a loop will make sense if error is temporary and will
> be resolved on retry. But I don't see any intention in the code to
> determine if error is temporary or not.  Whether to do fault synchronously
> or asynchronously, depends on so many variables but none of the varibales
> is whether error is temporary or not. (kvm_can_do_async_pf()).
> 
> And that makes it very inconsistent or unpredictable to figure out whether
> kvm will exit to qemu with error or it will just retry and go into an
> infinite loop.
> 
> This patch tries to make this behavior consistent. That is instead of
> getting into infinite loop of retrying page fault, exit to user space
> and stop VM if page fault error happens.
> 
> In future this can be improved by injecting errors into guest. As of
> now we don't have any race free method to inject errors in guest.
> 
> When page fault error happens in async path save that pfn and when next
> time guest retries, do a sync fault instead of async fault. So that if error
> is encountered, we exit to qemu and avoid infinite loop.
> 
> We maintain a cache of error gfns and force sync fault if a gfn is
> found in cache of error gfn. There is a small possibility that we
> miss an error gfn (as it got overwritten by a new error gfn). But
> its just a hint and sooner or later some error pfn will match
> and we will force sync fault and exit to user space.
> 
> Changes from v3:
> - Added function kvm_find_and_remove_error_gfn() and removed
>   kvm_find_error_gfn() and kvm_del_error_gfn(). (Vitaly)
> 
> - Added a macro GFN_INVALID (Vitaly).
> 
> - Used gpa_to_gfn() to convert gpa to gfn (Vitaly)
> 
> Change from v2:
> - Fixed a warning by converting kvm_find_error_gfn() static.
> 
> Change from v1:
> - Maintain a cache of error gfns, instead of single gfn. (Vitaly)
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h              |  2 +-
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  arch/x86/kvm/x86.c              | 54 +++++++++++++++++++++++++++++++--
>  include/linux/kvm_types.h       |  1 +
>  5 files changed, 56 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..e6f8d3f1a377 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>  #define KVM_NR_VAR_MTRR 8
>  
>  #define ASYNC_PF_PER_VCPU 64
> +#define ERROR_GFN_PER_VCPU 64
>  
>  enum kvm_reg {
>  	VCPU_REGS_RAX = __VCPU_REGS_RAX,
> @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
>  		unsigned long nested_apf_token;
>  		bool delivery_as_pf_vmexit;
>  		bool pageready_pending;
> +		gfn_t error_gfns[ERROR_GFN_PER_VCPU];
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..d0a2a12c7bb6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>  			     bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6d6a0ae7800c..b51d4aa405e0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>  	if (!async)
>  		return false; /* *pfn has correct page already */
>  
> -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> +	if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) {
>  		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
>  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>  			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 88c593f83b28..c1f5094d6e53 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
>  		vcpu->arch.apf.gfns[i] = ~0;
>  }
>  
> +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +	int i;
> +	for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
> +		vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
> +}
> +
>  static void kvm_on_user_return(struct user_return_notifier *urn)
>  {
>  	unsigned slot;
> @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
>  
>  	kvm_async_pf_hash_reset(vcpu);
> +	kvm_error_gfn_hash_reset(vcpu);
>  	kvm_pmu_init(vcpu);
>  
>  	vcpu->arch.pending_external_vector = -1;
> @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  
>  	kvm_clear_async_pf_completion_queue(vcpu);
>  	kvm_async_pf_hash_reset(vcpu);
> +	kvm_error_gfn_hash_reset(vcpu);
>  	vcpu->arch.apf.halted = false;
>  
>  	if (kvm_mpx_supported()) {
> @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_rflags);
>  
> +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> +{
> +	BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> +
> +	return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU));
> +}
> +
> +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> +	/*
> +	 * Overwrite the previous gfn. This is just a hint to do
> +	 * sync page fault.
> +	 */
> +	vcpu->arch.apf.error_gfns[key] = gfn;
> +}
> +
> +/* Returns true if gfn was found in hash table, false otherwise */
> +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> +	if (vcpu->arch.apf.error_gfns[key] != gfn)
> +		return 0;
> +
> +	vcpu->arch.apf.error_gfns[key] = GFN_INVALID;
> +	return true;
> +}
> +
>  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  {
>  	int r;
> @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
>  		return;
>  
> -	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	if (r < 0)
> +		kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa));
>  }
>  
>  static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  	return true;
>  }
>  
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
>  	if (unlikely(!lapic_in_kernel(vcpu) ||
>  		     kvm_event_needs_reinjection(vcpu) ||
> @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  	 * If interrupts are off we cannot even use an artificial
>  	 * halt state.
>  	 */
> -	return kvm_arch_interrupt_allowed(vcpu);
> +	if (!kvm_arch_interrupt_allowed(vcpu))
> +		return false;
> +
> +	/* Found gfn in error gfn cache. Force sync fault */
> +	if (kvm_find_and_remove_error_gfn(vcpu, gfn))
> +		return false;
> +
> +	return true;
>  }
>  
>  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 68e84cf42a3f..677bb8269cd3 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -36,6 +36,7 @@ typedef u64            gpa_t;
>  typedef u64            gfn_t;
>  
>  #define GPA_INVALID	(~(gpa_t)0)
> +#define GFN_INVALID	(~(gfn_t)0)
>  
>  typedef unsigned long  hva_t;
>  typedef u64            hpa_t;
> -- 
> 2.25.4
>
Sean Christopherson Sept. 29, 2020, 4:37 a.m. UTC | #6
On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h              |  2 +-
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  arch/x86/kvm/x86.c              | 54 +++++++++++++++++++++++++++++++--
>  include/linux/kvm_types.h       |  1 +
>  5 files changed, 56 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..e6f8d3f1a377 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>  #define KVM_NR_VAR_MTRR 8
>  
>  #define ASYNC_PF_PER_VCPU 64
> +#define ERROR_GFN_PER_VCPU 64

Aren't these two related?  I.e. wouldn't it make sense to do:

  #define ERROR_GFN_PER_VCPU ASYNC_PF_PER_VCPU

Or maybe even size error_gfns[] to ASYNC_PF_PER_VCPU?

>  
>  enum kvm_reg {
>  	VCPU_REGS_RAX = __VCPU_REGS_RAX,
> @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
>  		unsigned long nested_apf_token;
>  		bool delivery_as_pf_vmexit;
>  		bool pageready_pending;
> +		gfn_t error_gfns[ERROR_GFN_PER_VCPU];
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..d0a2a12c7bb6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>  			     bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
>  

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 88c593f83b28..c1f5094d6e53 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
>  		vcpu->arch.apf.gfns[i] = ~0;
>  }
>  
> +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
> +{
> +	int i;

Need a newline.   

> +	for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
> +		vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
> +}
> +
>  static void kvm_on_user_return(struct user_return_notifier *urn)
>  {
>  	unsigned slot;
> @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
>  
>  	kvm_async_pf_hash_reset(vcpu);
> +	kvm_error_gfn_hash_reset(vcpu);
>  	kvm_pmu_init(vcpu);
>  
>  	vcpu->arch.pending_external_vector = -1;
> @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  
>  	kvm_clear_async_pf_completion_queue(vcpu);
>  	kvm_async_pf_hash_reset(vcpu);
> +	kvm_error_gfn_hash_reset(vcpu);
>  	vcpu->arch.apf.halted = false;
>  
>  	if (kvm_mpx_supported()) {
> @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>  }
>  EXPORT_SYMBOL_GPL(kvm_set_rflags);
>  
> +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> +{
> +	BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> +
> +	return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU));
> +}
> +
> +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_error_gfn_hash_fn(gfn);
> +
> +	/*
> +	 * Overwrite the previous gfn. This is just a hint to do
> +	 * sync page fault.
> +	 */
> +	vcpu->arch.apf.error_gfns[key] = gfn;
> +}
> +
> +/* Returns true if gfn was found in hash table, false otherwise */
> +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> +	u32 key = kvm_error_gfn_hash_fn(gfn);

Mostly out of curiosity, do we really need a hash?  E.g. could we get away
with an array of 4 values?  2 values?  Just wondering if we can avoid 64*8
bytes per CPU.

One thought would be to use the index to handle the case of no error gfns so
that the size of the array doesn't affect lookup for the common case, e.g.

	...

		u8 error_gfn_head;
		gfn_t error_gfns[ERROR_GFN_PER_VCPU];
	} apf;	


	if (vcpu->arch.apf.error_gfn_head == 0xff)
		return false;

	for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) {
		if (vcpu->arch.apf.error_gfns[i] == gfn) {
			vcpu->arch.apf.error_gfns[i] = INVALID_GFN;
			return true;
		}
	}
	return true;

Or you could even avoid INVALID_GFN altogether by compacting the array on
removal.  The VM is probably dead anyways, burning a few cycles to clean
things up is a non-issue.  Ditto for slow insertion.

> +
> +	if (vcpu->arch.apf.error_gfns[key] != gfn)
> +		return 0;

Should be "return false".

> +
> +	vcpu->arch.apf.error_gfns[key] = GFN_INVALID;
> +	return true;
> +}
> +
>  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  {
>  	int r;
> @@ -10385,7 +10424,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
>  		return;
>  
> -	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	if (r < 0)
> +		kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa));
>  }
>  
>  static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> @@ -10495,7 +10536,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  	return true;
>  }
>  
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
>  	if (unlikely(!lapic_in_kernel(vcpu) ||
>  		     kvm_event_needs_reinjection(vcpu) ||
> @@ -10509,7 +10550,14 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  	 * If interrupts are off we cannot even use an artificial
>  	 * halt state.
>  	 */
> -	return kvm_arch_interrupt_allowed(vcpu);
> +	if (!kvm_arch_interrupt_allowed(vcpu))
> +		return false;
> +
> +	/* Found gfn in error gfn cache. Force sync fault */
> +	if (kvm_find_and_remove_error_gfn(vcpu, gfn))
> +		return false;
> +
> +	return true;
>  }
>  
>  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 68e84cf42a3f..677bb8269cd3 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -36,6 +36,7 @@ typedef u64            gpa_t;
>  typedef u64            gfn_t;
>  
>  #define GPA_INVALID	(~(gpa_t)0)
> +#define GFN_INVALID	(~(gfn_t)0)
>  
>  typedef unsigned long  hva_t;
>  typedef u64            hpa_t;
> -- 
> 2.25.4
>
Vivek Goyal Oct. 1, 2020, 9:55 p.m. UTC | #7
On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote:
> On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 ++
> >  arch/x86/kvm/mmu.h              |  2 +-
> >  arch/x86/kvm/mmu/mmu.c          |  2 +-
> >  arch/x86/kvm/x86.c              | 54 +++++++++++++++++++++++++++++++--
> >  include/linux/kvm_types.h       |  1 +
> >  5 files changed, 56 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index be5363b21540..e6f8d3f1a377 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -137,6 +137,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
> >  #define KVM_NR_VAR_MTRR 8
> >  
> >  #define ASYNC_PF_PER_VCPU 64
> > +#define ERROR_GFN_PER_VCPU 64
> 
> Aren't these two related?  I.e. wouldn't it make sense to do:

Hi,

They are related somewhat but they don't have to be same. I think we
can accumulate many more error GFNs if this vcpu does not schedule
the same task again to retry.

>   #define ERROR_GFN_PER_VCPU ASYNC_PF_PER_VCPU
> 
> Or maybe even size error_gfns[] to ASYNC_PF_PER_VCPU?

Given these two don't have to be same, I kept it separate. And kept the
hash size same for now. If one wants, hash can be bigger or smaller
down the line.

> 
> >  
> >  enum kvm_reg {
> >  	VCPU_REGS_RAX = __VCPU_REGS_RAX,
> > @@ -778,6 +779,7 @@ struct kvm_vcpu_arch {
> >  		unsigned long nested_apf_token;
> >  		bool delivery_as_pf_vmexit;
> >  		bool pageready_pending;
> > +		gfn_t error_gfns[ERROR_GFN_PER_VCPU];
> >  	} apf;
> >  
> >  	/* OSVW MSRs (AMD only) */
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 444bb9c54548..d0a2a12c7bb6 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
> >  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
> >  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> >  			     bool accessed_dirty, gpa_t new_eptp);
> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
> >  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  				u64 fault_address, char *insn, int insn_len);
> >  
> 
> ...
> 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 88c593f83b28..c1f5094d6e53 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -263,6 +263,13 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
> >  		vcpu->arch.apf.gfns[i] = ~0;
> >  }
> >  
> > +static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
> > +{
> > +	int i;
> 
> Need a newline.   

Will do.

> 
> > +	for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
> > +		vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
> > +}
> > +
> >  static void kvm_on_user_return(struct user_return_notifier *urn)
> >  {
> >  	unsigned slot;
> > @@ -9484,6 +9491,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >  	vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
> >  
> >  	kvm_async_pf_hash_reset(vcpu);
> > +	kvm_error_gfn_hash_reset(vcpu);
> >  	kvm_pmu_init(vcpu);
> >  
> >  	vcpu->arch.pending_external_vector = -1;
> > @@ -9608,6 +9616,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >  
> >  	kvm_clear_async_pf_completion_queue(vcpu);
> >  	kvm_async_pf_hash_reset(vcpu);
> > +	kvm_error_gfn_hash_reset(vcpu);
> >  	vcpu->arch.apf.halted = false;
> >  
> >  	if (kvm_mpx_supported()) {
> > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> >  
> > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> > +{
> > +	BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> > +
> > +	return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU));
> > +}
> > +
> > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > +	u32 key = kvm_error_gfn_hash_fn(gfn);
> > +
> > +	/*
> > +	 * Overwrite the previous gfn. This is just a hint to do
> > +	 * sync page fault.
> > +	 */
> > +	vcpu->arch.apf.error_gfns[key] = gfn;
> > +}
> > +
> > +/* Returns true if gfn was found in hash table, false otherwise */
> > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > +{
> > +	u32 key = kvm_error_gfn_hash_fn(gfn);
> 
> Mostly out of curiosity, do we really need a hash?  E.g. could we get away
> with an array of 4 values?  2 values?  Just wondering if we can avoid 64*8
> bytes per CPU.

We are relying on returning error when guest task retries fault. Fault
will be retried on same address if same task is run by vcpu after
"page ready" event. There is no guarantee that same task will be
run. In theory, this cpu could have a large number of tasks queued
and run these tasks before the faulting task is run again. Now say
there are 128 tasks being run and 32 of them have page fault
errors. Then if we keep 4 values, newer failures will simply
overwrite older failures and we will keep spinning instead of
exiting to user space.

That's why this array of 64 gfns and add gfns based on hash. This
does not completely elimiante the above possibility but chances
of one hitting this are very very slim.

> 
> One thought would be to use the index to handle the case of no error gfns so
> that the size of the array doesn't affect lookup for the common case, e.g.

We are calculating hash of gfn (used as index in array). So lookup cost
is not driven by size of array. Its O(1) and not O(N). We just lookup
at one element in array and if it does not match, we return false.

	u32 key = kvm_error_gfn_hash_fn(gfn);

	if (vcpu->arch.apf.error_gfns[key] != gfn)
		return 0;


> 
> 	...
> 
> 		u8 error_gfn_head;
> 		gfn_t error_gfns[ERROR_GFN_PER_VCPU];
> 	} apf;	
> 
> 
> 	if (vcpu->arch.apf.error_gfn_head == 0xff)
> 		return false;
> 
> 	for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) {
> 		if (vcpu->arch.apf.error_gfns[i] == gfn) {
> 			vcpu->arch.apf.error_gfns[i] = INVALID_GFN;
> 			return true;
> 		}
> 	}
> 	return true;
> 
> Or you could even avoid INVALID_GFN altogether by compacting the array on
> removal.  The VM is probably dead anyways, burning a few cycles to clean
> things up is a non-issue.  Ditto for slow insertion.

Same for insertion. Its O(1) and not dependent on size of array. Also
insertion anyway is very infrequent event because it will not be
often that error will happen.


> 
> > +
> > +	if (vcpu->arch.apf.error_gfns[key] != gfn)
> > +		return 0;
> 
> Should be "return false".

Will fix.

Thanks
Vivek
Sean Christopherson Oct. 1, 2020, 10:33 p.m. UTC | #8
On Thu, Oct 01, 2020 at 05:55:08PM -0400, Vivek Goyal wrote:
> On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote:
> > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> > >  
> > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> > > +{
> > > +	BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> > > +
> > > +	return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU));
> > > +}
> > > +
> > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > +{
> > > +	u32 key = kvm_error_gfn_hash_fn(gfn);
> > > +
> > > +	/*
> > > +	 * Overwrite the previous gfn. This is just a hint to do
> > > +	 * sync page fault.
> > > +	 */
> > > +	vcpu->arch.apf.error_gfns[key] = gfn;
> > > +}
> > > +
> > > +/* Returns true if gfn was found in hash table, false otherwise */
> > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > +{
> > > +	u32 key = kvm_error_gfn_hash_fn(gfn);
> > 
> > Mostly out of curiosity, do we really need a hash?  E.g. could we get away
> > with an array of 4 values?  2 values?  Just wondering if we can avoid 64*8
> > bytes per CPU.
> 
> We are relying on returning error when guest task retries fault. Fault
> will be retried on same address if same task is run by vcpu after
> "page ready" event. There is no guarantee that same task will be
> run. In theory, this cpu could have a large number of tasks queued
> and run these tasks before the faulting task is run again. Now say
> there are 128 tasks being run and 32 of them have page fault
> errors. Then if we keep 4 values, newer failures will simply
> overwrite older failures and we will keep spinning instead of
> exiting to user space.
> 
> That's why this array of 64 gfns and add gfns based on hash. This
> does not completely elimiante the above possibility but chances
> of one hitting this are very very slim.

But have you actually tried such a scenario?  In other words, is there good
justification for burning the extra memory?

Alternatively, what about adding a new KVM request type to handle this?
E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
request.  The vCPU then gets kicked and exits to userspace.  Before exiting
to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
simply get if error_gfn is "valid", i.e. there's a pending request.

That would guarantee the error is propagated to userspace, and doesn't lose
any guest information as dropping error GFNs just means the guest will take
more page fault exits.

> > One thought would be to use the index to handle the case of no error gfns so
> > that the size of the array doesn't affect lookup for the common case, e.g.
> 
> We are calculating hash of gfn (used as index in array). So lookup cost
> is not driven by size of array. Its O(1) and not O(N). We just lookup
> at one element in array and if it does not match, we return false.
> 
> 	u32 key = kvm_error_gfn_hash_fn(gfn);
> 
> 	if (vcpu->arch.apf.error_gfns[key] != gfn)
> 		return 0;
> 
> 
> > 
> > 	...
> > 
> > 		u8 error_gfn_head;
> > 		gfn_t error_gfns[ERROR_GFN_PER_VCPU];
> > 	} apf;	
> > 
> > 
> > 	if (vcpu->arch.apf.error_gfn_head == 0xff)
> > 		return false;
> > 
> > 	for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) {
> > 		if (vcpu->arch.apf.error_gfns[i] == gfn) {
> > 			vcpu->arch.apf.error_gfns[i] = INVALID_GFN;
> > 			return true;
> > 		}
> > 	}
> > 	return true;
> > 
> > Or you could even avoid INVALID_GFN altogether by compacting the array on
> > removal.  The VM is probably dead anyways, burning a few cycles to clean
> > things up is a non-issue.  Ditto for slow insertion.
> 
> Same for insertion. Its O(1) and not dependent on size of array. Also
> insertion anyway is very infrequent event because it will not be
> often that error will happen.

I know, I was saying that if we move to an array then we'd technically be
subject to O(whatever) search and delete, but that it's irrelevant from a
performance perspective because the guest is already toast if it hits a bad
GFN.

> > > +
> > > +	if (vcpu->arch.apf.error_gfns[key] != gfn)
> > > +		return 0;
> > 
> > Should be "return false".
> 
> Will fix.
> 
> Thanks
> Vivek
>
Vivek Goyal Oct. 2, 2020, 3:38 p.m. UTC | #9
On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> On Thu, Oct 01, 2020 at 05:55:08PM -0400, Vivek Goyal wrote:
> > On Mon, Sep 28, 2020 at 09:37:00PM -0700, Sean Christopherson wrote:
> > > On Mon, Jul 20, 2020 at 05:13:59PM -0400, Vivek Goyal wrote:
> > > > @@ -10369,6 +10378,36 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kvm_set_rflags);
> > > >  
> > > > +static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
> > > > +{
> > > > +	BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
> > > > +
> > > > +	return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU));
> > > > +}
> > > > +
> > > > +static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > > +{
> > > > +	u32 key = kvm_error_gfn_hash_fn(gfn);
> > > > +
> > > > +	/*
> > > > +	 * Overwrite the previous gfn. This is just a hint to do
> > > > +	 * sync page fault.
> > > > +	 */
> > > > +	vcpu->arch.apf.error_gfns[key] = gfn;
> > > > +}
> > > > +
> > > > +/* Returns true if gfn was found in hash table, false otherwise */
> > > > +static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> > > > +{
> > > > +	u32 key = kvm_error_gfn_hash_fn(gfn);
> > > 
> > > Mostly out of curiosity, do we really need a hash?  E.g. could we get away
> > > with an array of 4 values?  2 values?  Just wondering if we can avoid 64*8
> > > bytes per CPU.
> > 
> > We are relying on returning error when guest task retries fault. Fault
> > will be retried on same address if same task is run by vcpu after
> > "page ready" event. There is no guarantee that same task will be
> > run. In theory, this cpu could have a large number of tasks queued
> > and run these tasks before the faulting task is run again. Now say
> > there are 128 tasks being run and 32 of them have page fault
> > errors. Then if we keep 4 values, newer failures will simply
> > overwrite older failures and we will keep spinning instead of
> > exiting to user space.
> > 
> > That's why this array of 64 gfns and add gfns based on hash. This
> > does not completely elimiante the above possibility but chances
> > of one hitting this are very very slim.
> 
> But have you actually tried such a scenario?  In other words, is there good
> justification for burning the extra memory?

Its not easy to try and reproduce. So it is all theory  at this point of time.
If you are worried about memory usage, we can probably reduce the size
of hash table. Say from 64, reduce it to 8. I am fine with that. I think
initially I had a single error_gfn. But Vitaly had concerns about
above scenario, so I implemeted a hash table.

I think reducing hash table size to 8 or 16 probaly is a good middle
ground.

> 
> Alternatively, what about adding a new KVM request type to handle this?
> E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> request.  The vCPU then gets kicked and exits to userspace.  Before exiting
> to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
> simply get if error_gfn is "valid", i.e. there's a pending request.

Sorry, I did not understand the above proposal. Can you please elaborate
a bit more. Part of it is that I don't know much about KVM requests.
Looking at the code it looks like that main loop is parsing if some
kvm request is pending and executing that action.

Don't we want to make sure that we exit to user space when guest retries
error gfn access again. In this case once we get -EFAULT, we will still
inject page_ready into guest. And then either same process or a different
process might run. 

So when exactly code raises a kvm request. If I raise it right when
I get -EFAULT, then kvm will exit to user space upon next entry
time. But there is no guarantee guest vcpu is running the process which
actually accessed the error gfn. And that probably means that register
state of cpu does not mean much and one can not easily figure out
which task tried to access the bad memory and when.

That's why we prepare a list of error gfn and only exit to user space
when error_gfn access is retried so that guest vcpu context is correct.

What am I missing?

Thanks
Vivek

> 
> That would guarantee the error is propagated to userspace, and doesn't lose
> any guest information as dropping error GFNs just means the guest will take
> more page fault exits.
> 
> > > One thought would be to use the index to handle the case of no error gfns so
> > > that the size of the array doesn't affect lookup for the common case, e.g.
> > 
> > We are calculating hash of gfn (used as index in array). So lookup cost
> > is not driven by size of array. Its O(1) and not O(N). We just lookup
> > at one element in array and if it does not match, we return false.
> > 
> > 	u32 key = kvm_error_gfn_hash_fn(gfn);
> > 
> > 	if (vcpu->arch.apf.error_gfns[key] != gfn)
> > 		return 0;
> > 
> > 
> > > 
> > > 	...
> > > 
> > > 		u8 error_gfn_head;
> > > 		gfn_t error_gfns[ERROR_GFN_PER_VCPU];
> > > 	} apf;	
> > > 
> > > 
> > > 	if (vcpu->arch.apf.error_gfn_head == 0xff)
> > > 		return false;
> > > 
> > > 	for (i = 0; i < vcpu->arch.apf.error_gfn_head; i++) {
> > > 		if (vcpu->arch.apf.error_gfns[i] == gfn) {
> > > 			vcpu->arch.apf.error_gfns[i] = INVALID_GFN;
> > > 			return true;
> > > 		}
> > > 	}
> > > 	return true;
> > > 
> > > Or you could even avoid INVALID_GFN altogether by compacting the array on
> > > removal.  The VM is probably dead anyways, burning a few cycles to clean
> > > things up is a non-issue.  Ditto for slow insertion.
> > 
> > Same for insertion. Its O(1) and not dependent on size of array. Also
> > insertion anyway is very infrequent event because it will not be
> > often that error will happen.
> 
> I know, I was saying that if we move to an array then we'd technically be
> subject to O(whatever) search and delete, but that it's irrelevant from a
> performance perspective because the guest is already toast if it hits a bad
> GFN.
> 
> > > > +
> > > > +	if (vcpu->arch.apf.error_gfns[key] != gfn)
> > > > +		return 0;
> > > 
> > > Should be "return false".
> > 
> > Will fix.
> > 
> > Thanks
> > Vivek
> > 
>
Sean Christopherson Oct. 2, 2020, 6:30 p.m. UTC | #10
On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > Alternatively, what about adding a new KVM request type to handle this?
> > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > request.  The vCPU then gets kicked and exits to userspace.  Before exiting
> > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
> > simply get if error_gfn is "valid", i.e. there's a pending request.
> 
> Sorry, I did not understand the above proposal. Can you please elaborate
> a bit more. Part of it is that I don't know much about KVM requests.
> Looking at the code it looks like that main loop is parsing if some
> kvm request is pending and executing that action.
> 
> Don't we want to make sure that we exit to user space when guest retries
> error gfn access again.

> In this case once we get -EFAULT, we will still inject page_ready into
> guest. And then either same process or a different process might run. 
> 
> So when exactly code raises a kvm request. If I raise it right when
> I get -EFAULT, then kvm will exit to user space upon next entry
> time. But there is no guarantee guest vcpu is running the process which
> actually accessed the error gfn. And that probably means that register
> state of cpu does not mean much and one can not easily figure out
> which task tried to access the bad memory and when.
> 
> That's why we prepare a list of error gfn and only exit to user space
> when error_gfn access is retried so that guest vcpu context is correct.
> 
> What am I missing?

I don't think it's necessary to provide userspace with the register state of
the guest task that hit the bad page.  Other than debugging, I don't see how
userspace can do anything useful which such information.

Even if you want to inject an event of some form into the guest, having the
correct context for the event itself is not required.  IMO it's perfectly
reasonable for such an event to be asynchronous.

IIUC, your end goal is to be able to gracefully handle DAX file truncation.
Simply killing the guest task that hit the bad page isn't sufficient, as
nothing prevents a future task from accessing the same bad page.  To fully
handle the situation, the guest needs to remove the bad page from its memory
pool.  Once the page is offlined, the guest kernel's error handling will
kick in when a task accesses the bad page (or nothing ever touches the bad
page again and everyone is happy).

Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
to handle this, but I suspect the resulting behavior will look quite similar,
e.g. notify the virtiofs driver in the guest, which does some magic to take
the offending region offline, and then guest tasks get SIGBUS or whatever.

I also don't think it's KVM's responsibility to _directly_ handle such a
scenario.  As I said in an earlier version, KVM can't possibly know _why_ a
page fault came back with -EFAULT, only userspace can connect the dots of
GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely should
exit to userspace on the -EFAULT instead of hanging the guest, but that can
be done via a new request, as suggested.
Vivek Goyal Oct. 2, 2020, 7:27 p.m. UTC | #11
On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > > Alternatively, what about adding a new KVM request type to handle this?
> > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > > request.  The vCPU then gets kicked and exits to userspace.  Before exiting
> > > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
> > > simply get if error_gfn is "valid", i.e. there's a pending request.
> > 
> > Sorry, I did not understand the above proposal. Can you please elaborate
> > a bit more. Part of it is that I don't know much about KVM requests.
> > Looking at the code it looks like that main loop is parsing if some
> > kvm request is pending and executing that action.
> > 
> > Don't we want to make sure that we exit to user space when guest retries
> > error gfn access again.
> 
> > In this case once we get -EFAULT, we will still inject page_ready into
> > guest. And then either same process or a different process might run. 
> > 
> > So when exactly code raises a kvm request. If I raise it right when
> > I get -EFAULT, then kvm will exit to user space upon next entry
> > time. But there is no guarantee guest vcpu is running the process which
> > actually accessed the error gfn. And that probably means that register
> > state of cpu does not mean much and one can not easily figure out
> > which task tried to access the bad memory and when.
> > 
> > That's why we prepare a list of error gfn and only exit to user space
> > when error_gfn access is retried so that guest vcpu context is correct.
> > 
> > What am I missing?
> 
> I don't think it's necessary to provide userspace with the register state of
> the guest task that hit the bad page.  Other than debugging, I don't see how
> userspace can do anything useful which such information.

I think debugging is the whole point so that user can figure out which
access by guest task resulted in bad memory access. I would think this
will be important piece of information.

> 
> Even if you want to inject an event of some form into the guest, having the
> correct context for the event itself is not required.  IMO it's perfectly
> reasonable for such an event to be asynchronous.
> 
> IIUC, your end goal is to be able to gracefully handle DAX file truncation.
> Simply killing the guest task that hit the bad page isn't sufficient, as
> nothing prevents a future task from accessing the same bad page.

Next task can't even mmap that page mmap will fail. File got truncated,
that page does not exist. 

So sending SIGBUS to task should definitely solve the problem. We also
need to solve the issue for guest kernel accessing the page which got
truncated on host. In that case we need to use correct memcpy helpers
and use exception table magic and return error code to user space.

> To fully
> handle the situation, the guest needs to remove the bad page from its memory
> pool.  Once the page is offlined, the guest kernel's error handling will
> kick in when a task accesses the bad page (or nothing ever touches the bad
> page again and everyone is happy).

This is not really a case of bad page as such. It is more of a page
gone missing/trucated. And no new user can map it. We just need to
worry about existing users who already have it mapped.

> 
> Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
> to handle this, but I suspect the resulting behavior will look quite similar,
> e.g. notify the virtiofs driver in the guest, which does some magic to take
> the offending region offline, and then guest tasks get SIGBUS or whatever.
> 
> I also don't think it's KVM's responsibility to _directly_ handle such a
> scenario.  As I said in an earlier version, KVM can't possibly know _why_ a
> page fault came back with -EFAULT, only userspace can connect the dots of
> GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely should
> exit to userspace on the -EFAULT instead of hanging the guest, but that can
> be done via a new request, as suggested.

KVM atleast should have the mechanism to report this back to guest. And
we don't have any. There only seems to be #MC stuff for poisoned pages.
I am not sure how much we can build on top of #MC stuff to take care
of this case. One problem with #MC I found was that it generates
synchronous #MC only on load and not store. So all the code is
written in such a way that synchronous #MC can happen only on load
and hence the error handling. 

Stores generate different kind of #MC that too asynchronously and
caller will not know about it immiditely. But in this case we need
to know about error in the context of caller both for loads and stores.

Anyway, I agree that this patch does not solve the problem of race
free synchronous event inject into the guest. That's something yet
to be solved either by #VE or by #MC or by #foo.

This patch is only doing two things.

- Because we don't have a mechanism to report errors to guest, use
  the second best method and exit to user space.

- Make behavior consistent between synchronous fault and async faults.
  Currently sync faults exit to user space on error while async
   faults spin infinitely.

Once we have a method to report errors back to guest, then we first
should report error back to guest. And only in the absense of such
mechanism we should exit to user space.

Thanks
Vivek
Sean Christopherson Oct. 2, 2020, 7:45 p.m. UTC | #12
On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > > > Alternatively, what about adding a new KVM request type to handle this?
> > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > > > request.  The vCPU then gets kicked and exits to userspace.  Before exiting
> > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
> > > > simply get if error_gfn is "valid", i.e. there's a pending request.
> > > 
> > > Sorry, I did not understand the above proposal. Can you please elaborate
> > > a bit more. Part of it is that I don't know much about KVM requests.
> > > Looking at the code it looks like that main loop is parsing if some
> > > kvm request is pending and executing that action.
> > > 
> > > Don't we want to make sure that we exit to user space when guest retries
> > > error gfn access again.
> > 
> > > In this case once we get -EFAULT, we will still inject page_ready into
> > > guest. And then either same process or a different process might run. 
> > > 
> > > So when exactly code raises a kvm request. If I raise it right when
> > > I get -EFAULT, then kvm will exit to user space upon next entry
> > > time. But there is no guarantee guest vcpu is running the process which
> > > actually accessed the error gfn. And that probably means that register
> > > state of cpu does not mean much and one can not easily figure out
> > > which task tried to access the bad memory and when.
> > > 
> > > That's why we prepare a list of error gfn and only exit to user space
> > > when error_gfn access is retried so that guest vcpu context is correct.
> > > 
> > > What am I missing?
> > 
> > I don't think it's necessary to provide userspace with the register state of
> > the guest task that hit the bad page.  Other than debugging, I don't see how
> > userspace can do anything useful which such information.
> 
> I think debugging is the whole point so that user can figure out which
> access by guest task resulted in bad memory access. I would think this
> will be important piece of information.

But isn't this failure due to a truncation in the host?  Why would we care
about debugging the guest?  It hasn't done anything wrong, has it?  Or am I
misunderstanding the original problem statement.

> > To fully handle the situation, the guest needs to remove the bad page from
> > its memory pool.  Once the page is offlined, the guest kernel's error
> > handling will kick in when a task accesses the bad page (or nothing ever
> > touches the bad page again and everyone is happy).
> 
> This is not really a case of bad page as such. It is more of a page
> gone missing/trucated. And no new user can map it. We just need to
> worry about existing users who already have it mapped.

What do you mean by "no new user can map it"?  Are you talking about guest
tasks or host tasks?  If guest tasks, how would the guest know the page is
missing and thus prevent mapping the non-existent page?

> > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
> > to handle this, but I suspect the resulting behavior will look quite similar,
> > e.g. notify the virtiofs driver in the guest, which does some magic to take
> > the offending region offline, and then guest tasks get SIGBUS or whatever.
> > 
> > I also don't think it's KVM's responsibility to _directly_ handle such a
> > scenario.  As I said in an earlier version, KVM can't possibly know _why_ a
> > page fault came back with -EFAULT, only userspace can connect the dots of
> > GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely should
> > exit to userspace on the -EFAULT instead of hanging the guest, but that can
> > be done via a new request, as suggested.
> 
> KVM atleast should have the mechanism to report this back to guest. And
> we don't have any. There only seems to be #MC stuff for poisoned pages.
> I am not sure how much we can build on top of #MC stuff to take care
> of this case. One problem with #MC I found was that it generates
> synchronous #MC only on load and not store. So all the code is
> written in such a way that synchronous #MC can happen only on load
> and hence the error handling. 
> 
> Stores generate different kind of #MC that too asynchronously and
> caller will not know about it immiditely. But in this case we need
> to know about error in the context of caller both for loads and stores.
> 
> Anyway, I agree that this patch does not solve the problem of race
> free synchronous event inject into the guest. That's something yet
> to be solved either by #VE or by #MC or by #foo.
> 
> This patch is only doing two things.
> 
> - Because we don't have a mechanism to report errors to guest, use
>   the second best method and exit to user space.

Not that it matters at this point, but IMO exiting to userspace is the
correct behavior, not simply "second best method".  Again, KVM by design
does not know what lies behind any given HVA, and therefore should not be
making decisions about how to handle bad HVAs.
 
> - Make behavior consistent between synchronous fault and async faults.
>   Currently sync faults exit to user space on error while async
>    faults spin infinitely.

Yes, and this part can be done with a new request type.  Adding a new
request doesn't commit KVM to any ABI changes, e.g. userspace will still
see an -EFAULT return, and nothing more.  I.e. handling this via request
doesn't prevent switching to synchronous exits in the future, if such a
feature is required.

> Once we have a method to report errors back to guest, then we first
> should report error back to guest. And only in the absense of such
> mechanism we should exit to user space.

I don't see the value in extending KVM's ABI to avoid the exit to userspace.
E.g. we could add a memslot flag to autoreflect -EFAULT as an event of some
form, but this is a slow path, the extra time should be a non-issue.
Vivek Goyal Oct. 2, 2020, 8:02 p.m. UTC | #13
On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote:
> On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > > On Thu, Oct 01, 2020 at 03:33:20PM -0700, Sean Christopherson wrote:
> > > > > Alternatively, what about adding a new KVM request type to handle this?
> > > > > E.g. when the APF comes back with -EFAULT, snapshot the GFN and make a
> > > > > request.  The vCPU then gets kicked and exits to userspace.  Before exiting
> > > > > to userspace, the request handler resets vcpu->arch.apf.error_gfn.  Bad GFNs
> > > > > simply get if error_gfn is "valid", i.e. there's a pending request.
> > > > 
> > > > Sorry, I did not understand the above proposal. Can you please elaborate
> > > > a bit more. Part of it is that I don't know much about KVM requests.
> > > > Looking at the code it looks like that main loop is parsing if some
> > > > kvm request is pending and executing that action.
> > > > 
> > > > Don't we want to make sure that we exit to user space when guest retries
> > > > error gfn access again.
> > > 
> > > > In this case once we get -EFAULT, we will still inject page_ready into
> > > > guest. And then either same process or a different process might run. 
> > > > 
> > > > So when exactly code raises a kvm request. If I raise it right when
> > > > I get -EFAULT, then kvm will exit to user space upon next entry
> > > > time. But there is no guarantee guest vcpu is running the process which
> > > > actually accessed the error gfn. And that probably means that register
> > > > state of cpu does not mean much and one can not easily figure out
> > > > which task tried to access the bad memory and when.
> > > > 
> > > > That's why we prepare a list of error gfn and only exit to user space
> > > > when error_gfn access is retried so that guest vcpu context is correct.
> > > > 
> > > > What am I missing?
> > > 
> > > I don't think it's necessary to provide userspace with the register state of
> > > the guest task that hit the bad page.  Other than debugging, I don't see how
> > > userspace can do anything useful which such information.
> > 
> > I think debugging is the whole point so that user can figure out which
> > access by guest task resulted in bad memory access. I would think this
> > will be important piece of information.
> 
> But isn't this failure due to a truncation in the host?  Why would we care
> about debugging the guest?  It hasn't done anything wrong, has it?  Or am I
> misunderstanding the original problem statement.

I think you understood problem statement right. If guest has right
context, it just gives additional information who tried to access
the missing memory page. 

> 
> > > To fully handle the situation, the guest needs to remove the bad page from
> > > its memory pool.  Once the page is offlined, the guest kernel's error
> > > handling will kick in when a task accesses the bad page (or nothing ever
> > > touches the bad page again and everyone is happy).
> > 
> > This is not really a case of bad page as such. It is more of a page
> > gone missing/trucated. And no new user can map it. We just need to
> > worry about existing users who already have it mapped.
> 
> What do you mean by "no new user can map it"?  Are you talking about guest
> tasks or host tasks?  If guest tasks, how would the guest know the page is
> missing and thus prevent mapping the non-existent page?

If a new task wants mmap(), it will send a request to virtiofsd/qemu
on host. If file has been truncated, then mapping beyond file size
will fail and process will get error.  So they will not be able to
map a page which has been truncated.

> 
> > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
> > > to handle this, but I suspect the resulting behavior will look quite similar,
> > > e.g. notify the virtiofs driver in the guest, which does some magic to take
> > > the offending region offline, and then guest tasks get SIGBUS or whatever.
> > > 
> > > I also don't think it's KVM's responsibility to _directly_ handle such a
> > > scenario.  As I said in an earlier version, KVM can't possibly know _why_ a
> > > page fault came back with -EFAULT, only userspace can connect the dots of
> > > GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely should
> > > exit to userspace on the -EFAULT instead of hanging the guest, but that can
> > > be done via a new request, as suggested.
> > 
> > KVM atleast should have the mechanism to report this back to guest. And
> > we don't have any. There only seems to be #MC stuff for poisoned pages.
> > I am not sure how much we can build on top of #MC stuff to take care
> > of this case. One problem with #MC I found was that it generates
> > synchronous #MC only on load and not store. So all the code is
> > written in such a way that synchronous #MC can happen only on load
> > and hence the error handling. 
> > 
> > Stores generate different kind of #MC that too asynchronously and
> > caller will not know about it immiditely. But in this case we need
> > to know about error in the context of caller both for loads and stores.
> > 
> > Anyway, I agree that this patch does not solve the problem of race
> > free synchronous event inject into the guest. That's something yet
> > to be solved either by #VE or by #MC or by #foo.
> > 
> > This patch is only doing two things.
> > 
> > - Because we don't have a mechanism to report errors to guest, use
> >   the second best method and exit to user space.
> 
> Not that it matters at this point, but IMO exiting to userspace is the
> correct behavior, not simply "second best method".  Again, KVM by design
> does not know what lies behind any given HVA, and therefore should not be
> making decisions about how to handle bad HVAs.
>  
> > - Make behavior consistent between synchronous fault and async faults.
> >   Currently sync faults exit to user space on error while async
> >    faults spin infinitely.
> 
> Yes, and this part can be done with a new request type.  Adding a new
> request doesn't commit KVM to any ABI changes, e.g. userspace will still
> see an -EFAULT return, and nothing more.  I.e. handling this via request
> doesn't prevent switching to synchronous exits in the future, if such a
> feature is required.

I am really not sure what benfit this kvm request is bringing to the
table. To me maintaining a hash table and exiting when guest retries
is much nicer desgin. In fact, once we have the mechanism to
inject error into the guest using an exception, We can easily plug
that into the same path.

If memory usage is a concern, we can reduce the hash table size to say
4 or 8.

How is this change commiting KVM to an ABI?

> 
> > Once we have a method to report errors back to guest, then we first
> > should report error back to guest. And only in the absense of such
> > mechanism we should exit to user space.
> 
> I don't see the value in extending KVM's ABI to avoid the exit to userspace.
> E.g. we could add a memslot flag to autoreflect -EFAULT as an event of some
> form, but this is a slow path, the extra time should be a non-issue.

IIUC, you are saying that this becomes the property of memory region. Some
memory regions can choose their errors being reported back to guest
(using some kind of flag in memslot). And in the absense of such flag,
default behavior will continue to be exit to user space?

I guess that's fine if we don't want to treat all the memory areas in
same way. I can't think why reporting errors for all areas to guest
is a bad idea. Let guest either handle the error or die if it can't
handle it. But that discussion is for some other day. Our first problem
is that we don't have a race free mechanism to report errors.

Thanks
Vivek
Sean Christopherson Oct. 2, 2020, 9:13 p.m. UTC | #14
On Fri, Oct 02, 2020 at 04:02:14PM -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote:
> > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > > I don't think it's necessary to provide userspace with the register state of
> > > > the guest task that hit the bad page.  Other than debugging, I don't see how
> > > > userspace can do anything useful which such information.
> > > 
> > > I think debugging is the whole point so that user can figure out which
> > > access by guest task resulted in bad memory access. I would think this
> > > will be important piece of information.
> > 
> > But isn't this failure due to a truncation in the host?  Why would we care
> > about debugging the guest?  It hasn't done anything wrong, has it?  Or am I
> > misunderstanding the original problem statement.
> 
> I think you understood problem statement right. If guest has right
> context, it just gives additional information who tried to access
> the missing memory page. 

Yes, but it's not actionable, e.g. QEMU can't do anything differently given
a guest RIP.  It's useful information for hands-on debug, but the information
can be easily collected through other means when doing hands-on debug.
 
> > > > To fully handle the situation, the guest needs to remove the bad page from
> > > > its memory pool.  Once the page is offlined, the guest kernel's error
> > > > handling will kick in when a task accesses the bad page (or nothing ever
> > > > touches the bad page again and everyone is happy).
> > > 
> > > This is not really a case of bad page as such. It is more of a page
> > > gone missing/trucated. And no new user can map it. We just need to
> > > worry about existing users who already have it mapped.
> > 
> > What do you mean by "no new user can map it"?  Are you talking about guest
> > tasks or host tasks?  If guest tasks, how would the guest know the page is
> > missing and thus prevent mapping the non-existent page?
> 
> If a new task wants mmap(), it will send a request to virtiofsd/qemu
> on host. If file has been truncated, then mapping beyond file size
> will fail and process will get error.  So they will not be able to
> map a page which has been truncated.

Ah.  Is there anything that prevents the notification side of things from
being handled purely within the virtiofs layer?  E.g. host notifies the guest
that a file got truncated, virtiofs driver in the guest invokes a kernel API
to remove the page(s).

> > > > Note, I'm not necessarily suggesting that QEMU piggyback its #MC injection
> > > > to handle this, but I suspect the resulting behavior will look quite similar,
> > > > e.g. notify the virtiofs driver in the guest, which does some magic to take
> > > > the offending region offline, and then guest tasks get SIGBUS or whatever.
> > > > 
> > > > I also don't think it's KVM's responsibility to _directly_ handle such a
> > > > scenario.  As I said in an earlier version, KVM can't possibly know _why_ a
> > > > page fault came back with -EFAULT, only userspace can connect the dots of
> > > > GPA -> HVA -> vm_area_struct -> file -> inject event.  KVM definitely should
> > > > exit to userspace on the -EFAULT instead of hanging the guest, but that can
> > > > be done via a new request, as suggested.
> > > 
> > > KVM atleast should have the mechanism to report this back to guest. And
> > > we don't have any. There only seems to be #MC stuff for poisoned pages.
> > > I am not sure how much we can build on top of #MC stuff to take care
> > > of this case. One problem with #MC I found was that it generates
> > > synchronous #MC only on load and not store. So all the code is
> > > written in such a way that synchronous #MC can happen only on load
> > > and hence the error handling. 
> > > 
> > > Stores generate different kind of #MC that too asynchronously and
> > > caller will not know about it immiditely. But in this case we need
> > > to know about error in the context of caller both for loads and stores.
> > > 
> > > Anyway, I agree that this patch does not solve the problem of race
> > > free synchronous event inject into the guest. That's something yet
> > > to be solved either by #VE or by #MC or by #foo.
> > > 
> > > This patch is only doing two things.
> > > 
> > > - Because we don't have a mechanism to report errors to guest, use
> > >   the second best method and exit to user space.
> > 
> > Not that it matters at this point, but IMO exiting to userspace is the
> > correct behavior, not simply "second best method".  Again, KVM by design
> > does not know what lies behind any given HVA, and therefore should not be
> > making decisions about how to handle bad HVAs.
> >  
> > > - Make behavior consistent between synchronous fault and async faults.
> > >   Currently sync faults exit to user space on error while async
> > >    faults spin infinitely.
> > 
> > Yes, and this part can be done with a new request type.  Adding a new
> > request doesn't commit KVM to any ABI changes, e.g. userspace will still
> > see an -EFAULT return, and nothing more.  I.e. handling this via request
> > doesn't prevent switching to synchronous exits in the future, if such a
> > feature is required.
> 
> I am really not sure what benfit this kvm request is bringing to the
> table. To me maintaining a hash table and exiting when guest retries
> is much nicer desgin.

8 bytes pre vCPU versus 512 bytes per vCPU, with no guesswork.  I.e. the
request can guarantee the first access to a truncated file will be reported
to userspace.

> In fact, once we have the mechanism to inject error into the guest using an
> exception, We can easily plug that into the same path.

You're blurring the two things together.  The first step is to fix the
infinite loop by exiting to userspace with -EFAULT.  Notifying the guest of
the truncated file is a completely different problem to solve.  Reporting
-EFAULT to userspace is a pure bug fix, and is not unique to DAX.

> If memory usage is a concern, we can reduce the hash table size to say
> 4 or 8.
> 
> How is this change commiting KVM to an ABI?

I didn't mean to imply this patch changes the ABI.  What I was trying to say
is that going with the request-based implementation doesn't commit KVM to
new behavior, e.g. we can yank out the request implementation in favor of
something else if the need arises.

> > > Once we have a method to report errors back to guest, then we first
> > > should report error back to guest. And only in the absense of such
> > > mechanism we should exit to user space.
> > 
> > I don't see the value in extending KVM's ABI to avoid the exit to userspace.
> > E.g. we could add a memslot flag to autoreflect -EFAULT as an event of some
> > form, but this is a slow path, the extra time should be a non-issue.
> 
> IIUC, you are saying that this becomes the property of memory region. Some
> memory regions can choose their errors being reported back to guest
> (using some kind of flag in memslot). And in the absense of such flag,
> default behavior will continue to be exit to user space?
> 
> I guess that's fine if we don't want to treat all the memory areas in
> same way. I can't think why reporting errors for all areas to guest
> is a bad idea.

Because it'd be bleeding host information to the guest.  E.g. if there's a
a host kernel bug that causes gup() to fail, then KVM would inject a random
fault into the guest, which is all kinds of bad.

> Let guest either handle the error or die if it can't
> handle it. But that discussion is for some other day. Our first problem
> is that we don't have a race free mechanism to report errors.
> 
> Thanks
> Vivek
>
Vivek Goyal Oct. 5, 2020, 3:33 p.m. UTC | #15
On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote:
> On Fri, Oct 02, 2020 at 04:02:14PM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 12:45:18PM -0700, Sean Christopherson wrote:
> > > On Fri, Oct 02, 2020 at 03:27:34PM -0400, Vivek Goyal wrote:
> > > > On Fri, Oct 02, 2020 at 11:30:37AM -0700, Sean Christopherson wrote:
> > > > > On Fri, Oct 02, 2020 at 11:38:54AM -0400, Vivek Goyal wrote:
> > > > > I don't think it's necessary to provide userspace with the register state of
> > > > > the guest task that hit the bad page.  Other than debugging, I don't see how
> > > > > userspace can do anything useful which such information.
> > > > 
> > > > I think debugging is the whole point so that user can figure out which
> > > > access by guest task resulted in bad memory access. I would think this
> > > > will be important piece of information.
> > > 
> > > But isn't this failure due to a truncation in the host?  Why would we care
> > > about debugging the guest?  It hasn't done anything wrong, has it?  Or am I
> > > misunderstanding the original problem statement.
> > 
> > I think you understood problem statement right. If guest has right
> > context, it just gives additional information who tried to access
> > the missing memory page. 
> 
> Yes, but it's not actionable, e.g. QEMU can't do anything differently given
> a guest RIP.  It's useful information for hands-on debug, but the information
> can be easily collected through other means when doing hands-on debug.

Hi Sean,

I tried my patch and truncated file on host before guest did memcpy().
After truncation guest process tried memcpy() on truncated region and
kvm exited to user space with -EFAULT. I see following on serial console.

I am assuming qemu is printing the state of vcpu.

************************************************************
error: kvm run failed Bad address
RAX=00007fff6e7a9750 RBX=0000000000000000 RCX=00007f513927e000 RDX=000000000000a
RSI=00007f513927e000 RDI=00007fff6e7a9750 RBP=00007fff6e7a97b0 RSP=00007fff6e7a8
R8 =0000000000000000 R9 =0000000000000031 R10=00007fff6e7a957c R11=0000000000006
R12=0000000000401140 R13=0000000000000000 R14=0000000000000000 R15=0000000000000
RIP=00007f51391e0547 RFL=00010202 [-------] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0033 0000000000000000 ffffffff 00a0fb00 DPL=3 CS64 [-RA]
SS =002b 0000000000000000 ffffffff 00c0f300 DPL=3 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 00007f5139246540 ffffffff 00c00000
GS =0000 0000000000000000 ffffffff 00c00000
LDT=0000 0000000000000000 00000000 00000000
TR =0040 fffffe00003a6000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe00003a4000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=00007f513927e004 CR3=000000102b5eb805 CR4=00770ee0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=fa 6f 06 c5 fa 6f 4c 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17 f0 c3 <48> 8b 4c 16 3
*****************************************************************

I also changed my test program to print source and destination address
for memcpy.

dst=0x0x7fff6e7a9750 src=0x0x7f513927e000

Here dst matches RDI and src matches RSI. This trace also tells me
CPL=3 so a user space access triggered this.

Now I have few questions.

- If we exit to user space asynchronously (using kvm request), what debug
  information is in there which tells user which address is bad. I admit
  that even above trace does not seem to be telling me directly which
  address (HVA?) is bad.

  But if I take a crash dump of guest, using above information I should
  be able to get to GPA which is problematic. And looking at /proc/iomem
  it should also tell which device this memory region is in.

  Also using this crash dump one should be able to walk through virtiofs data
  structures and figure out which file and what offset with-in file does
  it belong to. Now one can look at filesystem on host and see file got
  truncated and it will become obvious it can't be faulted in. And then
  one can continue to debug that how did we arrive here.

But if we don't exit to user space synchronously, Only relevant
information we seem to have is -EFAULT. Apart from that, how does one
figure out what address is bad, or who tried to access it. Or which
file/offset does it belong to etc.

I agree that problem is not necessarily in guest code. But by exiting
synchronously, it gives enough information that one can use crash
dump to get to bottom of the issue. If we exit to user space
asynchronously, all this information will be lost and it might make
it very hard to figure out (if not impossible), what's going on.

>  
> > > > > To fully handle the situation, the guest needs to remove the bad page from
> > > > > its memory pool.  Once the page is offlined, the guest kernel's error
> > > > > handling will kick in when a task accesses the bad page (or nothing ever
> > > > > touches the bad page again and everyone is happy).
> > > > 
> > > > This is not really a case of bad page as such. It is more of a page
> > > > gone missing/trucated. And no new user can map it. We just need to
> > > > worry about existing users who already have it mapped.
> > > 
> > > What do you mean by "no new user can map it"?  Are you talking about guest
> > > tasks or host tasks?  If guest tasks, how would the guest know the page is
> > > missing and thus prevent mapping the non-existent page?
> > 
> > If a new task wants mmap(), it will send a request to virtiofsd/qemu
> > on host. If file has been truncated, then mapping beyond file size
> > will fail and process will get error.  So they will not be able to
> > map a page which has been truncated.
> 
> Ah.  Is there anything that prevents the notification side of things from
> being handled purely within the virtiofs layer?  E.g. host notifies the guest
> that a file got truncated, virtiofs driver in the guest invokes a kernel API
> to remove the page(s).

virtiofsd notifications can help a bit but not in all cases. For example,
If file got truncated and guest kernel accesses it immidiately after that,
(before notification arrives), it will hang and notification will not
be able to do much.

So while notification might be nice to have, but we still will need some
sort of error reporting from kvm.

Thanks
Vivek
Sean Christopherson Oct. 5, 2020, 4:16 p.m. UTC | #16
On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote:
> Now I have few questions.
> 
> - If we exit to user space asynchronously (using kvm request), what debug
>   information is in there which tells user which address is bad. I admit
>   that even above trace does not seem to be telling me directly which
>   address (HVA?) is bad.
> 
>   But if I take a crash dump of guest, using above information I should
>   be able to get to GPA which is problematic. And looking at /proc/iomem
>   it should also tell which device this memory region is in.
> 
>   Also using this crash dump one should be able to walk through virtiofs data
>   structures and figure out which file and what offset with-in file does
>   it belong to. Now one can look at filesystem on host and see file got
>   truncated and it will become obvious it can't be faulted in. And then
>   one can continue to debug that how did we arrive here.
> 
> But if we don't exit to user space synchronously, Only relevant
> information we seem to have is -EFAULT. Apart from that, how does one
> figure out what address is bad, or who tried to access it. Or which
> file/offset does it belong to etc.
>
> I agree that problem is not necessarily in guest code. But by exiting
> synchronously, it gives enough information that one can use crash
> dump to get to bottom of the issue. If we exit to user space
> asynchronously, all this information will be lost and it might make
> it very hard to figure out (if not impossible), what's going on.

If we want userspace to be able to do something useful, KVM should explicitly
inform userspace about the error, userspace shouldn't simply assume that
-EFAULT means a HVA->PFN lookup failed.  Userspace also shouldn't have to
query guest state to handle the error, as that won't work for protected guests
guests like SEV-ES and TDX.

I can think of two options:

  1. Send a signal, a la kvm_send_hwpoison_signal().

  2. Add a userspace exit reason along with a new entry in the run struct,
     e.g. that provides the bad GPA, HVA, possibly permissions, etc...

> > > > > > To fully handle the situation, the guest needs to remove the bad page from
> > > > > > its memory pool.  Once the page is offlined, the guest kernel's error
> > > > > > handling will kick in when a task accesses the bad page (or nothing ever
> > > > > > touches the bad page again and everyone is happy).
> > > > > 
> > > > > This is not really a case of bad page as such. It is more of a page
> > > > > gone missing/trucated. And no new user can map it. We just need to
> > > > > worry about existing users who already have it mapped.
> > > > 
> > > > What do you mean by "no new user can map it"?  Are you talking about guest
> > > > tasks or host tasks?  If guest tasks, how would the guest know the page is
> > > > missing and thus prevent mapping the non-existent page?
> > > 
> > > If a new task wants mmap(), it will send a request to virtiofsd/qemu
> > > on host. If file has been truncated, then mapping beyond file size
> > > will fail and process will get error.  So they will not be able to
> > > map a page which has been truncated.
> > 
> > Ah.  Is there anything that prevents the notification side of things from
> > being handled purely within the virtiofs layer?  E.g. host notifies the guest
> > that a file got truncated, virtiofs driver in the guest invokes a kernel API
> > to remove the page(s).
> 
> virtiofsd notifications can help a bit but not in all cases. For example,
> If file got truncated and guest kernel accesses it immidiately after that,
> (before notification arrives), it will hang and notification will not
> be able to do much.
> 
> So while notification might be nice to have, but we still will need some
> sort of error reporting from kvm.

And I'm in full agreement with that.  What I'm saying is that resolving the
infinite loop is a _bug fix_ and is completely orthogonal to adding a new
mechanism to handle the file truncation scenario.
Vivek Goyal Oct. 6, 2020, 1:46 p.m. UTC | #17
On Mon, Oct 05, 2020 at 09:16:20AM -0700, Sean Christopherson wrote:
> On Mon, Oct 05, 2020 at 11:33:18AM -0400, Vivek Goyal wrote:
> > On Fri, Oct 02, 2020 at 02:13:14PM -0700, Sean Christopherson wrote:
> > Now I have few questions.
> > 
> > - If we exit to user space asynchronously (using kvm request), what debug
> >   information is in there which tells user which address is bad. I admit
> >   that even above trace does not seem to be telling me directly which
> >   address (HVA?) is bad.
> > 
> >   But if I take a crash dump of guest, using above information I should
> >   be able to get to GPA which is problematic. And looking at /proc/iomem
> >   it should also tell which device this memory region is in.
> > 
> >   Also using this crash dump one should be able to walk through virtiofs data
> >   structures and figure out which file and what offset with-in file does
> >   it belong to. Now one can look at filesystem on host and see file got
> >   truncated and it will become obvious it can't be faulted in. And then
> >   one can continue to debug that how did we arrive here.
> > 
> > But if we don't exit to user space synchronously, Only relevant
> > information we seem to have is -EFAULT. Apart from that, how does one
> > figure out what address is bad, or who tried to access it. Or which
> > file/offset does it belong to etc.
> >
> > I agree that problem is not necessarily in guest code. But by exiting
> > synchronously, it gives enough information that one can use crash
> > dump to get to bottom of the issue. If we exit to user space
> > asynchronously, all this information will be lost and it might make
> > it very hard to figure out (if not impossible), what's going on.
> 
> If we want userspace to be able to do something useful, KVM should explicitly
> inform userspace about the error, userspace shouldn't simply assume that
> -EFAULT means a HVA->PFN lookup failed.

I guess that's fine. But for this patch, user space is not doing anything.
Its just printing error -EFAULT and dumping guest state (Same as we do
in case of synchronous fault).

> Userspace also shouldn't have to
> query guest state to handle the error, as that won't work for protected guests
> guests like SEV-ES and TDX.

So qemu would not be able to dump vcpu register state when kvm returns
with -EFAULT for the case of SEV-ES and TDX?

> 
> I can think of two options:
> 
>   1. Send a signal, a la kvm_send_hwpoison_signal().

This works because -EHWPOISON is a special kind of error which is
different from -EFAULT. For truncation, even kvm gets -EFAULT.

        if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
                return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
        if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
                return -EFAULT;

Anyway, if -EFAULT is too generic, and we need something finer grained,
that can be looked into when we actually have a method where kvm/qemu
injects error into guest.

> 
>   2. Add a userspace exit reason along with a new entry in the run struct,
>      e.g. that provides the bad GPA, HVA, possibly permissions, etc...

This sounds more reasonable to me. That is kvm gives additional
information to qemu about failing HVA and GPA with -EFAULT and that
can be helpful in debugging a problem. This seems like an extension
of KVM API.

Even with this, if we want to figure out which file got truncated, we
will need to take a dump of guest and try to figure out which file
this GPA is currently mapping(By looking at virtiofs data structures).
And that becomes little easier if vcpu is running the task which 
accessed that GPA. Anyway, if we have failing GPA, I think it should
be possible to figure out inode even without accessing task being
current on vcpu.

So we seem to have 3 options.

A. Just exit to user space with -EFAULT (using kvm request) and don't
   wait for the accessing task to run on vcpu again. 

B. Store error gfn in an hash and exit to user space when task accessing
   gfn runs again.

C. Extend KVM API and report failing HVA/GFN access by guest. And that
   should allow not having to exit to user space synchronously.

Thanks
Vivek
Vitaly Kuznetsov Oct. 6, 2020, 2:05 p.m. UTC | #18
Vivek Goyal <vgoyal@redhat.com> writes:

> A. Just exit to user space with -EFAULT (using kvm request) and don't
>    wait for the accessing task to run on vcpu again. 

What if we also save the required information (RIP, GFN, ...) in the
guest along with the APF token so in case of -EFAULT we can just 'crash'
the guest and the required information can easily be obtained from
kdump? This will solve the debugging problem even for TDX/SEV-ES (if
kdump is possible there).
Vivek Goyal Oct. 6, 2020, 2:15 p.m. UTC | #19
On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > A. Just exit to user space with -EFAULT (using kvm request) and don't
> >    wait for the accessing task to run on vcpu again. 
> 
> What if we also save the required information (RIP, GFN, ...) in the
> guest along with the APF token

Can you elaborate a bit more on this. You mean save GFN on stack before
it starts waiting for PAGE_READY event?

> so in case of -EFAULT we can just 'crash'
> the guest and the required information can easily be obtained from
> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
> kdump is possible there).

Just saving additional info in guest will not help because there might
be many tasks waiting and you don't know which GFN is problematic one.

Thanks
Vivek
Vitaly Kuznetsov Oct. 6, 2020, 2:50 p.m. UTC | #20
Vivek Goyal <vgoyal@redhat.com> writes:

> On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > A. Just exit to user space with -EFAULT (using kvm request) and don't
>> >    wait for the accessing task to run on vcpu again. 
>> 
>> What if we also save the required information (RIP, GFN, ...) in the
>> guest along with the APF token
>
> Can you elaborate a bit more on this. You mean save GFN on stack before
> it starts waiting for PAGE_READY event?

When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest
kernel gets all the registers of the userspace process (except for CR2
which is replaced with a token). In case it is not trivial to extract
accessed GFN from this data we can extend the shared APF structure and
add it there, KVM has it when it queues APF.

>
>> so in case of -EFAULT we can just 'crash'
>> the guest and the required information can easily be obtained from
>> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
>> kdump is possible there).
>
> Just saving additional info in guest will not help because there might
> be many tasks waiting and you don't know which GFN is problematic one.

But KVM knows which token caused the -EFAULT when we exit to userspace
(and we can pass this information to it) so to debug the situation you
take this token and then explore the kdump searching for what's
associated with this exact token.
Vivek Goyal Oct. 6, 2020, 3:08 p.m. UTC | #21
On Tue, Oct 06, 2020 at 04:50:44PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> 
> >> > A. Just exit to user space with -EFAULT (using kvm request) and don't
> >> >    wait for the accessing task to run on vcpu again. 
> >> 
> >> What if we also save the required information (RIP, GFN, ...) in the
> >> guest along with the APF token
> >
> > Can you elaborate a bit more on this. You mean save GFN on stack before
> > it starts waiting for PAGE_READY event?
> 
> When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest
> kernel gets all the registers of the userspace process (except for CR2
> which is replaced with a token). In case it is not trivial to extract
> accessed GFN from this data we can extend the shared APF structure and
> add it there, KVM has it when it queues APF.
> 
> >
> >> so in case of -EFAULT we can just 'crash'
> >> the guest and the required information can easily be obtained from
> >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
> >> kdump is possible there).
> >
> > Just saving additional info in guest will not help because there might
> > be many tasks waiting and you don't know which GFN is problematic one.
> 
> But KVM knows which token caused the -EFAULT when we exit to userspace
> (and we can pass this information to it) so to debug the situation you
> take this token and then explore the kdump searching for what's
> associated with this exact token.

So you will have to report token (along with -EFAULT) to user space. So this
is basically the 3rd proposal which is extension of kvm API and will
report say HVA/GFN also to user space along with -EFAULT.

Thanks
Vivek
Vitaly Kuznetsov Oct. 6, 2020, 3:24 p.m. UTC | #22
Vivek Goyal <vgoyal@redhat.com> writes:

> On Tue, Oct 06, 2020 at 04:50:44PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > On Tue, Oct 06, 2020 at 04:05:16PM +0200, Vitaly Kuznetsov wrote:
>> >> Vivek Goyal <vgoyal@redhat.com> writes:
>> >> 
>> >> > A. Just exit to user space with -EFAULT (using kvm request) and don't
>> >> >    wait for the accessing task to run on vcpu again. 
>> >> 
>> >> What if we also save the required information (RIP, GFN, ...) in the
>> >> guest along with the APF token
>> >
>> > Can you elaborate a bit more on this. You mean save GFN on stack before
>> > it starts waiting for PAGE_READY event?
>> 
>> When PAGE_NOT_PRESENT event is injected as #PF (for now) in the guest
>> kernel gets all the registers of the userspace process (except for CR2
>> which is replaced with a token). In case it is not trivial to extract
>> accessed GFN from this data we can extend the shared APF structure and
>> add it there, KVM has it when it queues APF.
>> 
>> >
>> >> so in case of -EFAULT we can just 'crash'
>> >> the guest and the required information can easily be obtained from
>> >> kdump? This will solve the debugging problem even for TDX/SEV-ES (if
>> >> kdump is possible there).
>> >
>> > Just saving additional info in guest will not help because there might
>> > be many tasks waiting and you don't know which GFN is problematic one.
>> 
>> But KVM knows which token caused the -EFAULT when we exit to userspace
>> (and we can pass this information to it) so to debug the situation you
>> take this token and then explore the kdump searching for what's
>> associated with this exact token.
>
> So you will have to report token (along with -EFAULT) to user space. So this
> is basically the 3rd proposal which is extension of kvm API and will
> report say HVA/GFN also to user space along with -EFAULT.
>

Right, I meant to say that guest kernel has full register state of the
userspace process which caused APF to get queued and instead of trying
to extract it in KVM and pass to userspace in case of a (later) failure
we limit KVM api change to contain token or GFN only and somehow keep
the rest in the guest. This should help with TDX/SEV-ES.
Sean Christopherson Oct. 6, 2020, 4:12 p.m. UTC | #23
On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> > So you will have to report token (along with -EFAULT) to user space. So this
> > is basically the 3rd proposal which is extension of kvm API and will
> > report say HVA/GFN also to user space along with -EFAULT.
> 
> Right, I meant to say that guest kernel has full register state of the
> userspace process which caused APF to get queued and instead of trying
> to extract it in KVM and pass to userspace in case of a (later) failure
> we limit KVM api change to contain token or GFN only and somehow keep
> the rest in the guest. This should help with TDX/SEV-ES.

Whatever gets reported to userspace should be identical with and without
async page faults, i.e. it definitely shouldn't have token information.

Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
into the guest is not an option.  Nor do I think that's "correct"
behavior (see everyone's objections to using #PF for APF fixed).  I.e. the
event should probably be an IRQ.
Vivek Goyal Oct. 6, 2020, 4:24 p.m. UTC | #24
On Tue, Oct 06, 2020 at 09:12:00AM -0700, Sean Christopherson wrote:
> On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> > > So you will have to report token (along with -EFAULT) to user space. So this
> > > is basically the 3rd proposal which is extension of kvm API and will
> > > report say HVA/GFN also to user space along with -EFAULT.
> > 
> > Right, I meant to say that guest kernel has full register state of the
> > userspace process which caused APF to get queued and instead of trying
> > to extract it in KVM and pass to userspace in case of a (later) failure
> > we limit KVM api change to contain token or GFN only and somehow keep
> > the rest in the guest. This should help with TDX/SEV-ES.
> 
> Whatever gets reported to userspace should be identical with and without
> async page faults, i.e. it definitely shouldn't have token information.
> 
> Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> into the guest is not an option.  Nor do I think that's "correct"
> behavior (see everyone's objections to using #PF for APF fixed).  I.e. the
> event should probably be an IRQ.

I am not sure if IRQ for "Page not Present" works. Will it have some
conflicts/issues with other high priority interrupts which can
get injected before "Page not present".

Vivek
Vitaly Kuznetsov Oct. 6, 2020, 4:39 p.m. UTC | #25
Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> > So you will have to report token (along with -EFAULT) to user space. So this
>> > is basically the 3rd proposal which is extension of kvm API and will
>> > report say HVA/GFN also to user space along with -EFAULT.
>> 
>> Right, I meant to say that guest kernel has full register state of the
>> userspace process which caused APF to get queued and instead of trying
>> to extract it in KVM and pass to userspace in case of a (later) failure
>> we limit KVM api change to contain token or GFN only and somehow keep
>> the rest in the guest. This should help with TDX/SEV-ES.
>
> Whatever gets reported to userspace should be identical with and without
> async page faults, i.e. it definitely shouldn't have token information.
>

Oh, right, when the error gets reported synchronously guest's kernel is
not yet aware of the issue so it won't be possible to find anything in
its kdump if userspace decides to crash it immediately. The register
state (if available) will be actual though.

> Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> into the guest is not an option.  

Not even #MC? So sad :-)

> Nor do I think that's "correct" behavior (see everyone's objections to
> using #PF for APF fixed).  I.e. the event should probably be an IRQ.

I recall Paolo objected against making APF 'page not present' into in
interrupt as it will require some very special handling to make sure it
gets injected (and handled) immediately but I'm not really sure how big
the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a
try.
Sean Christopherson Oct. 6, 2020, 5:17 p.m. UTC | #26
On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
> > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> > So you will have to report token (along with -EFAULT) to user space. So this
> >> > is basically the 3rd proposal which is extension of kvm API and will
> >> > report say HVA/GFN also to user space along with -EFAULT.
> >> 
> >> Right, I meant to say that guest kernel has full register state of the
> >> userspace process which caused APF to get queued and instead of trying
> >> to extract it in KVM and pass to userspace in case of a (later) failure
> >> we limit KVM api change to contain token or GFN only and somehow keep
> >> the rest in the guest. This should help with TDX/SEV-ES.
> >
> > Whatever gets reported to userspace should be identical with and without
> > async page faults, i.e. it definitely shouldn't have token information.
> >
> 
> Oh, right, when the error gets reported synchronously guest's kernel is
> not yet aware of the issue so it won't be possible to find anything in
> its kdump if userspace decides to crash it immediately. The register
> state (if available) will be actual though.
> 
> > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > into the guest is not an option.  
> 
> Not even #MC? So sad :-)

Heh, #MC isn't allowed either, yet...

> > Nor do I think that's "correct" behavior (see everyone's objections to
> > using #PF for APF fixed).  I.e. the event should probably be an IRQ.
> 
> I recall Paolo objected against making APF 'page not present' into in
> interrupt as it will require some very special handling to make sure it
> gets injected (and handled) immediately but I'm not really sure how big
> the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a
> try.

This shouldn't have anything to do with APF.  Again, the event injection is
needed even in the synchronous case as the file truncation in the host can
affect existing mappings in the guest.

I don't know that the mechanism needs to be virtiofs specific or if there can
be a more generic "these PFNs have disappeared", but it's most definitely
orthogonal to APF.
Dr. David Alan Gilbert Oct. 6, 2020, 5:21 p.m. UTC | #27
* Sean Christopherson (sean.j.christopherson@intel.com) wrote:
> On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote:
> > Sean Christopherson <sean.j.christopherson@intel.com> writes:
> > 
> > > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > >> > So you will have to report token (along with -EFAULT) to user space. So this
> > >> > is basically the 3rd proposal which is extension of kvm API and will
> > >> > report say HVA/GFN also to user space along with -EFAULT.
> > >> 
> > >> Right, I meant to say that guest kernel has full register state of the
> > >> userspace process which caused APF to get queued and instead of trying
> > >> to extract it in KVM and pass to userspace in case of a (later) failure
> > >> we limit KVM api change to contain token or GFN only and somehow keep
> > >> the rest in the guest. This should help with TDX/SEV-ES.
> > >
> > > Whatever gets reported to userspace should be identical with and without
> > > async page faults, i.e. it definitely shouldn't have token information.
> > >
> > 
> > Oh, right, when the error gets reported synchronously guest's kernel is
> > not yet aware of the issue so it won't be possible to find anything in
> > its kdump if userspace decides to crash it immediately. The register
> > state (if available) will be actual though.
> > 
> > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > > into the guest is not an option.  
> > 
> > Not even #MC? So sad :-)
> 
> Heh, #MC isn't allowed either, yet...
> 
> > > Nor do I think that's "correct" behavior (see everyone's objections to
> > > using #PF for APF fixed).  I.e. the event should probably be an IRQ.
> > 
> > I recall Paolo objected against making APF 'page not present' into in
> > interrupt as it will require some very special handling to make sure it
> > gets injected (and handled) immediately but I'm not really sure how big
> > the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a
> > try.
> 
> This shouldn't have anything to do with APF.  Again, the event injection is
> needed even in the synchronous case as the file truncation in the host can
> affect existing mappings in the guest.
> 
> I don't know that the mechanism needs to be virtiofs specific or if there can
> be a more generic "these PFNs have disappeared", but it's most definitely
> orthogonal to APF.

There are other cases we get 'these PFNs have disappeared' other than
virtiofs;  the classic is when people back the guest using a tmpfs that
then runs out of room.

Dave

> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
Vivek Goyal Oct. 6, 2020, 5:28 p.m. UTC | #28
On Tue, Oct 06, 2020 at 06:21:48PM +0100, Dr. David Alan Gilbert wrote:
> * Sean Christopherson (sean.j.christopherson@intel.com) wrote:
> > On Tue, Oct 06, 2020 at 06:39:56PM +0200, Vitaly Kuznetsov wrote:
> > > Sean Christopherson <sean.j.christopherson@intel.com> writes:
> > > 
> > > > On Tue, Oct 06, 2020 at 05:24:54PM +0200, Vitaly Kuznetsov wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >> > So you will have to report token (along with -EFAULT) to user space. So this
> > > >> > is basically the 3rd proposal which is extension of kvm API and will
> > > >> > report say HVA/GFN also to user space along with -EFAULT.
> > > >> 
> > > >> Right, I meant to say that guest kernel has full register state of the
> > > >> userspace process which caused APF to get queued and instead of trying
> > > >> to extract it in KVM and pass to userspace in case of a (later) failure
> > > >> we limit KVM api change to contain token or GFN only and somehow keep
> > > >> the rest in the guest. This should help with TDX/SEV-ES.
> > > >
> > > > Whatever gets reported to userspace should be identical with and without
> > > > async page faults, i.e. it definitely shouldn't have token information.
> > > >
> > > 
> > > Oh, right, when the error gets reported synchronously guest's kernel is
> > > not yet aware of the issue so it won't be possible to find anything in
> > > its kdump if userspace decides to crash it immediately. The register
> > > state (if available) will be actual though.
> > > 
> > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > > > into the guest is not an option.  
> > > 
> > > Not even #MC? So sad :-)
> > 
> > Heh, #MC isn't allowed either, yet...
> > 
> > > > Nor do I think that's "correct" behavior (see everyone's objections to
> > > > using #PF for APF fixed).  I.e. the event should probably be an IRQ.
> > > 
> > > I recall Paolo objected against making APF 'page not present' into in
> > > interrupt as it will require some very special handling to make sure it
> > > gets injected (and handled) immediately but I'm not really sure how big
> > > the hack is going to be, maybe in the light of TDX/SEV-ES it's worth a
> > > try.
> > 
> > This shouldn't have anything to do with APF.  Again, the event injection is
> > needed even in the synchronous case as the file truncation in the host can
> > affect existing mappings in the guest.
> > 
> > I don't know that the mechanism needs to be virtiofs specific or if there can
> > be a more generic "these PFNs have disappeared", but it's most definitely
> > orthogonal to APF.
> 
> There are other cases we get 'these PFNs have disappeared' other than
> virtiofs;  the classic is when people back the guest using a tmpfs that
> then runs out of room.

I also played with nvdimm driver where device was backed a file on host.
If I truncate that file, we face similar issues.

https://lore.kernel.org/kvm/20200616214847.24482-1-vgoyal@redhat.com/

I think any resource which can be backed by a file on host, can
potentially run into this issue if file is truncated.
(if guest can do load/store on these pages directly). 

Thanks
Vivek
Vivek Goyal Oct. 6, 2020, 5:35 p.m. UTC | #29
On Tue, Oct 06, 2020 at 10:17:04AM -0700, Sean Christopherson wrote:

[..]
> > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > > into the guest is not an option.  
> > 
> > Not even #MC? So sad :-)
> 
> Heh, #MC isn't allowed either, yet...

If #MC is not allowd, logic related to hwpoison memory will not work
as that seems to inject #MC.

Vivek
Sean Christopherson Oct. 7, 2020, 12:04 a.m. UTC | #30
On Tue, Oct 06, 2020 at 01:35:27PM -0400, Vivek Goyal wrote:
> On Tue, Oct 06, 2020 at 10:17:04AM -0700, Sean Christopherson wrote:
> 
> [..]
> > > > Note, TDX doesn't allow injection exceptions, so reflecting a #PF back
> > > > into the guest is not an option.  
> > > 
> > > Not even #MC? So sad :-)
> > 
> > Heh, #MC isn't allowed either, yet...
> 
> If #MC is not allowd, logic related to hwpoison memory will not work
> as that seems to inject #MC.

Yep.  Piggybacking #MC is undesirable for other reasons, e.g. adds a "hardware"
dependency for what is really a paravirt feature.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be5363b21540..e6f8d3f1a377 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -137,6 +137,7 @@  static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
 #define KVM_NR_VAR_MTRR 8
 
 #define ASYNC_PF_PER_VCPU 64
+#define ERROR_GFN_PER_VCPU 64
 
 enum kvm_reg {
 	VCPU_REGS_RAX = __VCPU_REGS_RAX,
@@ -778,6 +779,7 @@  struct kvm_vcpu_arch {
 		unsigned long nested_apf_token;
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
+		gfn_t error_gfns[ERROR_GFN_PER_VCPU];
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 444bb9c54548..d0a2a12c7bb6 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -60,7 +60,7 @@  void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 			     bool accessed_dirty, gpa_t new_eptp);
-bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
+bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 				u64 fault_address, char *insn, int insn_len);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d6a0ae7800c..b51d4aa405e0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4078,7 +4078,7 @@  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	if (!async)
 		return false; /* *pfn has correct page already */
 
-	if (!prefault && kvm_can_do_async_pf(vcpu)) {
+	if (!prefault && kvm_can_do_async_pf(vcpu, gpa_to_gfn(cr2_or_gpa))) {
 		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88c593f83b28..c1f5094d6e53 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -263,6 +263,13 @@  static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
 		vcpu->arch.apf.gfns[i] = ~0;
 }
 
+static inline void kvm_error_gfn_hash_reset(struct kvm_vcpu *vcpu)
+{
+	int i;
+	for (i = 0; i < ERROR_GFN_PER_VCPU; i++)
+		vcpu->arch.apf.error_gfns[i] = GFN_INVALID;
+}
+
 static void kvm_on_user_return(struct user_return_notifier *urn)
 {
 	unsigned slot;
@@ -9484,6 +9491,7 @@  int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
 
 	kvm_async_pf_hash_reset(vcpu);
+	kvm_error_gfn_hash_reset(vcpu);
 	kvm_pmu_init(vcpu);
 
 	vcpu->arch.pending_external_vector = -1;
@@ -9608,6 +9616,7 @@  void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 
 	kvm_clear_async_pf_completion_queue(vcpu);
 	kvm_async_pf_hash_reset(vcpu);
+	kvm_error_gfn_hash_reset(vcpu);
 	vcpu->arch.apf.halted = false;
 
 	if (kvm_mpx_supported()) {
@@ -10369,6 +10378,36 @@  void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 }
 EXPORT_SYMBOL_GPL(kvm_set_rflags);
 
+static inline u32 kvm_error_gfn_hash_fn(gfn_t gfn)
+{
+	BUILD_BUG_ON(!is_power_of_2(ERROR_GFN_PER_VCPU));
+
+	return hash_32(gfn & 0xffffffff, order_base_2(ERROR_GFN_PER_VCPU));
+}
+
+static void kvm_add_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_error_gfn_hash_fn(gfn);
+
+	/*
+	 * Overwrite the previous gfn. This is just a hint to do
+	 * sync page fault.
+	 */
+	vcpu->arch.apf.error_gfns[key] = gfn;
+}
+
+/* Returns true if gfn was found in hash table, false otherwise */
+static bool kvm_find_and_remove_error_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	u32 key = kvm_error_gfn_hash_fn(gfn);
+
+	if (vcpu->arch.apf.error_gfns[key] != gfn)
+		return 0;
+
+	vcpu->arch.apf.error_gfns[key] = GFN_INVALID;
+	return true;
+}
+
 void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 {
 	int r;
@@ -10385,7 +10424,9 @@  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
 		return;
 
-	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
+	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
+	if (r < 0)
+		kvm_add_error_gfn(vcpu, gpa_to_gfn(work->cr2_or_gpa));
 }
 
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
@@ -10495,7 +10536,7 @@  static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 	return true;
 }
 
-bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
+bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	if (unlikely(!lapic_in_kernel(vcpu) ||
 		     kvm_event_needs_reinjection(vcpu) ||
@@ -10509,7 +10550,14 @@  bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
 	 * If interrupts are off we cannot even use an artificial
 	 * halt state.
 	 */
-	return kvm_arch_interrupt_allowed(vcpu);
+	if (!kvm_arch_interrupt_allowed(vcpu))
+		return false;
+
+	/* Found gfn in error gfn cache. Force sync fault */
+	if (kvm_find_and_remove_error_gfn(vcpu, gfn))
+		return false;
+
+	return true;
 }
 
 bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 68e84cf42a3f..677bb8269cd3 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -36,6 +36,7 @@  typedef u64            gpa_t;
 typedef u64            gfn_t;
 
 #define GPA_INVALID	(~(gpa_t)0)
+#define GFN_INVALID	(~(gfn_t)0)
 
 typedef unsigned long  hva_t;
 typedef u64            hpa_t;