diff mbox series

[RFC] kvm,x86: Exit to user space in case of page fault error

Message ID 20200625214701.GA180786@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] kvm,x86: Exit to user space in case of page fault error | expand

Commit Message

Vivek Goyal June 25, 2020, 9:47 p.m. UTC
Page fault error handling behavior in kvm seems little inconsistent when
page fault reports error. If we are doing fault synchronously
then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
exit to user space and qemu reports error, "error: kvm run failed Bad address".

But if we are doing async page fault, then async_pf_execute() will simply
ignore the error reported by get_user_pages_remote() or
by kvm_mmu_do_page_fault(). It is assumed that page fault was successful
and either a page ready event is injected in guest or guest is brought
out of artificial halt state and run again. In both the cases when guest
retries the instruction, it takes exit again as page fault was not
successful in previous attempt. And then this infinite loop continues
forever.

Trying fault in a loop will make sense if error is temporary and will
be resolved on retry. But I don't see any intention in the code to
determine if error is temporary or not.  Whether to do fault synchronously
or asynchronously, depends on so many variables but none of the varibales
is whether error is temporary or not. (kvm_can_do_async_pf()).

And that makes it very inconsistent or unpredictable to figure out whether
kvm will exit to qemu with error or it will just retry and go into an
infinite loop.

This patch tries to make this behavior consistent. That is instead of
getting into infinite loop of retrying page fault, exit to user space
and stop VM if page fault error happens.

In future this can be improved by injecting errors into guest. As of
now we don't have any race free method to inject errors in guest.

When page fault error happens in async path save that pfn and when next
time guest retries, do a sync fault instead of async fault. So that if error
is encountered, we exit to qemu and avoid infinite loop.

As of now only one error pfn is stored and that means it could be
overwritten before next a retry from guest happens. But this is
just a hint and if we miss it, some other time we will catch it.
If this becomes an issue, we could maintain an array of error
gfn later to help ease the issue.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.h              |  2 +-
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/x86.c              | 14 +++++++++++---
 4 files changed, 14 insertions(+), 5 deletions(-)

Comments

Vitaly Kuznetsov June 26, 2020, 9:25 a.m. UTC | #1
Vivek Goyal <vgoyal@redhat.com> writes:

> Page fault error handling behavior in kvm seems little inconsistent when
> page fault reports error. If we are doing fault synchronously
> then we capture error (-EFAULT) returned by __gfn_to_pfn_memslot() and
> exit to user space and qemu reports error, "error: kvm run failed Bad address".
>
> But if we are doing async page fault, then async_pf_execute() will simply
> ignore the error reported by get_user_pages_remote() or
> by kvm_mmu_do_page_fault(). It is assumed that page fault was successful
> and either a page ready event is injected in guest or guest is brought
> out of artificial halt state and run again. In both the cases when guest
> retries the instruction, it takes exit again as page fault was not
> successful in previous attempt. And then this infinite loop continues
> forever.
>
> Trying fault in a loop will make sense if error is temporary and will
> be resolved on retry. But I don't see any intention in the code to
> determine if error is temporary or not.  Whether to do fault synchronously
> or asynchronously, depends on so many variables but none of the varibales
> is whether error is temporary or not. (kvm_can_do_async_pf()).
>
> And that makes it very inconsistent or unpredictable to figure out whether
> kvm will exit to qemu with error or it will just retry and go into an
> infinite loop.
>
> This patch tries to make this behavior consistent. That is instead of
> getting into infinite loop of retrying page fault, exit to user space
> and stop VM if page fault error happens.
>
> In future this can be improved by injecting errors into guest. As of
> now we don't have any race free method to inject errors in guest.
>
> When page fault error happens in async path save that pfn and when next
> time guest retries, do a sync fault instead of async fault. So that if error
> is encountered, we exit to qemu and avoid infinite loop.
>
> As of now only one error pfn is stored and that means it could be
> overwritten before next a retry from guest happens. But this is
> just a hint and if we miss it, some other time we will catch it.
> If this becomes an issue, we could maintain an array of error
> gfn later to help ease the issue.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu.h              |  2 +-
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  arch/x86/kvm/x86.c              | 14 +++++++++++---
>  4 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be5363b21540..3c0677b9d3d5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
>  		unsigned long nested_apf_token;
>  		bool delivery_as_pf_vmexit;
>  		bool pageready_pending;
> +		gfn_t error_gfn;
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 444bb9c54548..d0a2a12c7bb6 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -60,7 +60,7 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
>  void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
>  void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
>  			     bool accessed_dirty, gpa_t new_eptp);
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 76817d13c86e..a882a6a9f7a7 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>  	if (!async)
>  		return false; /* *pfn has correct page already */
>  
> -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {

gpa_to_gfn(cr2_or_gpa) ?

>  		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
>  		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>  			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3b92db412335..a6af7e9831b9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10380,7 +10380,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
>  		return;
>  
> -	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	if (r < 0)
> +		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
>  }
>  
>  static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> @@ -10490,7 +10492,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  	return true;
>  }
>  
> -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>  {
>  	if (unlikely(!lapic_in_kernel(vcpu) ||
>  		     kvm_event_needs_reinjection(vcpu) ||
> @@ -10504,7 +10506,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  	 * If interrupts are off we cannot even use an artificial
>  	 * halt state.
>  	 */
> -	return kvm_arch_interrupt_allowed(vcpu);
> +	if (!kvm_arch_interrupt_allowed(vcpu))
> +		return false;
> +
> +	if (vcpu->arch.apf.error_gfn == gfn)
> +		return false;
> +
> +	return true;
>  }
>  
>  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,

I'm a little bit afraid that a single error_gfn may not give us
deterministric behavior. E.g. when we have a lot of faulting processes
it may take many iterations to hit 'error_gfn == gfn' because we'll
always be overwriting 'error_gfn' with new values and waking up some
(random) process.

What if we just temporary disable the whole APF mechanism? That would
ensure we're making forward progress. Something like (completely
untested):

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8998e97457f..945b3d5a2796 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
 		unsigned long nested_apf_token;
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
+		bool error_pending;
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fdd05c233308..e5f04ae97e91 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4124,8 +4124,18 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
 		return RET_PF_RETRY;
 
-	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
+	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r)) {
+		/*
+		 * In case APF mechanism was previously disabled due to an error
+		 * we are ready to re-enable it here as we're about to inject an
+		 * error to userspace. There is no guarantee we are handling the
+		 * same GFN which failed in APF here but at least we are making
+		 * forward progress.
+		 */
+
+		vcpu->arch.apf.error_pending = false;
 		return r;
+	}
 
 	r = RET_PF_RETRY;
 	spin_lock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00c88c2f34e4..4607cf4d5117 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10379,7 +10379,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
 		return;
 
-	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
+	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
+	if (r < 0)
+		vcpu->arch.apf.error_pending = true;
 }
 
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
@@ -10499,6 +10501,9 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
 	if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
 		return false;
 
+	if (unlikely(vcpu->arch.apf.error_pending))
+		return false;
+
 	/*
 	 * If interrupts are off we cannot even use an artificial
 	 * halt state.
Vivek Goyal June 26, 2020, 3:03 p.m. UTC | #2
On Fri, Jun 26, 2020 at 11:25:19AM +0200, Vitaly Kuznetsov wrote:

[..]
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 76817d13c86e..a882a6a9f7a7 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> >  	if (!async)
> >  		return false; /* *pfn has correct page already */
> >  
> > -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> > +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
> 
> gpa_to_gfn(cr2_or_gpa) ?

Will do.

[..]
> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
> >  {
> >  	if (unlikely(!lapic_in_kernel(vcpu) ||
> >  		     kvm_event_needs_reinjection(vcpu) ||
> > @@ -10504,7 +10506,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> >  	 * If interrupts are off we cannot even use an artificial
> >  	 * halt state.
> >  	 */
> > -	return kvm_arch_interrupt_allowed(vcpu);
> > +	if (!kvm_arch_interrupt_allowed(vcpu))
> > +		return false;
> > +
> > +	if (vcpu->arch.apf.error_gfn == gfn)
> > +		return false;
> > +
> > +	return true;
> >  }
> >  
> >  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> 
> I'm a little bit afraid that a single error_gfn may not give us
> deterministric behavior. E.g. when we have a lot of faulting processes
> it may take many iterations to hit 'error_gfn == gfn' because we'll
> always be overwriting 'error_gfn' with new values and waking up some
> (random) process.
> 
> What if we just temporary disable the whole APF mechanism? That would
> ensure we're making forward progress. Something like (completely
> untested):
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f8998e97457f..945b3d5a2796 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
>  		unsigned long nested_apf_token;
>  		bool delivery_as_pf_vmexit;
>  		bool pageready_pending;
> +		bool error_pending;
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index fdd05c233308..e5f04ae97e91 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4124,8 +4124,18 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
>  		return RET_PF_RETRY;
>  
> -	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
> +	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r)) {
> +		/*
> +		 * In case APF mechanism was previously disabled due to an error
> +		 * we are ready to re-enable it here as we're about to inject an
> +		 * error to userspace. There is no guarantee we are handling the
> +		 * same GFN which failed in APF here but at least we are making
> +		 * forward progress.
> +		 */
> +
> +		vcpu->arch.apf.error_pending = false;

I like this idea. It is simple. But I have a concern with it though.

- Can it happen that we never retry faulting in error pfn.  Say a process
  accessed a pfn, we set error_pending, and then process got killed due
  to pending signal. Now process will not retry error pfn. And
  error_pending will remain set and we completely disabled APF
  mechanism till next error happens (if it happens).

In another idea, we could think of maintaining another hash of error
gfns. Similar to "vcpu->arch.apf.gfns[]". Say "vgpu->arch.apf.error_gfns[]"

- When error happens on a gfn, add it to hash. If slot is busy, overwrite
  it.

- When kvm_can_do_async_pf(gfn) is called, check if this gfn is present
  in error_gfn, if yes, clear it and force sync fault.

This is more complicated but should take care of your concerns. Also 
even if process never retries that gfn, we are fine. At max that
gfn will remain error_gfn array but will not disable APF completely.

Thanks
Vivek

>  		return r;
> +	}
>  
>  	r = RET_PF_RETRY;
>  	spin_lock(&vcpu->kvm->mmu_lock);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 00c88c2f34e4..4607cf4d5117 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10379,7 +10379,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
>  	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
>  		return;
>  
> -	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> +	if (r < 0)
> +		vcpu->arch.apf.error_pending = true;
>  }
>  
>  static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
> @@ -10499,6 +10501,9 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  	if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
>  		return false;
>  
> +	if (unlikely(vcpu->arch.apf.error_pending))
> +		return false;
> +
>  	/*
>  	 * If interrupts are off we cannot even use an artificial
>  	 * halt state.
> 
> -- 
> Vitaly
>
Vitaly Kuznetsov June 29, 2020, 8:56 p.m. UTC | #3
Vivek Goyal <vgoyal@redhat.com> writes:

> On Fri, Jun 26, 2020 at 11:25:19AM +0200, Vitaly Kuznetsov wrote:
>
> [..]
>> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> > index 76817d13c86e..a882a6a9f7a7 100644
>> > --- a/arch/x86/kvm/mmu/mmu.c
>> > +++ b/arch/x86/kvm/mmu/mmu.c
>> > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>> >  	if (!async)
>> >  		return false; /* *pfn has correct page already */
>> >  
>> > -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
>> > +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
>> 
>> gpa_to_gfn(cr2_or_gpa) ?
>
> Will do.
>
> [..]
>> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>> >  {
>> >  	if (unlikely(!lapic_in_kernel(vcpu) ||
>> >  		     kvm_event_needs_reinjection(vcpu) ||
>> > @@ -10504,7 +10506,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>> >  	 * If interrupts are off we cannot even use an artificial
>> >  	 * halt state.
>> >  	 */
>> > -	return kvm_arch_interrupt_allowed(vcpu);
>> > +	if (!kvm_arch_interrupt_allowed(vcpu))
>> > +		return false;
>> > +
>> > +	if (vcpu->arch.apf.error_gfn == gfn)
>> > +		return false;
>> > +
>> > +	return true;
>> >  }
>> >  
>> >  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>> 
>> I'm a little bit afraid that a single error_gfn may not give us
>> deterministric behavior. E.g. when we have a lot of faulting processes
>> it may take many iterations to hit 'error_gfn == gfn' because we'll
>> always be overwriting 'error_gfn' with new values and waking up some
>> (random) process.
>> 
>> What if we just temporary disable the whole APF mechanism? That would
>> ensure we're making forward progress. Something like (completely
>> untested):
>> 
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index f8998e97457f..945b3d5a2796 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
>>  		unsigned long nested_apf_token;
>>  		bool delivery_as_pf_vmexit;
>>  		bool pageready_pending;
>> +		bool error_pending;
>>  	} apf;
>>  
>>  	/* OSVW MSRs (AMD only) */
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index fdd05c233308..e5f04ae97e91 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -4124,8 +4124,18 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
>>  		return RET_PF_RETRY;
>>  
>> -	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
>> +	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r)) {
>> +		/*
>> +		 * In case APF mechanism was previously disabled due to an error
>> +		 * we are ready to re-enable it here as we're about to inject an
>> +		 * error to userspace. There is no guarantee we are handling the
>> +		 * same GFN which failed in APF here but at least we are making
>> +		 * forward progress.
>> +		 */
>> +
>> +		vcpu->arch.apf.error_pending = false;
>
> I like this idea. It is simple. But I have a concern with it though.
>
> - Can it happen that we never retry faulting in error pfn.  Say a process
>   accessed a pfn, we set error_pending, and then process got killed due
>   to pending signal. Now process will not retry error pfn. And
>   error_pending will remain set and we completely disabled APF
>   mechanism till next error happens (if it happens).

Can a process in kvm_async_pf_task_wait_schedule() get killed? I don't
see us checking signals/... in the loop, just 'if
(hlist_unhashed(&n.link))' -- and this only happens when APF task
completes. I don't know much about processes to be honest, could easily
be wrong completely :-)

>
> In another idea, we could think of maintaining another hash of error
> gfns. Similar to "vcpu->arch.apf.gfns[]". Say "vgpu->arch.apf.error_gfns[]"
>
> - When error happens on a gfn, add it to hash. If slot is busy, overwrite
>   it.
>
> - When kvm_can_do_async_pf(gfn) is called, check if this gfn is present
>   in error_gfn, if yes, clear it and force sync fault.
>
> This is more complicated but should take care of your concerns. Also 
> even if process never retries that gfn, we are fine. At max that
> gfn will remain error_gfn array but will not disable APF completely.

Yes, we can do that but I'm not sure it wouldn't be an overkill: we are
not trying to protect the mechanism against a malicious guest. Using APF
is guest's choice anyway so even if there's going to be an easy way to
disable it completely (poke an address and never retry upon wakeup) from
guest's side it doesn't sound like a big deal.

Also, we can introduce a status bit in the APF 'page ready' notification
stating that the page is actually NOT ready and the mecanism was blocked
because if that, the guest will have to access the GFN to get the error
injected (and unblock the mechanism).
Vivek Goyal June 29, 2020, 10:03 p.m. UTC | #4
On Mon, Jun 29, 2020 at 10:56:25PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Fri, Jun 26, 2020 at 11:25:19AM +0200, Vitaly Kuznetsov wrote:
> >
> > [..]
> >> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> > index 76817d13c86e..a882a6a9f7a7 100644
> >> > --- a/arch/x86/kvm/mmu/mmu.c
> >> > +++ b/arch/x86/kvm/mmu/mmu.c
> >> > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> >> >  	if (!async)
> >> >  		return false; /* *pfn has correct page already */
> >> >  
> >> > -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> >> > +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
> >> 
> >> gpa_to_gfn(cr2_or_gpa) ?
> >
> > Will do.
> >
> > [..]
> >> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> >> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
> >> >  {
> >> >  	if (unlikely(!lapic_in_kernel(vcpu) ||
> >> >  		     kvm_event_needs_reinjection(vcpu) ||
> >> > @@ -10504,7 +10506,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> >> >  	 * If interrupts are off we cannot even use an artificial
> >> >  	 * halt state.
> >> >  	 */
> >> > -	return kvm_arch_interrupt_allowed(vcpu);
> >> > +	if (!kvm_arch_interrupt_allowed(vcpu))
> >> > +		return false;
> >> > +
> >> > +	if (vcpu->arch.apf.error_gfn == gfn)
> >> > +		return false;
> >> > +
> >> > +	return true;
> >> >  }
> >> >  
> >> >  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> >> 
> >> I'm a little bit afraid that a single error_gfn may not give us
> >> deterministric behavior. E.g. when we have a lot of faulting processes
> >> it may take many iterations to hit 'error_gfn == gfn' because we'll
> >> always be overwriting 'error_gfn' with new values and waking up some
> >> (random) process.
> >> 
> >> What if we just temporary disable the whole APF mechanism? That would
> >> ensure we're making forward progress. Something like (completely
> >> untested):
> >> 
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index f8998e97457f..945b3d5a2796 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
> >>  		unsigned long nested_apf_token;
> >>  		bool delivery_as_pf_vmexit;
> >>  		bool pageready_pending;
> >> +		bool error_pending;
> >>  	} apf;
> >>  
> >>  	/* OSVW MSRs (AMD only) */
> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> index fdd05c233308..e5f04ae97e91 100644
> >> --- a/arch/x86/kvm/mmu/mmu.c
> >> +++ b/arch/x86/kvm/mmu/mmu.c
> >> @@ -4124,8 +4124,18 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >>  	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
> >>  		return RET_PF_RETRY;
> >>  
> >> -	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
> >> +	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r)) {
> >> +		/*
> >> +		 * In case APF mechanism was previously disabled due to an error
> >> +		 * we are ready to re-enable it here as we're about to inject an
> >> +		 * error to userspace. There is no guarantee we are handling the
> >> +		 * same GFN which failed in APF here but at least we are making
> >> +		 * forward progress.
> >> +		 */
> >> +
> >> +		vcpu->arch.apf.error_pending = false;
> >
> > I like this idea. It is simple. But I have a concern with it though.
> >
> > - Can it happen that we never retry faulting in error pfn.  Say a process
> >   accessed a pfn, we set error_pending, and then process got killed due
> >   to pending signal. Now process will not retry error pfn. And
> >   error_pending will remain set and we completely disabled APF
> >   mechanism till next error happens (if it happens).
> 
> Can a process in kvm_async_pf_task_wait_schedule() get killed? I don't
> see us checking signals/... in the loop, just 'if
> (hlist_unhashed(&n.link))' -- and this only happens when APF task
> completes. I don't know much about processes to be honest, could easily
> be wrong completely :-)

I think a waiting process will be woken up and scheduled again. And
when it is starts running again and goes back to user space (faulting
instruction was in user space), then we should check for pending SIGNAL
and kill it.

That's how my patches for sending SIGBUS were working. I queued SIGBUS
and then when process got scheduled, it got SIGBUS and got killed and
stopped retrying instruction. (Otherwise this fault cycle will never
end).

Hence, I think it is possible. Another process can send SIGKILL to
this process which is waiting for APF. Once APF page ready event
comes in, process will be killed after that without retrying the
instruct. I will be glad to be corrected if I understood it wrong.

> 
> >
> > In another idea, we could think of maintaining another hash of error
> > gfns. Similar to "vcpu->arch.apf.gfns[]". Say "vgpu->arch.apf.error_gfns[]"
> >
> > - When error happens on a gfn, add it to hash. If slot is busy, overwrite
> >   it.
> >
> > - When kvm_can_do_async_pf(gfn) is called, check if this gfn is present
> >   in error_gfn, if yes, clear it and force sync fault.
> >
> > This is more complicated but should take care of your concerns. Also 
> > even if process never retries that gfn, we are fine. At max that
> > gfn will remain error_gfn array but will not disable APF completely.
> 
> Yes, we can do that but I'm not sure it wouldn't be an overkill: we are
> not trying to protect the mechanism against a malicious guest. Using APF
> is guest's choice anyway so even if there's going to be an easy way to
> disable it completely (poke an address and never retry upon wakeup) from
> guest's side it doesn't sound like a big deal.

Sure but if guest chose APF and then it got disabled completely
intentionally, then its a probelm, isn't it. This is just a race
condition which can disable APF unintentionally and leave it like
that till next error happens. 

> 
> Also, we can introduce a status bit in the APF 'page ready' notification
> stating that the page is actually NOT ready and the mecanism was blocked
> because if that, the guest will have to access the GFN to get the error
> injected (and unblock the mechanism).

I am not sure how will we force guest to access that pfn if accessing
process gets killed. This actually feels like least preferred of all
options.

Thanks
Vivek
Vitaly Kuznetsov June 30, 2020, 1:24 p.m. UTC | #5
Vivek Goyal <vgoyal@redhat.com> writes:

> On Mon, Jun 29, 2020 at 10:56:25PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > On Fri, Jun 26, 2020 at 11:25:19AM +0200, Vitaly Kuznetsov wrote:
>> >
>> > [..]
>> >> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> >> > index 76817d13c86e..a882a6a9f7a7 100644
>> >> > --- a/arch/x86/kvm/mmu/mmu.c
>> >> > +++ b/arch/x86/kvm/mmu/mmu.c
>> >> > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>> >> >  	if (!async)
>> >> >  		return false; /* *pfn has correct page already */
>> >> >  
>> >> > -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
>> >> > +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
>> >> 
>> >> gpa_to_gfn(cr2_or_gpa) ?
>> >
>> > Will do.
>> >
>> > [..]
>> >> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>> >> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
>> >> >  {
>> >> >  	if (unlikely(!lapic_in_kernel(vcpu) ||
>> >> >  		     kvm_event_needs_reinjection(vcpu) ||
>> >> > @@ -10504,7 +10506,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>> >> >  	 * If interrupts are off we cannot even use an artificial
>> >> >  	 * halt state.
>> >> >  	 */
>> >> > -	return kvm_arch_interrupt_allowed(vcpu);
>> >> > +	if (!kvm_arch_interrupt_allowed(vcpu))
>> >> > +		return false;
>> >> > +
>> >> > +	if (vcpu->arch.apf.error_gfn == gfn)
>> >> > +		return false;
>> >> > +
>> >> > +	return true;
>> >> >  }
>> >> >  
>> >> >  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>> >> 
>> >> I'm a little bit afraid that a single error_gfn may not give us
>> >> deterministric behavior. E.g. when we have a lot of faulting processes
>> >> it may take many iterations to hit 'error_gfn == gfn' because we'll
>> >> always be overwriting 'error_gfn' with new values and waking up some
>> >> (random) process.
>> >> 
>> >> What if we just temporary disable the whole APF mechanism? That would
>> >> ensure we're making forward progress. Something like (completely
>> >> untested):
>> >> 
>> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> >> index f8998e97457f..945b3d5a2796 100644
>> >> --- a/arch/x86/include/asm/kvm_host.h
>> >> +++ b/arch/x86/include/asm/kvm_host.h
>> >> @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
>> >>  		unsigned long nested_apf_token;
>> >>  		bool delivery_as_pf_vmexit;
>> >>  		bool pageready_pending;
>> >> +		bool error_pending;
>> >>  	} apf;
>> >>  
>> >>  	/* OSVW MSRs (AMD only) */
>> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> >> index fdd05c233308..e5f04ae97e91 100644
>> >> --- a/arch/x86/kvm/mmu/mmu.c
>> >> +++ b/arch/x86/kvm/mmu/mmu.c
>> >> @@ -4124,8 +4124,18 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>> >>  	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
>> >>  		return RET_PF_RETRY;
>> >>  
>> >> -	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
>> >> +	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r)) {
>> >> +		/*
>> >> +		 * In case APF mechanism was previously disabled due to an error
>> >> +		 * we are ready to re-enable it here as we're about to inject an
>> >> +		 * error to userspace. There is no guarantee we are handling the
>> >> +		 * same GFN which failed in APF here but at least we are making
>> >> +		 * forward progress.
>> >> +		 */
>> >> +
>> >> +		vcpu->arch.apf.error_pending = false;
>> >
>> > I like this idea. It is simple. But I have a concern with it though.
>> >
>> > - Can it happen that we never retry faulting in error pfn.  Say a process
>> >   accessed a pfn, we set error_pending, and then process got killed due
>> >   to pending signal. Now process will not retry error pfn. And
>> >   error_pending will remain set and we completely disabled APF
>> >   mechanism till next error happens (if it happens).
>> 
>> Can a process in kvm_async_pf_task_wait_schedule() get killed? I don't
>> see us checking signals/... in the loop, just 'if
>> (hlist_unhashed(&n.link))' -- and this only happens when APF task
>> completes. I don't know much about processes to be honest, could easily
>> be wrong completely :-)
>
> I think a waiting process will be woken up and scheduled again. And
> when it is starts running again and goes back to user space (faulting
> instruction was in user space), then we should check for pending SIGNAL
> and kill it.
>
> That's how my patches for sending SIGBUS were working. I queued SIGBUS
> and then when process got scheduled, it got SIGBUS and got killed and
> stopped retrying instruction. (Otherwise this fault cycle will never
> end).
>
> Hence, I think it is possible. Another process can send SIGKILL to
> this process which is waiting for APF. Once APF page ready event
> comes in, process will be killed after that without retrying the
> instruct. I will be glad to be corrected if I understood it wrong.
>

It's probably me who's missing something important here :-) but I think
you describe how it *should* work as I'm not seeing how we can leave the
loop in kvm_async_pf_task_wait_schedule() other than by 
"if (hlist_unhashed(&n.link)) break;" and this only happens when APF
completes.

>> 
>> >
>> > In another idea, we could think of maintaining another hash of error
>> > gfns. Similar to "vcpu->arch.apf.gfns[]". Say "vgpu->arch.apf.error_gfns[]"
>> >
>> > - When error happens on a gfn, add it to hash. If slot is busy, overwrite
>> >   it.
>> >
>> > - When kvm_can_do_async_pf(gfn) is called, check if this gfn is present
>> >   in error_gfn, if yes, clear it and force sync fault.
>> >
>> > This is more complicated but should take care of your concerns. Also 
>> > even if process never retries that gfn, we are fine. At max that
>> > gfn will remain error_gfn array but will not disable APF completely.
>> 
>> Yes, we can do that but I'm not sure it wouldn't be an overkill: we are
>> not trying to protect the mechanism against a malicious guest. Using APF
>> is guest's choice anyway so even if there's going to be an easy way to
>> disable it completely (poke an address and never retry upon wakeup) from
>> guest's side it doesn't sound like a big deal.
>
> Sure but if guest chose APF and then it got disabled completely
> intentionally, then its a probelm, isn't it. This is just a race
> condition which can disable APF unintentionally and leave it like
> that till next error happens. 
>
>> 
>> Also, we can introduce a status bit in the APF 'page ready' notification
>> stating that the page is actually NOT ready and the mecanism was blocked
>> because if that, the guest will have to access the GFN to get the error
>> injected (and unblock the mechanism).
>
> I am not sure how will we force guest to access that pfn if accessing
> process gets killed. This actually feels like least preferred of all
> options.

When guest receives the 'page ready' event with an error it (like for
every other 'page ready' event) tries to wake up the corresponding
process but if the process is dead already it can do in-kernel probing
of the GFN, this way we guarantee that the error is always injected. I'm
not sure if it is needed though but in case it is, this can be a
solution. We can add a new feature bit and only deliver errors when the
guest indicates that it knows what to do with them.
Vivek Goyal June 30, 2020, 2:53 p.m. UTC | #6
On Tue, Jun 30, 2020 at 03:24:43PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Mon, Jun 29, 2020 at 10:56:25PM +0200, Vitaly Kuznetsov wrote:
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> 
> >> > On Fri, Jun 26, 2020 at 11:25:19AM +0200, Vitaly Kuznetsov wrote:
> >> >
> >> > [..]
> >> >> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> >> > index 76817d13c86e..a882a6a9f7a7 100644
> >> >> > --- a/arch/x86/kvm/mmu/mmu.c
> >> >> > +++ b/arch/x86/kvm/mmu/mmu.c
> >> >> > @@ -4078,7 +4078,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> >> >> >  	if (!async)
> >> >> >  		return false; /* *pfn has correct page already */
> >> >> >  
> >> >> > -	if (!prefault && kvm_can_do_async_pf(vcpu)) {
> >> >> > +	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
> >> >> 
> >> >> gpa_to_gfn(cr2_or_gpa) ?
> >> >
> >> > Will do.
> >> >
> >> > [..]
> >> >> > -bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> >> >> > +bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
> >> >> >  {
> >> >> >  	if (unlikely(!lapic_in_kernel(vcpu) ||
> >> >> >  		     kvm_event_needs_reinjection(vcpu) ||
> >> >> > @@ -10504,7 +10506,13 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
> >> >> >  	 * If interrupts are off we cannot even use an artificial
> >> >> >  	 * halt state.
> >> >> >  	 */
> >> >> > -	return kvm_arch_interrupt_allowed(vcpu);
> >> >> > +	if (!kvm_arch_interrupt_allowed(vcpu))
> >> >> > +		return false;
> >> >> > +
> >> >> > +	if (vcpu->arch.apf.error_gfn == gfn)
> >> >> > +		return false;
> >> >> > +
> >> >> > +	return true;
> >> >> >  }
> >> >> >  
> >> >> >  bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
> >> >> 
> >> >> I'm a little bit afraid that a single error_gfn may not give us
> >> >> deterministric behavior. E.g. when we have a lot of faulting processes
> >> >> it may take many iterations to hit 'error_gfn == gfn' because we'll
> >> >> always be overwriting 'error_gfn' with new values and waking up some
> >> >> (random) process.
> >> >> 
> >> >> What if we just temporary disable the whole APF mechanism? That would
> >> >> ensure we're making forward progress. Something like (completely
> >> >> untested):
> >> >> 
> >> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> >> index f8998e97457f..945b3d5a2796 100644
> >> >> --- a/arch/x86/include/asm/kvm_host.h
> >> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> >> @@ -778,6 +778,7 @@ struct kvm_vcpu_arch {
> >> >>  		unsigned long nested_apf_token;
> >> >>  		bool delivery_as_pf_vmexit;
> >> >>  		bool pageready_pending;
> >> >> +		bool error_pending;
> >> >>  	} apf;
> >> >>  
> >> >>  	/* OSVW MSRs (AMD only) */
> >> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> >> index fdd05c233308..e5f04ae97e91 100644
> >> >> --- a/arch/x86/kvm/mmu/mmu.c
> >> >> +++ b/arch/x86/kvm/mmu/mmu.c
> >> >> @@ -4124,8 +4124,18 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> >> >>  	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
> >> >>  		return RET_PF_RETRY;
> >> >>  
> >> >> -	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
> >> >> +	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r)) {
> >> >> +		/*
> >> >> +		 * In case APF mechanism was previously disabled due to an error
> >> >> +		 * we are ready to re-enable it here as we're about to inject an
> >> >> +		 * error to userspace. There is no guarantee we are handling the
> >> >> +		 * same GFN which failed in APF here but at least we are making
> >> >> +		 * forward progress.
> >> >> +		 */
> >> >> +
> >> >> +		vcpu->arch.apf.error_pending = false;
> >> >
> >> > I like this idea. It is simple. But I have a concern with it though.
> >> >
> >> > - Can it happen that we never retry faulting in error pfn.  Say a process
> >> >   accessed a pfn, we set error_pending, and then process got killed due
> >> >   to pending signal. Now process will not retry error pfn. And
> >> >   error_pending will remain set and we completely disabled APF
> >> >   mechanism till next error happens (if it happens).
> >> 
> >> Can a process in kvm_async_pf_task_wait_schedule() get killed? I don't
> >> see us checking signals/... in the loop, just 'if
> >> (hlist_unhashed(&n.link))' -- and this only happens when APF task
> >> completes. I don't know much about processes to be honest, could easily
> >> be wrong completely :-)
> >
> > I think a waiting process will be woken up and scheduled again. And
> > when it is starts running again and goes back to user space (faulting
> > instruction was in user space), then we should check for pending SIGNAL
> > and kill it.
> >
> > That's how my patches for sending SIGBUS were working. I queued SIGBUS
> > and then when process got scheduled, it got SIGBUS and got killed and
> > stopped retrying instruction. (Otherwise this fault cycle will never
> > end).
> >
> > Hence, I think it is possible. Another process can send SIGKILL to
> > this process which is waiting for APF. Once APF page ready event
> > comes in, process will be killed after that without retrying the
> > instruct. I will be glad to be corrected if I understood it wrong.
> >
> 
> It's probably me who's missing something important here :-) but I think
> you describe how it *should* work as I'm not seeing how we can leave the
> loop in kvm_async_pf_task_wait_schedule() other than by 
> "if (hlist_unhashed(&n.link)) break;" and this only happens when APF
> completes.

We don't leave loop in kvm_async_pf_task_wait_schedule(). It will happen
before you return to user space.

I have not looked too closely but I think following code path might be taken
after aync PF has completed.

__kvm_handle_async_pf()
  idtentry_exit_cond_rcu()
    prepare_exit_to_usermode()
      __prepare_exit_to_usermode()
        exit_to_usermode_loop()
	  do_signal()

So once you have been woken up (because APF completed), you will
return to user space and before that you will check if there are
pending signals and handle that signal first before user space
gets a chance to run again and retry faulting instruction.

> 
> >> 
> >> >
> >> > In another idea, we could think of maintaining another hash of error
> >> > gfns. Similar to "vcpu->arch.apf.gfns[]". Say "vgpu->arch.apf.error_gfns[]"
> >> >
> >> > - When error happens on a gfn, add it to hash. If slot is busy, overwrite
> >> >   it.
> >> >
> >> > - When kvm_can_do_async_pf(gfn) is called, check if this gfn is present
> >> >   in error_gfn, if yes, clear it and force sync fault.
> >> >
> >> > This is more complicated but should take care of your concerns. Also 
> >> > even if process never retries that gfn, we are fine. At max that
> >> > gfn will remain error_gfn array but will not disable APF completely.
> >> 
> >> Yes, we can do that but I'm not sure it wouldn't be an overkill: we are
> >> not trying to protect the mechanism against a malicious guest. Using APF
> >> is guest's choice anyway so even if there's going to be an easy way to
> >> disable it completely (poke an address and never retry upon wakeup) from
> >> guest's side it doesn't sound like a big deal.
> >
> > Sure but if guest chose APF and then it got disabled completely
> > intentionally, then its a probelm, isn't it. This is just a race
> > condition which can disable APF unintentionally and leave it like
> > that till next error happens. 
> >
> >> 
> >> Also, we can introduce a status bit in the APF 'page ready' notification
> >> stating that the page is actually NOT ready and the mecanism was blocked
> >> because if that, the guest will have to access the GFN to get the error
> >> injected (and unblock the mechanism).
> >
> > I am not sure how will we force guest to access that pfn if accessing
> > process gets killed. This actually feels like least preferred of all
> > options.
> 
> When guest receives the 'page ready' event with an error it (like for
> every other 'page ready' event) tries to wake up the corresponding
> process but if the process is dead already it can do in-kernel probing
> of the GFN, this way we guarantee that the error is always injected. I'm
> not sure if it is needed though but in case it is, this can be a
> solution. We can add a new feature bit and only deliver errors when the
> guest indicates that it knows what to do with them.

- Process will be delivered singal after async PF completion and during
  returning to user space. You have lost control by then.

- If you retry in kernel, we will change the context completely that
  who was trying to access the gfn in question. We want to retain
  the real context and retain information who was trying to access
  gfn in question.

Vivek
Vitaly Kuznetsov June 30, 2020, 3:13 p.m. UTC | #7
Vivek Goyal <vgoyal@redhat.com> writes:

> On Tue, Jun 30, 2020 at 03:24:43PM +0200, Vitaly Kuznetsov wrote:

>> 
>> It's probably me who's missing something important here :-) but I think
>> you describe how it *should* work as I'm not seeing how we can leave the
>> loop in kvm_async_pf_task_wait_schedule() other than by 
>> "if (hlist_unhashed(&n.link)) break;" and this only happens when APF
>> completes.
>
> We don't leave loop in kvm_async_pf_task_wait_schedule(). It will happen
> before you return to user space.
>
> I have not looked too closely but I think following code path might be taken
> after aync PF has completed.
>
> __kvm_handle_async_pf()
>   idtentry_exit_cond_rcu()
>     prepare_exit_to_usermode()
>       __prepare_exit_to_usermode()
>         exit_to_usermode_loop()
> 	  do_signal()
>
> So once you have been woken up (because APF completed),

Ah, OK so we still need to complete APF and we can't kill the process
before this happens, that's what I was missing.

>  you will
> return to user space and before that you will check if there are
> pending signals and handle that signal first before user space
> gets a chance to run again and retry faulting instruction.

...

>
>> 
>> When guest receives the 'page ready' event with an error it (like for
>> every other 'page ready' event) tries to wake up the corresponding
>> process but if the process is dead already it can do in-kernel probing
>> of the GFN, this way we guarantee that the error is always injected. I'm
>> not sure if it is needed though but in case it is, this can be a
>> solution. We can add a new feature bit and only deliver errors when the
>> guest indicates that it knows what to do with them.
>
> - Process will be delivered singal after async PF completion and during
>   returning to user space. You have lost control by then.
>

So actually there's no way for kernel to know if the userspace process
managed to re-try the instruction and get the error injected or if it
was killed prior to that.

> - If you retry in kernel, we will change the context completely that
>   who was trying to access the gfn in question. We want to retain
>   the real context and retain information who was trying to access
>   gfn in question.

(Just so I understand the idea better) does the guest context matter to
the host? Or, more specifically, are we going to do anything besides
get_user_pages() which will actually analyze who triggered the access
*in the guest*?
Vivek Goyal June 30, 2020, 3:25 p.m. UTC | #8
On Tue, Jun 30, 2020 at 05:13:54PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Tue, Jun 30, 2020 at 03:24:43PM +0200, Vitaly Kuznetsov wrote:
> 
> >> 
> >> It's probably me who's missing something important here :-) but I think
> >> you describe how it *should* work as I'm not seeing how we can leave the
> >> loop in kvm_async_pf_task_wait_schedule() other than by 
> >> "if (hlist_unhashed(&n.link)) break;" and this only happens when APF
> >> completes.
> >
> > We don't leave loop in kvm_async_pf_task_wait_schedule(). It will happen
> > before you return to user space.
> >
> > I have not looked too closely but I think following code path might be taken
> > after aync PF has completed.
> >
> > __kvm_handle_async_pf()
> >   idtentry_exit_cond_rcu()
> >     prepare_exit_to_usermode()
> >       __prepare_exit_to_usermode()
> >         exit_to_usermode_loop()
> > 	  do_signal()
> >
> > So once you have been woken up (because APF completed),
> 
> Ah, OK so we still need to complete APF and we can't kill the process
> before this happens, that's what I was missing.
> 
> >  you will
> > return to user space and before that you will check if there are
> > pending signals and handle that signal first before user space
> > gets a chance to run again and retry faulting instruction.
> 
> ...
> 
> >
> >> 
> >> When guest receives the 'page ready' event with an error it (like for
> >> every other 'page ready' event) tries to wake up the corresponding
> >> process but if the process is dead already it can do in-kernel probing
> >> of the GFN, this way we guarantee that the error is always injected. I'm
> >> not sure if it is needed though but in case it is, this can be a
> >> solution. We can add a new feature bit and only deliver errors when the
> >> guest indicates that it knows what to do with them.
> >
> > - Process will be delivered singal after async PF completion and during
> >   returning to user space. You have lost control by then.
> >
> 
> So actually there's no way for kernel to know if the userspace process
> managed to re-try the instruction and get the error injected or if it
> was killed prior to that.

Yes. 

> 
> > - If you retry in kernel, we will change the context completely that
> >   who was trying to access the gfn in question. We want to retain
> >   the real context and retain information who was trying to access
> >   gfn in question.
> 
> (Just so I understand the idea better) does the guest context matter to
> the host? Or, more specifically, are we going to do anything besides
> get_user_pages() which will actually analyze who triggered the access
> *in the guest*?

When we exit to user space, qemu prints bunch of register state. I am
wondering what does that state represent. Does some of that traces
back to the process which was trying to access that hva? I don't
know.

I think keeping a cache of error gfns might not be too bad from
implemetation point of view. I will give it a try and see how
bad does it look.

Vivek
Vitaly Kuznetsov June 30, 2020, 3:43 p.m. UTC | #9
Vivek Goyal <vgoyal@redhat.com> writes:

> On Tue, Jun 30, 2020 at 05:13:54PM +0200, Vitaly Kuznetsov wrote:
>> 
>> > - If you retry in kernel, we will change the context completely that
>> >   who was trying to access the gfn in question. We want to retain
>> >   the real context and retain information who was trying to access
>> >   gfn in question.
>> 
>> (Just so I understand the idea better) does the guest context matter to
>> the host? Or, more specifically, are we going to do anything besides
>> get_user_pages() which will actually analyze who triggered the access
>> *in the guest*?
>
> When we exit to user space, qemu prints bunch of register state. I am
> wondering what does that state represent. Does some of that traces
> back to the process which was trying to access that hva? I don't
> know.

We can get the full CPU state when the fault happens if we need to but
generally we are not analyzing it. I can imagine looking at CPL, for
example, but trying to distinguish guest's 'process A' from 'process B'
may not be simple.

>
> I think keeping a cache of error gfns might not be too bad from
> implemetation point of view. I will give it a try and see how
> bad does it look.

Right; I'm only worried about the fact that every cache (or hash) has a
limited size and under certain curcumstances we may overflow it. When an
overflow happens, we will follow the APF path again and this can go over
and over. Maybe we can punch a hole in EPT/NPT making the PFN reserved/
not-present so when the guest tries to access it again we trap the
access in KVM and, if the error persists, don't follow the APF path?
Sean Christopherson June 30, 2020, 3:50 p.m. UTC | #10
On Tue, Jun 30, 2020 at 05:43:54PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Tue, Jun 30, 2020 at 05:13:54PM +0200, Vitaly Kuznetsov wrote:
> >> 
> >> > - If you retry in kernel, we will change the context completely that
> >> >   who was trying to access the gfn in question. We want to retain
> >> >   the real context and retain information who was trying to access
> >> >   gfn in question.
> >> 
> >> (Just so I understand the idea better) does the guest context matter to
> >> the host? Or, more specifically, are we going to do anything besides
> >> get_user_pages() which will actually analyze who triggered the access
> >> *in the guest*?
> >
> > When we exit to user space, qemu prints bunch of register state. I am
> > wondering what does that state represent. Does some of that traces
> > back to the process which was trying to access that hva? I don't
> > know.
> 
> We can get the full CPU state when the fault happens if we need to but
> generally we are not analyzing it. I can imagine looking at CPL, for
> example, but trying to distinguish guest's 'process A' from 'process B'
> may not be simple.
> 
> >
> > I think keeping a cache of error gfns might not be too bad from
> > implemetation point of view. I will give it a try and see how
> > bad does it look.
> 
> Right; I'm only worried about the fact that every cache (or hash) has a
> limited size and under certain curcumstances we may overflow it. When an
> overflow happens, we will follow the APF path again and this can go over
> and over. Maybe we can punch a hole in EPT/NPT making the PFN reserved/
> not-present so when the guest tries to access it again we trap the
> access in KVM and, if the error persists, don't follow the APF path?

Just to make sure I'm somewhat keeping track, is the problem we're trying to
solve that the guest may not immediately retry the "bad" GPA and so KVM may
not detect that the async #PF already came back as -EFAULT or whatever?
Vitaly Kuznetsov June 30, 2020, 4:12 p.m. UTC | #11
Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Tue, Jun 30, 2020 at 05:43:54PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > On Tue, Jun 30, 2020 at 05:13:54PM +0200, Vitaly Kuznetsov wrote:
>> >> 
>> >> > - If you retry in kernel, we will change the context completely that
>> >> >   who was trying to access the gfn in question. We want to retain
>> >> >   the real context and retain information who was trying to access
>> >> >   gfn in question.
>> >> 
>> >> (Just so I understand the idea better) does the guest context matter to
>> >> the host? Or, more specifically, are we going to do anything besides
>> >> get_user_pages() which will actually analyze who triggered the access
>> >> *in the guest*?
>> >
>> > When we exit to user space, qemu prints bunch of register state. I am
>> > wondering what does that state represent. Does some of that traces
>> > back to the process which was trying to access that hva? I don't
>> > know.
>> 
>> We can get the full CPU state when the fault happens if we need to but
>> generally we are not analyzing it. I can imagine looking at CPL, for
>> example, but trying to distinguish guest's 'process A' from 'process B'
>> may not be simple.
>> 
>> >
>> > I think keeping a cache of error gfns might not be too bad from
>> > implemetation point of view. I will give it a try and see how
>> > bad does it look.
>> 
>> Right; I'm only worried about the fact that every cache (or hash) has a
>> limited size and under certain curcumstances we may overflow it. When an
>> overflow happens, we will follow the APF path again and this can go over
>> and over. Maybe we can punch a hole in EPT/NPT making the PFN reserved/
>> not-present so when the guest tries to access it again we trap the
>> access in KVM and, if the error persists, don't follow the APF path?
>
> Just to make sure I'm somewhat keeping track, is the problem we're trying to
> solve that the guest may not immediately retry the "bad" GPA and so KVM may
> not detect that the async #PF already came back as -EFAULT or whatever? 

Yes. In Vivek's patch there's a single 'error_gfn' per vCPU which serves
as an indicator whether to follow APF path or not.
Sean Christopherson June 30, 2020, 4:32 p.m. UTC | #12
On Tue, Jun 30, 2020 at 06:12:49PM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> 
> > On Tue, Jun 30, 2020 at 05:43:54PM +0200, Vitaly Kuznetsov wrote:
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> 
> >> > On Tue, Jun 30, 2020 at 05:13:54PM +0200, Vitaly Kuznetsov wrote:
> >> >> 
> >> >> > - If you retry in kernel, we will change the context completely that
> >> >> >   who was trying to access the gfn in question. We want to retain
> >> >> >   the real context and retain information who was trying to access
> >> >> >   gfn in question.
> >> >> 
> >> >> (Just so I understand the idea better) does the guest context matter to
> >> >> the host? Or, more specifically, are we going to do anything besides
> >> >> get_user_pages() which will actually analyze who triggered the access
> >> >> *in the guest*?
> >> >
> >> > When we exit to user space, qemu prints bunch of register state. I am
> >> > wondering what does that state represent. Does some of that traces
> >> > back to the process which was trying to access that hva? I don't
> >> > know.
> >> 
> >> We can get the full CPU state when the fault happens if we need to but
> >> generally we are not analyzing it. I can imagine looking at CPL, for
> >> example, but trying to distinguish guest's 'process A' from 'process B'
> >> may not be simple.
> >> 
> >> >
> >> > I think keeping a cache of error gfns might not be too bad from
> >> > implemetation point of view. I will give it a try and see how
> >> > bad does it look.
> >> 
> >> Right; I'm only worried about the fact that every cache (or hash) has a
> >> limited size and under certain curcumstances we may overflow it. When an
> >> overflow happens, we will follow the APF path again and this can go over
> >> and over. Maybe we can punch a hole in EPT/NPT making the PFN reserved/
> >> not-present so when the guest tries to access it again we trap the
> >> access in KVM and, if the error persists, don't follow the APF path?
> >
> > Just to make sure I'm somewhat keeping track, is the problem we're trying to
> > solve that the guest may not immediately retry the "bad" GPA and so KVM may
> > not detect that the async #PF already came back as -EFAULT or whatever? 
> 
> Yes. In Vivek's patch there's a single 'error_gfn' per vCPU which serves
> as an indicator whether to follow APF path or not.

A thought along the lines of your "punch a hole in the page tables" idea
would be to invalidate the SPTE (in the unlikely case it's present but not
writable) and tagging it as being invalid for async #PF.  E.g. for !EPT,
there are 63 bits available for metadata.  For EPT, there's a measly 60,
assuming we want to avoid using SUPPRESS_VE.  The fully !present case would
be straightforward, but the !writable case would require extra work,
especially for shadow paging.

With the SPTE tagged, it'd "just" be a matter of hooking into the page fault
paths to detect the flag and disable async #PF.  For TDP that's not too bad,
e.g. pass in a flag to fast_page_fault() and propagate it to try_async_pf().
Not sure how to handle shadow paging, that code makes my head hurt just
looking at it.

It'd require tweaking is_shadow_present_pte() to be more precise, but that's
probably a good thing, and peanuts compared to handling the faults.
Vivek Goyal June 30, 2020, 6:25 p.m. UTC | #13
On Tue, Jun 30, 2020 at 05:43:54PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Tue, Jun 30, 2020 at 05:13:54PM +0200, Vitaly Kuznetsov wrote:
> >> 
> >> > - If you retry in kernel, we will change the context completely that
> >> >   who was trying to access the gfn in question. We want to retain
> >> >   the real context and retain information who was trying to access
> >> >   gfn in question.
> >> 
> >> (Just so I understand the idea better) does the guest context matter to
> >> the host? Or, more specifically, are we going to do anything besides
> >> get_user_pages() which will actually analyze who triggered the access
> >> *in the guest*?
> >
> > When we exit to user space, qemu prints bunch of register state. I am
> > wondering what does that state represent. Does some of that traces
> > back to the process which was trying to access that hva? I don't
> > know.
> 
> We can get the full CPU state when the fault happens if we need to but
> generally we are not analyzing it. I can imagine looking at CPL, for
> example, but trying to distinguish guest's 'process A' from 'process B'
> may not be simple.
> 
> >
> > I think keeping a cache of error gfns might not be too bad from
> > implemetation point of view. I will give it a try and see how
> > bad does it look.
> 
> Right; I'm only worried about the fact that every cache (or hash) has a
> limited size and under certain curcumstances we may overflow it. When an
> overflow happens, we will follow the APF path again and this can go over
> and over.

Sure. But what are the chances of that happening. Say our cache size is
64. That means we need atleast 128 processes to do co-ordinated faults
(all in error zone) to skip the cache completely all the time. We
have to hit cache only once. Chances of missing the error gnf
cache completely for a very long time are very slim. And if we miss
it few times, now harm done. We will just spin few times and then
exit to qemu.

IOW, chances of spinning infinitely are not zero. But they look so
small that in practice I am not worried about it.

> Maybe we can punch a hole in EPT/NPT making the PFN reserved/
> not-present so when the guest tries to access it again we trap the
> access in KVM and, if the error persists, don't follow the APF path?

Cache solution seems simpler than this. Trying to maintain any state
in page tables will be invariably more complex (Especially given
many flavors of paging).

I can start looking in this direction if you really think that its worth
implementing  page table based solution for this problem. I feel that
we implement something simpler for now and if there are easy ways
to skip error gns, then replace it with something page table based
solution (This will only require hypervisor change and no guest
changes).

Vivek
Vitaly Kuznetsov July 1, 2020, 8:06 a.m. UTC | #14
Vivek Goyal <vgoyal@redhat.com> writes:

> On Tue, Jun 30, 2020 at 05:43:54PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > On Tue, Jun 30, 2020 at 05:13:54PM +0200, Vitaly Kuznetsov wrote:
>> >> 
>> >> > - If you retry in kernel, we will change the context completely that
>> >> >   who was trying to access the gfn in question. We want to retain
>> >> >   the real context and retain information who was trying to access
>> >> >   gfn in question.
>> >> 
>> >> (Just so I understand the idea better) does the guest context matter to
>> >> the host? Or, more specifically, are we going to do anything besides
>> >> get_user_pages() which will actually analyze who triggered the access
>> >> *in the guest*?
>> >
>> > When we exit to user space, qemu prints bunch of register state. I am
>> > wondering what does that state represent. Does some of that traces
>> > back to the process which was trying to access that hva? I don't
>> > know.
>> 
>> We can get the full CPU state when the fault happens if we need to but
>> generally we are not analyzing it. I can imagine looking at CPL, for
>> example, but trying to distinguish guest's 'process A' from 'process B'
>> may not be simple.
>> 
>> >
>> > I think keeping a cache of error gfns might not be too bad from
>> > implemetation point of view. I will give it a try and see how
>> > bad does it look.
>> 
>> Right; I'm only worried about the fact that every cache (or hash) has a
>> limited size and under certain curcumstances we may overflow it. When an
>> overflow happens, we will follow the APF path again and this can go over
>> and over.
>
> Sure. But what are the chances of that happening. Say our cache size is
> 64. That means we need atleast 128 processes to do co-ordinated faults
> (all in error zone) to skip the cache completely all the time. We
> have to hit cache only once. Chances of missing the error gnf
> cache completely for a very long time are very slim. And if we miss
> it few times, now harm done. We will just spin few times and then
> exit to qemu.
>
> IOW, chances of spinning infinitely are not zero. But they look so
> small that in practice I am not worried about it.
>
>> Maybe we can punch a hole in EPT/NPT making the PFN reserved/
>> not-present so when the guest tries to access it again we trap the
>> access in KVM and, if the error persists, don't follow the APF path?
>
> Cache solution seems simpler than this. Trying to maintain any state
> in page tables will be invariably more complex (Especially given
> many flavors of paging).
>
> I can start looking in this direction if you really think that its worth
> implementing  page table based solution for this problem. I feel that
> we implement something simpler for now and if there are easy ways
> to skip error gns, then replace it with something page table based
> solution (This will only require hypervisor change and no guest
> changes).

I think we're fine with an interim cache/hash solution as long as
chances of getting into an infinite loop accidentaially are very slim
and we don't have any specific latency requirements. Feel free to forget
about PT suggestion for now, we can certainly make it 'step 2' (IMO).
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be5363b21540..3c0677b9d3d5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -778,6 +778,7 @@  struct kvm_vcpu_arch {
 		unsigned long nested_apf_token;
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
+		gfn_t error_gfn;
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 444bb9c54548..d0a2a12c7bb6 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -60,7 +60,7 @@  void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, u32 cr0, u32 cr4, u32 efer);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 			     bool accessed_dirty, gpa_t new_eptp);
-bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
+bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn);
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 				u64 fault_address, char *insn, int insn_len);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 76817d13c86e..a882a6a9f7a7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4078,7 +4078,7 @@  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	if (!async)
 		return false; /* *pfn has correct page already */
 
-	if (!prefault && kvm_can_do_async_pf(vcpu)) {
+	if (!prefault && kvm_can_do_async_pf(vcpu, cr2_or_gpa >> PAGE_SHIFT)) {
 		trace_kvm_try_async_get_page(cr2_or_gpa, gfn);
 		if (kvm_find_async_pf_gfn(vcpu, gfn)) {
 			trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3b92db412335..a6af7e9831b9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10380,7 +10380,9 @@  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 	      work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
 		return;
 
-	kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
+	r = kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
+	if (r < 0)
+		vcpu->arch.apf.error_gfn = work->cr2_or_gpa >> PAGE_SHIFT;
 }
 
 static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
@@ -10490,7 +10492,7 @@  static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 	return true;
 }
 
-bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
+bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
 	if (unlikely(!lapic_in_kernel(vcpu) ||
 		     kvm_event_needs_reinjection(vcpu) ||
@@ -10504,7 +10506,13 @@  bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
 	 * If interrupts are off we cannot even use an artificial
 	 * halt state.
 	 */
-	return kvm_arch_interrupt_allowed(vcpu);
+	if (!kvm_arch_interrupt_allowed(vcpu))
+		return false;
+
+	if (vcpu->arch.apf.error_gfn == gfn)
+		return false;
+
+	return true;
 }
 
 bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,