diff mbox series

[v2,1/4] KVM: x86: move the event handling of KVM_REQ_GET_VMCS12_PAGES into a common function

Message ID 20220828222544.1964917-2-mizhang@google.com (mailing list archive)
State New, archived
Headers show
Series Fix a race between posted interrupt delivery and migration in a nested VM | expand

Commit Message

Mingwei Zhang Aug. 28, 2022, 10:25 p.m. UTC
Create a common function to handle kvm request in the vcpu_run loop. KVM
implicitly assumes the virtual APIC page being present + mapped into the
kernel address space when executing vmx_guest_apic_has_interrupts().
However, with demand paging KVM breaks the assumption, as the
KVM_REQ_GET_VMCS12_PAGES event isn't assessed before entering vcpu_block.

Fix this by getting vmcs12 pages before inspecting the guest's APIC page.
Because of this fix, the event handling code of
KVM_REQ_GET_NESTED_STATE_PAGES becomes a common code path for both
vcpu_enter_guest() and vcpu_block(). Thus, put this code snippet into a
common helper function to avoid code duplication.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Originally-by: Oliver Upton <oupton@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

Comments

Sean Christopherson Aug. 29, 2022, 4:09 p.m. UTC | #1
On Sun, Aug 28, 2022, Mingwei Zhang wrote:
> Create a common function to handle kvm request in the vcpu_run loop. KVM
> implicitly assumes the virtual APIC page being present + mapped into the
> kernel address space when executing vmx_guest_apic_has_interrupts().
> However, with demand paging KVM breaks the assumption, as the
> KVM_REQ_GET_VMCS12_PAGES event isn't assessed before entering vcpu_block.

KVM_REQ_GET_VMCS12_PAGES doesn't exist upstream.

> Fix this by getting vmcs12 pages before inspecting the guest's APIC page.
> Because of this fix, the event handling code of
> KVM_REQ_GET_NESTED_STATE_PAGES becomes a common code path for both
> vcpu_enter_guest() and vcpu_block(). Thus, put this code snippet into a
> common helper function to avoid code duplication.
> 
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> Originally-by: Oliver Upton <oupton@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>

If you drop someone as author, then their SOB also needs to be jettisoned.

> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/x86.c | 29 +++++++++++++++++++++++------
>  1 file changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d7374d768296..3dcaac8f0584 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10261,12 +10261,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			r = -EIO;
>  			goto out;
>  		}
> -		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> -			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> -				r = 0;
> -				goto out;
> -			}
> -		}
>  		if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
>  			kvm_mmu_free_obsolete_roots(vcpu);
>  		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
> @@ -10666,6 +10660,23 @@ static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
>  		!vcpu->arch.apf.halted);
>  }
>  
> +static int kvm_vcpu_handle_common_requests(struct kvm_vcpu *vcpu)
> +{
> +	if (kvm_request_pending(vcpu)) {

Probably going to be a moot point, but write this as

	if (!kvm_request_pending(vcpu))
		return 1;

to reduce indentation.

> +		/*
> +		 * Get the vmcs12 pages before checking for interrupts that
> +		 * might unblock the guest if L1 is using virtual-interrupt
> +		 * delivery.
> +		 */
> +		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> +			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu)))

Similarly

	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu) &&
	    unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu)))
		return 0;

though I can see the argument for fully isolating each request.  But again, likely
a moot point.

> +				return 0;
> +		}
> +	}
> +
> +	return 1;
> +}
> +
>  /* Called within kvm->srcu read side.  */
>  static int vcpu_run(struct kvm_vcpu *vcpu)
>  {
> @@ -10681,6 +10692,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  		 * this point can start executing an instruction.
>  		 */
>  		vcpu->arch.at_instruction_boundary = false;
> +
> +		/* Process common request regardless of vcpu state. */
> +		r = kvm_vcpu_handle_common_requests(vcpu);

IMO this is subtly a dangerous hook.  It implies that both vcpu_enter_guest()
and vcpu_block() correctly handle requests becoming pending after the "common"
check, but that's not actually the case.  If a request _needs_ to be handled
before vcpu_block(), then ideally it should be explicitly queried in
kvm_vcpu_check_block().  KVM_REQ_GET_NESTED_STATE_PAGES doesn't have issues because
it's only ever set from the vCPU itself.

Following that train of thought, KVM_REQ_GET_NESTED_STATE_PAGES really shouldn't
even be a request.  Aha!  And we can do that in a way that would magically fix this
bug, and would ensure we don't leave a trap for future us.

KVM already provides KVM_REQ_UNBLOCK to prevent blocking the vCPU without actaully
waking the vCPU, i.e. to kick the vCPU back into the vcpu_run() loop.  The request
is provided specifically for scenarios like this where KVM needs to do work before
blocking.

Normally I'd say we should do this over multiple patches so that the "blocking"
bug is fixed before doing the rework/cleanup, but I'm ok if we want to skip straight
to the rework since we're obviously carrying an internal patch and no one else is
likely to need the fix.  But I also wouldn't object to including an intermediate
patch to fix the bug so that there's a better paper trail.

E.g. as a very partial conversion:

---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/vmx/nested.c       |  2 +-
 arch/x86/kvm/x86.c              | 12 ++++++++++++
 arch/x86/kvm/x86.h              | 10 ++++++++++
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9345303c8c6d..bfca37419783 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -939,6 +939,8 @@ struct kvm_vcpu_arch {
 	 */
 	bool pdptrs_from_userspace;

+	bool nested_get_pages_pending;
+
 #if IS_ENABLED(CONFIG_HYPERV)
 	hpa_t hv_root_tdp;
 #endif
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ddd4367d4826..e83b145c3a35 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3446,7 +3446,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 		 * to nested_get_vmcs12_pages before the next VM-entry.  The MSRs
 		 * have already been set at vmentry time and should not be reset.
 		 */
-		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+		kvm_nested_get_pages_set_pending(vcpu);
 	}

 	/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c0e3e7915a3a..0a7601ebffc6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9650,6 +9650,12 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
 	return kvm_x86_ops.nested_ops->check_events(vcpu);
 }

+static int kvm_get_nested_state_pages(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.nested_get_pages_pending = false;
+	return kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu);
+}
+
 static void kvm_inject_exception(struct kvm_vcpu *vcpu)
 {
 	trace_kvm_inj_exception(vcpu->arch.exception.nr,
@@ -10700,6 +10706,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 		if (kvm_cpu_has_pending_timer(vcpu))
 			kvm_inject_pending_timer_irqs(vcpu);

+		if (vcpu->arch.nested_get_pages_pending) {
+			r = kvm_get_nested_state_pages(vcpu);
+			if (r <= 0)
+				break;
+		}
+
 		if (dm_request_for_irq_injection(vcpu) &&
 			kvm_vcpu_ready_for_interrupt_injection(vcpu)) {
 			r = 0;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 1926d2cb8e79..e35aac39dc73 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -481,4 +481,14 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
 			 unsigned int port, void *data,  unsigned int count,
 			 int in);

+static inline void kvm_nested_get_pages_set_pending(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Here is a comment explaining why KVM needs to prevent the vCPU from
+	 * blocking until the vCPU's nested pages have been loaded.
+	 */
+	vcpu->arch.nested_get_pages_pending = true;
+	kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
+}
+
 #endif

base-commit: 14a47a98151834c5bd2f6d8d592b01108a3f882a
--
Mingwei Zhang Sept. 7, 2022, 12:01 a.m. UTC | #2
On Mon, Aug 29, 2022, Sean Christopherson wrote:
> On Sun, Aug 28, 2022, Mingwei Zhang wrote:
> > Create a common function to handle kvm request in the vcpu_run loop. KVM
> > implicitly assumes the virtual APIC page being present + mapped into the
> > kernel address space when executing vmx_guest_apic_has_interrupts().
> > However, with demand paging KVM breaks the assumption, as the
> > KVM_REQ_GET_VMCS12_PAGES event isn't assessed before entering vcpu_block.
> 
> KVM_REQ_GET_VMCS12_PAGES doesn't exist upstream.

ack.
> 
> > Fix this by getting vmcs12 pages before inspecting the guest's APIC page.
> > Because of this fix, the event handling code of
> > KVM_REQ_GET_NESTED_STATE_PAGES becomes a common code path for both
> > vcpu_enter_guest() and vcpu_block(). Thus, put this code snippet into a
> > common helper function to avoid code duplication.
> > 
> > Cc: Maxim Levitsky <mlevitsk@redhat.com>
> > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> > Originally-by: Oliver Upton <oupton@google.com>
> > Signed-off-by: Oliver Upton <oupton@google.com>
> 
> If you drop someone as author, then their SOB also needs to be jettisoned.
> 

ack.

> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/x86.c | 29 +++++++++++++++++++++++------
> >  1 file changed, 23 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d7374d768296..3dcaac8f0584 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10261,12 +10261,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  			r = -EIO;
> >  			goto out;
> >  		}
> > -		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > -			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > -				r = 0;
> > -				goto out;
> > -			}
> > -		}
> >  		if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> >  			kvm_mmu_free_obsolete_roots(vcpu);
> >  		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
> > @@ -10666,6 +10660,23 @@ static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
> >  		!vcpu->arch.apf.halted);
> >  }
> >  
> > +static int kvm_vcpu_handle_common_requests(struct kvm_vcpu *vcpu)
> > +{
> > +	if (kvm_request_pending(vcpu)) {
> 
> Probably going to be a moot point, but write this as
> 
> 	if (!kvm_request_pending(vcpu))
> 		return 1;
> 
> to reduce indentation.
> 
> > +		/*
> > +		 * Get the vmcs12 pages before checking for interrupts that
> > +		 * might unblock the guest if L1 is using virtual-interrupt
> > +		 * delivery.
> > +		 */
> > +		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > +			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu)))
> 
> Similarly
> 
> 	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu) &&
> 	    unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu)))
> 		return 0;
> 
> though I can see the argument for fully isolating each request.  But again, likely
> a moot point.
> 
> > +				return 0;
> > +		}
> > +	}
> > +
> > +	return 1;
> > +}
> > +
> >  /* Called within kvm->srcu read side.  */
> >  static int vcpu_run(struct kvm_vcpu *vcpu)
> >  {
> > @@ -10681,6 +10692,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> >  		 * this point can start executing an instruction.
> >  		 */
> >  		vcpu->arch.at_instruction_boundary = false;
> > +
> > +		/* Process common request regardless of vcpu state. */
> > +		r = kvm_vcpu_handle_common_requests(vcpu);
> 
> IMO this is subtly a dangerous hook.  It implies that both vcpu_enter_guest()
> and vcpu_block() correctly handle requests becoming pending after the "common"
> check, but that's not actually the case.  If a request _needs_ to be handled
> before vcpu_block(), then ideally it should be explicitly queried in
> kvm_vcpu_check_block().  KVM_REQ_GET_NESTED_STATE_PAGES doesn't have issues because
> it's only ever set from the vCPU itself.
> 
> Following that train of thought, KVM_REQ_GET_NESTED_STATE_PAGES really shouldn't
> even be a request.  Aha!  And we can do that in a way that would magically fix this
> bug, and would ensure we don't leave a trap for future us.
> 
> KVM already provides KVM_REQ_UNBLOCK to prevent blocking the vCPU without actaully
> waking the vCPU, i.e. to kick the vCPU back into the vcpu_run() loop.  The request
> is provided specifically for scenarios like this where KVM needs to do work before
> blocking.
> 

hmm. I think this won't work. The warning is happening at this trace
(although the dynamic trace does not show the full stack trace in source
code):

WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn))
vmx_guest_apic_has_interrupt()
kvm_guest_apic_has_interrupt()
kvm_vcpu_has_events()
kvm_arch_vcpu_runnablea()
kvm_vcpu_check_block()

If you go to kvm_vcpu_check_block(), the check of KVM_REQ_UNBLOCK is
behind check of kvm_arch_vcpu_runnable(). So, with the diff you pointed
out, we will still see the warning.

Maybe what we can do is to re-order the
kvm_check_request(KVM_REQ_UNBLOCK, vcpu) to the beginning of the
kvm_vcpu_check_block()? But I am not sure.

Thanks.
-Mingwei
> Normally I'd say we should do this over multiple patches so that the "blocking"
> bug is fixed before doing the rework/cleanup, but I'm ok if we want to skip straight
> to the rework since we're obviously carrying an internal patch and no one else is
> likely to need the fix.  But I also wouldn't object to including an intermediate
> patch to fix the bug so that there's a better paper trail.
> 
> E.g. as a very partial conversion:
> 
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/vmx/nested.c       |  2 +-
>  arch/x86/kvm/x86.c              | 12 ++++++++++++
>  arch/x86/kvm/x86.h              | 10 ++++++++++
>  4 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9345303c8c6d..bfca37419783 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -939,6 +939,8 @@ struct kvm_vcpu_arch {
>  	 */
>  	bool pdptrs_from_userspace;
> 
> +	bool nested_get_pages_pending;
> +
>  #if IS_ENABLED(CONFIG_HYPERV)
>  	hpa_t hv_root_tdp;
>  #endif
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ddd4367d4826..e83b145c3a35 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3446,7 +3446,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>  		 * to nested_get_vmcs12_pages before the next VM-entry.  The MSRs
>  		 * have already been set at vmentry time and should not be reset.
>  		 */
> -		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +		kvm_nested_get_pages_set_pending(vcpu);
>  	}
> 
>  	/*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c0e3e7915a3a..0a7601ebffc6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9650,6 +9650,12 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
>  	return kvm_x86_ops.nested_ops->check_events(vcpu);
>  }
> 
> +static int kvm_get_nested_state_pages(struct kvm_vcpu *vcpu)
> +{
> +	vcpu->arch.nested_get_pages_pending = false;
> +	return kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu);
> +}
> +
>  static void kvm_inject_exception(struct kvm_vcpu *vcpu)
>  {
>  	trace_kvm_inj_exception(vcpu->arch.exception.nr,
> @@ -10700,6 +10706,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  		if (kvm_cpu_has_pending_timer(vcpu))
>  			kvm_inject_pending_timer_irqs(vcpu);
> 
> +		if (vcpu->arch.nested_get_pages_pending) {
> +			r = kvm_get_nested_state_pages(vcpu);
> +			if (r <= 0)
> +				break;
> +		}
> +
>  		if (dm_request_for_irq_injection(vcpu) &&
>  			kvm_vcpu_ready_for_interrupt_injection(vcpu)) {
>  			r = 0;
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1926d2cb8e79..e35aac39dc73 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -481,4 +481,14 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
>  			 unsigned int port, void *data,  unsigned int count,
>  			 int in);
> 
> +static inline void kvm_nested_get_pages_set_pending(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * Here is a comment explaining why KVM needs to prevent the vCPU from
> +	 * blocking until the vCPU's nested pages have been loaded.
> +	 */
> +	vcpu->arch.nested_get_pages_pending = true;
> +	kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
> +}
> +
>  #endif
> 
> base-commit: 14a47a98151834c5bd2f6d8d592b01108a3f882a
> --
Yuan Yao Sept. 7, 2022, 2:50 a.m. UTC | #3
On Mon, Aug 29, 2022 at 04:09:33PM +0000, Sean Christopherson wrote:
> On Sun, Aug 28, 2022, Mingwei Zhang wrote:
> > Create a common function to handle kvm request in the vcpu_run loop. KVM
> > implicitly assumes the virtual APIC page being present + mapped into the
> > kernel address space when executing vmx_guest_apic_has_interrupts().
> > However, with demand paging KVM breaks the assumption, as the
> > KVM_REQ_GET_VMCS12_PAGES event isn't assessed before entering vcpu_block.
>
> KVM_REQ_GET_VMCS12_PAGES doesn't exist upstream.
>
> > Fix this by getting vmcs12 pages before inspecting the guest's APIC page.
> > Because of this fix, the event handling code of
> > KVM_REQ_GET_NESTED_STATE_PAGES becomes a common code path for both
> > vcpu_enter_guest() and vcpu_block(). Thus, put this code snippet into a
> > common helper function to avoid code duplication.
> >
> > Cc: Maxim Levitsky <mlevitsk@redhat.com>
> > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> > Originally-by: Oliver Upton <oupton@google.com>
> > Signed-off-by: Oliver Upton <oupton@google.com>
>
> If you drop someone as author, then their SOB also needs to be jettisoned.
>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/x86.c | 29 +++++++++++++++++++++++------
> >  1 file changed, 23 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d7374d768296..3dcaac8f0584 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10261,12 +10261,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  			r = -EIO;
> >  			goto out;
> >  		}
> > -		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > -			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > -				r = 0;
> > -				goto out;
> > -			}
> > -		}
> >  		if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
> >  			kvm_mmu_free_obsolete_roots(vcpu);
> >  		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
> > @@ -10666,6 +10660,23 @@ static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
> >  		!vcpu->arch.apf.halted);
> >  }
> >
> > +static int kvm_vcpu_handle_common_requests(struct kvm_vcpu *vcpu)
> > +{
> > +	if (kvm_request_pending(vcpu)) {
>
> Probably going to be a moot point, but write this as
>
> 	if (!kvm_request_pending(vcpu))
> 		return 1;
>
> to reduce indentation.
>
> > +		/*
> > +		 * Get the vmcs12 pages before checking for interrupts that
> > +		 * might unblock the guest if L1 is using virtual-interrupt
> > +		 * delivery.
> > +		 */
> > +		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > +			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu)))
>
> Similarly
>
> 	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu) &&
> 	    unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu)))
> 		return 0;
>
> though I can see the argument for fully isolating each request.  But again, likely
> a moot point.
>
> > +				return 0;
> > +		}
> > +	}
> > +
> > +	return 1;
> > +}
> > +
> >  /* Called within kvm->srcu read side.  */
> >  static int vcpu_run(struct kvm_vcpu *vcpu)
> >  {
> > @@ -10681,6 +10692,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> >  		 * this point can start executing an instruction.
> >  		 */
> >  		vcpu->arch.at_instruction_boundary = false;
> > +
> > +		/* Process common request regardless of vcpu state. */
> > +		r = kvm_vcpu_handle_common_requests(vcpu);
>
> IMO this is subtly a dangerous hook.  It implies that both vcpu_enter_guest()
> and vcpu_block() correctly handle requests becoming pending after the "common"
> check, but that's not actually the case.  If a request _needs_ to be handled
> before vcpu_block(), then ideally it should be explicitly queried in
> kvm_vcpu_check_block().  KVM_REQ_GET_NESTED_STATE_PAGES doesn't have issues because
> it's only ever set from the vCPU itself.
>
> Following that train of thought, KVM_REQ_GET_NESTED_STATE_PAGES really shouldn't
> even be a request.  Aha!  And we can do that in a way that would magically fix this
> bug, and would ensure we don't leave a trap for future us.
>
> KVM already provides KVM_REQ_UNBLOCK to prevent blocking the vCPU without actaully
> waking the vCPU, i.e. to kick the vCPU back into the vcpu_run() loop.  The request
> is provided specifically for scenarios like this where KVM needs to do work before
> blocking.
>
> Normally I'd say we should do this over multiple patches so that the "blocking"
> bug is fixed before doing the rework/cleanup, but I'm ok if we want to skip straight
> to the rework since we're obviously carrying an internal patch and no one else is
> likely to need the fix.  But I also wouldn't object to including an intermediate
> patch to fix the bug so that there's a better paper trail.
>
> E.g. as a very partial conversion:
>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/vmx/nested.c       |  2 +-
>  arch/x86/kvm/x86.c              | 12 ++++++++++++
>  arch/x86/kvm/x86.h              | 10 ++++++++++
>  4 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9345303c8c6d..bfca37419783 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -939,6 +939,8 @@ struct kvm_vcpu_arch {
>  	 */
>  	bool pdptrs_from_userspace;
>
> +	bool nested_get_pages_pending;
> +
>  #if IS_ENABLED(CONFIG_HYPERV)
>  	hpa_t hv_root_tdp;
>  #endif
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ddd4367d4826..e83b145c3a35 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3446,7 +3446,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>  		 * to nested_get_vmcs12_pages before the next VM-entry.  The MSRs
>  		 * have already been set at vmentry time and should not be reset.
>  		 */
> -		kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +		kvm_nested_get_pages_set_pending(vcpu);
>  	}
>
>  	/*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c0e3e7915a3a..0a7601ebffc6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9650,6 +9650,12 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
>  	return kvm_x86_ops.nested_ops->check_events(vcpu);
>  }
>
> +static int kvm_get_nested_state_pages(struct kvm_vcpu *vcpu)
> +{
> +	vcpu->arch.nested_get_pages_pending = false;
> +	return kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu);
> +}
> +
>  static void kvm_inject_exception(struct kvm_vcpu *vcpu)
>  {
>  	trace_kvm_inj_exception(vcpu->arch.exception.nr,
> @@ -10700,6 +10706,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  		if (kvm_cpu_has_pending_timer(vcpu))
>  			kvm_inject_pending_timer_irqs(vcpu);
>
> +		if (vcpu->arch.nested_get_pages_pending) {
> +			r = kvm_get_nested_state_pages(vcpu);
> +			if (r <= 0)
> +				break;
> +		}
> +

Will this leads to skip the get_nested_state_pages for L2 first time
vmentry in every L2 running iteration ? Because with above changes
KVM_REQ_GET_NESTED_STATE_PAGES is not set in
nested_vmx_enter_non_root_mode() and
vcpu->arch.nested_get_pages_pending is not checked in
vcpu_enter_guest().

>  		if (dm_request_for_irq_injection(vcpu) &&
>  			kvm_vcpu_ready_for_interrupt_injection(vcpu)) {
>  			r = 0;
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1926d2cb8e79..e35aac39dc73 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -481,4 +481,14 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
>  			 unsigned int port, void *data,  unsigned int count,
>  			 int in);
>
> +static inline void kvm_nested_get_pages_set_pending(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * Here is a comment explaining why KVM needs to prevent the vCPU from
> +	 * blocking until the vCPU's nested pages have been loaded.
> +	 */
> +	vcpu->arch.nested_get_pages_pending = true;
> +	kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
> +}
> +
>  #endif
>
> base-commit: 14a47a98151834c5bd2f6d8d592b01108a3f882a
> --
Mingwei Zhang Sept. 7, 2022, 4:26 a.m. UTC | #4
> > @@ -10700,6 +10706,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> >               if (kvm_cpu_has_pending_timer(vcpu))
> >                       kvm_inject_pending_timer_irqs(vcpu);
> >
> > +             if (vcpu->arch.nested_get_pages_pending) {
> > +                     r = kvm_get_nested_state_pages(vcpu);
> > +                     if (r <= 0)
> > +                             break;
> > +             }
> > +
>
> Will this leads to skip the get_nested_state_pages for L2 first time
> vmentry in every L2 running iteration ? Because with above changes
> KVM_REQ_GET_NESTED_STATE_PAGES is not set in
> nested_vmx_enter_non_root_mode() and
> vcpu->arch.nested_get_pages_pending is not checked in
> vcpu_enter_guest().
>
Good catch. I think the diff won't work when vcpu is runnable. It only
tries to catch the vcpu block case. Even for the vcpu block case,  the
check of KVM_REQ_UNBLOCK is way too late. Ah, kvm_vcpu_check_block()
is called by kvm_vcpu_block() which is called by vcpu_block(). The
warning is triggered at the very beginning of vcpu_block(), i.e.,
within kvm_arch_vcpu_runnable(). So, please ignore the trace in my
previous email.

In addition, my minor push back for that is
vcpu->arch.nested_get_pages_pending seems to be another
KVM_REQ_GET_NESTED_STATE_PAGES.

Thanks.
-Mingwei


-Mingwei
Yuan Yao Sept. 7, 2022, 5:35 a.m. UTC | #5
On Tue, Sep 06, 2022 at 09:26:33PM -0700, Mingwei Zhang wrote:
> > > @@ -10700,6 +10706,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> > >               if (kvm_cpu_has_pending_timer(vcpu))
> > >                       kvm_inject_pending_timer_irqs(vcpu);
> > >
> > > +             if (vcpu->arch.nested_get_pages_pending) {
> > > +                     r = kvm_get_nested_state_pages(vcpu);
> > > +                     if (r <= 0)
> > > +                             break;
> > > +             }
> > > +
> >
> > Will this leads to skip the get_nested_state_pages for L2 first time
> > vmentry in every L2 running iteration ? Because with above changes
> > KVM_REQ_GET_NESTED_STATE_PAGES is not set in
> > nested_vmx_enter_non_root_mode() and
> > vcpu->arch.nested_get_pages_pending is not checked in
> > vcpu_enter_guest().
> >
> Good catch. I think the diff won't work when vcpu is runnable. It only
> tries to catch the vcpu block case. Even for the vcpu block case,  the
> check of KVM_REQ_UNBLOCK is way too late. Ah, kvm_vcpu_check_block()
> is called by kvm_vcpu_block() which is called by vcpu_block(). The
> warning is triggered at the very beginning of vcpu_block(), i.e.,
> within kvm_arch_vcpu_runnable(). So, please ignore the trace in my
> previous email.
>
> In addition, my minor push back for that is
> vcpu->arch.nested_get_pages_pending seems to be another
> KVM_REQ_GET_NESTED_STATE_PAGES.

Yeah, but in concept level it's not a REQ mask lives in the
vcpu->requests which can be cached by e.g. kvm_request_pending().
It's necessary to check vcpu->arch.nested_get_pages_pending in
vcpu_enter_guest() if Sean's idea is to replace
KVM_REQ_GET_NESTED_STATE_PAGES with nested_get_pages_pending.

>
> Thanks.
> -Mingwei
>
>
> -Mingwei
Sean Christopherson Sept. 7, 2022, 3:48 p.m. UTC | #6
On Wed, Sep 07, 2022, Yuan Yao wrote:
> On Tue, Sep 06, 2022 at 09:26:33PM -0700, Mingwei Zhang wrote:
> > > > @@ -10700,6 +10706,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> > > >               if (kvm_cpu_has_pending_timer(vcpu))
> > > >                       kvm_inject_pending_timer_irqs(vcpu);
> > > >
> > > > +             if (vcpu->arch.nested_get_pages_pending) {
> > > > +                     r = kvm_get_nested_state_pages(vcpu);
> > > > +                     if (r <= 0)
> > > > +                             break;
> > > > +             }
> > > > +
> > >
> > > Will this leads to skip the get_nested_state_pages for L2 first time
> > > vmentry in every L2 running iteration ? Because with above changes
> > > KVM_REQ_GET_NESTED_STATE_PAGES is not set in
> > > nested_vmx_enter_non_root_mode() and
> > > vcpu->arch.nested_get_pages_pending is not checked in
> > > vcpu_enter_guest().
> > >
> > Good catch. I think the diff won't work when vcpu is runnable.

It works, but it's inefficient if the request comes from KVM_SET_NESTED_STATE.
The pending KVM_REQ_UNBLOCK that comes with the flag will prevent actually running
the guest.  Specifically, this chunk of code will detect the pending request and
bail out of vcpu_enter_guest().

	if (kvm_vcpu_exit_request(vcpu)) {
		vcpu->mode = OUTSIDE_GUEST_MODE;
		smp_wmb();
		local_irq_enable();
		preempt_enable();
		kvm_vcpu_srcu_read_lock(vcpu);
		r = 1;
		goto cancel_injection;
	}

But the inefficiency is a non-issue since "true" emulation of VM-Enter will flow
through this path (the VMRESUME/VMLAUNCH/VMRUN exit handler runs at the end of
vcpu_enter_guest().

> > It only tries to catch the vcpu block case. Even for the vcpu block case,
> > the check of KVM_REQ_UNBLOCK is way too late. Ah, kvm_vcpu_check_block() is
> > called by kvm_vcpu_block() which is called by vcpu_block(). The warning is
> > triggered at the very beginning of vcpu_block(), i.e., within
> > kvm_arch_vcpu_runnable(). So, please ignore the trace in my previous email.
> >
> > In addition, my minor push back for that is
> > vcpu->arch.nested_get_pages_pending seems to be another
> > KVM_REQ_GET_NESTED_STATE_PAGES.
> 
> Yeah, but in concept level it's not a REQ mask lives in the
> vcpu->requests which can be cached by e.g. kvm_request_pending().
> It's necessary to check vcpu->arch.nested_get_pages_pending in
> vcpu_enter_guest() if Sean's idea is to replace
> KVM_REQ_GET_NESTED_STATE_PAGES with nested_get_pages_pending.

Yes, they key is that it's not a request.  Requests have implicit properties:
e.g. as above, effectively prevent running the vCPU until the request goes away,
they can be pended from other vCPUs, etc...  And the property that is most relevant
to this bug: except for special cases, requests only need to be serviced before
running vCPU.

And the number of requests is limited due to them being stored in a bitmap.  x86
still has plenty of room due to kvm_vcpu.requests being a u64, but it's still
preferable to avoid using a request unless absolutely necessary.

For this case, since using a request isn't strictly needed and using a request
would require special casing that request, my strong preference is to not use a
request.

So yes, my idea is to "just" replace the request with a flag, but there are subtly
quite a few impliciations in not using a request.
Sean Christopherson Sept. 7, 2022, 3:56 p.m. UTC | #7
On Wed, Sep 07, 2022, Sean Christopherson wrote:
> On Wed, Sep 07, 2022, Yuan Yao wrote:
> > On Tue, Sep 06, 2022 at 09:26:33PM -0700, Mingwei Zhang wrote:
> > > > > @@ -10700,6 +10706,12 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
> > > > >               if (kvm_cpu_has_pending_timer(vcpu))
> > > > >                       kvm_inject_pending_timer_irqs(vcpu);
> > > > >
> > > > > +             if (vcpu->arch.nested_get_pages_pending) {
> > > > > +                     r = kvm_get_nested_state_pages(vcpu);
> > > > > +                     if (r <= 0)
> > > > > +                             break;
> > > > > +             }
> > > > > +
> > > >
> > > > Will this leads to skip the get_nested_state_pages for L2 first time
> > > > vmentry in every L2 running iteration ? Because with above changes
> > > > KVM_REQ_GET_NESTED_STATE_PAGES is not set in
> > > > nested_vmx_enter_non_root_mode() and
> > > > vcpu->arch.nested_get_pages_pending is not checked in
> > > > vcpu_enter_guest().
> > > >
> > > Good catch. I think the diff won't work when vcpu is runnable.
> 
> It works, but it's inefficient if the request comes from KVM_SET_NESTED_STATE.
> The pending KVM_REQ_UNBLOCK that comes with the flag will prevent actually running
> the guest.  Specifically, this chunk of code will detect the pending request and
> bail out of vcpu_enter_guest().
> 
> 	if (kvm_vcpu_exit_request(vcpu)) {
> 		vcpu->mode = OUTSIDE_GUEST_MODE;
> 		smp_wmb();
> 		local_irq_enable();
> 		preempt_enable();
> 		kvm_vcpu_srcu_read_lock(vcpu);
> 		r = 1;
> 		goto cancel_injection;
> 	}
> 
> But the inefficiency is a non-issue since "true" emulation of VM-Enter will flow
> through this path (the VMRESUME/VMLAUNCH/VMRUN exit handler runs at the end of
> vcpu_enter_guest().

Actually, nested VM-Enter doesn't use this path at all.  The above holds true for
emulated RSM, but that's largely a moot point since RSM isn't exactly a hot path.
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7374d768296..3dcaac8f0584 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10261,12 +10261,6 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			r = -EIO;
 			goto out;
 		}
-		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
-			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
-				r = 0;
-				goto out;
-			}
-		}
 		if (kvm_check_request(KVM_REQ_MMU_FREE_OBSOLETE_ROOTS, vcpu))
 			kvm_mmu_free_obsolete_roots(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
@@ -10666,6 +10660,23 @@  static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
 		!vcpu->arch.apf.halted);
 }
 
+static int kvm_vcpu_handle_common_requests(struct kvm_vcpu *vcpu)
+{
+	if (kvm_request_pending(vcpu)) {
+		/*
+		 * Get the vmcs12 pages before checking for interrupts that
+		 * might unblock the guest if L1 is using virtual-interrupt
+		 * delivery.
+		 */
+		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
+			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu)))
+				return 0;
+		}
+	}
+
+	return 1;
+}
+
 /* Called within kvm->srcu read side.  */
 static int vcpu_run(struct kvm_vcpu *vcpu)
 {
@@ -10681,6 +10692,12 @@  static int vcpu_run(struct kvm_vcpu *vcpu)
 		 * this point can start executing an instruction.
 		 */
 		vcpu->arch.at_instruction_boundary = false;
+
+		/* Process common request regardless of vcpu state. */
+		r = kvm_vcpu_handle_common_requests(vcpu);
+		if (r <= 0)
+			break;
+
 		if (kvm_vcpu_running(vcpu)) {
 			r = vcpu_enter_guest(vcpu);
 		} else {