diff mbox series

[1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts

Message ID 20220802230718.1891356-2-mizhang@google.com (mailing list archive)
State New, archived
Headers show
Series Fix a race between posted interrupt delivery and migration in a nested VM | expand

Commit Message

Mingwei Zhang Aug. 2, 2022, 11:07 p.m. UTC
From: Oliver Upton <oupton@google.com>

vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
page being present + mapped into the kernel address space. However, with
demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
event isn't assessed before entering vcpu_block.

Fix this by getting vmcs12 pages before inspecting the guest's APIC
page. Note that upstream does not have this issue, as they will directly
get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
event request mechanism. However, the upstream approach is problematic,
as the vmcs12 pages will not be present if a live migration occurred
before checking the virtual APIC page.

Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Maxim Levitsky Aug. 3, 2022, 10:08 a.m. UTC | #1
On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote:
> From: Oliver Upton <oupton@google.com>
> 
> vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
> page being present + mapped into the kernel address space. However, with
> demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
> event isn't assessed before entering vcpu_block.
> 
> Fix this by getting vmcs12 pages before inspecting the guest's APIC
> page. Note that upstream does not have this issue, as they will directly
> get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
> event request mechanism. However, the upstream approach is problematic,
> as the vmcs12 pages will not be present if a live migration occurred
> before checking the virtual APIC page.

Since this patch is intended for upstream, I don't fully understand
the meaning of the above paragraph.


> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/x86.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5366f884e9a7..1d3d8127aaea 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>  {
>  	bool hv_timer;
>  
> +	/*
> +	 * We must first get the vmcs12 pages before checking for interrupts
> +	 * that might unblock the guest if L1 is using virtual-interrupt
> +	 * delivery.
> +	 */
> +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> +		/*
> +		 * If we have to ask user-space to post-copy a page,
> +		 * then we have to keep trying to get all of the
> +		 * VMCS12 pages until we succeed.
> +		 */
> +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +			return 0;
> +		}
> +	}
> +
>  	if (!kvm_arch_vcpu_runnable(vcpu)) {
>  		/*
>  		 * Switch to the software timer before halt-polling/blocking as


If I understand correctly, you are saying that if apic backing page is migrated in post copy
then 'get_nested_state_pages' will return false and thus fail?

AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case
for many things like MSR bitmaps and such - they always uses non atomic versions
of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which
supposed to block if they attempt to access HVA which is not present, and then
userfaultd should take over and wake them up.

If that still fails, nested VM entry is usually failed, and/or the whole VM
is crashed with 'KVM_EXIT_INTERNAL_ERROR'.

Anything I missed? 

Best regards,
	Maxim Levitsky
Mingwei Zhang Aug. 3, 2022, 4:45 p.m. UTC | #2
On Wed, Aug 03, 2022, Maxim Levitsky wrote:
> On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote:
> > From: Oliver Upton <oupton@google.com>
> > 
> > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
> > page being present + mapped into the kernel address space. However, with
> > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
> > event isn't assessed before entering vcpu_block.
> > 
> > Fix this by getting vmcs12 pages before inspecting the guest's APIC
> > page. Note that upstream does not have this issue, as they will directly
> > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
> > event request mechanism. However, the upstream approach is problematic,
> > as the vmcs12 pages will not be present if a live migration occurred
> > before checking the virtual APIC page.
> 
> Since this patch is intended for upstream, I don't fully understand
> the meaning of the above paragraph.

My apology. Some of the statement needs to be updated, which I should do
before sending. But I think the point here is that there is a missing
get_nested_state_pages() call here within vcpu_block() when there is the
request of KVM_REQ_GET_NESTED_STATE_PAGES.

> 
> 
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/x86.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 5366f884e9a7..1d3d8127aaea 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
> >  {
> >  	bool hv_timer;
> >  
> > +	/*
> > +	 * We must first get the vmcs12 pages before checking for interrupts
> > +	 * that might unblock the guest if L1 is using virtual-interrupt
> > +	 * delivery.
> > +	 */
> > +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > +		/*
> > +		 * If we have to ask user-space to post-copy a page,
> > +		 * then we have to keep trying to get all of the
> > +		 * VMCS12 pages until we succeed.
> > +		 */
> > +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > +			return 0;
> > +		}
> > +	}
> > +
> >  	if (!kvm_arch_vcpu_runnable(vcpu)) {
> >  		/*
> >  		 * Switch to the software timer before halt-polling/blocking as
> 
> 
> If I understand correctly, you are saying that if apic backing page is migrated in post copy
> then 'get_nested_state_pages' will return false and thus fail?

What I mean is that when the vCPU was halted and then migrated in this
case, KVM did not call get_nested_state_pages() before getting into
kvm_arch_vcpu_runnable(). This function checks the apic backing page and
fails on that check and triggered the warning.
> 
> AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case
> for many things like MSR bitmaps and such - they always uses non atomic versions
> of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which
> supposed to block if they attempt to access HVA which is not present, and then
> userfaultd should take over and wake them up.

You are right here.
> 
> If that still fails, nested VM entry is usually failed, and/or the whole VM
> is crashed with 'KVM_EXIT_INTERNAL_ERROR'.
> 
> Anything I missed? 
> 
> Best regards,
> 	Maxim Levitsky
>
Mingwei Zhang Aug. 3, 2022, 5 p.m. UTC | #3
On Wed, Aug 03, 2022, Mingwei Zhang wrote:
> On Wed, Aug 03, 2022, Maxim Levitsky wrote:
> > On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote:
> > > From: Oliver Upton <oupton@google.com>
> > > 
> > > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
> > > page being present + mapped into the kernel address space. However, with
> > > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
> > > event isn't assessed before entering vcpu_block.
> > > 
> > > Fix this by getting vmcs12 pages before inspecting the guest's APIC
> > > page. Note that upstream does not have this issue, as they will directly
> > > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
> > > event request mechanism. However, the upstream approach is problematic,
> > > as the vmcs12 pages will not be present if a live migration occurred
> > > before checking the virtual APIC page.
> > 
> > Since this patch is intended for upstream, I don't fully understand
> > the meaning of the above paragraph.
> 
> My apology. Some of the statement needs to be updated, which I should do
> before sending. But I think the point here is that there is a missing
> get_nested_state_pages() call here within vcpu_block() when there is the
> request of KVM_REQ_GET_NESTED_STATE_PAGES.
> 
> > 
> > 
> > > 
> > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > ---
> > >  arch/x86/kvm/x86.c | 17 +++++++++++++++++
> > >  1 file changed, 17 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 5366f884e9a7..1d3d8127aaea 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
> > >  {
> > >  	bool hv_timer;
> > >  
> > > +	/*
> > > +	 * We must first get the vmcs12 pages before checking for interrupts
> > > +	 * that might unblock the guest if L1 is using virtual-interrupt
> > > +	 * delivery.
> > > +	 */
> > > +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > +		/*
> > > +		 * If we have to ask user-space to post-copy a page,
> > > +		 * then we have to keep trying to get all of the
> > > +		 * VMCS12 pages until we succeed.
> > > +		 */
> > > +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > > +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > +			return 0;
> > > +		}
> > > +	}
> > > +
> > >  	if (!kvm_arch_vcpu_runnable(vcpu)) {
> > >  		/*
> > >  		 * Switch to the software timer before halt-polling/blocking as
> > 
> > 
> > If I understand correctly, you are saying that if apic backing page is migrated in post copy
> > then 'get_nested_state_pages' will return false and thus fail?
> 
> What I mean is that when the vCPU was halted and then migrated in this
> case, KVM did not call get_nested_state_pages() before getting into
> kvm_arch_vcpu_runnable(). This function checks the apic backing page and
> fails on that check and triggered the warning.
> > 
> > AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case
> > for many things like MSR bitmaps and such - they always uses non atomic versions
> > of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which
> > supposed to block if they attempt to access HVA which is not present, and then
> > userfaultd should take over and wake them up.
> 
> You are right here.
> > 
> > If that still fails, nested VM entry is usually failed, and/or the whole VM
> > is crashed with 'KVM_EXIT_INTERNAL_ERROR'.
> > 

Ah, I think I understand what you are saying. hmm, so basically the
patch here is to continuously request vmcs12 pages if failed. But what
you are saying is that we just need to call 'get_nested_state_pages'
once. If it fails, then the VM fails to work. Let me double check and
get back.

> > Anything I missed? 
> > 
> > Best regards,
> > 	Maxim Levitsky
> >
Paolo Bonzini Aug. 3, 2022, 5:18 p.m. UTC | #4
On 8/3/22 01:07, Mingwei Zhang wrote:
> +	/*
> +	 * We must first get the vmcs12 pages before checking for interrupts
> +	 * that might unblock the guest if L1 is using virtual-interrupt
> +	 * delivery.
> +	 */
> +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> +		/*
> +		 * If we have to ask user-space to post-copy a page,
> +		 * then we have to keep trying to get all of the
> +		 * VMCS12 pages until we succeed.
> +		 */
> +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +			return 0;
> +		}
> +	}
> +

I think request handling (except for KVM_REQ_EVENT) could be more 
generically moved from vcpu_enter_guest() to vcpu_run().

Paolo
Mingwei Zhang Aug. 3, 2022, 5:51 p.m. UTC | #5
On Wed, Aug 3, 2022 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 8/3/22 01:07, Mingwei Zhang wrote:
> > +     /*
> > +      * We must first get the vmcs12 pages before checking for interrupts
> > +      * that might unblock the guest if L1 is using virtual-interrupt
> > +      * delivery.
> > +      */
> > +     if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > +             /*
> > +              * If we have to ask user-space to post-copy a page,
> > +              * then we have to keep trying to get all of the
> > +              * VMCS12 pages until we succeed.
> > +              */
> > +             if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > +                     kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > +                     return 0;
> > +             }
> > +     }
> > +
>
> I think request handling (except for KVM_REQ_EVENT) could be more
> generically moved from vcpu_enter_guest() to vcpu_run().

Yeah, sounds good to me. I can come up with an updated version. At
least, I will remove the repeat request here.

>
> Paolo
>
Oliver Upton Aug. 3, 2022, 6:36 p.m. UTC | #6
On Wed, Aug 03, 2022 at 05:00:26PM +0000, Mingwei Zhang wrote:
> On Wed, Aug 03, 2022, Mingwei Zhang wrote:
> > On Wed, Aug 03, 2022, Maxim Levitsky wrote:
> > > On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote:
> > > > From: Oliver Upton <oupton@google.com>
> > > > 
> > > > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
> > > > page being present + mapped into the kernel address space. However, with
> > > > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
> > > > event isn't assessed before entering vcpu_block.
> > > > 
> > > > Fix this by getting vmcs12 pages before inspecting the guest's APIC
> > > > page. Note that upstream does not have this issue, as they will directly
> > > > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
> > > > event request mechanism. However, the upstream approach is problematic,
> > > > as the vmcs12 pages will not be present if a live migration occurred
> > > > before checking the virtual APIC page.
> > > 
> > > Since this patch is intended for upstream, I don't fully understand
> > > the meaning of the above paragraph.
> > 
> > My apology. Some of the statement needs to be updated, which I should do
> > before sending. But I think the point here is that there is a missing
> > get_nested_state_pages() call here within vcpu_block() when there is the
> > request of KVM_REQ_GET_NESTED_STATE_PAGES.

This was my poorly written changelog, sorry about that Mingwei :)

> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 5366f884e9a7..1d3d8127aaea 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
> > > >  {
> > > >  	bool hv_timer;
> > > >  
> > > > +	/*
> > > > +	 * We must first get the vmcs12 pages before checking for interrupts
> > > > +	 * that might unblock the guest if L1 is using virtual-interrupt
> > > > +	 * delivery.
> > > > +	 */
> > > > +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > > +		/*
> > > > +		 * If we have to ask user-space to post-copy a page,
> > > > +		 * then we have to keep trying to get all of the
> > > > +		 * VMCS12 pages until we succeed.
> > > > +		 */
> > > > +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > > > +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > > +			return 0;
> > > > +		}
> > > > +	}
> > > > +
> > > >  	if (!kvm_arch_vcpu_runnable(vcpu)) {
> > > >  		/*
> > > >  		 * Switch to the software timer before halt-polling/blocking as
> > > 
> > > 
> > > If I understand correctly, you are saying that if apic backing page is migrated in post copy
> > > then 'get_nested_state_pages' will return false and thus fail?
> > 
> > What I mean is that when the vCPU was halted and then migrated in this
> > case, KVM did not call get_nested_state_pages() before getting into
> > kvm_arch_vcpu_runnable(). This function checks the apic backing page and
> > fails on that check and triggered the warning.
> > > 
> > > AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case
> > > for many things like MSR bitmaps and such - they always uses non atomic versions
> > > of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which
> > > supposed to block if they attempt to access HVA which is not present, and then
> > > userfaultd should take over and wake them up.
> > 
> > You are right here.
> > > 
> > > If that still fails, nested VM entry is usually failed, and/or the whole VM
> > > is crashed with 'KVM_EXIT_INTERNAL_ERROR'.
> > > 
> 
> Ah, I think I understand what you are saying. hmm, so basically the
> patch here is to continuously request vmcs12 pages if failed. But what
> you are saying is that we just need to call 'get_nested_state_pages'
> once. If it fails, then the VM fails to work. Let me double check and
> get back.

IIRC the reason we reset the request on a nonzero return was because our
local implementation of the VMX hook was non-blocking and would bail on
the first page that needed to be demanded from the source. So, you
effectively keep hitting the request until all the pages are pulled in.

--
Thanks,
Oliver
Maxim Levitsky Aug. 3, 2022, 7:34 p.m. UTC | #7
On Wed, 2022-08-03 at 10:51 -0700, Mingwei Zhang wrote:
> On Wed, Aug 3, 2022 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > On 8/3/22 01:07, Mingwei Zhang wrote:
> > > +     /*
> > > +      * We must first get the vmcs12 pages before checking for interrupts
> > > +      * that might unblock the guest if L1 is using virtual-interrupt
> > > +      * delivery.
> > > +      */
> > > +     if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > +             /*
> > > +              * If we have to ask user-space to post-copy a page,
> > > +              * then we have to keep trying to get all of the
> > > +              * VMCS12 pages until we succeed.
> > > +              */
> > > +             if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > > +                     kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > +                     return 0;
> > > +             }
> > > +     }
> > > +
> > 
> > I think request handling (except for KVM_REQ_EVENT) could be more
> > generically moved from vcpu_enter_guest() to vcpu_run().
> 
> Yeah, sounds good to me. I can come up with an updated version. At
> least, I will remove the repeat request here.


Now it all makes sense. I do think that KVM_REQ_GET_NESTED_STATE_PAGES processing
when the vCPU is halted is indeed missing.

This reminds me that I would be *very* happy to remove the KVM_REQ_GET_NESTED_STATE_PAGES,
if by any chance there is an agreement to do so upstream.
This is yet another reason to do so to be honest.
Just my 0.2 cents of course.


Best regards,
	Maxim Levitsky



PS:
This also reminded me of an related issue:
https://lore.kernel.org/lkml/20220427173758.517087-1-pbonzini@redhat.com/T/
Any update on it?

> 
> > Paolo
> >
Sean Christopherson Aug. 25, 2022, 12:11 a.m. UTC | #8
On Wed, Aug 03, 2022, Maxim Levitsky wrote:
> On Wed, 2022-08-03 at 10:51 -0700, Mingwei Zhang wrote:
> > On Wed, Aug 3, 2022 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > On 8/3/22 01:07, Mingwei Zhang wrote:
> > > > +     /*
> > > > +      * We must first get the vmcs12 pages before checking for interrupts
> > > > +      * that might unblock the guest if L1 is using virtual-interrupt
> > > > +      * delivery.
> > > > +      */
> > > > +     if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > > +             /*
> > > > +              * If we have to ask user-space to post-copy a page,
> > > > +              * then we have to keep trying to get all of the
> > > > +              * VMCS12 pages until we succeed.
> > > > +              */
> > > > +             if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > > > +                     kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > > +                     return 0;
> > > > +             }
> > > > +     }
> > > > +
> > > 
> > > I think request handling (except for KVM_REQ_EVENT) could be more
> > > generically moved from vcpu_enter_guest() to vcpu_run().
> > 
> > Yeah, sounds good to me. I can come up with an updated version. At
> > least, I will remove the repeat request here.
> 
> Now it all makes sense. I do think that KVM_REQ_GET_NESTED_STATE_PAGES processing
> when the vCPU is halted is indeed missing.
> 
> This reminds me that I would be *very* happy to remove the KVM_REQ_GET_NESTED_STATE_PAGES,
> if by any chance there is an agreement to do so upstream.
> This is yet another reason to do so to be honest.
> Just my 0.2 cents of course.

+100

@google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.
Jim Mattson Aug. 25, 2022, 2:51 a.m. UTC | #9
On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote:

> @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
> as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
> move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.

Userfaultfd is a red herring. There were two reasons that we needed
this when nested live migration was implemented:
1) our netlink socket mechanism for funneling remote page requests to
a userspace listener was broken.
2) we were not necessarily prepared to deal with remote page requests
during VM setup.

(1) has long since been fixed. Though our preference is to exit from
KVM_RUN and get the vCPU thread to request the remote page itself, we
are now capable of queuing a remote page request with a separate
listener thread and blocking in the kernel until the page is received.
I believe that mechanism is functionally equivalent to userfaultfd,
though not as elegant.
I don't know about (2). I'm not sure when the listener thread is set
up, relative to all of the other setup steps. Eliminating
KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared
to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The
same is true when using userfaultfd.

These new ordering constraints represent a UAPI breakage, but we don't
seem to be as concerned about that as we once were. Maybe that's a
good thing. Can we get rid of all of the superseded ioctls, like
KVM_SET_CPUID, while we're at it?
Sean Christopherson Aug. 25, 2022, 2:40 p.m. UTC | #10
On Wed, Aug 24, 2022, Jim Mattson wrote:
> On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote:
> 
> > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
> > as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
> > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.
> 
> Userfaultfd is a red herring. There were two reasons that we needed
> this when nested live migration was implemented:
> 1) our netlink socket mechanism for funneling remote page requests to
> a userspace listener was broken.
> 2) we were not necessarily prepared to deal with remote page requests
> during VM setup.
> 
> (1) has long since been fixed. Though our preference is to exit from
> KVM_RUN and get the vCPU thread to request the remote page itself, we
> are now capable of queuing a remote page request with a separate
> listener thread and blocking in the kernel until the page is received.
> I believe that mechanism is functionally equivalent to userfaultfd,
> though not as elegant.
> I don't know about (2). I'm not sure when the listener thread is set
> up, relative to all of the other setup steps. Eliminating
> KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared
> to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The
> same is true when using userfaultfd.
> 
> These new ordering constraints represent a UAPI breakage, but we don't
> seem to be as concerned about that as we once were. Maybe that's a
> good thing. Can we get rid of all of the superseded ioctls, like
> KVM_SET_CPUID, while we're at it?

I view KVM_REQ_GET_NESTED_STATE_PAGES as a special case.  We are likely the only
users, we can (eventually) wean ourselves off the feature, and we can carry
internal patches (which we are obviously already carrying) until we transition
away.  And unlike KVM_SET_CPUID and other ancient ioctls() that are largely
forgotten, this feature is likely to be a maintenance burden as long as it exists.
Mingwei Zhang Aug. 25, 2022, 5:16 p.m. UTC | #11
On Thu, Aug 25, 2022 at 7:41 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Aug 24, 2022, Jim Mattson wrote:
> > On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
> > > as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
> > > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.
> >
> > Userfaultfd is a red herring. There were two reasons that we needed
> > this when nested live migration was implemented:
> > 1) our netlink socket mechanism for funneling remote page requests to
> > a userspace listener was broken.
> > 2) we were not necessarily prepared to deal with remote page requests
> > during VM setup.
> >
> > (1) has long since been fixed. Though our preference is to exit from
> > KVM_RUN and get the vCPU thread to request the remote page itself, we
> > are now capable of queuing a remote page request with a separate
> > listener thread and blocking in the kernel until the page is received.
> > I believe that mechanism is functionally equivalent to userfaultfd,
> > though not as elegant.
> > I don't know about (2). I'm not sure when the listener thread is set
> > up, relative to all of the other setup steps. Eliminating
> > KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared
> > to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The
> > same is true when using userfaultfd.
> >
> > These new ordering constraints represent a UAPI breakage, but we don't
> > seem to be as concerned about that as we once were. Maybe that's a
> > good thing. Can we get rid of all of the superseded ioctls, like
> > KVM_SET_CPUID, while we're at it?
>
> I view KVM_REQ_GET_NESTED_STATE_PAGES as a special case.  We are likely the only
> users, we can (eventually) wean ourselves off the feature, and we can carry
> internal patches (which we are obviously already carrying) until we transition
> away.  And unlike KVM_SET_CPUID and other ancient ioctls() that are largely
> forgotten, this feature is likely to be a maintenance burden as long as it exists.

KVM_REQ_GET_NESTED_STATE_PAGES has been uniformly used in
KVM_SET_NESTED_STATE ioctl in VMX (including eVMCS) and SVM, it is
basically a two-step setting up of a nested state mechanism.

We can change that, but this may have side effects and I think this
usage case has nothing to do with demand paging.

I noticed that nested_vmx_enter_non_root_mode() is called in
KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply
just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);

hmm... so is the nested_vmx_enter_non_root_mode() call in vmx
KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that
because the same function is called again in nested_vmx_run().

Thanks.
-Mingwei
Sean Christopherson Aug. 25, 2022, 5:53 p.m. UTC | #12
On Thu, Aug 25, 2022, Mingwei Zhang wrote:
> On Thu, Aug 25, 2022 at 7:41 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Aug 24, 2022, Jim Mattson wrote:
> > > On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
> > > > as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
> > > > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.
> > >
> > > Userfaultfd is a red herring. There were two reasons that we needed
> > > this when nested live migration was implemented:
> > > 1) our netlink socket mechanism for funneling remote page requests to
> > > a userspace listener was broken.
> > > 2) we were not necessarily prepared to deal with remote page requests
> > > during VM setup.
> > >
> > > (1) has long since been fixed. Though our preference is to exit from
> > > KVM_RUN and get the vCPU thread to request the remote page itself, we
> > > are now capable of queuing a remote page request with a separate
> > > listener thread and blocking in the kernel until the page is received.
> > > I believe that mechanism is functionally equivalent to userfaultfd,
> > > though not as elegant.
> > > I don't know about (2). I'm not sure when the listener thread is set
> > > up, relative to all of the other setup steps. Eliminating
> > > KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared
> > > to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The
> > > same is true when using userfaultfd.
> > >
> > > These new ordering constraints represent a UAPI breakage, but we don't
> > > seem to be as concerned about that as we once were. Maybe that's a
> > > good thing. Can we get rid of all of the superseded ioctls, like
> > > KVM_SET_CPUID, while we're at it?
> >
> > I view KVM_REQ_GET_NESTED_STATE_PAGES as a special case.  We are likely the only
> > users, we can (eventually) wean ourselves off the feature, and we can carry
> > internal patches (which we are obviously already carrying) until we transition
> > away.  And unlike KVM_SET_CPUID and other ancient ioctls() that are largely
> > forgotten, this feature is likely to be a maintenance burden as long as it exists.
> 
> KVM_REQ_GET_NESTED_STATE_PAGES has been uniformly used in
> KVM_SET_NESTED_STATE ioctl in VMX (including eVMCS) and SVM, it is
> basically a two-step setting up of a nested state mechanism.
> 
> We can change that, but this may have side effects and I think this
> usage case has nothing to do with demand paging.

There are two uses of KVM_REQ_GET_NESTED_STATE_PAGES:

  1. Defer loads when leaving SMM.

  2: Defer loads for KVM_SET_NESTED_STATE.

#1 is fully solvable without a request, e.g. split ->leave_smm() into two helpers,
one that restores whatever metadata is needed before restoring from SMRAM, and
a second to load guest virtualization state that _after_ restoring all other guest
state from SMRAM.

#2 is done because of the reasons Jim listed above, which are specific to demand
paging (including userfaultfd).  There might be some interactions with other
ioctls() (KVM_SET_SREGS?) that are papered over by the request, but that can be
solved without a full request since only the first KVM_RUN after KVM_SET_NESTED_STATE
needs to refresh things (though ideally we'd avoid that).

In other words, if the demand paging use case goes away, then KVM can get rid of
KVM_REQ_GET_NESTED_STATE_PAGES.  

> KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply
> just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);

svm_set_nested_state() very rougly open codes enter_svm_guest_mode().  VMX could
do the same, but that may or may not be a net positive.

> hmm... so is the nested_vmx_enter_non_root_mode() call in vmx
> KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that
> because the same function is called again in nested_vmx_run().

nested_vmx_run() is used only to emulate VMLAUNCH/VMRESUME and wont' be invoked
if the vCPU is already running L2 at the time of migration.
Mingwei Zhang Aug. 25, 2022, 8:35 p.m. UTC | #13
> There are two uses of KVM_REQ_GET_NESTED_STATE_PAGES:
>
>   1. Defer loads when leaving SMM.
>
>   2: Defer loads for KVM_SET_NESTED_STATE.
>
> #1 is fully solvable without a request, e.g. split ->leave_smm() into two helpers,
> one that restores whatever metadata is needed before restoring from SMRAM, and
> a second to load guest virtualization state that _after_ restoring all other guest
> state from SMRAM.
>
> #2 is done because of the reasons Jim listed above, which are specific to demand
> paging (including userfaultfd).  There might be some interactions with other
> ioctls() (KVM_SET_SREGS?) that are papered over by the request, but that can be
> solved without a full request since only the first KVM_RUN after KVM_SET_NESTED_STATE
> needs to refresh things (though ideally we'd avoid that).

Ack on the fact that the 2-step process is specific to demand paging.

Currently, KVM_SET_NESTED_STATE is a two-step process in which the 1st
step does not require vmcs12 to be ready. So, I am thinking about what
it means to deprecate KVM_REQ_GET_NESTED_STATE_PAGES?

For case #2, I think there might be two options if we deprecate it:

 - Ensuring vmcs12 ready during the call to
ioctl(KVM_SET_NESTED_STATE). This requires, as Jim mentioned, that the
thread who is listening to the remote page request ready to serve
before this call (this is true regardless of uffd based or Google base
demand paging). We definitely can solve this ordering problem, but
that is beyond KVM scope. It basically requires our userspace to
cooperate.

 - Ensuring vmcs12 ready before vmenter. This basically defers the
vmcs12 checks to the last second. I think this might be a better one.
However, isn't it the same as the original implementation, i.e.,
instead of using KVM_REQ_GET_NESTED_STATE_PAGES, we have to use some
other flags to tell KVM to load a vmcs12?

Thanks.
-Mingwei
>
> In other words, if the demand paging use case goes away, then KVM can get rid of
> KVM_REQ_GET_NESTED_STATE_PAGES.
>
> > KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply
> > just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>
> svm_set_nested_state() very rougly open codes enter_svm_guest_mode().  VMX could
> do the same, but that may or may not be a net positive.
>
> > hmm... so is the nested_vmx_enter_non_root_mode() call in vmx
> > KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that
> > because the same function is called again in nested_vmx_run().
>
> nested_vmx_run() is used only to emulate VMLAUNCH/VMRESUME and wont' be invoked
> if the vCPU is already running L2 at the time of migration.

Ack.
Mingwei Zhang Aug. 25, 2022, 8:37 p.m. UTC | #14
> Currently, KVM_SET_NESTED_STATE is a two-step process in which the 1st
> step does not require vmcs12 to be ready. So, I am thinking about what
> it means to deprecate KVM_REQ_GET_NESTED_STATE_PAGES?

typo:

"KVM_SET_NESTED_STATE is a two-step process" -> "Launching a nested VM
is a two-step process".
Jim Mattson Aug. 25, 2022, 11:21 p.m. UTC | #15
On Thu, Aug 25, 2022 at 1:35 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> > There are two uses of KVM_REQ_GET_NESTED_STATE_PAGES:
> >
> >   1. Defer loads when leaving SMM.
> >
> >   2: Defer loads for KVM_SET_NESTED_STATE.
> >
> > #1 is fully solvable without a request, e.g. split ->leave_smm() into two helpers,
> > one that restores whatever metadata is needed before restoring from SMRAM, and
> > a second to load guest virtualization state that _after_ restoring all other guest
> > state from SMRAM.
> >
> > #2 is done because of the reasons Jim listed above, which are specific to demand
> > paging (including userfaultfd).  There might be some interactions with other
> > ioctls() (KVM_SET_SREGS?) that are papered over by the request, but that can be
> > solved without a full request since only the first KVM_RUN after KVM_SET_NESTED_STATE
> > needs to refresh things (though ideally we'd avoid that).
>
> Ack on the fact that the 2-step process is specific to demand paging.
>
> Currently, KVM_SET_NESTED_STATE is a two-step process in which the 1st
> step does not require vmcs12 to be ready. So, I am thinking about what
> it means to deprecate KVM_REQ_GET_NESTED_STATE_PAGES?
>
> For case #2, I think there might be two options if we deprecate it:
>
>  - Ensuring vmcs12 ready during the call to
> ioctl(KVM_SET_NESTED_STATE). This requires, as Jim mentioned, that the
> thread who is listening to the remote page request ready to serve
> before this call (this is true regardless of uffd based or Google base
> demand paging). We definitely can solve this ordering problem, but
> that is beyond KVM scope. It basically requires our userspace to
> cooperate.

The vmcs12 isn't the problem, since its contents were loaded into L0
kernel memory at VMPTRLD. The problem is the structures hanging off of
the vmcs12, like the posted interrupt descriptor. The new constraints
need to be documented, and every user space VMM has to follow them
before we can eliminate KVM_REQ_GET_NESTED_STATE_PAGES.

>  - Ensuring vmcs12 ready before vmenter. This basically defers the
> vmcs12 checks to the last second. I think this might be a better one.
> However, isn't it the same as the original implementation, i.e.,
> instead of using KVM_REQ_GET_NESTED_STATE_PAGES, we have to use some
> other flags to tell KVM to load a vmcs12?

Again, the vmcs12 is not a problem, since its contents are already
cached in L0 kernel memory. Accesses to the structures hanging off of
the vmcs12 are already handled *during KVM_RUN* by existing demand
paging mechanisms.

> Thanks.
> -Mingwei
> >
> > In other words, if the demand paging use case goes away, then KVM can get rid of
> > KVM_REQ_GET_NESTED_STATE_PAGES.
> >
> > > KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply
> > > just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> >
> > svm_set_nested_state() very rougly open codes enter_svm_guest_mode().  VMX could
> > do the same, but that may or may not be a net positive.
> >
> > > hmm... so is the nested_vmx_enter_non_root_mode() call in vmx
> > > KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that
> > > because the same function is called again in nested_vmx_run().
> >
> > nested_vmx_run() is used only to emulate VMLAUNCH/VMRESUME and wont' be invoked
> > if the vCPU is already running L2 at the time of migration.
>
> Ack.
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5366f884e9a7..1d3d8127aaea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10599,6 +10599,23 @@  static inline int vcpu_block(struct kvm_vcpu *vcpu)
 {
 	bool hv_timer;
 
+	/*
+	 * We must first get the vmcs12 pages before checking for interrupts
+	 * that might unblock the guest if L1 is using virtual-interrupt
+	 * delivery.
+	 */
+	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
+		/*
+		 * If we have to ask user-space to post-copy a page,
+		 * then we have to keep trying to get all of the
+		 * VMCS12 pages until we succeed.
+		 */
+		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
+			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+			return 0;
+		}
+	}
+
 	if (!kvm_arch_vcpu_runnable(vcpu)) {
 		/*
 		 * Switch to the software timer before halt-polling/blocking as