Message ID | 20220802230718.1891356-2-mizhang@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix a race between posted interrupt delivery and migration in a nested VM | expand |
On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote: > From: Oliver Upton <oupton@google.com> > > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC > page being present + mapped into the kernel address space. However, with > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES > event isn't assessed before entering vcpu_block. > > Fix this by getting vmcs12 pages before inspecting the guest's APIC > page. Note that upstream does not have this issue, as they will directly > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the > event request mechanism. However, the upstream approach is problematic, > as the vmcs12 pages will not be present if a live migration occurred > before checking the virtual APIC page. Since this patch is intended for upstream, I don't fully understand the meaning of the above paragraph. > > Signed-off-by: Oliver Upton <oupton@google.com> > Signed-off-by: Mingwei Zhang <mizhang@google.com> > --- > arch/x86/kvm/x86.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 5366f884e9a7..1d3d8127aaea 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) > { > bool hv_timer; > > + /* > + * We must first get the vmcs12 pages before checking for interrupts > + * that might unblock the guest if L1 is using virtual-interrupt > + * delivery. > + */ > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { > + /* > + * If we have to ask user-space to post-copy a page, > + * then we have to keep trying to get all of the > + * VMCS12 pages until we succeed. > + */ > + if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { > + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > + return 0; > + } > + } > + > if (!kvm_arch_vcpu_runnable(vcpu)) { > /* > * Switch to the software timer before halt-polling/blocking as If I understand correctly, you are saying that if apic backing page is migrated in post copy then 'get_nested_state_pages' will return false and thus fail? AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case for many things like MSR bitmaps and such - they always uses non atomic versions of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which supposed to block if they attempt to access HVA which is not present, and then userfaultd should take over and wake them up. If that still fails, nested VM entry is usually failed, and/or the whole VM is crashed with 'KVM_EXIT_INTERNAL_ERROR'. Anything I missed? Best regards, Maxim Levitsky
On Wed, Aug 03, 2022, Maxim Levitsky wrote: > On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote: > > From: Oliver Upton <oupton@google.com> > > > > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC > > page being present + mapped into the kernel address space. However, with > > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES > > event isn't assessed before entering vcpu_block. > > > > Fix this by getting vmcs12 pages before inspecting the guest's APIC > > page. Note that upstream does not have this issue, as they will directly > > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the > > event request mechanism. However, the upstream approach is problematic, > > as the vmcs12 pages will not be present if a live migration occurred > > before checking the virtual APIC page. > > Since this patch is intended for upstream, I don't fully understand > the meaning of the above paragraph. My apology. Some of the statement needs to be updated, which I should do before sending. But I think the point here is that there is a missing get_nested_state_pages() call here within vcpu_block() when there is the request of KVM_REQ_GET_NESTED_STATE_PAGES. > > > > > > Signed-off-by: Oliver Upton <oupton@google.com> > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > > --- > > arch/x86/kvm/x86.c | 17 +++++++++++++++++ > > 1 file changed, 17 insertions(+) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 5366f884e9a7..1d3d8127aaea 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) > > { > > bool hv_timer; > > > > + /* > > + * We must first get the vmcs12 pages before checking for interrupts > > + * that might unblock the guest if L1 is using virtual-interrupt > > + * delivery. > > + */ > > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { > > + /* > > + * If we have to ask user-space to post-copy a page, > > + * then we have to keep trying to get all of the > > + * VMCS12 pages until we succeed. > > + */ > > + if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { > > + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > > + return 0; > > + } > > + } > > + > > if (!kvm_arch_vcpu_runnable(vcpu)) { > > /* > > * Switch to the software timer before halt-polling/blocking as > > > If I understand correctly, you are saying that if apic backing page is migrated in post copy > then 'get_nested_state_pages' will return false and thus fail? What I mean is that when the vCPU was halted and then migrated in this case, KVM did not call get_nested_state_pages() before getting into kvm_arch_vcpu_runnable(). This function checks the apic backing page and fails on that check and triggered the warning. > > AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case > for many things like MSR bitmaps and such - they always uses non atomic versions > of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which > supposed to block if they attempt to access HVA which is not present, and then > userfaultd should take over and wake them up. You are right here. > > If that still fails, nested VM entry is usually failed, and/or the whole VM > is crashed with 'KVM_EXIT_INTERNAL_ERROR'. > > Anything I missed? > > Best regards, > Maxim Levitsky >
On Wed, Aug 03, 2022, Mingwei Zhang wrote: > On Wed, Aug 03, 2022, Maxim Levitsky wrote: > > On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote: > > > From: Oliver Upton <oupton@google.com> > > > > > > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC > > > page being present + mapped into the kernel address space. However, with > > > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES > > > event isn't assessed before entering vcpu_block. > > > > > > Fix this by getting vmcs12 pages before inspecting the guest's APIC > > > page. Note that upstream does not have this issue, as they will directly > > > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the > > > event request mechanism. However, the upstream approach is problematic, > > > as the vmcs12 pages will not be present if a live migration occurred > > > before checking the virtual APIC page. > > > > Since this patch is intended for upstream, I don't fully understand > > the meaning of the above paragraph. > > My apology. Some of the statement needs to be updated, which I should do > before sending. But I think the point here is that there is a missing > get_nested_state_pages() call here within vcpu_block() when there is the > request of KVM_REQ_GET_NESTED_STATE_PAGES. > > > > > > > > > > > Signed-off-by: Oliver Upton <oupton@google.com> > > > Signed-off-by: Mingwei Zhang <mizhang@google.com> > > > --- > > > arch/x86/kvm/x86.c | 17 +++++++++++++++++ > > > 1 file changed, 17 insertions(+) > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 5366f884e9a7..1d3d8127aaea 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) > > > { > > > bool hv_timer; > > > > > > + /* > > > + * We must first get the vmcs12 pages before checking for interrupts > > > + * that might unblock the guest if L1 is using virtual-interrupt > > > + * delivery. > > > + */ > > > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { > > > + /* > > > + * If we have to ask user-space to post-copy a page, > > > + * then we have to keep trying to get all of the > > > + * VMCS12 pages until we succeed. > > > + */ > > > + if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { > > > + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > > > + return 0; > > > + } > > > + } > > > + > > > if (!kvm_arch_vcpu_runnable(vcpu)) { > > > /* > > > * Switch to the software timer before halt-polling/blocking as > > > > > > If I understand correctly, you are saying that if apic backing page is migrated in post copy > > then 'get_nested_state_pages' will return false and thus fail? > > What I mean is that when the vCPU was halted and then migrated in this > case, KVM did not call get_nested_state_pages() before getting into > kvm_arch_vcpu_runnable(). This function checks the apic backing page and > fails on that check and triggered the warning. > > > > AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case > > for many things like MSR bitmaps and such - they always uses non atomic versions > > of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which > > supposed to block if they attempt to access HVA which is not present, and then > > userfaultd should take over and wake them up. > > You are right here. > > > > If that still fails, nested VM entry is usually failed, and/or the whole VM > > is crashed with 'KVM_EXIT_INTERNAL_ERROR'. > > Ah, I think I understand what you are saying. hmm, so basically the patch here is to continuously request vmcs12 pages if failed. But what you are saying is that we just need to call 'get_nested_state_pages' once. If it fails, then the VM fails to work. Let me double check and get back. > > Anything I missed? > > > > Best regards, > > Maxim Levitsky > >
On 8/3/22 01:07, Mingwei Zhang wrote: > + /* > + * We must first get the vmcs12 pages before checking for interrupts > + * that might unblock the guest if L1 is using virtual-interrupt > + * delivery. > + */ > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { > + /* > + * If we have to ask user-space to post-copy a page, > + * then we have to keep trying to get all of the > + * VMCS12 pages until we succeed. > + */ > + if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { > + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > + return 0; > + } > + } > + I think request handling (except for KVM_REQ_EVENT) could be more generically moved from vcpu_enter_guest() to vcpu_run(). Paolo
On Wed, Aug 3, 2022 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 8/3/22 01:07, Mingwei Zhang wrote: > > + /* > > + * We must first get the vmcs12 pages before checking for interrupts > > + * that might unblock the guest if L1 is using virtual-interrupt > > + * delivery. > > + */ > > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { > > + /* > > + * If we have to ask user-space to post-copy a page, > > + * then we have to keep trying to get all of the > > + * VMCS12 pages until we succeed. > > + */ > > + if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { > > + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > > + return 0; > > + } > > + } > > + > > I think request handling (except for KVM_REQ_EVENT) could be more > generically moved from vcpu_enter_guest() to vcpu_run(). Yeah, sounds good to me. I can come up with an updated version. At least, I will remove the repeat request here. > > Paolo >
On Wed, Aug 03, 2022 at 05:00:26PM +0000, Mingwei Zhang wrote: > On Wed, Aug 03, 2022, Mingwei Zhang wrote: > > On Wed, Aug 03, 2022, Maxim Levitsky wrote: > > > On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote: > > > > From: Oliver Upton <oupton@google.com> > > > > > > > > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC > > > > page being present + mapped into the kernel address space. However, with > > > > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES > > > > event isn't assessed before entering vcpu_block. > > > > > > > > Fix this by getting vmcs12 pages before inspecting the guest's APIC > > > > page. Note that upstream does not have this issue, as they will directly > > > > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the > > > > event request mechanism. However, the upstream approach is problematic, > > > > as the vmcs12 pages will not be present if a live migration occurred > > > > before checking the virtual APIC page. > > > > > > Since this patch is intended for upstream, I don't fully understand > > > the meaning of the above paragraph. > > > > My apology. Some of the statement needs to be updated, which I should do > > before sending. But I think the point here is that there is a missing > > get_nested_state_pages() call here within vcpu_block() when there is the > > request of KVM_REQ_GET_NESTED_STATE_PAGES. This was my poorly written changelog, sorry about that Mingwei :) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > index 5366f884e9a7..1d3d8127aaea 100644 > > > > --- a/arch/x86/kvm/x86.c > > > > +++ b/arch/x86/kvm/x86.c > > > > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) > > > > { > > > > bool hv_timer; > > > > > > > > + /* > > > > + * We must first get the vmcs12 pages before checking for interrupts > > > > + * that might unblock the guest if L1 is using virtual-interrupt > > > > + * delivery. > > > > + */ > > > > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { > > > > + /* > > > > + * If we have to ask user-space to post-copy a page, > > > > + * then we have to keep trying to get all of the > > > > + * VMCS12 pages until we succeed. > > > > + */ > > > > + if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { > > > > + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > > > > + return 0; > > > > + } > > > > + } > > > > + > > > > if (!kvm_arch_vcpu_runnable(vcpu)) { > > > > /* > > > > * Switch to the software timer before halt-polling/blocking as > > > > > > > > > If I understand correctly, you are saying that if apic backing page is migrated in post copy > > > then 'get_nested_state_pages' will return false and thus fail? > > > > What I mean is that when the vCPU was halted and then migrated in this > > case, KVM did not call get_nested_state_pages() before getting into > > kvm_arch_vcpu_runnable(). This function checks the apic backing page and > > fails on that check and triggered the warning. > > > > > > AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case > > > for many things like MSR bitmaps and such - they always uses non atomic versions > > > of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which > > > supposed to block if they attempt to access HVA which is not present, and then > > > userfaultd should take over and wake them up. > > > > You are right here. > > > > > > If that still fails, nested VM entry is usually failed, and/or the whole VM > > > is crashed with 'KVM_EXIT_INTERNAL_ERROR'. > > > > > Ah, I think I understand what you are saying. hmm, so basically the > patch here is to continuously request vmcs12 pages if failed. But what > you are saying is that we just need to call 'get_nested_state_pages' > once. If it fails, then the VM fails to work. Let me double check and > get back. IIRC the reason we reset the request on a nonzero return was because our local implementation of the VMX hook was non-blocking and would bail on the first page that needed to be demanded from the source. So, you effectively keep hitting the request until all the pages are pulled in. -- Thanks, Oliver
On Wed, 2022-08-03 at 10:51 -0700, Mingwei Zhang wrote: > On Wed, Aug 3, 2022 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 8/3/22 01:07, Mingwei Zhang wrote: > > > + /* > > > + * We must first get the vmcs12 pages before checking for interrupts > > > + * that might unblock the guest if L1 is using virtual-interrupt > > > + * delivery. > > > + */ > > > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { > > > + /* > > > + * If we have to ask user-space to post-copy a page, > > > + * then we have to keep trying to get all of the > > > + * VMCS12 pages until we succeed. > > > + */ > > > + if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { > > > + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > > > + return 0; > > > + } > > > + } > > > + > > > > I think request handling (except for KVM_REQ_EVENT) could be more > > generically moved from vcpu_enter_guest() to vcpu_run(). > > Yeah, sounds good to me. I can come up with an updated version. At > least, I will remove the repeat request here. Now it all makes sense. I do think that KVM_REQ_GET_NESTED_STATE_PAGES processing when the vCPU is halted is indeed missing. This reminds me that I would be *very* happy to remove the KVM_REQ_GET_NESTED_STATE_PAGES, if by any chance there is an agreement to do so upstream. This is yet another reason to do so to be honest. Just my 0.2 cents of course. Best regards, Maxim Levitsky PS: This also reminded me of an related issue: https://lore.kernel.org/lkml/20220427173758.517087-1-pbonzini@redhat.com/T/ Any update on it? > > > Paolo > >
On Wed, Aug 03, 2022, Maxim Levitsky wrote: > On Wed, 2022-08-03 at 10:51 -0700, Mingwei Zhang wrote: > > On Wed, Aug 3, 2022 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 8/3/22 01:07, Mingwei Zhang wrote: > > > > + /* > > > > + * We must first get the vmcs12 pages before checking for interrupts > > > > + * that might unblock the guest if L1 is using virtual-interrupt > > > > + * delivery. > > > > + */ > > > > + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { > > > > + /* > > > > + * If we have to ask user-space to post-copy a page, > > > > + * then we have to keep trying to get all of the > > > > + * VMCS12 pages until we succeed. > > > > + */ > > > > + if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { > > > > + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > > > > + return 0; > > > > + } > > > > + } > > > > + > > > > > > I think request handling (except for KVM_REQ_EVENT) could be more > > > generically moved from vcpu_enter_guest() to vcpu_run(). > > > > Yeah, sounds good to me. I can come up with an updated version. At > > least, I will remove the repeat request here. > > Now it all makes sense. I do think that KVM_REQ_GET_NESTED_STATE_PAGES processing > when the vCPU is halted is indeed missing. > > This reminds me that I would be *very* happy to remove the KVM_REQ_GET_NESTED_STATE_PAGES, > if by any chance there is an agreement to do so upstream. > This is yet another reason to do so to be honest. > Just my 0.2 cents of course. +100 @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES as deprecated in upstream and stop accepting patches/fixes? IIUC, when we eventually move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.
On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote: > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES > as deprecated in upstream and stop accepting patches/fixes? IIUC, when we eventually > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point. Userfaultfd is a red herring. There were two reasons that we needed this when nested live migration was implemented: 1) our netlink socket mechanism for funneling remote page requests to a userspace listener was broken. 2) we were not necessarily prepared to deal with remote page requests during VM setup. (1) has long since been fixed. Though our preference is to exit from KVM_RUN and get the vCPU thread to request the remote page itself, we are now capable of queuing a remote page request with a separate listener thread and blocking in the kernel until the page is received. I believe that mechanism is functionally equivalent to userfaultfd, though not as elegant. I don't know about (2). I'm not sure when the listener thread is set up, relative to all of the other setup steps. Eliminating KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The same is true when using userfaultfd. These new ordering constraints represent a UAPI breakage, but we don't seem to be as concerned about that as we once were. Maybe that's a good thing. Can we get rid of all of the superseded ioctls, like KVM_SET_CPUID, while we're at it?
On Wed, Aug 24, 2022, Jim Mattson wrote: > On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote: > > > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES > > as deprecated in upstream and stop accepting patches/fixes? IIUC, when we eventually > > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point. > > Userfaultfd is a red herring. There were two reasons that we needed > this when nested live migration was implemented: > 1) our netlink socket mechanism for funneling remote page requests to > a userspace listener was broken. > 2) we were not necessarily prepared to deal with remote page requests > during VM setup. > > (1) has long since been fixed. Though our preference is to exit from > KVM_RUN and get the vCPU thread to request the remote page itself, we > are now capable of queuing a remote page request with a separate > listener thread and blocking in the kernel until the page is received. > I believe that mechanism is functionally equivalent to userfaultfd, > though not as elegant. > I don't know about (2). I'm not sure when the listener thread is set > up, relative to all of the other setup steps. Eliminating > KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared > to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The > same is true when using userfaultfd. > > These new ordering constraints represent a UAPI breakage, but we don't > seem to be as concerned about that as we once were. Maybe that's a > good thing. Can we get rid of all of the superseded ioctls, like > KVM_SET_CPUID, while we're at it? I view KVM_REQ_GET_NESTED_STATE_PAGES as a special case. We are likely the only users, we can (eventually) wean ourselves off the feature, and we can carry internal patches (which we are obviously already carrying) until we transition away. And unlike KVM_SET_CPUID and other ancient ioctls() that are largely forgotten, this feature is likely to be a maintenance burden as long as it exists.
On Thu, Aug 25, 2022 at 7:41 AM Sean Christopherson <seanjc@google.com> wrote: > > On Wed, Aug 24, 2022, Jim Mattson wrote: > > On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES > > > as deprecated in upstream and stop accepting patches/fixes? IIUC, when we eventually > > > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point. > > > > Userfaultfd is a red herring. There were two reasons that we needed > > this when nested live migration was implemented: > > 1) our netlink socket mechanism for funneling remote page requests to > > a userspace listener was broken. > > 2) we were not necessarily prepared to deal with remote page requests > > during VM setup. > > > > (1) has long since been fixed. Though our preference is to exit from > > KVM_RUN and get the vCPU thread to request the remote page itself, we > > are now capable of queuing a remote page request with a separate > > listener thread and blocking in the kernel until the page is received. > > I believe that mechanism is functionally equivalent to userfaultfd, > > though not as elegant. > > I don't know about (2). I'm not sure when the listener thread is set > > up, relative to all of the other setup steps. Eliminating > > KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared > > to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The > > same is true when using userfaultfd. > > > > These new ordering constraints represent a UAPI breakage, but we don't > > seem to be as concerned about that as we once were. Maybe that's a > > good thing. Can we get rid of all of the superseded ioctls, like > > KVM_SET_CPUID, while we're at it? > > I view KVM_REQ_GET_NESTED_STATE_PAGES as a special case. We are likely the only > users, we can (eventually) wean ourselves off the feature, and we can carry > internal patches (which we are obviously already carrying) until we transition > away. And unlike KVM_SET_CPUID and other ancient ioctls() that are largely > forgotten, this feature is likely to be a maintenance burden as long as it exists. KVM_REQ_GET_NESTED_STATE_PAGES has been uniformly used in KVM_SET_NESTED_STATE ioctl in VMX (including eVMCS) and SVM, it is basically a two-step setting up of a nested state mechanism. We can change that, but this may have side effects and I think this usage case has nothing to do with demand paging. I noticed that nested_vmx_enter_non_root_mode() is called in KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); hmm... so is the nested_vmx_enter_non_root_mode() call in vmx KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that because the same function is called again in nested_vmx_run(). Thanks. -Mingwei
On Thu, Aug 25, 2022, Mingwei Zhang wrote: > On Thu, Aug 25, 2022 at 7:41 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Wed, Aug 24, 2022, Jim Mattson wrote: > > > On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES > > > > as deprecated in upstream and stop accepting patches/fixes? IIUC, when we eventually > > > > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point. > > > > > > Userfaultfd is a red herring. There were two reasons that we needed > > > this when nested live migration was implemented: > > > 1) our netlink socket mechanism for funneling remote page requests to > > > a userspace listener was broken. > > > 2) we were not necessarily prepared to deal with remote page requests > > > during VM setup. > > > > > > (1) has long since been fixed. Though our preference is to exit from > > > KVM_RUN and get the vCPU thread to request the remote page itself, we > > > are now capable of queuing a remote page request with a separate > > > listener thread and blocking in the kernel until the page is received. > > > I believe that mechanism is functionally equivalent to userfaultfd, > > > though not as elegant. > > > I don't know about (2). I'm not sure when the listener thread is set > > > up, relative to all of the other setup steps. Eliminating > > > KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared > > > to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The > > > same is true when using userfaultfd. > > > > > > These new ordering constraints represent a UAPI breakage, but we don't > > > seem to be as concerned about that as we once were. Maybe that's a > > > good thing. Can we get rid of all of the superseded ioctls, like > > > KVM_SET_CPUID, while we're at it? > > > > I view KVM_REQ_GET_NESTED_STATE_PAGES as a special case. We are likely the only > > users, we can (eventually) wean ourselves off the feature, and we can carry > > internal patches (which we are obviously already carrying) until we transition > > away. And unlike KVM_SET_CPUID and other ancient ioctls() that are largely > > forgotten, this feature is likely to be a maintenance burden as long as it exists. > > KVM_REQ_GET_NESTED_STATE_PAGES has been uniformly used in > KVM_SET_NESTED_STATE ioctl in VMX (including eVMCS) and SVM, it is > basically a two-step setting up of a nested state mechanism. > > We can change that, but this may have side effects and I think this > usage case has nothing to do with demand paging. There are two uses of KVM_REQ_GET_NESTED_STATE_PAGES: 1. Defer loads when leaving SMM. 2: Defer loads for KVM_SET_NESTED_STATE. #1 is fully solvable without a request, e.g. split ->leave_smm() into two helpers, one that restores whatever metadata is needed before restoring from SMRAM, and a second to load guest virtualization state that _after_ restoring all other guest state from SMRAM. #2 is done because of the reasons Jim listed above, which are specific to demand paging (including userfaultfd). There might be some interactions with other ioctls() (KVM_SET_SREGS?) that are papered over by the request, but that can be solved without a full request since only the first KVM_RUN after KVM_SET_NESTED_STATE needs to refresh things (though ideally we'd avoid that). In other words, if the demand paging use case goes away, then KVM can get rid of KVM_REQ_GET_NESTED_STATE_PAGES. > KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply > just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); svm_set_nested_state() very rougly open codes enter_svm_guest_mode(). VMX could do the same, but that may or may not be a net positive. > hmm... so is the nested_vmx_enter_non_root_mode() call in vmx > KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that > because the same function is called again in nested_vmx_run(). nested_vmx_run() is used only to emulate VMLAUNCH/VMRESUME and wont' be invoked if the vCPU is already running L2 at the time of migration.
> There are two uses of KVM_REQ_GET_NESTED_STATE_PAGES: > > 1. Defer loads when leaving SMM. > > 2: Defer loads for KVM_SET_NESTED_STATE. > > #1 is fully solvable without a request, e.g. split ->leave_smm() into two helpers, > one that restores whatever metadata is needed before restoring from SMRAM, and > a second to load guest virtualization state that _after_ restoring all other guest > state from SMRAM. > > #2 is done because of the reasons Jim listed above, which are specific to demand > paging (including userfaultfd). There might be some interactions with other > ioctls() (KVM_SET_SREGS?) that are papered over by the request, but that can be > solved without a full request since only the first KVM_RUN after KVM_SET_NESTED_STATE > needs to refresh things (though ideally we'd avoid that). Ack on the fact that the 2-step process is specific to demand paging. Currently, KVM_SET_NESTED_STATE is a two-step process in which the 1st step does not require vmcs12 to be ready. So, I am thinking about what it means to deprecate KVM_REQ_GET_NESTED_STATE_PAGES? For case #2, I think there might be two options if we deprecate it: - Ensuring vmcs12 ready during the call to ioctl(KVM_SET_NESTED_STATE). This requires, as Jim mentioned, that the thread who is listening to the remote page request ready to serve before this call (this is true regardless of uffd based or Google base demand paging). We definitely can solve this ordering problem, but that is beyond KVM scope. It basically requires our userspace to cooperate. - Ensuring vmcs12 ready before vmenter. This basically defers the vmcs12 checks to the last second. I think this might be a better one. However, isn't it the same as the original implementation, i.e., instead of using KVM_REQ_GET_NESTED_STATE_PAGES, we have to use some other flags to tell KVM to load a vmcs12? Thanks. -Mingwei > > In other words, if the demand paging use case goes away, then KVM can get rid of > KVM_REQ_GET_NESTED_STATE_PAGES. > > > KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply > > just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > > svm_set_nested_state() very rougly open codes enter_svm_guest_mode(). VMX could > do the same, but that may or may not be a net positive. > > > hmm... so is the nested_vmx_enter_non_root_mode() call in vmx > > KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that > > because the same function is called again in nested_vmx_run(). > > nested_vmx_run() is used only to emulate VMLAUNCH/VMRESUME and wont' be invoked > if the vCPU is already running L2 at the time of migration. Ack.
> Currently, KVM_SET_NESTED_STATE is a two-step process in which the 1st > step does not require vmcs12 to be ready. So, I am thinking about what > it means to deprecate KVM_REQ_GET_NESTED_STATE_PAGES? typo: "KVM_SET_NESTED_STATE is a two-step process" -> "Launching a nested VM is a two-step process".
On Thu, Aug 25, 2022 at 1:35 PM Mingwei Zhang <mizhang@google.com> wrote: > > > There are two uses of KVM_REQ_GET_NESTED_STATE_PAGES: > > > > 1. Defer loads when leaving SMM. > > > > 2: Defer loads for KVM_SET_NESTED_STATE. > > > > #1 is fully solvable without a request, e.g. split ->leave_smm() into two helpers, > > one that restores whatever metadata is needed before restoring from SMRAM, and > > a second to load guest virtualization state that _after_ restoring all other guest > > state from SMRAM. > > > > #2 is done because of the reasons Jim listed above, which are specific to demand > > paging (including userfaultfd). There might be some interactions with other > > ioctls() (KVM_SET_SREGS?) that are papered over by the request, but that can be > > solved without a full request since only the first KVM_RUN after KVM_SET_NESTED_STATE > > needs to refresh things (though ideally we'd avoid that). > > Ack on the fact that the 2-step process is specific to demand paging. > > Currently, KVM_SET_NESTED_STATE is a two-step process in which the 1st > step does not require vmcs12 to be ready. So, I am thinking about what > it means to deprecate KVM_REQ_GET_NESTED_STATE_PAGES? > > For case #2, I think there might be two options if we deprecate it: > > - Ensuring vmcs12 ready during the call to > ioctl(KVM_SET_NESTED_STATE). This requires, as Jim mentioned, that the > thread who is listening to the remote page request ready to serve > before this call (this is true regardless of uffd based or Google base > demand paging). We definitely can solve this ordering problem, but > that is beyond KVM scope. It basically requires our userspace to > cooperate. The vmcs12 isn't the problem, since its contents were loaded into L0 kernel memory at VMPTRLD. The problem is the structures hanging off of the vmcs12, like the posted interrupt descriptor. The new constraints need to be documented, and every user space VMM has to follow them before we can eliminate KVM_REQ_GET_NESTED_STATE_PAGES. > - Ensuring vmcs12 ready before vmenter. This basically defers the > vmcs12 checks to the last second. I think this might be a better one. > However, isn't it the same as the original implementation, i.e., > instead of using KVM_REQ_GET_NESTED_STATE_PAGES, we have to use some > other flags to tell KVM to load a vmcs12? Again, the vmcs12 is not a problem, since its contents are already cached in L0 kernel memory. Accesses to the structures hanging off of the vmcs12 are already handled *during KVM_RUN* by existing demand paging mechanisms. > Thanks. > -Mingwei > > > > In other words, if the demand paging use case goes away, then KVM can get rid of > > KVM_REQ_GET_NESTED_STATE_PAGES. > > > > > KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply > > > just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); > > > > svm_set_nested_state() very rougly open codes enter_svm_guest_mode(). VMX could > > do the same, but that may or may not be a net positive. > > > > > hmm... so is the nested_vmx_enter_non_root_mode() call in vmx > > > KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that > > > because the same function is called again in nested_vmx_run(). > > > > nested_vmx_run() is used only to emulate VMLAUNCH/VMRESUME and wont' be invoked > > if the vCPU is already running L2 at the time of migration. > > Ack.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5366f884e9a7..1d3d8127aaea 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) { bool hv_timer; + /* + * We must first get the vmcs12 pages before checking for interrupts + * that might unblock the guest if L1 is using virtual-interrupt + * delivery. + */ + if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { + /* + * If we have to ask user-space to post-copy a page, + * then we have to keep trying to get all of the + * VMCS12 pages until we succeed. + */ + if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) { + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); + return 0; + } + } + if (!kvm_arch_vcpu_runnable(vcpu)) { /* * Switch to the software timer before halt-polling/blocking as