Message ID | 20240609154945.55332-6-nsaenz@amazon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Introducing Core Building Blocks for Hyper-V VSM Emulation | expand |
On Sun, Jun 09, 2024, Nicolas Saenz Julienne wrote: > Model inactive VTL vCPUs' behaviour with a new MP state. > > Inactive VTLs are in an artificial halt state. They enter into this > state in response to invoking HvCallVtlCall, HvCallVtlReturn. > User-space, which is VTL aware, can processes the hypercall, and set the > vCPU in MP_STATE_HV_INACTIVE_VTL. When a vCPU is run in this state it'll > block until a wakeup event is received. The rules of what constitutes an > event are analogous to halt's except that VTL's ignore RFLAGS.IF. > > When a wakeup event is registered, KVM will exit to user-space with a > KVM_SYSTEM_EVENT exit, and KVM_SYSTEM_EVENT_WAKEUP event type. > User-space is responsible of deciding whether the event has precedence > over the active VTL and will switch the vCPU to KVM_MP_STATE_RUNNABLE > before resuming execution on it. > > Running a KVM_MP_STATE_HV_INACTIVE_VTL vCPU with pending events will > return immediately to user-space. > > Note that by re-using the readily available halt infrastructure in > KVM_RUN, MP_STATE_HV_INACTIVE_VTL correctly handles (or disables) > virtualisation features like the VMX preemption timer or APICv before > blocking. IIUC, this is a convoluted and roundabout way to let userspace check if a vCPU has a wake event, correct? Even by the end of the series, KVM never sets MP_STATE_HV_INACTIVE_VTL, i.e. the only use for this is to combine it as: KVM_SET_MP_STATE => KVM_RUN => KVM_SET_MP_STATE => KVM_RUN The upside to this approach is that it requires minimal uAPI and very few KVM changes, but that's about it AFAICT. On the other hand, making this so painfully specific feels like a missed opportunity, and unnecessarily bleeds VTL details into KVM. Bringing halt-polling into the picture (by going down kvm_vcpu_halt()) is also rather bizarre since quite a bit of time has already elapsed since the vCPU first did HvCallVtlCall/HvCallVtlReturn. But that doesn't really have anything to do with MP_STATE_HV_INACTIVE_VTL, e.g. it'd be just as easy to go to kvm_vcpu_block(). Why not add an ioctl() to very explicitly block until a wake event is ready? Or probably better, a generic "wait" ioctl() that takes the wait type as an argument. Kinda like your idea of supporting .poll() on the vCPU FD[*], except it's very specifically restricted to a single caller (takes vcpu->mutex). We could probably actually implement it via .poll(), but I suspect that would be more confusing than helpful. E.g. extract the guts of vcpu_block() to a separate helper, and then wire that up to an ioctl(). As for the RFLAGS.IF quirk, maybe handle that via a kvm_run flag? That way, userspace doesn't need to do a round-trip just to set a single bit. E.g. I think we should be able to squeeze it into "struct kvm_hyperv_exit". Actually, speaking of kvm_hyperv_exit, is there a reason we can't simply wire up HVCALL_VTL_CALL and/or HVCALL_VTL_RETURN to a dedicated complete_userspace_io() callback that blocks if some flag is set? That would make it _much_ cleaner to scope the RFLAGS.IF check to kvm_hyperv_exit, and would require little to no new uAPI. > @@ -3797,6 +3798,10 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu) > if (!gif_set(svm)) > return true; > > + /* > + * The Hyper-V TLFS states that RFLAGS.IF is ignored when deciding > + * whether to block interrupts targeted at inactive VTLs. > + */ > if (is_guest_mode(vcpu)) { > /* As long as interrupts are being delivered... */ > if ((svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK) > @@ -3808,7 +3813,7 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu) > if (nested_exit_on_intr(svm)) > return false; > } else { > - if (!svm_get_if_flag(vcpu)) > + if (!svm_get_if_flag(vcpu) && !kvm_hv_vcpu_is_idle_vtl(vcpu)) Speaking of RFLAGS.IF, I think it makes sense to add a common x86 helper to handle the RFLAGS.IF vs. idle VTL logic. Naming will be annoying, but that's about it. E.g. kvm_is_irq_blocked_by_rflags_if() or so. [*] https://lore.kernel.org/lkml/20231001111313.77586-1-nsaenz@amazon.com
On Fri Sep 13, 2024 at 7:01 PM UTC, Sean Christopherson wrote: > On Sun, Jun 09, 2024, Nicolas Saenz Julienne wrote: > > Model inactive VTL vCPUs' behaviour with a new MP state. > > > > Inactive VTLs are in an artificial halt state. They enter into this > > state in response to invoking HvCallVtlCall, HvCallVtlReturn. > > User-space, which is VTL aware, can processes the hypercall, and set the > > vCPU in MP_STATE_HV_INACTIVE_VTL. When a vCPU is run in this state it'll > > block until a wakeup event is received. The rules of what constitutes an > > event are analogous to halt's except that VTL's ignore RFLAGS.IF. > > > > When a wakeup event is registered, KVM will exit to user-space with a > > KVM_SYSTEM_EVENT exit, and KVM_SYSTEM_EVENT_WAKEUP event type. > > User-space is responsible of deciding whether the event has precedence > > over the active VTL and will switch the vCPU to KVM_MP_STATE_RUNNABLE > > before resuming execution on it. > > > > Running a KVM_MP_STATE_HV_INACTIVE_VTL vCPU with pending events will > > return immediately to user-space. > > > > Note that by re-using the readily available halt infrastructure in > > KVM_RUN, MP_STATE_HV_INACTIVE_VTL correctly handles (or disables) > > virtualisation features like the VMX preemption timer or APICv before > > blocking. > > IIUC, this is a convoluted and roundabout way to let userspace check if a vCPU > has a wake event, correct? Even by the end of the series, KVM never sets > MP_STATE_HV_INACTIVE_VTL, i.e. the only use for this is to combine it as: > > KVM_SET_MP_STATE => KVM_RUN => KVM_SET_MP_STATE => KVM_RUN Correct. > The upside to this approach is that it requires minimal uAPI and very few KVM > changes, but that's about it AFAICT. On the other hand, making this so painfully > specific feels like a missed opportunity, and unnecessarily bleeds VTL details > into KVM. > > Bringing halt-polling into the picture (by going down kvm_vcpu_halt()) is also > rather bizarre since quite a bit of time has already elapsed since the vCPU first > did HvCallVtlCall/HvCallVtlReturn. But that doesn't really have anything to do > with MP_STATE_HV_INACTIVE_VTL, e.g. it'd be just as easy to go to kvm_vcpu_block(). > > Why not add an ioctl() to very explicitly block until a wake event is ready? > Or probably better, a generic "wait" ioctl() that takes the wait type as an > argument. > > Kinda like your idea of supporting .poll() on the vCPU FD[*], except it's very > specifically restricted to a single caller (takes vcpu->mutex). We could probably > actually implement it via .poll(), but I suspect that would be more confusing than > helpful. > > E.g. extract the guts of vcpu_block() to a separate helper, and then wire that > up to an ioctl(). > > As for the RFLAGS.IF quirk, maybe handle that via a kvm_run flag? That way, > userspace doesn't need to do a round-trip just to set a single bit. E.g. I think > we should be able to squeeze it into "struct kvm_hyperv_exit". It's things like the RFLAG.IF exemption that deterred me from building a generic interface. We might find out that the generic blocking logic doesn't match the expected VTL semantics and be stuck with a uAPI that isn't enough for VSM, nor useful for any other use-case. We can always introduce 'flags' I guess. Note that I'm just being cautious here, AFAICT the generic approach works, and I'm fine with going the "wait" ioctl. > Actually, speaking of kvm_hyperv_exit, is there a reason we can't simply wire up > HVCALL_VTL_CALL and/or HVCALL_VTL_RETURN to a dedicated complete_userspace_io() > callback that blocks if some flag is set? That would make it _much_ cleaner to > scope the RFLAGS.IF check to kvm_hyperv_exit, and would require little to no new > uAPI. So IIUC, the approach is to have complete_userspace_io() block after re-entering HVCALL_VTL_RETURN. Then, have it exit back onto user-space whenever an event is made available (maybe re-using KVM_SYSTEM_EVENT_WAKEUP?). That would work, but will need something extra to be compatible with migration/live-update. > > @@ -3797,6 +3798,10 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu) > > if (!gif_set(svm)) > > return true; > > > > + /* > > + * The Hyper-V TLFS states that RFLAGS.IF is ignored when deciding > > + * whether to block interrupts targeted at inactive VTLs. > > + */ > > if (is_guest_mode(vcpu)) { > > /* As long as interrupts are being delivered... */ > > if ((svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK) > > @@ -3808,7 +3813,7 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu) > > if (nested_exit_on_intr(svm)) > > return false; > > } else { > > - if (!svm_get_if_flag(vcpu)) > > + if (!svm_get_if_flag(vcpu) && !kvm_hv_vcpu_is_idle_vtl(vcpu)) > > Speaking of RFLAGS.IF, I think it makes sense to add a common x86 helper to handle > the RFLAGS.IF vs. idle VTL logic. Naming will be annoying, but that's about it. > > E.g. kvm_is_irq_blocked_by_rflags_if() or so. Noted. Thanks, Nicolas
On Mon, Sep 16, 2024, Nicolas Saenz Julienne wrote: > On Fri Sep 13, 2024 at 7:01 PM UTC, Sean Christopherson wrote: > > On Sun, Jun 09, 2024, Nicolas Saenz Julienne wrote: > > E.g. extract the guts of vcpu_block() to a separate helper, and then wire that > > up to an ioctl(). > > > > As for the RFLAGS.IF quirk, maybe handle that via a kvm_run flag? That way, > > userspace doesn't need to do a round-trip just to set a single bit. E.g. I think > > we should be able to squeeze it into "struct kvm_hyperv_exit". > > It's things like the RFLAG.IF exemption that deterred me from building a > generic interface. We might find out that the generic blocking logic > doesn't match the expected VTL semantics and be stuck with a uAPI that > isn't enough for VSM, nor useful for any other use-case. That's only motivation for ensuring that we are as confident as we can reasonably be that the uAPI we merge will work for VSM, e.g. by building out userspace and proving that a generic ioctl() provides the necessary functionality. If there's no other immediate use case, then there's no reason to merge a generic ioctl() until VSM support is imminent. And if there is another use case, then the concern that a generic ioctl() isn't useful obviously goes away. > We can always introduce 'flags' I guess. > > Note that I'm just being cautious here, AFAICT the generic approach > works, and I'm fine with going the "wait" ioctl. > > > Actually, speaking of kvm_hyperv_exit, is there a reason we can't simply wire up > > HVCALL_VTL_CALL and/or HVCALL_VTL_RETURN to a dedicated complete_userspace_io() > > callback that blocks if some flag is set? That would make it _much_ cleaner to > > scope the RFLAGS.IF check to kvm_hyperv_exit, and would require little to no new > > uAPI. > > So IIUC, the approach is to have complete_userspace_io() block after > re-entering HVCALL_VTL_RETURN. Then, have it exit back onto user-space > whenever an event is made available (maybe re-using KVM_SYSTEM_EVENT_WAKEUP?). Mostly out of curiosity, why does control need to return to userspace? > That would work, but will need something extra to be compatible with > migration/live-update. Gah, right, because KVM's generic ABI is that userspace must complete I/O exits before saving/restoring state. Yeah, having KVM automatically enter a blocking state is probably a bad idea.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 17893b330b76f..e664c54a13b04 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1517,6 +1517,8 @@ Possible values are: [s390] KVM_MP_STATE_SUSPENDED the vcpu is in a suspend state and is waiting for a wakeup event [arm64] + KVM_MP_STATE_HV_INACTIVE_VTL the vcpu is an inactive VTL and is waiting for + a wakeup event [x86] ========================== =============================================== On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an @@ -1559,6 +1561,23 @@ KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not. On LoongArch, only the KVM_MP_STATE_RUNNABLE state is used to reflect whether the vcpu is runnable. +For x86: +^^^^^^^^ + +KVM_MP_STATE_HV_INACTIVE_VTL is only available to a VM if Hyper-V's +HV_ACCESS_VSM CPUID is exposed to the guest. This processor state models the +behavior of an inactive VTL and should only be used for this purpose. A +userspace process should only switch a vCPU into this MP state in response to a +HvCallVtlCall, HvCallVtlReturn. + +If a vCPU is in KVM_MP_STATE_HV_INACTIVE_VTL, KVM will emulate the +architectural execution of a HLT instruction with the caveat that RFLAGS.IF is +ignored when deciding whether to wake up (TLFS 12.12.2.1). If a wakeup is +recognized, KVM will exit to userspace with a KVM_SYSTEM_EVENT exit, where the +event type is KVM_SYSTEM_EVENT_WAKEUP. Userspace has the responsibility to +switch the vCPU back into KVM_MP_STATE_RUNNABLE state. Calling KVM_RUN on a +KVM_MP_STATE_HV_INACTIVE_VTL vCPU with pending events will exit immediately. + 4.39 KVM_SET_MP_STATE --------------------- diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index d007d2203e0e4..d42fe3f85b002 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -271,6 +271,10 @@ static inline bool kvm_hv_cpuid_vsm_enabled(struct kvm_vcpu *vcpu) return hv_vcpu && (hv_vcpu->cpuid_cache.features_ebx & HV_ACCESS_VSM); } +static inline bool kvm_hv_vcpu_is_idle_vtl(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.mp_state == KVM_MP_STATE_HV_INACTIVE_VTL; +} #else /* CONFIG_KVM_HYPERV */ static inline void kvm_hv_setup_tsc_page(struct kvm *kvm, struct pvclock_vcpu_time_info *hv_clock) {} @@ -332,6 +336,10 @@ static inline bool kvm_hv_cpuid_vsm_enabled(struct kvm_vcpu *vcpu) { return false; } +static inline bool kvm_hv_vcpu_is_idle_vtl(struct kvm_vcpu *vcpu) +{ + return false; +} #endif /* CONFIG_KVM_HYPERV */ #endif /* __ARCH_X86_KVM_HYPERV_H__ */ diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 296c524988f95..9671191fef4ea 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -49,6 +49,7 @@ #include "svm.h" #include "svm_ops.h" +#include "hyperv.h" #include "kvm_onhyperv.h" #include "svm_onhyperv.h" @@ -3797,6 +3798,10 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu) if (!gif_set(svm)) return true; + /* + * The Hyper-V TLFS states that RFLAGS.IF is ignored when deciding + * whether to block interrupts targeted at inactive VTLs. + */ if (is_guest_mode(vcpu)) { /* As long as interrupts are being delivered... */ if ((svm->nested.ctl.int_ctl & V_INTR_MASKING_MASK) @@ -3808,7 +3813,7 @@ bool svm_interrupt_blocked(struct kvm_vcpu *vcpu) if (nested_exit_on_intr(svm)) return false; } else { - if (!svm_get_if_flag(vcpu)) + if (!svm_get_if_flag(vcpu) && !kvm_hv_vcpu_is_idle_vtl(vcpu)) return true; } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index b3c83c06f8265..ac0682fece604 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5057,7 +5057,12 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu) if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu)) return false; - return !(vmx_get_rflags(vcpu) & X86_EFLAGS_IF) || + /* + * The Hyper-V TLFS states that RFLAGS.IF is ignored when deciding + * whether to block interrupts targeted at inactive VTLs. + */ + return (!(vmx_get_rflags(vcpu) & X86_EFLAGS_IF) && + !kvm_hv_vcpu_is_idle_vtl(vcpu)) || (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8c9e4281d978d..a6e2312ccb68f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -134,6 +134,7 @@ static int kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu); static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2); static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2); +static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu); static DEFINE_MUTEX(vendor_module_lock); struct kvm_x86_ops kvm_x86_ops __read_mostly; @@ -11176,7 +11177,8 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) kvm_lapic_switch_to_sw_timer(vcpu); kvm_vcpu_srcu_read_unlock(vcpu); - if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) + if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED || + kvm_hv_vcpu_is_idle_vtl(vcpu)) kvm_vcpu_halt(vcpu); else kvm_vcpu_block(vcpu); @@ -11218,6 +11220,7 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu) vcpu->arch.apf.halted = false; break; case KVM_MP_STATE_INIT_RECEIVED: + case KVM_MP_STATE_HV_INACTIVE_VTL: break; default: WARN_ON_ONCE(1); @@ -11264,6 +11267,13 @@ static int vcpu_run(struct kvm_vcpu *vcpu) if (kvm_cpu_has_pending_timer(vcpu)) kvm_inject_pending_timer_irqs(vcpu); + if (kvm_hv_vcpu_is_idle_vtl(vcpu) && kvm_vcpu_has_events(vcpu)) { + r = 0; + vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT; + vcpu->run->system_event.type = KVM_SYSTEM_EVENT_WAKEUP; + break; + } + if (dm_request_for_irq_injection(vcpu) && kvm_vcpu_ready_for_interrupt_injection(vcpu)) { r = 0; @@ -11703,6 +11713,10 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, goto out; break; + case KVM_MP_STATE_HV_INACTIVE_VTL: + if (is_guest_mode(vcpu) || !kvm_hv_cpuid_vsm_enabled(vcpu)) + goto out; + break; case KVM_MP_STATE_RUNNABLE: break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index fbdee8d754595..f4864e6907e0b 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -564,6 +564,7 @@ struct kvm_vapic_addr { #define KVM_MP_STATE_LOAD 8 #define KVM_MP_STATE_AP_RESET_HOLD 9 #define KVM_MP_STATE_SUSPENDED 10 +#define KVM_MP_STATE_HV_INACTIVE_VTL 11 struct kvm_mp_state { __u32 mp_state;
Model inactive VTL vCPUs' behaviour with a new MP state. Inactive VTLs are in an artificial halt state. They enter into this state in response to invoking HvCallVtlCall, HvCallVtlReturn. User-space, which is VTL aware, can processes the hypercall, and set the vCPU in MP_STATE_HV_INACTIVE_VTL. When a vCPU is run in this state it'll block until a wakeup event is received. The rules of what constitutes an event are analogous to halt's except that VTL's ignore RFLAGS.IF. When a wakeup event is registered, KVM will exit to user-space with a KVM_SYSTEM_EVENT exit, and KVM_SYSTEM_EVENT_WAKEUP event type. User-space is responsible of deciding whether the event has precedence over the active VTL and will switch the vCPU to KVM_MP_STATE_RUNNABLE before resuming execution on it. Running a KVM_MP_STATE_HV_INACTIVE_VTL vCPU with pending events will return immediately to user-space. Note that by re-using the readily available halt infrastructure in KVM_RUN, MP_STATE_HV_INACTIVE_VTL correctly handles (or disables) virtualisation features like the VMX preemption timer or APICv before blocking. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com> --- I do recall Sean mentioning using MP states for this might have unexpected side-effects. But it was in the context of introducing a broader `HALTED_USERSPACE` style state. I believe that by narrowing down the MP state's semantics to the specifics of inactive VTLs -- alternatively, we could change RFLAGS.IF in user-space before updating the mp state -- we cement this as a VSM-only API as well as limit the ambiguity on the guest/vCPU's state upon entering into this execution mode. Documentation/virt/kvm/api.rst | 19 +++++++++++++++++++ arch/x86/kvm/hyperv.h | 8 ++++++++ arch/x86/kvm/svm/svm.c | 7 ++++++- arch/x86/kvm/vmx/vmx.c | 7 ++++++- arch/x86/kvm/x86.c | 16 +++++++++++++++- include/uapi/linux/kvm.h | 1 + 6 files changed, 55 insertions(+), 3 deletions(-)