mbox series

[v2,00/49] KVM: x86: CPUID overhaul, fixes, and caching

Message ID 20240517173926.965351-1-seanjc@google.com (mailing list archive)
Headers show
Series KVM: x86: CPUID overhaul, fixes, and caching | expand

Message

Sean Christopherson May 17, 2024, 5:38 p.m. UTC
This is technically v2 of "Replace governed features with guest cpu_caps",
but it obviously snowballed just a bit.  This series wanders all over the
place, and ideally would be 3-4 distinct series, but there are interactions
and dependencies all over the place.

The super short TL;DR: snapshot all X86_FEATURE_* flags that KVM cares
about so that all queries against guest capabilities are "fast", e.g. don't
require manual enabling or judgment calls as to where a feature needs to be
fast.

The guest_cpu_cap_* nomenclature follows the existing kvm_cpu_cap_*
except for a few (maybe just one?) cases where guest cpu_caps need APIs
that kvm_cpu_caps don't.  In theory, the similar names will make this
approach more intuitive.

Maxim's suggestion to incorporate KVM's capabilities into the guest's cpu_caps
grew on me, to the point where I decided to just go for it.  Through macro
shenanigans (see the last DO NOT APPLY patch) and manually verifying that
vcpu->arch.cpu_caps is always a superset of guest CPUID, I was able to gain
sufficient confidence that KVM won't silently change guest behavior.  Many, but
not all, of the new patches are related in some way to that approach.
 
There are *multiple* potentially breaking changes in this series (in for a
penny, in for a pound).  However, I don't expect any fallout for real world
VMMs because the ABI changes either disallow things that couldn't possibly
have worked in the first place, or are following in the footsteps of other
behaviors, e.g. KVM advertises x2APIC, which is 100% dependent on an in-kernel
local APIC.

 * Disallow stuffing CPUID-dependent guest CR4 features before setting guest
   CPUID.
 * Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation
 * Reject disabling of MWAIT/HLT interception when not allowed
 * Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID.
 * Advertise HYPERVISOR in KVM_GET_SUPPORTED_CPUID

Lastly, regarding the PoC DO NOT APPLY patch, I hope to turn that into an actual
patch in the future.  E.g. I think we can shove feature usage information into
a .note or something, and then do post-processing a la objtool during the build.

v2:
 - Collect a few reviews (though I dropped several due to the patches changing
   significantly).
 - Incorporate KVM's support into the vCPU's cpu_caps. [Maxim]
 - A massive pile of new patches.

v1: https://lore.kernel.org/all/20231110235528.1561679-1-seanjc@google.com

Sean Christopherson (49):
  KVM: x86: Do all post-set CPUID processing during vCPU creation
  KVM: x86: Explicitly do runtime CPUID updates "after" initial setup
  KVM: x86: Account for KVM-reserved CR4 bits when passing through CR4
    on VMX
  KVM: selftests: Update x86's set_sregs_test to match KVM's CPUID
    enforcement
  KVM: selftests: Assert that the @cpuid passed to get_cpuid_entry() is
    non-NULL
  KVM: selftests: Refresh vCPU CPUID cache in __vcpu_get_cpuid_entry()
  KVM: selftests: Verify KVM stuffs runtime CPUID OS bits on CR4 writes
  KVM: x86: Move __kvm_is_valid_cr4() definition to x86.h
  KVM: x86/pmu: Drop now-redundant refresh() during init()
  KVM: x86: Drop now-redundant MAXPHYADDR and GPA rsvd bits from vCPU
    creation
  KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation
  KVM: x86: Reject disabling of MWAIT/HLT interception when not allowed
  KVM: selftests: Fix a bad TEST_REQUIRE() in x86's KVM PV test
  KVM: selftests: Update x86's KVM PV test to match KVM's disabling
    exits behavior
  KVM: x86: Zero out PV features cache when the CPUID leaf is not
    present
  KVM: x86: Don't update PV features caches when enabling enforcement
    capability
  KVM: x86: Do reverse CPUID sanity checks in __feature_leaf()
  KVM: x86: Account for max supported CPUID leaf when getting raw host
    CPUID
  KVM: x86: Add a macro to init CPUID features that ignore host kernel
    support
  KVM: x86: Rename kvm_cpu_cap_mask() to kvm_cpu_cap_init()
  KVM: x86: Add a macro to init CPUID features that are 64-bit only
  KVM: x86: Add a macro to precisely handle aliased 0x1.EDX CPUID
    features
  KVM: x86: Handle kernel- and KVM-defined CPUID words in a single
    helper
  KVM: x86: #undef SPEC_CTRL_SSBD in cpuid.c to avoid macro collisions
  KVM: x86: Harden CPU capabilities processing against out-of-scope
    features
  KVM: x86: Add a macro to init CPUID features that KVM emulates in
    software
  KVM: x86: Swap incoming guest CPUID into vCPU before massaging in
    KVM_SET_CPUID2
  KVM: x86: Clear PV_UNHALT for !HLT-exiting only when userspace sets
    CPUID
  KVM: x86: Remove unnecessary caching of KVM's PV CPUID base
  KVM: x86: Always operate on kvm_vcpu data in cpuid_entry2_find()
  KVM: x86: Move kvm_find_cpuid_entry{,_index}() up near
    cpuid_entry2_find()
  KVM: x86: Remove all direct usage of cpuid_entry2_find()
  KVM: x86: Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID
  KVM: x86: Advertise HYPERVISOR in KVM_GET_SUPPORTED_CPUID
  KVM: x86: Add a macro to handle features that are fully VMM controlled
  KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap"
  KVM: x86: Replace guts of "governed" features with comprehensive
    cpu_caps
  KVM: x86: Initialize guest cpu_caps based on guest CPUID
  KVM: x86: Extract code for generating per-entry emulated CPUID
    information
  KVM: x86: Initialize guest cpu_caps based on KVM support
  KVM: x86: Avoid double CPUID lookup when updating MWAIT at runtime
  KVM: x86: Drop unnecessary check that cpuid_entry2_find() returns
    right leaf
  KVM: x86: Update OS{XSAVE,PKE} bits in guest CPUID irrespective of
    host support
  KVM: x86: Update guest cpu_caps at runtime for dynamic CPUID-based
    features
  KVM: x86: Shuffle code to prepare for dropping guest_cpuid_has()
  KVM: x86: Replace (almost) all guest CPUID feature queries with
    cpu_caps
  KVM: x86: Drop superfluous host XSAVE check when adjusting guest
    XSAVES caps
  KVM: x86: Add a macro for features that are synthesized into
    boot_cpu_data
  *** DO NOT APPLY *** KVM: x86: Verify KVM initializes all consumed
    guest caps

 Documentation/virt/kvm/api.rst                |  10 +-
 arch/x86/include/asm/kvm_host.h               |  46 +-
 arch/x86/kvm/cpuid.c                          | 660 +++++++++++-------
 arch/x86/kvm/cpuid.h                          | 141 ++--
 arch/x86/kvm/governed_features.h              |  22 -
 arch/x86/kvm/hyperv.c                         |   2 +-
 arch/x86/kvm/lapic.c                          |   2 +-
 arch/x86/kvm/mmu.h                            |   2 +-
 arch/x86/kvm/mmu/mmu.c                        |   4 +-
 arch/x86/kvm/mtrr.c                           |   2 +-
 arch/x86/kvm/pmu.c                            |   1 -
 arch/x86/kvm/reverse_cpuid.h                  |  22 +-
 arch/x86/kvm/smm.c                            |  10 +-
 arch/x86/kvm/svm/nested.c                     |  22 +-
 arch/x86/kvm/svm/pmu.c                        |   8 +-
 arch/x86/kvm/svm/sev.c                        |  21 +-
 arch/x86/kvm/svm/svm.c                        |  46 +-
 arch/x86/kvm/svm/svm.h                        |   4 +-
 arch/x86/kvm/vmx/hyperv.h                     |   2 +-
 arch/x86/kvm/vmx/nested.c                     |  18 +-
 arch/x86/kvm/vmx/pmu_intel.c                  |   4 +-
 arch/x86/kvm/vmx/sgx.c                        |  14 +-
 arch/x86/kvm/vmx/vmx.c                        |  61 +-
 arch/x86/kvm/x86.c                            | 153 ++--
 arch/x86/kvm/x86.h                            |   6 +-
 include/asm-generic/vmlinux.lds.h             |   4 +
 .../selftests/kvm/include/x86_64/processor.h  |  11 +-
 .../selftests/kvm/lib/x86_64/processor.c      |   2 +
 .../selftests/kvm/x86_64/kvm_pv_test.c        |  38 +-
 .../selftests/kvm/x86_64/set_sregs_test.c     |  63 +-
 30 files changed, 791 insertions(+), 610 deletions(-)
 delete mode 100644 arch/x86/kvm/governed_features.h


base-commit: 4aad0b1893a141f114ba40ed509066f3c9bc24b0

Comments

Paolo Bonzini May 17, 2024, 5:54 p.m. UTC | #1
On Fri, May 17, 2024 at 7:39 PM Sean Christopherson <seanjc@google.com> wrote:
>  * Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation
>  * Reject disabling of MWAIT/HLT interception when not allowed
>  * Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID.

This is technically a breaking change, and it's even documented in
api.rst under "KVM_GET_SUPPORTED_CPUID issues":

---
CPU[EAX=1]:ECX[21] (X2APIC) is reported by
``KVM_GET_SUPPORTED_CPUID``, but it can only be enabled if
``KVM_CREATE_IRQCHIP`` or ``KVM_ENABLE_CAP(KVM_CAP_IRQCHIP_SPLIT)``
are used to enable in-kernel emulation of the local APIC.

The same is true for the ``KVM_FEATURE_PV_UNHALT`` paravirtualized feature.

CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by
``KVM_GET_SUPPORTED_CPUID``. It can be enabled if
``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel has enabled
in-kernel emulation of the local APIC.
---

However I think we can get away with it. QEMU source code on one hand does

        /* tsc-deadline flag is not returned by GET_SUPPORTED_CPUID, but it
         * can be enabled if the kernel has KVM_CAP_TSC_DEADLINE_TIMER,
         * and the irqchip is in the kernel.
         */
        if (kvm_irqchip_in_kernel() &&
                kvm_check_extension(s, KVM_CAP_TSC_DEADLINE_TIMER)) {
            ret |= CPUID_EXT_TSC_DEADLINE_TIMER;
        }

        /* x2apic is reported by GET_SUPPORTED_CPUID, but it can't be enabled
         * without the in-kernel irqchip
         */
        if (!kvm_irqchip_in_kernel()) {
            ret &= ~CPUID_EXT_X2APIC;
        }

so it has to cope with existing mess but it's not expecting the
opposite mess (understandable).

However, in practice userspace APIC has always been utterly broken and
even deprecated in QEMU, so we might get away with it. I don't see why
one would use no kernel APIC unless the guest has no APIC whatsoever.

And no guest that doesn't find an APIC is going to use the TSC
deadline timer (sure the MSR is outside x2APIC space but how in the
world would you configure LVTT), likewise for X2APIC since you need to
turn it on at 0xFEE0_0000 first.

Paolo