diff mbox series

[2/3] KVM: x86: Add support for VMware guest specific hypercalls

Message ID 20241030033514.1728937-3-zack.rusin@broadcom.com (mailing list archive)
State New
Headers show
Series KVM: x86: Small changes to support VMware guests | expand

Commit Message

Zack Rusin Oct. 30, 2024, 3:34 a.m. UTC
VMware products handle hypercalls in userspace. Give KVM the ability
to run VMware guests unmodified by fowarding all hypercalls to the
userspace.

Enabling of the KVM_CAP_X86_VMWARE_HYPERCALL_ENABLE capability turns
the feature on - it's off by default. This allows vmx's built on top
of KVM to support VMware specific hypercalls.

Signed-off-by: Zack Rusin <zack.rusin@broadcom.com>
Cc: Doug Covelli <doug.covelli@broadcom.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Zack Rusin <zack.rusin@broadcom.com>
Cc: kvm@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
 Documentation/virt/kvm/api.rst  | 41 +++++++++++++++++++++++++++++----
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 33 ++++++++++++++++++++++++++
 include/uapi/linux/kvm.h        |  1 +
 4 files changed, 72 insertions(+), 4 deletions(-)

Comments

Paolo Bonzini Nov. 4, 2024, 10:13 p.m. UTC | #1
On Wed, Oct 30, 2024 at 4:35 AM Zack Rusin <zack.rusin@broadcom.com> wrote:
>
> VMware products handle hypercalls in userspace. Give KVM the ability
> to run VMware guests unmodified by fowarding all hypercalls to the
> userspace.
>
> Enabling of the KVM_CAP_X86_VMWARE_HYPERCALL_ENABLE capability turns
> the feature on - it's off by default. This allows vmx's built on top
> of KVM to support VMware specific hypercalls.

Hi Zack,

is there a spec of the hypercalls that are supported by userspace? I
would like to understand if there's anything that's best handled in
the kernel.

If we allow forwarding _all_ hypercalls to userspace, then people will
use it for things other than VMware and there goes all hope of
accelerating stuff in the kernel in the future.

So even having _some_ checks in the kernel before going out to
userspace would keep that door open, or at least try.

Patch 1 instead looks good from an API point of view.

Paolo

> Signed-off-by: Zack Rusin <zack.rusin@broadcom.com>
> Cc: Doug Covelli <doug.covelli@broadcom.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: x86@kernel.org
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Isaku Yamahata <isaku.yamahata@intel.com>
> Cc: Joel Stanley <joel@jms.id.au>
> Cc: Zack Rusin <zack.rusin@broadcom.com>
> Cc: kvm@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-kselftest@vger.kernel.org
> ---
>  Documentation/virt/kvm/api.rst  | 41 +++++++++++++++++++++++++++++----
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/x86.c              | 33 ++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h        |  1 +
>  4 files changed, 72 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 33ef3cc785e4..5a8c7922f64f 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6601,10 +6601,11 @@ to the byte array.
>  .. note::
>
>        For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN,
> -      KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
> -      operations are complete (and guest state is consistent) only after userspace
> -      has re-entered the kernel with KVM_RUN.  The kernel side will first finish
> -      incomplete operations and then check for pending signals.
> +      KVM_EXIT_EPR, KVM_EXIT_HYPERCALL, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR
> +      the corresponding operations are complete (and guest state is consistent)
> +      only after userspace has re-entered the kernel with KVM_RUN. The kernel
> +      side will first finish incomplete operations and then check for pending
> +      signals.
>
>        The pending state of the operation is not preserved in state which is
>        visible to userspace, thus userspace should ensure that the operation is
> @@ -8201,6 +8202,38 @@ default value for it is set via the kvm.enable_vmware_backdoor
>  kernel parameter (false when not set). Must be set before any
>  VCPUs have been created.
>
> +7.38 KVM_CAP_X86_VMWARE_HYPERCALL
> +---------------------------------
> +
> +:Architectures: x86
> +:Parameters: args[0] whether the feature should be enabled or not
> +:Returns: 0 on success.
> +
> +Capability allows userspace to handle hypercalls. When enabled
> +whenever the vcpu has executed a VMCALL(Intel) or a VMMCALL(AMD)
> +instruction kvm will exit to userspace with KVM_EXIT_HYPERCALL.
> +
> +On exit the hypercall structure of the kvm_run structure will
> +look as follows:
> +
> +::
> +   /* KVM_EXIT_HYPERCALL */
> +   struct {
> +      __u64 nr;      // rax
> +      __u64 args[6]; // rbx, rcx, rdx, rsi, rdi, rbp
> +      __u64 ret;     // cpl, whatever userspace
> +                     // sets this to on return will be
> +                     // written to the rax
> +      __u64 flags;   // KVM_EXIT_HYPERCALL_LONG_MODE if
> +                     // the hypercall was executed in
> +                     // 64bit mode, 0 otherwise
> +   } hypercall;
> +
> +Except when running in compatibility mode with VMware hypervisors
> +userspace handling of hypercalls is discouraged. To implement
> +such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO
> +(all except s390).
> +
>  8. Other capabilities.
>  ======================
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7fcf185e337f..7fbb11682517 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1404,6 +1404,7 @@ struct kvm_arch {
>         struct kvm_xen xen;
>  #endif
>         bool vmware_backdoor_enabled;
> +       bool vmware_hypercall_enabled;
>
>         bool backwards_tsc_observed;
>         bool boot_vcpu_runs_old_kvmclock;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d7071907d6a5..b676c54266e7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4689,6 +4689,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>         case KVM_CAP_MEMORY_FAULT_INFO:
>         case KVM_CAP_X86_GUEST_MODE:
>         case KVM_CAP_X86_VMWARE_BACKDOOR:
> +       case KVM_CAP_X86_VMWARE_HYPERCALL:
>                 r = 1;
>                 break;
>         case KVM_CAP_PRE_FAULT_MEMORY:
> @@ -6784,6 +6785,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>                 }
>                 mutex_unlock(&kvm->lock);
>                 break;
> +       case KVM_CAP_X86_VMWARE_HYPERCALL:
> +               r = -EINVAL;
> +               if (cap->args[0] & ~1)
> +                       break;
> +               kvm->arch.vmware_hypercall_enabled = cap->args[0];
> +               r = 0;
> +               break;
>         default:
>                 r = -EINVAL;
>                 break;
> @@ -10127,6 +10135,28 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
>         return kvm_skip_emulated_instruction(vcpu);
>  }
>
> +static int kvm_vmware_hypercall(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_run *run = vcpu->run;
> +       bool is_64_bit = is_64_bit_hypercall(vcpu);
> +       u64 mask = is_64_bit ? U64_MAX : U32_MAX;
> +
> +       vcpu->run->hypercall.flags = is_64_bit ? KVM_EXIT_HYPERCALL_LONG_MODE : 0;
> +       run->hypercall.nr = kvm_rax_read(vcpu) & mask;
> +       run->hypercall.args[0] = kvm_rbx_read(vcpu) & mask;
> +       run->hypercall.args[1] = kvm_rcx_read(vcpu) & mask;
> +       run->hypercall.args[2] = kvm_rdx_read(vcpu) & mask;
> +       run->hypercall.args[3] = kvm_rsi_read(vcpu) & mask;
> +       run->hypercall.args[4] = kvm_rdi_read(vcpu) & mask;
> +       run->hypercall.args[5] = kvm_rbp_read(vcpu) & mask;
> +       run->hypercall.ret = kvm_x86_call(get_cpl)(vcpu);
> +
> +       run->exit_reason = KVM_EXIT_HYPERCALL;
> +       vcpu->arch.complete_userspace_io = complete_hypercall_exit;
> +
> +       return 0;
> +}
> +
>  unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>                                       unsigned long a0, unsigned long a1,
>                                       unsigned long a2, unsigned long a3,
> @@ -10225,6 +10255,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>         int op_64_bit;
>         int cpl;
>
> +       if (vcpu->kvm->arch.vmware_hypercall_enabled)
> +               return kvm_vmware_hypercall(vcpu);
> +
>         if (kvm_xen_hypercall_enabled(vcpu->kvm))
>                 return kvm_xen_hypercall(vcpu);
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index c7b5f1c2ee1c..4c2cc6ed29a0 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -934,6 +934,7 @@ struct kvm_enable_cap {
>  #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
>  #define KVM_CAP_X86_GUEST_MODE 238
>  #define KVM_CAP_X86_VMWARE_BACKDOOR 239
> +#define KVM_CAP_X86_VMWARE_HYPERCALL 240
>
>  struct kvm_irq_routing_irqchip {
>         __u32 irqchip;
> --
> 2.43.0
>
Zack Rusin Nov. 5, 2024, 4:59 a.m. UTC | #2
On Mon, Nov 4, 2024 at 5:13 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On Wed, Oct 30, 2024 at 4:35 AM Zack Rusin <zack.rusin@broadcom.com> wrote:
> >
> > VMware products handle hypercalls in userspace. Give KVM the ability
> > to run VMware guests unmodified by fowarding all hypercalls to the
> > userspace.
> >
> > Enabling of the KVM_CAP_X86_VMWARE_HYPERCALL_ENABLE capability turns
> > the feature on - it's off by default. This allows vmx's built on top
> > of KVM to support VMware specific hypercalls.
>
> Hi Zack,

Hi, Paolo.

Thank you for looking at this.

> is there a spec of the hypercalls that are supported by userspace? I
> would like to understand if there's anything that's best handled in
> the kernel.

There's no spec but we have open headers listing the hypercalls.
There's about a 100 of them (a few were deprecated), the full
list starts here:
https://github.com/vmware/open-vm-tools/blob/739c5a2f4bfd4cdda491e6a6f6869d88c0bd6972/open-vm-tools/lib/include/backdoor_def.h#L97
They're not well documented, but the names are pretty self-explenatory.

> If we allow forwarding _all_ hypercalls to userspace, then people will
> use it for things other than VMware and there goes all hope of
> accelerating stuff in the kernel in the future.
>
> So even having _some_ checks in the kernel before going out to
> userspace would keep that door open, or at least try.

Doug just looked at this and I think I might have an idea on how to
limit the scope at least a bit: if you think it would help we could
limit forwarding of hypercalls to userspace only to those that that
come with a BDOOR_MAGIC (which is 0x564D5868) in eax. Would that help?

> Patch 1 instead looks good from an API point of view.

Ah, great, thanks!

z
Sean Christopherson Nov. 7, 2024, 10:32 p.m. UTC | #3
On Mon, Nov 04, 2024, Zack Rusin wrote:
> On Mon, Nov 4, 2024 at 5:13 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On Wed, Oct 30, 2024 at 4:35 AM Zack Rusin <zack.rusin@broadcom.com> wrote:
> > >
> > > VMware products handle hypercalls in userspace. Give KVM the ability
> > > to run VMware guests unmodified by fowarding all hypercalls to the
> > > userspace.
> > >
> > > Enabling of the KVM_CAP_X86_VMWARE_HYPERCALL_ENABLE capability turns
> > > the feature on - it's off by default. This allows vmx's built on top
> > > of KVM to support VMware specific hypercalls.
> >
> > Hi Zack,
> 
> Hi, Paolo.
> 
> Thank you for looking at this.
> 
> > is there a spec of the hypercalls that are supported by userspace? I
> > would like to understand if there's anything that's best handled in
> > the kernel.
> 
> There's no spec but we have open headers listing the hypercalls.
> There's about a 100 of them (a few were deprecated), the full
> list starts here:
> https://github.com/vmware/open-vm-tools/blob/739c5a2f4bfd4cdda491e6a6f6869d88c0bd6972/open-vm-tools/lib/include/backdoor_def.h#L97
> They're not well documented, but the names are pretty self-explenatory.

At a quick glance, this one needs to be handled in KVM:

  BDOOR_CMD_VCPU_MMIO_HONORS_PAT

and these probably should be in KVM:

  BDOOR_CMD_GETTIME
  BDOOR_CMD_SIDT
  BDOOR_CMD_SGDT
  BDOOR_CMD_SLDT_STR
  BDOOR_CMD_GETTIMEFULL
  BDOOR_CMD_VCPU_LEGACY_X2APIC_OK
  BDOOR_CMD_STEALCLOCK

and these maybe? (it's not clear what they do, from the name alone)

  BDOOR_CMD_GET_VCPU_INFO
  BDOOR_CMD_VCPU_RESERVED

> > If we allow forwarding _all_ hypercalls to userspace, then people will
> > use it for things other than VMware and there goes all hope of
> > accelerating stuff in the kernel in the future.

To some extent, that ship has sailed, no?  E.g. do KVM_XEN_HVM_CONFIG with
KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL set, and userspace can intercept pretty much
all hypercalls with very few side effects.

> > So even having _some_ checks in the kernel before going out to
> > userspace would keep that door open, or at least try.
> 
> Doug just looked at this and I think I might have an idea on how to
> limit the scope at least a bit: if you think it would help we could
> limit forwarding of hypercalls to userspace only to those that that
> come with a BDOOR_MAGIC (which is 0x564D5868) in eax. Would that help?

I don't think it addresses Paolo's concern (if I understood Paolo's concern
correctly), but it would help from the perspective of allowing KVM to support
VMware hypercalls and Xen/Hyper-V/KVM hypercalls in the same VM.

I also think we should add CONFIG_KVM_VMWARE from the get-go, and if we're feeling
lucky, maybe even retroactively bury KVM_CAP_X86_VMWARE_BACKDOOR behind that
Kconfig.  That would allow limiting the exposure to VMware specific code, e.g. if
KVM does end up handling hypercalls in-kernel.  And it might deter abuse to some
extent.
Zack Rusin Nov. 8, 2024, 5:03 a.m. UTC | #4
On Thu, Nov 7, 2024 at 5:32 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Nov 04, 2024, Zack Rusin wrote:
> > On Mon, Nov 4, 2024 at 5:13 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > >
> > > On Wed, Oct 30, 2024 at 4:35 AM Zack Rusin <zack.rusin@broadcom.com> wrote:
> > > >
> > > > VMware products handle hypercalls in userspace. Give KVM the ability
> > > > to run VMware guests unmodified by fowarding all hypercalls to the
> > > > userspace.
> > > >
> > > > Enabling of the KVM_CAP_X86_VMWARE_HYPERCALL_ENABLE capability turns
> > > > the feature on - it's off by default. This allows vmx's built on top
> > > > of KVM to support VMware specific hypercalls.
> > >
> > > Hi Zack,
> >
> > Hi, Paolo.
> >
> > Thank you for looking at this.
> >
> > > is there a spec of the hypercalls that are supported by userspace? I
> > > would like to understand if there's anything that's best handled in
> > > the kernel.
> >
> > There's no spec but we have open headers listing the hypercalls.
> > There's about a 100 of them (a few were deprecated), the full
> > list starts here:
> > https://github.com/vmware/open-vm-tools/blob/739c5a2f4bfd4cdda491e6a6f6869d88c0bd6972/open-vm-tools/lib/include/backdoor_def.h#L97
> > They're not well documented, but the names are pretty self-explenatory.
>
> At a quick glance, this one needs to be handled in KVM:
>
>   BDOOR_CMD_VCPU_MMIO_HONORS_PAT
>
> and these probably should be in KVM:
>
>   BDOOR_CMD_GETTIME
>   BDOOR_CMD_SIDT
>   BDOOR_CMD_SGDT
>   BDOOR_CMD_SLDT_STR
>   BDOOR_CMD_GETTIMEFULL
>   BDOOR_CMD_VCPU_LEGACY_X2APIC_OK
>   BDOOR_CMD_STEALCLOCK
>
> and these maybe? (it's not clear what they do, from the name alone)
>
>   BDOOR_CMD_GET_VCPU_INFO
>   BDOOR_CMD_VCPU_RESERVED

I'm not sure if there's any value in implementing a few of them. iirc
there's 101 of them (as I mentioned a lot have been deprecated but
that's for userspace, on the host we still have to do something for
old guests using them) and, if out of those 101 we implement 100 in
the kernel then, as far as this patch is concerned, it's no different
than if we had 0 out of 101 because we're still going to have to exit
to userspace to handle that 1 remaining.

Unless you're saying that those would be useful to you. In which case
I'd be glad to implement them for you, but I'd put them behind some
kind of a cap or a kernel config because we wouldn't be using them -
besides what Doug mentioned - we already maintain the shared code for
them that's used on Windows, MacOS, ESX and Linux so even if we had
them in the Linux kernel it would still make more sense to use the
code that's shared with the other OSes to lessen the maintenance
burden (so that changing anything within that code consistently
changes across all the OSes).

> > > If we allow forwarding _all_ hypercalls to userspace, then people will
> > > use it for things other than VMware and there goes all hope of
> > > accelerating stuff in the kernel in the future.
>
> To some extent, that ship has sailed, no?  E.g. do KVM_XEN_HVM_CONFIG with
> KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL set, and userspace can intercept pretty much
> all hypercalls with very few side effects.
>
> > > So even having _some_ checks in the kernel before going out to
> > > userspace would keep that door open, or at least try.
> >
> > Doug just looked at this and I think I might have an idea on how to
> > limit the scope at least a bit: if you think it would help we could
> > limit forwarding of hypercalls to userspace only to those that that
> > come with a BDOOR_MAGIC (which is 0x564D5868) in eax. Would that help?
>
> I don't think it addresses Paolo's concern (if I understood Paolo's concern
> correctly), but it would help from the perspective of allowing KVM to support
> VMware hypercalls and Xen/Hyper-V/KVM hypercalls in the same VM.

Yea, I just don't think there's any realistic way we could handle all
of those hypercalls in the kernel so I'm trying to offer some ideas on
how to lessen the scope to make it as painless as possible. Unless you
think we could somehow parlay my piercing blue eyes into getting those
patches in as is, in which case let's do that ;)

> I also think we should add CONFIG_KVM_VMWARE from the get-go, and if we're feeling
> lucky, maybe even retroactively bury KVM_CAP_X86_VMWARE_BACKDOOR behind that
> Kconfig.  That would allow limiting the exposure to VMware specific code, e.g. if
> KVM does end up handling hypercalls in-kernel.  And it might deter abuse to some
> extent.

I thought about that too. I was worried that even if we make it on by
default it will require quite a bit of handholding to make sure all
the distros include it, or otherwise on desktops Workstation still
wouldn't work with KVM by default, I also felt a little silly trying
to add a kernel config for those few lines that would be on pretty
much everywhere and since we didn't implement the vmware backdoor
functionality I didn't want to presume and try to shield a feature
that might be in production by others with a new kernel config.

z
Paolo Bonzini Nov. 9, 2024, 4:44 p.m. UTC | #5
On 11/7/24 23:32, Sean Christopherson wrote:
> On Mon, Nov 04, 2024, Zack Rusin wrote:
>> On Mon, Nov 4, 2024 at 5:13 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>> On Wed, Oct 30, 2024 at 4:35 AM Zack Rusin <zack.rusin@broadcom.com> wrote:
>>>>
>>>> VMware products handle hypercalls in userspace. Give KVM the ability
>>>> to run VMware guests unmodified by fowarding all hypercalls to the
>>>> userspace.
>>>>
>>>> Enabling of the KVM_CAP_X86_VMWARE_HYPERCALL_ENABLE capability turns
>>>> the feature on - it's off by default. This allows vmx's built on top
>>>> of KVM to support VMware specific hypercalls.
>>>
>>> Hi Zack,
>>
>> Hi, Paolo.
>>
>> Thank you for looking at this.
>>
>>> is there a spec of the hypercalls that are supported by userspace? I
>>> would like to understand if there's anything that's best handled in
>>> the kernel.
>>
>> There's no spec but we have open headers listing the hypercalls.
>> There's about a 100 of them (a few were deprecated), the full
>> list starts here:
>> https://github.com/vmware/open-vm-tools/blob/739c5a2f4bfd4cdda491e6a6f6869d88c0bd6972/open-vm-tools/lib/include/backdoor_def.h#L97
>> They're not well documented, but the names are pretty self-explenatory.
> 
> At a quick glance, this one needs to be handled in KVM:
> 
>    BDOOR_CMD_VCPU_MMIO_HONORS_PAT
> 
> and these probably should be in KVM:
> 
>    BDOOR_CMD_GETTIME
>    BDOOR_CMD_SIDT
>    BDOOR_CMD_SGDT
>    BDOOR_CMD_SLDT_STR
>    BDOOR_CMD_GETTIMEFULL
>    BDOOR_CMD_VCPU_LEGACY_X2APIC_OK
>    BDOOR_CMD_STEALCLOCK
> 
> and these maybe? (it's not clear what they do, from the name alone)
> 
>    BDOOR_CMD_GET_VCPU_INFO
>    BDOOR_CMD_VCPU_RESERVED
> 
>>> If we allow forwarding _all_ hypercalls to userspace, then people will
>>> use it for things other than VMware and there goes all hope of
>>> accelerating stuff in the kernel in the future.
> 
> To some extent, that ship has sailed, no?  E.g. do KVM_XEN_HVM_CONFIG with
> KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL set, and userspace can intercept pretty much
> all hypercalls with very few side effects.

Yes, but "pretty much all" is different from "this is a blanket 
hypercall vmexit for you to do as you please".

>>> So even having _some_ checks in the kernel before going out to
>>> userspace would keep that door open, or at least try.
>>
>> Doug just looked at this and I think I might have an idea on how to
>> limit the scope at least a bit: if you think it would help we could
>> limit forwarding of hypercalls to userspace only to those that that
>> come with a BDOOR_MAGIC (which is 0x564D5868) in eax. Would that help?
> 
> I don't think it addresses Paolo's concern (if I understood Paolo's concern
> correctly),

It does alleviate it.  Yeah, it would be just a tiny hurdle for 
userspace to set eax to a specific hex value to get them hypercalls. 
But it is _something_ at least.  It's enough to decrease substantially 
my level of sympathy for whoever does it, and as you point out it's also 
justified in terms of interoperability.

> but it would help from the perspective of allowing KVM to support
> VMware hypercalls and Xen/Hyper-V/KVM hypercalls in the same VM.

That too.  VMware in fact might be interested in reusing Hyper-V 
support.  Zack?

> I also think we should add CONFIG_KVM_VMWARE from the get-go, and if we're feeling
> lucky, maybe even retroactively bury KVM_CAP_X86_VMWARE_BACKDOOR behind that
> Kconfig.  That would allow limiting the exposure to VMware specific code, e.g. if
> KVM does end up handling hypercalls in-kernel.  And it might deter abuse to some
> extent.

A bit of wishful thinking on the last sentence but yes, we should do it.

Also we should have a single cap, KVM_CAP_X86_VMWARE, with flags 
KVM_CAP_X86_VMWARE_{BACKDOOR,HYPERCALL}.  Depending on exact details of 
VMware's spec it may even make sense to split further as in 
KVM_CAP_X86_VMWARE_{IOPORT,PMC,HYPERCALL}.  The I/O port is a bit nasty 
with how it bypasses the TSS; if VMware wanted to deprecate it, I would 
not complain at all.

To sum up:

- new Kconfig symbol hiding all existing VMware code

- new cap, KVM_CAP_X86_VMWARE returning the bits that you can set with 
KVM_ENABLE_CAP.  As in your patch, enable_vmware_backdoor provides a 
default for KVM_CAP_X86_VMWARE when the cap is not enabled, but it is 
generally deprecated.

- enable_vmware_backdoor should *not* enable KVM_CAP_X86_VMWARE_HYPERCALL

Paolo
Paolo Bonzini Nov. 9, 2024, 6:20 p.m. UTC | #6
On 11/8/24 06:03, Zack Rusin wrote:
>>> There's no spec but we have open headers listing the hypercalls.
>>> There's about a 100 of them (a few were deprecated), the full
>>> list starts here:
>>> https://github.com/vmware/open-vm-tools/blob/739c5a2f4bfd4cdda491e6a6f6869d88c0bd6972/open-vm-tools/lib/include/backdoor_def.h#L97
>>> They're not well documented, but the names are pretty self-explenatory.
>>
>> At a quick glance, this one needs to be handled in KVM:
>>
>>    BDOOR_CMD_VCPU_MMIO_HONORS_PAT
>>
>> and these probably should be in KVM:
>>
>>    BDOOR_CMD_GETTIME
>>    BDOOR_CMD_SIDT
>>    BDOOR_CMD_SGDT
>>    BDOOR_CMD_SLDT_STR
>>    BDOOR_CMD_GETTIMEFULL
>>    BDOOR_CMD_VCPU_LEGACY_X2APIC_OK
>>    BDOOR_CMD_STEALCLOCK
>
> I'm not sure if there's any value in implementing a few of them.

The value is that some of these depend on what the hypervisor does, not 
on what userspace does.  For Hypervisor.framework you have a lot of 
leeway, for KVM and Hyper-V less so.

Please understand that adding support for a closed spec is already a bit 
of a tall ask.  We can meet in the middle and make up for the 
closedness, but the way to do it is not technical; it's essentially 
trust.  You are the guys that know the spec and the userspace code best, 
so we trust you to make choices that make technical sense for both KVM 
and VMware.  But without a spec we even have to trust you on what makes 
sense or not to have in the kernel, so we ask you to be... honest about 
that.

One important point is that from the KVM maintainers' point of view, the 
feature you're adding might be used by others and not just VMware 
Workstation.  Microsoft and Apple might see things differently (Apple in 
particular has a much thinner wrapper around the processor's 
virtualization capbilities).

> iirc
> there's 101 of them (as I mentioned a lot have been deprecated but
> that's for userspace, on the host we still have to do something for
> old guests using them) and, if out of those 101 we implement 100 in
> the kernel then, as far as this patch is concerned, it's no different
> than if we had 0 out of 101 because we're still going to have to exit
> to userspace to handle that 1 remaining.
> 
> Unless you're saying that those would be useful to you. In which case
> I'd be glad to implement them for you, but I'd put them behind some
> kind of a cap or a kernel config because we wouldn't be using them -

Actually we'd ask you to _not_ put them behind a cap, and live with the 
kernel implementation.  Obviously that's not a requirement for all the 
100+ hypercalls, only for those where it makes sense.

> besides what Doug mentioned - we already maintain the shared code for
> them that's used on Windows, MacOS, ESX and Linux so even if we had
> them in the Linux kernel it would still make more sense to use the
> code that's shared with the other OSes to lessen the maintenance
> burden (so that changing anything within that code consistently
> changes across all the OSes).

If some of them can have shared code across all OSes, then that's a good 
sign that they do not belong in the kernel.  On the other hand, if the 
code is specific to Windows/macOS/ESX/Linux, and maybe it even calls 
into low-level Hypervisor.framework APIs on macOS, then it's possible or 
even likely that the best implementation for Linux is "just assume that 
KVM will do it" and assert(0).

In yet other cases (maybe those SGDT/SLDT/STR/SIDT ones??), if the code 
that you have for Linux is "just do this KVM ioctl to do it", it may 
provide better performance if you save the roundtrip to userspace and 
back.  If KVM is the best performing hypervisor for VMware Workstation, 
then we're happy, :) and if you have some performance issue we want to 
help you too.

A related topic is that a good implementation, equivalent to what the 
proprietary hypervisor implemented, might require adding a ioctl to 
query something that KVM currently does not provide (maybe the current 
steal clock? IIRC it's only available via a Xen ioctl, not a generic 
one).  In that case you'd need to contribute that extra API.  Doing that 
now is easier for both you guys and the KVM maintainers, so that's 
another reason to go through the list and share your findings.

Anyway, one question apart from this: is the API the same for the I/O 
port and hypercall backdoors?

>> I don't think it addresses Paolo's concern (if I understood Paolo's concern
>> correctly), but it would help from the perspective of allowing KVM to support
>> VMware hypercalls and Xen/Hyper-V/KVM hypercalls in the same VM.
> 
> Yea, I just don't think there's any realistic way we could handle all
> of those hypercalls in the kernel so I'm trying to offer some ideas on
> how to lessen the scope to make it as painless as possible. Unless you
> think we could somehow parlay my piercing blue eyes into getting those
> patches in as is, in which case let's do that ;)

Unlikely :) but it's not in bad shape at all!  The main remaining 
discussion point is the subset of hypercalls that need support in the 
kernel (either as a kernel implementation, or as a new ioctl). 
Hopefully the above guidelines will help you.

>> I also think we should add CONFIG_KVM_VMWARE from the get-go, and if we're feeling
>> lucky, maybe even retroactively bury KVM_CAP_X86_VMWARE_BACKDOOR behind that
>> Kconfig.  That would allow limiting the exposure to VMware specific code, e.g. if
>> KVM does end up handling hypercalls in-kernel.  And it might deter abuse to some
>> extent.
> 
> I thought about that too. I was worried that even if we make it on by
> default it will require quite a bit of handholding to make sure all
> the distros include it, or otherwise on desktops Workstation still
> wouldn't work with KVM by default, I also felt a little silly trying
> to add a kernel config for those few lines that would be on pretty
> much everywhere and since we didn't implement the vmware backdoor
> functionality I didn't want to presume and try to shield a feature
> that might be in production by others with a new kernel config.
We don't have a huge number of such knobs but based on experience I 
expect that it will be turned off only by cloud providers or appliance 
manufacturers that want to reduce the attack surface.  If it's enabled 
by default, distros will generally leave it on.  You can also add "If 
unsure, say Y" to the help message as we already do in several cases.(*)

In fact, if someone wants to turn it off, they will send the patch 
themselves to add CONFIG_KVM_VMWARE and it will be accepted.  So we 
might as well ask for it from the start. :)

Thanks,

Paolo

(*) In fact I am wondering if we should flip the default for Xen, in the 
beginning it was just an Amazon thing but since then David has 
contributed support in QEMU and CI.  To be clear, I am *not* asking 
VMware for anything but selftests to make CONFIG_KVM_VMWARE default to 
enabled.
Doug Covelli Nov. 9, 2024, 9:11 p.m. UTC | #7
On Sat, Nov 9, 2024 at 1:20 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/8/24 06:03, Zack Rusin wrote:
> >>> There's no spec but we have open headers listing the hypercalls.
> >>> There's about a 100 of them (a few were deprecated), the full
> >>> list starts here:
> >>> https://github.com/vmware/open-vm-tools/blob/739c5a2f4bfd4cdda491e6a6f6869d88c0bd6972/open-vm-tools/lib/include/backdoor_def.h#L97
> >>> They're not well documented, but the names are pretty self-explenatory.
> >>
> >> At a quick glance, this one needs to be handled in KVM:
> >>
> >>    BDOOR_CMD_VCPU_MMIO_HONORS_PAT
> >>
> >> and these probably should be in KVM:
> >>
> >>    BDOOR_CMD_GETTIME
> >>    BDOOR_CMD_SIDT
> >>    BDOOR_CMD_SGDT
> >>    BDOOR_CMD_SLDT_STR
> >>    BDOOR_CMD_GETTIMEFULL
> >>    BDOOR_CMD_VCPU_LEGACY_X2APIC_OK
> >>    BDOOR_CMD_STEALCLOCK
> >
> > I'm not sure if there's any value in implementing a few of them.
>
> The value is that some of these depend on what the hypervisor does, not
> on what userspace does.  For Hypervisor.framework you have a lot of
> leeway, for KVM and Hyper-V less so.
>
> Please understand that adding support for a closed spec is already a bit
> of a tall ask.  We can meet in the middle and make up for the
> closedness, but the way to do it is not technical; it's essentially
> trust.  You are the guys that know the spec and the userspace code best,
> so we trust you to make choices that make technical sense for both KVM
> and VMware.  But without a spec we even have to trust you on what makes
> sense or not to have in the kernel, so we ask you to be... honest about
> that.
>
> One important point is that from the KVM maintainers' point of view, the
> feature you're adding might be used by others and not just VMware
> Workstation.  Microsoft and Apple might see things differently (Apple in
> particular has a much thinner wrapper around the processor's
> virtualization capbilities).
>
> > iirc
> > there's 101 of them (as I mentioned a lot have been deprecated but
> > that's for userspace, on the host we still have to do something for
> > old guests using them) and, if out of those 101 we implement 100 in
> > the kernel then, as far as this patch is concerned, it's no different
> > than if we had 0 out of 101 because we're still going to have to exit
> > to userspace to handle that 1 remaining.
> >
> > Unless you're saying that those would be useful to you. In which case
> > I'd be glad to implement them for you, but I'd put them behind some
> > kind of a cap or a kernel config because we wouldn't be using them -
>
> Actually we'd ask you to _not_ put them behind a cap, and live with the
> kernel implementation.  Obviously that's not a requirement for all the
> 100+ hypercalls, only for those where it makes sense.
>
> > besides what Doug mentioned - we already maintain the shared code for
> > them that's used on Windows, MacOS, ESX and Linux so even if we had
> > them in the Linux kernel it would still make more sense to use the
> > code that's shared with the other OSes to lessen the maintenance
> > burden (so that changing anything within that code consistently
> > changes across all the OSes).
>
> If some of them can have shared code across all OSes, then that's a good
> sign that they do not belong in the kernel.  On the other hand, if the
> code is specific to Windows/macOS/ESX/Linux, and maybe it even calls
> into low-level Hypervisor.framework APIs on macOS, then it's possible or
> even likely that the best implementation for Linux is "just assume that
> KVM will do it" and assert(0).
>
> In yet other cases (maybe those SGDT/SLDT/STR/SIDT ones??), if the code
> that you have for Linux is "just do this KVM ioctl to do it", it may
> provide better performance if you save the roundtrip to userspace and
> back.  If KVM is the best performing hypervisor for VMware Workstation,
> then we're happy, :) and if you have some performance issue we want to
> help you too.

Appreciate the concern about performance however I don't think it is
something we should worry about.  Even with our existing VMM, which
runs at CPL0, all of these backdoor calls are handled by userspace
which means they are very slow (~28K cycles overhead on my Zen2) and
are not used in any perf critical code (if they were we would have
handled them at CPL0). Running on KVM the overhead is significantly
less.

As for the SGDT/SLDT/STR/SIDT backdoor calls these were added > 20
years ago for SW that used these instructions from CPL3 which did not
work well before VT/SVM were introduced.  These are really of no use
on modern CPUs and will be blocked if the guest OS has enabled UMIP.
Adding support for these to the KVM code would be a bit of a waste
IMHO  I have no objection to adding support for handling some backdoor
calls in the kernel if we find ones where it would be advantageous to
do so I'm just not aware of any where this would be the caase..

> A related topic is that a good implementation, equivalent to what the
> proprietary hypervisor implemented, might require adding a ioctl to
> query something that KVM currently does not provide (maybe the current
> steal clock? IIRC it's only available via a Xen ioctl, not a generic
> one).  In that case you'd need to contribute that extra API.  Doing that
> now is easier for both you guys and the KVM maintainers, so that's
> another reason to go through the list and share your findings.

For stolen time the backdoor call is used to enable the functionality
not to get/set the stolen time.  I agree that we would probably want
to do something KVM specific for this one however this is currently
really only supported by ESX (and only currently used by Photon OS) so
I don't think adding that support to KVM is critical.

> Anyway, one question apart from this: is the API the same for the I/O
> port and hypercall backdoors?

Yeah the calls and arguments are the same.  The hypercall based
interface is an attempt to modernize the backdoor since as you pointed
out the I/O based interface is kind of hacky as it bypasses the normal
checks for an I/O port access at CPL3.  It would be nice to get rid of
it but unfortunately I don't think that will happen in the foreseeable
future as there are a lot of existing VMs out there with older SW that
still uses this interface.

> >> I don't think it addresses Paolo's concern (if I understood Paolo's concern
> >> correctly), but it would help from the perspective of allowing KVM to support
> >> VMware hypercalls and Xen/Hyper-V/KVM hypercalls in the same VM.
> >
> > Yea, I just don't think there's any realistic way we could handle all
> > of those hypercalls in the kernel so I'm trying to offer some ideas on
> > how to lessen the scope to make it as painless as possible. Unless you
> > think we could somehow parlay my piercing blue eyes into getting those
> > patches in as is, in which case let's do that ;)
>
> Unlikely :) but it's not in bad shape at all!  The main remaining
> discussion point is the subset of hypercalls that need support in the
> kernel (either as a kernel implementation, or as a new ioctl).
> Hopefully the above guidelines will help you.
>
> >> I also think we should add CONFIG_KVM_VMWARE from the get-go, and if we're feeling
> >> lucky, maybe even retroactively bury KVM_CAP_X86_VMWARE_BACKDOOR behind that
> >> Kconfig.  That would allow limiting the exposure to VMware specific code, e.g. if
> >> KVM does end up handling hypercalls in-kernel.  And it might deter abuse to some
> >> extent.
> >
> > I thought about that too. I was worried that even if we make it on by
> > default it will require quite a bit of handholding to make sure all
> > the distros include it, or otherwise on desktops Workstation still
> > wouldn't work with KVM by default, I also felt a little silly trying
> > to add a kernel config for those few lines that would be on pretty
> > much everywhere and since we didn't implement the vmware backdoor
> > functionality I didn't want to presume and try to shield a feature
> > that might be in production by others with a new kernel config.
> We don't have a huge number of such knobs but based on experience I
> expect that it will be turned off only by cloud providers or appliance
> manufacturers that want to reduce the attack surface.  If it's enabled
> by default, distros will generally leave it on.  You can also add "If
> unsure, say Y" to the help message as we already do in several cases.(*)
>
> In fact, if someone wants to turn it off, they will send the patch
> themselves to add CONFIG_KVM_VMWARE and it will be accepted.  So we
> might as well ask for it from the start. :)
>
> Thanks,
>
> Paolo
>
> (*) In fact I am wondering if we should flip the default for Xen, in the
> beginning it was just an Amazon thing but since then David has
> contributed support in QEMU and CI.  To be clear, I am *not* asking
> VMware for anything but selftests to make CONFIG_KVM_VMWARE default to
> enabled.
>
Paolo Bonzini Nov. 11, 2024, 6:49 p.m. UTC | #8
On 11/9/24 22:11, Doug Covelli wrote:
> On Sat, Nov 9, 2024 at 1:20 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 11/8/24 06:03, Zack Rusin wrote:
>>>>> There's no spec but we have open headers listing the hypercalls.
>>>>> There's about a 100 of them (a few were deprecated), the full
>>>>> list starts here:
>>>>> https://github.com/vmware/open-vm-tools/blob/739c5a2f4bfd4cdda491e6a6f6869d88c0bd6972/open-vm-tools/lib/include/backdoor_def.h#L97
>>>>> They're not well documented, but the names are pretty self-explenatory.
>>>>
>>>> At a quick glance, this one needs to be handled in KVM:
>>>>
>>>>     BDOOR_CMD_VCPU_MMIO_HONORS_PAT
>>>>
>>>> and these probably should be in KVM:
>>>>
>>>>     BDOOR_CMD_GETTIME
>>>>     BDOOR_CMD_SIDT
>>>>     BDOOR_CMD_SGDT
>>>>     BDOOR_CMD_SLDT_STR
>>>>     BDOOR_CMD_GETTIMEFULL
>>>>     BDOOR_CMD_VCPU_LEGACY_X2APIC_OK
>>>>     BDOOR_CMD_STEALCLOCK
>>>
>>> I'm not sure if there's any value in implementing a few of them.
>>
>> The value is that some of these depend on what the hypervisor does, not
>> on what userspace does.  For Hypervisor.framework you have a lot of
>> leeway, for KVM and Hyper-V less so. [..] From the KVM maintainers'
>> point of view, the feature you're adding might be used by others and
>> not just VMware Workstation.  Microsoft and Apple might see things
>> differently (Apple in particular has a much thinner wrapper around
>> the processor's virtualization capbilities).
> 
> [...]
>
> the SGDT/SLDT/STR/SIDT backdoor calls these were added > 20
> years ago for SW that used these instructions from CPL3 which did not
> work well before VT/SVM were introduced.  These are really of no use
> on modern CPUs and will be blocked if the guest OS has enabled UMIP.
> [...]
> 
> For stolen time the backdoor call is [...] currently
> really only supported by ESX (and only currently used by Photon OS) so
> I don't think adding that support to KVM is critical.

Sounds good.  All I want is ensuring that someone with access to the 
spec did the exercise.

Still guessing, but for MMIO_HONORS_PAT we probably want to add a 
separate KVM_CHECK_EXTENSION capability.

Is BDOOR_CMD_VCPU_LEGACY_X2APIC_OK something where you can just return a 
constant?

This leaves just GETTIME and GETTIMEFULL.  If four hypercalls require 
some care in the hypervisor (which may or may not be an in-kernel 
implementation), that's not bad.  Can you share a bit more about these four?

>> Anyway, one question apart from this: is the API the same for the I/O
>> port and hypercall backdoors?
> 
> Yeah the calls and arguments are the same.  The hypercall based
> interface is an attempt to modernize the backdoor since as you pointed
> out the I/O based interface is kind of hacky as it bypasses the normal
> checks for an I/O port access at CPL3.  It would be nice to get rid of
> it but unfortunately I don't think that will happen in the foreseeable
> future as there are a lot of existing VMs out there with older SW that
> still uses this interface.

Yeah, but I think it still justifies that the KVM_ENABLE_CAP API can 
enable the hypercall but not the I/O port.

Paolo
Doug Covelli Nov. 11, 2024, 8:55 p.m. UTC | #9
On Mon, Nov 11, 2024 at 1:49 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/9/24 22:11, Doug Covelli wrote:
> > On Sat, Nov 9, 2024 at 1:20 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> On 11/8/24 06:03, Zack Rusin wrote:
> >>>>> There's no spec but we have open headers listing the hypercalls.
> >>>>> There's about a 100 of them (a few were deprecated), the full
> >>>>> list starts here:
> >>>>> https://github.com/vmware/open-vm-tools/blob/739c5a2f4bfd4cdda491e6a6f6869d88c0bd6972/open-vm-tools/lib/include/backdoor_def.h#L97
> >>>>> They're not well documented, but the names are pretty self-explenatory.
> >>>>
> >>>> At a quick glance, this one needs to be handled in KVM:
> >>>>
> >>>>     BDOOR_CMD_VCPU_MMIO_HONORS_PAT
> >>>>
> >>>> and these probably should be in KVM:
> >>>>
> >>>>     BDOOR_CMD_GETTIME
> >>>>     BDOOR_CMD_SIDT
> >>>>     BDOOR_CMD_SGDT
> >>>>     BDOOR_CMD_SLDT_STR
> >>>>     BDOOR_CMD_GETTIMEFULL
> >>>>     BDOOR_CMD_VCPU_LEGACY_X2APIC_OK
> >>>>     BDOOR_CMD_STEALCLOCK
> >>>
> >>> I'm not sure if there's any value in implementing a few of them.
> >>
> >> The value is that some of these depend on what the hypervisor does, not
> >> on what userspace does.  For Hypervisor.framework you have a lot of
> >> leeway, for KVM and Hyper-V less so. [..] From the KVM maintainers'
> >> point of view, the feature you're adding might be used by others and
> >> not just VMware Workstation.  Microsoft and Apple might see things
> >> differently (Apple in particular has a much thinner wrapper around
> >> the processor's virtualization capbilities).
> >
> > [...]
> >
> > the SGDT/SLDT/STR/SIDT backdoor calls these were added > 20
> > years ago for SW that used these instructions from CPL3 which did not
> > work well before VT/SVM were introduced.  These are really of no use
> > on modern CPUs and will be blocked if the guest OS has enabled UMIP.
> > [...]
> >
> > For stolen time the backdoor call is [...] currently
> > really only supported by ESX (and only currently used by Photon OS) so
> > I don't think adding that support to KVM is critical.
>
> Sounds good.  All I want is ensuring that someone with access to the
> spec did the exercise.
>
> Still guessing, but for MMIO_HONORS_PAT we probably want to add a
> separate KVM_CHECK_EXTENSION capability.
>
> Is BDOOR_CMD_VCPU_LEGACY_X2APIC_OK something where you can just return a
> constant?
>
> This leaves just GETTIME and GETTIMEFULL.  If four hypercalls require
> some care in the hypervisor (which may or may not be an in-kernel
> implementation), that's not bad.  Can you share a bit more about these four?

BDOOR_CMD_VCPU_MMIO_HONORS_PAT and BDOOR_CMD_VCPU_LEGACY_X2APIC_OK are not
actually backdoor calls - they are flags returned by BDOOR_CMD_GET_VCPU_INFO.

BDOOR_CMD_VCPU_MMIO_HONORS_PAT is only ever set to 1 on ESX as it is only
relevant for PCI passthru which is not supported on Linux/Windows/macOS.  IIRC
this was added over 10 years ago for some Infiniband device vendor to use in
their driver although I'm not sure that ever materialized.

BDOOR_CMD_VCPU_LEGACY_X2APIC_OK indicates if it is OK to use x2APIC w/o
interrupt remapping (e.g a virtual IOMMU).  I'm not sure if KVM supports this
but I think this one can be set to TRUE unconditionally as we have no plans to
use KVM_CREATE_IRQCHIP - if anything we would use KVM_CAP_SPLIT_IRQCHIP although
my preference would be to handle all APIC/IOAPIC/PIC emulation ourselves
provided we can avoid CR8 exits but that is another discussion.

For now I think it makes sense to handle BDOOR_CMD_GET_VCPU_INFO at userlevel
like we do on Windows and macOS.

BDOOR_CMD_GETTIME/BDOOR_CMD_GETTIMEFULL are similar with the former being
deprecated in favor of the latter.  Both do essentially the same thing which is
to return the host OS's time - on Linux this is obtained via gettimeofday.  I
believe this is mainly used by tools to fix up the VM's time when resuming from
suspend.  I think it is fine to continue handling these at userlevel.

Doug

> >> Anyway, one question apart from this: is the API the same for the I/O
> >> port and hypercall backdoors?
> >
> > Yeah the calls and arguments are the same.  The hypercall based
> > interface is an attempt to modernize the backdoor since as you pointed
> > out the I/O based interface is kind of hacky as it bypasses the normal
> > checks for an I/O port access at CPL3.  It would be nice to get rid of
> > it but unfortunately I don't think that will happen in the foreseeable
> > future as there are a lot of existing VMs out there with older SW that
> > still uses this interface.
>
> Yeah, but I think it still justifies that the KVM_ENABLE_CAP API can
> enable the hypercall but not the I/O port.
>
> Paolo
>
Paolo Bonzini Nov. 12, 2024, 5:44 p.m. UTC | #10
Il lun 11 nov 2024, 21:55 Doug Covelli <doug.covelli@broadcom.com> ha scritto:
>
> BDOOR_CMD_VCPU_MMIO_HONORS_PAT and BDOOR_CMD_VCPU_LEGACY_X2APIC_OK are not
> actually backdoor calls - they are flags returned by BDOOR_CMD_GET_VCPU_INFO.
>
> BDOOR_CMD_VCPU_MMIO_HONORS_PAT is only ever set to 1 on ESX as it is only
> relevant for PCI passthru which is not supported on Linux/Windows/macOS.  IIRC
> this was added over 10 years ago for some Infiniband device vendor to use in
> their driver although I'm not sure that ever materialized.

Ok. So I guess false is safe.

> BDOOR_CMD_VCPU_LEGACY_X2APIC_OK indicates if it is OK to use x2APIC w/o
> interrupt remapping (e.g a virtual IOMMU).  I'm not sure if KVM supports this
> but I think this one can be set to TRUE unconditionally as we have no plans to
> use KVM_CREATE_IRQCHIP - if anything we would use KVM_CAP_SPLIT_IRQCHIP although
> my preference would be to handle all APIC/IOAPIC/PIC emulation ourselves
> provided we can avoid CR8 exits but that is another discussion.

Split irqchip should be the best tradeoff. Without it, moves from cr8
stay in the kernel, but moves to cr8 always go to userspace with a
KVM_EXIT_SET_TPR exit. You also won't be able to use Intel
flexpriority (in-processor accelerated TPR) because KVM does not know
which bits are set in IRR. So it will be *really* every move to cr8
that goes to userspace.

> For now I think it makes sense to handle BDOOR_CMD_GET_VCPU_INFO at userlevel
> like we do on Windows and macOS.
>
> BDOOR_CMD_GETTIME/BDOOR_CMD_GETTIMEFULL are similar with the former being
> deprecated in favor of the latter.  Both do essentially the same thing which is
> to return the host OS's time - on Linux this is obtained via gettimeofday.  I
> believe this is mainly used by tools to fix up the VM's time when resuming from
> suspend.  I think it is fine to continue handling these at userlevel.

As long as the TSC is not involved it should be okay.

Paolo

> > >> Anyway, one question apart from this: is the API the same for the I/O
> > >> port and hypercall backdoors?
> > >
> > > Yeah the calls and arguments are the same.  The hypercall based
> > > interface is an attempt to modernize the backdoor since as you pointed
> > > out the I/O based interface is kind of hacky as it bypasses the normal
> > > checks for an I/O port access at CPL3.  It would be nice to get rid of
> > > it but unfortunately I don't think that will happen in the foreseeable
> > > future as there are a lot of existing VMs out there with older SW that
> > > still uses this interface.
> >
> > Yeah, but I think it still justifies that the KVM_ENABLE_CAP API can
> > enable the hypercall but not the I/O port.
> >
> > Paolo
Doug Covelli Nov. 12, 2024, 8:44 p.m. UTC | #11
On Tue, Nov 12, 2024 at 12:44 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> Il lun 11 nov 2024, 21:55 Doug Covelli <doug.covelli@broadcom.com> ha scritto:
> >
> > BDOOR_CMD_VCPU_MMIO_HONORS_PAT and BDOOR_CMD_VCPU_LEGACY_X2APIC_OK are not
> > actually backdoor calls - they are flags returned by BDOOR_CMD_GET_VCPU_INFO.
> >
> > BDOOR_CMD_VCPU_MMIO_HONORS_PAT is only ever set to 1 on ESX as it is only
> > relevant for PCI passthru which is not supported on Linux/Windows/macOS.  IIRC
> > this was added over 10 years ago for some Infiniband device vendor to use in
> > their driver although I'm not sure that ever materialized.
>
> Ok. So I guess false is safe.
>
> > BDOOR_CMD_VCPU_LEGACY_X2APIC_OK indicates if it is OK to use x2APIC w/o
> > interrupt remapping (e.g a virtual IOMMU).  I'm not sure if KVM supports this
> > but I think this one can be set to TRUE unconditionally as we have no plans to
> > use KVM_CREATE_IRQCHIP - if anything we would use KVM_CAP_SPLIT_IRQCHIP although
> > my preference would be to handle all APIC/IOAPIC/PIC emulation ourselves
> > provided we can avoid CR8 exits but that is another discussion.
>
> Split irqchip should be the best tradeoff. Without it, moves from cr8
> stay in the kernel, but moves to cr8 always go to userspace with a
> KVM_EXIT_SET_TPR exit. You also won't be able to use Intel
> flexpriority (in-processor accelerated TPR) because KVM does not know
> which bits are set in IRR. So it will be *really* every move to cr8
> that goes to userspace.

Sorry to hijack this thread but is there a technical reason not to allow CR8
based accesses to the TPR (not MMIO accesses) when the in-kernel local APIC is
not in use?  Both MSFT's WHP and Apple's hypervisor framework allow this and it
seems like it would be generally useful for any Hypervisor that does not want to
use the in-kernel APIC but still want to run Windows guests with decent
performance.

When we switched to WHP the biggest source of problems by far was from trying
to integrate our monitor with MSFT's APIC emulation code.  Even if we do want
to use the KVM in-kernel APIC at some point in the future it is still nice to
be able to fall back on our own APIC emulation code if necessary.

Also I could not find these documented anywhere but with MSFT's APIC our monitor
relies on extensions for trapping certain events such as INIT/SIPI plus LINT0
and SVR writes:

UINT64 X64ApicInitSipiExitTrap    : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap
UINT64 X64ApicWriteLint0ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
UINT64 X64ApicWriteLint1ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
UINT64 X64ApicWriteSvrExitTrap    : 1; // WHvRunVpExitReasonX64ApicWriteTrap

I did not see any similar functionality for KVM.  Does anything like that exist?
In any case we would be happy to add support for handling CR8 accesses w/o
exiting w/o the in-kernel APIC along with some sort of a way to configure the
TPR threshold if folks are not opposed to that.

Doug

> > For now I think it makes sense to handle BDOOR_CMD_GET_VCPU_INFO at userlevel
> > like we do on Windows and macOS.
> >
> > BDOOR_CMD_GETTIME/BDOOR_CMD_GETTIMEFULL are similar with the former being
> > deprecated in favor of the latter.  Both do essentially the same thing which is
> > to return the host OS's time - on Linux this is obtained via gettimeofday.  I
> > believe this is mainly used by tools to fix up the VM's time when resuming from
> > suspend.  I think it is fine to continue handling these at userlevel.
>
> As long as the TSC is not involved it should be okay.
>
> Paolo
>
> > > >> Anyway, one question apart from this: is the API the same for the I/O
> > > >> port and hypercall backdoors?
> > > >
> > > > Yeah the calls and arguments are the same.  The hypercall based
> > > > interface is an attempt to modernize the backdoor since as you pointed
> > > > out the I/O based interface is kind of hacky as it bypasses the normal
> > > > checks for an I/O port access at CPL3.  It would be nice to get rid of
> > > > it but unfortunately I don't think that will happen in the foreseeable
> > > > future as there are a lot of existing VMs out there with older SW that
> > > > still uses this interface.
> > >
> > > Yeah, but I think it still justifies that the KVM_ENABLE_CAP API can
> > > enable the hypercall but not the I/O port.
> > >
> > > Paolo
>
Sean Christopherson Nov. 13, 2024, 4:04 p.m. UTC | #12
On Wed, Nov 13, 2024, Paolo Bonzini wrote:
> Il mar 12 nov 2024, 21:44 Doug Covelli <doug.covelli@broadcom.com> ha
> scritto:
> 
> > > Split irqchip should be the best tradeoff. Without it, moves from cr8
> > > stay in the kernel, but moves to cr8 always go to userspace with a
> > > KVM_EXIT_SET_TPR exit. You also won't be able to use Intel
> > > flexpriority (in-processor accelerated TPR) because KVM does not know
> > > which bits are set in IRR. So it will be *really* every move to cr8
> > > that goes to userspace.
> >
> > Sorry to hijack this thread but is there a technical reason not to allow
> > CR8 based accesses to the TPR (not MMIO accesses) when the in-kernel local
> > APIC is not in use?
> 
> No worries, you're not hijacking :) The only reason is that it would be
> more code for a seldom used feature and anyway with worse performance. (To
> be clear, CR8 based accesses are allowed, but stores cause an exit in order
> to check the new TPR against IRR. That's because KVM's API does not have an
> equivalent of the TPR threshold as you point out below).
> 
> > Also I could not find these documented anywhere but with MSFT's APIC our
> > monitor relies on extensions for trapping certain events such as INIT/SIPI
> > plus LINT0 and SVR writes:
> >
> > UINT64 X64ApicInitSipiExitTrap    : 1; //
> > WHvRunVpExitReasonX64ApicInitSipiTrap
> > UINT64 X64ApicWriteLint0ExitTrap  : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> > UINT64 X64ApicWriteLint1ExitTrap  : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> > UINT64 X64ApicWriteSvrExitTrap    : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> >
> 
> There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are
> handled in the hypervisor and you can get the current state of APs via
> KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and
> KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively,
> plus the interrupt shadow; so there's no need for userspace to know when
> LINT0/LINT1 themselves change. The spurious interrupt vector register is
> also handled completely in kernel.
> 
> > I did not see any similar functionality for KVM.  Does anything like that
> > exist?  In any case we would be happy to add support for handling CR8
> > accesses w/o exiting w/o the in-kernel APIC along with some sort of a way
> > to configure the TPR threshold if folks are not opposed to that.
> 
> As far I know everybody who's using KVM (whether proprietary or open
> source) has had no need for that, so I don't think it's a good idea to make
> the API more complex. 

+1

> Performance of Windows guests is going to be bad anyway with userspace APIC.

Heh, on modern hardware, performance of any guest is going to suck with a userspace
APIC, compared to what is possible with an in-kernel APIC.

More importantly, I really, really don't want to encourage non-trivial usage of
a local APIC in userspace.  KVM's support for a userspace local APIC is very
poorly tested these days.   I have zero desire to spend any amount of time reviewing
and fixing issues that are unique to emulating the local APIC in userspace.  And
long term, I would love to force an in-kernel local APIC, though I don't know if
that's entirely feasible.
Doug Covelli Nov. 13, 2024, 4:24 p.m. UTC | #13
On Wed, Nov 13, 2024 at 2:31 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
>
> Il mar 12 nov 2024, 21:44 Doug Covelli <doug.covelli@broadcom.com> ha scritto:
>>
>> > Split irqchip should be the best tradeoff. Without it, moves from cr8
>> > stay in the kernel, but moves to cr8 always go to userspace with a
>> > KVM_EXIT_SET_TPR exit. You also won't be able to use Intel
>> > flexpriority (in-processor accelerated TPR) because KVM does not know
>> > which bits are set in IRR. So it will be *really* every move to cr8
>> > that goes to userspace.
>>
>> Sorry to hijack this thread but is there a technical reason not to allow CR8
>> based accesses to the TPR (not MMIO accesses) when the in-kernel local APIC is
>> not in use?
>
>
> No worries, you're not hijacking :) The only reason is that it would be more code for a seldom used feature and anyway with worse performance. (To be clear, CR8 based accesses are allowed, but stores cause an exit in order to check the new TPR against IRR. That's because KVM's API does not have an equivalent of the TPR threshold as you point out below).

I have not really looked at the code but it seems like it could also
simplify things as CR8 would be handled more uniformly regardless of
who is virtualizing the local APIC.

>> Also I could not find these documented anywhere but with MSFT's APIC our monitor
>> relies on extensions for trapping certain events such as INIT/SIPI plus LINT0
>> and SVR writes:
>>
>> UINT64 X64ApicInitSipiExitTrap    : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap
>> UINT64 X64ApicWriteLint0ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>> UINT64 X64ApicWriteLint1ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>> UINT64 X64ApicWriteSvrExitTrap    : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>
>
> There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are handled in the hypervisor and you can get the current state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively, plus the interrupt shadow; so there's no need for userspace to know when LINT0/LINT1 themselves change. The spurious interrupt vector register is also handled completely in kernel.

I realize that KVM can handle LINT0/SVR updates themselves but our
interrupt subsystem relies on knowing the current values of these
registers even when not virtualizing the local APIC.  I suppose we
could use KVM_GET_LAPIC to sync things up on demand but that seems
like it might nor be great from a performance point of view.

>> I did not see any similar functionality for KVM.  Does anything like that exist?
>> In any case we would be happy to add support for handling CR8 accesses w/o
>> exiting w/o the in-kernel APIC along with some sort of a way to configure the
>> TPR threshold if folks are not opposed to that.
>
>
> As far I know everybody who's using KVM (whether proprietary or open source) has had no need for that, so I don't think it's a good idea to make the API more complex. Performance of Windows guests is going to be bad anyway with userspace APIC.

From what I have seen the exit cost with KVM is significantly lower
than with WHP/Hyper-V.  I don't think performance of Windows guests
with userspace APIC emulation would be bad if CR8 exits could be
avoided (Linux guests perf isn't bad from what I have observed and the
main difference is the astronomical number of CR8 exits).  It seems
like it would be pretty decent although I agree if you want the
absolute best performance then you would want to use the in kernel
APIC to speed up handling of ICR/EOI writes but those are relatively
infrequent compared to CR8 accesses .

Anyway I just saw Sean's response while writing this and it seems he
is not in favor of avoiding CR8 exits w/o the in kernel APIC either so
I suppose we will have to look into making use of the in kernel APIC.

Doug

> Paolo
>
>> Doug
>>
>> > > For now I think it makes sense to handle BDOOR_CMD_GET_VCPU_INFO at userlevel
>> > > like we do on Windows and macOS.
>> > >
>> > > BDOOR_CMD_GETTIME/BDOOR_CMD_GETTIMEFULL are similar with the former being
>> > > deprecated in favor of the latter.  Both do essentially the same thing which is
>> > > to return the host OS's time - on Linux this is obtained via gettimeofday.  I
>> > > believe this is mainly used by tools to fix up the VM's time when resuming from
>> > > suspend.  I think it is fine to continue handling these at userlevel.
>> >
>> > As long as the TSC is not involved it should be okay.
>> >
>> > Paolo
>> >
>> > > > >> Anyway, one question apart from this: is the API the same for the I/O
>> > > > >> port and hypercall backdoors?
>> > > > >
>> > > > > Yeah the calls and arguments are the same.  The hypercall based
>> > > > > interface is an attempt to modernize the backdoor since as you pointed
>> > > > > out the I/O based interface is kind of hacky as it bypasses the normal
>> > > > > checks for an I/O port access at CPL3.  It would be nice to get rid of
>> > > > > it but unfortunately I don't think that will happen in the foreseeable
>> > > > > future as there are a lot of existing VMs out there with older SW that
>> > > > > still uses this interface.
>> > > >
>> > > > Yeah, but I think it still justifies that the KVM_ENABLE_CAP API can
>> > > > enable the hypercall but not the I/O port.
>> > > >
>> > > > Paolo
>> >
>>
>> --
>> This electronic communication and the information and any files transmitted
>> with it, or attached to it, are confidential and are intended solely for
>> the use of the individual or entity to whom it is addressed and may contain
>> information that is confidential, legally privileged, protected by privacy
>> laws, or otherwise restricted from disclosure to anyone else. If you are
>> not the intended recipient or the person responsible for delivering the
>> e-mail to the intended recipient, you are hereby notified that any use,
>> copying, distributing, dissemination, forwarding, printing, or copying of
>> this e-mail is strictly prohibited. If you received this e-mail in error,
>> please return the e-mail to the sender, delete it from your computer, and
>> destroy any printed copy of it.
>>
Paolo Bonzini Nov. 13, 2024, 5:59 p.m. UTC | #14
On 11/13/24 17:24, Doug Covelli wrote:
>> No worries, you're not hijacking :) The only reason is that it would
>> be more code for a seldom used feature and anyway with worse performance.
>> (To be clear, CR8 based accesses are allowed, but stores cause an exit
>> in order to check the new TPR against IRR. That's because KVM's API
>> does not have an equivalent of the TPR threshold as you point out below).
> 
> I have not really looked at the code but it seems like it could also
> simplify things as CR8 would be handled more uniformly regardless of
> who is virtualizing the local APIC.

Not much because CR8 basically does not exist at all (it's just a byte
in memory) with userspace APIC.  So it's not easy to make it simpler, even
though it's less uniform.

That said, there is an optimization: you only get KVM_EXIT_SET_TPR if
CR8 decreases.

>>> Also I could not find these documented anywhere but with MSFT's APIC our monitor
>>> relies on extensions for trapping certain events such as INIT/SIPI plus LINT0
>>> and SVR writes:
>>>
>>> UINT64 X64ApicInitSipiExitTrap    : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap
>>> UINT64 X64ApicWriteLint0ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>>> UINT64 X64ApicWriteLint1ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>>> UINT64 X64ApicWriteSvrExitTrap    : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>>
>> There's no need for this in KVM's in-kernel APIC model. INIT and
>> SIPI are handled in the hypervisor and you can get the current
>> state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected
>> with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR
>> and NMI blocking respectively, plus the interrupt shadow; so
>> there's no need for userspace to know when LINT0/LINT1 themselves
>> change. The spurious interrupt vector register is also handled
>> completely in kernel.
> 
> I realize that KVM can handle LINT0/SVR updates themselves but our
> interrupt subsystem relies on knowing the current values of these
> registers even when not virtualizing the local APIC.  I suppose we
> could use KVM_GET_LAPIC to sync things up on demand but that seems
> like it might nor be great from a performance point of view.

Ah no, you're right---you want to track the CPU that has ExtINT enabled
and send KVM_INTERRUPT to that one, I guess?  And you need the spurious
vector registers because writes can set the mask bit in LINTx, but
essentially you want to trap LINT0 changes.

Something like this (missing the KVM_ENABLE_CAP and KVM_CHECK_EXTENSION
code) is good, feel free to include it in your v2 (Co-developed-by
and Signed-off-by me):

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5fb29ca3263b..b7dd89c99613 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -122,6 +122,7 @@
  #define KVM_REQ_HV_TLB_FLUSH \
  	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
+#define KVM_REQ_REPORT_LINT0_ACCESS	KVM_ARCH_REQ(35)
  
  #define CR0_RESERVED_BITS                                               \
  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -775,6 +776,7 @@ struct kvm_vcpu_arch {
  	u64 smi_count;
  	bool at_instruction_boundary;
  	bool tpr_access_reporting;
+	bool lint0_access_reporting;
  	bool xfd_no_write_intercept;
  	u64 ia32_xss;
  	u64 microcode_version;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 88dc43660d23..0e070f447aa2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1561,6 +1561,21 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic)
  			      apic->divide_count));
  }
  
+static void __report_lint0_access(struct kvm_lapic *apic, u32 value)
+{
+	struct kvm_vcpu *vcpu = apic->vcpu;
+	struct kvm_run *run = vcpu->run;
+
+	kvm_make_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu);
+	run->lint0_access.value = value;
+}
+
+static inline void report_lint0_access(struct kvm_lapic *apic, u32 value)
+{
+	if (apic->vcpu->arch.lint0_access_reporting)
+		__report_lint0_access(apic, value);
+}
+
  static void __report_tpr_access(struct kvm_lapic *apic, bool write)
  {
  	struct kvm_vcpu *vcpu = apic->vcpu;
@@ -2312,8 +2327,10 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
  			int i;
  
  			for (i = 0; i < apic->nr_lvt_entries; i++) {
-				kvm_lapic_set_reg(apic, APIC_LVTx(i),
-					kvm_lapic_get_reg(apic, APIC_LVTx(i)) | APIC_LVT_MASKED);
+				u32 old = kvm_lapic_get_reg(apic, APIC_LVTx(i));
+				kvm_lapic_set_reg(apic, APIC_LVTx(i), old | APIC_LVT_MASKED);
+				if (i == 0 && !(old & APIC_LVT_MASKED))
+					report_lint0_access(apic, old | APIC_LVT_MASKED);
  			}
  			apic_update_lvtt(apic);
  			atomic_set(&apic->lapic_timer.pending, 0);
@@ -2352,6 +2369,8 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
  		if (!kvm_apic_sw_enabled(apic))
  			val |= APIC_LVT_MASKED;
  		val &= apic_lvt_mask[index];
+		if (index == 0 && val != kvm_lapic_get_reg(apic, reg))
+			report_lint0_access(apic, val);
  		kvm_lapic_set_reg(apic, reg, val);
  		break;
  	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d0d3dc3b7ef6..2b039b372c3f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10879,6 +10879,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
  			kvm_vcpu_flush_tlb_guest(vcpu);
  #endif
  
+		if (kvm_check_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu)) {
+			vcpu->run->exit_reason = KVM_EXIT_LINT0_ACCESS;
+			r = 0;
+			goto out;
+		}
  		if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {
  			vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;
  			r = 0;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc055145..ec97727f9de4 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -178,6 +178,7 @@ struct kvm_xen_exit {
  #define KVM_EXIT_NOTIFY           37
  #define KVM_EXIT_LOONGARCH_IOCSR  38
  #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_LINT0_ACCESS     40
  
  /* For KVM_EXIT_INTERNAL_ERROR */
  /* Emulate instruction failed. */
@@ -283,6 +284,10 @@ struct kvm_run {
  				__u64 flags;
  			};
  		} hypercall;
+		/* KVM_EXIT_LINT0_ACCESS */
+		struct {
+			__u32 value;
+		} lint0_access;
  		/* KVM_EXIT_TPR_ACCESS */
  		struct {
  			__u64 rip;


For LINT1, it should be less performance critical; if it's possible
to just go through all vCPUs, and do KVM_GET_LAPIC to check who you
should send a KVM_NMI to, then I'd do that.  I'd also accept a patch
that adds a VM-wide KVM_NMI ioctl that does the same in the hypervisor
if it's useful for you.

And since I've been proven wrong already, what do you need INIT/SIPI for?

Paolo
Doug Covelli Nov. 14, 2024, 3:45 p.m. UTC | #15
On Wed, Nov 13, 2024 at 12:59 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 11/13/24 17:24, Doug Covelli wrote:
> >> No worries, you're not hijacking :) The only reason is that it would
> >> be more code for a seldom used feature and anyway with worse performance.
> >> (To be clear, CR8 based accesses are allowed, but stores cause an exit
> >> in order to check the new TPR against IRR. That's because KVM's API
> >> does not have an equivalent of the TPR threshold as you point out below).
> >
> > I have not really looked at the code but it seems like it could also
> > simplify things as CR8 would be handled more uniformly regardless of
> > who is virtualizing the local APIC.
>
> Not much because CR8 basically does not exist at all (it's just a byte
> in memory) with userspace APIC.  So it's not easy to make it simpler, even
> though it's less uniform.
>
> That said, there is an optimization: you only get KVM_EXIT_SET_TPR if
> CR8 decreases.
>
> >>> Also I could not find these documented anywhere but with MSFT's APIC our monitor
> >>> relies on extensions for trapping certain events such as INIT/SIPI plus LINT0
> >>> and SVR writes:
> >>>
> >>> UINT64 X64ApicInitSipiExitTrap    : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap
> >>> UINT64 X64ApicWriteLint0ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
> >>> UINT64 X64ApicWriteLint1ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
> >>> UINT64 X64ApicWriteSvrExitTrap    : 1; // WHvRunVpExitReasonX64ApicWriteTrap
> >>
> >> There's no need for this in KVM's in-kernel APIC model. INIT and
> >> SIPI are handled in the hypervisor and you can get the current
> >> state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected
> >> with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR
> >> and NMI blocking respectively, plus the interrupt shadow; so
> >> there's no need for userspace to know when LINT0/LINT1 themselves
> >> change. The spurious interrupt vector register is also handled
> >> completely in kernel.
> >
> > I realize that KVM can handle LINT0/SVR updates themselves but our
> > interrupt subsystem relies on knowing the current values of these
> > registers even when not virtualizing the local APIC.  I suppose we
> > could use KVM_GET_LAPIC to sync things up on demand but that seems
> > like it might nor be great from a performance point of view.
>
> Ah no, you're right---you want to track the CPU that has ExtINT enabled
> and send KVM_INTERRUPT to that one, I guess?  And you need the spurious
> vector registers because writes can set the mask bit in LINTx, but
> essentially you want to trap LINT0 changes.
>
> Something like this (missing the KVM_ENABLE_CAP and KVM_CHECK_EXTENSION
> code) is good, feel free to include it in your v2 (Co-developed-by
> and Signed-off-by me):
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5fb29ca3263b..b7dd89c99613 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -122,6 +122,7 @@
>   #define KVM_REQ_HV_TLB_FLUSH \
>         KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>   #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE  KVM_ARCH_REQ(34)
> +#define KVM_REQ_REPORT_LINT0_ACCESS    KVM_ARCH_REQ(35)
>
>   #define CR0_RESERVED_BITS                                               \
>         (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -775,6 +776,7 @@ struct kvm_vcpu_arch {
>         u64 smi_count;
>         bool at_instruction_boundary;
>         bool tpr_access_reporting;
> +       bool lint0_access_reporting;
>         bool xfd_no_write_intercept;
>         u64 ia32_xss;
>         u64 microcode_version;
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 88dc43660d23..0e070f447aa2 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1561,6 +1561,21 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic)
>                               apic->divide_count));
>   }
>
> +static void __report_lint0_access(struct kvm_lapic *apic, u32 value)
> +{
> +       struct kvm_vcpu *vcpu = apic->vcpu;
> +       struct kvm_run *run = vcpu->run;
> +
> +       kvm_make_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu);
> +       run->lint0_access.value = value;
> +}
> +
> +static inline void report_lint0_access(struct kvm_lapic *apic, u32 value)
> +{
> +       if (apic->vcpu->arch.lint0_access_reporting)
> +               __report_lint0_access(apic, value);
> +}
> +
>   static void __report_tpr_access(struct kvm_lapic *apic, bool write)
>   {
>         struct kvm_vcpu *vcpu = apic->vcpu;
> @@ -2312,8 +2327,10 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>                         int i;
>
>                         for (i = 0; i < apic->nr_lvt_entries; i++) {
> -                               kvm_lapic_set_reg(apic, APIC_LVTx(i),
> -                                       kvm_lapic_get_reg(apic, APIC_LVTx(i)) | APIC_LVT_MASKED);
> +                               u32 old = kvm_lapic_get_reg(apic, APIC_LVTx(i));
> +                               kvm_lapic_set_reg(apic, APIC_LVTx(i), old | APIC_LVT_MASKED);
> +                               if (i == 0 && !(old & APIC_LVT_MASKED))
> +                                       report_lint0_access(apic, old | APIC_LVT_MASKED);
>                         }
>                         apic_update_lvtt(apic);
>                         atomic_set(&apic->lapic_timer.pending, 0);
> @@ -2352,6 +2369,8 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>                 if (!kvm_apic_sw_enabled(apic))
>                         val |= APIC_LVT_MASKED;
>                 val &= apic_lvt_mask[index];
> +               if (index == 0 && val != kvm_lapic_get_reg(apic, reg))
> +                       report_lint0_access(apic, val);
>                 kvm_lapic_set_reg(apic, reg, val);
>                 break;
>         }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d0d3dc3b7ef6..2b039b372c3f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10879,6 +10879,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>                         kvm_vcpu_flush_tlb_guest(vcpu);
>   #endif
>
> +               if (kvm_check_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu)) {
> +                       vcpu->run->exit_reason = KVM_EXIT_LINT0_ACCESS;
> +                       r = 0;
> +                       goto out;
> +               }
>                 if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {
>                         vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;
>                         r = 0;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 637efc055145..ec97727f9de4 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -178,6 +178,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_NOTIFY           37
>   #define KVM_EXIT_LOONGARCH_IOCSR  38
>   #define KVM_EXIT_MEMORY_FAULT     39
> +#define KVM_EXIT_LINT0_ACCESS     40
>
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -283,6 +284,10 @@ struct kvm_run {
>                                 __u64 flags;
>                         };
>                 } hypercall;
> +               /* KVM_EXIT_LINT0_ACCESS */
> +               struct {
> +                       __u32 value;
> +               } lint0_access;
>                 /* KVM_EXIT_TPR_ACCESS */
>                 struct {
>                         __u64 rip;
>
>
> For LINT1, it should be less performance critical; if it's possible
> to just go through all vCPUs, and do KVM_GET_LAPIC to check who you
> should send a KVM_NMI to, then I'd do that.  I'd also accept a patch
> that adds a VM-wide KVM_NMI ioctl that does the same in the hypervisor
> if it's useful for you.

Thanks for the patch - I'll get it a try but it might not be right away.

> And since I've been proven wrong already, what do you need INIT/SIPI for?

I don't think this one is as critical.  I believe the reason it was
added was so that we can synchronize startup of the APs with execution
of the BSP for guests that do not do a good job of that (Windows).

Doug

> Paolo
>
Doug Covelli Dec. 12, 2024, 12:19 p.m. UTC | #16
On Thu, Nov 14, 2024 at 10:45 AM Doug Covelli <doug.covelli@broadcom.com> wrote:
>
> On Wed, Nov 13, 2024 at 12:59 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 11/13/24 17:24, Doug Covelli wrote:
> > >> No worries, you're not hijacking :) The only reason is that it would
> > >> be more code for a seldom used feature and anyway with worse performance.
> > >> (To be clear, CR8 based accesses are allowed, but stores cause an exit
> > >> in order to check the new TPR against IRR. That's because KVM's API
> > >> does not have an equivalent of the TPR threshold as you point out below).
> > >
> > > I have not really looked at the code but it seems like it could also
> > > simplify things as CR8 would be handled more uniformly regardless of
> > > who is virtualizing the local APIC.
> >
> > Not much because CR8 basically does not exist at all (it's just a byte
> > in memory) with userspace APIC.  So it's not easy to make it simpler, even
> > though it's less uniform.
> >
> > That said, there is an optimization: you only get KVM_EXIT_SET_TPR if
> > CR8 decreases.
> >
> > >>> Also I could not find these documented anywhere but with MSFT's APIC our monitor
> > >>> relies on extensions for trapping certain events such as INIT/SIPI plus LINT0
> > >>> and SVR writes:
> > >>>
> > >>> UINT64 X64ApicInitSipiExitTrap    : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap
> > >>> UINT64 X64ApicWriteLint0ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
> > >>> UINT64 X64ApicWriteLint1ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
> > >>> UINT64 X64ApicWriteSvrExitTrap    : 1; // WHvRunVpExitReasonX64ApicWriteTrap
> > >>
> > >> There's no need for this in KVM's in-kernel APIC model. INIT and
> > >> SIPI are handled in the hypervisor and you can get the current
> > >> state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected
> > >> with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR
> > >> and NMI blocking respectively, plus the interrupt shadow; so
> > >> there's no need for userspace to know when LINT0/LINT1 themselves
> > >> change. The spurious interrupt vector register is also handled
> > >> completely in kernel.
> > >
> > > I realize that KVM can handle LINT0/SVR updates themselves but our
> > > interrupt subsystem relies on knowing the current values of these
> > > registers even when not virtualizing the local APIC.  I suppose we
> > > could use KVM_GET_LAPIC to sync things up on demand but that seems
> > > like it might nor be great from a performance point of view.
> >
> > Ah no, you're right---you want to track the CPU that has ExtINT enabled
> > and send KVM_INTERRUPT to that one, I guess?  And you need the spurious
> > vector registers because writes can set the mask bit in LINTx, but
> > essentially you want to trap LINT0 changes.
> >
> > Something like this (missing the KVM_ENABLE_CAP and KVM_CHECK_EXTENSION
> > code) is good, feel free to include it in your v2 (Co-developed-by
> > and Signed-off-by me):
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 5fb29ca3263b..b7dd89c99613 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -122,6 +122,7 @@
> >   #define KVM_REQ_HV_TLB_FLUSH \
> >         KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> >   #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE  KVM_ARCH_REQ(34)
> > +#define KVM_REQ_REPORT_LINT0_ACCESS    KVM_ARCH_REQ(35)
> >
> >   #define CR0_RESERVED_BITS                                               \
> >         (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> > @@ -775,6 +776,7 @@ struct kvm_vcpu_arch {
> >         u64 smi_count;
> >         bool at_instruction_boundary;
> >         bool tpr_access_reporting;
> > +       bool lint0_access_reporting;
> >         bool xfd_no_write_intercept;
> >         u64 ia32_xss;
> >         u64 microcode_version;
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 88dc43660d23..0e070f447aa2 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1561,6 +1561,21 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic)
> >                               apic->divide_count));
> >   }
> >
> > +static void __report_lint0_access(struct kvm_lapic *apic, u32 value)
> > +{
> > +       struct kvm_vcpu *vcpu = apic->vcpu;
> > +       struct kvm_run *run = vcpu->run;
> > +
> > +       kvm_make_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu);
> > +       run->lint0_access.value = value;
> > +}
> > +
> > +static inline void report_lint0_access(struct kvm_lapic *apic, u32 value)
> > +{
> > +       if (apic->vcpu->arch.lint0_access_reporting)
> > +               __report_lint0_access(apic, value);
> > +}
> > +
> >   static void __report_tpr_access(struct kvm_lapic *apic, bool write)
> >   {
> >         struct kvm_vcpu *vcpu = apic->vcpu;
> > @@ -2312,8 +2327,10 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >                         int i;
> >
> >                         for (i = 0; i < apic->nr_lvt_entries; i++) {
> > -                               kvm_lapic_set_reg(apic, APIC_LVTx(i),
> > -                                       kvm_lapic_get_reg(apic, APIC_LVTx(i)) | APIC_LVT_MASKED);
> > +                               u32 old = kvm_lapic_get_reg(apic, APIC_LVTx(i));
> > +                               kvm_lapic_set_reg(apic, APIC_LVTx(i), old | APIC_LVT_MASKED);
> > +                               if (i == 0 && !(old & APIC_LVT_MASKED))
> > +                                       report_lint0_access(apic, old | APIC_LVT_MASKED);
> >                         }
> >                         apic_update_lvtt(apic);
> >                         atomic_set(&apic->lapic_timer.pending, 0);
> > @@ -2352,6 +2369,8 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >                 if (!kvm_apic_sw_enabled(apic))
> >                         val |= APIC_LVT_MASKED;
> >                 val &= apic_lvt_mask[index];
> > +               if (index == 0 && val != kvm_lapic_get_reg(apic, reg))
> > +                       report_lint0_access(apic, val);
> >                 kvm_lapic_set_reg(apic, reg, val);
> >                 break;
> >         }
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d0d3dc3b7ef6..2b039b372c3f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10879,6 +10879,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >                         kvm_vcpu_flush_tlb_guest(vcpu);
> >   #endif
> >
> > +               if (kvm_check_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu)) {
> > +                       vcpu->run->exit_reason = KVM_EXIT_LINT0_ACCESS;
> > +                       r = 0;
> > +                       goto out;
> > +               }
> >                 if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {
> >                         vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;
> >                         r = 0;
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 637efc055145..ec97727f9de4 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -178,6 +178,7 @@ struct kvm_xen_exit {
> >   #define KVM_EXIT_NOTIFY           37
> >   #define KVM_EXIT_LOONGARCH_IOCSR  38
> >   #define KVM_EXIT_MEMORY_FAULT     39
> > +#define KVM_EXIT_LINT0_ACCESS     40
> >
> >   /* For KVM_EXIT_INTERNAL_ERROR */
> >   /* Emulate instruction failed. */
> > @@ -283,6 +284,10 @@ struct kvm_run {
> >                                 __u64 flags;
> >                         };
> >                 } hypercall;
> > +               /* KVM_EXIT_LINT0_ACCESS */
> > +               struct {
> > +                       __u32 value;
> > +               } lint0_access;
> >                 /* KVM_EXIT_TPR_ACCESS */
> >                 struct {
> >                         __u64 rip;
> >
> >
> > For LINT1, it should be less performance critical; if it's possible
> > to just go through all vCPUs, and do KVM_GET_LAPIC to check who you
> > should send a KVM_NMI to, then I'd do that.  I'd also accept a patch
> > that adds a VM-wide KVM_NMI ioctl that does the same in the hypervisor
> > if it's useful for you.
>
> Thanks for the patch - I'll get it a try but it might not be right away.
>
> > And since I've been proven wrong already, what do you need INIT/SIPI for?
>
> I don't think this one is as critical.  I believe the reason it was
> added was so that we can synchronize startup of the APs with execution
> of the BSP for guests that do not do a good job of that (Windows).
>
> Doug

We were able to get the in-kernel APIC working with our code using the split
IRQ chip option with our virtual EFI FW even w/o the traps for SVR and LVT0
writes.  Performance of Windows VMs is greatly improved as expected.
Unfortunately our ancient legacy BIOS will not work with > 1 VCPU due to lack
of support for IPIs with an archaic delivery mode of remote read which it uses
to discover APs by attempting to read their APIC ID register.  MSFT WHP supports
this functionality via an option, WHvPartitionPropertyCodeApicRemoteReadSupport.

Changing our legacy BIOS is not an option so in order to support Windows VMs
with the legacy BIOS with decent performance we would either need to add support
for remote reads of the APIC ID register to KVM or support CR8 accesses w/o
exiting w/o the in-kernel APIC in order.  Do you have a preference?

Thanks,
Doug
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 33ef3cc785e4..5a8c7922f64f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6601,10 +6601,11 @@  to the byte array.
 .. note::
 
       For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN,
-      KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
-      operations are complete (and guest state is consistent) only after userspace
-      has re-entered the kernel with KVM_RUN.  The kernel side will first finish
-      incomplete operations and then check for pending signals.
+      KVM_EXIT_EPR, KVM_EXIT_HYPERCALL, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR
+      the corresponding operations are complete (and guest state is consistent)
+      only after userspace has re-entered the kernel with KVM_RUN. The kernel
+      side will first finish incomplete operations and then check for pending
+      signals.
 
       The pending state of the operation is not preserved in state which is
       visible to userspace, thus userspace should ensure that the operation is
@@ -8201,6 +8202,38 @@  default value for it is set via the kvm.enable_vmware_backdoor
 kernel parameter (false when not set). Must be set before any
 VCPUs have been created.
 
+7.38 KVM_CAP_X86_VMWARE_HYPERCALL
+---------------------------------
+
+:Architectures: x86
+:Parameters: args[0] whether the feature should be enabled or not
+:Returns: 0 on success.
+
+Capability allows userspace to handle hypercalls. When enabled
+whenever the vcpu has executed a VMCALL(Intel) or a VMMCALL(AMD)
+instruction kvm will exit to userspace with KVM_EXIT_HYPERCALL.
+
+On exit the hypercall structure of the kvm_run structure will
+look as follows:
+
+::
+   /* KVM_EXIT_HYPERCALL */
+   struct {
+      __u64 nr;      // rax
+      __u64 args[6]; // rbx, rcx, rdx, rsi, rdi, rbp
+      __u64 ret;     // cpl, whatever userspace
+                     // sets this to on return will be
+                     // written to the rax
+      __u64 flags;   // KVM_EXIT_HYPERCALL_LONG_MODE if
+                     // the hypercall was executed in
+                     // 64bit mode, 0 otherwise
+   } hypercall;
+
+Except when running in compatibility mode with VMware hypervisors
+userspace handling of hypercalls is discouraged. To implement
+such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO
+(all except s390).
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7fcf185e337f..7fbb11682517 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1404,6 +1404,7 @@  struct kvm_arch {
 	struct kvm_xen xen;
 #endif
 	bool vmware_backdoor_enabled;
+	bool vmware_hypercall_enabled;
 
 	bool backwards_tsc_observed;
 	bool boot_vcpu_runs_old_kvmclock;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7071907d6a5..b676c54266e7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4689,6 +4689,7 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_MEMORY_FAULT_INFO:
 	case KVM_CAP_X86_GUEST_MODE:
 	case KVM_CAP_X86_VMWARE_BACKDOOR:
+	case KVM_CAP_X86_VMWARE_HYPERCALL:
 		r = 1;
 		break;
 	case KVM_CAP_PRE_FAULT_MEMORY:
@@ -6784,6 +6785,13 @@  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->lock);
 		break;
+	case KVM_CAP_X86_VMWARE_HYPERCALL:
+		r = -EINVAL;
+		if (cap->args[0] & ~1)
+			break;
+		kvm->arch.vmware_hypercall_enabled = cap->args[0];
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
@@ -10127,6 +10135,28 @@  static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
+static int kvm_vmware_hypercall(struct kvm_vcpu *vcpu)
+{
+	struct kvm_run *run = vcpu->run;
+	bool is_64_bit = is_64_bit_hypercall(vcpu);
+	u64 mask = is_64_bit ? U64_MAX : U32_MAX;
+
+	vcpu->run->hypercall.flags = is_64_bit ? KVM_EXIT_HYPERCALL_LONG_MODE : 0;
+	run->hypercall.nr = kvm_rax_read(vcpu) & mask;
+	run->hypercall.args[0] = kvm_rbx_read(vcpu) & mask;
+	run->hypercall.args[1] = kvm_rcx_read(vcpu) & mask;
+	run->hypercall.args[2] = kvm_rdx_read(vcpu) & mask;
+	run->hypercall.args[3] = kvm_rsi_read(vcpu) & mask;
+	run->hypercall.args[4] = kvm_rdi_read(vcpu) & mask;
+	run->hypercall.args[5] = kvm_rbp_read(vcpu) & mask;
+	run->hypercall.ret = kvm_x86_call(get_cpl)(vcpu);
+
+	run->exit_reason = KVM_EXIT_HYPERCALL;
+	vcpu->arch.complete_userspace_io = complete_hypercall_exit;
+
+	return 0;
+}
+
 unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
 				      unsigned long a0, unsigned long a1,
 				      unsigned long a2, unsigned long a3,
@@ -10225,6 +10255,9 @@  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	int op_64_bit;
 	int cpl;
 
+	if (vcpu->kvm->arch.vmware_hypercall_enabled)
+		return kvm_vmware_hypercall(vcpu);
+
 	if (kvm_xen_hypercall_enabled(vcpu->kvm))
 		return kvm_xen_hypercall(vcpu);
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c7b5f1c2ee1c..4c2cc6ed29a0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -934,6 +934,7 @@  struct kvm_enable_cap {
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
 #define KVM_CAP_X86_VMWARE_BACKDOOR 239
+#define KVM_CAP_X86_VMWARE_HYPERCALL 240
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;