diff mbox series

[RFC,v3,3/5] KVM: x86: Add notifications for Heki policy configuration and violation

Message ID 20240503131910.307630-4-mic@digikod.net (mailing list archive)
State New
Headers show
Series Hypervisor-Enforced Kernel Integrity - CR pinning | expand

Commit Message

Mickaël Salaün May 3, 2024, 1:19 p.m. UTC
Add an interface for user space to be notified about guests' Heki policy
and related violations.

Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and
KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can
contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The
returned value is the bitmask of known Heki exit reasons, for now:
KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4.

If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each
KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control
register. This enables to enlighten the VMM with the guest
auto-restrictions.

If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each
pinned CR violation. This enables the VMM to react to a policy
violation.

Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20240503131910.307630-4-mic@digikod.net
---

Changes since v1:
* New patch. Making user space aware of Heki properties was requested by
  Sean Christopherson.
---
 arch/x86/kvm/vmx/vmx.c   |   5 +-
 arch/x86/kvm/x86.c       | 114 +++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/x86.h       |   7 +--
 include/linux/kvm_host.h |   2 +
 include/uapi/linux/kvm.h |  22 ++++++++
 5 files changed, 136 insertions(+), 14 deletions(-)

Comments

Sean Christopherson May 3, 2024, 2:03 p.m. UTC | #1
On Fri, May 03, 2024, Mickaël Salaün wrote:
> Add an interface for user space to be notified about guests' Heki policy
> and related violations.
> 
> Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and
> KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can
> contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The
> returned value is the bitmask of known Heki exit reasons, for now:
> KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4.
> 
> If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each
> KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control
> register. This enables to enlighten the VMM with the guest
> auto-restrictions.
> 
> If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each
> pinned CR violation. This enables the VMM to react to a policy
> violation.
> 
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> Cc: Wanpeng Li <wanpengli@tencent.com>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Link: https://lore.kernel.org/r/20240503131910.307630-4-mic@digikod.net
> ---
> 
> Changes since v1:
> * New patch. Making user space aware of Heki properties was requested by
>   Sean Christopherson.

No, I suggested having userspace _control_ the pinning[*], not merely be notified
of pinning.

 : IMO, manipulation of protections, both for memory (this patch) and CPU state
 : (control registers in the next patch) should come from userspace.  I have no
 : objection to KVM providing plumbing if necessary, but I think userspace needs to
 : to have full control over the actual state.
 : 
 : One of the things that caused Intel's control register pinning series to stall
 : out was how to handle edge cases like kexec() and reboot.  Deferring to userspace
 : means the kernel doesn't need to define policy, e.g. when to unprotect memory,
 : and avoids questions like "should userspace be able to overwrite pinned control
 : registers".
 : 
 : And like the confidential VM use case, keeping userspace in the loop is a big
 : beneifit, e.g. the guest can't circumvent protections by coercing userspace into
 : writing to protected memory.

I stand by that suggestion, because I don't see a sane way to handle things like
kexec() and reboot without having a _much_ more sophisticated policy than would
ever be acceptable in KVM.

I think that can be done without KVM having any awareness of CR pinning whatsoever.
E.g. userspace just needs to ability to intercept CR writes and inject #GPs.  Off
the cuff, I suspect the uAPI could look very similar to MSR filtering.  E.g. I bet
userspace could enforce MSR pinning without any new KVM uAPI at all.

[*] https://lore.kernel.org/all/ZFUyhPuhtMbYdJ76@google.com
Mickaël Salaün May 6, 2024, 5:50 p.m. UTC | #2
On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote:
> On Fri, May 03, 2024, Mickaël Salaün wrote:
> > Add an interface for user space to be notified about guests' Heki policy
> > and related violations.
> > 
> > Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and
> > KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can
> > contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The
> > returned value is the bitmask of known Heki exit reasons, for now:
> > KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4.
> > 
> > If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each
> > KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control
> > register. This enables to enlighten the VMM with the guest
> > auto-restrictions.
> > 
> > If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each
> > pinned CR violation. This enables the VMM to react to a policy
> > violation.
> > 
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: H. Peter Anvin <hpa@zytor.com>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Sean Christopherson <seanjc@google.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
> > Cc: Wanpeng Li <wanpengli@tencent.com>
> > Signed-off-by: Mickaël Salaün <mic@digikod.net>
> > Link: https://lore.kernel.org/r/20240503131910.307630-4-mic@digikod.net
> > ---
> > 
> > Changes since v1:
> > * New patch. Making user space aware of Heki properties was requested by
> >   Sean Christopherson.
> 
> No, I suggested having userspace _control_ the pinning[*], not merely be notified
> of pinning.
> 
>  : IMO, manipulation of protections, both for memory (this patch) and CPU state
>  : (control registers in the next patch) should come from userspace.  I have no
>  : objection to KVM providing plumbing if necessary, but I think userspace needs to
>  : to have full control over the actual state.
>  : 
>  : One of the things that caused Intel's control register pinning series to stall
>  : out was how to handle edge cases like kexec() and reboot.  Deferring to userspace
>  : means the kernel doesn't need to define policy, e.g. when to unprotect memory,
>  : and avoids questions like "should userspace be able to overwrite pinned control
>  : registers".
>  : 
>  : And like the confidential VM use case, keeping userspace in the loop is a big
>  : beneifit, e.g. the guest can't circumvent protections by coercing userspace into
>  : writing to protected memory.
> 
> I stand by that suggestion, because I don't see a sane way to handle things like
> kexec() and reboot without having a _much_ more sophisticated policy than would
> ever be acceptable in KVM.
> 
> I think that can be done without KVM having any awareness of CR pinning whatsoever.
> E.g. userspace just needs to ability to intercept CR writes and inject #GPs.  Off
> the cuff, I suspect the uAPI could look very similar to MSR filtering.  E.g. I bet
> userspace could enforce MSR pinning without any new KVM uAPI at all.
> 
> [*] https://lore.kernel.org/all/ZFUyhPuhtMbYdJ76@google.com

OK, I had concern about the control not directly coming from the guest,
especially in the case of pKVM and confidential computing, but I get you
point.  It should indeed be quite similar to the MSR filtering on the
userspace side, except that we need another interface for the guest to
request such change (i.e. self-protection).

Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but
forward the request to userspace with a VM exit instead?  That would
also enable userspace to get the request and directly configure the CR
pinning with the same VM exit.
Sean Christopherson May 7, 2024, 1:34 a.m. UTC | #3
On Mon, May 06, 2024, Mickaël Salaün wrote:
> On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote:
> > > ---
> > > 
> > > Changes since v1:
> > > * New patch. Making user space aware of Heki properties was requested by
> > >   Sean Christopherson.
> > 
> > No, I suggested having userspace _control_ the pinning[*], not merely be notified
> > of pinning.
> > 
> >  : IMO, manipulation of protections, both for memory (this patch) and CPU state
> >  : (control registers in the next patch) should come from userspace.  I have no
> >  : objection to KVM providing plumbing if necessary, but I think userspace needs to
> >  : to have full control over the actual state.
> >  : 
> >  : One of the things that caused Intel's control register pinning series to stall
> >  : out was how to handle edge cases like kexec() and reboot.  Deferring to userspace
> >  : means the kernel doesn't need to define policy, e.g. when to unprotect memory,
> >  : and avoids questions like "should userspace be able to overwrite pinned control
> >  : registers".
> >  : 
> >  : And like the confidential VM use case, keeping userspace in the loop is a big
> >  : beneifit, e.g. the guest can't circumvent protections by coercing userspace into
> >  : writing to protected memory.
> > 
> > I stand by that suggestion, because I don't see a sane way to handle things like
> > kexec() and reboot without having a _much_ more sophisticated policy than would
> > ever be acceptable in KVM.
> > 
> > I think that can be done without KVM having any awareness of CR pinning whatsoever.
> > E.g. userspace just needs to ability to intercept CR writes and inject #GPs.  Off
> > the cuff, I suspect the uAPI could look very similar to MSR filtering.  E.g. I bet
> > userspace could enforce MSR pinning without any new KVM uAPI at all.
> > 
> > [*] https://lore.kernel.org/all/ZFUyhPuhtMbYdJ76@google.com
> 
> OK, I had concern about the control not directly coming from the guest,
> especially in the case of pKVM and confidential computing, but I get you

Hardware-based CoCo is completely out of scope, because KVM has zero visibility
into the guest (well, SNP technically allows trapping CR0/CR4, but KVM really
shouldn't intercept CR0/CR4 for SNP guests).

And more importantly, _KVM_ doesn't define any policies for CoCo VMs.  KVM might
help enforce policies that are defined by hardware/firmware, but KVM doesn't
define any of its own.

If pKVM on x86 comes along, then KVM will likely get in the business of defining
policy, but until that happens, KVM needs to stay firmly out of the picture.

> point.  It should indeed be quite similar to the MSR filtering on the
> userspace side, except that we need another interface for the guest to
> request such change (i.e. self-protection).
> 
> Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but
> forward the request to userspace with a VM exit instead?  That would
> also enable userspace to get the request and directly configure the CR
> pinning with the same VM exit.

No?  Maybe?  I strongly suspect that full support will need a richer set of APIs
than a single hypercall.  E.g. to handle kexec(), suspend+resume, emulated SMM,
and so on and so forth.  And that's just for CR pinning.

And hypercalls are hampered by the fact that VMCALL/VMMCALL don't allow for
delegation or restriction, i.e. there's no way for the guest to communicate to
the hypervisor that a less privileged component is allowed to perform some action,
nor is there a way for the guest to say some chunk of CPL0 code *isn't* allowed
to request transition.  Delegation and restriction all has to be done out-of-band.

It'd probably be more annoying to setup initially, but I think a synthetic device
with an MMIO-based interface would be more powerful and flexible in the long run.
Then userspace can evolve without needing to wait for KVM to catch up.

Actually, potential bad/crazy idea.  Why does the _host_ need to define policy?
Linux already knows what assets it wants to (un)protect and when.  What's missing
is a way for the guest kernel to effectively deprivilege and re-authenticate
itself as needed.  We've been tossing around the idea of paired VMs+vCPUs to
support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, with a
bit of pKVM mixed in?

Borrowing VTL terminology, where VTL0 is the least privileged, userspace launches
the VM at VTL0.  At some point, the guest triggers the deprivileging sequence and
userspace creates VTL1.  Userpace also provides a way for VTL0 restrict access to
its memory, e.g. to effectively make the page tables for the kernel's direct map
writable only from VTL1, to make kernel text RO (or XO), etc.  And VTL0 could then
also completely remove its access to code that changes CR0/CR4.

It would obviously require a _lot_ more upfront work, e.g. to isolate the kernel
text that modifies CR0/CR4 so that it can be removed from VTL0, but that should
be doable with annotations, e.g. tag relevant functions with __magic or whatever,
throw them in a dedicated section, and then free/protect the section(s) at the
appropriate time.

KVM would likely need to provide the ability to switch VTLs (or whatever they get
called), and host userspace would need to provide a decent amount of the backend
mechanisms and "core" policies, e.g. to manage VTL0 memory, teardown (turn off?)
VTL1 on kexec(), etc.  But everything else could live in the guest kernel itself.
E.g. to have CR pinning play nice with kexec(), toss the relevant kexec() code into
VTL1.  That way VTL1 can verify the kexec() target and tear itself down before
jumping into the new kernel. 

This is very off the cuff and have-wavy, e.g. I don't have much of an idea what
it would take to harden kernel text patching, but keeping the policy in the guest
seems like it'd make everything more tractable than trying to define an ABI
between Linux and a VMM that is rich and flexible enough to support all the
fancy things Linux does (and will do in the future).

Am I crazy?  Or maybe reinventing whatever that McAfee thing was that led to
Intel implementing EPTP switching?
Mickaël Salaün May 7, 2024, 9:30 a.m. UTC | #4
On Mon, May 06, 2024 at 06:34:53PM GMT, Sean Christopherson wrote:
> On Mon, May 06, 2024, Mickaël Salaün wrote:
> > On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote:
> > > > ---
> > > > 
> > > > Changes since v1:
> > > > * New patch. Making user space aware of Heki properties was requested by
> > > >   Sean Christopherson.
> > > 
> > > No, I suggested having userspace _control_ the pinning[*], not merely be notified
> > > of pinning.
> > > 
> > >  : IMO, manipulation of protections, both for memory (this patch) and CPU state
> > >  : (control registers in the next patch) should come from userspace.  I have no
> > >  : objection to KVM providing plumbing if necessary, but I think userspace needs to
> > >  : to have full control over the actual state.
> > >  : 
> > >  : One of the things that caused Intel's control register pinning series to stall
> > >  : out was how to handle edge cases like kexec() and reboot.  Deferring to userspace
> > >  : means the kernel doesn't need to define policy, e.g. when to unprotect memory,
> > >  : and avoids questions like "should userspace be able to overwrite pinned control
> > >  : registers".
> > >  : 
> > >  : And like the confidential VM use case, keeping userspace in the loop is a big
> > >  : beneifit, e.g. the guest can't circumvent protections by coercing userspace into
> > >  : writing to protected memory.
> > > 
> > > I stand by that suggestion, because I don't see a sane way to handle things like
> > > kexec() and reboot without having a _much_ more sophisticated policy than would
> > > ever be acceptable in KVM.
> > > 
> > > I think that can be done without KVM having any awareness of CR pinning whatsoever.
> > > E.g. userspace just needs to ability to intercept CR writes and inject #GPs.  Off
> > > the cuff, I suspect the uAPI could look very similar to MSR filtering.  E.g. I bet
> > > userspace could enforce MSR pinning without any new KVM uAPI at all.
> > > 
> > > [*] https://lore.kernel.org/all/ZFUyhPuhtMbYdJ76@google.com
> > 
> > OK, I had concern about the control not directly coming from the guest,
> > especially in the case of pKVM and confidential computing, but I get you
> 
> Hardware-based CoCo is completely out of scope, because KVM has zero visibility
> into the guest (well, SNP technically allows trapping CR0/CR4, but KVM really
> shouldn't intercept CR0/CR4 for SNP guests).
> 
> And more importantly, _KVM_ doesn't define any policies for CoCo VMs.  KVM might
> help enforce policies that are defined by hardware/firmware, but KVM doesn't
> define any of its own.
> 
> If pKVM on x86 comes along, then KVM will likely get in the business of defining
> policy, but until that happens, KVM needs to stay firmly out of the picture.
> 
> > point.  It should indeed be quite similar to the MSR filtering on the
> > userspace side, except that we need another interface for the guest to
> > request such change (i.e. self-protection).
> > 
> > Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but
> > forward the request to userspace with a VM exit instead?  That would
> > also enable userspace to get the request and directly configure the CR
> > pinning with the same VM exit.
> 
> No?  Maybe?  I strongly suspect that full support will need a richer set of APIs
> than a single hypercall.  E.g. to handle kexec(), suspend+resume, emulated SMM,
> and so on and so forth.  And that's just for CR pinning.
> 
> And hypercalls are hampered by the fact that VMCALL/VMMCALL don't allow for
> delegation or restriction, i.e. there's no way for the guest to communicate to
> the hypervisor that a less privileged component is allowed to perform some action,
> nor is there a way for the guest to say some chunk of CPL0 code *isn't* allowed
> to request transition.  Delegation and restriction all has to be done out-of-band.
> 
> It'd probably be more annoying to setup initially, but I think a synthetic device
> with an MMIO-based interface would be more powerful and flexible in the long run.
> Then userspace can evolve without needing to wait for KVM to catch up.
> 
> Actually, potential bad/crazy idea.  Why does the _host_ need to define policy?
> Linux already knows what assets it wants to (un)protect and when.  What's missing
> is a way for the guest kernel to effectively deprivilege and re-authenticate
> itself as needed.  We've been tossing around the idea of paired VMs+vCPUs to
> support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, with a
> bit of pKVM mixed in?
> 
> Borrowing VTL terminology, where VTL0 is the least privileged, userspace launches
> the VM at VTL0.  At some point, the guest triggers the deprivileging sequence and
> userspace creates VTL1.  Userpace also provides a way for VTL0 restrict access to
> its memory, e.g. to effectively make the page tables for the kernel's direct map
> writable only from VTL1, to make kernel text RO (or XO), etc.  And VTL0 could then
> also completely remove its access to code that changes CR0/CR4.
> 
> It would obviously require a _lot_ more upfront work, e.g. to isolate the kernel
> text that modifies CR0/CR4 so that it can be removed from VTL0, but that should
> be doable with annotations, e.g. tag relevant functions with __magic or whatever,
> throw them in a dedicated section, and then free/protect the section(s) at the
> appropriate time.
> 
> KVM would likely need to provide the ability to switch VTLs (or whatever they get
> called), and host userspace would need to provide a decent amount of the backend
> mechanisms and "core" policies, e.g. to manage VTL0 memory, teardown (turn off?)
> VTL1 on kexec(), etc.  But everything else could live in the guest kernel itself.
> E.g. to have CR pinning play nice with kexec(), toss the relevant kexec() code into
> VTL1.  That way VTL1 can verify the kexec() target and tear itself down before
> jumping into the new kernel. 
> 
> This is very off the cuff and have-wavy, e.g. I don't have much of an idea what
> it would take to harden kernel text patching, but keeping the policy in the guest
> seems like it'd make everything more tractable than trying to define an ABI
> between Linux and a VMM that is rich and flexible enough to support all the
> fancy things Linux does (and will do in the future).

Yes, we agree that the guest needs to manage its own policy.  That's why
we implemented Heki for KVM this way, but without VTLs because KVM
doesn't support them.

To sum up, is the VTL approach the only one that would be acceptable for
KVM?  If yes, that would indeed require a *lot* of work for something
we're not sure will be accepted later on.

> 
> Am I crazy?  Or maybe reinventing whatever that McAfee thing was that led to
> Intel implementing EPTP switching?
>
Sean Christopherson May 7, 2024, 4:16 p.m. UTC | #5
On Tue, May 07, 2024, Mickaël Salaün wrote:
> > Actually, potential bad/crazy idea.  Why does the _host_ need to define policy?
> > Linux already knows what assets it wants to (un)protect and when.  What's missing
> > is a way for the guest kernel to effectively deprivilege and re-authenticate
> > itself as needed.  We've been tossing around the idea of paired VMs+vCPUs to
> > support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, with a
> > bit of pKVM mixed in?
> > 
> > Borrowing VTL terminology, where VTL0 is the least privileged, userspace launches
> > the VM at VTL0.  At some point, the guest triggers the deprivileging sequence and
> > userspace creates VTL1.  Userpace also provides a way for VTL0 restrict access to
> > its memory, e.g. to effectively make the page tables for the kernel's direct map
> > writable only from VTL1, to make kernel text RO (or XO), etc.  And VTL0 could then
> > also completely remove its access to code that changes CR0/CR4.
> > 
> > It would obviously require a _lot_ more upfront work, e.g. to isolate the kernel
> > text that modifies CR0/CR4 so that it can be removed from VTL0, but that should
> > be doable with annotations, e.g. tag relevant functions with __magic or whatever,
> > throw them in a dedicated section, and then free/protect the section(s) at the
> > appropriate time.
> > 
> > KVM would likely need to provide the ability to switch VTLs (or whatever they get
> > called), and host userspace would need to provide a decent amount of the backend
> > mechanisms and "core" policies, e.g. to manage VTL0 memory, teardown (turn off?)
> > VTL1 on kexec(), etc.  But everything else could live in the guest kernel itself.
> > E.g. to have CR pinning play nice with kexec(), toss the relevant kexec() code into
> > VTL1.  That way VTL1 can verify the kexec() target and tear itself down before
> > jumping into the new kernel. 
> > 
> > This is very off the cuff and have-wavy, e.g. I don't have much of an idea what
> > it would take to harden kernel text patching, but keeping the policy in the guest
> > seems like it'd make everything more tractable than trying to define an ABI
> > between Linux and a VMM that is rich and flexible enough to support all the
> > fancy things Linux does (and will do in the future).
> 
> Yes, we agree that the guest needs to manage its own policy.  That's why
> we implemented Heki for KVM this way, but without VTLs because KVM
> doesn't support them.
> 
> To sum up, is the VTL approach the only one that would be acceptable for
> KVM?  

Heh, that's not a question you want to be asking.  You're effectively asking me
to make an authorative, "final" decision on a topic which I am only passingly
familiar with.

But since you asked it... :-)  Probably?

I see a lot of advantages to a VTL/VSM-like approach:

 1. Provides Linux-as-a guest the flexibility it needs to meaningfully advance
    its security, with the least amount of policy built into the guest/host ABI.

 2. Largely decouples guest policy from the host, i.e. should allow the guest to
    evolve/update it's policy without needing to coordinate changes with the host.

 3. The KVM implementation can be generic enough to be reusable for other features.

 4. Other groups are already working on VTL-like support in KVM, e.g. for VSM
    itself, and potentially for VMPL/SVSM support.

IMO, #2 is a *huge* selling point.  Not having to coordinate changes across
multiple code bases and/or organizations and/or maintainers is a big win for
velocity, long term maintenance, and probably the very viability of HEKI.

Providing the guest with the tools to define and implement its own policy means
end users don't have to way for some third party, e.g. CSPs, to deploy the
accompanying host-side changes, because there are no host-side changes.

And encapsulating everything in the guest drastically reduces the friction with
changes in the kernel that interact with hardening, both from a technical and a
social perspective.  I.e. giving the kernel (near) complete control over its
destiny minimizes the number of moving parts, and will be far, far easier to sell
to maintainers.  I would expect maintainers to react much more favorably to being
handed tools to harden the kernel, as opposed to being presented a set of APIs
that can be used to make the kernel compliant with _someone else's_ vision of
what kernel hardening should look like.

E.g. imagine a new feature comes along that requires overriding CR0/CR4 pinning
in a way that doesn't fit into existing policy.  If the VMM is involved in
defining/enforcing the CR pinning policy, then supporting said new feature would
require new guest/host ABI and an updated host VMM in order to make the new
feature compatible with HEKI.  Inevitably, even if everything goes smoothly from
an upstreaming perspective, that will result in guests that have to choose between
HEKI and new feature X, because there is zero chance that all hosts that run Linux
as a guest will be updated in advance of new feature X being deployed.

And if/when things don't go smoothly, odds are very good that kernel maintainers
will eventually tire of having to coordinate and negotiate with QEMU and other
VMMs, and will become resistant to continuing to support/extend HEKI.

> If yes, that would indeed require a *lot* of work for something we're not
> sure will be accepted later on.

Yes and no.  The AWS folks are pursuing VSM support in KVM+QEMU, and SVSM support
is trending toward the paired VM+vCPU model.  IMO, it's entirely feasible to
design KVM support such that much of the development load can be shared between
the projects.  And having 2+ use cases for a feature (set) makes it _much_ more
likely that the feature(s) will be accepted.

And similar to what Paolo said regarding HEKI not having a complete story, I
don't see a clear line of sight for landing host-defined policy enforcement, as
there are many open, non-trivial questions that need answers. I.e. upstreaming
HEKI in its current form is also far from a done deal, and isn't guaranteed to
be substantially less work when all is said and done.
Nicolas Saenz Julienne May 10, 2024, 10:07 a.m. UTC | #6
On Tue May 7, 2024 at 4:16 PM UTC, Sean Christopherson wrote:
> > If yes, that would indeed require a *lot* of work for something we're not
> > sure will be accepted later on.
>
> Yes and no.  The AWS folks are pursuing VSM support in KVM+QEMU, and SVSM support
> is trending toward the paired VM+vCPU model.  IMO, it's entirely feasible to
> design KVM support such that much of the development load can be shared between
> the projects.  And having 2+ use cases for a feature (set) makes it _much_ more
> likely that the feature(s) will be accepted.

Since Sean mentioned our VSM efforts, a small update. We were able to
validate the concept of one KVM VM per VTL as discussed in LPC. Right
now only for single CPU guests, but are in the late stages of bringing
up MP support. The resulting KVM code is small, and most will be
uncontroversial (I hope). If other obligations allow it, we plan on
having something suitable for review in the coming months.

Our implementation aims to implement all the VSM spec necessary to run
with Microsoft Credential Guard. But note that some aspects necessary
for HVCI are not covered, especially the ones that depend on MBEC
support, or some categories of secure intercepts.

Development happens
https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next
branch, but I'd advice against looking into it until we add some order
to the rework. Regardless, feel free to get in touch.

Nicolas
Mickaël Salaün May 14, 2024, 12:15 p.m. UTC | #7
On Tue, May 07, 2024 at 09:16:06AM -0700, Sean Christopherson wrote:
> On Tue, May 07, 2024, Mickaël Salaün wrote:
> > > Actually, potential bad/crazy idea.  Why does the _host_ need to define policy?
> > > Linux already knows what assets it wants to (un)protect and when.  What's missing
> > > is a way for the guest kernel to effectively deprivilege and re-authenticate
> > > itself as needed.  We've been tossing around the idea of paired VMs+vCPUs to
> > > support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, with a
> > > bit of pKVM mixed in?
> > > 
> > > Borrowing VTL terminology, where VTL0 is the least privileged, userspace launches
> > > the VM at VTL0.  At some point, the guest triggers the deprivileging sequence and
> > > userspace creates VTL1.  Userpace also provides a way for VTL0 restrict access to
> > > its memory, e.g. to effectively make the page tables for the kernel's direct map
> > > writable only from VTL1, to make kernel text RO (or XO), etc.  And VTL0 could then
> > > also completely remove its access to code that changes CR0/CR4.
> > > 
> > > It would obviously require a _lot_ more upfront work, e.g. to isolate the kernel
> > > text that modifies CR0/CR4 so that it can be removed from VTL0, but that should
> > > be doable with annotations, e.g. tag relevant functions with __magic or whatever,
> > > throw them in a dedicated section, and then free/protect the section(s) at the
> > > appropriate time.
> > > 
> > > KVM would likely need to provide the ability to switch VTLs (or whatever they get
> > > called), and host userspace would need to provide a decent amount of the backend
> > > mechanisms and "core" policies, e.g. to manage VTL0 memory, teardown (turn off?)
> > > VTL1 on kexec(), etc.  But everything else could live in the guest kernel itself.
> > > E.g. to have CR pinning play nice with kexec(), toss the relevant kexec() code into
> > > VTL1.  That way VTL1 can verify the kexec() target and tear itself down before
> > > jumping into the new kernel. 
> > > 
> > > This is very off the cuff and have-wavy, e.g. I don't have much of an idea what
> > > it would take to harden kernel text patching, but keeping the policy in the guest
> > > seems like it'd make everything more tractable than trying to define an ABI
> > > between Linux and a VMM that is rich and flexible enough to support all the
> > > fancy things Linux does (and will do in the future).
> > 
> > Yes, we agree that the guest needs to manage its own policy.  That's why
> > we implemented Heki for KVM this way, but without VTLs because KVM
> > doesn't support them.
> > 
> > To sum up, is the VTL approach the only one that would be acceptable for
> > KVM?  
> 
> Heh, that's not a question you want to be asking.  You're effectively asking me
> to make an authorative, "final" decision on a topic which I am only passingly
> familiar with.
> 
> But since you asked it... :-)  Probably?
> 
> I see a lot of advantages to a VTL/VSM-like approach:
> 
>  1. Provides Linux-as-a guest the flexibility it needs to meaningfully advance
>     its security, with the least amount of policy built into the guest/host ABI.
> 
>  2. Largely decouples guest policy from the host, i.e. should allow the guest to
>     evolve/update it's policy without needing to coordinate changes with the host.
> 
>  3. The KVM implementation can be generic enough to be reusable for other features.
> 
>  4. Other groups are already working on VTL-like support in KVM, e.g. for VSM
>     itself, and potentially for VMPL/SVSM support.
> 
> IMO, #2 is a *huge* selling point.  Not having to coordinate changes across
> multiple code bases and/or organizations and/or maintainers is a big win for
> velocity, long term maintenance, and probably the very viability of HEKI.

Agree, this is our goal.

> 
> Providing the guest with the tools to define and implement its own policy means
> end users don't have to way for some third party, e.g. CSPs, to deploy the
> accompanying host-side changes, because there are no host-side changes.
> 
> And encapsulating everything in the guest drastically reduces the friction with
> changes in the kernel that interact with hardening, both from a technical and a
> social perspective.  I.e. giving the kernel (near) complete control over its
> destiny minimizes the number of moving parts, and will be far, far easier to sell
> to maintainers.  I would expect maintainers to react much more favorably to being
> handed tools to harden the kernel, as opposed to being presented a set of APIs
> that can be used to make the kernel compliant with _someone else's_ vision of
> what kernel hardening should look like.
> 
> E.g. imagine a new feature comes along that requires overriding CR0/CR4 pinning
> in a way that doesn't fit into existing policy.  If the VMM is involved in
> defining/enforcing the CR pinning policy, then supporting said new feature would
> require new guest/host ABI and an updated host VMM in order to make the new
> feature compatible with HEKI.  Inevitably, even if everything goes smoothly from
> an upstreaming perspective, that will result in guests that have to choose between
> HEKI and new feature X, because there is zero chance that all hosts that run Linux
> as a guest will be updated in advance of new feature X being deployed.

Sure. We need to find a generic-enough KVM interface to be able to
restrict a wide range of virtualization/hardware mechanisms (to not rely
too much on KVM changes) and delegate most of enforcement/emulation to
VTL1.  In short, policy definition owned by VTL0/guest, and policy
enforcement shared between KVM (coarse grained) and VTL1 (fine grained).

> 
> And if/when things don't go smoothly, odds are very good that kernel maintainers
> will eventually tire of having to coordinate and negotiate with QEMU and other
> VMMs, and will become resistant to continuing to support/extend HEKI.

Yes, that was our concern too and another reason why we choose to let
the guest handle its own security policy.

> 
> > If yes, that would indeed require a *lot* of work for something we're not
> > sure will be accepted later on.
> 
> Yes and no.  The AWS folks are pursuing VSM support in KVM+QEMU, and SVSM support
> is trending toward the paired VM+vCPU model.  IMO, it's entirely feasible to
> design KVM support such that much of the development load can be shared between
> the projects.  And having 2+ use cases for a feature (set) makes it _much_ more
> likely that the feature(s) will be accepted.
> 
> And similar to what Paolo said regarding HEKI not having a complete story, I
> don't see a clear line of sight for landing host-defined policy enforcement, as
> there are many open, non-trivial questions that need answers. I.e. upstreaming
> HEKI in its current form is also far from a done deal, and isn't guaranteed to
> be substantially less work when all is said and done.

I'm not sure to understand why "Heki not having a complete story".  The
goal is the same as the current kernel self-protection mechanisms.
Mickaël Salaün May 14, 2024, 12:23 p.m. UTC | #8
On Fri, May 10, 2024 at 10:07:00AM +0000, Nicolas Saenz Julienne wrote:
> On Tue May 7, 2024 at 4:16 PM UTC, Sean Christopherson wrote:
> > > If yes, that would indeed require a *lot* of work for something we're not
> > > sure will be accepted later on.
> >
> > Yes and no.  The AWS folks are pursuing VSM support in KVM+QEMU, and SVSM support
> > is trending toward the paired VM+vCPU model.  IMO, it's entirely feasible to
> > design KVM support such that much of the development load can be shared between
> > the projects.  And having 2+ use cases for a feature (set) makes it _much_ more
> > likely that the feature(s) will be accepted.
> 
> Since Sean mentioned our VSM efforts, a small update. We were able to
> validate the concept of one KVM VM per VTL as discussed in LPC. Right
> now only for single CPU guests, but are in the late stages of bringing
> up MP support. The resulting KVM code is small, and most will be
> uncontroversial (I hope). If other obligations allow it, we plan on
> having something suitable for review in the coming months.

Looks good!

> 
> Our implementation aims to implement all the VSM spec necessary to run
> with Microsoft Credential Guard. But note that some aspects necessary
> for HVCI are not covered, especially the ones that depend on MBEC
> support, or some categories of secure intercepts.

We already implemented support for MBEC, so that should not be an issue.
We just need to find the best interface to configure it.

> 
> Development happens
> https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next
> branch, but I'd advice against looking into it until we add some order
> to the rework. Regardless, feel free to get in touch.

Thanks for the update.

Could we schedule a PUCK meeting to synchronize and help each other?
What about June 12?
Sean Christopherson May 15, 2024, 8:32 p.m. UTC | #9
On Tue, May 14, 2024, Mickaël Salaün wrote:
> On Fri, May 10, 2024 at 10:07:00AM +0000, Nicolas Saenz Julienne wrote:
> > Development happens
> > https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next
> > branch, but I'd advice against looking into it until we add some order
> > to the rework. Regardless, feel free to get in touch.
> 
> Thanks for the update.
> 
> Could we schedule a PUCK meeting to synchronize and help each other?
> What about June 12?

June 12th works on my end.
Nicolas Saenz Julienne May 16, 2024, 2:02 p.m. UTC | #10
On Tue May 14, 2024 at 12:23 PM UTC, Mickaël Salaün wrote:
> > Development happens
> > https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next
> > branch, but I'd advice against looking into it until we add some order
> > to the rework. Regardless, feel free to get in touch.
>
> Thanks for the update.
>
> Could we schedule a PUCK meeting to synchronize and help each other?
> What about June 12?

Sounds great! June 12th works for me.

Nicolas
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7ba970b525f7..5869a1ed7866 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5445,6 +5445,7 @@  static int handle_cr(struct kvm_vcpu *vcpu)
 	int reg;
 	int err;
 	int ret;
+	bool exit = false;
 
 	exit_qualification = vmx_get_exit_qual(vcpu);
 	cr = exit_qualification & 15;
@@ -5454,8 +5455,8 @@  static int handle_cr(struct kvm_vcpu *vcpu)
 		val = kvm_register_read(vcpu, reg);
 		trace_kvm_cr_write(cr, val);
 
-		ret = heki_check_cr(vcpu, cr, val);
-		if (ret)
+		ret = heki_check_cr(vcpu, cr, val, &exit);
+		if (exit)
 			return ret;
 
 		switch (cr) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a5f47be59abc..865e88f2b0fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -119,6 +119,10 @@  static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
 
 #define KVM_CAP_PMU_VALID_MASK KVM_PMU_CAP_DISABLE
 
+#define KVM_HEKI_EXIT_REASON_VALID_MASK ( \
+	KVM_HEKI_EXIT_REASON_CR0 | \
+	KVM_HEKI_EXIT_REASON_CR4)
+
 #define KVM_X2APIC_API_VALID_FLAGS (KVM_X2APIC_API_USE_32BIT_IDS | \
                                     KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK)
 
@@ -4836,6 +4840,10 @@  int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM))
 			r |= BIT(KVM_X86_SW_PROTECTED_VM);
 		break;
+	case KVM_CAP_HEKI_CONFIGURE:
+	case KVM_CAP_HEKI_DENIAL:
+		r = KVM_HEKI_EXIT_REASON_VALID_MASK;
+		break;
 	default:
 		break;
 	}
@@ -6729,6 +6737,22 @@  int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->lock);
 		break;
+#ifdef CONFIG_HEKI
+	case KVM_CAP_HEKI_CONFIGURE:
+		r = -EINVAL;
+		if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK)
+			break;
+		kvm->heki_configure_exit_reason = cap->args[0];
+		r = 0;
+		break;
+	case KVM_CAP_HEKI_DENIAL:
+		r = -EINVAL;
+		if (cap->args[0] & ~KVM_HEKI_EXIT_REASON_VALID_MASK)
+			break;
+		kvm->heki_denial_exit_reason = cap->args[0];
+		r = 0;
+		break;
+#endif
 	default:
 		r = -EINVAL;
 		break;
@@ -8283,11 +8307,60 @@  static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr)
 
 #ifdef CONFIG_HEKI
 
+static int complete_heki_configure_exit(struct kvm_vcpu *const vcpu)
+{
+	kvm_rax_write(vcpu, 0);
+	++vcpu->stat.hypercalls;
+	return kvm_skip_emulated_instruction(vcpu);
+}
+
+static int complete_heki_denial_exit(struct kvm_vcpu *const vcpu)
+{
+	kvm_inject_gp(vcpu, 0);
+	return 1;
+}
+
+/* Returns true if the @exit_reason is handled by @vcpu->kvm. */
+static bool heki_exit_cr(struct kvm_vcpu *const vcpu, const __u32 exit_reason,
+			 const u64 heki_reason, unsigned long value)
+{
+	switch (exit_reason) {
+	case KVM_EXIT_HEKI_CONFIGURE:
+		if (!(vcpu->kvm->heki_configure_exit_reason & heki_reason))
+			return false;
+
+		vcpu->run->heki_configure.reason = heki_reason;
+		memset(vcpu->run->heki_configure.reserved, 0,
+		       sizeof(vcpu->run->heki_configure.reserved));
+		vcpu->run->heki_configure.cr_pinned = value;
+		vcpu->arch.complete_userspace_io = complete_heki_configure_exit;
+		break;
+	case KVM_EXIT_HEKI_DENIAL:
+		if (!(vcpu->kvm->heki_denial_exit_reason & heki_reason))
+			return false;
+
+		vcpu->run->heki_denial.reason = heki_reason;
+		memset(vcpu->run->heki_denial.reserved, 0,
+		       sizeof(vcpu->run->heki_denial.reserved));
+		vcpu->run->heki_denial.cr_value = value;
+		vcpu->arch.complete_userspace_io = complete_heki_denial_exit;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return false;
+	}
+
+	vcpu->run->exit_reason = exit_reason;
+	return true;
+}
+
 #define HEKI_ABI_VERSION 1
 
 static int heki_lock_cr(struct kvm_vcpu *const vcpu, const unsigned long cr,
-			unsigned long pin, unsigned long flags)
+			unsigned long pin, unsigned long flags, bool *exit)
 {
+	*exit = false;
+
 	if (flags) {
 		if ((flags == KVM_LOCK_CR_UPDATE_VERSION) && !cr && !pin)
 			return HEKI_ABI_VERSION;
@@ -8307,6 +8380,8 @@  static int heki_lock_cr(struct kvm_vcpu *const vcpu, const unsigned long cr,
 			return -KVM_EINVAL;
 
 		atomic_long_or(pin, &vcpu->kvm->heki_pinned_cr0);
+		*exit = heki_exit_cr(vcpu, KVM_EXIT_HEKI_CONFIGURE,
+				     KVM_HEKI_EXIT_REASON_CR0, pin);
 		return 0;
 	case 4:
 		/* Checks for irrelevant bits. */
@@ -8316,24 +8391,37 @@  static int heki_lock_cr(struct kvm_vcpu *const vcpu, const unsigned long cr,
 		/* Ignores bits not present in host. */
 		pin &= __read_cr4();
 		atomic_long_or(pin, &vcpu->kvm->heki_pinned_cr4);
+		*exit = heki_exit_cr(vcpu, KVM_EXIT_HEKI_CONFIGURE,
+				     KVM_HEKI_EXIT_REASON_CR4, pin);
 		return 0;
 	}
 	return -KVM_EINVAL;
 }
 
+/*
+ * Sets @exit to true if the caller must exit (i.e. denied access) with the
+ * returned value:
+ * - 0 when kvm_run is configured;
+ * - 1 when there is no user space handler.
+ */
 int heki_check_cr(struct kvm_vcpu *const vcpu, const unsigned long cr,
-		  const unsigned long val)
+		  const unsigned long val, bool *exit)
 {
 	unsigned long pinned;
 
+	*exit = false;
+
 	switch (cr) {
 	case 0:
 		pinned = atomic_long_read(&vcpu->kvm->heki_pinned_cr0);
 		if ((val & pinned) != pinned) {
 			pr_warn_ratelimited(
 				"heki: Blocked CR0 update: 0x%lx\n", val);
-			kvm_inject_gp(vcpu, 0);
-			return 1;
+			*exit = true;
+			if (heki_exit_cr(vcpu, KVM_EXIT_HEKI_DENIAL,
+					 KVM_HEKI_EXIT_REASON_CR0, val))
+				return 0;
+			return complete_heki_denial_exit(vcpu);
 		}
 		return 0;
 	case 4:
@@ -8341,8 +8429,11 @@  int heki_check_cr(struct kvm_vcpu *const vcpu, const unsigned long cr,
 		if ((val & pinned) != pinned) {
 			pr_warn_ratelimited(
 				"heki: Blocked CR4 update: 0x%lx\n", val);
-			kvm_inject_gp(vcpu, 0);
-			return 1;
+			*exit = true;
+			if (heki_exit_cr(vcpu, KVM_EXIT_HEKI_DENIAL,
+					 KVM_HEKI_EXIT_REASON_CR4, val))
+				return 0;
+			return complete_heki_denial_exit(vcpu);
 		}
 		return 0;
 	}
@@ -8356,9 +8447,10 @@  static int emulator_set_cr(struct x86_emulate_ctxt *ctxt, int cr, ulong val)
 {
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
 	int res = 0;
+	bool exit = false;
 
-	res = heki_check_cr(vcpu, cr, val);
-	if (res)
+	res = heki_check_cr(vcpu, cr, val, &exit);
+	if (exit)
 		return res;
 
 	switch (cr) {
@@ -10222,7 +10314,11 @@  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		if (a0 > U32_MAX) {
 			ret = -KVM_EINVAL;
 		} else {
-			ret = heki_lock_cr(vcpu, a0, a1, a2);
+			bool exit = false;
+
+			ret = heki_lock_cr(vcpu, a0, a1, a2, &exit);
+			if (exit)
+				return ret;
 		}
 		break;
 #endif /* CONFIG_HEKI */
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index ade7d68ddaff..2740b74ab583 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -292,18 +292,19 @@  static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
 
 #ifdef CONFIG_HEKI
 
-int heki_check_cr(struct kvm_vcpu *vcpu, unsigned long cr, unsigned long val);
+int heki_check_cr(struct kvm_vcpu *vcpu, unsigned long cr, unsigned long val,
+		  bool *exit);
 
 #else /* CONFIG_HEKI */
 
 static inline int heki_check_cr(struct kvm_vcpu *vcpu, unsigned long cr,
-				unsigned long val)
+				unsigned long val, bool *exit)
 {
 	return 0;
 }
 
 static inline int heki_lock_cr(struct kvm_vcpu *const vcpu, unsigned long cr,
-			       unsigned long pin)
+			       unsigned long pin, bool *exit)
 {
 	return 0;
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6ff13937929a..cf8e271d47aa 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -839,6 +839,8 @@  struct kvm {
 #ifdef CONFIG_HEKI
 	atomic_long_t heki_pinned_cr0;
 	atomic_long_t heki_pinned_cr4;
+	u64 heki_configure_exit_reason;
+	u64 heki_denial_exit_reason;
 #endif /* CONFIG_HEKI */
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..1051c2f817ba 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -178,6 +178,8 @@  struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_HEKI_CONFIGURE   40
+#define KVM_EXIT_HEKI_DENIAL      41
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -433,6 +435,24 @@  struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_HEKI_CONFIGURE */
+		struct {
+#define KVM_HEKI_EXIT_REASON_CR0	(1ULL << 0)
+#define KVM_HEKI_EXIT_REASON_CR4	(1ULL << 1)
+			__u64 reason;
+			union {
+				__u64 cr_pinned;
+				__u64 reserved[7]; /* ignored */
+			};
+		} heki_configure;
+		/* KVM_EXIT_HEKI_DENIAL */
+		struct {
+			__u64 reason;
+			union {
+				__u64 cr_value;
+				__u64 reserved[7]; /* ignored */
+			};
+		} heki_denial;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -917,6 +937,8 @@  struct kvm_enable_cap {
 #define KVM_CAP_MEMORY_ATTRIBUTES 233
 #define KVM_CAP_GUEST_MEMFD 234
 #define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_HEKI_CONFIGURE 236
+#define KVM_CAP_HEKI_DENIAL 237
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;