mbox series

[v2,00/24] KVM: arm64: Introduce pKVM shadow state at EL2

Message ID 20220630135747.26983-1-will@kernel.org (mailing list archive)
Headers show
Series KVM: arm64: Introduce pKVM shadow state at EL2 | expand

Message

Will Deacon June 30, 2022, 1:57 p.m. UTC
Hi everyone,

This series has been extracted from the pKVM base support series (aka
"pKVM mega-patch") previously posted here:

  https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@kernel.org/

Unlike that more comprehensive series, this one is fairly fundamental
and does not introduce any new ABI commitments, leaving questions
involving the management of guest private memory and the creation of
protected VMs for future work. Instead, this series extends the pKVM EL2
code so that it can dynamically instantiate and manage VM shadow
structures without the host being able to access them directly. These
shadow structures consist of a shadow VM, a set of shadow vCPUs and the
stage-2 page-table and the pages used to hold them are returned to the
host when the VM is destroyed.

The last patch is marked as RFC because, although it plumbs in the
shadow state, it is woefully inefficient and copies to/from the host
state on every vCPU run. Without the last patch, the new structures are
unused but we move considerably closer to isolating guests from the
host.

The series is based on Marc's rework of the flags
(kvm-arm64/burn-the-flags).

Feedback welcome.

Cheers,

Will, Quentin, Fuad and Marc

Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>

Cc: kernel-team@android.com
Cc: kvm@vger.kernel.org
Cc: kvmarm@lists.cs.columbia.edu
Cc: linux-arm-kernel@lists.infradead.org

--->8

Fuad Tabba (3):
  KVM: arm64: Add hyp_spinlock_t static initializer
  KVM: arm64: Introduce shadow VM state at EL2
  KVM: arm64: Instantiate VM shadow data from EL1

Quentin Perret (15):
  KVM: arm64: Move hyp refcount manipulation helpers
  KVM: arm64: Allow non-coalescable pages in a hyp_pool
  KVM: arm64: Add flags to struct hyp_page
  KVM: arm64: Back hyp_vmemmap for all of memory
  KVM: arm64: Make hyp stage-1 refcnt correct on the whole range
  KVM: arm64: Implement do_donate() helper for donating memory
  KVM: arm64: Prevent the donation of no-map pages
  KVM: arm64: Add helpers to pin memory shared with hyp
  KVM: arm64: Add pcpu fixmap infrastructure at EL2
  KVM: arm64: Add generic hyp_memcache helpers
  KVM: arm64: Instantiate guest stage-2 page-tables at EL2
  KVM: arm64: Return guest memory from EL2 via dedicated teardown
    memcache
  KVM: arm64: Unmap kvm_arm_hyp_percpu_base from the host
  KVM: arm64: Explicitly map kvm_vgic_global_state at EL2
  KVM: arm64: Don't map host sections in pkvm

Will Deacon (6):
  KVM: arm64: Unify identifiers used to distinguish host and hypervisor
  KVM: arm64: Include asm/kvm_mmu.h in nvhe/mem_protect.h
  KVM: arm64: Initialise hyp symbols regardless of pKVM
  KVM: arm64: Provide I-cache invalidation by VA at EL2
  KVM: arm64: Maintain a copy of 'kvm_arm_vmid_bits' at EL2
  KVM: arm64: Use the shadow vCPU structure in handle___kvm_vcpu_run()

 arch/arm64/include/asm/kvm_asm.h              |   6 +-
 arch/arm64/include/asm/kvm_host.h             |  65 +++
 arch/arm64/include/asm/kvm_hyp.h              |   3 +
 arch/arm64/include/asm/kvm_pgtable.h          |   8 +
 arch/arm64/include/asm/kvm_pkvm.h             |  38 ++
 arch/arm64/kernel/image-vars.h                |  15 -
 arch/arm64/kvm/arm.c                          |  40 +-
 arch/arm64/kvm/hyp/hyp-constants.c            |   3 +
 arch/arm64/kvm/hyp/include/nvhe/gfp.h         |   6 +-
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  19 +-
 arch/arm64/kvm/hyp/include/nvhe/memory.h      |  26 +-
 arch/arm64/kvm/hyp/include/nvhe/mm.h          |  18 +-
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |  70 +++
 arch/arm64/kvm/hyp/include/nvhe/spinlock.h    |  10 +-
 arch/arm64/kvm/hyp/nvhe/cache.S               |  11 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            | 105 +++-
 arch/arm64/kvm/hyp/nvhe/hyp-smp.c             |   2 +
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 456 +++++++++++++++++-
 arch/arm64/kvm/hyp/nvhe/mm.c                  | 136 +++++-
 arch/arm64/kvm/hyp/nvhe/page_alloc.c          |  42 +-
 arch/arm64/kvm/hyp/nvhe/pkvm.c                | 438 +++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/setup.c               |  96 ++--
 arch/arm64/kvm/hyp/pgtable.c                  |   9 +
 arch/arm64/kvm/mmu.c                          |  26 +
 arch/arm64/kvm/pkvm.c                         | 121 ++++-
 25 files changed, 1625 insertions(+), 144 deletions(-)
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/pkvm.h

Comments

Sean Christopherson July 6, 2022, 7:17 p.m. UTC | #1
On Thu, Jun 30, 2022, Will Deacon wrote:
> Hi everyone,
> 
> This series has been extracted from the pKVM base support series (aka
> "pKVM mega-patch") previously posted here:
> 
>   https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@kernel.org/
> 
> Unlike that more comprehensive series, this one is fairly fundamental
> and does not introduce any new ABI commitments, leaving questions
> involving the management of guest private memory and the creation of
> protected VMs for future work. Instead, this series extends the pKVM EL2
> code so that it can dynamically instantiate and manage VM shadow
> structures without the host being able to access them directly. These
> shadow structures consist of a shadow VM, a set of shadow vCPUs and the
> stage-2 page-table and the pages used to hold them are returned to the
> host when the VM is destroyed.
> 
> The last patch is marked as RFC because, although it plumbs in the
> shadow state, it is woefully inefficient and copies to/from the host
> state on every vCPU run. Without the last patch, the new structures are
> unused but we move considerably closer to isolating guests from the
> host.

...

>  arch/arm64/include/asm/kvm_asm.h              |   6 +-
>  arch/arm64/include/asm/kvm_host.h             |  65 +++
>  arch/arm64/include/asm/kvm_hyp.h              |   3 +
>  arch/arm64/include/asm/kvm_pgtable.h          |   8 +
>  arch/arm64/include/asm/kvm_pkvm.h             |  38 ++
>  arch/arm64/kernel/image-vars.h                |  15 -
>  arch/arm64/kvm/arm.c                          |  40 +-
>  arch/arm64/kvm/hyp/hyp-constants.c            |   3 +
>  arch/arm64/kvm/hyp/include/nvhe/gfp.h         |   6 +-
>  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  19 +-
>  arch/arm64/kvm/hyp/include/nvhe/memory.h      |  26 +-
>  arch/arm64/kvm/hyp/include/nvhe/mm.h          |  18 +-
>  arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |  70 +++
>  arch/arm64/kvm/hyp/include/nvhe/spinlock.h    |  10 +-
>  arch/arm64/kvm/hyp/nvhe/cache.S               |  11 +
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c            | 105 +++-
>  arch/arm64/kvm/hyp/nvhe/hyp-smp.c             |   2 +
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 456 +++++++++++++++++-
>  arch/arm64/kvm/hyp/nvhe/mm.c                  | 136 +++++-
>  arch/arm64/kvm/hyp/nvhe/page_alloc.c          |  42 +-
>  arch/arm64/kvm/hyp/nvhe/pkvm.c                | 438 +++++++++++++++++
>  arch/arm64/kvm/hyp/nvhe/setup.c               |  96 ++--
>  arch/arm64/kvm/hyp/pgtable.c                  |   9 +
>  arch/arm64/kvm/mmu.c                          |  26 +
>  arch/arm64/kvm/pkvm.c                         | 121 ++++-
>  25 files changed, 1625 insertions(+), 144 deletions(-)
>  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/pkvm.h

The lack of documentation and the rather terse changelogs make this really hard
to review for folks that aren't intimately familiar with pKVM.  I have a decent
idea of the end goal of "shadowing", but that's mostly because of my involvement in
similar x86 projects.  Nothing in the changelogs ever explains _why_ pKVM uses
shadows.

I put "shadowing" in quotes because if the unstrusted host is aware that the VM
and vCPU it is manipulating aren't the "real" VMs/vCPUs, and there is an explicit API
between the untrusted host and pKVM for creating/destroying VMs/vCPUs, then I would
argue that it's not truly shadowing, especially if pKVM uses data/values verbatim
and only verifies correctness/safety.  It's definitely a nit, but for future readers
I think overloading "shadowing" could be confusing.

And beyond the basics, IMO pKVM needs a more formal definition of exactly what
guest state is protected/hidden from the untrusted host.  Peeking at the mega series,
there are a huge pile of patches that result in "gradual reduction of EL2 trust in
host data", but I couldn't any documentation that defines what that end result is.
Will Deacon July 8, 2022, 4:23 p.m. UTC | #2
Hi Sean,

Thanks for having a look.

On Wed, Jul 06, 2022 at 07:17:29PM +0000, Sean Christopherson wrote:
> On Thu, Jun 30, 2022, Will Deacon wrote:
> > This series has been extracted from the pKVM base support series (aka
> > "pKVM mega-patch") previously posted here:
> > 
> >   https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@kernel.org/
> > 
> > Unlike that more comprehensive series, this one is fairly fundamental
> > and does not introduce any new ABI commitments, leaving questions
> > involving the management of guest private memory and the creation of
> > protected VMs for future work. Instead, this series extends the pKVM EL2
> > code so that it can dynamically instantiate and manage VM shadow
> > structures without the host being able to access them directly. These
> > shadow structures consist of a shadow VM, a set of shadow vCPUs and the
> > stage-2 page-table and the pages used to hold them are returned to the
> > host when the VM is destroyed.
> > 
> > The last patch is marked as RFC because, although it plumbs in the
> > shadow state, it is woefully inefficient and copies to/from the host
> > state on every vCPU run. Without the last patch, the new structures are
> > unused but we move considerably closer to isolating guests from the
> > host.
> 
> ...
> 
> >  arch/arm64/include/asm/kvm_asm.h              |   6 +-
> >  arch/arm64/include/asm/kvm_host.h             |  65 +++
> >  arch/arm64/include/asm/kvm_hyp.h              |   3 +
> >  arch/arm64/include/asm/kvm_pgtable.h          |   8 +
> >  arch/arm64/include/asm/kvm_pkvm.h             |  38 ++
> >  arch/arm64/kernel/image-vars.h                |  15 -
> >  arch/arm64/kvm/arm.c                          |  40 +-
> >  arch/arm64/kvm/hyp/hyp-constants.c            |   3 +
> >  arch/arm64/kvm/hyp/include/nvhe/gfp.h         |   6 +-
> >  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  19 +-
> >  arch/arm64/kvm/hyp/include/nvhe/memory.h      |  26 +-
> >  arch/arm64/kvm/hyp/include/nvhe/mm.h          |  18 +-
> >  arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |  70 +++
> >  arch/arm64/kvm/hyp/include/nvhe/spinlock.h    |  10 +-
> >  arch/arm64/kvm/hyp/nvhe/cache.S               |  11 +
> >  arch/arm64/kvm/hyp/nvhe/hyp-main.c            | 105 +++-
> >  arch/arm64/kvm/hyp/nvhe/hyp-smp.c             |   2 +
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 456 +++++++++++++++++-
> >  arch/arm64/kvm/hyp/nvhe/mm.c                  | 136 +++++-
> >  arch/arm64/kvm/hyp/nvhe/page_alloc.c          |  42 +-
> >  arch/arm64/kvm/hyp/nvhe/pkvm.c                | 438 +++++++++++++++++
> >  arch/arm64/kvm/hyp/nvhe/setup.c               |  96 ++--
> >  arch/arm64/kvm/hyp/pgtable.c                  |   9 +
> >  arch/arm64/kvm/mmu.c                          |  26 +
> >  arch/arm64/kvm/pkvm.c                         | 121 ++++-
> >  25 files changed, 1625 insertions(+), 144 deletions(-)
> >  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/pkvm.h
> 
> The lack of documentation and the rather terse changelogs make this really hard
> to review for folks that aren't intimately familiar with pKVM.  I have a decent
> idea of the end goal of "shadowing", but that's mostly because of my involvement in
> similar x86 projects.  Nothing in the changelogs ever explains _why_ pKVM uses
> shadows.

That's understandable, but thanks for persevering; this series is pretty
down in the murky depths of the arm64 architecture and EL2 code so it
doesn't really map to the KVM code most folks are familiar with. It's fair
to say we're assuming a lot of niche prior knowledge (which is quite common
for arch code in my experience), but I wanted to inherit the broader cc list
so you were aware of this break-away series. Sadly, I don't think beefing up
the commit messages would get us to a point where somebody unfamiliar with
the EL2 code already could give a constructive review, but we can try to
expand them a bit if you genuinely think it would help.

On the more positive side, we'll be speaking at KVM forum about what we've
done here, so that will be a great place to discuss it further and then we
can also link back to the recordings in later postings of the mega-series.

> I put "shadowing" in quotes because if the unstrusted host is aware that the VM
> and vCPU it is manipulating aren't the "real" VMs/vCPUs, and there is an explicit API
> between the untrusted host and pKVM for creating/destroying VMs/vCPUs, then I would
> argue that it's not truly shadowing, especially if pKVM uses data/values verbatim
> and only verifies correctness/safety.  It's definitely a nit, but for future readers
> I think overloading "shadowing" could be confusing.

Ah, this is really interesting and nicely puts the ball back in my court as
I'm not well versed with x86's use of "shadowing". We should probably use
another name (ideas?), but our "shadow" is very much explicit -- rather than
the host using its 'struct kvm's and 'struct kvm_vcpu's directly, it instead
passes those data structures to the hypervisor which allocates its own
structures (in some cases reusing the host definitions directly, e.g.
'struct kvm_vcpu') and returning a handle to the host as a reference. The
host can then issue hypercalls with this handle to load/put vCPUs of that
VM, run them once they are loaded and synchronise aspects of the vCPU state
between the host and the hypervisor copies for things like emulation traps
or interrupt injection. The main thing is that the pages containing the
hypervisor structures are not accessible by the host until the corresponding
VM is destroyed.

The advantage of doing it this way is that we don't need to change very
much of the host KVM code at all, and we can even reuse some of it directly
in the hypervisor (e.g. inline functions and macros).

Perhaps we should s/shadow/hyp/ to make this a little clearer?

> And beyond the basics, IMO pKVM needs a more formal definition of exactly what
> guest state is protected/hidden from the untrusted host.  Peeking at the mega series,
> there are a huge pile of patches that result in "gradual reduction of EL2 trust in
> host data", but I couldn't any documentation that defines what that end result is.

That's fair; I'll work to extend the documentation in the next iteration of
the mega series to cover this in more detail. Roughly speaking, the end
result is that the vCPU register and memory state is inaccessible to the
host except in cases where the guest has done something to expose it such as
MMIO or a memory sharing hypercall. Given the complexity of the register
state (GPRs, floating point, SIMD, debug, etc) the mega-series elavates
portions of the state from the host to the hypervisor as separate patches
to structure things a bit better (that's where the gradual reduction comes
in).

Does that help at all?

Cheers,

Will
Vincent Donnefort July 19, 2022, 2:24 p.m. UTC | #3
On Thu, Jun 30, 2022 at 02:57:23PM +0100, Will Deacon wrote:
> Hi everyone,
> 
> This series has been extracted from the pKVM base support series (aka
> "pKVM mega-patch") previously posted here:
> 
>   https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@kernel.org/
> 
> Unlike that more comprehensive series, this one is fairly fundamental
> and does not introduce any new ABI commitments, leaving questions
> involving the management of guest private memory and the creation of
> protected VMs for future work. Instead, this series extends the pKVM EL2
> code so that it can dynamically instantiate and manage VM shadow
> structures without the host being able to access them directly. These
> shadow structures consist of a shadow VM, a set of shadow vCPUs and the
> stage-2 page-table and the pages used to hold them are returned to the
> host when the VM is destroyed.
> 
> The last patch is marked as RFC because, although it plumbs in the
> shadow state, it is woefully inefficient and copies to/from the host
> state on every vCPU run. Without the last patch, the new structures are
> unused but we move considerably closer to isolating guests from the
> host.
> 
> The series is based on Marc's rework of the flags
> (kvm-arm64/burn-the-flags).
> 
> Feedback welcome.
> 
> Cheers,

Only had few nitpicks

Reviewed-by: Vincent Donnefort <vdonnefort@google.com>

Also, I've been using this patchset for quite a while now.

Tested-by: Vincent Donnefort <vdonnefort@google.com>

[...]
Sean Christopherson July 19, 2022, 4:11 p.m. UTC | #4
Apologies for the slow reply.

On Fri, Jul 08, 2022, Will Deacon wrote:
> On Wed, Jul 06, 2022 at 07:17:29PM +0000, Sean Christopherson wrote:
> > On Thu, Jun 30, 2022, Will Deacon wrote:
> > The lack of documentation and the rather terse changelogs make this really hard
> > to review for folks that aren't intimately familiar with pKVM.  I have a decent
> > idea of the end goal of "shadowing", but that's mostly because of my involvement in
> > similar x86 projects.  Nothing in the changelogs ever explains _why_ pKVM uses
> > shadows.
> 
> That's understandable, but thanks for persevering; this series is pretty
> down in the murky depths of the arm64 architecture and EL2 code so it
> doesn't really map to the KVM code most folks are familiar with. It's fair
> to say we're assuming a lot of niche prior knowledge (which is quite common
> for arch code in my experience),

Assuming prior knowledge is fine so long as that prior knowledge is something
that can be gained by future readers through public documentation.  E.g. arch code
is always littered with acronyms, but someone can almost always decipher the
acronyms by reading specs or just searching the interwebs.

My objection to the changelogs is that they talk about "shadow" VMs, vCPUs, state,
structures, etc., without ever explaining what a shadow is, how it will be used,
or what its end purpose is.

A big part of the problem is that "shadow" is not unique terminology _and_ isn't
inherently tied to "protected kvm", e.g. a reader can't intuit that a "shadow vm"
is the the trusted hypervisors instance of the VM.  And I can search for pKVM and
get relevant, helpful search results.  But if I search for "shadow vm", I get
unrelated results, and "pkvm shadow vm" just leads me back to this series.

> but I wanted to inherit the broader cc list so you were aware of this
> break-away series. Sadly, I don't think beefing up the commit messages would
> get us to a point where somebody unfamiliar with the EL2 code already could
> give a constructive review, but we can try to expand them a bit if you
> genuinely think it would help.

I'm not looking at it just from a review point, but also from a future readers
perspective.  E.g. someone that looks at this changelog in isolation is going to
have no idea what a "shadow VM" is:

  KVM: arm64: Introduce pKVM shadow VM state at EL2

  Introduce a table of shadow VM structures at EL2 and provide hypercalls
  to the host for creating and destroying shadow VMs.

Obviously there will be some context available in surrounding patches, but if you
avoid the "shadow" terminology and provide a bit more context, then it yields
something like:

  KVM: arm64: Add infrastructure to create and track pKVM instances at EL2

  Introduce a global table (and lock) to track pKVM instances at EL2, and
  provide hypercalls that can be used by the untrusted host to create and
  destroy pKVM VMs.  pKVM VM/vCPU state is directly accessible only by the
  trusted hypervisor (EL2).  

  Each pKVM VM is directly associated with an untrusted host KVM instance,
  and is referenced by the host using an opaque handle.  Future patches will
  provide hypercalls to allow the host to initialize/set/get pKVM VM/vCPU
  state using the opaque handle.
   
> On the more positive side, we'll be speaking at KVM forum about what we've
> done here, so that will be a great place to discuss it further and then we
> can also link back to the recordings in later postings of the mega-series.
> 
> > I put "shadowing" in quotes because if the unstrusted host is aware that the VM
> > and vCPU it is manipulating aren't the "real" VMs/vCPUs, and there is an explicit API
> > between the untrusted host and pKVM for creating/destroying VMs/vCPUs, then I would
> > argue that it's not truly shadowing, especially if pKVM uses data/values verbatim
> > and only verifies correctness/safety.  It's definitely a nit, but for future readers
> > I think overloading "shadowing" could be confusing.
> 
> Ah, this is really interesting and nicely puts the ball back in my court as
> I'm not well versed with x86's use of "shadowing".

It's not just an x86, e.g. see https://en.wikipedia.org/wiki/Shadow_table.  The
use in pKVM is _really_ close in that what pKVM calls the shadow is the "real"
data that's used, but pKVM inverts the typical virtualization usage, which is
why I find it confusing.  I.e. instead of shadowing state being written by the
guest, pKVM is "shadowing" state written by the host.  If there ever comes a need
to actually shadow guest state, e.g. for nested virtualization, then using shadow
to refer to the protected state is going to create a conundrum.

Honestly, I think pKVM is simply being too cute in picking names.  And not just
for "shadow", e.g. IMO the flush/sync terminology in patch 24 is also unnecessarily
cute.  Instead of coming up with clever names, just be explicit in what the code
is doing.  E.g. something like:

  flush_shadow_state() => sync_host_to_pkvm_vcpu()
  sync_shadow_state()  => sync_pkvm_to_host_vcpu()

Then readers know the two functions are pairs, and will have a decent idea of
what the functions do even if they don't fully understand pKVM vs. host.

"shadow_area_size" is another case where it's unnecessarily cryptic, e.g. just
call it "donated_memory_size".

> Perhaps we should s/shadow/hyp/ to make this a little clearer?

Or maybe just "pkvm"?  I think that's especially viable if you do away with
kvm_shadow_vcpu_state.  As of this series at least, kvm_shadow_vcpu_state is
completely unnecessary.  kvm_vcpu.kvm can be used to get at the VM, and thus pKVM
state via container_of().  Then the host_vcpu can be retrieved by using the
vcpu_idx, e.g.

	struct pkvm_vm *pkvm_vm = to_pkvm_vm(pkvm_vcpu->vm);
	struct kvm_vcpu *host_vcpu;

	host_vcpu = kvm_get_vcpu(pkvm_vm->host_vm, pkvm_vcpu->vcpu_idx);

Even better is to not have to do that in the first place.  AFAICT, there's no need
to do such a lookup in handle___kvm_vcpu_run() since pKVM already has pointers to
both the host vCPU and the pKVM vCPU.

E.g. I believe you can make the code look like this:

struct kvm_arch {
	...

	/*
	 * For an unstructed host VM, pkvm_handle is used to lookup the
	 * associated pKVM instance.
	 */
	pvk_handle_t pkvm_handle;
};

struct pkvm_vm {
	struct kvm kvm;

	/* Backpointer to the host's (untrusted) KVM instance. */
	struct kvm *host_kvm;

	size_t donated_memory_size;

	struct kvm_pgtable pgt;
};

static struct kvm *pkvm_get_vm(pkvm_handle_t handle)
{
	unsigned int idx = pkvm_handle_to_idx(handle);

	if (unlikely(idx >= KVM_MAX_PVMS))
		return NULL;

	return pkvm_vm_table[idx];
}

struct kvm_vcpu *pkvm_vcpu_load(pkvm_handle_t handle, unsigned int vcpu_idx)
{
	struct kvm_vpcu *pkvm_vcpu = NULL;
	struct kvm *vm;

	hyp_spin_lock(&pkvm_global_lock);
	vm = pkvm_get_vm(handle);
	if (!vm || atomic_read(&vm->online_vcpus) <= vcpu_idx)
		goto unlock;

	pkvm_vcpu = kvm_get_vcpu(vm, vcpu_idx);
	hyp_page_ref_inc(hyp_virt_to_page(vm));
unlock:
	hyp_spin_unlock(&pkvm_global_lock);
	return pkvm_vcpu;
}

struct kvm_vcpu *pkvm_vcpu_put(struct kvm_vcpu *pkvm_vcpu)
{
	hyp_spin_lock(&pkvm_global_lock);
	hyp_page_ref_dec(hyp_virt_to_page(pkvm_vcpu->kvm));
	hyp_spin_unlock(&pkvm_global_lock);
}

static void sync_host_to_pkvm_vcpu(struct kvm_vcpu *pkvm_vcpu, struct kvm_vcpu *host_vcpu)
{
	pkvm_vcpu->arch.ctxt		= host_vcpu->arch.ctxt;

	pkvm_vcpu->arch.sve_state	= kern_hyp_va(host_vcpu->arch.sve_state);
	pkvm_vcpu->arch.sve_max_vl	= host_vcpu->arch.sve_max_vl;

	pkvm_vcpu->arch.hw_mmu		= host_vcpu->arch.hw_mmu;

	pkvm_vcpu->arch.hcr_el2		= host_vcpu->arch.hcr_el2;
	pkvm_vcpu->arch.mdcr_el2	= host_vcpu->arch.mdcr_el2;
	pkvm_vcpu->arch.cptr_el2	= host_vcpu->arch.cptr_el2;

	pkvm_vcpu->arch.iflags		= host_vcpu->arch.iflags;
	pkvm_vcpu->arch.fp_state	= host_vcpu->arch.fp_state;

	pkvm_vcpu->arch.debug_ptr	= kern_hyp_va(host_vcpu->arch.debug_ptr);
	pkvm_vcpu->arch.host_fpsimd_state = host_vcpu->arch.host_fpsimd_state;

	pkvm_vcpu->arch.vsesr_el2	= host_vcpu->arch.vsesr_el2;

	pkvm_vcpu->arch.vgic_cpu.vgic_v3 = host_vcpu->arch.vgic_cpu.vgic_v3;
}

static void sync_pkvm_to_host_vcpu(struct kvm_vcpu *host_vcpu, struct kvm_vcpu *pkvm_vcpu)
{
	struct vgic_v3_cpu_if *pkvm_cpu_if = &pkvm_vcpu->arch.vgic_cpu.vgic_v3;
	struct vgic_v3_cpu_if *host_cpu_if = &host_vcpu->arch.vgic_cpu.vgic_v3;
	unsigned int i;

	host_vcpu->arch.ctxt		= pkvm_vcpu->arch.ctxt;

	host_vcpu->arch.hcr_el2		= pkvm_vcpu->arch.hcr_el2;
	host_vcpu->arch.cptr_el2	= pkvm_vcpu->arch.cptr_el2;

	host_vcpu->arch.fault		= pkvm_vcpu->arch.fault;

	host_vcpu->arch.iflags		= pkvm_vcpu->arch.iflags;
	host_vcpu->arch.fp_state	= pkvm_vcpu->arch.fp_state;

	host_cpu_if->vgic_hcr		= pkvm_cpu_if->vgic_hcr;
	for (i = 0; i < pkvm_cpu_if->used_lrs; ++i)
		host_cpu_if->vgic_lr[i] = pkvm_cpu_if->vgic_lr[i];
}

static void handle___kvm_vcpu_run(struct kvm_cpu_context *host_ctxt)
{
	DECLARE_REG(struct kvm_vcpu *, host_vcpu, host_ctxt, 1);
	int ret;

	host_vcpu = kern_hyp_va(host_vcpu);

	if (unlikely(is_protected_kvm_enabled())) {
		struct kvm *host_kvm = kern_hyp_va(host_vcpu->kvm);
		struct kvm_vcpu *pkvm_vcpu;

		pkvm_vcpu = pkvm_vcpu_load(host_kvm, host_vcpu);
		if (!pkvm_vcpu) {
			ret = -EINVAL;
			goto out;
		}

		sync_host_to_pkvm_vcpu(pkvm_vcpu, host_vcpu);

		ret = __kvm_vcpu_run(pkvm_vcpu);

		sync_pkvm_to_host_vcpu(host_vcpu, pkvm_vcpu);

		pkvm_vcpu_put(pkvm_vcpu);
	} else {
		ret = __kvm_vcpu_run(host_vcpu);
	}

out:
	cpu_reg(host_ctxt, 1) =  ret;
}
Marc Zyngier July 20, 2022, 9:25 a.m. UTC | #5
On Tue, 19 Jul 2022 17:11:32 +0100,
Sean Christopherson <seanjc@google.com> wrote:

> Honestly, I think pKVM is simply being too cute in picking names.

I don't know what you mean by "cute" here, but I assume this is not
exactly a flattering qualifier.

> And not just for "shadow", e.g. IMO the flush/sync terminology in
> patch 24 is also unnecessarily cute.  Instead of coming up with
> clever names, just be explicit in what the code is doing.
> E.g. something like:
> 
>   flush_shadow_state() => sync_host_to_pkvm_vcpu()
>   sync_shadow_state()  => sync_pkvm_to_host_vcpu()

As much as I like bikesheding, this isn't going to happen. We have had
the sync/flush duality since day one, we have a lot of code based
around this naming, and departing from it seems counter productive.

	M.
Will Deacon July 20, 2022, 6:48 p.m. UTC | #6
Hi Sean,

On Tue, Jul 19, 2022 at 04:11:32PM +0000, Sean Christopherson wrote:
> Apologies for the slow reply.

No problem; you've provided a tonne of insightful feedback here, so it was
worth the wait. Thanks!

> On Fri, Jul 08, 2022, Will Deacon wrote:
> > but I wanted to inherit the broader cc list so you were aware of this
> > break-away series. Sadly, I don't think beefing up the commit messages would
> > get us to a point where somebody unfamiliar with the EL2 code already could
> > give a constructive review, but we can try to expand them a bit if you
> > genuinely think it would help.
> 
> I'm not looking at it just from a review point, but also from a future readers
> perspective.  E.g. someone that looks at this changelog in isolation is going to
> have no idea what a "shadow VM" is:
> 
>   KVM: arm64: Introduce pKVM shadow VM state at EL2
> 
>   Introduce a table of shadow VM structures at EL2 and provide hypercalls
>   to the host for creating and destroying shadow VMs.
> 
> Obviously there will be some context available in surrounding patches, but if you
> avoid the "shadow" terminology and provide a bit more context, then it yields
> something like:
> 
>   KVM: arm64: Add infrastructure to create and track pKVM instances at EL2
> 
>   Introduce a global table (and lock) to track pKVM instances at EL2, and
>   provide hypercalls that can be used by the untrusted host to create and
>   destroy pKVM VMs.  pKVM VM/vCPU state is directly accessible only by the
>   trusted hypervisor (EL2).  
> 
>   Each pKVM VM is directly associated with an untrusted host KVM instance,
>   and is referenced by the host using an opaque handle.  Future patches will
>   provide hypercalls to allow the host to initialize/set/get pKVM VM/vCPU
>   state using the opaque handle.

Thanks, that's much better. I'll have to summon up the energy to go through
the others as well...

> > Perhaps we should s/shadow/hyp/ to make this a little clearer?
> 
> Or maybe just "pkvm"?

I think the "hyp" part is useful to distinguish the pkvm code running at EL2
from the pkvm code running at EL1. For example, we have a 'pkvm' member in
'struct kvm_arch' which is used by the _host_ at EL1.

So I'd say either "pkvm_hyp" or "hyp" instead of "shadow". The latter is
nice and short...

> I think that's especially viable if you do away with
> kvm_shadow_vcpu_state.  As of this series at least, kvm_shadow_vcpu_state is
> completely unnecessary.  kvm_vcpu.kvm can be used to get at the VM, and thus pKVM
> state via container_of().  Then the host_vcpu can be retrieved by using the
> vcpu_idx, e.g.
> 
> 	struct pkvm_vm *pkvm_vm = to_pkvm_vm(pkvm_vcpu->vm);
> 	struct kvm_vcpu *host_vcpu;
> 
> 	host_vcpu = kvm_get_vcpu(pkvm_vm->host_vm, pkvm_vcpu->vcpu_idx);

Using container_of() here is neat; we can definitely go ahead with that
change. However, looking at this in more detail with Fuad, removing
'struct kvm_shadow_vcpu_state' entirely isn't going to work:

> E.g. I believe you can make the code look like this:
> 
> struct kvm_arch {
> 	...
> 
> 	/*
> 	 * For an unstructed host VM, pkvm_handle is used to lookup the
> 	 * associated pKVM instance.
> 	 */
> 	pvk_handle_t pkvm_handle;
> };
> 
> struct pkvm_vm {
> 	struct kvm kvm;
> 
> 	/* Backpointer to the host's (untrusted) KVM instance. */
> 	struct kvm *host_kvm;
> 
> 	size_t donated_memory_size;
> 
> 	struct kvm_pgtable pgt;
> };
> 
> static struct kvm *pkvm_get_vm(pkvm_handle_t handle)
> {
> 	unsigned int idx = pkvm_handle_to_idx(handle);
> 
> 	if (unlikely(idx >= KVM_MAX_PVMS))
> 		return NULL;
> 
> 	return pkvm_vm_table[idx];
> }
> 
> struct kvm_vcpu *pkvm_vcpu_load(pkvm_handle_t handle, unsigned int vcpu_idx)
> {
> 	struct kvm_vpcu *pkvm_vcpu = NULL;
> 	struct kvm *vm;
> 
> 	hyp_spin_lock(&pkvm_global_lock);
> 	vm = pkvm_get_vm(handle);
> 	if (!vm || atomic_read(&vm->online_vcpus) <= vcpu_idx)
> 		goto unlock;
> 
> 	pkvm_vcpu = kvm_get_vcpu(vm, vcpu_idx);

kvm_get_vcpu() makes use of an xarray to hold the vCPUs pointers and this is
really something which we cannot support at EL2 where, amongst other things,
we do not have support for RCU. Consequently, we do need to keep our own
mapping from the shad^H^H^H^Hhyp vCPU to the host vCPU.

We also end up expanding the 'struct kvm_shadow_vcpu_state' structure later
to track additional vCPU state in the hypervisor, for example in the
mega-series:

https://lore.kernel.org/kvmarm/20220519134204.5379-78-will@kernel.org/#Z31arch:arm64:kvm:hyp:include:nvhe:pkvm.h

Cheers,

Will
Sean Christopherson July 20, 2022, 9:17 p.m. UTC | #7
On Wed, Jul 20, 2022, Will Deacon wrote:
> Hi Sean,
> 
> On Tue, Jul 19, 2022 at 04:11:32PM +0000, Sean Christopherson wrote:
> > Or maybe just "pkvm"?
> 
> I think the "hyp" part is useful to distinguish the pkvm code running at EL2
> from the pkvm code running at EL1. For example, we have a 'pkvm' member in
> 'struct kvm_arch' which is used by the _host_ at EL1.

Right, my suggestion was to rename that to pkvm_handle to avoid a direct conflict,
and then that naturally yields the "pkvm_handle => pkvm_vm" association.  Or are
you expecting to shove more stuff into the that "pkvm" struct?
 
> So I'd say either "pkvm_hyp" or "hyp" instead of "shadow". The latter is
> nice and short...

I 100% agree that differentating between EL1 and EL2 is important for functions,
structs and global variables, but I would argue it's not so important for fields
and local variables where the "owning" struct/function provides that context.  But
that's actually a partial argument for just using "hyp".

My concern with just using e.g. "kvm_hyp" is that, because non-pKVM nVHE also has
the host vs. hyp split, it could lead people to believe that "kvm_hyp" is also
used for the non-pKVM case.

So, what about a blend?  E.g. "struct pkvm_hyp_vcpu *hyp_vcpu".  That provides
the context that the struct is specific to the EL2 side of pKVM, most usage is
nice and short, and the "hyp" prefix avoids the ambiguity that a bare "pkvm" would
suffer for EL1 vs. EL2.

Doesn't look awful?

static void handle___kvm_vcpu_run(struct kvm_cpu_context *host_ctxt)
{
	DECLARE_REG(struct kvm_vcpu *, host_vcpu, host_ctxt, 1);
	int ret;

	host_vcpu = kern_hyp_va(host_vcpu);

	if (unlikely(is_protected_kvm_enabled())) {
		struct pkvm_hyp_vcpu *hyp_vcpu;
		struct kvm *host_kvm;

		host_kvm = kern_hyp_va(host_vcpu->kvm);

		hyp_vcpu = pkvm_load_hyp_vcpu(host_kvm->arch.pkvm.handle,
					      host_vcpu->vcpu_idx);
		if (!hyp_vcpu) {
			ret = -EINVAL;
			goto out;
		}

		flush_pkvm_guest_state(hyp_vcpu);

		ret = __kvm_vcpu_run(shadow_vcpu);

		sync_pkvm_guest_state(hyp_vcpu);

		pkvm_put_hyp_vcpu(shadow_state);
	} else {
		/* The host is fully trusted, run its vCPU directly. */
		ret = __kvm_vcpu_run(host_vcpu);
	}

out:
	cpu_reg(host_ctxt, 1) =  ret;
}

	
 
> > I think that's especially viable if you do away with
> > kvm_shadow_vcpu_state.  As of this series at least, kvm_shadow_vcpu_state is
> > completely unnecessary.  kvm_vcpu.kvm can be used to get at the VM, and thus pKVM
> > state via container_of().  Then the host_vcpu can be retrieved by using the
> > vcpu_idx, e.g.
> > 
> > 	struct pkvm_vm *pkvm_vm = to_pkvm_vm(pkvm_vcpu->vm);
> > 	struct kvm_vcpu *host_vcpu;
> > 
> > 	host_vcpu = kvm_get_vcpu(pkvm_vm->host_vm, pkvm_vcpu->vcpu_idx);
> 
> Using container_of() here is neat; we can definitely go ahead with that
> change. However, looking at this in more detail with Fuad, removing
> 'struct kvm_shadow_vcpu_state' entirely isn't going to work:

> > struct kvm_vcpu *pkvm_vcpu_load(pkvm_handle_t handle, unsigned int vcpu_idx)
> > {
> > 	struct kvm_vpcu *pkvm_vcpu = NULL;
> > 	struct kvm *vm;
> > 
> > 	hyp_spin_lock(&pkvm_global_lock);
> > 	vm = pkvm_get_vm(handle);
> > 	if (!vm || atomic_read(&vm->online_vcpus) <= vcpu_idx)
> > 		goto unlock;
> > 
> > 	pkvm_vcpu = kvm_get_vcpu(vm, vcpu_idx);
> 
> kvm_get_vcpu() makes use of an xarray to hold the vCPUs pointers and this is
> really something which we cannot support at EL2 where, amongst other things,
> we do not have support for RCU. Consequently, we do need to keep our own
> mapping from the shad^H^H^H^Hhyp vCPU to the host vCPU.

Hmm, are there guardrails in place to prevent using "unsafe" fields from "struct kvm"
and "struct kvm_vcpu" at EL2?  If not, it seems like embedding the common structs
in the hyp/pkvm-specific structs is going bite us in the rear at some point.

Mostly out of curiosity, I assume the EL2 restriction only applies to nVHE mode?

And waaaay off topic, has anyone explored adding macro magic to generate wrappers
to (un)marshall registers to parameters/returns for the hyp functions?  E.g. it'd
be neat if you could make the code look like this without having to add a wrapper
for every function:

static int handle___kvm_vcpu_run(unsigned long __host_vcpu)
{
	struct kvm_vcpu *host_vcpu = kern_hyp_va(__host_vcpu);
	int ret;

	if (unlikely(is_protected_kvm_enabled())) {
		struct pkvm_hyp_vcpu *hyp_vcpu;
		struct kvm *host_kvm;

		host_kvm = kern_hyp_va(host_vcpu->kvm);

		hyp_vcpu = pkvm_load_hyp_vcpu(host_kvm->arch.pkvm.handle,
					      host_vcpu->vcpu_idx);
		if (!hyp_vcpu)
			return -EINVAL;

		flush_hypervisor_state(hyp_vcpu);

		ret = __kvm_vcpu_run(shadow_vcpu);

		sync_hypervisor_state(hyp_vcpu);
		pkvm_put_hyp_vcpu(shadow_state);
	} else {
		/* The host is fully trusted, run its vCPU directly. */
		ret = __kvm_vcpu_run(host_vcpu);
	}
	return ret;
}