mbox series

[RFC,00/27] KVM/arm64: A stage 2 for the host

Message ID 20201117181607.1761516-1-qperret@google.com (mailing list archive)
Headers show
Series KVM/arm64: A stage 2 for the host | expand

Message

Quentin Perret Nov. 17, 2020, 6:15 p.m. UTC
Hi all,

This RFC series provides the infrastructure enabling to wrap the host
kernel with a stage 2 when running KVM in nVHE. This can be useful for
several use-cases, but the primary motivation is to (eventually) be able
to protect guest memory from the host kernel. More details about the
overall idea, design, and motivations can be found in Will's talk at KVM
Forum 2020 [1], or the pKVM talk at the Android uconf during LPC 2020
[2].

This series essentially gets us to a point where the 'VM' bit is set
in the host's HCR_EL2 when running in nVHE and if 'kvm-arm.protected'
is set on the kernel command line. The EL2 object directly handles
memory aborts from the host and manages entirely its stage 2 page table.

However, this series does _not_ provide any real user for this (yet)
and simply idmaps everything into the host stage 2 as RWX cacheable.
This is all about the infrastructure for now, so clearly not ready for
inclusion upstream yet (hence the RFC tag), but the bases are there and
I thought it'd be useful to start a discussion with the community early
as this is a rather intrusive change. So, here goes.

One of the interesting requirements that comes with the series is that
managing page-tables requires some sort of memory allocator at EL2 to
allocate, refcount and free memory pages. Clearly, none of that is
currently possible in nVHE, so a significant chunk of the series is
dedicated to solving that problem. The proposed EL2 memory allocator
mimics Linux' buddy system in principles, and re-uses some of the arm64
mm design choices. Specifically, it uses a vmemmap at EL2 which contains
a set of struct hyp_page entries to hold pages metadata. To support
this, I extended the EL2 object to make it manage its own stage 1
page-table in addition to host stage 2. This simplifies the hyp_vmemmap
creation and was going to be required anyway for the protected VM
use-case -- the threat model implies the host cannot be trusted after
boot, and it will thus be crucial to ensure it cannot map arbitrary code
at EL2.

The pool of memory pages used by the EL2 allocator are reserved by the
host early during boot (while it is still trusted) using the memblock
API, and are donated to EL2 during KVM init. The current assumption is
that the host reserves enough pages to allow the EL2 object to map all
of memory at page granularity for both hyp stage 1 and host stage 2,
plus some extra pages for device mappings.

On top of that the series introduces a few smaller features that are
needed along the way, but hopefully all of those are detailed properly
in the relevant commit messages.

And as a last note, I'd like to point out that there are at this point
trivial ways for the host to circumvent its stage 2 protection. It still
owns the guests stage 2 for example, meaning that nothing would prevent
a malicious host from using a guest as a proxy to access protected
memory, _yet_. This series lays the ground for future work to address
these things, which will clearly require a stage 2 over the host at some
point, so I just wanted to set the expectations right.

With all that in mind, the series is organized as follows:

 - patches 01-03 provide EL2 with some utility libraries needed for
   memory management and synchronization;

 - patches 04-09 mostly refactor smalls portions of the code to ease the
   EL2 memory management;

 - patches 10-17 add the actual EL2 memory management code, as well as
   the setup/bootstrap code on the KVM init path;

 - patches 18-24 refactor the existing stage 2 management code to make
   it re-usable from the EL2 object;

 - and finally patches 25-27 introduce the host stage 2 and the trap
   handling logic at EL2.

This work is based on the latest kvmarm/queue (which includes Marc's
host EL2 entry rework [3], as well as Will's guest vector refactoring
[4]) + David's PSCI proxying series [5].

And if you'd like a branch that has all the bits and pieces:

    https://android-kvm.googlesource.com/linux qperret/host-stage2

Boot-tested (host and guest) using qemu in VHE and nVHE, and on real
hardware on a AML-S905X-CC (Le Potato).

Thanks,
Quentin

[1] https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google
[2] https://youtu.be/54q6RzS9BpQ?t=10859
[3] https://lore.kernel.org/kvmarm/20201109175923.445945-1-maz@kernel.org/
[4] https://lore.kernel.org/kvmarm/20201113113847.21619-1-will@kernel.org/
[5] https://lore.kernel.org/kvmarm/20201116204318.63987-1-dbrazdil@google.com/


Quentin Perret (24):
  KVM: arm64: Initialize kvm_nvhe_init_params early
  KVM: arm64: Avoid free_page() in page-table allocator
  KVM: arm64: Factor memory allocation out of pgtable.c
  KVM: arm64: Introduce a BSS section for use at Hyp
  KVM: arm64: Make kvm_call_hyp() a function call at Hyp
  KVM: arm64: Allow using kvm_nvhe_sym() in hyp code
  KVM: arm64: Introduce an early Hyp page allocator
  KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp
  KVM: arm64: Introduce a Hyp buddy page allocator
  KVM: arm64: Enable access to sanitized CPU features at EL2
  KVM: arm64: Factor out vector address calculation
  of/fdt: Introduce early_init_dt_add_memory_hyp()
  KVM: arm64: Prepare Hyp memory protection
  KVM: arm64: Elevate Hyp mappings creation at EL2
  KVM: arm64: Use kvm_arch for stage 2 pgtable
  KVM: arm64: Use kvm_arch in kvm_s2_mmu
  KVM: arm64: Set host stage 2 using kvm_nvhe_init_params
  KVM: arm64: Refactor kvm_arm_setup_stage2()
  KVM: arm64: Refactor __load_guest_stage2()
  KVM: arm64: Refactor __populate_fault_info()
  KVM: arm64: Make memcache anonymous in pgtable allocator
  KVM: arm64: Reserve memory for host stage 2
  KVM: arm64: Sort the memblock regions list
  KVM: arm64: Wrap the host with a stage 2

Will Deacon (3):
  arm64: lib: Annotate {clear,copy}_page() as position-independent
  KVM: arm64: Link position-independent string routines into .hyp.text
  KVM: arm64: Add standalone ticket spinlock implementation for use at
    hyp

 arch/arm64/include/asm/cpufeature.h           |   1 +
 arch/arm64/include/asm/hyp_image.h            |   4 +
 arch/arm64/include/asm/kvm_asm.h              |  13 +-
 arch/arm64/include/asm/kvm_cpufeature.h       |  19 ++
 arch/arm64/include/asm/kvm_host.h             |  17 +-
 arch/arm64/include/asm/kvm_hyp.h              |   8 +
 arch/arm64/include/asm/kvm_mmu.h              |  69 +++++-
 arch/arm64/include/asm/kvm_pgtable.h          |  41 +++-
 arch/arm64/include/asm/sections.h             |   1 +
 arch/arm64/kernel/asm-offsets.c               |   3 +
 arch/arm64/kernel/cpufeature.c                |  14 +-
 arch/arm64/kernel/image-vars.h                |  35 +++
 arch/arm64/kernel/vmlinux.lds.S               |   7 +
 arch/arm64/kvm/arm.c                          | 136 +++++++++--
 arch/arm64/kvm/hyp/Makefile                   |   2 +-
 arch/arm64/kvm/hyp/include/hyp/switch.h       |  36 +--
 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h |  14 ++
 arch/arm64/kvm/hyp/include/nvhe/gfp.h         |  32 +++
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  33 +++
 arch/arm64/kvm/hyp/include/nvhe/memory.h      |  55 +++++
 arch/arm64/kvm/hyp/include/nvhe/mm.h          | 107 +++++++++
 arch/arm64/kvm/hyp/include/nvhe/spinlock.h    |  95 ++++++++
 arch/arm64/kvm/hyp/include/nvhe/util.h        |  25 ++
 arch/arm64/kvm/hyp/nvhe/Makefile              |   9 +-
 arch/arm64/kvm/hyp/nvhe/cache.S               |  13 ++
 arch/arm64/kvm/hyp/nvhe/cpufeature.c          |   8 +
 arch/arm64/kvm/hyp/nvhe/early_alloc.c         |  60 +++++
 arch/arm64/kvm/hyp/nvhe/hyp-init.S            |  39 ++++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  50 ++++
 arch/arm64/kvm/hyp/nvhe/hyp.lds.S             |   1 +
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 191 ++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/mm.c                  | 175 ++++++++++++++
 arch/arm64/kvm/hyp/nvhe/page_alloc.c          | 185 +++++++++++++++
 arch/arm64/kvm/hyp/nvhe/psci-relay.c          |   7 +-
 arch/arm64/kvm/hyp/nvhe/setup.c               | 214 ++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/stub.c                |  22 ++
 arch/arm64/kvm/hyp/nvhe/switch.c              |  12 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c                 |   4 +-
 arch/arm64/kvm/hyp/pgtable.c                  |  98 ++++----
 arch/arm64/kvm/hyp/reserved_mem.c             |  95 ++++++++
 arch/arm64/kvm/mmu.c                          | 114 +++++++++-
 arch/arm64/kvm/reset.c                        |  42 +---
 arch/arm64/lib/clear_page.S                   |   4 +-
 arch/arm64/lib/copy_page.S                    |   4 +-
 arch/arm64/mm/init.c                          |   3 +
 drivers/of/fdt.c                              |   5 +
 46 files changed, 1971 insertions(+), 151 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/memory.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mm.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/util.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
 create mode 100644 arch/arm64/kvm/hyp/nvhe/cpufeature.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/early_alloc.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/mm.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/setup.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
 create mode 100644 arch/arm64/kvm/hyp/reserved_mem.c