mbox series

[v1,00/24] Opt-in always-on nVHE hypervisor

Message ID 20201109113233.9012-1-dbrazdil@google.com (mailing list archive)
Headers show
Series Opt-in always-on nVHE hypervisor | expand

Message

David Brazdil Nov. 9, 2020, 11:32 a.m. UTC
As we progress towards being able to keep guest state private to the
host running nVHE hypervisor, this series allows the hypervisor to
install itself on newly booted CPUs before the host is allowed to run
on them.

All functionality described below is opt-in, guarded by an early param
'kvm-arm.protected'. Future patches specific to the new "protected" mode
should be hidden behind the same param.

The hypervisor starts trapping host SMCs and intercepting host's PSCI
CPU_ON/OFF/SUSPEND calls. It replaces the host's entry point with its
own, initializes the EL2 state of the new CPU and installs the nVHE hyp
vector before ERETing to the host's entry point.

The kernel checks new cores' features against the finalized system
capabilities. To avoid the need to move this code/data to EL2, the
implementation only allows to boot cores that were online at the time of
KVM initialization and therefore had been checked already.

Other PSCI SMCs are forwarded to EL3, though only the known set of SMCs
implemented in the kernel is allowed. Non-PSCI SMCs are also forwarded
to EL3. Future changes will need to ensure the safety of all SMCs wrt.
private guests.

The host is still allowed to reset EL2 back to the stub vector, eg. for
hibernation or kexec, but will not disable nVHE when there are no VMs.

Tested on Rock Pi 4b, based on 5.10-rc3.

changes since rfc:
  * add early param to make features opt-in
  * simplify CPU_ON/SUSPEND implementation
  * replace spinlocks with CAS atomic
  * make cpu_logical_map ro_after_init

  -David

David Brazdil (24):
  psci: Accessor for configured PSCI version
  psci: Accessor for configured PSCI function IDs
  arm64: Move MAIR_EL1_SET to asm/memory.h
  kvm: arm64: Initialize MAIR_EL2 using a constant
  kvm: arm64: Add .hyp.data..ro_after_init ELF section
  kvm: arm64: Support per_cpu_ptr in nVHE hyp code
  kvm: arm64: Create nVHE copy of cpu_logical_map
  kvm: arm64: Move hyp-init params to a per-CPU struct
  kvm: arm64: Refactor handle_trap to use a switch
  kvm: arm64: Extract parts of el2_setup into a macro
  kvm: arm64: Add SMC handler in nVHE EL2
  kvm: arm64: Extract __do_hyp_init into a helper function
  kvm: arm64: Add CPU entry point in nVHE hyp
  kvm: arm64: Add function to enter host from KVM nVHE hyp code
  kvm: arm64: Bootstrap PSCI SMC handler in nVHE EL2
  kvm: arm64: Add offset for hyp VA <-> PA conversion
  kvm: arm64: Add __hyp_pa_symbol helper macro
  kvm: arm64: Forward safe PSCI SMCs coming from host
  kvm: arm64: Intercept host's PSCI_CPU_ON SMCs
  kvm: arm64: Intercept host's CPU_SUSPEND PSCI SMCs
  kvm: arm64: Add kvm-arm.protected early kernel parameter
  kvm: arm64: Keep nVHE EL2 vector installed
  kvm: arm64: Trap host SMCs in protected mode.
  kvm: arm64: Fix EL2 mode availability checks

 arch/arm64/include/asm/kvm_arm.h   |   1 +
 arch/arm64/include/asm/kvm_asm.h   | 136 ++++++++++++++
 arch/arm64/include/asm/kvm_hyp.h   |   9 +
 arch/arm64/include/asm/memory.h    |  13 ++
 arch/arm64/include/asm/percpu.h    |   6 +
 arch/arm64/include/asm/sections.h  |   1 +
 arch/arm64/include/asm/virt.h      |  26 +++
 arch/arm64/kernel/asm-offsets.c    |   5 +
 arch/arm64/kernel/head.S           | 140 ++------------
 arch/arm64/kernel/image-vars.h     |   7 +
 arch/arm64/kernel/vmlinux.lds.S    |  10 +
 arch/arm64/kvm/arm.c               | 157 ++++++++++++++--
 arch/arm64/kvm/hyp/nvhe/Makefile   |   3 +-
 arch/arm64/kvm/hyp/nvhe/host.S     |   9 +
 arch/arm64/kvm/hyp/nvhe/hyp-init.S |  84 +++++++--
 arch/arm64/kvm/hyp/nvhe/hyp-main.c |  56 +++++-
 arch/arm64/kvm/hyp/nvhe/hyp.lds.S  |   1 +
 arch/arm64/kvm/hyp/nvhe/percpu.c   |  38 ++++
 arch/arm64/kvm/hyp/nvhe/psci.c     | 291 +++++++++++++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/switch.c   |   5 +-
 arch/arm64/mm/proc.S               |  13 --
 drivers/firmware/psci/psci.c       |  25 ++-
 include/linux/psci.h               |  18 ++
 include/uapi/linux/psci.h          |   1 +
 24 files changed, 865 insertions(+), 190 deletions(-)
 create mode 100644 arch/arm64/kvm/hyp/nvhe/percpu.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/psci.c

Comments

Christoph Hellwig Nov. 10, 2020, 10:15 a.m. UTC | #1
On Mon, Nov 09, 2020 at 11:32:09AM +0000, David Brazdil wrote:
> As we progress towards being able to keep guest state private to the
> host running nVHE hypervisor, this series allows the hypervisor to
> install itself on newly booted CPUs before the host is allowed to run
> on them.

Why?  I thought we were trying to kill nVHE off now that newer CPUs
provide the saner virtualization extensions?
Marc Zyngier Nov. 10, 2020, 11:18 a.m. UTC | #2
On 2020-11-10 10:15, Christoph Hellwig wrote:
> On Mon, Nov 09, 2020 at 11:32:09AM +0000, David Brazdil wrote:
>> As we progress towards being able to keep guest state private to the
>> host running nVHE hypervisor, this series allows the hypervisor to
>> install itself on newly booted CPUs before the host is allowed to run
>> on them.
> 
> Why?  I thought we were trying to kill nVHE off now that newer CPUs
> provide the saner virtualization extensions?

We can't kill nVHE at all, because that is the only game in town.
You can't even buy a decent machine with VHE, no matter how much money
you put on the table.

nVHE is here for the foreseeable future, and we even use its misfeatures
to our advantage in order to offer confidential VMs. See Will's 
presentation
at KVM forum a couple of weeks ago for the gory details.

Thanks,

         M.
Marc Zyngier Nov. 11, 2020, 2:32 p.m. UTC | #3
Hi David,

On 2020-11-09 11:32, David Brazdil wrote:
> As we progress towards being able to keep guest state private to the
> host running nVHE hypervisor, this series allows the hypervisor to
> install itself on newly booted CPUs before the host is allowed to run
> on them.
> 
> All functionality described below is opt-in, guarded by an early param
> 'kvm-arm.protected'. Future patches specific to the new "protected" 
> mode
> should be hidden behind the same param.
> 
> The hypervisor starts trapping host SMCs and intercepting host's PSCI
> CPU_ON/OFF/SUSPEND calls. It replaces the host's entry point with its
> own, initializes the EL2 state of the new CPU and installs the nVHE hyp
> vector before ERETing to the host's entry point.
> 
> The kernel checks new cores' features against the finalized system
> capabilities. To avoid the need to move this code/data to EL2, the
> implementation only allows to boot cores that were online at the time 
> of
> KVM initialization and therefore had been checked already.
> 
> Other PSCI SMCs are forwarded to EL3, though only the known set of SMCs
> implemented in the kernel is allowed. Non-PSCI SMCs are also forwarded
> to EL3. Future changes will need to ensure the safety of all SMCs wrt.
> private guests.
> 
> The host is still allowed to reset EL2 back to the stub vector, eg. for
> hibernation or kexec, but will not disable nVHE when there are no VMs.
> 
> Tested on Rock Pi 4b, based on 5.10-rc3.

I think I've gone through most of the patches. When you respin this
series, you may want to do so on top of my host EL2 entry rework [1],
which change a few things you currently rely on.

If anything in there doesn't work for you, please let me know.

Thanks,

         M.

[1] https://lore.kernel.org/kvm/20201109175923.445945-1-maz@kernel.org/
Janne Karhunen Jan. 19, 2021, 1:17 p.m. UTC | #4
On Tue, Nov 10, 2020 at 1:19 PM Marc Zyngier <maz@kernel.org> wrote:

> > Why?  I thought we were trying to kill nVHE off now that newer CPUs
> > provide the saner virtualization extensions?
>
> We can't kill nVHE at all, because that is the only game in town.
> You can't even buy a decent machine with VHE, no matter how much money
> you put on the table.

As I mentioned it earlier, we did this type of nVHE hypervisor and the
proof of concept is here:
https://github.com/jkrh/kvms

See the README. It runs successfully on multiple pieces of arm64
hardware and provides a tiny QEMU based development environment via
the makefiles for the QEMU 'max' CPU. The code is rough, the amount of
man hours put to it is not sky high, but it does run. I'll update a
new kernel patch to patches/ dir for one of the later kernels
hopefully next week, up to now we have only supported kernels between
4.9 .... 5.6 as this is what our development hardware(s) run with. It
requires a handful of hooks in the kvm code, but the actual kvm calls
are just rerouted back to the kernel symbols. This way the hypervisor
itself can be kept very tiny.

The s2 page tables are fully owned by the hyp and the guests are
unmapped from the host memory when configured with the option (we call
it host blinding). Multiple VMs can be run without pinning them into
the memory.  It also provides a tiny out of tree driver prototype stub
to protect the critical sections of the kernel memory beyond the
kernel's own reach. There are still holes in the implementation such
as the virtio-mapback handling via whitelisting and paging integrity
checks, and many things are not quite all the way there yet. One step
at a time.


--
Janne