mbox series

[RFC,00/27] KVM Address Space Isolation

Message ID 1557758315-12667-1-git-send-email-alexandre.chartre@oracle.com (mailing list archive)
Headers show
Series KVM Address Space Isolation | expand

Message

Alexandre Chartre May 13, 2019, 2:38 p.m. UTC
Hi,

This series aims to introduce the concept of KVM address space isolation.
This is done as part of the upstream community effort to have exploit
mitigations for CPU info-leaks vulnerabilities such as L1TF. 

These patches are based on an original patches from Liran Alon, completed
with additional patches to effectively create KVM address space different
from the full kernel address space.

The current code is just an early POC, and it is not fully stable at the
moment (unfortunately you can expect crashes/hangs, see the "Issues"
section below). However I would like to start a discussion get feedback
and opinions about this approach.

Context
=======

The most naive approach to handle L1TF SMT-variant exploit is to just disable
hyper-threading. But that is not practical for public cloud providers. As a
second next best alternative, there is an approach to combine coscheduling
together with flushing L1D cache on every VMEntry. By coscheduling I refer
to some mechanism which on every VMExit from guest, kicks all sibling
hyperthreads from guest aswell.

However, this approach have some open issues:

1. Kicking all sibling hyperthreads for every VMExit have significant
   performance hit for some compute shapes (e.g. Emulated and PV).

2. It assumes only CPU core resource which could be leaked by some
   vulnerability is L1D cache. But future vulnerabilities may also be able
   to leak other CPU core resources. Therefore, we would prefer to have a
   mechanism which prevents these resources to be able to be loaded with
   sensitive data to begin with.

To better address (2), upstream community has discussed some mechanisms
related to reducing data that is mapped on kernel virtual address space.
Specifically:

a. XPFO: Removes from physmap pages that currently should only be accessed
   by userspace.

b. Process-local memory allocations: Allows having a memory area in kernel
   virtual address space that maps different content per-process. Then,
   allocations made on this memory area can be hidden from other tasks in
   the system running in kernel space. Most obvious use it to allocate
   there per-vCPU and per-VM KVM structures.

However, both (a)+(b) work in a black-list approach (where we decide which
data is considered dangerous and remove it from kernel virtual address
space) and don't address performance hit described at (1).


Proposal
========

To handle both these points, this series introduce the mechanism of KVM
address space isolation. Note that this mechanism completes (a)+(b) and
don't contradict. In case this mechanism is also applied, (a)+(b) should
still be applied to the full virtual address space as a defence-in-depth).

The idea is that most of KVM #VMExit handlers code will run in a special
KVM isolated address space which maps only KVM required code and per-VM
information. Only once KVM needs to architectually access other (sensitive)
data, it will switch from KVM isolated address space to full standard
host address space. At this point, KVM will also need to kick all sibling
hyperthreads to get out of guest (note that kicking all sibling hyperthreads
is not implemented in this serie).

Basically, we will have the following flow:

  - qemu issues KVM_RUN ioctl
  - KVM handles the ioctl and calls vcpu_run():
    . KVM switches from the kernel address to the KVM address space
    . KVM transfers control to VM (VMLAUNCH/VMRESUME)
    . VM returns to KVM
    . KVM handles VM-Exit:
      . if handling need full kernel then switch to kernel address space
      . else continues with KVM address space
    . KVM loops in vcpu_run() or return
  - KVM_RUN ioctl returns

So, the KVM_RUN core function will mainly be executed using the KVM address
space. The handling of a VM-Exit can require access to the kernel space
and, in that case, we will switch back to the kernel address space.

The high-level idea of how this is implemented is to create a separate
struct_mm for KVM such that a vCPU thread will switch active_mm between
it's original active_mm and kvm_mm when needed as described above. The
idea is very similar to how kernel switches between task active_mm and
efi_mm when calling EFI Runtime Services.

Note that because we use the kernel TLB Manager to switch between kvm_mm
and host_mm, we will effectively use TLB with PCID if enabled to make
these switches fast. As all of this is managed internally in TLB Manager's
switch_mm().


Patches
=======

The proposed patches implement the necessary framework for creating kvm_mm
and switching between host_mm and kvm_mm at appropriate times. They also
provide functions for populating the KVM address space, and implement an
actual KVM address space much smaller than the full kernel address space.

- 01-08: add framework for switching between the kernel address space and
  the KVM address space at appropriate times. Note that these patches do
  not create or switch the address space yet. Address space switching is
  implemented in patch 25.

- 09-18: add a framework for populating and managing the KVM page table;
  this also include mechanisms to ensure changes are effectively limited
  to the KVM page table and no change is mistakenly propagated to the
  kernel page table.

- 19-23: populate the KVM page table.

- 24: add page fault handler to handle and report missing mappings when
  running with the KVM address space. This is based on an original idea
  from Paul Turner.

- 25: implement the actual switch between the kernel address space and
  the KVM address space.

- 26-27: populate the KVM page table with more data.


If a fault occurs while running with the KVM address space, it will be
reported on the console like this:

[ 4840.727476] KVM isolation: page fault #0 (0) at fast_page_fault+0x13e/0x3e0 [kvm] on ffffea00005331f0 (0xffffea00005331f0)

If the KVM page_fault_stack module parameter is set to non-zero (that's
the default) then the stack of the fault will also be reported:

[ 5025.630374] KVM isolation: page fault #0 (0) at fast_page_fault+0x100/0x3e0 [kvm] on ffff88003c718000 (0xffff88003c718000)
[ 5025.631918] Call Trace:
[ 5025.632782]  tdp_page_fault+0xec/0x260 [kvm]
[ 5025.633395]  kvm_mmu_page_fault+0x74/0x5f0 [kvm]
[ 5025.644467]  handle_ept_violation+0xc3/0x1a0 [kvm_intel]
[ 5025.645218]  vmx_handle_exit+0xb9/0x600 [kvm_intel]
[ 5025.645917]  vcpu_enter_guest+0xb88/0x1580 [kvm]
[ 5025.646577]  kvm_arch_vcpu_ioctl_run+0x403/0x610 [kvm]
[ 5025.647313]  kvm_vcpu_ioctl+0x3d5/0x650 [kvm]
[ 5025.648538]  do_vfs_ioctl+0xaa/0x602
[ 5025.650502]  SyS_ioctl+0x79/0x84
[ 5025.650966]  do_syscall_64+0x79/0x1ae
[ 5025.651487]  entry_SYSCALL_64_after_hwframe+0x151/0x0
[ 5025.652200] RIP: 0033:0x7f74a2f1d997
[ 5025.652710] RSP: 002b:00007f749f3ec998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 5025.653769] RAX: ffffffffffffffda RBX: 0000562caa83e110 RCX: 00007f74a2f1d997
[ 5025.654769] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000c
[ 5025.655769] RBP: 0000562caa83e1b3 R08: 0000562ca9b6fa50 R09: 0000000000000006
[ 5025.656766] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ca9b552c0
[ 5025.657764] R13: 0000000000801000 R14: 00007f74a59d4000 R15: 0000562caa83e110

This allows to find out what is missing in the KVM address space.


Issues
======

Limited tests have been done so far, and mostly with an empty single-vcpu
VM (i.e. qemu-system-i386 -enable-kvm -smp 1). Single-vcpu VM is able to
start and run a full OS but the system will eventually crash/hang at some
point. Multiple vcpus will crash/hang much faster.


Performance Impact
==================

As this is a RFC, the effective performance impact hasn't been measured
yet. Current patches introduce two additional context switches (kernel to
KVM, and KVM to kernel) on each KVM_RUN ioctl. Also additional context
switches are added if a VM-Exit has to be handled using the full kernel
address space.

I expect that the KVM address space can eventually be expanded to include
the ioctl syscall entries. By doing so, and also adding the KVM page table
to the process userland page table (which should be safe to do because the
KVM address space doesn't have any secret), we could potentially handle the
KVM ioctl without having to switch to the kernel pagetable (thus effectively
eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
to be handled using the full kernel address space.


Thanks,

alex.

---

Alexandre Chartre (18):
  kvm/isolation: function to track buffers allocated for the KVM page
    table
  kvm/isolation: add KVM page table entry free functions
  kvm/isolation: add KVM page table entry offset functions
  kvm/isolation: add KVM page table entry allocation functions
  kvm/isolation: add KVM page table entry set functions
  kvm/isolation: functions to copy page table entries for a VA range
  kvm/isolation: keep track of VA range mapped in KVM address space
  kvm/isolation: functions to clear page table entries for a VA range
  kvm/isolation: improve mapping copy when mapping is already present
  kvm/isolation: function to copy page table entries for percpu buffer
  kvm/isolation: initialize the KVM page table with core mappings
  kvm/isolation: initialize the KVM page table with vmx specific data
  kvm/isolation: initialize the KVM page table with vmx VM data
  kvm/isolation: initialize the KVM page table with vmx cpu data
  kvm/isolation: initialize the KVM page table with the vcpu tasks
  kvm/isolation: KVM page fault handler
  kvm/isolation: initialize the KVM page table with KVM memslots
  kvm/isolation: initialize the KVM page table with KVM buses

Liran Alon (9):
  kernel: Export memory-management symbols required for KVM address
    space isolation
  KVM: x86: Introduce address_space_isolation module parameter
  KVM: x86: Introduce KVM separate virtual address space
  KVM: x86: Switch to KVM address space on entry to guest
  KVM: x86: Add handler to exit kvm isolation
  KVM: x86: Exit KVM isolation on IRQ entry
  KVM: x86: Switch to host address space when may access sensitive data
  KVM: x86: Optimize branches which checks if address space isolation
    enabled
  kvm/isolation: implement actual KVM isolation enter/exit

 arch/x86/include/asm/apic.h    |    4 +-
 arch/x86/include/asm/hardirq.h |   10 +
 arch/x86/include/asm/irq.h     |    1 +
 arch/x86/kernel/cpu/common.c   |    2 +
 arch/x86/kernel/dumpstack.c    |    1 +
 arch/x86/kernel/irq.c          |   11 +
 arch/x86/kernel/ldt.c          |    1 +
 arch/x86/kernel/smp.c          |    2 +-
 arch/x86/kvm/Makefile          |    2 +-
 arch/x86/kvm/isolation.c       | 1773 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/isolation.h       |   40 +
 arch/x86/kvm/mmu.c             |    3 +-
 arch/x86/kvm/vmx/vmx.c         |  123 +++-
 arch/x86/kvm/x86.c             |   44 +-
 arch/x86/mm/fault.c            |   12 +
 arch/x86/mm/tlb.c              |    4 +-
 arch/x86/platform/uv/tlb_uv.c  |    2 +-
 include/linux/kvm_host.h       |    2 +
 include/linux/percpu.h         |    2 +
 include/linux/sched.h          |    6 +
 mm/memory.c                    |    5 +
 mm/percpu.c                    |    6 +-
 virt/kvm/arm/arm.c             |    4 +
 virt/kvm/kvm_main.c            |    4 +-
 24 files changed, 2051 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/kvm/isolation.c
 create mode 100644 arch/x86/kvm/isolation.h

Comments

Liran Alon May 13, 2019, 4:42 p.m. UTC | #1
> On 13 May 2019, at 17:38, Alexandre Chartre <alexandre.chartre@oracle.com> wrote:
> 
> Hi,
> 
> This series aims to introduce the concept of KVM address space isolation.
> This is done as part of the upstream community effort to have exploit
> mitigations for CPU info-leaks vulnerabilities such as L1TF. 
> 
> These patches are based on an original patches from Liran Alon, completed
> with additional patches to effectively create KVM address space different
> from the full kernel address space.

Great job for pushing this forward! Thank you!

> 
> The current code is just an early POC, and it is not fully stable at the
> moment (unfortunately you can expect crashes/hangs, see the "Issues"
> section below). However I would like to start a discussion get feedback
> and opinions about this approach.
> 
> Context
> =======
> 
> The most naive approach to handle L1TF SMT-variant exploit is to just disable
> hyper-threading. But that is not practical for public cloud providers. As a
> second next best alternative, there is an approach to combine coscheduling
> together with flushing L1D cache on every VMEntry. By coscheduling I refer
> to some mechanism which on every VMExit from guest, kicks all sibling
> hyperthreads from guest aswell.
> 
> However, this approach have some open issues:
> 
> 1. Kicking all sibling hyperthreads for every VMExit have significant
>   performance hit for some compute shapes (e.g. Emulated and PV).
> 
> 2. It assumes only CPU core resource which could be leaked by some
>   vulnerability is L1D cache. But future vulnerabilities may also be able
>   to leak other CPU core resources. Therefore, we would prefer to have a
>   mechanism which prevents these resources to be able to be loaded with
>   sensitive data to begin with.
> 
> To better address (2), upstream community has discussed some mechanisms
> related to reducing data that is mapped on kernel virtual address space.
> Specifically:
> 
> a. XPFO: Removes from physmap pages that currently should only be accessed
>   by userspace.
> 
> b. Process-local memory allocations: Allows having a memory area in kernel
>   virtual address space that maps different content per-process. Then,
>   allocations made on this memory area can be hidden from other tasks in
>   the system running in kernel space. Most obvious use it to allocate
>   there per-vCPU and per-VM KVM structures.
> 
> However, both (a)+(b) work in a black-list approach (where we decide which
> data is considered dangerous and remove it from kernel virtual address
> space) and don't address performance hit described at (1).

+Cc Stefan from AWS and Kaya from Google.
(I have sent them my original patch series for review and discuss with them about this subject)
Stefan: Do you know what is Julian's current email address to Cc him as-well?

> 
> 
> Proposal
> ========
> 
> To handle both these points, this series introduce the mechanism of KVM
> address space isolation. Note that this mechanism completes (a)+(b) and
> don't contradict. In case this mechanism is also applied, (a)+(b) should
> still be applied to the full virtual address space as a defence-in-depth).
> 
> The idea is that most of KVM #VMExit handlers code will run in a special
> KVM isolated address space which maps only KVM required code and per-VM
> information. Only once KVM needs to architectually access other (sensitive)
> data, it will switch from KVM isolated address space to full standard
> host address space. At this point, KVM will also need to kick all sibling
> hyperthreads to get out of guest (note that kicking all sibling hyperthreads
> is not implemented in this serie).
> 
> Basically, we will have the following flow:
> 
>  - qemu issues KVM_RUN ioctl
>  - KVM handles the ioctl and calls vcpu_run():
>    . KVM switches from the kernel address to the KVM address space
>    . KVM transfers control to VM (VMLAUNCH/VMRESUME)
>    . VM returns to KVM
>    . KVM handles VM-Exit:
>      . if handling need full kernel then switch to kernel address space

*AND* kick sibling hyperthreads before switching to that address space.
I think it’s important to emphasise that one of the main points of this KVM address space isolation mechanism is to minimise number of times we require to kick sibling hyperthreads outside of guest. By hopefully having the vast majority of VMExits handled on KVM isolated address space.

>      . else continues with KVM address space
>    . KVM loops in vcpu_run() or return
>  - KVM_RUN ioctl returns
> 
> So, the KVM_RUN core function will mainly be executed using the KVM address
> space. The handling of a VM-Exit can require access to the kernel space
> and, in that case, we will switch back to the kernel address space.
> 
> The high-level idea of how this is implemented is to create a separate
> struct_mm for KVM such that a vCPU thread will switch active_mm between
> it's original active_mm and kvm_mm when needed as described above. The
> idea is very similar to how kernel switches between task active_mm and
> efi_mm when calling EFI Runtime Services.
> 
> Note that because we use the kernel TLB Manager to switch between kvm_mm
> and host_mm, we will effectively use TLB with PCID if enabled to make
> these switches fast. As all of this is managed internally in TLB Manager's
> switch_mm().
> 
> 
> Patches
> =======
> 
> The proposed patches implement the necessary framework for creating kvm_mm
> and switching between host_mm and kvm_mm at appropriate times. They also
> provide functions for populating the KVM address space, and implement an
> actual KVM address space much smaller than the full kernel address space.
> 
> - 01-08: add framework for switching between the kernel address space and
>  the KVM address space at appropriate times. Note that these patches do
>  not create or switch the address space yet. Address space switching is
>  implemented in patch 25.
> 
> - 09-18: add a framework for populating and managing the KVM page table;
>  this also include mechanisms to ensure changes are effectively limited
>  to the KVM page table and no change is mistakenly propagated to the
>  kernel page table.
> 
> - 19-23: populate the KVM page table.
> 
> - 24: add page fault handler to handle and report missing mappings when
>  running with the KVM address space. This is based on an original idea
>  from Paul Turner.
> 
> - 25: implement the actual switch between the kernel address space and
>  the KVM address space.
> 
> - 26-27: populate the KVM page table with more data.
> 
> 
> If a fault occurs while running with the KVM address space, it will be
> reported on the console like this:
> 
> [ 4840.727476] KVM isolation: page fault #0 (0) at fast_page_fault+0x13e/0x3e0 [kvm] on ffffea00005331f0 (0xffffea00005331f0)
> 
> If the KVM page_fault_stack module parameter is set to non-zero (that's
> the default) then the stack of the fault will also be reported:
> 
> [ 5025.630374] KVM isolation: page fault #0 (0) at fast_page_fault+0x100/0x3e0 [kvm] on ffff88003c718000 (0xffff88003c718000)
> [ 5025.631918] Call Trace:
> [ 5025.632782]  tdp_page_fault+0xec/0x260 [kvm]
> [ 5025.633395]  kvm_mmu_page_fault+0x74/0x5f0 [kvm]
> [ 5025.644467]  handle_ept_violation+0xc3/0x1a0 [kvm_intel]
> [ 5025.645218]  vmx_handle_exit+0xb9/0x600 [kvm_intel]
> [ 5025.645917]  vcpu_enter_guest+0xb88/0x1580 [kvm]
> [ 5025.646577]  kvm_arch_vcpu_ioctl_run+0x403/0x610 [kvm]
> [ 5025.647313]  kvm_vcpu_ioctl+0x3d5/0x650 [kvm]
> [ 5025.648538]  do_vfs_ioctl+0xaa/0x602
> [ 5025.650502]  SyS_ioctl+0x79/0x84
> [ 5025.650966]  do_syscall_64+0x79/0x1ae
> [ 5025.651487]  entry_SYSCALL_64_after_hwframe+0x151/0x0
> [ 5025.652200] RIP: 0033:0x7f74a2f1d997
> [ 5025.652710] RSP: 002b:00007f749f3ec998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 5025.653769] RAX: ffffffffffffffda RBX: 0000562caa83e110 RCX: 00007f74a2f1d997
> [ 5025.654769] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000c
> [ 5025.655769] RBP: 0000562caa83e1b3 R08: 0000562ca9b6fa50 R09: 0000000000000006
> [ 5025.656766] R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ca9b552c0
> [ 5025.657764] R13: 0000000000801000 R14: 00007f74a59d4000 R15: 0000562caa83e110
> 
> This allows to find out what is missing in the KVM address space.
> 
> 
> Issues
> ======
> 
> Limited tests have been done so far, and mostly with an empty single-vcpu
> VM (i.e. qemu-system-i386 -enable-kvm -smp 1). Single-vcpu VM is able to
> start and run a full OS but the system will eventually crash/hang at some
> point. Multiple vcpus will crash/hang much faster.
> 
> 
> Performance Impact
> ==================
> 
> As this is a RFC, the effective performance impact hasn't been measured
> yet.

> Current patches introduce two additional context switches (kernel to
> KVM, and KVM to kernel) on each KVM_RUN ioctl.

I have never considered this to be an issue.
By design of this patch series, I treated exits to userspace VMM as slow-path that should not be important to optimise.

> Also additional context
> switches are added if a VM-Exit has to be handled using the full kernel
> address space.

This is by design as well.
The KVM address space should contain enough data that is not-sensitive to be leaked by guest from one hand while still be able to handle the vast majority of exits without exiting the address space on the other hand. If we cannot achieve such a KVM isolated address space, the PoC of the series failed.

> 
> I expect that the KVM address space can eventually be expanded to include
> the ioctl syscall entries.

As mentioned above, I do not see a strong reason to do so. The ioctl syscalls are considered slow-path that shouldn’t be important to optimise.

> By doing so, and also adding the KVM page table
> to the process userland page table (which should be safe to do because the
> KVM address space doesn't have any secret), we could potentially handle the
> KVM ioctl without having to switch to the kernel pagetable (thus effectively
> eliminating KPTI for KVM).

From above reasons, I don’t think this is important.

> Then the only overhead would be if a VM-Exit has
> to be handled using the full kernel address space.

This was always by design the only overhead. And a major one. Because not only it switches address space but also it will require in the future to kick all sibling hyperthreads outside of guest. The purpose of the series is to created an address space such that most VMExits won’t require to do such kick to the sibling hyperthreads.

-Liran

> 
> 
> Thanks,
> 
> alex.
> 
> ---
> 
> Alexandre Chartre (18):
>  kvm/isolation: function to track buffers allocated for the KVM page
>    table
>  kvm/isolation: add KVM page table entry free functions
>  kvm/isolation: add KVM page table entry offset functions
>  kvm/isolation: add KVM page table entry allocation functions
>  kvm/isolation: add KVM page table entry set functions
>  kvm/isolation: functions to copy page table entries for a VA range
>  kvm/isolation: keep track of VA range mapped in KVM address space
>  kvm/isolation: functions to clear page table entries for a VA range
>  kvm/isolation: improve mapping copy when mapping is already present
>  kvm/isolation: function to copy page table entries for percpu buffer
>  kvm/isolation: initialize the KVM page table with core mappings
>  kvm/isolation: initialize the KVM page table with vmx specific data
>  kvm/isolation: initialize the KVM page table with vmx VM data
>  kvm/isolation: initialize the KVM page table with vmx cpu data
>  kvm/isolation: initialize the KVM page table with the vcpu tasks
>  kvm/isolation: KVM page fault handler
>  kvm/isolation: initialize the KVM page table with KVM memslots
>  kvm/isolation: initialize the KVM page table with KVM buses
> 
> Liran Alon (9):
>  kernel: Export memory-management symbols required for KVM address
>    space isolation
>  KVM: x86: Introduce address_space_isolation module parameter
>  KVM: x86: Introduce KVM separate virtual address space
>  KVM: x86: Switch to KVM address space on entry to guest
>  KVM: x86: Add handler to exit kvm isolation
>  KVM: x86: Exit KVM isolation on IRQ entry
>  KVM: x86: Switch to host address space when may access sensitive data
>  KVM: x86: Optimize branches which checks if address space isolation
>    enabled
>  kvm/isolation: implement actual KVM isolation enter/exit
> 
> arch/x86/include/asm/apic.h    |    4 +-
> arch/x86/include/asm/hardirq.h |   10 +
> arch/x86/include/asm/irq.h     |    1 +
> arch/x86/kernel/cpu/common.c   |    2 +
> arch/x86/kernel/dumpstack.c    |    1 +
> arch/x86/kernel/irq.c          |   11 +
> arch/x86/kernel/ldt.c          |    1 +
> arch/x86/kernel/smp.c          |    2 +-
> arch/x86/kvm/Makefile          |    2 +-
> arch/x86/kvm/isolation.c       | 1773 ++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/isolation.h       |   40 +
> arch/x86/kvm/mmu.c             |    3 +-
> arch/x86/kvm/vmx/vmx.c         |  123 +++-
> arch/x86/kvm/x86.c             |   44 +-
> arch/x86/mm/fault.c            |   12 +
> arch/x86/mm/tlb.c              |    4 +-
> arch/x86/platform/uv/tlb_uv.c  |    2 +-
> include/linux/kvm_host.h       |    2 +
> include/linux/percpu.h         |    2 +
> include/linux/sched.h          |    6 +
> mm/memory.c                    |    5 +
> mm/percpu.c                    |    6 +-
> virt/kvm/arm/arm.c             |    4 +
> virt/kvm/kvm_main.c            |    4 +-
> 24 files changed, 2051 insertions(+), 13 deletions(-)
> create mode 100644 arch/x86/kvm/isolation.c
> create mode 100644 arch/x86/kvm/isolation.h
>
Andy Lutomirski May 13, 2019, 6:17 p.m. UTC | #2
> I expect that the KVM address space can eventually be expanded to include
> the ioctl syscall entries. By doing so, and also adding the KVM page table
> to the process userland page table (which should be safe to do because the
> KVM address space doesn't have any secret), we could potentially handle the
> KVM ioctl without having to switch to the kernel pagetable (thus effectively
> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
> to be handled using the full kernel address space.
>

In the hopefully common case where a VM exits and then gets re-entered
without needing to load full page tables, what code actually runs?
I'm trying to understand when the optimization of not switching is
actually useful.

Allowing ioctl() without switching to kernel tables sounds...
extremely complicated.  It also makes the dubious assumption that user
memory contains no secrets.
Nakajima, Jun May 13, 2019, 7:31 p.m. UTC | #3
On 5/13/19, 7:43 AM, "kvm-owner@vger.kernel.org on behalf of Alexandre Chartre" wrote:

    Proposal
    ========
    
    To handle both these points, this series introduce the mechanism of KVM
    address space isolation. Note that this mechanism completes (a)+(b) and
    don't contradict. In case this mechanism is also applied, (a)+(b) should
    still be applied to the full virtual address space as a defence-in-depth).
    
    The idea is that most of KVM #VMExit handlers code will run in a special
    KVM isolated address space which maps only KVM required code and per-VM
    information. Only once KVM needs to architectually access other (sensitive)
    data, it will switch from KVM isolated address space to full standard
    host address space. At this point, KVM will also need to kick all sibling
    hyperthreads to get out of guest (note that kicking all sibling hyperthreads
    is not implemented in this serie).
    
    Basically, we will have the following flow:
    
      - qemu issues KVM_RUN ioctl
      - KVM handles the ioctl and calls vcpu_run():
        . KVM switches from the kernel address to the KVM address space
        . KVM transfers control to VM (VMLAUNCH/VMRESUME)
        . VM returns to KVM
        . KVM handles VM-Exit:
          . if handling need full kernel then switch to kernel address space
          . else continues with KVM address space
        . KVM loops in vcpu_run() or return
      - KVM_RUN ioctl returns
    
    So, the KVM_RUN core function will mainly be executed using the KVM address
    space. The handling of a VM-Exit can require access to the kernel space
    and, in that case, we will switch back to the kernel address space.
    
Once all sibling hyperthreads are in the host (either using the full kernel address space or user address space), what happens to the other sibling hyperthreads if one of them tries to do VM entry? That VCPU will switch to the KVM address space prior to VM entry, but others continue to run? Do you think (a) + (b) would be sufficient for that case?
 
---
Jun
Intel Open Source Technology Center
Liran Alon May 13, 2019, 9:08 p.m. UTC | #4
> On 13 May 2019, at 21:17, Andy Lutomirski <luto@kernel.org> wrote:
> 
>> I expect that the KVM address space can eventually be expanded to include
>> the ioctl syscall entries. By doing so, and also adding the KVM page table
>> to the process userland page table (which should be safe to do because the
>> KVM address space doesn't have any secret), we could potentially handle the
>> KVM ioctl without having to switch to the kernel pagetable (thus effectively
>> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
>> to be handled using the full kernel address space.
>> 
> 
> In the hopefully common case where a VM exits and then gets re-entered
> without needing to load full page tables, what code actually runs?
> I'm trying to understand when the optimization of not switching is
> actually useful.
> 
> Allowing ioctl() without switching to kernel tables sounds...
> extremely complicated.  It also makes the dubious assumption that user
> memory contains no secrets.

Let me attempt to clarify what we were thinking when creating this patch series:

1) It is never safe to execute one hyperthread inside guest while it’s sibling hyperthread runs in a virtual address space which contains secrets of host or other guests.
This is because we assume that using some speculative gadget (such as half-Spectrev2 gadget), it will be possible to populate *some* CPU core resource which could then be *somehow* leaked by the hyperthread running inside guest. In case of L1TF, this would be data populated to the L1D cache.

2) Because of (1), every time a hyperthread runs inside host kernel, we must make sure it’s sibling is not running inside guest. i.e. We must kick the sibling hyperthread outside of guest using IPI.

3) From (2), we should have theoretically deduced that for every #VMExit, there is a need to kick the sibling hyperthread also outside of guest until the #VMExit is completed. Such a patch series was implemented at some point but it had (obviously) significant performance hit.

4) The main goal of this patch series is to preserve (2), but to avoid the overhead specified in (3).

The way this patch series achieves (4) is by observing that during the run of a VM, most #VMExits can be handled rather quickly and locally inside KVM and doesn’t need to reference any data that is not relevant to this VM or KVM code. Therefore, if we will run these #VMExits in an isolated virtual address space (i.e. KVM isolated address space), there is no need to kick the sibling hyperthread from guest while these #VMExits handlers run.
The hope is that the very vast majority of #VMExit handlers will be able to completely run without requiring to switch to full address space. Therefore, avoiding the performance hit of (2).
However, for the very few #VMExits that does require to run in full kernel address space, we must first kick the sibling hyperthread outside of guest and only then switch to full kernel address space and only once all hyperthreads return to KVM address space, then allow then to enter into guest.

From this reason, I think the above paragraph (that was added to my original cover letter) is incorrect.
I believe that we should by design treat all exits to userspace VMM (e.g. QEMU) as slow-path that should not be optimised and therefore ok to switch address space (and therefore also kick sibling hyperthread). Similarly, all IOCTLs handlers are also slow-path and therefore it should be ok for them to also not run in KVM isolated address space.

-Liran
Liran Alon May 13, 2019, 9:16 p.m. UTC | #5
> On 13 May 2019, at 22:31, Nakajima, Jun <jun.nakajima@intel.com> wrote:
> 
> On 5/13/19, 7:43 AM, "kvm-owner@vger.kernel.org on behalf of Alexandre Chartre" wrote:
> 
>    Proposal
>    ========
> 
>    To handle both these points, this series introduce the mechanism of KVM
>    address space isolation. Note that this mechanism completes (a)+(b) and
>    don't contradict. In case this mechanism is also applied, (a)+(b) should
>    still be applied to the full virtual address space as a defence-in-depth).
> 
>    The idea is that most of KVM #VMExit handlers code will run in a special
>    KVM isolated address space which maps only KVM required code and per-VM
>    information. Only once KVM needs to architectually access other (sensitive)
>    data, it will switch from KVM isolated address space to full standard
>    host address space. At this point, KVM will also need to kick all sibling
>    hyperthreads to get out of guest (note that kicking all sibling hyperthreads
>    is not implemented in this serie).
> 
>    Basically, we will have the following flow:
> 
>      - qemu issues KVM_RUN ioctl
>      - KVM handles the ioctl and calls vcpu_run():
>        . KVM switches from the kernel address to the KVM address space
>        . KVM transfers control to VM (VMLAUNCH/VMRESUME)
>        . VM returns to KVM
>        . KVM handles VM-Exit:
>          . if handling need full kernel then switch to kernel address space
>          . else continues with KVM address space
>        . KVM loops in vcpu_run() or return
>      - KVM_RUN ioctl returns
> 
>    So, the KVM_RUN core function will mainly be executed using the KVM address
>    space. The handling of a VM-Exit can require access to the kernel space
>    and, in that case, we will switch back to the kernel address space.
> 
> Once all sibling hyperthreads are in the host (either using the full kernel address space or user address space), what happens to the other sibling hyperthreads if one of them tries to do VM entry? That VCPU will switch to the KVM address space prior to VM entry, but others continue to run? Do you think (a) + (b) would be sufficient for that case?

The description here is missing and important part: When a hyperthread needs to switch from KVM isolated address space to kernel full address space, it should first kick all sibling hyperthreads outside of guest and only then safety switch to full kernel address space. Only once all sibling hyperthreads are running with KVM isolated address space, it is safe to enter guest.

The main point of this address space is to avoid kicking all sibling hyperthreads on *every* VMExit from guest but instead only kick them when switching address space. The assumption is that the vast majority of exits can be handled in KVM isolated address space and therefore do not require to kick the sibling hyperthreads outside of guest.

-Liran

> 
> ---
> Jun
> Intel Open Source Technology Center
> 
>
Liran Alon May 13, 2019, 9:53 p.m. UTC | #6
> On 14 May 2019, at 0:42, Nakajima, Jun <jun.nakajima@intel.com> wrote:
> 
> 
> 
>> On May 13, 2019, at 2:16 PM, Liran Alon <liran.alon@oracle.com> wrote:
>> 
>>> On 13 May 2019, at 22:31, Nakajima, Jun <jun.nakajima@intel.com> wrote:
>>> 
>>> On 5/13/19, 7:43 AM, "kvm-owner@vger.kernel.org on behalf of Alexandre Chartre" wrote:
>>> 
>>>   Proposal
>>>   ========
>>> 
>>>   To handle both these points, this series introduce the mechanism of KVM
>>>   address space isolation. Note that this mechanism completes (a)+(b) and
>>>   don't contradict. In case this mechanism is also applied, (a)+(b) should
>>>   still be applied to the full virtual address space as a defence-in-depth).
>>> 
>>>   The idea is that most of KVM #VMExit handlers code will run in a special
>>>   KVM isolated address space which maps only KVM required code and per-VM
>>>   information. Only once KVM needs to architectually access other (sensitive)
>>>   data, it will switch from KVM isolated address space to full standard
>>>   host address space. At this point, KVM will also need to kick all sibling
>>>   hyperthreads to get out of guest (note that kicking all sibling hyperthreads
>>>   is not implemented in this serie).
>>> 
>>>   Basically, we will have the following flow:
>>> 
>>>     - qemu issues KVM_RUN ioctl
>>>     - KVM handles the ioctl and calls vcpu_run():
>>>       . KVM switches from the kernel address to the KVM address space
>>>       . KVM transfers control to VM (VMLAUNCH/VMRESUME)
>>>       . VM returns to KVM
>>>       . KVM handles VM-Exit:
>>>         . if handling need full kernel then switch to kernel address space
>>>         . else continues with KVM address space
>>>       . KVM loops in vcpu_run() or return
>>>     - KVM_RUN ioctl returns
>>> 
>>>   So, the KVM_RUN core function will mainly be executed using the KVM address
>>>   space. The handling of a VM-Exit can require access to the kernel space
>>>   and, in that case, we will switch back to the kernel address space.
>>> 
>>> Once all sibling hyperthreads are in the host (either using the full kernel address space or user address space), what happens to the other sibling hyperthreads if one of them tries to do VM entry? That VCPU will switch to the KVM address space prior to VM entry, but others continue to run? Do you think (a) + (b) would be sufficient for that case?
>> 
>> The description here is missing and important part: When a hyperthread needs to switch from KVM isolated address space to kernel full address space, it should first kick all sibling hyperthreads outside of guest and only then safety switch to full kernel address space. Only once all sibling hyperthreads are running with KVM isolated address space, it is safe to enter guest.
>> 
> 
> Okay, it makes sense. So, it will require some synchronization among the siblings there.

Definitely.
Currently the kicking of sibling hyperthreads is not integrated yet with this patch series. But it should be at some point.

-Liran

> 
>> The main point of this address space is to avoid kicking all sibling hyperthreads on *every* VMExit from guest but instead only kick them when switching address space. The assumption is that the vast majority of exits can be handled in KVM isolated address space and therefore do not require to kick the sibling hyperthreads outside of guest.
> 
> 
> ---
> Jun
> Intel Open Source Technology Center
Andy Lutomirski May 14, 2019, 2:07 a.m. UTC | #7
On Mon, May 13, 2019 at 2:09 PM Liran Alon <liran.alon@oracle.com> wrote:
>
>
>
> > On 13 May 2019, at 21:17, Andy Lutomirski <luto@kernel.org> wrote:
> >
> >> I expect that the KVM address space can eventually be expanded to include
> >> the ioctl syscall entries. By doing so, and also adding the KVM page table
> >> to the process userland page table (which should be safe to do because the
> >> KVM address space doesn't have any secret), we could potentially handle the
> >> KVM ioctl without having to switch to the kernel pagetable (thus effectively
> >> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
> >> to be handled using the full kernel address space.
> >>
> >
> > In the hopefully common case where a VM exits and then gets re-entered
> > without needing to load full page tables, what code actually runs?
> > I'm trying to understand when the optimization of not switching is
> > actually useful.
> >
> > Allowing ioctl() without switching to kernel tables sounds...
> > extremely complicated.  It also makes the dubious assumption that user
> > memory contains no secrets.
>
> Let me attempt to clarify what we were thinking when creating this patch series:
>
> 1) It is never safe to execute one hyperthread inside guest while it’s sibling hyperthread runs in a virtual address space which contains secrets of host or other guests.
> This is because we assume that using some speculative gadget (such as half-Spectrev2 gadget), it will be possible to populate *some* CPU core resource which could then be *somehow* leaked by the hyperthread running inside guest. In case of L1TF, this would be data populated to the L1D cache.
>
> 2) Because of (1), every time a hyperthread runs inside host kernel, we must make sure it’s sibling is not running inside guest. i.e. We must kick the sibling hyperthread outside of guest using IPI.
>
> 3) From (2), we should have theoretically deduced that for every #VMExit, there is a need to kick the sibling hyperthread also outside of guest until the #VMExit is completed. Such a patch series was implemented at some point but it had (obviously) significant performance hit.
>
>
4) The main goal of this patch series is to preserve (2), but to avoid
the overhead specified in (3).
>
> The way this patch series achieves (4) is by observing that during the run of a VM, most #VMExits can be handled rather quickly and locally inside KVM and doesn’t need to reference any data that is not relevant to this VM or KVM code. Therefore, if we will run these #VMExits in an isolated virtual address space (i.e. KVM isolated address space), there is no need to kick the sibling hyperthread from guest while these #VMExits handlers run.

Thanks!  This clarifies a lot of things.

> The hope is that the very vast majority of #VMExit handlers will be able to completely run without requiring to switch to full address space. Therefore, avoiding the performance hit of (2).
> However, for the very few #VMExits that does require to run in full kernel address space, we must first kick the sibling hyperthread outside of guest and only then switch to full kernel address space and only once all hyperthreads return to KVM address space, then allow then to enter into guest.

What exactly does "kick" mean in this context?  It sounds like you're
going to need to be able to kick sibling VMs from extremely atomic
contexts like NMI and MCE.
Peter Zijlstra May 14, 2019, 7:29 a.m. UTC | #8
(please, wrap our emails at 78 chars)

On Tue, May 14, 2019 at 12:08:23AM +0300, Liran Alon wrote:

> 3) From (2), we should have theoretically deduced that for every
> #VMExit, there is a need to kick the sibling hyperthread also outside
> of guest until the #VMExit is completed.

That's not in fact quite true; all you have to do is send the IPI.
Having one sibling IPI the other sibling carries enough guarantees that
the receiving sibling will not execute any further guest instructions.

That is, you don't have to wait on the VMExit to complete; you can just
IPI and get on with things. Now, this is still expensive, But it is
heaps better than doing a full sync up between siblings.
Peter Zijlstra May 14, 2019, 7:37 a.m. UTC | #9
On Mon, May 13, 2019 at 07:07:36PM -0700, Andy Lutomirski wrote:
> On Mon, May 13, 2019 at 2:09 PM Liran Alon <liran.alon@oracle.com> wrote:

> > The hope is that the very vast majority of #VMExit handlers will be
> > able to completely run without requiring to switch to full address
> > space. Therefore, avoiding the performance hit of (2).

> > However, for the very few #VMExits that does require to run in full
> > kernel address space, we must first kick the sibling hyperthread
> > outside of guest and only then switch to full kernel address space
> > and only once all hyperthreads return to KVM address space, then
> > allow then to enter into guest.
> 
> What exactly does "kick" mean in this context?  It sounds like you're
> going to need to be able to kick sibling VMs from extremely atomic
> contexts like NMI and MCE.

Yeah, doing the full synchronous thing from NMI/MCE context sounds
exceedingly dodgy, howver..

Realistically they only need to send an IPI to the other sibling; they
don't need to wait for the VMExit to complete or anything else.

And that is something we can do from NMI context -- with a bit of care.
See also arch_irq_work_raise(); specifically we need to ensure we leave
the APIC in an idle state, such that if we interrupted an APIC sequence
it will not suddenly fail/violate the APIC write/state etc.
Liran Alon May 14, 2019, 7:57 a.m. UTC | #10
> On 14 May 2019, at 10:29, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> (please, wrap our emails at 78 chars)
> 
> On Tue, May 14, 2019 at 12:08:23AM +0300, Liran Alon wrote:
> 
>> 3) From (2), we should have theoretically deduced that for every
>> #VMExit, there is a need to kick the sibling hyperthread also outside
>> of guest until the #VMExit is completed.
> 
> That's not in fact quite true; all you have to do is send the IPI.
> Having one sibling IPI the other sibling carries enough guarantees that
> the receiving sibling will not execute any further guest instructions.
> 
> That is, you don't have to wait on the VMExit to complete; you can just
> IPI and get on with things. Now, this is still expensive, But it is
> heaps better than doing a full sync up between siblings.
> 

I agree.

I didn’t say you need to do full sync. You just need to IPI the sibling
hyperthreads before switching to the full kernel address space.
But you need to make sure these sibling hyperthreads don’t get back into
the guest until all hyperthreads are running with KVM isolated address space.

It is still very expensive if done for every #VMExit. Which as I explained,
can be avoided in case we use the KVM isolated address space technique.

-Liran
Liran Alon May 14, 2019, 8:05 a.m. UTC | #11
> On 14 May 2019, at 5:07, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Mon, May 13, 2019 at 2:09 PM Liran Alon <liran.alon@oracle.com> wrote:
>> 
>> 
>> 
>>> On 13 May 2019, at 21:17, Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>>> I expect that the KVM address space can eventually be expanded to include
>>>> the ioctl syscall entries. By doing so, and also adding the KVM page table
>>>> to the process userland page table (which should be safe to do because the
>>>> KVM address space doesn't have any secret), we could potentially handle the
>>>> KVM ioctl without having to switch to the kernel pagetable (thus effectively
>>>> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
>>>> to be handled using the full kernel address space.
>>>> 
>>> 
>>> In the hopefully common case where a VM exits and then gets re-entered
>>> without needing to load full page tables, what code actually runs?
>>> I'm trying to understand when the optimization of not switching is
>>> actually useful.
>>> 
>>> Allowing ioctl() without switching to kernel tables sounds...
>>> extremely complicated.  It also makes the dubious assumption that user
>>> memory contains no secrets.
>> 
>> Let me attempt to clarify what we were thinking when creating this patch series:
>> 
>> 1) It is never safe to execute one hyperthread inside guest while it’s sibling hyperthread runs in a virtual address space which contains secrets of host or other guests.
>> This is because we assume that using some speculative gadget (such as half-Spectrev2 gadget), it will be possible to populate *some* CPU core resource which could then be *somehow* leaked by the hyperthread running inside guest. In case of L1TF, this would be data populated to the L1D cache.
>> 
>> 2) Because of (1), every time a hyperthread runs inside host kernel, we must make sure it’s sibling is not running inside guest. i.e. We must kick the sibling hyperthread outside of guest using IPI.
>> 
>> 3) From (2), we should have theoretically deduced that for every #VMExit, there is a need to kick the sibling hyperthread also outside of guest until the #VMExit is completed. Such a patch series was implemented at some point but it had (obviously) significant performance hit.
>> 
>> 
> 4) The main goal of this patch series is to preserve (2), but to avoid
> the overhead specified in (3).
>> 
>> The way this patch series achieves (4) is by observing that during the run of a VM, most #VMExits can be handled rather quickly and locally inside KVM and doesn’t need to reference any data that is not relevant to this VM or KVM code. Therefore, if we will run these #VMExits in an isolated virtual address space (i.e. KVM isolated address space), there is no need to kick the sibling hyperthread from guest while these #VMExits handlers run.
> 
> Thanks!  This clarifies a lot of things.
> 
>> The hope is that the very vast majority of #VMExit handlers will be able to completely run without requiring to switch to full address space. Therefore, avoiding the performance hit of (2).
>> However, for the very few #VMExits that does require to run in full kernel address space, we must first kick the sibling hyperthread outside of guest and only then switch to full kernel address space and only once all hyperthreads return to KVM address space, then allow then to enter into guest.
> 
> What exactly does "kick" mean in this context?  It sounds like you're
> going to need to be able to kick sibling VMs from extremely atomic
> contexts like NMI and MCE.

Yes that’s true.
“kick” in this context will probably mean sending an IPI to all sibling hyperthreads.
This IPI will cause these sibling hyperthreads to exit from guest to host on EXTERNAL_INTERRUPT
and wait for a condition that again allows to enter back into guest.
This condition will be once all hyperthreads of CPU core is again running only within KVM isolated address space of this VM.

-Liran
Alexandre Chartre May 14, 2019, 8:33 a.m. UTC | #12
On 5/13/19 11:08 PM, Liran Alon wrote:
> 
> 
>> On 13 May 2019, at 21:17, Andy Lutomirski <luto@kernel.org> wrote:
>>
>>> I expect that the KVM address space can eventually be expanded to include
>>> the ioctl syscall entries. By doing so, and also adding the KVM page table
>>> to the process userland page table (which should be safe to do because the
>>> KVM address space doesn't have any secret), we could potentially handle the
>>> KVM ioctl without having to switch to the kernel pagetable (thus effectively
>>> eliminating KPTI for KVM). Then the only overhead would be if a VM-Exit has
>>> to be handled using the full kernel address space.
>>>
>>
>> In the hopefully common case where a VM exits and then gets re-entered
>> without needing to load full page tables, what code actually runs?
>> I'm trying to understand when the optimization of not switching is
>> actually useful.
>>
>> Allowing ioctl() without switching to kernel tables sounds...
>> extremely complicated.  It also makes the dubious assumption that user
>> memory contains no secrets.
> 
> Let me attempt to clarify what we were thinking when creating this patch series:
> 
> 1) It is never safe to execute one hyperthread inside guest while it’s sibling hyperthread runs in a virtual address space which contains secrets of host or other guests.
> This is because we assume that using some speculative gadget (such as half-Spectrev2 gadget), it will be possible to populate *some* CPU core resource which could then be *somehow* leaked by the hyperthread running inside guest. In case of L1TF, this would be data populated to the L1D cache.
> 
> 2) Because of (1), every time a hyperthread runs inside host kernel, we must make sure it’s sibling is not running inside guest. i.e. We must kick the sibling hyperthread outside of guest using IPI.
> 
> 3) From (2), we should have theoretically deduced that for every #VMExit, there is a need to kick the sibling hyperthread also outside of guest until the #VMExit is completed. Such a patch series was implemented at some point but it had (obviously) significant performance hit.
> 
> 4) The main goal of this patch series is to preserve (2), but to avoid the overhead specified in (3).
> 
> The way this patch series achieves (4) is by observing that during the run of a VM, most #VMExits can be handled rather quickly and locally inside KVM and doesn’t need to reference any data that is not relevant to this VM or KVM code. Therefore, if we will run these #VMExits in an isolated virtual address space (i.e. KVM isolated address space), there is no need to kick the sibling hyperthread from guest while these #VMExits handlers run.
> The hope is that the very vast majority of #VMExit handlers will be able to completely run without requiring to switch to full address space. Therefore, avoiding the performance hit of (2).
> However, for the very few #VMExits that does require to run in full kernel address space, we must first kick the sibling hyperthread outside of guest and only then switch to full kernel address space and only once all hyperthreads return to KVM address space, then allow then to enter into guest.
> 
>  From this reason, I think the above paragraph (that was added to my original cover letter) is incorrect.

Yes, I am wrong. The KVM page table can't be added to the process userland page
table because this can leak secrets from userland. I was only thinking about
performances to reduce the number of context switches. So just forget that
paragraph :-)

alex.


> I believe that we should by design treat all exits to userspace VMM (e.g. QEMU) as slow-path that should not be optimised and therefore ok to switch address space (and therefore also kick sibling hyperthread). Similarly, all IOCTLs handlers are also slow-path and therefore it should be ok for them to also not run in KVM isolated address space.
> 
> -Liran
>
Jan Setje-Eilers May 14, 2019, 9:32 p.m. UTC | #13
On 5/14/19 12:37 AM, Peter Zijlstra wrote:
> On Mon, May 13, 2019 at 07:07:36PM -0700, Andy Lutomirski wrote:
>> On Mon, May 13, 2019 at 2:09 PM Liran Alon <liran.alon@oracle.com> wrote:
>>> The hope is that the very vast majority of #VMExit handlers will be
>>> able to completely run without requiring to switch to full address
>>> space. Therefore, avoiding the performance hit of (2).
>>> However, for the very few #VMExits that does require to run in full
>>> kernel address space, we must first kick the sibling hyperthread
>>> outside of guest and only then switch to full kernel address space
>>> and only once all hyperthreads return to KVM address space, then
>>> allow then to enter into guest.
>> What exactly does "kick" mean in this context?  It sounds like you're
>> going to need to be able to kick sibling VMs from extremely atomic
>> contexts like NMI and MCE.
> Yeah, doing the full synchronous thing from NMI/MCE context sounds
> exceedingly dodgy, howver..
>
> Realistically they only need to send an IPI to the other sibling; they
> don't need to wait for the VMExit to complete or anything else.
>
> And that is something we can do from NMI context -- with a bit of care.
> See also arch_irq_work_raise(); specifically we need to ensure we leave
> the APIC in an idle state, such that if we interrupted an APIC sequence
> it will not suddenly fail/violate the APIC write/state etc.
>
  I've been experimenting with IPI'ing siblings on vmexit, primarily 
because we know we'll need it if ASI turns out to be viable, but also 
because I wanted to understand why previous experiments resulted in such 
poor performance.

  You're correct that you don't need to wait for the sibling to come out 
once you send the IPI. That hardware thread will not do anything other 
than process the IPI once it's sent. There is still some need for 
synchronization, at least for the every vmexit case, since you always 
want to make sure that one thread is actually doing work while the other 
one is held. I have this working for some cases, but not enough to call 
it a general solution. I'm not at all sure that the every vmexit case 
can be made to perform for the general case. Even the non-general case 
uses synchronization that I fear might be overly complex.

  For the cases I do have working, simply not pinning the sibling when 
we exit due to the quest idling is a big enough win to put performance 
into a much more reasonable range.

  Base on this, I believe that pining a sibling HT in a subset of cases, 
when we interact with full kernel address space, is almost certainly 
reasonable.

-jan
Alexandre Chartre May 15, 2019, 12:52 p.m. UTC | #14
Thanks all for your replies and comments. I am trying to summarize main
feedback below, and define next steps.

But first, let me clarify what should happen when exiting the KVM isolated
address space (i.e. when we need to access to the full kernel). There was
some confusion because this was not clearly described in the cover letter.
Thanks to Liran for this better explanation:

   When a hyperthread needs to switch from KVM isolated address space to
   kernel full address space, it should first kick all sibling hyperthreads
   outside of guest and only then safety switch to full kernel address
   space. Only once all sibling hyperthreads are running with KVM isolated
   address space, it is safe to enter guest.

   The main point of this address space is to avoid kicking all sibling
   hyperthreads on *every* VMExit from guest but instead only kick them when
   switching address space. The assumption is that the vast majority of exits
   can be handled in KVM isolated address space and therefore do not require
   to kick the sibling hyperthreads outside of guest.

   “kick” in this context means sending an IPI to all sibling hyperthreads.
   This IPI will cause these sibling hyperthreads to exit from guest to host
   on EXTERNAL_INTERRUPT and wait for a condition that again allows to enter
   back into guest. This condition will be once all hyperthreads of CPU core
   is again running only within KVM isolated address space of this VM.


Feedback
========

Page-table Management

- Need to cleanup terminology mm vs page-table. It looks like we just need
   a KVM page-table, not a KVM mm.

- Interfaces for creating and managing page-table should be provided by
   the kernel, and not implemented in KVM. KVM shouldn't access kernel
   low-level memory management functions.

KVM Isolation Enter/Exit

- Changing CR3 in #PF could be a natural extension as #PF can already
   change page-tables, but we need a very coherent design and strong
   rules.

- Reduce kernel code running without the whole kernel mapping to the
   minimum.

- Avoid using current and task_struct while running with KVM page table.

- Ensure KVM page-table is not used with vmalloc.

- Try to avoid copying parts of the vmalloc page tables. This
   interacts unpleasantly with having the kernel stack.  We can freely
   use a different stack (the IRQ stack, for example) as long as
   we don't schedule, but that means we can't run preemptable code.

- Potential issues with tracing, kprobes... A solution would be to
   compile the isolated code with tracing off.

- Better centralize KVM isolation exit on IRQ, NMI, MCE, faults...
   Switch back to full kernel before switching to IRQ stack or
   shorlty after.

- Can we disable IRQ while running with KVM page-table?

   For IRQs it's somewhat feasible, but not for NMIs since NMIs are
   unblocked on VMX immediately after VM-Exit

   Exits due to INTR, NMI and #MC are considered high priority and are
   serviced before re-enabling IRQs and preemption[1].  All other exits
   are handled after IRQs and preemption are re-enabled.

   A decent number of exit handlers are quite short, but many exit
   handlers require significantly longer flows. In short, leaving
   IRQs disabled across all exits is not practical.

   It makes sense to pinpoint exactly what exits are:
   a) in the hot path for the use case (configuration)
   b) can be handled fast enough that they can run with IRQs disabled.

   Generating that list might allow us to tightly bound the contents
   of kvm_mm and sidestep many of the corner cases, i.e. select VM-Exits
   are handle with IRQs disabled using KVM's mm, while "slow" VM-Exits
   go through the full context switch.


KVM Page Table Content

- Check and reduce core mappings (kernel text size, cpu_entry_area,
   espfix64, IRQ stack...)

- Check and reduce percpu mapping, percpu memory can contain secrets (e.g.
   percpu random pool)


Next Steps
==========

I will investigate Sean's suggestion to see which VM-Exits can be handled
fast enough so that they can run with IRQs disabled (fast VM-Exits),
and which slow VM-Exits are in the hot path.

So I will work on a new POC which just handles fast VM-Exits with IRQs
disabled. This should largely reduce mappings required in the KVM page
table. I will also try to just have a KVM page-table and not a KVM mm.

After this new POC, we should be able to evaluate the need for handling
slow VM-Exits. And if there's an actual need, we can investigate how
to handle them with IRQs enabled.


Thanks,

alex.