mbox series

[Part2,v5,00/45] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

Message ID 20210820155918.7518-1-brijesh.singh@amd.com (mailing list archive)
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Message

Brijesh Singh Aug. 20, 2021, 3:58 p.m. UTC
This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
changes required in a host OS for SEV-SNP support. The series builds upon
SEV-SNP Part-1.

This series provides the basic building blocks to support booting the SEV-SNP
VMs, it does not cover all the security enhancement introduced by the SEV-SNP
such as interrupt protection.

The CCP driver is enhanced to provide new APIs that use the SEV-SNP
specific commands defined in the SEV-SNP firmware specification. The KVM
driver uses those APIs to create and managed the SEV-SNP guests.

The GHCB specification version 2 introduces new set of NAE's that is
used by the SEV-SNP guest to communicate with the hypervisor. The series
provides support to handle the following new NAE events:
- Register GHCB GPA
- Page State Change Request
- Hypevisor feature
- Guest message request

The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
access requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality is
already protected via memory encryption. When hardware encounters an RMP
checks failure, it raises a page-fault exception. If RMP check failure
is due to the page-size mismatch, then split the large page to resolve
the fault.

The series does not provide support for the interrupt security and migration
and those feature will be added after the base support.

The series is based on the commit:
 SNP part1 commit and
 fa7a549d321a (kvm/next, next) KVM: x86: accept userspace interrupt only if no event is injected

TODO:
  * Add support for command to ratelimit the guest message request.

Changes since v4:
 * Move the RMP entry definition to x86 specific header file.
 * Move the dump RMP entry function to SEV specific file.
 * Use BIT_ULL while defining the #PF bit fields.
 * Add helper function to check the IOMMU support for SEV-SNP feature.
 * Add helper functions for the page state transition.
 * Map and unmap the pages from the direct map after page is added or
   removed in RMP table.
 * Enforce the minimum SEV-SNP firmware version.
 * Extend the LAUNCH_UPDATE to accept the base_gfn and remove the
   logic to calculate the gfn from the hva.
 * Add a check in LAUNCH_UPDATE to ensure that all the pages are
   shared before calling the PSP.
 * Mark the memory failure when failing to remove the page from the
   RMP table or clearing the immutable bit.
 * Exclude the encrypted hva range from the KSM.
 * Remove the gfn tracking during the kvm_gfn_map() and use SRCU to
   syncronize the PSC and gfn mapping.
 * Allow PSC on the registered hva range only.
 * Add support for the Preferred GPA VMGEXIT.
 * Simplify the PSC handling routines.
 * Use the static_call() for the newly added kvm_x86_ops.
 * Remove the long-lived GHCB map.
 * Move the snp enable module parameter to the end of the file.
 * Remove the kvm_x86_op for the RMP fault handling. Call the
   fault handler directly from the #NPF interception.

Changes since v3:
 * Add support for extended guest message request.
 * Add ioctl to query the SNP Platform status.
 * Add ioctl to get and set the SNP config.
 * Add check to verify that memory reserved for the RMP covers the full system RAM.
 * Start the SNP specific commands from 256 instead of 255.
 * Multiple cleanup and fixes based on the review feedback.

Changes since v2:
 * Add AP creation support.
 * Drop the patch to handle the RMP fault for the kernel address.
 * Add functions to track the write access from the hypervisor.
 * Do not enable the SNP feature when IOMMU is disabled or is in passthrough mode.
 * Dump the RMP entry on RMP violation for the debug.
 * Shorten the GHCB macro names.
 * Start the SNP_INIT command id from 255 to give some gap for the legacy SEV.
 * Sync the header with the latest 0.9 SNP spec.
 
Changes since v1:
 * Add AP reset MSR protocol VMGEXIT NAE.
 * Add Hypervisor features VMGEXIT NAE.
 * Move the RMP table initialization and RMPUPDATE/PSMASH helper in
   arch/x86/kernel/sev.c.
 * Add support to map/unmap SEV legacy command buffer to firmware state when
   SNP is active.
 * Enhance PSP driver to provide helper to allocate/free memory used for the
   firmware context page.
 * Add support to handle RMP fault for the kernel address.
 * Add support to handle GUEST_REQUEST NAE event for attestation.
 * Rename RMP table lookup helper.
 * Drop typedef from rmpentry struct definition.
 * Drop SNP static key and use cpu_feature_enabled() to check whether SEV-SNP
   is active.
 * Multiple cleanup/fixes to address Boris review feedback.

Brijesh Singh (40):
  x86/cpufeatures: Add SEV-SNP CPU feature
  iommu/amd: Introduce function to check SEV-SNP support
  x86/sev: Add the host SEV-SNP initialization support
  x86/sev: Add RMP entry lookup helpers
  x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
  x86/sev: Invalid pages from direct map when adding it to RMP table
  x86/traps: Define RMP violation #PF error code
  x86/fault: Add support to handle the RMP fault for user address
  x86/fault: Add support to dump RMP entry on fault
  crypto: ccp: shutdown SEV firmware on kexec
  crypto:ccp: Define the SEV-SNP commands
  crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
  crypto:ccp: Provide APIs to issue SEV-SNP commands
  crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
  crypto: ccp: Handle the legacy SEV command when SNP is enabled
  crypto: ccp: Add the SNP_PLATFORM_STATUS command
  crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command
  crypto: ccp: Provide APIs to query extended attestation report
  KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
  KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
  KVM: SVM: Add initial SEV-SNP support
  KVM: SVM: Add KVM_SNP_INIT command
  KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
  KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
  KVM: SVM: Mark the private vma unmerable for SEV-SNP guests
  KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
  KVM: X86: Keep the NPT and RMP page level in sync
  KVM: x86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
  KVM: x86: Define RMP page fault error bits for #NPF
  KVM: x86: Update page-fault trace to log full 64-bit error code
  KVM: SVM: Do not use long-lived GHCB map while setting scratch area
  KVM: SVM: Remove the long-lived GHCB host map
  KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
  KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
  KVM: SVM: Add support to handle Page State Change VMGEXIT
  KVM: SVM: Introduce ops for the post gfn map and unmap
  KVM: x86: Export the kvm_zap_gfn_range() for the SNP use
  KVM: SVM: Add support to handle the RMP nested page fault
  KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
  KVM: SVM: Add module parameter to enable the SEV-SNP

Sean Christopherson (2):
  KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
  KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX and SNP

Tom Lendacky (3):
  KVM: SVM: Add support to handle AP reset MSR protocol
  KVM: SVM: Use a VMSA physical address variable for populating VMCB
  KVM: SVM: Support SEV-SNP AP Creation NAE event

 Documentation/virt/coco/sevguest.rst          |   55 +
 .../virt/kvm/amd-memory-encryption.rst        |  102 +
 arch/x86/include/asm/cpufeatures.h            |    1 +
 arch/x86/include/asm/disabled-features.h      |    8 +-
 arch/x86/include/asm/kvm-x86-ops.h            |    5 +
 arch/x86/include/asm/kvm_host.h               |   20 +
 arch/x86/include/asm/msr-index.h              |    6 +
 arch/x86/include/asm/sev-common.h             |   28 +
 arch/x86/include/asm/sev.h                    |   45 +
 arch/x86/include/asm/svm.h                    |    7 +
 arch/x86/include/asm/trap_pf.h                |   18 +-
 arch/x86/kernel/cpu/amd.c                     |    3 +-
 arch/x86/kernel/sev.c                         |  361 ++++
 arch/x86/kvm/lapic.c                          |    5 +-
 arch/x86/kvm/mmu.h                            |    7 +-
 arch/x86/kvm/mmu/mmu.c                        |   84 +-
 arch/x86/kvm/svm/sev.c                        | 1676 ++++++++++++++++-
 arch/x86/kvm/svm/svm.c                        |   62 +-
 arch/x86/kvm/svm/svm.h                        |   74 +-
 arch/x86/kvm/trace.h                          |   40 +-
 arch/x86/kvm/x86.c                            |   92 +-
 arch/x86/mm/fault.c                           |   84 +-
 drivers/crypto/ccp/sev-dev.c                  |  924 ++++++++-
 drivers/crypto/ccp/sev-dev.h                  |   17 +
 drivers/crypto/ccp/sp-pci.c                   |   12 +
 drivers/iommu/amd/init.c                      |   30 +
 include/linux/iommu.h                         |    9 +
 include/linux/mm.h                            |    6 +-
 include/linux/psp-sev.h                       |  346 ++++
 include/linux/sev.h                           |   32 +
 include/uapi/linux/kvm.h                      |   56 +
 include/uapi/linux/psp-sev.h                  |   60 +
 mm/memory.c                                   |   13 +
 tools/arch/x86/include/asm/cpufeatures.h      |    1 +
 34 files changed, 4088 insertions(+), 201 deletions(-)
 create mode 100644 include/linux/sev.h

Comments

Peter Gonda Nov. 12, 2021, 3:43 p.m. UTC | #1
Hi Brijesh,,

One high level discussion I'd like to have on these SNP KVM patches.

In these patches (V5) if a host userspace process writes a guest
private page a SIGBUS is issued to that process. If the kernel writes
a guest private page then the kernel panics due to the unhandled RMP
fault page fault. This is an issue because not all writes into guest
memory may come from a bug in the host. For instance a malicious or
even buggy guest could easily point the host to writing a private page
during the emulation of many virtual devices (virtio, NVMe, etc). For
example if a well behaved guests behavior is to: start up a driver,
select some pages to share with the guest, ask the host to convert
them to shared, then use those pages for virtual device DMA, if a
buggy guest forget the step to request the pages be converted to
shared its easy to see how the host could rightfully write to private
memory. I think we can better guarantee host reliability when running
SNP guests without changing SNP’s security properties.

Here is an alternative to the current approach: On RMP violation (host
or userspace) the page fault handler converts the page from private to
shared to allow the write to continue. This pulls from s390’s error
handling which does exactly this. See ‘arch_make_page_accessible()’.
Additionally it adds less complexity to the SNP kernel patches, and
requires no new ABI.

In the current (V5) KVM implementation if a userspace process
generates an RMP violation (writes to guest private memory) the
process receives a SIGBUS. At first glance, it would appear that
user-space shouldn’t write to private memory. However, guaranteeing
this in a generic fashion requires locking the RMP entries (via locks
external to the RMP). Otherwise, a user-space process emulating a
guest device IO may be vulnerable to having the guest memory
(maliciously or by guest bug) converted to private while user-space
emulation is happening. This results in a well behaved userspace
process receiving a SIGBUS.

This proposal allows buggy and malicious guests to run under SNP
without jeopardizing the reliability / safety of host processes. This
is very important to a cloud service provider (CSP) since it’s common
to have host wide daemons that write/read all guests, i.e. a single
process could manage the networking for all VMs on the host. Crashing
that singleton process kills networking for all VMs on the system.

This proposal also allows for minimal changes to the kexec flow and
kdump. The new kexec kernel can simply update private pages to shared
as it encounters them during their boot. This avoids needing to
propagate the RMP state from kernel to kernel. Of course this doesn’t
preserve any running VMs but is still useful for kdump crash dumps or
quicker rekerneling for development with kexec.

This proposal does cause guest memory corruption for some bugs but one
of SEV-SNP’s goals extended from SEV-ES’s goals is for guest’s to be
able to detect when its memory has been corrupted / replayed by the
host. So SNP already has features for allowing guests to detect this
kind of memory corruption. Additionally this is very similar to a page
of memory generating a machine check because of 2-bit memory
corruption. In other words SNP guests must be enlightened and ready
for these kinds of errors.

For an SNP guest running under this proposal the flow would look like this:
* Host gets a #PF because its trying to write to a private page.
* Host #PF handler updates the page to shared.
* Write continues normally.
* Guest accesses memory (r/w).
* Guest gets a #VC error because the page is not PVALIDATED
* Guest is now in control. Guest can terminate because its memory has
been corrupted. Guest could try and continue to log the error to its
owner.

A similar approach was introduced in the SNP patches V1 and V2 for
kernel page fault handling. The pushback around this convert to shared
approach was largely focused around the idea that the kernel has all
the information about which pages are shared vs private so it should
be able to check shared status before write to pages. After V2 the
patches were updated to not have a kernel page fault handler for RMP
violations (other than dumping state during a panic). The current
patches protect the host with new post_{map,unmap}_gfn() function that
checks if a page is shared before mapping it, then locks the page
shared until unmapped. Given the discussions on ‘[Part2,v5,39/45] KVM:
SVM: Introduce ops for the post gfn map and unmap’ building a solution
to do this is non trivial and adds new overheads to KVM. Additionally
the current solution is local to the kernel. So a new ABI just now be
created to allow the userspace VMM to access the kernel-side locks for
this to work generically for the whole host. This is more complicated
than this proposal and adding more lock holders seems like it could
reduce performance further.

There are a couple corner cases with this approach. Under SNP guests
can request their memory be changed into a VMSA. This VMSA page cannot
be changed to shared while the vCPU associated with it is running. So
KVM + the #PF handler will need something to kick vCPUs from running.
Joerg believes that a possible fix for this could be a new MMU
notifier in the kernel, then on the #PF we can go through the rmp and
execute this vCPU kick callback.

Another corner case is the RMPUPDATE instruction is not guaranteed to
succeed on first iteration. As noted above if the page is a VMSA it
cannot be updated while the vCPU is running. Another issue is if the
guest is running a RMPADJUST on a page it cannot be RMPUPDATED at that
time. There is a lock for each RMP Entry so there is a race for these
instructions. The vCPU kicking can solve this issue to be kicking all
guest vCPUs which removes the chance for the race.

Since this proposal probably results in SNP guests terminating due to
a page unexpectedly needing PVALIDATE. The approach could be
simplified to just the KVM killing the guest. I think it's nicer to
users to instead of unilaterally killing the guest allowing the
unvalidated #VC exception to allow users to collect some additional
debug information and any additional clean up work they would like to
perform.

Thanks
Peter

On Fri, Aug 20, 2021 at 9:59 AM Brijesh Singh <brijesh.singh@amd.com> wrote:
>
> This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
> changes required in a host OS for SEV-SNP support. The series builds upon
> SEV-SNP Part-1.
>
> This series provides the basic building blocks to support booting the SEV-SNP
> VMs, it does not cover all the security enhancement introduced by the SEV-SNP
> such as interrupt protection.
>
> The CCP driver is enhanced to provide new APIs that use the SEV-SNP
> specific commands defined in the SEV-SNP firmware specification. The KVM
> driver uses those APIs to create and managed the SEV-SNP guests.
>
> The GHCB specification version 2 introduces new set of NAE's that is
> used by the SEV-SNP guest to communicate with the hypervisor. The series
> provides support to handle the following new NAE events:
> - Register GHCB GPA
> - Page State Change Request
> - Hypevisor feature
> - Guest message request
>
> The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
> access requires an RMP check. In particular, the read accesses from the
> hypervisor do not require RMP checks because the data confidentiality is
> already protected via memory encryption. When hardware encounters an RMP
> checks failure, it raises a page-fault exception. If RMP check failure
> is due to the page-size mismatch, then split the large page to resolve
> the fault.
>
> The series does not provide support for the interrupt security and migration
> and those feature will be added after the base support.
>
> The series is based on the commit:
>  SNP part1 commit and
>  fa7a549d321a (kvm/next, next) KVM: x86: accept userspace interrupt only if no event is injected
>
> TODO:
>   * Add support for command to ratelimit the guest message request.
>
> Changes since v4:
>  * Move the RMP entry definition to x86 specific header file.
>  * Move the dump RMP entry function to SEV specific file.
>  * Use BIT_ULL while defining the #PF bit fields.
>  * Add helper function to check the IOMMU support for SEV-SNP feature.
>  * Add helper functions for the page state transition.
>  * Map and unmap the pages from the direct map after page is added or
>    removed in RMP table.
>  * Enforce the minimum SEV-SNP firmware version.
>  * Extend the LAUNCH_UPDATE to accept the base_gfn and remove the
>    logic to calculate the gfn from the hva.
>  * Add a check in LAUNCH_UPDATE to ensure that all the pages are
>    shared before calling the PSP.
>  * Mark the memory failure when failing to remove the page from the
>    RMP table or clearing the immutable bit.
>  * Exclude the encrypted hva range from the KSM.
>  * Remove the gfn tracking during the kvm_gfn_map() and use SRCU to
>    syncronize the PSC and gfn mapping.
>  * Allow PSC on the registered hva range only.
>  * Add support for the Preferred GPA VMGEXIT.
>  * Simplify the PSC handling routines.
>  * Use the static_call() for the newly added kvm_x86_ops.
>  * Remove the long-lived GHCB map.
>  * Move the snp enable module parameter to the end of the file.
>  * Remove the kvm_x86_op for the RMP fault handling. Call the
>    fault handler directly from the #NPF interception.
>
> Changes since v3:
>  * Add support for extended guest message request.
>  * Add ioctl to query the SNP Platform status.
>  * Add ioctl to get and set the SNP config.
>  * Add check to verify that memory reserved for the RMP covers the full system RAM.
>  * Start the SNP specific commands from 256 instead of 255.
>  * Multiple cleanup and fixes based on the review feedback.
>
> Changes since v2:
>  * Add AP creation support.
>  * Drop the patch to handle the RMP fault for the kernel address.
>  * Add functions to track the write access from the hypervisor.
>  * Do not enable the SNP feature when IOMMU is disabled or is in passthrough mode.
>  * Dump the RMP entry on RMP violation for the debug.
>  * Shorten the GHCB macro names.
>  * Start the SNP_INIT command id from 255 to give some gap for the legacy SEV.
>  * Sync the header with the latest 0.9 SNP spec.
>
> Changes since v1:
>  * Add AP reset MSR protocol VMGEXIT NAE.
>  * Add Hypervisor features VMGEXIT NAE.
>  * Move the RMP table initialization and RMPUPDATE/PSMASH helper in
>    arch/x86/kernel/sev.c.
>  * Add support to map/unmap SEV legacy command buffer to firmware state when
>    SNP is active.
>  * Enhance PSP driver to provide helper to allocate/free memory used for the
>    firmware context page.
>  * Add support to handle RMP fault for the kernel address.
>  * Add support to handle GUEST_REQUEST NAE event for attestation.
>  * Rename RMP table lookup helper.
>  * Drop typedef from rmpentry struct definition.
>  * Drop SNP static key and use cpu_feature_enabled() to check whether SEV-SNP
>    is active.
>  * Multiple cleanup/fixes to address Boris review feedback.
>
> Brijesh Singh (40):
>   x86/cpufeatures: Add SEV-SNP CPU feature
>   iommu/amd: Introduce function to check SEV-SNP support
>   x86/sev: Add the host SEV-SNP initialization support
>   x86/sev: Add RMP entry lookup helpers
>   x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
>   x86/sev: Invalid pages from direct map when adding it to RMP table
>   x86/traps: Define RMP violation #PF error code
>   x86/fault: Add support to handle the RMP fault for user address
>   x86/fault: Add support to dump RMP entry on fault
>   crypto: ccp: shutdown SEV firmware on kexec
>   crypto:ccp: Define the SEV-SNP commands
>   crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
>   crypto:ccp: Provide APIs to issue SEV-SNP commands
>   crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
>   crypto: ccp: Handle the legacy SEV command when SNP is enabled
>   crypto: ccp: Add the SNP_PLATFORM_STATUS command
>   crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command
>   crypto: ccp: Provide APIs to query extended attestation report
>   KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
>   KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
>   KVM: SVM: Add initial SEV-SNP support
>   KVM: SVM: Add KVM_SNP_INIT command
>   KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
>   KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
>   KVM: SVM: Mark the private vma unmerable for SEV-SNP guests
>   KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
>   KVM: X86: Keep the NPT and RMP page level in sync
>   KVM: x86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
>   KVM: x86: Define RMP page fault error bits for #NPF
>   KVM: x86: Update page-fault trace to log full 64-bit error code
>   KVM: SVM: Do not use long-lived GHCB map while setting scratch area
>   KVM: SVM: Remove the long-lived GHCB host map
>   KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
>   KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
>   KVM: SVM: Add support to handle Page State Change VMGEXIT
>   KVM: SVM: Introduce ops for the post gfn map and unmap
>   KVM: x86: Export the kvm_zap_gfn_range() for the SNP use
>   KVM: SVM: Add support to handle the RMP nested page fault
>   KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
>   KVM: SVM: Add module parameter to enable the SEV-SNP
>
> Sean Christopherson (2):
>   KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
>   KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX and SNP
>
> Tom Lendacky (3):
>   KVM: SVM: Add support to handle AP reset MSR protocol
>   KVM: SVM: Use a VMSA physical address variable for populating VMCB
>   KVM: SVM: Support SEV-SNP AP Creation NAE event
>
>  Documentation/virt/coco/sevguest.rst          |   55 +
>  .../virt/kvm/amd-memory-encryption.rst        |  102 +
>  arch/x86/include/asm/cpufeatures.h            |    1 +
>  arch/x86/include/asm/disabled-features.h      |    8 +-
>  arch/x86/include/asm/kvm-x86-ops.h            |    5 +
>  arch/x86/include/asm/kvm_host.h               |   20 +
>  arch/x86/include/asm/msr-index.h              |    6 +
>  arch/x86/include/asm/sev-common.h             |   28 +
>  arch/x86/include/asm/sev.h                    |   45 +
>  arch/x86/include/asm/svm.h                    |    7 +
>  arch/x86/include/asm/trap_pf.h                |   18 +-
>  arch/x86/kernel/cpu/amd.c                     |    3 +-
>  arch/x86/kernel/sev.c                         |  361 ++++
>  arch/x86/kvm/lapic.c                          |    5 +-
>  arch/x86/kvm/mmu.h                            |    7 +-
>  arch/x86/kvm/mmu/mmu.c                        |   84 +-
>  arch/x86/kvm/svm/sev.c                        | 1676 ++++++++++++++++-
>  arch/x86/kvm/svm/svm.c                        |   62 +-
>  arch/x86/kvm/svm/svm.h                        |   74 +-
>  arch/x86/kvm/trace.h                          |   40 +-
>  arch/x86/kvm/x86.c                            |   92 +-
>  arch/x86/mm/fault.c                           |   84 +-
>  drivers/crypto/ccp/sev-dev.c                  |  924 ++++++++-
>  drivers/crypto/ccp/sev-dev.h                  |   17 +
>  drivers/crypto/ccp/sp-pci.c                   |   12 +
>  drivers/iommu/amd/init.c                      |   30 +
>  include/linux/iommu.h                         |    9 +
>  include/linux/mm.h                            |    6 +-
>  include/linux/psp-sev.h                       |  346 ++++
>  include/linux/sev.h                           |   32 +
>  include/uapi/linux/kvm.h                      |   56 +
>  include/uapi/linux/psp-sev.h                  |   60 +
>  mm/memory.c                                   |   13 +
>  tools/arch/x86/include/asm/cpufeatures.h      |    1 +
>  34 files changed, 4088 insertions(+), 201 deletions(-)
>  create mode 100644 include/linux/sev.h
>
> --
> 2.17.1
>
Dave Hansen Nov. 12, 2021, 5:59 p.m. UTC | #2
On 11/12/21 7:43 AM, Peter Gonda wrote:
...
> Here is an alternative to the current approach: On RMP violation (host
> or userspace) the page fault handler converts the page from private to
> shared to allow the write to continue. This pulls from s390’s error
> handling which does exactly this. See ‘arch_make_page_accessible()’.
> Additionally it adds less complexity to the SNP kernel patches, and
> requires no new ABI.

I think it's important to very carefully describe where these RMP page
faults can occur within the kernel.  They can obvious occur on things like:

	copy_to_user(&user_buf, &kernel_buf, len);

That's not a big deal.  Those can obviously handle page faults.  We know
exactly the instruction on which the fault can occur and we handle it
gracefully.

*But*, these are harder:

	get_user_pages(addr, len, &pages);
	spin_lock(&lock);
	ptr = kmap_atomic(pages[0]);
	*ptr = foo; // #PF here
	kunmap_atomic(ptr);
	spin_unlock(&lock);
	put_page(pages[0]);

In this case, the place where the fault happens are not especially well
bounded.  It can be in compiler-generated code.  It can happen on any
random instruction in there.

Or, is there some mechanism that prevent guest-private memory from being
accessed in random host kernel code?

> This proposal does cause guest memory corruption for some bugs but one
> of SEV-SNP’s goals extended from SEV-ES’s goals is for guest’s to be
> able to detect when its memory has been corrupted / replayed by the
> host. So SNP already has features for allowing guests to detect this
> kind of memory corruption. Additionally this is very similar to a page
> of memory generating a machine check because of 2-bit memory
> corruption. In other words SNP guests must be enlightened and ready
> for these kinds of errors.
> 
> For an SNP guest running under this proposal the flow would look like this:
> * Host gets a #PF because its trying to write to a private page.
> * Host #PF handler updates the page to shared.
> * Write continues normally.
> * Guest accesses memory (r/w).
> * Guest gets a #VC error because the page is not PVALIDATED
> * Guest is now in control. Guest can terminate because its memory has
> been corrupted. Guest could try and continue to log the error to its
> owner.

This sounds like a _possible_ opportunity for the guest to do some extra
handling.  It's also quite possible that this #VC happens in a place
that the guest can't handle.


> A similar approach was introduced in the SNP patches V1 and V2 for
> kernel page fault handling. The pushback around this convert to shared
> approach was largely focused around the idea that the kernel has all
> the information about which pages are shared vs private so it should
> be able to check shared status before write to pages. After V2 the
> patches were updated to not have a kernel page fault handler for RMP
> violations (other than dumping state during a panic). The current
> patches protect the host with new post_{map,unmap}_gfn() function that
> checks if a page is shared before mapping it, then locks the page
> shared until unmapped. Given the discussions on ‘[Part2,v5,39/45] KVM:
> SVM: Introduce ops for the post gfn map and unmap’ building a solution
> to do this is non trivial and adds new overheads to KVM. Additionally
> the current solution is local to the kernel. So a new ABI just now be
> created to allow the userspace VMM to access the kernel-side locks for
> this to work generically for the whole host. This is more complicated
> than this proposal and adding more lock holders seems like it could
> reduce performance further.

The locking is complicated.  But, I think it is necessary.  Once the
kernel can map private memory, it can access it in random spots.  It's
hard to make random kernel code recoverable from faults.
Borislav Petkov Nov. 12, 2021, 6:35 p.m. UTC | #3
On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
> Or, is there some mechanism that prevent guest-private memory from being
> accessed in random host kernel code?

So I'm currently under the impression that random host->guest accesses
should not happen if not previously agreed upon by both.

Because, as explained on IRC, if host touches a private guest page,
whatever the host does to that page, the next time the guest runs, it'll
get a #VC where it will see that that page doesn't belong to it anymore
and then, out of paranoia, it will simply terminate to protect itself.

So cloud providers should have an interest to prevent such random stray
accesses if they wanna have guests. :)

> This sounds like a _possible_ opportunity for the guest to do some extra
> handling.  It's also quite possible that this #VC happens in a place
> that the guest can't handle.

How? It'll get a #VC when it first touches that page.

I'd say the #VC handler should be able to deal with it...

Thx.
Sean Christopherson Nov. 12, 2021, 7:48 p.m. UTC | #4
On Fri, Nov 12, 2021, Borislav Petkov wrote:
> On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
> > Or, is there some mechanism that prevent guest-private memory from being
> > accessed in random host kernel code?

Or random host userspace code...

> So I'm currently under the impression that random host->guest accesses
> should not happen if not previously agreed upon by both.

Key word "should".

> Because, as explained on IRC, if host touches a private guest page,
> whatever the host does to that page, the next time the guest runs, it'll
> get a #VC where it will see that that page doesn't belong to it anymore
> and then, out of paranoia, it will simply terminate to protect itself.
> 
> So cloud providers should have an interest to prevent such random stray
> accesses if they wanna have guests. :)

Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.

On Fri, Nov 12, 2021, Peter Gonda wrote:
> Here is an alternative to the current approach: On RMP violation (host
> or userspace) the page fault handler converts the page from private to
> shared to allow the write to continue. This pulls from s390’s error
> handling which does exactly this. See ‘arch_make_page_accessible()’.

Ah, after further reading, s390 does _not_ do implicit private=>shared conversions.

s390's arch_make_page_accessible() is somewhat similar, but it is not a direct
comparison.  IIUC, it exports and integrity protects the data and thus preserves
the guest's data in an encrypted form, e.g. so that it can be swapped to disk.
And if the host corrupts the data, attempting to convert it back to secure on a
subsequent guest access will fail.

The host kernel's handling of the "convert to secure" failures doesn't appear to
be all that robust, e.g. it looks like there are multiple paths where the error
is dropped on the floor and the guest is resumed , but IMO soft hanging the guest 
is still better than inducing a fault in the guest, and far better than potentially
coercing the guest into reading corrupted memory ("spurious" PVALIDATE).  And s390's
behavior is fixable since it's purely a host error handling problem.

To truly make a page shared, s390 requires the guest to call into the ultravisor
to make a page shared.  And on the host side, the host can pin a page as shared
to prevent the guest from unsharing it while the host is accessing it as a shared
page.

So, inducing #VC is similar in the sense that a malicious s390 can also DoS itself,
but is quite different in that (AFAICT) s390 does not create an attack surface where
a malicious or buggy host userspace can induce faults in the guest, or worst case in
SNP, exploit a buggy guest into accepting and accessing corrupted data.

It's also different in that s390 doesn't implicitly convert between shared and
private.  Functionally, it doesn't really change the end result because a buggy
host that writes guest private memory will DoS the guest (by inducing a #VC or
corrupting exported data), but at least for s390 there's a sane, legitimate use
case for accessing guest private memory (swap and maybe migration?), whereas for
SNP, IMO implicitly converting to shared on a host access is straight up wrong.

> Additionally it adds less complexity to the SNP kernel patches, and
> requires no new ABI.

I disagree, this would require "new" ABI in the sense that it commits KVM to
supporting SNP without requiring userspace to initiate any and all conversions
between shared and private.  Which in my mind is the big elephant in the room:
do we want to require new KVM (and kernel?) ABI to allow/force userspace to
explicitly declare guest private memory for TDX _and_ SNP, or just TDX?
Borislav Petkov Nov. 12, 2021, 8:04 p.m. UTC | #5
On Fri, Nov 12, 2021 at 07:48:17PM +0000, Sean Christopherson wrote:
> Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.

What do you suggest instead?
Sean Christopherson Nov. 12, 2021, 8:37 p.m. UTC | #6
On Fri, Nov 12, 2021, Borislav Petkov wrote:
> On Fri, Nov 12, 2021 at 07:48:17PM +0000, Sean Christopherson wrote:
> > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> 
> What do you suggest instead?

Let userspace decide what is mapped shared and what is mapped private.  The kernel
and KVM provide the APIs/infrastructure to do the actual conversions in a thread-safe
fashion and also to enforce the current state, but userspace is the control plane.

It would require non-trivial changes in userspace if there are multiple processes
accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
solvable.  The exit to userspace means all three components (guest, kernel, 
and userspace) have full knowledge of what is shared and what is private.  There
is zero ambiguity:

  - if userspace accesses guest private memory, it gets SIGSEGV or whatever.  
  - if kernel accesses guest private memory, it does BUG/panic/oops[*]
  - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.

This is the direction KVM TDX support is headed, though it's obviously still a WIP.

And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
define that:

  a) All convertible memory, i.e. RAM, starts as private.
  b) Conversions between private and shared must be done via explicit hypercall.

Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
type as implicit conversions.

[*] Sadly, fully preventing kernel access to guest private is not possible with
    TDX, especially if the direct map is left intact.  But maybe in the future
    TDX will signal a fault instead of poisoning memory and leaving a #MC mine.
Borislav Petkov Nov. 12, 2021, 8:53 p.m. UTC | #7
On Fri, Nov 12, 2021 at 08:37:59PM +0000, Sean Christopherson wrote:
> Let userspace decide what is mapped shared and what is mapped private. 

With "userspace", you mean the *host* userspace?

> The kernel and KVM provide the APIs/infrastructure to do the actual
> conversions in a thread-safe fashion and also to enforce the current
> state, but userspace is the control plane.
>
> It would require non-trivial changes in userspace if there are multiple processes
> accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
> solvable.  The exit to userspace means all three components (guest, kernel, 
> and userspace) have full knowledge of what is shared and what is private.  There
> is zero ambiguity:
> 
>   - if userspace accesses guest private memory, it gets SIGSEGV or whatever.  

That SIGSEGV is generated by the host kernel, I presume, after it checks
whether the memory belongs to the guest?

>   - if kernel accesses guest private memory, it does BUG/panic/oops[*]

If *it* is the host kernel, then you probably shouldn't do that -
otherwise you just killed the host kernel on which all those guests are
running.

>   - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.

Yah, that's the easy one.

> This is the direction KVM TDX support is headed, though it's obviously still a WIP.
> 
> And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
> define that:
> 
>   a) All convertible memory, i.e. RAM, starts as private.
>   b) Conversions between private and shared must be done via explicit hypercall.

I like the explicit nature of this but devil's in the detail and I'm no
virt guy...

> Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
> type as implicit conversions.
> 
> [*] Sadly, fully preventing kernel access to guest private is not possible with
>     TDX, especially if the direct map is left intact.  But maybe in the future
>     TDX will signal a fault instead of poisoning memory and leaving a #MC mine.

Yah, the #MC thing sounds like someone didn't think things through. ;-\

Thx.
Peter Gonda Nov. 12, 2021, 9:12 p.m. UTC | #8
On Fri, Nov 12, 2021 at 1:55 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Nov 12, 2021 at 08:37:59PM +0000, Sean Christopherson wrote:
> > Let userspace decide what is mapped shared and what is mapped private.
>
> With "userspace", you mean the *host* userspace?
>
> > The kernel and KVM provide the APIs/infrastructure to do the actual
> > conversions in a thread-safe fashion and also to enforce the current
> > state, but userspace is the control plane.
> >
> > It would require non-trivial changes in userspace if there are multiple processes
> > accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
> > solvable.  The exit to userspace means all three components (guest, kernel,
> > and userspace) have full knowledge of what is shared and what is private.  There
> > is zero ambiguity:
> >
> >   - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
>
> That SIGSEGV is generated by the host kernel, I presume, after it checks
> whether the memory belongs to the guest?
>
> >   - if kernel accesses guest private memory, it does BUG/panic/oops[*]
>
> If *it* is the host kernel, then you probably shouldn't do that -
> otherwise you just killed the host kernel on which all those guests are
> running.

I agree, it seems better to terminate the single guest with an issue.
Rather than killing the host (and therefore all guests). So I'd
suggest even in this case we do the 'convert to shared' approach or
just outright terminate the guest.

Are there already examples in KVM of a KVM bug in servicing a VM's
request results in a BUG/panic/oops? That seems not ideal ever.

>
> >   - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
>
> Yah, that's the easy one.
>
> > This is the direction KVM TDX support is headed, though it's obviously still a WIP.
> >
> > And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
> > define that:
> >
> >   a) All convertible memory, i.e. RAM, starts as private.
> >   b) Conversions between private and shared must be done via explicit hypercall.
>
> I like the explicit nature of this but devil's in the detail and I'm no
> virt guy...

This seems like a reasonable approach that can help with the issue of
terminating the entity behaving poorly. Could this feature be an
improvement that comes later? This improvement could be gated behind a
per VM KVM CAP, a kvm module param, or insert other solution here, to
not blind side userspace with this change?

>
> > Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
> > type as implicit conversions.
> >
> > [*] Sadly, fully preventing kernel access to guest private is not possible with
> >     TDX, especially if the direct map is left intact.  But maybe in the future
> >     TDX will signal a fault instead of poisoning memory and leaving a #MC mine.
>
> Yah, the #MC thing sounds like someone didn't think things through. ;-\

Yes #MC in TDX seems much harder to deal with than the #PF for SNP.
I'd propose TDX keeps the host kernel safe bug not have that solution
block SNP. As suggested above I like the idea but it seems like it can
come as a future improvement to SNP support.

>
> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
Marc Orr Nov. 12, 2021, 9:16 p.m. UTC | #9
> > So cloud providers should have an interest to prevent such random stray
> > accesses if they wanna have guests. :)
>
> Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.

I want to push back on "inducing a fault in the guest because of
_host_ bug is wrong.". The guest is _required_ to be robust against
the host maliciously (or accidentally) writing its memory. SNP
security depends on the guest detecting such writes. Therefore, why is
leveraging this system property that the guest will detect when its
private memory has been written wrong? Especially when its orders or
magnitudes simpler than the alternative to have everything in the
system -- kernel, user-space, and guest -- all coordinate to agree
what's private and what's shared. Such a complex approach is likely to
bring a lot of bugs, vulnerabilities, and limitations on future design
into the picture.
Andy Lutomirski Nov. 12, 2021, 9:20 p.m. UTC | #10
On 11/12/21 13:12, Peter Gonda wrote:
> On Fri, Nov 12, 2021 at 1:55 PM Borislav Petkov <bp@alien8.de> wrote:
>>
>> On Fri, Nov 12, 2021 at 08:37:59PM +0000, Sean Christopherson wrote:
>>> Let userspace decide what is mapped shared and what is mapped private.
>>
>> With "userspace", you mean the *host* userspace?
>>
>>> The kernel and KVM provide the APIs/infrastructure to do the actual
>>> conversions in a thread-safe fashion and also to enforce the current
>>> state, but userspace is the control plane.
>>>
>>> It would require non-trivial changes in userspace if there are multiple processes
>>> accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
>>> solvable.  The exit to userspace means all three components (guest, kernel,
>>> and userspace) have full knowledge of what is shared and what is private.  There
>>> is zero ambiguity:
>>>
>>>    - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
>>
>> That SIGSEGV is generated by the host kernel, I presume, after it checks
>> whether the memory belongs to the guest?
>>
>>>    - if kernel accesses guest private memory, it does BUG/panic/oops[*]
>>
>> If *it* is the host kernel, then you probably shouldn't do that -
>> otherwise you just killed the host kernel on which all those guests are
>> running.
> 
> I agree, it seems better to terminate the single guest with an issue.
> Rather than killing the host (and therefore all guests). So I'd
> suggest even in this case we do the 'convert to shared' approach or
> just outright terminate the guest.
> 
> Are there already examples in KVM of a KVM bug in servicing a VM's
> request results in a BUG/panic/oops? That seems not ideal ever.
> 
>>
>>>    - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
>>
>> Yah, that's the easy one.
>>
>>> This is the direction KVM TDX support is headed, though it's obviously still a WIP.
>>>
>>> And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
>>> define that:
>>>
>>>    a) All convertible memory, i.e. RAM, starts as private.
>>>    b) Conversions between private and shared must be done via explicit hypercall.
>>
>> I like the explicit nature of this but devil's in the detail and I'm no
>> virt guy...
> 
> This seems like a reasonable approach that can help with the issue of
> terminating the entity behaving poorly. Could this feature be an
> improvement that comes later? This improvement could be gated behind a
> per VM KVM CAP, a kvm module param, or insert other solution here, to
> not blind side userspace with this change?
> 
>>
>>> Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
>>> type as implicit conversions.
>>>
>>> [*] Sadly, fully preventing kernel access to guest private is not possible with
>>>      TDX, especially if the direct map is left intact.  But maybe in the future
>>>      TDX will signal a fault instead of poisoning memory and leaving a #MC mine.
>>
>> Yah, the #MC thing sounds like someone didn't think things through. ;-\
> 
> Yes #MC in TDX seems much harder to deal with than the #PF for SNP.
> I'd propose TDX keeps the host kernel safe bug not have that solution
> block SNP. As suggested above I like the idea but it seems like it can
> come as a future improvement to SNP support.
> 

Can you clarify what type of host bug you're talking about here?  We're 
talking about the host kernel reading or writing guest private memory 
through the direct map or kmap, right?  It seems to me that the way this 
happens is:

struct page *guest_page = (get and refcount a guest page);

...

guest switches the page to private;

...

read or write the page via direct map or kmap.


This naively *looks* like something a malicious or buggy guest could 
induce the host kernel to do.  But I think that, if this actually 
occurs, it's an example of a much more severe host bug.  In particular, 
there is nothing special about the guest switching a page to private -- 
the guest or QEMU could just as easily have freed (hotunplugged) a 
shared guest page or ballooned it or swapped it or otherwise deallocated 
it.  And, if the host touches a page/pfn that the guest has freed, this 
is UAF, is a huge security hole if the guest has any degree of control, 
and needs to kill the host kernel if detected.

Now we can kind of sweep it under the rug by saying that changing a page 
from shared to private isn't really freeing the page -- it's just 
modifying the usage of the page.  But to me that sounds like saying 
"reusing a former user memory page as a pagetable isn't *really* freeing 
it if the kernel kept a pointer around the whole time and the page 
allocator wasn't involved".

So let's please just consider these mode transitions to be equivalent to 
actually freeing the page at least from a locking perspective, and then 
the way to prevent these bugs and the correct action to take if they 
occur is clear.

(On TDX, changing a page shared/private state is very much like freeing 
and reallocating -- they're in different page tables and, in principle 
anyway, there is no actual relationship between a shared page and a 
supposedly matching private page except that some bits of the GPA happen 
to match.  The hardware the TD module fully support both being mapped at 
once at the same "address", at least according to the specs I've read.)

--Andy
Andy Lutomirski Nov. 12, 2021, 9:23 p.m. UTC | #11
On 11/12/21 13:16, Marc Orr wrote:
>>> So cloud providers should have an interest to prevent such random stray
>>> accesses if they wanna have guests. :)
>>
>> Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> 
> I want to push back on "inducing a fault in the guest because of
> _host_ bug is wrong.". The guest is _required_ to be robust against
> the host maliciously (or accidentally) writing its memory. SNP
> security depends on the guest detecting such writes. Therefore, why is
> leveraging this system property that the guest will detect when its
> private memory has been written wrong?

 >
> Especially when its orders or
> magnitudes simpler than the alternative to have everything in the
> system -- kernel, user-space, and guest -- all coordinate to agree
> what's private and what's shared. Such a complex approach is likely to
> bring a lot of bugs, vulnerabilities, and limitations on future design
> into the picture.
> 

SEV-SNP, TDX, and any reasonable software solution all require that the 
host know which pages are private and which pages are shared.  Sure, the 
old SEV-ES Linux host implementation was very simple, but it's nasty and 
fundamentally can't support migration.
Marc Orr Nov. 12, 2021, 9:30 p.m. UTC | #12
On Fri, Nov 12, 2021 at 12:38 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 12, 2021, Borislav Petkov wrote:
> > On Fri, Nov 12, 2021 at 07:48:17PM +0000, Sean Christopherson wrote:
> > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> >
> > What do you suggest instead?
>
> Let userspace decide what is mapped shared and what is mapped private.  The kernel
> and KVM provide the APIs/infrastructure to do the actual conversions in a thread-safe
> fashion and also to enforce the current state, but userspace is the control plane.
>
> It would require non-trivial changes in userspace if there are multiple processes
> accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
> solvable.  The exit to userspace means all three components (guest, kernel,
> and userspace) have full knowledge of what is shared and what is private.  There
> is zero ambiguity:
>
>   - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
>   - if kernel accesses guest private memory, it does BUG/panic/oops[*]
>   - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
>
> This is the direction KVM TDX support is headed, though it's obviously still a WIP.
>
> And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
> define that:
>
>   a) All convertible memory, i.e. RAM, starts as private.
>   b) Conversions between private and shared must be done via explicit hypercall.
>
> Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
> type as implicit conversions.
>
> [*] Sadly, fully preventing kernel access to guest private is not possible with
>     TDX, especially if the direct map is left intact.  But maybe in the future
>     TDX will signal a fault instead of poisoning memory and leaving a #MC mine.

In this proposal, consider a guest driver instructing a device to DMA
write a 1 GB memory buffer. A well-behaved guest driver will ensure
that the entire 1 GB is marked shared. But what about a malicious or
buggy guest? Let's assume a bad guest driver instructs the device to
write guest private memory.

So now, the virtual device, which might be implemented as some host
side process, needs to (1) check and lock all 4k constituent RMP
entries (so they're not converted to private while the DMA write is
taking palce), (2) write the 1 GB buffer, and (3) unlock all 4 k
constituent RMP entries? If I'm understanding this correctly, then the
synchronization will be prohibitively expensive.
Borislav Petkov Nov. 12, 2021, 9:35 p.m. UTC | #13
On Fri, Nov 12, 2021 at 01:23:25PM -0800, Andy Lutomirski wrote:
> SEV-SNP, TDX, and any reasonable software solution all require that the host
> know which pages are private and which pages are shared.  Sure, the old
> SEV-ES Linux host implementation was very simple, but it's nasty and
> fundamentally can't support migration.

Right, so at least SNP guests need to track which pages have been
already PVALIDATEd by them so that they don't validate them again. So if
we track that somewhere in struct page or wherever, that same bit can be
used to state, page is private or shared.
Dave Hansen Nov. 12, 2021, 9:37 p.m. UTC | #14
On 11/12/21 1:30 PM, Marc Orr wrote:
> In this proposal, consider a guest driver instructing a device to DMA
> write a 1 GB memory buffer. A well-behaved guest driver will ensure
> that the entire 1 GB is marked shared. But what about a malicious or
> buggy guest? Let's assume a bad guest driver instructs the device to
> write guest private memory.
> 
> So now, the virtual device, which might be implemented as some host
> side process, needs to (1) check and lock all 4k constituent RMP
> entries (so they're not converted to private while the DMA write is
> taking palce), (2) write the 1 GB buffer, and (3) unlock all 4 k
> constituent RMP entries? If I'm understanding this correctly, then the
> synchronization will be prohibitively expensive.

Are you taking about a 1GB *mapping* here?  As in, something us using a
1GB page table entry to map the 1GB memory buffer?  That was the only
case where I knew we needed coordination between neighbor RMP entries
and host memory accesses.

That 1GB problem _should_ be impossible.  I thought we settled on
disabling hugetlbfs and fracturing the whole of the direct map down to 4k.
Andy Lutomirski Nov. 12, 2021, 9:39 p.m. UTC | #15
On Fri, Nov 12, 2021, at 1:30 PM, Marc Orr wrote:
> On Fri, Nov 12, 2021 at 12:38 PM Sean Christopherson <seanjc@google.com> wrote:
>>
>> On Fri, Nov 12, 2021, Borislav Petkov wrote:
>> > On Fri, Nov 12, 2021 at 07:48:17PM +0000, Sean Christopherson wrote:
>> > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
>> >
>> > What do you suggest instead?
>>
>> Let userspace decide what is mapped shared and what is mapped private.  The kernel
>> and KVM provide the APIs/infrastructure to do the actual conversions in a thread-safe
>> fashion and also to enforce the current state, but userspace is the control plane.
>>
>> It would require non-trivial changes in userspace if there are multiple processes
>> accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
>> solvable.  The exit to userspace means all three components (guest, kernel,
>> and userspace) have full knowledge of what is shared and what is private.  There
>> is zero ambiguity:
>>
>>   - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
>>   - if kernel accesses guest private memory, it does BUG/panic/oops[*]
>>   - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
>>
>> This is the direction KVM TDX support is headed, though it's obviously still a WIP.
>>
>> And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
>> define that:
>>
>>   a) All convertible memory, i.e. RAM, starts as private.
>>   b) Conversions between private and shared must be done via explicit hypercall.
>>
>> Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
>> type as implicit conversions.
>>
>> [*] Sadly, fully preventing kernel access to guest private is not possible with
>>     TDX, especially if the direct map is left intact.  But maybe in the future
>>     TDX will signal a fault instead of poisoning memory and leaving a #MC mine.
>
> In this proposal, consider a guest driver instructing a device to DMA
> write a 1 GB memory buffer. A well-behaved guest driver will ensure
> that the entire 1 GB is marked shared. But what about a malicious or
> buggy guest? Let's assume a bad guest driver instructs the device to
> write guest private memory.
>
> So now, the virtual device, which might be implemented as some host
> side process, needs to (1) check and lock all 4k constituent RMP
> entries (so they're not converted to private while the DMA write is
> taking palce), (2) write the 1 GB buffer, and (3) unlock all 4 k
> constituent RMP entries? If I'm understanding this correctly, then the
> synchronization will be prohibitively expensive.

Let's consider a very very similar scenario: consider a guest driver setting up a 1 GB DMA buffer.  The virtual device, implemented as host process, needs to (1) map (and thus lock *or* be prepared for faults) in 1GB / 4k pages of guest memory (so they're not *freed* while the DMA write is taking place), (2) write the buffer, and (3) unlock all the pages.  Or it can lock them at setup time and keep them locked for a long time if that's appropriate.

Sure, the locking is expensive, but it's nonnegotiable.  The RMP issue is just a special case of the more general issue that the host MUST NOT ACCESS GUEST MEMORY AFTER IT'S FREED.

--Andy
Marc Orr Nov. 12, 2021, 9:40 p.m. UTC | #16
On Fri, Nov 12, 2021 at 1:38 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/12/21 1:30 PM, Marc Orr wrote:
> > In this proposal, consider a guest driver instructing a device to DMA
> > write a 1 GB memory buffer. A well-behaved guest driver will ensure
> > that the entire 1 GB is marked shared. But what about a malicious or
> > buggy guest? Let's assume a bad guest driver instructs the device to
> > write guest private memory.
> >
> > So now, the virtual device, which might be implemented as some host
> > side process, needs to (1) check and lock all 4k constituent RMP
> > entries (so they're not converted to private while the DMA write is
> > taking palce), (2) write the 1 GB buffer, and (3) unlock all 4 k
> > constituent RMP entries? If I'm understanding this correctly, then the
> > synchronization will be prohibitively expensive.
>
> Are you taking about a 1GB *mapping* here?  As in, something us using a
> 1GB page table entry to map the 1GB memory buffer?  That was the only
> case where I knew we needed coordination between neighbor RMP entries
> and host memory accesses.
>
> That 1GB problem _should_ be impossible.  I thought we settled on
> disabling hugetlbfs and fracturing the whole of the direct map down to 4k.

No. I was trying to give an example where a host-side process is
virtualizing a DMA write over a large buffer that consists of a lot of
4k or 2MB RMP entries. I picked 1 GB as an arbitrary example.
Marc Orr Nov. 12, 2021, 9:43 p.m. UTC | #17
On Fri, Nov 12, 2021 at 1:39 PM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Fri, Nov 12, 2021, at 1:30 PM, Marc Orr wrote:
> > On Fri, Nov 12, 2021 at 12:38 PM Sean Christopherson <seanjc@google.com> wrote:
> >>
> >> On Fri, Nov 12, 2021, Borislav Petkov wrote:
> >> > On Fri, Nov 12, 2021 at 07:48:17PM +0000, Sean Christopherson wrote:
> >> > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> >> >
> >> > What do you suggest instead?
> >>
> >> Let userspace decide what is mapped shared and what is mapped private.  The kernel
> >> and KVM provide the APIs/infrastructure to do the actual conversions in a thread-safe
> >> fashion and also to enforce the current state, but userspace is the control plane.
> >>
> >> It would require non-trivial changes in userspace if there are multiple processes
> >> accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
> >> solvable.  The exit to userspace means all three components (guest, kernel,
> >> and userspace) have full knowledge of what is shared and what is private.  There
> >> is zero ambiguity:
> >>
> >>   - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
> >>   - if kernel accesses guest private memory, it does BUG/panic/oops[*]
> >>   - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
> >>
> >> This is the direction KVM TDX support is headed, though it's obviously still a WIP.
> >>
> >> And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
> >> define that:
> >>
> >>   a) All convertible memory, i.e. RAM, starts as private.
> >>   b) Conversions between private and shared must be done via explicit hypercall.
> >>
> >> Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
> >> type as implicit conversions.
> >>
> >> [*] Sadly, fully preventing kernel access to guest private is not possible with
> >>     TDX, especially if the direct map is left intact.  But maybe in the future
> >>     TDX will signal a fault instead of poisoning memory and leaving a #MC mine.
> >
> > In this proposal, consider a guest driver instructing a device to DMA
> > write a 1 GB memory buffer. A well-behaved guest driver will ensure
> > that the entire 1 GB is marked shared. But what about a malicious or
> > buggy guest? Let's assume a bad guest driver instructs the device to
> > write guest private memory.
> >
> > So now, the virtual device, which might be implemented as some host
> > side process, needs to (1) check and lock all 4k constituent RMP
> > entries (so they're not converted to private while the DMA write is
> > taking palce), (2) write the 1 GB buffer, and (3) unlock all 4 k
> > constituent RMP entries? If I'm understanding this correctly, then the
> > synchronization will be prohibitively expensive.
>
> Let's consider a very very similar scenario: consider a guest driver setting up a 1 GB DMA buffer.  The virtual device, implemented as host process, needs to (1) map (and thus lock *or* be prepared for faults) in 1GB / 4k pages of guest memory (so they're not *freed* while the DMA write is taking place), (2) write the buffer, and (3) unlock all the pages.  Or it can lock them at setup time and keep them locked for a long time if that's appropriate.
>
> Sure, the locking is expensive, but it's nonnegotiable.  The RMP issue is just a special case of the more general issue that the host MUST NOT ACCESS GUEST MEMORY AFTER IT'S FREED.

Good point.
Borislav Petkov Nov. 12, 2021, 10:04 p.m. UTC | #18
On Fri, Nov 12, 2021 at 01:20:53PM -0800, Andy Lutomirski wrote:
> Can you clarify what type of host bug you're talking about here?  We're
> talking about the host kernel reading or writing guest private memory
> through the direct map or kmap, right?  It seems to me that the way this
> happens is:
> 
> struct page *guest_page = (get and refcount a guest page);
> 
> ...
> 
> guest switches the page to private;
> 
> ...
> 
> read or write the page via direct map or kmap.

Maybe I don't understand your example properly but on this access here,
the page is not gonna be validated anymore in the RMP table, i.e., on
the next access, the guest will get a #VC.

Or are you talking about a case where the host gets a reference to a
guest page and the guest - in the meantime - PVALIDATEs that page and
thus converts it into a private page?

So that case is simple: if the guest really deliberately gave that page
to the host to do stuff to it and it converted it to private in the
meantime, guest dies.

That's why it would make sense to have explicit consensus between guest
and host which pages get shared. Everything else is not going to be
allowed and the entity at fault pays the consequences.
Peter Gonda Nov. 12, 2021, 10:52 p.m. UTC | #19
On Fri, Nov 12, 2021 at 2:20 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On 11/12/21 13:12, Peter Gonda wrote:
> > On Fri, Nov 12, 2021 at 1:55 PM Borislav Petkov <bp@alien8.de> wrote:
> >>
> >> On Fri, Nov 12, 2021 at 08:37:59PM +0000, Sean Christopherson wrote:
> >>> Let userspace decide what is mapped shared and what is mapped private.
> >>
> >> With "userspace", you mean the *host* userspace?
> >>
> >>> The kernel and KVM provide the APIs/infrastructure to do the actual
> >>> conversions in a thread-safe fashion and also to enforce the current
> >>> state, but userspace is the control plane.
> >>>
> >>> It would require non-trivial changes in userspace if there are multiple processes
> >>> accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
> >>> solvable.  The exit to userspace means all three components (guest, kernel,
> >>> and userspace) have full knowledge of what is shared and what is private.  There
> >>> is zero ambiguity:
> >>>
> >>>    - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
> >>
> >> That SIGSEGV is generated by the host kernel, I presume, after it checks
> >> whether the memory belongs to the guest?
> >>
> >>>    - if kernel accesses guest private memory, it does BUG/panic/oops[*]
> >>
> >> If *it* is the host kernel, then you probably shouldn't do that -
> >> otherwise you just killed the host kernel on which all those guests are
> >> running.
> >
> > I agree, it seems better to terminate the single guest with an issue.
> > Rather than killing the host (and therefore all guests). So I'd
> > suggest even in this case we do the 'convert to shared' approach or
> > just outright terminate the guest.
> >
> > Are there already examples in KVM of a KVM bug in servicing a VM's
> > request results in a BUG/panic/oops? That seems not ideal ever.
> >
> >>
> >>>    - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
> >>
> >> Yah, that's the easy one.
> >>
> >>> This is the direction KVM TDX support is headed, though it's obviously still a WIP.
> >>>
> >>> And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
> >>> define that:
> >>>
> >>>    a) All convertible memory, i.e. RAM, starts as private.
> >>>    b) Conversions between private and shared must be done via explicit hypercall.
> >>
> >> I like the explicit nature of this but devil's in the detail and I'm no
> >> virt guy...
> >
> > This seems like a reasonable approach that can help with the issue of
> > terminating the entity behaving poorly. Could this feature be an
> > improvement that comes later? This improvement could be gated behind a
> > per VM KVM CAP, a kvm module param, or insert other solution here, to
> > not blind side userspace with this change?
> >
> >>
> >>> Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
> >>> type as implicit conversions.
> >>>
> >>> [*] Sadly, fully preventing kernel access to guest private is not possible with
> >>>      TDX, especially if the direct map is left intact.  But maybe in the future
> >>>      TDX will signal a fault instead of poisoning memory and leaving a #MC mine.
> >>
> >> Yah, the #MC thing sounds like someone didn't think things through. ;-\
> >
> > Yes #MC in TDX seems much harder to deal with than the #PF for SNP.
> > I'd propose TDX keeps the host kernel safe bug not have that solution
> > block SNP. As suggested above I like the idea but it seems like it can
> > come as a future improvement to SNP support.
> >
>
> Can you clarify what type of host bug you're talking about here?  We're
> talking about the host kernel reading or writing guest private memory
> through the direct map or kmap, right?  It seems to me that the way this
> happens is:

An example bug I am discussing looks like this:
* Guest wants to use virt device on some shared memory.
* Guest forgets to ask host to RMPUPDATE page to shared state (hypervisor owned)
* Guest points virt device at pages
* Virt device writes to page, since they are still private guest
userspace causes RMP fault #PF.

This seems like a trivial attack malicious guests could do.

>
> struct page *guest_page = (get and refcount a guest page);
>
> ...
>
> guest switches the page to private;
>
> ...
>
> read or write the page via direct map or kmap.
>
>
> This naively *looks* like something a malicious or buggy guest could
> induce the host kernel to do.  But I think that, if this actually
> occurs, it's an example of a much more severe host bug.  In particular,
> there is nothing special about the guest switching a page to private --
> the guest or QEMU could just as easily have freed (hotunplugged) a
> shared guest page or ballooned it or swapped it or otherwise deallocated
> it.  And, if the host touches a page/pfn that the guest has freed, this
> is UAF, is a huge security hole if the guest has any degree of control,
> and needs to kill the host kernel if detected.
>
> Now we can kind of sweep it under the rug by saying that changing a page
> from shared to private isn't really freeing the page -- it's just
> modifying the usage of the page.  But to me that sounds like saying
> "reusing a former user memory page as a pagetable isn't *really* freeing
> it if the kernel kept a pointer around the whole time and the page
> allocator wasn't involved".

This is different from the QEMU hotunplug because in that case QEMU is
aware of the memory change. Currently QEMU (userspace) is completely
unaware of which pages are shared and which are private. So how does
userspace know the  memory has been "freed"?

>
> So let's please just consider these mode transitions to be equivalent to
> actually freeing the page at least from a locking perspective, and then
> the way to prevent these bugs and the correct action to take if they
> occur is clear.
>
> (On TDX, changing a page shared/private state is very much like freeing
> and reallocating -- they're in different page tables and, in principle
> anyway, there is no actual relationship between a shared page and a
> supposedly matching private page except that some bits of the GPA happen
> to match.  The hardware the TD module fully support both being mapped at
> once at the same "address", at least according to the specs I've read.)
>
> --Andy
Peter Gonda Nov. 12, 2021, 10:54 p.m. UTC | #20
On Fri, Nov 12, 2021 at 2:43 PM Marc Orr <marcorr@google.com> wrote:
>
> On Fri, Nov 12, 2021 at 1:39 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> >
> >
> > On Fri, Nov 12, 2021, at 1:30 PM, Marc Orr wrote:
> > > On Fri, Nov 12, 2021 at 12:38 PM Sean Christopherson <seanjc@google.com> wrote:
> > >>
> > >> On Fri, Nov 12, 2021, Borislav Petkov wrote:
> > >> > On Fri, Nov 12, 2021 at 07:48:17PM +0000, Sean Christopherson wrote:
> > >> > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> > >> >
> > >> > What do you suggest instead?
> > >>
> > >> Let userspace decide what is mapped shared and what is mapped private.  The kernel
> > >> and KVM provide the APIs/infrastructure to do the actual conversions in a thread-safe
> > >> fashion and also to enforce the current state, but userspace is the control plane.
> > >>
> > >> It would require non-trivial changes in userspace if there are multiple processes
> > >> accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
> > >> solvable.  The exit to userspace means all three components (guest, kernel,
> > >> and userspace) have full knowledge of what is shared and what is private.  There
> > >> is zero ambiguity:
> > >>
> > >>   - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
> > >>   - if kernel accesses guest private memory, it does BUG/panic/oops[*]
> > >>   - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
> > >>
> > >> This is the direction KVM TDX support is headed, though it's obviously still a WIP.
> > >>
> > >> And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
> > >> define that:
> > >>
> > >>   a) All convertible memory, i.e. RAM, starts as private.
> > >>   b) Conversions between private and shared must be done via explicit hypercall.
> > >>
> > >> Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
> > >> type as implicit conversions.
> > >>
> > >> [*] Sadly, fully preventing kernel access to guest private is not possible with
> > >>     TDX, especially if the direct map is left intact.  But maybe in the future
> > >>     TDX will signal a fault instead of poisoning memory and leaving a #MC mine.
> > >
> > > In this proposal, consider a guest driver instructing a device to DMA
> > > write a 1 GB memory buffer. A well-behaved guest driver will ensure
> > > that the entire 1 GB is marked shared. But what about a malicious or
> > > buggy guest? Let's assume a bad guest driver instructs the device to
> > > write guest private memory.
> > >
> > > So now, the virtual device, which might be implemented as some host
> > > side process, needs to (1) check and lock all 4k constituent RMP
> > > entries (so they're not converted to private while the DMA write is
> > > taking palce), (2) write the 1 GB buffer, and (3) unlock all 4 k
> > > constituent RMP entries? If I'm understanding this correctly, then the
> > > synchronization will be prohibitively expensive.
> >
> > Let's consider a very very similar scenario: consider a guest driver setting up a 1 GB DMA buffer.  The virtual device, implemented as host process, needs to (1) map (and thus lock *or* be prepared for faults) in 1GB / 4k pages of guest memory (so they're not *freed* while the DMA write is taking place), (2) write the buffer, and (3) unlock all the pages.  Or it can lock them at setup time and keep them locked for a long time if that's appropriate.
> >
> > Sure, the locking is expensive, but it's nonnegotiable.  The RMP issue is just a special case of the more general issue that the host MUST NOT ACCESS GUEST MEMORY AFTER IT'S FREED.
>
> Good point.

Thanks for the responses Andy.

Having a way for userspace to lock pages as shared was an idea I just
proposed the simplest solution to start the conversation. So what we
could do here is change map to error if the selected region has
private pages, if the region is mapped we can then lock the pages
shared. Now processes mapping guest memory that are well behaved can
be safe from RMP violations. That seems like a reasonable solution for
allowing userspace to know if guest memory is accessible or not. Or do
you have other ideas to meet your other comment:

> SEV-SNP, TDX, and any reasonable software solution all require that the
> host know which pages are private and which pages are shared.  Sure, the
> old SEV-ES Linux host implementation was very simple, but it's nasty and
> fundamentally can't support migration.
Sean Christopherson Nov. 13, 2021, midnight UTC | #21
On Fri, Nov 12, 2021, Peter Gonda wrote:
> On Fri, Nov 12, 2021 at 1:55 PM Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Fri, Nov 12, 2021 at 08:37:59PM +0000, Sean Christopherson wrote:
> > > Let userspace decide what is mapped shared and what is mapped private.
> >
> > With "userspace", you mean the *host* userspace?

Yep.

> > > The kernel and KVM provide the APIs/infrastructure to do the actual
> > > conversions in a thread-safe fashion and also to enforce the current
> > > state, but userspace is the control plane.
> > >
> > > It would require non-trivial changes in userspace if there are multiple processes
> > > accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
> > > solvable.  The exit to userspace means all three components (guest, kernel,
> > > and userspace) have full knowledge of what is shared and what is private.  There
> > > is zero ambiguity:
> > >
> > >   - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
> >
> > That SIGSEGV is generated by the host kernel, I presume, after it checks
> > whether the memory belongs to the guest?

Yep.

> > >   - if kernel accesses guest private memory, it does BUG/panic/oops[*]
> >
> > If *it* is the host kernel, then you probably shouldn't do that -
> > otherwise you just killed the host kernel on which all those guests are
> > running.
> 
> I agree, it seems better to terminate the single guest with an issue.
> Rather than killing the host (and therefore all guests). So I'd
> suggest even in this case we do the 'convert to shared' approach or
> just outright terminate the guest.
> 
> Are there already examples in KVM of a KVM bug in servicing a VM's
> request results in a BUG/panic/oops? That seems not ideal ever.

Plenty of examples.  kvm_spurious_fault() is the obvious one.  Any NULL pointer
deref will lead to a BUG, etc...  And it's not just KVM, e.g. it's possible, if
unlikely, for the core kernel to run into guest private memory (e.g. if the kernel
botches an RMP change), and if that happens there's no guarantee that the kernel
can recover.

I fully agree that ideally KVM would have a better sense of self-preservation,
but IMO that's an orthogonal discussion.
Marc Orr Nov. 13, 2021, 12:10 a.m. UTC | #22
> > > If *it* is the host kernel, then you probably shouldn't do that -
> > > otherwise you just killed the host kernel on which all those guests are
> > > running.
> >
> > I agree, it seems better to terminate the single guest with an issue.
> > Rather than killing the host (and therefore all guests). So I'd
> > suggest even in this case we do the 'convert to shared' approach or
> > just outright terminate the guest.
> >
> > Are there already examples in KVM of a KVM bug in servicing a VM's
> > request results in a BUG/panic/oops? That seems not ideal ever.
>
> Plenty of examples.  kvm_spurious_fault() is the obvious one.  Any NULL pointer
> deref will lead to a BUG, etc...  And it's not just KVM, e.g. it's possible, if
> unlikely, for the core kernel to run into guest private memory (e.g. if the kernel
> botches an RMP change), and if that happens there's no guarantee that the kernel
> can recover.
>
> I fully agree that ideally KVM would have a better sense of self-preservation,
> but IMO that's an orthogonal discussion.

I don't think we should treat the possibility of crashing the host
with live VMs nonchalantly. It's a big deal. Doing so has big
implications on the probability that any cloud vendor wil bee able to
deploy this code to production. And aren't cloud vendors one of the
main use cases for all of this confidential compute stuff? I'm
honestly surprised that so many people are OK with crashing the host.
Sean Christopherson Nov. 13, 2021, 12:53 a.m. UTC | #23
On Fri, Nov 12, 2021, Peter Gonda wrote:
> On Fri, Nov 12, 2021 at 2:43 PM Marc Orr <marcorr@google.com> wrote:
> >
> > On Fri, Nov 12, 2021 at 1:39 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > Let's consider a very very similar scenario: consider a guest driver
> > > setting up a 1 GB DMA buffer.  The virtual device, implemented as host
> > > process, needs to (1) map (and thus lock *or* be prepared for faults) in
> > > 1GB / 4k pages of guest memory (so they're not *freed* while the DMA
> > > write is taking place), (2) write the buffer, and (3) unlock all the
> > > pages.  Or it can lock them at setup time and keep them locked for a long
> > > time if that's appropriate.
> > >
> > > Sure, the locking is expensive, but it's nonnegotiable.  The RMP issue is
> > > just a special case of the more general issue that the host MUST NOT
> > > ACCESS GUEST MEMORY AFTER IT'S FREED.
> >
> > Good point.
> 
> Thanks for the responses Andy.
> 
> Having a way for userspace to lock pages as shared was an idea I just
> proposed the simplest solution to start the conversation.

Assuming you meant that to read:

  Having a way for userspace to lock pages as shared is an alternative idea; I
  just proposed the simplest solution to start the conversation.

The unmapping[*] guest private memory proposal is essentially that, a way for userspace
to "lock" the state of a page by requiring all conversions to be initiated by userspace
and by providing APIs to associate a pfn 1:1 with a KVM instance, i.e. lock a pfn to
a guest.

Andy's DMA example brings up a very good point though.  If the shared and private
variants of a given GPA are _not_ required to point at a single PFN, which is the
case in the current unmapping proposal, userspace doesn't need to do any additional
juggling to track guest conversions across multiple processes.

Any process that's accessing guest (shared!) memory simply does its locking as normal,
which as Andy pointed out, is needed for correctness today.  If the guest requests a
conversion from shared=>private without first ensuring the gfn is unused (by a host
"device"), the host will side will continue accessing the old, shared memory, which it
locked, while the guest will be doing who knows what.  And if the guest provides a GPA
that isn't mapped shared in the VMM's address space, it's conceptually no different
than if the guest provided a completely bogus GPA, which again needs to be handled today.

In other words, if done properly, differentiating private from shared shouldn't be a
heavy lift for host userspace.

[*] Actually unmapping memory may not be strictly necessary for SNP because a
    #PF(RMP) is likely just as good as a #PF(!PRESENT) when both are treated as
    fatal, but the rest of the proposal that allows KVM to understand the stage
    of a page and exit to userspace accordingly applies.
Marc Orr Nov. 13, 2021, 1:04 a.m. UTC | #24
On Fri, Nov 12, 2021 at 4:53 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 12, 2021, Peter Gonda wrote:
> > On Fri, Nov 12, 2021 at 2:43 PM Marc Orr <marcorr@google.com> wrote:
> > >
> > > On Fri, Nov 12, 2021 at 1:39 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > Let's consider a very very similar scenario: consider a guest driver
> > > > setting up a 1 GB DMA buffer.  The virtual device, implemented as host
> > > > process, needs to (1) map (and thus lock *or* be prepared for faults) in
> > > > 1GB / 4k pages of guest memory (so they're not *freed* while the DMA
> > > > write is taking place), (2) write the buffer, and (3) unlock all the
> > > > pages.  Or it can lock them at setup time and keep them locked for a long
> > > > time if that's appropriate.
> > > >
> > > > Sure, the locking is expensive, but it's nonnegotiable.  The RMP issue is
> > > > just a special case of the more general issue that the host MUST NOT
> > > > ACCESS GUEST MEMORY AFTER IT'S FREED.
> > >
> > > Good point.
> >
> > Thanks for the responses Andy.
> >
> > Having a way for userspace to lock pages as shared was an idea I just
> > proposed the simplest solution to start the conversation.
>
> Assuming you meant that to read:
>
>   Having a way for userspace to lock pages as shared is an alternative idea; I
>   just proposed the simplest solution to start the conversation.
>
> The unmapping[*] guest private memory proposal is essentially that, a way for userspace
> to "lock" the state of a page by requiring all conversions to be initiated by userspace
> and by providing APIs to associate a pfn 1:1 with a KVM instance, i.e. lock a pfn to
> a guest.
>
> Andy's DMA example brings up a very good point though.  If the shared and private
> variants of a given GPA are _not_ required to point at a single PFN, which is the
> case in the current unmapping proposal, userspace doesn't need to do any additional
> juggling to track guest conversions across multiple processes.
>
> Any process that's accessing guest (shared!) memory simply does its locking as normal,
> which as Andy pointed out, is needed for correctness today.  If the guest requests a
> conversion from shared=>private without first ensuring the gfn is unused (by a host
> "device"), the host will side will continue accessing the old, shared memory, which it
> locked, while the guest will be doing who knows what.  And if the guest provides a GPA
> that isn't mapped shared in the VMM's address space, it's conceptually no different
> than if the guest provided a completely bogus GPA, which again needs to be handled today.
>
> In other words, if done properly, differentiating private from shared shouldn't be a
> heavy lift for host userspace.
>
> [*] Actually unmapping memory may not be strictly necessary for SNP because a
>     #PF(RMP) is likely just as good as a #PF(!PRESENT) when both are treated as
>     fatal, but the rest of the proposal that allows KVM to understand the stage
>     of a page and exit to userspace accordingly applies.

Thanks for this explanation. When you write "while the guest will be
doing who knows what":

Isn't that a large weakness of this proposal? To me, it seems better
for debuggability to corrupt the private memory (i.e., convert the
page to shared) so the guest can detect the issue via a PVALIDATE
failure.

The main issue I see with corrupting the guest memory is that we may
not know whether the host is at fault or the guest. Though, we can
probably in many cases be sure it's the host, if the pid associated
with the page fault is NOT a process associated with virtualization.
But if it is a process associated with virtualization, we legitimately
might not know. (I think if the pid is the kernel itself, it's
probably a host-side bug, but I'm still not confident on this; for
example, the guest might be able to coerce KVM's built-in emulator to
write guest private memory.)
Sean Christopherson Nov. 13, 2021, 6:28 p.m. UTC | #25
On Fri, Nov 12, 2021, Marc Orr wrote:
> On Fri, Nov 12, 2021 at 4:53 PM Sean Christopherson <seanjc@google.com> wrote:
> > On Fri, Nov 12, 2021, Peter Gonda wrote:
> > > Having a way for userspace to lock pages as shared was an idea I just
> > > proposed the simplest solution to start the conversation.
> >
> > Assuming you meant that to read:
> >
> >   Having a way for userspace to lock pages as shared is an alternative idea; I
> >   just proposed the simplest solution to start the conversation.
> >
> > The unmapping[*] guest private memory proposal is essentially that, a way for userspace
> > to "lock" the state of a page by requiring all conversions to be initiated by userspace
> > and by providing APIs to associate a pfn 1:1 with a KVM instance, i.e. lock a pfn to
> > a guest.
> >
> > Andy's DMA example brings up a very good point though.  If the shared and private
> > variants of a given GPA are _not_ required to point at a single PFN, which is the
> > case in the current unmapping proposal, userspace doesn't need to do any additional
> > juggling to track guest conversions across multiple processes.
> >
> > Any process that's accessing guest (shared!) memory simply does its locking as normal,
> > which as Andy pointed out, is needed for correctness today.  If the guest requests a
> > conversion from shared=>private without first ensuring the gfn is unused (by a host
> > "device"), the host will side will continue accessing the old, shared memory, which it
> > locked, while the guest will be doing who knows what.  And if the guest provides a GPA
> > that isn't mapped shared in the VMM's address space, it's conceptually no different
> > than if the guest provided a completely bogus GPA, which again needs to be handled today.
> >
> > In other words, if done properly, differentiating private from shared shouldn't be a
> > heavy lift for host userspace.
> >
> > [*] Actually unmapping memory may not be strictly necessary for SNP because a
> >     #PF(RMP) is likely just as good as a #PF(!PRESENT) when both are treated as
> >     fatal, but the rest of the proposal that allows KVM to understand the stage
> >     of a page and exit to userspace accordingly applies.
> 
> Thanks for this explanation. When you write "while the guest will be
> doing who knows what":
> 
> Isn't that a large weakness of this proposal? To me, it seems better
> for debuggability to corrupt the private memory (i.e., convert the
> page to shared) so the guest can detect the issue via a PVALIDATE
> failure.

The behavior is no different than it is today for regular VMs.

> The main issue I see with corrupting the guest memory is that we may
> not know whether the host is at fault or the guest.

Yes, one issue is that bugs in the host will result in downstream errors in the
guest, as opposed to immediate, synchronous detection in the guest.  IMO that is
a significant flaw.

Another issue is that the host kernel, which despite being "untrusted", absolutely
should be acting in the best interests of the guest.  Allowing userspace to inject
#VC, e.g. to attempt to attack the guest by triggering a spurious PVALIDATE, means
the kernel is failing miserably on that front.
Sean Christopherson Nov. 13, 2021, 6:34 p.m. UTC | #26
On Fri, Nov 12, 2021, Marc Orr wrote:
> > > > If *it* is the host kernel, then you probably shouldn't do that -
> > > > otherwise you just killed the host kernel on which all those guests are
> > > > running.
> > >
> > > I agree, it seems better to terminate the single guest with an issue.
> > > Rather than killing the host (and therefore all guests). So I'd
> > > suggest even in this case we do the 'convert to shared' approach or
> > > just outright terminate the guest.
> > >
> > > Are there already examples in KVM of a KVM bug in servicing a VM's
> > > request results in a BUG/panic/oops? That seems not ideal ever.
> >
> > Plenty of examples.  kvm_spurious_fault() is the obvious one.  Any NULL pointer
> > deref will lead to a BUG, etc...  And it's not just KVM, e.g. it's possible, if
> > unlikely, for the core kernel to run into guest private memory (e.g. if the kernel
> > botches an RMP change), and if that happens there's no guarantee that the kernel
> > can recover.
> >
> > I fully agree that ideally KVM would have a better sense of self-preservation,
> > but IMO that's an orthogonal discussion.
> 
> I don't think we should treat the possibility of crashing the host
> with live VMs nonchalantly. It's a big deal. Doing so has big
> implications on the probability that any cloud vendor wil bee able to
> deploy this code to production. And aren't cloud vendors one of the
> main use cases for all of this confidential compute stuff? I'm
> honestly surprised that so many people are OK with crashing the host.

I'm not treating it nonchalantly, merely acknowledging that (a) some flavors of kernel
bugs (or hardware issues!) are inherently fatal to the system, and (b) crashing the
host may be preferable to continuing on in certain cases, e.g. if continuing on has a
high probablity of corrupting guest data.
Marc Orr Nov. 14, 2021, 7:41 a.m. UTC | #27
On Sat, Nov 13, 2021 at 10:28 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 12, 2021, Marc Orr wrote:
> > On Fri, Nov 12, 2021 at 4:53 PM Sean Christopherson <seanjc@google.com> wrote:
> > > On Fri, Nov 12, 2021, Peter Gonda wrote:
> > > > Having a way for userspace to lock pages as shared was an idea I just
> > > > proposed the simplest solution to start the conversation.
> > >
> > > Assuming you meant that to read:
> > >
> > >   Having a way for userspace to lock pages as shared is an alternative idea; I
> > >   just proposed the simplest solution to start the conversation.
> > >
> > > The unmapping[*] guest private memory proposal is essentially that, a way for userspace
> > > to "lock" the state of a page by requiring all conversions to be initiated by userspace
> > > and by providing APIs to associate a pfn 1:1 with a KVM instance, i.e. lock a pfn to
> > > a guest.
> > >
> > > Andy's DMA example brings up a very good point though.  If the shared and private
> > > variants of a given GPA are _not_ required to point at a single PFN, which is the
> > > case in the current unmapping proposal, userspace doesn't need to do any additional
> > > juggling to track guest conversions across multiple processes.
> > >
> > > Any process that's accessing guest (shared!) memory simply does its locking as normal,
> > > which as Andy pointed out, is needed for correctness today.  If the guest requests a
> > > conversion from shared=>private without first ensuring the gfn is unused (by a host
> > > "device"), the host will side will continue accessing the old, shared memory, which it
> > > locked, while the guest will be doing who knows what.  And if the guest provides a GPA
> > > that isn't mapped shared in the VMM's address space, it's conceptually no different
> > > than if the guest provided a completely bogus GPA, which again needs to be handled today.
> > >
> > > In other words, if done properly, differentiating private from shared shouldn't be a
> > > heavy lift for host userspace.
> > >
> > > [*] Actually unmapping memory may not be strictly necessary for SNP because a
> > >     #PF(RMP) is likely just as good as a #PF(!PRESENT) when both are treated as
> > >     fatal, but the rest of the proposal that allows KVM to understand the stage
> > >     of a page and exit to userspace accordingly applies.
> >
> > Thanks for this explanation. When you write "while the guest will be
> > doing who knows what":
> >
> > Isn't that a large weakness of this proposal? To me, it seems better
> > for debuggability to corrupt the private memory (i.e., convert the
> > page to shared) so the guest can detect the issue via a PVALIDATE
> > failure.
>
> The behavior is no different than it is today for regular VMs.

Isn't this counter to the sketch you laid out earlier where you wrote:

--- QUOTE START ---
  - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
  - if kernel accesses guest private memory, it does BUG/panic/oops[*]
  - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
--- QUOTE END ---

Here, the guest does not get killed. Which seems hard to debug.

> > The main issue I see with corrupting the guest memory is that we may
> > not know whether the host is at fault or the guest.
>
> Yes, one issue is that bugs in the host will result in downstream errors in the
> guest, as opposed to immediate, synchronous detection in the guest.  IMO that is
> a significant flaw.

Nobody wants bugs in the host. Once we've hit one -- and we will --
we're in a bad situation. The question is how will we handle these
bugs. I'm arguing that we should design the system to be as robust as
possible.

I agree that immediate, synchronous detection in the guest is ideal.
But at what cost? Is it worth killing _ALL_ VMs on the system? That's
what we're doing when we crash host-wide processes, or even worse, the
kernel itself.

The reality is that guests must be able to detect writes to their
private memory long after they have occurred. Both SNP and TDX are
designed this way! For SNP the guest gets a PVALIDATE failure when it
reads the corrupted memory. For TDX, my understanding is that the
hardware fails to verify an HMAC over a page when it's read by the
guest.

The argument I'm making is that the success of confidential VMs -- in
both SNP and TDX -- depends on the guest being able to detect that its
private memory has been corrupted in a delayed and asynchronous
fashion. We should be leveraging this fact to make the entire system
more robust and reliable.

> Another issue is that the host kernel, which despite being "untrusted", absolutely
> should be acting in the best interests of the guest.  Allowing userspace to inject
> #VC, e.g. to attempt to attack the guest by triggering a spurious PVALIDATE, means
> the kernel is failing miserably on that front.

The host is acting in the best interests of the guest. That's why
we're having this debate :-). No-one here is trying to write host code
to sabotage the guest. Quite the opposite.

But what happens when we mess up? That's what this conversation is really about.

If allowing userspace to inject #VC into the guest means that the host
can continue to serve other guests, that seems like a win. The
alternative, to blow up the host, essentially expands the blast radius
from a single guest to all guests.

Also, these attacks go both ways. As we already discussed, the guest
may try to trick the host into writing its own private memory. Yes,
the entire idea to literally make that impossible is a good one -- in
theory. But it's also very complicated. And what happens if we get
that wrong? Now our entire host is at risk from a single guest.

Leveraging the system-wide design of SNP and TDX -- where a detecting
writes to private memory asynchronously is table stakes -- increases
the reliability of the entire system. And, as Peter mentioned earlier
on, we can always incorporate any future work to make writing private
memory impossible into SNP, on top of the code to convert the shared
page to private. This way, we get reliability in depth, and minimize
the odds of crashing the host -- and its VMs.
Marc Orr Nov. 14, 2021, 7:54 a.m. UTC | #28
On Sat, Nov 13, 2021 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Nov 12, 2021, Marc Orr wrote:
> > > > > If *it* is the host kernel, then you probably shouldn't do that -
> > > > > otherwise you just killed the host kernel on which all those guests are
> > > > > running.
> > > >
> > > > I agree, it seems better to terminate the single guest with an issue.
> > > > Rather than killing the host (and therefore all guests). So I'd
> > > > suggest even in this case we do the 'convert to shared' approach or
> > > > just outright terminate the guest.
> > > >
> > > > Are there already examples in KVM of a KVM bug in servicing a VM's
> > > > request results in a BUG/panic/oops? That seems not ideal ever.
> > >
> > > Plenty of examples.  kvm_spurious_fault() is the obvious one.  Any NULL pointer
> > > deref will lead to a BUG, etc...  And it's not just KVM, e.g. it's possible, if
> > > unlikely, for the core kernel to run into guest private memory (e.g. if the kernel
> > > botches an RMP change), and if that happens there's no guarantee that the kernel
> > > can recover.
> > >
> > > I fully agree that ideally KVM would have a better sense of self-preservation,
> > > but IMO that's an orthogonal discussion.
> >
> > I don't think we should treat the possibility of crashing the host
> > with live VMs nonchalantly. It's a big deal. Doing so has big
> > implications on the probability that any cloud vendor wil bee able to
> > deploy this code to production. And aren't cloud vendors one of the
> > main use cases for all of this confidential compute stuff? I'm
> > honestly surprised that so many people are OK with crashing the host.
>
> I'm not treating it nonchalantly, merely acknowledging that (a) some flavors of kernel
> bugs (or hardware issues!) are inherently fatal to the system, and (b) crashing the
> host may be preferable to continuing on in certain cases, e.g. if continuing on has a
> high probablity of corrupting guest data.

I disagree. Crashing the host -- and _ALL_ of its VMs (including
non-confidential VMs) -- is not preferable to crashing a single SNP
VM. Especially when that SNP VM is guaranteed to detect the memory
corruption and react accordingly.
Dr. David Alan Gilbert Nov. 15, 2021, 12:30 p.m. UTC | #29
* Sean Christopherson (seanjc@google.com) wrote:
> On Fri, Nov 12, 2021, Borislav Petkov wrote:
> > On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
> > > Or, is there some mechanism that prevent guest-private memory from being
> > > accessed in random host kernel code?
> 
> Or random host userspace code...
> 
> > So I'm currently under the impression that random host->guest accesses
> > should not happen if not previously agreed upon by both.
> 
> Key word "should".
> 
> > Because, as explained on IRC, if host touches a private guest page,
> > whatever the host does to that page, the next time the guest runs, it'll
> > get a #VC where it will see that that page doesn't belong to it anymore
> > and then, out of paranoia, it will simply terminate to protect itself.
> > 
> > So cloud providers should have an interest to prevent such random stray
> > accesses if they wanna have guests. :)
> 
> Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.

Would it necessarily have been a host bug?  A guest telling the host a
bad GPA to DMA into would trigger this wouldn't it?

Still; I wonder if it's best to kill the guest - maybe it's best for
the host to kill the guest and leave behind diagnostics of what
happened; for someone debugging the crash, it's going to be less useful
to know that page X was wrongly accessed (which is what the guest would
see), and more useful to know that it was the kernel's vhost-... driver
that accessed it.

Dave

> On Fri, Nov 12, 2021, Peter Gonda wrote:
> > Here is an alternative to the current approach: On RMP violation (host
> > or userspace) the page fault handler converts the page from private to
> > shared to allow the write to continue. This pulls from s390’s error
> > handling which does exactly this. See ‘arch_make_page_accessible()’.
> 
> Ah, after further reading, s390 does _not_ do implicit private=>shared conversions.
> 
> s390's arch_make_page_accessible() is somewhat similar, but it is not a direct
> comparison.  IIUC, it exports and integrity protects the data and thus preserves
> the guest's data in an encrypted form, e.g. so that it can be swapped to disk.
> And if the host corrupts the data, attempting to convert it back to secure on a
> subsequent guest access will fail.
> 
> The host kernel's handling of the "convert to secure" failures doesn't appear to
> be all that robust, e.g. it looks like there are multiple paths where the error
> is dropped on the floor and the guest is resumed , but IMO soft hanging the guest 
> is still better than inducing a fault in the guest, and far better than potentially
> coercing the guest into reading corrupted memory ("spurious" PVALIDATE).  And s390's
> behavior is fixable since it's purely a host error handling problem.
> 
> To truly make a page shared, s390 requires the guest to call into the ultravisor
> to make a page shared.  And on the host side, the host can pin a page as shared
> to prevent the guest from unsharing it while the host is accessing it as a shared
> page.
> 
> So, inducing #VC is similar in the sense that a malicious s390 can also DoS itself,
> but is quite different in that (AFAICT) s390 does not create an attack surface where
> a malicious or buggy host userspace can induce faults in the guest, or worst case in
> SNP, exploit a buggy guest into accepting and accessing corrupted data.
> 
> It's also different in that s390 doesn't implicitly convert between shared and
> private.  Functionally, it doesn't really change the end result because a buggy
> host that writes guest private memory will DoS the guest (by inducing a #VC or
> corrupting exported data), but at least for s390 there's a sane, legitimate use
> case for accessing guest private memory (swap and maybe migration?), whereas for
> SNP, IMO implicitly converting to shared on a host access is straight up wrong.
> 
> > Additionally it adds less complexity to the SNP kernel patches, and
> > requires no new ABI.
> 
> I disagree, this would require "new" ABI in the sense that it commits KVM to
> supporting SNP without requiring userspace to initiate any and all conversions
> between shared and private.  Which in my mind is the big elephant in the room:
> do we want to require new KVM (and kernel?) ABI to allow/force userspace to
> explicitly declare guest private memory for TDX _and_ SNP, or just TDX?
>
Joerg Roedel Nov. 15, 2021, 2:42 p.m. UTC | #30
On Mon, Nov 15, 2021 at 12:30:59PM +0000, Dr. David Alan Gilbert wrote:
> Still; I wonder if it's best to kill the guest - maybe it's best for
> the host to kill the guest and leave behind diagnostics of what
> happened; for someone debugging the crash, it's going to be less useful
> to know that page X was wrongly accessed (which is what the guest would
> see), and more useful to know that it was the kernel's vhost-... driver
> that accessed it.

I is best to let the guest #VC on the page when this happens. If it
happened because of a guest bug all necessary debugging data is in the
guest and only the guest owner can obtain it.

Then the guest owner can do a kdump on this unexpected #VC and collect
the data to debug the issue. With just killing the guest from the host
side this data would be lost.

Regards,

	Joerg
Dr. David Alan Gilbert Nov. 15, 2021, 3:33 p.m. UTC | #31
* Joerg Roedel (jroedel@suse.de) wrote:
> On Mon, Nov 15, 2021 at 12:30:59PM +0000, Dr. David Alan Gilbert wrote:
> > Still; I wonder if it's best to kill the guest - maybe it's best for
> > the host to kill the guest and leave behind diagnostics of what
> > happened; for someone debugging the crash, it's going to be less useful
> > to know that page X was wrongly accessed (which is what the guest would
> > see), and more useful to know that it was the kernel's vhost-... driver
> > that accessed it.
> 
> I is best to let the guest #VC on the page when this happens. If it
> happened because of a guest bug all necessary debugging data is in the
> guest and only the guest owner can obtain it.
> 
> Then the guest owner can do a kdump on this unexpected #VC and collect
> the data to debug the issue. With just killing the guest from the host
> side this data would be lost.

How would you debug an unexpected access by the host kernel using a
guests kdump?

Dave

> Regards,
> 
> 	Joerg
>
Joerg Roedel Nov. 15, 2021, 4:16 p.m. UTC | #32
On Fri, Nov 12, 2021 at 07:48:17PM +0000, Sean Christopherson wrote:
> Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.

And what is the plan with handling this host bug? Can it be handled in a
way that keeps the guest running?

IMO the best way to handle this is to do it the way Peter proposed:

	* Convert the page from private to shared on host write access
	  and log this event on the host side (e.g. via a trace event)
	* The guest will notice what happened and can decide on its own
	  what to do, either poison the page or panic with doing a
	  kdump that can be used for bug analysis by guest and host
	  owner

At the time the fault happens we can not reliably find the reason. It
can be a host bug, a guest bug (or attack), or whatnot. So the best that
can be done is collecting debug data without impacting other guests.

This also saves lots of code for avoiding these faults when the outcome
would be the same: A dead VM.

> I disagree, this would require "new" ABI in the sense that it commits KVM to
> supporting SNP without requiring userspace to initiate any and all conversions
> between shared and private.  Which in my mind is the big elephant in the room:
> do we want to require new KVM (and kernel?) ABI to allow/force userspace to
> explicitly declare guest private memory for TDX _and_ SNP, or just TDX?

No, not for SNP. User-space has no say in what guest memory is private
and shared, that should fundamentally be the guests decision. The host
has no idea about the guests workload and how much shared memory it
needs. It might be that the guest temporarily needs to share more
memory. I see no reason to cut this flexibility out for SNP guests.

Regards,
Brijesh Singh Nov. 15, 2021, 4:18 p.m. UTC | #33
On 11/12/21 2:37 PM, Sean Christopherson wrote:
> On Fri, Nov 12, 2021, Borislav Petkov wrote:
>> On Fri, Nov 12, 2021 at 07:48:17PM +0000, Sean Christopherson wrote:
>>> Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
>>

In the automatic change proposal, both the the host and a guest bug will 
cause a guest to get the #VC and then the guest can decide whether it 
wants to proceed or terminate. If it chooses to move, it can poison the 
page and log it for future examination.

>> What do you suggest instead?
> 
> Let userspace decide what is mapped shared and what is mapped private.  The kernel
> and KVM provide the APIs/infrastructure to do the actual conversions in a thread-safe
> fashion and also to enforce the current state, but userspace is the control plane.
> 
> It would require non-trivial changes in userspace if there are multiple processes
> accessing guest memory, e.g. Peter's networking daemon example, but it _is_ fully
> solvable.  The exit to userspace means all three components (guest, kernel,
> and userspace) have full knowledge of what is shared and what is private.  There
> is zero ambiguity:
> 
>    - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
>    - if kernel accesses guest private memory, it does BUG/panic/oops[*]
>    - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
> 
> This is the direction KVM TDX support is headed, though it's obviously still a WIP.
> 

Just curious, in this approach, how do you propose handling the host 
kexec/kdump? If a kexec/kdump occurs while the VM is still active, the 
new kernel will encounter the #PF (RMP violation) because some pages are 
still marked 'private' in the RMP table.



> And ideally, to avoid implicit conversions at any level, hardware vendors' ABIs
> define that:
> 
>    a) All convertible memory, i.e. RAM, starts as private.
>    b) Conversions between private and shared must be done via explicit hypercall.
> 
> Without (b), userspace and thus KVM have to treat guest accesses to the incorrect
> type as implicit conversions.
> 
> [*] Sadly, fully preventing kernel access to guest private is not possible with
>      TDX, especially if the direct map is left intact.  But maybe in the future
>      TDX will signal a fault instead of poisoning memory and leaving a #MC mine.
> 

thanks
Joerg Roedel Nov. 15, 2021, 4:20 p.m. UTC | #34
On Mon, Nov 15, 2021 at 03:33:03PM +0000, Dr. David Alan Gilbert wrote:
> How would you debug an unexpected access by the host kernel using a
> guests kdump?

The host needs to log the access in some way (trace-event, pr_err) with
relevant information.

And with the guest-side kdump you can rule out that it was a guest bug
which triggered the access. And you learn what the guest was trying to
do when it triggered the access. This also helps finding the issue on
the host side (if it is a host-side issue).

(Guest and host owner need to work together for debugging, of course).

Regards,
Dr. David Alan Gilbert Nov. 15, 2021, 4:32 p.m. UTC | #35
* Joerg Roedel (jroedel@suse.de) wrote:
> On Mon, Nov 15, 2021 at 03:33:03PM +0000, Dr. David Alan Gilbert wrote:
> > How would you debug an unexpected access by the host kernel using a
> > guests kdump?
> 
> The host needs to log the access in some way (trace-event, pr_err) with
> relevant information.

Yeh OK that makes sense if the host-owner could then enable some type of
debugging based on that event for the unfortunate guest owner.

> And with the guest-side kdump you can rule out that it was a guest bug
> which triggered the access. And you learn what the guest was trying to
> do when it triggered the access. This also helps finding the issue on
> the host side (if it is a host-side issue).
> 
> (Guest and host owner need to work together for debugging, of course).

Yeh.

Dave

> Regards,
> 
> -- 
> Jörg Rödel
> jroedel@suse.de
> 
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5
> 90409 Nürnberg
> Germany
>  
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Ivo Totev
>
Joerg Roedel Nov. 15, 2021, 4:36 p.m. UTC | #36
On Sat, Nov 13, 2021 at 06:34:52PM +0000, Sean Christopherson wrote:
> I'm not treating it nonchalantly, merely acknowledging that (a) some flavors of kernel
> bugs (or hardware issues!) are inherently fatal to the system, and (b) crashing the
> host may be preferable to continuing on in certain cases, e.g. if continuing on has a
> high probablity of corrupting guest data.

The problem here is that for SNP host-side RMP faults it will often not
be clear at fault-time if it was caused by wrong guest or host behavior. 

I agree with Marc that crashing the host is not the right thing to do in
this situation. Instead debug data should be collected to do further
post-mortem analysis.

Regards,
Joerg Roedel Nov. 15, 2021, 4:52 p.m. UTC | #37
On Sat, Nov 13, 2021 at 06:28:16PM +0000, Sean Christopherson wrote:
> Another issue is that the host kernel, which despite being "untrusted", absolutely
> should be acting in the best interests of the guest.  Allowing userspace to inject
> #VC, e.g. to attempt to attack the guest by triggering a spurious PVALIDATE, means
> the kernel is failing miserably on that front.

Well, no. The kernel is only a part of the hypervisor, KVM userspace is
another. It is possible today for the userspace part(s) to interact in bad
ways with the guest and trick or kill it. Allowing user-space to cause a
#VC in the guest is no different from that.

Regards,
Sean Christopherson Nov. 15, 2021, 5:16 p.m. UTC | #38
On Sat, Nov 13, 2021, Marc Orr wrote:
> On Sat, Nov 13, 2021 at 10:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Fri, Nov 12, 2021, Marc Orr wrote:
> > > > > > If *it* is the host kernel, then you probably shouldn't do that -
> > > > > > otherwise you just killed the host kernel on which all those guests are
> > > > > > running.
> > > > >
> > > > > I agree, it seems better to terminate the single guest with an issue.
> > > > > Rather than killing the host (and therefore all guests). So I'd
> > > > > suggest even in this case we do the 'convert to shared' approach or
> > > > > just outright terminate the guest.
> > > > >
> > > > > Are there already examples in KVM of a KVM bug in servicing a VM's
> > > > > request results in a BUG/panic/oops? That seems not ideal ever.
> > > >
> > > > Plenty of examples.  kvm_spurious_fault() is the obvious one.  Any NULL pointer
> > > > deref will lead to a BUG, etc...  And it's not just KVM, e.g. it's possible, if
> > > > unlikely, for the core kernel to run into guest private memory (e.g. if the kernel
> > > > botches an RMP change), and if that happens there's no guarantee that the kernel
> > > > can recover.
> > > >
> > > > I fully agree that ideally KVM would have a better sense of self-preservation,
> > > > but IMO that's an orthogonal discussion.
> > >
> > > I don't think we should treat the possibility of crashing the host
> > > with live VMs nonchalantly. It's a big deal. Doing so has big
> > > implications on the probability that any cloud vendor wil bee able to
> > > deploy this code to production. And aren't cloud vendors one of the
> > > main use cases for all of this confidential compute stuff? I'm
> > > honestly surprised that so many people are OK with crashing the host.
> >
> > I'm not treating it nonchalantly, merely acknowledging that (a) some flavors of kernel
> > bugs (or hardware issues!) are inherently fatal to the system, and (b) crashing the
> > host may be preferable to continuing on in certain cases, e.g. if continuing on has a
> > high probablity of corrupting guest data.
> 
> I disagree. Crashing the host -- and _ALL_ of its VMs (including
> non-confidential VMs) -- is not preferable to crashing a single SNP
> VM.

We're in violent agreement.  I fully agree that, when allowed by the architecture,
injecting an error into the guest is preferable to killing the VM, which is in turn
preferable to crashing the host.

What I'm saying is that there are classes of bugs where injecting an error is not
allowed/feasible, and where killing an individual VM is not correct/feasible.

The canonical example of this escalating behavior is an uncorrectable ECC #MC.  If
the bad page is guest memory and the guest vCPU model supports MCA, then userspace
can inject an #MC into the guest so that the guest can take action and hopefully
not simply die.  If the bad page is in the guest but the guest doesn't support #MC
injection, the guest effectively gets killed.  And if the #MC is in host kernel
memory that can't be offlined, e.g. hits the IDT, then the whole system comes
crashing down.

> Especially when that SNP VM is guaranteed to detect the memory corruption and
> react accordingly.

For the record, we can make no such guarantees about the SNP VM.  Yes, the VM
_should_ do the right thing when handed a #VC, but bugs happen, otherwise we
wouldn't be having this discussion.
Sean Christopherson Nov. 15, 2021, 5:25 p.m. UTC | #39
On Mon, Nov 15, 2021, Joerg Roedel wrote:
> On Sat, Nov 13, 2021 at 06:34:52PM +0000, Sean Christopherson wrote:
> > I'm not treating it nonchalantly, merely acknowledging that (a) some flavors of kernel
> > bugs (or hardware issues!) are inherently fatal to the system, and (b) crashing the
> > host may be preferable to continuing on in certain cases, e.g. if continuing on has a
> > high probablity of corrupting guest data.
> 
> The problem here is that for SNP host-side RMP faults it will often not
> be clear at fault-time if it was caused by wrong guest or host behavior. 
> 
> I agree with Marc that crashing the host is not the right thing to do in
> this situation. Instead debug data should be collected to do further
> post-mortem analysis.

Again, I am not saying that any RMP #PF violation is an immediate, "crash the
host".  It should be handled exactly like any other #PF due to permission violation.
The only wrinkle added by the RMP is that the #PF can be due to permissions on the
GPA itself, but even that is not unique, e.g. see the proposed KVM XO support that
will hopefully still land someday.

If the guest violates the current permissions, it (indirectly) gets a #VC.  If host
userspace violates permissions, it gets SIGSEGV.  If the host kernel violates
permissions, then it reacts to the #PF in whatever way it can.  What I am saying is
that in some cases, there is _zero_ chance of recovery in the host and so crashing
the entire system is inevitable.   E.g. if the host kernel hits an RMP #PF when
vectoring a #GP because the IDT lookup somehow triggers an RMP violation, then the
host is going into triple fault shutdown.

[*] https://lore.kernel.org/linux-mm/20191003212400.31130-1-rick.p.edgecombe@intel.com/
Sean Christopherson Nov. 15, 2021, 6:17 p.m. UTC | #40
On Sat, Nov 13, 2021, Marc Orr wrote:
> On Sat, Nov 13, 2021 at 10:28 AM Sean Christopherson <seanjc@google.com> wrote:
> > The behavior is no different than it is today for regular VMs.
> 
> Isn't this counter to the sketch you laid out earlier where you wrote:
> 
> --- QUOTE START ---
>   - if userspace accesses guest private memory, it gets SIGSEGV or whatever.
>   - if kernel accesses guest private memory, it does BUG/panic/oops[*]
>   - if guest accesses memory with the incorrect C/SHARED-bit, it gets killed.
> --- QUOTE END ---
> 
> Here, the guest does not get killed. Which seems hard to debug.

No, it does contradict that statement.
 
  If the guest requests a conversion from shared=>private without first ensuring
  the gfn is unused (by a host "device"), the host will side will continue accessing
  the old, shared memory, which it locked, while the guest will be doing who knows
  what.

In this case, the guest will have converted a GPA from shared=>private, i.e. changed
the effective GFN for a e.g. a shared queue, without informing host userspace that
the GFN, and thus the associated HVA in the host, has changed. For TDX that is
literally the same bug as the guest changing the GFN without informing the host, as
the SHARED bit is just an address bit with some extra meaning piled on top.  For SNP,
it's slightly different because the C-bit isn't strictly required to be an address
bit, but for all intents and purposes it's the same type of bug.

I phrased it "guest will be doing who knows what" because from a host userspace
perspective, it can't know what the guest behavior will be, and more importantly,
it doesn't care because (a) the guest is buggy and (b) the host itself  is _not_ in
danger.

Yes, those types of bugs suck to debug.  But they really should be few and far
between.  The only reason I called out this specific scenario was to note that host
userspace doesn't need to take extra steps to guard against bogus shared=>private
conversions, because host userspace already needs to have such guards in place.  In
prior (offline?) conversations, we had assumed that host userspace would need to
propagate the shared vs. private status to any and all processes that map guest
memory, i.e. would require substantial enabling, but that assumption was wrong.

> If allowing userspace to inject #VC into the guest means that the host
> can continue to serve other guests, that seems like a win. The
> alternative, to blow up the host, essentially expands the blast radius
> from a single guest to all guests.

As mentioned in other threads of this conversation, when I say "host crashes", I
am specifically talking about scenarios where it is simply not possible for the
host kernel to recover, e.g. an RMP #PF violation on the IDT.

Setting that aside, injecting a #VC into the guest is not in anyway necessary for
a buggy host userspace to terminate a single guest, host userspace can simply stop
running that specific guest.
Sean Christopherson Nov. 15, 2021, 6:26 p.m. UTC | #41
On Mon, Nov 15, 2021, Dr. David Alan Gilbert wrote:
> * Sean Christopherson (seanjc@google.com) wrote:
> > On Fri, Nov 12, 2021, Borislav Petkov wrote:
> > > On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
> > > > Or, is there some mechanism that prevent guest-private memory from being
> > > > accessed in random host kernel code?
> > 
> > Or random host userspace code...
> > 
> > > So I'm currently under the impression that random host->guest accesses
> > > should not happen if not previously agreed upon by both.
> > 
> > Key word "should".
> > 
> > > Because, as explained on IRC, if host touches a private guest page,
> > > whatever the host does to that page, the next time the guest runs, it'll
> > > get a #VC where it will see that that page doesn't belong to it anymore
> > > and then, out of paranoia, it will simply terminate to protect itself.
> > > 
> > > So cloud providers should have an interest to prevent such random stray
> > > accesses if they wanna have guests. :)
> > 
> > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> 
> Would it necessarily have been a host bug?  A guest telling the host a
> bad GPA to DMA into would trigger this wouldn't it?

No, because as Andy pointed out, host userspace must already guard against a bad
GPA, i.e. this is just a variant of the guest telling the host to DMA to a GPA
that is completely bogus.  The shared vs. private behavior just means that when
host userspace is doing a GPA=>HVA lookup, it needs to incorporate the "shared"
state of the GPA.  If the host goes and DMAs into the completely wrong HVA=>PFN,
then that is a host bug; that the bug happened to be exploited by a buggy/malicious
guest doesn't change the fact that the host messed up.
Marc Orr Nov. 15, 2021, 6:41 p.m. UTC | #42
On Mon, Nov 15, 2021 at 10:26 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Nov 15, 2021, Dr. David Alan Gilbert wrote:
> > * Sean Christopherson (seanjc@google.com) wrote:
> > > On Fri, Nov 12, 2021, Borislav Petkov wrote:
> > > > On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
> > > > > Or, is there some mechanism that prevent guest-private memory from being
> > > > > accessed in random host kernel code?
> > >
> > > Or random host userspace code...
> > >
> > > > So I'm currently under the impression that random host->guest accesses
> > > > should not happen if not previously agreed upon by both.
> > >
> > > Key word "should".
> > >
> > > > Because, as explained on IRC, if host touches a private guest page,
> > > > whatever the host does to that page, the next time the guest runs, it'll
> > > > get a #VC where it will see that that page doesn't belong to it anymore
> > > > and then, out of paranoia, it will simply terminate to protect itself.
> > > >
> > > > So cloud providers should have an interest to prevent such random stray
> > > > accesses if they wanna have guests. :)
> > >
> > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> >
> > Would it necessarily have been a host bug?  A guest telling the host a
> > bad GPA to DMA into would trigger this wouldn't it?
>
> No, because as Andy pointed out, host userspace must already guard against a bad
> GPA, i.e. this is just a variant of the guest telling the host to DMA to a GPA
> that is completely bogus.  The shared vs. private behavior just means that when
> host userspace is doing a GPA=>HVA lookup, it needs to incorporate the "shared"
> state of the GPA.  If the host goes and DMAs into the completely wrong HVA=>PFN,
> then that is a host bug; that the bug happened to be exploited by a buggy/malicious
> guest doesn't change the fact that the host messed up.

"If the host goes and DMAs into the completely wrong HVA=>PFN, then
that is a host bug; that the bug happened to be exploited by a
buggy/malicious guest doesn't change the fact that the host messed
up."
^^^
Again, I'm flabbergasted that you are arguing that it's OK for a guest
to exploit a host bug to take down host-side processes or the host
itself, either of which could bring down all other VMs on the machine.

I'm going to repeat -- this is not OK! Period.

Again, if the community wants to layer some orchestration scheme
between host userspace, host kernel, and guest, on top of the code to
inject the #VC into the guest, that's fine. This proposal is not
stopping that. In fact, the two approaches are completely orthogonal
and compatible.

But so far I have heard zero reasons why injecting a #VC into the
guest is wrong. Other than just stating that it's wrong.

Again, the guest must be able to detect buggy and malicious host-side
writes to private memory. Or else "confidential computing" doesn't
work. Assuming that's not true is not a valid argument to dismiss
injecting a #VC exception into the guest.
Sean Christopherson Nov. 15, 2021, 6:44 p.m. UTC | #43
On Mon, Nov 15, 2021, Brijesh Singh wrote:
> 
> On 11/12/21 2:37 PM, Sean Christopherson wrote:
> > This is the direction KVM TDX support is headed, though it's obviously still a WIP.
> > 
> 
> Just curious, in this approach, how do you propose handling the host
> kexec/kdump? If a kexec/kdump occurs while the VM is still active, the new
> kernel will encounter the #PF (RMP violation) because some pages are still
> marked 'private' in the RMP table.

There are two basic options: a) eagerly purge the RMP or b) lazily fixup the RMP
on #PF.  Either approach can be made to work.  I'm not opposed to fixing up the RMP
on #PF in the kexec/kdump case, I'm opposed to blindly updating the RMP on _all_
RMP #PFs, i.e. the kernel should modify the RMP if and only if it knows that doing
so is correct.  E.g. a naive lazy-fixup solution would be to track which pages have
been sanitized and adjust the RMP on #PF to a page that hasn't yet been sanitized.
Brijesh Singh Nov. 15, 2021, 6:58 p.m. UTC | #44
On 11/15/21 12:44 PM, Sean Christopherson wrote:
> On Mon, Nov 15, 2021, Brijesh Singh wrote:
>>
>> On 11/12/21 2:37 PM, Sean Christopherson wrote:
>>> This is the direction KVM TDX support is headed, though it's obviously still a WIP.
>>>
>>
>> Just curious, in this approach, how do you propose handling the host
>> kexec/kdump? If a kexec/kdump occurs while the VM is still active, the new
>> kernel will encounter the #PF (RMP violation) because some pages are still
>> marked 'private' in the RMP table.
> 
> There are two basic options: a) eagerly purge the RMP or b) lazily fixup the RMP
> on #PF.  Either approach can be made to work.  I'm not opposed to fixing up the RMP
> on #PF in the kexec/kdump case, I'm opposed to blindly updating the RMP on _all_
> RMP #PFs, i.e. the kernel should modify the RMP if and only if it knows that doing
> so is correct.  E.g. a naive lazy-fixup solution would be to track which pages have
> been sanitized and adjust the RMP on #PF to a page that hasn't yet been sanitized.
> 

Yap, I think option #a will require the current kernel to iterate 
through the entire memory and make it shared before booting the kexec 
kernel. It may bring another ask to track the guest private/shared on 
the host to minimize the iterations.

thanks
Sean Christopherson Nov. 15, 2021, 7:15 p.m. UTC | #45
+arm64 KVM folks

On Mon, Nov 15, 2021, Marc Orr wrote:
> On Mon, Nov 15, 2021 at 10:26 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Nov 15, 2021, Dr. David Alan Gilbert wrote:
> > > * Sean Christopherson (seanjc@google.com) wrote:
> > > > On Fri, Nov 12, 2021, Borislav Petkov wrote:
> > > > > On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
> > > > > > Or, is there some mechanism that prevent guest-private memory from being
> > > > > > accessed in random host kernel code?
> > > >
> > > > Or random host userspace code...
> > > >
> > > > > So I'm currently under the impression that random host->guest accesses
> > > > > should not happen if not previously agreed upon by both.
> > > >
> > > > Key word "should".
> > > >
> > > > > Because, as explained on IRC, if host touches a private guest page,
> > > > > whatever the host does to that page, the next time the guest runs, it'll
> > > > > get a #VC where it will see that that page doesn't belong to it anymore
> > > > > and then, out of paranoia, it will simply terminate to protect itself.
> > > > >
> > > > > So cloud providers should have an interest to prevent such random stray
> > > > > accesses if they wanna have guests. :)
> > > >
> > > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> > >
> > > Would it necessarily have been a host bug?  A guest telling the host a
> > > bad GPA to DMA into would trigger this wouldn't it?
> >
> > No, because as Andy pointed out, host userspace must already guard against a bad
> > GPA, i.e. this is just a variant of the guest telling the host to DMA to a GPA
> > that is completely bogus.  The shared vs. private behavior just means that when
> > host userspace is doing a GPA=>HVA lookup, it needs to incorporate the "shared"
> > state of the GPA.  If the host goes and DMAs into the completely wrong HVA=>PFN,
> > then that is a host bug; that the bug happened to be exploited by a buggy/malicious
> > guest doesn't change the fact that the host messed up.
> 
> "If the host goes and DMAs into the completely wrong HVA=>PFN, then
> that is a host bug; that the bug happened to be exploited by a
> buggy/malicious guest doesn't change the fact that the host messed
> up."
> ^^^
> Again, I'm flabbergasted that you are arguing that it's OK for a guest
> to exploit a host bug to take down host-side processes or the host
> itself, either of which could bring down all other VMs on the machine.
> 
> I'm going to repeat -- this is not OK! Period.

Huh?  At which point did I suggest it's ok to ship software with bugs?  Of course
it's not ok to introduce host bugs that let the guest crash the host (or host
processes).  But _if_ someone does ship buggy host software, it's not like we can
wave a magic wand and stop the guest from exploiting the bug.  That's why they're
such a big deal.

Yes, in this case a very specific flavor of host userspace bug could be morphed
into a guest exception, but as mentioned ad nauseum, _if_ host userspace has bug
where it does not properly validate a GPA=>HVA, then any such bug exists and is
exploitable today irrespective of SNP.

> Again, if the community wants to layer some orchestration scheme
> between host userspace, host kernel, and guest, on top of the code to
> inject the #VC into the guest, that's fine. This proposal is not
> stopping that. In fact, the two approaches are completely orthogonal
> and compatible.
> 
> But so far I have heard zero reasons why injecting a #VC into the
> guest is wrong. Other than just stating that it's wrong.

It creates a new attack surface, e.g. if the guest mishandles the #VC and does
PVALIDATE on memory that it previously accepted, then userspace can attack the
guest by accessing guest private memory to coerce the guest into consuming corrupted
data.

> Again, the guest must be able to detect buggy and malicious host-side
> writes to private memory. Or else "confidential computing" doesn't
> work.

That assertion assumes the host _hypervisor_ is untrusted, which does not hold true
for all use cases.  The Cc'd arm64 folks are working on a protected VM model where
the host kernel at large is untrusted, but the "hypervisor" (KVM plus a few other
bits), is still trusted by the guest.  Because the hypervisor is trusted, the guest
doesn't need to be hardened against event injection attacks from the host.

Note, SNP already has a similar concept in it's VMPLs.  VMPL3 runs a confidential VM
that is not hardened in any way, and fully trusts VMPL0 to not inject bogus faults.

And along the lines of arm64's pKVM, I would very much like to get KVM to a point
where it can remove host userspace from the guest's TCB without relying on hardware.
Anything that can be in hardware absolutely can be done in the kernel, and likely
can be done with significantly less performance overhead.  Confidential computing is
not a binary thing where the only valid use case is removing the host kernel from
the TCB and trusting only hardware.  There are undoubtedly use cases where trusting
the host kernel but not host userspace brings tangible value, but the overhead of
TDX/SNP to get the host kernel out of the TCB is the wrong tradeoff for performance
vs. security.

> Assuming that's not true is not a valid argument to dismiss
> injecting a #VC exception into the guest.
Marc Orr Nov. 16, 2021, 3:07 a.m. UTC | #46
On Mon, Nov 15, 2021 at 11:15 AM Sean Christopherson <seanjc@google.com> wrote:
>
> +arm64 KVM folks
>
> On Mon, Nov 15, 2021, Marc Orr wrote:
> > On Mon, Nov 15, 2021 at 10:26 AM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Nov 15, 2021, Dr. David Alan Gilbert wrote:
> > > > * Sean Christopherson (seanjc@google.com) wrote:
> > > > > On Fri, Nov 12, 2021, Borislav Petkov wrote:
> > > > > > On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
> > > > > > > Or, is there some mechanism that prevent guest-private memory from being
> > > > > > > accessed in random host kernel code?
> > > > >
> > > > > Or random host userspace code...
> > > > >
> > > > > > So I'm currently under the impression that random host->guest accesses
> > > > > > should not happen if not previously agreed upon by both.
> > > > >
> > > > > Key word "should".
> > > > >
> > > > > > Because, as explained on IRC, if host touches a private guest page,
> > > > > > whatever the host does to that page, the next time the guest runs, it'll
> > > > > > get a #VC where it will see that that page doesn't belong to it anymore
> > > > > > and then, out of paranoia, it will simply terminate to protect itself.
> > > > > >
> > > > > > So cloud providers should have an interest to prevent such random stray
> > > > > > accesses if they wanna have guests. :)
> > > > >
> > > > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
> > > >
> > > > Would it necessarily have been a host bug?  A guest telling the host a
> > > > bad GPA to DMA into would trigger this wouldn't it?
> > >
> > > No, because as Andy pointed out, host userspace must already guard against a bad
> > > GPA, i.e. this is just a variant of the guest telling the host to DMA to a GPA
> > > that is completely bogus.  The shared vs. private behavior just means that when
> > > host userspace is doing a GPA=>HVA lookup, it needs to incorporate the "shared"
> > > state of the GPA.  If the host goes and DMAs into the completely wrong HVA=>PFN,
> > > then that is a host bug; that the bug happened to be exploited by a buggy/malicious
> > > guest doesn't change the fact that the host messed up.
> >
> > "If the host goes and DMAs into the completely wrong HVA=>PFN, then
> > that is a host bug; that the bug happened to be exploited by a
> > buggy/malicious guest doesn't change the fact that the host messed
> > up."
> > ^^^
> > Again, I'm flabbergasted that you are arguing that it's OK for a guest
> > to exploit a host bug to take down host-side processes or the host
> > itself, either of which could bring down all other VMs on the machine.
> >
> > I'm going to repeat -- this is not OK! Period.
>
> Huh?  At which point did I suggest it's ok to ship software with bugs?  Of course
> it's not ok to introduce host bugs that let the guest crash the host (or host
> processes).  But _if_ someone does ship buggy host software, it's not like we can
> wave a magic wand and stop the guest from exploiting the bug.  That's why they're
> such a big deal.
>
> Yes, in this case a very specific flavor of host userspace bug could be morphed
> into a guest exception, but as mentioned ad nauseum, _if_ host userspace has bug
> where it does not properly validate a GPA=>HVA, then any such bug exists and is
> exploitable today irrespective of SNP.

If I'm understanding you correctly, you're saying that we will never
get into the host's page fault handler due to an RMP violation if we
implement the unmapping guest private memory proposal (without bugs).

However, bugs do happen. And the host-side page fault handler will
have code to react to an RMP violation (even if it's supposedly
impossible to hit). I'm saying that the host-side page fault handler
should NOT handle an RMP violation by killing host-side processes or
the kernel itself. This is detrimental to host reliability.

There are two ways to handle this. (1) Convert the private page
causing the RMP violation to shared, (2) Kill the guest.

Converting the private page to shared is a good solution in SNP's
threat model. And overall, it's better for debuggability than
immediately terminating the guest.

> > Again, if the community wants to layer some orchestration scheme
> > between host userspace, host kernel, and guest, on top of the code to
> > inject the #VC into the guest, that's fine. This proposal is not
> > stopping that. In fact, the two approaches are completely orthogonal
> > and compatible.
> >
> > But so far I have heard zero reasons why injecting a #VC into the
> > guest is wrong. Other than just stating that it's wrong.
>
> It creates a new attack surface, e.g. if the guest mishandles the #VC and does
> PVALIDATE on memory that it previously accepted, then userspace can attack the
> guest by accessing guest private memory to coerce the guest into consuming corrupted
> data.

We should handle RMP violations as best possible from within the
host-side page fault handler, independent of the proposal to unmap
private guest memory for all CVM architectures. Otherwise, if someone
figures out how to trigger an RMP violation by writing guest private
memory (despite unmapping guest private memory's goal to make this
impossible), the attack surface has now increased. Because now we're
either killing host processes or the kernel. Which is worse than
killing the single guest.

Second, I don't think it's correct to say that the host-side
implementation changes the attack surface. The guest must already be
hardened against host-side bugs and attacks. From the guest's
perspective, the attack surface is the same. Also, unmapping private
guest memory is going to require coordination across host userspace,
host kernel, and guest. That's a lot of code. And typically more code
means more attack surface. That being said, if all that code is
implemented perfectly, you might be right, that unmapping guest
private memory makes it harder for the host to attack the guest. But I
think it's a moot point. Because in the end, converting the page from
private to shared is entirely reasonable to solve this problem for
SNP, and we should be doing it irregardless, even if we do get
unmapping private memory working on SNP.
Andy Lutomirski Nov. 16, 2021, 5 a.m. UTC | #47
On Mon, Nov 15, 2021, at 10:41 AM, Marc Orr wrote:
> On Mon, Nov 15, 2021 at 10:26 AM Sean Christopherson <seanjc@google.com> wrote:
>>
>> On Mon, Nov 15, 2021, Dr. David Alan Gilbert wrote:
>> > * Sean Christopherson (seanjc@google.com) wrote:
>> > > On Fri, Nov 12, 2021, Borislav Petkov wrote:
>> > > > On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
>> > > > > Or, is there some mechanism that prevent guest-private memory from being
>> > > > > accessed in random host kernel code?
>> > >
>> > > Or random host userspace code...
>> > >
>> > > > So I'm currently under the impression that random host->guest accesses
>> > > > should not happen if not previously agreed upon by both.
>> > >
>> > > Key word "should".
>> > >
>> > > > Because, as explained on IRC, if host touches a private guest page,
>> > > > whatever the host does to that page, the next time the guest runs, it'll
>> > > > get a #VC where it will see that that page doesn't belong to it anymore
>> > > > and then, out of paranoia, it will simply terminate to protect itself.
>> > > >
>> > > > So cloud providers should have an interest to prevent such random stray
>> > > > accesses if they wanna have guests. :)
>> > >
>> > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
>> >
>> > Would it necessarily have been a host bug?  A guest telling the host a
>> > bad GPA to DMA into would trigger this wouldn't it?
>>
>> No, because as Andy pointed out, host userspace must already guard against a bad
>> GPA, i.e. this is just a variant of the guest telling the host to DMA to a GPA
>> that is completely bogus.  The shared vs. private behavior just means that when
>> host userspace is doing a GPA=>HVA lookup, it needs to incorporate the "shared"
>> state of the GPA.  If the host goes and DMAs into the completely wrong HVA=>PFN,
>> then that is a host bug; that the bug happened to be exploited by a buggy/malicious
>> guest doesn't change the fact that the host messed up.
>
> "If the host goes and DMAs into the completely wrong HVA=>PFN, then
> that is a host bug; that the bug happened to be exploited by a
> buggy/malicious guest doesn't change the fact that the host messed
> up."
> ^^^
> Again, I'm flabbergasted that you are arguing that it's OK for a guest
> to exploit a host bug to take down host-side processes or the host
> itself, either of which could bring down all other VMs on the machine.
>
> I'm going to repeat -- this is not OK! Period.

I don’t understand the point you’re trying to make. If the host _kernel_has a bug that allows a guest to trigger invalid host memory access, this is bad. We want to know about it and fix it, abcs the security folks want to minimize the chance that such a bug exists.

If host _userspace_ such a bug, the kernel should not crash if it’s exploited.
Andy Lutomirski Nov. 16, 2021, 5:14 a.m. UTC | #48
On Mon, Nov 15, 2021, at 7:07 PM, Marc Orr wrote:
> On Mon, Nov 15, 2021 at 11:15 AM Sean Christopherson <seanjc@google.com> wrote:
>>
>> +arm64 KVM folks
>>
>> On Mon, Nov 15, 2021, Marc Orr wrote:
>> > On Mon, Nov 15, 2021 at 10:26 AM Sean Christopherson <seanjc@google.com> wrote:
>> > >
>> > > On Mon, Nov 15, 2021, Dr. David Alan Gilbert wrote:
>> > > > * Sean Christopherson (seanjc@google.com) wrote:
>> > > > > On Fri, Nov 12, 2021, Borislav Petkov wrote:
>> > > > > > On Fri, Nov 12, 2021 at 09:59:46AM -0800, Dave Hansen wrote:
>> > > > > > > Or, is there some mechanism that prevent guest-private memory from being
>> > > > > > > accessed in random host kernel code?
>> > > > >
>> > > > > Or random host userspace code...
>> > > > >
>> > > > > > So I'm currently under the impression that random host->guest accesses
>> > > > > > should not happen if not previously agreed upon by both.
>> > > > >
>> > > > > Key word "should".
>> > > > >
>> > > > > > Because, as explained on IRC, if host touches a private guest page,
>> > > > > > whatever the host does to that page, the next time the guest runs, it'll
>> > > > > > get a #VC where it will see that that page doesn't belong to it anymore
>> > > > > > and then, out of paranoia, it will simply terminate to protect itself.
>> > > > > >
>> > > > > > So cloud providers should have an interest to prevent such random stray
>> > > > > > accesses if they wanna have guests. :)
>> > > > >
>> > > > > Yes, but IMO inducing a fault in the guest because of _host_ bug is wrong.
>> > > >
>> > > > Would it necessarily have been a host bug?  A guest telling the host a
>> > > > bad GPA to DMA into would trigger this wouldn't it?
>> > >
>> > > No, because as Andy pointed out, host userspace must already guard against a bad
>> > > GPA, i.e. this is just a variant of the guest telling the host to DMA to a GPA
>> > > that is completely bogus.  The shared vs. private behavior just means that when
>> > > host userspace is doing a GPA=>HVA lookup, it needs to incorporate the "shared"
>> > > state of the GPA.  If the host goes and DMAs into the completely wrong HVA=>PFN,
>> > > then that is a host bug; that the bug happened to be exploited by a buggy/malicious
>> > > guest doesn't change the fact that the host messed up.
>> >
>> > "If the host goes and DMAs into the completely wrong HVA=>PFN, then
>> > that is a host bug; that the bug happened to be exploited by a
>> > buggy/malicious guest doesn't change the fact that the host messed
>> > up."
>> > ^^^
>> > Again, I'm flabbergasted that you are arguing that it's OK for a guest
>> > to exploit a host bug to take down host-side processes or the host
>> > itself, either of which could bring down all other VMs on the machine.
>> >
>> > I'm going to repeat -- this is not OK! Period.
>>
>> Huh?  At which point did I suggest it's ok to ship software with bugs?  Of course
>> it's not ok to introduce host bugs that let the guest crash the host (or host
>> processes).  But _if_ someone does ship buggy host software, it's not like we can
>> wave a magic wand and stop the guest from exploiting the bug.  That's why they're
>> such a big deal.
>>
>> Yes, in this case a very specific flavor of host userspace bug could be morphed
>> into a guest exception, but as mentioned ad nauseum, _if_ host userspace has bug
>> where it does not properly validate a GPA=>HVA, then any such bug exists and is
>> exploitable today irrespective of SNP.
>
> If I'm understanding you correctly, you're saying that we will never
> get into the host's page fault handler due to an RMP violation if we
> implement the unmapping guest private memory proposal (without bugs).
>
> However, bugs do happen. And the host-side page fault handler will
> have code to react to an RMP violation (even if it's supposedly
> impossible to hit). I'm saying that the host-side page fault handler
> should NOT handle an RMP violation by killing host-side processes or
> the kernel itself. This is detrimental to host reliability.
>
> There are two ways to handle this. (1) Convert the private page
> causing the RMP violation to shared, (2) Kill the guest.

It’s time to put on my maintainer hat. This is solidly in my territory, and NAK.  A kernel privilege fault, from who-knows-what context (interrupts off? NMI? locks held?) that gets an RMP violation with no exception handler is *not* going to blindly write the RMP and retry.  It’s not going to send flush IPIs or call into KVM to “fix” things.  Just the locking issues alone are probably showstopping, even ignoring the security and sanity issues.

You are welcome to submit patches to make the panic_on_oops=0 case as robust as practical, and you are welcome to keep running other VMs after we die() and taint the kernel, but that is strictly best effort and is a bad idea in a highly security sensitive environment.

And that goes for everyone else here too. If you all have a brilliant idea to lazily fix RMP faults (and the actual semantics are reasonable) and you want my NAK to go away, I want to see a crystal clear explanation of what you plan to put in fault.c, how the locking is supposed to work, and how you bound retries such that the system will reliably OOPS if something goes wrong instead of retrying forever or deadlocking.

Otherwise can we please get on with designing a reasonable model for guest-private memory please?
Joerg Roedel Nov. 16, 2021, 1:02 p.m. UTC | #49
On Mon, Nov 15, 2021 at 06:26:16PM +0000, Sean Christopherson wrote:
> No, because as Andy pointed out, host userspace must already guard against a bad
> GPA, i.e. this is just a variant of the guest telling the host to DMA to a GPA
> that is completely bogus.  The shared vs. private behavior just means that when
> host userspace is doing a GPA=>HVA lookup, it needs to incorporate the "shared"
> state of the GPA.  If the host goes and DMAs into the completely wrong HVA=>PFN,
> then that is a host bug; that the bug happened to be exploited by a buggy/malicious
> guest doesn't change the fact that the host messed up.

The thing is that the usual checking mechanisms can't be applied to
guest-private pages. For user-space the GPA is valid if it fits into the
guest memory layout user-space set up before. But whether a page is
shared or private is the guests business. And without an expensive
reporting/query mechanism user-space doesn't have the information to do
the check.

A mechanism to lock pages to shared is also needed, and that creates the
next problems:

	* Who can release the lock, only the process which created it or
	  anyone who has the memory mapped?

	* What happens when a process has locked guest regions and then
	  dies with SIGSEGV, will its locks on guest memory be released
	  stay around forever?

And this is only what comes to mind immediatly, I sure there are more
problematic details in such an interface.

Regards,
Joerg Roedel Nov. 16, 2021, 1:21 p.m. UTC | #50
On Mon, Nov 15, 2021 at 09:14:14PM -0800, Andy Lutomirski wrote:
> It’s time to put on my maintainer hat. This is solidly in my
> territory, and NAK.  A kernel privilege fault, from who-knows-what
> context (interrupts off? NMI? locks held?) that gets an RMP violation
> with no exception handler is *not* going to blindly write the RMP and
> retry.  It’s not going to send flush IPIs or call into KVM to “fix”
> things.  Just the locking issues alone are probably showstopping, even
> ignoring the security and sanity issues.

RMP faults are expected from two contexts:

	* User-space
	* KVM running in task context

The only situation where RMP faults could happen outside of these
contexts is when running a kexec'ed kernel, which was launched while SNP
guests were still running (that needs to be taken care of as well).

And from the locking side, which lock does the #PF handler need to take?
Processors supporting SNP also have hardware support for flushing remote
TLBs, so locks taken in the flush path are not strictly required.

Calling into KVM is another story and needs some more thought, I agree
with that.

> Otherwise can we please get on with designing a reasonable model for
> guest-private memory please?

It is fine to unmap guest-private memory from the host kernel, even if
it is not required by SNP. TDX need to do that because of the #MC thing
that happens otherwise, but that is also just a way to emulate an
RMP-like fault with TDX.

But as Marc already pointed out, the kernel needs a plan B when an RMP
happens anyway due to some bug.

Regards,
Joerg Roedel Nov. 16, 2021, 1:30 p.m. UTC | #51
On Mon, Nov 15, 2021 at 07:15:07PM +0000, Sean Christopherson wrote:
> It creates a new attack surface, e.g. if the guest mishandles the #VC and does
> PVALIDATE on memory that it previously accepted, then userspace can attack the
> guest by accessing guest private memory to coerce the guest into consuming corrupted
> data.

If a guest can be tricked into a double PVALIDATE or otherwise
misbehaves on a #VC exception, then it is a guest bug and needs to be
fixed there.

It is a core requirement to the #VC handler that it can not be tricked
that way.

Regards,
Sean Christopherson Nov. 16, 2021, 6:26 p.m. UTC | #52
On Tue, Nov 16, 2021, Joerg Roedel wrote:
> But as Marc already pointed out, the kernel needs a plan B when an RMP
> happens anyway due to some bug.

I don't see why unexpected RMP #PF is a special snowflake that needs a different
plan than literally every other type of unexpected #PF in the kernel.
Peter Gonda Nov. 16, 2021, 6:39 p.m. UTC | #53
On Tue, Nov 16, 2021 at 11:26 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Nov 16, 2021, Joerg Roedel wrote:
> > But as Marc already pointed out, the kernel needs a plan B when an RMP
> > happens anyway due to some bug.
>
> I don't see why unexpected RMP #PF is a special snowflake that needs a different
> plan than literally every other type of unexpected #PF in the kernel.

When I started this thread I was not trying to say we *need* to do
something different for RMP faults, but that we *could* improve host
reliability by doing something. Since it is possible to special case
an RMP fault and prevent a panic I thought it was with discussing.
Sean Christopherson Nov. 16, 2021, 8:08 p.m. UTC | #54
On Tue, Nov 16, 2021, Joerg Roedel wrote:
> On Mon, Nov 15, 2021 at 06:26:16PM +0000, Sean Christopherson wrote:
> > No, because as Andy pointed out, host userspace must already guard against a bad
> > GPA, i.e. this is just a variant of the guest telling the host to DMA to a GPA
> > that is completely bogus.  The shared vs. private behavior just means that when
> > host userspace is doing a GPA=>HVA lookup, it needs to incorporate the "shared"
> > state of the GPA.  If the host goes and DMAs into the completely wrong HVA=>PFN,
> > then that is a host bug; that the bug happened to be exploited by a buggy/malicious
> > guest doesn't change the fact that the host messed up.
> 
> The thing is that the usual checking mechanisms can't be applied to
> guest-private pages. For user-space the GPA is valid if it fits into the
> guest memory layout user-space set up before. But whether a page is
> shared or private is the guests business.

And that's where we fundamentally disagree.  Whether a page is shared or private
is very much the host's business.  The guest can _ask_ to convert a page, but the
host ultimately owns the state of a page.  Even in this proposed auto-convert
approach, the host has final say over the state of the page.

The main difference between auto-converting in the #PF handler and an unmapping
approach is that, as you note below, the latter requires an explicit action from
host userspace.  But again, the host kernel has final say over the state of any
given page.

> And without an expensive reporting/query mechanism user-space doesn't have the
> information to do the check.

The cost of exiting to userspace isn't all that expensive relative to the cost of
the RMP update itself, e.g. IIRC updating the RMP is several thousand cycles.
TDX will have a similar cost to modify S-EPT entries.

Actually updating the backing store (see below) might be relatively expensive, but
I highly doubt it will be orders of magnitude slower than updating the RMP, or that
it will have a meaningful impact on guest performance.

> A mechanism to lock pages to shared is also needed, and that creates the
> next problems:

The most recent proposal for unmapping guest private memory doesn't require new
locking at the page level.  The high level idea is to treat shared and private
variations of GPAs as two unrelated addresses from a host memory management
perspective.  They are only a "single" address in the context of KVM's MMU, i.e.
the NPT for SNP.

For shared pages, no new locking is required as the PFN associated with a VMA will
not be freed until all mappings go away.  Any access after all mappings/references
have been dropped is a nothing more than a use-after-free bug, and the guilty party
is punished accordingly.

For private pages, the proposed idea is to require that all guest private memory
be backed by an elightened backing store, e.g. the initial RFC enhances memfd and
shmem to support sealing the file as guest-only:

  : The new seal is only allowed if there's no pre-existing pages in the fd
  : and there's no existing mapping of the file. After the seal is set, no
  : read/write/mmap from userspace is allowed.

It is KVM's responsibility to ensure it doesn't map a shared PFN into a private
GPA and vice versa, and that TDP entries are unmapped appropriately, e.g. when
userspace punches a hole in the backing store, but that can all be done using
existing locks, e.g. KVM's mmu_lock.  No new locking mechanisms are required.

> 	* Who can release the lock, only the process which created it or
> 	  anyone who has the memory mapped?
> 
> 	* What happens when a process has locked guest regions and then
> 	  dies with SIGSEGV, will its locks on guest memory be released
> 	  stay around forever?

> And this is only what comes to mind immediatly, I sure there are more
> problematic details in such an interface.

Please read through this proposal/RFC, more eyeballs would certainly be welcome.

https://lkml.kernel.org/r/20211111141352.26311-1-chao.p.peng@linux.intel.com
Brijesh Singh Nov. 22, 2021, 3:23 p.m. UTC | #55
Hi Peter,

On 11/12/21 9:43 AM, Peter Gonda wrote:
> Hi Brijesh,,
> 
> One high level discussion I'd like to have on these SNP KVM patches.
> 
> In these patches (V5) if a host userspace process writes a guest
> private page a SIGBUS is issued to that process. If the kernel writes
> a guest private page then the kernel panics due to the unhandled RMP
> fault page fault. This is an issue because not all writes into guest
> memory may come from a bug in the host. For instance a malicious or
> even buggy guest could easily point the host to writing a private page
> during the emulation of many virtual devices (virtio, NVMe, etc). For
> example if a well behaved guests behavior is to: start up a driver,
> select some pages to share with the guest, ask the host to convert
> them to shared, then use those pages for virtual device DMA, if a
> buggy guest forget the step to request the pages be converted to
> shared its easy to see how the host could rightfully write to private
> memory. I think we can better guarantee host reliability when running
> SNP guests without changing SNP’s security properties.
> 
> Here is an alternative to the current approach: On RMP violation (host
> or userspace) the page fault handler converts the page from private to
> shared to allow the write to continue. This pulls from s390’s error
> handling which does exactly this. See ‘arch_make_page_accessible()’.
> Additionally it adds less complexity to the SNP kernel patches, and
> requires no new ABI.
> 
> In the current (V5) KVM implementation if a userspace process
> generates an RMP violation (writes to guest private memory) the
> process receives a SIGBUS. At first glance, it would appear that
> user-space shouldn’t write to private memory. However, guaranteeing
> this in a generic fashion requires locking the RMP entries (via locks
> external to the RMP). Otherwise, a user-space process emulating a
> guest device IO may be vulnerable to having the guest memory
> (maliciously or by guest bug) converted to private while user-space
> emulation is happening. This results in a well behaved userspace
> process receiving a SIGBUS.
> 
> This proposal allows buggy and malicious guests to run under SNP
> without jeopardizing the reliability / safety of host processes. This
> is very important to a cloud service provider (CSP) since it’s common
> to have host wide daemons that write/read all guests, i.e. a single
> process could manage the networking for all VMs on the host. Crashing
> that singleton process kills networking for all VMs on the system.
> 
Thank you for starting the thread; based on the discussion, I am keeping 
the current implementation as-is and *not* going with the auto 
conversion from private to shared. To summarize what we are doing in the 
current SNP series:

- If userspace accesses guest private memory, it gets SIGBUS.
- If kernel accesses[*] guest private memory, it does panic.

[*] Kernel consults the RMP table for the page ownership before the 
access. If the page is shared, then it uses the locking mechanism to 
ensure that a guest will not be able to change the page ownership while 
kernel has it mapped.

thanks

> This proposal also allows for minimal changes to the kexec flow and
> kdump. The new kexec kernel can simply update private pages to shared
> as it encounters them during their boot. This avoids needing to
> propagate the RMP state from kernel to kernel. Of course this doesn’t
> preserve any running VMs but is still useful for kdump crash dumps or
> quicker rekerneling for development with kexec.
> 
> This proposal does cause guest memory corruption for some bugs but one
> of SEV-SNP’s goals extended from SEV-ES’s goals is for guest’s to be
> able to detect when its memory has been corrupted / replayed by the
> host. So SNP already has features for allowing guests to detect this
> kind of memory corruption. Additionally this is very similar to a page
> of memory generating a machine check because of 2-bit memory
> corruption. In other words SNP guests must be enlightened and ready
> for these kinds of errors.
> 
> For an SNP guest running under this proposal the flow would look like this:
> * Host gets a #PF because its trying to write to a private page.
> * Host #PF handler updates the page to shared.
> * Write continues normally.
> * Guest accesses memory (r/w).
> * Guest gets a #VC error because the page is not PVALIDATED
> * Guest is now in control. Guest can terminate because its memory has
> been corrupted. Guest could try and continue to log the error to its
> owner.
> 
> A similar approach was introduced in the SNP patches V1 and V2 for
> kernel page fault handling. The pushback around this convert to shared
> approach was largely focused around the idea that the kernel has all
> the information about which pages are shared vs private so it should
> be able to check shared status before write to pages. After V2 the
> patches were updated to not have a kernel page fault handler for RMP
> violations (other than dumping state during a panic). The current
> patches protect the host with new post_{map,unmap}_gfn() function that
> checks if a page is shared before mapping it, then locks the page
> shared until unmapped. Given the discussions on ‘[Part2,v5,39/45] KVM:
> SVM: Introduce ops for the post gfn map and unmap’ building a solution
> to do this is non trivial and adds new overheads to KVM. Additionally
> the current solution is local to the kernel. So a new ABI just now be
> created to allow the userspace VMM to access the kernel-side locks for
> this to work generically for the whole host. This is more complicated
> than this proposal and adding more lock holders seems like it could
> reduce performance further.
> 
> There are a couple corner cases with this approach. Under SNP guests
> can request their memory be changed into a VMSA. This VMSA page cannot
> be changed to shared while the vCPU associated with it is running. So
> KVM + the #PF handler will need something to kick vCPUs from running.
> Joerg believes that a possible fix for this could be a new MMU
> notifier in the kernel, then on the #PF we can go through the rmp and
> execute this vCPU kick callback.
> 
> Another corner case is the RMPUPDATE instruction is not guaranteed to
> succeed on first iteration. As noted above if the page is a VMSA it
> cannot be updated while the vCPU is running. Another issue is if the
> guest is running a RMPADJUST on a page it cannot be RMPUPDATED at that
> time. There is a lock for each RMP Entry so there is a race for these
> instructions. The vCPU kicking can solve this issue to be kicking all
> guest vCPUs which removes the chance for the race.
> 
> Since this proposal probably results in SNP guests terminating due to
> a page unexpectedly needing PVALIDATE. The approach could be
> simplified to just the KVM killing the guest. I think it's nicer to
> users to instead of unilaterally killing the guest allowing the
> unvalidated #VC exception to allow users to collect some additional
> debug information and any additional clean up work they would like to
> perform.
> 
> Thanks
> Peter
> 
> On Fri, Aug 20, 2021 at 9:59 AM Brijesh Singh <brijesh.singh@amd.com> wrote:
>>
>> This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
>> changes required in a host OS for SEV-SNP support. The series builds upon
>> SEV-SNP Part-1.
>>
>> This series provides the basic building blocks to support booting the SEV-SNP
>> VMs, it does not cover all the security enhancement introduced by the SEV-SNP
>> such as interrupt protection.
>>
>> The CCP driver is enhanced to provide new APIs that use the SEV-SNP
>> specific commands defined in the SEV-SNP firmware specification. The KVM
>> driver uses those APIs to create and managed the SEV-SNP guests.
>>
>> The GHCB specification version 2 introduces new set of NAE's that is
>> used by the SEV-SNP guest to communicate with the hypervisor. The series
>> provides support to handle the following new NAE events:
>> - Register GHCB GPA
>> - Page State Change Request
>> - Hypevisor feature
>> - Guest message request
>>
>> The RMP check is enforced as soon as SEV-SNP is enabled. Not every memory
>> access requires an RMP check. In particular, the read accesses from the
>> hypervisor do not require RMP checks because the data confidentiality is
>> already protected via memory encryption. When hardware encounters an RMP
>> checks failure, it raises a page-fault exception. If RMP check failure
>> is due to the page-size mismatch, then split the large page to resolve
>> the fault.
>>
>> The series does not provide support for the interrupt security and migration
>> and those feature will be added after the base support.
>>
>> The series is based on the commit:
>>   SNP part1 commit and
>>   fa7a549d321a (kvm/next, next) KVM: x86: accept userspace interrupt only if no event is injected
>>
>> TODO:
>>    * Add support for command to ratelimit the guest message request.
>>
>> Changes since v4:
>>   * Move the RMP entry definition to x86 specific header file.
>>   * Move the dump RMP entry function to SEV specific file.
>>   * Use BIT_ULL while defining the #PF bit fields.
>>   * Add helper function to check the IOMMU support for SEV-SNP feature.
>>   * Add helper functions for the page state transition.
>>   * Map and unmap the pages from the direct map after page is added or
>>     removed in RMP table.
>>   * Enforce the minimum SEV-SNP firmware version.
>>   * Extend the LAUNCH_UPDATE to accept the base_gfn and remove the
>>     logic to calculate the gfn from the hva.
>>   * Add a check in LAUNCH_UPDATE to ensure that all the pages are
>>     shared before calling the PSP.
>>   * Mark the memory failure when failing to remove the page from the
>>     RMP table or clearing the immutable bit.
>>   * Exclude the encrypted hva range from the KSM.
>>   * Remove the gfn tracking during the kvm_gfn_map() and use SRCU to
>>     syncronize the PSC and gfn mapping.
>>   * Allow PSC on the registered hva range only.
>>   * Add support for the Preferred GPA VMGEXIT.
>>   * Simplify the PSC handling routines.
>>   * Use the static_call() for the newly added kvm_x86_ops.
>>   * Remove the long-lived GHCB map.
>>   * Move the snp enable module parameter to the end of the file.
>>   * Remove the kvm_x86_op for the RMP fault handling. Call the
>>     fault handler directly from the #NPF interception.
>>
>> Changes since v3:
>>   * Add support for extended guest message request.
>>   * Add ioctl to query the SNP Platform status.
>>   * Add ioctl to get and set the SNP config.
>>   * Add check to verify that memory reserved for the RMP covers the full system RAM.
>>   * Start the SNP specific commands from 256 instead of 255.
>>   * Multiple cleanup and fixes based on the review feedback.
>>
>> Changes since v2:
>>   * Add AP creation support.
>>   * Drop the patch to handle the RMP fault for the kernel address.
>>   * Add functions to track the write access from the hypervisor.
>>   * Do not enable the SNP feature when IOMMU is disabled or is in passthrough mode.
>>   * Dump the RMP entry on RMP violation for the debug.
>>   * Shorten the GHCB macro names.
>>   * Start the SNP_INIT command id from 255 to give some gap for the legacy SEV.
>>   * Sync the header with the latest 0.9 SNP spec.
>>
>> Changes since v1:
>>   * Add AP reset MSR protocol VMGEXIT NAE.
>>   * Add Hypervisor features VMGEXIT NAE.
>>   * Move the RMP table initialization and RMPUPDATE/PSMASH helper in
>>     arch/x86/kernel/sev.c.
>>   * Add support to map/unmap SEV legacy command buffer to firmware state when
>>     SNP is active.
>>   * Enhance PSP driver to provide helper to allocate/free memory used for the
>>     firmware context page.
>>   * Add support to handle RMP fault for the kernel address.
>>   * Add support to handle GUEST_REQUEST NAE event for attestation.
>>   * Rename RMP table lookup helper.
>>   * Drop typedef from rmpentry struct definition.
>>   * Drop SNP static key and use cpu_feature_enabled() to check whether SEV-SNP
>>     is active.
>>   * Multiple cleanup/fixes to address Boris review feedback.
>>
>> Brijesh Singh (40):
>>    x86/cpufeatures: Add SEV-SNP CPU feature
>>    iommu/amd: Introduce function to check SEV-SNP support
>>    x86/sev: Add the host SEV-SNP initialization support
>>    x86/sev: Add RMP entry lookup helpers
>>    x86/sev: Add helper functions for RMPUPDATE and PSMASH instruction
>>    x86/sev: Invalid pages from direct map when adding it to RMP table
>>    x86/traps: Define RMP violation #PF error code
>>    x86/fault: Add support to handle the RMP fault for user address
>>    x86/fault: Add support to dump RMP entry on fault
>>    crypto: ccp: shutdown SEV firmware on kexec
>>    crypto:ccp: Define the SEV-SNP commands
>>    crypto: ccp: Add support to initialize the AMD-SP for SEV-SNP
>>    crypto:ccp: Provide APIs to issue SEV-SNP commands
>>    crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
>>    crypto: ccp: Handle the legacy SEV command when SNP is enabled
>>    crypto: ccp: Add the SNP_PLATFORM_STATUS command
>>    crypto: ccp: Add the SNP_{SET,GET}_EXT_CONFIG command
>>    crypto: ccp: Provide APIs to query extended attestation report
>>    KVM: SVM: Provide the Hypervisor Feature support VMGEXIT
>>    KVM: SVM: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
>>    KVM: SVM: Add initial SEV-SNP support
>>    KVM: SVM: Add KVM_SNP_INIT command
>>    KVM: SVM: Add KVM_SEV_SNP_LAUNCH_START command
>>    KVM: SVM: Add KVM_SEV_SNP_LAUNCH_UPDATE command
>>    KVM: SVM: Mark the private vma unmerable for SEV-SNP guests
>>    KVM: SVM: Add KVM_SEV_SNP_LAUNCH_FINISH command
>>    KVM: X86: Keep the NPT and RMP page level in sync
>>    KVM: x86: Introduce kvm_mmu_get_tdp_walk() for SEV-SNP use
>>    KVM: x86: Define RMP page fault error bits for #NPF
>>    KVM: x86: Update page-fault trace to log full 64-bit error code
>>    KVM: SVM: Do not use long-lived GHCB map while setting scratch area
>>    KVM: SVM: Remove the long-lived GHCB host map
>>    KVM: SVM: Add support to handle GHCB GPA register VMGEXIT
>>    KVM: SVM: Add support to handle MSR based Page State Change VMGEXIT
>>    KVM: SVM: Add support to handle Page State Change VMGEXIT
>>    KVM: SVM: Introduce ops for the post gfn map and unmap
>>    KVM: x86: Export the kvm_zap_gfn_range() for the SNP use
>>    KVM: SVM: Add support to handle the RMP nested page fault
>>    KVM: SVM: Provide support for SNP_GUEST_REQUEST NAE event
>>    KVM: SVM: Add module parameter to enable the SEV-SNP
>>
>> Sean Christopherson (2):
>>    KVM: x86/mmu: Move 'pfn' variable to caller of direct_page_fault()
>>    KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX and SNP
>>
>> Tom Lendacky (3):
>>    KVM: SVM: Add support to handle AP reset MSR protocol
>>    KVM: SVM: Use a VMSA physical address variable for populating VMCB
>>    KVM: SVM: Support SEV-SNP AP Creation NAE event
>>
>>   Documentation/virt/coco/sevguest.rst          |   55 +
>>   .../virt/kvm/amd-memory-encryption.rst        |  102 +
>>   arch/x86/include/asm/cpufeatures.h            |    1 +
>>   arch/x86/include/asm/disabled-features.h      |    8 +-
>>   arch/x86/include/asm/kvm-x86-ops.h            |    5 +
>>   arch/x86/include/asm/kvm_host.h               |   20 +
>>   arch/x86/include/asm/msr-index.h              |    6 +
>>   arch/x86/include/asm/sev-common.h             |   28 +
>>   arch/x86/include/asm/sev.h                    |   45 +
>>   arch/x86/include/asm/svm.h                    |    7 +
>>   arch/x86/include/asm/trap_pf.h                |   18 +-
>>   arch/x86/kernel/cpu/amd.c                     |    3 +-
>>   arch/x86/kernel/sev.c                         |  361 ++++
>>   arch/x86/kvm/lapic.c                          |    5 +-
>>   arch/x86/kvm/mmu.h                            |    7 +-
>>   arch/x86/kvm/mmu/mmu.c                        |   84 +-
>>   arch/x86/kvm/svm/sev.c                        | 1676 ++++++++++++++++-
>>   arch/x86/kvm/svm/svm.c                        |   62 +-
>>   arch/x86/kvm/svm/svm.h                        |   74 +-
>>   arch/x86/kvm/trace.h                          |   40 +-
>>   arch/x86/kvm/x86.c                            |   92 +-
>>   arch/x86/mm/fault.c                           |   84 +-
>>   drivers/crypto/ccp/sev-dev.c                  |  924 ++++++++-
>>   drivers/crypto/ccp/sev-dev.h                  |   17 +
>>   drivers/crypto/ccp/sp-pci.c                   |   12 +
>>   drivers/iommu/amd/init.c                      |   30 +
>>   include/linux/iommu.h                         |    9 +
>>   include/linux/mm.h                            |    6 +-
>>   include/linux/psp-sev.h                       |  346 ++++
>>   include/linux/sev.h                           |   32 +
>>   include/uapi/linux/kvm.h                      |   56 +
>>   include/uapi/linux/psp-sev.h                  |   60 +
>>   mm/memory.c                                   |   13 +
>>   tools/arch/x86/include/asm/cpufeatures.h      |    1 +
>>   34 files changed, 4088 insertions(+), 201 deletions(-)
>>   create mode 100644 include/linux/sev.h
>>
>> --
>> 2.17.1
>>
Vlastimil Babka Nov. 22, 2021, 5:03 p.m. UTC | #56
On 11/22/21 16:23, Brijesh Singh wrote:
> Hi Peter,
> 
> On 11/12/21 9:43 AM, Peter Gonda wrote:
>> Hi Brijesh,,
>>
>> One high level discussion I'd like to have on these SNP KVM patches.
>>
>> In these patches (V5) if a host userspace process writes a guest
>> private page a SIGBUS is issued to that process. If the kernel writes
>> a guest private page then the kernel panics due to the unhandled RMP
>> fault page fault. This is an issue because not all writes into guest
>> memory may come from a bug in the host. For instance a malicious or
>> even buggy guest could easily point the host to writing a private page
>> during the emulation of many virtual devices (virtio, NVMe, etc). For
>> example if a well behaved guests behavior is to: start up a driver,
>> select some pages to share with the guest, ask the host to convert
>> them to shared, then use those pages for virtual device DMA, if a
>> buggy guest forget the step to request the pages be converted to
>> shared its easy to see how the host could rightfully write to private
>> memory. I think we can better guarantee host reliability when running
>> SNP guests without changing SNP’s security properties.
>>
>> Here is an alternative to the current approach: On RMP violation (host
>> or userspace) the page fault handler converts the page from private to
>> shared to allow the write to continue. This pulls from s390’s error
>> handling which does exactly this. See ‘arch_make_page_accessible()’.
>> Additionally it adds less complexity to the SNP kernel patches, and
>> requires no new ABI.
>>
>> In the current (V5) KVM implementation if a userspace process
>> generates an RMP violation (writes to guest private memory) the
>> process receives a SIGBUS. At first glance, it would appear that
>> user-space shouldn’t write to private memory. However, guaranteeing
>> this in a generic fashion requires locking the RMP entries (via locks
>> external to the RMP). Otherwise, a user-space process emulating a
>> guest device IO may be vulnerable to having the guest memory
>> (maliciously or by guest bug) converted to private while user-space
>> emulation is happening. This results in a well behaved userspace
>> process receiving a SIGBUS.
>>
>> This proposal allows buggy and malicious guests to run under SNP
>> without jeopardizing the reliability / safety of host processes. This
>> is very important to a cloud service provider (CSP) since it’s common
>> to have host wide daemons that write/read all guests, i.e. a single
>> process could manage the networking for all VMs on the host. Crashing
>> that singleton process kills networking for all VMs on the system.
>>
> Thank you for starting the thread; based on the discussion, I am keeping the
> current implementation as-is and *not* going with the auto conversion from
> private to shared. To summarize what we are doing in the current SNP series:
> 
> - If userspace accesses guest private memory, it gets SIGBUS.

So, is there anything protecting host userspace processes from malicious guests?

> - If kernel accesses[*] guest private memory, it does panic.
> 
> [*] Kernel consults the RMP table for the page ownership before the access.
> If the page is shared, then it uses the locking mechanism to ensure that a
> guest will not be able to change the page ownership while kernel has it mapped.
> 
> thanks
>
Brijesh Singh Nov. 22, 2021, 6:01 p.m. UTC | #57
On 11/22/21 11:03 AM, Vlastimil Babka wrote:
> On 11/22/21 16:23, Brijesh Singh wrote:
>> Hi Peter,
>>
>> On 11/12/21 9:43 AM, Peter Gonda wrote:
>>> Hi Brijesh,,
>>>
>>> One high level discussion I'd like to have on these SNP KVM patches.
>>>
>>> In these patches (V5) if a host userspace process writes a guest
>>> private page a SIGBUS is issued to that process. If the kernel writes
>>> a guest private page then the kernel panics due to the unhandled RMP
>>> fault page fault. This is an issue because not all writes into guest
>>> memory may come from a bug in the host. For instance a malicious or
>>> even buggy guest could easily point the host to writing a private page
>>> during the emulation of many virtual devices (virtio, NVMe, etc). For
>>> example if a well behaved guests behavior is to: start up a driver,
>>> select some pages to share with the guest, ask the host to convert
>>> them to shared, then use those pages for virtual device DMA, if a
>>> buggy guest forget the step to request the pages be converted to
>>> shared its easy to see how the host could rightfully write to private
>>> memory. I think we can better guarantee host reliability when running
>>> SNP guests without changing SNP’s security properties.
>>>
>>> Here is an alternative to the current approach: On RMP violation (host
>>> or userspace) the page fault handler converts the page from private to
>>> shared to allow the write to continue. This pulls from s390’s error
>>> handling which does exactly this. See ‘arch_make_page_accessible()’.
>>> Additionally it adds less complexity to the SNP kernel patches, and
>>> requires no new ABI.
>>>
>>> In the current (V5) KVM implementation if a userspace process
>>> generates an RMP violation (writes to guest private memory) the
>>> process receives a SIGBUS. At first glance, it would appear that
>>> user-space shouldn’t write to private memory. However, guaranteeing
>>> this in a generic fashion requires locking the RMP entries (via locks
>>> external to the RMP). Otherwise, a user-space process emulating a
>>> guest device IO may be vulnerable to having the guest memory
>>> (maliciously or by guest bug) converted to private while user-space
>>> emulation is happening. This results in a well behaved userspace
>>> process receiving a SIGBUS.
>>>
>>> This proposal allows buggy and malicious guests to run under SNP
>>> without jeopardizing the reliability / safety of host processes. This
>>> is very important to a cloud service provider (CSP) since it’s common
>>> to have host wide daemons that write/read all guests, i.e. a single
>>> process could manage the networking for all VMs on the host. Crashing
>>> that singleton process kills networking for all VMs on the system.
>>>
>> Thank you for starting the thread; based on the discussion, I am keeping the
>> current implementation as-is and *not* going with the auto conversion from
>> private to shared. To summarize what we are doing in the current SNP series:
>>
>> - If userspace accesses guest private memory, it gets SIGBUS.
> 
> So, is there anything protecting host userspace processes from malicious guests?
> 

Unfortunately, no.

In the future, we could look into Sean's suggestion to come with an ABI 
that userspace can use to lock the guest pages before the access and 
notify the caller of the access violation. It seems that TDX may need 
something similar, but I cannot tell for sure. This proposal seems good 
at the first glance but devil is in the detail; once implemented we also 
need to measure the performance implication of it.

Should we consider using SIGSEGV (SEGV_ACCERR) instead of SIGBUS? In 
other words, treating a guest's private pages as read-only and writing 
to them will generate a standard SIGSEGV.

thanks


>> - If kernel accesses[*] guest private memory, it does panic.
>>
>> [*] Kernel consults the RMP table for the page ownership before the access.
>> If the page is shared, then it uses the locking mechanism to ensure that a
>> guest will not be able to change the page ownership while kernel has it mapped.
>>
>> thanks
>>
Dave Hansen Nov. 22, 2021, 6:30 p.m. UTC | #58
On 11/22/21 7:23 AM, Brijesh Singh wrote:
> Thank you for starting the thread; based on the discussion, I am keeping
> the current implementation as-is and *not* going with the auto
> conversion from private to shared. To summarize what we are doing in the
> current SNP series:
> 
> - If userspace accesses guest private memory, it gets SIGBUS.
> - If kernel accesses[*] guest private memory, it does panic.

There's a subtlety here, though.  There are really three *different*
kinds of kernel accesses that matter:

1. Kernel bugs.  Kernel goes off and touches some guest private memory
   when it didn't mean to.  Say, it runs off the end of a slab page and
   runs into a guest page.  panic() is expected here.
2. Kernel accesses guest private memory via a userspace mapping, in a
   place where it is known to be accessing userspace and is prepared to
   fault.  copy_to_user() is the most straightforward example.  Kernel
   must *not* panic().  Returning an error to the syscall is a good
   way to handle these (if in a syscall).
3. Kernel accesses guest private memory via a kernel mapping.  This one
   is tricky.  These probably *do* result in a panic() today, but
   ideally shouldn't.

Could you explicitly clarify what the current behavior is?
Brijesh Singh Nov. 22, 2021, 7:06 p.m. UTC | #59
On 11/22/21 12:30 PM, Dave Hansen wrote:
> On 11/22/21 7:23 AM, Brijesh Singh wrote:
>> Thank you for starting the thread; based on the discussion, I am keeping
>> the current implementation as-is and *not* going with the auto
>> conversion from private to shared. To summarize what we are doing in the
>> current SNP series:
>>
>> - If userspace accesses guest private memory, it gets SIGBUS.
>> - If kernel accesses[*] guest private memory, it does panic.
> There's a subtlety here, though.  There are really three *different*
> kinds of kernel accesses that matter:
> 1. Kernel bugs.  Kernel goes off and touches some guest private memory
>    when it didn't mean to.  Say, it runs off the end of a slab page and
>    runs into a guest page.  panic() is expected here.

In current implementation, a write to guest private will trigger a
kernel panic().


> 2. Kernel accesses guest private memory via a userspace mapping, in a
>    place where it is known to be accessing userspace and is prepared to
>    fault.  copy_to_user() is the most straightforward example.  Kernel
>    must *not* panic().  Returning an error to the syscall is a good
>    way to handle these (if in a syscall).

In the current implementation, the copy_to_user() on the guest private
will fails with -EFAULT.


> 3. Kernel accesses guest private memory via a kernel mapping.  This one
>    is tricky.  These probably *do* result in a panic() today, but
>    ideally shouldn't.

KVM has defined some helper functions to maps and unmap the guest pages.
Those helper functions do the GPA to PFN lookup before calling the
kmap(). Those helpers are enhanced such that it check the RMP table
before the kmap() and acquire a lock to prevent a page state change
until the kunmap() is called. So, in the current implementation, we
should *not* see a panic() unless there is a KVM driver bug that didn't
use the helper functions or a bug in the helper function itself.


> Could you explicitly clarify what the current behavior is?
Dave Hansen Nov. 22, 2021, 7:14 p.m. UTC | #60
On 11/22/21 11:06 AM, Brijesh Singh wrote:
>> 3. Kernel accesses guest private memory via a kernel mapping.  This one
>>    is tricky.  These probably *do* result in a panic() today, but
>>    ideally shouldn't.
> KVM has defined some helper functions to maps and unmap the guest pages.
> Those helper functions do the GPA to PFN lookup before calling the
> kmap(). Those helpers are enhanced such that it check the RMP table
> before the kmap() and acquire a lock to prevent a page state change
> until the kunmap() is called. So, in the current implementation, we
> should *not* see a panic() unless there is a KVM driver bug that didn't
> use the helper functions or a bug in the helper function itself.

I don't think this is really KVM specific.

Think of a remote process doing ptrace(PTRACE_POKEUSER) or pretty much
any generic get_user_pages() instance.  As long as the memory is mapped
into the page tables, you're exposed to users that walk the page tables.

How do we, for example, prevent ptrace() from inducing a panic()?
Brijesh Singh Nov. 22, 2021, 8:33 p.m. UTC | #61
On 11/22/21 1:14 PM, Dave Hansen wrote:
> On 11/22/21 11:06 AM, Brijesh Singh wrote:
>>> 3. Kernel accesses guest private memory via a kernel mapping.  This one
>>>     is tricky.  These probably *do* result in a panic() today, but
>>>     ideally shouldn't.
>> KVM has defined some helper functions to maps and unmap the guest pages.
>> Those helper functions do the GPA to PFN lookup before calling the
>> kmap(). Those helpers are enhanced such that it check the RMP table
>> before the kmap() and acquire a lock to prevent a page state change
>> until the kunmap() is called. So, in the current implementation, we
>> should *not* see a panic() unless there is a KVM driver bug that didn't
>> use the helper functions or a bug in the helper function itself.
> 
> I don't think this is really KVM specific.
> 
> Think of a remote process doing ptrace(PTRACE_POKEUSER) or pretty much
> any generic get_user_pages() instance.  As long as the memory is mapped
> into the page tables, you're exposed to users that walk the page tables.
> 
> How do we, for example, prevent ptrace() from inducing a panic()?
> 

In the current approach, this access will induce a panic(). In general, 
supporting the ptrace() for the encrypted VM region is going to be 
difficult. The upcoming TDX work to unmap the guest memory region from 
the current process page table can easily extend for the SNP to cover 
the current limitations.

thanks
Sean Christopherson Nov. 22, 2021, 9:34 p.m. UTC | #62
On Mon, Nov 22, 2021, Brijesh Singh wrote:
> 
> On 11/22/21 1:14 PM, Dave Hansen wrote:
> > On 11/22/21 11:06 AM, Brijesh Singh wrote:
> > > > 3. Kernel accesses guest private memory via a kernel mapping.  This one
> > > >     is tricky.  These probably *do* result in a panic() today, but
> > > >     ideally shouldn't.
> > > KVM has defined some helper functions to maps and unmap the guest pages.
> > > Those helper functions do the GPA to PFN lookup before calling the
> > > kmap(). Those helpers are enhanced such that it check the RMP table
> > > before the kmap() and acquire a lock to prevent a page state change
> > > until the kunmap() is called. So, in the current implementation, we
> > > should *not* see a panic() unless there is a KVM driver bug that didn't
> > > use the helper functions or a bug in the helper function itself.
> > 
> > I don't think this is really KVM specific.
> > 
> > Think of a remote process doing ptrace(PTRACE_POKEUSER) or pretty much
> > any generic get_user_pages() instance.  As long as the memory is mapped
> > into the page tables, you're exposed to users that walk the page tables.
> > 
> > How do we, for example, prevent ptrace() from inducing a panic()?
> > 
> 
> In the current approach, this access will induce a panic(). In general,
> supporting the ptrace() for the encrypted VM region is going to be
> difficult.

But ptrace() is just an example, any path in the kernel that accesses a gup'd
page through a kernel mapping will explode if handed a guest private page.

> The upcoming TDX work to unmap the guest memory region from the current process
> page table can easily extend for the SNP to cover the current limitations.

That represents an ABI change though.  If KVM allows userspace to create SNP guests
without any guarantees that userspace cannot coerce the kernel into accessing guest
private memory, then we are stuck supporting that behavior even if KVM later gains
the ability to provide such guarantees through new APIs.

If allowing this behavior was only a matter of the system admin opting into a
dangerous configuration, I would probably be ok merging SNP with it buried behind
EXPERT or something scarier, but this impacts KVM's ABI as well as kernel internals,
e.g. the hooks in kvm_vcpu_map() and friends are unnecessary if KVM can differentiate
between shared and private gfns in its memslots, as gfn_to_pfn() will either fail or
point at memory that is guaranteed to be in the shared state.
Dave Hansen Nov. 22, 2021, 10:51 p.m. UTC | #63
On 11/22/21 12:33 PM, Brijesh Singh wrote:
>> How do we, for example, prevent ptrace() from inducing a panic()?
> 
> In the current approach, this access will induce a panic(). 

That needs to get fixed before SEV-SNP is merged, IMNHO.  This behavior
would effectively mean that any userspace given access to create SNP
guests would panic the kernel.

> In general, supporting the ptrace() for the encrypted VM region is
> going to be difficult.

By "supporting", do you mean doing something functional?  I don't really
care if ptrace() to guest private memory returns -EINVAL or whatever.
The most important thing is not crashing the host.

Also, as Sean mentioned, this isn't really about ptrace() itself.  It's
really about ensuring that no kernel or devices accesses to guest
private memory can induce bad behavior.

> The upcoming TDX work to unmap the guest memory region from the
> current process page table can easily extend for the SNP to cover the
> current limitations.

My preference would be that we never have SEV-SNP code in the kernel
that can panic() the host from guest userspace.  If that means waiting
until there's common guest unmapping infrastructure around, then I think
we should wait.
Tony Luck Nov. 23, 2021, 5:15 a.m. UTC | #64
> My preference would be that we never have SEV-SNP code in the kernel
> that can panic() the host from guest userspace.  If that means waiting
> until there's common guest unmapping infrastructure around, then I think
> we should wait.

Perhaps I'm missing some context ... but guests must NEVER be allowed to
panic the host.

-Tony
Borislav Petkov Nov. 23, 2021, 7:18 a.m. UTC | #65
On Mon, Nov 22, 2021 at 02:51:35PM -0800, Dave Hansen wrote:
> By "supporting", do you mean doing something functional?  I don't really
> care if ptrace() to guest private memory returns -EINVAL or whatever.
> The most important thing is not crashing the host.
> 
> Also, as Sean mentioned, this isn't really about ptrace() itself.  It's
> really about ensuring that no kernel or devices accesses to guest
> private memory can induce bad behavior.

I keep repeating this suggestion of mine that we should treat
guest-private pages as hw-poisoned pages which have experienced a
uncorrectable error in the past.

mm already knows how to stay away from those.
Vlastimil Babka Nov. 23, 2021, 8:55 a.m. UTC | #66
On 11/22/21 23:51, Dave Hansen wrote:
> On 11/22/21 12:33 PM, Brijesh Singh wrote:
>>> How do we, for example, prevent ptrace() from inducing a panic()?
>> 
>> In the current approach, this access will induce a panic(). 
> 
> That needs to get fixed before SEV-SNP is merged, IMNHO.  This behavior
> would effectively mean that any userspace given access to create SNP
> guests would panic the kernel.
> 
>> In general, supporting the ptrace() for the encrypted VM region is
>> going to be difficult.
> 
> By "supporting", do you mean doing something functional?  I don't really
> care if ptrace() to guest private memory returns -EINVAL or whatever.
> The most important thing is not crashing the host.
> 
> Also, as Sean mentioned, this isn't really about ptrace() itself.  It's
> really about ensuring that no kernel or devices accesses to guest
> private memory can induce bad behavior.

Then we need gup to block any changes from shared to guest private? I assume
there will be the usual issues of recognizing temporary elevated refcount vs
long-term gup, etc.

>> The upcoming TDX work to unmap the guest memory region from the
>> current process page table can easily extend for the SNP to cover the
>> current limitations.

By "current process page table" you mean userspace page tables?

> My preference would be that we never have SEV-SNP code in the kernel
> that can panic() the host from guest userspace.  If that means waiting
> until there's common guest unmapping infrastructure around, then I think
> we should wait.
>
Sean Christopherson Nov. 23, 2021, 3:36 p.m. UTC | #67
On Tue, Nov 23, 2021, Borislav Petkov wrote:
> On Mon, Nov 22, 2021 at 02:51:35PM -0800, Dave Hansen wrote:
> > By "supporting", do you mean doing something functional?  I don't really
> > care if ptrace() to guest private memory returns -EINVAL or whatever.
> > The most important thing is not crashing the host.
> > 
> > Also, as Sean mentioned, this isn't really about ptrace() itself.  It's
> > really about ensuring that no kernel or devices accesses to guest
> > private memory can induce bad behavior.
> 
> I keep repeating this suggestion of mine that we should treat
> guest-private pages as hw-poisoned pages which have experienced a
> uncorrectable error in the past.
> 
> mm already knows how to stay away from those.

Kirill posted a few RFCs that did exactly that.  It's definitely a viable approach,
but it's a bit of a dead end, e.g. doesn't help solve page migration, is limited to
struct page, doesn't capture which KVM guest owns the memory, etc...

https://lore.kernel.org/kvm/20210416154106.23721-1-kirill.shutemov@linux.intel.com/
Borislav Petkov Nov. 23, 2021, 4:26 p.m. UTC | #68
On Tue, Nov 23, 2021 at 03:36:35PM +0000, Sean Christopherson wrote:
> Kirill posted a few RFCs that did exactly that.  It's definitely a viable approach,
> but it's a bit of a dead end,

One thing at a time...

> e.g. doesn't help solve page migration,

AFAICR, that needs a whole explicit and concerted effort with the
migration helper - that was one of the approaches, at least, guest's
explicit involvement, remote attestation and a bunch of other things...

> is limited to struct page

I'm no mm guy so maybe you can elaborate further.

> doesn't capture which KVM guest owns the memory, etc...

So I don't think we need this for the problem at hand. But from the
sound of it, it probably is a good idea to be able to map the guest
owner to the memory anyway.

> https://lore.kernel.org/kvm/20210416154106.23721-1-kirill.shutemov@linux.intel.com/

Right, there it is in the last patch.

Hmmkay, so we need some generic machinery which unmaps memory from
the host kernel's pagetables so that it doesn't do any stray/unwanted
accesses to it. I'd look in the direction of mm folks for what to do
exactly, though.

Thx.
Joerg Roedel Nov. 24, 2021, 4:03 p.m. UTC | #69
On Mon, Nov 22, 2021 at 02:51:35PM -0800, Dave Hansen wrote:
> My preference would be that we never have SEV-SNP code in the kernel
> that can panic() the host from guest userspace.  If that means waiting
> until there's common guest unmapping infrastructure around, then I think
> we should wait.

Can you elaborate how to crash host kernel from guest user-space? If I
understood correctly it was about crashing host kernel from _host_
user-space.

I think the RMP-fault path in the page-fault handler needs to take the
uaccess exception tables into account before actually causing a panic.
This should solve most of the problems discussed here.

Maybe we also need the previously suggested copy_from/to_guest()
interfaces.

Regards,
Dave Hansen Nov. 24, 2021, 5:48 p.m. UTC | #70
On 11/24/21 8:03 AM, Joerg Roedel wrote:
> On Mon, Nov 22, 2021 at 02:51:35PM -0800, Dave Hansen wrote:
>> My preference would be that we never have SEV-SNP code in the kernel
>> that can panic() the host from guest userspace.  If that means waiting
>> until there's common guest unmapping infrastructure around, then I think
>> we should wait.
> Can you elaborate how to crash host kernel from guest user-space? If I
> understood correctly it was about crashing host kernel from _host_
> user-space.

Sorry, I misspoke there.

My concern is about crashing the host kernel.  It appears that *host*
userspace can do that quite easily by inducing the host kernel to access
some guest private memory via a kernel mapping.

> I think the RMP-fault path in the page-fault handler needs to take the
> uaccess exception tables into account before actually causing a panic.
> This should solve most of the problems discussed here.

That covers things like copy_from_user().  It does not account for
things where kernel mappings are used, like where a
get_user_pages()/kmap() is in play.
Vlastimil Babka Nov. 24, 2021, 7:34 p.m. UTC | #71
On 11/24/21 18:48, Dave Hansen wrote:
> On 11/24/21 8:03 AM, Joerg Roedel wrote:
>> On Mon, Nov 22, 2021 at 02:51:35PM -0800, Dave Hansen wrote:
>>> My preference would be that we never have SEV-SNP code in the kernel
>>> that can panic() the host from guest userspace.  If that means waiting
>>> until there's common guest unmapping infrastructure around, then I think
>>> we should wait.
>> Can you elaborate how to crash host kernel from guest user-space? If I
>> understood correctly it was about crashing host kernel from _host_
>> user-space.
> 
> Sorry, I misspoke there.
> 
> My concern is about crashing the host kernel.  It appears that *host*
> userspace can do that quite easily by inducing the host kernel to access
> some guest private memory via a kernel mapping.

I thought some of the scenarios discussed here also went along "guest
(doesn't matter if userspace or kernel) shares a page with host, invokes
some host kernel operation and in parallel makes the page private again".

>> I think the RMP-fault path in the page-fault handler needs to take the
>> uaccess exception tables into account before actually causing a panic.
>> This should solve most of the problems discussed here.
> 
> That covers things like copy_from_user().  It does not account for
> things where kernel mappings are used, like where a
> get_user_pages()/kmap() is in play.
>
Joerg Roedel Nov. 25, 2021, 10:05 a.m. UTC | #72
On Wed, Nov 24, 2021 at 09:48:14AM -0800, Dave Hansen wrote:
> That covers things like copy_from_user().  It does not account for
> things where kernel mappings are used, like where a
> get_user_pages()/kmap() is in play.

The kmap case is guarded by KVM code, which locks the page first so that
the guest can't change the page state, then checks the page state, and
if it is shared does the kmap and the access.

This should turn an RMP fault in the kernel which is not covered in the
uaccess exception table into a fatal error.

Regards,
Brijesh Singh Nov. 29, 2021, 2:44 p.m. UTC | #73
On 11/25/21 4:05 AM, Joerg Roedel wrote:
> On Wed, Nov 24, 2021 at 09:48:14AM -0800, Dave Hansen wrote:
>> That covers things like copy_from_user().  It does not account for
>> things where kernel mappings are used, like where a
>> get_user_pages()/kmap() is in play.
> 
> The kmap case is guarded by KVM code, which locks the page first so that
> the guest can't change the page state, then checks the page state, and
> if it is shared does the kmap and the access.


The KVM use-case is well covered in the series, but I believe Dave is 
highlighting what if the access happens outside of the KVM driver (such 
as a ptrace() or others).

One possible approach to fix this is to enlighten the kmap/unmap(). 
Basically, move the per page locking mechanism used by the KVM in the 
arch-specific code and have kmap/kunmap() call the arch hooks. The arch 
hooks will do this:

Before the map, check whether the page is added as a shared in the RMP 
table. If not shared, then error.
Acquire a per-page map_lock.
Release the per-page map_lock on the kunmap().

The current patch set provides helpers to change the page from private 
to shared. Enhance the helpers to check for the per-page map_lock, if 
the map_lock is held then do not allow changing the page from shared to 
private.

Thoughts ?

> 
> This should turn an RMP fault in the kernel which is not covered in the
> uaccess exception table into a fatal error.
> 
> Regards,
>
Vlastimil Babka Nov. 29, 2021, 2:58 p.m. UTC | #74
On 11/29/21 15:44, Brijesh Singh wrote:
> 
> 
> On 11/25/21 4:05 AM, Joerg Roedel wrote:
>> On Wed, Nov 24, 2021 at 09:48:14AM -0800, Dave Hansen wrote:
>>> That covers things like copy_from_user().  It does not account for
>>> things where kernel mappings are used, like where a
>>> get_user_pages()/kmap() is in play.
>>
>> The kmap case is guarded by KVM code, which locks the page first so that
>> the guest can't change the page state, then checks the page state, and
>> if it is shared does the kmap and the access.
> 
> 
> The KVM use-case is well covered in the series, but I believe Dave is
> highlighting what if the access happens outside of the KVM driver (such as a
> ptrace() or others).

AFAIU ptrace() is a scenario where the userspace mapping is being gup-ped,
not a kernel page being kmap()ed?

> One possible approach to fix this is to enlighten the kmap/unmap().
> Basically, move the per page locking mechanism used by the KVM in the
> arch-specific code and have kmap/kunmap() call the arch hooks. The arch
> hooks will do this:
> 
> Before the map, check whether the page is added as a shared in the RMP
> table. If not shared, then error.
> Acquire a per-page map_lock.
> Release the per-page map_lock on the kunmap().
> 
> The current patch set provides helpers to change the page from private to
> shared. Enhance the helpers to check for the per-page map_lock, if the
> map_lock is held then do not allow changing the page from shared to private.

That could work for the kmap() context.
What to do for the userspace context (host userspace)?
- shared->private transition - page has to be unmapped from all userspace,
elevated refcount (gup() in progress) can block this unmap until it goes
away - could be doable
- still, what to do if host userspace then tries to access the unmapped
page? SIGSEGV instead of SIGBUS and it can recover?



> Thoughts ?
> 
>>
>> This should turn an RMP fault in the kernel which is not covered in the
>> uaccess exception table into a fatal error.
>>
>> Regards,
>>
Brijesh Singh Nov. 29, 2021, 4:13 p.m. UTC | #75
On 11/29/21 8:58 AM, Vlastimil Babka wrote:
> On 11/29/21 15:44, Brijesh Singh wrote:
>>
>>
>> On 11/25/21 4:05 AM, Joerg Roedel wrote:
>>> On Wed, Nov 24, 2021 at 09:48:14AM -0800, Dave Hansen wrote:
>>>> That covers things like copy_from_user().  It does not account for
>>>> things where kernel mappings are used, like where a
>>>> get_user_pages()/kmap() is in play.
>>>
>>> The kmap case is guarded by KVM code, which locks the page first so that
>>> the guest can't change the page state, then checks the page state, and
>>> if it is shared does the kmap and the access.
>>
>>
>> The KVM use-case is well covered in the series, but I believe Dave is
>> highlighting what if the access happens outside of the KVM driver (such as a
>> ptrace() or others).
> 
> AFAIU ptrace() is a scenario where the userspace mapping is being gup-ped,
> not a kernel page being kmap()ed?
> 

Yes that is correct.

>> One possible approach to fix this is to enlighten the kmap/unmap().
>> Basically, move the per page locking mechanism used by the KVM in the
>> arch-specific code and have kmap/kunmap() call the arch hooks. The arch
>> hooks will do this:
>>
>> Before the map, check whether the page is added as a shared in the RMP
>> table. If not shared, then error.
>> Acquire a per-page map_lock.
>> Release the per-page map_lock on the kunmap().
>>
>> The current patch set provides helpers to change the page from private to
>> shared. Enhance the helpers to check for the per-page map_lock, if the
>> map_lock is held then do not allow changing the page from shared to private.
> 
> That could work for the kmap() context.
> What to do for the userspace context (host userspace)?
> - shared->private transition - page has to be unmapped from all userspace,
> elevated refcount (gup() in progress) can block this unmap until it goes
> away - could be doable

An unmap of the page from all the userspace process during the page 
state transition will be great. If we can somehow store the state 
information in the 'struct page' then it can be later used to make 
better decision. I am not sure that relying on the elevated refcount is 
the correct approach. e.g in the case of encrypted guests, the HV may 
pin the page to prevent it from migration.

Thoughts on how you want to approach unmaping the page from userspace 
page table?


> - still, what to do if host userspace then tries to access the unmapped
> page? SIGSEGV instead of SIGBUS and it can recover?
> 

Yes, SIGSEGV makes sense to me.


> 
> 
>> Thoughts ?
>>
>>>
>>> This should turn an RMP fault in the kernel which is not covered in the
>>> uaccess exception table into a fatal error.
>>>
>>> Regards,
>>>
>
Dave Hansen Nov. 29, 2021, 4:41 p.m. UTC | #76
On 11/25/21 2:05 AM, Joerg Roedel wrote:
> On Wed, Nov 24, 2021 at 09:48:14AM -0800, Dave Hansen wrote:
>> That covers things like copy_from_user().  It does not account for
>> things where kernel mappings are used, like where a
>> get_user_pages()/kmap() is in play.
> The kmap case is guarded by KVM code, which locks the page first so that
> the guest can't change the page state, then checks the page state, and
> if it is shared does the kmap and the access.
> 
> This should turn an RMP fault in the kernel which is not covered in the
> uaccess exception table into a fatal error.

Let's say something does process_vm_readv() where the pid is a qemu
process and it is writing to a guest private memory area.  The syscall
will eventually end up in process_vm_rw_single_vec() which does:

>                 pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages,
>                                                      flags, process_pages,
>                                                      NULL, &locked);
...
>                 rc = process_vm_rw_pages(process_pages,
>                                          start_offset, bytes, iter,
>                                          vm_write);


and eventually in copy_page_from_iter():

>                 void *kaddr = kmap_local_page(page);
>                 size_t wanted = _copy_from_iter(kaddr + offset, bytes, i);
>                 kunmap_local(kaddr);

The kernel access to 'kaddr+offset' shouldn't fault.  How does the KVM
code thwart that kmap_local_page()?
Vlastimil Babka Nov. 30, 2021, 7:40 p.m. UTC | #77
On 11/29/21 17:13, Brijesh Singh wrote:
>>
>> That could work for the kmap() context.
>> What to do for the userspace context (host userspace)?
>> - shared->private transition - page has to be unmapped from all userspace,
>> elevated refcount (gup() in progress) can block this unmap until it goes
>> away - could be doable
> 
> An unmap of the page from all the userspace process during the page state
> transition will be great. If we can somehow store the state information in
> the 'struct page' then it can be later used to make better decision. I am
> not sure that relying on the elevated refcount is the correct approach. e.g
> in the case of encrypted guests, the HV may pin the page to prevent it from
> migration.
> 
> Thoughts on how you want to approach unmaping the page from userspace page
> table?

After giving it more thought and rereading the threads here it seems I
thought it would be easier than it really is, and it would have to be
something at least like Kirill's hwpoison based approach.

>> - still, what to do if host userspace then tries to access the unmapped
>> page? SIGSEGV instead of SIGBUS and it can recover?
>>
> 
> Yes, SIGSEGV makes sense to me.

OTOH the newer fd-based proposal also IIUC takes care of this part better -
the host userspace controls the guest's shared->private conversion requests
so it can't be tricked to access a page that's changed under it.

>>
>>
>>> Thoughts ?
>>>
>>>>
>>>> This should turn an RMP fault in the kernel which is not covered in the
>>>> uaccess exception table into a fatal error.
>>>>
>>>> Regards,
>>>>
>>