mbox series

[0/5] Enforce CPU cache flush for non-coherent device assignment

Message ID 20240507061802.20184-1-yan.y.zhao@intel.com (mailing list archive)
Headers show
Series Enforce CPU cache flush for non-coherent device assignment | expand

Message

Yan Zhao May 7, 2024, 6:18 a.m. UTC
This is a follow-up series to fix the security risk for non-coherent device
assignment raised by Jason in [1].

When IOMMU does not enforce cache coherency, devices are allowed to perform
non-coherent DMAs (DMAs that lack CPU cache snooping). This scenario poses
a risk of information leakage when the device is assigned into a VM.
Specifically, a malicious guest could potentially retrieve stale host data
through non-coherent DMA reads of physical memory, while data initialized
by host (e.g., zeros) still resides in the cache.

Furthermore, host kernel (e.g. a ksm thread) might encounter inconsistent
data between the CPU cache and physical memory (left by a malicious guest)
after a page is unpinned for DMA but before the page is recycled.

Therefore, a mitigation in VFIO/IOMMUFD is required to flush CPU caches on
pages involved in non-coherent DMAs prior to or following their mapping or
unmapping to or from the IOMMU.

The mitigation is not implemented in DMA API layer, so as to avoid slowing
down the DMA API users. Users of the DMA API are expected to take care of
CPU cache flushing in one of two ways: (a) by using the DMA API which is
aware of the non-coherence and does the flushes internally or (b) be aware
of its flushing needs and handle them on its own if they are overriding the
platform using no-snoop. A general mitigation in DMA API layer will only
come when non-coherent DMAs are common, which, however, is not the case
(now only Intel GPU and some ARM devices).

Also the mitigation is not implemented in IOMMU core for VMs exclusively,
because it would make a large IOTLB flush range being split due to the
absence of information regarding to IOVA-PFN relationship in IOMMU core.

Given non-coherent devices exist both on x86 and ARM, this series
introduces an arch helper to flush CPU caches for non-coherent DMAs which
is available for both VFIO and IOMMUFD, though current only implementation
for x86 is provided.


Series Layout:

Patch 1 first fixes an error in pat_pfn_immune_to_uc_mtrr() which always
        returns WB for untracked PAT ranges. This error leads to KVM
        treating all PFNs within these untracked PAT ranges as cacheable
        memory types, even when a PFN's MTRR type is UC. (An example is for
        VGA range from 0xa0000-0xbffff).
        Patch 3 will use pat_pfn_immune_to_uc_mtrr() to determine
        uncacheable PFNs.

Patch 2 is a side fix in KVM to prevent guest cacheable access to PFNs
        mapped as UC in host.

Patch 3 introduces and exports an arch helper arch_clean_nonsnoop_dma() to
        flush CPU cachelines. It takes physical address and size as inputs
        and provides a implementation for x86.
        Given that executing CLFLUSH on certain MMIO ranges on x86 can be
        problematic, potentially causing machine check exceptions on some
        platforms, while flushing is necessary on some other MMIO ranges
        (e.g., some MMIO ranges for PMEM), this patch determines
        cacheability by consulting the PAT (if enabled) or MTRR type (if
        PAT is disabled). It assesses whether a PFN is considered as
        uncacheable by the host. For reserved pages or !pfn_valid() PFN,
        CLFLUSH is avoided if the PFN is recognized as uncacheable on the
        host.

Patch 4/5 implement a mitigation in vfio/iommufd to flush CPU caches
         - before a page is accessible to non-coherent DMAs,
         - after the page is inaccessible to non-coherent DMAs, and right
           before it's unpinned for DMAs.


Performance data:

The overhead of flushing CPU caches is measured below:
CPU MHz:4494.377, 4 vCPU, 8G guest memory
Pass-through GPU: 1G aperture

Across each VM boot up and tear down,

IOMMUFD     |     Map        |   Unmap        | Teardown 
------------|----------------|----------------|-------------
w/o clflush | 1167M          |   40M          |  201M
w/  clflush | 2400M (+1233M) |  276M (+236M)  | 1160M (+959M)

Map = total cycles of iommufd_ioas_map() during VM boot up
Unmap = total cycles of iommufd_ioas_unmap() during VM boot up
Teardown = total cycles of iommufd_hwpt_paging_destroy() at VM teardown

VFIO        |     Map        |   Unmap        | Teardown 
------------|----------------|----------------|-------------
w/o clflush | 3058M          |  379M          |  448M
w/  clflush | 5664M (+2606M) | 1653M (+1274M) | 1522M (+1074M)

Map = total cycles of vfio_dma_do_map() during VM boot up
Unmap = total cycles of vfio_dma_do_unmap() during VM boot up
Teardown = total cycles of vfio_iommu_type1_detach_group() at VM teardown

[1] https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com

Yan Zhao (5):
  x86/pat: Let pat_pfn_immune_to_uc_mtrr() check MTRR for untracked PAT
    range
  KVM: x86/mmu: Fine-grained check of whether a invalid & RAM PFN is
    MMIO
  x86/mm: Introduce and export interface arch_clean_nonsnoop_dma()
  vfio/type1: Flush CPU caches on DMA pages in non-coherent domains
  iommufd: Flush CPU caches on DMA pages in non-coherent domains

 arch/x86/include/asm/cacheflush.h       |  3 +
 arch/x86/kvm/mmu/spte.c                 | 14 +++-
 arch/x86/mm/pat/memtype.c               | 12 +++-
 arch/x86/mm/pat/set_memory.c            | 88 +++++++++++++++++++++++++
 drivers/iommu/iommufd/hw_pagetable.c    | 19 +++++-
 drivers/iommu/iommufd/io_pagetable.h    |  5 ++
 drivers/iommu/iommufd/iommufd_private.h |  1 +
 drivers/iommu/iommufd/pages.c           | 44 ++++++++++++-
 drivers/vfio/vfio_iommu_type1.c         | 51 ++++++++++++++
 include/linux/cacheflush.h              |  6 ++
 10 files changed, 237 insertions(+), 6 deletions(-)


base-commit: e67572cd2204894179d89bd7b984072f19313b03