[0/5] Make the iommu driver no-snoop block feature consistent

Message ID	0-v1-ef02c60ddb76+12ca2-intel_no_snoop_jgg@nvidia.com (mailing list archive)
Headers	show Return-Path: <linux-rdma-owner@kernel.org> From: Jason Gunthorpe <jgg@nvidia.com> To: Alex Williamson <alex.williamson@redhat.com>, Lu Baolu <baolu.lu@linux.intel.com>, Christian Benvenuti <benve@cisco.com>, Cornelia Huck <cohuck@redhat.com>, David Woodhouse <dwmw2@infradead.org>, Gerald Schaefer <gerald.schaefer@linux.ibm.com>, iommu@lists.linux-foundation.org, Jason Wang <jasowang@redhat.com>, Joerg Roedel <joro@8bytes.org>, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-arm-msm@vger.kernel.org, linux-rdma@vger.kernel.org, linux-s390@vger.kernel.org, Matthew Rosato <mjrosato@linux.ibm.com>, "Michael S. Tsirkin" <mst@redhat.com>, Nelson Escobar <neescoba@cisco.com>, netdev@vger.kernel.org, Rob Clark <robdclark@gmail.com>, Robin Murphy <robin.murphy@arm.com>, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>, virtualization@lists.linux-foundation.org, Will Deacon <will@kernel.org> Cc: Christoph Hellwig <hch@lst.de>, "Tian, Kevin" <kevin.tian@intel.com> Subject: [PATCH 0/5] Make the iommu driver no-snoop block feature consistent Date: Tue, 5 Apr 2022 13:15:59 -0300 Message-Id: <0-v1-ef02c60ddb76+12ca2-intel_no_snoop_jgg@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	Make the iommu driver no-snoop block feature consistent \| expand [0/5] Make the iommu driver no-snoop block feature consistent [1/5] iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with dev_is_dma_coherent() [2/5] vfio: Require that devices support DMA cache coherence [3/5] iommu: Introduce the domain op enforce_cache_coherency() [4/5] vfio: Move the Intel no-snoop control off of IOMMU_CACHE [5/5] iommu: Delete IOMMU_CAP_CACHE_COHERENCY

Jason Gunthorpe April 5, 2022, 4:15 p.m. UTC

PCIe defines a 'no-snoop' bit in each the TLP which is usually implemented
by a platform as bypassing elements in the DMA coherent CPU cache
hierarchy. A driver can command a device to set this bit on some of its
transactions as a micro-optimization.

However, the driver is now responsible to synchronize the CPU cache with
the DMA that bypassed it. On x86 this is done through the wbinvd
instruction, and the i915 GPU driver is the only Linux DMA driver that
calls it.

The problem comes that KVM on x86 will normally disable the wbinvd
instruction in the guest and render it a NOP. As the driver running in the
guest is not aware the wbinvd doesn't work it may still cause the device
to set the no-snoop bit and the platform will bypass the CPU cache.
Without a working wbinvd there is no way to re-synchronize the CPU cache
and the driver in the VM has data corruption.

Thus, we see a general direction on x86 that the IOMMU HW is able to block
the no-snoop bit in the TLP. This NOP's the optimization and allows KVM to
to NOP the wbinvd without causing any data corruption.

This control for Intel IOMMU was exposed by using IOMMU_CACHE and
IOMMU_CAP_CACHE_COHERENCY, however these two values now have multiple
meanings and usages beyond blocking no-snoop and the whole thing has
become confused.

Change it so that:
 - IOMMU_CACHE is only about the DMA coherence of normal DMAs from a
   device. It is used by the DMA API and set when the DMA API will not be
   doing manual cache coherency operations.

 - dev_is_dma_coherent() indicates if IOMMU_CACHE can be used with the
   device

 - The new optional domain op enforce_cache_coherency() will cause the
   entire domain to block no-snoop requests - ie there is no way for any
   device attached to the domain to opt out of the IOMMU_CACHE behavior.

An iommu driver should implement enforce_cache_coherency() so that by
default domains allow the no-snoop optimization. This leaves it available
to kernel drivers like i915. VFIO will call enforce_cache_coherency()
before establishing any mappings and the domain should then permanently
block no-snoop.

If enforce_cache_coherency() fails VFIO will communicate back through to
KVM into the arch code via kvm_arch_register_noncoherent_dma()
(only implemented by x86) which triggers a working wbinvd to be made
available to the VM.

While other arches are certainly welcome to implement
enforce_cache_coherency(), it is not clear there is any benefit in doing
so.

After this series there are only two calls left to iommu_capable() with a
bus argument which should help Robin's work here.

This is on github: https://github.com/jgunthorpe/linux/commits/intel_no_snoop

Cc: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Jason Gunthorpe (5):
  iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with
    dev_is_dma_coherent()
  vfio: Require that devices support DMA cache coherence
  iommu: Introduce the domain op enforce_cache_coherency()
  vfio: Move the Intel no-snoop control off of IOMMU_CACHE
  iommu: Delete IOMMU_CAP_CACHE_COHERENCY

 drivers/infiniband/hw/usnic/usnic_uiom.c    | 16 +++++------
 drivers/iommu/amd/iommu.c                   |  9 +++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  2 --
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |  6 -----
 drivers/iommu/arm/arm-smmu/qcom_iommu.c     |  6 -----
 drivers/iommu/fsl_pamu_domain.c             |  6 -----
 drivers/iommu/intel/iommu.c                 | 15 ++++++++---
 drivers/iommu/s390-iommu.c                  |  2 --
 drivers/vfio/vfio.c                         |  6 +++++
 drivers/vfio/vfio_iommu_type1.c             | 30 +++++++++++++--------
 drivers/vhost/vdpa.c                        |  3 ++-
 include/linux/intel-iommu.h                 |  1 +
 include/linux/iommu.h                       |  6 +++--
 13 files changed, 58 insertions(+), 50 deletions(-)


base-commit: 3123109284176b1532874591f7c81f3837bbdc17

Alex Williamson April 5, 2022, 7:50 p.m. UTC | #1

On Tue,  5 Apr 2022 13:16:02 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> This new mechanism will replace using IOMMU_CAP_CACHE_COHERENCY and
> IOMMU_CACHE to control the no-snoop blocking behavior of the IOMMU.
> 
> Currently only Intel and AMD IOMMUs are known to support this
> feature. They both implement it as an IOPTE bit, that when set, will cause
> PCIe TLPs to that IOVA with the no-snoop bit set to be treated as though
> the no-snoop bit was clear.
> 
> The new API is triggered by calling enforce_cache_coherency() before
> mapping any IOVA to the domain which globally switches on no-snoop
> blocking. This allows other implementations that might block no-snoop
> globally and outside the IOPTE - AMD also documents such an HW capability.
> 
> Leave AMD out of sync with Intel and have it block no-snoop even for
> in-kernel users. This can be trivially resolved in a follow up patch.
> 
> Only VFIO will call this new API.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/amd/iommu.c   |  7 +++++++
>  drivers/iommu/intel/iommu.c | 14 +++++++++++++-
>  include/linux/intel-iommu.h |  1 +
>  include/linux/iommu.h       |  4 ++++
>  4 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index a1ada7bff44e61..e500b487eb3429 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -2271,6 +2271,12 @@ static int amd_iommu_def_domain_type(struct device *dev)
>  	return 0;
>  }
>  
> +static bool amd_iommu_enforce_cache_coherency(struct iommu_domain *domain)
> +{
> +	/* IOMMU_PTE_FC is always set */
> +	return true;
> +}
> +
>  const struct iommu_ops amd_iommu_ops = {
>  	.capable = amd_iommu_capable,
>  	.domain_alloc = amd_iommu_domain_alloc,
> @@ -2293,6 +2299,7 @@ const struct iommu_ops amd_iommu_ops = {
>  		.flush_iotlb_all = amd_iommu_flush_iotlb_all,
>  		.iotlb_sync	= amd_iommu_iotlb_sync,
>  		.free		= amd_iommu_domain_free,
> +		.enforce_cache_coherency = amd_iommu_enforce_cache_coherency,
>  	}
>  };
>  
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index df5c62ecf942b8..f08611a6cc4799 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4422,7 +4422,8 @@ static int intel_iommu_map(struct iommu_domain *domain,
>  		prot |= DMA_PTE_READ;
>  	if (iommu_prot & IOMMU_WRITE)
>  		prot |= DMA_PTE_WRITE;
> -	if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
> +	if (((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping) ||
> +	    dmar_domain->enforce_no_snoop)
>  		prot |= DMA_PTE_SNP;
>  
>  	max_addr = iova + size;
> @@ -4545,6 +4546,16 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
>  	return phys;
>  }
>  
> +static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +
> +	if (!dmar_domain->iommu_snooping)
> +		return false;
> +	dmar_domain->enforce_no_snoop = true;
> +	return true;
> +}

Don't we have issues if we try to set DMA_PTE_SNP on DMARs that don't
support it, ie. reserved register bit set in pte faults?  It seems
really inconsistent here that I could make a domain that supports
iommu_snooping, set enforce_no_snoop = true, then add another DMAR to
the domain that may not support iommu_snooping, I'd get false on the
subsequent enforcement test, but the dmar_domain is still trying to use
DMA_PTE_SNP.

There's also a disconnect, maybe just in the naming or documentation,
but if I call enforce_cache_coherency for a domain, that seems like the
domain should retain those semantics regardless of how it's modified,
ie. "enforced".  For example, if I tried to perform the above operation,
I should get a failure attaching the device that brings in the less
capable DMAR because the domain has been set to enforce this feature.

If the API is that I need to re-enforce_cache_coherency on every
modification of the domain, shouldn't dmar_domain->enforce_no_snoop
also return to a default value on domain changes?

Maybe this should be something like set_no_snoop_squashing with the
above semantics, it needs to be re-applied whenever the domain:device
composition changes?  Thanks,

Alex

> +
>  static bool intel_iommu_capable(enum iommu_cap cap)
>  {
>  	if (cap == IOMMU_CAP_CACHE_COHERENCY)
> @@ -4898,6 +4909,7 @@ const struct iommu_ops intel_iommu_ops = {
>  		.iotlb_sync		= intel_iommu_tlb_sync,
>  		.iova_to_phys		= intel_iommu_iova_to_phys,
>  		.free			= intel_iommu_domain_free,
> +		.enforce_cache_coherency = intel_iommu_enforce_cache_coherency,
>  	}
>  };
>  
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index 2f9891cb3d0014..1f930c0c225d94 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -540,6 +540,7 @@ struct dmar_domain {
>  	u8 has_iotlb_device: 1;
>  	u8 iommu_coherency: 1;		/* indicate coherency of iommu access */
>  	u8 iommu_snooping: 1;		/* indicate snooping control feature */
> +	u8 enforce_no_snoop : 1;        /* Create IOPTEs with snoop control */
>  
>  	struct list_head devices;	/* all devices' list */
>  	struct iova_domain iovad;	/* iova's that belong to this domain */
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 9208eca4b0d1ac..fe4f24c469c373 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -272,6 +272,9 @@ struct iommu_ops {
>   * @iotlb_sync: Flush all queued ranges from the hardware TLBs and empty flush
>   *            queue
>   * @iova_to_phys: translate iova to physical address
> + * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
> + *                           including no-snoop TLPs on PCIe or other platform
> + *                           specific mechanisms.
>   * @enable_nesting: Enable nesting
>   * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>   * @free: Release the domain after use.
> @@ -300,6 +303,7 @@ struct iommu_domain_ops {
>  	phys_addr_t (*iova_to_phys)(struct iommu_domain *domain,
>  				    dma_addr_t iova);
>  
> +	bool (*enforce_cache_coherency)(struct iommu_domain *domain);
>  	int (*enable_nesting)(struct iommu_domain *domain);
>  	int (*set_pgtable_quirks)(struct iommu_domain *domain,
>  				  unsigned long quirks);

Jason Gunthorpe April 5, 2022, 10:57 p.m. UTC | #2

On Tue, Apr 05, 2022 at 01:50:36PM -0600, Alex Williamson wrote:
> >  
> > +static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain)
> > +{
> > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > +
> > +	if (!dmar_domain->iommu_snooping)
> > +		return false;
> > +	dmar_domain->enforce_no_snoop = true;
> > +	return true;
> > +}
> 
> Don't we have issues if we try to set DMA_PTE_SNP on DMARs that don't
> support it, ie. reserved register bit set in pte faults?  

The way the Intel driver is setup that is not possible. Currently it
does:

 static bool intel_iommu_capable(enum iommu_cap cap)
 {
	if (cap == IOMMU_CAP_CACHE_COHERENCY)
		return domain_update_iommu_snooping(NULL);

Which is a global property unrelated to any device.

Thus either all devices and all domains support iommu_snooping, or
none do.

It is unclear because for some reason the driver recalculates this
almost constant value on every device attach..

> There's also a disconnect, maybe just in the naming or documentation,
> but if I call enforce_cache_coherency for a domain, that seems like the
> domain should retain those semantics regardless of how it's
> modified,

Right, this is how I would expect it to work.

> ie. "enforced".  For example, if I tried to perform the above operation,
> I should get a failure attaching the device that brings in the less
> capable DMAR because the domain has been set to enforce this
> feature.

We don't have any code causing a failure like this because no driver
needs it.

> Maybe this should be something like set_no_snoop_squashing with the
> above semantics, it needs to be re-applied whenever the domain:device
> composition changes?  Thanks,

If we get a real driver that needs non-uniformity here we can revisit
what to do. There are a couple of good options depending on exactly
what the HW behavior is.

Is it more clear if I fold in the below? It helps show that the
decision to use DMA_PTE_SNP is a global choice based on
domain_update_iommu_snooping():

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index e5062461ab0640..fc789a9d955645 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -641,7 +641,6 @@ static unsigned long domain_super_pgsize_bitmap(struct dmar_domain *domain)
 static void domain_update_iommu_cap(struct dmar_domain *domain)
 {
 	domain_update_iommu_coherency(domain);
-	domain->iommu_snooping = domain_update_iommu_snooping(NULL);
 	domain->iommu_superpage = domain_update_iommu_superpage(domain, NULL);
 
 	/*
@@ -4283,7 +4282,6 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width)
 	domain->agaw = width_to_agaw(adjust_width);
 
 	domain->iommu_coherency = false;
-	domain->iommu_snooping = false;
 	domain->iommu_superpage = 0;
 	domain->max_addr = 0;
 
@@ -4549,7 +4547,7 @@ static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
 
-	if (!dmar_domain->iommu_snooping)
+	if (!domain_update_iommu_snooping(NULL))
 		return false;
 	dmar_domain->enforce_no_snoop = true;
 	return true;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 1f930c0c225d94..bc39f633efdf03 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -539,7 +539,6 @@ struct dmar_domain {
 
 	u8 has_iotlb_device: 1;
 	u8 iommu_coherency: 1;		/* indicate coherency of iommu access */
-	u8 iommu_snooping: 1;		/* indicate snooping control feature */
 	u8 enforce_no_snoop : 1;        /* Create IOPTEs with snoop control */
 
 	struct list_head devices;	/* all devices' list */

Tian, Kevin April 5, 2022, 11:31 p.m. UTC | #3

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 6, 2022 6:58 AM
> 
> On Tue, Apr 05, 2022 at 01:50:36PM -0600, Alex Williamson wrote:
> > >
> > > +static bool intel_iommu_enforce_cache_coherency(struct
> iommu_domain *domain)
> > > +{
> > > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > > +
> > > +	if (!dmar_domain->iommu_snooping)
> > > +		return false;
> > > +	dmar_domain->enforce_no_snoop = true;
> > > +	return true;
> > > +}
> >
> > Don't we have issues if we try to set DMA_PTE_SNP on DMARs that don't
> > support it, ie. reserved register bit set in pte faults?
> 
> The way the Intel driver is setup that is not possible. Currently it
> does:
> 
>  static bool intel_iommu_capable(enum iommu_cap cap)
>  {
> 	if (cap == IOMMU_CAP_CACHE_COHERENCY)
> 		return domain_update_iommu_snooping(NULL);
> 
> Which is a global property unrelated to any device.
> 
> Thus either all devices and all domains support iommu_snooping, or
> none do.
> 
> It is unclear because for some reason the driver recalculates this
> almost constant value on every device attach..

The reason is simply because iommu capability is a global flag

Tian, Kevin April 6, 2022, 12:08 a.m. UTC | #4

> From: Tian, Kevin
> Sent: Wednesday, April 6, 2022 7:32 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 6, 2022 6:58 AM
> >
> > On Tue, Apr 05, 2022 at 01:50:36PM -0600, Alex Williamson wrote:
> > > >
> > > > +static bool intel_iommu_enforce_cache_coherency(struct
> > iommu_domain *domain)
> > > > +{
> > > > +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> > > > +
> > > > +	if (!dmar_domain->iommu_snooping)
> > > > +		return false;
> > > > +	dmar_domain->enforce_no_snoop = true;
> > > > +	return true;
> > > > +}
> > >
> > > Don't we have issues if we try to set DMA_PTE_SNP on DMARs that don't
> > > support it, ie. reserved register bit set in pte faults?
> >
> > The way the Intel driver is setup that is not possible. Currently it
> > does:
> >
> >  static bool intel_iommu_capable(enum iommu_cap cap)
> >  {
> > 	if (cap == IOMMU_CAP_CACHE_COHERENCY)
> > 		return domain_update_iommu_snooping(NULL);
> >
> > Which is a global property unrelated to any device.
> >
> > Thus either all devices and all domains support iommu_snooping, or
> > none do.
> >
> > It is unclear because for some reason the driver recalculates this
> > almost constant value on every device attach..
> 
> The reason is simply because iommu capability is a global flag

Tian, Kevin April 6, 2022, 6:52 a.m. UTC | #5

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 6, 2022 12:16 AM
> 
> PCIe defines a 'no-snoop' bit in each the TLP which is usually implemented
> by a platform as bypassing elements in the DMA coherent CPU cache
> hierarchy. A driver can command a device to set this bit on some of its
> transactions as a micro-optimization.
> 
> However, the driver is now responsible to synchronize the CPU cache with
> the DMA that bypassed it. On x86 this is done through the wbinvd
> instruction, and the i915 GPU driver is the only Linux DMA driver that
> calls it.

More accurately x86 supports both unprivileged clflush instructions
to invalidate one cacheline and a privileged wbinvd instruction to
invalidate the entire cache. Replacing 'this is done' with 'this may
be done' is clearer.

> 
> The problem comes that KVM on x86 will normally disable the wbinvd
> instruction in the guest and render it a NOP. As the driver running in the
> guest is not aware the wbinvd doesn't work it may still cause the device
> to set the no-snoop bit and the platform will bypass the CPU cache.
> Without a working wbinvd there is no way to re-synchronize the CPU cache
> and the driver in the VM has data corruption.
> 
> Thus, we see a general direction on x86 that the IOMMU HW is able to block
> the no-snoop bit in the TLP. This NOP's the optimization and allows KVM to
> to NOP the wbinvd without causing any data corruption.
> 
> This control for Intel IOMMU was exposed by using IOMMU_CACHE and
> IOMMU_CAP_CACHE_COHERENCY, however these two values now have
> multiple
> meanings and usages beyond blocking no-snoop and the whole thing has
> become confused.

Also point out your finding about AMD IOMMU?

> 
> Change it so that:
>  - IOMMU_CACHE is only about the DMA coherence of normal DMAs from a
>    device. It is used by the DMA API and set when the DMA API will not be
>    doing manual cache coherency operations.
> 
>  - dev_is_dma_coherent() indicates if IOMMU_CACHE can be used with the
>    device
> 
>  - The new optional domain op enforce_cache_coherency() will cause the
>    entire domain to block no-snoop requests - ie there is no way for any
>    device attached to the domain to opt out of the IOMMU_CACHE behavior.
> 
> An iommu driver should implement enforce_cache_coherency() so that by
> default domains allow the no-snoop optimization. This leaves it available
> to kernel drivers like i915. VFIO will call enforce_cache_coherency()
> before establishing any mappings and the domain should then permanently
> block no-snoop.
> 
> If enforce_cache_coherency() fails VFIO will communicate back through to
> KVM into the arch code via kvm_arch_register_noncoherent_dma()
> (only implemented by x86) which triggers a working wbinvd to be made
> available to the VM.
> 
> While other arches are certainly welcome to implement
> enforce_cache_coherency(), it is not clear there is any benefit in doing
> so.
> 
> After this series there are only two calls left to iommu_capable() with a
> bus argument which should help Robin's work here.
> 
> This is on github:
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> Cc: "Tian, Kevin" <kevin.tian@intel.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason Gunthorpe (5):
>   iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with
>     dev_is_dma_coherent()
>   vfio: Require that devices support DMA cache coherence
>   iommu: Introduce the domain op enforce_cache_coherency()
>   vfio: Move the Intel no-snoop control off of IOMMU_CACHE
>   iommu: Delete IOMMU_CAP_CACHE_COHERENCY
> 
>  drivers/infiniband/hw/usnic/usnic_uiom.c    | 16 +++++------
>  drivers/iommu/amd/iommu.c                   |  9 +++++--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  2 --
>  drivers/iommu/arm/arm-smmu/arm-smmu.c       |  6 -----
>  drivers/iommu/arm/arm-smmu/qcom_iommu.c     |  6 -----
>  drivers/iommu/fsl_pamu_domain.c             |  6 -----
>  drivers/iommu/intel/iommu.c                 | 15 ++++++++---
>  drivers/iommu/s390-iommu.c                  |  2 --
>  drivers/vfio/vfio.c                         |  6 +++++
>  drivers/vfio/vfio_iommu_type1.c             | 30 +++++++++++++--------
>  drivers/vhost/vdpa.c                        |  3 ++-
>  include/linux/intel-iommu.h                 |  1 +
>  include/linux/iommu.h                       |  6 +++--
>  13 files changed, 58 insertions(+), 50 deletions(-)
> 
> 
> base-commit: 3123109284176b1532874591f7c81f3837bbdc17
> --
> 2.35.1

Jason Gunthorpe April 7, 2022, 2:56 p.m. UTC | #6

On Wed, Apr 06, 2022 at 06:52:04AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 6, 2022 12:16 AM
> > 
> > PCIe defines a 'no-snoop' bit in each the TLP which is usually implemented
> > by a platform as bypassing elements in the DMA coherent CPU cache
> > hierarchy. A driver can command a device to set this bit on some of its
> > transactions as a micro-optimization.
> > 
> > However, the driver is now responsible to synchronize the CPU cache with
> > the DMA that bypassed it. On x86 this is done through the wbinvd
> > instruction, and the i915 GPU driver is the only Linux DMA driver that
> > calls it.
> 
> More accurately x86 supports both unprivileged clflush instructions
> to invalidate one cacheline and a privileged wbinvd instruction to
> invalidate the entire cache. Replacing 'this is done' with 'this may
> be done' is clearer.
> 
> > 
> > The problem comes that KVM on x86 will normally disable the wbinvd
> > instruction in the guest and render it a NOP. As the driver running in the
> > guest is not aware the wbinvd doesn't work it may still cause the device
> > to set the no-snoop bit and the platform will bypass the CPU cache.
> > Without a working wbinvd there is no way to re-synchronize the CPU cache
> > and the driver in the VM has data corruption.
> > 
> > Thus, we see a general direction on x86 that the IOMMU HW is able to block
> > the no-snoop bit in the TLP. This NOP's the optimization and allows KVM to
> > to NOP the wbinvd without causing any data corruption.
> > 
> > This control for Intel IOMMU was exposed by using IOMMU_CACHE and
> > IOMMU_CAP_CACHE_COHERENCY, however these two values now have
> > multiple
> > meanings and usages beyond blocking no-snoop and the whole thing has
> > become confused.
>
> Also point out your finding about AMD IOMMU?

Done, thanks

Jason

[0/5] Make the iommu driver no-snoop block feature consistent

Message

Comments