Message ID | 20240507062138.20465-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Enforce CPU cache flush for non-coherent device assignment | expand |
On Tue, 7 May 2024 14:21:38 +0800 Yan Zhao <yan.y.zhao@intel.com> wrote: > Flush CPU cache on DMA pages before mapping them into the first > non-coherent domain (domain that does not enforce cache coherency, i.e. CPU > caches are not force-snooped) and after unmapping them from the last > domain. > > Devices attached to non-coherent domains can execute non-coherent DMAs > (DMAs that lack CPU cache snooping) to access physical memory with CPU > caches bypassed. > > Such a scenario could be exploited by a malicious guest, allowing them to > access stale host data in memory rather than the data initialized by the > host (e.g., zeros) in the cache, thus posing a risk of information leakage > attack. > > Furthermore, the host kernel (e.g. a ksm thread) might encounter > inconsistent data between the CPU cache and memory (left by a malicious > guest) after a page is unpinned for DMA but before it's recycled. > > Therefore, it is required to flush the CPU cache before a page is > accessible to non-coherent DMAs and after the page is inaccessible to > non-coherent DMAs. > > However, the CPU cache is not flushed immediately when the page is unmapped > from the last non-coherent domain. Instead, the flushing is performed > lazily, right before the page is unpinned. > Take the following example to illustrate the process. The CPU cache is > flushed right before step 2 and step 5. > 1. A page is mapped into a coherent domain. > 2. The page is mapped into a non-coherent domain. > 3. The page is unmapped from the non-coherent domain e.g.due to hot-unplug. > 4. The page is unmapped from the coherent domain. > 5. The page is unpinned. > > Reasons for adopting this lazily flushing design include: > - There're several unmap paths and only one unpin path. Lazily flush before > unpin wipes out the inconsistency between cache and physical memory > before a page is globally visible and produces code that is simpler, more > maintainable and easier to backport. > - Avoid dividing a large unmap range into several smaller ones or > allocating additional memory to hold IOVA to HPA relationship. > > Reported-by: Jason Gunthorpe <jgg@nvidia.com> > Closes: https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com > Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation") > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Jason Gunthorpe <jgg@nvidia.com> > Cc: Kevin Tian <kevin.tian@intel.com> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++ > 1 file changed, 51 insertions(+) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index b5c15fe8f9fc..ce873f4220bf 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -74,6 +74,7 @@ struct vfio_iommu { > bool v2; > bool nesting; > bool dirty_page_tracking; > + bool has_noncoherent_domain; > struct list_head emulated_iommu_groups; > }; > > @@ -99,6 +100,7 @@ struct vfio_dma { > unsigned long *bitmap; > struct mm_struct *mm; > size_t locked_vm; > + bool cache_flush_required; /* For noncoherent domain */ Poor packing, minimally this should be grouped with the other bools in the structure, longer term they should likely all be converted to bit fields. > }; > > struct vfio_batch { > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, > long unlocked = 0, locked = 0; > long i; > > + if (dma->cache_flush_required) > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT); > + > for (i = 0; i < npage; i++, iova += PAGE_SIZE) { > if (put_pfn(pfn++, dma->prot)) { > unlocked++; > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, > &iotlb_gather); > } > > + dma->cache_flush_required = false; > + > if (do_accounting) { > vfio_lock_acct(dma, -unlocked, true); > return 0; > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) > iommu->dma_avail++; > } > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu) > +{ > + struct vfio_domain *domain; > + bool has_noncoherent = false; > + > + list_for_each_entry(domain, &iommu->domain_list, next) { > + if (domain->enforce_cache_coherency) > + continue; > + > + has_noncoherent = true; > + break; > + } > + iommu->has_noncoherent_domain = has_noncoherent; > +} This should be merged with vfio_domains_have_enforce_cache_coherency() and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below). > + > static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu) > { > struct vfio_domain *domain; > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > > vfio_batch_init(&batch); > > + /* > + * Record necessity to flush CPU cache to make sure CPU cache is flushed > + * for both pin & map and unmap & unpin (for unwind) paths. > + */ > + dma->cache_flush_required = iommu->has_noncoherent_domain; > + > while (size) { > /* Pin a contiguous chunk of memory */ > npage = vfio_pin_pages_remote(dma, vaddr + dma->size, > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > break; > } > > + if (dma->cache_flush_required) > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, > + npage << PAGE_SHIFT); > + > /* Map it! */ > ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, > dma->prot); > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > for (; n; n = rb_next(n)) { > struct vfio_dma *dma; > dma_addr_t iova; > + bool cache_flush_required; > > dma = rb_entry(n, struct vfio_dma, node); > iova = dma->iova; > + cache_flush_required = !domain->enforce_cache_coherency && > + !dma->cache_flush_required; > + if (cache_flush_required) > + dma->cache_flush_required = true; The variable name here isn't accurate and the logic is confusing. If the domain does not enforce coherency and the mapping is not tagged as requiring a cache flush, then we need to mark the mapping as requiring a cache flush. So the variable state is something more akin to set_cache_flush_required. But all we're saving with this is a redundant set if the mapping is already tagged as requiring a cache flush, so it could really be simplified to: dma->cache_flush_required = !domain->enforce_cache_coherency; It might add more clarity to just name the mapping flag dma->mapped_noncoherent. > > while (iova < dma->iova + dma->size) { > phys_addr_t phys; > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > size = npage << PAGE_SHIFT; > } > > + if (cache_flush_required) > + arch_clean_nonsnoop_dma(phys, size); > + I agree with others as well that this arch callback should be named something relative to the cache-flush/write-back operation that it actually performs instead of the overall reason for us requiring it. > ret = iommu_map(domain->domain, iova, phys, size, > dma->prot | IOMMU_CACHE, > GFP_KERNEL_ACCOUNT); > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT, > size >> PAGE_SHIFT, true); > } > + dma->cache_flush_required = false; > } > > vfio_batch_fini(&batch); > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * > if (!pages) > return; > > + if (!domain->enforce_cache_coherency) > + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); > + > list_for_each_entry(region, regions, list) { > start = ALIGN(region->start, PAGE_SIZE * 2); > if (start >= region->end || (region->end - start < PAGE_SIZE * 2)) > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * > break; > } > > + if (!domain->enforce_cache_coherency) > + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); > + Seems like this use case isn't subject to the unmap aspect since these are kernel allocated and freed pages rather than userspace pages. There's not an "ongoing use of the page" concern. The window of opportunity for a device to discover and exploit the mapping side issue appears almost impossibly small. > __free_pages(pages, order); > } > > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > > list_add(&domain->next, &iommu->domain_list); > vfio_update_pgsize_bitmap(iommu); > + if (!domain->enforce_cache_coherency) > + vfio_update_noncoherent_domain_state(iommu); Why isn't this simply: if (!domain->enforce_cache_coherency) iommu->has_noncoherent_domain = true; Or maybe: if (!domain->enforce_cache_coherency) iommu->noncoherent_domains++; > done: > /* Delete the old one and insert new iova list */ > vfio_iommu_iova_insert_copy(iommu, &iova_copy); > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, > } > iommu_domain_free(domain->domain); > list_del(&domain->next); > + if (!domain->enforce_cache_coherency) > + vfio_update_noncoherent_domain_state(iommu); If we were to just track the number of noncoherent domains, this could simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be: return iommu->noncoherent_domains ? 1 : 0; Maybe there should be wrappers for list_add() and list_del() relative to the iommu domain list to make it just be a counter. Thanks, Alex > kfree(domain); > vfio_iommu_aper_expand(iommu, &iova_copy); > vfio_update_pgsize_bitmap(iommu);
On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote: > On Tue, 7 May 2024 14:21:38 +0800 > Yan Zhao <yan.y.zhao@intel.com> wrote: ... > > drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++ > > 1 file changed, 51 insertions(+) > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > index b5c15fe8f9fc..ce873f4220bf 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -74,6 +74,7 @@ struct vfio_iommu { > > bool v2; > > bool nesting; > > bool dirty_page_tracking; > > + bool has_noncoherent_domain; > > struct list_head emulated_iommu_groups; > > }; > > > > @@ -99,6 +100,7 @@ struct vfio_dma { > > unsigned long *bitmap; > > struct mm_struct *mm; > > size_t locked_vm; > > + bool cache_flush_required; /* For noncoherent domain */ > > Poor packing, minimally this should be grouped with the other bools in > the structure, longer term they should likely all be converted to > bit fields. Yes. Will do! > > > }; > > > > struct vfio_batch { > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, > > long unlocked = 0, locked = 0; > > long i; > > > > + if (dma->cache_flush_required) > > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT); > > + > > for (i = 0; i < npage; i++, iova += PAGE_SIZE) { > > if (put_pfn(pfn++, dma->prot)) { > > unlocked++; > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, > > &iotlb_gather); > > } > > > > + dma->cache_flush_required = false; > > + > > if (do_accounting) { > > vfio_lock_acct(dma, -unlocked, true); > > return 0; > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) > > iommu->dma_avail++; > > } > > > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu) > > +{ > > + struct vfio_domain *domain; > > + bool has_noncoherent = false; > > + > > + list_for_each_entry(domain, &iommu->domain_list, next) { > > + if (domain->enforce_cache_coherency) > > + continue; > > + > > + has_noncoherent = true; > > + break; > > + } > > + iommu->has_noncoherent_domain = has_noncoherent; > > +} > > This should be merged with vfio_domains_have_enforce_cache_coherency() > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below). Will convert it to a counter and do the merge. Thanks for pointing it out! > > > + > > static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu) > > { > > struct vfio_domain *domain; > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > > vfio_batch_init(&batch); > > > > + /* > > + * Record necessity to flush CPU cache to make sure CPU cache is flushed > > + * for both pin & map and unmap & unpin (for unwind) paths. > > + */ > > + dma->cache_flush_required = iommu->has_noncoherent_domain; > > + > > while (size) { > > /* Pin a contiguous chunk of memory */ > > npage = vfio_pin_pages_remote(dma, vaddr + dma->size, > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > > break; > > } > > > > + if (dma->cache_flush_required) > > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, > > + npage << PAGE_SHIFT); > > + > > /* Map it! */ > > ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, > > dma->prot); > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > for (; n; n = rb_next(n)) { > > struct vfio_dma *dma; > > dma_addr_t iova; > > + bool cache_flush_required; > > > > dma = rb_entry(n, struct vfio_dma, node); > > iova = dma->iova; > > + cache_flush_required = !domain->enforce_cache_coherency && > > + !dma->cache_flush_required; > > + if (cache_flush_required) > > + dma->cache_flush_required = true; > > The variable name here isn't accurate and the logic is confusing. If > the domain does not enforce coherency and the mapping is not tagged as > requiring a cache flush, then we need to mark the mapping as requiring > a cache flush. So the variable state is something more akin to > set_cache_flush_required. But all we're saving with this is a > redundant set if the mapping is already tagged as requiring a cache > flush, so it could really be simplified to: > > dma->cache_flush_required = !domain->enforce_cache_coherency; Sorry about the confusion. If dma->cache_flush_required is set to true by a domain not enforcing cache coherency, we hope it will not be reset to false by a later attaching to domain enforcing cache coherency due to the lazily flushing design. > It might add more clarity to just name the mapping flag > dma->mapped_noncoherent. The dma->cache_flush_required is to mark whether pages in a vfio_dma requires cache flush in the subsequence mapping into the first non-coherent domain and page unpinning. So, mapped_noncoherent may not be accurate. Do you think it's better to put a comment for explanation? struct vfio_dma { ... bool iommu_mapped; bool lock_cap; /* capable(CAP_IPC_LOCK) */ bool vaddr_invalid; /* * Mark whether it is required to flush CPU caches when mapping pages * of the vfio_dma to the first non-coherent domain and when unpinning * pages of the vfio_dma */ bool cache_flush_required; ... }; > > > > > while (iova < dma->iova + dma->size) { > > phys_addr_t phys; > > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > size = npage << PAGE_SHIFT; > > } > > > > + if (cache_flush_required) > > + arch_clean_nonsnoop_dma(phys, size); > > + > > I agree with others as well that this arch callback should be named > something relative to the cache-flush/write-back operation that it > actually performs instead of the overall reason for us requiring it. > Ok. If there are no objections, I'll rename it to arch_flush_cache_phys() as suggested by Kevin. > > ret = iommu_map(domain->domain, iova, phys, size, > > dma->prot | IOMMU_CACHE, > > GFP_KERNEL_ACCOUNT); > > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT, > > size >> PAGE_SHIFT, true); > > } > > + dma->cache_flush_required = false; > > } > > > > vfio_batch_fini(&batch); > > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * > > if (!pages) > > return; > > > > + if (!domain->enforce_cache_coherency) > > + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); > > + > > list_for_each_entry(region, regions, list) { > > start = ALIGN(region->start, PAGE_SIZE * 2); > > if (start >= region->end || (region->end - start < PAGE_SIZE * 2)) > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * > > break; > > } > > > > + if (!domain->enforce_cache_coherency) > > + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); > > + > > Seems like this use case isn't subject to the unmap aspect since these > are kernel allocated and freed pages rather than userspace pages. > There's not an "ongoing use of the page" concern. > > The window of opportunity for a device to discover and exploit the > mapping side issue appears almost impossibly small. > The concern is for a malicious device attempting DMAs automatically. Do you think this concern is valid? As there're only extra flushes for 4 pages, what about keeping it for safety? > > __free_pages(pages, order); > > } > > > > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > > > > list_add(&domain->next, &iommu->domain_list); > > vfio_update_pgsize_bitmap(iommu); > > + if (!domain->enforce_cache_coherency) > > + vfio_update_noncoherent_domain_state(iommu); > > Why isn't this simply: > > if (!domain->enforce_cache_coherency) > iommu->has_noncoherent_domain = true; Yes, it's simpler during attach. > Or maybe: > > if (!domain->enforce_cache_coherency) > iommu->noncoherent_domains++; Yes, this counter is better. I previously thought a bool can save some space. > > done: > > /* Delete the old one and insert new iova list */ > > vfio_iommu_iova_insert_copy(iommu, &iova_copy); > > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, > > } > > iommu_domain_free(domain->domain); > > list_del(&domain->next); > > + if (!domain->enforce_cache_coherency) > > + vfio_update_noncoherent_domain_state(iommu); > > If we were to just track the number of noncoherent domains, this could > simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be: > > return iommu->noncoherent_domains ? 1 : 0; > > Maybe there should be wrappers for list_add() and list_del() relative > to the iommu domain list to make it just be a counter. Thanks, Do you think we can skip the "iommu->noncoherent_domains--" in vfio_iommu_type1_release() when iommu is about to be freed. Asking that is also because it's hard for me to find a good name for the wrapper around list_del(). :) It follows vfio_release_domain() in vfio_iommu_type1_release(), but not in vfio_iommu_type1_detach_group(). > > > > kfree(domain); > > vfio_iommu_aper_expand(iommu, &iova_copy); > > vfio_update_pgsize_bitmap(iommu); >
On Fri, 10 May 2024 18:31:13 +0800 Yan Zhao <yan.y.zhao@intel.com> wrote: > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote: > > On Tue, 7 May 2024 14:21:38 +0800 > > Yan Zhao <yan.y.zhao@intel.com> wrote: > ... > > > drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++ > > > 1 file changed, 51 insertions(+) > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > > index b5c15fe8f9fc..ce873f4220bf 100644 > > > --- a/drivers/vfio/vfio_iommu_type1.c > > > +++ b/drivers/vfio/vfio_iommu_type1.c > > > @@ -74,6 +74,7 @@ struct vfio_iommu { > > > bool v2; > > > bool nesting; > > > bool dirty_page_tracking; > > > + bool has_noncoherent_domain; > > > struct list_head emulated_iommu_groups; > > > }; > > > > > > @@ -99,6 +100,7 @@ struct vfio_dma { > > > unsigned long *bitmap; > > > struct mm_struct *mm; > > > size_t locked_vm; > > > + bool cache_flush_required; /* For noncoherent domain */ > > > > Poor packing, minimally this should be grouped with the other bools in > > the structure, longer term they should likely all be converted to > > bit fields. > Yes. Will do! > > > > > > }; > > > > > > struct vfio_batch { > > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, > > > long unlocked = 0, locked = 0; > > > long i; > > > > > > + if (dma->cache_flush_required) > > > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT); > > > + > > > for (i = 0; i < npage; i++, iova += PAGE_SIZE) { > > > if (put_pfn(pfn++, dma->prot)) { > > > unlocked++; > > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > &iotlb_gather); > > > } > > > > > > + dma->cache_flush_required = false; > > > + > > > if (do_accounting) { > > > vfio_lock_acct(dma, -unlocked, true); > > > return 0; > > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) > > > iommu->dma_avail++; > > > } > > > > > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu) > > > +{ > > > + struct vfio_domain *domain; > > > + bool has_noncoherent = false; > > > + > > > + list_for_each_entry(domain, &iommu->domain_list, next) { > > > + if (domain->enforce_cache_coherency) > > > + continue; > > > + > > > + has_noncoherent = true; > > > + break; > > > + } > > > + iommu->has_noncoherent_domain = has_noncoherent; > > > +} > > > > This should be merged with vfio_domains_have_enforce_cache_coherency() > > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below). > Will convert it to a counter and do the merge. > Thanks for pointing it out! > > > > > > + > > > static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu) > > > { > > > struct vfio_domain *domain; > > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > > > > vfio_batch_init(&batch); > > > > > > + /* > > > + * Record necessity to flush CPU cache to make sure CPU cache is flushed > > > + * for both pin & map and unmap & unpin (for unwind) paths. > > > + */ > > > + dma->cache_flush_required = iommu->has_noncoherent_domain; > > > + > > > while (size) { > > > /* Pin a contiguous chunk of memory */ > > > npage = vfio_pin_pages_remote(dma, vaddr + dma->size, > > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > break; > > > } > > > > > > + if (dma->cache_flush_required) > > > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, > > > + npage << PAGE_SHIFT); > > > + > > > /* Map it! */ > > > ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, > > > dma->prot); > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > for (; n; n = rb_next(n)) { > > > struct vfio_dma *dma; > > > dma_addr_t iova; > > > + bool cache_flush_required; > > > > > > dma = rb_entry(n, struct vfio_dma, node); > > > iova = dma->iova; > > > + cache_flush_required = !domain->enforce_cache_coherency && > > > + !dma->cache_flush_required; > > > + if (cache_flush_required) > > > + dma->cache_flush_required = true; > > > > The variable name here isn't accurate and the logic is confusing. If > > the domain does not enforce coherency and the mapping is not tagged as > > requiring a cache flush, then we need to mark the mapping as requiring > > a cache flush. So the variable state is something more akin to > > set_cache_flush_required. But all we're saving with this is a > > redundant set if the mapping is already tagged as requiring a cache > > flush, so it could really be simplified to: > > > > dma->cache_flush_required = !domain->enforce_cache_coherency; > Sorry about the confusion. > > If dma->cache_flush_required is set to true by a domain not enforcing cache > coherency, we hope it will not be reset to false by a later attaching to domain > enforcing cache coherency due to the lazily flushing design. Right, ok, the vfio_dma objects are shared between domains so we never want to set 'dma->cache_flush_required = false' due to the addition of a 'domain->enforce_cache_coherent == true'. So this could be: if (!dma->cache_flush_required) dma->cache_flush_required = !domain->enforce_cache_coherency; > > It might add more clarity to just name the mapping flag > > dma->mapped_noncoherent. > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires > cache flush in the subsequence mapping into the first non-coherent domain > and page unpinning. How do we arrive at a sequence where we have dma->cache_flush_required that isn't the result of being mapped into a domain with !domain->enforce_cache_coherency? It seems to me that we only get 'dma->cache_flush_required == true' as a result of being mapped into a 'domain->enforce_cache_coherency == false' domain. In that case the flush-on-map is handled at the time we're setting dma->cache_flush_required and what we're actually tracking with the flag is that the dma object has been mapped into a noncoherent domain. > So, mapped_noncoherent may not be accurate. > Do you think it's better to put a comment for explanation? > > struct vfio_dma { > ... > bool iommu_mapped; > bool lock_cap; /* capable(CAP_IPC_LOCK) */ > bool vaddr_invalid; > /* > * Mark whether it is required to flush CPU caches when mapping pages > * of the vfio_dma to the first non-coherent domain and when unpinning > * pages of the vfio_dma > */ > bool cache_flush_required; > ... > }; > > > > > > > > while (iova < dma->iova + dma->size) { > > > phys_addr_t phys; > > > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > size = npage << PAGE_SHIFT; > > > } > > > > > > + if (cache_flush_required) > > > + arch_clean_nonsnoop_dma(phys, size); > > > + > > > > I agree with others as well that this arch callback should be named > > something relative to the cache-flush/write-back operation that it > > actually performs instead of the overall reason for us requiring it. > > > Ok. If there are no objections, I'll rename it to arch_flush_cache_phys() as > suggested by Kevin. Yes, better. > > > ret = iommu_map(domain->domain, iova, phys, size, > > > dma->prot | IOMMU_CACHE, > > > GFP_KERNEL_ACCOUNT); > > > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT, > > > size >> PAGE_SHIFT, true); > > > } > > > + dma->cache_flush_required = false; > > > } > > > > > > vfio_batch_fini(&batch); > > > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * > > > if (!pages) > > > return; > > > > > > + if (!domain->enforce_cache_coherency) > > > + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); > > > + > > > list_for_each_entry(region, regions, list) { > > > start = ALIGN(region->start, PAGE_SIZE * 2); > > > if (start >= region->end || (region->end - start < PAGE_SIZE * 2)) > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * > > > break; > > > } > > > > > > + if (!domain->enforce_cache_coherency) > > > + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); > > > + > > > > Seems like this use case isn't subject to the unmap aspect since these > > are kernel allocated and freed pages rather than userspace pages. > > There's not an "ongoing use of the page" concern. > > > > The window of opportunity for a device to discover and exploit the > > mapping side issue appears almost impossibly small. > > > The concern is for a malicious device attempting DMAs automatically. > Do you think this concern is valid? > As there're only extra flushes for 4 pages, what about keeping it for safety? Userspace doesn't know anything about these mappings, so to exploit them the device would somehow need to discover and interact with the mapping in the split second that the mapping exists, without exposing itself with mapping faults at the IOMMU. I don't mind keeping the flush before map so that infinitesimal gap where previous data in physical memory exposed to the device is closed, but I have a much harder time seeing that the flush on unmap to synchronize physical memory is required. For example, the potential KSM use case doesn't exist since the pages are not owned by the user. Any subsequent use of the pages would be subject to the same condition we assumed after allocation, where the physical data may be inconsistent with the cached data. It's easy to flush 2 pages, but I think it obscures the function of the flush if we can't articulate the value in this case. > > > __free_pages(pages, order); > > > } > > > > > > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > > > > > > list_add(&domain->next, &iommu->domain_list); > > > vfio_update_pgsize_bitmap(iommu); > > > + if (!domain->enforce_cache_coherency) > > > + vfio_update_noncoherent_domain_state(iommu); > > > > Why isn't this simply: > > > > if (!domain->enforce_cache_coherency) > > iommu->has_noncoherent_domain = true; > Yes, it's simpler during attach. > > > Or maybe: > > > > if (!domain->enforce_cache_coherency) > > iommu->noncoherent_domains++; > Yes, this counter is better. > I previously thought a bool can save some space. > > > > done: > > > /* Delete the old one and insert new iova list */ > > > vfio_iommu_iova_insert_copy(iommu, &iova_copy); > > > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, > > > } > > > iommu_domain_free(domain->domain); > > > list_del(&domain->next); > > > + if (!domain->enforce_cache_coherency) > > > + vfio_update_noncoherent_domain_state(iommu); > > > > If we were to just track the number of noncoherent domains, this could > > simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be: > > > > return iommu->noncoherent_domains ? 1 : 0; > > > > Maybe there should be wrappers for list_add() and list_del() relative > > to the iommu domain list to make it just be a counter. Thanks, > > Do you think we can skip the "iommu->noncoherent_domains--" in > vfio_iommu_type1_release() when iommu is about to be freed. > > Asking that is also because it's hard for me to find a good name for the wrapper > around list_del(). :) vfio_iommu_link_domain(), vfio_iommu_unlink_domain()? > > It follows vfio_release_domain() in vfio_iommu_type1_release(), but not in > vfio_iommu_type1_detach_group(). I'm not sure I understand the concern here, detach_group is performed under the iommu->lock where the value of iommu->noncohernet_domains is only guaranteed while this lock is held. In the release callback the iommu->lock is not held, but we have no external users at this point. It's not strictly required that we decrement each domain, but it's also not a bad sanity test that iommu->noncoherent_domains should be zero after unlinking the domains. Thanks, Alex > > > kfree(domain); > > > vfio_iommu_aper_expand(iommu, > > > &iova_copy); vfio_update_pgsize_bitmap(iommu); > > >
On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote: > On Fri, 10 May 2024 18:31:13 +0800 > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote: > > > On Tue, 7 May 2024 14:21:38 +0800 > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > ... > > > > drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++ > > > > 1 file changed, 51 insertions(+) > > > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > > > index b5c15fe8f9fc..ce873f4220bf 100644 > > > > --- a/drivers/vfio/vfio_iommu_type1.c > > > > +++ b/drivers/vfio/vfio_iommu_type1.c > > > > @@ -74,6 +74,7 @@ struct vfio_iommu { > > > > bool v2; > > > > bool nesting; > > > > bool dirty_page_tracking; > > > > + bool has_noncoherent_domain; > > > > struct list_head emulated_iommu_groups; > > > > }; > > > > > > > > @@ -99,6 +100,7 @@ struct vfio_dma { > > > > unsigned long *bitmap; > > > > struct mm_struct *mm; > > > > size_t locked_vm; > > > > + bool cache_flush_required; /* For noncoherent domain */ > > > > > > Poor packing, minimally this should be grouped with the other bools in > > > the structure, longer term they should likely all be converted to > > > bit fields. > > Yes. Will do! > > > > > > > > > }; > > > > > > > > struct vfio_batch { > > > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, > > > > long unlocked = 0, locked = 0; > > > > long i; > > > > > > > > + if (dma->cache_flush_required) > > > > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT); > > > > + > > > > for (i = 0; i < npage; i++, iova += PAGE_SIZE) { > > > > if (put_pfn(pfn++, dma->prot)) { > > > > unlocked++; > > > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > > &iotlb_gather); > > > > } > > > > > > > > + dma->cache_flush_required = false; > > > > + > > > > if (do_accounting) { > > > > vfio_lock_acct(dma, -unlocked, true); > > > > return 0; > > > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) > > > > iommu->dma_avail++; > > > > } > > > > > > > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu) > > > > +{ > > > > + struct vfio_domain *domain; > > > > + bool has_noncoherent = false; > > > > + > > > > + list_for_each_entry(domain, &iommu->domain_list, next) { > > > > + if (domain->enforce_cache_coherency) > > > > + continue; > > > > + > > > > + has_noncoherent = true; > > > > + break; > > > > + } > > > > + iommu->has_noncoherent_domain = has_noncoherent; > > > > +} > > > > > > This should be merged with vfio_domains_have_enforce_cache_coherency() > > > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below). > > Will convert it to a counter and do the merge. > > Thanks for pointing it out! > > > > > > > > > + > > > > static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu) > > > > { > > > > struct vfio_domain *domain; > > > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > > > > > > vfio_batch_init(&batch); > > > > > > > > + /* > > > > + * Record necessity to flush CPU cache to make sure CPU cache is flushed > > > > + * for both pin & map and unmap & unpin (for unwind) paths. > > > > + */ > > > > + dma->cache_flush_required = iommu->has_noncoherent_domain; > > > > + > > > > while (size) { > > > > /* Pin a contiguous chunk of memory */ > > > > npage = vfio_pin_pages_remote(dma, vaddr + dma->size, > > > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > > break; > > > > } > > > > > > > > + if (dma->cache_flush_required) > > > > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, > > > > + npage << PAGE_SHIFT); > > > > + > > > > /* Map it! */ > > > > ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, > > > > dma->prot); > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > > for (; n; n = rb_next(n)) { > > > > struct vfio_dma *dma; > > > > dma_addr_t iova; > > > > + bool cache_flush_required; > > > > > > > > dma = rb_entry(n, struct vfio_dma, node); > > > > iova = dma->iova; > > > > + cache_flush_required = !domain->enforce_cache_coherency && > > > > + !dma->cache_flush_required; > > > > + if (cache_flush_required) > > > > + dma->cache_flush_required = true; > > > > > > The variable name here isn't accurate and the logic is confusing. If > > > the domain does not enforce coherency and the mapping is not tagged as > > > requiring a cache flush, then we need to mark the mapping as requiring > > > a cache flush. So the variable state is something more akin to > > > set_cache_flush_required. But all we're saving with this is a > > > redundant set if the mapping is already tagged as requiring a cache > > > flush, so it could really be simplified to: > > > > > > dma->cache_flush_required = !domain->enforce_cache_coherency; > > Sorry about the confusion. > > > > If dma->cache_flush_required is set to true by a domain not enforcing cache > > coherency, we hope it will not be reset to false by a later attaching to domain > > enforcing cache coherency due to the lazily flushing design. > > Right, ok, the vfio_dma objects are shared between domains so we never > want to set 'dma->cache_flush_required = false' due to the addition of a > 'domain->enforce_cache_coherent == true'. So this could be: > > if (!dma->cache_flush_required) > dma->cache_flush_required = !domain->enforce_cache_coherency; Though this code is easier for understanding, it leads to unnecessary setting of dma->cache_flush_required to false, given domain->enforce_cache_coherency is true at the most time. > > > It might add more clarity to just name the mapping flag > > > dma->mapped_noncoherent. > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires > > cache flush in the subsequence mapping into the first non-coherent domain > > and page unpinning. > > How do we arrive at a sequence where we have dma->cache_flush_required > that isn't the result of being mapped into a domain with > !domain->enforce_cache_coherency? Hmm, dma->cache_flush_required IS the result of being mapped into a domain with !domain->enforce_cache_coherency. My concern only arrives from the actual code sequence, i.e. dma->cache_flush_required is set to true before the actual mapping. If we rename it to dma->mapped_noncoherent and only set it to true after the actual successful mapping, it would lead to more code to handle flushing for the unwind case. Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote() by checking dma->cache_flush_required, which is true even before a full successful mapping, so we won't miss flush on any pages that are mapped into a non-coherent domain in a short window. > > It seems to me that we only get 'dma->cache_flush_required == true' as > a result of being mapped into a 'domain->enforce_cache_coherency == > false' domain. In that case the flush-on-map is handled at the time > we're setting dma->cache_flush_required and what we're actually > tracking with the flag is that the dma object has been mapped into a > noncoherent domain. > > > So, mapped_noncoherent may not be accurate. > > Do you think it's better to put a comment for explanation? > > > > struct vfio_dma { > > ... > > bool iommu_mapped; > > bool lock_cap; /* capable(CAP_IPC_LOCK) */ > > bool vaddr_invalid; > > /* > > * Mark whether it is required to flush CPU caches when mapping pages > > * of the vfio_dma to the first non-coherent domain and when unpinning > > * pages of the vfio_dma > > */ > > bool cache_flush_required; > > ... > > }; > > > > > > > > > > > while (iova < dma->iova + dma->size) { > > > > phys_addr_t phys; > > > > @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > > size = npage << PAGE_SHIFT; > > > > } > > > > > > > > + if (cache_flush_required) > > > > + arch_clean_nonsnoop_dma(phys, size); > > > > + > > > > > > I agree with others as well that this arch callback should be named > > > something relative to the cache-flush/write-back operation that it > > > actually performs instead of the overall reason for us requiring it. > > > > > Ok. If there are no objections, I'll rename it to arch_flush_cache_phys() as > > suggested by Kevin. > > Yes, better. > > > > > ret = iommu_map(domain->domain, iova, phys, size, > > > > dma->prot | IOMMU_CACHE, > > > > GFP_KERNEL_ACCOUNT); > > > > @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > > vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT, > > > > size >> PAGE_SHIFT, true); > > > > } > > > > + dma->cache_flush_required = false; > > > > } > > > > > > > > vfio_batch_fini(&batch); > > > > @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * > > > > if (!pages) > > > > return; > > > > > > > > + if (!domain->enforce_cache_coherency) > > > > + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); > > > > + > > > > list_for_each_entry(region, regions, list) { > > > > start = ALIGN(region->start, PAGE_SIZE * 2); > > > > if (start >= region->end || (region->end - start < PAGE_SIZE * 2)) > > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * > > > > break; > > > > } > > > > > > > > + if (!domain->enforce_cache_coherency) > > > > + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); > > > > + > > > > > > Seems like this use case isn't subject to the unmap aspect since these > > > are kernel allocated and freed pages rather than userspace pages. > > > There's not an "ongoing use of the page" concern. > > > > > > The window of opportunity for a device to discover and exploit the > > > mapping side issue appears almost impossibly small. > > > > > The concern is for a malicious device attempting DMAs automatically. > > Do you think this concern is valid? > > As there're only extra flushes for 4 pages, what about keeping it for safety? > > Userspace doesn't know anything about these mappings, so to exploit > them the device would somehow need to discover and interact with the > mapping in the split second that the mapping exists, without exposing > itself with mapping faults at the IOMMU. > > I don't mind keeping the flush before map so that infinitesimal gap > where previous data in physical memory exposed to the device is closed, > but I have a much harder time seeing that the flush on unmap to > synchronize physical memory is required. > > For example, the potential KSM use case doesn't exist since the pages > are not owned by the user. Any subsequent use of the pages would be > subject to the same condition we assumed after allocation, where the > physical data may be inconsistent with the cached data. It's easy to > flush 2 pages, but I think it obscures the function of the flush if we > can't articulate the value in this case. > I agree the second flush is not necessary if we are confident that functions in between the two flushes do not and will not touch the page in CPU side. However, can we guarantee this? For instance, is it possible for some IOMMU driver to read/write the page for some quirks? (Or is it just a totally paranoid?) If that's not impossible, then ensuring cache and memory coherency before page reclaiming is better? > > > > > __free_pages(pages, order); > > > > } > > > > > > > > @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, > > > > > > > > list_add(&domain->next, &iommu->domain_list); > > > > vfio_update_pgsize_bitmap(iommu); > > > > + if (!domain->enforce_cache_coherency) > > > > + vfio_update_noncoherent_domain_state(iommu); > > > > > > Why isn't this simply: > > > > > > if (!domain->enforce_cache_coherency) > > > iommu->has_noncoherent_domain = true; > > Yes, it's simpler during attach. > > > > > Or maybe: > > > > > > if (!domain->enforce_cache_coherency) > > > iommu->noncoherent_domains++; > > Yes, this counter is better. > > I previously thought a bool can save some space. > > > > > > done: > > > > /* Delete the old one and insert new iova list */ > > > > vfio_iommu_iova_insert_copy(iommu, &iova_copy); > > > > @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, > > > > } > > > > iommu_domain_free(domain->domain); > > > > list_del(&domain->next); > > > > + if (!domain->enforce_cache_coherency) > > > > + vfio_update_noncoherent_domain_state(iommu); > > > > > > If we were to just track the number of noncoherent domains, this could > > > simply be iommu->noncoherent_domains-- and VFIO_DMA_CC_DMA could be: > > > > > > return iommu->noncoherent_domains ? 1 : 0; > > > > > > Maybe there should be wrappers for list_add() and list_del() relative > > > to the iommu domain list to make it just be a counter. Thanks, > > > > Do you think we can skip the "iommu->noncoherent_domains--" in > > vfio_iommu_type1_release() when iommu is about to be freed. > > > > Asking that is also because it's hard for me to find a good name for the wrapper > > around list_del(). :) > > vfio_iommu_link_domain(), vfio_iommu_unlink_domain()? Ah, this is a good name! > > > > It follows vfio_release_domain() in vfio_iommu_type1_release(), but not in > > vfio_iommu_type1_detach_group(). > > I'm not sure I understand the concern here, detach_group is performed > under the iommu->lock where the value of iommu->noncohernet_domains is > only guaranteed while this lock is held. In the release callback the > iommu->lock is not held, but we have no external users at this point. > It's not strictly required that we decrement each domain, but it's also > not a bad sanity test that iommu->noncoherent_domains should be zero > after unlinking the domains. Thanks, I previously thought I couldn't find a name for a domain operation that's called after vfio_release_domain(), and I couldn't merge list_del() into vfio_release_domain() given it's not in vfio_iommu_type1_detach_group(). But vfio_iommu_unlink_domain() is a good one. I'll rename list_del() to vfio_iommu_unlink_domain(). Thanks! Yan
> From: Zhao, Yan Y <yan.y.zhao@intel.com> > Sent: Monday, May 13, 2024 3:11 PM > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote: > > On Fri, 10 May 2024 18:31:13 +0800 > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma > requires > > > cache flush in the subsequence mapping into the first non-coherent > domain > > > and page unpinning. > > > > How do we arrive at a sequence where we have dma- > >cache_flush_required > > that isn't the result of being mapped into a domain with > > !domain->enforce_cache_coherency? > Hmm, dma->cache_flush_required IS the result of being mapped into a > domain with > !domain->enforce_cache_coherency. > My concern only arrives from the actual code sequence, i.e. > dma->cache_flush_required is set to true before the actual mapping. > > If we rename it to dma->mapped_noncoherent and only set it to true after > the > actual successful mapping, it would lead to more code to handle flushing for > the > unwind case. > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote() > by checking dma->cache_flush_required, which is true even before a full > successful mapping, so we won't miss flush on any pages that are mapped > into a > non-coherent domain in a short window. > What about storing a vfio_iommu pointer in vfio_dma? Or pass an extra parameter to vfio_unpin_pages_remote()...
> From: Zhao, Yan Y <yan.y.zhao@intel.com> > Sent: Monday, May 13, 2024 3:11 PM > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote: > > On Fri, 10 May 2024 18:31:13 +0800 > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote: > > > > On Tue, 7 May 2024 14:21:38 +0800 > > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct > vfio_domain *domain, struct list_head * > > > > > break; > > > > > } > > > > > > > > > > + if (!domain->enforce_cache_coherency) > > > > > + arch_clean_nonsnoop_dma(page_to_phys(pages), > PAGE_SIZE * 2); > > > > > + > > > > > > > > Seems like this use case isn't subject to the unmap aspect since these > > > > are kernel allocated and freed pages rather than userspace pages. > > > > There's not an "ongoing use of the page" concern. > > > > > > > > The window of opportunity for a device to discover and exploit the > > > > mapping side issue appears almost impossibly small. > > > > > > > The concern is for a malicious device attempting DMAs automatically. > > > Do you think this concern is valid? > > > As there're only extra flushes for 4 pages, what about keeping it for safety? > > > > Userspace doesn't know anything about these mappings, so to exploit > > them the device would somehow need to discover and interact with the > > mapping in the split second that the mapping exists, without exposing > > itself with mapping faults at the IOMMU. Userspace could guess the attacking ranges based on code, e.g. currently the code just tries to use the 1st available IOVA region which likely starts at address 0. and mapping faults don't stop the attack. Just some after-the-fact hint revealing the possibility of being attacked.
On Thu, 16 May 2024 08:34:20 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Zhao, Yan Y <yan.y.zhao@intel.com> > > Sent: Monday, May 13, 2024 3:11 PM > > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote: > > > On Fri, 10 May 2024 18:31:13 +0800 > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote: > > > > > On Tue, 7 May 2024 14:21:38 +0800 > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct > > vfio_domain *domain, struct list_head * > > > > > > break; > > > > > > } > > > > > > > > > > > > + if (!domain->enforce_cache_coherency) > > > > > > + arch_clean_nonsnoop_dma(page_to_phys(pages), > > PAGE_SIZE * 2); > > > > > > + > > > > > > > > > > Seems like this use case isn't subject to the unmap aspect since these > > > > > are kernel allocated and freed pages rather than userspace pages. > > > > > There's not an "ongoing use of the page" concern. > > > > > > > > > > The window of opportunity for a device to discover and exploit the > > > > > mapping side issue appears almost impossibly small. > > > > > > > > > The concern is for a malicious device attempting DMAs automatically. > > > > Do you think this concern is valid? > > > > As there're only extra flushes for 4 pages, what about keeping it for safety? > > > > > > Userspace doesn't know anything about these mappings, so to exploit > > > them the device would somehow need to discover and interact with the > > > mapping in the split second that the mapping exists, without exposing > > > itself with mapping faults at the IOMMU. > > Userspace could guess the attacking ranges based on code, e.g. currently > the code just tries to use the 1st available IOVA region which likely starts > at address 0. > > and mapping faults don't stop the attack. Just some after-the-fact hint > revealing the possibility of being attacked.
On Mon, 13 May 2024 15:11:28 +0800 Yan Zhao <yan.y.zhao@intel.com> wrote: > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote: > > On Fri, 10 May 2024 18:31:13 +0800 > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote: > > > > On Tue, 7 May 2024 14:21:38 +0800 > > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > ... > > > > > drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++ > > > > > 1 file changed, 51 insertions(+) > > > > > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > > > > > index b5c15fe8f9fc..ce873f4220bf 100644 > > > > > --- a/drivers/vfio/vfio_iommu_type1.c > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c > > > > > @@ -74,6 +74,7 @@ struct vfio_iommu { > > > > > bool v2; > > > > > bool nesting; > > > > > bool dirty_page_tracking; > > > > > + bool has_noncoherent_domain; > > > > > struct list_head emulated_iommu_groups; > > > > > }; > > > > > > > > > > @@ -99,6 +100,7 @@ struct vfio_dma { > > > > > unsigned long *bitmap; > > > > > struct mm_struct *mm; > > > > > size_t locked_vm; > > > > > + bool cache_flush_required; /* For noncoherent domain */ > > > > > > > > Poor packing, minimally this should be grouped with the other bools in > > > > the structure, longer term they should likely all be converted to > > > > bit fields. > > > Yes. Will do! > > > > > > > > > > > > }; > > > > > > > > > > struct vfio_batch { > > > > > @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, > > > > > long unlocked = 0, locked = 0; > > > > > long i; > > > > > > > > > > + if (dma->cache_flush_required) > > > > > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT); > > > > > + > > > > > for (i = 0; i < npage; i++, iova += PAGE_SIZE) { > > > > > if (put_pfn(pfn++, dma->prot)) { > > > > > unlocked++; > > > > > @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > > > &iotlb_gather); > > > > > } > > > > > > > > > > + dma->cache_flush_required = false; > > > > > + > > > > > if (do_accounting) { > > > > > vfio_lock_acct(dma, -unlocked, true); > > > > > return 0; > > > > > @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) > > > > > iommu->dma_avail++; > > > > > } > > > > > > > > > > +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu) > > > > > +{ > > > > > + struct vfio_domain *domain; > > > > > + bool has_noncoherent = false; > > > > > + > > > > > + list_for_each_entry(domain, &iommu->domain_list, next) { > > > > > + if (domain->enforce_cache_coherency) > > > > > + continue; > > > > > + > > > > > + has_noncoherent = true; > > > > > + break; > > > > > + } > > > > > + iommu->has_noncoherent_domain = has_noncoherent; > > > > > +} > > > > > > > > This should be merged with vfio_domains_have_enforce_cache_coherency() > > > > and the VFIO_DMA_CC_IOMMU extension (if we keep it, see below). > > > Will convert it to a counter and do the merge. > > > Thanks for pointing it out! > > > > > > > > > > > > + > > > > > static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu) > > > > > { > > > > > struct vfio_domain *domain; > > > > > @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > > > > > > > > vfio_batch_init(&batch); > > > > > > > > > > + /* > > > > > + * Record necessity to flush CPU cache to make sure CPU cache is flushed > > > > > + * for both pin & map and unmap & unpin (for unwind) paths. > > > > > + */ > > > > > + dma->cache_flush_required = iommu->has_noncoherent_domain; > > > > > + > > > > > while (size) { > > > > > /* Pin a contiguous chunk of memory */ > > > > > npage = vfio_pin_pages_remote(dma, vaddr + dma->size, > > > > > @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, > > > > > break; > > > > > } > > > > > > > > > > + if (dma->cache_flush_required) > > > > > + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, > > > > > + npage << PAGE_SHIFT); > > > > > + > > > > > /* Map it! */ > > > > > ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, > > > > > dma->prot); > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > > > for (; n; n = rb_next(n)) { > > > > > struct vfio_dma *dma; > > > > > dma_addr_t iova; > > > > > + bool cache_flush_required; > > > > > > > > > > dma = rb_entry(n, struct vfio_dma, node); > > > > > iova = dma->iova; > > > > > + cache_flush_required = !domain->enforce_cache_coherency && > > > > > + !dma->cache_flush_required; > > > > > + if (cache_flush_required) > > > > > + dma->cache_flush_required = true; > > > > > > > > The variable name here isn't accurate and the logic is confusing. If > > > > the domain does not enforce coherency and the mapping is not tagged as > > > > requiring a cache flush, then we need to mark the mapping as requiring > > > > a cache flush. So the variable state is something more akin to > > > > set_cache_flush_required. But all we're saving with this is a > > > > redundant set if the mapping is already tagged as requiring a cache > > > > flush, so it could really be simplified to: > > > > > > > > dma->cache_flush_required = !domain->enforce_cache_coherency; > > > Sorry about the confusion. > > > > > > If dma->cache_flush_required is set to true by a domain not enforcing cache > > > coherency, we hope it will not be reset to false by a later attaching to domain > > > enforcing cache coherency due to the lazily flushing design. > > > > Right, ok, the vfio_dma objects are shared between domains so we never > > want to set 'dma->cache_flush_required = false' due to the addition of a > > 'domain->enforce_cache_coherent == true'. So this could be: > > > > if (!dma->cache_flush_required) > > dma->cache_flush_required = !domain->enforce_cache_coherency; > > Though this code is easier for understanding, it leads to unnecessary setting of > dma->cache_flush_required to false, given domain->enforce_cache_coherency is > true at the most time. I don't really see that as an issue, but the variable name originally chosen above, cache_flush_required, also doesn't convey that it's only attempting to set the value if it wasn't previously set and is now required by a noncoherent domain. > > > > It might add more clarity to just name the mapping flag > > > > dma->mapped_noncoherent. > > > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires > > > cache flush in the subsequence mapping into the first non-coherent domain > > > and page unpinning. > > > > How do we arrive at a sequence where we have dma->cache_flush_required > > that isn't the result of being mapped into a domain with > > !domain->enforce_cache_coherency? > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with > !domain->enforce_cache_coherency. > My concern only arrives from the actual code sequence, i.e. > dma->cache_flush_required is set to true before the actual mapping. > > If we rename it to dma->mapped_noncoherent and only set it to true after the > actual successful mapping, it would lead to more code to handle flushing for the > unwind case. > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote() > by checking dma->cache_flush_required, which is true even before a full > successful mapping, so we won't miss flush on any pages that are mapped into a > non-coherent domain in a short window. I don't think we need to be so literal that "mapped_noncoherent" can only be set after the vfio_dma is fully mapped to a noncoherent domain, but also we can come up with other names for the flag. Perhaps "is_noncoherent". My suggestion was more from the perspective of what does the flag represent rather than what we intend to do as a result of the flag being set. Thanks, Alex
On Thu, May 16, 2024 at 02:50:09PM -0600, Alex Williamson wrote: > On Mon, 13 May 2024 15:11:28 +0800 > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote: > > > On Fri, 10 May 2024 18:31:13 +0800 > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote: > > > > > On Tue, 7 May 2024 14:21:38 +0800 > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote: ... > > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > > > > for (; n; n = rb_next(n)) { > > > > > > struct vfio_dma *dma; > > > > > > dma_addr_t iova; > > > > > > + bool cache_flush_required; > > > > > > > > > > > > dma = rb_entry(n, struct vfio_dma, node); > > > > > > iova = dma->iova; > > > > > > + cache_flush_required = !domain->enforce_cache_coherency && > > > > > > + !dma->cache_flush_required; > > > > > > + if (cache_flush_required) > > > > > > + dma->cache_flush_required = true; > > > > > > > > > > The variable name here isn't accurate and the logic is confusing. If > > > > > the domain does not enforce coherency and the mapping is not tagged as > > > > > requiring a cache flush, then we need to mark the mapping as requiring > > > > > a cache flush. So the variable state is something more akin to > > > > > set_cache_flush_required. But all we're saving with this is a > > > > > redundant set if the mapping is already tagged as requiring a cache > > > > > flush, so it could really be simplified to: > > > > > > > > > > dma->cache_flush_required = !domain->enforce_cache_coherency; > > > > Sorry about the confusion. > > > > > > > > If dma->cache_flush_required is set to true by a domain not enforcing cache > > > > coherency, we hope it will not be reset to false by a later attaching to domain > > > > enforcing cache coherency due to the lazily flushing design. > > > > > > Right, ok, the vfio_dma objects are shared between domains so we never > > > want to set 'dma->cache_flush_required = false' due to the addition of a > > > 'domain->enforce_cache_coherent == true'. So this could be: > > > > > > if (!dma->cache_flush_required) > > > dma->cache_flush_required = !domain->enforce_cache_coherency; > > > > Though this code is easier for understanding, it leads to unnecessary setting of > > dma->cache_flush_required to false, given domain->enforce_cache_coherency is > > true at the most time. > > I don't really see that as an issue, but the variable name originally > chosen above, cache_flush_required, also doesn't convey that it's only > attempting to set the value if it wasn't previously set and is now > required by a noncoherent domain. Agreed, the old name is too vague. What about update_to_noncoherent_required? Then in vfio_iommu_replay(), it's like update_to_noncoherent_required = !domain->enforce_cache_coherency && !dma->is_noncoherent; if (update_to_noncoherent_required) dma->is_noncoherent = true; ... if (update_to_noncoherent_required) arch_flush_cache_phys((phys, size); > > > > > > It might add more clarity to just name the mapping flag > > > > > dma->mapped_noncoherent. > > > > > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires > > > > cache flush in the subsequence mapping into the first non-coherent domain > > > > and page unpinning. > > > > > > How do we arrive at a sequence where we have dma->cache_flush_required > > > that isn't the result of being mapped into a domain with > > > !domain->enforce_cache_coherency? > > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with > > !domain->enforce_cache_coherency. > > My concern only arrives from the actual code sequence, i.e. > > dma->cache_flush_required is set to true before the actual mapping. > > > > If we rename it to dma->mapped_noncoherent and only set it to true after the > > actual successful mapping, it would lead to more code to handle flushing for the > > unwind case. > > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote() > > by checking dma->cache_flush_required, which is true even before a full > > successful mapping, so we won't miss flush on any pages that are mapped into a > > non-coherent domain in a short window. > > I don't think we need to be so literal that "mapped_noncoherent" can > only be set after the vfio_dma is fully mapped to a noncoherent domain, > but also we can come up with other names for the flag. Perhaps > "is_noncoherent". My suggestion was more from the perspective of what > does the flag represent rather than what we intend to do as a result of > the flag being set. Thanks, Makes sense! I like the name "is_noncoherent" :)
On Fri, 17 May 2024 11:11:48 +0800 Yan Zhao <yan.y.zhao@intel.com> wrote: > On Thu, May 16, 2024 at 02:50:09PM -0600, Alex Williamson wrote: > > On Mon, 13 May 2024 15:11:28 +0800 > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote: > > > > On Fri, 10 May 2024 18:31:13 +0800 > > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote: > > > > > > On Tue, 7 May 2024 14:21:38 +0800 > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > ... > > > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > > > > > for (; n; n = rb_next(n)) { > > > > > > > struct vfio_dma *dma; > > > > > > > dma_addr_t iova; > > > > > > > + bool cache_flush_required; > > > > > > > > > > > > > > dma = rb_entry(n, struct vfio_dma, node); > > > > > > > iova = dma->iova; > > > > > > > + cache_flush_required = !domain->enforce_cache_coherency && > > > > > > > + !dma->cache_flush_required; > > > > > > > + if (cache_flush_required) > > > > > > > + dma->cache_flush_required = true; > > > > > > > > > > > > The variable name here isn't accurate and the logic is confusing. If > > > > > > the domain does not enforce coherency and the mapping is not tagged as > > > > > > requiring a cache flush, then we need to mark the mapping as requiring > > > > > > a cache flush. So the variable state is something more akin to > > > > > > set_cache_flush_required. But all we're saving with this is a > > > > > > redundant set if the mapping is already tagged as requiring a cache > > > > > > flush, so it could really be simplified to: > > > > > > > > > > > > dma->cache_flush_required = !domain->enforce_cache_coherency; > > > > > Sorry about the confusion. > > > > > > > > > > If dma->cache_flush_required is set to true by a domain not enforcing cache > > > > > coherency, we hope it will not be reset to false by a later attaching to domain > > > > > enforcing cache coherency due to the lazily flushing design. > > > > > > > > Right, ok, the vfio_dma objects are shared between domains so we never > > > > want to set 'dma->cache_flush_required = false' due to the addition of a > > > > 'domain->enforce_cache_coherent == true'. So this could be: > > > > > > > > if (!dma->cache_flush_required) > > > > dma->cache_flush_required = !domain->enforce_cache_coherency; > > > > > > Though this code is easier for understanding, it leads to unnecessary setting of > > > dma->cache_flush_required to false, given domain->enforce_cache_coherency is > > > true at the most time. > > > > I don't really see that as an issue, but the variable name originally > > chosen above, cache_flush_required, also doesn't convey that it's only > > attempting to set the value if it wasn't previously set and is now > > required by a noncoherent domain. > Agreed, the old name is too vague. > What about update_to_noncoherent_required? set_noncoherent? Thanks, Alex > Then in vfio_iommu_replay(), it's like > > update_to_noncoherent_required = !domain->enforce_cache_coherency && !dma->is_noncoherent; > if (update_to_noncoherent_required) > dma->is_noncoherent = true; > > ... > if (update_to_noncoherent_required) > arch_flush_cache_phys((phys, size); > > > > > > > > It might add more clarity to just name the mapping flag > > > > > > dma->mapped_noncoherent. > > > > > > > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires > > > > > cache flush in the subsequence mapping into the first non-coherent domain > > > > > and page unpinning. > > > > > > > > How do we arrive at a sequence where we have dma->cache_flush_required > > > > that isn't the result of being mapped into a domain with > > > > !domain->enforce_cache_coherency? > > > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with > > > !domain->enforce_cache_coherency. > > > My concern only arrives from the actual code sequence, i.e. > > > dma->cache_flush_required is set to true before the actual mapping. > > > > > > If we rename it to dma->mapped_noncoherent and only set it to true after the > > > actual successful mapping, it would lead to more code to handle flushing for the > > > unwind case. > > > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote() > > > by checking dma->cache_flush_required, which is true even before a full > > > successful mapping, so we won't miss flush on any pages that are mapped into a > > > non-coherent domain in a short window. > > > > I don't think we need to be so literal that "mapped_noncoherent" can > > only be set after the vfio_dma is fully mapped to a noncoherent domain, > > but also we can come up with other names for the flag. Perhaps > > "is_noncoherent". My suggestion was more from the perspective of what > > does the flag represent rather than what we intend to do as a result of > > the flag being set. Thanks, > Makes sense! > I like the name "is_noncoherent" :) >
On Thu, May 16, 2024 at 10:44:42PM -0600, Alex Williamson wrote: > On Fri, 17 May 2024 11:11:48 +0800 > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > On Thu, May 16, 2024 at 02:50:09PM -0600, Alex Williamson wrote: > > > On Mon, 13 May 2024 15:11:28 +0800 > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > > On Fri, May 10, 2024 at 10:57:28AM -0600, Alex Williamson wrote: > > > > > On Fri, 10 May 2024 18:31:13 +0800 > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > > > > > > On Thu, May 09, 2024 at 12:10:49PM -0600, Alex Williamson wrote: > > > > > > > On Tue, 7 May 2024 14:21:38 +0800 > > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > ... > > > > > > > > @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, > > > > > > > > for (; n; n = rb_next(n)) { > > > > > > > > struct vfio_dma *dma; > > > > > > > > dma_addr_t iova; > > > > > > > > + bool cache_flush_required; > > > > > > > > > > > > > > > > dma = rb_entry(n, struct vfio_dma, node); > > > > > > > > iova = dma->iova; > > > > > > > > + cache_flush_required = !domain->enforce_cache_coherency && > > > > > > > > + !dma->cache_flush_required; > > > > > > > > + if (cache_flush_required) > > > > > > > > + dma->cache_flush_required = true; > > > > > > > > > > > > > > The variable name here isn't accurate and the logic is confusing. If > > > > > > > the domain does not enforce coherency and the mapping is not tagged as > > > > > > > requiring a cache flush, then we need to mark the mapping as requiring > > > > > > > a cache flush. So the variable state is something more akin to > > > > > > > set_cache_flush_required. But all we're saving with this is a > > > > > > > redundant set if the mapping is already tagged as requiring a cache > > > > > > > flush, so it could really be simplified to: > > > > > > > > > > > > > > dma->cache_flush_required = !domain->enforce_cache_coherency; > > > > > > Sorry about the confusion. > > > > > > > > > > > > If dma->cache_flush_required is set to true by a domain not enforcing cache > > > > > > coherency, we hope it will not be reset to false by a later attaching to domain > > > > > > enforcing cache coherency due to the lazily flushing design. > > > > > > > > > > Right, ok, the vfio_dma objects are shared between domains so we never > > > > > want to set 'dma->cache_flush_required = false' due to the addition of a > > > > > 'domain->enforce_cache_coherent == true'. So this could be: > > > > > > > > > > if (!dma->cache_flush_required) > > > > > dma->cache_flush_required = !domain->enforce_cache_coherency; > > > > > > > > Though this code is easier for understanding, it leads to unnecessary setting of > > > > dma->cache_flush_required to false, given domain->enforce_cache_coherency is > > > > true at the most time. > > > > > > I don't really see that as an issue, but the variable name originally > > > chosen above, cache_flush_required, also doesn't convey that it's only > > > attempting to set the value if it wasn't previously set and is now > > > required by a noncoherent domain. > > Agreed, the old name is too vague. > > What about update_to_noncoherent_required? > > set_noncoherent? Thanks, > Concise! > > > Then in vfio_iommu_replay(), it's like > > > > update_to_noncoherent_required = !domain->enforce_cache_coherency && !dma->is_noncoherent; > > if (update_to_noncoherent_required) > > dma->is_noncoherent = true; > > > > ... > > if (update_to_noncoherent_required) > > arch_flush_cache_phys((phys, size); > > > > > > > > > > It might add more clarity to just name the mapping flag > > > > > > > dma->mapped_noncoherent. > > > > > > > > > > > > The dma->cache_flush_required is to mark whether pages in a vfio_dma requires > > > > > > cache flush in the subsequence mapping into the first non-coherent domain > > > > > > and page unpinning. > > > > > > > > > > How do we arrive at a sequence where we have dma->cache_flush_required > > > > > that isn't the result of being mapped into a domain with > > > > > !domain->enforce_cache_coherency? > > > > Hmm, dma->cache_flush_required IS the result of being mapped into a domain with > > > > !domain->enforce_cache_coherency. > > > > My concern only arrives from the actual code sequence, i.e. > > > > dma->cache_flush_required is set to true before the actual mapping. > > > > > > > > If we rename it to dma->mapped_noncoherent and only set it to true after the > > > > actual successful mapping, it would lead to more code to handle flushing for the > > > > unwind case. > > > > Currently, flush for unwind is handled centrally in vfio_unpin_pages_remote() > > > > by checking dma->cache_flush_required, which is true even before a full > > > > successful mapping, so we won't miss flush on any pages that are mapped into a > > > > non-coherent domain in a short window. > > > > > > I don't think we need to be so literal that "mapped_noncoherent" can > > > only be set after the vfio_dma is fully mapped to a noncoherent domain, > > > but also we can come up with other names for the flag. Perhaps > > > "is_noncoherent". My suggestion was more from the perspective of what > > > does the flag represent rather than what we intend to do as a result of > > > the flag being set. Thanks, > > Makes sense! > > I like the name "is_noncoherent" :) > > >
On Thu, May 16, 2024 at 02:31:59PM -0600, Alex Williamson wrote: > Yes, exactly. Zero'ing the page would obviously reestablish the > coherency, but the page could be reallocated without being zero'd and as > you describe the owner of that page could then get inconsistent > results. I think if we care about the performance of this stuff enough to try and remove flushes we'd be better off figuring out how to disable no snoop in PCI config space and trust the device not to use it and avoid these flushes. iommu enforcement is nice, but at least ARM has been assuming that the PCI config space bit is sufficient. Intel/AMD are probably fine here as they will only flush for weird GPU cases, but I expect ARM is going to be unhappy. Jason
+Daniel > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Saturday, May 18, 2024 1:11 AM > > On Thu, May 16, 2024 at 02:31:59PM -0600, Alex Williamson wrote: > > > Yes, exactly. Zero'ing the page would obviously reestablish the > > coherency, but the page could be reallocated without being zero'd and as > > you describe the owner of that page could then get inconsistent > > results. > > I think if we care about the performance of this stuff enough to try > and remove flushes we'd be better off figuring out how to disable no > snoop in PCI config space and trust the device not to use it and avoid > these flushes. > > iommu enforcement is nice, but at least ARM has been assuming that the > PCI config space bit is sufficient. > > Intel/AMD are probably fine here as they will only flush for weird GPU > cases, but I expect ARM is going to be unhappy. > My impression was that Intel GPU is not usable w/o non-coherent DMA, but I don't remember whether it's unusable being a functional breakage or a user experience breakage. e.g. I vaguely recalled that the display engine cannot afford high resolution/high refresh rate using the snoop way so the IOMMU dedicated for the GPU doesn't implement the force snoop capability. Daniel, can you help explain the behavior of Intel GPU in case nosnoop is disabled in the PCI config space? Overall it sounds that we are talking about different requirements. For Intel GPU nosnoop is a must but it is not currently done securely so we need add proper flush to fix it, while for ARM looks you don't have a case which relies on nosnoop so finding a way to disable it is more straightforward?
On Mon, May 20, 2024 at 02:52:43AM +0000, Tian, Kevin wrote: > +Daniel > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Saturday, May 18, 2024 1:11 AM > > > > On Thu, May 16, 2024 at 02:31:59PM -0600, Alex Williamson wrote: > > > > > Yes, exactly. Zero'ing the page would obviously reestablish the > > > coherency, but the page could be reallocated without being zero'd and as > > > you describe the owner of that page could then get inconsistent > > > results. > > > > I think if we care about the performance of this stuff enough to try > > and remove flushes we'd be better off figuring out how to disable no > > snoop in PCI config space and trust the device not to use it and avoid > > these flushes. > > > > iommu enforcement is nice, but at least ARM has been assuming that the > > PCI config space bit is sufficient. > > > > Intel/AMD are probably fine here as they will only flush for weird GPU > > cases, but I expect ARM is going to be unhappy. > > > > My impression was that Intel GPU is not usable w/o non-coherent DMA, > but I don't remember whether it's unusable being a functional breakage > or a user experience breakage. e.g. I vaguely recalled that the display > engine cannot afford high resolution/high refresh rate using the snoop > way so the IOMMU dedicated for the GPU doesn't implement the force > snoop capability. > > Daniel, can you help explain the behavior of Intel GPU in case nosnoop > is disabled in the PCI config space? > > Overall it sounds that we are talking about different requirements. For > Intel GPU nosnoop is a must but it is not currently done securely so we > need add proper flush to fix it, while for ARM looks you don't have a > case which relies on nosnoop so finding a way to disable it is more > straightforward? Intel GPU weirdness should not leak into making other devices insecure/slow. If necessary Intel GPU only should get some variant override to keep no snoop working. It would make alot of good sense if VFIO made the default to disable no-snoop via the config space. Jason
On Tue, 21 May 2024 13:07:14 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, May 20, 2024 at 02:52:43AM +0000, Tian, Kevin wrote: > > +Daniel > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Saturday, May 18, 2024 1:11 AM > > > > > > On Thu, May 16, 2024 at 02:31:59PM -0600, Alex Williamson wrote: > > > > > > > Yes, exactly. Zero'ing the page would obviously reestablish the > > > > coherency, but the page could be reallocated without being zero'd and as > > > > you describe the owner of that page could then get inconsistent > > > > results. > > > > > > I think if we care about the performance of this stuff enough to try > > > and remove flushes we'd be better off figuring out how to disable no > > > snoop in PCI config space and trust the device not to use it and avoid > > > these flushes. > > > > > > iommu enforcement is nice, but at least ARM has been assuming that the > > > PCI config space bit is sufficient. > > > > > > Intel/AMD are probably fine here as they will only flush for weird GPU > > > cases, but I expect ARM is going to be unhappy. > > > > > > > My impression was that Intel GPU is not usable w/o non-coherent DMA, > > but I don't remember whether it's unusable being a functional breakage > > or a user experience breakage. e.g. I vaguely recalled that the display > > engine cannot afford high resolution/high refresh rate using the snoop > > way so the IOMMU dedicated for the GPU doesn't implement the force > > snoop capability. > > > > Daniel, can you help explain the behavior of Intel GPU in case nosnoop > > is disabled in the PCI config space? > > > > Overall it sounds that we are talking about different requirements. For > > Intel GPU nosnoop is a must but it is not currently done securely so we > > need add proper flush to fix it, while for ARM looks you don't have a > > case which relies on nosnoop so finding a way to disable it is more > > straightforward? > > Intel GPU weirdness should not leak into making other devices > insecure/slow. If necessary Intel GPU only should get some variant > override to keep no snoop working. > > It would make alot of good sense if VFIO made the default to disable > no-snoop via the config space. We can certainly virtualize the config space no-snoop enable bit, but I'm not sure what it actually accomplishes. We'd then be relying on the device to honor the bit and not have any backdoors to twiddle the bit otherwise (where we know that GPUs often have multiple paths to get to config space). We also then have the question of does the device function correctly if we disable no-snoop. The more secure approach might be that we need to do these cache flushes for any IOMMU that doesn't maintain coherency, even for no-snoop transactions. Thanks, Alex
On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote: > > Intel GPU weirdness should not leak into making other devices > > insecure/slow. If necessary Intel GPU only should get some variant > > override to keep no snoop working. > > > > It would make alot of good sense if VFIO made the default to disable > > no-snoop via the config space. > > We can certainly virtualize the config space no-snoop enable bit, but > I'm not sure what it actually accomplishes. We'd then be relying on > the device to honor the bit and not have any backdoors to twiddle the > bit otherwise (where we know that GPUs often have multiple paths to get > to config space). I'm OK with this. If devices are insecure then they need quirks in vfio to disclose their problems, we shouldn't punish everyone who followed the spec because of some bad actors. But more broadly in a security engineered environment we can trust the no-snoop bit to work properly. > We also then have the question of does the device function > correctly if we disable no-snoop. Other than the GPU BW issue the no-snoop is not a functional behavior. > The more secure approach might be that we need to do these cache > flushes for any IOMMU that doesn't maintain coherency, even for > no-snoop transactions. Thanks, Did you mean 'even for snoop transactions'? That is where this series is, it assumes a no-snoop transaction took place even if that is impossible, because of config space, and then does pessimistic flushes. Jason
On Tue, 21 May 2024 13:34:00 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote: > > > > Intel GPU weirdness should not leak into making other devices > > > insecure/slow. If necessary Intel GPU only should get some variant > > > override to keep no snoop working. > > > > > > It would make alot of good sense if VFIO made the default to disable > > > no-snoop via the config space. > > > > We can certainly virtualize the config space no-snoop enable bit, but > > I'm not sure what it actually accomplishes. We'd then be relying on > > the device to honor the bit and not have any backdoors to twiddle the > > bit otherwise (where we know that GPUs often have multiple paths to get > > to config space). > > I'm OK with this. If devices are insecure then they need quirks in > vfio to disclose their problems, we shouldn't punish everyone who > followed the spec because of some bad actors. > > But more broadly in a security engineered environment we can trust the > no-snoop bit to work properly. The spec has an interesting requirement on devices sending no-snoop transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN): "Even when this bit is Set, a Function is only permitted to Set the No Snoop attribute on a transaction when it can guarantee that the address of the transaction is not stored in any cache in the system." I wouldn't think the function itself has such visibility and it would leave the problem of reestablishing coherency to the driver, but am I overlooking something that implicitly makes this safe? ie. if the function isn't permitted to perform no-snoop to an address stored in cache, there's nothing we need to do here. > > We also then have the question of does the device function > > correctly if we disable no-snoop. > > Other than the GPU BW issue the no-snoop is not a functional behavior. As with some other config space bits though, I think we're kind of hoping for sloppy driver behavior to virtualize this. The spec does allow the bit to be hardwired to zero: "This bit is permitted to be hardwired to 0b if a Function would never Set the No Snoop attribute in transactions it initiates." But there's no capability bit that allows us to report whether the device supports no-snoop, we're just hoping that a driver writing to the bit doesn't generate a fault if the bit doesn't stick. For example the no-snoop bit in the TLP itself may only be a bandwidth issue, but if the driver thinks no-snoop support is enabled it may request the device use the attribute for a specific transaction and the device could fault if it cannot comply. > > The more secure approach might be that we need to do these cache > > flushes for any IOMMU that doesn't maintain coherency, even for > > no-snoop transactions. Thanks, > > Did you mean 'even for snoop transactions'? I was referring to IOMMUs that maintain coherency regardless of no-snoop transactions, ie domain->enforce_cache_coherency (ex. snoop control/SNP on Intel), so I meant as typed, the IOMMU maintaining coherency even for no-snoop transactions. That's essentially the case we expect and we don't need to virtualize no-snoop enable on the device. > That is where this series is, it assumes a no-snoop transaction took > place even if that is impossible, because of config space, and then > does pessimistic flushes. So are you proposing that we can trust devices to honor the PCI_EXP_DEVCTL_NOSNOOP_EN bit and virtualize it to be hardwired to zero on IOMMUs that do not enforce coherency as the entire solution? Or maybe we trap on setting the bit to make the flushing less pessimistic? Intel folks might be able to comment on the performance hit relative to iGPU assignment of denying the device the ability to use no-snoop transactions (assuming the device control bit is actually honored). The latency of flushing caches on touching no-snoop enable might be prohibitive in the latter case. Thanks, Alex
On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote: > > I'm OK with this. If devices are insecure then they need quirks in > > vfio to disclose their problems, we shouldn't punish everyone who > > followed the spec because of some bad actors. > > > > But more broadly in a security engineered environment we can trust the > > no-snoop bit to work properly. > > The spec has an interesting requirement on devices sending no-snoop > transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN): > > "Even when this bit is Set, a Function is only permitted to Set the No > Snoop attribute on a transaction when it can guarantee that the > address of the transaction is not stored in any cache in the system." > > I wouldn't think the function itself has such visibility and it would > leave the problem of reestablishing coherency to the driver, but am I > overlooking something that implicitly makes this safe? I think it is just bad spec language! People are clearly using no-snoop on cachable memory today. The authors must have had some other usage in mind than what the industry actually did. > But there's no capability bit that allows us to report whether the > device supports no-snoop, we're just hoping that a driver writing to > the bit doesn't generate a fault if the bit doesn't stick. For example > the no-snoop bit in the TLP itself may only be a bandwidth issue, but > if the driver thinks no-snoop support is enabled it may request the > device use the attribute for a specific transaction and the device > could fault if it cannot comply. It could, but that is another wierdo quirk IMHO. We already see things in config space under hypervisor control because the VF don't have the bits :\ > > > The more secure approach might be that we need to do these cache > > > flushes for any IOMMU that doesn't maintain coherency, even for > > > no-snoop transactions. Thanks, > > > > Did you mean 'even for snoop transactions'? > > I was referring to IOMMUs that maintain coherency regardless of > no-snoop transactions, ie domain->enforce_cache_coherency (ex. snoop > control/SNP on Intel), so I meant as typed, the IOMMU maintaining > coherency even for no-snoop transactions. > > That's essentially the case we expect and we don't need to virtualize > no-snoop enable on the device. It is the most robust case to be sure, and then we don't need flushing. My point was we could extend the cases where we don't need to flush if we pay attention to, or virtualize, the PCI_EXP_DEVCTL_NOSNOOP_EN. > > That is where this series is, it assumes a no-snoop transaction took > > place even if that is impossible, because of config space, and then > > does pessimistic flushes. > > So are you proposing that we can trust devices to honor the > PCI_EXP_DEVCTL_NOSNOOP_EN bit and virtualize it to be hardwired to zero > on IOMMUs that do not enforce coherency as the entire solution? Maybe not entire, but as an additional step to reduce the cost of this. ARM would like this for instance. > Or maybe we trap on setting the bit to make the flushing less > pessimistic? Also a good idea. The VMM could then decide on policy. Jason
On Tue, May 21, 2024 at 01:34:00PM -0300, Jason Gunthorpe wrote: > On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote: > > > > Intel GPU weirdness should not leak into making other devices > > > insecure/slow. If necessary Intel GPU only should get some variant > > > override to keep no snoop working. > > > > > > It would make alot of good sense if VFIO made the default to disable > > > no-snoop via the config space. > > > > We can certainly virtualize the config space no-snoop enable bit, but > > I'm not sure what it actually accomplishes. We'd then be relying on > > the device to honor the bit and not have any backdoors to twiddle the > > bit otherwise (where we know that GPUs often have multiple paths to get > > to config space). > > I'm OK with this. If devices are insecure then they need quirks in > vfio to disclose their problems, we shouldn't punish everyone who > followed the spec because of some bad actors. Does that mean a malicous device that does not honor the bit could read uninitialized host data?
On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote: > On Tue, 21 May 2024 13:34:00 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote: > Intel folks might be able to comment on the performance hit relative to > iGPU assignment of denying the device the ability to use no-snoop > transactions (assuming the device control bit is actually honored). I don't have direct data for iGPU assignment. But I have a reference data regarding to virtio GPU. When backend GPU is iGPU for a virtio GPU, follow non-coherent path could increase performance up to 20%+ for some platforms. > The latency of flushing caches on touching no-snoop enable might be > prohibitive in the latter case. Thanks,
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, May 22, 2024 2:38 AM > > On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote: > > > I'm OK with this. If devices are insecure then they need quirks in > > > vfio to disclose their problems, we shouldn't punish everyone who > > > followed the spec because of some bad actors. > > > > > > But more broadly in a security engineered environment we can trust the > > > no-snoop bit to work properly. > > > > The spec has an interesting requirement on devices sending no-snoop > > transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN): > > > > "Even when this bit is Set, a Function is only permitted to Set the No > > Snoop attribute on a transaction when it can guarantee that the > > address of the transaction is not stored in any cache in the system." > > > > I wouldn't think the function itself has such visibility and it would > > leave the problem of reestablishing coherency to the driver, but am I > > overlooking something that implicitly makes this safe? > > I think it is just bad spec language! People are clearly using > no-snoop on cachable memory today. The authors must have had some > other usage in mind than what the industry actually did. sure no-snoop can be used on cacheable memory but then the driver needs to flush the cache before triggering the no-snoop DMA so it still meets the spec "the address of the transaction is not stored in any cache in the system". but as Alex said the function itself has no such visibility so it's really a guarantee made by the driver. > > > That is where this series is, it assumes a no-snoop transaction took > > > place even if that is impossible, because of config space, and then > > > does pessimistic flushes. > > > > So are you proposing that we can trust devices to honor the > > PCI_EXP_DEVCTL_NOSNOOP_EN bit and virtualize it to be hardwired to > zero > > on IOMMUs that do not enforce coherency as the entire solution? > > Maybe not entire, but as an additional step to reduce the cost of > this. ARM would like this for instance. I searched PCI_EXP_DEVCTL_NOSNOOP_EN but surprisingly it's not touched by i915 driver. sort of suggesting that Intel GPU doesn't follow the spec to honor that bit... > > > Or maybe we trap on setting the bit to make the flushing less > > pessimistic? > > Also a good idea. The VMM could then decide on policy. > On Intel platform there is no pessimistic flush. Only Intel GPUs are exempted from IOMMU force snoop (either being lacking of the capability on the IOMMU dedicated to the GPU or having a special flag bit < REQ_WO_PASID_PGSNP_NOTALLOWED> in the ACPI structure for the IOMMU hosting many devices) to require the additional flushes in this series. We just need to avoid such flushes on other platforms e.g. ARM. I'm fine to do a special check in the attach path to enable the flush only for Intel GPU. or alternatively could ARM SMMU driver implement @enforce_cache_coherency by disabling PCI nosnoop cap when the SMMU itself cannot force snoop? Then VFIO/IOMMUFD could still check enforce_cache_coherency generally to apply the cache flush trick...
On Wed, May 22, 2024 at 11:24:20AM +0800, Yan Zhao wrote: > On Tue, May 21, 2024 at 01:34:00PM -0300, Jason Gunthorpe wrote: > > On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote: > > > > > > Intel GPU weirdness should not leak into making other devices > > > > insecure/slow. If necessary Intel GPU only should get some variant > > > > override to keep no snoop working. > > > > > > > > It would make alot of good sense if VFIO made the default to disable > > > > no-snoop via the config space. > > > > > > We can certainly virtualize the config space no-snoop enable bit, but > > > I'm not sure what it actually accomplishes. We'd then be relying on > > > the device to honor the bit and not have any backdoors to twiddle the > > > bit otherwise (where we know that GPUs often have multiple paths to get > > > to config space). > > > > I'm OK with this. If devices are insecure then they need quirks in > > vfio to disclose their problems, we shouldn't punish everyone who > > followed the spec because of some bad actors. > Does that mean a malicous device that does not honor the bit could read > uninitialized host data? Yes, but a malicious device could also just do DMA with the PF RID and break everything. VFIO substantially trusts the device already, I'm not sure trusting it to do no-snoop blocking is a big reach. Jason
On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, May 22, 2024 2:38 AM > > > > On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote: > > > > I'm OK with this. If devices are insecure then they need quirks in > > > > vfio to disclose their problems, we shouldn't punish everyone who > > > > followed the spec because of some bad actors. > > > > > > > > But more broadly in a security engineered environment we can trust the > > > > no-snoop bit to work properly. > > > > > > The spec has an interesting requirement on devices sending no-snoop > > > transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN): > > > > > > "Even when this bit is Set, a Function is only permitted to Set the No > > > Snoop attribute on a transaction when it can guarantee that the > > > address of the transaction is not stored in any cache in the system." > > > > > > I wouldn't think the function itself has such visibility and it would > > > leave the problem of reestablishing coherency to the driver, but am I > > > overlooking something that implicitly makes this safe? > > > > I think it is just bad spec language! People are clearly using > > no-snoop on cachable memory today. The authors must have had some > > other usage in mind than what the industry actually did. > > sure no-snoop can be used on cacheable memory but then the driver > needs to flush the cache before triggering the no-snoop DMA so it > still meets the spec "the address of the transaction is not stored > in any cache in the system". Flush does not mean evict.. The way I read the above it is trying to say the driver must map all the memory non-cachable to ensure it never gets pulled into a cache in the first place. > > Maybe not entire, but as an additional step to reduce the cost of > > this. ARM would like this for instance. > > I searched PCI_EXP_DEVCTL_NOSNOOP_EN but surprisingly it's not > touched by i915 driver. sort of suggesting that Intel GPU doesn't follow > the spec to honor that bit... Or the BIOS turns it on and the OS just leaves it.. > I'm fine to do a special check in the attach path to enable the flush > only for Intel GPU. We already effectively do this already by checking the domain capabilities. Only the Intel GPU will have a non-coherent domain. > or alternatively could ARM SMMU driver implement > @enforce_cache_coherency by disabling PCI nosnoop cap when > the SMMU itself cannot force snoop? Then VFIO/IOMMUFD could > still check enforce_cache_coherency generally to apply the cache > flush trick...
On Wed, 22 May 2024 09:29:39 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Wednesday, May 22, 2024 2:38 AM > > > > > > On Tue, May 21, 2024 at 12:19:45PM -0600, Alex Williamson wrote: > > > > > I'm OK with this. If devices are insecure then they need quirks in > > > > > vfio to disclose their problems, we shouldn't punish everyone who > > > > > followed the spec because of some bad actors. > > > > > > > > > > But more broadly in a security engineered environment we can trust the > > > > > no-snoop bit to work properly. > > > > > > > > The spec has an interesting requirement on devices sending no-snoop > > > > transactions anyway (regarding PCI_EXP_DEVCTL_NOSNOOP_EN): > > > > > > > > "Even when this bit is Set, a Function is only permitted to Set the No > > > > Snoop attribute on a transaction when it can guarantee that the > > > > address of the transaction is not stored in any cache in the system." > > > > > > > > I wouldn't think the function itself has such visibility and it would > > > > leave the problem of reestablishing coherency to the driver, but am I > > > > overlooking something that implicitly makes this safe? > > > > > > I think it is just bad spec language! People are clearly using > > > no-snoop on cachable memory today. The authors must have had some > > > other usage in mind than what the industry actually did. > > > > sure no-snoop can be used on cacheable memory but then the driver > > needs to flush the cache before triggering the no-snoop DMA so it > > still meets the spec "the address of the transaction is not stored > > in any cache in the system". > > Flush does not mean evict.. The way I read the above it is trying to > say the driver must map all the memory non-cachable to ensure it never > gets pulled into a cache in the first place. I think we should probably just fall back to your previous interpretation, it's bad spec language. It may not be possible to map the memory uncachable, it's a driver issue to sync the DMA as needed for coherency. > > > Maybe not entire, but as an additional step to reduce the cost of > > > this. ARM would like this for instance. > > > > I searched PCI_EXP_DEVCTL_NOSNOOP_EN but surprisingly it's not > > touched by i915 driver. sort of suggesting that Intel GPU doesn't follow > > the spec to honor that bit... > > Or the BIOS turns it on and the OS just leaves it.. This is kind of an unusual feature in that sense, the default value of PCI_EXP_DEVCTL_NOSNOOP_EN is enabled. It therefore might make sense that the i915 driver assumes that it can do no-snoop. The interesting case would be if it still does no-snoop if that bit were cleared prior to the driver binding or while the device is running. But I think this also means that regardless of virtualizing PCI_EXP_DEVCTL_NOSNOOP_EN, there will be momentary gaps around device resets where a device could legitimately perform no-snoop transactions. Thanks, Alex
On Wed, May 22, 2024 at 08:43:18AM -0600, Alex Williamson wrote: > But I think this also means that regardless of virtualizing > PCI_EXP_DEVCTL_NOSNOOP_EN, there will be momentary gaps around device > resets where a device could legitimately perform no-snoop > transactions. Isn't memory enable turned off after FLR? If not do we have to make it off before doing FLR? I'm not sure how a no-snoop could leak out around FLR? Jason
On Wed, 22 May 2024 13:52:21 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, May 22, 2024 at 08:43:18AM -0600, Alex Williamson wrote: > > > But I think this also means that regardless of virtualizing > > PCI_EXP_DEVCTL_NOSNOOP_EN, there will be momentary gaps around device > > resets where a device could legitimately perform no-snoop > > transactions. > > Isn't memory enable turned off after FLR? If not do we have to make it > off before doing FLR? > > I'm not sure how a no-snoop could leak out around FLR? Good point, modulo s/memory/bus master/. Yes, we'd likely need to make sure we enter pci_reset_function() with BM disabled so that we don't have an ordering issue between restoring the PCIe capability and the command register. Likewise no-snoop handling would need to avoid gaps around backdoor resets like we try to do when we're masking INTx support on the device (vfio_bar_restore). Thanks, Alex
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, May 22, 2024 8:30 PM > > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote: > > I'm fine to do a special check in the attach path to enable the flush > > only for Intel GPU. > > We already effectively do this already by checking the domain > capabilities. Only the Intel GPU will have a non-coherent domain. > I'm confused. In earlier discussions you wanted to find a way to not publish others due to the check of non-coherent domain, e.g. some ARM SMMU cannot force snoop. Then you and Alex discussed the possibility of reducing pessimistic flushes by virtualizing the PCI NOSNOOP bit. With that in mind I was thinking whether we explicitly enable this flush only for Intel GPU instead of checking non-coherent domain in the attach path, since it's the only device with such requirement. Did I misunderstand the concern here?
On Wed, May 22, 2024 at 11:26:21PM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Wednesday, May 22, 2024 8:30 PM > > > > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote: > > > I'm fine to do a special check in the attach path to enable the flush > > > only for Intel GPU. > > > > We already effectively do this already by checking the domain > > capabilities. Only the Intel GPU will have a non-coherent domain. > > > > I'm confused. In earlier discussions you wanted to find a way to not > publish others due to the check of non-coherent domain, e.g. some > ARM SMMU cannot force snoop. > > Then you and Alex discussed the possibility of reducing pessimistic > flushes by virtualizing the PCI NOSNOOP bit. > > With that in mind I was thinking whether we explicitly enable this > flush only for Intel GPU instead of checking non-coherent domain > in the attach path, since it's the only device with such requirement. I am suggesting to do both checks: - If the iommu domain indicates it has force coherency then leave PCI no-snoop alone and no flush - If the PCI NOSNOOP bit is or can be 0 then no flush - Otherwise flush I'm not sure there is a good reason to ignore the data we get from the iommu domain that it enforces coherency? Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, May 23, 2024 7:32 AM > > On Wed, May 22, 2024 at 11:26:21PM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Wednesday, May 22, 2024 8:30 PM > > > > > > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote: > > > > I'm fine to do a special check in the attach path to enable the flush > > > > only for Intel GPU. > > > > > > We already effectively do this already by checking the domain > > > capabilities. Only the Intel GPU will have a non-coherent domain. > > > > > > > I'm confused. In earlier discussions you wanted to find a way to not > > publish others due to the check of non-coherent domain, e.g. some > > ARM SMMU cannot force snoop. > > > > Then you and Alex discussed the possibility of reducing pessimistic > > flushes by virtualizing the PCI NOSNOOP bit. > > > > With that in mind I was thinking whether we explicitly enable this > > flush only for Intel GPU instead of checking non-coherent domain > > in the attach path, since it's the only device with such requirement. > > I am suggesting to do both checks: > - If the iommu domain indicates it has force coherency then leave PCI > no-snoop alone and no flush > - If the PCI NOSNOOP bit is or can be 0 then no flush > - Otherwise flush How to judge whether PCI NOSNOOP can be 0? If following PCI spec it can always be set to 0 but then we break the requirement for Intel GPU. If we explicitly exempt Intel GPU in 2nd check then what'd be the value of doing that generic check?
On Wed, May 22, 2024 at 11:40:58PM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Thursday, May 23, 2024 7:32 AM > > > > On Wed, May 22, 2024 at 11:26:21PM +0000, Tian, Kevin wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Wednesday, May 22, 2024 8:30 PM > > > > > > > > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote: > > > > > I'm fine to do a special check in the attach path to enable the flush > > > > > only for Intel GPU. > > > > > > > > We already effectively do this already by checking the domain > > > > capabilities. Only the Intel GPU will have a non-coherent domain. > > > > > > > > > > I'm confused. In earlier discussions you wanted to find a way to not > > > publish others due to the check of non-coherent domain, e.g. some > > > ARM SMMU cannot force snoop. > > > > > > Then you and Alex discussed the possibility of reducing pessimistic > > > flushes by virtualizing the PCI NOSNOOP bit. > > > > > > With that in mind I was thinking whether we explicitly enable this > > > flush only for Intel GPU instead of checking non-coherent domain > > > in the attach path, since it's the only device with such requirement. > > > > I am suggesting to do both checks: > > - If the iommu domain indicates it has force coherency then leave PCI > > no-snoop alone and no flush > > - If the PCI NOSNOOP bit is or can be 0 then no flush > > - Otherwise flush > > How to judge whether PCI NOSNOOP can be 0? If following PCI spec > it can always be set to 0 but then we break the requirement for Intel > GPU. If we explicitly exempt Intel GPU in 2nd check then what'd be > the value of doing that generic check? Non-PCI environments still have this problem, and the first check does help them since we don't have PCI config space there. PCI can supply more information (no snoop impossible) and variant drivers can add in too (want no snoop) Jason
On Thu, 23 May 2024 11:58:48 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, May 22, 2024 at 11:40:58PM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Thursday, May 23, 2024 7:32 AM > > > > > > On Wed, May 22, 2024 at 11:26:21PM +0000, Tian, Kevin wrote: > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > > Sent: Wednesday, May 22, 2024 8:30 PM > > > > > > > > > > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote: > > > > > > I'm fine to do a special check in the attach path to enable the flush > > > > > > only for Intel GPU. > > > > > > > > > > We already effectively do this already by checking the domain > > > > > capabilities. Only the Intel GPU will have a non-coherent domain. > > > > > > > > > > > > > I'm confused. In earlier discussions you wanted to find a way to not > > > > publish others due to the check of non-coherent domain, e.g. some > > > > ARM SMMU cannot force snoop. > > > > > > > > Then you and Alex discussed the possibility of reducing pessimistic > > > > flushes by virtualizing the PCI NOSNOOP bit. > > > > > > > > With that in mind I was thinking whether we explicitly enable this > > > > flush only for Intel GPU instead of checking non-coherent domain > > > > in the attach path, since it's the only device with such requirement. > > > > > > I am suggesting to do both checks: > > > - If the iommu domain indicates it has force coherency then leave PCI > > > no-snoop alone and no flush > > > - If the PCI NOSNOOP bit is or can be 0 then no flush > > > - Otherwise flush > > > > How to judge whether PCI NOSNOOP can be 0? If following PCI spec > > it can always be set to 0 but then we break the requirement for Intel > > GPU. If we explicitly exempt Intel GPU in 2nd check then what'd be > > the value of doing that generic check? > > Non-PCI environments still have this problem, and the first check does > help them since we don't have PCI config space there. > > PCI can supply more information (no snoop impossible) and variant > drivers can add in too (want no snoop) I'm not sure I follow either. Since i915 doesn't set or test no-snoop enable, I think we need to assume drivers expect the reset value, so a device that supports no-snoop expects to use it, ie. we can't trap on no-snoop enable being set, the device is more likely to just operate with reduced performance if we surreptitiously clear the bit. The current proposal is to enable flushing based only on the domain enforcement of coherency. I think the augmentation is therefore that if the device is PCI and the no-snoop enable bit is zero after reset (indicating hardwired to zero), we also don't need to flush. I'm not sure the polarity of the variant drive statement above is correct. If the no-snoop enable bit is set after reset, we'd assume no-snoop is possible, so the variant driver would only need a way to indicate the device doesn't actually use no-snoop. For that it might just virtualize the no-snoop enable setting to vfio-pci-core. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, May 24, 2024 6:48 AM > > On Thu, 23 May 2024 11:58:48 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, May 22, 2024 at 11:40:58PM +0000, Tian, Kevin wrote: > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > Sent: Thursday, May 23, 2024 7:32 AM > > > > > > > > On Wed, May 22, 2024 at 11:26:21PM +0000, Tian, Kevin wrote: > > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > > > > Sent: Wednesday, May 22, 2024 8:30 PM > > > > > > > > > > > > On Wed, May 22, 2024 at 06:24:14AM +0000, Tian, Kevin wrote: > > > > > > > I'm fine to do a special check in the attach path to enable the flush > > > > > > > only for Intel GPU. > > > > > > > > > > > > We already effectively do this already by checking the domain > > > > > > capabilities. Only the Intel GPU will have a non-coherent domain. > > > > > > > > > > > > > > > > I'm confused. In earlier discussions you wanted to find a way to not > > > > > publish others due to the check of non-coherent domain, e.g. some > > > > > ARM SMMU cannot force snoop. > > > > > > > > > > Then you and Alex discussed the possibility of reducing pessimistic > > > > > flushes by virtualizing the PCI NOSNOOP bit. > > > > > > > > > > With that in mind I was thinking whether we explicitly enable this > > > > > flush only for Intel GPU instead of checking non-coherent domain > > > > > in the attach path, since it's the only device with such requirement. > > > > > > > > I am suggesting to do both checks: > > > > - If the iommu domain indicates it has force coherency then leave PCI > > > > no-snoop alone and no flush > > > > - If the PCI NOSNOOP bit is or can be 0 then no flush > > > > - Otherwise flush > > > > > > How to judge whether PCI NOSNOOP can be 0? If following PCI spec > > > it can always be set to 0 but then we break the requirement for Intel > > > GPU. If we explicitly exempt Intel GPU in 2nd check then what'd be > > > the value of doing that generic check? > > > > Non-PCI environments still have this problem, and the first check does > > help them since we don't have PCI config space there. > > > > PCI can supply more information (no snoop impossible) and variant > > drivers can add in too (want no snoop) > > I'm not sure I follow either. Since i915 doesn't set or test no-snoop > enable, I think we need to assume drivers expect the reset value, so a > device that supports no-snoop expects to use it, ie. we can't trap on > no-snoop enable being set, the device is more likely to just operate > with reduced performance if we surreptitiously clear the bit. > > The current proposal is to enable flushing based only on the domain > enforcement of coherency. I think the augmentation is therefore that > if the device is PCI and the no-snoop enable bit is zero after reset > (indicating hardwired to zero), we also don't need to flush. > > I'm not sure the polarity of the variant drive statement above is > correct. If the no-snoop enable bit is set after reset, we'd assume > no-snoop is possible, so the variant driver would only need a way to > indicate the device doesn't actually use no-snoop. For that it might > just virtualize the no-snoop enable setting to vfio-pci-core. Thanks, > Yeah. I re-checked the use of PCI_EXP_DEVCTL_NOSNOOP_EN and actually all references are about clearing the bit, echo'ing the point that if a driver wants to use nosnoop it expects the reset value w/o doing an explicit set and the virtualization of the no-snoop enable bit is more reasonable to catch the intention of 'clear'.
On Wed, May 22, 2024 at 09:26:05AM -0300, Jason Gunthorpe wrote: > On Wed, May 22, 2024 at 11:24:20AM +0800, Yan Zhao wrote: > > On Tue, May 21, 2024 at 01:34:00PM -0300, Jason Gunthorpe wrote: > > > On Tue, May 21, 2024 at 10:21:23AM -0600, Alex Williamson wrote: > > > > > > > > Intel GPU weirdness should not leak into making other devices > > > > > insecure/slow. If necessary Intel GPU only should get some variant > > > > > override to keep no snoop working. > > > > > > > > > > It would make alot of good sense if VFIO made the default to disable > > > > > no-snoop via the config space. > > > > > > > > We can certainly virtualize the config space no-snoop enable bit, but > > > > I'm not sure what it actually accomplishes. We'd then be relying on > > > > the device to honor the bit and not have any backdoors to twiddle the > > > > bit otherwise (where we know that GPUs often have multiple paths to get > > > > to config space). > > > > > > I'm OK with this. If devices are insecure then they need quirks in > > > vfio to disclose their problems, we shouldn't punish everyone who > > > followed the spec because of some bad actors. > > Does that mean a malicous device that does not honor the bit could read > > uninitialized host data? > > Yes, but a malicious device could also just do DMA with the PF RID and > break everything. VFIO substantially trusts the device already, I'm > not sure trusting it to do no-snoop blocking is a big reach. There are minor differences between the two trusts though. With no-snoop, the page is possible to be critical data previously used by kernel core. But with a fake RID, a malicious device can at least access memory allowed for other devices.
On Thu, May 23, 2024 at 04:47:53PM -0600, Alex Williamson wrote: > > > > I am suggesting to do both checks: > > > > - If the iommu domain indicates it has force coherency then leave PCI > > > > no-snoop alone and no flush > > > > - If the PCI NOSNOOP bit is or can be 0 then no flush > > > > - Otherwise flush > > > > > > How to judge whether PCI NOSNOOP can be 0? If following PCI spec > > > it can always be set to 0 but then we break the requirement for Intel > > > GPU. If we explicitly exempt Intel GPU in 2nd check then what'd be > > > the value of doing that generic check? > > > > Non-PCI environments still have this problem, and the first check does > > help them since we don't have PCI config space there. > > > > PCI can supply more information (no snoop impossible) and variant > > drivers can add in too (want no snoop) > > I'm not sure I follow either. Since i915 doesn't set or test no-snoop > enable, I think we need to assume drivers expect the reset value, so a > device that supports no-snoop expects to use it, ie. we can't trap on > no-snoop enable being set, the device is more likely to just operate > with reduced performance if we surreptitiously clear the bit. I'm not sure I understand this paragraph? > The current proposal is to enable flushing based only on the domain > enforcement of coherency. I think the augmentation is therefore that > if the device is PCI and the no-snoop enable bit is zero after reset > (indicating hardwired to zero), we also don't need to flush. Yes, that is a good additional starting point. > I'm not sure the polarity of the variant drive statement above is > correct. If the no-snoop enable bit is set after reset, we'd assume > no-snoop is possible, so the variant driver would only need a way to > indicate the device doesn't actually use no-snoop. For that it might > just virtualize the no-snoop enable setting to vfio-pci-core. Thanks, I wrote that with the idea that VFIO would always force no-snoop to 0. The variant driver could opt out of this. We could also do the reverse and leave no-snoop alone and have something force it off. The other issue to keep in mind is that if no-snoop is disabled when we attach the domains we shouldn't allow the VMM to turn it on later. Jason
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index b5c15fe8f9fc..ce873f4220bf 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -74,6 +74,7 @@ struct vfio_iommu { bool v2; bool nesting; bool dirty_page_tracking; + bool has_noncoherent_domain; struct list_head emulated_iommu_groups; }; @@ -99,6 +100,7 @@ struct vfio_dma { unsigned long *bitmap; struct mm_struct *mm; size_t locked_vm; + bool cache_flush_required; /* For noncoherent domain */ }; struct vfio_batch { @@ -716,6 +718,9 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova, long unlocked = 0, locked = 0; long i; + if (dma->cache_flush_required) + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, npage << PAGE_SHIFT); + for (i = 0; i < npage; i++, iova += PAGE_SIZE) { if (put_pfn(pfn++, dma->prot)) { unlocked++; @@ -1099,6 +1104,8 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, &iotlb_gather); } + dma->cache_flush_required = false; + if (do_accounting) { vfio_lock_acct(dma, -unlocked, true); return 0; @@ -1120,6 +1127,21 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) iommu->dma_avail++; } +static void vfio_update_noncoherent_domain_state(struct vfio_iommu *iommu) +{ + struct vfio_domain *domain; + bool has_noncoherent = false; + + list_for_each_entry(domain, &iommu->domain_list, next) { + if (domain->enforce_cache_coherency) + continue; + + has_noncoherent = true; + break; + } + iommu->has_noncoherent_domain = has_noncoherent; +} + static void vfio_update_pgsize_bitmap(struct vfio_iommu *iommu) { struct vfio_domain *domain; @@ -1455,6 +1477,12 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, vfio_batch_init(&batch); + /* + * Record necessity to flush CPU cache to make sure CPU cache is flushed + * for both pin & map and unmap & unpin (for unwind) paths. + */ + dma->cache_flush_required = iommu->has_noncoherent_domain; + while (size) { /* Pin a contiguous chunk of memory */ npage = vfio_pin_pages_remote(dma, vaddr + dma->size, @@ -1466,6 +1494,10 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, break; } + if (dma->cache_flush_required) + arch_clean_nonsnoop_dma(pfn << PAGE_SHIFT, + npage << PAGE_SHIFT); + /* Map it! */ ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, dma->prot); @@ -1683,9 +1715,14 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, for (; n; n = rb_next(n)) { struct vfio_dma *dma; dma_addr_t iova; + bool cache_flush_required; dma = rb_entry(n, struct vfio_dma, node); iova = dma->iova; + cache_flush_required = !domain->enforce_cache_coherency && + !dma->cache_flush_required; + if (cache_flush_required) + dma->cache_flush_required = true; while (iova < dma->iova + dma->size) { phys_addr_t phys; @@ -1737,6 +1774,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, size = npage << PAGE_SHIFT; } + if (cache_flush_required) + arch_clean_nonsnoop_dma(phys, size); + ret = iommu_map(domain->domain, iova, phys, size, dma->prot | IOMMU_CACHE, GFP_KERNEL_ACCOUNT); @@ -1801,6 +1841,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, vfio_unpin_pages_remote(dma, iova, phys >> PAGE_SHIFT, size >> PAGE_SHIFT, true); } + dma->cache_flush_required = false; } vfio_batch_fini(&batch); @@ -1828,6 +1869,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * if (!pages) return; + if (!domain->enforce_cache_coherency) + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); + list_for_each_entry(region, regions, list) { start = ALIGN(region->start, PAGE_SIZE * 2); if (start >= region->end || (region->end - start < PAGE_SIZE * 2)) @@ -1847,6 +1891,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain, struct list_head * break; } + if (!domain->enforce_cache_coherency) + arch_clean_nonsnoop_dma(page_to_phys(pages), PAGE_SIZE * 2); + __free_pages(pages, order); } @@ -2308,6 +2355,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, list_add(&domain->next, &iommu->domain_list); vfio_update_pgsize_bitmap(iommu); + if (!domain->enforce_cache_coherency) + vfio_update_noncoherent_domain_state(iommu); done: /* Delete the old one and insert new iova list */ vfio_iommu_iova_insert_copy(iommu, &iova_copy); @@ -2508,6 +2557,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, } iommu_domain_free(domain->domain); list_del(&domain->next); + if (!domain->enforce_cache_coherency) + vfio_update_noncoherent_domain_state(iommu); kfree(domain); vfio_iommu_aper_expand(iommu, &iova_copy); vfio_update_pgsize_bitmap(iommu);
Flush CPU cache on DMA pages before mapping them into the first non-coherent domain (domain that does not enforce cache coherency, i.e. CPU caches are not force-snooped) and after unmapping them from the last domain. Devices attached to non-coherent domains can execute non-coherent DMAs (DMAs that lack CPU cache snooping) to access physical memory with CPU caches bypassed. Such a scenario could be exploited by a malicious guest, allowing them to access stale host data in memory rather than the data initialized by the host (e.g., zeros) in the cache, thus posing a risk of information leakage attack. Furthermore, the host kernel (e.g. a ksm thread) might encounter inconsistent data between the CPU cache and memory (left by a malicious guest) after a page is unpinned for DMA but before it's recycled. Therefore, it is required to flush the CPU cache before a page is accessible to non-coherent DMAs and after the page is inaccessible to non-coherent DMAs. However, the CPU cache is not flushed immediately when the page is unmapped from the last non-coherent domain. Instead, the flushing is performed lazily, right before the page is unpinned. Take the following example to illustrate the process. The CPU cache is flushed right before step 2 and step 5. 1. A page is mapped into a coherent domain. 2. The page is mapped into a non-coherent domain. 3. The page is unmapped from the non-coherent domain e.g.due to hot-unplug. 4. The page is unmapped from the coherent domain. 5. The page is unpinned. Reasons for adopting this lazily flushing design include: - There're several unmap paths and only one unpin path. Lazily flush before unpin wipes out the inconsistency between cache and physical memory before a page is globally visible and produces code that is simpler, more maintainable and easier to backport. - Avoid dividing a large unmap range into several smaller ones or allocating additional memory to hold IOVA to HPA relationship. Reported-by: Jason Gunthorpe <jgg@nvidia.com> Closes: https://lore.kernel.org/lkml/20240109002220.GA439767@nvidia.com Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation") Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+)