diff mbox series

[v2,2/2] vfio/pci: Use unmap_mapping_range()

Message ID 20240530045236.1005864-3-alex.williamson@redhat.com (mailing list archive)
State New, archived
Headers show
Series vfio/pci: vfio device address space mapping | expand

Commit Message

Alex Williamson May 30, 2024, 4:52 a.m. UTC
With the vfio device fd tied to the address space of the pseudo fs
inode, we can use the mm to track all vmas that might be mmap'ing
device BARs, which removes our vma_list and all the complicated lock
ordering necessary to manually zap each related vma.

Note that we can no longer store the pfn in vm_pgoff if we want to use
unmap_mapping_range() to zap a selective portion of the device fd
corresponding to BAR mappings.

This also converts our mmap fault handler to use vmf_insert_pfn()
because we no longer have a vma_list to avoid the concurrency problem
with io_remap_pfn_range().  The goal is to eventually use the vm_ops
huge_fault handler to avoid the additional faulting overhead, but
vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.

Also, Jason notes that a race exists between unmap_mapping_range() and
the fops mmap callback if we were to call io_remap_pfn_range() to
populate the vma on mmap.  Specifically, mmap_region() does call_mmap()
before it does vma_link_file() which gives a window where the vma is
populated but invisible to unmap_mapping_range().

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 264 +++++++------------------------
 include/linux/vfio_pci_core.h    |   2 -
 2 files changed, 55 insertions(+), 211 deletions(-)

Comments

Yan Zhao May 31, 2024, 11:41 p.m. UTC | #1
On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote:
> With the vfio device fd tied to the address space of the pseudo fs
> inode, we can use the mm to track all vmas that might be mmap'ing
> device BARs, which removes our vma_list and all the complicated lock
> ordering necessary to manually zap each related vma.
> 
> Note that we can no longer store the pfn in vm_pgoff if we want to use
> unmap_mapping_range() to zap a selective portion of the device fd
> corresponding to BAR mappings.
> 
> This also converts our mmap fault handler to use vmf_insert_pfn()
> because we no longer have a vma_list to avoid the concurrency problem
> with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> huge_fault handler to avoid the additional faulting overhead, but
> vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
>
Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g.
for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
	offset = (i - vma->vm_start) >> PAGE_SHIFT;
	ret = vmf_insert_pfn(vma, i, base_pfn + offset);
	if (ret != VM_FAULT_NOPAGE) {
		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
		goto up_out;
	}
}

Thanks
Yan
Alex Williamson June 1, 2024, 5:18 a.m. UTC | #2
On Sat, 1 Jun 2024 07:41:27 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote:
> > With the vfio device fd tied to the address space of the pseudo fs
> > inode, we can use the mm to track all vmas that might be mmap'ing
> > device BARs, which removes our vma_list and all the complicated lock
> > ordering necessary to manually zap each related vma.
> > 
> > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > unmap_mapping_range() to zap a selective portion of the device fd
> > corresponding to BAR mappings.
> > 
> > This also converts our mmap fault handler to use vmf_insert_pfn()
> > because we no longer have a vma_list to avoid the concurrency problem
> > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > huge_fault handler to avoid the additional faulting overhead, but
> > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> >  
> Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g.
> for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
> 	offset = (i - vma->vm_start) >> PAGE_SHIFT;
> 	ret = vmf_insert_pfn(vma, i, base_pfn + offset);
> 	if (ret != VM_FAULT_NOPAGE) {
> 		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> 		goto up_out;
> 	}
> }

We can have concurrent faults, so doing this means that we need to
continue to maintain a locked list of vmas that have faulted to avoid
duplicate insertions and all the nasty lock gymnastics that go along
with it.  I'd rather we just support huge_fault.  Thanks,

Alex
Yan Zhao June 4, 2024, 4:26 a.m. UTC | #3
On Fri, May 31, 2024 at 11:18:15PM -0600, Alex Williamson wrote:
> On Sat, 1 Jun 2024 07:41:27 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote:
> > > With the vfio device fd tied to the address space of the pseudo fs
> > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > device BARs, which removes our vma_list and all the complicated lock
> > > ordering necessary to manually zap each related vma.
> > > 
> > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > unmap_mapping_range() to zap a selective portion of the device fd
> > > corresponding to BAR mappings.
> > > 
> > > This also converts our mmap fault handler to use vmf_insert_pfn()
> > > because we no longer have a vma_list to avoid the concurrency problem
> > > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > > huge_fault handler to avoid the additional faulting overhead, but
> > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> > >  
> > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g.
> > for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
> > 	offset = (i - vma->vm_start) >> PAGE_SHIFT;
> > 	ret = vmf_insert_pfn(vma, i, base_pfn + offset);
> > 	if (ret != VM_FAULT_NOPAGE) {
> > 		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> > 		goto up_out;
> > 	}
> > }
>
What about the prefault version? e.g.

        ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
+       if (ret & VM_FAULT_ERROR)
+               goto out_disabled;
+
+       /* prefault */
+       for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE, pfn++) {
+               if (i == vmf->address)
+                       continue;
+
+               /* Don't return error on prefault */
+               if (vmf_insert_pfn(vma, i, pfn) & VM_FAULT_ERROR)
+                       break;
+       }
+
 out_disabled:
        up_read(&vdev->memory_lock);


> We can have concurrent faults, so doing this means that we need to
> continue to maintain a locked list of vmas that have faulted to avoid
But looks vfio_pci_zap_bars() always zap full PCI BAR ranges for vmas in
core_vdev->inode->i_mapping.
So, it doesn't matter whether a vma is fully mapped or partly mapped?

> duplicate insertions and all the nasty lock gymnastics that go along
> with it.  I'd rather we just support huge_fault.  Thanks,
huge_fault is better. But before that, is this prefault one good?
Alex Williamson June 4, 2024, 2:07 p.m. UTC | #4
On Tue, 4 Jun 2024 12:26:29 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, May 31, 2024 at 11:18:15PM -0600, Alex Williamson wrote:
> > On Sat, 1 Jun 2024 07:41:27 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote:  
> > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > device BARs, which removes our vma_list and all the complicated lock
> > > > ordering necessary to manually zap each related vma.
> > > > 
> > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > corresponding to BAR mappings.
> > > > 
> > > > This also converts our mmap fault handler to use vmf_insert_pfn()
> > > > because we no longer have a vma_list to avoid the concurrency problem
> > > > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > > > huge_fault handler to avoid the additional faulting overhead, but
> > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> > > >    
> > > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g.
> > > for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
> > > 	offset = (i - vma->vm_start) >> PAGE_SHIFT;
> > > 	ret = vmf_insert_pfn(vma, i, base_pfn + offset);
> > > 	if (ret != VM_FAULT_NOPAGE) {
> > > 		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> > > 		goto up_out;
> > > 	}
> > > }  
> >  
> What about the prefault version? e.g.
> 
>         ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
> +       if (ret & VM_FAULT_ERROR)
> +               goto out_disabled;
> +
> +       /* prefault */
> +       for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE, pfn++) {
> +               if (i == vmf->address)
> +                       continue;
> +
> +               /* Don't return error on prefault */
> +               if (vmf_insert_pfn(vma, i, pfn) & VM_FAULT_ERROR)
> +                       break;
> +       }
> +
>  out_disabled:
>         up_read(&vdev->memory_lock);
> 
> 
> > We can have concurrent faults, so doing this means that we need to
> > continue to maintain a locked list of vmas that have faulted to avoid  
> But looks vfio_pci_zap_bars() always zap full PCI BAR ranges for vmas in
> core_vdev->inode->i_mapping.
> So, it doesn't matter whether a vma is fully mapped or partly mapped?

Yes, but without locking concurrent faults would all be re-inserting
the pfns concurrently from the fault handler.

> > duplicate insertions and all the nasty lock gymnastics that go along
> > with it.  I'd rather we just support huge_fault.  Thanks,  
> huge_fault is better. But before that, is this prefault one good?
> 

It seems like this would still be subject to the race that Jason noted
here[1], ie. call_mmap() occurs before vma_link_file(), therefore we
need to exclusively rely on fault to populate the vma or we risk a race
with unmap_mapping_range().  Thanks,

Alex


[1]https://lore.kernel.org/all/20240522183046.GG20229@nvidia.com/
Yan Zhao June 5, 2024, 2:22 a.m. UTC | #5
On Tue, Jun 04, 2024 at 08:07:28AM -0600, Alex Williamson wrote:
> On Tue, 4 Jun 2024 12:26:29 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, May 31, 2024 at 11:18:15PM -0600, Alex Williamson wrote:
> > > On Sat, 1 Jun 2024 07:41:27 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote:  
> > > > > With the vfio device fd tied to the address space of the pseudo fs
> > > > > inode, we can use the mm to track all vmas that might be mmap'ing
> > > > > device BARs, which removes our vma_list and all the complicated lock
> > > > > ordering necessary to manually zap each related vma.
> > > > > 
> > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use
> > > > > unmap_mapping_range() to zap a selective portion of the device fd
> > > > > corresponding to BAR mappings.
> > > > > 
> > > > > This also converts our mmap fault handler to use vmf_insert_pfn()
> > > > > because we no longer have a vma_list to avoid the concurrency problem
> > > > > with io_remap_pfn_range().  The goal is to eventually use the vm_ops
> > > > > huge_fault handler to avoid the additional faulting overhead, but
> > > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
> > > > >    
> > > > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g.
> > > > for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
> > > > 	offset = (i - vma->vm_start) >> PAGE_SHIFT;
> > > > 	ret = vmf_insert_pfn(vma, i, base_pfn + offset);
> > > > 	if (ret != VM_FAULT_NOPAGE) {
> > > > 		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> > > > 		goto up_out;
> > > > 	}
> > > > }  
> > >  
> > What about the prefault version? e.g.
> > 
> >         ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
> > +       if (ret & VM_FAULT_ERROR)
> > +               goto out_disabled;
> > +
> > +       /* prefault */
> > +       for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE, pfn++) {
> > +               if (i == vmf->address)
> > +                       continue;
> > +
> > +               /* Don't return error on prefault */
> > +               if (vmf_insert_pfn(vma, i, pfn) & VM_FAULT_ERROR)
> > +                       break;
> > +       }
> > +
> >  out_disabled:
> >         up_read(&vdev->memory_lock);
> > 
> > 
> > > We can have concurrent faults, so doing this means that we need to
> > > continue to maintain a locked list of vmas that have faulted to avoid  
> > But looks vfio_pci_zap_bars() always zap full PCI BAR ranges for vmas in
> > core_vdev->inode->i_mapping.
> > So, it doesn't matter whether a vma is fully mapped or partly mapped?
> 
> Yes, but without locking concurrent faults would all be re-inserting
> the pfns concurrently from the fault handler.
insert_pfn() holds a ptl lock with get_locked_pte() and it will return
VM_FAULT_NOPAGE directly without re-inserting if !pte_none() is found,
right?

> 
> > > duplicate insertions and all the nasty lock gymnastics that go along
> > > with it.  I'd rather we just support huge_fault.  Thanks,  
> > huge_fault is better. But before that, is this prefault one good?
> > 
> 
> It seems like this would still be subject to the race that Jason noted
> here[1], ie. call_mmap() occurs before vma_link_file(), therefore we
> need to exclusively rely on fault to populate the vma or we risk a race
> with unmap_mapping_range().  Thanks,
> 
Thanks. But the proposed prefault way is also in the fault handler,
which is after the vma_link_file().
Similar implementations can be found at 
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c.

> 
> 
> [1]https://lore.kernel.org/all/20240522183046.GG20229@nvidia.com/
>
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 80cae87fff36..db31c27bf78b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1610,100 +1610,20 @@  ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_write);
 
-/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */
-static int vfio_pci_zap_and_vma_lock(struct vfio_pci_core_device *vdev, bool try)
+static void vfio_pci_zap_bars(struct vfio_pci_core_device *vdev)
 {
-	struct vfio_pci_mmap_vma *mmap_vma, *tmp;
+	struct vfio_device *core_vdev = &vdev->vdev;
+	loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX);
+	loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX);
+	loff_t len = end - start;
 
-	/*
-	 * Lock ordering:
-	 * vma_lock is nested under mmap_lock for vm_ops callback paths.
-	 * The memory_lock semaphore is used by both code paths calling
-	 * into this function to zap vmas and the vm_ops.fault callback
-	 * to protect the memory enable state of the device.
-	 *
-	 * When zapping vmas we need to maintain the mmap_lock => vma_lock
-	 * ordering, which requires using vma_lock to walk vma_list to
-	 * acquire an mm, then dropping vma_lock to get the mmap_lock and
-	 * reacquiring vma_lock.  This logic is derived from similar
-	 * requirements in uverbs_user_mmap_disassociate().
-	 *
-	 * mmap_lock must always be the top-level lock when it is taken.
-	 * Therefore we can only hold the memory_lock write lock when
-	 * vma_list is empty, as we'd need to take mmap_lock to clear
-	 * entries.  vma_list can only be guaranteed empty when holding
-	 * vma_lock, thus memory_lock is nested under vma_lock.
-	 *
-	 * This enables the vm_ops.fault callback to acquire vma_lock,
-	 * followed by memory_lock read lock, while already holding
-	 * mmap_lock without risk of deadlock.
-	 */
-	while (1) {
-		struct mm_struct *mm = NULL;
-
-		if (try) {
-			if (!mutex_trylock(&vdev->vma_lock))
-				return 0;
-		} else {
-			mutex_lock(&vdev->vma_lock);
-		}
-		while (!list_empty(&vdev->vma_list)) {
-			mmap_vma = list_first_entry(&vdev->vma_list,
-						    struct vfio_pci_mmap_vma,
-						    vma_next);
-			mm = mmap_vma->vma->vm_mm;
-			if (mmget_not_zero(mm))
-				break;
-
-			list_del(&mmap_vma->vma_next);
-			kfree(mmap_vma);
-			mm = NULL;
-		}
-		if (!mm)
-			return 1;
-		mutex_unlock(&vdev->vma_lock);
-
-		if (try) {
-			if (!mmap_read_trylock(mm)) {
-				mmput(mm);
-				return 0;
-			}
-		} else {
-			mmap_read_lock(mm);
-		}
-		if (try) {
-			if (!mutex_trylock(&vdev->vma_lock)) {
-				mmap_read_unlock(mm);
-				mmput(mm);
-				return 0;
-			}
-		} else {
-			mutex_lock(&vdev->vma_lock);
-		}
-		list_for_each_entry_safe(mmap_vma, tmp,
-					 &vdev->vma_list, vma_next) {
-			struct vm_area_struct *vma = mmap_vma->vma;
-
-			if (vma->vm_mm != mm)
-				continue;
-
-			list_del(&mmap_vma->vma_next);
-			kfree(mmap_vma);
-
-			zap_vma_ptes(vma, vma->vm_start,
-				     vma->vm_end - vma->vm_start);
-		}
-		mutex_unlock(&vdev->vma_lock);
-		mmap_read_unlock(mm);
-		mmput(mm);
-	}
+	unmap_mapping_range(core_vdev->inode->i_mapping, start, len, true);
 }
 
 void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev)
 {
-	vfio_pci_zap_and_vma_lock(vdev, false);
 	down_write(&vdev->memory_lock);
-	mutex_unlock(&vdev->vma_lock);
+	vfio_pci_zap_bars(vdev);
 }
 
 u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev)
@@ -1725,99 +1645,41 @@  void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 c
 	up_write(&vdev->memory_lock);
 }
 
-/* Caller holds vma_lock */
-static int __vfio_pci_add_vma(struct vfio_pci_core_device *vdev,
-			      struct vm_area_struct *vma)
-{
-	struct vfio_pci_mmap_vma *mmap_vma;
-
-	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL_ACCOUNT);
-	if (!mmap_vma)
-		return -ENOMEM;
-
-	mmap_vma->vma = vma;
-	list_add(&mmap_vma->vma_next, &vdev->vma_list);
-
-	return 0;
-}
-
-/*
- * Zap mmaps on open so that we can fault them in on access and therefore
- * our vma_list only tracks mappings accessed since last zap.
- */
-static void vfio_pci_mmap_open(struct vm_area_struct *vma)
-{
-	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
-}
-
-static void vfio_pci_mmap_close(struct vm_area_struct *vma)
+static unsigned long vma_to_pfn(struct vm_area_struct *vma)
 {
 	struct vfio_pci_core_device *vdev = vma->vm_private_data;
-	struct vfio_pci_mmap_vma *mmap_vma;
+	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+	u64 pgoff;
 
-	mutex_lock(&vdev->vma_lock);
-	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
-		if (mmap_vma->vma == vma) {
-			list_del(&mmap_vma->vma_next);
-			kfree(mmap_vma);
-			break;
-		}
-	}
-	mutex_unlock(&vdev->vma_lock);
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
 }
 
 static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct vfio_pci_core_device *vdev = vma->vm_private_data;
-	struct vfio_pci_mmap_vma *mmap_vma;
-	vm_fault_t ret = VM_FAULT_NOPAGE;
+	unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff;
+	vm_fault_t ret = VM_FAULT_SIGBUS;
 
-	mutex_lock(&vdev->vma_lock);
-	down_read(&vdev->memory_lock);
+	pfn = vma_to_pfn(vma);
 
-	/*
-	 * Memory region cannot be accessed if the low power feature is engaged
-	 * or memory access is disabled.
-	 */
-	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) {
-		ret = VM_FAULT_SIGBUS;
-		goto up_out;
-	}
+	down_read(&vdev->memory_lock);
 
-	/*
-	 * We populate the whole vma on fault, so we need to test whether
-	 * the vma has already been mapped, such as for concurrent faults
-	 * to the same vma.  io_remap_pfn_range() will trigger a BUG_ON if
-	 * we ask it to fill the same range again.
-	 */
-	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
-		if (mmap_vma->vma == vma)
-			goto up_out;
-	}
+	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
+		goto out_disabled;
 
-	if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       vma->vm_end - vma->vm_start,
-			       vma->vm_page_prot)) {
-		ret = VM_FAULT_SIGBUS;
-		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
-		goto up_out;
-	}
+	ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
 
-	if (__vfio_pci_add_vma(vdev, vma)) {
-		ret = VM_FAULT_OOM;
-		zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
-	}
-
-up_out:
+out_disabled:
 	up_read(&vdev->memory_lock);
-	mutex_unlock(&vdev->vma_lock);
+
 	return ret;
 }
 
 static const struct vm_operations_struct vfio_pci_mmap_ops = {
-	.open = vfio_pci_mmap_open,
-	.close = vfio_pci_mmap_close,
 	.fault = vfio_pci_mmap_fault,
 };
 
@@ -1880,11 +1742,12 @@  int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
 
 	vma->vm_private_data = vdev;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
 
 	/*
-	 * See remap_pfn_range(), called from vfio_pci_fault() but we can't
-	 * change vm_flags within the fault handler.  Set them now.
+	 * Set vm_flags now, they should not be changed in the fault handler.
+	 * We want the same flags and page protection (decrypted above) as
+	 * io_remap_pfn_range() would set.
 	 *
 	 * VM_ALLOW_ANY_UNCACHED: The VMA flag is implemented for ARM64,
 	 * allowing KVM stage 2 device mapping attributes to use Normal-NC
@@ -2202,8 +2065,6 @@  int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->dummy_resources_list);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
-	mutex_init(&vdev->vma_lock);
-	INIT_LIST_HEAD(&vdev->vma_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
 	init_rwsem(&vdev->memory_lock);
 	xa_init(&vdev->ctx);
@@ -2219,7 +2080,6 @@  void vfio_pci_core_release_dev(struct vfio_device *core_vdev)
 
 	mutex_destroy(&vdev->igate);
 	mutex_destroy(&vdev->ioeventfds_lock);
-	mutex_destroy(&vdev->vma_lock);
 	kfree(vdev->region);
 	kfree(vdev->pm_save);
 }
@@ -2497,26 +2357,15 @@  static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
 	return ret;
 }
 
-/*
- * We need to get memory_lock for each device, but devices can share mmap_lock,
- * therefore we need to zap and hold the vma_lock for each device, and only then
- * get each memory_lock.
- */
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 				      struct vfio_pci_group_info *groups,
 				      struct iommufd_ctx *iommufd_ctx)
 {
-	struct vfio_pci_core_device *cur_mem;
-	struct vfio_pci_core_device *cur_vma;
-	struct vfio_pci_core_device *cur;
+	struct vfio_pci_core_device *vdev;
 	struct pci_dev *pdev;
-	bool is_mem = true;
 	int ret;
 
 	mutex_lock(&dev_set->lock);
-	cur_mem = list_first_entry(&dev_set->device_list,
-				   struct vfio_pci_core_device,
-				   vdev.dev_set_list);
 
 	pdev = vfio_pci_dev_set_resettable(dev_set);
 	if (!pdev) {
@@ -2533,7 +2382,7 @@  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 	if (ret)
 		goto err_unlock;
 
-	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
+	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) {
 		bool owned;
 
 		/*
@@ -2557,38 +2406,38 @@  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * Otherwise, reset is not allowed.
 		 */
 		if (iommufd_ctx) {
-			int devid = vfio_iommufd_get_dev_id(&cur_vma->vdev,
+			int devid = vfio_iommufd_get_dev_id(&vdev->vdev,
 							    iommufd_ctx);
 
 			owned = (devid > 0 || devid == -ENOENT);
 		} else {
-			owned = vfio_dev_in_groups(&cur_vma->vdev, groups);
+			owned = vfio_dev_in_groups(&vdev->vdev, groups);
 		}
 
 		if (!owned) {
 			ret = -EINVAL;
-			goto err_undo;
+			break;
 		}
 
 		/*
-		 * Locking multiple devices is prone to deadlock, runaway and
-		 * unwind if we hit contention.
+		 * Take the memory write lock for each device and zap BAR
+		 * mappings to prevent the user accessing the device while in
+		 * reset.  Locking multiple devices is prone to deadlock,
+		 * runaway and unwind if we hit contention.
 		 */
-		if (!vfio_pci_zap_and_vma_lock(cur_vma, true)) {
+		if (!down_write_trylock(&vdev->memory_lock)) {
 			ret = -EBUSY;
-			goto err_undo;
+			break;
 		}
+
+		vfio_pci_zap_bars(vdev);
 	}
-	cur_vma = NULL;
 
-	list_for_each_entry(cur_mem, &dev_set->device_list, vdev.dev_set_list) {
-		if (!down_write_trylock(&cur_mem->memory_lock)) {
-			ret = -EBUSY;
-			goto err_undo;
-		}
-		mutex_unlock(&cur_mem->vma_lock);
+	if (!list_entry_is_head(vdev,
+				&dev_set->device_list, vdev.dev_set_list)) {
+		vdev = list_prev_entry(vdev, vdev.dev_set_list);
+		goto err_undo;
 	}
-	cur_mem = NULL;
 
 	/*
 	 * The pci_reset_bus() will reset all the devices in the bus.
@@ -2599,25 +2448,22 @@  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 	 * cause the PCI config space reset without restoring the original
 	 * state (saved locally in 'vdev->pm_save').
 	 */
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
-		vfio_pci_set_power_state(cur, PCI_D0);
+	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
+		vfio_pci_set_power_state(vdev, PCI_D0);
 
 	ret = pci_reset_bus(pdev);
 
+	vdev = list_last_entry(&dev_set->device_list,
+			       struct vfio_pci_core_device, vdev.dev_set_list);
+
 err_undo:
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
-		if (cur == cur_mem)
-			is_mem = false;
-		if (cur == cur_vma)
-			break;
-		if (is_mem)
-			up_write(&cur->memory_lock);
-		else
-			mutex_unlock(&cur->vma_lock);
-	}
+	list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
+					 vdev.dev_set_list)
+		up_write(&vdev->memory_lock);
+
+	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
+		pm_runtime_put(&vdev->pdev->dev);
 
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
-		pm_runtime_put(&cur->pdev->dev);
 err_unlock:
 	mutex_unlock(&dev_set->lock);
 	return ret;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index a2c8b8bba711..f87067438ed4 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -93,8 +93,6 @@  struct vfio_pci_core_device {
 	struct list_head		sriov_pfs_item;
 	struct vfio_pci_core_device	*sriov_pf_core_dev;
 	struct notifier_block	nb;
-	struct mutex		vma_lock;
-	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
 };