Message ID | 20240530045236.1005864-3-alex.williamson@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/pci: vfio device address space mapping | expand |
On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote: > With the vfio device fd tied to the address space of the pseudo fs > inode, we can use the mm to track all vmas that might be mmap'ing > device BARs, which removes our vma_list and all the complicated lock > ordering necessary to manually zap each related vma. > > Note that we can no longer store the pfn in vm_pgoff if we want to use > unmap_mapping_range() to zap a selective portion of the device fd > corresponding to BAR mappings. > > This also converts our mmap fault handler to use vmf_insert_pfn() > because we no longer have a vma_list to avoid the concurrency problem > with io_remap_pfn_range(). The goal is to eventually use the vm_ops > huge_fault handler to avoid the additional faulting overhead, but > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first. > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g. for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { offset = (i - vma->vm_start) >> PAGE_SHIFT; ret = vmf_insert_pfn(vma, i, base_pfn + offset); if (ret != VM_FAULT_NOPAGE) { zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); goto up_out; } } Thanks Yan
On Sat, 1 Jun 2024 07:41:27 +0800 Yan Zhao <yan.y.zhao@intel.com> wrote: > On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote: > > With the vfio device fd tied to the address space of the pseudo fs > > inode, we can use the mm to track all vmas that might be mmap'ing > > device BARs, which removes our vma_list and all the complicated lock > > ordering necessary to manually zap each related vma. > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use > > unmap_mapping_range() to zap a selective portion of the device fd > > corresponding to BAR mappings. > > > > This also converts our mmap fault handler to use vmf_insert_pfn() > > because we no longer have a vma_list to avoid the concurrency problem > > with io_remap_pfn_range(). The goal is to eventually use the vm_ops > > huge_fault handler to avoid the additional faulting overhead, but > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first. > > > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g. > for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { > offset = (i - vma->vm_start) >> PAGE_SHIFT; > ret = vmf_insert_pfn(vma, i, base_pfn + offset); > if (ret != VM_FAULT_NOPAGE) { > zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); > goto up_out; > } > } We can have concurrent faults, so doing this means that we need to continue to maintain a locked list of vmas that have faulted to avoid duplicate insertions and all the nasty lock gymnastics that go along with it. I'd rather we just support huge_fault. Thanks, Alex
On Fri, May 31, 2024 at 11:18:15PM -0600, Alex Williamson wrote: > On Sat, 1 Jun 2024 07:41:27 +0800 > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote: > > > With the vfio device fd tied to the address space of the pseudo fs > > > inode, we can use the mm to track all vmas that might be mmap'ing > > > device BARs, which removes our vma_list and all the complicated lock > > > ordering necessary to manually zap each related vma. > > > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use > > > unmap_mapping_range() to zap a selective portion of the device fd > > > corresponding to BAR mappings. > > > > > > This also converts our mmap fault handler to use vmf_insert_pfn() > > > because we no longer have a vma_list to avoid the concurrency problem > > > with io_remap_pfn_range(). The goal is to eventually use the vm_ops > > > huge_fault handler to avoid the additional faulting overhead, but > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first. > > > > > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g. > > for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { > > offset = (i - vma->vm_start) >> PAGE_SHIFT; > > ret = vmf_insert_pfn(vma, i, base_pfn + offset); > > if (ret != VM_FAULT_NOPAGE) { > > zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); > > goto up_out; > > } > > } > What about the prefault version? e.g. ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff); + if (ret & VM_FAULT_ERROR) + goto out_disabled; + + /* prefault */ + for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE, pfn++) { + if (i == vmf->address) + continue; + + /* Don't return error on prefault */ + if (vmf_insert_pfn(vma, i, pfn) & VM_FAULT_ERROR) + break; + } + out_disabled: up_read(&vdev->memory_lock); > We can have concurrent faults, so doing this means that we need to > continue to maintain a locked list of vmas that have faulted to avoid But looks vfio_pci_zap_bars() always zap full PCI BAR ranges for vmas in core_vdev->inode->i_mapping. So, it doesn't matter whether a vma is fully mapped or partly mapped? > duplicate insertions and all the nasty lock gymnastics that go along > with it. I'd rather we just support huge_fault. Thanks, huge_fault is better. But before that, is this prefault one good?
On Tue, 4 Jun 2024 12:26:29 +0800 Yan Zhao <yan.y.zhao@intel.com> wrote: > On Fri, May 31, 2024 at 11:18:15PM -0600, Alex Williamson wrote: > > On Sat, 1 Jun 2024 07:41:27 +0800 > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote: > > > > With the vfio device fd tied to the address space of the pseudo fs > > > > inode, we can use the mm to track all vmas that might be mmap'ing > > > > device BARs, which removes our vma_list and all the complicated lock > > > > ordering necessary to manually zap each related vma. > > > > > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use > > > > unmap_mapping_range() to zap a selective portion of the device fd > > > > corresponding to BAR mappings. > > > > > > > > This also converts our mmap fault handler to use vmf_insert_pfn() > > > > because we no longer have a vma_list to avoid the concurrency problem > > > > with io_remap_pfn_range(). The goal is to eventually use the vm_ops > > > > huge_fault handler to avoid the additional faulting overhead, but > > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first. > > > > > > > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g. > > > for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { > > > offset = (i - vma->vm_start) >> PAGE_SHIFT; > > > ret = vmf_insert_pfn(vma, i, base_pfn + offset); > > > if (ret != VM_FAULT_NOPAGE) { > > > zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); > > > goto up_out; > > > } > > > } > > > What about the prefault version? e.g. > > ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff); > + if (ret & VM_FAULT_ERROR) > + goto out_disabled; > + > + /* prefault */ > + for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE, pfn++) { > + if (i == vmf->address) > + continue; > + > + /* Don't return error on prefault */ > + if (vmf_insert_pfn(vma, i, pfn) & VM_FAULT_ERROR) > + break; > + } > + > out_disabled: > up_read(&vdev->memory_lock); > > > > We can have concurrent faults, so doing this means that we need to > > continue to maintain a locked list of vmas that have faulted to avoid > But looks vfio_pci_zap_bars() always zap full PCI BAR ranges for vmas in > core_vdev->inode->i_mapping. > So, it doesn't matter whether a vma is fully mapped or partly mapped? Yes, but without locking concurrent faults would all be re-inserting the pfns concurrently from the fault handler. > > duplicate insertions and all the nasty lock gymnastics that go along > > with it. I'd rather we just support huge_fault. Thanks, > huge_fault is better. But before that, is this prefault one good? > It seems like this would still be subject to the race that Jason noted here[1], ie. call_mmap() occurs before vma_link_file(), therefore we need to exclusively rely on fault to populate the vma or we risk a race with unmap_mapping_range(). Thanks, Alex [1]https://lore.kernel.org/all/20240522183046.GG20229@nvidia.com/
On Tue, Jun 04, 2024 at 08:07:28AM -0600, Alex Williamson wrote: > On Tue, 4 Jun 2024 12:26:29 +0800 > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > On Fri, May 31, 2024 at 11:18:15PM -0600, Alex Williamson wrote: > > > On Sat, 1 Jun 2024 07:41:27 +0800 > > > Yan Zhao <yan.y.zhao@intel.com> wrote: > > > > > > > On Wed, May 29, 2024 at 10:52:31PM -0600, Alex Williamson wrote: > > > > > With the vfio device fd tied to the address space of the pseudo fs > > > > > inode, we can use the mm to track all vmas that might be mmap'ing > > > > > device BARs, which removes our vma_list and all the complicated lock > > > > > ordering necessary to manually zap each related vma. > > > > > > > > > > Note that we can no longer store the pfn in vm_pgoff if we want to use > > > > > unmap_mapping_range() to zap a selective portion of the device fd > > > > > corresponding to BAR mappings. > > > > > > > > > > This also converts our mmap fault handler to use vmf_insert_pfn() > > > > > because we no longer have a vma_list to avoid the concurrency problem > > > > > with io_remap_pfn_range(). The goal is to eventually use the vm_ops > > > > > huge_fault handler to avoid the additional faulting overhead, but > > > > > vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first. > > > > > > > > > Do we also consider looped vmf_insert_pfn() in mmap fault handler? e.g. > > > > for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { > > > > offset = (i - vma->vm_start) >> PAGE_SHIFT; > > > > ret = vmf_insert_pfn(vma, i, base_pfn + offset); > > > > if (ret != VM_FAULT_NOPAGE) { > > > > zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); > > > > goto up_out; > > > > } > > > > } > > > > > What about the prefault version? e.g. > > > > ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff); > > + if (ret & VM_FAULT_ERROR) > > + goto out_disabled; > > + > > + /* prefault */ > > + for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE, pfn++) { > > + if (i == vmf->address) > > + continue; > > + > > + /* Don't return error on prefault */ > > + if (vmf_insert_pfn(vma, i, pfn) & VM_FAULT_ERROR) > > + break; > > + } > > + > > out_disabled: > > up_read(&vdev->memory_lock); > > > > > > > We can have concurrent faults, so doing this means that we need to > > > continue to maintain a locked list of vmas that have faulted to avoid > > But looks vfio_pci_zap_bars() always zap full PCI BAR ranges for vmas in > > core_vdev->inode->i_mapping. > > So, it doesn't matter whether a vma is fully mapped or partly mapped? > > Yes, but without locking concurrent faults would all be re-inserting > the pfns concurrently from the fault handler. insert_pfn() holds a ptl lock with get_locked_pte() and it will return VM_FAULT_NOPAGE directly without re-inserting if !pte_none() is found, right? > > > > duplicate insertions and all the nasty lock gymnastics that go along > > > with it. I'd rather we just support huge_fault. Thanks, > > huge_fault is better. But before that, is this prefault one good? > > > > It seems like this would still be subject to the race that Jason noted > here[1], ie. call_mmap() occurs before vma_link_file(), therefore we > need to exclusively rely on fault to populate the vma or we risk a race > with unmap_mapping_range(). Thanks, > Thanks. But the proposed prefault way is also in the fault handler, which is after the vma_link_file(). Similar implementations can be found at drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c. > > > [1]https://lore.kernel.org/all/20240522183046.GG20229@nvidia.com/ >
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 80cae87fff36..db31c27bf78b 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1610,100 +1610,20 @@ ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu } EXPORT_SYMBOL_GPL(vfio_pci_core_write); -/* Return 1 on zap and vma_lock acquired, 0 on contention (only with @try) */ -static int vfio_pci_zap_and_vma_lock(struct vfio_pci_core_device *vdev, bool try) +static void vfio_pci_zap_bars(struct vfio_pci_core_device *vdev) { - struct vfio_pci_mmap_vma *mmap_vma, *tmp; + struct vfio_device *core_vdev = &vdev->vdev; + loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX); + loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX); + loff_t len = end - start; - /* - * Lock ordering: - * vma_lock is nested under mmap_lock for vm_ops callback paths. - * The memory_lock semaphore is used by both code paths calling - * into this function to zap vmas and the vm_ops.fault callback - * to protect the memory enable state of the device. - * - * When zapping vmas we need to maintain the mmap_lock => vma_lock - * ordering, which requires using vma_lock to walk vma_list to - * acquire an mm, then dropping vma_lock to get the mmap_lock and - * reacquiring vma_lock. This logic is derived from similar - * requirements in uverbs_user_mmap_disassociate(). - * - * mmap_lock must always be the top-level lock when it is taken. - * Therefore we can only hold the memory_lock write lock when - * vma_list is empty, as we'd need to take mmap_lock to clear - * entries. vma_list can only be guaranteed empty when holding - * vma_lock, thus memory_lock is nested under vma_lock. - * - * This enables the vm_ops.fault callback to acquire vma_lock, - * followed by memory_lock read lock, while already holding - * mmap_lock without risk of deadlock. - */ - while (1) { - struct mm_struct *mm = NULL; - - if (try) { - if (!mutex_trylock(&vdev->vma_lock)) - return 0; - } else { - mutex_lock(&vdev->vma_lock); - } - while (!list_empty(&vdev->vma_list)) { - mmap_vma = list_first_entry(&vdev->vma_list, - struct vfio_pci_mmap_vma, - vma_next); - mm = mmap_vma->vma->vm_mm; - if (mmget_not_zero(mm)) - break; - - list_del(&mmap_vma->vma_next); - kfree(mmap_vma); - mm = NULL; - } - if (!mm) - return 1; - mutex_unlock(&vdev->vma_lock); - - if (try) { - if (!mmap_read_trylock(mm)) { - mmput(mm); - return 0; - } - } else { - mmap_read_lock(mm); - } - if (try) { - if (!mutex_trylock(&vdev->vma_lock)) { - mmap_read_unlock(mm); - mmput(mm); - return 0; - } - } else { - mutex_lock(&vdev->vma_lock); - } - list_for_each_entry_safe(mmap_vma, tmp, - &vdev->vma_list, vma_next) { - struct vm_area_struct *vma = mmap_vma->vma; - - if (vma->vm_mm != mm) - continue; - - list_del(&mmap_vma->vma_next); - kfree(mmap_vma); - - zap_vma_ptes(vma, vma->vm_start, - vma->vm_end - vma->vm_start); - } - mutex_unlock(&vdev->vma_lock); - mmap_read_unlock(mm); - mmput(mm); - } + unmap_mapping_range(core_vdev->inode->i_mapping, start, len, true); } void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev) { - vfio_pci_zap_and_vma_lock(vdev, false); down_write(&vdev->memory_lock); - mutex_unlock(&vdev->vma_lock); + vfio_pci_zap_bars(vdev); } u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev) @@ -1725,99 +1645,41 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 c up_write(&vdev->memory_lock); } -/* Caller holds vma_lock */ -static int __vfio_pci_add_vma(struct vfio_pci_core_device *vdev, - struct vm_area_struct *vma) -{ - struct vfio_pci_mmap_vma *mmap_vma; - - mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL_ACCOUNT); - if (!mmap_vma) - return -ENOMEM; - - mmap_vma->vma = vma; - list_add(&mmap_vma->vma_next, &vdev->vma_list); - - return 0; -} - -/* - * Zap mmaps on open so that we can fault them in on access and therefore - * our vma_list only tracks mappings accessed since last zap. - */ -static void vfio_pci_mmap_open(struct vm_area_struct *vma) -{ - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); -} - -static void vfio_pci_mmap_close(struct vm_area_struct *vma) +static unsigned long vma_to_pfn(struct vm_area_struct *vma) { struct vfio_pci_core_device *vdev = vma->vm_private_data; - struct vfio_pci_mmap_vma *mmap_vma; + int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT); + u64 pgoff; - mutex_lock(&vdev->vma_lock); - list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) { - if (mmap_vma->vma == vma) { - list_del(&mmap_vma->vma_next); - kfree(mmap_vma); - break; - } - } - mutex_unlock(&vdev->vma_lock); + pgoff = vma->vm_pgoff & + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); + + return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff; } static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct vfio_pci_core_device *vdev = vma->vm_private_data; - struct vfio_pci_mmap_vma *mmap_vma; - vm_fault_t ret = VM_FAULT_NOPAGE; + unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff; + vm_fault_t ret = VM_FAULT_SIGBUS; - mutex_lock(&vdev->vma_lock); - down_read(&vdev->memory_lock); + pfn = vma_to_pfn(vma); - /* - * Memory region cannot be accessed if the low power feature is engaged - * or memory access is disabled. - */ - if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) { - ret = VM_FAULT_SIGBUS; - goto up_out; - } + down_read(&vdev->memory_lock); - /* - * We populate the whole vma on fault, so we need to test whether - * the vma has already been mapped, such as for concurrent faults - * to the same vma. io_remap_pfn_range() will trigger a BUG_ON if - * we ask it to fill the same range again. - */ - list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) { - if (mmap_vma->vma == vma) - goto up_out; - } + if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) + goto out_disabled; - if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - vma->vm_end - vma->vm_start, - vma->vm_page_prot)) { - ret = VM_FAULT_SIGBUS; - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); - goto up_out; - } + ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff); - if (__vfio_pci_add_vma(vdev, vma)) { - ret = VM_FAULT_OOM; - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); - } - -up_out: +out_disabled: up_read(&vdev->memory_lock); - mutex_unlock(&vdev->vma_lock); + return ret; } static const struct vm_operations_struct vfio_pci_mmap_ops = { - .open = vfio_pci_mmap_open, - .close = vfio_pci_mmap_close, .fault = vfio_pci_mmap_fault, }; @@ -1880,11 +1742,12 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma vma->vm_private_data = vdev; vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); - vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff; + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); /* - * See remap_pfn_range(), called from vfio_pci_fault() but we can't - * change vm_flags within the fault handler. Set them now. + * Set vm_flags now, they should not be changed in the fault handler. + * We want the same flags and page protection (decrypted above) as + * io_remap_pfn_range() would set. * * VM_ALLOW_ANY_UNCACHED: The VMA flag is implemented for ARM64, * allowing KVM stage 2 device mapping attributes to use Normal-NC @@ -2202,8 +2065,6 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev) mutex_init(&vdev->ioeventfds_lock); INIT_LIST_HEAD(&vdev->dummy_resources_list); INIT_LIST_HEAD(&vdev->ioeventfds_list); - mutex_init(&vdev->vma_lock); - INIT_LIST_HEAD(&vdev->vma_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); init_rwsem(&vdev->memory_lock); xa_init(&vdev->ctx); @@ -2219,7 +2080,6 @@ void vfio_pci_core_release_dev(struct vfio_device *core_vdev) mutex_destroy(&vdev->igate); mutex_destroy(&vdev->ioeventfds_lock); - mutex_destroy(&vdev->vma_lock); kfree(vdev->region); kfree(vdev->pm_save); } @@ -2497,26 +2357,15 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set) return ret; } -/* - * We need to get memory_lock for each device, but devices can share mmap_lock, - * therefore we need to zap and hold the vma_lock for each device, and only then - * get each memory_lock. - */ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, struct vfio_pci_group_info *groups, struct iommufd_ctx *iommufd_ctx) { - struct vfio_pci_core_device *cur_mem; - struct vfio_pci_core_device *cur_vma; - struct vfio_pci_core_device *cur; + struct vfio_pci_core_device *vdev; struct pci_dev *pdev; - bool is_mem = true; int ret; mutex_lock(&dev_set->lock); - cur_mem = list_first_entry(&dev_set->device_list, - struct vfio_pci_core_device, - vdev.dev_set_list); pdev = vfio_pci_dev_set_resettable(dev_set); if (!pdev) { @@ -2533,7 +2382,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, if (ret) goto err_unlock; - list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) { + list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) { bool owned; /* @@ -2557,38 +2406,38 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * Otherwise, reset is not allowed. */ if (iommufd_ctx) { - int devid = vfio_iommufd_get_dev_id(&cur_vma->vdev, + int devid = vfio_iommufd_get_dev_id(&vdev->vdev, iommufd_ctx); owned = (devid > 0 || devid == -ENOENT); } else { - owned = vfio_dev_in_groups(&cur_vma->vdev, groups); + owned = vfio_dev_in_groups(&vdev->vdev, groups); } if (!owned) { ret = -EINVAL; - goto err_undo; + break; } /* - * Locking multiple devices is prone to deadlock, runaway and - * unwind if we hit contention. + * Take the memory write lock for each device and zap BAR + * mappings to prevent the user accessing the device while in + * reset. Locking multiple devices is prone to deadlock, + * runaway and unwind if we hit contention. */ - if (!vfio_pci_zap_and_vma_lock(cur_vma, true)) { + if (!down_write_trylock(&vdev->memory_lock)) { ret = -EBUSY; - goto err_undo; + break; } + + vfio_pci_zap_bars(vdev); } - cur_vma = NULL; - list_for_each_entry(cur_mem, &dev_set->device_list, vdev.dev_set_list) { - if (!down_write_trylock(&cur_mem->memory_lock)) { - ret = -EBUSY; - goto err_undo; - } - mutex_unlock(&cur_mem->vma_lock); + if (!list_entry_is_head(vdev, + &dev_set->device_list, vdev.dev_set_list)) { + vdev = list_prev_entry(vdev, vdev.dev_set_list); + goto err_undo; } - cur_mem = NULL; /* * The pci_reset_bus() will reset all the devices in the bus. @@ -2599,25 +2448,22 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, * cause the PCI config space reset without restoring the original * state (saved locally in 'vdev->pm_save'). */ - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) - vfio_pci_set_power_state(cur, PCI_D0); + list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) + vfio_pci_set_power_state(vdev, PCI_D0); ret = pci_reset_bus(pdev); + vdev = list_last_entry(&dev_set->device_list, + struct vfio_pci_core_device, vdev.dev_set_list); + err_undo: - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) { - if (cur == cur_mem) - is_mem = false; - if (cur == cur_vma) - break; - if (is_mem) - up_write(&cur->memory_lock); - else - mutex_unlock(&cur->vma_lock); - } + list_for_each_entry_from_reverse(vdev, &dev_set->device_list, + vdev.dev_set_list) + up_write(&vdev->memory_lock); + + list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) + pm_runtime_put(&vdev->pdev->dev); - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) - pm_runtime_put(&cur->pdev->dev); err_unlock: mutex_unlock(&dev_set->lock); return ret; diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index a2c8b8bba711..f87067438ed4 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -93,8 +93,6 @@ struct vfio_pci_core_device { struct list_head sriov_pfs_item; struct vfio_pci_core_device *sriov_pf_core_dev; struct notifier_block nb; - struct mutex vma_lock; - struct list_head vma_list; struct rw_semaphore memory_lock; };