Message ID | 20230508125842.28193-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/pci: take mmap write lock for io_remap_pfn_range | expand |
On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote: > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after > pin_user_pages_remote() returns -EFAULT. > > follow_fault_pfn > fixup_user_fault > handle_mm_fault > handle_mm_fault > do_fault > do_shared_fault > do_fault > __do_fault > vfio_pci_mmap_fault > io_remap_pfn_range > remap_pfn_range > track_pfn_remap > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > remap_pfn_range_notrack > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1], > holding of mmap write lock is required. > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap > write lock. > > [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions") > commit 1c71222e5f23 > ("mm: replace vma->vm_flags direct modifications with modifier calls") > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index a5ab416cf476..5082f89152b3 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > struct vfio_pci_mmap_vma *mmap_vma; > vm_fault_t ret = VM_FAULT_NOPAGE; > > + mmap_assert_locked(vma->vm_mm); > + mmap_read_unlock(vma->vm_mm); > + > + if (mmap_write_lock_killable(vma->vm_mm)) > + return VM_FAULT_RETRY; Certainly not.. I'm not sure how to resolve this properly, set the flags in advance? The address space conversion? Jason
On Mon, 8 May 2023 13:48:30 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote: > > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after > > pin_user_pages_remote() returns -EFAULT. > > > > follow_fault_pfn > > fixup_user_fault > > handle_mm_fault > > handle_mm_fault > > do_fault > > do_shared_fault > > do_fault > > __do_fault > > vfio_pci_mmap_fault > > io_remap_pfn_range > > remap_pfn_range > > track_pfn_remap > > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > > remap_pfn_range_notrack > > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > > > > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1], > > holding of mmap write lock is required. > > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap > > write lock. > > > > [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com > > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions") > > commit 1c71222e5f23 > > ("mm: replace vma->vm_flags direct modifications with modifier calls") > > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > > --- > > drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++ > > 1 file changed, 17 insertions(+) > > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > > index a5ab416cf476..5082f89152b3 100644 > > --- a/drivers/vfio/pci/vfio_pci_core.c > > +++ b/drivers/vfio/pci/vfio_pci_core.c > > @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > struct vfio_pci_mmap_vma *mmap_vma; > > vm_fault_t ret = VM_FAULT_NOPAGE; > > > > + mmap_assert_locked(vma->vm_mm); > > + mmap_read_unlock(vma->vm_mm); > > + > > + if (mmap_write_lock_killable(vma->vm_mm)) > > + return VM_FAULT_RETRY; > > Certainly not.. > > I'm not sure how to resolve this properly, set the flags in advance? > > The address space conversion? We already try to set the flags in advance, but there are some architectural flags like VM_PAT that make that tricky. Cedric has been looking at inserting individual pages with vmf_insert_pfn(), but that incurs a lot more faults and therefore latency vs remapping the entire vma on fault. I'm not convinced that we shouldn't just attempt to remove the fault handler entirely, but I haven't tried it yet to know what gotchas are down that path. Thanks, Alex
On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > We already try to set the flags in advance, but there are some > architectural flags like VM_PAT that make that tricky. Cedric has been > looking at inserting individual pages with vmf_insert_pfn(), but that > incurs a lot more faults and therefore latency vs remapping the entire > vma on fault. I'm not convinced that we shouldn't just attempt to > remove the fault handler entirely, but I haven't tried it yet to know > what gotchas are down that path. Thanks, I thought we did it like this because there were races otherwise with PTE insertion and zapping? I don't remember well anymore. I vaugely remember the address_space conversion might help remove the fault handler? Jason
On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote: > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > > > We already try to set the flags in advance, but there are some > > architectural flags like VM_PAT that make that tricky. Cedric has been > > looking at inserting individual pages with vmf_insert_pfn(), but that > > incurs a lot more faults and therefore latency vs remapping the entire > > vma on fault. I'm not convinced that we shouldn't just attempt to > > remove the fault handler entirely, but I haven't tried it yet to know > > what gotchas are down that path. Thanks, > > I thought we did it like this because there were races otherwise with > PTE insertion and zapping? I don't remember well anymore. > > I vaugely remember the address_space conversion might help remove the > fault handler? > What about calling vmf_insert_pfn() in bulk as below? And what is address_space conversion? diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index a5ab416cf476..1476e537f593 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) struct vfio_pci_core_device *vdev = vma->vm_private_data; struct vfio_pci_mmap_vma *mmap_vma; vm_fault_t ret = VM_FAULT_NOPAGE; + unsigned long base_pfn, offset, i; mutex_lock(&vdev->vma_lock); down_read(&vdev->memory_lock); @@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) goto up_out; } - if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - vma->vm_end - vma->vm_start, - vma->vm_page_prot)) { - ret = VM_FAULT_SIGBUS; - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); - goto up_out; + base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT; + base_pfn += vma->vm_pgoff; + for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { + offset = (i - vma->vm_start) >> PAGE_SHIFT; + ret = vmf_insert_pfn(vma, i, base_pfn + offset); + if (ret != VM_FAULT_NOPAGE) { + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); + goto up_out; + } } if (__vfio_pci_add_vma(vdev, vma)) {
On 5/10/23 22:41, Jason Gunthorpe wrote: > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > >> We already try to set the flags in advance, but there are some >> architectural flags like VM_PAT that make that tricky. Cedric has been >> looking at inserting individual pages with vmf_insert_pfn(), but that >> incurs a lot more faults and therefore latency vs remapping the entire >> vma on fault. I'm not convinced that we shouldn't just attempt to >> remove the fault handler entirely, but I haven't tried it yet to know >> what gotchas are down that path. Thanks, OTOH I didn't see any noticeable slowdowns in the tests I did with NICs, IGD and NVIDIA GPU pass-through. I lack devices with large BARs though. If anyone has some time for it, here are commits : https://github.com/legoater/linux/commits/vfio First does a one page insert and fixes the lockdep issues we are seeing with the recent series: https://lore.kernel.org/all/20230126193752.297968-1-surenb@google.com/ Second adds some statistics. For the NVIDIA GPU, faults reach 800k. Thanks, C. > I thought we did it like this because there were races otherwise with > PTE insertion and zapping? I don't remember well anymore. > > I vaugely remember the address_space conversion might help remove the > fault handler? > > Jason >
On 5/11/23 08:56, Yan Zhao wrote: > On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote: >> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: >> >>> We already try to set the flags in advance, but there are some >>> architectural flags like VM_PAT that make that tricky. Cedric has been >>> looking at inserting individual pages with vmf_insert_pfn(), but that >>> incurs a lot more faults and therefore latency vs remapping the entire >>> vma on fault. I'm not convinced that we shouldn't just attempt to >>> remove the fault handler entirely, but I haven't tried it yet to know >>> what gotchas are down that path. Thanks, >> >> I thought we did it like this because there were races otherwise with >> PTE insertion and zapping? I don't remember well anymore. >> >> I vaugely remember the address_space conversion might help remove the >> fault handler? >> > What about calling vmf_insert_pfn() in bulk as below? This works too, it is slightly slower than the io_remap_pfn_range() call but doesn't have the lockdep issues. Thanks, C. > And what is address_space conversion? > > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index a5ab416cf476..1476e537f593 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > struct vfio_pci_core_device *vdev = vma->vm_private_data; > struct vfio_pci_mmap_vma *mmap_vma; > vm_fault_t ret = VM_FAULT_NOPAGE; > + unsigned long base_pfn, offset, i; > > mutex_lock(&vdev->vma_lock); > down_read(&vdev->memory_lock); > @@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > goto up_out; > } > > - if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, > - vma->vm_end - vma->vm_start, > - vma->vm_page_prot)) { > - ret = VM_FAULT_SIGBUS; > - zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); > - goto up_out; > + base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT; > + base_pfn += vma->vm_pgoff; > + for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) { > + offset = (i - vma->vm_start) >> PAGE_SHIFT; > + ret = vmf_insert_pfn(vma, i, base_pfn + offset); > + if (ret != VM_FAULT_NOPAGE) { > + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); > + goto up_out; > + } > } > > if (__vfio_pci_add_vma(vdev, vma)) { >
On Wed, 10 May 2023 17:41:06 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > > > We already try to set the flags in advance, but there are some > > architectural flags like VM_PAT that make that tricky. Cedric has been > > looking at inserting individual pages with vmf_insert_pfn(), but that > > incurs a lot more faults and therefore latency vs remapping the entire > > vma on fault. I'm not convinced that we shouldn't just attempt to > > remove the fault handler entirely, but I haven't tried it yet to know > > what gotchas are down that path. Thanks, > > I thought we did it like this because there were races otherwise with > PTE insertion and zapping? I don't remember well anymore. TBH, I don't recall if we tried a synchronous approach previously. The benefit of the faulting approach was that we could track the minimum set of vmas which are actually making use of the mapping and throw that tracking list away when zapping. Without that, we need to add vmas both on mmap and in vm_ops.open, removing only in vm_ops.close, and acquire all the proper mm locking for each vma to re-insert the mappings. > I vaugely remember the address_space conversion might help remove the > fault handler? Yes, this did remove the fault handler entirely, it's (obviously) dropped off my radar, but perhaps in the interim we could switch to vmf_insert_pfn() and revive the address space series to eventually remove the fault handling and vma list altogether. For reference, I think this was the last posting of the address space series: https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/ Thanks, Alex
On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote: > > I vaugely remember the address_space conversion might help remove the > > fault handler? > > Yes, this did remove the fault handler entirely, it's (obviously) > dropped off my radar, but perhaps in the interim we could switch to > vmf_insert_pfn() and revive the address space series to eventually > remove the fault handling and vma list altogether. vmf_insert_pfn() technically isn't supposed to be used for MMIO.. Eg it doesn't do the PAT stuff on x86 that is causing this problem in the first place. So doing the address space removing series seems like the best fix. It has been mislocked for a long time, I suspect there isn't a real urgent problem beyond we actually have lockdep annoations to catch the mislocking now. Jason
On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote: > On Wed, 10 May 2023 17:41:06 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote: > > > > > We already try to set the flags in advance, but there are some > > > architectural flags like VM_PAT that make that tricky. Cedric has been > > > looking at inserting individual pages with vmf_insert_pfn(), but that > > > incurs a lot more faults and therefore latency vs remapping the entire > > > vma on fault. I'm not convinced that we shouldn't just attempt to > > > remove the fault handler entirely, but I haven't tried it yet to know > > > what gotchas are down that path. Thanks, > > > > I thought we did it like this because there were races otherwise with > > PTE insertion and zapping? I don't remember well anymore. > > TBH, I don't recall if we tried a synchronous approach previously. The > benefit of the faulting approach was that we could track the minimum > set of vmas which are actually making use of the mapping and throw that > tracking list away when zapping. Without that, we need to add vmas > both on mmap and in vm_ops.open, removing only in vm_ops.close, and > acquire all the proper mm locking for each vma to re-insert the > mappings. > > > I vaugely remember the address_space conversion might help remove the > > fault handler? > > Yes, this did remove the fault handler entirely, it's (obviously) > dropped off my radar, but perhaps in the interim we could switch to > vmf_insert_pfn() and revive the address space series to eventually > remove the fault handling and vma list altogether. > > For reference, I think this was the last posting of the address space > series: > > https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/ Just took a quick look at this series. A question is that looks it still needs to call io_remap_pfn_range() in places like vfio_basic_config_write() for PCI_COMMAND, and device reset, so mmap write lock is still required around vdev->memory_lock.
On Mon, May 08, 2023 at 08:58:42PM GMT, Yan Zhao wrote: > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after > pin_user_pages_remote() returns -EFAULT. > > follow_fault_pfn > fixup_user_fault > handle_mm_fault > handle_mm_fault > do_fault > do_shared_fault > do_fault > __do_fault > vfio_pci_mmap_fault > io_remap_pfn_range > remap_pfn_range > track_pfn_remap > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > remap_pfn_range_notrack > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1], > holding of mmap write lock is required. > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap > write lock. > > [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions") > commit 1c71222e5f23 > ("mm: replace vma->vm_flags direct modifications with modifier calls") > With linux-next I started noticing traces similar to the above without lockdep, since it has ba168b52bf8e ("mm: use rwsem assertion macros for mmap_lock"). Were there any follow ups to this? Sorry if my quick searching missed it. Thanks, drew
Hey Drew, On Wed, 22 May 2024 18:56:29 +0200 Andrew Jones <ajones@ventanamicro.com> wrote: > On Mon, May 08, 2023 at 08:58:42PM GMT, Yan Zhao wrote: > > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after > > pin_user_pages_remote() returns -EFAULT. > > > > follow_fault_pfn > > fixup_user_fault > > handle_mm_fault > > handle_mm_fault > > do_fault > > do_shared_fault > > do_fault > > __do_fault > > vfio_pci_mmap_fault > > io_remap_pfn_range > > remap_pfn_range > > track_pfn_remap > > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > > remap_pfn_range_notrack > > vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) > > > > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1], > > holding of mmap write lock is required. > > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap > > write lock. > > > > [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com > > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions") > > commit 1c71222e5f23 > > ("mm: replace vma->vm_flags direct modifications with modifier calls") > > > > With linux-next I started noticing traces similar to the above without > lockdep, since it has ba168b52bf8e ("mm: use rwsem assertion macros for > mmap_lock"). Were there any follow ups to this? Sorry if my quick > searching missed it. We've been working on it in the background, but we're not there yet. I have a branch[1] that converts to an address space on the vfio device, making unmap_mapping_range() available to easily zap mappings to the BARs without all the ugly vma tracking, but that still leaves us with the problem that the remap_pfn_range() shouldn't be called from the fault handler resulting in this lockdep warning. We can switch to vmf_insert_pfn() in the fault handler, but that's a noticeable performance penalty. The angle I've been working recently is to try taking advantage of huge_fault support because we do have vmf_insert_pfn_{pmd,pud}, but these don't currently support pfnmap pfns. I've been working with Peter Xu as this aligns with some other work of his. It's working and resolves the performance issue, especially with a little alignment tweaking in userspace to take advantage of pud mappings. I'm not sure if there are any outstanding blockers on Peter's side, but this seems like a good route from the vfio side. If we're seeing this now without lockdep, we might need to bite the bullet and take the hit with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps. Thanks, Alex [1]https://github.com/awilliam/linux-vfio/commits/vfio-address-space/
On Wed, May 22, 2024 at 11:50:06AM -0600, Alex Williamson wrote: > I'm not sure if there are any outstanding blockers on Peter's side, but > this seems like a good route from the vfio side. If we're seeing this > now without lockdep, we might need to bite the bullet and take the hit > with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps. There is another alternative... Ideally we wouldn't use the fault handler. Instead when the MMIO becomes available again we'd iterate over all the VMA's and do remap_pfn_range(). When the MMIO becomes unavailable we do unmap_mapping_range() and remove it. This whole thing is synchronous and the fault handler should simply trigger SIGBUS if userspace races things. unmap_mapping_range() is easy, but the remap side doesn't have a helper today.. Something grotesque like this perhaps? while (1) { struct mm_struct *cur_mm = NULL; i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) { if (vma_populated(vma)) continue; cur_mm = vm->mm_struct; mmgrab(cur_mm); } i_mmap_unlock_read(mapping); if (!cur_mm) return; mmap_write_lock(cur_mm); i_mmap_lock_read(mapping); vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) { if (vma->mm_struct == mm) vfio_remap_vma(vma); } i_mmap_unlock_read(mapping); mmap_write_unlock(cur_mm); mmdrop(cur_mm); } I'm pretty sure we need to hold the mmap_write lock to do the remap_pfn.. vma_populated() would have to do a RCU read of the page table to check if the page 0 is present. Also there is a race in mmap if you call remap_pfn_range() from the mmap fops and also use unmap_mapping_range(). mmap_region() does call_mmap() before it does vma_link_file() which gives a window where the VMA is populated but invisible to unmap_mapping_range(). We'd need to add another fop to call after vma_link_file() to populate the mmap or rely exclusively on the fault handler to populate. Jason
On Wed, 22 May 2024 15:30:46 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, May 22, 2024 at 11:50:06AM -0600, Alex Williamson wrote: > > I'm not sure if there are any outstanding blockers on Peter's side, but > > this seems like a good route from the vfio side. If we're seeing this > > now without lockdep, we might need to bite the bullet and take the hit > > with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps. > > There is another alternative... > > Ideally we wouldn't use the fault handler. > > Instead when the MMIO becomes available again we'd iterate over all > the VMA's and do remap_pfn_range(). When the MMIO becomes unavailable > we do unmap_mapping_range() and remove it. This whole thing is > synchronous and the fault handler should simply trigger SIGBUS if > userspace races things. > > unmap_mapping_range() is easy, but the remap side doesn't have a > helper today.. > > Something grotesque like this perhaps? > > while (1) { > struct mm_struct *cur_mm = NULL; > > i_mmap_lock_read(mapping); > vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) { > if (vma_populated(vma)) > continue; > > cur_mm = vm->mm_struct; > mmgrab(cur_mm); > } > i_mmap_unlock_read(mapping); > > if (!cur_mm) > return; > > mmap_write_lock(cur_mm); > i_mmap_lock_read(mapping); > vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) { > if (vma->mm_struct == mm) > vfio_remap_vma(vma); > } > i_mmap_unlock_read(mapping); > mmap_write_unlock(cur_mm); > mmdrop(cur_mm); > } Yes, I've played with similar it's a pretty ugly and painful path versus lazily faulting. > I'm pretty sure we need to hold the mmap_write lock to do the > remap_pfn.. > > vma_populated() would have to do a RCU read of the page table to check > if the page 0 is present. > > Also there is a race in mmap if you call remap_pfn_range() from the > mmap fops and also use unmap_mapping_range(). mmap_region() does > call_mmap() before it does vma_link_file() which gives a window where > the VMA is populated but invisible to unmap_mapping_range(). We'd need > to add another fop to call after vma_link_file() to populate the mmap > or rely exclusively on the fault handler to populate. Thanks, the branch I linked added back remap_pfn_range() on mmap, conditional on locking/enabled, but I've predominantly been testing with that disabled to exercise the pmd/pud faults. I would have missed the gap vs unmap_mapping_range(). It's hard to beat the faulting mechanism from a simplicity standpoint, so long as the mm folks don't balk at pfnmap faults. I think the vma address space walking ends up with tracking flags to avoid redundant mappings that get complicated and add to the ugliness. The mm handles all that if we only use faulting and we get a simple test on fault to proceed or trigger a SIGBUS. Using pmd/pud faults should have some efficiency benefits as well. Thanks, Alex
On Wed, May 22, 2024 at 11:50:06AM -0600, Alex Williamson wrote: > I'm not sure if there are any outstanding blockers on Peter's side, but > this seems like a good route from the vfio side. If we're seeing this > now without lockdep, we might need to bite the bullet and take the hit > with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps. No immediate blockers, it's just that there're some small details that I may still need to look into. The current one TBD is pfn tracking implications on PAT. Here I see at least two issues to be investigated. Firstly, when vfio zap bars it can try to remove VM_PAT flag. To be explicit, unmap_single_vma() has: if (unlikely(vma->vm_flags & VM_PFNMAP)) untrack_pfn(vma, 0, 0, mm_wr_locked); I believe it'll also erase the entry on the memtype_rbroot.. I'm not sure whether that's correct at all, and if that's correct how we should re-inject that. So far I feel like we should keep that pfn tracking stuff alone from tearing down pgtables only, but I'll need to double check. E.g. I at least checked MADV_DONTNEED won't allow to apply on PFNMAPs, so vfio zapping the vma should be the 1st one can do that besides munmap(). The other thing is I just noticed very recently that the PAT bit on x86_64 is not always the same one.. on 4K it's bit 7, but it's reused as PSE on higher levels, moving PAT to bit 12: #define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ #define _PAGE_BIT_PAT 7 /* on 4KB pages */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ We may need something like protval_4k_2_large() when injecting huge mappings. From the schedule POV, the plan is I'll continue work on this after I flush the inbox for the past two weeks and when I'll get some spare time. Now ~160 emails left.. but I'm getting there. If there's comments for either of above, please shoot. Thanks,
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index a5ab416cf476..5082f89152b3 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) struct vfio_pci_mmap_vma *mmap_vma; vm_fault_t ret = VM_FAULT_NOPAGE; + mmap_assert_locked(vma->vm_mm); + mmap_read_unlock(vma->vm_mm); + + if (mmap_write_lock_killable(vma->vm_mm)) + return VM_FAULT_RETRY; + mutex_lock(&vdev->vma_lock); down_read(&vdev->memory_lock); @@ -1726,6 +1732,17 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) up_out: up_read(&vdev->memory_lock); mutex_unlock(&vdev->vma_lock); + mmap_write_unlock(vma->vm_mm); + + /* If PTE is installed successfully, add the completed flag to + * indicate mmap lock is released, + * otherwise retake the read lock + */ + if (ret == VM_FAULT_NOPAGE) + ret |= VM_FAULT_COMPLETED; + else + mmap_read_lock(vma->vm_mm); + return ret; }
In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after pin_user_pages_remote() returns -EFAULT. follow_fault_pfn fixup_user_fault handle_mm_fault handle_mm_fault do_fault do_shared_fault do_fault __do_fault vfio_pci_mmap_fault io_remap_pfn_range remap_pfn_range track_pfn_remap vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) remap_pfn_range_notrack vm_flags_set ==> mmap_assert_write_locked(vma->vm_mm) As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1], holding of mmap write lock is required. So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap write lock. [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions") commit 1c71222e5f23 ("mm: replace vma->vm_flags direct modifications with modifier calls") Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) base-commit: 705b004ee377b789e39ae237519bab714297ac83