Message ID | 161539852724.8302.17137130175894127401.stgit@gimli.home (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio/pci: Handle concurrent vma faults | expand |
On Wed, Mar 10, 2021 at 10:53:29AM -0700, Alex Williamson wrote: > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index 65e7e6b44578..ae723808e08b 100644 > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device *vdev, > { > struct vfio_pci_mmap_vma *mmap_vma; > > + list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) { > + if (mmap_vma->vma == vma) > + return 0; /* Swallow the error, the vma is tracked */ > + } > + > mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL); > if (!mmap_vma) > return -ENOMEM; > @@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > { > struct vm_area_struct *vma = vmf->vma; > struct vfio_pci_device *vdev = vma->vm_private_data; > - vm_fault_t ret = VM_FAULT_NOPAGE; > + unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff; > + vm_fault_t ret = VM_FAULT_SIGBUS; > > mutex_lock(&vdev->vma_lock); > down_read(&vdev->memory_lock); > > - if (!__vfio_pci_memory_enabled(vdev)) { > - ret = VM_FAULT_SIGBUS; > - mutex_unlock(&vdev->vma_lock); > + if (!__vfio_pci_memory_enabled(vdev)) > goto up_out; > + > + for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) { > + ret = vmf_insert_pfn_prot(vma, vaddr, pfn, > + pgprot_decrypted(vma->vm_page_prot)); I investigated this, I think the above pgprot_decrypted() should be moved here: static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) { vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); And since: vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) { return vmf_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot); The above can just use vfm_insert_pfn() The only thing that makes me nervous about this arrangment is loosing the track_pfn_remap() which was in remap_pfn_range() - I think it means we miss out on certain PAT manipulations.. I *suspect* this is not a problem for VFIO because it will rely on the MTRRs generally on x86 - but I also don't know this mechanim too well. I think after the address_space changes this should try to stick with a normal io_rmap_pfn_range() done outside the fault handler. Jason
On Wed, 10 Mar 2021 14:14:46 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Mar 10, 2021 at 10:53:29AM -0700, Alex Williamson wrote: > > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > > index 65e7e6b44578..ae723808e08b 100644 > > +++ b/drivers/vfio/pci/vfio_pci.c > > @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device *vdev, > > { > > struct vfio_pci_mmap_vma *mmap_vma; > > > > + list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) { > > + if (mmap_vma->vma == vma) > > + return 0; /* Swallow the error, the vma is tracked */ > > + } > > + > > mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL); > > if (!mmap_vma) > > return -ENOMEM; > > @@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) > > { > > struct vm_area_struct *vma = vmf->vma; > > struct vfio_pci_device *vdev = vma->vm_private_data; > > - vm_fault_t ret = VM_FAULT_NOPAGE; > > + unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff; > > + vm_fault_t ret = VM_FAULT_SIGBUS; > > > > mutex_lock(&vdev->vma_lock); > > down_read(&vdev->memory_lock); > > > > - if (!__vfio_pci_memory_enabled(vdev)) { > > - ret = VM_FAULT_SIGBUS; > > - mutex_unlock(&vdev->vma_lock); > > + if (!__vfio_pci_memory_enabled(vdev)) > > goto up_out; > > + > > + for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) { > > + ret = vmf_insert_pfn_prot(vma, vaddr, pfn, > > + pgprot_decrypted(vma->vm_page_prot)); > > I investigated this, I think the above pgprot_decrypted() should be > moved here: > > static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) > { > vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); > > > And since: > > vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr, > unsigned long pfn) > { > return vmf_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot); > > The above can just use vfm_insert_pfn() Cool, easy enough. Thanks for looking. > The only thing that makes me nervous about this arrangment is loosing > the track_pfn_remap() which was in remap_pfn_range() - I think it > means we miss out on certain PAT manipulations.. I *suspect* this is > not a problem for VFIO because it will rely on the MTRRs generally on > x86 - but I also don't know this mechanim too well. Yeah, for VM use cases the MTRRs generally override. > I think after the address_space changes this should try to stick with > a normal io_rmap_pfn_range() done outside the fault handler. I assume you're suggesting calling io_remap_pfn_range() when device memory is enabled, do you mean using vma_interval_tree_foreach() like unmap_mapping_range() does to avoid re-adding a vma list? Thanks, Alex
On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote: > > I think after the address_space changes this should try to stick with > > a normal io_rmap_pfn_range() done outside the fault handler. > > I assume you're suggesting calling io_remap_pfn_range() when device > memory is enabled, Yes, I think I saw Peter thinking along these lines too Then fault just always causes SIGBUS if it gets called Jason
On Wed, Mar 10, 2021 at 02:40:11PM -0400, Jason Gunthorpe wrote: > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote: > > > > I think after the address_space changes this should try to stick with > > > a normal io_rmap_pfn_range() done outside the fault handler. > > > > I assume you're suggesting calling io_remap_pfn_range() when device > > memory is enabled, > > Yes, I think I saw Peter thinking along these lines too > > Then fault just always causes SIGBUS if it gets called Indeed that looks better than looping in the fault(). But I don't know whether it'll be easy to move io_remap_pfn_range() to device memory enablement. If it's a two-step thing, we can fix the BUG_ON and vma duplication issue first, then the full rework can be done in the bigger series as what be chosen as the last approach. Thanks,
On Wed, Mar 10, 2021 at 03:06:07PM -0500, Peter Xu wrote: > On Wed, Mar 10, 2021 at 02:40:11PM -0400, Jason Gunthorpe wrote: > > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote: > > > > > > I think after the address_space changes this should try to stick with > > > > a normal io_rmap_pfn_range() done outside the fault handler. > > > > > > I assume you're suggesting calling io_remap_pfn_range() when device > > > memory is enabled, > > > > Yes, I think I saw Peter thinking along these lines too > > > > Then fault just always causes SIGBUS if it gets called I feel much more comfortable having the io_remap_pfn_range in place. > > Indeed that looks better than looping in the fault(). > > But I don't know whether it'll be easy to move io_remap_pfn_range() to device > memory enablement. If it's a two-step thing, we can fix the BUG_ON and vma > duplication issue first, then the full rework can be done in the bigger series > as what be chosen as the last approach. What kind of problems do you envision? It seems pretty simple to do, at least when combined with the unmap_mapping_range patch.
On Thu, Mar 11, 2021 at 11:35:24AM +0000, Christoph Hellwig wrote: > On Wed, Mar 10, 2021 at 03:06:07PM -0500, Peter Xu wrote: > > On Wed, Mar 10, 2021 at 02:40:11PM -0400, Jason Gunthorpe wrote: > > > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote: > > > > > > > > I think after the address_space changes this should try to stick with > > > > > a normal io_rmap_pfn_range() done outside the fault handler. > > > > > > > > I assume you're suggesting calling io_remap_pfn_range() when device > > > > memory is enabled, > > > > > > Yes, I think I saw Peter thinking along these lines too > > > > > > Then fault just always causes SIGBUS if it gets called > > I feel much more comfortable having the io_remap_pfn_range in place. It's just that Jason convinced me with the fact that io_remap_pfn_range() will modify vma flags, and I tend to agree that's not a good thing to do during a fault() handler (in remap_pfn_range): vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP; Although this case is special and it does not do harm it seems, since all these four flags are already set by vfio_pci_mmap() anyways, so the flag didn't really change at least with current code base. It's just still cleaner to not use io_remap_pfn_range() in vfio fault() since future change to the function io_remap_pfn_range() may not guarantee to match with vfio mmap(). > > > > > Indeed that looks better than looping in the fault(). > > > > But I don't know whether it'll be easy to move io_remap_pfn_range() to device > > memory enablement. If it's a two-step thing, we can fix the BUG_ON and vma > > duplication issue first, then the full rework can be done in the bigger series > > as what be chosen as the last approach. > > What kind of problems do you envision? It seems pretty simple to do, > at least when combined with the unmap_mapping_range patch. Moving the prefault into device memory enablement will even remove the 1st fault delay when doing the first MMIO access that triggers this fault(). Also in that case I think we can also call io_remap_pfn_range() directly and safely, rather than looping over vmf_insert_pfn_prot(). Thanks,
On Wed, 10 Mar 2021 14:40:11 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote: > > > > I think after the address_space changes this should try to stick with > > > a normal io_rmap_pfn_range() done outside the fault handler. > > > > I assume you're suggesting calling io_remap_pfn_range() when device > > memory is enabled, > > Yes, I think I saw Peter thinking along these lines too > > Then fault just always causes SIGBUS if it gets called Trying to use the address_space approach because otherwise we'd just be adding back vma list tracking, it looks like we can't call io_remap_pfn_range() while holding the address_space i_mmap_rwsem via i_mmap_lock_write(), like done in unmap_mapping_range(). lockdep identifies a circular lock order issue against fs_reclaim. Minimally we also need vma_interval_tree_iter_{first,next} exported in order to use vma_interval_tree_foreach(). Suggestions? Thanks, Alex
On Fri, Mar 12, 2021 at 12:16:11PM -0700, Alex Williamson wrote: > On Wed, 10 Mar 2021 14:40:11 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote: > > > > > > I think after the address_space changes this should try to stick with > > > > a normal io_rmap_pfn_range() done outside the fault handler. > > > > > > I assume you're suggesting calling io_remap_pfn_range() when device > > > memory is enabled, > > > > Yes, I think I saw Peter thinking along these lines too > > > > Then fault just always causes SIGBUS if it gets called > > Trying to use the address_space approach because otherwise we'd just be > adding back vma list tracking, it looks like we can't call > io_remap_pfn_range() while holding the address_space i_mmap_rwsem via > i_mmap_lock_write(), like done in unmap_mapping_range(). lockdep > identifies a circular lock order issue against fs_reclaim. Minimally we > also need vma_interval_tree_iter_{first,next} exported in order to use > vma_interval_tree_foreach(). Suggestions? Thanks, You are asking how to put the BAR back into every VMA when it is enabled again after it has been zap'd? What did the lockdep splat look like? Is it a memory allocation? Does current_gfp_context()/memalloc_nofs_save()/etc solve it? The easiest answer is to continue to use fault and the vmf_insert_page().. But it feels like it wouuld be OK to export enough i_mmap machinery to enable this. Cleaner than building your own tracking, which would still have the same ugly mmap_sem inversion issue which was preventing this last time. Jason
On Fri, 12 Mar 2021 15:41:47 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Fri, Mar 12, 2021 at 12:16:11PM -0700, Alex Williamson wrote: > > On Wed, 10 Mar 2021 14:40:11 -0400 > > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote: > > > > > > > > I think after the address_space changes this should try to stick with > > > > > a normal io_rmap_pfn_range() done outside the fault handler. > > > > > > > > I assume you're suggesting calling io_remap_pfn_range() when device > > > > memory is enabled, > > > > > > Yes, I think I saw Peter thinking along these lines too > > > > > > Then fault just always causes SIGBUS if it gets called > > > > Trying to use the address_space approach because otherwise we'd just be > > adding back vma list tracking, it looks like we can't call > > io_remap_pfn_range() while holding the address_space i_mmap_rwsem via > > i_mmap_lock_write(), like done in unmap_mapping_range(). lockdep > > identifies a circular lock order issue against fs_reclaim. Minimally we > > also need vma_interval_tree_iter_{first,next} exported in order to use > > vma_interval_tree_foreach(). Suggestions? Thanks, > > You are asking how to put the BAR back into every VMA when it is > enabled again after it has been zap'd? Exactly. > What did the lockdep splat look like? Is it a memory allocation? ====================================================== WARNING: possible circular locking dependency detected 5.12.0-rc1+ #18 Not tainted ------------------------------------------------------ CPU 0/KVM/1406 is trying to acquire lock: ffffffffa5a58d60 (fs_reclaim){+.+.}-{0:0}, at: fs_reclaim_acquire+0x83/0xd0 but task is already holding lock: ffff94c0f3e8fb08 (&mapping->i_mmap_rwsem){++++}-{3:3}, at: vfio_device_io_remap_mapping_range+0x31/0x120 [vfio] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mapping->i_mmap_rwsem){++++}-{3:3}: down_write+0x3d/0x70 dma_resv_lockdep+0x1b0/0x298 do_one_initcall+0x5b/0x2d0 kernel_init_freeable+0x251/0x298 kernel_init+0xa/0x111 ret_from_fork+0x22/0x30 -> #0 (fs_reclaim){+.+.}-{0:0}: __lock_acquire+0x111f/0x1e10 lock_acquire+0xb5/0x380 fs_reclaim_acquire+0xa3/0xd0 kmem_cache_alloc_trace+0x30/0x2c0 memtype_reserve+0xc3/0x280 reserve_pfn_range+0x86/0x160 track_pfn_remap+0xa6/0xe0 remap_pfn_range+0xa8/0x610 vfio_device_io_remap_mapping_range+0x93/0x120 [vfio] vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci] vfio_basic_config_write+0x12d/0x230 [vfio_pci] vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci] vfs_write+0xea/0x390 __x64_sys_pwrite64+0x72/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mapping->i_mmap_rwsem); lock(fs_reclaim); lock(&mapping->i_mmap_rwsem); lock(fs_reclaim); *** DEADLOCK *** 2 locks held by CPU 0/KVM/1406: #0: ffff94c0f9c71ef0 (&vdev->memory_lock){++++}-{3:3}, at: vfio_basic_config_write+0x19a/0x230 [vfio_pci] #1: ffff94c0f3e8fb08 (&mapping->i_mmap_rwsem){++++}-{3:3}, at: vfio_device_io_remap_mapping_range+0x31/0x120 [vfio] stack backtrace: CPU: 3 PID: 1406 Comm: CPU 0/KVM Not tainted 5.12.0-rc1+ #18 Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 Call Trace: dump_stack+0x7f/0xa1 check_noncircular+0xcf/0xf0 __lock_acquire+0x111f/0x1e10 lock_acquire+0xb5/0x380 ? fs_reclaim_acquire+0x83/0xd0 ? pat_enabled+0x10/0x10 ? memtype_reserve+0xc3/0x280 fs_reclaim_acquire+0xa3/0xd0 ? fs_reclaim_acquire+0x83/0xd0 kmem_cache_alloc_trace+0x30/0x2c0 memtype_reserve+0xc3/0x280 reserve_pfn_range+0x86/0x160 track_pfn_remap+0xa6/0xe0 remap_pfn_range+0xa8/0x610 ? lock_acquire+0xb5/0x380 ? vfio_device_io_remap_mapping_range+0x31/0x120 [vfio] ? lock_is_held_type+0xa5/0x120 vfio_device_io_remap_mapping_range+0x93/0x120 [vfio] vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci] vfio_basic_config_write+0x12d/0x230 [vfio_pci] vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci] vfs_write+0xea/0x390 __x64_sys_pwrite64+0x72/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f80176152ff Code: 08 89 3c 24 48 89 4c 24 18 e8 3d f3 ff ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 6d f3 ff ff 48 8b RSP: 002b:00007f7efa5f72f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000012 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f80176152ff RDX: 0000000000000002 RSI: 00007f7efa5f736c RDI: 000000000000002d RBP: 000055b66913d530 R08: 0000000000000000 R09: 000000000000ffff R10: 0000070000000004 R11: 0000000000000293 R12: 0000000000000004 R13: 0000000000000102 R14: 0000000000000002 R15: 000055b66913d530 > Does current_gfp_context()/memalloc_nofs_save()/etc solve it? Will investigate... > The easiest answer is to continue to use fault and the > vmf_insert_page().. > > But it feels like it wouuld be OK to export enough i_mmap machinery to > enable this. Cleaner than building your own tracking, which would > still have the same ugly mmap_sem inversion issue which was preventing > this last time. Yup, I'd rather fault than add that back, but I'm not sure we have a mapping function compatible with this framework. Thanks, Alex
On Fri, 12 Mar 2021 13:09:38 -0700 Alex Williamson <alex.williamson@redhat.com> wrote: > On Fri, 12 Mar 2021 15:41:47 -0400 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > ====================================================== > WARNING: possible circular locking dependency detected > 5.12.0-rc1+ #18 Not tainted > ------------------------------------------------------ > CPU 0/KVM/1406 is trying to acquire lock: > ffffffffa5a58d60 (fs_reclaim){+.+.}-{0:0}, at: fs_reclaim_acquire+0x83/0xd0 > > but task is already holding lock: > ffff94c0f3e8fb08 (&mapping->i_mmap_rwsem){++++}-{3:3}, at: vfio_device_io_remap_mapping_range+0x31/0x120 [vfio] > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (&mapping->i_mmap_rwsem){++++}-{3:3}: > down_write+0x3d/0x70 > dma_resv_lockdep+0x1b0/0x298 > do_one_initcall+0x5b/0x2d0 > kernel_init_freeable+0x251/0x298 > kernel_init+0xa/0x111 > ret_from_fork+0x22/0x30 > > -> #0 (fs_reclaim){+.+.}-{0:0}: > __lock_acquire+0x111f/0x1e10 > lock_acquire+0xb5/0x380 > fs_reclaim_acquire+0xa3/0xd0 > kmem_cache_alloc_trace+0x30/0x2c0 > memtype_reserve+0xc3/0x280 > reserve_pfn_range+0x86/0x160 > track_pfn_remap+0xa6/0xe0 > remap_pfn_range+0xa8/0x610 > vfio_device_io_remap_mapping_range+0x93/0x120 [vfio] > vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci] > vfio_basic_config_write+0x12d/0x230 [vfio_pci] > vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci] > vfs_write+0xea/0x390 > __x64_sys_pwrite64+0x72/0xb0 > do_syscall_64+0x33/0x40 > entry_SYSCALL_64_after_hwframe+0x44/0xae > .. > > Does current_gfp_context()/memalloc_nofs_save()/etc solve it? Yeah, we can indeed use memalloc_nofs_save/restore(). It seems we're trying to allocate something for pfnmap tracking and that enables lots of lockdep specific tests. Is it valid to wrap io_remap_pfn_range() around clearing this flag or am I just masking a bug? Thanks, Alex
On Fri, Mar 12, 2021 at 01:58:44PM -0700, Alex Williamson wrote: > Yeah, we can indeed use memalloc_nofs_save/restore(). It seems we're > trying to allocate something for pfnmap tracking and that enables lots > of lockdep specific tests. Is it valid to wrap io_remap_pfn_range() > around clearing this flag or am I just masking a bug? Thanks, Yes, I think it is fine. Those functions are ment to be used in a no-fs kind of region exactly like this. no-fs is telling the allocator not to do reclaim which is forbidden under the locks here (as reclaim will also attempt to get these locks) I would defer to Michal Hocko though, maybe cc him on the final patch series version. Jason
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 65e7e6b44578..ae723808e08b 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device *vdev, { struct vfio_pci_mmap_vma *mmap_vma; + list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) { + if (mmap_vma->vma == vma) + return 0; /* Swallow the error, the vma is tracked */ + } + mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL); if (!mmap_vma) return -ENOMEM; @@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct vfio_pci_device *vdev = vma->vm_private_data; - vm_fault_t ret = VM_FAULT_NOPAGE; + unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff; + vm_fault_t ret = VM_FAULT_SIGBUS; mutex_lock(&vdev->vma_lock); down_read(&vdev->memory_lock); - if (!__vfio_pci_memory_enabled(vdev)) { - ret = VM_FAULT_SIGBUS; - mutex_unlock(&vdev->vma_lock); + if (!__vfio_pci_memory_enabled(vdev)) goto up_out; + + for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) { + ret = vmf_insert_pfn_prot(vma, vaddr, pfn, + pgprot_decrypted(vma->vm_page_prot)); + if (ret != VM_FAULT_NOPAGE) { + zap_vma_ptes(vma, vma->vm_start, vaddr - vma->vm_start); + goto up_out; + } } if (__vfio_pci_add_vma(vdev, vma)) { ret = VM_FAULT_OOM; - mutex_unlock(&vdev->vma_lock); - goto up_out; + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); } - mutex_unlock(&vdev->vma_lock); - - if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, - vma->vm_end - vma->vm_start, vma->vm_page_prot)) - ret = VM_FAULT_SIGBUS; - up_out: up_read(&vdev->memory_lock); + mutex_unlock(&vdev->vma_lock); return ret; }
vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range() from within a vm_ops fault handler. This function will trigger a BUG_ON if it encounters a populated pte within the remapped range, where any fault is meant to populate the entire vma. Concurrent inflight faults to the same vma will therefore hit this issue, triggering traces such as: [ 1591.733256] kernel BUG at mm/memory.c:2177! [ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio pv680_mii(O) [ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1 [ 1591.770735] Hardware name: , BIOS HixxxxFPGA 1P B600 V121-1 [ 1591.778872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--) [ 1591.786134] pc : remap_pfn_range+0x214/0x340 [ 1591.793564] lr : remap_pfn_range+0x1b8/0x340 [ 1591.799117] sp : ffff80001068bbd0 [ 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000 [ 1591.810404] x27: 0000001100910000 x26: 0000001300910000 [ 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358 [ 1591.825144] x23: 0000001140000000 x22: 0000000000000041 [ 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000 [ 1591.839520] x19: 0000001100a00000 x18: 0000000000000000 [ 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540 [ 1591.853570] x15: 0000000000000000 x14: 0000000000000000 [ 1591.860768] x13: fffffc0000000000 x12: 0000000000000880 [ 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000 [ 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0 [ 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000 [ 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001 [ 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3 [ 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868 [ 1591.910234] Call trace: [ 1591.914837] remap_pfn_range+0x214/0x340 [ 1591.921765] vfio_pci_mmap_fault+0xac/0x130 [vfio_pci] [ 1591.931200] __do_fault+0x44/0x12c [ 1591.937031] handle_mm_fault+0xcc8/0x1230 [ 1591.942475] do_page_fault+0x16c/0x484 [ 1591.948635] do_translation_fault+0xbc/0xd8 [ 1591.954171] do_mem_abort+0x4c/0xc0 [ 1591.960316] el0_da+0x40/0x80 [ 1591.965585] el0_sync_handler+0x168/0x1b0 [ 1591.971608] el0_sync+0x174/0x180 [ 1591.978312] Code: eb1b027f 540000c0 f9400022 b4fffe02 (d4210000) Switch to using vmf_insert_pfn_prot() so that we can retain the decrypted memory protection from io_remap_pfn_range(), but allow concurrent page table updates. Tracking of vmas is also updated to prevent duplicate entries. Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking") Reported-by: Zeng Tao <prime.zeng@hisilicon.com> Suggested-by: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- Zeng Tao, I hope you don't mind me sending a new version to keep this moving. Testing and review appreciated, thanks! drivers/vfio/pci/vfio_pci.c | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-)