vfio/pci: Handle concurrent vma faults

Message ID	161539852724.8302.17137130175894127401.stgit@gimli.home (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Subject: [PATCH] vfio/pci: Handle concurrent vma faults From: Alex Williamson <alex.williamson@redhat.com> To: alex.williamson@redhat.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, jgg@nvidia.com, peterx@redhat.com, prime.zeng@hisilicon.com, cohuck@redhat.com Date: Wed, 10 Mar 2021 10:53:29 -0700 Message-ID: <161539852724.8302.17137130175894127401.stgit@gimli.home> User-Agent: StGit/0.21-2-g8ef5 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	vfio/pci: Handle concurrent vma faults \| expand vfio/pci: Handle concurrent vma faults

Alex Williamson March 10, 2021, 5:53 p.m. UTC

vfio_pci_mmap_fault() incorrectly makes use of io_remap_pfn_range()
from within a vm_ops fault handler.  This function will trigger a
BUG_ON if it encounters a populated pte within the remapped range,
where any fault is meant to populate the entire vma.  Concurrent
inflight faults to the same vma will therefore hit this issue,
triggering traces such as:

[ 1591.733256] kernel BUG at mm/memory.c:2177!
[ 1591.739515] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 1591.747381] Modules linked in: vfio_iommu_type1 vfio_pci vfio_virqfd vfio pv680_mii(O)
[ 1591.760536] CPU: 2 PID: 227 Comm: lcore-worker-2 Tainted: G O 5.11.0-rc3+ #1
[ 1591.770735] Hardware name:  , BIOS HixxxxFPGA 1P B600 V121-1
[ 1591.778872] pstate: 40400009 (nZcv daif +PAN -UAO -TCO BTYPE=--)
[ 1591.786134] pc : remap_pfn_range+0x214/0x340
[ 1591.793564] lr : remap_pfn_range+0x1b8/0x340
[ 1591.799117] sp : ffff80001068bbd0
[ 1591.803476] x29: ffff80001068bbd0 x28: 0000042eff6f0000
[ 1591.810404] x27: 0000001100910000 x26: 0000001300910000
[ 1591.817457] x25: 0068000000000fd3 x24: ffffa92f1338e358
[ 1591.825144] x23: 0000001140000000 x22: 0000000000000041
[ 1591.832506] x21: 0000001300910000 x20: ffffa92f141a4000
[ 1591.839520] x19: 0000001100a00000 x18: 0000000000000000
[ 1591.846108] x17: 0000000000000000 x16: ffffa92f11844540
[ 1591.853570] x15: 0000000000000000 x14: 0000000000000000
[ 1591.860768] x13: fffffc0000000000 x12: 0000000000000880
[ 1591.868053] x11: ffff0821bf3d01d0 x10: ffff5ef2abd89000
[ 1591.875932] x9 : ffffa92f12ab0064 x8 : ffffa92f136471c0
[ 1591.883208] x7 : 0000001140910000 x6 : 0000000200000000
[ 1591.890177] x5 : 0000000000000001 x4 : 0000000000000001
[ 1591.896656] x3 : 0000000000000000 x2 : 0168044000000fd3
[ 1591.903215] x1 : ffff082126261880 x0 : fffffc2084989868
[ 1591.910234] Call trace:
[ 1591.914837]  remap_pfn_range+0x214/0x340
[ 1591.921765]  vfio_pci_mmap_fault+0xac/0x130 [vfio_pci]
[ 1591.931200]  __do_fault+0x44/0x12c
[ 1591.937031]  handle_mm_fault+0xcc8/0x1230
[ 1591.942475]  do_page_fault+0x16c/0x484
[ 1591.948635]  do_translation_fault+0xbc/0xd8
[ 1591.954171]  do_mem_abort+0x4c/0xc0
[ 1591.960316]  el0_da+0x40/0x80
[ 1591.965585]  el0_sync_handler+0x168/0x1b0
[ 1591.971608]  el0_sync+0x174/0x180
[ 1591.978312] Code: eb1b027f 540000c0 f9400022 b4fffe02 (d4210000)

Switch to using vmf_insert_pfn_prot() so that we can retain the
decrypted memory protection from io_remap_pfn_range(), but allow
concurrent page table updates.  Tracking of vmas is also updated to
prevent duplicate entries.

Fixes: 11c4cd07ba11 ("vfio-pci: Fault mmaps to enable vma tracking")
Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Suggested-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

Zeng Tao, I hope you don't mind me sending a new version to keep
this moving.  Testing and review appreciated, thanks!

 drivers/vfio/pci/vfio_pci.c |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

Jason Gunthorpe March 10, 2021, 6:14 p.m. UTC | #1

On Wed, Mar 10, 2021 at 10:53:29AM -0700, Alex Williamson wrote:

> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 65e7e6b44578..ae723808e08b 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
>  {
>  	struct vfio_pci_mmap_vma *mmap_vma;
>  
> +	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
> +		if (mmap_vma->vma == vma)
> +			return 0; /* Swallow the error, the vma is tracked */
> +	}
> +
>  	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
>  	if (!mmap_vma)
>  		return -ENOMEM;
> @@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct vfio_pci_device *vdev = vma->vm_private_data;
> -	vm_fault_t ret = VM_FAULT_NOPAGE;
> +	unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff;
> +	vm_fault_t ret = VM_FAULT_SIGBUS;
>  
>  	mutex_lock(&vdev->vma_lock);
>  	down_read(&vdev->memory_lock);
>  
> -	if (!__vfio_pci_memory_enabled(vdev)) {
> -		ret = VM_FAULT_SIGBUS;
> -		mutex_unlock(&vdev->vma_lock);
> +	if (!__vfio_pci_memory_enabled(vdev))
>  		goto up_out;
> +
> +	for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
> +		ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
> +					  pgprot_decrypted(vma->vm_page_prot));

I investigated this, I think the above pgprot_decrypted() should be
moved here:

static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
{
        vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+       vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);


And since:

vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
			unsigned long pfn)
{
	return vmf_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot);

The above can just use vfm_insert_pfn()

The only thing that makes me nervous about this arrangment is loosing
the track_pfn_remap() which was in remap_pfn_range() - I think it
means we miss out on certain PAT manipulations.. I *suspect* this is
not a problem for VFIO because it will rely on the MTRRs generally on
x86 - but I also don't know this mechanim too well.

I think after the address_space changes this should try to stick with
a normal io_rmap_pfn_range() done outside the fault handler.

Jason

Alex Williamson March 10, 2021, 6:34 p.m. UTC | #2

On Wed, 10 Mar 2021 14:14:46 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 10, 2021 at 10:53:29AM -0700, Alex Williamson wrote:
> 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 65e7e6b44578..ae723808e08b 100644
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -1573,6 +1573,11 @@ static int __vfio_pci_add_vma(struct vfio_pci_device *vdev,
> >  {
> >  	struct vfio_pci_mmap_vma *mmap_vma;
> >  
> > +	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
> > +		if (mmap_vma->vma == vma)
> > +			return 0; /* Swallow the error, the vma is tracked */
> > +	}
> > +
> >  	mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
> >  	if (!mmap_vma)
> >  		return -ENOMEM;
> > @@ -1612,31 +1617,32 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  {
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct vfio_pci_device *vdev = vma->vm_private_data;
> > -	vm_fault_t ret = VM_FAULT_NOPAGE;
> > +	unsigned long vaddr = vma->vm_start, pfn = vma->vm_pgoff;
> > +	vm_fault_t ret = VM_FAULT_SIGBUS;
> >  
> >  	mutex_lock(&vdev->vma_lock);
> >  	down_read(&vdev->memory_lock);
> >  
> > -	if (!__vfio_pci_memory_enabled(vdev)) {
> > -		ret = VM_FAULT_SIGBUS;
> > -		mutex_unlock(&vdev->vma_lock);
> > +	if (!__vfio_pci_memory_enabled(vdev))
> >  		goto up_out;
> > +
> > +	for (; vaddr < vma->vm_end; vaddr += PAGE_SIZE, pfn++) {
> > +		ret = vmf_insert_pfn_prot(vma, vaddr, pfn,
> > +					  pgprot_decrypted(vma->vm_page_prot));  
> 
> I investigated this, I think the above pgprot_decrypted() should be
> moved here:
> 
> static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> {
>         vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +       vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> 
> 
> And since:
> 
> vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> 			unsigned long pfn)
> {
> 	return vmf_insert_pfn_prot(vma, addr, pfn, vma->vm_page_prot);
> 
> The above can just use vfm_insert_pfn()

Cool, easy enough.  Thanks for looking.
 
> The only thing that makes me nervous about this arrangment is loosing
> the track_pfn_remap() which was in remap_pfn_range() - I think it
> means we miss out on certain PAT manipulations.. I *suspect* this is
> not a problem for VFIO because it will rely on the MTRRs generally on
> x86 - but I also don't know this mechanim too well.

Yeah, for VM use cases the MTRRs generally override.

> I think after the address_space changes this should try to stick with
> a normal io_rmap_pfn_range() done outside the fault handler.

I assume you're suggesting calling io_remap_pfn_range() when device
memory is enabled, do you mean using vma_interval_tree_foreach() like
unmap_mapping_range() does to avoid re-adding a vma list?  Thanks,

Alex

Jason Gunthorpe March 10, 2021, 6:40 p.m. UTC | #3

On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:

> > I think after the address_space changes this should try to stick with
> > a normal io_rmap_pfn_range() done outside the fault handler.
> 
> I assume you're suggesting calling io_remap_pfn_range() when device
> memory is enabled,

Yes, I think I saw Peter thinking along these lines too

Then fault just always causes SIGBUS if it gets called

Jason

Peter Xu March 10, 2021, 8:06 p.m. UTC | #4

On Wed, Mar 10, 2021 at 02:40:11PM -0400, Jason Gunthorpe wrote:
> On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> 
> > > I think after the address_space changes this should try to stick with
> > > a normal io_rmap_pfn_range() done outside the fault handler.
> > 
> > I assume you're suggesting calling io_remap_pfn_range() when device
> > memory is enabled,
> 
> Yes, I think I saw Peter thinking along these lines too
> 
> Then fault just always causes SIGBUS if it gets called

Indeed that looks better than looping in the fault().

But I don't know whether it'll be easy to move io_remap_pfn_range() to device
memory enablement.  If it's a two-step thing, we can fix the BUG_ON and vma
duplication issue first, then the full rework can be done in the bigger series
as what be chosen as the last approach.

Thanks,

Christoph Hellwig March 11, 2021, 11:35 a.m. UTC | #5

On Wed, Mar 10, 2021 at 03:06:07PM -0500, Peter Xu wrote:
> On Wed, Mar 10, 2021 at 02:40:11PM -0400, Jason Gunthorpe wrote:
> > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> > 
> > > > I think after the address_space changes this should try to stick with
> > > > a normal io_rmap_pfn_range() done outside the fault handler.
> > > 
> > > I assume you're suggesting calling io_remap_pfn_range() when device
> > > memory is enabled,
> > 
> > Yes, I think I saw Peter thinking along these lines too
> > 
> > Then fault just always causes SIGBUS if it gets called

I feel much more comfortable having the io_remap_pfn_range in place.

> 
> Indeed that looks better than looping in the fault().
> 
> But I don't know whether it'll be easy to move io_remap_pfn_range() to device
> memory enablement.  If it's a two-step thing, we can fix the BUG_ON and vma
> duplication issue first, then the full rework can be done in the bigger series
> as what be chosen as the last approach.

What kind of problems do you envision?  It seems pretty simple to do,
at least when combined with the unmap_mapping_range patch.

Peter Xu March 11, 2021, 4:35 p.m. UTC | #6

On Thu, Mar 11, 2021 at 11:35:24AM +0000, Christoph Hellwig wrote:
> On Wed, Mar 10, 2021 at 03:06:07PM -0500, Peter Xu wrote:
> > On Wed, Mar 10, 2021 at 02:40:11PM -0400, Jason Gunthorpe wrote:
> > > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> > > 
> > > > > I think after the address_space changes this should try to stick with
> > > > > a normal io_rmap_pfn_range() done outside the fault handler.
> > > > 
> > > > I assume you're suggesting calling io_remap_pfn_range() when device
> > > > memory is enabled,
> > > 
> > > Yes, I think I saw Peter thinking along these lines too
> > > 
> > > Then fault just always causes SIGBUS if it gets called
> 
> I feel much more comfortable having the io_remap_pfn_range in place.

It's just that Jason convinced me with the fact that io_remap_pfn_range() will
modify vma flags, and I tend to agree that's not a good thing to do during a
fault() handler (in remap_pfn_range):

	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;

Although this case is special and it does not do harm it seems, since all these
four flags are already set by vfio_pci_mmap() anyways, so the flag didn't
really change at least with current code base.  It's just still cleaner to not
use io_remap_pfn_range() in vfio fault() since future change to the function
io_remap_pfn_range() may not guarantee to match with vfio mmap().

> 
> > 
> > Indeed that looks better than looping in the fault().
> > 
> > But I don't know whether it'll be easy to move io_remap_pfn_range() to device
> > memory enablement.  If it's a two-step thing, we can fix the BUG_ON and vma
> > duplication issue first, then the full rework can be done in the bigger series
> > as what be chosen as the last approach.
> 
> What kind of problems do you envision?  It seems pretty simple to do,
> at least when combined with the unmap_mapping_range patch.

Moving the prefault into device memory enablement will even remove the 1st
fault delay when doing the first MMIO access that triggers this fault().  Also
in that case I think we can also call io_remap_pfn_range() directly and safely,
rather than looping over vmf_insert_pfn_prot().

Thanks,

Alex Williamson March 12, 2021, 7:16 p.m. UTC | #7

On Wed, 10 Mar 2021 14:40:11 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> 
> > > I think after the address_space changes this should try to stick with
> > > a normal io_rmap_pfn_range() done outside the fault handler.  
> > 
> > I assume you're suggesting calling io_remap_pfn_range() when device
> > memory is enabled,  
> 
> Yes, I think I saw Peter thinking along these lines too
> 
> Then fault just always causes SIGBUS if it gets called

Trying to use the address_space approach because otherwise we'd just be
adding back vma list tracking, it looks like we can't call
io_remap_pfn_range() while holding the address_space i_mmap_rwsem via
i_mmap_lock_write(), like done in unmap_mapping_range().  lockdep
identifies a circular lock order issue against fs_reclaim.  Minimally we
also need vma_interval_tree_iter_{first,next} exported in order to use
vma_interval_tree_foreach().  Suggestions?  Thanks,

Alex

Jason Gunthorpe March 12, 2021, 7:41 p.m. UTC | #8

On Fri, Mar 12, 2021 at 12:16:11PM -0700, Alex Williamson wrote:
> On Wed, 10 Mar 2021 14:40:11 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> > 
> > > > I think after the address_space changes this should try to stick with
> > > > a normal io_rmap_pfn_range() done outside the fault handler.  
> > > 
> > > I assume you're suggesting calling io_remap_pfn_range() when device
> > > memory is enabled,  
> > 
> > Yes, I think I saw Peter thinking along these lines too
> > 
> > Then fault just always causes SIGBUS if it gets called
> 
> Trying to use the address_space approach because otherwise we'd just be
> adding back vma list tracking, it looks like we can't call
> io_remap_pfn_range() while holding the address_space i_mmap_rwsem via
> i_mmap_lock_write(), like done in unmap_mapping_range().  lockdep
> identifies a circular lock order issue against fs_reclaim.  Minimally we
> also need vma_interval_tree_iter_{first,next} exported in order to use
> vma_interval_tree_foreach().  Suggestions?  Thanks,

You are asking how to put the BAR back into every VMA when it is
enabled again after it has been zap'd?

What did the lockdep splat look like? Is it a memory allocation?

Does current_gfp_context()/memalloc_nofs_save()/etc solve it?

The easiest answer is to continue to use fault and the
vmf_insert_page()..

But it feels like it wouuld be OK to export enough i_mmap machinery to
enable this. Cleaner than building your own tracking, which would
still have the same ugly mmap_sem inversion issue which was preventing
this last time.

Jason

Alex Williamson March 12, 2021, 8:09 p.m. UTC | #9

On Fri, 12 Mar 2021 15:41:47 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Mar 12, 2021 at 12:16:11PM -0700, Alex Williamson wrote:
> > On Wed, 10 Mar 2021 14:40:11 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Mar 10, 2021 at 11:34:06AM -0700, Alex Williamson wrote:
> > >   
> > > > > I think after the address_space changes this should try to stick with
> > > > > a normal io_rmap_pfn_range() done outside the fault handler.    
> > > > 
> > > > I assume you're suggesting calling io_remap_pfn_range() when device
> > > > memory is enabled,    
> > > 
> > > Yes, I think I saw Peter thinking along these lines too
> > > 
> > > Then fault just always causes SIGBUS if it gets called  
> > 
> > Trying to use the address_space approach because otherwise we'd just be
> > adding back vma list tracking, it looks like we can't call
> > io_remap_pfn_range() while holding the address_space i_mmap_rwsem via
> > i_mmap_lock_write(), like done in unmap_mapping_range().  lockdep
> > identifies a circular lock order issue against fs_reclaim.  Minimally we
> > also need vma_interval_tree_iter_{first,next} exported in order to use
> > vma_interval_tree_foreach().  Suggestions?  Thanks,  
> 
> You are asking how to put the BAR back into every VMA when it is
> enabled again after it has been zap'd?

Exactly.
 
> What did the lockdep splat look like? Is it a memory allocation?


======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc1+ #18 Not tainted
------------------------------------------------------
CPU 0/KVM/1406 is trying to acquire lock:
ffffffffa5a58d60 (fs_reclaim){+.+.}-{0:0}, at: fs_reclaim_acquire+0x83/0xd0

but task is already holding lock:
ffff94c0f3e8fb08 (&mapping->i_mmap_rwsem){++++}-{3:3}, at: vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&mapping->i_mmap_rwsem){++++}-{3:3}:
       down_write+0x3d/0x70
       dma_resv_lockdep+0x1b0/0x298
       do_one_initcall+0x5b/0x2d0
       kernel_init_freeable+0x251/0x298
       kernel_init+0xa/0x111
       ret_from_fork+0x22/0x30

-> #0 (fs_reclaim){+.+.}-{0:0}:
       __lock_acquire+0x111f/0x1e10
       lock_acquire+0xb5/0x380
       fs_reclaim_acquire+0xa3/0xd0
       kmem_cache_alloc_trace+0x30/0x2c0
       memtype_reserve+0xc3/0x280
       reserve_pfn_range+0x86/0x160
       track_pfn_remap+0xa6/0xe0
       remap_pfn_range+0xa8/0x610
       vfio_device_io_remap_mapping_range+0x93/0x120 [vfio]
       vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci]
       vfio_basic_config_write+0x12d/0x230 [vfio_pci]
       vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci]
       vfs_write+0xea/0x390
       __x64_sys_pwrite64+0x72/0xb0
       do_syscall_64+0x33/0x40
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mapping->i_mmap_rwsem);
                               lock(fs_reclaim);
                               lock(&mapping->i_mmap_rwsem);
  lock(fs_reclaim);

 *** DEADLOCK ***

2 locks held by CPU 0/KVM/1406:
 #0: ffff94c0f9c71ef0 (&vdev->memory_lock){++++}-{3:3}, at: vfio_basic_config_write+0x19a/0x230 [vfio_pci]
 #1: ffff94c0f3e8fb08 (&mapping->i_mmap_rwsem){++++}-{3:3}, at: vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]

stack backtrace:
CPU: 3 PID: 1406 Comm: CPU 0/KVM Not tainted 5.12.0-rc1+ #18
Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
Call Trace:
 dump_stack+0x7f/0xa1
 check_noncircular+0xcf/0xf0
 __lock_acquire+0x111f/0x1e10
 lock_acquire+0xb5/0x380
 ? fs_reclaim_acquire+0x83/0xd0
 ? pat_enabled+0x10/0x10
 ? memtype_reserve+0xc3/0x280
 fs_reclaim_acquire+0xa3/0xd0
 ? fs_reclaim_acquire+0x83/0xd0
 kmem_cache_alloc_trace+0x30/0x2c0
 memtype_reserve+0xc3/0x280
 reserve_pfn_range+0x86/0x160
 track_pfn_remap+0xa6/0xe0
 remap_pfn_range+0xa8/0x610
 ? lock_acquire+0xb5/0x380
 ? vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]
 ? lock_is_held_type+0xa5/0x120
 vfio_device_io_remap_mapping_range+0x93/0x120 [vfio]
 vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci]
 vfio_basic_config_write+0x12d/0x230 [vfio_pci]
 vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci]
 vfs_write+0xea/0x390
 __x64_sys_pwrite64+0x72/0xb0
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f80176152ff
Code: 08 89 3c 24 48 89 4c 24 18 e8 3d f3 ff ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 6d f3 ff ff 48 8b
RSP: 002b:00007f7efa5f72f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f80176152ff
RDX: 0000000000000002 RSI: 00007f7efa5f736c RDI: 000000000000002d
RBP: 000055b66913d530 R08: 0000000000000000 R09: 000000000000ffff
R10: 0000070000000004 R11: 0000000000000293 R12: 0000000000000004
R13: 0000000000000102 R14: 0000000000000002 R15: 000055b66913d530

> Does current_gfp_context()/memalloc_nofs_save()/etc solve it?

Will investigate...
 
> The easiest answer is to continue to use fault and the
> vmf_insert_page()..
> 
> But it feels like it wouuld be OK to export enough i_mmap machinery to
> enable this. Cleaner than building your own tracking, which would
> still have the same ugly mmap_sem inversion issue which was preventing
> this last time.

Yup, I'd rather fault than add that back, but I'm not sure we have a
mapping function compatible with this framework.  Thanks,

Alex

Alex Williamson March 12, 2021, 8:58 p.m. UTC | #10

On Fri, 12 Mar 2021 13:09:38 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri, 12 Mar 2021 15:41:47 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.12.0-rc1+ #18 Not tainted
> ------------------------------------------------------
> CPU 0/KVM/1406 is trying to acquire lock:
> ffffffffa5a58d60 (fs_reclaim){+.+.}-{0:0}, at: fs_reclaim_acquire+0x83/0xd0
> 
> but task is already holding lock:
> ffff94c0f3e8fb08 (&mapping->i_mmap_rwsem){++++}-{3:3}, at: vfio_device_io_remap_mapping_range+0x31/0x120 [vfio]
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&mapping->i_mmap_rwsem){++++}-{3:3}:  
>        down_write+0x3d/0x70
>        dma_resv_lockdep+0x1b0/0x298
>        do_one_initcall+0x5b/0x2d0
>        kernel_init_freeable+0x251/0x298
>        kernel_init+0xa/0x111
>        ret_from_fork+0x22/0x30
> 
> -> #0 (fs_reclaim){+.+.}-{0:0}:  
>        __lock_acquire+0x111f/0x1e10
>        lock_acquire+0xb5/0x380
>        fs_reclaim_acquire+0xa3/0xd0
>        kmem_cache_alloc_trace+0x30/0x2c0
>        memtype_reserve+0xc3/0x280
>        reserve_pfn_range+0x86/0x160
>        track_pfn_remap+0xa6/0xe0
>        remap_pfn_range+0xa8/0x610
>        vfio_device_io_remap_mapping_range+0x93/0x120 [vfio]
>        vfio_pci_test_and_up_write_memory_lock+0x34/0x40 [vfio_pci]
>        vfio_basic_config_write+0x12d/0x230 [vfio_pci]
>        vfio_pci_config_rw+0x1b7/0x3a0 [vfio_pci]
>        vfs_write+0xea/0x390
>        __x64_sys_pwrite64+0x72/0xb0
>        do_syscall_64+0x33/0x40
>        entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
..
> > Does current_gfp_context()/memalloc_nofs_save()/etc solve it?  

Yeah, we can indeed use memalloc_nofs_save/restore().  It seems we're
trying to allocate something for pfnmap tracking and that enables lots
of lockdep specific tests.  Is it valid to wrap io_remap_pfn_range()
around clearing this flag or am I just masking a bug?  Thanks,

Alex

Jason Gunthorpe March 13, 2021, 12:03 a.m. UTC | #11

On Fri, Mar 12, 2021 at 01:58:44PM -0700, Alex Williamson wrote:

> Yeah, we can indeed use memalloc_nofs_save/restore().  It seems we're
> trying to allocate something for pfnmap tracking and that enables lots
> of lockdep specific tests.  Is it valid to wrap io_remap_pfn_range()
> around clearing this flag or am I just masking a bug?  Thanks,

Yes, I think it is fine. Those functions are ment to be used in a
no-fs kind of region exactly like this.

no-fs is telling the allocator not to do reclaim which is forbidden
under the locks here (as reclaim will also attempt to get these locks)

I would defer to Michal Hocko though, maybe cc him on the final patch
series version.

Jason

vfio/pci: Handle concurrent vma faults

Commit Message

Comments

Patch