vfio/pci: take mmap write lock for io_remap_pfn_range

Message ID	20230508125842.28193-1-yan.y.zhao@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Yan Zhao <yan.y.zhao@intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: alex.williamson@redhat.com, kevin.tian@intel.com, jgg@nvidia.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range Date: Mon, 8 May 2023 20:58:42 +0800 Message-Id: <20230508125842.28193-1-yan.y.zhao@intel.com> Precedence: bulk
Series	vfio/pci: take mmap write lock for io_remap_pfn_range \| expand vfio/pci: take mmap write lock for io_remap_pfn_range

Yan Zhao May 8, 2023, 12:58 p.m. UTC

In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
pin_user_pages_remote() returns -EFAULT.

follow_fault_pfn
 fixup_user_fault
  handle_mm_fault
   handle_mm_fault
    do_fault
     do_shared_fault
      do_fault
       __do_fault
        vfio_pci_mmap_fault
         io_remap_pfn_range
          remap_pfn_range
           track_pfn_remap
            vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
           remap_pfn_range_notrack
            vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)

As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
holding of mmap write lock is required.
So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
write lock.

[1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com
commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
commit 1c71222e5f23
("mm: replace vma->vm_flags direct modifications with modifier calls")

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)


base-commit: 705b004ee377b789e39ae237519bab714297ac83

Jason Gunthorpe May 8, 2023, 4:48 p.m. UTC | #1

On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote:
> In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
> pin_user_pages_remote() returns -EFAULT.
> 
> follow_fault_pfn
>  fixup_user_fault
>   handle_mm_fault
>    handle_mm_fault
>     do_fault
>      do_shared_fault
>       do_fault
>        __do_fault
>         vfio_pci_mmap_fault
>          io_remap_pfn_range
>           remap_pfn_range
>            track_pfn_remap
>             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
>            remap_pfn_range_notrack
>             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
> 
> As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
> holding of mmap write lock is required.
> So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
> write lock.
> 
> [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com
> commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
> commit 1c71222e5f23
> ("mm: replace vma->vm_flags direct modifications with modifier calls")
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a5ab416cf476..5082f89152b3 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  	struct vfio_pci_mmap_vma *mmap_vma;
>  	vm_fault_t ret = VM_FAULT_NOPAGE;
>  
> +	mmap_assert_locked(vma->vm_mm);
> +	mmap_read_unlock(vma->vm_mm);
> +
> +	if (mmap_write_lock_killable(vma->vm_mm))
> +		return VM_FAULT_RETRY;

Certainly not..

I'm not sure how to resolve this properly, set the flags in advance?

The address space conversion?

Jason

Alex Williamson May 8, 2023, 8:57 p.m. UTC | #2

On Mon, 8 May 2023 13:48:30 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, May 08, 2023 at 08:58:42PM +0800, Yan Zhao wrote:
> > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
> > pin_user_pages_remote() returns -EFAULT.
> > 
> > follow_fault_pfn
> >  fixup_user_fault
> >   handle_mm_fault
> >    handle_mm_fault
> >     do_fault
> >      do_shared_fault
> >       do_fault
> >        __do_fault
> >         vfio_pci_mmap_fault
> >          io_remap_pfn_range
> >           remap_pfn_range
> >            track_pfn_remap
> >             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
> >            remap_pfn_range_notrack
> >             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
> > 
> > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
> > holding of mmap write lock is required.
> > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
> > write lock.
> > 
> > [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com
> > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
> > commit 1c71222e5f23
> > ("mm: replace vma->vm_flags direct modifications with modifier calls")
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index a5ab416cf476..5082f89152b3 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1687,6 +1687,12 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
> >  	struct vfio_pci_mmap_vma *mmap_vma;
> >  	vm_fault_t ret = VM_FAULT_NOPAGE;
> >  
> > +	mmap_assert_locked(vma->vm_mm);
> > +	mmap_read_unlock(vma->vm_mm);
> > +
> > +	if (mmap_write_lock_killable(vma->vm_mm))
> > +		return VM_FAULT_RETRY;  
> 
> Certainly not..
> 
> I'm not sure how to resolve this properly, set the flags in advance?
> 
> The address space conversion?

We already try to set the flags in advance, but there are some
architectural flags like VM_PAT that make that tricky.  Cedric has been
looking at inserting individual pages with vmf_insert_pfn(), but that
incurs a lot more faults and therefore latency vs remapping the entire
vma on fault.  I'm not convinced that we shouldn't just attempt to
remove the fault handler entirely, but I haven't tried it yet to know
what gotchas are down that path.  Thanks,

Alex

Jason Gunthorpe May 10, 2023, 8:41 p.m. UTC | #3

On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:

> We already try to set the flags in advance, but there are some
> architectural flags like VM_PAT that make that tricky.  Cedric has been
> looking at inserting individual pages with vmf_insert_pfn(), but that
> incurs a lot more faults and therefore latency vs remapping the entire
> vma on fault.  I'm not convinced that we shouldn't just attempt to
> remove the fault handler entirely, but I haven't tried it yet to know
> what gotchas are down that path.  Thanks,

I thought we did it like this because there were races otherwise with
PTE insertion and zapping? I don't remember well anymore.

I vaugely remember the address_space conversion might help remove the
fault handler?

Jason

Yan Zhao May 11, 2023, 6:56 a.m. UTC | #4

On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote:
> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
> 
> > We already try to set the flags in advance, but there are some
> > architectural flags like VM_PAT that make that tricky.  Cedric has been
> > looking at inserting individual pages with vmf_insert_pfn(), but that
> > incurs a lot more faults and therefore latency vs remapping the entire
> > vma on fault.  I'm not convinced that we shouldn't just attempt to
> > remove the fault handler entirely, but I haven't tried it yet to know
> > what gotchas are down that path.  Thanks,
> 
> I thought we did it like this because there were races otherwise with
> PTE insertion and zapping? I don't remember well anymore.
> 
> I vaugely remember the address_space conversion might help remove the
> fault handler?
>
What about calling vmf_insert_pfn() in bulk as below?
And what is address_space conversion?


diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..1476e537f593 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
        struct vfio_pci_core_device *vdev = vma->vm_private_data;
        struct vfio_pci_mmap_vma *mmap_vma;
        vm_fault_t ret = VM_FAULT_NOPAGE;
+       unsigned long base_pfn, offset, i;

        mutex_lock(&vdev->vma_lock);
        down_read(&vdev->memory_lock);
@@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
                        goto up_out;
        }

-       if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-                              vma->vm_end - vma->vm_start,
-                              vma->vm_page_prot)) {
-               ret = VM_FAULT_SIGBUS;
-               zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
-               goto up_out;
+       base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+       base_pfn += vma->vm_pgoff;
+       for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
+               offset = (i - vma->vm_start) >> PAGE_SHIFT;
+               ret = vmf_insert_pfn(vma, i, base_pfn + offset);
+               if (ret != VM_FAULT_NOPAGE) {
+                       zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+                       goto up_out;
+               }
        }

        if (__vfio_pci_add_vma(vdev, vma)) {

Cédric Le Goater May 11, 2023, 7:32 a.m. UTC | #5

On 5/10/23 22:41, Jason Gunthorpe wrote:
> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
> 
>> We already try to set the flags in advance, but there are some
>> architectural flags like VM_PAT that make that tricky.  Cedric has been
>> looking at inserting individual pages with vmf_insert_pfn(), but that
>> incurs a lot more faults and therefore latency vs remapping the entire
>> vma on fault.  I'm not convinced that we shouldn't just attempt to
>> remove the fault handler entirely, but I haven't tried it yet to know
>> what gotchas are down that path.  Thanks,

OTOH I didn't see any noticeable slowdowns in the tests I did with
NICs, IGD and NVIDIA GPU pass-through. I lack devices with large
BARs though. If anyone has some time for it, here are commits :

   https://github.com/legoater/linux/commits/vfio

First does a one page insert and fixes the lockdep issues we are
seeing with the recent series:

   https://lore.kernel.org/all/20230126193752.297968-1-surenb@google.com/

Second adds some statistics. For the NVIDIA GPU, faults reach 800k.

Thanks,

C.

> I thought we did it like this because there were races otherwise with
> PTE insertion and zapping? I don't remember well anymore.
> 
> I vaugely remember the address_space conversion might help remove the
> fault handler?
> 
> Jason
>

Cédric Le Goater May 11, 2023, 7:38 a.m. UTC | #6

On 5/11/23 08:56, Yan Zhao wrote:
> On Wed, May 10, 2023 at 05:41:06PM -0300, Jason Gunthorpe wrote:
>> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
>>
>>> We already try to set the flags in advance, but there are some
>>> architectural flags like VM_PAT that make that tricky.  Cedric has been
>>> looking at inserting individual pages with vmf_insert_pfn(), but that
>>> incurs a lot more faults and therefore latency vs remapping the entire
>>> vma on fault.  I'm not convinced that we shouldn't just attempt to
>>> remove the fault handler entirely, but I haven't tried it yet to know
>>> what gotchas are down that path.  Thanks,
>>
>> I thought we did it like this because there were races otherwise with
>> PTE insertion and zapping? I don't remember well anymore.
>>
>> I vaugely remember the address_space conversion might help remove the
>> fault handler?
>>
> What about calling vmf_insert_pfn() in bulk as below?

This works too, it is slightly slower than the io_remap_pfn_range() call
but doesn't have the lockdep issues.

Thanks,

C.

> And what is address_space conversion?
> 
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a5ab416cf476..1476e537f593 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1686,6 +1686,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>          struct vfio_pci_core_device *vdev = vma->vm_private_data;
>          struct vfio_pci_mmap_vma *mmap_vma;
>          vm_fault_t ret = VM_FAULT_NOPAGE;
> +       unsigned long base_pfn, offset, i;
> 
>          mutex_lock(&vdev->vma_lock);
>          down_read(&vdev->memory_lock);
> @@ -1710,12 +1711,15 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>                          goto up_out;
>          }
> 
> -       if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> -                              vma->vm_end - vma->vm_start,
> -                              vma->vm_page_prot)) {
> -               ret = VM_FAULT_SIGBUS;
> -               zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> -               goto up_out;
> +       base_pfn = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +       base_pfn += vma->vm_pgoff;
> +       for (i = vma->vm_start; i < vma->vm_end; i += PAGE_SIZE) {
> +               offset = (i - vma->vm_start) >> PAGE_SHIFT;
> +               ret = vmf_insert_pfn(vma, i, base_pfn + offset);
> +               if (ret != VM_FAULT_NOPAGE) {
> +                       zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
> +                       goto up_out;
> +               }
>          }
> 
>          if (__vfio_pci_add_vma(vdev, vma)) {
>

Alex Williamson May 11, 2023, 4:07 p.m. UTC | #7

On Wed, 10 May 2023 17:41:06 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
> 
> > We already try to set the flags in advance, but there are some
> > architectural flags like VM_PAT that make that tricky.  Cedric has been
> > looking at inserting individual pages with vmf_insert_pfn(), but that
> > incurs a lot more faults and therefore latency vs remapping the entire
> > vma on fault.  I'm not convinced that we shouldn't just attempt to
> > remove the fault handler entirely, but I haven't tried it yet to know
> > what gotchas are down that path.  Thanks,  
> 
> I thought we did it like this because there were races otherwise with
> PTE insertion and zapping? I don't remember well anymore.

TBH, I don't recall if we tried a synchronous approach previously.  The
benefit of the faulting approach was that we could track the minimum
set of vmas which are actually making use of the mapping and throw that
tracking list away when zapping.  Without that, we need to add vmas
both on mmap and in vm_ops.open, removing only in vm_ops.close, and
acquire all the proper mm locking for each vma to re-insert the
mappings.

> I vaugely remember the address_space conversion might help remove the
> fault handler?

Yes, this did remove the fault handler entirely, it's (obviously)
dropped off my radar, but perhaps in the interim we could switch to
vmf_insert_pfn() and revive the address space series to eventually
remove the fault handling and vma list altogether.

For reference, I think this was the last posting of the address space
series:

https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/

Thanks,
Alex

Jason Gunthorpe May 11, 2023, 5:47 p.m. UTC | #8

On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote:

> > I vaugely remember the address_space conversion might help remove the
> > fault handler?
> 
> Yes, this did remove the fault handler entirely, it's (obviously)
> dropped off my radar, but perhaps in the interim we could switch to
> vmf_insert_pfn() and revive the address space series to eventually
> remove the fault handling and vma list altogether.

vmf_insert_pfn() technically isn't supposed to be used for MMIO..

Eg it doesn't do the PAT stuff on x86 that is causing this problem in
the first place.

So doing the address space removing series seems like the best fix. It
has been mislocked for a long time, I suspect there isn't a real
urgent problem beyond we actually have lockdep annoations to catch the
mislocking now.

Jason

Yan Zhao May 12, 2023, 8:02 a.m. UTC | #9

On Thu, May 11, 2023 at 10:07:06AM -0600, Alex Williamson wrote:
> On Wed, 10 May 2023 17:41:06 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, May 08, 2023 at 02:57:15PM -0600, Alex Williamson wrote:
> > 
> > > We already try to set the flags in advance, but there are some
> > > architectural flags like VM_PAT that make that tricky.  Cedric has been
> > > looking at inserting individual pages with vmf_insert_pfn(), but that
> > > incurs a lot more faults and therefore latency vs remapping the entire
> > > vma on fault.  I'm not convinced that we shouldn't just attempt to
> > > remove the fault handler entirely, but I haven't tried it yet to know
> > > what gotchas are down that path.  Thanks,  
> > 
> > I thought we did it like this because there were races otherwise with
> > PTE insertion and zapping? I don't remember well anymore.
> 
> TBH, I don't recall if we tried a synchronous approach previously.  The
> benefit of the faulting approach was that we could track the minimum
> set of vmas which are actually making use of the mapping and throw that
> tracking list away when zapping.  Without that, we need to add vmas
> both on mmap and in vm_ops.open, removing only in vm_ops.close, and
> acquire all the proper mm locking for each vma to re-insert the
> mappings.
> 
> > I vaugely remember the address_space conversion might help remove the
> > fault handler?
> 
> Yes, this did remove the fault handler entirely, it's (obviously)
> dropped off my radar, but perhaps in the interim we could switch to
> vmf_insert_pfn() and revive the address space series to eventually
> remove the fault handling and vma list altogether.
> 
> For reference, I think this was the last posting of the address space
> series:
> 
> https://lore.kernel.org/all/162818167535.1511194.6614962507750594786.stgit@omen/

Just took a quick look at this series.
A question is that looks it still needs to call io_remap_pfn_range() in
places like vfio_basic_config_write() for PCI_COMMAND, and device reset,
so mmap write lock is still required around vdev->memory_lock.

Andrew Jones May 22, 2024, 4:56 p.m. UTC | #10

On Mon, May 08, 2023 at 08:58:42PM GMT, Yan Zhao wrote:
> In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
> pin_user_pages_remote() returns -EFAULT.
> 
> follow_fault_pfn
>  fixup_user_fault
>   handle_mm_fault
>    handle_mm_fault
>     do_fault
>      do_shared_fault
>       do_fault
>        __do_fault
>         vfio_pci_mmap_fault
>          io_remap_pfn_range
>           remap_pfn_range
>            track_pfn_remap
>             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
>            remap_pfn_range_notrack
>             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
> 
> As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
> holding of mmap write lock is required.
> So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
> write lock.
> 
> [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com
> commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
> commit 1c71222e5f23
> ("mm: replace vma->vm_flags direct modifications with modifier calls")
>

With linux-next I started noticing traces similar to the above without
lockdep, since it has ba168b52bf8e ("mm: use rwsem assertion macros for
mmap_lock"). Were there any follow ups to this? Sorry if my quick
searching missed it.

Thanks,
drew

Alex Williamson May 22, 2024, 5:50 p.m. UTC | #11

Hey Drew,

On Wed, 22 May 2024 18:56:29 +0200
Andrew Jones <ajones@ventanamicro.com> wrote:

> On Mon, May 08, 2023 at 08:58:42PM GMT, Yan Zhao wrote:
> > In VFIO type1, vaddr_get_pfns() will try fault in MMIO PFNs after
> > pin_user_pages_remote() returns -EFAULT.
> > 
> > follow_fault_pfn
> >  fixup_user_fault
> >   handle_mm_fault
> >    handle_mm_fault
> >     do_fault
> >      do_shared_fault
> >       do_fault
> >        __do_fault
> >         vfio_pci_mmap_fault
> >          io_remap_pfn_range
> >           remap_pfn_range
> >            track_pfn_remap
> >             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
> >            remap_pfn_range_notrack
> >             vm_flags_set         ==> mmap_assert_write_locked(vma->vm_mm)
> > 
> > As io_remap_pfn_range() will call vm_flags_set() to update vm_flags [1],
> > holding of mmap write lock is required.
> > So, update vfio_pci_mmap_fault() to drop mmap read lock and take mmap
> > write lock.
> > 
> > [1] https://lkml.kernel.org/r/20230126193752.297968-3-surenb@google.com
> > commit bc292ab00f6c ("mm: introduce vma->vm_flags wrapper functions")
> > commit 1c71222e5f23
> > ("mm: replace vma->vm_flags direct modifications with modifier calls")
> >  
> 
> With linux-next I started noticing traces similar to the above without
> lockdep, since it has ba168b52bf8e ("mm: use rwsem assertion macros for
> mmap_lock"). Were there any follow ups to this? Sorry if my quick
> searching missed it.

We've been working on it in the background, but we're not there yet.  I
have a branch[1] that converts to an address space on the vfio device,
making unmap_mapping_range() available to easily zap mappings to the
BARs without all the ugly vma tracking, but that still leaves us with
the problem that the remap_pfn_range() shouldn't be called from the
fault handler resulting in this lockdep warning.  We can switch to
vmf_insert_pfn() in the fault handler, but that's a noticeable
performance penalty.

The angle I've been working recently is to try taking advantage of
huge_fault support because we do have vmf_insert_pfn_{pmd,pud}, but
these don't currently support pfnmap pfns.  I've been working with
Peter Xu as this aligns with some other work of his.  It's working and
resolves the performance issue, especially with a little alignment
tweaking in userspace to take advantage of pud mappings.

I'm not sure if there are any outstanding blockers on Peter's side, but
this seems like a good route from the vfio side.  If we're seeing this
now without lockdep, we might need to bite the bullet and take the hit
with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps.
Thanks,

Alex

[1]https://github.com/awilliam/linux-vfio/commits/vfio-address-space/

Jason Gunthorpe May 22, 2024, 6:30 p.m. UTC | #12

On Wed, May 22, 2024 at 11:50:06AM -0600, Alex Williamson wrote:
> I'm not sure if there are any outstanding blockers on Peter's side, but
> this seems like a good route from the vfio side.  If we're seeing this
> now without lockdep, we might need to bite the bullet and take the hit
> with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps.

There is another alternative...

Ideally we wouldn't use the fault handler.

Instead when the MMIO becomes available again we'd iterate over all
the VMA's and do remap_pfn_range(). When the MMIO becomes unavailable
we do unmap_mapping_range() and remove it. This whole thing is
synchronous and the fault handler should simply trigger SIGBUS if
userspace races things.

unmap_mapping_range() is easy, but the remap side doesn't have a
helper today..

Something grotesque like this perhaps?

	while (1) {
		struct mm_struct *cur_mm = NULL;

		i_mmap_lock_read(mapping);
		vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) {
			if (vma_populated(vma))
				continue;

			cur_mm = vm->mm_struct;
			mmgrab(cur_mm);
		}
		i_mmap_unlock_read(mapping);

		if (!cur_mm)
			return;

		mmap_write_lock(cur_mm);
		i_mmap_lock_read(mapping);
		vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) {
			if (vma->mm_struct == mm)
				vfio_remap_vma(vma);
		}
		i_mmap_unlock_read(mapping);
		mmap_write_unlock(cur_mm);
		mmdrop(cur_mm);
	}

I'm pretty sure we need to hold the mmap_write lock to do the
remap_pfn..

vma_populated() would have to do a RCU read of the page table to check
if the page 0 is present.

Also there is a race in mmap if you call remap_pfn_range() from the
mmap fops and also use unmap_mapping_range(). mmap_region() does
call_mmap() before it does vma_link_file() which gives a window where
the VMA is populated but invisible to unmap_mapping_range(). We'd need
to add another fop to call after vma_link_file() to populate the mmap
or rely exclusively on the fault handler to populate.

Jason

Alex Williamson May 22, 2024, 7:43 p.m. UTC | #13

On Wed, 22 May 2024 15:30:46 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, May 22, 2024 at 11:50:06AM -0600, Alex Williamson wrote:
> > I'm not sure if there are any outstanding blockers on Peter's side, but
> > this seems like a good route from the vfio side.  If we're seeing this
> > now without lockdep, we might need to bite the bullet and take the hit
> > with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps.  
> 
> There is another alternative...
> 
> Ideally we wouldn't use the fault handler.
> 
> Instead when the MMIO becomes available again we'd iterate over all
> the VMA's and do remap_pfn_range(). When the MMIO becomes unavailable
> we do unmap_mapping_range() and remove it. This whole thing is
> synchronous and the fault handler should simply trigger SIGBUS if
> userspace races things.
> 
> unmap_mapping_range() is easy, but the remap side doesn't have a
> helper today..
> 
> Something grotesque like this perhaps?
> 
> 	while (1) {
> 		struct mm_struct *cur_mm = NULL;
> 
> 		i_mmap_lock_read(mapping);
> 		vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) {
> 			if (vma_populated(vma))
> 				continue;
> 
> 			cur_mm = vm->mm_struct;
> 			mmgrab(cur_mm);
> 		}
> 		i_mmap_unlock_read(mapping);
> 
> 		if (!cur_mm)
> 			return;
> 
> 		mmap_write_lock(cur_mm);
> 		i_mmap_lock_read(mapping);
> 		vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) {
> 			if (vma->mm_struct == mm)
> 				vfio_remap_vma(vma);
> 		}
> 		i_mmap_unlock_read(mapping);
> 		mmap_write_unlock(cur_mm);
> 		mmdrop(cur_mm);
> 	}

Yes, I've played with similar it's a pretty ugly and painful path
versus lazily faulting.

> I'm pretty sure we need to hold the mmap_write lock to do the
> remap_pfn..
> 
> vma_populated() would have to do a RCU read of the page table to check
> if the page 0 is present.
> 
> Also there is a race in mmap if you call remap_pfn_range() from the
> mmap fops and also use unmap_mapping_range(). mmap_region() does
> call_mmap() before it does vma_link_file() which gives a window where
> the VMA is populated but invisible to unmap_mapping_range(). We'd need
> to add another fop to call after vma_link_file() to populate the mmap
> or rely exclusively on the fault handler to populate.

Thanks, the branch I linked added back remap_pfn_range() on mmap,
conditional on locking/enabled, but I've predominantly been testing
with that disabled to exercise the pmd/pud faults.  I would have missed
the gap vs unmap_mapping_range().

It's hard to beat the faulting mechanism from a simplicity standpoint,
so long as the mm folks don't balk at pfnmap faults.  I think the vma
address space walking ends up with tracking flags to avoid redundant
mappings that get complicated and add to the ugliness.  The mm handles
all that if we only use faulting and we get a simple test on fault to
proceed or trigger a SIGBUS.  Using pmd/pud faults should have some
efficiency benefits as well.  Thanks,

Alex

Peter Xu May 22, 2024, 9:21 p.m. UTC | #14

On Wed, May 22, 2024 at 11:50:06AM -0600, Alex Williamson wrote:
> I'm not sure if there are any outstanding blockers on Peter's side, but
> this seems like a good route from the vfio side.  If we're seeing this
> now without lockdep, we might need to bite the bullet and take the hit
> with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps.

No immediate blockers, it's just that there're some small details that I
may still need to look into.  The current one TBD is pfn tracking
implications on PAT.  Here I see at least two issues to be investigated.

Firstly, when vfio zap bars it can try to remove VM_PAT flag.  To be
explicit, unmap_single_vma() has:

	if (unlikely(vma->vm_flags & VM_PFNMAP))
		untrack_pfn(vma, 0, 0, mm_wr_locked);

I believe it'll also erase the entry on the memtype_rbroot.. I'm not sure
whether that's correct at all, and if that's correct how we should
re-inject that.  So far I feel like we should keep that pfn tracking stuff
alone from tearing down pgtables only, but I'll need to double check.
E.g. I at least checked MADV_DONTNEED won't allow to apply on PFNMAPs, so
vfio zapping the vma should be the 1st one can do that besides munmap().

The other thing is I just noticed very recently that the PAT bit on x86_64
is not always the same one.. on 4K it's bit 7, but it's reused as PSE on
higher levels, moving PAT to bit 12:

#define _PAGE_BIT_PSE		7	/* 4 MB (or 2MB) page */
#define _PAGE_BIT_PAT		7	/* on 4KB pages */
#define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */

We may need something like protval_4k_2_large() when injecting huge
mappings.

From the schedule POV, the plan is I'll continue work on this after I flush
the inbox for the past two weeks and when I'll get some spare time.  Now
~160 emails left.. but I'm getting there.  If there's comments for either
of above, please shoot.

Thanks,

vfio/pci: take mmap write lock for io_remap_pfn_range

Commit Message

Comments

Patch