diff mbox series

[RFC,v1,1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

Message ID 20230718082858.1570809-2-vivek.kasireddy@intel.com (mailing list archive)
State New, archived
Headers show
Series udmabuf: Replace pages when there is FALLOC_FL_PUNCH_HOLE in memfd | expand

Commit Message

Kasireddy, Vivek July 18, 2023, 8:28 a.m. UTC
Currently, there does not appear to be any mechanism for letting
drivers or other kernel entities know about updates made in a
mapping particularly when a new page is faulted in. Providing
notifications for such situations is really useful when using
memfds backed by ram-based filesystems such as shmem or hugetlbfs
that also allow FALLOC_FL_PUNCH_HOLE.

More specifically, when a hole is punched in a memfd (that is
backed by shmem or hugetlbfs), a driver can register for
notifications associated with range invalidations. However, it
would also be useful to have notifications when new pages are
faulted in as a result of writes made to the mapping region that
overlaps with a previously punched hole.

Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 include/linux/mmu_notifier.h | 27 +++++++++++++++++++++++++++
 mm/hugetlb.c                 |  9 ++++++++-
 mm/mmu_notifier.c            | 17 +++++++++++++++++
 mm/shmem.c                   |  7 ++++++-
 4 files changed, 58 insertions(+), 2 deletions(-)

Comments

Jason Gunthorpe July 18, 2023, 3:36 p.m. UTC | #1
On Tue, Jul 18, 2023 at 01:28:56AM -0700, Vivek Kasireddy wrote:
> Currently, there does not appear to be any mechanism for letting
> drivers or other kernel entities know about updates made in a
> mapping particularly when a new page is faulted in. Providing
> notifications for such situations is really useful when using
> memfds backed by ram-based filesystems such as shmem or hugetlbfs
> that also allow FALLOC_FL_PUNCH_HOLE.

Huh? You get an invalidate when this happens and the address becomes
non-present.

> More specifically, when a hole is punched in a memfd (that is
> backed by shmem or hugetlbfs), a driver can register for
> notifications associated with range invalidations. However, it
> would also be useful to have notifications when new pages are
> faulted in as a result of writes made to the mapping region that
> overlaps with a previously punched hole.

If there is no change to the PTEs then it is hard to see why this
would be part of a mmu_notifier.

Jason
Kasireddy, Vivek July 19, 2023, 12:05 a.m. UTC | #2
Hi Jason,

> 
> On Tue, Jul 18, 2023 at 01:28:56AM -0700, Vivek Kasireddy wrote:
> > Currently, there does not appear to be any mechanism for letting
> > drivers or other kernel entities know about updates made in a
> > mapping particularly when a new page is faulted in. Providing
> > notifications for such situations is really useful when using
> > memfds backed by ram-based filesystems such as shmem or hugetlbfs
> > that also allow FALLOC_FL_PUNCH_HOLE.
> 
> Huh? You get an invalidate when this happens and the address becomes
> non-present.
Yes, we do get an invalidate (range) but it is not going to help given my
use-case. This is because the invalidate only indicates that the old pages
are gone (and not about the availability of new pages). IIUC, after a hole
gets punched, it appears the new pages are faulted in only when there
are writes made to that region where the hole was punched. So, I think
what would really help is to get notified when a new page becomes part
of the mapping at a given offset. 

> 
> > More specifically, when a hole is punched in a memfd (that is
> > backed by shmem or hugetlbfs), a driver can register for
> > notifications associated with range invalidations. However, it
> > would also be useful to have notifications when new pages are
> > faulted in as a result of writes made to the mapping region that
> > overlaps with a previously punched hole.
> 
> If there is no change to the PTEs then it is hard to see why this
> would be part of a mmu_notifier.
IIUC, the PTEs do get changed but only when a new page is faulted in.
For shmem, it looks like the PTEs are updated in handle_pte_fault()
after shmem_fault() gets called and for hugetlbfs, this seems to
happen in hugetlb_fault().

Instead of introducing a new notifier, I did think about reusing
(or overloading) .change_pte() but I did not fully understand the impact
it would have on KVM, the only user of .change_pte(). 

Thanks,
Vivek

> 
> Jason
Jason Gunthorpe July 19, 2023, 12:24 a.m. UTC | #3
On Wed, Jul 19, 2023 at 12:05:29AM +0000, Kasireddy, Vivek wrote:

> > If there is no change to the PTEs then it is hard to see why this
> > would be part of a mmu_notifier.
> IIUC, the PTEs do get changed but only when a new page is faulted in.
> For shmem, it looks like the PTEs are updated in handle_pte_fault()
> after shmem_fault() gets called and for hugetlbfs, this seems to
> happen in hugetlb_fault().

That sounds about right

> Instead of introducing a new notifier, I did think about reusing
> (or overloading) .change_pte() but I did not fully understand the impact
> it would have on KVM, the only user of .change_pte().

Yes, change_pte will be called, I think, but under various locks. Why
would you need to change it?

What you are doing here doesn't make any sense within the design of
mmu_notifiers, eg:

> @ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  				  gfp, vma, vmf, &ret);
>  	if (err)
>  		return vmf_error(err);
> -	if (folio)
> +	if (folio) {
>  		vmf->page = folio_file_page(folio, vmf->pgoff);
> +		if (ret == VM_FAULT_LOCKED)
> +			mmu_notifier_update_mapping(vma->vm_mm, vmf->address,
> +						    page_to_pfn(vmf->page));
> +	}
>  	return ret;

Hasn't even updated the PTEs yet, but it is invoking a callback??

Jason
Alistair Popple July 19, 2023, 2:08 a.m. UTC | #4
Vivek Kasireddy <vivek.kasireddy@intel.com> writes:

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 64a3239b6407..1f2f0209101a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6096,8 +6096,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * hugetlb_no_page will drop vma lock and hugetlb fault
>  		 * mutex internally, which make us return immediately.
>  		 */
> -		return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> +		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
>  				      entry, flags);
> +		if (!ret)
> +			mmu_notifier_update_mapping(vma->vm_mm, address,
> +						    pte_pfn(*ptep));

The next patch ends up calling pfn_to_page() on the result of
pte_pfn(*ptep). I don't think that's safe because couldn't the PTE have
already changed and/or the new page have been freed?

> +		return ret;
>  
>  	ret = 0;
>  
> @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 */
>  	if (need_wait_lock)
>  		folio_wait_locked(folio);
> +	if (!ret)
> +		mmu_notifier_update_mapping(vma->vm_mm, address,
> +					    pte_pfn(*ptep));
>  	return ret;
>  }
>  
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 50c0dde1354f..6421405334b9 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>  	srcu_read_unlock(&srcu, id);
>  }
>  
> +void __mmu_notifier_update_mapping(struct mm_struct *mm, unsigned long address,
> +				   unsigned long pfn)
> +{
> +	struct mmu_notifier *subscription;
> +	int id;
> +
> +	id = srcu_read_lock(&srcu);
> +	hlist_for_each_entry_rcu(subscription,
> +				 &mm->notifier_subscriptions->list, hlist,
> +				 srcu_read_lock_held(&srcu)) {
> +		if (subscription->ops->update_mapping)
> +			subscription->ops->update_mapping(subscription, mm,
> +							  address, pfn);
> +	}
> +	srcu_read_unlock(&srcu, id);
> +}
> +
>  static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
>  			       const struct mmu_notifier_range *range)
>  {
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 2f2e0e618072..e59eb5fafadb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
>  #include <linux/fcntl.h>
>  #include <uapi/linux/memfd.h>
>  #include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
>  #include <linux/uuid.h>
>  
>  #include <linux/uaccess.h>
> @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  				  gfp, vma, vmf, &ret);
>  	if (err)
>  		return vmf_error(err);
> -	if (folio)
> +	if (folio) {
>  		vmf->page = folio_file_page(folio, vmf->pgoff);
> +		if (ret == VM_FAULT_LOCKED)
> +			mmu_notifier_update_mapping(vma->vm_mm, vmf->address,
> +						    page_to_pfn(vmf->page));
> +	}
>  	return ret;
>  }
Kasireddy, Vivek July 19, 2023, 6:19 a.m. UTC | #5
Hi Jason,

> 
> On Wed, Jul 19, 2023 at 12:05:29AM +0000, Kasireddy, Vivek wrote:
> 
> > > If there is no change to the PTEs then it is hard to see why this
> > > would be part of a mmu_notifier.
> > IIUC, the PTEs do get changed but only when a new page is faulted in.
> > For shmem, it looks like the PTEs are updated in handle_pte_fault()
> > after shmem_fault() gets called and for hugetlbfs, this seems to
> > happen in hugetlb_fault().
> 
> That sounds about right
> 
> > Instead of introducing a new notifier, I did think about reusing
> > (or overloading) .change_pte() but I did not fully understand the impact
> > it would have on KVM, the only user of .change_pte().
> 
> Yes, change_pte will be called, I think, but under various locks. 
AFAICT, change_pte does not seem to get called in my use-case either
during invalidate or when a new page is faulted in.

>Why would you need to change it?
What I meant to say is instead of adding a new notifier for mapping updates,
I initially considered just calling change_pte() when a new page is faulted in
but I was concerned that doing so might adversely impact existing users (of
change_pte) such as KVM.

> 
> What you are doing here doesn't make any sense within the design of
> mmu_notifiers, eg:
> 
> > @ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct vm_fault
> *vmf)
> >  				  gfp, vma, vmf, &ret);
> >  	if (err)
> >  		return vmf_error(err);
> > -	if (folio)
> > +	if (folio) {
> >  		vmf->page = folio_file_page(folio, vmf->pgoff);
> > +		if (ret == VM_FAULT_LOCKED)
> > +			mmu_notifier_update_mapping(vma->vm_mm, vmf-
> >address,
> > +						    page_to_pfn(vmf->page));
> > +	}
> >  	return ret;
> 
> Hasn't even updated the PTEs yet, but it is invoking a callback??
I was counting on the fragile assumption that once we have a valid page,
the PTE would be eventually updated after shmem_fault(), which doesn't
make sense in retrospect. Instead, would something like below be OK?
@@ -5234,6 +5237,14 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,

        lru_gen_exit_fault();

+       if (vma_is_shmem(vma) || is_vm_hugetlb_page(vma)) {
+               if (!follow_pte(vma->vm_mm, address, &ptep, &ptl)) {
+                       pfn = pte_pfn(ptep_get(ptep));
+                       pte_unmap_unlock(ptep, ptl);
+                       mmu_notifier_update_mapping(vma->vm_mm, address, pfn);
+               }
+       }


Thanks,
Vivek

> 
> Jason
Kasireddy, Vivek July 20, 2023, 7:43 a.m. UTC | #6
Hi Alistair,

> 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 64a3239b6407..1f2f0209101a 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6096,8 +6096,12 @@ vm_fault_t hugetlb_fault(struct mm_struct
> *mm, struct vm_area_struct *vma,
> >  		 * hugetlb_no_page will drop vma lock and hugetlb fault
> >  		 * mutex internally, which make us return immediately.
> >  		 */
> > -		return hugetlb_no_page(mm, vma, mapping, idx, address,
> ptep,
> > +		ret = hugetlb_no_page(mm, vma, mapping, idx, address,
> ptep,
> >  				      entry, flags);
> > +		if (!ret)
> > +			mmu_notifier_update_mapping(vma->vm_mm,
> address,
> > +						    pte_pfn(*ptep));
> 
> The next patch ends up calling pfn_to_page() on the result of
> pte_pfn(*ptep). I don't think that's safe because couldn't the PTE have
> already changed and/or the new page have been freed?
Yeah, that might be possible; I believe the right thing to do would be:
-               return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+               ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
                                      entry, flags);
+               if (!ret) {
+                       ptl = huge_pte_lock(h, mm, ptep);
+                       mmu_notifier_update_mapping(vma->vm_mm, address,
+                                                    pte_pfn(*ptep));
+                       spin_unlock(ptl);
+               }

In which case I'd need to make a similar change in the shmem path as well.
And, also redo (or eliminate) the locking in udmabuf (patch) which seems a
bit excessive on a second look given our use-case (where reads and writes do
not happen simultaneously due to fence synchronization in the guest driver). 

Thanks,
Vivek

> 
> > +		return ret;
> >
> >  	ret = 0;
> >
> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm,
> struct vm_area_struct *vma,
> >  	 */
> >  	if (need_wait_lock)
> >  		folio_wait_locked(folio);
> > +	if (!ret)
> > +		mmu_notifier_update_mapping(vma->vm_mm, address,
> > +					    pte_pfn(*ptep));
> >  	return ret;
> >  }
> >
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 50c0dde1354f..6421405334b9 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct
> mm_struct *mm, unsigned long address,
> >  	srcu_read_unlock(&srcu, id);
> >  }
> >
> > +void __mmu_notifier_update_mapping(struct mm_struct *mm, unsigned
> long address,
> > +				   unsigned long pfn)
> > +{
> > +	struct mmu_notifier *subscription;
> > +	int id;
> > +
> > +	id = srcu_read_lock(&srcu);
> > +	hlist_for_each_entry_rcu(subscription,
> > +				 &mm->notifier_subscriptions->list, hlist,
> > +				 srcu_read_lock_held(&srcu)) {
> > +		if (subscription->ops->update_mapping)
> > +			subscription->ops->update_mapping(subscription,
> mm,
> > +							  address, pfn);
> > +	}
> > +	srcu_read_unlock(&srcu, id);
> > +}
> > +
> >  static int mn_itree_invalidate(struct mmu_notifier_subscriptions
> *subscriptions,
> >  			       const struct mmu_notifier_range *range)
> >  {
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 2f2e0e618072..e59eb5fafadb 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
> >  #include <linux/fcntl.h>
> >  #include <uapi/linux/memfd.h>
> >  #include <linux/rmap.h>
> > +#include <linux/mmu_notifier.h>
> >  #include <linux/uuid.h>
> >
> >  #include <linux/uaccess.h>
> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct vm_fault
> *vmf)
> >  				  gfp, vma, vmf, &ret);
> >  	if (err)
> >  		return vmf_error(err);
> > -	if (folio)
> > +	if (folio) {
> >  		vmf->page = folio_file_page(folio, vmf->pgoff);
> > +		if (ret == VM_FAULT_LOCKED)
> > +			mmu_notifier_update_mapping(vma->vm_mm, vmf-
> >address,
> > +						    page_to_pfn(vmf->page));
> > +	}
> >  	return ret;
> >  }
Alistair Popple July 20, 2023, 9 a.m. UTC | #7
"Kasireddy, Vivek" <vivek.kasireddy@intel.com> writes:

> Hi Alistair,
>
>> 
>> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> > index 64a3239b6407..1f2f0209101a 100644
>> > --- a/mm/hugetlb.c
>> > +++ b/mm/hugetlb.c
>> > @@ -6096,8 +6096,12 @@ vm_fault_t hugetlb_fault(struct mm_struct
>> *mm, struct vm_area_struct *vma,
>> >  		 * hugetlb_no_page will drop vma lock and hugetlb fault
>> >  		 * mutex internally, which make us return immediately.
>> >  		 */
>> > -		return hugetlb_no_page(mm, vma, mapping, idx, address,
>> ptep,
>> > +		ret = hugetlb_no_page(mm, vma, mapping, idx, address,
>> ptep,
>> >  				      entry, flags);
>> > +		if (!ret)
>> > +			mmu_notifier_update_mapping(vma->vm_mm,
>> address,
>> > +						    pte_pfn(*ptep));
>> 
>> The next patch ends up calling pfn_to_page() on the result of
>> pte_pfn(*ptep). I don't think that's safe because couldn't the PTE have
>> already changed and/or the new page have been freed?
> Yeah, that might be possible; I believe the right thing to do would be:
> -               return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> +               ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
>                                       entry, flags);
> +               if (!ret) {
> +                       ptl = huge_pte_lock(h, mm, ptep);
> +                       mmu_notifier_update_mapping(vma->vm_mm, address,
> +                                                    pte_pfn(*ptep));
> +                       spin_unlock(ptl);
> +               }

Yes, although obviously as I think you point out below you wouldn't be
able to take any sleeping locks in mmu_notifier_update_mapping().

> In which case I'd need to make a similar change in the shmem path as well.
> And, also redo (or eliminate) the locking in udmabuf (patch) which seems a
> bit excessive on a second look given our use-case (where reads and writes do
> not happen simultaneously due to fence synchronization in the guest driver). 

I'm not at all familiar with the udmabuf use case but that sounds
brittle and effectively makes this notifier udmabuf specific right?

The name gives the impression it is more general though. I have
contemplated adding a notifier for PTE updates for drivers using
hmm_range_fault() as it would save some expensive device faults and it
this could be useful for that.

So if we're adding a notifier for PTE updates I think it would be good
if it covered all cases and was robust enough to allow mirroring of the
correct PTE value (ie. by being called under PTL or via some other
synchronisation like hmm_range_fault()).

Thanks.

> Thanks,
> Vivek
>
>> 
>> > +		return ret;
>> >
>> >  	ret = 0;
>> >
>> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm,
>> struct vm_area_struct *vma,
>> >  	 */
>> >  	if (need_wait_lock)
>> >  		folio_wait_locked(folio);
>> > +	if (!ret)
>> > +		mmu_notifier_update_mapping(vma->vm_mm, address,
>> > +					    pte_pfn(*ptep));
>> >  	return ret;
>> >  }
>> >
>> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
>> > index 50c0dde1354f..6421405334b9 100644
>> > --- a/mm/mmu_notifier.c
>> > +++ b/mm/mmu_notifier.c
>> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct
>> mm_struct *mm, unsigned long address,
>> >  	srcu_read_unlock(&srcu, id);
>> >  }
>> >
>> > +void __mmu_notifier_update_mapping(struct mm_struct *mm, unsigned
>> long address,
>> > +				   unsigned long pfn)
>> > +{
>> > +	struct mmu_notifier *subscription;
>> > +	int id;
>> > +
>> > +	id = srcu_read_lock(&srcu);
>> > +	hlist_for_each_entry_rcu(subscription,
>> > +				 &mm->notifier_subscriptions->list, hlist,
>> > +				 srcu_read_lock_held(&srcu)) {
>> > +		if (subscription->ops->update_mapping)
>> > +			subscription->ops->update_mapping(subscription,
>> mm,
>> > +							  address, pfn);
>> > +	}
>> > +	srcu_read_unlock(&srcu, id);
>> > +}
>> > +
>> >  static int mn_itree_invalidate(struct mmu_notifier_subscriptions
>> *subscriptions,
>> >  			       const struct mmu_notifier_range *range)
>> >  {
>> > diff --git a/mm/shmem.c b/mm/shmem.c
>> > index 2f2e0e618072..e59eb5fafadb 100644
>> > --- a/mm/shmem.c
>> > +++ b/mm/shmem.c
>> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
>> >  #include <linux/fcntl.h>
>> >  #include <uapi/linux/memfd.h>
>> >  #include <linux/rmap.h>
>> > +#include <linux/mmu_notifier.h>
>> >  #include <linux/uuid.h>
>> >
>> >  #include <linux/uaccess.h>
>> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct vm_fault
>> *vmf)
>> >  				  gfp, vma, vmf, &ret);
>> >  	if (err)
>> >  		return vmf_error(err);
>> > -	if (folio)
>> > +	if (folio) {
>> >  		vmf->page = folio_file_page(folio, vmf->pgoff);
>> > +		if (ret == VM_FAULT_LOCKED)
>> > +			mmu_notifier_update_mapping(vma->vm_mm, vmf-
>> >address,
>> > +						    page_to_pfn(vmf->page));
>> > +	}
>> >  	return ret;
>> >  }
Kasireddy, Vivek July 24, 2023, 7:54 a.m. UTC | #8
Hi Alistair,

> 
> 
> "Kasireddy, Vivek" <vivek.kasireddy@intel.com> writes:
> 
> > Hi Alistair,
> >
> >>
> >> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> > index 64a3239b6407..1f2f0209101a 100644
> >> > --- a/mm/hugetlb.c
> >> > +++ b/mm/hugetlb.c
> >> > @@ -6096,8 +6096,12 @@ vm_fault_t hugetlb_fault(struct mm_struct
> >> *mm, struct vm_area_struct *vma,
> >> >  		 * hugetlb_no_page will drop vma lock and hugetlb fault
> >> >  		 * mutex internally, which make us return immediately.
> >> >  		 */
> >> > -		return hugetlb_no_page(mm, vma, mapping, idx, address,
> >> ptep,
> >> > +		ret = hugetlb_no_page(mm, vma, mapping, idx, address,
> >> ptep,
> >> >  				      entry, flags);
> >> > +		if (!ret)
> >> > +			mmu_notifier_update_mapping(vma->vm_mm,
> >> address,
> >> > +						    pte_pfn(*ptep));
> >>
> >> The next patch ends up calling pfn_to_page() on the result of
> >> pte_pfn(*ptep). I don't think that's safe because couldn't the PTE have
> >> already changed and/or the new page have been freed?
> > Yeah, that might be possible; I believe the right thing to do would be:
> > -               return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> > +               ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
> >                                       entry, flags);
> > +               if (!ret) {
> > +                       ptl = huge_pte_lock(h, mm, ptep);
> > +                       mmu_notifier_update_mapping(vma->vm_mm, address,
> > +                                                    pte_pfn(*ptep));
> > +                       spin_unlock(ptl);
> > +               }
> 
> Yes, although obviously as I think you point out below you wouldn't be
> able to take any sleeping locks in mmu_notifier_update_mapping().
Yes, I understand that, but I am not sure how we can prevent any potential
notifier callback from taking sleeping locks other than adding clear comments.

> 
> > In which case I'd need to make a similar change in the shmem path as well.
> > And, also redo (or eliminate) the locking in udmabuf (patch) which seems a
> > bit excessive on a second look given our use-case (where reads and writes
> do
> > not happen simultaneously due to fence synchronization in the guest
> driver).
> 
> I'm not at all familiar with the udmabuf use case but that sounds
> brittle and effectively makes this notifier udmabuf specific right?
Oh, Qemu uses the udmabuf driver to provide Host Graphics components
(such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
buffers. In other words, from a core mm standpoint, udmabuf just
collects a bunch of pages (associated with buffers) scattered inside
the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
them in a dmabuf fd. And, since we provide zero-copy access, we
use DMA fences to ensure that the components on the Host and
Guest do not access the buffer simultaneously.

> 
> The name gives the impression it is more general though. I have
I'd like to make it suitable for general usage.

> contemplated adding a notifier for PTE updates for drivers using
> hmm_range_fault() as it would save some expensive device faults and it
> this could be useful for that.
> 
> So if we're adding a notifier for PTE updates I think it would be good
> if it covered all cases and was robust enough to allow mirroring of the
> correct PTE value (ie. by being called under PTL or via some other
> synchronisation like hmm_range_fault()).
Ok; in order to make it clear that the notifier is associated with PTE updates,
I think it needs to have a more suitable name such as mmu_notifier_update_pte()
or mmu_notifier_new_pte(). But we already have mmu_notifier_change_pte,
which IIUC is used mainly for PTE updates triggered by KSM. So, I am inclining
towards dropping this new notifier and instead adding a new flag to change_pte
to distinguish between KSM triggered notifications and others. Something along
the lines of:
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 218ddc3b4bc7..6afce2287143 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -129,7 +129,8 @@ struct mmu_notifier_ops {
        void (*change_pte)(struct mmu_notifier *subscription,
                           struct mm_struct *mm,
                           unsigned long address,
-                          pte_t pte);
+                          pte_t pte,
+                          bool ksm_update);
@@ -658,7 +659,7 @@ static inline void mmu_notifier_range_init_owner(
        unsigned long ___address = __address;                           \
        pte_t ___pte = __pte;                                           \
                                                                        \
-       mmu_notifier_change_pte(___mm, ___address, ___pte);             \
+       mmu_notifier_change_pte(___mm, ___address, ___pte, true);       \

And replace mmu_notifier_update_mapping(vma->vm_mm, address, pte_pfn(*ptep))
in the current patch with
mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));

Would that work for your HMM use-case -- assuming we call change_pte after
taking PTL?

Thanks,
Vivek

> 
> Thanks.
> 
> > Thanks,
> > Vivek
> >
> >>
> >> > +		return ret;
> >> >
> >> >  	ret = 0;
> >> >
> >> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct
> *mm,
> >> struct vm_area_struct *vma,
> >> >  	 */
> >> >  	if (need_wait_lock)
> >> >  		folio_wait_locked(folio);
> >> > +	if (!ret)
> >> > +		mmu_notifier_update_mapping(vma->vm_mm, address,
> >> > +					    pte_pfn(*ptep));
> >> >  	return ret;
> >> >  }
> >> >
> >> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> >> > index 50c0dde1354f..6421405334b9 100644
> >> > --- a/mm/mmu_notifier.c
> >> > +++ b/mm/mmu_notifier.c
> >> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct
> >> mm_struct *mm, unsigned long address,
> >> >  	srcu_read_unlock(&srcu, id);
> >> >  }
> >> >
> >> > +void __mmu_notifier_update_mapping(struct mm_struct *mm,
> unsigned
> >> long address,
> >> > +				   unsigned long pfn)
> >> > +{
> >> > +	struct mmu_notifier *subscription;
> >> > +	int id;
> >> > +
> >> > +	id = srcu_read_lock(&srcu);
> >> > +	hlist_for_each_entry_rcu(subscription,
> >> > +				 &mm->notifier_subscriptions->list, hlist,
> >> > +				 srcu_read_lock_held(&srcu)) {
> >> > +		if (subscription->ops->update_mapping)
> >> > +			subscription->ops->update_mapping(subscription,
> >> mm,
> >> > +							  address, pfn);
> >> > +	}
> >> > +	srcu_read_unlock(&srcu, id);
> >> > +}
> >> > +
> >> >  static int mn_itree_invalidate(struct mmu_notifier_subscriptions
> >> *subscriptions,
> >> >  			       const struct mmu_notifier_range *range)
> >> >  {
> >> > diff --git a/mm/shmem.c b/mm/shmem.c
> >> > index 2f2e0e618072..e59eb5fafadb 100644
> >> > --- a/mm/shmem.c
> >> > +++ b/mm/shmem.c
> >> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
> >> >  #include <linux/fcntl.h>
> >> >  #include <uapi/linux/memfd.h>
> >> >  #include <linux/rmap.h>
> >> > +#include <linux/mmu_notifier.h>
> >> >  #include <linux/uuid.h>
> >> >
> >> >  #include <linux/uaccess.h>
> >> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct
> vm_fault
> >> *vmf)
> >> >  				  gfp, vma, vmf, &ret);
> >> >  	if (err)
> >> >  		return vmf_error(err);
> >> > -	if (folio)
> >> > +	if (folio) {
> >> >  		vmf->page = folio_file_page(folio, vmf->pgoff);
> >> > +		if (ret == VM_FAULT_LOCKED)
> >> > +			mmu_notifier_update_mapping(vma->vm_mm, vmf-
> >> >address,
> >> > +						    page_to_pfn(vmf->page));
> >> > +	}
> >> >  	return ret;
> >> >  }
Jason Gunthorpe July 24, 2023, 1:35 p.m. UTC | #9
On Mon, Jul 24, 2023 at 07:54:38AM +0000, Kasireddy, Vivek wrote:

> > I'm not at all familiar with the udmabuf use case but that sounds
> > brittle and effectively makes this notifier udmabuf specific right?
> Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> buffers. In other words, from a core mm standpoint, udmabuf just
> collects a bunch of pages (associated with buffers) scattered inside
> the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> them in a dmabuf fd. And, since we provide zero-copy access, we
> use DMA fences to ensure that the components on the Host and
> Guest do not access the buffer simultaneously.

So why do you need to track updates proactively like this?

Trigger a move when the backing memory changes and re-acquire it with
hmm_range_fault like everything else does.

> And replace mmu_notifier_update_mapping(vma->vm_mm, address, pte_pfn(*ptep))
> in the current patch with
> mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));

It isn't very useful because nothing can do anything meaningful under
the PTLs. Can't allocate memory for instance. Which makes me wonder
what it is udmabuf plans to actually do here.

JAson
Alistair Popple July 24, 2023, 1:36 p.m. UTC | #10
"Kasireddy, Vivek" <vivek.kasireddy@intel.com> writes:

> Hi Alistair,
>
>> 
>> 
>> "Kasireddy, Vivek" <vivek.kasireddy@intel.com> writes:
>>
>> Yes, although obviously as I think you point out below you wouldn't be
>> able to take any sleeping locks in mmu_notifier_update_mapping().
> Yes, I understand that, but I am not sure how we can prevent any potential
> notifier callback from taking sleeping locks other than adding clear comments.

Oh of course not, but is such a restriction on not taking sleeping locks
acceptable for your implementation of the notifier callback? I notice in
patch 2 update_udmabuf() takes a mutex so I assumed not being able to
sleep in the callback would be an issue.

>> 
>> > In which case I'd need to make a similar change in the shmem path as well.
>> > And, also redo (or eliminate) the locking in udmabuf (patch) which seems a
>> > bit excessive on a second look given our use-case (where reads and writes
>> do
>> > not happen simultaneously due to fence synchronization in the guest
>> driver).
>> 
>> I'm not at all familiar with the udmabuf use case but that sounds
>> brittle and effectively makes this notifier udmabuf specific right?
> Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> buffers. In other words, from a core mm standpoint, udmabuf just
> collects a bunch of pages (associated with buffers) scattered inside
> the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> them in a dmabuf fd. And, since we provide zero-copy access, we
> use DMA fences to ensure that the components on the Host and
> Guest do not access the buffer simultaneously.

Thanks for the background!

>> contemplated adding a notifier for PTE updates for drivers using
>> hmm_range_fault() as it would save some expensive device faults and it
>> this could be useful for that.
>> 
>> So if we're adding a notifier for PTE updates I think it would be good
>> if it covered all cases and was robust enough to allow mirroring of the
>> correct PTE value (ie. by being called under PTL or via some other
>> synchronisation like hmm_range_fault()).
> Ok; in order to make it clear that the notifier is associated with PTE updates,
> I think it needs to have a more suitable name such as mmu_notifier_update_pte()
> or mmu_notifier_new_pte(). But we already have mmu_notifier_change_pte,
> which IIUC is used mainly for PTE updates triggered by KSM. So, I am inclining
> towards dropping this new notifier and instead adding a new flag to change_pte
> to distinguish between KSM triggered notifications and others. Something along
> the lines of:
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 218ddc3b4bc7..6afce2287143 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -129,7 +129,8 @@ struct mmu_notifier_ops {
>         void (*change_pte)(struct mmu_notifier *subscription,
>                            struct mm_struct *mm,
>                            unsigned long address,
> -                          pte_t pte);
> +                          pte_t pte,
> +                          bool ksm_update);
> @@ -658,7 +659,7 @@ static inline void mmu_notifier_range_init_owner(
>         unsigned long ___address = __address;                           \
>         pte_t ___pte = __pte;                                           \
>                                                                         \
> -       mmu_notifier_change_pte(___mm, ___address, ___pte);             \
> +       mmu_notifier_change_pte(___mm, ___address, ___pte, true);       \
>
> And replace mmu_notifier_update_mapping(vma->vm_mm, address, pte_pfn(*ptep))
> in the current patch with
> mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));

I wonder if we actually need the flag? IIUC it is already used for more
than just KSM. For example it can be called as part of fault handling by
set_pte_at_notify() in in wp_page_copy().

> Would that work for your HMM use-case -- assuming we call change_pte after
> taking PTL?

I suspect being called under the PTL could be an issue. For HMM we use a
combination of sequence numbers and a mutex to synchronise PTEs. To
avoid calling the notifier while holding PTL we might be able to record
the sequence number (subscriptions->invalidate_seq) while holding PTL,
release the PTL and provide that sequence number to the notifier
callback along with the PTE.

Some form of mmu_interval_read_retry() could then be used to detect if
the PTE has changed between dropping the PTL and calling the
update_pte()/change_pte() notifier.

Of course if your design can handle being called while holding the PTL
then the above is probably unnecessarily complex for your use-case.

My primary issue with this patch is the notifier is called without the
PTL while providing a PTE value. Without some form of synchronisation it
isn't safe to use the result of eg. pte_page(pte) or pte_write(pte) in
the notifier callback. Based on your comments it seems udmabuf might
have some other synchronisation that makes it safe, but being external
to the notifier calls make it's hard to reason about.

 - Alistair

> Thanks,
> Vivek
>
>> 
>> Thanks.
>> 
>> > Thanks,
>> > Vivek
>> >
>> >>
>> >> > +		return ret;
>> >> >
>> >> >  	ret = 0;
>> >> >
>> >> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct
>> *mm,
>> >> struct vm_area_struct *vma,
>> >> >  	 */
>> >> >  	if (need_wait_lock)
>> >> >  		folio_wait_locked(folio);
>> >> > +	if (!ret)
>> >> > +		mmu_notifier_update_mapping(vma->vm_mm, address,
>> >> > +					    pte_pfn(*ptep));
>> >> >  	return ret;
>> >> >  }
>> >> >
>> >> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
>> >> > index 50c0dde1354f..6421405334b9 100644
>> >> > --- a/mm/mmu_notifier.c
>> >> > +++ b/mm/mmu_notifier.c
>> >> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct
>> >> mm_struct *mm, unsigned long address,
>> >> >  	srcu_read_unlock(&srcu, id);
>> >> >  }
>> >> >
>> >> > +void __mmu_notifier_update_mapping(struct mm_struct *mm,
>> unsigned
>> >> long address,
>> >> > +				   unsigned long pfn)
>> >> > +{
>> >> > +	struct mmu_notifier *subscription;
>> >> > +	int id;
>> >> > +
>> >> > +	id = srcu_read_lock(&srcu);
>> >> > +	hlist_for_each_entry_rcu(subscription,
>> >> > +				 &mm->notifier_subscriptions->list, hlist,
>> >> > +				 srcu_read_lock_held(&srcu)) {
>> >> > +		if (subscription->ops->update_mapping)
>> >> > +			subscription->ops->update_mapping(subscription,
>> >> mm,
>> >> > +							  address, pfn);
>> >> > +	}
>> >> > +	srcu_read_unlock(&srcu, id);
>> >> > +}
>> >> > +
>> >> >  static int mn_itree_invalidate(struct mmu_notifier_subscriptions
>> >> *subscriptions,
>> >> >  			       const struct mmu_notifier_range *range)
>> >> >  {
>> >> > diff --git a/mm/shmem.c b/mm/shmem.c
>> >> > index 2f2e0e618072..e59eb5fafadb 100644
>> >> > --- a/mm/shmem.c
>> >> > +++ b/mm/shmem.c
>> >> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
>> >> >  #include <linux/fcntl.h>
>> >> >  #include <uapi/linux/memfd.h>
>> >> >  #include <linux/rmap.h>
>> >> > +#include <linux/mmu_notifier.h>
>> >> >  #include <linux/uuid.h>
>> >> >
>> >> >  #include <linux/uaccess.h>
>> >> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct
>> vm_fault
>> >> *vmf)
>> >> >  				  gfp, vma, vmf, &ret);
>> >> >  	if (err)
>> >> >  		return vmf_error(err);
>> >> > -	if (folio)
>> >> > +	if (folio) {
>> >> >  		vmf->page = folio_file_page(folio, vmf->pgoff);
>> >> > +		if (ret == VM_FAULT_LOCKED)
>> >> > +			mmu_notifier_update_mapping(vma->vm_mm, vmf-
>> >> >address,
>> >> > +						    page_to_pfn(vmf->page));
>> >> > +	}
>> >> >  	return ret;
>> >> >  }
Jason Gunthorpe July 24, 2023, 1:37 p.m. UTC | #11
On Mon, Jul 24, 2023 at 11:36:47PM +1000, Alistair Popple wrote:

> My primary issue with this patch is the notifier is called without the
> PTL while providing a PTE value. 

Right, this is no-go. The PTE value must be protected by the PTLs at
all times. We've made enough bugs already by being lazy with this in
the past..

And I don't think the existing sequence locking helps make sense of
this at all..

Jason
Kasireddy, Vivek July 24, 2023, 8:32 p.m. UTC | #12
Hi Jason,

> 
> On Mon, Jul 24, 2023 at 07:54:38AM +0000, Kasireddy, Vivek wrote:
> 
> > > I'm not at all familiar with the udmabuf use case but that sounds
> > > brittle and effectively makes this notifier udmabuf specific right?
> > Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > buffers. In other words, from a core mm standpoint, udmabuf just
> > collects a bunch of pages (associated with buffers) scattered inside
> > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > them in a dmabuf fd. And, since we provide zero-copy access, we
> > use DMA fences to ensure that the components on the Host and
> > Guest do not access the buffer simultaneously.
> 
> So why do you need to track updates proactively like this?
As David noted in the earlier series, if Qemu punches a hole in its memfd
that goes through pages that are registered against a udmabuf fd, then
udmabuf needs to update its list with new pages when the hole gets
filled after (guest) writes. Otherwise, we'd run into the coherency 
problem (between udmabuf and memfd) as demonstrated in the selftest
(patch #3 in this series).

> 
> Trigger a move when the backing memory changes and re-acquire it with
AFAICS, without this patch or adding new change_pte calls, there is no way to
get notified when a new page is mapped into the backing memory of a memfd
(backed by shmem or hugetlbfs) which happens after a hole punch followed
by writes. We can definitely get notified when a hole is punched via the
invalidate notifiers though, but as I described earlier this is not very helpful
for the udmabuf use-case.

> hmm_range_fault like everything else does.
> 
> > And replace mmu_notifier_update_mapping(vma->vm_mm, address,
> pte_pfn(*ptep))
> > in the current patch with
> > mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));
> 
> It isn't very useful because nothing can do anything meaningful under
> the PTLs. Can't allocate memory for instance. Which makes me wonder
> what it is udmabuf plans to actually do here.
It is useful for udmabuf because it helps ensure coherency with the memfd.
If you look at patch #2 in this series particularly the notifier callback (update_udmabuf),
it just updates its list of pages and does not allocate any memory or do anything
that would cause it to go to sleep other than taking a mutex which I plan to drop
in v2 as it is not really needed. With that removed, I think it seems ok to call the
notifier callback under the PTL.

Thanks,
Vivek

> 
> JAson
Kasireddy, Vivek July 24, 2023, 8:42 p.m. UTC | #13
Hi Alistair,

> >>
> >> Yes, although obviously as I think you point out below you wouldn't be
> >> able to take any sleeping locks in mmu_notifier_update_mapping().
> > Yes, I understand that, but I am not sure how we can prevent any potential
> > notifier callback from taking sleeping locks other than adding clear
> comments.
> 
> Oh of course not, but is such a restriction on not taking sleeping locks
> acceptable for your implementation of the notifier callback? I notice in
> patch 2 update_udmabuf() takes a mutex so I assumed not being able to
> sleep in the callback would be an issue.
I plan to drop the mutex in v2 which is not really needed as I described in
my previous reply because we ensure Guest and Host synchronization via
other means. 

> 
> >>
> >> > In which case I'd need to make a similar change in the shmem path as
> well.
> >> > And, also redo (or eliminate) the locking in udmabuf (patch) which
> seems a
> >> > bit excessive on a second look given our use-case (where reads and
> writes
> >> do
> >> > not happen simultaneously due to fence synchronization in the guest
> >> driver).
> >>
> >> I'm not at all familiar with the udmabuf use case but that sounds
> >> brittle and effectively makes this notifier udmabuf specific right?
> > Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > buffers. In other words, from a core mm standpoint, udmabuf just
> > collects a bunch of pages (associated with buffers) scattered inside
> > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > them in a dmabuf fd. And, since we provide zero-copy access, we
> > use DMA fences to ensure that the components on the Host and
> > Guest do not access the buffer simultaneously.
> 
> Thanks for the background!
> 
> >> contemplated adding a notifier for PTE updates for drivers using
> >> hmm_range_fault() as it would save some expensive device faults and it
> >> this could be useful for that.
> >>
> >> So if we're adding a notifier for PTE updates I think it would be good
> >> if it covered all cases and was robust enough to allow mirroring of the
> >> correct PTE value (ie. by being called under PTL or via some other
> >> synchronisation like hmm_range_fault()).
> > Ok; in order to make it clear that the notifier is associated with PTE
> updates,
> > I think it needs to have a more suitable name such as
> mmu_notifier_update_pte()
> > or mmu_notifier_new_pte(). But we already have
> mmu_notifier_change_pte,
> > which IIUC is used mainly for PTE updates triggered by KSM. So, I am
> inclining
> > towards dropping this new notifier and instead adding a new flag to
> change_pte
> > to distinguish between KSM triggered notifications and others. Something
> along
> > the lines of:
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 218ddc3b4bc7..6afce2287143 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -129,7 +129,8 @@ struct mmu_notifier_ops {
> >         void (*change_pte)(struct mmu_notifier *subscription,
> >                            struct mm_struct *mm,
> >                            unsigned long address,
> > -                          pte_t pte);
> > +                          pte_t pte,
> > +                          bool ksm_update);
> > @@ -658,7 +659,7 @@ static inline void mmu_notifier_range_init_owner(
> >         unsigned long ___address = __address;                           \
> >         pte_t ___pte = __pte;                                           \
> >                                                                         \
> > -       mmu_notifier_change_pte(___mm, ___address, ___pte);             \
> > +       mmu_notifier_change_pte(___mm, ___address, ___pte, true);       \
> >
> > And replace mmu_notifier_update_mapping(vma->vm_mm, address,
> pte_pfn(*ptep))
> > in the current patch with
> > mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));
> 
> I wonder if we actually need the flag? IIUC it is already used for more
> than just KSM. For example it can be called as part of fault handling by
> set_pte_at_notify() in in wp_page_copy().
Yes, I noticed that but what I really meant is I'd put all these prior instances
of change_pte in one category using the flag. Without the flag, KVM, the only
user that currently has a callback for change_pte would get notified which
may not be appropriate. Note that the change_pte callback for KVM was
added (based on Git log) for KSM updates and it is not clear to me if that
is still the case.

> 
> > Would that work for your HMM use-case -- assuming we call change_pte
> after
> > taking PTL?
> 
> I suspect being called under the PTL could be an issue. For HMM we use a
> combination of sequence numbers and a mutex to synchronise PTEs. To
> avoid calling the notifier while holding PTL we might be able to record
> the sequence number (subscriptions->invalidate_seq) while holding PTL,
> release the PTL and provide that sequence number to the notifier
> callback along with the PTE.
> 
> Some form of mmu_interval_read_retry() could then be used to detect if
> the PTE has changed between dropping the PTL and calling the
> update_pte()/change_pte() notifier.
> 
> Of course if your design can handle being called while holding the PTL
> then the above is probably unnecessarily complex for your use-case.
Yes, I believe we can handle it while holding the PTL. 

> 
> My primary issue with this patch is the notifier is called without the
> PTL while providing a PTE value. Without some form of synchronisation it
> isn't safe to use the result of eg. pte_page(pte) or pte_write(pte) in
> the notifier callback. Based on your comments it seems udmabuf might
> have some other synchronisation that makes it safe, but being external
> to the notifier calls make it's hard to reason about.
I intend to fix the PTL issue in v2 but I am still not sure what is the best
thing to do as far as the notifier is concerned given the following options:
- Keep this patch (and notifier name) but ensure that it is called under PTL
- Drop this patch and expand the use of change_pte but add the flag to
  distinguish between prior usage and new usage
- Keep this patch but don't include the PTE or the pfn of the new page as
  part of the notifier. In other words, just have this:
mmu_notifier_update_mapping(struct mm_struct *mm, unsigned long address)
This way, in udmabuf driver, we could get the new page from the page cache
as soon as we get notified:
	mapoff = linear_page_index(vma, address);
	new_page = find_get_page(vma->vm_file->f_mapping, mapoff);
This last option would probably limit the new notifier to the udmabuf
use-case but I do not intend to pursue it as you suggested that you are
also interested in a new notifier associated with PTE updates.

Thanks,
Vivek

> 
>  - Alistair
> 
> > Thanks,
> > Vivek
> >
> >>
> >> Thanks.
> >>
> >> > Thanks,
> >> > Vivek
> >> >
> >> >>
> >> >> > +		return ret;
> >> >> >
> >> >> >  	ret = 0;
> >> >> >
> >> >> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct
> >> *mm,
> >> >> struct vm_area_struct *vma,
> >> >> >  	 */
> >> >> >  	if (need_wait_lock)
> >> >> >  		folio_wait_locked(folio);
> >> >> > +	if (!ret)
> >> >> > +		mmu_notifier_update_mapping(vma->vm_mm,
> address,
> >> >> > +					    pte_pfn(*ptep));
> >> >> >  	return ret;
> >> >> >  }
> >> >> >
> >> >> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> >> >> > index 50c0dde1354f..6421405334b9 100644
> >> >> > --- a/mm/mmu_notifier.c
> >> >> > +++ b/mm/mmu_notifier.c
> >> >> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct
> >> >> mm_struct *mm, unsigned long address,
> >> >> >  	srcu_read_unlock(&srcu, id);
> >> >> >  }
> >> >> >
> >> >> > +void __mmu_notifier_update_mapping(struct mm_struct *mm,
> >> unsigned
> >> >> long address,
> >> >> > +				   unsigned long pfn)
> >> >> > +{
> >> >> > +	struct mmu_notifier *subscription;
> >> >> > +	int id;
> >> >> > +
> >> >> > +	id = srcu_read_lock(&srcu);
> >> >> > +	hlist_for_each_entry_rcu(subscription,
> >> >> > +				 &mm->notifier_subscriptions->list,
> hlist,
> >> >> > +				 srcu_read_lock_held(&srcu)) {
> >> >> > +		if (subscription->ops->update_mapping)
> >> >> > +			subscription->ops-
> >update_mapping(subscription,
> >> >> mm,
> >> >> > +							  address,
> pfn);
> >> >> > +	}
> >> >> > +	srcu_read_unlock(&srcu, id);
> >> >> > +}
> >> >> > +
> >> >> >  static int mn_itree_invalidate(struct mmu_notifier_subscriptions
> >> >> *subscriptions,
> >> >> >  			       const struct mmu_notifier_range *range)
> >> >> >  {
> >> >> > diff --git a/mm/shmem.c b/mm/shmem.c
> >> >> > index 2f2e0e618072..e59eb5fafadb 100644
> >> >> > --- a/mm/shmem.c
> >> >> > +++ b/mm/shmem.c
> >> >> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
> >> >> >  #include <linux/fcntl.h>
> >> >> >  #include <uapi/linux/memfd.h>
> >> >> >  #include <linux/rmap.h>
> >> >> > +#include <linux/mmu_notifier.h>
> >> >> >  #include <linux/uuid.h>
> >> >> >
> >> >> >  #include <linux/uaccess.h>
> >> >> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct
> >> vm_fault
> >> >> *vmf)
> >> >> >  				  gfp, vma, vmf, &ret);
> >> >> >  	if (err)
> >> >> >  		return vmf_error(err);
> >> >> > -	if (folio)
> >> >> > +	if (folio) {
> >> >> >  		vmf->page = folio_file_page(folio, vmf->pgoff);
> >> >> > +		if (ret == VM_FAULT_LOCKED)
> >> >> > +			mmu_notifier_update_mapping(vma-
> >vm_mm, vmf-
> >> >> >address,
> >> >> > +						    page_to_pfn(vmf-
> >page));
> >> >> > +	}
> >> >> >  	return ret;
> >> >> >  }
Alistair Popple July 25, 2023, 3:14 a.m. UTC | #14
"Kasireddy, Vivek" <vivek.kasireddy@intel.com> writes:

> Hi Alistair,

Hi Vivek,

>> I wonder if we actually need the flag? IIUC it is already used for more
>> than just KSM. For example it can be called as part of fault handling by
>> set_pte_at_notify() in in wp_page_copy().
> Yes, I noticed that but what I really meant is I'd put all these prior instances
> of change_pte in one category using the flag. Without the flag, KVM, the only
> user that currently has a callback for change_pte would get notified which
> may not be appropriate. Note that the change_pte callback for KVM was
> added (based on Git log) for KSM updates and it is not clear to me if that
> is still the case.

It is certainly now called from contexts other than KSM though. I have
no idea whether that's a problem, nor if adding more callers would
actually be an issue though so understand the motivation for the flag.

>> > Would that work for your HMM use-case -- assuming we call change_pte
>> after
>> > taking PTL?
>> 
>> I suspect being called under the PTL could be an issue. For HMM we use a
>> combination of sequence numbers and a mutex to synchronise PTEs. To
>> avoid calling the notifier while holding PTL we might be able to record
>> the sequence number (subscriptions->invalidate_seq) while holding PTL,
>> release the PTL and provide that sequence number to the notifier
>> callback along with the PTE.
>> 
>> Some form of mmu_interval_read_retry() could then be used to detect if
>> the PTE has changed between dropping the PTL and calling the
>> update_pte()/change_pte() notifier.
>> 
>> Of course if your design can handle being called while holding the PTL
>> then the above is probably unnecessarily complex for your use-case.
> Yes, I believe we can handle it while holding the PTL. 
>
>> 
>> My primary issue with this patch is the notifier is called without the
>> PTL while providing a PTE value. Without some form of synchronisation it
>> isn't safe to use the result of eg. pte_page(pte) or pte_write(pte) in
>> the notifier callback. Based on your comments it seems udmabuf might
>> have some other synchronisation that makes it safe, but being external
>> to the notifier calls make it's hard to reason about.
> I intend to fix the PTL issue in v2 but I am still not sure what is the best
> thing to do as far as the notifier is concerned given the following options:
> - Keep this patch (and notifier name) but ensure that it is called under PTL

I think that's preferable to adding a flag so long as it's implemented
and documented that this is called whenever a PTE is updated. Otherwise
a third user will come along and have the same problem we've currently
got with KVMs usage.

> - Drop this patch and expand the use of change_pte but add the flag to
>   distinguish between prior usage and new usage
> - Keep this patch but don't include the PTE or the pfn of the new page as
>   part of the notifier. In other words, just have this:
> mmu_notifier_update_mapping(struct mm_struct *mm, unsigned long address)
> This way, in udmabuf driver, we could get the new page from the page cache
> as soon as we get notified:
> 	mapoff = linear_page_index(vma, address);
> 	new_page = find_get_page(vma->vm_file->f_mapping, mapoff);
> This last option would probably limit the new notifier to the udmabuf
> use-case but I do not intend to pursue it as you suggested that you are
> also interested in a new notifier associated with PTE updates.

Actually the last option isn't limiting assuming it's sent whenever the
PTE is updated. It just means users have to use hmm_range_fault() or
some equivalent that already enforces proper synchronisation if they
need the actual PTE value. That seems fine to me.

> Thanks,
> Vivek
>
>> 
>>  - Alistair
>> 
>> > Thanks,
>> > Vivek
>> >
>> >>
>> >> Thanks.
>> >>
>> >> > Thanks,
>> >> > Vivek
>> >> >
>> >> >>
>> >> >> > +		return ret;
>> >> >> >
>> >> >> >  	ret = 0;
>> >> >> >
>> >> >> > @@ -6223,6 +6227,9 @@ vm_fault_t hugetlb_fault(struct mm_struct
>> >> *mm,
>> >> >> struct vm_area_struct *vma,
>> >> >> >  	 */
>> >> >> >  	if (need_wait_lock)
>> >> >> >  		folio_wait_locked(folio);
>> >> >> > +	if (!ret)
>> >> >> > +		mmu_notifier_update_mapping(vma->vm_mm,
>> address,
>> >> >> > +					    pte_pfn(*ptep));
>> >> >> >  	return ret;
>> >> >> >  }
>> >> >> >
>> >> >> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
>> >> >> > index 50c0dde1354f..6421405334b9 100644
>> >> >> > --- a/mm/mmu_notifier.c
>> >> >> > +++ b/mm/mmu_notifier.c
>> >> >> > @@ -441,6 +441,23 @@ void __mmu_notifier_change_pte(struct
>> >> >> mm_struct *mm, unsigned long address,
>> >> >> >  	srcu_read_unlock(&srcu, id);
>> >> >> >  }
>> >> >> >
>> >> >> > +void __mmu_notifier_update_mapping(struct mm_struct *mm,
>> >> unsigned
>> >> >> long address,
>> >> >> > +				   unsigned long pfn)
>> >> >> > +{
>> >> >> > +	struct mmu_notifier *subscription;
>> >> >> > +	int id;
>> >> >> > +
>> >> >> > +	id = srcu_read_lock(&srcu);
>> >> >> > +	hlist_for_each_entry_rcu(subscription,
>> >> >> > +				 &mm->notifier_subscriptions->list,
>> hlist,
>> >> >> > +				 srcu_read_lock_held(&srcu)) {
>> >> >> > +		if (subscription->ops->update_mapping)
>> >> >> > +			subscription->ops-
>> >update_mapping(subscription,
>> >> >> mm,
>> >> >> > +							  address,
>> pfn);
>> >> >> > +	}
>> >> >> > +	srcu_read_unlock(&srcu, id);
>> >> >> > +}
>> >> >> > +
>> >> >> >  static int mn_itree_invalidate(struct mmu_notifier_subscriptions
>> >> >> *subscriptions,
>> >> >> >  			       const struct mmu_notifier_range *range)
>> >> >> >  {
>> >> >> > diff --git a/mm/shmem.c b/mm/shmem.c
>> >> >> > index 2f2e0e618072..e59eb5fafadb 100644
>> >> >> > --- a/mm/shmem.c
>> >> >> > +++ b/mm/shmem.c
>> >> >> > @@ -77,6 +77,7 @@ static struct vfsmount *shm_mnt;
>> >> >> >  #include <linux/fcntl.h>
>> >> >> >  #include <uapi/linux/memfd.h>
>> >> >> >  #include <linux/rmap.h>
>> >> >> > +#include <linux/mmu_notifier.h>
>> >> >> >  #include <linux/uuid.h>
>> >> >> >
>> >> >> >  #include <linux/uaccess.h>
>> >> >> > @@ -2164,8 +2165,12 @@ static vm_fault_t shmem_fault(struct
>> >> vm_fault
>> >> >> *vmf)
>> >> >> >  				  gfp, vma, vmf, &ret);
>> >> >> >  	if (err)
>> >> >> >  		return vmf_error(err);
>> >> >> > -	if (folio)
>> >> >> > +	if (folio) {
>> >> >> >  		vmf->page = folio_file_page(folio, vmf->pgoff);
>> >> >> > +		if (ret == VM_FAULT_LOCKED)
>> >> >> > +			mmu_notifier_update_mapping(vma-
>> >vm_mm, vmf-
>> >> >> >address,
>> >> >> > +						    page_to_pfn(vmf-
>> >page));
>> >> >> > +	}
>> >> >> >  	return ret;
>> >> >> >  }
Alistair Popple July 25, 2023, 3:38 a.m. UTC | #15
Jason Gunthorpe <jgg@nvidia.com> writes:

> On Mon, Jul 24, 2023 at 07:54:38AM +0000, Kasireddy, Vivek wrote:

>> And replace mmu_notifier_update_mapping(vma->vm_mm, address, pte_pfn(*ptep))
>> in the current patch with
>> mmu_notifier_change_pte(vma->vm_mm, address, ptep, false));
>
> It isn't very useful because nothing can do anything meaningful under
> the PTLs. Can't allocate memory for instance.

Yeah, although you already have to be pretty careful allocating memory
from a notifier while holding any lock as it may trigger direct reclaim
which can call the same notifier leading to a recursive lock.

A totally different discussion though.

> JAson
Hugh Dickins July 25, 2023, 4:30 a.m. UTC | #16
On Mon, 24 Jul 2023, Kasireddy, Vivek wrote:
> Hi Jason,
> > On Mon, Jul 24, 2023 at 07:54:38AM +0000, Kasireddy, Vivek wrote:
> > 
> > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > brittle and effectively makes this notifier udmabuf specific right?
> > > Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > collects a bunch of pages (associated with buffers) scattered inside
> > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > use DMA fences to ensure that the components on the Host and
> > > Guest do not access the buffer simultaneously.
> > 
> > So why do you need to track updates proactively like this?
> As David noted in the earlier series, if Qemu punches a hole in its memfd
> that goes through pages that are registered against a udmabuf fd, then
> udmabuf needs to update its list with new pages when the hole gets
> filled after (guest) writes. Otherwise, we'd run into the coherency 
> problem (between udmabuf and memfd) as demonstrated in the selftest
> (patch #3 in this series).

Wouldn't this all be very much better if Qemu stopped punching holes there?

Hugh
Jason Gunthorpe July 25, 2023, 12:36 p.m. UTC | #17
On Mon, Jul 24, 2023 at 08:32:45PM +0000, Kasireddy, Vivek wrote:
> Hi Jason,
> 
> > 
> > On Mon, Jul 24, 2023 at 07:54:38AM +0000, Kasireddy, Vivek wrote:
> > 
> > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > brittle and effectively makes this notifier udmabuf specific right?
> > > Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > collects a bunch of pages (associated with buffers) scattered inside
> > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > use DMA fences to ensure that the components on the Host and
> > > Guest do not access the buffer simultaneously.
> > 
> > So why do you need to track updates proactively like this?
> As David noted in the earlier series, if Qemu punches a hole in its memfd
> that goes through pages that are registered against a udmabuf fd, then
> udmabuf needs to update its list with new pages when the hole gets
> filled after (guest) writes. Otherwise, we'd run into the coherency 
> problem (between udmabuf and memfd) as demonstrated in the selftest
> (patch #3 in this series).

Holes created in VMA are tracked by invalidation, you haven't
explained why this needs to also see change.

BTW it is very jarring to hear you talk about files when working with
mmu notifiers. MMU notifiers do not track hole punches or memfds, they
track VMAs and PTEs. Punching a hole in a mmapped memfd will
invalidate the convering PTEs.

> > Trigger a move when the backing memory changes and re-acquire it with
> AFAICS, without this patch or adding new change_pte calls, there is no way to
> get notified when a new page is mapped into the backing memory of a memfd
> (backed by shmem or hugetlbfs) which happens after a hole punch followed
> by writes. 

Yes, we have never wanted to do this because is it racy.

If you still need the memory mapped then you re-call hmm_range_fault
and re-obtain it. hmm_range_fault will resolve all the races and you
get new pages.

> We can definitely get notified when a hole is punched via the
> invalidate notifiers though, but as I described earlier this is not very helpful
> for the udmabuf use-case.

I still don't understand why, or what makes udmabuf so special
compared to all the other places tracking VMA changes and using
hmm_range_fault.

Jason
Kasireddy, Vivek July 25, 2023, 10:24 p.m. UTC | #18
Hi Hugh,

> 
> On Mon, 24 Jul 2023, Kasireddy, Vivek wrote:
> > Hi Jason,
> > > On Mon, Jul 24, 2023 at 07:54:38AM +0000, Kasireddy, Vivek wrote:
> > >
> > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > brittle and effectively makes this notifier udmabuf specific right?
> > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> components
> > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > use DMA fences to ensure that the components on the Host and
> > > > Guest do not access the buffer simultaneously.
> > >
> > > So why do you need to track updates proactively like this?
> > As David noted in the earlier series, if Qemu punches a hole in its memfd
> > that goes through pages that are registered against a udmabuf fd, then
> > udmabuf needs to update its list with new pages when the hole gets
> > filled after (guest) writes. Otherwise, we'd run into the coherency
> > problem (between udmabuf and memfd) as demonstrated in the selftest
> > (patch #3 in this series).
> 
> Wouldn't this all be very much better if Qemu stopped punching holes there?
I think holes can be punched anywhere in the memfd for various reasons. Some
of the use-cases where this would be done were identified by David. Here is what
he said in an earlier discussion:
"There are *probably* more issues on the QEMU side when udmabuf is paired 
with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for 
virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"

Thanks,
Vivek

> 
> Hugh
Kasireddy, Vivek July 25, 2023, 10:44 p.m. UTC | #19
Hi Jason,

> > >
> > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > brittle and effectively makes this notifier udmabuf specific right?
> > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> components
> > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > use DMA fences to ensure that the components on the Host and
> > > > Guest do not access the buffer simultaneously.
> > >
> > > So why do you need to track updates proactively like this?
> > As David noted in the earlier series, if Qemu punches a hole in its memfd
> > that goes through pages that are registered against a udmabuf fd, then
> > udmabuf needs to update its list with new pages when the hole gets
> > filled after (guest) writes. Otherwise, we'd run into the coherency
> > problem (between udmabuf and memfd) as demonstrated in the selftest
> > (patch #3 in this series).
> 
> Holes created in VMA are tracked by invalidation, you haven't
> explained why this needs to also see change.
Oh, the invalidation part is ok and does not need any changes. My concern
(and the reason for this new notifier patch) is only about the lack of a
notification when a PTE is updated because of a fault (new page). In other
words, if something like change_pte() would have been called after
handle_pte_fault() or hugetlb_fault(), then this patch would not be needed.

> 
> BTW it is very jarring to hear you talk about files when working with
> mmu notifiers. MMU notifiers do not track hole punches or memfds, they
> track VMAs and PTEs. Punching a hole in a mmapped memfd will
> invalidate the convering PTEs.
I figured describing the problem in terms of memfds or hole punches would
provide more context; but, ok, I'll refrain from mentioning memfds or holes
and limit the discussion of this patch to VMAs and PTEs. 

> 
> > > Trigger a move when the backing memory changes and re-acquire it with
> > AFAICS, without this patch or adding new change_pte calls, there is no way
> to
> > get notified when a new page is mapped into the backing memory of a
> memfd
> > (backed by shmem or hugetlbfs) which happens after a hole punch
> followed
> > by writes.
> 
> Yes, we have never wanted to do this because is it racy.
> 
> If you still need the memory mapped then you re-call hmm_range_fault
> and re-obtain it. hmm_range_fault will resolve all the races and you
> get new pages.
IIUC, for my udmabuf use-case, it looks like calling hmm_range_fault
immediately after an invalidate (range notification) would preemptively fault in
new pages before a write. The problem with that is if a read occurs on those
new pages, then the data is incorrect as a write may not have happened yet.
Ideally, what I am looking for is for getting new pages at the time of or after
a write; until then, it is ok to use the old pages given my use-case.

> 
> > We can definitely get notified when a hole is punched via the
> > invalidate notifiers though, but as I described earlier this is not very helpful
> > for the udmabuf use-case.
> 
> I still don't understand why, or what makes udmabuf so special
> compared to all the other places tracking VMA changes and using
> hmm_range_fault.
I think the difference comes down to whether we (udmabuf driver) want to
grab the new pages after getting notified about a PTE update because of a fault
triggered by a write vs proactively obtaining the new pages by triggering the
fault (since hmm_range_fault() seems to call handle_mm_fault()) before a
potential write.

Thanks,
Vivek

> 
> Jason
Jason Gunthorpe July 25, 2023, 10:53 p.m. UTC | #20
On Tue, Jul 25, 2023 at 10:44:09PM +0000, Kasireddy, Vivek wrote:
> > If you still need the memory mapped then you re-call hmm_range_fault
> > and re-obtain it. hmm_range_fault will resolve all the races and you
> > get new pages.

> IIUC, for my udmabuf use-case, it looks like calling hmm_range_fault
> immediately after an invalidate (range notification) would preemptively fault in
> new pages before a write. The problem with that is if a read occurs on those
> new pages, then the data is incorrect as a write may not have
> happened yet.

It cannot be, if you use hmm_range_fault correctly you cannot get
corruption no matter what is done to the mmap'd memfd. If there is
otherwise it is a hmm_range_fault bug plain and simple.

> Ideally, what I am looking for is for getting new pages at the time of or after
> a write; until then, it is ok to use the old pages given my use-case.

It is wrong, if you are synchronizing the vma then you must use the
latest copy. If your use case can tolerate it then keep a 'not
present' indication for the missing pages until you actually need
them, but dmabuf doesn't really provide an API for that.

> I think the difference comes down to whether we (udmabuf driver) want to
> grab the new pages after getting notified about a PTE update because
> of a fault

Why? You still haven't explained why you want this.

If you are writing to the pages then you have to do this

If you are reading from the pages then hmm_range_fault should return
the zero page for a hole until it is written too

Jason
Kasireddy, Vivek July 27, 2023, 7:34 a.m. UTC | #21
Hi Jason,

> 
> On Tue, Jul 25, 2023 at 10:44:09PM +0000, Kasireddy, Vivek wrote:
> > > If you still need the memory mapped then you re-call hmm_range_fault
> > > and re-obtain it. hmm_range_fault will resolve all the races and you
> > > get new pages.
> 
> > IIUC, for my udmabuf use-case, it looks like calling hmm_range_fault
> > immediately after an invalidate (range notification) would preemptively
> fault in
> > new pages before a write. The problem with that is if a read occurs on
> those
> > new pages, then the data is incorrect as a write may not have
> > happened yet.
> 
> It cannot be, if you use hmm_range_fault correctly you cannot get
> corruption no matter what is done to the mmap'd memfd. If there is
> otherwise it is a hmm_range_fault bug plain and simple.
> 
> > Ideally, what I am looking for is for getting new pages at the time of or after
> > a write; until then, it is ok to use the old pages given my use-case.
> 
> It is wrong, if you are synchronizing the vma then you must use the
> latest copy. If your use case can tolerate it then keep a 'not
> present' indication for the missing pages until you actually need
> them, but dmabuf doesn't really provide an API for that.
> 
> > I think the difference comes down to whether we (udmabuf driver) want to
> > grab the new pages after getting notified about a PTE update because
> > of a fault
> 
> Why? You still haven't explained why you want this.
Ok, let me explain using one of the udmabuf selftests (added in patch #3)
to describe the problem (sorry, I'd have to use the terms memfd, hole, etc)
I am trying to solve:
        size = MEMFD_SIZE * page_size;
        memfd = create_memfd_with_seals(size, false);
        addr1 = mmap_fd(memfd, size);
        write_to_memfd(addr1, size, 'a');
        buf = create_udmabuf_list(devfd, memfd, size);
        addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
        punch_hole(memfd, MEMFD_SIZE / 2);
   -> At this point, if I were to read addr1, it'd still have "a" in relevant areas
        because a new write hasn't happened yet. And, since this results in an
        invalidation (notification) of the associated VMA range, I could register
        a callback in udmabuf driver and get notified but I am not sure how or
        why that would be useful.

        write_to_memfd(addr1, size, 'b');
   -> Here, the hole gets refilled as a result of the above writes which trigger
        faults and the PTEs are updated to point to new pages. When this happens,
        the udmabuf driver needs to be made aware of the new pages that were
        faulted in because of the new writes. Since there does not appear to be
        a way to get notified when the hole is written to, the solution I came up
        with is to either add a new notifier or add calls to change_pte() when the
        PTEs do get updated. However, considering your suggestion to use
        hmm_range_fault(), it is not clear to me how it would help while the hole
        is being written to as the writes occur outside of the udmabuf driver. And,
        there is no way to get notified or track them either, AFAICS from inside the
        udmabuf driver.

Thanks,
Vivek

> 
> If you are writing to the pages then you have to do this
> 
> If you are reading from the pages then hmm_range_fault should return
> the zero page for a hole until it is written too
> 
> Jason
Jason Gunthorpe July 27, 2023, 11:58 a.m. UTC | #22
On Thu, Jul 27, 2023 at 07:34:30AM +0000, Kasireddy, Vivek wrote:
> Hi Jason,
> 
> > 
> > On Tue, Jul 25, 2023 at 10:44:09PM +0000, Kasireddy, Vivek wrote:
> > > > If you still need the memory mapped then you re-call hmm_range_fault
> > > > and re-obtain it. hmm_range_fault will resolve all the races and you
> > > > get new pages.
> > 
> > > IIUC, for my udmabuf use-case, it looks like calling hmm_range_fault
> > > immediately after an invalidate (range notification) would preemptively
> > fault in
> > > new pages before a write. The problem with that is if a read occurs on
> > those
> > > new pages, then the data is incorrect as a write may not have
> > > happened yet.
> > 
> > It cannot be, if you use hmm_range_fault correctly you cannot get
> > corruption no matter what is done to the mmap'd memfd. If there is
> > otherwise it is a hmm_range_fault bug plain and simple.
> > 
> > > Ideally, what I am looking for is for getting new pages at the time of or after
> > > a write; until then, it is ok to use the old pages given my use-case.
> > 
> > It is wrong, if you are synchronizing the vma then you must use the
> > latest copy. If your use case can tolerate it then keep a 'not
> > present' indication for the missing pages until you actually need
> > them, but dmabuf doesn't really provide an API for that.
> > 
> > > I think the difference comes down to whether we (udmabuf driver) want to
> > > grab the new pages after getting notified about a PTE update because
> > > of a fault
> > 
> > Why? You still haven't explained why you want this.
> Ok, let me explain using one of the udmabuf selftests (added in patch #3)
> to describe the problem (sorry, I'd have to use the terms memfd, hole, etc)
> I am trying to solve:
>         size = MEMFD_SIZE * page_size;
>         memfd = create_memfd_with_seals(size, false);
>         addr1 = mmap_fd(memfd, size);
>         write_to_memfd(addr1, size, 'a');
>         buf = create_udmabuf_list(devfd, memfd, size);
>         addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
>         punch_hole(memfd, MEMFD_SIZE / 2);
>    -> At this point, if I were to read addr1, it'd still have "a" in relevant areas
>         because a new write hasn't happened yet. And, since this results in an
>         invalidation (notification) of the associated VMA range, I could register
>         a callback in udmabuf driver and get notified but I am not sure how or
>         why that would be useful.

When you get an invalidation you trigger dmabuf move, which revokes
the importes use of the dmabuf because the underlying memory has
changed. This is exactly the same as a GPU driver migrating memory
to/fro CPU memory.

> 
>         write_to_memfd(addr1, size, 'b');
>    -> Here, the hole gets refilled as a result of the above writes which trigger
>         faults and the PTEs are updated to point to new pages. When this happens,
>         the udmabuf driver needs to be made aware of the new pages that were
>         faulted in because of the new writes. 

You only need this because you are not processing the invalidate.

>         a way to get notified when the hole is written to, the solution I came up
>         with is to either add a new notifier or add calls to change_pte() when the
>         PTEs do get updated. However, considering your suggestion to use
>         hmm_range_fault(), it is not clear to me how it would help while the hole
>         is being written to as the writes occur outside of the
>         udmabuf driver. 

You have the design backwards.

When a dmabuf importer asks for the dmabuf to be present you call
hmm_range_fault() and you get back whatever memory is appropriate. The
importer can then use it.

If the underlying memory changes then you get the invalidation and you
trigger move. The importer stops using the memory and the underlying
pages change.

Later the importer decides it needs the memory again so it again asks
for the dmabuf to be present, which does hmm_range_fault and gets
whatever is appropriate at the time.

Jason
Peter Xu July 27, 2023, 9:43 p.m. UTC | #23
Hi, Vivek,

On Tue, Jul 25, 2023 at 10:24:21PM +0000, Kasireddy, Vivek wrote:
> Hi Hugh,
> 
> > 
> > On Mon, 24 Jul 2023, Kasireddy, Vivek wrote:
> > > Hi Jason,
> > > > On Mon, Jul 24, 2023 at 07:54:38AM +0000, Kasireddy, Vivek wrote:
> > > >
> > > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > > brittle and effectively makes this notifier udmabuf specific right?
> > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> > components
> > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > > use DMA fences to ensure that the components on the Host and
> > > > > Guest do not access the buffer simultaneously.
> > > >
> > > > So why do you need to track updates proactively like this?
> > > As David noted in the earlier series, if Qemu punches a hole in its memfd
> > > that goes through pages that are registered against a udmabuf fd, then
> > > udmabuf needs to update its list with new pages when the hole gets
> > > filled after (guest) writes. Otherwise, we'd run into the coherency
> > > problem (between udmabuf and memfd) as demonstrated in the selftest
> > > (patch #3 in this series).
> > 
> > Wouldn't this all be very much better if Qemu stopped punching holes there?
> I think holes can be punched anywhere in the memfd for various reasons. Some

I just start to read this thread, even haven't finished all of them.. but
so far I'm not sure whether this is right at all..

udmabuf is a file, it means it should follow the file semantics. mmu
notifier is per-mm, otoh.

Imagine for some reason QEMU mapped the guest pages twice, udmabuf is
created with vma1, so udmabuf registers the mm changes over vma1 only.

However the shmem/hugetlb page cache can be populated in either vma1, or
vma2.  It means when populating on vma2 udmabuf won't get update notify at
all, udmabuf pages can still be obsolete.  Same thing to when multi-process
QEMU is used, where we can have vma1 in QEMU while vma2 in the other
process like vhost-user.

I think the trick here is we tried to "hide" the fact that these are
actually normal file pages, but we're doing PFNMAP on them... then we want
the file features back, like hole punching..

If we used normal file operations, everything will just work fine; TRUNCATE
will unmap the host mapped frame buffers when needed, and when accessed
it'll fault on demand from the page cache.  We seem to be trying to
reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to
this at least..

> of the use-cases where this would be done were identified by David. Here is what
> he said in an earlier discussion:
> "There are *probably* more issues on the QEMU side when udmabuf is paired 
> with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for 
> virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"

Now after seething this, I'm truly wondering whether we can still simply
use the file semantics we already have (for either shmem/hugetlb/...), or
is it a must we need to use a single fd to represent all?

Say, can we just use a tuple (fd, page_array) rather than the udmabuf
itself to do host zero-copy mapping?  the page_array can be e.g. a list of
file offsets that points to the pages (rather than pinning the pages using
FOLL_GET).  The good thing is then the fd can be the guest memory file
itself.  With that, we can mmap() over the shmem/hugetlb in whatever vma
and whatever process.  Truncation (and actually everything... e.g. page
migration, swapping, ... which will be disabled if we use PFNMAP pins) will
just all start to work, afaiu.

Thanks,
Kasireddy, Vivek July 29, 2023, 12:08 a.m. UTC | #24
Hi Peter,

> > > > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > > > brittle and effectively makes this notifier udmabuf specific right?
> > > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> > > components
> > > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > > > use DMA fences to ensure that the components on the Host and
> > > > > > Guest do not access the buffer simultaneously.
> > > > >
> > > > > So why do you need to track updates proactively like this?
> > > > As David noted in the earlier series, if Qemu punches a hole in its
> memfd
> > > > that goes through pages that are registered against a udmabuf fd, then
> > > > udmabuf needs to update its list with new pages when the hole gets
> > > > filled after (guest) writes. Otherwise, we'd run into the coherency
> > > > problem (between udmabuf and memfd) as demonstrated in the
> selftest
> > > > (patch #3 in this series).
> > >
> > > Wouldn't this all be very much better if Qemu stopped punching holes
> there?
> > I think holes can be punched anywhere in the memfd for various reasons.
> Some
> 
> I just start to read this thread, even haven't finished all of them.. but
> so far I'm not sure whether this is right at all..
> 
> udmabuf is a file, it means it should follow the file semantics. Mmu
Right, it is a file but a special type of file given that it is a dmabuf. So, AFAIK,
operations such as truncate, FALLOC_FL_PUNCH_HOLE, etc cannot be done
on it. And, in our use-case, since udmabuf driver is sharing (or exporting) its
buffer (via the fd), consumers (or importers) of the dmabuf fd are expected
to only read from it.

> notifier is per-mm, otoh.
> 
> Imagine for some reason QEMU mapped the guest pages twice, udmabuf is
> created with vma1, so udmabuf registers the mm changes over vma1 only.
Udmabufs are created with pages obtained from the mapping using offsets
provided by Qemu. 

> 
> However the shmem/hugetlb page cache can be populated in either vma1, or
> vma2.  It means when populating on vma2 udmabuf won't get update notify
> at
> all, udmabuf pages can still be obsolete.  Same thing to when multi-process
In this (unlikely) scenario you described above, I think we could still find all the
VMAs (and ranges) where the guest buffer pages are mapped (and register
for PTE updates) using Qemu's mm_struct. The below code can be modified
to create a list of VMAs where the guest buffer pages are mapped.
static struct vm_area_struct *find_guest_ram_vma(struct udmabuf *ubuf,
                                                 struct mm_struct *vmm_mm)
{
        struct vm_area_struct *vma = NULL;
        MA_STATE(mas, &vmm_mm->mm_mt, 0, 0);
        unsigned long addr;
        pgoff_t pg;

        mas_set(&mas, 0);
        mmap_read_lock(vmm_mm);
        mas_for_each(&mas, vma, ULONG_MAX) {
                for (pg = 0; pg < ubuf->pagecount; pg++) {
                        addr = page_address_in_vma(ubuf->pages[pg], vma);
                        if (addr == -EFAULT)
                                break;
                }
                if (addr != -EFAULT)
                        break;
        }
        mmap_read_unlock(vmm_mm);

        return vma;
}

> QEMU is used, where we can have vma1 in QEMU while vma2 in the other
> process like vhost-user.
> 
> I think the trick here is we tried to "hide" the fact that these are
> actually normal file pages, but we're doing PFNMAP on them... then we want
> the file features back, like hole punching..
> 
> If we used normal file operations, everything will just work fine; TRUNCATE
> will unmap the host mapped frame buffers when needed, and when
> accessed
> it'll fault on demand from the page cache.  We seem to be trying to
> reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to
> this at least..
If we can figure out the VMA ranges where the guest buffer pages are mapped,
we should be able to register mmu notifiers for those ranges right?

> 
> > of the use-cases where this would be done were identified by David. Here
> is what
> > he said in an earlier discussion:
> > "There are *probably* more issues on the QEMU side when udmabuf is
> paired
> > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"
> 
> Now after seething this, I'm truly wondering whether we can still simply
> use the file semantics we already have (for either shmem/hugetlb/...), or
> is it a must we need to use a single fd to represent all?
> 
> Say, can we just use a tuple (fd, page_array) rather than the udmabuf
> itself to do host zero-copy mapping?  the page_array can be e.g. a list of
That (tuple) is essentially what we are doing (with udmabuf) but in a
standardized way that follows convention using the dmabuf buffer sharing
framework that all the importers (other drivers and userspace components)
know and understand.

> file offsets that points to the pages (rather than pinning the pages using
If we are using the dmabuf framework, the pages must be pinned when the
importers map them.

> FOLL_GET).  The good thing is then the fd can be the guest memory file
> itself.  With that, we can mmap() over the shmem/hugetlb in whatever vma
> and whatever process.  Truncation (and actually everything... e.g. page
> migration, swapping, ... which will be disabled if we use PFNMAP pins) will
> just all start to work, afaiu.
IIUC, we'd not be able to use the fd of the guest memory file because the
dmabuf fds are expected to have constant size that reflects the size of the
buffer that is being shared. I just don't think it'd be feasible given all the
other restrictions:
https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html?highlight=dma_buf#userspace-interface-notes

Thanks,
Vivek


> 
> Thanks,
> 
> --
> Peter Xu
Kasireddy, Vivek July 29, 2023, 12:46 a.m. UTC | #25
Hi Jason,

> > > > > If you still need the memory mapped then you re-call
> hmm_range_fault
> > > > > and re-obtain it. hmm_range_fault will resolve all the races and you
> > > > > get new pages.
> > >
> > > > IIUC, for my udmabuf use-case, it looks like calling hmm_range_fault
> > > > immediately after an invalidate (range notification) would preemptively
> > > fault in
> > > > new pages before a write. The problem with that is if a read occurs on
> > > those
> > > > new pages, then the data is incorrect as a write may not have
> > > > happened yet.
> > >
> > > It cannot be, if you use hmm_range_fault correctly you cannot get
> > > corruption no matter what is done to the mmap'd memfd. If there is
> > > otherwise it is a hmm_range_fault bug plain and simple.
> > >
> > > > Ideally, what I am looking for is for getting new pages at the time of or
> after
> > > > a write; until then, it is ok to use the old pages given my use-case.
> > >
> > > It is wrong, if you are synchronizing the vma then you must use the
> > > latest copy. If your use case can tolerate it then keep a 'not
> > > present' indication for the missing pages until you actually need
> > > them, but dmabuf doesn't really provide an API for that.
> > >
> > > > I think the difference comes down to whether we (udmabuf driver)
> want to
> > > > grab the new pages after getting notified about a PTE update because
> > > > of a fault
> > >
> > > Why? You still haven't explained why you want this.
> > Ok, let me explain using one of the udmabuf selftests (added in patch #3)
> > to describe the problem (sorry, I'd have to use the terms memfd, hole, etc)
> > I am trying to solve:
> >         size = MEMFD_SIZE * page_size;
> >         memfd = create_memfd_with_seals(size, false);
> >         addr1 = mmap_fd(memfd, size);
> >         write_to_memfd(addr1, size, 'a');
> >         buf = create_udmabuf_list(devfd, memfd, size);
> >         addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
> >         punch_hole(memfd, MEMFD_SIZE / 2);
> >    -> At this point, if I were to read addr1, it'd still have "a" in relevant areas
> >         because a new write hasn't happened yet. And, since this results in an
> >         invalidation (notification) of the associated VMA range, I could register
> >         a callback in udmabuf driver and get notified but I am not sure how or
> >         why that would be useful.
> 
> When you get an invalidation you trigger dmabuf move, which revokes
> the importes use of the dmabuf because the underlying memory has
> changed. This is exactly the same as a GPU driver migrating memory
> to/fro CPU memory.
> 
> >
> >         write_to_memfd(addr1, size, 'b');
> >    -> Here, the hole gets refilled as a result of the above writes which trigger
> >         faults and the PTEs are updated to point to new pages. When this
> happens,
> >         the udmabuf driver needs to be made aware of the new pages that
> were
> >         faulted in because of the new writes.
> 
> You only need this because you are not processing the invalidate.
> 
> >         a way to get notified when the hole is written to, the solution I came
> up
> >         with is to either add a new notifier or add calls to change_pte() when
> the
> >         PTEs do get updated. However, considering your suggestion to use
> >         hmm_range_fault(), it is not clear to me how it would help while the
> hole
> >         is being written to as the writes occur outside of the
> >         udmabuf driver.
> 
> You have the design backwards.
> 
> When a dmabuf importer asks for the dmabuf to be present you call
> hmm_range_fault() and you get back whatever memory is appropriate. The
> importer can then use it.
> 
> If the underlying memory changes then you get the invalidation and you
> trigger move. The importer stops using the memory and the underlying
> pages change.
> 
> Later the importer decides it needs the memory again so it again asks
> for the dmabuf to be present, which does hmm_range_fault and gets
> whatever is appropriate at the time.
Unless I am missing something, I think just doing the above still won't solve
the problem. Consider this sequence:
     write_to_memfd(addr1, size, 'a');
     buf = create_udmabuf_list(devfd, memfd, size);
     addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
     read(addr2);
     write_to_memfd(addr1, size, 'b');
     punch_hole(memfd, MEMFD_SIZE / 2);
-> Since we can process the invalidate at this point, as per your suggestion,
     we can trigger dmabuf move to let the importers know that the dmabuf's
     backing memory has changed (or moved).

     read(addr2);
-> Because there is a hole, we can handle the read by either providing the
     old pages or zero pages (if using hmm_range_fault()) to the importers.
     Maybe it is against convention, but I think it makes sense to provide old
     pages (that were mapped before the hole punch) because the importers
     have not read the data in these pages ('b' above) yet. And, another reason
     to provide old pages is because the data in these pages is shown in a window
     on the Host's screen so it doesn't make sense to show zero page data.

-> write_to_memfd(addr1, size, 'c');
     As the hole gets refilled (with new pages) after the above write, AFAIU, we
     have to tell the importers again that since the backing memory has changed,
     (new pages) they need to recreate their mappings. But herein lies the problem:
     from inside the udmabuf driver, we cannot know when this write occurs, so we
     would not be able to notify the importers of the dmabuf move. Since Qemu knows
     about this write, I was initially thinking of adding a new udmabuf IOCTL (something
     like UDMABUF_PREPARE) to have Qemu let udmabuf know after writes occur.
     This would provide an opportunity after an invalidate to notify the importers of
     the (dmabuf) move. I think this would solve the problem with changes only to the
     udmabuf driver (including adding the new IOCTL + handling the invalidate) but I
     was hoping to solve it in a generic way by adding a new notifier or using change_pte()
     to get notified about PTE updates (when faults occur).

Thanks,
Vivek

> Jason
Jason Gunthorpe July 30, 2023, 11:09 p.m. UTC | #26
On Sat, Jul 29, 2023 at 12:46:59AM +0000, Kasireddy, Vivek wrote:

> > Later the importer decides it needs the memory again so it again asks
> > for the dmabuf to be present, which does hmm_range_fault and gets
> > whatever is appropriate at the time.
> Unless I am missing something, I think just doing the above still won't solve
> the problem. Consider this sequence:
>      write_to_memfd(addr1, size, 'a');
>      buf = create_udmabuf_list(devfd, memfd, size);
>      addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
>      read(addr2);
>      write_to_memfd(addr1, size, 'b');
>      punch_hole(memfd, MEMFD_SIZE / 2);
> -> Since we can process the invalidate at this point, as per your suggestion,
>      we can trigger dmabuf move to let the importers know that the dmabuf's
>      backing memory has changed (or moved).
> 
>      read(addr2);
> -> Because there is a hole, we can handle the read by either providing the
>      old pages or zero pages (if using hmm_range_fault()) to the
> importers.

You never provide the old pages. After trunctate the only correct
value to read is zero.

>      Maybe it is against convention, but I think it makes sense to provide old
>      pages (that were mapped before the hole punch) because the importers
>      have not read the data in these pages ('b' above) yet.

Nope.

>      And, another reason to provide old pages is because the data in
>      these pages is shown in a window on the Host's screen so it
>      doesn't make sense to show zero page data.

So why did you trucate it if you want to keep the data?


> -> write_to_memfd(addr1, size, 'c');
>      As the hole gets refilled (with new pages) after the above write, AFAIU, we
>      have to tell the importers again that since the backing memory has changed,
>      (new pages) they need to recreate their mappings. But herein lies the problem:
>      from inside the udmabuf driver, we cannot know when this write occurs, so we
>      would not be able to notify the importers of the dmabuf move.

You get another invalidate because the memfd removes the zero pages
that hmm_range_fault installed in the PTEs before replacing them with
actual writable pages. Then you do the move, and another
hmm_range_fault, and basically the whole thing over again. Except this
time instead of returning zero pages it returns actual writable page.

Jason
Peter Xu July 31, 2023, 5:05 p.m. UTC | #27
On Sat, Jul 29, 2023 at 12:08:25AM +0000, Kasireddy, Vivek wrote:
> Hi Peter,
> 
> > > > > > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > > > > > brittle and effectively makes this notifier udmabuf specific right?
> > > > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> > > > components
> > > > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > > > > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > > > > > collects a bunch of pages (associated with buffers) scattered inside
> > > > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > > > > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > > > > > use DMA fences to ensure that the components on the Host and
> > > > > > > Guest do not access the buffer simultaneously.
> > > > > >
> > > > > > So why do you need to track updates proactively like this?
> > > > > As David noted in the earlier series, if Qemu punches a hole in its
> > memfd
> > > > > that goes through pages that are registered against a udmabuf fd, then
> > > > > udmabuf needs to update its list with new pages when the hole gets
> > > > > filled after (guest) writes. Otherwise, we'd run into the coherency
> > > > > problem (between udmabuf and memfd) as demonstrated in the
> > selftest
> > > > > (patch #3 in this series).
> > > >
> > > > Wouldn't this all be very much better if Qemu stopped punching holes
> > there?
> > > I think holes can be punched anywhere in the memfd for various reasons.
> > Some
> > 
> > I just start to read this thread, even haven't finished all of them.. but
> > so far I'm not sure whether this is right at all..
> > 
> > udmabuf is a file, it means it should follow the file semantics. Mmu
> Right, it is a file but a special type of file given that it is a dmabuf. So, AFAIK,
> operations such as truncate, FALLOC_FL_PUNCH_HOLE, etc cannot be done
> on it. And, in our use-case, since udmabuf driver is sharing (or exporting) its
> buffer (via the fd), consumers (or importers) of the dmabuf fd are expected
> to only read from it.
> 
> > notifier is per-mm, otoh.
> > 
> > Imagine for some reason QEMU mapped the guest pages twice, udmabuf is
> > created with vma1, so udmabuf registers the mm changes over vma1 only.
> Udmabufs are created with pages obtained from the mapping using offsets
> provided by Qemu. 
> 
> > 
> > However the shmem/hugetlb page cache can be populated in either vma1, or
> > vma2.  It means when populating on vma2 udmabuf won't get update notify
> > at
> > all, udmabuf pages can still be obsolete.  Same thing to when multi-process
> In this (unlikely) scenario you described above,

IMHO it's very legal for qemu to do that, we won't want this to break so
easily and silently simply because qemu mapped it twice.  I would hope
it'll not be myself to debug something like that. :)

I actually personally have a tree that does exactly that:

https://github.com/xzpeter/qemu/commit/62050626d6e511d022953165cc0f604bf90c5324

But that's definitely not in main line.. it shouldn't need special
attention, either.  Just want to say that it can always happen for various
reasons especially in an relatively involved software piece like QEMU.

> I think we could still find all the
> VMAs (and ranges) where the guest buffer pages are mapped (and register
> for PTE updates) using Qemu's mm_struct. The below code can be modified
> to create a list of VMAs where the guest buffer pages are mapped.
> static struct vm_area_struct *find_guest_ram_vma(struct udmabuf *ubuf,
>                                                  struct mm_struct *vmm_mm)
> {
>         struct vm_area_struct *vma = NULL;
>         MA_STATE(mas, &vmm_mm->mm_mt, 0, 0);
>         unsigned long addr;
>         pgoff_t pg;
> 
>         mas_set(&mas, 0);
>         mmap_read_lock(vmm_mm);
>         mas_for_each(&mas, vma, ULONG_MAX) {
>                 for (pg = 0; pg < ubuf->pagecount; pg++) {
>                         addr = page_address_in_vma(ubuf->pages[pg], vma);
>                         if (addr == -EFAULT)
>                                 break;
>                 }
>                 if (addr != -EFAULT)
>                         break;
>         }
>         mmap_read_unlock(vmm_mm);
> 
>         return vma;
> }

This is hackish to me, and not working when across mm (multi-proc qemu).

> 
> > QEMU is used, where we can have vma1 in QEMU while vma2 in the other
> > process like vhost-user.
> > 
> > I think the trick here is we tried to "hide" the fact that these are
> > actually normal file pages, but we're doing PFNMAP on them... then we want
> > the file features back, like hole punching..
> > 
> > If we used normal file operations, everything will just work fine; TRUNCATE
> > will unmap the host mapped frame buffers when needed, and when
> > accessed
> > it'll fault on demand from the page cache.  We seem to be trying to
> > reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to
> > this at least..
> If we can figure out the VMA ranges where the guest buffer pages are mapped,
> we should be able to register mmu notifiers for those ranges right?

In general, sorry to say that, but, mmu notifiers still do not sound like
the right approach here.

> 
> > 
> > > of the use-cases where this would be done were identified by David. Here
> > is what
> > > he said in an earlier discussion:
> > > "There are *probably* more issues on the QEMU side when udmabuf is
> > paired
> > > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"
> > 
> > Now after seething this, I'm truly wondering whether we can still simply
> > use the file semantics we already have (for either shmem/hugetlb/...), or
> > is it a must we need to use a single fd to represent all?
> > 
> > Say, can we just use a tuple (fd, page_array) rather than the udmabuf
> > itself to do host zero-copy mapping?  the page_array can be e.g. a list of
> That (tuple) is essentially what we are doing (with udmabuf) but in a
> standardized way that follows convention using the dmabuf buffer sharing
> framework that all the importers (other drivers and userspace components)
> know and understand.
> 
> > file offsets that points to the pages (rather than pinning the pages using
> If we are using the dmabuf framework, the pages must be pinned when the
> importers map them.

Oh so the pages are for DMAs from hardwares, rather than accessed by the
host programs?

I really have merely zero knowledge from that aspect, sorry.  If so I don't
know how truncation can work with that, while keeping the page coherent.

Hugh asked why not QEMU just doesn't do that truncation, I'll then ask the
same.  Probably virtio-mem will not be able to work. I think postcopy will
not be affected - postcopy only drops pages at very early stage of dest
QEMU, not after VM started there, so either not affected or maybe there's
chance it'll work.

IIUC it's then the same as VFIO attached then we try to blow some pages
away from anything like virtio-balloon - AFAIR qemu just explicitly don't
allow that to happen.  See vfio_ram_block_discard_disable().

> 
> > FOLL_GET).  The good thing is then the fd can be the guest memory file
> > itself.  With that, we can mmap() over the shmem/hugetlb in whatever vma
> > and whatever process.  Truncation (and actually everything... e.g. page
> > migration, swapping, ... which will be disabled if we use PFNMAP pins) will
> > just all start to work, afaiu.
> IIUC, we'd not be able to use the fd of the guest memory file because the
> dmabuf fds are expected to have constant size that reflects the size of the
> buffer that is being shared. I just don't think it'd be feasible given all the
> other restrictions:
> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html?highlight=dma_buf#userspace-interface-notes

Yeah I also don't know well on the dmabuf APIs, but I think if the page
must be pinned for real world DMA then it's already another story to me..
what I said on the [guest_mem_fd, offset_array] tuple idea could only (if
still possible..) work if the udmabuf access is only from the processor
side, never from the device.

Thanks,
Kasireddy, Vivek Aug. 1, 2023, 5:32 a.m. UTC | #28
Hi Jason,

> 
> > > Later the importer decides it needs the memory again so it again asks
> > > for the dmabuf to be present, which does hmm_range_fault and gets
> > > whatever is appropriate at the time.
> > Unless I am missing something, I think just doing the above still won't solve
> > the problem. Consider this sequence:
> >      write_to_memfd(addr1, size, 'a');
> >      buf = create_udmabuf_list(devfd, memfd, size);
> >      addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
> >      read(addr2);
> >      write_to_memfd(addr1, size, 'b');
> >      punch_hole(memfd, MEMFD_SIZE / 2);
> > -> Since we can process the invalidate at this point, as per your suggestion,
> >      we can trigger dmabuf move to let the importers know that the
> dmabuf's
> >      backing memory has changed (or moved).
> >
> >      read(addr2);
> > -> Because there is a hole, we can handle the read by either providing the
> >      old pages or zero pages (if using hmm_range_fault()) to the
> > importers.
> 
> You never provide the old pages. After trunctate the only correct
> value to read is zero.
> 
> >      Maybe it is against convention, but I think it makes sense to provide old
> >      pages (that were mapped before the hole punch) because the importers
> >      have not read the data in these pages ('b' above) yet.
> 
> Nope.
> 
> >      And, another reason to provide old pages is because the data in
> >      these pages is shown in a window on the Host's screen so it
> >      doesn't make sense to show zero page data.
> 
> So why did you trucate it if you want to keep the data?
> 
> 
> > -> write_to_memfd(addr1, size, 'c');
> >      As the hole gets refilled (with new pages) after the above write, AFAIU,
> we
> >      have to tell the importers again that since the backing memory has
> changed,
> >      (new pages) they need to recreate their mappings. But herein lies the
> problem:
> >      from inside the udmabuf driver, we cannot know when this write occurs,
> so we
> >      would not be able to notify the importers of the dmabuf move.
> 
> You get another invalidate because the memfd removes the zero pages
> that hmm_range_fault installed in the PTEs before replacing them with
> actual writable pages. Then you do the move, and another
> hmm_range_fault, and basically the whole thing over again. Except this
> time instead of returning zero pages it returns actual writable page.
Ok, when I tested earlier (by registering an invalidate callback) but without
hmm_range_fault(), I did not find this additional invalidate getting triggered.
Let me try with hmm_range_fault() and see if everything works as expected.
Thank you for your help.


Thanks,
Vivek

> 
> Jason
Kasireddy, Vivek Aug. 1, 2023, 7:11 a.m. UTC | #29
Hi Peter,

> >
> > > > > > > > > I'm not at all familiar with the udmabuf use case but that
> sounds
> > > > > > > > > brittle and effectively makes this notifier udmabuf specific
> right?
> > > > > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics
> > > > > components
> > > > > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest
> created
> > > > > > > > buffers. In other words, from a core mm standpoint, udmabuf
> just
> > > > > > > > collects a bunch of pages (associated with buffers) scattered
> inside
> > > > > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and
> wraps
> > > > > > > > them in a dmabuf fd. And, since we provide zero-copy access,
> we
> > > > > > > > use DMA fences to ensure that the components on the Host and
> > > > > > > > Guest do not access the buffer simultaneously.
> > > > > > >
> > > > > > > So why do you need to track updates proactively like this?
> > > > > > As David noted in the earlier series, if Qemu punches a hole in its
> > > memfd
> > > > > > that goes through pages that are registered against a udmabuf fd,
> then
> > > > > > udmabuf needs to update its list with new pages when the hole gets
> > > > > > filled after (guest) writes. Otherwise, we'd run into the coherency
> > > > > > problem (between udmabuf and memfd) as demonstrated in the
> > > selftest
> > > > > > (patch #3 in this series).
> > > > >
> > > > > Wouldn't this all be very much better if Qemu stopped punching holes
> > > there?
> > > > I think holes can be punched anywhere in the memfd for various
> reasons.
> > > Some
> > >
> > > I just start to read this thread, even haven't finished all of them.. but
> > > so far I'm not sure whether this is right at all..
> > >
> > > udmabuf is a file, it means it should follow the file semantics. Mmu
> > Right, it is a file but a special type of file given that it is a dmabuf. So, AFAIK,
> > operations such as truncate, FALLOC_FL_PUNCH_HOLE, etc cannot be done
> > on it. And, in our use-case, since udmabuf driver is sharing (or exporting)
> its
> > buffer (via the fd), consumers (or importers) of the dmabuf fd are expected
> > to only read from it.
> >
> > > notifier is per-mm, otoh.
> > >
> > > Imagine for some reason QEMU mapped the guest pages twice, udmabuf
> is
> > > created with vma1, so udmabuf registers the mm changes over vma1
> only.
> > Udmabufs are created with pages obtained from the mapping using offsets
> > provided by Qemu.
> >
> > >
> > > However the shmem/hugetlb page cache can be populated in either
> vma1, or
> > > vma2.  It means when populating on vma2 udmabuf won't get update
> notify
> > > at
> > > all, udmabuf pages can still be obsolete.  Same thing to when multi-
> process
> > In this (unlikely) scenario you described above,
> 
> IMHO it's very legal for qemu to do that, we won't want this to break so
> easily and silently simply because qemu mapped it twice.  I would hope
> it'll not be myself to debug something like that. :)
> 
> I actually personally have a tree that does exactly that:
> 
> https://github.com/xzpeter/qemu/commit/62050626d6e511d022953165cc0f
> 604bf90c5324
> 
> But that's definitely not in main line.. it shouldn't need special
> attention, either.  Just want to say that it can always happen for various
> reasons especially in an relatively involved software piece like QEMU.
Ok, I'll keep your use-case in mind but AFAICS, the process that creates
the udmabuf can be considered the owner. So, I think it makes sense that
the owner's VMA range can be registered (via mmu_notifiers) for updates.

> 
> > I think we could still find all the
> > VMAs (and ranges) where the guest buffer pages are mapped (and register
> > for PTE updates) using Qemu's mm_struct. The below code can be
> modified
> > to create a list of VMAs where the guest buffer pages are mapped.
> > static struct vm_area_struct *find_guest_ram_vma(struct udmabuf *ubuf,
> >                                                  struct mm_struct *vmm_mm)
> > {
> >         struct vm_area_struct *vma = NULL;
> >         MA_STATE(mas, &vmm_mm->mm_mt, 0, 0);
> >         unsigned long addr;
> >         pgoff_t pg;
> >
> >         mas_set(&mas, 0);
> >         mmap_read_lock(vmm_mm);
> >         mas_for_each(&mas, vma, ULONG_MAX) {
> >                 for (pg = 0; pg < ubuf->pagecount; pg++) {
> >                         addr = page_address_in_vma(ubuf->pages[pg], vma);
> >                         if (addr == -EFAULT)
> >                                 break;
> >                 }
> >                 if (addr != -EFAULT)
> >                         break;
> >         }
> >         mmap_read_unlock(vmm_mm);
> >
> >         return vma;
> > }
> 
> This is hackish to me, and not working when across mm (multi-proc qemu).
Udmabuf backend is still considered experimental for multi-proc qemu (i.e, Qemu +
vhost-user-gpu given our use-case). And, it looks like the usage of the udmabuf
driver in both cases is different. 

> 
> >
> > > QEMU is used, where we can have vma1 in QEMU while vma2 in the
> other
> > > process like vhost-user.
> > >
> > > I think the trick here is we tried to "hide" the fact that these are
> > > actually normal file pages, but we're doing PFNMAP on them... then we
> want
> > > the file features back, like hole punching..
> > >
> > > If we used normal file operations, everything will just work fine;
> TRUNCATE
> > > will unmap the host mapped frame buffers when needed, and when
> > > accessed
> > > it'll fault on demand from the page cache.  We seem to be trying to
> > > reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to
> > > this at least..
> > If we can figure out the VMA ranges where the guest buffer pages are
> mapped,
> > we should be able to register mmu notifiers for those ranges right?
> 
> In general, sorry to say that, but, mmu notifiers still do not sound like
> the right approach here.
What limitation do you see with the usage of mmu notifiers for this use-case?
And, if using mmu notifiers is not the right approach, how do you suggest we
can solve this problem?

> 
> >
> > >
> > > > of the use-cases where this would be done were identified by David.
> Here
> > > is what
> > > > he said in an earlier discussion:
> > > > "There are *probably* more issues on the QEMU side when udmabuf is
> > > paired
> > > > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for
> > > > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in"
> > >
> > > Now after seething this, I'm truly wondering whether we can still simply
> > > use the file semantics we already have (for either shmem/hugetlb/...), or
> > > is it a must we need to use a single fd to represent all?
> > >
> > > Say, can we just use a tuple (fd, page_array) rather than the udmabuf
> > > itself to do host zero-copy mapping?  the page_array can be e.g. a list of
> > That (tuple) is essentially what we are doing (with udmabuf) but in a
> > standardized way that follows convention using the dmabuf buffer sharing
> > framework that all the importers (other drivers and userspace
> components)
> > know and understand.
> >
> > > file offsets that points to the pages (rather than pinning the pages using
> > If we are using the dmabuf framework, the pages must be pinned when the
> > importers map them.
> 
> Oh so the pages are for DMAs from hardwares, rather than accessed by the
> host programs?
GPU DMA is the main use-case but the fd (i.e, pages) can be consumed in
different ways. For local display support, the fd can be imported by the Host
GPU driver for DMA (if Qemu is launched with gl=on) or mmap'd by Qemu
UI module (if gl=off). For remote display support, Qemu shares the fd with
Spice which can either encode it using CPU based algorithms or GPU based
ones (H264/H265/VP8/VP9) using Gstreamer. 

> 
> I really have merely zero knowledge from that aspect, sorry.  If so I don't
> know how truncation can work with that, while keeping the page coherent.
> 
> Hugh asked why not QEMU just doesn't do that truncation, I'll then ask the
It is not just about truncation. My goal with this patch is to ensure that when
one or more (guest buffer) pages in the memfd are affected in any way (moved,
migrated, etc), udmabuf would take corrective action after getting notified.
And, given that the guest buffer pages can be scattered anywhere in the rather
large memfd, it seems likely that one or more pages might be impacted when
various other features (such as virto-mem/balloon, memory unplug) are enabled.

> same.  Probably virtio-mem will not be able to work. I think postcopy will
> not be affected - postcopy only drops pages at very early stage of dest
> QEMU, not after VM started there, so either not affected or maybe there's
> chance it'll work.
> 
> IIUC it's then the same as VFIO attached then we try to blow some pages
> away from anything like virtio-balloon - AFAIR qemu just explicitly don't
> allow that to happen.  See vfio_ram_block_discard_disable().
> 
> >
> > > FOLL_GET).  The good thing is then the fd can be the guest memory file
> > > itself.  With that, we can mmap() over the shmem/hugetlb in whatever
> vma
> > > and whatever process.  Truncation (and actually everything... e.g. page
> > > migration, swapping, ... which will be disabled if we use PFNMAP pins)
> will
> > > just all start to work, afaiu.
> > IIUC, we'd not be able to use the fd of the guest memory file because the
> > dmabuf fds are expected to have constant size that reflects the size of the
> > buffer that is being shared. I just don't think it'd be feasible given all the
> > other restrictions:
> > https://www.kernel.org/doc/html/latest/driver-api/dma-
> buf.html?highlight=dma_buf#userspace-interface-notes
> 
> Yeah I also don't know well on the dmabuf APIs, but I think if the page
> must be pinned for real world DMA then it's already another story to me..
Right, the pages need to be kept pinned as long as an importer is using
(mapping exists) them.

> what I said on the [guest_mem_fd, offset_array] tuple idea could only (if
> still possible..) work if the udmabuf access is only from the processor
> side, never from the device.
As of now, GPU is the device that would access the pages directly in addition
to the CPU but there are already patches (on qemu-devel) to facilitate DMA
access from other devices on the Host.

Thanks,
Vivek

> 
> Thanks,
> 
> --
> Peter Xu
>
Jason Gunthorpe Aug. 1, 2023, 12:19 p.m. UTC | #30
On Tue, Aug 01, 2023 at 05:32:38AM +0000, Kasireddy, Vivek wrote:

> > You get another invalidate because the memfd removes the zero pages
> > that hmm_range_fault installed in the PTEs before replacing them with
> > actual writable pages. Then you do the move, and another
> > hmm_range_fault, and basically the whole thing over again. Except this
> > time instead of returning zero pages it returns actual writable
> > page.

> Ok, when I tested earlier (by registering an invalidate callback) but without
> hmm_range_fault(), I did not find this additional invalidate getting triggered.
> Let me try with hmm_range_fault() and see if everything works as expected.
> Thank you for your help.

If you do not get an invalidate then there is a pretty serious bug in
the mm that needs fixing.

Anything hmm_range_fault() returns must be invalidated if the
underying CPU mapping changes for any reasons. Since hmm_range_fault()
will populate zero pages when reading from a hole in a memfd, it must
also get an invalidation when the zero pages are changed into writable
pages.

If you do not fault in the zero pages to the PTEs then you may not get
invalidate callbacks.

Jason
David Hildenbrand Aug. 1, 2023, 12:22 p.m. UTC | #31
On 01.08.23 14:19, Jason Gunthorpe wrote:
> On Tue, Aug 01, 2023 at 05:32:38AM +0000, Kasireddy, Vivek wrote:
> 
>>> You get another invalidate because the memfd removes the zero pages
>>> that hmm_range_fault installed in the PTEs before replacing them with
>>> actual writable pages. Then you do the move, and another
>>> hmm_range_fault, and basically the whole thing over again. Except this
>>> time instead of returning zero pages it returns actual writable
>>> page.
> 
>> Ok, when I tested earlier (by registering an invalidate callback) but without
>> hmm_range_fault(), I did not find this additional invalidate getting triggered.
>> Let me try with hmm_range_fault() and see if everything works as expected.
>> Thank you for your help.
> 
> If you do not get an invalidate then there is a pretty serious bug in
> the mm that needs fixing.
> 
> Anything hmm_range_fault() returns must be invalidated if the
> underying CPU mapping changes for any reasons. Since hmm_range_fault()
> will populate zero pages when reading from a hole in a memfd, it must
> also get an invalidation when the zero pages are changed into writable
> pages.

Can you point me at the code that returns that (shared) zero page?
Jason Gunthorpe Aug. 1, 2023, 12:23 p.m. UTC | #32
On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote:
> On 01.08.23 14:19, Jason Gunthorpe wrote:
> > On Tue, Aug 01, 2023 at 05:32:38AM +0000, Kasireddy, Vivek wrote:
> > 
> > > > You get another invalidate because the memfd removes the zero pages
> > > > that hmm_range_fault installed in the PTEs before replacing them with
> > > > actual writable pages. Then you do the move, and another
> > > > hmm_range_fault, and basically the whole thing over again. Except this
> > > > time instead of returning zero pages it returns actual writable
> > > > page.
> > 
> > > Ok, when I tested earlier (by registering an invalidate callback) but without
> > > hmm_range_fault(), I did not find this additional invalidate getting triggered.
> > > Let me try with hmm_range_fault() and see if everything works as expected.
> > > Thank you for your help.
> > 
> > If you do not get an invalidate then there is a pretty serious bug in
> > the mm that needs fixing.
> > 
> > Anything hmm_range_fault() returns must be invalidated if the
> > underying CPU mapping changes for any reasons. Since hmm_range_fault()
> > will populate zero pages when reading from a hole in a memfd, it must
> > also get an invalidation when the zero pages are changed into writable
> > pages.
> 
> Can you point me at the code that returns that (shared) zero page?

It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU
read faulted the page?

Jason
David Hildenbrand Aug. 1, 2023, 12:26 p.m. UTC | #33
On 01.08.23 14:23, Jason Gunthorpe wrote:
> On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote:
>> On 01.08.23 14:19, Jason Gunthorpe wrote:
>>> On Tue, Aug 01, 2023 at 05:32:38AM +0000, Kasireddy, Vivek wrote:
>>>
>>>>> You get another invalidate because the memfd removes the zero pages
>>>>> that hmm_range_fault installed in the PTEs before replacing them with
>>>>> actual writable pages. Then you do the move, and another
>>>>> hmm_range_fault, and basically the whole thing over again. Except this
>>>>> time instead of returning zero pages it returns actual writable
>>>>> page.
>>>
>>>> Ok, when I tested earlier (by registering an invalidate callback) but without
>>>> hmm_range_fault(), I did not find this additional invalidate getting triggered.
>>>> Let me try with hmm_range_fault() and see if everything works as expected.
>>>> Thank you for your help.
>>>
>>> If you do not get an invalidate then there is a pretty serious bug in
>>> the mm that needs fixing.
>>>
>>> Anything hmm_range_fault() returns must be invalidated if the
>>> underying CPU mapping changes for any reasons. Since hmm_range_fault()
>>> will populate zero pages when reading from a hole in a memfd, it must
>>> also get an invalidation when the zero pages are changed into writable
>>> pages.
>>
>> Can you point me at the code that returns that (shared) zero page?
> 
> It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU
> read faulted the page?

To the best of my knowledge, the shared zeropage is only used in 
MAP_PRIVATE|MAP_AON mappings and in weird DAX mappings.

If that changed, we have to fix FOLL_PIN|FOLL_LONGTERM for MAP_SHARED VMAs.

If you read-fault on a memfd hole, you should get a proper "zeroed" 
pagecache page that effectively "filled that hole" -- so there is no 
file hole anymore.
Jason Gunthorpe Aug. 1, 2023, 12:26 p.m. UTC | #34
On Tue, Aug 01, 2023 at 02:26:03PM +0200, David Hildenbrand wrote:
> On 01.08.23 14:23, Jason Gunthorpe wrote:
> > On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote:
> > > On 01.08.23 14:19, Jason Gunthorpe wrote:
> > > > On Tue, Aug 01, 2023 at 05:32:38AM +0000, Kasireddy, Vivek wrote:
> > > > 
> > > > > > You get another invalidate because the memfd removes the zero pages
> > > > > > that hmm_range_fault installed in the PTEs before replacing them with
> > > > > > actual writable pages. Then you do the move, and another
> > > > > > hmm_range_fault, and basically the whole thing over again. Except this
> > > > > > time instead of returning zero pages it returns actual writable
> > > > > > page.
> > > > 
> > > > > Ok, when I tested earlier (by registering an invalidate callback) but without
> > > > > hmm_range_fault(), I did not find this additional invalidate getting triggered.
> > > > > Let me try with hmm_range_fault() and see if everything works as expected.
> > > > > Thank you for your help.
> > > > 
> > > > If you do not get an invalidate then there is a pretty serious bug in
> > > > the mm that needs fixing.
> > > > 
> > > > Anything hmm_range_fault() returns must be invalidated if the
> > > > underying CPU mapping changes for any reasons. Since hmm_range_fault()
> > > > will populate zero pages when reading from a hole in a memfd, it must
> > > > also get an invalidation when the zero pages are changed into writable
> > > > pages.
> > > 
> > > Can you point me at the code that returns that (shared) zero page?
> > 
> > It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU
> > read faulted the page?
> 
> To the best of my knowledge, the shared zeropage is only used in
> MAP_PRIVATE|MAP_AON mappings and in weird DAX mappings.
> 
> If that changed, we have to fix FOLL_PIN|FOLL_LONGTERM for MAP_SHARED VMAs.
> 
> If you read-fault on a memfd hole, you should get a proper "zeroed"
> pagecache page that effectively "filled that hole" -- so there is no file
> hole anymore.

Sounds fine then :)

Jason
David Hildenbrand Aug. 1, 2023, 12:28 p.m. UTC | #35
On 01.08.23 14:26, Jason Gunthorpe wrote:
> On Tue, Aug 01, 2023 at 02:26:03PM +0200, David Hildenbrand wrote:
>> On 01.08.23 14:23, Jason Gunthorpe wrote:
>>> On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote:
>>>> On 01.08.23 14:19, Jason Gunthorpe wrote:
>>>>> On Tue, Aug 01, 2023 at 05:32:38AM +0000, Kasireddy, Vivek wrote:
>>>>>
>>>>>>> You get another invalidate because the memfd removes the zero pages
>>>>>>> that hmm_range_fault installed in the PTEs before replacing them with
>>>>>>> actual writable pages. Then you do the move, and another
>>>>>>> hmm_range_fault, and basically the whole thing over again. Except this
>>>>>>> time instead of returning zero pages it returns actual writable
>>>>>>> page.
>>>>>
>>>>>> Ok, when I tested earlier (by registering an invalidate callback) but without
>>>>>> hmm_range_fault(), I did not find this additional invalidate getting triggered.
>>>>>> Let me try with hmm_range_fault() and see if everything works as expected.
>>>>>> Thank you for your help.
>>>>>
>>>>> If you do not get an invalidate then there is a pretty serious bug in
>>>>> the mm that needs fixing.
>>>>>
>>>>> Anything hmm_range_fault() returns must be invalidated if the
>>>>> underying CPU mapping changes for any reasons. Since hmm_range_fault()
>>>>> will populate zero pages when reading from a hole in a memfd, it must
>>>>> also get an invalidation when the zero pages are changed into writable
>>>>> pages.
>>>>
>>>> Can you point me at the code that returns that (shared) zero page?
>>>
>>> It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU
>>> read faulted the page?
>>
>> To the best of my knowledge, the shared zeropage is only used in
>> MAP_PRIVATE|MAP_AON mappings and in weird DAX mappings.
>>
>> If that changed, we have to fix FOLL_PIN|FOLL_LONGTERM for MAP_SHARED VMAs.
>>
>> If you read-fault on a memfd hole, you should get a proper "zeroed"
>> pagecache page that effectively "filled that hole" -- so there is no file
>> hole anymore.
> 
> Sounds fine then :)

Right, the "the zero pages are changed into writable pages" in your 
above comment just might not apply, because there won't be any page 
replacement (hopefully :) ).
Kasireddy, Vivek Aug. 1, 2023, 5:53 p.m. UTC | #36
Hi David,

> 
> On 01.08.23 14:26, Jason Gunthorpe wrote:
> > On Tue, Aug 01, 2023 at 02:26:03PM +0200, David Hildenbrand wrote:
> >> On 01.08.23 14:23, Jason Gunthorpe wrote:
> >>> On Tue, Aug 01, 2023 at 02:22:12PM +0200, David Hildenbrand wrote:
> >>>> On 01.08.23 14:19, Jason Gunthorpe wrote:
> >>>>> On Tue, Aug 01, 2023 at 05:32:38AM +0000, Kasireddy, Vivek wrote:
> >>>>>
> >>>>>>> You get another invalidate because the memfd removes the zero
> pages
> >>>>>>> that hmm_range_fault installed in the PTEs before replacing them
> with
> >>>>>>> actual writable pages. Then you do the move, and another
> >>>>>>> hmm_range_fault, and basically the whole thing over again. Except
> this
> >>>>>>> time instead of returning zero pages it returns actual writable
> >>>>>>> page.
> >>>>>
> >>>>>> Ok, when I tested earlier (by registering an invalidate callback) but
> without
> >>>>>> hmm_range_fault(), I did not find this additional invalidate getting
> triggered.
> >>>>>> Let me try with hmm_range_fault() and see if everything works as
> expected.
> >>>>>> Thank you for your help.
> >>>>>
> >>>>> If you do not get an invalidate then there is a pretty serious bug in
> >>>>> the mm that needs fixing.
> >>>>>
> >>>>> Anything hmm_range_fault() returns must be invalidated if the
> >>>>> underying CPU mapping changes for any reasons. Since
> hmm_range_fault()
> >>>>> will populate zero pages when reading from a hole in a memfd, it must
> >>>>> also get an invalidation when the zero pages are changed into writable
> >>>>> pages.
> >>>>
> >>>> Can you point me at the code that returns that (shared) zero page?
> >>>
> >>> It calls handle_mm_fault() - shouldn't that do it? Same as if the CPU
> >>> read faulted the page?
> >>
> >> To the best of my knowledge, the shared zeropage is only used in
> >> MAP_PRIVATE|MAP_AON mappings and in weird DAX mappings.
> >>
> >> If that changed, we have to fix FOLL_PIN|FOLL_LONGTERM for
> MAP_SHARED VMAs.
> >>
> >> If you read-fault on a memfd hole, you should get a proper "zeroed"
> >> pagecache page that effectively "filled that hole" -- so there is no file
> >> hole anymore.
> >
> > Sounds fine then :)
> 
> Right, the "the zero pages are changed into writable pages" in your
> above comment just might not apply, because there won't be any page
> replacement (hopefully :) ).
If the page replacement does not happen when there are new writes to the
area where the hole previously existed, then would we still get an invalidate
when this happens? Is there any other way to get notified when the zeroed
page is written to if the invalidate does not get triggered?

Thanks,
Vivek

> 
> --
> Cheers,
> 
> David / dhildenb
Jason Gunthorpe Aug. 1, 2023, 6:19 p.m. UTC | #37
On Tue, Aug 01, 2023 at 05:53:32PM +0000, Kasireddy, Vivek wrote:

> > Right, the "the zero pages are changed into writable pages" in your
> > above comment just might not apply, because there won't be any page
> > replacement (hopefully :) ).

> If the page replacement does not happen when there are new writes to the
> area where the hole previously existed, then would we still get an invalidate
> when this happens? Is there any other way to get notified when the zeroed
> page is written to if the invalidate does not get triggered?

What David is saying is that memfd does not use the zero page
optimization for hole punches. Any access to the memory, including
read-only access through hmm_range_fault() will allocate unique
pages. Since there is no zero page and no zero-page replacement there
is no issue with invalidations.

Jason
Peter Xu Aug. 1, 2023, 9:57 p.m. UTC | #38
On Tue, Aug 01, 2023 at 07:11:09AM +0000, Kasireddy, Vivek wrote:
> Ok, I'll keep your use-case in mind but AFAICS, the process that creates
> the udmabuf can be considered the owner. So, I think it makes sense that
> the owner's VMA range can be registered (via mmu_notifiers) for updates.

No need to have your special attention on this; my use case is not anything
useful with details, just wanted to show the idea that virtual address
range based notification might not work.

[...]

> What limitation do you see with the usage of mmu notifiers for this use-case?
> And, if using mmu notifiers is not the right approach, how do you suggest we
> can solve this problem?

AFAIU, even if there'll be a notification chanism, it needs to be at least
in per-file address space (probably in file offsets) rather than per-mm for
a shmem backend, so that any mapping of the file should notify that.

Isn't it already too late though to wait that notification until page is
installed?  Because here you pinned the page for DMA, I think it means
before a new page installed (but after the page is invalidated) the device
can DMA to an invalid buffer.

To come back to the original question: I don't know how that could work at
all, the userapp should just never do that invalidation, because right
after it does, the dma buffer will be invalid, and the device can update
data into trash.  So.. I don't have an easy way to do this right.. besides
disabling ram discard just like what vfio does already.

Thanks,
Kasireddy, Vivek Aug. 3, 2023, 7:35 a.m. UTC | #39
Hi Jason,

> > > Right, the "the zero pages are changed into writable pages" in your
> > > above comment just might not apply, because there won't be any page
> > > replacement (hopefully :) ).
> 
> > If the page replacement does not happen when there are new writes to the
> > area where the hole previously existed, then would we still get an
> invalidate
> > when this happens? Is there any other way to get notified when the zeroed
> > page is written to if the invalidate does not get triggered?
> 
> What David is saying is that memfd does not use the zero page
> optimization for hole punches. Any access to the memory, including
> read-only access through hmm_range_fault() will allocate unique
> pages. Since there is no zero page and no zero-page replacement there
> is no issue with invalidations.
It looks like even with hmm_range_fault(), the invalidate does not get
triggered when the hole is refilled with new pages because of writes.
This is probably because hmm_range_fault() does not fault in any pages
that get invalidated later when writes occur. Not sure if there is a way to
request it to fill a hole with zero pages. Here is what I have in the
invalidate callback (added on top of this series):
static bool invalidate_udmabuf(struct mmu_interval_notifier *mn,
                               const struct mmu_notifier_range *range_mn,
                               unsigned long cur_seq)
{
        struct udmabuf_vma_range *range =
                        container_of(mn, struct udmabuf_vma_range, range_mn);
        struct udmabuf *ubuf = range->ubuf;
        struct hmm_range hrange = {0};
        unsigned long *pfns, num_pages, timeout;
        int i, ret;

        printk("invalidate; start = %lu, end = %lu\n",
               range->start, range->end);

        hrange.notifier = mn;
        hrange.default_flags = HMM_PFN_REQ_FAULT;
        hrange.start = max(range_mn->start, range->start);
        hrange.end = min(range_mn->end, range->end);
        num_pages = (hrange.end - hrange.start) >> PAGE_SHIFT;

        pfns = kmalloc_array(num_pages, sizeof(*pfns), GFP_KERNEL);
        if (!pfns)
                return true;

        printk("invalidate; num pages = %lu\n", num_pages);

        hrange.hmm_pfns = pfns;
        timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
        do {
                hrange.notifier_seq = mmu_interval_read_begin(mn);

                mmap_read_lock(ubuf->vmm_mm);
                ret = hmm_range_fault(&hrange);
                mmap_read_unlock(ubuf->vmm_mm);
                if (ret) {
                        if (ret == -EBUSY && !time_after(jiffies, timeout))
                                continue;
                        break;
                }

                if (mmu_interval_read_retry(mn, hrange.notifier_seq))
                        continue;
        } while (ret);

        if (!ret) {
                for (i = 0; i < num_pages; i++) {
                        printk("hmm returned page = %p; pfn = %lu\n",
                               hmm_pfn_to_page(pfns[i]),
                               pfns[i] & ~HMM_PFN_FLAGS);
                }
        }
        return true;
}

static const struct mmu_interval_notifier_ops udmabuf_invalidate_ops = {
        .invalidate = invalidate_udmabuf,
};

Here are the log messages I see when I run the udmabuf (shmem-based) selftest:
[  132.662863] invalidate; start = 140737347612672, end = 140737347629056
[  132.672953] invalidate; num pages = 4
[  132.676690] hmm returned page = 000000000483755d; pfn = 2595360
[  132.682676] hmm returned page = 00000000d5a87cc6; pfn = 2588133
[  132.688651] hmm returned page = 00000000f9eb8d20; pfn = 2673429
[  132.694629] hmm returned page = 000000005b44da27; pfn = 2588481
[  132.700605] invalidate; start = 140737348661248, end = 140737348677632
[  132.710672] invalidate; num pages = 4
[  132.714412] hmm returned page = 0000000002867206; pfn = 2680737
[  132.720394] hmm returned page = 00000000778a48f0; pfn = 2680738
[  132.726366] hmm returned page = 00000000d8adf162; pfn = 2680739
[  132.732350] hmm returned page = 00000000671769ff; pfn = 2680740

The above log messages are seen immediately after the hole is punched. As
you can see, hmm_range_fault() returns the pfns of old pages and not zero
pages. And, I see the below messages (with patch #2 in this series applied)
as the hole is refilled after writes:
[  160.279227] udpate mapping; old page = 000000000483755d; pfn = 2595360
[  160.285809] update mapping; new page = 00000000080e9595; pfn = 2680991
[  160.292402] udpate mapping; old page = 00000000d5a87cc6; pfn = 2588133
[  160.298979] update mapping; new page = 000000000483755d; pfn = 2595360
[  160.305574] udpate mapping; old page = 00000000f9eb8d20; pfn = 2673429
[  160.312154] update mapping; new page = 00000000d5a87cc6; pfn = 2588133
[  160.318744] udpate mapping; old page = 000000005b44da27; pfn = 2588481
[  160.325320] update mapping; new page = 00000000f9eb8d20; pfn = 2673429
[  160.333022] udpate mapping; old page = 0000000002867206; pfn = 2680737
[  160.339603] update mapping; new page = 000000003e2e9628; pfn = 2674703
[  160.346201] udpate mapping; old page = 00000000778a48f0; pfn = 2680738
[  160.352789] update mapping; new page = 0000000002867206; pfn = 2680737
[  160.359394] udpate mapping; old page = 00000000d8adf162; pfn = 2680739
[  160.365966] update mapping; new page = 00000000778a48f0; pfn = 2680738
[  160.372552] udpate mapping; old page = 00000000671769ff; pfn = 2680740
[  160.379131] update mapping; new page = 00000000d8adf162; pfn = 2680739

FYI, I ran this experiment with the kernel (6.5.0 RC1) from drm-tip.

Thanks,
Vivek

> 
> Jason
Kasireddy, Vivek Aug. 3, 2023, 8:08 a.m. UTC | #40
Hi Peter,

> > Ok, I'll keep your use-case in mind but AFAICS, the process that creates
> > the udmabuf can be considered the owner. So, I think it makes sense that
> > the owner's VMA range can be registered (via mmu_notifiers) for updates.
> 
> No need to have your special attention on this; my use case is not anything
> useful with details, just wanted to show the idea that virtual address
> range based notification might not work.
> 
> [...]
> 
> > What limitation do you see with the usage of mmu notifiers for this use-
> case?
> > And, if using mmu notifiers is not the right approach, how do you suggest
> we
> > can solve this problem?
> 
> AFAIU, even if there'll be a notification chanism, it needs to be at least
> in per-file address space (probably in file offsets) rather than per-mm for
> a shmem backend, so that any mapping of the file should notify that.
Yes, it makes sense that the notification in this case is a combination of
(mapping, offset). Not sure how challenging it'd be to add such a notification
mechanism that would be per-file address space. However, as discussed
earlier with Alistair, it appears there is some value in having something
similar with mmu notifiers:
mmu_notifier_update_mapping(struct mm_struct *mm, unsigned long address)
And, in the callback, we could get the new page either using hmm_range_fault()
or through the page cache as we get notified after the PTE gets updated:
 	mapoff = linear_page_index(vma, address);
 	new_page = find_get_page(vma->vm_file->f_mapping, mapoff);

> 
> Isn't it already too late though to wait that notification until page is
> installed?  Because here you pinned the page for DMA, I think it means
> before a new page installed (but after the page is invalidated) the device
> can DMA to an invalid buffer.
The page is only invalidated in the memfd. Until the hole is written to,
we (udmabuf) can choose to handle any reads (or DMA) using old pages
if needed.

> 
> To come back to the original question: I don't know how that could work at
> all, the userapp should just never do that invalidation, because right
> after it does, the dma buffer will be invalid, and the device can update
> data into trash.  So.. I don't have an easy way to do this right.. besides
> disabling ram discard just like what vfio does already.
Yeah, disabling ram discard is the last option if we cannot find a way to
safely get notified about mapping updates.

Thanks,
Vivek

> 
> Thanks,
> 
> --
> Peter Xu
>
Jason Gunthorpe Aug. 3, 2023, 12:14 p.m. UTC | #41
On Thu, Aug 03, 2023 at 07:35:51AM +0000, Kasireddy, Vivek wrote:
> Hi Jason,
> 
> > > > Right, the "the zero pages are changed into writable pages" in your
> > > > above comment just might not apply, because there won't be any page
> > > > replacement (hopefully :) ).
> > 
> > > If the page replacement does not happen when there are new writes to the
> > > area where the hole previously existed, then would we still get an
> > invalidate
> > > when this happens? Is there any other way to get notified when the zeroed
> > > page is written to if the invalidate does not get triggered?
> > 
> > What David is saying is that memfd does not use the zero page
> > optimization for hole punches. Any access to the memory, including
> > read-only access through hmm_range_fault() will allocate unique
> > pages. Since there is no zero page and no zero-page replacement there
> > is no issue with invalidations.

> It looks like even with hmm_range_fault(), the invalidate does not get
> triggered when the hole is refilled with new pages because of writes.
> This is probably because hmm_range_fault() does not fault in any pages
> that get invalidated later when writes occur.

hmm_range_fault() returns the current content of the VMAs, or it
faults. If it returns pages then it came from one of these two places.

If your VMA is incoherent with what you are doing then you have bigger
problems, or maybe you found a bug.

> The above log messages are seen immediately after the hole is punched. As
> you can see, hmm_range_fault() returns the pfns of old pages and not zero
> pages. And, I see the below messages (with patch #2 in this series applied)
> as the hole is refilled after writes:

I don't know what you are doing, but it is something wrong or you've
found a bug in the memfds.

Jason
David Hildenbrand Aug. 3, 2023, 12:32 p.m. UTC | #42
On 03.08.23 14:14, Jason Gunthorpe wrote:
> On Thu, Aug 03, 2023 at 07:35:51AM +0000, Kasireddy, Vivek wrote:
>> Hi Jason,
>>
>>>>> Right, the "the zero pages are changed into writable pages" in your
>>>>> above comment just might not apply, because there won't be any page
>>>>> replacement (hopefully :) ).
>>>
>>>> If the page replacement does not happen when there are new writes to the
>>>> area where the hole previously existed, then would we still get an
>>> invalidate
>>>> when this happens? Is there any other way to get notified when the zeroed
>>>> page is written to if the invalidate does not get triggered?
>>>
>>> What David is saying is that memfd does not use the zero page
>>> optimization for hole punches. Any access to the memory, including
>>> read-only access through hmm_range_fault() will allocate unique
>>> pages. Since there is no zero page and no zero-page replacement there
>>> is no issue with invalidations.
> 
>> It looks like even with hmm_range_fault(), the invalidate does not get
>> triggered when the hole is refilled with new pages because of writes.
>> This is probably because hmm_range_fault() does not fault in any pages
>> that get invalidated later when writes occur.
> 
> hmm_range_fault() returns the current content of the VMAs, or it
> faults. If it returns pages then it came from one of these two places.
> 
> If your VMA is incoherent with what you are doing then you have bigger
> problems, or maybe you found a bug.
> 
>> The above log messages are seen immediately after the hole is punched. As
>> you can see, hmm_range_fault() returns the pfns of old pages and not zero
>> pages. And, I see the below messages (with patch #2 in this series applied)
>> as the hole is refilled after writes:
> 
> I don't know what you are doing, but it is something wrong or you've
> found a bug in the memfds.


Maybe THP is involved? I recently had to dig that out for an internal 
discussion:

"Currently when truncating shmem file, if the range is partial of THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed unless the range cover the whole THP.
Even though all the subpages are truncated (randomly or sequentially),
the THP may still be kept in page cache.  This might be fine for some
usecases which prefer preserving THP."

My recollection is that this behavior was never changed.

https://lore.kernel.org/all/1575420174-19171-1-git-send-email-yang.shi@linux.alibaba.com/
Peter Xu Aug. 3, 2023, 1:02 p.m. UTC | #43
Hi, Vivek,

On Thu, Aug 03, 2023 at 08:08:41AM +0000, Kasireddy, Vivek wrote:
> > Isn't it already too late though to wait that notification until page is
> > installed?  Because here you pinned the page for DMA, I think it means
> > before a new page installed (but after the page is invalidated) the device
> > can DMA to an invalid buffer.
>
> The page is only invalidated in the memfd. Until the hole is written to,
> we (udmabuf) can choose to handle any reads (or DMA) using old pages
> if needed.

But what happens if there's DMA writes?  I don't see anything that will
stop the device from doing so - the whole design looks fully transparent, I
just still don't see how it can be done without synchronizing with the
device.

IIUC, we need to e.g. quiesce the device when any page got invalidated in
some way like hole punching, and should happen right before it happens
(comparing to the notification of new page update which should be right
after the installation OTOH).

I think the vfio use case currently face the same condition and challenge,
assuming there's currently no easy solution so that was just prohibited.  I
guess people are just waiting for hardware vendors to support device page
faults, like processors - then we can synchronize with the device using the
device IOMMU page tables (by clearing it at proper time and blocks DMA
writes).

Thanks,
Alistair Popple Aug. 4, 2023, 12:14 a.m. UTC | #44
David Hildenbrand <david@redhat.com> writes:

> On 03.08.23 14:14, Jason Gunthorpe wrote:
>> On Thu, Aug 03, 2023 at 07:35:51AM +0000, Kasireddy, Vivek wrote:
>>> Hi Jason,
>>>
>>>>>> Right, the "the zero pages are changed into writable pages" in your
>>>>>> above comment just might not apply, because there won't be any page
>>>>>> replacement (hopefully :) ).
>>>>
>>>>> If the page replacement does not happen when there are new writes to the
>>>>> area where the hole previously existed, then would we still get an
>>>> invalidate
>>>>> when this happens? Is there any other way to get notified when the zeroed
>>>>> page is written to if the invalidate does not get triggered?
>>>>
>>>> What David is saying is that memfd does not use the zero page
>>>> optimization for hole punches. Any access to the memory, including
>>>> read-only access through hmm_range_fault() will allocate unique
>>>> pages. Since there is no zero page and no zero-page replacement there
>>>> is no issue with invalidations.
>> 
>>> It looks like even with hmm_range_fault(), the invalidate does not get
>>> triggered when the hole is refilled with new pages because of writes.
>>> This is probably because hmm_range_fault() does not fault in any pages
>>> that get invalidated later when writes occur.
>> hmm_range_fault() returns the current content of the VMAs, or it
>> faults. If it returns pages then it came from one of these two places.
>> If your VMA is incoherent with what you are doing then you have
>> bigger
>> problems, or maybe you found a bug.

Note it will only fault in pages if HMM_PFN_REQ_FAULT is specified. You
are setting that however you aren't setting HMM_PFN_REQ_WRITE which is
what would trigger a fault to bring in the new pages. Does setting that
fix the issue you are seeing?

>>> The above log messages are seen immediately after the hole is punched. As
>>> you can see, hmm_range_fault() returns the pfns of old pages and not zero
>>> pages. And, I see the below messages (with patch #2 in this series applied)
>>> as the hole is refilled after writes:
>> I don't know what you are doing, but it is something wrong or you've
>> found a bug in the memfds.
>
>
> Maybe THP is involved? I recently had to dig that out for an internal
> discussion:
>
> "Currently when truncating shmem file, if the range is partial of THP
> (start or end is in the middle of THP), the pages actually will just get
> cleared rather than being freed unless the range cover the whole THP.
> Even though all the subpages are truncated (randomly or sequentially),
> the THP may still be kept in page cache.  This might be fine for some
> usecases which prefer preserving THP."
>
> My recollection is that this behavior was never changed.
>
> https://lore.kernel.org/all/1575420174-19171-1-git-send-email-yang.shi@linux.alibaba.com/
Kasireddy, Vivek Aug. 4, 2023, 6:39 a.m. UTC | #45
Hi Alistair, David, Jason,

> >>>>>> Right, the "the zero pages are changed into writable pages" in your
> >>>>>> above comment just might not apply, because there won't be any
> page
> >>>>>> replacement (hopefully :) ).
> >>>>
> >>>>> If the page replacement does not happen when there are new writes
> to the
> >>>>> area where the hole previously existed, then would we still get an
> >>>> invalidate
> >>>>> when this happens? Is there any other way to get notified when the
> zeroed
> >>>>> page is written to if the invalidate does not get triggered?
> >>>>
> >>>> What David is saying is that memfd does not use the zero page
> >>>> optimization for hole punches. Any access to the memory, including
> >>>> read-only access through hmm_range_fault() will allocate unique
> >>>> pages. Since there is no zero page and no zero-page replacement there
> >>>> is no issue with invalidations.
> >>
> >>> It looks like even with hmm_range_fault(), the invalidate does not get
> >>> triggered when the hole is refilled with new pages because of writes.
> >>> This is probably because hmm_range_fault() does not fault in any pages
> >>> that get invalidated later when writes occur.
> >> hmm_range_fault() returns the current content of the VMAs, or it
> >> faults. If it returns pages then it came from one of these two places.
> >> If your VMA is incoherent with what you are doing then you have
> >> bigger
> >> problems, or maybe you found a bug.
> 
> Note it will only fault in pages if HMM_PFN_REQ_FAULT is specified. You
> are setting that however you aren't setting HMM_PFN_REQ_WRITE which is
> what would trigger a fault to bring in the new pages. Does setting that
> fix the issue you are seeing?
No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
Although, I do not have THP enabled (or built-in), shmem does not evict
the pages after hole punch as noted in the comment in shmem_fallocate():
                if ((u64)unmap_end > (u64)unmap_start)
                        unmap_mapping_range(mapping, unmap_start,
                                            1 + unmap_end - unmap_start, 0);
                shmem_truncate_range(inode, offset, offset + len - 1);
                /* No need to unmap again: hole-punching leaves COWed pages */

As a result, the pfn is still valid and the pte is pte_present() and pte_write().
This is the reason why adding in HMM_PFN_REQ_WRITE does not help;
because, it fails the below condition in hmm_pte_need_fault():
        if ((pfn_req_flags & HMM_PFN_REQ_WRITE) &&
            !(cpu_flags & HMM_PFN_WRITE))
                return HMM_NEED_FAULT | HMM_NEED_WRITE_FAULT;

If I force it to read-fault or write-fault (by hacking hmm_pte_need_fault()),
it gets indefinitely stuck in the do while loop in hmm_range_fault().
AFAIU, unless there is a way to fault-in zero pages (or any scratch pages)
after hole punch that get invalidated because of writes, I do not see how
using hmm_range_fault() can help with my use-case. 

Thanks,
Vivek

> 
> >>> The above log messages are seen immediately after the hole is punched.
> As
> >>> you can see, hmm_range_fault() returns the pfns of old pages and not
> zero
> >>> pages. And, I see the below messages (with patch #2 in this series
> applied)
> >>> as the hole is refilled after writes:
> >> I don't know what you are doing, but it is something wrong or you've
> >> found a bug in the memfds.
> >
> >
> > Maybe THP is involved? I recently had to dig that out for an internal
> > discussion:
> >
> > "Currently when truncating shmem file, if the range is partial of THP
> > (start or end is in the middle of THP), the pages actually will just get
> > cleared rather than being freed unless the range cover the whole THP.
> > Even though all the subpages are truncated (randomly or sequentially),
> > the THP may still be kept in page cache.  This might be fine for some
> > usecases which prefer preserving THP."
> >
> > My recollection is that this behavior was never changed.
> >
> > https://lore.kernel.org/all/1575420174-19171-1-git-send-email-
> yang.shi@linux.alibaba.com/
David Hildenbrand Aug. 4, 2023, 7:23 a.m. UTC | #46
On 04.08.23 08:39, Kasireddy, Vivek wrote:
> Hi Alistair, David, Jason,
> 
>>>>>>>> Right, the "the zero pages are changed into writable pages" in your
>>>>>>>> above comment just might not apply, because there won't be any
>> page
>>>>>>>> replacement (hopefully :) ).
>>>>>>
>>>>>>> If the page replacement does not happen when there are new writes
>> to the
>>>>>>> area where the hole previously existed, then would we still get an
>>>>>> invalidate
>>>>>>> when this happens? Is there any other way to get notified when the
>> zeroed
>>>>>>> page is written to if the invalidate does not get triggered?
>>>>>>
>>>>>> What David is saying is that memfd does not use the zero page
>>>>>> optimization for hole punches. Any access to the memory, including
>>>>>> read-only access through hmm_range_fault() will allocate unique
>>>>>> pages. Since there is no zero page and no zero-page replacement there
>>>>>> is no issue with invalidations.
>>>>
>>>>> It looks like even with hmm_range_fault(), the invalidate does not get
>>>>> triggered when the hole is refilled with new pages because of writes.
>>>>> This is probably because hmm_range_fault() does not fault in any pages
>>>>> that get invalidated later when writes occur.
>>>> hmm_range_fault() returns the current content of the VMAs, or it
>>>> faults. If it returns pages then it came from one of these two places.
>>>> If your VMA is incoherent with what you are doing then you have
>>>> bigger
>>>> problems, or maybe you found a bug.
>>
>> Note it will only fault in pages if HMM_PFN_REQ_FAULT is specified. You
>> are setting that however you aren't setting HMM_PFN_REQ_WRITE which is
>> what would trigger a fault to bring in the new pages. Does setting that
>> fix the issue you are seeing?
> No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
> Although, I do not have THP enabled (or built-in), shmem does not evict
> the pages after hole punch as noted in the comment in shmem_fallocate():
>                  if ((u64)unmap_end > (u64)unmap_start)
>                          unmap_mapping_range(mapping, unmap_start,
>                                              1 + unmap_end - unmap_start, 0);
>                  shmem_truncate_range(inode, offset, offset + len - 1);
>                  /* No need to unmap again: hole-punching leaves COWed pages */
> 
> As a result, the pfn is still valid and the pte is pte_present() and pte_write().
> This is the reason why adding in HMM_PFN_REQ_WRITE does not help;

Just to understand your setup: you are definitely using a MAP_SHARED 
shmem mapping, and not accidentally a MAP_PRIVATE mapping?
Jason Gunthorpe Aug. 4, 2023, 12:49 p.m. UTC | #47
On Fri, Aug 04, 2023 at 06:39:22AM +0000, Kasireddy, Vivek wrote:

> No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
> Although, I do not have THP enabled (or built-in), shmem does not evict
> the pages after hole punch as noted in the comment in shmem_fallocate():

This is the source of all your problems.

Things that are mm-centric are supposed to track the VMAs and changes to
the PTEs. If you do something in userspace and it doesn't cause the
CPU page tables to change then it certainly shouldn't cause any mmu
notifiers or hmm_range_fault changes.

There should still be an invalidation notifier at some point when the
CPU tables do eventually change, whenever that is. Missing that
notification would be a bug.

> If I force it to read-fault or write-fault (by hacking hmm_pte_need_fault()),
> it gets indefinitely stuck in the do while loop in hmm_range_fault().
> AFAIU, unless there is a way to fault-in zero pages (or any scratch pages)
> after hole punch that get invalidated because of writes, I do not see how
> using hmm_range_fault() can help with my use-case. 

hmm_range_fault() is the correct API to use if you are working with
notifiers. Do not hack something together using pin_user_pages.

Jason
Kasireddy, Vivek Aug. 4, 2023, 9:53 p.m. UTC | #48
Hi David,

> >
> >>>>>>>> Right, the "the zero pages are changed into writable pages" in your
> >>>>>>>> above comment just might not apply, because there won't be any
> >> page
> >>>>>>>> replacement (hopefully :) ).
> >>>>>>
> >>>>>>> If the page replacement does not happen when there are new
> writes
> >> to the
> >>>>>>> area where the hole previously existed, then would we still get an
> >>>>>> invalidate
> >>>>>>> when this happens? Is there any other way to get notified when the
> >> zeroed
> >>>>>>> page is written to if the invalidate does not get triggered?
> >>>>>>
> >>>>>> What David is saying is that memfd does not use the zero page
> >>>>>> optimization for hole punches. Any access to the memory, including
> >>>>>> read-only access through hmm_range_fault() will allocate unique
> >>>>>> pages. Since there is no zero page and no zero-page replacement
> there
> >>>>>> is no issue with invalidations.
> >>>>
> >>>>> It looks like even with hmm_range_fault(), the invalidate does not get
> >>>>> triggered when the hole is refilled with new pages because of writes.
> >>>>> This is probably because hmm_range_fault() does not fault in any
> pages
> >>>>> that get invalidated later when writes occur.
> >>>> hmm_range_fault() returns the current content of the VMAs, or it
> >>>> faults. If it returns pages then it came from one of these two places.
> >>>> If your VMA is incoherent with what you are doing then you have
> >>>> bigger
> >>>> problems, or maybe you found a bug.
> >>
> >> Note it will only fault in pages if HMM_PFN_REQ_FAULT is specified. You
> >> are setting that however you aren't setting HMM_PFN_REQ_WRITE which
> is
> >> what would trigger a fault to bring in the new pages. Does setting that
> >> fix the issue you are seeing?
> > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
> > Although, I do not have THP enabled (or built-in), shmem does not evict
> > the pages after hole punch as noted in the comment in shmem_fallocate():
> >                  if ((u64)unmap_end > (u64)unmap_start)
> >                          unmap_mapping_range(mapping, unmap_start,
> >                                              1 + unmap_end - unmap_start, 0);
> >                  shmem_truncate_range(inode, offset, offset + len - 1);
> >                  /* No need to unmap again: hole-punching leaves COWed pages
> */
> >
> > As a result, the pfn is still valid and the pte is pte_present() and pte_write().
> > This is the reason why adding in HMM_PFN_REQ_WRITE does not help;
> 
> Just to understand your setup: you are definitely using a MAP_SHARED
> shmem mapping, and not accidentally a MAP_PRIVATE mapping?
In terms of setup, I am just running the udmabuf selftest (shmem-based)
introduced in patch #3 of this series:
https://lore.kernel.org/all/20230718082858.1570809-4-vivek.kasireddy@intel.com/

And, it indeed uses a MAP_SHARED mapping.

Thanks,
Vivek

> 
> --
> Cheers,
> 
> David / dhildenb
Kasireddy, Vivek Aug. 8, 2023, 7:37 a.m. UTC | #49
Hi Jason,

> 
> > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
> > Although, I do not have THP enabled (or built-in), shmem does not evict
> > the pages after hole punch as noted in the comment in shmem_fallocate():
> 
> This is the source of all your problems.
> 
> Things that are mm-centric are supposed to track the VMAs and changes to
> the PTEs. If you do something in userspace and it doesn't cause the
> CPU page tables to change then it certainly shouldn't cause any mmu
> notifiers or hmm_range_fault changes.
I am not doing anything out of the blue in the userspace. I think the behavior
I am seeing with shmem (where an invalidation event (MMU_NOTIFY_CLEAR)
does occur because of a hole punch but the PTEs don't really get updated)
can arguably be considered an optimization. 

> 
> There should still be an invalidation notifier at some point when the
> CPU tables do eventually change, whenever that is. Missing that
> notification would be a bug.
I clearly do not see any notification getting triggered (from both shmem_fault()
and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
due to writes. Are you saying that there needs to be an invalidation event
(MMU_NOTIFY_CLEAR?) dispatched at this point?

> 
> > If I force it to read-fault or write-fault (by hacking hmm_pte_need_fault()),
> > it gets indefinitely stuck in the do while loop in hmm_range_fault().
> > AFAIU, unless there is a way to fault-in zero pages (or any scratch pages)
> > after hole punch that get invalidated because of writes, I do not see how
> > using hmm_range_fault() can help with my use-case.
> 
> hmm_range_fault() is the correct API to use if you are working with
> notifiers. Do not hack something together using pin_user_pages.
I noticed that hmm_range_fault() does not seem to be working as expected
given that it gets stuck(hangs) while walking hugetlb pages. Regardless,
as I mentioned above, the lack of notification when PTEs do get updated due
to writes is the crux of the issue here. Therefore, AFAIU, triggering an
invalidation event or some other kind of notification would help in fixing
this issue.

Thanks,
Vivek

> 
> Jason
Jason Gunthorpe Aug. 8, 2023, 12:42 p.m. UTC | #50
On Tue, Aug 08, 2023 at 07:37:19AM +0000, Kasireddy, Vivek wrote:
> Hi Jason,
> 
> > 
> > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
> > > Although, I do not have THP enabled (or built-in), shmem does not evict
> > > the pages after hole punch as noted in the comment in shmem_fallocate():
> > 
> > This is the source of all your problems.
> > 
> > Things that are mm-centric are supposed to track the VMAs and changes to
> > the PTEs. If you do something in userspace and it doesn't cause the
> > CPU page tables to change then it certainly shouldn't cause any mmu
> > notifiers or hmm_range_fault changes.
> I am not doing anything out of the blue in the userspace. I think the behavior
> I am seeing with shmem (where an invalidation event (MMU_NOTIFY_CLEAR)
> does occur because of a hole punch but the PTEs don't really get updated)
> can arguably be considered an optimization. 

Your explanations don't make sense.

If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:

> > There should still be an invalidation notifier at some point when the
> > CPU tables do eventually change, whenever that is. Missing that
> > notification would be a bug.
> I clearly do not see any notification getting triggered (from both shmem_fault()
> and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
> due to writes. Are you saying that there needs to be an invalidation event
> (MMU_NOTIFY_CLEAR?) dispatched at this point?

You don't get to get shmem_fault in the first place.

If they were marked non-prsent during the CLEAR then the shadow side
remains non-present until it gets its own fault.

If they were made non-present without an invalidation then that is a
bug.

> > hmm_range_fault() is the correct API to use if you are working with
> > notifiers. Do not hack something together using pin_user_pages.

> I noticed that hmm_range_fault() does not seem to be working as expected
> given that it gets stuck(hangs) while walking hugetlb pages.

You are the first to report that, it sounds like a serious bug. Please
try to fix it.

> Regardless, as I mentioned above, the lack of notification when PTEs
> do get updated due to writes is the crux of the issue
> here. Therefore, AFAIU, triggering an invalidation event or some
> other kind of notification would help in fixing this issue.

You seem to be facing some kind of bug in the mm, it sounds pretty
serious, and it almost certainly is a missing invalidation.

Basically, anything that changes a PTE must eventually trigger an
invalidation. It is illegal to change a PTE from one present value to
another present value without invalidation notification.

It is not surprising something would be missed here.

Jason
Kasireddy, Vivek Aug. 16, 2023, 6:43 a.m. UTC | #51
Hi Jason,

> > >
> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
> > > > Although, I do not have THP enabled (or built-in), shmem does not evict
> > > > the pages after hole punch as noted in the comment in
> shmem_fallocate():
> > >
> > > This is the source of all your problems.
> > >
> > > Things that are mm-centric are supposed to track the VMAs and changes
> to
> > > the PTEs. If you do something in userspace and it doesn't cause the
> > > CPU page tables to change then it certainly shouldn't cause any mmu
> > > notifiers or hmm_range_fault changes.
> > I am not doing anything out of the blue in the userspace. I think the
> behavior
> > I am seeing with shmem (where an invalidation event
> (MMU_NOTIFY_CLEAR)
> > does occur because of a hole punch but the PTEs don't really get updated)
> > can arguably be considered an optimization.
> 
> Your explanations don't make sense.
> 
> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
> 
> > > There should still be an invalidation notifier at some point when the
> > > CPU tables do eventually change, whenever that is. Missing that
> > > notification would be a bug.
> > I clearly do not see any notification getting triggered (from both
> shmem_fault()
> > and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
> > due to writes. Are you saying that there needs to be an invalidation event
> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> 
> You don't get to get shmem_fault in the first place.
What I am observing is that even after MMU_NOTIFY_CLEAR (hole punch) is sent,
hmm_range_fault() finds that the PTEs associated with the hole are still pte_present().
I think it remains this way as long as there are reads on the hole. Once there are
writes, it triggers shmem_fault() which results in PTEs getting updated but without
any notification.

> 
> If they were marked non-prsent during the CLEAR then the shadow side
> remains non-present until it gets its own fault.
> 
> If they were made non-present without an invalidation then that is a
> bug.
> 
> > > hmm_range_fault() is the correct API to use if you are working with
> > > notifiers. Do not hack something together using pin_user_pages.
> 
> > I noticed that hmm_range_fault() does not seem to be working as expected
> > given that it gets stuck(hangs) while walking hugetlb pages.
> 
> You are the first to report that, it sounds like a serious bug. Please
> try to fix it.
> 
> > Regardless, as I mentioned above, the lack of notification when PTEs
> > do get updated due to writes is the crux of the issue
> > here. Therefore, AFAIU, triggering an invalidation event or some
> > other kind of notification would help in fixing this issue.
> 
> You seem to be facing some kind of bug in the mm, it sounds pretty
> serious, and it almost certainly is a missing invalidation.
> 
> Basically, anything that changes a PTE must eventually trigger an
> invalidation. It is illegal to change a PTE from one present value to
> another present value without invalidation notification.
> 
> It is not surprising something would be missed here.
As you suggest, it looks like the root-cause of this issue is the missing
invalidation notification when the PTEs are changed from one present
value to another. I'd like to fix this issue eventually but I first need to
focus on addressing udmabuf page migration (out of movable zone)
and also look into the locking concerns Daniel mentioned about pairing
static and dynamic dmabuf exporters and importers.

Thanks,
Vivek
Alistair Popple Aug. 21, 2023, 9:02 a.m. UTC | #52
"Kasireddy, Vivek" <vivek.kasireddy@intel.com> writes:

> Hi Jason,
>
>> > >
>> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the issue.
>> > > > Although, I do not have THP enabled (or built-in), shmem does not evict
>> > > > the pages after hole punch as noted in the comment in
>> shmem_fallocate():
>> > >
>> > > This is the source of all your problems.
>> > >
>> > > Things that are mm-centric are supposed to track the VMAs and changes
>> to
>> > > the PTEs. If you do something in userspace and it doesn't cause the
>> > > CPU page tables to change then it certainly shouldn't cause any mmu
>> > > notifiers or hmm_range_fault changes.
>> > I am not doing anything out of the blue in the userspace. I think the
>> behavior
>> > I am seeing with shmem (where an invalidation event
>> (MMU_NOTIFY_CLEAR)
>> > does occur because of a hole punch but the PTEs don't really get updated)
>> > can arguably be considered an optimization.
>> 
>> Your explanations don't make sense.
>> 
>> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
>> 
>> > > There should still be an invalidation notifier at some point when the
>> > > CPU tables do eventually change, whenever that is. Missing that
>> > > notification would be a bug.
>> > I clearly do not see any notification getting triggered (from both
>> shmem_fault()
>> > and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
>> > due to writes. Are you saying that there needs to be an invalidation event
>> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
>> 
>> You don't get to get shmem_fault in the first place.
> What I am observing is that even after MMU_NOTIFY_CLEAR (hole punch) is sent,
> hmm_range_fault() finds that the PTEs associated with the hole are still pte_present().
> I think it remains this way as long as there are reads on the hole. Once there are
> writes, it triggers shmem_fault() which results in PTEs getting updated but without
> any notification.

Oh wait, this is shmem. The read from hmm_range_fault() (assuming you
specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
missing PTE. Subsequent writes will just upgrade PTE permissions
assuming the read didn't map them RW to begin with. If you want to
actually see the hole with hmm_range_fault() don't specify
HMM_PFN_REQ_FAULT (or _WRITE).

>> 
>> If they were marked non-prsent during the CLEAR then the shadow side
>> remains non-present until it gets its own fault.
>> 
>> If they were made non-present without an invalidation then that is a
>> bug.
>> 
>> > > hmm_range_fault() is the correct API to use if you are working with
>> > > notifiers. Do not hack something together using pin_user_pages.
>> 
>> > I noticed that hmm_range_fault() does not seem to be working as expected
>> > given that it gets stuck(hangs) while walking hugetlb pages.
>> 
>> You are the first to report that, it sounds like a serious bug. Please
>> try to fix it.
>> 
>> > Regardless, as I mentioned above, the lack of notification when PTEs
>> > do get updated due to writes is the crux of the issue
>> > here. Therefore, AFAIU, triggering an invalidation event or some
>> > other kind of notification would help in fixing this issue.
>> 
>> You seem to be facing some kind of bug in the mm, it sounds pretty
>> serious, and it almost certainly is a missing invalidation.
>> 
>> Basically, anything that changes a PTE must eventually trigger an
>> invalidation. It is illegal to change a PTE from one present value to
>> another present value without invalidation notification.
>> 
>> It is not surprising something would be missed here.
> As you suggest, it looks like the root-cause of this issue is the missing
> invalidation notification when the PTEs are changed from one present

I don't think there's a missing invalidation here. You say you're seeing
the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is
cleared. When else do you expect a notification?

> value to another. I'd like to fix this issue eventually but I first need to
> focus on addressing udmabuf page migration (out of movable zone)
> and also look into the locking concerns Daniel mentioned about pairing
> static and dynamic dmabuf exporters and importers.
>
> Thanks,
> Vivek
Kasireddy, Vivek Aug. 22, 2023, 6:14 a.m. UTC | #53
Hi Alistair,

> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
> issue.
> >> > > > Although, I do not have THP enabled (or built-in), shmem does not
> evict
> >> > > > the pages after hole punch as noted in the comment in
> >> shmem_fallocate():
> >> > >
> >> > > This is the source of all your problems.
> >> > >
> >> > > Things that are mm-centric are supposed to track the VMAs and
> changes
> >> to
> >> > > the PTEs. If you do something in userspace and it doesn't cause the
> >> > > CPU page tables to change then it certainly shouldn't cause any mmu
> >> > > notifiers or hmm_range_fault changes.
> >> > I am not doing anything out of the blue in the userspace. I think the
> >> behavior
> >> > I am seeing with shmem (where an invalidation event
> >> (MMU_NOTIFY_CLEAR)
> >> > does occur because of a hole punch but the PTEs don't really get
> updated)
> >> > can arguably be considered an optimization.
> >>
> >> Your explanations don't make sense.
> >>
> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
> >>
> >> > > There should still be an invalidation notifier at some point when the
> >> > > CPU tables do eventually change, whenever that is. Missing that
> >> > > notification would be a bug.
> >> > I clearly do not see any notification getting triggered (from both
> >> shmem_fault()
> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
> >> > due to writes. Are you saying that there needs to be an invalidation
> event
> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> >>
> >> You don't get to get shmem_fault in the first place.
> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole punch)
> is sent,
> > hmm_range_fault() finds that the PTEs associated with the hole are still
> pte_present().
> > I think it remains this way as long as there are reads on the hole. Once
> there are
> > writes, it triggers shmem_fault() which results in PTEs getting updated but
> without
> > any notification.
> 
> Oh wait, this is shmem. The read from hmm_range_fault() (assuming you
> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
> missing PTE. 
When running one of the udmabuf subtests (introduced in the third patch of
this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is punched.
As a response, hmm_range_fault() is called from the udmabuf invalidate callback,
to walk over the PTEs associated with the hole. When this happens, I noticed that
the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all the
PTEs associated with the hole. 
static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
                                                 pte_t pte)
{
        if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
                return 0;
        return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
}

As a result, hmm_pte_need_fault() always returns 0 and shmem_fault()
never gets triggered despite specifying HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE.
And, the set of PFNs returned by hmm_range_fault() are the same ones
that existed before the hole was punched.

> Subsequent writes will just upgrade PTE permissions
> assuming the read didn't map them RW to begin with. If you want to
> actually see the hole with hmm_range_fault() don't specify
> HMM_PFN_REQ_FAULT (or _WRITE).
> 
> >>
> >> If they were marked non-prsent during the CLEAR then the shadow side
> >> remains non-present until it gets its own fault.
> >>
> >> If they were made non-present without an invalidation then that is a
> >> bug.
> >>
> >> > > hmm_range_fault() is the correct API to use if you are working with
> >> > > notifiers. Do not hack something together using pin_user_pages.
> >>
> >> > I noticed that hmm_range_fault() does not seem to be working as
> expected
> >> > given that it gets stuck(hangs) while walking hugetlb pages.
> >>
> >> You are the first to report that, it sounds like a serious bug. Please
> >> try to fix it.
> >>
> >> > Regardless, as I mentioned above, the lack of notification when PTEs
> >> > do get updated due to writes is the crux of the issue
> >> > here. Therefore, AFAIU, triggering an invalidation event or some
> >> > other kind of notification would help in fixing this issue.
> >>
> >> You seem to be facing some kind of bug in the mm, it sounds pretty
> >> serious, and it almost certainly is a missing invalidation.
> >>
> >> Basically, anything that changes a PTE must eventually trigger an
> >> invalidation. It is illegal to change a PTE from one present value to
> >> another present value without invalidation notification.
> >>
> >> It is not surprising something would be missed here.
> > As you suggest, it looks like the root-cause of this issue is the missing
> > invalidation notification when the PTEs are changed from one present
> 
> I don't think there's a missing invalidation here. You say you're seeing
> the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is
> cleared. When else do you expect a notification?
Oh, given that we are finding PTEs that are still pte_present() even after
MMU_NOTIFY_CLEAR is sent, the theory is that another MMU_NOTIFY_CLEAR
needs to be sent after the PTEs are updated when new pages are faulted-in.

However, it just occurred to me that maybe the behavior I am seeing is not
unexpected as it might be a timing issue that has to do with when the PTEs
are walked. Let me explain. Here is what shmem does when a hole is punched:
                if ((u64)unmap_end > (u64)unmap_start)
                        unmap_mapping_range(mapping, unmap_start,
                                            1 + unmap_end - unmap_start, 0);
                shmem_truncate_range(inode, offset, offset + len - 1);

IIUC, the invalidate callback is called from unmap_mapping_range() but
the page removal does not happen until shmem_truncate_range() gets
called. So, if I were to call hmm_range_fault() after shmem_truncate_range(),
I might see different results as the PTEs would probably no longer be present.
In order to test this theory, I would have to schedule a wq thread func from the
invalidate callback (to walk the PTEs after a slight delay). I'll try this out when
I get a chance after addressing some of the locking concerns associated with
pairing static/dynamic dmabuf exporters and importers.

Thanks,
Vivek

> 
> > value to another. I'd like to fix this issue eventually but I first need to
> > focus on addressing udmabuf page migration (out of movable zone)
> > and also look into the locking concerns Daniel mentioned about pairing
> > static and dynamic dmabuf exporters and importers.
> >
> > Thanks,
> > Vivek
Alistair Popple Aug. 22, 2023, 8:15 a.m. UTC | #54
"Kasireddy, Vivek" <vivek.kasireddy@intel.com> writes:

> Hi Alistair,
>
>> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
>> issue.
>> >> > > > Although, I do not have THP enabled (or built-in), shmem does not
>> evict
>> >> > > > the pages after hole punch as noted in the comment in
>> >> shmem_fallocate():
>> >> > >
>> >> > > This is the source of all your problems.
>> >> > >
>> >> > > Things that are mm-centric are supposed to track the VMAs and
>> changes
>> >> to
>> >> > > the PTEs. If you do something in userspace and it doesn't cause the
>> >> > > CPU page tables to change then it certainly shouldn't cause any mmu
>> >> > > notifiers or hmm_range_fault changes.
>> >> > I am not doing anything out of the blue in the userspace. I think the
>> >> behavior
>> >> > I am seeing with shmem (where an invalidation event
>> >> (MMU_NOTIFY_CLEAR)
>> >> > does occur because of a hole punch but the PTEs don't really get
>> updated)
>> >> > can arguably be considered an optimization.
>> >>
>> >> Your explanations don't make sense.
>> >>
>> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
>> >>
>> >> > > There should still be an invalidation notifier at some point when the
>> >> > > CPU tables do eventually change, whenever that is. Missing that
>> >> > > notification would be a bug.
>> >> > I clearly do not see any notification getting triggered (from both
>> >> shmem_fault()
>> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
>> >> > due to writes. Are you saying that there needs to be an invalidation
>> event
>> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
>> >>
>> >> You don't get to get shmem_fault in the first place.
>> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole punch)
>> is sent,
>> > hmm_range_fault() finds that the PTEs associated with the hole are still
>> pte_present().
>> > I think it remains this way as long as there are reads on the hole. Once
>> there are
>> > writes, it triggers shmem_fault() which results in PTEs getting updated but
>> without
>> > any notification.
>> 
>> Oh wait, this is shmem. The read from hmm_range_fault() (assuming you
>> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
>> missing PTE. 
> When running one of the udmabuf subtests (introduced in the third patch of
> this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is punched.
> As a response, hmm_range_fault() is called from the udmabuf invalidate callback,

Actually I'm suprised that works. If you've setup an interval notifier
and are updating the notifier sequence numbers correctly I would expect
hmm_range_fault() to return -EBUSY until
mmu_notifier_invalidate_range_end() is called.

It might be helpful to post the code you're testing with somewhere but
are you calling mmu_interval_read_begin() to start the critical section
and mmu_interval_set_seq() to update the sequence in another notifier?
I'm not at all convinced calling hmm_range_fault() from a notifier can
be made to work though.

> to walk over the PTEs associated with the hole. When this happens, I noticed that
> the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all the
> PTEs associated with the hole. 
> static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
>                                                  pte_t pte)
> {
>         if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
>                 return 0;
>         return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
> }
>
> As a result, hmm_pte_need_fault() always returns 0 and shmem_fault()
> never gets triggered despite specifying HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE.
> And, the set of PFNs returned by hmm_range_fault() are the same ones
> that existed before the hole was punched.
>
>> Subsequent writes will just upgrade PTE permissions
>> assuming the read didn't map them RW to begin with. If you want to
>> actually see the hole with hmm_range_fault() don't specify
>> HMM_PFN_REQ_FAULT (or _WRITE).
>> 
>> >>
>> >> If they were marked non-prsent during the CLEAR then the shadow side
>> >> remains non-present until it gets its own fault.
>> >>
>> >> If they were made non-present without an invalidation then that is a
>> >> bug.
>> >>
>> >> > > hmm_range_fault() is the correct API to use if you are working with
>> >> > > notifiers. Do not hack something together using pin_user_pages.
>> >>
>> >> > I noticed that hmm_range_fault() does not seem to be working as
>> expected
>> >> > given that it gets stuck(hangs) while walking hugetlb pages.
>> >>
>> >> You are the first to report that, it sounds like a serious bug. Please
>> >> try to fix it.
>> >>
>> >> > Regardless, as I mentioned above, the lack of notification when PTEs
>> >> > do get updated due to writes is the crux of the issue
>> >> > here. Therefore, AFAIU, triggering an invalidation event or some
>> >> > other kind of notification would help in fixing this issue.
>> >>
>> >> You seem to be facing some kind of bug in the mm, it sounds pretty
>> >> serious, and it almost certainly is a missing invalidation.
>> >>
>> >> Basically, anything that changes a PTE must eventually trigger an
>> >> invalidation. It is illegal to change a PTE from one present value to
>> >> another present value without invalidation notification.
>> >>
>> >> It is not surprising something would be missed here.
>> > As you suggest, it looks like the root-cause of this issue is the missing
>> > invalidation notification when the PTEs are changed from one present
>> 
>> I don't think there's a missing invalidation here. You say you're seeing
>> the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is
>> cleared. When else do you expect a notification?
> Oh, given that we are finding PTEs that are still pte_present() even after
> MMU_NOTIFY_CLEAR is sent, the theory is that another MMU_NOTIFY_CLEAR
> needs to be sent after the PTEs are updated when new pages are faulted-in.
>
> However, it just occurred to me that maybe the behavior I am seeing is not
> unexpected as it might be a timing issue that has to do with when the PTEs
> are walked. Let me explain. Here is what shmem does when a hole is punched:
>                 if ((u64)unmap_end > (u64)unmap_start)
>                         unmap_mapping_range(mapping, unmap_start,
>                                             1 + unmap_end - unmap_start, 0);
>                 shmem_truncate_range(inode, offset, offset + len - 1);
>
> IIUC, the invalidate callback is called from unmap_mapping_range() but
> the page removal does not happen until shmem_truncate_range() gets
> called. So, if I were to call hmm_range_fault() after shmem_truncate_range(),
> I might see different results as the PTEs would probably no longer be present.
> In order to test this theory, I would have to schedule a wq thread func from the
> invalidate callback (to walk the PTEs after a slight delay). I'll try this out when
> I get a chance after addressing some of the locking concerns associated with
> pairing static/dynamic dmabuf exporters and importers.

That sounds plausible. The PTE will actually be cleared in
unmap_mapping_range() after the mmu notifier is called. I'm curious how
hmm_range_fault() passes though.

> Thanks,
> Vivek
>
>> 
>> > value to another. I'd like to fix this issue eventually but I first need to
>> > focus on addressing udmabuf page migration (out of movable zone)
>> > and also look into the locking concerns Daniel mentioned about pairing
>> > static and dynamic dmabuf exporters and importers.
>> >
>> > Thanks,
>> > Vivek
Kasireddy, Vivek Aug. 24, 2023, 6:48 a.m. UTC | #55
Hi Alistair,

> >
> >> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
> >> issue.
> >> >> > > > Although, I do not have THP enabled (or built-in), shmem does
> not
> >> evict
> >> >> > > > the pages after hole punch as noted in the comment in
> >> >> shmem_fallocate():
> >> >> > >
> >> >> > > This is the source of all your problems.
> >> >> > >
> >> >> > > Things that are mm-centric are supposed to track the VMAs and
> >> changes
> >> >> to
> >> >> > > the PTEs. If you do something in userspace and it doesn't cause the
> >> >> > > CPU page tables to change then it certainly shouldn't cause any
> mmu
> >> >> > > notifiers or hmm_range_fault changes.
> >> >> > I am not doing anything out of the blue in the userspace. I think the
> >> >> behavior
> >> >> > I am seeing with shmem (where an invalidation event
> >> >> (MMU_NOTIFY_CLEAR)
> >> >> > does occur because of a hole punch but the PTEs don't really get
> >> updated)
> >> >> > can arguably be considered an optimization.
> >> >>
> >> >> Your explanations don't make sense.
> >> >>
> >> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
> >> >>
> >> >> > > There should still be an invalidation notifier at some point when the
> >> >> > > CPU tables do eventually change, whenever that is. Missing that
> >> >> > > notification would be a bug.
> >> >> > I clearly do not see any notification getting triggered (from both
> >> >> shmem_fault()
> >> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is
> refilled
> >> >> > due to writes. Are you saying that there needs to be an invalidation
> >> event
> >> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> >> >>
> >> >> You don't get to get shmem_fault in the first place.
> >> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole
> punch)
> >> is sent,
> >> > hmm_range_fault() finds that the PTEs associated with the hole are still
> >> pte_present().
> >> > I think it remains this way as long as there are reads on the hole. Once
> >> there are
> >> > writes, it triggers shmem_fault() which results in PTEs getting updated
> but
> >> without
> >> > any notification.
> >>
> >> Oh wait, this is shmem. The read from hmm_range_fault() (assuming you
> >> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
> >> missing PTE.
> > When running one of the udmabuf subtests (introduced in the third patch
> of
> > this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is punched.
> > As a response, hmm_range_fault() is called from the udmabuf invalidate
> callback,
> 
> Actually I'm suprised that works. If you've setup an interval notifier
> and are updating the notifier sequence numbers correctly I would expect
> hmm_range_fault() to return -EBUSY until
> mmu_notifier_invalidate_range_end() is called.
> 
> It might be helpful to post the code you're testing with somewhere but
> are you calling mmu_interval_read_begin() to start the critical section
> and mmu_interval_set_seq() to update the sequence in another notifier?
> I'm not at all convinced calling hmm_range_fault() from a notifier can
> be made to work though.
That could be part of the problem. I mean the way hmm_range_fault()
is invoked from the invalidate callback is probably incorrect as you are
suggesting. Anyway, here is the code I am testing with:
static bool invalidate_udmabuf(struct mmu_interval_notifier *mn,
                               const struct mmu_notifier_range *range_mn,
                               unsigned long cur_seq)
{
        struct udmabuf_vma_range *range =
                        container_of(mn, struct udmabuf_vma_range, range_mn);
        struct udmabuf *ubuf = range->ubuf;
        struct hmm_range hrange = {0};
        unsigned long *pfns, num_pages, timeout;
        int i, ret;

        printk("invalidate; start = %lu, end = %lu\n",
               range->start, range->end);

        hrange.notifier = mn;
        hrange.default_flags = HMM_PFN_REQ_FAULT;
        hrange.start = max(range_mn->start, range->start);
        hrange.end = min(range_mn->end, range->end);
        num_pages = (hrange.end - hrange.start) >> PAGE_SHIFT;

        pfns = kmalloc_array(num_pages, sizeof(*pfns), GFP_KERNEL);
        if (!pfns)
                return true;

        printk("invalidate; num pages = %lu\n", num_pages);

        hrange.hmm_pfns = pfns;
        timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
        do {
                hrange.notifier_seq = mmu_interval_read_begin(mn);

                mmap_read_lock(ubuf->vmm_mm);
                ret = hmm_range_fault(&hrange);
                mmap_read_unlock(ubuf->vmm_mm);
                if (ret) {
                        if (ret == -EBUSY && !time_after(jiffies, timeout))
                                continue;
                        break;
                }

                if (mmu_interval_read_retry(mn, hrange.notifier_seq))
                        continue;
        } while (ret);

        if (!ret) {
                for (i = 0; i < num_pages; i++) {
                        printk("hmm returned page = %p; pfn = %lu\n",
                               hmm_pfn_to_page(pfns[i]),
                               pfns[i] & ~HMM_PFN_FLAGS);
                }
        }
        return true;
}

static const struct mmu_interval_notifier_ops udmabuf_invalidate_ops = {
        .invalidate = invalidate_udmabuf,
};

Thanks,
Vivek

> 
> > to walk over the PTEs associated with the hole. When this happens, I
> noticed that
> > the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all
> the
> > PTEs associated with the hole.
> > static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range
> *range,
> >                                                  pte_t pte)
> > {
> >         if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
> >                 return 0;
> >         return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) :
> HMM_PFN_VALID;
> > }
> >
> > As a result, hmm_pte_need_fault() always returns 0 and shmem_fault()
> > never gets triggered despite specifying HMM_PFN_REQ_FAULT |
> HMM_PFN_REQ_WRITE.
> > And, the set of PFNs returned by hmm_range_fault() are the same ones
> > that existed before the hole was punched.
> >
> >> Subsequent writes will just upgrade PTE permissions
> >> assuming the read didn't map them RW to begin with. If you want to
> >> actually see the hole with hmm_range_fault() don't specify
> >> HMM_PFN_REQ_FAULT (or _WRITE).
> >>
> >> >>
> >> >> If they were marked non-prsent during the CLEAR then the shadow side
> >> >> remains non-present until it gets its own fault.
> >> >>
> >> >> If they were made non-present without an invalidation then that is a
> >> >> bug.
> >> >>
> >> >> > > hmm_range_fault() is the correct API to use if you are working with
> >> >> > > notifiers. Do not hack something together using pin_user_pages.
> >> >>
> >> >> > I noticed that hmm_range_fault() does not seem to be working as
> >> expected
> >> >> > given that it gets stuck(hangs) while walking hugetlb pages.
> >> >>
> >> >> You are the first to report that, it sounds like a serious bug. Please
> >> >> try to fix it.
> >> >>
> >> >> > Regardless, as I mentioned above, the lack of notification when PTEs
> >> >> > do get updated due to writes is the crux of the issue
> >> >> > here. Therefore, AFAIU, triggering an invalidation event or some
> >> >> > other kind of notification would help in fixing this issue.
> >> >>
> >> >> You seem to be facing some kind of bug in the mm, it sounds pretty
> >> >> serious, and it almost certainly is a missing invalidation.
> >> >>
> >> >> Basically, anything that changes a PTE must eventually trigger an
> >> >> invalidation. It is illegal to change a PTE from one present value to
> >> >> another present value without invalidation notification.
> >> >>
> >> >> It is not surprising something would be missed here.
> >> > As you suggest, it looks like the root-cause of this issue is the missing
> >> > invalidation notification when the PTEs are changed from one present
> >>
> >> I don't think there's a missing invalidation here. You say you're seeing
> >> the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is
> >> cleared. When else do you expect a notification?
> > Oh, given that we are finding PTEs that are still pte_present() even after
> > MMU_NOTIFY_CLEAR is sent, the theory is that another
> MMU_NOTIFY_CLEAR
> > needs to be sent after the PTEs are updated when new pages are faulted-in.
> >
> > However, it just occurred to me that maybe the behavior I am seeing is not
> > unexpected as it might be a timing issue that has to do with when the PTEs
> > are walked. Let me explain. Here is what shmem does when a hole is
> punched:
> >                 if ((u64)unmap_end > (u64)unmap_start)
> >                         unmap_mapping_range(mapping, unmap_start,
> >                                             1 + unmap_end - unmap_start, 0);
> >                 shmem_truncate_range(inode, offset, offset + len - 1);
> >
> > IIUC, the invalidate callback is called from unmap_mapping_range() but
> > the page removal does not happen until shmem_truncate_range() gets
> > called. So, if I were to call hmm_range_fault() after
> shmem_truncate_range(),
> > I might see different results as the PTEs would probably no longer be
> present.
> > In order to test this theory, I would have to schedule a wq thread func from
> the
> > invalidate callback (to walk the PTEs after a slight delay). I'll try this out
> when
> > I get a chance after addressing some of the locking concerns associated with
> > pairing static/dynamic dmabuf exporters and importers.
> 
> That sounds plausible. The PTE will actually be cleared in
> unmap_mapping_range() after the mmu notifier is called. I'm curious how
> hmm_range_fault() passes though.
> 
> > Thanks,
> > Vivek
> >
> >>
> >> > value to another. I'd like to fix this issue eventually but I first need to
> >> > focus on addressing udmabuf page migration (out of movable zone)
> >> > and also look into the locking concerns Daniel mentioned about pairing
> >> > static and dynamic dmabuf exporters and importers.
> >> >
> >> > Thanks,
> >> > Vivek
Kasireddy, Vivek Aug. 28, 2023, 4:38 a.m. UTC | #56
Hi Alistair,

> 
> > >
> > >> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
> > >> issue.
> > >> >> > > > Although, I do not have THP enabled (or built-in), shmem does
> > not
> > >> evict
> > >> >> > > > the pages after hole punch as noted in the comment in
> > >> >> shmem_fallocate():
> > >> >> > >
> > >> >> > > This is the source of all your problems.
> > >> >> > >
> > >> >> > > Things that are mm-centric are supposed to track the VMAs and
> > >> changes
> > >> >> to
> > >> >> > > the PTEs. If you do something in userspace and it doesn't cause
> the
> > >> >> > > CPU page tables to change then it certainly shouldn't cause any
> > mmu
> > >> >> > > notifiers or hmm_range_fault changes.
> > >> >> > I am not doing anything out of the blue in the userspace. I think the
> > >> >> behavior
> > >> >> > I am seeing with shmem (where an invalidation event
> > >> >> (MMU_NOTIFY_CLEAR)
> > >> >> > does occur because of a hole punch but the PTEs don't really get
> > >> updated)
> > >> >> > can arguably be considered an optimization.
> > >> >>
> > >> >> Your explanations don't make sense.
> > >> >>
> > >> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present
> then:
> > >> >>
> > >> >> > > There should still be an invalidation notifier at some point when
> the
> > >> >> > > CPU tables do eventually change, whenever that is. Missing that
> > >> >> > > notification would be a bug.
> > >> >> > I clearly do not see any notification getting triggered (from both
> > >> >> shmem_fault()
> > >> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is
> > refilled
> > >> >> > due to writes. Are you saying that there needs to be an invalidation
> > >> event
> > >> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> > >> >>
> > >> >> You don't get to get shmem_fault in the first place.
> > >> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole
> > punch)
> > >> is sent,
> > >> > hmm_range_fault() finds that the PTEs associated with the hole are still
> > >> pte_present().
> > >> > I think it remains this way as long as there are reads on the hole. Once
> > >> there are
> > >> > writes, it triggers shmem_fault() which results in PTEs getting updated
> > but
> > >> without
> > >> > any notification.
> > >>
> > >> Oh wait, this is shmem. The read from hmm_range_fault() (assuming
> you
> > >> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
> > >> missing PTE.
> > > When running one of the udmabuf subtests (introduced in the third patch
> > of
> > > this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is
> punched.
> > > As a response, hmm_range_fault() is called from the udmabuf invalidate
> > callback,
> >
> > Actually I'm suprised that works. If you've setup an interval notifier
> > and are updating the notifier sequence numbers correctly I would expect
> > hmm_range_fault() to return -EBUSY until
> > mmu_notifier_invalidate_range_end() is called.
> >
> > It might be helpful to post the code you're testing with somewhere but
> > are you calling mmu_interval_read_begin() to start the critical section
> > and mmu_interval_set_seq() to update the sequence in another notifier?
> > I'm not at all convinced calling hmm_range_fault() from a notifier can
> > be made to work though.
Turns out, calling hmm_range_fault() from the invalidate callback was indeed
a problem and the reason why new pages were not faulted-in. In other words,
it looks like the invalidate callback is not the right place to invoke hmm_range_fault()
as the PTEs may not have been cleared.

> That could be part of the problem. I mean the way hmm_range_fault()
> is invoked from the invalidate callback is probably incorrect as you are
> suggesting. Anyway, here is the code I am testing with:
> static bool invalidate_udmabuf(struct mmu_interval_notifier *mn,
>                                const struct mmu_notifier_range *range_mn,
>                                unsigned long cur_seq)
> {
>         struct udmabuf_vma_range *range =
>                         container_of(mn, struct udmabuf_vma_range, range_mn);
>         struct udmabuf *ubuf = range->ubuf;
>         struct hmm_range hrange = {0};
>         unsigned long *pfns, num_pages, timeout;
>         int i, ret;
> 
>         printk("invalidate; start = %lu, end = %lu\n",
>                range->start, range->end);
> 
>         hrange.notifier = mn;
>         hrange.default_flags = HMM_PFN_REQ_FAULT;
>         hrange.start = max(range_mn->start, range->start);
>         hrange.end = min(range_mn->end, range->end);
>         num_pages = (hrange.end - hrange.start) >> PAGE_SHIFT;
> 
>         pfns = kmalloc_array(num_pages, sizeof(*pfns), GFP_KERNEL);
>         if (!pfns)
>                 return true;
> 
>         printk("invalidate; num pages = %lu\n", num_pages);
> 
>         hrange.hmm_pfns = pfns;
>         timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>         do {
>                 hrange.notifier_seq = mmu_interval_read_begin(mn);
> 
>                 mmap_read_lock(ubuf->vmm_mm);
>                 ret = hmm_range_fault(&hrange);
>                 mmap_read_unlock(ubuf->vmm_mm);
>                 if (ret) {
>                         if (ret == -EBUSY && !time_after(jiffies, timeout))
>                                 continue;
>                         break;
>                 }
> 
>                 if (mmu_interval_read_retry(mn, hrange.notifier_seq))
>                         continue;
>         } while (ret);
> 
>         if (!ret) {
>                 for (i = 0; i < num_pages; i++) {
>                         printk("hmm returned page = %p; pfn = %lu\n",
>                                hmm_pfn_to_page(pfns[i]),
>                                pfns[i] & ~HMM_PFN_FLAGS);
>                 }
>         }
>         return true;
> }
> 
Doing the above from a wq worker func (scheduled after invalidate event)
instead of the invalidate callback lets hmm_range_fault() fault-in new pages.
What this means is that, at-least in my use-case, getting MMU_NOTIFY_CLEAR
indicates that the invalidation is still ongoing and that it is not done yet.
Sorry for the confusion.

Thanks,
Vivek

> static const struct mmu_interval_notifier_ops udmabuf_invalidate_ops = {
>         .invalidate = invalidate_udmabuf,
> };
> 
> >
> > > to walk over the PTEs associated with the hole. When this happens, I
> > noticed that
> > > the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all
> > the
> > > PTEs associated with the hole.
> > > static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range
> > *range,
> > >                                                  pte_t pte)
> > > {
> > >         if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
> > >                 return 0;
> > >         return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) :
> > HMM_PFN_VALID;
> > > }
> > >
> > > As a result, hmm_pte_need_fault() always returns 0 and shmem_fault()
> > > never gets triggered despite specifying HMM_PFN_REQ_FAULT |
> > HMM_PFN_REQ_WRITE.
> > > And, the set of PFNs returned by hmm_range_fault() are the same ones
> > > that existed before the hole was punched.
> > >
> > >> Subsequent writes will just upgrade PTE permissions
> > >> assuming the read didn't map them RW to begin with. If you want to
> > >> actually see the hole with hmm_range_fault() don't specify
> > >> HMM_PFN_REQ_FAULT (or _WRITE).
> > >>
> > >> >>
> > >> >> If they were marked non-prsent during the CLEAR then the shadow
> side
> > >> >> remains non-present until it gets its own fault.
> > >> >>
> > >> >> If they were made non-present without an invalidation then that is a
> > >> >> bug.
> > >> >>
> > >> >> > > hmm_range_fault() is the correct API to use if you are working
> with
> > >> >> > > notifiers. Do not hack something together using pin_user_pages.
> > >> >>
> > >> >> > I noticed that hmm_range_fault() does not seem to be working as
> > >> expected
> > >> >> > given that it gets stuck(hangs) while walking hugetlb pages.
> > >> >>
> > >> >> You are the first to report that, it sounds like a serious bug. Please
> > >> >> try to fix it.
> > >> >>
> > >> >> > Regardless, as I mentioned above, the lack of notification when PTEs
> > >> >> > do get updated due to writes is the crux of the issue
> > >> >> > here. Therefore, AFAIU, triggering an invalidation event or some
> > >> >> > other kind of notification would help in fixing this issue.
> > >> >>
> > >> >> You seem to be facing some kind of bug in the mm, it sounds pretty
> > >> >> serious, and it almost certainly is a missing invalidation.
> > >> >>
> > >> >> Basically, anything that changes a PTE must eventually trigger an
> > >> >> invalidation. It is illegal to change a PTE from one present value to
> > >> >> another present value without invalidation notification.
> > >> >>
> > >> >> It is not surprising something would be missed here.
> > >> > As you suggest, it looks like the root-cause of this issue is the missing
> > >> > invalidation notification when the PTEs are changed from one present
> > >>
> > >> I don't think there's a missing invalidation here. You say you're seeing
> > >> the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is
> > >> cleared. When else do you expect a notification?
> > > Oh, given that we are finding PTEs that are still pte_present() even after
> > > MMU_NOTIFY_CLEAR is sent, the theory is that another
> > MMU_NOTIFY_CLEAR
> > > needs to be sent after the PTEs are updated when new pages are faulted-
> in.
> > >
> > > However, it just occurred to me that maybe the behavior I am seeing is not
> > > unexpected as it might be a timing issue that has to do with when the
> PTEs
> > > are walked. Let me explain. Here is what shmem does when a hole is
> > punched:
> > >                 if ((u64)unmap_end > (u64)unmap_start)
> > >                         unmap_mapping_range(mapping, unmap_start,
> > >                                             1 + unmap_end - unmap_start, 0);
> > >                 shmem_truncate_range(inode, offset, offset + len - 1);
> > >
> > > IIUC, the invalidate callback is called from unmap_mapping_range() but
> > > the page removal does not happen until shmem_truncate_range() gets
> > > called. So, if I were to call hmm_range_fault() after
> > shmem_truncate_range(),
> > > I might see different results as the PTEs would probably no longer be
> > present.
> > > In order to test this theory, I would have to schedule a wq thread func
> from
> > the
> > > invalidate callback (to walk the PTEs after a slight delay). I'll try this out
> > when
> > > I get a chance after addressing some of the locking concerns associated
> with
> > > pairing static/dynamic dmabuf exporters and importers.
> >
> > That sounds plausible. The PTE will actually be cleared in
> > unmap_mapping_range() after the mmu notifier is called. I'm curious how
> > hmm_range_fault() passes though.
> >
> > > Thanks,
> > > Vivek
> > >
> > >>
> > >> > value to another. I'd like to fix this issue eventually but I first need to
> > >> > focus on addressing udmabuf page migration (out of movable zone)
> > >> > and also look into the locking concerns Daniel mentioned about pairing
> > >> > static and dynamic dmabuf exporters and importers.
> > >> >
> > >> > Thanks,
> > >> > Vivek
Jason Gunthorpe Aug. 30, 2023, 4:02 p.m. UTC | #57
On Mon, Aug 28, 2023 at 04:38:01AM +0000, Kasireddy, Vivek wrote:

> Turns out, calling hmm_range_fault() from the invalidate callback was indeed
> a problem and the reason why new pages were not faulted-in. In other words,
> it looks like the invalidate callback is not the right place to invoke hmm_range_fault()
> as the PTEs may not have been cleared.

Yes, that is correct, you can't even do the locking properly if you
invoke things like this. mmu_interval_read_retry() should continually
fail while you are within the invalidate callback.

Jason
diff mbox series

Patch

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e051c3c4..218ddc3b4bc7 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -131,6 +131,16 @@  struct mmu_notifier_ops {
 			   unsigned long address,
 			   pte_t pte);
 
+	/*
+	 * update_mapping is called when a page is replaced (at a given offset)
+	 * in a mapping backed by shmem or hugetlbfs. The new page's pfn will
+	 * be contained in the pfn field.
+	 */
+	void (*update_mapping)(struct mmu_notifier *subscription,
+			       struct mm_struct *mm,
+			       unsigned long address,
+			       unsigned long pfn);
+
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
 	 * paired and are called only when the mmap_lock and/or the
@@ -394,6 +404,9 @@  extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
+extern void __mmu_notifier_update_mapping(struct mm_struct *mm,
+					  unsigned long address,
+					  unsigned long pfn);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r,
 				  bool only_end);
@@ -447,6 +460,14 @@  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 		__mmu_notifier_change_pte(mm, address, pte);
 }
 
+static inline void mmu_notifier_update_mapping(struct mm_struct *mm,
+					       unsigned long address,
+					       unsigned long pfn)
+{
+	if (mm_has_notifiers(mm))
+		__mmu_notifier_update_mapping(mm, address, pfn);
+}
+
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -695,6 +716,12 @@  static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 {
 }
 
+static inline void mmu_notifier_update_mapping(struct mm_struct *mm,
+					       unsigned long address,
+					       unsigned long pfn)
+{
+}
+
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 64a3239b6407..1f2f0209101a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6096,8 +6096,12 @@  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * hugetlb_no_page will drop vma lock and hugetlb fault
 		 * mutex internally, which make us return immediately.
 		 */
-		return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+		ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
 				      entry, flags);
+		if (!ret)
+			mmu_notifier_update_mapping(vma->vm_mm, address,
+						    pte_pfn(*ptep));
+		return ret;
 
 	ret = 0;
 
@@ -6223,6 +6227,9 @@  vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	if (need_wait_lock)
 		folio_wait_locked(folio);
+	if (!ret)
+		mmu_notifier_update_mapping(vma->vm_mm, address,
+					    pte_pfn(*ptep));
 	return ret;
 }
 
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 50c0dde1354f..6421405334b9 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -441,6 +441,23 @@  void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
+void __mmu_notifier_update_mapping(struct mm_struct *mm, unsigned long address,
+				   unsigned long pfn)
+{
+	struct mmu_notifier *subscription;
+	int id;
+
+	id = srcu_read_lock(&srcu);
+	hlist_for_each_entry_rcu(subscription,
+				 &mm->notifier_subscriptions->list, hlist,
+				 srcu_read_lock_held(&srcu)) {
+		if (subscription->ops->update_mapping)
+			subscription->ops->update_mapping(subscription, mm,
+							  address, pfn);
+	}
+	srcu_read_unlock(&srcu, id);
+}
+
 static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
 			       const struct mmu_notifier_range *range)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index 2f2e0e618072..e59eb5fafadb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -77,6 +77,7 @@  static struct vfsmount *shm_mnt;
 #include <linux/fcntl.h>
 #include <uapi/linux/memfd.h>
 #include <linux/rmap.h>
+#include <linux/mmu_notifier.h>
 #include <linux/uuid.h>
 
 #include <linux/uaccess.h>
@@ -2164,8 +2165,12 @@  static vm_fault_t shmem_fault(struct vm_fault *vmf)
 				  gfp, vma, vmf, &ret);
 	if (err)
 		return vmf_error(err);
-	if (folio)
+	if (folio) {
 		vmf->page = folio_file_page(folio, vmf->pgoff);
+		if (ret == VM_FAULT_LOCKED)
+			mmu_notifier_update_mapping(vma->vm_mm, vmf->address,
+						    page_to_pfn(vmf->page));
+	}
 	return ret;
 }