diff mbox series

[2/6] mm/userfaultfd: Fix uffd-wp special cases for fork()

Message ID 20210428225030.9708-3-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm/uffd: Misc fix for uffd-wp and one more test | expand

Commit Message

Peter Xu April 28, 2021, 10:50 p.m. UTC
We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop
_PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
right..  A few fixes around the code path:

  1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather than the
     new vma.  That's overlooked in b569a1760782, so it won't work as expected.
     Thanks to the recent rework on fork code (7a4830c380f3a8b3), we can easily
     get the new vma now, so switch the checks to that.

  2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the huge pmd
     is a migration huge pmd.  When it happens, instead of using pmd_uffd_wp(),
     we should use pmd_swp_uffd_wp(). The fix is simply to handle them separately.

  3. Forget to carry over uffd-wp bit for a write migration huge pmd entry.
     This also happens in copy_huge_pmd(), where we converted a write huge
     migration entry into a read one.

  4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.

  5. In copy_present_page() when COW is enforced when fork(), we also need to
     pass over the uffd-wp bit if VM_UFFD_WP is armed on the new vma, and when
     the pte to be copied has uffd-wp bit set.

Remove the comment in copy_present_pte() about this.  It won't help a huge lot
to only comment there, but comment everywhere would be an overkill.  Let's
assume the commit messages would help.

Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Fixes: b569a1760782f3da03ff718d61f74163dea599ff
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h |  2 +-
 mm/huge_memory.c        | 23 ++++++++++-------------
 mm/memory.c             | 25 +++++++++++++------------
 3 files changed, 24 insertions(+), 26 deletions(-)

Comments

Andrew Morton May 26, 2021, 12:15 a.m. UTC | #1
On Wed, 28 Apr 2021 18:50:26 -0400 Peter Xu <peterx@redhat.com> wrote:

> We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop
> _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
> right..  A few fixes around the code path:
> 
>   1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather than the
>      new vma.  That's overlooked in b569a1760782, so it won't work as expected.
>      Thanks to the recent rework on fork code (7a4830c380f3a8b3), we can easily
>      get the new vma now, so switch the checks to that.
> 
>   2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the huge pmd
>      is a migration huge pmd.  When it happens, instead of using pmd_uffd_wp(),
>      we should use pmd_swp_uffd_wp(). The fix is simply to handle them separately.
> 
>   3. Forget to carry over uffd-wp bit for a write migration huge pmd entry.
>      This also happens in copy_huge_pmd(), where we converted a write huge
>      migration entry into a read one.
> 
>   4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.
> 
>   5. In copy_present_page() when COW is enforced when fork(), we also need to
>      pass over the uffd-wp bit if VM_UFFD_WP is armed on the new vma, and when
>      the pte to be copied has uffd-wp bit set.
> 
> Remove the comment in copy_present_pte() about this.  It won't help a huge lot
> to only comment there, but comment everywhere would be an overkill.  Let's
> assume the commit messages would help.
> 

This run afoul of Alistair's "mm: Device exclusive memory access",
https://lkml.kernel.org/r/20210524132725.12697-8-apopple@nvidia.com

`vma' is now undeclared.  I think this?

--- a/mm/memory.c~mm-userfaultfd-fix-uffd-wp-special-cases-for-fork-fix
+++ a/mm/memory.c
@@ -850,8 +850,8 @@ copy_nonpresent_pte(struct mm_struct *ds
 		 * exclusive entries currently only support private writable
 		 * (ie. COW) mappings.
 		 */
-		VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
-		if (try_restore_exclusive_pte(src_mm, src_pte, vma, addr))
+		VM_BUG_ON(!is_cow_mapping(dst_vma->vm_flags));
+		if (try_restore_exclusive_pte(src_mm, src_pte, dst_vma, addr))
 			return -EBUSY;
 		return -ENOENT;
 	}
Peter Xu May 26, 2021, 12:36 a.m. UTC | #2
On Tue, May 25, 2021 at 05:15:58PM -0700, Andrew Morton wrote:
> This run afoul of Alistair's "mm: Device exclusive memory access",
> https://lkml.kernel.org/r/20210524132725.12697-8-apopple@nvidia.com
> 
> `vma' is now undeclared.  I think this?
> 
> --- a/mm/memory.c~mm-userfaultfd-fix-uffd-wp-special-cases-for-fork-fix
> +++ a/mm/memory.c
> @@ -850,8 +850,8 @@ copy_nonpresent_pte(struct mm_struct *ds
>  		 * exclusive entries currently only support private writable
>  		 * (ie. COW) mappings.
>  		 */
> -		VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
> -		if (try_restore_exclusive_pte(src_mm, src_pte, vma, addr))
> +		VM_BUG_ON(!is_cow_mapping(dst_vma->vm_flags));

This one looks good, as both src_vma/dst_vma should have the same flags related
to is_cow.

> +		if (try_restore_exclusive_pte(src_mm, src_pte, dst_vma, addr))

Should this be s/dst_vma/src_vma/ perhaps?  Alistairs please correct me
otherwise, as it tries to restore the pte for src mm not dst (the child).

I haven't yet got time to look at the new series, planning to do it tomorrow
maybe.. but I see that it's already queued in -mm.  Andrew, we do have chance
to go back if necessary, right?

I haven't looked at the rest, but I think try_restore_exclusive_pte() can at
least drop the *mm pointer as it's never used (even if we need, we've got
vma->vm_mm too)..
Andrew Morton May 26, 2021, 3:04 a.m. UTC | #3
On Tue, 25 May 2021 20:36:18 -0400 Peter Xu <peterx@redhat.com> wrote:

> On Tue, May 25, 2021 at 05:15:58PM -0700, Andrew Morton wrote:
> > This run afoul of Alistair's "mm: Device exclusive memory access",
> > https://lkml.kernel.org/r/20210524132725.12697-8-apopple@nvidia.com
> > 
> > `vma' is now undeclared.  I think this?
> > 
> > --- a/mm/memory.c~mm-userfaultfd-fix-uffd-wp-special-cases-for-fork-fix
> > +++ a/mm/memory.c
> > @@ -850,8 +850,8 @@ copy_nonpresent_pte(struct mm_struct *ds
> >  		 * exclusive entries currently only support private writable
> >  		 * (ie. COW) mappings.
> >  		 */
> > -		VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
> > -		if (try_restore_exclusive_pte(src_mm, src_pte, vma, addr))
> > +		VM_BUG_ON(!is_cow_mapping(dst_vma->vm_flags));
> 
> This one looks good, as both src_vma/dst_vma should have the same flags related
> to is_cow.
> 
> > +		if (try_restore_exclusive_pte(src_mm, src_pte, dst_vma, addr))
> 
> Should this be s/dst_vma/src_vma/ perhaps?  Alistairs please correct me
> otherwise, as it tries to restore the pte for src mm not dst (the child).
> 
> I haven't yet got time to look at the new series, planning to do it tomorrow
> maybe.. but I see that it's already queued in -mm.  Andrew, we do have chance
> to go back if necessary, right?

Sure.

> I haven't looked at the rest, but I think try_restore_exclusive_pte() can at
> least drop the *mm pointer as it's never used (even if we need, we've got
> vma->vm_mm too)..

OK, thanks.  I just released a tree into linux-next (hopefully this
blooper won't cause too much damage).  Please send a suitable fixup.
diff mbox series

Patch

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9626fda5efcea..60dad7c88d72b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -10,7 +10,7 @@ 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
-		  struct vm_area_struct *vma);
+		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
 void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 22bf2d0fff79b..20a4569895254 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1014,7 +1014,7 @@  struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
-		  struct vm_area_struct *vma)
+		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
 	spinlock_t *dst_ptl, *src_ptl;
 	struct page *src_page;
@@ -1023,7 +1023,7 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	int ret = -ENOMEM;
 
 	/* Skip if can be re-fill on fault */
-	if (!vma_is_anonymous(vma))
+	if (!vma_is_anonymous(dst_vma))
 		return 0;
 
 	pgtable = pte_alloc_one(dst_mm);
@@ -1037,14 +1037,6 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	ret = -EAGAIN;
 	pmd = *src_pmd;
 
-	/*
-	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
-	 * does not have the VM_UFFD_WP, which means that the uffd
-	 * fork event is not enabled.
-	 */
-	if (!(vma->vm_flags & VM_UFFD_WP))
-		pmd = pmd_clear_uffd_wp(pmd);
-
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
@@ -1055,11 +1047,15 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			pmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*src_pmd))
 				pmd = pmd_swp_mksoft_dirty(pmd);
+			if (pmd_swp_uffd_wp(*src_pmd))
+				pmd = pmd_swp_mkuffd_wp(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
 		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+		if (!userfaultfd_wp(dst_vma))
+			pmd = pmd_swp_clear_uffd_wp(pmd);
 		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 		ret = 0;
 		goto out_unlock;
@@ -1095,11 +1091,11 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * best effort that the pinned pages won't be replaced by another
 	 * random page during the coming copy-on-write.
 	 */
-	if (unlikely(page_needs_cow_for_dma(vma, src_page))) {
+	if (unlikely(page_needs_cow_for_dma(src_vma, src_page))) {
 		pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
-		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
+		__split_huge_pmd(src_vma, src_pmd, addr, false, NULL);
 		return -EAGAIN;
 	}
 
@@ -1109,8 +1105,9 @@  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 out_zero_page:
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	if (!userfaultfd_wp(dst_vma))
+		pmd = pmd_clear_uffd_wp(pmd);
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
diff --git a/mm/memory.c b/mm/memory.c
index 045daf58608f7..a17a53a7dade6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -708,10 +708,10 @@  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 static unsigned long
 copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma, unsigned long addr, int *rss)
 {
-	unsigned long vm_flags = vma->vm_flags;
+	unsigned long vm_flags = dst_vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
 	swp_entry_t entry = pte_to_swp_entry(pte);
@@ -780,6 +780,8 @@  copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
 	}
+	if (!userfaultfd_wp(dst_vma))
+		pte = pte_swp_clear_uffd_wp(pte);
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 	return 0;
 }
@@ -845,6 +847,9 @@  copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	/* All done, just insert the new page copy in the child */
 	pte = mk_pte(new_page, dst_vma->vm_page_prot);
 	pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma);
+	if (userfaultfd_pte_wp(dst_vma, *src_pte))
+		/* Uffd-wp needs to be delivered to dest pte as well */
+		pte = pte_wrprotect(pte_mkuffd_wp(pte));
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
 	return 0;
 }
@@ -894,12 +899,7 @@  copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
-	/*
-	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
-	 * does not have the VM_UFFD_WP, which means that the uffd
-	 * fork event is not enabled.
-	 */
-	if (!(vm_flags & VM_UFFD_WP))
+	if (!userfaultfd_wp(dst_vma))
 		pte = pte_clear_uffd_wp(pte);
 
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
@@ -974,7 +974,8 @@  copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		if (unlikely(!pte_present(*src_pte))) {
 			entry.val = copy_nonpresent_pte(dst_mm, src_mm,
 							dst_pte, src_pte,
-							src_vma, addr, rss);
+							dst_vma, src_vma,
+							addr, rss);
 			if (entry.val)
 				break;
 			progress += 8;
@@ -1051,8 +1052,8 @@  copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			|| pmd_devmap(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
-			err = copy_huge_pmd(dst_mm, src_mm,
-					    dst_pmd, src_pmd, addr, src_vma);
+			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
+					    addr, dst_vma, src_vma);
 			if (err == -ENOMEM)
 				return -ENOMEM;
 			if (!err)