diff mbox series

[1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock

Message ID 20231230025607.2476912-1-surenb@google.com (mailing list archive)
State New, archived
Headers show
Series [1/1] userfaultfd: fix move_pages_pte() splitting folio under RCU read lock | expand

Commit Message

Suren Baghdasaryan Dec. 30, 2023, 2:56 a.m. UTC
While testing the split PMD path with lockdep enabled I've got an
"Invalid wait context" error caused by split_huge_page_to_list() trying
to lock anon_vma->rwsem while inside RCU read section. The issues is due
to move_pages_pte() calling split_folio() under RCU read lock. Fix this
by unmapping the PTEs and exiting RCU read section before splitting the
folio and then retrying. The same retry pattern is used when locking the
folio or anon_vma in this function.

Fixes: 94b01c885131 ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
Patch applies over mm-unstable.
Please note that the SHA in Fixes tag is unstable.

 mm/userfaultfd.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Peter Xu Jan. 2, 2024, 8:59 a.m. UTC | #1
On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
> @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  
>  		/* at this point we have src_folio locked */
>  		if (folio_test_large(src_folio)) {
> +			/* split_folio() can block */
> +			pte_unmap(&orig_src_pte);
> +			pte_unmap(&orig_dst_pte);
> +			src_pte = dst_pte = NULL;
>  			err = split_folio(src_folio);
>  			if (err)
>  				goto out;
> +			goto retry;
>  		}

Do we also need to clear src_folio and src_folio_pte?  If the folio is a
thp, I think it means it's pte mapped here. Then after the split we may
want to fetch the small folio after the split, not the head one?
Suren Baghdasaryan Jan. 2, 2024, 4:58 p.m. UTC | #2
On Tue, Jan 2, 2024 at 1:00 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
> > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >
> >               /* at this point we have src_folio locked */
> >               if (folio_test_large(src_folio)) {
> > +                     /* split_folio() can block */
> > +                     pte_unmap(&orig_src_pte);
> > +                     pte_unmap(&orig_dst_pte);
> > +                     src_pte = dst_pte = NULL;
> >                       err = split_folio(src_folio);
> >                       if (err)
> >                               goto out;
> > +                     goto retry;
> >               }
>
> Do we also need to clear src_folio and src_folio_pte?  If the folio is a
> thp, I think it means it's pte mapped here. Then after the split we may
> want to fetch the small folio after the split, not the head one?

I think we need to re-fetch the src_folio only if the src_addr falls
into a non-head page. Looking at the __split_huge_page(), the head
page is skipped in the last loop, so I think it should stay valid.
That said, maybe it's just an implementation detail of the
__split_huge_page() and I should not rely on that and refetch anyway?

>
> --
> Peter Xu
>
Suren Baghdasaryan Jan. 2, 2024, 11:16 p.m. UTC | #3
On Tue, Jan 2, 2024 at 8:58 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 2, 2024 at 1:00 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
> > > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >
> > >               /* at this point we have src_folio locked */
> > >               if (folio_test_large(src_folio)) {
> > > +                     /* split_folio() can block */
> > > +                     pte_unmap(&orig_src_pte);
> > > +                     pte_unmap(&orig_dst_pte);
> > > +                     src_pte = dst_pte = NULL;
> > >                       err = split_folio(src_folio);
> > >                       if (err)
> > >                               goto out;
> > > +                     goto retry;
> > >               }
> >
> > Do we also need to clear src_folio and src_folio_pte?  If the folio is a
> > thp, I think it means it's pte mapped here. Then after the split we may
> > want to fetch the small folio after the split, not the head one?
>
> I think we need to re-fetch the src_folio only if the src_addr falls
> into a non-head page. Looking at the __split_huge_page(), the head
> page is skipped in the last loop, so I think it should stay valid.
> That said, maybe it's just an implementation detail of the
> __split_huge_page() and I should not rely on that and refetch anyway?

I'll post a v2 with this fix and re-fetching the folio
unconditionally. We also don't need to reset src_folio_pte value
because it's used only if src_folio is not NULL.
Thanks for catching this, Peter!

>
> >
> > --
> > Peter Xu
> >
Suren Baghdasaryan Jan. 2, 2024, 11:34 p.m. UTC | #4
On Tue, Jan 2, 2024 at 3:16 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jan 2, 2024 at 8:58 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Jan 2, 2024 at 1:00 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Dec 29, 2023 at 06:56:07PM -0800, Suren Baghdasaryan wrote:
> > > > @@ -1078,9 +1078,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >
> > > >               /* at this point we have src_folio locked */
> > > >               if (folio_test_large(src_folio)) {
> > > > +                     /* split_folio() can block */
> > > > +                     pte_unmap(&orig_src_pte);
> > > > +                     pte_unmap(&orig_dst_pte);
> > > > +                     src_pte = dst_pte = NULL;
> > > >                       err = split_folio(src_folio);
> > > >                       if (err)
> > > >                               goto out;
> > > > +                     goto retry;
> > > >               }
> > >
> > > Do we also need to clear src_folio and src_folio_pte?  If the folio is a
> > > thp, I think it means it's pte mapped here. Then after the split we may
> > > want to fetch the small folio after the split, not the head one?
> >
> > I think we need to re-fetch the src_folio only if the src_addr falls
> > into a non-head page. Looking at the __split_huge_page(), the head
> > page is skipped in the last loop, so I think it should stay valid.
> > That said, maybe it's just an implementation detail of the
> > __split_huge_page() and I should not rely on that and refetch anyway?
>
> I'll post a v2 with this fix and re-fetching the folio
> unconditionally. We also don't need to reset src_folio_pte value
> because it's used only if src_folio is not NULL.

Posted at https://lore.kernel.org/all/20240102233256.1077959-1-surenb@google.com/

> Thanks for catching this, Peter!
>
> >
> > >
> > > --
> > > Peter Xu
> > >
diff mbox series

Patch

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 5e718014e671..71393410e028 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1078,9 +1078,14 @@  static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 
 		/* at this point we have src_folio locked */
 		if (folio_test_large(src_folio)) {
+			/* split_folio() can block */
+			pte_unmap(&orig_src_pte);
+			pte_unmap(&orig_dst_pte);
+			src_pte = dst_pte = NULL;
 			err = split_folio(src_folio);
 			if (err)
 				goto out;
+			goto retry;
 		}
 
 		if (!src_anon_vma) {