diff mbox series

[1/1] userfaultfd: do not block on locking a large folio with raised refcount

Message ID 20250225204613.2316092-1-surenb@google.com (mailing list archive)
State New
Headers show
Series [1/1] userfaultfd: do not block on locking a large folio with raised refcount | expand

Commit Message

Suren Baghdasaryan Feb. 25, 2025, 8:46 p.m. UTC
Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
state when it goes into split_folio() with raised folio refcount.
split_folio() expects the reference count to be exactly
mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with
EAGAIN otherwise. If multiple processes are trying to move the same
large folio, they raise the refcount (all tasks succeed in that) then
one of them succeeds in locking the folio, while others will block in
folio_lock() while keeping the refcount raised. The winner of this
race will proceed with calling split_folio() and will fail returning
EAGAIN to the caller and unlocking the folio. The next competing process
will get the folio locked and will go through the same flow. In the
meantime the original winner will be retried and will block in
folio_lock(), getting into the queue of waiting processes only to repeat
the same path. All this results in a livelock.
An easy fix would be to avoid waiting for the folio lock while holding
folio refcount, similar to madvise_free_huge_pmd() where folio lock is
acquired before raising the folio refcount.
Modify move_pages_pte() to try locking the folio first and if that fails
and the folio is large then return EAGAIN without touching the folio
refcount. If the folio is single-page then split_folio() is not called,
so we don't have this issue.
Lokesh has a reproducer [1] and I verified that this change fixes the
issue.

[1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock

Reported-by: Lokesh Gidra <lokeshgidra@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/userfaultfd.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)


base-commit: 801d47bd96ce22acd43809bc09e004679f707c39

Comments

Lokesh Gidra Feb. 25, 2025, 8:56 p.m. UTC | #1
On Tue, Feb 25, 2025 at 12:46 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
> state when it goes into split_folio() with raised folio refcount.
> split_folio() expects the reference count to be exactly
> mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with
> EAGAIN otherwise. If multiple processes are trying to move the same
> large folio, they raise the refcount (all tasks succeed in that) then
> one of them succeeds in locking the folio, while others will block in
> folio_lock() while keeping the refcount raised. The winner of this
> race will proceed with calling split_folio() and will fail returning
> EAGAIN to the caller and unlocking the folio. The next competing process
> will get the folio locked and will go through the same flow. In the
> meantime the original winner will be retried and will block in
> folio_lock(), getting into the queue of waiting processes only to repeat
> the same path. All this results in a livelock.
> An easy fix would be to avoid waiting for the folio lock while holding
> folio refcount, similar to madvise_free_huge_pmd() where folio lock is
> acquired before raising the folio refcount.
> Modify move_pages_pte() to try locking the folio first and if that fails
> and the folio is large then return EAGAIN without touching the folio
> refcount. If the folio is single-page then split_folio() is not called,
> so we don't have this issue.
> Lokesh has a reproducer [1] and I verified that this change fixes the
> issue.
>
> [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
>
Thanks so much for fixing this, Suren.

Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
> Reported-by: Lokesh Gidra <lokeshgidra@google.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/userfaultfd.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 867898c4e30b..f17f8290c523 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>                  */
>                 if (!src_folio) {
>                         struct folio *folio;
> +                       bool locked;
>
>                         /*
>                          * Pin the page while holding the lock to be sure the
> @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>                                 goto out;
>                         }
>
> +                       locked = folio_trylock(folio);
> +                       /*
> +                        * We avoid waiting for folio lock with a raised refcount
> +                        * for large folios because extra refcounts will result in
> +                        * split_folio() failing later and retrying. If multiple
> +                        * tasks are trying to move a large folio we can end
> +                        * livelocking.
> +                        */
> +                       if (!locked && folio_test_large(folio)) {
> +                               spin_unlock(src_ptl);
> +                               err = -EAGAIN;
> +                               goto out;
> +                       }
> +
>                         folio_get(folio);
>                         src_folio = folio;
>                         src_folio_pte = orig_src_pte;
>                         spin_unlock(src_ptl);
>
> -                       if (!folio_trylock(src_folio)) {
> +                       if (!locked) {
>                                 pte_unmap(&orig_src_pte);
>                                 pte_unmap(&orig_dst_pte);
>                                 src_pte = dst_pte = NULL;
>
> base-commit: 801d47bd96ce22acd43809bc09e004679f707c39
> --
> 2.48.1.658.g4767266eb4-goog
>
Peter Xu Feb. 25, 2025, 9:32 p.m. UTC | #2
On Tue, Feb 25, 2025 at 12:46:13PM -0800, Suren Baghdasaryan wrote:
> Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
> state when it goes into split_folio() with raised folio refcount.
> split_folio() expects the reference count to be exactly
> mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with
> EAGAIN otherwise. If multiple processes are trying to move the same
> large folio, they raise the refcount (all tasks succeed in that) then
> one of them succeeds in locking the folio, while others will block in
> folio_lock() while keeping the refcount raised. The winner of this
> race will proceed with calling split_folio() and will fail returning
> EAGAIN to the caller and unlocking the folio. The next competing process
> will get the folio locked and will go through the same flow. In the
> meantime the original winner will be retried and will block in
> folio_lock(), getting into the queue of waiting processes only to repeat
> the same path. All this results in a livelock.
> An easy fix would be to avoid waiting for the folio lock while holding
> folio refcount, similar to madvise_free_huge_pmd() where folio lock is
> acquired before raising the folio refcount.
> Modify move_pages_pte() to try locking the folio first and if that fails
> and the folio is large then return EAGAIN without touching the folio
> refcount. If the folio is single-page then split_folio() is not called,
> so we don't have this issue.
> Lokesh has a reproducer [1] and I verified that this change fixes the
> issue.
> 
> [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
> 
> Reported-by: Lokesh Gidra <lokeshgidra@google.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

One question irrelevant of this change below..

> ---
>  mm/userfaultfd.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 867898c4e30b..f17f8290c523 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  		 */
>  		if (!src_folio) {
>  			struct folio *folio;
> +			bool locked;
>  
>  			/*
>  			 * Pin the page while holding the lock to be sure the
> @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  				goto out;
>  			}
>  
> +			locked = folio_trylock(folio);
> +			/*
> +			 * We avoid waiting for folio lock with a raised refcount
> +			 * for large folios because extra refcounts will result in
> +			 * split_folio() failing later and retrying. If multiple
> +			 * tasks are trying to move a large folio we can end
> +			 * livelocking.
> +			 */
> +			if (!locked && folio_test_large(folio)) {
> +				spin_unlock(src_ptl);
> +				err = -EAGAIN;
> +				goto out;
> +			}
> +
>  			folio_get(folio);
>  			src_folio = folio;
>  			src_folio_pte = orig_src_pte;
>  			spin_unlock(src_ptl);
>  
> -			if (!folio_trylock(src_folio)) {
> +			if (!locked) {
>  				pte_unmap(&orig_src_pte);
>  				pte_unmap(&orig_dst_pte);

.. just notice this.  Are these problematic?  I mean, orig_*_pte are stack
variables, afaict.  I'd expect these things blow on HIGHPTE..

>  				src_pte = dst_pte = NULL;
> 
> base-commit: 801d47bd96ce22acd43809bc09e004679f707c39
> -- 
> 2.48.1.658.g4767266eb4-goog
>
Suren Baghdasaryan Feb. 25, 2025, 10:12 p.m. UTC | #3
On Tue, Feb 25, 2025 at 1:32 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 25, 2025 at 12:46:13PM -0800, Suren Baghdasaryan wrote:
> > Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
> > state when it goes into split_folio() with raised folio refcount.
> > split_folio() expects the reference count to be exactly
> > mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with
> > EAGAIN otherwise. If multiple processes are trying to move the same
> > large folio, they raise the refcount (all tasks succeed in that) then
> > one of them succeeds in locking the folio, while others will block in
> > folio_lock() while keeping the refcount raised. The winner of this
> > race will proceed with calling split_folio() and will fail returning
> > EAGAIN to the caller and unlocking the folio. The next competing process
> > will get the folio locked and will go through the same flow. In the
> > meantime the original winner will be retried and will block in
> > folio_lock(), getting into the queue of waiting processes only to repeat
> > the same path. All this results in a livelock.
> > An easy fix would be to avoid waiting for the folio lock while holding
> > folio refcount, similar to madvise_free_huge_pmd() where folio lock is
> > acquired before raising the folio refcount.
> > Modify move_pages_pte() to try locking the folio first and if that fails
> > and the folio is large then return EAGAIN without touching the folio
> > refcount. If the folio is single-page then split_folio() is not called,
> > so we don't have this issue.
> > Lokesh has a reproducer [1] and I verified that this change fixes the
> > issue.
> >
> > [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
> >
> > Reported-by: Lokesh Gidra <lokeshgidra@google.com>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> One question irrelevant of this change below..
>
> > ---
> >  mm/userfaultfd.c | 17 ++++++++++++++++-
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 867898c4e30b..f17f8290c523 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                */
> >               if (!src_folio) {
> >                       struct folio *folio;
> > +                     bool locked;
> >
> >                       /*
> >                        * Pin the page while holding the lock to be sure the
> > @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                               goto out;
> >                       }
> >
> > +                     locked = folio_trylock(folio);
> > +                     /*
> > +                      * We avoid waiting for folio lock with a raised refcount
> > +                      * for large folios because extra refcounts will result in
> > +                      * split_folio() failing later and retrying. If multiple
> > +                      * tasks are trying to move a large folio we can end
> > +                      * livelocking.
> > +                      */
> > +                     if (!locked && folio_test_large(folio)) {
> > +                             spin_unlock(src_ptl);
> > +                             err = -EAGAIN;
> > +                             goto out;
> > +                     }
> > +
> >                       folio_get(folio);
> >                       src_folio = folio;
> >                       src_folio_pte = orig_src_pte;
> >                       spin_unlock(src_ptl);
> >
> > -                     if (!folio_trylock(src_folio)) {
> > +                     if (!locked) {
> >                               pte_unmap(&orig_src_pte);
> >                               pte_unmap(&orig_dst_pte);
>
> .. just notice this.  Are these problematic?  I mean, orig_*_pte are stack
> variables, afaict.  I'd expect these things blow on HIGHPTE..

Ugh! Yes, I think so. From a quick look, move_pages_pte() is the only
place we have this issue and I don't see a reason for copying src_pte
and dst_pte values. I'll spend some more time trying to understand if
we really need these local copies.

>
> >                               src_pte = dst_pte = NULL;
> >
> > base-commit: 801d47bd96ce22acd43809bc09e004679f707c39
> > --
> > 2.48.1.658.g4767266eb4-goog
> >
>
> --
> Peter Xu
>
Suren Baghdasaryan Feb. 25, 2025, 10:21 p.m. UTC | #4
On Tue, Feb 25, 2025 at 2:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Feb 25, 2025 at 1:32 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Feb 25, 2025 at 12:46:13PM -0800, Suren Baghdasaryan wrote:
> > > Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
> > > state when it goes into split_folio() with raised folio refcount.
> > > split_folio() expects the reference count to be exactly
> > > mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with
> > > EAGAIN otherwise. If multiple processes are trying to move the same
> > > large folio, they raise the refcount (all tasks succeed in that) then
> > > one of them succeeds in locking the folio, while others will block in
> > > folio_lock() while keeping the refcount raised. The winner of this
> > > race will proceed with calling split_folio() and will fail returning
> > > EAGAIN to the caller and unlocking the folio. The next competing process
> > > will get the folio locked and will go through the same flow. In the
> > > meantime the original winner will be retried and will block in
> > > folio_lock(), getting into the queue of waiting processes only to repeat
> > > the same path. All this results in a livelock.
> > > An easy fix would be to avoid waiting for the folio lock while holding
> > > folio refcount, similar to madvise_free_huge_pmd() where folio lock is
> > > acquired before raising the folio refcount.
> > > Modify move_pages_pte() to try locking the folio first and if that fails
> > > and the folio is large then return EAGAIN without touching the folio
> > > refcount. If the folio is single-page then split_folio() is not called,
> > > so we don't have this issue.
> > > Lokesh has a reproducer [1] and I verified that this change fixes the
> > > issue.
> > >
> > > [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
> > >
> > > Reported-by: Lokesh Gidra <lokeshgidra@google.com>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> >
> > One question irrelevant of this change below..
> >
> > > ---
> > >  mm/userfaultfd.c | 17 ++++++++++++++++-
> > >  1 file changed, 16 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index 867898c4e30b..f17f8290c523 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                */
> > >               if (!src_folio) {
> > >                       struct folio *folio;
> > > +                     bool locked;
> > >
> > >                       /*
> > >                        * Pin the page while holding the lock to be sure the
> > > @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                               goto out;
> > >                       }
> > >
> > > +                     locked = folio_trylock(folio);
> > > +                     /*
> > > +                      * We avoid waiting for folio lock with a raised refcount
> > > +                      * for large folios because extra refcounts will result in
> > > +                      * split_folio() failing later and retrying. If multiple
> > > +                      * tasks are trying to move a large folio we can end
> > > +                      * livelocking.
> > > +                      */
> > > +                     if (!locked && folio_test_large(folio)) {
> > > +                             spin_unlock(src_ptl);
> > > +                             err = -EAGAIN;
> > > +                             goto out;
> > > +                     }
> > > +
> > >                       folio_get(folio);
> > >                       src_folio = folio;
> > >                       src_folio_pte = orig_src_pte;
> > >                       spin_unlock(src_ptl);
> > >
> > > -                     if (!folio_trylock(src_folio)) {
> > > +                     if (!locked) {
> > >                               pte_unmap(&orig_src_pte);
> > >                               pte_unmap(&orig_dst_pte);
> >
> > .. just notice this.  Are these problematic?  I mean, orig_*_pte are stack
> > variables, afaict.  I'd expect these things blow on HIGHPTE..
>
> Ugh! Yes, I think so. From a quick look, move_pages_pte() is the only
> place we have this issue and I don't see a reason for copying src_pte
> and dst_pte values. I'll spend some more time trying to understand if
> we really need these local copies.

Ah, we copy the values to later check if PTEs changed from under us.
But I see no reason we need to use orig_{src|dst}_pte instead of
{src|dst}_pte when doing pte_unmap(). I think we can safely replace
them with the original ones. WDYT?

>
> >
> > >                               src_pte = dst_pte = NULL;
> > >
> > > base-commit: 801d47bd96ce22acd43809bc09e004679f707c39
> > > --
> > > 2.48.1.658.g4767266eb4-goog
> > >
> >
> > --
> > Peter Xu
> >
Peter Xu Feb. 25, 2025, 10:48 p.m. UTC | #5
On Tue, Feb 25, 2025 at 02:21:39PM -0800, Suren Baghdasaryan wrote:
> On Tue, Feb 25, 2025 at 2:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Feb 25, 2025 at 1:32 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Feb 25, 2025 at 12:46:13PM -0800, Suren Baghdasaryan wrote:
> > > > Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
> > > > state when it goes into split_folio() with raised folio refcount.
> > > > split_folio() expects the reference count to be exactly
> > > > mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with
> > > > EAGAIN otherwise. If multiple processes are trying to move the same
> > > > large folio, they raise the refcount (all tasks succeed in that) then
> > > > one of them succeeds in locking the folio, while others will block in
> > > > folio_lock() while keeping the refcount raised. The winner of this
> > > > race will proceed with calling split_folio() and will fail returning
> > > > EAGAIN to the caller and unlocking the folio. The next competing process
> > > > will get the folio locked and will go through the same flow. In the
> > > > meantime the original winner will be retried and will block in
> > > > folio_lock(), getting into the queue of waiting processes only to repeat
> > > > the same path. All this results in a livelock.
> > > > An easy fix would be to avoid waiting for the folio lock while holding
> > > > folio refcount, similar to madvise_free_huge_pmd() where folio lock is
> > > > acquired before raising the folio refcount.
> > > > Modify move_pages_pte() to try locking the folio first and if that fails
> > > > and the folio is large then return EAGAIN without touching the folio
> > > > refcount. If the folio is single-page then split_folio() is not called,
> > > > so we don't have this issue.
> > > > Lokesh has a reproducer [1] and I verified that this change fixes the
> > > > issue.
> > > >
> > > > [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
> > > >
> > > > Reported-by: Lokesh Gidra <lokeshgidra@google.com>
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > >
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > >
> > > One question irrelevant of this change below..
> > >
> > > > ---
> > > >  mm/userfaultfd.c | 17 ++++++++++++++++-
> > > >  1 file changed, 16 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index 867898c4e30b..f17f8290c523 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >                */
> > > >               if (!src_folio) {
> > > >                       struct folio *folio;
> > > > +                     bool locked;
> > > >
> > > >                       /*
> > > >                        * Pin the page while holding the lock to be sure the
> > > > @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >                               goto out;
> > > >                       }
> > > >
> > > > +                     locked = folio_trylock(folio);
> > > > +                     /*
> > > > +                      * We avoid waiting for folio lock with a raised refcount
> > > > +                      * for large folios because extra refcounts will result in
> > > > +                      * split_folio() failing later and retrying. If multiple
> > > > +                      * tasks are trying to move a large folio we can end
> > > > +                      * livelocking.
> > > > +                      */
> > > > +                     if (!locked && folio_test_large(folio)) {
> > > > +                             spin_unlock(src_ptl);
> > > > +                             err = -EAGAIN;
> > > > +                             goto out;
> > > > +                     }
> > > > +
> > > >                       folio_get(folio);
> > > >                       src_folio = folio;
> > > >                       src_folio_pte = orig_src_pte;
> > > >                       spin_unlock(src_ptl);
> > > >
> > > > -                     if (!folio_trylock(src_folio)) {
> > > > +                     if (!locked) {
> > > >                               pte_unmap(&orig_src_pte);
> > > >                               pte_unmap(&orig_dst_pte);
> > >
> > > .. just notice this.  Are these problematic?  I mean, orig_*_pte are stack
> > > variables, afaict.  I'd expect these things blow on HIGHPTE..
> >
> > Ugh! Yes, I think so. From a quick look, move_pages_pte() is the only
> > place we have this issue and I don't see a reason for copying src_pte
> > and dst_pte values. I'll spend some more time trying to understand if
> > we really need these local copies.
> 
> Ah, we copy the values to later check if PTEs changed from under us.
> But I see no reason we need to use orig_{src|dst}_pte instead of
> {src|dst}_pte when doing pte_unmap(). I think we can safely replace

That looks like something we just overlooked before, meanwhile it's
undetectable on !HIGHPTE anyway.. in which case the addr ignored, and that
turns always into an rcu unlock.

> them with the original ones. WDYT?

Agreed.  Maybe not "the original ones" if we're looking for words to put
into the commit message: it could be "we should kunmap() whatever we
kmap()ed before", or something better.

Thanks,
Suren Baghdasaryan Feb. 25, 2025, 11:01 p.m. UTC | #6
On Tue, Feb 25, 2025 at 2:48 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 25, 2025 at 02:21:39PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Feb 25, 2025 at 2:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Feb 25, 2025 at 1:32 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Tue, Feb 25, 2025 at 12:46:13PM -0800, Suren Baghdasaryan wrote:
> > > > > Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
> > > > > state when it goes into split_folio() with raised folio refcount.
> > > > > split_folio() expects the reference count to be exactly
> > > > > mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with
> > > > > EAGAIN otherwise. If multiple processes are trying to move the same
> > > > > large folio, they raise the refcount (all tasks succeed in that) then
> > > > > one of them succeeds in locking the folio, while others will block in
> > > > > folio_lock() while keeping the refcount raised. The winner of this
> > > > > race will proceed with calling split_folio() and will fail returning
> > > > > EAGAIN to the caller and unlocking the folio. The next competing process
> > > > > will get the folio locked and will go through the same flow. In the
> > > > > meantime the original winner will be retried and will block in
> > > > > folio_lock(), getting into the queue of waiting processes only to repeat
> > > > > the same path. All this results in a livelock.
> > > > > An easy fix would be to avoid waiting for the folio lock while holding
> > > > > folio refcount, similar to madvise_free_huge_pmd() where folio lock is
> > > > > acquired before raising the folio refcount.
> > > > > Modify move_pages_pte() to try locking the folio first and if that fails
> > > > > and the folio is large then return EAGAIN without touching the folio
> > > > > refcount. If the folio is single-page then split_folio() is not called,
> > > > > so we don't have this issue.
> > > > > Lokesh has a reproducer [1] and I verified that this change fixes the
> > > > > issue.
> > > > >
> > > > > [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
> > > > >
> > > > > Reported-by: Lokesh Gidra <lokeshgidra@google.com>
> > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > >
> > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > >
> > > > One question irrelevant of this change below..
> > > >
> > > > > ---
> > > > >  mm/userfaultfd.c | 17 ++++++++++++++++-
> > > > >  1 file changed, 16 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > index 867898c4e30b..f17f8290c523 100644
> > > > > --- a/mm/userfaultfd.c
> > > > > +++ b/mm/userfaultfd.c
> > > > > @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > >                */
> > > > >               if (!src_folio) {
> > > > >                       struct folio *folio;
> > > > > +                     bool locked;
> > > > >
> > > > >                       /*
> > > > >                        * Pin the page while holding the lock to be sure the
> > > > > @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > >                               goto out;
> > > > >                       }
> > > > >
> > > > > +                     locked = folio_trylock(folio);
> > > > > +                     /*
> > > > > +                      * We avoid waiting for folio lock with a raised refcount
> > > > > +                      * for large folios because extra refcounts will result in
> > > > > +                      * split_folio() failing later and retrying. If multiple
> > > > > +                      * tasks are trying to move a large folio we can end
> > > > > +                      * livelocking.
> > > > > +                      */
> > > > > +                     if (!locked && folio_test_large(folio)) {
> > > > > +                             spin_unlock(src_ptl);
> > > > > +                             err = -EAGAIN;
> > > > > +                             goto out;
> > > > > +                     }
> > > > > +
> > > > >                       folio_get(folio);
> > > > >                       src_folio = folio;
> > > > >                       src_folio_pte = orig_src_pte;
> > > > >                       spin_unlock(src_ptl);
> > > > >
> > > > > -                     if (!folio_trylock(src_folio)) {
> > > > > +                     if (!locked) {
> > > > >                               pte_unmap(&orig_src_pte);
> > > > >                               pte_unmap(&orig_dst_pte);
> > > >
> > > > .. just notice this.  Are these problematic?  I mean, orig_*_pte are stack
> > > > variables, afaict.  I'd expect these things blow on HIGHPTE..
> > >
> > > Ugh! Yes, I think so. From a quick look, move_pages_pte() is the only
> > > place we have this issue and I don't see a reason for copying src_pte
> > > and dst_pte values. I'll spend some more time trying to understand if
> > > we really need these local copies.
> >
> > Ah, we copy the values to later check if PTEs changed from under us.
> > But I see no reason we need to use orig_{src|dst}_pte instead of
> > {src|dst}_pte when doing pte_unmap(). I think we can safely replace
>
> That looks like something we just overlooked before, meanwhile it's
> undetectable on !HIGHPTE anyway.. in which case the addr ignored, and that
> turns always into an rcu unlock.
>
> > them with the original ones. WDYT?
>
> Agreed.  Maybe not "the original ones" if we're looking for words to put
> into the commit message: it could be "we should kunmap() whatever we
> kmap()ed before", or something better.

Sounds good to me. I'll give others one day to provide their input and
will post a fix.
Thanks!

>
> Thanks,
>
> --
> Peter Xu
>
Liam R. Howlett Feb. 26, 2025, 2:59 p.m. UTC | #7
* Suren Baghdasaryan <surenb@google.com> [250225 15:46]:
> Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
> state when it goes into split_folio() with raised folio refcount.
> split_folio() expects the reference count to be exactly
> mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with
> EAGAIN otherwise. If multiple processes are trying to move the same
> large folio, they raise the refcount (all tasks succeed in that) then
> one of them succeeds in locking the folio, while others will block in
> folio_lock() while keeping the refcount raised. The winner of this
> race will proceed with calling split_folio() and will fail returning
> EAGAIN to the caller and unlocking the folio. The next competing process
> will get the folio locked and will go through the same flow. In the
> meantime the original winner will be retried and will block in
> folio_lock(), getting into the queue of waiting processes only to repeat
> the same path. All this results in a livelock.
> An easy fix would be to avoid waiting for the folio lock while holding
> folio refcount, similar to madvise_free_huge_pmd() where folio lock is
> acquired before raising the folio refcount.
> Modify move_pages_pte() to try locking the folio first and if that fails
> and the folio is large then return EAGAIN without touching the folio
> refcount. If the folio is single-page then split_folio() is not called,
> so we don't have this issue.
> Lokesh has a reproducer [1] and I verified that this change fixes the
> issue.
> 
> [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
> 
> Reported-by: Lokesh Gidra <lokeshgidra@google.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/userfaultfd.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 867898c4e30b..f17f8290c523 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  		 */
>  		if (!src_folio) {
>  			struct folio *folio;
> +			bool locked;
>  
>  			/*
>  			 * Pin the page while holding the lock to be sure the
> @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  				goto out;
>  			}
>  
> +			locked = folio_trylock(folio);
> +			/*
> +			 * We avoid waiting for folio lock with a raised refcount
> +			 * for large folios because extra refcounts will result in
> +			 * split_folio() failing later and retrying. If multiple
> +			 * tasks are trying to move a large folio we can end
> +			 * livelocking.
> +			 */
> +			if (!locked && folio_test_large(folio)) {
> +				spin_unlock(src_ptl);
> +				err = -EAGAIN;
> +				goto out;
> +			}
> +

Reversing the locking/folio_get() is okay because of the src_ptl spin
lock, right?  It might be worth saying something about it in the
comment?

>  			folio_get(folio);
>  			src_folio = folio;
>  			src_folio_pte = orig_src_pte;
>  			spin_unlock(src_ptl);
>  
> -			if (!folio_trylock(src_folio)) {
> +			if (!locked) {
>  				pte_unmap(&orig_src_pte);
>  				pte_unmap(&orig_dst_pte);
>  				src_pte = dst_pte = NULL;
> 
> base-commit: 801d47bd96ce22acd43809bc09e004679f707c39
> -- 
> 2.48.1.658.g4767266eb4-goog
>
Suren Baghdasaryan Feb. 26, 2025, 4:11 p.m. UTC | #8
On Wed, Feb 26, 2025 at 6:59 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250225 15:46]:
> > Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock
> > state when it goes into split_folio() with raised folio refcount.
> > split_folio() expects the reference count to be exactly
> > mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with
> > EAGAIN otherwise. If multiple processes are trying to move the same
> > large folio, they raise the refcount (all tasks succeed in that) then
> > one of them succeeds in locking the folio, while others will block in
> > folio_lock() while keeping the refcount raised. The winner of this
> > race will proceed with calling split_folio() and will fail returning
> > EAGAIN to the caller and unlocking the folio. The next competing process
> > will get the folio locked and will go through the same flow. In the
> > meantime the original winner will be retried and will block in
> > folio_lock(), getting into the queue of waiting processes only to repeat
> > the same path. All this results in a livelock.
> > An easy fix would be to avoid waiting for the folio lock while holding
> > folio refcount, similar to madvise_free_huge_pmd() where folio lock is
> > acquired before raising the folio refcount.
> > Modify move_pages_pte() to try locking the folio first and if that fails
> > and the folio is large then return EAGAIN without touching the folio
> > refcount. If the folio is single-page then split_folio() is not called,
> > so we don't have this issue.
> > Lokesh has a reproducer [1] and I verified that this change fixes the
> > issue.
> >
> > [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock
> >
> > Reported-by: Lokesh Gidra <lokeshgidra@google.com>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/userfaultfd.c | 17 ++++++++++++++++-
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 867898c4e30b..f17f8290c523 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1236,6 +1236,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                */
> >               if (!src_folio) {
> >                       struct folio *folio;
> > +                     bool locked;
> >
> >                       /*
> >                        * Pin the page while holding the lock to be sure the
> > @@ -1255,12 +1256,26 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                               goto out;
> >                       }
> >
> > +                     locked = folio_trylock(folio);
> > +                     /*
> > +                      * We avoid waiting for folio lock with a raised refcount
> > +                      * for large folios because extra refcounts will result in
> > +                      * split_folio() failing later and retrying. If multiple
> > +                      * tasks are trying to move a large folio we can end
> > +                      * livelocking.
> > +                      */
> > +                     if (!locked && folio_test_large(folio)) {
> > +                             spin_unlock(src_ptl);
> > +                             err = -EAGAIN;
> > +                             goto out;
> > +                     }
> > +
>
> Reversing the locking/folio_get() is okay because of the src_ptl spin
> lock, right?  It might be worth saying something about it in the
> comment?

That is correct. We take both folio lock and refcount before we drop
PTL. I'll add a comment. Thanks!

>
> >                       folio_get(folio);
> >                       src_folio = folio;
> >                       src_folio_pte = orig_src_pte;
> >                       spin_unlock(src_ptl);
> >
> > -                     if (!folio_trylock(src_folio)) {
> > +                     if (!locked) {
> >                               pte_unmap(&orig_src_pte);
> >                               pte_unmap(&orig_dst_pte);
> >                               src_pte = dst_pte = NULL;
> >
> > base-commit: 801d47bd96ce22acd43809bc09e004679f707c39
> > --
> > 2.48.1.658.g4767266eb4-goog
> >
Matthew Wilcox Feb. 26, 2025, 4:16 p.m. UTC | #9
On Wed, Feb 26, 2025 at 08:11:25AM -0800, Suren Baghdasaryan wrote:
> On Wed, Feb 26, 2025 at 6:59 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > Reversing the locking/folio_get() is okay because of the src_ptl spin
> > lock, right?  It might be worth saying something about it in the
> > comment?
> 
> That is correct. We take both folio lock and refcount before we drop
> PTL. I'll add a comment. Thanks!

In the commit message, not in the code, please.
Suren Baghdasaryan Feb. 26, 2025, 4:22 p.m. UTC | #10
On Wed, Feb 26, 2025 at 8:16 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Feb 26, 2025 at 08:11:25AM -0800, Suren Baghdasaryan wrote:
> > On Wed, Feb 26, 2025 at 6:59 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > Reversing the locking/folio_get() is okay because of the src_ptl spin
> > > lock, right?  It might be worth saying something about it in the
> > > comment?
> >
> > That is correct. We take both folio lock and refcount before we drop
> > PTL. I'll add a comment. Thanks!
>
> In the commit message, not in the code, please.

Ack.
Suren Baghdasaryan Feb. 26, 2025, 6:59 p.m. UTC | #11
On Wed, Feb 26, 2025 at 8:22 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Feb 26, 2025 at 8:16 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Feb 26, 2025 at 08:11:25AM -0800, Suren Baghdasaryan wrote:
> > > On Wed, Feb 26, 2025 at 6:59 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > Reversing the locking/folio_get() is okay because of the src_ptl spin
> > > > lock, right?  It might be worth saying something about it in the
> > > > comment?
> > >
> > > That is correct. We take both folio lock and refcount before we drop
> > > PTL. I'll add a comment. Thanks!
> >
> > In the commit message, not in the code, please.
>
> Ack.

I posted v2 at https://lore.kernel.org/all/20250226185510.2732648-2-surenb@google.com/
as part of a pachset which includes the fix for PTE unmapping that
Peter reported. Patchset is rebased over mm-hotfixes-unstable which
includes Barry's fix to the nearby code. This avoids merge conflicts.
diff mbox series

Patch

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 867898c4e30b..f17f8290c523 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1236,6 +1236,7 @@  static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		 */
 		if (!src_folio) {
 			struct folio *folio;
+			bool locked;
 
 			/*
 			 * Pin the page while holding the lock to be sure the
@@ -1255,12 +1256,26 @@  static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				goto out;
 			}
 
+			locked = folio_trylock(folio);
+			/*
+			 * We avoid waiting for folio lock with a raised refcount
+			 * for large folios because extra refcounts will result in
+			 * split_folio() failing later and retrying. If multiple
+			 * tasks are trying to move a large folio we can end
+			 * livelocking.
+			 */
+			if (!locked && folio_test_large(folio)) {
+				spin_unlock(src_ptl);
+				err = -EAGAIN;
+				goto out;
+			}
+
 			folio_get(folio);
 			src_folio = folio;
 			src_folio_pte = orig_src_pte;
 			spin_unlock(src_ptl);
 
-			if (!folio_trylock(src_folio)) {
+			if (!locked) {
 				pte_unmap(&orig_src_pte);
 				pte_unmap(&orig_dst_pte);
 				src_pte = dst_pte = NULL;