diff mbox series

not issuing vma_start_write() in dup_mmap() if the caller is single-threaded

Message ID CAGudoHFdPd5a7fiAutTFAxQz5f1fP2n9rwORm7Hj9fPzuVhiKw@mail.gmail.com (mailing list archive)
State New
Headers show
Series not issuing vma_start_write() in dup_mmap() if the caller is single-threaded | expand

Commit Message

Mateusz Guzik March 27, 2025, 5:46 a.m. UTC
Hi Suren,

I brought this up long time ago, you agreed the idea works and stated
you will sort it out, but it apparently fell to the wayside (no hard
feelings though :>)

Here is the original:
https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/

the thread is weirdly long and I recommend not opening it without a
good reason, I link it for reference if needed.

At the time there were woes concerning stability of the new locking
scheme, resulting in CC list being rather excessive. As such I'm
starting a new thread with a modest list, not sure who to add though.

So dup_mmap() holds the caller mmap_sem for writing and calls
vma_start_write() to protect against fault handling in another threads
using the same mm.

If this is the only thread with this ->mm in place, there are no
sibling threads to worry about and this can be checked with mm_users
== 1.

AFAICS all remote accesses require the mmap_sem to also be held, which
provides exclusion against dup_mmap, meaning they don't pose a problem
either.

The patch to merely skip locking is a few liner and I would officially
submit it myself, but for my taste an assert is needed in fault
handling to runtime test the above invariant. So happens I really
can't be bothered to figure out how to sort it out and was hoping you
would step in. ;) Alternatively if you guys don't think the assert is
warranted, that's your business.

As for whether this can save any locking -- yes:

I added a probe (below for reference) with two args: whether we are
single-threaded and vma_start_write() returning whether it took the
down/up cycle and ran make -j 20 in the kernel dir.

The lock was taken for every single vma (377913 in total), while all
forking processes were single-threaded. Or to put it differently all
of these were skippable.

the probe (total hack):
bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'

probe diff:
 static __latent_entropy int dup_mmap(struct mm_struct *mm,
                                        struct mm_struct *oldmm)
@@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        unsigned long charge = 0;
        LIST_HEAD(uf);
        VMA_ITERATOR(vmi, mm, 0);
+       bool only_user;

        if (mmap_write_lock_killable(oldmm))
                return -EINTR;
+       only_user = atomic_read(&oldmm->mm_users) == 1;
        flush_cache_dup_mm(oldmm);
        uprobe_dup_mmap(oldmm, mm);
        /*
@@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        mt_clear_in_rcu(vmi.mas.tree);
        for_each_vma(vmi, mpnt) {
                struct file *file;
+               bool locked;
+
+               locked = vma_start_write(mpnt);
+               dup_probe(only_user ? 1 :0, locked ? 1 : 0);

-               vma_start_write(mpnt);
                if (mpnt->vm_flags & VM_DONTCOPY) {
                        retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
                                                    mpnt->vm_end, GFP_KERNEL);

Comments

Suren Baghdasaryan March 29, 2025, 12:56 a.m. UTC | #1
Hi Mateusz,

On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Hi Suren,
>
> I brought this up long time ago, you agreed the idea works and stated
> you will sort it out, but it apparently fell to the wayside (no hard
> feelings though :>)

Sorry about that. Yeah, that was a hectic time and I completely forgot
to look into the single-user case.

>
> Here is the original:
> https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
>
> the thread is weirdly long and I recommend not opening it without a
> good reason, I link it for reference if needed.

I had to re-read it to remember what it was all about :) To bring
others up-to-speed, the suggested change would look something like
this:

@@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        unsigned long charge = 0;
        LIST_HEAD(uf);
        VMA_ITERATOR(vmi, mm, 0);
+       bool only_user;

        if (mmap_write_lock_killable(oldmm))
                return -EINTR;
+       only_user = atomic_read(&oldmm->mm_users) == 1;
        flush_cache_dup_mm(oldmm);
        uprobe_dup_mmap(oldmm, mm);
        /*
@@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        mt_clear_in_rcu(vmi.mas.tree);
        for_each_vma(vmi, mpnt) {
                struct file *file;

-               vma_start_write(mpnt);
+               if (!only_user)
+                       vma_start_write(mpnt);
                if (mpnt->vm_flags & VM_DONTCOPY) {
                        retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
                                                    mpnt->vm_end, GFP_KERNEL);

>
> At the time there were woes concerning stability of the new locking
> scheme, resulting in CC list being rather excessive. As such I'm
> starting a new thread with a modest list, not sure who to add though.
>
> So dup_mmap() holds the caller mmap_sem for writing and calls
> vma_start_write() to protect against fault handling in another threads
> using the same mm.
>
> If this is the only thread with this ->mm in place, there are no
> sibling threads to worry about and this can be checked with mm_users
> == 1.
>
> AFAICS all remote accesses require the mmap_sem to also be held, which
> provides exclusion against dup_mmap, meaning they don't pose a problem
> either.

Yes, I believe the optimization you proposed would be safe.

>
> The patch to merely skip locking is a few liner and I would officially
> submit it myself, but for my taste an assert is needed in fault
> handling to runtime test the above invariant.

Hmm. Adding an assert in the pagefault path that checks whether the
fault is triggered during dup_mmap() && mm_users==1 would not be
trivial. We would need to indicate that we expect no page faults while
in that section of dup_mmap() (maybe a flag inside mm_struct) and then
assert that this flag is not set inside the pagefault handling path.
I'm not sure it's worth the complexity... As was discussed in that
thread, the only other task that might fault a page would be external
and therefore would have to go through
access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
already holds mmap_write_lock the new user would have to wait.

> So happens I really
> can't be bothered to figure out how to sort it out and was hoping you
> would step in. ;) Alternatively if you guys don't think the assert is
> warranted, that's your business.

I don't think it's worth it but I'll CC Matthew and Lorenzo (you
already CC'ed Liam) to get their opinion.

>
> As for whether this can save any locking -- yes:

Yeah, I'm sure it will make a difference in performance. While forking
we are currently locking each VMA separately, so skipping that would
be nice.
Folks, WDYT? Do we need a separate assertion that pagefault can't
happen if mm_users==1 and we are holding mmap_write_lock
(access_remote_vm() will block)?
Thanks,
Suren.

>
> I added a probe (below for reference) with two args: whether we are
> single-threaded and vma_start_write() returning whether it took the
> down/up cycle and ran make -j 20 in the kernel dir.
>
> The lock was taken for every single vma (377913 in total), while all
> forking processes were single-threaded. Or to put it differently all
> of these were skippable.
>
> the probe (total hack):
> bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
>
> probe diff:
> diff --git a/fs/namei.c b/fs/namei.c
> index ecb7b95c2ca3..d6cde76eda81 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -5459,3 +5459,7 @@ const struct inode_operations
> page_symlink_inode_operations = {
>         .get_link       = page_get_link,
>  };
>  EXPORT_SYMBOL(page_symlink_inode_operations);
> +
> +void dup_probe(int, int);
> +void dup_probe(int, int) { }
> +EXPORT_SYMBOL(dup_probe);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1f80baddacc5..f7b1f0a02f2e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> vm_area_struct *vma, unsigned int *mm_l
>   * Exclude concurrent readers under the per-VMA lock until the currently
>   * write-locked mmap_lock is dropped or downgraded.
>   */
> -static inline void vma_start_write(struct vm_area_struct *vma)
> +static inline bool vma_start_write(struct vm_area_struct *vma)
>  {
>         unsigned int mm_lock_seq;
>
>         if (__is_vma_write_locked(vma, &mm_lock_seq))
> -               return;
> +               return false;
>
>         down_write(&vma->vm_lock->lock);
>         /*
> @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> vm_area_struct *vma)
>          */
>         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>         up_write(&vma->vm_lock->lock);
> +       return true;
>  }
>
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 735405a9c5f3..0cc56255a339 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> struct mm_struct *oldmm)
>                 pr_warn_once("exe_file_deny_write_access() failed in
> %s\n", __func__);
>  }
>
> +void dup_probe(int, int);
> +
>  #ifdef CONFIG_MMU
>  static __latent_entropy int dup_mmap(struct mm_struct *mm,
>                                         struct mm_struct *oldmm)
> @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         unsigned long charge = 0;
>         LIST_HEAD(uf);
>         VMA_ITERATOR(vmi, mm, 0);
> +       bool only_user;
>
>         if (mmap_write_lock_killable(oldmm))
>                 return -EINTR;
> +       only_user = atomic_read(&oldmm->mm_users) == 1;
>         flush_cache_dup_mm(oldmm);
>         uprobe_dup_mmap(oldmm, mm);
>         /*
> @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         mt_clear_in_rcu(vmi.mas.tree);
>         for_each_vma(vmi, mpnt) {
>                 struct file *file;
> +               bool locked;
> +
> +               locked = vma_start_write(mpnt);
> +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
>
> -               vma_start_write(mpnt);
>                 if (mpnt->vm_flags & VM_DONTCOPY) {
>                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
>                                                     mpnt->vm_end, GFP_KERNEL);
>
> --
> Mateusz Guzik <mjguzik gmail.com>
Mateusz Guzik March 29, 2025, 1:15 a.m. UTC | #2
On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > Here is the original:
> > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> >
> > the thread is weirdly long and I recommend not opening it without a
> > good reason, I link it for reference if needed.
>
> I had to re-read it to remember what it was all about :) To bring
> others up-to-speed, the suggested change would look something like
> this:
>
> @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         unsigned long charge = 0;
>         LIST_HEAD(uf);
>         VMA_ITERATOR(vmi, mm, 0);
> +       bool only_user;
>
>         if (mmap_write_lock_killable(oldmm))
>                 return -EINTR;
> +       only_user = atomic_read(&oldmm->mm_users) == 1;
>         flush_cache_dup_mm(oldmm);
>         uprobe_dup_mmap(oldmm, mm);
>         /*
> @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         mt_clear_in_rcu(vmi.mas.tree);
>         for_each_vma(vmi, mpnt) {
>                 struct file *file;
>
> -               vma_start_write(mpnt);
> +               if (!only_user)
> +                       vma_start_write(mpnt);
>                 if (mpnt->vm_flags & VM_DONTCOPY) {
>                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
>                                                     mpnt->vm_end, GFP_KERNEL);
>
> >
> > At the time there were woes concerning stability of the new locking
> > scheme, resulting in CC list being rather excessive. As such I'm
> > starting a new thread with a modest list, not sure who to add though.
> >
> > So dup_mmap() holds the caller mmap_sem for writing and calls
> > vma_start_write() to protect against fault handling in another threads
> > using the same mm.
> >
> > If this is the only thread with this ->mm in place, there are no
> > sibling threads to worry about and this can be checked with mm_users
> > == 1.
> >
> > AFAICS all remote accesses require the mmap_sem to also be held, which
> > provides exclusion against dup_mmap, meaning they don't pose a problem
> > either.
>
> Yes, I believe the optimization you proposed would be safe.
>
> >
> > The patch to merely skip locking is a few liner and I would officially
> > submit it myself, but for my taste an assert is needed in fault
> > handling to runtime test the above invariant.
>
> Hmm. Adding an assert in the pagefault path that checks whether the
> fault is triggered during dup_mmap() && mm_users==1 would not be
> trivial. We would need to indicate that we expect no page faults while
> in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> assert that this flag is not set inside the pagefault handling path.
> I'm not sure it's worth the complexity... As was discussed in that
> thread, the only other task that might fault a page would be external
> and therefore would have to go through
> access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> already holds mmap_write_lock the new user would have to wait.
>
> > So happens I really
> > can't be bothered to figure out how to sort it out and was hoping you
> > would step in. ;) Alternatively if you guys don't think the assert is
> > warranted, that's your business.
>
> I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> already CC'ed Liam) to get their opinion.
>
> >
> > As for whether this can save any locking -- yes:
>
> Yeah, I'm sure it will make a difference in performance. While forking
> we are currently locking each VMA separately, so skipping that would
> be nice.
> Folks, WDYT? Do we need a separate assertion that pagefault can't
> happen if mm_users==1 and we are holding mmap_write_lock
> (access_remote_vm() will block)?

pseudocode-wise I was thinking in the lines of the following in the fault path:

if (current->mm != vma->vm_mm)
       mmap_assert_locked(vma->vm_mm);

with the assumption that mmap_assert_locked expands to nothing without debug

>
> >
> > I added a probe (below for reference) with two args: whether we are
> > single-threaded and vma_start_write() returning whether it took the
> > down/up cycle and ran make -j 20 in the kernel dir.
> >
> > The lock was taken for every single vma (377913 in total), while all
> > forking processes were single-threaded. Or to put it differently all
> > of these were skippable.
> >
> > the probe (total hack):
> > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> >
> > probe diff:
> > diff --git a/fs/namei.c b/fs/namei.c
> > index ecb7b95c2ca3..d6cde76eda81 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > page_symlink_inode_operations = {
> >         .get_link       = page_get_link,
> >  };
> >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > +
> > +void dup_probe(int, int);
> > +void dup_probe(int, int) { }
> > +EXPORT_SYMBOL(dup_probe);
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1f80baddacc5..f7b1f0a02f2e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > vm_area_struct *vma, unsigned int *mm_l
> >   * Exclude concurrent readers under the per-VMA lock until the currently
> >   * write-locked mmap_lock is dropped or downgraded.
> >   */
> > -static inline void vma_start_write(struct vm_area_struct *vma)
> > +static inline bool vma_start_write(struct vm_area_struct *vma)
> >  {
> >         unsigned int mm_lock_seq;
> >
> >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > -               return;
> > +               return false;
> >
> >         down_write(&vma->vm_lock->lock);
> >         /*
> > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > vm_area_struct *vma)
> >          */
> >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> >         up_write(&vma->vm_lock->lock);
> > +       return true;
> >  }
> >
> >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 735405a9c5f3..0cc56255a339 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > struct mm_struct *oldmm)
> >                 pr_warn_once("exe_file_deny_write_access() failed in
> > %s\n", __func__);
> >  }
> >
> > +void dup_probe(int, int);
> > +
> >  #ifdef CONFIG_MMU
> >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >                                         struct mm_struct *oldmm)
> > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >         unsigned long charge = 0;
> >         LIST_HEAD(uf);
> >         VMA_ITERATOR(vmi, mm, 0);
> > +       bool only_user;
> >
> >         if (mmap_write_lock_killable(oldmm))
> >                 return -EINTR;
> > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> >         flush_cache_dup_mm(oldmm);
> >         uprobe_dup_mmap(oldmm, mm);
> >         /*
> > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >         mt_clear_in_rcu(vmi.mas.tree);
> >         for_each_vma(vmi, mpnt) {
> >                 struct file *file;
> > +               bool locked;
> > +
> > +               locked = vma_start_write(mpnt);
> > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> >
> > -               vma_start_write(mpnt);
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> >                                                     mpnt->vm_end, GFP_KERNEL);
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
Suren Baghdasaryan March 29, 2025, 1:35 a.m. UTC | #3
On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > Here is the original:
> > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> > >
> > > the thread is weirdly long and I recommend not opening it without a
> > > good reason, I link it for reference if needed.
> >
> > I had to re-read it to remember what it was all about :) To bring
> > others up-to-speed, the suggested change would look something like
> > this:
> >
> > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >         unsigned long charge = 0;
> >         LIST_HEAD(uf);
> >         VMA_ITERATOR(vmi, mm, 0);
> > +       bool only_user;
> >
> >         if (mmap_write_lock_killable(oldmm))
> >                 return -EINTR;
> > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> >         flush_cache_dup_mm(oldmm);
> >         uprobe_dup_mmap(oldmm, mm);
> >         /*
> > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >         mt_clear_in_rcu(vmi.mas.tree);
> >         for_each_vma(vmi, mpnt) {
> >                 struct file *file;
> >
> > -               vma_start_write(mpnt);
> > +               if (!only_user)
> > +                       vma_start_write(mpnt);
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> >                                                     mpnt->vm_end, GFP_KERNEL);
> >
> > >
> > > At the time there were woes concerning stability of the new locking
> > > scheme, resulting in CC list being rather excessive. As such I'm
> > > starting a new thread with a modest list, not sure who to add though.
> > >
> > > So dup_mmap() holds the caller mmap_sem for writing and calls
> > > vma_start_write() to protect against fault handling in another threads
> > > using the same mm.
> > >
> > > If this is the only thread with this ->mm in place, there are no
> > > sibling threads to worry about and this can be checked with mm_users
> > > == 1.
> > >
> > > AFAICS all remote accesses require the mmap_sem to also be held, which
> > > provides exclusion against dup_mmap, meaning they don't pose a problem
> > > either.
> >
> > Yes, I believe the optimization you proposed would be safe.
> >
> > >
> > > The patch to merely skip locking is a few liner and I would officially
> > > submit it myself, but for my taste an assert is needed in fault
> > > handling to runtime test the above invariant.
> >
> > Hmm. Adding an assert in the pagefault path that checks whether the
> > fault is triggered during dup_mmap() && mm_users==1 would not be
> > trivial. We would need to indicate that we expect no page faults while
> > in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> > assert that this flag is not set inside the pagefault handling path.
> > I'm not sure it's worth the complexity... As was discussed in that
> > thread, the only other task that might fault a page would be external
> > and therefore would have to go through
> > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> > already holds mmap_write_lock the new user would have to wait.
> >
> > > So happens I really
> > > can't be bothered to figure out how to sort it out and was hoping you
> > > would step in. ;) Alternatively if you guys don't think the assert is
> > > warranted, that's your business.
> >
> > I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> > already CC'ed Liam) to get their opinion.
> >
> > >
> > > As for whether this can save any locking -- yes:
> >
> > Yeah, I'm sure it will make a difference in performance. While forking
> > we are currently locking each VMA separately, so skipping that would
> > be nice.
> > Folks, WDYT? Do we need a separate assertion that pagefault can't
> > happen if mm_users==1 and we are holding mmap_write_lock
> > (access_remote_vm() will block)?
>
> pseudocode-wise I was thinking in the lines of the following in the fault path:
>
> if (current->mm != vma->vm_mm)
>        mmap_assert_locked(vma->vm_mm);
>
> with the assumption that mmap_assert_locked expands to nothing without debug

I see. IOW, if someone external is faulting a page then it has to be
holding at least mmap_read_lock. So, it's a more general check but
seems reasonable. I think adding it at the end of lock_vma_under_rcu()
under the "#ifdef CONFIG_DEBUG_VM" condition would be enough.

>
> >
> > >
> > > I added a probe (below for reference) with two args: whether we are
> > > single-threaded and vma_start_write() returning whether it took the
> > > down/up cycle and ran make -j 20 in the kernel dir.
> > >
> > > The lock was taken for every single vma (377913 in total), while all
> > > forking processes were single-threaded. Or to put it differently all
> > > of these were skippable.
> > >
> > > the probe (total hack):
> > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> > >
> > > probe diff:
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index ecb7b95c2ca3..d6cde76eda81 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > > page_symlink_inode_operations = {
> > >         .get_link       = page_get_link,
> > >  };
> > >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > > +
> > > +void dup_probe(int, int);
> > > +void dup_probe(int, int) { }
> > > +EXPORT_SYMBOL(dup_probe);
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 1f80baddacc5..f7b1f0a02f2e 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > > vm_area_struct *vma, unsigned int *mm_l
> > >   * Exclude concurrent readers under the per-VMA lock until the currently
> > >   * write-locked mmap_lock is dropped or downgraded.
> > >   */
> > > -static inline void vma_start_write(struct vm_area_struct *vma)
> > > +static inline bool vma_start_write(struct vm_area_struct *vma)
> > >  {
> > >         unsigned int mm_lock_seq;
> > >
> > >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > > -               return;
> > > +               return false;
> > >
> > >         down_write(&vma->vm_lock->lock);
> > >         /*
> > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > > vm_area_struct *vma)
> > >          */
> > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > >         up_write(&vma->vm_lock->lock);
> > > +       return true;
> > >  }
> > >
> > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 735405a9c5f3..0cc56255a339 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > > struct mm_struct *oldmm)
> > >                 pr_warn_once("exe_file_deny_write_access() failed in
> > > %s\n", __func__);
> > >  }
> > >
> > > +void dup_probe(int, int);
> > > +
> > >  #ifdef CONFIG_MMU
> > >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >                                         struct mm_struct *oldmm)
> > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >         unsigned long charge = 0;
> > >         LIST_HEAD(uf);
> > >         VMA_ITERATOR(vmi, mm, 0);
> > > +       bool only_user;
> > >
> > >         if (mmap_write_lock_killable(oldmm))
> > >                 return -EINTR;
> > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > >         flush_cache_dup_mm(oldmm);
> > >         uprobe_dup_mmap(oldmm, mm);
> > >         /*
> > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >         mt_clear_in_rcu(vmi.mas.tree);
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > > +               bool locked;
> > > +
> > > +               locked = vma_start_write(mpnt);
> > > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> > >
> > > -               vma_start_write(mpnt);
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > >                                                     mpnt->vm_end, GFP_KERNEL);
> > >
> > > --
> > > Mateusz Guzik <mjguzik gmail.com>
>
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
Mateusz Guzik March 29, 2025, 1:51 a.m. UTC | #4
top post warning

It just hit me that userfaultfd() may be a problem. Creating it only
bumps mm_count, while mm_users gets transiently modified during
various ops.

So in particular you may end up pounding off the userfaultfd instance
to another process while being single-threaded, then mm_users == 1 and
userfaultfd may be trying to do something. I have no idea if this one
is guaranteed to take the lock.

However, the good news is that mm_count tends to be 1. If both
mm_count and mm_users are 1, then there is no usefaultfd in use and
nobody to add it either.

State of mm_count verified with: bpftrace -e 'kprobe:copy_process {
@[curtask->mm->mm_count.counter] = count(); }'

On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > > Here is the original:
> > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> > > >
> > > > the thread is weirdly long and I recommend not opening it without a
> > > > good reason, I link it for reference if needed.
> > >
> > > I had to re-read it to remember what it was all about :) To bring
> > > others up-to-speed, the suggested change would look something like
> > > this:
> > >
> > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >         unsigned long charge = 0;
> > >         LIST_HEAD(uf);
> > >         VMA_ITERATOR(vmi, mm, 0);
> > > +       bool only_user;
> > >
> > >         if (mmap_write_lock_killable(oldmm))
> > >                 return -EINTR;
> > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > >         flush_cache_dup_mm(oldmm);
> > >         uprobe_dup_mmap(oldmm, mm);
> > >         /*
> > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >         mt_clear_in_rcu(vmi.mas.tree);
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > >
> > > -               vma_start_write(mpnt);
> > > +               if (!only_user)
> > > +                       vma_start_write(mpnt);
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > >                                                     mpnt->vm_end, GFP_KERNEL);
> > >
> > > >
> > > > At the time there were woes concerning stability of the new locking
> > > > scheme, resulting in CC list being rather excessive. As such I'm
> > > > starting a new thread with a modest list, not sure who to add though.
> > > >
> > > > So dup_mmap() holds the caller mmap_sem for writing and calls
> > > > vma_start_write() to protect against fault handling in another threads
> > > > using the same mm.
> > > >
> > > > If this is the only thread with this ->mm in place, there are no
> > > > sibling threads to worry about and this can be checked with mm_users
> > > > == 1.
> > > >
> > > > AFAICS all remote accesses require the mmap_sem to also be held, which
> > > > provides exclusion against dup_mmap, meaning they don't pose a problem
> > > > either.
> > >
> > > Yes, I believe the optimization you proposed would be safe.
> > >
> > > >
> > > > The patch to merely skip locking is a few liner and I would officially
> > > > submit it myself, but for my taste an assert is needed in fault
> > > > handling to runtime test the above invariant.
> > >
> > > Hmm. Adding an assert in the pagefault path that checks whether the
> > > fault is triggered during dup_mmap() && mm_users==1 would not be
> > > trivial. We would need to indicate that we expect no page faults while
> > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> > > assert that this flag is not set inside the pagefault handling path.
> > > I'm not sure it's worth the complexity... As was discussed in that
> > > thread, the only other task that might fault a page would be external
> > > and therefore would have to go through
> > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> > > already holds mmap_write_lock the new user would have to wait.
> > >
> > > > So happens I really
> > > > can't be bothered to figure out how to sort it out and was hoping you
> > > > would step in. ;) Alternatively if you guys don't think the assert is
> > > > warranted, that's your business.
> > >
> > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> > > already CC'ed Liam) to get their opinion.
> > >
> > > >
> > > > As for whether this can save any locking -- yes:
> > >
> > > Yeah, I'm sure it will make a difference in performance. While forking
> > > we are currently locking each VMA separately, so skipping that would
> > > be nice.
> > > Folks, WDYT? Do we need a separate assertion that pagefault can't
> > > happen if mm_users==1 and we are holding mmap_write_lock
> > > (access_remote_vm() will block)?
> >
> > pseudocode-wise I was thinking in the lines of the following in the fault path:
> >
> > if (current->mm != vma->vm_mm)
> >        mmap_assert_locked(vma->vm_mm);
> >
> > with the assumption that mmap_assert_locked expands to nothing without debug
>
> I see. IOW, if someone external is faulting a page then it has to be
> holding at least mmap_read_lock. So, it's a more general check but
> seems reasonable. I think adding it at the end of lock_vma_under_rcu()
> under the "#ifdef CONFIG_DEBUG_VM" condition would be enough.
>
> >
> > >
> > > >
> > > > I added a probe (below for reference) with two args: whether we are
> > > > single-threaded and vma_start_write() returning whether it took the
> > > > down/up cycle and ran make -j 20 in the kernel dir.
> > > >
> > > > The lock was taken for every single vma (377913 in total), while all
> > > > forking processes were single-threaded. Or to put it differently all
> > > > of these were skippable.
> > > >
> > > > the probe (total hack):
> > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> > > >
> > > > probe diff:
> > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > index ecb7b95c2ca3..d6cde76eda81 100644
> > > > --- a/fs/namei.c
> > > > +++ b/fs/namei.c
> > > > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > > > page_symlink_inode_operations = {
> > > >         .get_link       = page_get_link,
> > > >  };
> > > >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > > > +
> > > > +void dup_probe(int, int);
> > > > +void dup_probe(int, int) { }
> > > > +EXPORT_SYMBOL(dup_probe);
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 1f80baddacc5..f7b1f0a02f2e 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > > > vm_area_struct *vma, unsigned int *mm_l
> > > >   * Exclude concurrent readers under the per-VMA lock until the currently
> > > >   * write-locked mmap_lock is dropped or downgraded.
> > > >   */
> > > > -static inline void vma_start_write(struct vm_area_struct *vma)
> > > > +static inline bool vma_start_write(struct vm_area_struct *vma)
> > > >  {
> > > >         unsigned int mm_lock_seq;
> > > >
> > > >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > > > -               return;
> > > > +               return false;
> > > >
> > > >         down_write(&vma->vm_lock->lock);
> > > >         /*
> > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > > > vm_area_struct *vma)
> > > >          */
> > > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > > >         up_write(&vma->vm_lock->lock);
> > > > +       return true;
> > > >  }
> > > >
> > > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 735405a9c5f3..0cc56255a339 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > > > struct mm_struct *oldmm)
> > > >                 pr_warn_once("exe_file_deny_write_access() failed in
> > > > %s\n", __func__);
> > > >  }
> > > >
> > > > +void dup_probe(int, int);
> > > > +
> > > >  #ifdef CONFIG_MMU
> > > >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >                                         struct mm_struct *oldmm)
> > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >         unsigned long charge = 0;
> > > >         LIST_HEAD(uf);
> > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > +       bool only_user;
> > > >
> > > >         if (mmap_write_lock_killable(oldmm))
> > > >                 return -EINTR;
> > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > >         flush_cache_dup_mm(oldmm);
> > > >         uprobe_dup_mmap(oldmm, mm);
> > > >         /*
> > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > >         for_each_vma(vmi, mpnt) {
> > > >                 struct file *file;
> > > > +               bool locked;
> > > > +
> > > > +               locked = vma_start_write(mpnt);
> > > > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> > > >
> > > > -               vma_start_write(mpnt);
> > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > >
> > > > --
> > > > Mateusz Guzik <mjguzik gmail.com>
> >
> >
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
Suren Baghdasaryan March 30, 2025, 7:23 p.m. UTC | #5
On Fri, Mar 28, 2025 at 6:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> top post warning
>
> It just hit me that userfaultfd() may be a problem. Creating it only
> bumps mm_count, while mm_users gets transiently modified during
> various ops.
>
> So in particular you may end up pounding off the userfaultfd instance
> to another process while being single-threaded, then mm_users == 1 and
> userfaultfd may be trying to do something. I have no idea if this one
> is guaranteed to take the lock.

Hmm, yeah. uffd is nasty. Even in the cases when it does
mmget_not_zero(), there is still the possibility of a race. For
example:

dup_mmap
    only_user = atomic_read(&oldmm->mm_users) == 1;

                                             userfaultfd_move()
                                                 mmget_not_zero() <--
inc mm_users

    if (!only_user)
        vma_start_write(mpnt); <-- gets skipped

                                                    move_pages()
                                                         uffd_move_lock()
                                                             uffd_lock_vma()

lock_vma_under_rcu() <-- succeeds
                                                         move_pages_pte()
    copy_page_range()

So, userfaultfd_move() will happily move pages while dup_mmap() is
doing copy_page_range(). The copied range might look quite
interesting...

>
> However, the good news is that mm_count tends to be 1. If both
> mm_count and mm_users are 1, then there is no usefaultfd in use and
> nobody to add it either.

I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
calling mmgrab(), therefore I think it can race with the code checking
its value.

>
> State of mm_count verified with: bpftrace -e 'kprobe:copy_process {
> @[curtask->mm->mm_count.counter] = count(); }'
>
> On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > > > Here is the original:
> > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> > > > >
> > > > > the thread is weirdly long and I recommend not opening it without a
> > > > > good reason, I link it for reference if needed.
> > > >
> > > > I had to re-read it to remember what it was all about :) To bring
> > > > others up-to-speed, the suggested change would look something like
> > > > this:
> > > >
> > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >         unsigned long charge = 0;
> > > >         LIST_HEAD(uf);
> > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > +       bool only_user;
> > > >
> > > >         if (mmap_write_lock_killable(oldmm))
> > > >                 return -EINTR;
> > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > >         flush_cache_dup_mm(oldmm);
> > > >         uprobe_dup_mmap(oldmm, mm);
> > > >         /*
> > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > >         for_each_vma(vmi, mpnt) {
> > > >                 struct file *file;
> > > >
> > > > -               vma_start_write(mpnt);
> > > > +               if (!only_user)
> > > > +                       vma_start_write(mpnt);
> > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > >
> > > > >
> > > > > At the time there were woes concerning stability of the new locking
> > > > > scheme, resulting in CC list being rather excessive. As such I'm
> > > > > starting a new thread with a modest list, not sure who to add though.
> > > > >
> > > > > So dup_mmap() holds the caller mmap_sem for writing and calls
> > > > > vma_start_write() to protect against fault handling in another threads
> > > > > using the same mm.
> > > > >
> > > > > If this is the only thread with this ->mm in place, there are no
> > > > > sibling threads to worry about and this can be checked with mm_users
> > > > > == 1.
> > > > >
> > > > > AFAICS all remote accesses require the mmap_sem to also be held, which
> > > > > provides exclusion against dup_mmap, meaning they don't pose a problem
> > > > > either.
> > > >
> > > > Yes, I believe the optimization you proposed would be safe.
> > > >
> > > > >
> > > > > The patch to merely skip locking is a few liner and I would officially
> > > > > submit it myself, but for my taste an assert is needed in fault
> > > > > handling to runtime test the above invariant.
> > > >
> > > > Hmm. Adding an assert in the pagefault path that checks whether the
> > > > fault is triggered during dup_mmap() && mm_users==1 would not be
> > > > trivial. We would need to indicate that we expect no page faults while
> > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> > > > assert that this flag is not set inside the pagefault handling path.
> > > > I'm not sure it's worth the complexity... As was discussed in that
> > > > thread, the only other task that might fault a page would be external
> > > > and therefore would have to go through
> > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> > > > already holds mmap_write_lock the new user would have to wait.
> > > >
> > > > > So happens I really
> > > > > can't be bothered to figure out how to sort it out and was hoping you
> > > > > would step in. ;) Alternatively if you guys don't think the assert is
> > > > > warranted, that's your business.
> > > >
> > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> > > > already CC'ed Liam) to get their opinion.
> > > >
> > > > >
> > > > > As for whether this can save any locking -- yes:
> > > >
> > > > Yeah, I'm sure it will make a difference in performance. While forking
> > > > we are currently locking each VMA separately, so skipping that would
> > > > be nice.
> > > > Folks, WDYT? Do we need a separate assertion that pagefault can't
> > > > happen if mm_users==1 and we are holding mmap_write_lock
> > > > (access_remote_vm() will block)?
> > >
> > > pseudocode-wise I was thinking in the lines of the following in the fault path:
> > >
> > > if (current->mm != vma->vm_mm)
> > >        mmap_assert_locked(vma->vm_mm);
> > >
> > > with the assumption that mmap_assert_locked expands to nothing without debug
> >
> > I see. IOW, if someone external is faulting a page then it has to be
> > holding at least mmap_read_lock. So, it's a more general check but
> > seems reasonable. I think adding it at the end of lock_vma_under_rcu()
> > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough.
> >
> > >
> > > >
> > > > >
> > > > > I added a probe (below for reference) with two args: whether we are
> > > > > single-threaded and vma_start_write() returning whether it took the
> > > > > down/up cycle and ran make -j 20 in the kernel dir.
> > > > >
> > > > > The lock was taken for every single vma (377913 in total), while all
> > > > > forking processes were single-threaded. Or to put it differently all
> > > > > of these were skippable.
> > > > >
> > > > > the probe (total hack):
> > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> > > > >
> > > > > probe diff:
> > > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > > index ecb7b95c2ca3..d6cde76eda81 100644
> > > > > --- a/fs/namei.c
> > > > > +++ b/fs/namei.c
> > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > > > > page_symlink_inode_operations = {
> > > > >         .get_link       = page_get_link,
> > > > >  };
> > > > >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > > > > +
> > > > > +void dup_probe(int, int);
> > > > > +void dup_probe(int, int) { }
> > > > > +EXPORT_SYMBOL(dup_probe);
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 1f80baddacc5..f7b1f0a02f2e 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > > > > vm_area_struct *vma, unsigned int *mm_l
> > > > >   * Exclude concurrent readers under the per-VMA lock until the currently
> > > > >   * write-locked mmap_lock is dropped or downgraded.
> > > > >   */
> > > > > -static inline void vma_start_write(struct vm_area_struct *vma)
> > > > > +static inline bool vma_start_write(struct vm_area_struct *vma)
> > > > >  {
> > > > >         unsigned int mm_lock_seq;
> > > > >
> > > > >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > > > > -               return;
> > > > > +               return false;
> > > > >
> > > > >         down_write(&vma->vm_lock->lock);
> > > > >         /*
> > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > > > > vm_area_struct *vma)
> > > > >          */
> > > > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > > > >         up_write(&vma->vm_lock->lock);
> > > > > +       return true;
> > > > >  }
> > > > >
> > > > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > index 735405a9c5f3..0cc56255a339 100644
> > > > > --- a/kernel/fork.c
> > > > > +++ b/kernel/fork.c
> > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > > > > struct mm_struct *oldmm)
> > > > >                 pr_warn_once("exe_file_deny_write_access() failed in
> > > > > %s\n", __func__);
> > > > >  }
> > > > >
> > > > > +void dup_probe(int, int);
> > > > > +
> > > > >  #ifdef CONFIG_MMU
> > > > >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >                                         struct mm_struct *oldmm)
> > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >         unsigned long charge = 0;
> > > > >         LIST_HEAD(uf);
> > > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > > +       bool only_user;
> > > > >
> > > > >         if (mmap_write_lock_killable(oldmm))
> > > > >                 return -EINTR;
> > > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > > >         flush_cache_dup_mm(oldmm);
> > > > >         uprobe_dup_mmap(oldmm, mm);
> > > > >         /*
> > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > > >         for_each_vma(vmi, mpnt) {
> > > > >                 struct file *file;
> > > > > +               bool locked;
> > > > > +
> > > > > +               locked = vma_start_write(mpnt);
> > > > > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> > > > >
> > > > > -               vma_start_write(mpnt);
> > > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > > >
> > > > > --
> > > > > Mateusz Guzik <mjguzik gmail.com>
> > >
> > >
> > >
> > > --
> > > Mateusz Guzik <mjguzik gmail.com>
>
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
Suren Baghdasaryan March 30, 2025, 7:25 p.m. UTC | #6
On Sun, Mar 30, 2025 at 12:23 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Mar 28, 2025 at 6:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > top post warning
> >
> > It just hit me that userfaultfd() may be a problem. Creating it only
> > bumps mm_count, while mm_users gets transiently modified during
> > various ops.
> >
> > So in particular you may end up pounding off the userfaultfd instance
> > to another process while being single-threaded, then mm_users == 1 and
> > userfaultfd may be trying to do something. I have no idea if this one
> > is guaranteed to take the lock.
>
> Hmm, yeah. uffd is nasty. Even in the cases when it does
> mmget_not_zero(), there is still the possibility of a race. For
> example:
>
> dup_mmap
>     only_user = atomic_read(&oldmm->mm_users) == 1;
>
>                                              userfaultfd_move()
>                                                  mmget_not_zero() <--
> inc mm_users
>
>     if (!only_user)
>         vma_start_write(mpnt); <-- gets skipped
>
>                                                     move_pages()
>                                                          uffd_move_lock()
>                                                              uffd_lock_vma()
>
> lock_vma_under_rcu() <-- succeeds
>                                                          move_pages_pte()
>     copy_page_range()

Sorry about formatting. This might look a bit better:

Task 1                           Task 2
dup_mmap
     only_user = atomic_read(&oldmm->mm_users) == 1;

                                      userfaultfd_move()
                                          mmget_not_zero()

     if (!only_user)
         vma_start_write(mpnt); <-- gets skipped

                                             move_pages()
                                                  uffd_move_lock()
                                                      uffd_lock_vma()
                                                          lock_vma_under_rcu()
                                             move_pages_pte()
     copy_page_range()

>
> So, userfaultfd_move() will happily move pages while dup_mmap() is
> doing copy_page_range(). The copied range might look quite
> interesting...
>
> >
> > However, the good news is that mm_count tends to be 1. If both
> > mm_count and mm_users are 1, then there is no usefaultfd in use and
> > nobody to add it either.
>
> I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
> calling mmgrab(), therefore I think it can race with the code checking
> its value.
>
> >
> > State of mm_count verified with: bpftrace -e 'kprobe:copy_process {
> > @[curtask->mm->mm_count.counter] = count(); }'
> >
> > On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > > > > Here is the original:
> > > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> > > > > >
> > > > > > the thread is weirdly long and I recommend not opening it without a
> > > > > > good reason, I link it for reference if needed.
> > > > >
> > > > > I had to re-read it to remember what it was all about :) To bring
> > > > > others up-to-speed, the suggested change would look something like
> > > > > this:
> > > > >
> > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >         unsigned long charge = 0;
> > > > >         LIST_HEAD(uf);
> > > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > > +       bool only_user;
> > > > >
> > > > >         if (mmap_write_lock_killable(oldmm))
> > > > >                 return -EINTR;
> > > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > > >         flush_cache_dup_mm(oldmm);
> > > > >         uprobe_dup_mmap(oldmm, mm);
> > > > >         /*
> > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > > >         for_each_vma(vmi, mpnt) {
> > > > >                 struct file *file;
> > > > >
> > > > > -               vma_start_write(mpnt);
> > > > > +               if (!only_user)
> > > > > +                       vma_start_write(mpnt);
> > > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > > >
> > > > > >
> > > > > > At the time there were woes concerning stability of the new locking
> > > > > > scheme, resulting in CC list being rather excessive. As such I'm
> > > > > > starting a new thread with a modest list, not sure who to add though.
> > > > > >
> > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls
> > > > > > vma_start_write() to protect against fault handling in another threads
> > > > > > using the same mm.
> > > > > >
> > > > > > If this is the only thread with this ->mm in place, there are no
> > > > > > sibling threads to worry about and this can be checked with mm_users
> > > > > > == 1.
> > > > > >
> > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which
> > > > > > provides exclusion against dup_mmap, meaning they don't pose a problem
> > > > > > either.
> > > > >
> > > > > Yes, I believe the optimization you proposed would be safe.
> > > > >
> > > > > >
> > > > > > The patch to merely skip locking is a few liner and I would officially
> > > > > > submit it myself, but for my taste an assert is needed in fault
> > > > > > handling to runtime test the above invariant.
> > > > >
> > > > > Hmm. Adding an assert in the pagefault path that checks whether the
> > > > > fault is triggered during dup_mmap() && mm_users==1 would not be
> > > > > trivial. We would need to indicate that we expect no page faults while
> > > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> > > > > assert that this flag is not set inside the pagefault handling path.
> > > > > I'm not sure it's worth the complexity... As was discussed in that
> > > > > thread, the only other task that might fault a page would be external
> > > > > and therefore would have to go through
> > > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> > > > > already holds mmap_write_lock the new user would have to wait.
> > > > >
> > > > > > So happens I really
> > > > > > can't be bothered to figure out how to sort it out and was hoping you
> > > > > > would step in. ;) Alternatively if you guys don't think the assert is
> > > > > > warranted, that's your business.
> > > > >
> > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> > > > > already CC'ed Liam) to get their opinion.
> > > > >
> > > > > >
> > > > > > As for whether this can save any locking -- yes:
> > > > >
> > > > > Yeah, I'm sure it will make a difference in performance. While forking
> > > > > we are currently locking each VMA separately, so skipping that would
> > > > > be nice.
> > > > > Folks, WDYT? Do we need a separate assertion that pagefault can't
> > > > > happen if mm_users==1 and we are holding mmap_write_lock
> > > > > (access_remote_vm() will block)?
> > > >
> > > > pseudocode-wise I was thinking in the lines of the following in the fault path:
> > > >
> > > > if (current->mm != vma->vm_mm)
> > > >        mmap_assert_locked(vma->vm_mm);
> > > >
> > > > with the assumption that mmap_assert_locked expands to nothing without debug
> > >
> > > I see. IOW, if someone external is faulting a page then it has to be
> > > holding at least mmap_read_lock. So, it's a more general check but
> > > seems reasonable. I think adding it at the end of lock_vma_under_rcu()
> > > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > I added a probe (below for reference) with two args: whether we are
> > > > > > single-threaded and vma_start_write() returning whether it took the
> > > > > > down/up cycle and ran make -j 20 in the kernel dir.
> > > > > >
> > > > > > The lock was taken for every single vma (377913 in total), while all
> > > > > > forking processes were single-threaded. Or to put it differently all
> > > > > > of these were skippable.
> > > > > >
> > > > > > the probe (total hack):
> > > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> > > > > >
> > > > > > probe diff:
> > > > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > > > index ecb7b95c2ca3..d6cde76eda81 100644
> > > > > > --- a/fs/namei.c
> > > > > > +++ b/fs/namei.c
> > > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > > > > > page_symlink_inode_operations = {
> > > > > >         .get_link       = page_get_link,
> > > > > >  };
> > > > > >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > > > > > +
> > > > > > +void dup_probe(int, int);
> > > > > > +void dup_probe(int, int) { }
> > > > > > +EXPORT_SYMBOL(dup_probe);
> > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > index 1f80baddacc5..f7b1f0a02f2e 100644
> > > > > > --- a/include/linux/mm.h
> > > > > > +++ b/include/linux/mm.h
> > > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > > > > > vm_area_struct *vma, unsigned int *mm_l
> > > > > >   * Exclude concurrent readers under the per-VMA lock until the currently
> > > > > >   * write-locked mmap_lock is dropped or downgraded.
> > > > > >   */
> > > > > > -static inline void vma_start_write(struct vm_area_struct *vma)
> > > > > > +static inline bool vma_start_write(struct vm_area_struct *vma)
> > > > > >  {
> > > > > >         unsigned int mm_lock_seq;
> > > > > >
> > > > > >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > > > > > -               return;
> > > > > > +               return false;
> > > > > >
> > > > > >         down_write(&vma->vm_lock->lock);
> > > > > >         /*
> > > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > > > > > vm_area_struct *vma)
> > > > > >          */
> > > > > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > > > > >         up_write(&vma->vm_lock->lock);
> > > > > > +       return true;
> > > > > >  }
> > > > > >
> > > > > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > index 735405a9c5f3..0cc56255a339 100644
> > > > > > --- a/kernel/fork.c
> > > > > > +++ b/kernel/fork.c
> > > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > > > > > struct mm_struct *oldmm)
> > > > > >                 pr_warn_once("exe_file_deny_write_access() failed in
> > > > > > %s\n", __func__);
> > > > > >  }
> > > > > >
> > > > > > +void dup_probe(int, int);
> > > > > > +
> > > > > >  #ifdef CONFIG_MMU
> > > > > >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > >                                         struct mm_struct *oldmm)
> > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > >         unsigned long charge = 0;
> > > > > >         LIST_HEAD(uf);
> > > > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > > > +       bool only_user;
> > > > > >
> > > > > >         if (mmap_write_lock_killable(oldmm))
> > > > > >                 return -EINTR;
> > > > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > > > >         flush_cache_dup_mm(oldmm);
> > > > > >         uprobe_dup_mmap(oldmm, mm);
> > > > > >         /*
> > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > > > >         for_each_vma(vmi, mpnt) {
> > > > > >                 struct file *file;
> > > > > > +               bool locked;
> > > > > > +
> > > > > > +               locked = vma_start_write(mpnt);
> > > > > > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> > > > > >
> > > > > > -               vma_start_write(mpnt);
> > > > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > > > >
> > > > > > --
> > > > > > Mateusz Guzik <mjguzik gmail.com>
> > > >
> > > >
> > > >
> > > > --
> > > > Mateusz Guzik <mjguzik gmail.com>
> >
> >
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
Mateusz Guzik March 30, 2025, 7:43 p.m. UTC | #7
On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > However, the good news is that mm_count tends to be 1. If both
> > mm_count and mm_users are 1, then there is no usefaultfd in use and
> > nobody to add it either.
>
> I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
> calling mmgrab(), therefore I think it can race with the code checking
> its value.
>

It issues:
ctx->mm = current->mm;
...
mmgrab(ctx->mm);

Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in
dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to
create it either and the optimization is saved.
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index ecb7b95c2ca3..d6cde76eda81 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5459,3 +5459,7 @@  const struct inode_operations
page_symlink_inode_operations = {
        .get_link       = page_get_link,
 };
 EXPORT_SYMBOL(page_symlink_inode_operations);
+
+void dup_probe(int, int);
+void dup_probe(int, int) { }
+EXPORT_SYMBOL(dup_probe);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f80baddacc5..f7b1f0a02f2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -760,12 +760,12 @@  static bool __is_vma_write_locked(struct
vm_area_struct *vma, unsigned int *mm_l
  * Exclude concurrent readers under the per-VMA lock until the currently
  * write-locked mmap_lock is dropped or downgraded.
  */
-static inline void vma_start_write(struct vm_area_struct *vma)
+static inline bool vma_start_write(struct vm_area_struct *vma)
 {
        unsigned int mm_lock_seq;

        if (__is_vma_write_locked(vma, &mm_lock_seq))
-               return;
+               return false;

        down_write(&vma->vm_lock->lock);
        /*
@@ -776,6 +776,7 @@  static inline void vma_start_write(struct
vm_area_struct *vma)
         */
        WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
        up_write(&vma->vm_lock->lock);
+       return true;
 }

 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..0cc56255a339 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -629,6 +629,8 @@  static void dup_mm_exe_file(struct mm_struct *mm,
struct mm_struct *oldmm)
                pr_warn_once("exe_file_deny_write_access() failed in
%s\n", __func__);
 }

+void dup_probe(int, int);
+
 #ifdef CONFIG_MMU