diff mbox series

not issuing vma_start_write() in dup_mmap() if the caller is single-threaded

Message ID CAGudoHFdPd5a7fiAutTFAxQz5f1fP2n9rwORm7Hj9fPzuVhiKw@mail.gmail.com (mailing list archive)
State New
Headers show
Series not issuing vma_start_write() in dup_mmap() if the caller is single-threaded | expand

Commit Message

Mateusz Guzik March 27, 2025, 5:46 a.m. UTC
Hi Suren,

I brought this up long time ago, you agreed the idea works and stated
you will sort it out, but it apparently fell to the wayside (no hard
feelings though :>)

Here is the original:
https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/

the thread is weirdly long and I recommend not opening it without a
good reason, I link it for reference if needed.

At the time there were woes concerning stability of the new locking
scheme, resulting in CC list being rather excessive. As such I'm
starting a new thread with a modest list, not sure who to add though.

So dup_mmap() holds the caller mmap_sem for writing and calls
vma_start_write() to protect against fault handling in another threads
using the same mm.

If this is the only thread with this ->mm in place, there are no
sibling threads to worry about and this can be checked with mm_users
== 1.

AFAICS all remote accesses require the mmap_sem to also be held, which
provides exclusion against dup_mmap, meaning they don't pose a problem
either.

The patch to merely skip locking is a few liner and I would officially
submit it myself, but for my taste an assert is needed in fault
handling to runtime test the above invariant. So happens I really
can't be bothered to figure out how to sort it out and was hoping you
would step in. ;) Alternatively if you guys don't think the assert is
warranted, that's your business.

As for whether this can save any locking -- yes:

I added a probe (below for reference) with two args: whether we are
single-threaded and vma_start_write() returning whether it took the
down/up cycle and ran make -j 20 in the kernel dir.

The lock was taken for every single vma (377913 in total), while all
forking processes were single-threaded. Or to put it differently all
of these were skippable.

the probe (total hack):
bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'

probe diff:
 static __latent_entropy int dup_mmap(struct mm_struct *mm,
                                        struct mm_struct *oldmm)
@@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        unsigned long charge = 0;
        LIST_HEAD(uf);
        VMA_ITERATOR(vmi, mm, 0);
+       bool only_user;

        if (mmap_write_lock_killable(oldmm))
                return -EINTR;
+       only_user = atomic_read(&oldmm->mm_users) == 1;
        flush_cache_dup_mm(oldmm);
        uprobe_dup_mmap(oldmm, mm);
        /*
@@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        mt_clear_in_rcu(vmi.mas.tree);
        for_each_vma(vmi, mpnt) {
                struct file *file;
+               bool locked;
+
+               locked = vma_start_write(mpnt);
+               dup_probe(only_user ? 1 :0, locked ? 1 : 0);

-               vma_start_write(mpnt);
                if (mpnt->vm_flags & VM_DONTCOPY) {
                        retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
                                                    mpnt->vm_end, GFP_KERNEL);

Comments

Suren Baghdasaryan March 29, 2025, 12:56 a.m. UTC | #1
Hi Mateusz,

On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Hi Suren,
>
> I brought this up long time ago, you agreed the idea works and stated
> you will sort it out, but it apparently fell to the wayside (no hard
> feelings though :>)

Sorry about that. Yeah, that was a hectic time and I completely forgot
to look into the single-user case.

>
> Here is the original:
> https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
>
> the thread is weirdly long and I recommend not opening it without a
> good reason, I link it for reference if needed.

I had to re-read it to remember what it was all about :) To bring
others up-to-speed, the suggested change would look something like
this:

@@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        unsigned long charge = 0;
        LIST_HEAD(uf);
        VMA_ITERATOR(vmi, mm, 0);
+       bool only_user;

        if (mmap_write_lock_killable(oldmm))
                return -EINTR;
+       only_user = atomic_read(&oldmm->mm_users) == 1;
        flush_cache_dup_mm(oldmm);
        uprobe_dup_mmap(oldmm, mm);
        /*
@@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
        mt_clear_in_rcu(vmi.mas.tree);
        for_each_vma(vmi, mpnt) {
                struct file *file;

-               vma_start_write(mpnt);
+               if (!only_user)
+                       vma_start_write(mpnt);
                if (mpnt->vm_flags & VM_DONTCOPY) {
                        retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
                                                    mpnt->vm_end, GFP_KERNEL);

>
> At the time there were woes concerning stability of the new locking
> scheme, resulting in CC list being rather excessive. As such I'm
> starting a new thread with a modest list, not sure who to add though.
>
> So dup_mmap() holds the caller mmap_sem for writing and calls
> vma_start_write() to protect against fault handling in another threads
> using the same mm.
>
> If this is the only thread with this ->mm in place, there are no
> sibling threads to worry about and this can be checked with mm_users
> == 1.
>
> AFAICS all remote accesses require the mmap_sem to also be held, which
> provides exclusion against dup_mmap, meaning they don't pose a problem
> either.

Yes, I believe the optimization you proposed would be safe.

>
> The patch to merely skip locking is a few liner and I would officially
> submit it myself, but for my taste an assert is needed in fault
> handling to runtime test the above invariant.

Hmm. Adding an assert in the pagefault path that checks whether the
fault is triggered during dup_mmap() && mm_users==1 would not be
trivial. We would need to indicate that we expect no page faults while
in that section of dup_mmap() (maybe a flag inside mm_struct) and then
assert that this flag is not set inside the pagefault handling path.
I'm not sure it's worth the complexity... As was discussed in that
thread, the only other task that might fault a page would be external
and therefore would have to go through
access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
already holds mmap_write_lock the new user would have to wait.

> So happens I really
> can't be bothered to figure out how to sort it out and was hoping you
> would step in. ;) Alternatively if you guys don't think the assert is
> warranted, that's your business.

I don't think it's worth it but I'll CC Matthew and Lorenzo (you
already CC'ed Liam) to get their opinion.

>
> As for whether this can save any locking -- yes:

Yeah, I'm sure it will make a difference in performance. While forking
we are currently locking each VMA separately, so skipping that would
be nice.
Folks, WDYT? Do we need a separate assertion that pagefault can't
happen if mm_users==1 and we are holding mmap_write_lock
(access_remote_vm() will block)?
Thanks,
Suren.

>
> I added a probe (below for reference) with two args: whether we are
> single-threaded and vma_start_write() returning whether it took the
> down/up cycle and ran make -j 20 in the kernel dir.
>
> The lock was taken for every single vma (377913 in total), while all
> forking processes were single-threaded. Or to put it differently all
> of these were skippable.
>
> the probe (total hack):
> bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
>
> probe diff:
> diff --git a/fs/namei.c b/fs/namei.c
> index ecb7b95c2ca3..d6cde76eda81 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -5459,3 +5459,7 @@ const struct inode_operations
> page_symlink_inode_operations = {
>         .get_link       = page_get_link,
>  };
>  EXPORT_SYMBOL(page_symlink_inode_operations);
> +
> +void dup_probe(int, int);
> +void dup_probe(int, int) { }
> +EXPORT_SYMBOL(dup_probe);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1f80baddacc5..f7b1f0a02f2e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> vm_area_struct *vma, unsigned int *mm_l
>   * Exclude concurrent readers under the per-VMA lock until the currently
>   * write-locked mmap_lock is dropped or downgraded.
>   */
> -static inline void vma_start_write(struct vm_area_struct *vma)
> +static inline bool vma_start_write(struct vm_area_struct *vma)
>  {
>         unsigned int mm_lock_seq;
>
>         if (__is_vma_write_locked(vma, &mm_lock_seq))
> -               return;
> +               return false;
>
>         down_write(&vma->vm_lock->lock);
>         /*
> @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> vm_area_struct *vma)
>          */
>         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
>         up_write(&vma->vm_lock->lock);
> +       return true;
>  }
>
>  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 735405a9c5f3..0cc56255a339 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> struct mm_struct *oldmm)
>                 pr_warn_once("exe_file_deny_write_access() failed in
> %s\n", __func__);
>  }
>
> +void dup_probe(int, int);
> +
>  #ifdef CONFIG_MMU
>  static __latent_entropy int dup_mmap(struct mm_struct *mm,
>                                         struct mm_struct *oldmm)
> @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         unsigned long charge = 0;
>         LIST_HEAD(uf);
>         VMA_ITERATOR(vmi, mm, 0);
> +       bool only_user;
>
>         if (mmap_write_lock_killable(oldmm))
>                 return -EINTR;
> +       only_user = atomic_read(&oldmm->mm_users) == 1;
>         flush_cache_dup_mm(oldmm);
>         uprobe_dup_mmap(oldmm, mm);
>         /*
> @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         mt_clear_in_rcu(vmi.mas.tree);
>         for_each_vma(vmi, mpnt) {
>                 struct file *file;
> +               bool locked;
> +
> +               locked = vma_start_write(mpnt);
> +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
>
> -               vma_start_write(mpnt);
>                 if (mpnt->vm_flags & VM_DONTCOPY) {
>                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
>                                                     mpnt->vm_end, GFP_KERNEL);
>
> --
> Mateusz Guzik <mjguzik gmail.com>
Mateusz Guzik March 29, 2025, 1:15 a.m. UTC | #2
On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > Here is the original:
> > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> >
> > the thread is weirdly long and I recommend not opening it without a
> > good reason, I link it for reference if needed.
>
> I had to re-read it to remember what it was all about :) To bring
> others up-to-speed, the suggested change would look something like
> this:
>
> @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         unsigned long charge = 0;
>         LIST_HEAD(uf);
>         VMA_ITERATOR(vmi, mm, 0);
> +       bool only_user;
>
>         if (mmap_write_lock_killable(oldmm))
>                 return -EINTR;
> +       only_user = atomic_read(&oldmm->mm_users) == 1;
>         flush_cache_dup_mm(oldmm);
>         uprobe_dup_mmap(oldmm, mm);
>         /*
> @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>         mt_clear_in_rcu(vmi.mas.tree);
>         for_each_vma(vmi, mpnt) {
>                 struct file *file;
>
> -               vma_start_write(mpnt);
> +               if (!only_user)
> +                       vma_start_write(mpnt);
>                 if (mpnt->vm_flags & VM_DONTCOPY) {
>                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
>                                                     mpnt->vm_end, GFP_KERNEL);
>
> >
> > At the time there were woes concerning stability of the new locking
> > scheme, resulting in CC list being rather excessive. As such I'm
> > starting a new thread with a modest list, not sure who to add though.
> >
> > So dup_mmap() holds the caller mmap_sem for writing and calls
> > vma_start_write() to protect against fault handling in another threads
> > using the same mm.
> >
> > If this is the only thread with this ->mm in place, there are no
> > sibling threads to worry about and this can be checked with mm_users
> > == 1.
> >
> > AFAICS all remote accesses require the mmap_sem to also be held, which
> > provides exclusion against dup_mmap, meaning they don't pose a problem
> > either.
>
> Yes, I believe the optimization you proposed would be safe.
>
> >
> > The patch to merely skip locking is a few liner and I would officially
> > submit it myself, but for my taste an assert is needed in fault
> > handling to runtime test the above invariant.
>
> Hmm. Adding an assert in the pagefault path that checks whether the
> fault is triggered during dup_mmap() && mm_users==1 would not be
> trivial. We would need to indicate that we expect no page faults while
> in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> assert that this flag is not set inside the pagefault handling path.
> I'm not sure it's worth the complexity... As was discussed in that
> thread, the only other task that might fault a page would be external
> and therefore would have to go through
> access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> already holds mmap_write_lock the new user would have to wait.
>
> > So happens I really
> > can't be bothered to figure out how to sort it out and was hoping you
> > would step in. ;) Alternatively if you guys don't think the assert is
> > warranted, that's your business.
>
> I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> already CC'ed Liam) to get their opinion.
>
> >
> > As for whether this can save any locking -- yes:
>
> Yeah, I'm sure it will make a difference in performance. While forking
> we are currently locking each VMA separately, so skipping that would
> be nice.
> Folks, WDYT? Do we need a separate assertion that pagefault can't
> happen if mm_users==1 and we are holding mmap_write_lock
> (access_remote_vm() will block)?

pseudocode-wise I was thinking in the lines of the following in the fault path:

if (current->mm != vma->vm_mm)
       mmap_assert_locked(vma->vm_mm);

with the assumption that mmap_assert_locked expands to nothing without debug

>
> >
> > I added a probe (below for reference) with two args: whether we are
> > single-threaded and vma_start_write() returning whether it took the
> > down/up cycle and ran make -j 20 in the kernel dir.
> >
> > The lock was taken for every single vma (377913 in total), while all
> > forking processes were single-threaded. Or to put it differently all
> > of these were skippable.
> >
> > the probe (total hack):
> > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> >
> > probe diff:
> > diff --git a/fs/namei.c b/fs/namei.c
> > index ecb7b95c2ca3..d6cde76eda81 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > page_symlink_inode_operations = {
> >         .get_link       = page_get_link,
> >  };
> >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > +
> > +void dup_probe(int, int);
> > +void dup_probe(int, int) { }
> > +EXPORT_SYMBOL(dup_probe);
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1f80baddacc5..f7b1f0a02f2e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > vm_area_struct *vma, unsigned int *mm_l
> >   * Exclude concurrent readers under the per-VMA lock until the currently
> >   * write-locked mmap_lock is dropped or downgraded.
> >   */
> > -static inline void vma_start_write(struct vm_area_struct *vma)
> > +static inline bool vma_start_write(struct vm_area_struct *vma)
> >  {
> >         unsigned int mm_lock_seq;
> >
> >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > -               return;
> > +               return false;
> >
> >         down_write(&vma->vm_lock->lock);
> >         /*
> > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > vm_area_struct *vma)
> >          */
> >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> >         up_write(&vma->vm_lock->lock);
> > +       return true;
> >  }
> >
> >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 735405a9c5f3..0cc56255a339 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > struct mm_struct *oldmm)
> >                 pr_warn_once("exe_file_deny_write_access() failed in
> > %s\n", __func__);
> >  }
> >
> > +void dup_probe(int, int);
> > +
> >  #ifdef CONFIG_MMU
> >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >                                         struct mm_struct *oldmm)
> > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >         unsigned long charge = 0;
> >         LIST_HEAD(uf);
> >         VMA_ITERATOR(vmi, mm, 0);
> > +       bool only_user;
> >
> >         if (mmap_write_lock_killable(oldmm))
> >                 return -EINTR;
> > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> >         flush_cache_dup_mm(oldmm);
> >         uprobe_dup_mmap(oldmm, mm);
> >         /*
> > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >         mt_clear_in_rcu(vmi.mas.tree);
> >         for_each_vma(vmi, mpnt) {
> >                 struct file *file;
> > +               bool locked;
> > +
> > +               locked = vma_start_write(mpnt);
> > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> >
> > -               vma_start_write(mpnt);
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> >                                                     mpnt->vm_end, GFP_KERNEL);
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
Suren Baghdasaryan March 29, 2025, 1:35 a.m. UTC | #3
On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > Here is the original:
> > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> > >
> > > the thread is weirdly long and I recommend not opening it without a
> > > good reason, I link it for reference if needed.
> >
> > I had to re-read it to remember what it was all about :) To bring
> > others up-to-speed, the suggested change would look something like
> > this:
> >
> > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >         unsigned long charge = 0;
> >         LIST_HEAD(uf);
> >         VMA_ITERATOR(vmi, mm, 0);
> > +       bool only_user;
> >
> >         if (mmap_write_lock_killable(oldmm))
> >                 return -EINTR;
> > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> >         flush_cache_dup_mm(oldmm);
> >         uprobe_dup_mmap(oldmm, mm);
> >         /*
> > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >         mt_clear_in_rcu(vmi.mas.tree);
> >         for_each_vma(vmi, mpnt) {
> >                 struct file *file;
> >
> > -               vma_start_write(mpnt);
> > +               if (!only_user)
> > +                       vma_start_write(mpnt);
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> >                                                     mpnt->vm_end, GFP_KERNEL);
> >
> > >
> > > At the time there were woes concerning stability of the new locking
> > > scheme, resulting in CC list being rather excessive. As such I'm
> > > starting a new thread with a modest list, not sure who to add though.
> > >
> > > So dup_mmap() holds the caller mmap_sem for writing and calls
> > > vma_start_write() to protect against fault handling in another threads
> > > using the same mm.
> > >
> > > If this is the only thread with this ->mm in place, there are no
> > > sibling threads to worry about and this can be checked with mm_users
> > > == 1.
> > >
> > > AFAICS all remote accesses require the mmap_sem to also be held, which
> > > provides exclusion against dup_mmap, meaning they don't pose a problem
> > > either.
> >
> > Yes, I believe the optimization you proposed would be safe.
> >
> > >
> > > The patch to merely skip locking is a few liner and I would officially
> > > submit it myself, but for my taste an assert is needed in fault
> > > handling to runtime test the above invariant.
> >
> > Hmm. Adding an assert in the pagefault path that checks whether the
> > fault is triggered during dup_mmap() && mm_users==1 would not be
> > trivial. We would need to indicate that we expect no page faults while
> > in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> > assert that this flag is not set inside the pagefault handling path.
> > I'm not sure it's worth the complexity... As was discussed in that
> > thread, the only other task that might fault a page would be external
> > and therefore would have to go through
> > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> > already holds mmap_write_lock the new user would have to wait.
> >
> > > So happens I really
> > > can't be bothered to figure out how to sort it out and was hoping you
> > > would step in. ;) Alternatively if you guys don't think the assert is
> > > warranted, that's your business.
> >
> > I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> > already CC'ed Liam) to get their opinion.
> >
> > >
> > > As for whether this can save any locking -- yes:
> >
> > Yeah, I'm sure it will make a difference in performance. While forking
> > we are currently locking each VMA separately, so skipping that would
> > be nice.
> > Folks, WDYT? Do we need a separate assertion that pagefault can't
> > happen if mm_users==1 and we are holding mmap_write_lock
> > (access_remote_vm() will block)?
>
> pseudocode-wise I was thinking in the lines of the following in the fault path:
>
> if (current->mm != vma->vm_mm)
>        mmap_assert_locked(vma->vm_mm);
>
> with the assumption that mmap_assert_locked expands to nothing without debug

I see. IOW, if someone external is faulting a page then it has to be
holding at least mmap_read_lock. So, it's a more general check but
seems reasonable. I think adding it at the end of lock_vma_under_rcu()
under the "#ifdef CONFIG_DEBUG_VM" condition would be enough.

>
> >
> > >
> > > I added a probe (below for reference) with two args: whether we are
> > > single-threaded and vma_start_write() returning whether it took the
> > > down/up cycle and ran make -j 20 in the kernel dir.
> > >
> > > The lock was taken for every single vma (377913 in total), while all
> > > forking processes were single-threaded. Or to put it differently all
> > > of these were skippable.
> > >
> > > the probe (total hack):
> > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> > >
> > > probe diff:
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index ecb7b95c2ca3..d6cde76eda81 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > > page_symlink_inode_operations = {
> > >         .get_link       = page_get_link,
> > >  };
> > >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > > +
> > > +void dup_probe(int, int);
> > > +void dup_probe(int, int) { }
> > > +EXPORT_SYMBOL(dup_probe);
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 1f80baddacc5..f7b1f0a02f2e 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > > vm_area_struct *vma, unsigned int *mm_l
> > >   * Exclude concurrent readers under the per-VMA lock until the currently
> > >   * write-locked mmap_lock is dropped or downgraded.
> > >   */
> > > -static inline void vma_start_write(struct vm_area_struct *vma)
> > > +static inline bool vma_start_write(struct vm_area_struct *vma)
> > >  {
> > >         unsigned int mm_lock_seq;
> > >
> > >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > > -               return;
> > > +               return false;
> > >
> > >         down_write(&vma->vm_lock->lock);
> > >         /*
> > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > > vm_area_struct *vma)
> > >          */
> > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > >         up_write(&vma->vm_lock->lock);
> > > +       return true;
> > >  }
> > >
> > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 735405a9c5f3..0cc56255a339 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > > struct mm_struct *oldmm)
> > >                 pr_warn_once("exe_file_deny_write_access() failed in
> > > %s\n", __func__);
> > >  }
> > >
> > > +void dup_probe(int, int);
> > > +
> > >  #ifdef CONFIG_MMU
> > >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >                                         struct mm_struct *oldmm)
> > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >         unsigned long charge = 0;
> > >         LIST_HEAD(uf);
> > >         VMA_ITERATOR(vmi, mm, 0);
> > > +       bool only_user;
> > >
> > >         if (mmap_write_lock_killable(oldmm))
> > >                 return -EINTR;
> > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > >         flush_cache_dup_mm(oldmm);
> > >         uprobe_dup_mmap(oldmm, mm);
> > >         /*
> > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >         mt_clear_in_rcu(vmi.mas.tree);
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > > +               bool locked;
> > > +
> > > +               locked = vma_start_write(mpnt);
> > > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> > >
> > > -               vma_start_write(mpnt);
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > >                                                     mpnt->vm_end, GFP_KERNEL);
> > >
> > > --
> > > Mateusz Guzik <mjguzik gmail.com>
>
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
Mateusz Guzik March 29, 2025, 1:51 a.m. UTC | #4
top post warning

It just hit me that userfaultfd() may be a problem. Creating it only
bumps mm_count, while mm_users gets transiently modified during
various ops.

So in particular you may end up pounding off the userfaultfd instance
to another process while being single-threaded, then mm_users == 1 and
userfaultfd may be trying to do something. I have no idea if this one
is guaranteed to take the lock.

However, the good news is that mm_count tends to be 1. If both
mm_count and mm_users are 1, then there is no usefaultfd in use and
nobody to add it either.

State of mm_count verified with: bpftrace -e 'kprobe:copy_process {
@[curtask->mm->mm_count.counter] = count(); }'

On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > > Here is the original:
> > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> > > >
> > > > the thread is weirdly long and I recommend not opening it without a
> > > > good reason, I link it for reference if needed.
> > >
> > > I had to re-read it to remember what it was all about :) To bring
> > > others up-to-speed, the suggested change would look something like
> > > this:
> > >
> > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >         unsigned long charge = 0;
> > >         LIST_HEAD(uf);
> > >         VMA_ITERATOR(vmi, mm, 0);
> > > +       bool only_user;
> > >
> > >         if (mmap_write_lock_killable(oldmm))
> > >                 return -EINTR;
> > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > >         flush_cache_dup_mm(oldmm);
> > >         uprobe_dup_mmap(oldmm, mm);
> > >         /*
> > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >         mt_clear_in_rcu(vmi.mas.tree);
> > >         for_each_vma(vmi, mpnt) {
> > >                 struct file *file;
> > >
> > > -               vma_start_write(mpnt);
> > > +               if (!only_user)
> > > +                       vma_start_write(mpnt);
> > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > >                                                     mpnt->vm_end, GFP_KERNEL);
> > >
> > > >
> > > > At the time there were woes concerning stability of the new locking
> > > > scheme, resulting in CC list being rather excessive. As such I'm
> > > > starting a new thread with a modest list, not sure who to add though.
> > > >
> > > > So dup_mmap() holds the caller mmap_sem for writing and calls
> > > > vma_start_write() to protect against fault handling in another threads
> > > > using the same mm.
> > > >
> > > > If this is the only thread with this ->mm in place, there are no
> > > > sibling threads to worry about and this can be checked with mm_users
> > > > == 1.
> > > >
> > > > AFAICS all remote accesses require the mmap_sem to also be held, which
> > > > provides exclusion against dup_mmap, meaning they don't pose a problem
> > > > either.
> > >
> > > Yes, I believe the optimization you proposed would be safe.
> > >
> > > >
> > > > The patch to merely skip locking is a few liner and I would officially
> > > > submit it myself, but for my taste an assert is needed in fault
> > > > handling to runtime test the above invariant.
> > >
> > > Hmm. Adding an assert in the pagefault path that checks whether the
> > > fault is triggered during dup_mmap() && mm_users==1 would not be
> > > trivial. We would need to indicate that we expect no page faults while
> > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> > > assert that this flag is not set inside the pagefault handling path.
> > > I'm not sure it's worth the complexity... As was discussed in that
> > > thread, the only other task that might fault a page would be external
> > > and therefore would have to go through
> > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> > > already holds mmap_write_lock the new user would have to wait.
> > >
> > > > So happens I really
> > > > can't be bothered to figure out how to sort it out and was hoping you
> > > > would step in. ;) Alternatively if you guys don't think the assert is
> > > > warranted, that's your business.
> > >
> > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> > > already CC'ed Liam) to get their opinion.
> > >
> > > >
> > > > As for whether this can save any locking -- yes:
> > >
> > > Yeah, I'm sure it will make a difference in performance. While forking
> > > we are currently locking each VMA separately, so skipping that would
> > > be nice.
> > > Folks, WDYT? Do we need a separate assertion that pagefault can't
> > > happen if mm_users==1 and we are holding mmap_write_lock
> > > (access_remote_vm() will block)?
> >
> > pseudocode-wise I was thinking in the lines of the following in the fault path:
> >
> > if (current->mm != vma->vm_mm)
> >        mmap_assert_locked(vma->vm_mm);
> >
> > with the assumption that mmap_assert_locked expands to nothing without debug
>
> I see. IOW, if someone external is faulting a page then it has to be
> holding at least mmap_read_lock. So, it's a more general check but
> seems reasonable. I think adding it at the end of lock_vma_under_rcu()
> under the "#ifdef CONFIG_DEBUG_VM" condition would be enough.
>
> >
> > >
> > > >
> > > > I added a probe (below for reference) with two args: whether we are
> > > > single-threaded and vma_start_write() returning whether it took the
> > > > down/up cycle and ran make -j 20 in the kernel dir.
> > > >
> > > > The lock was taken for every single vma (377913 in total), while all
> > > > forking processes were single-threaded. Or to put it differently all
> > > > of these were skippable.
> > > >
> > > > the probe (total hack):
> > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> > > >
> > > > probe diff:
> > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > index ecb7b95c2ca3..d6cde76eda81 100644
> > > > --- a/fs/namei.c
> > > > +++ b/fs/namei.c
> > > > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > > > page_symlink_inode_operations = {
> > > >         .get_link       = page_get_link,
> > > >  };
> > > >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > > > +
> > > > +void dup_probe(int, int);
> > > > +void dup_probe(int, int) { }
> > > > +EXPORT_SYMBOL(dup_probe);
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 1f80baddacc5..f7b1f0a02f2e 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > > > vm_area_struct *vma, unsigned int *mm_l
> > > >   * Exclude concurrent readers under the per-VMA lock until the currently
> > > >   * write-locked mmap_lock is dropped or downgraded.
> > > >   */
> > > > -static inline void vma_start_write(struct vm_area_struct *vma)
> > > > +static inline bool vma_start_write(struct vm_area_struct *vma)
> > > >  {
> > > >         unsigned int mm_lock_seq;
> > > >
> > > >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > > > -               return;
> > > > +               return false;
> > > >
> > > >         down_write(&vma->vm_lock->lock);
> > > >         /*
> > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > > > vm_area_struct *vma)
> > > >          */
> > > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > > >         up_write(&vma->vm_lock->lock);
> > > > +       return true;
> > > >  }
> > > >
> > > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index 735405a9c5f3..0cc56255a339 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > > > struct mm_struct *oldmm)
> > > >                 pr_warn_once("exe_file_deny_write_access() failed in
> > > > %s\n", __func__);
> > > >  }
> > > >
> > > > +void dup_probe(int, int);
> > > > +
> > > >  #ifdef CONFIG_MMU
> > > >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >                                         struct mm_struct *oldmm)
> > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >         unsigned long charge = 0;
> > > >         LIST_HEAD(uf);
> > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > +       bool only_user;
> > > >
> > > >         if (mmap_write_lock_killable(oldmm))
> > > >                 return -EINTR;
> > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > >         flush_cache_dup_mm(oldmm);
> > > >         uprobe_dup_mmap(oldmm, mm);
> > > >         /*
> > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > >         for_each_vma(vmi, mpnt) {
> > > >                 struct file *file;
> > > > +               bool locked;
> > > > +
> > > > +               locked = vma_start_write(mpnt);
> > > > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> > > >
> > > > -               vma_start_write(mpnt);
> > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > >
> > > > --
> > > > Mateusz Guzik <mjguzik gmail.com>
> >
> >
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
Suren Baghdasaryan March 30, 2025, 7:23 p.m. UTC | #5
On Fri, Mar 28, 2025 at 6:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> top post warning
>
> It just hit me that userfaultfd() may be a problem. Creating it only
> bumps mm_count, while mm_users gets transiently modified during
> various ops.
>
> So in particular you may end up pounding off the userfaultfd instance
> to another process while being single-threaded, then mm_users == 1 and
> userfaultfd may be trying to do something. I have no idea if this one
> is guaranteed to take the lock.

Hmm, yeah. uffd is nasty. Even in the cases when it does
mmget_not_zero(), there is still the possibility of a race. For
example:

dup_mmap
    only_user = atomic_read(&oldmm->mm_users) == 1;

                                             userfaultfd_move()
                                                 mmget_not_zero() <--
inc mm_users

    if (!only_user)
        vma_start_write(mpnt); <-- gets skipped

                                                    move_pages()
                                                         uffd_move_lock()
                                                             uffd_lock_vma()

lock_vma_under_rcu() <-- succeeds
                                                         move_pages_pte()
    copy_page_range()

So, userfaultfd_move() will happily move pages while dup_mmap() is
doing copy_page_range(). The copied range might look quite
interesting...

>
> However, the good news is that mm_count tends to be 1. If both
> mm_count and mm_users are 1, then there is no usefaultfd in use and
> nobody to add it either.

I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
calling mmgrab(), therefore I think it can race with the code checking
its value.

>
> State of mm_count verified with: bpftrace -e 'kprobe:copy_process {
> @[curtask->mm->mm_count.counter] = count(); }'
>
> On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > > > Here is the original:
> > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> > > > >
> > > > > the thread is weirdly long and I recommend not opening it without a
> > > > > good reason, I link it for reference if needed.
> > > >
> > > > I had to re-read it to remember what it was all about :) To bring
> > > > others up-to-speed, the suggested change would look something like
> > > > this:
> > > >
> > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >         unsigned long charge = 0;
> > > >         LIST_HEAD(uf);
> > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > +       bool only_user;
> > > >
> > > >         if (mmap_write_lock_killable(oldmm))
> > > >                 return -EINTR;
> > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > >         flush_cache_dup_mm(oldmm);
> > > >         uprobe_dup_mmap(oldmm, mm);
> > > >         /*
> > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > >         for_each_vma(vmi, mpnt) {
> > > >                 struct file *file;
> > > >
> > > > -               vma_start_write(mpnt);
> > > > +               if (!only_user)
> > > > +                       vma_start_write(mpnt);
> > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > >
> > > > >
> > > > > At the time there were woes concerning stability of the new locking
> > > > > scheme, resulting in CC list being rather excessive. As such I'm
> > > > > starting a new thread with a modest list, not sure who to add though.
> > > > >
> > > > > So dup_mmap() holds the caller mmap_sem for writing and calls
> > > > > vma_start_write() to protect against fault handling in another threads
> > > > > using the same mm.
> > > > >
> > > > > If this is the only thread with this ->mm in place, there are no
> > > > > sibling threads to worry about and this can be checked with mm_users
> > > > > == 1.
> > > > >
> > > > > AFAICS all remote accesses require the mmap_sem to also be held, which
> > > > > provides exclusion against dup_mmap, meaning they don't pose a problem
> > > > > either.
> > > >
> > > > Yes, I believe the optimization you proposed would be safe.
> > > >
> > > > >
> > > > > The patch to merely skip locking is a few liner and I would officially
> > > > > submit it myself, but for my taste an assert is needed in fault
> > > > > handling to runtime test the above invariant.
> > > >
> > > > Hmm. Adding an assert in the pagefault path that checks whether the
> > > > fault is triggered during dup_mmap() && mm_users==1 would not be
> > > > trivial. We would need to indicate that we expect no page faults while
> > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> > > > assert that this flag is not set inside the pagefault handling path.
> > > > I'm not sure it's worth the complexity... As was discussed in that
> > > > thread, the only other task that might fault a page would be external
> > > > and therefore would have to go through
> > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> > > > already holds mmap_write_lock the new user would have to wait.
> > > >
> > > > > So happens I really
> > > > > can't be bothered to figure out how to sort it out and was hoping you
> > > > > would step in. ;) Alternatively if you guys don't think the assert is
> > > > > warranted, that's your business.
> > > >
> > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> > > > already CC'ed Liam) to get their opinion.
> > > >
> > > > >
> > > > > As for whether this can save any locking -- yes:
> > > >
> > > > Yeah, I'm sure it will make a difference in performance. While forking
> > > > we are currently locking each VMA separately, so skipping that would
> > > > be nice.
> > > > Folks, WDYT? Do we need a separate assertion that pagefault can't
> > > > happen if mm_users==1 and we are holding mmap_write_lock
> > > > (access_remote_vm() will block)?
> > >
> > > pseudocode-wise I was thinking in the lines of the following in the fault path:
> > >
> > > if (current->mm != vma->vm_mm)
> > >        mmap_assert_locked(vma->vm_mm);
> > >
> > > with the assumption that mmap_assert_locked expands to nothing without debug
> >
> > I see. IOW, if someone external is faulting a page then it has to be
> > holding at least mmap_read_lock. So, it's a more general check but
> > seems reasonable. I think adding it at the end of lock_vma_under_rcu()
> > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough.
> >
> > >
> > > >
> > > > >
> > > > > I added a probe (below for reference) with two args: whether we are
> > > > > single-threaded and vma_start_write() returning whether it took the
> > > > > down/up cycle and ran make -j 20 in the kernel dir.
> > > > >
> > > > > The lock was taken for every single vma (377913 in total), while all
> > > > > forking processes were single-threaded. Or to put it differently all
> > > > > of these were skippable.
> > > > >
> > > > > the probe (total hack):
> > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> > > > >
> > > > > probe diff:
> > > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > > index ecb7b95c2ca3..d6cde76eda81 100644
> > > > > --- a/fs/namei.c
> > > > > +++ b/fs/namei.c
> > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > > > > page_symlink_inode_operations = {
> > > > >         .get_link       = page_get_link,
> > > > >  };
> > > > >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > > > > +
> > > > > +void dup_probe(int, int);
> > > > > +void dup_probe(int, int) { }
> > > > > +EXPORT_SYMBOL(dup_probe);
> > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > index 1f80baddacc5..f7b1f0a02f2e 100644
> > > > > --- a/include/linux/mm.h
> > > > > +++ b/include/linux/mm.h
> > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > > > > vm_area_struct *vma, unsigned int *mm_l
> > > > >   * Exclude concurrent readers under the per-VMA lock until the currently
> > > > >   * write-locked mmap_lock is dropped or downgraded.
> > > > >   */
> > > > > -static inline void vma_start_write(struct vm_area_struct *vma)
> > > > > +static inline bool vma_start_write(struct vm_area_struct *vma)
> > > > >  {
> > > > >         unsigned int mm_lock_seq;
> > > > >
> > > > >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > > > > -               return;
> > > > > +               return false;
> > > > >
> > > > >         down_write(&vma->vm_lock->lock);
> > > > >         /*
> > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > > > > vm_area_struct *vma)
> > > > >          */
> > > > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > > > >         up_write(&vma->vm_lock->lock);
> > > > > +       return true;
> > > > >  }
> > > > >
> > > > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > index 735405a9c5f3..0cc56255a339 100644
> > > > > --- a/kernel/fork.c
> > > > > +++ b/kernel/fork.c
> > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > > > > struct mm_struct *oldmm)
> > > > >                 pr_warn_once("exe_file_deny_write_access() failed in
> > > > > %s\n", __func__);
> > > > >  }
> > > > >
> > > > > +void dup_probe(int, int);
> > > > > +
> > > > >  #ifdef CONFIG_MMU
> > > > >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >                                         struct mm_struct *oldmm)
> > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >         unsigned long charge = 0;
> > > > >         LIST_HEAD(uf);
> > > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > > +       bool only_user;
> > > > >
> > > > >         if (mmap_write_lock_killable(oldmm))
> > > > >                 return -EINTR;
> > > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > > >         flush_cache_dup_mm(oldmm);
> > > > >         uprobe_dup_mmap(oldmm, mm);
> > > > >         /*
> > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > > >         for_each_vma(vmi, mpnt) {
> > > > >                 struct file *file;
> > > > > +               bool locked;
> > > > > +
> > > > > +               locked = vma_start_write(mpnt);
> > > > > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> > > > >
> > > > > -               vma_start_write(mpnt);
> > > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > > >
> > > > > --
> > > > > Mateusz Guzik <mjguzik gmail.com>
> > >
> > >
> > >
> > > --
> > > Mateusz Guzik <mjguzik gmail.com>
>
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
Suren Baghdasaryan March 30, 2025, 7:25 p.m. UTC | #6
On Sun, Mar 30, 2025 at 12:23 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Mar 28, 2025 at 6:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > top post warning
> >
> > It just hit me that userfaultfd() may be a problem. Creating it only
> > bumps mm_count, while mm_users gets transiently modified during
> > various ops.
> >
> > So in particular you may end up pounding off the userfaultfd instance
> > to another process while being single-threaded, then mm_users == 1 and
> > userfaultfd may be trying to do something. I have no idea if this one
> > is guaranteed to take the lock.
>
> Hmm, yeah. uffd is nasty. Even in the cases when it does
> mmget_not_zero(), there is still the possibility of a race. For
> example:
>
> dup_mmap
>     only_user = atomic_read(&oldmm->mm_users) == 1;
>
>                                              userfaultfd_move()
>                                                  mmget_not_zero() <--
> inc mm_users
>
>     if (!only_user)
>         vma_start_write(mpnt); <-- gets skipped
>
>                                                     move_pages()
>                                                          uffd_move_lock()
>                                                              uffd_lock_vma()
>
> lock_vma_under_rcu() <-- succeeds
>                                                          move_pages_pte()
>     copy_page_range()

Sorry about formatting. This might look a bit better:

Task 1                           Task 2
dup_mmap
     only_user = atomic_read(&oldmm->mm_users) == 1;

                                      userfaultfd_move()
                                          mmget_not_zero()

     if (!only_user)
         vma_start_write(mpnt); <-- gets skipped

                                             move_pages()
                                                  uffd_move_lock()
                                                      uffd_lock_vma()
                                                          lock_vma_under_rcu()
                                             move_pages_pte()
     copy_page_range()

>
> So, userfaultfd_move() will happily move pages while dup_mmap() is
> doing copy_page_range(). The copied range might look quite
> interesting...
>
> >
> > However, the good news is that mm_count tends to be 1. If both
> > mm_count and mm_users are 1, then there is no usefaultfd in use and
> > nobody to add it either.
>
> I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
> calling mmgrab(), therefore I think it can race with the code checking
> its value.
>
> >
> > State of mm_count verified with: bpftrace -e 'kprobe:copy_process {
> > @[curtask->mm->mm_count.counter] = count(); }'
> >
> > On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > > > > Here is the original:
> > > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/
> > > > > >
> > > > > > the thread is weirdly long and I recommend not opening it without a
> > > > > > good reason, I link it for reference if needed.
> > > > >
> > > > > I had to re-read it to remember what it was all about :) To bring
> > > > > others up-to-speed, the suggested change would look something like
> > > > > this:
> > > > >
> > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >         unsigned long charge = 0;
> > > > >         LIST_HEAD(uf);
> > > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > > +       bool only_user;
> > > > >
> > > > >         if (mmap_write_lock_killable(oldmm))
> > > > >                 return -EINTR;
> > > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > > >         flush_cache_dup_mm(oldmm);
> > > > >         uprobe_dup_mmap(oldmm, mm);
> > > > >         /*
> > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > > >         for_each_vma(vmi, mpnt) {
> > > > >                 struct file *file;
> > > > >
> > > > > -               vma_start_write(mpnt);
> > > > > +               if (!only_user)
> > > > > +                       vma_start_write(mpnt);
> > > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > > >
> > > > > >
> > > > > > At the time there were woes concerning stability of the new locking
> > > > > > scheme, resulting in CC list being rather excessive. As such I'm
> > > > > > starting a new thread with a modest list, not sure who to add though.
> > > > > >
> > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls
> > > > > > vma_start_write() to protect against fault handling in another threads
> > > > > > using the same mm.
> > > > > >
> > > > > > If this is the only thread with this ->mm in place, there are no
> > > > > > sibling threads to worry about and this can be checked with mm_users
> > > > > > == 1.
> > > > > >
> > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which
> > > > > > provides exclusion against dup_mmap, meaning they don't pose a problem
> > > > > > either.
> > > > >
> > > > > Yes, I believe the optimization you proposed would be safe.
> > > > >
> > > > > >
> > > > > > The patch to merely skip locking is a few liner and I would officially
> > > > > > submit it myself, but for my taste an assert is needed in fault
> > > > > > handling to runtime test the above invariant.
> > > > >
> > > > > Hmm. Adding an assert in the pagefault path that checks whether the
> > > > > fault is triggered during dup_mmap() && mm_users==1 would not be
> > > > > trivial. We would need to indicate that we expect no page faults while
> > > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then
> > > > > assert that this flag is not set inside the pagefault handling path.
> > > > > I'm not sure it's worth the complexity... As was discussed in that
> > > > > thread, the only other task that might fault a page would be external
> > > > > and therefore would have to go through
> > > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap()
> > > > > already holds mmap_write_lock the new user would have to wait.
> > > > >
> > > > > > So happens I really
> > > > > > can't be bothered to figure out how to sort it out and was hoping you
> > > > > > would step in. ;) Alternatively if you guys don't think the assert is
> > > > > > warranted, that's your business.
> > > > >
> > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you
> > > > > already CC'ed Liam) to get their opinion.
> > > > >
> > > > > >
> > > > > > As for whether this can save any locking -- yes:
> > > > >
> > > > > Yeah, I'm sure it will make a difference in performance. While forking
> > > > > we are currently locking each VMA separately, so skipping that would
> > > > > be nice.
> > > > > Folks, WDYT? Do we need a separate assertion that pagefault can't
> > > > > happen if mm_users==1 and we are holding mmap_write_lock
> > > > > (access_remote_vm() will block)?
> > > >
> > > > pseudocode-wise I was thinking in the lines of the following in the fault path:
> > > >
> > > > if (current->mm != vma->vm_mm)
> > > >        mmap_assert_locked(vma->vm_mm);
> > > >
> > > > with the assumption that mmap_assert_locked expands to nothing without debug
> > >
> > > I see. IOW, if someone external is faulting a page then it has to be
> > > holding at least mmap_read_lock. So, it's a more general check but
> > > seems reasonable. I think adding it at the end of lock_vma_under_rcu()
> > > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > I added a probe (below for reference) with two args: whether we are
> > > > > > single-threaded and vma_start_write() returning whether it took the
> > > > > > down/up cycle and ran make -j 20 in the kernel dir.
> > > > > >
> > > > > > The lock was taken for every single vma (377913 in total), while all
> > > > > > forking processes were single-threaded. Or to put it differently all
> > > > > > of these were skippable.
> > > > > >
> > > > > > the probe (total hack):
> > > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }'
> > > > > >
> > > > > > probe diff:
> > > > > > diff --git a/fs/namei.c b/fs/namei.c
> > > > > > index ecb7b95c2ca3..d6cde76eda81 100644
> > > > > > --- a/fs/namei.c
> > > > > > +++ b/fs/namei.c
> > > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations
> > > > > > page_symlink_inode_operations = {
> > > > > >         .get_link       = page_get_link,
> > > > > >  };
> > > > > >  EXPORT_SYMBOL(page_symlink_inode_operations);
> > > > > > +
> > > > > > +void dup_probe(int, int);
> > > > > > +void dup_probe(int, int) { }
> > > > > > +EXPORT_SYMBOL(dup_probe);
> > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > > > index 1f80baddacc5..f7b1f0a02f2e 100644
> > > > > > --- a/include/linux/mm.h
> > > > > > +++ b/include/linux/mm.h
> > > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct
> > > > > > vm_area_struct *vma, unsigned int *mm_l
> > > > > >   * Exclude concurrent readers under the per-VMA lock until the currently
> > > > > >   * write-locked mmap_lock is dropped or downgraded.
> > > > > >   */
> > > > > > -static inline void vma_start_write(struct vm_area_struct *vma)
> > > > > > +static inline bool vma_start_write(struct vm_area_struct *vma)
> > > > > >  {
> > > > > >         unsigned int mm_lock_seq;
> > > > > >
> > > > > >         if (__is_vma_write_locked(vma, &mm_lock_seq))
> > > > > > -               return;
> > > > > > +               return false;
> > > > > >
> > > > > >         down_write(&vma->vm_lock->lock);
> > > > > >         /*
> > > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct
> > > > > > vm_area_struct *vma)
> > > > > >          */
> > > > > >         WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
> > > > > >         up_write(&vma->vm_lock->lock);
> > > > > > +       return true;
> > > > > >  }
> > > > > >
> > > > > >  static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > index 735405a9c5f3..0cc56255a339 100644
> > > > > > --- a/kernel/fork.c
> > > > > > +++ b/kernel/fork.c
> > > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm,
> > > > > > struct mm_struct *oldmm)
> > > > > >                 pr_warn_once("exe_file_deny_write_access() failed in
> > > > > > %s\n", __func__);
> > > > > >  }
> > > > > >
> > > > > > +void dup_probe(int, int);
> > > > > > +
> > > > > >  #ifdef CONFIG_MMU
> > > > > >  static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > >                                         struct mm_struct *oldmm)
> > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > >         unsigned long charge = 0;
> > > > > >         LIST_HEAD(uf);
> > > > > >         VMA_ITERATOR(vmi, mm, 0);
> > > > > > +       bool only_user;
> > > > > >
> > > > > >         if (mmap_write_lock_killable(oldmm))
> > > > > >                 return -EINTR;
> > > > > > +       only_user = atomic_read(&oldmm->mm_users) == 1;
> > > > > >         flush_cache_dup_mm(oldmm);
> > > > > >         uprobe_dup_mmap(oldmm, mm);
> > > > > >         /*
> > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > > > > >         mt_clear_in_rcu(vmi.mas.tree);
> > > > > >         for_each_vma(vmi, mpnt) {
> > > > > >                 struct file *file;
> > > > > > +               bool locked;
> > > > > > +
> > > > > > +               locked = vma_start_write(mpnt);
> > > > > > +               dup_probe(only_user ? 1 :0, locked ? 1 : 0);
> > > > > >
> > > > > > -               vma_start_write(mpnt);
> > > > > >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> > > > > >                         retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start,
> > > > > >                                                     mpnt->vm_end, GFP_KERNEL);
> > > > > >
> > > > > > --
> > > > > > Mateusz Guzik <mjguzik gmail.com>
> > > >
> > > >
> > > >
> > > > --
> > > > Mateusz Guzik <mjguzik gmail.com>
> >
> >
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
Mateusz Guzik March 30, 2025, 7:43 p.m. UTC | #7
On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > However, the good news is that mm_count tends to be 1. If both
> > mm_count and mm_users are 1, then there is no usefaultfd in use and
> > nobody to add it either.
>
> I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
> calling mmgrab(), therefore I think it can race with the code checking
> its value.
>

It issues:
ctx->mm = current->mm;
...
mmgrab(ctx->mm);

Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in
dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to
create it either and the optimization is saved.
Liam R. Howlett March 31, 2025, 4:43 p.m. UTC | #8
* Mateusz Guzik <mjguzik@gmail.com> [250330 15:43]:
> On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > However, the good news is that mm_count tends to be 1. If both
> > > mm_count and mm_users are 1, then there is no usefaultfd in use and
> > > nobody to add it either.
> >
> > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
> > calling mmgrab(), therefore I think it can race with the code checking
> > its value.
> >

Considering the mm struct isn't the only way to find the vmas and there
are users who use other locks to ensure the mm and vma don't go away
(rmap, for example).  It is reasonable to think that other users may use
the vma lock to avoid mm_struct accesses racing.

Although I don't know of a way this is unsafe today, we are complicating
the locking story of the mm with this change and no data has been given
on the benefit.  I don't recall any regression caused by the addition of
per-vma locking?

> 
> It issues:
> ctx->mm = current->mm;
> ...
> mmgrab(ctx->mm);
> 
> Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in
> dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to
> create it either and the optimization is saved.

mm_count is lazy, so I am not entirely sure we can trust what it says.
But maybe that's only true of mmgrab_lazy_tlb() now?

Thanks,
Liam
Mateusz Guzik March 31, 2025, 5:50 p.m. UTC | #9
On Mon, Mar 31, 2025 at 6:43 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Mateusz Guzik <mjguzik@gmail.com> [250330 15:43]:
> > On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > However, the good news is that mm_count tends to be 1. If both
> > > > mm_count and mm_users are 1, then there is no usefaultfd in use and
> > > > nobody to add it either.
> > >
> > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
> > > calling mmgrab(), therefore I think it can race with the code checking
> > > its value.
> > >
>
> Considering the mm struct isn't the only way to find the vmas and there
> are users who use other locks to ensure the mm and vma don't go away
> (rmap, for example).  It is reasonable to think that other users may use
> the vma lock to avoid mm_struct accesses racing.
>
> Although I don't know of a way this is unsafe today, we are complicating
> the locking story of the mm with this change and no data has been given
> on the benefit.  I don't recall any regression caused by the addition of
> per-vma locking?
>

I was not involved in any of this, but I learned about the issue from lwn:
https://lwn.net/Articles/937943/

the new (at the time) per-vma locking was suffering weird crashes in
multithreaded programs and Suren ultimately fixed it by locking parent
vma at a 5% hit,
see fb49c455323f ("fork: lock VMAs of the parent process when
forking"). The patch merely adds vma_start_write(mpnt) in dup_mmap.

What I'm proposing here remedies the problem for most commonly forking
consumers (single-threaded), assuming it does work. ;)

To that end see below.

> >
> > It issues:
> > ctx->mm = current->mm;
> > ...
> > mmgrab(ctx->mm);
> >
> > Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in
> > dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to
> > create it either and the optimization is saved.
>
> mm_count is lazy, so I am not entirely sure we can trust what it says.
> But maybe that's only true of mmgrab_lazy_tlb() now?
>

warning: I don't know the Linux nomenclature here. I'm going to
outline how I see it.

There is an idiomatic way of splitting ref counts into two:
- something to prevent the struct itself from getting freed ("hold
count" where I'm from, in this case ->mm_count)
- something to prevent data used by the structure from getting freed
("use count" where I'm from, in this case ->mm_users)

mm_users > 0 keeps one ref on ->mm_count

AFAICS the scheme employed for mm follows the mold.

So with that mmgrab_lazy_tlb() example, the call bumps the count on first use.

Suppose we are left with one thread in the process and a lazy tlb
somewhere as the only consumers. mm_users is 1 because of the only
thread and mm_count is 2 -- one ref for mm_users > 0 and one ref for
lazy tlb.

Then my proposed check: ->mm_count == 1 && mm->mm_users == 1

... fails and the optimization is not used.

Now, per the above, the lock was added to protect against faults
happening in parallel. The only cases I found where this is of note
are remote accesses (e.g., from /proc/pid/cmdline) and userfaultfd.

I'm not an mm person and this is why I referred to Suren to sort this
out, hoping he would both have interest and enough knowledge about mm
to validate it.

That is to say I don't *vouch* the idea myself (otherwise I would sign
off on a patch), I am merely bringing it up again long after the dust
has settled. If the idea is a nogo, then so be it, but then it would
be nice to document somewhere why is it so.
Suren Baghdasaryan March 31, 2025, 6:42 p.m. UTC | #10
On Mon, Mar 31, 2025 at 10:51 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Mon, Mar 31, 2025 at 6:43 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Mateusz Guzik <mjguzik@gmail.com> [250330 15:43]:
> > > On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > However, the good news is that mm_count tends to be 1. If both
> > > > > mm_count and mm_users are 1, then there is no usefaultfd in use and
> > > > > nobody to add it either.
> > > >
> > > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
> > > > calling mmgrab(), therefore I think it can race with the code checking
> > > > its value.
> > > >
> >
> > Considering the mm struct isn't the only way to find the vmas and there
> > are users who use other locks to ensure the mm and vma don't go away
> > (rmap, for example).  It is reasonable to think that other users may use
> > the vma lock to avoid mm_struct accesses racing.
> >
> > Although I don't know of a way this is unsafe today, we are complicating
> > the locking story of the mm with this change and no data has been given
> > on the benefit.  I don't recall any regression caused by the addition of
> > per-vma locking?
> >
>
> I was not involved in any of this, but I learned about the issue from lwn:
> https://lwn.net/Articles/937943/
>
> the new (at the time) per-vma locking was suffering weird crashes in
> multithreaded programs and Suren ultimately fixed it by locking parent
> vma at a 5% hit,
> see fb49c455323f ("fork: lock VMAs of the parent process when
> forking"). The patch merely adds vma_start_write(mpnt) in dup_mmap.
>
> What I'm proposing here remedies the problem for most commonly forking
> consumers (single-threaded), assuming it does work. ;)
>
> To that end see below.
>
> > >
> > > It issues:
> > > ctx->mm = current->mm;
> > > ...
> > > mmgrab(ctx->mm);
> > >
> > > Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in
> > > dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to
> > > create it either and the optimization is saved.
> >
> > mm_count is lazy, so I am not entirely sure we can trust what it says.
> > But maybe that's only true of mmgrab_lazy_tlb() now?
> >
>
> warning: I don't know the Linux nomenclature here. I'm going to
> outline how I see it.
>
> There is an idiomatic way of splitting ref counts into two:
> - something to prevent the struct itself from getting freed ("hold
> count" where I'm from, in this case ->mm_count)
> - something to prevent data used by the structure from getting freed
> ("use count" where I'm from, in this case ->mm_users)
>
> mm_users > 0 keeps one ref on ->mm_count
>
> AFAICS the scheme employed for mm follows the mold.
>
> So with that mmgrab_lazy_tlb() example, the call bumps the count on first use.
>
> Suppose we are left with one thread in the process and a lazy tlb
> somewhere as the only consumers. mm_users is 1 because of the only
> thread and mm_count is 2 -- one ref for mm_users > 0 and one ref for
> lazy tlb.
>
> Then my proposed check: ->mm_count == 1 && mm->mm_users == 1
>
> ... fails and the optimization is not used.
>
> Now, per the above, the lock was added to protect against faults
> happening in parallel. The only cases I found where this is of note
> are remote accesses (e.g., from /proc/pid/cmdline) and userfaultfd.
>
> I'm not an mm person and this is why I referred to Suren to sort this
> out, hoping he would both have interest and enough knowledge about mm
> to validate it.
>
> That is to say I don't *vouch* the idea myself (otherwise I would sign
> off on a patch), I am merely bringing it up again long after the dust
> has settled. If the idea is a nogo, then so be it, but then it would
> be nice to document somewhere why is it so.

I think it would be worth optimizing if it's as straight-forward as we
think so far. I would like to spend some more time checking uffd code
to see if anything else might be lurking there before posting the
patch. If I don't find anything new I'll post a patch sometime next
week.
Thanks,
Suren.

> --
> Mateusz Guzik <mjguzik gmail.com>
Liam R. Howlett March 31, 2025, 7:24 p.m. UTC | #11
* Mateusz Guzik <mjguzik@gmail.com> [250331 13:51]:
> On Mon, Mar 31, 2025 at 6:43 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Mateusz Guzik <mjguzik@gmail.com> [250330 15:43]:
> > > On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > However, the good news is that mm_count tends to be 1. If both
> > > > > mm_count and mm_users are 1, then there is no usefaultfd in use and
> > > > > nobody to add it either.
> > > >
> > > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while
> > > > calling mmgrab(), therefore I think it can race with the code checking
> > > > its value.
> > > >
> >
> > Considering the mm struct isn't the only way to find the vmas and there
> > are users who use other locks to ensure the mm and vma don't go away
> > (rmap, for example).  It is reasonable to think that other users may use
> > the vma lock to avoid mm_struct accesses racing.
> >
> > Although I don't know of a way this is unsafe today, we are complicating
> > the locking story of the mm with this change and no data has been given
> > on the benefit.  I don't recall any regression caused by the addition of
> > per-vma locking?
> >
> 
> I was not involved in any of this, but I learned about the issue from lwn:
> https://lwn.net/Articles/937943/
> 
> the new (at the time) per-vma locking was suffering weird crashes in
> multithreaded programs and Suren ultimately fixed it by locking parent
> vma at a 5% hit,
> see fb49c455323f ("fork: lock VMAs of the parent process when
> forking"). The patch merely adds vma_start_write(mpnt) in dup_mmap.

Ah, thanks for the pointer.  I forgot about that part.

You are referencing the 5% on 5,000 forks of 10,000 vmas in a tight
loop.  The kernel build time was not affected.  I'm sticking with this
being an inconsequential increase on any meaningful benchmark.

Reading the commit did trigger my memory about the 5% regression, but I
disregarded it as it seems so unlikely in real life that the benefits
outweighed the upper bounds of 5% negative.

The 5,000 forks of 10,000 vmas benchmark would also be at least a bit
faster with Suren's more recent rcu safe vma change [1].

Although, that doesn't rule out changing this to get higher number, I
still don't think the complexity of the change is worth it for a
contrived benchmark.

> 
> What I'm proposing here remedies the problem for most commonly forking
> consumers (single-threaded), assuming it does work. ;)
> 
> To that end see below.
> 
> > >
> > > It issues:
> > > ctx->mm = current->mm;
> > > ...
> > > mmgrab(ctx->mm);
> > >
> > > Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in
> > > dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to
> > > create it either and the optimization is saved.
> >
> > mm_count is lazy, so I am not entirely sure we can trust what it says.
> > But maybe that's only true of mmgrab_lazy_tlb() now?
> >
> 
> warning: I don't know the Linux nomenclature here. I'm going to
> outline how I see it.
> 
> There is an idiomatic way of splitting ref counts into two:
> - something to prevent the struct itself from getting freed ("hold
> count" where I'm from, in this case ->mm_count)
> - something to prevent data used by the structure from getting freed
> ("use count" where I'm from, in this case ->mm_users)
> 
> mm_users > 0 keeps one ref on ->mm_count
> 
> AFAICS the scheme employed for mm follows the mold.
> 
> So with that mmgrab_lazy_tlb() example, the call bumps the count on first use.
> 
> Suppose we are left with one thread in the process and a lazy tlb
> somewhere as the only consumers. mm_users is 1 because of the only
> thread and mm_count is 2 -- one ref for mm_users > 0 and one ref for
> lazy tlb.
> 
> Then my proposed check: ->mm_count == 1 && mm->mm_users == 1
> 
> ... fails and the optimization is not used.
> 
> Now, per the above, the lock was added to protect against faults
> happening in parallel. The only cases I found where this is of note
> are remote accesses (e.g., from /proc/pid/cmdline) and userfaultfd.
> 
> I'm not an mm person and this is why I referred to Suren to sort this
> out, hoping he would both have interest and enough knowledge about mm
> to validate it.
> 
> That is to say I don't *vouch* the idea myself (otherwise I would sign
> off on a patch), I am merely bringing it up again long after the dust
> has settled. If the idea is a nogo, then so be it, but then it would
> be nice to document somewhere why is it so.

Thanks for the breakdown here.

I am also not sure, but I felt it worth the mention.

Reading Documentation/mm/active_mm.rst makes me uncertain of the
mm_count == 1 and mm_users == 1 test.  Since mm_count is the number of
'lazy' users, and it may no longer include the lazy users..

Since there are no real workloads that suffer, is this worth it?

[1]. https://lore.kernel.org/all/202503311656.e3596aaf-lkp@intel.com/

Thanks,
Liam
Mateusz Guzik March 31, 2025, 8:27 p.m. UTC | #12
On Mon, Mar 31, 2025 at 9:24 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> Reading the commit did trigger my memory about the 5% regression, but I
> disregarded it as it seems so unlikely in real life that the benefits
> outweighed the upper bounds of 5% negative.
>
> The 5,000 forks of 10,000 vmas benchmark would also be at least a bit
> faster with Suren's more recent rcu safe vma change [1].
>

Well it follows it would be more than 5% now? ;)

> Although, that doesn't rule out changing this to get higher number, I
> still don't think the complexity of the change is worth it for a
> contrived benchmark.
>

I assumed this would be easy enough to reason through for someone
familiar with mm (which I'm not).

If this is deemed too hairy, then I completely agree it should not be
pursued (at least for now, see below).

> Reading Documentation/mm/active_mm.rst makes me uncertain of the
> mm_count == 1 and mm_users == 1 test.  Since mm_count is the number of
> 'lazy' users, and it may no longer include the lazy users..
>

The doc states interested callers need to use mmgrab_lazy_tlb() and
mmdrop_lazy_tlb() and that the *exact* count wont be known.

This is perfectly fine for my proposal as we don't care specifically
how many other users are out there (lazy tlb or otherwise), what we do
care to know is if there is at least one and that much we are being
told in the expected manner: pounding the mm off to lazy tlb also
grabs a ->mm_count on it.

To be more exact, mmgrab_lazy_tlb() starts with pinning the struct
while the mmap semaphore is held. This synchronizes ->mm_count
visibility against dup_mmap() which write-locks it.

Suppose a process with two threads races with one issuing dup_mmap()
and the other one exiting. Further suppose the forking thread got the
lock after the thread executing exit_mmap() finished and the mm
remains in active use for lazy tlb. Per my previous e-mail it is an
invariant that in this case ->mm_count is at least two:
1. the forking thread uses the mm, which implies mm_users > 0, which
implies mm_count of at least one.
2. employing the mm for lazy tlb use bumped the count, which means it
has to be at least two, regardless of how many lazy uses managed to
elide refcount manipulation afterwards

But the count being two disables the optimization. Also note the
forking thread got the lock first, ->mm_count cannot transition 1 -> 2
on the account of lazy tlb flush as in the worst case it is waiting
for the lock (besides, this implies ->mm_users of at least two due to
the thread waiting which also disables the optimization).

> Since there are no real workloads that suffer, is this worth it?
>

It is unclear to me if you are criticizing this idea specifically
(given your doubts of whether it even works) or the notion of this
kind of optimization to begin with.

And this brings me back to kernel build, which I'm looking at for kicks.

There is a lot of evil going on there from userspace pov, which
partially obfuscates kernel-side slowdowns, of which there are plenty
(and I whacked some of them on the VFS side, more in the pipeline). I
don't expect dodging vma locking to be worth much in isolation in a
real workload. I do expect consistent effort to provide smooth
execution in commonly used code paths (fork being one) to add up.
Apart from being algorithmically sound in its own right, smooth
execution means no stalls for no stalls which can be easily avoided
(notably cache misses and lock-prefixed ops).

To give you a specific example, compilation of C files ran by gnu make
goes through the shell -- as in extra fork + exec trip on top of
spawning the compiler. In order to model this, with focus on the
kernel side, I slapped in a system("gcc -c -o /tmp/crap.o src.c")
testcase into will-it-scale (src.c is hello world with one #include
clause, important for later). The use of system() suffers the shell
trip in the same manner.

I'm seeing about 20% system time, about a third of which looks
suspicious at best. For example over 1% of the total (or over 4% of
kernel time) is spent in sync_regs() copying 168 bytes using rep
movsq. Convincing the compiler to use regular stores instead dropped
the routine to about 0.1% and gave me a measurable boost in throughput
of these compiles. Will this speed up compilation of a non-toy file?
In isolation, no. I do however claim that:
1. consistent effort to whack avoidable slowdowns adds up, you don't
need to do anything revolutionary (albeit that's welcome)
2. the kernel is chock full of slowdowns which don't need to be there,
but I'm not going to rant about specific examples
3. the famed real workloads are shooting themselves in their feet on
the regular, making it really hard to show improvement on the kernel
side, apart from massive changes or fixing utter breakage

So, for the vma lock avoidance idea, it may be it requires too much
analysis to justify it. If so, screw it. I do stand by what I tries to
accomplish tho (whacking atomics in the fast path in the common case).
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index ecb7b95c2ca3..d6cde76eda81 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5459,3 +5459,7 @@  const struct inode_operations
page_symlink_inode_operations = {
        .get_link       = page_get_link,
 };
 EXPORT_SYMBOL(page_symlink_inode_operations);
+
+void dup_probe(int, int);
+void dup_probe(int, int) { }
+EXPORT_SYMBOL(dup_probe);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f80baddacc5..f7b1f0a02f2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -760,12 +760,12 @@  static bool __is_vma_write_locked(struct
vm_area_struct *vma, unsigned int *mm_l
  * Exclude concurrent readers under the per-VMA lock until the currently
  * write-locked mmap_lock is dropped or downgraded.
  */
-static inline void vma_start_write(struct vm_area_struct *vma)
+static inline bool vma_start_write(struct vm_area_struct *vma)
 {
        unsigned int mm_lock_seq;

        if (__is_vma_write_locked(vma, &mm_lock_seq))
-               return;
+               return false;

        down_write(&vma->vm_lock->lock);
        /*
@@ -776,6 +776,7 @@  static inline void vma_start_write(struct
vm_area_struct *vma)
         */
        WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq);
        up_write(&vma->vm_lock->lock);
+       return true;
 }

 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
diff --git a/kernel/fork.c b/kernel/fork.c
index 735405a9c5f3..0cc56255a339 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -629,6 +629,8 @@  static void dup_mm_exe_file(struct mm_struct *mm,
struct mm_struct *oldmm)
                pr_warn_once("exe_file_deny_write_access() failed in
%s\n", __func__);
 }

+void dup_probe(int, int);
+
 #ifdef CONFIG_MMU