Message ID | CAGudoHFdPd5a7fiAutTFAxQz5f1fP2n9rwORm7Hj9fPzuVhiKw@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | not issuing vma_start_write() in dup_mmap() if the caller is single-threaded | expand |
Hi Mateusz, On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > Hi Suren, > > I brought this up long time ago, you agreed the idea works and stated > you will sort it out, but it apparently fell to the wayside (no hard > feelings though :>) Sorry about that. Yeah, that was a hectic time and I completely forgot to look into the single-user case. > > Here is the original: > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > the thread is weirdly long and I recommend not opening it without a > good reason, I link it for reference if needed. I had to re-read it to remember what it was all about :) To bring others up-to-speed, the suggested change would look something like this: @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, unsigned long charge = 0; LIST_HEAD(uf); VMA_ITERATOR(vmi, mm, 0); + bool only_user; if (mmap_write_lock_killable(oldmm)) return -EINTR; + only_user = atomic_read(&oldmm->mm_users) == 1; flush_cache_dup_mm(oldmm); uprobe_dup_mmap(oldmm, mm); /* @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, mt_clear_in_rcu(vmi.mas.tree); for_each_vma(vmi, mpnt) { struct file *file; - vma_start_write(mpnt); + if (!only_user) + vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, mpnt->vm_end, GFP_KERNEL); > > At the time there were woes concerning stability of the new locking > scheme, resulting in CC list being rather excessive. As such I'm > starting a new thread with a modest list, not sure who to add though. > > So dup_mmap() holds the caller mmap_sem for writing and calls > vma_start_write() to protect against fault handling in another threads > using the same mm. > > If this is the only thread with this ->mm in place, there are no > sibling threads to worry about and this can be checked with mm_users > == 1. > > AFAICS all remote accesses require the mmap_sem to also be held, which > provides exclusion against dup_mmap, meaning they don't pose a problem > either. Yes, I believe the optimization you proposed would be safe. > > The patch to merely skip locking is a few liner and I would officially > submit it myself, but for my taste an assert is needed in fault > handling to runtime test the above invariant. Hmm. Adding an assert in the pagefault path that checks whether the fault is triggered during dup_mmap() && mm_users==1 would not be trivial. We would need to indicate that we expect no page faults while in that section of dup_mmap() (maybe a flag inside mm_struct) and then assert that this flag is not set inside the pagefault handling path. I'm not sure it's worth the complexity... As was discussed in that thread, the only other task that might fault a page would be external and therefore would have to go through access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() already holds mmap_write_lock the new user would have to wait. > So happens I really > can't be bothered to figure out how to sort it out and was hoping you > would step in. ;) Alternatively if you guys don't think the assert is > warranted, that's your business. I don't think it's worth it but I'll CC Matthew and Lorenzo (you already CC'ed Liam) to get their opinion. > > As for whether this can save any locking -- yes: Yeah, I'm sure it will make a difference in performance. While forking we are currently locking each VMA separately, so skipping that would be nice. Folks, WDYT? Do we need a separate assertion that pagefault can't happen if mm_users==1 and we are holding mmap_write_lock (access_remote_vm() will block)? Thanks, Suren. > > I added a probe (below for reference) with two args: whether we are > single-threaded and vma_start_write() returning whether it took the > down/up cycle and ran make -j 20 in the kernel dir. > > The lock was taken for every single vma (377913 in total), while all > forking processes were single-threaded. Or to put it differently all > of these were skippable. > > the probe (total hack): > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > probe diff: > diff --git a/fs/namei.c b/fs/namei.c > index ecb7b95c2ca3..d6cde76eda81 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -5459,3 +5459,7 @@ const struct inode_operations > page_symlink_inode_operations = { > .get_link = page_get_link, > }; > EXPORT_SYMBOL(page_symlink_inode_operations); > + > +void dup_probe(int, int); > +void dup_probe(int, int) { } > +EXPORT_SYMBOL(dup_probe); > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 1f80baddacc5..f7b1f0a02f2e 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > vm_area_struct *vma, unsigned int *mm_l > * Exclude concurrent readers under the per-VMA lock until the currently > * write-locked mmap_lock is dropped or downgraded. > */ > -static inline void vma_start_write(struct vm_area_struct *vma) > +static inline bool vma_start_write(struct vm_area_struct *vma) > { > unsigned int mm_lock_seq; > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > - return; > + return false; > > down_write(&vma->vm_lock->lock); > /* > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > vm_area_struct *vma) > */ > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > up_write(&vma->vm_lock->lock); > + return true; > } > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > diff --git a/kernel/fork.c b/kernel/fork.c > index 735405a9c5f3..0cc56255a339 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > struct mm_struct *oldmm) > pr_warn_once("exe_file_deny_write_access() failed in > %s\n", __func__); > } > > +void dup_probe(int, int); > + > #ifdef CONFIG_MMU > static __latent_entropy int dup_mmap(struct mm_struct *mm, > struct mm_struct *oldmm) > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > unsigned long charge = 0; > LIST_HEAD(uf); > VMA_ITERATOR(vmi, mm, 0); > + bool only_user; > > if (mmap_write_lock_killable(oldmm)) > return -EINTR; > + only_user = atomic_read(&oldmm->mm_users) == 1; > flush_cache_dup_mm(oldmm); > uprobe_dup_mmap(oldmm, mm); > /* > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > mt_clear_in_rcu(vmi.mas.tree); > for_each_vma(vmi, mpnt) { > struct file *file; > + bool locked; > + > + locked = vma_start_write(mpnt); > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > - vma_start_write(mpnt); > if (mpnt->vm_flags & VM_DONTCOPY) { > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > mpnt->vm_end, GFP_KERNEL); > > -- > Mateusz Guzik <mjguzik gmail.com>
On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > Here is the original: > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > the thread is weirdly long and I recommend not opening it without a > > good reason, I link it for reference if needed. > > I had to re-read it to remember what it was all about :) To bring > others up-to-speed, the suggested change would look something like > this: > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > unsigned long charge = 0; > LIST_HEAD(uf); > VMA_ITERATOR(vmi, mm, 0); > + bool only_user; > > if (mmap_write_lock_killable(oldmm)) > return -EINTR; > + only_user = atomic_read(&oldmm->mm_users) == 1; > flush_cache_dup_mm(oldmm); > uprobe_dup_mmap(oldmm, mm); > /* > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > mt_clear_in_rcu(vmi.mas.tree); > for_each_vma(vmi, mpnt) { > struct file *file; > > - vma_start_write(mpnt); > + if (!only_user) > + vma_start_write(mpnt); > if (mpnt->vm_flags & VM_DONTCOPY) { > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > mpnt->vm_end, GFP_KERNEL); > > > > > At the time there were woes concerning stability of the new locking > > scheme, resulting in CC list being rather excessive. As such I'm > > starting a new thread with a modest list, not sure who to add though. > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > vma_start_write() to protect against fault handling in another threads > > using the same mm. > > > > If this is the only thread with this ->mm in place, there are no > > sibling threads to worry about and this can be checked with mm_users > > == 1. > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > provides exclusion against dup_mmap, meaning they don't pose a problem > > either. > > Yes, I believe the optimization you proposed would be safe. > > > > > The patch to merely skip locking is a few liner and I would officially > > submit it myself, but for my taste an assert is needed in fault > > handling to runtime test the above invariant. > > Hmm. Adding an assert in the pagefault path that checks whether the > fault is triggered during dup_mmap() && mm_users==1 would not be > trivial. We would need to indicate that we expect no page faults while > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > assert that this flag is not set inside the pagefault handling path. > I'm not sure it's worth the complexity... As was discussed in that > thread, the only other task that might fault a page would be external > and therefore would have to go through > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > already holds mmap_write_lock the new user would have to wait. > > > So happens I really > > can't be bothered to figure out how to sort it out and was hoping you > > would step in. ;) Alternatively if you guys don't think the assert is > > warranted, that's your business. > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > already CC'ed Liam) to get their opinion. > > > > > As for whether this can save any locking -- yes: > > Yeah, I'm sure it will make a difference in performance. While forking > we are currently locking each VMA separately, so skipping that would > be nice. > Folks, WDYT? Do we need a separate assertion that pagefault can't > happen if mm_users==1 and we are holding mmap_write_lock > (access_remote_vm() will block)? pseudocode-wise I was thinking in the lines of the following in the fault path: if (current->mm != vma->vm_mm) mmap_assert_locked(vma->vm_mm); with the assumption that mmap_assert_locked expands to nothing without debug > > > > > I added a probe (below for reference) with two args: whether we are > > single-threaded and vma_start_write() returning whether it took the > > down/up cycle and ran make -j 20 in the kernel dir. > > > > The lock was taken for every single vma (377913 in total), while all > > forking processes were single-threaded. Or to put it differently all > > of these were skippable. > > > > the probe (total hack): > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > probe diff: > > diff --git a/fs/namei.c b/fs/namei.c > > index ecb7b95c2ca3..d6cde76eda81 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > page_symlink_inode_operations = { > > .get_link = page_get_link, > > }; > > EXPORT_SYMBOL(page_symlink_inode_operations); > > + > > +void dup_probe(int, int); > > +void dup_probe(int, int) { } > > +EXPORT_SYMBOL(dup_probe); > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > vm_area_struct *vma, unsigned int *mm_l > > * Exclude concurrent readers under the per-VMA lock until the currently > > * write-locked mmap_lock is dropped or downgraded. > > */ > > -static inline void vma_start_write(struct vm_area_struct *vma) > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > { > > unsigned int mm_lock_seq; > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > - return; > > + return false; > > > > down_write(&vma->vm_lock->lock); > > /* > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > vm_area_struct *vma) > > */ > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > up_write(&vma->vm_lock->lock); > > + return true; > > } > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 735405a9c5f3..0cc56255a339 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > struct mm_struct *oldmm) > > pr_warn_once("exe_file_deny_write_access() failed in > > %s\n", __func__); > > } > > > > +void dup_probe(int, int); > > + > > #ifdef CONFIG_MMU > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > struct mm_struct *oldmm) > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > unsigned long charge = 0; > > LIST_HEAD(uf); > > VMA_ITERATOR(vmi, mm, 0); > > + bool only_user; > > > > if (mmap_write_lock_killable(oldmm)) > > return -EINTR; > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > flush_cache_dup_mm(oldmm); > > uprobe_dup_mmap(oldmm, mm); > > /* > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > mt_clear_in_rcu(vmi.mas.tree); > > for_each_vma(vmi, mpnt) { > > struct file *file; > > + bool locked; > > + > > + locked = vma_start_write(mpnt); > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > - vma_start_write(mpnt); > > if (mpnt->vm_flags & VM_DONTCOPY) { > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > mpnt->vm_end, GFP_KERNEL); > > > > -- > > Mateusz Guzik <mjguzik gmail.com>
On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > Here is the original: > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > the thread is weirdly long and I recommend not opening it without a > > > good reason, I link it for reference if needed. > > > > I had to re-read it to remember what it was all about :) To bring > > others up-to-speed, the suggested change would look something like > > this: > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > unsigned long charge = 0; > > LIST_HEAD(uf); > > VMA_ITERATOR(vmi, mm, 0); > > + bool only_user; > > > > if (mmap_write_lock_killable(oldmm)) > > return -EINTR; > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > flush_cache_dup_mm(oldmm); > > uprobe_dup_mmap(oldmm, mm); > > /* > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > mt_clear_in_rcu(vmi.mas.tree); > > for_each_vma(vmi, mpnt) { > > struct file *file; > > > > - vma_start_write(mpnt); > > + if (!only_user) > > + vma_start_write(mpnt); > > if (mpnt->vm_flags & VM_DONTCOPY) { > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > At the time there were woes concerning stability of the new locking > > > scheme, resulting in CC list being rather excessive. As such I'm > > > starting a new thread with a modest list, not sure who to add though. > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > vma_start_write() to protect against fault handling in another threads > > > using the same mm. > > > > > > If this is the only thread with this ->mm in place, there are no > > > sibling threads to worry about and this can be checked with mm_users > > > == 1. > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > > provides exclusion against dup_mmap, meaning they don't pose a problem > > > either. > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > The patch to merely skip locking is a few liner and I would officially > > > submit it myself, but for my taste an assert is needed in fault > > > handling to runtime test the above invariant. > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > fault is triggered during dup_mmap() && mm_users==1 would not be > > trivial. We would need to indicate that we expect no page faults while > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > > assert that this flag is not set inside the pagefault handling path. > > I'm not sure it's worth the complexity... As was discussed in that > > thread, the only other task that might fault a page would be external > > and therefore would have to go through > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > already holds mmap_write_lock the new user would have to wait. > > > > > So happens I really > > > can't be bothered to figure out how to sort it out and was hoping you > > > would step in. ;) Alternatively if you guys don't think the assert is > > > warranted, that's your business. > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > already CC'ed Liam) to get their opinion. > > > > > > > > As for whether this can save any locking -- yes: > > > > Yeah, I'm sure it will make a difference in performance. While forking > > we are currently locking each VMA separately, so skipping that would > > be nice. > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > happen if mm_users==1 and we are holding mmap_write_lock > > (access_remote_vm() will block)? > > pseudocode-wise I was thinking in the lines of the following in the fault path: > > if (current->mm != vma->vm_mm) > mmap_assert_locked(vma->vm_mm); > > with the assumption that mmap_assert_locked expands to nothing without debug I see. IOW, if someone external is faulting a page then it has to be holding at least mmap_read_lock. So, it's a more general check but seems reasonable. I think adding it at the end of lock_vma_under_rcu() under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > I added a probe (below for reference) with two args: whether we are > > > single-threaded and vma_start_write() returning whether it took the > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > The lock was taken for every single vma (377913 in total), while all > > > forking processes were single-threaded. Or to put it differently all > > > of these were skippable. > > > > > > the probe (total hack): > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > > > probe diff: > > > diff --git a/fs/namei.c b/fs/namei.c > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > --- a/fs/namei.c > > > +++ b/fs/namei.c > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > page_symlink_inode_operations = { > > > .get_link = page_get_link, > > > }; > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > + > > > +void dup_probe(int, int); > > > +void dup_probe(int, int) { } > > > +EXPORT_SYMBOL(dup_probe); > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > vm_area_struct *vma, unsigned int *mm_l > > > * Exclude concurrent readers under the per-VMA lock until the currently > > > * write-locked mmap_lock is dropped or downgraded. > > > */ > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > { > > > unsigned int mm_lock_seq; > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > - return; > > > + return false; > > > > > > down_write(&vma->vm_lock->lock); > > > /* > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > vm_area_struct *vma) > > > */ > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > up_write(&vma->vm_lock->lock); > > > + return true; > > > } > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > index 735405a9c5f3..0cc56255a339 100644 > > > --- a/kernel/fork.c > > > +++ b/kernel/fork.c > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > > struct mm_struct *oldmm) > > > pr_warn_once("exe_file_deny_write_access() failed in > > > %s\n", __func__); > > > } > > > > > > +void dup_probe(int, int); > > > + > > > #ifdef CONFIG_MMU > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > struct mm_struct *oldmm) > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > unsigned long charge = 0; > > > LIST_HEAD(uf); > > > VMA_ITERATOR(vmi, mm, 0); > > > + bool only_user; > > > > > > if (mmap_write_lock_killable(oldmm)) > > > return -EINTR; > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > flush_cache_dup_mm(oldmm); > > > uprobe_dup_mmap(oldmm, mm); > > > /* > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > mt_clear_in_rcu(vmi.mas.tree); > > > for_each_vma(vmi, mpnt) { > > > struct file *file; > > > + bool locked; > > > + > > > + locked = vma_start_write(mpnt); > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > - vma_start_write(mpnt); > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > mpnt->vm_end, GFP_KERNEL); > > > > > > -- > > > Mateusz Guzik <mjguzik gmail.com> > > > > -- > Mateusz Guzik <mjguzik gmail.com>
top post warning It just hit me that userfaultfd() may be a problem. Creating it only bumps mm_count, while mm_users gets transiently modified during various ops. So in particular you may end up pounding off the userfaultfd instance to another process while being single-threaded, then mm_users == 1 and userfaultfd may be trying to do something. I have no idea if this one is guaranteed to take the lock. However, the good news is that mm_count tends to be 1. If both mm_count and mm_users are 1, then there is no usefaultfd in use and nobody to add it either. State of mm_count verified with: bpftrace -e 'kprobe:copy_process { @[curtask->mm->mm_count.counter] = count(); }' On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > Here is the original: > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > > > the thread is weirdly long and I recommend not opening it without a > > > > good reason, I link it for reference if needed. > > > > > > I had to re-read it to remember what it was all about :) To bring > > > others up-to-speed, the suggested change would look something like > > > this: > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > unsigned long charge = 0; > > > LIST_HEAD(uf); > > > VMA_ITERATOR(vmi, mm, 0); > > > + bool only_user; > > > > > > if (mmap_write_lock_killable(oldmm)) > > > return -EINTR; > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > flush_cache_dup_mm(oldmm); > > > uprobe_dup_mmap(oldmm, mm); > > > /* > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > mt_clear_in_rcu(vmi.mas.tree); > > > for_each_vma(vmi, mpnt) { > > > struct file *file; > > > > > > - vma_start_write(mpnt); > > > + if (!only_user) > > > + vma_start_write(mpnt); > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > > At the time there were woes concerning stability of the new locking > > > > scheme, resulting in CC list being rather excessive. As such I'm > > > > starting a new thread with a modest list, not sure who to add though. > > > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > > vma_start_write() to protect against fault handling in another threads > > > > using the same mm. > > > > > > > > If this is the only thread with this ->mm in place, there are no > > > > sibling threads to worry about and this can be checked with mm_users > > > > == 1. > > > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > > > provides exclusion against dup_mmap, meaning they don't pose a problem > > > > either. > > > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > > > > The patch to merely skip locking is a few liner and I would officially > > > > submit it myself, but for my taste an assert is needed in fault > > > > handling to runtime test the above invariant. > > > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > > fault is triggered during dup_mmap() && mm_users==1 would not be > > > trivial. We would need to indicate that we expect no page faults while > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > > > assert that this flag is not set inside the pagefault handling path. > > > I'm not sure it's worth the complexity... As was discussed in that > > > thread, the only other task that might fault a page would be external > > > and therefore would have to go through > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > > already holds mmap_write_lock the new user would have to wait. > > > > > > > So happens I really > > > > can't be bothered to figure out how to sort it out and was hoping you > > > > would step in. ;) Alternatively if you guys don't think the assert is > > > > warranted, that's your business. > > > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > > already CC'ed Liam) to get their opinion. > > > > > > > > > > > As for whether this can save any locking -- yes: > > > > > > Yeah, I'm sure it will make a difference in performance. While forking > > > we are currently locking each VMA separately, so skipping that would > > > be nice. > > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > > happen if mm_users==1 and we are holding mmap_write_lock > > > (access_remote_vm() will block)? > > > > pseudocode-wise I was thinking in the lines of the following in the fault path: > > > > if (current->mm != vma->vm_mm) > > mmap_assert_locked(vma->vm_mm); > > > > with the assumption that mmap_assert_locked expands to nothing without debug > > I see. IOW, if someone external is faulting a page then it has to be > holding at least mmap_read_lock. So, it's a more general check but > seems reasonable. I think adding it at the end of lock_vma_under_rcu() > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > > > > > > I added a probe (below for reference) with two args: whether we are > > > > single-threaded and vma_start_write() returning whether it took the > > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > > > The lock was taken for every single vma (377913 in total), while all > > > > forking processes were single-threaded. Or to put it differently all > > > > of these were skippable. > > > > > > > > the probe (total hack): > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > > > > > probe diff: > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > > --- a/fs/namei.c > > > > +++ b/fs/namei.c > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > > page_symlink_inode_operations = { > > > > .get_link = page_get_link, > > > > }; > > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > > + > > > > +void dup_probe(int, int); > > > > +void dup_probe(int, int) { } > > > > +EXPORT_SYMBOL(dup_probe); > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > > --- a/include/linux/mm.h > > > > +++ b/include/linux/mm.h > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > > vm_area_struct *vma, unsigned int *mm_l > > > > * Exclude concurrent readers under the per-VMA lock until the currently > > > > * write-locked mmap_lock is dropped or downgraded. > > > > */ > > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > > { > > > > unsigned int mm_lock_seq; > > > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > - return; > > > > + return false; > > > > > > > > down_write(&vma->vm_lock->lock); > > > > /* > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > > vm_area_struct *vma) > > > > */ > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > up_write(&vma->vm_lock->lock); > > > > + return true; > > > > } > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > index 735405a9c5f3..0cc56255a339 100644 > > > > --- a/kernel/fork.c > > > > +++ b/kernel/fork.c > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > > > struct mm_struct *oldmm) > > > > pr_warn_once("exe_file_deny_write_access() failed in > > > > %s\n", __func__); > > > > } > > > > > > > > +void dup_probe(int, int); > > > > + > > > > #ifdef CONFIG_MMU > > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > struct mm_struct *oldmm) > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > unsigned long charge = 0; > > > > LIST_HEAD(uf); > > > > VMA_ITERATOR(vmi, mm, 0); > > > > + bool only_user; > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > return -EINTR; > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > flush_cache_dup_mm(oldmm); > > > > uprobe_dup_mmap(oldmm, mm); > > > > /* > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > for_each_vma(vmi, mpnt) { > > > > struct file *file; > > > > + bool locked; > > > > + > > > > + locked = vma_start_write(mpnt); > > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > > > - vma_start_write(mpnt); > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > -- > > > > Mateusz Guzik <mjguzik gmail.com> > > > > > > > > -- > > Mateusz Guzik <mjguzik gmail.com>
On Fri, Mar 28, 2025 at 6:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > top post warning > > It just hit me that userfaultfd() may be a problem. Creating it only > bumps mm_count, while mm_users gets transiently modified during > various ops. > > So in particular you may end up pounding off the userfaultfd instance > to another process while being single-threaded, then mm_users == 1 and > userfaultfd may be trying to do something. I have no idea if this one > is guaranteed to take the lock. Hmm, yeah. uffd is nasty. Even in the cases when it does mmget_not_zero(), there is still the possibility of a race. For example: dup_mmap only_user = atomic_read(&oldmm->mm_users) == 1; userfaultfd_move() mmget_not_zero() <-- inc mm_users if (!only_user) vma_start_write(mpnt); <-- gets skipped move_pages() uffd_move_lock() uffd_lock_vma() lock_vma_under_rcu() <-- succeeds move_pages_pte() copy_page_range() So, userfaultfd_move() will happily move pages while dup_mmap() is doing copy_page_range(). The copied range might look quite interesting... > > However, the good news is that mm_count tends to be 1. If both > mm_count and mm_users are 1, then there is no usefaultfd in use and > nobody to add it either. I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while calling mmgrab(), therefore I think it can race with the code checking its value. > > State of mm_count verified with: bpftrace -e 'kprobe:copy_process { > @[curtask->mm->mm_count.counter] = count(); }' > > On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > Here is the original: > > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > > > > > the thread is weirdly long and I recommend not opening it without a > > > > > good reason, I link it for reference if needed. > > > > > > > > I had to re-read it to remember what it was all about :) To bring > > > > others up-to-speed, the suggested change would look something like > > > > this: > > > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > unsigned long charge = 0; > > > > LIST_HEAD(uf); > > > > VMA_ITERATOR(vmi, mm, 0); > > > > + bool only_user; > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > return -EINTR; > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > flush_cache_dup_mm(oldmm); > > > > uprobe_dup_mmap(oldmm, mm); > > > > /* > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > for_each_vma(vmi, mpnt) { > > > > struct file *file; > > > > > > > > - vma_start_write(mpnt); > > > > + if (!only_user) > > > > + vma_start_write(mpnt); > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > > > > > At the time there were woes concerning stability of the new locking > > > > > scheme, resulting in CC list being rather excessive. As such I'm > > > > > starting a new thread with a modest list, not sure who to add though. > > > > > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > > > vma_start_write() to protect against fault handling in another threads > > > > > using the same mm. > > > > > > > > > > If this is the only thread with this ->mm in place, there are no > > > > > sibling threads to worry about and this can be checked with mm_users > > > > > == 1. > > > > > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > > > > provides exclusion against dup_mmap, meaning they don't pose a problem > > > > > either. > > > > > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > > > > > > > The patch to merely skip locking is a few liner and I would officially > > > > > submit it myself, but for my taste an assert is needed in fault > > > > > handling to runtime test the above invariant. > > > > > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > > > fault is triggered during dup_mmap() && mm_users==1 would not be > > > > trivial. We would need to indicate that we expect no page faults while > > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > > > > assert that this flag is not set inside the pagefault handling path. > > > > I'm not sure it's worth the complexity... As was discussed in that > > > > thread, the only other task that might fault a page would be external > > > > and therefore would have to go through > > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > > > already holds mmap_write_lock the new user would have to wait. > > > > > > > > > So happens I really > > > > > can't be bothered to figure out how to sort it out and was hoping you > > > > > would step in. ;) Alternatively if you guys don't think the assert is > > > > > warranted, that's your business. > > > > > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > > > already CC'ed Liam) to get their opinion. > > > > > > > > > > > > > > As for whether this can save any locking -- yes: > > > > > > > > Yeah, I'm sure it will make a difference in performance. While forking > > > > we are currently locking each VMA separately, so skipping that would > > > > be nice. > > > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > > > happen if mm_users==1 and we are holding mmap_write_lock > > > > (access_remote_vm() will block)? > > > > > > pseudocode-wise I was thinking in the lines of the following in the fault path: > > > > > > if (current->mm != vma->vm_mm) > > > mmap_assert_locked(vma->vm_mm); > > > > > > with the assumption that mmap_assert_locked expands to nothing without debug > > > > I see. IOW, if someone external is faulting a page then it has to be > > holding at least mmap_read_lock. So, it's a more general check but > > seems reasonable. I think adding it at the end of lock_vma_under_rcu() > > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > > > > > > > > > > > I added a probe (below for reference) with two args: whether we are > > > > > single-threaded and vma_start_write() returning whether it took the > > > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > > > > > The lock was taken for every single vma (377913 in total), while all > > > > > forking processes were single-threaded. Or to put it differently all > > > > > of these were skippable. > > > > > > > > > > the probe (total hack): > > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > > > > > > > probe diff: > > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > > > --- a/fs/namei.c > > > > > +++ b/fs/namei.c > > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > > > page_symlink_inode_operations = { > > > > > .get_link = page_get_link, > > > > > }; > > > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > > > + > > > > > +void dup_probe(int, int); > > > > > +void dup_probe(int, int) { } > > > > > +EXPORT_SYMBOL(dup_probe); > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > > > --- a/include/linux/mm.h > > > > > +++ b/include/linux/mm.h > > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > > > vm_area_struct *vma, unsigned int *mm_l > > > > > * Exclude concurrent readers under the per-VMA lock until the currently > > > > > * write-locked mmap_lock is dropped or downgraded. > > > > > */ > > > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > > > { > > > > > unsigned int mm_lock_seq; > > > > > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > > - return; > > > > > + return false; > > > > > > > > > > down_write(&vma->vm_lock->lock); > > > > > /* > > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > > > vm_area_struct *vma) > > > > > */ > > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > > up_write(&vma->vm_lock->lock); > > > > > + return true; > > > > > } > > > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > index 735405a9c5f3..0cc56255a339 100644 > > > > > --- a/kernel/fork.c > > > > > +++ b/kernel/fork.c > > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > > > > struct mm_struct *oldmm) > > > > > pr_warn_once("exe_file_deny_write_access() failed in > > > > > %s\n", __func__); > > > > > } > > > > > > > > > > +void dup_probe(int, int); > > > > > + > > > > > #ifdef CONFIG_MMU > > > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > struct mm_struct *oldmm) > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > unsigned long charge = 0; > > > > > LIST_HEAD(uf); > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > + bool only_user; > > > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > > return -EINTR; > > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > > flush_cache_dup_mm(oldmm); > > > > > uprobe_dup_mmap(oldmm, mm); > > > > > /* > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > for_each_vma(vmi, mpnt) { > > > > > struct file *file; > > > > > + bool locked; > > > > > + > > > > > + locked = vma_start_write(mpnt); > > > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > > > > > - vma_start_write(mpnt); > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > -- > > > > > Mateusz Guzik <mjguzik gmail.com> > > > > > > > > > > > > -- > > > Mateusz Guzik <mjguzik gmail.com> > > > > -- > Mateusz Guzik <mjguzik gmail.com>
On Sun, Mar 30, 2025 at 12:23 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Mar 28, 2025 at 6:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > top post warning > > > > It just hit me that userfaultfd() may be a problem. Creating it only > > bumps mm_count, while mm_users gets transiently modified during > > various ops. > > > > So in particular you may end up pounding off the userfaultfd instance > > to another process while being single-threaded, then mm_users == 1 and > > userfaultfd may be trying to do something. I have no idea if this one > > is guaranteed to take the lock. > > Hmm, yeah. uffd is nasty. Even in the cases when it does > mmget_not_zero(), there is still the possibility of a race. For > example: > > dup_mmap > only_user = atomic_read(&oldmm->mm_users) == 1; > > userfaultfd_move() > mmget_not_zero() <-- > inc mm_users > > if (!only_user) > vma_start_write(mpnt); <-- gets skipped > > move_pages() > uffd_move_lock() > uffd_lock_vma() > > lock_vma_under_rcu() <-- succeeds > move_pages_pte() > copy_page_range() Sorry about formatting. This might look a bit better: Task 1 Task 2 dup_mmap only_user = atomic_read(&oldmm->mm_users) == 1; userfaultfd_move() mmget_not_zero() if (!only_user) vma_start_write(mpnt); <-- gets skipped move_pages() uffd_move_lock() uffd_lock_vma() lock_vma_under_rcu() move_pages_pte() copy_page_range() > > So, userfaultfd_move() will happily move pages while dup_mmap() is > doing copy_page_range(). The copied range might look quite > interesting... > > > > > However, the good news is that mm_count tends to be 1. If both > > mm_count and mm_users are 1, then there is no usefaultfd in use and > > nobody to add it either. > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while > calling mmgrab(), therefore I think it can race with the code checking > its value. > > > > > State of mm_count verified with: bpftrace -e 'kprobe:copy_process { > > @[curtask->mm->mm_count.counter] = count(); }' > > > > On Sat, Mar 29, 2025 at 2:35 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Fri, Mar 28, 2025 at 6:16 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > > > On Sat, Mar 29, 2025 at 1:57 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > On Wed, Mar 26, 2025 at 10:46 PM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > Here is the original: > > > > > > https://lore.kernel.org/all/20230804214620.btgwhsszsd7rh6nf@f/ > > > > > > > > > > > > the thread is weirdly long and I recommend not opening it without a > > > > > > good reason, I link it for reference if needed. > > > > > > > > > > I had to re-read it to remember what it was all about :) To bring > > > > > others up-to-speed, the suggested change would look something like > > > > > this: > > > > > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > unsigned long charge = 0; > > > > > LIST_HEAD(uf); > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > + bool only_user; > > > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > > return -EINTR; > > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > > flush_cache_dup_mm(oldmm); > > > > > uprobe_dup_mmap(oldmm, mm); > > > > > /* > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > for_each_vma(vmi, mpnt) { > > > > > struct file *file; > > > > > > > > > > - vma_start_write(mpnt); > > > > > + if (!only_user) > > > > > + vma_start_write(mpnt); > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > > > > > > > > At the time there were woes concerning stability of the new locking > > > > > > scheme, resulting in CC list being rather excessive. As such I'm > > > > > > starting a new thread with a modest list, not sure who to add though. > > > > > > > > > > > > So dup_mmap() holds the caller mmap_sem for writing and calls > > > > > > vma_start_write() to protect against fault handling in another threads > > > > > > using the same mm. > > > > > > > > > > > > If this is the only thread with this ->mm in place, there are no > > > > > > sibling threads to worry about and this can be checked with mm_users > > > > > > == 1. > > > > > > > > > > > > AFAICS all remote accesses require the mmap_sem to also be held, which > > > > > > provides exclusion against dup_mmap, meaning they don't pose a problem > > > > > > either. > > > > > > > > > > Yes, I believe the optimization you proposed would be safe. > > > > > > > > > > > > > > > > > The patch to merely skip locking is a few liner and I would officially > > > > > > submit it myself, but for my taste an assert is needed in fault > > > > > > handling to runtime test the above invariant. > > > > > > > > > > Hmm. Adding an assert in the pagefault path that checks whether the > > > > > fault is triggered during dup_mmap() && mm_users==1 would not be > > > > > trivial. We would need to indicate that we expect no page faults while > > > > > in that section of dup_mmap() (maybe a flag inside mm_struct) and then > > > > > assert that this flag is not set inside the pagefault handling path. > > > > > I'm not sure it's worth the complexity... As was discussed in that > > > > > thread, the only other task that might fault a page would be external > > > > > and therefore would have to go through > > > > > access_remote_vm()->mmap_read_lock_killable() and since dup_mmap() > > > > > already holds mmap_write_lock the new user would have to wait. > > > > > > > > > > > So happens I really > > > > > > can't be bothered to figure out how to sort it out and was hoping you > > > > > > would step in. ;) Alternatively if you guys don't think the assert is > > > > > > warranted, that's your business. > > > > > > > > > > I don't think it's worth it but I'll CC Matthew and Lorenzo (you > > > > > already CC'ed Liam) to get their opinion. > > > > > > > > > > > > > > > > > As for whether this can save any locking -- yes: > > > > > > > > > > Yeah, I'm sure it will make a difference in performance. While forking > > > > > we are currently locking each VMA separately, so skipping that would > > > > > be nice. > > > > > Folks, WDYT? Do we need a separate assertion that pagefault can't > > > > > happen if mm_users==1 and we are holding mmap_write_lock > > > > > (access_remote_vm() will block)? > > > > > > > > pseudocode-wise I was thinking in the lines of the following in the fault path: > > > > > > > > if (current->mm != vma->vm_mm) > > > > mmap_assert_locked(vma->vm_mm); > > > > > > > > with the assumption that mmap_assert_locked expands to nothing without debug > > > > > > I see. IOW, if someone external is faulting a page then it has to be > > > holding at least mmap_read_lock. So, it's a more general check but > > > seems reasonable. I think adding it at the end of lock_vma_under_rcu() > > > under the "#ifdef CONFIG_DEBUG_VM" condition would be enough. > > > > > > > > > > > > > > > > > > > > > > > > I added a probe (below for reference) with two args: whether we are > > > > > > single-threaded and vma_start_write() returning whether it took the > > > > > > down/up cycle and ran make -j 20 in the kernel dir. > > > > > > > > > > > > The lock was taken for every single vma (377913 in total), while all > > > > > > forking processes were single-threaded. Or to put it differently all > > > > > > of these were skippable. > > > > > > > > > > > > the probe (total hack): > > > > > > bpftrace -e 'kprobe:dup_probe { @[arg0, arg1] = count(); }' > > > > > > > > > > > > probe diff: > > > > > > diff --git a/fs/namei.c b/fs/namei.c > > > > > > index ecb7b95c2ca3..d6cde76eda81 100644 > > > > > > --- a/fs/namei.c > > > > > > +++ b/fs/namei.c > > > > > > @@ -5459,3 +5459,7 @@ const struct inode_operations > > > > > > page_symlink_inode_operations = { > > > > > > .get_link = page_get_link, > > > > > > }; > > > > > > EXPORT_SYMBOL(page_symlink_inode_operations); > > > > > > + > > > > > > +void dup_probe(int, int); > > > > > > +void dup_probe(int, int) { } > > > > > > +EXPORT_SYMBOL(dup_probe); > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > > > index 1f80baddacc5..f7b1f0a02f2e 100644 > > > > > > --- a/include/linux/mm.h > > > > > > +++ b/include/linux/mm.h > > > > > > @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct > > > > > > vm_area_struct *vma, unsigned int *mm_l > > > > > > * Exclude concurrent readers under the per-VMA lock until the currently > > > > > > * write-locked mmap_lock is dropped or downgraded. > > > > > > */ > > > > > > -static inline void vma_start_write(struct vm_area_struct *vma) > > > > > > +static inline bool vma_start_write(struct vm_area_struct *vma) > > > > > > { > > > > > > unsigned int mm_lock_seq; > > > > > > > > > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > > > > > - return; > > > > > > + return false; > > > > > > > > > > > > down_write(&vma->vm_lock->lock); > > > > > > /* > > > > > > @@ -776,6 +776,7 @@ static inline void vma_start_write(struct > > > > > > vm_area_struct *vma) > > > > > > */ > > > > > > WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); > > > > > > up_write(&vma->vm_lock->lock); > > > > > > + return true; > > > > > > } > > > > > > > > > > > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > > > > > index 735405a9c5f3..0cc56255a339 100644 > > > > > > --- a/kernel/fork.c > > > > > > +++ b/kernel/fork.c > > > > > > @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, > > > > > > struct mm_struct *oldmm) > > > > > > pr_warn_once("exe_file_deny_write_access() failed in > > > > > > %s\n", __func__); > > > > > > } > > > > > > > > > > > > +void dup_probe(int, int); > > > > > > + > > > > > > #ifdef CONFIG_MMU > > > > > > static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > struct mm_struct *oldmm) > > > > > > @@ -638,9 +640,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > unsigned long charge = 0; > > > > > > LIST_HEAD(uf); > > > > > > VMA_ITERATOR(vmi, mm, 0); > > > > > > + bool only_user; > > > > > > > > > > > > if (mmap_write_lock_killable(oldmm)) > > > > > > return -EINTR; > > > > > > + only_user = atomic_read(&oldmm->mm_users) == 1; > > > > > > flush_cache_dup_mm(oldmm); > > > > > > uprobe_dup_mmap(oldmm, mm); > > > > > > /* > > > > > > @@ -664,8 +668,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > > > > > > mt_clear_in_rcu(vmi.mas.tree); > > > > > > for_each_vma(vmi, mpnt) { > > > > > > struct file *file; > > > > > > + bool locked; > > > > > > + > > > > > > + locked = vma_start_write(mpnt); > > > > > > + dup_probe(only_user ? 1 :0, locked ? 1 : 0); > > > > > > > > > > > > - vma_start_write(mpnt); > > > > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > > retval = vma_iter_clear_gfp(&vmi, mpnt->vm_start, > > > > > > mpnt->vm_end, GFP_KERNEL); > > > > > > > > > > > > -- > > > > > > Mateusz Guzik <mjguzik gmail.com> > > > > > > > > > > > > > > > > -- > > > > Mateusz Guzik <mjguzik gmail.com> > > > > > > > > -- > > Mateusz Guzik <mjguzik gmail.com>
On Sun, Mar 30, 2025 at 9:23 PM Suren Baghdasaryan <surenb@google.com> wrote: > > However, the good news is that mm_count tends to be 1. If both > > mm_count and mm_users are 1, then there is no usefaultfd in use and > > nobody to add it either. > > I'm not sure... IIUC new_userfaultfd() does not take mmap_lock while > calling mmgrab(), therefore I think it can race with the code checking > its value. > It issues: ctx->mm = current->mm; ... mmgrab(ctx->mm); Thus I claim if mm_count is 1 *and* mm_users is 1 *and* we are in dup_mmap(), nobody has a userfaultfd for our mm and there is nobody to create it either and the optimization is saved.
diff --git a/fs/namei.c b/fs/namei.c index ecb7b95c2ca3..d6cde76eda81 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -5459,3 +5459,7 @@ const struct inode_operations page_symlink_inode_operations = { .get_link = page_get_link, }; EXPORT_SYMBOL(page_symlink_inode_operations); + +void dup_probe(int, int); +void dup_probe(int, int) { } +EXPORT_SYMBOL(dup_probe); diff --git a/include/linux/mm.h b/include/linux/mm.h index 1f80baddacc5..f7b1f0a02f2e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -760,12 +760,12 @@ static bool __is_vma_write_locked(struct vm_area_struct *vma, unsigned int *mm_l * Exclude concurrent readers under the per-VMA lock until the currently * write-locked mmap_lock is dropped or downgraded. */ -static inline void vma_start_write(struct vm_area_struct *vma) +static inline bool vma_start_write(struct vm_area_struct *vma) { unsigned int mm_lock_seq; if (__is_vma_write_locked(vma, &mm_lock_seq)) - return; + return false; down_write(&vma->vm_lock->lock); /* @@ -776,6 +776,7 @@ static inline void vma_start_write(struct vm_area_struct *vma) */ WRITE_ONCE(vma->vm_lock_seq, mm_lock_seq); up_write(&vma->vm_lock->lock); + return true; } static inline void vma_assert_write_locked(struct vm_area_struct *vma) diff --git a/kernel/fork.c b/kernel/fork.c index 735405a9c5f3..0cc56255a339 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -629,6 +629,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) pr_warn_once("exe_file_deny_write_access() failed in %s\n", __func__); } +void dup_probe(int, int); + #ifdef CONFIG_MMU