Message ID | 20211116215715.645231-1-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/2] mm: protect free_pgtables with mmap_lock write lock in exit_mmap | expand |
On Tue, Nov 16, 2021 at 1:57 PM Suren Baghdasaryan <surenb@google.com> wrote: > > oom-reaper and process_mrelease system call should protect against > races with exit_mmap which can destroy page tables while they > walk the VMA tree. oom-reaper protects from that race by setting > MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP > before taking and releasing mmap_write_lock. process_mrelease has > to elevate mm->mm_users to prevent such race. Both oom-reaper and > process_mrelease hold mmap_read_lock when walking the VMA tree. > The locking rules and mechanisms could be simpler if exit_mmap takes > mmap_write_lock while executing destructive operations such as > free_pgtables. > Change exit_mmap to hold the mmap_write_lock when calling > free_pgtables. Operations like unmap_vmas() and unlock_range() are not > destructive and could run under mmap_read_lock but for simplicity we > take one mmap_write_lock during almost the entire operation. Note > also that because oom-reaper checks VM_LOCKED flag, unlock_range() > should not be allowed to race with it. > In most cases this lock should be uncontended. Previously, Kirill > reported ~4% regression caused by a similar change [1]. We reran the > same test and although the individual results are quite noisy, the > percentiles show lower regression with 1.6% being the worst case [2]. > The change allows oom-reaper and process_mrelease to execute safely > under mmap_read_lock without worries that exit_mmap might destroy page > tables from under them. > > [1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/ > [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/ Friendly nudge. Michal, Matthew, from our discussion in https://lore.kernel.org/all/YXKhOKIIngIuJaYi@casper.infradead.org I was under the impression this change would be interesting for you. Any feedback? > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > mm/mmap.c | 11 +++++++---- > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index bfb0ea164a90..69b3036c6dee 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm) > * to mmu_notifier_release(mm) ensures mmu notifier callbacks in > * __oom_reap_task_mm() will not block. > * > - * This needs to be done before calling munlock_vma_pages_all(), > + * This needs to be done before calling unlock_range(), > * which clears VM_LOCKED, otherwise the oom reaper cannot > * reliably test it. > */ > (void)__oom_reap_task_mm(mm); > > set_bit(MMF_OOM_SKIP, &mm->flags); > - mmap_write_lock(mm); > - mmap_write_unlock(mm); > } > > + mmap_write_lock(mm); > if (mm->locked_vm) > unlock_range(mm->mmap, ULONG_MAX); > > arch_exit_mmap(mm); > > vma = mm->mmap; > - if (!vma) /* Can happen if dup_mmap() received an OOM */ > + if (!vma) { > + /* Can happen if dup_mmap() received an OOM */ > + mmap_write_unlock(mm); > return; > + } > > lru_add_drain(); > flush_cache_mm(mm); > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm) > unmap_vmas(&tlb, vma, 0, -1); > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb); > + mmap_write_unlock(mm); > > /* > * Walk the list again, actually closing and freeing it, > -- > 2.34.0.rc1.387.gb447b232ab-goog >
On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote: > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm) > unmap_vmas(&tlb, vma, 0, -1); > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb); > + mmap_write_unlock(mm); > > /* > * Walk the list again, actually closing and freeing it, Is there a reason to unlock here instead of after the remove_vma loop? We'll need the mmap sem held during that loop when VMAs are stored in the maple tree.
On Tue, Nov 23, 2021 at 5:19 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote: > > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm) > > unmap_vmas(&tlb, vma, 0, -1); > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > > tlb_finish_mmu(&tlb); > > + mmap_write_unlock(mm); > > > > /* > > * Walk the list again, actually closing and freeing it, > > Is there a reason to unlock here instead of after the remove_vma loop? > We'll need the mmap sem held during that loop when VMAs are stored in > the maple tree. I didn't realize remove_vma() would need to be protected as well. I think I can move mmap_write_unlock down to cover the last walk too with no impact. Does anyone know if there was any specific reason to perform that last walk with no locks held (as the comment states)? I can track that comment back to Linux-2.6.12-rc2 merge with no earlier history, so not sure if it's critical not to hold any locks at this point. Seems to me it's ok to hold mmap_write_unlock but maybe I'm missing something?
On Tue 23-11-21 09:56:41, Suren Baghdasaryan wrote: > On Tue, Nov 23, 2021 at 5:19 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote: > > > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm) > > > unmap_vmas(&tlb, vma, 0, -1); > > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > > > tlb_finish_mmu(&tlb); > > > + mmap_write_unlock(mm); > > > > > > /* > > > * Walk the list again, actually closing and freeing it, > > > > Is there a reason to unlock here instead of after the remove_vma loop? > > We'll need the mmap sem held during that loop when VMAs are stored in > > the maple tree. > > I didn't realize remove_vma() would need to be protected as well. I > think I can move mmap_write_unlock down to cover the last walk too > with no impact. > Does anyone know if there was any specific reason to perform that last > walk with no locks held (as the comment states)? I can track that > comment back to Linux-2.6.12-rc2 merge with no earlier history, so not > sure if it's critical not to hold any locks at this point. Seems to me > it's ok to hold mmap_write_unlock but maybe I'm missing something? I suspect the primary reason was that neither fput (and callbacks invoked from it) nor vm_close would need to be very careful about interacting with mm locks. fput is async these days so it shouldn't be problematic. vm_ops->close doesn't have any real contract definition AFAIK but taking mmap_sem from those would be really suprising. They should be mostly destructing internal vma state and that shouldn't really require address space protection.
On Wed, Nov 24, 2021 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 23-11-21 09:56:41, Suren Baghdasaryan wrote: > > On Tue, Nov 23, 2021 at 5:19 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote: > > > > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm) > > > > unmap_vmas(&tlb, vma, 0, -1); > > > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > > > > tlb_finish_mmu(&tlb); > > > > + mmap_write_unlock(mm); > > > > > > > > /* > > > > * Walk the list again, actually closing and freeing it, > > > > > > Is there a reason to unlock here instead of after the remove_vma loop? > > > We'll need the mmap sem held during that loop when VMAs are stored in > > > the maple tree. > > > > I didn't realize remove_vma() would need to be protected as well. I > > think I can move mmap_write_unlock down to cover the last walk too > > with no impact. > > Does anyone know if there was any specific reason to perform that last > > walk with no locks held (as the comment states)? I can track that > > comment back to Linux-2.6.12-rc2 merge with no earlier history, so not > > sure if it's critical not to hold any locks at this point. Seems to me > > it's ok to hold mmap_write_unlock but maybe I'm missing something? > > I suspect the primary reason was that neither fput (and callbacks > invoked from it) nor vm_close would need to be very careful about > interacting with mm locks. fput is async these days so it shouldn't be > problematic. vm_ops->close doesn't have any real contract definition AFAIK > but taking mmap_sem from those would be really suprising. They should be > mostly destructing internal vma state and that shouldn't really require > address space protection. Thanks for clarification, Michal. I'll post an updated patch with remove_vma() loop executed under mmap_write_lock protection. > -- > Michal Hocko > SUSE Labs
On Wed, Nov 24, 2021 at 7:25 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Nov 24, 2021 at 4:20 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 23-11-21 09:56:41, Suren Baghdasaryan wrote: > > > On Tue, Nov 23, 2021 at 5:19 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Tue, Nov 16, 2021 at 01:57:14PM -0800, Suren Baghdasaryan wrote: > > > > > @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm) > > > > > unmap_vmas(&tlb, vma, 0, -1); > > > > > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > > > > > tlb_finish_mmu(&tlb); > > > > > + mmap_write_unlock(mm); > > > > > > > > > > /* > > > > > * Walk the list again, actually closing and freeing it, > > > > > > > > Is there a reason to unlock here instead of after the remove_vma loop? > > > > We'll need the mmap sem held during that loop when VMAs are stored in > > > > the maple tree. > > > > > > I didn't realize remove_vma() would need to be protected as well. I > > > think I can move mmap_write_unlock down to cover the last walk too > > > with no impact. > > > Does anyone know if there was any specific reason to perform that last > > > walk with no locks held (as the comment states)? I can track that > > > comment back to Linux-2.6.12-rc2 merge with no earlier history, so not > > > sure if it's critical not to hold any locks at this point. Seems to me > > > it's ok to hold mmap_write_unlock but maybe I'm missing something? > > > > I suspect the primary reason was that neither fput (and callbacks > > invoked from it) nor vm_close would need to be very careful about > > interacting with mm locks. fput is async these days so it shouldn't be > > problematic. vm_ops->close doesn't have any real contract definition AFAIK > > but taking mmap_sem from those would be really suprising. They should be > > mostly destructing internal vma state and that shouldn't really require > > address space protection. > > Thanks for clarification, Michal. I'll post an updated patch with > remove_vma() loop executed under mmap_write_lock protection. v2 is posted at https://lore.kernel.org/all/20211124235906.14437-1-surenb@google.com/ > > > -- > > Michal Hocko > > SUSE Labs
diff --git a/mm/mmap.c b/mm/mmap.c index bfb0ea164a90..69b3036c6dee 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3142,25 +3142,27 @@ void exit_mmap(struct mm_struct *mm) * to mmu_notifier_release(mm) ensures mmu notifier callbacks in * __oom_reap_task_mm() will not block. * - * This needs to be done before calling munlock_vma_pages_all(), + * This needs to be done before calling unlock_range(), * which clears VM_LOCKED, otherwise the oom reaper cannot * reliably test it. */ (void)__oom_reap_task_mm(mm); set_bit(MMF_OOM_SKIP, &mm->flags); - mmap_write_lock(mm); - mmap_write_unlock(mm); } + mmap_write_lock(mm); if (mm->locked_vm) unlock_range(mm->mmap, ULONG_MAX); arch_exit_mmap(mm); vma = mm->mmap; - if (!vma) /* Can happen if dup_mmap() received an OOM */ + if (!vma) { + /* Can happen if dup_mmap() received an OOM */ + mmap_write_unlock(mm); return; + } lru_add_drain(); flush_cache_mm(mm); @@ -3170,6 +3172,7 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(&tlb, vma, 0, -1); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb); + mmap_write_unlock(mm); /* * Walk the list again, actually closing and freeing it,
oom-reaper and process_mrelease system call should protect against races with exit_mmap which can destroy page tables while they walk the VMA tree. oom-reaper protects from that race by setting MMF_OOM_VICTIM and by relying on exit_mmap to set MMF_OOM_SKIP before taking and releasing mmap_write_lock. process_mrelease has to elevate mm->mm_users to prevent such race. Both oom-reaper and process_mrelease hold mmap_read_lock when walking the VMA tree. The locking rules and mechanisms could be simpler if exit_mmap takes mmap_write_lock while executing destructive operations such as free_pgtables. Change exit_mmap to hold the mmap_write_lock when calling free_pgtables. Operations like unmap_vmas() and unlock_range() are not destructive and could run under mmap_read_lock but for simplicity we take one mmap_write_lock during almost the entire operation. Note also that because oom-reaper checks VM_LOCKED flag, unlock_range() should not be allowed to race with it. In most cases this lock should be uncontended. Previously, Kirill reported ~4% regression caused by a similar change [1]. We reran the same test and although the individual results are quite noisy, the percentiles show lower regression with 1.6% being the worst case [2]. The change allows oom-reaper and process_mrelease to execute safely under mmap_read_lock without worries that exit_mmap might destroy page tables from under them. [1] https://lore.kernel.org/all/20170725141723.ivukwhddk2voyhuc@node.shutemov.name/ [2] https://lore.kernel.org/all/CAJuCfpGC9-c9P40x7oy=jy5SphMcd0o0G_6U1-+JAziGKG6dGA@mail.gmail.com/ Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- mm/mmap.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-)