Message ID | 20220215201922.1908156-1-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/1] mm: fix use-after-free bug when mm->mmap is reused after being freed | expand |
On Tue, 2022-02-15 at 12:19 -0800, Suren Baghdasaryan wrote: > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > otherwise it points to a vma that was freed and when reused leads to > a use-after-free bug. > > Reported-by: syzbot+2ccf63a4bd07cf39cab0@syzkaller.appspotmail.com > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > Reviewed-by: Rik van Riel <riel@surriel.com>
On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > otherwise it points to a vma that was freed and when reused leads to > a use-after-free bug. > > ... > > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > vma = remove_vma(vma); > cond_resched(); > } > + mm->mmap = NULL; > mmap_write_unlock(mm); > vm_unacct_memory(nr_accounted); > } https://lore.kernel.org/all/00000000000072ef2c05d7f81950@google.com/ It would be nice to have a Fixes: for this. Is it specific to process_mrelease(), or should we backport further?
On Tue, Feb 15, 2022 at 12:19 PM Suren Baghdasaryan <surenb@google.com> wrote: > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > otherwise it points to a vma that was freed and when reused leads to > a use-after-free bug. > > Reported-by: syzbot+2ccf63a4bd07cf39cab0@syzkaller.appspotmail.com > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Suren Baghdasaryan <surenb@google.com> Looks good to me. Reviewed-by: Yang Shi <shy828301@gmail.com> > --- > mm/mmap.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 1e8fdb0b51ed..d445c1b9d606 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > vma = remove_vma(vma); > cond_resched(); > } > + mm->mmap = NULL; > mmap_write_unlock(mm); > vm_unacct_memory(nr_accounted); > } > -- > 2.35.1.265.g69c8d7142f-goog >
On Tue, Feb 15, 2022 at 12:37 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > otherwise it points to a vma that was freed and when reused leads to > > a use-after-free bug. > > > > ... > > > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > vma = remove_vma(vma); > > cond_resched(); > > } > > + mm->mmap = NULL; > > mmap_write_unlock(mm); > > vm_unacct_memory(nr_accounted); > > } > > https://lore.kernel.org/all/00000000000072ef2c05d7f81950@google.com/ > > It would be nice to have a Fixes: for this. Oh, right. Should be: Fixes: 64591e8605d6 ("mm: protect free_pgtables with mmap_lock write lock in exit_mmap") > > Is it specific to process_mrelease(), or should we backport further? The broken change is recent and was introduced in v5.17-rc1.
On Tue, Feb 15, 2022 at 12:45 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, Feb 15, 2022 at 12:37 PM Andrew Morton > <akpm@linux-foundation.org> wrote: > > > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > > otherwise it points to a vma that was freed and when reused leads to > > > a use-after-free bug. > > > > > > ... > > > > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > > vma = remove_vma(vma); > > > cond_resched(); > > > } > > > + mm->mmap = NULL; > > > mmap_write_unlock(mm); > > > vm_unacct_memory(nr_accounted); > > > } > > > > https://lore.kernel.org/all/00000000000072ef2c05d7f81950@google.com/ > > > > It would be nice to have a Fixes: for this. > > Oh, right. Should be: > > Fixes: 64591e8605d6 ("mm: protect free_pgtables with mmap_lock write > lock in exit_mmap") Andrew, do you want me to post another version with Fixes: 64591e8605d6 ("mm: protect free_pgtables with mmap_lock write lock in exit_mmap") added or you can add it directly? > > > > > Is it specific to process_mrelease(), or should we backport further? > > The broken change is recent and was introduced in v5.17-rc1.
On Tue, 15 Feb 2022 12:46:57 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > Fixes: 64591e8605d6 ("mm: protect free_pgtables with mmap_lock write > > lock in exit_mmap") > > Andrew, do you want me to post another version with Fixes: > 64591e8605d6 ("mm: protect free_pgtables with mmap_lock write lock in > exit_mmap") added or you can add it directly? I added it, thanks.
On Tue 15-02-22 12:19:22, Suren Baghdasaryan wrote: > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > otherwise it points to a vma that was freed and when reused leads to > a use-after-free bug. OK, so I have dived into this again. exit_mmap doesn't reset mmap indeed. That doesn't really matter for _oom victims_. Both the oom reaper and mrelease do check for MMF_OOM_SKIP before calling __oom_reap_task_mm. exit_mmap still sets MMF_OOM_SKIP before taking the mmap_lock for oom victims so those paths should be still properly synchronized. I have proposed to get rid of this http://lkml.kernel.org/r/YbHIaq9a0CtqRulE@dhcp22.suse.cz but we haven't agreed on that. mrelease path is broken because it doesn't mark the process oom_victim and so the MMF_OOM_SKIP synchronization doesn't work. So we really need this. I would propose to rephrase the changelog to be more specific because I do not want to remember all those details later on. What about " oom reaping (__oom_reap_task_mm) relies on a 2 way synchronization with exit_mmap. First it relies on the mmap_lock to exclude from unlock path[1], page tables tear down (free_pgtables) and vma destruction. This alone is not sufficient because mm->mmap is never reset. For historical reasons[2] the lock is taken there is also MMF_OOM_SKIP set for oom victims before. The oom reaper only ever looks at oom victims so the whole scheme works properly but process_mrelease can opearate on any task (with fatal signals pending) which doesn't really imply oom victims. That means that the MMF_OOM_SKIP part of the synchronization doesn't work and it can see a task after the whole address space has been demolished and traverse an already released mm->mmap list. This leads to use after free as properly caught up by KASAN report. Fix the issue by reseting mm->mmap so that MMF_OOM_SKIP synchronization is not needed anymore. The MMF_OOM_SKIP is not removed from exit_mmap yet but it acts mostly as an optimization now. [1] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3") [2] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") " I really have to say that I hate how complex this has grown in the name of optimizations. This has backfired several times already resulting in 2 security issues. I really hope to get read any note of the oom reaper from exit_mmap. > Reported-by: syzbot+2ccf63a4bd07cf39cab0@syzkaller.appspotmail.com > Suggested-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > mm/mmap.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 1e8fdb0b51ed..d445c1b9d606 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > vma = remove_vma(vma); > cond_resched(); > } > + mm->mmap = NULL; > mmap_write_unlock(mm); > vm_unacct_memory(nr_accounted); > } > -- > 2.35.1.265.g69c8d7142f-goog
On Tue, Feb 15, 2022 at 11:54 PM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 15-02-22 12:19:22, Suren Baghdasaryan wrote: > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > otherwise it points to a vma that was freed and when reused leads to > > a use-after-free bug. > > OK, so I have dived into this again. exit_mmap doesn't reset mmap > indeed. That doesn't really matter for _oom victims_. Both the oom reaper and > mrelease do check for MMF_OOM_SKIP before calling __oom_reap_task_mm. > exit_mmap still sets MMF_OOM_SKIP before taking the mmap_lock for oom > victims so those paths should be still properly synchronized. I have > proposed to get rid of this > http://lkml.kernel.org/r/YbHIaq9a0CtqRulE@dhcp22.suse.cz but we haven't > agreed on that. > > mrelease path is broken because it doesn't mark the process oom_victim > and so the MMF_OOM_SKIP synchronization doesn't work. So we really need > this. > > I would propose to rephrase the changelog to be more specific because I > do not want to remember all those details later on. > What about > " > oom reaping (__oom_reap_task_mm) relies on a 2 way synchronization with > exit_mmap. First it relies on the mmap_lock to exclude from unlock > path[1], page tables tear down (free_pgtables) and vma destruction. > This alone is not sufficient because mm->mmap is never reset. For > historical reasons[2] the lock is taken there is also MMF_OOM_SKIP set > for oom victims before. > > The oom reaper only ever looks at oom victims so the whole scheme works > properly but process_mrelease can opearate on any task (with fatal > signals pending) which doesn't really imply oom victims. That means that > the MMF_OOM_SKIP part of the synchronization doesn't work and it can > see a task after the whole address space has been demolished and > traverse an already released mm->mmap list. This leads to use after free > as properly caught up by KASAN report. > > Fix the issue by reseting mm->mmap so that MMF_OOM_SKIP synchronization > is not needed anymore. The MMF_OOM_SKIP is not removed from exit_mmap > yet but it acts mostly as an optimization now. > > [1] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, > v3") > [2] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run > concurrently") > " This changelog is very detailed and I support switching to it. Andrew, please let me know if I should re-post the patch with this description or you will just amend the one in your tree. > > I really have to say that I hate how complex this has grown in the name > of optimizations. This has backfired several times already resulting in > 2 security issues. I really hope to get read any note of the oom reaper > from exit_mmap. Agree. I want to take another stab at removing __oom_reap_task_mm from exit_mmap. Now that Hugh made changes to mlock mechanisms and __oom_reap_task_mm does not skip locked vmas I think that should be possible. Planning to look into that sometimes next week. > > > Reported-by: syzbot+2ccf63a4bd07cf39cab0@syzkaller.appspotmail.com > > Suggested-by: Michal Hocko <mhocko@suse.com> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > > Thanks! > > --- > > mm/mmap.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 1e8fdb0b51ed..d445c1b9d606 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > vma = remove_vma(vma); > > cond_resched(); > > } > > + mm->mmap = NULL; > > mmap_write_unlock(mm); > > vm_unacct_memory(nr_accounted); > > } > > -- > > 2.35.1.265.g69c8d7142f-goog > > -- > Michal Hocko > SUSE Labs
On Thu, 17 Feb 2022 11:51:13 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > This changelog is very detailed and I support switching to it. Andrew, > please let me know if I should re-post the patch with this description > or you will just amend the one in your tree. I have made that change. From: Suren Baghdasaryan <surenb@google.com> Subject: mm: fix use-after-free bug when mm->mmap is reused after being freed oom reaping (__oom_reap_task_mm) relies on a 2 way synchronization with exit_mmap. First it relies on the mmap_lock to exclude from unlock path[1], page tables tear down (free_pgtables) and vma destruction. This alone is not sufficient because mm->mmap is never reset. For historical reasons[2] the lock is taken there is also MMF_OOM_SKIP set for oom victims before. The oom reaper only ever looks at oom victims so the whole scheme works properly but process_mrelease can opearate on any task (with fatal signals pending) which doesn't really imply oom victims. That means that the MMF_OOM_SKIP part of the synchronization doesn't work and it can see a task after the whole address space has been demolished and traverse an already released mm->mmap list. This leads to use after free as properly caught up by KASAN report. Fix the issue by reseting mm->mmap so that MMF_OOM_SKIP synchronization is not needed anymore. The MMF_OOM_SKIP is not removed from exit_mmap yet but it acts mostly as an optimization now. [1] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3") [2] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") [mhocko@suse.com: changelog rewrite] Link: https://lore.kernel.org/all/00000000000072ef2c05d7f81950@google.com/ Link: https://lkml.kernel.org/r/20220215201922.1908156-1-surenb@google.com Fixes: 64591e8605d6 ("mm: protect free_pgtables with mmap_lock write lock in exit_mmap") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: syzbot+2ccf63a4bd07cf39cab0@syzkaller.appspotmail.com Suggested-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/mmap.c | 1 + 1 file changed, 1 insertion(+) --- a/mm/mmap.c~mm-fix-use-after-free-bug-when-mm-mmap-is-reused-after-being-freed +++ a/mm/mmap.c @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) vma = remove_vma(vma); cond_resched(); } + mm->mmap = NULL; mmap_write_unlock(mm); vm_unacct_memory(nr_accounted); }
On Thu 17-02-22 11:51:13, Suren Baghdasaryan wrote: > On Tue, Feb 15, 2022 at 11:54 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > I really have to say that I hate how complex this has grown in the name > > of optimizations. This has backfired several times already resulting in > > 2 security issues. I really hope to get read any note of the oom reaper > > from exit_mmap. > > Agree. I want to take another stab at removing __oom_reap_task_mm from > exit_mmap. Now that Hugh made changes to mlock mechanisms and > __oom_reap_task_mm does not skip locked vmas I think that should be > possible. Planning to look into that sometimes next week. Thanks!
On Thu 17-02-22 12:50:56, Andrew Morton wrote: > On Thu, 17 Feb 2022 11:51:13 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > This changelog is very detailed and I support switching to it. Andrew, > > please let me know if I should re-post the patch with this description > > or you will just amend the one in your tree. > > I have made that change. LGTM. Thanks!
On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > otherwise it points to a vma that was freed and when reused leads to > a use-after-free bug. > > ... > > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > vma = remove_vma(vma); > cond_resched(); > } > + mm->mmap = NULL; > mmap_write_unlock(mm); > vm_unacct_memory(nr_accounted); > } After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll revert this fix as part of merging the maple-tree parts of linux-next. I'll be sending this fix to Linus this week. All of which means that the thusly-resolved Maple tree patches might reintroduce this use-after-free bug.
On Thu, Feb 24, 2022 at 08:18:59PM -0800, Andrew Morton wrote: > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > otherwise it points to a vma that was freed and when reused leads to > > a use-after-free bug. > > > > ... > > > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > vma = remove_vma(vma); > > cond_resched(); > > } > > + mm->mmap = NULL; > > mmap_write_unlock(mm); > > vm_unacct_memory(nr_accounted); > > } > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > revert this fix as part of merging the maple-tree parts of linux-next. > I'll be sending this fix to Linus this week. > > All of which means that the thusly-resolved Maple tree patches might > reintroduce this use-after-free bug. I don't think so? The problem is that VMAs are (currently) part of two data structures -- the rbtree and the linked list. remove_vma() only removes VMAs from the rbtree; it doesn't set mm->mmap to NULL. With maple tree, the linked list goes away. remove_vma() removes VMAs from the maple tree. So anyone looking to iterate over all VMAs has to go and look in the maple tree for them ... and there's nothing there. But maybe I misunderstood the bug that's being solved here.
On Thu, Feb 24, 2022 at 8:23 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Feb 24, 2022 at 08:18:59PM -0800, Andrew Morton wrote: > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > > otherwise it points to a vma that was freed and when reused leads to > > > a use-after-free bug. > > > > > > ... > > > > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > > vma = remove_vma(vma); > > > cond_resched(); > > > } > > > + mm->mmap = NULL; > > > mmap_write_unlock(mm); > > > vm_unacct_memory(nr_accounted); > > > } > > > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > > revert this fix as part of merging the maple-tree parts of linux-next. > > I'll be sending this fix to Linus this week. > > > > All of which means that the thusly-resolved Maple tree patches might > > reintroduce this use-after-free bug. > > I don't think so? The problem is that VMAs are (currently) part of > two data structures -- the rbtree and the linked list. remove_vma() > only removes VMAs from the rbtree; it doesn't set mm->mmap to NULL. > > With maple tree, the linked list goes away. remove_vma() removes VMAs > from the maple tree. So anyone looking to iterate over all VMAs has to > go and look in the maple tree for them ... and there's nothing there. Yes, I think you are right. With maple trees we don't need this fix. > > But maybe I misunderstood the bug that's being solved here.
On Thu 24-02-22 20:18:59, Andrew Morton wrote: > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > otherwise it points to a vma that was freed and when reused leads to > > a use-after-free bug. > > > > ... > > > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > vma = remove_vma(vma); > > cond_resched(); > > } > > + mm->mmap = NULL; > > mmap_write_unlock(mm); > > vm_unacct_memory(nr_accounted); > > } > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > revert this fix as part of merging the maple-tree parts of linux-next. > I'll be sending this fix to Linus this week. But this is a regression introduced in this release cycle so the patch should be merged before Maple tree patches, no?
On Fri, 25 Feb 2022 11:17:34 +0100 Michal Hocko <mhocko@suse.com> wrote: > On Thu 24-02-22 20:18:59, Andrew Morton wrote: > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > > otherwise it points to a vma that was freed and when reused leads to > > > a use-after-free bug. > > > > > > ... > > > > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > > vma = remove_vma(vma); > > > cond_resched(); > > > } > > > + mm->mmap = NULL; > > > mmap_write_unlock(mm); > > > vm_unacct_memory(nr_accounted); > > > } > > > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > > revert this fix as part of merging the maple-tree parts of linux-next. > > I'll be sending this fix to Linus this week. > > But this is a regression introduced in this release cycle so the patch > should be merged before Maple tree patches, no? Yes, I'll be sending this one-liner upstream very soon and we'll then undo it in the maple-tree patchset.
* Suren Baghdasaryan <surenb@google.com> [220225 00:51]: > On Thu, Feb 24, 2022 at 8:23 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Feb 24, 2022 at 08:18:59PM -0800, Andrew Morton wrote: > > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > > > otherwise it points to a vma that was freed and when reused leads to > > > > a use-after-free bug. > > > > > > > > ... > > > > > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > > > vma = remove_vma(vma); > > > > cond_resched(); > > > > } > > > > + mm->mmap = NULL; > > > > mmap_write_unlock(mm); > > > > vm_unacct_memory(nr_accounted); > > > > } > > > > > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > > > revert this fix as part of merging the maple-tree parts of linux-next. > > > I'll be sending this fix to Linus this week. > > > > > > All of which means that the thusly-resolved Maple tree patches might > > > reintroduce this use-after-free bug. > > > > I don't think so? The problem is that VMAs are (currently) part of > > two data structures -- the rbtree and the linked list. remove_vma() > > only removes VMAs from the rbtree; it doesn't set mm->mmap to NULL. > > > > With maple tree, the linked list goes away. remove_vma() removes VMAs > > from the maple tree. So anyone looking to iterate over all VMAs has to > > go and look in the maple tree for them ... and there's nothing there. > > Yes, I think you are right. With maple trees we don't need this fix. Yes, this is correct. The maple tree removes the entire linked list... but since the mm is unstable in the exit_mmap(), I had added the destruction of the maple tree there. Maybe this is the wrong place to be destroying the tree tracking the VMAs (althought this patch partially destroys the VMA tracking linked list), but it brought my attention to the race that this patch solves and the process_mrelease() function. Couldn't this be avoided by using mmget_not_zero() instead of mmgrab() in process_mrelease()? That would ensure we aren't stepping on an exit_mmap() and potentially the locking change in exit_mmap() wouldn't be needed either? Logically, I view this as process_mrelease() having issue with the fact that the mmaps are no longer stable in tear down regardless of the data structure that is used. Thanks, Liam
On Thu, Mar 10, 2022 at 7:55 AM Liam Howlett <liam.howlett@oracle.com> wrote: > > * Suren Baghdasaryan <surenb@google.com> [220225 00:51]: > > On Thu, Feb 24, 2022 at 8:23 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Thu, Feb 24, 2022 at 08:18:59PM -0800, Andrew Morton wrote: > > > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > > > > otherwise it points to a vma that was freed and when reused leads to > > > > > a use-after-free bug. > > > > > > > > > > ... > > > > > > > > > > --- a/mm/mmap.c > > > > > +++ b/mm/mmap.c > > > > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > > > > vma = remove_vma(vma); > > > > > cond_resched(); > > > > > } > > > > > + mm->mmap = NULL; > > > > > mmap_write_unlock(mm); > > > > > vm_unacct_memory(nr_accounted); > > > > > } > > > > > > > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > > > > revert this fix as part of merging the maple-tree parts of linux-next. > > > > I'll be sending this fix to Linus this week. > > > > > > > > All of which means that the thusly-resolved Maple tree patches might > > > > reintroduce this use-after-free bug. > > > > > > I don't think so? The problem is that VMAs are (currently) part of > > > two data structures -- the rbtree and the linked list. remove_vma() > > > only removes VMAs from the rbtree; it doesn't set mm->mmap to NULL. > > > > > > With maple tree, the linked list goes away. remove_vma() removes VMAs > > > from the maple tree. So anyone looking to iterate over all VMAs has to > > > go and look in the maple tree for them ... and there's nothing there. > > > > Yes, I think you are right. With maple trees we don't need this fix. > > > Yes, this is correct. The maple tree removes the entire linked list... > but since the mm is unstable in the exit_mmap(), I had added the > destruction of the maple tree there. Maybe this is the wrong place to > be destroying the tree tracking the VMAs (althought this patch partially > destroys the VMA tracking linked list), but it brought my attention to > the race that this patch solves and the process_mrelease() function. > Couldn't this be avoided by using mmget_not_zero() instead of mmgrab() > in process_mrelease()? That's what we were doing before [1]. That unfortunately has a problem of process_mrelease possibly calling the last mmput and being blocked on IO completion in exit_aio. The race between exit_mmap and process_mrelease is solved by using mmap_lock. I think by destroying the maple tree in exit_mmap before the mmap_write_unlock call, you keep things working and functionality intact. Is there any reason this can't be done? [1] ba535c1caf3ee78a ("mm/oom_kill: allow process_mrelease to run under mmap_lock protection") > That would ensure we aren't stepping on an > exit_mmap() and potentially the locking change in exit_mmap() wouldn't > be needed either? Logically, I view this as process_mrelease() having > issue with the fact that the mmaps are no longer stable in tear down > regardless of the data structure that is used. > > Thanks, > Liam > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
* Suren Baghdasaryan <surenb@google.com> [220310 11:28]: > On Thu, Mar 10, 2022 at 7:55 AM Liam Howlett <liam.howlett@oracle.com> wrote: > > > > * Suren Baghdasaryan <surenb@google.com> [220225 00:51]: > > > On Thu, Feb 24, 2022 at 8:23 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Thu, Feb 24, 2022 at 08:18:59PM -0800, Andrew Morton wrote: > > > > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > > > > > otherwise it points to a vma that was freed and when reused leads to > > > > > > a use-after-free bug. > > > > > > > > > > > > ... > > > > > > > > > > > > --- a/mm/mmap.c > > > > > > +++ b/mm/mmap.c > > > > > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > > > > > vma = remove_vma(vma); > > > > > > cond_resched(); > > > > > > } > > > > > > + mm->mmap = NULL; > > > > > > mmap_write_unlock(mm); > > > > > > vm_unacct_memory(nr_accounted); > > > > > > } > > > > > > > > > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > > > > > revert this fix as part of merging the maple-tree parts of linux-next. > > > > > I'll be sending this fix to Linus this week. > > > > > > > > > > All of which means that the thusly-resolved Maple tree patches might > > > > > reintroduce this use-after-free bug. > > > > > > > > I don't think so? The problem is that VMAs are (currently) part of > > > > two data structures -- the rbtree and the linked list. remove_vma() > > > > only removes VMAs from the rbtree; it doesn't set mm->mmap to NULL. > > > > > > > > With maple tree, the linked list goes away. remove_vma() removes VMAs > > > > from the maple tree. So anyone looking to iterate over all VMAs has to > > > > go and look in the maple tree for them ... and there's nothing there. > > > > > > Yes, I think you are right. With maple trees we don't need this fix. > > > > > > Yes, this is correct. The maple tree removes the entire linked list... > > but since the mm is unstable in the exit_mmap(), I had added the > > destruction of the maple tree there. Maybe this is the wrong place to > > be destroying the tree tracking the VMAs (althought this patch partially > > destroys the VMA tracking linked list), but it brought my attention to > > the race that this patch solves and the process_mrelease() function. > > Couldn't this be avoided by using mmget_not_zero() instead of mmgrab() > > in process_mrelease()? > > That's what we were doing before [1]. That unfortunately has a problem > of process_mrelease possibly calling the last mmput and being blocked > on IO completion in exit_aio. Oh, I see. Thanks. > The race between exit_mmap and > process_mrelease is solved by using mmap_lock. I think an important part of the race fix isn't just the lock holding but the setting of the start of the linked list to NULL above. That means the code in __oom_reap_task_mm() via process_mrelease() will continue to execute but iterate for zero VMAs. > I think by destroying the maple tree in exit_mmap before the > mmap_write_unlock call, you keep things working and functionality > intact. Is there any reason this can't be done? Yes, unfortunately. If MMF_OOM_SKIP is not set, then process_mrelease() will call __oom_reap_task_mm() which will get a null pointer dereference or a use after free in the vma iterator as it tries to iterate the maple tree. I think the best plan is to set MMF_OOM_SKIP unconditionally when the mmap_write_lock() is acquired. Doing so will ensure nothing will try to gain memory by reaping a task that no longer has memory to yield - or at least won't shortly. If we do use MMF_OOM_SKIP in such a way, then I think it is safe to quickly drop the lock? Also, should process_mrelease() be setting MMF_OOM_VICTIM on this mm? It would enable the fast path on a race with exit_mmap() - thought that may not be desirable? > > [1] ba535c1caf3ee78a ("mm/oom_kill: allow process_mrelease to run > under mmap_lock protection") > > > That would ensure we aren't stepping on an > > exit_mmap() and potentially the locking change in exit_mmap() wouldn't > > be needed either? Logically, I view this as process_mrelease() having > > issue with the fact that the mmaps are no longer stable in tear down > > regardless of the data structure that is used. > > > > Thanks, > > Liam > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > >
On Thu, Mar 10, 2022 at 2:22 PM Liam Howlett <liam.howlett@oracle.com> wrote: > > * Suren Baghdasaryan <surenb@google.com> [220310 11:28]: > > On Thu, Mar 10, 2022 at 7:55 AM Liam Howlett <liam.howlett@oracle.com> wrote: > > > > > > * Suren Baghdasaryan <surenb@google.com> [220225 00:51]: > > > > On Thu, Feb 24, 2022 at 8:23 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Thu, Feb 24, 2022 at 08:18:59PM -0800, Andrew Morton wrote: > > > > > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > > > > > > otherwise it points to a vma that was freed and when reused leads to > > > > > > > a use-after-free bug. > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > --- a/mm/mmap.c > > > > > > > +++ b/mm/mmap.c > > > > > > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > > > > > > vma = remove_vma(vma); > > > > > > > cond_resched(); > > > > > > > } > > > > > > > + mm->mmap = NULL; > > > > > > > mmap_write_unlock(mm); > > > > > > > vm_unacct_memory(nr_accounted); > > > > > > > } > > > > > > > > > > > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > > > > > > revert this fix as part of merging the maple-tree parts of linux-next. > > > > > > I'll be sending this fix to Linus this week. > > > > > > > > > > > > All of which means that the thusly-resolved Maple tree patches might > > > > > > reintroduce this use-after-free bug. > > > > > > > > > > I don't think so? The problem is that VMAs are (currently) part of > > > > > two data structures -- the rbtree and the linked list. remove_vma() > > > > > only removes VMAs from the rbtree; it doesn't set mm->mmap to NULL. > > > > > > > > > > With maple tree, the linked list goes away. remove_vma() removes VMAs > > > > > from the maple tree. So anyone looking to iterate over all VMAs has to > > > > > go and look in the maple tree for them ... and there's nothing there. > > > > > > > > Yes, I think you are right. With maple trees we don't need this fix. > > > > > > > > > Yes, this is correct. The maple tree removes the entire linked list... > > > but since the mm is unstable in the exit_mmap(), I had added the > > > destruction of the maple tree there. Maybe this is the wrong place to > > > be destroying the tree tracking the VMAs (althought this patch partially > > > destroys the VMA tracking linked list), but it brought my attention to > > > the race that this patch solves and the process_mrelease() function. > > > Couldn't this be avoided by using mmget_not_zero() instead of mmgrab() > > > in process_mrelease()? > > > > That's what we were doing before [1]. That unfortunately has a problem > > of process_mrelease possibly calling the last mmput and being blocked > > on IO completion in exit_aio. > > Oh, I see. Thanks. > > > > The race between exit_mmap and > > process_mrelease is solved by using mmap_lock. > > I think an important part of the race fix isn't just the lock holding > but the setting of the start of the linked list to NULL above. That > means the code in __oom_reap_task_mm() via process_mrelease() will > continue to execute but iterate for zero VMAs. > > > I think by destroying the maple tree in exit_mmap before the > > mmap_write_unlock call, you keep things working and functionality > > intact. Is there any reason this can't be done? > > Yes, unfortunately. If MMF_OOM_SKIP is not set, then process_mrelease() > will call __oom_reap_task_mm() which will get a null pointer dereference > or a use after free in the vma iterator as it tries to iterate the maple > tree. I think the best plan is to set MMF_OOM_SKIP unconditionally > when the mmap_write_lock() is acquired. Doing so will ensure nothing > will try to gain memory by reaping a task that no longer has memory to > yield - or at least won't shortly. If we do use MMF_OOM_SKIP in such a > way, then I think it is safe to quickly drop the lock? That technically would work but it changes the semantics of MMF_OOM_SKIP flag from "mm is of no interest for the OOM killer" to something like "mm is empty" akin to mm->mmap == NULL. So, there is no way for maple tree to indicate that it is empty? > > Also, should process_mrelease() be setting MMF_OOM_VICTIM on this mm? > It would enable the fast path on a race with exit_mmap() - thought that > may not be desirable? Michal does not like that approach because again, process_mrelease is not oom-killer to set MMF_OOM_VICTIM flag. Besides, we want to get rid of that special mm_is_oom_victim(mm) branch inside exit_mmap. Which reminds me to look into it again. > > > > > [1] ba535c1caf3ee78a ("mm/oom_kill: allow process_mrelease to run > > under mmap_lock protection") > > > > > That would ensure we aren't stepping on an > > > exit_mmap() and potentially the locking change in exit_mmap() wouldn't > > > be needed either? Logically, I view this as process_mrelease() having > > > issue with the fact that the mmaps are no longer stable in tear down > > > regardless of the data structure that is used. > > > > > > Thanks, > > > Liam > > > > > > -- > > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
* Suren Baghdasaryan <surenb@google.com> [220310 18:31]: > On Thu, Mar 10, 2022 at 2:22 PM Liam Howlett <liam.howlett@oracle.com> wrote: > > > > * Suren Baghdasaryan <surenb@google.com> [220310 11:28]: > > > On Thu, Mar 10, 2022 at 7:55 AM Liam Howlett <liam.howlett@oracle.com> wrote: > > > > > > > > * Suren Baghdasaryan <surenb@google.com> [220225 00:51]: > > > > > On Thu, Feb 24, 2022 at 8:23 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Thu, Feb 24, 2022 at 08:18:59PM -0800, Andrew Morton wrote: > > > > > > > On Tue, 15 Feb 2022 12:19:22 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > > > > > > > > After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, > > > > > > > > otherwise it points to a vma that was freed and when reused leads to > > > > > > > > a use-after-free bug. > > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > --- a/mm/mmap.c > > > > > > > > +++ b/mm/mmap.c > > > > > > > > @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) > > > > > > > > vma = remove_vma(vma); > > > > > > > > cond_resched(); > > > > > > > > } > > > > > > > > + mm->mmap = NULL; > > > > > > > > mmap_write_unlock(mm); > > > > > > > > vm_unacct_memory(nr_accounted); > > > > > > > > } > > > > > > > > > > > > > > After the Maple tree patches, mm_struct.mmap doesn't exist. So I'll > > > > > > > revert this fix as part of merging the maple-tree parts of linux-next. > > > > > > > I'll be sending this fix to Linus this week. > > > > > > > > > > > > > > All of which means that the thusly-resolved Maple tree patches might > > > > > > > reintroduce this use-after-free bug. > > > > > > > > > > > > I don't think so? The problem is that VMAs are (currently) part of > > > > > > two data structures -- the rbtree and the linked list. remove_vma() > > > > > > only removes VMAs from the rbtree; it doesn't set mm->mmap to NULL. > > > > > > > > > > > > With maple tree, the linked list goes away. remove_vma() removes VMAs > > > > > > from the maple tree. So anyone looking to iterate over all VMAs has to > > > > > > go and look in the maple tree for them ... and there's nothing there. > > > > > > > > > > Yes, I think you are right. With maple trees we don't need this fix. > > > > > > > > > > > > Yes, this is correct. The maple tree removes the entire linked list... > > > > but since the mm is unstable in the exit_mmap(), I had added the > > > > destruction of the maple tree there. Maybe this is the wrong place to > > > > be destroying the tree tracking the VMAs (althought this patch partially > > > > destroys the VMA tracking linked list), but it brought my attention to > > > > the race that this patch solves and the process_mrelease() function. > > > > Couldn't this be avoided by using mmget_not_zero() instead of mmgrab() > > > > in process_mrelease()? > > > > > > That's what we were doing before [1]. That unfortunately has a problem > > > of process_mrelease possibly calling the last mmput and being blocked > > > on IO completion in exit_aio. > > > > Oh, I see. Thanks. > > > > > > > The race between exit_mmap and > > > process_mrelease is solved by using mmap_lock. > > > > I think an important part of the race fix isn't just the lock holding > > but the setting of the start of the linked list to NULL above. That > > means the code in __oom_reap_task_mm() via process_mrelease() will > > continue to execute but iterate for zero VMAs. > > > > > I think by destroying the maple tree in exit_mmap before the > > > mmap_write_unlock call, you keep things working and functionality > > > intact. Is there any reason this can't be done? > > > > Yes, unfortunately. If MMF_OOM_SKIP is not set, then process_mrelease() > > will call __oom_reap_task_mm() which will get a null pointer dereference > > or a use after free in the vma iterator as it tries to iterate the maple > > tree. I think the best plan is to set MMF_OOM_SKIP unconditionally > > when the mmap_write_lock() is acquired. Doing so will ensure nothing > > will try to gain memory by reaping a task that no longer has memory to > > yield - or at least won't shortly. If we do use MMF_OOM_SKIP in such a > > way, then I think it is safe to quickly drop the lock? > > That technically would work but it changes the semantics of > MMF_OOM_SKIP flag from "mm is of no interest for the OOM killer" to > something like "mm is empty" akin to mm->mmap == NULL. Well, an empty mm is of no interest to the oom killer was my thought. > So, there is no way for maple tree to indicate that it is empty? On second look, the tree is part of the mm_struct. Destroying will clear the flags and remove all VMAs, but that should be fine as long as nothing tries to add anything back to the tree. I don't think there is a dereference issue here and it will continue to run through the motions on an empty set as it does right now. > > > > > Also, should process_mrelease() be setting MMF_OOM_VICTIM on this mm? > > It would enable the fast path on a race with exit_mmap() - thought that > > may not be desirable? > > Michal does not like that approach because again, process_mrelease is > not oom-killer to set MMF_OOM_VICTIM flag. Besides, we want to get rid > of that special mm_is_oom_victim(mm) branch inside exit_mmap. Which > reminds me to look into it again. > > > > > > > > > [1] ba535c1caf3ee78a ("mm/oom_kill: allow process_mrelease to run > > > under mmap_lock protection") > > > > > > > That would ensure we aren't stepping on an > > > > exit_mmap() and potentially the locking change in exit_mmap() wouldn't > > > > be needed either? Logically, I view this as process_mrelease() having > > > > issue with the fact that the mmaps are no longer stable in tear down > > > > regardless of the data structure that is used. > > > > > > > > Thanks, > > > > Liam > > > > > > > > -- > > > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > > > > > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > >
diff --git a/mm/mmap.c b/mm/mmap.c index 1e8fdb0b51ed..d445c1b9d606 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3186,6 +3186,7 @@ void exit_mmap(struct mm_struct *mm) vma = remove_vma(vma); cond_resched(); } + mm->mmap = NULL; mmap_write_unlock(mm); vm_unacct_memory(nr_accounted); }
After exit_mmap frees all vmas in the mm, mm->mmap needs to be reset, otherwise it points to a vma that was freed and when reused leads to a use-after-free bug. Reported-by: syzbot+2ccf63a4bd07cf39cab0@syzkaller.appspotmail.com Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- mm/mmap.c | 1 + 1 file changed, 1 insertion(+)