Message ID | 20190119005022.61321-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] mm, oom: fix use-after-free in oom_kill_process | expand |
Hi Shakeel! > > On looking further it seems like the process selected to be oom-killed > has exited even before reaching read_lock(&tasklist_lock) in > oom_kill_process(). More specifically the tsk->usage is 1 which is due > to get_task_struct() in oom_evaluate_task() and the put_task_struct > within for_each_thread() frees the tsk and for_each_thread() tries to > access the tsk. The easiest fix is to do get/put across the > for_each_thread() on the selected task. Please, feel free to add Reviewed-by: Roman Gushchin <guro@fb.com> for this part. > > Now the next question is should we continue with the oom-kill as the > previously selected task has exited? However before adding more > complexity and heuristics, let's answer why we even look at the > children of oom-kill selected task? The select_bad_process() has already > selected the worst process in the system/memcg. Due to race, the > selected process might not be the worst at the kill time but does that > matter matter? The userspace can play with oom_score_adj to prefer > children to be killed before the parent. I looked at the history but it > seems like this is there before git history. I'd totally support you in an attempt to remove this logic, unless someone has a good example of its usefulness. I believe it's a very old hack to select children over parents in case they have the same oom badness (e.g. share most of the memory). Maybe we can prefer older processes in case of equal oom badness, and it will be enough. Thanks!
On 2019/01/19 9:50, Shakeel Butt wrote: > On looking further it seems like the process selected to be oom-killed > has exited even before reaching read_lock(&tasklist_lock) in > oom_kill_process(). More specifically the tsk->usage is 1 which is due > to get_task_struct() in oom_evaluate_task() and the put_task_struct > within for_each_thread() frees the tsk and for_each_thread() tries to > access the tsk. The easiest fix is to do get/put across the > for_each_thread() on the selected task. Good catch. p->usage can become 1 while printk()ing a lot at dump_header(). > @@ -981,6 +981,13 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > * still freeing memory. > */ > read_lock(&tasklist_lock); > + > + /* > + * The task 'p' might have already exited before reaching here. The > + * put_task_struct() will free task_struct 'p' while the loop still try > + * to access the field of 'p', so, get an extra reference. > + */ > + get_task_struct(p); > for_each_thread(p, t) { > list_for_each_entry(child, &t->children, sibling) { > unsigned int child_points; > @@ -1000,6 +1007,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > } > } > } > + put_task_struct(p); Moving put_task_struct(p) to after read_unlock(&tasklist_lock) will reduce latency of a write_lock(&tasklist_lock) waiter. > read_unlock(&tasklist_lock); > > /* > By the way, p->usage is already 1 implies that p->mm == NULL due to already completed exit_mm(p). Then, process_shares_mm(child, p->mm) might fail to return true for some of children. Not critical but might lead to unnecessary oom_badness() calls for child selection. Maybe we want to use same logic __oom_kill_process() uses (i.e. bail out if find_task_lock_mm(p) failed)?
On Fri 18-01-19 16:50:22, Shakeel Butt wrote: [...] > On looking further it seems like the process selected to be oom-killed > has exited even before reaching read_lock(&tasklist_lock) in > oom_kill_process(). More specifically the tsk->usage is 1 which is due > to get_task_struct() in oom_evaluate_task() and the put_task_struct > within for_each_thread() frees the tsk and for_each_thread() tries to > access the tsk. The easiest fix is to do get/put across the > for_each_thread() on the selected task. Very well spotted! The code seems safe because we are careful to transfer the victim along with reference counting but I've totally missed that the loop itself needs a reference. It seems that this has been broken since the heuristic has been introduced. But I haven't checked it closely. I am still on vacation. > Now the next question is should we continue with the oom-kill as the > previously selected task has exited? However before adding more > complexity and heuristics, let's answer why we even look at the > children of oom-kill selected task? The objective was the work protection assuming that children did less work than their parrent. I find this argument a bit questionable because it highly depends a specific workload while it opens doors for problematic behavior at the same time. If you have a fork bomb like workload then it is basically hard to resolve the OOM condition as children have barely any memory so we keep looping killing tasks which will not free up much. So I am all for removing this heuristic. > The select_bad_process() has already > selected the worst process in the system/memcg. Due to race, the > selected process might not be the worst at the kill time but does that > matter matter? No, we don't I believe. The aim of the oom killer is to kill something. We will never be ideal here because this is a land of races. > The userspace can play with oom_score_adj to prefer > children to be killed before the parent. I looked at the history but it > seems like this is there before git history. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> Fixes: 5e9d834a0e0c ("oom: sacrifice child with highest badness score for parent") Cc: stable Acked-by: Michal Hocko <mhocko@suse.com> Thanks! > --- > mm/oom_kill.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 0930b4365be7..1a007dae1e8f 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -981,6 +981,13 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > * still freeing memory. > */ > read_lock(&tasklist_lock); > + > + /* > + * The task 'p' might have already exited before reaching here. The > + * put_task_struct() will free task_struct 'p' while the loop still try > + * to access the field of 'p', so, get an extra reference. > + */ > + get_task_struct(p); > for_each_thread(p, t) { > list_for_each_entry(child, &t->children, sibling) { > unsigned int child_points; > @@ -1000,6 +1007,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > } > } > } > + put_task_struct(p); > read_unlock(&tasklist_lock); > > /* > -- > 2.20.1.321.g9e740568ce-goog
On Fri, Jan 18, 2019 at 5:58 PM Roman Gushchin <guro@fb.com> wrote: > > Hi Shakeel! > > > > > On looking further it seems like the process selected to be oom-killed > > has exited even before reaching read_lock(&tasklist_lock) in > > oom_kill_process(). More specifically the tsk->usage is 1 which is due > > to get_task_struct() in oom_evaluate_task() and the put_task_struct > > within for_each_thread() frees the tsk and for_each_thread() tries to > > access the tsk. The easiest fix is to do get/put across the > > for_each_thread() on the selected task. > > Please, feel free to add > Reviewed-by: Roman Gushchin <guro@fb.com> > for this part. > Thanks. > > > > Now the next question is should we continue with the oom-kill as the > > previously selected task has exited? However before adding more > > complexity and heuristics, let's answer why we even look at the > > children of oom-kill selected task? The select_bad_process() has already > > selected the worst process in the system/memcg. Due to race, the > > selected process might not be the worst at the kill time but does that > > matter matter? The userspace can play with oom_score_adj to prefer > > children to be killed before the parent. I looked at the history but it > > seems like this is there before git history. > > I'd totally support you in an attempt to remove this logic, > unless someone has a good example of its usefulness. > > I believe it's a very old hack to select children over parents > in case they have the same oom badness (e.g. share most of the memory). > > Maybe we can prefer older processes in case of equal oom badness, > and it will be enough. > > Thanks! I am thinking of removing the whole logic of selecting children. Shakeel
On Fri, Jan 18, 2019 at 7:35 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > On 2019/01/19 9:50, Shakeel Butt wrote: > > On looking further it seems like the process selected to be oom-killed > > has exited even before reaching read_lock(&tasklist_lock) in > > oom_kill_process(). More specifically the tsk->usage is 1 which is due > > to get_task_struct() in oom_evaluate_task() and the put_task_struct > > within for_each_thread() frees the tsk and for_each_thread() tries to > > access the tsk. The easiest fix is to do get/put across the > > for_each_thread() on the selected task. > > Good catch. p->usage can become 1 while printk()ing a lot at dump_header(). > > > @@ -981,6 +981,13 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > * still freeing memory. > > */ > > read_lock(&tasklist_lock); > > + > > + /* > > + * The task 'p' might have already exited before reaching here. The > > + * put_task_struct() will free task_struct 'p' while the loop still try > > + * to access the field of 'p', so, get an extra reference. > > + */ > > + get_task_struct(p); > > for_each_thread(p, t) { > > list_for_each_entry(child, &t->children, sibling) { > > unsigned int child_points; > > @@ -1000,6 +1007,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > } > > } > > } > > + put_task_struct(p); > > Moving put_task_struct(p) to after read_unlock(&tasklist_lock) will reduce > latency of a write_lock(&tasklist_lock) waiter. > > > read_unlock(&tasklist_lock); > > > > /* > > > > By the way, p->usage is already 1 implies that p->mm == NULL due to already > completed exit_mm(p). Then, process_shares_mm(child, p->mm) might fail to > return true for some of children. Not critical but might lead to unnecessary > oom_badness() calls for child selection. Maybe we want to use same logic > __oom_kill_process() uses (i.e. bail out if find_task_lock_mm(p) failed)? Thanks for the review. I am thinking of removing the whole children selection heuristic for now. Shakeel
On Fri, Jan 18, 2019 at 11:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 18-01-19 16:50:22, Shakeel Butt wrote: > [...] > > On looking further it seems like the process selected to be oom-killed > > has exited even before reaching read_lock(&tasklist_lock) in > > oom_kill_process(). More specifically the tsk->usage is 1 which is due > > to get_task_struct() in oom_evaluate_task() and the put_task_struct > > within for_each_thread() frees the tsk and for_each_thread() tries to > > access the tsk. The easiest fix is to do get/put across the > > for_each_thread() on the selected task. > > Very well spotted! The code seems safe because we are careful to > transfer the victim along with reference counting but I've totally > missed that the loop itself needs a reference. It seems that this has > been broken since the heuristic has been introduced. But I haven't > checked it closely. I am still on vacation. > > > Now the next question is should we continue with the oom-kill as the > > previously selected task has exited? However before adding more > > complexity and heuristics, let's answer why we even look at the > > children of oom-kill selected task? > > The objective was the work protection assuming that children did less > work than their parrent. I find this argument a bit questionable because > it highly depends a specific workload while it opens doors for > problematic behavior at the same time. If you have a fork bomb like > workload then it is basically hard to resolve the OOM condition as > children have barely any memory so we keep looping killing tasks which > will not free up much. So I am all for removing this heuristic. > > > The select_bad_process() has already > > selected the worst process in the system/memcg. Due to race, the > > selected process might not be the worst at the kill time but does that > > matter matter? > > No, we don't I believe. The aim of the oom killer is to kill something. > We will never be ideal here because this is a land of races. > > > The userspace can play with oom_score_adj to prefer > > children to be killed before the parent. I looked at the history but it > > seems like this is there before git history. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Fixes: 5e9d834a0e0c ("oom: sacrifice child with highest badness score for parent") > Cc: stable > > Acked-by: Michal Hocko <mhocko@suse.com> > > Thanks! Thanks for the review. I will keep this for the stable branches and for the next release I will remove this whole children selection heuristic. Shakeel > > --- > > mm/oom_kill.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 0930b4365be7..1a007dae1e8f 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -981,6 +981,13 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > * still freeing memory. > > */ > > read_lock(&tasklist_lock); > > + > > + /* > > + * The task 'p' might have already exited before reaching here. The > > + * put_task_struct() will free task_struct 'p' while the loop still try > > + * to access the field of 'p', so, get an extra reference. > > + */ > > + get_task_struct(p); > > for_each_thread(p, t) { > > list_for_each_entry(child, &t->children, sibling) { > > unsigned int child_points; > > @@ -1000,6 +1007,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > } > > } > > } > > + put_task_struct(p); > > read_unlock(&tasklist_lock); > > > > /* > > -- > > 2.20.1.321.g9e740568ce-goog > > -- > Michal Hocko > SUSE Labs
On Sat 19-01-19 08:09:34, Michal Hocko wrote: [...] > Fixes: 5e9d834a0e0c ("oom: sacrifice child with highest badness score for parent") So I've double checked and I was wrong blaming this commit. Back then it was tasklist_lock to protect us from releasing the task. It's been only since 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock") that we rely on the reference counting and unless I am missing something this is also the commit which has introduced this bug. > Cc: stable
================================================================== BUG: KASAN: use-after-free in oom_kill_process.cold+0x484/0x9d4 mm/oom_kill.c:978 Read of size 8 at addr ffff8880595f6c40 by task syz-executor1/15817 CPU: 1 PID: 15817 Comm: syz-executor1 Not tainted 5.0.0-rc2+ #29 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1db/0x2d0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135 oom_kill_process.cold+0x484/0x9d4 mm/oom_kill.c:978 out_of_memory+0x885/0x1420 mm/oom_kill.c:1133 mem_cgroup_out_of_memory+0x160/0x210 mm/memcontrol.c:1393 mem_cgroup_oom mm/memcontrol.c:1721 [inline] try_charge+0xd44/0x19b0 mm/memcontrol.c:2283 memcg_kmem_charge_memcg+0x7c/0x130 mm/memcontrol.c:2591 memcg_kmem_charge+0x13b/0x340 mm/memcontrol.c:2624 __alloc_pages_nodemask+0x7b8/0xdc0 mm/page_alloc.c:4559 __alloc_pages include/linux/gfp.h:473 [inline] __alloc_pages_node include/linux/gfp.h:486 [inline] alloc_pages_node include/linux/gfp.h:500 [inline] alloc_thread_stack_node kernel/fork.c:246 [inline] dup_task_struct kernel/fork.c:849 [inline] copy_process+0x847/0x8710 kernel/fork.c:1753 _do_fork+0x1a9/0x1170 kernel/fork.c:2227 __do_sys_clone kernel/fork.c:2334 [inline] __se_sys_clone kernel/fork.c:2328 [inline] __x64_sys_clone+0xbf/0x150 kernel/fork.c:2328 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457ec9 Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f36f091cc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457ec9 RDX: 9999999999999999 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 000000000073bf00 R08: ffffffffffffffff R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f36f091d6d4 R13: 00000000004be2a0 R14: 00000000004ce760 R15: 00000000ffffffff Allocated by task 15809: save_stack+0x45/0xd0 mm/kasan/common.c:73 set_track mm/kasan/common.c:85 [inline] __kasan_kmalloc mm/kasan/common.c:496 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469 kasan_kmalloc mm/kasan/common.c:504 [inline] kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:411 kmem_cache_alloc_node+0x144/0x710 mm/slab.c:3633 alloc_task_struct_node kernel/fork.c:158 [inline] dup_task_struct kernel/fork.c:845 [inline] copy_process+0x405b/0x8710 kernel/fork.c:1753 _do_fork+0x1a9/0x1170 kernel/fork.c:2227 __do_sys_clone kernel/fork.c:2334 [inline] __se_sys_clone kernel/fork.c:2328 [inline] __x64_sys_clone+0xbf/0x150 kernel/fork.c:2328 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 15817: save_stack+0x45/0xd0 mm/kasan/common.c:73 set_track mm/kasan/common.c:85 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:458 kasan_slab_free+0xe/0x10 mm/kasan/common.c:466 __cache_free mm/slab.c:3487 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3749 free_task_struct kernel/fork.c:163 [inline] free_task+0x170/0x1f0 kernel/fork.c:458 __put_task_struct+0x2e0/0x630 kernel/fork.c:731 put_task_struct+0x4b/0x60 include/linux/sched/task.h:98 oom_kill_process.cold+0x93a/0x9d4 mm/oom_kill.c:990 out_of_memory+0x885/0x1420 mm/oom_kill.c:1133 mem_cgroup_out_of_memory+0x160/0x210 mm/memcontrol.c:1393 mem_cgroup_oom mm/memcontrol.c:1721 [inline] try_charge+0xd44/0x19b0 mm/memcontrol.c:2283 memcg_kmem_charge_memcg+0x7c/0x130 mm/memcontrol.c:2591 memcg_kmem_charge+0x13b/0x340 mm/memcontrol.c:2624 __alloc_pages_nodemask+0x7b8/0xdc0 mm/page_alloc.c:4559 __alloc_pages include/linux/gfp.h:473 [inline] __alloc_pages_node include/linux/gfp.h:486 [inline] alloc_pages_node include/linux/gfp.h:500 [inline] alloc_thread_stack_node kernel/fork.c:246 [inline] dup_task_struct kernel/fork.c:849 [inline] copy_process+0x847/0x8710 kernel/fork.c:1753 _do_fork+0x1a9/0x1170 kernel/fork.c:2227 __do_sys_clone kernel/fork.c:2334 [inline] __se_sys_clone kernel/fork.c:2328 [inline] __x64_sys_clone+0xbf/0x150 kernel/fork.c:2328 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8880595f6540 which belongs to the cache task_struct(33:syz1) of size 6080 The buggy address is located 1792 bytes inside of 6080-byte region [ffff8880595f6540, ffff8880595f7d00) The buggy address belongs to the page: page:ffffea0001657d80 count:1 mapcount:0 mapping:ffff888091f65840 index:0x0 compound_mapcount: 0 flags: 0x1fffc0000010200(slab|head) raw: 01fffc0000010200 ffffea00028b3288 ffffea0002612788 ffff888091f65840 raw: 0000000000000000 ffff8880595f6540 0000000100000001 ffff888057fe2b00 page dumped because: kasan: bad access detected page->mem_cgroup:ffff888057fe2b00 -------------------- On looking further it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter matter? The userspace can play with oom_score_adj to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Signed-off-by: Shakeel Butt <shakeelb@google.com> --- mm/oom_kill.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 0930b4365be7..1a007dae1e8f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -981,6 +981,13 @@ static void oom_kill_process(struct oom_control *oc, const char *message) * still freeing memory. */ read_lock(&tasklist_lock); + + /* + * The task 'p' might have already exited before reaching here. The + * put_task_struct() will free task_struct 'p' while the loop still try + * to access the field of 'p', so, get an extra reference. + */ + get_task_struct(p); for_each_thread(p, t) { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; @@ -1000,6 +1007,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } } } + put_task_struct(p); read_unlock(&tasklist_lock); /*