Message ID | 20191029005405.201986-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: memcontrol: fix data race in mem_cgroup_select_victim_node | expand |
On Mon 28-10-19 17:54:05, Shakeel Butt wrote: > Syzbot reported the following bug: > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0: > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686 > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1: > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675 > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > mem_cgroup_select_victim_node() can be called concurrently which reads > and modifies memcg->last_scanned_node without any synchrnonization. So, > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE() > to stop potential reordering. I am sorry but I do not understand the problem and the fix. Why does the race happen and why does _ONCE fixes it? There is still no synchronization. Do you want to prevent from memcg->last_scanned_node reloading? > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Suggested-by: Eric Dumazet <edumazet@google.com> > Cc: Greg Thelen <gthelen@google.com> > Reported-by: syzbot+13f93c99c06988391efe@syzkaller.appspotmail.com > --- > mm/memcontrol.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c4c555055a72..5a06739dd3e4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1667,7 +1667,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) > int node; > > mem_cgroup_may_update_nodemask(memcg); > - node = memcg->last_scanned_node; > + node = READ_ONCE(memcg->last_scanned_node); > > node = next_node_in(node, memcg->scan_nodes); > /* > @@ -1678,7 +1678,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) > if (unlikely(node == MAX_NUMNODES)) > node = numa_node_id(); > > - memcg->last_scanned_node = node; > + WRITE_ONCE(memcg->last_scanned_node, node); > return node; > } > #else > -- > 2.24.0.rc0.303.g954a862665-goog
+Marco On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote: > > Syzbot reported the following bug: > > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node > > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0: > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686 > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1: > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675 > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > mem_cgroup_select_victim_node() can be called concurrently which reads > > and modifies memcg->last_scanned_node without any synchrnonization. So, > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE() > > to stop potential reordering. > > I am sorry but I do not understand the problem and the fix. Why does the > race happen and why does _ONCE fixes it? There is still no > synchronization. Do you want to prevent from memcg->last_scanned_node > reloading? > The problem is memcg->last_scanned_node can read and modified concurrently. Though to me it seems like a tolerable race and not worth to add an explicit lock. My aim was to make KCSAN happy here to look elsewhere for the concurrency bugs. However I see that it might complain next on memcg->scan_nodes. Now taking a step back, I am questioning the whole motivation behind mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK zonelist to the reclaimer, the shrink_node will be called for all potential nodes. Also we don't short circuit the traversal of shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >> priority) for all nodes, I don't see the reason behind having round robin order of node traversal. I am thinking of removing the whole mem_cgroup_select_victim_node() heuristic. Please let me know if there are any objections. thanks, Shakeel
On Tue, 29 Oct 2019, Shakeel Butt wrote: > +Marco > > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote: > > > Syzbot reported the following bug: > > > > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node > > > > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0: > > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686 > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1: > > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675 > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > mem_cgroup_select_victim_node() can be called concurrently which reads > > > and modifies memcg->last_scanned_node without any synchrnonization. So, > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE() > > > to stop potential reordering. Strictly speaking, READ_ONCE/WRITE_ONCE alone avoid various bad compiler optimizations, including store tearing, load tearing, etc. This does not add memory barriers to constrain memory ordering. (If this code needs some memory ordering guarantees w.r.t. previous loads/stores then this alone is not enough.) > > I am sorry but I do not understand the problem and the fix. Why does the > > race happen and why does _ONCE fixes it? There is still no > > synchronization. Do you want to prevent from memcg->last_scanned_node > > reloading? > > > > The problem is memcg->last_scanned_node can read and modified > concurrently. Though to me it seems like a tolerable race and not > worth to add an explicit lock. My aim was to make KCSAN happy here to > look elsewhere for the concurrency bugs. However I see that it might > complain next on memcg->scan_nodes. The plain concurrent reads/writes are a data race, which may manifest in various undefined behaviour due to compiler optimizations. The _ONCE will prevent these (KCSAN only reports data races). Note that, "data race" does not necessarily imply "race condition"; some data races are race conditions (usually the more interesting bugs) -- but not *all* data races are race conditions. If there is no race condition here that warrants heavier synchronization (locking etc.), then this patch is all that should be needed. I can't comment on the rest. Thanks, -- Marco
On Tue, Oct 29, 2019 at 11:09:29AM -0700, Shakeel Butt wrote: > +Marco > > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote: > > > Syzbot reported the following bug: > > > > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node > > > > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0: > > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686 > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1: > > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675 > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > mem_cgroup_select_victim_node() can be called concurrently which reads > > > and modifies memcg->last_scanned_node without any synchrnonization. So, > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE() > > > to stop potential reordering. > > > > I am sorry but I do not understand the problem and the fix. Why does the > > race happen and why does _ONCE fixes it? There is still no > > synchronization. Do you want to prevent from memcg->last_scanned_node > > reloading? > > > > The problem is memcg->last_scanned_node can read and modified > concurrently. Though to me it seems like a tolerable race and not > worth to add an explicit lock. My aim was to make KCSAN happy here to > look elsewhere for the concurrency bugs. However I see that it might > complain next on memcg->scan_nodes. > > Now taking a step back, I am questioning the whole motivation behind > mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK > zonelist to the reclaimer, the shrink_node will be called for all > potential nodes. Also we don't short circuit the traversal of > shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >> > priority) for all nodes, I don't see the reason behind having round > robin order of node traversal. It's actually only very recently that we don't bail out of the reclaim loop anymore - if I'm not missing anything, it was only 1ba6fc9af35b ("mm: vmscan: do not share cgroup iteration between reclaimers") that removed the last bailout condition on sc->nr_reclaimed. > I am thinking of removing the whole mem_cgroup_select_victim_node() > heuristic. Please let me know if there are any objections. In the current state, I don't see any reason to keep it, either. We can always just start the zonelist walk from the current node. A nice cleanup, actually. Good catch!
On Tue, Oct 29, 2019 at 11:28 AM Marco Elver <elver@google.com> wrote: > > > > On Tue, 29 Oct 2019, Shakeel Butt wrote: > > > +Marco > > > > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote: > > > > Syzbot reported the following bug: > > > > > > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node > > > > > > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0: > > > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686 > > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1: > > > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675 > > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > > > mem_cgroup_select_victim_node() can be called concurrently which reads > > > > and modifies memcg->last_scanned_node without any synchrnonization. So, > > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE() > > > > to stop potential reordering. > > Strictly speaking, READ_ONCE/WRITE_ONCE alone avoid various bad compiler > optimizations, including store tearing, load tearing, etc. This does not > add memory barriers to constrain memory ordering. (If this code needs > some memory ordering guarantees w.r.t. previous loads/stores then this > alone is not enough.) > > > > I am sorry but I do not understand the problem and the fix. Why does the > > > race happen and why does _ONCE fixes it? There is still no > > > synchronization. Do you want to prevent from memcg->last_scanned_node > > > reloading? > > > > > > > The problem is memcg->last_scanned_node can read and modified > > concurrently. Though to me it seems like a tolerable race and not > > worth to add an explicit lock. My aim was to make KCSAN happy here to > > look elsewhere for the concurrency bugs. However I see that it might > > complain next on memcg->scan_nodes. > > The plain concurrent reads/writes are a data race, which may manifest in > various undefined behaviour due to compiler optimizations. The _ONCE > will prevent these (KCSAN only reports data races). Note that, "data > race" does not necessarily imply "race condition"; some data races are > race conditions (usually the more interesting bugs) -- but not *all* > data races are race conditions. If there is no race condition here that > warrants heavier synchronization (locking etc.), then this patch is all > that should be needed. > > I can't comment on the rest. > Thanks Marco for the explanation.
On Tue, Oct 29, 2019 at 11:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Oct 29, 2019 at 11:09:29AM -0700, Shakeel Butt wrote: > > +Marco > > > > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote: > > > > Syzbot reported the following bug: > > > > > > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node > > > > > > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0: > > > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686 > > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1: > > > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675 > > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > > > mem_cgroup_select_victim_node() can be called concurrently which reads > > > > and modifies memcg->last_scanned_node without any synchrnonization. So, > > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE() > > > > to stop potential reordering. > > > > > > I am sorry but I do not understand the problem and the fix. Why does the > > > race happen and why does _ONCE fixes it? There is still no > > > synchronization. Do you want to prevent from memcg->last_scanned_node > > > reloading? > > > > > > > The problem is memcg->last_scanned_node can read and modified > > concurrently. Though to me it seems like a tolerable race and not > > worth to add an explicit lock. My aim was to make KCSAN happy here to > > look elsewhere for the concurrency bugs. However I see that it might > > complain next on memcg->scan_nodes. > > > > Now taking a step back, I am questioning the whole motivation behind > > mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK > > zonelist to the reclaimer, the shrink_node will be called for all > > potential nodes. Also we don't short circuit the traversal of > > shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >> > > priority) for all nodes, I don't see the reason behind having round > > robin order of node traversal. > > It's actually only very recently that we don't bail out of the reclaim > loop anymore - if I'm not missing anything, it was only 1ba6fc9af35b > ("mm: vmscan: do not share cgroup iteration between reclaimers") that > removed the last bailout condition on sc->nr_reclaimed. > > > I am thinking of removing the whole mem_cgroup_select_victim_node() > > heuristic. Please let me know if there are any objections. > > In the current state, I don't see any reason to keep it, either. We > can always just start the zonelist walk from the current node. > > A nice cleanup, actually. Good catch! Thanks, I will follow up with the removal of this heuristic.
On Tue 29-10-19 11:09:29, Shakeel Butt wrote: > +Marco > > On Tue, Oct 29, 2019 at 2:03 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Mon 28-10-19 17:54:05, Shakeel Butt wrote: > > > Syzbot reported the following bug: > > > > > > BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node > > > > > > write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0: > > > mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686 > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1: > > > mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675 > > > try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 > > > reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 > > > mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 > > > tracehook_notify_resume include/linux/tracehook.h:197 [inline] > > > exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 > > > prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 > > > swapgs_restore_regs_and_return_to_usermode+0x0/0x40 > > > > > > mem_cgroup_select_victim_node() can be called concurrently which reads > > > and modifies memcg->last_scanned_node without any synchrnonization. So, > > > read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE() > > > to stop potential reordering. > > > > I am sorry but I do not understand the problem and the fix. Why does the > > race happen and why does _ONCE fixes it? There is still no > > synchronization. Do you want to prevent from memcg->last_scanned_node > > reloading? > > > > The problem is memcg->last_scanned_node can read and modified > concurrently. Though to me it seems like a tolerable race and not > worth to add an explicit lock. Agreed > My aim was to make KCSAN happy here to > look elsewhere for the concurrency bugs. However I see that it might > complain next on memcg->scan_nodes. I would really refrain from adding whatever measure to silence some tool without a deeper understanding of why that is needed. $FOO_ONCE will prevent compiler from making funcy stuff. But this is an int and I would be really surprised if $FOO_ONCE made any practical difference. > Now taking a step back, I am questioning the whole motivation behind > mem_cgroup_select_victim_node(). Since we pass ZONELIST_FALLBACK > zonelist to the reclaimer, the shrink_node will be called for all > potential nodes. Also we don't short circuit the traversal of > shrink_node for all nodes on nr_reclaimed and we scan (size_on_node >> > priority) for all nodes, I don't see the reason behind having round > robin order of node traversal. > > I am thinking of removing the whole mem_cgroup_select_victim_node() > heuristic. Please let me know if there are any objections. I would have to think more about this but this surely sounds like a preferable way than adding $FOO_ONCE to silence the tool. Thanks!
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c4c555055a72..5a06739dd3e4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1667,7 +1667,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) int node; mem_cgroup_may_update_nodemask(memcg); - node = memcg->last_scanned_node; + node = READ_ONCE(memcg->last_scanned_node); node = next_node_in(node, memcg->scan_nodes); /* @@ -1678,7 +1678,7 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg) if (unlikely(node == MAX_NUMNODES)) node = numa_node_id(); - memcg->last_scanned_node = node; + WRITE_ONCE(memcg->last_scanned_node, node); return node; } #else
Syzbot reported the following bug: BUG: KCSAN: data-race in mem_cgroup_select_victim_node / mem_cgroup_select_victim_node write to 0xffff88809fade9b0 of 4 bytes by task 8603 on cpu 0: mem_cgroup_select_victim_node+0xb5/0x3d0 mm/memcontrol.c:1686 try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 tracehook_notify_resume include/linux/tracehook.h:197 [inline] exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 swapgs_restore_regs_and_return_to_usermode+0x0/0x40 read to 0xffff88809fade9b0 of 4 bytes by task 7290 on cpu 1: mem_cgroup_select_victim_node+0x92/0x3d0 mm/memcontrol.c:1675 try_to_free_mem_cgroup_pages+0x175/0x4c0 mm/vmscan.c:3376 reclaim_high.constprop.0+0xf7/0x140 mm/memcontrol.c:2349 mem_cgroup_handle_over_high+0x96/0x180 mm/memcontrol.c:2430 tracehook_notify_resume include/linux/tracehook.h:197 [inline] exit_to_usermode_loop+0x20c/0x2c0 arch/x86/entry/common.c:163 prepare_exit_to_usermode+0x180/0x1a0 arch/x86/entry/common.c:194 swapgs_restore_regs_and_return_to_usermode+0x0/0x40 mem_cgroup_select_victim_node() can be called concurrently which reads and modifies memcg->last_scanned_node without any synchrnonization. So, read and modify memcg->last_scanned_node with READ_ONCE()/WRITE_ONCE() to stop potential reordering. Signed-off-by: Shakeel Butt <shakeelb@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Reported-by: syzbot+13f93c99c06988391efe@syzkaller.appspotmail.com --- mm/memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)