Message ID | 20250211081819.33307-1-chenridong@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | memcg: avoid dead loop when setting memory.max | expand |
On Tue 11-02-25 08:18:19, Chen Ridong wrote: > From: Chen Ridong <chenridong@huawei.com> > > A softlockup issue was found with stress test: > watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181] > CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1 > Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu > RIP: 0010:stop_machine_yield+0x2/0x10 > RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246 > RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000 > RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40 > RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff > R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40 > R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20 > FS: 0000000000000000(0000) GS:ff1cdaaf6eb80000(0000) > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > multi_cpu_stop+0x8f/0x100 > cpu_stopper_thread+0x90/0x140 > smpboot_thread_fn+0xad/0x150 > kthread+0xc2/0x100 > ret_from_fork+0x2d/0x50 > > The stress test involves CPU hotplug operations and memory control group > (memcg) operations. The scenario can be described as follows: > > echo xx > memory.max cache_ap_online oom_reaper > (CPU23) (CPU50) > xx < usage stop_machine_from_inactive_cpu > for(;;) // all active cpus > trigger OOM queue_stop_cpus_work > // waiting oom_reaper > multi_cpu_stop(migration/xx) > // sync all active cpus ack > // waiting cpu23 ack > // CPU50 loops in multi_cpu_stop > waiting cpu50 > > Detailed explanation: > 1. When the usage is larger than xx, an OOM may be triggered. If the > process does not handle with ths kill signal immediately, it will loop > in the memory_max_write. Do I get it right that the issue is that mem_cgroup_out_of_memory which doesn't have any cond_resched so it cannot yield to stopped kthread? oom itself cannot make any progress because the oom victim is blocked as per 3). > 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the > active cpus. Within the multi_cpu_stop function, it attempts to > synchronize the CPU states. However, the CPU23 didn't acknowledge > because it is stuck in a loop within the for(;;). > 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting > for CPU23 to acknowledge the synchronization request. > 4. Finally, it formed cyclic dependency and lead to softlockup and dead > loop. > > To fix this issue, add cond_resched() in the memory_max_write, so that > it will not block migration task. My first question was why this is not a problem in other allocation/charge paths but this one is different because it doesn't ever try to reclaim after MAX_RECLAIM_RETRIES reclaim rounds. We do have scheduling points in the reclaim path which are no longer triggered after we hit oom situation in this case. I was thinking about having a guranteed cond_resched when oom killer fails to find a victim but it seems the simplest fix for this particular corner case is to add cond_resched as you did here. Hopefully we will get rid of it very soon when !PREEMPT is removed. Btw. this could be a problem on a single CPU machine even without CPU hotplug as the oom repear won't run until memory_max_write yields the cpu. > Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage") > Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> > --- > mm/memcontrol.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 8d21c1a44220..16f3bdbd37d8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > memcg_memory_event(memcg, MEMCG_OOM); > if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0)) > break; > + cond_resched(); > } > > memcg_wb_domain_size_changed(memcg); > -- > 2.34.1
On 2025/2/11 17:02, Michal Hocko wrote: > On Tue 11-02-25 08:18:19, Chen Ridong wrote: >> From: Chen Ridong <chenridong@huawei.com> >> >> A softlockup issue was found with stress test: >> watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181] >> CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1 >> Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu >> RIP: 0010:stop_machine_yield+0x2/0x10 >> RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246 >> RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000 >> RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40 >> RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff >> R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40 >> R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20 >> FS: 0000000000000000(0000) GS:ff1cdaaf6eb80000(0000) >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> PKRU: 55555554 >> Call Trace: >> multi_cpu_stop+0x8f/0x100 >> cpu_stopper_thread+0x90/0x140 >> smpboot_thread_fn+0xad/0x150 >> kthread+0xc2/0x100 >> ret_from_fork+0x2d/0x50 >> >> The stress test involves CPU hotplug operations and memory control group >> (memcg) operations. The scenario can be described as follows: >> >> echo xx > memory.max cache_ap_online oom_reaper >> (CPU23) (CPU50) >> xx < usage stop_machine_from_inactive_cpu >> for(;;) // all active cpus >> trigger OOM queue_stop_cpus_work >> // waiting oom_reaper >> multi_cpu_stop(migration/xx) >> // sync all active cpus ack >> // waiting cpu23 ack >> // CPU50 loops in multi_cpu_stop >> waiting cpu50 >> >> Detailed explanation: >> 1. When the usage is larger than xx, an OOM may be triggered. If the >> process does not handle with ths kill signal immediately, it will loop >> in the memory_max_write. > > Do I get it right that the issue is that mem_cgroup_out_of_memory which > doesn't have any cond_resched so it cannot yield to stopped kthread? > oom itself cannot make any progress because the oom victim is blocked as > per 3). > Yes, the same task was evaluated as the victim, which is blocked as described in point 3). Consequently, the operation returned oc->chosen = (void *)-1UL in the oom_evaluate_task function, and no cond_resched() was invoked. for(;;) { ... mem_cgroup_out_of_memory out_of_memory select_bad_process oom_evaluate_task oc->chosen = (void *)-1UL; return !!oc->chosen; } >> 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the >> active cpus. Within the multi_cpu_stop function, it attempts to >> synchronize the CPU states. However, the CPU23 didn't acknowledge >> because it is stuck in a loop within the for(;;). >> 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting >> for CPU23 to acknowledge the synchronization request. >> 4. Finally, it formed cyclic dependency and lead to softlockup and dead >> loop. >> >> To fix this issue, add cond_resched() in the memory_max_write, so that >> it will not block migration task. > > My first question was why this is not a problem in other > allocation/charge paths but this one is different because it doesn't > ever try to reclaim after MAX_RECLAIM_RETRIES reclaim rounds. > We do have scheduling points in the reclaim path which are no longer > triggered after we hit oom situation in this case. > > I was thinking about having a guranteed cond_resched when oom killer > fails to find a victim but it seems the simplest fix for this particular > corner case is to add cond_resched as you did here. Hopefully we will > get rid of it very soon when !PREEMPT is removed. > > Btw. this could be a problem on a single CPU machine even without CPU > hotplug as the oom repear won't run until memory_max_write yields the > cpu. > >> Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage") >> Signed-off-by: Chen Ridong <chenridong@huawei.com> > > Acked-by: Michal Hocko <mhocko@suse.com> > Thank you very much. Best regards, Ridong >> --- >> mm/memcontrol.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 8d21c1a44220..16f3bdbd37d8 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, >> memcg_memory_event(memcg, MEMCG_OOM); >> if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0)) >> break; >> + cond_resched(); >> } >> >> memcg_wb_domain_size_changed(memcg); >> -- >> 2.34.1 >
On Tue, Feb 11, 2025 at 08:18:19AM +0000, Chen Ridong wrote: > From: Chen Ridong <chenridong@huawei.com> > > A softlockup issue was found with stress test: > watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181] > CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1 > Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu > RIP: 0010:stop_machine_yield+0x2/0x10 > RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246 > RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000 > RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40 > RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff > R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40 > R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20 > FS: 0000000000000000(0000) GS:ff1cdaaf6eb80000(0000) > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > multi_cpu_stop+0x8f/0x100 > cpu_stopper_thread+0x90/0x140 > smpboot_thread_fn+0xad/0x150 > kthread+0xc2/0x100 > ret_from_fork+0x2d/0x50 > > The stress test involves CPU hotplug operations and memory control group > (memcg) operations. The scenario can be described as follows: > > echo xx > memory.max cache_ap_online oom_reaper > (CPU23) (CPU50) > xx < usage stop_machine_from_inactive_cpu > for(;;) // all active cpus > trigger OOM queue_stop_cpus_work > // waiting oom_reaper > multi_cpu_stop(migration/xx) > // sync all active cpus ack > // waiting cpu23 ack > // CPU50 loops in multi_cpu_stop > waiting cpu50 > > Detailed explanation: > 1. When the usage is larger than xx, an OOM may be triggered. If the > process does not handle with ths kill signal immediately, it will loop > in the memory_max_write. > 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the > active cpus. Within the multi_cpu_stop function, it attempts to > synchronize the CPU states. However, the CPU23 didn't acknowledge > because it is stuck in a loop within the for(;;). > 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting > for CPU23 to acknowledge the synchronization request. > 4. Finally, it formed cyclic dependency and lead to softlockup and dead > loop. > > To fix this issue, add cond_resched() in the memory_max_write, so that > it will not block migration task. > > Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage") > Signed-off-by: Chen Ridong <chenridong@huawei.com> > --- > mm/memcontrol.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 8d21c1a44220..16f3bdbd37d8 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > memcg_memory_event(memcg, MEMCG_OOM); > if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0)) Wouldn't it be more robust if we put an upper bound on the else case of above condition i.e. fix number of retries? As you have discovered there is a hidden dependency on the forward progress of oom_reaper and this check/code-path which I think is not needed. > break; > + cond_resched(); > } > > memcg_wb_domain_size_changed(memcg); > -- > 2.34.1 >
On Tue 11-02-25 11:04:21, Shakeel Butt wrote: > On Tue, Feb 11, 2025 at 08:18:19AM +0000, Chen Ridong wrote: [...] > Wouldn't it be more robust if we put an upper bound on the else case of > above condition i.e. fix number of retries? As you have discovered there > is a hidden dependency on the forward progress of oom_reaper and this > check/code-path which I think is not needed. Any OOM path has a dependency on oom_reaper or task exiting. Is there any reason why this path should be any special? With cond_resched we can look for a day where this will be just removed and the code will still work. With a number of retries we will have a non-deterministic time dependent behavior because number of retries rather than fwd progress would define the failure mode.
On Tue, Feb 11, 2025 at 09:35:47PM +0100, Michal Hocko wrote: > On Tue 11-02-25 11:04:21, Shakeel Butt wrote: > > On Tue, Feb 11, 2025 at 08:18:19AM +0000, Chen Ridong wrote: > [...] > > Wouldn't it be more robust if we put an upper bound on the else case of > > above condition i.e. fix number of retries? As you have discovered there > > is a hidden dependency on the forward progress of oom_reaper and this > > check/code-path which I think is not needed. > > Any OOM path has a dependency on oom_reaper or task exiting. Personally I would say any allocation (or charge) has a dependency on oom_reaper making progress (but not arguing on this point). > Is there > any reason why this path should be any special? With cond_resched we can > look for a day where this will be just removed and the code will still > work. With a number of retries we will have a non-deterministic time > dependent behavior because number of retries rather than fwd progress > would define the failure mode. I am not against adding cond_resched() which might/will be removed in future. To me it is just that we are leaving our fate to cpu scheduler here which maybe is ok as we (MM) do it all over the place. Anyways no objection from me.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8d21c1a44220..16f3bdbd37d8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, memcg_memory_event(memcg, MEMCG_OOM); if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0)) break; + cond_resched(); } memcg_wb_domain_size_changed(memcg);