memcg: avoid dead loop when setting memory.max

Message ID	20250211081819.33307-1-chenridong@huaweicloud.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Chen Ridong <chenridong@huaweicloud.com> To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, akpm@linux-foundation.org Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, chenridong@huawei.com, wangweiyang2@huawei.com Subject: [PATCH] memcg: avoid dead loop when setting memory.max Date: Tue, 11 Feb 2025 08:18:19 +0000 Message-Id: <20250211081819.33307-1-chenridong@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: avoid dead loop when setting memory.max \| expand memcg: avoid dead loop when setting memory.max

Chen Ridong Feb. 11, 2025, 8:18 a.m. UTC

From: Chen Ridong <chenridong@huawei.com>

A softlockup issue was found with stress test:
 watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181]
 CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1
 Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu
 RIP: 0010:stop_machine_yield+0x2/0x10
 RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246
 RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000
 RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40
 RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff
 R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40
 R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20
 FS:  0000000000000000(0000) GS:ff1cdaaf6eb80000(0000)
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  multi_cpu_stop+0x8f/0x100
  cpu_stopper_thread+0x90/0x140
  smpboot_thread_fn+0xad/0x150
  kthread+0xc2/0x100
  ret_from_fork+0x2d/0x50

The stress test involves CPU hotplug operations and memory control group
(memcg) operations. The scenario can be described as follows:

 echo xx > memory.max 	cache_ap_online			oom_reaper
 (CPU23)						(CPU50)
 xx < usage		stop_machine_from_inactive_cpu
 for(;;)			// all active cpus
 trigger OOM		queue_stop_cpus_work
 // waiting oom_reaper
 			multi_cpu_stop(migration/xx)
 			// sync all active cpus ack
 			// waiting cpu23 ack
 			// CPU50 loops in multi_cpu_stop
 							waiting cpu50

Detailed explanation:
1. When the usage is larger than xx, an OOM may be triggered. If the
   process does not handle with ths kill signal immediately, it will loop
   in the memory_max_write.
2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
   active cpus. Within the multi_cpu_stop function,  it attempts to
   synchronize the CPU states. However, the CPU23 didn't acknowledge
   because it is stuck in a loop within the for(;;).
3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
   for CPU23 to acknowledge the synchronization request.
4. Finally, it formed cyclic dependency and lead to softlockup and dead
   loop.

To fix this issue, add cond_resched() in the memory_max_write, so that
it will not block migration task.

Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

Michal Hocko Feb. 11, 2025, 9:02 a.m. UTC | #1

On Tue 11-02-25 08:18:19, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> A softlockup issue was found with stress test:
>  watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181]
>  CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1
>  Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu
>  RIP: 0010:stop_machine_yield+0x2/0x10
>  RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246
>  RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000
>  RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40
>  RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff
>  R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40
>  R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20
>  FS:  0000000000000000(0000) GS:ff1cdaaf6eb80000(0000)
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  PKRU: 55555554
>  Call Trace:
>   multi_cpu_stop+0x8f/0x100
>   cpu_stopper_thread+0x90/0x140
>   smpboot_thread_fn+0xad/0x150
>   kthread+0xc2/0x100
>   ret_from_fork+0x2d/0x50
> 
> The stress test involves CPU hotplug operations and memory control group
> (memcg) operations. The scenario can be described as follows:
> 
>  echo xx > memory.max 	cache_ap_online			oom_reaper
>  (CPU23)						(CPU50)
>  xx < usage		stop_machine_from_inactive_cpu
>  for(;;)			// all active cpus
>  trigger OOM		queue_stop_cpus_work
>  // waiting oom_reaper
>  			multi_cpu_stop(migration/xx)
>  			// sync all active cpus ack
>  			// waiting cpu23 ack
>  			// CPU50 loops in multi_cpu_stop
>  							waiting cpu50
> 
> Detailed explanation:
> 1. When the usage is larger than xx, an OOM may be triggered. If the
>    process does not handle with ths kill signal immediately, it will loop
>    in the memory_max_write.

Do I get it right that the issue is that mem_cgroup_out_of_memory which
doesn't have any cond_resched so it cannot yield to stopped kthread?
oom itself cannot make any progress because the oom victim is blocked as
per 3).

> 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
>    active cpus. Within the multi_cpu_stop function,  it attempts to
>    synchronize the CPU states. However, the CPU23 didn't acknowledge
>    because it is stuck in a loop within the for(;;).
> 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
>    for CPU23 to acknowledge the synchronization request.
> 4. Finally, it formed cyclic dependency and lead to softlockup and dead
>    loop.
> 
> To fix this issue, add cond_resched() in the memory_max_write, so that
> it will not block migration task.

My first question was why this is not a problem in other
allocation/charge paths but this one is different because it doesn't
ever try to reclaim after MAX_RECLAIM_RETRIES reclaim rounds.
We do have scheduling points in the reclaim path which are no longer
triggered after we hit oom situation in this case.

I was thinking about having a guranteed cond_resched when oom killer
fails to find a victim but it seems the simplest fix for this particular
corner case is to add cond_resched as you did here. Hopefully we will
get rid of it very soon when !PREEMPT is removed.

Btw. this could be a problem on a single CPU machine even without CPU
hotplug as the oom repear won't run until memory_max_write yields the
cpu.

> Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
> Signed-off-by: Chen Ridong <chenridong@huawei.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8d21c1a44220..16f3bdbd37d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>  		memcg_memory_event(memcg, MEMCG_OOM);
>  		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
>  			break;
> +		cond_resched();
>  	}
>  
>  	memcg_wb_domain_size_changed(memcg);
> -- 
> 2.34.1

Chen Ridong Feb. 11, 2025, 11:29 a.m. UTC | #2

On 2025/2/11 17:02, Michal Hocko wrote:
> On Tue 11-02-25 08:18:19, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> A softlockup issue was found with stress test:
>>  watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181]
>>  CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1
>>  Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu
>>  RIP: 0010:stop_machine_yield+0x2/0x10
>>  RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246
>>  RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000
>>  RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40
>>  RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff
>>  R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40
>>  R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20
>>  FS:  0000000000000000(0000) GS:ff1cdaaf6eb80000(0000)
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0
>>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>  PKRU: 55555554
>>  Call Trace:
>>   multi_cpu_stop+0x8f/0x100
>>   cpu_stopper_thread+0x90/0x140
>>   smpboot_thread_fn+0xad/0x150
>>   kthread+0xc2/0x100
>>   ret_from_fork+0x2d/0x50
>>
>> The stress test involves CPU hotplug operations and memory control group
>> (memcg) operations. The scenario can be described as follows:
>>
>>  echo xx > memory.max 	cache_ap_online			oom_reaper
>>  (CPU23)						(CPU50)
>>  xx < usage		stop_machine_from_inactive_cpu
>>  for(;;)			// all active cpus
>>  trigger OOM		queue_stop_cpus_work
>>  // waiting oom_reaper
>>  			multi_cpu_stop(migration/xx)
>>  			// sync all active cpus ack
>>  			// waiting cpu23 ack
>>  			// CPU50 loops in multi_cpu_stop
>>  							waiting cpu50
>>
>> Detailed explanation:
>> 1. When the usage is larger than xx, an OOM may be triggered. If the
>>    process does not handle with ths kill signal immediately, it will loop
>>    in the memory_max_write.
> 
> Do I get it right that the issue is that mem_cgroup_out_of_memory which
> doesn't have any cond_resched so it cannot yield to stopped kthread?
> oom itself cannot make any progress because the oom victim is blocked as
> per 3).
> 

Yes, the same task was evaluated as the victim, which is blocked as
described in point 3). Consequently, the operation returned oc->chosen =
(void *)-1UL in the oom_evaluate_task function, and no cond_resched()
was invoked.

for(;;) {
...
mem_cgroup_out_of_memory
  out_of_memory
    select_bad_process
      oom_evaluate_task
	oc->chosen = (void *)-1UL;
  return !!oc->chosen;
}

>> 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
>>    active cpus. Within the multi_cpu_stop function,  it attempts to
>>    synchronize the CPU states. However, the CPU23 didn't acknowledge
>>    because it is stuck in a loop within the for(;;).
>> 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
>>    for CPU23 to acknowledge the synchronization request.
>> 4. Finally, it formed cyclic dependency and lead to softlockup and dead
>>    loop.
>>
>> To fix this issue, add cond_resched() in the memory_max_write, so that
>> it will not block migration task.
> 
> My first question was why this is not a problem in other
> allocation/charge paths but this one is different because it doesn't
> ever try to reclaim after MAX_RECLAIM_RETRIES reclaim rounds.
> We do have scheduling points in the reclaim path which are no longer
> triggered after we hit oom situation in this case.
> 
> I was thinking about having a guranteed cond_resched when oom killer
> fails to find a victim but it seems the simplest fix for this particular
> corner case is to add cond_resched as you did here. Hopefully we will
> get rid of it very soon when !PREEMPT is removed.
> 
> Btw. this could be a problem on a single CPU machine even without CPU
> hotplug as the oom repear won't run until memory_max_write yields the
> cpu.
> 
>> Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 

Thank you very much.

Best regards,
Ridong

>> ---
>>  mm/memcontrol.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8d21c1a44220..16f3bdbd37d8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>>  		memcg_memory_event(memcg, MEMCG_OOM);
>>  		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
>>  			break;
>> +		cond_resched();
>>  	}
>>  
>>  	memcg_wb_domain_size_changed(memcg);
>> -- 
>> 2.34.1
>

Shakeel Butt Feb. 11, 2025, 7:04 p.m. UTC | #3

On Tue, Feb 11, 2025 at 08:18:19AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
> 
> A softlockup issue was found with stress test:
>  watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181]
>  CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1
>  Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu
>  RIP: 0010:stop_machine_yield+0x2/0x10
>  RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246
>  RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000
>  RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40
>  RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff
>  R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40
>  R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20
>  FS:  0000000000000000(0000) GS:ff1cdaaf6eb80000(0000)
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  PKRU: 55555554
>  Call Trace:
>   multi_cpu_stop+0x8f/0x100
>   cpu_stopper_thread+0x90/0x140
>   smpboot_thread_fn+0xad/0x150
>   kthread+0xc2/0x100
>   ret_from_fork+0x2d/0x50
> 
> The stress test involves CPU hotplug operations and memory control group
> (memcg) operations. The scenario can be described as follows:
> 
>  echo xx > memory.max 	cache_ap_online			oom_reaper
>  (CPU23)						(CPU50)
>  xx < usage		stop_machine_from_inactive_cpu
>  for(;;)			// all active cpus
>  trigger OOM		queue_stop_cpus_work
>  // waiting oom_reaper
>  			multi_cpu_stop(migration/xx)
>  			// sync all active cpus ack
>  			// waiting cpu23 ack
>  			// CPU50 loops in multi_cpu_stop
>  							waiting cpu50
> 
> Detailed explanation:
> 1. When the usage is larger than xx, an OOM may be triggered. If the
>    process does not handle with ths kill signal immediately, it will loop
>    in the memory_max_write.
> 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
>    active cpus. Within the multi_cpu_stop function,  it attempts to
>    synchronize the CPU states. However, the CPU23 didn't acknowledge
>    because it is stuck in a loop within the for(;;).
> 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
>    for CPU23 to acknowledge the synchronization request.
> 4. Finally, it formed cyclic dependency and lead to softlockup and dead
>    loop.
> 
> To fix this issue, add cond_resched() in the memory_max_write, so that
> it will not block migration task.
> 
> Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>  mm/memcontrol.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8d21c1a44220..16f3bdbd37d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>  		memcg_memory_event(memcg, MEMCG_OOM);
>  		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))

Wouldn't it be more robust if we put an upper bound on the else case of
above condition i.e. fix number of retries? As you have discovered there
is a hidden dependency on the forward progress of oom_reaper and this
check/code-path which I think is not needed. 

>  			break;
> +		cond_resched();
>  	}
>  
>  	memcg_wb_domain_size_changed(memcg);
> -- 
> 2.34.1
>

Michal Hocko Feb. 11, 2025, 8:35 p.m. UTC | #4

On Tue 11-02-25 11:04:21, Shakeel Butt wrote:
> On Tue, Feb 11, 2025 at 08:18:19AM +0000, Chen Ridong wrote:
[...]
> Wouldn't it be more robust if we put an upper bound on the else case of
> above condition i.e. fix number of retries? As you have discovered there
> is a hidden dependency on the forward progress of oom_reaper and this
> check/code-path which I think is not needed. 

Any OOM path has a dependency on oom_reaper or task exiting. Is there
any reason why this path should be any special? With cond_resched we can
look for a day where this will be just removed and the code will still
work. With a number of retries we will have a non-deterministic time
dependent behavior because number of retries rather than fwd progress
would define the failure mode.

Shakeel Butt Feb. 12, 2025, 12:29 a.m. UTC | #5

On Tue, Feb 11, 2025 at 09:35:47PM +0100, Michal Hocko wrote:
> On Tue 11-02-25 11:04:21, Shakeel Butt wrote:
> > On Tue, Feb 11, 2025 at 08:18:19AM +0000, Chen Ridong wrote:
> [...]
> > Wouldn't it be more robust if we put an upper bound on the else case of
> > above condition i.e. fix number of retries? As you have discovered there
> > is a hidden dependency on the forward progress of oom_reaper and this
> > check/code-path which I think is not needed. 
> 
> Any OOM path has a dependency on oom_reaper or task exiting.

Personally I would say any allocation (or charge) has a dependency on
oom_reaper making progress (but not arguing on this point).

> Is there
> any reason why this path should be any special? With cond_resched we can
> look for a day where this will be just removed and the code will still
> work. With a number of retries we will have a non-deterministic time
> dependent behavior because number of retries rather than fwd progress
> would define the failure mode.

I am not against adding cond_resched() which might/will be removed in
future. To me it is just that we are leaving our fate to cpu scheduler
here which maybe is ok as we (MM) do it all over the place. Anyways no
objection from me.

memcg: avoid dead loop when setting memory.max

Commit Message

Comments

Patch