diff mbox series

mm, oom: prevent soft lockup on memcg oom for UP systems

Message ID alpine.DEB.2.21.2003101438510.161160@chino.kir.corp.google.com (mailing list archive)
State New, archived
Headers show
Series mm, oom: prevent soft lockup on memcg oom for UP systems | expand

Commit Message

David Rientjes March 10, 2020, 9:39 p.m. UTC
When a process is oom killed as a result of memcg limits and the victim
is waiting to exit, nothing ends up actually yielding the processor back
to the victim on UP systems with preemption disabled.  Instead, the
charging process simply loops in memcg reclaim and eventually soft
lockups.

Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
RIP: 0010:shrink_lruvec+0x4e9/0xa40
...
Call Trace:
 shrink_node+0x40d/0x7d0
 do_try_to_free_pages+0x13f/0x470
 try_to_free_mem_cgroup_pages+0x16d/0x230
 try_charge+0x247/0xac0
 mem_cgroup_try_charge+0x10a/0x220
 mem_cgroup_try_charge_delay+0x1e/0x40
 handle_mm_fault+0xdf2/0x15f0
 do_user_addr_fault+0x21f/0x420
 page_fault+0x2f/0x40

Make sure that something ends up actually yielding the processor back to
the victim to allow for memory freeing.  Most appropriate place appears to
be shrink_node_memcgs() where the iteration of all decendant memcgs could
be particularly lengthy.

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/vmscan.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Tetsuo Handa March 10, 2020, 10:05 p.m. UTC | #1
On 2020/03/11 6:39, David Rientjes wrote:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  		unsigned long reclaimed;
>  		unsigned long scanned;
>  
> +		cond_resched();
> +

Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
than the OOM victim ?) gets scheduled?

>  		switch (mem_cgroup_protected(target_memcg, memcg)) {
>  		case MEMCG_PROT_MIN:
>  			/*
>
Michal Hocko March 10, 2020, 10:10 p.m. UTC | #2
On Tue 10-03-20 14:39:48, David Rientjes wrote:
> When a process is oom killed as a result of memcg limits and the victim
> is waiting to exit, nothing ends up actually yielding the processor back
> to the victim on UP systems with preemption disabled.  Instead, the
> charging process simply loops in memcg reclaim and eventually soft
> lockups.
> 
> Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> RIP: 0010:shrink_lruvec+0x4e9/0xa40
> ...
> Call Trace:
>  shrink_node+0x40d/0x7d0
>  do_try_to_free_pages+0x13f/0x470
>  try_to_free_mem_cgroup_pages+0x16d/0x230
>  try_charge+0x247/0xac0
>  mem_cgroup_try_charge+0x10a/0x220
>  mem_cgroup_try_charge_delay+0x1e/0x40
>  handle_mm_fault+0xdf2/0x15f0
>  do_user_addr_fault+0x21f/0x420
>  page_fault+0x2f/0x40
> 
> Make sure that something ends up actually yielding the processor back to
> the victim to allow for memory freeing.  Most appropriate place appears to
> be shrink_node_memcgs() where the iteration of all decendant memcgs could
> be particularly lengthy.

There is a cond_resched in shrink_lruvec and another one in
shrink_page_list. Why doesn't any of them hit? Is it because there are
no pages on the LRU list? Because rss data suggests there should be
enough pages to go that path. Or maybe it is shrink_slab path that takes
too long?

The patch itself makes sense to me but I would like to see more
explanation on how that happens.

Thanks.

> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: stable@vger.kernel.org
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/vmscan.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  		unsigned long reclaimed;
>  		unsigned long scanned;
>  
> +		cond_resched();
> +
>  		switch (mem_cgroup_protected(target_memcg, memcg)) {
>  		case MEMCG_PROT_MIN:
>  			/*
David Rientjes March 10, 2020, 10:55 p.m. UTC | #3
On Wed, 11 Mar 2020, Tetsuo Handa wrote:

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >  		unsigned long reclaimed;
> >  		unsigned long scanned;
> >  
> > +		cond_resched();
> > +
> 
> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
> than the OOM victim ?) gets scheduled?
> 

I think it's the best we can do that immediately solves the issue unless 
you have another idea in mind?

> >  		switch (mem_cgroup_protected(target_memcg, memcg)) {
> >  		case MEMCG_PROT_MIN:
> >  			/*
> > 
>
David Rientjes March 10, 2020, 11:02 p.m. UTC | #4
On Tue, 10 Mar 2020, Michal Hocko wrote:

> > When a process is oom killed as a result of memcg limits and the victim
> > is waiting to exit, nothing ends up actually yielding the processor back
> > to the victim on UP systems with preemption disabled.  Instead, the
> > charging process simply loops in memcg reclaim and eventually soft
> > lockups.
> > 
> > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> > RIP: 0010:shrink_lruvec+0x4e9/0xa40
> > ...
> > Call Trace:
> >  shrink_node+0x40d/0x7d0
> >  do_try_to_free_pages+0x13f/0x470
> >  try_to_free_mem_cgroup_pages+0x16d/0x230
> >  try_charge+0x247/0xac0
> >  mem_cgroup_try_charge+0x10a/0x220
> >  mem_cgroup_try_charge_delay+0x1e/0x40
> >  handle_mm_fault+0xdf2/0x15f0
> >  do_user_addr_fault+0x21f/0x420
> >  page_fault+0x2f/0x40
> > 
> > Make sure that something ends up actually yielding the processor back to
> > the victim to allow for memory freeing.  Most appropriate place appears to
> > be shrink_node_memcgs() where the iteration of all decendant memcgs could
> > be particularly lengthy.
> 
> There is a cond_resched in shrink_lruvec and another one in
> shrink_page_list. Why doesn't any of them hit? Is it because there are
> no pages on the LRU list? Because rss data suggests there should be
> enough pages to go that path. Or maybe it is shrink_slab path that takes
> too long?
> 

I think it can be a number of cases, most notably mem_cgroup_protected() 
checks which is why the cond_resched() is added above it.  Rather than add 
cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the 
cond_resched() is added above the switch clause because the iteration 
itself may be potentially very lengthy.

We could also do it in shrink_zones() or the priority based 
do_try_to_free_pages() loop, but I'd be nervous about the lengthy memcg 
iteration in shrink_node_memcgs() independent of this.

Any other ideas on how to ensure we actually try to resched for the 
benefit of an oom victim to prevent this soft lockup?

> The patch itself makes sense to me but I would like to see more
> explanation on how that happens.
> 
> Thanks.
> 
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> >  mm/vmscan.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >  		unsigned long reclaimed;
> >  		unsigned long scanned;
> >  
> > +		cond_resched();
> > +
> >  		switch (mem_cgroup_protected(target_memcg, memcg)) {
> >  		case MEMCG_PROT_MIN:
> >  			/*
> 
> -- 
> Michal Hocko
> SUSE Labs
>
Andrew Morton March 11, 2020, 12:18 a.m. UTC | #5
On Tue, 10 Mar 2020 14:39:48 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> When a process is oom killed as a result of memcg limits and the victim
> is waiting to exit, nothing ends up actually yielding the processor back
> to the victim on UP systems with preemption disabled.  Instead, the
> charging process simply loops in memcg reclaim and eventually soft
> lockups.
> 
> Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> RIP: 0010:shrink_lruvec+0x4e9/0xa40
> ...
> Call Trace:
>  shrink_node+0x40d/0x7d0
>  do_try_to_free_pages+0x13f/0x470
>  try_to_free_mem_cgroup_pages+0x16d/0x230
>  try_charge+0x247/0xac0
>  mem_cgroup_try_charge+0x10a/0x220
>  mem_cgroup_try_charge_delay+0x1e/0x40
>  handle_mm_fault+0xdf2/0x15f0
>  do_user_addr_fault+0x21f/0x420
>  page_fault+0x2f/0x40
> 
> Make sure that something ends up actually yielding the processor back to
> the victim to allow for memory freeing.  Most appropriate place appears to
> be shrink_node_memcgs() where the iteration of all decendant memcgs could
> be particularly lengthy.
>

That's a bit sad.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  		unsigned long reclaimed;
>  		unsigned long scanned;
>  
> +		cond_resched();
> +
>  		switch (mem_cgroup_protected(target_memcg, memcg)) {
>  		case MEMCG_PROT_MIN:
>  			/*


Obviously better, but this will still spin wheels until this tasks's
timeslice expires, and we might want to do something to help ensure
that the victim runs next (or soon)?

(And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?)
David Rientjes March 11, 2020, 12:34 a.m. UTC | #6
On Tue, 10 Mar 2020, Andrew Morton wrote:

> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >  		unsigned long reclaimed;
> >  		unsigned long scanned;
> >  
> > +		cond_resched();
> > +
> >  		switch (mem_cgroup_protected(target_memcg, memcg)) {
> >  		case MEMCG_PROT_MIN:
> >  			/*
> 
> 
> Obviously better, but this will still spin wheels until this tasks's
> timeslice expires, and we might want to do something to help ensure
> that the victim runs next (or soon)?
> 

We used to have a schedule_timeout_killable(1) to address exactly that 
scenario but it was removed in 4.19:

commit 9bfe5ded054b8e28a94c78580f233d6879a00146
Author: Michal Hocko <mhocko@suse.com>
Date:   Fri Aug 17 15:49:04 2018 -0700

    mm, oom: remove sleep from under oom_lock

This is why we don't see this issue on 4.14 guests but we do on 4.19.  I 
had assumed the issue Tetsuo reported that resulted in that patch was 
still an issue and I preferred to fix the weird UP issue by adding a 
cond_resched() that is likely needed for the iteration in 
shrink_node_memcg() anyway.  Do we care to optimize for UP systems 
encountering memcg oom kills?  Eh, maybe, but I'm not very interested in 
opening up a centithread about this.

> (And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?)
> 

This guest does have CONFIG_MEMCG enabled, it's a memcg oom condition.

But unrelated to this patch, I think it's just a weird naming for it.  The 
do-while loop in shrink_node_memcgs() actually uses memcg = NULL for the 
non-memcg case and is responsible for calling into page and slab reclaim.
Michal Hocko March 11, 2020, 8:27 a.m. UTC | #7
On Tue 10-03-20 16:02:23, David Rientjes wrote:
> On Tue, 10 Mar 2020, Michal Hocko wrote:
> 
> > > When a process is oom killed as a result of memcg limits and the victim
> > > is waiting to exit, nothing ends up actually yielding the processor back
> > > to the victim on UP systems with preemption disabled.  Instead, the
> > > charging process simply loops in memcg reclaim and eventually soft
> > > lockups.
> > > 
> > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> > > RIP: 0010:shrink_lruvec+0x4e9/0xa40
> > > ...
> > > Call Trace:
> > >  shrink_node+0x40d/0x7d0
> > >  do_try_to_free_pages+0x13f/0x470
> > >  try_to_free_mem_cgroup_pages+0x16d/0x230
> > >  try_charge+0x247/0xac0
> > >  mem_cgroup_try_charge+0x10a/0x220
> > >  mem_cgroup_try_charge_delay+0x1e/0x40
> > >  handle_mm_fault+0xdf2/0x15f0
> > >  do_user_addr_fault+0x21f/0x420
> > >  page_fault+0x2f/0x40
> > > 
> > > Make sure that something ends up actually yielding the processor back to
> > > the victim to allow for memory freeing.  Most appropriate place appears to
> > > be shrink_node_memcgs() where the iteration of all decendant memcgs could
> > > be particularly lengthy.
> > 
> > There is a cond_resched in shrink_lruvec and another one in
> > shrink_page_list. Why doesn't any of them hit? Is it because there are
> > no pages on the LRU list? Because rss data suggests there should be
> > enough pages to go that path. Or maybe it is shrink_slab path that takes
> > too long?
> > 
> 
> I think it can be a number of cases, most notably mem_cgroup_protected() 
> checks which is why the cond_resched() is added above it.  Rather than add 
> cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the 
> cond_resched() is added above the switch clause because the iteration 
> itself may be potentially very lengthy.

Was any of the above the case for your soft lockup case? How have you
managed to trigger it? As I've said I am not against the patch but I
would really like to see an actual explanation what happened rather than
speculations of what might have happened. If for nothing else then for
the future reference.

If this is really about all the hierarchy being MEMCG_PROT_MIN protected
and that results in a very expensive and pointless reclaim walk that can
trigger soft lockup then it should be explicitly mentioned in the
changelog.
Michal Hocko March 11, 2020, 8:36 a.m. UTC | #8
On Tue 10-03-20 17:18:02, Andrew Morton wrote:
[...]
> (And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?)

Because there won't be anything memcg specific with the config disabled.
mem_cgroup_iter will expand to NULL memcg, mem_cgroup_protected switch
compiled out, mem_cgroup_lruvec will return the lruvec abstraction which
resolves to pgdat and the rest is not memcg dependent. We could have
split up the reclaim protection or the loop out of the line but I
believe it is better to be clearly visible.
Tetsuo Handa March 11, 2020, 9:34 a.m. UTC | #9
On 2020/03/11 7:55, David Rientjes wrote:
> On Wed, 11 Mar 2020, Tetsuo Handa wrote:
> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>>>  		unsigned long reclaimed;
>>>  		unsigned long scanned;
>>>  
>>> +		cond_resched();
>>> +
>>
>> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
>> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
>> than the OOM victim ?) gets scheduled?
>>
> 
> I think it's the best we can do that immediately solves the issue unless 
> you have another idea in mind?

"schedule_timeout_killable(1) outside of oom_lock" or "the OOM reaper grabs oom_lock
so that allocating threads guarantee that the OOM reaper gets scheduled" or "direct OOM
reaping so that allocating threads guarantee that some memory is reclaimed".

> 
>>>  		switch (mem_cgroup_protected(target_memcg, memcg)) {
>>>  		case MEMCG_PROT_MIN:
>>>  			/*
>>>
>>
David Rientjes March 11, 2020, 7:38 p.m. UTC | #10
On Wed, 11 Mar 2020, Tetsuo Handa wrote:

> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >>>  		unsigned long reclaimed;
> >>>  		unsigned long scanned;
> >>>  
> >>> +		cond_resched();
> >>> +
> >>
> >> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
> >> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
> >> than the OOM victim ?) gets scheduled?
> >>
> > 
> > I think it's the best we can do that immediately solves the issue unless 
> > you have another idea in mind?
> 
> "schedule_timeout_killable(1) outside of oom_lock" or "the OOM reaper grabs oom_lock
> so that allocating threads guarantee that the OOM reaper gets scheduled" or "direct OOM
> reaping so that allocating threads guarantee that some memory is reclaimed".
> 

The cond_resched() here is needed if the iteration is lengthy depending on 
the number of descendant memcgs already.

schedule_timeout_killable(1) does not make any guarantees that current 
will be scheduled after the victim or oom_reaper on UP systems.

If you have an alternate patch to try, we can test it.  But since this 
cond_resched() is needed anyway, I'm not sure it will change the result.

> > 
> >>>  		switch (mem_cgroup_protected(target_memcg, memcg)) {
> >>>  		case MEMCG_PROT_MIN:
> >>>  			/*
> >>>
> >>
> 
>
David Rientjes March 11, 2020, 7:45 p.m. UTC | #11
On Wed, 11 Mar 2020, Michal Hocko wrote:

> > > > When a process is oom killed as a result of memcg limits and the victim
> > > > is waiting to exit, nothing ends up actually yielding the processor back
> > > > to the victim on UP systems with preemption disabled.  Instead, the
> > > > charging process simply loops in memcg reclaim and eventually soft
> > > > lockups.
> > > > 
> > > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> > > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> > > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> > > > RIP: 0010:shrink_lruvec+0x4e9/0xa40
> > > > ...
> > > > Call Trace:
> > > >  shrink_node+0x40d/0x7d0
> > > >  do_try_to_free_pages+0x13f/0x470
> > > >  try_to_free_mem_cgroup_pages+0x16d/0x230
> > > >  try_charge+0x247/0xac0
> > > >  mem_cgroup_try_charge+0x10a/0x220
> > > >  mem_cgroup_try_charge_delay+0x1e/0x40
> > > >  handle_mm_fault+0xdf2/0x15f0
> > > >  do_user_addr_fault+0x21f/0x420
> > > >  page_fault+0x2f/0x40
> > > > 
> > > > Make sure that something ends up actually yielding the processor back to
> > > > the victim to allow for memory freeing.  Most appropriate place appears to
> > > > be shrink_node_memcgs() where the iteration of all decendant memcgs could
> > > > be particularly lengthy.
> > > 
> > > There is a cond_resched in shrink_lruvec and another one in
> > > shrink_page_list. Why doesn't any of them hit? Is it because there are
> > > no pages on the LRU list? Because rss data suggests there should be
> > > enough pages to go that path. Or maybe it is shrink_slab path that takes
> > > too long?
> > > 
> > 
> > I think it can be a number of cases, most notably mem_cgroup_protected() 
> > checks which is why the cond_resched() is added above it.  Rather than add 
> > cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the 
> > cond_resched() is added above the switch clause because the iteration 
> > itself may be potentially very lengthy.
> 
> Was any of the above the case for your soft lockup case? How have you
> managed to trigger it? As I've said I am not against the patch but I
> would really like to see an actual explanation what happened rather than
> speculations of what might have happened. If for nothing else then for
> the future reference.
> 

Yes, this is how it was triggered in my own testing.

> If this is really about all the hierarchy being MEMCG_PROT_MIN protected
> and that results in a very expensive and pointless reclaim walk that can
> trigger soft lockup then it should be explicitly mentioned in the
> changelog.

I think the changelog clearly states that we need to guarantee that a 
reclaimer will yield the processor back to allow a victim to exit.  This 
is where we make the guarantee.  If it helps for the specific reason it 
triggered in my testing, we could add:

"For example, mem_cgroup_protected() can prohibit reclaim and thus any 
yielding in page reclaim would not address the issue."
Tetsuo Handa March 11, 2020, 10:04 p.m. UTC | #12
On 2020/03/12 4:38, David Rientjes wrote:
> On Wed, 11 Mar 2020, Tetsuo Handa wrote:
> 
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>>>>>  		unsigned long reclaimed;
>>>>>  		unsigned long scanned;
>>>>>  
>>>>> +		cond_resched();
>>>>> +
>>>>
>>>> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority,
>>>> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather
>>>> than the OOM victim ?) gets scheduled?
>>>>
>>>
>>> I think it's the best we can do that immediately solves the issue unless 
>>> you have another idea in mind?
>>
>> "schedule_timeout_killable(1) outside of oom_lock" or "the OOM reaper grabs oom_lock
>> so that allocating threads guarantee that the OOM reaper gets scheduled" or "direct OOM
>> reaping so that allocating threads guarantee that some memory is reclaimed".
>>
> 
> The cond_resched() here is needed if the iteration is lengthy depending on 
> the number of descendant memcgs already.

No. cond_resched() here will become no-op if CONFIG_PREEMPTION=y and current
thread has realtime priority.

> 
> schedule_timeout_killable(1) does not make any guarantees that current 
> will be scheduled after the victim or oom_reaper on UP systems.

The point of schedule_timeout_*(1) is to guarantee that current thread
will yield CPU to other threads even if CONFIG_PREEMPTION=y and current
thread has realtime priority case. There is no guarantee that current
thread will be rescheduled immediately after a sleep is irrelevant.

> 
> If you have an alternate patch to try, we can test it.  But since this 
> cond_resched() is needed anyway, I'm not sure it will change the result.

schedule_timeout_killable(1) is an alternate patch to try; I don't think
that this cond_resched() is needed anyway.

> 
>>>
>>>>>  		switch (mem_cgroup_protected(target_memcg, memcg)) {
>>>>>  		case MEMCG_PROT_MIN:
>>>>>  			/*
>>>>>
>>>>
>>
>>
David Rientjes March 11, 2020, 10:14 p.m. UTC | #13
On Thu, 12 Mar 2020, Tetsuo Handa wrote:

> > The cond_resched() here is needed if the iteration is lengthy depending on 
> > the number of descendant memcgs already.
> 
> No. cond_resched() here will become no-op if CONFIG_PREEMPTION=y and current
> thread has realtime priority.
> 

It's helpful without CONFIG_PREEMPTION for excessively long memcg 
iterations to avoid starving need_resched.

> > schedule_timeout_killable(1) does not make any guarantees that current 
> > will be scheduled after the victim or oom_reaper on UP systems.
> 
> The point of schedule_timeout_*(1) is to guarantee that current thread
> will yield CPU to other threads even if CONFIG_PREEMPTION=y and current
> thread has realtime priority case. There is no guarantee that current
> thread will be rescheduled immediately after a sleep is irrelevant.
> 
> > 
> > If you have an alternate patch to try, we can test it.  But since this 
> > cond_resched() is needed anyway, I'm not sure it will change the result.
> 
> schedule_timeout_killable(1) is an alternate patch to try; I don't think
> that this cond_resched() is needed anyway.
> 

You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
Tetsuo Handa March 12, 2020, 12:12 a.m. UTC | #14
> On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > > If you have an alternate patch to try, we can test it.  But since this 
> > > cond_resched() is needed anyway, I'm not sure it will change the result.
> > 
> > schedule_timeout_killable(1) is an alternate patch to try; I don't think
> > that this cond_resched() is needed anyway.
> > 
> 
> You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
> 

Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs()
is enough. But like you mentioned,

David Rientjes wrote:
> On Tue, 10 Mar 2020, Andrew Morton wrote:
> 
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > >  		unsigned long reclaimed;
> > >  		unsigned long scanned;
> > >  
> > > +		cond_resched();
> > > +
> > >  		switch (mem_cgroup_protected(target_memcg, memcg)) {
> > >  		case MEMCG_PROT_MIN:
> > >  			/*
> > 
> > 
> > Obviously better, but this will still spin wheels until this tasks's
> > timeslice expires, and we might want to do something to help ensure
> > that the victim runs next (or soon)?
> > 
> 
> We used to have a schedule_timeout_killable(1) to address exactly that 
> scenario but it was removed in 4.19:
> 
> commit 9bfe5ded054b8e28a94c78580f233d6879a00146
> Author: Michal Hocko <mhocko@suse.com>
> Date:   Fri Aug 17 15:49:04 2018 -0700
> 
>     mm, oom: remove sleep from under oom_lock

you can try re-adding sleep outside of oom_lock:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d09776cd6e10..3aee7e0eca4e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	ret = should_force_charge() || out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
+	schedule_timeout_killable(1);
 	return ret;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..e80158049651 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3797,7 +3797,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
@@ -4590,6 +4589,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/* Retry as long as the OOM killer is making progress */
 	if (did_some_progress) {
+		schedule_timeout_uninterruptible(1);
 		no_progress_loops = 0;
 		goto retry;
 	}

By the way, will you share the reproducer (and how to use the reproducer) ?
Tetsuo Handa March 12, 2020, 4:23 a.m. UTC | #15
> > We used to have a schedule_timeout_killable(1) to address exactly that 
> > scenario but it was removed in 4.19:
> > 
> > commit 9bfe5ded054b8e28a94c78580f233d6879a00146
> > Author: Michal Hocko <mhocko@suse.com>
> > Date:   Fri Aug 17 15:49:04 2018 -0700
> > 
> >     mm, oom: remove sleep from under oom_lock
> 

For the record: If I revert that commit, I can observe that
current thread sleeps for more than one minute with oom_lock held.
That is, I don't want to revert that commit.

[ 1629.930998][T14021] idle-priority invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[ 1629.934944][T14021] CPU: 0 PID: 14021 Comm: idle-priority Kdump: loaded Not tainted 5.6.0-rc5+ #968
[ 1629.938352][T14021] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019
[ 1629.942211][T14021] Call Trace:
[ 1629.943956][T14021]  dump_stack+0x71/0xa5
[ 1629.946041][T14021]  dump_header+0x4d/0x3e0
[ 1629.948097][T14021]  oom_kill_process+0x179/0x210
[ 1629.950191][T14021]  out_of_memory+0x109/0x3d0
[ 1629.952154][T14021]  ? out_of_memory+0x1ea/0x3d0
[ 1629.954349][T14021]  __alloc_pages_slowpath+0x934/0xb89
[ 1629.956959][T14021]  __alloc_pages_nodemask+0x333/0x370
[ 1629.959121][T14021]  handle_pte_fault+0x4ce/0xb40
[ 1629.961274][T14021]  __handle_mm_fault+0x2b2/0x5e0
[ 1629.963363][T14021]  handle_mm_fault+0x177/0x360
[ 1629.965404][T14021]  ? handle_mm_fault+0x46/0x360
[ 1629.967451][T14021]  do_page_fault+0x2d5/0x680
[ 1629.969351][T14021]  page_fault+0x34/0x40
[ 1629.971290][T14021] RIP: 0010:__clear_user+0x36/0x70
[ 1629.973621][T14021] Code: 0c a7 e2 95 53 48 89 f3 be 14 00 00 00 e8 82 87 b4 ff 0f 01 cb 48 89 d8 48 c1 eb 03 4c 89 e7 83 e0 07 48 89 d9 48 85 c9 74 0f <48> c7 07 00 00 00 00 48 83 c7 08 ff c9 75 f1 48 89 c1 85 c9 74 0a
[ 1629.980104][T14021] RSP: 0000:ffffa06e0742bd40 EFLAGS: 00050202
[ 1629.982543][T14021] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002
[ 1629.985322][T14021] RDX: 0000000000000000 RSI: 00000000a8057a95 RDI: 00007f0ca8605000
[ 1629.999883][T14021] RBP: ffffa06e0742bd50 R08: 0000000000000001 R09: 0000000000000000
[ 1630.018164][T14021] R10: 0000000000000000 R11: ffff953843a80978 R12: 00007f0ca8604010
[ 1630.021491][T14021] R13: 000000000c988000 R14: ffffa06e0742be08 R15: 0000000000001000
[ 1630.024881][T14021]  clear_user+0x47/0x60
[ 1630.026651][T14021]  iov_iter_zero+0x95/0x3d0
[ 1630.028455][T14021]  read_iter_zero+0x38/0xb0
[ 1630.030234][T14021]  new_sync_read+0x112/0x1b0
[ 1630.032025][T14021]  __vfs_read+0x27/0x40
[ 1630.033639][T14021]  vfs_read+0xaf/0x160
[ 1630.035243][T14021]  ksys_read+0x62/0xe0
[ 1630.036869][T14021]  __x64_sys_read+0x15/0x20
[ 1630.038827][T14021]  do_syscall_64+0x4a/0x1e0
[ 1630.040515][T14021]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1630.042598][T14021] RIP: 0033:0x7f109bb78950
[ 1630.044208][T14021] Code: Bad RIP value.
[ 1630.045798][T14021] RSP: 002b:00007fff499db458 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 1630.048558][T14021] RAX: ffffffffffffffda RBX: 0000000200000000 RCX: 00007f109bb78950
[ 1630.051117][T14021] RDX: 0000000200000000 RSI: 00007f0c9bc7c010 RDI: 0000000000000003
[ 1630.053715][T14021] RBP: 00007f0c9bc7c010 R08: 0000000000000000 R09: 0000000000021000
[ 1630.056488][T14021] R10: 00007fff499daea0 R11: 0000000000000246 R12: 00007f0c9bc7c010
[ 1630.059124][T14021] R13: 0000000000000005 R14: 0000000000000003 R15: 0000000000000000
[ 1630.061935][T14021] Mem-Info:
[ 1630.063226][T14021] active_anon:1314192 inactive_anon:5198 isolated_anon:0
[ 1630.063226][T14021]  active_file:9 inactive_file:0 isolated_file:0
[ 1630.063226][T14021]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 1630.063226][T14021]  slab_reclaimable:6649 slab_unreclaimable:268898
[ 1630.063226][T14021]  mapped:1094 shmem:10475 pagetables:127946 bounce:0
[ 1630.063226][T14021]  free:25585 free_pcp:62 free_cma:0
[ 1630.075882][T14021] Node 0 active_anon:5256768kB inactive_anon:20792kB active_file:36kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4376kB dirty:0kB writeback:0kB shmem:41900kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3158016kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1630.084961][T14021] DMA free:15360kB min:132kB low:164kB high:196kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1630.094169][T14021] lowmem_reserve[]: 0 2717 7644 7644
[ 1630.096174][T14021] DMA32 free:43680kB min:23972kB low:29964kB high:35956kB reserved_highatomic:0KB active_anon:2733524kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2782468kB mlocked:0kB kernel_stack:0kB pagetables:4764kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1630.106132][T14021] lowmem_reserve[]: 0 0 4927 4927
[ 1630.108104][T14021] Normal free:43300kB min:43476kB low:54344kB high:65212kB reserved_highatomic:4096KB active_anon:2523244kB inactive_anon:20792kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:0kB present:5242880kB managed:5045744kB mlocked:0kB kernel_stack:313872kB pagetables:507020kB bounce:0kB free_pcp:248kB local_pcp:248kB free_cma:0kB
[ 1630.118473][T14021] lowmem_reserve[]: 0 0 0 0
[ 1630.120348][T14021] DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB
[ 1630.124316][T14021] DMA32: 0*4kB 0*8kB 2*16kB (UM) 2*32kB (UM) 1*64kB (U) 0*128kB 2*256kB (UM) 2*512kB (UM) 1*1024kB (M) 2*2048kB (UM) 9*4096kB (M) = 43680kB
[ 1630.128773][T14021] Normal: 1231*4kB (UE) 991*8kB (UE) 343*16kB (UME) 230*32kB (UE) 117*64kB (UME) 65*128kB (UME) 7*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43300kB
[ 1630.134207][T14021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1630.137435][T14021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1630.140770][T14021] 10478 total pagecache pages
[ 1630.142789][T14021] 0 pages in swap cache
[ 1630.144572][T14021] Swap cache stats: add 0, delete 0, find 0/0
[ 1630.146921][T14021] Free swap  = 0kB
[ 1630.148627][T14021] Total swap = 0kB
[ 1630.150298][T14021] 2096998 pages RAM
[ 1630.152078][T14021] 0 pages HighMem/MovableOnly
[ 1630.154073][T14021] 136105 pages reserved
[ 1630.156165][T14021] 0 pages cma reserved
[ 1630.158028][T14021] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=normal-priority,pid=23173,uid=0
[ 1630.161867][T14021] Out of memory: Killed process 23173 (normal-priority) total-vm:4228kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:52kB oom_score_adj:1000
[ 1630.171681][   T18] oom_reaper: reaped process 23173 (normal-priority), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[ 1800.548694][    C0] idle-priority   R  running task    12416 14021   1183 0x80004080
[ 1800.550942][    C0] Call Trace:
[ 1800.552116][    C0]  __schedule+0x28c/0x860
[ 1800.553531][    C0]  ? _raw_spin_unlock_irqrestore+0x2c/0x50
[ 1800.555293][    C0]  schedule+0x5f/0xd0
[ 1800.556618][    C0]  schedule_timeout+0x1ab/0x340
[ 1800.558151][    C0]  ? __next_timer_interrupt+0xd0/0xd0
[ 1800.560069][    C0]  schedule_timeout_killable+0x25/0x30
[ 1800.561836][    C0]  out_of_memory+0x113/0x3d0
[ 1800.563519][    C0]  ? out_of_memory+0x1ea/0x3d0
[ 1800.565040][    C0]  __alloc_pages_slowpath+0x934/0xb89
[ 1800.566727][    C0]  __alloc_pages_nodemask+0x333/0x370
[ 1800.568397][    C0]  handle_pte_fault+0x4ce/0xb40
[ 1800.569944][    C0]  __handle_mm_fault+0x2b2/0x5e0
[ 1800.571527][    C0]  handle_mm_fault+0x177/0x360
[ 1800.573046][    C0]  ? handle_mm_fault+0x46/0x360
[ 1800.574592][    C0]  do_page_fault+0x2d5/0x680
[ 1800.576306][    C0]  page_fault+0x34/0x40
[ 1800.578413][    C0] RIP: 0010:__clear_user+0x36/0x70
Michal Hocko March 12, 2020, 8:32 a.m. UTC | #16
On Wed 11-03-20 12:45:40, David Rientjes wrote:
> On Wed, 11 Mar 2020, Michal Hocko wrote:
> 
> > > > > When a process is oom killed as a result of memcg limits and the victim
> > > > > is waiting to exit, nothing ends up actually yielding the processor back
> > > > > to the victim on UP systems with preemption disabled.  Instead, the
> > > > > charging process simply loops in memcg reclaim and eventually soft
> > > > > lockups.
> > > > > 
> > > > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> > > > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806]
> > > > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136
> > > > > RIP: 0010:shrink_lruvec+0x4e9/0xa40
> > > > > ...
> > > > > Call Trace:
> > > > >  shrink_node+0x40d/0x7d0
> > > > >  do_try_to_free_pages+0x13f/0x470
> > > > >  try_to_free_mem_cgroup_pages+0x16d/0x230
> > > > >  try_charge+0x247/0xac0
> > > > >  mem_cgroup_try_charge+0x10a/0x220
> > > > >  mem_cgroup_try_charge_delay+0x1e/0x40
> > > > >  handle_mm_fault+0xdf2/0x15f0
> > > > >  do_user_addr_fault+0x21f/0x420
> > > > >  page_fault+0x2f/0x40
> > > > > 
> > > > > Make sure that something ends up actually yielding the processor back to
> > > > > the victim to allow for memory freeing.  Most appropriate place appears to
> > > > > be shrink_node_memcgs() where the iteration of all decendant memcgs could
> > > > > be particularly lengthy.
> > > > 
> > > > There is a cond_resched in shrink_lruvec and another one in
> > > > shrink_page_list. Why doesn't any of them hit? Is it because there are
> > > > no pages on the LRU list? Because rss data suggests there should be
> > > > enough pages to go that path. Or maybe it is shrink_slab path that takes
> > > > too long?
> > > > 
> > > 
> > > I think it can be a number of cases, most notably mem_cgroup_protected() 
> > > checks which is why the cond_resched() is added above it.  Rather than add 
> > > cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the 
> > > cond_resched() is added above the switch clause because the iteration 
> > > itself may be potentially very lengthy.
> > 
> > Was any of the above the case for your soft lockup case? How have you
> > managed to trigger it? As I've said I am not against the patch but I
> > would really like to see an actual explanation what happened rather than
> > speculations of what might have happened. If for nothing else then for
> > the future reference.
> > 
> 
> Yes, this is how it was triggered in my own testing.
> 
> > If this is really about all the hierarchy being MEMCG_PROT_MIN protected
> > and that results in a very expensive and pointless reclaim walk that can
> > trigger soft lockup then it should be explicitly mentioned in the
> > changelog.
> 
> I think the changelog clearly states that we need to guarantee that a 
> reclaimer will yield the processor back to allow a victim to exit.  This 
> is where we make the guarantee.  If it helps for the specific reason it 
> triggered in my testing, we could add:
> 
> "For example, mem_cgroup_protected() can prohibit reclaim and thus any 
> yielding in page reclaim would not address the issue."

I would suggest something like the following:
"
The reclaim path (including the OOM) relies on explicit scheduling
points to hand over execution to tasks which could help with the reclaim
process. Currently it is mostly shrink_page_list which yields CPU for
each reclaimed page. This might be insuficient though in some
configurations. E.g. when a memcg OOM path is triggered in a hierarchy
which doesn't have any reclaimable memory because of memory reclaim
protection (MEMCG_PROT_MIN) then there is possible to trigger a soft
lockup during an out of memory situation on non preemptible kernels
<PUT YOUR SOFT LOCKUP SPLAT HERE>

Fix this by adding a cond_resched up in the reclaim path and make sure
there is a yield point regardless of reclaimability of the target
hierarchy.
"
David Rientjes March 12, 2020, 6:07 p.m. UTC | #17
On Thu, 12 Mar 2020, Tetsuo Handa wrote:

> > On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > > > If you have an alternate patch to try, we can test it.  But since this 
> > > > cond_resched() is needed anyway, I'm not sure it will change the result.
> > > 
> > > schedule_timeout_killable(1) is an alternate patch to try; I don't think
> > > that this cond_resched() is needed anyway.
> > > 
> > 
> > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
> > 
> 
> Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs()
> is enough. But like you mentioned,
> 

It passes our testing because this is where the allocator is looping while 
the victim is trying to exit if only it could be scheduled.

> you can try re-adding sleep outside of oom_lock:
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d09776cd6e10..3aee7e0eca4e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	ret = should_force_charge() || out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
> +	schedule_timeout_killable(1);
>  	return ret;
>  }
>  

If current was process chosen for oom kill, this would actually induce the 
problem, not fix it.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3c4eb750a199..e80158049651 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3797,7 +3797,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	if (!mutex_trylock(&oom_lock)) {
>  		*did_some_progress = 1;
> -		schedule_timeout_uninterruptible(1);
>  		return NULL;
>  	}
>  
> @@ -4590,6 +4589,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  
>  	/* Retry as long as the OOM killer is making progress */
>  	if (did_some_progress) {
> +		schedule_timeout_uninterruptible(1);
>  		no_progress_loops = 0;
>  		goto retry;
>  	}
> 
> By the way, will you share the reproducer (and how to use the reproducer) ?
> 

On an UP kernel with swap disabled, you limit a memcg to 100MB and start 
three processes that each fault 40MB attached to it.  Same reproducer as 
the "mm, oom: make a last minute check to prevent unnecessary memcg oom 
kills" patch except in that case there are two cores.
David Rientjes March 12, 2020, 6:20 p.m. UTC | #18
On Thu, 12 Mar 2020, Michal Hocko wrote:

> > I think the changelog clearly states that we need to guarantee that a 
> > reclaimer will yield the processor back to allow a victim to exit.  This 
> > is where we make the guarantee.  If it helps for the specific reason it 
> > triggered in my testing, we could add:
> > 
> > "For example, mem_cgroup_protected() can prohibit reclaim and thus any 
> > yielding in page reclaim would not address the issue."
> 
> I would suggest something like the following:
> "
> The reclaim path (including the OOM) relies on explicit scheduling
> points to hand over execution to tasks which could help with the reclaim
> process.

Are there other examples where yielding in the reclaim path would "help 
with the reclaim process" other than oom victims?  This sentence seems 
vague.

> Currently it is mostly shrink_page_list which yields CPU for
> each reclaimed page. This might be insuficient though in some
> configurations. E.g. when a memcg OOM path is triggered in a hierarchy
> which doesn't have any reclaimable memory because of memory reclaim
> protection (MEMCG_PROT_MIN) then there is possible to trigger a soft
> lockup during an out of memory situation on non preemptible kernels
> <PUT YOUR SOFT LOCKUP SPLAT HERE>
> 
> Fix this by adding a cond_resched up in the reclaim path and make sure
> there is a yield point regardless of reclaimability of the target
> hierarchy.
> "
>
Michal Hocko March 12, 2020, 8:16 p.m. UTC | #19
On Thu 12-03-20 11:20:33, David Rientjes wrote:
> On Thu, 12 Mar 2020, Michal Hocko wrote:
> 
> > > I think the changelog clearly states that we need to guarantee that a 
> > > reclaimer will yield the processor back to allow a victim to exit.  This 
> > > is where we make the guarantee.  If it helps for the specific reason it 
> > > triggered in my testing, we could add:
> > > 
> > > "For example, mem_cgroup_protected() can prohibit reclaim and thus any 
> > > yielding in page reclaim would not address the issue."
> > 
> > I would suggest something like the following:
> > "
> > The reclaim path (including the OOM) relies on explicit scheduling
> > points to hand over execution to tasks which could help with the reclaim
> > process.
> 
> Are there other examples where yielding in the reclaim path would "help 
> with the reclaim process" other than oom victims?  This sentence seems 
> vague.

In the context of UP and !PREEMPT this also includes IO flushers,
filesystems rely on workers and there are things I am very likely not
aware of. If you think this is vaague then feel free to reformulate.
All I really do care about is what the next paragraph is explaining.

> > Currently it is mostly shrink_page_list which yields CPU for
> > each reclaimed page. This might be insuficient though in some
> > configurations. E.g. when a memcg OOM path is triggered in a hierarchy
> > which doesn't have any reclaimable memory because of memory reclaim
> > protection (MEMCG_PROT_MIN) then there is possible to trigger a soft
> > lockup during an out of memory situation on non preemptible kernels
> > <PUT YOUR SOFT LOCKUP SPLAT HERE>
> > 
> > Fix this by adding a cond_resched up in the reclaim path and make sure
> > there is a yield point regardless of reclaimability of the target
> > hierarchy.
> > "
> >
Andrew Morton March 12, 2020, 10:32 p.m. UTC | #20
On Thu, 12 Mar 2020 11:07:15 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> 
> > > On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > > > > If you have an alternate patch to try, we can test it.  But since this 
> > > > > cond_resched() is needed anyway, I'm not sure it will change the result.
> > > > 
> > > > schedule_timeout_killable(1) is an alternate patch to try; I don't think
> > > > that this cond_resched() is needed anyway.
> > > > 
> > > 
> > > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
> > > 
> > 
> > Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs()
> > is enough. But like you mentioned,
> > 
> 
> It passes our testing because this is where the allocator is looping while 
> the victim is trying to exit if only it could be scheduled.

What happens if the allocator has SCHED_FIFO?
Tetsuo Handa March 13, 2020, 12:15 a.m. UTC | #21
David Rientjes wrote:
> > By the way, will you share the reproducer (and how to use the reproducer) ?
> > 
> 
> On an UP kernel with swap disabled, you limit a memcg to 100MB and start 
> three processes that each fault 40MB attached to it.  Same reproducer as 
> the "mm, oom: make a last minute check to prevent unnecessary memcg oom 
> kills" patch except in that case there are two cores.
> 

I'm not a heavy memcg user. Please provide steps for reproducing your problem
in a "copy and pastable" way (e.g. bash script, C program).

> > @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	 */
> >  	ret = should_force_charge() || out_of_memory(&oc);
> >  	mutex_unlock(&oom_lock);
> > +	schedule_timeout_killable(1);
> >  	return ret;
> >  }
> >  
> 
> If current was process chosen for oom kill, this would actually induce the 
> problem, not fix it.
> 

Why? Memcg OOM path allows using forced charge path if should_force_charge() == true.

Since your lockup report

  Call Trace:
   shrink_node+0x40d/0x7d0
   do_try_to_free_pages+0x13f/0x470
   try_to_free_mem_cgroup_pages+0x16d/0x230
   try_charge+0x247/0xac0
   mem_cgroup_try_charge+0x10a/0x220
   mem_cgroup_try_charge_delay+0x1e/0x40
   handle_mm_fault+0xdf2/0x15f0
   do_user_addr_fault+0x21f/0x420
   page_fault+0x2f/0x40

says that allocating thread was calling try_to_free_mem_cgroup_pages() from try_charge(),
allocating thread must be able to reach mem_cgroup_out_of_memory() from mem_cgroup_oom()
 from try_charge(). And actually

  Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0

says that allocating thread did reach mem_cgroup_out_of_memory(). Then, allocating thread
must be able to sleep at mem_cgroup_out_of_memory() if schedule_timeout_killable(1) is
mem_cgroup_out_of_memory().

Also, if current process was chosen for OOM-kill, current process will be able to leave
try_charge() due to should_force_charge() == true, won't it?

Thus, how can "this would actually induce the problem, not fix it." happen?
If your problem is that something keeps allocating threads away from reaching
should_force_charge() check, please explain the mechanism. If that is explained,
I would agree that schedule_timeout_killable(1) in mem_cgroup_out_of_memory()
won't help.
David Rientjes March 13, 2020, 10:01 p.m. UTC | #22
On Fri, 13 Mar 2020, Tetsuo Handa wrote:

> > On an UP kernel with swap disabled, you limit a memcg to 100MB and start 
> > three processes that each fault 40MB attached to it.  Same reproducer as 
> > the "mm, oom: make a last minute check to prevent unnecessary memcg oom 
> > kills" patch except in that case there are two cores.
> > 
> 
> I'm not a heavy memcg user. Please provide steps for reproducing your problem
> in a "copy and pastable" way (e.g. bash script, C program).
> 

Use Documentation/admin-guide/cgroup-v1/memory.rst or 
Documentation/admin-guide/cgroup-v2.rst to setup a memcg depending on 
which cgroup version you use, limit it to 100MB, and attach your process 
to it.

Run three programs that fault 40MB.  To do that, you need to use mmap:

	(void)mmap(NULL, 40 << 20, PROT_READ|PROT_WRITE,
		MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, 0, 0);

Have it stall after populating the memory:

	for (;;);

> > > @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  	 */
> > >  	ret = should_force_charge() || out_of_memory(&oc);
> > >  	mutex_unlock(&oom_lock);
> > > +	schedule_timeout_killable(1);
> > >  	return ret;
> > >  }
> > >  
> > 
> > If current was process chosen for oom kill, this would actually induce the 
> > problem, not fix it.
> > 
> 
> Why? Memcg OOM path allows using forced charge path if should_force_charge() == true.
> 
> Since your lockup report
> 
>   Call Trace:
>    shrink_node+0x40d/0x7d0
>    do_try_to_free_pages+0x13f/0x470
>    try_to_free_mem_cgroup_pages+0x16d/0x230
>    try_charge+0x247/0xac0
>    mem_cgroup_try_charge+0x10a/0x220
>    mem_cgroup_try_charge_delay+0x1e/0x40
>    handle_mm_fault+0xdf2/0x15f0
>    do_user_addr_fault+0x21f/0x420
>    page_fault+0x2f/0x40
> 
> says that allocating thread was calling try_to_free_mem_cgroup_pages() from try_charge(),
> allocating thread must be able to reach mem_cgroup_out_of_memory() from mem_cgroup_oom()
>  from try_charge(). And actually
> 
>   Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0
> 
> says that allocating thread did reach mem_cgroup_out_of_memory(). Then, allocating thread
> must be able to sleep at mem_cgroup_out_of_memory() if schedule_timeout_killable(1) is
> mem_cgroup_out_of_memory().
> 
> Also, if current process was chosen for OOM-kill, current process will be able to leave
> try_charge() due to should_force_charge() == true, won't it?
> 
> Thus, how can "this would actually induce the problem, not fix it." happen?

The entire issue is that the victim never gets a chance to run because the 
allocator doesn't give it a chance to run on an UP system.  Your patch is 
broken because if the victim is current, you've lost your golden 
opportunity to actually exit and ceded control to the allocator that will 
now starve the victim.
Tetsuo Handa March 13, 2020, 11:15 p.m. UTC | #23
On 2020/03/14 7:01, David Rientjes wrote:
> The entire issue is that the victim never gets a chance to run because the 
> allocator doesn't give it a chance to run on an UP system.  Your patch is 
> broken because if the victim is current, you've lost your golden 
> opportunity to actually exit and ceded control to the allocator that will 
> now starve the victim.
> 

I still cannot understand. There is no need to give CPU time to OOM victims.
We just need to give CPU time to the OOM reaper kernel thread till the OOM
reaper kernel thread sets MMF_OOM_SKIP to OOM victims. If current thread is
an OOM victim, schedule_timeout_killable(1) will give other threads (including
the OOM reaper kernel thread) CPU time to run. That is similar with your
cond_resched() patch (except that cond_resched() might fail to give other
threads CPU time to run if current thread has realtime priority), isn't it?

So, please explain the mechanism why cond_resched() works but
schedule_timeout_killable(1) cannot work.
Tetsuo Handa March 13, 2020, 11:32 p.m. UTC | #24
On 2020/03/14 8:15, Tetsuo Handa wrote:
> If current thread is
> an OOM victim, schedule_timeout_killable(1) will give other threads (including
> the OOM reaper kernel thread) CPU time to run.

If current thread is an OOM victim, schedule_timeout_killable(1) will give other
threads (including the OOM reaper kernel thread) CPU time to run, by leaving
try_charge() path due to should_force_charge() == true and reaching do_exit() path
instead of returning to userspace code doing "for (;;);".

Unless the problem is that current thread cannot reach should_force_charge() check,
schedule_timeout_killable(1) should work.
Michal Hocko March 16, 2020, 9:31 a.m. UTC | #25
On Thu 12-03-20 15:32:38, Andrew Morton wrote:
> On Thu, 12 Mar 2020 11:07:15 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> 
> > On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > 
> > > > On Thu, 12 Mar 2020, Tetsuo Handa wrote:
> > > > > > If you have an alternate patch to try, we can test it.  But since this 
> > > > > > cond_resched() is needed anyway, I'm not sure it will change the result.
> > > > > 
> > > > > schedule_timeout_killable(1) is an alternate patch to try; I don't think
> > > > > that this cond_resched() is needed anyway.
> > > > > 
> > > > 
> > > > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
> > > > 
> > > 
> > > Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs()
> > > is enough. But like you mentioned,
> > > 
> > 
> > It passes our testing because this is where the allocator is looping while 
> > the victim is trying to exit if only it could be scheduled.
> 
> What happens if the allocator has SCHED_FIFO?

The same thing as a SCHED_FIFO running in a tight loop in the userspace.

As long as a high priority context depends on a resource held by a low
priority task then we have a priority inversion problem and the page
allocator is no real exception here. But I do not see the allocator
is much different from any other code in the kernel. We do not add
random sleeps here and there to push a high priority FIFO or RT tasks
out of the execution context. We do cond_resched to help !PREEMPT
kernels but priority related issues are really out of scope of that
facility.
Michal Hocko March 16, 2020, 9:32 a.m. UTC | #26
On Thu 12-03-20 21:16:27, Michal Hocko wrote:
> On Thu 12-03-20 11:20:33, David Rientjes wrote:
> > On Thu, 12 Mar 2020, Michal Hocko wrote:
> > 
> > > > I think the changelog clearly states that we need to guarantee that a 
> > > > reclaimer will yield the processor back to allow a victim to exit.  This 
> > > > is where we make the guarantee.  If it helps for the specific reason it 
> > > > triggered in my testing, we could add:
> > > > 
> > > > "For example, mem_cgroup_protected() can prohibit reclaim and thus any 
> > > > yielding in page reclaim would not address the issue."
> > > 
> > > I would suggest something like the following:
> > > "
> > > The reclaim path (including the OOM) relies on explicit scheduling
> > > points to hand over execution to tasks which could help with the reclaim
> > > process.
> > 
> > Are there other examples where yielding in the reclaim path would "help 
> > with the reclaim process" other than oom victims?  This sentence seems 
> > vague.
> 
> In the context of UP and !PREEMPT this also includes IO flushers,
> filesystems rely on workers and there are things I am very likely not
> aware of. If you think this is vaague then feel free to reformulate.
> All I really do care about is what the next paragraph is explaining.

Btw. do you plan to send a patch with an updated changelog?

> > > Currently it is mostly shrink_page_list which yields CPU for
> > > each reclaimed page. This might be insuficient though in some
> > > configurations. E.g. when a memcg OOM path is triggered in a hierarchy
> > > which doesn't have any reclaimable memory because of memory reclaim
> > > protection (MEMCG_PROT_MIN) then there is possible to trigger a soft
> > > lockup during an out of memory situation on non preemptible kernels
> > > <PUT YOUR SOFT LOCKUP SPLAT HERE>
> > > 
> > > Fix this by adding a cond_resched up in the reclaim path and make sure
> > > there is a yield point regardless of reclaimability of the target
> > > hierarchy.
> > > "
> > > 
> 
> -- 
> Michal Hocko
> SUSE Labs
Tetsuo Handa March 16, 2020, 10:04 a.m. UTC | #27
On 2020/03/16 18:31, Michal Hocko wrote:
>> What happens if the allocator has SCHED_FIFO?
> 
> The same thing as a SCHED_FIFO running in a tight loop in the userspace.
> 
> As long as a high priority context depends on a resource held by a low
> priority task then we have a priority inversion problem and the page
> allocator is no real exception here. But I do not see the allocator
> is much different from any other code in the kernel. We do not add
> random sleeps here and there to push a high priority FIFO or RT tasks
> out of the execution context. We do cond_resched to help !PREEMPT
> kernels but priority related issues are really out of scope of that
> facility.
> 

Spinning with realtime priority in userspace is a userspace's bug.
Spinning with realtime priority in kernelspace until watchdog fires is
a kernel's bug. We are not responsible for userspace's bug, and I'm
asking whether the memory allocator kernel code can give enough CPU
time to other threads even if current thread has realtime priority.
Michal Hocko March 16, 2020, 10:14 a.m. UTC | #28
On Mon 16-03-20 19:04:44, Tetsuo Handa wrote:
> On 2020/03/16 18:31, Michal Hocko wrote:
> >> What happens if the allocator has SCHED_FIFO?
> > 
> > The same thing as a SCHED_FIFO running in a tight loop in the userspace.
> > 
> > As long as a high priority context depends on a resource held by a low
> > priority task then we have a priority inversion problem and the page
> > allocator is no real exception here. But I do not see the allocator
> > is much different from any other code in the kernel. We do not add
> > random sleeps here and there to push a high priority FIFO or RT tasks
> > out of the execution context. We do cond_resched to help !PREEMPT
> > kernels but priority related issues are really out of scope of that
> > facility.
> > 
> 
> Spinning with realtime priority in userspace is a userspace's bug.
> Spinning with realtime priority in kernelspace until watchdog fires is
> a kernel's bug. We are not responsible for userspace's bug, and I'm
> asking whether the memory allocator kernel code can give enough CPU
> time to other threads even if current thread has realtime priority.

We've been through that discussion many times and the core point is that
this requires a large surgery to work properly. It is not just to add a
sleep into the page allocator and be done with that. Page allocator
cannot really do much on its own. It relies on many other contexts to
make a forward progress. What you really demand is far from trivial.
Maybe you are looking something much closer to the RT kernel than what
other preemption modes can offer currently.

Right now, you really have to be careful when running FIFO/RT processes
and plan their resources very carefully. Is that ideal? Not really but
considering that this is the status quo for many years it seems that
the usecases tend to find their way around that restriction.
David Rientjes March 16, 2020, 11:59 p.m. UTC | #29
On Sat, 14 Mar 2020, Tetsuo Handa wrote:

> > If current thread is
> > an OOM victim, schedule_timeout_killable(1) will give other threads (including
> > the OOM reaper kernel thread) CPU time to run.
> 
> If current thread is an OOM victim, schedule_timeout_killable(1) will give other
> threads (including the OOM reaper kernel thread) CPU time to run, by leaving
> try_charge() path due to should_force_charge() == true and reaching do_exit() path
> instead of returning to userspace code doing "for (;;);".
> 
> Unless the problem is that current thread cannot reach should_force_charge() check,
> schedule_timeout_killable(1) should work.
> 

No need to yield if current is the oom victim, allowing the oom reaper to 
run when it may not actually be able to free memory is not required.  It 
increases the likelihood that some other process schedules and is unable 
to yield back due to the memcg oom condition such that the victim doesn't 
get a chance to run again.

This happens because the victim is allowed to overcharge but other 
processes attached to an oom memcg hierarchy simply fail the charge.  We 
are then reliant on all memory chargers in the kernel to yield if their 
charges fail due to oom.  It's the only way to allow the victim to 
eventually run.

So the only change that I would make to your patch is to do this in 
mem_cgroup_out_of_memory() instead:

	if (!fatal_signal_pending(current))
		schedule_timeout_killable(1);

So we don't have this reliance on all other memory chargers to yield when 
their charge fails and there is no delay for victims themselves.

 [ I'll still propose my change that adds cond_resched() to 
   shrink_node_memcgs() because we can see need_resched set for a 
   prolonged period of time without scheduling. ]

If you agree, I'll propose your patch with a changelog that indicates it 
can fix the soft lockup issue for UP and can likely get a tested-by for 
it.
Tetsuo Handa March 17, 2020, 3:18 a.m. UTC | #30
David Rientjes wrote:
> On Sat, 14 Mar 2020, Tetsuo Handa wrote:
> > If current thread is an OOM victim, schedule_timeout_killable(1) will give other
> > threads (including the OOM reaper kernel thread) CPU time to run, by leaving
> > try_charge() path due to should_force_charge() == true and reaching do_exit() path
> > instead of returning to userspace code doing "for (;;);".
> > 
> > Unless the problem is that current thread cannot reach should_force_charge() check,
> > schedule_timeout_killable(1) should work.
> > 
> 
> No need to yield if current is the oom victim, allowing the oom reaper to 
> run when it may not actually be able to free memory is not required.  It 
> increases the likelihood that some other process schedules and is unable 
> to yield back due to the memcg oom condition such that the victim doesn't 
> get a chance to run again.
> 
> This happens because the victim is allowed to overcharge but other 
> processes attached to an oom memcg hierarchy simply fail the charge.  We 
> are then reliant on all memory chargers in the kernel to yield if their 
> charges fail due to oom.  It's the only way to allow the victim to 
> eventually run.
> 
> So the only change that I would make to your patch is to do this in 
> mem_cgroup_out_of_memory() instead:
> 
> 	if (!fatal_signal_pending(current))
> 		schedule_timeout_killable(1);
> 
> So we don't have this reliance on all other memory chargers to yield when 
> their charge fails and there is no delay for victims themselves.

I see. You want below functions for environments where current thread can
fail to resume execution for long if current thread once reschedules (e.g.
UP kernel, many threads contended on one CPU).

/*
 * Give other threads CPU time, unless current thread was already killed.
 * Used when we prefer killed threads to continue execution (in a hope that
 * killed threads terminate quickly) over giving other threads CPU time.
 */
signed long __sched schedule_timeout_killable_expedited(signed long timeout)
{
	if (unlikely(fatal_signal_pending(current)))
		return timeout;
	return schedule_timeout_killable(timeout);
}

/*
 * Latency reduction via explicit rescheduling in places that are safe,
 * but becomes no-op if current thread was already killed. Used when we
 * prefer killed threads to continue execution (in a hope that killed
 * threads terminate quickly) over giving other threads CPU time.
 */
int cond_resched_expedited(void)
{
	if (unlikely(fatal_signal_pending(current)))
		return 0;
	return cond_resched();
}

> 
>  [ I'll still propose my change that adds cond_resched() to 
>    shrink_node_memcgs() because we can see need_resched set for a 
>    prolonged period of time without scheduling. ]

As long as there is schedule_timeout_killable(), I'm fine with adding
cond_resched() in other places.

> 
> If you agree, I'll propose your patch with a changelog that indicates it 
> can fix the soft lockup issue for UP and can likely get a tested-by for 
> it.
> 

Please go ahead.
David Rientjes March 17, 2020, 4:09 a.m. UTC | #31
On Tue, 17 Mar 2020, Tetsuo Handa wrote:

> > 	if (!fatal_signal_pending(current))
> > 		schedule_timeout_killable(1);
> > 
> > So we don't have this reliance on all other memory chargers to yield when 
> > their charge fails and there is no delay for victims themselves.
> 
> I see. You want below functions for environments where current thread can
> fail to resume execution for long if current thread once reschedules (e.g.
> UP kernel, many threads contended on one CPU).
> 
> /*
>  * Give other threads CPU time, unless current thread was already killed.
>  * Used when we prefer killed threads to continue execution (in a hope that
>  * killed threads terminate quickly) over giving other threads CPU time.
>  */
> signed long __sched schedule_timeout_killable_expedited(signed long timeout)
> {
> 	if (unlikely(fatal_signal_pending(current)))
> 		return timeout;
> 	return schedule_timeout_killable(timeout);
> }
> 

I simply want the

	if (!fatal_signal_pending(current))
		schedule_timeout_killable(1);

after dropping oom_lock because I don't know that a generic function would 
be useful outside of this single place.  If it becomes a regular pattern, 
for whatever reason, I think we can consider a new schedule_timeout 
variant.

> /*
>  * Latency reduction via explicit rescheduling in places that are safe,
>  * but becomes no-op if current thread was already killed. Used when we
>  * prefer killed threads to continue execution (in a hope that killed
>  * threads terminate quickly) over giving other threads CPU time.
>  */
> int cond_resched_expedited(void)
> {
> 	if (unlikely(fatal_signal_pending(current)))
> 		return 0;
> 	return cond_resched();
> }
> 
> > 
> >  [ I'll still propose my change that adds cond_resched() to 
> >    shrink_node_memcgs() because we can see need_resched set for a 
> >    prolonged period of time without scheduling. ]
> 
> As long as there is schedule_timeout_killable(), I'm fine with adding
> cond_resched() in other places.
> 

Sounds good, thanks Tetsuo.
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2637,6 +2637,8 @@  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		unsigned long reclaimed;
 		unsigned long scanned;
 
+		cond_resched();
+
 		switch (mem_cgroup_protected(target_memcg, memcg)) {
 		case MEMCG_PROT_MIN:
 			/*