Message ID | alpine.DEB.2.21.2003101438510.161160@chino.kir.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, oom: prevent soft lockup on memcg oom for UP systems | expand |
On 2020/03/11 6:39, David Rientjes wrote: > diff --git a/mm/vmscan.c b/mm/vmscan.c > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > unsigned long reclaimed; > unsigned long scanned; > > + cond_resched(); > + Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority, can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather than the OOM victim ?) gets scheduled? > switch (mem_cgroup_protected(target_memcg, memcg)) { > case MEMCG_PROT_MIN: > /* >
On Tue 10-03-20 14:39:48, David Rientjes wrote: > When a process is oom killed as a result of memcg limits and the victim > is waiting to exit, nothing ends up actually yielding the processor back > to the victim on UP systems with preemption disabled. Instead, the > charging process simply loops in memcg reclaim and eventually soft > lockups. > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0 > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806] > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136 > RIP: 0010:shrink_lruvec+0x4e9/0xa40 > ... > Call Trace: > shrink_node+0x40d/0x7d0 > do_try_to_free_pages+0x13f/0x470 > try_to_free_mem_cgroup_pages+0x16d/0x230 > try_charge+0x247/0xac0 > mem_cgroup_try_charge+0x10a/0x220 > mem_cgroup_try_charge_delay+0x1e/0x40 > handle_mm_fault+0xdf2/0x15f0 > do_user_addr_fault+0x21f/0x420 > page_fault+0x2f/0x40 > > Make sure that something ends up actually yielding the processor back to > the victim to allow for memory freeing. Most appropriate place appears to > be shrink_node_memcgs() where the iteration of all decendant memcgs could > be particularly lengthy. There is a cond_resched in shrink_lruvec and another one in shrink_page_list. Why doesn't any of them hit? Is it because there are no pages on the LRU list? Because rss data suggests there should be enough pages to go that path. Or maybe it is shrink_slab path that takes too long? The patch itself makes sense to me but I would like to see more explanation on how that happens. Thanks. > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: stable@vger.kernel.org > Signed-off-by: David Rientjes <rientjes@google.com> > --- > mm/vmscan.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > unsigned long reclaimed; > unsigned long scanned; > > + cond_resched(); > + > switch (mem_cgroup_protected(target_memcg, memcg)) { > case MEMCG_PROT_MIN: > /*
On Wed, 11 Mar 2020, Tetsuo Handa wrote: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > > unsigned long reclaimed; > > unsigned long scanned; > > > > + cond_resched(); > > + > > Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority, > can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather > than the OOM victim ?) gets scheduled? > I think it's the best we can do that immediately solves the issue unless you have another idea in mind? > > switch (mem_cgroup_protected(target_memcg, memcg)) { > > case MEMCG_PROT_MIN: > > /* > > >
On Tue, 10 Mar 2020, Michal Hocko wrote: > > When a process is oom killed as a result of memcg limits and the victim > > is waiting to exit, nothing ends up actually yielding the processor back > > to the victim on UP systems with preemption disabled. Instead, the > > charging process simply loops in memcg reclaim and eventually soft > > lockups. > > > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0 > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806] > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136 > > RIP: 0010:shrink_lruvec+0x4e9/0xa40 > > ... > > Call Trace: > > shrink_node+0x40d/0x7d0 > > do_try_to_free_pages+0x13f/0x470 > > try_to_free_mem_cgroup_pages+0x16d/0x230 > > try_charge+0x247/0xac0 > > mem_cgroup_try_charge+0x10a/0x220 > > mem_cgroup_try_charge_delay+0x1e/0x40 > > handle_mm_fault+0xdf2/0x15f0 > > do_user_addr_fault+0x21f/0x420 > > page_fault+0x2f/0x40 > > > > Make sure that something ends up actually yielding the processor back to > > the victim to allow for memory freeing. Most appropriate place appears to > > be shrink_node_memcgs() where the iteration of all decendant memcgs could > > be particularly lengthy. > > There is a cond_resched in shrink_lruvec and another one in > shrink_page_list. Why doesn't any of them hit? Is it because there are > no pages on the LRU list? Because rss data suggests there should be > enough pages to go that path. Or maybe it is shrink_slab path that takes > too long? > I think it can be a number of cases, most notably mem_cgroup_protected() checks which is why the cond_resched() is added above it. Rather than add cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the cond_resched() is added above the switch clause because the iteration itself may be potentially very lengthy. We could also do it in shrink_zones() or the priority based do_try_to_free_pages() loop, but I'd be nervous about the lengthy memcg iteration in shrink_node_memcgs() independent of this. Any other ideas on how to ensure we actually try to resched for the benefit of an oom victim to prevent this soft lockup? > The patch itself makes sense to me but I would like to see more > explanation on how that happens. > > Thanks. > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Michal Hocko <mhocko@kernel.org> > > Cc: stable@vger.kernel.org > > Signed-off-by: David Rientjes <rientjes@google.com> > > --- > > mm/vmscan.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > > unsigned long reclaimed; > > unsigned long scanned; > > > > + cond_resched(); > > + > > switch (mem_cgroup_protected(target_memcg, memcg)) { > > case MEMCG_PROT_MIN: > > /* > > -- > Michal Hocko > SUSE Labs >
On Tue, 10 Mar 2020 14:39:48 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > When a process is oom killed as a result of memcg limits and the victim > is waiting to exit, nothing ends up actually yielding the processor back > to the victim on UP systems with preemption disabled. Instead, the > charging process simply loops in memcg reclaim and eventually soft > lockups. > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0 > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806] > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136 > RIP: 0010:shrink_lruvec+0x4e9/0xa40 > ... > Call Trace: > shrink_node+0x40d/0x7d0 > do_try_to_free_pages+0x13f/0x470 > try_to_free_mem_cgroup_pages+0x16d/0x230 > try_charge+0x247/0xac0 > mem_cgroup_try_charge+0x10a/0x220 > mem_cgroup_try_charge_delay+0x1e/0x40 > handle_mm_fault+0xdf2/0x15f0 > do_user_addr_fault+0x21f/0x420 > page_fault+0x2f/0x40 > > Make sure that something ends up actually yielding the processor back to > the victim to allow for memory freeing. Most appropriate place appears to > be shrink_node_memcgs() where the iteration of all decendant memcgs could > be particularly lengthy. > That's a bit sad. > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > unsigned long reclaimed; > unsigned long scanned; > > + cond_resched(); > + > switch (mem_cgroup_protected(target_memcg, memcg)) { > case MEMCG_PROT_MIN: > /* Obviously better, but this will still spin wheels until this tasks's timeslice expires, and we might want to do something to help ensure that the victim runs next (or soon)? (And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?)
On Tue, 10 Mar 2020, Andrew Morton wrote: > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > > unsigned long reclaimed; > > unsigned long scanned; > > > > + cond_resched(); > > + > > switch (mem_cgroup_protected(target_memcg, memcg)) { > > case MEMCG_PROT_MIN: > > /* > > > Obviously better, but this will still spin wheels until this tasks's > timeslice expires, and we might want to do something to help ensure > that the victim runs next (or soon)? > We used to have a schedule_timeout_killable(1) to address exactly that scenario but it was removed in 4.19: commit 9bfe5ded054b8e28a94c78580f233d6879a00146 Author: Michal Hocko <mhocko@suse.com> Date: Fri Aug 17 15:49:04 2018 -0700 mm, oom: remove sleep from under oom_lock This is why we don't see this issue on 4.14 guests but we do on 4.19. I had assumed the issue Tetsuo reported that resulted in that patch was still an issue and I preferred to fix the weird UP issue by adding a cond_resched() that is likely needed for the iteration in shrink_node_memcg() anyway. Do we care to optimize for UP systems encountering memcg oom kills? Eh, maybe, but I'm not very interested in opening up a centithread about this. > (And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?) > This guest does have CONFIG_MEMCG enabled, it's a memcg oom condition. But unrelated to this patch, I think it's just a weird naming for it. The do-while loop in shrink_node_memcgs() actually uses memcg = NULL for the non-memcg case and is responsible for calling into page and slab reclaim.
On Tue 10-03-20 16:02:23, David Rientjes wrote: > On Tue, 10 Mar 2020, Michal Hocko wrote: > > > > When a process is oom killed as a result of memcg limits and the victim > > > is waiting to exit, nothing ends up actually yielding the processor back > > > to the victim on UP systems with preemption disabled. Instead, the > > > charging process simply loops in memcg reclaim and eventually soft > > > lockups. > > > > > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0 > > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806] > > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136 > > > RIP: 0010:shrink_lruvec+0x4e9/0xa40 > > > ... > > > Call Trace: > > > shrink_node+0x40d/0x7d0 > > > do_try_to_free_pages+0x13f/0x470 > > > try_to_free_mem_cgroup_pages+0x16d/0x230 > > > try_charge+0x247/0xac0 > > > mem_cgroup_try_charge+0x10a/0x220 > > > mem_cgroup_try_charge_delay+0x1e/0x40 > > > handle_mm_fault+0xdf2/0x15f0 > > > do_user_addr_fault+0x21f/0x420 > > > page_fault+0x2f/0x40 > > > > > > Make sure that something ends up actually yielding the processor back to > > > the victim to allow for memory freeing. Most appropriate place appears to > > > be shrink_node_memcgs() where the iteration of all decendant memcgs could > > > be particularly lengthy. > > > > There is a cond_resched in shrink_lruvec and another one in > > shrink_page_list. Why doesn't any of them hit? Is it because there are > > no pages on the LRU list? Because rss data suggests there should be > > enough pages to go that path. Or maybe it is shrink_slab path that takes > > too long? > > > > I think it can be a number of cases, most notably mem_cgroup_protected() > checks which is why the cond_resched() is added above it. Rather than add > cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the > cond_resched() is added above the switch clause because the iteration > itself may be potentially very lengthy. Was any of the above the case for your soft lockup case? How have you managed to trigger it? As I've said I am not against the patch but I would really like to see an actual explanation what happened rather than speculations of what might have happened. If for nothing else then for the future reference. If this is really about all the hierarchy being MEMCG_PROT_MIN protected and that results in a very expensive and pointless reclaim walk that can trigger soft lockup then it should be explicitly mentioned in the changelog.
On Tue 10-03-20 17:18:02, Andrew Morton wrote:
[...]
> (And why is shrink_node_memcgs compiled in when CONFIG_MEMCG=n?)
Because there won't be anything memcg specific with the config disabled.
mem_cgroup_iter will expand to NULL memcg, mem_cgroup_protected switch
compiled out, mem_cgroup_lruvec will return the lruvec abstraction which
resolves to pgdat and the rest is not memcg dependent. We could have
split up the reclaim protection or the loop out of the line but I
believe it is better to be clearly visible.
On 2020/03/11 7:55, David Rientjes wrote: > On Wed, 11 Mar 2020, Tetsuo Handa wrote: > >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) >>> unsigned long reclaimed; >>> unsigned long scanned; >>> >>> + cond_resched(); >>> + >> >> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority, >> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather >> than the OOM victim ?) gets scheduled? >> > > I think it's the best we can do that immediately solves the issue unless > you have another idea in mind? "schedule_timeout_killable(1) outside of oom_lock" or "the OOM reaper grabs oom_lock so that allocating threads guarantee that the OOM reaper gets scheduled" or "direct OOM reaping so that allocating threads guarantee that some memory is reclaimed". > >>> switch (mem_cgroup_protected(target_memcg, memcg)) { >>> case MEMCG_PROT_MIN: >>> /* >>> >>
On Wed, 11 Mar 2020, Tetsuo Handa wrote: > >>> diff --git a/mm/vmscan.c b/mm/vmscan.c > >>> --- a/mm/vmscan.c > >>> +++ b/mm/vmscan.c > >>> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > >>> unsigned long reclaimed; > >>> unsigned long scanned; > >>> > >>> + cond_resched(); > >>> + > >> > >> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority, > >> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather > >> than the OOM victim ?) gets scheduled? > >> > > > > I think it's the best we can do that immediately solves the issue unless > > you have another idea in mind? > > "schedule_timeout_killable(1) outside of oom_lock" or "the OOM reaper grabs oom_lock > so that allocating threads guarantee that the OOM reaper gets scheduled" or "direct OOM > reaping so that allocating threads guarantee that some memory is reclaimed". > The cond_resched() here is needed if the iteration is lengthy depending on the number of descendant memcgs already. schedule_timeout_killable(1) does not make any guarantees that current will be scheduled after the victim or oom_reaper on UP systems. If you have an alternate patch to try, we can test it. But since this cond_resched() is needed anyway, I'm not sure it will change the result. > > > >>> switch (mem_cgroup_protected(target_memcg, memcg)) { > >>> case MEMCG_PROT_MIN: > >>> /* > >>> > >> > >
On Wed, 11 Mar 2020, Michal Hocko wrote: > > > > When a process is oom killed as a result of memcg limits and the victim > > > > is waiting to exit, nothing ends up actually yielding the processor back > > > > to the victim on UP systems with preemption disabled. Instead, the > > > > charging process simply loops in memcg reclaim and eventually soft > > > > lockups. > > > > > > > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0 > > > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806] > > > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136 > > > > RIP: 0010:shrink_lruvec+0x4e9/0xa40 > > > > ... > > > > Call Trace: > > > > shrink_node+0x40d/0x7d0 > > > > do_try_to_free_pages+0x13f/0x470 > > > > try_to_free_mem_cgroup_pages+0x16d/0x230 > > > > try_charge+0x247/0xac0 > > > > mem_cgroup_try_charge+0x10a/0x220 > > > > mem_cgroup_try_charge_delay+0x1e/0x40 > > > > handle_mm_fault+0xdf2/0x15f0 > > > > do_user_addr_fault+0x21f/0x420 > > > > page_fault+0x2f/0x40 > > > > > > > > Make sure that something ends up actually yielding the processor back to > > > > the victim to allow for memory freeing. Most appropriate place appears to > > > > be shrink_node_memcgs() where the iteration of all decendant memcgs could > > > > be particularly lengthy. > > > > > > There is a cond_resched in shrink_lruvec and another one in > > > shrink_page_list. Why doesn't any of them hit? Is it because there are > > > no pages on the LRU list? Because rss data suggests there should be > > > enough pages to go that path. Or maybe it is shrink_slab path that takes > > > too long? > > > > > > > I think it can be a number of cases, most notably mem_cgroup_protected() > > checks which is why the cond_resched() is added above it. Rather than add > > cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the > > cond_resched() is added above the switch clause because the iteration > > itself may be potentially very lengthy. > > Was any of the above the case for your soft lockup case? How have you > managed to trigger it? As I've said I am not against the patch but I > would really like to see an actual explanation what happened rather than > speculations of what might have happened. If for nothing else then for > the future reference. > Yes, this is how it was triggered in my own testing. > If this is really about all the hierarchy being MEMCG_PROT_MIN protected > and that results in a very expensive and pointless reclaim walk that can > trigger soft lockup then it should be explicitly mentioned in the > changelog. I think the changelog clearly states that we need to guarantee that a reclaimer will yield the processor back to allow a victim to exit. This is where we make the guarantee. If it helps for the specific reason it triggered in my testing, we could add: "For example, mem_cgroup_protected() can prohibit reclaim and thus any yielding in page reclaim would not address the issue."
On 2020/03/12 4:38, David Rientjes wrote: > On Wed, 11 Mar 2020, Tetsuo Handa wrote: > >>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>>>> --- a/mm/vmscan.c >>>>> +++ b/mm/vmscan.c >>>>> @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) >>>>> unsigned long reclaimed; >>>>> unsigned long scanned; >>>>> >>>>> + cond_resched(); >>>>> + >>>> >>>> Is this safe for CONFIG_PREEMPTION case? If current thread has realtime priority, >>>> can we guarantee that the OOM victim (well, the OOM reaper kernel thread rather >>>> than the OOM victim ?) gets scheduled? >>>> >>> >>> I think it's the best we can do that immediately solves the issue unless >>> you have another idea in mind? >> >> "schedule_timeout_killable(1) outside of oom_lock" or "the OOM reaper grabs oom_lock >> so that allocating threads guarantee that the OOM reaper gets scheduled" or "direct OOM >> reaping so that allocating threads guarantee that some memory is reclaimed". >> > > The cond_resched() here is needed if the iteration is lengthy depending on > the number of descendant memcgs already. No. cond_resched() here will become no-op if CONFIG_PREEMPTION=y and current thread has realtime priority. > > schedule_timeout_killable(1) does not make any guarantees that current > will be scheduled after the victim or oom_reaper on UP systems. The point of schedule_timeout_*(1) is to guarantee that current thread will yield CPU to other threads even if CONFIG_PREEMPTION=y and current thread has realtime priority case. There is no guarantee that current thread will be rescheduled immediately after a sleep is irrelevant. > > If you have an alternate patch to try, we can test it. But since this > cond_resched() is needed anyway, I'm not sure it will change the result. schedule_timeout_killable(1) is an alternate patch to try; I don't think that this cond_resched() is needed anyway. > >>> >>>>> switch (mem_cgroup_protected(target_memcg, memcg)) { >>>>> case MEMCG_PROT_MIN: >>>>> /* >>>>> >>>> >> >>
On Thu, 12 Mar 2020, Tetsuo Handa wrote: > > The cond_resched() here is needed if the iteration is lengthy depending on > > the number of descendant memcgs already. > > No. cond_resched() here will become no-op if CONFIG_PREEMPTION=y and current > thread has realtime priority. > It's helpful without CONFIG_PREEMPTION for excessively long memcg iterations to avoid starving need_resched. > > schedule_timeout_killable(1) does not make any guarantees that current > > will be scheduled after the victim or oom_reaper on UP systems. > > The point of schedule_timeout_*(1) is to guarantee that current thread > will yield CPU to other threads even if CONFIG_PREEMPTION=y and current > thread has realtime priority case. There is no guarantee that current > thread will be rescheduled immediately after a sleep is irrelevant. > > > > > If you have an alternate patch to try, we can test it. But since this > > cond_resched() is needed anyway, I'm not sure it will change the result. > > schedule_timeout_killable(1) is an alternate patch to try; I don't think > that this cond_resched() is needed anyway. > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()?
> On Thu, 12 Mar 2020, Tetsuo Handa wrote: > > > If you have an alternate patch to try, we can test it. But since this > > > cond_resched() is needed anyway, I'm not sure it will change the result. > > > > schedule_timeout_killable(1) is an alternate patch to try; I don't think > > that this cond_resched() is needed anyway. > > > > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()? > Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs() is enough. But like you mentioned, David Rientjes wrote: > On Tue, 10 Mar 2020, Andrew Morton wrote: > > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > > > unsigned long reclaimed; > > > unsigned long scanned; > > > > > > + cond_resched(); > > > + > > > switch (mem_cgroup_protected(target_memcg, memcg)) { > > > case MEMCG_PROT_MIN: > > > /* > > > > > > Obviously better, but this will still spin wheels until this tasks's > > timeslice expires, and we might want to do something to help ensure > > that the victim runs next (or soon)? > > > > We used to have a schedule_timeout_killable(1) to address exactly that > scenario but it was removed in 4.19: > > commit 9bfe5ded054b8e28a94c78580f233d6879a00146 > Author: Michal Hocko <mhocko@suse.com> > Date: Fri Aug 17 15:49:04 2018 -0700 > > mm, oom: remove sleep from under oom_lock you can try re-adding sleep outside of oom_lock: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d09776cd6e10..3aee7e0eca4e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, */ ret = should_force_charge() || out_of_memory(&oc); mutex_unlock(&oom_lock); + schedule_timeout_killable(1); return ret; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3c4eb750a199..e80158049651 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3797,7 +3797,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, */ if (!mutex_trylock(&oom_lock)) { *did_some_progress = 1; - schedule_timeout_uninterruptible(1); return NULL; } @@ -4590,6 +4589,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, /* Retry as long as the OOM killer is making progress */ if (did_some_progress) { + schedule_timeout_uninterruptible(1); no_progress_loops = 0; goto retry; } By the way, will you share the reproducer (and how to use the reproducer) ?
> > We used to have a schedule_timeout_killable(1) to address exactly that > > scenario but it was removed in 4.19: > > > > commit 9bfe5ded054b8e28a94c78580f233d6879a00146 > > Author: Michal Hocko <mhocko@suse.com> > > Date: Fri Aug 17 15:49:04 2018 -0700 > > > > mm, oom: remove sleep from under oom_lock > For the record: If I revert that commit, I can observe that current thread sleeps for more than one minute with oom_lock held. That is, I don't want to revert that commit. [ 1629.930998][T14021] idle-priority invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 [ 1629.934944][T14021] CPU: 0 PID: 14021 Comm: idle-priority Kdump: loaded Not tainted 5.6.0-rc5+ #968 [ 1629.938352][T14021] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019 [ 1629.942211][T14021] Call Trace: [ 1629.943956][T14021] dump_stack+0x71/0xa5 [ 1629.946041][T14021] dump_header+0x4d/0x3e0 [ 1629.948097][T14021] oom_kill_process+0x179/0x210 [ 1629.950191][T14021] out_of_memory+0x109/0x3d0 [ 1629.952154][T14021] ? out_of_memory+0x1ea/0x3d0 [ 1629.954349][T14021] __alloc_pages_slowpath+0x934/0xb89 [ 1629.956959][T14021] __alloc_pages_nodemask+0x333/0x370 [ 1629.959121][T14021] handle_pte_fault+0x4ce/0xb40 [ 1629.961274][T14021] __handle_mm_fault+0x2b2/0x5e0 [ 1629.963363][T14021] handle_mm_fault+0x177/0x360 [ 1629.965404][T14021] ? handle_mm_fault+0x46/0x360 [ 1629.967451][T14021] do_page_fault+0x2d5/0x680 [ 1629.969351][T14021] page_fault+0x34/0x40 [ 1629.971290][T14021] RIP: 0010:__clear_user+0x36/0x70 [ 1629.973621][T14021] Code: 0c a7 e2 95 53 48 89 f3 be 14 00 00 00 e8 82 87 b4 ff 0f 01 cb 48 89 d8 48 c1 eb 03 4c 89 e7 83 e0 07 48 89 d9 48 85 c9 74 0f <48> c7 07 00 00 00 00 48 83 c7 08 ff c9 75 f1 48 89 c1 85 c9 74 0a [ 1629.980104][T14021] RSP: 0000:ffffa06e0742bd40 EFLAGS: 00050202 [ 1629.982543][T14021] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000000002 [ 1629.985322][T14021] RDX: 0000000000000000 RSI: 00000000a8057a95 RDI: 00007f0ca8605000 [ 1629.999883][T14021] RBP: ffffa06e0742bd50 R08: 0000000000000001 R09: 0000000000000000 [ 1630.018164][T14021] R10: 0000000000000000 R11: ffff953843a80978 R12: 00007f0ca8604010 [ 1630.021491][T14021] R13: 000000000c988000 R14: ffffa06e0742be08 R15: 0000000000001000 [ 1630.024881][T14021] clear_user+0x47/0x60 [ 1630.026651][T14021] iov_iter_zero+0x95/0x3d0 [ 1630.028455][T14021] read_iter_zero+0x38/0xb0 [ 1630.030234][T14021] new_sync_read+0x112/0x1b0 [ 1630.032025][T14021] __vfs_read+0x27/0x40 [ 1630.033639][T14021] vfs_read+0xaf/0x160 [ 1630.035243][T14021] ksys_read+0x62/0xe0 [ 1630.036869][T14021] __x64_sys_read+0x15/0x20 [ 1630.038827][T14021] do_syscall_64+0x4a/0x1e0 [ 1630.040515][T14021] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1630.042598][T14021] RIP: 0033:0x7f109bb78950 [ 1630.044208][T14021] Code: Bad RIP value. [ 1630.045798][T14021] RSP: 002b:00007fff499db458 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 1630.048558][T14021] RAX: ffffffffffffffda RBX: 0000000200000000 RCX: 00007f109bb78950 [ 1630.051117][T14021] RDX: 0000000200000000 RSI: 00007f0c9bc7c010 RDI: 0000000000000003 [ 1630.053715][T14021] RBP: 00007f0c9bc7c010 R08: 0000000000000000 R09: 0000000000021000 [ 1630.056488][T14021] R10: 00007fff499daea0 R11: 0000000000000246 R12: 00007f0c9bc7c010 [ 1630.059124][T14021] R13: 0000000000000005 R14: 0000000000000003 R15: 0000000000000000 [ 1630.061935][T14021] Mem-Info: [ 1630.063226][T14021] active_anon:1314192 inactive_anon:5198 isolated_anon:0 [ 1630.063226][T14021] active_file:9 inactive_file:0 isolated_file:0 [ 1630.063226][T14021] unevictable:0 dirty:0 writeback:0 unstable:0 [ 1630.063226][T14021] slab_reclaimable:6649 slab_unreclaimable:268898 [ 1630.063226][T14021] mapped:1094 shmem:10475 pagetables:127946 bounce:0 [ 1630.063226][T14021] free:25585 free_pcp:62 free_cma:0 [ 1630.075882][T14021] Node 0 active_anon:5256768kB inactive_anon:20792kB active_file:36kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4376kB dirty:0kB writeback:0kB shmem:41900kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3158016kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [ 1630.084961][T14021] DMA free:15360kB min:132kB low:164kB high:196kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 1630.094169][T14021] lowmem_reserve[]: 0 2717 7644 7644 [ 1630.096174][T14021] DMA32 free:43680kB min:23972kB low:29964kB high:35956kB reserved_highatomic:0KB active_anon:2733524kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2782468kB mlocked:0kB kernel_stack:0kB pagetables:4764kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 1630.106132][T14021] lowmem_reserve[]: 0 0 4927 4927 [ 1630.108104][T14021] Normal free:43300kB min:43476kB low:54344kB high:65212kB reserved_highatomic:4096KB active_anon:2523244kB inactive_anon:20792kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:0kB present:5242880kB managed:5045744kB mlocked:0kB kernel_stack:313872kB pagetables:507020kB bounce:0kB free_pcp:248kB local_pcp:248kB free_cma:0kB [ 1630.118473][T14021] lowmem_reserve[]: 0 0 0 0 [ 1630.120348][T14021] DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB [ 1630.124316][T14021] DMA32: 0*4kB 0*8kB 2*16kB (UM) 2*32kB (UM) 1*64kB (U) 0*128kB 2*256kB (UM) 2*512kB (UM) 1*1024kB (M) 2*2048kB (UM) 9*4096kB (M) = 43680kB [ 1630.128773][T14021] Normal: 1231*4kB (UE) 991*8kB (UE) 343*16kB (UME) 230*32kB (UE) 117*64kB (UME) 65*128kB (UME) 7*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43300kB [ 1630.134207][T14021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 1630.137435][T14021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 1630.140770][T14021] 10478 total pagecache pages [ 1630.142789][T14021] 0 pages in swap cache [ 1630.144572][T14021] Swap cache stats: add 0, delete 0, find 0/0 [ 1630.146921][T14021] Free swap = 0kB [ 1630.148627][T14021] Total swap = 0kB [ 1630.150298][T14021] 2096998 pages RAM [ 1630.152078][T14021] 0 pages HighMem/MovableOnly [ 1630.154073][T14021] 136105 pages reserved [ 1630.156165][T14021] 0 pages cma reserved [ 1630.158028][T14021] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=normal-priority,pid=23173,uid=0 [ 1630.161867][T14021] Out of memory: Killed process 23173 (normal-priority) total-vm:4228kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:52kB oom_score_adj:1000 [ 1630.171681][ T18] oom_reaper: reaped process 23173 (normal-priority), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB (...snipped...) [ 1800.548694][ C0] idle-priority R running task 12416 14021 1183 0x80004080 [ 1800.550942][ C0] Call Trace: [ 1800.552116][ C0] __schedule+0x28c/0x860 [ 1800.553531][ C0] ? _raw_spin_unlock_irqrestore+0x2c/0x50 [ 1800.555293][ C0] schedule+0x5f/0xd0 [ 1800.556618][ C0] schedule_timeout+0x1ab/0x340 [ 1800.558151][ C0] ? __next_timer_interrupt+0xd0/0xd0 [ 1800.560069][ C0] schedule_timeout_killable+0x25/0x30 [ 1800.561836][ C0] out_of_memory+0x113/0x3d0 [ 1800.563519][ C0] ? out_of_memory+0x1ea/0x3d0 [ 1800.565040][ C0] __alloc_pages_slowpath+0x934/0xb89 [ 1800.566727][ C0] __alloc_pages_nodemask+0x333/0x370 [ 1800.568397][ C0] handle_pte_fault+0x4ce/0xb40 [ 1800.569944][ C0] __handle_mm_fault+0x2b2/0x5e0 [ 1800.571527][ C0] handle_mm_fault+0x177/0x360 [ 1800.573046][ C0] ? handle_mm_fault+0x46/0x360 [ 1800.574592][ C0] do_page_fault+0x2d5/0x680 [ 1800.576306][ C0] page_fault+0x34/0x40 [ 1800.578413][ C0] RIP: 0010:__clear_user+0x36/0x70
On Wed 11-03-20 12:45:40, David Rientjes wrote: > On Wed, 11 Mar 2020, Michal Hocko wrote: > > > > > > When a process is oom killed as a result of memcg limits and the victim > > > > > is waiting to exit, nothing ends up actually yielding the processor back > > > > > to the victim on UP systems with preemption disabled. Instead, the > > > > > charging process simply loops in memcg reclaim and eventually soft > > > > > lockups. > > > > > > > > > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0 > > > > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806] > > > > > CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136 > > > > > RIP: 0010:shrink_lruvec+0x4e9/0xa40 > > > > > ... > > > > > Call Trace: > > > > > shrink_node+0x40d/0x7d0 > > > > > do_try_to_free_pages+0x13f/0x470 > > > > > try_to_free_mem_cgroup_pages+0x16d/0x230 > > > > > try_charge+0x247/0xac0 > > > > > mem_cgroup_try_charge+0x10a/0x220 > > > > > mem_cgroup_try_charge_delay+0x1e/0x40 > > > > > handle_mm_fault+0xdf2/0x15f0 > > > > > do_user_addr_fault+0x21f/0x420 > > > > > page_fault+0x2f/0x40 > > > > > > > > > > Make sure that something ends up actually yielding the processor back to > > > > > the victim to allow for memory freeing. Most appropriate place appears to > > > > > be shrink_node_memcgs() where the iteration of all decendant memcgs could > > > > > be particularly lengthy. > > > > > > > > There is a cond_resched in shrink_lruvec and another one in > > > > shrink_page_list. Why doesn't any of them hit? Is it because there are > > > > no pages on the LRU list? Because rss data suggests there should be > > > > enough pages to go that path. Or maybe it is shrink_slab path that takes > > > > too long? > > > > > > > > > > I think it can be a number of cases, most notably mem_cgroup_protected() > > > checks which is why the cond_resched() is added above it. Rather than add > > > cond_resched() only for MEMCG_PROT_MIN and for certain MEMCG_PROT_LOW, the > > > cond_resched() is added above the switch clause because the iteration > > > itself may be potentially very lengthy. > > > > Was any of the above the case for your soft lockup case? How have you > > managed to trigger it? As I've said I am not against the patch but I > > would really like to see an actual explanation what happened rather than > > speculations of what might have happened. If for nothing else then for > > the future reference. > > > > Yes, this is how it was triggered in my own testing. > > > If this is really about all the hierarchy being MEMCG_PROT_MIN protected > > and that results in a very expensive and pointless reclaim walk that can > > trigger soft lockup then it should be explicitly mentioned in the > > changelog. > > I think the changelog clearly states that we need to guarantee that a > reclaimer will yield the processor back to allow a victim to exit. This > is where we make the guarantee. If it helps for the specific reason it > triggered in my testing, we could add: > > "For example, mem_cgroup_protected() can prohibit reclaim and thus any > yielding in page reclaim would not address the issue." I would suggest something like the following: " The reclaim path (including the OOM) relies on explicit scheduling points to hand over execution to tasks which could help with the reclaim process. Currently it is mostly shrink_page_list which yields CPU for each reclaimed page. This might be insuficient though in some configurations. E.g. when a memcg OOM path is triggered in a hierarchy which doesn't have any reclaimable memory because of memory reclaim protection (MEMCG_PROT_MIN) then there is possible to trigger a soft lockup during an out of memory situation on non preemptible kernels <PUT YOUR SOFT LOCKUP SPLAT HERE> Fix this by adding a cond_resched up in the reclaim path and make sure there is a yield point regardless of reclaimability of the target hierarchy. "
On Thu, 12 Mar 2020, Tetsuo Handa wrote: > > On Thu, 12 Mar 2020, Tetsuo Handa wrote: > > > > If you have an alternate patch to try, we can test it. But since this > > > > cond_resched() is needed anyway, I'm not sure it will change the result. > > > > > > schedule_timeout_killable(1) is an alternate patch to try; I don't think > > > that this cond_resched() is needed anyway. > > > > > > > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()? > > > > Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs() > is enough. But like you mentioned, > It passes our testing because this is where the allocator is looping while the victim is trying to exit if only it could be scheduled. > you can try re-adding sleep outside of oom_lock: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d09776cd6e10..3aee7e0eca4e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > */ > ret = should_force_charge() || out_of_memory(&oc); > mutex_unlock(&oom_lock); > + schedule_timeout_killable(1); > return ret; > } > If current was process chosen for oom kill, this would actually induce the problem, not fix it. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3c4eb750a199..e80158049651 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3797,7 +3797,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > */ > if (!mutex_trylock(&oom_lock)) { > *did_some_progress = 1; > - schedule_timeout_uninterruptible(1); > return NULL; > } > > @@ -4590,6 +4589,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > /* Retry as long as the OOM killer is making progress */ > if (did_some_progress) { > + schedule_timeout_uninterruptible(1); > no_progress_loops = 0; > goto retry; > } > > By the way, will you share the reproducer (and how to use the reproducer) ? > On an UP kernel with swap disabled, you limit a memcg to 100MB and start three processes that each fault 40MB attached to it. Same reproducer as the "mm, oom: make a last minute check to prevent unnecessary memcg oom kills" patch except in that case there are two cores.
On Thu, 12 Mar 2020, Michal Hocko wrote: > > I think the changelog clearly states that we need to guarantee that a > > reclaimer will yield the processor back to allow a victim to exit. This > > is where we make the guarantee. If it helps for the specific reason it > > triggered in my testing, we could add: > > > > "For example, mem_cgroup_protected() can prohibit reclaim and thus any > > yielding in page reclaim would not address the issue." > > I would suggest something like the following: > " > The reclaim path (including the OOM) relies on explicit scheduling > points to hand over execution to tasks which could help with the reclaim > process. Are there other examples where yielding in the reclaim path would "help with the reclaim process" other than oom victims? This sentence seems vague. > Currently it is mostly shrink_page_list which yields CPU for > each reclaimed page. This might be insuficient though in some > configurations. E.g. when a memcg OOM path is triggered in a hierarchy > which doesn't have any reclaimable memory because of memory reclaim > protection (MEMCG_PROT_MIN) then there is possible to trigger a soft > lockup during an out of memory situation on non preemptible kernels > <PUT YOUR SOFT LOCKUP SPLAT HERE> > > Fix this by adding a cond_resched up in the reclaim path and make sure > there is a yield point regardless of reclaimability of the target > hierarchy. > " >
On Thu 12-03-20 11:20:33, David Rientjes wrote: > On Thu, 12 Mar 2020, Michal Hocko wrote: > > > > I think the changelog clearly states that we need to guarantee that a > > > reclaimer will yield the processor back to allow a victim to exit. This > > > is where we make the guarantee. If it helps for the specific reason it > > > triggered in my testing, we could add: > > > > > > "For example, mem_cgroup_protected() can prohibit reclaim and thus any > > > yielding in page reclaim would not address the issue." > > > > I would suggest something like the following: > > " > > The reclaim path (including the OOM) relies on explicit scheduling > > points to hand over execution to tasks which could help with the reclaim > > process. > > Are there other examples where yielding in the reclaim path would "help > with the reclaim process" other than oom victims? This sentence seems > vague. In the context of UP and !PREEMPT this also includes IO flushers, filesystems rely on workers and there are things I am very likely not aware of. If you think this is vaague then feel free to reformulate. All I really do care about is what the next paragraph is explaining. > > Currently it is mostly shrink_page_list which yields CPU for > > each reclaimed page. This might be insuficient though in some > > configurations. E.g. when a memcg OOM path is triggered in a hierarchy > > which doesn't have any reclaimable memory because of memory reclaim > > protection (MEMCG_PROT_MIN) then there is possible to trigger a soft > > lockup during an out of memory situation on non preemptible kernels > > <PUT YOUR SOFT LOCKUP SPLAT HERE> > > > > Fix this by adding a cond_resched up in the reclaim path and make sure > > there is a yield point regardless of reclaimability of the target > > hierarchy. > > " > >
On Thu, 12 Mar 2020 11:07:15 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > On Thu, 12 Mar 2020, Tetsuo Handa wrote: > > > > On Thu, 12 Mar 2020, Tetsuo Handa wrote: > > > > > If you have an alternate patch to try, we can test it. But since this > > > > > cond_resched() is needed anyway, I'm not sure it will change the result. > > > > > > > > schedule_timeout_killable(1) is an alternate patch to try; I don't think > > > > that this cond_resched() is needed anyway. > > > > > > > > > > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()? > > > > > > > Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs() > > is enough. But like you mentioned, > > > > It passes our testing because this is where the allocator is looping while > the victim is trying to exit if only it could be scheduled. What happens if the allocator has SCHED_FIFO?
David Rientjes wrote: > > By the way, will you share the reproducer (and how to use the reproducer) ? > > > > On an UP kernel with swap disabled, you limit a memcg to 100MB and start > three processes that each fault 40MB attached to it. Same reproducer as > the "mm, oom: make a last minute check to prevent unnecessary memcg oom > kills" patch except in that case there are two cores. > I'm not a heavy memcg user. Please provide steps for reproducing your problem in a "copy and pastable" way (e.g. bash script, C program). > > @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > */ > > ret = should_force_charge() || out_of_memory(&oc); > > mutex_unlock(&oom_lock); > > + schedule_timeout_killable(1); > > return ret; > > } > > > > If current was process chosen for oom kill, this would actually induce the > problem, not fix it. > Why? Memcg OOM path allows using forced charge path if should_force_charge() == true. Since your lockup report Call Trace: shrink_node+0x40d/0x7d0 do_try_to_free_pages+0x13f/0x470 try_to_free_mem_cgroup_pages+0x16d/0x230 try_charge+0x247/0xac0 mem_cgroup_try_charge+0x10a/0x220 mem_cgroup_try_charge_delay+0x1e/0x40 handle_mm_fault+0xdf2/0x15f0 do_user_addr_fault+0x21f/0x420 page_fault+0x2f/0x40 says that allocating thread was calling try_to_free_mem_cgroup_pages() from try_charge(), allocating thread must be able to reach mem_cgroup_out_of_memory() from mem_cgroup_oom() from try_charge(). And actually Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0 says that allocating thread did reach mem_cgroup_out_of_memory(). Then, allocating thread must be able to sleep at mem_cgroup_out_of_memory() if schedule_timeout_killable(1) is mem_cgroup_out_of_memory(). Also, if current process was chosen for OOM-kill, current process will be able to leave try_charge() due to should_force_charge() == true, won't it? Thus, how can "this would actually induce the problem, not fix it." happen? If your problem is that something keeps allocating threads away from reaching should_force_charge() check, please explain the mechanism. If that is explained, I would agree that schedule_timeout_killable(1) in mem_cgroup_out_of_memory() won't help.
On Fri, 13 Mar 2020, Tetsuo Handa wrote: > > On an UP kernel with swap disabled, you limit a memcg to 100MB and start > > three processes that each fault 40MB attached to it. Same reproducer as > > the "mm, oom: make a last minute check to prevent unnecessary memcg oom > > kills" patch except in that case there are two cores. > > > > I'm not a heavy memcg user. Please provide steps for reproducing your problem > in a "copy and pastable" way (e.g. bash script, C program). > Use Documentation/admin-guide/cgroup-v1/memory.rst or Documentation/admin-guide/cgroup-v2.rst to setup a memcg depending on which cgroup version you use, limit it to 100MB, and attach your process to it. Run three programs that fault 40MB. To do that, you need to use mmap: (void)mmap(NULL, 40 << 20, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, 0, 0); Have it stall after populating the memory: for (;;); > > > @@ -1576,6 +1576,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > */ > > > ret = should_force_charge() || out_of_memory(&oc); > > > mutex_unlock(&oom_lock); > > > + schedule_timeout_killable(1); > > > return ret; > > > } > > > > > > > If current was process chosen for oom kill, this would actually induce the > > problem, not fix it. > > > > Why? Memcg OOM path allows using forced charge path if should_force_charge() == true. > > Since your lockup report > > Call Trace: > shrink_node+0x40d/0x7d0 > do_try_to_free_pages+0x13f/0x470 > try_to_free_mem_cgroup_pages+0x16d/0x230 > try_charge+0x247/0xac0 > mem_cgroup_try_charge+0x10a/0x220 > mem_cgroup_try_charge_delay+0x1e/0x40 > handle_mm_fault+0xdf2/0x15f0 > do_user_addr_fault+0x21f/0x420 > page_fault+0x2f/0x40 > > says that allocating thread was calling try_to_free_mem_cgroup_pages() from try_charge(), > allocating thread must be able to reach mem_cgroup_out_of_memory() from mem_cgroup_oom() > from try_charge(). And actually > > Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0 > > says that allocating thread did reach mem_cgroup_out_of_memory(). Then, allocating thread > must be able to sleep at mem_cgroup_out_of_memory() if schedule_timeout_killable(1) is > mem_cgroup_out_of_memory(). > > Also, if current process was chosen for OOM-kill, current process will be able to leave > try_charge() due to should_force_charge() == true, won't it? > > Thus, how can "this would actually induce the problem, not fix it." happen? The entire issue is that the victim never gets a chance to run because the allocator doesn't give it a chance to run on an UP system. Your patch is broken because if the victim is current, you've lost your golden opportunity to actually exit and ceded control to the allocator that will now starve the victim.
On 2020/03/14 7:01, David Rientjes wrote: > The entire issue is that the victim never gets a chance to run because the > allocator doesn't give it a chance to run on an UP system. Your patch is > broken because if the victim is current, you've lost your golden > opportunity to actually exit and ceded control to the allocator that will > now starve the victim. > I still cannot understand. There is no need to give CPU time to OOM victims. We just need to give CPU time to the OOM reaper kernel thread till the OOM reaper kernel thread sets MMF_OOM_SKIP to OOM victims. If current thread is an OOM victim, schedule_timeout_killable(1) will give other threads (including the OOM reaper kernel thread) CPU time to run. That is similar with your cond_resched() patch (except that cond_resched() might fail to give other threads CPU time to run if current thread has realtime priority), isn't it? So, please explain the mechanism why cond_resched() works but schedule_timeout_killable(1) cannot work.
On 2020/03/14 8:15, Tetsuo Handa wrote: > If current thread is > an OOM victim, schedule_timeout_killable(1) will give other threads (including > the OOM reaper kernel thread) CPU time to run. If current thread is an OOM victim, schedule_timeout_killable(1) will give other threads (including the OOM reaper kernel thread) CPU time to run, by leaving try_charge() path due to should_force_charge() == true and reaching do_exit() path instead of returning to userspace code doing "for (;;);". Unless the problem is that current thread cannot reach should_force_charge() check, schedule_timeout_killable(1) should work.
On Thu 12-03-20 15:32:38, Andrew Morton wrote: > On Thu, 12 Mar 2020 11:07:15 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > > On Thu, 12 Mar 2020, Tetsuo Handa wrote: > > > > > > On Thu, 12 Mar 2020, Tetsuo Handa wrote: > > > > > > If you have an alternate patch to try, we can test it. But since this > > > > > > cond_resched() is needed anyway, I'm not sure it will change the result. > > > > > > > > > > schedule_timeout_killable(1) is an alternate patch to try; I don't think > > > > > that this cond_resched() is needed anyway. > > > > > > > > > > > > > You are suggesting schedule_timeout_killable(1) in shrink_node_memcgs()? > > > > > > > > > > Andrew Morton also mentioned whether cond_resched() in shrink_node_memcgs() > > > is enough. But like you mentioned, > > > > > > > It passes our testing because this is where the allocator is looping while > > the victim is trying to exit if only it could be scheduled. > > What happens if the allocator has SCHED_FIFO? The same thing as a SCHED_FIFO running in a tight loop in the userspace. As long as a high priority context depends on a resource held by a low priority task then we have a priority inversion problem and the page allocator is no real exception here. But I do not see the allocator is much different from any other code in the kernel. We do not add random sleeps here and there to push a high priority FIFO or RT tasks out of the execution context. We do cond_resched to help !PREEMPT kernels but priority related issues are really out of scope of that facility.
On Thu 12-03-20 21:16:27, Michal Hocko wrote: > On Thu 12-03-20 11:20:33, David Rientjes wrote: > > On Thu, 12 Mar 2020, Michal Hocko wrote: > > > > > > I think the changelog clearly states that we need to guarantee that a > > > > reclaimer will yield the processor back to allow a victim to exit. This > > > > is where we make the guarantee. If it helps for the specific reason it > > > > triggered in my testing, we could add: > > > > > > > > "For example, mem_cgroup_protected() can prohibit reclaim and thus any > > > > yielding in page reclaim would not address the issue." > > > > > > I would suggest something like the following: > > > " > > > The reclaim path (including the OOM) relies on explicit scheduling > > > points to hand over execution to tasks which could help with the reclaim > > > process. > > > > Are there other examples where yielding in the reclaim path would "help > > with the reclaim process" other than oom victims? This sentence seems > > vague. > > In the context of UP and !PREEMPT this also includes IO flushers, > filesystems rely on workers and there are things I am very likely not > aware of. If you think this is vaague then feel free to reformulate. > All I really do care about is what the next paragraph is explaining. Btw. do you plan to send a patch with an updated changelog? > > > Currently it is mostly shrink_page_list which yields CPU for > > > each reclaimed page. This might be insuficient though in some > > > configurations. E.g. when a memcg OOM path is triggered in a hierarchy > > > which doesn't have any reclaimable memory because of memory reclaim > > > protection (MEMCG_PROT_MIN) then there is possible to trigger a soft > > > lockup during an out of memory situation on non preemptible kernels > > > <PUT YOUR SOFT LOCKUP SPLAT HERE> > > > > > > Fix this by adding a cond_resched up in the reclaim path and make sure > > > there is a yield point regardless of reclaimability of the target > > > hierarchy. > > > " > > > > > -- > Michal Hocko > SUSE Labs
On 2020/03/16 18:31, Michal Hocko wrote: >> What happens if the allocator has SCHED_FIFO? > > The same thing as a SCHED_FIFO running in a tight loop in the userspace. > > As long as a high priority context depends on a resource held by a low > priority task then we have a priority inversion problem and the page > allocator is no real exception here. But I do not see the allocator > is much different from any other code in the kernel. We do not add > random sleeps here and there to push a high priority FIFO or RT tasks > out of the execution context. We do cond_resched to help !PREEMPT > kernels but priority related issues are really out of scope of that > facility. > Spinning with realtime priority in userspace is a userspace's bug. Spinning with realtime priority in kernelspace until watchdog fires is a kernel's bug. We are not responsible for userspace's bug, and I'm asking whether the memory allocator kernel code can give enough CPU time to other threads even if current thread has realtime priority.
On Mon 16-03-20 19:04:44, Tetsuo Handa wrote: > On 2020/03/16 18:31, Michal Hocko wrote: > >> What happens if the allocator has SCHED_FIFO? > > > > The same thing as a SCHED_FIFO running in a tight loop in the userspace. > > > > As long as a high priority context depends on a resource held by a low > > priority task then we have a priority inversion problem and the page > > allocator is no real exception here. But I do not see the allocator > > is much different from any other code in the kernel. We do not add > > random sleeps here and there to push a high priority FIFO or RT tasks > > out of the execution context. We do cond_resched to help !PREEMPT > > kernels but priority related issues are really out of scope of that > > facility. > > > > Spinning with realtime priority in userspace is a userspace's bug. > Spinning with realtime priority in kernelspace until watchdog fires is > a kernel's bug. We are not responsible for userspace's bug, and I'm > asking whether the memory allocator kernel code can give enough CPU > time to other threads even if current thread has realtime priority. We've been through that discussion many times and the core point is that this requires a large surgery to work properly. It is not just to add a sleep into the page allocator and be done with that. Page allocator cannot really do much on its own. It relies on many other contexts to make a forward progress. What you really demand is far from trivial. Maybe you are looking something much closer to the RT kernel than what other preemption modes can offer currently. Right now, you really have to be careful when running FIFO/RT processes and plan their resources very carefully. Is that ideal? Not really but considering that this is the status quo for many years it seems that the usecases tend to find their way around that restriction.
On Sat, 14 Mar 2020, Tetsuo Handa wrote: > > If current thread is > > an OOM victim, schedule_timeout_killable(1) will give other threads (including > > the OOM reaper kernel thread) CPU time to run. > > If current thread is an OOM victim, schedule_timeout_killable(1) will give other > threads (including the OOM reaper kernel thread) CPU time to run, by leaving > try_charge() path due to should_force_charge() == true and reaching do_exit() path > instead of returning to userspace code doing "for (;;);". > > Unless the problem is that current thread cannot reach should_force_charge() check, > schedule_timeout_killable(1) should work. > No need to yield if current is the oom victim, allowing the oom reaper to run when it may not actually be able to free memory is not required. It increases the likelihood that some other process schedules and is unable to yield back due to the memcg oom condition such that the victim doesn't get a chance to run again. This happens because the victim is allowed to overcharge but other processes attached to an oom memcg hierarchy simply fail the charge. We are then reliant on all memory chargers in the kernel to yield if their charges fail due to oom. It's the only way to allow the victim to eventually run. So the only change that I would make to your patch is to do this in mem_cgroup_out_of_memory() instead: if (!fatal_signal_pending(current)) schedule_timeout_killable(1); So we don't have this reliance on all other memory chargers to yield when their charge fails and there is no delay for victims themselves. [ I'll still propose my change that adds cond_resched() to shrink_node_memcgs() because we can see need_resched set for a prolonged period of time without scheduling. ] If you agree, I'll propose your patch with a changelog that indicates it can fix the soft lockup issue for UP and can likely get a tested-by for it.
David Rientjes wrote: > On Sat, 14 Mar 2020, Tetsuo Handa wrote: > > If current thread is an OOM victim, schedule_timeout_killable(1) will give other > > threads (including the OOM reaper kernel thread) CPU time to run, by leaving > > try_charge() path due to should_force_charge() == true and reaching do_exit() path > > instead of returning to userspace code doing "for (;;);". > > > > Unless the problem is that current thread cannot reach should_force_charge() check, > > schedule_timeout_killable(1) should work. > > > > No need to yield if current is the oom victim, allowing the oom reaper to > run when it may not actually be able to free memory is not required. It > increases the likelihood that some other process schedules and is unable > to yield back due to the memcg oom condition such that the victim doesn't > get a chance to run again. > > This happens because the victim is allowed to overcharge but other > processes attached to an oom memcg hierarchy simply fail the charge. We > are then reliant on all memory chargers in the kernel to yield if their > charges fail due to oom. It's the only way to allow the victim to > eventually run. > > So the only change that I would make to your patch is to do this in > mem_cgroup_out_of_memory() instead: > > if (!fatal_signal_pending(current)) > schedule_timeout_killable(1); > > So we don't have this reliance on all other memory chargers to yield when > their charge fails and there is no delay for victims themselves. I see. You want below functions for environments where current thread can fail to resume execution for long if current thread once reschedules (e.g. UP kernel, many threads contended on one CPU). /* * Give other threads CPU time, unless current thread was already killed. * Used when we prefer killed threads to continue execution (in a hope that * killed threads terminate quickly) over giving other threads CPU time. */ signed long __sched schedule_timeout_killable_expedited(signed long timeout) { if (unlikely(fatal_signal_pending(current))) return timeout; return schedule_timeout_killable(timeout); } /* * Latency reduction via explicit rescheduling in places that are safe, * but becomes no-op if current thread was already killed. Used when we * prefer killed threads to continue execution (in a hope that killed * threads terminate quickly) over giving other threads CPU time. */ int cond_resched_expedited(void) { if (unlikely(fatal_signal_pending(current))) return 0; return cond_resched(); } > > [ I'll still propose my change that adds cond_resched() to > shrink_node_memcgs() because we can see need_resched set for a > prolonged period of time without scheduling. ] As long as there is schedule_timeout_killable(), I'm fine with adding cond_resched() in other places. > > If you agree, I'll propose your patch with a changelog that indicates it > can fix the soft lockup issue for UP and can likely get a tested-by for > it. > Please go ahead.
On Tue, 17 Mar 2020, Tetsuo Handa wrote: > > if (!fatal_signal_pending(current)) > > schedule_timeout_killable(1); > > > > So we don't have this reliance on all other memory chargers to yield when > > their charge fails and there is no delay for victims themselves. > > I see. You want below functions for environments where current thread can > fail to resume execution for long if current thread once reschedules (e.g. > UP kernel, many threads contended on one CPU). > > /* > * Give other threads CPU time, unless current thread was already killed. > * Used when we prefer killed threads to continue execution (in a hope that > * killed threads terminate quickly) over giving other threads CPU time. > */ > signed long __sched schedule_timeout_killable_expedited(signed long timeout) > { > if (unlikely(fatal_signal_pending(current))) > return timeout; > return schedule_timeout_killable(timeout); > } > I simply want the if (!fatal_signal_pending(current)) schedule_timeout_killable(1); after dropping oom_lock because I don't know that a generic function would be useful outside of this single place. If it becomes a regular pattern, for whatever reason, I think we can consider a new schedule_timeout variant. > /* > * Latency reduction via explicit rescheduling in places that are safe, > * but becomes no-op if current thread was already killed. Used when we > * prefer killed threads to continue execution (in a hope that killed > * threads terminate quickly) over giving other threads CPU time. > */ > int cond_resched_expedited(void) > { > if (unlikely(fatal_signal_pending(current))) > return 0; > return cond_resched(); > } > > > > > [ I'll still propose my change that adds cond_resched() to > > shrink_node_memcgs() because we can see need_resched set for a > > prolonged period of time without scheduling. ] > > As long as there is schedule_timeout_killable(), I'm fine with adding > cond_resched() in other places. > Sounds good, thanks Tetsuo.
diff --git a/mm/vmscan.c b/mm/vmscan.c --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2637,6 +2637,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) unsigned long reclaimed; unsigned long scanned; + cond_resched(); + switch (mem_cgroup_protected(target_memcg, memcg)) { case MEMCG_PROT_MIN: /*
When a process is oom killed as a result of memcg limits and the victim is waiting to exit, nothing ends up actually yielding the processor back to the victim on UP systems with preemption disabled. Instead, the charging process simply loops in memcg reclaim and eventually soft lockups. Memory cgroup out of memory: Killed process 808 (repro) total-vm:41944kB, anon-rss:35344kB, file-rss:504kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0 watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [repro:806] CPU: 0 PID: 806 Comm: repro Not tainted 5.6.0-rc5+ #136 RIP: 0010:shrink_lruvec+0x4e9/0xa40 ... Call Trace: shrink_node+0x40d/0x7d0 do_try_to_free_pages+0x13f/0x470 try_to_free_mem_cgroup_pages+0x16d/0x230 try_charge+0x247/0xac0 mem_cgroup_try_charge+0x10a/0x220 mem_cgroup_try_charge_delay+0x1e/0x40 handle_mm_fault+0xdf2/0x15f0 do_user_addr_fault+0x21f/0x420 page_fault+0x2f/0x40 Make sure that something ends up actually yielding the processor back to the victim to allow for memory freeing. Most appropriate place appears to be shrink_node_memcgs() where the iteration of all decendant memcgs could be particularly lengthy. Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: David Rientjes <rientjes@google.com> --- mm/vmscan.c | 2 ++ 1 file changed, 2 insertions(+)