Message ID | 20181022071323.9550-3-mhocko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | oom, memcg: do not report racy no-eligible OOM | expand |
On 2018/10/22 16:13, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Tetsuo has reported [1] that a single process group memcg might easily > swamp the log with no-eligible oom victim reports due to race between > the memcg charge and oom_reaper > > Thread 1 Thread2 oom_reaper > try_charge try_charge > mem_cgroup_out_of_memory > mutex_lock(oom_lock) > mem_cgroup_out_of_memory > mutex_lock(oom_lock) > out_of_memory > select_bad_process > oom_kill_process(current) > wake_oom_reaper > oom_reap_task > MMF_OOM_SKIP->victim > mutex_unlock(oom_lock) > out_of_memory > select_bad_process # no task > > If Thread1 didn't race it would bail out from try_charge and force the > charge. We can achieve the same by checking tsk_is_oom_victim inside > the oom_lock and therefore close the race. > > [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/memcontrol.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e79cb59552d9..a9dfed29967b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > .gfp_mask = gfp_mask, > .order = order, > }; > - bool ret; > + bool ret = true; > > mutex_lock(&oom_lock); > + > + /* > + * multi-threaded tasks might race with oom_reaper and gain > + * MMF_OOM_SKIP before reaching out_of_memory which can lead > + * to out_of_memory failure if the task is the last one in > + * memcg which would be a false possitive failure reported > + */ > + if (tsk_is_oom_victim(current)) > + goto unlock; > + This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) so that any killed threads no longer wait for oom_lock. Also, closing this race for only memcg OOM path is strange. Global OOM path (which are CLONE_VM without CLONE_THREAD) is still suffering this race (though frequency is lower than memcg OOM due to use of mutex_trylock()). Either checking before calling out_of_memory() or checking task_will_free_mem(current) inside out_of_memory() will close this race for both paths. > ret = out_of_memory(&oc); > + > +unlock: > mutex_unlock(&oom_lock); > return ret; > } >
On Mon 22-10-18 20:45:17, Tetsuo Handa wrote: > On 2018/10/22 16:13, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Tetsuo has reported [1] that a single process group memcg might easily > > swamp the log with no-eligible oom victim reports due to race between > > the memcg charge and oom_reaper > > > > Thread 1 Thread2 oom_reaper > > try_charge try_charge > > mem_cgroup_out_of_memory > > mutex_lock(oom_lock) > > mem_cgroup_out_of_memory > > mutex_lock(oom_lock) > > out_of_memory > > select_bad_process > > oom_kill_process(current) > > wake_oom_reaper > > oom_reap_task > > MMF_OOM_SKIP->victim > > mutex_unlock(oom_lock) > > out_of_memory > > select_bad_process # no task > > > > If Thread1 didn't race it would bail out from try_charge and force the > > charge. We can achieve the same by checking tsk_is_oom_victim inside > > the oom_lock and therefore close the race. > > > > [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > --- > > mm/memcontrol.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index e79cb59552d9..a9dfed29967b 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > .gfp_mask = gfp_mask, > > .order = order, > > }; > > - bool ret; > > + bool ret = true; > > > > mutex_lock(&oom_lock); > > + > > + /* > > + * multi-threaded tasks might race with oom_reaper and gain > > + * MMF_OOM_SKIP before reaching out_of_memory which can lead > > + * to out_of_memory failure if the task is the last one in > > + * memcg which would be a false possitive failure reported > > + */ > > + if (tsk_is_oom_victim(current)) > > + goto unlock; > > + > > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) > so that any killed threads no longer wait for oom_lock. tsk_is_oom_victim is stronger because it doesn't depend on fatal_signal_pending which might be cleared throughout the exit process. > Also, closing this race for only memcg OOM path is strange. Global OOM path > (which are CLONE_VM without CLONE_THREAD) is still suffering this race > (though frequency is lower than memcg OOM due to use of mutex_trylock()). Either > checking before calling out_of_memory() or checking task_will_free_mem(current) > inside out_of_memory() will close this race for both paths. The global case is much more complicated because we know that memcg might bypass the charge so we do not have to care about the potential endless loop like in page allocator path. Moreover I am not even sure the race is all that interesting in the global case. I have never heard about a pre-mature panic due to no killable task. The racing oom task would have to be the last eligible process in the system and that is quite unlikely. We can think about a more involved solution for that if we ever hear about this to be a real problem. So a simple memcg specific fix sounds like a reasonable way forward.
On 2018/10/22 21:03, Michal Hocko wrote: > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote: >> On 2018/10/22 16:13, Michal Hocko wrote: >>> From: Michal Hocko <mhocko@suse.com> >>> >>> Tetsuo has reported [1] that a single process group memcg might easily >>> swamp the log with no-eligible oom victim reports due to race between >>> the memcg charge and oom_reaper >>> >>> Thread 1 Thread2 oom_reaper >>> try_charge try_charge >>> mem_cgroup_out_of_memory >>> mutex_lock(oom_lock) >>> mem_cgroup_out_of_memory >>> mutex_lock(oom_lock) >>> out_of_memory >>> select_bad_process >>> oom_kill_process(current) >>> wake_oom_reaper >>> oom_reap_task >>> MMF_OOM_SKIP->victim >>> mutex_unlock(oom_lock) >>> out_of_memory >>> select_bad_process # no task >>> >>> If Thread1 didn't race it would bail out from try_charge and force the >>> charge. We can achieve the same by checking tsk_is_oom_victim inside >>> the oom_lock and therefore close the race. >>> >>> [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp >>> Signed-off-by: Michal Hocko <mhocko@suse.com> >>> --- >>> mm/memcontrol.c | 14 +++++++++++++- >>> 1 file changed, 13 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index e79cb59552d9..a9dfed29967b 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, >>> .gfp_mask = gfp_mask, >>> .order = order, >>> }; >>> - bool ret; >>> + bool ret = true; >>> >>> mutex_lock(&oom_lock); >>> + >>> + /* >>> + * multi-threaded tasks might race with oom_reaper and gain >>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead >>> + * to out_of_memory failure if the task is the last one in >>> + * memcg which would be a false possitive failure reported >>> + */ >>> + if (tsk_is_oom_victim(current)) >>> + goto unlock; >>> + >> >> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) >> so that any killed threads no longer wait for oom_lock. > > tsk_is_oom_victim is stronger because it doesn't depend on > fatal_signal_pending which might be cleared throughout the exit process. I mean: mm/memcontrol.c | 3 +- mm/oom_kill.c | 111 +++++--------------------------------------------------- 2 files changed, 12 insertions(+), 102 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e79cb59..2c1e1ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1382,7 +1382,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, }; bool ret; - mutex_lock(&oom_lock); + if (mutex_lock_killable(&oom_lock)) + return true; ret = out_of_memory(&oc); mutex_unlock(&oom_lock); return ret; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa53..22fd6b3 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -755,81 +755,6 @@ bool oom_killer_disable(signed long timeout) return true; } -static inline bool __task_will_free_mem(struct task_struct *task) -{ - struct signal_struct *sig = task->signal; - - /* - * A coredumping process may sleep for an extended period in exit_mm(), - * so the oom killer cannot assume that the process will promptly exit - * and release memory. - */ - if (sig->flags & SIGNAL_GROUP_COREDUMP) - return false; - - if (sig->flags & SIGNAL_GROUP_EXIT) - return true; - - if (thread_group_empty(task) && (task->flags & PF_EXITING)) - return true; - - return false; -} - -/* - * Checks whether the given task is dying or exiting and likely to - * release its address space. This means that all threads and processes - * sharing the same mm have to be killed or exiting. - * Caller has to make sure that task->mm is stable (hold task_lock or - * it operates on the current). - */ -static bool task_will_free_mem(struct task_struct *task) -{ - struct mm_struct *mm = task->mm; - struct task_struct *p; - bool ret = true; - - /* - * Skip tasks without mm because it might have passed its exit_mm and - * exit_oom_victim. oom_reaper could have rescued that but do not rely - * on that for now. We can consider find_lock_task_mm in future. - */ - if (!mm) - return false; - - if (!__task_will_free_mem(task)) - return false; - - /* - * This task has already been drained by the oom reaper so there are - * only small chances it will free some more - */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) - return false; - - if (atomic_read(&mm->mm_users) <= 1) - return true; - - /* - * Make sure that all tasks which share the mm with the given tasks - * are dying as well to make sure that a) nobody pins its mm and - * b) the task is also reapable by the oom reaper. - */ - rcu_read_lock(); - for_each_process(p) { - if (!process_shares_mm(p, mm)) - continue; - if (same_thread_group(task, p)) - continue; - ret = __task_will_free_mem(p); - if (!ret) - break; - } - rcu_read_unlock(); - - return ret; -} - static void __oom_kill_process(struct task_struct *victim) { struct task_struct *p; @@ -879,6 +804,8 @@ static void __oom_kill_process(struct task_struct *victim) */ rcu_read_lock(); for_each_process(p) { + struct task_struct *t; + if (!process_shares_mm(p, mm)) continue; if (same_thread_group(p, victim)) @@ -898,6 +825,11 @@ static void __oom_kill_process(struct task_struct *victim) if (unlikely(p->flags & PF_KTHREAD)) continue; do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID); + t = find_lock_task_mm(p); + if (!t) + continue; + mark_oom_victim(t); + task_unlock(t); } rcu_read_unlock(); @@ -934,21 +866,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message) static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - /* - * If the task is already exiting, don't alarm the sysadmin or kill - * its children or threads, just give it access to memory reserves - * so it can die quickly - */ - task_lock(p); - if (task_will_free_mem(p)) { - mark_oom_victim(p); - wake_oom_reaper(p); - task_unlock(p); - put_task_struct(p); - return; - } - task_unlock(p); - if (__ratelimit(&oom_rs)) dump_header(oc, p); @@ -1055,6 +972,9 @@ bool out_of_memory(struct oom_control *oc) unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL)) + return true; + if (oom_killer_disabled) return false; @@ -1066,17 +986,6 @@ bool out_of_memory(struct oom_control *oc) } /* - * If current has a pending SIGKILL or is exiting, then automatically - * select it. The goal is to allow it to allocate so that it may - * quickly exit and free its memory. - */ - if (task_will_free_mem(current)) { - mark_oom_victim(current); - wake_oom_reaper(current); - return true; - } - - /* * The OOM killer does not compensate for IO-less reclaim. * pagefault_out_of_memory lost its gfp context so we have to * make sure exclude 0 mask - all other users should have at least
On Mon 22-10-18 22:20:36, Tetsuo Handa wrote: > On 2018/10/22 21:03, Michal Hocko wrote: > > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote: > >> On 2018/10/22 16:13, Michal Hocko wrote: > >>> From: Michal Hocko <mhocko@suse.com> > >>> > >>> Tetsuo has reported [1] that a single process group memcg might easily > >>> swamp the log with no-eligible oom victim reports due to race between > >>> the memcg charge and oom_reaper > >>> > >>> Thread 1 Thread2 oom_reaper > >>> try_charge try_charge > >>> mem_cgroup_out_of_memory > >>> mutex_lock(oom_lock) > >>> mem_cgroup_out_of_memory > >>> mutex_lock(oom_lock) > >>> out_of_memory > >>> select_bad_process > >>> oom_kill_process(current) > >>> wake_oom_reaper > >>> oom_reap_task > >>> MMF_OOM_SKIP->victim > >>> mutex_unlock(oom_lock) > >>> out_of_memory > >>> select_bad_process # no task > >>> > >>> If Thread1 didn't race it would bail out from try_charge and force the > >>> charge. We can achieve the same by checking tsk_is_oom_victim inside > >>> the oom_lock and therefore close the race. > >>> > >>> [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp > >>> Signed-off-by: Michal Hocko <mhocko@suse.com> > >>> --- > >>> mm/memcontrol.c | 14 +++++++++++++- > >>> 1 file changed, 13 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>> index e79cb59552d9..a9dfed29967b 100644 > >>> --- a/mm/memcontrol.c > >>> +++ b/mm/memcontrol.c > >>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > >>> .gfp_mask = gfp_mask, > >>> .order = order, > >>> }; > >>> - bool ret; > >>> + bool ret = true; > >>> > >>> mutex_lock(&oom_lock); > >>> + > >>> + /* > >>> + * multi-threaded tasks might race with oom_reaper and gain > >>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead > >>> + * to out_of_memory failure if the task is the last one in > >>> + * memcg which would be a false possitive failure reported > >>> + */ > >>> + if (tsk_is_oom_victim(current)) > >>> + goto unlock; > >>> + > >> > >> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) > >> so that any killed threads no longer wait for oom_lock. > > > > tsk_is_oom_victim is stronger because it doesn't depend on > > fatal_signal_pending which might be cleared throughout the exit process. > > I mean: > > mm/memcontrol.c | 3 +- > mm/oom_kill.c | 111 +++++--------------------------------------------------- > 2 files changed, 12 insertions(+), 102 deletions(-) This is much larger change than I feel comfortable with to plug this specific issue. A simple and easy to understand fix which doesn't add maintenance burden should be preferred in general. The code reduction looks attractive but considering it is based on removing one of the heuristics to prevent OOM reports in some case it should be done on its own with a careful and throughout justification. E.g. how often is the heuristic really helpful. In principle I do not oppose to remove the shortcut after all due diligence is done because this particular one had given us quite a lot headaches in the past.
On 2018/10/22 22:43, Michal Hocko wrote: > On Mon 22-10-18 22:20:36, Tetsuo Handa wrote: >> I mean: >> >> mm/memcontrol.c | 3 +- >> mm/oom_kill.c | 111 +++++--------------------------------------------------- >> 2 files changed, 12 insertions(+), 102 deletions(-) > > This is much larger change than I feel comfortable with to plug this > specific issue. A simple and easy to understand fix which doesn't add > maintenance burden should be preferred in general. > > The code reduction looks attractive but considering it is based on > removing one of the heuristics to prevent OOM reports in some case it > should be done on its own with a careful and throughout justification. > E.g. how often is the heuristic really helpful. I think the heuristic is hardly helpful. Regarding task_will_free_mem(current) condition in out_of_memory(), this served for two purposes. One is that mark_oom_victim() is not yet called on current thread group when mark_oom_victim() was already called on other thread groups. But such situation disappears by removing task_will_free_mem() shortcuts and forcing for_each_process(p) loop in __oom_kill_process(). The other is that mark_oom_victim() is not yet called on any thread groups when all thread groups are exiting. In that case, we will fail to wait for current thread group to release its mm... But it is unlikely that only threads which task_will_free_mem(current) returns true can call out_of_memory() (note that task_will_free_mem(p) returns false if p->mm == NULL). I think it is highly unlikely to hit task_will_free_mem(p) condition in oom_kill_process(). To hit it, the candidate who was chosen due to the largest memory user has to be already exiting. However, if already exiting, it is likely the candidate already released its mm (and hence no longer the largest memory user). I can't say such race never happens, but I think it is unlikely. Also, since task_will_free_mem(p) returns false if thread group leader's mm is NULL whereas oom_badness() from select_bad_process() evaluates any mm in that thread group and returns a thread group leader, this heuristic is incomplete after all. > > In principle I do not oppose to remove the shortcut after all due > diligence is done because this particular one had given us quite a lot > headaches in the past. >
Michal Hocko wrote: > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote: > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index e79cb59552d9..a9dfed29967b 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > .gfp_mask = gfp_mask, > > > .order = order, > > > }; > > > - bool ret; > > > + bool ret = true; > > > > > > mutex_lock(&oom_lock); > > > + > > > + /* > > > + * multi-threaded tasks might race with oom_reaper and gain > > > + * MMF_OOM_SKIP before reaching out_of_memory which can lead > > > + * to out_of_memory failure if the task is the last one in > > > + * memcg which would be a false possitive failure reported > > > + */ > > > + if (tsk_is_oom_victim(current)) > > > + goto unlock; > > > + > > > > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) > > so that any killed threads no longer wait for oom_lock. > > tsk_is_oom_victim is stronger because it doesn't depend on > fatal_signal_pending which might be cleared throughout the exit process. > I still want to propose this. No need to be memcg OOM specific. mm/memcontrol.c | 3 ++- mm/oom_kill.c | 10 ++++++++++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e79cb59..2c1e1ac 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1382,7 +1382,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, }; bool ret; - mutex_lock(&oom_lock); + if (mutex_lock_killable(&oom_lock)) + return true; ret = out_of_memory(&oc); mutex_unlock(&oom_lock); return ret; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index f10aa53..e453bad 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1055,6 +1055,16 @@ bool out_of_memory(struct oom_control *oc) unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + /* + * It is possible that multi-threaded OOM victims get + * task_will_free_mem(current) == false when the OOM reaper quickly + * set MMF_OOM_SKIP. But since we know that tsk_is_oom_victim() == true + * tasks won't loop forever (unleess it is a __GFP_NOFAIL allocation + * request), we don't need to select next OOM victim. + */ + if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL)) + return true; + if (oom_killer_disabled) return false;
On Tue 23-10-18 10:01:08, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote: > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index e79cb59552d9..a9dfed29967b 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > .gfp_mask = gfp_mask, > > > > .order = order, > > > > }; > > > > - bool ret; > > > > + bool ret = true; > > > > > > > > mutex_lock(&oom_lock); > > > > + > > > > + /* > > > > + * multi-threaded tasks might race with oom_reaper and gain > > > > + * MMF_OOM_SKIP before reaching out_of_memory which can lead > > > > + * to out_of_memory failure if the task is the last one in > > > > + * memcg which would be a false possitive failure reported > > > > + */ > > > > + if (tsk_is_oom_victim(current)) > > > > + goto unlock; > > > > + > > > > > > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) > > > so that any killed threads no longer wait for oom_lock. > > > > tsk_is_oom_victim is stronger because it doesn't depend on > > fatal_signal_pending which might be cleared throughout the exit process. > > > > I still want to propose this. No need to be memcg OOM specific. Well, I maintain what I've said [1] about simplicity and specific fix for a specific issue. Especially in the tricky code like this where all the consequences are far more subtle than they seem to be. This is obviously a matter of taste but I don't see much point discussing this back and forth for ever. Unless there is a general agreement that the above is less appropriate then I am willing to consider a different change but I simply do not have energy to nit pick for ever. [1] http://lkml.kernel.org/r/20181022134315.GF18839@dhcp22.suse.cz
On Tue 23-10-18 13:42:46, Michal Hocko wrote: > On Tue 23-10-18 10:01:08, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote: > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > index e79cb59552d9..a9dfed29967b 100644 > > > > > --- a/mm/memcontrol.c > > > > > +++ b/mm/memcontrol.c > > > > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > > .gfp_mask = gfp_mask, > > > > > .order = order, > > > > > }; > > > > > - bool ret; > > > > > + bool ret = true; > > > > > > > > > > mutex_lock(&oom_lock); > > > > > + > > > > > + /* > > > > > + * multi-threaded tasks might race with oom_reaper and gain > > > > > + * MMF_OOM_SKIP before reaching out_of_memory which can lead > > > > > + * to out_of_memory failure if the task is the last one in > > > > > + * memcg which would be a false possitive failure reported > > > > > + */ > > > > > + if (tsk_is_oom_victim(current)) > > > > > + goto unlock; > > > > > + > > > > > > > > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) > > > > so that any killed threads no longer wait for oom_lock. > > > > > > tsk_is_oom_victim is stronger because it doesn't depend on > > > fatal_signal_pending which might be cleared throughout the exit process. > > > > > > > I still want to propose this. No need to be memcg OOM specific. > > Well, I maintain what I've said [1] about simplicity and specific fix > for a specific issue. Especially in the tricky code like this where all > the consequences are far more subtle than they seem to be. > > This is obviously a matter of taste but I don't see much point discussing > this back and forth for ever. Unless there is a general agreement that > the above is less appropriate then I am willing to consider a different > change but I simply do not have energy to nit pick for ever. > > [1] http://lkml.kernel.org/r/20181022134315.GF18839@dhcp22.suse.cz In other words. Having a memcg specific fix means, well, a memcg maintenance burden. Like any other memcg specific oom decisions we already have. So are you OK with that Johannes or you would like to see a more generic fix which might turn out to be more complex?
On 2018/10/23 21:10, Michal Hocko wrote: > On Tue 23-10-18 13:42:46, Michal Hocko wrote: >> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote: >>> Michal Hocko wrote: >>>> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote: >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>>>> index e79cb59552d9..a9dfed29967b 100644 >>>>>> --- a/mm/memcontrol.c >>>>>> +++ b/mm/memcontrol.c >>>>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, >>>>>> .gfp_mask = gfp_mask, >>>>>> .order = order, >>>>>> }; >>>>>> - bool ret; >>>>>> + bool ret = true; >>>>>> >>>>>> mutex_lock(&oom_lock); >>>>>> + >>>>>> + /* >>>>>> + * multi-threaded tasks might race with oom_reaper and gain >>>>>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead >>>>>> + * to out_of_memory failure if the task is the last one in >>>>>> + * memcg which would be a false possitive failure reported >>>>>> + */ >>>>>> + if (tsk_is_oom_victim(current)) >>>>>> + goto unlock; >>>>>> + >>>>> >>>>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) >>>>> so that any killed threads no longer wait for oom_lock. >>>> >>>> tsk_is_oom_victim is stronger because it doesn't depend on >>>> fatal_signal_pending which might be cleared throughout the exit process. >>>> >>> >>> I still want to propose this. No need to be memcg OOM specific. >> >> Well, I maintain what I've said [1] about simplicity and specific fix >> for a specific issue. Especially in the tricky code like this where all >> the consequences are far more subtle than they seem to be. >> >> This is obviously a matter of taste but I don't see much point discussing >> this back and forth for ever. Unless there is a general agreement that >> the above is less appropriate then I am willing to consider a different >> change but I simply do not have energy to nit pick for ever. >> >> [1] http://lkml.kernel.org/r/20181022134315.GF18839@dhcp22.suse.cz > > In other words. Having a memcg specific fix means, well, a memcg > maintenance burden. Like any other memcg specific oom decisions we > already have. So are you OK with that Johannes or you would like to see > a more generic fix which might turn out to be more complex? > I don't know what "that Johannes" refers to. If you don't want to affect SysRq-OOM and pagefault-OOM cases, are you OK with having a global-OOM specific fix? mm/page_alloc.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e2ef1c1..f59f029 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3518,6 +3518,17 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) if (gfp_mask & __GFP_THISNODE) goto out; + /* + * It is possible that multi-threaded OOM victims get + * task_will_free_mem(current) == false when the OOM reaper quickly + * set MMF_OOM_SKIP. But since we know that tsk_is_oom_victim() == true + * tasks won't loop forever (unless it is a __GFP_NOFAIL allocation + * request), we don't need to select next OOM victim. + */ + if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL)) { + *did_some_progress = 1; + goto out; + } /* Exhausted what can be done so it's blame time */ if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { *did_some_progress = 1;
On Tue 23-10-18 21:33:43, Tetsuo Handa wrote: > On 2018/10/23 21:10, Michal Hocko wrote: > > On Tue 23-10-18 13:42:46, Michal Hocko wrote: > >> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote: > >>> Michal Hocko wrote: > >>>> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote: > >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>>>>> index e79cb59552d9..a9dfed29967b 100644 > >>>>>> --- a/mm/memcontrol.c > >>>>>> +++ b/mm/memcontrol.c > >>>>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > >>>>>> .gfp_mask = gfp_mask, > >>>>>> .order = order, > >>>>>> }; > >>>>>> - bool ret; > >>>>>> + bool ret = true; > >>>>>> > >>>>>> mutex_lock(&oom_lock); > >>>>>> + > >>>>>> + /* > >>>>>> + * multi-threaded tasks might race with oom_reaper and gain > >>>>>> + * MMF_OOM_SKIP before reaching out_of_memory which can lead > >>>>>> + * to out_of_memory failure if the task is the last one in > >>>>>> + * memcg which would be a false possitive failure reported > >>>>>> + */ > >>>>>> + if (tsk_is_oom_victim(current)) > >>>>>> + goto unlock; > >>>>>> + > >>>>> > >>>>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock) > >>>>> so that any killed threads no longer wait for oom_lock. > >>>> > >>>> tsk_is_oom_victim is stronger because it doesn't depend on > >>>> fatal_signal_pending which might be cleared throughout the exit process. > >>>> > >>> > >>> I still want to propose this. No need to be memcg OOM specific. > >> > >> Well, I maintain what I've said [1] about simplicity and specific fix > >> for a specific issue. Especially in the tricky code like this where all > >> the consequences are far more subtle than they seem to be. > >> > >> This is obviously a matter of taste but I don't see much point discussing > >> this back and forth for ever. Unless there is a general agreement that > >> the above is less appropriate then I am willing to consider a different > >> change but I simply do not have energy to nit pick for ever. > >> > >> [1] http://lkml.kernel.org/r/20181022134315.GF18839@dhcp22.suse.cz > > > > In other words. Having a memcg specific fix means, well, a memcg > > maintenance burden. Like any other memcg specific oom decisions we > > already have. So are you OK with that Johannes or you would like to see > > a more generic fix which might turn out to be more complex? > > > > I don't know what "that Johannes" refers to. let me rephrase Johannes, are you OK with that (memcg specific fix) or you would like to see a more generic fix which might turn out to be more complex.
On Mon, Oct 22, 2018 at 09:13:23AM +0200, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Tetsuo has reported [1] that a single process group memcg might easily > swamp the log with no-eligible oom victim reports due to race between > the memcg charge and oom_reaper > > Thread 1 Thread2 oom_reaper > try_charge try_charge > mem_cgroup_out_of_memory > mutex_lock(oom_lock) > mem_cgroup_out_of_memory > mutex_lock(oom_lock) > out_of_memory > select_bad_process > oom_kill_process(current) > wake_oom_reaper > oom_reap_task > MMF_OOM_SKIP->victim > mutex_unlock(oom_lock) > out_of_memory > select_bad_process # no task > > If Thread1 didn't race it would bail out from try_charge and force the > charge. We can achieve the same by checking tsk_is_oom_victim inside > the oom_lock and therefore close the race. > > [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/memcontrol.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e79cb59552d9..a9dfed29967b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > .gfp_mask = gfp_mask, > .order = order, > }; > - bool ret; > + bool ret = true; > > mutex_lock(&oom_lock); > + > + /* > + * multi-threaded tasks might race with oom_reaper and gain > + * MMF_OOM_SKIP before reaching out_of_memory which can lead > + * to out_of_memory failure if the task is the last one in > + * memcg which would be a false possitive failure reported > + */ > + if (tsk_is_oom_victim(current)) > + goto unlock; > + > ret = out_of_memory(&oc); We already check tsk_is_oom_victim(current) in try_charge() before looping on the OOM killer, so at most we'd have a single "no eligible tasks" message from such a race before we force the charge - right? While that's not perfect, I don't think it warrants complicating this code even more. I honestly find it near-impossible to follow the code and the possible scenarios at this point. out_of_memory() bails on task_will_free_mem(current), which specifically *excludes* already reaped tasks. Why are we then adding a separate check before that to bail on already reaped victims? Do we want to bail if current is a reaped victim or not? I don't see how we could skip it safely in general: the current task might have been killed and reaped and gotten access to the memory reserve and still fail to allocate on its way out. It needs to kill the next task if there is one, or warn if there isn't another one. Because we're genuinely oom without reclaimable tasks. There is of course the scenario brought forward in this thread, where multiple threads of a process race and the second one enters oom even though it doesn't need to anymore. What the global case does to catch this is to grab the oom lock and do one last alloc attempt. Should memcg lock the oom_lock and try one more time to charge the memcg? Some simplification in this area would really be great. I'm reluctant to ack patches like the above, even if they have some optical benefits for the user, because the code is already too tricky for what it does.
On Fri 26-10-18 10:25:31, Johannes Weiner wrote: > On Mon, Oct 22, 2018 at 09:13:23AM +0200, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Tetsuo has reported [1] that a single process group memcg might easily > > swamp the log with no-eligible oom victim reports due to race between > > the memcg charge and oom_reaper > > > > Thread 1 Thread2 oom_reaper > > try_charge try_charge > > mem_cgroup_out_of_memory > > mutex_lock(oom_lock) > > mem_cgroup_out_of_memory > > mutex_lock(oom_lock) > > out_of_memory > > select_bad_process > > oom_kill_process(current) > > wake_oom_reaper > > oom_reap_task > > MMF_OOM_SKIP->victim > > mutex_unlock(oom_lock) > > out_of_memory > > select_bad_process # no task > > > > If Thread1 didn't race it would bail out from try_charge and force the > > charge. We can achieve the same by checking tsk_is_oom_victim inside > > the oom_lock and therefore close the race. > > > > [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > --- > > mm/memcontrol.c | 14 +++++++++++++- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index e79cb59552d9..a9dfed29967b 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > .gfp_mask = gfp_mask, > > .order = order, > > }; > > - bool ret; > > + bool ret = true; > > > > mutex_lock(&oom_lock); > > + > > + /* > > + * multi-threaded tasks might race with oom_reaper and gain > > + * MMF_OOM_SKIP before reaching out_of_memory which can lead > > + * to out_of_memory failure if the task is the last one in > > + * memcg which would be a false possitive failure reported > > + */ > > + if (tsk_is_oom_victim(current)) > > + goto unlock; > > + > > ret = out_of_memory(&oc); > > We already check tsk_is_oom_victim(current) in try_charge() before > looping on the OOM killer, so at most we'd have a single "no eligible > tasks" message from such a race before we force the charge - right? Not really. You can have many threads blocked on the oom_lock and being reaped while they are waiting. So the check without the lock will always be racy. This is what Tetsuo's test case actually triggers I believe. > While that's not perfect, I don't think it warrants complicating this > code even more. I honestly find it near-impossible to follow the code > and the possible scenarios at this point. I do agree that the code is quite far from easy to follow. The set of events that might happen in a different context is not trivial. > out_of_memory() bails on task_will_free_mem(current), which > specifically *excludes* already reaped tasks. Why are we then adding a > separate check before that to bail on already reaped victims? 696453e66630a has introduced the bail out. > Do we want to bail if current is a reaped victim or not? > > I don't see how we could skip it safely in general: the current task > might have been killed and reaped and gotten access to the memory > reserve and still fail to allocate on its way out. It needs to kill > the next task if there is one, or warn if there isn't another > one. Because we're genuinely oom without reclaimable tasks. Yes, this would be the case for the global case which is a real OOM situation. Memcg oom is somehow more relaxed because the oom is local. > There is of course the scenario brought forward in this thread, where > multiple threads of a process race and the second one enters oom even > though it doesn't need to anymore. What the global case does to catch > this is to grab the oom lock and do one last alloc attempt. Should > memcg lock the oom_lock and try one more time to charge the memcg? That would be another option. I agree that making it more towards the global case makes it more attractive. My tsk_is_oom_victim is more towards "plug this particular case". So does this look better to you? diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e79cb59552d9..4abb66efe806 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, .gfp_mask = gfp_mask, .order = order, }; - bool ret; + bool ret = true; mutex_lock(&oom_lock); + + /* + * Make the last moment check while we were waiting for the oom_lock + * just in case the oom_reaper could have freed released some + * memory in the meantime. This mimics the lalst moment allocation + * in __alloc_pages_may_oom + */ + if (mem_cgroup_margin(mem_over_limit) >= 1 << order) + goto unlock; + ret = out_of_memory(&oc); + +unlock: mutex_unlock(&oom_lock); return ret; } > Some simplification in this area would really be great. I'm reluctant > to ack patches like the above, even if they have some optical benefits > for the user, because the code is already too tricky for what it does. I am open to different ideas, unless they are crazy timeout here and timeout there wrapped with a duct tape.
On Fri 26-10-18 21:25:51, Michal Hocko wrote: > On Fri 26-10-18 10:25:31, Johannes Weiner wrote: [...] > > There is of course the scenario brought forward in this thread, where > > multiple threads of a process race and the second one enters oom even > > though it doesn't need to anymore. What the global case does to catch > > this is to grab the oom lock and do one last alloc attempt. Should > > memcg lock the oom_lock and try one more time to charge the memcg? > > That would be another option. I agree that making it more towards the > global case makes it more attractive. My tsk_is_oom_victim is more > towards "plug this particular case". Nevertheless let me emphasise that tsk_is_oom_victim will close the race completely, while mem_cgroup_margin will always be racy. So the question is whether we want to close the race because it is just too easy for userspace to hit it or keep the global and memcg oom handling as close as possible.
On 2018/10/27 4:25, Michal Hocko wrote: >> out_of_memory() bails on task_will_free_mem(current), which >> specifically *excludes* already reaped tasks. Why are we then adding a >> separate check before that to bail on already reaped victims? > > 696453e66630a has introduced the bail out. > >> Do we want to bail if current is a reaped victim or not? >> >> I don't see how we could skip it safely in general: the current task >> might have been killed and reaped and gotten access to the memory >> reserve and still fail to allocate on its way out. It needs to kill >> the next task if there is one, or warn if there isn't another >> one. Because we're genuinely oom without reclaimable tasks. > > Yes, this would be the case for the global case which is a real OOM > situation. Memcg oom is somehow more relaxed because the oom is local. We can handle possibility of genuinely OOM without reclaimable tasks. Only __GFP_NOFAIL OOM has to select next OOM victim. There is no need to select next OOM victim unless __GFP_NOFAIL. Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks") was too simple. On 2018/10/27 4:33, Michal Hocko wrote: > On Fri 26-10-18 21:25:51, Michal Hocko wrote: >> On Fri 26-10-18 10:25:31, Johannes Weiner wrote: > [...] >>> There is of course the scenario brought forward in this thread, where >>> multiple threads of a process race and the second one enters oom even >>> though it doesn't need to anymore. What the global case does to catch >>> this is to grab the oom lock and do one last alloc attempt. Should >>> memcg lock the oom_lock and try one more time to charge the memcg? >> >> That would be another option. I agree that making it more towards the >> global case makes it more attractive. My tsk_is_oom_victim is more >> towards "plug this particular case". > > Nevertheless let me emphasise that tsk_is_oom_victim will close the race > completely, while mem_cgroup_margin will always be racy. So the question > is whether we want to close the race because it is just too easy for > userspace to hit it or keep the global and memcg oom handling as close > as possible. > Yes, adding tsk_is_oom_victim(current) before calling out_of_memory() from both global OOM and memcg OOM paths can close the race completely. (But note that tsk_is_oom_victim(current) for global OOM path needs to check for __GFP_NOFAIL in order to handle genuinely OOM case.)
From dc0d9ec3205a28680dcf2213c0ffe8820e8b5913 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Tue, 6 Nov 2018 12:27:36 +0900 Subject: [PATCH] memcg: killed threads should not invoke memcg OOM killer It is possible that a single process group memcg easily swamps the log with no-eligible OOM victim messages after current thread was OOM-killed, due to race between the memcg charge and the OOM reaper [1]. Thread-1 Thread-2 OOM reaper try_charge() mem_cgroup_out_of_memory() mutex_lock(oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(oom_lock) out_of_memory() select_bad_process() oom_kill_process(current) wake_oom_reaper() oom_reap_task() # sets MMF_OOM_SKIP mutex_unlock(oom_lock) out_of_memory() select_bad_process() # no task mutex_unlock(oom_lock) We don't need to invoke the memcg OOM killer if current thread was killed when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and memory_max_write() can bail out upon SIGKILL, and try_charge() allows already killed/exiting threads to make forward progress. [1] https://lkml.kernel.org/r/ea637f9a-5dd0-f927-d26d-d0b4fd8ccb6f@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- mm/memcontrol.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6e1469b..a97648a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, }; bool ret; - mutex_lock(&oom_lock); - ret = out_of_memory(&oc); + if (mutex_lock_killable(&oom_lock)) + return true; + /* + * A few threads which were not waiting at mutex_lock_killable() can + * fail to bail out. Therefore, check again after holding oom_lock. + */ + ret = fatal_signal_pending(current) || out_of_memory(&oc); mutex_unlock(&oom_lock); return ret; }
On Tue 06-11-18 18:44:43, Tetsuo Handa wrote: [...] > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 6e1469b..a97648a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > }; > bool ret; > > - mutex_lock(&oom_lock); > - ret = out_of_memory(&oc); > + if (mutex_lock_killable(&oom_lock)) > + return true; > + /* > + * A few threads which were not waiting at mutex_lock_killable() can > + * fail to bail out. Therefore, check again after holding oom_lock. > + */ > + ret = fatal_signal_pending(current) || out_of_memory(&oc); > mutex_unlock(&oom_lock); > return ret; > } If we are goging with a memcg specific thingy then I really prefer tsk_is_oom_victim approach. Or is there any reason why this is not suitable?
On 2018/11/06 21:42, Michal Hocko wrote: > On Tue 06-11-18 18:44:43, Tetsuo Handa wrote: > [...] >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 6e1469b..a97648a 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, >> }; >> bool ret; >> >> - mutex_lock(&oom_lock); >> - ret = out_of_memory(&oc); >> + if (mutex_lock_killable(&oom_lock)) >> + return true; >> + /* >> + * A few threads which were not waiting at mutex_lock_killable() can >> + * fail to bail out. Therefore, check again after holding oom_lock. >> + */ >> + ret = fatal_signal_pending(current) || out_of_memory(&oc); >> mutex_unlock(&oom_lock); >> return ret; >> } > > If we are goging with a memcg specific thingy then I really prefer > tsk_is_oom_victim approach. Or is there any reason why this is not > suitable? > Why need to wait for mark_oom_victim() called after slow printk() messages? If current thread got Ctrl-C and thus current thread can terminate, what is nice with waiting for the OOM killer? If there are several OOM events in multiple memcg domains waiting for completion of printk() messages? I don't see points with waiting for oom_lock, for try_charge() already allows current thread to terminate due to fatal_signal_pending() test.
On Wed 07-11-18 18:45:27, Tetsuo Handa wrote: > On 2018/11/06 21:42, Michal Hocko wrote: > > On Tue 06-11-18 18:44:43, Tetsuo Handa wrote: > > [...] > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >> index 6e1469b..a97648a 100644 > >> --- a/mm/memcontrol.c > >> +++ b/mm/memcontrol.c > >> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > >> }; > >> bool ret; > >> > >> - mutex_lock(&oom_lock); > >> - ret = out_of_memory(&oc); > >> + if (mutex_lock_killable(&oom_lock)) > >> + return true; > >> + /* > >> + * A few threads which were not waiting at mutex_lock_killable() can > >> + * fail to bail out. Therefore, check again after holding oom_lock. > >> + */ > >> + ret = fatal_signal_pending(current) || out_of_memory(&oc); > >> mutex_unlock(&oom_lock); > >> return ret; > >> } > > > > If we are goging with a memcg specific thingy then I really prefer > > tsk_is_oom_victim approach. Or is there any reason why this is not > > suitable? > > > > Why need to wait for mark_oom_victim() called after slow printk() messages? > > If current thread got Ctrl-C and thus current thread can terminate, what is > nice with waiting for the OOM killer? If there are several OOM events in > multiple memcg domains waiting for completion of printk() messages? I don't > see points with waiting for oom_lock, for try_charge() already allows current > thread to terminate due to fatal_signal_pending() test. mutex_lock_killable would take care of exiting task already. I would then still prefer to check for mark_oom_victim because that is not racy with the exit path clearing signals. I can update my patch to use _killable lock variant if we are really going with the memcg specific fix. Johaness?
On 2018/11/07 19:08, Michal Hocko wrote: > On Wed 07-11-18 18:45:27, Tetsuo Handa wrote: >> On 2018/11/06 21:42, Michal Hocko wrote: >>> On Tue 06-11-18 18:44:43, Tetsuo Handa wrote: >>> [...] >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>>> index 6e1469b..a97648a 100644 >>>> --- a/mm/memcontrol.c >>>> +++ b/mm/memcontrol.c >>>> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, >>>> }; >>>> bool ret; >>>> >>>> - mutex_lock(&oom_lock); >>>> - ret = out_of_memory(&oc); >>>> + if (mutex_lock_killable(&oom_lock)) >>>> + return true; >>>> + /* >>>> + * A few threads which were not waiting at mutex_lock_killable() can >>>> + * fail to bail out. Therefore, check again after holding oom_lock. >>>> + */ >>>> + ret = fatal_signal_pending(current) || out_of_memory(&oc); >>>> mutex_unlock(&oom_lock); >>>> return ret; >>>> } >>> >>> If we are goging with a memcg specific thingy then I really prefer >>> tsk_is_oom_victim approach. Or is there any reason why this is not >>> suitable? >>> >> >> Why need to wait for mark_oom_victim() called after slow printk() messages? >> >> If current thread got Ctrl-C and thus current thread can terminate, what is >> nice with waiting for the OOM killer? If there are several OOM events in >> multiple memcg domains waiting for completion of printk() messages? I don't >> see points with waiting for oom_lock, for try_charge() already allows current >> thread to terminate due to fatal_signal_pending() test. > > mutex_lock_killable would take care of exiting task already. I would > then still prefer to check for mark_oom_victim because that is not racy > with the exit path clearing signals. I can update my patch to use > _killable lock variant if we are really going with the memcg specific > fix. > > Johaness? > No response for one month. When can we get to an RCU stall problem syzbot reported?
On 2018/12/07 21:43, Tetsuo Handa wrote: > No response for one month. When can we get to an RCU stall problem syzbot reported? Why not to apply this patch and then think how to address https://lore.kernel.org/lkml/201810100012.w9A0Cjtn047782@www262.sakura.ne.jp/ ? From 0fb58415770a83d6c40d471e1840f8bc4a35ca83 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Date: Wed, 12 Dec 2018 19:15:51 +0900 Subject: [PATCH] memcg: killed threads should not invoke memcg OOM killer It is possible that a single process group memcg easily swamps the log with no-eligible OOM victim messages after current thread was OOM-killed, due to race between the memcg charge and the OOM reaper [1]. Thread-1 Thread-2 OOM reaper try_charge() mem_cgroup_out_of_memory() mutex_lock(oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(oom_lock) out_of_memory() select_bad_process() oom_kill_process(current) wake_oom_reaper() oom_reap_task() # sets MMF_OOM_SKIP mutex_unlock(oom_lock) out_of_memory() select_bad_process() # no task mutex_unlock(oom_lock) We don't need to invoke the memcg OOM killer if current thread was killed when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and memory_max_write() can bail out upon SIGKILL, and try_charge() allows already killed/exiting threads to make forward progress. Michal has a plan to use tsk_is_oom_victim() by calling mark_oom_victim() on all thread groups sharing victim's mm. But fatal_signal_pending() in this patch helps regardless of Michal's plan because it will avoid needlessly calling out_of_memory() when current thread is already terminating (e.g. got SIGINT after passing fatal_signal_pending() check in try_charge() and mutex_lock_killable() did not block). [1] https://lkml.kernel.org/r/ea637f9a-5dd0-f927-d26d-d0b4fd8ccb6f@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> --- mm/memcontrol.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b860dd4f7..b0d3bf3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1389,8 +1389,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, }; bool ret; - mutex_lock(&oom_lock); - ret = out_of_memory(&oc); + if (mutex_lock_killable(&oom_lock)) + return true; + /* + * A few threads which were not waiting at mutex_lock_killable() can + * fail to bail out. Therefore, check again after holding oom_lock. + */ + ret = fatal_signal_pending(current) || out_of_memory(&oc); mutex_unlock(&oom_lock); return ret; }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e79cb59552d9..a9dfed29967b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, .gfp_mask = gfp_mask, .order = order, }; - bool ret; + bool ret = true; mutex_lock(&oom_lock); + + /* + * multi-threaded tasks might race with oom_reaper and gain + * MMF_OOM_SKIP before reaching out_of_memory which can lead + * to out_of_memory failure if the task is the last one in + * memcg which would be a false possitive failure reported + */ + if (tsk_is_oom_victim(current)) + goto unlock; + ret = out_of_memory(&oc); + +unlock: mutex_unlock(&oom_lock); return ret; }