[RFC,2/2] memcg: do not report racy no-eligible OOM tasks

Message ID	20181022071323.9550-3-mhocko@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: Michal Hocko <mhocko@kernel.org> To: <linux-mm@kvack.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, David Rientjes <rientjes@google.com>, Andrew Morton <akpm@linux-foundation.org>, LKML <linux-kernel@vger.kernel.org>, Michal Hocko <mhocko@suse.com> Subject: [RFC PATCH 2/2] memcg: do not report racy no-eligible OOM tasks Date: Mon, 22 Oct 2018 09:13:23 +0200 Message-Id: <20181022071323.9550-3-mhocko@kernel.org> In-Reply-To: <20181022071323.9550-1-mhocko@kernel.org> References: <20181022071323.9550-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	oom, memcg: do not report racy no-eligible OOM \| expand [RFC,0/2] oom, memcg: do not report racy no-eligible OOM [RFC,1/2] mm, oom: marks all killed tasks as oom victims [RFC,2/2] memcg: do not report racy no-eligible OOM tasks

Michal Hocko Oct. 22, 2018, 7:13 a.m. UTC

From: Michal Hocko <mhocko@suse.com>

Tetsuo has reported [1] that a single process group memcg might easily
swamp the log with no-eligible oom victim reports due to race between
the memcg charge and oom_reaper

Thread 1		Thread2				oom_reaper
try_charge		try_charge
			  mem_cgroup_out_of_memory
			    mutex_lock(oom_lock)
  mem_cgroup_out_of_memory
    mutex_lock(oom_lock)
			      out_of_memory
			        select_bad_process
				oom_kill_process(current)
				  wake_oom_reaper
							  oom_reap_task
							  MMF_OOM_SKIP->victim
			    mutex_unlock(oom_lock)
    out_of_memory
      select_bad_process # no task

If Thread1 didn't race it would bail out from try_charge and force the
charge. We can achieve the same by checking tsk_is_oom_victim inside
the oom_lock and therefore close the race.

[1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Tetsuo Handa Oct. 22, 2018, 11:45 a.m. UTC | #1

On 2018/10/22 16:13, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo has reported [1] that a single process group memcg might easily
> swamp the log with no-eligible oom victim reports due to race between
> the memcg charge and oom_reaper
> 
> Thread 1		Thread2				oom_reaper
> try_charge		try_charge
> 			  mem_cgroup_out_of_memory
> 			    mutex_lock(oom_lock)
>   mem_cgroup_out_of_memory
>     mutex_lock(oom_lock)
> 			      out_of_memory
> 			        select_bad_process
> 				oom_kill_process(current)
> 				  wake_oom_reaper
> 							  oom_reap_task
> 							  MMF_OOM_SKIP->victim
> 			    mutex_unlock(oom_lock)
>     out_of_memory
>       select_bad_process # no task
> 
> If Thread1 didn't race it would bail out from try_charge and force the
> charge. We can achieve the same by checking tsk_is_oom_victim inside
> the oom_lock and therefore close the race.
> 
> [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e79cb59552d9..a9dfed29967b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  	};
> -	bool ret;
> +	bool ret = true;
>  
>  	mutex_lock(&oom_lock);
> +
> +	/*
> +	 * multi-threaded tasks might race with oom_reaper and gain
> +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> +	 * to out_of_memory failure if the task is the last one in
> +	 * memcg which would be a false possitive failure reported
> +	 */
> +	if (tsk_is_oom_victim(current))
> +		goto unlock;
> +

This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
so that any killed threads no longer wait for oom_lock.

Also, closing this race for only memcg OOM path is strange. Global OOM path
(which are CLONE_VM without CLONE_THREAD) is still suffering this race
(though frequency is lower than memcg OOM due to use of mutex_trylock()). Either
checking before calling out_of_memory() or checking task_will_free_mem(current)
inside out_of_memory() will close this race for both paths.

>  	ret = out_of_memory(&oc);
> +
> +unlock:
>  	mutex_unlock(&oom_lock);
>  	return ret;
>  }
>

Michal Hocko Oct. 22, 2018, 12:03 p.m. UTC | #2

On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> On 2018/10/22 16:13, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Tetsuo has reported [1] that a single process group memcg might easily
> > swamp the log with no-eligible oom victim reports due to race between
> > the memcg charge and oom_reaper
> > 
> > Thread 1		Thread2				oom_reaper
> > try_charge		try_charge
> > 			  mem_cgroup_out_of_memory
> > 			    mutex_lock(oom_lock)
> >   mem_cgroup_out_of_memory
> >     mutex_lock(oom_lock)
> > 			      out_of_memory
> > 			        select_bad_process
> > 				oom_kill_process(current)
> > 				  wake_oom_reaper
> > 							  oom_reap_task
> > 							  MMF_OOM_SKIP->victim
> > 			    mutex_unlock(oom_lock)
> >     out_of_memory
> >       select_bad_process # no task
> > 
> > If Thread1 didn't race it would bail out from try_charge and force the
> > charge. We can achieve the same by checking tsk_is_oom_victim inside
> > the oom_lock and therefore close the race.
> > 
> > [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/memcontrol.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e79cb59552d9..a9dfed29967b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  		.gfp_mask = gfp_mask,
> >  		.order = order,
> >  	};
> > -	bool ret;
> > +	bool ret = true;
> >  
> >  	mutex_lock(&oom_lock);
> > +
> > +	/*
> > +	 * multi-threaded tasks might race with oom_reaper and gain
> > +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > +	 * to out_of_memory failure if the task is the last one in
> > +	 * memcg which would be a false possitive failure reported
> > +	 */
> > +	if (tsk_is_oom_victim(current))
> > +		goto unlock;
> > +
> 
> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> so that any killed threads no longer wait for oom_lock.

tsk_is_oom_victim is stronger because it doesn't depend on
fatal_signal_pending which might be cleared throughout the exit process.

> Also, closing this race for only memcg OOM path is strange. Global OOM path
> (which are CLONE_VM without CLONE_THREAD) is still suffering this race
> (though frequency is lower than memcg OOM due to use of mutex_trylock()). Either
> checking before calling out_of_memory() or checking task_will_free_mem(current)
> inside out_of_memory() will close this race for both paths.

The global case is much more complicated because we know that memcg
might bypass the charge so we do not have to care about the potential
endless loop like in page allocator path. Moreover I am not even sure
the race is all that interesting in the global case. I have never heard
about a pre-mature panic due to no killable task. The racing oom task
would have to be the last eligible process in the system and that is
quite unlikely. We can think about a more involved solution for that if
we ever hear about this to be a real problem.

So a simple memcg specific fix sounds like a reasonable way forward.

Tetsuo Handa Oct. 22, 2018, 1:20 p.m. UTC | #3

On 2018/10/22 21:03, Michal Hocko wrote:
> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
>> On 2018/10/22 16:13, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> Tetsuo has reported [1] that a single process group memcg might easily
>>> swamp the log with no-eligible oom victim reports due to race between
>>> the memcg charge and oom_reaper
>>>
>>> Thread 1		Thread2				oom_reaper
>>> try_charge		try_charge
>>> 			  mem_cgroup_out_of_memory
>>> 			    mutex_lock(oom_lock)
>>>   mem_cgroup_out_of_memory
>>>     mutex_lock(oom_lock)
>>> 			      out_of_memory
>>> 			        select_bad_process
>>> 				oom_kill_process(current)
>>> 				  wake_oom_reaper
>>> 							  oom_reap_task
>>> 							  MMF_OOM_SKIP->victim
>>> 			    mutex_unlock(oom_lock)
>>>     out_of_memory
>>>       select_bad_process # no task
>>>
>>> If Thread1 didn't race it would bail out from try_charge and force the
>>> charge. We can achieve the same by checking tsk_is_oom_victim inside
>>> the oom_lock and therefore close the race.
>>>
>>> [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>> ---
>>>  mm/memcontrol.c | 14 +++++++++++++-
>>>  1 file changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index e79cb59552d9..a9dfed29967b 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  		.gfp_mask = gfp_mask,
>>>  		.order = order,
>>>  	};
>>> -	bool ret;
>>> +	bool ret = true;
>>>  
>>>  	mutex_lock(&oom_lock);
>>> +
>>> +	/*
>>> +	 * multi-threaded tasks might race with oom_reaper and gain
>>> +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
>>> +	 * to out_of_memory failure if the task is the last one in
>>> +	 * memcg which would be a false possitive failure reported
>>> +	 */
>>> +	if (tsk_is_oom_victim(current))
>>> +		goto unlock;
>>> +
>>
>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
>> so that any killed threads no longer wait for oom_lock.
> 
> tsk_is_oom_victim is stronger because it doesn't depend on
> fatal_signal_pending which might be cleared throughout the exit process.

I mean:

 mm/memcontrol.c |   3 +-
 mm/oom_kill.c   | 111 +++++---------------------------------------------------
 2 files changed, 12 insertions(+), 102 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59..2c1e1ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1382,7 +1382,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	};
 	bool ret;
 
-	mutex_lock(&oom_lock);
+	if (mutex_lock_killable(&oom_lock))
+		return true;
 	ret = out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 	return ret;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..22fd6b3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -755,81 +755,6 @@ bool oom_killer_disable(signed long timeout)
 	return true;
 }
 
-static inline bool __task_will_free_mem(struct task_struct *task)
-{
-	struct signal_struct *sig = task->signal;
-
-	/*
-	 * A coredumping process may sleep for an extended period in exit_mm(),
-	 * so the oom killer cannot assume that the process will promptly exit
-	 * and release memory.
-	 */
-	if (sig->flags & SIGNAL_GROUP_COREDUMP)
-		return false;
-
-	if (sig->flags & SIGNAL_GROUP_EXIT)
-		return true;
-
-	if (thread_group_empty(task) && (task->flags & PF_EXITING))
-		return true;
-
-	return false;
-}
-
-/*
- * Checks whether the given task is dying or exiting and likely to
- * release its address space. This means that all threads and processes
- * sharing the same mm have to be killed or exiting.
- * Caller has to make sure that task->mm is stable (hold task_lock or
- * it operates on the current).
- */
-static bool task_will_free_mem(struct task_struct *task)
-{
-	struct mm_struct *mm = task->mm;
-	struct task_struct *p;
-	bool ret = true;
-
-	/*
-	 * Skip tasks without mm because it might have passed its exit_mm and
-	 * exit_oom_victim. oom_reaper could have rescued that but do not rely
-	 * on that for now. We can consider find_lock_task_mm in future.
-	 */
-	if (!mm)
-		return false;
-
-	if (!__task_will_free_mem(task))
-		return false;
-
-	/*
-	 * This task has already been drained by the oom reaper so there are
-	 * only small chances it will free some more
-	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags))
-		return false;
-
-	if (atomic_read(&mm->mm_users) <= 1)
-		return true;
-
-	/*
-	 * Make sure that all tasks which share the mm with the given tasks
-	 * are dying as well to make sure that a) nobody pins its mm and
-	 * b) the task is also reapable by the oom reaper.
-	 */
-	rcu_read_lock();
-	for_each_process(p) {
-		if (!process_shares_mm(p, mm))
-			continue;
-		if (same_thread_group(task, p))
-			continue;
-		ret = __task_will_free_mem(p);
-		if (!ret)
-			break;
-	}
-	rcu_read_unlock();
-
-	return ret;
-}
-
 static void __oom_kill_process(struct task_struct *victim)
 {
 	struct task_struct *p;
@@ -879,6 +804,8 @@ static void __oom_kill_process(struct task_struct *victim)
 	 */
 	rcu_read_lock();
 	for_each_process(p) {
+		struct task_struct *t;
+
 		if (!process_shares_mm(p, mm))
 			continue;
 		if (same_thread_group(p, victim))
@@ -898,6 +825,11 @@ static void __oom_kill_process(struct task_struct *victim)
 		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, PIDTYPE_TGID);
+		t = find_lock_task_mm(p);
+		if (!t)
+			continue;
+		mark_oom_victim(t);
+		task_unlock(t);
 	}
 	rcu_read_unlock();
 
@@ -934,21 +866,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just give it access to memory reserves
-	 * so it can die quickly
-	 */
-	task_lock(p);
-	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
 
@@ -1055,6 +972,9 @@ bool out_of_memory(struct oom_control *oc)
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
+	if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
+		return true;
+
 	if (oom_killer_disabled)
 		return false;
 
@@ -1066,17 +986,6 @@ bool out_of_memory(struct oom_control *oc)
 	}
 
 	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 */
-	if (task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		wake_oom_reaper(current);
-		return true;
-	}
-
-	/*
 	 * The OOM killer does not compensate for IO-less reclaim.
 	 * pagefault_out_of_memory lost its gfp context so we have to
 	 * make sure exclude 0 mask - all other users should have at least

Michal Hocko Oct. 22, 2018, 1:43 p.m. UTC | #4

On Mon 22-10-18 22:20:36, Tetsuo Handa wrote:
> On 2018/10/22 21:03, Michal Hocko wrote:
> > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> >> On 2018/10/22 16:13, Michal Hocko wrote:
> >>> From: Michal Hocko <mhocko@suse.com>
> >>>
> >>> Tetsuo has reported [1] that a single process group memcg might easily
> >>> swamp the log with no-eligible oom victim reports due to race between
> >>> the memcg charge and oom_reaper
> >>>
> >>> Thread 1		Thread2				oom_reaper
> >>> try_charge		try_charge
> >>> 			  mem_cgroup_out_of_memory
> >>> 			    mutex_lock(oom_lock)
> >>>   mem_cgroup_out_of_memory
> >>>     mutex_lock(oom_lock)
> >>> 			      out_of_memory
> >>> 			        select_bad_process
> >>> 				oom_kill_process(current)
> >>> 				  wake_oom_reaper
> >>> 							  oom_reap_task
> >>> 							  MMF_OOM_SKIP->victim
> >>> 			    mutex_unlock(oom_lock)
> >>>     out_of_memory
> >>>       select_bad_process # no task
> >>>
> >>> If Thread1 didn't race it would bail out from try_charge and force the
> >>> charge. We can achieve the same by checking tsk_is_oom_victim inside
> >>> the oom_lock and therefore close the race.
> >>>
> >>> [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
> >>> Signed-off-by: Michal Hocko <mhocko@suse.com>
> >>> ---
> >>>  mm/memcontrol.c | 14 +++++++++++++-
> >>>  1 file changed, 13 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index e79cb59552d9..a9dfed29967b 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  		.gfp_mask = gfp_mask,
> >>>  		.order = order,
> >>>  	};
> >>> -	bool ret;
> >>> +	bool ret = true;
> >>>  
> >>>  	mutex_lock(&oom_lock);
> >>> +
> >>> +	/*
> >>> +	 * multi-threaded tasks might race with oom_reaper and gain
> >>> +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> >>> +	 * to out_of_memory failure if the task is the last one in
> >>> +	 * memcg which would be a false possitive failure reported
> >>> +	 */
> >>> +	if (tsk_is_oom_victim(current))
> >>> +		goto unlock;
> >>> +
> >>
> >> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> >> so that any killed threads no longer wait for oom_lock.
> > 
> > tsk_is_oom_victim is stronger because it doesn't depend on
> > fatal_signal_pending which might be cleared throughout the exit process.
> 
> I mean:
> 
>  mm/memcontrol.c |   3 +-
>  mm/oom_kill.c   | 111 +++++---------------------------------------------------
>  2 files changed, 12 insertions(+), 102 deletions(-)

This is much larger change than I feel comfortable with to plug this
specific issue. A simple and easy to understand fix which doesn't add
maintenance burden should be preferred in general.

The code reduction looks attractive but considering it is based on
removing one of the heuristics to prevent OOM reports in some case it
should be done on its own with a careful and throughout justification.
E.g. how often is the heuristic really helpful.

In principle I do not oppose to remove the shortcut after all due
diligence is done because this particular one had given us quite a lot
headaches in the past.

Tetsuo Handa Oct. 22, 2018, 3:12 p.m. UTC | #5

On 2018/10/22 22:43, Michal Hocko wrote:
> On Mon 22-10-18 22:20:36, Tetsuo Handa wrote:
>> I mean:
>>
>>  mm/memcontrol.c |   3 +-
>>  mm/oom_kill.c   | 111 +++++---------------------------------------------------
>>  2 files changed, 12 insertions(+), 102 deletions(-)
> 
> This is much larger change than I feel comfortable with to plug this
> specific issue. A simple and easy to understand fix which doesn't add
> maintenance burden should be preferred in general.
> 
> The code reduction looks attractive but considering it is based on
> removing one of the heuristics to prevent OOM reports in some case it
> should be done on its own with a careful and throughout justification.
> E.g. how often is the heuristic really helpful.

I think the heuristic is hardly helpful.

Regarding task_will_free_mem(current) condition in out_of_memory(),
this served for two purposes. One is that mark_oom_victim() is not yet
called on current thread group when mark_oom_victim() was already called
on other thread groups. But such situation disappears by removing
task_will_free_mem() shortcuts and forcing for_each_process(p) loop
in __oom_kill_process().

The other is that mark_oom_victim() is not yet called on any thread groups when
all thread groups are exiting. In that case, we will fail to wait for current
thread group to release its mm... But it is unlikely that only threads which
task_will_free_mem(current) returns true can call out_of_memory() (note that
task_will_free_mem(p) returns false if p->mm == NULL).

I think it is highly unlikely to hit task_will_free_mem(p) condition
in oom_kill_process(). To hit it, the candidate who was chosen due to
the largest memory user has to be already exiting. However, if already
exiting, it is likely the candidate already released its mm (and hence
no longer the largest memory user). I can't say such race never happens,
but I think it is unlikely. Also, since task_will_free_mem(p) returns false
if thread group leader's mm is NULL whereas oom_badness() from
select_bad_process() evaluates any mm in that thread group and returns
a thread group leader, this heuristic is incomplete after all.

> 
> In principle I do not oppose to remove the shortcut after all due
> diligence is done because this particular one had given us quite a lot
> headaches in the past.
>

Tetsuo Handa Oct. 23, 2018, 1:01 a.m. UTC | #6

Michal Hocko wrote:
> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index e79cb59552d9..a9dfed29967b 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  		.gfp_mask = gfp_mask,
> > >  		.order = order,
> > >  	};
> > > -	bool ret;
> > > +	bool ret = true;
> > >  
> > >  	mutex_lock(&oom_lock);
> > > +
> > > +	/*
> > > +	 * multi-threaded tasks might race with oom_reaper and gain
> > > +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > > +	 * to out_of_memory failure if the task is the last one in
> > > +	 * memcg which would be a false possitive failure reported
> > > +	 */
> > > +	if (tsk_is_oom_victim(current))
> > > +		goto unlock;
> > > +
> > 
> > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> > so that any killed threads no longer wait for oom_lock.
> 
> tsk_is_oom_victim is stronger because it doesn't depend on
> fatal_signal_pending which might be cleared throughout the exit process.
> 

I still want to propose this. No need to be memcg OOM specific.

 mm/memcontrol.c |  3 ++-
 mm/oom_kill.c   | 10 ++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59..2c1e1ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1382,7 +1382,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	};
 	bool ret;
 
-	mutex_lock(&oom_lock);
+	if (mutex_lock_killable(&oom_lock))
+		return true;
 	ret = out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 	return ret;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa53..e453bad 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1055,6 +1055,16 @@ bool out_of_memory(struct oom_control *oc)
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
+	/*
+	 * It is possible that multi-threaded OOM victims get
+	 * task_will_free_mem(current) == false when the OOM reaper quickly
+	 * set MMF_OOM_SKIP. But since we know that tsk_is_oom_victim() == true
+	 * tasks won't loop forever (unleess it is a __GFP_NOFAIL allocation
+	 * request), we don't need to select next OOM victim.
+	 */
+	if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
+		return true;
+
 	if (oom_killer_disabled)
 		return false;

Michal Hocko Oct. 23, 2018, 11:42 a.m. UTC | #7

On Tue 23-10-18 10:01:08, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index e79cb59552d9..a9dfed29967b 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > >  		.gfp_mask = gfp_mask,
> > > >  		.order = order,
> > > >  	};
> > > > -	bool ret;
> > > > +	bool ret = true;
> > > >  
> > > >  	mutex_lock(&oom_lock);
> > > > +
> > > > +	/*
> > > > +	 * multi-threaded tasks might race with oom_reaper and gain
> > > > +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > > > +	 * to out_of_memory failure if the task is the last one in
> > > > +	 * memcg which would be a false possitive failure reported
> > > > +	 */
> > > > +	if (tsk_is_oom_victim(current))
> > > > +		goto unlock;
> > > > +
> > > 
> > > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> > > so that any killed threads no longer wait for oom_lock.
> > 
> > tsk_is_oom_victim is stronger because it doesn't depend on
> > fatal_signal_pending which might be cleared throughout the exit process.
> > 
> 
> I still want to propose this. No need to be memcg OOM specific.

Well, I maintain what I've said [1] about simplicity and specific fix
for a specific issue. Especially in the tricky code like this where all
the consequences are far more subtle than they seem to be.

This is obviously a matter of taste but I don't see much point discussing
this back and forth for ever. Unless there is a general agreement that
the above is less appropriate then I am willing to consider a different
change but I simply do not have energy to nit pick for ever.

[1] http://lkml.kernel.org/r/20181022134315.GF18839@dhcp22.suse.cz

Michal Hocko Oct. 23, 2018, 12:10 p.m. UTC | #8

On Tue 23-10-18 13:42:46, Michal Hocko wrote:
> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index e79cb59552d9..a9dfed29967b 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > > >  		.gfp_mask = gfp_mask,
> > > > >  		.order = order,
> > > > >  	};
> > > > > -	bool ret;
> > > > > +	bool ret = true;
> > > > >  
> > > > >  	mutex_lock(&oom_lock);
> > > > > +
> > > > > +	/*
> > > > > +	 * multi-threaded tasks might race with oom_reaper and gain
> > > > > +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > > > > +	 * to out_of_memory failure if the task is the last one in
> > > > > +	 * memcg which would be a false possitive failure reported
> > > > > +	 */
> > > > > +	if (tsk_is_oom_victim(current))
> > > > > +		goto unlock;
> > > > > +
> > > > 
> > > > This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> > > > so that any killed threads no longer wait for oom_lock.
> > > 
> > > tsk_is_oom_victim is stronger because it doesn't depend on
> > > fatal_signal_pending which might be cleared throughout the exit process.
> > > 
> > 
> > I still want to propose this. No need to be memcg OOM specific.
> 
> Well, I maintain what I've said [1] about simplicity and specific fix
> for a specific issue. Especially in the tricky code like this where all
> the consequences are far more subtle than they seem to be.
> 
> This is obviously a matter of taste but I don't see much point discussing
> this back and forth for ever. Unless there is a general agreement that
> the above is less appropriate then I am willing to consider a different
> change but I simply do not have energy to nit pick for ever.
> 
> [1] http://lkml.kernel.org/r/20181022134315.GF18839@dhcp22.suse.cz

In other words. Having a memcg specific fix means, well, a memcg
maintenance burden. Like any other memcg specific oom decisions we
already have. So are you OK with that Johannes or you would like to see
a more generic fix which might turn out to be more complex?

Tetsuo Handa Oct. 23, 2018, 12:33 p.m. UTC | #9

On 2018/10/23 21:10, Michal Hocko wrote:
> On Tue 23-10-18 13:42:46, Michal Hocko wrote:
>> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote:
>>> Michal Hocko wrote:
>>>> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>>> index e79cb59552d9..a9dfed29967b 100644
>>>>>> --- a/mm/memcontrol.c
>>>>>> +++ b/mm/memcontrol.c
>>>>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>>>>  		.gfp_mask = gfp_mask,
>>>>>>  		.order = order,
>>>>>>  	};
>>>>>> -	bool ret;
>>>>>> +	bool ret = true;
>>>>>>  
>>>>>>  	mutex_lock(&oom_lock);
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * multi-threaded tasks might race with oom_reaper and gain
>>>>>> +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
>>>>>> +	 * to out_of_memory failure if the task is the last one in
>>>>>> +	 * memcg which would be a false possitive failure reported
>>>>>> +	 */
>>>>>> +	if (tsk_is_oom_victim(current))
>>>>>> +		goto unlock;
>>>>>> +
>>>>>
>>>>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
>>>>> so that any killed threads no longer wait for oom_lock.
>>>>
>>>> tsk_is_oom_victim is stronger because it doesn't depend on
>>>> fatal_signal_pending which might be cleared throughout the exit process.
>>>>
>>>
>>> I still want to propose this. No need to be memcg OOM specific.
>>
>> Well, I maintain what I've said [1] about simplicity and specific fix
>> for a specific issue. Especially in the tricky code like this where all
>> the consequences are far more subtle than they seem to be.
>>
>> This is obviously a matter of taste but I don't see much point discussing
>> this back and forth for ever. Unless there is a general agreement that
>> the above is less appropriate then I am willing to consider a different
>> change but I simply do not have energy to nit pick for ever.
>>
>> [1] http://lkml.kernel.org/r/20181022134315.GF18839@dhcp22.suse.cz
> 
> In other words. Having a memcg specific fix means, well, a memcg
> maintenance burden. Like any other memcg specific oom decisions we
> already have. So are you OK with that Johannes or you would like to see
> a more generic fix which might turn out to be more complex?
> 

I don't know what "that Johannes" refers to.

If you don't want to affect SysRq-OOM and pagefault-OOM cases,
are you OK with having a global-OOM specific fix?

 mm/page_alloc.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c1..f59f029 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3518,6 +3518,17 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	if (gfp_mask & __GFP_THISNODE)
 		goto out;
 
+	/*
+	 * It is possible that multi-threaded OOM victims get
+	 * task_will_free_mem(current) == false when the OOM reaper quickly
+	 * set MMF_OOM_SKIP. But since we know that tsk_is_oom_victim() == true
+	 * tasks won't loop forever (unless it is a __GFP_NOFAIL allocation
+	 * request), we don't need to select next OOM victim.
+	 */
+	if (tsk_is_oom_victim(current) && !(gfp_mask & __GFP_NOFAIL)) {
+		*did_some_progress = 1;
+		goto out;
+	}
 	/* Exhausted what can be done so it's blame time */
 	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;

Michal Hocko Oct. 23, 2018, 12:48 p.m. UTC | #10

On Tue 23-10-18 21:33:43, Tetsuo Handa wrote:
> On 2018/10/23 21:10, Michal Hocko wrote:
> > On Tue 23-10-18 13:42:46, Michal Hocko wrote:
> >> On Tue 23-10-18 10:01:08, Tetsuo Handa wrote:
> >>> Michal Hocko wrote:
> >>>> On Mon 22-10-18 20:45:17, Tetsuo Handa wrote:
> >>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>>>> index e79cb59552d9..a9dfed29967b 100644
> >>>>>> --- a/mm/memcontrol.c
> >>>>>> +++ b/mm/memcontrol.c
> >>>>>> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>>>>  		.gfp_mask = gfp_mask,
> >>>>>>  		.order = order,
> >>>>>>  	};
> >>>>>> -	bool ret;
> >>>>>> +	bool ret = true;
> >>>>>>  
> >>>>>>  	mutex_lock(&oom_lock);
> >>>>>> +
> >>>>>> +	/*
> >>>>>> +	 * multi-threaded tasks might race with oom_reaper and gain
> >>>>>> +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> >>>>>> +	 * to out_of_memory failure if the task is the last one in
> >>>>>> +	 * memcg which would be a false possitive failure reported
> >>>>>> +	 */
> >>>>>> +	if (tsk_is_oom_victim(current))
> >>>>>> +		goto unlock;
> >>>>>> +
> >>>>>
> >>>>> This is not wrong but is strange. We can use mutex_lock_killable(&oom_lock)
> >>>>> so that any killed threads no longer wait for oom_lock.
> >>>>
> >>>> tsk_is_oom_victim is stronger because it doesn't depend on
> >>>> fatal_signal_pending which might be cleared throughout the exit process.
> >>>>
> >>>
> >>> I still want to propose this. No need to be memcg OOM specific.
> >>
> >> Well, I maintain what I've said [1] about simplicity and specific fix
> >> for a specific issue. Especially in the tricky code like this where all
> >> the consequences are far more subtle than they seem to be.
> >>
> >> This is obviously a matter of taste but I don't see much point discussing
> >> this back and forth for ever. Unless there is a general agreement that
> >> the above is less appropriate then I am willing to consider a different
> >> change but I simply do not have energy to nit pick for ever.
> >>
> >> [1] http://lkml.kernel.org/r/20181022134315.GF18839@dhcp22.suse.cz
> > 
> > In other words. Having a memcg specific fix means, well, a memcg
> > maintenance burden. Like any other memcg specific oom decisions we
> > already have. So are you OK with that Johannes or you would like to see
> > a more generic fix which might turn out to be more complex?
> > 
> 
> I don't know what "that Johannes" refers to.

let me rephrase

Johannes, are you OK with that (memcg specific fix) or you would like to
see a more generic fix which might turn out to be more complex.

Johannes Weiner Oct. 26, 2018, 2:25 p.m. UTC | #11

On Mon, Oct 22, 2018 at 09:13:23AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo has reported [1] that a single process group memcg might easily
> swamp the log with no-eligible oom victim reports due to race between
> the memcg charge and oom_reaper
> 
> Thread 1		Thread2				oom_reaper
> try_charge		try_charge
> 			  mem_cgroup_out_of_memory
> 			    mutex_lock(oom_lock)
>   mem_cgroup_out_of_memory
>     mutex_lock(oom_lock)
> 			      out_of_memory
> 			        select_bad_process
> 				oom_kill_process(current)
> 				  wake_oom_reaper
> 							  oom_reap_task
> 							  MMF_OOM_SKIP->victim
> 			    mutex_unlock(oom_lock)
>     out_of_memory
>       select_bad_process # no task
> 
> If Thread1 didn't race it would bail out from try_charge and force the
> charge. We can achieve the same by checking tsk_is_oom_victim inside
> the oom_lock and therefore close the race.
> 
> [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e79cb59552d9..a9dfed29967b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  	};
> -	bool ret;
> +	bool ret = true;
>  
>  	mutex_lock(&oom_lock);
> +
> +	/*
> +	 * multi-threaded tasks might race with oom_reaper and gain
> +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> +	 * to out_of_memory failure if the task is the last one in
> +	 * memcg which would be a false possitive failure reported
> +	 */
> +	if (tsk_is_oom_victim(current))
> +		goto unlock;
> +
>  	ret = out_of_memory(&oc);

We already check tsk_is_oom_victim(current) in try_charge() before
looping on the OOM killer, so at most we'd have a single "no eligible
tasks" message from such a race before we force the charge - right?

While that's not perfect, I don't think it warrants complicating this
code even more. I honestly find it near-impossible to follow the code
and the possible scenarios at this point.

out_of_memory() bails on task_will_free_mem(current), which
specifically *excludes* already reaped tasks. Why are we then adding a
separate check before that to bail on already reaped victims?

Do we want to bail if current is a reaped victim or not?

I don't see how we could skip it safely in general: the current task
might have been killed and reaped and gotten access to the memory
reserve and still fail to allocate on its way out. It needs to kill
the next task if there is one, or warn if there isn't another
one. Because we're genuinely oom without reclaimable tasks.

There is of course the scenario brought forward in this thread, where
multiple threads of a process race and the second one enters oom even
though it doesn't need to anymore. What the global case does to catch
this is to grab the oom lock and do one last alloc attempt. Should
memcg lock the oom_lock and try one more time to charge the memcg?

Some simplification in this area would really be great. I'm reluctant
to ack patches like the above, even if they have some optical benefits
for the user, because the code is already too tricky for what it does.

Michal Hocko Oct. 26, 2018, 7:25 p.m. UTC | #12

On Fri 26-10-18 10:25:31, Johannes Weiner wrote:
> On Mon, Oct 22, 2018 at 09:13:23AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Tetsuo has reported [1] that a single process group memcg might easily
> > swamp the log with no-eligible oom victim reports due to race between
> > the memcg charge and oom_reaper
> > 
> > Thread 1		Thread2				oom_reaper
> > try_charge		try_charge
> > 			  mem_cgroup_out_of_memory
> > 			    mutex_lock(oom_lock)
> >   mem_cgroup_out_of_memory
> >     mutex_lock(oom_lock)
> > 			      out_of_memory
> > 			        select_bad_process
> > 				oom_kill_process(current)
> > 				  wake_oom_reaper
> > 							  oom_reap_task
> > 							  MMF_OOM_SKIP->victim
> > 			    mutex_unlock(oom_lock)
> >     out_of_memory
> >       select_bad_process # no task
> > 
> > If Thread1 didn't race it would bail out from try_charge and force the
> > charge. We can achieve the same by checking tsk_is_oom_victim inside
> > the oom_lock and therefore close the race.
> > 
> > [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/memcontrol.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e79cb59552d9..a9dfed29967b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  		.gfp_mask = gfp_mask,
> >  		.order = order,
> >  	};
> > -	bool ret;
> > +	bool ret = true;
> >  
> >  	mutex_lock(&oom_lock);
> > +
> > +	/*
> > +	 * multi-threaded tasks might race with oom_reaper and gain
> > +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > +	 * to out_of_memory failure if the task is the last one in
> > +	 * memcg which would be a false possitive failure reported
> > +	 */
> > +	if (tsk_is_oom_victim(current))
> > +		goto unlock;
> > +
> >  	ret = out_of_memory(&oc);
> 
> We already check tsk_is_oom_victim(current) in try_charge() before
> looping on the OOM killer, so at most we'd have a single "no eligible
> tasks" message from such a race before we force the charge - right?

Not really. You can have many threads blocked on the oom_lock and being
reaped while they are waiting. So the check without the lock will always
be racy. This is what Tetsuo's test case actually triggers I believe.

> While that's not perfect, I don't think it warrants complicating this
> code even more. I honestly find it near-impossible to follow the code
> and the possible scenarios at this point.

I do agree that the code is quite far from easy to follow. The set of
events that might happen in a different context is not trivial.

> out_of_memory() bails on task_will_free_mem(current), which
> specifically *excludes* already reaped tasks. Why are we then adding a
> separate check before that to bail on already reaped victims?

696453e66630a has introduced the bail out.

> Do we want to bail if current is a reaped victim or not?
> 
> I don't see how we could skip it safely in general: the current task
> might have been killed and reaped and gotten access to the memory
> reserve and still fail to allocate on its way out. It needs to kill
> the next task if there is one, or warn if there isn't another
> one. Because we're genuinely oom without reclaimable tasks.

Yes, this would be the case for the global case which is a real OOM
situation. Memcg oom is somehow more relaxed because the oom is local.

> There is of course the scenario brought forward in this thread, where
> multiple threads of a process race and the second one enters oom even
> though it doesn't need to anymore. What the global case does to catch
> this is to grab the oom lock and do one last alloc attempt. Should
> memcg lock the oom_lock and try one more time to charge the memcg?

That would be another option. I agree that making it more towards the
global case makes it more attractive. My tsk_is_oom_victim is more
towards "plug this particular case".

So does this look better to you?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59552d9..4abb66efe806 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1380,10 +1380,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.gfp_mask = gfp_mask,
 		.order = order,
 	};
-	bool ret;
+	bool ret = true;
 
 	mutex_lock(&oom_lock);
+
+	/*
+	 * Make the last moment check while we were waiting for the oom_lock
+	 * just in case the oom_reaper could have freed released some
+	 * memory in the meantime. This mimics the lalst moment allocation
+	 * in __alloc_pages_may_oom
+	 */
+	if (mem_cgroup_margin(mem_over_limit) >= 1 << order)
+		goto unlock;
+
 	ret = out_of_memory(&oc);
+
+unlock:
 	mutex_unlock(&oom_lock);
 	return ret;
 }

> Some simplification in this area would really be great. I'm reluctant
> to ack patches like the above, even if they have some optical benefits
> for the user, because the code is already too tricky for what it does.

I am open to different ideas, unless they are crazy timeout here and
timeout there wrapped with a duct tape.

Michal Hocko Oct. 26, 2018, 7:33 p.m. UTC | #13

On Fri 26-10-18 21:25:51, Michal Hocko wrote:
> On Fri 26-10-18 10:25:31, Johannes Weiner wrote:
[...]
> > There is of course the scenario brought forward in this thread, where
> > multiple threads of a process race and the second one enters oom even
> > though it doesn't need to anymore. What the global case does to catch
> > this is to grab the oom lock and do one last alloc attempt. Should
> > memcg lock the oom_lock and try one more time to charge the memcg?
> 
> That would be another option. I agree that making it more towards the
> global case makes it more attractive. My tsk_is_oom_victim is more
> towards "plug this particular case".

Nevertheless let me emphasise that tsk_is_oom_victim will close the race
completely, while mem_cgroup_margin will always be racy. So the question
is whether we want to close the race because it is just too easy for
userspace to hit it or keep the global and memcg oom handling as close
as possible.

Tetsuo Handa Oct. 27, 2018, 1:10 a.m. UTC | #14

On 2018/10/27 4:25, Michal Hocko wrote:
>> out_of_memory() bails on task_will_free_mem(current), which
>> specifically *excludes* already reaped tasks. Why are we then adding a
>> separate check before that to bail on already reaped victims?
> 
> 696453e66630a has introduced the bail out.
> 
>> Do we want to bail if current is a reaped victim or not?
>>
>> I don't see how we could skip it safely in general: the current task
>> might have been killed and reaped and gotten access to the memory
>> reserve and still fail to allocate on its way out. It needs to kill
>> the next task if there is one, or warn if there isn't another
>> one. Because we're genuinely oom without reclaimable tasks.
> 
> Yes, this would be the case for the global case which is a real OOM
> situation. Memcg oom is somehow more relaxed because the oom is local.

We can handle possibility of genuinely OOM without reclaimable tasks.
Only __GFP_NOFAIL OOM has to select next OOM victim. There is no need to
select next OOM victim unless __GFP_NOFAIL. Commit 696453e66630ad45
("mm, oom: task_will_free_mem should skip oom_reaped tasks") was too simple.

On 2018/10/27 4:33, Michal Hocko wrote:
> On Fri 26-10-18 21:25:51, Michal Hocko wrote:
>> On Fri 26-10-18 10:25:31, Johannes Weiner wrote:
> [...]
>>> There is of course the scenario brought forward in this thread, where
>>> multiple threads of a process race and the second one enters oom even
>>> though it doesn't need to anymore. What the global case does to catch
>>> this is to grab the oom lock and do one last alloc attempt. Should
>>> memcg lock the oom_lock and try one more time to charge the memcg?
>>
>> That would be another option. I agree that making it more towards the
>> global case makes it more attractive. My tsk_is_oom_victim is more
>> towards "plug this particular case".
> 
> Nevertheless let me emphasise that tsk_is_oom_victim will close the race
> completely, while mem_cgroup_margin will always be racy. So the question
> is whether we want to close the race because it is just too easy for
> userspace to hit it or keep the global and memcg oom handling as close
> as possible.
> 

Yes, adding tsk_is_oom_victim(current) before calling out_of_memory() from
both global OOM and memcg OOM paths can close the race completely. (But
note that tsk_is_oom_victim(current) for global OOM path needs to check for
__GFP_NOFAIL in order to handle genuinely OOM case.)

Tetsuo Handa Nov. 6, 2018, 9:44 a.m. UTC | #15

From dc0d9ec3205a28680dcf2213c0ffe8820e8b5913 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Tue, 6 Nov 2018 12:27:36 +0900
Subject: [PATCH] memcg: killed threads should not invoke memcg OOM killer

It is possible that a single process group memcg easily swamps the log
with no-eligible OOM victim messages after current thread was OOM-killed,
due to race between the memcg charge and the OOM reaper [1].

Thread-1                 Thread-2                       OOM reaper
try_charge()
  mem_cgroup_out_of_memory()
    mutex_lock(oom_lock)
                        try_charge()
                          mem_cgroup_out_of_memory()
                            mutex_lock(oom_lock)
    out_of_memory()
      select_bad_process()
      oom_kill_process(current)
      wake_oom_reaper()
                                                        oom_reap_task()
                                                        # sets MMF_OOM_SKIP
    mutex_unlock(oom_lock)
                            out_of_memory()
                              select_bad_process() # no task
                            mutex_unlock(oom_lock)

We don't need to invoke the memcg OOM killer if current thread was killed
when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and
memory_max_write() can bail out upon SIGKILL, and try_charge() allows
already killed/exiting threads to make forward progress.

[1] https://lkml.kernel.org/r/ea637f9a-5dd0-f927-d26d-d0b4fd8ccb6f@i-love.sakura.ne.jp

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/memcontrol.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b..a97648a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	};
 	bool ret;
 
-	mutex_lock(&oom_lock);
-	ret = out_of_memory(&oc);
+	if (mutex_lock_killable(&oom_lock))
+		return true;
+	/*
+	 * A few threads which were not waiting at mutex_lock_killable() can
+	 * fail to bail out. Therefore, check again after holding oom_lock.
+	 */
+	ret = fatal_signal_pending(current) || out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 	return ret;
 }

Michal Hocko Nov. 6, 2018, 12:42 p.m. UTC | #16

On Tue 06-11-18 18:44:43, Tetsuo Handa wrote:
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6e1469b..a97648a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	};
>  	bool ret;
>  
> -	mutex_lock(&oom_lock);
> -	ret = out_of_memory(&oc);
> +	if (mutex_lock_killable(&oom_lock))
> +		return true;
> +	/*
> +	 * A few threads which were not waiting at mutex_lock_killable() can
> +	 * fail to bail out. Therefore, check again after holding oom_lock.
> +	 */
> +	ret = fatal_signal_pending(current) || out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
>  	return ret;
>  }

If we are goging with a memcg specific thingy then I really prefer
tsk_is_oom_victim approach. Or is there any reason why this is not
suitable?

Tetsuo Handa Nov. 7, 2018, 9:45 a.m. UTC | #17

On 2018/11/06 21:42, Michal Hocko wrote:
> On Tue 06-11-18 18:44:43, Tetsuo Handa wrote:
> [...]
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6e1469b..a97648a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>  	};
>>  	bool ret;
>>  
>> -	mutex_lock(&oom_lock);
>> -	ret = out_of_memory(&oc);
>> +	if (mutex_lock_killable(&oom_lock))
>> +		return true;
>> +	/*
>> +	 * A few threads which were not waiting at mutex_lock_killable() can
>> +	 * fail to bail out. Therefore, check again after holding oom_lock.
>> +	 */
>> +	ret = fatal_signal_pending(current) || out_of_memory(&oc);
>>  	mutex_unlock(&oom_lock);
>>  	return ret;
>>  }
> 
> If we are goging with a memcg specific thingy then I really prefer
> tsk_is_oom_victim approach. Or is there any reason why this is not
> suitable?
> 

Why need to wait for mark_oom_victim() called after slow printk() messages?

If current thread got Ctrl-C and thus current thread can terminate, what is
nice with waiting for the OOM killer? If there are several OOM events in
multiple memcg domains waiting for completion of printk() messages? I don't
see points with waiting for oom_lock, for try_charge() already allows current
thread to terminate due to fatal_signal_pending() test.

Michal Hocko Nov. 7, 2018, 10:08 a.m. UTC | #18

On Wed 07-11-18 18:45:27, Tetsuo Handa wrote:
> On 2018/11/06 21:42, Michal Hocko wrote:
> > On Tue 06-11-18 18:44:43, Tetsuo Handa wrote:
> > [...]
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 6e1469b..a97648a 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>  	};
> >>  	bool ret;
> >>  
> >> -	mutex_lock(&oom_lock);
> >> -	ret = out_of_memory(&oc);
> >> +	if (mutex_lock_killable(&oom_lock))
> >> +		return true;
> >> +	/*
> >> +	 * A few threads which were not waiting at mutex_lock_killable() can
> >> +	 * fail to bail out. Therefore, check again after holding oom_lock.
> >> +	 */
> >> +	ret = fatal_signal_pending(current) || out_of_memory(&oc);
> >>  	mutex_unlock(&oom_lock);
> >>  	return ret;
> >>  }
> > 
> > If we are goging with a memcg specific thingy then I really prefer
> > tsk_is_oom_victim approach. Or is there any reason why this is not
> > suitable?
> > 
> 
> Why need to wait for mark_oom_victim() called after slow printk() messages?
> 
> If current thread got Ctrl-C and thus current thread can terminate, what is
> nice with waiting for the OOM killer? If there are several OOM events in
> multiple memcg domains waiting for completion of printk() messages? I don't
> see points with waiting for oom_lock, for try_charge() already allows current
> thread to terminate due to fatal_signal_pending() test.

mutex_lock_killable would take care of exiting task already. I would
then still prefer to check for mark_oom_victim because that is not racy
with the exit path clearing signals. I can update my patch to use
_killable lock variant if we are really going with the memcg specific
fix.

Johaness?

Tetsuo Handa Dec. 7, 2018, 12:43 p.m. UTC | #19

On 2018/11/07 19:08, Michal Hocko wrote:
> On Wed 07-11-18 18:45:27, Tetsuo Handa wrote:
>> On 2018/11/06 21:42, Michal Hocko wrote:
>>> On Tue 06-11-18 18:44:43, Tetsuo Handa wrote:
>>> [...]
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 6e1469b..a97648a 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -1382,8 +1382,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>>  	};
>>>>  	bool ret;
>>>>  
>>>> -	mutex_lock(&oom_lock);
>>>> -	ret = out_of_memory(&oc);
>>>> +	if (mutex_lock_killable(&oom_lock))
>>>> +		return true;
>>>> +	/*
>>>> +	 * A few threads which were not waiting at mutex_lock_killable() can
>>>> +	 * fail to bail out. Therefore, check again after holding oom_lock.
>>>> +	 */
>>>> +	ret = fatal_signal_pending(current) || out_of_memory(&oc);
>>>>  	mutex_unlock(&oom_lock);
>>>>  	return ret;
>>>>  }
>>>
>>> If we are goging with a memcg specific thingy then I really prefer
>>> tsk_is_oom_victim approach. Or is there any reason why this is not
>>> suitable?
>>>
>>
>> Why need to wait for mark_oom_victim() called after slow printk() messages?
>>
>> If current thread got Ctrl-C and thus current thread can terminate, what is
>> nice with waiting for the OOM killer? If there are several OOM events in
>> multiple memcg domains waiting for completion of printk() messages? I don't
>> see points with waiting for oom_lock, for try_charge() already allows current
>> thread to terminate due to fatal_signal_pending() test.
> 
> mutex_lock_killable would take care of exiting task already. I would
> then still prefer to check for mark_oom_victim because that is not racy
> with the exit path clearing signals. I can update my patch to use
> _killable lock variant if we are really going with the memcg specific
> fix.
> 
> Johaness?
> 

No response for one month. When can we get to an RCU stall problem syzbot reported?

Tetsuo Handa Dec. 12, 2018, 10:23 a.m. UTC | #20

On 2018/12/07 21:43, Tetsuo Handa wrote:
> No response for one month. When can we get to an RCU stall problem syzbot reported?

Why not to apply this patch and then think how to address
https://lore.kernel.org/lkml/201810100012.w9A0Cjtn047782@www262.sakura.ne.jp/ ?

From 0fb58415770a83d6c40d471e1840f8bc4a35ca83 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 12 Dec 2018 19:15:51 +0900
Subject: [PATCH] memcg: killed threads should not invoke memcg OOM killer

It is possible that a single process group memcg easily swamps the log
with no-eligible OOM victim messages after current thread was OOM-killed,
due to race between the memcg charge and the OOM reaper [1].

Thread-1                 Thread-2                       OOM reaper
try_charge()
  mem_cgroup_out_of_memory()
    mutex_lock(oom_lock)
                        try_charge()
                          mem_cgroup_out_of_memory()
                            mutex_lock(oom_lock)
    out_of_memory()
      select_bad_process()
      oom_kill_process(current)
      wake_oom_reaper()
                                                        oom_reap_task()
                                                        # sets MMF_OOM_SKIP
    mutex_unlock(oom_lock)
                            out_of_memory()
                              select_bad_process() # no task
                            mutex_unlock(oom_lock)

We don't need to invoke the memcg OOM killer if current thread was killed
when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and
memory_max_write() can bail out upon SIGKILL, and try_charge() allows
already killed/exiting threads to make forward progress.

Michal has a plan to use tsk_is_oom_victim() by calling mark_oom_victim()
on all thread groups sharing victim's mm. But fatal_signal_pending() in
this patch helps regardless of Michal's plan because it will avoid
needlessly calling out_of_memory() when current thread is already
terminating (e.g. got SIGINT after passing fatal_signal_pending() check
in try_charge() and mutex_lock_killable() did not block).

[1] https://lkml.kernel.org/r/ea637f9a-5dd0-f927-d26d-d0b4fd8ccb6f@i-love.sakura.ne.jp

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/memcontrol.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b860dd4f7..b0d3bf3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1389,8 +1389,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	};
 	bool ret;

-	mutex_lock(&oom_lock);
-	ret = out_of_memory(&oc);
+	if (mutex_lock_killable(&oom_lock))
+		return true;
+	/*
+	 * A few threads which were not waiting at mutex_lock_killable() can
+	 * fail to bail out. Therefore, check again after holding oom_lock.
+	 */
+	ret = fatal_signal_pending(current) || out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 	return ret;
 }

[RFC,2/2] memcg: do not report racy no-eligible OOM tasks

Commit Message

Comments

Patch