diff mbox

mm,oom: Don't call schedule_timeout_killable() with oom_lock held.

Message ID 20180528124313.GC27180@dhcp22.suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Michal Hocko May 28, 2018, 12:43 p.m. UTC
On Fri 25-05-18 20:46:21, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 25-05-18 19:57:32, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > What is wrong with the folliwing? should_reclaim_retry should be a
> > > > natural reschedule point. PF_WQ_WORKER is a special case which needs a
> > > > stronger rescheduling policy. Doing that unconditionally seems more
> > > > straightforward than depending on a zone being a good candidate for a
> > > > further reclaim.
> > > 
> > > Where is schedule_timeout_uninterruptible(1) for !PF_KTHREAD threads?
> > 
> > Re-read what I've said.
> 
> Please show me as a complete patch. Then, I will test your patch.

So how about we start as simple as the following? If we _really_ need to
touch should_reclaim_retry then it should be done in a separate patch
with some numbers/tracing data backing that story.
---
From fa3b165c34937bf628f2eee7e6f0483c611167d0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 28 May 2018 14:30:37 +0200
Subject: [PATCH] mm, oom: remove sleep from under oom_lock

Tetsuo has pointed out that since 27ae357fa82b ("mm, oom: fix concurrent
munlock and oom reaper unmap, v3") we have a strong synchronization
between the oom_killer and victim's exiting because both have to take
the oom_lock. Therefore the original heuristic to sleep for a short time
in out_of_memory doesn't serve the original purpose.

Moreover Tetsuo has noticed that the short sleep can be more harmful
than actually useful. Hammering the system with many processes can lead
to a starvation when the task holding the oom_lock can block for a
long time (minutes) and block any further progress because the
oom_reaper depends on the oom_lock as well.

Drop the short sleep from out_of_memory when we hold the lock. Keep the
sleep when the trylock fails to throttle the concurrent OOM paths a bit.
This should be solved in a more reasonable way (e.g. sleep proportional
to the time spent in the active reclaiming etc.) but this is much more
complex thing to achieve. This is a quick fixup to remove a stale code.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 17 ++---------------
 1 file changed, 2 insertions(+), 15 deletions(-)

Comments

Tetsuo Handa May 28, 2018, 8:57 p.m. UTC | #1
Michal Hocko wrote:
> On Fri 25-05-18 20:46:21, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 25-05-18 19:57:32, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > What is wrong with the folliwing? should_reclaim_retry should be a
> > > > > natural reschedule point. PF_WQ_WORKER is a special case which needs a
> > > > > stronger rescheduling policy. Doing that unconditionally seems more
> > > > > straightforward than depending on a zone being a good candidate for a
> > > > > further reclaim.
> > > > 
> > > > Where is schedule_timeout_uninterruptible(1) for !PF_KTHREAD threads?
> > > 
> > > Re-read what I've said.
> > 
> > Please show me as a complete patch. Then, I will test your patch.
> 
> So how about we start as simple as the following? If we _really_ need to
> touch should_reclaim_retry then it should be done in a separate patch
> with some numbers/tracing data backing that story.

This patch is incorrect that it ignores the bug in Roman's
"mm, oom: cgroup-aware OOM killer" patch in linux-next. I suggest applying
this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.
Michal Hocko May 29, 2018, 7:17 a.m. UTC | #2
On Tue 29-05-18 05:57:16, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 25-05-18 20:46:21, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 25-05-18 19:57:32, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > > > What is wrong with the folliwing? should_reclaim_retry should be a
> > > > > > natural reschedule point. PF_WQ_WORKER is a special case which needs a
> > > > > > stronger rescheduling policy. Doing that unconditionally seems more
> > > > > > straightforward than depending on a zone being a good candidate for a
> > > > > > further reclaim.
> > > > > 
> > > > > Where is schedule_timeout_uninterruptible(1) for !PF_KTHREAD threads?
> > > > 
> > > > Re-read what I've said.
> > > 
> > > Please show me as a complete patch. Then, I will test your patch.
> > 
> > So how about we start as simple as the following? If we _really_ need to
> > touch should_reclaim_retry then it should be done in a separate patch
> > with some numbers/tracing data backing that story.
> 
> This patch is incorrect that it ignores the bug in Roman's
> "mm, oom: cgroup-aware OOM killer" patch in linux-next.

I've expected Roman to comment on that. The fix should be trivial. But
does that prevent from further testing of this patch? Are you actually
using cgroup OOM killer? If not the bug should be a non-issue, right?

> I suggest applying
> this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.

Well, I hope the whole pile gets merged in the upcoming merge window
rather than stall even more. This patch however can wait some more.
There is no hurry to merge it right away.
Andrew Morton May 29, 2018, 11:07 p.m. UTC | #3
On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > I suggest applying
> > this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.
> 
> Well, I hope the whole pile gets merged in the upcoming merge window
> rather than stall even more.

I'm more inclined to drop it all.  David has identified significant
shortcomings and I'm not seeing a way of addressing those shortcomings
in a backward-compatible fashion.  Therefore there is no way forward
at present.
Michal Hocko June 1, 2018, 3:28 p.m. UTC | #4
On Tue 29-05-18 16:07:00, Andrew Morton wrote:
> On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > I suggest applying
> > > this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.
> > 
> > Well, I hope the whole pile gets merged in the upcoming merge window
> > rather than stall even more.
> 
> I'm more inclined to drop it all.  David has identified significant
> shortcomings and I'm not seeing a way of addressing those shortcomings
> in a backward-compatible fashion.  Therefore there is no way forward
> at present.

Well, I thought we have argued about those "shortcomings" back and forth
and expressed that they are not really a problem for workloads which are
going to use the feature. The backward compatibility has been explained
as well AFAICT. Anyway if this is your position on the matter then I
just give up. I've tried to do my best to review the feature (as !author
nor the end user) and I cannot do really much more. I find it quite sad
though to be honest.
Andrew Morton June 1, 2018, 9:11 p.m. UTC | #5
On Fri, 1 Jun 2018 17:28:01 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Tue 29-05-18 16:07:00, Andrew Morton wrote:
> > On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > > I suggest applying
> > > > this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.
> > > 
> > > Well, I hope the whole pile gets merged in the upcoming merge window
> > > rather than stall even more.
> > 
> > I'm more inclined to drop it all.  David has identified significant
> > shortcomings and I'm not seeing a way of addressing those shortcomings
> > in a backward-compatible fashion.  Therefore there is no way forward
> > at present.
> 
> Well, I thought we have argued about those "shortcomings" back and forth
> and expressed that they are not really a problem for workloads which are
> going to use the feature. The backward compatibility has been explained
> as well AFAICT.

Feel free to re-explain.  It's the only way we'll get there.

David has proposed an alternative patchset.  IIRC Roman gave that a
one-line positive response but I don't think it has seen a lot of
attention?
Michal Hocko June 4, 2018, 7:04 a.m. UTC | #6
On Fri 01-06-18 14:11:10, Andrew Morton wrote:
> On Fri, 1 Jun 2018 17:28:01 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Tue 29-05-18 16:07:00, Andrew Morton wrote:
> > > On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > > 
> > > > > I suggest applying
> > > > > this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.
> > > > 
> > > > Well, I hope the whole pile gets merged in the upcoming merge window
> > > > rather than stall even more.
> > > 
> > > I'm more inclined to drop it all.  David has identified significant
> > > shortcomings and I'm not seeing a way of addressing those shortcomings
> > > in a backward-compatible fashion.  Therefore there is no way forward
> > > at present.
> > 
> > Well, I thought we have argued about those "shortcomings" back and forth
> > and expressed that they are not really a problem for workloads which are
> > going to use the feature. The backward compatibility has been explained
> > as well AFAICT.
> 
> Feel free to re-explain.  It's the only way we'll get there.

OK, I will go and my points to the last version of the patchset.

> David has proposed an alternative patchset.  IIRC Roman gave that a
> one-line positive response but I don't think it has seen a lot of
> attention?

I plan to go and revisit that. My preliminary feedback is that a more
generic policy API is really tricky and the patchset has many holes
there. But I will come with a more specific feedback in the respective
thread.
Tetsuo Handa June 4, 2018, 10:41 a.m. UTC | #7
On 2018/06/04 16:04, Michal Hocko wrote:
> On Fri 01-06-18 14:11:10, Andrew Morton wrote:
>> On Fri, 1 Jun 2018 17:28:01 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>>
>>> On Tue 29-05-18 16:07:00, Andrew Morton wrote:
>>>> On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>>>>
>>>>>> I suggest applying
>>>>>> this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.
>>>>>
>>>>> Well, I hope the whole pile gets merged in the upcoming merge window
>>>>> rather than stall even more.
>>>>
>>>> I'm more inclined to drop it all.  David has identified significant
>>>> shortcomings and I'm not seeing a way of addressing those shortcomings
>>>> in a backward-compatible fashion.  Therefore there is no way forward
>>>> at present.
>>>
>>> Well, I thought we have argued about those "shortcomings" back and forth
>>> and expressed that they are not really a problem for workloads which are
>>> going to use the feature. The backward compatibility has been explained
>>> as well AFAICT.
>>
>> Feel free to re-explain.  It's the only way we'll get there.
> 
> OK, I will go and my points to the last version of the patchset.
> 
>> David has proposed an alternative patchset.  IIRC Roman gave that a
>> one-line positive response but I don't think it has seen a lot of
>> attention?
> 
> I plan to go and revisit that. My preliminary feedback is that a more
> generic policy API is really tricky and the patchset has many holes
> there. But I will come with a more specific feedback in the respective
> thread.
> 
Is current version of "mm, oom: cgroup-aware OOM killer" patchset going to be
dropped for now? I want to know which state should I use for baseline for my patch.
Michal Hocko June 4, 2018, 11:22 a.m. UTC | #8
On Mon 04-06-18 19:41:01, Tetsuo Handa wrote:
> On 2018/06/04 16:04, Michal Hocko wrote:
> > On Fri 01-06-18 14:11:10, Andrew Morton wrote:
> >> On Fri, 1 Jun 2018 17:28:01 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> >>
> >>> On Tue 29-05-18 16:07:00, Andrew Morton wrote:
> >>>> On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> >>>>
> >>>>>> I suggest applying
> >>>>>> this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.
> >>>>>
> >>>>> Well, I hope the whole pile gets merged in the upcoming merge window
> >>>>> rather than stall even more.
> >>>>
> >>>> I'm more inclined to drop it all.  David has identified significant
> >>>> shortcomings and I'm not seeing a way of addressing those shortcomings
> >>>> in a backward-compatible fashion.  Therefore there is no way forward
> >>>> at present.
> >>>
> >>> Well, I thought we have argued about those "shortcomings" back and forth
> >>> and expressed that they are not really a problem for workloads which are
> >>> going to use the feature. The backward compatibility has been explained
> >>> as well AFAICT.
> >>
> >> Feel free to re-explain.  It's the only way we'll get there.
> > 
> > OK, I will go and my points to the last version of the patchset.
> > 
> >> David has proposed an alternative patchset.  IIRC Roman gave that a
> >> one-line positive response but I don't think it has seen a lot of
> >> attention?
> > 
> > I plan to go and revisit that. My preliminary feedback is that a more
> > generic policy API is really tricky and the patchset has many holes
> > there. But I will come with a more specific feedback in the respective
> > thread.
> > 
> Is current version of "mm, oom: cgroup-aware OOM killer" patchset going to be
> dropped for now? I want to know which state should I use for baseline for my patch.

Is it that urgent that it cannot wait until after the merge window when
thing should settle down?
Tetsuo Handa June 4, 2018, 11:30 a.m. UTC | #9
On 2018/06/04 20:22, Michal Hocko wrote:
> On Mon 04-06-18 19:41:01, Tetsuo Handa wrote:
>> On 2018/06/04 16:04, Michal Hocko wrote:
>>> On Fri 01-06-18 14:11:10, Andrew Morton wrote:
>>>> On Fri, 1 Jun 2018 17:28:01 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>>>>
>>>>> On Tue 29-05-18 16:07:00, Andrew Morton wrote:
>>>>>> On Tue, 29 May 2018 09:17:41 +0200 Michal Hocko <mhocko@kernel.org> wrote:
>>>>>>
>>>>>>>> I suggest applying
>>>>>>>> this patch first, and then fix "mm, oom: cgroup-aware OOM killer" patch.
>>>>>>>
>>>>>>> Well, I hope the whole pile gets merged in the upcoming merge window
>>>>>>> rather than stall even more.
>>>>>>
>>>>>> I'm more inclined to drop it all.  David has identified significant
>>>>>> shortcomings and I'm not seeing a way of addressing those shortcomings
>>>>>> in a backward-compatible fashion.  Therefore there is no way forward
>>>>>> at present.
>>>>>
>>>>> Well, I thought we have argued about those "shortcomings" back and forth
>>>>> and expressed that they are not really a problem for workloads which are
>>>>> going to use the feature. The backward compatibility has been explained
>>>>> as well AFAICT.
>>>>
>>>> Feel free to re-explain.  It's the only way we'll get there.
>>>
>>> OK, I will go and my points to the last version of the patchset.
>>>
>>>> David has proposed an alternative patchset.  IIRC Roman gave that a
>>>> one-line positive response but I don't think it has seen a lot of
>>>> attention?
>>>
>>> I plan to go and revisit that. My preliminary feedback is that a more
>>> generic policy API is really tricky and the patchset has many holes
>>> there. But I will come with a more specific feedback in the respective
>>> thread.
>>>
>> Is current version of "mm, oom: cgroup-aware OOM killer" patchset going to be
>> dropped for now? I want to know which state should I use for baseline for my patch.
> 
> Is it that urgent that it cannot wait until after the merge window when
> thing should settle down?
> 
Yes, for it is a regression fix which I expected to be in time for 4.17.
I want to apply it before OOM killer code gets complicated by cgroup-aware OOM killer.
David Rientjes June 6, 2018, 9:02 a.m. UTC | #10
On Mon, 4 Jun 2018, Tetsuo Handa wrote:

> Is current version of "mm, oom: cgroup-aware OOM killer" patchset going to be
> dropped for now? I want to know which state should I use for baseline for my patch.
> 

My patchset to fix the issues with regard to the cgroup-aware oom killer 
to fix its calculations (current version in -mm is completey buggy for 
oom_score_adj, fixed in my patch 4/6), its context based errors 
(discounting mempolicy oom kills, fixed in my patch 6/6) and make it 
generally useful beyond highly specialized usecases in a backwards 
compatible way was posted on March 22 at 
https://marc.info/?l=linux-kernel&m=152175564104466.

The base patchset is seemingly abandoned in -mm, unfortunately, so I think 
all oom killer patches should be based on Linus's tree.
Tetsuo Handa June 6, 2018, 1:37 p.m. UTC | #11
On 2018/06/06 18:02, David Rientjes wrote:
> On Mon, 4 Jun 2018, Tetsuo Handa wrote:
> 
>> Is current version of "mm, oom: cgroup-aware OOM killer" patchset going to be
>> dropped for now? I want to know which state should I use for baseline for my patch.
>>
> 
> My patchset to fix the issues with regard to the cgroup-aware oom killer 
> to fix its calculations (current version in -mm is completey buggy for 
> oom_score_adj, fixed in my patch 4/6), its context based errors 
> (discounting mempolicy oom kills, fixed in my patch 6/6) and make it 
> generally useful beyond highly specialized usecases in a backwards 
> compatible way was posted on March 22 at 
> https://marc.info/?l=linux-kernel&m=152175564104466.
> 
> The base patchset is seemingly abandoned in -mm, unfortunately, so I think 
> all oom killer patches should be based on Linus's tree.
> 
OK. I will use linux.git as a base.

By the way, does "[RFC] Getting rid of INFLIGHT_VICTIM" simplify or break
your cgroup-aware oom killer? If it simplifies your work, I'd like to apply
it as well.
David Rientjes June 6, 2018, 6:44 p.m. UTC | #12
On Wed, 6 Jun 2018, Tetsuo Handa wrote:

> OK. I will use linux.git as a base.
> 
> By the way, does "[RFC] Getting rid of INFLIGHT_VICTIM" simplify or break
> your cgroup-aware oom killer? If it simplifies your work, I'd like to apply
> it as well.
> 

I think it impacts the proposal to allow the oom reaper to operate over 
several different mm's in its list without processing one, waiting to give 
up, removing it, and moving on to the next one.  It doesn't impact the 
cgroup-aware oom killer extension that I made other than trivial patch 
conflicts.  I think if we can iterate over the oom reaper list to 
determine inflight victims it's simpler.
diff mbox

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 565e7da55318..7f12e346e477 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1099,7 +1099,6 @@  bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
-	bool delay = false; /* if set, delay next allocation attempt */
 
 	if (oom_killer_disabled)
 		return false;
@@ -1149,10 +1148,8 @@  bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
-	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
-		delay = true;
+	if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc))
 		goto out;
-	}
 
 	select_bad_process(oc);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
@@ -1160,20 +1157,10 @@  bool out_of_memory(struct oom_control *oc)
 		dump_header(oc, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM) {
+	if (oc->chosen_task && oc->chosen_task != INFLIGHT_VICTIM)
 		oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
 				 "Memory cgroup out of memory");
-		delay = true;
-	}
-
 out:
-	/*
-	 * Give the killed process a good chance to exit before trying
-	 * to allocate memory again.
-	 */
-	if (delay)
-		schedule_timeout_killable(1);
-
 	return !!oc->chosen_task;
 }