diff mbox

[3/4] OOM, PM: OOM killed task shouldn't escape PM suspend

Message ID 20141106130543.GE7202@dhcp22.suse.cz (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Michal Hocko Nov. 6, 2014, 1:05 p.m. UTC
On Wed 05-11-14 12:01:11, Tejun Heo wrote:
> On Wed, Nov 05, 2014 at 11:54:28AM -0500, Tejun Heo wrote:
> > > Still not following. How do you want to detect an on-going OOM without
> > > any interface around out_of_memory?
> > 
> > I thought you were using oom_killer_allowed_start() outside OOM path.
> > Ugh.... why is everything weirdly structured?  oom_killer_disabled
> > implies that oom killer may fail, right?  Why is
> > __alloc_pages_slowpath() checking it directly?  If whether oom killing
> > failed or not is relevant to its users, make out_of_memory() return an
> > error code.  There's no reason for the exclusion detail to leak out of
> > the oom killer proper.  The only interface should be disable/enable
> > and whether oom killing failed or not.
> 
> And what's implemented is wrong.  What happens if oom killing is
> already in progress and then a task blocks trying to write-lock the
> rwsem and then that task is selected as the OOM victim?

But this is nothing new. Suspend hasn't been checking for fatal signals
nor for TIF_MEMDIE since the OOM disabling was introduced and I suppose
even before.

This is not harmful though. The previous OOM kill attempt would kick the
current TASK and mark it with TIF_MEMDIE and retry the allocation. After
OOM is disabled the allocation simply fails. The current will die on its
way out of the kernel. Definitely worth fixing. In a separate patch.

> disable() call must be able to fail.

This would be a way to do it without requiring caller to check for
TIF_MEMDIE explicitly. The fewer of them we have the better.
---
From 3a7e18144a369bfc537c1cda4c7c2c548e9114b8 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Thu, 6 Nov 2014 11:51:34 +0100
Subject: [PATCH] OOM, PM: handle pm freezer as an OOM victim correctly

PM freezer doesn't check whether it has been killed by OOM killer
after it disables OOM killer which means that it continues with the
suspend even though it should die as soon as possible. This has been
the case ever since PM suspend disables OOM killer and I suppose
it has ignored OOM even before.

This is not harmful though. The allocation which triggers OOM will
retry the allocation after a process is killed and the next attempt
will fail because the OOM killer will be disabled at the time so
there is no risk of an endless loop because the OOM victim doesn't
die.

But this is a correctness issue because no task should ignore OOM.
As suggested by Tejun, oom_killer_disable will return a success status
now. If the current task is pending fatal signals or TIF_MEMDIE is set
after oom_sem is taken then the caller should bail out and this is what
freeze_processes does with this patch.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h    |  4 +++-
 kernel/power/process.c | 16 ++++++++++------
 mm/oom_kill.c          | 12 +++++++++++-
 3 files changed, 24 insertions(+), 8 deletions(-)

Comments

Tejun Heo Nov. 6, 2014, 3:09 p.m. UTC | #1
On Thu, Nov 06, 2014 at 02:05:43PM +0100, Michal Hocko wrote:
> But this is nothing new. Suspend hasn't been checking for fatal signals
> nor for TIF_MEMDIE since the OOM disabling was introduced and I suppose
> even before.
> 
> This is not harmful though. The previous OOM kill attempt would kick the
> current TASK and mark it with TIF_MEMDIE and retry the allocation. After
> OOM is disabled the allocation simply fails. The current will die on its
> way out of the kernel. Definitely worth fixing. In a separate patch.

Hah?  Isn't this a new outright A-B B-A deadlock involving the rwsem
you added?

> > disable() call must be able to fail.
> 
> This would be a way to do it without requiring caller to check for
> TIF_MEMDIE explicitly. The fewer of them we have the better.

Why the hell would the caller ever even KNOW about this?  This is
something which must be encapsulated in the OOM killer disable/enable
interface.

> +bool oom_killer_disable(void)
>  {
> +	bool ret = true;
> +
>  	down_write(&oom_sem);

How would this task pass the above down_write() if the OOM killer is
already read locking oom_sem?  Or is the OOM killer guaranteed to make
forward progress even if the killed task can't make forward progress?
But, if so, what are we talking about in this thread?

> +
> +	/* We might have been killed while waiting for the oom_sem. */
> +	if (fatal_signal_pending(current) || test_thread_flag(TIF_MEMDIE)) {
> +		up_write(&oom_sem);
> +		ret = false;
> +	}

This is pointless.  What does the above do?
Michal Hocko Nov. 6, 2014, 4:01 p.m. UTC | #2
On Thu 06-11-14 10:09:27, Tejun Heo wrote:
> On Thu, Nov 06, 2014 at 02:05:43PM +0100, Michal Hocko wrote:
> > But this is nothing new. Suspend hasn't been checking for fatal signals
> > nor for TIF_MEMDIE since the OOM disabling was introduced and I suppose
> > even before.
> > 
> > This is not harmful though. The previous OOM kill attempt would kick the
> > current TASK and mark it with TIF_MEMDIE and retry the allocation. After
> > OOM is disabled the allocation simply fails. The current will die on its
> > way out of the kernel. Definitely worth fixing. In a separate patch.
> 
> Hah?  Isn't this a new outright A-B B-A deadlock involving the rwsem
> you added?

No, see below.
 
> > > disable() call must be able to fail.
> > 
> > This would be a way to do it without requiring caller to check for
> > TIF_MEMDIE explicitly. The fewer of them we have the better.
> 
> Why the hell would the caller ever even KNOW about this?  This is
> something which must be encapsulated in the OOM killer disable/enable
> interface.
> 
> > +bool oom_killer_disable(void)
> >  {
> > +	bool ret = true;
> > +
> >  	down_write(&oom_sem);
> 
> How would this task pass the above down_write() if the OOM killer is
> already read locking oom_sem?  Or is the OOM killer guaranteed to make
> forward progress even if the killed task can't make forward progress?
> But, if so, what are we talking about in this thread?

Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates.
That will release the read_lock, allow this to take the write lock and
check whether it the current has been killed without any races.
OOM killer doesn't wait for the killed task. The allocation is retried.

Does this explain your concern?

[...]
Tejun Heo Nov. 6, 2014, 4:12 p.m. UTC | #3
On Thu, Nov 06, 2014 at 05:01:58PM +0100, Michal Hocko wrote:
> Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates.
> That will release the read_lock, allow this to take the write lock and
> check whether it the current has been killed without any races.
> OOM killer doesn't wait for the killed task. The allocation is retried.
> 
> Does this explain your concern?

Draining oom killer then doesn't mean anything, no?  OOM killer may
have been disabled and drained but the killed tasks might wake up
after the PM freezer considers them to be frozen, right?  What am I
missing?
Michal Hocko Nov. 6, 2014, 4:31 p.m. UTC | #4
On Thu 06-11-14 11:12:11, Tejun Heo wrote:
> On Thu, Nov 06, 2014 at 05:01:58PM +0100, Michal Hocko wrote:
> > Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates.
> > That will release the read_lock, allow this to take the write lock and
> > check whether it the current has been killed without any races.
> > OOM killer doesn't wait for the killed task. The allocation is retried.
> > 
> > Does this explain your concern?
> 
> Draining oom killer then doesn't mean anything, no?  OOM killer may
> have been disabled and drained but the killed tasks might wake up
> after the PM freezer considers them to be frozen, right?  What am I
> missing?

The mutual exclusion between OOM and the freezer will cause that the
victim will have TIF_MEMDIE already set when try_to_freeze_tasks even
starts. Then freezing_slow_path wouldn't allow the task to enter the
fridge so the wake up moment is not really that important.
Tejun Heo Nov. 6, 2014, 4:33 p.m. UTC | #5
On Thu, Nov 06, 2014 at 05:31:24PM +0100, Michal Hocko wrote:
> On Thu 06-11-14 11:12:11, Tejun Heo wrote:
> > On Thu, Nov 06, 2014 at 05:01:58PM +0100, Michal Hocko wrote:
> > > Yes, OOM killer simply kicks the process sets TIF_MEMDIE and terminates.
> > > That will release the read_lock, allow this to take the write lock and
> > > check whether it the current has been killed without any races.
> > > OOM killer doesn't wait for the killed task. The allocation is retried.
> > > 
> > > Does this explain your concern?
> > 
> > Draining oom killer then doesn't mean anything, no?  OOM killer may
> > have been disabled and drained but the killed tasks might wake up
> > after the PM freezer considers them to be frozen, right?  What am I
> > missing?
> 
> The mutual exclusion between OOM and the freezer will cause that the
> victim will have TIF_MEMDIE already set when try_to_freeze_tasks even
> starts. Then freezing_slow_path wouldn't allow the task to enter the
> fridge so the wake up moment is not really that important.

What if it was already in the freezer?
Michal Hocko Nov. 6, 2014, 4:58 p.m. UTC | #6
On Thu 06-11-14 11:33:04, Tejun Heo wrote:
> On Thu, Nov 06, 2014 at 05:31:24PM +0100, Michal Hocko wrote:
> > On Thu 06-11-14 11:12:11, Tejun Heo wrote:
[...]
> > > Draining oom killer then doesn't mean anything, no?  OOM killer may
> > > have been disabled and drained but the killed tasks might wake up
> > > after the PM freezer considers them to be frozen, right?  What am I
> > > missing?
> > 
> > The mutual exclusion between OOM and the freezer will cause that the
> > victim will have TIF_MEMDIE already set when try_to_freeze_tasks even
> > starts. Then freezing_slow_path wouldn't allow the task to enter the
> > fridge so the wake up moment is not really that important.
> 
> What if it was already in the freezer?

Good question! You are right that there is a race window until the wake
up then. I will think about this case some more. There is simply no
control on when the task wakes up and freezer will see it as frozen
until then. An immediate way around would be to check for TIF_MEMDIE in
try_to_freeze_tasks.

I have to call it end of the day unfortunately and will be back on
Monday.
diff mbox

Patch

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 4af99a9b543b..a978bf2b02a1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -77,8 +77,10 @@  extern int unregister_oom_notifier(struct notifier_block *nb);
  * oom_killer_disable - disable OOM killer in page allocator
  *
  * Forces all page allocations to fail rather than trigger OOM killer.
+ * Returns true on success and fails if the OOM killer couldn't be
+ * disabled (e.g. because the current task has been killed before)
  */
-extern void oom_killer_disable(void);
+extern bool oom_killer_disable(void);
 
 /**
  * oom_killer_enable - enable OOM killer
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 7d08d56cbf3f..0f8b782f9215 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -123,6 +123,16 @@  int freeze_processes(void)
 	if (error)
 		return error;
 
+	/*
+	 * Need to exlude OOM killer from triggering while tasks are
+	 * getting frozen to make sure none of them gets killed after
+	 * try_to_freeze_tasks is done.
+	 */
+	if (!oom_killer_disable()) {
+		usermodehelper_enable();
+		return -EBUSY;
+	}
+
 	/* Make sure this task doesn't get frozen */
 	current->flags |= PF_SUSPEND_TASK;
 
@@ -133,12 +143,6 @@  int freeze_processes(void)
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
 
-	/*
-	 * Need to exlude OOM killer from triggering while tasks are
-	 * getting frozen to make sure none of them gets killed after
-	 * try_to_freeze_tasks is done.
-	 */
-	oom_killer_disable();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f80c5b777f05..58ade54ee421 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -600,9 +600,19 @@  void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 
 static DECLARE_RWSEM(oom_sem);
 
-void oom_killer_disable(void)
+bool oom_killer_disable(void)
 {
+	bool ret = true;
+
 	down_write(&oom_sem);
+
+	/* We might have been killed while waiting for the oom_sem. */
+	if (fatal_signal_pending(current) || test_thread_flag(TIF_MEMDIE)) {
+		up_write(&oom_sem);
+		ret = false;
+	}
+
+	return ret;
 }
 
 void oom_killer_enable(void)