[3/4] OOM, PM: OOM killed task shouldn't escape PM suspend

Message ID	20141105174609.GE28226@dhcp22.suse.cz (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <linux-pm-owner@kernel.org> Date: Wed, 5 Nov 2014 18:46:09 +0100 From: Michal Hocko <mhocko@suse.cz> To: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>, Andrew Morton <akpm@linux-foundation.org>, Cong Wang <xiyou.wangcong@gmail.com>, David Rientjes <rientjes@google.com>, Oleg Nesterov <oleg@redhat.com>, LKML <linux-kernel@vger.kernel.org>, linux-mm@kvack.org, Linux PM list <linux-pm@vger.kernel.org> Subject: Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend Message-ID: <20141105174609.GE28226@dhcp22.suse.cz> References: <20141104192705.GA22163@htj.dyndns.org> <20141105124620.GB4527@dhcp22.suse.cz> <20141105130247.GA14386@htj.dyndns.org> <20141105133100.GC4527@dhcp22.suse.cz> <20141105134219.GD4527@dhcp22.suse.cz> <20141105154436.GB14386@htj.dyndns.org> <20141105160115.GA28226@dhcp22.suse.cz> <20141105162929.GD14386@htj.dyndns.org> <20141105163956.GD28226@dhcp22.suse.cz> <20141105165428.GF14386@htj.dyndns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141105165428.GF14386@htj.dyndns.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

Message ID

20141105174609.GE28226@dhcp22.suse.cz (mailing list archive)

State

Not Applicable, archived

Headers

Date: Wed, 5 Nov 2014 18:46:09 +0100
From: Michal Hocko <mhocko@suse.cz>
To: Tejun Heo <tj@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	David Rientjes <rientjes@google.com>, Oleg Nesterov <oleg@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>, linux-mm@kvack.org,
	Linux PM list <linux-pm@vger.kernel.org>
Subject: Re: [PATCH 3/4] OOM, PM: OOM killed task shouldn't escape PM suspend
Message-ID: <20141105174609.GE28226@dhcp22.suse.cz>
References: <20141104192705.GA22163@htj.dyndns.org>
	<20141105124620.GB4527@dhcp22.suse.cz>
	<20141105130247.GA14386@htj.dyndns.org>
	<20141105133100.GC4527@dhcp22.suse.cz>
	<20141105134219.GD4527@dhcp22.suse.cz>
	<20141105154436.GB14386@htj.dyndns.org>
	<20141105160115.GA28226@dhcp22.suse.cz>
	<20141105162929.GD14386@htj.dyndns.org>
	<20141105163956.GD28226@dhcp22.suse.cz>
	<20141105165428.GF14386@htj.dyndns.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20141105165428.GF14386@htj.dyndns.org>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk

Commit Message

Michal Hocko Nov. 5, 2014, 5:46 p.m. UTC

On Wed 05-11-14 11:54:28, Tejun Heo wrote:
> On Wed, Nov 05, 2014 at 05:39:56PM +0100, Michal Hocko wrote:
> > On Wed 05-11-14 11:29:29, Tejun Heo wrote:
> > > Hello, Michal.
> > > 
> > > On Wed, Nov 05, 2014 at 05:01:15PM +0100, Michal Hocko wrote:
> > > > I am not sure I am following. With the latest patch OOM path is no
> > > > longer blocked by the PM (aka oom_killer_disable()). Allocations simply
> > > > fail if the read_trylock fails.
> > > > oom_killer_disable is moved before tasks are frozen and it will wait for
> > > > all on-going OOM killers on the write lock. OOM killer is enabled again
> > > > on the resume path.
> > > 
> > > Sure, but why are we exposing new interfaces?  Can't we just make
> > > oom_killer_disable() first set the disable flag and wait for the
> > > on-going ones to finish (and make the function fail if it gets chosen
> > > as an OOM victim)?
> > 
> > Still not following. How do you want to detect an on-going OOM without
> > any interface around out_of_memory?
> 
> I thought you were using oom_killer_allowed_start() outside OOM path.
> Ugh.... why is everything weirdly structured?  oom_killer_disabled
> implies that oom killer may fail, right?  Why is
> __alloc_pages_slowpath() checking it directly?

Because out_of_memory can be called from mutliple paths. And
the only interesting one should be the page allocation path.
pagefault_out_of_memory is not interesting because it cannot happen for
the frozen task.

Now that I am looking maybe even sysrq OOM trigger should as well.

> If whether oom killing failed or not is relevant to its users, make
> out_of_memory() return an error code.  There's no reason for the
> exclusion detail to leak out of the oom killer proper.  The only
> interface should be disable/enable and whether oom killing failed or
> not.

Got your point. I can reshuffle the code and make the trylock thingy
inside oom_kill.c. I am not sure it is so much better because the OOM
knowledge is already spread (e.g. check oom_zonelist_trylock outside of
out_of_memory or even oom_gfp_allowed before we
enter__alloc_pages_may_oom). Anyway, I do not care much and I am OK with
your return code convention as the only other way how OOM might fail is
when there is no victim and we panic then.

Something like (even not compile tested)
---

Comments

Tejun Heo Nov. 5, 2014, 5:55 p.m. UTC | #1

On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote:
> Because out_of_memory can be called from mutliple paths. And
> the only interesting one should be the page allocation path.
> pagefault_out_of_memory is not interesting because it cannot happen for
> the frozen task.

Hmmm.... wouldn't that be broken by definition tho?  So, if the oom
killer is invoked from somewhere else than page allocation path, it
would proceed ignoring the disabled setting and would race against PM
freeze path all the same.  Why are things broken at such basic levels?
Something named oom_killer_disable does a lame attempt at it and not
even that depending on who's calling.  There probably is a history
leading to the current situation but the level that things are broken
at is too basic and baffling.  :(

Michal Hocko Nov. 6, 2014, 12:49 p.m. UTC | #2

On Wed 05-11-14 12:55:27, Tejun Heo wrote:
> On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote:
> > Because out_of_memory can be called from mutliple paths. And
> > the only interesting one should be the page allocation path.
> > pagefault_out_of_memory is not interesting because it cannot happen for
> > the frozen task.
> 
> Hmmm.... wouldn't that be broken by definition tho?  So, if the oom
> killer is invoked from somewhere else than page allocation path, it
> would proceed ignoring the disabled setting and would race against PM
> freeze path all the same. 

Not really because try_to_freeze_tasks doesn't finish until _all_ tasks
are frozen and a task in the page fault path cannot be frozen, can it?

I mean there shouldn't be any problem to not invoke OOM killer under
from the page fault path as well but that might lead to looping in the
page fault path without any progress until freezer enables OOM killer on
the failure path because the said task cannot be frozen.

Is this preferable?

Tejun Heo Nov. 6, 2014, 3:01 p.m. UTC | #3

On Thu, Nov 06, 2014 at 01:49:53PM +0100, Michal Hocko wrote:
> On Wed 05-11-14 12:55:27, Tejun Heo wrote:
> > On Wed, Nov 05, 2014 at 06:46:09PM +0100, Michal Hocko wrote:
> > > Because out_of_memory can be called from mutliple paths. And
> > > the only interesting one should be the page allocation path.
> > > pagefault_out_of_memory is not interesting because it cannot happen for
> > > the frozen task.
> > 
> > Hmmm.... wouldn't that be broken by definition tho?  So, if the oom
> > killer is invoked from somewhere else than page allocation path, it
> > would proceed ignoring the disabled setting and would race against PM
> > freeze path all the same. 
> 
> Not really because try_to_freeze_tasks doesn't finish until _all_ tasks
> are frozen and a task in the page fault path cannot be frozen, can it?

We used to have freezing points deep in file system code which may be
reacheable from page fault.  Please take a step back and look at the
paragraph above.  Doesn't it sound extremely contrived and brittle
even if it's not outright broken?  What if somebody adds another oom
killing site somewhere else?  How can this possibly be a solution that
we intentionally implement?

> I mean there shouldn't be any problem to not invoke OOM killer under
> from the page fault path as well but that might lead to looping in the
> page fault path without any progress until freezer enables OOM killer on
> the failure path because the said task cannot be frozen.
> 
> Is this preferable?

Why would PM freezing make OOM killing fail?  That doesn't make much
sense.  Sure, it can block it for a finite duration for sync purposes
but making OOM killing fail seems the wrong way around.  We're doing
one thing for non-PM freezing and the other way around for PM
freezing, which indicates one of the two directions is wrong.

Shouldn't it be that OOM killing happening while PM freezing is in
progress cancels PM freezing rather than the other way around?  Find a
point in PM suspend/hibernation operation where everything must be
stable, disable OOM killing there and check whether OOM killing
happened inbetween and if so back out.  It seems rather obvious to me
that OOM killing has to have precedence over PM freezing.

Sure, once the system reaches a point where the whole system must be
in a stable state for snapshotting or whatever, disabling OOM killing
is fine but at that point the system is in a very limited execution
mode and sure won't be processing page faults from userland for
example and we can actually disable OOM killing knowing that anything
afterwards is ready to handle memory allocation failures.

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..14f3d7fd961f 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,10 @@  static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true)) {
+		printk(KERN_INFO "OOM killer disabled\n");
+	}
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 850f7f653eb7..4af99a9b543b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -68,7 +68,7 @@  extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
@@ -85,21 +85,6 @@  extern void oom_killer_disable(void);
  */
 extern void oom_killer_enable(void);
 
-/**
- * oom_killer_allowed_start - start OOM killer section
- *
- * Synchronise with oom_killer_{disable,enable} sections.
- * Returns 1 if oom_killer is allowed.
- */
-extern int oom_killer_allowed_start(void);
-
-/**
- * oom_killer_allowed_end - end OOM killer section
- *
- * previously started by oom_killer_allowed_end.
- */
-extern void oom_killer_allowed_end(void);
-
 static inline bool oom_gfp_allowed(gfp_t gfp_mask)
 {
 	return (gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 126e7da17cf9..3e136a2c0b1f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -610,18 +610,8 @@  void oom_killer_enable(void)
 	up_write(&oom_sem);
 }
 
-int oom_killer_allowed_start(void)
-{
-	return down_read_trylock(&oom_sem);
-}
-
-void oom_killer_allowed_end(void)
-{
-	up_read(&oom_sem);
-}
-
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -633,7 +623,7 @@  void oom_killer_allowed_end(void)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -698,6 +688,27 @@  out:
 		schedule_timeout_killable(1);
 }
 
+/** out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	if (!down_read_trylock(&oom_sem))
+		return false;
+	__out_of_memory(zonlist, gfp_mask, order, nodemask, force_kill);
+	up_read(&oom_sem);
+
+	return true;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
@@ -712,7 +723,7 @@  void pagefault_out_of_memory(void)
 
 	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		out_of_memory(NULL, 0, 0, NULL, false);
+		__out_of_memory(NULL, 0, 0, NULL, false);
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 206ce46ce975..fdbcdd9cd1a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2239,10 +2239,11 @@  static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int classzone_idx, int migratetype)
+	int classzone_idx, int migratetype, bool *oom_failed)
 {
 	struct page *page;
 
+	*oom_failed = false;
 	/* Acquire the per-zone oom lock for each zone */
 	if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
 		schedule_timeout_uninterruptible(1);
@@ -2279,8 +2280,8 @@  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-
+	if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*oom_failed = true;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
@@ -2706,26 +2707,28 @@  rebalance:
 	 */
 	if (!did_some_progress) {
 		if (oom_gfp_allowed(gfp_mask)) {
+			bool oom_failed;
+
 			/* Coredumps can quickly deplete all memory reserves */
 			if ((current->flags & PF_DUMPCORE) &&
 			    !(gfp_mask & __GFP_NOFAIL))
 				goto nopage;
-			/*
-			 * Just make sure that we cannot race with oom_killer
-			 * disabling e.g. PM freezer needs to make sure that
-			 * no OOM happens after all tasks are frozen.
-			 */
-			if (!oom_killer_allowed_start())
-				goto nopage;
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					classzone_idx, migratetype);
-			oom_killer_allowed_end();
+					classzone_idx, migratetype,
+					&oom_failed);
 
 			if (page)
 				goto got_pg;
 
+			/*
+			 * OOM killer might be disabled and then we have to
+			 * fail the allocation
+			 */
+			if (oom_failed)
+				goto no_page;
+
 			if (!(gfp_mask & __GFP_NOFAIL)) {
 				/*
 				 * The oom killer is not called for high-order

[3/4] OOM, PM: OOM killed task shouldn't escape PM suspend

Commit Message

Comments

Patch