diff mbox

[RFC,2/2] OOM, PM: make OOM detection in the freezer path raceless

Message ID 1416345006-8284-2-git-send-email-mhocko@suse.cz (mailing list archive)
State RFC, archived
Headers show

Commit Message

Michal Hocko Nov. 18, 2014, 9:10 p.m. UTC
5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small and
really unlikely and partial solution deemed sufficient at the time of
submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled is now checked at out_of_memory level which takes
the lock for reading. This also means that the page fault path is
covered now as well although it was assumed to be safe before. As per
Tejun, "We used to have freezing points deep in file system code which
may be reacheable from page fault." so it would be better and more
robust to not rely on freezing points here. Same applies to the memcg
OOM killer.

out_of_memory tells the caller whether the OOM was allowed to
trigger and the callers are supposed to handle the situation. The page
allocation path simply fails the allocation same as before. The page
fault path will be retrying the fault until the freezer fails and Sysrq
OOM trigger will simply complain to the log.

oom_killer_disable takes oom_sem for writing and after it disables
further OOM killer invocations it checks for any OOM victims which
are still alive (because they haven't woken up to handle the pending
signal). Victims are counted via {un}mark_tsk_oom_victim. The
last victim signals the completion via oom_victims_wait on which
oom_killer_disable() waits if it sees non zero oom_victims.
This is safe against both mark_tsk_oom_victim which cannot be called
after oom_killer_disabled is set and unmark_tsk_oom_victim signals the
completion only for the last oom_victim when oom is disabled and
oom_killer_disable waits for completion only of there was at least one
victim at the time it disabled the oom.

As oom_killer_disable is a full OOM barrier now we can postpone it to
later after all freezable tasks are frozen during PM freezer. This
reduces the time when OOM is put out order and so reduces chances of
misbehavior due to unexpected allocation failures.

TODO:
Android lowmemory killer abuses mark_tsk_oom_victim in lowmem_scan
and it has to learn about oom_disable logic as well.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c    |  6 ++--
 include/linux/oom.h    | 26 ++++++++------
 kernel/power/process.c | 60 +++++++++-----------------------
 mm/memcontrol.c        |  4 ++-
 mm/oom_kill.c          | 94 +++++++++++++++++++++++++++++++++++++++++---------
 mm/page_alloc.c        | 32 ++++++++---------
 6 files changed, 132 insertions(+), 90 deletions(-)

Comments

Rafael J. Wysocki Nov. 27, 2014, 12:47 a.m. UTC | #1
On Tuesday, November 18, 2014 10:10:06 PM Michal Hocko wrote:
> 5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
> has left a race window when OOM killer manages to note_oom_kill after
> freeze_processes checks the counter. The race window is quite small and
> really unlikely and partial solution deemed sufficient at the time of
> submission.
> 
> Tejun wasn't happy about this partial solution though and insisted on a
> full solution. That requires the full OOM and freezer's task freezing
> exclusion, though. This is done by this patch which introduces oom_sem
> RW lock and turns oom_killer_disable() into a full OOM barrier.
> 
> oom_killer_disabled is now checked at out_of_memory level which takes
> the lock for reading. This also means that the page fault path is
> covered now as well although it was assumed to be safe before. As per
> Tejun, "We used to have freezing points deep in file system code which
> may be reacheable from page fault." so it would be better and more
> robust to not rely on freezing points here. Same applies to the memcg
> OOM killer.
> 
> out_of_memory tells the caller whether the OOM was allowed to
> trigger and the callers are supposed to handle the situation. The page
> allocation path simply fails the allocation same as before. The page
> fault path will be retrying the fault until the freezer fails and Sysrq
> OOM trigger will simply complain to the log.
> 
> oom_killer_disable takes oom_sem for writing and after it disables
> further OOM killer invocations it checks for any OOM victims which
> are still alive (because they haven't woken up to handle the pending
> signal). Victims are counted via {un}mark_tsk_oom_victim. The
> last victim signals the completion via oom_victims_wait on which
> oom_killer_disable() waits if it sees non zero oom_victims.
> This is safe against both mark_tsk_oom_victim which cannot be called
> after oom_killer_disabled is set and unmark_tsk_oom_victim signals the
> completion only for the last oom_victim when oom is disabled and
> oom_killer_disable waits for completion only of there was at least one
> victim at the time it disabled the oom.
> 
> As oom_killer_disable is a full OOM barrier now we can postpone it to
> later after all freezable tasks are frozen during PM freezer. This
> reduces the time when OOM is put out order and so reduces chances of
> misbehavior due to unexpected allocation failures.
> 
> TODO:
> Android lowmemory killer abuses mark_tsk_oom_victim in lowmem_scan
> and it has to learn about oom_disable logic as well.
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

This appears to do the right thing to me, although I admit I haven't checked
the details very carefully.

Tejun?

> ---
>  drivers/tty/sysrq.c    |  6 ++--
>  include/linux/oom.h    | 26 ++++++++------
>  kernel/power/process.c | 60 +++++++++-----------------------
>  mm/memcontrol.c        |  4 ++-
>  mm/oom_kill.c          | 94 +++++++++++++++++++++++++++++++++++++++++---------
>  mm/page_alloc.c        | 32 ++++++++---------
>  6 files changed, 132 insertions(+), 90 deletions(-)
> 
> diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
> index 42bad18c66c9..6818589c1004 100644
> --- a/drivers/tty/sysrq.c
> +++ b/drivers/tty/sysrq.c
> @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
>  
>  static void moom_callback(struct work_struct *ignored)
>  {
> -	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
> -		      0, NULL, true);
> +	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
> +			   GFP_KERNEL, 0, NULL, true)) {
> +		printk(KERN_INFO "OOM request ignored because killer is disabled\n");
> +	}
>  }
>  
>  static DECLARE_WORK(moom_work, moom_callback);
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 8f7e74f8ab3a..d802575c9307 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -72,22 +72,26 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  		unsigned long totalpages, const nodemask_t *nodemask,
>  		bool force_kill);
>  
> -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *mask, bool force_kill);
>  extern int register_oom_notifier(struct notifier_block *nb);
>  extern int unregister_oom_notifier(struct notifier_block *nb);
>  
> -extern bool oom_killer_disabled;
> -
> -static inline void oom_killer_disable(void)
> -{
> -	oom_killer_disabled = true;
> -}
> +/**
> + * oom_killer_disable - disable OOM killer
> + *
> + * Forces all page allocations to fail rather than trigger OOM killer.
> + * Will block and wait until all OOM victims are dead.
> + *
> + * Returns true if successfull and false if the OOM killer cannot be
> + * disabled.
> + */
> +extern bool oom_killer_disable(void);
>  
> -static inline void oom_killer_enable(void)
> -{
> -	oom_killer_disabled = false;
> -}
> +/**
> + * oom_killer_enable - enable OOM killer
> + */
> +extern void oom_killer_enable(void);
>  
>  static inline bool oom_gfp_allowed(gfp_t gfp_mask)
>  {
> diff --git a/kernel/power/process.c b/kernel/power/process.c
> index 5a6ec8678b9a..a4306e39f35c 100644
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -108,30 +108,6 @@ static int try_to_freeze_tasks(bool user_only)
>  	return todo ? -EBUSY : 0;
>  }
>  
> -static bool __check_frozen_processes(void)
> -{
> -	struct task_struct *g, *p;
> -
> -	for_each_process_thread(g, p)
> -		if (p != current && !freezer_should_skip(p) && !frozen(p))
> -			return false;
> -
> -	return true;
> -}
> -
> -/*
> - * Returns true if all freezable tasks (except for current) are frozen already
> - */
> -static bool check_frozen_processes(void)
> -{
> -	bool ret;
> -
> -	read_lock(&tasklist_lock);
> -	ret = __check_frozen_processes();
> -	read_unlock(&tasklist_lock);
> -	return ret;
> -}
> -
>  /**
>   * freeze_processes - Signal user space processes to enter the refrigerator.
>   * The current thread will not be frozen.  The same process that calls
> @@ -142,7 +118,6 @@ static bool check_frozen_processes(void)
>  int freeze_processes(void)
>  {
>  	int error;
> -	int oom_kills_saved;
>  
>  	error = __usermodehelper_disable(UMH_FREEZING);
>  	if (error)
> @@ -157,27 +132,11 @@ int freeze_processes(void)
>  	pm_wakeup_clear();
>  	printk("Freezing user space processes ... ");
>  	pm_freezing = true;
> -	oom_kills_saved = oom_kills_count();
>  	error = try_to_freeze_tasks(true);
>  	if (!error) {
>  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> -		oom_killer_disable();
> -
> -		/*
> -		 * There might have been an OOM kill while we were
> -		 * freezing tasks and the killed task might be still
> -		 * on the way out so we have to double check for race.
> -		 */
> -		if (oom_kills_count() != oom_kills_saved &&
> -		    !check_frozen_processes()) {
> -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> -			printk("OOM in progress.");
> -			error = -EBUSY;
> -		} else {
> -			printk("done.");
> -		}
> +		printk("done.\n");
>  	}
> -	printk("\n");
>  	BUG_ON(in_atomic());
>  
>  	if (error)
> @@ -206,6 +165,18 @@ int freeze_kernel_threads(void)
>  	printk("\n");
>  	BUG_ON(in_atomic());
>  
> +	/*
> +	 * Now that everything freezable is handled we need to disbale
> +	 * the OOM killer to disallow any further interference with
> +	 * killable tasks.
> +	 */
> +	printk("Disabling OOM killer ... ");
> +	if (!oom_killer_disable()) {
> +		printk("failed.\n");
> +		error = -EAGAIN;
> +	} else
> +		printk("done.\n");
> +
>  	if (error)
>  		thaw_kernel_threads();
>  	return error;
> @@ -222,8 +193,6 @@ void thaw_processes(void)
>  	pm_freezing = false;
>  	pm_nosig_freezing = false;
>  
> -	oom_killer_enable();
> -
>  	printk("Restarting tasks ... ");
>  
>  	__usermodehelper_set_disable_depth(UMH_FREEZING);
> @@ -251,6 +220,9 @@ void thaw_kernel_threads(void)
>  {
>  	struct task_struct *g, *p;
>  
> +	printk("Enabling OOM killer again.\n");
> +	oom_killer_enable();
> +
>  	pm_nosig_freezing = false;
>  	printk("Restarting kernel threads ... ");
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 302e0fc6d121..34bcbb053132 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	current->memcg_oom.order = order;
>  }
>  
> +extern bool oom_killer_disabled;
> +
>  /**
>   * mem_cgroup_oom_synchronize - complete memcg OOM handling
>   * @handle: actually kill/wait or just clean up the OOM state
> @@ -2155,7 +2157,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (!memcg)
>  		return false;
>  
> -	if (!handle)
> +	if (!handle || oom_killer_disabled)
>  		goto cleanup;
>  
>  	owait.memcg = memcg;
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8b6e14136f4f..b3ccd92bc6dc 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -405,30 +405,63 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
>  }
>  
>  /*
> - * Number of OOM killer invocations (including memcg OOM killer).
> - * Primarily used by PM freezer to check for potential races with
> - * OOM killed frozen task.
> + * Number of OOM victims in flight
>   */
> -static atomic_t oom_kills = ATOMIC_INIT(0);
> +static atomic_t oom_victims = ATOMIC_INIT(0);
> +static DECLARE_COMPLETION(oom_victims_wait);
>  
> -int oom_kills_count(void)
> +bool oom_killer_disabled __read_mostly;
> +static DECLARE_RWSEM(oom_sem);
> +
> +void mark_tsk_oom_victim(struct task_struct *tsk)
>  {
> -	return atomic_read(&oom_kills);
> +	BUG_ON(oom_killer_disabled);
> +	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> +		return;
> +	atomic_inc(&oom_victims);
>  }
>  
> -void note_oom_kill(void)
> +void unmark_tsk_oom_victim(struct task_struct *tsk)
>  {
> -	atomic_inc(&oom_kills);
> +	int count;
> +
> +	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> +		return;
> +
> +	down_read(&oom_sem);
> +	/*
> +	 * There is no need to signal the lasst oom_victim if there
> +	 * is nobody who cares.
> +	 */
> +	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> +		complete(&oom_victims_wait);
> +	up_read(&oom_sem);
>  }
>  
> -void mark_tsk_oom_victim(struct task_struct *tsk)
> +bool oom_killer_disable(void)
>  {
> -	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> +	/*
> +	 * Make sure to not race with an ongoing OOM killer
> +	 * and that the current is not the victim.
> +	 */
> +	down_write(&oom_sem);
> +	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
> +		oom_killer_disabled = true;
> +
> +	count = atomic_read(&oom_victims);
> +	up_write(&oom_sem);
> +
> +	if (count && oom_killer_disabled)
> +		wait_for_completion(&oom_victims_wait);
> +
> +	return oom_killer_disabled;
>  }
>  
> -void unmark_tsk_oom_victim(struct task_struct *tsk)
> +void oom_killer_enable(void)
>  {
> -	clear_thread_flag(TIF_MEMDIE);
> +	down_write(&oom_sem);
> +	oom_killer_disabled = false;
> +	up_write(&oom_sem);
>  }
>  
>  #define K(x) ((x) << (PAGE_SHIFT-10))
> @@ -626,7 +659,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
>  }
>  
>  /**
> - * out_of_memory - kill the "best" process when we run out of memory
> + * __out_of_memory - kill the "best" process when we run out of memory
>   * @zonelist: zonelist pointer
>   * @gfp_mask: memory allocation flags
>   * @order: amount of memory being requested as a power of 2
> @@ -638,7 +671,7 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
>   * OR try to be smart about which process to kill. Note that we
>   * don't have to be perfect here, we just have to be good.
>   */
> -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		int order, nodemask_t *nodemask, bool force_kill)
>  {
>  	const nodemask_t *mpol_mask;
> @@ -703,6 +736,31 @@ out:
>  		schedule_timeout_killable(1);
>  }
>  
> +/** out_of_memory -  tries to invoke OOM killer.
> + * @zonelist: zonelist pointer
> + * @gfp_mask: memory allocation flags
> + * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
> + * @force_kill: true if a task must be killed, even if others are exiting
> + *
> + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
> + * when it returns false. Otherwise returns true.
> + */
> +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +		int order, nodemask_t *nodemask, bool force_kill)
> +{
> +	bool ret = false;
> +
> +	down_read(&oom_sem);
> +	if (!oom_killer_disabled) {
> +		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
> +		ret = true;
> +	}
> +	up_read(&oom_sem);
> +
> +	return ret;
> +}
> +
>  /*
>   * The pagefault handler calls here because it is out of memory, so kill a
>   * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
> @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
>  {
>  	struct zonelist *zonelist;
>  
> +	down_read(&oom_sem);
>  	if (mem_cgroup_oom_synchronize(true))
> -		return;
> +		goto unlock;
>  
>  	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
>  	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> -		out_of_memory(NULL, 0, 0, NULL, false);
> +		if (!oom_killer_disabled)
> +			__out_of_memory(NULL, 0, 0, NULL, false);
>  		oom_zonelist_unlock(zonelist, GFP_KERNEL);
>  	}
> +unlock:
> +	up_read(&oom_sem);
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9cd36b822444..d44d69aa7b70 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -242,8 +242,6 @@ void set_pageblock_migratetype(struct page *page, int migratetype)
>  					PB_migrate, PB_migrate_end);
>  }
>  
> -bool oom_killer_disabled __read_mostly;
> -
>  #ifdef CONFIG_DEBUG_VM
>  static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
>  {
> @@ -2241,10 +2239,11 @@ static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
>  	nodemask_t *nodemask, struct zone *preferred_zone,
> -	int classzone_idx, int migratetype)
> +	int classzone_idx, int migratetype, bool *oom_failed)
>  {
>  	struct page *page;
>  
> +	*oom_failed = false;
>  	/* Acquire the per-zone oom lock for each zone */
>  	if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
>  		schedule_timeout_uninterruptible(1);
> @@ -2252,14 +2251,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  	/*
> -	 * PM-freezer should be notified that there might be an OOM killer on
> -	 * its way to kill and wake somebody up. This is too early and we might
> -	 * end up not killing anything but false positives are acceptable.
> -	 * See freeze_processes.
> -	 */
> -	note_oom_kill();
> -
> -	/*
>  	 * Go through the zonelist yet one more time, keep very high watermark
>  	 * here, this is only to catch a parallel oom killing, we must fail if
>  	 * we're still under heavy pressure.
> @@ -2289,8 +2280,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  			goto out;
>  	}
>  	/* Exhausted what can be done so it's blamo time */
> -	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
> -
> +	if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false))
> +		*oom_failed = true;
>  out:
>  	oom_zonelist_unlock(zonelist, gfp_mask);
>  	return page;
> @@ -2716,8 +2707,8 @@ rebalance:
>  	 */
>  	if (!did_some_progress) {
>  		if (oom_gfp_allowed(gfp_mask)) {
> -			if (oom_killer_disabled)
> -				goto nopage;
> +			bool oom_failed;
> +
>  			/* Coredumps can quickly deplete all memory reserves */
>  			if ((current->flags & PF_DUMPCORE) &&
>  			    !(gfp_mask & __GFP_NOFAIL))
> @@ -2725,10 +2716,19 @@ rebalance:
>  			page = __alloc_pages_may_oom(gfp_mask, order,
>  					zonelist, high_zoneidx,
>  					nodemask, preferred_zone,
> -					classzone_idx, migratetype);
> +					classzone_idx, migratetype,
> +					&oom_failed);
> +
>  			if (page)
>  				goto got_pg;
>  
> +			/*
> +			 * OOM killer might be disabled and then we have to
> +			 * fail the allocation
> +			 */
> +			if (oom_failed)
> +				goto nopage;
> +
>  			if (!(gfp_mask & __GFP_NOFAIL)) {
>  				/*
>  				 * The oom killer is not called for high-order
>
Tejun Heo Dec. 2, 2014, 10:08 p.m. UTC | #2
Hello, sorry about the delay.  Was on vacation.

Generally looks good to me.  Some comments below.

> @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
>  
>  static void moom_callback(struct work_struct *ignored)
>  {
> -	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
> -		      0, NULL, true);
> +	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
> +			   GFP_KERNEL, 0, NULL, true)) {
> +		printk(KERN_INFO "OOM request ignored because killer is disabled\n");
> +	}
>  }

CodingStyle line 157 says "Do not unnecessarily use braces where a
single statement will do.".

> +/**
> + * oom_killer_disable - disable OOM killer
> + *
> + * Forces all page allocations to fail rather than trigger OOM killer.
> + * Will block and wait until all OOM victims are dead.
> + *
> + * Returns true if successfull and false if the OOM killer cannot be
> + * disabled.
> + */
> +extern bool oom_killer_disable(void);

And function comments usually go where the function body is, not where
the function is declared, no?

> @@ -157,27 +132,11 @@ int freeze_processes(void)
>  	pm_wakeup_clear();
>  	printk("Freezing user space processes ... ");
>  	pm_freezing = true;
> -	oom_kills_saved = oom_kills_count();
>  	error = try_to_freeze_tasks(true);
>  	if (!error) {
>  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> -		oom_killer_disable();
> -
> -		/*
> -		 * There might have been an OOM kill while we were
> -		 * freezing tasks and the killed task might be still
> -		 * on the way out so we have to double check for race.
> -		 */
> -		if (oom_kills_count() != oom_kills_saved &&
> -		    !check_frozen_processes()) {
> -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> -			printk("OOM in progress.");
> -			error = -EBUSY;
> -		} else {
> -			printk("done.");
> -		}
> +		printk("done.\n");

A delta but shouldn't it be pr_cont()?

...
> @@ -206,6 +165,18 @@ int freeze_kernel_threads(void)
>  	printk("\n");
>  	BUG_ON(in_atomic());
>  
> +	/*
> +	 * Now that everything freezable is handled we need to disbale
> +	 * the OOM killer to disallow any further interference with
> +	 * killable tasks.
> +	 */
> +	printk("Disabling OOM killer ... ");
> +	if (!oom_killer_disable()) {
> +		printk("failed.\n");
> +		error = -EAGAIN;
> +	} else
> +		printk("done.\n");

Ditto on pr_cont() and CodingStyle line 169 says "This does not apply
if only one branch of a conditional statement is a single statement;
in the latter case use braces in both branches:"

> @@ -251,6 +220,9 @@ void thaw_kernel_threads(void)
>  {
>  	struct task_struct *g, *p;
>  
> +	printk("Enabling OOM killer again.\n");

Do we really need this printk?  The same goes for Disabling OOM
killer.  For freezing it makes some sense because freezing may take a
considerable amount of time and even occassionally fail due to
timeout.  We aren't really expecting those to happen for OOM victims.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 302e0fc6d121..34bcbb053132 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	current->memcg_oom.order = order;
>  }
>  
> +extern bool oom_killer_disabled;

Ugh... don't we wanna put this in a header file?

> +void mark_tsk_oom_victim(struct task_struct *tsk)
>  {
> -	return atomic_read(&oom_kills);
> +	BUG_ON(oom_killer_disabled);

WARN_ON_ONCE() is prolly a better option here?

> +	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))

Can a task actually be selected as an OOM victim multiple times?

> +		return;
> +	atomic_inc(&oom_victims);
>  }
>  
> -void note_oom_kill(void)
> +void unmark_tsk_oom_victim(struct task_struct *tsk)
>  {
> -	atomic_inc(&oom_kills);
> +	int count;
> +
> +	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> +		return;

Maybe test this inline in exit_mm()?  e.g.

	if (test_thread_flag(TIF_MEMDIE))
		unmark_tsk_oom_victim(current);

Also, can the function ever be called by someone other than current?
If not, why would it take @task?

> +
> +	down_read(&oom_sem);
> +	/*
> +	 * There is no need to signal the lasst oom_victim if there
> +	 * is nobody who cares.
> +	 */
> +	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> +		complete(&oom_victims_wait);

I don't think using completion this way is safe.  Please read on.

> +	up_read(&oom_sem);
>  }
>  
> -void mark_tsk_oom_victim(struct task_struct *tsk)
> +bool oom_killer_disable(void)
>  {
> -	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> +	/*
> +	 * Make sure to not race with an ongoing OOM killer
> +	 * and that the current is not the victim.
> +	 */
> +	down_write(&oom_sem);
> +	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
> +		oom_killer_disabled = true;

Prolly "if (TIF_MEMDIE) { unlock; return; }" is easier to follow.

> +
> +	count = atomic_read(&oom_victims);
> +	up_write(&oom_sem);
> +
> +	if (count && oom_killer_disabled)
> +		wait_for_completion(&oom_victims_wait);

So, each complete() increments the done count and wait decs.  The
above code works iff the complete()'s and wait()'s are always balanced
which usually isn't true in this type of wait code.  Either use
reinit_completion() / complete_all() combos or wait_event().

> +
> +	return oom_killer_disabled;

Maybe 0 / -errno is better choice as return values?

> +/** out_of_memory -  tries to invoke OOM killer.

Formatting?

> + * @zonelist: zonelist pointer
> + * @gfp_mask: memory allocation flags
> + * @order: amount of memory being requested as a power of 2
> + * @nodemask: nodemask passed to page allocator
> + * @force_kill: true if a task must be killed, even if others are exiting
> + *
> + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
> + * when it returns false. Otherwise returns true.
> + */
> +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +		int order, nodemask_t *nodemask, bool force_kill)
> +{
> +	bool ret = false;
> +
> +	down_read(&oom_sem);
> +	if (!oom_killer_disabled) {
> +		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
> +		ret = true;
> +	}
> +	up_read(&oom_sem);
> +
> +	return ret;

Ditto on return value.  0 / -EBUSY seem like a better choice to me.

> @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
>  {
>  	struct zonelist *zonelist;
>  
> +	down_read(&oom_sem);
>  	if (mem_cgroup_oom_synchronize(true))
> -		return;
> +		goto unlock;
>  
>  	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
>  	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> -		out_of_memory(NULL, 0, 0, NULL, false);
> +		if (!oom_killer_disabled)
> +			__out_of_memory(NULL, 0, 0, NULL, false);
>  		oom_zonelist_unlock(zonelist, GFP_KERNEL);

Is this a condition which can happen and we can deal with?  With
userland fully frozen, there shouldn't be page faults which lead to
memory allocation, right?  Shouldn't we document how oom
disable/enable is supposed to be used (it only makes sense while the
whole system is in quiescent state) and at least trigger
WARN_ON_ONCE() if the above code path gets triggered while oom killer
is disabled?

Thanks.
Michal Hocko Dec. 4, 2014, 2:16 p.m. UTC | #3
On Tue 02-12-14 17:08:04, Tejun Heo wrote:
> Hello, sorry about the delay.  Was on vacation.
> 
> Generally looks good to me.  Some comments below.
> 
> > @@ -355,8 +355,10 @@ static struct sysrq_key_op sysrq_term_op = {
> >  
> >  static void moom_callback(struct work_struct *ignored)
> >  {
> > -	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
> > -		      0, NULL, true);
> > +	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
> > +			   GFP_KERNEL, 0, NULL, true)) {
> > +		printk(KERN_INFO "OOM request ignored because killer is disabled\n");
> > +	}
> >  }
> 
> CodingStyle line 157 says "Do not unnecessarily use braces where a
> single statement will do.".

Sure. Fixed

> > +/**
> > + * oom_killer_disable - disable OOM killer
> > + *
> > + * Forces all page allocations to fail rather than trigger OOM killer.
> > + * Will block and wait until all OOM victims are dead.
> > + *
> > + * Returns true if successfull and false if the OOM killer cannot be
> > + * disabled.
> > + */
> > +extern bool oom_killer_disable(void);
> 
> And function comments usually go where the function body is, not where
> the function is declared, no?

Fixed

> > @@ -157,27 +132,11 @@ int freeze_processes(void)
> >  	pm_wakeup_clear();
> >  	printk("Freezing user space processes ... ");
> >  	pm_freezing = true;
> > -	oom_kills_saved = oom_kills_count();
> >  	error = try_to_freeze_tasks(true);
> >  	if (!error) {
> >  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> > -		oom_killer_disable();
> > -
> > -		/*
> > -		 * There might have been an OOM kill while we were
> > -		 * freezing tasks and the killed task might be still
> > -		 * on the way out so we have to double check for race.
> > -		 */
> > -		if (oom_kills_count() != oom_kills_saved &&
> > -		    !check_frozen_processes()) {
> > -			__usermodehelper_set_disable_depth(UMH_ENABLED);
> > -			printk("OOM in progress.");
> > -			error = -EBUSY;
> > -		} else {
> > -			printk("done.");
> > -		}
> > +		printk("done.\n");
> 
> A delta but shouldn't it be pr_cont()?

kernel/power/process.c doesn't use pr_* so I've stayed with what the
rest of the file is using. I can add a patch which transforms all of
them.

> ...
> > @@ -206,6 +165,18 @@ int freeze_kernel_threads(void)
> >  	printk("\n");
> >  	BUG_ON(in_atomic());
> >  
> > +	/*
> > +	 * Now that everything freezable is handled we need to disbale
> > +	 * the OOM killer to disallow any further interference with
> > +	 * killable tasks.
> > +	 */
> > +	printk("Disabling OOM killer ... ");
> > +	if (!oom_killer_disable()) {
> > +		printk("failed.\n");
> > +		error = -EAGAIN;
> > +	} else
> > +		printk("done.\n");
> 
> Ditto on pr_cont() and
>
> CodingStyle line 169 says "This does not apply
> if only one branch of a conditional statement is a single statement;
> in the latter case use braces in both branches:"

Fixed

> > @@ -251,6 +220,9 @@ void thaw_kernel_threads(void)
> >  {
> >  	struct task_struct *g, *p;
> >  
> > +	printk("Enabling OOM killer again.\n");
> 
> Do we really need this printk?  The same goes for Disabling OOM
> killer.  For freezing it makes some sense because freezing may take a
> considerable amount of time and even occassionally fail due to
> timeout.  We aren't really expecting those to happen for OOM victims.

I just considered them useful if there are follow up allocation failure
messages to know that they are due to OOM killer.
I can remove them. 

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 302e0fc6d121..34bcbb053132 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2128,6 +2128,8 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  	current->memcg_oom.order = order;
> >  }
> >  
> > +extern bool oom_killer_disabled;
> 
> Ugh... don't we wanna put this in a header file?

Who else would need the declaration? This is not something random code
should look at.

> > +void mark_tsk_oom_victim(struct task_struct *tsk)
> >  {
> > -	return atomic_read(&oom_kills);
> > +	BUG_ON(oom_killer_disabled);
> 
> WARN_ON_ONCE() is prolly a better option here?

Well, something fishy is going on when oom_killer_disabled is set and we
mark new OOM victim. This is a clear bug. Why would be warning and a
allow the follow up breakage?

> > +	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> 
> Can a task actually be selected as an OOM victim multiple times?

AFAICS nothing prevents from global OOM and memcg OOM killers racing.
 
> > +		return;
> > +	atomic_inc(&oom_victims);
> >  }
> >  
> > -void note_oom_kill(void)
> > +void unmark_tsk_oom_victim(struct task_struct *tsk)
> >  {
> > -	atomic_inc(&oom_kills);
> > +	int count;
> > +
> > +	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> > +		return;
> 
> Maybe test this inline in exit_mm()?  e.g.
> 
> 	if (test_thread_flag(TIF_MEMDIE))
> 		unmark_tsk_oom_victim(current);

Why do you think testing TIF_MEMDIE in exit_mm is better? I would like
to reduce the usage of the flag as much as possible.

> Also, can the function ever be called by someone other than current?
> If not, why would it take @task?

Changed to use current only. If there is anybody who needs that we can
change that later. I wanted to have it symmetric to mark_tsk_oom_victim
but that is not that important.

> > +
> > +	down_read(&oom_sem);
> > +	/*
> > +	 * There is no need to signal the lasst oom_victim if there
> > +	 * is nobody who cares.
> > +	 */
> > +	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
> > +		complete(&oom_victims_wait);
> 
> I don't think using completion this way is safe.  Please read on.
> 
> > +	up_read(&oom_sem);
> >  }
> >  
> > -void mark_tsk_oom_victim(struct task_struct *tsk)
> > +bool oom_killer_disable(void)
> >  {
> > -	set_tsk_thread_flag(tsk, TIF_MEMDIE);
> > +	/*
> > +	 * Make sure to not race with an ongoing OOM killer
> > +	 * and that the current is not the victim.
> > +	 */
> > +	down_write(&oom_sem);
> > +	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
> > +		oom_killer_disabled = true;
> 
> Prolly "if (TIF_MEMDIE) { unlock; return; }" is easier to follow.

OK

> > +
> > +	count = atomic_read(&oom_victims);
> > +	up_write(&oom_sem);
> > +
> > +	if (count && oom_killer_disabled)
> > +		wait_for_completion(&oom_victims_wait);
> 
> So, each complete() increments the done count and wait decs.  The
> above code works iff the complete()'s and wait()'s are always balanced
> which usually isn't true in this type of wait code.  Either use
> reinit_completion() / complete_all() combos or wait_event().

Hmm, I thought that only a single instance of freeze_kernel_threads
(which calls oom_killer_disable) can run at a time. But I am currently
not sure that all paths are called under lock_system_sleep.
I am not familiar with reinit_completion API. Is the following correct?
[...]
@@ -434,10 +434,23 @@ void unmark_tsk_oom_victim(struct task_struct *tsk)
 	 * is nobody who cares.
 	 */
 	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
-		complete(&oom_victims_wait);
+		complete_all(&oom_victims_wait);
 	up_read(&oom_sem);
 }
[...]
@@ -445,16 +458,23 @@ bool oom_killer_disable(void)
 	 * and that the current is not the victim.
 	 */
 	down_write(&oom_sem);
-	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
-		oom_killer_disabled = true;
+	if (test_thread_flag(TIF_MEMDIE)) {
+		up_write(&oom_sem);
+		return false;
+	}
+
+	/* unmark_tsk_oom_victim is calling complete_all */
+	if (!oom_killer_disable)
+		reinit_completion(&oom_victims_wait);
 
+	oom_killer_disabled = true;
 	count = atomic_read(&oom_victims);
 	up_write(&oom_sem);
 
-	if (count && oom_killer_disabled)
+	if (count)
 		wait_for_completion(&oom_victims_wait);
 
-	return oom_killer_disabled;
+	return true;
 }

> > +
> > +	return oom_killer_disabled;
> 
> Maybe 0 / -errno is better choice as return values?

I do not have problem to change this if you feel strong about it but
true/false sounds easier to me and it allows the caller to decide what to
report. If there were multiple reasons to fail then sure but that is not
the case.
 
> > +/** out_of_memory -  tries to invoke OOM killer.
> 
> Formatting?

fixed

> > + * @zonelist: zonelist pointer
> > + * @gfp_mask: memory allocation flags
> > + * @order: amount of memory being requested as a power of 2
> > + * @nodemask: nodemask passed to page allocator
> > + * @force_kill: true if a task must be killed, even if others are exiting
> > + *
> > + * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
> > + * when it returns false. Otherwise returns true.
> > + */
> > +bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > +		int order, nodemask_t *nodemask, bool force_kill)
> > +{
> > +	bool ret = false;
> > +
> > +	down_read(&oom_sem);
> > +	if (!oom_killer_disabled) {
> > +		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
> > +		ret = true;
> > +	}
> > +	up_read(&oom_sem);
> > +
> > +	return ret;
> 
> Ditto on return value.  0 / -EBUSY seem like a better choice to me.
> 
> > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
> >  {
> >  	struct zonelist *zonelist;
> >  
> > +	down_read(&oom_sem);
> >  	if (mem_cgroup_oom_synchronize(true))
> > -		return;
> > +		goto unlock;
> >  
> >  	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
> >  	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> > -		out_of_memory(NULL, 0, 0, NULL, false);
> > +		if (!oom_killer_disabled)
> > +			__out_of_memory(NULL, 0, 0, NULL, false);
> >  		oom_zonelist_unlock(zonelist, GFP_KERNEL);
> 
> Is this a condition which can happen and we can deal with? With
> userland fully frozen, there shouldn't be page faults which lead to
> memory allocation, right?

Except for racing OOM victims which were missed by try_to_freeze_tasks
because they didn't get cpu slice to wake up from the freezer. The task
would die on the way out from the page fault exception. I have updated
the changelog to be more verbose about this.

> Shouldn't we document how oom disable/enable is supposed to be used

Well the API shouldn't be used outside of the PM freezer IMO. This is not a
general API that other part of the kernel should be using. I can surely
add more documentation for the PM usage though. I have rewritten the
changelog:
"
    As oom_killer_disable() is a full OOM barrier now we can postpone it in
    the PM freezer to later after all freezable user tasks are considered
    frozen (to freeze_kernel_threads).

    Normally there wouldn't be any unfrozen user tasks at this moment so
    the function will not block. But if there was an OOM killer racing with
    try_to_freeze_tasks and the OOM victim didn't finish yet then we have to
    wait for it. This should complete in a finite time, though, because
        - the victim cannot loop in the page fault handler (it would die
          on the way out from the exception)
        - it cannot loop in the page allocator because all the further\
          allocation would fail
        - it shouldn't be blocked on any locks held by frozen tasks
          (try_to_freeze expects lockless context) and kernel threads and
          work queues are not frozen yet
"

And I've added:
+/**
+ * oom_killer_disable - disable OOM killer
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ * Will block and wait until all OOM victims are dead.
+ *
+ * The function cannot be called when there are runnable user tasks because
+ * the userspace would see unexpected allocation failures as a result. Any
+ * new usage of this function should be consulted with MM people.
+ *
+ * Returns true if successful and false if the OOM killer cannot be
+ * disabled.
+ */
 bool oom_killer_disable(void)

> (it only makes sense while the whole system is in quiescent state)
> and at least trigger WARN_ON_ONCE() if the above code path gets
> triggered while oom killer is disabled?

I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)).

Thanks for the review!
Tejun Heo Dec. 4, 2014, 2:44 p.m. UTC | #4
On Thu, Dec 04, 2014 at 03:16:23PM +0100, Michal Hocko wrote:
> > A delta but shouldn't it be pr_cont()?
> 
> kernel/power/process.c doesn't use pr_* so I've stayed with what the
> rest of the file is using. I can add a patch which transforms all of
> them.

The console output becomes wrong when printk() is used on
continuation.  So, yeah, it'd be great to fix it.

> > > +extern bool oom_killer_disabled;
> > 
> > Ugh... don't we wanna put this in a header file?
> 
> Who else would need the declaration? This is not something random code
> should look at.

Let's say, somebody changes the type to ulong for whatever reason
later and forgets to update this declaration.  What happens then on a
big endian machine?

Jesus, this is basic C programming.  You don't sprinkle external
declarations which the compiler can't verify against the actual
definitions.  There's absolutely no compelling reason to do that here.
Why would you take out compiler verification for no reason?

> > > +void mark_tsk_oom_victim(struct task_struct *tsk)
> > >  {
> > > -	return atomic_read(&oom_kills);
> > > +	BUG_ON(oom_killer_disabled);
> > 
> > WARN_ON_ONCE() is prolly a better option here?
> 
> Well, something fishy is going on when oom_killer_disabled is set and we
> mark new OOM victim. This is a clear bug. Why would be warning and a
> allow the follow up breakage?

Because the system is more likely to be able to go on and we don't BUG
when we can WARN as a general rule.  Working systems is almost always
better than a dead system even for debugging.

> > > +	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > 
> > Can a task actually be selected as an OOM victim multiple times?
> 
> AFAICS nothing prevents from global OOM and memcg OOM killers racing.

Maybe it'd be a good idea to note that in the comment?

> > > -void note_oom_kill(void)
> > > +void unmark_tsk_oom_victim(struct task_struct *tsk)
> > >  {
> > > -	atomic_inc(&oom_kills);
> > > +	int count;
> > > +
> > > +	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > +		return;
> > 
> > Maybe test this inline in exit_mm()?  e.g.
> > 
> > 	if (test_thread_flag(TIF_MEMDIE))
> > 		unmark_tsk_oom_victim(current);
> 
> Why do you think testing TIF_MEMDIE in exit_mm is better? I would like
> to reduce the usage of the flag as much as possible.

Because it's adding a function call/return to hot path for everybody.
It sure is a miniscule cost but we're adding that for no good reason.

> > So, each complete() increments the done count and wait decs.  The
> > above code works iff the complete()'s and wait()'s are always balanced
> > which usually isn't true in this type of wait code.  Either use
> > reinit_completion() / complete_all() combos or wait_event().
> 
> Hmm, I thought that only a single instance of freeze_kernel_threads
> (which calls oom_killer_disable) can run at a time. But I am currently
> not sure that all paths are called under lock_system_sleep.
> I am not familiar with reinit_completion API. Is the following correct?

Hmmm... wouldn't wait_event() easier to read in this case?

...
> > Maybe 0 / -errno is better choice as return values?
> 
> I do not have problem to change this if you feel strong about it but
> true/false sounds easier to me and it allows the caller to decide what to
> report. If there were multiple reasons to fail then sure but that is not
> the case.

It's not a big deal but except for functions which have clear boolean
behavior - functions which try/attempt something or query or decide
certain things - randomly thrown in bool returns tend to become
confusing especially because its bool fail value is the opposite of
0/-errno fail value.  So, "this function only fails with one reason"
is usually a bad and arbitrary reason for choosing bool return which
causes confusion on callsites and headaches when the function develops
more reasons to fail.

...
> > > @@ -712,12 +770,16 @@ void pagefault_out_of_memory(void)
> > >  {
> > >  	struct zonelist *zonelist;
> > >  
> > > +	down_read(&oom_sem);
> > >  	if (mem_cgroup_oom_synchronize(true))
> > > -		return;
> > > +		goto unlock;
> > >  
> > >  	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
> > >  	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
> > > -		out_of_memory(NULL, 0, 0, NULL, false);
> > > +		if (!oom_killer_disabled)
> > > +			__out_of_memory(NULL, 0, 0, NULL, false);
> > >  		oom_zonelist_unlock(zonelist, GFP_KERNEL);
> > 
> > Is this a condition which can happen and we can deal with? With
> > userland fully frozen, there shouldn't be page faults which lead to
> > memory allocation, right?
> 
> Except for racing OOM victims which were missed by try_to_freeze_tasks
> because they didn't get cpu slice to wake up from the freezer. The task
> would die on the way out from the page fault exception. I have updated
> the changelog to be more verbose about this.

That's something very not obvious.  Let's please add a comment
explaining that.

> > (it only makes sense while the whole system is in quiescent state)
> > and at least trigger WARN_ON_ONCE() if the above code path gets
> > triggered while oom killer is disabled?
> 
> I can add a WARN_ON(!test_thread_flag(tsk, TIF_MEMDIE)).

Yeah, that makes sense to me.

Thanks.
diff mbox

Patch

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 42bad18c66c9..6818589c1004 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,8 +355,10 @@  static struct sysrq_key_op sysrq_term_op = {
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL), GFP_KERNEL,
-		      0, NULL, true);
+	if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
+			   GFP_KERNEL, 0, NULL, true)) {
+		printk(KERN_INFO "OOM request ignored because killer is disabled\n");
+	}
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8f7e74f8ab3a..d802575c9307 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -72,22 +72,26 @@  extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *mask, bool force_kill);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 
-extern bool oom_killer_disabled;
-
-static inline void oom_killer_disable(void)
-{
-	oom_killer_disabled = true;
-}
+/**
+ * oom_killer_disable - disable OOM killer
+ *
+ * Forces all page allocations to fail rather than trigger OOM killer.
+ * Will block and wait until all OOM victims are dead.
+ *
+ * Returns true if successfull and false if the OOM killer cannot be
+ * disabled.
+ */
+extern bool oom_killer_disable(void);
 
-static inline void oom_killer_enable(void)
-{
-	oom_killer_disabled = false;
-}
+/**
+ * oom_killer_enable - enable OOM killer
+ */
+extern void oom_killer_enable(void);
 
 static inline bool oom_gfp_allowed(gfp_t gfp_mask)
 {
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 5a6ec8678b9a..a4306e39f35c 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -108,30 +108,6 @@  static int try_to_freeze_tasks(bool user_only)
 	return todo ? -EBUSY : 0;
 }
 
-static bool __check_frozen_processes(void)
-{
-	struct task_struct *g, *p;
-
-	for_each_process_thread(g, p)
-		if (p != current && !freezer_should_skip(p) && !frozen(p))
-			return false;
-
-	return true;
-}
-
-/*
- * Returns true if all freezable tasks (except for current) are frozen already
- */
-static bool check_frozen_processes(void)
-{
-	bool ret;
-
-	read_lock(&tasklist_lock);
-	ret = __check_frozen_processes();
-	read_unlock(&tasklist_lock);
-	return ret;
-}
-
 /**
  * freeze_processes - Signal user space processes to enter the refrigerator.
  * The current thread will not be frozen.  The same process that calls
@@ -142,7 +118,6 @@  static bool check_frozen_processes(void)
 int freeze_processes(void)
 {
 	int error;
-	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -157,27 +132,11 @@  int freeze_processes(void)
 	pm_wakeup_clear();
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
-	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
-		oom_killer_disable();
-
-		/*
-		 * There might have been an OOM kill while we were
-		 * freezing tasks and the killed task might be still
-		 * on the way out so we have to double check for race.
-		 */
-		if (oom_kills_count() != oom_kills_saved &&
-		    !check_frozen_processes()) {
-			__usermodehelper_set_disable_depth(UMH_ENABLED);
-			printk("OOM in progress.");
-			error = -EBUSY;
-		} else {
-			printk("done.");
-		}
+		printk("done.\n");
 	}
-	printk("\n");
 	BUG_ON(in_atomic());
 
 	if (error)
@@ -206,6 +165,18 @@  int freeze_kernel_threads(void)
 	printk("\n");
 	BUG_ON(in_atomic());
 
+	/*
+	 * Now that everything freezable is handled we need to disbale
+	 * the OOM killer to disallow any further interference with
+	 * killable tasks.
+	 */
+	printk("Disabling OOM killer ... ");
+	if (!oom_killer_disable()) {
+		printk("failed.\n");
+		error = -EAGAIN;
+	} else
+		printk("done.\n");
+
 	if (error)
 		thaw_kernel_threads();
 	return error;
@@ -222,8 +193,6 @@  void thaw_processes(void)
 	pm_freezing = false;
 	pm_nosig_freezing = false;
 
-	oom_killer_enable();
-
 	printk("Restarting tasks ... ");
 
 	__usermodehelper_set_disable_depth(UMH_FREEZING);
@@ -251,6 +220,9 @@  void thaw_kernel_threads(void)
 {
 	struct task_struct *g, *p;
 
+	printk("Enabling OOM killer again.\n");
+	oom_killer_enable();
+
 	pm_nosig_freezing = false;
 	printk("Restarting kernel threads ... ");
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 302e0fc6d121..34bcbb053132 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2128,6 +2128,8 @@  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 	current->memcg_oom.order = order;
 }
 
+extern bool oom_killer_disabled;
+
 /**
  * mem_cgroup_oom_synchronize - complete memcg OOM handling
  * @handle: actually kill/wait or just clean up the OOM state
@@ -2155,7 +2157,7 @@  bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
-	if (!handle)
+	if (!handle || oom_killer_disabled)
 		goto cleanup;
 
 	owait.memcg = memcg;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8b6e14136f4f..b3ccd92bc6dc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -405,30 +405,63 @@  static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 }
 
 /*
- * Number of OOM killer invocations (including memcg OOM killer).
- * Primarily used by PM freezer to check for potential races with
- * OOM killed frozen task.
+ * Number of OOM victims in flight
  */
-static atomic_t oom_kills = ATOMIC_INIT(0);
+static atomic_t oom_victims = ATOMIC_INIT(0);
+static DECLARE_COMPLETION(oom_victims_wait);
 
-int oom_kills_count(void)
+bool oom_killer_disabled __read_mostly;
+static DECLARE_RWSEM(oom_sem);
+
+void mark_tsk_oom_victim(struct task_struct *tsk)
 {
-	return atomic_read(&oom_kills);
+	BUG_ON(oom_killer_disabled);
+	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
+	atomic_inc(&oom_victims);
 }
 
-void note_oom_kill(void)
+void unmark_tsk_oom_victim(struct task_struct *tsk)
 {
-	atomic_inc(&oom_kills);
+	int count;
+
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
+
+	down_read(&oom_sem);
+	/*
+	 * There is no need to signal the lasst oom_victim if there
+	 * is nobody who cares.
+	 */
+	if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
+		complete(&oom_victims_wait);
+	up_read(&oom_sem);
 }
 
-void mark_tsk_oom_victim(struct task_struct *tsk)
+bool oom_killer_disable(void)
 {
-	set_tsk_thread_flag(tsk, TIF_MEMDIE);
+	/*
+	 * Make sure to not race with an ongoing OOM killer
+	 * and that the current is not the victim.
+	 */
+	down_write(&oom_sem);
+	if (!test_tsk_thread_flag(current, TIF_MEMDIE))
+		oom_killer_disabled = true;
+
+	count = atomic_read(&oom_victims);
+	up_write(&oom_sem);
+
+	if (count && oom_killer_disabled)
+		wait_for_completion(&oom_victims_wait);
+
+	return oom_killer_disabled;
 }
 
-void unmark_tsk_oom_victim(struct task_struct *tsk)
+void oom_killer_enable(void)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	down_write(&oom_sem);
+	oom_killer_disabled = false;
+	up_write(&oom_sem);
 }
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
@@ -626,7 +659,7 @@  void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
 }
 
 /**
- * out_of_memory - kill the "best" process when we run out of memory
+ * __out_of_memory - kill the "best" process when we run out of memory
  * @zonelist: zonelist pointer
  * @gfp_mask: memory allocation flags
  * @order: amount of memory being requested as a power of 2
@@ -638,7 +671,7 @@  void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		int order, nodemask_t *nodemask, bool force_kill)
 {
 	const nodemask_t *mpol_mask;
@@ -703,6 +736,31 @@  out:
 		schedule_timeout_killable(1);
 }
 
+/** out_of_memory -  tries to invoke OOM killer.
+ * @zonelist: zonelist pointer
+ * @gfp_mask: memory allocation flags
+ * @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
+ * @force_kill: true if a task must be killed, even if others are exiting
+ *
+ * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
+ * when it returns false. Otherwise returns true.
+ */
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask, bool force_kill)
+{
+	bool ret = false;
+
+	down_read(&oom_sem);
+	if (!oom_killer_disabled) {
+		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
+		ret = true;
+	}
+	up_read(&oom_sem);
+
+	return ret;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task.  If any populated zone has ZONE_OOM_LOCKED set, a
@@ -712,12 +770,16 @@  void pagefault_out_of_memory(void)
 {
 	struct zonelist *zonelist;
 
+	down_read(&oom_sem);
 	if (mem_cgroup_oom_synchronize(true))
-		return;
+		goto unlock;
 
 	zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
 	if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
-		out_of_memory(NULL, 0, 0, NULL, false);
+		if (!oom_killer_disabled)
+			__out_of_memory(NULL, 0, 0, NULL, false);
 		oom_zonelist_unlock(zonelist, GFP_KERNEL);
 	}
+unlock:
+	up_read(&oom_sem);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cd36b822444..d44d69aa7b70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,8 +242,6 @@  void set_pageblock_migratetype(struct page *page, int migratetype)
 					PB_migrate, PB_migrate_end);
 }
 
-bool oom_killer_disabled __read_mostly;
-
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -2241,10 +2239,11 @@  static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int classzone_idx, int migratetype)
+	int classzone_idx, int migratetype, bool *oom_failed)
 {
 	struct page *page;
 
+	*oom_failed = false;
 	/* Acquire the per-zone oom lock for each zone */
 	if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
 		schedule_timeout_uninterruptible(1);
@@ -2252,14 +2251,6 @@  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * PM-freezer should be notified that there might be an OOM killer on
-	 * its way to kill and wake somebody up. This is too early and we might
-	 * end up not killing anything but false positives are acceptable.
-	 * See freeze_processes.
-	 */
-	note_oom_kill();
-
-	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
@@ -2289,8 +2280,8 @@  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			goto out;
 	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order, nodemask, false);
-
+	if (!out_of_memory(zonelist, gfp_mask, order, nodemask, false))
+		*oom_failed = true;
 out:
 	oom_zonelist_unlock(zonelist, gfp_mask);
 	return page;
@@ -2716,8 +2707,8 @@  rebalance:
 	 */
 	if (!did_some_progress) {
 		if (oom_gfp_allowed(gfp_mask)) {
-			if (oom_killer_disabled)
-				goto nopage;
+			bool oom_failed;
+
 			/* Coredumps can quickly deplete all memory reserves */
 			if ((current->flags & PF_DUMPCORE) &&
 			    !(gfp_mask & __GFP_NOFAIL))
@@ -2725,10 +2716,19 @@  rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					classzone_idx, migratetype);
+					classzone_idx, migratetype,
+					&oom_failed);
+
 			if (page)
 				goto got_pg;
 
+			/*
+			 * OOM killer might be disabled and then we have to
+			 * fail the allocation
+			 */
+			if (oom_failed)
+				goto nopage;
+
 			if (!(gfp_mask & __GFP_NOFAIL)) {
 				/*
 				 * The oom killer is not called for high-order