From patchwork Fri Jul 20 20:14:53 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Rientjes <rientjes@google.com>
X-Patchwork-Id: 10538213
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	8579E6053F for <patchwork-linux-mm@patchwork.kernel.org>;
	Fri, 20 Jul 2018 20:14:59 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 70992200CB
	for <patchwork-linux-mm@patchwork.kernel.org>;
	Fri, 20 Jul 2018 20:14:59 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 6383021FAC; Fri, 20 Jul 2018 20:14:59 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.5 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,
	USER_IN_DEF_DKIM_WL autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 497D2200CB
	for <patchwork-linux-mm@patchwork.kernel.org>;
	Fri, 20 Jul 2018 20:14:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3E0F26B0003; Fri, 20 Jul 2018 16:14:57 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 369BA6B026A; Fri, 20 Jul 2018 16:14:57 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1BC9C6B026B; Fri, 20 Jul 2018 16:14:57 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pl0-f71.google.com (mail-pl0-f71.google.com
	[209.85.160.71])
	by kanga.kvack.org (Postfix) with ESMTP id BD6A96B0003
	for <linux-mm@kvack.org>; Fri, 20 Jul 2018 16:14:56 -0400 (EDT)
Received: by mail-pl0-f71.google.com with SMTP id w1-v6so8107526ply.12
	for <linux-mm@kvack.org>; Fri, 20 Jul 2018 13:14:56 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:dkim-signature:date:from:to:cc:subject
	:in-reply-to:message-id:references:user-agent:mime-version;
	bh=m/YYLvchBbTrAM/5/KB5qnQC1VxE6NRjbenwWYjg6v0=;
	b=pqbqcG1sDcMMrwrVhmRnoFIGXs8LfYvq1WA0V4+rSHFCmmz62Ovt6CSdzc3N9HD93L
	pNVlQdJo447yqAvlAHVsp8F11/RGT5Su4LWcZ7XKZRa03MSS7nhaMk1S65ki4GFQgJUf
	v2V5+r4BSDNzZ+x7H6W6x5Gh1lG8JMqpeE18rHmqKskHQyCzHZrEFh0Kc6Vo2j2p0pcv
	zuH4bFaYiIRxEYY1Evm7WpkXBxj344nWQ3v8o2FUOgtMkxGIuRYkOVx6z+Z+Lx8ONFbU
	TNtfyiMO22lV8SJVe+8MO/fT6NpQdKs8lAldwBdVAkmUPKy4iT81OsrBTr3948BIIVJV
	KILA==
X-Gm-Message-State: AOUpUlEo0JXmXXuImmcgONv6XlGAoZ6pvirKUj+NdTqC8UnEm4+pumJz
	x5Pp4JrqDF8nbz1tpDSnCwp7b6T1/RYS7uFL/Q7JTVmrjiHYLfu5LII/uMASSsu/Zyc6OZl0fRB
	73sPuaefG33GMo/nzJZo/ZcZrueMxhDev/DQfV0on+su4crOeQaSHT/6m1hXHYjs/wg4hj2ruvA
	3ICyOyVOEzfl+/bK56ulhvQiOF7EoeIaFht50lUzYiSf5HdpkKPPwwHHT4ymU4Ug9ANhuog1ap5
	UFe8gc3G8ZIma/C3atHqeQcRCobEV4E+pvrv/2cYdpmSzc/zYLtdyhhIjX6jUrBPp4YHpN/3E26
	K5YtTnfH6yvHnLJJKrWS0Sn5zoRLxd4facjGYHeRLKhwR6HpamzwXWKBwMzqZCRgO6DmLeu0Eu8
	l
X-Received: by 2002:a62:3b03:: with SMTP id
	i3-v6mr3522419pfa.197.1532117696431;
	Fri, 20 Jul 2018 13:14:56 -0700 (PDT)
X-Received: by 2002:a62:3b03:: with SMTP id
	i3-v6mr3522351pfa.197.1532117694945;
	Fri, 20 Jul 2018 13:14:54 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1532117694; cv=none;
	d=google.com; s=arc-20160816;
	b=P8nMJufwW3Xv50rw3PwgSDwmHi0RXG1AyJd2/8Wad/cxgE7hO2XhhIrLo2hHpEEnUO
	RpoKU8Q45/Wul4oqalofaCx/tGo0kDoQ7zStbofkfUd4jGeX+Xl+BUP0dqAttPpaNAtd
	wLoKXcdlr5hvJnhkF65j/tncbrgqFM9NwAy9PWEPfiiuewG9FD+M+VUdpi1rQji896GX
	Ao8Txv1S+ISq7+OQJras9dRpWta+1X24pbtVNABCmD3LAEDT9C4RVf59Cz0vSy+diX2w
	ILnQ1dcbSKEajoz9oavn4n/+BvPY1a55sQxwFoY0QMkmcSU5cDNKCOt7iTMfMuGCx+KG
	Xgnw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
	s=arc-20160816;
	h=mime-version:user-agent:references:message-id:in-reply-to:subject
	:cc:to:from:date:dkim-signature:arc-authentication-results;
	bh=m/YYLvchBbTrAM/5/KB5qnQC1VxE6NRjbenwWYjg6v0=;
	b=k3vCHsYJ/9XZW4DKMiyd7zSWr+SA+cD7ZILpN+7XK8sMy6K5XyDBfYkRQokmrT3qlE
	j03jqEE0KBnnxLWLTXESNwvXrrIw1puEjMrQSAiAdFCjSWrER9Kk2BNQBzhgx9lraB7Y
	uLXM3FozuYBSDz8y+o+gmkXAW1Q/4lZzJf77VzzJCWAGmwuLEGIwWk+ufsqVkQW0PcDQ
	JjIGpSQC1xP1lg47DzLqciMtnZiaUpRz1Ep8wNSJJiiuNshZucdt4R3LD1i0JyLXSNMo
	26dobQ/nIyGrfFkBMfJAaV/qGtub7tA120XHa18yLbXLKvexfpFjkeGZbz9N9kBy6uho
	il7A==
ARC-Authentication-Results: i=1; mx.google.com;
	dkim=pass header.i=@google.com header.s=20161025 header.b=uUt7L2ms;
	spf=pass (google.com: domain of rientjes@google.com designates
	209.85.220.65 as permitted sender)
	smtp.mailfrom=rientjes@google.com;
	dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from mail-sor-f65.google.com (mail-sor-f65.google.com.
	[209.85.220.65]) by mx.google.com with SMTPS id
	i191-v6sor724694pgc.243.2018.07.20.13.14.54
	for <linux-mm@kvack.org> (Google Transport Security);
	Fri, 20 Jul 2018 13:14:54 -0700 (PDT)
Received-SPF: pass (google.com: domain of rientjes@google.com designates
	209.85.220.65 as permitted sender) client-ip=209.85.220.65;
Authentication-Results: mx.google.com;
	dkim=pass header.i=@google.com header.s=20161025 header.b=uUt7L2ms;
	spf=pass (google.com: domain of rientjes@google.com designates
	209.85.220.65 as permitted sender)
	smtp.mailfrom=rientjes@google.com;
	dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20161025;
	h=date:from:to:cc:subject:in-reply-to:message-id:references
	:user-agent:mime-version;
	bh=m/YYLvchBbTrAM/5/KB5qnQC1VxE6NRjbenwWYjg6v0=;
	b=uUt7L2msWNM7eOkBtZBqueOXrU2cO13pn8T5Q9l6Z1a+JZ6p/w2IYNP+9+YrHGztwY
	tAkQNWJoqqRoiYYYfUSiRBZR+xsouqX7XHI5e5axQ8X86CaFWzvfcIQT8edSXfx2va7b
	v49SYBIw0cL8TiOL4IYHA6X7dgtpbAceXNinm6eZ8d8HV0wls83xaGuDJXRNFXfONWFK
	l4aCntGDiz0NxLweMQGvPTwikbTLRWTSVui0IfKgpP3O/cRrYvCJUF6ExAQUpTszl65z
	yI0DoSs8YAqHMs1tDQaclpRr1+gcxWXJAjSaKV3PmMgtyT6Bqd2weeZT1ydVdzk1btMr
	YMGQ==
X-Google-Smtp-Source: 
 AAOMgpfL68Cci22sTdpsep0rbplHVpxYMGx/wQx6M6eHNSGkeRZbb+z6CBClu8soz4vxO4XCk8V6MA==
X-Received: by 2002:a63:1063:: with SMTP id
	35-v6mr3346449pgq.249.1532117694213;
	Fri, 20 Jul 2018 13:14:54 -0700 (PDT)
Received: from [2620:15c:17:3:3a5:23a7:5e32:4598]
	([2620:15c:17:3:3a5:23a7:5e32:4598])
	by smtp.gmail.com with ESMTPSA id
	q25-v6sm5965387pfk.96.2018.07.20.13.14.53
	(version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
	Fri, 20 Jul 2018 13:14:53 -0700 (PDT)
Date: Fri, 20 Jul 2018 13:14:53 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
X-X-Sender: rientjes@chino.kir.corp.google.com
To: Andrew Morton <akpm@linux-foundation.org>
cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Michal Hocko <mhocko@suse.com>, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: [patch v4] mm, oom: fix unnecessary killing of additional processes
In-Reply-To: <alpine.DEB.2.21.1807200133310.119737@chino.kir.corp.google.com>
Message-ID: <alpine.DEB.2.21.1807201314230.231119@chino.kir.corp.google.com>
References: <alpine.DEB.2.21.1806211434420.51095@chino.kir.corp.google.com>
	<d19d44c3-c8cf-70a1-9b15-c98df233d5f0@i-love.sakura.ne.jp>
	<alpine.DEB.2.21.1807181317540.49359@chino.kir.corp.google.com>
	<a78fb992-ad59-0cdb-3c38-8284b2245f21@i-love.sakura.ne.jp>
	<alpine.DEB.2.21.1807200133310.119737@chino.kir.corp.google.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm.  This can happen for a variety of reasons,
including:

 - the inability to grab mm->mmap_sem in a sufficient amount of time,

 - when the mm has blockable mmu notifiers that could cause the oom reaper
   to stall indefinitely,

but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:

 - when the mm's memory is mostly mlocked.

When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily.  This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.

This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap() and is separate from
the oom reaper setting MMF_OOM_SKIP prematurely.

We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.

To fix this, this patch introduces a per-mm reaping period, which is
configurable through the new oom_free_timeout_ms file in debugfs and
defaults to one second to match the current heuristics.  This support
requires that the oom reaper's list becomes a proper linked list so that
other mm's may be reaped while waiting for an mm's timeout to expire.

This replaces the current timeouts in the oom reaper: (1) when trying to
grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
a HZ sleep if there are blockable mmu notifiers.  It extends it with
timeout to allow an oom victim to reach exit_mmap() before choosing
additional processes unnecessarily.

The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.

The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress.  Since the
default oom_free_timeout_ms is one second, the same as current heuristics,
there should be no functional change with this patch for users who do not
tune it to be longer other than MMF_OOM_SKIP is set by exit_mmap() after
free_pgtables(), which is the preferred behavior.

The reaping timeout can intentionally be set for a substantial amount of
time, such as 10s, since oom livelock is a very rare occurrence and it's
better to optimize for preventing additional (unnecessary) oom killing
than a scenario that is much more unlikely.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 v4:
 - fix double set_bit() per Tetsuo
 - fix jiffies wraparound per Tetsuo

 v3:
  - oom_free_timeout_ms is now static per kbuild test robot

 v2:
  - configurable timeout period through debugfs
  - change mm->reap_timeout to mm->oom_free_expire and add more
    descriptive comment per akpm
  - add comment to describe task->oom_reap_list locking based on
    oom_reaper_lock per akpm
  - rework the exit_mmap() comment and split into two parts to be more
    descriptive about the locking and the issue with the oom reaper
    racing with munlock_vma_pages_all() per akpm

 include/linux/mm_types.h |   7 ++
 include/linux/sched.h    |   3 +-
 kernel/fork.c            |   3 +
 mm/mmap.c                |  19 +++---
 mm/oom_kill.c            | 143 ++++++++++++++++++++++++++-------------
 5 files changed, 118 insertions(+), 57 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -449,6 +449,13 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_MMU
+	/*
+	 * When to give up on memory freeing from this mm after its
+	 * threads have been oom killed, in jiffies.
+	 */
+	unsigned long oom_free_expire;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1184,7 +1184,8 @@ struct task_struct {
 #endif
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
-	struct task_struct		*oom_reaper_list;
+	/* OOM victim queue for oom reaper, protected by oom_reaper_lock */
+	struct list_head		oom_reap_list;
 #endif
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -843,6 +843,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_FAULT_INJECTION
 	tsk->fail_nth = 0;
 #endif
+#ifdef CONFIG_MMU
+	INIT_LIST_HEAD(&tsk->oom_reap_list);
+#endif
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3066,25 +3066,27 @@ void exit_mmap(struct mm_struct *mm)
 	if (unlikely(mm_is_oom_victim(mm))) {
 		/*
 		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
-		 *
 		 * Nothing can be holding mm->mmap_sem here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
 		 *
+		 * This sets MMF_UNSTABLE to avoid racing with the oom reaper.
 		 * This needs to be done before calling munlock_vma_pages_all(),
 		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
+		 * reliably test for it.  If the oom reaper races with
+		 * munlock_vma_pages_all(), this can result in a kernel oops if
+		 * a pmd is zapped, for example, after follow_page_mask() has
+		 * checked pmd_none().
 		 */
 		mutex_lock(&oom_lock);
 		__oom_reap_task_mm(mm);
 		mutex_unlock(&oom_lock);
 
-		set_bit(MMF_OOM_SKIP, &mm->flags);
+		/*
+		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
+		 * guarantee that the oom reaper will not run on this mm again
+		 * after mmap_sem is dropped.
+		 */
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}
@@ -3112,6 +3114,7 @@ void exit_mmap(struct mm_struct *mm)
 	unmap_vmas(&tlb, vma, 0, -1);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -41,6 +41,7 @@
 #include <linux/kthread.h>
 #include <linux/init.h>
 #include <linux/mmu_notifier.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -484,7 +485,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 void __oom_reap_task_mm(struct mm_struct *mm)
@@ -527,10 +528,8 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 	}
 }
 
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
-	bool ret = true;
-
 	/*
 	 * We have to make sure to not race with the victim exit path
 	 * and cause premature new oom victim selection:
@@ -548,9 +547,8 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		goto out_oom;
 	}
 
 	/*
@@ -559,69 +557,81 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * TODO: we really want to get rid of this ugly hack and make sure that
 	 * notifiers cannot block for unbounded amount of time
 	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
+	if (mm_has_blockable_invalidate_notifiers(mm))
+		goto out_mm;
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
+	 * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't
+	 * work on the mm anymore. The check for MMF_UNSTABLE must run
 	 * under mmap_sem for reading because it serializes against the
 	 * down_write();up_write() cycle in exit_mmap().
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
-		up_read(&mm->mmap_sem);
+	if (test_bit(MMF_UNSTABLE, &mm->flags)) {
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		goto out_mm;
 	}
 
 	trace_start_task_reaping(tsk->pid);
-
 	__oom_reap_task_mm(mm);
+	trace_finish_task_reaping(tsk->pid);
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_mm:
 	up_read(&mm->mmap_sem);
-
-	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
+out_oom:
 	mutex_unlock(&oom_lock);
-	return ret;
 }
 
-#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
-	int attempts = 0;
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
-		schedule_timeout_idle(HZ/10);
+	/*
+	 * If this mm has either been fully unmapped, or the oom reaper has
+	 * given up on it, nothing left to do except drop the refcount.
+	 */
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
 
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
-		goto done;
+	/*
+	 * If this mm has already been reaped, doing so again will not likely
+	 * free additional memory.
+	 */
+	if (!test_bit(MMF_UNSTABLE, &mm->flags))
+		oom_reap_task_mm(tsk, mm);
 
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
+	if (time_after_eq(jiffies, mm->oom_free_expire)) {
+		if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+			pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+			debug_show_all_locks();
 
-done:
-	tsk->oom_reaper_list = NULL;
+			/*
+			 * Reaping has failed for the timeout period, so give up
+			 * and allow additional processes to be oom killed.
+			 */
+			set_bit(MMF_OOM_SKIP, &mm->flags);
+		}
+		goto drop;
+	}
 
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
+
+	/* Enqueue to be reaped again */
+	spin_lock(&oom_reaper_lock);
+	list_add_tail(&tsk->oom_reap_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
+
+	schedule_timeout_idle(HZ/10);
+	return;
 
-	/* Drop a reference taken by wake_oom_reaper */
+drop:
+	/* Drop the reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
 
@@ -630,11 +640,13 @@ static int oom_reaper(void *unused)
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_reaper_list));
 		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_entry(oom_reaper_list.next,
+					 struct task_struct, oom_reap_list);
+			list_del(&tsk->oom_reap_list);
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -645,25 +657,60 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
+/*
+ * Millisecs to wait for an oom mm to free memory before selecting another
+ * victim.
+ */
+static u64 oom_free_timeout_ms = 1000;
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms);
+
+	if (!expire)
+		expire++;
+	/*
+	 * Set the reap timeout; if it's already set, the mm is enqueued and
+	 * this tsk can be ignored.
+	 */
+	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire))
 		return;
 
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	list_add(&tsk->oom_reap_list, &oom_reaper_list);
 	spin_unlock(&oom_reaper_lock);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);
 }
 
+#ifdef CONFIG_DEBUG_FS
+static int oom_free_timeout_ms_read(void *data, u64 *val)
+{
+	*val = oom_free_timeout_ms;
+	return 0;
+}
+
+static int oom_free_timeout_ms_write(void *data, u64 val)
+{
+	if (val > 60 * 1000)
+		return -EINVAL;
+
+	oom_free_timeout_ms = val;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(oom_free_timeout_ms_fops, oom_free_timeout_ms_read,
+			oom_free_timeout_ms_write, "%llu\n");
+#endif /* CONFIG_DEBUG_FS */
+
 static int __init oom_init(void)
 {
 	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+#ifdef CONFIG_DEBUG_FS
+	if (!IS_ERR(oom_reaper_th))
+		debugfs_create_file("oom_free_timeout_ms", 0200, NULL, NULL,
+				    &oom_free_timeout_ms_fops);
+#endif
 	return 0;
 }
 subsys_initcall(oom_init)