From patchwork Wed Jun 20 10:37:36 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Michal Hocko <mhocko@kernel.org>
X-Patchwork-Id: 10476869
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	7B1AC604D3 for <patchwork-linux-mm@patchwork.kernel.org>;
	Wed, 20 Jun 2018 10:37:49 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6B4EC2872E
	for <patchwork-linux-mm@patchwork.kernel.org>;
	Wed, 20 Jun 2018 10:37:49 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5F96C28E04; Wed, 20 Jun 2018 10:37:49 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B4F632872E
	for <patchwork-linux-mm@patchwork.kernel.org>;
	Wed, 20 Jun 2018 10:37:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 332326B0003; Wed, 20 Jun 2018 06:37:47 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2E24B6B0006; Wed, 20 Jun 2018 06:37:47 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1AA2F6B0007; Wed, 20 Jun 2018 06:37:47 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com
	[74.125.82.71])
	by kanga.kvack.org (Postfix) with ESMTP id AF4346B0003
	for <linux-mm@kvack.org>; Wed, 20 Jun 2018 06:37:46 -0400 (EDT)
Received: by mail-wm0-f71.google.com with SMTP id o15-v6so1941818wmf.1
	for <linux-mm@kvack.org>; Wed, 20 Jun 2018 03:37:46 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-original-authentication-results:x-gm-message-state:from:to:cc
	:subject:date:message-id;
	bh=+GyKN4+ZjSubA3jQ+29qRWbd0fXlECakEqgP/B7ZN+A=;
	b=TPW7E9tDWUFdOVttf9byaL3sCwRIJppGVknfRxBHrMSsDSR+uzUFi7QWM4qMO4Md8W
	0BRUYB0gfqmnEcmGYfB26f/sU59SIv0XcMSyQahs4/OlzbFgFbfJHm2a8vOWcqUVwVh5
	99YPxH9JSuBNNmhknO5Nl6FByavuaKMnQmrftUVPanx5LkcyYkbSAeDb+hjI4AgG00+k
	gVPU2tmTBJDot7PjBvUP8WJp/J7ScBapyBMA+zXbxQA7lvI2MmC54a8VmB2po5wXf76N
	QQ4XvkfF6H5gW5kbd2iJ/1fcMACZoaF+jTc46BUlZuSeJURdxVPb6qcMSliTUontlZRC
	OsBw==
X-Original-Authentication-Results: mx.google.com;
	spf=pass (google.com: domain of
	mstsxfx@gmail.com designates 209.85.220.65 as permitted
	sender) smtp.mailfrom=mstsxfx@gmail.com;
	dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
X-Gm-Message-State: APt69E0fQPECs3RC8etNqHN8HE/pXxUHNO4KIUzJHnZjHtZDIXuLXLq5
	tcrX1KAdxfB63v/TsILRWriaF3if3lgGf5cqQdlZ1gQEuGE3+s62zT4/ewSu4FP0pyt8I+a18jz
	So7u1m2ff9aS3pzsl84ScVO3YyhUte0MUAquab7NbnVaP8Jkb4DijhBbgT2x4iVmlK+usI8bR4z
	owPkxap4tDvUH3eA3eLNhXzSywM5dwoFK50Yi9fcR02jPCCrZOxevcrvDyxtItMvAx0872gxU+N
	Aiisa0tqaWGViPZu4sEqMB2Mz2dwIAUt8/nJBsCqaHMRGzj4eGXfYEYWELXif2Jn7grrAUtTPkL
	SSki4fBDsX+1g3DK38gy99tQPpzNsrCPnjvSYcjJHUSzMNxeTPCcHvPK6+tTa9IoBdwSZvPmAQ=
	=
X-Received: by 2002:a50:9063:: with SMTP id
	z32-v6mr18136945edz.79.1529491066034;
	Wed, 20 Jun 2018 03:37:46 -0700 (PDT)
X-Received: by 2002:a50:9063:: with SMTP id
	z32-v6mr18136895edz.79.1529491064939;
	Wed, 20 Jun 2018 03:37:44 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1529491064; cv=none;
	d=google.com; s=arc-20160816;
	b=wWIQOrBRay8t92s6Z3J5G0fF0qTE0G1CzINhhqlk1npgHA2mZWVFJxyBfTZSnG8gvD
	Aw/C/B4LMWqQFi1bc9VjbtsegSgehBy5VXLzzyt9Rsr4oWLgnWeYJMzUC0a8lNaKcOM2
	S7hifBqUQzKUs9MU6O3M+B+3kLSmglHOkE802X74/g+Mft9XP82KolLoJZBPGCH6RPYe
	KiZW6K5Tsw1GgDsh+X5Ys0QnKb2QREdoYRqXCeUPmxNnD/buWTbuBE2/SpDHSNg9AL8e
	mMdkC/ru6q4XjDQi4K5SG2Z2JWVJG0Jq4Xw1V6Kjg2kLt9F9hpUIrGI/JgeE2rTJAjsl
	PFcQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
	s=arc-20160816;
	h=message-id:date:subject:cc:to:from:arc-authentication-results;
	bh=+GyKN4+ZjSubA3jQ+29qRWbd0fXlECakEqgP/B7ZN+A=;
	b=CTdxyqgJ7JBc0HsjyMbY3wSN1h7lhaSulmTzej3bpCPRD2KblWp7JFHGvKYS7JnHmM
	JE+Awg1EzBow78WDoGSl/V6DfxuHpiYte+IQyiXWdxW0iEPHjR8t0592a/THeSyKF0f2
	v15cO1O68/+7f+9s4HeXrDIqKYC1iUkKHHkyj6WZ51AtseGRmyl7Ytl/oRGuwQSC6O4Q
	PUzki+im7XV1kYnl4pQIqJ++9Ndf+8KDRBFjG8P0mCGgmukzpq82vCfP4aQTGveNzcHA
	FPSr19cBrmBOWZfK62eztPHq75RbeWQe/TWsNooppNAQOLJSdG/9RFSytavQSH+Q51BI
	FBEA==
ARC-Authentication-Results: i=1; mx.google.com;
	spf=pass (google.com: domain of mstsxfx@gmail.com designates
	209.85.220.65 as permitted sender)
	smtp.mailfrom=mstsxfx@gmail.com;
	dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
Received: from mail-sor-f65.google.com (mail-sor-f65.google.com.
	[209.85.220.65]) by mx.google.com with SMTPS id
	r9-v6sor1015487eda.39.2018.06.20.03.37.44 for <linux-mm@kvack.org>
	(Google Transport Security); Wed, 20 Jun 2018 03:37:44 -0700 (PDT)
Received-SPF: pass (google.com: domain of mstsxfx@gmail.com designates
	209.85.220.65 as permitted sender) client-ip=209.85.220.65;
Authentication-Results: mx.google.com;
	spf=pass (google.com: domain of mstsxfx@gmail.com designates
	209.85.220.65 as permitted sender)
	smtp.mailfrom=mstsxfx@gmail.com;
	dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
X-Google-Smtp-Source: 
 ADUXVKJAN/1wxmTn2yU8j5+jKidH+qaeSnRHGqPkPD9/YMRLmIIiyMYgZd03t00vIpqBRRrA9tPasQ==
X-Received: by 2002:aa7:c6c3:: with SMTP id
	b3-v6mr17750541eds.302.1529491064341;
	Wed, 20 Jun 2018 03:37:44 -0700 (PDT)
Received: from tiehlicka.suse.cz (prg-ext-pat.suse.com. [213.151.95.130])
	by smtp.gmail.com with ESMTPSA id
	e1-v6sm974072edr.23.2018.06.20.03.37.42
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Wed, 20 Jun 2018 03:37:43 -0700 (PDT)
From: Michal Hocko <mhocko@kernel.org>
To: <linux-mm@kvack.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Greg Thelen <gthelen@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>, Michal Hocko <mhocko@suse.com>
Subject: [RFC PATCH] memcg, oom: move out_of_memory back to the charge path
Date: Wed, 20 Jun 2018 12:37:36 +0200
Message-Id: <20180620103736.13880-1-mhocko@kernel.org>
X-Mailer: git-send-email 2.17.1
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

From: Michal Hocko <mhocko@suse.com>

3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")
has changed the ENOMEM semantic of memcg charges. Rather than invoking
the oom killer from the charging context it delays the oom killer to the
page fault path (pagefault_out_of_memory). This in turn means that many
users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg
hits the hard limit and the memcg is is OOM. This is behavior is
inconsistent with !memcg case where the oom killer is invoked from the
allocation context and the allocator keeps retrying until it succeeds.

The difference in the behavior is user visible. mmap(MAP_POPULATE) might
result in not fully populated ranges while the mmap return code doesn't
tell that to the userspace. Random syscalls might fail with ENOMEM etc.

The primary motivation of the different memcg oom semantic was the
deadlock avoidance. Things have changed since then, though. We have
an async oom teardown by the oom reaper now and so we do not have to
rely on the victim to tear down its memory anymore. Therefore we can
return to the original semantic as long as the memcg oom killer is not
handed over to the users space.

There is still one thing to be careful about here though. If the oom
killer is not able to make any forward progress - e.g. because there is
no eligible task to kill - then we have to bail out of the charge path
to prevent from same class of deadlocks. We have basically two options
here. Either we fail the charge with ENOMEM or force the charge and
allow overcharge. The first option has been considered more harmful than
useful because rare inconsistencies in the ENOMEM behavior is hard to
test for and error prone. Basically the same reason why the page
allocator doesn't fail allocations under such conditions. The later
might allow runaways but those should be really unlikely unless somebody
misconfigures the system. E.g. allowing to migrate tasks away from the
memcg to a different unlimited memcg with move_charge_at_immigrate
disabled.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
we have discussed this at LSFMM this year and my recollection is that
we have agreed that we should do this. So I am sending the patch as an
RFC. Please note I have only forward ported the patch without any
testing yet. I would like to see a general agreement before I spend more
time on this.

Thoughts? Objections?

 mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 56 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e6f0d5ef320a..7fe3ce1fd625 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1483,28 +1483,54 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
 
-static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
+enum oom_status {
+	OOM_SUCCESS,
+	OOM_FAILED,
+	OOM_ASYNC,
+	OOM_SKIPPED
+};
+
+static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
-		return;
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		return OOM_SKIPPED;
 	/*
 	 * We are in the middle of the charge context here, so we
 	 * don't want to block when potentially sitting on a callstack
 	 * that holds all kinds of filesystem and mm locks.
 	 *
-	 * Also, the caller may handle a failed allocation gracefully
-	 * (like optional page cache readahead) and so an OOM killer
-	 * invocation might not even be necessary.
+	 * cgroup1 allows disabling the OOM killer and waiting for outside
+	 * handling until the charge can succeed; remember the context and put
+	 * the task to sleep at the end of the page fault when all locks are
+	 * released.
+	 *
+	 * On the other hand, in-kernel OOM killer allows for an async victim
+	 * memory reclaim (oom_reaper) and that means that we are not solely
+	 * relying on the oom victim to make a forward progress and we can
+	 * invoke the oom killer here.
 	 *
-	 * That's why we don't do anything here except remember the
-	 * OOM context and then deal with it at the end of the page
-	 * fault when the stack is unwound, the locks are released,
-	 * and when we know whether the fault was overall successful.
+	 * Please note that mem_cgroup_oom_synchronize might fail to find a
+	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
+	 * we would fall back to the global oom killer in pagefault_out_of_memory
 	 */
+	if (!memcg->oom_kill_disable) {
+		if (mem_cgroup_out_of_memory(memcg, mask, order))
+			return OOM_SUCCESS;
+
+		WARN(!current->memcg_may_oom,
+				"Memory cgroup charge failed because of no reclaimable memory! "
+				"This looks like a misconfiguration or a kernel bug.");
+		return OOM_FAILED;
+	}
+
+	if (!current->memcg_may_oom)
+		return OOM_SKIPPED;
 	css_get(&memcg->css);
 	current->memcg_in_oom = memcg;
 	current->memcg_oom_gfp_mask = mask;
 	current->memcg_oom_order = order;
+
+	return OOM_ASYNC;
 }
 
 /**
@@ -1899,6 +1925,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
+	bool oomed = false;
 
 	if (mem_cgroup_is_root(memcg))
 		return 0;
@@ -1986,6 +2013,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (nr_retries--)
 		goto retry;
 
+	if (gfp_mask & __GFP_RETRY_MAYFAIL && oomed)
+		goto nomem;
+
 	if (gfp_mask & __GFP_NOFAIL)
 		goto force;
 
@@ -1994,8 +2024,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	memcg_memory_event(mem_over_limit, MEMCG_OOM);
 
-	mem_cgroup_oom(mem_over_limit, gfp_mask,
+	/*
+	 * keep retrying as long as the memcg oom killer is able to make
+	 * a forward progress or bypass the charge if the oom killer
+	 * couldn't make any progress.
+	 */
+	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
+	switch (oom_status) {
+	case OOM_SUCCESS:
+		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+		oomed = true;
+		goto retry;
+	case OOM_FAILED:
+		goto force;
+	default:
+		goto nomem;
+	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;