From patchwork Fri Jul 13 23:07:31 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Rientjes <rientjes@google.com>
X-Patchwork-Id: 10524231
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	6518F60245 for <patchwork-linux-mm@patchwork.kernel.org>;
	Fri, 13 Jul 2018 23:07:40 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5341529B44
	for <patchwork-linux-mm@patchwork.kernel.org>;
	Fri, 13 Jul 2018 23:07:40 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 45E9C29B63; Fri, 13 Jul 2018 23:07:40 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.5 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE,
	USER_IN_DEF_DKIM_WL autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4D9E429B44
	for <patchwork-linux-mm@patchwork.kernel.org>;
	Fri, 13 Jul 2018 23:07:39 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B3D3E6B026C; Fri, 13 Jul 2018 19:07:34 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AED4B6B026D; Fri, 13 Jul 2018 19:07:34 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9B7306B026E; Fri, 13 Jul 2018 19:07:34 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com
	[209.85.192.200])
	by kanga.kvack.org (Postfix) with ESMTP id 4F37C6B026C
	for <linux-mm@kvack.org>; Fri, 13 Jul 2018 19:07:34 -0400 (EDT)
Received: by mail-pf0-f200.google.com with SMTP id d1-v6so5480648pfo.16
	for <linux-mm@kvack.org>; Fri, 13 Jul 2018 16:07:34 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:dkim-signature:date:from:to:cc:subject
	:in-reply-to:message-id:references:user-agent:mime-version;
	bh=2dli7BAjWUBwBEPdTYiXJD6n06w69Pv49ZrHQ9c7Wp4=;
	b=H2PG3acaPZA9UxkqJ50rqoqNK1iKNf/YXtRjbIvboDJoVSM6aBG3p3m/Imz2Gg5BwV
	FHF9yOG+sPoNaqGOPW2FFEPmS4jbeaNIgnKq9eINkl0ysb3qmeQDqkbSyukB02zYp8/F
	/a68SSE3jZEUAePI7EqYVZY+xkrGdQXzuikOe4T7kyPVGbmuidsXnFs0iREk2z5Ilcfu
	pAdmovTSNAmY4T9C525/VY5h+V/1X0V9cCXu/rvF0zMETEwUdbz5whbDWi/L6EC7zod3
	ppdxo9rn9f5dUAacxy23kqnQJxZUWffr3PKB4Py6UJEp8qpTPaKgjVPp4IBB3W0aw6gd
	shxw==
X-Gm-Message-State: AOUpUlGhd0Sc5kXjV0mI4gFFYLcmmVHm0F3bqnW/hqN+OxIu9SerJUUr
	2I/5+3JBGk1JoDvcMG0kRgR/pUscQJA5wtYNCPTroJ19xOVjnt9EYPFeTjspqfkgS1eEynWYsW5
	A+PE4w8/I9l49VWZ5h5NiNG8l5GZQQo86kKiKxShvFW6UqsUILuRsIxNmuFSKZW0CQUgIqMuhWZ
	hvp7CkuMQRGAAEKqiX2r6f4r6FmJ1t1d5uaaH90GYE7hSIIlhBvlVARqPcnsWTUOcSQgjFDOyCq
	fz0NCLiiYIgiCEiZhxg3RTX6cLhC1HgvZkkudbxNEY4q/+snWjtcFjgxl7FmK8yoSDHiWiwr5mX
	nOr2T/uOhNSG+kVGqCMkI+81jtRK4U/F1Jxdlbkrq6T98uz0kgIc2gC8kMHCFSu1qDWMfpUhpaY
	H
X-Received: by 2002:a65:498c:: with SMTP id
	r12-v6mr7910306pgs.112.1531523253858;
	Fri, 13 Jul 2018 16:07:33 -0700 (PDT)
X-Received: by 2002:a65:498c:: with SMTP id
	r12-v6mr7910275pgs.112.1531523252854;
	Fri, 13 Jul 2018 16:07:32 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1531523252; cv=none;
	d=google.com; s=arc-20160816;
	b=C6GehTqsmUrvwkc6Zvy/e15xg4BBBjWjsehlXA+I1RXc4fJ1Cd9rp60KE9AnshBNsI
	SK+NbazAVtLu+S9emeuSGnIBHN5B7gZuF0DXvARRdxy8pTavTe6U83sLbgQ1WRY4mfbs
	myPSDRgoEz4pcrbfKHDrB+P5IPcy0F2SOzXX8Ms06BsO8eTVaNlp46Bh8R8okp4jBqcB
	PyIdc4fawGM8SfWJmXTMm3t/xhDCgzkGrnxaiEPvmT3cL/zBW/7sOl/nW6NjA/PXPMCr
	LwbahdfWjavdOjDRM7O/Lf5aqHxEbrx0f5jvUQVKIGaACym+QMlRhZtxjEvbh9rSWm3Z
	9PWA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
	s=arc-20160816;
	h=mime-version:user-agent:references:message-id:in-reply-to:subject
	:cc:to:from:date:dkim-signature:arc-authentication-results;
	bh=2dli7BAjWUBwBEPdTYiXJD6n06w69Pv49ZrHQ9c7Wp4=;
	b=NyMopOd6X4CX4F+AGntw08Rxt5wNPAXcPEgQdfhl+URluh/czI/Cx4VzVwUMtdF0BI
	AvhRsX79TdeBJULfMoNgALeHB7hX+Ja9nyJqhOogJKbq3iQptovcuGLNOCJSE3D7GgCg
	oFlW2/+WDkQ9bXTCKOC5oTMatrn+1isHvUdmWx/0hzP3zz/nxV63tTV7VNiXjRDoeVXS
	ubdZV30MR3+mLyxlLKmamzHVuhJ4EJjwCfociumb6mpTeMOAxAHQ1Pspxiy28qUvMV2U
	5ZcfEEAnZmOrwujktXHGTveXR7CLeIoravEhIAFrOFX6SWLZu664waSB/RkD8ZISvwnI
	9sXA==
ARC-Authentication-Results: i=1; mx.google.com;
	dkim=pass header.i=@google.com header.s=20161025 header.b=cico51aM;
	spf=pass (google.com: domain of rientjes@google.com designates
	209.85.220.65 as permitted sender)
	smtp.mailfrom=rientjes@google.com;
	dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
Received: from mail-sor-f65.google.com (mail-sor-f65.google.com.
	[209.85.220.65]) by mx.google.com with SMTPS id
	y12-v6sor8324593plp.60.2018.07.13.16.07.32 for <linux-mm@kvack.org>
	(Google Transport Security); Fri, 13 Jul 2018 16:07:32 -0700 (PDT)
Received-SPF: pass (google.com: domain of rientjes@google.com designates
	209.85.220.65 as permitted sender) client-ip=209.85.220.65;
Authentication-Results: mx.google.com;
	dkim=pass header.i=@google.com header.s=20161025 header.b=cico51aM;
	spf=pass (google.com: domain of rientjes@google.com designates
	209.85.220.65 as permitted sender)
	smtp.mailfrom=rientjes@google.com;
	dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20161025;
	h=date:from:to:cc:subject:in-reply-to:message-id:references
	:user-agent:mime-version;
	bh=2dli7BAjWUBwBEPdTYiXJD6n06w69Pv49ZrHQ9c7Wp4=;
	b=cico51aM2aJTt+AVf6cr1/rYUHm8+U/fzKV7c4waMuo3C6nEiSI3LQ52hbzU9O8fKN
	7MJn5oX2AhJ0uJTkTkiSwuzUsBlm1LyKQvrW/V7ARw1vDwXBW+g9eAog3DzzrT09sDQT
	YUExhKjSBlHmEEs889h6H+qfLnQgwYCm7SjKiUMECko7GYPZR4KQ1dHyFClwdP6WOZ8G
	+XhO0Wz0D1D1THYof0ROB65IvZK1hL8p2zuArm68V2K3WuBoUVYZmctMBPI7BF7fiUA5
	buAy5k/H3h15/3G7jFMym2/4KhY6w5k/12DnbFHvgxnxXp60yj3ppIX4Gc+dahEWqcMA
	gylg==
X-Google-Smtp-Source: 
 AAOMgpcsfNT4k+9IULx6UuPeYdsx9i/Dqd0D6tK7RDCrffqRNgV45WCYU/6VAmhNHqQOKp88T5/G/g==
X-Received: by 2002:a17:902:ab90:: with SMTP id
	f16-v6mr8359292plr.182.1531523252154;
	Fri, 13 Jul 2018 16:07:32 -0700 (PDT)
Received: from [2620:15c:17:3:3a5:23a7:5e32:4598]
	([2620:15c:17:3:3a5:23a7:5e32:4598])
	by smtp.gmail.com with ESMTPSA id
	65-v6sm37172129pfq.81.2018.07.13.16.07.31
	(version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
	Fri, 13 Jul 2018 16:07:31 -0700 (PDT)
Date: Fri, 13 Jul 2018 16:07:31 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
X-X-Sender: rientjes@chino.kir.corp.google.com
To: Andrew Morton <akpm@linux-foundation.org>, Roman Gushchin <guro@fb.com>
cc: Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [patch v3 -mm 4/6] mm, memcg: evaluate root and leaf memcgs fairly
	on oom
In-Reply-To: <alpine.DEB.2.21.1807131604560.217600@chino.kir.corp.google.com>
Message-ID: <alpine.DEB.2.21.1807131606180.217600@chino.kir.corp.google.com>
References: <alpine.DEB.2.20.1803121755590.192200@chino.kir.corp.google.com>
	<alpine.DEB.2.20.1803151351140.55261@chino.kir.corp.google.com>
	<alpine.DEB.2.20.1803161405410.209509@chino.kir.corp.google.com>
	<alpine.DEB.2.20.1803221451370.17056@chino.kir.corp.google.com>
	<alpine.DEB.2.21.1807131604560.217600@chino.kir.corp.google.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

There are several downsides to the current implementation that compares
the root mem cgroup with leaf mem cgroups for the cgroup-aware oom killer.

For example, /proc/pid/oom_score_adj is accounted for processes attached
to the root mem cgroup but not leaves.  This leads to wild inconsistencies
that unfairly bias for or against the root mem cgroup.

Assume a 728KB bash shell is attached to the root mem cgroup without any
other processes having a non-default /proc/pid/oom_score_adj.  At the time
of system oom, the root mem cgroup evaluated to 43,474 pages after boot.
If the bash shell adjusts its /proc/self/oom_score_adj to 1000, however,
the root mem cgroup evaluates to 24,765,482 pages lol.  It would take a
cgroup 95GB of memory to outweigh the root mem cgroup's evaluation.

The reverse is even more confusing: if the bash shell adjusts its
/proc/self/oom_score_adj to -999, the root mem cgroup evaluates to 42,268
pages, a basically meaningless transformation.

/proc/pid/oom_score_adj is discounted, however, for processes attached to
leaf mem cgroups.  If a sole process using 250MB of memory is attached to
a mem cgroup, it evaluates to 250MB >> PAGE_SHIFT.  If its
/proc/pid/oom_score_adj is changed to -999, or even 1000, the evaluation
remains the same for the mem cgroup.

The heuristic that is used for the root mem cgroup also differs from leaf
mem cgroups.

For the root mem cgroup, the evaluation is the sum of all process's
/proc/pid/oom_score.  Besides factoring in oom_score_adj, it is based on
the sum of rss + swap + page tables for all processes attached to it.
For leaf mem cgroups, it is based on the amount of anonymous or
unevictable memory + unreclaimable slab + kernel stack + sock + swap.

There's also an exemption for root mem cgroup processes that do not
intersect the allocating process's mems_allowed.  Because the current
heuristic is based on oom_badness(), the evaluation of the root mem
cgroup disregards all processes attached to it that have disjoint
mems_allowed making oom selection specifically dependant on the
allocating process for system oom conditions!

This patch introduces completely fair comparison between the root mem
cgroup and leaf mem cgroups.  It compares them with the same heuristic
and does not prefer one over the other.  It disregards oom_score_adj
as the cgroup-aware oom killer should, if enabled by memory.oom_policy.
The goal is to target the most memory consuming cgroup on the system,
not consider per-process adjustment.

The fact that the evaluation of all mem cgroups depends on the mempolicy
of the allocating process, which is completely undocumented for the
cgroup-aware oom killer, will be addressed in a subsequent patch.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/admin-guide/cgroup-v2.rst |   7 +-
 mm/memcontrol.c                         | 149 ++++++++++++------------
 2 files changed, 76 insertions(+), 80 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1382,12 +1382,7 @@ OOM killer to kill all processes attached to the cgroup, except processes
 with /proc/pid/oom_score_adj set to -1000 (oom disabled).
 
 The root cgroup is treated as a leaf memory cgroup as well, so it is
-compared with other leaf memory cgroups. Due to internal implementation
-restrictions the size of the root cgroup is the cumulative sum of
-oom_badness of all its tasks (in other words oom_score_adj of each task
-is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can
-lead to over- or underestimation of the root cgroup consumption and it is
-therefore discouraged. This might change in the future, however.
+compared with other leaf memory cgroups.
 
 Please, note that memory charges are not migrating if tasks
 are moved between different memory cgroups. Moving tasks with
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -94,6 +94,8 @@ int do_swap_account __read_mostly;
 #define do_swap_account		0
 #endif
 
+static atomic_long_t total_sock_pages;
+
 /* Whether legacy memory+swap accounting is active */
 static bool do_memsw_account(void)
 {
@@ -2818,9 +2820,9 @@ static inline bool memcg_has_children(struct mem_cgroup *memcg)
 }
 
 static long memcg_oom_badness(struct mem_cgroup *memcg,
-			      const nodemask_t *nodemask,
-			      unsigned long totalpages)
+			      const nodemask_t *nodemask)
 {
+	const bool is_root_memcg = memcg == root_mem_cgroup;
 	long points = 0;
 	int nid;
 	pg_data_t *pgdat;
@@ -2829,92 +2831,65 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
 		if (nodemask && !node_isset(nid, *nodemask))
 			continue;
 
-		points += mem_cgroup_node_nr_lru_pages(memcg, nid,
-				LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
-
 		pgdat = NODE_DATA(nid);
-		points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
-					    NR_SLAB_UNRECLAIMABLE);
+		if (is_root_memcg) {
+			points += node_page_state(pgdat, NR_ACTIVE_ANON) +
+				  node_page_state(pgdat, NR_INACTIVE_ANON);
+			points += node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE);
+		} else {
+			points += mem_cgroup_node_nr_lru_pages(memcg, nid,
+							       LRU_ALL_ANON);
+			points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
+						    NR_SLAB_UNRECLAIMABLE);
+		}
 	}
 
-	points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
-		(PAGE_SIZE / 1024);
-	points += memcg_page_state(memcg, MEMCG_SOCK);
-	points += memcg_page_state(memcg, MEMCG_SWAP);
-
+	if (is_root_memcg) {
+		points += global_zone_page_state(NR_KERNEL_STACK_KB) /
+				(PAGE_SIZE / 1024);
+		points += atomic_long_read(&total_sock_pages);
+		points += total_swap_pages - get_nr_swap_pages();
+	} else {
+		points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
+				(PAGE_SIZE / 1024);
+		points += memcg_page_state(memcg, MEMCG_SOCK);
+		points += memcg_page_state(memcg, MEMCG_SWAP);
+	}
 	return points;
 }
 
 /*
- * Checks if the given memcg is a valid OOM victim and returns a number,
- * which means the folowing:
- *   -1: there are inflight OOM victim tasks, belonging to the memcg
- *    0: memcg is not eligible, e.g. all belonging tasks are protected
- *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
+ * Checks if the given non-root memcg has a valid OOM victim and returns a
+ * number, which means the following:
+ *   -1: there is an inflight OOM victim process attached to the memcg
+ *    0: memcg is not eligible because all tasks attached are unkillable
+ *       (kthreads or oom_score_adj set to OOM_SCORE_ADJ_MIN)
  *   >0: memcg is eligible, and the returned value is an estimation
  *       of the memory footprint
  */
 static long oom_evaluate_memcg(struct mem_cgroup *memcg,
-			       const nodemask_t *nodemask,
-			       unsigned long totalpages)
+			       const nodemask_t *nodemask)
 {
 	struct css_task_iter it;
 	struct task_struct *task;
 	int eligible = 0;
 
 	/*
-	 * Root memory cgroup is a special case:
-	 * we don't have necessary stats to evaluate it exactly as
-	 * leaf memory cgroups, so we approximate it's oom_score
-	 * by summing oom_score of all belonging tasks, which are
-	 * owners of their mm structs.
-	 *
-	 * If there are inflight OOM victim tasks inside
-	 * the root memcg, we return -1.
-	 */
-	if (memcg == root_mem_cgroup) {
-		struct css_task_iter it;
-		struct task_struct *task;
-		long score = 0;
-
-		css_task_iter_start(&memcg->css, 0, &it);
-		while ((task = css_task_iter_next(&it))) {
-			if (tsk_is_oom_victim(task) &&
-			    !test_bit(MMF_OOM_SKIP,
-				      &task->signal->oom_mm->flags)) {
-				score = -1;
-				break;
-			}
-
-			task_lock(task);
-			if (!task->mm) {
-				task_unlock(task);
-				continue;
-			}
-			task_unlock(task);
-
-			score += oom_badness(task, memcg, nodemask,
-					     totalpages);
-		}
-		css_task_iter_end(&it);
-
-		return score;
-	}
-
-	/*
-	 * Memcg is OOM eligible if there are OOM killable tasks inside.
-	 *
-	 * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
-	 * as unkillable.
-	 *
-	 * If there are inflight OOM victim tasks inside the memcg,
-	 * we return -1.
+	 * Memcg is eligible for oom kill if at least one process is eligible
+	 * to be killed.  Processes with oom_score_adj of OOM_SCORE_ADJ_MIN
+	 * are unkillable.
 	 */
 	css_task_iter_start(&memcg->css, 0, &it);
 	while ((task = css_task_iter_next(&it))) {
+		task_lock(task);
+		if (!task->mm) {
+			task_unlock(task);
+			continue;
+		}
 		if (!eligible &&
 		    task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
 			eligible = 1;
+		task_unlock(task);
 
 		if (tsk_is_oom_victim(task) &&
 		    !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
@@ -2927,13 +2902,14 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
 	if (eligible <= 0)
 		return eligible;
 
-	return memcg_oom_badness(memcg, nodemask, totalpages);
+	return memcg_oom_badness(memcg, nodemask);
 }
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 {
 	struct mem_cgroup *iter, *group = NULL;
 	long group_score = 0;
+	long leaf_score = 0;
 
 	oc->chosen_memcg = NULL;
 	oc->chosen_points = 0;
@@ -2959,12 +2935,18 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 	for_each_mem_cgroup_tree(iter, root) {
 		long score;
 
+		/*
+		 * Root memory cgroup will be considered after iteration,
+		 * if eligible.
+		 */
+		if (iter == root_mem_cgroup)
+			continue;
+
 		/*
 		 * We don't consider non-leaf non-oom_group memory cgroups
 		 * without the oom policy of "tree" as OOM victims.
 		 */
-		if (memcg_has_children(iter) && iter != root_mem_cgroup &&
-		    !mem_cgroup_oom_group(iter) &&
+		if (memcg_has_children(iter) && !mem_cgroup_oom_group(iter) &&
 		    iter->oom_policy != MEMCG_OOM_POLICY_TREE)
 			continue;
 
@@ -2972,16 +2954,15 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 		 * If group is not set or we've ran out of the group's sub-tree,
 		 * we should set group and reset group_score.
 		 */
-		if (!group || group == root_mem_cgroup ||
-		    !mem_cgroup_is_descendant(iter, group)) {
+		if (!group || !mem_cgroup_is_descendant(iter, group)) {
 			group = iter;
 			group_score = 0;
 		}
 
-		if (memcg_has_children(iter) && iter != root_mem_cgroup)
+		if (memcg_has_children(iter))
 			continue;
 
-		score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
+		score = oom_evaluate_memcg(iter, oc->nodemask);
 
 		/*
 		 * Ignore empty and non-eligible memory cgroups.
@@ -3000,6 +2981,7 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 		}
 
 		group_score += score;
+		leaf_score += score;
 
 		if (group_score > oc->chosen_points) {
 			oc->chosen_points = group_score;
@@ -3007,8 +2989,25 @@ static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
 		}
 	}
 
-	if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
-		css_get(&oc->chosen_memcg->css);
+	if (oc->chosen_memcg != INFLIGHT_VICTIM) {
+		if (root == root_mem_cgroup) {
+			group_score = oom_evaluate_memcg(root_mem_cgroup,
+							 oc->nodemask);
+			if (group_score > leaf_score) {
+				/*
+				 * Discount the sum of all leaf scores to find
+				 * root score.
+				 */
+				group_score -= leaf_score;
+				if (group_score > oc->chosen_points) {
+					oc->chosen_points = group_score;
+					oc->chosen_memcg = root_mem_cgroup;
+				}
+			}
+		}
+		if (oc->chosen_memcg)
+			css_get(&oc->chosen_memcg->css);
+	}
 
 	rcu_read_unlock();
 }
@@ -6491,6 +6490,7 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 		gfp_mask = GFP_NOWAIT;
 
 	mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
+	atomic_long_add(nr_pages, &total_sock_pages);
 
 	if (try_charge(memcg, gfp_mask, nr_pages) == 0)
 		return true;
@@ -6512,6 +6512,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 	}
 
 	mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages);
+	atomic_long_add(-nr_pages, &total_sock_pages);
 
 	refill_stock(memcg, nr_pages);
 }