From patchwork Sun Oct 13 00:30:16 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mina Almasry <almasrymina@google.com>
X-Patchwork-Id: 11187165
Return-Path: <SRS0=89K8=YG=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DC9E4112B
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:33 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 8D54D21655
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:33 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="hB2RBKC4"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8D54D21655
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A26AE8E0001; Sat, 12 Oct 2019 20:30:32 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9D6AE6B0005; Sat, 12 Oct 2019 20:30:32 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8EB2A8E0001; Sat, 12 Oct 2019 20:30:32 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0144.hostedemail.com
 [216.40.44.144])
	by kanga.kvack.org (Postfix) with ESMTP id 6211A6B0003
	for <linux-mm@kvack.org>; Sat, 12 Oct 2019 20:30:32 -0400 (EDT)
Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with SMTP id D10A6180ACF61
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:31 +0000 (UTC)
X-FDA: 76036880262.02.hope82_5bd709c786441
X-Spam-Summary: 
 2,0,0,a02e3c060f94c6b8,d41d8cd98f00b204,3jxcixqskciigrsgyx4sotgmuumrk.iusrot03-ssq1giq.uxm@flex--almasrymina.bounces.google.com,:mike.kravetz@oracle.com:shuah@kernel.org:almasrymina@google.com:rientjes@google.com:shakeelb@google.com:gthelen@google.com:akpm@linux-foundation.org:khalid.aziz@oracle.com:linux-kernel@vger.kernel.org::linux-kselftest@vger.kernel.org:cgroups@vger.kernel.org:aneesh.kumar@linux.vnet.ibm.com:mkoutny@suse.com:hdanton@sina.com,RULES_HIT:4:41:69:152:355:379:541:800:960:967:973:988:989:1260:1277:1313:1314:1345:1431:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:1801:2198:2199:2393:2525:2553:2559:2563:2640:2682:2685:2693:2859:2890:2899:2915:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3152:3743:3865:3866:3867:3868:3870:3871:3872:3873:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4042:4250:4321:4470:4605:5007:6117:6119:6261:6653:6742:7875:7903:8603:8660:8784:9025:9592:9969:10004:10394:11026:11232:11473:116
 58:11914
X-HE-Tag: hope82_5bd709c786441
X-Filterd-Recvd-Size: 17502
Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com
 [209.85.210.202])
	by imf24.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:30 +0000 (UTC)
Received: by mail-pf1-f202.google.com with SMTP id a2so10648175pfo.12
        for <linux-mm@kvack.org>; Sat, 12 Oct 2019 17:30:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:message-id:mime-version:subject:from:to:cc;
        bh=EopVkWKYWV3u5jB7sSoC9RjSodCCxqS+Mci7sV9VxPY=;
        b=hB2RBKC4tmA2pETKfBInZNzPFP/IbYhsVcQIrjFIiOakJ+v+2kExP0/S5mBY3wmz+c
         IeQTPgfaT+/7KvPexhO6RXR/IaTPDsq9PqHl8bMR9fqKNb7+XteOsPERKl447uKDoDoF
         AxssDGQIuGZFe9gczY9xP0Y9q1yPUDkPUAvoShoVjF0IlsqDulUnxV7vg3iEIKUZgHGl
         1fZr33mvjsDQxkrixKFWCHOdKu7dZrnveqqoe7lOnjg3MMcGr0z3OHjbZ+wim9FU/25U
         MiPngACOPZOA5f0DE61gTrF5GomMD2NbveTne+j6EWaq+bcI1Us3muiCABNLLaCSZP8h
         9uWA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc;
        bh=EopVkWKYWV3u5jB7sSoC9RjSodCCxqS+Mci7sV9VxPY=;
        b=DIDs1A3rtz523Cd1geXmJAeOBBdDe5ASNCQhpwm9sjUhzOaf9eGtMgdh4Ym36ytaGH
         aESlT/pwaTtnAX072V88ngjOYmbVdH48EkjfLp94iYrMjh14uIfvaVwpEaTh8Rn2oEpo
         NDuQIOxuocLP22jjIXlEVSQSYxkeODs1jjk7duppe4O0iqneXtSZAIgrEgkcrMkFUqdh
         fXjYnLX3v+INTJpZsCLxbQ+ihAGdIEWXe1nH+TulLsL9/YMVn4YNu1Fqj/cPBcUCyMOt
         A3GKHDW93JZ4eTZe00WNFsjJZZa22KJswwKNSOWb68UtxmxJDUJD5mgFed2BJS05IDK5
         XM3A==
X-Gm-Message-State: APjAAAW5uIKujd+Zdvcwb36DmiEXn99S7ZqYCkPprEhZoCZp7d35JJsn
	7auUC1tKQyiv75Ip8AIhMEuw3O3v6D2l9Od7GA==
X-Google-Smtp-Source: 
 APXvYqxZ884rRhx2+SOp4Nn0XYcIPLL5V05SWv3kr04QLfHmTNd+KEvJk3uGH632D73P9rHnKVruD0tulCqNF0+7Bw==
X-Received: by 2002:a63:3104:: with SMTP id x4mr22466310pgx.88.1570926629548;
 Sat, 12 Oct 2019 17:30:29 -0700 (PDT)
Date: Sat, 12 Oct 2019 17:30:16 -0700
Message-Id: <20191013003024.215429-1-almasrymina@google.com>
Mime-Version: 1.0
X-Mailer: git-send-email 2.23.0.700.g56cf767bdb-goog
Subject: [PATCH v6 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter
From: Mina Almasry <almasrymina@google.com>
To: mike.kravetz@oracle.com
Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com,
	shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org,
	khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org,
	aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com,
	Hillf Danton <hdanton@sina.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.

Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.

We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.

Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:

- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory and
hugetlb memory faulted in without a prior reservation.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch series, with tests to verify
functionality and show the usage. We also added cgroup-v2 support to
hugetlb_cgroup so that the new use cases can be extended to v2.

Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
   the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
   duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
   hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that modifies
   the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
   reservation time rather than fault time. Adding a new page_counter seems
   better as userspace could, if it wants, choose to enforce different cgroups
   differently: one via limit_in_bytes, and another via
   reservation_limit_in_bytes. This could be very useful if you're
   transitioning how hugetlb memory is partitioned on your system one
   cgroup at a time, for example. Also, someone may find usage for both
   limit_in_bytes and reservation_limit_in_bytes concurrently, and this
   approach gives them the option to do so.

Testing:
- Added tests passing.
- libhugetlbfs tests mostly passing, but some tests have trouble with and
  without this patch series. Seems environment issue rather than code:
  - Overall results:
    ********** TEST SUMMARY
    *                      2M
    *                      32-bit 64-bit
    *     Total testcases:    84      0
    *             Skipped:     0      0
    *                PASS:    66      0
    *                FAIL:    14      0
    *    Killed by signal:     0      0
    *   Bad configuration:     4      0
    *       Expected FAIL:     0      0
    *     Unexpected PASS:     0      0
    *    Test not present:     0      0
    * Strange test result:     0      0
    **********
  - Failing tests:
    - elflink_rw_and_share_test("linkhuge_rw") segfaults with and without this
      patch series.
    - LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc (2M: 32):
      FAIL    Address is not hugepage
    - LD_PRELOAD=libhugetlbfs.so HUGETLB_RESTRICT_EXE=unknown:malloc
      HUGETLB_MORECORE=yes malloc (2M: 32):
      FAIL    Address is not hugepage
    - LD_PRELOAD=libhugetlbfs.so HUGETLB_MORECORE=yes malloc_manysmall (2M: 32):
      FAIL    Address is not hugepage
    - GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
      HUGETLB_MORECORE=yes heapshrink (2M: 32):
      FAIL    Heap not on hugepages
    - GLIBC_TUNABLES=glibc.malloc.tcache_count=0 LD_PRELOAD=libhugetlbfs.so
      libheapshrink.so HUGETLB_MORECORE=yes heapshrink (2M: 32):
      FAIL    Heap not on hugepages
    - HUGETLB_ELFMAP=RW linkhuge_rw (2M: 32): FAIL    small_data is not hugepage
    - HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32):
      FAIL    small_data is not hugepage
    - alloc-instantiate-race shared (2M: 32):
      Bad configuration: sched_setaffinity(cpu1): Invalid argument -
      FAIL    Child 1 killed by signal Killed
    - shmoverride_linked (2M: 32):
      FAIL    shmget failed size 2097152 from line 176: Invalid argument
    - HUGETLB_SHM=yes shmoverride_linked (2M: 32):
      FAIL    shmget failed size 2097152 from line 176: Invalid argument
    - shmoverride_linked_static (2M: 32):
      FAIL shmget failed size 2097152 from line 176: Invalid argument
    - HUGETLB_SHM=yes shmoverride_linked_static (2M: 32):
      FAIL shmget failed size 2097152 from line 176: Invalid argument
    - LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32):
      FAIL shmget failed size 2097152 from line 176: Invalid argument
    - LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32):
      FAIL    shmget failed size 2097152 from line 176: Invalid argument

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Hillf Danton <hdanton@sina.com>
---
 include/linux/hugetlb.h |  23 ++++++++-
 mm/hugetlb_cgroup.c     | 111 ++++++++++++++++++++++++++++++----------
 2 files changed, 107 insertions(+), 27 deletions(-)

--
2.23.0.700.g56cf767bdb-goog

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53fc34f930d08..9c49a0ba894d3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -320,6 +320,27 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,

 #ifdef CONFIG_HUGETLB_PAGE

+enum {
+	/* Tracks hugetlb memory faulted in. */
+	HUGETLB_RES_USAGE,
+	/* Tracks hugetlb memory reserved. */
+	HUGETLB_RES_RESERVATION_USAGE,
+	/* Limit for hugetlb memory faulted in. */
+	HUGETLB_RES_LIMIT,
+	/* Limit for hugetlb memory reserved. */
+	HUGETLB_RES_RESERVATION_LIMIT,
+	/* Max usage for hugetlb memory faulted in. */
+	HUGETLB_RES_MAX_USAGE,
+	/* Max usage for hugetlb memory reserved. */
+	HUGETLB_RES_RESERVATION_MAX_USAGE,
+	/* Faulted memory accounting fail count. */
+	HUGETLB_RES_FAILCNT,
+	/* Reserved memory accounting fail count. */
+	HUGETLB_RES_RESERVATION_FAILCNT,
+	HUGETLB_RES_NULL,
+	HUGETLB_RES_MAX,
+};
+
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
@@ -340,7 +361,7 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
-	struct cftype cgroup_files[5];
+	struct cftype cgroup_files[HUGETLB_RES_MAX];
 #endif
 	char name[HSTATE_NAME_LEN];
 };
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index f1930fa0b445d..1ed4448ca41d3 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
 	 * the counter to account for hugepages from hugetlb.
 	 */
 	struct page_counter hugepage[HUGE_MAX_HSTATE];
+	/*
+	 * the counter to account for hugepage reservations from hugetlb.
+	 */
+	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
 };

 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -33,6 +37,14 @@ struct hugetlb_cgroup {

 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline struct page_counter *
+hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved)
+{
+	if (reserved)
+		return &h_cg->reserved_hugepage[idx];
+	return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -254,30 +266,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
 	return;
 }

-enum {
-	RES_USAGE,
-	RES_LIMIT,
-	RES_MAX_USAGE,
-	RES_FAILCNT,
-};
-
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 				   struct cftype *cft)
 {
 	struct page_counter *counter;
+	struct page_counter *reserved_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

 	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+	reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

 	switch (MEMFILE_ATTR(cft->private)) {
-	case RES_USAGE:
+	case HUGETLB_RES_USAGE:
 		return (u64)page_counter_read(counter) * PAGE_SIZE;
-	case RES_LIMIT:
+	case HUGETLB_RES_RESERVATION_USAGE:
+		return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
+	case HUGETLB_RES_LIMIT:
 		return (u64)counter->max * PAGE_SIZE;
-	case RES_MAX_USAGE:
+	case HUGETLB_RES_RESERVATION_LIMIT:
+		return (u64)reserved_counter->max * PAGE_SIZE;
+	case HUGETLB_RES_MAX_USAGE:
 		return (u64)counter->watermark * PAGE_SIZE;
-	case RES_FAILCNT:
+	case HUGETLB_RES_RESERVATION_MAX_USAGE:
+		return (u64)reserved_counter->watermark * PAGE_SIZE;
+	case HUGETLB_RES_FAILCNT:
 		return counter->failcnt;
+	case HUGETLB_RES_RESERVATION_FAILCNT:
+		return reserved_counter->failcnt;
 	default:
 		BUG();
 	}
@@ -291,6 +306,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 	int ret, idx;
 	unsigned long nr_pages;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+	bool reserved = false;

 	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
 		return -EINVAL;
@@ -304,9 +320,14 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 	nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));

 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
-	case RES_LIMIT:
+	case HUGETLB_RES_RESERVATION_LIMIT:
+		reserved = true;
+		/* Fall through. */
+	case HUGETLB_RES_LIMIT:
 		mutex_lock(&hugetlb_limit_mutex);
-		ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+		ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, idx,
+								      reserved),
+					   nr_pages);
 		mutex_unlock(&hugetlb_limit_mutex);
 		break;
 	default:
@@ -320,18 +341,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
 	int ret = 0;
-	struct page_counter *counter;
+	struct page_counter *counter, *reserved_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));

 	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+	reserved_counter =
+		&h_cg->reserved_hugepage[MEMFILE_IDX(of_cft(of)->private)];

 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
-	case RES_MAX_USAGE:
+	case HUGETLB_RES_MAX_USAGE:
 		page_counter_reset_watermark(counter);
 		break;
-	case RES_FAILCNT:
+	case HUGETLB_RES_RESERVATION_MAX_USAGE:
+		page_counter_reset_watermark(reserved_counter);
+		break;
+	case HUGETLB_RES_FAILCNT:
 		counter->failcnt = 0;
 		break;
+	case HUGETLB_RES_RESERVATION_FAILCNT:
+		reserved_counter->failcnt = 0;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -357,37 +386,67 @@ static void __init __hugetlb_cgroup_file_init(int idx)
 	struct hstate *h = &hstates[idx];

 	/* format the size */
-	mem_fmt(buf, 32, huge_page_size(h));
+	mem_fmt(buf, sizeof(buf), huge_page_size(h));

 	/* Add the limit file */
-	cft = &h->cgroup_files[0];
+	cft = &h->cgroup_files[HUGETLB_RES_LIMIT];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.limit_in_bytes", buf);
-	cft->private = MEMFILE_PRIVATE(idx, RES_LIMIT);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_LIMIT);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+	cft->write = hugetlb_cgroup_write;
+
+	/* Add the reservation limit file */
+	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_LIMIT];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes",
+		 buf);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_LIMIT);
 	cft->read_u64 = hugetlb_cgroup_read_u64;
 	cft->write = hugetlb_cgroup_write;

 	/* Add the usage file */
-	cft = &h->cgroup_files[1];
+	cft = &h->cgroup_files[HUGETLB_RES_USAGE];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
-	cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_USAGE);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
+	/* Add the reservation usage file */
+	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_USAGE];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes",
+		 buf);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_USAGE);
 	cft->read_u64 = hugetlb_cgroup_read_u64;

 	/* Add the MAX usage file */
-	cft = &h->cgroup_files[2];
+	cft = &h->cgroup_files[HUGETLB_RES_MAX_USAGE];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
-	cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_MAX_USAGE);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
+	/* Add the MAX reservation usage file */
+	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_MAX_USAGE];
+	snprintf(cft->name, MAX_CFTYPE_NAME,
+		 "%s.reservation_max_usage_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_MAX_USAGE);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

 	/* Add the failcntfile */
-	cft = &h->cgroup_files[3];
+	cft = &h->cgroup_files[HUGETLB_RES_FAILCNT];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
-	cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_FAILCNT);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
+	/* Add the reservation failcntfile */
+	cft = &h->cgroup_files[HUGETLB_RES_RESERVATION_FAILCNT];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf);
+	cft->private = MEMFILE_PRIVATE(idx, HUGETLB_RES_RESERVATION_FAILCNT);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files[4];
+	cft = &h->cgroup_files[HUGETLB_RES_NULL];
 	memset(cft, 0, sizeof(*cft));

 	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,

From patchwork Sun Oct 13 00:30:17 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mina Almasry <almasrymina@google.com>
X-Patchwork-Id: 11187169
Return-Path: <SRS0=89K8=YG=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F06C414ED
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:36 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A1CBE22459
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:36 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="Ug+KA9Vg"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A1CBE22459
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 740336B0003; Sat, 12 Oct 2019 20:30:34 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6C95B8E0005; Sat, 12 Oct 2019 20:30:34 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 569856B0006; Sat, 12 Oct 2019 20:30:34 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0175.hostedemail.com
 [216.40.44.175])
	by kanga.kvack.org (Postfix) with ESMTP id 28E7E6B0003
	for <linux-mm@kvack.org>; Sat, 12 Oct 2019 20:30:34 -0400 (EDT)
Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with SMTP id D3744180ACF61
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:33 +0000 (UTC)
X-FDA: 76036880346.01.worm73_5c245ba472321
X-Spam-Summary: 
 2,0,0,3917925b41e08401,d41d8cd98f00b204,3j3cixqskciqitui0z6uqviowwotm.kwutqv25-uus3iks.wzo@flex--almasrymina.bounces.google.com,:mike.kravetz@oracle.com:shuah@kernel.org:almasrymina@google.com:rientjes@google.com:shakeelb@google.com:gthelen@google.com:akpm@linux-foundation.org:khalid.aziz@oracle.com:linux-kernel@vger.kernel.org::linux-kselftest@vger.kernel.org:cgroups@vger.kernel.org:aneesh.kumar@linux.vnet.ibm.com:mkoutny@suse.com,RULES_HIT:4:41:152:355:379:541:800:960:966:973:988:989:1260:1277:1313:1314:1345:1359:1431:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2559:2562:2640:2731:2892:3138:3139:3140:3141:3142:3152:3865:3867:3868:3870:4321:4385:4605:5007:6119:6261:6653:6742:7903:8957:9010:9969:10004:11026:11232:11233:11473:11658:11914:12043:12219:12291:12296:12297:12438:12555:12679:12683:12895:12986:13255:14096:14097:14394:14659:21080:21444:21451:21627:21966:30054:30074:30075,0,RBL:209.85.215.202:@flex--almasrymina.bo
 unces.go
X-HE-Tag: worm73_5c245ba472321
X-Filterd-Recvd-Size: 17589
Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com
 [209.85.215.202])
	by imf33.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:33 +0000 (UTC)
Received: by mail-pg1-f202.google.com with SMTP id e15so9878117pgh.19
        for <linux-mm@kvack.org>; Sat, 12 Oct 2019 17:30:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=SPq2lw8UrhLovJBnkJrVS2+K8kLX0YKhAqcrM2Rtc+w=;
        b=Ug+KA9Vg1oST0lhtNstB6aWbvW8idHmFUqHQjHV/NFUxdo9Kuvi5QM2L7fEoeUAK/j
         iXsinpcyrvMB6qnIZna7/b2gBscKsFuWLl7fjsrBwdC9DYnO5OWrKsj7DPsGARVz5Wpd
         oVxkKFv3+Axwz5LeejvlsTi0+ezvSsjaAYTY5dqBMNY2YuxM14hdByCeiOes+BllmcAn
         L/A8obHX7Jxdps5o31yIxZGPNpXa6re2MgYAMRiE99I7BMft/DGH21AupM4MRvd7BsuC
         8xhf8SdeWv9qkaGn/LIa3Tai68RMNT89oHmq7ey0thO+TNn7ARNOnZuMrhyFIRSATqcW
         MHRw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=SPq2lw8UrhLovJBnkJrVS2+K8kLX0YKhAqcrM2Rtc+w=;
        b=GmpUksiRDLky7jmGhz6dUAxHLtH74RF132xxQYMeZ7AZlu9aITrnTsZ7Zb1E0lz6zY
         fWmPomvS1Dxj+LU+OMLKvEiLSOOEIOD8APiW5QZGNR9LtxW5//dfvysXL1ZRcTiHaDsf
         6of1kRSTI745JPLbXX08DHZFuosavkwbPIKNg6T2hW6tQ+HUMIYOeVzyzq8y4mEZ7K+6
         CM5GzaIvnLA9X8DiMWvXvuz+Pt+Ntzb2drcCseOjyK+9/+9/DHIf+dRFaPZNrIy/eH7E
         XL+FaUTDiotKhM8LrAI1rNq7c3xI67K4g7N+BkuBbHRZzW8+/FUslEBp5Dt5vQdXKZOR
         UOXg==
X-Gm-Message-State: APjAAAXoz7oGQK2t9yH3fnqh19z0M/RprrScTsbDSCqpDu00I39l1Hp4
	P60A1C3lrdB1l8YAVd/XaWykp+AfMLaWm/O5SA==
X-Google-Smtp-Source: 
 APXvYqyU+1xMbPU2H3Z/QOVjxR6KoMJQVtFNuogiHd666X92CXDKj4PObb5bMQCdZ93EEKMgY+5bD04UdfZXTXNKOw==
X-Received: by 2002:a63:5b07:: with SMTP id p7mr25637840pgb.416.1570926631850;
 Sat, 12 Oct 2019 17:30:31 -0700 (PDT)
Date: Sat, 12 Oct 2019 17:30:17 -0700
In-Reply-To: <20191013003024.215429-1-almasrymina@google.com>
Message-Id: <20191013003024.215429-2-almasrymina@google.com>
Mime-Version: 1.0
References: <20191013003024.215429-1-almasrymina@google.com>
X-Mailer: git-send-email 2.23.0.700.g56cf767bdb-goog
Subject: [PATCH v6 2/9] hugetlb_cgroup: add interface for charge/uncharge
 hugetlb reservations
From: Mina Almasry <almasrymina@google.com>
To: mike.kravetz@oracle.com
Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com,
	shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org,
	khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org,
	aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb
usage or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/linux/hugetlb_cgroup.h |  67 +++++++++++++---------
 mm/hugetlb.c                   |  17 +++---
 mm/hugetlb_cgroup.c            | 100 +++++++++++++++++++++++++--------
 3 files changed, 130 insertions(+), 54 deletions(-)

--
2.23.0.700.g56cf767bdb-goog

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 063962f6dfc6a..1bb58a63af586 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -22,27 +22,35 @@ struct hugetlb_cgroup;
  * Minimum page order trackable by hugetlb cgroup.
  * At least 3 pages are necessary for all the tracking information.
  */
-#define HUGETLB_CGROUP_MIN_ORDER	2
+#define HUGETLB_CGROUP_MIN_ORDER 3

 #ifdef CONFIG_CGROUP_HUGETLB

-static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page,
+							      bool reserved)
 {
 	VM_BUG_ON_PAGE(!PageHuge(page), page);

 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return NULL;
-	return (struct hugetlb_cgroup *)page[2].private;
+	if (reserved)
+		return (struct hugetlb_cgroup *)page[3].private;
+	else
+		return (struct hugetlb_cgroup *)page[2].private;
 }

-static inline
-int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+static inline int set_hugetlb_cgroup(struct page *page,
+				     struct hugetlb_cgroup *h_cg,
+				     bool reservation)
 {
 	VM_BUG_ON_PAGE(!PageHuge(page), page);

 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return -1;
-	page[2].private	= (unsigned long)h_cg;
+	if (reservation)
+		page[3].private = (unsigned long)h_cg;
+	else
+		page[2].private = (unsigned long)h_cg;
 	return 0;
 }

@@ -52,26 +60,33 @@ static inline bool hugetlb_cgroup_disabled(void)
 }

 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-					struct hugetlb_cgroup **ptr);
+					struct hugetlb_cgroup **ptr,
+					bool reserved);
 extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 					 struct hugetlb_cgroup *h_cg,
-					 struct page *page);
+					 struct page *page, bool reserved);
 extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
-					 struct page *page);
+					 struct page *page, bool reserved);
+
 extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-					   struct hugetlb_cgroup *h_cg);
+					   struct hugetlb_cgroup *h_cg,
+					   bool reserved);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+					    unsigned long nr_pages);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
 				   struct page *newhpage);

 #else
-static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page,
+							      bool reserved)
 {
 	return NULL;
 }

-static inline
-int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+static inline int set_hugetlb_cgroup(struct page *page,
+				     struct hugetlb_cgroup *h_cg, bool reserved)
 {
 	return 0;
 }
@@ -81,28 +96,30 @@ static inline bool hugetlb_cgroup_disabled(void)
 	return true;
 }

-static inline int
-hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-			     struct hugetlb_cgroup **ptr)
+static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
+					       struct hugetlb_cgroup **ptr,
+					       bool reserved)
 {
 	return 0;
 }

-static inline void
-hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
-			     struct hugetlb_cgroup *h_cg,
-			     struct page *page)
+static inline void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+						struct hugetlb_cgroup *h_cg,
+						struct page *page,
+						bool reserved)
 {
 }

-static inline void
-hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)
+static inline void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+						struct page *page,
+						bool reserved)
 {
 }

-static inline void
-hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-			       struct hugetlb_cgroup *h_cg)
+static inline void hugetlb_cgroup_uncharge_cgroup(int idx,
+						  unsigned long nr_pages,
+						  struct hugetlb_cgroup *h_cg,
+						  bool reserved)
 {
 }

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1e759b57364ab..324859170463b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1141,7 +1141,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_active | 1 << PG_private |
 				1 << PG_writeback);
 	}
-	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, false), page);
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
@@ -1251,8 +1251,8 @@ void free_huge_page(struct page *page)

 	spin_lock(&hugetlb_lock);
 	clear_page_huge_active(page);
-	hugetlb_cgroup_uncharge_page(hstate_index(h),
-				     pages_per_huge_page(h), page);
+	hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h),
+				     page, false);
 	if (restore_reserve)
 		h->resv_huge_pages++;

@@ -1278,7 +1278,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	INIT_LIST_HEAD(&page->lru);
 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	spin_lock(&hugetlb_lock);
-	set_hugetlb_cgroup(page, NULL);
+	set_hugetlb_cgroup(page, NULL, false);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
 	spin_unlock(&hugetlb_lock);
@@ -2064,7 +2064,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			gbl_chg = 1;
 	}

-	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
+	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
+					   false);
 	if (ret)
 		goto out_subpool_put;

@@ -2088,7 +2089,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 		list_move(&page->lru, &h->hugepage_activelist);
 		/* Fall through */
 	}
-	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
+	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page,
+				     false);
 	spin_unlock(&hugetlb_lock);

 	set_page_private(page, (unsigned long)spool);
@@ -2112,7 +2114,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	return page;

 out_uncharge_cgroup:
-	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
+	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
+				       false);
 out_subpool_put:
 	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 1ed4448ca41d3..854117513979b 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -73,8 +73,12 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg)
 	int idx;

 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-		if (page_counter_read(&h_cg->hugepage[idx]))
+		if (page_counter_read(
+			    hugetlb_cgroup_get_counter(h_cg, idx, true)) ||
+		    page_counter_read(
+			    hugetlb_cgroup_get_counter(h_cg, idx, false))) {
 			return true;
+		}
 	}
 	return false;
 }
@@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
 	int idx;

 	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
-		struct page_counter *counter = &h_cgroup->hugepage[idx];
 		struct page_counter *parent = NULL;
+		struct page_counter *reserved_parent = NULL;
 		unsigned long limit;
 		int ret;

-		if (parent_h_cgroup)
-			parent = &parent_h_cgroup->hugepage[idx];
-		page_counter_init(counter, parent);
+		if (parent_h_cgroup) {
+			parent = hugetlb_cgroup_get_counter(parent_h_cgroup,
+							    idx, false);
+			reserved_parent = hugetlb_cgroup_get_counter(
+				parent_h_cgroup, idx, true);
+		}
+		page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx,
+							     false),
+				  parent);
+		page_counter_init(hugetlb_cgroup_get_counter(h_cgroup, idx,
+							     true),
+				  reserved_parent);

 		limit = round_down(PAGE_COUNTER_MAX,
 				   1 << huge_page_order(&hstates[idx]));
-		ret = page_counter_set_max(counter, limit);
+
+		ret = page_counter_set_max(
+			hugetlb_cgroup_get_counter(h_cgroup, idx, false),
+			limit);
+		ret = page_counter_set_max(
+			hugetlb_cgroup_get_counter(h_cgroup, idx, true), limit);
 		VM_BUG_ON(ret);
 	}
 }
@@ -126,6 +144,26 @@ static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
 	kfree(h_cgroup);
 }

+static void hugetlb_cgroup_move_parent_reservation(int idx,
+						   struct hugetlb_cgroup *h_cg)
+{
+	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
+
+	/* Move the reservation counters. */
+	if (!parent_hugetlb_cgroup(h_cg)) {
+		parent = root_h_cgroup;
+		/* root has no limit */
+		page_counter_charge(
+			&root_h_cgroup->reserved_hugepage[idx],
+			page_counter_read(
+				hugetlb_cgroup_get_counter(h_cg, idx, true)));
+	}
+
+	/* Take the pages off the local counter */
+	page_counter_cancel(
+		hugetlb_cgroup_get_counter(h_cg, idx, true),
+		page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true)));
+}

 /*
  * Should be called with hugetlb_lock held.
@@ -142,7 +180,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 	struct hugetlb_cgroup *page_hcg;
 	struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);

-	page_hcg = hugetlb_cgroup_from_page(page);
+	page_hcg = hugetlb_cgroup_from_page(page, false);
 	/*
 	 * We can have pages in active list without any cgroup
 	 * ie, hugepage with less than 3 pages. We can safely
@@ -161,7 +199,7 @@ static void hugetlb_cgroup_move_parent(int idx, struct hugetlb_cgroup *h_cg,
 	/* Take the pages off the local counter */
 	page_counter_cancel(counter, nr_pages);

-	set_hugetlb_cgroup(page, parent);
+	set_hugetlb_cgroup(page, parent, false);
 out:
 	return;
 }
@@ -180,6 +218,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 	do {
 		for_each_hstate(h) {
 			spin_lock(&hugetlb_lock);
+			hugetlb_cgroup_move_parent_reservation(idx, h_cg);
 			list_for_each_entry(page, &h->hugepage_activelist, lru)
 				hugetlb_cgroup_move_parent(idx, h_cg, page);

@@ -191,7 +230,7 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css)
 }

 int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-				 struct hugetlb_cgroup **ptr)
+				 struct hugetlb_cgroup **ptr, bool reserved)
 {
 	int ret = 0;
 	struct page_counter *counter;
@@ -214,8 +253,11 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
 	}
 	rcu_read_unlock();

-	if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages, &counter))
+	if (!page_counter_try_charge(hugetlb_cgroup_get_counter(h_cg, idx,
+								reserved),
+				     nr_pages, &counter)) {
 		ret = -ENOMEM;
+	}
 	css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
@@ -225,12 +267,12 @@ int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
 /* Should be called with hugetlb_lock held */
 void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 				  struct hugetlb_cgroup *h_cg,
-				  struct page *page)
+				  struct page *page, bool reserved)
 {
 	if (hugetlb_cgroup_disabled() || !h_cg)
 		return;

-	set_hugetlb_cgroup(page, h_cg);
+	set_hugetlb_cgroup(page, h_cg, reserved);
 	return;
 }

@@ -238,23 +280,26 @@ void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
  * Should be called with hugetlb_lock held
  */
 void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
-				  struct page *page)
+				  struct page *page, bool reserved)
 {
 	struct hugetlb_cgroup *h_cg;

 	if (hugetlb_cgroup_disabled())
 		return;
 	lockdep_assert_held(&hugetlb_lock);
-	h_cg = hugetlb_cgroup_from_page(page);
+	h_cg = hugetlb_cgroup_from_page(page, reserved);
 	if (unlikely(!h_cg))
 		return;
-	set_hugetlb_cgroup(page, NULL);
-	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
+	set_hugetlb_cgroup(page, NULL, reserved);
+
+	page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
+			      nr_pages);
+
 	return;
 }

 void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-				    struct hugetlb_cgroup *h_cg)
+				    struct hugetlb_cgroup *h_cg, bool reserved)
 {
 	if (hugetlb_cgroup_disabled() || !h_cg)
 		return;
@@ -262,8 +307,17 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
 	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
 		return;

-	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
-	return;
+	page_counter_uncharge(hugetlb_cgroup_get_counter(h_cg, idx, reserved),
+			      nr_pages);
+}
+
+void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+				     unsigned long nr_pages)
+{
+	if (hugetlb_cgroup_disabled() || !p)
+		return;
+
+	page_counter_uncharge(p, nr_pages);
 }

 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
@@ -475,6 +529,7 @@ void __init hugetlb_cgroup_file_init(void)
 void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 {
 	struct hugetlb_cgroup *h_cg;
+	struct hugetlb_cgroup *h_cg_reservation;
 	struct hstate *h = page_hstate(oldhpage);

 	if (hugetlb_cgroup_disabled())
@@ -482,11 +537,12 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)

 	VM_BUG_ON_PAGE(!PageHuge(oldhpage), oldhpage);
 	spin_lock(&hugetlb_lock);
-	h_cg = hugetlb_cgroup_from_page(oldhpage);
-	set_hugetlb_cgroup(oldhpage, NULL);
+	h_cg = hugetlb_cgroup_from_page(oldhpage, false);
+	h_cg_reservation = hugetlb_cgroup_from_page(oldhpage, true);
+	set_hugetlb_cgroup(oldhpage, NULL, false);

 	/* move the h_cg details to new cgroup */
-	set_hugetlb_cgroup(newhpage, h_cg);
+	set_hugetlb_cgroup(newhpage, h_cg_reservation, true);
 	list_move(&newhpage->lru, &h->hugepage_activelist);
 	spin_unlock(&hugetlb_lock);
 	return;

From patchwork Sun Oct 13 00:30:18 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mina Almasry <almasrymina@google.com>
X-Patchwork-Id: 11187171
Return-Path: <SRS0=89K8=YG=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4632F112B
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:39 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 1226C20679
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:39 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="hJsydjot"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1226C20679
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 7DDAA8E0006; Sat, 12 Oct 2019 20:30:36 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 766C28E0005; Sat, 12 Oct 2019 20:30:36 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5E0978E0006; Sat, 12 Oct 2019 20:30:36 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0159.hostedemail.com
 [216.40.44.159])
	by kanga.kvack.org (Postfix) with ESMTP id 3A06D8E0005
	for <linux-mm@kvack.org>; Sat, 12 Oct 2019 20:30:36 -0400 (EDT)
Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with SMTP id C5911611B
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:35 +0000 (UTC)
X-FDA: 76036880430.04.spoon84_5c77fb500fd1e
X-Spam-Summary: 
 2,0,0,b89e87752568ccf1,d41d8cd98f00b204,3kncixqskciclwxl329xtylrzzrwp.nzxwty58-xxv6lnv.z2r@flex--almasrymina.bounces.google.com,:mike.kravetz@oracle.com:shuah@kernel.org:almasrymina@google.com:rientjes@google.com:shakeelb@google.com:gthelen@google.com:akpm@linux-foundation.org:khalid.aziz@oracle.com:linux-kernel@vger.kernel.org::linux-kselftest@vger.kernel.org:cgroups@vger.kernel.org:aneesh.kumar@linux.vnet.ibm.com:mkoutny@suse.com,RULES_HIT:41:152:355:379:541:966:988:989:1260:1277:1313:1314:1345:1359:1516:1518:1534:1541:1593:1594:1711:1730:1747:1777:1792:2196:2199:2393:2559:2562:3138:3139:3140:3141:3142:3152:3352:3876:3877:4321:4385:4605:5007:6114:6119:6261:6642:6653:6742:9969:10004:10400:11026:11232:11473:11658:11914:12043:12296:12297:12438:12555:12895:12986:13069:13311:13357:14394:14659:14721:21080:21444:21627:30054,0,RBL:209.85.210.202:@flex--almasrymina.bounces.google.com:.lbl8.mailshell.net-62.18.0.100
 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Net
 check:no
X-HE-Tag: spoon84_5c77fb500fd1e
X-Filterd-Recvd-Size: 4111
Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com
 [209.85.210.202])
	by imf27.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:35 +0000 (UTC)
Received: by mail-pf1-f202.google.com with SMTP id b204so10650055pfb.11
        for <linux-mm@kvack.org>; Sat, 12 Oct 2019 17:30:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=bBrSpk5/lkb8AYsfeF62BNq9VWemrH+WCBAgVGWPGII=;
        b=hJsydjotiPKAHadLUhsMYb1DIvUHI/kTS4eMu9je11cjpNhoSeutvUQtrx/wKtlpnw
         KQ9qbPdjYSO8qTEwOaAYp1y+a9mT3vEPpNW/symkNk0FLP6a2K4IcbLuP/QgYmh8BbEr
         sJoWnep9Jx+LO8ugGxfDMv/HuEqGs785Fuc5e9crs1hhJPomjPhsIe2Dazo2zleIiBvT
         jUP+k489gBqIWZi1aR8iLR4OBNa0WGPAAWPJkMr7llHnVGZNphvewvTDllm3rB0rsRWp
         YLAq6vZwAdjwh8zxFMfiIL6O1Sh/wIOpnsMNoOHXEePop0Tqm3EoHuhpP+n1rH1sHCGk
         IwDw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=bBrSpk5/lkb8AYsfeF62BNq9VWemrH+WCBAgVGWPGII=;
        b=PgPs6EexDn5cKG59NUUjdzHyU+i9pjFxcgkwuICGIzsA1KvtntUFuPCJcY0QI0Z8LZ
         N8wKA41iz+pbCfgQMzFMVk0+Oy9XGU/GYHBHauhRAylP8J8/JC8awPHlY7AOTAiW51Po
         ZWH9edy2lWJKrpYliz7hIRTtHFFqDIKVXU2OufYJQNGkUvwPg3uufxoZxbNemih9yEe1
         f0CNdPV7eaX2KLMbNTnemT5bgoZY9zYiKEYOSvCggVbHODxtZmIBjDqzayXTp/NOoLBh
         ZamHsCQ23jIamYyBZ9w0E0yFpewDFV4huMB/jyvelg+ru56pRf6nl5UI6NdPyQIV790U
         b2vQ==
X-Gm-Message-State: APjAAAW7/lNiecVInsLXd30rP92+SJm6TpmVlGitWyS1Nr3L1az4byh+
	be/8j2IwMoT9WSzQd7E2lvMZJE2v/BvT3ZDojw==
X-Google-Smtp-Source: 
 APXvYqyIcSJcfujhRC7eYCwvl3F0kP+Cp+S3qVcoEFG/cktxKwguPX459Efupz3ruaGszXcEJTI3C/3AIufOhsvM5g==
X-Received: by 2002:a63:f247:: with SMTP id d7mr23278319pgk.300.1570926634315;
 Sat, 12 Oct 2019 17:30:34 -0700 (PDT)
Date: Sat, 12 Oct 2019 17:30:18 -0700
In-Reply-To: <20191013003024.215429-1-almasrymina@google.com>
Message-Id: <20191013003024.215429-3-almasrymina@google.com>
Mime-Version: 1.0
References: <20191013003024.215429-1-almasrymina@google.com>
X-Mailer: git-send-email 2.23.0.700.g56cf767bdb-goog
Subject: [PATCH v6 3/9] hugetlb_cgroup: add cgroup-v2 support
From: Mina Almasry <almasrymina@google.com>
To: mike.kravetz@oracle.com
Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com,
	shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org,
	khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org,
	aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

---
 mm/hugetlb_cgroup.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)
---
2.23.0.700.g56cf767bdb-goog

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 854117513979b..ac1500205faf7 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -503,8 +503,13 @@ static void __init __hugetlb_cgroup_file_init(int idx)
 	cft = &h->cgroup_files[HUGETLB_RES_NULL];
 	memset(cft, 0, sizeof(*cft));

-	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
-					  h->cgroup_files));
+	if (cgroup_subsys_on_dfl(hugetlb_cgrp_subsys)) {
+		WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
+					       h->cgroup_files));
+	} else {
+		WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
+						  h->cgroup_files));
+	}
 }

 void __init hugetlb_cgroup_file_init(void)
@@ -548,8 +553,14 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 	return;
 }

+static struct cftype hugetlb_files[] = {
+	{} /* terminate */
+};
+
 struct cgroup_subsys hugetlb_cgrp_subsys = {
 	.css_alloc	= hugetlb_cgroup_css_alloc,
 	.css_offline	= hugetlb_cgroup_css_offline,
 	.css_free	= hugetlb_cgroup_css_free,
+	.dfl_cftypes = hugetlb_files,
+	.legacy_cftypes = hugetlb_files,
 };

From patchwork Sun Oct 13 00:30:19 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mina Almasry <almasrymina@google.com>
X-Patchwork-Id: 11187173
Return-Path: <SRS0=89K8=YG=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BEA9D14ED
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:41 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 7EA9120679
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:41 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="rsKto33X"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7EA9120679
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 3BF4B8E0007; Sat, 12 Oct 2019 20:30:39 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 320FB8E0005; Sat, 12 Oct 2019 20:30:39 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1C2418E0007; Sat, 12 Oct 2019 20:30:39 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0196.hostedemail.com
 [216.40.44.196])
	by kanga.kvack.org (Postfix) with ESMTP id EC0EF8E0005
	for <linux-mm@kvack.org>; Sat, 12 Oct 2019 20:30:38 -0400 (EDT)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with SMTP id 926222C37
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:38 +0000 (UTC)
X-FDA: 76036880556.08.alley93_5cde3dc20ec5f
X-Spam-Summary: 
 2,0,0,7c0b0fb979b1b529,d41d8cd98f00b204,3lhcixqskciknyzn54bzv0nt11tyr.p1zyv07a-zzx8npx.14t@flex--almasrymina.bounces.google.com,:mike.kravetz@oracle.com:shuah@kernel.org:almasrymina@google.com:rientjes@google.com:shakeelb@google.com:gthelen@google.com:akpm@linux-foundation.org:khalid.aziz@oracle.com:linux-kernel@vger.kernel.org::linux-kselftest@vger.kernel.org:cgroups@vger.kernel.org:aneesh.kumar@linux.vnet.ibm.com:mkoutny@suse.com:hdanton@sina.com,RULES_HIT:2:41:69:152:355:379:541:800:960:966:973:988:989:1260:1277:1313:1314:1345:1359:1431:1437:1516:1518:1535:1593:1594:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2559:2562:2693:2890:2902:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:4042:4049:4120:4321:4385:4605:5007:6119:6261:6653:6742:7875:7903:9036:9108:9592:9969:10004:11026:11473:11658:11914:12043:12291:12296:12297:12438:12555:12683:12895:12986:13161:13229:14394:14659:21080:21444:21450:21451:21627:21789:21939:30001:30
 012:3005
X-HE-Tag: alley93_5cde3dc20ec5f
X-Filterd-Recvd-Size: 9341
Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com
 [209.85.210.201])
	by imf28.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:38 +0000 (UTC)
Received: by mail-pf1-f201.google.com with SMTP id s139so10626072pfc.21
        for <linux-mm@kvack.org>; Sat, 12 Oct 2019 17:30:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=DNmFl79SZe74kl3sGGlwsB9W8EFU1e3Qwq2m7y71jZQ=;
        b=rsKto33XZsfu8ildwaPOw/0gg0MyCumRNSgfg8agZCrCKwhrpbfgbXArysBvCH8RWJ
         S4w3ty1cdw6yfoQVfhoI5mfaA5G9DzgwPcr9E/cwyhWdVMkiNzN2JBCl9Ncxgs4sIef5
         IGt3fDk0URExoSiC9lsxi4TedA7aUvd+8NcrwRdhTj8wTCCiWuwh+tmqyx+TcRUskei+
         xaY3FZJ8djFEVaJdhPVWNg1M0LFZ7biEYdYIIjMoNuqpq0R+K9ml29534Yq6SpHFmAXG
         aqJvkvaLKEYSKDLoIriRYvLwV5CjEoKvtCtzqr9AFbsr0Y+CbLUS0gOUeapbBF44ZQVJ
         DUSw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=DNmFl79SZe74kl3sGGlwsB9W8EFU1e3Qwq2m7y71jZQ=;
        b=cS2ZMVlHFqOw5QbUXufxPlxFF50FIq+hCnAWzrJGxLPBc8L8IkYXzAzPmVXDMDJqVv
         sdMYuxcJHYbbxfUQ4jX6EPULCqPIKoatmNeZVh08hqSyZ3zZW1/vJeYiimggaJLyDa05
         ZbSKjRyNi9nR6GxrzgeLrrYzr5vktq/zVTSQvwRddTdYYtVKqmWBV8E/N15PPHkANQtX
         CpamDHa+sFp4qDy2exLw+tGfm8grm9q7OBHv9hoSLhFDikQLIo3+80fQxXLMqQfOOhLw
         qkMcm8j4GjDdsxToOeugFurvsIggNfaj7Z0f2W29zzDTyFdNUkS2cp6xPRF5NkpUBnA4
         BhKA==
X-Gm-Message-State: APjAAAXGnaQeQqNFCtM/VkSjuWnhiV0Kitm3CP6Hz5u0i+XrWdpUoPHh
	J/sQOy7oDhqMqnv9t1K8j6ZWx1jeIa6qAW4feA==
X-Google-Smtp-Source: 
 APXvYqzS97v+e+/yu8NhhsB/OmnV+/MyJiVaOD2YZq9TyA0TrmZpPjq6jNDN1fAoDHa84UmAfAPlVJ7aYfqryc/aBg==
X-Received: by 2002:a63:1b10:: with SMTP id b16mr5141195pgb.235.1570926636855;
 Sat, 12 Oct 2019 17:30:36 -0700 (PDT)
Date: Sat, 12 Oct 2019 17:30:19 -0700
In-Reply-To: <20191013003024.215429-1-almasrymina@google.com>
Message-Id: <20191013003024.215429-4-almasrymina@google.com>
Mime-Version: 1.0
References: <20191013003024.215429-1-almasrymina@google.com>
X-Mailer: git-send-email 2.23.0.700.g56cf767bdb-goog
Subject: [PATCH v6 4/9] hugetlb_cgroup: add reservation accounting for private
 mappings
From: Mina Almasry <almasrymina@google.com>
To: mike.kravetz@oracle.com
Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com,
	shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org,
	khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org,
	aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com,
	Hillf Danton <hdanton@sina.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Normally the pointer to the cgroup to uncharge hangs off the struct
page, and gets queried when it's time to free the page. With
hugetlb_cgroup reservations, this is not possible. Because it's possible
for a page to be reserved by one task and actually faulted in by another
task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Hillf Danton <hdanton@sina.com>
---
 include/linux/hugetlb.h        |  8 +++++++
 include/linux/hugetlb_cgroup.h | 11 +++++++++
 mm/hugetlb.c                   | 44 +++++++++++++++++++++++++++++++++-
 mm/hugetlb_cgroup.c            | 12 ----------
 4 files changed, 62 insertions(+), 13 deletions(-)

--
2.23.0.700.g56cf767bdb-goog

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9c49a0ba894d3..36dcda7be4b0e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,14 @@ struct resv_map {
 	long adds_in_progress;
 	struct list_head region_cache;
 	long region_cache_count;
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * On private mappings, the counter to uncharge reservations is stored
+	 * here. If these fields are 0, then the mapping is shared.
+	 */
+	struct page_counter *reservation_counter;
+	unsigned long pages_per_hpage;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 1bb58a63af586..f6e3d74a02536 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -25,6 +25,17 @@ struct hugetlb_cgroup;
 #define HUGETLB_CGROUP_MIN_ORDER 3

 #ifdef CONFIG_CGROUP_HUGETLB
+struct hugetlb_cgroup {
+	struct cgroup_subsys_state css;
+	/*
+	 * the counter to account for hugepages from hugetlb.
+	 */
+	struct page_counter hugepage[HUGE_MAX_HSTATE];
+	/*
+	 * the counter to account for hugepage reservations from hugetlb.
+	 */
+	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
+};

 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page,
 							      bool reserved)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 324859170463b..4a60d7d44b4c3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -665,6 +665,16 @@ struct resv_map *resv_map_alloc(void)
 	INIT_LIST_HEAD(&resv_map->regions);

 	resv_map->adds_in_progress = 0;
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * Initialize these to 0. On shared mappings, 0's here indicate these
+	 * fields don't do cgroup accounting. On private mappings, these will be
+	 * re-initialized to the proper values, to indicate that hugetlb cgroup
+	 * reservations are to be un-charged from here.
+	 */
+	resv_map->reservation_counter = NULL;
+	resv_map->pages_per_hpage = 0;
+#endif

 	INIT_LIST_HEAD(&resv_map->region_cache);
 	list_add(&rg->link, &resv_map->region_cache);
@@ -3217,7 +3227,18 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)

 	reserve = (end - start) - region_count(resv, start, end);

-	kref_put(&resv->refs, resv_map_release);
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * Since we check for HPAGE_RESV_OWNER above, this must a private
+	 * mapping, and these values should be none-zero, and should point to
+	 * the hugetlb_cgroup counter to uncharge for this reservation.
+	 */
+	WARN_ON(!resv->reservation_counter);
+	WARN_ON(!resv->pages_per_hpage);
+
+	hugetlb_cgroup_uncharge_counter(resv->reservation_counter,
+					(end - start) * resv->pages_per_hpage);
+#endif

 	if (reserve) {
 		/*
@@ -3227,6 +3248,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct *vma)
 		gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
 		hugetlb_acct_memory(h, -gbl_reserve);
 	}
+
+	kref_put(&resv->refs, resv_map_release);
 }

 static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -4560,6 +4583,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
+	struct hugetlb_cgroup *h_cg;
 	long gbl_reserve;

 	/* This should never happen */
@@ -4593,12 +4617,30 @@ int hugetlb_reserve_pages(struct inode *inode,
 		chg = region_chg(resv_map, from, to);

 	} else {
+		/* Private mapping. */
 		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;

 		chg = to - from;

+		if (hugetlb_cgroup_charge_cgroup(hstate_index(h),
+						 chg * pages_per_huge_page(h),
+						 &h_cg, true)) {
+			kref_put(&resv_map->refs, resv_map_release);
+			return -ENOMEM;
+		}
+
+#ifdef CONFIG_CGROUP_HUGETLB
+		/*
+		 * Since this branch handles private mappings, we attach the
+		 * counter to uncharge for this reservation off resv_map.
+		 */
+		resv_map->reservation_counter =
+			&h_cg->reserved_hugepage[hstate_index(h)];
+		resv_map->pages_per_hpage = pages_per_huge_page(h);
+#endif
+
 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
 	}
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index ac1500205faf7..a0ca4024888e1 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -19,18 +19,6 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>

-struct hugetlb_cgroup {
-	struct cgroup_subsys_state css;
-	/*
-	 * the counter to account for hugepages from hugetlb.
-	 */
-	struct page_counter hugepage[HUGE_MAX_HSTATE];
-	/*
-	 * the counter to account for hugepage reservations from hugetlb.
-	 */
-	struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
-};
-
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_IDX(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)

From patchwork Sun Oct 13 00:30:20 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mina Almasry <almasrymina@google.com>
X-Patchwork-Id: 11187179
Return-Path: <SRS0=89K8=YG=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A8546112B
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:44 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5903121D7D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:44 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="DKrO4FjB"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5903121D7D
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 51D3F8E0008; Sat, 12 Oct 2019 20:30:42 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4A6CC8E0005; Sat, 12 Oct 2019 20:30:42 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 321308E0008; Sat, 12 Oct 2019 20:30:42 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0232.hostedemail.com
 [216.40.44.232])
	by kanga.kvack.org (Postfix) with ESMTP id 0A2818E0005
	for <linux-mm@kvack.org>; Sat, 12 Oct 2019 20:30:42 -0400 (EDT)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with SMTP id 9F4A75DE9
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:41 +0000 (UTC)
X-FDA: 76036880682.14.cap45_5d461d1dbf64f
X-Spam-Summary: 
 2,0,0,a93fed9aa27696c4,d41d8cd98f00b204,3l3cixqskciwq12q87e2y3qw44w1u.s421y3ad-220bqs0.47w@flex--almasrymina.bounces.google.com,:mike.kravetz@oracle.com:shuah@kernel.org:almasrymina@google.com:rientjes@google.com:shakeelb@google.com:gthelen@google.com:akpm@linux-foundation.org:khalid.aziz@oracle.com:linux-kernel@vger.kernel.org::linux-kselftest@vger.kernel.org:cgroups@vger.kernel.org:aneesh.kumar@linux.vnet.ibm.com:mkoutny@suse.com,RULES_HIT:41:69:152:327:355:379:541:800:960:966:968:973:982:988:989:1260:1277:1313:1314:1345:1359:1431:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2393:2553:2559:2562:2693:2890:2895:2898:2902:2924:2926:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:4042:4250:4321:4385:4605:5007:6119:6261:6653:6742:7875:7903:7904:8660:8957:9010:9592:9969:10004:10954:11026:11232:11473:11658:11914:12043:12291:12296:12297:12438:12555:12679:12683:12895:13148:13161:13184:13229:13230:14096:14097:14
 394:1465
X-HE-Tag: cap45_5d461d1dbf64f
X-Filterd-Recvd-Size: 21935
Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com
 [209.85.214.201])
	by imf45.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:40 +0000 (UTC)
Received: by mail-pl1-f201.google.com with SMTP id g7so8191229plo.5
        for <linux-mm@kvack.org>; Sat, 12 Oct 2019 17:30:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=7D+BwcFwF9MIweYeaUnktctPXmN7Px2zmN0DVSpL+yk=;
        b=DKrO4FjB33f9+BhpISVLb63wTqT5BuDpR92MaWPe0wr07ZXPaqHGc0tYhK/9LwXXIy
         zQlEvMvVhuPCoMvoAo8rEq46qD9Y/TufAMKPDSyBQ1eIufEKg6IIYKK4i90ZDGTPglxg
         v0wJ5JBS/QwMpw+t0MYamZ5yqh3TqKsgest3XIIFfx8tydbW+zvi66Ef2wQdQ71D/U/+
         4zNdHNG1J9pQK8bsyXjNjE2KuvILv1AfB1hi7nj9kP79wnTU0bLzZw8OckpOKq7ZSvrW
         jdI2+Qjng5cITuECeHDsmGZOyKp12bm/wLbPlouACcTAGfumKm5imVJig7n/6fgKA5Ii
         sbyg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=7D+BwcFwF9MIweYeaUnktctPXmN7Px2zmN0DVSpL+yk=;
        b=YQcjxK/pglfSvC1YLfO+1DakuHB9trE0LDAIwmCUe1T7oPGwxF0vZ4+P9D3JkOfv2i
         aEoHiDxMRSDJRhbxWnkQK2kZ6tUjpOdokwZPDmWxHJyZKCm5wtNsGop4fKFb3mZOTFO1
         4Tl7Gfj5n88jokTSRRIqjGvukveb1MK0dfF93Ogg10yGaqvYXMDpR/nEMD30ZeJXL/7q
         J2Ag5I39rxQemE+63YNS1CKjmSso8GsjbBGGDIK10PqJv+6RZ3E3UEGChlWX64ppeoWf
         Fm9mHxrmD7xWC31E36A1Nu5UOxvzgX7QZsiM2QmfvaKMTXMy+TKT5IqQy5jxga1zIvDq
         4Zhg==
X-Gm-Message-State: APjAAAW31Lca+E+hAhIVE0+L6zlRp1GzglONdHLTgqS6F6/Vy9vR3z1R
	Bj9wlJ5KfqDUf9gJoR8uFPCAwwyaUFqsqaCcSw==
X-Google-Smtp-Source: 
 APXvYqwNcTMUfpeALxNxyO+727Acah9psvwfnkoRfG/8TpDRhe9bRbhhRZ0u+DKKPenst2XENPL6G+5qFroLgZiDiA==
X-Received: by 2002:a63:36c7:: with SMTP id
 d190mr24464655pga.287.1570926639409;
 Sat, 12 Oct 2019 17:30:39 -0700 (PDT)
Date: Sat, 12 Oct 2019 17:30:20 -0700
In-Reply-To: <20191013003024.215429-1-almasrymina@google.com>
Message-Id: <20191013003024.215429-5-almasrymina@google.com>
Mime-Version: 1.0
References: <20191013003024.215429-1-almasrymina@google.com>
X-Mailer: git-send-email 2.23.0.700.g56cf767bdb-goog
Subject: [PATCH v6 5/9] hugetlb: disable region_add file_region coalescing
From: Mina Almasry <almasrymina@google.com>
To: mike.kravetz@oracle.com
Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com,
	shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org,
	khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org,
	aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

A follow up patch in this series adds hugetlb cgroup uncharge info the
file_region entries in resv->regions. The cgroup uncharge info may
differ for different regions, so they can no longer be coalesced at
region_add time. So, disable region coalescing in region_add in this
patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6].
New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
[5->6].

Special care needs to be taken to handle the resv->adds_in_progress
variable correctly. In the past, only 1 region would be added for every
region_chg and region_add call. But now, each call may add multiple
regions, so we can no longer increment adds_in_progress by 1 in region_chg,
or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
region_chg calls add_reservation_in_range() to count the number of regions
needed and allocates those, and that info is passed to region_add and
region_abort to decrement adds_in_progress correctly.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---

Changes in v6:
- Fix bug in number of region_caches allocated by region_chg

---
 mm/hugetlb.c | 256 +++++++++++++++++++++++++++++----------------------
 1 file changed, 147 insertions(+), 109 deletions(-)

--
2.23.0.700.g56cf767bdb-goog

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4a60d7d44b4c3..f9c1947925bb9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -244,6 +244,12 @@ struct file_region {
 	long to;
 };

+/* Helper that removes a struct file_region from the resv_map cache and returns
+ * it for use.
+ */
+static struct file_region *
+get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);
+
 /* Must be called with resv->lock held. Calling this with count_only == true
  * will count the number of pages to be added but will not modify the linked
  * list.
@@ -251,51 +257,61 @@ struct file_region {
 static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 				     bool count_only)
 {
-	long chg = 0;
+	long add = 0;
 	struct list_head *head = &resv->regions;
+	long last_accounted_offset = f;
 	struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;

-	/* Locate the region we are before or in. */
-	list_for_each_entry (rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-
-	chg = t - f;
+	/* In this loop, we essentially handle an entry for the range
+	 * last_accounted_offset -> rg->from, at every iteration, with some
+	 * bounds checking.
+	 */
+	list_for_each_entry_safe(rg, trg, head, link) {
+		/* Skip irrelevant regions that start before our range. */
+		if (rg->from < f) {
+			/* If this region ends after the last accounted offset,
+			 * then we need to update last_accounted_offset.
+			 */
+			if (rg->to > last_accounted_offset)
+				last_accounted_offset = rg->to;
+			continue;
+		}

-	/* Check for and consume any regions we now overlap with. */
-	nrg = rg;
-	list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
+		/* When we find a region that starts beyond our range, we've
+		 * finished.
+		 */
 		if (rg->from > t)
 			break;

-		/* We overlap with this area, if it extends further than
-		 * us then we must extend ourselves.  Account for its
-		 * existing reservation.
+		/* Add an entry for last_accounted_offset -> rg->from, and
+		 * update last_accounted_offset.
 		 */
-		if (rg->to > t) {
-			chg += rg->to - t;
-			t = rg->to;
+		if (rg->from > last_accounted_offset) {
+			add += rg->from - last_accounted_offset;
+			if (!count_only) {
+				nrg = get_file_region_entry_from_cache(
+					resv, last_accounted_offset, rg->from);
+				list_add(&nrg->link, rg->link.prev);
+			}
 		}
-		chg -= rg->to - rg->from;

-		if (!count_only && rg != nrg) {
-			list_del(&rg->link);
-			kfree(rg);
-		}
+		last_accounted_offset = rg->to;
 	}

-	if (!count_only) {
-		nrg->from = f;
-		nrg->to = t;
+	/* Handle the case where our range extends beyond
+	 * last_accounted_offset.
+	 */
+	if (last_accounted_offset < t) {
+		add += t - last_accounted_offset;
+		if (!count_only) {
+			nrg = get_file_region_entry_from_cache(
+				resv, last_accounted_offset, t);
+			list_add(&nrg->link, rg->link.prev);
+		}
+		last_accounted_offset = t;
 	}

-	return chg;
+	return add;
 }

 /*
@@ -305,46 +321,24 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
  * must exist in the cache due to the previous call to region_chg with
  * the same range.
  *
+ * regions_needed is the out value provided by a previous
+ * call to region_chg.
+ *
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
  */
-static long region_add(struct resv_map *resv, long f, long t)
+static long region_add(struct resv_map *resv, long f, long t,
+		       long regions_needed)
 {
-	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg;
 	long add = 0;

 	spin_lock(&resv->lock);
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
-
-	/*
-	 * If no region exists which can be expanded to include the
-	 * specified range, pull a region descriptor from the cache
-	 * and use it for this range.
-	 */
-	if (&rg->link == head || t < rg->from) {
-		VM_BUG_ON(resv->region_cache_count <= 0);
-
-		resv->region_cache_count--;
-		nrg = list_first_entry(&resv->region_cache, struct file_region,
-					link);
-		list_del(&nrg->link);

-		nrg->from = f;
-		nrg->to = t;
-		list_add(&nrg->link, rg->link.prev);
-
-		add += t - f;
-		goto out_locked;
-	}
+	VM_BUG_ON(resv->region_cache_count < regions_needed);

 	add = add_reservation_in_range(resv, f, t, false);
+	resv->adds_in_progress -= regions_needed;

-out_locked:
-	resv->adds_in_progress--;
 	spin_unlock(&resv->lock);
 	VM_BUG_ON(add < 0);
 	return add;
@@ -361,31 +355,39 @@ static long region_add(struct resv_map *resv, long f, long t)
  * as a placeholder, so that the subsequent region_add
  * call will have all the regions it needs and will not fail.
  *
+ * out_regions_needed is the number of regions added to the
+ * resv->region_cache_count.  This value needs to be provided to a follow up
+ * call to region_add or region_abort for proper accounting.
+ *
  * Returns the number of huge pages that need to be added to the existing
  * reservation map for the range [f, t).  This number is greater or equal to
  * zero.  -ENOMEM is returned if a new file_region structure or cache entry
  * is needed and can not be allocated.
  */
-static long region_chg(struct resv_map *resv, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t,
+		       long *out_regions_needed)
 {
+	struct file_region *trg = NULL;
 	long chg = 0;

+	/* Allocate the maximum number of regions we're going to need for this
+	 * reservation. The maximum number of regions we're going to need is
+	 * (t - f) / 2 + 1, which corresponds to a region with alternating
+	 * reserved and unreserved pages.
+	 */
+	*out_regions_needed = (t - f) / 2 + 1;
+
 	spin_lock(&resv->lock);
-retry_locked:
-	resv->adds_in_progress++;
+
+	resv->adds_in_progress += *out_regions_needed;

 	/*
 	 * Check for sufficient descriptors in the cache to accommodate
 	 * the number of in progress add operations.
 	 */
-	if (resv->adds_in_progress > resv->region_cache_count) {
-		struct file_region *trg;
-
-		VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1);
+	while (resv->region_cache_count < resv->adds_in_progress) {
 		/* Must drop lock to allocate a new descriptor. */
-		resv->adds_in_progress--;
 		spin_unlock(&resv->lock);
-
 		trg = kmalloc(sizeof(*trg), GFP_KERNEL);
 		if (!trg)
 			return -ENOMEM;
@@ -393,9 +395,9 @@ static long region_chg(struct resv_map *resv, long f, long t)
 		spin_lock(&resv->lock);
 		list_add(&trg->link, &resv->region_cache);
 		resv->region_cache_count++;
-		goto retry_locked;
 	}

+	/* Count how many hugepages in this range are NOT respresented. */
 	chg = add_reservation_in_range(resv, f, t, true);

 	spin_unlock(&resv->lock);
@@ -407,17 +409,19 @@ static long region_chg(struct resv_map *resv, long f, long t)
  * of the resv_map keeps track of the operations in progress between
  * calls to region_chg and region_add.  Operations are sometimes
  * aborted after the call to region_chg.  In such cases, region_abort
- * is called to decrement the adds_in_progress counter.
+ * is called to decrement the adds_in_progress counter. regions_needed
+ * is the value returned by the region_chg call, it is used to decrement
+ * the adds_in_progress counter.
  *
  * NOTE: The range arguments [f, t) are not needed or used in this
  * routine.  They are kept to make reading the calling code easier as
  * arguments will match the associated region_chg call.
  */
-static void region_abort(struct resv_map *resv, long f, long t)
+static void region_abort(struct resv_map *resv, long f, long t,
+			 long regions_needed)
 {
 	spin_lock(&resv->lock);
-	VM_BUG_ON(!resv->region_cache_count);
-	resv->adds_in_progress--;
+	resv->adds_in_progress -= regions_needed;
 	spin_unlock(&resv->lock);
 }

@@ -1898,9 +1902,10 @@ enum vma_resv_mode {
 	VMA_END_RESV,
 	VMA_ADD_RESV,
 };
-static long __vma_reservation_common(struct hstate *h,
-				struct vm_area_struct *vma, unsigned long addr,
-				enum vma_resv_mode mode)
+static long
+__vma_reservation_common(struct hstate *h, struct vm_area_struct *vma,
+			 unsigned long addr, enum vma_resv_mode mode,
+			 long *out_regions_needed, long in_regions_needed)
 {
 	struct resv_map *resv;
 	pgoff_t idx;
@@ -1913,20 +1918,24 @@ static long __vma_reservation_common(struct hstate *h,
 	idx = vma_hugecache_offset(h, vma, addr);
 	switch (mode) {
 	case VMA_NEEDS_RESV:
-		ret = region_chg(resv, idx, idx + 1);
+		VM_BUG_ON(!out_regions_needed);
+		ret = region_chg(resv, idx, idx + 1, out_regions_needed);
 		break;
 	case VMA_COMMIT_RESV:
-		ret = region_add(resv, idx, idx + 1);
+		VM_BUG_ON(in_regions_needed == -1);
+		ret = region_add(resv, idx, idx + 1, in_regions_needed);
 		break;
 	case VMA_END_RESV:
-		region_abort(resv, idx, idx + 1);
+		VM_BUG_ON(in_regions_needed == -1);
+		region_abort(resv, idx, idx + 1, in_regions_needed);
 		ret = 0;
 		break;
 	case VMA_ADD_RESV:
+		VM_BUG_ON(in_regions_needed == -1);
 		if (vma->vm_flags & VM_MAYSHARE)
-			ret = region_add(resv, idx, idx + 1);
+			ret = region_add(resv, idx, idx + 1, in_regions_needed);
 		else {
-			region_abort(resv, idx, idx + 1);
+			region_abort(resv, idx, idx + 1, in_regions_needed);
 			ret = region_del(resv, idx, idx + 1);
 		}
 		break;
@@ -1959,28 +1968,32 @@ static long __vma_reservation_common(struct hstate *h,
 		return ret < 0 ? ret : 0;
 }

-static long vma_needs_reservation(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long addr)
+static long vma_needs_reservation(struct hstate *h, struct vm_area_struct *vma,
+				  unsigned long addr, long *out_regions_needed)
 {
-	return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV);
+	return __vma_reservation_common(h, vma, addr, VMA_NEEDS_RESV,
+					out_regions_needed, -1);
 }

-static long vma_commit_reservation(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long addr)
+static long vma_commit_reservation(struct hstate *h, struct vm_area_struct *vma,
+				   unsigned long addr, long regions_needed)
 {
-	return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV);
+	return __vma_reservation_common(h, vma, addr, VMA_COMMIT_RESV, NULL,
+					regions_needed);
 }

-static void vma_end_reservation(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long addr)
+static void vma_end_reservation(struct hstate *h, struct vm_area_struct *vma,
+				unsigned long addr, long regions_needed)
 {
-	(void)__vma_reservation_common(h, vma, addr, VMA_END_RESV);
+	(void)__vma_reservation_common(h, vma, addr, VMA_END_RESV, NULL,
+				       regions_needed);
 }

-static long vma_add_reservation(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long addr)
+static long vma_add_reservation(struct hstate *h, struct vm_area_struct *vma,
+				unsigned long addr, long regions_needed)
 {
-	return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV);
+	return __vma_reservation_common(h, vma, addr, VMA_ADD_RESV, NULL,
+					regions_needed);
 }

 /*
@@ -1998,8 +2011,10 @@ static void restore_reserve_on_error(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long address,
 			struct page *page)
 {
+	long regions_needed = 0;
 	if (unlikely(PagePrivate(page))) {
-		long rc = vma_needs_reservation(h, vma, address);
+		long rc =
+			vma_needs_reservation(h, vma, address, &regions_needed);

 		if (unlikely(rc < 0)) {
 			/*
@@ -2015,7 +2030,8 @@ static void restore_reserve_on_error(struct hstate *h,
 			 */
 			ClearPagePrivate(page);
 		} else if (rc) {
-			rc = vma_add_reservation(h, vma, address);
+			rc = vma_add_reservation(h, vma, address,
+						 regions_needed);
 			if (unlikely(rc < 0))
 				/*
 				 * See above comment about rare out of
@@ -2023,7 +2039,7 @@ static void restore_reserve_on_error(struct hstate *h,
 				 */
 				ClearPagePrivate(page);
 		} else
-			vma_end_reservation(h, vma, address);
+			vma_end_reservation(h, vma, address, regions_needed);
 	}
 }

@@ -2037,6 +2053,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	long gbl_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
+	long regions_needed = 0;

 	idx = hstate_index(h);
 	/*
@@ -2044,7 +2061,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	 * has a reservation for the page to be allocated.  A return
 	 * code of zero indicates a reservation exists (no change).
 	 */
-	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
+	map_chg = gbl_chg =
+		vma_needs_reservation(h, vma, addr, &regions_needed);
 	if (map_chg < 0)
 		return ERR_PTR(-ENOMEM);

@@ -2058,7 +2076,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	if (map_chg || avoid_reserve) {
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
 		if (gbl_chg < 0) {
-			vma_end_reservation(h, vma, addr);
+			vma_end_reservation(h, vma, addr, regions_needed);
 			return ERR_PTR(-ENOSPC);
 		}

@@ -2105,7 +2123,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,

 	set_page_private(page, (unsigned long)spool);

-	map_commit = vma_commit_reservation(h, vma, addr);
+	map_commit = vma_commit_reservation(h, vma, addr, regions_needed);
 	if (unlikely(map_chg > map_commit)) {
 		/*
 		 * The page was added to the reservation map between
@@ -2129,7 +2147,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 out_subpool_put:
 	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
-	vma_end_reservation(h, vma, addr);
+	vma_end_reservation(h, vma, addr, regions_needed);
 	return ERR_PTR(-ENOSPC);
 }

@@ -3850,6 +3868,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page = false;
+	long regions_needed = 0;

 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -3967,12 +3986,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	 * the spinlock.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
-		if (vma_needs_reservation(h, vma, haddr) < 0) {
+		if (vma_needs_reservation(h, vma, haddr, &regions_needed) < 0) {
 			ret = VM_FAULT_OOM;
 			goto backout_unlocked;
 		}
 		/* Just decrements count, does not deallocate */
-		vma_end_reservation(h, vma, haddr);
+		vma_end_reservation(h, vma, haddr, regions_needed);
 	}

 	ptl = huge_pte_lock(h, mm, ptep);
@@ -4062,6 +4081,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct address_space *mapping;
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
+	long regions_needed = 0;

 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
@@ -4116,12 +4136,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * consumed.
 	 */
 	if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
-		if (vma_needs_reservation(h, vma, haddr) < 0) {
+		if (vma_needs_reservation(h, vma, haddr, &regions_needed) < 0) {
 			ret = VM_FAULT_OOM;
 			goto out_mutex;
 		}
 		/* Just decrements count, does not deallocate */
-		vma_end_reservation(h, vma, haddr);
+		vma_end_reservation(h, vma, haddr, regions_needed);

 		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
@@ -4584,7 +4604,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
 	struct hugetlb_cgroup *h_cg;
-	long gbl_reserve;
+	long gbl_reserve, regions_needed = 0;

 	/* This should never happen */
 	if (from > to) {
@@ -4614,7 +4634,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 		 */
 		resv_map = inode_resv_map(inode);

-		chg = region_chg(resv_map, from, to);
+		chg = region_chg(resv_map, from, to, &regions_needed);

 	} else {
 		/* Private mapping. */
@@ -4684,7 +4704,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		long add = region_add(resv_map, from, to);
+		long add = region_add(resv_map, from, to, regions_needed);

 		if (unlikely(chg > add)) {
 			/*
@@ -4706,7 +4726,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		/* Don't call region_abort if region_chg failed */
 		if (chg >= 0)
-			region_abort(resv_map, from, to);
+			region_abort(resv_map, from, to, regions_needed);
 	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_put(&resv_map->refs, resv_map_release);
 	return ret;
@@ -5130,3 +5150,21 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		spin_unlock(&hugetlb_lock);
 	}
 }
+
+static struct file_region *
+get_file_region_entry_from_cache(struct resv_map *resv, long from, long to)
+{
+	struct file_region *nrg = NULL;
+
+	VM_BUG_ON(resv->region_cache_count <= 0);
+
+	resv->region_cache_count--;
+	nrg = list_first_entry(&resv->region_cache, struct file_region, link);
+	VM_BUG_ON(!nrg);
+	list_del(&nrg->link);
+
+	nrg->from = from;
+	nrg->to = to;
+
+	return nrg;
+}

From patchwork Sun Oct 13 00:30:21 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mina Almasry <almasrymina@google.com>
X-Patchwork-Id: 11187181
Return-Path: <SRS0=89K8=YG=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 903DF17EE
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:47 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 423D720679
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:47 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="uBRme8l1"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 423D720679
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 53CC78E0009; Sat, 12 Oct 2019 20:30:44 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4CEF48E0005; Sat, 12 Oct 2019 20:30:44 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 349C18E0009; Sat, 12 Oct 2019 20:30:44 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0139.hostedemail.com
 [216.40.44.139])
	by kanga.kvack.org (Postfix) with ESMTP id 07C508E0005
	for <linux-mm@kvack.org>; Sat, 12 Oct 2019 20:30:44 -0400 (EDT)
Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with SMTP id A5664181AEF30
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:43 +0000 (UTC)
X-FDA: 76036880766.09.glue52_5d947bd492646
X-Spam-Summary: 
 2,0,0,9db5c69f075454f6,d41d8cd98f00b204,3mxcixqskci4s34sa9g405sy66y3w.u64305cf-442dsu2.69y@flex--almasrymina.bounces.google.com,:mike.kravetz@oracle.com:shuah@kernel.org:almasrymina@google.com:rientjes@google.com:shakeelb@google.com:gthelen@google.com:akpm@linux-foundation.org:khalid.aziz@oracle.com:linux-kernel@vger.kernel.org::linux-kselftest@vger.kernel.org:cgroups@vger.kernel.org:aneesh.kumar@linux.vnet.ibm.com:mkoutny@suse.com,RULES_HIT:1:41:69:152:355:379:541:800:960:966:968:973:988:989:1260:1277:1313:1314:1345:1359:1431:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2553:2559:2562:2637:2693:2898:2902:2907:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3874:4250:4321:4385:4605:5007:6119:6261:6653:6742:7875:7903:7904:8603:8660:9010:9592:9969:10004:11026:11233:11473:11658:11914:12043:12291:12296:12297:12438:12555:12683:12895:13148:13161:13229:13230:14394:14659:21063:21080:21220:21444:21450:21627:21789:21796:21939:30
 034:3003
X-HE-Tag: glue52_5d947bd492646
X-Filterd-Recvd-Size: 13904
Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com
 [209.85.210.202])
	by imf48.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:42 +0000 (UTC)
Received: by mail-pf1-f202.google.com with SMTP id i28so10627721pfq.16
        for <linux-mm@kvack.org>; Sat, 12 Oct 2019 17:30:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=LHbD1B91xdGNIRrZOTEf3dBGFn0prBcVwxkD9ZnMONk=;
        b=uBRme8l1r2nKIYeN05YBhUI6knKKwaG9a+9GFNYD9eBMZ9Z2hATV2tGBfpiVsm1LJ3
         QIkGjc3b9Cv9hBbtZWEV6C7Yh31smOG4bJKjhPbMeK6XnTaasRrNGnMSqszavrb7tk1S
         3Ic5TzRyL77sg5IEcYoIoqdVH847XNty/j/SOmNQKSpWH7adbDiz9QyRnA+DzWcLXDb8
         ok90UFF/QT15fIm99rnTlp9DYcoT3ZPd6y+mQTRsyhCXKdIosvFGbRbtibxmxVH856fE
         blUoepV7M4ViY4+XDwA3UihybZRJEHT1stWQijQAqB7lUZLVCujwSkCBtXFw1dHJnDpQ
         9Z+A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=LHbD1B91xdGNIRrZOTEf3dBGFn0prBcVwxkD9ZnMONk=;
        b=eXkahG7YfQ+HkSyeha068/ILWV/sx2qYpzgPGF8Bvf2sYsMptkM3rFh2zXMxl858Zz
         ZTAr7Dq8lEG0t+OisPIFzjmnwtxBK9S1DBP7cnTC9kCJAf16X3P1MYXH5mbP52ToS61R
         dk9LXeQvGyqI8wqZlvcla3fWY9ux5gDwSA4I1/jrHFE/rdtdFtkFo/tScABiztjmwYv3
         i793PYNmnz4GlM85OulUFO+5DbuhTUTO1oPFVohFELX0itRsnVC07Uv7Bg/aTEqHnzCl
         2BrGnq/mMyx1IWSgHFj8+INaP+CuGOx5D0xsvFszJ1dpTNq0JPiJDCkKqlf5OsT1TAzB
         3fXg==
X-Gm-Message-State: APjAAAXtcGdzAaeMWhs0i4g/XEMHi45CmYX4UlTwR97ELPyNV2W4xuVh
	tY48OjvgykqBFEHlIm2BB6SA6iUF8JbjCABYWw==
X-Google-Smtp-Source: 
 APXvYqxkNR0wQCE/4rnpDdVUsJzyn46jJAt1BU9w53Atd8G2RenjUc0Fwe/LgYPbRpKmD659JZXqZIfHKC+oUobNEg==
X-Received: by 2002:a65:6096:: with SMTP id
 t22mr24472289pgu.409.1570926641714;
 Sat, 12 Oct 2019 17:30:41 -0700 (PDT)
Date: Sat, 12 Oct 2019 17:30:21 -0700
In-Reply-To: <20191013003024.215429-1-almasrymina@google.com>
Message-Id: <20191013003024.215429-6-almasrymina@google.com>
Mime-Version: 1.0
References: <20191013003024.215429-1-almasrymina@google.com>
X-Mailer: git-send-email 2.23.0.700.g56cf767bdb-goog
Subject: [PATCH v6 6/9] hugetlb_cgroup: add accounting for shared mappings
From: Mina Almasry <almasrymina@google.com>
To: mike.kravetz@oracle.com
Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com,
	shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org,
	khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org,
	aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

After a call to region_chg, we charge the approprate hugetlb_cgroup, and if
successful, we pass on the hugetlb_cgroup info to a follow up region_add call.
When a file_region entry is added to the resv_map via region_add, we put the
pointer to that cgroup in file_region->reservation_counter. If charging doesn't
succeed, we report the error to the caller, so that the kernel fails the
reservation.

On region_del, which is when the hugetlb memory is unreserved, we also uncharge
the file_region->reservation_counter.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 mm/hugetlb.c | 147 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 116 insertions(+), 31 deletions(-)

--
2.23.0.700.g56cf767bdb-goog

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9c1947925bb9..af336bf227fb6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -242,6 +242,15 @@ struct file_region {
 	struct list_head link;
 	long from;
 	long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * On shared mappings, each reserved region appears as a struct
+	 * file_region in resv_map. These fields hold the info needed to
+	 * uncharge each reservation.
+	 */
+	struct page_counter *reservation_counter;
+	unsigned long pages_per_hpage;
+#endif
 };

 /* Helper that removes a struct file_region from the resv_map cache and returns
@@ -250,12 +259,30 @@ struct file_region {
 static struct file_region *
 get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);

+/* Helper that records hugetlb_cgroup uncharge info. */
+static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg,
+						struct file_region *nrg,
+						struct hstate *h)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+	if (h_cg) {
+		nrg->reservation_counter =
+			&h_cg->reserved_hugepage[hstate_index(h)];
+		nrg->pages_per_hpage = pages_per_huge_page(h);
+	} else {
+		nrg->reservation_counter = NULL;
+		nrg->pages_per_hpage = 0;
+	}
+#endif
+}
+
 /* Must be called with resv->lock held. Calling this with count_only == true
  * will count the number of pages to be added but will not modify the linked
  * list.
  */
 static long add_reservation_in_range(struct resv_map *resv, long f, long t,
-				     bool count_only)
+				     struct hugetlb_cgroup *h_cg,
+				     struct hstate *h, bool count_only)
 {
 	long add = 0;
 	struct list_head *head = &resv->regions;
@@ -291,6 +318,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 			if (!count_only) {
 				nrg = get_file_region_entry_from_cache(
 					resv, last_accounted_offset, rg->from);
+				record_hugetlb_cgroup_uncharge_info(h_cg, nrg,
+								    h);
 				list_add(&nrg->link, rg->link.prev);
 			}
 		}
@@ -306,11 +335,13 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 		if (!count_only) {
 			nrg = get_file_region_entry_from_cache(
 				resv, last_accounted_offset, t);
+			record_hugetlb_cgroup_uncharge_info(h_cg, nrg, h);
 			list_add(&nrg->link, rg->link.prev);
 		}
 		last_accounted_offset = t;
 	}

+	VM_BUG_ON(add < 0);
 	return add;
 }

@@ -327,7 +358,8 @@ static long add_reservation_in_range(struct resv_map *resv, long f, long t,
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
  */
-static long region_add(struct resv_map *resv, long f, long t,
+static long region_add(struct hstate *h, struct hugetlb_cgroup *h_cg,
+		       struct resv_map *resv, long f, long t,
 		       long regions_needed)
 {
 	long add = 0;
@@ -336,7 +368,7 @@ static long region_add(struct resv_map *resv, long f, long t,

 	VM_BUG_ON(resv->region_cache_count < regions_needed);

-	add = add_reservation_in_range(resv, f, t, false);
+	add = add_reservation_in_range(resv, f, t, h_cg, h, false);
 	resv->adds_in_progress -= regions_needed;

 	spin_unlock(&resv->lock);
@@ -398,7 +430,7 @@ static long region_chg(struct resv_map *resv, long f, long t,
 	}

 	/* Count how many hugepages in this range are NOT respresented. */
-	chg = add_reservation_in_range(resv, f, t, true);
+	chg = add_reservation_in_range(resv, f, t, NULL, NULL, true);

 	spin_unlock(&resv->lock);
 	return chg;
@@ -425,6 +457,24 @@ static void region_abort(struct resv_map *resv, long f, long t,
 	spin_unlock(&resv->lock);
 }

+static void uncharge_cgroup_if_shared_mapping(struct resv_map *resv,
+					      struct file_region *rg,
+					      unsigned long nr_pages)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * If resv->reservation_counter is NULL, then this is shared
+	 * reservation, and the reserved memory is tracked in the file_struct
+	 * entries inside of resv_map. So we need to uncharge the memory here.
+	 */
+	if (rg->reservation_counter && rg->pages_per_hpage && nr_pages > 0 &&
+	    !resv->reservation_counter) {
+		hugetlb_cgroup_uncharge_counter(rg->reservation_counter,
+						nr_pages * rg->pages_per_hpage);
+	}
+#endif
+}
+
 /*
  * Delete the specified range [f, t) from the reserve map.  If the
  * t parameter is LONG_MAX, this indicates that ALL regions after f
@@ -494,6 +544,9 @@ static long region_del(struct resv_map *resv, long f, long t)
 			/* Original entry is trimmed */
 			rg->to = f;

+			uncharge_cgroup_if_shared_mapping(resv, rg,
+							  nrg->to - nrg->from);
+
 			list_add(&nrg->link, &rg->link);
 			nrg = NULL;
 			break;
@@ -501,6 +554,8 @@ static long region_del(struct resv_map *resv, long f, long t)

 		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
 			del += rg->to - rg->from;
+			uncharge_cgroup_if_shared_mapping(resv, rg,
+							  rg->to - rg->from);
 			list_del(&rg->link);
 			kfree(rg);
 			continue;
@@ -509,14 +564,20 @@ static long region_del(struct resv_map *resv, long f, long t)
 		if (f <= rg->from) {	/* Trim beginning of region */
 			del += t - rg->from;
 			rg->from = t;
+
+			uncharge_cgroup_if_shared_mapping(resv, rg,
+							  t - rg->from);
 		} else {		/* Trim end of region */
 			del += rg->to - f;
 			rg->to = f;
+
+			uncharge_cgroup_if_shared_mapping(resv, rg, rg->to - f);
 		}
 	}

 	spin_unlock(&resv->lock);
 	kfree(nrg);
+
 	return del;
 }

@@ -1923,7 +1984,8 @@ __vma_reservation_common(struct hstate *h, struct vm_area_struct *vma,
 		break;
 	case VMA_COMMIT_RESV:
 		VM_BUG_ON(in_regions_needed == -1);
-		ret = region_add(resv, idx, idx + 1, in_regions_needed);
+		ret = region_add(NULL, NULL, resv, idx, idx + 1,
+				 in_regions_needed);
 		break;
 	case VMA_END_RESV:
 		VM_BUG_ON(in_regions_needed == -1);
@@ -1933,7 +1995,8 @@ __vma_reservation_common(struct hstate *h, struct vm_area_struct *vma,
 	case VMA_ADD_RESV:
 		VM_BUG_ON(in_regions_needed == -1);
 		if (vma->vm_flags & VM_MAYSHARE)
-			ret = region_add(resv, idx, idx + 1, in_regions_needed);
+			ret = region_add(NULL, NULL, resv, idx, idx + 1,
+					 in_regions_needed);
 		else {
 			region_abort(resv, idx, idx + 1, in_regions_needed);
 			ret = region_del(resv, idx, idx + 1);
@@ -4603,7 +4666,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
-	struct hugetlb_cgroup *h_cg;
+	struct hugetlb_cgroup *h_cg = NULL;
 	long gbl_reserve, regions_needed = 0;

 	/* This should never happen */
@@ -4644,23 +4707,6 @@ int hugetlb_reserve_pages(struct inode *inode,

 		chg = to - from;

-		if (hugetlb_cgroup_charge_cgroup(hstate_index(h),
-						 chg * pages_per_huge_page(h),
-						 &h_cg, true)) {
-			kref_put(&resv_map->refs, resv_map_release);
-			return -ENOMEM;
-		}
-
-#ifdef CONFIG_CGROUP_HUGETLB
-		/*
-		 * Since this branch handles private mappings, we attach the
-		 * counter to uncharge for this reservation off resv_map.
-		 */
-		resv_map->reservation_counter =
-			&h_cg->reserved_hugepage[hstate_index(h)];
-		resv_map->pages_per_hpage = pages_per_huge_page(h);
-#endif
-
 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
 	}
@@ -4670,6 +4716,25 @@ int hugetlb_reserve_pages(struct inode *inode,
 		goto out_err;
 	}

+	ret = hugetlb_cgroup_charge_cgroup(
+		hstate_index(h), chg * pages_per_huge_page(h), &h_cg, true);
+
+	if (ret < 0) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+#ifdef CONFIG_CGROUP_HUGETLB
+	if (vma && !(vma->vm_flags & VM_MAYSHARE)) {
+		/* For private mappings, the hugetlb_cgroup uncharge info hangs
+		 * of the resv_map.
+		 */
+		resv_map->reservation_counter =
+			&h_cg->reserved_hugepage[hstate_index(h)];
+		resv_map->pages_per_hpage = pages_per_huge_page(h);
+	}
+#endif
+
 	/*
 	 * There must be enough pages in the subpool for the mapping. If
 	 * the subpool has a minimum size, there may be some global
@@ -4678,7 +4743,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	gbl_reserve = hugepage_subpool_get_pages(spool, chg);
 	if (gbl_reserve < 0) {
 		ret = -ENOSPC;
-		goto out_err;
+		goto out_uncharge_cgroup;
 	}

 	/*
@@ -4687,9 +4752,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 */
 	ret = hugetlb_acct_memory(h, gbl_reserve);
 	if (ret < 0) {
-		/* put back original number of pages, chg */
-		(void)hugepage_subpool_put_pages(spool, chg);
-		goto out_err;
+		goto out_put_pages;
 	}

 	/*
@@ -4704,7 +4767,8 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		long add = region_add(resv_map, from, to, regions_needed);
+		long add =
+			region_add(h, h_cg, resv_map, from, to, regions_needed);

 		if (unlikely(chg > add)) {
 			/*
@@ -4716,12 +4780,33 @@ int hugetlb_reserve_pages(struct inode *inode,
 			 */
 			long rsv_adjust;

-			rsv_adjust = hugepage_subpool_put_pages(spool,
-								chg - add);
+			hugetlb_cgroup_uncharge_cgroup(
+				hstate_index(h),
+				(chg - add) * pages_per_huge_page(h), h_cg,
+				true);
+
+			rsv_adjust =
+				hugepage_subpool_put_pages(spool, chg - add);
 			hugetlb_acct_memory(h, -rsv_adjust);
 		}
+	} else {
+#ifdef CONFIG_CGROUP_HUGETLB
+		/*
+		 * Since this branch handles private mappings, we attach the
+		 * counter to uncharge for this reservation off resv_map.
+		 */
+		resv_map->reservation_counter =
+			&h_cg->reserved_hugepage[hstate_index(h)];
+		resv_map->pages_per_hpage = pages_per_huge_page(h);
+#endif
 	}
 	return 0;
+out_put_pages:
+	/* put back original number of pages, chg */
+	(void)hugepage_subpool_put_pages(spool, chg);
+out_uncharge_cgroup:
+	hugetlb_cgroup_uncharge_cgroup(
+		hstate_index(h), chg * pages_per_huge_page(h), h_cg, true);
 out_err:
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		/* Don't call region_abort if region_chg failed */

From patchwork Sun Oct 13 00:30:22 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mina Almasry <almasrymina@google.com>
X-Patchwork-Id: 11187183
Return-Path: <SRS0=89K8=YG=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1EBB714ED
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:50 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D56DC2089C
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:49 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="aQR29j5F"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D56DC2089C
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 915AA8E000A; Sat, 12 Oct 2019 20:30:46 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 87BB58E0005; Sat, 12 Oct 2019 20:30:46 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6F3418E000A; Sat, 12 Oct 2019 20:30:46 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0172.hostedemail.com
 [216.40.44.172])
	by kanga.kvack.org (Postfix) with ESMTP id 4BD4E8E0005
	for <linux-mm@kvack.org>; Sat, 12 Oct 2019 20:30:46 -0400 (EDT)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with SMTP id E229C82FC74F
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:45 +0000 (UTC)
X-FDA: 76036880850.14.nest69_5def5ebcd025d
X-Spam-Summary: 
 2,0,0,cd008638ba74702d,d41d8cd98f00b204,3nhcixqskcjev67vdcj738v19916z.x97638fi-775gvx5.9c1@flex--almasrymina.bounces.google.com,:mike.kravetz@oracle.com:shuah@kernel.org:almasrymina@google.com:rientjes@google.com:shakeelb@google.com:gthelen@google.com:akpm@linux-foundation.org:khalid.aziz@oracle.com:linux-kernel@vger.kernel.org::linux-kselftest@vger.kernel.org:cgroups@vger.kernel.org:aneesh.kumar@linux.vnet.ibm.com:mkoutny@suse.com,RULES_HIT:41:152:355:379:541:960:966:973:988:989:1260:1277:1313:1314:1345:1359:1437:1516:1518:1535:1543:1593:1594:1711:1730:1747:1777:1792:2196:2199:2393:2559:2562:2890:3138:3139:3140:3141:3142:3152:3353:3865:3866:3867:3868:3870:3871:3872:4042:4117:4321:4385:4605:5007:6261:6653:6742:9969:10004:10400:11026:11473:11658:11914:12043:12291:12296:12297:12438:12555:12895:13161:13229:14096:14097:14394:14659:14721:21080:21444:21450:21627:30054:30070,0,RBL:209.85.215.201:@flex--almasrymina.bounces.google.com:.lbl8.mailshell.net-62.18.175.100
 66.100.2
X-HE-Tag: nest69_5def5ebcd025d
X-Filterd-Recvd-Size: 6203
Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com
 [209.85.215.201])
	by imf39.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:45 +0000 (UTC)
Received: by mail-pg1-f201.google.com with SMTP id s1so9890360pgm.2
        for <linux-mm@kvack.org>; Sat, 12 Oct 2019 17:30:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=xekgIg4dN3m1HVyQSgfrAu003H3IqqF6Q7QAQZd3RT0=;
        b=aQR29j5FYNJohLbfcXRNwkGUhbOnF+f07QgP1hMkNy7kiKn1f5mnO8EVdG7aw01ly5
         TtSY91Przz2c/ZSOK8uEKQiaIgFMSiIpENtHoQuJxMWLTgn4tGCqj+65kukDeTvuPZ+6
         e0aIFU+zJxpx8cSGSTanyam9LjcbAm5+M063GirvgsUNTDcfagTWeRYCTtHzVqnwLLM7
         69ewHkcoHtOzQ0btNWcjLRJURH9G3Y0Bhzo7A+/nE7v0y300AxR4DbiKcLJHGloN4RHm
         ctuwoCBT++gTQxKoneYGym6bnDCYtc6ZwSEYHmKmL+Fa0JZy8Lz3Qh4pnmBaTp7cQ0RP
         3+Sg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=xekgIg4dN3m1HVyQSgfrAu003H3IqqF6Q7QAQZd3RT0=;
        b=KT91bx3m/NiEPGdI+qWL4UtpOz5Wph/+wG2uD55sooloVHLICWRXIcccIqcJmZe9k2
         RvuEv1jz65FL1PofCc/VOuaN4tbxr6RQ8viKbWyD3u5UjEaYU4O2h9YGV05DWvkEh/dV
         ZpDObtHVzZo72a4pmEByFD5gZg8/x3i17tEpRCN0V6VzmSMMTPI6aslAuUx02/hhuooU
         3hqW++iz9K3fMJzs98MHsa9jZpVFat6UF4j/QHclHpHenDG9aLgJZgrkTHOoQT6Z91lZ
         XP40qXq5YmOZKVYv4gOKby8m64aQpG5p27deqm+UBVTyV8oR7L8CN9U3Pdq2WHJsbDD5
         vf0A==
X-Gm-Message-State: APjAAAUI+jUmnT9PFkvlniIKqJgANWn1i1Lw8H+6tp7uAwl45bdCNIVA
	tKKRBaLIMGQ2ZKdS4DGfoCjUsvgwAZbqSIQzYg==
X-Google-Smtp-Source: 
 APXvYqwO1B/rBxyR8P/t+7Sb94rTCepgrVxZzgfjaNGhv1GAZuM8kwDVsFDuBpAPLFO8daUod0KAKRyTnJXMKu9s9g==
X-Received: by 2002:a63:e853:: with SMTP id
 a19mr24813620pgk.192.1570926644257;
 Sat, 12 Oct 2019 17:30:44 -0700 (PDT)
Date: Sat, 12 Oct 2019 17:30:22 -0700
In-Reply-To: <20191013003024.215429-1-almasrymina@google.com>
Message-Id: <20191013003024.215429-7-almasrymina@google.com>
Mime-Version: 1.0
References: <20191013003024.215429-1-almasrymina@google.com>
X-Mailer: git-send-email 2.23.0.700.g56cf767bdb-goog
Subject: [PATCH v6 7/9] hugetlb_cgroup: support noreserve mappings
From: Mina Almasry <almasrymina@google.com>
To: mike.kravetz@oracle.com
Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com,
	shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org,
	khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org,
	aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Support MAP_NORESERVE accounting as part of the new counter.

For each hugepage allocation, at allocation time we check if there is
a reservation for this allocation or not. If there is a reservation for
this allocation, then this allocation was charged at reservation time,
and we don't re-account it. If there is no reserevation for this
allocation, we charge the appropriate hugetlb_cgroup.

The hugetlb_cgroup to uncharge for this allocation is stored in
page[3].private. We use new APIs added in an earlier patch to set this
pointer.
---
 mm/hugetlb.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

--
2.23.0.700.g56cf767bdb-goog

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index af336bf227fb6..79b99878ce6f9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1217,6 +1217,7 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 				1 << PG_writeback);
 	}
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, false), page);
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, true), page);
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
@@ -1328,6 +1329,9 @@ void free_huge_page(struct page *page)
 	clear_page_huge_active(page);
 	hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h),
 				     page, false);
+	hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h),
+				     page, true);
+
 	if (restore_reserve)
 		h->resv_huge_pages++;

@@ -1354,6 +1358,7 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	spin_lock(&hugetlb_lock);
 	set_hugetlb_cgroup(page, NULL, false);
+	set_hugetlb_cgroup(page, NULL, true);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
 	spin_unlock(&hugetlb_lock);
@@ -2155,10 +2160,19 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			gbl_chg = 1;
 	}

+	/* If this allocation is not consuming a reservation, charge it now.
+	 */
+	if (map_chg || avoid_reserve || !vma_resv_map(vma)) {
+		ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h),
+						   &h_cg, true);
+		if (ret)
+			goto out_subpool_put;
+	}
+
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
 					   false);
 	if (ret)
-		goto out_subpool_put;
+		goto out_uncharge_cgroup_reservation;

 	spin_lock(&hugetlb_lock);
 	/*
@@ -2182,6 +2196,11 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 	}
 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page,
 				     false);
+	if (!vma_resv_map(vma) || map_chg || avoid_reserve) {
+		hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg,
+					     page, true);
+	}
+
 	spin_unlock(&hugetlb_lock);

 	set_page_private(page, (unsigned long)spool);
@@ -2207,6 +2226,10 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
 				       false);
+out_uncharge_cgroup_reservation:
+	if (map_chg || avoid_reserve || !vma_resv_map(vma))
+		hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h),
+					       h_cg, true);
 out_subpool_put:
 	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);

From patchwork Sun Oct 13 00:30:23 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mina Almasry <almasrymina@google.com>
X-Patchwork-Id: 11187189
Return-Path: <SRS0=89K8=YG=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B441D17EE
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:52 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5A295222BD
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:52 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="cqd/ghNX"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5A295222BD
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 45A538E000B; Sat, 12 Oct 2019 20:30:49 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3E5F28E0005; Sat, 12 Oct 2019 20:30:49 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 211928E000B; Sat, 12 Oct 2019 20:30:49 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0073.hostedemail.com
 [216.40.44.73])
	by kanga.kvack.org (Postfix) with ESMTP id E26C08E0005
	for <linux-mm@kvack.org>; Sat, 12 Oct 2019 20:30:48 -0400 (EDT)
Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with SMTP id 831E6611B
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:48 +0000 (UTC)
X-FDA: 76036880976.20.tax30_5e49a41b4b51f
X-Spam-Summary: 
 30,2,0,95054212a617ecae,d41d8cd98f00b204,3nncixqskcjmx89xfel95ax3bb381.zb985ahk-997ixz7.be3@flex--almasrymina.bounces.google.com,:mike.kravetz@oracle.com:shuah@kernel.org:almasrymina@google.com:rientjes@google.com:shakeelb@google.com:gthelen@google.com:akpm@linux-foundation.org:khalid.aziz@oracle.com:linux-kernel@vger.kernel.org::linux-kselftest@vger.kernel.org:cgroups@vger.kernel.org:aneesh.kumar@linux.vnet.ibm.com:mkoutny@suse.com,RULES_HIT:41:152:327:355:379:541:800:960:973:988:989:1260:1277:1313:1314:1345:1359:1431:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:2196:2197:2198:2199:2200:2201:2393:2538:2559:2562:2737:2898:2899:2987:3138:3139:3140:3141:3142:3152:3653:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4605:5007:6119:6261:6653:6742:7875:7903:7904:8603:9969:10004:11026:11232:11233:11473:11657:11658:11914:12043:12296:12297:12438:12555:12895:12986:13161:13229:14394:14659:21080:21221:21324:21444:21451:21505:21627:21789:21796:30003:30012:30029:300
 36:30054
X-HE-Tag: tax30_5e49a41b4b51f
X-Filterd-Recvd-Size: 28831
Received: from mail-vk1-f202.google.com (mail-vk1-f202.google.com
 [209.85.221.202])
	by imf36.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:47 +0000 (UTC)
Received: by mail-vk1-f202.google.com with SMTP id q187so5387913vkq.4
        for <linux-mm@kvack.org>; Sat, 12 Oct 2019 17:30:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=+9wsAPS8btYUPw+5WMgu19SYvJ42Mp3uRoltI2kCOCk=;
        b=cqd/ghNXqpGqCqIjE7uV8bFl0r31+LoV5Wwjxm/L/6aooM+raxcyHey9v/yAAzpQjf
         uDefaLfE1zAg4nDwtJb682tI3R+JPuTcn0x4qjDJ9zIeNu4pfqLjjS/wQbLoUCiKDQG4
         UjQGb7F6FMGwxZY9gVo7LKs17FXoRQS37swB56U06k1uaVjBjUToftzdNVeY6/BKq1k4
         yvlmcGv6uYzljIrdLXv0o1OecU/5jm9md9YvLZujfoiCI9bw1tUemiYQgwnOy54YZ9u7
         lAB6WdAQbCnK8bnTl416QVpYAUPqmkHjsfBBW9DM2YBBy1Mwnrez8PDzeN7GL069/tV6
         DGUQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=+9wsAPS8btYUPw+5WMgu19SYvJ42Mp3uRoltI2kCOCk=;
        b=QqcNG1ydJ1rfsRR59rkssOCAuNbcOh+0Z+GLWZ66WyfiEoGSjTxwrYgRaTEBRwOZN0
         R0mFaZv1uw0Yu/xO314HKHpVOlba4QbCMKIY7E6IjK+J33YM4GIbuh86r4WrKygBixsB
         ftKYbfTo9Zgsc2wZyFr8iuksl4rp5AqGV9CNUKvaO25Kc25mLUuSTAJhkjngwfeFBrH9
         skbllb2auht5fEk8gEGZLFobBaD4St93RswvY9ME3QVurwHSoOKA4VIK1YXAN8wUjdmx
         Z++IhE4Ay7JfYv3ocjcOTDEgB2utv2GNHoKELihk0G9JtWutBlFYE3KVOMd3RotWfS1Q
         256Q==
X-Gm-Message-State: APjAAAU2JbhnrsSVYJYe7t4ovQ0U5W/+tBTgE/WkU3v64LV2//Wv1C+e
	76nOzIqnvSz/zOvucbxI+3jrgwOzQANrdWGerA==
X-Google-Smtp-Source: 
 APXvYqxKSbidX73TR12HDQpGYtmwZzTVGosENmSppw7D0e20kgcGP4/DInYPsXAzpgbOOHzCmeCJ0dE1Yw/ZROixog==
X-Received: by 2002:a1f:b2d6:: with SMTP id
 b205mr12571185vkf.90.1570926646927;
 Sat, 12 Oct 2019 17:30:46 -0700 (PDT)
Date: Sat, 12 Oct 2019 17:30:23 -0700
In-Reply-To: <20191013003024.215429-1-almasrymina@google.com>
Message-Id: <20191013003024.215429-8-almasrymina@google.com>
Mime-Version: 1.0
References: <20191013003024.215429-1-almasrymina@google.com>
X-Mailer: git-send-email 2.23.0.700.g56cf767bdb-goog
Subject: [PATCH v6 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests
From: Mina Almasry <almasrymina@google.com>
To: mike.kravetz@oracle.com
Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com,
	shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org,
	khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org,
	aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

The tests use both shared and private mapped hugetlb memory, and
monitors the hugetlb usage counter as well as the hugetlb reservation
counter. They test different configurations such as hugetlb memory usage
via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---

Changes in v6:
- Updates tests for cgroups-v2 and NORESERVE allocations.

---
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   1 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 527 ++++++++++++++++++
 .../selftests/vm/write_hugetlb_memory.sh      |  23 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 261 +++++++++
 5 files changed, 813 insertions(+)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

--
2.23.0.700.g56cf767bdb-goog

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34d..d3bed9407773c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+write_to_hugetlbfs
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 9534dc2bc9295..31c2cc5cf30b5 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs

 TEST_PROGS := run_vmtests

diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
new file mode 100755
index 0000000000000..278dd6475cd0f
--- /dev/null
+++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,527 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+if [[ $(id -u) -ne 0 ]]; then
+   echo "This test must be run as root. Skipping..."
+   exit 0
+fi
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+      mkdir -p $cgroup_path
+      mount -t cgroup2 none $cgroup_path
+fi
+
+echo "+hugetlb" > /dev/cgroup/memory/cgroup.subtree_control
+
+
+cleanup () {
+	echo $$ > $cgroup_path/cgroup.procs
+
+	if [[ -e /mnt/huge ]]; then
+	      rm -rf /mnt/huge/*
+	      umount /mnt/huge || echo error
+	      rmdir /mnt/huge
+	fi
+	if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+	      rmdir $cgroup_path/hugetlb_cgroup_test
+	fi
+	if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+	      rmdir $cgroup_path/hugetlb_cgroup_test1
+	fi
+	if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+	      rmdir $cgroup_path/hugetlb_cgroup_test2
+	fi
+	echo 0 > /proc/sys/vm/nr_hugepages
+	echo CLEANUP DONE
+}
+
+function expect_equal() {
+      local expected="$1"
+      local actual="$2"
+      local error="$3"
+
+      if [[ "$expected" != "$actual" ]]; then
+	    echo "expected ($expected) != actual ($actual): $3"
+	    cleanup
+	    exit 1
+      fi
+}
+
+function setup_cgroup() {
+      local name="$1"
+      local cgroup_limit="$2"
+      local reservation_limit="$3"
+
+      mkdir $cgroup_path/$name
+
+      echo writing cgroup limit: "$cgroup_limit"
+      echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
+
+      echo writing reseravation limit: "$reservation_limit"
+      echo "$reservation_limit" > \
+	    $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
+
+      if [ -e "$cgroup_path/$name/cpuset.cpus" ]; then
+	 echo 0 > $cgroup_path/$name/cpuset.cpus
+      fi
+      if [ -e "$cgroup_path/$name/cpuset.mems" ]; then
+	 echo 0 > $cgroup_path/$name/cpuset.mems
+      fi
+}
+
+function wait_for_hugetlb_memory_to_get_depleted {
+   local cgroup="$1"
+   local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes"
+   # Wait for hugetlbfs memory to get depleted.
+   while [ $(cat $path) != 0 ]; do
+      echo Waiting for hugetlb memory to get depleted.
+      cat $path
+      sleep 0.5
+   done
+}
+
+function wait_for_hugetlb_memory_to_get_reserved {
+   local cgroup="$1"
+   local size="$2"
+
+   local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes"
+   # Wait for hugetlbfs memory to get written.
+   while [ $(cat $path) != $size ]; do
+      echo Waiting for hugetlb memory to reach size $size.
+      cat $path
+      sleep 0.5
+   done
+}
+
+function wait_for_hugetlb_memory_to_get_written {
+   local cgroup="$1"
+   local size="$2"
+
+   local path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.usage_in_bytes"
+   # Wait for hugetlbfs memory to get written.
+   while [ $(cat $path) != $size ]; do
+      echo Waiting for hugetlb memory to reach size $size.
+      cat $path
+      sleep 0.5
+   done
+}
+
+function write_hugetlbfs_and_get_usage() {
+      local cgroup="$1"
+      local size="$2"
+      local populate="$3"
+      local write="$4"
+      local path="$5"
+      local method="$6"
+      local private="$7"
+      local expect_failure="$8"
+      local reserve="$9"
+
+      # Function return values.
+      reservation_failed=0
+      oom_killed=0
+      hugetlb_difference=0
+      reserved_difference=0
+
+      local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes
+      local reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes
+
+      local hugetlb_before=$(cat $hugetlb_usage)
+      local reserved_before=$(cat $reserved_usage)
+
+      echo
+      echo Starting:
+      echo hugetlb_usage="$hugetlb_before"
+      echo reserved_usage="$reserved_before"
+      echo expect_failure is "$expect_failure"
+
+      set +e
+      if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \
+	 [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then
+
+	    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+		  "$cgroup"  "$path" "$method" "$private" "-l" "$reserve"&
+	    local write_result=$?
+
+	    if [[ "$reserve" != "-n" ]]; then
+	       wait_for_hugetlb_memory_to_get_reserved "$cgroup" "$size"
+	    elif [[ "$populate" == "-o" ]] || \
+	       [[ "$write" == "-w" ]]; then
+	       wait_for_hugetlb_memory_to_get_written "$cgroup" "$size"
+	    else
+	       # This case doesn't produce visible effects, but we still have
+	       # to wait for the async process to start and execute...
+	       sleep 0.5
+	    fi
+
+	    echo write_result is $write_result
+      else
+	    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+		  "$cgroup"  "$path" "$method" "$private" "$reserve"
+	    local write_result=$?
+
+	    if [[ "$reserve" != "-n" ]]; then
+	       wait_for_hugetlb_memory_to_get_reserved "$cgroup" "$size"
+	    fi
+      fi
+      set -e
+
+      if [[ "$write_result" == 1 ]]; then
+	    reservation_failed=1
+      fi
+
+      # On linus/master, the above process gets SIGBUS'd on oomkill, with
+      # return code 135. On earlier kernels, it gets actual oomkill, with return
+      # code 137, so just check for both conditions in case we're testing
+      # against an earlier kernel.
+      if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then
+	    oom_killed=1
+      fi
+
+      local hugetlb_after=$(cat $hugetlb_usage)
+      local reserved_after=$(cat $reserved_usage)
+
+      echo After write:
+      echo hugetlb_usage="$hugetlb_after"
+      echo reserved_usage="$reserved_after"
+
+      hugetlb_difference=$(($hugetlb_after - $hugetlb_before))
+      reserved_difference=$(($reserved_after - $reserved_before))
+}
+
+function cleanup_hugetlb_memory() {
+      set +e
+      local cgroup="$1"
+      if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
+	    echo kiling write_to_hugetlbfs
+	    killall -2 write_to_hugetlbfs
+	    wait_for_hugetlb_memory_to_get_depleted $cgroup
+      fi
+      set -e
+
+      if [[ -e /mnt/huge ]]; then
+	    rm -rf /mnt/huge/*
+	    umount /mnt/huge
+	    rmdir /mnt/huge
+      fi
+}
+
+function run_test() {
+      local size="$1"
+      local populate="$2"
+      local write="$3"
+      local cgroup_limit="$4"
+      local reservation_limit="$5"
+      local nr_hugepages="$6"
+      local method="$7"
+      local private="$8"
+      local expect_failure="$9"
+      local reserve="${10}"
+
+      # Function return values.
+      hugetlb_difference=0
+      reserved_difference=0
+      reservation_failed=0
+      oom_killed=0
+
+      echo nr hugepages = "$nr_hugepages"
+      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
+
+      setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit"
+
+      mkdir -p /mnt/huge
+      mount -t hugetlbfs \
+	    -o pagesize=2M,size=256M none /mnt/huge
+
+      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \
+	    "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure" \
+	    "$reserve"
+
+      cleanup_hugetlb_memory "hugetlb_cgroup_test"
+
+      local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.usage_in_bytes)
+      local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.2MB.reservation_usage_in_bytes)
+
+      echo $hugetlb_difference
+      echo $reserved_difference
+      expect_equal "0" "$final_hugetlb" "final hugetlb is not zero"
+      expect_equal "0" "$final_reservation" "final reservation is not zero"
+}
+
+function run_multiple_cgroup_test() {
+      local size1="$1"
+      local populate1="$2"
+      local write1="$3"
+      local cgroup_limit1="$4"
+      local reservation_limit1="$5"
+
+      local size2="$6"
+      local populate2="$7"
+      local write2="$8"
+      local cgroup_limit2="$9"
+      local reservation_limit2="${10}"
+
+      local nr_hugepages="${11}"
+      local method="${12}"
+      local private="${13}"
+      local expect_failure="${14}"
+      local reserve="${15}"
+
+      # Function return values.
+      hugetlb_difference1=0
+      reserved_difference1=0
+      reservation_failed1=0
+      oom_killed1=0
+
+      hugetlb_difference2=0
+      reserved_difference2=0
+      reservation_failed2=0
+      oom_killed2=0
+
+
+      echo nr hugepages = "$nr_hugepages"
+      echo "$nr_hugepages" > /proc/sys/vm/nr_hugepages
+
+      setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1"
+      setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2"
+
+      mkdir -p /mnt/huge
+      mount -t hugetlbfs \
+	    -o pagesize=2M,size=256M none /mnt/huge
+
+      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \
+	    "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \
+	    "$expect_failure" "$reserve"
+
+      hugetlb_difference1=$hugetlb_difference
+      reserved_difference1=$reserved_difference
+      reservation_failed1=$reservation_failed
+      oom_killed1=$oom_killed
+
+      local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.usage_in_bytes
+      local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.2MB.reservation_usage_in_bytes
+      local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.usage_in_bytes
+      local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.2MB.reservation_usage_in_bytes
+
+      local usage_before_second_write=$(cat $cgroup1_hugetlb_usage)
+      local reservation_usage_before_second_write=$(cat \
+	    $cgroup1_reservation_usage)
+
+      write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \
+	    "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \
+	    "$expect_failure" "$reserve"
+
+      hugetlb_difference2=$hugetlb_difference
+      reserved_difference2=$reserved_difference
+      reservation_failed2=$reservation_failed
+      oom_killed2=$oom_killed
+
+      expect_equal "$usage_before_second_write" \
+	    "$(cat $cgroup1_hugetlb_usage)" "Usage changed."
+      expect_equal "$reservation_usage_before_second_write" \
+	    "$(cat $cgroup1_reservation_usage)" "Reservation usage changed."
+
+      cleanup_hugetlb_memory
+
+      local final_hugetlb=$(cat $cgroup1_hugetlb_usage)
+      local final_reservation=$(cat $cgroup1_reservation_usage)
+
+      expect_equal "0" "$final_hugetlb" \
+	    "hugetlbt_cgroup_test1 final hugetlb is not zero"
+      expect_equal "0" "$final_reservation" \
+	    "hugetlbt_cgroup_test1 final reservation is not zero"
+
+      local final_hugetlb=$(cat $cgroup2_hugetlb_usage)
+      local final_reservation=$(cat $cgroup2_reservation_usage)
+
+      expect_equal "0" "$final_hugetlb" \
+	    "hugetlb_cgroup_test2 final hugetlb is not zero"
+      expect_equal "0" "$final_reservation" \
+	    "hugetlb_cgroup_test2 final reservation is not zero"
+}
+
+cleanup
+
+for populate in "" "-o"; do
+for method in 0 1 2; do
+for private in "" "-r"; do
+for reserve in "" "-n"; do
+
+# Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported.
+if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then
+      continue
+fi
+
+# Skip populated shmem tests. Doesn't seem to be supported.
+if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then
+      continue
+fi
+
+if [[ "$method" == 2"" ]] && [[ "$reserve" == "-n" ]]; then
+      continue
+fi
+
+cleanup
+echo
+echo
+echo
+echo Test normal case.
+echo private=$private, populate=$populate, method=$method, reserve=$reserve
+run_test $((10 * 1024 * 1024)) "$populate" "" $((20 * 1024 * 1024)) \
+      $((20 * 1024 * 1024)) 10 "$method" "$private" "0" "$reserve"
+
+echo Memory charged to hugtlb=$hugetlb_difference
+echo Memory charged to reservation=$reserved_difference
+
+if [[ "$populate" == "-o" ]]; then
+      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
+	    "Reserved memory charged to hugetlb cgroup."
+else
+      expect_equal "0" "$hugetlb_difference" \
+	    "Reserved memory charged to hugetlb cgroup."
+fi
+
+if [[ "$reserve" != "-n" ]] || [[ "$populate" == "-o" ]]; then
+   expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
+      "Reserved memory not charged to reservation usage."
+else
+   expect_equal "0" "$reserved_difference" \
+      "Reserved memory not charged to reservation usage."
+fi
+
+echo 'PASS'
+
+cleanup
+echo
+echo
+echo
+echo Test normal case with write.
+echo private=$private, populate=$populate, method=$method, reserve=$reserve
+run_test $((10 * 1024 * 1024)) "$populate" '-w' $((20 * 1024 * 1024)) \
+      $((20 * 1024 * 1024)) 10 "$method" "$private" "0" "$reserve"
+
+echo Memory charged to hugtlb=$hugetlb_difference
+echo Memory charged to reservation=$reserved_difference
+
+expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference" \
+      "Reserved memory charged to hugetlb cgroup."
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference" \
+   "Reserved memory not charged to reservation usage."
+
+echo 'PASS'
+
+
+cleanup
+continue
+echo
+echo
+echo
+echo Test more than reservation case.
+echo private=$private, populate=$populate, method=$method, reserve=$reserve
+
+if [ "$reserve" != "-n" ]; then
+   run_test "$((10 * 1024 * 1024))" "$populate" '' "$((20 * 1024 * 1024))" \
+      "$((5 * 1024 * 1024))" "10" "$method" "$private" "1" "$reserve"
+
+   expect_equal "1" "$reservation_failed" "Reservation succeeded."
+fi
+
+echo 'PASS'
+
+cleanup
+
+echo
+echo
+echo
+echo Test more than cgroup limit case.
+echo private=$private, populate=$populate, method=$method, reserve=$reserve
+
+# Not sure if shm memory can be cleaned up when the process gets sigbus'd.
+if [[ "$method" != 2 ]]; then
+      run_test $((10 * 1024 * 1024)) "$populate" "-w" $((5 * 1024 * 1024)) \
+	    $((20 * 1024 * 1024)) 10 "$method" "$private" "1" "$reserve"
+
+      expect_equal "1" "$oom_killed" "Not oom killed."
+fi
+echo 'PASS'
+
+cleanup
+
+echo
+echo
+echo
+echo Test normal case, multiple cgroups.
+echo private=$private, populate=$populate, method=$method, reserve=$reserve
+run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "" \
+      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
+      "$populate" "" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
+      "$method" "$private" "0" "$reserve"
+
+echo Memory charged to hugtlb1=$hugetlb_difference1
+echo Memory charged to reservation1=$reserved_difference1
+echo Memory charged to hugtlb2=$hugetlb_difference2
+echo Memory charged to reservation2=$reserved_difference2
+
+if [[ "$reserve" != "-n" ]] || [[ "$populate" == "-o" ]]; then
+   expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
+	 "Incorrect reservations charged to cgroup 1."
+   expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
+	 "Incorrect reservation charged to cgroup 2."
+else
+   expect_equal "0" "$reserved_difference1" \
+	 "Incorrect reservations charged to cgroup 1."
+   expect_equal "0" "$reserved_difference2" \
+	 "Incorrect reservation charged to cgroup 2."
+fi
+
+if [[ "$populate" == "-o" ]]; then
+      expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
+	    "Incorrect hugetlb charged to cgroup 1."
+      expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
+	    "Incorrect hugetlb charged to cgroup 2."
+else
+      expect_equal "0" "$hugetlb_difference1" \
+	    "Incorrect hugetlb charged to cgroup 1."
+      expect_equal "0" "$hugetlb_difference2" \
+	    "Incorrect hugetlb charged to cgroup 2."
+fi
+echo 'PASS'
+
+cleanup
+echo
+echo
+echo
+echo Test normal case with write, multiple cgroups.
+echo private=$private, populate=$populate, method=$method, reserve=$reserve
+run_multiple_cgroup_test "$((6 * 1024 * 1024))" "$populate" "-w" \
+      "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "$((10 * 1024 * 1024))" \
+      "$populate" "-w" "$((20 * 1024 * 1024))" "$((20 * 1024 * 1024))" "10" \
+      "$method" "$private" "0" "$reserve"
+
+echo Memory charged to hugtlb1=$hugetlb_difference1
+echo Memory charged to reservation1=$reserved_difference1
+echo Memory charged to hugtlb2=$hugetlb_difference2
+echo Memory charged to reservation2=$reserved_difference2
+
+expect_equal "$((6 * 1024 * 1024))" "$hugetlb_difference1" \
+      "Incorrect hugetlb charged to cgroup 1."
+expect_equal "$((6 * 1024 * 1024))" "$reserved_difference1" \
+      "Incorrect reservation charged to cgroup 1."
+expect_equal "$((10 * 1024 * 1024))" "$hugetlb_difference2" \
+      "Incorrect hugetlb charged to cgroup 2."
+expect_equal "$((10 * 1024 * 1024))" "$reserved_difference2" \
+      "Incorrected reservation charged to cgroup 2."
+echo 'PASS'
+
+cleanup
+
+done # reserve
+done # private
+done # populate
+done # method
+
+umount $cgroup_path
+rmdir $cgroup_path
diff --git a/tools/testing/selftests/vm/write_hugetlb_memory.sh b/tools/testing/selftests/vm/write_hugetlb_memory.sh
new file mode 100644
index 0000000000000..d3d0d108924d4
--- /dev/null
+++ b/tools/testing/selftests/vm/write_hugetlb_memory.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+size=$1
+populate=$2
+write=$3
+cgroup=$4
+path=$5
+method=$6
+private=$7
+want_sleep=$8
+reserve=$9
+
+echo "Putting task in cgroup '$cgroup'"
+echo $$ > /dev/cgroup/memory/"$cgroup"/cgroup.procs
+
+echo "Method is $method"
+
+set +e
+./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \
+      "$private" "$want_sleep" "$reserve"
diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
new file mode 100644
index 0000000000000..85811c3384a10
--- /dev/null
+++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
@@ -0,0 +1,261 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program reserves and uses hugetlb memory, supporting a bunch of
+ * scenarios needed by the charged_reserved_hugetlb.sh test.
+ */
+
+#include <err.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/shm.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+
+/* Global definitions. */
+enum method {
+	HUGETLBFS,
+	MMAP_MAP_HUGETLB,
+	SHM,
+	MAX_METHOD
+};
+
+
+/* Global variables. */
+static const char *self;
+static char *shmaddr;
+static int shmid;
+
+/*
+ * Show usage and exit.
+ */
+static void exit_usage(void)
+{
+	printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
+	       "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
+	       "[-o] [-w] [-n]\n",
+	       self);
+	exit(EXIT_FAILURE);
+}
+
+void sig_handler(int signo)
+{
+	printf("Received %d.\n", signo);
+	if (signo == SIGINT) {
+		printf("Deleting the memory\n");
+		if (shmdt((const void *)shmaddr) != 0) {
+			perror("Detach failure");
+			shmctl(shmid, IPC_RMID, NULL);
+			exit(4);
+		}
+
+		shmctl(shmid, IPC_RMID, NULL);
+		printf("Done deleting the memory\n");
+	}
+	exit(2);
+}
+
+int main(int argc, char **argv)
+{
+	int fd = 0;
+	int key = 0;
+	int *ptr = NULL;
+	int c = 0;
+	int size = 0;
+	char path[256] = "";
+	enum method method = MAX_METHOD;
+	int want_sleep = 0, private = 0;
+	int populate = 0;
+	int write = 0;
+	int reserve = 1;
+
+	unsigned long i;
+
+	if (signal(SIGINT, sig_handler) == SIG_ERR)
+		err(1, "\ncan't catch SIGINT\n");
+
+	/* Parse command-line arguments. */
+	setvbuf(stdout, NULL, _IONBF, 0);
+	self = argv[0];
+
+	while ((c = getopt(argc, argv, "s:p:m:owlrn")) != -1) {
+		switch (c) {
+		case 's':
+			size = atoi(optarg);
+			break;
+		case 'p':
+			strncpy(path, optarg, sizeof(path));
+			break;
+		case 'm':
+			if (atoi(optarg) >= MAX_METHOD) {
+				errno = EINVAL;
+				perror("Invalid -m.");
+				exit_usage();
+			}
+			method = atoi(optarg);
+			break;
+		case 'o':
+			populate = 1;
+			break;
+		case 'w':
+			write = 1;
+			break;
+		case 'l':
+			want_sleep = 1;
+			break;
+		case 'r':
+		    private
+			= 1;
+			break;
+		case 'n':
+			reserve = 0;
+			break;
+		default:
+			errno = EINVAL;
+			perror("Invalid arg");
+			exit_usage();
+		}
+	}
+
+	if (strncmp(path, "", sizeof(path)) != 0) {
+		printf("Writing to this path: %s\n", path);
+	} else {
+		errno = EINVAL;
+		perror("path not found");
+		exit_usage();
+	}
+
+	if (size != 0) {
+		printf("Writing this size: %d\n", size);
+	} else {
+		errno = EINVAL;
+		perror("size not found");
+		exit_usage();
+	}
+
+	if (!populate)
+		printf("Not populating.\n");
+	else
+		printf("Populating.\n");
+
+	if (!write)
+		printf("Not writing to memory.\n");
+
+	if (method == MAX_METHOD) {
+		errno = EINVAL;
+		perror("-m Invalid");
+		exit_usage();
+	} else
+		printf("Using method=%d\n", method);
+
+	if (!private)
+		printf("Shared mapping.\n");
+	else
+		printf("Private mapping.\n");
+
+	if (!reserve)
+		printf("NO_RESERVE mapping.\n");
+	else
+		printf("RESERVE mapping.\n");
+
+	switch (method) {
+	case HUGETLBFS:
+		printf("Allocating using HUGETLBFS.\n");
+		fd = open(path, O_CREAT | O_RDWR, 0777);
+		if (fd == -1)
+			err(1, "Failed to open file.");
+
+		ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			   (private ? MAP_PRIVATE : MAP_SHARED) |
+				   (populate ? MAP_POPULATE : 0) |
+				   (reserve ? 0 : MAP_NORESERVE),
+			   fd, 0);
+
+		if (ptr == MAP_FAILED) {
+			close(fd);
+			err(1, "Error mapping the file");
+		}
+		break;
+	case MMAP_MAP_HUGETLB:
+		printf("Allocating using MAP_HUGETLB.\n");
+		ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			   (private ? (MAP_PRIVATE | MAP_ANONYMOUS) :
+				      MAP_SHARED) |
+				   MAP_HUGETLB | (populate ? MAP_POPULATE : 0) |
+				   (reserve ? 0 : MAP_NORESERVE),
+			   -1, 0);
+
+		if (ptr == MAP_FAILED)
+			err(1, "mmap");
+
+		printf("Returned address is %p\n", ptr);
+		break;
+	case SHM:
+		printf("Allocating using SHM.\n");
+		shmid = shmget(key, size,
+			       SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
+		if (shmid < 0) {
+			shmid = shmget(++key, size,
+				       SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
+			if (shmid < 0)
+				err(1, "shmget");
+		}
+		printf("shmid: 0x%x, shmget key:%d\n", shmid, key);
+
+		shmaddr = shmat(shmid, NULL, 0);
+		if (shmaddr == (char *)-1) {
+			perror("Shared memory attach failure");
+			shmctl(shmid, IPC_RMID, NULL);
+			exit(2);
+		}
+		printf("shmaddr: %p\n", shmaddr);
+
+		break;
+	default:
+		errno = EINVAL;
+		err(1, "Invalid method.");
+	}
+
+	if (write) {
+		printf("Writing to memory.\n");
+		if (method != SHM) {
+			memset(ptr, 1, size);
+		} else {
+			printf("Starting the writes:\n");
+			for (i = 0; i < size; i++) {
+				shmaddr[i] = (char)(i);
+				if (!(i % (1024 * 1024)))
+					printf(".");
+			}
+			printf("\n");
+
+			printf("Starting the Check...");
+			for (i = 0; i < size; i++)
+				if (shmaddr[i] != (char)i) {
+					printf("\nIndex %lu mismatched\n", i);
+					exit(3);
+				}
+			printf("Done.\n");
+		}
+	}
+
+	if (want_sleep) {
+		/* Signal to caller that we're done. */
+		printf("DONE\n");
+
+		/* Hold memory until external kill signal is delivered. */
+		while (1)
+			sleep(100);
+	}
+
+	switch (method == HUGETLBFS) {
+		close(fd);
+	}
+
+	return 0;
+}

From patchwork Sun Oct 13 00:30:24 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mina Almasry <almasrymina@google.com>
X-Patchwork-Id: 11187191
Return-Path: <SRS0=89K8=YG=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6227414ED
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:55 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 217792053B
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Sun, 13 Oct 2019 00:30:55 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="wP3VRc4S"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 217792053B
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 78EBE8E000C; Sat, 12 Oct 2019 20:30:51 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 73F4A8E0005; Sat, 12 Oct 2019 20:30:51 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5BA7F8E000C; Sat, 12 Oct 2019 20:30:51 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0053.hostedemail.com
 [216.40.44.53])
	by kanga.kvack.org (Postfix) with ESMTP id 378678E0005
	for <linux-mm@kvack.org>; Sat, 12 Oct 2019 20:30:51 -0400 (EDT)
Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with SMTP id E0CFD181AEF30
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:50 +0000 (UTC)
X-FDA: 76036881060.18.horse16_5ea7acb24ea27
X-Spam-Summary: 
 2,0,0,5cad82f2f06359d6,d41d8cd98f00b204,3oxcixqskcjy0bc0ihoc8d06ee6b4.2ecb8dkn-ccal02a.eh6@flex--almasrymina.bounces.google.com,:mike.kravetz@oracle.com:shuah@kernel.org:almasrymina@google.com:rientjes@google.com:shakeelb@google.com:gthelen@google.com:akpm@linux-foundation.org:khalid.aziz@oracle.com:linux-kernel@vger.kernel.org::linux-kselftest@vger.kernel.org:cgroups@vger.kernel.org:aneesh.kumar@linux.vnet.ibm.com:mkoutny@suse.com:hdanton@sina.com,RULES_HIT:2:41:152:355:379:541:800:960:973:988:989:1260:1277:1313:1314:1345:1359:1431:1437:1516:1518:1535:1593:1594:1605:1606:1730:1747:1777:1792:1801:2198:2199:2393:2559:2562:2693:2895:2899:2901:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:4119:4250:4605:5007:6119:6261:6630:6653:6742:7875:7903:9592:9969:10004:11026:11232:11658:11914:12043:12296:12297:12438:12555:12895:13851:14096:14097:14394:14659:21080:21220:21444:21627:30034:30054,0,RBL:209.85.215.202:@flex--almasrymina.bounces.google.com
 :.lbl8.m
X-HE-Tag: horse16_5ea7acb24ea27
X-Filterd-Recvd-Size: 8324
Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com
 [209.85.215.202])
	by imf25.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sun, 13 Oct 2019 00:30:50 +0000 (UTC)
Received: by mail-pg1-f202.google.com with SMTP id x35so9025969pgk.1
        for <linux-mm@kvack.org>; Sat, 12 Oct 2019 17:30:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:in-reply-to:message-id:mime-version:references:subject:from:to
         :cc;
        bh=P7abJRI85r+K/TRDAITrJhvU4blOks6xxJvHrDmdwNo=;
        b=wP3VRc4Sjd0MfiWZ3l2O6xPOLegcNe5EG1wD6mk6jKeio7RlfzqcFgBJyTj8r2Qle2
         kuGJwA3jttSxlj7OKvvEPJdc3CKObPaqCoIxgSi2+gefihDO33R6kqlmnHH76TwNxMpG
         SLcYgJmC6zeZuTvcfSwzUAMb4vKCHShqf89n0eGQQIbfvZ280blHsZSwNsjdxgaXhUpQ
         ljle8dNBA64Hmq3eBKl6ld5wmof4nc8QdiozpMkmy+vjIrXPnnPjAImDGAKwjPoIZHw5
         Ffr+GO/dLvKz03UQ1axTfrFlysqXhmwBXxxDdUJDi9DbQUmX73pfEtBtHLS0/xbo2JCd
         qzHA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:in-reply-to:message-id:mime-version
         :references:subject:from:to:cc;
        bh=P7abJRI85r+K/TRDAITrJhvU4blOks6xxJvHrDmdwNo=;
        b=ZzRRimG9A28rhbYD4P2cvpaGc4F45pASQmASJC84KSSAhjhbz7hexLMnUzzdue1C61
         FCtx9bBKsVaBbrNeR5dIbIWxbXqPzjbpGmUF1C4KYse96w5aS3OpmORSiCFh51KtfeWG
         genGpOlXZsgjCVH/mCbpWIEbzqBQQqahgrgZPxPBoxbPSw8VIKTcoFqJmW41w6lkXw5N
         veZhu5Mh4CXqar9oeJufmD5WmWJdRhgZqV6Y+GWEK+D4LPw/7zehEN8xbWpyD4mhumAT
         bxmPoSPZ5S/2tl05vhm2uwKyFZr071izLBfbT0LLnM/T5zYrfoV/9Qjg+Yw+jhNVtGVk
         RkUg==
X-Gm-Message-State: APjAAAUNH75u+4UYjGmZX+SKk0IPQvYMvlXJbwnEVzoSZjorWC8xvi1N
	EnLna7HmA1pFe4kOpR65hjay+DTHskv9TK648A==
X-Google-Smtp-Source: 
 APXvYqxXEc4NA7oKRRTcl220Dm4FauosfZDC5X4WT0geKI4r44dqc3LfO67JzAjlCIBC80vIy5yux4EZzO32zBDpPA==
X-Received: by 2002:a63:cf4a:: with SMTP id b10mr23871724pgj.86.1570926649195;
 Sat, 12 Oct 2019 17:30:49 -0700 (PDT)
Date: Sat, 12 Oct 2019 17:30:24 -0700
In-Reply-To: <20191013003024.215429-1-almasrymina@google.com>
Message-Id: <20191013003024.215429-9-almasrymina@google.com>
Mime-Version: 1.0
References: <20191013003024.215429-1-almasrymina@google.com>
X-Mailer: git-send-email 2.23.0.700.g56cf767bdb-goog
Subject: [PATCH v6 9/9] hugetlb_cgroup: Add hugetlb_cgroup reservation docs
From: Mina Almasry <almasrymina@google.com>
To: mike.kravetz@oracle.com
Cc: shuah@kernel.org, almasrymina@google.com, rientjes@google.com,
	shakeelb@google.com, gthelen@google.com, akpm@linux-foundation.org,
	khalid.aziz@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org,
	aneesh.kumar@linux.vnet.ibm.com, mkoutny@suse.com,
	Hillf Danton <hdanton@sina.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Add docs for how to use hugetlb_cgroup reservations, and their behavior.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Hillf Danton <hdanton@sina.com>
---

Changes in v6:
- Updated docs to reflect the new design based on a new counter that
tracks both reservations and faults.

---
 .../admin-guide/cgroup-v1/hugetlb.rst         | 64 +++++++++++++++----
 1 file changed, 53 insertions(+), 11 deletions(-)

--
2.23.0.700.g56cf767bdb-goog

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index a3902aa253a96..efb94e4db9d5a 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -2,13 +2,6 @@
 HugeTLB Controller
 ==================

-The HugeTLB controller allows to limit the HugeTLB usage per control group and
-enforces the controller limit during page fault. Since HugeTLB doesn't
-support page reclaim, enforcing the limit at page fault time implies that,
-the application will get SIGBUS signal if it tries to access HugeTLB pages
-beyond its limit. This requires the application to know beforehand how much
-HugeTLB pages it would require for its use.
-
 HugeTLB controller can be created by first mounting the cgroup filesystem.

 # mount -t cgroup -o hugetlb none /sys/fs/cgroup
@@ -28,10 +21,14 @@ process (bash) into it.

 Brief summary of control files::

- hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
- hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
- hugetlb.<hugepagesize>.usage_in_bytes     # show current usage for "hugepagesize" hugetlb
- hugetlb.<hugepagesize>.failcnt		   # show the number of allocation failure due to HugeTLB limit
+ hugetlb.<hugepagesize>.reservation_limit_in_bytes     # set/show limit of "hugepagesize" hugetlb reservations
+ hugetlb.<hugepagesize>.reservation_max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults.
+ hugetlb.<hugepagesize>.reservation_usage_in_bytes     # show current reservations and no-reserve faults for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.reservation_failcnt            # show the number of allocation failure due to HugeTLB reservation limit
+ hugetlb.<hugepagesize>.limit_in_bytes                 # set/show limit of "hugepagesize" hugetlb faults
+ hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
+ hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit

 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
 files include::
@@ -40,11 +37,56 @@ files include::
   hugetlb.1GB.max_usage_in_bytes
   hugetlb.1GB.usage_in_bytes
   hugetlb.1GB.failcnt
+  hugetlb.1GB.reservation_limit_in_bytes
+  hugetlb.1GB.reservation_max_usage_in_bytes
+  hugetlb.1GB.reservation_usage_in_bytes
+  hugetlb.1GB.reservation_failcnt
   hugetlb.64KB.limit_in_bytes
   hugetlb.64KB.max_usage_in_bytes
   hugetlb.64KB.usage_in_bytes
   hugetlb.64KB.failcnt
+  hugetlb.64KB.reservation_limit_in_bytes
+  hugetlb.64KB.reservation_max_usage_in_bytes
+  hugetlb.64KB.reservation_usage_in_bytes
+  hugetlb.64KB.reservation_failcnt
   hugetlb.32MB.limit_in_bytes
   hugetlb.32MB.max_usage_in_bytes
   hugetlb.32MB.usage_in_bytes
   hugetlb.32MB.failcnt
+  hugetlb.32MB.reservation_limit_in_bytes
+  hugetlb.32MB.reservation_max_usage_in_bytes
+  hugetlb.32MB.reservation_usage_in_bytes
+  hugetlb.32MB.reservation_failcnt
+
+
+1. Reservation limits
+
+The HugeTLB controller allows to limit the HugeTLB reservations per control
+group and enforces the controller limit at reservation time and at the fault of
+hugetlb memory for which no reservation exists. Reservation limits
+are superior to Page fault limits (see section 2), since Reservation limits are
+enforced at reservation time (on mmap or shget), and never causes the
+application to get SIGBUS signal if the memory was reserved before hand. For
+MAP_NORESERVE allocations, the reservation limit behaves the same as the fault
+limit, enforcing memory usage at fault time and causing the application to
+receive a SIGBUS if it's crossing its limit.
+
+2. Page fault limits
+
+The HugeTLB controller allows to limit the HugeTLB usage (page fault) per
+control group and enforces the controller limit during page fault. Since HugeTLB
+doesn't support page reclaim, enforcing the limit at page fault time implies
+that, the application will get SIGBUS signal if it tries to access HugeTLB
+pages beyond its limit. This requires the application to know beforehand how
+much HugeTLB pages it would require for its use.
+
+
+3. Caveats with shared memory
+
+For shared hugetlb memory, both hugetlb reservation and page faults are charged
+to the first task that causes the memory to be reserved or faulted, and all
+subsequent uses of this reserved or faulted memory is done without charging.
+
+Shared hugetlb memory is only uncharged when it is unreserved or deallocated.
+This is usually when the hugetlbfs file is deleted, and not when the task that
+caused the reservation or fault has exited.