From patchwork Wed Nov 21 10:14:11 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10692279
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 75FB01751
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:26 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6265C2B8C7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:26 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 562542B8DB; Wed, 21 Nov 2018 10:14:26 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 175282B8C7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:25 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E51D76B25AD; Wed, 21 Nov 2018 05:14:18 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DD9C36B25AF; Wed, 21 Nov 2018 05:14:18 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B3E186B25B0; Wed, 21 Nov 2018 05:14:18 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com
 [209.85.208.69])
	by kanga.kvack.org (Postfix) with ESMTP id 3DB126B25AD
	for <linux-mm@kvack.org>; Wed, 21 Nov 2018 05:14:18 -0500 (EST)
Received: by mail-ed1-f69.google.com with SMTP id y35so2782290edb.5
        for <linux-mm@kvack.org>; Wed, 21 Nov 2018 02:14:18 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=R1Gcsh9si0HWJL1khqoaPZuQK38VwtVlVfy18iIoQYI=;
        b=AxQxmm74oxffiZ+pVYWa/eHGOGr5hGBnsyJKCIXzF/N7jYTzw8lEWCJZzmJc90YqsF
         3HkeOTUjZEcsjrcUhb96jZewl+++9jCkz03CQcsfuQpeSYKWN7xaprloddtw7wTa5Y3/
         jStbMI96deqdhhqYJv5VKkIVyU33IfN/5RePd65Z5b3WNeY98AdhRLe0FcBbfOBWNUc9
         FfsJku3DOAHnKFsxtKFhaRdyWLMgxEdBnLWG+1lqQsO4ZNbS5clDEbr4PWfR4jWcW/t4
         a8Bb54z7Yoa8fi5brX5pes9X/H1BDC67C3CxFH0umA3mIoQxnnx0V5Rniap2rK46nj0j
         pNfg==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.106 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AA+aEWZk1j0fu+im8B+Syvgh70OI03roXiYUcY7H024Ir7/xOIugMPCm
	jGZvcyWk4OjcZyukYQcR62vTn8DDfT5nLc7lX1UKnXgk55BB0i0lmWHvWp/l/A452ksz+eQ/lMs
	CBxQ7wCj1UsOahQvyeD1hvRfourqnoDJ8eLitKHF4vsByh0cXHoNC6fS0wWYswBtb6w==
X-Received: by 2002:a50:9724:: with SMTP id c33mr5191824edb.288.1542795257679;
        Wed, 21 Nov 2018 02:14:17 -0800 (PST)
X-Google-Smtp-Source: 
 AFSGD/Xhk3fMpVUGQliHsTFsScIfSfZmMCpm8aSmmtnBdfsAZodjTPFmPwSAYJl72qk9pfp0nGZF
X-Received: by 2002:a50:9724:: with SMTP id c33mr5191742edb.288.1542795255741;
        Wed, 21 Nov 2018 02:14:15 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1542795255; cv=none;
        d=google.com; s=arc-20160816;
        b=wpofZSau1pqZtNzCTq3EWGkQbX2AmBsDlomTA7Idn/XsbTRavkef/6nTUhlBlBryID
         qj1ll6B2pxRe060tzqmZ7kJxUXd/f1Ehu5QmnwbgcNA+8TOEubWvnDAMhPXS6A4gaAlf
         m7bQf2ohQRhNsQx9TIIfvGONNJN2AECACwJibSB1Qmu63q/h/gxc/goJFu2mxYZyLBZL
         wEtWRNnvqqT+t0F/DWHsNuMF4YN0p3Zu7NkU+6KEDqLdE7zlWM+SFWqPEm5kvNcqh36x
         JGnq2IpGQQZv3bLK6f2tLSo0F2S9sGr1wxIxIU9GZ8VhTC0If98JrFsavju4jT8E62mz
         GdPw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=R1Gcsh9si0HWJL1khqoaPZuQK38VwtVlVfy18iIoQYI=;
        b=i03lO2xCgjrroc0p6uwp52fXE3MsirR6YSB8e5n9IE4BG52nU5BsrqPJgXQ3cNRglv
         fO7jzjYuZJKnQkILCkBqi5BXxkUaciYZj6+vLD6C/Tqb7xg+aVHiKVXtx52Zq3AV5iCr
         k+NmUCIUg+bSfQO/2NEPbEcq5L1/NGTyply05mwF+WV3oUxir1ErPSnyASl/B/D1K8Sr
         mlbJYAfNxSQOwWQMiF/8uX2/YMSWQB0xTqp/rl2F+u3DUBVkLdb99dhCXRFGM2FXgWbZ
         KldunUngXFMMbsQXKyl8TNXCkcXpX5ubN/1hHt/BTfyqlhxY8/YoEHxGBoHgHAkCjDEL
         jJ5A==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.106 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp11.blacknight.com (outbound-smtp11.blacknight.com.
 [46.22.139.106])
        by mx.google.com with ESMTPS id
 gh5-v6si699606ejb.65.2018.11.21.02.14.15
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Wed, 21 Nov 2018 02:14:15 -0800 (PST)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 46.22.139.106 as permitted sender) client-ip=46.22.139.106;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.106 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10])
	by outbound-smtp11.blacknight.com (Postfix) with ESMTPS id 3194D1C1DF3
	for <linux-mm@kvack.org>; Wed, 21 Nov 2018 10:14:15 +0000 (GMT)
Received: (qmail 19867 invoked from network); 21 Nov 2018 10:14:15 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.69])
  by 81.17.254.9 with ESMTPA; 21 Nov 2018 10:14:15 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 1/4] mm,
 page_alloc: Spread allocations across zones before introducing fragmentation
Date: Wed, 21 Nov 2018 10:14:11 +0000
Message-Id: <20181121101414.21301-2-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181121101414.21301-1-mgorman@techsingularity.net>
References: <20181121101414.21301-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

The page allocator zone lists are iterated based on the watermarks
of each zone which does not take anti-fragmentation into account. On
x86, node 0 may have multiple zones while other nodes have one zone. A
consequence is that tasks running on node 0 may fragment ZONE_NORMAL even
though ZONE_DMA32 has plenty of free memory. This patch special cases
the allocator fast path such that it'll try an allocation from a lower
local zone before fragmenting a higher zone. In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the
fast path as they are uninteresting from a fragmentation point of view.

This was evaluated using a benchmark designed to fragment memory
before attempting THPs.  It's implemented in mmtests as the following
configurations

configs/config-global-dhp__workload_thpfioscale
configs/config-global-dhp__workload_thpfioscale-defrag
configs/config-global-dhp__workload_thpfioscale-madvhugepage

e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameterr create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll fault back in old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup

Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive. Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.

When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag event. If the fallback_order is smaller than a
pageblock order (order-9 on 64-bit x86) then it's considered an event
that may cause external fragmentation issues in the future. Hence, the
primary metric here is the number of external fragmentation events that
occur with order < 9. The secondary metric is allocation latency and huge
page allocation success rates but note that differences in latencies and
what the success rate also can affect the number of external fragmentation
event which is why it's a secondary metric.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc1 extfrag events < order 9:  1023463
4.20-rc1+patch:                      358574 (65% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                      vanilla           lowzone-v2r4
Min       fault-base-1      588.00 (   0.00%)      557.00 (   5.27%)
Min       fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)
Amean     fault-base-1      663.58 (   0.00%)      663.65 (  -0.01%)
Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)

                              4.20.0-rc1             4.20.0-rc1
                                 vanilla           lowzone-v2r4
Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)

Fault latencies are reduced while allocation success rates remain at zero
asthis configuration does not make any heavy effort to allocate THP and
fio is heavily active at the time and filling memory.  However, a 65%
reduction of serious fragmentation events reduces the changes of external
fragmentation being a problem in the future.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  342549
4.20-rc1+patch:                     337890 (1% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                      vanilla           lowzone-v2r4
Amean     fault-base-1     1517.06 (   0.00%)     1531.37 (  -0.94%)
Amean     fault-huge-1     1129.50 (   0.00%)     1160.95 (  -2.78%)

thpfioscale Percentage Faults Huge
                              4.20.0-rc1             4.20.0-rc1
                                 vanilla           lowzone-v2r4
Percentage huge-1       78.01 (   0.00%)       78.97 (   1.23%)

Nothing dramatic. Fragmentation events were only reduced slightly
which is very different to what was reported in V1. A big difference
with V1 is the relative size of Normal to the DMA32 zone. This machine
has double the memory so the impact of using a small zone to avoid
fragmentation events is much lower.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  209820
4.20-rc1+patch:                     185923 (11% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                      vanilla           lowzone-v2r4
Amean     fault-base-5     1324.93 (   0.00%)     1334.99 (  -0.76%)
Amean     fault-huge-5     4681.71 (   0.00%)     2428.43 (  48.13%)

                              4.20.0-rc1             4.20.0-rc1
                                 vanilla           lowzone-v2r4
Percentage huge-5        1.05 (   0.00%)        1.13 (   7.94%)

The reduction of external fragmentation events is expected. A careful
reader may spot that the reduction is lower than it was on v1. This is due
to the removal of __GFP_THISNODE in commit ac5b2c18911f ("mm: thp: relax
__GFP_THISNODE for MADV_HUGEPAGE mappings") as THP allocations can now spill
over to remote nodes instead of fragmenting local memory.  This reduces the
impact of the use of a lower zone to avoid fragmentation. It's also worth
noting relative to v1 that the allocation success rate is slightly higher.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9: 167464
4.20-rc1+patch:                    130081 (22% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                      vanilla           lowzone-v2r4
Amean     fault-base-5     7721.82 (   0.00%)     6652.67 (  13.85%)
Amean     fault-huge-5     3896.10 (   0.00%)     2486.89 *  36.17%*

thpfioscale Percentage Faults Huge
                              4.20.0-rc1             4.20.0-rc1
                                 vanilla           lowzone-v2r4
Percentage huge-5       95.02 (   0.00%)       94.49 (  -0.56%)

In this case, there was both a reduction in the external fragmentation
causing events and the huge page allocation success latency with little
change in the allocation success rates which were already high. A careful
reader will note that V1 had very different outcomes both in terms of
the number of fragmentation events and the allocation success rates. In
this case, it's due to the baseline including the THP __GFP_THISNODE
removaal patch.

Overall, the patch significantly reduces the number of external
fragmentation causing events so the success of THP over long periods of
time would be improved for this adverse workload. While there are large
differences compared to how V1 behaved, this is almost entirely accounted
for by ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE
mappings").

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   |  13 +++++---
 mm/page_alloc.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 291eb2b6d1d8..544355156c92 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,10 +480,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_OOM		ALLOC_NO_WATERMARKS
 #endif
 
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-#define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
+#define ALLOC_HARDER		 0x10 /* try to alloc harder */
+#define ALLOC_HIGH		 0x20 /* __GFP_HIGH set */
+#define ALLOC_CPUSET		 0x40 /* check for correct cpuset */
+#define ALLOC_CMA		 0x80 /* allow allocations from CMA areas */
+#ifdef CONFIG_ZONE_DMA32
+#define ALLOC_NOFRAGMENT	0x100 /* avoid mixing pageblock types */
+#else
+#define ALLOC_NOFRAGMENT	  0x0
+#endif
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6847177dc4a1..828fcccbc5c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2375,20 +2375,30 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
  * condition simpler.
  */
 static __always_inline bool
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
+__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
+						unsigned int alloc_flags)
 {
 	struct free_area *area;
 	int current_order;
+	int min_order = order;
 	struct page *page;
 	int fallback_mt;
 	bool can_steal;
 
+	/*
+	 * Do not steal pages from freelists belonging to other pageblocks
+	 * i.e. orders < pageblock_order. In the event there is on local
+	 * zone free, the allocation will retry later.
+	 */
+	if (alloc_flags & ALLOC_NOFRAGMENT)
+		min_order = pageblock_order;
+
 	/*
 	 * Find the largest available free page in the other list. This roughly
 	 * approximates finding the pageblock with the most free pages, which
 	 * would be too costly to do exactly.
 	 */
-	for (current_order = MAX_ORDER - 1; current_order >= order;
+	for (current_order = MAX_ORDER - 1; current_order >= min_order;
 				--current_order) {
 		area = &(zone->free_area[current_order]);
 		fallback_mt = find_suitable_fallback(area, current_order,
@@ -2447,7 +2457,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
  * Call me with the zone->lock already held.
  */
 static __always_inline struct page *
-__rmqueue(struct zone *zone, unsigned int order, int migratetype)
+__rmqueue(struct zone *zone, unsigned int order, int migratetype,
+						unsigned int alloc_flags)
 {
 	struct page *page;
 
@@ -2457,7 +2468,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
 		if (migratetype == MIGRATE_MOVABLE)
 			page = __rmqueue_cma_fallback(zone, order);
 
-		if (!page && __rmqueue_fallback(zone, order, migratetype))
+		if (!page && __rmqueue_fallback(zone, order, migratetype,
+								alloc_flags))
 			goto retry;
 	}
 
@@ -2472,13 +2484,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
-			int migratetype)
+			int migratetype, unsigned int alloc_flags)
 {
 	int i, alloced = 0;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype);
+		struct page *page = __rmqueue(zone, order, migratetype,
+								alloc_flags);
 		if (unlikely(page == NULL))
 			break;
 
@@ -2934,6 +2947,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 
 /* Remove page from the per-cpu list, caller must protect the list */
 static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
+			unsigned int alloc_flags,
 			struct per_cpu_pages *pcp,
 			struct list_head *list)
 {
@@ -2943,7 +2957,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
-					migratetype);
+					migratetype, alloc_flags);
 			if (unlikely(list_empty(list)))
 				return NULL;
 		}
@@ -2959,7 +2973,8 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 /* Lock and remove page from the per-cpu list */
 static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			unsigned int alloc_flags)
 {
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
@@ -2969,7 +2984,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	local_irq_save(flags);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
-	page = __rmqueue_pcplist(zone,  migratetype, pcp, list);
+	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone);
@@ -2992,7 +3007,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	if (likely(order == 0)) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
-				gfp_flags, migratetype);
+				gfp_flags, migratetype, alloc_flags);
 		goto out;
 	}
 
@@ -3011,7 +3026,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 				trace_mm_page_alloc_zone_locked(page, order, migratetype);
 		}
 		if (!page)
-			page = __rmqueue(zone, order, migratetype);
+			page = __rmqueue(zone, order, migratetype, alloc_flags);
 	} while (page && check_new_pages(page, order));
 	spin_unlock(&zone->lock);
 	if (!page)
@@ -3253,6 +3268,36 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 }
 #endif	/* CONFIG_NUMA */
 
+#ifdef CONFIG_ZONE_DMA32
+/*
+ * The restriction on ZONE_DMA32 as being a suitable zone to use to avoid
+ * fragmentation is subtle. If the preferred zone was HIGHMEM then
+ * premature use of a lower zone may cause lowmem pressure problems that
+ * are wose than fragmentation. If the next zone is ZONE_DMA then it is
+ * probably too small. It only makes sense to spread allocations to avoid
+ * fragmentation between the Normal and DMA32 zones.
+ */
+static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+{
+	if (zone_idx(zone) != ZONE_NORMAL)
+		return 0;
+
+	/*
+	 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
+	 * the pointer is within zone->zone_pgdat->node_zones[].
+	 */
+	if (!populated_zone(--zone))
+		return 0;
+
+	return ALLOC_NOFRAGMENT;
+}
+#else
+static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+{
+	return 0;
+}
+#endif
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -3264,11 +3309,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
+	bool no_fallback;
 
+retry:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
+	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
 	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		struct page *page;
@@ -3307,6 +3355,21 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
+		if (no_fallback) {
+			int local_nid;
+
+			/*
+			 * If moving to a remote node, retry but allow
+			 * fragmenting fallbacks. Locality is more important
+			 * than fragmentation avoidance.
+			 */
+			local_nid = zone_to_nid(ac->preferred_zoneref->zone);
+			if (zone_to_nid(zone) != local_nid) {
+				alloc_flags &= ~ALLOC_NOFRAGMENT;
+				goto retry;
+			}
+		}
+
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac_classzone_idx(ac), alloc_flags)) {
@@ -3374,6 +3437,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 	}
 
+	/*
+	 * It's possible on a UMA machine to get through all zones that are
+	 * fragmented. If avoiding fragmentation, reset and try again
+	 */
+	if (no_fallback) {
+		alloc_flags &= ~ALLOC_NOFRAGMENT;
+		goto retry;
+	}
+
 	return NULL;
 }
 
@@ -4369,6 +4441,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 
 	finalise_ac(gfp_mask, &ac);
 
+	/*
+	 * Forbid the first pass from falling back to types that fragment
+	 * memory until all local zones are considered.
+	 */
+	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone);
+
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))

From patchwork Wed Nov 21 10:14:12 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10692275
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E9BDC13BB
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:19 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D88F82B8C6
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:19 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id CCEA92B8CA; Wed, 21 Nov 2018 10:14:19 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 137DF2B8C7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:19 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EE8586B25AC; Wed, 21 Nov 2018 05:14:17 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E71646B25AE; Wed, 21 Nov 2018 05:14:17 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D39086B25AF; Wed, 21 Nov 2018 05:14:17 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com
 [209.85.208.70])
	by kanga.kvack.org (Postfix) with ESMTP id 75FD76B25AC
	for <linux-mm@kvack.org>; Wed, 21 Nov 2018 05:14:17 -0500 (EST)
Received: by mail-ed1-f70.google.com with SMTP id s50so2762318edd.11
        for <linux-mm@kvack.org>; Wed, 21 Nov 2018 02:14:17 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=gjo02eaG+7h35smRKujujQhu9q5uu0t/PqIXWwn0i9U=;
        b=FBcMvbrS8Ncsq7tW92rEHHIR1g7v2EVPc5H4EcJj/nIJOwZRrjd3ICdyDSjXiwPE/n
         TM/Fd0YBsMvQAlXfyGR6wboW726S0HjYCjhrBUVwYBKV6NWC3yiC+ZMk3a6+HcxKwXUH
         89VMAfV5K3iO7D8WOvP7wVg5gcqNie/fUfyCKICOfPRYFhSwUxjGFRtsIWymeBrahkKj
         12qK6Ke0OgxZAvo+fPl2Ix84MqDG9X+YM/iUbDeSxgah55vwt/j1SwYdAKnsM89fFN6K
         c2nzabvZSMCbhZj5SwE5itsjYKaQhUfwHHFvuaSYrcTiy4NYDdhEitNJg/uevf6GUoCD
         VFKA==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AA+aEWb/5wIrwoQXj0WYWdBeN999rCwQyiimYQPFY6gaOa3fiOPBb/nh
	BHJFit0KgXLajgCY83UfoZNU5XA1qLzOBEu5bhsoqzIoqjWu437X8p7kgJJtD1EXEsfOEN5CWfr
	yJKcXq1gnDjpmvodc4YtQ6O75Ksttd6Jw0E25B5LBfOtHLpTwi42Ma0/SNe2I+bYzgw==
X-Received: by 2002:a50:e789:: with SMTP id
 b9-v6mr5070799edn.275.1542795256968;
        Wed, 21 Nov 2018 02:14:16 -0800 (PST)
X-Google-Smtp-Source: 
 AFSGD/X6oD5ozKQ07PMOsNuH/hnWfdR2h1sEAgid1ndMn6lrcvEvaWv9Af23v45T3BqJl2tWBnZ9
X-Received: by 2002:a50:e789:: with SMTP id
 b9-v6mr5070749edn.275.1542795255829;
        Wed, 21 Nov 2018 02:14:15 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1542795255; cv=none;
        d=google.com; s=arc-20160816;
        b=egxZrImbeKsNwO+uFM30Xq3/YPEHh33Hl0Rxn7SMe4f6dGJyFKPp1ocAnG1TNubQ4w
         0UTUS7oQe2SYC8x01SO98hL/w3669hbnP/uwsXxutNzcQTVufoeLwNiFNQ4A5D3LTpvi
         1hKzU3j21WMiZhGb52uyTQK2qLLlau1XQhLgGQN8nwnISPeX7bmksdg+DTJr9ufNFTU4
         VOLNCmIPNgRgRivBcWLKQe8mv3daBheOhax4G0D0YxqB0EX7SB0iMAEC39bbJNBwHj/z
         21+78TVXMdX76KfuyVdGvcjH0O/dKTKfhcMxH4wilXvD2kMETXTtZ8b/oz54KUhjP+P9
         +oqg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=gjo02eaG+7h35smRKujujQhu9q5uu0t/PqIXWwn0i9U=;
        b=f7tGDyHtbD4DbOh9DNmZQOjfk9IAbJTSj9cKHZgkuYvXTnNYuSsHELpQuw4Tb94uXS
         sSYtCoXgMoP1g3ZrbmUPFFq06wfIESv/6ci156QW661vD1i/G9BXn0FVmqkb6za+/vsI
         xwdAm2BJO0XJsYivBSIX9akNxQleKZPdDwdfzxyIM/eOGVf75K4LkFJNIiuPycN63SUY
         WaH/CSqKyXGQwLGUk6iLEiOexFoFXZO+W+iNeORXV8ic3bsdPn1tRrzIit8DoWmhf3/G
         28rkNjmlNvjCmVnBBHKpi1c9kbs7Tj3fXc2ylg1rOe7OmVOLqJ1srCHGuVyw2UzpqYXW
         kgHQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp04.blacknight.com (outbound-smtp04.blacknight.com.
 [81.17.249.35])
        by mx.google.com with ESMTPS id
 z44si2575313edd.279.2018.11.21.02.14.15
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Wed, 21 Nov 2018 02:14:15 -0800 (PST)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 81.17.249.35 as permitted sender) client-ip=81.17.249.35;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10])
	by outbound-smtp04.blacknight.com (Postfix) with ESMTPS id 5CA1C989DF
	for <linux-mm@kvack.org>; Wed, 21 Nov 2018 10:14:15 +0000 (UTC)
Received: (qmail 19887 invoked from network); 21 Nov 2018 10:14:15 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.69])
  by 81.17.254.9 with ESMTPA; 21 Nov 2018 10:14:15 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 2/4] mm: Move zone watermark accesses behind an accessor
Date: Wed, 21 Nov 2018 10:14:12 +0000
Message-Id: <20181121101414.21301-3-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181121101414.21301-1-mgorman@techsingularity.net>
References: <20181121101414.21301-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This is a preparation patch only, no functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  9 +++++----
 mm/compaction.c        |  2 +-
 mm/page_alloc.c        | 12 ++++++------
 3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 847705a6d0ec..e43e8e79db99 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -269,9 +269,10 @@ enum zone_watermarks {
 	NR_WMARK
 };
 
-#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
-#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
-#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
+#define wmark_pages(z, i) (z->_watermark[i])
 
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
@@ -362,7 +363,7 @@ struct zone {
 	/* Read-mostly fields */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
-	unsigned long watermark[NR_WMARK];
+	unsigned long _watermark[NR_WMARK];
 
 	unsigned long nr_reserved_highatomic;
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 7c607479de4a..ef29490b0f46 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1431,7 +1431,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	if (is_via_compact_memory(order))
 		return COMPACT_CONTINUE;
 
-	watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 	/*
 	 * If watermarks for high-order allocation are already met, there
 	 * should be no need for compaction at all.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 828fcccbc5c5..9ea2d828d20c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3370,7 +3370,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
-		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac_classzone_idx(ac), alloc_flags)) {
 			int ret;
@@ -4790,7 +4790,7 @@ long si_mem_available(void)
 		pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
 
 	for_each_zone(zone)
-		wmark_low += zone->watermark[WMARK_LOW];
+		wmark_low += low_wmark_pages(zone);
 
 	/*
 	 * Estimate the amount of memory available for userspace allocations,
@@ -7416,13 +7416,13 @@ static void __setup_per_zone_wmarks(void)
 
 			min_pages = zone->managed_pages / 1024;
 			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
-			zone->watermark[WMARK_MIN] = min_pages;
+			zone->_watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
-			zone->watermark[WMARK_MIN] = tmp;
+			zone->_watermark[WMARK_MIN] = tmp;
 		}
 
 		/*
@@ -7434,8 +7434,8 @@ static void __setup_per_zone_wmarks(void)
 			    mult_frac(zone->managed_pages,
 				      watermark_scale_factor, 10000));
 
-		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
-		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
+		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

From patchwork Wed Nov 21 10:14:13 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10692281
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EE3B013BB
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:29 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DA06F2B8C7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:29 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id CDAC72B8DB; Wed, 21 Nov 2018 10:14:29 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 77BCF2B8C7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B42126B25AF; Wed, 21 Nov 2018 05:14:19 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id AA0216B25B1; Wed, 21 Nov 2018 05:14:19 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8A5D16B25B2; Wed, 21 Nov 2018 05:14:19 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com
 [209.85.208.71])
	by kanga.kvack.org (Postfix) with ESMTP id 13C6C6B25AF
	for <linux-mm@kvack.org>; Wed, 21 Nov 2018 05:14:19 -0500 (EST)
Received: by mail-ed1-f71.google.com with SMTP id x98-v6so2793140ede.0
        for <linux-mm@kvack.org>; Wed, 21 Nov 2018 02:14:19 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=LZxSls+ASfQgwzj+hdBQVTfrN/jaqolZGP9DUQfDqrk=;
        b=mK80fboBfZVJw6g0n82PrfbR0RG+7syGOEb1pgy+8YiiXWypW9ZCbvjhN54Ip0rqSS
         fjYALC1glns44szQiEa3kf2lkQB5CJwzDoDdlotpUtdtcSzm7jFT5jDV4OL6oTvV8oJQ
         5+J0vSldSmxdMFBiO/nl3pD8tcc/TvyuuMEyTYkTsrrpspxekPuUL7/5npHVQXUqJlum
         k1WvcR+dJsvbDEERnK93XNoCTF/wvFJJjQKuJrNlO2P/32SIFU5DHzzdK1CQZhAoZIv6
         xHz+UXIlLy78B39z5O9ZefTOkWn0XhbHzYZeM/2j4miJXl2LnvTnNZ/5tHMER/MxOmnQ
         Se/g==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.8 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AA+aEWaMABfOqkoRD+G7IOxNToXG9eWAlSjDPzLVA2ku0aTUSozxcJWB
	YyWNFtdeFnct/lZJSDd4cnwFmfymvMVTCsfTw5g1ymg9yweyNsHBEHGQVImnvgPPQSVrDRXQDqf
	MKOyp+yDuFIxubXz11xrGDv/0ZJnU0H3FguCpy4NnQO8/V7K16YTGf0Cp+xiboLa3Ug==
X-Received: by 2002:a50:b042:: with SMTP id
 i60-v6mr5069047edd.168.1542795258427;
        Wed, 21 Nov 2018 02:14:18 -0800 (PST)
X-Google-Smtp-Source: 
 AFSGD/U8TbW7jvKUmm8i3wwSrZFbhvj5VUaWbVMsPKQjCcs/nhT38pqUJutQkIo74K8WYOLnDDwj
X-Received: by 2002:a50:b042:: with SMTP id
 i60-v6mr5068954edd.168.1542795256035;
        Wed, 21 Nov 2018 02:14:16 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1542795255; cv=none;
        d=google.com; s=arc-20160816;
        b=ppqQhGPH6Tl4cHkNfARProMQdHmlLtpohjfsj5CorpiOZBVgLbs0+/aCVA+QbjK1Qd
         CHp3sY6NPrie0h/3VOYZU6GDK+bHZc7TJFHOmINUgFB+h/YYjJyAtq6rm8gAce746+ES
         KMlgGFhaY6I0SE57u3hL74aFAFODDnIzA0sH1TRiAlxCJordTOq/RE8tHTwN2mth9J+h
         +E6HDSNxBSPHQ8F+IDCw+uUdDdrBVLtBStW7jh/a1pCnRWN3Z5Scn45nBwyM1Q/9dIbF
         rYJR1fkbm8X6r193kNxgoDIGFFFauOqcqpnE8s67J2Gzo84AY9d8TuyFAUMwZHNVdv37
         zCMw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=LZxSls+ASfQgwzj+hdBQVTfrN/jaqolZGP9DUQfDqrk=;
        b=eIq+Cu+xtCxlD/cbsDlUCzRvWfVqZSOliv0hHMapBUJK7Bw25KbL7KHk5JowEoTlpZ
         5+VVo7u7DSAojUmGpJIGptX/6v8DkRnUA91KlnYou1VdbhIB0SAEt9Q13hXzIAGwB8h2
         YaxHOtnM1FY5FwoepBRr6HWdK+wSx3cbBqzmw+AmpItcEXpiUANiNQd/F/LH9LgZO94g
         YeVYi8UhcKyz7ooALaC5k0s2XtRQuMglm/19IE+FSx2fKSbv4BSI9ALtoMTtMqKsVKit
         kUuiY9nQWo5rAdyhDfWy9gl4BsSzLy7zWHlyDU1wMHZczuI2IIofMEB6m/hjLGShCGuj
         KR7Q==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.8 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp02.blacknight.com (outbound-smtp02.blacknight.com.
 [81.17.249.8])
        by mx.google.com with ESMTPS id
 j11-v6si5734725ejk.136.2018.11.21.02.14.15
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Wed, 21 Nov 2018 02:14:15 -0800 (PST)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 81.17.249.8 as permitted sender) client-ip=81.17.249.8;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.8 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10])
	by outbound-smtp02.blacknight.com (Postfix) with ESMTPS id 8EF9098A1D
	for <linux-mm@kvack.org>; Wed, 21 Nov 2018 10:14:15 +0000 (UTC)
Received: (qmail 19912 invoked from network); 21 Nov 2018 10:14:15 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.69])
  by 81.17.254.9 with ESMTPA; 21 Nov 2018 10:14:15 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 3/4] mm: Reclaim small amounts of memory when an external
 fragmentation event occurs
Date: Wed, 21 Nov 2018 10:14:13 +0000
Message-Id: <20181121101414.21301-4-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181121101414.21301-1-mgorman@techsingularity.net>
References: <20181121101414.21301-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if there
are enough sparsely populated pageblocks then the problem can still occur
as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a
zone watermark to be temporarily boosted when an external fragmentation
causing events occurs. The boosting will stall allocations that would
decrease free memory below the boosted low watermark and kswapd is woken
unconditionally to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost
is cleared. When kswapd finishes, it wakes kcompactd at the pageblock
order to clean some of the pageblocks that may have been affected by the
fragmentation event. kswapd avoids any writeback or swap from reclaim
context during this operation to avoid excessive system disruption in
the name of fragmentation avoidance. Care is taken so that kswapd will
do normal reclaim work if the system is really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc1 extfrag events < order 9:  1023463
4.20-rc1+patch:                      358574 (65% reduction)
4.20-rc1+patch1-3:                    19274 (98% reduction)

                                   4.20.0-rc1             4.20.0-rc1
                                 lowzone-v2r4             boost-v2r4
Amean     fault-base-1      663.65 (   0.00%)      659.85 *   0.57%*
Amean     fault-huge-1        0.00 (   0.00%)      172.19 * -99.00%*

                              4.20.0-rc1             4.20.0-rc1
                            lowzone-v2r4             boost-v2r4
Percentage huge-1        0.00 (   0.00%)        1.68 ( 100.00%)

Note that external fragmentation causing events are massively reduced
by this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  342549
4.20-rc1+patch:                     337890 ( 1% reduction)
4.20-rc1+patch1-3:                   12801 (96% reduction)

thpfioscale Fault Latencies
thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                 lowzone-v2r4             boost-v2r4
Amean     fault-base-1     1531.37 (   0.00%)     1578.91 (  -3.10%)
Amean     fault-huge-1     1160.95 (   0.00%)     1090.23 *   6.09%*

                              4.20.0-rc1             4.20.0-rc1
                            lowzone-v2r4             boost-v2r4
Percentage huge-1       78.97 (   0.00%)       82.59 (   4.58%)

As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  209820
4.20-rc1+patch:                     185923 (11% reduction)
4.20-rc1+patch1-3:                   11240 (95% reduction)

                                   4.20.0-rc1             4.20.0-rc1
                                 lowzone-v2r4             boost-v2r4
Amean     fault-base-5     1334.99 (   0.00%)     1395.28 (  -4.52%)
Amean     fault-huge-5     2428.43 (   0.00%)      539.69 (  77.78%)

                              4.20.0-rc1             4.20.0-rc1
                            lowzone-v2r4             boost-v2r4
Percentage huge-5        1.13 (   0.00%)        0.53 ( -52.94%)

This is an illustration of why latencies are not the primary metric.
There is a 95% reduction in fragmentation causing events but the
huge page latencies look fantastic until you account for the fact it
might be because the success rate was lower. Given how low it was
initially, this is partially down to luck.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9: 167464
4.20-rc1+patch:                    130081 (22% reduction)
4.20-rc1+patch1-3:                  12057 (92% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                 lowzone-v2r4             boost-v2r4
Amean     fault-base-5     6652.67 (   0.00%)     8691.83 * -30.65%*
Amean     fault-huge-5     2486.89 (   0.00%)     2899.83 * -16.60%*

                              4.20.0-rc1             4.20.0-rc1
                            lowzone-v2r4             boost-v2r4
Percentage huge-5       94.49 (   0.00%)       95.55 (   1.13%)

There is a large reduction in fragmentation events with a very slightly
higher THP allocation success rate. The latencies look bad but a closer
look at the data seems to indicate the problem is at the tails. Given the
high THP allocation success rate, the system is under quite some pressure.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/sysctl/vm.txt |  19 +++++++
 include/linux/mm.h          |   1 +
 include/linux/mmzone.h      |  11 ++--
 kernel/sysctl.c             |   8 +++
 mm/page_alloc.c             |  53 +++++++++++++++++--
 mm/vmscan.c                 | 123 ++++++++++++++++++++++++++++++++++++++++----
 6 files changed, 199 insertions(+), 16 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 7d73882e2c27..4dff1b75229b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -63,6 +63,7 @@ files can be found in mm/swap.c.
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- watermark_boost_factor
 - watermark_scale_factor
 - zone_reclaim_mode
 
@@ -856,6 +857,24 @@ ten times more freeable objects than there are.
 
 =============================================================
 
+watermark_boost_factor:
+
+This factor controls the level of reclaim when memory is being fragmented.
+It defines the percentage of the low watermark of a zone that will be
+reclaimed if pages of different mobility are being mixed within pageblocks.
+The intent is so that compaction has less work to do and increase the
+success rate of future high-order allocations such as SLUB allocations,
+THP and hugetlbfs pages.
+
+To make it sensible with respect to the watermark_scale_factor parameter,
+the unit is in fractions of 10,000. The default value of 15000 means
+that 150% of the high watermark will be reclaimed in the event of a
+pageblock being mixed due to fragmentation. If this value is smaller
+than a pageblock then a pageblocks worth of pages will be reclaimed (e.g.
+2MB on 64-bit x86). A boost factor of 0 will disable the feature.
+
+=============================================================
+
 watermark_scale_factor:
 
 This factor controls the aggressiveness of kswapd. It defines the
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..2c4c69508413 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2202,6 +2202,7 @@ extern void zone_pcp_reset(struct zone *zone);
 
 /* page_alloc.c */
 extern int min_free_kbytes;
+extern int watermark_boost_factor;
 extern int watermark_scale_factor;
 
 /* nommu.c */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e43e8e79db99..d352c1dab486 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -269,10 +269,10 @@ enum zone_watermarks {
 	NR_WMARK
 };
 
-#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
-#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
-#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
-#define wmark_pages(z, i) (z->_watermark[i])
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
+#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
@@ -364,6 +364,7 @@ struct zone {
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long _watermark[NR_WMARK];
+	unsigned long watermark_boost;
 
 	unsigned long nr_reserved_highatomic;
 
@@ -885,6 +886,8 @@ static inline int is_highmem(struct zone *zone)
 struct ctl_table;
 int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5fc724e4e454..1825f712e73b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1462,6 +1462,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= min_free_kbytes_sysctl_handler,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "watermark_boost_factor",
+		.data		= &watermark_boost_factor,
+		.maxlen		= sizeof(watermark_boost_factor),
+		.mode		= 0644,
+		.proc_handler	= watermark_boost_factor_sysctl_handler,
+		.extra1		= &zero,
+	},
 	{
 		.procname	= "watermark_scale_factor",
 		.data		= &watermark_scale_factor,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ea2d828d20c..04b29228e9f0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -263,6 +263,7 @@ compound_page_dtor * const compound_page_dtors[] = {
 
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
+int watermark_boost_factor __read_mostly = 15000;
 int watermark_scale_factor = 10;
 
 static unsigned long nr_kernel_pages __meminitdata;
@@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 	return false;
 }
 
+static inline void boost_watermark(struct zone *zone)
+{
+	unsigned long max_boost;
+
+	if (!watermark_boost_factor)
+		return;
+
+	max_boost = mult_frac(wmark_pages(zone, WMARK_HIGH),
+			watermark_boost_factor, 10000);
+	max_boost = max(pageblock_nr_pages, max_boost);
+
+	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
+		max_boost);
+}
+
 /*
  * This function implements actual steal behaviour. If order is large enough,
  * we can steal whole pageblock. If not, we first move freepages in this
@@ -2160,6 +2176,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 		goto single_page;
 	}
 
+	/*
+	 * Boost watermarks to increase reclaim pressure to reduce the
+	 * likelihood of future fallbacks. Wake kswapd now as the node
+	 * may be balanced overall and kswapd will not wake naturally.
+	 */
+	boost_watermark(zone);
+	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
+
 	/* We are not allowed to try stealing from the whole block */
 	if (!whole_block)
 		goto single_page;
@@ -3277,11 +3301,19 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
  * probably too small. It only makes sense to spread allocations to avoid
  * fragmentation between the Normal and DMA32 zones.
  */
-static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+static inline unsigned int
+alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 {
 	if (zone_idx(zone) != ZONE_NORMAL)
 		return 0;
 
+	/*
+	 * A fragmenting fallback will try waking kswapd. ALLOC_NOFRAGMENT
+	 * may break that so such callers can introduce fragmentation.
+	 */
+	if (!(gfp_mask & __GFP_KSWAPD_RECLAIM))
+		return 0;
+
 	/*
 	 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
 	 * the pointer is within zone->zone_pgdat->node_zones[].
@@ -3292,7 +3324,8 @@ static inline unsigned int alloc_flags_nofragment(struct zone *zone)
 	return ALLOC_NOFRAGMENT;
 }
 #else
-static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+static inline unsigned int
+alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -4445,7 +4478,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	 * Forbid the first pass from falling back to types that fragment
 	 * memory until all local zones are considered.
 	 */
-	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone);
+	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone,
+								gfp_mask);
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
@@ -7436,6 +7470,7 @@ static void __setup_per_zone_wmarks(void)
 
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+		zone->watermark_boost = 0;
 
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -7536,6 +7571,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62ac0c488624..5ba76ec4f01e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3378,6 +3378,30 @@ static void age_active_anon(struct pglist_data *pgdat,
 	} while (memcg);
 }
 
+static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx)
+{
+	int i;
+	struct zone *zone;
+
+	/*
+	 * Check for watermark boosts top-down as the higher zones
+	 * are more likely to be boosted. Both watermarks and boosts
+	 * should not be checked at the time time as reclaim would
+	 * start prematurely when there is no boosting and a lower
+	 * zone is balanced.
+	 */
+	for (i = classzone_idx; i >= 0; i--) {
+		zone = pgdat->node_zones + i;
+		if (!managed_zone(zone))
+			continue;
+
+		if (zone->watermark_boost)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Returns true if there is an eligible zone balanced for the request order
  * and classzone_idx
@@ -3388,9 +3412,12 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long mark = -1;
 	struct zone *zone;
 
+	/*
+	 * Check watermarks bottom-up as lower zones are more likely to
+	 * meet watermarks.
+	 */
 	for (i = 0; i <= classzone_idx; i++) {
 		zone = pgdat->node_zones + i;
-
 		if (!managed_zone(zone))
 			continue;
 
@@ -3516,14 +3543,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	unsigned long pflags;
+	unsigned long nr_boost_reclaim;
+	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
+	bool boosted;
 	struct zone *zone;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
-		.priority = DEF_PRIORITY,
-		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
-		.may_swap = 1,
 	};
 
 	psi_memstall_enter(&pflags);
@@ -3531,9 +3558,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 	count_vm_event(PAGEOUTRUN);
 
+	/*
+	 * Account for the reclaim boost. Note that the zone boost is left in
+	 * place so that parallel allocations that are near the watermark will
+	 * stall or direct reclaim until kswapd is finished.
+	 */
+	nr_boost_reclaim = 0;
+	for (i = 0; i <= classzone_idx; i++) {
+		zone = pgdat->node_zones + i;
+		if (!managed_zone(zone))
+			continue;
+
+		nr_boost_reclaim += zone->watermark_boost;
+		zone_boosts[i] = zone->watermark_boost;
+	}
+	boosted = nr_boost_reclaim;
+
+restart:
+	sc.priority = DEF_PRIORITY;
 	do {
 		unsigned long nr_reclaimed = sc.nr_reclaimed;
 		bool raise_priority = true;
+		bool balanced;
 		bool ret;
 
 		sc.reclaim_idx = classzone_idx;
@@ -3560,13 +3606,39 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		}
 
 		/*
-		 * Only reclaim if there are no eligible zones. Note that
-		 * sc.reclaim_idx is not used as buffer_heads_over_limit may
-		 * have adjusted it.
+		 * If the pgdat is imbalanced then ignore boosting and preserve
+		 * the watermarks for a later time and restart. Note that the
+		 * zone watermarks will be still reset at the end of balancing
+		 * on the grounds that the normal reclaim should be enough to
+		 * re-evaluate if boosting is required when kswapd next wakes.
+		 */
+		balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
+		if (!balanced && nr_boost_reclaim) {
+			nr_boost_reclaim = 0;
+			goto restart;
+		}
+
+		/*
+		 * If boosting is not active then only reclaim if there are no
+		 * eligible zones. Note that sc.reclaim_idx is not used as
+		 * buffer_heads_over_limit may have adjusted it.
 		 */
-		if (pgdat_balanced(pgdat, sc.order, classzone_idx))
+		if (!nr_boost_reclaim && balanced)
 			goto out;
 
+		/* Limit the priority of boosting to avoid reclaim writeback */
+		if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
+			raise_priority = false;
+
+		/*
+		 * Do not writeback or swap pages for boosted reclaim. The
+		 * intent is to relieve pressure not issue sub-optimal IO
+		 * from reclaim context. If no pages are reclaimed, the
+		 * reclaim will be aborted.
+		 */
+		sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
+		sc.may_swap = !nr_boost_reclaim;
+
 		/*
 		 * Do some background aging of the anon list, to give
 		 * pages a chance to be referenced before reclaiming. All
@@ -3618,6 +3690,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * progress in reclaiming pages
 		 */
 		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+		nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
+
+		/*
+		 * If reclaim made no progress for a boost, stop reclaim as
+		 * IO cannot be queued and it could be an infinite loop in
+		 * extreme circumstances.
+		 */
+		if (nr_boost_reclaim && !nr_reclaimed)
+			break;
+
 		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1);
@@ -3626,6 +3708,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		pgdat->kswapd_failures++;
 
 out:
+	/* If reclaim was boosted, account for the reclaim done in this pass */
+	if (boosted) {
+		unsigned long flags;
+
+		for (i = 0; i <= classzone_idx; i++) {
+			if (!zone_boosts[i])
+				continue;
+
+			/* Increments are under the zone lock */
+			zone = pgdat->node_zones + i;
+			spin_lock_irqsave(&zone->lock, flags);
+			zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+			spin_unlock_irqrestore(&zone->lock, flags);
+		}
+
+		/*
+		 * As there is now likely space, wakeup kcompact to defragment
+		 * pageblocks.
+		 */
+		wakeup_kcompactd(pgdat, pageblock_order, classzone_idx);
+	}
+
 	snapshot_refaults(NULL, pgdat);
 	__fs_reclaim_release();
 	psi_memstall_leave(&pflags);
@@ -3854,7 +3958,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 
 	/* Hopeless node, leave it to direct reclaim if possible */
 	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
-	    pgdat_balanced(pgdat, order, classzone_idx)) {
+	    (pgdat_balanced(pgdat, order, classzone_idx) &&
+	     !pgdat_watermark_boosted(pgdat, classzone_idx))) {
 		/*
 		 * There may be plenty of free memory available, but it's too
 		 * fragmented for high-order allocations.  Wake up kcompactd

From patchwork Wed Nov 21 10:14:14 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10692283
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 54A7A13BB
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:34 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3D8A32B8C7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:34 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 310772B8DB; Wed, 21 Nov 2018 10:14:34 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C2C382B8C7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 21 Nov 2018 10:14:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id F3BD96B25B0; Wed, 21 Nov 2018 05:14:19 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EC21F6B25B1; Wed, 21 Nov 2018 05:14:19 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CCBFC6B25B3; Wed, 21 Nov 2018 05:14:19 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com
 [209.85.208.70])
	by kanga.kvack.org (Postfix) with ESMTP id 3D4CE6B25B0
	for <linux-mm@kvack.org>; Wed, 21 Nov 2018 05:14:19 -0500 (EST)
Received: by mail-ed1-f70.google.com with SMTP id s50so2762393edd.11
        for <linux-mm@kvack.org>; Wed, 21 Nov 2018 02:14:19 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=LnPCgqV/1EY3WFn4izckLANQoIUkGrrMnO1ysG0mLZc=;
        b=s2A/yLbYWpwdMZOOPwJ/WOxa/JSzue5NwGwGNb9misGHcfCkc9X18NVDMiH1Erz3xh
         pijP8a1Ezn9akmVwiIt2G3iI3QNgydATWmIfgOxijCoVJhgAiW4//FFjRexp8KF0yp+o
         YfiivqJ2myTATQ6RLucrLYSJYArTmbVz6BZb6zz4kH2LOVw6EKR6Nrt2+3GFhcyjYcRv
         mvW93wAaLNRzTOBqETkckXeXlJ5Ofgo2nGrr4O04u4Fu5bOxZjX6fCKjLc3VW3QGNRr9
         OOuAAqb6ZeG2n8NPWnd97YqRL22xjxFqtseHQDl+PruhRWeULCvsbqFd/Ctk4x9o5I4J
         ccXQ==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AA+aEWbB3r43iW5ziXNH5Nou1f1iaXOqbXiEmPAKZ04tc5I6K3VoYBwL
	AwJ6zQGC0LSxPwsgMYaF/VUNzVWXjpcqmGjFr9HTrbJDbDTL2CkpDIrCG1kbakiDc5OlfmOA1aC
	BJM4tH/8AujE7i0IaSy5YiYp+V8Fk8l3kayu/uaEofKaiMFQW2R8VOyJ+iIjGwlQYRg==
X-Received: by 2002:a50:b284:: with SMTP id
 p4-v6mr5330509edd.60.1542795258654;
        Wed, 21 Nov 2018 02:14:18 -0800 (PST)
X-Google-Smtp-Source: 
 AFSGD/USFFitLzvv7Zb9YPK0uILA+eKBVxr8KpU7imSUijc42PybfgCOvLCEL4bAozgbVs6YYC3t
X-Received: by 2002:a50:b284:: with SMTP id
 p4-v6mr5330394edd.60.1542795256114;
        Wed, 21 Nov 2018 02:14:16 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1542795256; cv=none;
        d=google.com; s=arc-20160816;
        b=MCJGo/Igg0jckG83yx8YhXghwlC1XLRASsQwH2jbaahyM0EYwhKvayJoIO0CBvEpWO
         PcM7sERpiYHCfdjPKJBuCcAKomJXbPcLOd48ymaW7A6H3wNeo2nWtpby/7djV/PKeI0N
         /QAZmyhTtdJ1DS3uSj1tPIaZtZzCNsJM7k0GkF2vxOBTjd64UA2ZpCZxDpAZNMXitGy1
         jmN/xns0s2DAhu7kaMzm4JTLKcMCgQ0Fduv8tAEG9b35ptpuUiRxACYV5zHQv3xoXDsZ
         65mnVrNzmIoKVBM1BDPw9z5Iff0cqOTCtrGU1DEnT/gVjz8jVNtkQZTyRGYfAW8j8H9P
         rvrA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=LnPCgqV/1EY3WFn4izckLANQoIUkGrrMnO1ysG0mLZc=;
        b=aUILaXttoW3dTA6apEps5+taUCeqxduxnVLSLJgO9TydJuf9BgnGeCfZrvuK8X94k3
         E4XQ7e95o960PSkgcrGXDUWw7oeiaIashQAM3rdALEtC89IJEqvXpMUha41SrH7AEOXS
         ToIME1OlDLHf0BfNrtV8iKsftmNG1KhpSrmWJwiGBUwseEedZUkCc+MzCREsH5VihwXw
         qZGUlomvuaV57tKPjg/Whj37Qf+nN8Fm31T8ur7MiiwtlkLXc0cIJiXpUrl5nf4vEkhB
         fzQtVyYf/oKmw9q1glsh6DU5ejALx4NQ3bqns9sBIEWOnqowDfEoGzDp6YEH0NK9R+F1
         sNBQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp04.blacknight.com (outbound-smtp04.blacknight.com.
 [81.17.249.35])
        by mx.google.com with ESMTPS id
 a9-v6si152089eju.175.2018.11.21.02.14.15
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Wed, 21 Nov 2018 02:14:16 -0800 (PST)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 81.17.249.35 as permitted sender) client-ip=81.17.249.35;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10])
	by outbound-smtp04.blacknight.com (Postfix) with ESMTPS id BD86198A46
	for <linux-mm@kvack.org>; Wed, 21 Nov 2018 10:14:15 +0000 (UTC)
Received: (qmail 19936 invoked from network); 21 Nov 2018 10:14:15 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.69])
  by 81.17.254.9 with ESMTPA; 21 Nov 2018 10:14:15 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 4/4] mm: Stall movable allocations until kswapd progresses
 during serious external fragmentation event
Date: Wed, 21 Nov 2018 10:14:14 +0000
Message-Id: <20181121101414.21301-5-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181121101414.21301-1-mgorman@techsingularity.net>
References: <20181121101414.21301-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

An event that potentially causes external fragmentation problems has
already been described but there are degrees of severity.  A "serious"
event is defined as one that steals a contiguous range of pages of an order
lower than fragment_stall_order (PAGE_ALLOC_COSTLY_ORDER by default). If a
movable allocation request that is allowed to sleep needs to steal a small
block then it schedules until kswapd makes progress or a timeout passes.
The watermarks are also boosted slightly faster so that kswapd makes
greater effort to reclaim enough pages to avoid the fragmentation event.

This stall is not guaranteed to avoid serious fragmentation events.
If memory pressure is high enough, the pages freed by kswapd may be
reallocated or the free pages may not be in pageblocks that contain
only movable pages. Furthermore an allocation request that cannot stall
(e.g. atomic allocations) or unmovable/reclaimable allocations will still
proceed without stalling.

The worst-case scenario for stalling is a combination of both high memory
pressure where kswapd is having trouble keeping free pages over the
pfmemalloc_reserve and movable allocations are fragmenting memory. In this
case, an allocation request may sleep for longer. There are both vmstats
to identify stalls are happening and a tracepoint to quantify what the
stall durations are. Note that the granularity of the stall detection is
a jiffy so the delay accounting is not precise.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc1 extfrag events < order 9:  1023463
4.20-rc1+patch:                      358574 (65% reduction)
4.20-rc1+patch1-3:                    19274 (98% reduction)
4.20-rc1+patch1-4:                     1094 (99.9% reduction)

                                   4.20.0-rc1             4.20.0-rc1
                                   boost-v3r1             stall-v3r1
Amean     fault-base-1      659.85 (   0.00%)      658.74 (   0.17%)
Amean     fault-huge-1      172.19 (   0.00%)      168.00 (   2.43%)

thpfioscale Percentage Faults Huge
                              4.20.0-rc1             4.20.0-rc1
                              boost-v3r1             stall-v3r1
Percentage huge-1        1.68 (   0.00%)        0.88 ( -47.52%)

Fragmentation events are now reduced to negligible levels.

The latencies and allocation success rates are roughly similar.  Over the
course of 16 minutes, there were 52 stalls due to fragmentation avoidance
with a total stall time of 0.2 seconds.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  342549
4.20-rc1+patch:                     337890 ( 1% reduction)
4.20-rc1+patch1-3:                   12801 (96% reduction)
4.20-rc1+patch1-4:                    1112 (99.7% reduction)

                                   4.20.0-rc1             4.20.0-rc1
                                   boost-v3r1             stall-v3r1
Amean     fault-base-1     1578.91 (   0.00%)     1647.00 (  -4.31%)
Amean     fault-huge-1     1090.23 (   0.00%)      559.31 *  48.70%*

                              4.20.0-rc1             4.20.0-rc1
                              boost-v3r1             stall-v3r1
Percentage huge-1       82.59 (   0.00%)       99.98 (  21.05%)

The fragmentation events were reduced and the latencies are good. This
is a big difference between v2 and v3 of the series as v2 had stalls
that reached the timeout of HZ/10 where as a timeout of HZ/50 has better
latencies without compromising on fragmentation events or allocation success rates.

There were 219 stalls over the course of 16 minutes for a total stall
time of roughly 1 second (as opposed to 11 seconds with HZ/10). The
distribution of stalls is as follows

    209 4000
      1 8000
      9 20000

This shows the majority of stalls were for just one jiffie.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  209820
4.20-rc1+patch:                     185923 (11% reduction)
4.20-rc1+patch1-3:                   11240 (95% reduction)
4.20-rc1+patch1-4:                    8709 (96% reduction)

                                   4.20.0-rc1             4.20.0-rc1
                                   boost-v3r1             stall-v3r1
Amean     fault-base-5     1395.28 (   0.00%)     1335.23 (   4.30%)
Amean     fault-huge-5      539.69 (   0.00%)      614.88 * -13.93%*

                              4.20.0-rc1             4.20.0-rc1
                              boost-v3r1             stall-v3r1
Percentage huge-5        0.53 (   0.00%)        2.16 ( 306.25%)

There is a slight reduction in fragmentation events but it's slight
enough that it may be due to luck.  There is a small increase in latencies
which is partially offset by a slight increase in THP allocation success
rates. There were 65 stalls over the course of 63 minutes with stall time
of a total of roughly 0.2 seconds.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9: 167464
4.20-rc1+patch:                    130081 (22% reduction)
4.20-rc1+patch1-3:                  12057 (92% reduction)
4.20-rc1+patch1-4:                  11494 (93% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                   boost-v3r1             stall-v3r1
Amean     fault-base-5     8691.83 (   0.00%)     7380.80 (  15.08%)
Amean     fault-huge-5     2899.83 (   0.00%)     4066.94 * -40.25%*

                              4.20.0-rc1             4.20.0-rc1
                              boost-v3r1             stall-v3r1
Percentage huge-5       95.55 (   0.00%)       98.98 (   3.59%)

The fragmentation events are reduced and while there is some wobble on
the latency, the success rate is near 100% while under heavy pressure.
There were 2016 stalls over the course of 85 minutes with a total stall
time of roughly 8 seconds.

This patch does reduce fragmentation rates overall but it's not free
as some allocataions can stall for short periods of time and there
are knock-on effects to latency when THP allocation success rates are
higher. While it's within acceptable limits for the adverse test case,
there may be other workloads that cannot tolerate the stalls. If this
occurs, it can be tuned to disable the feature or more ideally, the test
case is made available for analysis to see if the stall behaviour can be
reduced while still limiting the fragmentation events. On the flip-side,
it has been checked that setting the fragment_stall_order to 9 eliminated
fragmentation events entirely.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/sysctl/vm.txt   | 23 +++++++++++
 include/linux/mm.h            |  1 +
 include/linux/mmzone.h        |  2 +
 include/linux/vm_event_item.h |  1 +
 include/trace/events/kmem.h   | 21 ++++++++++
 kernel/sysctl.c               | 10 +++++
 mm/internal.h                 |  1 +
 mm/page_alloc.c               | 93 +++++++++++++++++++++++++++++++++++++------
 mm/vmstat.c                   |  1 +
 9 files changed, 141 insertions(+), 12 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 4dff1b75229b..7d4f41bf7902 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -31,6 +31,7 @@ files can be found in mm/swap.c.
 - dirty_writeback_centisecs
 - drop_caches
 - extfrag_threshold
+- fragment_stall_order
 - hugetlb_shm_group
 - laptop_mode
 - legacy_va_layout
@@ -275,6 +276,28 @@ any throttling.
 
 ==============================================================
 
+fragment_stall_order
+
+External fragmentation control is managed on a pageblock level where the
+page allocator tries to avoid mixing pages of different mobility within page
+blocks (e.g. order 9 on 64-bit x86). If external fragmentation is perfectly
+controlled then a THP allocation will often succeed up to the number of
+movable pageblocks in the system as reported by /proc/pagetypeinfo.
+
+When memory is low, the system may have to mix pageblocks and will wake
+kswapd to try control future fragmentation. fragment_stall_order controls if
+the allocating task will stall if possible until kswapd makes some progress
+in preference to fragmenting the system. This incurs a small stall penalty
+in exchange for future success at allocating huge pages. If the stalls
+are undesirable and high-order allocations are irrelevant then this can
+be disabled by writing 0 to the tunable. Writing the pageblock order will
+strongly (but not perfectly) control external fragmentation.
+
+The default will stall for fragmenting allocations smaller than the
+PAGE_ALLOC_COSTLY_ORDER (defined as order-3 at the time of writing).
+
+==============================================================
+
 hugetlb_shm_group
 
 hugetlb_shm_group contains group id that is allowed to create SysV
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2c4c69508413..2b72de790ef9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2204,6 +2204,7 @@ extern void zone_pcp_reset(struct zone *zone);
 extern int min_free_kbytes;
 extern int watermark_boost_factor;
 extern int watermark_scale_factor;
+extern int fragment_stall_order;
 
 /* nommu.c */
 extern atomic_long_t mmap_pages_allocated;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d352c1dab486..cffec484ac8a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -890,6 +890,8 @@ int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int fragment_stall_order_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..7661abe5236e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
 		OOM_KILL,
+		FRAGMENTSTALL,
 #ifdef CONFIG_NUMA_BALANCING
 		NUMA_PTE_UPDATES,
 		NUMA_HUGE_PTE_UPDATES,
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index eb57e3037deb..caadd8681ac5 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -315,6 +315,27 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__entry->change_ownership)
 );
 
+TRACE_EVENT(mm_fragmentation_stall,
+
+	TP_PROTO(int nid, unsigned long duration),
+
+	TP_ARGS(nid, duration),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid		)
+		__field(	unsigned long,	duration	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->duration	= duration
+	),
+
+	TP_printk("nid=%d duration=%lu",
+		__entry->nid,
+		__entry->duration)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1825f712e73b..eb09c79ddbef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -126,6 +126,7 @@ static int zero;
 static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
 static int __maybe_unused four = 4;
+static int __maybe_unused max_order = MAX_ORDER;
 static unsigned long one_ul = 1;
 static int one_hundred = 100;
 static int one_thousand = 1000;
@@ -1479,6 +1480,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &one,
 		.extra2		= &one_thousand,
 	},
+	{
+		.procname	= "fragment_stall_order",
+		.data		= &fragment_stall_order,
+		.maxlen		= sizeof(fragment_stall_order),
+		.mode		= 0644,
+		.proc_handler	= fragment_stall_order_sysctl_handler,
+		.extra1		= &zero,
+		.extra2		= &max_order,
+	},
 	{
 		.procname	= "percpu_pagelist_fraction",
 		.data		= &percpu_pagelist_fraction,
diff --git a/mm/internal.h b/mm/internal.h
index 544355156c92..5506a4596d59 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -489,6 +489,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #else
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
+#define ALLOC_FRAGMENT_STALL	0x200 /* stall if fragmenting heavily */
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 04b29228e9f0..e8b0691e8971 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -265,6 +265,7 @@ int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 int watermark_boost_factor __read_mostly = 15000;
 int watermark_scale_factor = 10;
+int fragment_stall_order __read_mostly = (PAGE_ALLOC_COSTLY_ORDER + 1);
 
 static unsigned long nr_kernel_pages __meminitdata;
 static unsigned long nr_all_pages __meminitdata;
@@ -2130,9 +2131,10 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 	return false;
 }
 
-static inline void boost_watermark(struct zone *zone)
+static inline void boost_watermark(struct zone *zone, bool fast_boost)
 {
 	unsigned long max_boost;
+	unsigned long nr;
 
 	if (!watermark_boost_factor)
 		return;
@@ -2140,9 +2142,36 @@ static inline void boost_watermark(struct zone *zone)
 	max_boost = mult_frac(wmark_pages(zone, WMARK_HIGH),
 			watermark_boost_factor, 10000);
 	max_boost = max(pageblock_nr_pages, max_boost);
+	nr = pageblock_nr_pages;
 
-	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
-		max_boost);
+	/* Scale relative to the MIGRATE_PCPTYPES similar to min_free_kbytes */
+	if (fast_boost)
+		nr += pageblock_nr_pages * (MIGRATE_PCPTYPES << 1);
+
+	zone->watermark_boost = min(zone->watermark_boost + nr, max_boost);
+}
+
+static void stall_fragmentation(struct zone *pzone)
+{
+	DEFINE_WAIT(wait);
+	long remaining = 0;
+	long timeout = HZ/50;
+	pg_data_t *pgdat = pzone->zone_pgdat;
+
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	boost_watermark(pzone, true);
+	prepare_to_wait(&pgdat->pfmemalloc_wait, &wait, TASK_INTERRUPTIBLE);
+	if (waitqueue_active(&pgdat->kswapd_wait))
+		wake_up_interruptible(&pgdat->kswapd_wait);
+	remaining = schedule_timeout(timeout);
+	finish_wait(&pgdat->pfmemalloc_wait, &wait);
+	if (remaining != timeout) {
+		trace_mm_fragmentation_stall(pgdat->node_id,
+			jiffies_to_usecs(timeout - remaining));
+		count_vm_event(FRAGMENTSTALL);
+	}
 }
 
 /*
@@ -2153,8 +2182,9 @@ static inline void boost_watermark(struct zone *zone)
  * of pages are free or compatible, we can change migratetype of the pageblock
  * itself, so pages freed in the future will be put on the correct free list.
  */
-static void steal_suitable_fallback(struct zone *zone, struct page *page,
-					int start_type, bool whole_block)
+static bool steal_suitable_fallback(struct zone *zone, struct page *page,
+					int start_type, bool whole_block,
+					unsigned int alloc_flags)
 {
 	unsigned int current_order = page_order(page);
 	struct free_area *area;
@@ -2181,9 +2211,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	 * likelihood of future fallbacks. Wake kswapd now as the node
 	 * may be balanced overall and kswapd will not wake naturally.
 	 */
-	boost_watermark(zone);
+	boost_watermark(zone, false);
 	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
 
+	if ((alloc_flags & ALLOC_FRAGMENT_STALL) &&
+	    current_order < fragment_stall_order) {
+		return false;
+	}
+
 	/* We are not allowed to try stealing from the whole block */
 	if (!whole_block)
 		goto single_page;
@@ -2224,11 +2259,12 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 			page_group_by_mobility_disabled)
 		set_pageblock_migratetype(page, start_type);
 
-	return;
+	return true;
 
 single_page:
 	area = &zone->free_area[current_order];
 	list_move(&page->lru, &area->free_list[start_type]);
+	return true;
 }
 
 /*
@@ -2467,13 +2503,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	page = list_first_entry(&area->free_list[fallback_mt],
 							struct page, lru);
 
-	steal_suitable_fallback(zone, page, start_migratetype, can_steal);
+	if (!steal_suitable_fallback(zone, page, start_migratetype, can_steal,
+								alloc_flags))
+		return false;
 
 	trace_mm_page_alloc_extfrag(page, order, current_order,
 		start_migratetype, fallback_mt);
 
 	return true;
-
 }
 
 /*
@@ -3340,9 +3377,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 						const struct alloc_context *ac)
 {
 	struct zoneref *z = ac->preferred_zoneref;
+	struct zone *pzone = z->zone;
 	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
 	bool no_fallback;
+	bool fragment_stall;
 
 retry:
 	/*
@@ -3350,6 +3389,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
 	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
+	fragment_stall = alloc_flags & ALLOC_FRAGMENT_STALL;
+
 	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		struct page *page;
@@ -3388,7 +3429,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
-		if (no_fallback) {
+		if (no_fallback || fragment_stall) {
 			int local_nid;
 
 			/*
@@ -3396,9 +3437,12 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			 * fragmenting fallbacks. Locality is more important
 			 * than fragmentation avoidance.
 			 */
-			local_nid = zone_to_nid(ac->preferred_zoneref->zone);
+			local_nid = zone_to_nid(pzone);
 			if (zone_to_nid(zone) != local_nid) {
+				if (fragment_stall)
+					stall_fragmentation(pzone);
 				alloc_flags &= ~ALLOC_NOFRAGMENT;
+				alloc_flags &= ~ALLOC_FRAGMENT_STALL;
 				goto retry;
 			}
 		}
@@ -3474,8 +3518,12 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * It's possible on a UMA machine to get through all zones that are
 	 * fragmented. If avoiding fragmentation, reset and try again
 	 */
-	if (no_fallback) {
+	if (no_fallback || fragment_stall) {
+		if (fragment_stall)
+			stall_fragmentation(pzone);
+
 		alloc_flags &= ~ALLOC_NOFRAGMENT;
+		alloc_flags &= ~ALLOC_FRAGMENT_STALL;
 		goto retry;
 	}
 
@@ -4186,6 +4234,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
+	/*
+	 * Consider stalling on heavy for movable allocations in preference to
+	 * fragmenting unmovable/reclaimable pageblocks.
+	 */
+	if ((gfp_mask & (__GFP_MOVABLE|__GFP_DIRECT_RECLAIM)) ==
+			(__GFP_MOVABLE|__GFP_DIRECT_RECLAIM))
+		alloc_flags |= ALLOC_FRAGMENT_STALL;
+
 	/*
 	 * We need to recalculate the starting point for the zonelist iterator
 	 * because we might have used different nodemask in the fast path, or
@@ -4207,6 +4263,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto got_pg;
+	alloc_flags &= ~ALLOC_FRAGMENT_STALL;
 
 	/*
 	 * For costly allocations, try direct compaction first, as it's likely
@@ -7583,6 +7640,18 @@ int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int fragment_stall_order_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9c624595e904..6cc7755c6eb1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1211,6 +1211,7 @@ const char * const vmstat_text[] = {
 	"drop_pagecache",
 	"drop_slab",
 	"oom_kill",
+	"fragment_stall",
 
 #ifdef CONFIG_NUMA_BALANCING
 	"numa_pte_updates",