From patchwork Fri Nov 23 11:45:24 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10695665
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 60A0E15A7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:42 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4E9122C02D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:42 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 42CDC2C90C; Fri, 23 Nov 2018 11:45:42 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DC8BF2C02D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 446B56B2CF7; Fri, 23 Nov 2018 06:45:33 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3F3E96B2CFB; Fri, 23 Nov 2018 06:45:33 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 29BE16B2CF8; Fri, 23 Nov 2018 06:45:33 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com
 [209.85.208.69])
	by kanga.kvack.org (Postfix) with ESMTP id A2ADA6B2CF6
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 06:45:32 -0500 (EST)
Received: by mail-ed1-f69.google.com with SMTP id m19so5672027edc.6
        for <linux-mm@kvack.org>; Fri, 23 Nov 2018 03:45:32 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=LGLGXYoRXbeAPQc2yULFaKa9WDL1XAPgPF+pt0cB1UQ=;
        b=bjFg0iihCwJ9yjb5RLnJ7CAwRoRVUX3CcryNpCmnqXGnSG7XIlEJtUWB5ZjiOuZbtd
         X0txNdb3uKianoDeULaPuYIzgMRIj2wdSnqdYLNt+w3D6Nlk5W7oPavk0dH3YWavv83s
         nZ4z0xWOBlNPeHvfM/hbJ6ULLUVaL/rDg5jYXAuqZdPt+lg3xo2xDrt68CAy+zfSjCjJ
         n4oBL7dg657e64Hrq/9zJ8hPalaXSzCGReeVK/5xQ1u0bVzY4hCNNrPFYRg+9lVmY4yw
         UNTzPPKQn24l/SE7GYdD/hvclUtuYzGh9LXF4wWrsXWv4Na99kIraBkwJ7QCM4nZzELu
         D9Wg==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AGRZ1gJ9cYJj2OTvqyV8t9u6Vqc/lNCN/gyHLkDgB4NVgPErUj1J955J
	inW5c/xFVitV9Otft7scS7pQMYFLceRQCdaBmmIxIyjkAyMPofcXjkwARSCTAJcrblGXXYVqdbO
	+U+WxXz3fo4+dUG89IV0+b8TrqLF49kAXAhdpi1eiqK8gd/KPxSkehnZuZQcmjdrUUw==
X-Received: by 2002:a17:906:7a9c:: with SMTP id
 f28-v6mr11265269ejo.135.1542973532048;
        Fri, 23 Nov 2018 03:45:32 -0800 (PST)
X-Google-Smtp-Source: 
 AJdET5cB7RZdEw1tvxBk4mSWsESOOgPPK3TqeAjrxlQQY2Q5CX9JcjEawMoiiAor905iYQfdH2O3
X-Received: by 2002:a17:906:7a9c:: with SMTP id
 f28-v6mr11265177ejo.135.1542973529929;
        Fri, 23 Nov 2018 03:45:29 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1542973529; cv=none;
        d=google.com; s=arc-20160816;
        b=nWSHuuWcOxCUVc8N7Vlk+ivvreE4QnH/HuszF7b0omXkZtakZrcSapKET5WqzoLpdR
         QQhe+drlSiCYqpJ2h4PFwJtWIzC/cHvanLxZ1JBZzLSDO/MsrqAau4gj1QrRS+XUpEnq
         /iwkyHhQHqDSPMtLQqqlHQs0rTc3YUxbx2MKy0ANWl3vAvMbmxDMcH83dnQks9DNUyFO
         4o+SFEcsro4P5iOOdJN9HziGt/W0NDkg5o1Nq7cweaQo2CToig/R+aqLv0aQKSo6Dcni
         afrgGRnz2rdtUeOgLAzT/40hHDz0legI9gkeldGkoBhRNjFMID2qHsrUPzhjwh/hA3Et
         MHaA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=LGLGXYoRXbeAPQc2yULFaKa9WDL1XAPgPF+pt0cB1UQ=;
        b=Dc60PShhnX2bWLDg2aTkjnmFb98bw/5Q4QzaEvi2A42JEIjAN09PlUtmp88a3QqbJk
         MbsY28JFz+K2aD+TS5122MYDWmFN6ZuuypF/owhokizT4WInXJTWlRQvmTEpqCrWMxwB
         EU5ZBDqj8SSFaKqfJLpoJ54VqJZXQcWhZhRoR5V3z2acDZ9whS6FhNnir+lzzMd+ngXP
         DVUbWZJBtAkohoJ2E/Fam+PkzbM/mYjZPBtKFeV4kDEfc41Gg04oq+GXdNiH9GrXFhMq
         FomuxzM6/HdFMqY/1r4o4zMAyu4/nG4XCws5vg+gA7NJ7TPSmqZexwj2TOXTvKlUAtHJ
         lKCg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp10.blacknight.com (outbound-smtp10.blacknight.com.
 [46.22.139.15])
        by mx.google.com with ESMTPS id
 g15-v6si5731764ejj.234.2018.11.23.03.45.29
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 23 Nov 2018 03:45:29 -0800 (PST)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 46.22.139.15 as permitted sender) client-ip=46.22.139.15;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17])
	by outbound-smtp10.blacknight.com (Postfix) with ESMTPS id 6E3C61C2CDC
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 11:45:29 +0000 (GMT)
Received: (qmail 12273 invoked from network); 23 Nov 2018 11:45:29 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.69])
  by 81.17.254.9 with ESMTPA; 23 Nov 2018 11:45:29 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 1/5] mm,
 page_alloc: Spread allocations across zones before introducing fragmentation
Date: Fri, 23 Nov 2018 11:45:24 +0000
Message-Id: <20181123114528.28802-2-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181123114528.28802-1-mgorman@techsingularity.net>
References: <20181123114528.28802-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

The page allocator zone lists are iterated based on the watermarks
of each zone which does not take anti-fragmentation into account. On
x86, node 0 may have multiple zones while other nodes have one zone. A
consequence is that tasks running on node 0 may fragment ZONE_NORMAL even
though ZONE_DMA32 has plenty of free memory. This patch special cases
the allocator fast path such that it'll try an allocation from a lower
local zone before fragmenting a higher zone. In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the
fast path as they are uninteresting from a fragmentation point of view.

This was evaluated using a benchmark designed to fragment memory before
attempting THP allocations. It's implemented in mmtests as the following
configurations

configs/config-global-dhp__workload_thpfioscale
configs/config-global-dhp__workload_thpfioscale-defrag
configs/config-global-dhp__workload_thpfioscale-madvhugepage

e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch).
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameter create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed.
3. Warm up a number of fio read-only processes accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll refault old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds.
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup the test files.

Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive. Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.

When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than
a pageblock order (order-9 on 64-bit x86) then it's considered to be an
"external fragmentation event" that may cause issues in the future. Hence,
the primary metric here is the number of external fragmentation events that
occur with order < 9. The secondary metric is allocation latency and huge
page allocation success rates but note that differences in latencies and
what the success rate also can affect the number of external fragmentation
event which is why it's a secondary metric.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9:   804694
4.20-rc3+patch:                      408912 (49% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)

                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)

Fault latencies are slightly reduced while allocation success rates remain
at zero as this configuration does not make any special effort to allocate
THP and fio is heavily active at the time and either filling memory or
keeping pages resident. However, a 49% reduction of serious fragmentation
events reduces the changes of external fragmentation being a problem in
the future.

Vlastimil asked during review for a breakdown of the allocation types
that are falling back.

vanilla
   3816 MIGRATE_UNMOVABLE
 800845 MIGRATE_MOVABLE
     33 MIGRATE_UNRECLAIMABLE

patch
    735 MIGRATE_UNMOVABLE
 408135 MIGRATE_MOVABLE
     42 MIGRATE_UNRECLAIMABLE

The majority of the fallbacks are due to movable allocations and this is
consistent for the workload throughout the series so will not be presented
again as the primary source of fallbacks are movable allocations.

Movable fallbacks are sometimes considered "ok" to fallback because they can
be migrated. The problem is that they can fill an unmovable/reclaimable
pageblock causing those allocations to fallback later and polluting
pageblocks with pages that cannot move.  If there is a movable fallback,
it is pretty much guaranteed to affect an unmovable/reclaimable pageblock
and while it might not be enough to actually cause a unmovable/reclaimable
fallback in the future, we cannot know that in advance so the patch takes
the only option available to it. Hence, it's important to control them. This
point is also consistent throughout the series and will not be repeated.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  291392
4.20-rc3+patch:                     191187 (34% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)

thpfioscale Percentage Faults Huge
                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)

Fragmentation events were reduced quite a bit although this is known
to be a little variable. The latencies and allocation success rates
are similar but they were already quite high.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  215698
4.20-rc3+patch:                     200210 (7% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)

                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)

The reduction of external fragmentation events is slight and this is
partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f ("mm:
thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP allocations
can now spill over to remote nodes instead of fragmenting local memory.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch:                    147463 (11% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*

thpfioscale Percentage Faults Huge
                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)

There was a slight reduction in external fragmentation events although
the latencies were higher. The allocation success rate is high enough that
the system is struggling and there is quite a lot of parallel reclaim and
compaction activity. There is also a certain degree of luck on whether
processes start on node 0 or not for this patch but the relevance is
reduced later in the series.

Overall, the patch reduces the number of external fragmentation causing
events so the success of THP over long periods of time would be improved
for this adverse workload.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h   |  13 ++++---
 mm/page_alloc.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 105 insertions(+), 16 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 291eb2b6d1d8..544355156c92 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,10 +480,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_OOM		ALLOC_NO_WATERMARKS
 #endif
 
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-#define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
+#define ALLOC_HARDER		 0x10 /* try to alloc harder */
+#define ALLOC_HIGH		 0x20 /* __GFP_HIGH set */
+#define ALLOC_CPUSET		 0x40 /* check for correct cpuset */
+#define ALLOC_CMA		 0x80 /* allow allocations from CMA areas */
+#ifdef CONFIG_ZONE_DMA32
+#define ALLOC_NOFRAGMENT	0x100 /* avoid mixing pageblock types */
+#else
+#define ALLOC_NOFRAGMENT	  0x0
+#endif
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6847177dc4a1..f86638aee96a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2375,20 +2375,30 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
  * condition simpler.
  */
 static __always_inline bool
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
+__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
+						unsigned int alloc_flags)
 {
 	struct free_area *area;
 	int current_order;
+	int min_order = order;
 	struct page *page;
 	int fallback_mt;
 	bool can_steal;
 
+	/*
+	 * Do not steal pages from freelists belonging to other pageblocks
+	 * i.e. orders < pageblock_order. If there are no local zones free,
+	 * the zonelists will be reiterated without ALLOC_NOFRAGMENT.
+	 */
+	if (alloc_flags & ALLOC_NOFRAGMENT)
+		min_order = pageblock_order;
+
 	/*
 	 * Find the largest available free page in the other list. This roughly
 	 * approximates finding the pageblock with the most free pages, which
 	 * would be too costly to do exactly.
 	 */
-	for (current_order = MAX_ORDER - 1; current_order >= order;
+	for (current_order = MAX_ORDER - 1; current_order >= min_order;
 				--current_order) {
 		area = &(zone->free_area[current_order]);
 		fallback_mt = find_suitable_fallback(area, current_order,
@@ -2447,7 +2457,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
  * Call me with the zone->lock already held.
  */
 static __always_inline struct page *
-__rmqueue(struct zone *zone, unsigned int order, int migratetype)
+__rmqueue(struct zone *zone, unsigned int order, int migratetype,
+						unsigned int alloc_flags)
 {
 	struct page *page;
 
@@ -2457,7 +2468,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
 		if (migratetype == MIGRATE_MOVABLE)
 			page = __rmqueue_cma_fallback(zone, order);
 
-		if (!page && __rmqueue_fallback(zone, order, migratetype))
+		if (!page && __rmqueue_fallback(zone, order, migratetype,
+								alloc_flags))
 			goto retry;
 	}
 
@@ -2472,13 +2484,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
-			int migratetype)
+			int migratetype, unsigned int alloc_flags)
 {
 	int i, alloced = 0;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype);
+		struct page *page = __rmqueue(zone, order, migratetype,
+								alloc_flags);
 		if (unlikely(page == NULL))
 			break;
 
@@ -2934,6 +2947,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 
 /* Remove page from the per-cpu list, caller must protect the list */
 static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
+			unsigned int alloc_flags,
 			struct per_cpu_pages *pcp,
 			struct list_head *list)
 {
@@ -2943,7 +2957,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
-					migratetype);
+					migratetype, alloc_flags);
 			if (unlikely(list_empty(list)))
 				return NULL;
 		}
@@ -2959,7 +2973,8 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 /* Lock and remove page from the per-cpu list */
 static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			unsigned int alloc_flags)
 {
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
@@ -2969,7 +2984,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	local_irq_save(flags);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
-	page = __rmqueue_pcplist(zone,  migratetype, pcp, list);
+	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone);
@@ -2992,7 +3007,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	if (likely(order == 0)) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
-				gfp_flags, migratetype);
+				gfp_flags, migratetype, alloc_flags);
 		goto out;
 	}
 
@@ -3011,7 +3026,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 				trace_mm_page_alloc_zone_locked(page, order, migratetype);
 		}
 		if (!page)
-			page = __rmqueue(zone, order, migratetype);
+			page = __rmqueue(zone, order, migratetype, alloc_flags);
 	} while (page && check_new_pages(page, order));
 	spin_unlock(&zone->lock);
 	if (!page)
@@ -3253,6 +3268,40 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 }
 #endif	/* CONFIG_NUMA */
 
+#ifdef CONFIG_ZONE_DMA32
+/*
+ * The restriction on ZONE_DMA32 as being a suitable zone to use to avoid
+ * fragmentation is subtle. If the preferred zone was HIGHMEM then
+ * premature use of a lower zone may cause lowmem pressure problems that
+ * are worse than fragmentation. If the next zone is ZONE_DMA then it is
+ * probably too small. It only makes sense to spread allocations to avoid
+ * fragmentation between the Normal and DMA32 zones.
+ */
+static inline unsigned int
+alloc_flags_nofragment(struct zone *zone)
+{
+	if (zone_idx(zone) != ZONE_NORMAL)
+		return 0;
+
+	/*
+	 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
+	 * the pointer is within zone->zone_pgdat->node_zones[]. Also assume
+	 * on UMA that if Normal is populated then so is DMA32.
+	 */
+	BUILD_BUG_ON(ZONE_NORMAL - ZONE_DMA32 != 1);
+	if (nr_online_nodes > 1 && !populated_zone(--zone))
+		return 0;
+
+	return ALLOC_NOFRAGMENT;
+}
+#else
+static inline unsigned int
+alloc_flags_nofragment(struct zone *zone)
+{
+	return 0;
+}
+#endif
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -3261,14 +3310,18 @@ static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 						const struct alloc_context *ac)
 {
-	struct zoneref *z = ac->preferred_zoneref;
+	struct zoneref *z;
 	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
+	bool no_fallback;
 
+retry:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
+	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
+	z = ac->preferred_zoneref;
 	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		struct page *page;
@@ -3307,6 +3360,22 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
+		if (no_fallback && nr_online_nodes > 1 &&
+		    zone != ac->preferred_zoneref->zone) {
+			int local_nid;
+
+			/*
+			 * If moving to a remote node, retry but allow
+			 * fragmenting fallbacks. Locality is more important
+			 * than fragmentation avoidance.
+			 */
+			local_nid = zone_to_nid(ac->preferred_zoneref->zone);
+			if (zone_to_nid(zone) != local_nid) {
+				alloc_flags &= ~ALLOC_NOFRAGMENT;
+				goto retry;
+			}
+		}
+
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac_classzone_idx(ac), alloc_flags)) {
@@ -3374,6 +3443,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 	}
 
+	/*
+	 * It's possible on a UMA machine to get through all zones that are
+	 * fragmented. If avoiding fragmentation, reset and try again.
+	 */
+	if (no_fallback) {
+		alloc_flags &= ~ALLOC_NOFRAGMENT;
+		goto retry;
+	}
+
 	return NULL;
 }
 
@@ -4369,6 +4447,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 
 	finalise_ac(gfp_mask, &ac);
 
+	/*
+	 * Forbid the first pass from falling back to types that fragment
+	 * memory until all local zones are considered.
+	 */
+	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone);
+
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))

From patchwork Fri Nov 23 11:45:25 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10695661
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4621C13BB
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:36 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 35F882C02D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:36 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 2A4CA2C90C; Fri, 23 Nov 2018 11:45:36 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 921E52C02D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:35 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1F21B6B2CEE; Fri, 23 Nov 2018 06:45:32 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 10DE86B2CF7; Fri, 23 Nov 2018 06:45:32 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 020DE6B2CF8; Fri, 23 Nov 2018 06:45:31 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com
 [209.85.208.70])
	by kanga.kvack.org (Postfix) with ESMTP id 9ACCE6B2CEE
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 06:45:31 -0500 (EST)
Received: by mail-ed1-f70.google.com with SMTP id s50so5635150edd.11
        for <linux-mm@kvack.org>; Fri, 23 Nov 2018 03:45:31 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=qNfXbvJcVFYms2l/9ym5kM8vPJaWMK4iR6eR0X2cMP4=;
        b=mTfB16ZCo8oY+9AMOxAcPZPRaS0mfNOHVagxIYHjwrd24YeB+5ozuTeAs3ZyQtcVgQ
         KadGU0drdes+IKOm4Qsde7+XxsvpcRqV1WhXPbhVZSNf22jjuXOnK113oSn6Q1H0Uxa2
         JwovwGpC53YmfyhcbCM+yVPczL4h/AUeZCfIeVPpFW3A4x7zy4WBfIQZe1g7Ox4GEuvH
         v7z/82S+nqG76u5v40hIPZ5f8SJR8wxWkU0qIr5oF8E2o5NkTGKWfrQ+D4abDKH7J1E8
         gwYRV4V/LxfhIIFqZuwRMAXqNR8bm1GuSBTy/jy8EukOyzwv+aW0NTs6LNvPLR2kyS6s
         +Hhw==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.13 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AGRZ1gKhz97jxLYNBKmrj0VJztzzNssM/7Eh3qb6urNxvJjfqhZsXW+S
	PRsqLnAG3nJXaAcJVTmAXvGDgkK1+uDmXvMp0omM31H4psdmO9Q/OWDlsLmf5uacW73wpuTXbSR
	dlbmzy2UHrqGbGX1aSf6f1VGdaWlWjFGpgFBQgKDkcV851s3zSnl0LkEqkNqFQHI/dw==
X-Received: by 2002:a17:906:6212:: with SMTP id
 s18-v6mr11452004ejk.119.1542973531090;
        Fri, 23 Nov 2018 03:45:31 -0800 (PST)
X-Google-Smtp-Source: 
 AJdET5d4RhJi8iTfEaFhl+ZpWBkthfjfm6kKJ0xDfHyco4sT/xltz1gs/TqjfHUJOv76qs2lj/p+
X-Received: by 2002:a17:906:6212:: with SMTP id
 s18-v6mr11451952ejk.119.1542973529965;
        Fri, 23 Nov 2018 03:45:29 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1542973529; cv=none;
        d=google.com; s=arc-20160816;
        b=F0T7L3mYSWQ7w9ma+qhfX4OsGgKU/i6fa3SPplMRbiPcfbSp4MBcXf2mcUsra1M/DE
         ch7gE4JB7cKriD3xLXSB0puUpjoY9vcsxSjtWwScBxDijt/vACcbf2pUl6HMrx5dEjXC
         lC63WJhHqz2V0D4NVdSSzxg4D0kpUV36vk+IoS+Cf1dXqJY2HNgwoot0JXeCXDIg5BQQ
         XRhupASOS1GwlKdJU2s0AqbponrRlUF8rU9zYX1N0jfP44xo1VjFpniKJurTVmKk1Ike
         jNaEPZZ1kObYzxnzEuhuOtO0jNjl6bJ5+snaS8Cujar6XvPCTmJJ/cG4zjcJs040aA3t
         0R9g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=qNfXbvJcVFYms2l/9ym5kM8vPJaWMK4iR6eR0X2cMP4=;
        b=bFZuk6QP0KMBWTzAi68jYWn+ZqWvgSQ1cTq2DxtNQZW0NQ/pj6FzF5Cw8Pu1g6nyh0
         j11eSdsTEIrQtA913yTI4UrE61rTDTrh8gK0XLLHdeSVLFxHQfwH90Qy1tadatqJwl/5
         gJFcAJCn1zO0NQQyxfvZn+DYcLehojXmuevudJt7duOWR5BsRR+8+5/kINfQ9c16+15c
         HDhS6dh65q01GtVPC36P/cKmaGZPFQqq8yXuwysbUYQ+gbRmd6d8mnq/yiPLuvlGujUi
         5h8kZnxwLyiK42NOQmDmiZToEYNT3vtoL1gMMaIz9a4h1CaPfWRWl5V7cSJJtVLL1G7B
         221A==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.13 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp08.blacknight.com (outbound-smtp08.blacknight.com.
 [46.22.139.13])
        by mx.google.com with ESMTPS id l24si321599edr.135.2018.11.23.03.45.29
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 23 Nov 2018 03:45:29 -0800 (PST)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 46.22.139.13 as permitted sender) client-ip=46.22.139.13;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.13 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17])
	by outbound-smtp08.blacknight.com (Postfix) with ESMTPS id 9B74C1C2CEA
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 11:45:29 +0000 (GMT)
Received: (qmail 12299 invoked from network); 23 Nov 2018 11:45:29 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.69])
  by 81.17.254.9 with ESMTPA; 23 Nov 2018 11:45:29 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 2/5] mm: Move zone watermark accesses behind an accessor
Date: Fri, 23 Nov 2018 11:45:25 +0000
Message-Id: <20181123114528.28802-3-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181123114528.28802-1-mgorman@techsingularity.net>
References: <20181123114528.28802-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This is a preparation patch only, no functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  9 +++++----
 mm/compaction.c        |  2 +-
 mm/page_alloc.c        | 12 ++++++------
 3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 847705a6d0ec..e43e8e79db99 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -269,9 +269,10 @@ enum zone_watermarks {
 	NR_WMARK
 };
 
-#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
-#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
-#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
+#define wmark_pages(z, i) (z->_watermark[i])
 
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
@@ -362,7 +363,7 @@ struct zone {
 	/* Read-mostly fields */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
-	unsigned long watermark[NR_WMARK];
+	unsigned long _watermark[NR_WMARK];
 
 	unsigned long nr_reserved_highatomic;
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 7c607479de4a..ef29490b0f46 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1431,7 +1431,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	if (is_via_compact_memory(order))
 		return COMPACT_CONTINUE;
 
-	watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 	/*
 	 * If watermarks for high-order allocation are already met, there
 	 * should be no need for compaction at all.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f86638aee96a..4ba84cd2977a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3376,7 +3376,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
-		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac_classzone_idx(ac), alloc_flags)) {
 			int ret;
@@ -4796,7 +4796,7 @@ long si_mem_available(void)
 		pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
 
 	for_each_zone(zone)
-		wmark_low += zone->watermark[WMARK_LOW];
+		wmark_low += low_wmark_pages(zone);
 
 	/*
 	 * Estimate the amount of memory available for userspace allocations,
@@ -7422,13 +7422,13 @@ static void __setup_per_zone_wmarks(void)
 
 			min_pages = zone->managed_pages / 1024;
 			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
-			zone->watermark[WMARK_MIN] = min_pages;
+			zone->_watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
-			zone->watermark[WMARK_MIN] = tmp;
+			zone->_watermark[WMARK_MIN] = tmp;
 		}
 
 		/*
@@ -7440,8 +7440,8 @@ static void __setup_per_zone_wmarks(void)
 			    mult_frac(zone->managed_pages,
 				      watermark_scale_factor, 10000));
 
-		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
-		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
+		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

From patchwork Fri Nov 23 11:45:26 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10695663
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A84EE13BB
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:38 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9897F2C02D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:38 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 8CAB12C90C; Fri, 23 Nov 2018 11:45:38 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F33332C02D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 57BBA6B2CF4; Fri, 23 Nov 2018 06:45:32 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 55D356B2CF6; Fri, 23 Nov 2018 06:45:32 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 30A1C6B2CF8; Fri, 23 Nov 2018 06:45:32 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com
 [209.85.208.71])
	by kanga.kvack.org (Postfix) with ESMTP id C77986B2CF6
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 06:45:31 -0500 (EST)
Received: by mail-ed1-f71.google.com with SMTP id l45so5697293edb.1
        for <linux-mm@kvack.org>; Fri, 23 Nov 2018 03:45:31 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=67eksPrwLW39zu9wZTOZe/O2EYcDl/Fitm/q/93Nxng=;
        b=Ng3SEVxjcU1VYlK/UBIFujAizriQ6Kyxqy09qEhUup4Vj1f24Lgv/+OHYkfdW43flj
         I4LitEOePrzYi+Y6htY1NtsJBzMLFZ/WHvI41uRiBJR1JzF7189po4VA7mUfTiGp9mZx
         96zmQw9e78ISkxLEl9ZObmbhv5bVjnsanRZp1cqOnrqplbu/yYDeMDRw8YE9dBTV4GUq
         2bbeSRGtnJzY3T+/3oxr5kvnJa8Q9wVEyhW7QBN+9iA3KZF8bkIqg8/wArfDUf23MNs6
         Ark5Z6rLsDdBlF8d/AxZAGRNW4V7QO0B02iYNTgle8h2rQOXWk8+SbcldQg15EFL728n
         fPmw==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AGRZ1gJy8o8z50UgErkQ1NPgzyJ4y13nefoWnM/3feLtTS7uoLP729xd
	ILK3phOD0FIw9iUp7bXndA+B2EONmYhaoxYjx0V/E4fLGf0wABaToRVJAAANdJSHeAdlv0z9N7y
	hLT7yTZDtbk7YDwhT0R8XlfqBXI/derTx+1mQVNsCO5M7TpizQxokfujA+1nzxkD2Bw==
X-Received: by 2002:a17:906:f1c9:: with SMTP id
 gx9-v6mr11229176ejb.144.1542973531259;
        Fri, 23 Nov 2018 03:45:31 -0800 (PST)
X-Google-Smtp-Source: 
 AJdET5e6okPZA24wv/cR1bBGL7On3VAd4VDa2Qx2EWUk4/Flu0wiIiLVyQXxlsh14mZs0CiCWSQ5
X-Received: by 2002:a17:906:f1c9:: with SMTP id
 gx9-v6mr11229127ejb.144.1542973530160;
        Fri, 23 Nov 2018 03:45:30 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1542973530; cv=none;
        d=google.com; s=arc-20160816;
        b=KQ95+FnjWqDyFdO5E4bYm+OrCFq4H5517GFhf86i006fdci6PEDM5/Ookda3FHjZ7X
         WaP1v5lFiyOqO76FfQZCRQ/fBn1zkffhNUldnJiksmBi4+lzO6sSD39xplTovichF2j2
         Y8XhNHoQNWMYhIKVpxW8Dx/SBut0ZuAwNgFnNCy1aJ+8yU925cPZctqIsFsujDLcRgoX
         gTR2g7EEsTdbHm3L/kqWvIRsnrgS3VW3jjMz7aydzARmOagqZT1WjdcEDYXxgDNyY8iU
         c2pvhvooxPRYrDbrInHOm7NzLHVrlJXCpFodyJ8nrYfRIIL5U3aHozpnjsfI0PtIiQ9R
         5DYQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=67eksPrwLW39zu9wZTOZe/O2EYcDl/Fitm/q/93Nxng=;
        b=d8aRExdRqRmrtbfz/jZfES84bwbeKunDZVAsSj5hoMSrPHUSqo6gRIYJg4y8mw3Lx8
         PzhPteqFMIwoIOO+79tWGd0N0wwNcXd617jAJM/ZvqaN3q57EQlUfwBYPMsxKec+WWcf
         jjBE/prF/HBP84xtkmstU17jBZdFm1NoHQA1gPK8Kb4DS6jCkYztYynrPpjmsRdO5DKb
         2TnYht+fg0b9C5ltoWPL4PFq+WSouQw0iyJLSmMW6d1CHUY6PJWSNBbyNEVw4/zitBtu
         RwY24GxiyjN1ywX9hD6qRobf5+3+kJU2kHvh1+XQyopdA5R7kBoNw9As//R6XSB2viw8
         NlQg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp10.blacknight.com (outbound-smtp10.blacknight.com.
 [46.22.139.15])
        by mx.google.com with ESMTPS id x12si2569576edh.28.2018.11.23.03.45.29
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 23 Nov 2018 03:45:30 -0800 (PST)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 46.22.139.15 as permitted sender) client-ip=46.22.139.15;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17])
	by outbound-smtp10.blacknight.com (Postfix) with ESMTPS id C8B411C2CEF
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 11:45:29 +0000 (GMT)
Received: (qmail 12335 invoked from network); 23 Nov 2018 11:45:29 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.69])
  by 81.17.254.9 with ESMTPA; 23 Nov 2018 11:45:29 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 3/5] mm: Use alloc_flags to record if kswapd can wake
Date: Fri, 23 Nov 2018 11:45:26 +0000
Message-Id: <20181123114528.28802-4-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181123114528.28802-1-mgorman@techsingularity.net>
References: <20181123114528.28802-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
into alloc_flags. This is a preparation patch only that avoids having to
pass gfp_mask through a long callchain in a future patch.

Note that the setting in the fast path happens in alloc_flags_nofragment()
and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
That's true in this patch but is not true later so it's done now for
easier review to show where the flag needs to be recorded.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h   |  1 +
 mm/page_alloc.c | 25 +++++++++++++++++--------
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 544355156c92..be826ee9dc7f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -489,6 +489,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #else
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
+#define ALLOC_KSWAPD		0x200 /* allow waking of kswapd */
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4ba84cd2977a..e44eb68744ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3278,10 +3278,15 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
  * fragmentation between the Normal and DMA32 zones.
  */
 static inline unsigned int
-alloc_flags_nofragment(struct zone *zone)
+alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 {
+	unsigned int alloc_flags = 0;
+
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+		alloc_flags |= ALLOC_KSWAPD;
+
 	if (zone_idx(zone) != ZONE_NORMAL)
-		return 0;
+		goto out;
 
 	/*
 	 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
@@ -3290,13 +3295,14 @@ alloc_flags_nofragment(struct zone *zone)
 	 */
 	BUILD_BUG_ON(ZONE_NORMAL - ZONE_DMA32 != 1);
 	if (nr_online_nodes > 1 && !populated_zone(--zone))
-		return 0;
+		goto out;
 
-	return ALLOC_NOFRAGMENT;
+out:
+	return alloc_flags;
 }
 #else
 static inline unsigned int
-alloc_flags_nofragment(struct zone *zone)
+alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -3939,6 +3945,9 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+		alloc_flags |= ALLOC_KSWAPD;
+
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
@@ -4170,7 +4179,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
-	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
 
 	/*
@@ -4228,7 +4237,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 retry:
 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
-	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
 
 	reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
@@ -4451,7 +4460,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	 * Forbid the first pass from falling back to types that fragment
 	 * memory until all local zones are considered.
 	 */
-	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone);
+	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask);
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);

From patchwork Fri Nov 23 11:45:27 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10695669
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2F22C13BB
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:49 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1CD502C02D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:49 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 112E12C90C; Fri, 23 Nov 2018 11:45:49 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9E7A42C02D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 866566B2CF8; Fri, 23 Nov 2018 06:45:34 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7C8426B2CFD; Fri, 23 Nov 2018 06:45:34 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5F5A36B2CFF; Fri, 23 Nov 2018 06:45:34 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com
 [209.85.208.72])
	by kanga.kvack.org (Postfix) with ESMTP id D97166B2CFD
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 06:45:33 -0500 (EST)
Received: by mail-ed1-f72.google.com with SMTP id w15so5664180edl.21
        for <linux-mm@kvack.org>; Fri, 23 Nov 2018 03:45:33 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=HP3LljnY5cCcwvNKV7Q/XgcKfcuhF8LmqijzObwUZiM=;
        b=oo9TsC+nS68grJLfRwZQPR2AtA+mh+K/t0eHWPtAh1qvq6P0ArSLRhTC9vY46Mqz3s
         e7Sj3FRSQNuU2lxmkgfXkOTKy+VDvCu+jys+36+nYrxooCEkdcV6sWr/Sy1/mIlUb/tW
         rpcvu0Fw3LuD7B+NNO0CmB0KEXx/5vqa0957oIXGwzVatAklGFP3DRfl14EyOIQEsV1D
         p+AttXEZVxynNjbmMICxil7BVTSyaG3V9wHzYfPNOLe88xhEXY0StnRWkgPA49M9Ab82
         cp/uyAwcAiCjBaOivSVp7F2t0frjCKQMToYYK3Ix95CmaQnsGsvGO+2wVPH+tZfdnVQo
         GuuQ==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AA+aEWZys0FviubrYDoSnp077rr9W2IH0/E5QqvzAukMB1lNYw3qtUWW
	/fZh8HFCKzx6vB7oQSQvFVwICHWwDkBqepX90AyMddjZIHBKif66MK6liGz7MPoSyUIOrcJnSiK
	7eOOTrq8frRIWq7pb6sGLm0Sylkn6JqEwcxlGTupfwYgzZbY/JvWKR8u1pULXCjOO4w==
X-Received: by 2002:a17:906:7395:: with SMTP id
 f21-v6mr8489133ejl.219.1542973533268;
        Fri, 23 Nov 2018 03:45:33 -0800 (PST)
X-Google-Smtp-Source: 
 AFSGD/UXCINrtTwudpDUmWk/UT9wuwgkScfSaUE7eHwNIH9iVWRjP2TZ0ywTX0P20dIq95mM6/nC
X-Received: by 2002:a17:906:7395:: with SMTP id
 f21-v6mr8489004ejl.219.1542973530587;
        Fri, 23 Nov 2018 03:45:30 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1542973530; cv=none;
        d=google.com; s=arc-20160816;
        b=EukFGUozUR6/bbhvRQoe/PttRMmlUJ6OWK6J3xUmVL5veshkB2Ha3tMl9fRNMWhEnp
         UaagAHmcsxb35VhRKYNNGzOZXJyYNO/dgg4c9bNB8dlc9fqMJUt+2NkLUuBv3akK8KbE
         6l75aAqqnWWAvi4IEihpfOEze6GF1OXC8GAjaC99w0W/q6rLwUuDU0Sw6MQ/j1kcFeQ6
         fFvVJuujR++PRynqAhmkAHRcB4vQ9izjzH46HXUaXoKI/mXHCaKSdqxoWD+xsNURMxkr
         ElDKcvnuC07BPTcSWHFuA1tDeQWq4hbiqBLEoldL7f8xbVA9aboE+rEEjuc3+Y2qi7gN
         KFDg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=HP3LljnY5cCcwvNKV7Q/XgcKfcuhF8LmqijzObwUZiM=;
        b=0oSR2bvXXkgZPSbKN30hs85MtLbhHmOgAvZYsM81d2GRrnE3l4wPdw4HmdiVkco3++
         eFpZymXSkWHWqHAkW5L2Amg7rfU6Dbt7fVVamBlmoPDWcLvy0R4sEFd3jt5rLNh5GIau
         v9q57G2iRCqOv+ZyU5HFi97Q4m40I8hZuGCeIbr4Kws/ieBeHk7afOElE9NH6LLpLSPv
         8tqAiQdVr+wJOa8ujsdzhfitA3SUFKtCWkJTXJ1bbgb/kbxx44aG6BNUe8GSKm3IQ3eI
         SN2I6UlPpIbNrBcwnxjRXp93d3CbozhCfe2NASjc2MY9U4nbO1PXuqv2J5jTxOUWep6j
         YDhg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp13.blacknight.com (outbound-smtp13.blacknight.com.
 [46.22.139.230])
        by mx.google.com with ESMTPS id q23si7992209eds.78.2018.11.23.03.45.30
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 23 Nov 2018 03:45:30 -0800 (PST)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 46.22.139.230 as permitted sender) client-ip=46.22.139.230;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (unknown [81.17.254.17])
	by outbound-smtp13.blacknight.com (Postfix) with ESMTPS id 0A9C01C2CF3
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 11:45:30 +0000 (GMT)
Received: (qmail 12364 invoked from network); 23 Nov 2018 11:45:29 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.69])
  by 81.17.254.9 with ESMTPA; 23 Nov 2018 11:45:29 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 4/5] mm: Reclaim small amounts of memory when an external
 fragmentation event occurs
Date: Fri, 23 Nov 2018 11:45:27 +0000
Message-Id: <20181123114528.28802-5-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181123114528.28802-1-mgorman@techsingularity.net>
References: <20181123114528.28802-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if there
are enough sparsely populated pageblocks then the problem can still occur
as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs. The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the
size of the high watermark and the watermark_boost_factor until the boost
is cleared. When kswapd finishes, it wakes kcompactd at the pageblock
order to clean some of the pageblocks that may have been affected by
the fragmentation event. kswapd avoids any writeback, slab shrinkage and
swap from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance. Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9:   804694
4.20-rc3+patch:                      408912 (49% reduction)
4.20-rc3+patch1-4:                    18421 (98% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)

Note that external fragmentation causing events are massively reduced
by this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  291392
4.20-rc3+patch:                     191187 (34% reduction)
4.20-rc3+patch1-4:                   13464 (95% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)

As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  215698
4.20-rc3+patch:                     200210 (7% reduction)
4.20-rc3+patch1-4:                   14263 (93% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)

There is a 93% reduction in fragmentation causing events, there
is a big reduction in the huge page fault latency and allocation
success rate is higher.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch:                    147463 (11% reduction)
4.20-rc3+patch1-4:                  11095 (93% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)

There is a large reduction in fragmentation events with some jitter around
the latencies and success rates. As before, the high THP allocation
success rate does mean the system is under a lot of pressure. However,
as the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 Documentation/sysctl/vm.txt |  21 +++++++
 include/linux/mm.h          |   1 +
 include/linux/mmzone.h      |  11 ++--
 kernel/sysctl.c             |   8 +++
 mm/page_alloc.c             |  43 +++++++++++++-
 mm/vmscan.c                 | 133 +++++++++++++++++++++++++++++++++++++++++---
 6 files changed, 202 insertions(+), 15 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 7d73882e2c27..187ce4f599a2 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -63,6 +63,7 @@ files can be found in mm/swap.c.
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- watermark_boost_factor
 - watermark_scale_factor
 - zone_reclaim_mode
 
@@ -856,6 +857,26 @@ ten times more freeable objects than there are.
 
 =============================================================
 
+watermark_boost_factor:
+
+This factor controls the level of reclaim when memory is being fragmented.
+It defines the percentage of the high watermark of a zone that will be
+reclaimed if pages of different mobility are being mixed within pageblocks.
+The intent is that compaction has less work to do in the future and to
+increase the success rate of future high-order allocations such as SLUB
+allocations, THP and hugetlbfs pages.
+
+To make it sensible with respect to the watermark_scale_factor parameter,
+the unit is in fractions of 10,000. The default value of 15,000 means
+that up to 150% of the high watermark will be reclaimed in the event of
+a pageblock being mixed due to fragmentation. The level of reclaim is
+determined by the number of fragmentation events that occurred in the
+recent past. If this value is smaller than a pageblock then a pageblocks
+worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
+of 0 will disable the feature.
+
+=============================================================
+
 watermark_scale_factor:
 
 This factor controls the aggressiveness of kswapd. It defines the
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..2c4c69508413 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2202,6 +2202,7 @@ extern void zone_pcp_reset(struct zone *zone);
 
 /* page_alloc.c */
 extern int min_free_kbytes;
+extern int watermark_boost_factor;
 extern int watermark_scale_factor;
 
 /* nommu.c */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e43e8e79db99..d352c1dab486 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -269,10 +269,10 @@ enum zone_watermarks {
 	NR_WMARK
 };
 
-#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
-#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
-#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
-#define wmark_pages(z, i) (z->_watermark[i])
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
+#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
@@ -364,6 +364,7 @@ struct zone {
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long _watermark[NR_WMARK];
+	unsigned long watermark_boost;
 
 	unsigned long nr_reserved_highatomic;
 
@@ -885,6 +886,8 @@ static inline int is_highmem(struct zone *zone)
 struct ctl_table;
 int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5fc724e4e454..1825f712e73b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1462,6 +1462,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= min_free_kbytes_sysctl_handler,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "watermark_boost_factor",
+		.data		= &watermark_boost_factor,
+		.maxlen		= sizeof(watermark_boost_factor),
+		.mode		= 0644,
+		.proc_handler	= watermark_boost_factor_sysctl_handler,
+		.extra1		= &zero,
+	},
 	{
 		.procname	= "watermark_scale_factor",
 		.data		= &watermark_scale_factor,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e44eb68744ed..f9af2a79b1cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -263,6 +263,7 @@ compound_page_dtor * const compound_page_dtors[] = {
 
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
+int watermark_boost_factor __read_mostly = 15000;
 int watermark_scale_factor = 10;
 
 static unsigned long nr_kernel_pages __meminitdata;
@@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 	return false;
 }
 
+static inline void boost_watermark(struct zone *zone)
+{
+	unsigned long max_boost;
+
+	if (!watermark_boost_factor)
+		return;
+
+	max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
+			watermark_boost_factor, 10000);
+	max_boost = max(pageblock_nr_pages, max_boost);
+
+	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
+		max_boost);
+}
+
 /*
  * This function implements actual steal behaviour. If order is large enough,
  * we can steal whole pageblock. If not, we first move freepages in this
@@ -2138,7 +2154,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
  * itself, so pages freed in the future will be put on the correct free list.
  */
 static void steal_suitable_fallback(struct zone *zone, struct page *page,
-					int start_type, bool whole_block)
+		unsigned int alloc_flags, int start_type, bool whole_block)
 {
 	unsigned int current_order = page_order(page);
 	struct free_area *area;
@@ -2160,6 +2176,15 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 		goto single_page;
 	}
 
+	/*
+	 * Boost watermarks to increase reclaim pressure to reduce the
+	 * likelihood of future fallbacks. Wake kswapd now as the node
+	 * may be balanced overall and kswapd will not wake naturally.
+	 */
+	boost_watermark(zone);
+	if (alloc_flags & ALLOC_KSWAPD)
+		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
+
 	/* We are not allowed to try stealing from the whole block */
 	if (!whole_block)
 		goto single_page;
@@ -2443,7 +2468,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	page = list_first_entry(&area->free_list[fallback_mt],
 							struct page, lru);
 
-	steal_suitable_fallback(zone, page, start_migratetype, can_steal);
+	steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
+								can_steal);
 
 	trace_mm_page_alloc_extfrag(page, order, current_order,
 		start_migratetype, fallback_mt);
@@ -7451,6 +7477,7 @@ static void __setup_per_zone_wmarks(void)
 
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+		zone->watermark_boost = 0;
 
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -7551,6 +7578,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62ac0c488624..4c96b6356398 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -87,6 +87,9 @@ struct scan_control {
 	/* Can pages be swapped as part of reclaim? */
 	unsigned int may_swap:1;
 
+	/* e.g. boosted watermark reclaim leaves slabs alone */
+	unsigned int may_shrinkslab:1;
+
 	/*
 	 * Cgroups are not reclaimed below their configured memory.low,
 	 * unless we threaten to OOM. If any cgroups are skipped due to
@@ -2755,8 +2758,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
 			node_lru_pages += lru_pages;
 
-			shrink_slab(sc->gfp_mask, pgdat->node_id,
+			if (sc->may_shrinkslab) {
+				shrink_slab(sc->gfp_mask, pgdat->node_id,
 				    memcg, sc->priority);
+			}
 
 			/* Record the group's reclaim efficiency */
 			vmpressure(sc->gfp_mask, memcg, false,
@@ -3238,6 +3243,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.may_shrinkslab = 1,
 	};
 
 	/*
@@ -3282,6 +3288,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 		.may_unmap = 1,
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.may_swap = !noswap,
+		.may_shrinkslab = 1,
 	};
 	unsigned long lru_pages;
 
@@ -3328,6 +3335,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = may_swap,
+		.may_shrinkslab = 1,
 	};
 
 	/*
@@ -3378,6 +3386,30 @@ static void age_active_anon(struct pglist_data *pgdat,
 	} while (memcg);
 }
 
+static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx)
+{
+	int i;
+	struct zone *zone;
+
+	/*
+	 * Check for watermark boosts top-down as the higher zones
+	 * are more likely to be boosted. Both watermarks and boosts
+	 * should not be checked at the time time as reclaim would
+	 * start prematurely when there is no boosting and a lower
+	 * zone is balanced.
+	 */
+	for (i = classzone_idx; i >= 0; i--) {
+		zone = pgdat->node_zones + i;
+		if (!managed_zone(zone))
+			continue;
+
+		if (zone->watermark_boost)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Returns true if there is an eligible zone balanced for the request order
  * and classzone_idx
@@ -3388,6 +3420,10 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long mark = -1;
 	struct zone *zone;
 
+	/*
+	 * Check watermarks bottom-up as lower zones are more likely to
+	 * meet watermarks.
+	 */
 	for (i = 0; i <= classzone_idx; i++) {
 		zone = pgdat->node_zones + i;
 
@@ -3516,14 +3552,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	unsigned long pflags;
+	unsigned long nr_boost_reclaim;
+	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
+	bool boosted;
 	struct zone *zone;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
-		.priority = DEF_PRIORITY,
-		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
-		.may_swap = 1,
 	};
 
 	psi_memstall_enter(&pflags);
@@ -3531,9 +3567,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 	count_vm_event(PAGEOUTRUN);
 
+	/*
+	 * Account for the reclaim boost. Note that the zone boost is left in
+	 * place so that parallel allocations that are near the watermark will
+	 * stall or direct reclaim until kswapd is finished.
+	 */
+	nr_boost_reclaim = 0;
+	for (i = 0; i <= classzone_idx; i++) {
+		zone = pgdat->node_zones + i;
+		if (!managed_zone(zone))
+			continue;
+
+		nr_boost_reclaim += zone->watermark_boost;
+		zone_boosts[i] = zone->watermark_boost;
+	}
+	boosted = nr_boost_reclaim;
+
+restart:
+	sc.priority = DEF_PRIORITY;
 	do {
 		unsigned long nr_reclaimed = sc.nr_reclaimed;
 		bool raise_priority = true;
+		bool balanced;
 		bool ret;
 
 		sc.reclaim_idx = classzone_idx;
@@ -3560,13 +3615,40 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		}
 
 		/*
-		 * Only reclaim if there are no eligible zones. Note that
-		 * sc.reclaim_idx is not used as buffer_heads_over_limit may
-		 * have adjusted it.
+		 * If the pgdat is imbalanced then ignore boosting and preserve
+		 * the watermarks for a later time and restart. Note that the
+		 * zone watermarks will be still reset at the end of balancing
+		 * on the grounds that the normal reclaim should be enough to
+		 * re-evaluate if boosting is required when kswapd next wakes.
+		 */
+		balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
+		if (!balanced && nr_boost_reclaim) {
+			nr_boost_reclaim = 0;
+			goto restart;
+		}
+
+		/*
+		 * If boosting is not active then only reclaim if there are no
+		 * eligible zones. Note that sc.reclaim_idx is not used as
+		 * buffer_heads_over_limit may have adjusted it.
 		 */
-		if (pgdat_balanced(pgdat, sc.order, classzone_idx))
+		if (!nr_boost_reclaim && balanced)
 			goto out;
 
+		/* Limit the priority of boosting to avoid reclaim writeback */
+		if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
+			raise_priority = false;
+
+		/*
+		 * Do not writeback or swap pages for boosted reclaim. The
+		 * intent is to relieve pressure not issue sub-optimal IO
+		 * from reclaim context. If no pages are reclaimed, the
+		 * reclaim will be aborted.
+		 */
+		sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
+		sc.may_swap = !nr_boost_reclaim;
+		sc.may_shrinkslab = !nr_boost_reclaim;
+
 		/*
 		 * Do some background aging of the anon list, to give
 		 * pages a chance to be referenced before reclaiming. All
@@ -3618,6 +3700,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * progress in reclaiming pages
 		 */
 		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+		nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
+
+		/*
+		 * If reclaim made no progress for a boost, stop reclaim as
+		 * IO cannot be queued and it could be an infinite loop in
+		 * extreme circumstances.
+		 */
+		if (nr_boost_reclaim && !nr_reclaimed)
+			break;
+
 		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1);
@@ -3626,6 +3718,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		pgdat->kswapd_failures++;
 
 out:
+	/* If reclaim was boosted, account for the reclaim done in this pass */
+	if (boosted) {
+		unsigned long flags;
+
+		for (i = 0; i <= classzone_idx; i++) {
+			if (!zone_boosts[i])
+				continue;
+
+			/* Increments are under the zone lock */
+			zone = pgdat->node_zones + i;
+			spin_lock_irqsave(&zone->lock, flags);
+			zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+			spin_unlock_irqrestore(&zone->lock, flags);
+		}
+
+		/*
+		 * As there is now likely space, wakeup kcompact to defragment
+		 * pageblocks.
+		 */
+		wakeup_kcompactd(pgdat, pageblock_order, classzone_idx);
+	}
+
 	snapshot_refaults(NULL, pgdat);
 	__fs_reclaim_release();
 	psi_memstall_leave(&pflags);
@@ -3854,7 +3968,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 
 	/* Hopeless node, leave it to direct reclaim if possible */
 	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
-	    pgdat_balanced(pgdat, order, classzone_idx)) {
+	    (pgdat_balanced(pgdat, order, classzone_idx) &&
+	     !pgdat_watermark_boosted(pgdat, classzone_idx))) {
 		/*
 		 * There may be plenty of free memory available, but it's too
 		 * fragmented for high-order allocations.  Wake up kcompactd

From patchwork Fri Nov 23 11:45:28 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10695667
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8EFAB15A7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:45 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7C4732C02D
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:45 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 709BF2C90C; Fri, 23 Nov 2018 11:45:45 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 067802C8F6
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 23 Nov 2018 11:45:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CD0846B2CF6; Fri, 23 Nov 2018 06:45:33 -0500 (EST)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C5CC86B2CF8; Fri, 23 Nov 2018 06:45:33 -0500 (EST)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A85196B2CFB; Fri, 23 Nov 2018 06:45:33 -0500 (EST)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com
 [209.85.208.69])
	by kanga.kvack.org (Postfix) with ESMTP id 2D5396B2CF6
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 06:45:33 -0500 (EST)
Received: by mail-ed1-f69.google.com with SMTP id q8so495621edd.8
        for <linux-mm@kvack.org>; Fri, 23 Nov 2018 03:45:33 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=Ujg01Lt7yAbbzPRkYscevKi/FB583OCiwcyMatxCsLw=;
        b=G1OrSPQBTgBuTh1WjG+kLKDn5ldkGKuYy0nRE7f6ZAx5p2nbdCRxGTM/7x/i4hrbE7
         7zPEoaKpAXCMnsAkH/lB9Fwp91iLyEvSAwkXA3CNS1OWOQ9yQJ0xyXhvbqlQTDFA5zBW
         koaQI++VpjTT89XhDKTfiBPErMmX61hObNVvZGGj1hoyLV/tGEXnhp9TpZbqp54OJTPz
         pyktaxgyBpCsYag4+RwiIZ+mn6/kPBJSQw4t0uTgmkdqHJ7bh4h5OiUpNn2HAXuyiDsw
         RaI0ab73YoPH9IDJr3I78FXRXiZgc1xugzyAiq2xu8Yvug2rXcFCPEklipGa+VLIhYHV
         rlsg==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.233 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AA+aEWb9tMwqius8gdUZwv5Ku/OHlbcMnnCwn8ulCknAlELG+bg/3y1K
	2dUEQLoJa7mdisNiY+EbIffzNjcFizpTv6U24FNDF51bLzarSctTjB6dftiTfXN5lPFssHbG/W/
	YplGhev0nvl4u/h/LFsd7SmBbbzSyfL1CR7e3IqFSPSTWvT2aS4gt9ZYMYMpH0LS85g==
X-Received: by 2002:a50:a5e2:: with SMTP id b31mr12635998edc.5.1542973532599;
        Fri, 23 Nov 2018 03:45:32 -0800 (PST)
X-Google-Smtp-Source: 
 AFSGD/WoFSa7smH1pEmo9U4euV93ApZmwnT0bQK/YHtzmOWxdj6HA1idpVfVTco7FTPP0hnJOxCX
X-Received: by 2002:a50:a5e2:: with SMTP id b31mr12635893edc.5.1542973530662;
        Fri, 23 Nov 2018 03:45:30 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1542973530; cv=none;
        d=google.com; s=arc-20160816;
        b=dVrqOTYDZWeMhsxeBkRXQ+1hOOXOqwizr1rNV6qkomQVI51vhUSYOWnsM1IYzltqFE
         NYzmT9Jx+9uRIKVp5MEqLvgBqP6lKPHa59BEMoMvbHxziFxTUXVj+/a7pUA252ROyqzE
         +ghMDNB97xoykbIP+w7h/zQilhhUg3bV7br/GXu+11U8tlTEP5/MYWTzzkbPi0mA4fdc
         Bnu5nuEZF1Lktc+tgykGiKv9xEsKi9LLEn7Qedv8GgE0MF6rqzqHuerqYmBt5H1oMbXi
         fIX43l8oEkFHV6JKhHNOfg2MH3LdDIGgTSzBouEjd5CSPHps9OZrNAphji0CR3T0nqew
         LNpg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=Ujg01Lt7yAbbzPRkYscevKi/FB583OCiwcyMatxCsLw=;
        b=YqmR7pGw6t4w+HHp6SfYWohGGvy1orri259/wQ5ZB9DKTDSCNaDnLzXneRa/Yoq1Js
         LGAnbQKt0I+7fGy+S7SBUDAaq2KLN1Mps7ykw1ZSqaEWMCsgLEqT752DtikHuKE4kt1x
         fTzn39mqtUtYIIV0a6jj9F0EvPr6c+h36pSQv8HdsVbrmX48eqbAv9eNSSjlBedXzi6Z
         X1c/hr0HGxHQ7Zd6AhQHF3X3BXkwI49UHVFk8Nm6gPHgmqqxW1SWU6RSH529wzHnJXK8
         qTShICjiwLIikMFfy1rVoPVxCwyKYqR5rmXzZeZaw1q8WEoIHibTXeplN6ZARQRDaV3Q
         2W1g==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.233 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp16.blacknight.com (outbound-smtp16.blacknight.com.
 [46.22.139.233])
        by mx.google.com with ESMTPS id
 d1si13139940edb.435.2018.11.23.03.45.30
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 23 Nov 2018 03:45:30 -0800 (PST)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 46.22.139.233 as permitted sender) client-ip=46.22.139.233;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.233 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17])
	by outbound-smtp16.blacknight.com (Postfix) with ESMTPS id 3EC961C2CBC
	for <linux-mm@kvack.org>; Fri, 23 Nov 2018 11:45:30 +0000 (GMT)
Received: (qmail 12400 invoked from network); 23 Nov 2018 11:45:30 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.69])
  by 81.17.254.9 with ESMTPA; 23 Nov 2018 11:45:30 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Michal Hocko <mhocko@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 5/5] mm: Stall movable allocations until kswapd progresses
 during serious external fragmentation event
Date: Fri, 23 Nov 2018 11:45:28 +0000
Message-Id: <20181123114528.28802-6-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181123114528.28802-1-mgorman@techsingularity.net>
References: <20181123114528.28802-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

An event that potentially causes external fragmentation problems has
already been described but there are degrees of severity.  A "serious"
event is defined as one that steals a contiguous range of pages of an order
lower than fragment_stall_order (PAGE_ALLOC_COSTLY_ORDER by default). If a
movable allocation request that is allowed to sleep needs to steal a small
block then it schedules until kswapd makes progress or a timeout passes.
The watermarks are also boosted slightly faster so that kswapd makes
greater effort to reclaim enough pages to avoid the fragmentation event.

This stall is not guaranteed to avoid serious fragmentation events.
If memory pressure is high enough, the pages freed by kswapd may be
reallocated or the free pages may not be in pageblocks that contain
only movable pages. Furthermore an allocation request that cannot stall
(e.g. atomic allocations) or unmovable/reclaimable allocations will still
proceed without stalling. The reason is that movable allocations can be
migrated and stalling for kswapd to make progress means that compaction
has targets. Unmovable/reclaimable allocations on the other hand do not
benefit from stalling as their pages cannot move.

The worst-case scenario for stalling is a combination of both high memory
pressure where kswapd is having trouble keeping free pages over the
pfmemalloc_reserve and movable allocations are fragmenting memory. In this
case, an allocation request may sleep for longer. There are both vmstats
to identify stalls are happening and a tracepoint to quantify what the
stall durations are. Note that the granularity of the stall detection is
a jiffy so the delay accounting is not precise.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9:   804694
4.20-rc3+patch:                      408912 (49% reduction)
4.20-rc3+patch1-4:                    18421 (98% reduction)
4.20-rc3+patch1-5:                    16788 (98% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                   boost-v5r8             stall-v5r8
Amean     fault-base-1      652.71 (   0.00%)      651.40 (   0.20%)
Amean     fault-huge-1      178.93 (   0.00%)      174.49 *   2.48%*

thpfioscale Percentage Faults Huge
                              4.20.0-rc3             4.20.0-rc3
                              boost-v5r8             stall-v5r8
Percentage huge-1        5.12 (   0.00%)        5.56 (   8.77%)

Fragmentation events are further reduced. Note that in previous versions,
it was reduced to negligible levels but the logic has been corrected
to avoid exceessive reclaim and slab shrinkage in the meantime to avoid
IO regressions that may not be tolerable.

The latencies and allocation success rates are roughly similar.  Over the
course of 16 minutes, there were 2 stalls due to fragmentation avoidance
for 8 microseconds.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  291392
4.20-rc3+patch:                     191187 (34% reduction)
4.20-rc3+patch1-4:                   13464 (95% reduction)
4.20-rc3+patch1-5:                   15089 (99.7% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                   boost-v5r8             stall-v5r8
Amean     fault-base-1     1481.67 (   0.00%)        0.00 * 100.00%*
Amean     fault-huge-1     1063.88 (   0.00%)      540.81 *  49.17%*

                              4.20.0-rc3             4.20.0-rc3
                              boost-v5r8             stall-v5r8
Percentage huge-1       83.46 (   0.00%)      100.00 (  19.82%)

The fragmentation events were increased which is bad, but this is offset
by the fact that THP allocation rates had a lower latency and a perfect
allocation success rate. There were 102 stalls over the course of 16
minutes for a total stall time of roughly 0.4 seconds.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  215698
4.20-rc3+patch:                     200210 (7% reduction)
4.20-rc3+patch1-4:                   14263 (93% reduction)
4.20-rc3+patch1-5:                   11702 (95% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                   boost-v5r8             stall-v5r8
Amean     fault-base-5     1306.87 (   0.00%)     1340.96 (  -2.61%)
Amean     fault-huge-5     1348.94 (   0.00%)     2089.44 ( -54.89%)

                              4.20.0-rc3             4.20.0-rc3
                              boost-v5r8             stall-v5r8
Percentage huge-5        7.91 (   0.00%)        2.43 ( -69.26%)

There is a slight reduction in fragmentation events but it's slight
enough that it may be due to luck. Unfortunately, both the latencies
and success rates were lower. However, this is highly likely to be due
to luck given that there were just 12 stalls for 76 microseconds. Direct
reclaim was also eliminated but that is likely a co-incidence.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch:                    147463 (11% reduction)
4.20-rc3+patch1-4:                  11095 (93% reduction)
4.20-rc3+patch1-5:                  10677 (94% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                   boost-v5r8             stall-v5r8
Amean     fault-base-5     7419.67 (   0.00%)     6853.97 (   7.62%)
Amean     fault-huge-5     3263.80 (   0.00%)     1799.26 *  44.87%*

                              4.20.0-rc3             4.20.0-rc3
                              boost-v5r8             stall-v5r8
Percentage huge-5       87.98 (   0.00%)       98.97 (  12.49%)

The fragmentation events are slightly reduced with the latencies and
allocation success rates much improved.  There were 462 stalls over the
course of 68 minutes with a total stall time of roughly 1.9 seconds.

This patch has a marginal rate on fragmentation rates as it's rare for
the stall logic to actually trigger but the small stalls can be enough for
kswapd to catch up. How much that helps is variable but probably worthwhile
for long-term allocation success rates. It is possible to eliminate
fragmentation events entirely with tuning due to this patch although that
would require careful evaluation to determine if it's worthwhile.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/sysctl/vm.txt   |  23 +++++++++
 include/linux/mm.h            |   1 +
 include/linux/mmzone.h        |   2 +
 include/linux/vm_event_item.h |   1 +
 include/trace/events/kmem.h   |  21 +++++++++
 kernel/sysctl.c               |  10 ++++
 mm/internal.h                 |   1 +
 mm/page_alloc.c               | 105 +++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c                   |   3 +-
 mm/vmstat.c                   |   1 +
 10 files changed, 155 insertions(+), 13 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 187ce4f599a2..d07e2f01f429 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -31,6 +31,7 @@ files can be found in mm/swap.c.
 - dirty_writeback_centisecs
 - drop_caches
 - extfrag_threshold
+- fragment_stall_order
 - hugetlb_shm_group
 - laptop_mode
 - legacy_va_layout
@@ -275,6 +276,28 @@ any throttling.
 
 ==============================================================
 
+fragment_stall_order
+
+External fragmentation control is managed on a pageblock level where the
+page allocator tries to avoid mixing pages of different mobility within page
+blocks (e.g. order 9 on 64-bit x86). If external fragmentation is perfectly
+controlled then a THP allocation will often succeed up to the number of
+movable pageblocks in the system as reported by /proc/pagetypeinfo.
+
+When memory is low, the system may have to mix pageblocks and will wake
+kswapd to try control future fragmentation. fragment_stall_order controls if
+the allocating task will stall if possible until kswapd makes some progress
+in preference to fragmenting the system. This incurs a small stall penalty
+in exchange for future success at allocating huge pages. If the stalls
+are undesirable and high-order allocations are irrelevant then this can
+be disabled by writing 0 to the tunable. Writing the pageblock order will
+strongly (but not perfectly) control external fragmentation.
+
+The default (4) will stall for fragmenting allocations less than or equal
+to the PAGE_ALLOC_COSTLY_ORDER (defined as order-3 at the time of writing).
+
+==============================================================
+
 hugetlb_shm_group
 
 hugetlb_shm_group contains group id that is allowed to create SysV
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2c4c69508413..2b72de790ef9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2204,6 +2204,7 @@ extern void zone_pcp_reset(struct zone *zone);
 extern int min_free_kbytes;
 extern int watermark_boost_factor;
 extern int watermark_scale_factor;
+extern int fragment_stall_order;
 
 /* nommu.c */
 extern atomic_long_t mmap_pages_allocated;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d352c1dab486..cffec484ac8a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -890,6 +890,8 @@ int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int fragment_stall_order_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..7661abe5236e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PAGEOUTRUN, PGROTATED,
 		DROP_PAGECACHE, DROP_SLAB,
 		OOM_KILL,
+		FRAGMENTSTALL,
 #ifdef CONFIG_NUMA_BALANCING
 		NUMA_PTE_UPDATES,
 		NUMA_HUGE_PTE_UPDATES,
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index eb57e3037deb..caadd8681ac5 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -315,6 +315,27 @@ TRACE_EVENT(mm_page_alloc_extfrag,
 		__entry->change_ownership)
 );
 
+TRACE_EVENT(mm_fragmentation_stall,
+
+	TP_PROTO(int nid, unsigned long duration),
+
+	TP_ARGS(nid, duration),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid		)
+		__field(	unsigned long,	duration	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->duration	= duration
+	),
+
+	TP_printk("nid=%d duration=%lu",
+		__entry->nid,
+		__entry->duration)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1825f712e73b..eb09c79ddbef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -126,6 +126,7 @@ static int zero;
 static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
 static int __maybe_unused four = 4;
+static int __maybe_unused max_order = MAX_ORDER;
 static unsigned long one_ul = 1;
 static int one_hundred = 100;
 static int one_thousand = 1000;
@@ -1479,6 +1480,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &one,
 		.extra2		= &one_thousand,
 	},
+	{
+		.procname	= "fragment_stall_order",
+		.data		= &fragment_stall_order,
+		.maxlen		= sizeof(fragment_stall_order),
+		.mode		= 0644,
+		.proc_handler	= fragment_stall_order_sysctl_handler,
+		.extra1		= &zero,
+		.extra2		= &max_order,
+	},
 	{
 		.procname	= "percpu_pagelist_fraction",
 		.data		= &percpu_pagelist_fraction,
diff --git a/mm/internal.h b/mm/internal.h
index be826ee9dc7f..236fa39b9835 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -490,6 +490,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
 #define ALLOC_KSWAPD		0x200 /* allow waking of kswapd */
+#define ALLOC_FRAGMENT_STALL	0x400 /* stall if fragmenting heavily */
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f9af2a79b1cd..da56ecdfffa7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -265,6 +265,7 @@ int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 int watermark_boost_factor __read_mostly = 15000;
 int watermark_scale_factor = 10;
+int fragment_stall_order __read_mostly = (PAGE_ALLOC_COSTLY_ORDER + 1);
 
 static unsigned long nr_kernel_pages __meminitdata;
 static unsigned long nr_all_pages __meminitdata;
@@ -2130,9 +2131,10 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 	return false;
 }
 
-static inline void boost_watermark(struct zone *zone)
+static inline void __boost_watermark(struct zone *zone, bool fast_boost)
 {
 	unsigned long max_boost;
+	unsigned long nr;
 
 	if (!watermark_boost_factor)
 		return;
@@ -2140,9 +2142,45 @@ static inline void boost_watermark(struct zone *zone)
 	max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
 			watermark_boost_factor, 10000);
 	max_boost = max(pageblock_nr_pages, max_boost);
+	nr = pageblock_nr_pages;
 
-	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
-		max_boost);
+	/* Scale relative to the MIGRATE_PCPTYPES similar to min_free_kbytes */
+	if (fast_boost)
+		nr += pageblock_nr_pages * (MIGRATE_PCPTYPES << 1);
+
+	zone->watermark_boost = min(zone->watermark_boost + nr, max_boost);
+}
+
+static inline void boost_watermark(struct zone *zone, bool fast_boost)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	__boost_watermark(zone, fast_boost);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static void stall_fragmentation(struct zone *pzone)
+{
+	DEFINE_WAIT(wait);
+	long remaining = 0;
+	long timeout = HZ/50;
+	pg_data_t *pgdat = pzone->zone_pgdat;
+
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	boost_watermark(pzone, true);
+	prepare_to_wait(&pgdat->pfmemalloc_wait, &wait, TASK_INTERRUPTIBLE);
+	if (waitqueue_active(&pgdat->kswapd_wait))
+		wake_up_interruptible(&pgdat->kswapd_wait);
+	remaining = schedule_timeout(timeout);
+	finish_wait(&pgdat->pfmemalloc_wait, &wait);
+	if (remaining != timeout) {
+		trace_mm_fragmentation_stall(pgdat->node_id,
+			jiffies_to_usecs(timeout - remaining));
+		count_vm_event(FRAGMENTSTALL);
+	}
 }
 
 /*
@@ -2153,7 +2191,7 @@ static inline void boost_watermark(struct zone *zone)
  * of pages are free or compatible, we can change migratetype of the pageblock
  * itself, so pages freed in the future will be put on the correct free list.
  */
-static void steal_suitable_fallback(struct zone *zone, struct page *page,
+static bool steal_suitable_fallback(struct zone *zone, struct page *page,
 		unsigned int alloc_flags, int start_type, bool whole_block)
 {
 	unsigned int current_order = page_order(page);
@@ -2181,10 +2219,15 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	 * likelihood of future fallbacks. Wake kswapd now as the node
 	 * may be balanced overall and kswapd will not wake naturally.
 	 */
-	boost_watermark(zone);
+	__boost_watermark(zone, false);
 	if (alloc_flags & ALLOC_KSWAPD)
 		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
 
+	if ((alloc_flags & ALLOC_FRAGMENT_STALL) &&
+	    current_order < fragment_stall_order) {
+		return false;
+	}
+
 	/* We are not allowed to try stealing from the whole block */
 	if (!whole_block)
 		goto single_page;
@@ -2225,11 +2268,12 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 			page_group_by_mobility_disabled)
 		set_pageblock_migratetype(page, start_type);
 
-	return;
+	return true;
 
 single_page:
 	area = &zone->free_area[current_order];
 	list_move(&page->lru, &area->free_list[start_type]);
+	return true;
 }
 
 /*
@@ -2468,14 +2512,15 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	page = list_first_entry(&area->free_list[fallback_mt],
 							struct page, lru);
 
-	steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
-								can_steal);
+	if (!steal_suitable_fallback(zone, page, alloc_flags,
+					start_migratetype, can_steal)) {
+		return false;
+	}
 
 	trace_mm_page_alloc_extfrag(page, order, current_order,
 		start_migratetype, fallback_mt);
 
 	return true;
-
 }
 
 /*
@@ -3343,9 +3388,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 						const struct alloc_context *ac)
 {
 	struct zoneref *z;
+	struct zone *pzone;
 	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
 	bool no_fallback;
+	bool fragment_stall;
 
 retry:
 	/*
@@ -3353,7 +3400,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
 	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
+	fragment_stall = alloc_flags & ALLOC_FRAGMENT_STALL;
 	z = ac->preferred_zoneref;
+	pzone = z->zone;
+
 	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		struct page *page;
@@ -3392,7 +3442,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
-		if (no_fallback && nr_online_nodes > 1 &&
+		if ((no_fallback || fragment_stall) && nr_online_nodes > 1 &&
 		    zone != ac->preferred_zoneref->zone) {
 			int local_nid;
 
@@ -3401,9 +3451,12 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			 * fragmenting fallbacks. Locality is more important
 			 * than fragmentation avoidance.
 			 */
-			local_nid = zone_to_nid(ac->preferred_zoneref->zone);
+			local_nid = zone_to_nid(pzone);
 			if (zone_to_nid(zone) != local_nid) {
+				if (fragment_stall)
+					stall_fragmentation(pzone);
 				alloc_flags &= ~ALLOC_NOFRAGMENT;
+				alloc_flags &= ~ALLOC_FRAGMENT_STALL;
 				goto retry;
 			}
 		}
@@ -3479,8 +3532,12 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * It's possible on a UMA machine to get through all zones that are
 	 * fragmented. If avoiding fragmentation, reset and try again.
 	 */
-	if (no_fallback) {
+	if (no_fallback || fragment_stall) {
+		if (fragment_stall)
+			stall_fragmentation(pzone);
+
 		alloc_flags &= ~ALLOC_NOFRAGMENT;
+		alloc_flags &= ~ALLOC_FRAGMENT_STALL;
 		goto retry;
 	}
 
@@ -4194,6 +4251,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
+	/*
+	 * Consider stalling on heavy for movable allocations in preference to
+	 * fragmenting unmovable/reclaimable pageblocks. A caller that is
+	 * willing to direct reclaim is already willing to stall.
+	 * Unmovable/reclaimable allocations do not stall as kswapd is not
+	 * guaranteed to free pages in their respective pageblocks.
+	 */
+	if ((gfp_mask & (__GFP_MOVABLE|__GFP_DIRECT_RECLAIM)) ==
+			(__GFP_MOVABLE|__GFP_DIRECT_RECLAIM))
+		alloc_flags |= ALLOC_FRAGMENT_STALL;
+
 	/*
 	 * We need to recalculate the starting point for the zonelist iterator
 	 * because we might have used different nodemask in the fast path, or
@@ -4215,6 +4283,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto got_pg;
+	alloc_flags &= ~ALLOC_FRAGMENT_STALL;
 
 	/*
 	 * For costly allocations, try direct compaction first, as it's likely
@@ -7590,6 +7659,18 @@ int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int fragment_stall_order_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4c96b6356398..c3350ed4ff7e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3685,7 +3685,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * able to safely make forward progress. Wake them
 		 */
 		if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
-				allow_direct_reclaim(pgdat))
+				((!raise_priority && nr_boost_reclaim) ||
+				 allow_direct_reclaim(pgdat)))
 			wake_up_all(&pgdat->pfmemalloc_wait);
 
 		/* Check if kswapd should be suspending */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9c624595e904..6cc7755c6eb1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1211,6 +1211,7 @@ const char * const vmstat_text[] = {
 	"drop_pagecache",
 	"drop_slab",
 	"oom_kill",
+	"fragment_stall",
 
 #ifdef CONFIG_NUMA_BALANCING
 	"numa_pte_updates",