From patchwork Wed Oct 31 16:06:41 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10662893
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 352A41751
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:06:57 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2463426D08
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:06:57 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 180A12929C; Wed, 31 Oct 2018 16:06:57 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DD0B428BAA
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:06:55 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 20EFC6B026D; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1D10F6B026E; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EDC476B026E; Wed, 31 Oct 2018 12:06:49 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com
 [209.85.208.71])
	by kanga.kvack.org (Postfix) with ESMTP id 7D4046B026A
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 12:06:49 -0400 (EDT)
Received: by mail-ed1-f71.google.com with SMTP id y5-v6so8327716edp.7
        for <linux-mm@kvack.org>; Wed, 31 Oct 2018 09:06:49 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=a+33b9G/K//tZKJhcBuROou5CybzSzWoXKv/Zi0wBv0=;
        b=pdypl3B18mbY9f5p8kq7C+APUwi+w83o2OFnh3Vy4JRkzvFCXh6K6YKYatmYTyzls0
         gYg1mphsau2RKyadGvqwCwwW7Xm/FLZfqw/4afaw8wycKF/CchWSQvlIY+jVpN45aMXM
         PXyNHVYOhg0uQre8Oj5mTc7KXOYvZF5eEYJQk3lD85eSKHW0Q3+wjuxQCSe0VfrlVZ1G
         /gwexWn0oFE0r25Oj6jeDO4WU9oMyKQkdM31ZH6tJ0FJbIggqD4smtxs8c/43woCe1Vg
         wZYwzVyag1RCpKABEgBBnB02na00IaOw8xI4VGvkRM7X9YUFOkPEvNcZ+kFid3abqowl
         66tw==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.195 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AGRZ1gLKNFqV5lHEdiFVHurEl0r8MrAc59sSdDGElA6x96dVFtQTk6Zo
	FmEfvoONNEVNtFauLsCSHHeoSAY8X1gIkLlebVfezjUqXi6qXFLvOpoI6iynGVFJ39bGvk8CAiB
	lRXHZ4S3ro6XKPSNYWc7v8vrZa1OpIj2R3bWpjqswH+mcVZpYlh4qkIqy31GCBtaakA==
X-Received: by 2002:a17:906:c149:: with SMTP id
 bp9-v6mr1963096ejb.82.1541002008877;
        Wed, 31 Oct 2018 09:06:48 -0700 (PDT)
X-Google-Smtp-Source: 
 AJdET5d382bOIzeTgBwk/Nrk1BS1+1mEM2k0f3F2TQ+u7RGauwDN28zPE5w4o4P7ASNqmmpAiF1s
X-Received: by 2002:a17:906:c149:: with SMTP id
 bp9-v6mr1963013ejb.82.1541002006684;
        Wed, 31 Oct 2018 09:06:46 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1541002006; cv=none;
        d=google.com; s=arc-20160816;
        b=nToDhybnM+Hr4TBmRlQyjBxfyyQvatWHuc3L7TCEc0ny74K55Gh9wj1gi9z/AS5vSU
         ALpRPekhTULukPyCSWQQRcNRCjbRSg/4VUmrE4rM7SM84V9pu5vx7GwOuLEN8xH47zda
         gTTUWXEQujFlMLpUHkgqQPX+3W4KaF/YhCb+JMjl/0gN7VQeQq9zJioouzlIAiIiIPWA
         MZTKR0w6LcdwD0uHmVh/TefI4Y5PC8wJym5te2YeI5STmbHoTyQUd4z++rF//uCicGB6
         yQAsyu8f1ndQVXa8WImmafVLuHNUAJtwmaTSA8A9jr2eod1EALpsb7rGPxgFHFa57KjU
         GN6A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=a+33b9G/K//tZKJhcBuROou5CybzSzWoXKv/Zi0wBv0=;
        b=W+t5SFt0Mtvid4Ga8+kvipONNAefv3cgnCMsXFqhLxghHx3/VEgDpYn36nb+cEodmU
         hVtyb9Mfvl9Aq/H1sJJEtdQThDKjdooU8W2YGqXVQtbTfF/EOq7XlU0SBx3XxLhsmGXs
         IjutNUh6AZzBl/jRtv8XTVAUMio0GKfu17C/T8LhhSOiDpUNNjiUCUmKoIIA76lr1/O9
         D2qQycZxk8ZCELICGQKhjGwH7kB8+5UzrGXI2vU3R5sdZ0uMwe8nwN+0eUjkFUiBwHzt
         l6ELpFXRJXDbdGCDSfHUfanMIak5ZlSeZaP2FbfOpTX0tk2qRr2Osf11r4BtBhPwPzVy
         j46A==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.195 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp27.blacknight.com (outbound-smtp27.blacknight.com.
 [81.17.249.195])
        by mx.google.com with ESMTPS id
 c29-v6si1922507eda.227.2018.10.31.09.06.46
        for <linux-mm@kvack.org>
        (version=TLS1 cipher=AES128-SHA bits=128/128);
        Wed, 31 Oct 2018 09:06:46 -0700 (PDT)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 81.17.249.195 as permitted sender) client-ip=81.17.249.195;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.195 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17])
	by outbound-smtp27.blacknight.com (Postfix) with ESMTPS id 5CDCAB88E8
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 16:06:46 +0000 (GMT)
Received: (qmail 5561 invoked from network); 31 Oct 2018 16:06:46 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.142])
  by 81.17.254.9 with ESMTPA; 31 Oct 2018 16:06:46 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 1/5] mm,
 page_alloc: Spread allocations across zones before introducing fragmentation
Date: Wed, 31 Oct 2018 16:06:41 +0000
Message-Id: <20181031160645.7633-2-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181031160645.7633-1-mgorman@techsingularity.net>
References: <20181031160645.7633-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

The page allocator zone lists are iterated based on the watermarks
of each zone which does not take anti-fragmentation into account. On
x86, node 0 may have multiple zones while other nodes have one zone. A
consequence is that tasks running on node 0 may fragment ZONE_NORMAL even
though ZONE_DMA32 has plenty of free memory. This patch special cases
the allocator fast path such that it'll try an allocation from a lower
local zone before fragmenting a higher zone. In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the
fast path as they are uninteresting from a fragmentation point of view.

This was evaluated using a benchmark designed to fragment memory
before attempting THPs.  It's implemented in mmtests as the following
configurations

configs/config-global-dhp__workload_thpfioscale
configs/config-global-dhp__workload_thpfioscale-defrag
configs/config-global-dhp__workload_thpfioscale-madvhugepage

e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameterr create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll fault back in old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup

Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive. Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.

When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag event. If the fallback_order is smaller than a
pageblock order (order-9 on 64-bit x86) then it's considered an event
that may cause external fragmentation issues in the future. Hence, the
primary metric here is the number of external fragmentation events that
occur with order < 9. The secondary metric is allocation latency and huge
page allocation success rates but note that differences in latencies and
what the success rate also can affect the number of external fragmentation
event which is why it's a secondary metric.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.19 extfrag events < order 0:	71227
4.19+patch:                     36456 (49% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                      vanilla           lowzone-v1r1
Amean     fault-base-1      605.84 (   0.00%)      599.92 *   0.98%*
Amean     fault-huge-1      296.00 (   0.00%)      179.84 *  39.24%*

                                  4.19.0                 4.19.0
                                 vanilla           lowzone-v1r1
Percentage huge-1        0.44 (   0.00%)        1.08 ( 146.15%)

Fault latencies are reduced. While allocation success rates are not much
higher, this configuration does not make any heavy effort to allocate
THP and fio is heavily active at the time and filling memory.  However,
a 49% reduction of serious fragmentation events reduces the changes of
external fragmentation being a problem in the future.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.19 extfrag events < order 0:  40761
4.19+patch:                     36085 (11.47% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                      vanilla           lowzone-v1r1
Amean     fault-base-1     1938.77 (   0.00%)     1938.47 (   0.02%)
Amean     fault-huge-1      774.80 (   0.00%)      749.40 *   3.28%*

thpfioscale Percentage Faults Huge
                                  4.19.0                 4.19.0
                                 vanilla           lowzone-v1r1
Percentage huge-1       83.59 (   0.00%)       83.79 (   0.24%)

Nothing dramatic. Fragmentation events are still reduced but the differences
in fault latencies and allocation success rates are similar.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.19 extfrag events < order 0:  882868
4.19+patch:                     476937 (46% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                      vanilla           lowzone-v1r1
Amean     fault-base-5     1505.76 (   0.00%)     1602.01 (  -6.39%)
Amean     fault-huge-5      687.00 (   0.00%)        0.00 * 100.00%*

                                  4.19.0                 4.19.0
                                 vanilla           lowzone-v1r1
Percentage huge-5        0.07 (   0.00%)        0.00 (   0.00%)

The reduction of external fragmentation events is expected. The
latencies are off because the huge page allocations generally
failed and the patch does not have a direct impact on success
rates.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.19 extfrag events < order 0: 803099
4.19+patch:                    654671 (23% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                      vanilla           lowzone-v1r1
Amean     fault-base-5     5389.23 (   0.00%)     6678.61 * -23.93%*
Amean     fault-huge-5     5039.32 (   0.00%)     2796.35 *  44.51%*

thpfioscale Percentage Faults Huge
                                  4.19.0                 4.19.0
                                 vanilla           lowzone-v1r1
Percentage huge-5       30.69 (   0.00%)       57.92 (  88.71%)

In this case, there was both a reduction in the external fragmentation
causing events and the huge page allocation success rates were increased
substantially from 30.69% of attempts to 57.92%.

Overall, the patch significantly reduces the number of external
fragmentation causing events so the success of THP over long
periods of time would be improved for this adverse workload.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   |  13 +++++---
 mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 99 insertions(+), 15 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 87256ae1bef8..0dd659cf2a7e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,10 +480,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_OOM		ALLOC_NO_WATERMARKS
 #endif
 
-#define ALLOC_HARDER		0x10 /* try to alloc harder */
-#define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET		0x40 /* check for correct cpuset */
-#define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
+#define ALLOC_HARDER		 0x10 /* try to alloc harder */
+#define ALLOC_HIGH		 0x20 /* __GFP_HIGH set */
+#define ALLOC_CPUSET		 0x40 /* check for correct cpuset */
+#define ALLOC_CMA		 0x80 /* allow allocations from CMA areas */
+#ifdef CONFIG_ZONE_DMA32
+#define ALLOC_NOFRAGMENT	0x100 /* avoid mixing pageblock types */
+#else
+#define ALLOC_NOFRAGMENT	  0x0
+#endif
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2ef1c17942f..db5d61868c96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2364,20 +2364,30 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
  * condition simpler.
  */
 static __always_inline bool
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
+__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
+						unsigned int alloc_flags)
 {
 	struct free_area *area;
 	int current_order;
+	int min_order = order;
 	struct page *page;
 	int fallback_mt;
 	bool can_steal;
 
+	/*
+	 * Do not steal pages from freelists belonging to other pageblocks
+	 * i.e. orders < pageblock_order. In the event there is on local
+	 * zone free, the allocation will retry later.
+	 */
+	if (alloc_flags & ALLOC_NOFRAGMENT)
+		min_order = pageblock_order;
+
 	/*
 	 * Find the largest available free page in the other list. This roughly
 	 * approximates finding the pageblock with the most free pages, which
 	 * would be too costly to do exactly.
 	 */
-	for (current_order = MAX_ORDER - 1; current_order >= order;
+	for (current_order = MAX_ORDER - 1; current_order >= min_order;
 				--current_order) {
 		area = &(zone->free_area[current_order]);
 		fallback_mt = find_suitable_fallback(area, current_order,
@@ -2436,7 +2446,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
  * Call me with the zone->lock already held.
  */
 static __always_inline struct page *
-__rmqueue(struct zone *zone, unsigned int order, int migratetype)
+__rmqueue(struct zone *zone, unsigned int order, int migratetype,
+						unsigned int alloc_flags)
 {
 	struct page *page;
 
@@ -2446,7 +2457,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
 		if (migratetype == MIGRATE_MOVABLE)
 			page = __rmqueue_cma_fallback(zone, order);
 
-		if (!page && __rmqueue_fallback(zone, order, migratetype))
+		if (!page && __rmqueue_fallback(zone, order, migratetype,
+								alloc_flags))
 			goto retry;
 	}
 
@@ -2461,13 +2473,14 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype)
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
-			int migratetype)
+			int migratetype, unsigned int alloc_flags)
 {
 	int i, alloced = 0;
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype);
+		struct page *page = __rmqueue(zone, order, migratetype,
+								alloc_flags);
 		if (unlikely(page == NULL))
 			break;
 
@@ -2923,6 +2936,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 
 /* Remove page from the per-cpu list, caller must protect the list */
 static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
+			unsigned int alloc_flags,
 			struct per_cpu_pages *pcp,
 			struct list_head *list)
 {
@@ -2932,7 +2946,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 		if (list_empty(list)) {
 			pcp->count += rmqueue_bulk(zone, 0,
 					pcp->batch, list,
-					migratetype);
+					migratetype, alloc_flags);
 			if (unlikely(list_empty(list)))
 				return NULL;
 		}
@@ -2948,7 +2962,8 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
 /* Lock and remove page from the per-cpu list */
 static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			unsigned int alloc_flags)
 {
 	struct per_cpu_pages *pcp;
 	struct list_head *list;
@@ -2958,7 +2973,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	local_irq_save(flags);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	list = &pcp->lists[migratetype];
-	page = __rmqueue_pcplist(zone,  migratetype, pcp, list);
+	page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
 		zone_statistics(preferred_zone, zone);
@@ -2981,7 +2996,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 
 	if (likely(order == 0)) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
-				gfp_flags, migratetype);
+				gfp_flags, migratetype, alloc_flags);
 		goto out;
 	}
 
@@ -3000,7 +3015,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 				trace_mm_page_alloc_zone_locked(page, order, migratetype);
 		}
 		if (!page)
-			page = __rmqueue(zone, order, migratetype);
+			page = __rmqueue(zone, order, migratetype, alloc_flags);
 	} while (page && check_new_pages(page, order));
 	spin_unlock(&zone->lock);
 	if (!page)
@@ -3242,6 +3257,36 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 }
 #endif	/* CONFIG_NUMA */
 
+#ifdef CONFIG_ZONE_DMA32
+/*
+ * The restriction on ZONE_DMA32 as being a suitable zone to use to avoid
+ * fragmentation is subtle. If the preferred zone was HIGHMEM then
+ * premature use of a lower zone may cause lowmem pressure problems that
+ * are wose than fragmentation. If the next zone is ZONE_DMA then it is
+ * probably too small. It only makes sense to spread allocations to avoid
+ * fragmentation between the Normal and DMA32 zones.
+ */
+static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+{
+	if (zone_idx(zone) != ZONE_NORMAL)
+		return 0;
+
+	/*
+	 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
+	 * the pointer is within zone->zone_pgdat->node_zones[].
+	 */
+	if (!populated_zone(--zone))
+		return 0;
+
+	return ALLOC_NOFRAGMENT;
+}
+#else
+static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+{
+	return 0;
+}
+#endif
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -3253,11 +3298,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zoneref *z = ac->preferred_zoneref;
 	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
+	bool no_fallback;
 
+retry:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
+	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
 	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		struct page *page;
@@ -3296,6 +3344,22 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
+		if (no_fallback) {
+			int local_nid;
+
+			/*
+			 * If moving to a remote node, retry but allow
+			 * fragmenting fallbacks. Locality is more important
+			 * than fragmentation avoidance.
+			 *
+			 */
+			local_nid = zone_to_nid(ac->preferred_zoneref->zone);
+			if (zone_to_nid(zone) != local_nid) {
+				alloc_flags &= ~ALLOC_NOFRAGMENT;
+				goto retry;
+			}
+		}
+
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac_classzone_idx(ac), alloc_flags)) {
@@ -3363,6 +3427,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 	}
 
+	/*
+	 * It's possible on a UMA machine to get through all zones that are
+	 * fragmented. If avoiding fragmentation, reset and try again
+	 */
+	if (no_fallback) {
+		alloc_flags &= ~ALLOC_NOFRAGMENT;
+		goto retry;
+	}
+
 	return NULL;
 }
 
@@ -4366,6 +4439,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 
 	finalise_ac(gfp_mask, &ac);
 
+	/*
+	 * Forbid the first pass from falling back to types that fragment
+	 * memory until all local zones are considered.
+	 */
+	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone);
+
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))

From patchwork Wed Oct 31 16:06:42 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10662891
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0A3D41751
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:06:53 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EF0B426D08
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:06:52 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id E207D2929C; Wed, 31 Oct 2018 16:06:52 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 670A326D08
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:06:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 234D16B0269; Wed, 31 Oct 2018 12:06:49 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1E5E16B026A; Wed, 31 Oct 2018 12:06:49 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 063436B026C; Wed, 31 Oct 2018 12:06:48 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com
 [209.85.208.70])
	by kanga.kvack.org (Postfix) with ESMTP id 8798B6B0269
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 12:06:48 -0400 (EDT)
Received: by mail-ed1-f70.google.com with SMTP id k17-v6so11024363edr.18
        for <linux-mm@kvack.org>; Wed, 31 Oct 2018 09:06:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=47NV/eSAsgivwMtEcSAchPKAX6gtitBeIucoqSw9wVo=;
        b=Sii9J4LmabUMCapJuD7YRQYDqoE8C0aK+5LpapS69kIrqX3Qo8RFB0PPNmSNiP817i
         o+hV4OA2gUl25/HGsdyf5B87vibg5QIHN+FhaTsEgvVzubYNCxAgp9nQNFd4CqMjcPDX
         ogONjkq23dwJuhfBuTkmmdGH78jj8b9axNjYLj9asg5k1d1HuuEDRnGI2Bf9SvSct+7K
         BX6CEKHGdDLzTT6OtIEbYfhg3OigzPJgyZNw/SRM3awaVkcGSbl2Km2ONk0IVMz8167f
         JynEUwX80pphbGXb3ToZyVXqsk/7VY0eU9nt/zw6gQIqipBP2J1YtGifpjdqLPrdxiX3
         yPLQ==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.8 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AGRZ1gJmaIoFcd34ux8dS2mOyViL8pRcP3SJj0mCxIxZMOOM7l+1oNZO
	xD8hQ7iuY8TQDqa/J+OJiPqwy7VLyjlO7cOEV/nVDoBGZDu5Fa6kEAcwLsBgtSZFhVRRFU6AKIV
	gbeuCBDklppvoisgUJGobtwNde7wddeT2ZBzsUzfwpCJCdi2emgR2ompaj0aIvzogTg==
X-Received: by 2002:a17:906:64c:: with SMTP id
 t12-v6mr1997045ejb.113.1541002007943;
        Wed, 31 Oct 2018 09:06:47 -0700 (PDT)
X-Google-Smtp-Source: 
 AJdET5cgOfJgTVyoCvhLNr81/CmZqGYlTMKfbSkem1g69zeFlyrc9G68ySxhyJMwxSYqNKo23rSL
X-Received: by 2002:a17:906:64c:: with SMTP id
 t12-v6mr1996996ejb.113.1541002006859;
        Wed, 31 Oct 2018 09:06:46 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1541002006; cv=none;
        d=google.com; s=arc-20160816;
        b=KbzU1otJveLxZ8CoNvZxJO0ydIfVIxB+aw3kL/tEFhVtyGFVLfQyWaOxpEHjdmfq4i
         XJONQzWE6NgGBBhyZrNnGkufMgOVMGt+9uPB2VoRtjyOBz0pyMWRZirSd5J7m/pzW/iz
         xhrcNh1JaR/6AnoCteJ1jJKz+CjEznMDQsZ1TEW4Iipk9lU/g/6UyXYqwua6mzT2hboU
         I7xbYTxj6eRlv7K34LaXzGno7CkmxPKtjM+RgzsdXX+9DDyKRqXWC3sP9J649wndPMoK
         CUGwS9NWbJG9Wvm6GpbHaW5ckT0pY5NKDT6rBtrTYKXPZyf8zORm3a1TGWfgnbZgQebT
         vOMw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=47NV/eSAsgivwMtEcSAchPKAX6gtitBeIucoqSw9wVo=;
        b=ndp/ef/wLi8N1zoJhZJ6iBamxnELOXJy/W8KVbjzWgL7qPvn5xfYwxuDY9sYWWLtM7
         m8TboLAJTnKotsvIt04FFv30RqB9ZJatdpCRycZ4BVJ1XJdexdItYrGZp3o1jfZ7vBGu
         GuL+ZyLbxxFMyyISdQYEHdwr96sl1J/iAxdE+I+3pF5N0vlHmH3sPwL0RmKYrT4/d8cl
         rb1X31pYHsYuW17kPjlljZTcFo4RQtjFM13KnO3wEXg+hzSrRTHCCnd6nb0e9q207VdR
         Fv1LfZxSV8d4kerXIezPQWzACsCmQklB9gm1F3eHs/4ndedsEBYrAP/xcWMCy2uzrF8A
         XsZw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.8 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp02.blacknight.com (outbound-smtp02.blacknight.com.
 [81.17.249.8])
        by mx.google.com with ESMTPS id
 b12-v6si5575339ejd.100.2018.10.31.09.06.46
        for <linux-mm@kvack.org>
        (version=TLS1 cipher=AES128-SHA bits=128/128);
        Wed, 31 Oct 2018 09:06:46 -0700 (PDT)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 81.17.249.8 as permitted sender) client-ip=81.17.249.8;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.8 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17])
	by outbound-smtp02.blacknight.com (Postfix) with ESMTPS id 870B998955
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 16:06:46 +0000 (UTC)
Received: (qmail 5573 invoked from network); 31 Oct 2018 16:06:46 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.142])
  by 81.17.254.9 with ESMTPA; 31 Oct 2018 16:06:46 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 2/5] mm: Move zone watermark accesses behind an accessor
Date: Wed, 31 Oct 2018 16:06:42 +0000
Message-Id: <20181031160645.7633-3-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181031160645.7633-1-mgorman@techsingularity.net>
References: <20181031160645.7633-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This is a preparation patch only, no functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  9 +++++----
 mm/compaction.c        |  2 +-
 mm/page_alloc.c        | 12 ++++++------
 3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d4b0c79d2924..854d6c188888 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -267,9 +267,10 @@ enum zone_watermarks {
 	NR_WMARK
 };
 
-#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
-#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
-#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
+#define wmark_pages(z, i) (z->_watermark[i])
 
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
@@ -360,7 +361,7 @@ struct zone {
 	/* Read-mostly fields */
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
-	unsigned long watermark[NR_WMARK];
+	unsigned long _watermark[NR_WMARK];
 
 	unsigned long nr_reserved_highatomic;
 
diff --git a/mm/compaction.c b/mm/compaction.c
index faca45ebe62d..aa9473a64915 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1430,7 +1430,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	if (is_via_compact_memory(order))
 		return COMPACT_CONTINUE;
 
-	watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 	/*
 	 * If watermarks for high-order allocation are already met, there
 	 * should be no need for compaction at all.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index db5d61868c96..a51887765abc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3360,7 +3360,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
-		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac_classzone_idx(ac), alloc_flags)) {
 			int ret;
@@ -4787,7 +4787,7 @@ long si_mem_available(void)
 		pages[lru] = global_node_page_state(NR_LRU_BASE + lru);
 
 	for_each_zone(zone)
-		wmark_low += zone->watermark[WMARK_LOW];
+		wmark_low += low_wmark_pages(zone);
 
 	/*
 	 * Estimate the amount of memory available for userspace allocations,
@@ -7323,13 +7323,13 @@ static void __setup_per_zone_wmarks(void)
 
 			min_pages = zone->managed_pages / 1024;
 			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
-			zone->watermark[WMARK_MIN] = min_pages;
+			zone->_watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
 			 * If it's a lowmem zone, reserve a number of pages
 			 * proportionate to the zone's size.
 			 */
-			zone->watermark[WMARK_MIN] = tmp;
+			zone->_watermark[WMARK_MIN] = tmp;
 		}
 
 		/*
@@ -7341,8 +7341,8 @@ static void __setup_per_zone_wmarks(void)
 			    mult_frac(zone->managed_pages,
 				      watermark_scale_factor, 10000));
 
-		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
-		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
+		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}

From patchwork Wed Oct 31 16:06:43 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10662897
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6AD1F14DE
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:07:06 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 595CE26D08
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:07:06 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 4CFEA2929C; Wed, 31 Oct 2018 16:07:06 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E951C26D08
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:07:04 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 950656B026C; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 71BAE6B026B; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 58E9B6B026E; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com
 [209.85.208.69])
	by kanga.kvack.org (Postfix) with ESMTP id E02C26B026C
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 12:06:49 -0400 (EDT)
Received: by mail-ed1-f69.google.com with SMTP id x44-v6so10727529edd.17
        for <linux-mm@kvack.org>; Wed, 31 Oct 2018 09:06:49 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=KWJEhkEX6JvntJ+v8ohUvcZ67FLtyYKozzooioiu6e0=;
        b=aHLD8boJeAar/q7l6AYz+7aMccUm611XjsbFAOoH7xHtR57kvm1OSKTXUQLi1OGiuk
         FV2UgLH7SaesJ6ZnbhdOZjm/mYNnha9W9jSYIbpB2RTaW36GWKdmQnWJeHYqoJalaS6r
         LXqzs3XMsXgPTpvAB4hCj1sp8Vj03AWtQr40tEalFLHqPcSPKTfSBcyWRFmER4fFauhu
         uIsUWR2yszeAb3XnZVEQ1WtyOj1AObXeXICBUz1hQglAXtfis0VmPsVMQVWPaD0b+TJ5
         BhGjEHG+btO+plamLDEzH4rzLAd6bDOIeC96xDfHVm9V5ENMqE8vaXZn1KeP+w9ES4hd
         XnaA==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.194 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AGRZ1gKJ+weH1mguM4UxK/tdGWruTwlUzxEyiZbfHaDPguNFSACOADuT
	cfbTAwNDOLywKqlbOpX26qh4oJf/XU/avS/CDU5JGWUeOrkvLRew0N4exo4iTO4d5f87layF2iG
	xxwMdipMHcbFhSDfEqcGMm1/ZNAOvCTB89pFX/vP6DSuDLI1pj1FMQpLbWuq1Y56pBA==
X-Received: by 2002:a50:8f23:: with SMTP id
 32-v6mr2539928edy.158.1541002009259;
        Wed, 31 Oct 2018 09:06:49 -0700 (PDT)
X-Google-Smtp-Source: 
 AJdET5cM8dDHsqZJj9Tytjr6/FRBf1emZKYrgClb2VpxHW550nKC9ukXY7Oj29aPw8VNF0RL/KBb
X-Received: by 2002:a50:8f23:: with SMTP id
 32-v6mr2539813edy.158.1541002007074;
        Wed, 31 Oct 2018 09:06:47 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1541002007; cv=none;
        d=google.com; s=arc-20160816;
        b=FZaNelfJhTmyzxSmqEi4f3P7OxY5X5+AkgEI1BD9Kzq8pwc9UoZLkIddx55hT8G7CR
         emNN3cCgfRu+z4kknrHKeqbPXG1pw1qtGTEqH01UPHzvQuFp0CUIdVOPDSwPHwAwq8S4
         rlhv9folJPeGuRWQ8+HapqNmPyvIoE2A2dCvelGuFf08Dgt0expwnIbQXIRL7voqUN74
         tyOlL+e+YGYRCv85aA+37ahczqYK2ovXAGToSHl0y+ntxgrMir1pU7zvNSDi82lDFblY
         NGsCL7ffpHASPUaehJ1W8sb8ArqtE6Ash0uP6VXqigqRMZ+Mdd45e30St2hbMoZZzthE
         EQhw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=KWJEhkEX6JvntJ+v8ohUvcZ67FLtyYKozzooioiu6e0=;
        b=Dj5e6r/Ni6mKAmUfgBGurIEMOXTMY3pjHwm6CbKhXCYyygZ7N3OmEcGKer0s0wYXrJ
         55x4qdc9AFYeh3l+99GDNShCIVwUoHGGUHWOmzYS7AFz0JVCFOfmghqYj7E90j5z7MUk
         z4wETh/WJmQzypzVoJJHzYe7DPuwQgQ17cJG8NAhNAqD+eoYuKMsdRwQIGopUihHFYN+
         FqApepNbH1ifAyuIAGfy6Pk1VK6s81+KD6mqsM5K88z/r79v9U9mD/Uh8HVLBFqtsaOX
         sTWTtItvrdRtchL9+10Z9oiuYl7z0gCvwwtTdM52uG5vom2tzghFTAqWKVim1D9S90qa
         hY+w==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.194 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp26.blacknight.com (outbound-smtp26.blacknight.com.
 [81.17.249.194])
        by mx.google.com with ESMTPS id
 p14-v6si3143228edi.343.2018.10.31.09.06.46
        for <linux-mm@kvack.org>
        (version=TLS1 cipher=AES128-SHA bits=128/128);
        Wed, 31 Oct 2018 09:06:47 -0700 (PDT)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 81.17.249.194 as permitted sender) client-ip=81.17.249.194;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.194 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17])
	by outbound-smtp26.blacknight.com (Postfix) with ESMTPS id B6EF1B88F8
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 16:06:46 +0000 (GMT)
Received: (qmail 5622 invoked from network); 31 Oct 2018 16:06:46 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.142])
  by 81.17.254.9 with ESMTPA; 31 Oct 2018 16:06:46 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 3/5] mm: Reclaim small amounts of memory when an external
 fragmentation event occurs
Date: Wed, 31 Oct 2018 16:06:43 +0000
Message-Id: <20181031160645.7633-4-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181031160645.7633-1-mgorman@techsingularity.net>
References: <20181031160645.7633-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if there
is enough sparsely populated pageblocks then the problem can still occur
as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs. The boosting will stall allocations below the boosted low
watermark and kswapd is woken unconditionally to reclaim an amount of
memory relative to the size of the high watermark and the
watermark_boost_factor until the boost is cleared. When kswapd finishes,
it wakes kcompactd at the pageblock order to clean some of the pageblocks
that may have been affected by the fragmentation event. kswapd avoids
any writeback or swap from reclaim context during this operation to avoid
excessive system disruption in the name of fragmentation avoidance. Care
is taken so that kswapd will do normal reclaim work if the system is
really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.19 extfrag events < order 0:  71227
4.19+patch1:                    36456 (49% reduction)
4.19+patch1-3:                   4510 (94% reduction)

                                       4.19.0                 4.19.0
                                 lowzone-v1r1             boost-v1r5
Amean     fault-base-1      599.92 (   0.00%)      630.44 *  -5.09%*
Amean     fault-huge-1      179.84 (   0.00%)      179.22 (   0.35%)

                                  4.19.0                 4.19.0
                            lowzone-v1r1             boost-v1r5
Percentage huge-1        1.08 (   0.00%)        2.89 ( 168.75%)

Note that external fragmentation causing events are massively reduced
by this path whether in comparison to the previous kernel or the vanilla
kernel. There is some jitter in the fault latencies and they are a bit
more variable but the slight increase in THP allocation success rates
would account for some of that.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.19 extfrag events < order 0:  40761
4.19+patch1:                    36085 (11% reduction)
4.19+patch1-3:                   1887 (95% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                 lowzone-v1r1             boost-v1r5
Amean     fault-base-1     1938.47 (   0.00%)     1863.70 *   3.86%*
Amean     fault-huge-1      749.40 (   0.00%)      776.07 *  -3.56%*

thpfioscale Percentage Faults Huge
                                  4.19.0                 4.19.0
                            lowzone-v1r1             boost-v1r5
Percentage huge-1       83.79 (   0.00%)       86.92 (   3.73%)

As before, massive reduction in external fragmentation events, some
jitter on latencies and a slight increase in THP allocation success
rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.19 extfrag events < order 0:  882868
4.19+patch1:                    476937 (46% reduction)
4.19+patch1-3:                   29044 (97% reduction)

                                       4.19.0                 4.19.0
                                 lowzone-v1r1             boost-v1r5
Amean     fault-base-5     1602.01 (   0.00%)     1595.28 (   0.42%)
Amean     fault-huge-5        0.00 (   0.00%)      435.67 * -99.00%*

                                  4.19.0                 4.19.0
                            lowzone-v1r1             boost-v1r5
Percentage huge-5        0.00 (   0.00%)        0.15 ( 100.00%)

This is an illustration of why latencies are not the primary metric.
There is a 97% reduction in fragmentation causing events but the
huge page latencies are much higher because they went from never
succeeding to a small success.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.19 extfrag events < order 0: 803099
4.19+patch1:                   654671 (23% reduction)
4.19+patch1-3:                  24352 (97% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                 lowzone-v1r1             boost-v1r5
Amean     fault-base-5     6678.61 (   0.00%)     5935.74 (  11.12%)
Amean     fault-huge-5     2796.35 (   0.00%)     2611.69 (   6.60%)

                                  4.19.0                 4.19.0
                            lowzone-v1r1             boost-v1r5
Percentage huge-5       57.92 (   0.00%)       66.18 (  14.26%)

There is a large reduction in fragmentation events and is reflected
by a higher THP allocation success rate without a negative impact
on fault latencies.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/sysctl/vm.txt |  19 +++++++
 include/linux/mm.h          |   1 +
 include/linux/mmzone.h      |  11 ++--
 kernel/sysctl.c             |   8 +++
 mm/page_alloc.c             |  50 +++++++++++++++++-
 mm/vmscan.c                 | 123 ++++++++++++++++++++++++++++++++++++++++----
 6 files changed, 197 insertions(+), 15 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 7d73882e2c27..2244520d7913 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -63,6 +63,7 @@ files can be found in mm/swap.c.
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- watermark_boost_factor
 - watermark_scale_factor
 - zone_reclaim_mode
 
@@ -856,6 +857,24 @@ ten times more freeable objects than there are.
 
 =============================================================
 
+watermark_boost_factor:
+
+This factor controls the level of reclaim when memory is being fragmented.
+It defines the percentage of the low watermark of a zone that will be
+reclaimed if pages of different mobility are being mixed within pageblocks.
+The intent is so that compaction has less work to do and increase the
+success rate of future high-order allocations such as SLUB allocations,
+THP and hugetlbfs pages.
+
+To make it sensible with respect to the matermark_scale_factor parameter,
+the unit is in fractions of 10,000. The default value of 15000 means
+that 150% of the high watermark will be reclaimed in the event of a
+pageblock being mixed due to fragmentation. If this value is smaller
+than a pageblock then a pageblocks worth of pages will be reclaimed (e.g.
+2MB on 64-bit x86). A boost factor of 0 will disable the feature.
+
+=============================================================
+
 watermark_scale_factor:
 
 This factor controls the aggressiveness of kswapd. It defines the
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0416a7204be3..036bba4b84af 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2174,6 +2174,7 @@ extern void zone_pcp_reset(struct zone *zone);
 
 /* page_alloc.c */
 extern int min_free_kbytes;
+extern int watermark_boost_factor;
 extern int watermark_scale_factor;
 
 /* nommu.c */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 854d6c188888..30595df513c4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -267,10 +267,10 @@ enum zone_watermarks {
 	NR_WMARK
 };
 
-#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
-#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
-#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
-#define wmark_pages(z, i) (z->_watermark[i])
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
+#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
@@ -362,6 +362,7 @@ struct zone {
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long _watermark[NR_WMARK];
+	unsigned long watermark_boost;
 
 	unsigned long nr_reserved_highatomic;
 
@@ -886,6 +887,8 @@ static inline int is_highmem(struct zone *zone)
 struct ctl_table;
 int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cc02050fd0c4..6886c7928bb4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1450,6 +1450,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= min_free_kbytes_sysctl_handler,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "watermark_boost_factor",
+		.data		= &watermark_boost_factor,
+		.maxlen		= sizeof(watermark_boost_factor),
+		.mode		= 0644,
+		.proc_handler	= watermark_boost_factor_sysctl_handler,
+		.extra1		= &zero,
+	},
 	{
 		.procname	= "watermark_scale_factor",
 		.data		= &watermark_scale_factor,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a51887765abc..f799c5510789 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -263,6 +263,7 @@ compound_page_dtor * const compound_page_dtors[] = {
 
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
+int watermark_boost_factor __read_mostly = 15000;
 int watermark_scale_factor = 10;
 
 static unsigned long nr_kernel_pages __meminitdata;
@@ -2118,6 +2119,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 	return false;
 }
 
+static inline void boost_watermark(struct zone *zone)
+{
+	unsigned long max_boost;
+
+	if (!watermark_boost_factor)
+		return;
+
+	max_boost = mult_frac(wmark_pages(zone, WMARK_HIGH),
+			watermark_boost_factor, 10000);
+	max_boost = max(pageblock_nr_pages, max_boost);
+
+	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
+		max_boost);
+}
+
 /*
  * This function implements actual steal behaviour. If order is large enough,
  * we can steal whole pageblock. If not, we first move freepages in this
@@ -2149,6 +2165,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 		goto single_page;
 	}
 
+	/*
+	 * Boost watermarks to increase reclaim pressure to reduce the
+	 * likelihood of future fallbacks. Wake kswapd now as the node
+	 * may be balanced overall and kswapd will not wake naturally.
+	 */
+	boost_watermark(zone);
+	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
+
 	/* We are not allowed to try stealing from the whole block */
 	if (!whole_block)
 		goto single_page;
@@ -3266,11 +3290,19 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
  * probably too small. It only makes sense to spread allocations to avoid
  * fragmentation between the Normal and DMA32 zones.
  */
-static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+static inline unsigned int alloc_flags_nofragment(struct zone *zone,
+							gfp_t gfp_mask)
 {
 	if (zone_idx(zone) != ZONE_NORMAL)
 		return 0;
 
+	/*
+	 * A fragmenting fallback will try waking kswapd. ALLOC_NOFRAGMENT
+	 * may break that so such callers can introduce fragmentation.
+	 */
+	if (!(gfp_mask & __GFP_KSWAPD_RECLAIM))
+		return 0;
+
 	/*
 	 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
 	 * the pointer is within zone->zone_pgdat->node_zones[].
@@ -4443,7 +4475,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	 * Forbid the first pass from falling back to types that fragment
 	 * memory until all local zones are considered.
 	 */
-	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone);
+	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone,
+								gfp_mask);
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
@@ -7343,6 +7376,7 @@ static void __setup_per_zone_wmarks(void)
 
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+		zone->watermark_boost = 0;
 
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -7443,6 +7477,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5ef7240cbcb..7a8161258f0d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3360,6 +3360,30 @@ static void age_active_anon(struct pglist_data *pgdat,
 	} while (memcg);
 }
 
+static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx)
+{
+	int i;
+	struct zone *zone;
+
+	/*
+	 * Check for watermark boosts top-down as the higher zones
+	 * are more likely to be boosted. Both watermarks and boosts
+	 * should not be checked at the time time as reclaim would
+	 * start prematurely when there is no boosting and a lower
+	 * zone is balanced.
+	 */
+	for (i = classzone_idx; i >= 0; i--) {
+		zone = pgdat->node_zones + i;
+		if (!managed_zone(zone))
+			continue;
+
+		if (zone->watermark_boost)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Returns true if there is an eligible zone balanced for the request order
  * and classzone_idx
@@ -3370,9 +3394,12 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long mark = -1;
 	struct zone *zone;
 
+	/*
+	 * Check watermarks bottom-up as lower zones are more likely to
+	 * meet watermarks.
+	 */
 	for (i = 0; i <= classzone_idx; i++) {
 		zone = pgdat->node_zones + i;
-
 		if (!managed_zone(zone))
 			continue;
 
@@ -3497,23 +3524,42 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	int i;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	unsigned long nr_boost_reclaim;
+	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
+	bool boosted;
 	struct zone *zone;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
-		.priority = DEF_PRIORITY,
-		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
-		.may_swap = 1,
 	};
 
 	__fs_reclaim_acquire();
 
 	count_vm_event(PAGEOUTRUN);
 
+	/*
+	 * Account for the reclaim boost. Note that the zone boost is left in
+	 * place so that parallel allocations that are near the watermark will
+	 * stall or direct reclaim until kswapd is finished.
+	 */
+	nr_boost_reclaim = 0;
+	for (i = 0; i <= classzone_idx; i++) {
+		zone = pgdat->node_zones + i;
+		if (!managed_zone(zone))
+			continue;
+
+		nr_boost_reclaim += zone->watermark_boost;
+		zone_boosts[i] = zone->watermark_boost;
+	}
+	boosted = nr_boost_reclaim;
+
+restart:
+	sc.priority = DEF_PRIORITY;
 	do {
 		unsigned long nr_reclaimed = sc.nr_reclaimed;
 		bool raise_priority = true;
+		bool balanced;
 		bool ret;
 
 		sc.reclaim_idx = classzone_idx;
@@ -3540,13 +3586,39 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		}
 
 		/*
-		 * Only reclaim if there are no eligible zones. Note that
-		 * sc.reclaim_idx is not used as buffer_heads_over_limit may
-		 * have adjusted it.
+		 * If the pgdat is imbalanced then ignore boosting and preserve
+		 * the watermarks for a later time and restart. Note that the
+		 * zone watermarks will be still reset at the end of balancing
+		 * on the grounds that the normal reclaim should be enough to
+		 * re-evaluate if boosting is required when kswapd next wakes.
+		 */
+		balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
+		if (!balanced && nr_boost_reclaim) {
+			nr_boost_reclaim = 0;
+			goto restart;
+		}
+
+		/*
+		 * If boosting is not active then only reclaim if there are no
+		 * eligible zones. Note that sc.reclaim_idx is not used as
+		 * buffer_heads_over_limit may have adjusted it.
 		 */
-		if (pgdat_balanced(pgdat, sc.order, classzone_idx))
+		if (!nr_boost_reclaim && balanced)
 			goto out;
 
+		/* Limit the priority of boosting to avoid reclaim writeback */
+		if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
+			raise_priority = false;
+
+		/*
+		 * Do not writeback or swap pages for boosted reclaim. The
+		 * intent is to relieve pressure not issue sub-optimal IO
+		 * from reclaim context. If no pages are reclaimed, the
+		 * reclaim will be aborted.
+		 */
+		sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
+		sc.may_swap = !nr_boost_reclaim;
+
 		/*
 		 * Do some background aging of the anon list, to give
 		 * pages a chance to be referenced before reclaiming. All
@@ -3598,6 +3670,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * progress in reclaiming pages
 		 */
 		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+		nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
+
+		/*
+		 * If reclaim made no progress for a boost, stop reclaim as
+		 * IO cannot be queued and it could be an infinite loop in
+		 * extreme circumstances.
+		 */
+		if (nr_boost_reclaim && !nr_reclaimed)
+			break;
+
 		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1);
@@ -3606,6 +3688,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		pgdat->kswapd_failures++;
 
 out:
+	/* If reclaim was boosted, account for the reclaim done in this pass */
+	if (boosted) {
+		unsigned long flags;
+
+		for (i = 0; i <= classzone_idx; i++) {
+			if (!zone_boosts[i])
+				continue;
+
+			/* Increments are under the zone lock */
+			zone = pgdat->node_zones + i;
+			spin_lock_irqsave(&zone->lock, flags);
+			zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+			spin_unlock_irqrestore(&zone->lock, flags);
+		}
+
+		/*
+		 * As there is now likely space, wakeup kcompact to defragment
+		 * pageblocks.
+		 */
+		wakeup_kcompactd(pgdat, pageblock_order, classzone_idx);
+	}
+
 	snapshot_refaults(NULL, pgdat);
 	__fs_reclaim_release();
 	/*
@@ -3833,7 +3937,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 
 	/* Hopeless node, leave it to direct reclaim if possible */
 	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
-	    pgdat_balanced(pgdat, order, classzone_idx)) {
+	    (pgdat_balanced(pgdat, order, classzone_idx) &&
+	     !pgdat_watermark_boosted(pgdat, classzone_idx))) {
 		/*
 		 * There may be plenty of free memory available, but it's too
 		 * fragmented for high-order allocations.  Wake up kcompactd

From patchwork Wed Oct 31 16:06:44 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10662895
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1787514DE
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:07:02 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 069D426D08
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:07:02 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id EEDDD2929C; Wed, 31 Oct 2018 16:07:01 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 78E0826D08
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:07:00 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 71E176B0270; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 601B46B026C; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4A5346B026F; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com
 [209.85.208.71])
	by kanga.kvack.org (Postfix) with ESMTP id D1E016B026B
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 12:06:49 -0400 (EDT)
Received: by mail-ed1-f71.google.com with SMTP id g26-v6so11075871edp.13
        for <linux-mm@kvack.org>; Wed, 31 Oct 2018 09:06:49 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=NqPOevcOqsgWRK24OyEdef9iNyoASlSSrNZmmzFJKMA=;
        b=s84YgyFaZd9fOQuc2vuRchbLhzqIOzFUJAUp41fP4KenQU35G7a/OieX2KgoAMBaYj
         +diCh/nktyTh6I3AClTP52T3LovxjePN4T+37J+ObX7zMdwO0eQGjmE+KR1QlRKxchH1
         DLzSxkssG1rLFKhNoJHQ+JPGOWUrJgVl/K8gHtSY/e+PMdZNsPfP4Uk7ayGiG/ZqOp/p
         O1Z6jPZ2gow0QfvvUnvmSt4jNEE5fTSM7SC8ZBnsRLDF2DK6PUCqpGtSJWfdO/T9jD95
         S4dB+eK20UNVC7cRZC5CvqKk9vcEGhmhxQEI9EBn0pN05MpI6/Z9AQgTinI2J8CzjyZp
         kSsA==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AGRZ1gI2bCZl7vsoGAJ9mBVPLZLdKf9+8mgFnSqJKg7PLVQT7n551PGt
	WUbLC6p0His4jqXmfIKNQYQ7JP8QPzgTK9mDh0rYU2ha8pB2wIfaE3y9Zpj8RHV6q0KyMzgcEDj
	MyODd4KrVsEj395rYbtva2hP/WKExbNaOWU+SvooexCTYV6dp6l1FoU0eOst4fNMdzw==
X-Received: by 2002:a17:906:359b:: with SMTP id
 o27-v6mr2007653ejb.14.1541002009212;
        Wed, 31 Oct 2018 09:06:49 -0700 (PDT)
X-Google-Smtp-Source: 
 AJdET5ciVLCB9F9MzMeHxoYZJ3cerKo3yqow2I6dXy3YwpNMAHl9XcB0KbPgUYlc/vSBf1YnweRq
X-Received: by 2002:a17:906:359b:: with SMTP id
 o27-v6mr2007575ejb.14.1541002007412;
        Wed, 31 Oct 2018 09:06:47 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1541002007; cv=none;
        d=google.com; s=arc-20160816;
        b=Mgdz3mlDGngqEzBHJ6ic7UBCaDQjStJnz+duLFFSxh7q+S5cdi+/tIRiPXkGCk++xD
         LD/N4xEfDlnm2VBfNcEDgUbUGYxYgqP6YAqGS5m4YgRv1lV8Oo59wZXQx4TUOIvAaRYM
         zjfn/Wa4SgSOP1qLrHLpykM+xFNalZ2xHO939jWbZj9/4jN0iHWEadqRIClsilEZzIlu
         EnyzWu960nQOZh/sGsHIpGZ2pOGa6zD4dQZZKskU3HQM3KnVmmW3hyV08wZX6WhM5e4p
         sQkgK9qdbjfqUvhjJ9KfleT7Cl3ZeQC3Ywi10Rx8iHD9vhA875HVnx8Nfvvjex5zhfi4
         4EgQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=NqPOevcOqsgWRK24OyEdef9iNyoASlSSrNZmmzFJKMA=;
        b=E4ORpv1tUymquvWFOK0GNbarOjeMEKhNVpjVoQZnv1jPzi59+gYbyjEtvONWDrL126
         7oPebQRkWS7u93UpwXdFEFNx66PcQk2Tvm6lsJVKVcXItQ7/Yg3EsPB+QBFuQVdna+Z0
         DMLhjzVxcczQOwMUc2al7uJ3+oz21aJMO2/YHAIXjm4vyWQJuJakRwanVb9hbvhKQdia
         8YVTiMIcGl/RdFsC1JSHNDEIu/BJGn2JiRxoS7dhu2FiKvxO9ZbfRPeeGYz7DyHyH72u
         Efnwz362VcVYczLwhlCS2q+HGXilaOK2JXft4AwqY1TSLMNiSQKwd+iMbKrDxx1O/Zsx
         lPqA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp10.blacknight.com (outbound-smtp10.blacknight.com.
 [46.22.139.15])
        by mx.google.com with ESMTPS id
 10-v6si3530421ejo.30.2018.10.31.09.06.47
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Wed, 31 Oct 2018 09:06:47 -0700 (PDT)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 46.22.139.15 as permitted sender) client-ip=46.22.139.15;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17])
	by outbound-smtp10.blacknight.com (Postfix) with ESMTPS id E514D1C23C5
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 16:06:46 +0000 (GMT)
Received: (qmail 5640 invoked from network); 31 Oct 2018 16:06:46 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.142])
  by 81.17.254.9 with ESMTPA; 31 Oct 2018 16:06:46 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 4/5] mm: Stall movable allocations until kswapd progresses
 during serious external fragmentation event
Date: Wed, 31 Oct 2018 16:06:44 +0000
Message-Id: <20181031160645.7633-5-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181031160645.7633-1-mgorman@techsingularity.net>
References: <20181031160645.7633-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

An external fragmentation causing events as already been described. A
serious external fragmentation causing event is described as one that steals
a contiguous range of pages of an order lower than fragment_stall_order
(PAGE_ALLOC_COSTLY_ORDER by default). If fragmentation would steal a
block smaller than this, this patch causes a movable allocation request
that is allowed to sleep to until kswapd makes progress. As kswapd has
just been woken due to a boosted watermark, it's expected to return quickly.

This stall is not guaranteed to avoid serious fragmentation causing events.
If memory pressure is high enough, the pages freed by kswapd may still
be used or they may not be in pageblocks that contain only movable
pages. Furthermore an allocation request that cannot stall (e.g. atomic
allocations) or if for unmovable/reclaimable pages will still proceed
without stalling.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.19 extfrag events < order 0:  71227
4.19+patch1:                    36456 (49% reduction)
4.19+patch1-3:                   4510 (94% reduction)
4.19+patch1-4:                    548 (99% reduction)

Fragmentation events reduced further. The latency and allocation rates
were similar so are not included for brevity.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.19 extfrag events < order 0:  40761
4.19+patch1:                    36085 (11% reduction)
4.19+patch1-3:                   1887 (95% reduction)
4.19+patch1-4:                    394 (99% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                   boost-v1r5             stall-v1r6
Amean     fault-base-1     1863.70 (   0.00%)     3943.28 *-111.58%*
Amean     fault-huge-1      776.07 (   0.00%)     2739.80 *-253.03%*

                                  4.19.0                 4.19.0
                              boost-v1r5             stall-v1r6
Percentage huge-1       86.92 (   0.00%)       98.55 (  13.39%)

Similar to the first case, the reduction in fragmentation events
is notable. However, on this occasion the latencies are much higher
but the allocation success rate is also way higher at 98% success
rate. This is a case where the increased success rate causing pressure
elsewhere but the reduced external framentation events means that
compaction is more effective. This is a classic trade-off on whether
allocation success rate is higher but if problematic, the behaviour
can be tuned.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.19 extfrag events < order 0:  882868
4.19+patch1:                    476937 (46% reduction)
4.19+patch1-3:                   29044 (97% reduction)
4.19+patch1-4:                   29290 (97% reduction)

There is little impact on fragmentation causing events but the
latency and allocation rates were similar.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.19 extfrag events < order 0: 803099
4.19+patch1:                   654671 (23% reduction)
4.19+patch1-3:                  24352 (97% reduction)
4.19+patch1-4:                  16698 (98% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                   boost-v1r5             stall-v1r6
Amean     fault-base-5     5935.74 (   0.00%)     8649.60 * -45.72%*
Amean     fault-huge-5     2611.69 (   0.00%)     2799.82 (  -7.20%)

                                  4.19.0                 4.19.0
                              boost-v1r5             stall-v1r6
Percentage huge-5       66.18 (   0.00%)       77.80 (  17.56%)

Similar to the 1-socket case, the fragmentation events are reduced
but the higher THP allocation success rates also impact the latencies
as compaction goes to work.

This patch does reduce fragmentation rates overall but it's not free as
some allocataions can stall for short periods of time. While it's within
acceptable limits for the adverse test case, there may be other workloads
that cannot tolerate the stalls. Either it can be tuned to disable the
feature or more ideally, the test case is made available for analysis
to see if the stall behaviour can be reduced while still limiting the
fragmentation events. On the flip-side, it has been checked that setting
the fragment_stall_order to 9 eliminated fragmentation events entirely
on the 1-socket machine and by 99.71% on the 2-socket machine.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/sysctl/vm.txt | 23 +++++++++++++++
 include/linux/mm.h          |  1 +
 include/linux/mmzone.h      |  2 ++
 kernel/sysctl.c             | 10 +++++++
 mm/internal.h               |  1 +
 mm/page_alloc.c             | 68 +++++++++++++++++++++++++++++++++++++++------
 6 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 2244520d7913..f7d3fcb9d4ce 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -31,6 +31,7 @@ files can be found in mm/swap.c.
 - dirty_writeback_centisecs
 - drop_caches
 - extfrag_threshold
+- fragment_stall_order
 - hugetlb_shm_group
 - laptop_mode
 - legacy_va_layout
@@ -275,6 +276,28 @@ any throttling.
 
 ==============================================================
 
+fragment_stall_order
+
+External fragmentation control is managed on a pageblock level where the
+page allocator tries to avoid mixing pages of different mobility within page
+blocks (e.g. order 9 on 64-bit x86). If external fragmentation is perfectly
+controlled then a THP allocation will often succeed up to the number of
+movable pageblocks in the system as reported by /proc/pagetypeinfo.
+
+When memory is low, the system may have to mix pageblocks and will wake
+kswapd to try control future fragmentation. fragment_stall_order controls if
+the allocating task will stall if possible until kswapd makes some progress
+in preference to fragmenting the system. This incurs a small stall penalty
+in exchange for future success at allocating huge pages. If the stalls
+are undesirable and high-order allocations are irrelevant then this can
+be disabled by writing 0 to the tunable. Writing the pageblock order will
+strongly (but not perfectly) control external fragmentation.
+
+The default will stall for fragmenting allocations smaller than the
+PAGE_ALLOC_COSTLY_ORDER (defined as order-3 at the time of writing).
+
+==============================================================
+
 hugetlb_shm_group
 
 hugetlb_shm_group contains group id that is allowed to create SysV
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 036bba4b84af..a1a2e2833986 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2176,6 +2176,7 @@ extern void zone_pcp_reset(struct zone *zone);
 extern int min_free_kbytes;
 extern int watermark_boost_factor;
 extern int watermark_scale_factor;
+extern int fragment_stall_order;
 
 /* nommu.c */
 extern atomic_long_t mmap_pages_allocated;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30595df513c4..66e71a8ac8a6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -891,6 +891,8 @@ int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int fragment_stall_order_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6886c7928bb4..d26f3d9a6400 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -125,6 +125,7 @@ static int zero;
 static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
 static int __maybe_unused four = 4;
+static int __maybe_unused max_order = MAX_ORDER;
 static unsigned long one_ul = 1;
 static int one_hundred = 100;
 static int one_thousand = 1000;
@@ -1467,6 +1468,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &one,
 		.extra2		= &one_thousand,
 	},
+	{
+		.procname	= "fragment_stall_order",
+		.data		= &fragment_stall_order,
+		.maxlen		= sizeof(fragment_stall_order),
+		.mode		= 0644,
+		.proc_handler	= fragment_stall_order_sysctl_handler,
+		.extra1		= &zero,
+		.extra2		= &max_order,
+	},
 	{
 		.procname	= "percpu_pagelist_fraction",
 		.data		= &percpu_pagelist_fraction,
diff --git a/mm/internal.h b/mm/internal.h
index 0dd659cf2a7e..4f159a3b5c4f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -489,6 +489,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #else
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
+#define ALLOC_FRAGMENT_STALL	0x200 /* stall if fragmenting heavily */
 
 enum ttu_flags;
 struct tlbflush_unmap_batch;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f799c5510789..63de66b893d3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -265,6 +265,7 @@ int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 int watermark_boost_factor __read_mostly = 15000;
 int watermark_scale_factor = 10;
+int fragment_stall_order __read_mostly = (PAGE_ALLOC_COSTLY_ORDER + 1);
 
 static unsigned long nr_kernel_pages __meminitdata;
 static unsigned long nr_all_pages __meminitdata;
@@ -2134,6 +2135,21 @@ static inline void boost_watermark(struct zone *zone)
 		max_boost);
 }
 
+static void stall_fragmentation(pg_data_t *pgdat)
+{
+	DEFINE_WAIT(wait);
+	long remaining = 0;
+
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	prepare_to_wait(&pgdat->pfmemalloc_wait, &wait, TASK_INTERRUPTIBLE);
+	if (waitqueue_active(&pgdat->kswapd_wait))
+		wake_up_interruptible(&pgdat->kswapd_wait);
+	remaining = schedule_timeout(HZ/10);
+	finish_wait(&pgdat->pfmemalloc_wait, &wait);
+}
+
 /*
  * This function implements actual steal behaviour. If order is large enough,
  * we can steal whole pageblock. If not, we first move freepages in this
@@ -2142,8 +2158,9 @@ static inline void boost_watermark(struct zone *zone)
  * of pages are free or compatible, we can change migratetype of the pageblock
  * itself, so pages freed in the future will be put on the correct free list.
  */
-static void steal_suitable_fallback(struct zone *zone, struct page *page,
-					int start_type, bool whole_block)
+static bool steal_suitable_fallback(struct zone *zone, struct page *page,
+					int start_type, bool whole_block,
+					unsigned int alloc_flags)
 {
 	unsigned int current_order = page_order(page);
 	struct free_area *area;
@@ -2173,6 +2190,11 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	boost_watermark(zone);
 	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
 
+	if ((alloc_flags & ALLOC_FRAGMENT_STALL) &&
+	    current_order < fragment_stall_order) {
+		return false;
+	}
+
 	/* We are not allowed to try stealing from the whole block */
 	if (!whole_block)
 		goto single_page;
@@ -2213,11 +2235,12 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 			page_group_by_mobility_disabled)
 		set_pageblock_migratetype(page, start_type);
 
-	return;
+	return true;
 
 single_page:
 	area = &zone->free_area[current_order];
 	list_move(&page->lru, &area->free_list[start_type]);
+	return true;
 }
 
 /*
@@ -2456,13 +2479,14 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	page = list_first_entry(&area->free_list[fallback_mt],
 							struct page, lru);
 
-	steal_suitable_fallback(zone, page, start_migratetype, can_steal);
+	if (!steal_suitable_fallback(zone, page, start_migratetype, can_steal,
+								alloc_flags))
+		return false;
 
 	trace_mm_page_alloc_extfrag(page, order, current_order,
 		start_migratetype, fallback_mt);
 
 	return true;
-
 }
 
 /*
@@ -3331,6 +3355,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
 	bool no_fallback;
+	bool fragment_stall;
 
 retry:
 	/*
@@ -3338,6 +3363,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
 	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
+	fragment_stall = alloc_flags & ALLOC_FRAGMENT_STALL;
+
 	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		struct page *page;
@@ -3376,18 +3403,21 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
-		if (no_fallback) {
+		if (no_fallback || fragment_stall) {
+			pg_data_t *pgdat = zone->zone_pgdat;
 			int local_nid;
 
 			/*
 			 * If moving to a remote node, retry but allow
 			 * fragmenting fallbacks. Locality is more important
 			 * than fragmentation avoidance.
-			 *
 			 */
+			if (fragment_stall)
+				stall_fragmentation(pgdat);
 			local_nid = zone_to_nid(ac->preferred_zoneref->zone);
 			if (zone_to_nid(zone) != local_nid) {
 				alloc_flags &= ~ALLOC_NOFRAGMENT;
+				alloc_flags &= ~ALLOC_FRAGMENT_STALL;
 				goto retry;
 			}
 		}
@@ -3463,8 +3493,9 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * It's possible on a UMA machine to get through all zones that are
 	 * fragmented. If avoiding fragmentation, reset and try again
 	 */
-	if (no_fallback) {
+	if (no_fallback || fragment_stall) {
 		alloc_flags &= ~ALLOC_NOFRAGMENT;
+		alloc_flags &= ~ALLOC_FRAGMENT_STALL;
 		goto retry;
 	}
 
@@ -4192,6 +4223,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
+	/*
+	 * Consider stalling on heavy for movable allocations in preference to
+	 * fragmenting unmovable/reclaimable pageblocks.
+	 */
+	if ((gfp_mask & (__GFP_MOVABLE|__GFP_DIRECT_RECLAIM)) ==
+			(__GFP_MOVABLE|__GFP_DIRECT_RECLAIM))
+		alloc_flags |= ALLOC_FRAGMENT_STALL;
+
 	/*
 	 * We need to recalculate the starting point for the zonelist iterator
 	 * because we might have used different nodemask in the fast path, or
@@ -4213,6 +4252,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto got_pg;
+	alloc_flags &= ~ALLOC_FRAGMENT_STALL;
 
 	/*
 	 * For costly allocations, try direct compaction first, as it's likely
@@ -7489,6 +7529,18 @@ int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int fragment_stall_order_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {

From patchwork Wed Oct 31 16:06:45 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mel Gorman <mgorman@techsingularity.net>
X-Patchwork-Id: 10662899
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E722114DE
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:07:10 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D518E26D08
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:07:10 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id C637E29EB2; Wed, 31 Oct 2018 16:07:10 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7A3B226D08
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed, 31 Oct 2018 16:07:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C95B96B026B; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C45A66B026E; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9AEDF6B026F; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com
 [209.85.208.70])
	by kanga.kvack.org (Postfix) with ESMTP id 1988F6B026A
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 12:06:50 -0400 (EDT)
Received: by mail-ed1-f70.google.com with SMTP id x1-v6so10976061eds.16
        for <linux-mm@kvack.org>; Wed, 31 Oct 2018 09:06:50 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=GAUan2SuAD2Tp76K2SuQfT4XdYilWRU+Ql0ZD4Ov4r4=;
        b=kOq6EdJRde4Chui7NAOLvVsB8yz3iG72I+I/qO4qRTP2FVYXG1dkqu3Gj6crHono6F
         +JXXfogKE/tat5pgeZEdacAhR/DZxbSZY6rSSUPsZ77yjDAYtG6cnHKWr/xKIOeWrYd7
         6L7AamcxUTkIZGQag19pvRYT6aHIC92u2YvkUHuEmJM8kB0reH2DWsOQhwgPXOpfmR46
         tKcHJmbJjXWPCHVFS/Ut1NLbgvwi7oKoCQKaO9f3Irbwis02KncXmYHU+rgRiXJJpFGr
         QMDofLpKDHjB1cpNtDFqECmvmaaeI2UJ7ptPXzzjWuhG6yxSOPl0Lun/jyTsn0sQVDyS
         US9Q==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
X-Gm-Message-State: AGRZ1gITJ/GpU/ejst3c/tStMSPCwSnfpNydzTfJ2kMRfVd6xvFQdlRS
	fQmMIzFHK0PEoSsyuGWPYYuR5JYb9AXL1Yfl005XSPNejL71NNxEEruwWbX0NireGcFd3laJ11s
	SYyndkN2KRzov3/Y8BX7Zuhq4tmT8/K1ybGIJ6uOYlKOk4D+dWsxn1UMpw3+wQpz4iw==
X-Received: by 2002:a17:906:f14e:: with SMTP id
 gw14-v6mr1997760ejb.231.1541002009456;
        Wed, 31 Oct 2018 09:06:49 -0700 (PDT)
X-Google-Smtp-Source: 
 AJdET5eswuw5sKbzazoQZ3u5/sH2CC5wbCdOjIZn9ad2G9B+1xFhAi8RNVrDSTbW9R78S3fCRin7
X-Received: by 2002:a17:906:f14e:: with SMTP id
 gw14-v6mr1997669ejb.231.1541002007414;
        Wed, 31 Oct 2018 09:06:47 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1541002007; cv=none;
        d=google.com; s=arc-20160816;
        b=T+xoU0TwFRTbVI7lxNNAB+KzHtun60kmNHcdHyb6unFT1iblqqGC9YL/sA7aFnZoz6
         sFNb+Rkqmr9BrP7CEJPNGuBiqaj8P+cJ6ocmWjdPC6GJdiYxBjyK1M62likzT8m/ZAT3
         s1VV5EhabvqRycy3EDLkP+lGwzInCPXO+ULuFbqr1tE7jyFJNFU+b4hoPzXuiU7VVe+C
         U17siK6HHT9Y6CxuJkoC6cT95OG2sRWU5rU8SWG6ddo4S7lQvyvv8tJn0vw/wpbSh+k/
         cAg/Y2DlK7AIUJAqvR3RyWgWeGJ9Kc/n5PCGgpT7WTYueBMMOV8bdQJO53QsKFvo9x2f
         YUOA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=GAUan2SuAD2Tp76K2SuQfT4XdYilWRU+Ql0ZD4Ov4r4=;
        b=j9A5zTl9wdgjbsmKYrUsGyPPjWUv2aXKktSDfonzHtxoWMho77LSiMY2PaxC9Jn9Ux
         LVWfsFGMS6HcbwKGTySRGcpiwTXeD3aJDwcbDZJPnNECobmYFaMs7KRhtSGSLV0TS66Y
         WHGV9fViweWWZALpU5GN0clJSfvY64SbhjrotQ35Hl5kHQwAT48tGPqtSPcBZuVMA8Df
         ihm02SkuSTbmD9/xl3enomRGrcSsYKsI48fXC6XUl9+wyGYxIFxjP7lIoVr06fm83aQ1
         B+VfWBPV9HTeFnVUAhrjtOVx4VMdCFB/NphcsfD51xThr2ZIFdIkEhoOX0XwXq9Lefbr
         ++cA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from outbound-smtp04.blacknight.com (outbound-smtp04.blacknight.com.
 [81.17.249.35])
        by mx.google.com with ESMTPS id a2si481537edv.415.2018.10.31.09.06.47
        for <linux-mm@kvack.org>
        (version=TLS1 cipher=AES128-SHA bits=128/128);
        Wed, 31 Oct 2018 09:06:47 -0700 (PDT)
Received-SPF: pass (google.com: domain of mgorman@techsingularity.net
 designates 81.17.249.35 as permitted sender) client-ip=81.17.249.35;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mgorman@techsingularity.net designates
 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net
Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17])
	by outbound-smtp04.blacknight.com (Postfix) with ESMTPS id 1A24B9896F
	for <linux-mm@kvack.org>; Wed, 31 Oct 2018 16:06:47 +0000 (UTC)
Received: (qmail 5659 invoked from network); 31 Oct 2018 16:06:47 -0000
Received: from unknown (HELO stampy.163woodhaven.lan)
 (mgorman@techsingularity.net@[37.228.229.142])
  by 81.17.254.9 with ESMTPA; 31 Oct 2018 16:06:47 -0000
From: Mel Gorman <mgorman@techsingularity.net>
To: Linux-MM <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 5/5] mm: Target compaction on pageblocks that were recently
 fragmented
Date: Wed, 31 Oct 2018 16:06:45 +0000
Message-Id: <20181031160645.7633-6-mgorman@techsingularity.net>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20181031160645.7633-1-mgorman@techsingularity.net>
References: <20181031160645.7633-1-mgorman@techsingularity.net>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Despite the earlier patches, external fragmentation events are still
inevitable as not all callers can stall or are appropriate to stall
(e.g. unmovable allocations that kswapd reclaim will not necessarily
help). In the event there is a mixed pageblock, it's desirable to move all
movable pages from that block so that unmovable/unreclaimable allocations
do not further pollute the address space.

This patch queues such pageblocks for early compaction and relies on
kswapd to wake kcompactd when some pages are reclaimed. Waking kcompactd
after kswapd makes progress is so that the compaction is more likely to
have a suitable migration destination.

This patch may be controversial as there are multiple other design
decisions that can be made. We could refuse to change pageblock ownership
in some cases but great care would need to be taken to avoid premature
OOMs or a livelock. Similarly, we could tag pageblocks as mixed and
search for them but that would increase scanning costs. Finally, there
is a corner case that a mixed pageblock that is after the point where a
free scanner can operate may fail to clean the pageblock but addressing
that would require a fundamental alteration to how compaction works.

Unlike the previous series, this one is harder to prove that it is a benefit
because it ideally require a very long-lived workload that is fragmenting
to show if it's really effective. The timing of such an allocation stream
would be critical and detecting the change would be difficult can be
within the noise. Hence, the potential benefit of this patch is more
conceptual than quantitive even though there are some positive results.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.19 extfrag events < order 0:  71227
4.19+patch1:                    36456 (49% reduction)
4.19+patch1-3:                   4510 (94% reduction)
4.19+patch1-4:                    548 (99% reduction)
4.19+patch1-5:                    422 (99% reduction)

                                       4.19.0                 4.19.0
                                   stall-v1r6         proactive-v1r6
Amean     fault-base-1      839.48 (   0.00%)      860.89 *  -2.55%*
Amean     fault-huge-1      172.74 (   0.00%)      159.49 (   7.67%)

                                  4.19.0                 4.19.0
                              stall-v1r6         proactive-v1r6
Percentage huge-1        1.04 (   0.00%)        2.29 ( 119.35%)

While there is an improvement in the reduction of fragmentation events
and allocation success rates, the differences are marginal enough that
it may not be significant.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.19 extfrag events < order 0:  40761
4.19+patch1:                    36085 (11% reduction)
4.19+patch1-3:                   1887 (95% reduction)
4.19+patch1-4:                    394 (99% reduction)
4.19+patch1-5:                    440 (99% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                   stall-v1r6         proactive-v1r6
Amean     fault-base-1     3943.28 (   0.00%)     2704.46 *  31.42%*
Amean     fault-huge-1     2739.80 (   0.00%)     2552.13 *   6.85%*

thpfioscale Percentage Faults Huge
                                  4.19.0                 4.19.0
                              stall-v1r6         proactive-v1r6
Percentage huge-1       98.55 (   0.00%)       98.76 (   0.20%)

Slight increase in fragmentation events albeit very small. The latency
is much improved as well as a slight increase in allocation success
rates but this may be a co-incidence of the system state.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.19 extfrag events < order 0:  882868
4.19+patch1:                    476937 (46% reduction)
4.19+patch1-3:                   29044 (97% reduction)
4.19+patch1-4:                   29290 (97% reduction)
4.19+patch1-5:                   30791 (97% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                   stall-v1r6         proactive-v1r6
Amean     fault-base-5     1773.24 (   0.00%)     1519.89 *  14.29%*
Amean     fault-huge-5    17791.20 (   0.00%)      536.44 (  96.98%)

                                  4.19.0                 4.19.0
                              stall-v1r6         proactive-v1r6
Percentage huge-5        0.17 (   0.00%)        0.98 ( 490.00%)

Again, the fragmentation causing events is slightly increased although
this is likely within the noise. The latency is massively improved but
the success rate is only marginally improved. Given the low success rate,
it may be a co-incidence of the exact system state during the test but
the fact it happened on both 1 and 2 socket machines is encouraging.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.19 extfrag events < order 0: 803099
4.19+patch1:                   654671 (23% reduction)
4.19+patch1-3:                  24352 (97% reduction)
4.19+patch1-4:                  16698 (98% reduction)
4.19+patch1-5:                  32623 (96% reduction)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0
                                   stall-v1r6         proactive-v1r6
Amean     fault-base-5     8649.60 (   0.00%)    13074.71 * -51.16%*
Amean     fault-huge-5     2799.82 (   0.00%)     3410.02 * -21.79%*

thpfioscale Percentage Faults Huge
                                  4.19.0                 4.19.0
                              stall-v1r6         proactive-v1r6
Percentage huge-5       77.80 (   0.00%)       83.30 (   7.06%)

This shows an increase in both fragmentation events and latency. However
it is somewhat balanced by the higher allocation success rates which in
themselves can increase fragmentation pressure.

This is less an obvious universal win. It does control fragmentation
better to some extent in that pageblocks can be found faster in some
cases but the nature of the workload makes it less clear-cut.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/compaction.h        |   4 ++
 include/linux/migrate.h           |   7 +-
 include/linux/mmzone.h            |   4 ++
 include/trace/events/compaction.h |  62 ++++++++++++++++
 mm/compaction.c                   | 146 +++++++++++++++++++++++++++++++++++---
 mm/migrate.c                      |   6 +-
 mm/page_alloc.c                   |   7 ++
 7 files changed, 225 insertions(+), 11 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 68250a57aace..1fc1ad055f66 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -177,6 +177,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 extern int kcompactd_run(int nid);
 extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
+extern void kcompactd_queue_migration(struct zone *zone, struct page *page);
 
 #else
 static inline void reset_isolation_suitable(pg_data_t *pgdat)
@@ -225,6 +226,9 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_i
 {
 }
 
+static inline void kcompactd_queue_migration(struct zone *zone, struct page *page)
+{
+}
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f2b4abbca55e..f12cee38c0f0 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -61,7 +61,7 @@ static inline struct page *new_page_nodemask(struct page *page,
 
 #ifdef CONFIG_MIGRATION
 
-extern void putback_movable_pages(struct list_head *l);
+extern unsigned int putback_movable_pages(struct list_head *l);
 extern int migrate_page(struct address_space *mapping,
 			struct page *newpage, struct page *page,
 			enum migrate_mode mode);
@@ -82,7 +82,10 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
 		int extra_count);
 #else
 
-static inline void putback_movable_pages(struct list_head *l) {}
+static inline unsigned int putback_movable_pages(struct list_head *l)
+{
+	return 0;
+}
 static inline int migrate_pages(struct list_head *l, new_page_t new,
 		free_page_t free, unsigned long private, enum migrate_mode mode,
 		int reason)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 66e71a8ac8a6..0a905add8112 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -495,6 +495,10 @@ struct zone {
 	unsigned int		compact_considered;
 	unsigned int		compact_defer_shift;
 	int			compact_order_failed;
+
+#define COMPACT_QUEUE_LENGTH 16
+	unsigned long		compact_queue[COMPACT_QUEUE_LENGTH];
+	int			nr_compact;
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 6074eff3d766..6b5b61177d8c 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -353,6 +353,68 @@ DEFINE_EVENT(kcompactd_wake_template, mm_compaction_kcompactd_wake,
 	TP_ARGS(nid, order, classzone_idx)
 );
 
+TRACE_EVENT(mm_compaction_wakeup_kcompactd_queue,
+
+	TP_PROTO(
+		int nid,
+		enum zone_type zoneid,
+		unsigned long pfn,
+		int nr_queued),
+
+	TP_ARGS(nid, pfn, zoneid, nr_queued),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(enum zone_type, zoneid)
+		__field(unsigned long, pfn)
+		__field(int, nr_queued)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->zoneid = zoneid;
+		__entry->pfn = pfn;
+		__entry->nr_queued = nr_queued;
+	),
+
+	TP_printk("nid=%d zoneid=%-8s pfn=%lu nr_queued=%d",
+		__entry->nid,
+		__print_symbolic(__entry->zoneid, ZONE_TYPE),
+		__entry->pfn,
+		__entry->nr_queued)
+);
+
+TRACE_EVENT(mm_compaction_kcompactd_migrated,
+
+	TP_PROTO(
+		int nid,
+		enum zone_type zoneid,
+		int nr_migrated,
+		int nr_failed),
+
+	TP_ARGS(nid, zoneid, nr_migrated, nr_failed),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(enum zone_type, zoneid)
+		__field(int, nr_migrated)
+		__field(int, nr_failed)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->zoneid = zoneid,
+		__entry->nr_migrated = nr_migrated;
+		__entry->nr_failed = nr_failed;
+	),
+
+	TP_printk("nid=%d zoneid=%-8s nr_migrated=%d nr_failed=%d",
+		__entry->nid,
+		__print_symbolic(__entry->zoneid, ZONE_TYPE),
+		__entry->nr_migrated,
+		__entry->nr_failed)
+);
+
 #endif /* _TRACE_COMPACTION_H */
 
 /* This part must be outside protection */
diff --git a/mm/compaction.c b/mm/compaction.c
index aa9473a64915..853538e568d9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1914,6 +1914,12 @@ void compaction_unregister_node(struct node *node)
 
 static inline bool kcompactd_work_requested(pg_data_t *pgdat)
 {
+	int zoneid;
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++)
+		if (pgdat->node_zones[zoneid].nr_compact)
+			return true;
+
 	return pgdat->kcompactd_max_order > 0 || kthread_should_stop();
 }
 
@@ -1937,6 +1943,93 @@ static bool kcompactd_node_suitable(pg_data_t *pgdat)
 	return false;
 }
 
+static void kcompactd_migrate_block(struct compact_control *cc,
+	unsigned long pfn)
+{
+	unsigned long end = min(pfn + pageblock_nr_pages, zone_end_pfn(cc->zone));
+	unsigned long total_migrated = 0, total_failed = 0;
+
+	cc->migrate_pfn = pfn;
+	while (pfn && pfn < end) {
+		int err;
+		unsigned long nr_migrated, nr_failed = 0;
+
+		pfn = isolate_migratepages_range(cc, pfn, end);
+		if (!pfn)
+			break;
+
+		nr_migrated = cc->nr_migratepages;
+		err = migrate_pages(&cc->migratepages, compaction_alloc,
+				compaction_free, (unsigned long)cc,
+				cc->mode, MR_COMPACTION);
+		if (err) {
+			nr_failed = putback_movable_pages(&cc->migratepages);
+			nr_migrated -= nr_failed;
+		}
+		cc->nr_migratepages = 0;
+		total_migrated += nr_migrated;
+		total_failed += nr_failed;
+	}
+
+	trace_mm_compaction_kcompactd_migrated(zone_to_nid(cc->zone),
+		zone_idx(cc->zone), total_migrated, total_failed);
+	return;
+}
+
+static void kcompactd_init_cc(struct compact_control *cc, struct zone *zone)
+{
+	cc->nr_freepages = 0;
+	cc->nr_migratepages = 0;
+	cc->total_migrate_scanned = 0;
+	cc->total_free_scanned = 0;
+	cc->zone = zone;
+	INIT_LIST_HEAD(&cc->freepages);
+	INIT_LIST_HEAD(&cc->migratepages);
+}
+
+static void kcompactd_do_queue(pg_data_t *pgdat)
+{
+	/*
+	 * With no special task, compact all zones so that a page of requested
+	 * order is allocatable.
+	 */
+	int zoneid;
+	struct zone *zone;
+	struct compact_control cc = {
+		.order = 0,
+		.total_migrate_scanned = 0,
+		.total_free_scanned = 0,
+		.classzone_idx = 0,
+		.mode = MIGRATE_SYNC,
+		.ignore_skip_hint = true,
+		.gfp_mask = GFP_KERNEL,
+	};
+	trace_mm_compaction_kcompactd_wake(pgdat->node_id, 0, -1);
+
+	migrate_prep();
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		unsigned long pfn = ULONG_MAX;
+		int limit;
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		kcompactd_init_cc(&cc, zone);
+		cc.free_pfn = pageblock_start_pfn(zone_end_pfn(zone) - 1);
+		limit = zone->nr_compact;
+		while (zone->nr_compact && limit--) {
+			unsigned long flags;
+
+			spin_lock_irqsave(&zone->lock, flags);
+			if (zone->nr_compact)
+				pfn = zone->compact_queue[--zone->nr_compact];
+			spin_unlock_irqrestore(&zone->lock, flags);
+			kcompactd_migrate_block(&cc, pfn);
+		}
+	}
+}
+
 static void kcompactd_do_work(pg_data_t *pgdat)
 {
 	/*
@@ -1956,7 +2049,6 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 	};
 	trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
 							cc.classzone_idx);
-	count_compact_event(KCOMPACTD_WAKE);
 
 	for (zoneid = 0; zoneid <= cc.classzone_idx; zoneid++) {
 		int status;
@@ -1972,13 +2064,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 							COMPACT_CONTINUE)
 			continue;
 
-		cc.nr_freepages = 0;
-		cc.nr_migratepages = 0;
-		cc.total_migrate_scanned = 0;
-		cc.total_free_scanned = 0;
-		cc.zone = zone;
-		INIT_LIST_HEAD(&cc.freepages);
-		INIT_LIST_HEAD(&cc.migratepages);
+		kcompactd_init_cc(&cc, zone);
 
 		if (kthread_should_stop())
 			return;
@@ -2024,6 +2110,19 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 
 void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
 {
+	int i;
+
+	/* Wake kcompact if there are compaction queue entries */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *zone = &pgdat->node_zones[i];
+
+		if (!managed_zone(zone))
+			continue;
+
+		if (zone->nr_compact)
+			goto wake;
+	}
+
 	if (!order)
 		return;
 
@@ -2043,6 +2142,7 @@ void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx)
 	if (!kcompactd_node_suitable(pgdat))
 		return;
 
+wake:
 	trace_mm_compaction_wakeup_kcompactd(pgdat->node_id, order,
 							classzone_idx);
 	wake_up_interruptible(&pgdat->kcompactd_wait);
@@ -2072,12 +2172,42 @@ static int kcompactd(void *p)
 		wait_event_freezable(pgdat->kcompactd_wait,
 				kcompactd_work_requested(pgdat));
 
+		count_compact_event(KCOMPACTD_WAKE);
+		kcompactd_do_queue(pgdat);
 		kcompactd_do_work(pgdat);
 	}
 
 	return 0;
 }
 
+/*
+ * Queue a pageblock to have all movable pages migrated from. Note that
+ * kcompactd is not woken at this point. This assumes that kswapd has
+ * been woken to reclaim pages above the boosted watermark. kcompactd
+ * will be woken when kswapd has made progress.
+ */
+void kcompactd_queue_migration(struct zone *zone, struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page) & ~(pageblock_nr_pages - 1);
+	int nr_queued = -1;
+
+	/* Do not overflow the queue */
+	if (zone->nr_compact == COMPACT_QUEUE_LENGTH)
+		goto trace;
+
+	/* Only queue a pageblock once */
+	for (nr_queued = 0; nr_queued < zone->nr_compact; nr_queued++) {
+		if (zone->compact_queue[nr_queued] == pfn)
+			return;
+	}
+
+	zone->compact_queue[zone->nr_compact++] = pfn;
+
+trace:
+	trace_mm_compaction_wakeup_kcompactd_queue(zone_to_nid(zone),
+		zone_idx(zone), pfn, nr_queued);
+}
+
 /*
  * This kcompactd start function will be called by init and node-hot-add.
  * On node-hot-add, kcompactd will moved to proper cpus if cpus are hot-added.
diff --git a/mm/migrate.c b/mm/migrate.c
index 84381b55b2bd..b8ce5b56a2a9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -164,12 +164,14 @@ void putback_movable_page(struct page *page)
  * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range()
  * and isolate_huge_page().
  */
-void putback_movable_pages(struct list_head *l)
+unsigned int putback_movable_pages(struct list_head *l)
 {
 	struct page *page;
 	struct page *page2;
+	unsigned int nr_putback = 0;
 
 	list_for_each_entry_safe(page, page2, l, lru) {
+		nr_putback++;
 		if (unlikely(PageHuge(page))) {
 			putback_active_hugepage(page);
 			continue;
@@ -195,6 +197,8 @@ void putback_movable_pages(struct list_head *l)
 			putback_lru_page(page);
 		}
 	}
+
+	return nr_putback;
 }
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63de66b893d3..77bcc35903e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2190,6 +2190,9 @@ static bool steal_suitable_fallback(struct zone *zone, struct page *page,
 	boost_watermark(zone);
 	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
 
+	if (start_type == MIGRATE_MOVABLE || old_block_type == MIGRATE_MOVABLE)
+		kcompactd_queue_migration(zone, page);
+
 	if ((alloc_flags & ALLOC_FRAGMENT_STALL) &&
 	    current_order < fragment_stall_order) {
 		return false;
@@ -6359,7 +6362,11 @@ static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
 #ifdef CONFIG_COMPACTION
 static void pgdat_init_kcompactd(struct pglist_data *pgdat)
 {
+	int i;
+
 	init_waitqueue_head(&pgdat->kcompactd_wait);
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		pgdat->node_zones[i].nr_compact = 0;
 }
 #else
 static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}