[1/4] mm, page_alloc: Spread allocations across zones before introducing fragmentation

The page allocator zone lists are iterated based on the watermarks
of each zone which does not take anti-fragmentation into account. On
x86, node 0 may have multiple zones while other nodes have one zone. A
consequence is that tasks running on node 0 may fragment ZONE_NORMAL even
though ZONE_DMA32 has plenty of free memory. This patch special cases
the allocator fast path such that it'll try an allocation from a lower
local zone before fragmenting a higher zone. In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the
fast path as they are uninteresting from a fragmentation point of view.

This was evaluated using a benchmark designed to fragment memory
before attempting THPs.  It's implemented in mmtests as the following
configurations

configs/config-global-dhp__workload_thpfioscale
configs/config-global-dhp__workload_thpfioscale-defrag
configs/config-global-dhp__workload_thpfioscale-madvhugepage

e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameterr create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll fault back in old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup

Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive. Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.

When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag event. If the fallback_order is smaller than a
pageblock order (order-9 on 64-bit x86) then it's considered an event
that may cause external fragmentation issues in the future. Hence, the
primary metric here is the number of external fragmentation events that
occur with order < 9. The secondary metric is allocation latency and huge
page allocation success rates but note that differences in latencies and
what the success rate also can affect the number of external fragmentation
event which is why it's a secondary metric.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc1 extfrag events < order 9:  1023463
4.20-rc1+patch:                      358574 (65% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                      vanilla           lowzone-v2r4
Min       fault-base-1      588.00 (   0.00%)      557.00 (   5.27%)
Min       fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)
Amean     fault-base-1      663.58 (   0.00%)      663.65 (  -0.01%)
Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)

                              4.20.0-rc1             4.20.0-rc1
                                 vanilla           lowzone-v2r4
Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)

Fault latencies are reduced while allocation success rates remain at zero
asthis configuration does not make any heavy effort to allocate THP and
fio is heavily active at the time and filling memory.  However, a 65%
reduction of serious fragmentation events reduces the changes of external
fragmentation being a problem in the future.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  342549
4.20-rc1+patch:                     337890 (1% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                      vanilla           lowzone-v2r4
Amean     fault-base-1     1517.06 (   0.00%)     1531.37 (  -0.94%)
Amean     fault-huge-1     1129.50 (   0.00%)     1160.95 (  -2.78%)

thpfioscale Percentage Faults Huge
                              4.20.0-rc1             4.20.0-rc1
                                 vanilla           lowzone-v2r4
Percentage huge-1       78.01 (   0.00%)       78.97 (   1.23%)

Nothing dramatic. Fragmentation events were only reduced slightly
which is very different to what was reported in V1. A big difference
with V1 is the relative size of Normal to the DMA32 zone. This machine
has double the memory so the impact of using a small zone to avoid
fragmentation events is much lower.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  209820
4.20-rc1+patch:                     185923 (11% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                      vanilla           lowzone-v2r4
Amean     fault-base-5     1324.93 (   0.00%)     1334.99 (  -0.76%)
Amean     fault-huge-5     4681.71 (   0.00%)     2428.43 (  48.13%)

                              4.20.0-rc1             4.20.0-rc1
                                 vanilla           lowzone-v2r4
Percentage huge-5        1.05 (   0.00%)        1.13 (   7.94%)

The reduction of external fragmentation events is expected. A careful
reader may spot that the reduction is lower than it was on v1. This is due
to the removal of __GFP_THISNODE in commit ac5b2c18911f ("mm: thp: relax
__GFP_THISNODE for MADV_HUGEPAGE mappings") as THP allocations can now spill
over to remote nodes instead of fragmenting local memory.  This reduces the
impact of the use of a lower zone to avoid fragmentation. It's also worth
noting relative to v1 that the allocation success rate is slightly higher.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9: 167464
4.20-rc1+patch:                    130081 (22% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                      vanilla           lowzone-v2r4
Amean     fault-base-5     7721.82 (   0.00%)     6652.67 (  13.85%)
Amean     fault-huge-5     3896.10 (   0.00%)     2486.89 *  36.17%*

thpfioscale Percentage Faults Huge
                              4.20.0-rc1             4.20.0-rc1
                                 vanilla           lowzone-v2r4
Percentage huge-5       95.02 (   0.00%)       94.49 (  -0.56%)

In this case, there was both a reduction in the external fragmentation
causing events and the huge page allocation success latency with little
change in the allocation success rates which were already high. A careful
reader will note that V1 had very different outcomes both in terms of
the number of fragmentation events and the allocation success rates. In
this case, it's due to the baseline including the THP __GFP_THISNODE
removaal patch.

Overall, the patch significantly reduces the number of external
fragmentation causing events so the success of THP over long periods of
time would be improved for this adverse workload. While there are large
differences compared to how V1 behaved, this is almost entirely accounted
for by ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE
mappings").

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   |  13 +++++---
 mm/page_alloc.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 98 insertions(+), 15 deletions(-)

Message ID	20181108091218.32715-2-mgorman@techsingularity.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E4DC814BD for <patchwork-linux-mm@patchwork.kernel.org>; Thu, 8 Nov 2018 09:12:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D3EC22CF52 for <patchwork-linux-mm@patchwork.kernel.org>; Thu, 8 Nov 2018 09:12:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C83562CF65; Thu, 8 Nov 2018 09:12:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8EA502CF52 for <patchwork-linux-mm@patchwork.kernel.org>; Thu, 8 Nov 2018 09:12:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0F6446B05B7; Thu, 8 Nov 2018 04:12:23 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 0B2786B05B9; Thu, 8 Nov 2018 04:12:23 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E6A276B05B8; Thu, 8 Nov 2018 04:12:22 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by kanga.kvack.org (Postfix) with ESMTP id 5904F6B05B5 for <linux-mm@kvack.org>; Thu, 8 Nov 2018 04:12:22 -0500 (EST) Received: by mail-ed1-f70.google.com with SMTP id z72-v6so10957040ede.14 for <linux-mm@kvack.org>; Thu, 08 Nov 2018 01:12:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=27XlKhGhVykC9+0keuc5iaQlYSXf6+i5XK7EQZa0ipI=; b=J+IQnIVYwpCAk551xmS2wN7PqdWIsCVbHDhoJNX/2+uXRgaJFIjdL3GsMrmI0qU1cm AAl0FFFECFO9A4os5FqQ82ThPkDcziwdc3ACHK8E4remf7CvpdkeL5hXH47lbg+OsahK djGUwCAJr/A4Y1YC4g918D90w6JQAH2+728/uCuyTYmwGI56Dr5YOgFfq9jyXkaGx1xs lnL+lzLmrd/eUC4AjsGg8uR2O6FYeuqU0VZ5ddWKImLG2YG/IwWRivRtB3rSxwhraxf5 uVT8f0Vh9iEcfFtX1pm8GQ+7Fw+oQVimUjv4NslBip9BQAUfc6NOXHnFPlfy8ui2ZDZz BkGQ== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Gm-Message-State: AGRZ1gLTgMMJ6ofkEottFG8FhuvCV+h5Mp7LgABMqD7HMJZK7l1egStV lngfKP6HfNkm1iaCLhShCcF5edykR5efPNAyTs7n3V0dMnmQrRn5K4xx0dhpsYaPDT/yDOGAzFJ JYc6gMYZ4Qw+1VUAofcwN/7NL5bsyiNKi//3AlWiq+bGKivKM93h3Z9nlgZWMqMLm2w== X-Received: by 2002:a17:906:f1c9:: with SMTP id gx9-v6mr2524204ejb.144.1541668341761; Thu, 08 Nov 2018 01:12:21 -0800 (PST) X-Google-Smtp-Source: AJdET5cQee76Sd7uwKa3YD+N+GTgHfkM5nMF1xwTJdtga6hFjf878frX+aOyA2/YIO9BH2hLA4mB X-Received: by 2002:a17:906:f1c9:: with SMTP id gx9-v6mr2524127ejb.144.1541668340071; Thu, 08 Nov 2018 01:12:20 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1541668340; cv=none; d=google.com; s=arc-20160816; b=hlWt88nqK+Koj6QcRjmwn5Vd/xXtkDqxJfxWvaZM92QIDPxVDXrkQ5f68etjGCmT1x bMdSScb6Rnj7PgLKIOEK0xstyFbQ6vuPMnyCXMJakG/xtFzM/n0un3ath9w4RbcwWUHB 58LxpazxT9Y+aFuhsJJ/EhfUMPFiSYFeDQR7q/vBGAlQYVzYMNUee6BmHkpivnT6Zurw hyKU2WjGwx8bg0AGfS9y1Eo60P7w6Y9QWvLW5P18WqpcbZ//uWBMU6d1yhlAYUTxvoDN DsJ4TauvFXu6RcJnxZCbg6Po9/ea+2S9MDyOkf6KXMY2+3bomgJPw0rOQov6CnEpZVRq ODyA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from; bh=27XlKhGhVykC9+0keuc5iaQlYSXf6+i5XK7EQZa0ipI=; b=e+jFxv6smEJyUDqqY//aeI58qq//t6Iiyep2k/ZKRG3i7qK48qtmNwpnx8Cp5ku2NY F0/13/1lIF/ZqDtoas3/RZeXF+I89ITq5vcnfr1uFaXQpWfpvemV4R+AIt0U46gVYsO9 kgCpcNI0/+dinEskWqGGMKJ7hgzDJy64Lvqd43yxr4SUx6wB1nf6+y52ifsOibGg9Hxm 3bEMVhkfoGUleNxYXlpzerCQcniH0nQtRCJJxyY/1xSuwYQ9NRZ7cmiSdk9AgYA/0H7G 1pAhogasvdC6/A80x9IFM7Lb6tP9Y69MWXnYMDHA2NfDuDNTsXm1ZU1g7z4VutaRB+Db JbcQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from outbound-smtp13.blacknight.com (outbound-smtp13.blacknight.com. [46.22.139.230]) by mx.google.com with ESMTPS id 33-v6si678280edq.249.2018.11.08.01.12.19 for <linux-mm@kvack.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Nov 2018 01:12:20 -0800 (PST) Received-SPF: pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) client-ip=46.22.139.230; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.230 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from mail.blacknight.com (unknown [81.17.254.10]) by outbound-smtp13.blacknight.com (Postfix) with ESMTPS id 932521C2FAE for <linux-mm@kvack.org>; Thu, 8 Nov 2018 09:12:19 +0000 (GMT) Received: (qmail 23761 invoked from network); 8 Nov 2018 09:12:19 -0000 Received: from unknown (HELO stampy.163woodhaven.lan) (mgorman@techsingularity.net@[37.228.229.69]) by 81.17.254.9 with ESMTPA; 8 Nov 2018 09:12:19 -0000 From: Mel Gorman <mgorman@techsingularity.net> To: Linux-MM <linux-mm@kvack.org> Cc: Andrew Morton <akpm@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>, David Rientjes <rientjes@google.com>, Andrea Arcangeli <aarcange@redhat.com>, Zi Yan <zi.yan@cs.rutgers.edu>, LKML <linux-kernel@vger.kernel.org>, Mel Gorman <mgorman@techsingularity.net> Subject: [PATCH 1/4] mm, page_alloc: Spread allocations across zones before introducing fragmentation Date: Thu, 8 Nov 2018 09:12:15 +0000 Message-Id: <20181108091218.32715-2-mgorman@techsingularity.net> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20181108091218.32715-1-mgorman@techsingularity.net> References: <20181108091218.32715-1-mgorman@techsingularity.net> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> X-Virus-Scanned: ClamAV using ClamSMTP
Series	Fragmentation avoidance improvements v3 \| expand [0/4] Fragmentation avoidance improvements v3 [1/4] mm, page_alloc: Spread allocations across zones before introducing fragmentation [3/4] mm: Reclaim small amounts of memory when an external fragmentation event occurs [4/4] mm: Stall movable allocations until kswapd progresses during serious external fragmentation e…

[1/4] mm, page_alloc: Spread allocations across zones before introducing fragmentation

Commit Message

Patch