[1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation

The page allocator zone lists are iterated based on the watermarks
of each zone which does not take anti-fragmentation into account. On
x86, node 0 may have multiple zones while other nodes have one zone. A
consequence is that tasks running on node 0 may fragment ZONE_NORMAL even
though ZONE_DMA32 has plenty of free memory. This patch special cases
the allocator fast path such that it'll try an allocation from a lower
local zone before fragmenting a higher zone. In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the
fast path as they are uninteresting from a fragmentation point of view.

This was evaluated using a benchmark designed to fragment memory before
attempting THP allocations. It's implemented in mmtests as the following
configurations

configs/config-global-dhp__workload_thpfioscale
configs/config-global-dhp__workload_thpfioscale-defrag
configs/config-global-dhp__workload_thpfioscale-madvhugepage

e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch).
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameter create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed.
3. Warm up a number of fio read-only processes accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll refault old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds.
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup the test files.

Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive. Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.

When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than
a pageblock order (order-9 on 64-bit x86) then it's considered to be an
"external fragmentation event" that may cause issues in the future. Hence,
the primary metric here is the number of external fragmentation events that
occur with order < 9. The secondary metric is allocation latency and huge
page allocation success rates but note that differences in latencies and
what the success rate also can affect the number of external fragmentation
event which is why it's a secondary metric.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9:   804694
4.20-rc3+patch:                      408912 (49% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)

                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)

Fault latencies are slightly reduced while allocation success rates remain
at zero as this configuration does not make any special effort to allocate
THP and fio is heavily active at the time and either filling memory or
keeping pages resident. However, a 49% reduction of serious fragmentation
events reduces the changes of external fragmentation being a problem in
the future.

Vlastimil asked during review for a breakdown of the allocation types
that are falling back.

vanilla
   3816 MIGRATE_UNMOVABLE
 800845 MIGRATE_MOVABLE
     33 MIGRATE_UNRECLAIMABLE

patch
    735 MIGRATE_UNMOVABLE
 408135 MIGRATE_MOVABLE
     42 MIGRATE_UNRECLAIMABLE

The majority of the fallbacks are due to movable allocations and this is
consistent for the workload throughout the series so will not be presented
again as the primary source of fallbacks are movable allocations.

Movable fallbacks are sometimes considered "ok" to fallback because they can
be migrated. The problem is that they can fill an unmovable/reclaimable
pageblock causing those allocations to fallback later and polluting
pageblocks with pages that cannot move.  If there is a movable fallback,
it is pretty much guaranteed to affect an unmovable/reclaimable pageblock
and while it might not be enough to actually cause a unmovable/reclaimable
fallback in the future, we cannot know that in advance so the patch takes
the only option available to it. Hence, it's important to control them. This
point is also consistent throughout the series and will not be repeated.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  291392
4.20-rc3+patch:                     191187 (34% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)

thpfioscale Percentage Faults Huge
                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)

Fragmentation events were reduced quite a bit although this is known
to be a little variable. The latencies and allocation success rates
are similar but they were already quite high.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  215698
4.20-rc3+patch:                     200210 (7% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)

                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)

The reduction of external fragmentation events is slight and this is
partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f ("mm:
thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP allocations
can now spill over to remote nodes instead of fragmenting local memory.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch:                    147463 (11% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*

thpfioscale Percentage Faults Huge
                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)

There was a slight reduction in external fragmentation events although
the latencies were higher. The allocation success rate is high enough that
the system is struggling and there is quite a lot of parallel reclaim and
compaction activity. There is also a certain degree of luck on whether
processes start on node 0 or not for this patch but the relevance is
reduced later in the series.

Overall, the patch reduces the number of external fragmentation causing
events so the success of THP over long periods of time would be improved
for this adverse workload.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/internal.h   |  13 ++++---
 mm/page_alloc.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 105 insertions(+), 16 deletions(-)

Message ID	20181123114528.28802-2-mgorman@techsingularity.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 60A0E15A7 for <patchwork-linux-mm@patchwork.kernel.org>; Fri, 23 Nov 2018 11:45:42 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4E9122C02D for <patchwork-linux-mm@patchwork.kernel.org>; Fri, 23 Nov 2018 11:45:42 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 42CDC2C90C; Fri, 23 Nov 2018 11:45:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DC8BF2C02D for <patchwork-linux-mm@patchwork.kernel.org>; Fri, 23 Nov 2018 11:45:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 446B56B2CF7; Fri, 23 Nov 2018 06:45:33 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 3F3E96B2CFB; Fri, 23 Nov 2018 06:45:33 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 29BE16B2CF8; Fri, 23 Nov 2018 06:45:33 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com [209.85.208.69]) by kanga.kvack.org (Postfix) with ESMTP id A2ADA6B2CF6 for <linux-mm@kvack.org>; Fri, 23 Nov 2018 06:45:32 -0500 (EST) Received: by mail-ed1-f69.google.com with SMTP id m19so5672027edc.6 for <linux-mm@kvack.org>; Fri, 23 Nov 2018 03:45:32 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=LGLGXYoRXbeAPQc2yULFaKa9WDL1XAPgPF+pt0cB1UQ=; b=bjFg0iihCwJ9yjb5RLnJ7CAwRoRVUX3CcryNpCmnqXGnSG7XIlEJtUWB5ZjiOuZbtd X0txNdb3uKianoDeULaPuYIzgMRIj2wdSnqdYLNt+w3D6Nlk5W7oPavk0dH3YWavv83s nZ4z0xWOBlNPeHvfM/hbJ6ULLUVaL/rDg5jYXAuqZdPt+lg3xo2xDrt68CAy+zfSjCjJ n4oBL7dg657e64Hrq/9zJ8hPalaXSzCGReeVK/5xQ1u0bVzY4hCNNrPFYRg+9lVmY4yw UNTzPPKQn24l/SE7GYdD/hvclUtuYzGh9LXF4wWrsXWv4Na99kIraBkwJ7QCM4nZzELu D9Wg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Gm-Message-State: AGRZ1gJ9cYJj2OTvqyV8t9u6Vqc/lNCN/gyHLkDgB4NVgPErUj1J955J inW5c/xFVitV9Otft7scS7pQMYFLceRQCdaBmmIxIyjkAyMPofcXjkwARSCTAJcrblGXXYVqdbO +U+WxXz3fo4+dUG89IV0+b8TrqLF49kAXAhdpi1eiqK8gd/KPxSkehnZuZQcmjdrUUw== X-Received: by 2002:a17:906:7a9c:: with SMTP id f28-v6mr11265269ejo.135.1542973532048; Fri, 23 Nov 2018 03:45:32 -0800 (PST) X-Google-Smtp-Source: AJdET5cB7RZdEw1tvxBk4mSWsESOOgPPK3TqeAjrxlQQY2Q5CX9JcjEawMoiiAor905iYQfdH2O3 X-Received: by 2002:a17:906:7a9c:: with SMTP id f28-v6mr11265177ejo.135.1542973529929; Fri, 23 Nov 2018 03:45:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542973529; cv=none; d=google.com; s=arc-20160816; b=nWSHuuWcOxCUVc8N7Vlk+ivvreE4QnH/HuszF7b0omXkZtakZrcSapKET5WqzoLpdR QQhe+drlSiCYqpJ2h4PFwJtWIzC/cHvanLxZ1JBZzLSDO/MsrqAau4gj1QrRS+XUpEnq /iwkyHhQHqDSPMtLQqqlHQs0rTc3YUxbx2MKy0ANWl3vAvMbmxDMcH83dnQks9DNUyFO 4o+SFEcsro4P5iOOdJN9HziGt/W0NDkg5o1Nq7cweaQo2CToig/R+aqLv0aQKSo6Dcni afrgGRnz2rdtUeOgLAzT/40hHDz0legI9gkeldGkoBhRNjFMID2qHsrUPzhjwh/hA3Et MHaA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from; bh=LGLGXYoRXbeAPQc2yULFaKa9WDL1XAPgPF+pt0cB1UQ=; b=Dc60PShhnX2bWLDg2aTkjnmFb98bw/5Q4QzaEvi2A42JEIjAN09PlUtmp88a3QqbJk MbsY28JFz+K2aD+TS5122MYDWmFN6ZuuypF/owhokizT4WInXJTWlRQvmTEpqCrWMxwB EU5ZBDqj8SSFaKqfJLpoJ54VqJZXQcWhZhRoR5V3z2acDZ9whS6FhNnir+lzzMd+ngXP DVUbWZJBtAkohoJ2E/Fam+PkzbM/mYjZPBtKFeV4kDEfc41Gg04oq+GXdNiH9GrXFhMq FomuxzM6/HdFMqY/1r4o4zMAyu4/nG4XCws5vg+gA7NJ7TPSmqZexwj2TOXTvKlUAtHJ lKCg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from outbound-smtp10.blacknight.com (outbound-smtp10.blacknight.com. [46.22.139.15]) by mx.google.com with ESMTPS id g15-v6si5731764ejj.234.2018.11.23.03.45.29 for <linux-mm@kvack.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 23 Nov 2018 03:45:29 -0800 (PST) Received-SPF: pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.15 as permitted sender) client-ip=46.22.139.15; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 46.22.139.15 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17]) by outbound-smtp10.blacknight.com (Postfix) with ESMTPS id 6E3C61C2CDC for <linux-mm@kvack.org>; Fri, 23 Nov 2018 11:45:29 +0000 (GMT) Received: (qmail 12273 invoked from network); 23 Nov 2018 11:45:29 -0000 Received: from unknown (HELO stampy.163woodhaven.lan) (mgorman@techsingularity.net@[37.228.229.69]) by 81.17.254.9 with ESMTPA; 23 Nov 2018 11:45:29 -0000 From: Mel Gorman <mgorman@techsingularity.net> To: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz>, David Rientjes <rientjes@google.com>, Andrea Arcangeli <aarcange@redhat.com>, Zi Yan <zi.yan@cs.rutgers.edu>, Michal Hocko <mhocko@kernel.org>, LKML <linux-kernel@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>, Mel Gorman <mgorman@techsingularity.net> Subject: [PATCH 1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation Date: Fri, 23 Nov 2018 11:45:24 +0000 Message-Id: <20181123114528.28802-2-mgorman@techsingularity.net> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20181123114528.28802-1-mgorman@techsingularity.net> References: <20181123114528.28802-1-mgorman@techsingularity.net> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> X-Virus-Scanned: ClamAV using ClamSMTP
Series	Fragmentation avoidance improvements v5 \| expand [0/5] Fragmentation avoidance improvements v5 [1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation [2/5] mm: Move zone watermark accesses behind an accessor [3/5] mm: Use alloc_flags to record if kswapd can wake [4/5] mm: Reclaim small amounts of memory when an external fragmentation event occurs [5/5] mm: Stall movable allocations until kswapd progresses during serious external fragmentation e…

[1/5] mm, page_alloc: Spread allocations across zones before introducing fragmentation

Commit Message

Comments

Patch