From patchwork Wed Oct 31 16:06:40 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mel Gorman X-Patchwork-Id: 10662889 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ED0D114DE for ; Wed, 31 Oct 2018 16:06:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DC1A226D08 for ; Wed, 31 Oct 2018 16:06:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CF8D22929C; Wed, 31 Oct 2018 16:06:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E7CE928BAA for ; Wed, 31 Oct 2018 16:06:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E1DB46B0266; Wed, 31 Oct 2018 12:06:48 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id DCBA46B026A; Wed, 31 Oct 2018 12:06:48 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE3256B026B; Wed, 31 Oct 2018 12:06:48 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com [209.85.208.72]) by kanga.kvack.org (Postfix) with ESMTP id 76A3F6B0266 for ; Wed, 31 Oct 2018 12:06:48 -0400 (EDT) Received: by mail-ed1-f72.google.com with SMTP id b34-v6so11038670ede.5 for ; Wed, 31 Oct 2018 09:06:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id; bh=kehjazfaCcNqgWnQa14wzViaqmrMwBID3QUUHj0Vac8=; b=GoBfFQ+BPQEcOQNP8cOOJapefSoj4xgkkbc6md56ki46Uwo+bQYZ0OdUq4e0PMl5f0 BNwtw+8XEtblkg+dyWheiDH6YxCUJbxJBFUOXC5ptoHREXuW7DAhpHclEXLK6uw2ABc0 YInQr17yeXefW95ip9hT2ZXGJ5B5OxFo0e//aS7ErdTit4IHfDQbqlYF0kD++bybFKo2 ujcN7KLzU5Xt7vqDiUXeAeh46ftpCNjpV2LmY5+s5H2X/54IGgElti8BzgayaIPzob2R d2Vx47AWKANs2A80at7h3D8OqY8rhuQq9Iudw3R+mDuTkbNmaHLLmxVYRPrMtWdL8Xag SIMA== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net X-Gm-Message-State: AGRZ1gJixgwZh84D+ICdcFGsR3kq7BI/Z+ToELBvzSbowc36YrNvcuqL UjA4E7D5HX6673fL5mrEX1YkB6kOQwV+pOyWShkOwHmq11GNdTas8KU2wAAviomlBtzGYrrZj3A ROnIlbZSjqC90Oga1mNnWGzM42F/hVSFhMzCbuLXWaSn0duIhftujFG1kwXykCdsQaA== X-Received: by 2002:aa7:d281:: with SMTP id w1-v6mr2643823edq.65.1541002007862; Wed, 31 Oct 2018 09:06:47 -0700 (PDT) X-Google-Smtp-Source: AJdET5ck2R11Fn1/Pf5dBUb01G+aBsWUxfl7x/1KA0xqWQBtb7PQmVjZ6wR/cJLKo8GrU8bdvazh X-Received: by 2002:aa7:d281:: with SMTP id w1-v6mr2643758edq.65.1541002006624; Wed, 31 Oct 2018 09:06:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1541002006; cv=none; d=google.com; s=arc-20160816; b=MetqxNq8Io6xOmH22aYwnEw5kXDMqDHKpFeczPY5IvxrUrPjged8TcJw6DLar0DjSG 5+B82NzR8YkzmBiy+l/zuX1eO1kneyD2tZ7fi2GN32Lkl+v/vhMlQ57Ttt4llZHEt8gU aHtXkX+B18v8XwOdm7Kgv7oc+Z+qK1CEH92Y7fRLeaazsKuLSFZu1bjnsjpsiWLGoWVd CRybXqjynGIdE12Zav6w8dpLCshqTY0AHSSVdORdOnYXhOmAwhUCgAUoQd25vacnB3a4 qnsKwhHL/N7fOTK9mMmzeWPL3aGPUTTU4jfRMuWtPDZNzFUOASGGt3BYFrrfGl6g+jeN gBcw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=message-id:date:subject:cc:to:from; bh=kehjazfaCcNqgWnQa14wzViaqmrMwBID3QUUHj0Vac8=; b=gk0U4zpDsbg2XRSrBHONa2EM9JhRo6jetzkcPOZe9yPkaEkWcFr5jaLYKWsV8oH8DX iU4m+vRj28uFA6z+xa9+d40b0YPuWnz9JwoDh2OKusaDNpyqxwjNR4KuJ3UP2Vfa9eUX DB3841O7112HxyHByqhKwOI9GOhlE8WKKZO4oXUFxots3Uw4Iw+8WVJXQrB454skxxtu 887AUELO9uEPJqdRsK5i0jPzqX/lm3HIBKrWnJNjgWrg7pAw6Yrv/axe6mbnUTc6VLVB SHBQ4nf4greWSK7z8+N5i5yEfEA2rtMndrBNn3lkIS3cICxt+gluspLkZQDhPmkZUWQN 88dQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from outbound-smtp04.blacknight.com (outbound-smtp04.blacknight.com. [81.17.249.35]) by mx.google.com with ESMTPS id s21-v6si740729ejf.83.2018.10.31.09.06.46 for (version=TLS1 cipher=AES128-SHA bits=128/128); Wed, 31 Oct 2018 09:06:46 -0700 (PDT) Received-SPF: pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.35 as permitted sender) client-ip=81.17.249.35; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mgorman@techsingularity.net designates 81.17.249.35 as permitted sender) smtp.mailfrom=mgorman@techsingularity.net Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17]) by outbound-smtp04.blacknight.com (Postfix) with ESMTPS id 29F9A986CC for ; Wed, 31 Oct 2018 16:06:46 +0000 (UTC) Received: (qmail 5540 invoked from network); 31 Oct 2018 16:06:46 -0000 Received: from unknown (HELO stampy.163woodhaven.lan) (mgorman@techsingularity.net@[37.228.229.142]) by 81.17.254.9 with ESMTPA; 31 Oct 2018 16:06:46 -0000 From: Mel Gorman To: Linux-MM Cc: Andrew Morton , Vlastimil Babka , David Rientjes , Andrea Arcangeli , Zi Yan , LKML , Mel Gorman Subject: [PATCH 0/5] Fragmentation avoidance improvements Date: Wed, 31 Oct 2018 16:06:40 +0000 Message-Id: <20181031160645.7633-1-mgorman@techsingularity.net> X-Mailer: git-send-email 2.16.4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Warning: This is a long intro with long changelogs and this is not a trivial area to either analyse or fix. TLDR -- 95% reduction in fragmentation events, patches 1-3 should be relatively ok. Patch 4 and 5 need scrutiny but they are also independent or dropped. It has been noted before that fragmentation avoidance (aka anti-fragmentation) is far from perfect. Given a long enough time or an adverse enough workload, memory still gets fragmented and the long-term success of high-order allocations degrades. This series defines an adverse workload, a definition of external fragmentation events (including serious) ones and a series that reduces the level of those fragmentation events. This series is *not* directly related to the recent __GFP_THISNODE discussion and has no impact on the trivial test cases that were discussed there. This series was also evaluated without the candidate fixes from that discussion. The series does have consequences for high-order and THP allocations though that are important to consider so the same people are cc'd. It's also far from a complete solution but side-issues such as compaction, usability and other factors would require different series. It's also extremely important to note that this is analysed in the context of one adverse workload. While other patterns of fragmentation are possible (and workloads that are mostly slab allocations have a completely different solution space), they would need test cases to be properly considered. The details of the workload and the consequences are described in more detail in the changelogs. However, from patch 1, this is a high-level summary of the adverse workload. The exact details are found in the mmtests implementation. The broad details of the workload are as follows; 1. Create an XFS filesystem (not specified in the configuration but done as part of the testing for this patch) 2. Start 4 fio threads that write a number of 64K files inefficiently. Inefficiently means that files are created on first access and not created in advance (fio parameterr create_on_open=1) and fallocate is not used (fallocate=none). With multiple IO issuers this creates a mix of slab and page cache allocations over time. The total size of the files is 150% physical memory so that the slabs and page cache pages get mixed 3. Warm up a number of fio read-only threads accessing the same files created in step 2. This part runs for the same length of time it took to create the files. It'll fault back in old data and further interleave slab and page cache allocations. As it's now low on memory due to step 2, fragmentation occurs as pageblocks get stolen. 4. While step 3 is still running, start a process that tries to allocate 75% of memory as huge pages with a number of threads. The number of threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with fio, any other threads or forcing cross-NUMA scheduling. Note that the test has not been used on a machine with less than 8 cores. The benchmark records whether huge pages were allocated and what the fault latency was in microseconds 5. Measure the number of events potentially causing external fragmentation, the fault latency and the huge page allocation success rate. 6. Cleanup Overall the series reduces external fragmentation causing events by over 95% on 1 and 2 socket machines, which in turn impacts high-order allocation success rates over the long term. There are differences in latencies and high-order allocation success rates. Latencies are a mixed bag as they are vulnerable to exact system state and whether allocations succeeded so they are treated as a secondary metric. Patch 1 uses lower zones if they are populated and have free memory instead of fragmenting a higher zone. It's special cased to handle a Normal->DMA32 fallback with the reasons explained in the changelog. Patch 2+3 boosts watermarks temporarily when an external fragmentation event occurs. kswapd wakes to reclaim a small amount of old memory and then wakes kcompactd on completion to recover the system slightly. This introduces some overhead in the slowpath. The level of boosting can be tuned or disabled depending on the tolerance for fragmentation vs allocation latency. Patch 4 is more heavy handed. In the event of a movable allocation request that can stall, it'll wake kswapd as in patch 3. However, if the expected fragmentation event is serious then the request will stall briefly on pfmemalloc_wait until kswapd completes light reclaim work and retry the allocation without stalling. This can avoid the fragmentation event entirely in some cases. The definition of a serious fragmentation event can be tuned or disabled. Patch 5 is the hardest to prove it's a real benefit. In the event that fragmentation was unavoidable, it'll queue a pageblock for kcompactd to clean. It's a fixed-length queue that is neither guaranteed to have a slot available or successfully clean a pageblock. Patches 4 and 5 can be treated independently or dropped. The bulk of the improvement in fragmentation avoidance is from patches 1-3 (94-97% reduction in fragmentation events for an adverse workload on both a 1-socket and 2-socket machine). Documentation/sysctl/vm.txt | 42 +++++++ include/linux/compaction.h | 4 + include/linux/migrate.h | 7 +- include/linux/mm.h | 2 + include/linux/mmzone.h | 18 ++- include/trace/events/compaction.h | 62 +++++++++++ kernel/sysctl.c | 18 +++ mm/compaction.c | 148 +++++++++++++++++++++++-- mm/internal.h | 14 ++- mm/migrate.c | 6 +- mm/page_alloc.c | 228 ++++++++++++++++++++++++++++++++++---- mm/vmscan.c | 123 ++++++++++++++++++-- 12 files changed, 621 insertions(+), 51 deletions(-)