From patchwork Tue Sep 25 12:03:25 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Michal Hocko <mhocko@kernel.org>
X-Patchwork-Id: 10613887
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AB3A514BD
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Tue, 25 Sep 2018 12:03:51 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9ADB02973F
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Tue, 25 Sep 2018 12:03:51 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 8EE32297BA; Tue, 25 Sep 2018 12:03:51 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3CEDF2973F
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Tue, 25 Sep 2018 12:03:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EB4458E008E; Tue, 25 Sep 2018 08:03:48 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E64DB8E0072; Tue, 25 Sep 2018 08:03:48 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D07968E008E; Tue, 25 Sep 2018 08:03:48 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pg1-f197.google.com (mail-pg1-f197.google.com
 [209.85.215.197])
	by kanga.kvack.org (Postfix) with ESMTP id 8D2C38E0072
	for <linux-mm@kvack.org>; Tue, 25 Sep 2018 08:03:48 -0400 (EDT)
Received: by mail-pg1-f197.google.com with SMTP id s15-v6so3663872pgv.9
        for <linux-mm@kvack.org>; Tue, 25 Sep 2018 05:03:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=wRKSiUQDbw+8adYoE8EbYONjg8YQRck0ZU8fe396StY=;
        b=fIel9JTf1r6CnGZ8QIMm6uUanW9o1J4YcHZ5G4936NELOAd9VdnjCyhMrOPhzlFBjY
         ek+8yiZGz0tGX0BpjnTzti1NPU02KrqNIG3pXFDjy/X3H+hu1zc/MMd2nFk0t81eAnxr
         K7Ax1jkTD7iQdiJVJuDhIytPNc8RRj5ue2qzCENSbYc0hKe9xJExVkkUKhTOwUv7WFkD
         5uRMTu4MycikUU2e3BOv21sW0eBEdpCw6iqqkqpTFePkpVeKjtazaGEtpIJ9kRbHq2bb
         o+nlkibE2fBTW9BOtnkQ1zS64Sib/duH6EjyWw+yc7ZMdp2iz+GXWQFXttWYsUpms9PS
         A6Cw==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mstsxfx@gmail.com designates
 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
X-Gm-Message-State: ABuFfojAH6DV04CTm3dkH4mBZssgopaxMDyMYfa4h8AlY9+2P+/Z4Gkq
	uQciCbS0JGbJ5QTJm0h77IBeS1U0GL2IRtdvA0QWOw+qJK9F7F/fPwvDVy3Jv90yxNVhw/Imoi3
	Bt/Yim8eiSykaTr+xFqwSJKFZnzg4xOf8fGrCz1C919dvJrcIYv9l2quPdaCAmUUhUL3Nz2sfL9
	xnCbpWw97kPs0vU0XuuRXtQ5OUTkdMd7d7AX+53mqWD74gC7vcCENLOLlveCQpeFxxVKHYHHL4d
	3mLYgFEh5IP+G7Shxum8sS2UUd2okRVGuz97q67hLYoS1X4Fh95izc9IbPWZOKW8+gKZzSWlClV
	VA8ntdgqFpTgXceAJnmGfuvor8i1PDwuHvjKSOJHdghCqyzNeY1+DYZXWzOW1SfuoCFz7G48vg=
	=
X-Received: by 2002:aa7:824d:: with SMTP id
 e13-v6mr777941pfn.97.1537877028200;
        Tue, 25 Sep 2018 05:03:48 -0700 (PDT)
X-Received: by 2002:aa7:824d:: with SMTP id
 e13-v6mr777861pfn.97.1537877027050;
        Tue, 25 Sep 2018 05:03:47 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1537877027; cv=none;
        d=google.com; s=arc-20160816;
        b=YDcHhR+xvcc0ZjasQscWOCwFn3ieTI9jL474N3McsJ4IbuMHlxHNvEe8M8ZoUzeNRz
         xuZ0UOivQL/QJiXM6woY3DYNEGJtD7ZFHEN2VeaTC5yfXksxOaVVdlr/NVgd5+bhxOoA
         RB5R4S9kxFoXo8e8APgQlLNO2peUXnMNkXwAHSrhfor1lfnqWrpwg5LwYISys6g1ZlwQ
         IYmQ7Y9mf+lhYcRr0dr2Gesds+ouZlivaPXAClI1EKQQdhQZ5Is3wGl6taCFpAaL3F/Q
         yjizr8kQ5C5YQozy9S8DxGg+5dk6ttVhD8bWnYMS7L0kVlGWfcYIG01cpvYgbO4S3KCo
         ztzQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=wRKSiUQDbw+8adYoE8EbYONjg8YQRck0ZU8fe396StY=;
        b=rEfNcEZK+ZMNbBdTA6TQqbAAoETCsqpycbfAn7eli58/5UHqfhnQTxR2NTM7iyEfLf
         a9C2oxJkmCvh1ADsV/pDN0UZTwPHMxguVLYngby/sSoUkz8EPm1fETvIqmJirU68n/R9
         orx6IzJiNPt6subNm2zA/I8zvo9ucEFmNbchy8qdRfgPa2MVqnPvFbaBzjimLwsRxpwv
         uIPoER9+xCOjiwCqwv11CEuGPUdJ56WgsgwLDOXdhrruF7sqqi698o+FP2zHEEmS9fT+
         5nAXklvc5Dua+m72iqSSOzAsMuB4WFjzuAF/jc9q3oVAi7gRN2ZX3dmrTVWLfXteWQg6
         Qkow==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mstsxfx@gmail.com designates
 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
Received: from mail-sor-f65.google.com (mail-sor-f65.google.com.
 [209.85.220.65])
        by mx.google.com with SMTPS id
 z14-v6sor250832pgs.71.2018.09.25.05.03.46
        for <linux-mm@kvack.org>
        (Google Transport Security);
        Tue, 25 Sep 2018 05:03:47 -0700 (PDT)
Received-SPF: pass (google.com: domain of mstsxfx@gmail.com designates
 209.85.220.65 as permitted sender) client-ip=209.85.220.65;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mstsxfx@gmail.com designates
 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
X-Google-Smtp-Source: 
 ACcGV61rWA5SVdGerccndzfdIWlc+TP8Y6cuqwOYIV9KaQ6240/LSC3SQM/wR/7/tS8uCq/Os50Y7A==
X-Received: by 2002:a63:2703:: with SMTP id
 n3-v6mr768211pgn.113.1537877026589;
        Tue, 25 Sep 2018 05:03:46 -0700 (PDT)
Received: from tiehlicka.suse.cz (prg-ext-pat.suse.com. [213.151.95.130])
        by smtp.gmail.com with ESMTPSA id
 h132-v6sm3657449pfc.100.2018.09.25.05.03.43
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 25 Sep 2018 05:03:45 -0700 (PDT)
From: Michal Hocko <mhocko@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Argangeli <andrea@kernel.org>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Stefan Priebe - Profihost AG <s.priebe@profihost.ag>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	<linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Stable tree <stable@vger.kernel.org>,
	Michal Hocko <mhocko@suse.com>
Subject: [PATCH 1/2] mm: thp:  relax __GFP_THISNODE for MADV_HUGEPAGE mappings
Date: Tue, 25 Sep 2018 14:03:25 +0200
Message-Id: <20180925120326.24392-2-mhocko@kernel.org>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180925120326.24392-1-mhocko@kernel.org>
References: <20180925120326.24392-1-mhocko@kernel.org>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

From: Andrea Arcangeli <aarcange@redhat.com>

THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim. Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:

[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free. While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases. The existing behaviour
is similiar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can. The
default behaviour should always be to cause the least harm for the
common case.

If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top. Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

[mhocko@suse.com: rewrote the changelog based on the one from Andrea]
Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: stable # 4.1+
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Nacked-by: David Rientjes <rientjes@google.com>
---
 mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..149b6f4cf023 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+			/*
+			 * We cannot invoke reclaim if __GFP_THISNODE
+			 * is set. Invoking reclaim with
+			 * __GFP_THISNODE set, would cause THP
+			 * allocations to trigger heavy swapping
+			 * despite there may be tons of free memory
+			 * (including potentially plenty of THP
+			 * already available in the buddy) on all the
+			 * other NUMA nodes.
+			 *
+			 * At most we could invoke compaction when
+			 * __GFP_THISNODE is set (but we would need to
+			 * refrain from invoking reclaim even if
+			 * compaction returned COMPACT_SKIPPED because
+			 * there wasn't not enough memory to succeed
+			 * compaction). For now just avoid
+			 * __GFP_THISNODE instead of limiting the
+			 * allocation path to a strict and single
+			 * compaction invocation.
+			 *
+			 * Supposedly if direct reclaim was enabled by
+			 * the caller, the app prefers THP regardless
+			 * of the node it comes from so this would be
+			 * more desiderable behavior than only
+			 * providing THP originated from the local
+			 * node in such case.
+			 */
+			if (!(gfp & __GFP_DIRECT_RECLAIM))
+				gfp |= __GFP_THISNODE;
+			page = __alloc_pages_node(hpage_node, gfp, order);
 			goto out;
 		}
 	}

From patchwork Tue Sep 25 12:03:26 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Michal Hocko <mhocko@kernel.org>
X-Patchwork-Id: 10613889
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9C04D14BD
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Tue, 25 Sep 2018 12:03:54 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8BA552973F
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Tue, 25 Sep 2018 12:03:54 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 7FCD9297BA; Tue, 25 Sep 2018 12:03:54 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 93FC22973F
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Tue, 25 Sep 2018 12:03:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 31B788E008F; Tue, 25 Sep 2018 08:03:52 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2CD3E8E0072; Tue, 25 Sep 2018 08:03:52 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 194798E008F; Tue, 25 Sep 2018 08:03:52 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com
 [209.85.215.200])
	by kanga.kvack.org (Postfix) with ESMTP id C72D08E0072
	for <linux-mm@kvack.org>; Tue, 25 Sep 2018 08:03:51 -0400 (EDT)
Received: by mail-pg1-f200.google.com with SMTP id z8-v6so445824pgp.20
        for <linux-mm@kvack.org>; Tue, 25 Sep 2018 05:03:51 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-original-authentication-results:x-gm-message-state:from:to:cc
         :subject:date:message-id:in-reply-to:references;
        bh=1i4++f7u22QigBJDl/7wRgReroUW5liEf0zg2YEWt1w=;
        b=BvKP0x7f/WM37d8LIwkr2gLlvotTfEnktx2IRqJ5SRXXoXTpufgc0oAYF48AYV6WKR
         UNDny2t85GZa4QH5ctD6yw4U1A23osS9V5wgsosr++Layj0ZkS+4zT55PF9c0tmvdPSb
         VMCSIVlYwSsFYxxVbp+qebkaE25fLO2hgKZWH20IWGiLVtEiVQP2tizva1p5do0CHeRd
         dh3NeutbxBddbnzw5WgoOnBTeFLnJIz16spARmzQ2F/qhXRL6Ka5s9u4HRuDNuBC81ba
         1+gKrN5/b+n5V8Bj9bt+vHIlmE5LB1OvR8hDSBgUgSVHUkwHi0K8obaIxpffY3L3qZJs
         vBSg==
X-Original-Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mstsxfx@gmail.com designates
 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
X-Gm-Message-State: ABuFfog0c4e5NOejp9wWlQnc8QX+4ZnIcWW6we8+ytZn4cuOxyjjOf61
	8pPtAE7pGC6IiJ4YNYsODEwkpVQhyxGd5jHLvx8MBum9gPJB4JpoHavZSay99B2ygseI2rJVuJs
	LjW+JOZ5WEe5z3EUhrrLH0jA7ClM8KVI46wMTuZhlzNi235NeOWy4ZsfyL+9bFkMY2OdFzZGp1/
	Qye4SdQ8nZMFkSBng0hSkUWDDS3KUIX79RvL9gT0t8YSy9PTxw8QR7hG+AkwTmKmUXGl6HJy4aB
	x8QNwx5bf59KfA4wg7O6x8jr5JJwYKIC7Krih+ffvjq2uy5W8dKs846r7lG6YQXgWLcptdZvyPh
	9nqTrdmrHDRE+SPiItwIEheRxTWXTq1WZzDJ2twAIHgCgPsrd/lQ3Np9M/pfaeNY+AaEPPK9cg=
	=
X-Received: by 2002:a63:646:: with SMTP id 67-v6mr753910pgg.230.1537877031455;
        Tue, 25 Sep 2018 05:03:51 -0700 (PDT)
X-Received: by 2002:a63:646:: with SMTP id 67-v6mr753829pgg.230.1537877030238;
        Tue, 25 Sep 2018 05:03:50 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1537877030; cv=none;
        d=google.com; s=arc-20160816;
        b=s8wNBUXvxGW9BY4xtjj7XQ2f2X9W7++6kMh0kNEpQeiKcxW+2zlYapxuOj3QpRgWg/
         oLIBsEiyisY2M/M43Xa1xHO82togcsF5231cUaTiWbyrqDZrTu6we86DLFsVSDahBh7n
         YuppbIEVfaJEPR4d6VreZbdZG9wpoApysPehvUO8tm69pelEqynEJtzWCu+GGASON3NY
         aup+fZv9britEx7zgpX1wYzenyTRvMzUi13kv9diLTLqFZen/AsQ80B/IGTBTeDtycUZ
         YjejVKYNCyxlVceW2C7L5amGe5hLBn3mG8SjptKkVftugGFH1rtmIWL8pIBOtxoXvzl7
         X8Mw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=references:in-reply-to:message-id:date:subject:cc:to:from;
        bh=1i4++f7u22QigBJDl/7wRgReroUW5liEf0zg2YEWt1w=;
        b=R5xBNYzKVI68V83LiINyQzpmjo9DbTJdoEYaDfGlvT8yz2fvlltY8BpVYCJJBdg766
         LgLESVB16CZSGdpIRGIVdKWbUBYZNevWXZPuxWYRXfNzW5XtZ47VfZxT4WpD6iM1giE6
         pGwLwzQl3o5yYXjIPk6gCHMaXyMaxxpR486Aqn33OL+Vv43Rmsgt/3XPAVxtfCblLyL2
         HJNMYBVdSTDtQrUFFOH4ie06BS7hyL7I1PM9+Yh3KDnUG8/AKVXrFxxGRb1lo/+dX5OZ
         PWlcjGfQX445M7r1oSmXS2/a/9qoXhXMC5reZyJZQDgSFwCihbA4z12M3JFmLcT5dKOb
         bVxQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of mstsxfx@gmail.com designates
 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
Received: from mail-sor-f65.google.com (mail-sor-f65.google.com.
 [209.85.220.65])
        by mx.google.com with SMTPS id
 g10-v6sor256729pfi.39.2018.09.25.05.03.50
        for <linux-mm@kvack.org>
        (Google Transport Security);
        Tue, 25 Sep 2018 05:03:50 -0700 (PDT)
Received-SPF: pass (google.com: domain of mstsxfx@gmail.com designates
 209.85.220.65 as permitted sender) client-ip=209.85.220.65;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of mstsxfx@gmail.com designates
 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
X-Google-Smtp-Source: 
 ACcGV60D7x90SBZYFav22al51jbKqfH6lIv3lMdBZnuLyzxm9X1yzYyaBujNAuYqpjelFKl1tNz57w==
X-Received: by 2002:a62:9992:: with SMTP id
 t18-v6mr753093pfk.239.1537877029791;
        Tue, 25 Sep 2018 05:03:49 -0700 (PDT)
Received: from tiehlicka.suse.cz (prg-ext-pat.suse.com. [213.151.95.130])
        by smtp.gmail.com with ESMTPSA id
 h132-v6sm3657449pfc.100.2018.09.25.05.03.46
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 25 Sep 2018 05:03:49 -0700 (PDT)
From: Michal Hocko <mhocko@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Rientjes <rientjes@google.com>,
	Andrea Argangeli <andrea@kernel.org>,
	Zi Yan <zi.yan@cs.rutgers.edu>,
	Stefan Priebe - Profihost AG <s.priebe@profihost.ag>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	<linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Michal Hocko <mhocko@suse.com>
Subject: [PATCH 2/2] mm,
 thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
Date: Tue, 25 Sep 2018 14:03:26 +0200
Message-Id: <20180925120326.24392-3-mhocko@kernel.org>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180925120326.24392-1-mhocko@kernel.org>
References: <20180925120326.24392-1-mhocko@kernel.org>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

From: Michal Hocko <mhocko@suse.com>

THP allocation mode is quite complex and it depends on the defrag
mode. This complexity is hidden in alloc_hugepage_direct_gfpmask from a
large part currently. The NUMA special casing (namely __GFP_THISNODE) is
however independent and placed in alloc_pages_vma currently. This both
adds an unnecessary branch to all vma based page allocation requests and
it makes the code more complex unnecessarily as well. Not to mention
that e.g. shmem THP used to do the node reclaiming unconditionally
regardless of the defrag mode until recently. This was not only
unexpected behavior but it was also hardly a good default behavior and I
strongly suspect it was just a side effect of the code sharing more than
a deliberate decision which suggests that such a layering is wrong.

Get rid of the thp special casing from alloc_pages_vma and move the logic
to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to
the resulting gfp mask only when the direct reclaim is not requested and
when there is no explicit numa binding to preserve the current logic.

This allows for removing alloc_hugepage_vma as well.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/gfp.h       | 12 +++-----
 include/linux/mempolicy.h |  2 ++
 mm/huge_memory.c          | 38 +++++++++++++++++------
 mm/mempolicy.c            | 63 +++------------------------------------
 mm/shmem.c                |  2 +-
 5 files changed, 40 insertions(+), 77 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 24bcc5eec6b4..76f8db0b0e71 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -510,22 +510,18 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr,
-			int node, bool hugepage);
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
-	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
+			int node);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
-	alloc_pages(gfp_mask, order)
-#define alloc_hugepage_vma(gfp_mask, vma, addr, order)	\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
 	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)			\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
+	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
 #define alloc_page_vma_node(gfp_mask, vma, addr, node)		\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
+	alloc_pages_vma(gfp_mask, 0, vma, addr, node)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c3bc7e9c9a2a..c0bcede31930 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,21 +629,40 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+	gfp_t this_node = 0;
+
+#ifdef CONFIG_NUMA
+	struct mempolicy *pol;
+	/*
+	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
+	 * specified, to express a general desire to stay on the current
+	 * node for optimistic allocation attempts. If the defrag mode
+	 * and/or madvise hint requires the direct reclaim then we prefer
+	 * to fallback to other node rather than node reclaim because that
+	 * can lead to excessive reclaim even though there is free memory
+	 * on other nodes. We expect that NUMA preferences are specified
+	 * by memory policies.
+	 */
+	pol = get_vma_policy(vma, addr);
+	if (pol->mode != MPOL_BIND)
+		this_node = __GFP_THISNODE;
+	mpol_cond_put(pol);
+#endif
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM);
+							     __GFP_KSWAPD_RECLAIM | this_node);
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     0);
-	return GFP_TRANSHUGE_LIGHT;
+							     this_node);
+	return GFP_TRANSHUGE_LIGHT | this_node;
 }
 
 /* Caller must hold page table lock. */
@@ -715,8 +734,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma);
-	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
+	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1290,8 +1309,9 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
-		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
+		new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
+				haddr, numa_node_id());
 	} else
 		new_page = NULL;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 149b6f4cf023..b7b564751165 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1102,8 +1102,8 @@ static struct page *new_page(struct page *page, unsigned long start)
 	} else if (PageTransHuge(page)) {
 		struct page *thp;
 
-		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
-					 HPAGE_PMD_ORDER);
+		thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
+				address, numa_node_id());
 		if (!thp)
 			return NULL;
 		prep_transhuge_page(thp);
@@ -1648,7 +1648,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -1997,7 +1997,6 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  * 	@vma:  Pointer to VMA or NULL if not available.
  *	@addr: Virtual Address of the allocation. Must be inside the VMA.
  *	@node: Which node to prefer for allocation (modulo policy).
- *	@hugepage: for hugepages try only the preferred node if possible
  *
  * 	This function allocates a page from the kernel page pool and applies
  *	a NUMA policy associated with the VMA or the current process.
@@ -2008,7 +2007,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  */
 struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
-		unsigned long addr, int node, bool hugepage)
+		unsigned long addr, int node)
 {
 	struct mempolicy *pol;
 	struct page *page;
@@ -2026,60 +2025,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED &&
-						!(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
-
-		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			/*
-			 * We cannot invoke reclaim if __GFP_THISNODE
-			 * is set. Invoking reclaim with
-			 * __GFP_THISNODE set, would cause THP
-			 * allocations to trigger heavy swapping
-			 * despite there may be tons of free memory
-			 * (including potentially plenty of THP
-			 * already available in the buddy) on all the
-			 * other NUMA nodes.
-			 *
-			 * At most we could invoke compaction when
-			 * __GFP_THISNODE is set (but we would need to
-			 * refrain from invoking reclaim even if
-			 * compaction returned COMPACT_SKIPPED because
-			 * there wasn't not enough memory to succeed
-			 * compaction). For now just avoid
-			 * __GFP_THISNODE instead of limiting the
-			 * allocation path to a strict and single
-			 * compaction invocation.
-			 *
-			 * Supposedly if direct reclaim was enabled by
-			 * the caller, the app prefers THP regardless
-			 * of the node it comes from so this would be
-			 * more desiderable behavior than only
-			 * providing THP originated from the local
-			 * node in such case.
-			 */
-			if (!(gfp & __GFP_DIRECT_RECLAIM))
-				gfp |= __GFP_THISNODE;
-			page = __alloc_pages_node(hpage_node, gfp, order);
-			goto out;
-		}
-	}
-
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
diff --git a/mm/shmem.c b/mm/shmem.c
index 0376c124b043..371e291f5d4b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1473,7 +1473,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
 
 	shmem_pseudo_vma_init(&pvma, info, hindex);
 	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
-			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
+			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
 	shmem_pseudo_vma_destroy(&pvma);
 	if (page)
 		prep_transhuge_page(page);