From patchwork Wed Sep  4 19:54:18 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Rientjes <rientjes@google.com>
X-Patchwork-Id: 11131337
Return-Path: <SRS0=zrK/=W7=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2806B1510
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed,  4 Sep 2019 19:54:24 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D9C1723400
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed,  4 Sep 2019 19:54:23 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="aYpwCJlp"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D9C1723400
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A21026B0006; Wed,  4 Sep 2019 15:54:22 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9D12F6B000A; Wed,  4 Sep 2019 15:54:22 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8E5F16B000C; Wed,  4 Sep 2019 15:54:22 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0076.hostedemail.com
 [216.40.44.76])
	by kanga.kvack.org (Postfix) with ESMTP id 6BDA96B0006
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 15:54:22 -0400 (EDT)
Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with SMTP id 0B4D0A2B9
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 19:54:22 +0000 (UTC)
X-FDA: 75898289964.24.name97_7d5da57647841
X-Spam-Summary: 
 2,0,0,c003fce0d5b410bd,d41d8cd98f00b204,rientjes@google.com,:torvalds@linux-foundation.org:akpm@linux-foundation.org:aarcange@redhat.com:mhocko@suse.com:mgorman@suse.de:vbabka@suse.cz:kirill@shutemov.name:linux-kernel@vger.kernel.org:,RULES_HIT:2:41:69:355:379:800:960:966:973:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1535:1593:1594:1605:1730:1747:1777:1792:2196:2199:2393:2559:2562:2689:2734:2901:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:4049:4120:4250:4321:4385:4470:4605:5007:6119:6261:6653:7875:7903:7974:9592:10004:11026:11233:11473:11657:11658:11914:12043:12296:12297:12438:12517:12519:12555:12679:12683:12986:13146:13161:13229:13230:13255:13439:14096:14097:14130:14659:21080:21444:21451:21627:21740:30054:30069:30091,0,RBL:209.85.210.196:@google.com:.lbl8.mailshell.net-62.18.0.100
 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LU
 A_SUMMAR
X-HE-Tag: name97_7d5da57647841
X-Filterd-Recvd-Size: 9142
Received: from mail-pf1-f196.google.com (mail-pf1-f196.google.com
 [209.85.210.196])
	by imf35.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 19:54:21 +0000 (UTC)
Received: by mail-pf1-f196.google.com with SMTP id w22so5525744pfi.9
        for <linux-mm@kvack.org>; Wed, 04 Sep 2019 12:54:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:message-id:user-agent:mime-version;
        bh=/yqLrrEZ1jGCgctDjUkMs8Nl6CSrDL+V901mgKrrlLc=;
        b=aYpwCJlpggUi2FQ0+ApYB02dfO+53N9oMqyYZoiWyvmwYynDM/vkN4WH4vPpshrnjI
         /wEZ0BQozY9Vz3IYpgEpRlDoJKcfsl8jUIkBCP4EIkBsLmZROitGMvntgR+w/15fSe0l
         L1oCGWejuhYlxUmB7okIRI7/QG+P2H5ykBGYn0CrWi+dXqY9CxJ8dYWsQjCR+xnisFD+
         QPz3uj6A/rSJUXVv8HdQ1sQr7A6ZvAf4X42VuIbDJ2h0+jBSRqX+buEA5dQukQcDgj2f
         Re2Y4AhkhVEP57zS1Lhzch1D07VtGHmFyyjVhsHqhKGiSjKSfYNr+q6tlKwjWsiQG73L
         WkJg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent
         :mime-version;
        bh=/yqLrrEZ1jGCgctDjUkMs8Nl6CSrDL+V901mgKrrlLc=;
        b=bz1ZaeTR5ypVoJAdnnRpNoFaOh8kMGvHbUXC5YspZLMgalt52kNO31V1GmpHZNVFw4
         PFpUsp/lejoKJ22dtgOKML+BME1cC8hEB0Qtkoc0Le3JjUWuaZFPKRq78D4v4lRzAhjx
         MfQjHwJ8JKBi9ANw2YHBYt4GERnmnhwjSzA+NZ6gProDjKHZU0nw4M/LtkWik5Ktd/J/
         DovnQfzS13A0fmHVOFlaMi6LHkgJfv9Ke6TVC58z8u2DOsn3mJR9QLqUV6sjx5KWHGjQ
         DATLvSO/51VLwrGZXhDprmTsHzCkYBc58rwfuFUNhvOSPuxwLtwE0pM2q/0yxbSSP7Y+
         VleQ==
X-Gm-Message-State: APjAAAVvUEyMmzGw2MjVf+P4TaeRI3AbxdqZ1pP09tr4Slrh6G1k9xGP
	BaqPSVVBK/V4taEgrGhrmssB2g==
X-Google-Smtp-Source: 
 APXvYqxY8g9k8sFXGaBUVsXS3uhCUUF/KrFeCPAt3bvfH1vh500utSbB+LJNNZ1FAUItgeRsAEEEYQ==
X-Received: by 2002:a17:90a:c715:: with SMTP id
 o21mr4798298pjt.55.1567626859752;
        Wed, 04 Sep 2019 12:54:19 -0700 (PDT)
Received: from [2620:15c:17:3:3a5:23a7:5e32:4598]
 ([2620:15c:17:3:3a5:23a7:5e32:4598])
        by smtp.gmail.com with ESMTPSA id
 g202sm32480208pfb.155.2019.09.04.12.54.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 04 Sep 2019 12:54:19 -0700 (PDT)
Date: Wed, 4 Sep 2019 12:54:18 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
X-X-Sender: rientjes@chino.kir.corp.google.com
To: Linus Torvalds <torvalds@linux-foundation.org>,
    Andrew Morton <akpm@linux-foundation.org>
cc: Andrea Arcangeli <aarcange@redhat.com>, Michal Hocko <mhocko@suse.com>,
    Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [patch for-5.3 1/4] Revert "Revert "mm, thp: restore node-local
 hugepage allocations""
Message-ID: <alpine.DEB.2.21.1909041252590.94813@chino.kir.corp.google.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

This reverts commit a8282608c88e08b1782141026eab61204c1e533f.

The commit references the original intended semantic for MADV_HUGEPAGE
which has subsequently taken on three unique purposes:

 - enables or disables thp for a range of memory depending on the system's
   config (is thp "enabled" set to "always" or "madvise"),

 - determines the synchronous compaction behavior for thp allocations at
   fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
   and

 - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
   clear previous hugepage advice).

These are the three purposes that currently exist in 5.2 and over the past
several years that userspace has been written around.  Adding a NUMA
locality preference adds a fourth dimension to an already conflated advice
mode.

Based on the semantic that MADV_HUGEPAGE has provided over the past
several years, there exist workloads that use the tunable based on these
principles: specifically that the allocation should attempt to defragment
a local node before falling back.  It is agreed that remote hugepages
typically (but not always) have a better access latency than remote native
pages, although on Naples this is at parity for intersocket.

The revert commit that this patch reverts allows hugepage allocation to
immediately allocate remotely when local memory is fragmented.  This is
contrary to the semantic of MADV_HUGEPAGE over the past several years:
that is, memory compaction should be attempted locally before falling
back.

The performance degradation of remote hugepages over local hugepages on
Rome, for example, is 53.5% increased access latency.  For this reason,
the goal is to revert back to the 5.2 and previous behavior that would
attempt local defragmentation before falling back.  With the patch that
is reverted by this patch, we see performance degradations at the tail
because the allocator happily allocates the remote hugepage rather than
even attempting to make a local hugepage available.

zone_reclaim_mode is not a solution to this problem since it does not
only impact hugepage allocations but rather changes the memory allocation
strategy for *all* page allocations.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mempolicy.h |  2 --
 mm/huge_memory.c          | 42 +++++++++++++++------------------------
 mm/mempolicy.c            |  2 +-
 3 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,8 +139,6 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
-struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
-						unsigned long addr);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -648,37 +648,27 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-	gfp_t this_node = 0;
-
-#ifdef CONFIG_NUMA
-	struct mempolicy *pol;
-	/*
-	 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
-	 * specified, to express a general desire to stay on the current
-	 * node for optimistic allocation attempts. If the defrag mode
-	 * and/or madvise hint requires the direct reclaim then we prefer
-	 * to fallback to other node rather than node reclaim because that
-	 * can lead to excessive reclaim even though there is free memory
-	 * on other nodes. We expect that NUMA preferences are specified
-	 * by memory policies.
-	 */
-	pol = get_vma_policy(vma, addr);
-	if (pol->mode != MPOL_BIND)
-		this_node = __GFP_THISNODE;
-	mpol_cond_put(pol);
-#endif
+	const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 
+	/* Always do synchronous compaction */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | __GFP_THISNODE |
+		       (vma_madvised ? 0 : __GFP_NORETRY);
+
+	/* Kick kcompactd and fail quickly */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
+		return gfp_mask | __GFP_KSWAPD_RECLAIM;
+
+	/* Synchronous compaction if madvised, otherwise kick kcompactd */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     __GFP_KSWAPD_RECLAIM | this_node);
+		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
+						  __GFP_KSWAPD_RECLAIM);
+
+	/* Only do synchronous compaction if madvised */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-							     this_node);
-	return GFP_TRANSHUGE_LIGHT | this_node;
+		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
+
+	return gfp_mask;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1734,7 +1734,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);

From patchwork Wed Sep  4 19:54:20 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Rientjes <rientjes@google.com>
X-Patchwork-Id: 11131339
Return-Path: <SRS0=zrK/=W7=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E43981395
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed,  4 Sep 2019 19:54:26 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 9464E22CF7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed,  4 Sep 2019 19:54:26 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="frdgxIlY"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9464E22CF7
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 120616B000A; Wed,  4 Sep 2019 15:54:25 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0F9C76B000C; Wed,  4 Sep 2019 15:54:25 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EDD506B000D; Wed,  4 Sep 2019 15:54:24 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0064.hostedemail.com
 [216.40.44.64])
	by kanga.kvack.org (Postfix) with ESMTP id CEEE96B000A
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 15:54:24 -0400 (EDT)
Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with SMTP id 5EDBB440E
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 19:54:24 +0000 (UTC)
X-FDA: 75898290048.20.list90_7da9250942249
X-Spam-Summary: 
 2,0,0,65694c53c6aaf779,d41d8cd98f00b204,rientjes@google.com,:torvalds@linux-foundation.org:akpm@linux-foundation.org:aarcange@redhat.com:mhocko@suse.com:mgorman@suse.de:vbabka@suse.cz:kirill@shutemov.name:linux-kernel@vger.kernel.org:,RULES_HIT:1:2:41:69:355:379:800:960:966:973:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:2196:2198:2199:2200:2393:2559:2562:2731:2740:2910:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:4051:4250:4321:4385:4605:5007:6119:6120:6261:6653:7901:7903:8957:11026:11232:11473:11657:11658:11914:12043:12114:12291:12296:12297:12438:12517:12519:12555:12679:12683:13156:13228:13255:13439:14096:14097:14659:21080:21444:21451:21627:30003:30054:30069:30075,0,RBL:209.85.210.194:@google.com:.lbl8.mailshell.net-62.18.0.100
 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:1:0,LFtime:23,LUA_SUMMARY:none
X-HE-Tag: list90_7da9250942249
X-Filterd-Recvd-Size: 11272
Received: from mail-pf1-f194.google.com (mail-pf1-f194.google.com
 [209.85.210.194])
	by imf12.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 19:54:23 +0000 (UTC)
Received: by mail-pf1-f194.google.com with SMTP id 205so11519933pfw.2
        for <linux-mm@kvack.org>; Wed, 04 Sep 2019 12:54:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:message-id:user-agent:mime-version;
        bh=5m9yEJkwMGkTsleZ21Dj3LjbDk1nJJBNL55KpYfxVrA=;
        b=frdgxIlY4Nc86Dt2nE92haaUYpAauut2Kk/7VOTuR0X0NOLbmkUttvop6zDwkQ1zpz
         rWR5qOdNkmSMArIGyuxg6zNUORvF5hXOV/iwwsQ77LKvVOpfYIbSSDDtEQZ5cTTNXDXr
         NO/AsH8t19vzIK4O+xi/xAQJylAg4BYIUndEFQfwitQSdkpxL99I5fzf4iylQXlSQnIg
         YN7CvfXkoTfiK0djxw8oo29ScoFuwpMX5EzWmm7+J1llbtX+2vXlHT6cgYfCLy32AJOK
         5cB0afoAnvdT+D+FDUyF2bxwsT3p3oXy4FD/0sqhDk/lMsky7Sql+rwsMZFXoVqOM4Qv
         RpaQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent
         :mime-version;
        bh=5m9yEJkwMGkTsleZ21Dj3LjbDk1nJJBNL55KpYfxVrA=;
        b=jQWPKgDjPqMOECjzEVWgwv2dWW1byurAhTOhtUiTY6wFeQouCXA9LhoBRvRQC6PTEW
         cGbns3Ad6eCqE2r2AD6Zb0jn0fQoziG6PDYiVSzSpZbQRnXMLnDHU0yFOmAIM3c9u22I
         TNF1o04QqxVYsyFXKWq3XD3Mqu9gkLrcQmRkGlJgq4AWDK7f6yAkdW+Btn3PFD7l/Caf
         n8Elt2Nh1vvJ/X2j4vUhEX5gKP8xYPBJ4z3eVdw0vtcJnIW3yy9+obNeUwsNrDMYZL0I
         0RsOFumVLLYYV/1UuKHKKhVbJO61KtWZrz26j6m7nDGZMWJGwB/Zu9LgVF5/rufL+ZB+
         3k+g==
X-Gm-Message-State: APjAAAX7qxoAIijoeGKjvKmo4CBjDlWLMaMQjPl95UAVjzzFQnzrhn2K
	gczRfBQ/viwpVxRTbT/wuufIcg==
X-Google-Smtp-Source: 
 APXvYqwrCvnGc3SA4RjOmcxVI4Fez+CswtbUXEr6d/jsncxuWzmiXpLxRs+5DBwOx/kpLU6ny3/mYQ==
X-Received: by 2002:a65:62cd:: with SMTP id
 m13mr36300315pgv.437.1567626861960;
        Wed, 04 Sep 2019 12:54:21 -0700 (PDT)
Received: from [2620:15c:17:3:3a5:23a7:5e32:4598]
 ([2620:15c:17:3:3a5:23a7:5e32:4598])
        by smtp.gmail.com with ESMTPSA id
 y194sm619808pfg.186.2019.09.04.12.54.20
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 04 Sep 2019 12:54:21 -0700 (PDT)
Date: Wed, 4 Sep 2019 12:54:20 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
X-X-Sender: rientjes@chino.kir.corp.google.com
To: Linus Torvalds <torvalds@linux-foundation.org>,
    Andrew Morton <akpm@linux-foundation.org>
cc: Andrea Arcangeli <aarcange@redhat.com>, Michal Hocko <mhocko@suse.com>,
    Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [patch for-5.3 2/4] Revert "Revert "Revert "mm, thp: consolidate
 THP gfp handling into alloc_hugepage_direct_gfpmask""
Message-ID: <alpine.DEB.2.21.1909041253180.94813@chino.kir.corp.google.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

This reverts commit 92717d429b38e4f9f934eed7e605cc42858f1839.

Since commit a8282608c88e ("Revert "mm, thp: restore node-local hugepage
allocations"") is reverted in this series, it is better to restore the
previous 5.2 behavior between the thp allocation and the page allocator
rather than to attempt any consolidation or cleanup for a policy that is
now reverted.  It's less risky during an rc cycle and subsequent patches
in this series further modify the same policy that the pre-5.3 behavior
implements.

Consolidation and cleanup can be done subsequent to a sane default page
allocation strategy, so this patch reverts a cleanup done on a strategy
that is now reverted and thus is the least risky option for 5.3.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/gfp.h | 12 ++++++++----
 mm/huge_memory.c    | 27 +++++++++++++--------------
 mm/mempolicy.c      | 32 +++++++++++++++++++++++++++++---
 mm/shmem.c          |  2 +-
 4 files changed, 51 insertions(+), 22 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -510,18 +510,22 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
 }
 extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr,
-			int node);
+			int node, bool hugepage);
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
+	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr, node)\
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
+	alloc_pages(gfp_mask, order)
+#define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
 	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)			\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
 #define alloc_page_vma_node(gfp_mask, vma, addr, node)		\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, node)
+	alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -645,30 +645,30 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
-	const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE;
 
 	/* Always do synchronous compaction */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | __GFP_THISNODE |
-		       (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 
 	/* Kick kcompactd and fail quickly */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		return gfp_mask | __GFP_KSWAPD_RECLAIM;
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
 
 	/* Synchronous compaction if madvised, otherwise kick kcompactd */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
-		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM :
-						  __GFP_KSWAPD_RECLAIM);
+		return GFP_TRANSHUGE_LIGHT |
+			(vma_madvised ? __GFP_DIRECT_RECLAIM :
+					__GFP_KSWAPD_RECLAIM);
 
 	/* Only do synchronous compaction if madvised */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
-		return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
+		return GFP_TRANSHUGE_LIGHT |
+		       (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
 
-	return gfp_mask;
+	return GFP_TRANSHUGE_LIGHT;
 }
 
 /* Caller must hold page table lock. */
@@ -740,8 +740,8 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 			pte_free(vma->vm_mm, pgtable);
 		return ret;
 	}
-	gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
-	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
+	gfp = alloc_hugepage_direct_gfpmask(vma);
+	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1348,9 +1348,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 alloc:
 	if (__transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_direct_gfpmask(vma, haddr);
-		new_page = alloc_pages_vma(huge_gfp, HPAGE_PMD_ORDER, vma,
-				haddr, numa_node_id());
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1180,8 +1180,8 @@ static struct page *new_page(struct page *page, unsigned long start)
 	} else if (PageTransHuge(page)) {
 		struct page *thp;
 
-		thp = alloc_pages_vma(GFP_TRANSHUGE, HPAGE_PMD_ORDER, vma,
-				address, numa_node_id());
+		thp = alloc_hugepage_vma(GFP_TRANSHUGE, vma, address,
+					 HPAGE_PMD_ORDER);
 		if (!thp)
 			return NULL;
 		prep_transhuge_page(thp);
@@ -2083,6 +2083,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  * 	@vma:  Pointer to VMA or NULL if not available.
  *	@addr: Virtual Address of the allocation. Must be inside the VMA.
  *	@node: Which node to prefer for allocation (modulo policy).
+ *	@hugepage: for hugepages try only the preferred node if possible
  *
  * 	This function allocates a page from the kernel page pool and applies
  *	a NUMA policy associated with the VMA or the current process.
@@ -2093,7 +2094,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
  */
 struct page *
 alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
-		unsigned long addr, int node)
+		unsigned long addr, int node, bool hugepage)
 {
 	struct mempolicy *pol;
 	struct page *page;
@@ -2111,6 +2112,31 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
+	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
+		int hpage_node = node;
+
+		/*
+		 * For hugepage allocation and non-interleave policy which
+		 * allows the current node (or other explicitly preferred
+		 * node) we only try to allocate from the current/preferred
+		 * node and don't fall back to other nodes, as the cost of
+		 * remote accesses would likely offset THP benefits.
+		 *
+		 * If the policy is interleave, or does not allow the current
+		 * node in its nodemask, we allocate the standard way.
+		 */
+		if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
+			hpage_node = pol->v.preferred_node;
+
+		nmask = policy_nodemask(gfp, pol);
+		if (!nmask || node_isset(hpage_node, *nmask)) {
+			mpol_cond_put(pol);
+			page = __alloc_pages_node(hpage_node,
+						gfp | __GFP_THISNODE, order);
+			goto out;
+		}
+	}
+
 	nmask = policy_nodemask(gfp, pol);
 	preferred_nid = policy_node(gfp, pol, node);
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
diff --git a/mm/shmem.c b/mm/shmem.c
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1466,7 +1466,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
 
 	shmem_pseudo_vma_init(&pvma, info, hindex);
 	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
-			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
+			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
 	shmem_pseudo_vma_destroy(&pvma);
 	if (page)
 		prep_transhuge_page(page);

From patchwork Wed Sep  4 19:54:22 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Rientjes <rientjes@google.com>
X-Patchwork-Id: 11131341
Return-Path: <SRS0=zrK/=W7=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 50AC41395
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed,  4 Sep 2019 19:54:29 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 0F48722CF7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed,  4 Sep 2019 19:54:28 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="or+awGJW"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0F48722CF7
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 9FA8D6B000C; Wed,  4 Sep 2019 15:54:26 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 985AB6B000D; Wed,  4 Sep 2019 15:54:26 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8723B6B000E; Wed,  4 Sep 2019 15:54:26 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0153.hostedemail.com
 [216.40.44.153])
	by kanga.kvack.org (Postfix) with ESMTP id 6495F6B000C
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 15:54:26 -0400 (EDT)
Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with SMTP id 09562824CA21
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 19:54:26 +0000 (UTC)
X-FDA: 75898290132.15.alley25_7df2caa788419
X-Spam-Summary: 
 2,0,0,44037022ecd62162,d41d8cd98f00b204,rientjes@google.com,:torvalds@linux-foundation.org:akpm@linux-foundation.org:aarcange@redhat.com:mhocko@suse.com:mgorman@suse.de:vbabka@suse.cz:kirill@shutemov.name:linux-kernel@vger.kernel.org:,RULES_HIT:41:355:379:800:877:960:966:973:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1535:1543:1593:1594:1711:1730:1747:1777:1792:2196:2198:2199:2200:2393:2553:2559:2562:2731:2890:2901:3138:3139:3140:3141:3142:3152:3355:3865:3866:3867:3868:3870:3871:3872:3874:4042:4117:4250:4385:4641:5007:6119:6120:6261:6653:7875:7903:10004:10400:10450:10455:11026:11658:11914:12291:12295:12297:12438:12517:12519:13161:13215:13229:13255:13439:14096:14097:14181:14659:14721:14819:19904:19999:21080:21324:21444:21450:21451:21627:30001:30054:30090:30091,0,RBL:209.85.210.193:@google.com:.lbl8.mailshell.net-62.18.0.100
 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0
 :0,LFtim
X-HE-Tag: alley25_7df2caa788419
X-Filterd-Recvd-Size: 6113
Received: from mail-pf1-f193.google.com (mail-pf1-f193.google.com
 [209.85.210.193])
	by imf06.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 19:54:25 +0000 (UTC)
Received: by mail-pf1-f193.google.com with SMTP id x127so1284022pfb.7
        for <linux-mm@kvack.org>; Wed, 04 Sep 2019 12:54:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:message-id:user-agent:mime-version;
        bh=B3LdhXMejYHV96KWaOzQTiZ+Z/f85k5LioNA2CuFGwA=;
        b=or+awGJWRmLily/ifY9QIoKC+CGZd1JKGA2tLCtcVC/VmH4wwudR9egRn8R/ycC2MU
         QkEVmujb1Mk/MjJsZ531+o2qQ7k16VC23z6324ylW7RJXfjvi0O5g2i8K9SXerUQbgNO
         Q6NSmTB/AoABUH0mdkvgh56M3vvjfBldKCTDhb6r8Ry9sSh/SoJW+jWVTaFD3s8cE7fe
         ohOxMT9Cth1jYnIPeErV+9BC+nAc3qlv9HzDPUcw3x/Wwi54C0UG/xZGaIFTFZBuEquM
         B3NRWQjdcMOUp7g/Fb0s6sxmlmLcFoSw/s92qJLDjewgM3DS2Ds0XUVoYF5VnX0+1+IS
         cONA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent
         :mime-version;
        bh=B3LdhXMejYHV96KWaOzQTiZ+Z/f85k5LioNA2CuFGwA=;
        b=Inq4nQFKOw9xPBJ2oQ63TuKp+6HyXwDl1V90768D2EFazPG54RrpSkMlwpag8XMOeT
         GFZZ30Vdc/pr/oI6FbaZsB1x5S/GRNPhBmSogtgB3k7wqWBdO3q8zgaBMaqrX5tHxr+y
         BByTl29CFGCJH+dDLoT+tgEPhk5afoI+zDJbR0StoHv3Vq8i/u6oIYfhtYgEctkcnDHX
         sH1EJFw1g/WMOpiiRvuZnPQXYzyAH1Z02ye5sFge8bUQlZSi7YNlEnX3PGbwhP/xoCgz
         1mJ9AKq3gia53zX2AMsbzQN4uGnOJVoRY6dR7BSfSPn0Kufs+nQmgcUHqz+FvDOTU632
         wQmA==
X-Gm-Message-State: APjAAAWkcDiaFssZ6DEDlww0jMUaEOaeMCs21L1Qt6fURBQsInzTlmWB
	kjfYRtYXPIhRIF1fVvdjUDmcrg==
X-Google-Smtp-Source: 
 APXvYqx0L17qVqLK8UqgdtItnyAIhFPkS2AAz/mdKmDOlHp8/ioytZJL1xi4U489pRG9xu/gDl/18A==
X-Received: by 2002:a62:cd45:: with SMTP id
 o66mr49017925pfg.112.1567626864234;
        Wed, 04 Sep 2019 12:54:24 -0700 (PDT)
Received: from [2620:15c:17:3:3a5:23a7:5e32:4598]
 ([2620:15c:17:3:3a5:23a7:5e32:4598])
        by smtp.gmail.com with ESMTPSA id
 c2sm3173938pjs.13.2019.09.04.12.54.23
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 04 Sep 2019 12:54:23 -0700 (PDT)
Date: Wed, 4 Sep 2019 12:54:22 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
X-X-Sender: rientjes@chino.kir.corp.google.com
To: Linus Torvalds <torvalds@linux-foundation.org>,
    Andrew Morton <akpm@linux-foundation.org>
cc: Andrea Arcangeli <aarcange@redhat.com>, Michal Hocko <mhocko@suse.com>,
    Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [rfc 3/4] mm, page_alloc: avoid expensive reclaim when compaction
 may not succeed
Message-ID: <alpine.DEB.2.21.1909041253390.94813@chino.kir.corp.google.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Memory compaction has a couple significant drawbacks as the allocation
order increases, specifically:

 - isolate_freepages() is responsible for finding free pages to use as
   migration targets and is implemented as a linear scan of memory
   starting at the end of a zone,

 - failing order-0 watermark checks in memory compaction does not account
   for how far below the watermarks the zone actually is: to enable
   migration, there must be *some* free memory available.  Per the above,
   watermarks are not always suffficient if isolate_freepages() cannot
   find the free memory but it could require hundreds of MBs of reclaim to
   even reach this threshold (read: potentially very expensive reclaim with
   no indication compaction can be successful), and

 - if compaction at this order has failed recently so that it does not even
   run as a result of deferred compaction, looping through reclaim can often
   be pointless.

For hugepage allocations, these are quite substantial drawbacks because
these are very high order allocations (order-9 on x86) and falling back to
doing reclaim can potentially be *very* expensive without any indication
that compaction would even be successful.

Reclaim itself is unlikely to free entire pageblocks and certainly no
reliance should be put on it to do so in isolation (recall lumpy reclaim).
This means we should avoid reclaim and simply fail hugepage allocation if
compaction is deferred.

It is also not helpful to thrash a zone by doing excessive reclaim if
compaction may not be able to access that memory.  If order-0 watermarks
fail and the allocation order is sufficiently large, it is likely better
to fail the allocation rather than thrashing the zone.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4458,6 +4458,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
+		 if (order >= pageblock_order && (gfp_mask & __GFP_IO)) {
+			/*
+			 * If allocating entire pageblock(s) and compaction
+			 * failed because all zones are below low watermarks
+			 * or is prohibited because it recently failed at this
+			 * order, fail immediately.
+			 *
+			 * Reclaim is
+			 *  - potentially very expensive because zones are far
+			 *    below their low watermarks or this is part of very
+			 *    bursty high order allocations,
+			 *  - not guaranteed to help because isolate_freepages()
+			 *    may not iterate over freed pages as part of its
+			 *    linear scan, and
+			 *  - unlikely to make entire pageblocks free on its
+			 *    own.
+			 */
+			if (compact_result == COMPACT_SKIPPED ||
+			    compact_result == COMPACT_DEFERRED)
+				goto nopage;
+		}
+
 		/*
 		 * Checks for costly allocations with __GFP_NORETRY, which
 		 * includes THP page fault allocations

From patchwork Wed Sep  4 19:54:25 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Rientjes <rientjes@google.com>
X-Patchwork-Id: 11131343
Return-Path: <SRS0=zrK/=W7=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 596BF1510
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed,  4 Sep 2019 19:54:31 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 1BA9522CF7
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Wed,  4 Sep 2019 19:54:31 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="E3V36Igk"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1BA9522CF7
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id AC0626B000D; Wed,  4 Sep 2019 15:54:28 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A20036B000E; Wed,  4 Sep 2019 15:54:28 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 935706B0010; Wed,  4 Sep 2019 15:54:28 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0241.hostedemail.com
 [216.40.44.241])
	by kanga.kvack.org (Postfix) with ESMTP id 7433B6B000D
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 15:54:28 -0400 (EDT)
Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with SMTP id F1093A2D9
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 19:54:27 +0000 (UTC)
X-FDA: 75898290174.22.glass22_7e3e18b6a1031
X-Spam-Summary: 
 2,0,0,82784b88b86ef792,d41d8cd98f00b204,rientjes@google.com,:torvalds@linux-foundation.org:akpm@linux-foundation.org:aarcange@redhat.com:mhocko@suse.com:mgorman@suse.de:vbabka@suse.cz:kirill@shutemov.name:linux-kernel@vger.kernel.org:,RULES_HIT:41:355:379:800:960:973:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1535:1542:1593:1594:1711:1730:1747:1777:1792:2198:2199:2393:2559:2562:2693:2731:2901:3138:3139:3140:3141:3142:3152:3354:3865:3866:3867:3868:3870:3871:3872:3874:5007:6119:6120:6261:6653:7903:8660:10004:10128:10400:10450:10455:11026:11473:11658:11914:12043:12297:12438:12517:12519:13148:13161:13215:13229:13230:13439:14096:14097:14181:14659:14721:14819:19904:19999:21080:21444:21451:21627:21740:30054:30091,0,RBL:209.85.214.194:@google.com:.lbl8.mailshell.net-62.18.0.100
 66.100.201.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none
X-HE-Tag: glass22_7e3e18b6a1031
X-Filterd-Recvd-Size: 5260
Received: from mail-pl1-f194.google.com (mail-pl1-f194.google.com
 [209.85.214.194])
	by imf03.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed,  4 Sep 2019 19:54:27 +0000 (UTC)
Received: by mail-pl1-f194.google.com with SMTP id t1so8306plq.13
        for <linux-mm@kvack.org>; Wed, 04 Sep 2019 12:54:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:message-id:user-agent:mime-version;
        bh=/fcTM79WZf6aDvSyEB7v9KgQnwaVScj2r1F/kgx7mCM=;
        b=E3V36Igk37DlL+J+dNYTpk/5GulrcMjXTraH6Ngs+1wvAnXOx31Cox17KsuPBS7LwS
         /EwobeVXzDpp5HZRc6SYmYbcJ8PFJ6kTVzu86VTyAp2FJl2PsaajqWu2hY4sWkvGts+a
         SgmZuJHXD58+tgFD+VMwSmk0P7tTVZwYiBNn3CfSxtIRzkKNJhSkYUVNiPZiJmS/bclx
         lY9UzFHAaQDRht4CDe8lp9VOmP8kK/Vd/Md1apXnk23QxeYAYkiOH88Isx85bfdS0va9
         qZqc3Pab7GpN9kcL1nLhYDiS0u71EEyedAHOTAWzT807RgsRB/1r9nH30uSry4hIh6lp
         ndjg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent
         :mime-version;
        bh=/fcTM79WZf6aDvSyEB7v9KgQnwaVScj2r1F/kgx7mCM=;
        b=IuMJJe6KEV1s+dfzFpd/A+kt2uB9/5eWcLFCZUnuYX9g4yeuRtH5Ss71lIbqoR0a5Q
         51XzKc3rgjbntiJx3KPx6T0FRpLWgLWWbAlIYOpQi5CsfzOg7mDk0yp65imKsbvkrFeg
         xZ4rw2i6GtWLLG7ViOnBavrtHUV8yLUnNHgH3CXY5X4eIkL42u8ed3FHWxZ9hb4YgjQR
         yRnAym6u2iMhfvvQ49hEO4ezOS7Xh1rYFjnAs2HSXYFM1/wSeaaL0gkKBLJbNRSUCLxg
         AzEtOK9kImlgh1ZGBiM+R1sRFGIKPeUEvVJHNOZCfsNLzhV2GOywrPRy0AUpAJT8Id+H
         dwaw==
X-Gm-Message-State: APjAAAW8eUHoBqm9bxNvQgWeGevgnYVMYieTw+BaBz+DEaJNQuVjWgyL
	yYFPifREIYo3tVuP4yP1g7pZHQ==
X-Google-Smtp-Source: 
 APXvYqyev6JUD7Cgpq+QD6BsUkX6hZCGFvGo0usW/rNEhRfV/oQEPsa1k3RR3apalU7SCOTsPntQdQ==
X-Received: by 2002:a17:902:b7cb:: with SMTP id
 v11mr23076612plz.153.1567626866334;
        Wed, 04 Sep 2019 12:54:26 -0700 (PDT)
Received: from [2620:15c:17:3:3a5:23a7:5e32:4598]
 ([2620:15c:17:3:3a5:23a7:5e32:4598])
        by smtp.gmail.com with ESMTPSA id
 j2sm6631739pfe.130.2019.09.04.12.54.25
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 04 Sep 2019 12:54:25 -0700 (PDT)
Date: Wed, 4 Sep 2019 12:54:25 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
X-X-Sender: rientjes@chino.kir.corp.google.com
To: Linus Torvalds <torvalds@linux-foundation.org>,
    Andrew Morton <akpm@linux-foundation.org>
cc: Andrea Arcangeli <aarcange@redhat.com>, Michal Hocko <mhocko@suse.com>,
    Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
    "Kirill A. Shutemov" <kirill@shutemov.name>, linux-kernel@vger.kernel.org,
    linux-mm@kvack.org
Subject: [rfc 4/4] mm, page_alloc: allow hugepage fallback to remote nodes
 when madvised
Message-ID: <alpine.DEB.2.21.1909041253560.94813@chino.kir.corp.google.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

For systems configured to always try hard to allocate transparent
hugepages (thp defrag setting of "always") or for memory that has been
explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
remote memory to allocate the hugepage if the local allocation fails
first.

The point is to allow the initial call to __alloc_pages_node() to attempt
to defragment local memory to make a hugepage available, if possible,
rather than immediately fallback to remote memory.  Local hugepages will
always have a better access latency than remote (huge)pages, so an attempt
to make a hugepage available locally is always preferred.

If memory compaction cannot be successful locally, however, it is likely
better to fallback to remote memory.  This could take on two forms: either
allow immediate fallback to remote memory or do per-zone watermark checks.
It would be possible to fallback only when per-zone watermarks fail for
order-0 memory, since that would require local reclaim for all subsequent
faults so remote huge allocation is likely better than thrashing the local
zone for large workloads.

In this case, it is assumed that because the system is configured to try
hard to allocate hugepages or the vma is advised to explicitly want to try
hard for hugepages that remote allocation is better when local allocation
and memory compaction have both failed.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/mempolicy.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2133,6 +2133,17 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 			mpol_cond_put(pol);
 			page = __alloc_pages_node(hpage_node,
 						gfp | __GFP_THISNODE, order);
+
+			/*
+			 * If hugepage allocations are configured to always
+			 * synchronous compact or the vma has been madvised
+			 * to prefer hugepage backing, retry allowing remote
+			 * memory as well.
+			 */
+			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
+				page = __alloc_pages_node(hpage_node,
+						gfp | __GFP_NORETRY, order);
+
 			goto out;
 		}
 	}