From patchwork Fri Jun 19 16:24:23 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ben Widawsky <ben.widawsky@intel.com>
X-Patchwork-Id: 11614607
Return-Path: <SRS0=VjTG=AA=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 34DD390
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 19 Jun 2020 16:24:57 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id F374A217D8
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Fri, 19 Jun 2020 16:24:56 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F374A217D8
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D0CB68D00D6; Fri, 19 Jun 2020 12:24:35 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C73F78D00DD; Fri, 19 Jun 2020 12:24:35 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8CD728D00D6; Fri, 19 Jun 2020 12:24:35 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0150.hostedemail.com
 [216.40.44.150])
	by kanga.kvack.org (Postfix) with ESMTP id 51A7F8D00D8
	for <linux-mm@kvack.org>; Fri, 19 Jun 2020 12:24:35 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 1AF168698E
	for <linux-mm@kvack.org>; Fri, 19 Jun 2020 16:24:35 +0000 (UTC)
X-FDA: 76946484510.27.drain74_511049d26e1a
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin27.hostedemail.com (Postfix) with ESMTP id 8F5803D66B
	for <linux-mm@kvack.org>; Fri, 19 Jun 2020 16:24:34 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,ben.widawsky@intel.com,,RULES_HIT:30005:30054:30064:30091,0,RBL:134.134.136.126:@intel.com:.lbl8.mailshell.net-62.18.0.100
 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:ft,MSBL:0,DNSBL:none,Custom_rules:0:1:0,LFtime:23,LUA_SUMMARY:none
X-HE-Tag: drain74_511049d26e1a
X-Filterd-Recvd-Size: 10539
Received: from mga18.intel.com (mga18.intel.com [134.134.136.126])
	by imf34.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Fri, 19 Jun 2020 16:24:33 +0000 (UTC)
IronPort-SDR: 
 ZcukJWQ8rgasjr8Cevgp4pla9T0uzrcvAr4hZY6EkfyJKFsxPOWiTZYsvgzXZbcETuZqZmz/Qq
 WuSop2yAk9Hg==
X-IronPort-AV: E=McAfee;i="6000,8403,9657"; a="130375196"
X-IronPort-AV: E=Sophos;i="5.75,256,1589266800";
   d="scan'208";a="130375196"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Jun 2020 09:24:32 -0700
IronPort-SDR: 
 QOS+I4PPX2AaJcI7jMrIE0uugPR+OmjPTed45wszA9ZhRfUkcepiEtQg/qf7o+57N+N6PI4u35
 4MU0YfJKC0AQ==
X-IronPort-AV: E=Sophos;i="5.75,255,1589266800";
   d="scan'208";a="264368500"
Received: from sjiang-mobl2.ccr.corp.intel.com (HELO bwidawsk-mobl5.local)
 ([10.252.131.131])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Jun 2020 09:24:32 -0700
From: Ben Widawsky <ben.widawsky@intel.com>
To: linux-mm <linux-mm@kvack.org>
Cc: Ben Widawsky <ben.widawsky@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Michal Hocko <mhocko@kernel.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: [PATCH 16/18] alloc_pages_nodemask: turn preferred nid into a
 nodemask
Date: Fri, 19 Jun 2020 09:24:23 -0700
Message-Id: <20200619162425.1052382-17-ben.widawsky@intel.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <20200619162425.1052382-1-ben.widawsky@intel.com>
References: <20200619162425.1052382-1-ben.widawsky@intel.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: 8F5803D66B
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam05
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

The guts of the page allocator already understands that the memory
policy might provide multiple preferred nodes. Ideally, the alloc
function itself wouldn't take multiple nodes until one of the callers
decided it would be useful. Unfortunately, the way the callstack is
today is the caller of __alloc_pages_nodemask is responsible for
figuring out the preferred nodes (almost always without policy in place,
this is numa_node_id()). The purpose of this patch is to allow multiple
preferred nodes while keeping the existing logical preference
assignments in place.

In other words, everything at, and below __alloc_pages_nodemask() has no
concept of policy, and this patch maintains that division.

Like bindmask, NULL and empty set for preference are allowed.

A note on allocation. One of the obvious fallouts from this is some
callers are now going to allocate nodemasks on their stack. When no
policy is in place, these nodemasks are simply the
nodemask_of_node(numa_node_id()). Some amount of this is addressed in
the next patch. The alternatives are kmalloc which is unsafe in these
paths, a percpu variable can't work because a nodemask today can be 128B
at the max NODE_SHIFT of 10 on x86 cnd ia64 is too large for a percpu
variable, or a lookup table. There's no reason a lookup table can't
work, but it seems like a premature optimization. If you were to make a
lookup table for the more extreme cases of large systems, each nodemask
would be 128B, and you have 1024 nodes - so the size of just that is
128K.

I'm very open to better solutions.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 include/linux/gfp.h     |  8 +++-----
 include/linux/migrate.h |  4 ++--
 mm/hugetlb.c            |  3 +--
 mm/mempolicy.c          | 27 ++++++---------------------
 mm/page_alloc.c         | 10 ++++------
 5 files changed, 16 insertions(+), 36 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9ab5c07579bd..47e9c02c17ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -499,15 +499,13 @@ static inline int arch_make_page_accessible(struct page *page)
 }
 #endif
 
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
-							nodemask_t *nodemask);
+struct page *__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+				    nodemask_t *prefmask, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages_nodes(nodemask_t *nodes, gfp_t gfp_mask, unsigned int order)
 {
-	return __alloc_pages_nodemask(gfp_mask, order, first_node(*nodes),
-				      NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, nodes, NULL);
 }
 
 /*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3e546cbf03dd..91b399ec9249 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -37,6 +37,7 @@ static inline struct page *new_page_nodemask(struct page *page,
 	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
 	unsigned int order = 0;
 	struct page *new_page = NULL;
+	nodemask_t pmask = nodemask_of_node(preferred_nid);
 
 	if (PageHuge(page))
 		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
@@ -50,8 +51,7 @@ static inline struct page *new_page_nodemask(struct page *page,
 	if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
 		gfp_mask |= __GFP_HIGHMEM;
 
-	new_page = __alloc_pages_nodemask(gfp_mask, order,
-				preferred_nid, nodemask);
+	new_page = __alloc_pages_nodemask(gfp_mask, order, &pmask, nodemask);
 
 	if (new_page && PageTransHuge(new_page))
 		prep_transhuge_page(new_page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 71b6750661df..52e097aed7ed 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1706,8 +1706,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	gfp_mask |= __GFP_COMP|__GFP_NOWARN;
 	if (alloc_try_hard)
 		gfp_mask |= __GFP_RETRY_MAYFAIL;
-	page = __alloc_pages_nodemask(gfp_mask, order, first_node(pmask),
-				      nmask);
+	page = __alloc_pages_nodemask(gfp_mask, order, &pmask, nmask);
 	if (page)
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	else
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9521bb46aa00..fb49bea41ab8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2274,7 +2274,6 @@ static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
 {
 	nodemask_t *nmask;
 	struct page *page;
-	int hpage_node = first_node(*prefmask);
 
 	/*
 	 * For hugepage allocation and non-interleave policy which allows the
@@ -2282,9 +2281,6 @@ static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
 	 * allocate from the current/preferred node and don't fall back to other
 	 * nodes, as the cost of remote accesses would likely offset THP
 	 * benefits.
-	 *
-	 * If the policy is interleave or multiple preferred nodes, or does not
-	 * allow the current node in its nodemask, we allocate the standard way.
 	 */
 	nmask = policy_nodemask(gfp, pol);
 
@@ -2293,7 +2289,7 @@ static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
 	 * unnecessarily, just compact.
 	 */
 	page = __alloc_pages_nodemask(gfp | __GFP_THISNODE | __GFP_NORETRY,
-				      order, hpage_node, nmask);
+				      order, prefmask, nmask);
 
 	/*
 	 * If hugepage allocations are configured to always synchronous compact
@@ -2301,7 +2297,7 @@ static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
 	 * allowing remote memory with both reclaim and compact as well.
 	 */
 	if (!page && (gfp & __GFP_DIRECT_RECLAIM))
-		page = __alloc_pages_nodemask(gfp, order, hpage_node, nmask);
+		page = __alloc_pages_nodemask(gfp, order, prefmask, nmask);
 
 	VM_BUG_ON(page && nmask && !node_isset(page_to_nid(page), *nmask));
 
@@ -2337,14 +2333,10 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 {
 	struct mempolicy *pol;
 	struct page *page;
-	nodemask_t *nmask, *pmask, tmp;
+	nodemask_t *nmask, *pmask;
 
 	pol = get_vma_policy(vma, addr);
 	pmask = policy_preferred_nodes(gfp, pol);
-	if (!pmask) {
-		tmp = nodemask_of_node(node);
-		pmask = &tmp;
-	}
 
 	if (pol->mode == MPOL_INTERLEAVE) {
 		unsigned nid;
@@ -2358,9 +2350,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		mpol_cond_put(pol);
 	} else {
 		nmask = policy_nodemask(gfp, pol);
-		page = __alloc_pages_nodemask(gfp, order, first_node(*pmask),
-					      nmask);
 		mpol_cond_put(pol);
+		page = __alloc_pages_nodemask(gfp, order, pmask, nmask);
 	}
 
 	return page;
@@ -2397,14 +2388,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 	if (pol->mode == MPOL_INTERLEAVE) {
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	} else {
-		nodemask_t tmp, *pmask;
-
-		pmask = policy_preferred_nodes(gfp, pol);
-		if (!pmask) {
-			tmp = nodemask_of_node(numa_node_id());
-			pmask = &tmp;
-		}
-		page = __alloc_pages_nodemask(gfp, order, first_node(*pmask),
+		page = __alloc_pages_nodemask(gfp, order,
+					      policy_preferred_nodes(gfp, pol),
 					      policy_nodemask(gfp, pol));
 	}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6f8f112a5d4..0f90419fe0d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4967,15 +4967,13 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
-							nodemask_t *nodemask)
+struct page *__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+				    nodemask_t *prefmask, nodemask_t *nodemask)
 {
 	struct page *page;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = { };
-	nodemask_t prefmask = nodemask_of_node(preferred_nid);
 
 	/*
 	 * There are several places where we assume that the order value is sane
@@ -4988,11 +4986,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 
 	gfp_mask &= gfp_allowed_mask;
 	alloc_mask = gfp_mask;
-	if (!prepare_alloc_pages(gfp_mask, order, &prefmask, nodemask, &ac,
+	if (!prepare_alloc_pages(gfp_mask, order, prefmask, nodemask, &ac,
 				 &alloc_mask, &alloc_flags))
 		return NULL;
 
-	ac.prefmask = &prefmask;
+	ac.prefmask = prefmask;
 
 	finalise_ac(gfp_mask, &ac);