From patchwork Wed Oct 13 09:45:39 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12555273 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 40D69C433FE for ; Wed, 13 Oct 2021 09:46:37 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BCBA160F94 for ; Wed, 13 Oct 2021 09:46:36 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org BCBA160F94 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 2814D6B006C; Wed, 13 Oct 2021 05:46:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 22F9A6B0071; Wed, 13 Oct 2021 05:46:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F8A5900002; Wed, 13 Oct 2021 05:46:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0178.hostedemail.com [216.40.44.178]) by kanga.kvack.org (Postfix) with ESMTP id 01D546B006C for ; Wed, 13 Oct 2021 05:46:35 -0400 (EDT) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id B1A9C19B3D for ; Wed, 13 Oct 2021 09:46:35 +0000 (UTC) X-FDA: 78690934350.31.E0CCA3B Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf27.hostedemail.com (Postfix) with ESMTP id 3151170000A4 for ; Wed, 13 Oct 2021 09:46:35 +0000 (UTC) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.1.2/8.16.1.2) with SMTP id 19D833Wm014847; Wed, 13 Oct 2021 05:46:26 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding; s=pp1; bh=oICOH0X7TMau5fbIVdgw2c7WVZjp0L935WwPYjC4wK8=; b=JAkRYjnCsmCUDqht1eDr6cc2hyUeJTIC7MGxE2QLSGyKwME/b6H4OpHyDz9cO6iPKDke FE02MMPBv1pUQR5RG13orx0OGIEb0jXFP39x9mKfHIMP9waPBg8V6dRGrANkejtr2exS jFLat1R4Z2WgHTDqta63j9o2Jt0ulh0e82paYrINhHmU0Q+LHNKcFD9MIPgsS/1C2BLJ 0lkLjuHb9269Uia7zqpgceUfqP4PrmO13Svl5VOqUA9A4GKgSfWiLh5CTw8bP8yGxgvc mI4S60oumyLpCRHpsDfgL+QP81uJ/6gMp6qqtuzZ7/jTwdWkoEo8eayjcKOo73oVHTFl oA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 3bnnvfrpy1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 13 Oct 2021 05:46:25 -0400 Received: from m0098417.ppops.net (m0098417.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.43/8.16.0.43) with SMTP id 19D9Nmia005983; Wed, 13 Oct 2021 05:46:25 -0400 Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0a-001b2d01.pphosted.com with ESMTP id 3bnnvfrpxq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 13 Oct 2021 05:46:25 -0400 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 19D9bRuR029596; Wed, 13 Oct 2021 09:46:24 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma04dal.us.ibm.com with ESMTP id 3bkeq7evbq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 13 Oct 2021 09:46:24 +0000 Received: from b01ledav006.gho.pok.ibm.com (b01ledav006.gho.pok.ibm.com [9.57.199.111]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 19D9kNG350004316 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 13 Oct 2021 09:46:23 GMT Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 89F32AC064; Wed, 13 Oct 2021 09:46:23 +0000 (GMT) Received: from b01ledav006.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6D214AC05E; Wed, 13 Oct 2021 09:46:18 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.38.58]) by b01ledav006.gho.pok.ibm.com (Postfix) with ESMTP; Wed, 13 Oct 2021 09:46:18 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, "Aneesh Kumar K.V" , Ben Widawsky , Dave Hansen , Feng Tang , Michal Hocko , Andrea Arcangeli , Mel Gorman , Mike Kravetz , Randy Dunlap , Vlastimil Babka , Andi Kleen , Dan Williams , Huang Ying Subject: [RFC PATCH] mm/mempolicy: add MPOL_PREFERRED_STRICT memory policy Date: Wed, 13 Oct 2021 15:15:39 +0530 Message-Id: <20211013094539.962357-1-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: hUH0Tk_S1qxuPzV7qzW3OuOv7WRy2EfI X-Proofpoint-ORIG-GUID: cyKnzEWcF8yWp15QSIw5oNeQrDq22gSo X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.182.1,Aquarius:18.0.790,Hydra:6.0.425,FMLib:17.0.607.475 definitions=2021-10-13_03,2021-10-13_01,2020-04-07_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1011 mlxlogscore=999 bulkscore=0 malwarescore=0 suspectscore=0 phishscore=0 mlxscore=0 lowpriorityscore=0 priorityscore=1501 spamscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2109230001 definitions=main-2110130064 X-Rspamd-Queue-Id: 3151170000A4 X-Stat-Signature: c9134qre1jo7ky1epgmah3ufjk6juyhs Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=JAkRYjnC; spf=pass (imf27.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspamd-Server: rspam06 X-HE-Tag: 1634118395-286597 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2) interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a preference node from which the kernel will fulfill memory allocation requests. Unlike the MPOL_PREFERRED mode, it takes a set of nodes. The nodes in the nodemask are used as fallback allocation nodes if memory is not available on the preferred node. Unlike MPOL_PREFERRED_MANY, it will not fall back memory allocations to all nodes in the system. Like the MPOL_BIND interface, it works over a set of nodes and will cause a SIGSEGV or invoke the OOM killer if memory is not available on those preferred nodes. This patch helps applications to hint a memory allocation preference node and fallback to _only_ a set of nodes if the memory is not available on the preferred node. Fallback allocation is attempted from the node which is nearest to the preferred node. This new memory policy helps applications to have explicit control on slow memory allocation and avoids default fallback to slow memory NUMA nodes. The difference with MPOL_BIND is the ability to specify a preferred node which is the first node in the nodemask argument passed. Cc: Ben Widawsky Cc: Dave Hansen Cc: Feng Tang Cc: Michal Hocko Cc: Andrea Arcangeli Cc: Mel Gorman Cc: Mike Kravetz Cc: Randy Dunlap Cc: Vlastimil Babka Cc: Andi Kleen Cc: Dan Williams Cc: Huang Ying b Signed-off-by: Aneesh Kumar K.V --- .../admin-guide/mm/numa_memory_policy.rst | 7 +++ include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 43 +++++++++++++++++-- 3 files changed, 48 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index 64fd0ba0d057..4dfdcbd22d67 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -252,6 +252,13 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node. +MPOL_PREFERRED_STRICT + This mode specifies that the allocation should be attempted + from the first node specified in the nodemask of the policy. + If that allocation fails, the kernel will search other nodes + in the nodemask, in order of increasing distance from the + preferred node based on information provided by the platform firmware. + NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 046d0ccba4cd..8aa1d1963235 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_PREFERRED_STRICT, MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 1592b081c58e..59080dd1ea69 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -407,6 +407,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_nodemask, .rebind = mpol_rebind_preferred, }, + [MPOL_PREFERRED_STRICT] = { + .create = mpol_new_nodemask, + .rebind = mpol_rebind_preferred, + }, }; static int migrate_page_add(struct page *page, struct list_head *pagelist, @@ -900,6 +904,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_PREFERRED_STRICT: *nodes = p->nodes; break; case MPOL_LOCAL: @@ -1781,7 +1786,7 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) cpuset_nodemask_valid_mems_allowed(&policy->nodes)) return &policy->nodes; - if (mode == MPOL_PREFERRED_MANY) + if (mode == MPOL_PREFERRED_MANY || mode == MPOL_PREFERRED_STRICT) return &policy->nodes; return NULL; @@ -1796,7 +1801,7 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) */ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) { - if (policy->mode == MPOL_PREFERRED) { + if (policy->mode == MPOL_PREFERRED || policy->mode == MPOL_PREFERRED_STRICT) { nd = first_node(policy->nodes); } else { /* @@ -1840,6 +1845,7 @@ unsigned int mempolicy_slab_node(void) switch (policy->mode) { case MPOL_PREFERRED: + case MPOL_PREFERRED_STRICT: return first_node(policy->nodes); case MPOL_INTERLEAVE: @@ -1952,7 +1958,8 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, huge_page_shift(hstate_vma(vma))); } else { nid = policy_node(gfp_flags, *mpol, numa_node_id()); - if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY) + if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY || + mode == MPOL_PREFERRED_STRICT) *nodemask = &(*mpol)->nodes; } return nid; @@ -1986,6 +1993,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) switch (mempolicy->mode) { case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_PREFERRED_STRICT: case MPOL_BIND: case MPOL_INTERLEAVE: *mask = mempolicy->nodes; @@ -2072,6 +2080,23 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order, return page; } +static struct page *alloc_pages_preferred_strict(gfp_t gfp, unsigned int order, + struct mempolicy *pol) +{ + int nid; + gfp_t preferred_gfp; + + /* + * With MPOL_PREFERRED_STRICT first node in the policy nodemask + * is picked as the preferred node id and the fallback allocation + * is still restricted to the preferred nodes in the nodemask. + */ + preferred_gfp = gfp | __GFP_NOWARN; + preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); + nid = first_node(pol->nodes); + return __alloc_pages(preferred_gfp, order, nid, &pol->nodes); +} + /** * alloc_pages_vma - Allocate a page for a VMA. * @gfp: GFP flags. @@ -2113,6 +2138,12 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, goto out; } + if (pol->mode == MPOL_PREFERRED_STRICT) { + page = alloc_pages_preferred_strict(gfp, order, pol); + mpol_cond_put(pol); + goto out; + } + if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) { int hpage_node = node; @@ -2193,6 +2224,8 @@ struct page *alloc_pages(gfp_t gfp, unsigned order) else if (pol->mode == MPOL_PREFERRED_MANY) page = alloc_pages_preferred_many(gfp, order, numa_node_id(), pol); + else if (pol->mode == MPOL_PREFERRED_STRICT) + page = alloc_pages_preferred_strict(gfp, order, pol); else page = __alloc_pages(gfp, order, policy_node(gfp, pol, numa_node_id()), @@ -2265,6 +2298,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_PREFERRED_STRICT: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2405,6 +2439,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long break; case MPOL_PREFERRED: + case MPOL_PREFERRED_STRICT: if (node_isset(curnid, pol->nodes)) goto out; polnid = first_node(pol->nodes); @@ -2866,6 +2901,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) err = 0; goto out; case MPOL_PREFERRED_MANY: + case MPOL_PREFERRED_STRICT: case MPOL_BIND: /* * Insist on a nodelist @@ -2953,6 +2989,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) break; case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_PREFERRED_STRICT: case MPOL_BIND: case MPOL_INTERLEAVE: nodes = pol->nodes;