From patchwork Fri Jun 3 13:42:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12869078 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43F1BC433EF for ; Fri, 3 Jun 2022 13:43:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D09CD8D0003; Fri, 3 Jun 2022 09:43:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CE05F8D0001; Fri, 3 Jun 2022 09:43:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B81BA8D0003; Fri, 3 Jun 2022 09:43:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id AB1C98D0001 for ; Fri, 3 Jun 2022 09:43:43 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8CF4161649 for ; Fri, 3 Jun 2022 13:43:43 +0000 (UTC) X-FDA: 79537042326.22.C639ACD Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf20.hostedemail.com (Postfix) with ESMTP id D0FBD1C0061 for ; Fri, 3 Jun 2022 13:43:23 +0000 (UTC) Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 253DSsxV011341; Fri, 3 Jun 2022 13:43:27 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-transfer-encoding : mime-version; s=pp1; bh=/oA9LGIP88Gpnp11nljD/CJTAamm36Hcf8YkiGAoLxU=; b=C613BK2CafRGr2nma/5h7BHrCcz0tHTrcYyHed1gt6SyQhMs0hMfZG1j6xf3AK4qNCEM k8xSUqxGmCa1Qdb7e0fSe5fnMLTSpXPshrX47VsPUojnyheI/1tH1D8be6ibK5eAA2tx EnENdJ77hzr8RwW7nMz7vu+xxtDgnQv71n//frIe3X0UBOpKNLzrDL2P3Ct+CWhyrn0p r1XC3ceX/Js6FCZsyzdfDdEN5y/0mHb667hUveotZYghzCATJb/yQTEIK1PAvLJo/qjA Ev0dTQEuY21bc6odP2AHJmCdhzfjiXN3171KFGD2erBXx07rcQldAGHD5M58wIeVS6sZ dQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfk7kr8et-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:27 +0000 Received: from m0098399.ppops.net (m0098399.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 253DZQq3003156; Fri, 3 Jun 2022 13:43:26 GMT Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfk7kr8ec-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:26 +0000 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 253DaXQ2027627; Fri, 3 Jun 2022 13:43:25 GMT Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com [9.57.198.26]) by ppma04dal.us.ibm.com with ESMTP id 3gd4n68dhx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:25 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 253DhOXm6030034 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Jun 2022 13:43:24 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 903F5112064; Fri, 3 Jun 2022 13:43:24 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0291E112065; Fri, 3 Jun 2022 13:43:17 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.93.173]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 3 Jun 2022 13:43:16 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers Date: Fri, 3 Jun 2022 19:12:29 +0530 Message-Id: <20220603134237.131362-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: A_gPILNQDguDnqU3jwfkNfGenZR6HMNq X-Proofpoint-GUID: hm4m7ynA733JJRgb55UsFu1ziIMX7G9v X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-03_04,2022-06-03_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 mlxscore=0 adultscore=0 bulkscore=0 mlxlogscore=999 priorityscore=1501 impostorscore=0 malwarescore=0 spamscore=0 lowpriorityscore=0 phishscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206030059 X-Stat-Signature: f8c3g79b4o4h6jzt933amwgyr8jm43yz X-Rspam-User: Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=C613BK2C; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf20.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: D0FBD1C0061 X-HE-Tag: 1654263803-367628 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the top tier, and builds the tier hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel interface needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM device attached via CXL.mem or a DRAM-backed memory-only node on a virtual machine) and should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. The current kernel also don't provide any interfaces for the userspace to learn about the memory tier hierarchy in order to optimize its memory allocations. This patch series address the above by defining memory tiers explicitly. This patch introduce explicity memory tiers with ranks. The rank value of a memory tier is used to derive the demotion order between NUMA nodes. The memory tiers present in a system can be found at /sys/devices/system/memtier/memtierN/ The nodes which are part of a specific memory tier can be listed via /sys/devices/system/memtier/memtierN/nodelist "Rank" is an opaque value. Its absolute value doesn't have any special meaning. But the rank values of different memtiers can be compared with each other to determine the memory tier order. For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and their rank values are 300, 200, 100, then the memory tier order is: memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier and memtier1 is the lowest tier. The rank value of each memtier should be unique. A higher rank memory tier will appear first in the demotion order than a lower rank memory tier. ie. while reclaim we choose a node in higher rank memory tier to demote pages to as compared to a node in a lower rank memory tier. For now we are not adding the dynamic number of memory tiers. But a future series supporting that is possible. Currently number of tiers supported is limitted to MAX_MEMORY_TIERS(3). When doing memory hotplug, if not added to a memory tier, the NUMA node gets added to DEFAULT_MEMORY_TIER(1). This patch is based on the proposal sent by Wei Xu at [1]. [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com Suggested-by: Wei Xu Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 20 ++++ mm/Kconfig | 11 ++ mm/Makefile | 1 + mm/memory-tiers.c | 188 +++++++++++++++++++++++++++++++++++ 4 files changed, 220 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..e17f6b4ee177 --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +#ifdef CONFIG_TIERED_MEMORY + +#define MEMORY_TIER_HBM_GPU 0 +#define MEMORY_TIER_DRAM 1 +#define MEMORY_TIER_PMEM 2 + +#define MEMORY_RANK_HBM_GPU 300 +#define MEMORY_RANK_DRAM 200 +#define MEMORY_RANK_PMEM 100 + +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM +#define MAX_MEMORY_TIERS 3 + +#endif /* CONFIG_TIERED_MEMORY */ + +#endif diff --git a/mm/Kconfig b/mm/Kconfig index 169e64192e48..08a3d330740b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION config ARCH_ENABLE_THP_MIGRATION bool +config TIERED_MEMORY + bool "Support for explicit memory tiers" + def_bool n + depends on MIGRATION && NUMA + help + Support to split nodes into memory tiers explicitly and + to demote pages on reclaim to lower tiers. This option + also exposes sysfs interface to read nodes available in + specific tier and to move specific node among different + possible tiers. + config HUGETLB_PAGE_SIZE_VARIABLE def_bool n help diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..482557fbc9d1 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..7de18d94a08d --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,188 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +struct memory_tier { + struct list_head list; + struct device dev; + nodemask_t nodelist; + int rank; +}; + +#define to_memory_tier(device) container_of(device, struct memory_tier, dev) + +static struct bus_type memory_tier_subsys = { + .name = "memtier", + .dev_name = "memtier", +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); + + +static ssize_t nodelist_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct memory_tier *memtier = to_memory_tier(dev); + + return sysfs_emit(buf, "%*pbl\n", + nodemask_pr_args(&memtier->nodelist)); +} +static DEVICE_ATTR_RO(nodelist); + +static ssize_t rank_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct memory_tier *memtier = to_memory_tier(dev); + + return sysfs_emit(buf, "%d\n", memtier->rank); +} +static DEVICE_ATTR_RO(rank); + +static struct attribute *memory_tier_dev_attrs[] = { + &dev_attr_nodelist.attr, + &dev_attr_rank.attr, + NULL +}; + +static const struct attribute_group memory_tier_dev_group = { + .attrs = memory_tier_dev_attrs, +}; + +static const struct attribute_group *memory_tier_dev_groups[] = { + &memory_tier_dev_group, + NULL +}; + +static void memory_tier_device_release(struct device *dev) +{ + struct memory_tier *tier = to_memory_tier(dev); + + kfree(tier); +} + +/* + * Keep it simple by having direct mapping between + * tier index and rank value. + */ +static inline int get_rank_from_tier(unsigned int tier) +{ + switch (tier) { + case MEMORY_TIER_HBM_GPU: + return MEMORY_RANK_HBM_GPU; + case MEMORY_TIER_DRAM: + return MEMORY_RANK_DRAM; + case MEMORY_TIER_PMEM: + return MEMORY_RANK_PMEM; + } + + return 0; +} + +static void insert_memory_tier(struct memory_tier *memtier) +{ + struct list_head *ent; + struct memory_tier *tmp_memtier; + + list_for_each(ent, &memory_tiers) { + tmp_memtier = list_entry(ent, struct memory_tier, list); + if (tmp_memtier->rank < memtier->rank) { + list_add_tail(&memtier->list, ent); + return; + } + } + list_add_tail(&memtier->list, &memory_tiers); +} + +static struct memory_tier *register_memory_tier(unsigned int tier) +{ + int error; + struct memory_tier *memtier; + + if (tier >= MAX_MEMORY_TIERS) + return NULL; + + memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!memtier) + return NULL; + + memtier->dev.id = tier; + memtier->rank = get_rank_from_tier(tier); + memtier->dev.bus = &memory_tier_subsys; + memtier->dev.release = memory_tier_device_release; + memtier->dev.groups = memory_tier_dev_groups; + + insert_memory_tier(memtier); + + error = device_register(&memtier->dev); + if (error) { + list_del(&memtier->list); + put_device(&memtier->dev); + return NULL; + } + return memtier; +} + +__maybe_unused // temporay to prevent warnings during bisects +static void unregister_memory_tier(struct memory_tier *memtier) +{ + list_del(&memtier->list); + device_unregister(&memtier->dev); +} + +static ssize_t +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS); +} +static DEVICE_ATTR_RO(max_tier); + +static ssize_t +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER); +} +static DEVICE_ATTR_RO(default_tier); + +static struct attribute *memory_tier_attrs[] = { + &dev_attr_max_tier.attr, + &dev_attr_default_tier.attr, + NULL +}; + +static const struct attribute_group memory_tier_attr_group = { + .attrs = memory_tier_attrs, +}; + +static const struct attribute_group *memory_tier_attr_groups[] = { + &memory_tier_attr_group, + NULL, +}; + +static int __init memory_tier_init(void) +{ + int ret; + struct memory_tier *memtier; + + ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups); + if (ret) + panic("%s() failed to register subsystem: %d\n", __func__, ret); + + /* + * Register only default memory tier to hide all empty + * memory tier from sysfs. + */ + memtier = register_memory_tier(DEFAULT_MEMORY_TIER); + if (!memtier) + panic("%s() failed to register memory tier: %d\n", __func__, ret); + + /* CPU only nodes are not part of memory tiers. */ + memtier->nodelist = node_states[N_MEMORY]; + + return 0; +} +subsys_initcall(memory_tier_init); + From patchwork Fri Jun 3 13:42:30 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12869079 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D6E4C43334 for ; Fri, 3 Jun 2022 13:43:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F0F438D0005; Fri, 3 Jun 2022 09:43:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC2228D0001; Fri, 3 Jun 2022 09:43:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D125E8D0005; Fri, 3 Jun 2022 09:43:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C24BB8D0001 for ; Fri, 3 Jun 2022 09:43:48 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 8195112145B for ; Fri, 3 Jun 2022 13:43:48 +0000 (UTC) X-FDA: 79537042536.12.31D0EBF Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf09.hostedemail.com (Postfix) with ESMTP id 163DF14006E for ; Fri, 3 Jun 2022 13:43:31 +0000 (UTC) Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 253DUkP8016766; Fri, 3 Jun 2022 13:43:36 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=K1VdzzCOxCgpv74NdzVcO4YxHjbdKq+GNt2rMe64fVs=; b=EmGsS21hZwH1jPIIYT7eBSKfIkOvpy/S/1/Y8s6hZUYqmgds/pmV44LbXZWnbWaVKLvE M1ZKxuV/qRpb3hVbZb/KIGzzCRy8cT00hDPsftf9NPZpEPRel9SHjdGdnEYCdREaetDc +1W45wLwozoie8/t++XBSFKITl6jbsVlit+HJ19knad1Mga3xtg45lTqJazICpAlHBOX 1u9pplOM3sjWCCx1PczCyBB9fy1JVh2pFz+X3FF046jYJ4HvFlkwY1qCQvFzGNdWUs9o 3n1sNjOs6GoODPIinVezuBPG62E4Rtgb0UcDX21CsjZak7aYwki6Bajp14c8tf5+y4Xi LQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfk8h08wm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:35 +0000 Received: from m0098414.ppops.net (m0098414.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 253DUpN3016953; Fri, 3 Jun 2022 13:43:35 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfk8h08wf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:35 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 253Da8S9005529; Fri, 3 Jun 2022 13:43:34 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma01dal.us.ibm.com with ESMTP id 3gcxt62ygh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:34 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 253DhXAi31850762 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Jun 2022 13:43:33 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F0BBE112064; Fri, 3 Jun 2022 13:43:32 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4B677112063; Fri, 3 Jun 2022 13:43:25 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.93.173]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 3 Jun 2022 13:43:24 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v5 2/9] mm/demotion: Expose per node memory tier to sysfs Date: Fri, 3 Jun 2022 19:12:30 +0530 Message-Id: <20220603134237.131362-3-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: w-kEa5wmFAHsGWYIrUkC23cIWnbXumAg X-Proofpoint-ORIG-GUID: z_canjlGlQrqFBIESd2oOLNFbxerRxsS X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-03_04,2022-06-03_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 priorityscore=1501 suspectscore=0 bulkscore=0 adultscore=0 clxscore=1015 phishscore=0 spamscore=0 mlxlogscore=999 malwarescore=0 impostorscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206030059 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 163DF14006E X-Stat-Signature: irhcufxrt51gdhyhjbdij3phyu748jh5 X-Rspam-User: Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=EmGsS21h; spf=pass (imf09.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-HE-Tag: 1654263811-468809 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add support to modify the memory tier for a NUMA node. /sys/devices/system/node/nodeN/memtier where N = node id When read, It list the memory tier that the node belongs to. When written, the kernel moves the node into the specified memory tier, the tier assignment of all other nodes are not affected. If the memory tier does not exist, writing to the above file creates the tier and assign the NUMA node to that tier. mutex memory_tier_lock is introduced to protect memory tier related chanegs as it can happen from sysfs as well on hot plug events. Suggested-by: Wei Xu Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- drivers/base/node.c | 39 +++++++++++ include/linux/memory-tiers.h | 3 + mm/memory-tiers.c | 123 ++++++++++++++++++++++++++++++++++- 3 files changed, 164 insertions(+), 1 deletion(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 0ac6376ef7a1..599ed64d910f 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,7 @@ #include #include #include +#include static struct bus_type node_subsys = { .name = "node", @@ -560,11 +561,49 @@ static ssize_t node_read_distance(struct device *dev, } static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); +#ifdef CONFIG_TIERED_MEMORY +static ssize_t memtier_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + int node = dev->id; + int tier_index = node_get_memory_tier_id(node); + + if (tier_index != -1) + return sysfs_emit(buf, "%d\n", tier_index); + return 0; +} + +static ssize_t memtier_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + unsigned long tier; + int node = dev->id; + int ret; + + ret = kstrtoul(buf, 10, &tier); + if (ret) + return ret; + + ret = node_reset_memory_tier(node, tier); + if (ret) + return ret; + + return count; +} + +static DEVICE_ATTR_RW(memtier); +#endif + static struct attribute *node_dev_attrs[] = { &dev_attr_meminfo.attr, &dev_attr_numastat.attr, &dev_attr_distance.attr, &dev_attr_vmstat.attr, +#ifdef CONFIG_TIERED_MEMORY + &dev_attr_memtier.attr, +#endif NULL }; diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index e17f6b4ee177..91f071804476 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -15,6 +15,9 @@ #define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM #define MAX_MEMORY_TIERS 3 +int node_get_memory_tier_id(int node); +int node_set_memory_tier(int node, int tier); +int node_reset_memory_tier(int node, int tier); #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 7de18d94a08d..9c78c47ad030 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -126,7 +126,6 @@ static struct memory_tier *register_memory_tier(unsigned int tier) return memtier; } -__maybe_unused // temporay to prevent warnings during bisects static void unregister_memory_tier(struct memory_tier *memtier) { list_del(&memtier->list); @@ -162,6 +161,128 @@ static const struct attribute_group *memory_tier_attr_groups[] = { NULL, }; +static struct memory_tier *__node_get_memory_tier(int node) +{ + struct memory_tier *memtier; + + list_for_each_entry(memtier, &memory_tiers, list) { + if (node_isset(node, memtier->nodelist)) + return memtier; + } + return NULL; +} + +static struct memory_tier *__get_memory_tier_from_id(int id) +{ + struct memory_tier *memtier; + + list_for_each_entry(memtier, &memory_tiers, list) { + if (memtier->dev.id == id) + return memtier; + } + return NULL; +} + +__maybe_unused // temporay to prevent warnings during bisects +static void node_remove_from_memory_tier(int node) +{ + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + + memtier = __node_get_memory_tier(node); + if (!memtier) + goto out; + /* + * Remove node from tier, if tier becomes + * empty then unregister it to make it invisible + * in sysfs. + */ + node_clear(node, memtier->nodelist); + if (nodes_empty(memtier->nodelist)) + unregister_memory_tier(memtier); + +out: + mutex_unlock(&memory_tier_lock); +} + +int node_get_memory_tier_id(int node) +{ + int tier = -1; + struct memory_tier *memtier; + /* + * Make sure memory tier is not unregistered + * while it is being read. + */ + mutex_lock(&memory_tier_lock); + memtier = __node_get_memory_tier(node); + if (memtier) + tier = memtier->dev.id; + mutex_unlock(&memory_tier_lock); + + return tier; +} + +static int __node_set_memory_tier(int node, int tier) +{ + int ret = 0; + struct memory_tier *memtier; + + memtier = __get_memory_tier_from_id(tier); + if (!memtier) { + memtier = register_memory_tier(tier); + if (!memtier) { + ret = -EINVAL; + goto out; + } + } + node_set(node, memtier->nodelist); +out: + return ret; +} + +int node_reset_memory_tier(int node, int tier) +{ + struct memory_tier *current_tier; + int ret = 0; + + mutex_lock(&memory_tier_lock); + + current_tier = __node_get_memory_tier(node); + if (!current_tier || current_tier->dev.id == tier) + goto out; + + node_clear(node, current_tier->nodelist); + + ret = __node_set_memory_tier(node, tier); + if (ret) { + /* reset it back to older tier */ + node_set(node, current_tier->nodelist); + goto out; + } + + if (nodes_empty(current_tier->nodelist)) + unregister_memory_tier(current_tier); +out: + mutex_unlock(&memory_tier_lock); + + return ret; +} + +int node_set_memory_tier(int node, int tier) +{ + struct memory_tier *memtier; + int ret = 0; + + mutex_lock(&memory_tier_lock); + memtier = __node_get_memory_tier(node); + if (!memtier) + ret = __node_set_memory_tier(node, tier); + mutex_unlock(&memory_tier_lock); + + return ret; +} + static int __init memory_tier_init(void) { int ret; From patchwork Fri Jun 3 13:42:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12869080 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3A5AC433EF for ; Fri, 3 Jun 2022 13:43:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3DA3A8D0006; Fri, 3 Jun 2022 09:43:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 38BBE8D0001; Fri, 3 Jun 2022 09:43:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 227D98D0006; Fri, 3 Jun 2022 09:43:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 16B548D0001 for ; Fri, 3 Jun 2022 09:43:59 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id DF388340FA for ; Fri, 3 Jun 2022 13:43:58 +0000 (UTC) X-FDA: 79537042956.14.2DF3B29 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf03.hostedemail.com (Postfix) with ESMTP id D299D20055 for ; Fri, 3 Jun 2022 13:43:42 +0000 (UTC) Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 253DGraV021315; Fri, 3 Jun 2022 13:43:45 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=dkro1rlH6gMB3Xr+XLl+CbwUIq9jbXlSu+phhuXyiLc=; b=mCDFpIEqkwpPOApw8gqo24L6mofZWjRpvFYh5v7S8kyBc+4Ac6WHApXZGzqKfL0SpxFW Kav6uSngbNvlwT+mnCgq3HyYqVtEHt5zXP1J8tam0JHI4GPw9o8WC+/xgAlY8QwReuMh 5tTal/URd1E/oW094+D9UG4yXM/s9505vpgrvN3jVwmvp5K3dcXvXgTy1xQVBjA99In1 fZAang3aDo11Mm52LY+v9PtAHIknSNR+j77BC3GqNK6OCdzTeleYMv4WY2hbgTaJauli kYXZlT0Il0DaWXqOLTIsUP2iLEyXCv+1Rh2Ql9rUDU0zoY572c7YsIvfQQanYCmbnUVJ bQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfk1x8ebq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:44 +0000 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 253DSUHJ031862; Fri, 3 Jun 2022 13:43:44 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfk1x8ebf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:44 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 253Da8SF005529; Fri, 3 Jun 2022 13:43:43 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma01dal.us.ibm.com with ESMTP id 3gcxt62yhd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:43 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 253Dhfgq29622554 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Jun 2022 13:43:42 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DB61F112067; Fri, 3 Jun 2022 13:43:41 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id AABAC112061; Fri, 3 Jun 2022 13:43:33 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.93.173]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 3 Jun 2022 13:43:33 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v5 3/9] mm/demotion: Move memory demotion related code Date: Fri, 3 Jun 2022 19:12:31 +0530 Message-Id: <20220603134237.131362-4-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: eRc9AmmPVq1txSjdGWhwjIaTHonnXf6D X-Proofpoint-ORIG-GUID: QEXIDvhhB8M8RVlFClW7P3w6W6tuy8s7 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-03_04,2022-06-03_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 bulkscore=0 mlxscore=0 malwarescore=0 adultscore=0 priorityscore=1501 clxscore=1015 suspectscore=0 impostorscore=0 mlxlogscore=999 lowpriorityscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206030059 X-Rspamd-Queue-Id: D299D20055 X-Stat-Signature: tkocedwu1fgnxf5p9g9pyh96g49uq54w X-Rspam-User: Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=mCDFpIEq; spf=pass (imf03.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspamd-Server: rspam08 X-HE-Tag: 1654263822-631266 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This move memory demotion related code to mm/demotion.c. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 4 +++ include/linux/migrate.h | 2 -- mm/memory-tiers.c | 60 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 60 +----------------------------------- mm/vmscan.c | 1 + 5 files changed, 66 insertions(+), 61 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 91f071804476..33ef36395a20 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -15,9 +15,13 @@ #define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM #define MAX_MEMORY_TIERS 3 +extern bool numa_demotion_enabled; int node_get_memory_tier_id(int node); int node_set_memory_tier(int node, int tier); int node_reset_memory_tier(int node, int tier); +#else +#define numa_demotion_enabled false + #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 069a89e847f3..43e737215f33 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -78,7 +78,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) extern void set_migration_target_nodes(void); extern void migrate_on_reclaim_init(void); -extern bool numa_demotion_enabled; extern int next_demotion_node(int node); #else static inline void set_migration_target_nodes(void) {} @@ -87,7 +86,6 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } -#define numa_demotion_enabled false #endif #ifdef CONFIG_COMPACTION diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 9c78c47ad030..3f382d1f844a 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -307,3 +307,63 @@ static int __init memory_tier_init(void) } subsys_initcall(memory_tier_init); +bool numa_demotion_enabled = false; + +#ifdef CONFIG_SYSFS +static ssize_t numa_demotion_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%s\n", + numa_demotion_enabled ? "true" : "false"); +} + +static ssize_t numa_demotion_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret; + + ret = kstrtobool(buf, &numa_demotion_enabled); + if (ret) + return ret; + + return count; +} + +static struct kobj_attribute numa_demotion_enabled_attr = + __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, + numa_demotion_enabled_store); + +static struct attribute *numa_attrs[] = { + &numa_demotion_enabled_attr.attr, + NULL, +}; + +static const struct attribute_group numa_attr_group = { + .attrs = numa_attrs, +}; + +static int __init numa_init_sysfs(void) +{ + int err; + struct kobject *numa_kobj; + + numa_kobj = kobject_create_and_add("numa", mm_kobj); + if (!numa_kobj) { + pr_err("failed to create numa kobject\n"); + return -ENOMEM; + } + err = sysfs_create_group(numa_kobj, &numa_attr_group); + if (err) { + pr_err("failed to register numa group\n"); + goto delete_obj; + } + return 0; + +delete_obj: + kobject_put(numa_kobj); + return err; +} +subsys_initcall(numa_init_sysfs); +#endif /* CONFIG_SYSFS */ + diff --git a/mm/migrate.c b/mm/migrate.c index e51588e95f57..29cacc217e38 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2508,64 +2508,6 @@ void __init migrate_on_reclaim_init(void) set_migration_target_nodes(); cpus_read_unlock(); } +#endif /* CONFIG_NUMA */ -bool numa_demotion_enabled = false; - -#ifdef CONFIG_SYSFS -static ssize_t numa_demotion_enabled_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%s\n", - numa_demotion_enabled ? "true" : "false"); -} - -static ssize_t numa_demotion_enabled_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - ssize_t ret; - - ret = kstrtobool(buf, &numa_demotion_enabled); - if (ret) - return ret; - - return count; -} - -static struct kobj_attribute numa_demotion_enabled_attr = - __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, - numa_demotion_enabled_store); - -static struct attribute *numa_attrs[] = { - &numa_demotion_enabled_attr.attr, - NULL, -}; - -static const struct attribute_group numa_attr_group = { - .attrs = numa_attrs, -}; - -static int __init numa_init_sysfs(void) -{ - int err; - struct kobject *numa_kobj; - numa_kobj = kobject_create_and_add("numa", mm_kobj); - if (!numa_kobj) { - pr_err("failed to create numa kobject\n"); - return -ENOMEM; - } - err = sysfs_create_group(numa_kobj, &numa_attr_group); - if (err) { - pr_err("failed to register numa group\n"); - goto delete_obj; - } - return 0; - -delete_obj: - kobject_put(numa_kobj); - return err; -} -subsys_initcall(numa_init_sysfs); -#endif /* CONFIG_SYSFS */ -#endif /* CONFIG_NUMA */ diff --git a/mm/vmscan.c b/mm/vmscan.c index f7d9a683e3a7..3a8f78277f99 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -50,6 +50,7 @@ #include #include #include +#include #include #include From patchwork Fri Jun 3 13:42:32 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12869081 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A35FC433EF for ; Fri, 3 Jun 2022 13:44:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C05BF8D0007; Fri, 3 Jun 2022 09:44:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BB7738D0001; Fri, 3 Jun 2022 09:44:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9DC898D0007; Fri, 3 Jun 2022 09:44:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8AC008D0001 for ; Fri, 3 Jun 2022 09:44:04 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 61FE615A3 for ; Fri, 3 Jun 2022 13:44:04 +0000 (UTC) X-FDA: 79537043208.21.6E91F11 Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf31.hostedemail.com (Postfix) with ESMTP id E55F620083 for ; Fri, 3 Jun 2022 13:43:20 +0000 (UTC) Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 253CnSUJ026159; Fri, 3 Jun 2022 13:43:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=zScmpxuF1DrE777qAlX7tuYo2tUSkNGuwWqIMUILFO0=; b=nIqP7qhVHLQocEPcQBsT/qyIiRWu1pRyONkliR5KGjp2xgXCplhiW7vmUmadkKX22h0K mfYxFw90CvZOtqps3G2HSWtL59YcC/nrzgdwNCQ1dAS32TJ99m5zWq+44Asx030extHS f9srGDqHxEr8PO7O1FCKK0//E1YF8n8TlMnX5fMA1Rarskac/WbexXZa/4SrYTgka/sa KDQ44nrKYwNzTwyj8OBg3qpH8cZchU2LCIDSJCZ3x+io8LQua8bdOrpVyftLv4MXew2F eoBRcRsdfeLZ+aykQlBv2bOK+5erUVYX7tsMgZ7e5ght0a+hSTfViINrG1Kw6OrCZtFp MQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfjmt125r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:53 +0000 Received: from m0098413.ppops.net (m0098413.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 253DS6ji023059; Fri, 3 Jun 2022 13:43:53 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfjmt125f-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:52 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 253DamLO010706; Fri, 3 Jun 2022 13:43:52 GMT Received: from b01cxnp23034.gho.pok.ibm.com (b01cxnp23034.gho.pok.ibm.com [9.57.198.29]) by ppma03dal.us.ibm.com with ESMTP id 3gd3yn8t1t-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:43:51 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 253Dhogt24052078 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Jun 2022 13:43:50 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C4AD3112061; Fri, 3 Jun 2022 13:43:50 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id ADD85112065; Fri, 3 Jun 2022 13:43:42 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.93.173]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 3 Jun 2022 13:43:42 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v5 4/9] mm/demotion: Build demotion targets based on explicit memory tiers Date: Fri, 3 Jun 2022 19:12:32 +0530 Message-Id: <20220603134237.131362-5-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: oVYFXimJ6ASD-eWTcUnwQ48HJAhOKJXB X-Proofpoint-ORIG-GUID: A-3LUT8v1U8O_OsyDbAYGhza784l9aOX X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-03_04,2022-06-03_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 impostorscore=0 phishscore=0 adultscore=0 spamscore=0 suspectscore=0 clxscore=1015 lowpriorityscore=0 mlxlogscore=999 priorityscore=1501 malwarescore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206030059 X-Stat-Signature: xadarygiibg5ddd939h8ywrfjchcdsfa X-Rspam-User: Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=nIqP7qhV; spf=pass (imf31.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: E55F620083 X-HE-Tag: 1654263800-248448 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default tier 1 and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. The rank approach allows us to keep memory tier device IDs stable even if there is a need to change the tier ordering among different memory tiers. e.g. DRAM nodes with CPUs will always be on memtier1, no matter how many tiers are higher or lower than these nodes. A new memory tier can be inserted into the tier hierarchy for a new set of nodes without affecting the node assignment of any existing memtier, provided that there is enough gap in the rank values for the new memtier. The absolute value of "rank" of a memtier doesn't necessarily carry any meaning. Its value relative to other memtiers decides the level of this memtier in the tier hierarchy. For now, This patch supports hardcoded rank values which are 300, 200, & 100 for memory tiers 0,1 & 2 respectively. Below is the sysfs interface to read the rank values of memory tier, /sys/devices/system/memtier/memtierN/rank This interface is read only for now. Write support can be added when there is a need of flexibility of more number of memory tiers(> 3) with flexibile ordering requirement among them. Suggested-by: Wei Xu Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 5 + include/linux/migrate.h | 13 -- mm/memory-tiers.c | 269 ++++++++++++++++++++++++ mm/migrate.c | 394 ----------------------------------- mm/vmstat.c | 4 - 5 files changed, 274 insertions(+), 411 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 33ef36395a20..adc2cb3bf5f8 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -16,11 +16,16 @@ #define MAX_MEMORY_TIERS 3 extern bool numa_demotion_enabled; +int next_demotion_node(int node); int node_get_memory_tier_id(int node); int node_set_memory_tier(int node, int tier); int node_reset_memory_tier(int node, int tier); #else #define numa_demotion_enabled false +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} #endif /* CONFIG_TIERED_MEMORY */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 43e737215f33..93fab62e6548 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) -extern void set_migration_target_nodes(void); -extern void migrate_on_reclaim_init(void); -extern int next_demotion_node(int node); -#else -static inline void set_migration_target_nodes(void) {} -static inline void migrate_on_reclaim_init(void) {} -static inline int next_demotion_node(int node) -{ - return NUMA_NO_NODE; -} -#endif - #ifdef CONFIG_COMPACTION extern int PageMovable(struct page *page); extern void __SetPageMovable(struct page *page, struct address_space *mapping); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 3f382d1f844a..0d05c0bfb79b 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -4,6 +4,10 @@ #include #include #include +#include +#include + +#include "internal.h" struct memory_tier { struct list_head list; @@ -12,6 +16,10 @@ struct memory_tier { int rank; }; +struct demotion_nodes { + nodemask_t preferred; +}; + #define to_memory_tier(device) container_of(device, struct memory_tier, dev) static struct bus_type memory_tier_subsys = { @@ -19,9 +27,71 @@ static struct bus_type memory_tier_subsys = { .dev_name = "memtier", }; +static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); +/* + * node_demotion[] examples: + * + * Example 1: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. + * + * node distances: + * node 0 1 2 3 + * 0 10 20 30 40 + * 1 20 10 40 30 + * 2 30 40 10 40 + * 3 40 30 40 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-1 + * memory_tiers[2] = 2-3 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 3 + * node_demotion[2].preferred = + * node_demotion[3].preferred = + * + * Example 2: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 30 + * 2 30 30 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-2 + * memory_tiers[2] = + * + * node_demotion[0].preferred = + * node_demotion[1].preferred = + * node_demotion[2].preferred = + * + * Example 3: + * + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 40 + * 2 30 40 10 + * + * memory_tiers[0] = 1 + * memory_tiers[1] = 0 + * memory_tiers[2] = 2 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 0 + * node_demotion[2].preferred = + * + */ +static struct demotion_nodes *node_demotion __read_mostly; static ssize_t nodelist_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -202,6 +272,7 @@ static void node_remove_from_memory_tier(int node) if (nodes_empty(memtier->nodelist)) unregister_memory_tier(memtier); + establish_migration_targets(); out: mutex_unlock(&memory_tier_lock); } @@ -263,6 +334,8 @@ int node_reset_memory_tier(int node, int tier) if (nodes_empty(current_tier->nodelist)) unregister_memory_tier(current_tier); + + establish_migration_targets(); out: mutex_unlock(&memory_tier_lock); @@ -276,13 +349,208 @@ int node_set_memory_tier(int node, int tier) mutex_lock(&memory_tier_lock); memtier = __node_get_memory_tier(node); + /* + * if node is already part of the tier proceed with the + * current tier value, because we might want to establish + * new migration paths now. The node might be added to a tier + * before it was made part of N_MEMORY, hence estabilish_migration_targets + * will have skipped this node. + */ if (!memtier) ret = __node_set_memory_tier(node, tier); + establish_migration_targets(); + mutex_unlock(&memory_tier_lock); return ret; } +/** + * next_demotion_node() - Get the next node in the demotion path + * @node: The starting node to lookup the next node + * + * Return: node id for next memory node in the demotion path hierarchy + * from @node; NUMA_NO_NODE if @node is terminal. This does not keep + * @node online or guarantee that it *continues* to be the next demotion + * target. + */ +int next_demotion_node(int node) +{ + struct demotion_nodes *nd; + int target, nnodes, i; + + if (!node_demotion) + return NUMA_NO_NODE; + + nd = &node_demotion[node]; + + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * Make sure to use RCU over entire code blocks if + * node_demotion[] reads need to be consistent. + */ + rcu_read_lock(); + + nnodes = nodes_weight(nd->preferred); + if (!nnodes) + return NUMA_NO_NODE; + + /* + * If there are multiple target nodes, just select one + * target node randomly. + * + * In addition, we can also use round-robin to select + * target node, but we should introduce another variable + * for node_demotion[] to record last selected target node, + * that may cause cache ping-pong due to the changing of + * last target node. Or introducing per-cpu data to avoid + * caching issue, which seems more complicated. So selecting + * target node randomly seems better until now. + */ + nnodes = get_random_int() % nnodes; + target = first_node(nd->preferred); + for (i = 0; i < nnodes; i++) + target = next_node(target, nd->preferred); + + rcu_read_unlock(); + + return target; +} + +/* Disable reclaim-based migration. */ +static void __disable_all_migrate_targets(void) +{ + int node; + + for_each_node_mask(node, node_states[N_MEMORY]) + node_demotion[node].preferred = NODE_MASK_NONE; +} + +static void disable_all_migrate_targets(void) +{ + __disable_all_migrate_targets(); + + /* + * Ensure that the "disable" is visible across the system. + * Readers will see either a combination of before+disable + * state or disable+after. They will never see before and + * after state together. + */ + synchronize_rcu(); +} + +/* + * Find an automatic demotion target for all memory + * nodes. Failing here is OK. It might just indicate + * being at the end of a chain. + */ +static void establish_migration_targets(void) +{ + struct memory_tier *memtier; + struct demotion_nodes *nd; + int target = NUMA_NO_NODE, node; + int distance, best_distance; + nodemask_t used; + + if (!node_demotion) + return; + + disable_all_migrate_targets(); + + for_each_node_mask(node, node_states[N_MEMORY]) { + best_distance = -1; + nd = &node_demotion[node]; + + memtier = __node_get_memory_tier(node); + if (!memtier || list_is_last(&memtier->list, &memory_tiers)) + continue; + /* + * Get the next memtier to find the demotion node list. + */ + memtier = list_next_entry(memtier, list); + + /* + * find_next_best_node, use 'used' nodemask as a skip list. + * Add all memory nodes except the selected memory tier + * nodelist to skip list so that we find the best node from the + * memtier nodelist. + */ + nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist); + + /* + * Find all the nodes in the memory tier node list of same best distance. + * add them to the preferred mask. We randomly select between nodes + * in the preferred mask when allocating pages during demotion. + */ + do { + target = find_next_best_node(node, &used); + if (target == NUMA_NO_NODE) + break; + + distance = node_distance(node, target); + if (distance == best_distance || best_distance == -1) { + best_distance = distance; + node_set(target, nd->preferred); + } else { + break; + } + } while (1); + } +} + +/* + * This runs whether reclaim-based migration is enabled or not, + * which ensures that the user can turn reclaim-based migration + * at any time without needing to recalculate migration targets. + */ +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, + unsigned long action, void *_arg) +{ + struct memory_notify *arg = _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + + switch (action) { + case MEM_OFFLINE: + /* + * In case we are moving out of N_MEMORY. Keep the node + * in the memory tier so that when we bring memory online, + * they appear in the right memory tier. We still need + * to rebuild the demotion order. + */ + mutex_lock(&memory_tier_lock); + establish_migration_targets(); + mutex_unlock(&memory_tier_lock); + break; + case MEM_ONLINE: + /* + * We ignore the error here, if the node already have the tier + * registered, we will continue to use that for the new memory + * we are adding here. + */ + node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER); + break; + } + + return notifier_from_errno(0); +} + +static void __init migrate_on_reclaim_init(void) +{ + node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes), + GFP_KERNEL); + WARN_ON(!node_demotion); + + hotplug_memory_notifier(migrate_on_reclaim_callback, 100); +} + static int __init memory_tier_init(void) { int ret; @@ -302,6 +570,7 @@ static int __init memory_tier_init(void) /* CPU only nodes are not part of memory tiers. */ memtier->nodelist = node_states[N_MEMORY]; + migrate_on_reclaim_init(); return 0; } diff --git a/mm/migrate.c b/mm/migrate.c index 29cacc217e38..0b554625a219 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2116,398 +2116,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, return 0; } #endif /* CONFIG_NUMA_BALANCING */ - -/* - * node_demotion[] example: - * - * Consider a system with two sockets. Each socket has - * three classes of memory attached: fast, medium and slow. - * Each memory class is placed in its own NUMA node. The - * CPUs are placed in the node with the "fast" memory. The - * 6 NUMA nodes (0-5) might be split among the sockets like - * this: - * - * Socket A: 0, 1, 2 - * Socket B: 3, 4, 5 - * - * When Node 0 fills up, its memory should be migrated to - * Node 1. When Node 1 fills up, it should be migrated to - * Node 2. The migration path start on the nodes with the - * processors (since allocations default to this node) and - * fast memory, progress through medium and end with the - * slow memory: - * - * 0 -> 1 -> 2 -> stop - * 3 -> 4 -> 5 -> stop - * - * This is represented in the node_demotion[] like this: - * - * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1 - * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2 - * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate - * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4 - * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5 - * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate - * - * Moreover some systems may have multiple slow memory nodes. - * Suppose a system has one socket with 3 memory nodes, node 0 - * is fast memory type, and node 1/2 both are slow memory - * type, and the distance between fast memory node and slow - * memory node is same. So the migration path should be: - * - * 0 -> 1/2 -> stop - * - * This is represented in the node_demotion[] like this: - * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2 - * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate - * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate - */ - -/* - * Writes to this array occur without locking. Cycles are - * not allowed: Node X demotes to Y which demotes to X... - * - * If multiple reads are performed, a single rcu_read_lock() - * must be held over all reads to ensure that no cycles are - * observed. - */ -#define DEFAULT_DEMOTION_TARGET_NODES 15 - -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES -#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1) -#else -#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES -#endif - -struct demotion_nodes { - unsigned short nr; - short nodes[DEMOTION_TARGET_NODES]; -}; - -static struct demotion_nodes *node_demotion __read_mostly; - -/** - * next_demotion_node() - Get the next node in the demotion path - * @node: The starting node to lookup the next node - * - * Return: node id for next memory node in the demotion path hierarchy - * from @node; NUMA_NO_NODE if @node is terminal. This does not keep - * @node online or guarantee that it *continues* to be the next demotion - * target. - */ -int next_demotion_node(int node) -{ - struct demotion_nodes *nd; - unsigned short target_nr, index; - int target; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - /* - * node_demotion[] is updated without excluding this - * function from running. RCU doesn't provide any - * compiler barriers, so the READ_ONCE() is required - * to avoid compiler reordering or read merging. - * - * Make sure to use RCU over entire code blocks if - * node_demotion[] reads need to be consistent. - */ - rcu_read_lock(); - target_nr = READ_ONCE(nd->nr); - - switch (target_nr) { - case 0: - target = NUMA_NO_NODE; - goto out; - case 1: - index = 0; - break; - default: - /* - * If there are multiple target nodes, just select one - * target node randomly. - * - * In addition, we can also use round-robin to select - * target node, but we should introduce another variable - * for node_demotion[] to record last selected target node, - * that may cause cache ping-pong due to the changing of - * last target node. Or introducing per-cpu data to avoid - * caching issue, which seems more complicated. So selecting - * target node randomly seems better until now. - */ - index = get_random_int() % target_nr; - break; - } - - target = READ_ONCE(nd->nodes[index]); - -out: - rcu_read_unlock(); - return target; -} - -/* Disable reclaim-based migration. */ -static void __disable_all_migrate_targets(void) -{ - int node, i; - - if (!node_demotion) - return; - - for_each_online_node(node) { - node_demotion[node].nr = 0; - for (i = 0; i < DEMOTION_TARGET_NODES; i++) - node_demotion[node].nodes[i] = NUMA_NO_NODE; - } -} - -static void disable_all_migrate_targets(void) -{ - __disable_all_migrate_targets(); - - /* - * Ensure that the "disable" is visible across the system. - * Readers will see either a combination of before+disable - * state or disable+after. They will never see before and - * after state together. - * - * The before+after state together might have cycles and - * could cause readers to do things like loop until this - * function finishes. This ensures they can only see a - * single "bad" read and would, for instance, only loop - * once. - */ - synchronize_rcu(); -} - -/* - * Find an automatic demotion target for 'node'. - * Failing here is OK. It might just indicate - * being at the end of a chain. - */ -static int establish_migrate_target(int node, nodemask_t *used, - int best_distance) -{ - int migration_target, index, val; - struct demotion_nodes *nd; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - migration_target = find_next_best_node(node, used); - if (migration_target == NUMA_NO_NODE) - return NUMA_NO_NODE; - - /* - * If the node has been set a migration target node before, - * which means it's the best distance between them. Still - * check if this node can be demoted to other target nodes - * if they have a same best distance. - */ - if (best_distance != -1) { - val = node_distance(node, migration_target); - if (val > best_distance) - goto out_clear; - } - - index = nd->nr; - if (WARN_ONCE(index >= DEMOTION_TARGET_NODES, - "Exceeds maximum demotion target nodes\n")) - goto out_clear; - - nd->nodes[index] = migration_target; - nd->nr++; - - return migration_target; -out_clear: - node_clear(migration_target, *used); - return NUMA_NO_NODE; -} - -/* - * When memory fills up on a node, memory contents can be - * automatically migrated to another node instead of - * discarded at reclaim. - * - * Establish a "migration path" which will start at nodes - * with CPUs and will follow the priorities used to build the - * page allocator zonelists. - * - * The difference here is that cycles must be avoided. If - * node0 migrates to node1, then neither node1, nor anything - * node1 migrates to can migrate to node0. Also one node can - * be migrated to multiple nodes if the target nodes all have - * a same best-distance against the source node. - * - * This function can run simultaneously with readers of - * node_demotion[]. However, it can not run simultaneously - * with itself. Exclusion is provided by memory hotplug events - * being single-threaded. - */ -static void __set_migration_target_nodes(void) -{ - nodemask_t next_pass; - nodemask_t this_pass; - nodemask_t used_targets = NODE_MASK_NONE; - int node, best_distance; - - /* - * Avoid any oddities like cycles that could occur - * from changes in the topology. This will leave - * a momentary gap when migration is disabled. - */ - disable_all_migrate_targets(); - - /* - * Allocations go close to CPUs, first. Assume that - * the migration path starts at the nodes with CPUs. - */ - next_pass = node_states[N_CPU]; -again: - this_pass = next_pass; - next_pass = NODE_MASK_NONE; - /* - * To avoid cycles in the migration "graph", ensure - * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only - * once per pass so that multiple source nodes can - * share a target node. - * - * 'used_targets' will become unavailable in future - * passes. This limits some opportunities for - * multiple source nodes to share a destination. - */ - nodes_or(used_targets, used_targets, this_pass); - - for_each_node_mask(node, this_pass) { - best_distance = -1; - - /* - * Try to set up the migration path for the node, and the target - * migration nodes can be multiple, so doing a loop to find all - * the target nodes if they all have a best node distance. - */ - do { - int target_node = - establish_migrate_target(node, &used_targets, - best_distance); - - if (target_node == NUMA_NO_NODE) - break; - - if (best_distance == -1) - best_distance = node_distance(node, target_node); - - /* - * Visit targets from this pass in the next pass. - * Eventually, every node will have been part of - * a pass, and will become set in 'used_targets'. - */ - node_set(target_node, next_pass); - } while (1); - } - /* - * 'next_pass' contains nodes which became migration - * targets in this pass. Make additional passes until - * no more migrations targets are available. - */ - if (!nodes_empty(next_pass)) - goto again; -} - -/* - * For callers that do not hold get_online_mems() already. - */ -void set_migration_target_nodes(void) -{ - get_online_mems(); - __set_migration_target_nodes(); - put_online_mems(); -} - -/* - * This leaves migrate-on-reclaim transiently disabled between - * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs - * whether reclaim-based migration is enabled or not, which - * ensures that the user can turn reclaim-based migration at - * any time without needing to recalculate migration targets. - * - * These callbacks already hold get_online_mems(). That is why - * __set_migration_target_nodes() can be used as opposed to - * set_migration_target_nodes(). - */ -#ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, - unsigned long action, void *_arg) -{ - struct memory_notify *arg = _arg; - - /* - * Only update the node migration order when a node is - * changing status, like online->offline. This avoids - * the overhead of synchronize_rcu() in most cases. - */ - if (arg->status_change_nid < 0) - return notifier_from_errno(0); - - switch (action) { - case MEM_GOING_OFFLINE: - /* - * Make sure there are not transient states where - * an offline node is a migration target. This - * will leave migration disabled until the offline - * completes and the MEM_OFFLINE case below runs. - */ - disable_all_migrate_targets(); - break; - case MEM_OFFLINE: - case MEM_ONLINE: - /* - * Recalculate the target nodes once the node - * reaches its final state (online or offline). - */ - __set_migration_target_nodes(); - break; - case MEM_CANCEL_OFFLINE: - /* - * MEM_GOING_OFFLINE disabled all the migration - * targets. Reenable them. - */ - __set_migration_target_nodes(); - break; - case MEM_GOING_ONLINE: - case MEM_CANCEL_ONLINE: - break; - } - - return notifier_from_errno(0); -} -#endif - -void __init migrate_on_reclaim_init(void) -{ - node_demotion = kcalloc(nr_node_ids, - sizeof(struct demotion_nodes), - GFP_KERNEL); - WARN_ON(!node_demotion); -#ifdef CONFIG_MEMORY_HOTPLUG - hotplug_memory_notifier(migrate_on_reclaim_callback, 100); -#endif - /* - * At this point, all numa nodes with memory/CPus have their state - * properly set, so we can build the demotion order now. - * Let us hold the cpu_hotplug lock just, as we could possibily have - * CPU hotplug events during boot. - */ - cpus_read_lock(); - set_migration_target_nodes(); - cpus_read_unlock(); -} #endif /* CONFIG_NUMA */ - - diff --git a/mm/vmstat.c b/mm/vmstat.c index da525bfb6f4a..835e3c028f35 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -28,7 +28,6 @@ #include #include #include -#include #include "internal.h" @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu) if (!node_state(cpu_to_node(cpu), N_CPU)) { node_set_state(cpu_to_node(cpu), N_CPU); - set_migration_target_nodes(); } return 0; @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu) return 0; node_clear_state(node, N_CPU); - set_migration_target_nodes(); return 0; } @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void) start_shepherd_timer(); #endif - migrate_on_reclaim_init(); #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); From patchwork Fri Jun 3 13:42:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12869082 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43154C433EF for ; Fri, 3 Jun 2022 13:44:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 93A818D0008; Fri, 3 Jun 2022 09:44:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8EB998D0001; Fri, 3 Jun 2022 09:44:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B2498D0008; Fri, 3 Jun 2022 09:44:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 6D5F98D0001 for ; Fri, 3 Jun 2022 09:44:10 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay11.hostedemail.com (Postfix) with ESMTP id 425F0814E7 for ; Fri, 3 Jun 2022 13:44:10 +0000 (UTC) X-FDA: 79537043460.05.ABC2F2F Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf05.hostedemail.com (Postfix) with ESMTP id 8774110005B for ; Fri, 3 Jun 2022 13:43:33 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 253AacwD001424; Fri, 3 Jun 2022 13:44:02 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=6JnFH6Ge2DqnYCy9p4M/QpIYwkZi2m2iZlpzOFDI2qU=; b=bN2bC5I0eXblOAEaGNm11SXMP/VWA4sA188NRfcEFNnqyumOWRm8ARhUjOnW+gmiNrDH FiywIb8w/NZvE130mx//VhfTxO4itfMSXWHeZSF/CvsIBDQAFqhcwYbyRZhxYSo15G0Z bEUGpTDefxG230zAK80qPqtcwpCeAvwwdbgIqBKANubqK8cvIJJh5//O/XZNK0Ug3sQl sXOMeAbrFS+l46YlA19/EvESTF9qJti5nWCHqJ9c9XJoEU5np9SCJlWHUxly3C67DEVP E9vqSRAepOYpl+MyiQxBfqjq7jXEUczhObDn/fqSXgSqAaY/bm8bfesYARKVhaQ20KUA 0Q== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfeksdewt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:02 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 253BZDYD031309; Fri, 3 Jun 2022 13:44:01 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfeksdew8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:01 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 253Da8xw005528; Fri, 3 Jun 2022 13:44:00 GMT Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com [9.57.198.26]) by ppma01dal.us.ibm.com with ESMTP id 3gcxt62yjn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:00 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 253DhxoG3801862 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Jun 2022 13:43:59 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 63A86112062; Fri, 3 Jun 2022 13:43:59 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 981C2112063; Fri, 3 Jun 2022 13:43:51 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.93.173]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 3 Jun 2022 13:43:51 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v5 5/9] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Date: Fri, 3 Jun 2022 19:12:33 +0530 Message-Id: <20220603134237.131362-6-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: Ng9BJzbA3aQjPHA8NpZ2baLKJ6Aib5bK X-Proofpoint-ORIG-GUID: XdqiEYOM5eBOErlTnTvrzuuKbVYkljLC X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-03_04,2022-06-03_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 adultscore=0 clxscore=1015 lowpriorityscore=0 mlxlogscore=999 malwarescore=0 suspectscore=0 impostorscore=0 bulkscore=0 mlxscore=0 phishscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206030059 X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 8774110005B X-Stat-Signature: 1o6mh838s9wq5x9xwiymeq9gbgncpmkc Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=bN2bC5I0; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf05.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-HE-Tag: 1654263813-110878 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: By default, all nodes are assigned to DEFAULT_MEMORY_TIER which is memory tier 1 which is designated for nodes with DRAM, so it is not the right tier for dax devices. Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future, support should be added to distinguish the dax-devices which should not be MEMORY_TIER_PMEM and right memory tier should be set for them. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- drivers/dax/kmem.c | 4 ++++ mm/memory-tiers.c | 1 + 2 files changed, 5 insertions(+) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a37622060fff..7a11c387fbbc 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) dev_set_drvdata(dev, data); +#ifdef CONFIG_TIERED_MEMORY + node_set_memory_tier(numa_node, MEMORY_TIER_PMEM); +#endif return 0; err_request_mem: diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 0d05c0bfb79b..9c82cf4c4bca 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -364,6 +364,7 @@ int node_set_memory_tier(int node, int tier) return ret; } +EXPORT_SYMBOL_GPL(node_set_memory_tier); /** * next_demotion_node() - Get the next node in the demotion path From patchwork Fri Jun 3 13:42:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12869083 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D1E9CCA473 for ; Fri, 3 Jun 2022 13:44:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D06408D0009; Fri, 3 Jun 2022 09:44:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB8428D0001; Fri, 3 Jun 2022 09:44:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B571B8D0009; Fri, 3 Jun 2022 09:44:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A92628D0001 for ; Fri, 3 Jun 2022 09:44:19 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 85BBE61649 for ; Fri, 3 Jun 2022 13:44:19 +0000 (UTC) X-FDA: 79537043838.16.1C38A3C Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf29.hostedemail.com (Postfix) with ESMTP id 3148B12006B for ; Fri, 3 Jun 2022 13:44:04 +0000 (UTC) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 253CFKio002787; Fri, 3 Jun 2022 13:44:11 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=D5i6K4ZN5mUiBnlrxfjgVIkpayfc1VEiR533YidgkH8=; b=aj4+6juGm3/ikZRfPPsKDQtK0/2ru3EB22pWXsjlQZs/JC+e597EjxTZgmPZolYYrZcG f/H96qt5U/52vYdO/RIydksVVefboRzQIRg3oeuOfCfDv6KTLEct9BB3g21l4G0BorT3 N9Y/LFrxbRhJJz7obb81RXkgHrkKOaVQJdzoTVttC440OTaW8XWpxYmVNy9Q7sjynqrM vUE6IcJq7XU2zDI6SxJsPASguUUOb66T4+RBumL1yT/9AJakzs6+cw81OSgSfbOs/ciE qmDDUz6z4bK70waeOOmqdUfNdb93v/ZH5zKieQuUPBEsrkWuNto16s4Y9ncifIMi6rzI YA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfgg8b912-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:10 +0000 Received: from m0098394.ppops.net (m0098394.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 253DNhoT008644; Fri, 3 Jun 2022 13:44:10 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfgg8b90r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:10 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 253DZ5eh008331; Fri, 3 Jun 2022 13:44:09 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma02dal.us.ibm.com with ESMTP id 3gd1adht8b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:09 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 253Di8OL34734366 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Jun 2022 13:44:08 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 021E5112063; Fri, 3 Jun 2022 13:44:08 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1D58E112062; Fri, 3 Jun 2022 13:44:00 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.93.173]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 3 Jun 2022 13:43:59 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v5 6/9] mm/demotion: Add support for removing node from demotion memory tiers Date: Fri, 3 Jun 2022 19:12:34 +0530 Message-Id: <20220603134237.131362-7-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: 1MPPeNNf2X6BKdTqRxMT25SHQ8oKwC9E X-Proofpoint-GUID: pnAm_g5t95g4B-34eM7X5D2EouxevYVk X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-03_04,2022-06-03_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 clxscore=1015 impostorscore=0 lowpriorityscore=0 malwarescore=0 bulkscore=0 mlxlogscore=999 mlxscore=0 adultscore=0 phishscore=0 suspectscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206030059 X-Rspam-User: Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=aj4+6juG; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 3148B12006B X-Stat-Signature: sdrdyt3puoto3wbgerd1jnm5gy34c7pa X-HE-Tag: 1654263844-448523 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch adds the special string "none" as a supported memtier value that we can use to remove a specific node from being using as demotion target. For ex: :/sys/devices/system/node/node1# cat memtier 1 :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist 1-3 :/sys/devices/system/node/node1# echo none > memtier :/sys/devices/system/node/node1# :/sys/devices/system/node/node1# cat memtier :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist 2-3 :/sys/devices/system/node/node1# Signed-off-by: Aneesh Kumar K.V --- drivers/base/node.c | 4 ++++ include/linux/memory-tiers.h | 1 + mm/memory-tiers.c | 18 +++++++++++++++--- 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 599ed64d910f..344786290149 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -582,6 +582,10 @@ static ssize_t memtier_store(struct device *dev, int node = dev->id; int ret; + if (!strncmp(buf, "none", strlen("none"))) { + node_remove_from_memory_tier(node); + return count; + } ret = kstrtoul(buf, 10, &tier); if (ret) return ret; diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index adc2cb3bf5f8..79bd8d26feb2 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -17,6 +17,7 @@ extern bool numa_demotion_enabled; int next_demotion_node(int node); +void node_remove_from_memory_tier(int node); int node_get_memory_tier_id(int node); int node_set_memory_tier(int node, int tier); int node_reset_memory_tier(int node, int tier); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 9c82cf4c4bca..b4e72b672d4d 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -253,8 +253,7 @@ static struct memory_tier *__get_memory_tier_from_id(int id) return NULL; } -__maybe_unused // temporay to prevent warnings during bisects -static void node_remove_from_memory_tier(int node) +void node_remove_from_memory_tier(int node) { struct memory_tier *memtier; @@ -320,7 +319,20 @@ int node_reset_memory_tier(int node, int tier) mutex_lock(&memory_tier_lock); current_tier = __node_get_memory_tier(node); - if (!current_tier || current_tier->dev.id == tier) + if (!current_tier) { + /* + * If a N_MEMORY node doesn't have a tier index, then + * we removed it from demotion earlier and we are trying + * add it back. Just add the node to requested tier. + */ + if (node_state(node, N_MEMORY)) { + ret = __node_set_memory_tier(node, tier); + establish_migration_targets(); + } + goto out; + } + + if (current_tier->dev.id == tier) goto out; node_clear(node, current_tier->nodelist); From patchwork Fri Jun 3 13:42:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12869084 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CAFBCC43334 for ; Fri, 3 Jun 2022 13:44:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 064458D0003; Fri, 3 Jun 2022 09:44:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 015B08D0001; Fri, 3 Jun 2022 09:44:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DA6518D0003; Fri, 3 Jun 2022 09:44:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CCCF98D0001 for ; Fri, 3 Jun 2022 09:44:27 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id ABFB112146E for ; Fri, 3 Jun 2022 13:44:27 +0000 (UTC) X-FDA: 79537044174.27.51EDE40 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf21.hostedemail.com (Postfix) with ESMTP id 0BB7F1C0062 for ; Fri, 3 Jun 2022 13:44:11 +0000 (UTC) Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 253Cugre026540; Fri, 3 Jun 2022 13:44:19 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=ZOQN208AaTMt4Y+rjYXoDEA9HvNVnabUjL2wDLR89+M=; b=SVwRTlydakEiEI2ExzaCjiQoCkbSDUJbzD0j/unXVRqXFXBRHUDGJUZxCVucIU27iRRT tWGPNpYzTDXSNYMBI8ySXFb/lw+4xmYpOSYhANF00fgiIkyg6DYPK7fDDZIQB7sfCGCO klc6fW7m1CsB65Pp3TUUDwh7r0axd8pThyKWDVrgcza207YDyDict+Nh6bBZjxmwSkPP GhTkBW8oFo/RccO3OciudbM0czOPgrPDFdPPOfc0WjXiPXhmOXVMlheE71BA0qUSvZ/T cuf1h/V19n8oRnbat1Ibvakw5EkQlHH3Qbrt76BIqDAQFKqzkz6RogjgnsEDNXnqzmFn wQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfjrgrwnf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:19 +0000 Received: from m0098410.ppops.net (m0098410.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 253D0hf7010880; Fri, 3 Jun 2022 13:44:18 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfjrgrwn5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:18 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 253DamLW010706; Fri, 3 Jun 2022 13:44:17 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma03dal.us.ibm.com with ESMTP id 3gd3yn8t3n-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:17 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 253DiG6k17498544 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Jun 2022 13:44:16 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9E96D112065; Fri, 3 Jun 2022 13:44:16 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C9920112063; Fri, 3 Jun 2022 13:44:08 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.93.173]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 3 Jun 2022 13:44:08 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [PATCH v5 7/9] mm/demotion: Demote pages according to allocation fallback order Date: Fri, 3 Jun 2022 19:12:35 +0530 Message-Id: <20220603134237.131362-8-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: mX0Gsxn3vtIVcf4XO3jJ852AA3picwmx X-Proofpoint-GUID: B5_lxBaSxwSVGbsYsGyLNsySyo553IWv X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-03_04,2022-06-03_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 lowpriorityscore=0 phishscore=0 malwarescore=0 adultscore=0 impostorscore=0 mlxscore=0 spamscore=0 priorityscore=1501 bulkscore=0 mlxlogscore=999 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206030059 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 0BB7F1C0062 Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=SVwRTlyd; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf21.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: sgikb4hmdqxwrb6d44aigqdrwko3j56r X-Rspam-User: X-HE-Tag: 1654263851-134121 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya currently, a higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets mask for node, also demote_page_list() function is modified to utilize this allowed node mask by filling it in migration_target_control structure before passing it to migrate_pages(). Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 5 ++++ mm/memory-tiers.c | 49 ++++++++++++++++++++++++++++++++++-- mm/vmscan.c | 38 +++++++++++++--------------- 3 files changed, 70 insertions(+), 22 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 79bd8d26feb2..cd6e71f702ad 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node); int node_get_memory_tier_id(int node); int node_set_memory_tier(int node, int tier); int node_reset_memory_tier(int node, int tier); +void node_get_allowed_targets(int node, nodemask_t *targets); #else #define numa_demotion_enabled false static inline int next_demotion_node(int node) @@ -28,6 +29,10 @@ static inline int next_demotion_node(int node) return NUMA_NO_NODE; } +static inline void node_get_allowed_targets(int node, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index b4e72b672d4d..592d939ec28d 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -18,6 +18,7 @@ struct memory_tier { struct demotion_nodes { nodemask_t preferred; + nodemask_t allowed; }; #define to_memory_tier(device) container_of(device, struct memory_tier, dev) @@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier) } EXPORT_SYMBOL_GPL(node_set_memory_tier); +void node_get_allowed_targets(int node, nodemask_t *targets) +{ + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * If any node is moving to lower tiers then modifications + * in node_demotion[] are still valid for this node, if any + * node is moving to higher tier then moving node may be + * used once for demotion which should be ok so rcu should + * be enough here. + */ + rcu_read_lock(); + + *targets = node_demotion[node].allowed; + + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void) { int node; - for_each_node_mask(node, node_states[N_MEMORY]) + for_each_node_mask(node, node_states[N_MEMORY]) { node_demotion[node].preferred = NODE_MASK_NONE; + node_demotion[node].allowed = NODE_MASK_NONE; + } } static void disable_all_migrate_targets(void) @@ -465,7 +487,7 @@ static void establish_migration_targets(void) struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t used; + nodemask_t used, allowed = NODE_MASK_NONE; if (!node_demotion) return; @@ -511,6 +533,29 @@ static void establish_migration_targets(void) } } while (1); } + /* + * Now build the allowed mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + list_for_each_entry(memtier, &memory_tiers, list) + nodes_or(allowed, allowed, memtier->nodelist); + /* + * Removes nodes not yet in N_MEMORY. + */ + nodes_and(allowed, node_states[N_MEMORY], allowed); + + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from allowed nodes, + * This will remove all nodes in current and above + * memory tier from the allowed mask. + */ + nodes_andnot(allowed, allowed, memtier->nodelist); + for_each_node_mask(node, memtier->nodelist) + node_demotion[node].allowed = allowed; + } } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 3a8f78277f99..d424b7af2f26 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,23 +1460,6 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) -{ - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; - - return alloc_migration_target(page, (unsigned long)&mtc); -} - /* * Take pages on @demote_list and attempt to demote them to * another node. Pages which are not demoted are left on @@ -1487,6 +1470,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1494,10 +1490,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat->node_id, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ - migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + migrate_pages(demote_pages, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); From patchwork Fri Jun 3 13:42:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12869085 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37D4ECCA473 for ; Fri, 3 Jun 2022 13:44:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CF7648D0002; Fri, 3 Jun 2022 09:44:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C82AE8D0001; Fri, 3 Jun 2022 09:44:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AD2CE8D0002; Fri, 3 Jun 2022 09:44:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9D0BB8D0001 for ; Fri, 3 Jun 2022 09:44:42 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 7FDF061671 for ; Fri, 3 Jun 2022 13:44:42 +0000 (UTC) X-FDA: 79537044804.25.D603DBE Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf02.hostedemail.com (Postfix) with ESMTP id 1A6E98007E for ; Fri, 3 Jun 2022 13:44:29 +0000 (UTC) Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 253CLWHE020315; Fri, 3 Jun 2022 13:44:28 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=AH6mkUwA+QjCqwLZVtpQSiVa/wTru13R4bBfOxu/+tY=; b=MTNYJHg0s0IeyUQsVkmKQIG1l75PjXPWv7f0eGHaAlr4N12OmwQrgqNT6RH4bMwdkgWQ FhzII9ukDl1QkydFjaUJq9ABTplUxmkiPwvtmMJTeQqIFHg0MhJWwJC04CkE6hvcaxK8 yxxwtfCUi6rpXZqv5uWGi6sqdEioEzArASD3xMFjd6BLZMJXBozu/lioI2fhUH6vvj8e jP5b6efj4XMIArTPtwTB7Ryfj1a1XohX4Keqm8aPqB9nmDoJWNiAWxGkYmQJLCVcTGbF xn0AehJ5slfZH+QZqUBMuYVMxdOrvmffyO6kGJPDOB47eFhWbddL4NyeH4Qg2lc+F9l7 eQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfhcjjb9r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:28 +0000 Received: from m0098416.ppops.net (m0098416.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 253CPLLO005641; Fri, 3 Jun 2022 13:44:27 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfhcjjb9g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:27 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 253DaMt9031229; Fri, 3 Jun 2022 13:44:26 GMT Received: from b01cxnp23034.gho.pok.ibm.com (b01cxnp23034.gho.pok.ibm.com [9.57.198.29]) by ppma01wdc.us.ibm.com with ESMTP id 3gbcbjhj6r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:26 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 253DiPWX18547142 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Jun 2022 13:44:25 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D5FFC112067; Fri, 3 Jun 2022 13:44:25 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 68659112064; Fri, 3 Jun 2022 13:44:17 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.93.173]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 3 Jun 2022 13:44:16 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K . V" Subject: [PATCH v5 8/9] mm/demotion: Add documentation for memory tiering Date: Fri, 3 Jun 2022 19:12:36 +0530 Message-Id: <20220603134237.131362-9-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: dc_m-Uxzs3C13ricJFlPdqJoArj-P2aN X-Proofpoint-GUID: kG7U20EY_9fnH35KKDdv9Uf-bO_ymnLB X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-03_04,2022-06-03_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 phishscore=0 clxscore=1015 spamscore=0 mlxlogscore=999 suspectscore=0 priorityscore=1501 malwarescore=0 adultscore=0 mlxscore=0 bulkscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206030059 Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=MTNYJHg0; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf02.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 1A6E98007E X-Rspam-User: X-Stat-Signature: q16s9ypc1rjjfbuy3at945o3kxtowzws X-HE-Tag: 1654263869-842367 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya All N_MEMORY nodes are divided into 3 memoty tiers with rank value MEMORY_RANK_HBM_GPU, MEMORY_RANK_DRAM and MEMORY_RANK_PMEM. By default, All nodes are assigned to memory tier with rank value MEMORY_RANK_DRAM. Demotion path for all N_MEMORY nodes is prepared based on the rank value of memory tiers. This patch adds documention for memory tiering introduction, its sysfs interfaces and how demotion is performed based on memory tiers. Suggested-by: Wei Xu Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- Documentation/admin-guide/mm/index.rst | 1 + .../admin-guide/mm/memory-tiering.rst | 175 ++++++++++++++++++ 2 files changed, 176 insertions(+) create mode 100644 Documentation/admin-guide/mm/memory-tiering.rst diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index c21b5823f126..3f211cbca8c3 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -32,6 +32,7 @@ the Linux memory management. idle_page_tracking ksm memory-hotplug + memory-tiering nommu-mmap numa_memory_policy numaperf diff --git a/Documentation/admin-guide/mm/memory-tiering.rst b/Documentation/admin-guide/mm/memory-tiering.rst new file mode 100644 index 000000000000..afbb9591f301 --- /dev/null +++ b/Documentation/admin-guide/mm/memory-tiering.rst @@ -0,0 +1,175 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _admin_guide_memory_tiering: + +============ +Memory tiers +============ + +This document describes explicit memory tiering support along with +demotion based on memory tiers. + +Introduction +============ + +Many systems have multiple type of memory devices e.g. GPU, DRAM and +PMEM. The memory subsystem of these systems can be called memory +tiering system because the performance of the different types of +memory is different. Memory tiers are defined based on hardware +capabilities of memory nodes. Each memory tier is assigned a rank +value that determines the memory tier position in demotion order. + +The memory tier assignment of each node is independent from each +other. Moving a node from one tier to another tier doesn't affect +the tier assignment of any other node. + +Memory tiers are used to build the demotion targets for nodes, a node +can demote its pages to any node of any lower tiers. + +Memory tier rank +================= + +Memory nodes are divided into below 3 types of memory tiers with rank value +as shown base on their hardware characteristics. + +MEMORY_RANK_HBM_GPU +MEMORY_RANK_DRAM +MEMORY_RANK_PMEM + +Memory tiers initialization and (re)assignments +=============================================== + +By default, all nodes are assigned to memory tier with default rank +DEFAULT_MEMORY_RANK which is 1 (MEMORY_RANK_DRAM). Memory tier of +memory node can be either modified through sysfs or from driver. On +hotplug, memory tier with default rank is assigned to memory node. + +Sysfs interfaces +================ + +Nodes belonging to specific tier can be read from, +/sys/devices/system/memtier/memtierN/nodelist (Read-Only) + +Where N is 0 - 2. + +Example 1: +For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node, +node 2 is PMEM node an ideal tier layout will be + +$ cat /sys/devices/system/memtier/memtier0/nodelist +1 +$ cat /sys/devices/system/memtier/memtier1/nodelist +0 +$ cat /sys/devices/system/memtier/memtier2/nodelist +2 + +Example 2: +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM +nodes. + +$ cat /sys/devices/system/memtier/memtier0/nodelist +cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or +directory +$ cat /sys/devices/system/memtier/memtier1/nodelist +0-1 +$ cat /sys/devices/system/memtier/memtier2/nodelist +2-3 + +Default memory tier can be read from, +/sys/devices/system/memtier/default_tier (Read-Only) + +e.g. +$ cat /sys/devices/system/memtier/default_tier +memtier1 + +Max memory tier can be read from, +/sys/devices/system/memtier/max_tier (Read-Only) + +e.g. +$ cat /sys/devices/system/memtier/max_tier +3 + +Individual node's memory tier can be read of set using, +/sys/devices/system/node/nodeN/memtier (Read-Write) + +where N = node id + +When this interface is written, Node is moved from old memory tier +to new memory tier and demotion targets for all N_MEMORY nodes are +built again. + +For example 1 mentioned above, +$ cat /sys/devices/system/node/node0/memtier +1 +$ cat /sys/devices/system/node/node1/memtier +0 +$ cat /sys/devices/system/node/node2/memtier +2 + +Demotion +======== + +In a system with DRAM and persistent memory, once DRAM +fills up, reclaim will start and some of the DRAM contents will be +thrown out even if there is a space in persistent memory. +Consequently allocations will, at some point, start falling over to the slower +persistent memory. + +That has two nasty properties. First, the newer allocations can end up in +the slower persistent memory. Second, reclaimed data in DRAM are just +discarded even if there are gobs of space in persistent memory that could +be used. + +Instead of page being discarded during reclaim, it can be moved to +persistent memory. Allowing page migration during reclaim enables +these systems to migrate pages from fast(higher) tiers to slow(lower) +tiers when the fast(higher) tier is under pressure. + + +Enable/Disable demotion +----------------------- + +By default demotion is disabled, it can be enabled/disabled using +below sysfs interface, + +$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled + +preferred and allowed demotion nodes +------------------------------------ + +Preffered nodes for a specific N_MEMORY nodes are best nodes +from next possible lower memory tier. Allowed nodes for any +node are all the node available in all possible lower memory +tiers. + +Example: + +For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM +nodes, + +node distances: +node 0 1 2 3 + 0 10 20 30 40 + 1 20 10 40 30 + 2 30 40 10 40 + 3 40 30 40 10 + +memory_tiers[0] = +memory_tiers[1] = 0-1 +memory_tiers[2] = 2-3 + +node_demotion[0].preferred = 2 +node_demotion[0].allowed = 2, 3 +node_demotion[1].preferred = 3 +node_demotion[1].allowed = 3, 2 +node_demotion[2].preferred = +node_demotion[2].allowed = +node_demotion[3].preferred = +node_demotion[3].allowed = + +Memory allocation for demotion +------------------------------ + +If page needs to be demoted from any node, the kernel 1st tries +to allocate new page from node's preferred node and fallbacks to +node's allowed targets in allocation fallback order. From patchwork Fri Jun 3 13:42:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12869086 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39A67C433EF for ; Fri, 3 Jun 2022 13:44:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CC1DE8D0006; Fri, 3 Jun 2022 09:44:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C75E38D0001; Fri, 3 Jun 2022 09:44:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9FD678D0006; Fri, 3 Jun 2022 09:44:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 916F88D0001 for ; Fri, 3 Jun 2022 09:44:43 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 77CA315E5 for ; Fri, 3 Jun 2022 13:44:43 +0000 (UTC) X-FDA: 79537044846.16.7B34892 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf01.hostedemail.com (Postfix) with ESMTP id 8D1BD4007B for ; Fri, 3 Jun 2022 13:44:34 +0000 (UTC) Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 253CFLgN002903; Fri, 3 Jun 2022 13:44:37 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=BUCn0BY2ZXsWri4dRdU95JjN/8kb3U+L/F9XlYqa2Bg=; b=BVnWdjjulZzHbLLJyBuVL9VJJrL5p7elwwnZV1I9qfuEPb78HGwg8g7g0KD6NX1rarjq 9qjNMLIwKb0RIrS5nNaHca8k78ZVCHLQ8UDKstvqXGXStThoW9crb6NyCYXUCVV/8bka OB3d55aXErE9cfaAdwDxqTijwxoiRq/45WXA+FnJSVjE/rm44Pq50a/CU+MRsqjJyW90 YySszjPwPz0HZ1eP1MxJwKLzIzuE+XYxSSyCOeKXXdLaBzDqyeH3hQN0kkwc8Qs30sYY eVN1TZCHKm0067GYxYT+DlTtyX1psgvifpxlXWehSaq3qrw5t95OfOlIfTfQo/dCdGXq zQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfgg8b99f-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:36 +0000 Received: from m0098394.ppops.net (m0098394.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 253Cwqok004481; Fri, 3 Jun 2022 13:44:36 GMT Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3gfgg8b996-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:36 +0000 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 253DaUVL027611; Fri, 3 Jun 2022 13:44:35 GMT Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com [9.57.198.26]) by ppma04dal.us.ibm.com with ESMTP id 3gd4n68dq9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 03 Jun 2022 13:44:35 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 253DiYHU9830668 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 3 Jun 2022 13:44:34 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 190F0112063; Fri, 3 Jun 2022 13:44:34 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 84AC2112062; Fri, 3 Jun 2022 13:44:26 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.93.173]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 3 Jun 2022 13:44:26 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes , "Aneesh Kumar K.V" Subject: [PATCH v5 9/9] mm/demotion: Update node_is_toptier to work with memory tiers Date: Fri, 3 Jun 2022 19:12:37 +0530 Message-Id: <20220603134237.131362-10-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> References: <20220603134237.131362-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: h6x5ag3c0QcohwVDA3Yn5j1SGQueDgDx X-Proofpoint-GUID: xxrlc1sSbC-F3CxfgTlo2b9hGvdng3ck X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.874,Hydra:6.0.517,FMLib:17.11.64.514 definitions=2022-06-03_04,2022-06-03_01,2022-02-23_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 clxscore=1015 impostorscore=0 lowpriorityscore=0 malwarescore=0 bulkscore=0 mlxlogscore=999 mlxscore=0 adultscore=0 phishscore=0 suspectscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2204290000 definitions=main-2206030059 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 8D1BD4007B X-Stat-Signature: isti94t4g1q51jomweaunemy5zhxtcub X-Rspam-User: Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=BVnWdjju; spf=pass (imf01.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-HE-Tag: 1654263874-535798 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With memory tiers support we can have memory on NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. To avoid taking locks, a nodemask is maintained for all demotion targets. All NUMA nodes are by default top tier nodes and as we add new lower memory tiers NUMA nodes get added to the demotion targets thereby moving them out of the top tier. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 16 ++++++++++++++++ include/linux/node.h | 5 ----- mm/huge_memory.c | 1 + mm/memory-tiers.c | 10 ++++++++++ mm/migrate.c | 1 + mm/mprotect.c | 1 + 6 files changed, 29 insertions(+), 5 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index cd6e71f702ad..32e0e6fabf02 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -16,12 +16,23 @@ #define MAX_MEMORY_TIERS 3 extern bool numa_demotion_enabled; +extern nodemask_t demotion_target_mask; int next_demotion_node(int node); void node_remove_from_memory_tier(int node); int node_get_memory_tier_id(int node); int node_set_memory_tier(int node, int tier); int node_reset_memory_tier(int node, int tier); void node_get_allowed_targets(int node, nodemask_t *targets); + +/* + * By default all nodes are top tiper. As we create new memory tiers + * we below top tiers we add them to NON_TOP_TIER state. + */ +static inline bool node_is_toptier(int node) +{ + return !node_isset(node, demotion_target_mask); +} + #else #define numa_demotion_enabled false static inline int next_demotion_node(int node) @@ -33,6 +44,11 @@ static inline void node_get_allowed_targets(int node, nodemask_t *targets) { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif /* CONFIG_TIERED_MEMORY */ #endif diff --git a/include/linux/node.h b/include/linux/node.h index 40d641a8bfb0..9ec680dd607f 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg, #define to_node(device) container_of(device, struct node, dev) -static inline bool node_is_toptier(int node) -{ - return node_state(node, N_CPU); -} - #endif /* _LINUX_NODE_H_ */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a77c78a2b6b5..294873d4be2b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 592d939ec28d..df8e9910165a 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -31,6 +31,7 @@ static struct bus_type memory_tier_subsys = { static void establish_migration_targets(void); static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); +nodemask_t demotion_target_mask; /* * node_demotion[] examples: @@ -541,6 +542,15 @@ static void establish_migration_targets(void) */ list_for_each_entry(memtier, &memory_tiers, list) nodes_or(allowed, allowed, memtier->nodelist); + /* + * Add nodes to demotion target mask so that we can find + * top tier easily. + */ + memtier = list_first_entry(&memory_tiers, struct memory_tier, list); + if (memtier) + nodes_andnot(demotion_target_mask, allowed, memtier->nodelist); + else + demotion_target_mask = NODE_MASK_NONE; /* * Removes nodes not yet in N_MEMORY. */ diff --git a/mm/migrate.c b/mm/migrate.c index 0b554625a219..78615c48fc0f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include #include #include +#include #include diff --git a/mm/mprotect.c b/mm/mprotect.c index ba5592655ee3..92a2fc0fa88b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include