From patchwork Wed Jul 20 02:59:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12923294 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4747EC433EF for ; Wed, 20 Jul 2022 03:00:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C679F6B0074; Tue, 19 Jul 2022 23:00:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BF0186B0078; Tue, 19 Jul 2022 23:00:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A6A316B007B; Tue, 19 Jul 2022 23:00:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 9629D6B0074 for ; Tue, 19 Jul 2022 23:00:09 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6A019AAD07 for ; Wed, 20 Jul 2022 03:00:09 +0000 (UTC) X-FDA: 79705974138.10.9970597 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf04.hostedemail.com (Postfix) with ESMTP id D141040060 for ; Wed, 20 Jul 2022 03:00:08 +0000 (UTC) Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26K2gIlE001948; Wed, 20 Jul 2022 02:59:52 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-transfer-encoding : mime-version; s=pp1; bh=e1Fo4W63FPOVMdgBK4/L/AqPqNHPfqCxBMqMcpU0GFI=; b=gfum23MnX7lMibNvA1+U01Rcx+JM5osyKLc9oAOGr4MaJODiR2uxoQFBSIGtgebXR+Y2 sfnuF1F/+VMXrQ5KzLx/G2EoR3kzYZuTd83Q6ef65yKa6Ya4VRwoS8Sm4fgk2U9kfCNH KKlhzQS/vejauyxFrDH8uNfKGx/azWCLF4ewu/hQ5San+Z0SkuqQ+XcRXV6mam6SKxc2 fQk7WwNgiqpHzZXvqq/BNiIHUEXYR1l9q8KePNLdaUZAuqTTMJZmO/RXT6PhOtTVi4DJ HGwkaC5QMTngmsl5Y9Uy1Yu0Nc3LHrNsWaWAZBZfVvTfRDEjD+DDys+6VDyIhhNdF4ES CA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9598c38-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 02:59:51 +0000 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26K2uMdh018654; Wed, 20 Jul 2022 02:59:51 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9598c2s-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 02:59:51 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26K2pRde015591; Wed, 20 Jul 2022 02:59:50 GMT Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by ppma03dal.us.ibm.com with ESMTP id 3hbmy9ct4d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 02:59:50 +0000 Received: from b03ledav002.gho.boulder.ibm.com (b03ledav002.gho.boulder.ibm.com [9.17.130.233]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26K2xmH041222538 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Jul 2022 02:59:48 GMT Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 96446136053; Wed, 20 Jul 2022 02:59:48 +0000 (GMT) Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 126C313604F; Wed, 20 Jul 2022 02:59:42 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.15.129]) by b03ledav002.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 20 Jul 2022 02:59:41 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" , Jagdish Gediya Subject: [PATCH v10 1/8] mm/demotion: Add support for explicit memory tiers Date: Wed, 20 Jul 2022 08:29:13 +0530 Message-Id: <20220720025920.1373558-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> References: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 5iLDHiCkcrrpI0OaU4KW2ieuq4ndEiHU X-Proofpoint-ORIG-GUID: Q2qjYVECBqXhP_sAkW44QdMpt7lsdxNw X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-19_10,2022-07-19_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 adultscore=0 malwarescore=0 priorityscore=1501 phishscore=0 clxscore=1015 impostorscore=0 mlxlogscore=999 bulkscore=0 lowpriorityscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207200008 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658286009; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=e1Fo4W63FPOVMdgBK4/L/AqPqNHPfqCxBMqMcpU0GFI=; b=oSIwIP9E8vLMP2wGYuqIeru5wTjNcKJS7OYETMseMQpjPzM3N6NThMsqn/IHN11xI+jFHB N1lVYGOjZh4+ITNNWatQkDgCRefq0/pSQl7VwBbJgn5Y0+EMBtcZNEirUUg7DNYlBmJ2Es 73e+E0Veo324L4WZxDmh7g9LVPbY+dc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658286009; a=rsa-sha256; cv=none; b=mNHJtjB09fFoFGaAG+Icw5hBhkiEkGAIIj+dUiDwfqwBSIrjQruOodt6ToiPSXyqvTYrL5 nult9dWZe/yk497mIHOVI0AWElvjx7ZhloGzOIS2A1NzT6CEkTZ+eh7JaIEwo0683HH2bZ dxmiVjH2qpAxdqlsBHDUDt4VRWHYBZM= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=gfum23Mn; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf04.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Queue-Id: D141040060 Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=gfum23Mn; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf04.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam12 X-Rspam-User: X-Stat-Signature: cynmfp8r4izqbxijderpcowhg5kx87nw X-HE-Tag: 1658286008-252095 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the top tier, and builds the tier hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel interface needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM device attached via CXL.mem or a DRAM-backed memory-only node on a virtual machine) and should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. The current kernel also don't provide any interfaces for the userspace to learn about the memory tier hierarchy in order to optimize its memory allocations. This patch series address the above by defining memory tiers explicitly. Linux kernel presents memory devices as NUMA nodes and each memory device is of a specific type. The memory type of a device is represented by its performance level. A memory tier corresponds to a range of performance levels. This allows for classifying memory devices with a specific performance range into a memory tier. This patch configures the range/chunk size to be 128. The default DRAM performance level is 512. We can have 4 memory tiers below the default DRAM performance level which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. Slower memory devices like persistent memory will have performance levels below the default DRAM level and hence will be placed in these 4 lower tiers. While reclaim we migrate pages from fast(higher) tiers to slow(lower) tiers when the fast(higher) tier is under memory pressure. A kernel parameter is provided to override the default memory tier. Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 18 +++++++ include/linux/node.h | 6 +++ mm/Makefile | 1 + mm/memory-tiers.c | 101 +++++++++++++++++++++++++++++++++++ 4 files changed, 126 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..f28f9910a4e7 --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +#ifdef CONFIG_NUMA +/* + * Each tier cover a performance level chunk size of 128 + */ +#define MEMTIER_CHUNK_BITS 7 +/* + * For now let's have 4 memory tier below default DRAM tier. + */ +#define MEMTIER_PERF_LEVEL_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) +/* leave one tier below this slow pmem */ +#define MEMTIER_PERF_LEVEL_PMEM (1 << MEMTIER_CHUNK_BITS) + +#endif /* CONFIG_NUMA */ +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/node.h b/include/linux/node.h index 40d641a8bfb0..a2a16d4104fd 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -92,6 +92,12 @@ struct node { struct list_head cache_attrs; struct device *cache_dev; #endif + /* + * For memory devices, perf_level describes + * the device performance and how it should be used + * while building a memory hierarchy. + */ + int perf_level; }; struct memory_block; diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..d30acebc2164 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..61bb84c54091 --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,101 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include + +struct memory_tier { + struct list_head list; + int perf_level; + nodemask_t nodelist; +}; + +static LIST_HEAD(memory_tiers); +static DEFINE_MUTEX(memory_tier_lock); + +/* + * For now let's have 4 memory tier below default DRAM tier. + */ +static unsigned int default_memtier_perf_level = MEMTIER_PERF_LEVEL_DRAM; +core_param(default_memory_tier_perf_level, default_memtier_perf_level, uint, 0644); +/* + * performance levels are grouped into memtiers each of chunk size + * memtier_perf_chunk + */ +static struct memory_tier *find_create_memory_tier(unsigned int perf_level) +{ + bool found_slot = false; + struct list_head *ent; + struct memory_tier *memtier, *new_memtier; + unsigned int memtier_perf_chunk_size = 1 << MEMTIER_CHUNK_BITS; + /* + * zero is special in that it indicates uninitialized + * perf level by respective driver. Pick default memory + * tier perf level for that. + */ + if (!perf_level) + perf_level = default_memtier_perf_level; + + lockdep_assert_held_once(&memory_tier_lock); + + perf_level = round_down(perf_level, memtier_perf_chunk_size); + list_for_each(ent, &memory_tiers) { + memtier = list_entry(ent, struct memory_tier, list); + if (perf_level == memtier->perf_level) { + return memtier; + } else if (perf_level < memtier->perf_level) { + found_slot = true; + break; + } + } + + new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!new_memtier) + return ERR_PTR(-ENOMEM); + + new_memtier->perf_level = perf_level; + if (found_slot) + list_add_tail(&new_memtier->list, ent); + else + list_add_tail(&new_memtier->list, &memory_tiers); + return new_memtier; +} + +static int __init memory_tier_init(void) +{ + int node; + struct memory_tier *memtier; + + /* + * Since this is early during boot, we could avoid + * holding memtory_tier_lock. But keep it simple by + * holding locks. So we can add lock held debug checks + * in other functions. + */ + mutex_lock(&memory_tier_lock); + memtier = find_create_memory_tier(default_memtier_perf_level); + if (IS_ERR(memtier)) + panic("%s() failed to register memory tier: %ld\n", + __func__, PTR_ERR(memtier)); + + /* CPU only nodes are not part of memory tiers. */ + memtier->nodelist = node_states[N_MEMORY]; + + /* + * nodes that are already online and that doesn't + * have perf level assigned is assigned a default perf + * level. + */ + for_each_node_state(node, N_MEMORY) { + struct node *node_property = node_devices[node]; + + if (!node_property->perf_level) + node_property->perf_level = default_memtier_perf_level; + } + mutex_unlock(&memory_tier_lock); + return 0; +} +subsys_initcall(memory_tier_init); From patchwork Wed Jul 20 02:59:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12923295 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07CC9C43334 for ; Wed, 20 Jul 2022 03:00:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 120416B0078; Tue, 19 Jul 2022 23:00:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0A6D66B007B; Tue, 19 Jul 2022 23:00:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E18616B007E; Tue, 19 Jul 2022 23:00:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C1CE96B007B for ; Tue, 19 Jul 2022 23:00:09 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 98A0DAACDB for ; Wed, 20 Jul 2022 03:00:09 +0000 (UTC) X-FDA: 79705974138.20.3AA13D3 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf29.hostedemail.com (Postfix) with ESMTP id 375C3120079 for ; Wed, 20 Jul 2022 03:00:09 +0000 (UTC) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26K2tc0h009756; Wed, 20 Jul 2022 02:59:58 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=EXyhPbSClFuMFOo1JAw5ESZkIzH2jfbQV1OjZ5G5Kj4=; b=iDeJfwuT1PF1Uu/D6hWH+zh8ipqX9S8FsUQQwQRllnbVm/aqYZODDsx4679Rj7n85kcu c/M7TfSXYD0Kl9lw2sA9DEzy4oElLFkorwe2cJdXk+y+Us+nsMAAOuql616tjGCHM3NI jBoVDmd5aJ42q+NQYFAanyzDN8A3ErSA4uOH7lWt7xL3TbPoa3jYnTXP6o3EZgt2b70z F37sjUncGNGKkCxZzICQ+3079GSDRrB7phgxkSqkAChqpyc0YtQMaj7srjV+mmP6SZBV rIM0CfoyGuG70s0tMsir7hjlWCC0bT+bNo0UmwKQcaptVlILHvhICyAGzfzDxpPJDXRM sg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9bsr2xj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 02:59:58 +0000 Received: from m0098417.ppops.net (m0098417.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26K2uCiR010882; Wed, 20 Jul 2022 02:59:57 GMT Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9bsr2x5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 02:59:57 +0000 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26K2pfCM014159; Wed, 20 Jul 2022 02:59:56 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma03wdc.us.ibm.com with ESMTP id 3hbmy973f6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 02:59:56 +0000 Received: from b03ledav002.gho.boulder.ibm.com (b03ledav002.gho.boulder.ibm.com [9.17.130.233]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26K2xuT036241742 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Jul 2022 02:59:56 GMT Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 093F013604F; Wed, 20 Jul 2022 02:59:56 +0000 (GMT) Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C67D3136055; Wed, 20 Jul 2022 02:59:49 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.15.129]) by b03ledav002.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 20 Jul 2022 02:59:48 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v10 2/8] mm/demotion: Move memory demotion related code Date: Wed, 20 Jul 2022 08:29:14 +0530 Message-Id: <20220720025920.1373558-3-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> References: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 2iN8LiDSoMCwI1L_23YN9caxjw9zWkTg X-Proofpoint-ORIG-GUID: qJSkDR8S1oZcag0nJiWiVVDKGZMO2dps X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-19_10,2022-07-19_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 phishscore=0 impostorscore=0 malwarescore=0 suspectscore=0 mlxlogscore=999 priorityscore=1501 clxscore=1015 lowpriorityscore=0 bulkscore=0 adultscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207200008 ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=iDeJfwuT; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658286009; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EXyhPbSClFuMFOo1JAw5ESZkIzH2jfbQV1OjZ5G5Kj4=; b=sLvwPO/+D7Ue8PdB3rsyaEgxrxRYuXWWa9Z2lk+TmcGPehWrJsWLjC6SQt2Y372N8JOL8u bWbQdmOC7vfCWZma45tHnliht26QXk1TFfGhAEbR7G6YHKMlQF/gu8kmAmJEWhVbbAeksm cyOaYAjW5wiTKwfqVrmVrRrLdQ1L4ro= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658286009; a=rsa-sha256; cv=none; b=6iPK/Y7L2/Jem6K7MxWPhxNOOqtM6cqVDtB0S3lm/58z/lf2xmGxVgZ94Olzeq+I8krLaC 7o41JJvdB5e6v9JgLOX0KIyEo0eIo7XggwBNcxUUW1ZoC/SBniO+B4LOj06Otcy+AQlr1G sYeJM6+JssqdUKVPxhiFQ8GXkQZZIt8= X-Stat-Signature: 139d6outukpkw3jutto64w6qb671oi9x X-Rspamd-Queue-Id: 375C3120079 X-Rspam-User: Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=iDeJfwuT; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam11 X-HE-Tag: 1658286009-470581 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This move memory demotion related code to mm/memory-tiers.c. No functional change in this patch. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 6 ++++ include/linux/migrate.h | 2 -- mm/memory-tiers.c | 62 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 60 +--------------------------------- mm/vmscan.c | 1 + 5 files changed, 70 insertions(+), 61 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index f28f9910a4e7..ef380a39db3a 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -2,6 +2,7 @@ #ifndef _LINUX_MEMORY_TIERS_H #define _LINUX_MEMORY_TIERS_H +#include #ifdef CONFIG_NUMA /* * Each tier cover a performance level chunk size of 128 @@ -14,5 +15,10 @@ /* leave one tier below this slow pmem */ #define MEMTIER_PERF_LEVEL_PMEM (1 << MEMTIER_CHUNK_BITS) +extern bool numa_demotion_enabled; + +#else + +#define numa_demotion_enabled false #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 069a89e847f3..43e737215f33 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -78,7 +78,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) extern void set_migration_target_nodes(void); extern void migrate_on_reclaim_init(void); -extern bool numa_demotion_enabled; extern int next_demotion_node(int node); #else static inline void set_migration_target_nodes(void) {} @@ -87,7 +86,6 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } -#define numa_demotion_enabled false #endif #ifdef CONFIG_COMPACTION diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 61bb84c54091..41a21cc5ae55 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -99,3 +99,65 @@ static int __init memory_tier_init(void) return 0; } subsys_initcall(memory_tier_init); + +bool numa_demotion_enabled = false; + +#ifdef CONFIG_MIGRATION +#ifdef CONFIG_SYSFS +static ssize_t numa_demotion_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%s\n", + numa_demotion_enabled ? "true" : "false"); +} + +static ssize_t numa_demotion_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret; + + ret = kstrtobool(buf, &numa_demotion_enabled); + if (ret) + return ret; + + return count; +} + +static struct kobj_attribute numa_demotion_enabled_attr = + __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, + numa_demotion_enabled_store); + +static struct attribute *numa_attrs[] = { + &numa_demotion_enabled_attr.attr, + NULL, +}; + +static const struct attribute_group numa_attr_group = { + .attrs = numa_attrs, +}; + +static int __init numa_init_sysfs(void) +{ + int err; + struct kobject *numa_kobj; + + numa_kobj = kobject_create_and_add("numa", mm_kobj); + if (!numa_kobj) { + pr_err("failed to create numa kobject\n"); + return -ENOMEM; + } + err = sysfs_create_group(numa_kobj, &numa_attr_group); + if (err) { + pr_err("failed to register numa group\n"); + goto delete_obj; + } + return 0; + +delete_obj: + kobject_put(numa_kobj); + return err; +} +subsys_initcall(numa_init_sysfs); +#endif /* CONFIG_SYSFS */ +#endif diff --git a/mm/migrate.c b/mm/migrate.c index 6c1ea61f39d8..fce7d4a9e940 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2509,64 +2509,6 @@ void __init migrate_on_reclaim_init(void) set_migration_target_nodes(); cpus_read_unlock(); } +#endif /* CONFIG_NUMA */ -bool numa_demotion_enabled = false; - -#ifdef CONFIG_SYSFS -static ssize_t numa_demotion_enabled_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%s\n", - numa_demotion_enabled ? "true" : "false"); -} - -static ssize_t numa_demotion_enabled_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - ssize_t ret; - - ret = kstrtobool(buf, &numa_demotion_enabled); - if (ret) - return ret; - - return count; -} - -static struct kobj_attribute numa_demotion_enabled_attr = - __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, - numa_demotion_enabled_store); - -static struct attribute *numa_attrs[] = { - &numa_demotion_enabled_attr.attr, - NULL, -}; - -static const struct attribute_group numa_attr_group = { - .attrs = numa_attrs, -}; - -static int __init numa_init_sysfs(void) -{ - int err; - struct kobject *numa_kobj; - numa_kobj = kobject_create_and_add("numa", mm_kobj); - if (!numa_kobj) { - pr_err("failed to create numa kobject\n"); - return -ENOMEM; - } - err = sysfs_create_group(numa_kobj, &numa_attr_group); - if (err) { - pr_err("failed to register numa group\n"); - goto delete_obj; - } - return 0; - -delete_obj: - kobject_put(numa_kobj); - return err; -} -subsys_initcall(numa_init_sysfs); -#endif /* CONFIG_SYSFS */ -#endif /* CONFIG_NUMA */ diff --git a/mm/vmscan.c b/mm/vmscan.c index f7d9a683e3a7..3a8f78277f99 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -50,6 +50,7 @@ #include #include #include +#include #include #include From patchwork Wed Jul 20 02:59:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12923296 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6ABD4C43334 for ; Wed, 20 Jul 2022 03:00:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C417D6B007B; Tue, 19 Jul 2022 23:00:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BEFFD6B007D; Tue, 19 Jul 2022 23:00:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A1BE86B007E; Tue, 19 Jul 2022 23:00:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9016C6B007B for ; Tue, 19 Jul 2022 23:00:14 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6CF05A0440 for ; Wed, 20 Jul 2022 03:00:14 +0000 (UTC) X-FDA: 79705974348.14.2EE2B25 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf22.hostedemail.com (Postfix) with ESMTP id EAC96C008B for ; Wed, 20 Jul 2022 03:00:13 +0000 (UTC) Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26K2gLxl002427; Wed, 20 Jul 2022 03:00:07 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=d4hY5T8h3P3yTciiyzpijA0CASBqERoWzYtXZLunFso=; b=THJC44wwLL0G9sige+dbgiZNLp4L0cZnciRBFmZa/oRkMxbHF7uCbbaMR4yH/XbECygd 7qBuMU7kMcu0dsZ9JdkBRRk9uZc07i6+fSaEuwCxzk1G3GcXW2Xk3ytiX1Uk2vINOc9j GpOn3dT6umke+NAXJ81lCMZQXLK5qen39RrmSGfZzJkoKrGdOyJYLxJpTJCvo5gLNX8N g+pYPOsan/TDOLC80GOUWvCgsR3lptZ8Pwjg0f4eJ9IWu0yfYcXspRNCWt2J4M1/c6JF +IheeiBWP/z+kXoG3fJgTrwydqQasM05pZcguhqfbPyg2sLf3eOt3r5YeRR+BQmQP9Mf bw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9598cby-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:06 +0000 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26K2hrPc014361; Wed, 20 Jul 2022 03:00:06 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9598cax-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:06 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26K2o6i3008114; Wed, 20 Jul 2022 03:00:04 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma04wdc.us.ibm.com with ESMTP id 3hbmy9f3ha-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:04 +0000 Received: from b03ledav002.gho.boulder.ibm.com (b03ledav002.gho.boulder.ibm.com [9.17.130.233]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26K303DL32440782 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Jul 2022 03:00:03 GMT Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C7A12136053; Wed, 20 Jul 2022 03:00:03 +0000 (GMT) Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A077F136059; Wed, 20 Jul 2022 02:59:56 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.15.129]) by b03ledav002.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 20 Jul 2022 02:59:56 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v10 3/8] mm/demotion: Add hotplug callbacks to handle new numa node onlined Date: Wed, 20 Jul 2022 08:29:15 +0530 Message-Id: <20220720025920.1373558-4-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> References: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: h8jquO0PXY5vgSZO0D5eDCWeDZAYq7iw X-Proofpoint-ORIG-GUID: VFjEaa-UFAw1Ak6cWfSa5C6JOMAjoyHJ X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-19_10,2022-07-19_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 adultscore=0 malwarescore=0 priorityscore=1501 phishscore=0 clxscore=1015 impostorscore=0 mlxlogscore=999 bulkscore=0 lowpriorityscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207200008 ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=THJC44ww; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf22.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658286014; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=d4hY5T8h3P3yTciiyzpijA0CASBqERoWzYtXZLunFso=; b=aUpkD8w4XRE4aljPlpDOr0keZiDp/JPBX/ZaCIboCpAipFgzjPSSVf73D9bP9fUYrTVtRj ivwccKDyN5SJpD4SVR3pkhD+x7W90jSJsGY1MvcOuit95w+O4QSdaf38Gjka7vD18sekQ2 uqQ/XujryJWs8/cn3fZtMLECZyKOB0Y= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658286014; a=rsa-sha256; cv=none; b=T3a2y5mPOPJDlP+isqp7JM9ZwUBdUcJqBuJh8PdDVzUw1u+mjAOdxH5TjzI/m4jN2ioO2M bcfcCHBSepoQW5roALsNqnQAd3sK7bEHbbF8JAhZSPcccm8KJBkIvARHHJDVUfoaDxZDwR I5jQkSp4CsJBURRXUiMLOvArGib02RI= X-Rspamd-Queue-Id: EAC96C008B Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=THJC44ww; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf22.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: odrysfzbbwqsiuc7cor37afoxc4wz1p3 X-HE-Tag: 1658286013-923051 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: If the new NUMA node onlined doesn't have a performance level assigned, the kernel adds the NUMA node to default memory tier. Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 1 + mm/memory-tiers.c | 75 ++++++++++++++++++++++++++++++++++++ 2 files changed, 76 insertions(+) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index ef380a39db3a..3d5f14d57ae6 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -14,6 +14,7 @@ #define MEMTIER_PERF_LEVEL_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) /* leave one tier below this slow pmem */ #define MEMTIER_PERF_LEVEL_PMEM (1 << MEMTIER_CHUNK_BITS) +#define MEMTIER_HOTPLUG_PRIO 100 extern bool numa_demotion_enabled; diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 41a21cc5ae55..cc3a47ec18e4 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -5,6 +5,7 @@ #include #include #include +#include #include struct memory_tier { @@ -64,6 +65,78 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level) return new_memtier; } +static struct memory_tier *__node_get_memory_tier(int node) +{ + struct memory_tier *memtier; + + list_for_each_entry(memtier, &memory_tiers, list) { + if (node_isset(node, memtier->nodelist)) + return memtier; + } + return NULL; +} + +static void init_node_memory_tier(int node) +{ + int perf_level; + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + + memtier = __node_get_memory_tier(node); + if (!memtier) { + perf_level = node_devices[node]->perf_level; + memtier = find_create_memory_tier(perf_level); + node_set(node, memtier->nodelist); + } + mutex_unlock(&memory_tier_lock); +} + +static void clear_node_memory_tier(int node) +{ + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + memtier = __node_get_memory_tier(node); + if (memtier) + node_clear(node, memtier->nodelist); + mutex_unlock(&memory_tier_lock); +} + +/* + * This runs whether reclaim-based migration is enabled or not, + * which ensures that the user can turn reclaim-based migration + * at any time without needing to recalculate migration targets. + */ +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, + unsigned long action, void *_arg) +{ + struct memory_notify *arg = _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + + switch (action) { + case MEM_OFFLINE: + clear_node_memory_tier(arg->status_change_nid); + break; + case MEM_ONLINE: + init_node_memory_tier(arg->status_change_nid); + break; + } + + return notifier_from_errno(0); +} + +static void __init migrate_on_reclaim_init(void) +{ + hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO); +} + static int __init memory_tier_init(void) { int node; @@ -96,6 +169,8 @@ static int __init memory_tier_init(void) node_property->perf_level = default_memtier_perf_level; } mutex_unlock(&memory_tier_lock); + + migrate_on_reclaim_init(); return 0; } subsys_initcall(memory_tier_init); From patchwork Wed Jul 20 02:59:16 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12923297 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75DDFC43334 for ; Wed, 20 Jul 2022 03:00:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0F62F6B007D; Tue, 19 Jul 2022 23:00:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 07FAF6B0080; Tue, 19 Jul 2022 23:00:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E48516B007D; Tue, 19 Jul 2022 23:00:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CEDE46B007D for ; Tue, 19 Jul 2022 23:00:21 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A30FB1603D7 for ; Wed, 20 Jul 2022 03:00:21 +0000 (UTC) X-FDA: 79705974642.25.E3BAFB4 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf04.hostedemail.com (Postfix) with ESMTP id 3BB564006B for ; Wed, 20 Jul 2022 03:00:21 +0000 (UTC) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26K2tcVF009739; Wed, 20 Jul 2022 03:00:14 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=TOqzPn2YLz1Y8ysnYylImZnW1OJnHJTjryVGOSd1RE8=; b=jDJ29vPBCMYcXZVJHyBqsGnsPNU1qBmWK3k2FJo+oxW6sij2TX2yJG5dvbxmwNdiAxTH jvlj1APxRmzgEbtogkGUFEOh11mVQCj2cGSnNleIPwCsHLRDSv8NoFglO8f/UMRAJKTY MV2ws937mH6rkNeETXOTHiMsyyrqbIqkDNW1g5nME6lxRiIx2Acaz447KQdCNtCEfUfO m834hhn1lSszZIvi0LSs/u3HVmKhixZ3FXTfY/lDh61UUOCai3qKU1mf0YSxvZIgTJ3P 0tWlcmeOJOw5fKhKbehITZAYf1wFbS0LYHVz7s6DwkPGv0tXF5kX+kJpPmxHFCHvvTcf vQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9bsr3bv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:14 +0000 Received: from m0098417.ppops.net (m0098417.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26K2upS6018571; Wed, 20 Jul 2022 03:00:13 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9bsr3ag-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:13 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26K2phI3010779; Wed, 20 Jul 2022 03:00:12 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma02dal.us.ibm.com with ESMTP id 3hbmy9ms5u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:12 +0000 Received: from b03ledav002.gho.boulder.ibm.com (b03ledav002.gho.boulder.ibm.com [9.17.130.233]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26K30ApM38470110 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Jul 2022 03:00:11 GMT Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D9C6713604F; Wed, 20 Jul 2022 03:00:10 +0000 (GMT) Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 612CB136059; Wed, 20 Jul 2022 03:00:04 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.15.129]) by b03ledav002.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 20 Jul 2022 03:00:04 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM Date: Wed, 20 Jul 2022 08:29:16 +0530 Message-Id: <20220720025920.1373558-5-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> References: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: zvG1MzLRgBMzfwyym8kNHOGIQn3O2wnN X-Proofpoint-ORIG-GUID: 4-N5HJDyapfuv0NGvi71b9_FK8alNmqz X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-19_10,2022-07-19_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 phishscore=0 impostorscore=0 malwarescore=0 suspectscore=0 mlxlogscore=999 priorityscore=1501 clxscore=1015 lowpriorityscore=0 bulkscore=0 adultscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207200008 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658286021; a=rsa-sha256; cv=none; b=O2tgGJuwOqCkjKOKDAGoF/TWw9f9o4uFZbTmHH1r8W+2C30wCdOh1sJenLE0EqA4RoHZl1 p3i6qRtDUfnq7yQG5W/GdUqfFGm80OzCr/l7sl/wPHGAWp0gDriubcmSaMldzEeYJkrec8 52QS1pU26iF+TIKrQsEZBG6G9kx4TiU= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=jDJ29vPB; spf=pass (imf04.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658286021; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TOqzPn2YLz1Y8ysnYylImZnW1OJnHJTjryVGOSd1RE8=; b=bunArfDFtpe7eZ/V4PlPAqBllvsXGASQjbT4endmZZvfn0mf1G8QoCv9zFwvjz9qSC16gI DBzy1KObKlB/Bba+qvaV0RUjtEWGu21GiQIQ6p0Y6bgRaYq8ve+qa1yTnKoEbZlp0xhaZJ iNO4fwEVu/HyT2DEWFDHzafOWsZpbNY= X-Stat-Signature: 51dhwq1psfthro639ibhw89dmj11bfjz X-Rspamd-Queue-Id: 3BB564006B Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=jDJ29vPB; spf=pass (imf04.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1658286021-848110 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: By default, all nodes are assigned to the default memory tier which is the memory tier designated for nodes with DRAM Set dax kmem device node's tier to slower memory tier by assigning performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier appears below the default memory tier in demotion order. Signed-off-by: Aneesh Kumar K.V Reported-by: kernel test robot --- arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++--- drivers/acpi/nfit/core.c | 41 ++++++++++++++++++++++- 2 files changed, 76 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 82cae08976bc..3b6164418d6f 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -14,6 +14,8 @@ #include #include #include +#include +#include #include #include @@ -98,6 +100,7 @@ struct papr_scm_priv { bool hcall_flush_required; uint64_t bound_addr; + int target_node; struct nvdimm_bus_descriptor bus_desc; struct nvdimm_bus *bus; @@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) p->bus_desc.module = THIS_MODULE; p->bus_desc.of_node = p->pdev->dev.of_node; p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL); + p->target_node = dev_to_node(&p->pdev->dev); /* Set the dimm command family mask to accept PDSMs */ set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask); @@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) mapping.size = p->blocks * p->block_size; // XXX: potential overflow? memset(&ndr_desc, 0, sizeof(ndr_desc)); - target_nid = dev_to_node(&p->pdev->dev); + target_nid = p->target_node; online_nid = numa_map_to_online_node(target_nid); ndr_desc.numa_node = online_nid; ndr_desc.target_node = target_nid; @@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = { }, }; +static int papr_scm_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + struct memory_notify *mnb = arg; + int nid = mnb->status_change_nid; + struct papr_scm_priv *p; + + if (nid == NUMA_NO_NODE || action != MEM_ONLINE) + return NOTIFY_OK; + + mutex_lock(&papr_ndr_lock); + list_for_each_entry(p, &papr_nd_regions, region_list) { + if (p->target_node == nid) { + node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM; + break; + } + } + + mutex_unlock(&papr_ndr_lock); + return NOTIFY_OK; +} + static int __init papr_scm_init(void) { int ret; ret = platform_driver_register(&papr_scm_driver); - if (!ret) - mce_register_notifier(&mce_ue_nb); - - return ret; + if (ret) + return ret; + mce_register_notifier(&mce_ue_nb); + /* + * register a memory hotplug notifier at prio 2 so that we + * can update the perf level for the node. + */ + hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1); + return 0; } module_init(papr_scm_init); diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index ae5f4acf2675..7ea1017ef790 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -15,6 +15,8 @@ #include #include #include +#include +#include #include #include #include "intel.h" @@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = { }, }; +static int nfit_callback(struct notifier_block *self, + unsigned long action, void *arg) +{ + bool found = false; + struct memory_notify *mnb = arg; + int nid = mnb->status_change_nid; + struct nfit_spa *nfit_spa; + struct acpi_nfit_desc *acpi_desc; + + if (nid == NUMA_NO_NODE || action != MEM_ONLINE) + return NOTIFY_OK; + + mutex_lock(&acpi_desc_lock); + list_for_each_entry(acpi_desc, &acpi_descs, list) { + mutex_lock(&acpi_desc->init_mutex); + list_for_each_entry(nfit_spa, &acpi_desc->spas, list) { + struct acpi_nfit_system_address *spa = nfit_spa->spa; + int target_node = pxm_to_node(spa->proximity_domain); + + if (target_node == nid) { + node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM; + found = true; + break; + } + } + mutex_unlock(&acpi_desc->init_mutex); + if (found) + break; + } + mutex_unlock(&acpi_desc_lock); + return NOTIFY_OK; +} + static __init int nfit_init(void) { int ret; @@ -3509,7 +3544,11 @@ static __init int nfit_init(void) nfit_mce_unregister(); destroy_workqueue(nfit_wq); } - + /* + * register a memory hotplug notifier at prio 2 so that we + * can update the perf level for the node. + */ + hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1); return ret; } From patchwork Wed Jul 20 02:59:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12923298 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F90CC433EF for ; Wed, 20 Jul 2022 03:00:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BCC106B0073; Tue, 19 Jul 2022 23:00:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B550E6B0074; Tue, 19 Jul 2022 23:00:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9CE076B007E; Tue, 19 Jul 2022 23:00:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 865446B0073 for ; Tue, 19 Jul 2022 23:00:32 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 4C4F840449 for ; Wed, 20 Jul 2022 03:00:32 +0000 (UTC) X-FDA: 79705975104.12.E304B5B Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf03.hostedemail.com (Postfix) with ESMTP id BFA7A2005E for ; Wed, 20 Jul 2022 03:00:29 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26K2gPGw031910; Wed, 20 Jul 2022 03:00:21 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=mWatOUojGiY3Mv9cKY//8XSXezhcQ1U3OtxZveaCE6M=; b=emdzvEBYDVqCScGYDqiO+SXlIJZrmYCQSaCX0EpqctfD/FzwUqkau1cw4TRkd2nDRNvK UjDAhz+fcGDRVq6dTspr5Anuhr7K233oUaxITyq3Jw2A0wbHn3mx5gMyd7UUrlx8nfxv bhDtImcy6SbO3N2KUkbjP+wiQrl5cBre/W89dWg1++Ib/qndhT2WtvI0aCCVwhlMFyP1 aQ0kZbp4DL/zYH5eLI1YSS6nK7LNrtmyHUhtuKsMdOMp+rvnAmgMizkmgbD/cScqBo+X HpuCelxgUQy/ouPkvYba7U56OhH/piHa+ecrsXXnUFTvUCzxRmyfs51OQA5GOWXNU1HB fw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he95b0cm6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:21 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26K30KIe039028; Wed, 20 Jul 2022 03:00:20 GMT Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he95b0ck5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:20 +0000 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26K2p3Dw027862; Wed, 20 Jul 2022 03:00:19 GMT Received: from b03cxnp08027.gho.boulder.ibm.com (b03cxnp08027.gho.boulder.ibm.com [9.17.130.19]) by ppma04dal.us.ibm.com with ESMTP id 3hbmy9vswc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:19 +0000 Received: from b03ledav002.gho.boulder.ibm.com (b03ledav002.gho.boulder.ibm.com [9.17.130.233]) by b03cxnp08027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26K30IMS14353030 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Jul 2022 03:00:18 GMT Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 43778136055; Wed, 20 Jul 2022 03:00:18 +0000 (GMT) Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8128F13605D; Wed, 20 Jul 2022 03:00:11 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.15.129]) by b03ledav002.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 20 Jul 2022 03:00:11 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v10 5/8] mm/demotion: Build demotion targets based on explicit memory tiers Date: Wed, 20 Jul 2022 08:29:17 +0530 Message-Id: <20220720025920.1373558-6-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> References: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: EOuayJLqmijO06Zu38pzJiY69cZfEs0L X-Proofpoint-ORIG-GUID: GpenNnr6acy8gvndrUCxpWNZ0ogRQBi7 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-19_10,2022-07-19_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 lowpriorityscore=0 suspectscore=0 malwarescore=0 impostorscore=0 phishscore=0 mlxscore=0 spamscore=0 bulkscore=0 priorityscore=1501 clxscore=1015 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207200008 ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=emdzvEBY; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf03.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658286030; a=rsa-sha256; cv=none; b=TBNCjK7fad0EOlQME8bftn5KbkZl9h/0mz2lmVv5yxIzGdXyyFsGTHhK5SCs828OD3Tj2r 2jx0ow8iFvOX7u/ESpJffTpZZHYpZWqiF5vw8Zc336Kj8ZZW6+I6IVADHyXXCalcI9DCtF pWOGdGAbTWM4Y+gPYlID3oDPTaK79Tg= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658286030; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mWatOUojGiY3Mv9cKY//8XSXezhcQ1U3OtxZveaCE6M=; b=Xlj7KyKl+Ti8PXthRM2HB7G3SCbJdI/OPFyZML2DwAp2bd+sarMzPa/FqyIeYPNZMlKeVA mtr26tKoWD/ElpQWZXJZAuK618i5ZJqI2kjjS2F6TPmED1qukS38UegmNyjh8x4eqmMOmI 7u+INq/ZaK5G5+78RcwCwu5+ohfkFj0= X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: BFA7A2005E Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=emdzvEBY; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf03.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Stat-Signature: 5fw7i6uj58bdwrzxqgoqsgcqrndcghdu X-HE-Tag: 1658286029-835325 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default memory tier and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. A new memory tier can be inserted into the tier hierarchy for a new set of nodes without affecting the node assignment of any existing memtier, provided that there is enough gap in the performance level values for the new memtier. The absolute value of performance level of a memtier doesn't necessarily carry any meaning. Its value relative to other memtiers decides the level of this memtier in the tier hierarchy. Signed-off-by: Aneesh Kumar K.V Reported-by: kernel test robot --- include/linux/memory-tiers.h | 12 ++ include/linux/migrate.h | 13 -- mm/memory-tiers.c | 218 ++++++++++++++++++- mm/migrate.c | 394 ----------------------------------- mm/vmstat.c | 4 - 5 files changed, 229 insertions(+), 412 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 3d5f14d57ae6..852e86bd0a23 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -17,9 +17,21 @@ #define MEMTIER_HOTPLUG_PRIO 100 extern bool numa_demotion_enabled; +#ifdef CONFIG_MIGRATION +int next_demotion_node(int node); +#else +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} +#endif #else #define numa_demotion_enabled false +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 43e737215f33..93fab62e6548 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -75,19 +75,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) -extern void set_migration_target_nodes(void); -extern void migrate_on_reclaim_init(void); -extern int next_demotion_node(int node); -#else -static inline void set_migration_target_nodes(void) {} -static inline void migrate_on_reclaim_init(void) {} -static inline int next_demotion_node(int node) -{ - return NUMA_NO_NODE; -} -#endif - #ifdef CONFIG_COMPACTION extern int PageMovable(struct page *page); extern void __SetPageMovable(struct page *page, struct address_space *mapping); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index cc3a47ec18e4..a8cfe2ca3903 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -6,17 +6,88 @@ #include #include #include +#include #include +#include "internal.h" + struct memory_tier { struct list_head list; int perf_level; nodemask_t nodelist; }; +struct demotion_nodes { + nodemask_t preferred; +}; + static LIST_HEAD(memory_tiers); static DEFINE_MUTEX(memory_tier_lock); +#ifdef CONFIG_MIGRATION +/* + * node_demotion[] examples: + * + * Example 1: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. + * + * node distances: + * node 0 1 2 3 + * 0 10 20 30 40 + * 1 20 10 40 30 + * 2 30 40 10 40 + * 3 40 30 40 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-1 + * memory_tiers[2] = 2-3 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 3 + * node_demotion[2].preferred = + * node_demotion[3].preferred = + * + * Example 2: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 30 + * 2 30 30 10 + * + * memory_tiers[0] = + * memory_tiers[1] = 0-2 + * memory_tiers[2] = + * + * node_demotion[0].preferred = + * node_demotion[1].preferred = + * node_demotion[2].preferred = + * + * Example 3: + * + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 40 + * 2 30 40 10 + * + * memory_tiers[0] = 1 + * memory_tiers[1] = 0 + * memory_tiers[2] = 2 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 0 + * node_demotion[2].preferred = + * + */ +static struct demotion_nodes *node_demotion __read_mostly; +#endif /* CONFIG_MIGRATION */ + /* * For now let's have 4 memory tier below default DRAM tier. */ @@ -76,6 +147,136 @@ static struct memory_tier *__node_get_memory_tier(int node) return NULL; } +#ifdef CONFIG_MIGRATION +/** + * next_demotion_node() - Get the next node in the demotion path + * @node: The starting node to lookup the next node + * + * Return: node id for next memory node in the demotion path hierarchy + * from @node; NUMA_NO_NODE if @node is terminal. This does not keep + * @node online or guarantee that it *continues* to be the next demotion + * target. + */ +int next_demotion_node(int node) +{ + struct demotion_nodes *nd; + int target; + + if (!node_demotion) + return NUMA_NO_NODE; + + nd = &node_demotion[node]; + + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * Make sure to use RCU over entire code blocks if + * node_demotion[] reads need to be consistent. + */ + rcu_read_lock(); + /* + * If there are multiple target nodes, just select one + * target node randomly. + * + * In addition, we can also use round-robin to select + * target node, but we should introduce another variable + * for node_demotion[] to record last selected target node, + * that may cause cache ping-pong due to the changing of + * last target node. Or introducing per-cpu data to avoid + * caching issue, which seems more complicated. So selecting + * target node randomly seems better until now. + */ + target = node_random(&nd->preferred); + rcu_read_unlock(); + + return target; +} + +/* Disable reclaim-based migration. */ +static void __disable_all_migrate_targets(void) +{ + int node; + + for_each_node_state(node, N_MEMORY) + node_demotion[node].preferred = NODE_MASK_NONE; +} + +static void disable_all_migrate_targets(void) +{ + __disable_all_migrate_targets(); + + /* + * Ensure that the "disable" is visible across the system. + * Readers will see either a combination of before+disable + * state or disable+after. They will never see before and + * after state together. + */ + synchronize_rcu(); +} +/* + * Find an automatic demotion target for all memory + * nodes. Failing here is OK. It might just indicate + * being at the end of a chain. + */ +static void establish_migration_targets(void) +{ + struct memory_tier *memtier; + struct demotion_nodes *nd; + int target = NUMA_NO_NODE, node; + int distance, best_distance; + nodemask_t used; + + if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) + return; + + disable_all_migrate_targets(); + + for_each_node_state(node, N_MEMORY) { + best_distance = -1; + nd = &node_demotion[node]; + + memtier = __node_get_memory_tier(node); + if (!memtier || list_is_last(&memtier->list, &memory_tiers)) + continue; + /* + * Get the next memtier to find the demotion node list. + */ + memtier = list_next_entry(memtier, list); + + /* + * find_next_best_node, use 'used' nodemask as a skip list. + * Add all memory nodes except the selected memory tier + * nodelist to skip list so that we find the best node from the + * memtier nodelist. + */ + nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist); + + /* + * Find all the nodes in the memory tier node list of same best distance. + * add them to the preferred mask. We randomly select between nodes + * in the preferred mask when allocating pages during demotion. + */ + do { + target = find_next_best_node(node, &used); + if (target == NUMA_NO_NODE) + break; + + distance = node_distance(node, target); + if (distance == best_distance || best_distance == -1) { + best_distance = distance; + node_set(target, nd->preferred); + } else { + break; + } + } while (1); + } +} +#else +static inline void disable_all_migrate_targets(void) {} +static inline void establish_migration_targets(void) {} +#endif /* CONFIG_MIGRATION */ + static void init_node_memory_tier(int node) { int perf_level; @@ -84,11 +285,19 @@ static void init_node_memory_tier(int node) mutex_lock(&memory_tier_lock); memtier = __node_get_memory_tier(node); + /* + * if node is already part of the tier proceed with the + * current tier value, because we might want to establish + * new migration paths now. The node might be added to a tier + * before it was made part of N_MEMORY, hence estabilish_migration_targets + * will have skipped this node. + */ if (!memtier) { perf_level = node_devices[node]->perf_level; memtier = find_create_memory_tier(perf_level); node_set(node, memtier->nodelist); } + establish_migration_targets(); mutex_unlock(&memory_tier_lock); } @@ -98,8 +307,10 @@ static void clear_node_memory_tier(int node) mutex_lock(&memory_tier_lock); memtier = __node_get_memory_tier(node); - if (memtier) + if (memtier) { node_clear(node, memtier->nodelist); + establish_migration_targets(); + } mutex_unlock(&memory_tier_lock); } @@ -134,6 +345,11 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, static void __init migrate_on_reclaim_init(void) { + if (IS_ENABLED(CONFIG_MIGRATION)) { + node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes), + GFP_KERNEL); + WARN_ON(!node_demotion); + } hotplug_memory_notifier(migrate_on_reclaim_callback, MEMTIER_HOTPLUG_PRIO); } diff --git a/mm/migrate.c b/mm/migrate.c index fce7d4a9e940..c758c9c21d7d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2117,398 +2117,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, return 0; } #endif /* CONFIG_NUMA_BALANCING */ - -/* - * node_demotion[] example: - * - * Consider a system with two sockets. Each socket has - * three classes of memory attached: fast, medium and slow. - * Each memory class is placed in its own NUMA node. The - * CPUs are placed in the node with the "fast" memory. The - * 6 NUMA nodes (0-5) might be split among the sockets like - * this: - * - * Socket A: 0, 1, 2 - * Socket B: 3, 4, 5 - * - * When Node 0 fills up, its memory should be migrated to - * Node 1. When Node 1 fills up, it should be migrated to - * Node 2. The migration path start on the nodes with the - * processors (since allocations default to this node) and - * fast memory, progress through medium and end with the - * slow memory: - * - * 0 -> 1 -> 2 -> stop - * 3 -> 4 -> 5 -> stop - * - * This is represented in the node_demotion[] like this: - * - * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1 - * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2 - * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate - * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4 - * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5 - * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate - * - * Moreover some systems may have multiple slow memory nodes. - * Suppose a system has one socket with 3 memory nodes, node 0 - * is fast memory type, and node 1/2 both are slow memory - * type, and the distance between fast memory node and slow - * memory node is same. So the migration path should be: - * - * 0 -> 1/2 -> stop - * - * This is represented in the node_demotion[] like this: - * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2 - * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate - * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate - */ - -/* - * Writes to this array occur without locking. Cycles are - * not allowed: Node X demotes to Y which demotes to X... - * - * If multiple reads are performed, a single rcu_read_lock() - * must be held over all reads to ensure that no cycles are - * observed. - */ -#define DEFAULT_DEMOTION_TARGET_NODES 15 - -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES -#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1) -#else -#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES -#endif - -struct demotion_nodes { - unsigned short nr; - short nodes[DEMOTION_TARGET_NODES]; -}; - -static struct demotion_nodes *node_demotion __read_mostly; - -/** - * next_demotion_node() - Get the next node in the demotion path - * @node: The starting node to lookup the next node - * - * Return: node id for next memory node in the demotion path hierarchy - * from @node; NUMA_NO_NODE if @node is terminal. This does not keep - * @node online or guarantee that it *continues* to be the next demotion - * target. - */ -int next_demotion_node(int node) -{ - struct demotion_nodes *nd; - unsigned short target_nr, index; - int target; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - /* - * node_demotion[] is updated without excluding this - * function from running. RCU doesn't provide any - * compiler barriers, so the READ_ONCE() is required - * to avoid compiler reordering or read merging. - * - * Make sure to use RCU over entire code blocks if - * node_demotion[] reads need to be consistent. - */ - rcu_read_lock(); - target_nr = READ_ONCE(nd->nr); - - switch (target_nr) { - case 0: - target = NUMA_NO_NODE; - goto out; - case 1: - index = 0; - break; - default: - /* - * If there are multiple target nodes, just select one - * target node randomly. - * - * In addition, we can also use round-robin to select - * target node, but we should introduce another variable - * for node_demotion[] to record last selected target node, - * that may cause cache ping-pong due to the changing of - * last target node. Or introducing per-cpu data to avoid - * caching issue, which seems more complicated. So selecting - * target node randomly seems better until now. - */ - index = get_random_int() % target_nr; - break; - } - - target = READ_ONCE(nd->nodes[index]); - -out: - rcu_read_unlock(); - return target; -} - -/* Disable reclaim-based migration. */ -static void __disable_all_migrate_targets(void) -{ - int node, i; - - if (!node_demotion) - return; - - for_each_online_node(node) { - node_demotion[node].nr = 0; - for (i = 0; i < DEMOTION_TARGET_NODES; i++) - node_demotion[node].nodes[i] = NUMA_NO_NODE; - } -} - -static void disable_all_migrate_targets(void) -{ - __disable_all_migrate_targets(); - - /* - * Ensure that the "disable" is visible across the system. - * Readers will see either a combination of before+disable - * state or disable+after. They will never see before and - * after state together. - * - * The before+after state together might have cycles and - * could cause readers to do things like loop until this - * function finishes. This ensures they can only see a - * single "bad" read and would, for instance, only loop - * once. - */ - synchronize_rcu(); -} - -/* - * Find an automatic demotion target for 'node'. - * Failing here is OK. It might just indicate - * being at the end of a chain. - */ -static int establish_migrate_target(int node, nodemask_t *used, - int best_distance) -{ - int migration_target, index, val; - struct demotion_nodes *nd; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - migration_target = find_next_best_node(node, used); - if (migration_target == NUMA_NO_NODE) - return NUMA_NO_NODE; - - /* - * If the node has been set a migration target node before, - * which means it's the best distance between them. Still - * check if this node can be demoted to other target nodes - * if they have a same best distance. - */ - if (best_distance != -1) { - val = node_distance(node, migration_target); - if (val > best_distance) - goto out_clear; - } - - index = nd->nr; - if (WARN_ONCE(index >= DEMOTION_TARGET_NODES, - "Exceeds maximum demotion target nodes\n")) - goto out_clear; - - nd->nodes[index] = migration_target; - nd->nr++; - - return migration_target; -out_clear: - node_clear(migration_target, *used); - return NUMA_NO_NODE; -} - -/* - * When memory fills up on a node, memory contents can be - * automatically migrated to another node instead of - * discarded at reclaim. - * - * Establish a "migration path" which will start at nodes - * with CPUs and will follow the priorities used to build the - * page allocator zonelists. - * - * The difference here is that cycles must be avoided. If - * node0 migrates to node1, then neither node1, nor anything - * node1 migrates to can migrate to node0. Also one node can - * be migrated to multiple nodes if the target nodes all have - * a same best-distance against the source node. - * - * This function can run simultaneously with readers of - * node_demotion[]. However, it can not run simultaneously - * with itself. Exclusion is provided by memory hotplug events - * being single-threaded. - */ -static void __set_migration_target_nodes(void) -{ - nodemask_t next_pass; - nodemask_t this_pass; - nodemask_t used_targets = NODE_MASK_NONE; - int node, best_distance; - - /* - * Avoid any oddities like cycles that could occur - * from changes in the topology. This will leave - * a momentary gap when migration is disabled. - */ - disable_all_migrate_targets(); - - /* - * Allocations go close to CPUs, first. Assume that - * the migration path starts at the nodes with CPUs. - */ - next_pass = node_states[N_CPU]; -again: - this_pass = next_pass; - next_pass = NODE_MASK_NONE; - /* - * To avoid cycles in the migration "graph", ensure - * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only - * once per pass so that multiple source nodes can - * share a target node. - * - * 'used_targets' will become unavailable in future - * passes. This limits some opportunities for - * multiple source nodes to share a destination. - */ - nodes_or(used_targets, used_targets, this_pass); - - for_each_node_mask(node, this_pass) { - best_distance = -1; - - /* - * Try to set up the migration path for the node, and the target - * migration nodes can be multiple, so doing a loop to find all - * the target nodes if they all have a best node distance. - */ - do { - int target_node = - establish_migrate_target(node, &used_targets, - best_distance); - - if (target_node == NUMA_NO_NODE) - break; - - if (best_distance == -1) - best_distance = node_distance(node, target_node); - - /* - * Visit targets from this pass in the next pass. - * Eventually, every node will have been part of - * a pass, and will become set in 'used_targets'. - */ - node_set(target_node, next_pass); - } while (1); - } - /* - * 'next_pass' contains nodes which became migration - * targets in this pass. Make additional passes until - * no more migrations targets are available. - */ - if (!nodes_empty(next_pass)) - goto again; -} - -/* - * For callers that do not hold get_online_mems() already. - */ -void set_migration_target_nodes(void) -{ - get_online_mems(); - __set_migration_target_nodes(); - put_online_mems(); -} - -/* - * This leaves migrate-on-reclaim transiently disabled between - * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs - * whether reclaim-based migration is enabled or not, which - * ensures that the user can turn reclaim-based migration at - * any time without needing to recalculate migration targets. - * - * These callbacks already hold get_online_mems(). That is why - * __set_migration_target_nodes() can be used as opposed to - * set_migration_target_nodes(). - */ -#ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, - unsigned long action, void *_arg) -{ - struct memory_notify *arg = _arg; - - /* - * Only update the node migration order when a node is - * changing status, like online->offline. This avoids - * the overhead of synchronize_rcu() in most cases. - */ - if (arg->status_change_nid < 0) - return notifier_from_errno(0); - - switch (action) { - case MEM_GOING_OFFLINE: - /* - * Make sure there are not transient states where - * an offline node is a migration target. This - * will leave migration disabled until the offline - * completes and the MEM_OFFLINE case below runs. - */ - disable_all_migrate_targets(); - break; - case MEM_OFFLINE: - case MEM_ONLINE: - /* - * Recalculate the target nodes once the node - * reaches its final state (online or offline). - */ - __set_migration_target_nodes(); - break; - case MEM_CANCEL_OFFLINE: - /* - * MEM_GOING_OFFLINE disabled all the migration - * targets. Reenable them. - */ - __set_migration_target_nodes(); - break; - case MEM_GOING_ONLINE: - case MEM_CANCEL_ONLINE: - break; - } - - return notifier_from_errno(0); -} -#endif - -void __init migrate_on_reclaim_init(void) -{ - node_demotion = kcalloc(nr_node_ids, - sizeof(struct demotion_nodes), - GFP_KERNEL); - WARN_ON(!node_demotion); -#ifdef CONFIG_MEMORY_HOTPLUG - hotplug_memory_notifier(migrate_on_reclaim_callback, 100); -#endif - /* - * At this point, all numa nodes with memory/CPus have their state - * properly set, so we can build the demotion order now. - * Let us hold the cpu_hotplug lock just, as we could possibily have - * CPU hotplug events during boot. - */ - cpus_read_lock(); - set_migration_target_nodes(); - cpus_read_unlock(); -} #endif /* CONFIG_NUMA */ - - diff --git a/mm/vmstat.c b/mm/vmstat.c index 373d2730fcf2..35c6ff97cf29 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -28,7 +28,6 @@ #include #include #include -#include #include "internal.h" @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu) if (!node_state(cpu_to_node(cpu), N_CPU)) { node_set_state(cpu_to_node(cpu), N_CPU); - set_migration_target_nodes(); } return 0; @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu) return 0; node_clear_state(node, N_CPU); - set_migration_target_nodes(); return 0; } @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void) start_shepherd_timer(); #endif - migrate_on_reclaim_init(); #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); From patchwork Wed Jul 20 02:59:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12923299 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1A5CC43334 for ; Wed, 20 Jul 2022 03:00:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 75F9B6B0074; Tue, 19 Jul 2022 23:00:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 70E426B007E; Tue, 19 Jul 2022 23:00:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5AF5C6B0080; Tue, 19 Jul 2022 23:00:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4808F6B0074 for ; Tue, 19 Jul 2022 23:00:35 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 229D7AAD15 for ; Wed, 20 Jul 2022 03:00:35 +0000 (UTC) X-FDA: 79705975230.23.86DF621 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf28.hostedemail.com (Postfix) with ESMTP id 9EA63C0091 for ; Wed, 20 Jul 2022 03:00:34 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26K2mnxv001933; Wed, 20 Jul 2022 03:00:28 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=FuOtZSr8KoRXEeUwczPYyLcJHjQ8n/U557lGv6Ad/2Q=; b=W/vxl+ACQsvxPzBgNKwXK3E5ho758I/NGk4lxYi9Ufhv6txl1x/1m/VFzOoCBblMtU8c wb4cahuugNpnwNBcUgbHmxqMs9WZ1gHP+k5o8Ej/26E7XDeBq4vCQm4tdjOb9etjCVd2 wdrJQx+/EeRgjmTrDcSf/uwVp6qsSD1fekIgUx+9gXIcMcHthTfY5t9UizG+fYbFXRW6 jB/5wQ60EKzUtLdnxMJMflXIrt501E2QjNw+YKZ9KT19T1WfuBEV87Ym9New2UwU+2tY i04IZWtYa/bIQwcp02MwzsMovBjx/5l5n39c4tkQ/s1cf/2jCr9vH8sGfWNdKfgyCiHU vA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he98kr7jr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:27 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26K2uOQL003467; Wed, 20 Jul 2022 03:00:27 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he98kr7hs-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:27 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26K2pQFR015586; Wed, 20 Jul 2022 03:00:25 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma03dal.us.ibm.com with ESMTP id 3hbmy9ctas-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:25 +0000 Received: from b03ledav002.gho.boulder.ibm.com (b03ledav002.gho.boulder.ibm.com [9.17.130.233]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26K30Okb38470138 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Jul 2022 03:00:24 GMT Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8B006136061; Wed, 20 Jul 2022 03:00:24 +0000 (GMT) Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F201113605D; Wed, 20 Jul 2022 03:00:18 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.15.129]) by b03ledav002.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 20 Jul 2022 03:00:18 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v10 6/8] mm/demotion: Add pg_data_t member to track node memory tier details Date: Wed, 20 Jul 2022 08:29:18 +0530 Message-Id: <20220720025920.1373558-7-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> References: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: mnoErwtpBcVIvJixFDrbKn0hgAgvvcYR X-Proofpoint-ORIG-GUID: sWCGTEv4EguCla7l9lLVxz9zEQdkDjDq X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-19_10,2022-07-19_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 spamscore=0 impostorscore=0 suspectscore=0 phishscore=0 lowpriorityscore=0 bulkscore=0 malwarescore=0 mlxscore=0 priorityscore=1501 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207200008 ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="W/vxl+AC"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf28.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658286034; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FuOtZSr8KoRXEeUwczPYyLcJHjQ8n/U557lGv6Ad/2Q=; b=GCQiGb6l30M6F0jvP8rMQ3FHOX5bUfolSSZTOXWW/0QfN06PumcyEoq+Kn9FtkIseVJy37 0QZdUR4INGWAyQJGySHgpV9aEEBxdehjC/ICLw2McEojtG0XDNaEvectzm1vHLJdajswPV RbWysJJT7SKs72TdQyotmsCJCio3Wvk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658286034; a=rsa-sha256; cv=none; b=LLz3FS6OErMiNWU2qydvkwVhTCZGMf0inxdeqC0N+x07OJWdFC+BLEKlh68VIswW/1biTK IchThbgin49NX39/l1zon3zRzZOMM8YXJsaRdsbnUvMOo8Ixu+7/SFgcmlqTTxQCEFfJTU moy5gSOgqUvCSkK5Zv5SaaD3YxgW0zk= X-Rspamd-Queue-Id: 9EA63C0091 Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="W/vxl+AC"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf28.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: 7hbt9muwz4abd53d453guticw135fnit X-HE-Tag: 1658286034-188413 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Also update different helpes to use NODE_DATA()->memtier. Since node specific memtier can change based on the reassignment of NUMA node to a different memory tiers, accessing NODE_DATA()->memtier needs to happen under an rcu read lock or memory_tier_lock. Signed-off-by: Aneesh Kumar K.V --- include/linux/mmzone.h | 3 ++ mm/memory-tiers.c | 65 +++++++++++++++++++++++++++++++++++------- 2 files changed, 57 insertions(+), 11 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index aab70355d64f..353812495a70 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -928,6 +928,9 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; +#ifdef CONFIG_NUMA + struct memory_tier __rcu *memtier; +#endif } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index a8cfe2ca3903..4715f9b96a44 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -138,13 +138,18 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level) static struct memory_tier *__node_get_memory_tier(int node) { - struct memory_tier *memtier; + pg_data_t *pgdat; - list_for_each_entry(memtier, &memory_tiers, list) { - if (node_isset(node, memtier->nodelist)) - return memtier; - } - return NULL; + pgdat = NODE_DATA(node); + if (!pgdat) + return NULL; + /* + * Since we hold memory_tier_lock, we can avoid + * RCU read locks when accessing the details. No + * parallel updates are possible here. + */ + return rcu_dereference_check(pgdat->memtier, + lockdep_is_held(&memory_tier_lock)); } #ifdef CONFIG_MIGRATION @@ -277,6 +282,29 @@ static inline void disable_all_migrate_targets(void) {} static inline void establish_migration_targets(void) {} #endif /* CONFIG_MIGRATION */ +static void memtier_node_set(int node, struct memory_tier *memtier) +{ + pg_data_t *pgdat; + struct memory_tier *current_memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return; + /* + * Make sure we mark the memtier NULL before we assign the new memory tier + * to the NUMA node. This make sure that anybody looking at NODE_DATA + * finds a NULL memtier or the one which is still valid. + */ + current_memtier = rcu_dereference_check(pgdat->memtier, + lockdep_is_held(&memory_tier_lock)); + rcu_assign_pointer(pgdat->memtier, NULL); + synchronize_rcu(); + if (current_memtier) + node_clear(node, current_memtier->nodelist); + node_set(node, memtier->nodelist); + rcu_assign_pointer(pgdat->memtier, memtier); +} + static void init_node_memory_tier(int node) { int perf_level; @@ -295,7 +323,7 @@ static void init_node_memory_tier(int node) if (!memtier) { perf_level = node_devices[node]->perf_level; memtier = find_create_memory_tier(perf_level); - node_set(node, memtier->nodelist); + memtier_node_set(node, memtier); } establish_migration_targets(); mutex_unlock(&memory_tier_lock); @@ -303,12 +331,25 @@ static void init_node_memory_tier(int node) static void clear_node_memory_tier(int node) { - struct memory_tier *memtier; + pg_data_t *pgdat; + struct memory_tier *current_memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return; mutex_lock(&memory_tier_lock); - memtier = __node_get_memory_tier(node); - if (memtier) { - node_clear(node, memtier->nodelist); + /* + * Make sure we mark the memtier NULL before we assign the new memory tier + * to the NUMA node. This make sure that anybody looking at NODE_DATA + * finds a NULL memtier or the one which is still valid. + */ + current_memtier = rcu_dereference_check(pgdat->memtier, + lockdep_is_held(&memory_tier_lock)); + rcu_assign_pointer(pgdat->memtier, NULL); + synchronize_rcu(); + if (current_memtier) { + node_clear(node, current_memtier->nodelist); establish_migration_targets(); } mutex_unlock(&memory_tier_lock); @@ -383,6 +424,8 @@ static int __init memory_tier_init(void) if (!node_property->perf_level) node_property->perf_level = default_memtier_perf_level; + + rcu_assign_pointer(NODE_DATA(node)->memtier, memtier); } mutex_unlock(&memory_tier_lock); From patchwork Wed Jul 20 02:59:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12923300 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14C15C43334 for ; Wed, 20 Jul 2022 03:00:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ABAD66B007E; Tue, 19 Jul 2022 23:00:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A69FC6B0080; Tue, 19 Jul 2022 23:00:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E34F6B0081; Tue, 19 Jul 2022 23:00:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 7C7CB6B007E for ; Tue, 19 Jul 2022 23:00:45 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 597A11C5DAF for ; Wed, 20 Jul 2022 03:00:45 +0000 (UTC) X-FDA: 79705975650.01.B8C8FF5 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf10.hostedemail.com (Postfix) with ESMTP id C7486C007D for ; Wed, 20 Jul 2022 03:00:44 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26K2gTla032074; Wed, 20 Jul 2022 03:00:35 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=jizW4FVpNeZa0xrWud6JBfLQaVR7XKFZ/NHzS4bk8aw=; b=G/0atQmtv1oAYizd3vx1WnqaK7b67R1RJJdhO+M/VInOx9yYNlSP0qwdrzSSot8mnnxc owLMfKLx5332sinjeIunebiY44nQLfg6cRzUfZiXn8c1egt+d2DcFdeOY/dwih4PI2gl XyPMA0QyEq+Uvnc1ua6h+M4xkm4dSwvIHFimgWmgS5XdNrXZ04+yU9izUBLmAHFXN3cM b50KIqrGq2zmyu9RfwYYFIgWCeBIegoXcXc4UGa9MubAors/ncI5fWFqo+mZmoIUkOVL ru5s2Vmd3PX/Jpkvvugrqjj2s8QBQ0d7EAEOU8puButhl6QcHLMHOP6PYmvdOKSznyTe kQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he95b0cwe-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:35 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26K2xIS8034999; Wed, 20 Jul 2022 03:00:34 GMT Received: from ppma02wdc.us.ibm.com (aa.5b.37a9.ip4.static.sl-reverse.com [169.55.91.170]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he95b0cv2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:34 +0000 Received: from pps.filterd (ppma02wdc.us.ibm.com [127.0.0.1]) by ppma02wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26K2njYi016972; Wed, 20 Jul 2022 03:00:32 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma02wdc.us.ibm.com with ESMTP id 3hbmy9f4d6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:32 +0000 Received: from b03ledav002.gho.boulder.ibm.com (b03ledav002.gho.boulder.ibm.com [9.17.130.233]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26K30VGN31457766 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Jul 2022 03:00:31 GMT Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 90715136055; Wed, 20 Jul 2022 03:00:31 +0000 (GMT) Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3A8F7136060; Wed, 20 Jul 2022 03:00:25 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.15.129]) by b03ledav002.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 20 Jul 2022 03:00:24 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Jagdish Gediya , "Aneesh Kumar K . V" Subject: [PATCH v10 7/8] mm/demotion: Demote pages according to allocation fallback order Date: Wed, 20 Jul 2022 08:29:19 +0530 Message-Id: <20220720025920.1373558-8-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> References: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 0hDc90MJJxHqizivltaYLcTja-1KBXP1 X-Proofpoint-ORIG-GUID: 6HtB90fU3mO_dtzZILFnCI6WJZjHdIRc X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-19_10,2022-07-19_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 lowpriorityscore=0 suspectscore=0 malwarescore=0 impostorscore=0 phishscore=0 mlxscore=0 spamscore=0 bulkscore=0 priorityscore=1501 clxscore=1015 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207200008 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658286045; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jizW4FVpNeZa0xrWud6JBfLQaVR7XKFZ/NHzS4bk8aw=; b=j64s6nWjdJIcEIRlOGomCy4eGzH6YIe7x/8+InhaAo0jqT2KkT0sVewBIFWjbgZNkk0aeJ FdkZ3bio4uOBf2cBl7ZDepQwubp0Ft2+LhLe0XJ7hU35jkvCap87y4I7dbpPGULdYKfsIZ io/AcnOfYNnVvnNzu4IyDHUuaokaAi0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658286045; a=rsa-sha256; cv=none; b=ix1nl8T2tQuvr200IP40N8du6sG2DC/Hh3XREnYFPpomXlrvODavRnWq47NRHbG3Vh92fc 9Xr09wgjyH0IgrdBf57Khl9Usznsgufn/ApJ81BxCTeAYCSjwZ1pP448D9hKnb5A3v+jbe f3D/fkbxVsfairzcYEP8pK+u7BY/n8g= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="G/0atQmt"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Queue-Id: C7486C007D Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="G/0atQmt"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam12 X-Rspam-User: X-Stat-Signature: sttcpto396cof79aaqx64if968typsw8 X-HE-Tag: 1658286044-563740 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya Currently, a higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path. This strict, hard-coded demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets for a memory tier. demote_page_list() function is now modified to utilize this allowed node mask as the fallback allocation mask. Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 11 +++++++ mm/memory-tiers.c | 54 +++++++++++++++++++++++++++++++-- mm/vmscan.c | 58 ++++++++++++++++++++++++++---------- 3 files changed, 106 insertions(+), 17 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 852e86bd0a23..0e58588fa066 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -19,11 +19,17 @@ extern bool numa_demotion_enabled; #ifdef CONFIG_MIGRATION int next_demotion_node(int node); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); #else static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif #else @@ -33,5 +39,10 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 4715f9b96a44..4a96e4213d66 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -15,6 +15,7 @@ struct memory_tier { struct list_head list; int perf_level; nodemask_t nodelist; + nodemask_t lower_tier_mask; }; struct demotion_nodes { @@ -153,6 +154,24 @@ static struct memory_tier *__node_get_memory_tier(int node) } #ifdef CONFIG_MIGRATION +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + struct memory_tier *memtier; + + /* + * pg_data_t.memtier updates includes a synchronize_rcu() + * which ensures that we either find NULL or a valid memtier + * in NODE_DATA. protect the access via rcu_read_lock(); + */ + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (memtier) + *targets = memtier->lower_tier_mask; + else + *targets = NODE_MASK_NONE; + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -201,10 +220,19 @@ int next_demotion_node(int node) /* Disable reclaim-based migration. */ static void __disable_all_migrate_targets(void) { + struct memory_tier *memtier; int node; - for_each_node_state(node, N_MEMORY) + for_each_node_state(node, N_MEMORY) { node_demotion[node].preferred = NODE_MASK_NONE; + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + memtier = rcu_dereference_check(NODE_DATA(node)->memtier, + lockdep_is_held(&memory_tier_lock)); + memtier->lower_tier_mask = NODE_MASK_NONE; + } } static void disable_all_migrate_targets(void) @@ -230,7 +258,7 @@ static void establish_migration_targets(void) struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t used; + nodemask_t used, lower_tier = NODE_MASK_NONE; if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) return; @@ -276,6 +304,28 @@ static void establish_migration_targets(void) } } while (1); } + /* + * Now build the lower_tier mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + list_for_each_entry(memtier, &memory_tiers, list) + nodes_or(lower_tier, lower_tier, memtier->nodelist); + /* + * Removes nodes not yet in N_MEMORY. + */ + nodes_and(lower_tier, node_states[N_MEMORY], lower_tier); + + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from lower_tier nodes, + * This will remove all nodes in current and above + * memory tier from the lower_tier mask. + */ + nodes_andnot(lower_tier, lower_tier, memtier->nodelist); + memtier->lower_tier_mask = lower_tier; + } } #else static inline void disable_all_migrate_targets(void) {} diff --git a/mm/vmscan.c b/mm/vmscan.c index 3a8f78277f99..60a5235dd639 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1460,21 +1460,34 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) +static struct page *alloc_demote_page(struct page *page, unsigned long private) { - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; + struct page *target_page; + nodemask_t *allowed_mask; + struct migration_target_control *mtc; + + mtc = (struct migration_target_control *)private; + + allowed_mask = mtc->nmask; + /* + * make sure we allocate from the target node first also trying to + * reclaim pages from the target node via kswapd if we are low on + * free memory on target node. If we don't do this and if we have low + * free memory on the target memtier, we would start allocating pages + * from higher memory tiers without even forcing a demotion of cold + * pages from the target memtier. This can result in the kernel placing + * hotpages in higher memory tiers. + */ + mtc->nmask = NULL; + mtc->gfp_mask |= __GFP_THISNODE; + target_page = alloc_migration_target(page, (unsigned long)mtc); + if (target_page) + return target_page; - return alloc_migration_target(page, (unsigned long)&mtc); + mtc->gfp_mask &= ~__GFP_THISNODE; + mtc->nmask = allowed_mask; + + return alloc_migration_target(page, (unsigned long)mtc); } /* @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); From patchwork Wed Jul 20 02:59:20 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12923301 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EE131C43334 for ; Wed, 20 Jul 2022 03:00:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 935896B0080; Tue, 19 Jul 2022 23:00:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8E3046B0081; Tue, 19 Jul 2022 23:00:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 784336B0082; Tue, 19 Jul 2022 23:00:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 6325E6B0080 for ; Tue, 19 Jul 2022 23:00:50 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 334921C5D88 for ; Wed, 20 Jul 2022 03:00:50 +0000 (UTC) X-FDA: 79705975860.18.276EEEC Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf29.hostedemail.com (Postfix) with ESMTP id A4A4F12007C for ; Wed, 20 Jul 2022 03:00:49 +0000 (UTC) Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26K2gQKg002780; Wed, 20 Jul 2022 03:00:43 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=CoWRC8zLpfrq8J5yaBXe8uR3c59RwBfJjTSs16zKYno=; b=aOUVI6bG0fRGC71i3Gjm/a+YdSYnxHdf0O/xVTegxer8iAC8V7oy6s4+Nf1BbMoELFR/ HaxIPZbVpL6FrBmquh1pRNV6o93LKots8/2Goi1zP/drC1+Xmnv94gDIX5efJMQUKooh TZ88WUXyOueDr4DLcC7KImRcuN8eWJDZSEYkS78wEP0LVyyEWLppeOqKbN0w7kXFqcqp teoLRYwkWzmIH/0mapY/UJrsgFCJzUJbhW49NGF4m1SHSJGU3NQXxrHznRGBq4rdAqFx Rc0115IkJCwKGxSPe0YsauhI6V9U/EHzDIS8NKgLvkmJedIgmpcI9fxvwEoN478pmJpS Cg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9598d35-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:42 +0000 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26K2xsqr022449; Wed, 20 Jul 2022 03:00:41 GMT Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3he9598d2c-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:41 +0000 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26K2p0J1027798; Wed, 20 Jul 2022 03:00:40 GMT Received: from b03cxnp08027.gho.boulder.ibm.com (b03cxnp08027.gho.boulder.ibm.com [9.17.130.19]) by ppma04dal.us.ibm.com with ESMTP id 3hbmy9vt1d-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 20 Jul 2022 03:00:40 +0000 Received: from b03ledav002.gho.boulder.ibm.com (b03ledav002.gho.boulder.ibm.com [9.17.130.233]) by b03cxnp08027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26K30dsS11010588 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Jul 2022 03:00:39 GMT Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EA80213605E; Wed, 20 Jul 2022 03:00:38 +0000 (GMT) Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4A04E136066; Wed, 20 Jul 2022 03:00:32 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.15.129]) by b03ledav002.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 20 Jul 2022 03:00:31 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" Subject: [PATCH v10 8/8] mm/demotion: Update node_is_toptier to work with memory tiers Date: Wed, 20 Jul 2022 08:29:20 +0530 Message-Id: <20220720025920.1373558-9-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> References: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: jC0zOs31DDl7MnCGEYeteHwwZw2sOjUA X-Proofpoint-ORIG-GUID: TCbzbyk4VwdzNTjq3ayFAyVUkK2vGQcb X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-19_10,2022-07-19_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 adultscore=0 malwarescore=0 priorityscore=1501 phishscore=0 clxscore=1015 impostorscore=0 mlxlogscore=999 bulkscore=0 lowpriorityscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207200008 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658286049; a=rsa-sha256; cv=none; b=jZarYcZlcDIuOY3WvzSrE/1uz54ddBLaQeEzP844ntbh0VAlyxC9TW8BerRkjAotWHrSLo pizeClUVFYkwUOyotwZ2BVxdL8crnIaJS980RG1yt6CbrQANo/fELq3CEoEOxJkHlVaimY VN9N+RSCHZKIcqJUUvtZPBwER1kycCs= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=aOUVI6bG; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658286049; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CoWRC8zLpfrq8J5yaBXe8uR3c59RwBfJjTSs16zKYno=; b=Wk49lQRO0CSowtDgRcIvVOnUvM6nXosfRNQYiyVV9E7Jv7eQho0v00BQvtbtM9v145S1aP oYvXj48Ro+0TreTMdPLqlmxeTnrAcZ0hOTZxDZuJVTEoGB9YdOpLqjQIIXS7E63xWmz0QR qMoHUvzNhVB4rbfggMmV44I0HRO36xw= X-Stat-Signature: 69yuh6mhaz9ewg3dcdkzw6bpsn7knhs7 X-Rspamd-Queue-Id: A4A4F12007C Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=aOUVI6bG; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1658286049-868184 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With memory tiers support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as a top memory tier Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 11 +++++++++ include/linux/node.h | 5 ----- mm/huge_memory.c | 1 + mm/memory-tiers.c | 43 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 1 + mm/mprotect.c | 1 + 6 files changed, 57 insertions(+), 5 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 0e58588fa066..085dd815bf73 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -20,6 +20,7 @@ extern bool numa_demotion_enabled; #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); +bool node_is_toptier(int node); #else static inline int next_demotion_node(int node) { @@ -30,6 +31,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif #else @@ -44,5 +50,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/node.h b/include/linux/node.h index a2a16d4104fd..d0432db18094 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -191,9 +191,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg, #define to_node(device) container_of(device, struct node, dev) -static inline bool node_is_toptier(int node) -{ - return node_state(node, N_CPU); -} - #endif /* _LINUX_NODE_H_ */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 834f288b3769..8405662646e9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 4a96e4213d66..f0515bfd4051 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -13,6 +13,7 @@ struct memory_tier { struct list_head list; + int id; int perf_level; nodemask_t nodelist; nodemask_t lower_tier_mask; @@ -26,6 +27,7 @@ static LIST_HEAD(memory_tiers); static DEFINE_MUTEX(memory_tier_lock); #ifdef CONFIG_MIGRATION +static int top_tier_id; /* * node_demotion[] examples: * @@ -129,6 +131,7 @@ static struct memory_tier *find_create_memory_tier(unsigned int perf_level) if (!new_memtier) return ERR_PTR(-ENOMEM); + new_memtier->id = perf_level >> MEMTIER_CHUNK_BITS; new_memtier->perf_level = perf_level; if (found_slot) list_add_tail(&new_memtier->list, ent); @@ -154,6 +157,31 @@ static struct memory_tier *__node_get_memory_tier(int node) } #ifdef CONFIG_MIGRATION +bool node_is_toptier(int node) +{ + bool toptier; + pg_data_t *pgdat; + struct memory_tier *memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return false; + + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (!memtier) { + toptier = true; + goto out; + } + if (memtier->id >= top_tier_id) + toptier = true; + else + toptier = false; +out: + rcu_read_unlock(); + return toptier; +} + void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) { struct memory_tier *memtier; @@ -304,6 +332,21 @@ static void establish_migration_targets(void) } } while (1); } + /* + * Promotion is allowed from a memory tier to higher + * memory tier only if the memory tier doesn't include + * compute. We want to skip promotion from a memory tier, + * if any node that is part of the memory tier have CPUs. + * Once we detect such a memory tier, we consider that tier + * as top tiper from which promotion is not allowed. + */ + list_for_each_entry_reverse(memtier, &memory_tiers, list) { + nodes_and(used, node_states[N_CPU], memtier->nodelist); + if (!nodes_empty(used)) { + top_tier_id = memtier->id; + break; + } + } /* * Now build the lower_tier mask for each node collecting node mask from * all memory tier below it. This allows us to fallback demotion page diff --git a/mm/migrate.c b/mm/migrate.c index c758c9c21d7d..1da81136eaaa 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include #include #include +#include #include diff --git a/mm/mprotect.c b/mm/mprotect.c index ba5592655ee3..92a2fc0fa88b 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include