From patchwork Thu Jul 28 19:04:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12931691 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5934AC19F2B for ; Thu, 28 Jul 2022 19:06:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B542194000A; Thu, 28 Jul 2022 15:06:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AB5148E0001; Thu, 28 Jul 2022 15:06:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9580D94000A; Thu, 28 Jul 2022 15:06:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 839C18E0001 for ; Thu, 28 Jul 2022 15:06:18 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 557721C701E for ; Thu, 28 Jul 2022 19:06:18 +0000 (UTC) X-FDA: 79737439236.25.E6FD64F Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf01.hostedemail.com (Postfix) with ESMTP id A147D400C9 for ; Thu, 28 Jul 2022 19:06:16 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26SIloKW018672; Thu, 28 Jul 2022 19:05:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : content-transfer-encoding : mime-version; s=pp1; bh=uTJDX5S3TUopqT6N3AIdKKSI+mqbGEFWWY/1SYf1XMU=; b=BfCBa6Ri4r4wepH1CxPYCwmLspilQmTwtgzY3Vu62y0YkC5/rbCLrbJ6PqaPVc95BH9O wSvwBf5NB7D5JbBypq7YADV57R0LxWENbmNh5PfurbOvgZdOsrSRea3cnWkbXvjQvGK0 URjEsOf8ljYS2cWK9ynG/1e+pjtlWUTGXq8irNGOP/pj1GBsx4+o/WsM1BRBNBN82TsT hb4xOfZA40jidKS+eFrW5+XH+ddB4qNBlADmQV/Xdcz2k+0BN3lai7JBix9JU1tRoH0w ggBJRlGP6Wq4zLrE9+gOvdpIjOfmmffIxotOl+7psJxCuq3g7RDZ8xAcVeckPOu+HNOw nA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hm0238sem-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 28 Jul 2022 19:05:53 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26SIlsKw018778; Thu, 28 Jul 2022 19:05:28 GMT Received: from ppma02dal.us.ibm.com (a.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.10]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hm0238qqn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 28 Jul 2022 19:05:27 +0000 Received: from pps.filterd (ppma02dal.us.ibm.com [127.0.0.1]) by ppma02dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26SJ2B7l018194; Thu, 28 Jul 2022 19:05:02 GMT Received: from b03cxnp08026.gho.boulder.ibm.com (b03cxnp08026.gho.boulder.ibm.com [9.17.130.18]) by ppma02dal.us.ibm.com with ESMTP id 3hhfpj63f4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 28 Jul 2022 19:05:02 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08026.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26SJ51vR39452940 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 28 Jul 2022 19:05:01 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0D4746A054; Thu, 28 Jul 2022 19:05:01 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D45016A04D; Thu, 28 Jul 2022 19:04:54 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.25.218]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Thu, 28 Jul 2022 19:04:54 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, "Aneesh Kumar K.V" , Jagdish Gediya Subject: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers Date: Fri, 29 Jul 2022 00:34:29 +0530 Message-Id: <20220728190436.858458-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20220728190436.858458-1-aneesh.kumar@linux.ibm.com> References: <20220728190436.858458-1-aneesh.kumar@linux.ibm.com> X-TM-AS-GCONF: 00 X-Proofpoint-GUID: Z1w32zoRkLjpVRSMY3eUyjA4weG8iv1n X-Proofpoint-ORIG-GUID: pE09Hj9K4GLlc_Q8_Ci0p3aPWfuGMvd- X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-28_06,2022-07-28_02,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 impostorscore=0 suspectscore=0 malwarescore=0 adultscore=0 phishscore=0 spamscore=0 clxscore=1015 mlxlogscore=999 mlxscore=0 priorityscore=1501 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207280086 ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=BfCBa6Ri; spf=pass (imf01.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659035176; a=rsa-sha256; cv=none; b=hbmCfMZ4/zVqCubN9my2guTO650FSFS42x6ABmb1IwE9mz/fHCmAdmWsIA9vKzwq/JQ9xm duh463cpkRhhBaSaEfmcumPcjzhl2MpwvqSe7ut0B6D5IovMEfe0CYxGMBkMKDcF4ylmT/ zYgGEi45Ji26C/U9/Ksc+pdgaNlAO4k= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659035176; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uTJDX5S3TUopqT6N3AIdKKSI+mqbGEFWWY/1SYf1XMU=; b=PD1xlczS3/Td/JmqG59JYzUAYJ0oMQNlVHGckTNNda1FSv5sGeKgCFt/d2P4c9cZqDgNJR dvdQLpSQ0A/scBcBU5k2lyzbgWeuwu9jwvlkChwOaZwlfeyQnigh2DL4pJAwEEOccS2FKL BCbDMXgp8tya6tKivOUX4bR2O7Twb60= X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: A147D400C9 Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=BfCBa6Ri; spf=pass (imf01.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Stat-Signature: g46sap7dcoxco6qyiss5gf347r1wfwzx X-Rspam-User: X-HE-Tag: 1659035176-214876 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and builds the tier hierarchy tier-by-tier by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel implementation needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to nodes with shortest distance on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. This patch series address the above by defining memory tiers explicitly. Linux kernel presents memory devices as NUMA nodes and each memory device is of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This allows for classifying memory devices with a specific performance range into a memory tier. This patch configures the range/chunk size to be 128. The default DRAM abstract distance is 512. We can have 4 memory tiers below the default DRAM abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. Slower memory devices like persistent memory will have abstract distance below the default DRAM level and hence will be placed in these 4 lower tiers. A kernel parameter is provided to override the default memory tier. Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 17 ++++++ mm/Makefile | 1 + mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ 3 files changed, 120 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..8d7884b7a3f0 --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +/* + * Each tier cover a abstrace distance chunk size of 128 + */ +#define MEMTIER_CHUNK_BITS 7 +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) +/* + * For now let's have 4 memory tier below default DRAM tier. + */ +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) +/* leave one tier below this slow pmem */ +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) + +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..d30acebc2164 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..01cfd514c192 --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +struct memory_tier { + /* hierarchy of memory tiers */ + struct list_head list; + /* list of all memory types part of this tier */ + struct list_head memory_types; + /* + * start value of abstract distance. memory tier maps + * an abstract distance range, + * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE + */ + int adistance_start; +}; + +struct memory_dev_type { + /* list of memory types that are are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct memory_tier *memtier; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); +struct memory_dev_type *node_memory_types[MAX_NUMNODES]; +/* + * For now let's have 4 memory tier below default DRAM tier. + */ +static struct memory_dev_type default_dram_type = { + .adistance = MEMTIER_ADISTANCE_DRAM, + .tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling), +}; + +static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) +{ + bool found_slot = false; + struct memory_tier *memtier, *new_memtier; + int adistance = memtype->adistance; + unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; + + lockdep_assert_held_once(&memory_tier_lock); + + /* + * If the memtype is already part of a memory tier, + * just return that. + */ + if (memtype->memtier) + return memtype->memtier; + + adistance = round_down(adistance, memtier_adistance_chunk_size); + list_for_each_entry(memtier, &memory_tiers, list) { + if (adistance == memtier->adistance_start) { + memtype->memtier = memtier; + list_add(&memtype->tier_sibiling, &memtier->memory_types); + return memtier; + } else if (adistance < memtier->adistance_start) { + found_slot = true; + break; + } + } + + new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!new_memtier) + return ERR_PTR(-ENOMEM); + + new_memtier->adistance_start = adistance; + INIT_LIST_HEAD(&new_memtier->list); + INIT_LIST_HEAD(&new_memtier->memory_types); + if (found_slot) + list_add_tail(&new_memtier->list, &memtier->list); + else + list_add_tail(&new_memtier->list, &memory_tiers); + memtype->memtier = new_memtier; + list_add(&memtype->tier_sibiling, &new_memtier->memory_types); + return new_memtier; +} + +static int __init memory_tier_init(void) +{ + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + /* CPU only nodes are not part of memory tiers. */ + default_dram_type.nodes = node_states[N_MEMORY]; + + memtier = find_create_memory_tier(&default_dram_type); + if (IS_ERR(memtier)) + panic("%s() failed to register memory tier: %ld\n", + __func__, PTR_ERR(memtier)); + mutex_unlock(&memory_tier_lock); + + return 0; +} +subsys_initcall(memory_tier_init);