From patchwork Thu Aug 18 13:10:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12947864 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7A03C28B2B for ; Thu, 18 Aug 2022 20:39:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1E3AD8D0003; Thu, 18 Aug 2022 16:39:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 16C4F8D0002; Thu, 18 Aug 2022 16:39:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F00D28D0003; Thu, 18 Aug 2022 16:39:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id DBB538D0002 for ; Thu, 18 Aug 2022 16:39:47 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A8EC1AD2A7 for ; Thu, 18 Aug 2022 20:39:47 +0000 (UTC) X-FDA: 79813879614.02.8AB48C0 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf06.hostedemail.com (Postfix) with ESMTP id 12084184CD3 for ; Thu, 18 Aug 2022 20:28:28 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ICitT4025834; Thu, 18 Aug 2022 13:11:03 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=4MzbymPQ/qh0NBVwON+9WHYwENubdq5Fjkr2wzaXZNg=; b=Hw71tHpOa+jvR+mcNMlDqoR9TAs2+rdU1e8KEN2oNmhabnt2QiV73gFs3z03Kbu3ewxI 99SuQ/cDtorPYlOWv2pJ6iwnY+voICsqXAldEYucb/QriaJYkJ/pCphhxKYlb9qe71TB tUAGRUOXwDSIl+rWemSN6cKcgd2xsgLjo2/5CpoPzZ4oXsc28aCHpY9qju1wrxrTltPb XkcnvtRM75sG+72MEZh0naC3rjUARVLeZkfWIxIHc+DG/XUUOjI5XQg3cBhbZW82U95c wSZ7SqNq0Lh7cY8iA0IlOlyff37h/6pZOn0OwJcRoOxl4z96z+KJnWvKo7NVJnWxueZz 8g== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1npxgx21-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:03 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ICnXv9040230; Thu, 18 Aug 2022 13:11:02 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1npxgx10-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:02 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID68Mv007590; Thu, 18 Aug 2022 13:11:01 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma01dal.us.ibm.com with ESMTP id 3hx3kaxgar-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:01 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDB0Zk3080772 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:00 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 45673124055; Thu, 18 Aug 2022 13:11:00 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 556A9124054; Thu, 18 Aug 2022 13:10:55 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:10:55 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 01/10] mm/demotion: Add support for explicit memory tiers Date: Thu, 18 Aug 2022 18:40:33 +0530 Message-Id: <20220818131042.113280-2-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: qnO-NOLRzjSnMq5r0oR9KL8StF-DTgeu X-Proofpoint-GUID: sAUN-Rd9BJ2AqOrsHqw9IPLUJWpi7vbY X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 lowpriorityscore=0 clxscore=1015 bulkscore=0 mlxlogscore=999 adultscore=0 suspectscore=0 mlxscore=0 impostorscore=0 priorityscore=1501 phishscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660854511; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4MzbymPQ/qh0NBVwON+9WHYwENubdq5Fjkr2wzaXZNg=; b=PKFomdHg/vR9YUXKmg2egYPLw8nT9nPz+USTU5E4zHFiYN+jNnK36DQX9tyvzhQDB9EYzk j3z/AtocXrl8g0MY+cE+/Qps7ru7EKlk1SoiRK9t0DMHfZQ46uakeeunhFuS9oA1utwxJ+ 4y26gcw/3VY7L8nwMZRwS87WQLySMlg= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Hw71tHpO; spf=pass (imf06.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660854511; a=rsa-sha256; cv=none; b=fbxefdASr+aywuy/UtkKJZqsL1ChXqPE5DR7IPmt2Mygz94i1ajhoaVuuTaGAnyeGto6C6 yk4FRqUhpBttKBeTXM2r7wA+QclW61bj2RXEWLzs4D6oxyUKzOSRz6xVJbNO8Fvsoha2Gi 5h4dTn1VIRjhDJjaOazcs0vETETkGFk= X-Stat-Signature: 3s3mo95kfmn9i5cagexafihhkychwekd X-Rspamd-Queue-Id: 12084184CD3 Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Hw71tHpO; spf=pass (imf06.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1660854508-672950 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the current kernel, memory tiers are defined implicitly via a demotion path relationship between NUMA nodes, which is created during the kernel initialization and updated when a NUMA node is hot-added or hot-removed. The current implementation puts all nodes with CPU into the highest tier, and builds the tier hierarchy by establishing the per-node demotion targets based on the distances between nodes. This current memory tier kernel implementation needs to be improved for several important use cases, The current tier initialization code always initializes each memory-only NUMA node into a lower tier. But a memory-only NUMA node may have a high performance memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that should be put into a higher tier. The current tier hierarchy always puts CPU nodes into the top tier. But on a system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices should be in the top tier, and DRAM nodes with CPUs are better to be placed into the next lower tier. With current kernel higher tier node can only be demoted to nodes with shortest distance on the next lower tier as defined by the demotion path, not any other node from any lower tier. This strict, demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space), This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that. This patch series address the above by defining memory tiers explicitly. Linux kernel presents memory devices as NUMA nodes and each memory device is of a specific type. The memory type of a device is represented by its abstract distance. A memory tier corresponds to a range of abstract distance. This allows for classifying memory devices with a specific performance range into a memory tier. This patch configures the range/chunk size to be 128. The default DRAM abstract distance is 512. We can have 4 memory tiers below the default DRAM with abstract distance range 0 - 127, 127 - 255, 256- 383, 384 - 511. Faster memory devices can be placed in these faster(higher) memory tiers. Slower memory devices like persistent memory will have abstract distance higher than the default DRAM level. Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 18 +++++ mm/Makefile | 1 + mm/memory-tiers.c | 129 +++++++++++++++++++++++++++++++++++ 3 files changed, 148 insertions(+) create mode 100644 include/linux/memory-tiers.h create mode 100644 mm/memory-tiers.c diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h new file mode 100644 index 000000000000..5206568f34ed --- /dev/null +++ b/include/linux/memory-tiers.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMORY_TIERS_H +#define _LINUX_MEMORY_TIERS_H + +/* + * Each tier cover a abstrace distance chunk size of 128 + */ +#define MEMTIER_CHUNK_BITS 7 +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) +/* + * Smaller abstract distance values imply faster(higher) memory tiers. Offset + * the DRAM adistance so that we can accommodate devices with a slightly lower + * adistance value (slightly slower) than default DRAM adistance to be part of + * the same memory tier. + */ +#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) + +#endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/Makefile b/mm/Makefile index 9a564f836403..488f604e77e0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o +obj-$(CONFIG_NUMA) += memory-tiers.o obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c new file mode 100644 index 000000000000..1f494e69776a --- /dev/null +++ b/mm/memory-tiers.c @@ -0,0 +1,129 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +struct memory_tier { + /* hierarchy of memory tiers */ + struct list_head list; + /* list of all memory types part of this tier */ + struct list_head memory_types; + /* + * start value of abstract distance. memory tier maps + * an abstract distance range, + * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE + */ + int adistance_start; +}; + +struct memory_dev_type { + /* list of memory types that are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct memory_tier *memtier; +}; + +static DEFINE_MUTEX(memory_tier_lock); +static LIST_HEAD(memory_tiers); +static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; +/* + * For now we can have 4 faster memory tiers with smaller adistance + * than default DRAM tier. + */ +static struct memory_dev_type default_dram_type = { + .adistance = MEMTIER_ADISTANCE_DRAM, + .tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling), +}; + +static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) +{ + bool found_slot = false; + struct memory_tier *memtier, *new_memtier; + int adistance = memtype->adistance; + unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE; + + lockdep_assert_held_once(&memory_tier_lock); + + /* + * If the memtype is already part of a memory tier, + * just return that. + */ + if (memtype->memtier) + return memtype->memtier; + + adistance = round_down(adistance, memtier_adistance_chunk_size); + list_for_each_entry(memtier, &memory_tiers, list) { + if (adistance == memtier->adistance_start) { + memtype->memtier = memtier; + list_add(&memtype->tier_sibiling, &memtier->memory_types); + return memtier; + } else if (adistance < memtier->adistance_start) { + found_slot = true; + break; + } + } + + new_memtier = kmalloc(sizeof(struct memory_tier), GFP_KERNEL); + if (!new_memtier) + return ERR_PTR(-ENOMEM); + + new_memtier->adistance_start = adistance; + INIT_LIST_HEAD(&new_memtier->list); + INIT_LIST_HEAD(&new_memtier->memory_types); + if (found_slot) + list_add_tail(&new_memtier->list, &memtier->list); + else + list_add_tail(&new_memtier->list, &memory_tiers); + memtype->memtier = new_memtier; + list_add(&memtype->tier_sibiling, &new_memtier->memory_types); + return new_memtier; +} + +static struct memory_tier *set_node_memory_tier(int node) +{ + struct memory_tier *memtier; + struct memory_dev_type *memtype; + + lockdep_assert_held_once(&memory_tier_lock); + + if (!node_state(node, N_MEMORY)) + return ERR_PTR(-EINVAL); + + if (!node_memory_types[node]) + node_memory_types[node] = &default_dram_type; + + memtype = node_memory_types[node]; + node_set(node, memtype->nodes); + memtier = find_create_memory_tier(memtype); + return memtier; +} + +static int __init memory_tier_init(void) +{ + int node; + struct memory_tier *memtier; + + mutex_lock(&memory_tier_lock); + /* + * Look at all the existing N_MEMORY nodes and add them to + * default memory tier or to a tier if we already have memory + * types assigned. + */ + for_each_node_state(node, N_MEMORY) { + memtier = set_node_memory_tier(node); + if (IS_ERR(memtier)) + /* + * Continue with memtiers we are able to setup + */ + break; + } + mutex_unlock(&memory_tier_lock); + + return 0; +} +subsys_initcall(memory_tier_init); From patchwork Thu Aug 18 13:10:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12948011 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F1EBC00140 for ; Thu, 18 Aug 2022 21:49:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 383198D0002; Thu, 18 Aug 2022 17:49:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E5EA8D0001; Thu, 18 Aug 2022 17:49:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 110858D0002; Thu, 18 Aug 2022 17:49:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0041B8D0001 for ; Thu, 18 Aug 2022 17:48:59 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B50A081B40 for ; Thu, 18 Aug 2022 21:48:59 +0000 (UTC) X-FDA: 79814053998.20.84EEAC8 Received: by imf20.hostedemail.com (Postfix, from userid 200) id B2B8E1C0546; Thu, 18 Aug 2022 15:58:04 +0000 (UTC) Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf20.hostedemail.com (Postfix) with ESMTP id 8CE9A1C2920 for ; Thu, 18 Aug 2022 15:58:00 +0000 (UTC) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ICxohl026681; Thu, 18 Aug 2022 13:11:12 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=XmLg6OLOMq6BKTRPoejiLXiFZksM5cUcBPKRSS+HmWI=; b=Mf/a2ILdJP/m9tEdXU8wXm58B7HzZ8VvuijI35e3LzHfXVSH3ywfkbzvPgnQ6CedUFtv UYjAS1uRX/VgMstJqPW4sLuSRya1Z6FEegKBPOwT2YTzuah9vcMnSmCQWNtHuSs8k5yA PtEcoxWWUMjvvv2nR66CGn3t+m9OoPc5M9w68bw+UHJGP82KmRQFdu4gKp5nlDhw+kaF QG1nVzUjHoaFxKPU0UiSSYYV63dnpkuVa6FDBH5ojsQ1G560196/6zXge9JD1PKWjxmC zXrWsL6/wN2L3+l/Bpv7C2x8+qfZdbUWP3ddpLFe+uLT7xUgU/rqDMxiYm1P9uSfbmmd 3Q== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nx0rhu3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:12 +0000 Received: from m0098417.ppops.net (m0098417.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ICxoYA026692; Thu, 18 Aug 2022 13:11:11 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nx0rhtr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:11 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID5dqu002960; Thu, 18 Aug 2022 13:11:10 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma03dal.us.ibm.com with ESMTP id 3hx3kaehs5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:10 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDB97764618834 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:09 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 87B5A124058; Thu, 18 Aug 2022 13:11:09 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C4279124054; Thu, 18 Aug 2022 13:11:00 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:11:00 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 02/10] mm/demotion: Move memory demotion related code Date: Thu, 18 Aug 2022 18:40:34 +0530 Message-Id: <20220818131042.113280-3-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: CpO2Oa9ZtExkPCTLe1HWPdhlpl4JKX3p X-Proofpoint-ORIG-GUID: h7FhnIlYQ17fQbTDnNq1nnY-3Ql-oF1h X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 priorityscore=1501 mlxscore=0 spamscore=0 suspectscore=0 lowpriorityscore=0 phishscore=0 clxscore=1015 bulkscore=0 impostorscore=0 adultscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660838281; a=rsa-sha256; cv=none; b=NNitVkOJu+ug9j5sH09c4xAR5ZHoA9tvlxwpp/TGWH924dw2+8ci31Qt5S8lYokC7BEvPV 7RP+vzZ3nH/fSgB5P6QNQonf7W1L/UgCnt0s0Qw/KPjrIIm52NbuzvumtSue1QCTq0WwGE IbyI4VWKb5ZF0VBt6Mo5zMXwGUuXLeA= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="Mf/a2ILd"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf20.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660838281; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XmLg6OLOMq6BKTRPoejiLXiFZksM5cUcBPKRSS+HmWI=; b=Vsn4uOrE4lOeVLHuZ5b/ESNS9n4LS98QnMKuc5/4RvgfhqC/LOUPeWAoKST/r8l4LO5FtO Qht99KBjK1xBHQZUOgwarfWWHRT5inIIMMOUJcuqbFE+L/mmYkDPwLp5ZYmGTCsbWyYJrJ bXG2F22czdwb56TMeRbEk98rMv8BVrY= Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b="Mf/a2ILd"; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf20.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-Stat-Signature: 4r1d47cc1ag791575gy91pjpon6z3fzs X-Rspamd-Queue-Id: 8CE9A1C2920 X-Rspamd-Server: rspam12 X-HE-Tag: 1660838280-764924 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This move memory demotion related code to mm/memory-tiers.c. No functional change in this patch. Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 8 +++++ include/linux/migrate.h | 2 -- mm/memory-tiers.c | 64 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 60 +-------------------------------- mm/vmscan.c | 1 + 5 files changed, 74 insertions(+), 61 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 5206568f34ed..c82a1abefca0 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -15,4 +15,12 @@ */ #define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) +#ifdef CONFIG_NUMA +#include +extern bool numa_demotion_enabled; + +#else + +#define numa_demotion_enabled false +#endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 22c0a0cf5e0c..96f8c84413fe 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -103,7 +103,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) extern void set_migration_target_nodes(void); extern void migrate_on_reclaim_init(void); -extern bool numa_demotion_enabled; extern int next_demotion_node(int node); #else static inline void set_migration_target_nodes(void) {} @@ -112,7 +111,6 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } -#define numa_demotion_enabled false #endif #ifdef CONFIG_COMPACTION diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 1f494e69776a..f3dc3318d931 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -3,6 +3,8 @@ #include #include #include +#include +#include #include struct memory_tier { @@ -127,3 +129,65 @@ static int __init memory_tier_init(void) return 0; } subsys_initcall(memory_tier_init); + +bool numa_demotion_enabled = false; + +#ifdef CONFIG_MIGRATION +#ifdef CONFIG_SYSFS +static ssize_t numa_demotion_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%s\n", + numa_demotion_enabled ? "true" : "false"); +} + +static ssize_t numa_demotion_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + ssize_t ret; + + ret = kstrtobool(buf, &numa_demotion_enabled); + if (ret) + return ret; + + return count; +} + +static struct kobj_attribute numa_demotion_enabled_attr = + __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, + numa_demotion_enabled_store); + +static struct attribute *numa_attrs[] = { + &numa_demotion_enabled_attr.attr, + NULL, +}; + +static const struct attribute_group numa_attr_group = { + .attrs = numa_attrs, +}; + +static int __init numa_init_sysfs(void) +{ + int err; + struct kobject *numa_kobj; + + numa_kobj = kobject_create_and_add("numa", mm_kobj); + if (!numa_kobj) { + pr_err("failed to create numa kobject\n"); + return -ENOMEM; + } + err = sysfs_create_group(numa_kobj, &numa_attr_group); + if (err) { + pr_err("failed to register numa group\n"); + goto delete_obj; + } + return 0; + +delete_obj: + kobject_put(numa_kobj); + return err; +} +subsys_initcall(numa_init_sysfs); +#endif /* CONFIG_SYSFS */ +#endif diff --git a/mm/migrate.c b/mm/migrate.c index 6a1597c92261..5d7fb417edbf 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2562,64 +2562,6 @@ void __init migrate_on_reclaim_init(void) set_migration_target_nodes(); cpus_read_unlock(); } +#endif /* CONFIG_NUMA */ -bool numa_demotion_enabled = false; - -#ifdef CONFIG_SYSFS -static ssize_t numa_demotion_enabled_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%s\n", - numa_demotion_enabled ? "true" : "false"); -} - -static ssize_t numa_demotion_enabled_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - ssize_t ret; - - ret = kstrtobool(buf, &numa_demotion_enabled); - if (ret) - return ret; - - return count; -} - -static struct kobj_attribute numa_demotion_enabled_attr = - __ATTR(demotion_enabled, 0644, numa_demotion_enabled_show, - numa_demotion_enabled_store); - -static struct attribute *numa_attrs[] = { - &numa_demotion_enabled_attr.attr, - NULL, -}; - -static const struct attribute_group numa_attr_group = { - .attrs = numa_attrs, -}; - -static int __init numa_init_sysfs(void) -{ - int err; - struct kobject *numa_kobj; - numa_kobj = kobject_create_and_add("numa", mm_kobj); - if (!numa_kobj) { - pr_err("failed to create numa kobject\n"); - return -ENOMEM; - } - err = sysfs_create_group(numa_kobj, &numa_attr_group); - if (err) { - pr_err("failed to register numa group\n"); - goto delete_obj; - } - return 0; - -delete_obj: - kobject_put(numa_kobj); - return err; -} -subsys_initcall(numa_init_sysfs); -#endif /* CONFIG_SYSFS */ -#endif /* CONFIG_NUMA */ diff --git a/mm/vmscan.c b/mm/vmscan.c index b2b1431352dc..224de380ac88 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -49,6 +49,7 @@ #include #include #include +#include #include #include From patchwork Thu Aug 18 13:10:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12947984 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 048FCC00140 for ; Thu, 18 Aug 2022 21:21:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 96CB38D0002; Thu, 18 Aug 2022 17:21:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 91C748D0001; Thu, 18 Aug 2022 17:21:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7BD288D0002; Thu, 18 Aug 2022 17:21:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 171A78D0001 for ; Thu, 18 Aug 2022 17:21:46 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id F04C841FEC for ; Thu, 18 Aug 2022 21:21:45 +0000 (UTC) X-FDA: 79813985370.17.5120EB4 Received: by imf31.hostedemail.com (Postfix, from userid 200) id 540F320075; Thu, 18 Aug 2022 16:07:58 +0000 (UTC) Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf31.hostedemail.com (Postfix) with ESMTP id DB90E20EE7 for ; Thu, 18 Aug 2022 16:07:54 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ICBtSt006495; Thu, 18 Aug 2022 13:11:18 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=DxxrFOycvBHQk5/4QI1h0OEG5diEikgeH+qCa3Uf3eI=; b=ojC9LQDd5lsUqZ16jKWK/Yody57nBuglYb83ImLtmsQAIPMIpadzYSTTe+T9kPcpBjLR gwsySZ8mqqHSfeR54gyQvY672X3cSeiG84FWKItL2gQXKGz94Jd/RbpkHtfJWbn2o52N 5UBrAzw9R3YxnGKDa76UcoiWK3YDN7hILrRIx46NfEQiYrEi7Smi8+64uGX5GnV6EL7H bHHeOwPTKdslGsVFCN2PM2Q0BoO0eKK1mLIvRaesjdW49EB80XcMVo3xBM5A+lF2npWH DwDvOSQaFHj2I4NnQOWgFcOzd7tza94YTePZ7/Nh8DUi7FL3aQuZ6g54aaKdD0SUZUM5 Sw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1n7cj53g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:18 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ICDAFu009666; Thu, 18 Aug 2022 13:11:17 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1n7cj51v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:17 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID5cDD002945; Thu, 18 Aug 2022 13:11:16 GMT Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by ppma03dal.us.ibm.com with ESMTP id 3hx3kaehsf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:15 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDBFUI6423158 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:15 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 05C73124052; Thu, 18 Aug 2022 13:11:15 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 14032124055; Thu, 18 Aug 2022 13:11:10 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:11:09 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 03/10] mm/demotion: Add hotplug callbacks to handle new numa node onlined Date: Thu, 18 Aug 2022 18:40:35 +0530 Message-Id: <20220818131042.113280-4-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: tELrMvoFiCSu17nbcBUc6ZVRIvn7mR_T X-Proofpoint-ORIG-GUID: lScm0d_d-SkSpiUOKWETZzOpgiZm_JbL X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 malwarescore=0 bulkscore=0 impostorscore=0 phishscore=0 priorityscore=1501 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxlogscore=999 suspectscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660838876; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DxxrFOycvBHQk5/4QI1h0OEG5diEikgeH+qCa3Uf3eI=; b=hJMWPFnXVwCMaBocXDqcUXwhBqDpSkkIS0kuBEwQicUcEZcpT3sCeD6u2IPw2lWwMpSPEf fUK4b9EQLHeImGxAhXM8Wm2DiRdxEYv/MkGn8xNTw8RgAWPQjqWDtKx/l4W6/fc98v9ArO fxN0XCaGnnaRkEqeLh6KTk8nk12oOLg= ARC-Authentication-Results: i=1; imf31.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=ojC9LQDd; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf31.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660838876; a=rsa-sha256; cv=none; b=7NHTx4Lyw2+qZoZOKYhMJB51qaJ+S3RAa4zTUK4udmXWVwbVbkYUh6zalQn7xaNHPdjl3z iXUpqCPeWtLaq5dTBs9BhTrDfAXx8MABnLSyf1xxddmWeUJes81W02BDF5XNMycjLwNtM8 t5qOGWKizla1Hdcq41mqOZq3Z2VUtT0= X-Rspamd-Queue-Id: DB90E20EE7 X-Rspam-User: Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=ojC9LQDd; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf31.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspamd-Server: rspam02 X-Stat-Signature: to1g4ktfdtoa94xnca1c8btxbyukt44j X-HE-Tag: 1660838874-221179 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: If the new NUMA node onlined doesn't have a abstract distance assigned, the kernel adds the NUMA node to default memory tier. Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 1 + mm/memory-tiers.c | 68 ++++++++++++++++++++++++++++++++++++ 2 files changed, 69 insertions(+) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index c82a1abefca0..17b41e592be6 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -14,6 +14,7 @@ * the same memory tier. */ #define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) +#define MEMTIER_HOTPLUG_PRIO 100 #ifdef CONFIG_NUMA #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index f3dc3318d931..05f05395468a 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -5,6 +5,7 @@ #include #include #include +#include #include struct memory_tier { @@ -105,6 +106,72 @@ static struct memory_tier *set_node_memory_tier(int node) return memtier; } +static struct memory_tier *__node_get_memory_tier(int node) +{ + struct memory_dev_type *memtype; + + memtype = node_memory_types[node]; + if (memtype && node_isset(node, memtype->nodes)) + return memtype->memtier; + return NULL; +} + +static void destroy_memory_tier(struct memory_tier *memtier) +{ + list_del(&memtier->list); + kfree(memtier); +} + +static bool clear_node_memory_tier(int node) +{ + bool cleared = false; + struct memory_tier *memtier; + + memtier = __node_get_memory_tier(node); + if (memtier) { + struct memory_dev_type *memtype; + + memtype = node_memory_types[node]; + node_clear(node, memtype->nodes); + if (nodes_empty(memtype->nodes)) { + list_del(&memtype->tier_sibiling); + memtype->memtier = NULL; + if (list_empty(&memtier->memory_types)) + destroy_memory_tier(memtier); + } + cleared = true; + } + return cleared; +} + +static int __meminit memtier_hotplug_callback(struct notifier_block *self, + unsigned long action, void *_arg) +{ + struct memory_notify *arg = _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + + switch (action) { + case MEM_OFFLINE: + mutex_lock(&memory_tier_lock); + clear_node_memory_tier(arg->status_change_nid); + mutex_unlock(&memory_tier_lock); + break; + case MEM_ONLINE: + mutex_lock(&memory_tier_lock); + set_node_memory_tier(arg->status_change_nid); + mutex_unlock(&memory_tier_lock); + break; + } + + return notifier_from_errno(0); +} + static int __init memory_tier_init(void) { int node; @@ -126,6 +193,7 @@ static int __init memory_tier_init(void) } mutex_unlock(&memory_tier_lock); + hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO); return 0; } subsys_initcall(memory_tier_init); From patchwork Thu Aug 18 13:10:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12948046 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CF6CC00140 for ; Thu, 18 Aug 2022 22:09:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EFC218D0002; Thu, 18 Aug 2022 18:09:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EAD238D0001; Thu, 18 Aug 2022 18:09:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D4D238D0002; Thu, 18 Aug 2022 18:09:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C6E458D0001 for ; Thu, 18 Aug 2022 18:09:26 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A2761C02EC for ; Thu, 18 Aug 2022 22:09:26 +0000 (UTC) X-FDA: 79814105532.03.6B68E74 Received: by imf10.hostedemail.com (Postfix, from userid 200) id DDB05C412C; Thu, 18 Aug 2022 15:54:40 +0000 (UTC) Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf10.hostedemail.com (Postfix) with ESMTP id 9FD3AC0A87 for ; Thu, 18 Aug 2022 15:54:40 +0000 (UTC) Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ICBt1L006499; Thu, 18 Aug 2022 13:11:24 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=fH58GLDZU2WeWZkq6YRNl4Y4y3kC8OPowjQaO6LlzkQ=; b=WQ+ZIBTvV0WhbFWV9a78P5fhvKR6zhQive+THRr8Ng+2DAyyV5ij0QMOoH1aYuh31IP7 5a/KtLdfWs7zzaZNycQM2IH4lhHW/n9+Xf5HLVhffHHufYK0pKkTrOmcYhq92+FbfjQB 8l07yzC25XpV0LYiIL7TbVmaG0OxVvEZEsOzv0WTMIWUG1PHnCitf2xUT6hjty45qTmg ceQWjfAppksCIhCaC7E42HxzfRkhdaZvqfgPezI1KquB385iAxTfQF25JTBqoROP0Yz1 hW6Ozi8Mq/+kCw66TGv8thhgNW7K10tIc/Bi3gYeBIL9gh2V/KlQudQAqVZOpTPsvwgh +A== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1n7cj58u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:23 +0000 Received: from m0098409.ppops.net (m0098409.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ICCCPg006971; Thu, 18 Aug 2022 13:11:22 GMT Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1n7cj57k-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:22 +0000 Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1]) by ppma05wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID5VHc002393; Thu, 18 Aug 2022 13:11:21 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma05wdc.us.ibm.com with ESMTP id 3hx3kaexyf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:21 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDBK3u3211938 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:20 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8CDD5124052; Thu, 18 Aug 2022 13:11:20 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8C67A124053; Thu, 18 Aug 2022 13:11:15 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:11:15 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 04/10] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE Date: Thu, 18 Aug 2022 18:40:36 +0530 Message-Id: <20220818131042.113280-5-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: gQLVWZOKJyTCBgbEDICuLfrsaZdruVDd X-Proofpoint-ORIG-GUID: M7YE5VNrYOmXd492QHsmU4pLrMKSSxqW X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 adultscore=0 malwarescore=0 bulkscore=0 impostorscore=0 phishscore=0 priorityscore=1501 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxlogscore=999 suspectscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660838081; a=rsa-sha256; cv=none; b=jk3a5eaRV5KES9jmdWaeu6kXyI6LlhXcqhyZ4afNWFimfPXUVaRDuE5eKeqORM0w2eH4Zn Z2wL9doxS+i8w9GnXLf2gq7faWCi+9zYDpn2BR5LUQ8lt85NeFz2FRYUoRno86T1xQOGIG rfsgLUaWIL4IrL90dSnwGpZAcFJhM6Q= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=WQ+ZIBTv; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660838081; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fH58GLDZU2WeWZkq6YRNl4Y4y3kC8OPowjQaO6LlzkQ=; b=6r/CYCUsG38VT2H8cRBUtRwLIsIxNv5mur1A9YcxcBniHdKCiECY2Edj6iQQsnmDokwJ3q r16usqtf5VPAH4doI6wE+d6j2oKVK5e0+orY42YVKZvUszfLhK/bXzV6th8Ohny/V6DmpN QGYCfpEX5XuZRMyUO6hdLAI+6wZrdMc= X-Rspam-User: X-Stat-Signature: p1sxym19p3je9pai45nds8cbxjk4mggb X-Rspamd-Queue-Id: 9FD3AC0A87 X-Rspamd-Server: rspam06 Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=WQ+ZIBTv; spf=pass (imf10.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-HE-Tag: 1660838080-842091 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: By default, all nodes are assigned to the default memory tier which is the memory tier designated for nodes with DRAM Set dax kmem device node's tier to slower memory tier by assigning abstract distance to MEMTIER_DEFAULT_DAX_ADISTANCE. Low-level drivers like papr_scm or ACPI NFIT can initialize memory device type to a more accurate value based on device tree details or HMAT. If the kernel doesn't find the memory type initialized, a default slower memory type is assigned by the kmem driver. Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- drivers/dax/kmem.c | 42 +++++++++++++++-- include/linux/memory-tiers.h | 42 ++++++++++++++++- mm/memory-tiers.c | 91 +++++++++++++++++++++++++++--------- 3 files changed, 149 insertions(+), 26 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a37622060fff..4852a2dbdb27 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -11,9 +11,17 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" +/* + * Default abstract distance assigned to the NUMA node onlined + * by DAX/kmem if the low level platform driver didn't initialize + * one for this NUMA node. + */ +#define MEMTIER_DEFAULT_DAX_ADISTANCE (MEMTIER_ADISTANCE_DRAM * 5) + /* Memory resource name used for add_memory_driver_managed(). */ static const char *kmem_name; /* Set if any memory will remain added when the driver will be unloaded. */ @@ -41,6 +49,7 @@ struct dax_kmem_data { struct resource *res[]; }; +static struct memory_dev_type *dax_slowmem_type; static int dev_dax_kmem_probe(struct dev_dax *dev_dax) { struct device *dev = &dev_dax->dev; @@ -79,11 +88,13 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) return -EINVAL; } + init_node_memory_type(numa_node, dax_slowmem_type); + + rc = -ENOMEM; data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL); if (!data) - return -ENOMEM; + goto err_dax_kmem_data; - rc = -ENOMEM; data->res_name = kstrdup(dev_name(dev), GFP_KERNEL); if (!data->res_name) goto err_res_name; @@ -155,6 +166,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) kfree(data->res_name); err_res_name: kfree(data); +err_dax_kmem_data: + clear_node_memory_type(numa_node, dax_slowmem_type); return rc; } @@ -162,6 +175,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) static void dev_dax_kmem_remove(struct dev_dax *dev_dax) { int i, success = 0; + int node = dev_dax->target_node; struct device *dev = &dev_dax->dev; struct dax_kmem_data *data = dev_get_drvdata(dev); @@ -198,6 +212,14 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax) kfree(data->res_name); kfree(data); dev_set_drvdata(dev, NULL); + /* + * Clear the memtype association on successful unplug. + * If not, we have memory blocks left which can be + * offlined/onlined later. We need to keep memory_dev_type + * for that. This implies this reference will be around + * till next reboot. + */ + clear_node_memory_type(node, dax_slowmem_type); } } #else @@ -228,9 +250,22 @@ static int __init dax_kmem_init(void) if (!kmem_name) return -ENOMEM; + dax_slowmem_type = alloc_memory_type(MEMTIER_DEFAULT_DAX_ADISTANCE); + if (IS_ERR(dax_slowmem_type)) { + rc = PTR_ERR(dax_slowmem_type); + goto err_dax_slowmem_type; + } + rc = dax_driver_register(&device_dax_kmem_driver); if (rc) - kfree_const(kmem_name); + goto error_dax_driver; + + return rc; + +error_dax_driver: + destroy_memory_type(dax_slowmem_type); +err_dax_slowmem_type: + kfree_const(kmem_name); return rc; } @@ -239,6 +274,7 @@ static void __exit dax_kmem_exit(void) dax_driver_unregister(&device_dax_kmem_driver); if (!any_hotremove_failed) kfree_const(kmem_name); + destroy_memory_type(dax_slowmem_type); } MODULE_AUTHOR("Intel Corporation"); diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 17b41e592be6..30aecff9ae79 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -2,6 +2,9 @@ #ifndef _LINUX_MEMORY_TIERS_H #define _LINUX_MEMORY_TIERS_H +#include +#include +#include /* * Each tier cover a abstrace distance chunk size of 128 */ @@ -16,12 +19,49 @@ #define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1)) #define MEMTIER_HOTPLUG_PRIO 100 +struct memory_tier; +struct memory_dev_type { + /* list of memory types that are part of same tier as this type */ + struct list_head tier_sibiling; + /* abstract distance for this specific memory type */ + int adistance; + /* Nodes of same abstract distance */ + nodemask_t nodes; + struct kref kref; + struct memory_tier *memtier; +}; + #ifdef CONFIG_NUMA -#include extern bool numa_demotion_enabled; +struct memory_dev_type *alloc_memory_type(int adistance); +void destroy_memory_type(struct memory_dev_type *memtype); +void init_node_memory_type(int node, struct memory_dev_type *default_type); +void clear_node_memory_type(int node, struct memory_dev_type *memtype); #else #define numa_demotion_enabled false +/* + * CONFIG_NUMA implementation returns non NULL error. + */ +static inline struct memory_dev_type *alloc_memory_type(int adistance) +{ + return NULL; +} + +static inline void destroy_memory_type(struct memory_dev_type *memtype) +{ + +} + +static inline void init_node_memory_type(int node, struct memory_dev_type *default_type) +{ + +} + +static inline void clear_node_memory_type(int node, struct memory_dev_type *memtype) +{ + +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 05f05395468a..3ddf305df7d1 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -1,6 +1,4 @@ // SPDX-License-Identifier: GPL-2.0 -#include -#include #include #include #include @@ -21,27 +19,10 @@ struct memory_tier { int adistance_start; }; -struct memory_dev_type { - /* list of memory types that are part of same tier as this type */ - struct list_head tier_sibiling; - /* abstract distance for this specific memory type */ - int adistance; - /* Nodes of same abstract distance */ - nodemask_t nodes; - struct memory_tier *memtier; -}; - static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; -/* - * For now we can have 4 faster memory tiers with smaller adistance - * than default DRAM tier. - */ -static struct memory_dev_type default_dram_type = { - .adistance = MEMTIER_ADISTANCE_DRAM, - .tier_sibiling = LIST_HEAD_INIT(default_dram_type.tier_sibiling), -}; +static struct memory_dev_type *default_dram_type; static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) { @@ -87,6 +68,14 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty return new_memtier; } +static inline void __init_node_memory_type(int node, struct memory_dev_type *memtype) +{ + if (!node_memory_types[node]) { + node_memory_types[node] = memtype; + kref_get(&memtype->kref); + } +} + static struct memory_tier *set_node_memory_tier(int node) { struct memory_tier *memtier; @@ -97,8 +86,7 @@ static struct memory_tier *set_node_memory_tier(int node) if (!node_state(node, N_MEMORY)) return ERR_PTR(-EINVAL); - if (!node_memory_types[node]) - node_memory_types[node] = &default_dram_type; + __init_node_memory_type(node, default_dram_type); memtype = node_memory_types[node]; node_set(node, memtype->nodes); @@ -144,6 +132,57 @@ static bool clear_node_memory_tier(int node) return cleared; } +static void release_memtype(struct kref *kref) +{ + struct memory_dev_type *memtype; + + memtype = container_of(kref, struct memory_dev_type, kref); + kfree(memtype); +} + +struct memory_dev_type *alloc_memory_type(int adistance) +{ + struct memory_dev_type *memtype; + + memtype = kmalloc(sizeof(*memtype), GFP_KERNEL); + if (!memtype) + return ERR_PTR(-ENOMEM); + + memtype->adistance = adistance; + INIT_LIST_HEAD(&memtype->tier_sibiling); + memtype->nodes = NODE_MASK_NONE; + memtype->memtier = NULL; + kref_init(&memtype->kref); + return memtype; +} +EXPORT_SYMBOL_GPL(alloc_memory_type); + +void destroy_memory_type(struct memory_dev_type *memtype) +{ + kref_put(&memtype->kref, release_memtype); +} +EXPORT_SYMBOL_GPL(destroy_memory_type); + +void init_node_memory_type(int node, struct memory_dev_type *memtype) +{ + + mutex_lock(&memory_tier_lock); + __init_node_memory_type(node, memtype); + mutex_unlock(&memory_tier_lock); +} +EXPORT_SYMBOL_GPL(init_node_memory_type); + +void clear_node_memory_type(int node, struct memory_dev_type *memtype) +{ + mutex_lock(&memory_tier_lock); + if (node_memory_types[node] == memtype) { + node_memory_types[node] = NULL; + kref_put(&memtype->kref, release_memtype); + } + mutex_unlock(&memory_tier_lock); +} +EXPORT_SYMBOL_GPL(clear_node_memory_type); + static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) { @@ -178,6 +217,14 @@ static int __init memory_tier_init(void) struct memory_tier *memtier; mutex_lock(&memory_tier_lock); + /* + * For now we can have 4 faster memory tiers with smaller adistance + * than default DRAM tier. + */ + default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); + if (!default_dram_type) + panic("%s() failed to allocate default DRAM tier\n", __func__); + /* * Look at all the existing N_MEMORY nodes and add them to * default memory tier or to a tier if we already have memory From patchwork Thu Aug 18 13:10:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12947996 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32728C28B2B for ; Thu, 18 Aug 2022 21:38:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A0FC48D0002; Thu, 18 Aug 2022 17:38:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 99A448D0001; Thu, 18 Aug 2022 17:38:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 79BD38D0002; Thu, 18 Aug 2022 17:38:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 61D9F8D0001 for ; Thu, 18 Aug 2022 17:38:09 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3AFDE40A8B for ; Thu, 18 Aug 2022 21:38:09 +0000 (UTC) X-FDA: 79814026698.01.2C072D3 Received: by imf13.hostedemail.com (Postfix, from userid 200) id F2D1F203B1; Thu, 18 Aug 2022 15:58:59 +0000 (UTC) Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf13.hostedemail.com (Postfix) with ESMTP id A7D5F21BDD for ; Thu, 18 Aug 2022 15:58:50 +0000 (UTC) Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ID0Cua027149; Thu, 18 Aug 2022 13:11:29 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=u22J3NPZgvu/uyJmjjlmQdkMVo1PtXvwUJT36pS8Pq0=; b=TMoHPpsfCvW2Szb2kqUT/HmgbiwmTv2pJycNzUNQHU+9cBMPENNOvlQ5I1c5fbzuUZdI qFrt43Im/mPZ3GzQnDSh1j6VXKfF+vY4l0K3vAGcMYvemuBBLF0zRAzfwx8dYEDNE6o5 ZlI0N5R8mZKIibPsORnEmxCjjnteZK1WpiqN0MeOcxeD/GD6ap5JQonZIJmApHPm8SPi DqTWX5f1JyzHY8TuMRc2w6S44P29EJ6b3OtvNvAksYl9Y3TUr4OceL+y7+oVL8L32hn4 uI1dwKwVhQwI1yHyPQE0uDMfUzq35oaZoGc55R5G0rsBeuTlni2JAog5SOa7jxVnB2H2 tA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nx6ggmh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:28 +0000 Received: from m0098421.ppops.net (m0098421.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ID27qQ003574; Thu, 18 Aug 2022 13:11:28 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nx6ggky-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:28 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID5dcs002963; Thu, 18 Aug 2022 13:11:27 GMT Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com [9.57.198.26]) by ppma03dal.us.ibm.com with ESMTP id 3hx3kaeht8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:27 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDBQHS31654386 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:26 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4DCB2124053; Thu, 18 Aug 2022 13:11:26 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1DF4E124052; Thu, 18 Aug 2022 13:11:21 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:11:20 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 05/10] mm/demotion: Build demotion targets based on explicit memory tiers Date: Thu, 18 Aug 2022 18:40:37 +0530 Message-Id: <20220818131042.113280-6-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: TMTRUdi2SfGKzMZvbNHws2Z60MwyQKMs X-Proofpoint-ORIG-GUID: cAp1tJmBvOxZvhjGhZVkR2qhYfioFNZB X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 impostorscore=0 malwarescore=0 lowpriorityscore=0 mlxlogscore=999 priorityscore=1501 suspectscore=0 mlxscore=0 spamscore=0 adultscore=0 clxscore=1015 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660838336; a=rsa-sha256; cv=none; b=jcrv/+dFiBXl9+gkRqnJn3AglWzEdJaterFGKwE+YY4DYYzFUFkJPiEDlWrqEMeQn76z92 6iTuTOqrwB7VRoQzLkAK7zR4HphQf6P6Z/UXNloEkVelZhM48Ud7e5fw63oXVgwbVB1LL9 +ZdYy6XEb0Q6WbFjiusGndPrlddW+7s= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=TMoHPpsf; spf=pass (imf13.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660838336; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=u22J3NPZgvu/uyJmjjlmQdkMVo1PtXvwUJT36pS8Pq0=; b=SMX5wPoE9cyecuz9/+ukL26S4FfSAFM8Fdql55K1IU+gsnP8kn2HGk9BPbVSWom0lV/A3B hQXNk9LBIIvLYH031hrMGZP+fm9OTA9oOKi+ZwG6MFFaR8LdpIEvJ0/GBnq+4lhHjdqsHA 8VXn3ecg+Iw3gpPydyyHSXxxzh9WSTI= X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: A7D5F21BDD Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=TMoHPpsf; spf=pass (imf13.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Stat-Signature: x6kr733whfzrfc3ft5yepqeudth1un34 X-Rspam-User: X-HE-Tag: 1660838330-945763 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default memory tier and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 13 ++ include/linux/migrate.h | 13 -- mm/memory-tiers.c | 238 +++++++++++++++++++-- mm/migrate.c | 394 ----------------------------------- mm/vmstat.c | 4 - 5 files changed, 239 insertions(+), 423 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 30aecff9ae79..548e69f23727 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -37,6 +37,14 @@ struct memory_dev_type *alloc_memory_type(int adistance); void destroy_memory_type(struct memory_dev_type *memtype); void init_node_memory_type(int node, struct memory_dev_type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); +#ifdef CONFIG_MIGRATION +int next_demotion_node(int node); +#else +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} +#endif #else @@ -63,5 +71,10 @@ static inline void clear_node_memory_type(int node, struct memory_dev_type *memt { } + +static inline int next_demotion_node(int node) +{ + return NUMA_NO_NODE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 96f8c84413fe..704a04f5a074 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -100,19 +100,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, #endif /* CONFIG_MIGRATION */ -#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA) -extern void set_migration_target_nodes(void); -extern void migrate_on_reclaim_init(void); -extern int next_demotion_node(int node); -#else -static inline void set_migration_target_nodes(void) {} -static inline void migrate_on_reclaim_init(void) {} -static inline int next_demotion_node(int node) -{ - return NUMA_NO_NODE; -} -#endif - #ifdef CONFIG_COMPACTION bool PageMovable(struct page *page); void __SetPageMovable(struct page *page, const struct movable_operations *ops); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 3ddf305df7d1..c29bb24449b8 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -6,6 +6,8 @@ #include #include +#include "internal.h" + struct memory_tier { /* hierarchy of memory tiers */ struct list_head list; @@ -19,10 +21,74 @@ struct memory_tier { int adistance_start; }; +struct demotion_nodes { + nodemask_t preferred; +}; + static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; static struct memory_dev_type *default_dram_type; +#ifdef CONFIG_MIGRATION +/* + * node_demotion[] examples: + * + * Example 1: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. + * + * node distances: + * node 0 1 2 3 + * 0 10 20 30 40 + * 1 20 10 40 30 + * 2 30 40 10 40 + * 3 40 30 40 10 + * + * memory_tiers0 = 0-1 + * memory_tiers1 = 2-3 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 3 + * node_demotion[2].preferred = + * node_demotion[3].preferred = + * + * Example 2: + * + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 30 + * 2 30 30 10 + * + * memory_tiers0 = 0-2 + * + * node_demotion[0].preferred = + * node_demotion[1].preferred = + * node_demotion[2].preferred = + * + * Example 3: + * + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. + * + * node distances: + * node 0 1 2 + * 0 10 20 30 + * 1 20 10 40 + * 2 30 40 10 + * + * memory_tiers0 = 1 + * memory_tiers1 = 0 + * memory_tiers2 = 2 + * + * node_demotion[0].preferred = 2 + * node_demotion[1].preferred = 0 + * node_demotion[2].preferred = + * + */ +static struct demotion_nodes *node_demotion __read_mostly; +#endif /* CONFIG_MIGRATION */ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memtype) { @@ -68,6 +134,154 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty return new_memtier; } +static struct memory_tier *__node_get_memory_tier(int node) +{ + struct memory_dev_type *memtype; + + memtype = node_memory_types[node]; + if (memtype && node_isset(node, memtype->nodes)) + return memtype->memtier; + return NULL; +} + +#ifdef CONFIG_MIGRATION +/** + * next_demotion_node() - Get the next node in the demotion path + * @node: The starting node to lookup the next node + * + * Return: node id for next memory node in the demotion path hierarchy + * from @node; NUMA_NO_NODE if @node is terminal. This does not keep + * @node online or guarantee that it *continues* to be the next demotion + * target. + */ +int next_demotion_node(int node) +{ + struct demotion_nodes *nd; + int target; + + if (!node_demotion) + return NUMA_NO_NODE; + + nd = &node_demotion[node]; + + /* + * node_demotion[] is updated without excluding this + * function from running. + * + * Make sure to use RCU over entire code blocks if + * node_demotion[] reads need to be consistent. + */ + rcu_read_lock(); + /* + * If there are multiple target nodes, just select one + * target node randomly. + * + * In addition, we can also use round-robin to select + * target node, but we should introduce another variable + * for node_demotion[] to record last selected target node, + * that may cause cache ping-pong due to the changing of + * last target node. Or introducing per-cpu data to avoid + * caching issue, which seems more complicated. So selecting + * target node randomly seems better until now. + */ + target = node_random(&nd->preferred); + rcu_read_unlock(); + + return target; +} + +static void disable_all_demotion_targets(void) +{ + int node; + + for_each_node_state(node, N_MEMORY) + node_demotion[node].preferred = NODE_MASK_NONE; + /* + * Ensure that the "disable" is visible across the system. + * Readers will see either a combination of before+disable + * state or disable+after. They will never see before and + * after state together. + */ + synchronize_rcu(); +} + +static __always_inline nodemask_t get_memtier_nodemask(struct memory_tier *memtier) +{ + nodemask_t nodes = NODE_MASK_NONE; + struct memory_dev_type *memtype; + + list_for_each_entry(memtype, &memtier->memory_types, tier_sibiling) + nodes_or(nodes, nodes, memtype->nodes); + + return nodes; +} + +/* + * Find an automatic demotion target for all memory + * nodes. Failing here is OK. It might just indicate + * being at the end of a chain. + */ +static void establish_demotion_targets(void) +{ + struct memory_tier *memtier; + struct demotion_nodes *nd; + int target = NUMA_NO_NODE, node; + int distance, best_distance; + nodemask_t tier_nodes; + + lockdep_assert_held_once(&memory_tier_lock); + + if (!node_demotion || !IS_ENABLED(CONFIG_MIGRATION)) + return; + + disable_all_demotion_targets(); + + for_each_node_state(node, N_MEMORY) { + best_distance = -1; + nd = &node_demotion[node]; + + memtier = __node_get_memory_tier(node); + if (!memtier || list_is_last(&memtier->list, &memory_tiers)) + continue; + /* + * Get the lower memtier to find the demotion node list. + */ + memtier = list_next_entry(memtier, list); + tier_nodes = get_memtier_nodemask(memtier); + /* + * find_next_best_node, use 'used' nodemask as a skip list. + * Add all memory nodes except the selected memory tier + * nodelist to skip list so that we find the best node from the + * memtier nodelist. + */ + nodes_andnot(tier_nodes, node_states[N_MEMORY], tier_nodes); + + /* + * Find all the nodes in the memory tier node list of same best distance. + * add them to the preferred mask. We randomly select between nodes + * in the preferred mask when allocating pages during demotion. + */ + do { + target = find_next_best_node(node, &tier_nodes); + if (target == NUMA_NO_NODE) + break; + + distance = node_distance(node, target); + if (distance == best_distance || best_distance == -1) { + best_distance = distance; + node_set(target, nd->preferred); + } else { + break; + } + } while (1); + } +} + +#else +static inline void disable_all_demotion_targets(void) {} +static inline void establish_demotion_targets(void) {} +#endif /* CONFIG_MIGRATION */ + static inline void __init_node_memory_type(int node, struct memory_dev_type *memtype) { if (!node_memory_types[node]) { @@ -94,16 +308,6 @@ static struct memory_tier *set_node_memory_tier(int node) return memtier; } -static struct memory_tier *__node_get_memory_tier(int node) -{ - struct memory_dev_type *memtype; - - memtype = node_memory_types[node]; - if (memtype && node_isset(node, memtype->nodes)) - return memtype->memtier; - return NULL; -} - static void destroy_memory_tier(struct memory_tier *memtier) { list_del(&memtier->list); @@ -186,6 +390,7 @@ EXPORT_SYMBOL_GPL(clear_node_memory_type); static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) { + struct memory_tier *memtier; struct memory_notify *arg = _arg; /* @@ -198,12 +403,15 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self, switch (action) { case MEM_OFFLINE: mutex_lock(&memory_tier_lock); - clear_node_memory_tier(arg->status_change_nid); + if (clear_node_memory_tier(arg->status_change_nid)) + establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; case MEM_ONLINE: mutex_lock(&memory_tier_lock); - set_node_memory_tier(arg->status_change_nid); + memtier = set_node_memory_tier(arg->status_change_nid); + if (!IS_ERR(memtier)) + establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; } @@ -216,6 +424,11 @@ static int __init memory_tier_init(void) int node; struct memory_tier *memtier; +#ifdef CONFIG_MIGRATION + node_demotion = kcalloc(nr_node_ids, sizeof(struct demotion_nodes), + GFP_KERNEL); + WARN_ON(!node_demotion); +#endif mutex_lock(&memory_tier_lock); /* * For now we can have 4 faster memory tiers with smaller adistance @@ -238,6 +451,7 @@ static int __init memory_tier_init(void) */ break; } + establish_demotion_targets(); mutex_unlock(&memory_tier_lock); hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRIO); diff --git a/mm/migrate.c b/mm/migrate.c index 5d7fb417edbf..ea86594f4bc5 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2170,398 +2170,4 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, return 0; } #endif /* CONFIG_NUMA_BALANCING */ - -/* - * node_demotion[] example: - * - * Consider a system with two sockets. Each socket has - * three classes of memory attached: fast, medium and slow. - * Each memory class is placed in its own NUMA node. The - * CPUs are placed in the node with the "fast" memory. The - * 6 NUMA nodes (0-5) might be split among the sockets like - * this: - * - * Socket A: 0, 1, 2 - * Socket B: 3, 4, 5 - * - * When Node 0 fills up, its memory should be migrated to - * Node 1. When Node 1 fills up, it should be migrated to - * Node 2. The migration path start on the nodes with the - * processors (since allocations default to this node) and - * fast memory, progress through medium and end with the - * slow memory: - * - * 0 -> 1 -> 2 -> stop - * 3 -> 4 -> 5 -> stop - * - * This is represented in the node_demotion[] like this: - * - * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1 - * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2 - * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate - * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4 - * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5 - * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate - * - * Moreover some systems may have multiple slow memory nodes. - * Suppose a system has one socket with 3 memory nodes, node 0 - * is fast memory type, and node 1/2 both are slow memory - * type, and the distance between fast memory node and slow - * memory node is same. So the migration path should be: - * - * 0 -> 1/2 -> stop - * - * This is represented in the node_demotion[] like this: - * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2 - * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate - * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate - */ - -/* - * Writes to this array occur without locking. Cycles are - * not allowed: Node X demotes to Y which demotes to X... - * - * If multiple reads are performed, a single rcu_read_lock() - * must be held over all reads to ensure that no cycles are - * observed. - */ -#define DEFAULT_DEMOTION_TARGET_NODES 15 - -#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES -#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1) -#else -#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES -#endif - -struct demotion_nodes { - unsigned short nr; - short nodes[DEMOTION_TARGET_NODES]; -}; - -static struct demotion_nodes *node_demotion __read_mostly; - -/** - * next_demotion_node() - Get the next node in the demotion path - * @node: The starting node to lookup the next node - * - * Return: node id for next memory node in the demotion path hierarchy - * from @node; NUMA_NO_NODE if @node is terminal. This does not keep - * @node online or guarantee that it *continues* to be the next demotion - * target. - */ -int next_demotion_node(int node) -{ - struct demotion_nodes *nd; - unsigned short target_nr, index; - int target; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - /* - * node_demotion[] is updated without excluding this - * function from running. RCU doesn't provide any - * compiler barriers, so the READ_ONCE() is required - * to avoid compiler reordering or read merging. - * - * Make sure to use RCU over entire code blocks if - * node_demotion[] reads need to be consistent. - */ - rcu_read_lock(); - target_nr = READ_ONCE(nd->nr); - - switch (target_nr) { - case 0: - target = NUMA_NO_NODE; - goto out; - case 1: - index = 0; - break; - default: - /* - * If there are multiple target nodes, just select one - * target node randomly. - * - * In addition, we can also use round-robin to select - * target node, but we should introduce another variable - * for node_demotion[] to record last selected target node, - * that may cause cache ping-pong due to the changing of - * last target node. Or introducing per-cpu data to avoid - * caching issue, which seems more complicated. So selecting - * target node randomly seems better until now. - */ - index = get_random_int() % target_nr; - break; - } - - target = READ_ONCE(nd->nodes[index]); - -out: - rcu_read_unlock(); - return target; -} - -/* Disable reclaim-based migration. */ -static void __disable_all_migrate_targets(void) -{ - int node, i; - - if (!node_demotion) - return; - - for_each_online_node(node) { - node_demotion[node].nr = 0; - for (i = 0; i < DEMOTION_TARGET_NODES; i++) - node_demotion[node].nodes[i] = NUMA_NO_NODE; - } -} - -static void disable_all_migrate_targets(void) -{ - __disable_all_migrate_targets(); - - /* - * Ensure that the "disable" is visible across the system. - * Readers will see either a combination of before+disable - * state or disable+after. They will never see before and - * after state together. - * - * The before+after state together might have cycles and - * could cause readers to do things like loop until this - * function finishes. This ensures they can only see a - * single "bad" read and would, for instance, only loop - * once. - */ - synchronize_rcu(); -} - -/* - * Find an automatic demotion target for 'node'. - * Failing here is OK. It might just indicate - * being at the end of a chain. - */ -static int establish_migrate_target(int node, nodemask_t *used, - int best_distance) -{ - int migration_target, index, val; - struct demotion_nodes *nd; - - if (!node_demotion) - return NUMA_NO_NODE; - - nd = &node_demotion[node]; - - migration_target = find_next_best_node(node, used); - if (migration_target == NUMA_NO_NODE) - return NUMA_NO_NODE; - - /* - * If the node has been set a migration target node before, - * which means it's the best distance between them. Still - * check if this node can be demoted to other target nodes - * if they have a same best distance. - */ - if (best_distance != -1) { - val = node_distance(node, migration_target); - if (val > best_distance) - goto out_clear; - } - - index = nd->nr; - if (WARN_ONCE(index >= DEMOTION_TARGET_NODES, - "Exceeds maximum demotion target nodes\n")) - goto out_clear; - - nd->nodes[index] = migration_target; - nd->nr++; - - return migration_target; -out_clear: - node_clear(migration_target, *used); - return NUMA_NO_NODE; -} - -/* - * When memory fills up on a node, memory contents can be - * automatically migrated to another node instead of - * discarded at reclaim. - * - * Establish a "migration path" which will start at nodes - * with CPUs and will follow the priorities used to build the - * page allocator zonelists. - * - * The difference here is that cycles must be avoided. If - * node0 migrates to node1, then neither node1, nor anything - * node1 migrates to can migrate to node0. Also one node can - * be migrated to multiple nodes if the target nodes all have - * a same best-distance against the source node. - * - * This function can run simultaneously with readers of - * node_demotion[]. However, it can not run simultaneously - * with itself. Exclusion is provided by memory hotplug events - * being single-threaded. - */ -static void __set_migration_target_nodes(void) -{ - nodemask_t next_pass; - nodemask_t this_pass; - nodemask_t used_targets = NODE_MASK_NONE; - int node, best_distance; - - /* - * Avoid any oddities like cycles that could occur - * from changes in the topology. This will leave - * a momentary gap when migration is disabled. - */ - disable_all_migrate_targets(); - - /* - * Allocations go close to CPUs, first. Assume that - * the migration path starts at the nodes with CPUs. - */ - next_pass = node_states[N_CPU]; -again: - this_pass = next_pass; - next_pass = NODE_MASK_NONE; - /* - * To avoid cycles in the migration "graph", ensure - * that migration sources are not future targets by - * setting them in 'used_targets'. Do this only - * once per pass so that multiple source nodes can - * share a target node. - * - * 'used_targets' will become unavailable in future - * passes. This limits some opportunities for - * multiple source nodes to share a destination. - */ - nodes_or(used_targets, used_targets, this_pass); - - for_each_node_mask(node, this_pass) { - best_distance = -1; - - /* - * Try to set up the migration path for the node, and the target - * migration nodes can be multiple, so doing a loop to find all - * the target nodes if they all have a best node distance. - */ - do { - int target_node = - establish_migrate_target(node, &used_targets, - best_distance); - - if (target_node == NUMA_NO_NODE) - break; - - if (best_distance == -1) - best_distance = node_distance(node, target_node); - - /* - * Visit targets from this pass in the next pass. - * Eventually, every node will have been part of - * a pass, and will become set in 'used_targets'. - */ - node_set(target_node, next_pass); - } while (1); - } - /* - * 'next_pass' contains nodes which became migration - * targets in this pass. Make additional passes until - * no more migrations targets are available. - */ - if (!nodes_empty(next_pass)) - goto again; -} - -/* - * For callers that do not hold get_online_mems() already. - */ -void set_migration_target_nodes(void) -{ - get_online_mems(); - __set_migration_target_nodes(); - put_online_mems(); -} - -/* - * This leaves migrate-on-reclaim transiently disabled between - * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs - * whether reclaim-based migration is enabled or not, which - * ensures that the user can turn reclaim-based migration at - * any time without needing to recalculate migration targets. - * - * These callbacks already hold get_online_mems(). That is why - * __set_migration_target_nodes() can be used as opposed to - * set_migration_target_nodes(). - */ -#ifdef CONFIG_MEMORY_HOTPLUG -static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, - unsigned long action, void *_arg) -{ - struct memory_notify *arg = _arg; - - /* - * Only update the node migration order when a node is - * changing status, like online->offline. This avoids - * the overhead of synchronize_rcu() in most cases. - */ - if (arg->status_change_nid < 0) - return notifier_from_errno(0); - - switch (action) { - case MEM_GOING_OFFLINE: - /* - * Make sure there are not transient states where - * an offline node is a migration target. This - * will leave migration disabled until the offline - * completes and the MEM_OFFLINE case below runs. - */ - disable_all_migrate_targets(); - break; - case MEM_OFFLINE: - case MEM_ONLINE: - /* - * Recalculate the target nodes once the node - * reaches its final state (online or offline). - */ - __set_migration_target_nodes(); - break; - case MEM_CANCEL_OFFLINE: - /* - * MEM_GOING_OFFLINE disabled all the migration - * targets. Reenable them. - */ - __set_migration_target_nodes(); - break; - case MEM_GOING_ONLINE: - case MEM_CANCEL_ONLINE: - break; - } - - return notifier_from_errno(0); -} -#endif - -void __init migrate_on_reclaim_init(void) -{ - node_demotion = kcalloc(nr_node_ids, - sizeof(struct demotion_nodes), - GFP_KERNEL); - WARN_ON(!node_demotion); -#ifdef CONFIG_MEMORY_HOTPLUG - hotplug_memory_notifier(migrate_on_reclaim_callback, 100); -#endif - /* - * At this point, all numa nodes with memory/CPus have their state - * properly set, so we can build the demotion order now. - * Let us hold the cpu_hotplug lock just, as we could possibily have - * CPU hotplug events during boot. - */ - cpus_read_lock(); - set_migration_target_nodes(); - cpus_read_unlock(); -} #endif /* CONFIG_NUMA */ - - diff --git a/mm/vmstat.c b/mm/vmstat.c index 373d2730fcf2..35c6ff97cf29 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -28,7 +28,6 @@ #include #include #include -#include #include "internal.h" @@ -2060,7 +2059,6 @@ static int vmstat_cpu_online(unsigned int cpu) if (!node_state(cpu_to_node(cpu), N_CPU)) { node_set_state(cpu_to_node(cpu), N_CPU); - set_migration_target_nodes(); } return 0; @@ -2085,7 +2083,6 @@ static int vmstat_cpu_dead(unsigned int cpu) return 0; node_clear_state(node, N_CPU); - set_migration_target_nodes(); return 0; } @@ -2118,7 +2115,6 @@ void __init init_mm_internals(void) start_shepherd_timer(); #endif - migrate_on_reclaim_init(); #ifdef CONFIG_PROC_FS proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op); proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op); From patchwork Thu Aug 18 13:10:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12947992 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95098C00140 for ; Thu, 18 Aug 2022 21:29:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1CD658E0001; Thu, 18 Aug 2022 17:29:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 17E148D0001; Thu, 18 Aug 2022 17:29:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 01F6C8E0001; Thu, 18 Aug 2022 17:29:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E489D8D0001 for ; Thu, 18 Aug 2022 17:29:08 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B73798236A for ; Thu, 18 Aug 2022 21:29:08 +0000 (UTC) X-FDA: 79814003976.25.F14404E Received: by imf11.hostedemail.com (Postfix, from userid 200) id 2172140034; Thu, 18 Aug 2022 15:45:09 +0000 (UTC) Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf11.hostedemail.com (Postfix) with ESMTP id 28E5040D76 for ; Thu, 18 Aug 2022 15:45:05 +0000 (UTC) Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ID0jUx004836; Thu, 18 Aug 2022 13:11:35 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=SgB1OBJ+aT4fKrCirlggMMGrMcGU1lu023bf/o8Sw4M=; b=Wlv78d4F+GU0V4YStCFHd5GtSHDiHSgU8ovslbjVH7A6swQTUkOfQLAXqDVylDzP7gfN W6yM/gGjBYoMmR1lNBkMs4e6LDwMay+CY7a0koX+x8T6uzRdvSgKdnFl41pD3bxTm7Xe /RIFZLwV02L2aOFPdUn2QBua3ElmahtYhM3OTIifT5d4N60WHB7vPCN61OLInJBzM0Cq K+WrO4ppXhlw3US8FDSwpffKX6m37kZgQTd5por4EvnI/rnPntwXV2YIhdT8RxKoV4nX UARoawmgwlthWy5VreaxP9ZTi6QgBtdhshnVpTDvNf1pTpnuOhigat7Npw/B6S8XkCkO Ag== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nxerf3j-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:35 +0000 Received: from m0098420.ppops.net (m0098420.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ID1DvF006707; Thu, 18 Aug 2022 13:11:34 GMT Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nxerf2n-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:33 +0000 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID5dD3002954; Thu, 18 Aug 2022 13:11:32 GMT Received: from b01cxnp23033.gho.pok.ibm.com (b01cxnp23033.gho.pok.ibm.com [9.57.198.28]) by ppma03dal.us.ibm.com with ESMTP id 3hx3kaehtn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:32 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp23033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDBVIV1311356 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:32 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DCED5124053; Thu, 18 Aug 2022 13:11:31 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DAA8F124052; Thu, 18 Aug 2022 13:11:26 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:11:26 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 06/10] mm/demotion: Add pg_data_t member to track node memory tier details Date: Thu, 18 Aug 2022 18:40:38 +0530 Message-Id: <20220818131042.113280-7-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: v9nEluJdfDt2HY6hBTW5UjrvGDtDlk4J X-Proofpoint-ORIG-GUID: iRErj2z7V2Pqzx7rEKjNCSlj6_HoxJMH X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 priorityscore=1501 mlxlogscore=999 malwarescore=0 lowpriorityscore=0 impostorscore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Wlv78d4F; spf=pass (imf11.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660837509; a=rsa-sha256; cv=none; b=3Nnf8sGUEvVe2E5SF9lsVEdp3nOa1b6EzvRaDLQguF+VHe7X/odl7coCrV9xexYzBKFizx 0sbLss2UiCsMFrw2s82kv2K0PznQG8fxZzYKVwcDfMWmGeNNRaehlj4QrefiXnHHmDBawq 2JqhYklUaEsbtjEPQOEBczLw2C3l9pU= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660837509; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SgB1OBJ+aT4fKrCirlggMMGrMcGU1lu023bf/o8Sw4M=; b=vz7RKqzZ6l6q0zaqACuZA3oPx/Z2OWeGRPoR8qTVlzqCJmny6cMQKpveEL3FcoTBlCyqdw itaY3dJwluEeNekRxzN6TcN+9jOXBA681hxyiz8vDaawojgq8Uu+AiWpMYEtUhgQbZTIRE iGrXXX1l+edrQuY69rJjJ34Mh71YcC0= X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 28E5040D76 Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=Wlv78d4F; spf=pass (imf11.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Stat-Signature: oszj37iubzbm4nag9uz6nr9ge5dsnegr X-Rspam-User: X-HE-Tag: 1660837505-285173 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Also update different helpes to use NODE_DATA()->memtier. Since node specific memtier can change based on the reassignment of NUMA node to a different memory tiers, accessing NODE_DATA()->memtier needs to happen under an rcu read lock or memory_tier_lock. Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- include/linux/mmzone.h | 3 +++ mm/memory-tiers.c | 40 +++++++++++++++++++++++++++++++++++----- 2 files changed, 38 insertions(+), 5 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e24b40c52468..7d78133fe8dd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1012,6 +1012,9 @@ typedef struct pglist_data { /* Per-node vmstats */ struct per_cpu_nodestat __percpu *per_cpu_nodestats; atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]; +#ifdef CONFIG_NUMA + struct memory_tier __rcu *memtier; +#endif } pg_data_t; #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index c29bb24449b8..2b7e91b45a75 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #include "internal.h" @@ -136,12 +137,18 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty static struct memory_tier *__node_get_memory_tier(int node) { - struct memory_dev_type *memtype; + pg_data_t *pgdat; - memtype = node_memory_types[node]; - if (memtype && node_isset(node, memtype->nodes)) - return memtype->memtier; - return NULL; + pgdat = NODE_DATA(node); + if (!pgdat) + return NULL; + /* + * Since we hold memory_tier_lock, we can avoid + * RCU read locks when accessing the details. No + * parallel updates are possible here. + */ + return rcu_dereference_check(pgdat->memtier, + lockdep_is_held(&memory_tier_lock)); } #ifdef CONFIG_MIGRATION @@ -294,6 +301,8 @@ static struct memory_tier *set_node_memory_tier(int node) { struct memory_tier *memtier; struct memory_dev_type *memtype; + pg_data_t *pgdat = NODE_DATA(node); + lockdep_assert_held_once(&memory_tier_lock); @@ -305,24 +314,45 @@ static struct memory_tier *set_node_memory_tier(int node) memtype = node_memory_types[node]; node_set(node, memtype->nodes); memtier = find_create_memory_tier(memtype); + if (!IS_ERR(memtier)) + rcu_assign_pointer(pgdat->memtier, memtier); return memtier; } static void destroy_memory_tier(struct memory_tier *memtier) { list_del(&memtier->list); + /* + * synchronize_rcu in clear_node_memory_tier makes sure + * we don't have rcu access to this memory tier. + */ kfree(memtier); } static bool clear_node_memory_tier(int node) { bool cleared = false; + pg_data_t *pgdat; struct memory_tier *memtier; + pgdat = NODE_DATA(node); + if (!pgdat) + return false; + + /* + * Make sure that anybody looking at NODE_DATA who finds + * a valid memtier finds memory_dev_types with nodes still + * linked to the memtier. We achieve this by waiting for + * rcu read section to finish using synchronize_rcu. + * This also enables us to free the destroyed memory tier + * with kfree instead of kfree_rcu + */ memtier = __node_get_memory_tier(node); if (memtier) { struct memory_dev_type *memtype; + rcu_assign_pointer(pgdat->memtier, NULL); + synchronize_rcu(); memtype = node_memory_types[node]; node_clear(node, memtype->nodes); if (nodes_empty(memtype->nodes)) { From patchwork Thu Aug 18 13:10:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12948111 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E54BC00140 for ; Thu, 18 Aug 2022 23:00:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 265E38D0003; Thu, 18 Aug 2022 19:00:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1EE028D0002; Thu, 18 Aug 2022 19:00:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 040268D0003; Thu, 18 Aug 2022 19:00:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E51E98D0002 for ; Thu, 18 Aug 2022 19:00:32 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A60E3AB321 for ; Thu, 18 Aug 2022 23:00:31 +0000 (UTC) X-FDA: 79814234262.01.FA9468A Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf22.hostedemail.com (Postfix) with ESMTP id 8F34CC7323 for ; Thu, 18 Aug 2022 21:55:54 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ICijSP025680; Thu, 18 Aug 2022 13:11:40 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=LIDo6CqydH8lG4GOng8n1y97XLjFtFkKiq31uguX6/0=; b=nMKsU+ARNQiYqZvXpnW/kzFMz06fu9QR913yYDC6tDT8r8RxlToakrdHy1Fu9MdbIOF3 2BALmMSwVlCEBbd5uzntqIOe2MMbpHxN2tZeeVxhVFhmbQEZXXbs13szjASn6c7s0Gsn ajGP3KqzPDOmaxx7KQ6YhSmEOpvpkWDriAOFPgmUKi53KtPFDGop3uXsoG9+RNVOllxx CWWzU7c+UhkCn5Z6hV3gc3sOg50sarfHt0DigNkeELZWdIYtsA8/Qd6S3aFX70wtvpw9 X2J8uTnytXNh86th3etjQhksW/SEt8Ud4dqzDtKW3IiozDf+vydGE3GtdXHNtKspamGS mA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1npxgxrg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:40 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ICnQhJ039833; Thu, 18 Aug 2022 13:11:39 GMT Received: from ppma04wdc.us.ibm.com (1a.90.2fa9.ip4.static.sl-reverse.com [169.47.144.26]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1npxgxqr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:39 +0000 Received: from pps.filterd (ppma04wdc.us.ibm.com [127.0.0.1]) by ppma04wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID6iTt000651; Thu, 18 Aug 2022 13:11:37 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma04wdc.us.ibm.com with ESMTP id 3hx3k9y180-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:37 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDBbNS262846 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:37 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7920E124053; Thu, 18 Aug 2022 13:11:37 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 84B19124052; Thu, 18 Aug 2022 13:11:32 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:11:32 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 07/10] mm/demotion: Drop memtier from memtype Date: Thu, 18 Aug 2022 18:40:39 +0530 Message-Id: <20220818131042.113280-8-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: LppKjg5JzG-s2wSh-KGKpU7kVJGuptNi X-Proofpoint-GUID: kXSCFxrp07fCnSLcM_3IuSn8kxNCc6_X X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 lowpriorityscore=0 clxscore=1015 bulkscore=0 mlxlogscore=999 adultscore=0 suspectscore=0 mlxscore=0 impostorscore=0 priorityscore=1501 phishscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660859755; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LIDo6CqydH8lG4GOng8n1y97XLjFtFkKiq31uguX6/0=; b=yveNjdlKkkwmnw/6eOP82igTTwvQvQYB3a7/i28APeNxq+d6YHYL6YjPgWXDJyofieH2sg 7UgnpyuSFdtMiTW76PRWezpiZ49g1e0WK/d/v5RMmU5eHMRL2GutbRw2bzevwW94cf3Ml0 ErC6nGgHhlFzlDrvcqjujiSIcoJ3YEE= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=nMKsU+AR; spf=pass (imf22.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660859755; a=rsa-sha256; cv=none; b=PSKrOK7pw/1rgZDvg6o4FyerwtwWCPrbFpAbxnx8/lDOmJPc+YzlVzJoJRmnOxKbRAu/5P tS69nhJuqcHPTBKXPmvBuXfxLlUbwB2KhO/IUijU22bJjc+8ieSGs0Ll4bpfa9IDVa/C36 dhWl2/tEMkANi3me6pRQ9ys29EQo5Xs= X-Rspam-User: X-Rspamd-Queue-Id: 8F34CC7323 Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=nMKsU+AR; spf=pass (imf22.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Stat-Signature: gk7xn67gb754m9awr3wbxo9qaat4ca1n X-Rspamd-Server: rspam10 X-HE-Tag: 1660859754-150501 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now that we track node-specific memtier in pg_data_t, we can drop memtier from memtype. Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 1 - mm/memory-tiers.c | 16 +++++++++------- 2 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 548e69f23727..108083d74557 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -28,7 +28,6 @@ struct memory_dev_type { /* Nodes of same abstract distance */ nodemask_t nodes; struct kref kref; - struct memory_tier *memtier; }; #ifdef CONFIG_NUMA diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 2b7e91b45a75..455c104fab5d 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -100,17 +100,22 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty lockdep_assert_held_once(&memory_tier_lock); + adistance = round_down(adistance, memtier_adistance_chunk_size); /* * If the memtype is already part of a memory tier, * just return that. */ - if (memtype->memtier) - return memtype->memtier; + if (!list_empty(&memtype->tier_sibiling)) { + list_for_each_entry(memtier, &memory_tiers, list) { + if (adistance == memtier->adistance_start) + return memtier; + } + WARN_ON(1); + return ERR_PTR(-EINVAL); + } - adistance = round_down(adistance, memtier_adistance_chunk_size); list_for_each_entry(memtier, &memory_tiers, list) { if (adistance == memtier->adistance_start) { - memtype->memtier = memtier; list_add(&memtype->tier_sibiling, &memtier->memory_types); return memtier; } else if (adistance < memtier->adistance_start) { @@ -130,7 +135,6 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty list_add_tail(&new_memtier->list, &memtier->list); else list_add_tail(&new_memtier->list, &memory_tiers); - memtype->memtier = new_memtier; list_add(&memtype->tier_sibiling, &new_memtier->memory_types); return new_memtier; } @@ -357,7 +361,6 @@ static bool clear_node_memory_tier(int node) node_clear(node, memtype->nodes); if (nodes_empty(memtype->nodes)) { list_del(&memtype->tier_sibiling); - memtype->memtier = NULL; if (list_empty(&memtier->memory_types)) destroy_memory_tier(memtier); } @@ -385,7 +388,6 @@ struct memory_dev_type *alloc_memory_type(int adistance) memtype->adistance = adistance; INIT_LIST_HEAD(&memtype->tier_sibiling); memtype->nodes = NODE_MASK_NONE; - memtype->memtier = NULL; kref_init(&memtype->kref); return memtype; } From patchwork Thu Aug 18 13:10:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12947947 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14569C28B2B for ; Thu, 18 Aug 2022 21:09:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 71E658D0003; Thu, 18 Aug 2022 17:09:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6CF288D0002; Thu, 18 Aug 2022 17:09:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 547E08D0003; Thu, 18 Aug 2022 17:09:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 4633D8D0002 for ; Thu, 18 Aug 2022 17:09:32 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 44F1382558 for ; Thu, 18 Aug 2022 21:09:30 +0000 (UTC) X-FDA: 79813954500.14.5A970E1 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf09.hostedemail.com (Postfix) with ESMTP id B23A9140A32 for ; Thu, 18 Aug 2022 21:04:38 +0000 (UTC) Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ICxpka026757; Thu, 18 Aug 2022 13:11:45 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=ZGi0zkTBJ3xmi2kyZ6V3Nh+fxC21FPCvnjIKT7n9plU=; b=DJ2J1jy7w9oUUYK6M3yXr3r6TSe3MXgzF35ELtC+bt6H+nFIfr+2Wn9ZtPJMFxxW66Xw /YwrOMa72YImUrhNjahk2hcVpzZaSQwpuO30Slts74jJ5oeCjreVM0kHIxRRsLfXHeaT 4BS7q5+Y1VVpDWmN2sXXgZnY/U/2Ns3SFljkH0lZGcwSJ1ZM7KWtJAU79Ut0f6uLZ+2h AIHbPR20RZVpMNFTlLVEc+eOfXUuJmjVmhoHoFoYPAWk8QqZ+cCmJ/A0rT3hoH1UVFvc ZFd1yyyr3g8j6hSDRxb8ES+mQyyPtfYWgTQ0E/dxKph8YuEG6l+EBZyX9RtSDfJXGYEs rw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nx0rjg1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:45 +0000 Received: from m0098417.ppops.net (m0098417.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ID095U028151; Thu, 18 Aug 2022 13:11:44 GMT Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nx0rjfd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:44 +0000 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID6UKO003231; Thu, 18 Aug 2022 13:11:43 GMT Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by ppma01wdc.us.ibm.com with ESMTP id 3hx3k9pygp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:43 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDBheY9175686 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:43 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 28664124052; Thu, 18 Aug 2022 13:11:43 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0A166124053; Thu, 18 Aug 2022 13:11:38 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:11:37 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K . V" Subject: [PATCH v15 08/10] mm/demotion: Demote pages according to allocation fallback order Date: Thu, 18 Aug 2022 18:40:40 +0530 Message-Id: <20220818131042.113280-9-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: F5aMTFKmc3FEEfgEfGUGLUpnSyjaqyTG X-Proofpoint-ORIG-GUID: ycA9iy6aLJXGEFNgGqqIdVFejkkEvzfk X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 priorityscore=1501 mlxscore=0 spamscore=0 suspectscore=0 lowpriorityscore=0 phishscore=0 clxscore=1015 bulkscore=0 impostorscore=0 adultscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660856678; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZGi0zkTBJ3xmi2kyZ6V3Nh+fxC21FPCvnjIKT7n9plU=; b=xOl2ffQXXT5C4zFpyszjUne0iJOqrxNtPuo/ZzW0dhzIds7Enf8M5pi5vW2YnsgRCeK/Kr xyxzHfenRIpOImkvCCX/1bOowAI7KVkxt3VcHNaRJuMeLAxpFsCDAbmEazwMC8W6sr2R4G +QQl+bcleCnOtsBpG0b5eb0xcTZ1CUc= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=DJ2J1jy7; spf=pass (imf09.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660856678; a=rsa-sha256; cv=none; b=2Qn8JKPdxOviYMVccexFr04xKOV++oxePD+pUVqCt9dhX8a5pMzrOqZUQSXJUQG/qYTmLS 5wE+b8oh+k0Sqqgtt7Wv7rVd4g7RV+5rXzNPYyzQ6pHYoau2b8l3md4mlfHXiGw8NpmNaF 1bxg/MiY8MXp7YCp12dwpuNGJjRYfLg= X-Rspam-User: Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=DJ2J1jy7; spf=pass (imf09.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Stat-Signature: 77y618ro4u9fxtxarxcfdb9ojt9bua1f X-Rspamd-Queue-Id: B23A9140A32 X-Rspamd-Server: rspam05 X-HE-Tag: 1660856678-289690 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Jagdish Gediya Currently, a higher tier node can only be demoted to selected nodes on the next lower tier as defined by the demotion path. This strict demotion order does not work in all use cases (e.g. some use cases may want to allow cross-socket demotion to another node in the same demotion tier as a fallback when the preferred demotion node is out of space). This demotion order is also inconsistent with the page allocation fallback order when all the nodes in a higher tier are out of space: The page allocation can fall back to any node from any lower tier, whereas the demotion order doesn't allow that currently. This patch adds support to get all the allowed demotion targets for a memory tier. demote_page_list() function is now modified to utilize this allowed node mask as the fallback allocation mask. Reviewed-by: "Huang, Ying" Signed-off-by: Jagdish Gediya Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 12 ++++++++ mm/memory-tiers.c | 51 +++++++++++++++++++++++++++++-- mm/vmscan.c | 58 ++++++++++++++++++++++++++---------- 3 files changed, 103 insertions(+), 18 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 108083d74557..a2f8f4c250b9 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -5,6 +5,7 @@ #include #include #include +#include /* * Each tier cover a abstrace distance chunk size of 128 */ @@ -38,11 +39,17 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type); void clear_node_memory_type(int node, struct memory_dev_type *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); #else static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif #else @@ -75,5 +82,10 @@ static inline int next_demotion_node(int node) { return NUMA_NO_NODE; } + +static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + *targets = NODE_MASK_NONE; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 455c104fab5d..89b9f317be99 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -4,7 +4,6 @@ #include #include #include -#include #include #include "internal.h" @@ -20,6 +19,8 @@ struct memory_tier { * adistance_start .. adistance_start + MEMTIER_CHUNK_SIZE */ int adistance_start; + /* All the nodes that are part of all the lower memory tiers. */ + nodemask_t lower_tier_mask; }; struct demotion_nodes { @@ -156,6 +157,24 @@ static struct memory_tier *__node_get_memory_tier(int node) } #ifdef CONFIG_MIGRATION +void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) +{ + struct memory_tier *memtier; + + /* + * pg_data_t.memtier updates includes a synchronize_rcu() + * which ensures that we either find NULL or a valid memtier + * in NODE_DATA. protect the access via rcu_read_lock(); + */ + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (memtier) + *targets = memtier->lower_tier_mask; + else + *targets = NODE_MASK_NONE; + rcu_read_unlock(); +} + /** * next_demotion_node() - Get the next node in the demotion path * @node: The starting node to lookup the next node @@ -203,10 +222,19 @@ int next_demotion_node(int node) static void disable_all_demotion_targets(void) { + struct memory_tier *memtier; int node; - for_each_node_state(node, N_MEMORY) + for_each_node_state(node, N_MEMORY) { node_demotion[node].preferred = NODE_MASK_NONE; + /* + * We are holding memory_tier_lock, it is safe + * to access pgda->memtier. + */ + memtier = __node_get_memory_tier(node); + if (memtier) + memtier->lower_tier_mask = NODE_MASK_NONE; + } /* * Ensure that the "disable" is visible across the system. * Readers will see either a combination of before+disable @@ -238,7 +266,7 @@ static void establish_demotion_targets(void) struct demotion_nodes *nd; int target = NUMA_NO_NODE, node; int distance, best_distance; - nodemask_t tier_nodes; + nodemask_t tier_nodes, lower_tier; lockdep_assert_held_once(&memory_tier_lock); @@ -286,6 +314,23 @@ static void establish_demotion_targets(void) } } while (1); } + /* + * Now build the lower_tier mask for each node collecting node mask from + * all memory tier below it. This allows us to fallback demotion page + * allocation to a set of nodes that is closer the above selected + * perferred node. + */ + lower_tier = node_states[N_MEMORY]; + list_for_each_entry(memtier, &memory_tiers, list) { + /* + * Keep removing current tier from lower_tier nodes, + * This will remove all nodes in current and above + * memory tier from the lower_tier mask. + */ + tier_nodes = get_memtier_nodemask(memtier); + nodes_andnot(lower_tier, lower_tier, tier_nodes); + memtier->lower_tier_mask = lower_tier; + } } #else diff --git a/mm/vmscan.c b/mm/vmscan.c index 224de380ac88..500b9054be18 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1521,21 +1521,34 @@ static void folio_check_dirty_writeback(struct folio *folio, mapping->a_ops->is_dirty_writeback(folio, dirty, writeback); } -static struct page *alloc_demote_page(struct page *page, unsigned long node) +static struct page *alloc_demote_page(struct page *page, unsigned long private) { - struct migration_target_control mtc = { - /* - * Allocate from 'node', or fail quickly and quietly. - * When this happens, 'page' will likely just be discarded - * instead of migrated. - */ - .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | - __GFP_THISNODE | __GFP_NOWARN | - __GFP_NOMEMALLOC | GFP_NOWAIT, - .nid = node - }; + struct page *target_page; + nodemask_t *allowed_mask; + struct migration_target_control *mtc; + + mtc = (struct migration_target_control *)private; + + allowed_mask = mtc->nmask; + /* + * make sure we allocate from the target node first also trying to + * demote or reclaim pages from the target node via kswapd if we are + * low on free memory on target node. If we don't do this and if + * we have free memory on the slower(lower) memtier, we would start + * allocating pages from slower(lower) memory tiers without even forcing + * a demotion of cold pages from the target memtier. This can result + * in the kernel placing hot pages in slower(lower) memory tiers. + */ + mtc->nmask = NULL; + mtc->gfp_mask |= __GFP_THISNODE; + target_page = alloc_migration_target(page, (unsigned long)mtc); + if (target_page) + return target_page; - return alloc_migration_target(page, (unsigned long)&mtc); + mtc->gfp_mask &= ~__GFP_THISNODE; + mtc->nmask = allowed_mask; + + return alloc_migration_target(page, (unsigned long)mtc); } /* @@ -1548,6 +1561,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages, { int target_nid = next_demotion_node(pgdat->node_id); unsigned int nr_succeeded; + nodemask_t allowed_mask; + + struct migration_target_control mtc = { + /* + * Allocate from 'node', or fail quickly and quietly. + * When this happens, 'page' will likely just be discarded + * instead of migrated. + */ + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN | + __GFP_NOMEMALLOC | GFP_NOWAIT, + .nid = target_nid, + .nmask = &allowed_mask + }; if (list_empty(demote_pages)) return 0; @@ -1555,10 +1581,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages, if (target_nid == NUMA_NO_NODE) return 0; + node_get_allowed_targets(pgdat, &allowed_mask); + /* Demotion ignores all cpuset and mempolicy settings */ migrate_pages(demote_pages, alloc_demote_page, NULL, - target_nid, MIGRATE_ASYNC, MR_DEMOTION, - &nr_succeeded); + (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); if (current_is_kswapd()) __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); From patchwork Thu Aug 18 13:10:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12948012 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D67BC00140 for ; Thu, 18 Aug 2022 22:03:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8437F8D0002; Thu, 18 Aug 2022 18:03:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7CC6F8D0001; Thu, 18 Aug 2022 18:03:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 644E58D0002; Thu, 18 Aug 2022 18:03:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 50FE48D0001 for ; Thu, 18 Aug 2022 18:03:32 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2A2A4160BE7 for ; Thu, 18 Aug 2022 22:03:32 +0000 (UTC) X-FDA: 79814090664.22.EDC0CD2 Received: by imf05.hostedemail.com (Postfix, from userid 200) id 4CFDF1024C0; Thu, 18 Aug 2022 15:55:21 +0000 (UTC) Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf05.hostedemail.com (Postfix) with ESMTP id 0A8AB10254F for ; Thu, 18 Aug 2022 15:55:19 +0000 (UTC) Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ID152b016265; Thu, 18 Aug 2022 13:11:51 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=JYAsT46EGOr9dd0WGx844S7yI9r04cEcc0KwzcFDO6I=; b=N4Y5K5vaCFAfBsj1b5pMOMNNpaXCNKDvgsrntx2TAht+/aaxmY2429sK1nZEXe+pX0aR Qv5bKIzIoOLmz2L0s1bYwhoyMujpunH4KlQWbmUsMyDFHhslqhsrFg8CTHsX9ZE2lIqv daDi0e0sVVs1r2B+9NKabf+cCBmbtOyvDWpQG1Svh24kCvHtWxlJ2lDLmMFeXiCWGVqH FV8xAakxYG7X6rqyDp2/+H/gJSgnsRJ8T4BIU83dSDxHvRl0Tp/xwRT7K9FWAAFCn+4k cBIC86npymqOz1BvZPocvr0RJfBaolw7FZIFHwK/s90i1oKWcQfgHL+Hr/5t5BwvRZUy Og== Received: from pps.reinject (localhost [127.0.0.1]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nxkgegg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:50 +0000 Received: from m0098419.ppops.net (m0098419.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ID1BVp016772; Thu, 18 Aug 2022 13:11:50 GMT Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27]) by mx0b-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nxkgefc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:50 +0000 Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1]) by ppma05wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID5Rm9002360; Thu, 18 Aug 2022 13:11:49 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma05wdc.us.ibm.com with ESMTP id 3hx3kaey0r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:49 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDBmgw59048336 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:48 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D49E7124053; Thu, 18 Aug 2022 13:11:48 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BAE99124052; Thu, 18 Aug 2022 13:11:43 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:11:43 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 09/10] mm/demotion: Update node_is_toptier to work with memory tiers Date: Thu, 18 Aug 2022 18:40:41 +0530 Message-Id: <20220818131042.113280-10-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: L_9LzCUdiNDQER5hC2xFI3393BWJpNYz X-Proofpoint-ORIG-GUID: uFTLE3b55H5HtWN50XFgJ86kZGrwSLCt X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 mlxlogscore=999 impostorscore=0 bulkscore=0 suspectscore=0 adultscore=0 phishscore=0 priorityscore=1501 clxscore=1015 malwarescore=0 mlxscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660838120; a=rsa-sha256; cv=none; b=YRCX6ituukJnfwTjpM+34jbiKstLyVnZEOMOY36AJo7KITJwIFRr4AbxEkRBt2rwTCdBiQ NrjWeP1GTfHVPyjCqGcEsrgttSD7iUMRdPVX9Z15PkwMksPmTgy31xJSbIlbyriLbPHgfB NoA7eCUxduesNSwmEzQf5G54HtxtorY= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=N4Y5K5va; spf=pass (imf05.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660838120; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JYAsT46EGOr9dd0WGx844S7yI9r04cEcc0KwzcFDO6I=; b=qtL1wQdDOiIvrlAmaPGX/pReLAgGorMf0WNwvUxlhRL4/0Xy0Lg27Ly9syqQT3aO/4aIRW mENxxEde7PDuF3j9gDj6KwoK9HKU/hWPDsgvJFsYWvVIkCAsskuU5+akcscs9/+X6enuhY Qrp9nTJqcVaXjxaBaeGbwiRUHpjD8SQ= X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 0A8AB10254F X-Stat-Signature: aeci4xut4mgpwbm9zufsmx58jwgrznte Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=N4Y5K5va; spf=pass (imf05.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-HE-Tag: 1660838119-825716 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: With memory tier support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower(slower) memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as a top memory tier Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- include/linux/memory-tiers.h | 11 +++++++++ include/linux/node.h | 5 ---- mm/huge_memory.c | 1 + mm/memory-tiers.c | 46 ++++++++++++++++++++++++++++++++++++ mm/migrate.c | 1 + mm/mprotect.c | 1 + 6 files changed, 60 insertions(+), 5 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index a2f8f4c250b9..1592d374b5bf 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -40,6 +40,7 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); +bool node_is_toptier(int node); #else static inline int next_demotion_node(int node) { @@ -50,6 +51,11 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif #else @@ -87,5 +93,10 @@ static inline void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *target { *targets = NODE_MASK_NONE; } + +static inline bool node_is_toptier(int node) +{ + return true; +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/include/linux/node.h b/include/linux/node.h index 40d641a8bfb0..9ec680dd607f 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -185,9 +185,4 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg, #define to_node(device) container_of(device, struct node, dev) -static inline bool node_is_toptier(int node) -{ - return node_state(node, N_CPU); -} - #endif /* _LINUX_NODE_H_ */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8a7c1b344abe..1e9357576f2d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 89b9f317be99..a20795bb0e07 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -32,6 +32,7 @@ static LIST_HEAD(memory_tiers); static struct memory_dev_type *node_memory_types[MAX_NUMNODES]; static struct memory_dev_type *default_dram_type; #ifdef CONFIG_MIGRATION +static int top_tier_adistance; /* * node_demotion[] examples: * @@ -157,6 +158,31 @@ static struct memory_tier *__node_get_memory_tier(int node) } #ifdef CONFIG_MIGRATION +bool node_is_toptier(int node) +{ + bool toptier; + pg_data_t *pgdat; + struct memory_tier *memtier; + + pgdat = NODE_DATA(node); + if (!pgdat) + return false; + + rcu_read_lock(); + memtier = rcu_dereference(pgdat->memtier); + if (!memtier) { + toptier = true; + goto out; + } + if (memtier->adistance_start < top_tier_adistance) + toptier = true; + else + toptier = false; +out: + rcu_read_unlock(); + return toptier; +} + void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets) { struct memory_tier *memtier; @@ -314,6 +340,26 @@ static void establish_demotion_targets(void) } } while (1); } + /* + * Promotion is allowed from a memory tier to higher + * memory tier only if the memory tier doesn't include + * compute. We want to skip promotion from a memory tier, + * if any node that is part of the memory tier have CPUs. + * Once we detect such a memory tier, we consider that tier + * as top tiper from which promotion is not allowed. + */ + list_for_each_entry_reverse(memtier, &memory_tiers, list) { + tier_nodes = get_memtier_nodemask(memtier); + nodes_and(tier_nodes, node_states[N_CPU], tier_nodes); + if (!nodes_empty(tier_nodes)) { + /* + * abstract distance below the max value of this memtier + * is considered toptier. + */ + top_tier_adistance = memtier->adistance_start + MEMTIER_CHUNK_SIZE; + break; + } + } /* * Now build the lower_tier mask for each node collecting node mask from * all memory tier below it. This allows us to fallback demotion page diff --git a/mm/migrate.c b/mm/migrate.c index ea86594f4bc5..55e7718cfe45 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -50,6 +50,7 @@ #include #include #include +#include #include diff --git a/mm/mprotect.c b/mm/mprotect.c index 3a23dde73723..61cd80831b04 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include From patchwork Thu Aug 18 13:10:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Aneesh Kumar K.V" X-Patchwork-Id: 12947384 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95DDAC00140 for ; Thu, 18 Aug 2022 16:11:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1081E8E0002; Thu, 18 Aug 2022 12:11:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0902E8E0001; Thu, 18 Aug 2022 12:11:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E73CA8E0002; Thu, 18 Aug 2022 12:11:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D17128E0001 for ; Thu, 18 Aug 2022 12:11:50 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A7A4741962 for ; Thu, 18 Aug 2022 16:11:50 +0000 (UTC) X-FDA: 79813204380.07.1331EA8 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf29.hostedemail.com (Postfix) with ESMTP id BBEA8120F85 for ; Thu, 18 Aug 2022 15:54:26 +0000 (UTC) Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 27ICg54K006566; Thu, 18 Aug 2022 13:11:57 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=m2fn92MjFJdmD8qI1OD1ZZfckb8jiPcZlxGR7KNDcb8=; b=GEuaBSeZu7+6VtM7qT0GuZZuObdNf/fQi2RHM42ceUb6GRxWr42z5qq3Yrl6GQxVe7jp /WyiH7j2bB/PkEANZliSjNfSB2ERtIsxsgIB/Q2Lemk77Dbs9DXfs16aJOIN/3FlQaRV 5iJ5lwLF0XVXMybd1VnrGGThpBbBeBdtx7+IV9Qu9eqFu8HVb85/V+1yC7AJgbzQCQYx owkvzvYGPHKOw59LsKgs9a9JGEd7w2KL4A09SSl8JYGy6TaCpmq1so0szGsO+2ZEtSWX OfjGQCiGBKt0LcJMT0af77qEOkaQYrVxT3LGlc1HyhdRiyug9j8ia5zUrgBjCBHryyY+ EA== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nne93m4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:57 +0000 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 27ICgCEV007242; Thu, 18 Aug 2022 13:11:56 GMT Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3j1nne93k9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:56 +0000 Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1]) by ppma05wdc.us.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 27ID5Q0k002071; Thu, 18 Aug 2022 13:11:54 GMT Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com [9.57.198.26]) by ppma05wdc.us.ibm.com with ESMTP id 3hx3kaey17-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 18 Aug 2022 13:11:54 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 27IDBsnA17695410 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 18 Aug 2022 13:11:54 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 56B60124053; Thu, 18 Aug 2022 13:11:54 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 69E29124052; Thu, 18 Aug 2022 13:11:49 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.43.111.107]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Thu, 18 Aug 2022 13:11:49 +0000 (GMT) From: "Aneesh Kumar K.V" To: linux-mm@kvack.org, akpm@linux-foundation.org Cc: Wei Xu , Huang Ying , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Bharata B Rao , "Aneesh Kumar K.V" Subject: [PATCH v15 10/10] lib/nodemask: Optimize node_random for nodemask with single NUMA node Date: Thu, 18 Aug 2022 18:40:42 +0530 Message-Id: <20220818131042.113280-11-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.37.2 In-Reply-To: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> References: <20220818131042.113280-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: h821cm1iY4VxjPEtr8AlCehQ-kEpQgcV X-Proofpoint-GUID: cPF8qtUhy1sZdH0c1B30i54NexpZhcNG X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-08-18_12,2022-08-18_01,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 bulkscore=0 mlxlogscore=999 adultscore=0 lowpriorityscore=0 suspectscore=0 malwarescore=0 priorityscore=1501 phishscore=0 impostorscore=0 spamscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2207270000 definitions=main-2208180045 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660838066; a=rsa-sha256; cv=none; b=2rUbaThh7JvsUD8N5GyfwzdpCu/ZyWynKNIJZzfy4EjrWJSxypOkVw1GwPY6UctVhYutff CIl1RkmeTn95bfwrL41wQVxmrWNRr5PNx4vMdt7WPDgESTpSFw9hlys8iqKiPfDeDRyWyt /9T0AamjMTLOB64Ad7NtoxtOmDax7ts= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=GEuaBSeZ; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660838066; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=m2fn92MjFJdmD8qI1OD1ZZfckb8jiPcZlxGR7KNDcb8=; b=biO8L+KkR8wu1aPul0PXTDYc06Ui9Q36FVAFg8DWdwW0AqZqursBrJXca5ljiVNWJYl/4n 0XuJH/f+w54Y4stH/f6izqfp4VxrY3bASf28FnbvAPSPJbKybuVlok87l3YqhvYGyt+3aC VtuO8iBEixQwzw7ijJjjBLfoRupZsIs= Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=GEuaBSeZ; spf=pass (imf29.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: BBEA8120F85 X-Stat-Signature: 6gj9847didyk6ofrwjakx7chy7je975q X-HE-Tag: 1660838066-643108 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The most common case for certain node_random usage (demotion nodemask) is with nodemask weight 1. We can avoid calling get_random_init() in that case and always return the only node set in the nodemask. A simple test as below before = rdtsc_ordered(); for (i= 0; i < 100; i++) { rand = node_random(&nmask); } after = rdtsc_ordered(); Without fix after - before : 16438 With fix after - before : 816 Reviewed-by: "Huang, Ying" Signed-off-by: Aneesh Kumar K.V --- include/linux/nodemask.h | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h index 4b71a96190a8..ac5b6a371be5 100644 --- a/include/linux/nodemask.h +++ b/include/linux/nodemask.h @@ -504,12 +504,21 @@ static inline int num_node_state(enum node_states state) static inline int node_random(const nodemask_t *maskp) { #if defined(CONFIG_NUMA) && (MAX_NUMNODES > 1) - int w, bit = NUMA_NO_NODE; + int w, bit; w = nodes_weight(*maskp); - if (w) + switch (w) { + case 0: + bit = NUMA_NO_NODE; + break; + case 1: + bit = first_node(*maskp); + break; + default: bit = bitmap_ord_to_pos(maskp->bits, - get_random_int() % w, MAX_NUMNODES); + get_random_int() % w, MAX_NUMNODES); + break; + } return bit; #else return 0;