From patchwork Thu Oct 12 02:48:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rongwei Wang X-Patchwork-Id: 13418181 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4312CDB465 for ; Thu, 12 Oct 2023 02:48:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 768288D0104; Wed, 11 Oct 2023 22:48:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6A2128D0002; Wed, 11 Oct 2023 22:48:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4CC1E8D0104; Wed, 11 Oct 2023 22:48:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 360318D0002 for ; Wed, 11 Oct 2023 22:48:57 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 01FF81A0209 for ; Thu, 12 Oct 2023 02:48:55 +0000 (UTC) X-FDA: 81335277072.26.69F8F0C Received: from out30-119.freemail.mail.aliyun.com (out30-119.freemail.mail.aliyun.com [115.124.30.119]) by imf07.hostedemail.com (Postfix) with ESMTP id A989240006 for ; Thu, 12 Oct 2023 02:48:53 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf07.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.119 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697078934; a=rsa-sha256; cv=none; b=PdYimsg98LpMcUqo6a/s3s39OCBU85Hxj46IbwvqjSTeWQD0J7sUbiKReRV0JEx4OTZWPE 2igXtoK3tA70EncIbbkcvDnttwtDItWpi8OrWdabKYxRtc8y19Qd04DD+xf9bq3Sz+TodB jcdE4+iIaay15daw6uydPye9+Xhyt7c= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf07.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.119 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697078934; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1ZoA1bxAgGQl99u5HNz8LjmvFZa51SZA+oaCz1PJgWc=; b=m7cHwD3abMnFatvDHD3xyVv8ZAw/J/XreQE24QHpyfXfqmly555CtJ/7W6bWF95zfpByw7 lm3ehWAvmec7t/V3xgUlcLytlMoqY0BpkebHzWYr/kH1ZGQ3iGWhz3FNAoGowkIERtHXgd PyBFiOb0QzeUFLYQA99Icr0vww6MoiQ= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045168;MF=rongwei.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0VtykMfs_1697078928; Received: from localhost.localdomain(mailfrom:rongwei.wang@linux.alibaba.com fp:SMTPD_---0VtykMfs_1697078928) by smtp.aliyun-inc.com; Thu, 12 Oct 2023 10:48:48 +0800 From: Rongwei Wang To: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: akpm@linux-foundation.org, willy@infradead.org, catalin.marinas@arm.com, dave.hansen@linux.intel.com, tj@kernel.org, mingo@redhat.com Subject: [PATCH RFC 1/5] mm/numa: move numa emulation APIs into generic files Date: Thu, 12 Oct 2023 10:48:38 +0800 Message-Id: <20231012024842.99703-2-rongwei.wang@linux.alibaba.com> X-Mailer: git-send-email 2.40.0 In-Reply-To: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> References: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> MIME-Version: 1.0 X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: A989240006 X-Stat-Signature: drim7tj4fum7ei3gisxefrfgjaewzi1g X-Rspam-User: X-HE-Tag: 1697078933-508583 X-HE-Meta: U2FsdGVkX1845JltJqy5bkhLjvADGvw9PuwHpzZPYepxuKeB8Q9OlaWG7XbvZfp7Q8ac44qg5bWchAclFNlcbIDhYwwNq3CHplueSxZ8Hx/9V2S+dEOCR7i63dGieCPdh0IIkB/+mO1MhNnABbSOYwvvzJWZZhRbLJaN5LGN4gKGFGEjeu7x/CK6m2k3Qvmvw/FweRl4ZgBiWYQqBfnT7BFvll56A/U+yexSKi88aTs26oZzBbFwVae0Pr44Qd40pxZ1qZrl82LP5xxaTE3ZEe7ab/iNCDL3PO2UM27v/k/Es7v7jXSCwGvpN9Siralm52tkNUyuESikQ4oPxMHrTOxTyZUUijxK86TwZ8Iw2CUIhQDaI9LoCiV5mFQ4J648lpIFICtX0lfuQf960ze3jc7FCQok6C2gk8ixh0IIthC1BGRPPLCFKDvyYukjriQgvfY2sUs+mHApWaUKIqL4D5be0i2vCjb5FBKNQpmgb+LRQuUWHcdXB/L8r1ASM3QF9JAlHvYzIZUnVuglSqMi8xKL86YaG2wMrQ2YOvE3Ue/3ZCVbeQ4CyhVVecNkHpw24L8VrMekxakCCISrIoyRGQCBTE3QeZfLjEEIOr9iO/k7l/Xl40FuFCym7lHVjZiQRrQCYumNhm0hc6ohBOIy7tUk4Pc3mI+ZjDAckcYjKSF/nuunFb10xnvXXDJDQbsdMS565dtPLAHWPlS92XU6yKFZKvZo2m0SpvcIZk8w5kXE5ca4IhvgSvrN0QwbxzssUdKfrge8ykElQSESq4L2iyYglrj5dbzdkok4rgiXW6hS8he6IIUSwrTqrDgM2uw2+fT/GmK81qDT7p79JEYAxSJjSTSvIb/wRogQtx0Hzo99an51+TIEXpEQRHPdri24WuHNN1tGPQk480CkNrFxv/Rk/arYc5GefjSNZsAP+z9bzOPZjfZT6XedSEFEc2QpCe60QrKDtANe1f50AAp heOC0V0k xQyN4MVhZnT9d8RX5KbCl7STshxpsV028U4athZ4wPRU6RH2BysKJGn8nESzLn3RaZyTqG2vmc2f6q+XIP9fTlsWqug== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In order to support NUMA EMU for other arch, some functions that used by numa_meminfo should be moved out x86 arch. mm/numa.c created to place above API. CONFIG_NUMA_EMU will be handled later. Signed-off-by: Rongwei Wang --- arch/x86/include/asm/numa.h | 3 - arch/x86/mm/numa.c | 216 +------------------------- arch/x86/mm/numa_internal.h | 14 +- include/asm-generic/numa.h | 18 +++ mm/Makefile | 1 + mm/numa.c | 298 ++++++++++++++++++++++++++++++++++++ 6 files changed, 323 insertions(+), 227 deletions(-) create mode 100644 mm/numa.c diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index e3bae2b60a0d..8d79be8095d5 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -9,9 +9,6 @@ #include #ifdef CONFIG_NUMA - -#define NR_NODE_MEMBLKS (MAX_NUMNODES*2) - /* * Too small node sizes may confuse the VM badly. Usually they * result from BIOS bugs. So dont recognize nodes as standalone diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 2aadb2019b4f..969b11fff03f 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -25,8 +25,8 @@ nodemask_t numa_nodes_parsed __initdata; struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); -static struct numa_meminfo numa_meminfo __initdata_or_meminfo; -static struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; +extern struct numa_meminfo numa_meminfo; +extern struct numa_meminfo numa_reserved_meminfo; static int numa_distance_cnt; static u8 *numa_distance; @@ -148,34 +148,6 @@ static int __init numa_add_memblk_to(int nid, u64 start, u64 end, return 0; } -/** - * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo - * @idx: Index of memblk to remove - * @mi: numa_meminfo to remove memblk from - * - * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and - * decrementing @mi->nr_blks. - */ -void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi) -{ - mi->nr_blks--; - memmove(&mi->blk[idx], &mi->blk[idx + 1], - (mi->nr_blks - idx) * sizeof(mi->blk[0])); -} - -/** - * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another - * @dst: numa_meminfo to append block to - * @idx: Index of memblk to remove - * @src: numa_meminfo to remove memblk from - */ -static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx, - struct numa_meminfo *src) -{ - dst->blk[dst->nr_blks++] = src->blk[idx]; - numa_remove_memblk_from(idx, src); -} - /** * numa_add_memblk - Add one numa_memblk to numa_meminfo * @nid: NUMA node ID of the new memblk @@ -225,124 +197,6 @@ static void __init alloc_node_data(int nid) node_set_online(nid); } -/** - * numa_cleanup_meminfo - Cleanup a numa_meminfo - * @mi: numa_meminfo to clean up - * - * Sanitize @mi by merging and removing unnecessary memblks. Also check for - * conflicts and clear unused memblks. - * - * RETURNS: - * 0 on success, -errno on failure. - */ -int __init numa_cleanup_meminfo(struct numa_meminfo *mi) -{ - const u64 low = 0; - const u64 high = PFN_PHYS(max_pfn); - int i, j, k; - - /* first, trim all entries */ - for (i = 0; i < mi->nr_blks; i++) { - struct numa_memblk *bi = &mi->blk[i]; - - /* move / save reserved memory ranges */ - if (!memblock_overlaps_region(&memblock.memory, - bi->start, bi->end - bi->start)) { - numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi); - continue; - } - - /* make sure all non-reserved blocks are inside the limits */ - bi->start = max(bi->start, low); - - /* preserve info for non-RAM areas above 'max_pfn': */ - if (bi->end > high) { - numa_add_memblk_to(bi->nid, high, bi->end, - &numa_reserved_meminfo); - bi->end = high; - } - - /* and there's no empty block */ - if (bi->start >= bi->end) - numa_remove_memblk_from(i--, mi); - } - - /* merge neighboring / overlapping entries */ - for (i = 0; i < mi->nr_blks; i++) { - struct numa_memblk *bi = &mi->blk[i]; - - for (j = i + 1; j < mi->nr_blks; j++) { - struct numa_memblk *bj = &mi->blk[j]; - u64 start, end; - - /* - * See whether there are overlapping blocks. Whine - * about but allow overlaps of the same nid. They - * will be merged below. - */ - if (bi->end > bj->start && bi->start < bj->end) { - if (bi->nid != bj->nid) { - pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n", - bi->nid, bi->start, bi->end - 1, - bj->nid, bj->start, bj->end - 1); - return -EINVAL; - } - pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n", - bi->nid, bi->start, bi->end - 1, - bj->start, bj->end - 1); - } - - /* - * Join together blocks on the same node, holes - * between which don't overlap with memory on other - * nodes. - */ - if (bi->nid != bj->nid) - continue; - start = min(bi->start, bj->start); - end = max(bi->end, bj->end); - for (k = 0; k < mi->nr_blks; k++) { - struct numa_memblk *bk = &mi->blk[k]; - - if (bi->nid == bk->nid) - continue; - if (start < bk->end && end > bk->start) - break; - } - if (k < mi->nr_blks) - continue; - printk(KERN_INFO "NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n", - bi->nid, bi->start, bi->end - 1, bj->start, - bj->end - 1, start, end - 1); - bi->start = start; - bi->end = end; - numa_remove_memblk_from(j--, mi); - } - } - - /* clear unused ones */ - for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) { - mi->blk[i].start = mi->blk[i].end = 0; - mi->blk[i].nid = NUMA_NO_NODE; - } - - return 0; -} - -/* - * Set nodes, which have memory in @mi, in *@nodemask. - */ -static void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, - const struct numa_meminfo *mi) -{ - int i; - - for (i = 0; i < ARRAY_SIZE(mi->blk); i++) - if (mi->blk[i].start != mi->blk[i].end && - mi->blk[i].nid != NUMA_NO_NODE) - node_set(mi->blk[i].nid, *nodemask); -} - /** * numa_reset_distance - Reset NUMA distance table * @@ -478,72 +332,6 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi) return true; } -/* - * Mark all currently memblock-reserved physical memory (which covers the - * kernel's own memory ranges) as hot-unswappable. - */ -static void __init numa_clear_kernel_node_hotplug(void) -{ - nodemask_t reserved_nodemask = NODE_MASK_NONE; - struct memblock_region *mb_region; - int i; - - /* - * We have to do some preprocessing of memblock regions, to - * make them suitable for reservation. - * - * At this time, all memory regions reserved by memblock are - * used by the kernel, but those regions are not split up - * along node boundaries yet, and don't necessarily have their - * node ID set yet either. - * - * So iterate over all memory known to the x86 architecture, - * and use those ranges to set the nid in memblock.reserved. - * This will split up the memblock regions along node - * boundaries and will set the node IDs as well. - */ - for (i = 0; i < numa_meminfo.nr_blks; i++) { - struct numa_memblk *mb = numa_meminfo.blk + i; - int ret; - - ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid); - WARN_ON_ONCE(ret); - } - - /* - * Now go over all reserved memblock regions, to construct a - * node mask of all kernel reserved memory areas. - * - * [ Note, when booting with mem=nn[kMG] or in a kdump kernel, - * numa_meminfo might not include all memblock.reserved - * memory ranges, because quirks such as trim_snb_memory() - * reserve specific pages for Sandy Bridge graphics. ] - */ - for_each_reserved_mem_region(mb_region) { - int nid = memblock_get_region_node(mb_region); - - if (nid != MAX_NUMNODES) - node_set(nid, reserved_nodemask); - } - - /* - * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory - * belonging to the reserved node mask. - * - * Note that this will include memory regions that reside - * on nodes that contain kernel memory - entire nodes - * become hot-unpluggable: - */ - for (i = 0; i < numa_meminfo.nr_blks; i++) { - struct numa_memblk *mb = numa_meminfo.blk + i; - - if (!node_isset(mb->nid, reserved_nodemask)) - continue; - - memblock_clear_hotplug(mb->start, mb->end - mb->start); - } -} - static int __init numa_register_memblks(struct numa_meminfo *mi) { int i, nid; diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h index 86860f279662..b6053beb81b1 100644 --- a/arch/x86/mm/numa_internal.h +++ b/arch/x86/mm/numa_internal.h @@ -16,19 +16,13 @@ struct numa_meminfo { struct numa_memblk blk[NR_NODE_MEMBLKS]; }; -void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi); -int __init numa_cleanup_meminfo(struct numa_meminfo *mi); +extern int __init numa_cleanup_meminfo(struct numa_meminfo *mi); void __init numa_reset_distance(void); void __init x86_numa_init(void); -#ifdef CONFIG_NUMA_EMU -void __init numa_emulation(struct numa_meminfo *numa_meminfo, - int numa_dist_cnt); -#else -static inline void numa_emulation(struct numa_meminfo *numa_meminfo, - int numa_dist_cnt) -{ } -#endif +extern void __init numa_emulation(struct numa_meminfo *numa_meminfo, + int numa_dist_cnt); + #endif /* __X86_MM_NUMA_INTERNAL_H */ diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h index 1a3ad6d29833..929d7c582a73 100644 --- a/include/asm-generic/numa.h +++ b/include/asm-generic/numa.h @@ -39,6 +39,24 @@ void numa_store_cpu_info(unsigned int cpu); void numa_add_cpu(unsigned int cpu); void numa_remove_cpu(unsigned int cpu); +struct numa_memblk { + u64 start; + u64 end; + int nid; +}; + +struct numa_meminfo { + int nr_blks; + struct numa_memblk blk[NR_NODE_MEMBLKS]; +}; + +extern struct numa_meminfo numa_meminfo; + +int __init numa_register_memblks(struct numa_meminfo *mi); +int __init numa_cleanup_meminfo(struct numa_meminfo *mi); +void __init numa_emulation(struct numa_meminfo *numa_meminfo, + int numa_dist_cnt); + #else /* CONFIG_NUMA */ static inline void numa_store_cpu_info(unsigned int cpu) { } diff --git a/mm/Makefile b/mm/Makefile index ec65984e2ade..6fc1bd7c9f5b 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o +obj-$(CONFIG_NUMA) += numa.o diff --git a/mm/numa.c b/mm/numa.c new file mode 100644 index 000000000000..88277e8404f0 --- /dev/null +++ b/mm/numa.c @@ -0,0 +1,298 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +struct numa_meminfo numa_meminfo __initdata_or_meminfo; +struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; + +/* + * Set nodes, which have memory in @mi, in *@nodemask. + */ +void __init numa_nodemask_from_meminfo(nodemask_t *nodemask, + const struct numa_meminfo *mi) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(mi->blk); i++) + if (mi->blk[i].start != mi->blk[i].end && + mi->blk[i].nid != NUMA_NO_NODE) + node_set(mi->blk[i].nid, *nodemask); +} + +/** + * numa_remove_memblk_from - Remove one numa_memblk from a numa_meminfo + * @idx: Index of memblk to remove + * @mi: numa_meminfo to remove memblk from + * + * Remove @idx'th numa_memblk from @mi by shifting @mi->blk[] and + * decrementing @mi->nr_blks. + */ +static void __init numa_remove_memblk_from(int idx, struct numa_meminfo *mi) +{ + mi->nr_blks--; + memmove(&mi->blk[idx], &mi->blk[idx + 1], + (mi->nr_blks - idx) * sizeof(mi->blk[0])); +} + +/** + * numa_move_tail_memblk - Move a numa_memblk from one numa_meminfo to another + * @dst: numa_meminfo to append block to + * @idx: Index of memblk to remove + * @src: numa_meminfo to remove memblk from + */ +static void __init numa_move_tail_memblk(struct numa_meminfo *dst, int idx, + struct numa_meminfo *src) +{ + dst->blk[dst->nr_blks++] = src->blk[idx]; + numa_remove_memblk_from(idx, src); +} + +int __init numa_add_memblk_to(int nid, u64 start, u64 end, + struct numa_meminfo *mi) +{ + /* ignore zero length blks */ + if (start == end) + return 0; + + /* whine about and ignore invalid blks */ + if (start > end || nid < 0 || nid >= MAX_NUMNODES) { + pr_warn("Warning: invalid memblk node %d [mem %#010Lx-%#010Lx]\n", + nid, start, end - 1); + return 0; + } + + if (mi->nr_blks >= NR_NODE_MEMBLKS) { + pr_err("too many memblk ranges\n"); + return -EINVAL; + } + + mi->blk[mi->nr_blks].start = start; + mi->blk[mi->nr_blks].end = end; + mi->blk[mi->nr_blks].nid = nid; + mi->nr_blks++; + return 0; +} + +/** + * numa_cleanup_meminfo - Cleanup a numa_meminfo + * @mi: numa_meminfo to clean up + * + * Sanitize @mi by merging and removing unnecessary memblks. Also check for + * conflicts and clear unused memblks. + * + * RETURNS: + * 0 on success, -errno on failure. + */ +int __init numa_cleanup_meminfo(struct numa_meminfo *mi) +{ + const u64 low = 0; + const u64 high = PFN_PHYS(max_pfn); + int i, j, k; + + /* first, trim all entries */ + for (i = 0; i < mi->nr_blks; i++) { + struct numa_memblk *bi = &mi->blk[i]; + + /* move / save reserved memory ranges */ + if (!memblock_overlaps_region(&memblock.memory, + bi->start, bi->end - bi->start)) { + numa_move_tail_memblk(&numa_reserved_meminfo, i--, mi); + continue; + } + + /* make sure all non-reserved blocks are inside the limits */ + bi->start = max(bi->start, low); + + /* preserve info for non-RAM areas above 'max_pfn': */ + if (bi->end > high) { + numa_add_memblk_to(bi->nid, high, bi->end, + &numa_reserved_meminfo); + bi->end = high; + } + + /* and there's no empty block */ + if (bi->start >= bi->end) + numa_remove_memblk_from(i--, mi); + } + + /* merge neighboring / overlapping entries */ + for (i = 0; i < mi->nr_blks; i++) { + struct numa_memblk *bi = &mi->blk[i]; + + for (j = i + 1; j < mi->nr_blks; j++) { + struct numa_memblk *bj = &mi->blk[j]; + u64 start, end; + + /* + * See whether there are overlapping blocks. Whine + * about but allow overlaps of the same nid. They + * will be merged below. + */ + if (bi->end > bj->start && bi->start < bj->end) { + if (bi->nid != bj->nid) { + pr_err("node %d [mem %#010Lx-%#010Lx] overlaps with node %d [mem %#010Lx-%#010Lx]\n", + bi->nid, bi->start, bi->end - 1, + bj->nid, bj->start, bj->end - 1); + return -EINVAL; + } + pr_warn("Warning: node %d [mem %#010Lx-%#010Lx] overlaps with itself [mem %#010Lx-%#010Lx]\n", + bi->nid, bi->start, bi->end - 1, + bj->start, bj->end - 1); + } + + /* + * Join together blocks on the same node, holes + * between which don't overlap with memory on other + * nodes. + */ + if (bi->nid != bj->nid) + continue; + start = min(bi->start, bj->start); + end = max(bi->end, bj->end); + for (k = 0; k < mi->nr_blks; k++) { + struct numa_memblk *bk = &mi->blk[k]; + + if (bi->nid == bk->nid) + continue; + if (start < bk->end && end > bk->start) + break; + } + if (k < mi->nr_blks) + continue; + printk(KERN_INFO "NUMA: Node %d [mem %#010Lx-%#010Lx] + [mem %#010Lx-%#010Lx] -> [mem %#010Lx-%#010Lx]\n", + bi->nid, bi->start, bi->end - 1, bj->start, + bj->end - 1, start, end - 1); + bi->start = start; + bi->end = end; + numa_remove_memblk_from(j--, mi); + } + } + + /* clear unused ones */ + for (i = mi->nr_blks; i < ARRAY_SIZE(mi->blk); i++) { + mi->blk[i].start = mi->blk[i].end = 0; + mi->blk[i].nid = NUMA_NO_NODE; + } + + return 0; +} + +/* + * Mark all currently memblock-reserved physical memory (which covers the + * kernel's own memory ranges) as hot-unswappable. + */ +static void __init numa_clear_kernel_node_hotplug(void) +{ + nodemask_t reserved_nodemask = NODE_MASK_NONE; + struct memblock_region *mb_region; + int i; + + /* + * We have to do some preprocessing of memblock regions, to + * make them suitable for reservation. + * + * At this time, all memory regions reserved by memblock are + * used by the kernel, but those regions are not split up + * along node boundaries yet, and don't necessarily have their + * node ID set yet either. + * + * So iterate over all memory known to the x86 architecture, + * and use those ranges to set the nid in memblock.reserved. + * This will split up the memblock regions along node + * boundaries and will set the node IDs as well. + */ + for (i = 0; i < numa_meminfo.nr_blks; i++) { + struct numa_memblk *mb = numa_meminfo.blk + i; + int ret; + + ret = memblock_set_node(mb->start, mb->end - mb->start, &memblock.reserved, mb->nid); + WARN_ON_ONCE(ret); + } + + /* + * Now go over all reserved memblock regions, to construct a + * node mask of all kernel reserved memory areas. + * + * [ Note, when booting with mem=nn[kMG] or in a kdump kernel, + * numa_meminfo might not include all memblock.reserved + * memory ranges, because quirks such as trim_snb_memory() + * reserve specific pages for Sandy Bridge graphics. ] + */ + for_each_reserved_mem_region(mb_region) { + int nid = memblock_get_region_node(mb_region); + + if (nid != MAX_NUMNODES) + node_set(nid, reserved_nodemask); + } + + /* + * Finally, clear the MEMBLOCK_HOTPLUG flag for all memory + * belonging to the reserved node mask. + * + * Note that this will include memory regions that reside + * on nodes that contain kernel memory - entire nodes + * become hot-unpluggable: + */ + for (i = 0; i < numa_meminfo.nr_blks; i++) { + struct numa_memblk *mb = numa_meminfo.blk + i; + + if (!node_isset(mb->nid, reserved_nodemask)) + continue; + + memblock_clear_hotplug(mb->start, mb->end - mb->start); + } +} + +int __weak __init numa_register_memblks(struct numa_meminfo *mi) +{ + int i; + + /* Account for nodes with cpus and no memory */ + node_possible_map = numa_nodes_parsed; + numa_nodemask_from_meminfo(&node_possible_map, mi); + if (WARN_ON(nodes_empty(node_possible_map))) + return -EINVAL; + + for (i = 0; i < mi->nr_blks; i++) { + struct numa_memblk *mb = &mi->blk[i]; + memblock_set_node(mb->start, mb->end - mb->start, + &memblock.memory, mb->nid); + } + + /* + * At very early time, the kernel have to use some memory such as + * loading the kernel image. We cannot prevent this anyway. So any + * node the kernel resides in should be un-hotpluggable. + * + * And when we come here, alloc node data won't fail. + */ + numa_clear_kernel_node_hotplug(); + + /* + * If sections array is gonna be used for pfn -> nid mapping, check + * whether its granularity is fine enough. + */ + if (IS_ENABLED(NODE_NOT_IN_PAGE_FLAGS)) { + unsigned long pfn_align = node_map_pfn_alignment(); + + if (pfn_align && pfn_align < PAGES_PER_SECTION) { + pr_warn("Node alignment %LuMB < min %LuMB, rejecting NUMA config\n", + PFN_PHYS(pfn_align) >> 20, + PFN_PHYS(PAGES_PER_SECTION) >> 20); + return -EINVAL; + } + } + + return 0; +} From patchwork Thu Oct 12 02:48:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rongwei Wang X-Patchwork-Id: 13418180 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46AD2CDB46E for ; Thu, 12 Oct 2023 02:48:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0FD298D00F9; Wed, 11 Oct 2023 22:48:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0AD088D0002; Wed, 11 Oct 2023 22:48:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E41088D00F9; Wed, 11 Oct 2023 22:48:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D60738D0002 for ; Wed, 11 Oct 2023 22:48:55 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9B4231403D7 for ; Thu, 12 Oct 2023 02:48:55 +0000 (UTC) X-FDA: 81335277030.25.D4C7FCC Received: from out30-110.freemail.mail.aliyun.com (out30-110.freemail.mail.aliyun.com [115.124.30.110]) by imf22.hostedemail.com (Postfix) with ESMTP id 8DA35C0022 for ; Thu, 12 Oct 2023 02:48:53 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.110 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697078934; a=rsa-sha256; cv=none; b=r7TrYIytHeyX7HrQHwE+/87DsDB3vmXdloTnI42VePeLg7T2692pWpNMT/0DvlxtuJ5vcl d2RAVFQgGcfaYXnq8kgAejlKXZf++ESc9oL6r3VR3j2DSCcjpQuRZuiTw1RSMdl9Edmovd HusgQSCjrIjVQ8ueNQpEEGlWchmdsyA= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.110 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697078934; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VR7wPq0C0mPYrOmbzB0ZF5iIStSetPqgvjgy4RyVjjM=; b=a/KJosz6hCaMNDOm4I3GKNiDiqcrY0BICiBQwuJIUZCddhC1+Ch8LQCgqZB91acigJjV09 X65qvwVAA0NqLz4wxbV5OPfkdCPpSyl1SJ27XKPpVS0SZE1MQOpOE2lp+5KResGBy21Q5l NM4So/WCcJJzKI7u9m3MjwlW0IMHQLs= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=rongwei.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0VtykMgL_1697078929; Received: from localhost.localdomain(mailfrom:rongwei.wang@linux.alibaba.com fp:SMTPD_---0VtykMgL_1697078929) by smtp.aliyun-inc.com; Thu, 12 Oct 2023 10:48:49 +0800 From: Rongwei Wang To: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: akpm@linux-foundation.org, willy@infradead.org, catalin.marinas@arm.com, dave.hansen@linux.intel.com, tj@kernel.org, mingo@redhat.com Subject: [PATCH RFC 2/5] mm: percpu: fix variable type of cpu Date: Thu, 12 Oct 2023 10:48:39 +0800 Message-Id: <20231012024842.99703-3-rongwei.wang@linux.alibaba.com> X-Mailer: git-send-email 2.40.0 In-Reply-To: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> References: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 8DA35C0022 X-Stat-Signature: 5x5xbxtp8uw51mocweutc496gauxjm49 X-HE-Tag: 1697078933-399163 X-HE-Meta: U2FsdGVkX18EBwxgMmclBSLmeFm540THMdWAQnv7w8311tHf6PtZbJ5pGPRubNMi90FN1X8hiFkzWjS0WC6xEI8eiY9ai/23z9Ij/Wc/Hpg/1ErkNesXvvcJxZq2oeYsPDYWh5GMURt8p4eis4yrZX4AYqqLaDXmXYwyvgNtiMdzbYtZiN3wKUwmqM2lnXuXO/UMMPvPSo5IkRGX5k5trZlN/BYFGuFhLxCaR4dgUAvB26WmQemP9O+hOWRFAxxRwtL1N360H/3yK/9MNx7CLPxGHLGOtunSCRZjC5Y9m3F8uaPdFcIhAfDTxEdsdxJ1BdS8XbVRDwJjd5OikENoMmGFNREDTRukDfcZF93xQjiBBP8WVRbQydy7P/gz4I0dW+0Q7IRhMK0hdZX7pPsYLCi8WH6YkqfEr2dmp3/TI9qHUFU8nxrz9MXl50HiynhkuC4KadoXLRCxKW6hAloW1qHd+i4Wm4N7TQ1hftjvknA4Zn58wm8Nc70Uj9hxnPNb9OHOnutWY65IBftCBU4/RSe9XUZmhWfh3ugIBSTz2dWFZltErjO+e53WOtYjKf7CjxbzW6Ozhw4u+S5wMWvXBLRsrf5xTHSThfBz9lazggyG6jZ7UxXamJSsO+y5DjcQz/wSHxf25q7miiJ3otBv6XepURKcEI3pWqdsKzD6SL3iPcf4648GP3J/twWO9OlnhWaM4wdHjcK/ZatQEtqDF6hQDkTzTZ+rbcDTiuqs1rWRWUWi/B6PUs+EKG0qZMgNZFjmdKFHQI39XHC3DBhDPTuZpE9gRmmZ7WxhYVzNLz+BtDowk/EKX6grbr7coTjxaW0yMcjB/xDxkSAvJSa3pAjgHTnTf+pPXdjRXWmqeMkEFhHwe25UF/BV5WjwRUifF/1eQOWfyhPwqZMJ5vDYzhh7Ine3OK7UURGuzmG5xkaGl5sTaHxDEJHfbn7nzq/WRtrX+YSnDjJPrOL9lGo gfyMQj+C dDrcWc4XKpOOYKgKoJlISXbcDs8aYiGUcLGIPYVSB41TJZnH39JObhdA2wYtvcIaL9D4rZSX8GXunp20= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Almost all places declare 'cpu' as 'unsigned int' type, but early_cpu_to_nod() not. So correct it in this patch. Signed-off-by: Rongwei Wang --- drivers/base/arch_numa.c | 2 +- include/linux/percpu.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index eaa31e567d1e..db0bb8b8fd67 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -144,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid) unsigned long __per_cpu_offset[NR_CPUS] __read_mostly; EXPORT_SYMBOL(__per_cpu_offset); -static int __init early_cpu_to_node(int cpu) +static int __init early_cpu_to_node(unsigned int cpu) { return cpu_to_node_map[cpu]; } diff --git a/include/linux/percpu.h b/include/linux/percpu.h index 68fac2e7cbe6..4aee8400af54 100644 --- a/include/linux/percpu.h +++ b/include/linux/percpu.h @@ -100,7 +100,7 @@ extern const char * const pcpu_fc_names[PCPU_FC_NR]; extern enum pcpu_fc pcpu_chosen_fc; -typedef int (pcpu_fc_cpu_to_node_fn_t)(int cpu); +typedef int (pcpu_fc_cpu_to_node_fn_t)(unsigned int cpu); typedef int (pcpu_fc_cpu_distance_fn_t)(unsigned int from, unsigned int to); extern struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, From patchwork Thu Oct 12 02:48:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rongwei Wang X-Patchwork-Id: 13418182 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F93FCDB482 for ; Thu, 12 Oct 2023 02:49:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E22418D0105; Wed, 11 Oct 2023 22:48:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D7F218D0002; Wed, 11 Oct 2023 22:48:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B35458D0105; Wed, 11 Oct 2023 22:48:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A42C18D0002 for ; Wed, 11 Oct 2023 22:48:57 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 69C2FA0360 for ; Thu, 12 Oct 2023 02:48:56 +0000 (UTC) X-FDA: 81335277072.22.E37863E Received: from out30-99.freemail.mail.aliyun.com (out30-99.freemail.mail.aliyun.com [115.124.30.99]) by imf07.hostedemail.com (Postfix) with ESMTP id 7C7EF40003 for ; Thu, 12 Oct 2023 02:48:54 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf07.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697078935; a=rsa-sha256; cv=none; b=BJcbBf+jrz6sKSUBEBxCTGSwCzAbwf1xgtaBFf3S3MkivdxqXIArCLVHonreH1ps0fHBhM nDPnzg2kcPrcTqN31+83CA6G7m85+biSP7zMcvWV2Q3N7JNYzp7qc4tQ8lVy67Y3tM8cW+ cuBaU9H0wi1m8rET/NLmr2hdCZLq0/Y= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf07.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697078935; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JUVnUsb26mlDDBqIiN1AvBz7w4jH9SeFH++N8Ys38o0=; b=lj/smOD4Xp01prcvHmBvmf3z+ECkc5CLMPu2yKx+xTZiEgxLZe2bh3xkHnL0QwoyVb2dO0 p1oyajUlVprcaBCvQC/+L0zYYwQ8FqoVj+O6+uBy89htIxuEtv2EbkRMrRnmVGil9XxO5B 1hJKggCEC/jnXDIzkykWcdkA10dvVSc= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R561e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046049;MF=rongwei.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0VtykMgi_1697078930; Received: from localhost.localdomain(mailfrom:rongwei.wang@linux.alibaba.com fp:SMTPD_---0VtykMgi_1697078930) by smtp.aliyun-inc.com; Thu, 12 Oct 2023 10:48:51 +0800 From: Rongwei Wang To: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: akpm@linux-foundation.org, willy@infradead.org, catalin.marinas@arm.com, dave.hansen@linux.intel.com, tj@kernel.org, mingo@redhat.com Subject: [PATCH RFC 3/5] arch_numa: remove __init in early_cpu_to_node() Date: Thu, 12 Oct 2023 10:48:40 +0800 Message-Id: <20231012024842.99703-4-rongwei.wang@linux.alibaba.com> X-Mailer: git-send-email 2.40.0 In-Reply-To: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> References: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> MIME-Version: 1.0 X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 7C7EF40003 X-Stat-Signature: ctqswy51ytbmyuzu7t3tixh7wbiep3t8 X-Rspam-User: X-HE-Tag: 1697078934-590831 X-HE-Meta: U2FsdGVkX19GuZ3c9ePxkNmgQtG3MarQ74/nlOqEc+mggStUcjo25p+xdw2wkbGh6BPgsdq52GjbEdQ13ea5n1WlFqVLBijz9zpwYo2LYKg5IzR6/Dq31BmzxDcBtprJ3Qvrqbxk84k0l9c5SLd4HvKiktIhWEIFUuyxcUc5Z4eLfeLySkEd668XvDdn6pJldcjy2l19NDZI0WztYCYI0A4zXZmqJDIoj6SkkXRYyarnl2opd014Z06EQ2pORIXeo8smum4uXDHIiWRk/IFgpRmni5qC/QwgWpVbGlBL+TLD5uMpi4W+C0GXCrhkXT7u9cmL7xAbEH33ObthDXefk+wywyCCR3F2XgpaIdvty6QSMiwlCl7xjciNKktDctDYQBpX3Y0+Yo8bGPDBAiHIw/47kwDuvgZ4wHlQHLUyACWq6p/qrlY3Z4p8gzPIyap1qpOGD83eBb4mjpEYtgCLe46R/cMOaIr+eU1XURjDHPltosg0PiUGyYNmsS/y+5m/sNGyQIoKxzMZi+b3OvY6lzuM38xIHsWehawc/QZKug97Ei58NS+qDYd0B5v8AC+NAyf3l9m0KjUfOevHAHxDeA4crrW3gA5NyQbYUHNsjXAUCweYOlBtP8/WCdWE8wrCIf7GDLBvifHmKOfDUfsKWjfHUTb+5/P64aI6xVc6RhmnXpPAdyJ+brAJMwrBH8+iwPhIMRq9rf0N9iA7D9t32N5RdoRjR7CNlOLfLF5ec8FJJnlvdzRUeB9RpGmPCFzNMI6kjCJ9wugjJ3JyEoZ89QD6w1Y5CPvldX1BEkFHEVx6E7ROBRdmZrZuuogp4fPxlvlotGl1a9WK9hm66yHwvuSScyFkyyZ0i7rQ2CdJVnxRsYXrt+puMcs7gahe++QxhlaPMxsUE+2e9Kwk11H5uq+DHd9j+uTaqZM6kdBitgRe5h5OYGwd2xeUTGDi2cPQLqD+hvuEHOdrIQpHG3P EGXN6QaB mJXsW3954EzByGaaE6rwI0VC1+tklK/1k9KSf039CmJHoIF+RAVgdkqgR86iPxYPCDjFVY+PRG0MHmIek1v1T5WZjw5viTw9UYdnUuVA9GeXoc/IHP+YDipf70pguS4rP9U1v X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Most of arch does not stick '__init' for early_cpu_to_node(). And it's safe to delete this attribute here, ready for later numa emulation. Signed-off-by: Rongwei Wang --- drivers/base/arch_numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index db0bb8b8fd67..5df0ad5cb09d 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -144,7 +144,7 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid) unsigned long __per_cpu_offset[NR_CPUS] __read_mostly; EXPORT_SYMBOL(__per_cpu_offset); -static int __init early_cpu_to_node(unsigned int cpu) +int early_cpu_to_node(unsigned int cpu) { return cpu_to_node_map[cpu]; } From patchwork Thu Oct 12 02:48:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rongwei Wang X-Patchwork-Id: 13418183 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB540CDB465 for ; Thu, 12 Oct 2023 02:49:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 47F938D0002; Wed, 11 Oct 2023 22:48:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 408BA80009; Wed, 11 Oct 2023 22:48:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 20F198D0002; Wed, 11 Oct 2023 22:48:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id EC2C38D0106 for ; Wed, 11 Oct 2023 22:48:57 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B1E06120305 for ; Thu, 12 Oct 2023 02:48:57 +0000 (UTC) X-FDA: 81335277114.16.2250672 Received: from out30-100.freemail.mail.aliyun.com (out30-100.freemail.mail.aliyun.com [115.124.30.100]) by imf18.hostedemail.com (Postfix) with ESMTP id 59C7A1C0010 for ; Thu, 12 Oct 2023 02:48:54 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf18.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697078936; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RKdpKZmQMPNttAS8mVK68HKJPE9wGh0XKSs1FNlW5ZA=; b=sTBsTqBOcx2F4l5OqBrHnaDovVk/2w5QAnC+OrX07w9ie2KgTstadKPWSa2ri2jCVKTuNz DnnOu+x9DFC1RHQHeNXi/ZgmH0ksO+A9Ygo6pNn9jBNrBKATnn4vFi7a8xEHjei4B9DQTm lxdzjWliDtiFdwOxBYkAHjL/Pjp87eY= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf18.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697078936; a=rsa-sha256; cv=none; b=kQBb+cY2ZuhjiUYVcUx7TA/KDQtKTo41jzbicnEY8lSje6ebbZKBy27bnxO6oXEUeOpfop lEmn2hs0+AJSsRafIlhG+8iP/wAfjrHj2HGKxK++E+xIvmisxSQ1aLysW2nQxbC339rEmx +AoZClJ0va5a+xrPnRuSIPtv5MILdOU= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046060;MF=rongwei.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0VtykMh0_1697078931; Received: from localhost.localdomain(mailfrom:rongwei.wang@linux.alibaba.com fp:SMTPD_---0VtykMh0_1697078931) by smtp.aliyun-inc.com; Thu, 12 Oct 2023 10:48:52 +0800 From: Rongwei Wang To: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: akpm@linux-foundation.org, willy@infradead.org, catalin.marinas@arm.com, dave.hansen@linux.intel.com, tj@kernel.org, mingo@redhat.com Subject: [PATCH RFC 4/5] mm/numa: support CONFIG_NUMA_EMU for arm64 Date: Thu, 12 Oct 2023 10:48:41 +0800 Message-Id: <20231012024842.99703-5-rongwei.wang@linux.alibaba.com> X-Mailer: git-send-email 2.40.0 In-Reply-To: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> References: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 59C7A1C0010 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: dxgnf8ou7frsztprbms7gdnbmfpb5t7b X-HE-Tag: 1697078934-815914 X-HE-Meta: U2FsdGVkX1/MmT8TOj6YG2+2ZBbZcO46XXVQ+jrDSrXkKY2/LPcRqyrdf5RPSlWZydG6lc/LBJE4GdjCrwzOW3OsZ7w3Fqdq0xPP3vNJ/4lE9exCtk3uojJl8NagjQew4o1BhJrijGuXjEfcKaB7R1Fjib4xuCw8Opki65+7IcI+4sECl/MFtOT0mvIXwIY78JxutfeBjnDqoGsHpjrOYJvuVu+ezknjEFXgqk2a1pz+inFSeuTH193LzOgv3TzwNi2K+JWw4RTKql4C2euk5kvW/LfU66sscOC+Vkv+ntJx2Jaq71z6M2UCUFTepGBJJQNi95f9InwbTTBHiRvlhQaLcI7U9lyOWmXzVsAGzT4NEUPQnhzax+TzhRr7Awi2onrAofxgPXPPuuGPS3mzZ9Bi7Pkn5Qv3a+X0THN1m17qFS9iCHPdfsKyXNoI7p2BUhguuvE+c1GlwHh5KaVY/9ZZd3mgyeLK5MkYAP+RID6GUWxXWPL8HhPSXXHZRBaCudrlz4P9/9us5w6RN6VH+p+5Nd1iCfBM8evQyMnxvdHWyltsOKCT9dxfdpsEI2IBcqMqknuAQac09wGnKM84iwEeHUwwpGcMkkEQfPjkIS2vNsAl9Q3G8FV9etFSaFBfyq/rMwBhVV7OkZ2UwposPPjbaDL271JlGGh9ssFvj0rokMkHITwWoro/Wkfs5ans0GSXFCTj0f8Tpby/5DhfApIPOhDSlmCfoNMoIP9UiYBOKb6v3eM2V2pBNbNRrUyPRqgWKxRX4TbmtL5uPV7CDXKEv7fIdIyrrH5q9DQWXLl28g7BT9c+8OTcdxtoqbBnFmR9gzeoavOzpThvc/gtTz9SMubYdPivQ779syDOIhzVNZ8yV8HSYloNMZA4kFU1nYCCDlrruut5ySr0j6dWNCwklb8Zwy0+a2H8EtGZ+BXja9ukfY/jBcm7drGJr/3uYffleACuvtwyD7yXOaX xOPP3m6d I5hmX/vV+yELyrhlcPVW6O2hfIvp4RjkEXulujQ6OA6rBHUMlv7bYbUka5/ZFC4fOHJntH1M2eLAVD52MmTvIsh2JkQc0lcHgoMOwyfeOIdlHKNLaIqmuc7wYPlpkcyJy9hpB X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The CONFIG_NUMA_EMU migrates from x86/Kconfig to mm/Kconfig. Now x86 and arm64 support it. Signed-off-by: Rongwei Wang --- arch/x86/Kconfig | 8 - arch/x86/mm/Makefile | 1 - arch/x86/mm/numa_emulation.c | 585 ----------------------------------- drivers/base/arch_numa.c | 3 + include/asm-generic/numa.h | 12 + mm/Kconfig | 8 + mm/numa.c | 12 + 7 files changed, 35 insertions(+), 594 deletions(-) delete mode 100644 arch/x86/mm/numa_emulation.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 66bfabae8814..13438bfe2ec1 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1568,14 +1568,6 @@ config X86_64_ACPI_NUMA help Enable ACPI SRAT based node topology detection. -config NUMA_EMU - bool "NUMA emulation" - depends on NUMA - help - Enable NUMA emulation. A flat machine will be split - into virtual nodes when booted with "numa=fake=N", where N is the - number of nodes. This is only useful for debugging. - config NODES_SHIFT int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP range 1 10 diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile index c80febc44cd2..1581f17e5de4 100644 --- a/arch/x86/mm/Makefile +++ b/arch/x86/mm/Makefile @@ -56,7 +56,6 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o obj-$(CONFIG_AMD_NUMA) += amdtopology.o obj-$(CONFIG_ACPI_NUMA) += srat.o -obj-$(CONFIG_NUMA_EMU) += numa_emulation.o obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c deleted file mode 100644 index 9a9305367fdd..000000000000 --- a/arch/x86/mm/numa_emulation.c +++ /dev/null @@ -1,585 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * NUMA emulation - */ -#include -#include -#include -#include -#include - -#include "numa_internal.h" - -static int emu_nid_to_phys[MAX_NUMNODES]; -static char *emu_cmdline __initdata; - -int __init numa_emu_cmdline(char *str) -{ - emu_cmdline = str; - return 0; -} - -static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi) -{ - int i; - - for (i = 0; i < mi->nr_blks; i++) - if (mi->blk[i].nid == nid) - return i; - return -ENOENT; -} - -static u64 __init mem_hole_size(u64 start, u64 end) -{ - unsigned long start_pfn = PFN_UP(start); - unsigned long end_pfn = PFN_DOWN(end); - - if (start_pfn < end_pfn) - return PFN_PHYS(absent_pages_in_range(start_pfn, end_pfn)); - return 0; -} - -/* - * Sets up nid to range from @start to @end. The return value is -errno if - * something went wrong, 0 otherwise. - */ -static int __init emu_setup_memblk(struct numa_meminfo *ei, - struct numa_meminfo *pi, - int nid, int phys_blk, u64 size) -{ - struct numa_memblk *eb = &ei->blk[ei->nr_blks]; - struct numa_memblk *pb = &pi->blk[phys_blk]; - - if (ei->nr_blks >= NR_NODE_MEMBLKS) { - pr_err("NUMA: Too many emulated memblks, failing emulation\n"); - return -EINVAL; - } - - ei->nr_blks++; - eb->start = pb->start; - eb->end = pb->start + size; - eb->nid = nid; - - if (emu_nid_to_phys[nid] == NUMA_NO_NODE) - emu_nid_to_phys[nid] = pb->nid; - - pb->start += size; - if (pb->start >= pb->end) { - WARN_ON_ONCE(pb->start > pb->end); - numa_remove_memblk_from(phys_blk, pi); - } - - printk(KERN_INFO "Faking node %d at [mem %#018Lx-%#018Lx] (%LuMB)\n", - nid, eb->start, eb->end - 1, (eb->end - eb->start) >> 20); - return 0; -} - -/* - * Sets up nr_nodes fake nodes interleaved over physical nodes ranging from addr - * to max_addr. - * - * Returns zero on success or negative on error. - */ -static int __init split_nodes_interleave(struct numa_meminfo *ei, - struct numa_meminfo *pi, - u64 addr, u64 max_addr, int nr_nodes) -{ - nodemask_t physnode_mask = numa_nodes_parsed; - u64 size; - int big; - int nid = 0; - int i, ret; - - if (nr_nodes <= 0) - return -1; - if (nr_nodes > MAX_NUMNODES) { - pr_info("numa=fake=%d too large, reducing to %d\n", - nr_nodes, MAX_NUMNODES); - nr_nodes = MAX_NUMNODES; - } - - /* - * Calculate target node size. x86_32 freaks on __udivdi3() so do - * the division in ulong number of pages and convert back. - */ - size = max_addr - addr - mem_hole_size(addr, max_addr); - size = PFN_PHYS((unsigned long)(size >> PAGE_SHIFT) / nr_nodes); - - /* - * Calculate the number of big nodes that can be allocated as a result - * of consolidating the remainder. - */ - big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * nr_nodes) / - FAKE_NODE_MIN_SIZE; - - size &= FAKE_NODE_MIN_HASH_MASK; - if (!size) { - pr_err("Not enough memory for each node. " - "NUMA emulation disabled.\n"); - return -1; - } - - /* - * Continue to fill physical nodes with fake nodes until there is no - * memory left on any of them. - */ - while (!nodes_empty(physnode_mask)) { - for_each_node_mask(i, physnode_mask) { - u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN); - u64 start, limit, end; - int phys_blk; - - phys_blk = emu_find_memblk_by_nid(i, pi); - if (phys_blk < 0) { - node_clear(i, physnode_mask); - continue; - } - start = pi->blk[phys_blk].start; - limit = pi->blk[phys_blk].end; - end = start + size; - - if (nid < big) - end += FAKE_NODE_MIN_SIZE; - - /* - * Continue to add memory to this fake node if its - * non-reserved memory is less than the per-node size. - */ - while (end - start - mem_hole_size(start, end) < size) { - end += FAKE_NODE_MIN_SIZE; - if (end > limit) { - end = limit; - break; - } - } - - /* - * If there won't be at least FAKE_NODE_MIN_SIZE of - * non-reserved memory in ZONE_DMA32 for the next node, - * this one must extend to the boundary. - */ - if (end < dma32_end && dma32_end - end - - mem_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE) - end = dma32_end; - - /* - * If there won't be enough non-reserved memory for the - * next node, this one must extend to the end of the - * physical node. - */ - if (limit - end - mem_hole_size(end, limit) < size) - end = limit; - - ret = emu_setup_memblk(ei, pi, nid++ % nr_nodes, - phys_blk, - min(end, limit) - start); - if (ret < 0) - return ret; - } - } - return 0; -} - -/* - * Returns the end address of a node so that there is at least `size' amount of - * non-reserved memory or `max_addr' is reached. - */ -static u64 __init find_end_of_node(u64 start, u64 max_addr, u64 size) -{ - u64 end = start + size; - - while (end - start - mem_hole_size(start, end) < size) { - end += FAKE_NODE_MIN_SIZE; - if (end > max_addr) { - end = max_addr; - break; - } - } - return end; -} - -static u64 uniform_size(u64 max_addr, u64 base, u64 hole, int nr_nodes) -{ - unsigned long max_pfn = PHYS_PFN(max_addr); - unsigned long base_pfn = PHYS_PFN(base); - unsigned long hole_pfns = PHYS_PFN(hole); - - return PFN_PHYS((max_pfn - base_pfn - hole_pfns) / nr_nodes); -} - -/* - * Sets up fake nodes of `size' interleaved over physical nodes ranging from - * `addr' to `max_addr'. - * - * Returns zero on success or negative on error. - */ -static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei, - struct numa_meminfo *pi, - u64 addr, u64 max_addr, u64 size, - int nr_nodes, struct numa_memblk *pblk, - int nid) -{ - nodemask_t physnode_mask = numa_nodes_parsed; - int i, ret, uniform = 0; - u64 min_size; - - if ((!size && !nr_nodes) || (nr_nodes && !pblk)) - return -1; - - /* - * In the 'uniform' case split the passed in physical node by - * nr_nodes, in the non-uniform case, ignore the passed in - * physical block and try to create nodes of at least size - * @size. - * - * In the uniform case, split the nodes strictly by physical - * capacity, i.e. ignore holes. In the non-uniform case account - * for holes and treat @size as a minimum floor. - */ - if (!nr_nodes) - nr_nodes = MAX_NUMNODES; - else { - nodes_clear(physnode_mask); - node_set(pblk->nid, physnode_mask); - uniform = 1; - } - - if (uniform) { - min_size = uniform_size(max_addr, addr, 0, nr_nodes); - size = min_size; - } else { - /* - * The limit on emulated nodes is MAX_NUMNODES, so the - * size per node is increased accordingly if the - * requested size is too small. This creates a uniform - * distribution of node sizes across the entire machine - * (but not necessarily over physical nodes). - */ - min_size = uniform_size(max_addr, addr, - mem_hole_size(addr, max_addr), nr_nodes); - } - min_size = ALIGN(max(min_size, FAKE_NODE_MIN_SIZE), FAKE_NODE_MIN_SIZE); - if (size < min_size) { - pr_err("Fake node size %LuMB too small, increasing to %LuMB\n", - size >> 20, min_size >> 20); - size = min_size; - } - size = ALIGN_DOWN(size, FAKE_NODE_MIN_SIZE); - - /* - * Fill physical nodes with fake nodes of size until there is no memory - * left on any of them. - */ - while (!nodes_empty(physnode_mask)) { - for_each_node_mask(i, physnode_mask) { - u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN); - u64 start, limit, end; - int phys_blk; - - phys_blk = emu_find_memblk_by_nid(i, pi); - if (phys_blk < 0) { - node_clear(i, physnode_mask); - continue; - } - - start = pi->blk[phys_blk].start; - limit = pi->blk[phys_blk].end; - - if (uniform) - end = start + size; - else - end = find_end_of_node(start, limit, size); - /* - * If there won't be at least FAKE_NODE_MIN_SIZE of - * non-reserved memory in ZONE_DMA32 for the next node, - * this one must extend to the boundary. - */ - if (end < dma32_end && dma32_end - end - - mem_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE) - end = dma32_end; - - /* - * If there won't be enough non-reserved memory for the - * next node, this one must extend to the end of the - * physical node. - */ - if ((limit - end - mem_hole_size(end, limit) < size) - && !uniform) - end = limit; - - ret = emu_setup_memblk(ei, pi, nid++ % MAX_NUMNODES, - phys_blk, - min(end, limit) - start); - if (ret < 0) - return ret; - } - } - return nid; -} - -static int __init split_nodes_size_interleave(struct numa_meminfo *ei, - struct numa_meminfo *pi, - u64 addr, u64 max_addr, u64 size) -{ - return split_nodes_size_interleave_uniform(ei, pi, addr, max_addr, size, - 0, NULL, 0); -} - -static int __init setup_emu2phys_nid(int *dfl_phys_nid) -{ - int i, max_emu_nid = 0; - - *dfl_phys_nid = NUMA_NO_NODE; - for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) { - if (emu_nid_to_phys[i] != NUMA_NO_NODE) { - max_emu_nid = i; - if (*dfl_phys_nid == NUMA_NO_NODE) - *dfl_phys_nid = emu_nid_to_phys[i]; - } - } - - return max_emu_nid; -} - -/** - * numa_emulation - Emulate NUMA nodes - * @numa_meminfo: NUMA configuration to massage - * @numa_dist_cnt: The size of the physical NUMA distance table - * - * Emulate NUMA nodes according to the numa=fake kernel parameter. - * @numa_meminfo contains the physical memory configuration and is modified - * to reflect the emulated configuration on success. @numa_dist_cnt is - * used to determine the size of the physical distance table. - * - * On success, the following modifications are made. - * - * - @numa_meminfo is updated to reflect the emulated nodes. - * - * - __apicid_to_node[] is updated such that APIC IDs are mapped to the - * emulated nodes. - * - * - NUMA distance table is rebuilt to represent distances between emulated - * nodes. The distances are determined considering how emulated nodes - * are mapped to physical nodes and match the actual distances. - * - * - emu_nid_to_phys[] reflects how emulated nodes are mapped to physical - * nodes. This is used by numa_add_cpu() and numa_remove_cpu(). - * - * If emulation is not enabled or fails, emu_nid_to_phys[] is filled with - * identity mapping and no other modification is made. - */ -void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) -{ - static struct numa_meminfo ei __initdata; - static struct numa_meminfo pi __initdata; - const u64 max_addr = PFN_PHYS(max_pfn); - u8 *phys_dist = NULL; - size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]); - int max_emu_nid, dfl_phys_nid; - int i, j, ret; - - if (!emu_cmdline) - goto no_emu; - - memset(&ei, 0, sizeof(ei)); - pi = *numa_meminfo; - - for (i = 0; i < MAX_NUMNODES; i++) - emu_nid_to_phys[i] = NUMA_NO_NODE; - - /* - * If the numa=fake command-line contains a 'M' or 'G', it represents - * the fixed node size. Otherwise, if it is just a single number N, - * split the system RAM into N fake nodes. - */ - if (strchr(emu_cmdline, 'U')) { - nodemask_t physnode_mask = numa_nodes_parsed; - unsigned long n; - int nid = 0; - - n = simple_strtoul(emu_cmdline, &emu_cmdline, 0); - ret = -1; - for_each_node_mask(i, physnode_mask) { - /* - * The reason we pass in blk[0] is due to - * numa_remove_memblk_from() called by - * emu_setup_memblk() will delete entry 0 - * and then move everything else up in the pi.blk - * array. Therefore we should always be looking - * at blk[0]. - */ - ret = split_nodes_size_interleave_uniform(&ei, &pi, - pi.blk[0].start, pi.blk[0].end, 0, - n, &pi.blk[0], nid); - if (ret < 0) - break; - if (ret < n) { - pr_info("%s: phys: %d only got %d of %ld nodes, failing\n", - __func__, i, ret, n); - ret = -1; - break; - } - nid = ret; - } - } else if (strchr(emu_cmdline, 'M') || strchr(emu_cmdline, 'G')) { - u64 size; - - size = memparse(emu_cmdline, &emu_cmdline); - ret = split_nodes_size_interleave(&ei, &pi, 0, max_addr, size); - } else { - unsigned long n; - - n = simple_strtoul(emu_cmdline, &emu_cmdline, 0); - ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n); - } - if (*emu_cmdline == ':') - emu_cmdline++; - - if (ret < 0) - goto no_emu; - - if (numa_cleanup_meminfo(&ei) < 0) { - pr_warn("NUMA: Warning: constructed meminfo invalid, disabling emulation\n"); - goto no_emu; - } - - /* copy the physical distance table */ - if (numa_dist_cnt) { - u64 phys; - - phys = memblock_phys_alloc_range(phys_size, PAGE_SIZE, 0, - PFN_PHYS(max_pfn_mapped)); - if (!phys) { - pr_warn("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n"); - goto no_emu; - } - phys_dist = __va(phys); - - for (i = 0; i < numa_dist_cnt; i++) - for (j = 0; j < numa_dist_cnt; j++) - phys_dist[i * numa_dist_cnt + j] = - node_distance(i, j); - } - - /* - * Determine the max emulated nid and the default phys nid to use - * for unmapped nodes. - */ - max_emu_nid = setup_emu2phys_nid(&dfl_phys_nid); - - /* commit */ - *numa_meminfo = ei; - - /* Make sure numa_nodes_parsed only contains emulated nodes */ - nodes_clear(numa_nodes_parsed); - for (i = 0; i < ARRAY_SIZE(ei.blk); i++) - if (ei.blk[i].start != ei.blk[i].end && - ei.blk[i].nid != NUMA_NO_NODE) - node_set(ei.blk[i].nid, numa_nodes_parsed); - - /* - * Transform __apicid_to_node table to use emulated nids by - * reverse-mapping phys_nid. The maps should always exist but fall - * back to zero just in case. - */ - for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) { - if (__apicid_to_node[i] == NUMA_NO_NODE) - continue; - for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++) - if (__apicid_to_node[i] == emu_nid_to_phys[j]) - break; - __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0; - } - - /* make sure all emulated nodes are mapped to a physical node */ - for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) - if (emu_nid_to_phys[i] == NUMA_NO_NODE) - emu_nid_to_phys[i] = dfl_phys_nid; - - /* transform distance table */ - numa_reset_distance(); - for (i = 0; i < max_emu_nid + 1; i++) { - for (j = 0; j < max_emu_nid + 1; j++) { - int physi = emu_nid_to_phys[i]; - int physj = emu_nid_to_phys[j]; - int dist; - - if (get_option(&emu_cmdline, &dist) == 2) - ; - else if (physi >= numa_dist_cnt || physj >= numa_dist_cnt) - dist = physi == physj ? - LOCAL_DISTANCE : REMOTE_DISTANCE; - else - dist = phys_dist[physi * numa_dist_cnt + physj]; - - numa_set_distance(i, j, dist); - } - } - - /* free the copied physical distance table */ - memblock_free(phys_dist, phys_size); - return; - -no_emu: - /* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */ - for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) - emu_nid_to_phys[i] = i; -} - -#ifndef CONFIG_DEBUG_PER_CPU_MAPS -void numa_add_cpu(int cpu) -{ - int physnid, nid; - - nid = early_cpu_to_node(cpu); - BUG_ON(nid == NUMA_NO_NODE || !node_online(nid)); - - physnid = emu_nid_to_phys[nid]; - - /* - * Map the cpu to each emulated node that is allocated on the physical - * node of the cpu's apic id. - */ - for_each_online_node(nid) - if (emu_nid_to_phys[nid] == physnid) - cpumask_set_cpu(cpu, node_to_cpumask_map[nid]); -} - -void numa_remove_cpu(int cpu) -{ - int i; - - for_each_online_node(i) - cpumask_clear_cpu(cpu, node_to_cpumask_map[i]); -} -#else /* !CONFIG_DEBUG_PER_CPU_MAPS */ -static void numa_set_cpumask(int cpu, bool enable) -{ - int nid, physnid; - - nid = early_cpu_to_node(cpu); - if (nid == NUMA_NO_NODE) { - /* early_cpu_to_node() already emits a warning and trace */ - return; - } - - physnid = emu_nid_to_phys[nid]; - - for_each_online_node(nid) { - if (emu_nid_to_phys[nid] != physnid) - continue; - - debug_cpumask_set_cpu(cpu, nid, enable); - } -} - -void numa_add_cpu(int cpu) -{ - numa_set_cpumask(cpu, true); -} - -void numa_remove_cpu(int cpu) -{ - numa_set_cpumask(cpu, false); -} -#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */ diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index 5df0ad5cb09d..67bdbcd0caf9 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -13,6 +13,7 @@ #include #include +#include #include struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; @@ -30,6 +31,8 @@ static __init int numa_parse_early_param(char *opt) return -EINVAL; if (str_has_prefix(opt, "off")) numa_off = true; + if (!strncmp(opt, "fake=", 5)) + return numa_emu_cmdline(opt + 5); return 0; } diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h index 929d7c582a73..4658155a070a 100644 --- a/include/asm-generic/numa.h +++ b/include/asm-generic/numa.h @@ -50,12 +50,24 @@ struct numa_meminfo { struct numa_memblk blk[NR_NODE_MEMBLKS]; }; +#ifdef CONFIG_NUMA_EMU +#define FAKE_NODE_MIN_SIZE ((u64)32 << 20) +#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL)) + extern struct numa_meminfo numa_meminfo; +extern char *emu_cmdline __initdata; +int numa_emu_cmdline(char *str); int __init numa_register_memblks(struct numa_meminfo *mi); int __init numa_cleanup_meminfo(struct numa_meminfo *mi); void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt); +#else +static inline int numa_emu_cmdline(char *str) +{ + return -EINVAL; +} +#endif #else /* CONFIG_NUMA */ diff --git a/mm/Kconfig b/mm/Kconfig index 264a2df5ecf5..22bead675ee6 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -549,6 +549,14 @@ config ARCH_ENABLE_MEMORY_HOTPLUG config ARCH_ENABLE_MEMORY_HOTREMOVE bool +config NUMA_EMU + bool "NUMA emulation (EXPERIMENTAL)" + depends on NUMA && (X86 || ARM64) + help + Enable NUMA emulation. A flat machine will be split + into virtual nodes when booted with "numa=fake=N", where N is the + number of nodes. This is only useful for debugging. + # eventually, we can have this option just 'select SPARSEMEM' menuconfig MEMORY_HOTPLUG bool "Memory hotplug" diff --git a/mm/numa.c b/mm/numa.c index 88277e8404f0..3cc01f06a2a6 100644 --- a/mm/numa.c +++ b/mm/numa.c @@ -16,6 +16,10 @@ struct numa_meminfo numa_meminfo __initdata_or_meminfo; struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; +#ifdef CONFIG_NUMA_EMU +char *emu_cmdline __initdata; +#endif + /* * Set nodes, which have memory in @mi, in *@nodemask. */ @@ -296,3 +300,11 @@ int __weak __init numa_register_memblks(struct numa_meminfo *mi) return 0; } + +#ifdef CONFIG_NUMA_EMU +int __init numa_emu_cmdline(char *str) +{ + emu_cmdline = str; + return 0; +} +#endif From patchwork Thu Oct 12 02:48:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rongwei Wang X-Patchwork-Id: 13418184 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F3D7CDB482 for ; Thu, 12 Oct 2023 02:49:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C9A538000A; Wed, 11 Oct 2023 22:48:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BF8BA80009; Wed, 11 Oct 2023 22:48:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A266B8000A; Wed, 11 Oct 2023 22:48:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8A10380009 for ; Wed, 11 Oct 2023 22:48:59 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 5E8A1120305 for ; Thu, 12 Oct 2023 02:48:59 +0000 (UTC) X-FDA: 81335277198.10.B2C9748 Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) by imf22.hostedemail.com (Postfix) with ESMTP id B1D3EC0008 for ; Thu, 12 Oct 2023 02:48:56 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697078937; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xMjRAibelh3drddaOoQFrts1mWuyeyRghMOM9q87ZDM=; b=1WgP0uuRM/Pjj69JR1Au/qsgBanNdXO1mDUYKJj+aGakNyd2bjPGd9JTDgDHePShb3gAEE wXxO2SMAYaElxZOmBakzC/CjDXGXLsiy0pjEr664Eohxg5hE9mFOazytiKdyN7VAHDraAU 8yvcHav76nrdrf7KZmsHrcf26E71UxE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697078937; a=rsa-sha256; cv=none; b=XHjr9Aja6CROuVIGFJjQGL5DBjNNLm9QBo4A9AosulInvIc5+EPLAM/FgKEXLo4eGJaUZ1 6TqwFzvB2FH/xlgOCm1pSdn2ZbdvKY6Pu6EKntuAdrAWR6rqJJrcDGpiuX3wgoFrR1xUID L0qsbtICnjLAQawYXkLoWjRRLWwRAXo= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=rongwei.wang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0VtykMhN_1697078932; Received: from localhost.localdomain(mailfrom:rongwei.wang@linux.alibaba.com fp:SMTPD_---0VtykMhN_1697078932) by smtp.aliyun-inc.com; Thu, 12 Oct 2023 10:48:53 +0800 From: Rongwei Wang To: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: akpm@linux-foundation.org, willy@infradead.org, catalin.marinas@arm.com, dave.hansen@linux.intel.com, tj@kernel.org, mingo@redhat.com Subject: [PATCH RFC 5/5] mm/numa: migrate leftover numa emulation into mm/numa.c Date: Thu, 12 Oct 2023 10:48:42 +0800 Message-Id: <20231012024842.99703-6-rongwei.wang@linux.alibaba.com> X-Mailer: git-send-email 2.40.0 In-Reply-To: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> References: <20231012024842.99703-1-rongwei.wang@linux.alibaba.com> MIME-Version: 1.0 X-Stat-Signature: uc4gc4gt3mecfy16f7xaxc1pjxe1erzb X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: B1D3EC0008 X-Rspam-User: X-HE-Tag: 1697078936-151816 X-HE-Meta: U2FsdGVkX1+34e05h2bLjodN2PyyBV7JZgOsNsw25zpeKPG6oMb0BM6B6y22oEZH4j3lueqn6kkUAivK37/o+bTBz3B17OuueV2gsSjhZEHu1fGGjsD8/9e4MzXRgQn3XyFw3Q4wmhiTgsFnyJrv69TlhTRsRCvIlR8oeqcN5y4DOxSqgc9nXlE5PYhescXGB5EBi192RaTnXbkffFfLkYcjfyqijX3iCdKB0/q/46ni4KV+seOsg4IEb2XSc4IjwhN+2LPduYU+wxcOfjQVZ6kNvx873Iw7N0jdvkuEezQp/WTCWwk/y0ef7TxM1+SqzcIGtQVmTqh2NbNOiQO2gU1hZX6uqQrnd3bcnsPFWcc1n4OHJAp3E75ZQYyZXNPT7OGRqrOYuMA9GGm3mBBXES3teJ2a01IFjzQzCrDMKLgitc1Zo/jEyW+u4nSsMqIJvjxo1pCzBi4Byr6bNg4wg4ReBL7Jcgc0QpKWEAPQx/DS2a/0oIJCqMI8eXVWDy3ylf2DnRt5bID3DJJL/3yJ91RX4ePVG4K8DXl18k2W9uYZFArq+skFlm2POMTcd7r/kd0x3dR1m3u6e+Jnb839FW8JUjAY6ys2BD67llZI5bXmlH2FQnwj/i0oesSPO4SOplDK9WfTSBgsgwSjydjZuFTfXNCzo7r5UnZfgeu5KS6xvQZQ1cYWQIhxlxa/Pj6O812A1B2Z8/l/+JTz/ePmM1xWACgrwjB4VDiKS1lvUYxcTX8kdBcVRg9mKQzbO/fLS0nfo+PJx9WrYqnFJcTDlcB9QF1k2pzlN58B7NgJ1W6n7lHcR148wdE8vjfKWRZw/1qBON66OI+s6wwRT06wa5+fMiBQI7GIRnQ2sU2NyUKeYq5k2s8EqofRyNiqmzXZy8dE3fXdtZ7RdGc/7oFX4OvkCT8cMegXVqNbaOGxkug/QmfxQgc9YBI64bWs0NTbgsQyOFSdgeavg/3FKwK rssv9fAw PCkGa82cgk4LEeulo65p4ns5B6CpOMQzXlOqEfi1BLbDy+XCq95N0q1T9f53cyxpeB+zDGrOa6hmFZZ8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Here moving original x86/mm/numa_emulation.c into mm/numa.c. And next to enable it for arm64. Signed-off-by: Rongwei Wang --- drivers/base/arch_numa.c | 2 + include/asm-generic/numa.h | 3 + mm/numa.c | 586 ++++++++++++++++++++++++++++++++++++- 3 files changed, 587 insertions(+), 4 deletions(-) diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index 67bdbcd0caf9..c6f5ceadb9e1 100644 --- a/drivers/base/arch_numa.c +++ b/drivers/base/arch_numa.c @@ -64,6 +64,7 @@ EXPORT_SYMBOL(cpumask_of_node); #endif +#ifndef CONFIG_NUMA_EMU static void numa_update_cpu(unsigned int cpu, bool remove) { int nid = cpu_to_node(cpu); @@ -92,6 +93,7 @@ void numa_clear_node(unsigned int cpu) numa_remove_cpu(cpu); set_cpu_numa_node(cpu, NUMA_NO_NODE); } +#endif /* * Allocate node_to_cpumask_map based on number of available nodes diff --git a/include/asm-generic/numa.h b/include/asm-generic/numa.h index 4658155a070a..9969ec7f59a4 100644 --- a/include/asm-generic/numa.h +++ b/include/asm-generic/numa.h @@ -55,6 +55,7 @@ struct numa_meminfo { #define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL)) extern struct numa_meminfo numa_meminfo; +extern int emu_nid_to_phys[MAX_NUMNODES]; extern char *emu_cmdline __initdata; int numa_emu_cmdline(char *str); @@ -62,6 +63,8 @@ int __init numa_register_memblks(struct numa_meminfo *mi); int __init numa_cleanup_meminfo(struct numa_meminfo *mi); void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt); +int __init numa_add_memblk_to(int nid, u64 start, u64 end, + struct numa_meminfo *mi); #else static inline int numa_emu_cmdline(char *str) { diff --git a/mm/numa.c b/mm/numa.c index 3cc01f06a2a6..a6e9652498c9 100644 --- a/mm/numa.c +++ b/mm/numa.c @@ -1,4 +1,5 @@ // SPDX-License-Identifier: GPL-2.0-only +/* Most of this file comes from x86/numa_emulation.c */ #include #include #include @@ -16,10 +17,6 @@ struct numa_meminfo numa_meminfo __initdata_or_meminfo; struct numa_meminfo numa_reserved_meminfo __initdata_or_meminfo; -#ifdef CONFIG_NUMA_EMU -char *emu_cmdline __initdata; -#endif - /* * Set nodes, which have memory in @mi, in *@nodemask. */ @@ -302,9 +299,590 @@ int __weak __init numa_register_memblks(struct numa_meminfo *mi) } #ifdef CONFIG_NUMA_EMU +int emu_nid_to_phys[MAX_NUMNODES]; +char *emu_cmdline __initdata; + int __init numa_emu_cmdline(char *str) { emu_cmdline = str; return 0; } + +static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi) +{ + int i; + + for (i = 0; i < mi->nr_blks; i++) + if (mi->blk[i].nid == nid) + return i; + return -ENOENT; +} + +static u64 __init mem_hole_size(u64 start, u64 end) +{ + unsigned long start_pfn = PFN_UP(start); + unsigned long end_pfn = PFN_DOWN(end); + + if (start_pfn < end_pfn) + return PFN_PHYS(absent_pages_in_range(start_pfn, end_pfn)); + return 0; +} + +/* + * Sets up nid to range from @start to @end. The return value is -errno if + * something went wrong, 0 otherwise. + */ +static int __init emu_setup_memblk(struct numa_meminfo *ei, + struct numa_meminfo *pi, + int nid, int phys_blk, u64 size) +{ + struct numa_memblk *eb = &ei->blk[ei->nr_blks]; + struct numa_memblk *pb = &pi->blk[phys_blk]; + + if (ei->nr_blks >= NR_NODE_MEMBLKS) { + pr_err("NUMA: Too many emulated memblks, failing emulation\n"); + return -EINVAL; + } + + ei->nr_blks++; + eb->start = pb->start; + eb->end = pb->start + size; + eb->nid = nid; + + if (emu_nid_to_phys[nid] == NUMA_NO_NODE) + emu_nid_to_phys[nid] = pb->nid; + + pb->start += size; + if (pb->start >= pb->end) { + WARN_ON_ONCE(pb->start > pb->end); + numa_remove_memblk_from(phys_blk, pi); + } + + printk(KERN_INFO "Faking node %d at [mem %#018Lx-%#018Lx] (%LuMB)\n", + nid, eb->start, eb->end - 1, (eb->end - eb->start) >> 20); + return 0; +} + +/* + * Sets up nr_nodes fake nodes interleaved over physical nodes ranging from addr + * to max_addr. + * + * Returns zero on success or negative on error. + */ +static int __init split_nodes_interleave(struct numa_meminfo *ei, + struct numa_meminfo *pi, + u64 addr, u64 max_addr, int nr_nodes) +{ + nodemask_t physnode_mask = numa_nodes_parsed; + u64 size; + int big; + int nid = 0; + int i, ret; + + if (nr_nodes <= 0) + return -1; + if (nr_nodes > MAX_NUMNODES) { + pr_info("numa=fake=%d too large, reducing to %d\n", + nr_nodes, MAX_NUMNODES); + nr_nodes = MAX_NUMNODES; + } + + /* + * Calculate target node size. x86_32 freaks on __udivdi3() so do + * the division in ulong number of pages and convert back. + */ + size = max_addr - addr - mem_hole_size(addr, max_addr); + size = PFN_PHYS((unsigned long)(size >> PAGE_SHIFT) / nr_nodes); + + /* + * Calculate the number of big nodes that can be allocated as a result + * of consolidating the remainder. + */ + big = ((size & ~FAKE_NODE_MIN_HASH_MASK) * nr_nodes) / + FAKE_NODE_MIN_SIZE; + + size &= FAKE_NODE_MIN_HASH_MASK; + if (!size) { + pr_err("Not enough memory for each node. " + "NUMA emulation disabled.\n"); + return -1; + } + + /* + * Continue to fill physical nodes with fake nodes until there is no + * memory left on any of them. + */ + while (!nodes_empty(physnode_mask)) { + for_each_node_mask(i, physnode_mask) { +#ifdef CONFIG_X86 + u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN); +#endif + u64 start, limit, end; + int phys_blk; + + phys_blk = emu_find_memblk_by_nid(i, pi); + if (phys_blk < 0) { + node_clear(i, physnode_mask); + continue; + } + start = pi->blk[phys_blk].start; + limit = pi->blk[phys_blk].end; + end = start + size; + + if (nid < big) + end += FAKE_NODE_MIN_SIZE; + + /* + * Continue to add memory to this fake node if its + * non-reserved memory is less than the per-node size. + */ + while (end - start - mem_hole_size(start, end) < size) { + end += FAKE_NODE_MIN_SIZE; + if (end > limit) { + end = limit; + break; + } + } + +#ifdef CONFIG_X86 + /* + * If there won't be at least FAKE_NODE_MIN_SIZE of + * non-reserved memory in ZONE_DMA32 for the next node, + * this one must extend to the boundary. + */ + if (end < dma32_end && dma32_end - end - + mem_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE) + end = dma32_end; +#endif + + /* + * If there won't be enough non-reserved memory for the + * next node, this one must extend to the end of the + * physical node. + */ + if (limit - end - mem_hole_size(end, limit) < size) + end = limit; + + ret = emu_setup_memblk(ei, pi, nid++ % nr_nodes, + phys_blk, + min(end, limit) - start); + if (ret < 0) + return ret; + } + } + return 0; +} + +/* + * Returns the end address of a node so that there is at least `size' amount of + * non-reserved memory or `max_addr' is reached. + */ +static u64 __init find_end_of_node(u64 start, u64 max_addr, u64 size) +{ + u64 end = start + size; + + while (end - start - mem_hole_size(start, end) < size) { + end += FAKE_NODE_MIN_SIZE; + if (end > max_addr) { + end = max_addr; + break; + } + } + return end; +} + +static u64 uniform_size(u64 max_addr, u64 base, u64 hole, int nr_nodes) +{ + unsigned long max_pfn = PHYS_PFN(max_addr); + unsigned long base_pfn = PHYS_PFN(base); + unsigned long hole_pfns = PHYS_PFN(hole); + + return PFN_PHYS((max_pfn - base_pfn - hole_pfns) / nr_nodes); +} + +/* + * Sets up fake nodes of `size' interleaved over physical nodes ranging from + * `addr' to `max_addr'. + * + * Returns zero on success or negative on error. + */ +static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei, + struct numa_meminfo *pi, + u64 addr, u64 max_addr, u64 size, + int nr_nodes, struct numa_memblk *pblk, + int nid) +{ + nodemask_t physnode_mask = numa_nodes_parsed; + int i, ret, uniform = 0; + u64 min_size; + + if ((!size && !nr_nodes) || (nr_nodes && !pblk)) + return -1; + + /* + * In the 'uniform' case split the passed in physical node by + * nr_nodes, in the non-uniform case, ignore the passed in + * physical block and try to create nodes of at least size + * @size. + * + * In the uniform case, split the nodes strictly by physical + * capacity, i.e. ignore holes. In the non-uniform case account + * for holes and treat @size as a minimum floor. + */ + if (!nr_nodes) + nr_nodes = MAX_NUMNODES; + else { + nodes_clear(physnode_mask); + node_set(pblk->nid, physnode_mask); + uniform = 1; + } + + if (uniform) { + min_size = uniform_size(max_addr, addr, 0, nr_nodes); + size = min_size; + } else { + /* + * The limit on emulated nodes is MAX_NUMNODES, so the + * size per node is increased accordingly if the + * requested size is too small. This creates a uniform + * distribution of node sizes across the entire machine + * (but not necessarily over physical nodes). + */ + min_size = uniform_size(max_addr, addr, + mem_hole_size(addr, max_addr), nr_nodes); + } + min_size = ALIGN(max(min_size, FAKE_NODE_MIN_SIZE), FAKE_NODE_MIN_SIZE); + if (size < min_size) { + pr_err("Fake node size %LuMB too small, increasing to %LuMB\n", + size >> 20, min_size >> 20); + size = min_size; + } + size = ALIGN_DOWN(size, FAKE_NODE_MIN_SIZE); + + /* + * Fill physical nodes with fake nodes of size until there is no memory + * left on any of them. + */ + while (!nodes_empty(physnode_mask)) { + for_each_node_mask(i, physnode_mask) { +#ifdef CONFIG_X86 + u64 dma32_end = PFN_PHYS(MAX_DMA32_PFN); +#endif + u64 start, limit, end; + int phys_blk; + + phys_blk = emu_find_memblk_by_nid(i, pi); + if (phys_blk < 0) { + node_clear(i, physnode_mask); + continue; + } + + start = pi->blk[phys_blk].start; + limit = pi->blk[phys_blk].end; + + if (uniform) + end = start + size; + else + end = find_end_of_node(start, limit, size); + +#ifdef CONFIG_X86 + /* + * If there won't be at least FAKE_NODE_MIN_SIZE of + * non-reserved memory in ZONE_DMA32 for the next node, + * this one must extend to the boundary. + */ + if (end < dma32_end && dma32_end - end - + mem_hole_size(end, dma32_end) < FAKE_NODE_MIN_SIZE) + end = dma32_end; +#endif + + /* + * If there won't be enough non-reserved memory for the + * next node, this one must extend to the end of the + * physical node. + */ + if ((limit - end - mem_hole_size(end, limit) < size) + && !uniform) + end = limit; + + ret = emu_setup_memblk(ei, pi, nid++ % MAX_NUMNODES, + phys_blk, + min(end, limit) - start); + if (ret < 0) + return ret; + } + } + return nid; +} + +static int __init split_nodes_size_interleave(struct numa_meminfo *ei, + struct numa_meminfo *pi, + u64 addr, u64 max_addr, u64 size) +{ + return split_nodes_size_interleave_uniform(ei, pi, addr, max_addr, size, + 0, NULL, 0); +} + +static int __init setup_emu2phys_nid(int *dfl_phys_nid) +{ + int i, max_emu_nid = 0; + + *dfl_phys_nid = NUMA_NO_NODE; + for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) { + if (emu_nid_to_phys[i] != NUMA_NO_NODE) { + max_emu_nid = i; + if (*dfl_phys_nid == NUMA_NO_NODE) + *dfl_phys_nid = emu_nid_to_phys[i]; + } + } + + return max_emu_nid; +} + +/** + * numa_emulation - Emulate NUMA nodes + * @numa_meminfo: NUMA configuration to massage + * @numa_dist_cnt: The size of the physical NUMA distance table + * + * Emulate NUMA nodes according to the numa=fake kernel parameter. + * @numa_meminfo contains the physical memory configuration and is modified + * to reflect the emulated configuration on success. @numa_dist_cnt is + * used to determine the size of the physical distance table. + * + * On success, the following modifications are made. + * + * - @numa_meminfo is updated to reflect the emulated nodes. + * + * - __apicid_to_node[] is updated such that APIC IDs are mapped to the + * emulated nodes. + * + * - NUMA distance table is rebuilt to represent distances between emulated + * nodes. The distances are determined considering how emulated nodes + * are mapped to physical nodes and match the actual distances. + * + * - emu_nid_to_phys[] reflects how emulated nodes are mapped to physical + * nodes. This is used by numa_add_cpu() and numa_remove_cpu(). + * + * If emulation is not enabled or fails, emu_nid_to_phys[] is filled with + * identity mapping and no other modification is made. + */ +void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) +{ + static struct numa_meminfo ei __initdata; + static struct numa_meminfo pi __initdata; + const u64 max_addr = PFN_PHYS(max_pfn); + u8 *phys_dist = NULL; + size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]); + int max_emu_nid, dfl_phys_nid; + int i, j, ret; + + if (!emu_cmdline) + goto no_emu; + + memset(&ei, 0, sizeof(ei)); + pi = *numa_meminfo; + + for (i = 0; i < MAX_NUMNODES; i++) + emu_nid_to_phys[i] = NUMA_NO_NODE; + + /* + * If the numa=fake command-line contains a 'M' or 'G', it represents + * the fixed node size. Otherwise, if it is just a single number N, + * split the system RAM into N fake nodes. + */ + if (strchr(emu_cmdline, 'U')) { + nodemask_t physnode_mask = numa_nodes_parsed; + unsigned long n; + int nid = 0; + + n = simple_strtoul(emu_cmdline, &emu_cmdline, 0); + ret = -1; + for_each_node_mask(i, physnode_mask) { + /* + * The reason we pass in blk[0] is due to + * numa_remove_memblk_from() called by + * emu_setup_memblk() will delete entry 0 + * and then move everything else up in the pi.blk + * array. Therefore we should always be looking + * at blk[0]. + */ + ret = split_nodes_size_interleave_uniform(&ei, &pi, + pi.blk[0].start, pi.blk[0].end, 0, + n, &pi.blk[0], nid); + if (ret < 0) + break; + if (ret < n) { + pr_info("%s: phys: %d only got %d of %ld nodes, failing\n", + __func__, i, ret, n); + ret = -1; + break; + } + nid = ret; + } + } else if (strchr(emu_cmdline, 'M') || strchr(emu_cmdline, 'G')) { + u64 size; + + size = memparse(emu_cmdline, &emu_cmdline); + ret = split_nodes_size_interleave(&ei, &pi, 0, max_addr, size); + } else { + unsigned long n; + + n = simple_strtoul(emu_cmdline, &emu_cmdline, 0); + ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n); + } + if (*emu_cmdline == ':') + emu_cmdline++; + + if (ret < 0) + goto no_emu; + + if (numa_cleanup_meminfo(&ei) < 0) { + pr_warn("NUMA: Warning: constructed meminfo invalid, disabling emulation\n"); + goto no_emu; + } + + /* copy the physical distance table */ + if (numa_dist_cnt) { + u64 phys; + + phys = memblock_phys_alloc_range(phys_size, PAGE_SIZE, 0, + MEMBLOCK_ALLOC_ACCESSIBLE); + if (!phys) { + pr_warn("NUMA: Warning: can't allocate copy of distance table, disabling emulation\n"); + goto no_emu; + } + phys_dist = __va(phys); + + for (i = 0; i < numa_dist_cnt; i++) + for (j = 0; j < numa_dist_cnt; j++) + phys_dist[i * numa_dist_cnt + j] = + node_distance(i, j); + } + + /* + * Determine the max emulated nid and the default phys nid to use + * for unmapped nodes. + */ + max_emu_nid = setup_emu2phys_nid(&dfl_phys_nid); + + /* commit */ + *numa_meminfo = ei; + + /* Make sure numa_nodes_parsed only contains emulated nodes */ + nodes_clear(numa_nodes_parsed); + for (i = 0; i < ARRAY_SIZE(ei.blk); i++) + if (ei.blk[i].start != ei.blk[i].end && + ei.blk[i].nid != NUMA_NO_NODE) + node_set(ei.blk[i].nid, numa_nodes_parsed); + +#ifdef CONFIG_X86 + /* + * Transform __apicid_to_node table to use emulated nids by + * reverse-mapping phys_nid. The maps should always exist but fall + * back to zero just in case. + */ + for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) { + if (__apicid_to_node[i] == NUMA_NO_NODE) + continue; + for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++) + if (__apicid_to_node[i] == emu_nid_to_phys[j]) + break; + __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0; + } +#endif + + /* make sure all emulated nodes are mapped to a physical node */ + for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) + if (emu_nid_to_phys[i] == NUMA_NO_NODE) + emu_nid_to_phys[i] = dfl_phys_nid; + + /* transform distance table */ + numa_free_distance(); + for (i = 0; i < max_emu_nid + 1; i++) { + for (j = 0; j < max_emu_nid + 1; j++) { + int physi = emu_nid_to_phys[i]; + int physj = emu_nid_to_phys[j]; + int dist; + + if (get_option(&emu_cmdline, &dist) == 2) + ; + else if (physi >= numa_dist_cnt || physj >= numa_dist_cnt) + dist = physi == physj ? + LOCAL_DISTANCE : REMOTE_DISTANCE; + else + dist = phys_dist[physi * numa_dist_cnt + physj]; + + numa_set_distance(i, j, dist); + } + } + + /* free the copied physical distance table */ + memblock_free(phys_dist, phys_size); + return; + +no_emu: + /* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */ + for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) + emu_nid_to_phys[i] = i; +} + +#ifndef CONFIG_DEBUG_PER_CPU_MAPS +extern int early_cpu_to_node(unsigned int cpu); + +void numa_add_cpu(unsigned int cpu) +{ + int physnid, nid; + + nid = early_cpu_to_node(cpu); + BUG_ON(nid == NUMA_NO_NODE || !node_online(nid)); + + physnid = emu_nid_to_phys[nid]; + + /* + * Map the cpu to each emulated node that is allocated on the physical + * node of the cpu's apic id. + */ + for_each_online_node(nid) + if (emu_nid_to_phys[nid] == physnid) + cpumask_set_cpu(cpu, node_to_cpumask_map[nid]); +} + +void numa_remove_cpu(unsigned int cpu) +{ + int i; + + for_each_online_node(i) + cpumask_clear_cpu(cpu, node_to_cpumask_map[i]); +} +#else /* !CONFIG_DEBUG_PER_CPU_MAPS */ +static void numa_set_cpumask(int cpu, bool enable) +{ + int nid, physnid; + + nid = early_cpu_to_node(cpu); + if (nid == NUMA_NO_NODE) { + /* early_cpu_to_node() already emits a warning and trace */ + return; + } + + physnid = emu_nid_to_phys[nid]; + + for_each_online_node(nid) { + if (emu_nid_to_phys[nid] != physnid) + continue; + + debug_cpumask_set_cpu(cpu, nid, enable); + } +} + +void numa_add_cpu(unsigned int cpu) +{ + numa_set_cpumask(cpu, true); +} + +void numa_remove_cpu(unsigned int cpu) +{ + numa_set_cpumask(cpu, false); +} +#endif /* !CONFIG_DEBUG_PER_CPU_MAPS */ #endif