From patchwork Fri Jul 10 12:09:50 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Song Bao Hua (Barry Song)" X-Patchwork-Id: 11656441 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2E5F313B1 for ; Fri, 10 Jul 2020 12:12:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EB5362078D for ; Fri, 10 Jul 2020 12:12:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EB5362078D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hisilicon.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 379B18D0001; Fri, 10 Jul 2020 08:12:05 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 329536B0005; Fri, 10 Jul 2020 08:12:05 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2170C8D0001; Fri, 10 Jul 2020 08:12:05 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0084.hostedemail.com [216.40.44.84]) by kanga.kvack.org (Postfix) with ESMTP id 0C5CF6B0002 for ; Fri, 10 Jul 2020 08:12:05 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id CE6CE12F3 for ; Fri, 10 Jul 2020 12:12:04 +0000 (UTC) X-FDA: 77022052968.30.way86_030324f26ece Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id 8831B180B3C95 for ; Fri, 10 Jul 2020 12:12:04 +0000 (UTC) X-Spam-Summary: 1,0,0,64967dc7f583139a,d41d8cd98f00b204,song.bao.hua@hisilicon.com,,RULES_HIT:2:41:355:379:421:541:582:800:960:966:973:988:989:1152:1260:1261:1277:1311:1313:1314:1345:1431:1437:1515:1516:1518:1535:1605:1606:1730:1747:1777:1792:1963:2194:2196:2198:2199:2200:2201:2393:2559:2562:2689:2693:2731:2741:3138:3139:3140:3141:3142:3167:3865:3866:3867:3868:3870:3871:3872:3874:4118:4250:4321:4385:4605:5007:6119:6261:6630:6742:7875:7903:8603:8660:9592:10004:11026:11473:11658:11914:12043:12296:12297:12438:12555:12895:13141:13148:13161:13229:13230:13894:14394:21080:21451:21524:21627:21795:21939:30012:30045:30051:30054:30056:30064,0,RBL:45.249.212.35:@hisilicon.com:.lbl8.mailshell.net-64.100.201.201 62.2.2.100;04yrmjihe8br4n4dc3ofr4i4pm81gyppfizp9msm4id1rjhj55hmpofy7xxk7bs.8epst1djmixxo3bzmtnfseu5gk5rdc6fy8sy43haffx7cuqx6gn1uthg5ux3nuj.a-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:none,Custom _rules:0 X-HE-Tag: way86_030324f26ece X-Filterd-Recvd-Size: 7490 Received: from huawei.com (szxga07-in.huawei.com [45.249.212.35]) by imf40.hostedemail.com (Postfix) with ESMTP for ; Fri, 10 Jul 2020 12:12:03 +0000 (UTC) Received: from DGGEMS404-HUB.china.huawei.com (unknown [172.30.72.60]) by Forcepoint Email with ESMTP id 9A88CBA74490FCAB9C3A; Fri, 10 Jul 2020 20:11:55 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.202.234) by DGGEMS404-HUB.china.huawei.com (10.3.19.204) with Microsoft SMTP Server id 14.3.487.0; Fri, 10 Jul 2020 20:11:49 +0800 From: Barry Song To: CC: , , , , , Barry Song , Roman Gushchin , Catalin Marinas , Will Deacon , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H . Peter Anvin" , Mike Kravetz , Mike Rapoport , "Anshuman Khandual" , Jonathan Cameron Subject: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory Date: Sat, 11 Jul 2020 00:09:50 +1200 Message-ID: <20200710120950.37716-1-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 X-Originating-IP: [10.126.202.234] X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: 8831B180B3C95 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Online nodes are not necessarily memory containing nodes. Splitting huge_cma in online nodes can lead to inconsistent hugetlb_cma size with user setting. For example, for one system with 4 numa nodes and only one of them has memory, if users set hugetlb_cma to 4GB, it will split into four 1GB. So only the node with memory will get 1GB CMA. All other three nodes get nothing. That means the whole system gets only 1GB CMA while users ask for 4GB. Thus, it is more sensible to split hugetlb_cma in nodes with memory. For the above case, the only node with memory will reserve 4GB cma which is same with user setting in bootargs. In order to split cma in nodes with memory, hugetlb_cma_reserve() should scan over those nodes with N_MEMORY state rather than N_ONLINE state. That means the function should be called only after arch code has finished setting the N_MEMORY state of nodes. The problem is always there if N_ONLINE != N_MEMORY. It is a general problem to all platforms. But there is some trivial difference among different architectures. For example, for ARM64, before hugetlb_cma_reserve() is called, all nodes have got N_ONLINE state. So hugetlb will get inconsistent cma size when some online nodes have no memory. For x86 case, the problem is hidden because X86 happens to just set N_ONLINE on the nodes with memory when hugetlb_cma_reserve() is called. Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve() and lets both x86 and ARM64 call the function after N_MEMORY state is ready. It also documents the requirement in the definition of hugetlb_cma_reserve(). Cc: Roman Gushchin Cc: Catalin Marinas Cc: Will Deacon Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Mike Kravetz Cc: Mike Rapoport Cc: Andrew Morton Cc: Anshuman Khandual Cc: Jonathan Cameron Signed-off-by: Barry Song Signed-off-by: Mike Kravetz Signed-off-by: Mike Kravetz --- v3: another try to refine changelog with respect to Anshuman Khandual's comments arch/arm64/mm/init.c | 19 ++++++++++--------- arch/x86/kernel/setup.c | 12 +++++++++--- mm/hugetlb.c | 11 +++++++++-- 3 files changed, 28 insertions(+), 14 deletions(-) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 1e93cfc7c47a..420f5e55615c 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -420,15 +420,6 @@ void __init bootmem_init(void) arm64_numa_init(); - /* - * must be done after arm64_numa_init() which calls numa_init() to - * initialize node_online_map that gets used in hugetlb_cma_reserve() - * while allocating required CMA size across online nodes. - */ -#ifdef CONFIG_ARM64_4K_PAGES - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); -#endif - /* * Sparsemem tries to allocate bootmem in memory_present(), so must be * done after the fixed reservations. @@ -438,6 +429,16 @@ void __init bootmem_init(void) sparse_init(); zone_sizes_init(min, max); + /* + * must be done after zone_sizes_init() which calls free_area_init() + * that calls node_set_state() to initialize node_states[N_MEMORY] + * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY + * state + */ +#ifdef CONFIG_ARM64_4K_PAGES + hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); +#endif + memblock_dump_all(); } diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index a3767e74c758..a1a9712090ae 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1164,9 +1164,6 @@ void __init setup_arch(char **cmdline_p) initmem_init(); dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT); - if (boot_cpu_has(X86_FEATURE_GBPAGES)) - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); - /* * Reserve memory for crash kernel after SRAT is parsed so that it * won't consume hotpluggable memory. @@ -1180,6 +1177,15 @@ void __init setup_arch(char **cmdline_p) x86_init.paging.pagetable_init(); + /* + * must be done after zone_sizes_init() which calls free_area_init() + * that calls node_set_state() to initialize node_states[N_MEMORY] + * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY + * state + */ + if (boot_cpu_has(X86_FEATURE_GBPAGES)) + hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); + kasan_init(); /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bc3304af40d0..32b5035ffec1 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5665,6 +5665,13 @@ static int __init cmdline_parse_hugetlb_cma(char *p) early_param("hugetlb_cma", cmdline_parse_hugetlb_cma); +/* + * hugetlb_cma_reserve() - reserve CMA for gigantic pages on nodes with memory + * + * must be called after free_area_init() that updates N_MEMORY via node_set_state(). + * hugetlb_cma_reserve() scans over N_MEMORY nodemask and hence expects the platforms + * to have initialized N_MEMORY state. + */ void __init hugetlb_cma_reserve(int order) { unsigned long size, reserved, per_node; @@ -5685,12 +5692,12 @@ void __init hugetlb_cma_reserve(int order) * If 3 GB area is requested on a machine with 4 numa nodes, * let's allocate 1 GB on first three nodes and ignore the last one. */ - per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes); + per_node = DIV_ROUND_UP(hugetlb_cma_size, num_node_state(N_MEMORY)); pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n", hugetlb_cma_size / SZ_1M, per_node / SZ_1M); reserved = 0; - for_each_node_state(nid, N_ONLINE) { + for_each_node_state(nid, N_MEMORY) { int res; size = min(per_node, hugetlb_cma_size - reserved);