From patchwork Fri Jul 10 03:50:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Song Bao Hua (Barry Song)" X-Patchwork-Id: 11655599 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 34AE313B1 for ; Fri, 10 Jul 2020 03:53:49 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0CF622072E for ; Fri, 10 Jul 2020 03:53:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="zv6XyPLn" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0CF622072E Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hisilicon.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:MIME-Version:Message-ID:Date:Subject:To:From: Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender :Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner; bh=DBXvKF/xaP+mOIWj2IXfE6gI5ZLqzQp2eMMjU7z+H6w=; b=zv6XyPLntwJ+OJUhNrJ4bcMBbl kcqRwY26VrkNF83UnqG9RAHdVkPp0aaTw63zyAbbyaCgJj2UlonzXFTy+g1Hl/QzvqiLx01dSisgt vHAV0ANvQGyHYEKn/TVD55izsm2bMGNv1kwlZ3ILes3lqdSZJoXsTTS4jV0e6oSn1ShQAOzCmJUfU AInLOi29FAD+vvkVQl6/aOOCPOES+r9eeli1sLgJcoJ0ow3Xvxs/JOslPUzBBCyNw5Vnd1OBwG5GJ CQZCPK3uD7ES0CFuZLjAAR64QOUWwLwTETPzAbvvQMfYgoGJXBZ4rWiT3yI0ypbdcsVTRDhAYVP+U G9gsIHUg==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1jtk5K-00039v-AV; Fri, 10 Jul 2020 03:52:26 +0000 Received: from szxga07-in.huawei.com ([45.249.212.35] helo=huawei.com) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1jtk5H-000395-Ih for linux-arm-kernel@lists.infradead.org; Fri, 10 Jul 2020 03:52:24 +0000 Received: from DGGEMS409-HUB.china.huawei.com (unknown [172.30.72.60]) by Forcepoint Email with ESMTP id 352C5E318CA7A956C58D; Fri, 10 Jul 2020 11:52:20 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.201.126) by DGGEMS409-HUB.china.huawei.com (10.3.19.209) with Microsoft SMTP Server id 14.3.487.0; Fri, 10 Jul 2020 11:52:12 +0800 From: Barry Song To: Subject: [PATCH v2] mm/hugetlb: split hugetlb_cma in nodes with memory Date: Fri, 10 Jul 2020 15:50:14 +1200 Message-ID: <20200710035014.25244-1-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 X-Originating-IP: [10.126.201.126] X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20200709_235223_881864_DF539E03 X-CRM114-Status: GOOD ( 17.85 ) X-Spam-Score: 0.7 (/) X-Spam-Report: SpamAssassin version 3.4.4 on merlin.infradead.org summary: Content analysis details: (0.7 points) pts rule name description ---- ---------------------- -------------------------------------------------- -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at https://www.dnswl.org/, medium trust [45.249.212.35 listed in list.dnswl.org] 0.0 RCVD_IN_MSPIKE_H4 RBL: Very Good reputation (+4) [45.249.212.35 listed in wl.mailspike.net] -0.0 SPF_PASS SPF: sender matches SPF record -0.0 SPF_HELO_PASS SPF: HELO matches SPF record 0.0 RCVD_IN_MSPIKE_WL Mailspike good senders 3.0 AC_FROM_MANY_DOTS Multiple periods in From user name X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Barry Song , Anshuman Khandual , Catalin Marinas , x86@kernel.org, linux-kernel@vger.kernel.org, linuxarm@huawei.com, linux-mm@kvack.org, Ingo Molnar , Borislav Petkov , Jonathan Cameron , "H . Peter Anvin" , Thomas Gleixner , Mike Rapoport , Will Deacon , Roman Gushchin , linux-arm-kernel@lists.infradead.org, Mike Kravetz Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org Rather than splitting huge_cma in online nodes, it is better to do it in nodes with memory. Without this patch, for an ARM64 server with four numa nodes and only node0 has memory. If I set hugetlb_cma=4G in bootargs, without this patch, I got the below printk: hugetlb_cma: reserve 4096 MiB, up to 1024 MiB per node hugetlb_cma: reserved 1024 MiB on node 0 hugetlb_cma: reservation failed: err -12, node 1 hugetlb_cma: reservation failed: err -12, node 2 hugetlb_cma: reservation failed: err -12, node 3 With this patch, I got the below printk: hugetlb_cma: reserve 4096 MiB, up to 4096 MiB per node hugetlb_cma: reserved 4096 MiB on node 0 So this patch makes the hugetlb_cma size consistent with users' setting on ARM64 platforms. Jonathan Cameron tested this patch on x86 platform. Jonathan figured out the boot code of x86 is much different with arm64. On arm64 all nodes are marked online at the same time. On x86, only nodes with memory are initially marked as online: initmem_init()->x86_numa_init()->numa_init()-> numa_register_memblks()->alloc_node_data()->node_set_online() So at time of the existing cma setup call only the memory containing nodes are online. The other nodes are brought up much later. Therefore, on x86 platform, hugetlb_cma size is actually consistent with users' setting even though system has nodes without memory. The problem is always there if N_ONLINE != N_MEMORY. In x86 case, it is just hidden because N_ONLINE happen to match N_MEMORY during the boot process when hugetlb_cma_reserve() gets called. This patch documents this problem in the comment of hugetlb_cma_reserve() and makes hugetlb_cma size optimal. Cc: Roman Gushchin Cc: Catalin Marinas Cc: Will Deacon Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: H. Peter Anvin Cc: Mike Kravetz Cc: Mike Rapoport Cc: Andrew Morton Cc: Anshuman Khandual Cc: Jonathan Cameron Signed-off-by: Barry Song --- -v2: document better according to Anshuman Khandual's comment arch/arm64/mm/init.c | 19 ++++++++++--------- arch/x86/kernel/setup.c | 12 +++++++++--- include/linux/hugetlb.h | 7 +++++++ mm/hugetlb.c | 4 ++-- 4 files changed, 28 insertions(+), 14 deletions(-) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 1e93cfc7c47a..420f5e55615c 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -420,15 +420,6 @@ void __init bootmem_init(void) arm64_numa_init(); - /* - * must be done after arm64_numa_init() which calls numa_init() to - * initialize node_online_map that gets used in hugetlb_cma_reserve() - * while allocating required CMA size across online nodes. - */ -#ifdef CONFIG_ARM64_4K_PAGES - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); -#endif - /* * Sparsemem tries to allocate bootmem in memory_present(), so must be * done after the fixed reservations. @@ -438,6 +429,16 @@ void __init bootmem_init(void) sparse_init(); zone_sizes_init(min, max); + /* + * must be done after zone_sizes_init() which calls free_area_init() + * that calls node_set_state() to initialize node_states[N_MEMORY] + * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY + * state + */ +#ifdef CONFIG_ARM64_4K_PAGES + hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); +#endif + memblock_dump_all(); } diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index a3767e74c758..a1a9712090ae 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1164,9 +1164,6 @@ void __init setup_arch(char **cmdline_p) initmem_init(); dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT); - if (boot_cpu_has(X86_FEATURE_GBPAGES)) - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); - /* * Reserve memory for crash kernel after SRAT is parsed so that it * won't consume hotpluggable memory. @@ -1180,6 +1177,15 @@ void __init setup_arch(char **cmdline_p) x86_init.paging.pagetable_init(); + /* + * must be done after zone_sizes_init() which calls free_area_init() + * that calls node_set_state() to initialize node_states[N_MEMORY] + * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY + * state + */ + if (boot_cpu_has(X86_FEATURE_GBPAGES)) + hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); + kasan_init(); /* diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 50650d0d01b9..6df411d91040 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -909,6 +909,13 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h, } #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA) +/** + * hugetlb_cma_reserve() -- reserve CMA for gigantic pages on nodes with memory + * + * must be called after free_area_init() that updates N_MEMORY via node_set_state(). + * hugetlb_cma_reserve() scans over N_MEMORY nodemask and hence expects the platforms + * to have initialized N_MEMORY state. + */ extern void __init hugetlb_cma_reserve(int order); extern void __init hugetlb_cma_check(void); #else diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bc3304af40d0..f2071f2d8c1f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5685,12 +5685,12 @@ void __init hugetlb_cma_reserve(int order) * If 3 GB area is requested on a machine with 4 numa nodes, * let's allocate 1 GB on first three nodes and ignore the last one. */ - per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes); + per_node = DIV_ROUND_UP(hugetlb_cma_size, num_node_state(N_MEMORY)); pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n", hugetlb_cma_size / SZ_1M, per_node / SZ_1M); reserved = 0; - for_each_node_state(nid, N_ONLINE) { + for_each_node_state(nid, N_MEMORY) { int res; size = min(per_node, hugetlb_cma_size - reserved);