Message ID | 20200710120950.37716-1-song.bao.hua@hisilicon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm/hugetlb: split hugetlb_cma in nodes with memory | expand |
On 7/10/20 5:09 AM, Barry Song wrote: > Online nodes are not necessarily memory containing nodes. Splitting > huge_cma in online nodes can lead to inconsistent hugetlb_cma size > with user setting. For example, for one system with 4 numa nodes and > only one of them has memory, if users set hugetlb_cma to 4GB, it will > split into four 1GB. So only the node with memory will get 1GB CMA. > All other three nodes get nothing. That means the whole system gets > only 1GB CMA while users ask for 4GB. > > Thus, it is more sensible to split hugetlb_cma in nodes with memory. > For the above case, the only node with memory will reserve 4GB cma > which is same with user setting in bootargs. In order to split cma > in nodes with memory, hugetlb_cma_reserve() should scan over those > nodes with N_MEMORY state rather than N_ONLINE state. That means > the function should be called only after arch code has finished > setting the N_MEMORY state of nodes. > > The problem is always there if N_ONLINE != N_MEMORY. It is a general > problem to all platforms. But there is some trivial difference among > different architectures. > For example, for ARM64, before hugetlb_cma_reserve() is called, all > nodes have got N_ONLINE state. So hugetlb will get inconsistent cma > size when some online nodes have no memory. For x86 case, the problem > is hidden because X86 happens to just set N_ONLINE on the nodes with > memory when hugetlb_cma_reserve() is called. > > Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve() > and lets both x86 and ARM64 call the function after N_MEMORY state > is ready. It also documents the requirement in the definition of > hugetlb_cma_reserve(). > > Cc: Roman Gushchin <guro@fb.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Will Deacon <will@kernel.org> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Mike Rapoport <rppt@linux.ibm.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Anshuman Khandual <anshuman.khandual@arm.com> > Cc: Jonathan Cameron <jonathan.cameron@huawei.com> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> I agree we should only be concerned with N_MEMORY nodes for the CMA reservations. However, this patch got me thinking: - Do we really have to initiate the CMA reservations from arch specific code? - Can we move the call to reserve CMA a little later into hugetlb arch independent code? I know the cma_declare_contiguous_nid() routine says it should be called from arch specific code. However, unless I am missing something that seems mostly about timing. What about a change like this on top of this patch? From 72b5b9a623f8711ad7f79f1a8f910906245f5d07 Mon Sep 17 00:00:00 2001 From: Mike Kravetz <mike.kravetz@oracle.com> Date: Tue, 14 Jul 2020 15:54:46 -0700 Subject: [PATCH] hugetlb: move cma allocation call to arch independent code Instead of calling hugetlb_cma_reserve() from arch specific code, call from arch independent code when a gigantic page hstate is created. This is late enough in the init process that all numa memory information should be initialized. And, it is early enough to still use early memory allocator. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- arch/arm64/mm/init.c | 10 ---------- arch/x86/kernel/setup.c | 9 --------- mm/hugetlb.c | 8 +++++++- 3 files changed, 7 insertions(+), 20 deletions(-) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 79806732f4b4..ff0ff584dde9 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -427,16 +427,6 @@ void __init bootmem_init(void) sparse_init(); zone_sizes_init(min, max); - /* - * must be done after zone_sizes_init() which calls free_area_init() - * that calls node_set_state() to initialize node_states[N_MEMORY] - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY - * state - */ -#ifdef CONFIG_ARM64_4K_PAGES - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); -#endif - memblock_dump_all(); } diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index a1a9712090ae..111c8467fafa 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1177,15 +1177,6 @@ void __init setup_arch(char **cmdline_p) x86_init.paging.pagetable_init(); - /* - * must be done after zone_sizes_init() which calls free_area_init() - * that calls node_set_state() to initialize node_states[N_MEMORY] - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY - * state - */ - if (boot_cpu_has(X86_FEATURE_GBPAGES)) - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); - kasan_init(); /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f24acb3af741..a0007d1d12d2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order) snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", huge_page_size(h)/1024); + if (order >= MAX_ORDER && hugetlb_cma_size) + hugetlb_cma_reserve(order); + parsed_hstate = h; } @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order) unsigned long size, reserved, per_node; int nid; - cma_reserve_called = true; + if (cma_reserve_called) + return; + else + cma_reserve_called = true; if (!hugetlb_cma_size) return;
Hi Mike, On Tue, Jul 14, 2020 at 04:21:01PM -0700, Mike Kravetz wrote: > I agree we should only be concerned with N_MEMORY nodes for the CMA > reservations. However, this patch got me thinking: > - Do we really have to initiate the CMA reservations from arch specific code? > - Can we move the call to reserve CMA a little later into hugetlb arch > independent code? > > I know the cma_declare_contiguous_nid() routine says it should be called > from arch specific code. However, unless I am missing something that seems > mostly about timing. > > What about a change like this on top of this patch? > > From 72b5b9a623f8711ad7f79f1a8f910906245f5d07 Mon Sep 17 00:00:00 2001 > From: Mike Kravetz <mike.kravetz@oracle.com> > Date: Tue, 14 Jul 2020 15:54:46 -0700 > Subject: [PATCH] hugetlb: move cma allocation call to arch independent code > > Instead of calling hugetlb_cma_reserve() from arch specific code, > call from arch independent code when a gigantic page hstate is > created. This is late enough in the init process that all numa > memory information should be initialized. And, it is early enough > to still use early memory allocator. > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > --- > arch/arm64/mm/init.c | 10 ---------- > arch/x86/kernel/setup.c | 9 --------- > mm/hugetlb.c | 8 +++++++- > 3 files changed, 7 insertions(+), 20 deletions(-) > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 79806732f4b4..ff0ff584dde9 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -427,16 +427,6 @@ void __init bootmem_init(void) > sparse_init(); > zone_sizes_init(min, max); > > - /* > - * must be done after zone_sizes_init() which calls free_area_init() > - * that calls node_set_state() to initialize node_states[N_MEMORY] > - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY > - * state > - */ > -#ifdef CONFIG_ARM64_4K_PAGES > - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); > -#endif > - > memblock_dump_all(); > } > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index a1a9712090ae..111c8467fafa 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -1177,15 +1177,6 @@ void __init setup_arch(char **cmdline_p) > > x86_init.paging.pagetable_init(); > > - /* > - * must be done after zone_sizes_init() which calls free_area_init() > - * that calls node_set_state() to initialize node_states[N_MEMORY] > - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY > - * state > - */ > - if (boot_cpu_has(X86_FEATURE_GBPAGES)) > - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); > - > kasan_init(); > > /* > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index f24acb3af741..a0007d1d12d2 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order) > snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", > huge_page_size(h)/1024); (nit: you can also make hugetlb_cma_reserve() static and remote its function prototypes from hugetlb.h) > + if (order >= MAX_ORDER && hugetlb_cma_size) > + hugetlb_cma_reserve(order); Although I really like the idea of moving this out of the arch code, I don't quite follow the check against MAX_ORDER here -- it looks like a bit of a hack to try to intercept the "PUD_SHIFT - PAGE_SHIFT" order which we currently pass to hugetlb_cma_reserve(). Maybe we could instead have something like: #ifndef HUGETLB_CMA_ORDER #define HUGETLB_CMA_ORDER (PUD_SHIFT - PAGE_SHIFT) #endif and then just do: if (order == HUGETLB_CMA_ORDER) hugetlb_cma_reserve(order); ? Is there something else I'm missing? > + > parsed_hstate = h; > } > > @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order) > unsigned long size, reserved, per_node; > int nid; > > - cma_reserve_called = true; > + if (cma_reserve_called) > + return; > + else > + cma_reserve_called = true; (nit: don't need the 'else' here) Will
> -----Original Message----- > From: Mike Kravetz [mailto:mike.kravetz@oracle.com] > Sent: Wednesday, July 15, 2020 11:21 AM > To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; > akpm@linux-foundation.org > Cc: x86@kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org; > Linuxarm <linuxarm@huawei.com>; linux-arm-kernel@lists.infradead.org; > Roman Gushchin <guro@fb.com>; Catalin Marinas > <catalin.marinas@arm.com>; Will Deacon <will@kernel.org>; Thomas Gleixner > <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; Borislav Petkov > <bp@alien8.de>; H.Peter Anvin <hpa@zytor.com>; Mike Rapoport > <rppt@linux.ibm.com>; Anshuman Khandual > <anshuman.khandual@arm.com>; Jonathan Cameron > <jonathan.cameron@huawei.com> > Subject: Re: [PATCH v3] mm/hugetlb: split hugetlb_cma in nodes with memory > > On 7/10/20 5:09 AM, Barry Song wrote: > > Online nodes are not necessarily memory containing nodes. Splitting > > huge_cma in online nodes can lead to inconsistent hugetlb_cma size > > with user setting. For example, for one system with 4 numa nodes and > > only one of them has memory, if users set hugetlb_cma to 4GB, it will > > split into four 1GB. So only the node with memory will get 1GB CMA. > > All other three nodes get nothing. That means the whole system gets > > only 1GB CMA while users ask for 4GB. > > > > Thus, it is more sensible to split hugetlb_cma in nodes with memory. > > For the above case, the only node with memory will reserve 4GB cma > > which is same with user setting in bootargs. In order to split cma > > in nodes with memory, hugetlb_cma_reserve() should scan over those > > nodes with N_MEMORY state rather than N_ONLINE state. That means > > the function should be called only after arch code has finished > > setting the N_MEMORY state of nodes. > > > > The problem is always there if N_ONLINE != N_MEMORY. It is a general > > problem to all platforms. But there is some trivial difference among > > different architectures. > > For example, for ARM64, before hugetlb_cma_reserve() is called, all > > nodes have got N_ONLINE state. So hugetlb will get inconsistent cma > > size when some online nodes have no memory. For x86 case, the problem > > is hidden because X86 happens to just set N_ONLINE on the nodes with > > memory when hugetlb_cma_reserve() is called. > > > > Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve() > > and lets both x86 and ARM64 call the function after N_MEMORY state > > is ready. It also documents the requirement in the definition of > > hugetlb_cma_reserve(). > > > > Cc: Roman Gushchin <guro@fb.com> > > Cc: Catalin Marinas <catalin.marinas@arm.com> > > Cc: Will Deacon <will@kernel.org> > > Cc: Thomas Gleixner <tglx@linutronix.de> > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Borislav Petkov <bp@alien8.de> > > Cc: H. Peter Anvin <hpa@zytor.com> > > Cc: Mike Kravetz <mike.kravetz@oracle.com> > > Cc: Mike Rapoport <rppt@linux.ibm.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Anshuman Khandual <anshuman.khandual@arm.com> > > Cc: Jonathan Cameron <jonathan.cameron@huawei.com> > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > > I agree we should only be concerned with N_MEMORY nodes for the CMA > reservations. However, this patch got me thinking: > - Do we really have to initiate the CMA reservations from arch specific code? > - Can we move the call to reserve CMA a little later into hugetlb arch > independent code? > > I know the cma_declare_contiguous_nid() routine says it should be called > from arch specific code. However, unless I am missing something that seems > mostly about timing. > > What about a change like this on top of this patch? > > From 72b5b9a623f8711ad7f79f1a8f910906245f5d07 Mon Sep 17 00:00:00 > 2001 > From: Mike Kravetz <mike.kravetz@oracle.com> > Date: Tue, 14 Jul 2020 15:54:46 -0700 > Subject: [PATCH] hugetlb: move cma allocation call to arch independent code > > Instead of calling hugetlb_cma_reserve() from arch specific code, > call from arch independent code when a gigantic page hstate is > created. This is late enough in the init process that all numa > memory information should be initialized. And, it is early enough > to still use early memory allocator. > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > --- > arch/arm64/mm/init.c | 10 ---------- > arch/x86/kernel/setup.c | 9 --------- > mm/hugetlb.c | 8 +++++++- > 3 files changed, 7 insertions(+), 20 deletions(-) > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c > index 79806732f4b4..ff0ff584dde9 100644 > --- a/arch/arm64/mm/init.c > +++ b/arch/arm64/mm/init.c > @@ -427,16 +427,6 @@ void __init bootmem_init(void) > sparse_init(); > zone_sizes_init(min, max); > > - /* > - * must be done after zone_sizes_init() which calls free_area_init() > - * that calls node_set_state() to initialize node_states[N_MEMORY] > - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY > - * state > - */ > -#ifdef CONFIG_ARM64_4K_PAGES > - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); > -#endif > - > memblock_dump_all(); > } > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index a1a9712090ae..111c8467fafa 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -1177,15 +1177,6 @@ void __init setup_arch(char **cmdline_p) > > x86_init.paging.pagetable_init(); > > - /* > - * must be done after zone_sizes_init() which calls free_area_init() > - * that calls node_set_state() to initialize node_states[N_MEMORY] > - * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY > - * state > - */ > - if (boot_cpu_has(X86_FEATURE_GBPAGES)) > - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); > - > kasan_init(); > > /* > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index f24acb3af741..a0007d1d12d2 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order) > snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", > huge_page_size(h)/1024); > > + if (order >= MAX_ORDER && hugetlb_cma_size) > + hugetlb_cma_reserve(order); Hello, Mike, I am not sure if it is necessarily correct as the order for gigantic pages is arch-dependent: https://lkml.org/lkml/2020/7/1/14 > + > parsed_hstate = h; > } > > @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order) > unsigned long size, reserved, per_node; > int nid; > > - cma_reserve_called = true; > + if (cma_reserve_called) > + return; > + else > + cma_reserve_called = true; > > if (!hugetlb_cma_size) > return; Thanks Barry
On 7/15/20 1:18 AM, Will Deacon wrote: >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index f24acb3af741..a0007d1d12d2 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order) >> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", >> huge_page_size(h)/1024); > > (nit: you can also make hugetlb_cma_reserve() static and remote its function > prototypes from hugetlb.h) Yes thanks. I threw this together pretty quickly. > >> + if (order >= MAX_ORDER && hugetlb_cma_size) >> + hugetlb_cma_reserve(order); > > Although I really like the idea of moving this out of the arch code, I don't > quite follow the check against MAX_ORDER here -- it looks like a bit of a > hack to try to intercept the "PUD_SHIFT - PAGE_SHIFT" order which we > currently pass to hugetlb_cma_reserve(). Maybe we could instead have > something like: > > #ifndef HUGETLB_CMA_ORDER > #define HUGETLB_CMA_ORDER (PUD_SHIFT - PAGE_SHIFT) > #endif > > and then just do: > > if (order == HUGETLB_CMA_ORDER) > hugetlb_cma_reserve(order); > > ? Is there something else I'm missing? > Well, the current hugetlb CMA code only kicks in for gigantic pages as defined by the hugetlb code. For example, the code to allocate a page from CMA is in the routine alloc_gigantic_page(). alloc_gigantic_page() is called from alloc_fresh_huge_page() which starts with: if (hstate_is_gigantic(h)) page = alloc_gigantic_page(h, gfp_mask, nid, nmask); else page = alloc_buddy_huge_page(h, gfp_mask, nid, nmask, node_alloc_noretry); and, hstate_is_gigantic is, static inline bool hstate_is_gigantic(struct hstate *h) { return huge_page_order(h) >= MAX_ORDER; } So, everything in the existing code really depends on the hugetlb definition of gigantic page (order >= MAX_ORDER). The code to check for 'order >= MAX_ORDER' in my proposed patch is just following the same convention. I think the current dependency on the hugetlb definition of gigantic page may be too simplistic if using CMA for huegtlb pages becomes more common. Some architectures (sparc, powerpc) have more than one gigantic pages size. Currently there is no way to specify that CMA should be used for one and not the other. In addition, I could imagine someone wanting to reserve/use CMA for non-gigantic (PMD) sized pages. There is no mechainsm for that today. I honestly have not heard about many use cases for this CMA functionality. When support was initially added, it was driven by a specific use case and the 'all gigantic pages use CMA if defined' implementation was deemed sufficient. If there are more use cases, or this seems too simple we can revisit that decision. >> + >> parsed_hstate = h; >> } >> >> @@ -5647,7 +5650,10 @@ void __init hugetlb_cma_reserve(int order) >> unsigned long size, reserved, per_node; >> int nid; >> >> - cma_reserve_called = true; >> + if (cma_reserve_called) >> + return; >> + else >> + cma_reserve_called = true; > > (nit: don't need the 'else' here) Yes, duh!
On 7/15/20 4:14 AM, Song Bao Hua (Barry Song) wrote: >> From: Mike Kravetz [mailto:mike.kravetz@oracle.com] >> huge_page_size(h)/1024); >> >> + if (order >= MAX_ORDER && hugetlb_cma_size) >> + hugetlb_cma_reserve(order); > > Hello, Mike, > I am not sure if it is necessarily correct as the order for gigantic pages is arch-dependent: > https://lkml.org/lkml/2020/7/1/14 > See my reply to Wil. The code to allocate gigantic pages from CMA depends on the arch independent definition of gigantic page which is 'order >= MAX_ORDER'.
On Wed, Jul 15, 2020 at 09:59:24AM -0700, Mike Kravetz wrote: > On 7/15/20 1:18 AM, Will Deacon wrote: > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> index f24acb3af741..a0007d1d12d2 100644 > >> --- a/mm/hugetlb.c > >> +++ b/mm/hugetlb.c > >> @@ -3273,6 +3273,9 @@ void __init hugetlb_add_hstate(unsigned int order) > >> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", > >> huge_page_size(h)/1024); > > > > (nit: you can also make hugetlb_cma_reserve() static and remote its function > > prototypes from hugetlb.h) > > Yes thanks. I threw this together pretty quickly. > > > > >> + if (order >= MAX_ORDER && hugetlb_cma_size) > >> + hugetlb_cma_reserve(order); > > > > Although I really like the idea of moving this out of the arch code, I don't > > quite follow the check against MAX_ORDER here -- it looks like a bit of a > > hack to try to intercept the "PUD_SHIFT - PAGE_SHIFT" order which we > > currently pass to hugetlb_cma_reserve(). Maybe we could instead have > > something like: > > > > #ifndef HUGETLB_CMA_ORDER > > #define HUGETLB_CMA_ORDER (PUD_SHIFT - PAGE_SHIFT) > > #endif > > > > and then just do: > > > > if (order == HUGETLB_CMA_ORDER) > > hugetlb_cma_reserve(order); > > > > ? Is there something else I'm missing? > > > > Well, the current hugetlb CMA code only kicks in for gigantic pages as > defined by the hugetlb code. For example, the code to allocate a page > from CMA is in the routine alloc_gigantic_page(). alloc_gigantic_page() > is called from alloc_fresh_huge_page() which starts with: > > if (hstate_is_gigantic(h)) > page = alloc_gigantic_page(h, gfp_mask, nid, nmask); > else > page = alloc_buddy_huge_page(h, gfp_mask, > nid, nmask, node_alloc_noretry); > > and, hstate_is_gigantic is, > > static inline bool hstate_is_gigantic(struct hstate *h) > { > return huge_page_order(h) >= MAX_ORDER; > } > > So, everything in the existing code really depends on the hugetlb definition > of gigantic page (order >= MAX_ORDER). The code to check for > 'order >= MAX_ORDER' in my proposed patch is just following the same > convention. Fair enough, and thanks for the explanation. Maybe just chuck a comment in, then? Alternatively, having something like: static inline bool page_order_is_gigantic(unsigned int order) { return order >= MAX_ORDER; } static inline bool hstate_is_gigantic(struct hstate *h) { return page_order_is_gigantic(huge_page_order(h)); } and then using page_order_is_gigantic() to predicate the call to hugetlb_cma_reserve? Dunno, maybe it's overkill. Up to you. > I think the current dependency on the hugetlb definition of gigantic page > may be too simplistic if using CMA for huegtlb pages becomes more common. > Some architectures (sparc, powerpc) have more than one gigantic pages size. > Currently there is no way to specify that CMA should be used for one and > not the other. In addition, I could imagine someone wanting to reserve/use > CMA for non-gigantic (PMD) sized pages. There is no mechainsm for that today. > > I honestly have not heard about many use cases for this CMA functionality. > When support was initially added, it was driven by a specific use case and > the 'all gigantic pages use CMA if defined' implementation was deemed > sufficient. If there are more use cases, or this seems too simple we can > revisit that decision. Agreed, I think your patch is an improvement regardless of that. Will
On 07/16/2020 11:55 PM, Mike Kravetz wrote: >>From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001 > From: Mike Kravetz <mike.kravetz@oracle.com> > Date: Tue, 14 Jul 2020 15:54:46 -0700 > Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic > hstate > > Instead of calling hugetlb_cma_reserve() directly from arch specific > code, call from hugetlb_add_hstate when adding a gigantic hstate. > hugetlb_add_hstate is either called from arch specific huge page setup, > or as the result of hugetlb command line processing. In either case, > this is late enough in the init process that all numa memory information > should be initialized. And, it is early enough to still use early > memory allocator. This assumes that hugetlb_add_hstate() is called from the arch code at the right point in time for the generic HugeTLB to do the required CMA reservation which is not ideal. I guess it must have been a reason why CMA reservation should always called by the platform code which knows the boot sequence timing better.
On Fri, Jul 17, 2020 at 10:32:53AM +0530, Anshuman Khandual wrote: > > > On 07/16/2020 11:55 PM, Mike Kravetz wrote: > >>From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001 > > From: Mike Kravetz <mike.kravetz@oracle.com> > > Date: Tue, 14 Jul 2020 15:54:46 -0700 > > Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic > > hstate > > > > Instead of calling hugetlb_cma_reserve() directly from arch specific > > code, call from hugetlb_add_hstate when adding a gigantic hstate. > > hugetlb_add_hstate is either called from arch specific huge page setup, > > or as the result of hugetlb command line processing. In either case, > > this is late enough in the init process that all numa memory information > > should be initialized. And, it is early enough to still use early > > memory allocator. > > This assumes that hugetlb_add_hstate() is called from the arch code at > the right point in time for the generic HugeTLB to do the required CMA > reservation which is not ideal. I guess it must have been a reason why > CMA reservation should always called by the platform code which knows > the boot sequence timing better. Ha, except we've moved it around two or three times already in the last month or so, so I'd say we don't have a clue when to call it in the arch code. Will
On 07/17/2020 02:06 PM, Will Deacon wrote: > On Fri, Jul 17, 2020 at 10:32:53AM +0530, Anshuman Khandual wrote: >> >> >> On 07/16/2020 11:55 PM, Mike Kravetz wrote: >>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001 >>> From: Mike Kravetz <mike.kravetz@oracle.com> >>> Date: Tue, 14 Jul 2020 15:54:46 -0700 >>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic >>> hstate >>> >>> Instead of calling hugetlb_cma_reserve() directly from arch specific >>> code, call from hugetlb_add_hstate when adding a gigantic hstate. >>> hugetlb_add_hstate is either called from arch specific huge page setup, >>> or as the result of hugetlb command line processing. In either case, >>> this is late enough in the init process that all numa memory information >>> should be initialized. And, it is early enough to still use early >>> memory allocator. >> >> This assumes that hugetlb_add_hstate() is called from the arch code at >> the right point in time for the generic HugeTLB to do the required CMA >> reservation which is not ideal. I guess it must have been a reason why >> CMA reservation should always called by the platform code which knows >> the boot sequence timing better. > > Ha, except we've moved it around two or three times already in the last > month or so, so I'd say we don't have a clue when to call it in the arch > code. The arch dependency is not going way with this change either. Just that its getting transferred to hugetlb_add_hstate() which gets called from arch_initcall() in every architecture. The perfect timing here happens to be because of arch_initcall() instead. This is probably fine, as long as 0. hugetlb_add_hstate() is always called at arch_initcall() 1. N_MEMORY mask is guaranteed to be initialized at arch_initcall() 2. CMA reservation is available to be called at arch_initcall()
On 7/16/20 10:02 PM, Anshuman Khandual wrote: > > > On 07/16/2020 11:55 PM, Mike Kravetz wrote: >> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001 >> From: Mike Kravetz <mike.kravetz@oracle.com> >> Date: Tue, 14 Jul 2020 15:54:46 -0700 >> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic >> hstate >> >> Instead of calling hugetlb_cma_reserve() directly from arch specific >> code, call from hugetlb_add_hstate when adding a gigantic hstate. >> hugetlb_add_hstate is either called from arch specific huge page setup, >> or as the result of hugetlb command line processing. In either case, >> this is late enough in the init process that all numa memory information >> should be initialized. And, it is early enough to still use early >> memory allocator. > > This assumes that hugetlb_add_hstate() is called from the arch code at > the right point in time for the generic HugeTLB to do the required CMA > reservation which is not ideal. I guess it must have been a reason why > CMA reservation should always called by the platform code which knows > the boot sequence timing better. Actually, the code does not make the assumption that hugetlb_add_hstate is called from arch specific huge page setup. It can even be called later at the time of hugetlb command line processing. My 'reasoning' is that gigantic pages can currently be preallocated from bootmem/memblock_alloc at the time of command line processing. Therefore, we should be able to reserve bootmem for CMA at the same time. Is there something wrong with this reasoning? I tested this on x86 by removing the call to hugetlb_add_hstate from arch specific code and instead forced the call at command line processing time. The ability to reserve CMA was the same. Yes, the CMA reservation interface says it should be called from arch specific code. However, if we currently depend on the ability to do memblock_alloc at hugetlb command line processing time for gigantic page preallocation, then I think we can do the CMA reservation here as well. Thinking about it some more, I suppose there could be some arch code that could call hugetlb_add_hstate too early in the boot process. But, I do not think we have an issue with calling it too late.
On 07/17/2020 11:07 PM, Mike Kravetz wrote: > On 7/17/20 2:51 AM, Anshuman Khandual wrote: >> >> >> On 07/17/2020 02:06 PM, Will Deacon wrote: >>> On Fri, Jul 17, 2020 at 10:32:53AM +0530, Anshuman Khandual wrote: >>>> >>>> >>>> On 07/16/2020 11:55 PM, Mike Kravetz wrote: >>>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001 >>>>> From: Mike Kravetz <mike.kravetz@oracle.com> >>>>> Date: Tue, 14 Jul 2020 15:54:46 -0700 >>>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic >>>>> hstate >>>>> >>>>> Instead of calling hugetlb_cma_reserve() directly from arch specific >>>>> code, call from hugetlb_add_hstate when adding a gigantic hstate. >>>>> hugetlb_add_hstate is either called from arch specific huge page setup, >>>>> or as the result of hugetlb command line processing. In either case, >>>>> this is late enough in the init process that all numa memory information >>>>> should be initialized. And, it is early enough to still use early >>>>> memory allocator. >>>> >>>> This assumes that hugetlb_add_hstate() is called from the arch code at >>>> the right point in time for the generic HugeTLB to do the required CMA >>>> reservation which is not ideal. I guess it must have been a reason why >>>> CMA reservation should always called by the platform code which knows >>>> the boot sequence timing better. >>> >>> Ha, except we've moved it around two or three times already in the last >>> month or so, so I'd say we don't have a clue when to call it in the arch >>> code. >> >> The arch dependency is not going way with this change either. Just that >> its getting transferred to hugetlb_add_hstate() which gets called from >> arch_initcall() in every architecture. >> >> The perfect timing here happens to be because of arch_initcall() instead. >> This is probably fine, as long as >> >> 0. hugetlb_add_hstate() is always called at arch_initcall() > > In another reply, I give reasoning why it would be safe to call even later > at hugetlb command line processing time. Understood, but there is a time window in which CMA reservation is available irrespective of whether it is called from arch or generic code. Finding this right time window and ensuring that N_MEMORY nodemask is initialized, easier done in the platform code. > >> 1. N_MEMORY mask is guaranteed to be initialized at arch_initcall() > > This is a bit more difficult to guarantee. I find the init sequence hard to > understand. Looking at the arm code, arch_initcall(hugetlbpage_init) > happens after N_MEMORY mask is setup. I can't imagine any arch code setting > up huge pages before N_MEMORY. But, I suppose it is possible and we would > need to somehow guarantee this. Ensuring that N_MEMORY nodemask is initialized from the generic code is even more difficult. > >> 2. CMA reservation is available to be called at arch_initcall() > > Since I am pretty sure we can delay the reservation until hugetlb command > line processing time, it would be great if it was always done there. But moving hugetlb CMA reservation completely during command line processing has got another concern of mixing with existing memblock based pre-allocation. > Unfortunately, I can not immediately think of an easy way to do this. > It is rationale to move CMA reservation stuff into generic HugeTLB but there are some challenges which needs to be solved comprehensively. The patch here from Barry does solve a short term problem (N_ONLINE ---> N_MEMORY) for now which IMHO should be considered. Moving CMA reservation into generic HugeTLB would require some more thoughts and can be attempted later.
On 07/17/2020 10:32 PM, Mike Kravetz wrote: > On 7/16/20 10:02 PM, Anshuman Khandual wrote: >> >> >> On 07/16/2020 11:55 PM, Mike Kravetz wrote: >>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001 >>> From: Mike Kravetz <mike.kravetz@oracle.com> >>> Date: Tue, 14 Jul 2020 15:54:46 -0700 >>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic >>> hstate >>> >>> Instead of calling hugetlb_cma_reserve() directly from arch specific >>> code, call from hugetlb_add_hstate when adding a gigantic hstate. >>> hugetlb_add_hstate is either called from arch specific huge page setup, >>> or as the result of hugetlb command line processing. In either case, >>> this is late enough in the init process that all numa memory information >>> should be initialized. And, it is early enough to still use early >>> memory allocator. >> >> This assumes that hugetlb_add_hstate() is called from the arch code at >> the right point in time for the generic HugeTLB to do the required CMA >> reservation which is not ideal. I guess it must have been a reason why >> CMA reservation should always called by the platform code which knows >> the boot sequence timing better. > > Actually, the code does not make the assumption that hugetlb_add_hstate > is called from arch specific huge page setup. It can even be called later > at the time of hugetlb command line processing. Yes, now that hugetlb_cma_reserve() has been moved into hugetlb_add_hstate(). But then there is an explicit warning while trying to mix both the command line options i.e hugepagesz= and hugetlb_cma=. The proposed code here have not changed that behavior and hence the following warning should have been triggered here as well. 1) hugepagesz_setup() hugetlb_add_hstate() hugetlb_cma_reserve() 2) hugepages_setup() hugetlb_hstate_alloc_pages() when order >= MAX_ORDER if (hstate_is_gigantic(h)) { if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) { pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n"); break; } if (!alloc_bootmem_huge_page(h)) break; } Nonetheless, it does not make sense to mix both memblock and CMA based huge page pre-allocations. But looking at this again, could this warning be ever triggered till now ? Unless, a given platform calls hugetlb_cma_reserve() before _setup("hugepages=", hugepages_setup). Anyways, there seems to be good reasons to keep both memblock and CMA based pre-allocations in place. But mixing them together (as done in the proposed code here) does not seem to be right. > > My 'reasoning' is that gigantic pages can currently be preallocated from > bootmem/memblock_alloc at the time of command line processing. Therefore, > we should be able to reserve bootmem for CMA at the same time. Is there > something wrong with this reasoning? I tested this on x86 by removing the > call to hugetlb_add_hstate from arch specific code and instead forced the > call at command line processing time. The ability to reserve CMA was the > same. There is no problem with that reasoning. __setup() triggered function should be able perform CMA reservation. But as pointed out before, it does not make sense to mix both CMA reservation and memblock based pre-allocation. > > Yes, the CMA reservation interface says it should be called from arch > specific code. However, if we currently depend on the ability to do > memblock_alloc at hugetlb command line processing time for gigantic page > preallocation, then I think we can do the CMA reservation here as well. IIUC, CMA reservation and memblock alloc have some differences in terms of how the memory can be used later on, will have to dig deeper on this. But the comment section near cma_declare_contiguous_nid() is a concern. * This function reserves memory from early allocator. It should be * called by arch specific code once the early allocator (memblock or bootmem) * has been activated and all other subsystems have already allocated/reserved * memory. This function allows to create custom reserved areas. > > Thinking about it some more, I suppose there could be some arch code that > could call hugetlb_add_hstate too early in the boot process. But, I do > not think we have an issue with calling it too late. > Calling it too late might have got the page allocator initialized completely and then CMA reservation would not be possible afterwards. Also calling it too early would prevent other subsystems which might need memory reservation in specific physical ranges.
On 7/19/20 11:22 PM, Anshuman Khandual wrote: > > > On 07/17/2020 10:32 PM, Mike Kravetz wrote: >> On 7/16/20 10:02 PM, Anshuman Khandual wrote: >>> >>> >>> On 07/16/2020 11:55 PM, Mike Kravetz wrote: >>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001 >>>> From: Mike Kravetz <mike.kravetz@oracle.com> >>>> Date: Tue, 14 Jul 2020 15:54:46 -0700 >>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic >>>> hstate >>>> >>>> Instead of calling hugetlb_cma_reserve() directly from arch specific >>>> code, call from hugetlb_add_hstate when adding a gigantic hstate. >>>> hugetlb_add_hstate is either called from arch specific huge page setup, >>>> or as the result of hugetlb command line processing. In either case, >>>> this is late enough in the init process that all numa memory information >>>> should be initialized. And, it is early enough to still use early >>>> memory allocator. >>> >>> This assumes that hugetlb_add_hstate() is called from the arch code at >>> the right point in time for the generic HugeTLB to do the required CMA >>> reservation which is not ideal. I guess it must have been a reason why >>> CMA reservation should always called by the platform code which knows >>> the boot sequence timing better. >> >> Actually, the code does not make the assumption that hugetlb_add_hstate >> is called from arch specific huge page setup. It can even be called later >> at the time of hugetlb command line processing. > > Yes, now that hugetlb_cma_reserve() has been moved into hugetlb_add_hstate(). > But then there is an explicit warning while trying to mix both the command > line options i.e hugepagesz= and hugetlb_cma=. The proposed code here have > not changed that behavior and hence the following warning should have been > triggered here as well. > > 1) hugepagesz_setup() > hugetlb_add_hstate() > hugetlb_cma_reserve() > > 2) hugepages_setup() > hugetlb_hstate_alloc_pages() when order >= MAX_ORDER > > if (hstate_is_gigantic(h)) { > if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) { > pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n"); > break; > } > if (!alloc_bootmem_huge_page(h)) > break; > } > > Nonetheless, it does not make sense to mix both memblock and CMA based huge > page pre-allocations. But looking at this again, could this warning be ever > triggered till now ? Unless, a given platform calls hugetlb_cma_reserve() > before _setup("hugepages=", hugepages_setup). Anyways, there seems to be > good reasons to keep both memblock and CMA based pre-allocations in place. > But mixing them together (as done in the proposed code here) does not seem > to be right. I'm not sure if I follow the question. This proposal does not change the trigger for the warning printed when one tries to both reserve CMA and pre-allocate gigantic pages. If hugetlb_cma is specified on the command line, and someone tries to pre-allocate gigantic pages they will get the warning. Such a command line on x86 might look like, hugetlb_cma=4G hugepagesz=1G hugepages=4 You will then see, [ 0.065864] HugeTLB: hugetlb_cma is enabled, skip boot time allocation [ 0.065866] HugeTLB: allocating 4 of page size 1.00 GiB failed. Only allocated 0 hugepages. Ideally we could/should eliminate the second message. This behavior exists in the current code. >> My 'reasoning' is that gigantic pages can currently be preallocated from >> bootmem/memblock_alloc at the time of command line processing. Therefore, >> we should be able to reserve bootmem for CMA at the same time. Is there >> something wrong with this reasoning? I tested this on x86 by removing the >> call to hugetlb_add_hstate from arch specific code and instead forced the >> call at command line processing time. The ability to reserve CMA was the >> same. > > There is no problem with that reasoning. __setup() triggered function should > be able perform CMA reservation. But as pointed out before, it does not make > sense to mix both CMA reservation and memblock based pre-allocation. Agree. I am not proposing we do. Sorry, if you got that impression. >> Yes, the CMA reservation interface says it should be called from arch >> specific code. However, if we currently depend on the ability to do >> memblock_alloc at hugetlb command line processing time for gigantic page >> preallocation, then I think we can do the CMA reservation here as well. > > IIUC, CMA reservation and memblock alloc have some differences in terms of > how the memory can be used later on, will have to dig deeper on this. But > the comment section near cma_declare_contiguous_nid() is a concern. > > * This function reserves memory from early allocator. It should be > * called by arch specific code once the early allocator (memblock or bootmem) > * has been activated and all other subsystems have already allocated/reserved > * memory. This function allows to create custom reserved areas. > Yes, that is the comment I was looking at as well. However, note that hugetlb pre-allocation of gigantic pages will end up calling memblock_alloc_range_nid. This is the same routine used for CMA reservations/allocations from cma_declare_contiguous_nid. This is why there should be no issue with doing CMA reservations at this time. This may be the confusing part. I am not saying we would do CMA reservations and pre-allocations together. Rather, they both rely on the underlying code so we can call them at the same time in the init process. >> Thinking about it some more, I suppose there could be some arch code that >> could call hugetlb_add_hstate too early in the boot process. But, I do >> not think we have an issue with calling it too late. >> > > Calling it too late might have got the page allocator initialized completely > and then CMA reservation would not be possible afterwards. Also calling it > too early would prevent other subsystems which might need memory reservation > in specific physical ranges. I thought about it some more and came up with a way to do all this at command line processing time. It will take me a day or two to put together. The patch from Barry which started this thread is indeed needed and is in Andrew's tree. I'll start another thread with a patch to move CMA reservations to command line processing.
Mike Kravetz <mike.kravetz@oracle.com> writes: > On 7/19/20 11:22 PM, Anshuman Khandual wrote: >> >> >> On 07/17/2020 10:32 PM, Mike Kravetz wrote: >>> On 7/16/20 10:02 PM, Anshuman Khandual wrote: >>>> >>>> >>>> On 07/16/2020 11:55 PM, Mike Kravetz wrote: >>>>> >From 17c8f37afbf42fe7412e6eebb3619c6e0b7e1c3c Mon Sep 17 00:00:00 2001 >>>>> From: Mike Kravetz <mike.kravetz@oracle.com> >>>>> Date: Tue, 14 Jul 2020 15:54:46 -0700 >>>>> Subject: [PATCH] hugetlb: move cma reservation to code setting up gigantic >>>>> hstate >>>>> >>>>> Instead of calling hugetlb_cma_reserve() directly from arch specific >>>>> code, call from hugetlb_add_hstate when adding a gigantic hstate. >>>>> hugetlb_add_hstate is either called from arch specific huge page setup, >>>>> or as the result of hugetlb command line processing. In either case, >>>>> this is late enough in the init process that all numa memory information >>>>> should be initialized. And, it is early enough to still use early >>>>> memory allocator. >>>> >>>> This assumes that hugetlb_add_hstate() is called from the arch code at >>>> the right point in time for the generic HugeTLB to do the required CMA >>>> reservation which is not ideal. I guess it must have been a reason why >>>> CMA reservation should always called by the platform code which knows >>>> the boot sequence timing better. >>> >>> Actually, the code does not make the assumption that hugetlb_add_hstate >>> is called from arch specific huge page setup. It can even be called later >>> at the time of hugetlb command line processing. >> >> Yes, now that hugetlb_cma_reserve() has been moved into hugetlb_add_hstate(). >> But then there is an explicit warning while trying to mix both the command >> line options i.e hugepagesz= and hugetlb_cma=. The proposed code here have >> not changed that behavior and hence the following warning should have been >> triggered here as well. >> >> 1) hugepagesz_setup() >> hugetlb_add_hstate() >> hugetlb_cma_reserve() >> >> 2) hugepages_setup() >> hugetlb_hstate_alloc_pages() when order >= MAX_ORDER >> >> if (hstate_is_gigantic(h)) { >> if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) { >> pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n"); >> break; >> } >> if (!alloc_bootmem_huge_page(h)) >> break; >> } >> >> Nonetheless, it does not make sense to mix both memblock and CMA based huge >> page pre-allocations. But looking at this again, could this warning be ever >> triggered till now ? Unless, a given platform calls hugetlb_cma_reserve() >> before _setup("hugepages=", hugepages_setup). Anyways, there seems to be >> good reasons to keep both memblock and CMA based pre-allocations in place. >> But mixing them together (as done in the proposed code here) does not seem >> to be right. > > I'm not sure if I follow the question. > > This proposal does not change the trigger for the warning printed when one > tries to both reserve CMA and pre-allocate gigantic pages. If hugetlb_cma > is specified on the command line, and someone tries to pre-allocate gigantic > pages they will get the warning. Such a command line on x86 might look like, > hugetlb_cma=4G hugepagesz=1G hugepages=4 > > You will then see, > [ 0.065864] HugeTLB: hugetlb_cma is enabled, skip boot time allocation > [ 0.065866] HugeTLB: allocating 4 of page size 1.00 GiB failed. Only allocated 0 hugepages. > > Ideally we could/should eliminate the second message. > This behavior exists in the current code. There is variant of this which is pseries powerpc where there is hypervisor assistance w.r.t allocating gigantic pages. See the ppc64 enablement patch https://lore.kernel.org/linuxppc-dev/20200713150749.25245-1-aneesh.kumar@linux.ibm.com/ -aneesh
On 7/27/20 7:37 AM, Aneesh Kumar K.V wrote: > There is variant of this which is pseries powerpc where there is > hypervisor assistance w.r.t allocating gigantic pages. See the ppc64 > enablement patch > > https://lore.kernel.org/linuxppc-dev/20200713150749.25245-1-aneesh.kumar@linux.ibm.com/ > Thanks Aneesh, I missed the powerpc support. I knew about the powerpc hypervisor assistance which caused me to rethink my original idea that all this could be arch independent. My next idea was that architectures would only need to provide a constant something like: #define HUGETLB_CMA_ORDER (PUD_SHIFT - PAGE_SHIFT) However, it looks like this can not be a compile time constant on powerpc. Moving more of the CMA support to arch independent code has moved down on my priority list. So, it it likely not to get much work in the immediate future. Just curious, can you have multiple gigantic page sizes supported at one time (one system instance) on powerpc?
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 1e93cfc7c47a..420f5e55615c 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -420,15 +420,6 @@ void __init bootmem_init(void) arm64_numa_init(); - /* - * must be done after arm64_numa_init() which calls numa_init() to - * initialize node_online_map that gets used in hugetlb_cma_reserve() - * while allocating required CMA size across online nodes. - */ -#ifdef CONFIG_ARM64_4K_PAGES - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); -#endif - /* * Sparsemem tries to allocate bootmem in memory_present(), so must be * done after the fixed reservations. @@ -438,6 +429,16 @@ void __init bootmem_init(void) sparse_init(); zone_sizes_init(min, max); + /* + * must be done after zone_sizes_init() which calls free_area_init() + * that calls node_set_state() to initialize node_states[N_MEMORY] + * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY + * state + */ +#ifdef CONFIG_ARM64_4K_PAGES + hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); +#endif + memblock_dump_all(); } diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index a3767e74c758..a1a9712090ae 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1164,9 +1164,6 @@ void __init setup_arch(char **cmdline_p) initmem_init(); dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT); - if (boot_cpu_has(X86_FEATURE_GBPAGES)) - hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); - /* * Reserve memory for crash kernel after SRAT is parsed so that it * won't consume hotpluggable memory. @@ -1180,6 +1177,15 @@ void __init setup_arch(char **cmdline_p) x86_init.paging.pagetable_init(); + /* + * must be done after zone_sizes_init() which calls free_area_init() + * that calls node_set_state() to initialize node_states[N_MEMORY] + * because hugetlb_cma_reserve() will scan over nodes with N_MEMORY + * state + */ + if (boot_cpu_has(X86_FEATURE_GBPAGES)) + hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); + kasan_init(); /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bc3304af40d0..32b5035ffec1 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5665,6 +5665,13 @@ static int __init cmdline_parse_hugetlb_cma(char *p) early_param("hugetlb_cma", cmdline_parse_hugetlb_cma); +/* + * hugetlb_cma_reserve() - reserve CMA for gigantic pages on nodes with memory + * + * must be called after free_area_init() that updates N_MEMORY via node_set_state(). + * hugetlb_cma_reserve() scans over N_MEMORY nodemask and hence expects the platforms + * to have initialized N_MEMORY state. + */ void __init hugetlb_cma_reserve(int order) { unsigned long size, reserved, per_node; @@ -5685,12 +5692,12 @@ void __init hugetlb_cma_reserve(int order) * If 3 GB area is requested on a machine with 4 numa nodes, * let's allocate 1 GB on first three nodes and ignore the last one. */ - per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes); + per_node = DIV_ROUND_UP(hugetlb_cma_size, num_node_state(N_MEMORY)); pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n", hugetlb_cma_size / SZ_1M, per_node / SZ_1M); reserved = 0; - for_each_node_state(nid, N_ONLINE) { + for_each_node_state(nid, N_MEMORY) { int res; size = min(per_node, hugetlb_cma_size - reserved);
Online nodes are not necessarily memory containing nodes. Splitting huge_cma in online nodes can lead to inconsistent hugetlb_cma size with user setting. For example, for one system with 4 numa nodes and only one of them has memory, if users set hugetlb_cma to 4GB, it will split into four 1GB. So only the node with memory will get 1GB CMA. All other three nodes get nothing. That means the whole system gets only 1GB CMA while users ask for 4GB. Thus, it is more sensible to split hugetlb_cma in nodes with memory. For the above case, the only node with memory will reserve 4GB cma which is same with user setting in bootargs. In order to split cma in nodes with memory, hugetlb_cma_reserve() should scan over those nodes with N_MEMORY state rather than N_ONLINE state. That means the function should be called only after arch code has finished setting the N_MEMORY state of nodes. The problem is always there if N_ONLINE != N_MEMORY. It is a general problem to all platforms. But there is some trivial difference among different architectures. For example, for ARM64, before hugetlb_cma_reserve() is called, all nodes have got N_ONLINE state. So hugetlb will get inconsistent cma size when some online nodes have no memory. For x86 case, the problem is hidden because X86 happens to just set N_ONLINE on the nodes with memory when hugetlb_cma_reserve() is called. Anyway, this patch moves to scan N_MEMORY in hugetlb_cma_reserve() and lets both x86 and ARM64 call the function after N_MEMORY state is ready. It also documents the requirement in the definition of hugetlb_cma_reserve(). Cc: Roman Gushchin <guro@fb.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> --- v3: another try to refine changelog with respect to Anshuman Khandual's comments arch/arm64/mm/init.c | 19 ++++++++++--------- arch/x86/kernel/setup.c | 12 +++++++++--- mm/hugetlb.c | 11 +++++++++-- 3 files changed, 28 insertions(+), 14 deletions(-)