diff mbox series

[v2] mm: hugetlb: optionally allocate gigantic hugepages using cma

Message ID 20200310002524.2291595-1-guro@fb.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm: hugetlb: optionally allocate gigantic hugepages using cma | expand

Commit Message

Roman Gushchin March 10, 2020, 12:25 a.m. UTC
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the run-time allocation of gigantic pages. However
it actually works only at early stages of the system loading, when
the majority of memory is free. After some time the memory gets
fragmented by non-movable pages, so the chances to find a contiguous
1 GB block are getting close to zero. Even dropping caches manually
doesn't help a lot.

At large scale rebooting servers in order to allocate gigantic hugepages
is quite expensive and complex. At the same time keeping some constant
percentage of memory in reserved hugepages even if the workload isn't
using it is a big waste: not all workloads can benefit from using 1 GB
pages.

The following solution can solve the problem:
1) On boot time a dedicated cma area* is reserved. The size is passed
   as a kernel argument.
2) Run-time allocations of gigantic hugepages are performed using the
   cma allocator and the dedicated cma area

In this case gigantic hugepages can be allocated successfully with a
high probability, however the memory isn't completely wasted if nobody
is using 1GB hugepages: it can be used for pagecache, anon memory,
THPs, etc.

* On a multi-node machine a per-node cma area is allocated on each node.
  Following gigantic hugetlb allocation are using the first available
  numa node if the mask isn't specified by a user.

Usage:
1) configure the kernel to allocate a cma area for hugetlb allocations:
   pass hugetlb_cma=10G as a kernel argument

2) allocate hugetlb pages as usual, e.g.
   echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

If the option isn't enabled or the allocation of the cma area failed,
the current behavior of the system is preserved.

Only x86 is covered by this patch, but it's trivial to extend it to
cover other architectures as well.

v2: fixed !CONFIG_CMA build, suggested by Andrew Morton

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 .../admin-guide/kernel-parameters.txt         |   7 ++
 arch/x86/kernel/setup.c                       |   3 +
 include/linux/hugetlb.h                       |   2 +
 mm/hugetlb.c                                  | 115 ++++++++++++++++++
 4 files changed, 127 insertions(+)

Comments

Roman Gushchin March 10, 2020, 12:30 a.m. UTC | #1
On Mon, Mar 09, 2020 at 05:25:24PM -0700, Roman Gushchin wrote:
> Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
> at runtime") has added the run-time allocation of gigantic pages. However
> it actually works only at early stages of the system loading, when
> the majority of memory is free. After some time the memory gets
> fragmented by non-movable pages, so the chances to find a contiguous
> 1 GB block are getting close to zero. Even dropping caches manually
> doesn't help a lot.
> 
> At large scale rebooting servers in order to allocate gigantic hugepages
> is quite expensive and complex. At the same time keeping some constant
> percentage of memory in reserved hugepages even if the workload isn't
> using it is a big waste: not all workloads can benefit from using 1 GB
> pages.
> 
> The following solution can solve the problem:
> 1) On boot time a dedicated cma area* is reserved. The size is passed
>    as a kernel argument.
> 2) Run-time allocations of gigantic hugepages are performed using the
>    cma allocator and the dedicated cma area
> 
> In this case gigantic hugepages can be allocated successfully with a
> high probability, however the memory isn't completely wasted if nobody
> is using 1GB hugepages: it can be used for pagecache, anon memory,
> THPs, etc.
> 
> * On a multi-node machine a per-node cma area is allocated on each node.
>   Following gigantic hugetlb allocation are using the first available
>   numa node if the mask isn't specified by a user.
> 
> Usage:
> 1) configure the kernel to allocate a cma area for hugetlb allocations:
>    pass hugetlb_cma=10G as a kernel argument
> 
> 2) allocate hugetlb pages as usual, e.g.
>    echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

Btw, if somebody wants to give it a try, please pull these two following
fs-related fixes:

1) for btrfs: https://patchwork.kernel.org/patch/11420997/
2) for ext4:  https://lore.kernel.org/patchwork/patch/1202868/

Please, make sure that you pull the ext4 fix, if, say, /boot is served
by ext4.

Thanks!
Michal Hocko March 10, 2020, 8:45 a.m. UTC | #2
[Cc Mike as hugetlb maintainer and keeping the full context for his
reference]

On Mon 09-03-20 17:25:24, Roman Gushchin wrote:
> Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
> at runtime") has added the run-time allocation of gigantic pages. However
> it actually works only at early stages of the system loading, when
> the majority of memory is free. After some time the memory gets
> fragmented by non-movable pages, so the chances to find a contiguous
> 1 GB block are getting close to zero. Even dropping caches manually
> doesn't help a lot.
> 
> At large scale rebooting servers in order to allocate gigantic hugepages
> is quite expensive and complex. At the same time keeping some constant
> percentage of memory in reserved hugepages even if the workload isn't
> using it is a big waste: not all workloads can benefit from using 1 GB
> pages.
> 
> The following solution can solve the problem:
> 1) On boot time a dedicated cma area* is reserved. The size is passed
>    as a kernel argument.
> 2) Run-time allocations of gigantic hugepages are performed using the
>    cma allocator and the dedicated cma area
> 
> In this case gigantic hugepages can be allocated successfully with a
> high probability, however the memory isn't completely wasted if nobody
> is using 1GB hugepages: it can be used for pagecache, anon memory,
> THPs, etc.
> 
> * On a multi-node machine a per-node cma area is allocated on each node.
>   Following gigantic hugetlb allocation are using the first available
>   numa node if the mask isn't specified by a user.
> 
> Usage:
> 1) configure the kernel to allocate a cma area for hugetlb allocations:
>    pass hugetlb_cma=10G as a kernel argument
> 
> 2) allocate hugetlb pages as usual, e.g.
>    echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 
> If the option isn't enabled or the allocation of the cma area failed,
> the current behavior of the system is preserved.
> 
> Only x86 is covered by this patch, but it's trivial to extend it to
> cover other architectures as well.

Overall idea makes sense to me. I am worried about the configuration
side of the thing. Not only I would stick with the absolute size for now
for simplicity and because percentage usecase is not really explained
anywhere. I am also worried about the resulting memory layout you will
get when using the parameter.

Let's scroll down to the setup code ...

> v2: fixed !CONFIG_CMA build, suggested by Andrew Morton
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   7 ++
>  arch/x86/kernel/setup.c                       |   3 +
>  include/linux/hugetlb.h                       |   2 +
>  mm/hugetlb.c                                  | 115 ++++++++++++++++++
>  4 files changed, 127 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 0c9894247015..d3349ec1dbef 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1452,6 +1452,13 @@
>  	hpet_mmap=	[X86, HPET_MMAP] Allow userspace to mmap HPET
>  			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
>  
> +	hugetlb_cma=	[x86-64] The size of a cma area used for allocation
> +			of gigantic hugepages.
> +			Format: nn[GTPE] | nn%
> +
> +			If enabled, boot-time allocation of gigantic hugepages
> +			is skipped.
> +
>  	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
>  	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
>  			On x86-64 and powerpc, this option can be specified
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index a74262c71484..ceeb06ddfd41 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -16,6 +16,7 @@
>  #include <linux/pci.h>
>  #include <linux/root_dev.h>
>  #include <linux/sfi.h>
> +#include <linux/hugetlb.h>
>  #include <linux/tboot.h>
>  #include <linux/usb/xhci-dbgp.h>
>  
> @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
>  	initmem_init();
>  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
>  
> +	hugetlb_cma_reserve();
> +
>  	/*
>  	 * Reserve memory for crash kernel after SRAT is parsed so that it
>  	 * won't consume hotpluggable memory.
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 50480d16bd33..50050c981ab9 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -157,6 +157,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
>  extern int sysctl_hugetlb_shm_group;
>  extern struct list_head huge_boot_pages;
>  
> +extern void __init hugetlb_cma_reserve(void);
> +
>  /* arch callbacks */
>  
>  pte_t *huge_pte_alloc(struct mm_struct *mm,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7fb31750e670..c6f58bab879c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -28,6 +28,7 @@
>  #include <linux/jhash.h>
>  #include <linux/numa.h>
>  #include <linux/llist.h>
> +#include <linux/cma.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgtable.h>
> @@ -44,6 +45,9 @@
>  int hugetlb_max_hstate __read_mostly;
>  unsigned int default_hstate_idx;
>  struct hstate hstates[HUGE_MAX_HSTATE];
> +
> +static struct cma *hugetlb_cma[MAX_NUMNODES];
> +
>  /*
>   * Minimum page order among possible hugepage sizes, set to a proper value
>   * at boot time.
> @@ -1228,6 +1232,11 @@ static void destroy_compound_gigantic_page(struct page *page,
>  
>  static void free_gigantic_page(struct page *page, unsigned int order)
>  {
> +	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> +		cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order);
> +		return;
> +	}
> +
>  	free_contig_range(page_to_pfn(page), 1 << order);
>  }
>  
> @@ -1237,6 +1246,23 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>  {
>  	unsigned long nr_pages = 1UL << huge_page_order(h);
>  
> +	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> +		struct page *page;
> +		int nid;
> +
> +		for_each_node_mask(nid, *nodemask) {
> +			if (!hugetlb_cma[nid])
> +				break;
> +
> +			page = cma_alloc(hugetlb_cma[nid], nr_pages,
> +					 huge_page_order(h), true);
> +			if (page)
> +				return page;
> +		}
> +
> +		return NULL;
> +	}
> +
>  	return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
>  }
>  
> @@ -2439,6 +2465,10 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
>  
>  	for (i = 0; i < h->max_huge_pages; ++i) {
>  		if (hstate_is_gigantic(h)) {
> +			if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> +				pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> +				break;
> +			}
>  			if (!alloc_bootmem_huge_page(h))
>  				break;
>  		} else if (!alloc_pool_huge_page(h,
> @@ -5372,3 +5402,88 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
>  		spin_unlock(&hugetlb_lock);
>  	}
>  }
> +
> +#ifdef CONFIG_CMA
> +static unsigned long hugetlb_cma_size __initdata;
> +static unsigned long hugetlb_cma_percent __initdata;
> +
> +static int __init cmdline_parse_hugetlb_cma(char *p)
> +{
> +	unsigned long long val;
> +	char *endptr;
> +
> +	if (!p)
> +		return -EINVAL;
> +
> +	/* Value may be a percentage of total memory, otherwise bytes */
> +	val = simple_strtoull(p, &endptr, 0);
> +	if (*endptr == '%')
> +		hugetlb_cma_percent = clamp_t(unsigned long, val, 0, 100);
> +	else
> +		hugetlb_cma_size = memparse(p, &p);
> +
> +	return 0;
> +}
> +
> +early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
> +
> +void __init hugetlb_cma_reserve(void)
> +{
> +	unsigned long totalpages = 0;
> +	unsigned long start_pfn, end_pfn;
> +	phys_addr_t size;
> +	int nid, i, res;
> +
> +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
> +		return;
> +
> +	if (hugetlb_cma_percent) {
> +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
> +				       NULL)
> +			totalpages += end_pfn - start_pfn;
> +
> +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
> +			10000UL;
> +	} else {
> +		size = hugetlb_cma_size;
> +	}
> +
> +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
> +		size / nr_online_nodes);
> +
> +	size /= nr_online_nodes;
> +
> +	for_each_node_state(nid, N_ONLINE) {
> +		unsigned long min_pfn = 0, max_pfn = 0;
> +
> +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> +			if (!min_pfn)
> +				min_pfn = start_pfn;
> +			max_pfn = end_pfn;
> +		}

Do you want to compare the range to the size? But besides that, I
believe this really needs to be much more careful. I believe you do not
want to eat a considerable part of the kernel memory because the
resulting configuration will really struggle (yeah all the low mem/high
mem problems all over again).

> +
> +		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
> +					     PFN_PHYS(max_pfn), (1UL << 30),
> +					     0, false,
> +					     "hugetlb", &hugetlb_cma[nid]);
> +		if (res) {
> +			pr_warn("hugetlb_cma: reservation failed: err %d, node %d, [%llu, %llu)",
> +				res, nid, PFN_PHYS(min_pfn), PFN_PHYS(max_pfn));
> +
> +			for (; nid >= 0; nid--)
> +				hugetlb_cma[nid] = NULL;
> +
> +			break;
> +		}
> +
> +		pr_info("hugetlb_cma: successfully reserved %llu on node %d\n",
> +			size, nid);
> +	}
> +}
> +
> +#else /* CONFIG_CMA */
> +void __init hugetlb_cma_reserve(void)
> +{
> +}
> +
> +#endif /* CONFIG_CMA */
> -- 
> 2.24.1
Michal Hocko March 10, 2020, 9:01 a.m. UTC | #3
On Mon 09-03-20 17:25:24, Roman Gushchin wrote:
[...]
> 2) Run-time allocations of gigantic hugepages are performed using the
>    cma allocator and the dedicated cma area

[...]
> @@ -1237,6 +1246,23 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>  {
>  	unsigned long nr_pages = 1UL << huge_page_order(h);
>  
> +	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> +		struct page *page;
> +		int nid;
> +
> +		for_each_node_mask(nid, *nodemask) {
> +			if (!hugetlb_cma[nid])
> +				break;
> +
> +			page = cma_alloc(hugetlb_cma[nid], nr_pages,
> +					 huge_page_order(h), true);
> +			if (page)
> +				return page;
> +		}
> +
> +		return NULL;

Is there any strong reason why the alloaction annot fallback to non-CMA
allocator when the cma is depleted?

> +	}
> +
>  	return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
>  }
Roman Gushchin March 10, 2020, 5:25 p.m. UTC | #4
Hello, Michal!

On Tue, Mar 10, 2020 at 09:45:44AM +0100, Michal Hocko wrote:
> [Cc Mike as hugetlb maintainer and keeping the full context for his
> reference]

Thanks!

> 
> On Mon 09-03-20 17:25:24, Roman Gushchin wrote:
> > Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
> > at runtime") has added the run-time allocation of gigantic pages. However
> > it actually works only at early stages of the system loading, when
> > the majority of memory is free. After some time the memory gets
> > fragmented by non-movable pages, so the chances to find a contiguous
> > 1 GB block are getting close to zero. Even dropping caches manually
> > doesn't help a lot.
> > 
> > At large scale rebooting servers in order to allocate gigantic hugepages
> > is quite expensive and complex. At the same time keeping some constant
> > percentage of memory in reserved hugepages even if the workload isn't
> > using it is a big waste: not all workloads can benefit from using 1 GB
> > pages.
> > 
> > The following solution can solve the problem:
> > 1) On boot time a dedicated cma area* is reserved. The size is passed
> >    as a kernel argument.
> > 2) Run-time allocations of gigantic hugepages are performed using the
> >    cma allocator and the dedicated cma area
> > 
> > In this case gigantic hugepages can be allocated successfully with a
> > high probability, however the memory isn't completely wasted if nobody
> > is using 1GB hugepages: it can be used for pagecache, anon memory,
> > THPs, etc.
> > 
> > * On a multi-node machine a per-node cma area is allocated on each node.
> >   Following gigantic hugetlb allocation are using the first available
> >   numa node if the mask isn't specified by a user.
> > 
> > Usage:
> > 1) configure the kernel to allocate a cma area for hugetlb allocations:
> >    pass hugetlb_cma=10G as a kernel argument
> > 
> > 2) allocate hugetlb pages as usual, e.g.
> >    echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > 
> > If the option isn't enabled or the allocation of the cma area failed,
> > the current behavior of the system is preserved.
> > 
> > Only x86 is covered by this patch, but it's trivial to extend it to
> > cover other architectures as well.
> 
> Overall idea makes sense to me. I am worried about the configuration
> side of the thing. Not only I would stick with the absolute size for now
> for simplicity and because percentage usecase is not really explained
> anywhere. I am also worried about the resulting memory layout you will
> get when using the parameter.

Thanks! I agree, we can drop the percentage configuration for the simplicity.

> 
> Let's scroll down to the setup code ...
> 
> > v2: fixed !CONFIG_CMA build, suggested by Andrew Morton
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   7 ++
> >  arch/x86/kernel/setup.c                       |   3 +
> >  include/linux/hugetlb.h                       |   2 +
> >  mm/hugetlb.c                                  | 115 ++++++++++++++++++
> >  4 files changed, 127 insertions(+)
> > 
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 0c9894247015..d3349ec1dbef 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -1452,6 +1452,13 @@
> >  	hpet_mmap=	[X86, HPET_MMAP] Allow userspace to mmap HPET
> >  			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
> >  
> > +	hugetlb_cma=	[x86-64] The size of a cma area used for allocation
> > +			of gigantic hugepages.
> > +			Format: nn[GTPE] | nn%
> > +
> > +			If enabled, boot-time allocation of gigantic hugepages
> > +			is skipped.
> > +
> >  	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
> >  	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
> >  			On x86-64 and powerpc, this option can be specified
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index a74262c71484..ceeb06ddfd41 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/pci.h>
> >  #include <linux/root_dev.h>
> >  #include <linux/sfi.h>
> > +#include <linux/hugetlb.h>
> >  #include <linux/tboot.h>
> >  #include <linux/usb/xhci-dbgp.h>
> >  
> > @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
> >  	initmem_init();
> >  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
> >  
> > +	hugetlb_cma_reserve();
> > +
> >  	/*
> >  	 * Reserve memory for crash kernel after SRAT is parsed so that it
> >  	 * won't consume hotpluggable memory.
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 50480d16bd33..50050c981ab9 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -157,6 +157,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
> >  extern int sysctl_hugetlb_shm_group;
> >  extern struct list_head huge_boot_pages;
> >  
> > +extern void __init hugetlb_cma_reserve(void);
> > +
> >  /* arch callbacks */
> >  
> >  pte_t *huge_pte_alloc(struct mm_struct *mm,
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 7fb31750e670..c6f58bab879c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -28,6 +28,7 @@
> >  #include <linux/jhash.h>
> >  #include <linux/numa.h>
> >  #include <linux/llist.h>
> > +#include <linux/cma.h>
> >  
> >  #include <asm/page.h>
> >  #include <asm/pgtable.h>
> > @@ -44,6 +45,9 @@
> >  int hugetlb_max_hstate __read_mostly;
> >  unsigned int default_hstate_idx;
> >  struct hstate hstates[HUGE_MAX_HSTATE];
> > +
> > +static struct cma *hugetlb_cma[MAX_NUMNODES];
> > +
> >  /*
> >   * Minimum page order among possible hugepage sizes, set to a proper value
> >   * at boot time.
> > @@ -1228,6 +1232,11 @@ static void destroy_compound_gigantic_page(struct page *page,
> >  
> >  static void free_gigantic_page(struct page *page, unsigned int order)
> >  {
> > +	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> > +		cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order);
> > +		return;
> > +	}
> > +
> >  	free_contig_range(page_to_pfn(page), 1 << order);
> >  }
> >  
> > @@ -1237,6 +1246,23 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> >  {
> >  	unsigned long nr_pages = 1UL << huge_page_order(h);
> >  
> > +	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> > +		struct page *page;
> > +		int nid;
> > +
> > +		for_each_node_mask(nid, *nodemask) {
> > +			if (!hugetlb_cma[nid])
> > +				break;
> > +
> > +			page = cma_alloc(hugetlb_cma[nid], nr_pages,
> > +					 huge_page_order(h), true);
> > +			if (page)
> > +				return page;
> > +		}
> > +
> > +		return NULL;
> > +	}
> > +
> >  	return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
> >  }
> >  
> > @@ -2439,6 +2465,10 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> >  
> >  	for (i = 0; i < h->max_huge_pages; ++i) {
> >  		if (hstate_is_gigantic(h)) {
> > +			if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> > +				pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> > +				break;
> > +			}
> >  			if (!alloc_bootmem_huge_page(h))
> >  				break;
> >  		} else if (!alloc_pool_huge_page(h,
> > @@ -5372,3 +5402,88 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
> >  		spin_unlock(&hugetlb_lock);
> >  	}
> >  }
> > +
> > +#ifdef CONFIG_CMA
> > +static unsigned long hugetlb_cma_size __initdata;
> > +static unsigned long hugetlb_cma_percent __initdata;
> > +
> > +static int __init cmdline_parse_hugetlb_cma(char *p)
> > +{
> > +	unsigned long long val;
> > +	char *endptr;
> > +
> > +	if (!p)
> > +		return -EINVAL;
> > +
> > +	/* Value may be a percentage of total memory, otherwise bytes */
> > +	val = simple_strtoull(p, &endptr, 0);
> > +	if (*endptr == '%')
> > +		hugetlb_cma_percent = clamp_t(unsigned long, val, 0, 100);
> > +	else
> > +		hugetlb_cma_size = memparse(p, &p);
> > +
> > +	return 0;
> > +}
> > +
> > +early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
> > +
> > +void __init hugetlb_cma_reserve(void)
> > +{
> > +	unsigned long totalpages = 0;
> > +	unsigned long start_pfn, end_pfn;
> > +	phys_addr_t size;
> > +	int nid, i, res;
> > +
> > +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
> > +		return;
> > +
> > +	if (hugetlb_cma_percent) {
> > +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
> > +				       NULL)
> > +			totalpages += end_pfn - start_pfn;
> > +
> > +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
> > +			10000UL;
> > +	} else {
> > +		size = hugetlb_cma_size;
> > +	}
> > +
> > +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
> > +		size / nr_online_nodes);
> > +
> > +	size /= nr_online_nodes;
> > +
> > +	for_each_node_state(nid, N_ONLINE) {
> > +		unsigned long min_pfn = 0, max_pfn = 0;
> > +
> > +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > +			if (!min_pfn)
> > +				min_pfn = start_pfn;
> > +			max_pfn = end_pfn;
> > +		}
> 
> Do you want to compare the range to the size?

You mean add a check that the range is big enough?

> But besides that, I
> believe this really needs to be much more careful. I believe you do not
> want to eat a considerable part of the kernel memory because the
> resulting configuration will really struggle (yeah all the low mem/high
> mem problems all over again).

Well, so far I was focused on a particular case when the target cma size
is significantly smaller than the total RAM size (~5-10%). What is the right
thing to do here? Fallback to the current behavior if the requested size is
more than x% of total memory? 1/2? How do you think?

We've discussed it with Rik in private, and he expressed an idea to start
with ~50% always and then shrink it on-demand. Something that we might
have here long-term.


Thank you!
Mike Kravetz March 10, 2020, 5:27 p.m. UTC | #5
On 3/9/20 5:25 PM, Roman Gushchin wrote:
> Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
> at runtime") has added the run-time allocation of gigantic pages. However
> it actually works only at early stages of the system loading, when
> the majority of memory is free. After some time the memory gets
> fragmented by non-movable pages, so the chances to find a contiguous
> 1 GB block are getting close to zero. Even dropping caches manually
> doesn't help a lot.
> 
> At large scale rebooting servers in order to allocate gigantic hugepages
> is quite expensive and complex. At the same time keeping some constant
> percentage of memory in reserved hugepages even if the workload isn't
> using it is a big waste: not all workloads can benefit from using 1 GB
> pages.
> 
> The following solution can solve the problem:
> 1) On boot time a dedicated cma area* is reserved. The size is passed
>    as a kernel argument.
> 2) Run-time allocations of gigantic hugepages are performed using the
>    cma allocator and the dedicated cma area
> 
> In this case gigantic hugepages can be allocated successfully with a
> high probability, however the memory isn't completely wasted if nobody
> is using 1GB hugepages: it can be used for pagecache, anon memory,
> THPs, etc.
> 
> * On a multi-node machine a per-node cma area is allocated on each node.
>   Following gigantic hugetlb allocation are using the first available
>   numa node if the mask isn't specified by a user.
> 
> Usage:
> 1) configure the kernel to allocate a cma area for hugetlb allocations:
>    pass hugetlb_cma=10G as a kernel argument
> 
> 2) allocate hugetlb pages as usual, e.g.
>    echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 
> If the option isn't enabled or the allocation of the cma area failed,
> the current behavior of the system is preserved.
> 
> Only x86 is covered by this patch, but it's trivial to extend it to
> cover other architectures as well.
> 
> v2: fixed !CONFIG_CMA build, suggested by Andrew Morton
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Thanks!  I really like this idea.

> ---
>  .../admin-guide/kernel-parameters.txt         |   7 ++
>  arch/x86/kernel/setup.c                       |   3 +
>  include/linux/hugetlb.h                       |   2 +
>  mm/hugetlb.c                                  | 115 ++++++++++++++++++
>  4 files changed, 127 insertions(+)
> 
<snip>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index a74262c71484..ceeb06ddfd41 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -16,6 +16,7 @@
>  #include <linux/pci.h>
>  #include <linux/root_dev.h>
>  #include <linux/sfi.h>
> +#include <linux/hugetlb.h>
>  #include <linux/tboot.h>
>  #include <linux/usb/xhci-dbgp.h>
>  
> @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
>  	initmem_init();
>  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
>  
> +	hugetlb_cma_reserve();
> +

I know this is called from arch specific code here to fit in with the timing
of CMA setup/reservation calls.  However, there really is nothing architecture
specific about this functionality.  It would be great IMO if we could make
this architecture independent.  However, I can not think of a straight forward
way to do this.

>  	/*
>  	 * Reserve memory for crash kernel after SRAT is parsed so that it
>  	 * won't consume hotpluggable memory.
<snip>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
<snip>
> +void __init hugetlb_cma_reserve(void)
> +{
> +	unsigned long totalpages = 0;
> +	unsigned long start_pfn, end_pfn;
> +	phys_addr_t size;
> +	int nid, i, res;
> +
> +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
> +		return;
> +
> +	if (hugetlb_cma_percent) {
> +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
> +				       NULL)
> +			totalpages += end_pfn - start_pfn;
> +
> +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
> +			10000UL;
> +	} else {
> +		size = hugetlb_cma_size;
> +	}
> +
> +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
> +		size / nr_online_nodes);
> +
> +	size /= nr_online_nodes;
> +
> +	for_each_node_state(nid, N_ONLINE) {
> +		unsigned long min_pfn = 0, max_pfn = 0;
> +
> +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> +			if (!min_pfn)
> +				min_pfn = start_pfn;
> +			max_pfn = end_pfn;
> +		}
> +
> +		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
> +					     PFN_PHYS(max_pfn), (1UL << 30),

The alignment is hard coded for x86 gigantic page size.  If this supports
more architectures or becomes arch independent we will need to determine
what this alignment should be.  Perhaps an arch specific call back to get
the alignment for gigantic pages.  That will require a little thought as
some arch's support multiple gigantic page sizes.
Roman Gushchin March 10, 2020, 5:30 p.m. UTC | #6
On Tue, Mar 10, 2020 at 10:01:21AM +0100, Michal Hocko wrote:
> On Mon 09-03-20 17:25:24, Roman Gushchin wrote:
> [...]
> > 2) Run-time allocations of gigantic hugepages are performed using the
> >    cma allocator and the dedicated cma area
> 
> [...]
> > @@ -1237,6 +1246,23 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> >  {
> >  	unsigned long nr_pages = 1UL << huge_page_order(h);
> >  
> > +	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> > +		struct page *page;
> > +		int nid;
> > +
> > +		for_each_node_mask(nid, *nodemask) {
> > +			if (!hugetlb_cma[nid])
> > +				break;
> > +
> > +			page = cma_alloc(hugetlb_cma[nid], nr_pages,
> > +					 huge_page_order(h), true);
> > +			if (page)
> > +				return page;
> > +		}
> > +
> > +		return NULL;
> 
> Is there any strong reason why the alloaction annot fallback to non-CMA
> allocator when the cma is depleted?

The reason is that that gigantic pages allocated using cma require
a special handling on releasing. It's solvable by using an additional
page flag, but because the current code is usually not working except
a short time just after the system start, I don't think it's worth it.

But I do not have a strong opinion here.

Thanks!
Michal Hocko March 10, 2020, 5:37 p.m. UTC | #7
On Tue 10-03-20 10:25:59, Roman Gushchin wrote:
> Hello, Michal!
> 
> On Tue, Mar 10, 2020 at 09:45:44AM +0100, Michal Hocko wrote:
[...]
> > > +	for_each_node_state(nid, N_ONLINE) {
> > > +		unsigned long min_pfn = 0, max_pfn = 0;
> > > +
> > > +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > > +			if (!min_pfn)
> > > +				min_pfn = start_pfn;
> > > +			max_pfn = end_pfn;
> > > +		}
> > 
> > Do you want to compare the range to the size?
> 
> You mean add a check that the range is big enough?

Yes, size and forgot to mention alignment.

> > But besides that, I
> > believe this really needs to be much more careful. I believe you do not
> > want to eat a considerable part of the kernel memory because the
> > resulting configuration will really struggle (yeah all the low mem/high
> > mem problems all over again).
> 
> Well, so far I was focused on a particular case when the target cma size
> is significantly smaller than the total RAM size (~5-10%). What is the right
> thing to do here? Fallback to the current behavior if the requested size is
> more than x% of total memory? 1/2? How do you think?

I would start by excluding restricted kernel zones (<ZONE_NORMAL).
Cutting off 1G of ZONE_DMA32 might be a real problem.
 
> We've discussed it with Rik in private, and he expressed an idea to start
> with ~50% always and then shrink it on-demand. Something that we might
> have here long-term.

I would start simple. Simply make it a documented behavior. And if
somebody really cares enough then we can make something more clever.
Until then just avoid zone as mentioned above. This would require that
you do few changes 1) allow fallback from CMA allocation failure 2) do
not bail out initialization on CMA reservation failure.
Mike Kravetz March 10, 2020, 5:38 p.m. UTC | #8
On 3/10/20 1:45 AM, Michal Hocko wrote:
> On Mon 09-03-20 17:25:24, Roman Gushchin wrote:
<snip>
>> +early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
>> +
>> +void __init hugetlb_cma_reserve(void)
>> +{
>> +	unsigned long totalpages = 0;
>> +	unsigned long start_pfn, end_pfn;
>> +	phys_addr_t size;
>> +	int nid, i, res;
>> +
>> +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
>> +		return;
>> +
>> +	if (hugetlb_cma_percent) {
>> +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
>> +				       NULL)
>> +			totalpages += end_pfn - start_pfn;
>> +
>> +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
>> +			10000UL;
>> +	} else {
>> +		size = hugetlb_cma_size;
>> +	}
>> +
>> +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
>> +		size / nr_online_nodes);
>> +
>> +	size /= nr_online_nodes;
>> +
>> +	for_each_node_state(nid, N_ONLINE) {
>> +		unsigned long min_pfn = 0, max_pfn = 0;
>> +
>> +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>> +			if (!min_pfn)
>> +				min_pfn = start_pfn;
>> +			max_pfn = end_pfn;
>> +		}
> 
> Do you want to compare the range to the size? But besides that, I
> believe this really needs to be much more careful. I believe you do not
> want to eat a considerable part of the kernel memory because the
> resulting configuration will really struggle (yeah all the low mem/high
> mem problems all over again).

Will it struggle any worse than if the we allocated the same amount of memory
for gigantic pages as is done today?  Of course, sys admins may think reserving
memory for CMA is better than pre-allocating and end up reserving a greater
amount.
Michal Hocko March 10, 2020, 5:39 p.m. UTC | #9
On Tue 10-03-20 10:30:56, Roman Gushchin wrote:
> On Tue, Mar 10, 2020 at 10:01:21AM +0100, Michal Hocko wrote:
> > On Mon 09-03-20 17:25:24, Roman Gushchin wrote:
> > [...]
> > > 2) Run-time allocations of gigantic hugepages are performed using the
> > >    cma allocator and the dedicated cma area
> > 
> > [...]
> > > @@ -1237,6 +1246,23 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> > >  {
> > >  	unsigned long nr_pages = 1UL << huge_page_order(h);
> > >  
> > > +	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> > > +		struct page *page;
> > > +		int nid;
> > > +
> > > +		for_each_node_mask(nid, *nodemask) {
> > > +			if (!hugetlb_cma[nid])
> > > +				break;
> > > +
> > > +			page = cma_alloc(hugetlb_cma[nid], nr_pages,
> > > +					 huge_page_order(h), true);
> > > +			if (page)
> > > +				return page;
> > > +		}
> > > +
> > > +		return NULL;
> > 
> > Is there any strong reason why the alloaction annot fallback to non-CMA
> > allocator when the cma is depleted?
> 
> The reason is that that gigantic pages allocated using cma require
> a special handling on releasing. It's solvable by using an additional
> page flag, but because the current code is usually not working except
> a short time just after the system start, I don't think it's worth it.

I am not deeply familiar with the cma much TBH but cma_release seems to
be documented to return false if the area doesn't belong to the area so
the free patch can try cma_release and fallback to the regular free, no?
Michal Hocko March 10, 2020, 5:42 p.m. UTC | #10
On Tue 10-03-20 10:38:24, Mike Kravetz wrote:
> On 3/10/20 1:45 AM, Michal Hocko wrote:
> > On Mon 09-03-20 17:25:24, Roman Gushchin wrote:
> <snip>
> >> +early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
> >> +
> >> +void __init hugetlb_cma_reserve(void)
> >> +{
> >> +	unsigned long totalpages = 0;
> >> +	unsigned long start_pfn, end_pfn;
> >> +	phys_addr_t size;
> >> +	int nid, i, res;
> >> +
> >> +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
> >> +		return;
> >> +
> >> +	if (hugetlb_cma_percent) {
> >> +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
> >> +				       NULL)
> >> +			totalpages += end_pfn - start_pfn;
> >> +
> >> +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
> >> +			10000UL;
> >> +	} else {
> >> +		size = hugetlb_cma_size;
> >> +	}
> >> +
> >> +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
> >> +		size / nr_online_nodes);
> >> +
> >> +	size /= nr_online_nodes;
> >> +
> >> +	for_each_node_state(nid, N_ONLINE) {
> >> +		unsigned long min_pfn = 0, max_pfn = 0;
> >> +
> >> +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> >> +			if (!min_pfn)
> >> +				min_pfn = start_pfn;
> >> +			max_pfn = end_pfn;
> >> +		}
> > 
> > Do you want to compare the range to the size? But besides that, I
> > believe this really needs to be much more careful. I believe you do not
> > want to eat a considerable part of the kernel memory because the
> > resulting configuration will really struggle (yeah all the low mem/high
> > mem problems all over again).
> 
> Will it struggle any worse than if the we allocated the same amount of memory
> for gigantic pages as is done today?  Of course, sys admins may think reserving
> memory for CMA is better than pre-allocating and end up reserving a greater
> amount.

Yes the later is my main concern. It requires to have a deep MM
understanding to realize what the lowmem problem is. Even though who
might be familiar consider it 32b relict of the past. I have seen that
several times wrt. unproportional ZONE_MOVABLE sizing already.
Roman Gushchin March 10, 2020, 5:58 p.m. UTC | #11
On Tue, Mar 10, 2020 at 06:39:51PM +0100, Michal Hocko wrote:
> On Tue 10-03-20 10:30:56, Roman Gushchin wrote:
> > On Tue, Mar 10, 2020 at 10:01:21AM +0100, Michal Hocko wrote:
> > > On Mon 09-03-20 17:25:24, Roman Gushchin wrote:
> > > [...]
> > > > 2) Run-time allocations of gigantic hugepages are performed using the
> > > >    cma allocator and the dedicated cma area
> > > 
> > > [...]
> > > > @@ -1237,6 +1246,23 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
> > > >  {
> > > >  	unsigned long nr_pages = 1UL << huge_page_order(h);
> > > >  
> > > > +	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> > > > +		struct page *page;
> > > > +		int nid;
> > > > +
> > > > +		for_each_node_mask(nid, *nodemask) {
> > > > +			if (!hugetlb_cma[nid])
> > > > +				break;
> > > > +
> > > > +			page = cma_alloc(hugetlb_cma[nid], nr_pages,
> > > > +					 huge_page_order(h), true);
> > > > +			if (page)
> > > > +				return page;
> > > > +		}
> > > > +
> > > > +		return NULL;
> > > 
> > > Is there any strong reason why the alloaction annot fallback to non-CMA
> > > allocator when the cma is depleted?
> > 
> > The reason is that that gigantic pages allocated using cma require
> > a special handling on releasing. It's solvable by using an additional
> > page flag, but because the current code is usually not working except
> > a short time just after the system start, I don't think it's worth it.
> 
> I am not deeply familiar with the cma much TBH but cma_release seems to
> be documented to return false if the area doesn't belong to the area so
> the free patch can try cma_release and fallback to the regular free, no?

Good point! Then the fallback is not adding too much of complexity, so
I'll add it in the next version.

Thanks!
Roman Gushchin March 10, 2020, 6:05 p.m. UTC | #12
On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote:
> On 3/9/20 5:25 PM, Roman Gushchin wrote:
> > Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
> > at runtime") has added the run-time allocation of gigantic pages. However
> > it actually works only at early stages of the system loading, when
> > the majority of memory is free. After some time the memory gets
> > fragmented by non-movable pages, so the chances to find a contiguous
> > 1 GB block are getting close to zero. Even dropping caches manually
> > doesn't help a lot.
> > 
> > At large scale rebooting servers in order to allocate gigantic hugepages
> > is quite expensive and complex. At the same time keeping some constant
> > percentage of memory in reserved hugepages even if the workload isn't
> > using it is a big waste: not all workloads can benefit from using 1 GB
> > pages.
> > 
> > The following solution can solve the problem:
> > 1) On boot time a dedicated cma area* is reserved. The size is passed
> >    as a kernel argument.
> > 2) Run-time allocations of gigantic hugepages are performed using the
> >    cma allocator and the dedicated cma area
> > 
> > In this case gigantic hugepages can be allocated successfully with a
> > high probability, however the memory isn't completely wasted if nobody
> > is using 1GB hugepages: it can be used for pagecache, anon memory,
> > THPs, etc.
> > 
> > * On a multi-node machine a per-node cma area is allocated on each node.
> >   Following gigantic hugetlb allocation are using the first available
> >   numa node if the mask isn't specified by a user.
> > 
> > Usage:
> > 1) configure the kernel to allocate a cma area for hugetlb allocations:
> >    pass hugetlb_cma=10G as a kernel argument
> > 
> > 2) allocate hugetlb pages as usual, e.g.
> >    echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > 
> > If the option isn't enabled or the allocation of the cma area failed,
> > the current behavior of the system is preserved.
> > 
> > Only x86 is covered by this patch, but it's trivial to extend it to
> > cover other architectures as well.
> > 
> > v2: fixed !CONFIG_CMA build, suggested by Andrew Morton
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Thanks!  I really like this idea.

Thank you!

> 
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   7 ++
> >  arch/x86/kernel/setup.c                       |   3 +
> >  include/linux/hugetlb.h                       |   2 +
> >  mm/hugetlb.c                                  | 115 ++++++++++++++++++
> >  4 files changed, 127 insertions(+)
> > 
> <snip>
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index a74262c71484..ceeb06ddfd41 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/pci.h>
> >  #include <linux/root_dev.h>
> >  #include <linux/sfi.h>
> > +#include <linux/hugetlb.h>
> >  #include <linux/tboot.h>
> >  #include <linux/usb/xhci-dbgp.h>
> >  
> > @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
> >  	initmem_init();
> >  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
> >  
> > +	hugetlb_cma_reserve();
> > +
> 
> I know this is called from arch specific code here to fit in with the timing
> of CMA setup/reservation calls.  However, there really is nothing architecture
> specific about this functionality.  It would be great IMO if we could make
> this architecture independent.  However, I can not think of a straight forward
> way to do this.

I agree. Unfortunately I have no better idea than having an arch-dependent hook.

> 
> >  	/*
> >  	 * Reserve memory for crash kernel after SRAT is parsed so that it
> >  	 * won't consume hotpluggable memory.
> <snip>
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> <snip>
> > +void __init hugetlb_cma_reserve(void)
> > +{
> > +	unsigned long totalpages = 0;
> > +	unsigned long start_pfn, end_pfn;
> > +	phys_addr_t size;
> > +	int nid, i, res;
> > +
> > +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
> > +		return;
> > +
> > +	if (hugetlb_cma_percent) {
> > +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
> > +				       NULL)
> > +			totalpages += end_pfn - start_pfn;
> > +
> > +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
> > +			10000UL;
> > +	} else {
> > +		size = hugetlb_cma_size;
> > +	}
> > +
> > +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
> > +		size / nr_online_nodes);
> > +
> > +	size /= nr_online_nodes;
> > +
> > +	for_each_node_state(nid, N_ONLINE) {
> > +		unsigned long min_pfn = 0, max_pfn = 0;
> > +
> > +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > +			if (!min_pfn)
> > +				min_pfn = start_pfn;
> > +			max_pfn = end_pfn;
> > +		}
> > +
> > +		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
> > +					     PFN_PHYS(max_pfn), (1UL << 30),
> 
> The alignment is hard coded for x86 gigantic page size.  If this supports
> more architectures or becomes arch independent we will need to determine
> what this alignment should be.  Perhaps an arch specific call back to get
> the alignment for gigantic pages.  That will require a little thought as
> some arch's support multiple gigantic page sizes.

Good point!
Should we take the biggest possible size as a reference?
Or the smallest (larger than MAX_ORDER)?

Thanks!
Rik van Riel March 10, 2020, 6:22 p.m. UTC | #13
On Tue, 2020-03-10 at 11:05 -0700, Roman Gushchin wrote:
> On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote:
> > 
> > > +	for_each_node_state(nid, N_ONLINE) {
> > > +		unsigned long min_pfn = 0, max_pfn = 0;
> > > +
> > > +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn,
> > > NULL) {
> > > +			if (!min_pfn)
> > > +				min_pfn = start_pfn;
> > > +			max_pfn = end_pfn;
> > > +		}
> > > +
> > > +		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
> > > +					     PFN_PHYS(max_pfn), (1UL <<
> > > 30),
> > 
> > The alignment is hard coded for x86 gigantic page size.  If this
> > supports
> > more architectures or becomes arch independent we will need to
> > determine
> > what this alignment should be.  Perhaps an arch specific call back
> > to get
> > the alignment for gigantic pages.  That will require a little
> > thought as
> > some arch's support multiple gigantic page sizes.
> 
> Good point!
> Should we take the biggest possible size as a reference?
> Or the smallest (larger than MAX_ORDER)?

I would say biggest.

You can always allocate multiple smaller gigantic pages
inside the space reserved for one ginormous one.
Mike Kravetz March 10, 2020, 6:33 p.m. UTC | #14
On 3/10/20 11:05 AM, Roman Gushchin wrote:
> On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote:
>> On 3/9/20 5:25 PM, Roman Gushchin wrote:
>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>>> index a74262c71484..ceeb06ddfd41 100644
>>> --- a/arch/x86/kernel/setup.c
>>> +++ b/arch/x86/kernel/setup.c
>>> @@ -16,6 +16,7 @@
>>>  #include <linux/pci.h>
>>>  #include <linux/root_dev.h>
>>>  #include <linux/sfi.h>
>>> +#include <linux/hugetlb.h>
>>>  #include <linux/tboot.h>
>>>  #include <linux/usb/xhci-dbgp.h>
>>>  
>>> @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
>>>  	initmem_init();
>>>  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
>>>  
>>> +	hugetlb_cma_reserve();
>>> +
>>
>> I know this is called from arch specific code here to fit in with the timing
>> of CMA setup/reservation calls.  However, there really is nothing architecture
>> specific about this functionality.  It would be great IMO if we could make
>> this architecture independent.  However, I can not think of a straight forward
>> way to do this.
> 
> I agree. Unfortunately I have no better idea than having an arch-dependent hook.
> 
>>
>>>  	/*
>>>  	 * Reserve memory for crash kernel after SRAT is parsed so that it
>>>  	 * won't consume hotpluggable memory.
>> <snip>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> <snip>
>>> +void __init hugetlb_cma_reserve(void)
>>> +{
>>> +	unsigned long totalpages = 0;
>>> +	unsigned long start_pfn, end_pfn;
>>> +	phys_addr_t size;
>>> +	int nid, i, res;
>>> +
>>> +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
>>> +		return;
>>> +
>>> +	if (hugetlb_cma_percent) {
>>> +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
>>> +				       NULL)
>>> +			totalpages += end_pfn - start_pfn;
>>> +
>>> +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
>>> +			10000UL;
>>> +	} else {
>>> +		size = hugetlb_cma_size;
>>> +	}
>>> +
>>> +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
>>> +		size / nr_online_nodes);
>>> +
>>> +	size /= nr_online_nodes;
>>> +
>>> +	for_each_node_state(nid, N_ONLINE) {
>>> +		unsigned long min_pfn = 0, max_pfn = 0;
>>> +
>>> +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>>> +			if (!min_pfn)
>>> +				min_pfn = start_pfn;
>>> +			max_pfn = end_pfn;
>>> +		}
>>> +
>>> +		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
>>> +					     PFN_PHYS(max_pfn), (1UL << 30),
>>
>> The alignment is hard coded for x86 gigantic page size.  If this supports
>> more architectures or becomes arch independent we will need to determine
>> what this alignment should be.  Perhaps an arch specific call back to get
>> the alignment for gigantic pages.  That will require a little thought as
>> some arch's support multiple gigantic page sizes.
> 
> Good point!
> Should we take the biggest possible size as a reference?
> Or the smallest (larger than MAX_ORDER)?

As mentioned, it is pretty simple for architectures like x86 that only
have one gigantic page size.  Just a random thought, but since
hugetlb_cma_reserve is called from arch specific code perhaps the arch
code could pass in at least alignment as an argument to this function?
That way we can somewhat push the issue to the architectures.  For example,
power supports 16GB gigantic page size but I believe they are allocated
via firmware somehow.  So, they would not pass 16G as alignment.  In this
case setup of the CMA area is somewhat architecture dependent.  So, perhaps
the call from arch specific code is not such a bad idea.

With that in mind, we may want some type of coordination between arch
specific and independent code.  Right now, cmdline_parse_hugetlb_cma
is will accept a hugetlb_cma command line option without complaint even
if the architecture does not call hugetlb_cma_reserve.

Just a nit, but cma_declare_contiguous if going to round up size by alignment.  So, the actual reserved size may not match what is printed with,
+		pr_info("hugetlb_cma: successfully reserved %llu on node %d\n",
+			size, nid);

I found this out by testing code and specifying hugetlb_cma=2M.  Messages
in log were:
	kernel: hugetlb_cma: reserve 2097152, 1048576 per node
	kernel: hugetlb_cma: successfully reserved 1048576 on node 0
	kernel: hugetlb_cma: successfully reserved 1048576 on node 1
But, it really reserved 1GB per node.
Andreas Schaufler March 10, 2020, 6:54 p.m. UTC | #15
Hi all,

sorry for bumping into this discussion, but this is exactly the feature
I am currently looking for on ARM64. You may browse under
linux-arm-msm@vger.kernel.org for a message I posted there recently.

I read somewhere in this thread that it should be easily portable to
other architectures than x86. I am ready to give it a try, but I would
need a few pointers where to hook in in the ARM64 code.

Thanks and best regards
Andreas

On 10.03.2020 19:33, Mike Kravetz wrote:
> On 3/10/20 11:05 AM, Roman Gushchin wrote:
>> On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote:
>>> On 3/9/20 5:25 PM, Roman Gushchin wrote:
>>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>>>> index a74262c71484..ceeb06ddfd41 100644
>>>> --- a/arch/x86/kernel/setup.c
>>>> +++ b/arch/x86/kernel/setup.c
>>>> @@ -16,6 +16,7 @@
>>>>  #include <linux/pci.h>
>>>>  #include <linux/root_dev.h>
>>>>  #include <linux/sfi.h>
>>>> +#include <linux/hugetlb.h>
>>>>  #include <linux/tboot.h>
>>>>  #include <linux/usb/xhci-dbgp.h>
>>>>
>>>> @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
>>>>  	initmem_init();
>>>>  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
>>>>
>>>> +	hugetlb_cma_reserve();
>>>> +
>>>
>>> I know this is called from arch specific code here to fit in with the timing
>>> of CMA setup/reservation calls.  However, there really is nothing architecture
>>> specific about this functionality.  It would be great IMO if we could make
>>> this architecture independent.  However, I can not think of a straight forward
>>> way to do this.
>>
>> I agree. Unfortunately I have no better idea than having an arch-dependent hook.
>>
>>>
>>>>  	/*
>>>>  	 * Reserve memory for crash kernel after SRAT is parsed so that it
>>>>  	 * won't consume hotpluggable memory.
>>> <snip>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> <snip>
>>>> +void __init hugetlb_cma_reserve(void)
>>>> +{
>>>> +	unsigned long totalpages = 0;
>>>> +	unsigned long start_pfn, end_pfn;
>>>> +	phys_addr_t size;
>>>> +	int nid, i, res;
>>>> +
>>>> +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
>>>> +		return;
>>>> +
>>>> +	if (hugetlb_cma_percent) {
>>>> +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
>>>> +				       NULL)
>>>> +			totalpages += end_pfn - start_pfn;
>>>> +
>>>> +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
>>>> +			10000UL;
>>>> +	} else {
>>>> +		size = hugetlb_cma_size;
>>>> +	}
>>>> +
>>>> +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
>>>> +		size / nr_online_nodes);
>>>> +
>>>> +	size /= nr_online_nodes;
>>>> +
>>>> +	for_each_node_state(nid, N_ONLINE) {
>>>> +		unsigned long min_pfn = 0, max_pfn = 0;
>>>> +
>>>> +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>>>> +			if (!min_pfn)
>>>> +				min_pfn = start_pfn;
>>>> +			max_pfn = end_pfn;
>>>> +		}
>>>> +
>>>> +		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
>>>> +					     PFN_PHYS(max_pfn), (1UL << 30),
>>>
>>> The alignment is hard coded for x86 gigantic page size.  If this supports
>>> more architectures or becomes arch independent we will need to determine
>>> what this alignment should be.  Perhaps an arch specific call back to get
>>> the alignment for gigantic pages.  That will require a little thought as
>>> some arch's support multiple gigantic page sizes.
>>
>> Good point!
>> Should we take the biggest possible size as a reference?
>> Or the smallest (larger than MAX_ORDER)?
>
> As mentioned, it is pretty simple for architectures like x86 that only
> have one gigantic page size.  Just a random thought, but since
> hugetlb_cma_reserve is called from arch specific code perhaps the arch
> code could pass in at least alignment as an argument to this function?
> That way we can somewhat push the issue to the architectures.  For example,
> power supports 16GB gigantic page size but I believe they are allocated
> via firmware somehow.  So, they would not pass 16G as alignment.  In this
> case setup of the CMA area is somewhat architecture dependent.  So, perhaps
> the call from arch specific code is not such a bad idea.
>
> With that in mind, we may want some type of coordination between arch
> specific and independent code.  Right now, cmdline_parse_hugetlb_cma
> is will accept a hugetlb_cma command line option without complaint even
> if the architecture does not call hugetlb_cma_reserve.
>
> Just a nit, but cma_declare_contiguous if going to round up size by alignment.  So, the actual reserved size may not match what is printed with,
> +		pr_info("hugetlb_cma: successfully reserved %llu on node %d\n",
> +			size, nid);
>
> I found this out by testing code and specifying hugetlb_cma=2M.  Messages
> in log were:
> 	kernel: hugetlb_cma: reserve 2097152, 1048576 per node
> 	kernel: hugetlb_cma: successfully reserved 1048576 on node 0
> 	kernel: hugetlb_cma: successfully reserved 1048576 on node 1
> But, it really reserved 1GB per node.
>
Roman Gushchin March 10, 2020, 6:56 p.m. UTC | #16
On Tue, Mar 10, 2020 at 07:54:58PM +0100, Andreas Schaufler wrote:
> Hi all,
> 
> sorry for bumping into this discussion, but this is exactly the feature
> I am currently looking for on ARM64. You may browse under
> linux-arm-msm@vger.kernel.org for a message I posted there recently.
> 
> I read somewhere in this thread that it should be easily portable to
> other architectures than x86. I am ready to give it a try, but I would
> need a few pointers where to hook in in the ARM64 code.

I'll try to add arm support in the next version and will really
appreciate if you'll be able to test it.

Thanks!
Andreas Schaufler March 10, 2020, 7 p.m. UTC | #17
On 10.03.2020 19:56, Roman Gushchin wrote:
> On Tue, Mar 10, 2020 at 07:54:58PM +0100, Andreas Schaufler wrote:
>> Hi all,
>>
>> sorry for bumping into this discussion, but this is exactly the feature
>> I am currently looking for on ARM64. You may browse under
>> linux-arm-msm@vger.kernel.org for a message I posted there recently.
>>
>> I read somewhere in this thread that it should be easily portable to
>> other architectures than x86. I am ready to give it a try, but I would
>> need a few pointers where to hook in in the ARM64 code.
>
> I'll try to add arm support in the next version and will really
> appreciate if you'll be able to test it.
>
> Thanks!
>

Excellent. Thanks. Just let me know.
Roman Gushchin March 10, 2020, 7:19 p.m. UTC | #18
On Tue, Mar 10, 2020 at 11:33:47AM -0700, Mike Kravetz wrote:
> On 3/10/20 11:05 AM, Roman Gushchin wrote:
> > On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote:
> >> On 3/9/20 5:25 PM, Roman Gushchin wrote:
> >>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> >>> index a74262c71484..ceeb06ddfd41 100644
> >>> --- a/arch/x86/kernel/setup.c
> >>> +++ b/arch/x86/kernel/setup.c
> >>> @@ -16,6 +16,7 @@
> >>>  #include <linux/pci.h>
> >>>  #include <linux/root_dev.h>
> >>>  #include <linux/sfi.h>
> >>> +#include <linux/hugetlb.h>
> >>>  #include <linux/tboot.h>
> >>>  #include <linux/usb/xhci-dbgp.h>
> >>>  
> >>> @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
> >>>  	initmem_init();
> >>>  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
> >>>  
> >>> +	hugetlb_cma_reserve();
> >>> +
> >>
> >> I know this is called from arch specific code here to fit in with the timing
> >> of CMA setup/reservation calls.  However, there really is nothing architecture
> >> specific about this functionality.  It would be great IMO if we could make
> >> this architecture independent.  However, I can not think of a straight forward
> >> way to do this.
> > 
> > I agree. Unfortunately I have no better idea than having an arch-dependent hook.
> > 
> >>
> >>>  	/*
> >>>  	 * Reserve memory for crash kernel after SRAT is parsed so that it
> >>>  	 * won't consume hotpluggable memory.
> >> <snip>
> >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> <snip>
> >>> +void __init hugetlb_cma_reserve(void)
> >>> +{
> >>> +	unsigned long totalpages = 0;
> >>> +	unsigned long start_pfn, end_pfn;
> >>> +	phys_addr_t size;
> >>> +	int nid, i, res;
> >>> +
> >>> +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
> >>> +		return;
> >>> +
> >>> +	if (hugetlb_cma_percent) {
> >>> +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
> >>> +				       NULL)
> >>> +			totalpages += end_pfn - start_pfn;
> >>> +
> >>> +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
> >>> +			10000UL;
> >>> +	} else {
> >>> +		size = hugetlb_cma_size;
> >>> +	}
> >>> +
> >>> +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
> >>> +		size / nr_online_nodes);
> >>> +
> >>> +	size /= nr_online_nodes;
> >>> +
> >>> +	for_each_node_state(nid, N_ONLINE) {
> >>> +		unsigned long min_pfn = 0, max_pfn = 0;
> >>> +
> >>> +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> >>> +			if (!min_pfn)
> >>> +				min_pfn = start_pfn;
> >>> +			max_pfn = end_pfn;
> >>> +		}
> >>> +
> >>> +		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
> >>> +					     PFN_PHYS(max_pfn), (1UL << 30),
> >>
> >> The alignment is hard coded for x86 gigantic page size.  If this supports
> >> more architectures or becomes arch independent we will need to determine
> >> what this alignment should be.  Perhaps an arch specific call back to get
> >> the alignment for gigantic pages.  That will require a little thought as
> >> some arch's support multiple gigantic page sizes.
> > 
> > Good point!
> > Should we take the biggest possible size as a reference?
> > Or the smallest (larger than MAX_ORDER)?
> 
> As mentioned, it is pretty simple for architectures like x86 that only
> have one gigantic page size.  Just a random thought, but since
> hugetlb_cma_reserve is called from arch specific code perhaps the arch
> code could pass in at least alignment as an argument to this function?
> That way we can somewhat push the issue to the architectures.  For example,
> power supports 16GB gigantic page size but I believe they are allocated
> via firmware somehow.  So, they would not pass 16G as alignment.  In this
> case setup of the CMA area is somewhat architecture dependent.  So, perhaps
> the call from arch specific code is not such a bad idea.
> 
> With that in mind, we may want some type of coordination between arch
> specific and independent code.  Right now, cmdline_parse_hugetlb_cma
> is will accept a hugetlb_cma command line option without complaint even
> if the architecture does not call hugetlb_cma_reserve.
> 
> Just a nit, but cma_declare_contiguous if going to round up size by alignment.  So, the actual reserved size may not match what is printed with,
> +		pr_info("hugetlb_cma: successfully reserved %llu on node %d\n",
> +			size, nid);
> 
> I found this out by testing code and specifying hugetlb_cma=2M.  Messages
> in log were:
> 	kernel: hugetlb_cma: reserve 2097152, 1048576 per node
> 	kernel: hugetlb_cma: successfully reserved 1048576 on node 0
> 	kernel: hugetlb_cma: successfully reserved 1048576 on node 1
> But, it really reserved 1GB per node.

Good point! In the passed size is too small to cover a single huge page,
we should probably print a warning and bail out.

Will fix in the next version.

Thanks!
Michal Hocko March 10, 2020, 7:36 p.m. UTC | #19
On Tue 10-03-20 12:19:06, Roman Gushchin wrote:
[...]
> > I found this out by testing code and specifying hugetlb_cma=2M.  Messages
> > in log were:
> > 	kernel: hugetlb_cma: reserve 2097152, 1048576 per node
> > 	kernel: hugetlb_cma: successfully reserved 1048576 on node 0
> > 	kernel: hugetlb_cma: successfully reserved 1048576 on node 1
> > But, it really reserved 1GB per node.
> 
> Good point! In the passed size is too small to cover a single huge page,
> we should probably print a warning and bail out.

Or maybe you just want to make the interface the unit size rather than
overall size oriented. E.g. I want 10G pages per each numa node.
Rik van Riel March 10, 2020, 7:46 p.m. UTC | #20
On Tue, 2020-03-10 at 20:36 +0100, Michal Hocko wrote:
> On Tue 10-03-20 12:19:06, Roman Gushchin wrote:
> [...]
> > > I found this out by testing code and specifying
> > > hugetlb_cma=2M.  Messages
> > > in log were:
> > > 	kernel: hugetlb_cma: reserve 2097152, 1048576 per node
> > > 	kernel: hugetlb_cma: successfully reserved 1048576 on node 0
> > > 	kernel: hugetlb_cma: successfully reserved 1048576 on node 1
> > > But, it really reserved 1GB per node.
> > 
> > Good point! In the passed size is too small to cover a single huge
> > page,
> > we should probably print a warning and bail out.
> 
> Or maybe you just want to make the interface the unit size rather
> than
> overall size oriented. E.g. I want 10G pages per each numa node.

How would that work for architectures that have multiple
possible hugetlbfs gigantic page sizes, where the admin
can allocate different numbers of differently sized pages
after bootup?
Mike Kravetz March 10, 2020, 8:11 p.m. UTC | #21
On 3/10/20 12:46 PM, Rik van Riel wrote:
> On Tue, 2020-03-10 at 20:36 +0100, Michal Hocko wrote:
>> On Tue 10-03-20 12:19:06, Roman Gushchin wrote:
>> [...]
>>>> I found this out by testing code and specifying
>>>> hugetlb_cma=2M.  Messages
>>>> in log were:
>>>> 	kernel: hugetlb_cma: reserve 2097152, 1048576 per node
>>>> 	kernel: hugetlb_cma: successfully reserved 1048576 on node 0
>>>> 	kernel: hugetlb_cma: successfully reserved 1048576 on node 1
>>>> But, it really reserved 1GB per node.
>>>
>>> Good point! In the passed size is too small to cover a single huge
>>> page,
>>> we should probably print a warning and bail out.
>>
>> Or maybe you just want to make the interface the unit size rather
>> than
>> overall size oriented. E.g. I want 10G pages per each numa node.
> 
> How would that work for architectures that have multiple
> possible hugetlbfs gigantic page sizes, where the admin
> can allocate different numbers of differently sized pages
> after bootup?

For hugetlb page reservations at boot today, pairs specifying size and
quantity are put on the command line.  For example,
hugepagesz=2M hugepages=512 hugepagesz=1G hugepages=64

We could do something similiar for CMA.
hugepagesz=512M hugepages_cma=256 hugepagesz=1G hugepages_cma=64

That would make things much more complicated (implies separate CMA
reservations per size) and may be overkill for the first implementation.

Perhaps we limit CMA reservations to one gigantic huge page size.  The
architectures would need to define the default and there could be a
command line option to override.  Something like,
default_cmapagesz=  analogous to today's default_hugepagesz=.  Then
hugepages_cma= is only associated with that default gigantic huge page
size.

The more I think about it, the more I like limiting CMA reservations to
only one gigantic huge page size (per arch).
Rik van Riel March 10, 2020, 8:15 p.m. UTC | #22
On Tue, 2020-03-10 at 13:11 -0700, Mike Kravetz wrote:
> On 3/10/20 12:46 PM, Rik van Riel wrote:
> > 
> > How would that work for architectures that have multiple
> > possible hugetlbfs gigantic page sizes, where the admin
> > can allocate different numbers of differently sized pages
> > after bootup?
> 
> For hugetlb page reservations at boot today, pairs specifying size
> and
> quantity are put on the command line.  For example,
> hugepagesz=2M hugepages=512 hugepagesz=1G hugepages=64
> 
> We could do something similiar for CMA.
> hugepagesz=512M hugepages_cma=256 hugepagesz=1G hugepages_cma=64
> 
> That would make things much more complicated (implies separate CMA
> reservations per size) and may be overkill for the first
> implementation.
> 
> Perhaps we limit CMA reservations to one gigantic huge page
> size.  The
> architectures would need to define the default and there could be a
> command line option to override.  Something like,
> default_cmapagesz=  analogous to today's default_hugepagesz=.  Then
> hugepages_cma= is only associated with that default gigantic huge
> page
> size.
> 
> The more I think about it, the more I like limiting CMA reservations
> to
> only one gigantic huge page size (per arch).

Why, though?

The cma_alloc function can return allocations of different
sizes at the same time.

There is no limitation in the underlying code that would stop
a user from allocating hugepages of different sizes through
sysfs.

Allowing the system administrator to allocate a little extra
memory for the CMA pool could also allow us to work around
initial issues of compaction/migration failing to move some
of the pages, while we play whack-a-mole with the last corner
cases.
Mike Kravetz March 10, 2020, 8:29 p.m. UTC | #23
On 3/10/20 1:15 PM, Rik van Riel wrote:
> On Tue, 2020-03-10 at 13:11 -0700, Mike Kravetz wrote:
>> On 3/10/20 12:46 PM, Rik van Riel wrote:
>>>
>>> How would that work for architectures that have multiple
>>> possible hugetlbfs gigantic page sizes, where the admin
>>> can allocate different numbers of differently sized pages
>>> after bootup?
>>
>> For hugetlb page reservations at boot today, pairs specifying size
>> and
>> quantity are put on the command line.  For example,
>> hugepagesz=2M hugepages=512 hugepagesz=1G hugepages=64
>>
>> We could do something similiar for CMA.
>> hugepagesz=512M hugepages_cma=256 hugepagesz=1G hugepages_cma=64
>>
>> That would make things much more complicated (implies separate CMA
>> reservations per size) and may be overkill for the first
>> implementation.
>>
>> Perhaps we limit CMA reservations to one gigantic huge page
>> size.  The
>> architectures would need to define the default and there could be a
>> command line option to override.  Something like,
>> default_cmapagesz=  analogous to today's default_hugepagesz=.  Then
>> hugepages_cma= is only associated with that default gigantic huge
>> page
>> size.
>>
>> The more I think about it, the more I like limiting CMA reservations
>> to
>> only one gigantic huge page size (per arch).
> 
> Why, though?
> 
> The cma_alloc function can return allocations of different
> sizes at the same time.
> 
> There is no limitation in the underlying code that would stop
> a user from allocating hugepages of different sizes through
> sysfs.

True, there is no technical reason.

I was only trying to simplify the setup and answer the outstanding questions.
- What alignment to use for reservations?
- What is minimum size of reservations?

If only one gigantic page size is supported, the answer is simple.  In any
case, I think input from arch specific code will be needed.
Roman Gushchin March 10, 2020, 8:29 p.m. UTC | #24
On Tue, Mar 10, 2020 at 04:15:18PM -0400, Rik van Riel wrote:
> On Tue, 2020-03-10 at 13:11 -0700, Mike Kravetz wrote:
> > On 3/10/20 12:46 PM, Rik van Riel wrote:
> > > 
> > > How would that work for architectures that have multiple
> > > possible hugetlbfs gigantic page sizes, where the admin
> > > can allocate different numbers of differently sized pages
> > > after bootup?
> > 
> > For hugetlb page reservations at boot today, pairs specifying size
> > and
> > quantity are put on the command line.  For example,
> > hugepagesz=2M hugepages=512 hugepagesz=1G hugepages=64
> > 
> > We could do something similiar for CMA.
> > hugepagesz=512M hugepages_cma=256 hugepagesz=1G hugepages_cma=64
> > 
> > That would make things much more complicated (implies separate CMA
> > reservations per size) and may be overkill for the first
> > implementation.
> > 
> > Perhaps we limit CMA reservations to one gigantic huge page
> > size.  The
> > architectures would need to define the default and there could be a
> > command line option to override.  Something like,
> > default_cmapagesz=  analogous to today's default_hugepagesz=.  Then
> > hugepages_cma= is only associated with that default gigantic huge
> > page
> > size.
> > 
> > The more I think about it, the more I like limiting CMA reservations
> > to
> > only one gigantic huge page size (per arch).
> 
> Why, though?
> 
> The cma_alloc function can return allocations of different
> sizes at the same time.
> 
> There is no limitation in the underlying code that would stop
> a user from allocating hugepages of different sizes through
> sysfs.
> 
> Allowing the system administrator to allocate a little extra
> memory for the CMA pool could also allow us to work around
> initial issues of compaction/migration failing to move some
> of the pages, while we play whack-a-mole with the last corner
> cases.

I agree. Because the cma area can be effectively used by userspace
applications even without allocating gigantic pages, there is no
need to be very greedy on setting the cma size (unlike allocating
1 GB pages on boot, where every unused page is a wasted 1 GB).
Rik van Riel March 10, 2020, 8:38 p.m. UTC | #25
On Tue, 2020-03-10 at 13:29 -0700, Mike Kravetz wrote:
> On 3/10/20 1:15 PM, Rik van Riel wrote:
> > On Tue, 2020-03-10 at 13:11 -0700, Mike Kravetz wrote:
> > > the more I think about it, the more I like limiting CMA
> > > reservations
> > > to
> > > only one gigantic huge page size (per arch).
> > 
> > Why, though?
> > 
> > The cma_alloc function can return allocations of different
> > sizes at the same time.
> > 
> > There is no limitation in the underlying code that would stop
> > a user from allocating hugepages of different sizes through
> > sysfs.
> 
> True, there is no technical reason.
> 
> I was only trying to simplify the setup and answer the outstanding
> questions.
> - What alignment to use for reservations?

Alignment can be the largest hugepage size the system
supports, assuming the amount of memory set aside is
at least this large?

> - What is minimum size of reservations?

One good thing is that it isn't really a reservation,
since the memory can still be used for things like page
cache and anonymous memory, so if too much is reserved
the memory does not go to waste.

One of my follow-up projects to Roman's project will be
to get THP & khugepaged to effectively use memory in the
hugepage CMA area, too.

That way when a system is running a workload that is not
using 1GB hugetlbfs pages, that easily defragmentable memory
pool will still bring some benefits to the workload.

> If only one gigantic page size is supported, the answer is
> simple.  In any
> case, I think input from arch specific code will be needed.

Agreed.
Rik van Riel March 16, 2020, 1:08 a.m. UTC | #26
On Tue, 2020-03-10 at 18:37 +0100, Michal Hocko wrote:
> On Tue 10-03-20 10:25:59, Roman Gushchin wrote:
> > Well, so far I was focused on a particular case when the target cma
> > size
> > is significantly smaller than the total RAM size (~5-10%). What is
> > the right
> > thing to do here? Fallback to the current behavior if the requested
> > size is
> > more than x% of total memory? 1/2? How do you think?
> 
> I would start by excluding restricted kernel zones (<ZONE_NORMAL).
> Cutting off 1G of ZONE_DMA32 might be a real problem.

It looks like memblock_find_in_range_node(), which
is called from memblock_alloc_range_nid(), will already
do top-down allocation inside each node.

However, looking at that code some more, it has some
limitations that we might not want. Specifically, if
we want to allocate for example a 16GB CMA area, but
the node in question only has a 15GB available area
in one spot and a 1GB available area in another spot,
for example due to memory holes, the allocation will fail.

I wonder if it makes sense to have separate cma_declare_contiguous
calls for each 1GB page we set up. That way it will be easier
to round-robin between the ZONE_NORMAL zones in each node, and
also to avoid the ZONE_DMA32 and other special nodes on systems
where those are a relatively small part of memory.

I'll whip up a patch to do that.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0c9894247015..d3349ec1dbef 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1452,6 +1452,13 @@ 
 	hpet_mmap=	[X86, HPET_MMAP] Allow userspace to mmap HPET
 			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
 
+	hugetlb_cma=	[x86-64] The size of a cma area used for allocation
+			of gigantic hugepages.
+			Format: nn[GTPE] | nn%
+
+			If enabled, boot-time allocation of gigantic hugepages
+			is skipped.
+
 	hugepages=	[HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
 	hugepagesz=	[HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
 			On x86-64 and powerpc, this option can be specified
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a74262c71484..ceeb06ddfd41 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -16,6 +16,7 @@ 
 #include <linux/pci.h>
 #include <linux/root_dev.h>
 #include <linux/sfi.h>
+#include <linux/hugetlb.h>
 #include <linux/tboot.h>
 #include <linux/usb/xhci-dbgp.h>
 
@@ -1158,6 +1159,8 @@  void __init setup_arch(char **cmdline_p)
 	initmem_init();
 	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
 
+	hugetlb_cma_reserve();
+
 	/*
 	 * Reserve memory for crash kernel after SRAT is parsed so that it
 	 * won't consume hotpluggable memory.
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 50480d16bd33..50050c981ab9 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -157,6 +157,8 @@  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
 extern int sysctl_hugetlb_shm_group;
 extern struct list_head huge_boot_pages;
 
+extern void __init hugetlb_cma_reserve(void);
+
 /* arch callbacks */
 
 pte_t *huge_pte_alloc(struct mm_struct *mm,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7fb31750e670..c6f58bab879c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -28,6 +28,7 @@ 
 #include <linux/jhash.h>
 #include <linux/numa.h>
 #include <linux/llist.h>
+#include <linux/cma.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -44,6 +45,9 @@ 
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
+
+static struct cma *hugetlb_cma[MAX_NUMNODES];
+
 /*
  * Minimum page order among possible hugepage sizes, set to a proper value
  * at boot time.
@@ -1228,6 +1232,11 @@  static void destroy_compound_gigantic_page(struct page *page,
 
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
+	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
+		cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order);
+		return;
+	}
+
 	free_contig_range(page_to_pfn(page), 1 << order);
 }
 
@@ -1237,6 +1246,23 @@  static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 {
 	unsigned long nr_pages = 1UL << huge_page_order(h);
 
+	if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
+		struct page *page;
+		int nid;
+
+		for_each_node_mask(nid, *nodemask) {
+			if (!hugetlb_cma[nid])
+				break;
+
+			page = cma_alloc(hugetlb_cma[nid], nr_pages,
+					 huge_page_order(h), true);
+			if (page)
+				return page;
+		}
+
+		return NULL;
+	}
+
 	return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
 }
 
@@ -2439,6 +2465,10 @@  static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
+			if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
+				pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
+				break;
+			}
 			if (!alloc_bootmem_huge_page(h))
 				break;
 		} else if (!alloc_pool_huge_page(h,
@@ -5372,3 +5402,88 @@  void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		spin_unlock(&hugetlb_lock);
 	}
 }
+
+#ifdef CONFIG_CMA
+static unsigned long hugetlb_cma_size __initdata;
+static unsigned long hugetlb_cma_percent __initdata;
+
+static int __init cmdline_parse_hugetlb_cma(char *p)
+{
+	unsigned long long val;
+	char *endptr;
+
+	if (!p)
+		return -EINVAL;
+
+	/* Value may be a percentage of total memory, otherwise bytes */
+	val = simple_strtoull(p, &endptr, 0);
+	if (*endptr == '%')
+		hugetlb_cma_percent = clamp_t(unsigned long, val, 0, 100);
+	else
+		hugetlb_cma_size = memparse(p, &p);
+
+	return 0;
+}
+
+early_param("hugetlb_cma", cmdline_parse_hugetlb_cma);
+
+void __init hugetlb_cma_reserve(void)
+{
+	unsigned long totalpages = 0;
+	unsigned long start_pfn, end_pfn;
+	phys_addr_t size;
+	int nid, i, res;
+
+	if (!hugetlb_cma_size && !hugetlb_cma_percent)
+		return;
+
+	if (hugetlb_cma_percent) {
+		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
+				       NULL)
+			totalpages += end_pfn - start_pfn;
+
+		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
+			10000UL;
+	} else {
+		size = hugetlb_cma_size;
+	}
+
+	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
+		size / nr_online_nodes);
+
+	size /= nr_online_nodes;
+
+	for_each_node_state(nid, N_ONLINE) {
+		unsigned long min_pfn = 0, max_pfn = 0;
+
+		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
+			if (!min_pfn)
+				min_pfn = start_pfn;
+			max_pfn = end_pfn;
+		}
+
+		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
+					     PFN_PHYS(max_pfn), (1UL << 30),
+					     0, false,
+					     "hugetlb", &hugetlb_cma[nid]);
+		if (res) {
+			pr_warn("hugetlb_cma: reservation failed: err %d, node %d, [%llu, %llu)",
+				res, nid, PFN_PHYS(min_pfn), PFN_PHYS(max_pfn));
+
+			for (; nid >= 0; nid--)
+				hugetlb_cma[nid] = NULL;
+
+			break;
+		}
+
+		pr_info("hugetlb_cma: successfully reserved %llu on node %d\n",
+			size, nid);
+	}
+}
+
+#else /* CONFIG_CMA */
+void __init hugetlb_cma_reserve(void)
+{
+}
+
+#endif /* CONFIG_CMA */