Message ID | 20200305033014.1152-1-longpeng2@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/hugetlb: avoid weird message in hugetlb_init | expand |
On 3/4/20 7:30 PM, Longpeng(Mike) wrote: > From: Longpeng <longpeng2@huawei.com> > > Some architectures(e.g. x86,risv) doesn't add 2M-hstate by default, > so if we add 'default_hugepagesz=2M' but without 'hugepagesz=2M' in > cmdline, we'll get a message as follow: > "HugeTLB: unsupported default_hugepagesz 2097152. Reverting to 2097152" Yes, that is a weird message that should not be printed. Thanks for pointing out the issue! > As architecture-specific HPAGE_SIZE hstate should be supported by > default, we can avoid this weird message by add it if we hadn't yet. As you have discovered, some of the hugetlb size setup is done in architecture specific code. And other code is architecture independent. The code which verifies huge page sizes (hugepagesz=) is architecture specific. The code which reads default_hugepagesz= is architecture independent. In fact, no verification of the 'default_hugepagesz' value is made when value is read from the command line. The value is verified later after command line processing. Current code considers the value 'valid' if it corresponds to a hstate previously setup by architecture specific code. If architecture specific code did not set up a corresponding hstate, an error like that above is reported. Some architectures such as arm, powerpc and sparc set up hstates for all supported sizes. Other architectures such as riscv, s390 and x86 only set up hstates for those requested on the command line with hugepagesz=. Depending on config options, x86 and riscv may or may not set up PUD_SIZE hstates. Therefore, on s390 and x86 and riscv (with certain config options) it would be possible to specify a valid huge page size (PUD_SIZE) with default_hugepagesz=, and have that value be rejected. This is because the architecture specific code will not set up that hstate by default. The code you have proposed handles the case where the value specified by default_hugepagesz= coresponds to HPAGE_SIZE. That is because the architecture independent code will set up the hstate for HPAGE_SIZE. HPAGE_SIZE is the only valid huge page size known by the architecture independent code. I am thinking we may want to have a more generic solution by allowing the default_hugepagesz= processing code to verify the passed size and set up the corresponding hstate. This would require more cooperation between architecture specific and independent code. This could be accomplished with a simple arch_hugetlb_valid_size() routine provided by the architectures. Below is an untested patch to add such support to the architecture independent code and x86. Other architectures would be similar. In addition, with architectures providing arch_hugetlb_valid_size() it should be possible to have a common routine in architecture independent code to read/process hugepagesz= command line arguments. Of course, another approach would be to simply require ALL architectures to set up hstates for ALL supported huge page sizes. Thoughts?
在 2020/3/6 8:09, Mike Kravetz 写道: > On 3/4/20 7:30 PM, Longpeng(Mike) wrote: >> From: Longpeng <longpeng2@huawei.com> >> >> Some architectures(e.g. x86,risv) doesn't add 2M-hstate by default, >> so if we add 'default_hugepagesz=2M' but without 'hugepagesz=2M' in >> cmdline, we'll get a message as follow: >> "HugeTLB: unsupported default_hugepagesz 2097152. Reverting to 2097152" > > Yes, that is a weird message that should not be printed. Thanks for > pointing out the issue! > >> As architecture-specific HPAGE_SIZE hstate should be supported by >> default, we can avoid this weird message by add it if we hadn't yet. > > As you have discovered, some of the hugetlb size setup is done in > architecture specific code. And other code is architecture independent. > > The code which verifies huge page sizes (hugepagesz=) is architecture > specific. The code which reads default_hugepagesz= is architecture > independent. In fact, no verification of the 'default_hugepagesz' value > is made when value is read from the command line. The value is verified > later after command line processing. Current code considers the value > 'valid' if it corresponds to a hstate previously setup by architecture > specific code. If architecture specific code did not set up a corresponding > hstate, an error like that above is reported. > > Some architectures such as arm, powerpc and sparc set up hstates for all > supported sizes. Other architectures such as riscv, s390 and x86 only > set up hstates for those requested on the command line with hugepagesz=. > Depending on config options, x86 and riscv may or may not set up PUD_SIZE > hstates. > > Therefore, on s390 and x86 and riscv (with certain config options) it > would be possible to specify a valid huge page size (PUD_SIZE) with > default_hugepagesz=, and have that value be rejected. This is because > the architecture specific code will not set up that hstate by default. > > The code you have proposed handles the case where the value specified > by default_hugepagesz= coresponds to HPAGE_SIZE. That is because the > architecture independent code will set up the hstate for HPAGE_SIZE. > HPAGE_SIZE is the only valid huge page size known by the architecture > independent code. > Hi Mike, Thanks for your detailed explanation :) > I am thinking we may want to have a more generic solution by allowing > the default_hugepagesz= processing code to verify the passed size and > set up the corresponding hstate. This would require more cooperation > between architecture specific and independent code. This could be > accomplished with a simple arch_hugetlb_valid_size() routine provided > by the architectures. Below is an untested patch to add such support > to the architecture independent code and x86. Other architectures would > be similar. > > In addition, with architectures providing arch_hugetlb_valid_size() it > should be possible to have a common routine in architecture independent > code to read/process hugepagesz= command line arguments. > I just want to use the minimize changes to address this issue, so I choosed a way which my patch did. To be honest, the approach you suggested above is much better though it need more changes. > Of course, another approach would be to simply require ALL architectures > to set up hstates for ALL supported huge page sizes. > I think this is also needed, then we can request all supported size of hugepages by sysfs(e.g. /sys/kernel/mm/hugepages/*) dynamically. Currently, (x86) we can only request 1G-hugepage through sysfs if we boot with 'default_hugepagesz=1G', even with the first approach. BTW, because it's not easy to discuss with you due to the time difference, I have another question about the default hugepages to consult you here. Why the /proc/meminfo only show the info about the default hugepages, but not others? meminfo is more well know than sysfs, some ordinary users know meminfo but don't know use the sysfs to get the hugepages status(e.g. total, free). > Thoughts? > -- > Mike Kravetz > > diff --git a/arch/x86/include/asm/hugetlb.h b/arch/x86/include/asm/hugetlb.h > index f65cfb48cfdd..dc00c3df1f22 100644 > --- a/arch/x86/include/asm/hugetlb.h > +++ b/arch/x86/include/asm/hugetlb.h > @@ -7,6 +7,9 @@ > > #define hugepages_supported() boot_cpu_has(X86_FEATURE_PSE) > > +#define __HAVE_ARCH_HUGETLB_VALID_SIZE > +extern bool __init arch_hugetlb_valid_size(unsigned long size); > + > static inline int is_hugepage_only_range(struct mm_struct *mm, > unsigned long addr, > unsigned long len) { > diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c > index 5bfd5aef5378..1c4372bfe782 100644 > --- a/arch/x86/mm/hugetlbpage.c > +++ b/arch/x86/mm/hugetlbpage.c > @@ -181,13 +181,22 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > #endif /* CONFIG_HUGETLB_PAGE */ > > #ifdef CONFIG_X86_64 > +bool __init arch_hugetlb_valid_size(unsigned long size) > +{ > + if (size == PMD_SIZE) > + return true; > + else if (size == PUD_SIZE && boot_cpu_has(X86_FEATURE_GBPAGES)) > + return true; > + else > + return false; > +} > + > static __init int setup_hugepagesz(char *opt) > { > unsigned long ps = memparse(opt, &opt); > - if (ps == PMD_SIZE) { > - hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT); > - } else if (ps == PUD_SIZE && boot_cpu_has(X86_FEATURE_GBPAGES)) { > - hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT); > + > + if (arch_hugetlb_valid_size(ps)) { > + hugetlb_add_hstate(ilog2(ps) - PAGE_SHIFT); > } else { > hugetlb_bad_size(); > printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n", > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 50480d16bd33..822d0d8559c7 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -678,6 +678,13 @@ static inline spinlock_t *huge_pte_lockptr(struct hstate *h, > return &mm->page_table_lock; > } > > +#ifndef __HAVE_ARCH_HUGETLB_VALID_SIZE > +static inline bool arch_hugetlb_valid_size(unsigned long size) > +{ > + return (size == HPAGE_SIZE); > +} > +#endif > + > #ifndef hugepages_supported > /* > * Some platform decide whether they support huge pages at boot > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 7fb31750e670..fc3f0f1e3a27 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3078,17 +3078,13 @@ static int __init hugetlb_init(void) > if (!hugepages_supported()) > return 0; > > + /* if default_hstate_size != 0, corresponding hstate was added */ > if (!size_to_hstate(default_hstate_size)) { > - if (default_hstate_size != 0) { > - pr_err("HugeTLB: unsupported default_hugepagesz %lu. Reverting to %lu\n", > - default_hstate_size, HPAGE_SIZE); > - } > - > default_hstate_size = HPAGE_SIZE; > - if (!size_to_hstate(default_hstate_size)) > - hugetlb_add_hstate(HUGETLB_PAGE_ORDER); > + hugetlb_add_hstate(HUGETLB_PAGE_ORDER); > } > default_hstate_idx = hstate_index(size_to_hstate(default_hstate_size)); > + > if (default_hstate_max_huge_pages) { > if (!default_hstate.max_huge_pages) > default_hstate.max_huge_pages = default_hstate_max_huge_pages; > @@ -3195,7 +3191,15 @@ __setup("hugepages=", hugetlb_nrpages_setup); > > static int __init hugetlb_default_setup(char *s) > { > - default_hstate_size = memparse(s, &s); > + unsigned long size = memparse(s, &s); > + > + if (arch_hugetlb_valid_size(size)) { > + default_hstate_size = size; > + hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT); > + } else { > + pr_err("HugeTLB: unsupported default_hugepagesz %lu.\n", size); > + hugetlb_bad_size(); > + } > return 1; > } > __setup("default_hugepagesz=", hugetlb_default_setup); >
On 3/5/20 10:36 PM, Longpeng (Mike) wrote: > 在 2020/3/6 8:09, Mike Kravetz 写道: >> On 3/4/20 7:30 PM, Longpeng(Mike) wrote: >>> From: Longpeng <longpeng2@huawei.com> > >> I am thinking we may want to have a more generic solution by allowing >> the default_hugepagesz= processing code to verify the passed size and >> set up the corresponding hstate. This would require more cooperation >> between architecture specific and independent code. This could be >> accomplished with a simple arch_hugetlb_valid_size() routine provided >> by the architectures. Below is an untested patch to add such support >> to the architecture independent code and x86. Other architectures would >> be similar. >> >> In addition, with architectures providing arch_hugetlb_valid_size() it >> should be possible to have a common routine in architecture independent >> code to read/process hugepagesz= command line arguments. >> > I just want to use the minimize changes to address this issue, so I choosed a > way which my patch did. > > To be honest, the approach you suggested above is much better though it need > more changes. > >> Of course, another approach would be to simply require ALL architectures >> to set up hstates for ALL supported huge page sizes. >> > I think this is also needed, then we can request all supported size of hugepages > by sysfs(e.g. /sys/kernel/mm/hugepages/*) dynamically. Currently, (x86) we can > only request 1G-hugepage through sysfs if we boot with 'default_hugepagesz=1G', > even with the first approach. I 'think' you can use sysfs for 1G huge pages on x86 today. Just booted a system without any hugepage options on the command line. # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages 0 # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/^Cugepages # echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages 1 # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages 1 x86 and riscv will set up hstates for PUD_SIZE hstates by default if CONFIG_CONTIG_ALLOC. This is because of a somewhat recent feature that allowed dynamic allocation of gigantic (page order >= MAX_ORDER) pages. Before that feature, it made no sense to set up an hstate for gigantic pages if they were not allocated at boot time and could not be dynamically added later. I'll code up a proposal that does the following: - Have arch specific code provide a list of supported huge page sizes - Arch independent code uses list to create all hstates - Move processing of "hugepagesz=" to arch independent code - Validate "default_hugepagesz=" when value is read from command line It make take a few days. When ready, I will pull in the architecture specific people. > BTW, because it's not easy to discuss with you due to the time difference, I > have another question about the default hugepages to consult you here. Why the > /proc/meminfo only show the info about the default hugepages, but not others? > meminfo is more well know than sysfs, some ordinary users know meminfo but don't > know use the sysfs to get the hugepages status(e.g. total, free). I believe that is simply history. In the beginning there was only the default huge page size and that was added to meminfo. People then wrote scripts to parse huge page information in meminfo. When support for other huge pages was added, it was not added to meminfo as it could break user scripts parsing the file. Adding information for all potential huge page sizes may create lots of entries that are unused. I was not around when these decisions were made, but that is my understanding. BTW - A recently added meminfo field 'Hugetlb' displays the amount of memory consumed by huge pages of ALL sizes.
在 2020/3/7 4:12, Mike Kravetz 写道: > On 3/5/20 10:36 PM, Longpeng (Mike) wrote: >> 在 2020/3/6 8:09, Mike Kravetz 写道: >>> On 3/4/20 7:30 PM, Longpeng(Mike) wrote: >>>> From: Longpeng <longpeng2@huawei.com> >> >>> I am thinking we may want to have a more generic solution by allowing [...] >>> Of course, another approach would be to simply require ALL architectures >>> to set up hstates for ALL supported huge page sizes. >>> >> I think this is also needed, then we can request all supported size of hugepages >> by sysfs(e.g. /sys/kernel/mm/hugepages/*) dynamically. Currently, (x86) we can >> only request 1G-hugepage through sysfs if we boot with 'default_hugepagesz=1G', >> even with the first approach. > > I 'think' you can use sysfs for 1G huge pages on x86 today. Just booted a > system without any hugepage options on the command line. > > # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages > 0 > # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/^Cugepages > # echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages > # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages > 1 > # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages > 1 > > x86 and riscv will set up hstates for PUD_SIZE hstates by default if > CONFIG_CONTIG_ALLOC. This is because of a somewhat recent feature that > allowed dynamic allocation of gigantic (page order >= MAX_ORDER) pages. > Before that feature, it made no sense to set up an hstate for gigantic > pages if they were not allocated at boot time and could not be dynamically > added later. > Um... maybe my poor English expressing ability to make you misunderstand. In fact, I want to say that we should allow the user to allocate ALL supported size of hugepages dynamically by default, so we need require ALL architectures to set up ALL supported huge page sizes. > I'll code up a proposal that does the following: > - Have arch specific code provide a list of supported huge page sizes > - Arch independent code uses list to create all hstates > - Move processing of "hugepagesz=" to arch independent code > - Validate "default_hugepagesz=" when value is read from command line > > It make take a few days. When ready, I will pull in the architecture > specific people. > Great! I'm looking forward to your patches. I also have one or two other small improvements, I hope to discuss with you after you finish these codes. > >> BTW, because it's not easy to discuss with you due to the time difference, I >> have another question about the default hugepages to consult you here. Why the >> /proc/meminfo only show the info about the default hugepages, but not others? >> meminfo is more well know than sysfs, some ordinary users know meminfo but don't >> know use the sysfs to get the hugepages status(e.g. total, free). > > I believe that is simply history. In the beginning there was only the > default huge page size and that was added to meminfo. People then wrote > scripts to parse huge page information in meminfo. When support for > other huge pages was added, it was not added to meminfo as it could break > user scripts parsing the file. Adding information for all potential > huge page sizes may create lots of entries that are unused. I was not > around when these decisions were made, but that is my understanding. > BTW - A recently added meminfo field 'Hugetlb' displays the amount of > memory consumed by huge pages of ALL sizes. I get it, thanks :) > -- > Mike Kravetz >
On 3/6/20 3:12 PM, Mike Kravetz wrote: > On 3/5/20 10:36 PM, Longpeng (Mike) wrote: >> 在 2020/3/6 8:09, Mike Kravetz 写道: >>> On 3/4/20 7:30 PM, Longpeng(Mike) wrote: >>>> From: Longpeng <longpeng2@huawei.com> >>> I am thinking we may want to have a more generic solution by allowing >>> the default_hugepagesz= processing code to verify the passed size and >>> set up the corresponding hstate. This would require more cooperation >>> between architecture specific and independent code. This could be >>> accomplished with a simple arch_hugetlb_valid_size() routine provided >>> by the architectures. Below is an untested patch to add such support >>> to the architecture independent code and x86. Other architectures would >>> be similar. >>> >>> In addition, with architectures providing arch_hugetlb_valid_size() it >>> should be possible to have a common routine in architecture independent >>> code to read/process hugepagesz= command line arguments. >>> >> I just want to use the minimize changes to address this issue, so I choosed a >> way which my patch did. >> >> To be honest, the approach you suggested above is much better though it need >> more changes. >> >>> Of course, another approach would be to simply require ALL architectures >>> to set up hstates for ALL supported huge page sizes. >>> >> I think this is also needed, then we can request all supported size of hugepages >> by sysfs(e.g. /sys/kernel/mm/hugepages/*) dynamically. Currently, (x86) we can >> only request 1G-hugepage through sysfs if we boot with 'default_hugepagesz=1G', >> even with the first approach. > I 'think' you can use sysfs for 1G huge pages on x86 today. Just booted a > system without any hugepage options on the command line. > > # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages > 0 > # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/^Cugepages > # echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages > # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages > 1 > # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages > 1 > > x86 and riscv will set up hstates for PUD_SIZE hstates by default if > CONFIG_CONTIG_ALLOC. This is because of a somewhat recent feature that > allowed dynamic allocation of gigantic (page order >= MAX_ORDER) pages. > Before that feature, it made no sense to set up an hstate for gigantic > pages if they were not allocated at boot time and could not be dynamically > added later. > > I'll code up a proposal that does the following: > - Have arch specific code provide a list of supported huge page sizes > - Arch independent code uses list to create all hstates > - Move processing of "hugepagesz=" to arch independent code > - Validate "default_hugepagesz=" when value is read from command line > > It make take a few days. When ready, I will pull in the architecture > specific people. Hi Mike, On platforms that support multiple huge page sizes when 'hugepagesz' is not specified before 'hugepages=', hugepages are not allocated. (For example if we are requesting 1GB hugepages) In terms of reporting meminfo and /sys/kernel/../nr_hugepages reports the expected results but if we use sysctl vm.nr_hugepages then it reports a non-zero value as it reads the max_huge_pages from the default hstate instead of nr_huge_pages. AFAIK nr_huge_pages is the one that indicates the number of huge pages that are successfully allocated. Does vm.nr_hugepages is expected to report the maximum number of hugepages? If so, will it not make sense to rename the procname? However, if we expect nr_hugepages to report the number of successfully allocated hugepages then we should use nr_huge_pages in hugetlb_sysctl_handler_common(). > >> BTW, because it's not easy to discuss with you due to the time difference, I >> have another question about the default hugepages to consult you here. Why the >> /proc/meminfo only show the info about the default hugepages, but not others? >> meminfo is more well know than sysfs, some ordinary users know meminfo but don't >> know use the sysfs to get the hugepages status(e.g. total, free). > I believe that is simply history. In the beginning there was only the > default huge page size and that was added to meminfo. People then wrote > scripts to parse huge page information in meminfo. When support for > other huge pages was added, it was not added to meminfo as it could break > user scripts parsing the file. Adding information for all potential > huge page sizes may create lots of entries that are unused. I was not > around when these decisions were made, but that is my understanding. > BTW - A recently added meminfo field 'Hugetlb' displays the amount of > memory consumed by huge pages of ALL sizes.
On 4/10/20 8:47 AM, Nitesh Narayan Lal wrote: > Hi Mike, > > On platforms that support multiple huge page sizes when 'hugepagesz' is not > specified before 'hugepages=', hugepages are not allocated. (For example > if we are requesting 1GB hugepages) Hi Nitesh, This should only be an issue with gigantic huge pages. This is because hugepages=X not following a hugepagesz=Y specifies the number of huge pages of default size to allocate. It does not currently work for gigantic pages. In the other thread, I provided this explanation as to why: It comes about because we do not definitively set the default huge page size until after command line processing (in hugetlb_init). And, we must preallocate gigantic huge pages during command line processing because that is when the bootmem allocater is available. I will be looking into modifying this behavior to allocate the pages as expected, even for gigantic pages. > In terms of reporting meminfo and /sys/kernel/../nr_hugepages reports the > expected results but if we use sysctl vm.nr_hugepages then it reports a non-zero > value as it reads the max_huge_pages from the default hstate instead of > nr_huge_pages. > AFAIK nr_huge_pages is the one that indicates the number of huge pages that are > successfully allocated. > > Does vm.nr_hugepages is expected to report the maximum number of hugepages? If > so, will it not make sense to rename the procname? > > However, if we expect nr_hugepages to report the number of successfully > allocated hugepages then we should use nr_huge_pages in > hugetlb_sysctl_handler_common(). This looks like a bug. Neither sysctl or the /proc file should be reporting a non-zero value if huge pages do not exist.
On 4/13/20 2:33 PM, Mike Kravetz wrote: > On 4/10/20 8:47 AM, Nitesh Narayan Lal wrote: >> Hi Mike, >> >> On platforms that support multiple huge page sizes when 'hugepagesz' is not >> specified before 'hugepages=', hugepages are not allocated. (For example >> if we are requesting 1GB hugepages) > Hi Nitesh, > > This should only be an issue with gigantic huge pages. This is because > hugepages=X not following a hugepagesz=Y specifies the number of huge pages > of default size to allocate. It does not currently work for gigantic pages. I see, since we changed the default hugepages to gigantic pages and we missed 'hugepagesz=' no page were allocated of any type. > In the other thread, I provided this explanation as to why: > It comes about because we do not definitively set the default huge page size > until after command line processing (in hugetlb_init). And, we must > preallocate gigantic huge pages during command line processing because that > is when the bootmem allocater is available. > > I will be looking into modifying this behavior to allocate the pages as > expected, even for gigantic pages. Nice, looking forward to it. > >> In terms of reporting meminfo and /sys/kernel/../nr_hugepages reports the >> expected results but if we use sysctl vm.nr_hugepages then it reports a non-zero >> value as it reads the max_huge_pages from the default hstate instead of >> nr_huge_pages. >> AFAIK nr_huge_pages is the one that indicates the number of huge pages that are >> successfully allocated. >> >> Does vm.nr_hugepages is expected to report the maximum number of hugepages? If >> so, will it not make sense to rename the procname? >> >> However, if we expect nr_hugepages to report the number of successfully >> allocated hugepages then we should use nr_huge_pages in >> hugetlb_sysctl_handler_common(). > This looks like a bug. Neither sysctl or the /proc file should be reporting > a non-zero value if huge pages do not exist. Yeap, as I mentioned it reports max_huge_pages instead of the nr_huge_pages.
On 4/13/20 2:21 PM, Nitesh Narayan Lal wrote: > > On 4/13/20 2:33 PM, Mike Kravetz wrote: >> On 4/10/20 8:47 AM, Nitesh Narayan Lal wrote: >>> Hi Mike, >>> >>> On platforms that support multiple huge page sizes when 'hugepagesz' is not >>> specified before 'hugepages=', hugepages are not allocated. (For example >>> if we are requesting 1GB hugepages) >> Hi Nitesh, >> >> This should only be an issue with gigantic huge pages. This is because >> hugepages=X not following a hugepagesz=Y specifies the number of huge pages >> of default size to allocate. It does not currently work for gigantic pages. > > I see, since we changed the default hugepages to gigantic pages and we missed > 'hugepagesz=' no page were allocated of any type. > >> In the other thread, I provided this explanation as to why: >> It comes about because we do not definitively set the default huge page size >> until after command line processing (in hugetlb_init). And, we must >> preallocate gigantic huge pages during command line processing because that >> is when the bootmem allocater is available. >> >> I will be looking into modifying this behavior to allocate the pages as >> expected, even for gigantic pages. > > Nice, looking forward to it. > >> >>> In terms of reporting meminfo and /sys/kernel/../nr_hugepages reports the >>> expected results but if we use sysctl vm.nr_hugepages then it reports a non-zero >>> value as it reads the max_huge_pages from the default hstate instead of >>> nr_huge_pages. >>> AFAIK nr_huge_pages is the one that indicates the number of huge pages that are >>> successfully allocated. >>> >>> Does vm.nr_hugepages is expected to report the maximum number of hugepages? If >>> so, will it not make sense to rename the procname? >>> >>> However, if we expect nr_hugepages to report the number of successfully >>> allocated hugepages then we should use nr_huge_pages in >>> hugetlb_sysctl_handler_common(). >> This looks like a bug. Neither sysctl or the /proc file should be reporting >> a non-zero value if huge pages do not exist. > > Yeap, as I mentioned it reports max_huge_pages instead of the nr_huge_pages. Does this only happen when you specify gigantic pages as the default huge page size and they are not allocated at boot time? Or, are there other situations where this happens? If so, can you provide a sample of the boot parameters used, or how to recreate. I am fixing up the issue with gigantic pages, and suspect this will take are of all the issues you are seeing. This will be part of the command line cleanup series. Just want to make sure I am not missing something.
On 4/15/20 12:03 AM, Mike Kravetz wrote: > On 4/13/20 2:21 PM, Nitesh Narayan Lal wrote: >> On 4/13/20 2:33 PM, Mike Kravetz wrote: >>> On 4/10/20 8:47 AM, Nitesh Narayan Lal wrote: >>>> Hi Mike, >>>> >>>> On platforms that support multiple huge page sizes when 'hugepagesz' is not >>>> specified before 'hugepages=', hugepages are not allocated. (For example >>>> if we are requesting 1GB hugepages) >>> Hi Nitesh, >>> >>> This should only be an issue with gigantic huge pages. This is because >>> hugepages=X not following a hugepagesz=Y specifies the number of huge pages >>> of default size to allocate. It does not currently work for gigantic pages. >> I see, since we changed the default hugepages to gigantic pages and we missed >> 'hugepagesz=' no page were allocated of any type. >> >>> In the other thread, I provided this explanation as to why: >>> It comes about because we do not definitively set the default huge page size >>> until after command line processing (in hugetlb_init). And, we must >>> preallocate gigantic huge pages during command line processing because that >>> is when the bootmem allocater is available. >>> >>> I will be looking into modifying this behavior to allocate the pages as >>> expected, even for gigantic pages. >> Nice, looking forward to it. >> >>>> In terms of reporting meminfo and /sys/kernel/../nr_hugepages reports the >>>> expected results but if we use sysctl vm.nr_hugepages then it reports a non-zero >>>> value as it reads the max_huge_pages from the default hstate instead of >>>> nr_huge_pages. >>>> AFAIK nr_huge_pages is the one that indicates the number of huge pages that are >>>> successfully allocated. >>>> >>>> Does vm.nr_hugepages is expected to report the maximum number of hugepages? If >>>> so, will it not make sense to rename the procname? >>>> >>>> However, if we expect nr_hugepages to report the number of successfully >>>> allocated hugepages then we should use nr_huge_pages in >>>> hugetlb_sysctl_handler_common(). >>> This looks like a bug. Neither sysctl or the /proc file should be reporting >>> a non-zero value if huge pages do not exist. >> Yeap, as I mentioned it reports max_huge_pages instead of the nr_huge_pages. > Does this only happen when you specify gigantic pages as the default huge > page size and they are not allocated at boot time? Yes. > Or, are there other > situations where this happens? If so, can you provide a sample of the > boot parameters used, or how to recreate. To reproduce this behavior boot the kernel with 'default_hugepagesz=1G hugepages=8' parameter in the kernel cmdline, hugepagesz needs to be skipped to ensure that no gigantic hugepages are allocated. After the kernel is up check the output of 'sysctl vm.nr_hugepages'. This should be good enough to reproduce this issue. > > I am fixing up the issue with gigantic pages, and suspect this will take > are of all the issues you are seeing. This will be part of the command line > cleanup series. Just want to make sure I am not missing something. Makes sense. Thank you.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dd8737a..21f623b 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2829,6 +2829,9 @@ static int __init hugetlb_init(void) if (!hugepages_supported()) return 0; + if (!size_to_hstate(HPAGE_SIZE)) + hugetlb_add_hstate(HUGETLB_PAGE_ORDER); + if (!size_to_hstate(default_hstate_size)) { if (default_hstate_size != 0) { pr_err("HugeTLB: unsupported default_hugepagesz %lu. Reverting to %lu\n", @@ -2836,8 +2839,6 @@ static int __init hugetlb_init(void) } default_hstate_size = HPAGE_SIZE; - if (!size_to_hstate(default_hstate_size)) - hugetlb_add_hstate(HUGETLB_PAGE_ORDER); } default_hstate_idx = hstate_index(size_to_hstate(default_hstate_size)); if (default_hstate_max_huge_pages) {