Message ID | 20201019182853.7467-1-gpiccoli@canonical.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, hugetlb: Avoid double clearing for hugetlb pages | expand |
On Mon 19-10-20 15:28:53, Guilherme G. Piccoli wrote: [...] > $ time echo 32768 > /proc/sys/vm/nr_hugepages > real 0m24.189s > user 0m0.000s > sys 0m24.184s > > $ cat /proc/meminfo |grep "MemA\|Hugetlb" > MemAvailable: 30784732 kB > Hugetlb: 67108864 kB > > * Without this patch, init_on_alloc=0 > $ cat /proc/meminfo |grep "MemA\|Hugetlb" > MemAvailable: 97892752 kB > Hugetlb: 0 kB > > $ time echo 32768 > /proc/sys/vm/nr_hugepages > real 0m0.316s > user 0m0.000s > sys 0m0.316s Yes zeroying is quite costly and that is to be expected when the feature is enabled. Hugetlb like other allocator users perform their own initialization rather than go through __GFP_ZERO path. More on that below. Could you be more specific about why this is a problem. Hugetlb pool is usualy preallocatd once during early boot. 24s for 65GB of 2MB pages is non trivial amount of time but it doens't look like a major disaster either. If the pool is allocated later it can take much more time due to memory fragmentation. I definitely do not want to downplay this but I would like to hear about the real life examples of the problem. [...] > > Hi everybody, thanks in advance for the review/comments. I'd like to > point 2 things related to the implementation: > > 1) I understand that adding GFP flags is not really welcome by the > mm community; I've considered passing that as function parameter but > that would be a hacky mess, so I decided to add the flag since it seems > this is a fair use of the flag mechanism (to control actions on pages). > If anybody has a better/simpler suggestion to implement this, I'm all > ears - thanks! This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com. Previously it has been brought up in SLUB context AFAIR. Your numbers are quite clear here but do we really need a gfp flag with all the problems we tend to grow in with them? One potential way around this specifically for hugetlb would be to use __GFP_ZERO when allocating from the allocator and marking the fact in the struct page while it is sitting in the pool. Page fault handler could then skip the zeroying phase. Not an act of beauty TBH but it fits into the existing model of the full control over initialization. Btw. it would allow to implement init_on_free semantic as well. I haven't implemented the actual two main methods hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because I am not entirely sure about the current state of hugetlb struct page in the pool. But there should be a lot of room in there (or in tail pages). Mike will certainly know much better. But the skeleton of the patch would look like something like this (not even compile tested). diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index b5c109703daa..031af7cdf8a7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -724,7 +724,8 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset, error = PTR_ERR(page); goto out; } - clear_huge_page(page, addr, pages_per_huge_page(h)); + if (!hugetlb_test_clear_pre_init_page(page)) + clear_huge_page(page, addr, pages_per_huge_page(h)); __SetPageUptodate(page); error = huge_add_to_page_cache(page, mapping, index); if (unlikely(error)) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 67fc6383995b..83cc8abb4d69 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1413,6 +1413,7 @@ static void __free_huge_page(struct page *page) page->mapping = NULL; restore_reserve = PagePrivate(page); ClearPagePrivate(page); + hugetlb_test_clear_pre_init_page(page); /* * If PagePrivate() was set on page, page allocation consumed a @@ -1703,6 +1704,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int order = huge_page_order(h); struct page *page; bool alloc_try_hard = true; + bool pre_init = false; /* * By default we always try hard to allocate the page with @@ -1718,10 +1720,18 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, gfp_mask |= __GFP_RETRY_MAYFAIL; if (nid == NUMA_NO_NODE) nid = numa_mem_id(); + + /* prevent from double initialization */ + if (want_init_on_alloc(gfp_mask)) { + gfp_mask |= __GFP_ZERO; + pre_init = true; + } + page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask); - if (page) + if (page) { __count_vm_event(HTLB_BUDDY_PGALLOC); - else + hugetlb_mark_pre_init_page(page); + } else __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); /* @@ -4221,6 +4231,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma, goto out_release_all; } + hugetlb_test_clear_pre_init_page(new_page); copy_user_huge_page(new_page, old_page, address, vma, pages_per_huge_page(h)); __SetPageUptodate(new_page); @@ -4411,7 +4422,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, ret = vmf_error(PTR_ERR(page)); goto out; } - clear_huge_page(page, address, pages_per_huge_page(h)); + if (!hugetlb_test_clear_pre_init_page(page)) + clear_huge_page(page, address, pages_per_huge_page(h)); __SetPageUptodate(page); new_page = true; @@ -4709,6 +4721,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (IS_ERR(page)) goto out; + hugetlb_test_clear_pre_init_page(page); ret = copy_huge_page_from_user(page, (const void __user *) src_addr, pages_per_huge_page(h), false); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index eddbe4e56c73..8cc1fc9c4d13 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -525,7 +525,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long flags = qp->flags; int ret; bool has_unmovable = false; - pte_t *pte; + pte_t *pte, *mapped_pte; spinlock_t *ptl; ptl = pmd_trans_huge_lock(pmd, vma); @@ -539,7 +539,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; - pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); + mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue; @@ -571,7 +571,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, } else break; } - pte_unmap_unlock(pte - 1, ptl); + pte_unmap_unlock(mapped_pte, ptl); cond_resched(); if (has_unmovable)
On 20.10.20 10:20, Michal Hocko wrote: > On Mon 19-10-20 15:28:53, Guilherme G. Piccoli wrote: > [...] >> $ time echo 32768 > /proc/sys/vm/nr_hugepages >> real 0m24.189s >> user 0m0.000s >> sys 0m24.184s >> >> $ cat /proc/meminfo |grep "MemA\|Hugetlb" >> MemAvailable: 30784732 kB >> Hugetlb: 67108864 kB >> >> * Without this patch, init_on_alloc=0 >> $ cat /proc/meminfo |grep "MemA\|Hugetlb" >> MemAvailable: 97892752 kB >> Hugetlb: 0 kB >> >> $ time echo 32768 > /proc/sys/vm/nr_hugepages >> real 0m0.316s >> user 0m0.000s >> sys 0m0.316s > > Yes zeroying is quite costly and that is to be expected when the feature > is enabled. Hugetlb like other allocator users perform their own > initialization rather than go through __GFP_ZERO path. More on that > below. > > Could you be more specific about why this is a problem. Hugetlb pool is > usualy preallocatd once during early boot. 24s for 65GB of 2MB pages > is non trivial amount of time but it doens't look like a major disaster > either. If the pool is allocated later it can take much more time due to > memory fragmentation. > > I definitely do not want to downplay this but I would like to hear about > the real life examples of the problem. > > [...] >> >> Hi everybody, thanks in advance for the review/comments. I'd like to >> point 2 things related to the implementation: >> >> 1) I understand that adding GFP flags is not really welcome by the >> mm community; I've considered passing that as function parameter but >> that would be a hacky mess, so I decided to add the flag since it seems >> this is a fair use of the flag mechanism (to control actions on pages). >> If anybody has a better/simpler suggestion to implement this, I'm all >> ears - thanks! > > This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com. > Previously it has been brought up in SLUB context AFAIR. Your numbers > are quite clear here but do we really need a gfp flag with all the > problems we tend to grow in with them? > > One potential way around this specifically for hugetlb would be to use > __GFP_ZERO when allocating from the allocator and marking the fact in > the struct page while it is sitting in the pool. Page fault handler > could then skip the zeroying phase. Not an act of beauty TBH but it > fits into the existing model of the full control over initialization. > Btw. it would allow to implement init_on_free semantic as well. I > haven't implemented the actual two main methods > hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because > I am not entirely sure about the current state of hugetlb struct page in > the pool. But there should be a lot of room in there (or in tail pages). > Mike will certainly know much better. But the skeleton of the patch > would look like something like this (not even compile tested). Something like that is certainly nicer than proposed gfp flags. (__GFP_NOINIT_ON_ALLOC is just ugly, especially, to optimize such corner-case features)
On 10/20/20 1:20 AM, Michal Hocko wrote: > On Mon 19-10-20 15:28:53, Guilherme G. Piccoli wrote: > > Yes zeroying is quite costly and that is to be expected when the feature > is enabled. Hugetlb like other allocator users perform their own > initialization rather than go through __GFP_ZERO path. More on that > below. > > Could you be more specific about why this is a problem. Hugetlb pool is > usualy preallocatd once during early boot. 24s for 65GB of 2MB pages > is non trivial amount of time but it doens't look like a major disaster > either. If the pool is allocated later it can take much more time due to > memory fragmentation. > > I definitely do not want to downplay this but I would like to hear about > the real life examples of the problem. > > [...] >> >> Hi everybody, thanks in advance for the review/comments. I'd like to >> point 2 things related to the implementation: >> >> 1) I understand that adding GFP flags is not really welcome by the >> mm community; I've considered passing that as function parameter but >> that would be a hacky mess, so I decided to add the flag since it seems >> this is a fair use of the flag mechanism (to control actions on pages). >> If anybody has a better/simpler suggestion to implement this, I'm all >> ears - thanks! > > This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com. > Previously it has been brought up in SLUB context AFAIR. Your numbers > are quite clear here but do we really need a gfp flag with all the > problems we tend to grow in with them? > > One potential way around this specifically for hugetlb would be to use > __GFP_ZERO when allocating from the allocator and marking the fact in > the struct page while it is sitting in the pool. Page fault handler > could then skip the zeroying phase. Not an act of beauty TBH but it > fits into the existing model of the full control over initialization. > Btw. it would allow to implement init_on_free semantic as well. I > haven't implemented the actual two main methods > hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because > I am not entirely sure about the current state of hugetlb struct page in > the pool. But there should be a lot of room in there (or in tail pages). > Mike will certainly know much better. But the skeleton of the patch > would look like something like this (not even compile tested). Thanks Michal. I was not involved in the discussions for init_on_alloc, so was waiting for someone else to comment. My first though was to also do as you propose. Skip the clear on page fault if page was already cleared at allocation time. Yes, there should be plenty of room to store this state while huge pages are in the pool. Of course, users will still see those delays at allocation time pointed out in the commit message. I guess that should be expected. We do have users which allocate over 1TB of huge pages via sysctl. Those pages are used and cleared via page faults, but not necessarily all at the same time. If such users would ever set init_on_alloc they would see a huge delay. My 'guess' is that such users are unlikely to ever use init_on_alloc or init_on_free for general performance reasons.
Hi Michal, thanks a lot for your thorough response. I'll address the comments inline, below. Thanks also David and Mike - in fact, I almost don't need to respond here after Mike, he was right to the point I'm going to discuss heh... On 20/10/2020 05:20, Michal Hocko wrote: > > Yes zeroying is quite costly and that is to be expected when the feature > is enabled. Hugetlb like other allocator users perform their own > initialization rather than go through __GFP_ZERO path. More on that > below. > > Could you be more specific about why this is a problem. Hugetlb pool is > usualy preallocatd once during early boot. 24s for 65GB of 2MB pages > is non trivial amount of time but it doens't look like a major disaster > either. If the pool is allocated later it can take much more time due to > memory fragmentation. > > I definitely do not want to downplay this but I would like to hear about > the real life examples of the problem. Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was just my simple test in a guest, the real case is much worse! It aligns with Mike's comment, we have complains of minute-like delays, due to a very big pool of hugepages being allocated. Users have their own methodology for allocating pages, some would prefer do that "later" for a variety of reasons, so early boot time allocations are not always used, that shouldn't be the only focus of the discussion here. In the specific report I had, the user complains about more than 3 minutes to allocate ~542G of 2M hugetlb pages. Now, you'll ask why in the heck they are using init_on_alloc then - right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set by default in Ubuntu, for hardening reasons. So, the workaround for the users complaining of delays in allocating hugetlb pages currently is to set "init_on_alloc" to 0. It's a bit lame to ask users to disable such hardening thing just because we have a double initialization in hugetlb... > > > This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com. > Previously it has been brought up in SLUB context AFAIR. Your numbers > are quite clear here but do we really need a gfp flag with all the > problems we tend to grow in with them? > > One potential way around this specifically for hugetlb would be to use > __GFP_ZERO when allocating from the allocator and marking the fact in > the struct page while it is sitting in the pool. Page fault handler > could then skip the zeroying phase. Not an act of beauty TBH but it > fits into the existing model of the full control over initialization. > Btw. it would allow to implement init_on_free semantic as well. I > haven't implemented the actual two main methods > hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because > I am not entirely sure about the current state of hugetlb struct page in > the pool. But there should be a lot of room in there (or in tail pages). > Mike will certainly know much better. But the skeleton of the patch > would look like something like this (not even compile tested). > [code...] Thanks a lot for pointing the previous discussion for me! I should have done my homework properly and read all versions of the patchset...my bad! I'm glad to see this problem was discussed and considered early in the patch submission, I guess it only missed more real-world numbers. Your approach seems interesting, but as per Mike's response (which seems to have anticipated all my arguments heheh) your approach is a bit reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages in fault time), whereas the big problem hereby tentatively fixed is the massive delay on allocation time of the hugetlb pages. I understand that your suggestion has no burden of introducing more GFP flags, and I agree that those are potentially dangerous if misused (and I totally agree with David that __GFP_NOINIT_ON_ALLOC is heinous, I'd rather go with the originally proposed __GFP_NO_AUTOINIT), but... wouldn't it be letting the code just drive a design decision? Like "oh, adding a flag is so bad..better just let this bug/perf issue to stay". I agree with the arguments here, don't get me wrong - specially since I'm far from being any kind of mm expert, I trust your judgement that GFP flags are the utmost villains, but at the same time I'd rather not change something (like the hugetlb zeroing code) that is not really fixing the hereby discussed issue. I'm open to other suggestions, of course, but the GFP flag seems the least hacky way for fixing that, and ultimately, the flags are meant for this, right? Control page behavior stuff. About misuse of a GFP flag, this is a risk for every "API" on kernel, and we rely in the (knowingly great) kernel review process to block that. We could even have a more "terrifying" comment there around the flag, asking new users to CC all relevant involved people in the patch submission before using that... Anyway, thanks a bunch for the good points raised here Michal, David and Mike, and appreciate your patience with somebody trying to mess your GFP flags. Let me know your thoughts! Cheers, Guilherme
On 20.10.20 21:19, Guilherme G. Piccoli wrote: > Hi Michal, thanks a lot for your thorough response. I'll address the > comments inline, below. Thanks also David and Mike - in fact, I almost > don't need to respond here after Mike, he was right to the point I'm > going to discuss heh... > > > On 20/10/2020 05:20, Michal Hocko wrote: >> >> Yes zeroying is quite costly and that is to be expected when the feature >> is enabled. Hugetlb like other allocator users perform their own >> initialization rather than go through __GFP_ZERO path. More on that >> below. >> >> Could you be more specific about why this is a problem. Hugetlb pool is >> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages >> is non trivial amount of time but it doens't look like a major disaster >> either. If the pool is allocated later it can take much more time due to >> memory fragmentation. >> >> I definitely do not want to downplay this but I would like to hear about >> the real life examples of the problem. > > Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was > just my simple test in a guest, the real case is much worse! It aligns > with Mike's comment, we have complains of minute-like delays, due to a > very big pool of hugepages being allocated. > > Users have their own methodology for allocating pages, some would prefer > do that "later" for a variety of reasons, so early boot time allocations > are not always used, that shouldn't be the only focus of the discussion > here. > In the specific report I had, the user complains about more than 3 > minutes to allocate ~542G of 2M hugetlb pages. > > Now, you'll ask why in the heck they are using init_on_alloc then - > right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set > by default in Ubuntu, for hardening reasons. So, the workaround for the > users complaining of delays in allocating hugetlb pages currently is to > set "init_on_alloc" to 0. It's a bit lame to ask users to disable such > hardening thing just because we have a double initialization in hugetlb... > > >> >> >> This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com. >> Previously it has been brought up in SLUB context AFAIR. Your numbers >> are quite clear here but do we really need a gfp flag with all the >> problems we tend to grow in with them? >> >> One potential way around this specifically for hugetlb would be to use >> __GFP_ZERO when allocating from the allocator and marking the fact in >> the struct page while it is sitting in the pool. Page fault handler >> could then skip the zeroying phase. Not an act of beauty TBH but it >> fits into the existing model of the full control over initialization. >> Btw. it would allow to implement init_on_free semantic as well. I >> haven't implemented the actual two main methods >> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because >> I am not entirely sure about the current state of hugetlb struct page in >> the pool. But there should be a lot of room in there (or in tail pages). >> Mike will certainly know much better. But the skeleton of the patch >> would look like something like this (not even compile tested). >> [code...] > > Thanks a lot for pointing the previous discussion for me! I should have > done my homework properly and read all versions of the patchset...my > bad! I'm glad to see this problem was discussed and considered early in > the patch submission, I guess it only missed more real-world numbers. > > Your approach seems interesting, but as per Mike's response (which seems > to have anticipated all my arguments heheh) your approach is a bit > reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages > in fault time), whereas the big problem hereby tentatively fixed is the > massive delay on allocation time of the hugetlb pages. > > I understand that your suggestion has no burden of introducing more GFP > flags, and I agree that those are potentially dangerous if misused (and > I totally agree with David that __GFP_NOINIT_ON_ALLOC is heinous, I'd > rather go with the originally proposed __GFP_NO_AUTOINIT), but... > wouldn't it be letting the code just drive a design decision? Like "oh, > adding a flag is so bad..better just let this bug/perf issue to stay". The main problem I have is that page alloc code does some internal page allocator things ("init_on_alloc" - "Fill newly allocated pages and heap objects with zeroes"), and we're allowing users of page alloc code *that really shouldn't have to care* to override that behavior, exposing unnecessary complexity. Mainly: other allocators. "__GFP_NOINIT_ON_ALLOC" - what exactly does it do? "__GFP_NO_AUTOINIT" - what exactly does it do? __GFP_ZERO set: page always zero. __GFP_ZERO not set: page zero with init_on_alloc, page not necessarily zero without init_on_alloc. Users can find out by looking at init_on_alloc. IMHO, even something like __GFP_DONT_ZERO would be clearer. But I still somewhat don't like letting users of the buddy override configured behavior. Yes, it could be used by other alloactors (like hugetlb) to optimize. But it could also be used by any driver wanting to optimize the "init_on_alloc" case, eventually introducing security issues because the code tries to be smart.
When I first wrote that, the design was a bit different, the flag was called __GFP_HTLB_PAGE or something like that. The design was to signal/mark the composing pages of hugetlb as exactly this: they are pages composing a huge page of hugetlb "type". Then, I skipped the "init_on_alloc" thing for such pages. If your concern is more about semantics (or giving multiple users, like drivers, the power to try "optimize" their code and skip this security feature), I think my first approach was better! This way, the flag would be restricted to hugetlb usage only. I've changed my mind about that approach before submitting for 2 reasons: (a) It feels a waste of resources having a GFP flag *only* to signal regular pages composing hugetlb pages, it's a single user only, forever! (b) Having 2 conditional settings on __GFP_BITS_SHIFT (LOCKDEP and HUGETLB) started to make this define a bit tricky to code, since we'd have 2 Kconfig-conditional bits to be set. So, I've moved to this other approach, hereby submitted. Cheers, Guilherme
On 20.10.20 22:07, David Hildenbrand wrote: > On 20.10.20 21:19, Guilherme G. Piccoli wrote: >> Hi Michal, thanks a lot for your thorough response. I'll address the >> comments inline, below. Thanks also David and Mike - in fact, I almost >> don't need to respond here after Mike, he was right to the point I'm >> going to discuss heh... >> >> >> On 20/10/2020 05:20, Michal Hocko wrote: >>> >>> Yes zeroying is quite costly and that is to be expected when the feature >>> is enabled. Hugetlb like other allocator users perform their own >>> initialization rather than go through __GFP_ZERO path. More on that >>> below. >>> >>> Could you be more specific about why this is a problem. Hugetlb pool is >>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages >>> is non trivial amount of time but it doens't look like a major disaster >>> either. If the pool is allocated later it can take much more time due to >>> memory fragmentation. >>> >>> I definitely do not want to downplay this but I would like to hear about >>> the real life examples of the problem. >> >> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was >> just my simple test in a guest, the real case is much worse! It aligns >> with Mike's comment, we have complains of minute-like delays, due to a >> very big pool of hugepages being allocated. >> >> Users have their own methodology for allocating pages, some would prefer >> do that "later" for a variety of reasons, so early boot time allocations >> are not always used, that shouldn't be the only focus of the discussion >> here. >> In the specific report I had, the user complains about more than 3 >> minutes to allocate ~542G of 2M hugetlb pages. >> >> Now, you'll ask why in the heck they are using init_on_alloc then - >> right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set >> by default in Ubuntu, for hardening reasons. So, the workaround for the >> users complaining of delays in allocating hugetlb pages currently is to >> set "init_on_alloc" to 0. It's a bit lame to ask users to disable such >> hardening thing just because we have a double initialization in hugetlb... >> >> >>> >>> >>> This has been discussed already (http://lkml.kernel.org/r/20190514143537.10435-4-glider@google.com. >>> Previously it has been brought up in SLUB context AFAIR. Your numbers >>> are quite clear here but do we really need a gfp flag with all the >>> problems we tend to grow in with them? >>> >>> One potential way around this specifically for hugetlb would be to use >>> __GFP_ZERO when allocating from the allocator and marking the fact in >>> the struct page while it is sitting in the pool. Page fault handler >>> could then skip the zeroying phase. Not an act of beauty TBH but it >>> fits into the existing model of the full control over initialization. >>> Btw. it would allow to implement init_on_free semantic as well. I >>> haven't implemented the actual two main methods >>> hugetlb_test_clear_pre_init_page and hugetlb_mark_pre_init_page because >>> I am not entirely sure about the current state of hugetlb struct page in >>> the pool. But there should be a lot of room in there (or in tail pages). >>> Mike will certainly know much better. But the skeleton of the patch >>> would look like something like this (not even compile tested). >>> [code...] >> >> Thanks a lot for pointing the previous discussion for me! I should have >> done my homework properly and read all versions of the patchset...my >> bad! I'm glad to see this problem was discussed and considered early in >> the patch submission, I guess it only missed more real-world numbers. >> >> Your approach seems interesting, but as per Mike's response (which seems >> to have anticipated all my arguments heheh) your approach is a bit >> reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages >> in fault time), whereas the big problem hereby tentatively fixed is the >> massive delay on allocation time of the hugetlb pages. >> >> I understand that your suggestion has no burden of introducing more GFP >> flags, and I agree that those are potentially dangerous if misused (and >> I totally agree with David that __GFP_NOINIT_ON_ALLOC is heinous, I'd >> rather go with the originally proposed __GFP_NO_AUTOINIT), but... >> wouldn't it be letting the code just drive a design decision? Like "oh, >> adding a flag is so bad..better just let this bug/perf issue to stay". > > The main problem I have is that page alloc code does some internal page > allocator things ("init_on_alloc" - "Fill newly allocated pages and heap > objects with zeroes"), and we're allowing users of page alloc code *that > really shouldn't have to care* to override that behavior, exposing > unnecessary complexity. Mainly: other allocators. > > "__GFP_NOINIT_ON_ALLOC" - what exactly does it do? > "__GFP_NO_AUTOINIT" - what exactly does it do? > > __GFP_ZERO set: page always zero. > __GFP_ZERO not set: page zero with init_on_alloc, page not necessarily > zero without init_on_alloc. Users can find out by > looking at init_on_alloc. > > IMHO, even something like __GFP_DONT_ZERO would be clearer. But I still > somewhat don't like letting users of the buddy override configured > behavior. Yes, it could be used by other alloactors (like hugetlb) to > optimize. > > But it could also be used by any driver wanting to optimize the > "init_on_alloc" case, eventually introducing security issues because the > code tries to be smart. > BTW, there might be other users for something like __GFP_DONT_ZERO. Especially, memory ballooning drivers (and virtio-mem), whereby the hypervisor is (WHP) going to zap the page either way after allocation. You just cannot assume that when freeing such a page again, that it's actually zero. But then, somebody told the system to suffer ("alloc_on_init"), so there isn't too much motivation to optimize such corner cases.
On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote: > On 20/10/2020 05:20, Michal Hocko wrote: > > > > Yes zeroying is quite costly and that is to be expected when the feature > > is enabled. Hugetlb like other allocator users perform their own > > initialization rather than go through __GFP_ZERO path. More on that > > below. > > > > Could you be more specific about why this is a problem. Hugetlb pool is > > usualy preallocatd once during early boot. 24s for 65GB of 2MB pages > > is non trivial amount of time but it doens't look like a major disaster > > either. If the pool is allocated later it can take much more time due to > > memory fragmentation. > > > > I definitely do not want to downplay this but I would like to hear about > > the real life examples of the problem. > > Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was > just my simple test in a guest, the real case is much worse! It aligns > with Mike's comment, we have complains of minute-like delays, due to a > very big pool of hugepages being allocated. The cost of page clearing is mostly a constant overhead so it is quite natural to see the time scaling with the number of pages. That overhead has to happen at some point of time. Sure it is more visible when allocating during boot time resp. when doing pre-allocation during runtime. The page fault path would be then faster. The overhead just moves to a different place. So I am not sure this is really a strong argument to hold. [...] > Now, you'll ask why in the heck they are using init_on_alloc then - > right? So, the Kconfig option "CONFIG_INIT_ON_ALLOC_DEFAULT_ON" is set > by default in Ubuntu, for hardening reasons. This is not really that important as long as you properly explain your users what to expect from the default configuration. The hardening aspect of this default might be really valuable but it comes with a price. A non trivial one. The example you have highlighted is just one from many. The one we can actually workaround although the more I think about it the less I am convinced this is a good idea. Effectively _any_ !__GFP_ZERO allocation suffers from double initialization in one form or another. In the worst case the whole content of the page gets overwritten. Consider any page cache allocation for read for example. This is GFP_*USER* request that gets overwritten by the file content. Is this visible? Not really on a single page but consider how many pages you read from disk and you will get a considerable overhead. Should we exempt these allocations as well to reduce the overhead? I dot think so. This directly undermines the objective of the hardening. AFAIU the whole point of init_on_alloc is to prevent from previous content leaking to a new user without relying the new user is going to do right thing and initialize everything properly. Hugetlb is no different here. It just doesn't bother to implement what init_on_{alloc,free} promises. If there is a bug in hugetlb fault handler then you can still leak data from one user to another potentially. GFP_I_WANT_TO_OPT_OUT is actually allowing to increase the number of users who would like to reduce overhead and risk data leak. The power of unconditional initialization is in the fact that this is clearly trivial to check that nothing gets missed. > So, the workaround for the > users complaining of delays in allocating hugetlb pages currently is to > set "init_on_alloc" to 0. It's a bit lame to ask users to disable such > hardening thing just because we have a double initialization in hugetlb... It is not lame. It is expressing that people do not want to pay additional price for the hardening or they are not aware of the cost benefit here. > Your approach seems interesting, but as per Mike's response (which seems > to have anticipated all my arguments heheh) your approach is a bit > reversed, solving a ""non-existent"" problem (of zeroing hugetlb pages > in fault time), whereas the big problem hereby tentatively fixed is the > massive delay on allocation time of the hugetlb pages. Yes, my approach is not great either because it breaks the core assumption of init_on_alloc and that is that the initialization is done at a well defined spot. It had to go to all callers of the hugetlb allocator (alloc_buddy_huge_page) and fix them up. The proper thing to do would be to do the initialization (or opt out if the page is pre-zeroed either from the page allocator or from init_on_free) in alloc_buddy_huge_page. [...] > About misuse of a GFP flag, this is a risk for every "API" on kernel, > and we rely in the (knowingly great) kernel review process to block > that. Been there done that. There were and still are many examples. I have spent non-trivial time on clean ups. My experience is that the review process simply doesn't stop random drivers or subsystems to do what they think is right. > We could even have a more "terrifying" comment there around the > flag, asking new users to CC all relevant involved people in the patch > submission before using that... Been there done that. There is always room to improve though.
On Tue 20-10-20 17:19:42, Guilherme Piccoli wrote: > When I first wrote that, the design was a bit different, the flag was > called __GFP_HTLB_PAGE or something like that. The design was to > signal/mark the composing pages of hugetlb as exactly this: they are > pages composing a huge page of hugetlb "type". Then, I skipped the > "init_on_alloc" thing for such pages. As pointed out in the other email. This is not about hugetlb although this might be visible more than other because they just add a tiny bit to an overall overhead. Each page cache read, CoW and many many other !__GFP_ZERO users are in the same position when they double initialize. A dedicated __GFP_HTLB_PAGE is really focusing on a wrong side of the problem. We do have __GFP_ZERO for a good reason and that is to optimize the initialization. init_on_alloc goes effectively against this approach with a "potentially broken code" philosophy in mind and that is good as a hardening measure indeed. But that comes with an increased overhead and/or shifted layer when the overhead happens. Sure there is some room to optimize the code here and there but the primary idea of the hardening is to make the initialization dead trivial and clear that nothing can sneak out.
On 21.10.20 08:15, Michal Hocko wrote: > On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote: >> On 20/10/2020 05:20, Michal Hocko wrote: >>> >>> Yes zeroying is quite costly and that is to be expected when the feature >>> is enabled. Hugetlb like other allocator users perform their own >>> initialization rather than go through __GFP_ZERO path. More on that >>> below. >>> >>> Could you be more specific about why this is a problem. Hugetlb pool is >>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages >>> is non trivial amount of time but it doens't look like a major disaster >>> either. If the pool is allocated later it can take much more time due to >>> memory fragmentation. >>> >>> I definitely do not want to downplay this but I would like to hear about >>> the real life examples of the problem. >> >> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was >> just my simple test in a guest, the real case is much worse! It aligns >> with Mike's comment, we have complains of minute-like delays, due to a >> very big pool of hugepages being allocated. > > The cost of page clearing is mostly a constant overhead so it is quite > natural to see the time scaling with the number of pages. That overhead > has to happen at some point of time. Sure it is more visible when > allocating during boot time resp. when doing pre-allocation during > runtime. The page fault path would be then faster. The overhead just > moves to a different place. So I am not sure this is really a strong > argument to hold. We have people complaining that starting VMs backed by hugetlbfs takes too long, they would much rather have that initialization be done when booting the hypervisor ... so looks like there is no right or wrong.
On Wed 21-10-20 11:50:48, David Hildenbrand wrote: > On 21.10.20 08:15, Michal Hocko wrote: > > On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote: > >> On 20/10/2020 05:20, Michal Hocko wrote: > >>> > >>> Yes zeroying is quite costly and that is to be expected when the feature > >>> is enabled. Hugetlb like other allocator users perform their own > >>> initialization rather than go through __GFP_ZERO path. More on that > >>> below. > >>> > >>> Could you be more specific about why this is a problem. Hugetlb pool is > >>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages > >>> is non trivial amount of time but it doens't look like a major disaster > >>> either. If the pool is allocated later it can take much more time due to > >>> memory fragmentation. > >>> > >>> I definitely do not want to downplay this but I would like to hear about > >>> the real life examples of the problem. > >> > >> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was > >> just my simple test in a guest, the real case is much worse! It aligns > >> with Mike's comment, we have complains of minute-like delays, due to a > >> very big pool of hugepages being allocated. > > > > The cost of page clearing is mostly a constant overhead so it is quite > > natural to see the time scaling with the number of pages. That overhead > > has to happen at some point of time. Sure it is more visible when > > allocating during boot time resp. when doing pre-allocation during > > runtime. The page fault path would be then faster. The overhead just > > moves to a different place. So I am not sure this is really a strong > > argument to hold. > > We have people complaining that starting VMs backed by hugetlbfs takes > too long, they would much rather have that initialization be done when > booting the hypervisor ... I can imagine. Everybody would love to have a free lunch ;) But more seriously, the overhead of the initialization is unavoidable. The memory has to be zeroed out by definition and somebody has to pay for that. Sure one can think of a deferred context to do that but this just spreads the overhead out to the overall system overhead. Even if the zeroying is done during the allocation time then it is the first user who can benefit from that. Any reuse of the hugetlb pool has to reinitialize again. One can still be creative - e.g. prefault hugetlb files from userspace in parallel and reuse that but I am not sure kernel should try to be clever here.
On 10/21/20 4:31 AM, Michal Hocko wrote: > On Wed 21-10-20 11:50:48, David Hildenbrand wrote: >> On 21.10.20 08:15, Michal Hocko wrote: >>> On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote: >>>> On 20/10/2020 05:20, Michal Hocko wrote: >>>>> >>>>> Yes zeroying is quite costly and that is to be expected when the feature >>>>> is enabled. Hugetlb like other allocator users perform their own >>>>> initialization rather than go through __GFP_ZERO path. More on that >>>>> below. >>>>> >>>>> Could you be more specific about why this is a problem. Hugetlb pool is >>>>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages >>>>> is non trivial amount of time but it doens't look like a major disaster >>>>> either. If the pool is allocated later it can take much more time due to >>>>> memory fragmentation. >>>>> >>>>> I definitely do not want to downplay this but I would like to hear about >>>>> the real life examples of the problem. >>>> >>>> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was >>>> just my simple test in a guest, the real case is much worse! It aligns >>>> with Mike's comment, we have complains of minute-like delays, due to a >>>> very big pool of hugepages being allocated. >>> >>> The cost of page clearing is mostly a constant overhead so it is quite >>> natural to see the time scaling with the number of pages. That overhead >>> has to happen at some point of time. Sure it is more visible when >>> allocating during boot time resp. when doing pre-allocation during >>> runtime. The page fault path would be then faster. The overhead just >>> moves to a different place. So I am not sure this is really a strong >>> argument to hold. >> >> We have people complaining that starting VMs backed by hugetlbfs takes >> too long, they would much rather have that initialization be done when >> booting the hypervisor ... > > I can imagine. Everybody would love to have a free lunch ;) But more > seriously, the overhead of the initialization is unavoidable. The memory > has to be zeroed out by definition and somebody has to pay for that. > Sure one can think of a deferred context to do that but this just > spreads the overhead out to the overall system overhead. > > Even if the zeroying is done during the allocation time then it is the > first user who can benefit from that. Any reuse of the hugetlb pool has > to reinitialize again. I remember a conversation with some of our database people who thought it best for their model if hugetlb pages in the pool were already clear so that no initialization was done at fault time. Of course, this requires clearing at page free time. In their model, they thought it better to pay the price at allocation (pool creation) time and free time so that faults would be as fast as possible. I wonder if the VMs backed by hugetlbfs pages would benefit from this behavior as well? If we track the initialized state (clean or not) of huge pages in the pool as suggested in Michal's skeleton of a patch, we 'could' then allow users to choose when hugetlb page clearing is done. None of that would address the original point of this thread, the global init_on_alloc parameter.
On 22.10.20 01:32, Mike Kravetz wrote: > On 10/21/20 4:31 AM, Michal Hocko wrote: >> On Wed 21-10-20 11:50:48, David Hildenbrand wrote: >>> On 21.10.20 08:15, Michal Hocko wrote: >>>> On Tue 20-10-20 16:19:06, Guilherme G. Piccoli wrote: >>>>> On 20/10/2020 05:20, Michal Hocko wrote: >>>>>> >>>>>> Yes zeroying is quite costly and that is to be expected when the feature >>>>>> is enabled. Hugetlb like other allocator users perform their own >>>>>> initialization rather than go through __GFP_ZERO path. More on that >>>>>> below. >>>>>> >>>>>> Could you be more specific about why this is a problem. Hugetlb pool is >>>>>> usualy preallocatd once during early boot. 24s for 65GB of 2MB pages >>>>>> is non trivial amount of time but it doens't look like a major disaster >>>>>> either. If the pool is allocated later it can take much more time due to >>>>>> memory fragmentation. >>>>>> >>>>>> I definitely do not want to downplay this but I would like to hear about >>>>>> the real life examples of the problem. >>>>> >>>>> Indeed, 24s of delay (!) is not so harmful for boot time, but...64G was >>>>> just my simple test in a guest, the real case is much worse! It aligns >>>>> with Mike's comment, we have complains of minute-like delays, due to a >>>>> very big pool of hugepages being allocated. >>>> >>>> The cost of page clearing is mostly a constant overhead so it is quite >>>> natural to see the time scaling with the number of pages. That overhead >>>> has to happen at some point of time. Sure it is more visible when >>>> allocating during boot time resp. when doing pre-allocation during >>>> runtime. The page fault path would be then faster. The overhead just >>>> moves to a different place. So I am not sure this is really a strong >>>> argument to hold. >>> >>> We have people complaining that starting VMs backed by hugetlbfs takes >>> too long, they would much rather have that initialization be done when >>> booting the hypervisor ... >> >> I can imagine. Everybody would love to have a free lunch ;) But more >> seriously, the overhead of the initialization is unavoidable. The memory >> has to be zeroed out by definition and somebody has to pay for that. >> Sure one can think of a deferred context to do that but this just >> spreads the overhead out to the overall system overhead. >> >> Even if the zeroying is done during the allocation time then it is the >> first user who can benefit from that. Any reuse of the hugetlb pool has >> to reinitialize again. > > I remember a conversation with some of our database people who thought > it best for their model if hugetlb pages in the pool were already clear > so that no initialization was done at fault time. Of course, this requires > clearing at page free time. In their model, they thought it better to pay > the price at allocation (pool creation) time and free time so that faults > would be as fast as possible. > > I wonder if the VMs backed by hugetlbfs pages would benefit from this > behavior as well? So what VMMs like qemu already do is prealloc/prefault all hugetlbfs memory (if told to, because it's not desired when overcommitting memory) - relevant for low-latency applications and similar. https://github.com/qemu/qemu/blob/67e8498937866b49b513e3acadef985c15f44fb5/util/oslib-posix.c#L561 That's why starting a VM backed by a lot of huge pages is slow when prefaulting: you wait until everything was zeroed before booting the VM. > > If we track the initialized state (clean or not) of huge pages in the > pool as suggested in Michal's skeleton of a patch, we 'could' then allow > users to choose when hugetlb page clearing is done. Right, in case of QEMU if there are zeroed pages a) prealloc would be faster b) page faults would be faster Also we could do hugetlb page clearing from a background thread/process, as also mentioned by Michal. > > None of that would address the original point of this thread, the global > init_on_alloc parameter. Yes, but I guess we're past that: whatever leaves the buddy shall be zeroed out. That's the whole point of that security hardening mechanism.
On Thu 22-10-20 10:04:50, David Hildenbrand wrote: [...] > > None of that would address the original point of this thread, the global > > init_on_alloc parameter. > > Yes, but I guess we're past that: whatever leaves the buddy shall be > zeroed out. That's the whole point of that security hardening mechanism. Hugetlb can control its zeroying behavior via mount option (for MAP_HUGETLB controled by a command line parameter). If the page fault handler can recognize the pre-initialized pages then both init_on* can be implemented along with such a hugetlb specific mechanism.
On 22.10.20 10:55, Michal Hocko wrote: > On Thu 22-10-20 10:04:50, David Hildenbrand wrote: > [...] >>> None of that would address the original point of this thread, the global >>> init_on_alloc parameter. >> >> Yes, but I guess we're past that: whatever leaves the buddy shall be >> zeroed out. That's the whole point of that security hardening mechanism. > > Hugetlb can control its zeroying behavior via mount option (for > MAP_HUGETLB controled by a command line parameter). If the page fault > handler can recognize the pre-initialized pages then both init_on* can Right, looking at init_on_alloc tells you if you have to zero after alloc or if it's already been done even though you didn't pass GFP_ZERO.
Thanks all for the valuable opinions and feedback! Closing the loop here: so, the proposal wasn't accepted, but an interesting idea to optimize later hugeTLB allocations was discussed in the thread - it doesn't help much our case, but it is a performance optimization. Cheers, Guilherme
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index c603237e006c..c03909f8e7b6 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -39,8 +39,9 @@ struct vm_area_struct; #define ___GFP_HARDWALL 0x100000u #define ___GFP_THISNODE 0x200000u #define ___GFP_ACCOUNT 0x400000u +#define ___GFP_NOINIT_ON_ALLOC 0x800000u #ifdef CONFIG_LOCKDEP -#define ___GFP_NOLOCKDEP 0x800000u +#define ___GFP_NOLOCKDEP 0x1000000u #else #define ___GFP_NOLOCKDEP 0 #endif @@ -215,16 +216,19 @@ struct vm_area_struct; * %__GFP_COMP address compound page metadata. * * %__GFP_ZERO returns a zeroed page on success. + * + * %__GFP_NOINIT_ON_ALLOC avoids uspace pages to be double-cleared (like HugeTLB) */ -#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) -#define __GFP_COMP ((__force gfp_t)___GFP_COMP) -#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) +#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN) +#define __GFP_COMP ((__force gfp_t)___GFP_COMP) +#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) +#define __GFP_NOINIT_ON_ALLOC ((__force gfp_t)___GFP_NOINIT_ON_ALLOC) /* Disable lockdep for GFP context tracking */ #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP)) +#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP)) #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /** diff --git a/include/linux/mm.h b/include/linux/mm.h index ef360fe70aaf..7fa60d22a90a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2882,7 +2882,7 @@ DECLARE_STATIC_KEY_FALSE(init_on_alloc); static inline bool want_init_on_alloc(gfp_t flags) { if (static_branch_unlikely(&init_on_alloc) && - !page_poisoning_enabled()) + !page_poisoning_enabled() && !(flags & __GFP_NOINIT_ON_ALLOC)) return true; return flags & __GFP_ZERO; } diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 67018d367b9f..89b0c0ddcc52 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -48,7 +48,8 @@ {(unsigned long)__GFP_WRITE, "__GFP_WRITE"}, \ {(unsigned long)__GFP_RECLAIM, "__GFP_RECLAIM"}, \ {(unsigned long)__GFP_DIRECT_RECLAIM, "__GFP_DIRECT_RECLAIM"},\ - {(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"}\ + {(unsigned long)__GFP_KSWAPD_RECLAIM, "__GFP_KSWAPD_RECLAIM"},\ + {(unsigned long)__GFP_NOINIT_ON_ALLOC, "__GFP_NOINIT_ON_ALLOC"}\ #define show_gfp_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index fe76f8fd5a73..c60a6726b0be 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1742,6 +1742,13 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, { struct page *page; + /* + * Signal the following page allocs to avoid them being cleared + * in allocation time - since HugeTLB pages are *only* used as + * userspace pages, they'll be cleared by default before usage. + */ + gfp_mask |= __GFP_NOINIT_ON_ALLOC; + if (hstate_is_gigantic(h)) page = alloc_gigantic_page(h, gfp_mask, nid, nmask); else diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index a50dae2c4ae9..bef90d8bb7f6 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -660,6 +660,7 @@ static const struct { { "__GFP_RECLAIM", "R" }, { "__GFP_DIRECT_RECLAIM", "DR" }, { "__GFP_KSWAPD_RECLAIM", "KR" }, + { "__GFP_NOINIT_ON_ALLOC", "NIA" }, }; static size_t max_gfp_len;
Commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") introduced the option for clearing/initializing kernel pages on allocation time - this can be achieved either using a parameter or a Kconfig setting. The goal for the change was a kernel hardening measure. Despite the performance degradation with "init_on_alloc" is considered low, there is one case in which it can be noticed and it may impact latency of the system - this is when hugetlb pages are allocated. Those pages are meant to be used by userspace *only* (differently of THP, for example). In allocation time for hugetlb, their component pages go through the initialization/clearing process in prep_new_page() kernel_init_free_pages() and, when used in userspace mappings, the hugetlb are _again_ cleared; I've checked that in practice by running the kernel selftest[0] for hugetlb mapping - the pages go through clear_huge_pages() on page fault [ see hugetlb_no_page() ]. This patch proposes a way to prevent this resource waste by skipping the page initialization/clearing if the page is a component of hugetlb page (even if "init_on_alloc" or the respective Kconfig are set). The performance improvement measured in [1] demonstrates that it is effective and bring the hugetlb allocation time to the same level as with "init_on_alloc" disabled. Despite we've used sysctl to allocate hugetlb pages in our tests, the same delay happens in early boot time when hugetlb parameters are set on kernel cmdline (and "init_on_alloc" is set). [0] tools/testing/selftests/vm/map_hugetlb.c [1] Test results - all tests executed in a pristine kernel 5.9+, from 2020-10-19, at commit 7cf726a594353010. A virtual machine with 96G of total memory was used, the test consists in allocating 64G of 2M hugetlb pages. Results below: * Without this patch, init_on_alloc=1 $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 97892212 kB Hugetlb: 0 kB $ time echo 32768 > /proc/sys/vm/nr_hugepages real 0m24.189s user 0m0.000s sys 0m24.184s $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 30784732 kB Hugetlb: 67108864 kB * Without this patch, init_on_alloc=0 $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 97892752 kB Hugetlb: 0 kB $ time echo 32768 > /proc/sys/vm/nr_hugepages real 0m0.316s user 0m0.000s sys 0m0.316s $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 30783628 kB Hugetlb: 67108864 kB * WITH this patch, init_on_alloc=1 $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 97891952 kB Hugetlb: 0 kB $ time echo 32768 > /proc/sys/vm/nr_hugepages real 0m0.209s user 0m0.000s sys 0m0.209s $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 30782964 kB Hugetlb: 67108864 kB * WITH this patch, init_on_alloc=0 $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 97892620 kB Hugetlb: 0 kB $ time echo 32768 > /proc/sys/vm/nr_hugepages real 0m0.206s user 0m0.000s sys 0m0.206s $ cat /proc/meminfo |grep "MemA\|Hugetlb" MemAvailable: 30783804 kB Hugetlb: 67108864 kB Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: Alexander Potapenko <glider@google.com> Cc: James Morris <jamorris@linux.microsoft.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mike Kravetz <mike.kravetz@oracle.com> --- Hi everybody, thanks in advance for the review/comments. I'd like to point 2 things related to the implementation: 1) I understand that adding GFP flags is not really welcome by the mm community; I've considered passing that as function parameter but that would be a hacky mess, so I decided to add the flag since it seems this is a fair use of the flag mechanism (to control actions on pages). If anybody has a better/simpler suggestion to implement this, I'm all ears - thanks! 2) The checkpatch script gave me the following error, but I decided to ignore it in order to maintain the same format present in the file: ERROR: space required after that close brace '}' #171: FILE: include/trace/events/mmflags.h:52: include/linux/gfp.h | 14 +++++++++----- include/linux/mm.h | 2 +- include/trace/events/mmflags.h | 3 ++- mm/hugetlb.c | 7 +++++++ tools/perf/builtin-kmem.c | 1 + 5 files changed, 20 insertions(+), 7 deletions(-)