Message ID | 20210413104747.12177-4-osalvador@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Make alloc_contig_range handle Hugetlb pages | expand |
On Tue 13-04-21 12:47:43, Oscar Salvador wrote: > Currently, the clearing of the flag is done under the lock, but this > is unnecessary as we just allocated the page and we did not give it > away yet, so no one should be messing with it. > > Also, this helps making clear that here the lock is only protecting the > counter. While moving the flag clearing is ok I am wondering why do we need that in the first place. I think it is just a leftover from 6c0371490140 ("hugetlb: convert PageHugeFreed to HPageFreed flag"). Prior to that a tail page as been used to keep track of the state but now all happens in the head page and the flag uses page->private which is always initialized when allocated by the allocator (post_alloc_hook). Or do we need it for giga pages which are not allocated by the page allocator? If yes then moving it to prep_compound_gigantic_page would be better. So should we just drop it here? > Signed-off-by: Oscar Salvador <osalvador@suse.de> > --- > mm/hugetlb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 54d81d5947ed..e40d5fe5c63c 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1490,10 +1490,10 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) > hugetlb_set_page_subpool(page, NULL); > set_hugetlb_cgroup(page, NULL); > set_hugetlb_cgroup_rsvd(page, NULL); > + ClearHPageFreed(page); > spin_lock_irq(&hugetlb_lock); > h->nr_huge_pages++; > h->nr_huge_pages_node[nid]++; > - ClearHPageFreed(page); > spin_unlock_irq(&hugetlb_lock); > } > > -- > 2.16.3
On 4/13/21 6:23 AM, Michal Hocko wrote: > On Tue 13-04-21 12:47:43, Oscar Salvador wrote: >> Currently, the clearing of the flag is done under the lock, but this >> is unnecessary as we just allocated the page and we did not give it >> away yet, so no one should be messing with it. >> >> Also, this helps making clear that here the lock is only protecting the >> counter. > > While moving the flag clearing is ok I am wondering why do we need that > in the first place. I think it is just a leftover from 6c0371490140 > ("hugetlb: convert PageHugeFreed to HPageFreed flag"). Prior to that a tail > page as been used to keep track of the state but now all happens in the > head page and the flag uses page->private which is always initialized > when allocated by the allocator (post_alloc_hook). Yes, this was just left over from 6c0371490140. And, as you mention post_alloc_hook will clear page->private for all non-gigantic pages allocated via buddy. > Or do we need it for giga pages which are not allocated by the page > allocator? If yes then moving it to prep_compound_gigantic_page would be > better. I am pretty sure dynamically allocated giga pages have page->Private cleared as well. It is not obvious, but the alloc_contig_range code used to put together giga pages will end up calling isolate_freepages_range that will call split_map_pages and then post_alloc_hook for each MAX_ORDER block. As mentioned, this is not obvious and I would hate to rely on this behavior not changing. > > So should we just drop it here? The only place where page->private may not be initialized is when we do allocations at boot time from memblock. In this case, we will add the pages to the free list via put_page/free_huge_page so the appropriate flags will be cleared before anyone notices. I'm wondering if we should just do a set_page_private(page, 0) here in prep_new_huge_page since we now use that field for flags. Or, is that overkill?
On Tue 13-04-21 14:19:03, Mike Kravetz wrote: > On 4/13/21 6:23 AM, Michal Hocko wrote: > > On Tue 13-04-21 12:47:43, Oscar Salvador wrote: [...] > > Or do we need it for giga pages which are not allocated by the page > > allocator? If yes then moving it to prep_compound_gigantic_page would be > > better. > > I am pretty sure dynamically allocated giga pages have page->Private > cleared as well. It is not obvious, but the alloc_contig_range code > used to put together giga pages will end up calling isolate_freepages_range > that will call split_map_pages and then post_alloc_hook for each MAX_ORDER > block. Thanks for saving me from crawling that code. > As mentioned, this is not obvious and I would hate to rely on this > behavior not changing. Thinking about it some more, having some (page granularity) allocator not clearing page private would be a serious problem for anybody relying on its state. So I am not sure this can change. > > So should we just drop it here? > > The only place where page->private may not be initialized is when we do > allocations at boot time from memblock. In this case, we will add the > pages to the free list via put_page/free_huge_page so the appropriate > flags will be cleared before anyone notices. Pages allocated by the bootmem should be pre initialized from the boot, no? > I'm wondering if we should just do a set_page_private(page, 0) here in > prep_new_huge_page since we now use that field for flags. Or, is that > overkill? I would rather not duplicate the work done by underlying allocators. I do not think other users of the allocator want to do the same so why should hugetlb be any different.
On Wed, Apr 14, 2021 at 08:04:21AM +0200, Michal Hocko wrote: > On Tue 13-04-21 14:19:03, Mike Kravetz wrote: > > On 4/13/21 6:23 AM, Michal Hocko wrote: > > The only place where page->private may not be initialized is when we do > > allocations at boot time from memblock. In this case, we will add the > > pages to the free list via put_page/free_huge_page so the appropriate > > flags will be cleared before anyone notices. > > Pages allocated by the bootmem should be pre initialized from the boot, > no? I guess Mike means: hugetlb_hstate_alloc_pages alloc_bootmem_huge_page __alloc_bootmem_huge_page memblock_alloc_try_nid_raw and AFAICS, memblock_alloc_try_nid_raw() does not zero the memory. Then these pages are initialized in: gather_bootmem_prealloc prep_compound_huge_page prep_new_huge_page But as can be noticed, no one touches page->private when coming from that path.
On Wed 14-04-21 09:41:32, Oscar Salvador wrote: > On Wed, Apr 14, 2021 at 08:04:21AM +0200, Michal Hocko wrote: > > On Tue 13-04-21 14:19:03, Mike Kravetz wrote: > > > On 4/13/21 6:23 AM, Michal Hocko wrote: > > > The only place where page->private may not be initialized is when we do > > > allocations at boot time from memblock. In this case, we will add the > > > pages to the free list via put_page/free_huge_page so the appropriate > > > flags will be cleared before anyone notices. > > > > Pages allocated by the bootmem should be pre initialized from the boot, > > no? > > I guess Mike means: > > hugetlb_hstate_alloc_pages > alloc_bootmem_huge_page > __alloc_bootmem_huge_page > memblock_alloc_try_nid_raw > > and AFAICS, memblock_alloc_try_nid_raw() does not zero the memory. You are right it doesn't do it there. But all struct pages, even those that are allocated by the bootmem allocator should initialize its struct pages. They would be poisoned otherwise, right? I would have to look at the exact code path but IIRC this should be around the time bootmem allocator state transitions to the page allocator.
On Wed, Apr 14, 2021 at 10:28:33AM +0200, Michal Hocko wrote: > You are right it doesn't do it there. But all struct pages, even those > that are allocated by the bootmem allocator should initialize its struct > pages. They would be poisoned otherwise, right? I would have to look at > the exact code path but IIRC this should be around the time bootmem > allocator state transitions to the page allocator. Ok, you are right. struct pages are initialized a bit earlier through: start_kernel setup_arch paging_init zone_sizes_init free_area_init free_area_init_node free_area_init_core memmap_init_zone memmap_init_range __init_single_page While the allocation of bootmem hugetlb happens start_kernel parse_args ... hugepages_setup ... hugetlb_hstate_alloc_pages __alloc_bootmem_huge_page which is after the setup_arch() call. So by the time we get the page from __alloc_bootmem_huge_page(), fields are zeroed. I thought we might get in trouble because memblock_alloc_try_nid_raw() calls page_init_poison() which poisons the chunk with 0xff,e.g: [ 1.955471] boot: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff [ 1.955476] boot: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff but it seems that does not the memmap struct page. I checked, and when we get there in __alloc_bootmem_huge_page, page->private is still zeroed, so I guess it should be safe to assume that we do not really need to clear the flag in __prep_new_huge_page() routine?
On Wed, Apr 14, 2021 at 12:01:47PM +0200, Oscar Salvador wrote:
> but it seems that does not the memmap struct page.
that sould read as "but it seems that that does not affect the memmap struct page"
On Wed 14-04-21 12:01:47, Oscar Salvador wrote: > On Wed, Apr 14, 2021 at 10:28:33AM +0200, Michal Hocko wrote: > > You are right it doesn't do it there. But all struct pages, even those > > that are allocated by the bootmem allocator should initialize its struct > > pages. They would be poisoned otherwise, right? I would have to look at > > the exact code path but IIRC this should be around the time bootmem > > allocator state transitions to the page allocator. > > Ok, you are right. > struct pages are initialized a bit earlier through: > > start_kernel > setup_arch > paging_init > zone_sizes_init > free_area_init > free_area_init_node > free_area_init_core > memmap_init_zone > memmap_init_range > __init_single_page > > While the allocation of bootmem hugetlb happens > > start_kernel > parse_args > ... > hugepages_setup > ... > hugetlb_hstate_alloc_pages > __alloc_bootmem_huge_page > > which is after the setup_arch() call. Thanks for pulling those paths. It is always painful to crawl that code. > So by the time we get the page from __alloc_bootmem_huge_page(), fields are > zeroed. > I thought we might get in trouble because memblock_alloc_try_nid_raw() calls > page_init_poison() which poisons the chunk with 0xff,e.g: > > [ 1.955471] boot: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff > [ 1.955476] boot: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff > > but it seems that does not the memmap struct page. Well, to be precise it does the very same thing with memamp struct pages but that is before the initialization code you have pointed out above. In this context it just poisons the allocated content which is the GB page storage. > I checked, and when we get there in __alloc_bootmem_huge_page, page->private is > still zeroed, so I guess it should be safe to assume that we do not really need > to clear the flag in __prep_new_huge_page() routine? It would be quite nasty if the struct pages content would be undefined. Maybe that is possible but then I would rather stick the initialization into __alloc_bootmem_huge_page. Thanks!
On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote: > Well, to be precise it does the very same thing with memamp struct pages > but that is before the initialization code you have pointed out above. > In this context it just poisons the allocated content which is the GB > page storage. Right. > > I checked, and when we get there in __alloc_bootmem_huge_page, page->private is > > still zeroed, so I guess it should be safe to assume that we do not really need > > to clear the flag in __prep_new_huge_page() routine? > > It would be quite nasty if the struct pages content would be undefined. > Maybe that is possible but then I would rather stick the initialization > into __alloc_bootmem_huge_page. Yes, but I do not think that is really possible unless I missed something. Let us see what Mike thinks of it, if there are no objections, we can get rid of the clearing flag right there.
On Wed 14-04-21 12:49:53, Oscar Salvador wrote: > On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote: [...] > > > I checked, and when we get there in __alloc_bootmem_huge_page, page->private is > > > still zeroed, so I guess it should be safe to assume that we do not really need > > > to clear the flag in __prep_new_huge_page() routine? > > > > It would be quite nasty if the struct pages content would be undefined. > > Maybe that is possible but then I would rather stick the initialization > > into __alloc_bootmem_huge_page. > > Yes, but I do not think that is really possible unless I missed something. Yeah, it should be fine. I was thinking of a alloc, modify struct pages, free back to the bootmem allocator sequence. But I do not remember ever seeing sequence like that. Bootmem allocator users tend to be simple, allocate storage and either retain it for the life time. Other than PageReserved bit they do not touch metadata. If we want to be paranoid then we can add VM_WARN_ON for unexpected state when allocating from the bootmem. But I am not sure this is really worth it.
On 14.04.21 13:09, Michal Hocko wrote: > On Wed 14-04-21 12:49:53, Oscar Salvador wrote: >> On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote: > [...] >>>> I checked, and when we get there in __alloc_bootmem_huge_page, page->private is >>>> still zeroed, so I guess it should be safe to assume that we do not really need >>>> to clear the flag in __prep_new_huge_page() routine? >>> >>> It would be quite nasty if the struct pages content would be undefined. >>> Maybe that is possible but then I would rather stick the initialization >>> into __alloc_bootmem_huge_page. >> >> Yes, but I do not think that is really possible unless I missed something. > > Yeah, it should be fine. I was thinking of a alloc, modify struct pages, > free back to the bootmem allocator sequence. But I do not remember ever > seeing sequence like that. We do have code like that, though. Take a look at arch/x86/mm/init_64.c:free_pagetable() as one example. But in general, we assume freeing code (buddy, but also memblock) restores the state of the memmap to the original state it obtained the memmap. So if it's properly initialized when coming from the allocator the first time, it should remain properly initialized when re-entering and re-leaving the allocator.
On 4/14/21 3:49 AM, Oscar Salvador wrote: > On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote: >> Well, to be precise it does the very same thing with memamp struct pages >> but that is before the initialization code you have pointed out above. >> In this context it just poisons the allocated content which is the GB >> page storage. > > Right. > >>> I checked, and when we get there in __alloc_bootmem_huge_page, page->private is >>> still zeroed, so I guess it should be safe to assume that we do not really need >>> to clear the flag in __prep_new_huge_page() routine? >> >> It would be quite nasty if the struct pages content would be undefined. >> Maybe that is possible but then I would rather stick the initialization >> into __alloc_bootmem_huge_page. > > Yes, but I do not think that is really possible unless I missed something. > Let us see what Mike thinks of it, if there are no objections, we can > get rid of the clearing flag right there. > Thanks for crawling through that code Oscar! I do not think you missed anything. Let's just get rid of the flag clearing.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 54d81d5947ed..e40d5fe5c63c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1490,10 +1490,10 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) hugetlb_set_page_subpool(page, NULL); set_hugetlb_cgroup(page, NULL); set_hugetlb_cgroup_rsvd(page, NULL); + ClearHPageFreed(page); spin_lock_irq(&hugetlb_lock); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; - ClearHPageFreed(page); spin_unlock_irq(&hugetlb_lock); }
Currently, the clearing of the flag is done under the lock, but this is unnecessary as we just allocated the page and we did not give it away yet, so no one should be messing with it. Also, this helps making clear that here the lock is only protecting the counter. Signed-off-by: Oscar Salvador <osalvador@suse.de> --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)