Message ID | 20230515170809.284680-1-tsahu@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/folio: Avoid special handling for order value 0 in folio_set_order | expand |
Changes from v1: - Changed the patch description. Added comment from Mike. ~Tarun Tarun Sahu <tsahu@linux.ibm.com> writes: > folio_set_order(folio, 0) is used in kernel at two places > __destroy_compound_gigantic_folio and __prep_compound_gigantic_folio. > Currently, It is called to clear out the folio->_folio_nr_pages and > folio->_folio_order. > > For __destroy_compound_gigantic_folio: > In past, folio_set_order(folio, 0) was needed because page->mapping used > to overlap with _folio_nr_pages and _folio_order. So if these fields were > left uncleared during freeing gigantic hugepages, they were causing > "BUG: bad page state" due to non-zero page->mapping. Now, After > Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to > CMA") page->mapping has explicitly been cleared out for tail pages. Also, > _folio_order and _folio_nr_pages no longer overlaps with page->mapping. > > struct page { > ... > struct address_space * mapping; /* 24 8 */ > ... > } > > struct folio { > ... > union { > struct { > long unsigned int _flags_1; /* 64 8 */ > long unsigned int _head_1; /* 72 8 */ > unsigned char _folio_dtor; /* 80 1 */ > unsigned char _folio_order; /* 81 1 */ > > /* XXX 2 bytes hole, try to pack */ > > atomic_t _entire_mapcount; /* 84 4 */ > atomic_t _nr_pages_mapped; /* 88 4 */ > atomic_t _pincount; /* 92 4 */ > unsigned int _folio_nr_pages; /* 96 4 */ > }; /* 64 40 */ > struct page __page_1 __attribute__((__aligned__(8))); /* 64 64 */ > } > ... > } > > So, folio_set_order(folio, 0) can be removed from freeing gigantic > folio path (__destroy_compound_gigantic_folio). > > Another place, folio_set_order(folio, 0) is called inside > __prep_compound_gigantic_folio during error path. Here, > folio_set_order(folio, 0) can also be removed if we move > folio_set_order(folio, order) after for loop. > > The patch also moves _folio_set_head call in __prep_compound_gigantic_folio() > such that we avoid clearing them in the error path. > > Also, as Mike pointed out: > "It would actually be better to move the calls _folio_set_head and > folio_set_order in __prep_compound_gigantic_folio() as suggested here. Why? > In the current code, the ref count on the 'head page' is still 1 (or more) > while those calls are made. So, someone could take a speculative ref on the > page BEFORE the tail pages are set up." > > This way, folio_set_order(folio, 0) is no more needed. And it will also > helps removing the confusion of folio order being set to 0 (as _folio_order > field is part of first tail page). > > Testing: I have run LTP tests, which all passes. and also I have written > the test in LTP which tests the bug caused by compound_nr and page->mapping > overlapping. > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/hugetlb/hugemmap/hugemmap32.c > > Running on older kernel ( < 5.10-rc7) with the above bug this fails while > on newer kernel and, also with this patch it passes. > > Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com> > --- > mm/hugetlb.c | 9 +++------ > mm/internal.h | 8 ++------ > 2 files changed, 5 insertions(+), 12 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index f154019e6b84..607553445855 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1489,7 +1489,6 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, > set_page_refcounted(p); > } > > - folio_set_order(folio, 0); > __folio_clear_head(folio); > } > > @@ -1951,9 +1950,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > struct page *p; > > __folio_clear_reserved(folio); > - __folio_set_head(folio); > - /* we rely on prep_new_hugetlb_folio to set the destructor */ > - folio_set_order(folio, order); > for (i = 0; i < nr_pages; i++) { > p = folio_page(folio, i); > > @@ -1999,6 +1995,9 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > if (i != 0) > set_compound_head(p, &folio->page); > } > + __folio_set_head(folio); > + /* we rely on prep_new_hugetlb_folio to set the destructor */ > + folio_set_order(folio, order); > atomic_set(&folio->_entire_mapcount, -1); > atomic_set(&folio->_nr_pages_mapped, 0); > atomic_set(&folio->_pincount, 0); > @@ -2017,8 +2016,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > p = folio_page(folio, j); > __ClearPageReserved(p); > } > - folio_set_order(folio, 0); > - __folio_clear_head(folio); > return false; > } > > diff --git a/mm/internal.h b/mm/internal.h > index 68410c6d97ac..c59fe08c5b39 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -425,16 +425,12 @@ int split_free_page(struct page *free_page, > */ > static inline void folio_set_order(struct folio *folio, unsigned int order) > { > - if (WARN_ON_ONCE(!folio_test_large(folio))) > + if (WARN_ON_ONCE(!order || !folio_test_large(folio))) > return; > > folio->_folio_order = order; > #ifdef CONFIG_64BIT > - /* > - * When hugetlb dissolves a folio, we need to clear the tail > - * page, rather than setting nr_pages to 1. > - */ > - folio->_folio_nr_pages = order ? 1U << order : 0; > + folio->_folio_nr_pages = 1U << order; > #endif > } > > -- > 2.31.1
On Mon, May 15, 2023 at 10:38:09PM +0530, Tarun Sahu wrote: > @@ -1951,9 +1950,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > struct page *p; > > __folio_clear_reserved(folio); > - __folio_set_head(folio); > - /* we rely on prep_new_hugetlb_folio to set the destructor */ > - folio_set_order(folio, order); > for (i = 0; i < nr_pages; i++) { > p = folio_page(folio, i); > > @@ -1999,6 +1995,9 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > if (i != 0) > set_compound_head(p, &folio->page); > } > + __folio_set_head(folio); > + /* we rely on prep_new_hugetlb_folio to set the destructor */ > + folio_set_order(folio, order); This makes me nervous, as I said before. This means that compound_head(tail) can temporarily point to a page which is not marked as a head page. That's different from prep_compound_page(). You need to come up with some good argumentation for why this is safe, and no amount of testing you do can replace it -- any race in this area will be subtle.
On 05/15/23 18:16, Matthew Wilcox wrote: > On Mon, May 15, 2023 at 10:38:09PM +0530, Tarun Sahu wrote: > > @@ -1951,9 +1950,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > > struct page *p; > > > > __folio_clear_reserved(folio); > > - __folio_set_head(folio); > > - /* we rely on prep_new_hugetlb_folio to set the destructor */ > > - folio_set_order(folio, order); > > for (i = 0; i < nr_pages; i++) { > > p = folio_page(folio, i); > > > > @@ -1999,6 +1995,9 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > > if (i != 0) > > set_compound_head(p, &folio->page); > > } > > + __folio_set_head(folio); > > + /* we rely on prep_new_hugetlb_folio to set the destructor */ > > + folio_set_order(folio, order); > > This makes me nervous, as I said before. This means that > compound_head(tail) can temporarily point to a page which is not marked > as a head page. That's different from prep_compound_page(). You need to > come up with some good argumentation for why this is safe, and no amount > of testing you do can replace it -- any race in this area will be subtle. I added comments supporting this approach in the first version of the patch. My argument was that this is actually safer than the existing code. That is because we freeze the page (ref count 0) before setting compound_head(tail). So, nobody should be taking any speculative refs on those tail pages. In the existing code, we set the compound page order in the head before freezing the head or any tail pages. Therefore, speculative refs can be taken on any of the pages while in this state. If we want prep_compound_gigantic_folio to work like prep_compound_page we would need to take two passes through the pages. In the first pass, freeze all the pages and in the second set up the compound page.
Hi Mathew, Matthew Wilcox <willy@infradead.org> writes: > On Mon, May 15, 2023 at 10:38:09PM +0530, Tarun Sahu wrote: >> @@ -1951,9 +1950,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, >> struct page *p; >> >> __folio_clear_reserved(folio); >> - __folio_set_head(folio); >> - /* we rely on prep_new_hugetlb_folio to set the destructor */ >> - folio_set_order(folio, order); >> for (i = 0; i < nr_pages; i++) { >> p = folio_page(folio, i); >> >> @@ -1999,6 +1995,9 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, >> if (i != 0) >> set_compound_head(p, &folio->page); >> } >> + __folio_set_head(folio); >> + /* we rely on prep_new_hugetlb_folio to set the destructor */ >> + folio_set_order(folio, order); > > This makes me nervous, as I said before. This means that > compound_head(tail) can temporarily point to a page which is not marked > as a head page. That's different from prep_compound_page(). You need to > come up with some good argumentation for why this is safe, and no amount > of testing you do can replace it -- any race in this area will be subtle. IIUC, I am certain that it is safe to move these calls and agree with what Mike said. Here is my reasoning: When we get pages from CMA allocator for gigantic folio, page refcount for each pages is 1. page_cache_get_speculative (now folio_try_get_rcu) can take reference to any of these pages before prep_compound_gigantic_folio explicitly freeze refcount of these pages. With this race condition there are 2 possible situation. ... if (!demote) { if (!page_ref_freeze(p, 1)) { pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n"); goto out_error; } } else { VM_BUG_ON_PAGE(page_count(p), p); } if (i != 0) set_compound_head(p, &folio->page); } ... 1. In the current code, before freezing refcount of nth (hence, n+th) tail page, folio_try_get_rcu might try to take nth tail page reference, so refcount will be increased of the nth tail page not the head page (as compound head is not yet set for nth tail page). and once this happens, nth iteration of loop will cause error and prep_compound_gigantic_folio will fail. So, setting the PG_head at the starting of for-loop or at the end won't have any difference to this flow. 2. If reference for the head page is taken by folio_try_get_rcu before freezing it, prep_compound_gigantic_page will fail, but before PG_head and folio_order of head page is cleared in error path, the caller of folio_try_get_rcu path will find that this page is head page and might try to operate on its tail pages while these tail pages are invalid. Hence, It will be safer if we call __folio_set_head and folio_set_order after freezing the tail page refcount. ~Tarun
Hi, This is a gentle reminder, please let me know, If any information or any changes are needed from my end. Thanks Tarun Tarun Sahu <tsahu@linux.ibm.com> writes: > folio_set_order(folio, 0) is used in kernel at two places > __destroy_compound_gigantic_folio and __prep_compound_gigantic_folio. > Currently, It is called to clear out the folio->_folio_nr_pages and > folio->_folio_order. > > For __destroy_compound_gigantic_folio: > In past, folio_set_order(folio, 0) was needed because page->mapping used > to overlap with _folio_nr_pages and _folio_order. So if these fields were > left uncleared during freeing gigantic hugepages, they were causing > "BUG: bad page state" due to non-zero page->mapping. Now, After > Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to > CMA") page->mapping has explicitly been cleared out for tail pages. Also, > _folio_order and _folio_nr_pages no longer overlaps with page->mapping. > > struct page { > ... > struct address_space * mapping; /* 24 8 */ > ... > } > > struct folio { > ... > union { > struct { > long unsigned int _flags_1; /* 64 8 */ > long unsigned int _head_1; /* 72 8 */ > unsigned char _folio_dtor; /* 80 1 */ > unsigned char _folio_order; /* 81 1 */ > > /* XXX 2 bytes hole, try to pack */ > > atomic_t _entire_mapcount; /* 84 4 */ > atomic_t _nr_pages_mapped; /* 88 4 */ > atomic_t _pincount; /* 92 4 */ > unsigned int _folio_nr_pages; /* 96 4 */ > }; /* 64 40 */ > struct page __page_1 __attribute__((__aligned__(8))); /* 64 64 */ > } > ... > } > > So, folio_set_order(folio, 0) can be removed from freeing gigantic > folio path (__destroy_compound_gigantic_folio). > > Another place, folio_set_order(folio, 0) is called inside > __prep_compound_gigantic_folio during error path. Here, > folio_set_order(folio, 0) can also be removed if we move > folio_set_order(folio, order) after for loop. > > The patch also moves _folio_set_head call in __prep_compound_gigantic_folio() > such that we avoid clearing them in the error path. > > Also, as Mike pointed out: > "It would actually be better to move the calls _folio_set_head and > folio_set_order in __prep_compound_gigantic_folio() as suggested here. Why? > In the current code, the ref count on the 'head page' is still 1 (or more) > while those calls are made. So, someone could take a speculative ref on the > page BEFORE the tail pages are set up." > > This way, folio_set_order(folio, 0) is no more needed. And it will also > helps removing the confusion of folio order being set to 0 (as _folio_order > field is part of first tail page). > > Testing: I have run LTP tests, which all passes. and also I have written > the test in LTP which tests the bug caused by compound_nr and page->mapping > overlapping. > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/hugetlb/hugemmap/hugemmap32.c > > Running on older kernel ( < 5.10-rc7) with the above bug this fails while > on newer kernel and, also with this patch it passes. > > Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com> > --- > mm/hugetlb.c | 9 +++------ > mm/internal.h | 8 ++------ > 2 files changed, 5 insertions(+), 12 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index f154019e6b84..607553445855 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1489,7 +1489,6 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, > set_page_refcounted(p); > } > > - folio_set_order(folio, 0); > __folio_clear_head(folio); > } > > @@ -1951,9 +1950,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > struct page *p; > > __folio_clear_reserved(folio); > - __folio_set_head(folio); > - /* we rely on prep_new_hugetlb_folio to set the destructor */ > - folio_set_order(folio, order); > for (i = 0; i < nr_pages; i++) { > p = folio_page(folio, i); > > @@ -1999,6 +1995,9 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > if (i != 0) > set_compound_head(p, &folio->page); > } > + __folio_set_head(folio); > + /* we rely on prep_new_hugetlb_folio to set the destructor */ > + folio_set_order(folio, order); > atomic_set(&folio->_entire_mapcount, -1); > atomic_set(&folio->_nr_pages_mapped, 0); > atomic_set(&folio->_pincount, 0); > @@ -2017,8 +2016,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > p = folio_page(folio, j); > __ClearPageReserved(p); > } > - folio_set_order(folio, 0); > - __folio_clear_head(folio); > return false; > } > > diff --git a/mm/internal.h b/mm/internal.h > index 68410c6d97ac..c59fe08c5b39 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -425,16 +425,12 @@ int split_free_page(struct page *free_page, > */ > static inline void folio_set_order(struct folio *folio, unsigned int order) > { > - if (WARN_ON_ONCE(!folio_test_large(folio))) > + if (WARN_ON_ONCE(!order || !folio_test_large(folio))) > return; > > folio->_folio_order = order; > #ifdef CONFIG_64BIT > - /* > - * When hugetlb dissolves a folio, we need to clear the tail > - * page, rather than setting nr_pages to 1. > - */ > - folio->_folio_nr_pages = order ? 1U << order : 0; > + folio->_folio_nr_pages = 1U << order; > #endif > } > > -- > 2.31.1
Hi Mike, Please find my comments inline. Mike Kravetz <mike.kravetz@oracle.com> writes: > On 06/06/23 10:32, Tarun Sahu wrote: >> >> Hi Mike, >> >> Thanks for your inputs. >> I wanted to know if you find it okay, Can I send it again adding your Reviewed-by? > > Hi Tarun, > > Just a few more comments/questions. > > On 05/15/23 22:38, Tarun Sahu wrote: >> folio_set_order(folio, 0) is used in kernel at two places >> __destroy_compound_gigantic_folio and __prep_compound_gigantic_folio. >> Currently, It is called to clear out the folio->_folio_nr_pages and >> folio->_folio_order. >> >> For __destroy_compound_gigantic_folio: >> In past, folio_set_order(folio, 0) was needed because page->mapping used >> to overlap with _folio_nr_pages and _folio_order. So if these fields were >> left uncleared during freeing gigantic hugepages, they were causing >> "BUG: bad page state" due to non-zero page->mapping. Now, After >> Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to >> CMA") page->mapping has explicitly been cleared out for tail pages. Also, >> _folio_order and _folio_nr_pages no longer overlaps with page->mapping. > > I believe the same logic/reasoning as above also applies to > __prep_compound_gigantic_folio. > Why? > In __prep_compound_gigantic_folio we only call folio_set_order(folio, 0) > in the case of error. If __prep_compound_gigantic_folio fails, the caller > will then call free_gigantic_folio() on the "gigantic page". However, it is > not really a gigantic at this point in time, and we are simply calling > cma_release() or free_contig_range(). > The end result is that I do not believe the existing call to > folio_set_order(folio, 0) in __prep_compound_gigantic_folio is actually > required. ??? No, there is a difference. IIUC, __destroy_compound_gigantic_folio explicitly reset page->mapping for each page of compound page which makes sure, even if in future some fields of struct page/folio overlaps with page->mapping, it won't cause `BUG: bad page state` error. But If we just remove folio_set_order(folio, 0) from __prep_compound_gigantic_folio without moving folio_set_order(folio, order), this will cause extra maintenance overhead to track if page->_folio_order overlaps with page->mapping everytime struct page fields are changed. As in case of overlapping page->mapping will be non-zero. IMHO, To avoid it, moving the folio_set_order(folio, order) after all error checks are done on tail pages. So, _folio_order is either set on success and not set in case of error. (which is the original proposal). But for folio_set_head, I agree the way you suggested below. WDYT? > > If my reasoning above is correct, then we could just have one patch to > remove the folio_set_order(folio, 0) calls and remove special casing for > order 0 in folio_set_order. > > However, I still believe your restructuring of __prep_compound_gigantic_folio, > is of value. I do not believe there is an issue as questioned by Matthew. My > reasoning has been stated previously. We could make changes like the following > to retain the same order of operations in __prep_compound_gigantic_folio and > totally avoid Matthew's question. Totally untested. > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index ea24718db4af..a54fee663cb1 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1950,10 +1950,8 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > int nr_pages = 1 << order; > struct page *p; > > - __folio_clear_reserved(folio); > - __folio_set_head(folio); > /* we rely on prep_new_hugetlb_folio to set the destructor */ > - folio_set_order(folio, order); > + > for (i = 0; i < nr_pages; i++) { > p = folio_page(folio, i); > > @@ -1969,7 +1967,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > * on the head page when they need know if put_page() is needed > * after get_user_pages(). > */ > - if (i != 0) /* head page cleared above */ > + if (i != 0) /* head page cleared below */ > __ClearPageReserved(p); > /* > * Subtle and very unlikely > @@ -1996,8 +1994,14 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > } else { > VM_BUG_ON_PAGE(page_count(p), p); > } > - if (i != 0) > + > + if (i == 0) { > + __folio_clear_reserved(folio); > + __folio_set_head(folio); > + folio_set_order(folio, order); With folio_set_head, I agree to this, But does not feel good with folio_set_order as per my above reasoning. WDYT? > + } else { > set_compound_head(p, &folio->page); > + } > } > atomic_set(&folio->_entire_mapcount, -1); > atomic_set(&folio->_nr_pages_mapped, 0); > @@ -2017,7 +2021,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > p = folio_page(folio, j); > __ClearPageReserved(p); > } > - folio_set_order(folio, 0); > __folio_clear_head(folio); > return false; > } > > >> >> struct page { >> ... >> struct address_space * mapping; /* 24 8 */ >> ... >> } >> >> struct folio { >> ... >> union { >> struct { >> long unsigned int _flags_1; /* 64 8 */ >> long unsigned int _head_1; /* 72 8 */ >> unsigned char _folio_dtor; /* 80 1 */ >> unsigned char _folio_order; /* 81 1 */ >> >> /* XXX 2 bytes hole, try to pack */ >> >> atomic_t _entire_mapcount; /* 84 4 */ >> atomic_t _nr_pages_mapped; /* 88 4 */ >> atomic_t _pincount; /* 92 4 */ >> unsigned int _folio_nr_pages; /* 96 4 */ >> }; /* 64 40 */ >> struct page __page_1 __attribute__((__aligned__(8))); /* 64 64 */ >> } >> ... >> } > > I do not think the copy of page/folio definitions adds much value to the > commit message. Yeah, Will remove it. > > -- > Mike Kravetz
On 06/08/23 15:33, Tarun Sahu wrote: > Hi Mike, > > Please find my comments inline. > > Mike Kravetz <mike.kravetz@oracle.com> writes: > > > On 06/06/23 10:32, Tarun Sahu wrote: > >> > >> Hi Mike, > >> > >> Thanks for your inputs. > >> I wanted to know if you find it okay, Can I send it again adding your Reviewed-by? > > > > Hi Tarun, > > > > Just a few more comments/questions. > > > > On 05/15/23 22:38, Tarun Sahu wrote: > >> folio_set_order(folio, 0) is used in kernel at two places > >> __destroy_compound_gigantic_folio and __prep_compound_gigantic_folio. > >> Currently, It is called to clear out the folio->_folio_nr_pages and > >> folio->_folio_order. > >> > >> For __destroy_compound_gigantic_folio: > >> In past, folio_set_order(folio, 0) was needed because page->mapping used > >> to overlap with _folio_nr_pages and _folio_order. So if these fields were > >> left uncleared during freeing gigantic hugepages, they were causing > >> "BUG: bad page state" due to non-zero page->mapping. Now, After > >> Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to > >> CMA") page->mapping has explicitly been cleared out for tail pages. Also, > >> _folio_order and _folio_nr_pages no longer overlaps with page->mapping. > > > > I believe the same logic/reasoning as above also applies to > > __prep_compound_gigantic_folio. > > Why? > > In __prep_compound_gigantic_folio we only call folio_set_order(folio, 0) > > in the case of error. If __prep_compound_gigantic_folio fails, the caller > > will then call free_gigantic_folio() on the "gigantic page". However, it is > > not really a gigantic at this point in time, and we are simply calling > > cma_release() or free_contig_range(). > > The end result is that I do not believe the existing call to > > folio_set_order(folio, 0) in __prep_compound_gigantic_folio is actually > > required. ??? > No, there is a difference. IIUC, __destroy_compound_gigantic_folio > explicitly reset page->mapping for each page of compound page which > makes sure, even if in future some fields of struct page/folio overlaps > with page->mapping, it won't cause `BUG: bad page state` error. But If we > just remove folio_set_order(folio, 0) from __prep_compound_gigantic_folio > without moving folio_set_order(folio, order), this will cause extra > maintenance overhead to track if page->_folio_order overlaps with > page->mapping everytime struct page fields are changed. As in case of > overlapping page->mapping will be non-zero. IMHO, To avoid it, > moving the folio_set_order(folio, order) after all error checks are > done on tail pages. So, _folio_order is either set on success and not > set in case of error. (which is the original proposal). But for > folio_set_head, I agree the way you suggested below. > > WDYT? Right. It is more 'future proof' to only set folio order on success as done in your original patch. > > > > If my reasoning above is correct, then we could just have one patch to > > remove the folio_set_order(folio, 0) calls and remove special casing for > > order 0 in folio_set_order. > > > > However, I still believe your restructuring of __prep_compound_gigantic_folio, > > is of value. I do not believe there is an issue as questioned by Matthew. My > > reasoning has been stated previously. We could make changes like the following > > to retain the same order of operations in __prep_compound_gigantic_folio and > > totally avoid Matthew's question. Totally untested. > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index ea24718db4af..a54fee663cb1 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1950,10 +1950,8 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > > int nr_pages = 1 << order; > > struct page *p; > > > > - __folio_clear_reserved(folio); > > - __folio_set_head(folio); > > /* we rely on prep_new_hugetlb_folio to set the destructor */ > > - folio_set_order(folio, order); > > + > > for (i = 0; i < nr_pages; i++) { > > p = folio_page(folio, i); > > > > @@ -1969,7 +1967,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > > * on the head page when they need know if put_page() is needed > > * after get_user_pages(). > > */ > > - if (i != 0) /* head page cleared above */ > > + if (i != 0) /* head page cleared below */ > > __ClearPageReserved(p); > > /* > > * Subtle and very unlikely > > @@ -1996,8 +1994,14 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > > } else { > > VM_BUG_ON_PAGE(page_count(p), p); > > } > > - if (i != 0) > > + > > + if (i == 0) { > > + __folio_clear_reserved(folio); > > + __folio_set_head(folio); > > + folio_set_order(folio, order); > With folio_set_head, I agree to this, But does not feel good with > folio_set_order as per my above reasoning. WDYT? Agree with your reasoning. We should just move __folio_set_head and folio_set_order after the loop as you originally suggested. > > > + } else { > > set_compound_head(p, &folio->page); > > + } > > } > > atomic_set(&folio->_entire_mapcount, -1); > > atomic_set(&folio->_nr_pages_mapped, 0); > > @@ -2017,7 +2021,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, > > p = folio_page(folio, j); > > __ClearPageReserved(p); > > } > > - folio_set_order(folio, 0); > > __folio_clear_head(folio); > > return false; > > } > > > > > >> > >> struct page { > >> ... > >> struct address_space * mapping; /* 24 8 */ > >> ... > >> } > >> > >> struct folio { > >> ... > >> union { > >> struct { > >> long unsigned int _flags_1; /* 64 8 */ > >> long unsigned int _head_1; /* 72 8 */ > >> unsigned char _folio_dtor; /* 80 1 */ > >> unsigned char _folio_order; /* 81 1 */ > >> > >> /* XXX 2 bytes hole, try to pack */ > >> > >> atomic_t _entire_mapcount; /* 84 4 */ > >> atomic_t _nr_pages_mapped; /* 88 4 */ > >> atomic_t _pincount; /* 92 4 */ > >> unsigned int _folio_nr_pages; /* 96 4 */ > >> }; /* 64 40 */ > >> struct page __page_1 __attribute__((__aligned__(8))); /* 64 64 */ > >> } > >> ... > >> } > > > > I do not think the copy of page/folio definitions adds much value to the > > commit message. > Yeah, Will remove it. > > I think we are finally on the same page. I am good with this v2 patch. Only change is to update commit message to remove the definitions.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f154019e6b84..607553445855 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1489,7 +1489,6 @@ static void __destroy_compound_gigantic_folio(struct folio *folio, set_page_refcounted(p); } - folio_set_order(folio, 0); __folio_clear_head(folio); } @@ -1951,9 +1950,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, struct page *p; __folio_clear_reserved(folio); - __folio_set_head(folio); - /* we rely on prep_new_hugetlb_folio to set the destructor */ - folio_set_order(folio, order); for (i = 0; i < nr_pages; i++) { p = folio_page(folio, i); @@ -1999,6 +1995,9 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, if (i != 0) set_compound_head(p, &folio->page); } + __folio_set_head(folio); + /* we rely on prep_new_hugetlb_folio to set the destructor */ + folio_set_order(folio, order); atomic_set(&folio->_entire_mapcount, -1); atomic_set(&folio->_nr_pages_mapped, 0); atomic_set(&folio->_pincount, 0); @@ -2017,8 +2016,6 @@ static bool __prep_compound_gigantic_folio(struct folio *folio, p = folio_page(folio, j); __ClearPageReserved(p); } - folio_set_order(folio, 0); - __folio_clear_head(folio); return false; } diff --git a/mm/internal.h b/mm/internal.h index 68410c6d97ac..c59fe08c5b39 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -425,16 +425,12 @@ int split_free_page(struct page *free_page, */ static inline void folio_set_order(struct folio *folio, unsigned int order) { - if (WARN_ON_ONCE(!folio_test_large(folio))) + if (WARN_ON_ONCE(!order || !folio_test_large(folio))) return; folio->_folio_order = order; #ifdef CONFIG_64BIT - /* - * When hugetlb dissolves a folio, we need to clear the tail - * page, rather than setting nr_pages to 1. - */ - folio->_folio_nr_pages = order ? 1U << order : 0; + folio->_folio_nr_pages = 1U << order; #endif }
folio_set_order(folio, 0) is used in kernel at two places __destroy_compound_gigantic_folio and __prep_compound_gigantic_folio. Currently, It is called to clear out the folio->_folio_nr_pages and folio->_folio_order. For __destroy_compound_gigantic_folio: In past, folio_set_order(folio, 0) was needed because page->mapping used to overlap with _folio_nr_pages and _folio_order. So if these fields were left uncleared during freeing gigantic hugepages, they were causing "BUG: bad page state" due to non-zero page->mapping. Now, After Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to CMA") page->mapping has explicitly been cleared out for tail pages. Also, _folio_order and _folio_nr_pages no longer overlaps with page->mapping. struct page { ... struct address_space * mapping; /* 24 8 */ ... } struct folio { ... union { struct { long unsigned int _flags_1; /* 64 8 */ long unsigned int _head_1; /* 72 8 */ unsigned char _folio_dtor; /* 80 1 */ unsigned char _folio_order; /* 81 1 */ /* XXX 2 bytes hole, try to pack */ atomic_t _entire_mapcount; /* 84 4 */ atomic_t _nr_pages_mapped; /* 88 4 */ atomic_t _pincount; /* 92 4 */ unsigned int _folio_nr_pages; /* 96 4 */ }; /* 64 40 */ struct page __page_1 __attribute__((__aligned__(8))); /* 64 64 */ } ... } So, folio_set_order(folio, 0) can be removed from freeing gigantic folio path (__destroy_compound_gigantic_folio). Another place, folio_set_order(folio, 0) is called inside __prep_compound_gigantic_folio during error path. Here, folio_set_order(folio, 0) can also be removed if we move folio_set_order(folio, order) after for loop. The patch also moves _folio_set_head call in __prep_compound_gigantic_folio() such that we avoid clearing them in the error path. Also, as Mike pointed out: "It would actually be better to move the calls _folio_set_head and folio_set_order in __prep_compound_gigantic_folio() as suggested here. Why? In the current code, the ref count on the 'head page' is still 1 (or more) while those calls are made. So, someone could take a speculative ref on the page BEFORE the tail pages are set up." This way, folio_set_order(folio, 0) is no more needed. And it will also helps removing the confusion of folio order being set to 0 (as _folio_order field is part of first tail page). Testing: I have run LTP tests, which all passes. and also I have written the test in LTP which tests the bug caused by compound_nr and page->mapping overlapping. https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/hugetlb/hugemmap/hugemmap32.c Running on older kernel ( < 5.10-rc7) with the above bug this fails while on newer kernel and, also with this patch it passes. Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com> --- mm/hugetlb.c | 9 +++------ mm/internal.h | 8 ++------ 2 files changed, 5 insertions(+), 12 deletions(-)