Message ID | 3f0e058aef3951b39cf6bb4259c247352d4fe736.1739931468.git.luizcap@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: page_ext: Introduce new iteration API | expand |
On 19.02.25 03:17, Luiz Capitulino wrote: > The page extension implementation assumes that all page extensions of > a given page order are stored in the same memory section. The function > page_ext_next() relies on this assumption by adding an offset to the > current object to return the next adjacent page extension. > > This behavior works as expected for flatmem but fails for sparsemem when > using 1G pages. The commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for > gigantic folios") exposes this issue, making it possible for a crash when > using page_owner or page_table_check page extensions. > > The problem is that for 1G pages, the page extensions may span memory > section boundaries and be stored in different memory sections. This issue > was not visible before commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP > for gigantic folios") because alloc_contig_pages() never passed more than > MAX_PAGE_ORDER to post_alloc_hook(). However, the series introducing > mentioned commit changed this behavior allowing the full 1G page order > to be passed. > > Reproducer: > > 1. Build the kernel with CONFIG_SPARSEMEM=y and table extensions > support > 2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line > 3. Reserve one 1G page at run-time, this should crash (backtrace below) > > To address this issue, this commit introduces a new API for iterating > through page extensions. The main iteration loops are for_each_page_ext() > and for_each_page_ext_order(). Both must be called with the RCU read > lock taken. Here's an usage example: > > """ > struct page_ext_iter iter; > struct page_ext *page_ext; > > ... > > rcu_read_lock(); > for_each_page_ext_order(page, order, page_ext, iter) { > struct my_page_ext *obj = get_my_page_ext_obj(page_ext); > ... > } > rcu_read_unlock(); > """ > [...] > +struct page_ext *page_ext_iter_begin(struct page_ext_iter *iter, struct page *page); > +struct page_ext *page_ext_iter_next(struct page_ext_iter *iter); > + > +/** > + * page_ext_iter_get() - Get current page extension > + * @iter: page extension iterator. > + * > + * Return: NULL if no page_ext exists for this iterator. > + */ > +static inline struct page_ext *page_ext_iter_get(const struct page_ext_iter *iter) > +{ > + return iter->page_ext; > +} > + > +/** > + * for_each_page_ext(): iterate through page_ext objects. > + * @__page: the page we're interested in > + * @__pgcount: how many pages to iterate through > + * @__page_ext: struct page_ext pointer where the current page_ext > + * object is returned > + * @__iter: struct page_ext_iter object (defined in the stack) > + * > + * IMPORTANT: must be called with RCU read lock taken. > + */ > +#define for_each_page_ext(__page, __pgcount, __page_ext, __iter) \ > + __page_ext = page_ext_iter_begin(&__iter, __page); \ Doing stuff out of the loop is dangerous. Assume something does if (xxx) for_each_page_ext() Just move that inside the for(). for (__page_ext = page_ext_iter_begin(&__iter, __page), __iter.index = 0) > + for (__iter.index = 0; \ > + __page_ext && __iter.index < __pgcount; \ > + __page_ext = page_ext_iter_next(&__iter), \ > + __iter.index++) Hm, if we now have an index, why not turn iter.pfn -> iter.start_pfn, and only adjust the index in page_ext_iter_next? Then you can set the index to 0 in page_ext_iter_begin() and have here for (__page_ext = page_ext_iter_begin(&__iter, __page), __page_ext && __iter.index < __pgcount, __page_ext = page_ext_iter_next(&__iter);) A page_ext_iter_reset() could then simply reset the index=0 and lookup the page_ext(start_pfn + index) == page_ext(start_pfn) > + > +/** > + * for_each_page_ext_order(): iterate through page_ext objects > + * for a given page order > + * @__page: the page we're interested in > + * @__order: page order to iterate through > + * @__page_ext: struct page_ext pointer where the current page_ext > + * object is returned > + * @__iter: struct page_ext_iter object (defined in the stack) > + * > + * IMPORTANT: must be called with RCU read lock taken. > + */ > +#define for_each_page_ext_order(__page, __order, __page_ext, __iter) \ > + for_each_page_ext(__page, (1UL << __order), __page_ext, __iter) > + > #else /* !CONFIG_PAGE_EXTENSION */ > struct page_ext; > > diff --git a/mm/page_ext.c b/mm/page_ext.c > index 641d93f6af4c1..508deb04d5ead 100644 > --- a/mm/page_ext.c > +++ b/mm/page_ext.c > @@ -549,3 +549,44 @@ void page_ext_put(struct page_ext *page_ext) > > rcu_read_unlock(); > } > + > +/** > + * page_ext_iter_begin() - Prepare for iterating through page extensions. > + * @iter: page extension iterator. > + * @page: The page we're interested in. > + * > + * Must be called with RCU read lock taken. > + * > + * Return: NULL if no page_ext exists for this page. > + */ > +struct page_ext *page_ext_iter_begin(struct page_ext_iter *iter, struct page *page) > +{ > + iter->pfn = page_to_pfn(page); > + iter->page_ext = lookup_page_ext(page); > + > + return iter->page_ext; > +} > + > +/** > + * page_ext_iter_next() - Get next page extension > + * @iter: page extension iterator. > + * > + * Must be called with RCU read lock taken. > + * > + * Return: NULL if no next page_ext exists. > + */ > +struct page_ext *page_ext_iter_next(struct page_ext_iter *iter) > +{ > + if (WARN_ON_ONCE(!iter->page_ext)) > + return NULL; > + > + iter->pfn++; > +> + if (page_ext_iter_next_fast_possible(iter->pfn)) { > + iter->page_ext = page_ext_next(iter->page_ext); > + } else { > + iter->page_ext = lookup_page_ext(pfn_to_page(iter->pfn)); > + } > + > + return iter->page_ext; > +} We now always have a function call when calling into page_ext_iter_next(). Could we move that to the header and rather expose lookup_page_ext() ?
On 2025-02-20 05:59, David Hildenbrand wrote: > On 19.02.25 03:17, Luiz Capitulino wrote: >> The page extension implementation assumes that all page extensions of >> a given page order are stored in the same memory section. The function >> page_ext_next() relies on this assumption by adding an offset to the >> current object to return the next adjacent page extension. >> >> This behavior works as expected for flatmem but fails for sparsemem when >> using 1G pages. The commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for >> gigantic folios") exposes this issue, making it possible for a crash when >> using page_owner or page_table_check page extensions. >> >> The problem is that for 1G pages, the page extensions may span memory >> section boundaries and be stored in different memory sections. This issue >> was not visible before commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP >> for gigantic folios") because alloc_contig_pages() never passed more than >> MAX_PAGE_ORDER to post_alloc_hook(). However, the series introducing >> mentioned commit changed this behavior allowing the full 1G page order >> to be passed. >> >> Reproducer: >> >> 1. Build the kernel with CONFIG_SPARSEMEM=y and table extensions >> support >> 2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line >> 3. Reserve one 1G page at run-time, this should crash (backtrace below) >> >> To address this issue, this commit introduces a new API for iterating >> through page extensions. The main iteration loops are for_each_page_ext() >> and for_each_page_ext_order(). Both must be called with the RCU read >> lock taken. Here's an usage example: >> >> """ >> struct page_ext_iter iter; >> struct page_ext *page_ext; >> >> ... >> >> rcu_read_lock(); >> for_each_page_ext_order(page, order, page_ext, iter) { >> struct my_page_ext *obj = get_my_page_ext_obj(page_ext); >> ... >> } >> rcu_read_unlock(); >> """ >> > > [...] > >> +struct page_ext *page_ext_iter_begin(struct page_ext_iter *iter, struct page *page); >> +struct page_ext *page_ext_iter_next(struct page_ext_iter *iter); >> + >> +/** >> + * page_ext_iter_get() - Get current page extension >> + * @iter: page extension iterator. >> + * >> + * Return: NULL if no page_ext exists for this iterator. >> + */ >> +static inline struct page_ext *page_ext_iter_get(const struct page_ext_iter *iter) >> +{ >> + return iter->page_ext; >> +} >> + >> +/** >> + * for_each_page_ext(): iterate through page_ext objects. >> + * @__page: the page we're interested in >> + * @__pgcount: how many pages to iterate through >> + * @__page_ext: struct page_ext pointer where the current page_ext >> + * object is returned >> + * @__iter: struct page_ext_iter object (defined in the stack) >> + * >> + * IMPORTANT: must be called with RCU read lock taken. >> + */ >> +#define for_each_page_ext(__page, __pgcount, __page_ext, __iter) \ >> + __page_ext = page_ext_iter_begin(&__iter, __page); \ > > Doing stuff out of the loop is dangerous. Assume something does > > if (xxx) > for_each_page_ext() > > Just move that inside the for(). > > for (__page_ext = page_ext_iter_begin(&__iter, __page), __iter.index = 0) Oh, good point. Will do the change. >> + for (__iter.index = 0; \ >> + __page_ext && __iter.index < __pgcount; \ >> + __page_ext = page_ext_iter_next(&__iter), \ >> + __iter.index++) > > Hm, if we now have an index, why not turn iter.pfn -> iter.start_pfn, and only adjust the index in page_ext_iter_next? > > Then you can set the index to 0 in page_ext_iter_begin() and have here > > for (__page_ext = page_ext_iter_begin(&__iter, __page), > __page_ext && __iter.index < __pgcount, > __page_ext = page_ext_iter_next(&__iter);) I can do this if you feel strong about it, but I prefer explicitly over implicitly. I moved the index into the iter object just to avoid having to define it in the macro's body. Also, the way I did it allows for using page_ext_iter_begin()/page_ext_iter_next() own their if the need arises. > A page_ext_iter_reset() could then simply reset the index=0 and > lookup the page_ext(start_pfn + index) == page_ext(start_pfn) Just note we don't have page_ext_iter_reset() today (and I guess it's not needed). >> + >> +/** >> + * for_each_page_ext_order(): iterate through page_ext objects >> + * for a given page order >> + * @__page: the page we're interested in >> + * @__order: page order to iterate through >> + * @__page_ext: struct page_ext pointer where the current page_ext >> + * object is returned >> + * @__iter: struct page_ext_iter object (defined in the stack) >> + * >> + * IMPORTANT: must be called with RCU read lock taken. >> + */ >> +#define for_each_page_ext_order(__page, __order, __page_ext, __iter) \ >> + for_each_page_ext(__page, (1UL << __order), __page_ext, __iter) >> + >> #else /* !CONFIG_PAGE_EXTENSION */ >> struct page_ext; >> diff --git a/mm/page_ext.c b/mm/page_ext.c >> index 641d93f6af4c1..508deb04d5ead 100644 >> --- a/mm/page_ext.c >> +++ b/mm/page_ext.c >> @@ -549,3 +549,44 @@ void page_ext_put(struct page_ext *page_ext) >> rcu_read_unlock(); >> } >> + >> +/** >> + * page_ext_iter_begin() - Prepare for iterating through page extensions. >> + * @iter: page extension iterator. >> + * @page: The page we're interested in. >> + * >> + * Must be called with RCU read lock taken. >> + * >> + * Return: NULL if no page_ext exists for this page. >> + */ >> +struct page_ext *page_ext_iter_begin(struct page_ext_iter *iter, struct page *page) >> +{ >> + iter->pfn = page_to_pfn(page); >> + iter->page_ext = lookup_page_ext(page); >> + >> + return iter->page_ext; >> +} >> + >> +/** >> + * page_ext_iter_next() - Get next page extension >> + * @iter: page extension iterator. >> + * >> + * Must be called with RCU read lock taken. >> + * >> + * Return: NULL if no next page_ext exists. >> + */ >> +struct page_ext *page_ext_iter_next(struct page_ext_iter *iter) >> +{ >> + if (WARN_ON_ONCE(!iter->page_ext)) >> + return NULL; >> + >> + iter->pfn++; > > +> + if (page_ext_iter_next_fast_possible(iter->pfn)) { >> + iter->page_ext = page_ext_next(iter->page_ext); >> + } else { >> + iter->page_ext = lookup_page_ext(pfn_to_page(iter->pfn)); >> + } >> + >> + return iter->page_ext; >> +} > > We now always have a function call when calling into page_ext_iter_next(). Could we move that to the header and rather expose lookup_page_ext() ? I personally don't like over-using inline functions, also I don't think this code needs optimization since the current clients make the affected code paths slow anyways (and this also applies to the likely/unlikely use in page_owner and page_table_check, I'd drop all of them if you ask me). But again, I can change if this would prevent you from giving your ACK :)
>>> + for (__iter.index = 0; \ >>> + __page_ext && __iter.index < __pgcount; \ >>> + __page_ext = page_ext_iter_next(&__iter), \ >>> + __iter.index++) >> >> Hm, if we now have an index, why not turn iter.pfn -> iter.start_pfn, and only adjust the index in page_ext_iter_next? >> >> Then you can set the index to 0 in page_ext_iter_begin() and have here >> >> for (__page_ext = page_ext_iter_begin(&__iter, __page), >> __page_ext && __iter.index < __pgcount, >> __page_ext = page_ext_iter_next(&__iter);) > > I can do this if you feel strong about it, but I prefer explicitly over > implicitly. I moved the index into the iter object just to avoid having > to define it in the macro's body. Also, the way I did it allows for > using page_ext_iter_begin()/page_ext_iter_next() own their if the need > arises. Ah, I see what you mean. for (__page_ext = page_ext_iter_begin(&__iter, __page, __pgcount); __page_ext; __page_ext = page_ext_iter_next(&__iter)) Could do that I guess by moving the count in there as well and performing the check+increment in page_ext_iter_next. That looks very clean to me, but no strong opinion. Having the index in there just to make a macro happy is rather weird. > >> A page_ext_iter_reset() could then simply reset the index=0 and >> lookup the page_ext(start_pfn + index) == page_ext(start_pfn) > > Just note we don't have page_ext_iter_reset() today (and I guess it's > not needed). Right, was writing this before reviewing the other patch. > >>> + >>> +/** >>> + * for_each_page_ext_order(): iterate through page_ext objects >>> + * for a given page order >>> + * @__page: the page we're interested in >>> + * @__order: page order to iterate through >>> + * @__page_ext: struct page_ext pointer where the current page_ext >>> + * object is returned >>> + * @__iter: struct page_ext_iter object (defined in the stack) >>> + * >>> + * IMPORTANT: must be called with RCU read lock taken. >>> + */ >>> +#define for_each_page_ext_order(__page, __order, __page_ext, __iter) \ >>> + for_each_page_ext(__page, (1UL << __order), __page_ext, __iter) >>> + >>> #else /* !CONFIG_PAGE_EXTENSION */ >>> struct page_ext; >>> diff --git a/mm/page_ext.c b/mm/page_ext.c >>> index 641d93f6af4c1..508deb04d5ead 100644 >>> --- a/mm/page_ext.c >>> +++ b/mm/page_ext.c >>> @@ -549,3 +549,44 @@ void page_ext_put(struct page_ext *page_ext) >>> rcu_read_unlock(); >>> } >>> + >>> +/** >>> + * page_ext_iter_begin() - Prepare for iterating through page extensions. >>> + * @iter: page extension iterator. >>> + * @page: The page we're interested in. >>> + * >>> + * Must be called with RCU read lock taken. >>> + * >>> + * Return: NULL if no page_ext exists for this page. >>> + */ >>> +struct page_ext *page_ext_iter_begin(struct page_ext_iter *iter, struct page *page) >>> +{ >>> + iter->pfn = page_to_pfn(page); >>> + iter->page_ext = lookup_page_ext(page); >>> + >>> + return iter->page_ext; >>> +} >>> + >>> +/** >>> + * page_ext_iter_next() - Get next page extension >>> + * @iter: page extension iterator. >>> + * >>> + * Must be called with RCU read lock taken. >>> + * >>> + * Return: NULL if no next page_ext exists. >>> + */ >>> +struct page_ext *page_ext_iter_next(struct page_ext_iter *iter) >>> +{ >>> + if (WARN_ON_ONCE(!iter->page_ext)) >>> + return NULL; >>> + >>> + iter->pfn++; >> > +> + if (page_ext_iter_next_fast_possible(iter->pfn)) { >>> + iter->page_ext = page_ext_next(iter->page_ext); >>> + } else { >>> + iter->page_ext = lookup_page_ext(pfn_to_page(iter->pfn)); >>> + } >>> + >>> + return iter->page_ext; >>> +} >> >> We now always have a function call when calling into page_ext_iter_next(). Could we move that to the header and rather expose lookup_page_ext() ? > > I personally don't like over-using inline functions, also I don't think this > code needs optimization since the current clients make the affected code paths > slow anyways (and this also applies to the likely/unlikely use in page_owner > and page_table_check, I'd drop all of them if you ask me). But again, I can > change if this would prevent you from giving your ACK :) Well, 512^512 function calls for a 1 GiB page just to traverse the page ext? :)
>> I personally don't like over-using inline functions, also I don't think this >> code needs optimization since the current clients make the affected code paths >> slow anyways (and this also applies to the likely/unlikely use in page_owner >> and page_table_check, I'd drop all of them if you ask me). But again, I can >> change if this would prevent you from giving your ACK :) > > Well, 512^512 function calls for a 1 GiB page just to traverse the page > ext? :) Ehm, 512^2 :)
On 2025-02-20 15:45, David Hildenbrand wrote: >>>> + for (__iter.index = 0; \ >>>> + __page_ext && __iter.index < __pgcount; \ >>>> + __page_ext = page_ext_iter_next(&__iter), \ >>>> + __iter.index++) >>> >>> Hm, if we now have an index, why not turn iter.pfn -> iter.start_pfn, and only adjust the index in page_ext_iter_next? >>> >>> Then you can set the index to 0 in page_ext_iter_begin() and have here >>> >>> for (__page_ext = page_ext_iter_begin(&__iter, __page), >>> __page_ext && __iter.index < __pgcount, >>> __page_ext = page_ext_iter_next(&__iter);) >> >> I can do this if you feel strong about it, but I prefer explicitly over >> implicitly. I moved the index into the iter object just to avoid having >> to define it in the macro's body. Also, the way I did it allows for >> using page_ext_iter_begin()/page_ext_iter_next() own their if the need >> arises. > > Ah, I see what you mean. > > for (__page_ext = page_ext_iter_begin(&__iter, __page, __pgcount); > __page_ext; > __page_ext = page_ext_iter_next(&__iter)) > > Could do that I guess by moving the count in there as well and performing the check+increment in page_ext_iter_next. > > That looks very clean to me, but no strong opinion. Having the index in there just to make a macro happy is rather weird. OK, I'll give this a try. >>> A page_ext_iter_reset() could then simply reset the index=0 and >>> lookup the page_ext(start_pfn + index) == page_ext(start_pfn) >> >> Just note we don't have page_ext_iter_reset() today (and I guess it's >> not needed). > > Right, was writing this before reviewing the other patch. > >> >>>> + >>>> +/** >>>> + * for_each_page_ext_order(): iterate through page_ext objects >>>> + * for a given page order >>>> + * @__page: the page we're interested in >>>> + * @__order: page order to iterate through >>>> + * @__page_ext: struct page_ext pointer where the current page_ext >>>> + * object is returned >>>> + * @__iter: struct page_ext_iter object (defined in the stack) >>>> + * >>>> + * IMPORTANT: must be called with RCU read lock taken. >>>> + */ >>>> +#define for_each_page_ext_order(__page, __order, __page_ext, __iter) \ >>>> + for_each_page_ext(__page, (1UL << __order), __page_ext, __iter) >>>> + >>>> #else /* !CONFIG_PAGE_EXTENSION */ >>>> struct page_ext; >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c >>>> index 641d93f6af4c1..508deb04d5ead 100644 >>>> --- a/mm/page_ext.c >>>> +++ b/mm/page_ext.c >>>> @@ -549,3 +549,44 @@ void page_ext_put(struct page_ext *page_ext) >>>> rcu_read_unlock(); >>>> } >>>> + >>>> +/** >>>> + * page_ext_iter_begin() - Prepare for iterating through page extensions. >>>> + * @iter: page extension iterator. >>>> + * @page: The page we're interested in. >>>> + * >>>> + * Must be called with RCU read lock taken. >>>> + * >>>> + * Return: NULL if no page_ext exists for this page. >>>> + */ >>>> +struct page_ext *page_ext_iter_begin(struct page_ext_iter *iter, struct page *page) >>>> +{ >>>> + iter->pfn = page_to_pfn(page); >>>> + iter->page_ext = lookup_page_ext(page); >>>> + >>>> + return iter->page_ext; >>>> +} >>>> + >>>> +/** >>>> + * page_ext_iter_next() - Get next page extension >>>> + * @iter: page extension iterator. >>>> + * >>>> + * Must be called with RCU read lock taken. >>>> + * >>>> + * Return: NULL if no next page_ext exists. >>>> + */ >>>> +struct page_ext *page_ext_iter_next(struct page_ext_iter *iter) >>>> +{ >>>> + if (WARN_ON_ONCE(!iter->page_ext)) >>>> + return NULL; >>>> + >>>> + iter->pfn++; >>> > +> + if (page_ext_iter_next_fast_possible(iter->pfn)) { >>>> + iter->page_ext = page_ext_next(iter->page_ext); >>>> + } else { >>>> + iter->page_ext = lookup_page_ext(pfn_to_page(iter->pfn)); >>>> + } >>>> + >>>> + return iter->page_ext; >>>> +} >>> >>> We now always have a function call when calling into page_ext_iter_next(). Could we move that to the header and rather expose lookup_page_ext() ? >> >> I personally don't like over-using inline functions, also I don't think this >> code needs optimization since the current clients make the affected code paths >> slow anyways (and this also applies to the likely/unlikely use in page_owner >> and page_table_check, I'd drop all of them if you ask me). But again, I can >> change if this would prevent you from giving your ACK :) > > Well, 512^512 function calls for a 1 GiB page just to traverse the page ext? :) Page_owner may allocate memory, do hash lookup and what not from that code path. But you have a point that other clients (such as page_table_check) may benefit.
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h index e4b48a0dda244..a99da12e59fa7 100644 --- a/include/linux/page_ext.h +++ b/include/linux/page_ext.h @@ -3,6 +3,7 @@ #define __LINUX_PAGE_EXT_H #include <linux/types.h> +#include <linux/mmzone.h> #include <linux/stacktrace.h> struct pglist_data; @@ -69,12 +70,26 @@ extern void page_ext_init(void); static inline void page_ext_init_flatmem_late(void) { } + +static inline bool page_ext_iter_next_fast_possible(unsigned long next_pfn) +{ + /* + * page_ext is allocated per memory section. Once we cross a + * memory section, we have to fetch the new pointer. + */ + return next_pfn % PAGES_PER_SECTION; +} #else extern void page_ext_init_flatmem(void); extern void page_ext_init_flatmem_late(void); static inline void page_ext_init(void) { } + +static inline bool page_ext_iter_next_fast_possible(unsigned long next_pfn) +{ + return true; +} #endif extern struct page_ext *page_ext_get(const struct page *page); @@ -93,6 +108,57 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr) return next; } +struct page_ext_iter { + unsigned long pfn; + unsigned long index; + struct page_ext *page_ext; +}; + +struct page_ext *page_ext_iter_begin(struct page_ext_iter *iter, struct page *page); +struct page_ext *page_ext_iter_next(struct page_ext_iter *iter); + +/** + * page_ext_iter_get() - Get current page extension + * @iter: page extension iterator. + * + * Return: NULL if no page_ext exists for this iterator. + */ +static inline struct page_ext *page_ext_iter_get(const struct page_ext_iter *iter) +{ + return iter->page_ext; +} + +/** + * for_each_page_ext(): iterate through page_ext objects. + * @__page: the page we're interested in + * @__pgcount: how many pages to iterate through + * @__page_ext: struct page_ext pointer where the current page_ext + * object is returned + * @__iter: struct page_ext_iter object (defined in the stack) + * + * IMPORTANT: must be called with RCU read lock taken. + */ +#define for_each_page_ext(__page, __pgcount, __page_ext, __iter) \ + __page_ext = page_ext_iter_begin(&__iter, __page); \ + for (__iter.index = 0; \ + __page_ext && __iter.index < __pgcount; \ + __page_ext = page_ext_iter_next(&__iter), \ + __iter.index++) + +/** + * for_each_page_ext_order(): iterate through page_ext objects + * for a given page order + * @__page: the page we're interested in + * @__order: page order to iterate through + * @__page_ext: struct page_ext pointer where the current page_ext + * object is returned + * @__iter: struct page_ext_iter object (defined in the stack) + * + * IMPORTANT: must be called with RCU read lock taken. + */ +#define for_each_page_ext_order(__page, __order, __page_ext, __iter) \ + for_each_page_ext(__page, (1UL << __order), __page_ext, __iter) + #else /* !CONFIG_PAGE_EXTENSION */ struct page_ext; diff --git a/mm/page_ext.c b/mm/page_ext.c index 641d93f6af4c1..508deb04d5ead 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -549,3 +549,44 @@ void page_ext_put(struct page_ext *page_ext) rcu_read_unlock(); } + +/** + * page_ext_iter_begin() - Prepare for iterating through page extensions. + * @iter: page extension iterator. + * @page: The page we're interested in. + * + * Must be called with RCU read lock taken. + * + * Return: NULL if no page_ext exists for this page. + */ +struct page_ext *page_ext_iter_begin(struct page_ext_iter *iter, struct page *page) +{ + iter->pfn = page_to_pfn(page); + iter->page_ext = lookup_page_ext(page); + + return iter->page_ext; +} + +/** + * page_ext_iter_next() - Get next page extension + * @iter: page extension iterator. + * + * Must be called with RCU read lock taken. + * + * Return: NULL if no next page_ext exists. + */ +struct page_ext *page_ext_iter_next(struct page_ext_iter *iter) +{ + if (WARN_ON_ONCE(!iter->page_ext)) + return NULL; + + iter->pfn++; + + if (page_ext_iter_next_fast_possible(iter->pfn)) { + iter->page_ext = page_ext_next(iter->page_ext); + } else { + iter->page_ext = lookup_page_ext(pfn_to_page(iter->pfn)); + } + + return iter->page_ext; +}
The page extension implementation assumes that all page extensions of a given page order are stored in the same memory section. The function page_ext_next() relies on this assumption by adding an offset to the current object to return the next adjacent page extension. This behavior works as expected for flatmem but fails for sparsemem when using 1G pages. The commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios") exposes this issue, making it possible for a crash when using page_owner or page_table_check page extensions. The problem is that for 1G pages, the page extensions may span memory section boundaries and be stored in different memory sections. This issue was not visible before commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios") because alloc_contig_pages() never passed more than MAX_PAGE_ORDER to post_alloc_hook(). However, the series introducing mentioned commit changed this behavior allowing the full 1G page order to be passed. Reproducer: 1. Build the kernel with CONFIG_SPARSEMEM=y and table extensions support 2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line 3. Reserve one 1G page at run-time, this should crash (backtrace below) To address this issue, this commit introduces a new API for iterating through page extensions. The main iteration loops are for_each_page_ext() and for_each_page_ext_order(). Both must be called with the RCU read lock taken. Here's an usage example: """ struct page_ext_iter iter; struct page_ext *page_ext; ... rcu_read_lock(); for_each_page_ext_order(page, order, page_ext, iter) { struct my_page_ext *obj = get_my_page_ext_obj(page_ext); ... } rcu_read_unlock(); """ Both loop constructs use page_ext_iter_next(), which checks to see if we have crossed sections in the iteration. In this case, page_ext_iter_next() retrieves the next page_ext object from another section. Thanks to David Hildenbrand for helping identify the root cause and providing suggestions on how to fix and optmize the solution (final implementation and bugs are all mine through). Lastly, here's the backtrace, without kasan you can get random crashes: [ 76.052526] BUG: KASAN: slab-out-of-bounds in __update_page_owner_handle+0x238/0x298 [ 76.060283] Write of size 4 at addr ffff07ff96240038 by task tee/3598 [ 76.066714] [ 76.068203] CPU: 88 UID: 0 PID: 3598 Comm: tee Kdump: loaded Not tainted 6.13.0-rep1 #3 [ 76.076202] Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 2.10.20220810 (SCP: 2.10.20220810) 2022/08/10 [ 76.088972] Call trace: [ 76.091411] show_stack+0x20/0x38 (C) [ 76.095073] dump_stack_lvl+0x80/0xf8 [ 76.098733] print_address_description.constprop.0+0x88/0x398 [ 76.104476] print_report+0xa8/0x278 [ 76.108041] kasan_report+0xa8/0xf8 [ 76.111520] __asan_report_store4_noabort+0x20/0x30 [ 76.116391] __update_page_owner_handle+0x238/0x298 [ 76.121259] __set_page_owner+0xdc/0x140 [ 76.125173] post_alloc_hook+0x190/0x1d8 [ 76.129090] alloc_contig_range_noprof+0x54c/0x890 [ 76.133874] alloc_contig_pages_noprof+0x35c/0x4a8 [ 76.138656] alloc_gigantic_folio.isra.0+0x2c0/0x368 [ 76.143616] only_alloc_fresh_hugetlb_folio.isra.0+0x24/0x150 [ 76.149353] alloc_pool_huge_folio+0x11c/0x1f8 [ 76.153787] set_max_huge_pages+0x364/0xca8 [ 76.157961] __nr_hugepages_store_common+0xb0/0x1a0 [ 76.162829] nr_hugepages_store+0x108/0x118 [ 76.167003] kobj_attr_store+0x3c/0x70 [ 76.170745] sysfs_kf_write+0xfc/0x188 [ 76.174492] kernfs_fop_write_iter+0x274/0x3e0 [ 76.178927] vfs_write+0x64c/0x8e0 [ 76.182323] ksys_write+0xf8/0x1f0 [ 76.185716] __arm64_sys_write+0x74/0xb0 [ 76.189630] invoke_syscall.constprop.0+0xd8/0x1e0 [ 76.194412] do_el0_svc+0x164/0x1e0 [ 76.197891] el0_svc+0x40/0xe0 [ 76.200939] el0t_64_sync_handler+0x144/0x168 [ 76.205287] el0t_64_sync+0x1ac/0x1b0 Fixes: cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> --- include/linux/page_ext.h | 66 ++++++++++++++++++++++++++++++++++++++++ mm/page_ext.c | 41 +++++++++++++++++++++++++ 2 files changed, 107 insertions(+)