Message ID | 1657810063-28938-1-git-send-email-quic_charante@quicinc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: fix use-after free of page_ext after race with memory-offline | expand |
On Thu, 14 Jul 2022 20:17:43 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > The below is one path where race between page_ext and offline of the > respective memory blocks will cause use-after-free on the access of > page_ext structure. > > ... > > --- a/include/linux/page_ext.h > +++ b/include/linux/page_ext.h > @@ -64,6 +64,25 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr) > return next; > } > > +static inline struct page_ext *get_page_ext(struct page *page) > +{ > + struct page_ext *page_ext; > + > + rcu_read_lock(); If page_ext.h is to call rcu functions then it will need to include appropriate header files. > + page_ext = lookup_page_ext(page); > + if (!page_ext) { > + rcu_read_unlock(); > + return NULL; > + } > + > + return page_ext; > +} > + > +static inline void put_page_ext(void) > +{ > + rcu_read_unlock(); > +} Better names would be page_ext_get() and page_ext_put(). The rest of the page_ext API appears to have got this right, so let's not mess that up. Also, these aren't really get and put functions - page_ext doesn't have a refcount. But I can't immediately think of a name that better communicates what we're doing here so I guess get and put will do. And are we able to add some comments here explaining why these functions exist and what they do? > #else /* !CONFIG_PAGE_EXTENSION */ > struct page_ext; Are you sure we didn't need CONFIG_PAGE_EXTENSION=n stubs for these two functions?
Thanks Andrew for the review!! On 7/15/2022 6:34 AM, Andrew Morton wrote: > On Thu, 14 Jul 2022 20:17:43 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote: > >> The below is one path where race between page_ext and offline of the >> respective memory blocks will cause use-after-free on the access of >> page_ext structure. >> >> ... >> >> --- a/include/linux/page_ext.h >> +++ b/include/linux/page_ext.h >> @@ -64,6 +64,25 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr) >> return next; >> } >> >> +static inline struct page_ext *get_page_ext(struct page *page) >> +{ >> + struct page_ext *page_ext; >> + >> + rcu_read_lock(); > If page_ext.h is to call rcu functions then it will need to include > appropriate header files. > Will add them!! >> + page_ext = lookup_page_ext(page); >> + if (!page_ext) { >> + rcu_read_unlock(); >> + return NULL; >> + } >> + >> + return page_ext; >> +} >> + >> +static inline void put_page_ext(void) >> +{ >> + rcu_read_unlock(); >> +} > Better names would be page_ext_get() and page_ext_put(). The rest of > the page_ext API appears to have got this right, so let's not mess that > up. I see naming convention is not consistent in page_ext.c. For couple of them I see page_ext_xxx() and for the rest it is xxx_page_ext(). Sure I will follow the page_ext_xxx() convention in V2. > > Also, these aren't really get and put functions - page_ext doesn't have > a refcount. But I can't immediately think of a name that better > communicates what we're doing here so I guess get and put will do. >> And are we able to add some comments here explaining why these > functions exist and what they do? Sure will add then in V2. > >> #else /* !CONFIG_PAGE_EXTENSION */ >> struct page_ext; > Are you sure we didn't need CONFIG_PAGE_EXTENSION=n stubs for these two > functions? I think it does need. Will add them in v2. > >
Hi Charan, On Thu, Jul 14, 2022 at 08:17:43PM +0530, Charan Teja Kalla wrote: > The below is one path where race between page_ext and offline of the > respective memory blocks will cause use-after-free on the access of > page_ext structure. > > process1 process2 > --------- --------- > a)doing /proc/page_owner doing memory offline > through offline_pages. > > b)PageBuddy check is failed > thus proceed to get the > page_owner information > through page_ext access. > page_ext = lookup_page_ext(page); > > migrate_pages(); > ................. > Since all pages are successfully > migrated as part of the offline > operation,send MEM_OFFLINE notification > where for page_ext it calls: > offline_page_ext()--> > __free_page_ext()--> > free_page_ext()--> > vfree(ms->page_ext) > mem_section->page_ext = NULL > > c) Check for the PAGE_EXT flags > in the page_ext->flags access > results into the use-after-free(leading > to the translation faults). > > As mentioned above, there is really no synchronization between page_ext > access and its freeing in the memory_offline. > > The memory offline steps(roughly) on a memory block is as below: > 1) Isolate all the pages > 2) while(1) > try free the pages to buddy.(->free_list[MIGRATE_ISOLATE]) > 3) delete the pages from this buddy list. > 4) Then free page_ext.(Note: The struct page is still alive as it is > freed only during hot remove of the memory which frees the memmap, which > steps the user might not perform). > > This design leads to the state where struct page is alive but the struct > page_ext is freed, where the later is ideally part of the former which > just representing the page_flags. This seems to be a wrong design where > 'struct page' as a whole is not accessible(Thanks to Minchan for > pointing this out). > > The above mentioned race is just one example __but the problem persists > in the other paths too involving page_ext->flags access(eg: > page_is_idle())__. Since offline waits till the last reference on the > page goes down i.e. any path that took the refcount on the page can make > the memory offline operation to wait. Eg: In the migrate_pages() > operation, we do take the extra refcount on the pages that are under > migration and then we do copy page_owner by accessing page_ext. For > > Fix those paths where offline races with page_ext access by maintaining > synchronization with rcu lock. > > Thanks to David Hildenbrand for his views/suggestions on the initial > discussion[1]. > > [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/ > > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > --- > include/linux/page_ext.h | 19 +++++++++++++++++++ > include/linux/page_idle.h | 40 ++++++++++++++++++++++++++++++---------- > mm/page_ext.c | 3 ++- > mm/page_owner.c | 18 +++++++++++++++--- > mm/page_table_check.c | 10 +++++++--- > 5 files changed, 73 insertions(+), 17 deletions(-) > > diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h > index fabb2e1..df5d353 100644 > --- a/include/linux/page_ext.h > +++ b/include/linux/page_ext.h > @@ -64,6 +64,25 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr) > return next; > } > > +static inline struct page_ext *get_page_ext(struct page *page) > +{ > + struct page_ext *page_ext; > + > + rcu_read_lock(); > + page_ext = lookup_page_ext(page); > + if (!page_ext) { > + rcu_read_unlock(); > + return NULL; > + } > + > + return page_ext; > +} > + > +static inline void put_page_ext(void) > +{ > + rcu_read_unlock(); > +} > + Would it be a harm if we make lookup_page_ext() completely a private function? Or is there any codepath that have the benefit of calling lookup_page_ext() without going through get_page_ext()? If that is the case, we should add RCU lockdep check inside lookup_page_ext() to make sure that this function is called with RCUs. Thanks, Pavan
On Thu 14-07-22 20:17:43, Charan Teja Kalla wrote: > The below is one path where race between page_ext and offline of the > respective memory blocks will cause use-after-free on the access of > page_ext structure. > > process1 process2 > --------- --------- > a)doing /proc/page_owner doing memory offline > through offline_pages. > > b)PageBuddy check is failed > thus proceed to get the > page_owner information > through page_ext access. > page_ext = lookup_page_ext(page); > > migrate_pages(); > ................. > Since all pages are successfully > migrated as part of the offline > operation,send MEM_OFFLINE notification > where for page_ext it calls: > offline_page_ext()--> > __free_page_ext()--> > free_page_ext()--> > vfree(ms->page_ext) > mem_section->page_ext = NULL > > c) Check for the PAGE_EXT flags > in the page_ext->flags access > results into the use-after-free(leading > to the translation faults). > > As mentioned above, there is really no synchronization between page_ext > access and its freeing in the memory_offline. > > The memory offline steps(roughly) on a memory block is as below: > 1) Isolate all the pages > 2) while(1) > try free the pages to buddy.(->free_list[MIGRATE_ISOLATE]) > 3) delete the pages from this buddy list. > 4) Then free page_ext.(Note: The struct page is still alive as it is > freed only during hot remove of the memory which frees the memmap, which > steps the user might not perform). > > This design leads to the state where struct page is alive but the struct > page_ext is freed, where the later is ideally part of the former which > just representing the page_flags. This seems to be a wrong design where > 'struct page' as a whole is not accessible(Thanks to Minchan for > pointing this out). Nice description of the problem! Thanks! > The above mentioned race is just one example __but the problem persists > in the other paths too involving page_ext->flags access(eg: > page_is_idle())__. Since offline waits till the last reference on the > page goes down i.e. any path that took the refcount on the page can make > the memory offline operation to wait. Eg: In the migrate_pages() > operation, we do take the extra refcount on the pages that are under > migration and then we do copy page_owner by accessing page_ext. For > > Fix those paths where offline races with page_ext access by maintaining > synchronization with rcu lock. Please be much more specific about the synchronization. How does RCU actually synchronize the offlining and access? Higher level description of all the actors would be very helpful not only for the review but also for future readers. Also, more specifically [...] > diff --git a/mm/page_ext.c b/mm/page_ext.c > index 3dc715d..5ccd3ee 100644 > --- a/mm/page_ext.c > +++ b/mm/page_ext.c > @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) > if (!ms || !ms->page_ext) > return; > base = get_entry(ms->page_ext, pfn); > - free_page_ext(base); > ms->page_ext = NULL; > + synchronize_rcu(); > + free_page_ext(base); > } So you are imposing the RCU grace period for each page_ext! This can get really expensive. Have you tried to measure the effect? Is there any reason why page_ext is freed during offlining rather when it is hotremoved? Thanks!
Thanks Pavan for the comments!! On 7/18/2022 11:41 AM, Pavan Kondeti wrote: >> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h >> index fabb2e1..df5d353 100644 >> --- a/include/linux/page_ext.h >> +++ b/include/linux/page_ext.h >> @@ -64,6 +64,25 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr) >> return next; >> } >> >> +static inline struct page_ext *get_page_ext(struct page *page) >> +{ >> + struct page_ext *page_ext; >> + >> + rcu_read_lock(); >> + page_ext = lookup_page_ext(page); >> + if (!page_ext) { >> + rcu_read_unlock(); >> + return NULL; >> + } >> + >> + return page_ext; >> +} >> + >> +static inline void put_page_ext(void) >> +{ >> + rcu_read_unlock(); >> +} >> + > Would it be a harm if we make lookup_page_ext() completely a private function? > Or is there any codepath that have the benefit of calling lookup_page_ext() > without going through get_page_ext()? If that is the case, we should add > RCU lockdep check inside lookup_page_ext() to make sure that this function is > called with RCUs. IIUC, the synchronization is really not needed in all the paths of accessing the page_ext thus hiding the lookup_page_ext and forcing the users to always rely on get and put functions doesn't seem correct to me. Some example code paths where you don't need the synchronization while accessing the page_ext are: 1) In migration (where we also migrate the page_owner information), we take the extra refcount on the source and destination pages and then start the migration. This extra refcount makes the test_pages_isolated() function to fail thus retry the offline operation. 2) In free_pages_prepare(), we do reset the page_owner(through page_ext) which again doesn't need the protection to access because the page is already freeing (through only one path). Thus I don't find the need of rcu lockdep check in the lookup_page_ext. Any other paths that I am missing to add protection while page_ext access, please let me know. Thanks, Charan
Thanks Michal for the comments!! On 7/18/2022 5:20 PM, Michal Hocko wrote: >> The above mentioned race is just one example __but the problem persists >> in the other paths too involving page_ext->flags access(eg: >> page_is_idle())__. Since offline waits till the last reference on the >> page goes down i.e. any path that took the refcount on the page can make >> the memory offline operation to wait. Eg: In the migrate_pages() >> operation, we do take the extra refcount on the pages that are under >> migration and then we do copy page_owner by accessing page_ext. For >> >> Fix those paths where offline races with page_ext access by maintaining >> synchronization with rcu lock. > Please be much more specific about the synchronization. How does RCU > actually synchronize the offlining and access? Higher level description > of all the actors would be very helpful not only for the review but also > for future readers. I will improve the commit message about this synchronization change using RCU's. > > Also, more specifically > [...] >> diff --git a/mm/page_ext.c b/mm/page_ext.c >> index 3dc715d..5ccd3ee 100644 >> --- a/mm/page_ext.c >> +++ b/mm/page_ext.c >> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) >> if (!ms || !ms->page_ext) >> return; >> base = get_entry(ms->page_ext, pfn); >> - free_page_ext(base); >> ms->page_ext = NULL; >> + synchronize_rcu(); >> + free_page_ext(base); >> } > So you are imposing the RCU grace period for each page_ext! This can get > really expensive. Have you tried to measure the effect? > I didn't really measure the effect. Let me measure it and post these in V2. > Is there any reason why page_ext is freed during offlining rather when > it is hotremoved? This is something I am struggling to get the answer. IMO, this is even wrong design where I don't have page_ext but page. Moving the freeing of page_ext to hotremove path actually solves the problem but somehow this idea didn't liked[1]. copying the excerpt here: "> > 3) Change the design where the page_ext is valid as long as the struct > page is alive. :/ Doesn't spark joy." @Joonsoo : We see that you did commit eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging"). Any reason why the page_ext is chosen to free at offline operation rather than the remove operation of a memory block? [1] https://lore.kernel.org/linux-mm/8fefe59d-c893-39f4-3225-65343086c867@redhat.com/ > > Thanks!
On Mon 18-07-22 19:28:13, Charan Teja Kalla wrote: > Thanks Michal for the comments!! > > On 7/18/2022 5:20 PM, Michal Hocko wrote: > >> The above mentioned race is just one example __but the problem persists > >> in the other paths too involving page_ext->flags access(eg: > >> page_is_idle())__. Since offline waits till the last reference on the > >> page goes down i.e. any path that took the refcount on the page can make > >> the memory offline operation to wait. Eg: In the migrate_pages() > >> operation, we do take the extra refcount on the pages that are under > >> migration and then we do copy page_owner by accessing page_ext. For > >> > >> Fix those paths where offline races with page_ext access by maintaining > >> synchronization with rcu lock. > > Please be much more specific about the synchronization. How does RCU > > actually synchronize the offlining and access? Higher level description > > of all the actors would be very helpful not only for the review but also > > for future readers. > > I will improve the commit message about this synchronization change > using RCU's. Thanks! The most imporant part is how the exclusion is actual achieved because that is not really clear at first sight CPU1 CPU2 lookup_page_ext(PageA) offlining offline_page_ext __free_page_ext(addrA) get_entry(addrA) ms->page_ext = NULL synchronize_rcu() free_page_ext free_pages_exact (now addrA is unusable) rcu_read_lock() entryA = get_entry(addrA) base + page_ext_size * index # an address not invalidated by the freeing path do_something(entryA) rcu_read_unlock() CPU1 never checks ms->page_ext so it cannot bail out early when the thing is torn down. Or maybe I am missing something. I am not familiar with page_ext much. > > Also, more specifically > > [...] > >> diff --git a/mm/page_ext.c b/mm/page_ext.c > >> index 3dc715d..5ccd3ee 100644 > >> --- a/mm/page_ext.c > >> +++ b/mm/page_ext.c > >> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) > >> if (!ms || !ms->page_ext) > >> return; > >> base = get_entry(ms->page_ext, pfn); > >> - free_page_ext(base); > >> ms->page_ext = NULL; > >> + synchronize_rcu(); > >> + free_page_ext(base); > >> } > > So you are imposing the RCU grace period for each page_ext! This can get > > really expensive. Have you tried to measure the effect? I was wrong here! This is for each memory section which is not as terrible as every single page_ext. This can be still quite a lot memory sections in a single memory block (e.g. on ppc memory sections are ridiculously small). > I didn't really measure the effect. Let me measure it and post these in V2. I think it would be much more optimal to split the operation into 2 phases. Invalidate all the page_ext metadata then synchronize_rcu and only then free them all. I am not very familiar with page_ext so I am not sure this is easy to be done. Maybe page_ext = NULL can be done in the first stage. > > Is there any reason why page_ext is freed during offlining rather when > > it is hotremoved? > > This is something I am struggling to get the answer. IMO, this is even > wrong design where I don't have page_ext but page. Moving the freeing of > page_ext to hotremove path actually solves the problem but somehow this > idea didn't liked[1]. copying the excerpt here: yes, it certainly adds subtlety to the page_ext thingy. I do agree that even situation around struct page is not all that great wrt synchronization. We have pfn_to_online_page which even when racy doesn't give you a garbage because hotremove happens very rarely or so long after offlining that the race window is essentially impractically too long for any potential damage. We would have to change a lot to make it work "properly". I am not optimistic this is actually feasible. > > 3) Change the design where the page_ext is valid as long as the struct > > page is alive. > > :/ Doesn't spark joy." I would be wondering why. It should only take to move the callback to happen at hotremove. So it shouldn't be very involved of a change. I can imagine somebody would be relying on releasing resources when offlining memory but is that really the case?
Thanks Michal!! On 7/18/2022 8:24 PM, Michal Hocko wrote: >>>> The above mentioned race is just one example __but the problem persists >>>> in the other paths too involving page_ext->flags access(eg: >>>> page_is_idle())__. Since offline waits till the last reference on the >>>> page goes down i.e. any path that took the refcount on the page can make >>>> the memory offline operation to wait. Eg: In the migrate_pages() >>>> operation, we do take the extra refcount on the pages that are under >>>> migration and then we do copy page_owner by accessing page_ext. For >>>> >>>> Fix those paths where offline races with page_ext access by maintaining >>>> synchronization with rcu lock. >>> Please be much more specific about the synchronization. How does RCU >>> actually synchronize the offlining and access? Higher level description >>> of all the actors would be very helpful not only for the review but also >>> for future readers. >> I will improve the commit message about this synchronization change >> using RCU's. > Thanks! The most imporant part is how the exclusion is actual achieved > because that is not really clear at first sight > > CPU1 CPU2 > lookup_page_ext(PageA) offlining > offline_page_ext > __free_page_ext(addrA) > get_entry(addrA) > ms->page_ext = NULL > synchronize_rcu() > free_page_ext > free_pages_exact (now addrA is unusable) > > rcu_read_lock() > entryA = get_entry(addrA) > base + page_ext_size * index # an address not invalidated by the freeing path > do_something(entryA) > rcu_read_unlock() > > CPU1 never checks ms->page_ext so it cannot bail out early when the > thing is torn down. Or maybe I am missing something. I am not familiar > with page_ext much. Thanks a lot for catching this Michal. You are correct that the proposed code from me is still racy. I Will correct this along with the proper commit message in the next version of this patch. > >>> Also, more specifically >>> [...] >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c >>>> index 3dc715d..5ccd3ee 100644 >>>> --- a/mm/page_ext.c >>>> +++ b/mm/page_ext.c >>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) >>>> if (!ms || !ms->page_ext) >>>> return; >>>> base = get_entry(ms->page_ext, pfn); >>>> - free_page_ext(base); >>>> ms->page_ext = NULL; >>>> + synchronize_rcu(); >>>> + free_page_ext(base); >>>> } >>> So you are imposing the RCU grace period for each page_ext! This can get >>> really expensive. Have you tried to measure the effect? > I was wrong here! This is for each memory section which is not as > terrible as every single page_ext. This can be still quite a lot memory > sections in a single memory block (e.g. on ppc memory sections are > ridiculously small). > On the ARM64, I see that the minimum a section size will go is 128MB. I think 16MB is the section size on ppc. Any inputs on how frequently offline/online operation is being done on this ppc arch? >> I didn't really measure the effect. Let me measure it and post these in V2. > I think it would be much more optimal to split the operation into 2 > phases. Invalidate all the page_ext metadata then synchronize_rcu and > only then free them all. I am not very familiar with page_ext so I am > not sure this is easy to be done. Maybe page_ext = NULL can be done in > the first stage. > Let me explore If this can be easily done. >>> 3) Change the design where the page_ext is valid as long as the struct >>> page is alive. >> :/ Doesn't spark joy." > I would be wondering why. It should only take to move the callback to > happen at hotremove. So it shouldn't be very involved of a change. I can > imagine somebody would be relying on releasing resources when offlining > memory but is that really the case? I don't find any hard need of the clients needs to release this page ext memory. What I can think of is that page_ext size is proportional to the debug features(is what for being used on 64bit, as of now) we are enabling. Eg: Enabling the page_owner requires additional 0x30 bytes per page which memory is not required when the memory block is offlined. But then it should be the same case for memory occupied by struct page too for this offlined block. One comment from the initial discussion : "It smells like page_ext should use some mechanism during MEM_OFFLINE to synchronize against any users of its metadata. Generic memory offlining code might be the wrong place for that." -- I think the page_ext creation and deletion should fit into the sparse code. I will try to provide the changes on tomorrow and If it seems unfit there, I will work on improving the current patch based on the rcu logic. Thanks, Charan
On 18.07.22 16:54, Michal Hocko wrote: > On Mon 18-07-22 19:28:13, Charan Teja Kalla wrote: >> Thanks Michal for the comments!! >> >> On 7/18/2022 5:20 PM, Michal Hocko wrote: >>>> The above mentioned race is just one example __but the problem persists >>>> in the other paths too involving page_ext->flags access(eg: >>>> page_is_idle())__. Since offline waits till the last reference on the >>>> page goes down i.e. any path that took the refcount on the page can make >>>> the memory offline operation to wait. Eg: In the migrate_pages() >>>> operation, we do take the extra refcount on the pages that are under >>>> migration and then we do copy page_owner by accessing page_ext. For >>>> >>>> Fix those paths where offline races with page_ext access by maintaining >>>> synchronization with rcu lock. >>> Please be much more specific about the synchronization. How does RCU >>> actually synchronize the offlining and access? Higher level description >>> of all the actors would be very helpful not only for the review but also >>> for future readers. >> >> I will improve the commit message about this synchronization change >> using RCU's. > > Thanks! The most imporant part is how the exclusion is actual achieved > because that is not really clear at first sight > > CPU1 CPU2 > lookup_page_ext(PageA) offlining > offline_page_ext > __free_page_ext(addrA) > get_entry(addrA) > ms->page_ext = NULL > synchronize_rcu() > free_page_ext > free_pages_exact (now addrA is unusable) > > rcu_read_lock() > entryA = get_entry(addrA) > base + page_ext_size * index # an address not invalidated by the freeing path > do_something(entryA) > rcu_read_unlock() > > CPU1 never checks ms->page_ext so it cannot bail out early when the > thing is torn down. Or maybe I am missing something. I am not familiar > with page_ext much. > >>> Also, more specifically >>> [...] >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c >>>> index 3dc715d..5ccd3ee 100644 >>>> --- a/mm/page_ext.c >>>> +++ b/mm/page_ext.c >>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) >>>> if (!ms || !ms->page_ext) >>>> return; >>>> base = get_entry(ms->page_ext, pfn); >>>> - free_page_ext(base); >>>> ms->page_ext = NULL; >>>> + synchronize_rcu(); >>>> + free_page_ext(base); >>>> } >>> So you are imposing the RCU grace period for each page_ext! This can get >>> really expensive. Have you tried to measure the effect? > > I was wrong here! This is for each memory section which is not as > terrible as every single page_ext. This can be still quite a lot memory > sections in a single memory block (e.g. on ppc memory sections are > ridiculously small). > >> I didn't really measure the effect. Let me measure it and post these in V2. > > I think it would be much more optimal to split the operation into 2 > phases. Invalidate all the page_ext metadata then synchronize_rcu and > only then free them all. I am not very familiar with page_ext so I am > not sure this is easy to be done. Maybe page_ext = NULL can be done in > the first stage. > >>> Is there any reason why page_ext is freed during offlining rather when >>> it is hotremoved? >> >> This is something I am struggling to get the answer. IMO, this is even >> wrong design where I don't have page_ext but page. Moving the freeing of >> page_ext to hotremove path actually solves the problem but somehow this >> idea didn't liked[1]. copying the excerpt here: > > yes, it certainly adds subtlety to the page_ext thingy. I do agree that > even situation around struct page is not all that great wrt > synchronization. We have pfn_to_online_page which even when racy doesn't > give you a garbage because hotremove happens very rarely or so long > after offlining that the race window is essentially impractically too > long for any potential damage. We would have to change a lot to make it > work "properly". I am not optimistic this is actually feasible. > >>> 3) Change the design where the page_ext is valid as long as the struct >>> page is alive. >> >> :/ Doesn't spark joy." > > I would be wondering why. It should only take to move the callback to > happen at hotremove. So it shouldn't be very involved of a change. I can > imagine somebody would be relying on releasing resources when offlining > memory but is that really the case? Various reasons: 1) There was a discussion in the past to eventually also use rcu protection for handling pdn_to_online_page(). So doing it cleanly here is certainly an improvement. 2) I really dislike having to scatter section online checks all over the place in page ext code. Once there is a difference between active vs. stale page ext data things get a bit messy and error prone. This is already ugly enough in our generic memmap handling code IMHO. 3) Having on-demand allocations, such as KASAN or page ext from the memory online notifier is at least currently cleaner, because we don't have to handle each and every subsystem that hooks into that during the core memory hotadd/remove phase, which primarily only setups the vmemmap, direct map and memory block devices. Personally, I think what we have in this patch is quite nice and clean. But I won't object if it can be similarly done in a clean way from hot(un)plug code. That is, I ack this patch but don't object to similarly clean approaches.
On Tue 19-07-22 17:19:34, David Hildenbrand wrote: > On 18.07.22 16:54, Michal Hocko wrote: > > On Mon 18-07-22 19:28:13, Charan Teja Kalla wrote: [...] > >>> 3) Change the design where the page_ext is valid as long as the struct > >>> page is alive. > >> > >> :/ Doesn't spark joy." > > > > I would be wondering why. It should only take to move the callback to > > happen at hotremove. So it shouldn't be very involved of a change. I can > > imagine somebody would be relying on releasing resources when offlining > > memory but is that really the case? > > Various reasons: > > 1) There was a discussion in the past to eventually also use rcu > protection for handling pdn_to_online_page(). So doing it cleanly here > is certainly an improvement. Call me skeptical on that. > 2) I really dislike having to scatter section online checks all over the > place in page ext code. Once there is a difference between active vs. > stale page ext data things get a bit messy and error prone. This is > already ugly enough in our generic memmap handling code IMHO. They should represent a free page in any case so even they are stall they shouldn't be really dangerous, right? > 3) Having on-demand allocations, such as KASAN or page ext from the > memory online notifier is at least currently cleaner, because we don't > have to handle each and every subsystem that hooks into that during the > core memory hotadd/remove phase, which primarily only setups the > vmemmap, direct map and memory block devices. Cannot this hook into __add_pages which is the real implementation of the arch independent way to allocate vmemmap. Or at the sparsemem level because we do not (and very likely won't) support memory hotplug on any other memory model. > Personally, I think what we have in this patch is quite nice and clean. > But I won't object if it can be similarly done in a clean way from > hot(un)plug code. Well, if the scheme can be done without synchronize_rcu for each section which can backfire and if the scheme doesn't add too much complexity to achieve that then sure I won't object. I just do not get why page_ext should have a different allocation lifetime expectancy than a real page. Quite confusing if you ask me.
On Tue 19-07-22 20:42:42, Charan Teja Kalla wrote: > Thanks Michal!! > > On 7/18/2022 8:24 PM, Michal Hocko wrote: [...] > >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c > >>>> index 3dc715d..5ccd3ee 100644 > >>>> --- a/mm/page_ext.c > >>>> +++ b/mm/page_ext.c > >>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) > >>>> if (!ms || !ms->page_ext) > >>>> return; > >>>> base = get_entry(ms->page_ext, pfn); > >>>> - free_page_ext(base); > >>>> ms->page_ext = NULL; > >>>> + synchronize_rcu(); > >>>> + free_page_ext(base); > >>>> } > >>> So you are imposing the RCU grace period for each page_ext! This can get > >>> really expensive. Have you tried to measure the effect? > > I was wrong here! This is for each memory section which is not as > > terrible as every single page_ext. This can be still quite a lot memory > > sections in a single memory block (e.g. on ppc memory sections are > > ridiculously small). > > > > On the ARM64, I see that the minimum a section size will go is 128MB. I > think 16MB is the section size on ppc. Any inputs on how frequently > offline/online operation is being done on this ppc arch? I have seen several reports where 16MB sections were used on PPC LPARs with a non trivial size. My usual answer to that is tha this is mostly a self inflicted injury but I am told that for some reasons I cannot udnerstand this is not easy to change. So reasonable or not this is not all that uncommon in PPC land. We definitely shouldn't optimize for those setups but we shouldn't make them suffer even more as well. Besides that it seems that a single rcu_synchronize per offline operation should be doable.
> >> 2) I really dislike having to scatter section online checks all over the >> place in page ext code. Once there is a difference between active vs. >> stale page ext data things get a bit messy and error prone. This is >> already ugly enough in our generic memmap handling code IMHO. > > They should represent a free page in any case so even they are stall > they shouldn't be really dangerous, right? Good question. The use-after-free tells me that there could at least be something accessing page_ext data after offlining right now. As long as it's only unsynchronized read access, we should be fine. > >> 3) Having on-demand allocations, such as KASAN or page ext from the >> memory online notifier is at least currently cleaner, because we don't >> have to handle each and every subsystem that hooks into that during the >> core memory hotadd/remove phase, which primarily only setups the >> vmemmap, direct map and memory block devices. > > Cannot this hook into __add_pages which is the real implementation of > the arch independent way to allocate vmemmap. Or at the sparsemem level > because we do not (and very likely won't) support memory hotplug on > any other memory model. As __add_pages() is also called from mm/memremap.c where we don't want that metadata, we'd have to special-case (would need a new parameter I guess). > >> Personally, I think what we have in this patch is quite nice and clean. >> But I won't object if it can be similarly done in a clean way from >> hot(un)plug code. > > Well, if the scheme can be done without synchronize_rcu for each section > which can backfire and if the scheme doesn't add too much complexity to > achieve that then sure I won't object. I just do not get why page_ext > should have a different allocation lifetime expectancy than a real page. > Quite confusing if you ask me. In contrast to memmap, people actually test for zero pointers here. If you ask me the memmap access is ugly enough and I don't really enjoy other metadata following that pattern of "stale and suddenly removed". Here seems to be an easy way to do it in a clean way. But yes, if the synchronize_rcu turns out problematic, we'd either have to optimize or move the allcoation/free phase.
On 19.07.22 17:43, Michal Hocko wrote: > On Tue 19-07-22 20:42:42, Charan Teja Kalla wrote: >> Thanks Michal!! >> >> On 7/18/2022 8:24 PM, Michal Hocko wrote: > [...] >>>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c >>>>>> index 3dc715d..5ccd3ee 100644 >>>>>> --- a/mm/page_ext.c >>>>>> +++ b/mm/page_ext.c >>>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) >>>>>> if (!ms || !ms->page_ext) >>>>>> return; >>>>>> base = get_entry(ms->page_ext, pfn); >>>>>> - free_page_ext(base); >>>>>> ms->page_ext = NULL; >>>>>> + synchronize_rcu(); >>>>>> + free_page_ext(base); >>>>>> } >>>>> So you are imposing the RCU grace period for each page_ext! This can get >>>>> really expensive. Have you tried to measure the effect? >>> I was wrong here! This is for each memory section which is not as >>> terrible as every single page_ext. This can be still quite a lot memory >>> sections in a single memory block (e.g. on ppc memory sections are >>> ridiculously small). >>> >> >> On the ARM64, I see that the minimum a section size will go is 128MB. I >> think 16MB is the section size on ppc. Any inputs on how frequently >> offline/online operation is being done on this ppc arch? > > I have seen several reports where 16MB sections were used on PPC LPARs > with a non trivial size. My usual answer to that is tha this is mostly a > self inflicted injury but I am told that for some reasons I cannot > udnerstand this is not easy to change. So reasonable or not this is not > all that uncommon in PPC land. > > We definitely shouldn't optimize for those setups but we shouldn't make > them suffer even more as well. Besides that it seems that a single > rcu_synchronize per offline operation should be doable. IIRC, any reasonable PPC64 installation uses LMB >= 256 MiB, which maps to the logical memory block size, and we only online/offline complete memory blocks, not individual memory sections. So one these installations, you'll see memory getting onlined/offlined in at least 256MiB granularity.
Hi Charan, On Tue, Jul 19, 2022 at 08:42:42PM +0530, Charan Teja Kalla wrote: > Thanks Michal!! > > On 7/18/2022 8:24 PM, Michal Hocko wrote: > >>>> The above mentioned race is just one example __but the problem persists > >>>> in the other paths too involving page_ext->flags access(eg: > >>>> page_is_idle())__. Since offline waits till the last reference on the > >>>> page goes down i.e. any path that took the refcount on the page can make > >>>> the memory offline operation to wait. Eg: In the migrate_pages() > >>>> operation, we do take the extra refcount on the pages that are under > >>>> migration and then we do copy page_owner by accessing page_ext. For > >>>> > >>>> Fix those paths where offline races with page_ext access by maintaining > >>>> synchronization with rcu lock. > >>> Please be much more specific about the synchronization. How does RCU > >>> actually synchronize the offlining and access? Higher level description > >>> of all the actors would be very helpful not only for the review but also > >>> for future readers. > >> I will improve the commit message about this synchronization change > >> using RCU's. > > Thanks! The most imporant part is how the exclusion is actual achieved > > because that is not really clear at first sight > > > > CPU1 CPU2 > > lookup_page_ext(PageA) offlining > > offline_page_ext > > __free_page_ext(addrA) > > get_entry(addrA) > > ms->page_ext = NULL > > synchronize_rcu() > > free_page_ext > > free_pages_exact (now addrA is unusable) > > > > rcu_read_lock() > > entryA = get_entry(addrA) > > base + page_ext_size * index # an address not invalidated by the freeing path > > do_something(entryA) > > rcu_read_unlock() > > > > CPU1 never checks ms->page_ext so it cannot bail out early when the > > thing is torn down. Or maybe I am missing something. I am not familiar > > with page_ext much. > > > Thanks a lot for catching this Michal. You are correct that the proposed > code from me is still racy. I Will correct this along with the proper > commit message in the next version of this patch. > Trying to understand your discussion with Michal. What part is still racy? We do check for mem_section::page_ext and bail out early from lookup_page_ext(), no? Also to make this scheme explicit, we can annotate page_ext member with __rcu and use rcu_assign_pointer() on the writer side. struct page_ext *lookup_page_ext(const struct page *page) { unsigned long pfn = page_to_pfn(page); struct mem_section *section = __pfn_to_section(pfn); /* * The sanity checks the page allocator does upon freeing a * page can reach here before the page_ext arrays are * allocated when feeding a range of pages to the allocator * for the first time during bootup or memory hotplug. */ if (!section->page_ext) return NULL; return get_entry(section->page_ext, pfn); } Thanks, Pavan
On Wed 20-07-22 13:51:12, Pavan Kondeti wrote: > Hi Charan, > > On Tue, Jul 19, 2022 at 08:42:42PM +0530, Charan Teja Kalla wrote: > > Thanks Michal!! > > > > On 7/18/2022 8:24 PM, Michal Hocko wrote: > > >>>> The above mentioned race is just one example __but the problem persists > > >>>> in the other paths too involving page_ext->flags access(eg: > > >>>> page_is_idle())__. Since offline waits till the last reference on the > > >>>> page goes down i.e. any path that took the refcount on the page can make > > >>>> the memory offline operation to wait. Eg: In the migrate_pages() > > >>>> operation, we do take the extra refcount on the pages that are under > > >>>> migration and then we do copy page_owner by accessing page_ext. For > > >>>> > > >>>> Fix those paths where offline races with page_ext access by maintaining > > >>>> synchronization with rcu lock. > > >>> Please be much more specific about the synchronization. How does RCU > > >>> actually synchronize the offlining and access? Higher level description > > >>> of all the actors would be very helpful not only for the review but also > > >>> for future readers. > > >> I will improve the commit message about this synchronization change > > >> using RCU's. > > > Thanks! The most imporant part is how the exclusion is actual achieved > > > because that is not really clear at first sight > > > > > > CPU1 CPU2 > > > lookup_page_ext(PageA) offlining > > > offline_page_ext > > > __free_page_ext(addrA) > > > get_entry(addrA) > > > ms->page_ext = NULL > > > synchronize_rcu() > > > free_page_ext > > > free_pages_exact (now addrA is unusable) > > > > > > rcu_read_lock() > > > entryA = get_entry(addrA) > > > base + page_ext_size * index # an address not invalidated by the freeing path > > > do_something(entryA) > > > rcu_read_unlock() > > > > > > CPU1 never checks ms->page_ext so it cannot bail out early when the > > > thing is torn down. Or maybe I am missing something. I am not familiar > > > with page_ext much. > > > > > > Thanks a lot for catching this Michal. You are correct that the proposed > > code from me is still racy. I Will correct this along with the proper > > commit message in the next version of this patch. > > > > Trying to understand your discussion with Michal. What part is still racy? We > do check for mem_section::page_ext and bail out early from lookup_page_ext(), > no? > > Also to make this scheme explicit, we can annotate page_ext member with __rcu > and use rcu_assign_pointer() on the writer side. > > struct page_ext *lookup_page_ext(const struct page *page) > { > unsigned long pfn = page_to_pfn(page); > struct mem_section *section = __pfn_to_section(pfn); > /* > * The sanity checks the page allocator does upon freeing a > * page can reach here before the page_ext arrays are > * allocated when feeding a range of pages to the allocator > * for the first time during bootup or memory hotplug. > */ > if (!section->page_ext) > return NULL; > return get_entry(section->page_ext, pfn); > } You are right. I was looking at the wrong implementation and misread ifdef vs. ifndef CONFIG_SPARSEMEM. My bad. Memory hotplug is not supported outside of CONFIG_SPARSEMEM so the scheme should really work. I would use READ_ONCE for ms->page_ext and WRITE_ONCE on the initialization side.
Thanks Michal & Pavan, On 7/20/2022 2:40 PM, Michal Hocko wrote: >>>> Thanks! The most imporant part is how the exclusion is actual achieved >>>> because that is not really clear at first sight >>>> >>>> CPU1 CPU2 >>>> lookup_page_ext(PageA) offlining >>>> offline_page_ext >>>> __free_page_ext(addrA) >>>> get_entry(addrA) >>>> ms->page_ext = NULL >>>> synchronize_rcu() >>>> free_page_ext >>>> free_pages_exact (now addrA is unusable) >>>> >>>> rcu_read_lock() >>>> entryA = get_entry(addrA) >>>> base + page_ext_size * index # an address not invalidated by the freeing path >>>> do_something(entryA) >>>> rcu_read_unlock() >>>> >>>> CPU1 never checks ms->page_ext so it cannot bail out early when the >>>> thing is torn down. Or maybe I am missing something. I am not familiar >>>> with page_ext much. >>> >>> Thanks a lot for catching this Michal. You are correct that the proposed >>> code from me is still racy. I Will correct this along with the proper >>> commit message in the next version of this patch. >>> >> Trying to understand your discussion with Michal. What part is still racy? We >> do check for mem_section::page_ext and bail out early from lookup_page_ext(), >> no? >> >> Also to make this scheme explicit, we can annotate page_ext member with __rcu >> and use rcu_assign_pointer() on the writer side. Annotating with __rcu requires all the read and writes to ms->page_ext to be under rcu_[access|assign]_pointer which is a big patch. I think READ_ONCE and WRITE_ONCE, mentioned by Michal, below should does the job. >> >> struct page_ext *lookup_page_ext(const struct page *page) >> { >> unsigned long pfn = page_to_pfn(page); >> struct mem_section *section = __pfn_to_section(pfn); >> /* >> * The sanity checks the page allocator does upon freeing a >> * page can reach here before the page_ext arrays are >> * allocated when feeding a range of pages to the allocator >> * for the first time during bootup or memory hotplug. >> */ >> if (!section->page_ext) >> return NULL; >> return get_entry(section->page_ext, pfn); >> } > You are right. I was looking at the wrong implementation and misread > ifdef vs. ifndef CONFIG_SPARSEMEM. My bad. > There is still a small race window b/n ms->page_ext setting NULL and its access even under CONFIG_SPARSEMEM. In the above mentioned example: CPU1 CPU2 rcu_read_lock() lookup_page_ext(PageA): offlining offline_page_ext __free_page_ext(addrA) get_entry(addrA) if (!section->page_ext) turns to be false. ms->page_ext = NULL addrA = get_entry(base=section->page_ext): base + page_ext_size * index; **Since base is NULL here, caller can still do the dereference on the invalid pointer address.** synchronize_rcu() free_page_ext free_pages_exact (now ) > Memory hotplug is not supported outside of CONFIG_SPARSEMEM so the > scheme should really work. I would use READ_ONCE for ms->page_ext and > WRITE_ONCE on the initialization side. Yes, I should be using the READ_ONCE() and WRITE_ONCE() here. Thanks, Charan
On Wed 20-07-22 16:13:19, Charan Teja Kalla wrote: > Thanks Michal & Pavan, > > On 7/20/2022 2:40 PM, Michal Hocko wrote: > >>>> Thanks! The most imporant part is how the exclusion is actual achieved > >>>> because that is not really clear at first sight > >>>> > >>>> CPU1 CPU2 > >>>> lookup_page_ext(PageA) offlining > >>>> offline_page_ext > >>>> __free_page_ext(addrA) > >>>> get_entry(addrA) > >>>> ms->page_ext = NULL > >>>> synchronize_rcu() > >>>> free_page_ext > >>>> free_pages_exact (now addrA is unusable) > >>>> > >>>> rcu_read_lock() > >>>> entryA = get_entry(addrA) > >>>> base + page_ext_size * index # an address not invalidated by the freeing path > >>>> do_something(entryA) > >>>> rcu_read_unlock() > >>>> > >>>> CPU1 never checks ms->page_ext so it cannot bail out early when the > >>>> thing is torn down. Or maybe I am missing something. I am not familiar > >>>> with page_ext much. > >>> > >>> Thanks a lot for catching this Michal. You are correct that the proposed > >>> code from me is still racy. I Will correct this along with the proper > >>> commit message in the next version of this patch. > >>> > >> Trying to understand your discussion with Michal. What part is still racy? We > >> do check for mem_section::page_ext and bail out early from lookup_page_ext(), > >> no? > >> > >> Also to make this scheme explicit, we can annotate page_ext member with __rcu > >> and use rcu_assign_pointer() on the writer side. > > Annotating with __rcu requires all the read and writes to ms->page_ext > to be under rcu_[access|assign]_pointer which is a big patch. I think > READ_ONCE and WRITE_ONCE, mentioned by Michal, below should does the job. > > >> > >> struct page_ext *lookup_page_ext(const struct page *page) > >> { > >> unsigned long pfn = page_to_pfn(page); > >> struct mem_section *section = __pfn_to_section(pfn); > >> /* > >> * The sanity checks the page allocator does upon freeing a > >> * page can reach here before the page_ext arrays are > >> * allocated when feeding a range of pages to the allocator > >> * for the first time during bootup or memory hotplug. > >> */ > >> if (!section->page_ext) > >> return NULL; > >> return get_entry(section->page_ext, pfn); > >> } > > You are right. I was looking at the wrong implementation and misread > > ifdef vs. ifndef CONFIG_SPARSEMEM. My bad. > > > > There is still a small race window b/n ms->page_ext setting NULL and its > access even under CONFIG_SPARSEMEM. In the above mentioned example: > > CPU1 CPU2 > rcu_read_lock() > lookup_page_ext(PageA): offlining > offline_page_ext > __free_page_ext(addrA) > get_entry(addrA) > if (!section->page_ext) > turns to be false. > ms->page_ext = NULL > > addrA = get_entry(base=section->page_ext): > base + page_ext_size * index; > **Since base is NULL here, caller > can still do the dereference on > the invalid pointer address.** only if the value is re-fetched. Not likely but definitely better to have it covered. That is why I was suggesting READ_ONCE/WRITE_ONCE for this iperation. > > synchronize_rcu() > free_page_ext > free_pages_exact (now ) > > > > Memory hotplug is not supported outside of CONFIG_SPARSEMEM so the > > scheme should really work. I would use READ_ONCE for ms->page_ext and > > WRITE_ONCE on the initialization side. > > Yes, I should be using the READ_ONCE() and WRITE_ONCE() here. yes.
Thanks Michal here!! On 7/19/2022 9:13 PM, Michal Hocko wrote: >>>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c >>>>>> index 3dc715d..5ccd3ee 100644 >>>>>> --- a/mm/page_ext.c >>>>>> +++ b/mm/page_ext.c >>>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) >>>>>> if (!ms || !ms->page_ext) >>>>>> return; >>>>>> base = get_entry(ms->page_ext, pfn); >>>>>> - free_page_ext(base); >>>>>> ms->page_ext = NULL; >>>>>> + synchronize_rcu(); >>>>>> + free_page_ext(base); >>>>>> } >>>>> So you are imposing the RCU grace period for each page_ext! This can get >>>>> really expensive. Have you tried to measure the effect? >>> I was wrong here! This is for each memory section which is not as >>> terrible as every single page_ext. This can be still quite a lot memory >>> sections in a single memory block (e.g. on ppc memory sections are >>> ridiculously small). >>> >> On the ARM64, I see that the minimum a section size will go is 128MB. I >> think 16MB is the section size on ppc. Any inputs on how frequently >> offline/online operation is being done on this ppc arch? > I have seen several reports where 16MB sections were used on PPC LPARs > with a non trivial size. My usual answer to that is tha this is mostly a > self inflicted injury but I am told that for some reasons I cannot > udnerstand this is not easy to change. So reasonable or not this is not > all that uncommon in PPC land. > > We definitely shouldn't optimize for those setups but we shouldn't make > them suffer even more as well. Besides that it seems that a single > rcu_synchronize per offline operation should be doable. I too feel it is doable but the code might look to need to traverse the sections of a memory block twice. 1) with synchronize_rcu() calling for each memory section of a memblock: --------------------------------------------------------------------- for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) __free_page_ext(pfn): ms->page_ext = NULL synchronize_rcu();// Called on every section. free_page_ext();//free the page_ext. 2) With a single synchronize_rcu() for each offlined block: ------------------------------------------------------- for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) { __free_page_ext(pfn): ms->page_ext = NULL; } synchronize_rcu(); // call synchronize_rcu for just once for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) free_page_ext(); // Free the page_ext. Any better code you have in mind here please? But since there are few sections the overhead of traversing them will be definitely less compared to synchronize_rcu() per memsection. But I really doubt if there will be a real impact making sync_rcu per section because,(as david also mentioned and you also corrected it I think), the concern here is for ppc where the min memblock size is 256M with section size of 16M and there is a single offline operation on that block but can end up in calling 16 sync_rcu() calls. Should we really optimize this case here? If yes, I can go with the approach 2) mentioned above. Sorry if I am really undermining the problem here. Thanks, Charan
On Wed 20-07-22 20:38:58, Charan Teja Kalla wrote: > Thanks Michal here!! > > On 7/19/2022 9:13 PM, Michal Hocko wrote: > >>>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c > >>>>>> index 3dc715d..5ccd3ee 100644 > >>>>>> --- a/mm/page_ext.c > >>>>>> +++ b/mm/page_ext.c > >>>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) > >>>>>> if (!ms || !ms->page_ext) > >>>>>> return; > >>>>>> base = get_entry(ms->page_ext, pfn); > >>>>>> - free_page_ext(base); > >>>>>> ms->page_ext = NULL; > >>>>>> + synchronize_rcu(); > >>>>>> + free_page_ext(base); > >>>>>> } > >>>>> So you are imposing the RCU grace period for each page_ext! This can get > >>>>> really expensive. Have you tried to measure the effect? > >>> I was wrong here! This is for each memory section which is not as > >>> terrible as every single page_ext. This can be still quite a lot memory > >>> sections in a single memory block (e.g. on ppc memory sections are > >>> ridiculously small). > >>> > >> On the ARM64, I see that the minimum a section size will go is 128MB. I > >> think 16MB is the section size on ppc. Any inputs on how frequently > >> offline/online operation is being done on this ppc arch? > > I have seen several reports where 16MB sections were used on PPC LPARs > > with a non trivial size. My usual answer to that is tha this is mostly a > > self inflicted injury but I am told that for some reasons I cannot > > udnerstand this is not easy to change. So reasonable or not this is not > > all that uncommon in PPC land. > > > > We definitely shouldn't optimize for those setups but we shouldn't make > > them suffer even more as well. Besides that it seems that a single > > rcu_synchronize per offline operation should be doable. > > I too feel it is doable but the code might look to need to traverse the > sections of a memory block twice. yes, this is to be expected unless you really want to optimize that further by some global flag which is probably not worth it. Traversing the sequence of section is not really an expensive operation comparing to do an expenensive operation to each of the iteration. > 1) with synchronize_rcu() calling for each memory section of a memblock: > --------------------------------------------------------------------- > for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) > __free_page_ext(pfn): > ms->page_ext = NULL > synchronize_rcu();// Called on every section. > free_page_ext();//free the page_ext. > > 2) With a single synchronize_rcu() for each offlined block: > ------------------------------------------------------- > for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) { > __free_page_ext(pfn): > ms->page_ext = NULL; you want WRITE_ONCE here > } > synchronize_rcu(); // call synchronize_rcu for just once > for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) > free_page_ext(); // Free the page_ext. Yes this is what I really meant but the thing is that you need the pointer to know what to free. One way to handle that would be to not clear the pointer but make it visibly invalid for later checks. E.g. set the lowest bit and check for it rather for NULL. > Any better code you have in mind here please? > > But since there are few sections the overhead of traversing them will be > definitely less compared to synchronize_rcu() per memsection. > > But I really doubt if there will be a real impact making sync_rcu per > section because,(as david also mentioned and you also corrected it I > think), the concern here is for ppc where the min memblock size is 256M > with section size of 16M and there is a single offline operation on that > block but can end up in calling 16 sync_rcu() calls. Should we really > optimize this case here? If yes, I can go with the approach 2) mentioned > above. Sorry if I am really undermining the problem here. I would prefer a single rcu barrier solution. Unless this turns into monster complicated stuff. Which I hope we can avoid completely by simply not doing the separate lifetime as well... Making any assumptions on the number of sections that are handled here is just a ticket to get issues later. Let's not put land mines ahead of us please.
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h index fabb2e1..df5d353 100644 --- a/include/linux/page_ext.h +++ b/include/linux/page_ext.h @@ -64,6 +64,25 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr) return next; } +static inline struct page_ext *get_page_ext(struct page *page) +{ + struct page_ext *page_ext; + + rcu_read_lock(); + page_ext = lookup_page_ext(page); + if (!page_ext) { + rcu_read_unlock(); + return NULL; + } + + return page_ext; +} + +static inline void put_page_ext(void) +{ + rcu_read_unlock(); +} + #else /* !CONFIG_PAGE_EXTENSION */ struct page_ext; diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 4663dfe..0d80f78 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -13,65 +13,85 @@ * If there is not enough space to store Idle and Young bits in page flags, use * page ext flags instead. */ - static inline bool folio_test_young(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(&folio->page); + struct page_ext *page_ext; + bool page_young; + page_ext = get_page_ext(&folio->page); if (unlikely(!page_ext)) return false; - return test_bit(PAGE_EXT_YOUNG, &page_ext->flags); + page_young = test_bit(PAGE_EXT_YOUNG, &page_ext->flags); + put_page_ext(); + + return page_young; } static inline void folio_set_young(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(&folio->page); + struct page_ext *page_ext; + page_ext = get_page_ext(&folio->page); if (unlikely(!page_ext)) return; set_bit(PAGE_EXT_YOUNG, &page_ext->flags); + put_page_ext(); } static inline bool folio_test_clear_young(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(&folio->page); + struct page_ext *page_ext; + bool page_young; + page_ext = get_page_ext(&folio->page); if (unlikely(!page_ext)) return false; - return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); + page_young = test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); + put_page_ext(); + + return page_young; } static inline bool folio_test_idle(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(&folio->page); + struct page_ext *page_ext; + bool page_idle; + page_ext = get_page_ext(&folio->page); if (unlikely(!page_ext)) return false; - return test_bit(PAGE_EXT_IDLE, &page_ext->flags); + page_idle = test_bit(PAGE_EXT_IDLE, &page_ext->flags); + put_page_ext(); + + return page_idle; } static inline void folio_set_idle(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(&folio->page); + struct page_ext *page_ext; + page_ext = get_page_ext(&folio->page); if (unlikely(!page_ext)) return; set_bit(PAGE_EXT_IDLE, &page_ext->flags); + put_page_ext(); } static inline void folio_clear_idle(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(&folio->page); + struct page_ext *page_ext; + page_ext = get_page_ext(&folio->page); if (unlikely(!page_ext)) return; clear_bit(PAGE_EXT_IDLE, &page_ext->flags); + put_page_ext(); } #endif /* !CONFIG_64BIT */ diff --git a/mm/page_ext.c b/mm/page_ext.c index 3dc715d..5ccd3ee 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) if (!ms || !ms->page_ext) return; base = get_entry(ms->page_ext, pfn); - free_page_ext(base); ms->page_ext = NULL; + synchronize_rcu(); + free_page_ext(base); } static int __meminit online_page_ext(unsigned long start_pfn, diff --git a/mm/page_owner.c b/mm/page_owner.c index e4c6f3f..d4b9417 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -195,14 +195,16 @@ noinline void __set_page_owner(struct page *page, unsigned short order, void __set_page_owner_migrate_reason(struct page *page, int reason) { - struct page_ext *page_ext = lookup_page_ext(page); + struct page_ext *page_ext; struct page_owner *page_owner; + page_ext = get_page_ext(page); if (unlikely(!page_ext)) return; page_owner = get_page_owner(page_ext); page_owner->last_migrate_reason = reason; + put_page_ext(); } void __split_page_owner(struct page *page, unsigned int nr) @@ -288,7 +290,10 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, pageblock_mt = get_pageblock_migratetype(page); + rcu_read_lock(); for (; pfn < block_end_pfn; pfn++) { + if (need_resched()) + cond_resched_rcu(); /* The pageblock is online, no need to recheck. */ page = pfn_to_page(pfn); @@ -327,6 +332,7 @@ void pagetypeinfo_showmixedcount_print(struct seq_file *m, } pfn += (1UL << page_owner->order) - 1; } + rcu_read_unlock(); } /* Print counts */ @@ -492,6 +498,7 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) struct page_ext *page_ext; struct page_owner *page_owner; depot_stack_handle_t handle; + ssize_t ret = 0; if (!static_branch_unlikely(&page_owner_inited)) return -EINVAL; @@ -506,7 +513,10 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) drain_all_pages(NULL); /* Find an allocated page */ + rcu_read_lock(); for (; pfn < max_pfn; pfn++) { + if (need_resched()) + cond_resched_rcu(); /* * If the new page is in a new MAX_ORDER_NR_PAGES area, * validate the area as existing, skip it if not @@ -563,11 +573,13 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) /* Record the next PFN to read in the file offset */ *ppos = (pfn - min_low_pfn) + 1; - return print_page_owner(buf, count, pfn, page, + ret = print_page_owner(buf, count, pfn, page, page_owner, handle); + break; } + rcu_read_unlock(); - return 0; + return ret; } static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone) diff --git a/mm/page_table_check.c b/mm/page_table_check.c index e206274..66fd49a 100644 --- a/mm/page_table_check.c +++ b/mm/page_table_check.c @@ -68,7 +68,7 @@ static void page_table_check_clear(struct mm_struct *mm, unsigned long addr, return; page = pfn_to_page(pfn); - page_ext = lookup_page_ext(page); + page_ext = get_page_ext(page); anon = PageAnon(page); for (i = 0; i < pgcnt; i++) { @@ -83,6 +83,7 @@ static void page_table_check_clear(struct mm_struct *mm, unsigned long addr, } page_ext = page_ext_next(page_ext); } + put_page_ext(); } /* @@ -103,7 +104,7 @@ static void page_table_check_set(struct mm_struct *mm, unsigned long addr, return; page = pfn_to_page(pfn); - page_ext = lookup_page_ext(page); + page_ext = get_page_ext(page); anon = PageAnon(page); for (i = 0; i < pgcnt; i++) { @@ -118,6 +119,7 @@ static void page_table_check_set(struct mm_struct *mm, unsigned long addr, } page_ext = page_ext_next(page_ext); } + put_page_ext(); } /* @@ -126,9 +128,10 @@ static void page_table_check_set(struct mm_struct *mm, unsigned long addr, */ void __page_table_check_zero(struct page *page, unsigned int order) { - struct page_ext *page_ext = lookup_page_ext(page); + struct page_ext *page_ext; unsigned long i; + page_ext = get_page_ext(page); BUG_ON(!page_ext); for (i = 0; i < (1ul << order); i++) { struct page_table_check *ptc = get_page_table_check(page_ext); @@ -137,6 +140,7 @@ void __page_table_check_zero(struct page *page, unsigned int order) BUG_ON(atomic_read(&ptc->file_map_count)); page_ext = page_ext_next(page_ext); } + put_page_ext(); } void __page_table_check_pte_clear(struct mm_struct *mm, unsigned long addr,
The below is one path where race between page_ext and offline of the respective memory blocks will cause use-after-free on the access of page_ext structure. process1 process2 --------- --------- a)doing /proc/page_owner doing memory offline through offline_pages. b)PageBuddy check is failed thus proceed to get the page_owner information through page_ext access. page_ext = lookup_page_ext(page); migrate_pages(); ................. Since all pages are successfully migrated as part of the offline operation,send MEM_OFFLINE notification where for page_ext it calls: offline_page_ext()--> __free_page_ext()--> free_page_ext()--> vfree(ms->page_ext) mem_section->page_ext = NULL c) Check for the PAGE_EXT flags in the page_ext->flags access results into the use-after-free(leading to the translation faults). As mentioned above, there is really no synchronization between page_ext access and its freeing in the memory_offline. The memory offline steps(roughly) on a memory block is as below: 1) Isolate all the pages 2) while(1) try free the pages to buddy.(->free_list[MIGRATE_ISOLATE]) 3) delete the pages from this buddy list. 4) Then free page_ext.(Note: The struct page is still alive as it is freed only during hot remove of the memory which frees the memmap, which steps the user might not perform). This design leads to the state where struct page is alive but the struct page_ext is freed, where the later is ideally part of the former which just representing the page_flags. This seems to be a wrong design where 'struct page' as a whole is not accessible(Thanks to Minchan for pointing this out). The above mentioned race is just one example __but the problem persists in the other paths too involving page_ext->flags access(eg: page_is_idle())__. Since offline waits till the last reference on the page goes down i.e. any path that took the refcount on the page can make the memory offline operation to wait. Eg: In the migrate_pages() operation, we do take the extra refcount on the pages that are under migration and then we do copy page_owner by accessing page_ext. For Fix those paths where offline races with page_ext access by maintaining synchronization with rcu lock. Thanks to David Hildenbrand for his views/suggestions on the initial discussion[1]. [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/ Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> --- include/linux/page_ext.h | 19 +++++++++++++++++++ include/linux/page_idle.h | 40 ++++++++++++++++++++++++++++++---------- mm/page_ext.c | 3 ++- mm/page_owner.c | 18 +++++++++++++++--- mm/page_table_check.c | 10 +++++++--- 5 files changed, 73 insertions(+), 17 deletions(-)