diff mbox series

mm: fix use-after free of page_ext after race with memory-offline

Message ID 1657810063-28938-1-git-send-email-quic_charante@quicinc.com (mailing list archive)
State New
Headers show
Series mm: fix use-after free of page_ext after race with memory-offline | expand

Commit Message

Charan Teja Kalla July 14, 2022, 2:47 p.m. UTC
The below is one path where race between page_ext and  offline of the
respective memory blocks will cause use-after-free on the access of
page_ext structure.

process1		              process2
---------                             ---------
a)doing /proc/page_owner           doing memory offline
			           through offline_pages.

b)PageBuddy check is failed
thus proceed to get the
page_owner information
through page_ext access.
page_ext = lookup_page_ext(page);

				    migrate_pages();
				    .................
				Since all pages are successfully
				migrated as part of the offline
				operation,send MEM_OFFLINE notification
				where for page_ext it calls:
				offline_page_ext()-->
				__free_page_ext()-->
				   free_page_ext()-->
				     vfree(ms->page_ext)
			           mem_section->page_ext = NULL

c) Check for the PAGE_EXT flags
in the page_ext->flags access
results into the use-after-free(leading
to the translation faults).

As mentioned above, there is really no synchronization between page_ext
access and its freeing in the memory_offline.

The memory offline steps(roughly) on a memory block is as below:
1) Isolate all the pages
2) while(1)
  try free the pages to buddy.(->free_list[MIGRATE_ISOLATE])
3) delete the pages from this buddy list.
4) Then free page_ext.(Note: The struct page is still alive as it is
freed only during hot remove of the memory which frees the memmap, which
steps the user might not perform).

This design leads to the state where struct page is alive but the struct
page_ext is freed, where the later is ideally part of the former which
just representing the page_flags. This seems to be a wrong design where
'struct page' as a whole is not accessible(Thanks to Minchan for
pointing this out).

The above mentioned race is just one example __but the problem persists
in the other paths too involving page_ext->flags access(eg:
page_is_idle())__. Since offline waits till the last reference on the
page goes down i.e. any path that took the refcount on the page can make
the memory offline operation to wait. Eg: In the migrate_pages()
operation, we do take the extra refcount on the pages that are under
migration and then we do copy page_owner by accessing page_ext. For

Fix those paths where offline races with page_ext access by maintaining
synchronization with rcu lock.

Thanks to David Hildenbrand for his views/suggestions on the initial
discussion[1].

[1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/

Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
 include/linux/page_ext.h  | 19 +++++++++++++++++++
 include/linux/page_idle.h | 40 ++++++++++++++++++++++++++++++----------
 mm/page_ext.c             |  3 ++-
 mm/page_owner.c           | 18 +++++++++++++++---
 mm/page_table_check.c     | 10 +++++++---
 5 files changed, 73 insertions(+), 17 deletions(-)

Comments

Andrew Morton July 15, 2022, 1:04 a.m. UTC | #1
On Thu, 14 Jul 2022 20:17:43 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:

> The below is one path where race between page_ext and  offline of the
> respective memory blocks will cause use-after-free on the access of
> page_ext structure.
> 
> ...
>
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -64,6 +64,25 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr)
>  	return next;
>  }
>  
> +static inline struct page_ext *get_page_ext(struct page *page)
> +{
> +	struct page_ext *page_ext;
> +
> +	rcu_read_lock();

If page_ext.h is to call rcu functions then it will need to include
appropriate header files.

> +	page_ext = lookup_page_ext(page);
> +	if (!page_ext) {
> +		rcu_read_unlock();
> +		return NULL;
> +	}
> +
> +	return page_ext;
> +}
> +
> +static inline void put_page_ext(void)
> +{
> +	rcu_read_unlock();
> +}

Better names would be page_ext_get() and page_ext_put().  The rest of
the page_ext API appears to have got this right, so let's not mess that
up.

Also, these aren't really get and put functions - page_ext doesn't have
a refcount.  But I can't immediately think of a name that better
communicates what we're doing here so I guess get and put will do.

And are we able to add some comments here explaining why these
functions exist and what they do?

>  #else /* !CONFIG_PAGE_EXTENSION */
>  struct page_ext;

Are you sure we didn't need CONFIG_PAGE_EXTENSION=n stubs for these two
functions?
Charan Teja Kalla July 15, 2022, 12:32 p.m. UTC | #2
Thanks Andrew for the review!!

On 7/15/2022 6:34 AM, Andrew Morton wrote:
> On Thu, 14 Jul 2022 20:17:43 +0530 Charan Teja Kalla <quic_charante@quicinc.com> wrote:
> 
>> The below is one path where race between page_ext and  offline of the
>> respective memory blocks will cause use-after-free on the access of
>> page_ext structure.
>>
>> ...
>>
>> --- a/include/linux/page_ext.h
>> +++ b/include/linux/page_ext.h
>> @@ -64,6 +64,25 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr)
>>  	return next;
>>  }
>>  
>> +static inline struct page_ext *get_page_ext(struct page *page)
>> +{
>> +	struct page_ext *page_ext;
>> +
>> +	rcu_read_lock();
> If page_ext.h is to call rcu functions then it will need to include
> appropriate header files.
> 

Will add them!!

>> +	page_ext = lookup_page_ext(page);
>> +	if (!page_ext) {
>> +		rcu_read_unlock();
>> +		return NULL;
>> +	}
>> +
>> +	return page_ext;
>> +}
>> +
>> +static inline void put_page_ext(void)
>> +{
>> +	rcu_read_unlock();
>> +}
> Better names would be page_ext_get() and page_ext_put().  The rest of
> the page_ext API appears to have got this right, so let's not mess that
> up.

I see naming convention is not consistent in page_ext.c. For couple of
them I see page_ext_xxx() and for the rest it is xxx_page_ext(). Sure I
will follow the page_ext_xxx() convention in V2.

> 
> Also, these aren't really get and put functions - page_ext doesn't have
> a refcount.  But I can't immediately think of a name that better
> communicates what we're doing here so I guess get and put will do.
>> And are we able to add some comments here explaining why these
> functions exist and what they do?

Sure will add then in V2.
> 
>>  #else /* !CONFIG_PAGE_EXTENSION */
>>  struct page_ext;
> Are you sure we didn't need CONFIG_PAGE_EXTENSION=n stubs for these two
> functions?

I think it does need. Will add them in v2.

> 
>
Pavan Kondeti July 18, 2022, 6:11 a.m. UTC | #3
Hi Charan,

On Thu, Jul 14, 2022 at 08:17:43PM +0530, Charan Teja Kalla wrote:
> The below is one path where race between page_ext and  offline of the
> respective memory blocks will cause use-after-free on the access of
> page_ext structure.
> 
> process1		              process2
> ---------                             ---------
> a)doing /proc/page_owner           doing memory offline
> 			           through offline_pages.
> 
> b)PageBuddy check is failed
> thus proceed to get the
> page_owner information
> through page_ext access.
> page_ext = lookup_page_ext(page);
> 
> 				    migrate_pages();
> 				    .................
> 				Since all pages are successfully
> 				migrated as part of the offline
> 				operation,send MEM_OFFLINE notification
> 				where for page_ext it calls:
> 				offline_page_ext()-->
> 				__free_page_ext()-->
> 				   free_page_ext()-->
> 				     vfree(ms->page_ext)
> 			           mem_section->page_ext = NULL
> 
> c) Check for the PAGE_EXT flags
> in the page_ext->flags access
> results into the use-after-free(leading
> to the translation faults).
> 
> As mentioned above, there is really no synchronization between page_ext
> access and its freeing in the memory_offline.
> 
> The memory offline steps(roughly) on a memory block is as below:
> 1) Isolate all the pages
> 2) while(1)
>   try free the pages to buddy.(->free_list[MIGRATE_ISOLATE])
> 3) delete the pages from this buddy list.
> 4) Then free page_ext.(Note: The struct page is still alive as it is
> freed only during hot remove of the memory which frees the memmap, which
> steps the user might not perform).
> 
> This design leads to the state where struct page is alive but the struct
> page_ext is freed, where the later is ideally part of the former which
> just representing the page_flags. This seems to be a wrong design where
> 'struct page' as a whole is not accessible(Thanks to Minchan for
> pointing this out).
> 
> The above mentioned race is just one example __but the problem persists
> in the other paths too involving page_ext->flags access(eg:
> page_is_idle())__. Since offline waits till the last reference on the
> page goes down i.e. any path that took the refcount on the page can make
> the memory offline operation to wait. Eg: In the migrate_pages()
> operation, we do take the extra refcount on the pages that are under
> migration and then we do copy page_owner by accessing page_ext. For
> 
> Fix those paths where offline races with page_ext access by maintaining
> synchronization with rcu lock.
> 
> Thanks to David Hildenbrand for his views/suggestions on the initial
> discussion[1].
> 
> [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/
> 
> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
> ---
>  include/linux/page_ext.h  | 19 +++++++++++++++++++
>  include/linux/page_idle.h | 40 ++++++++++++++++++++++++++++++----------
>  mm/page_ext.c             |  3 ++-
>  mm/page_owner.c           | 18 +++++++++++++++---
>  mm/page_table_check.c     | 10 +++++++---
>  5 files changed, 73 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index fabb2e1..df5d353 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -64,6 +64,25 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr)
>  	return next;
>  }
>  
> +static inline struct page_ext *get_page_ext(struct page *page)
> +{
> +	struct page_ext *page_ext;
> +
> +	rcu_read_lock();
> +	page_ext = lookup_page_ext(page);
> +	if (!page_ext) {
> +		rcu_read_unlock();
> +		return NULL;
> +	}
> +
> +	return page_ext;
> +}
> +
> +static inline void put_page_ext(void)
> +{
> +	rcu_read_unlock();
> +}
> +
Would it be a harm if we make lookup_page_ext() completely a private function?
Or is there any codepath that have the benefit of calling lookup_page_ext()
without going through get_page_ext()? If that is the case, we should add
RCU lockdep check inside lookup_page_ext() to make sure that this function is
called with RCUs.

Thanks,
Pavan
Michal Hocko July 18, 2022, 11:50 a.m. UTC | #4
On Thu 14-07-22 20:17:43, Charan Teja Kalla wrote:
> The below is one path where race between page_ext and  offline of the
> respective memory blocks will cause use-after-free on the access of
> page_ext structure.
> 
> process1		              process2
> ---------                             ---------
> a)doing /proc/page_owner           doing memory offline
> 			           through offline_pages.
> 
> b)PageBuddy check is failed
> thus proceed to get the
> page_owner information
> through page_ext access.
> page_ext = lookup_page_ext(page);
> 
> 				    migrate_pages();
> 				    .................
> 				Since all pages are successfully
> 				migrated as part of the offline
> 				operation,send MEM_OFFLINE notification
> 				where for page_ext it calls:
> 				offline_page_ext()-->
> 				__free_page_ext()-->
> 				   free_page_ext()-->
> 				     vfree(ms->page_ext)
> 			           mem_section->page_ext = NULL
> 
> c) Check for the PAGE_EXT flags
> in the page_ext->flags access
> results into the use-after-free(leading
> to the translation faults).
> 
> As mentioned above, there is really no synchronization between page_ext
> access and its freeing in the memory_offline.
> 
> The memory offline steps(roughly) on a memory block is as below:
> 1) Isolate all the pages
> 2) while(1)
>   try free the pages to buddy.(->free_list[MIGRATE_ISOLATE])
> 3) delete the pages from this buddy list.
> 4) Then free page_ext.(Note: The struct page is still alive as it is
> freed only during hot remove of the memory which frees the memmap, which
> steps the user might not perform).
> 
> This design leads to the state where struct page is alive but the struct
> page_ext is freed, where the later is ideally part of the former which
> just representing the page_flags. This seems to be a wrong design where
> 'struct page' as a whole is not accessible(Thanks to Minchan for
> pointing this out).

Nice description of the problem! Thanks!

> The above mentioned race is just one example __but the problem persists
> in the other paths too involving page_ext->flags access(eg:
> page_is_idle())__. Since offline waits till the last reference on the
> page goes down i.e. any path that took the refcount on the page can make
> the memory offline operation to wait. Eg: In the migrate_pages()
> operation, we do take the extra refcount on the pages that are under
> migration and then we do copy page_owner by accessing page_ext. For
> 
> Fix those paths where offline races with page_ext access by maintaining
> synchronization with rcu lock.

Please be much more specific about the synchronization. How does RCU
actually synchronize the offlining and access? Higher level description
of all the actors would be very helpful not only for the review but also
for future readers.

Also, more specifically
[...]
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 3dc715d..5ccd3ee 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
>  	if (!ms || !ms->page_ext)
>  		return;
>  	base = get_entry(ms->page_ext, pfn);
> -	free_page_ext(base);
>  	ms->page_ext = NULL;
> +	synchronize_rcu();
> +	free_page_ext(base);
>  }

So you are imposing the RCU grace period for each page_ext! This can get
really expensive. Have you tried to measure the effect?

Is there any reason why page_ext is freed during offlining rather when
it is hotremoved?

Thanks!
Charan Teja Kalla July 18, 2022, 1:15 p.m. UTC | #5
Thanks Pavan for the comments!!

On 7/18/2022 11:41 AM, Pavan Kondeti wrote:
>> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
>> index fabb2e1..df5d353 100644
>> --- a/include/linux/page_ext.h
>> +++ b/include/linux/page_ext.h
>> @@ -64,6 +64,25 @@ static inline struct page_ext *page_ext_next(struct page_ext *curr)
>>  	return next;
>>  }
>>  
>> +static inline struct page_ext *get_page_ext(struct page *page)
>> +{
>> +	struct page_ext *page_ext;
>> +
>> +	rcu_read_lock();
>> +	page_ext = lookup_page_ext(page);
>> +	if (!page_ext) {
>> +		rcu_read_unlock();
>> +		return NULL;
>> +	}
>> +
>> +	return page_ext;
>> +}
>> +
>> +static inline void put_page_ext(void)
>> +{
>> +	rcu_read_unlock();
>> +}
>> +
> Would it be a harm if we make lookup_page_ext() completely a private function?
> Or is there any codepath that have the benefit of calling lookup_page_ext()
> without going through get_page_ext()? If that is the case, we should add
> RCU lockdep check inside lookup_page_ext() to make sure that this function is
> called with RCUs.

IIUC, the synchronization is really not needed in all the paths of
accessing the page_ext thus hiding the lookup_page_ext and forcing the
users to always rely on get and put functions doesn't seem correct to me.

Some example code paths where you don't need the synchronization while
accessing the page_ext are:
1) In migration (where we also migrate the page_owner information), we
take the extra refcount on the source and destination pages and then
start the migration. This extra refcount makes the test_pages_isolated()
function to fail thus retry the offline operation.

2) In free_pages_prepare(), we do reset the page_owner(through page_ext)
which again doesn't need the protection to access because the page is
already freeing (through only one path).

Thus I don't find the need of rcu lockdep check in the lookup_page_ext.

Any other paths that I am missing to add protection while page_ext
access, please let me know.

Thanks,
Charan
Charan Teja Kalla July 18, 2022, 1:58 p.m. UTC | #6
Thanks Michal for the comments!!

On 7/18/2022 5:20 PM, Michal Hocko wrote:
>> The above mentioned race is just one example __but the problem persists
>> in the other paths too involving page_ext->flags access(eg:
>> page_is_idle())__. Since offline waits till the last reference on the
>> page goes down i.e. any path that took the refcount on the page can make
>> the memory offline operation to wait. Eg: In the migrate_pages()
>> operation, we do take the extra refcount on the pages that are under
>> migration and then we do copy page_owner by accessing page_ext. For
>>
>> Fix those paths where offline races with page_ext access by maintaining
>> synchronization with rcu lock.
> Please be much more specific about the synchronization. How does RCU
> actually synchronize the offlining and access? Higher level description
> of all the actors would be very helpful not only for the review but also
> for future readers.

I will improve the commit message about this synchronization change
using RCU's.

> 
> Also, more specifically
> [...]
>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>> index 3dc715d..5ccd3ee 100644
>> --- a/mm/page_ext.c
>> +++ b/mm/page_ext.c
>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
>>  	if (!ms || !ms->page_ext)
>>  		return;
>>  	base = get_entry(ms->page_ext, pfn);
>> -	free_page_ext(base);
>>  	ms->page_ext = NULL;
>> +	synchronize_rcu();
>> +	free_page_ext(base);
>>  }
> So you are imposing the RCU grace period for each page_ext! This can get
> really expensive. Have you tried to measure the effect?
> 

I didn't really measure the effect. Let me measure it and post these in V2.

> Is there any reason why page_ext is freed during offlining rather when
> it is hotremoved?

This is something I am struggling to get the answer. IMO, this is even
wrong design where I don't have page_ext but page. Moving the freeing of
page_ext to hotremove path actually solves the problem but somehow this
idea didn't liked[1].  copying the excerpt here:

">
> 3) Change the design where the page_ext is valid as long as the struct
> page is alive.

:/ Doesn't spark joy."


@Joonsoo : We see that you did commit eefa864b701d ("mm/page_ext:
resurrect struct page extending code for debugging").  Any reason why
the page_ext is chosen to free at offline operation rather than the
remove operation of a memory block?


[1]
https://lore.kernel.org/linux-mm/8fefe59d-c893-39f4-3225-65343086c867@redhat.com/
> 
> Thanks!
Michal Hocko July 18, 2022, 2:54 p.m. UTC | #7
On Mon 18-07-22 19:28:13, Charan Teja Kalla wrote:
> Thanks Michal for the comments!!
> 
> On 7/18/2022 5:20 PM, Michal Hocko wrote:
> >> The above mentioned race is just one example __but the problem persists
> >> in the other paths too involving page_ext->flags access(eg:
> >> page_is_idle())__. Since offline waits till the last reference on the
> >> page goes down i.e. any path that took the refcount on the page can make
> >> the memory offline operation to wait. Eg: In the migrate_pages()
> >> operation, we do take the extra refcount on the pages that are under
> >> migration and then we do copy page_owner by accessing page_ext. For
> >>
> >> Fix those paths where offline races with page_ext access by maintaining
> >> synchronization with rcu lock.
> > Please be much more specific about the synchronization. How does RCU
> > actually synchronize the offlining and access? Higher level description
> > of all the actors would be very helpful not only for the review but also
> > for future readers.
> 
> I will improve the commit message about this synchronization change
> using RCU's.

Thanks! The most imporant part is how the exclusion is actual achieved
because that is not really clear at first sight

CPU1					CPU2
lookup_page_ext(PageA)			offlining
					  offline_page_ext
					    __free_page_ext(addrA)
					      get_entry(addrA)
					      ms->page_ext = NULL
					      synchronize_rcu()
					      free_page_ext
					        free_pages_exact (now addrA is unusable)
					
  rcu_read_lock()
  entryA = get_entry(addrA)
    base + page_ext_size * index # an address not invalidated by the freeing path
  do_something(entryA)
  rcu_read_unlock()

CPU1 never checks ms->page_ext so it cannot bail out early when the
thing is torn down. Or maybe I am missing something. I am not familiar
with page_ext much.

> > Also, more specifically
> > [...]
> >> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >> index 3dc715d..5ccd3ee 100644
> >> --- a/mm/page_ext.c
> >> +++ b/mm/page_ext.c
> >> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
> >>  	if (!ms || !ms->page_ext)
> >>  		return;
> >>  	base = get_entry(ms->page_ext, pfn);
> >> -	free_page_ext(base);
> >>  	ms->page_ext = NULL;
> >> +	synchronize_rcu();
> >> +	free_page_ext(base);
> >>  }
> > So you are imposing the RCU grace period for each page_ext! This can get
> > really expensive. Have you tried to measure the effect?

I was wrong here! This is for each memory section which is not as
terrible as every single page_ext. This can be still quite a lot memory
sections in a single memory block (e.g. on ppc memory sections are
ridiculously small).

> I didn't really measure the effect. Let me measure it and post these in V2.

I think it would be much more optimal to split the operation into 2
phases. Invalidate all the page_ext metadata then synchronize_rcu and
only then free them all. I am not very familiar with page_ext so I am
not sure this is easy to be done. Maybe page_ext = NULL can be done in
the first stage.

> > Is there any reason why page_ext is freed during offlining rather when
> > it is hotremoved?
> 
> This is something I am struggling to get the answer. IMO, this is even
> wrong design where I don't have page_ext but page. Moving the freeing of
> page_ext to hotremove path actually solves the problem but somehow this
> idea didn't liked[1].  copying the excerpt here:

yes, it certainly adds subtlety to the page_ext thingy. I do agree that
even situation around struct page is not all that great wrt
synchronization. We have pfn_to_online_page which even when racy doesn't
give you a garbage because hotremove happens very rarely or so long
after offlining that the race window is essentially impractically too
long for any potential damage. We would have to change a lot to make it
work "properly". I am not optimistic this is actually feasible.

> > 3) Change the design where the page_ext is valid as long as the struct
> > page is alive.
> 
> :/ Doesn't spark joy."

I would be wondering why. It should only take to move the callback to
happen at hotremove. So it shouldn't be very involved of a change. I can
imagine somebody would be relying on releasing resources when offlining
memory but is that really the case?
Charan Teja Kalla July 19, 2022, 3:12 p.m. UTC | #8
Thanks Michal!!

On 7/18/2022 8:24 PM, Michal Hocko wrote:
>>>> The above mentioned race is just one example __but the problem persists
>>>> in the other paths too involving page_ext->flags access(eg:
>>>> page_is_idle())__. Since offline waits till the last reference on the
>>>> page goes down i.e. any path that took the refcount on the page can make
>>>> the memory offline operation to wait. Eg: In the migrate_pages()
>>>> operation, we do take the extra refcount on the pages that are under
>>>> migration and then we do copy page_owner by accessing page_ext. For
>>>>
>>>> Fix those paths where offline races with page_ext access by maintaining
>>>> synchronization with rcu lock.
>>> Please be much more specific about the synchronization. How does RCU
>>> actually synchronize the offlining and access? Higher level description
>>> of all the actors would be very helpful not only for the review but also
>>> for future readers.
>> I will improve the commit message about this synchronization change
>> using RCU's.
> Thanks! The most imporant part is how the exclusion is actual achieved
> because that is not really clear at first sight
> 
> CPU1					CPU2
> lookup_page_ext(PageA)			offlining
> 					  offline_page_ext
> 					    __free_page_ext(addrA)
> 					      get_entry(addrA)
> 					      ms->page_ext = NULL
> 					      synchronize_rcu()
> 					      free_page_ext
> 					        free_pages_exact (now addrA is unusable)
> 					
>   rcu_read_lock()
>   entryA = get_entry(addrA)
>     base + page_ext_size * index # an address not invalidated by the freeing path
>   do_something(entryA)
>   rcu_read_unlock()
> 
> CPU1 never checks ms->page_ext so it cannot bail out early when the
> thing is torn down. Or maybe I am missing something. I am not familiar
> with page_ext much.


Thanks a lot for catching this Michal. You are correct that the proposed
code from me is still racy. I Will correct this along with the proper
commit message in the next version of this patch.

> 
>>> Also, more specifically
>>> [...]
>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>>>> index 3dc715d..5ccd3ee 100644
>>>> --- a/mm/page_ext.c
>>>> +++ b/mm/page_ext.c
>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
>>>>  	if (!ms || !ms->page_ext)
>>>>  		return;
>>>>  	base = get_entry(ms->page_ext, pfn);
>>>> -	free_page_ext(base);
>>>>  	ms->page_ext = NULL;
>>>> +	synchronize_rcu();
>>>> +	free_page_ext(base);
>>>>  }
>>> So you are imposing the RCU grace period for each page_ext! This can get
>>> really expensive. Have you tried to measure the effect?
> I was wrong here! This is for each memory section which is not as
> terrible as every single page_ext. This can be still quite a lot memory
> sections in a single memory block (e.g. on ppc memory sections are
> ridiculously small).
> 

On the ARM64, I see that the minimum a section size will go is 128MB. I
think 16MB is the section size on ppc. Any inputs on how frequently
offline/online operation is being done on this ppc arch?


>> I didn't really measure the effect. Let me measure it and post these in V2.
> I think it would be much more optimal to split the operation into 2
> phases. Invalidate all the page_ext metadata then synchronize_rcu and
> only then free them all. I am not very familiar with page_ext so I am
> not sure this is easy to be done. Maybe page_ext = NULL can be done in
> the first stage.
> 

Let me explore If this can be easily done.

>>> 3) Change the design where the page_ext is valid as long as the struct
>>> page is alive.
>> :/ Doesn't spark joy."
> I would be wondering why. It should only take to move the callback to
> happen at hotremove. So it shouldn't be very involved of a change. I can
> imagine somebody would be relying on releasing resources when offlining
> memory but is that really the case?

I don't find any hard need of the clients needs to release this page ext
memory.

What I can think of is that page_ext size is proportional to the debug
features(is what for being used on 64bit, as of now) we are enabling.
Eg: Enabling the page_owner requires additional 0x30 bytes per page
which memory is not required when the memory block is offlined. But then
it should be the same case for memory occupied by struct page too for
this offlined block.

One comment from the initial discussion : "It smells like page_ext
should use some mechanism during  MEM_OFFLINE to
synchronize against any users of its metadata. Generic memory offlining
code might be the wrong place for that."  -- I think the page_ext
creation and deletion should fit into the sparse code. I will try to
provide the changes on tomorrow and If it seems unfit there, I will work
on improving the current patch based on the rcu logic.

Thanks,
Charan
David Hildenbrand July 19, 2022, 3:19 p.m. UTC | #9
On 18.07.22 16:54, Michal Hocko wrote:
> On Mon 18-07-22 19:28:13, Charan Teja Kalla wrote:
>> Thanks Michal for the comments!!
>>
>> On 7/18/2022 5:20 PM, Michal Hocko wrote:
>>>> The above mentioned race is just one example __but the problem persists
>>>> in the other paths too involving page_ext->flags access(eg:
>>>> page_is_idle())__. Since offline waits till the last reference on the
>>>> page goes down i.e. any path that took the refcount on the page can make
>>>> the memory offline operation to wait. Eg: In the migrate_pages()
>>>> operation, we do take the extra refcount on the pages that are under
>>>> migration and then we do copy page_owner by accessing page_ext. For
>>>>
>>>> Fix those paths where offline races with page_ext access by maintaining
>>>> synchronization with rcu lock.
>>> Please be much more specific about the synchronization. How does RCU
>>> actually synchronize the offlining and access? Higher level description
>>> of all the actors would be very helpful not only for the review but also
>>> for future readers.
>>
>> I will improve the commit message about this synchronization change
>> using RCU's.
> 
> Thanks! The most imporant part is how the exclusion is actual achieved
> because that is not really clear at first sight
> 
> CPU1					CPU2
> lookup_page_ext(PageA)			offlining
> 					  offline_page_ext
> 					    __free_page_ext(addrA)
> 					      get_entry(addrA)
> 					      ms->page_ext = NULL
> 					      synchronize_rcu()
> 					      free_page_ext
> 					        free_pages_exact (now addrA is unusable)
> 					
>   rcu_read_lock()
>   entryA = get_entry(addrA)
>     base + page_ext_size * index # an address not invalidated by the freeing path
>   do_something(entryA)
>   rcu_read_unlock()
> 
> CPU1 never checks ms->page_ext so it cannot bail out early when the
> thing is torn down. Or maybe I am missing something. I am not familiar
> with page_ext much.
> 
>>> Also, more specifically
>>> [...]
>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>>>> index 3dc715d..5ccd3ee 100644
>>>> --- a/mm/page_ext.c
>>>> +++ b/mm/page_ext.c
>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
>>>>  	if (!ms || !ms->page_ext)
>>>>  		return;
>>>>  	base = get_entry(ms->page_ext, pfn);
>>>> -	free_page_ext(base);
>>>>  	ms->page_ext = NULL;
>>>> +	synchronize_rcu();
>>>> +	free_page_ext(base);
>>>>  }
>>> So you are imposing the RCU grace period for each page_ext! This can get
>>> really expensive. Have you tried to measure the effect?
> 
> I was wrong here! This is for each memory section which is not as
> terrible as every single page_ext. This can be still quite a lot memory
> sections in a single memory block (e.g. on ppc memory sections are
> ridiculously small).
> 
>> I didn't really measure the effect. Let me measure it and post these in V2.
> 
> I think it would be much more optimal to split the operation into 2
> phases. Invalidate all the page_ext metadata then synchronize_rcu and
> only then free them all. I am not very familiar with page_ext so I am
> not sure this is easy to be done. Maybe page_ext = NULL can be done in
> the first stage.
> 
>>> Is there any reason why page_ext is freed during offlining rather when
>>> it is hotremoved?
>>
>> This is something I am struggling to get the answer. IMO, this is even
>> wrong design where I don't have page_ext but page. Moving the freeing of
>> page_ext to hotremove path actually solves the problem but somehow this
>> idea didn't liked[1].  copying the excerpt here:
> 
> yes, it certainly adds subtlety to the page_ext thingy. I do agree that
> even situation around struct page is not all that great wrt
> synchronization. We have pfn_to_online_page which even when racy doesn't
> give you a garbage because hotremove happens very rarely or so long
> after offlining that the race window is essentially impractically too
> long for any potential damage. We would have to change a lot to make it
> work "properly". I am not optimistic this is actually feasible.
> 
>>> 3) Change the design where the page_ext is valid as long as the struct
>>> page is alive.
>>
>> :/ Doesn't spark joy."
> 
> I would be wondering why. It should only take to move the callback to
> happen at hotremove. So it shouldn't be very involved of a change. I can
> imagine somebody would be relying on releasing resources when offlining
> memory but is that really the case?

Various reasons:

1) There was a discussion in the past to eventually also use rcu
protection for handling pdn_to_online_page(). So doing it cleanly here
is certainly an improvement.

2) I really dislike having to scatter section online checks all over the
place in page ext code. Once there is a difference between active vs.
stale page ext data things get a bit messy and error prone. This is
already ugly enough in our generic memmap handling code IMHO.

3) Having on-demand allocations, such as KASAN or page ext from the
memory online notifier is at least currently cleaner, because we don't
have to handle each and every subsystem that hooks into that during the
core memory hotadd/remove phase, which primarily only setups the
vmemmap, direct map and memory block devices.


Personally, I think what we have in this patch is quite nice and clean.
But I won't object if it can be similarly done in a clean way from
hot(un)plug code.

That is, I ack this patch but don't object to similarly clean approaches.
Michal Hocko July 19, 2022, 3:37 p.m. UTC | #10
On Tue 19-07-22 17:19:34, David Hildenbrand wrote:
> On 18.07.22 16:54, Michal Hocko wrote:
> > On Mon 18-07-22 19:28:13, Charan Teja Kalla wrote:
[...]
> >>> 3) Change the design where the page_ext is valid as long as the struct
> >>> page is alive.
> >>
> >> :/ Doesn't spark joy."
> > 
> > I would be wondering why. It should only take to move the callback to
> > happen at hotremove. So it shouldn't be very involved of a change. I can
> > imagine somebody would be relying on releasing resources when offlining
> > memory but is that really the case?
> 
> Various reasons:
> 
> 1) There was a discussion in the past to eventually also use rcu
> protection for handling pdn_to_online_page(). So doing it cleanly here
> is certainly an improvement.

Call me skeptical on that.

> 2) I really dislike having to scatter section online checks all over the
> place in page ext code. Once there is a difference between active vs.
> stale page ext data things get a bit messy and error prone. This is
> already ugly enough in our generic memmap handling code IMHO.

They should represent a free page in any case so even they are stall
they shouldn't be really dangerous, right? 

> 3) Having on-demand allocations, such as KASAN or page ext from the
> memory online notifier is at least currently cleaner, because we don't
> have to handle each and every subsystem that hooks into that during the
> core memory hotadd/remove phase, which primarily only setups the
> vmemmap, direct map and memory block devices.

Cannot this hook into __add_pages which is the real implementation of
the arch independent way to allocate vmemmap. Or at the sparsemem level
because we do not (and very likely won't) support memory hotplug on
any other memory model.

> Personally, I think what we have in this patch is quite nice and clean.
> But I won't object if it can be similarly done in a clean way from
> hot(un)plug code.

Well, if the scheme can be done without synchronize_rcu for each section
which can backfire and if the scheme doesn't add too much complexity to
achieve that then sure I won't object. I just do not get why page_ext
should have a different allocation lifetime expectancy than a real page.
Quite confusing if you ask me.
Michal Hocko July 19, 2022, 3:43 p.m. UTC | #11
On Tue 19-07-22 20:42:42, Charan Teja Kalla wrote:
> Thanks Michal!!
> 
> On 7/18/2022 8:24 PM, Michal Hocko wrote:
[...]
> >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >>>> index 3dc715d..5ccd3ee 100644
> >>>> --- a/mm/page_ext.c
> >>>> +++ b/mm/page_ext.c
> >>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
> >>>>  	if (!ms || !ms->page_ext)
> >>>>  		return;
> >>>>  	base = get_entry(ms->page_ext, pfn);
> >>>> -	free_page_ext(base);
> >>>>  	ms->page_ext = NULL;
> >>>> +	synchronize_rcu();
> >>>> +	free_page_ext(base);
> >>>>  }
> >>> So you are imposing the RCU grace period for each page_ext! This can get
> >>> really expensive. Have you tried to measure the effect?
> > I was wrong here! This is for each memory section which is not as
> > terrible as every single page_ext. This can be still quite a lot memory
> > sections in a single memory block (e.g. on ppc memory sections are
> > ridiculously small).
> > 
> 
> On the ARM64, I see that the minimum a section size will go is 128MB. I
> think 16MB is the section size on ppc. Any inputs on how frequently
> offline/online operation is being done on this ppc arch?

I have seen several reports where 16MB sections were used on PPC LPARs
with a non trivial size. My usual answer to that is tha this is mostly a
self inflicted injury but I am told that for some reasons I cannot
udnerstand this is not easy to change. So reasonable or not this is not
all that uncommon in PPC land.

We definitely shouldn't optimize for those setups but we shouldn't make
them suffer even more as well. Besides that it seems that a single
rcu_synchronize per offline operation should be doable.
David Hildenbrand July 19, 2022, 3:50 p.m. UTC | #12
> 
>> 2) I really dislike having to scatter section online checks all over the
>> place in page ext code. Once there is a difference between active vs.
>> stale page ext data things get a bit messy and error prone. This is
>> already ugly enough in our generic memmap handling code IMHO.
> 
> They should represent a free page in any case so even they are stall
> they shouldn't be really dangerous, right?
Good question. The use-after-free tells me that there could at least be
something accessing page_ext data after offlining right now. As long as
it's only unsynchronized read access, we should be fine.

> 
>> 3) Having on-demand allocations, such as KASAN or page ext from the
>> memory online notifier is at least currently cleaner, because we don't
>> have to handle each and every subsystem that hooks into that during the
>> core memory hotadd/remove phase, which primarily only setups the
>> vmemmap, direct map and memory block devices.
> 
> Cannot this hook into __add_pages which is the real implementation of
> the arch independent way to allocate vmemmap. Or at the sparsemem level
> because we do not (and very likely won't) support memory hotplug on
> any other memory model.

As __add_pages() is also called from mm/memremap.c where we don't want
that metadata, we'd have to special-case (would need a new parameter I
guess).

> 
>> Personally, I think what we have in this patch is quite nice and clean.
>> But I won't object if it can be similarly done in a clean way from
>> hot(un)plug code.
> 
> Well, if the scheme can be done without synchronize_rcu for each section
> which can backfire and if the scheme doesn't add too much complexity to
> achieve that then sure I won't object. I just do not get why page_ext
> should have a different allocation lifetime expectancy than a real page.
> Quite confusing if you ask me.

In contrast to memmap, people actually test for zero pointers here.

If you ask me the memmap access is ugly enough and I don't really enjoy
other metadata following that pattern of "stale and suddenly removed".
Here seems to be an easy way to do it in a clean way.

But yes, if the synchronize_rcu turns out problematic, we'd either have
to optimize or move the allcoation/free phase.
David Hildenbrand July 19, 2022, 3:54 p.m. UTC | #13
On 19.07.22 17:43, Michal Hocko wrote:
> On Tue 19-07-22 20:42:42, Charan Teja Kalla wrote:
>> Thanks Michal!!
>>
>> On 7/18/2022 8:24 PM, Michal Hocko wrote:
> [...]
>>>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>>>>>> index 3dc715d..5ccd3ee 100644
>>>>>> --- a/mm/page_ext.c
>>>>>> +++ b/mm/page_ext.c
>>>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
>>>>>>  	if (!ms || !ms->page_ext)
>>>>>>  		return;
>>>>>>  	base = get_entry(ms->page_ext, pfn);
>>>>>> -	free_page_ext(base);
>>>>>>  	ms->page_ext = NULL;
>>>>>> +	synchronize_rcu();
>>>>>> +	free_page_ext(base);
>>>>>>  }
>>>>> So you are imposing the RCU grace period for each page_ext! This can get
>>>>> really expensive. Have you tried to measure the effect?
>>> I was wrong here! This is for each memory section which is not as
>>> terrible as every single page_ext. This can be still quite a lot memory
>>> sections in a single memory block (e.g. on ppc memory sections are
>>> ridiculously small).
>>>
>>
>> On the ARM64, I see that the minimum a section size will go is 128MB. I
>> think 16MB is the section size on ppc. Any inputs on how frequently
>> offline/online operation is being done on this ppc arch?
> 
> I have seen several reports where 16MB sections were used on PPC LPARs
> with a non trivial size. My usual answer to that is tha this is mostly a
> self inflicted injury but I am told that for some reasons I cannot
> udnerstand this is not easy to change. So reasonable or not this is not
> all that uncommon in PPC land.
> 
> We definitely shouldn't optimize for those setups but we shouldn't make
> them suffer even more as well. Besides that it seems that a single
> rcu_synchronize per offline operation should be doable.

IIRC, any reasonable PPC64 installation uses LMB >= 256 MiB, which maps
to the logical memory block size, and we only online/offline complete
memory blocks, not individual memory sections.

So one these installations, you'll see memory getting onlined/offlined
in at least 256MiB granularity.
Pavan Kondeti July 20, 2022, 8:21 a.m. UTC | #14
Hi Charan,

On Tue, Jul 19, 2022 at 08:42:42PM +0530, Charan Teja Kalla wrote:
> Thanks Michal!!
> 
> On 7/18/2022 8:24 PM, Michal Hocko wrote:
> >>>> The above mentioned race is just one example __but the problem persists
> >>>> in the other paths too involving page_ext->flags access(eg:
> >>>> page_is_idle())__. Since offline waits till the last reference on the
> >>>> page goes down i.e. any path that took the refcount on the page can make
> >>>> the memory offline operation to wait. Eg: In the migrate_pages()
> >>>> operation, we do take the extra refcount on the pages that are under
> >>>> migration and then we do copy page_owner by accessing page_ext. For
> >>>>
> >>>> Fix those paths where offline races with page_ext access by maintaining
> >>>> synchronization with rcu lock.
> >>> Please be much more specific about the synchronization. How does RCU
> >>> actually synchronize the offlining and access? Higher level description
> >>> of all the actors would be very helpful not only for the review but also
> >>> for future readers.
> >> I will improve the commit message about this synchronization change
> >> using RCU's.
> > Thanks! The most imporant part is how the exclusion is actual achieved
> > because that is not really clear at first sight
> > 
> > CPU1					CPU2
> > lookup_page_ext(PageA)			offlining
> > 					  offline_page_ext
> > 					    __free_page_ext(addrA)
> > 					      get_entry(addrA)
> > 					      ms->page_ext = NULL
> > 					      synchronize_rcu()
> > 					      free_page_ext
> > 					        free_pages_exact (now addrA is unusable)
> > 					
> >   rcu_read_lock()
> >   entryA = get_entry(addrA)
> >     base + page_ext_size * index # an address not invalidated by the freeing path
> >   do_something(entryA)
> >   rcu_read_unlock()
> > 
> > CPU1 never checks ms->page_ext so it cannot bail out early when the
> > thing is torn down. Or maybe I am missing something. I am not familiar
> > with page_ext much.
> 
> 
> Thanks a lot for catching this Michal. You are correct that the proposed
> code from me is still racy. I Will correct this along with the proper
> commit message in the next version of this patch.
> 

Trying to understand your discussion with Michal. What part is still racy? We
do check for mem_section::page_ext and bail out early from lookup_page_ext(),
no?

Also to make this scheme explicit, we can annotate page_ext member with __rcu
and use rcu_assign_pointer() on the writer side.

struct page_ext *lookup_page_ext(const struct page *page)
{
        unsigned long pfn = page_to_pfn(page);
        struct mem_section *section = __pfn_to_section(pfn);
        /*
         * The sanity checks the page allocator does upon freeing a
         * page can reach here before the page_ext arrays are
         * allocated when feeding a range of pages to the allocator
         * for the first time during bootup or memory hotplug.
         */
        if (!section->page_ext)
                return NULL;
        return get_entry(section->page_ext, pfn);
}

Thanks,
Pavan
Michal Hocko July 20, 2022, 9:10 a.m. UTC | #15
On Wed 20-07-22 13:51:12, Pavan Kondeti wrote:
> Hi Charan,
> 
> On Tue, Jul 19, 2022 at 08:42:42PM +0530, Charan Teja Kalla wrote:
> > Thanks Michal!!
> > 
> > On 7/18/2022 8:24 PM, Michal Hocko wrote:
> > >>>> The above mentioned race is just one example __but the problem persists
> > >>>> in the other paths too involving page_ext->flags access(eg:
> > >>>> page_is_idle())__. Since offline waits till the last reference on the
> > >>>> page goes down i.e. any path that took the refcount on the page can make
> > >>>> the memory offline operation to wait. Eg: In the migrate_pages()
> > >>>> operation, we do take the extra refcount on the pages that are under
> > >>>> migration and then we do copy page_owner by accessing page_ext. For
> > >>>>
> > >>>> Fix those paths where offline races with page_ext access by maintaining
> > >>>> synchronization with rcu lock.
> > >>> Please be much more specific about the synchronization. How does RCU
> > >>> actually synchronize the offlining and access? Higher level description
> > >>> of all the actors would be very helpful not only for the review but also
> > >>> for future readers.
> > >> I will improve the commit message about this synchronization change
> > >> using RCU's.
> > > Thanks! The most imporant part is how the exclusion is actual achieved
> > > because that is not really clear at first sight
> > > 
> > > CPU1					CPU2
> > > lookup_page_ext(PageA)			offlining
> > > 					  offline_page_ext
> > > 					    __free_page_ext(addrA)
> > > 					      get_entry(addrA)
> > > 					      ms->page_ext = NULL
> > > 					      synchronize_rcu()
> > > 					      free_page_ext
> > > 					        free_pages_exact (now addrA is unusable)
> > > 					
> > >   rcu_read_lock()
> > >   entryA = get_entry(addrA)
> > >     base + page_ext_size * index # an address not invalidated by the freeing path
> > >   do_something(entryA)
> > >   rcu_read_unlock()
> > > 
> > > CPU1 never checks ms->page_ext so it cannot bail out early when the
> > > thing is torn down. Or maybe I am missing something. I am not familiar
> > > with page_ext much.
> > 
> > 
> > Thanks a lot for catching this Michal. You are correct that the proposed
> > code from me is still racy. I Will correct this along with the proper
> > commit message in the next version of this patch.
> > 
> 
> Trying to understand your discussion with Michal. What part is still racy? We
> do check for mem_section::page_ext and bail out early from lookup_page_ext(),
> no?
> 
> Also to make this scheme explicit, we can annotate page_ext member with __rcu
> and use rcu_assign_pointer() on the writer side.
> 
> struct page_ext *lookup_page_ext(const struct page *page)
> {
>         unsigned long pfn = page_to_pfn(page);
>         struct mem_section *section = __pfn_to_section(pfn);
>         /*
>          * The sanity checks the page allocator does upon freeing a
>          * page can reach here before the page_ext arrays are
>          * allocated when feeding a range of pages to the allocator
>          * for the first time during bootup or memory hotplug.
>          */
>         if (!section->page_ext)
>                 return NULL;
>         return get_entry(section->page_ext, pfn);
> }

You are right. I was looking at the wrong implementation and misread
ifdef vs. ifndef CONFIG_SPARSEMEM. My bad.

Memory hotplug is not supported outside of CONFIG_SPARSEMEM so the
scheme should really work. I would use READ_ONCE for ms->page_ext and
WRITE_ONCE on the initialization side.
Charan Teja Kalla July 20, 2022, 10:43 a.m. UTC | #16
Thanks Michal & Pavan,

On 7/20/2022 2:40 PM, Michal Hocko wrote:
>>>> Thanks! The most imporant part is how the exclusion is actual achieved
>>>> because that is not really clear at first sight
>>>>
>>>> CPU1					CPU2
>>>> lookup_page_ext(PageA)			offlining
>>>> 					  offline_page_ext
>>>> 					    __free_page_ext(addrA)
>>>> 					      get_entry(addrA)
>>>> 					      ms->page_ext = NULL
>>>> 					      synchronize_rcu()
>>>> 					      free_page_ext
>>>> 					        free_pages_exact (now addrA is unusable)
>>>> 					
>>>>   rcu_read_lock()
>>>>   entryA = get_entry(addrA)
>>>>     base + page_ext_size * index # an address not invalidated by the freeing path
>>>>   do_something(entryA)
>>>>   rcu_read_unlock()
>>>>
>>>> CPU1 never checks ms->page_ext so it cannot bail out early when the
>>>> thing is torn down. Or maybe I am missing something. I am not familiar
>>>> with page_ext much.
>>>
>>> Thanks a lot for catching this Michal. You are correct that the proposed
>>> code from me is still racy. I Will correct this along with the proper
>>> commit message in the next version of this patch.
>>>
>> Trying to understand your discussion with Michal. What part is still racy? We
>> do check for mem_section::page_ext and bail out early from lookup_page_ext(),
>> no?
>>
>> Also to make this scheme explicit, we can annotate page_ext member with __rcu
>> and use rcu_assign_pointer() on the writer side.

Annotating with __rcu requires all the read and writes to ms->page_ext
to be under rcu_[access|assign]_pointer which is a big patch. I think
READ_ONCE and WRITE_ONCE, mentioned by Michal, below should does the job.

>>
>> struct page_ext *lookup_page_ext(const struct page *page)
>> {
>>         unsigned long pfn = page_to_pfn(page);
>>         struct mem_section *section = __pfn_to_section(pfn);
>>         /*
>>          * The sanity checks the page allocator does upon freeing a
>>          * page can reach here before the page_ext arrays are
>>          * allocated when feeding a range of pages to the allocator
>>          * for the first time during bootup or memory hotplug.
>>          */
>>         if (!section->page_ext)
>>                 return NULL;
>>         return get_entry(section->page_ext, pfn);
>> }
> You are right. I was looking at the wrong implementation and misread
> ifdef vs. ifndef CONFIG_SPARSEMEM. My bad.
> 

There is still a small race window b/n ms->page_ext setting NULL and its
access even under CONFIG_SPARSEMEM. In the above mentioned example:

 CPU1					CPU2
 rcu_read_lock()
 lookup_page_ext(PageA):		offlining
 					  offline_page_ext
 					    __free_page_ext(addrA)
 					      get_entry(addrA)
    if (!section->page_ext)
       turns to be false.
 					      ms->page_ext = NULL
						
   addrA = get_entry(base=section->page_ext):
     base + page_ext_size * index;
     **Since base is NULL here, caller
     can still do the dereference on
     the invalid pointer address.**
						
 					      synchronize_rcu()
 					      free_page_ext
 					        free_pages_exact (now )


> Memory hotplug is not supported outside of CONFIG_SPARSEMEM so the
> scheme should really work. I would use READ_ONCE for ms->page_ext and
> WRITE_ONCE on the initialization side.

Yes, I should be using the READ_ONCE() and WRITE_ONCE() here.

Thanks,
Charan
Michal Hocko July 20, 2022, 11:13 a.m. UTC | #17
On Wed 20-07-22 16:13:19, Charan Teja Kalla wrote:
> Thanks Michal & Pavan,
> 
> On 7/20/2022 2:40 PM, Michal Hocko wrote:
> >>>> Thanks! The most imporant part is how the exclusion is actual achieved
> >>>> because that is not really clear at first sight
> >>>>
> >>>> CPU1					CPU2
> >>>> lookup_page_ext(PageA)			offlining
> >>>> 					  offline_page_ext
> >>>> 					    __free_page_ext(addrA)
> >>>> 					      get_entry(addrA)
> >>>> 					      ms->page_ext = NULL
> >>>> 					      synchronize_rcu()
> >>>> 					      free_page_ext
> >>>> 					        free_pages_exact (now addrA is unusable)
> >>>> 					
> >>>>   rcu_read_lock()
> >>>>   entryA = get_entry(addrA)
> >>>>     base + page_ext_size * index # an address not invalidated by the freeing path
> >>>>   do_something(entryA)
> >>>>   rcu_read_unlock()
> >>>>
> >>>> CPU1 never checks ms->page_ext so it cannot bail out early when the
> >>>> thing is torn down. Or maybe I am missing something. I am not familiar
> >>>> with page_ext much.
> >>>
> >>> Thanks a lot for catching this Michal. You are correct that the proposed
> >>> code from me is still racy. I Will correct this along with the proper
> >>> commit message in the next version of this patch.
> >>>
> >> Trying to understand your discussion with Michal. What part is still racy? We
> >> do check for mem_section::page_ext and bail out early from lookup_page_ext(),
> >> no?
> >>
> >> Also to make this scheme explicit, we can annotate page_ext member with __rcu
> >> and use rcu_assign_pointer() on the writer side.
> 
> Annotating with __rcu requires all the read and writes to ms->page_ext
> to be under rcu_[access|assign]_pointer which is a big patch. I think
> READ_ONCE and WRITE_ONCE, mentioned by Michal, below should does the job.
> 
> >>
> >> struct page_ext *lookup_page_ext(const struct page *page)
> >> {
> >>         unsigned long pfn = page_to_pfn(page);
> >>         struct mem_section *section = __pfn_to_section(pfn);
> >>         /*
> >>          * The sanity checks the page allocator does upon freeing a
> >>          * page can reach here before the page_ext arrays are
> >>          * allocated when feeding a range of pages to the allocator
> >>          * for the first time during bootup or memory hotplug.
> >>          */
> >>         if (!section->page_ext)
> >>                 return NULL;
> >>         return get_entry(section->page_ext, pfn);
> >> }
> > You are right. I was looking at the wrong implementation and misread
> > ifdef vs. ifndef CONFIG_SPARSEMEM. My bad.
> > 
> 
> There is still a small race window b/n ms->page_ext setting NULL and its
> access even under CONFIG_SPARSEMEM. In the above mentioned example:
> 
>  CPU1					CPU2
>  rcu_read_lock()
>  lookup_page_ext(PageA):		offlining
>  					  offline_page_ext
>  					    __free_page_ext(addrA)
>  					      get_entry(addrA)
>     if (!section->page_ext)
>        turns to be false.
>  					      ms->page_ext = NULL
> 						
>    addrA = get_entry(base=section->page_ext):
>      base + page_ext_size * index;
>      **Since base is NULL here, caller
>      can still do the dereference on
>      the invalid pointer address.**

only if the value is re-fetched. Not likely but definitely better to
have it covered. That is why I was suggesting READ_ONCE/WRITE_ONCE for
this iperation.
> 						
>  					      synchronize_rcu()
>  					      free_page_ext
>  					        free_pages_exact (now )
> 
> 
> > Memory hotplug is not supported outside of CONFIG_SPARSEMEM so the
> > scheme should really work. I would use READ_ONCE for ms->page_ext and
> > WRITE_ONCE on the initialization side.
> 
> Yes, I should be using the READ_ONCE() and WRITE_ONCE() here.

yes.
Charan Teja Kalla July 20, 2022, 3:08 p.m. UTC | #18
Thanks Michal here!!

On 7/19/2022 9:13 PM, Michal Hocko wrote:
>>>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>>>>>> index 3dc715d..5ccd3ee 100644
>>>>>> --- a/mm/page_ext.c
>>>>>> +++ b/mm/page_ext.c
>>>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
>>>>>>  	if (!ms || !ms->page_ext)
>>>>>>  		return;
>>>>>>  	base = get_entry(ms->page_ext, pfn);
>>>>>> -	free_page_ext(base);
>>>>>>  	ms->page_ext = NULL;
>>>>>> +	synchronize_rcu();
>>>>>> +	free_page_ext(base);
>>>>>>  }
>>>>> So you are imposing the RCU grace period for each page_ext! This can get
>>>>> really expensive. Have you tried to measure the effect?
>>> I was wrong here! This is for each memory section which is not as
>>> terrible as every single page_ext. This can be still quite a lot memory
>>> sections in a single memory block (e.g. on ppc memory sections are
>>> ridiculously small).
>>>
>> On the ARM64, I see that the minimum a section size will go is 128MB. I
>> think 16MB is the section size on ppc. Any inputs on how frequently
>> offline/online operation is being done on this ppc arch?
> I have seen several reports where 16MB sections were used on PPC LPARs
> with a non trivial size. My usual answer to that is tha this is mostly a
> self inflicted injury but I am told that for some reasons I cannot
> udnerstand this is not easy to change. So reasonable or not this is not
> all that uncommon in PPC land.
> 
> We definitely shouldn't optimize for those setups but we shouldn't make
> them suffer even more as well. Besides that it seems that a single
> rcu_synchronize per offline operation should be doable.

I too feel it is doable but the code might look to need to traverse the
sections of a memory block twice.

1) with synchronize_rcu() calling for each memory section of a memblock:
---------------------------------------------------------------------
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
    __free_page_ext(pfn):
         ms->page_ext = NULL
         synchronize_rcu();// Called on every section.
	 free_page_ext();//free the page_ext.

2) With a single synchronize_rcu() for each offlined block:
-------------------------------------------------------
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
	__free_page_ext(pfn):
	    ms->page_ext = NULL;
}
synchronize_rcu(); // call synchronize_rcu for just once
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
	free_page_ext(); // Free the page_ext.

Any better code you have in mind here please?

But since there are few sections the overhead of traversing them will be
definitely less compared to synchronize_rcu() per memsection.

But I really doubt if there will be a real impact making sync_rcu per
section because,(as david also mentioned and you also corrected it I
think), the concern here is for ppc where the min memblock size is 256M
with section size of 16M and there is a single offline operation on that
block but can end up in calling 16 sync_rcu() calls. Should we really
optimize this case here? If yes, I can go with the approach 2) mentioned
above. Sorry if I am really undermining the problem here.


Thanks,
Charan
Michal Hocko July 20, 2022, 3:22 p.m. UTC | #19
On Wed 20-07-22 20:38:58, Charan Teja Kalla wrote:
> Thanks Michal here!!
> 
> On 7/19/2022 9:13 PM, Michal Hocko wrote:
> >>>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >>>>>> index 3dc715d..5ccd3ee 100644
> >>>>>> --- a/mm/page_ext.c
> >>>>>> +++ b/mm/page_ext.c
> >>>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
> >>>>>>  	if (!ms || !ms->page_ext)
> >>>>>>  		return;
> >>>>>>  	base = get_entry(ms->page_ext, pfn);
> >>>>>> -	free_page_ext(base);
> >>>>>>  	ms->page_ext = NULL;
> >>>>>> +	synchronize_rcu();
> >>>>>> +	free_page_ext(base);
> >>>>>>  }
> >>>>> So you are imposing the RCU grace period for each page_ext! This can get
> >>>>> really expensive. Have you tried to measure the effect?
> >>> I was wrong here! This is for each memory section which is not as
> >>> terrible as every single page_ext. This can be still quite a lot memory
> >>> sections in a single memory block (e.g. on ppc memory sections are
> >>> ridiculously small).
> >>>
> >> On the ARM64, I see that the minimum a section size will go is 128MB. I
> >> think 16MB is the section size on ppc. Any inputs on how frequently
> >> offline/online operation is being done on this ppc arch?
> > I have seen several reports where 16MB sections were used on PPC LPARs
> > with a non trivial size. My usual answer to that is tha this is mostly a
> > self inflicted injury but I am told that for some reasons I cannot
> > udnerstand this is not easy to change. So reasonable or not this is not
> > all that uncommon in PPC land.
> > 
> > We definitely shouldn't optimize for those setups but we shouldn't make
> > them suffer even more as well. Besides that it seems that a single
> > rcu_synchronize per offline operation should be doable.
> 
> I too feel it is doable but the code might look to need to traverse the
> sections of a memory block twice.

yes, this is to be expected unless you really want to optimize that
further by some global flag which is probably not worth it. Traversing
the sequence of section is not really an expensive operation comparing
to do an expenensive operation to each of the iteration.
 
> 1) with synchronize_rcu() calling for each memory section of a memblock:
> ---------------------------------------------------------------------
> for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
>     __free_page_ext(pfn):
>          ms->page_ext = NULL
>          synchronize_rcu();// Called on every section.
> 	 free_page_ext();//free the page_ext.
> 
> 2) With a single synchronize_rcu() for each offlined block:
> -------------------------------------------------------
> for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> 	__free_page_ext(pfn):
> 	    ms->page_ext = NULL;

you want WRITE_ONCE here

> }
> synchronize_rcu(); // call synchronize_rcu for just once
> for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
> 	free_page_ext(); // Free the page_ext.

Yes this is what I really meant but the thing is that you need the
pointer to know what to free. One way to handle that would be to not
clear the pointer but make it visibly invalid for later checks. E.g. set
the lowest bit and check for it rather for NULL.
 
> Any better code you have in mind here please?
> 
> But since there are few sections the overhead of traversing them will be
> definitely less compared to synchronize_rcu() per memsection.
> 
> But I really doubt if there will be a real impact making sync_rcu per
> section because,(as david also mentioned and you also corrected it I
> think), the concern here is for ppc where the min memblock size is 256M
> with section size of 16M and there is a single offline operation on that
> block but can end up in calling 16 sync_rcu() calls. Should we really
> optimize this case here? If yes, I can go with the approach 2) mentioned
> above. Sorry if I am really undermining the problem here.

I would prefer a single rcu barrier solution. Unless this turns into
monster complicated stuff. Which I hope we can avoid completely by
simply not doing the separate lifetime as well...

Making any assumptions on the number of sections that are handled here
is just a ticket to get issues later. Let's not put land mines ahead of
us please.
diff mbox series

Patch

diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index fabb2e1..df5d353 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -64,6 +64,25 @@  static inline struct page_ext *page_ext_next(struct page_ext *curr)
 	return next;
 }
 
+static inline struct page_ext *get_page_ext(struct page *page)
+{
+	struct page_ext *page_ext;
+
+	rcu_read_lock();
+	page_ext = lookup_page_ext(page);
+	if (!page_ext) {
+		rcu_read_unlock();
+		return NULL;
+	}
+
+	return page_ext;
+}
+
+static inline void put_page_ext(void)
+{
+	rcu_read_unlock();
+}
+
 #else /* !CONFIG_PAGE_EXTENSION */
 struct page_ext;
 
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 4663dfe..0d80f78 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -13,65 +13,85 @@ 
  * If there is not enough space to store Idle and Young bits in page flags, use
  * page ext flags instead.
  */
-
 static inline bool folio_test_young(struct folio *folio)
 {
-	struct page_ext *page_ext = lookup_page_ext(&folio->page);
+	struct page_ext *page_ext;
+	bool page_young;
 
+	page_ext = get_page_ext(&folio->page);
 	if (unlikely(!page_ext))
 		return false;
 
-	return test_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+	page_young = test_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+	put_page_ext();
+
+	return page_young;
 }
 
 static inline void folio_set_young(struct folio *folio)
 {
-	struct page_ext *page_ext = lookup_page_ext(&folio->page);
+	struct page_ext *page_ext;
 
+	page_ext = get_page_ext(&folio->page);
 	if (unlikely(!page_ext))
 		return;
 
 	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+	put_page_ext();
 }
 
 static inline bool folio_test_clear_young(struct folio *folio)
 {
-	struct page_ext *page_ext = lookup_page_ext(&folio->page);
+	struct page_ext *page_ext;
+	bool page_young;
 
+	page_ext = get_page_ext(&folio->page);
 	if (unlikely(!page_ext))
 		return false;
 
-	return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+	page_young = test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+	put_page_ext();
+
+	return page_young;
 }
 
 static inline bool folio_test_idle(struct folio *folio)
 {
-	struct page_ext *page_ext = lookup_page_ext(&folio->page);
+	struct page_ext *page_ext;
+	bool page_idle;
 
+	page_ext = get_page_ext(&folio->page);
 	if (unlikely(!page_ext))
 		return false;
 
-	return test_bit(PAGE_EXT_IDLE, &page_ext->flags);
+	page_idle =  test_bit(PAGE_EXT_IDLE, &page_ext->flags);
+	put_page_ext();
+
+	return page_idle;
 }
 
 static inline void folio_set_idle(struct folio *folio)
 {
-	struct page_ext *page_ext = lookup_page_ext(&folio->page);
+	struct page_ext *page_ext;
 
+	page_ext = get_page_ext(&folio->page);
 	if (unlikely(!page_ext))
 		return;
 
 	set_bit(PAGE_EXT_IDLE, &page_ext->flags);
+	put_page_ext();
 }
 
 static inline void folio_clear_idle(struct folio *folio)
 {
-	struct page_ext *page_ext = lookup_page_ext(&folio->page);
+	struct page_ext *page_ext;
 
+	page_ext = get_page_ext(&folio->page);
 	if (unlikely(!page_ext))
 		return;
 
 	clear_bit(PAGE_EXT_IDLE, &page_ext->flags);
+	put_page_ext();
 }
 #endif /* !CONFIG_64BIT */
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 3dc715d..5ccd3ee 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -299,8 +299,9 @@  static void __free_page_ext(unsigned long pfn)
 	if (!ms || !ms->page_ext)
 		return;
 	base = get_entry(ms->page_ext, pfn);
-	free_page_ext(base);
 	ms->page_ext = NULL;
+	synchronize_rcu();
+	free_page_ext(base);
 }
 
 static int __meminit online_page_ext(unsigned long start_pfn,
diff --git a/mm/page_owner.c b/mm/page_owner.c
index e4c6f3f..d4b9417 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -195,14 +195,16 @@  noinline void __set_page_owner(struct page *page, unsigned short order,
 
 void __set_page_owner_migrate_reason(struct page *page, int reason)
 {
-	struct page_ext *page_ext = lookup_page_ext(page);
+	struct page_ext *page_ext;
 	struct page_owner *page_owner;
 
+	page_ext = get_page_ext(page);
 	if (unlikely(!page_ext))
 		return;
 
 	page_owner = get_page_owner(page_ext);
 	page_owner->last_migrate_reason = reason;
+	put_page_ext();
 }
 
 void __split_page_owner(struct page *page, unsigned int nr)
@@ -288,7 +290,10 @@  void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 
 		pageblock_mt = get_pageblock_migratetype(page);
 
+		rcu_read_lock();
 		for (; pfn < block_end_pfn; pfn++) {
+			if (need_resched())
+				cond_resched_rcu();
 			/* The pageblock is online, no need to recheck. */
 			page = pfn_to_page(pfn);
 
@@ -327,6 +332,7 @@  void pagetypeinfo_showmixedcount_print(struct seq_file *m,
 			}
 			pfn += (1UL << page_owner->order) - 1;
 		}
+		rcu_read_unlock();
 	}
 
 	/* Print counts */
@@ -492,6 +498,7 @@  read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 	struct page_ext *page_ext;
 	struct page_owner *page_owner;
 	depot_stack_handle_t handle;
+	ssize_t ret = 0;
 
 	if (!static_branch_unlikely(&page_owner_inited))
 		return -EINVAL;
@@ -506,7 +513,10 @@  read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 	drain_all_pages(NULL);
 
 	/* Find an allocated page */
+	rcu_read_lock();
 	for (; pfn < max_pfn; pfn++) {
+		if (need_resched())
+			cond_resched_rcu();
 		/*
 		 * If the new page is in a new MAX_ORDER_NR_PAGES area,
 		 * validate the area as existing, skip it if not
@@ -563,11 +573,13 @@  read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 		/* Record the next PFN to read in the file offset */
 		*ppos = (pfn - min_low_pfn) + 1;
 
-		return print_page_owner(buf, count, pfn, page,
+		ret = print_page_owner(buf, count, pfn, page,
 				page_owner, handle);
+		break;
 	}
+	rcu_read_unlock();
 
-	return 0;
+	return ret;
 }
 
 static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index e206274..66fd49a 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -68,7 +68,7 @@  static void page_table_check_clear(struct mm_struct *mm, unsigned long addr,
 		return;
 
 	page = pfn_to_page(pfn);
-	page_ext = lookup_page_ext(page);
+	page_ext = get_page_ext(page);
 	anon = PageAnon(page);
 
 	for (i = 0; i < pgcnt; i++) {
@@ -83,6 +83,7 @@  static void page_table_check_clear(struct mm_struct *mm, unsigned long addr,
 		}
 		page_ext = page_ext_next(page_ext);
 	}
+	put_page_ext();
 }
 
 /*
@@ -103,7 +104,7 @@  static void page_table_check_set(struct mm_struct *mm, unsigned long addr,
 		return;
 
 	page = pfn_to_page(pfn);
-	page_ext = lookup_page_ext(page);
+	page_ext = get_page_ext(page);
 	anon = PageAnon(page);
 
 	for (i = 0; i < pgcnt; i++) {
@@ -118,6 +119,7 @@  static void page_table_check_set(struct mm_struct *mm, unsigned long addr,
 		}
 		page_ext = page_ext_next(page_ext);
 	}
+	put_page_ext();
 }
 
 /*
@@ -126,9 +128,10 @@  static void page_table_check_set(struct mm_struct *mm, unsigned long addr,
  */
 void __page_table_check_zero(struct page *page, unsigned int order)
 {
-	struct page_ext *page_ext = lookup_page_ext(page);
+	struct page_ext *page_ext;
 	unsigned long i;
 
+	page_ext = get_page_ext(page);
 	BUG_ON(!page_ext);
 	for (i = 0; i < (1ul << order); i++) {
 		struct page_table_check *ptc = get_page_table_check(page_ext);
@@ -137,6 +140,7 @@  void __page_table_check_zero(struct page *page, unsigned int order)
 		BUG_ON(atomic_read(&ptc->file_map_count));
 		page_ext = page_ext_next(page_ext);
 	}
+	put_page_ext();
 }
 
 void __page_table_check_pte_clear(struct mm_struct *mm, unsigned long addr,