diff mbox series

[v4] mm, compaction: Skip all non-migratable pages during scan

Message ID 20230525191507.160076-1-khalid.aziz@oracle.com (mailing list archive)
State New
Headers show
Series [v4] mm, compaction: Skip all non-migratable pages during scan | expand

Commit Message

Khalid Aziz May 25, 2023, 7:15 p.m. UTC
Pages pinned in memory through extra refcounts can not be migrated.
Currently as isolate_migratepages_block() scans pages for
compaction, it skips any pinned anonymous pages. All non-migratable
pages should be skipped and not just the anonymous pinned pages.
This patch adds a check for extra refcounts on a page to determine
if the page can be migrated.  This was seen as a real issue on a
customer workload where a large number of pages were pinned by vfio
on the host and any attempts to allocate hugepages resulted in
significant amount of cpu time spent in either direct compaction or
in kcompactd scanning vfio pinned pages over and over again that can
not be migrated. These are the changes in relevant stats with this
patch for a test run of this scenario:

				Before			After
compact_migrate_scanned		329,798,858		370,984,387
compact_free_scanned		40,478,406		25,843,262
compact_isolated		135,470,452		777,235
pgmigrate_success		544,255			507,325
pgmigrate_fail			134,616,282		47
kcompactd CPU time		5:12.81			0:12.28

Before the patch, large number of pages were isolated but most of
them failed to migrate.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Suggested-by: Steve Sistare <steven.sistare@oracle.com>
Cc: Khalid Aziz <khalid@kernel.org>
---
v4:
	- Use existing folio_expected_refs() function (Suggested
	  by Huang, Ying)
	- Use folio functions
	- Take into account contig allocations when checking for
	  long temr pinning and skip pages in ZONE_MOVABLE and
	  MIGRATE_CMA type pages (Suggested by David Hildenbrand)
	- Use folio version of total_mapcount() instead of
	  page_mapcount() (Suggested by Baolin Wang)

v3:
	- Account for extra ref added by get_page_unless_zero() earlier
	  in isolate_migratepages_block() (Suggested by Huang, Ying)
	- Clean up computation of extra refs to be consistent 
	  (Suggested by Huang, Ying)

v2:
	- Update comments in the code (Suggested by Andrew)
	- Use PagePrivate() instead of page_has_private() (Suggested
	  by Matthew)
	- Pass mapping to page_has_extrarefs() (Suggested by Matthew)
	- Use page_ref_count() (Suggested by Matthew)
	- Rename is_pinned_page() to reflect its function more
	  accurately (Suggested by Matthew)

 include/linux/migrate.h | 16 +++++++++++++++
 mm/compaction.c         | 44 +++++++++++++++++++++++++++++++++++++----
 mm/migrate.c            | 14 -------------
 3 files changed, 56 insertions(+), 18 deletions(-)

Comments

Matthew Wilcox (Oracle) May 25, 2023, 7:58 p.m. UTC | #1
On Thu, May 25, 2023 at 01:15:07PM -0600, Khalid Aziz wrote:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5a9501e0ae01..b548e05f0349 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  	return too_many;
>  }
>  
> +/*
> + * Check if this base page should be skipped from isolation because
> + * it has extra refcounts that will prevent it from being migrated.
> + * This code is inspired by similar code in migrate_vma_check_page(),
> + * can_split_folio() and folio_migrate_mapping()
> + */
> +static inline bool page_has_extra_refs(struct page *page,
> +					struct address_space *mapping)
> +{
> +	unsigned long extra_refs;
> +	struct folio *folio;
> +
> +	/*
> +	 * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA
> +	 * pages that can not be long term pinned
> +	 */
> +	if (is_zone_movable_page(page) || is_migrate_cma_page(page))
> +		return false;
> +
> +	folio = page_folio(page);
> +
> +	/*
> +	 * caller holds a ref already from get_page_unless_zero()
> +	 * which is accounted for in folio_expected_refs()
> +	 */
> +	extra_refs = folio_expected_refs(mapping, folio);
> +
> +	/*
> +	 * This is an admittedly racy check but good enough to determine
> +	 * if a page is pinned and can not be migrated
> +	 */
> +	if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio))
> +		return true;
> +	return false;
> +}
> +
>  /**
>   * isolate_migratepages_block() - isolate all migrate-able pages within
>   *				  a single pageblock
> @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  			goto isolate_fail;

Just out of shot, we have ...

                if (unlikely(!get_page_unless_zero(page)))

This is the perfect opportunity to use folio_get_nontail_page() instead.
You get back the folio without having to cast the pointer yourself
or call page_folio().  Now you can use a folio throughout your new
function, saving a call to compound_head().

For a followup patch, everything in this loop below this point can use
the folio ... that's quite a lot of change.

>  		/*
> -		 * Migration will fail if an anonymous page is pinned in memory,
> -		 * so avoid taking lru_lock and isolating it unnecessarily in an
> -		 * admittedly racy check.
> +		 * Migration will fail if a page has extra refcounts
> +		 * from long term pinning preventing it from migrating,
> +		 * so avoid taking lru_lock and isolating it unnecessarily.
>  		 */

Isn't "long term pinning" the wrong description of the problem?  Long term
pins suggest to me FOLL_LONGTERM.  I think this is simple short term
pins that we care about here.
Steven Sistare May 25, 2023, 8:15 p.m. UTC | #2
On 5/25/2023 3:58 PM, Matthew Wilcox wrote:
> On Thu, May 25, 2023 at 01:15:07PM -0600, Khalid Aziz wrote:
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 5a9501e0ae01..b548e05f0349 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>  	return too_many;
>>  }
>>  
>> +/*
>> + * Check if this base page should be skipped from isolation because
>> + * it has extra refcounts that will prevent it from being migrated.
>> + * This code is inspired by similar code in migrate_vma_check_page(),
>> + * can_split_folio() and folio_migrate_mapping()
>> + */
>> +static inline bool page_has_extra_refs(struct page *page,
>> +					struct address_space *mapping)
>> +{
>> +	unsigned long extra_refs;
>> +	struct folio *folio;
>> +
>> +	/*
>> +	 * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA
>> +	 * pages that can not be long term pinned
>> +	 */
>> +	if (is_zone_movable_page(page) || is_migrate_cma_page(page))
>> +		return false;
>> +
>> +	folio = page_folio(page);
>> +
>> +	/*
>> +	 * caller holds a ref already from get_page_unless_zero()
>> +	 * which is accounted for in folio_expected_refs()
>> +	 */
>> +	extra_refs = folio_expected_refs(mapping, folio);
>> +
>> +	/*
>> +	 * This is an admittedly racy check but good enough to determine
>> +	 * if a page is pinned and can not be migrated
>> +	 */
>> +	if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio))
>> +		return true;
>> +	return false;
>> +}
>> +
>>  /**
>>   * isolate_migratepages_block() - isolate all migrate-able pages within
>>   *				  a single pageblock
>> @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>  			goto isolate_fail;
> 
> Just out of shot, we have ...
> 
>                 if (unlikely(!get_page_unless_zero(page)))
> 
> This is the perfect opportunity to use folio_get_nontail_page() instead.
> You get back the folio without having to cast the pointer yourself
> or call page_folio().  Now you can use a folio throughout your new
> function, saving a call to compound_head().
> 
> For a followup patch, everything in this loop below this point can use
> the folio ... that's quite a lot of change.
> 
>>  		/*
>> -		 * Migration will fail if an anonymous page is pinned in memory,
>> -		 * so avoid taking lru_lock and isolating it unnecessarily in an
>> -		 * admittedly racy check.
>> +		 * Migration will fail if a page has extra refcounts
>> +		 * from long term pinning preventing it from migrating,
>> +		 * so avoid taking lru_lock and isolating it unnecessarily.
>>  		 */
> 
> Isn't "long term pinning" the wrong description of the problem?  Long term
> pins suggest to me FOLL_LONGTERM.  I think this is simple short term
> pins that we care about here.

vfio pins are held for a long time - Steve
Khalid Aziz May 25, 2023, 8:41 p.m. UTC | #3
On 5/25/23 13:58, Matthew Wilcox wrote:
> On Thu, May 25, 2023 at 01:15:07PM -0600, Khalid Aziz wrote:
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 5a9501e0ae01..b548e05f0349 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>   	return too_many;
>>   }
>>   
>> +/*
>> + * Check if this base page should be skipped from isolation because
>> + * it has extra refcounts that will prevent it from being migrated.
>> + * This code is inspired by similar code in migrate_vma_check_page(),
>> + * can_split_folio() and folio_migrate_mapping()
>> + */
>> +static inline bool page_has_extra_refs(struct page *page,
>> +					struct address_space *mapping)
>> +{
>> +	unsigned long extra_refs;
>> +	struct folio *folio;
>> +
>> +	/*
>> +	 * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA
>> +	 * pages that can not be long term pinned
>> +	 */
>> +	if (is_zone_movable_page(page) || is_migrate_cma_page(page))
>> +		return false;
>> +
>> +	folio = page_folio(page);
>> +
>> +	/*
>> +	 * caller holds a ref already from get_page_unless_zero()
>> +	 * which is accounted for in folio_expected_refs()
>> +	 */
>> +	extra_refs = folio_expected_refs(mapping, folio);
>> +
>> +	/*
>> +	 * This is an admittedly racy check but good enough to determine
>> +	 * if a page is pinned and can not be migrated
>> +	 */
>> +	if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio))
>> +		return true;
>> +	return false;
>> +}
>> +
>>   /**
>>    * isolate_migratepages_block() - isolate all migrate-able pages within
>>    *				  a single pageblock
>> @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>>   			goto isolate_fail;
> 
> Just out of shot, we have ...
> 
>                  if (unlikely(!get_page_unless_zero(page)))
> 
> This is the perfect opportunity to use folio_get_nontail_page() instead.
> You get back the folio without having to cast the pointer yourself
> or call page_folio().  Now you can use a folio throughout your new
> function, saving a call to compound_head().
> 
> For a followup patch, everything in this loop below this point can use
> the folio ... that's quite a lot of change.

Can that all be in a separate patch by itself? I tried to keep all folio functions contained inside 
page_has_extra_refs(). If we change part of isolate_migratepages_block() to folio, it would make sense to change rest of 
the function at the same time.

> 
>>   		/*
>> -		 * Migration will fail if an anonymous page is pinned in memory,
>> -		 * so avoid taking lru_lock and isolating it unnecessarily in an
>> -		 * admittedly racy check.
>> +		 * Migration will fail if a page has extra refcounts
>> +		 * from long term pinning preventing it from migrating,
>> +		 * so avoid taking lru_lock and isolating it unnecessarily.
>>   		 */
> 
> Isn't "long term pinning" the wrong description of the problem?  Long term
> pins suggest to me FOLL_LONGTERM.  I think this is simple short term
> pins that we care about here.
> 

As Steve pointed out, vfio pinned pages are long term and we are concerned about long term pinned pages since no matter 
how many times we go over them, they will not migrate.

Thanks,
Khalid
Matthew Wilcox (Oracle) May 25, 2023, 8:45 p.m. UTC | #4
On Thu, May 25, 2023 at 04:15:07PM -0400, Steven Sistare wrote:
> On 5/25/2023 3:58 PM, Matthew Wilcox wrote:
> > On Thu, May 25, 2023 at 01:15:07PM -0600, Khalid Aziz wrote:
> >> diff --git a/mm/compaction.c b/mm/compaction.c
> >> index 5a9501e0ae01..b548e05f0349 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>  	return too_many;
> >>  }
> >>  
> >> +/*
> >> + * Check if this base page should be skipped from isolation because
> >> + * it has extra refcounts that will prevent it from being migrated.
> >> + * This code is inspired by similar code in migrate_vma_check_page(),
> >> + * can_split_folio() and folio_migrate_mapping()
> >> + */
> >> +static inline bool page_has_extra_refs(struct page *page,
> >> +					struct address_space *mapping)
> >> +{
> >> +	unsigned long extra_refs;
> >> +	struct folio *folio;
> >> +
> >> +	/*
> >> +	 * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA
> >> +	 * pages that can not be long term pinned
> >> +	 */
> >> +	if (is_zone_movable_page(page) || is_migrate_cma_page(page))
> >> +		return false;
> >> +
> >> +	folio = page_folio(page);
> >> +
> >> +	/*
> >> +	 * caller holds a ref already from get_page_unless_zero()
> >> +	 * which is accounted for in folio_expected_refs()
> >> +	 */
> >> +	extra_refs = folio_expected_refs(mapping, folio);
> >> +
> >> +	/*
> >> +	 * This is an admittedly racy check but good enough to determine
> >> +	 * if a page is pinned and can not be migrated
> >> +	 */
> >> +	if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio))
> >> +		return true;
> >> +	return false;
> >> +}
> >> +
> >>  /**
> >>   * isolate_migratepages_block() - isolate all migrate-able pages within
> >>   *				  a single pageblock
> >> @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >>  			goto isolate_fail;
> > 
> > Just out of shot, we have ...
> > 
> >                 if (unlikely(!get_page_unless_zero(page)))
> > 
> > This is the perfect opportunity to use folio_get_nontail_page() instead.
> > You get back the folio without having to cast the pointer yourself
> > or call page_folio().  Now you can use a folio throughout your new
> > function, saving a call to compound_head().
> > 
> > For a followup patch, everything in this loop below this point can use
> > the folio ... that's quite a lot of change.
> > 
> >>  		/*
> >> -		 * Migration will fail if an anonymous page is pinned in memory,
> >> -		 * so avoid taking lru_lock and isolating it unnecessarily in an
> >> -		 * admittedly racy check.
> >> +		 * Migration will fail if a page has extra refcounts
> >> +		 * from long term pinning preventing it from migrating,
> >> +		 * so avoid taking lru_lock and isolating it unnecessarily.
> >>  		 */
> > 
> > Isn't "long term pinning" the wrong description of the problem?  Long term
> > pins suggest to me FOLL_LONGTERM.  I think this is simple short term
> > pins that we care about here.
> 
> vfio pins are held for a long time - Steve

So this is a third sense of "pinned pages" that is neither what
filesystems nor the mm means by pinned pages, but whatever it is that
vfio means by pinned pages?  If only "pin" weren't such a desirable
word.  Can somebody explain to me in small words what a vfio pin looks
like because I've tried reading vfio_iommu_type1_pin_pages() and I
don't recognise anything there that looks like pinning in either of
the other two senses.
Matthew Wilcox (Oracle) May 25, 2023, 9:31 p.m. UTC | #5
On Thu, May 25, 2023 at 09:45:34PM +0100, Matthew Wilcox wrote:
> > > Isn't "long term pinning" the wrong description of the problem?  Long term
> > > pins suggest to me FOLL_LONGTERM.  I think this is simple short term
> > > pins that we care about here.
> > 
> > vfio pins are held for a long time - Steve
> 
> So this is a third sense of "pinned pages" that is neither what
> filesystems nor the mm means by pinned pages, but whatever it is that
> vfio means by pinned pages?  If only "pin" weren't such a desirable
> word.  Can somebody explain to me in small words what a vfio pin looks
> like because I've tried reading vfio_iommu_type1_pin_pages() and I
> don't recognise anything there that looks like pinning in either of
> the other two senses.

Oh, I think I found it!  pin_user_pages_remote() is called by
vaddr_get_pfns().  If these are the pages you're concerned about,
then the efficient way to do what you want is simply to call
folio_maybe_dma_pinned().  Far more efficient than the current mess
of total_mapcount().
Khalid Aziz May 26, 2023, 3:44 p.m. UTC | #6
On 5/25/23 15:31, Matthew Wilcox wrote:
> On Thu, May 25, 2023 at 09:45:34PM +0100, Matthew Wilcox wrote:
>>>> Isn't "long term pinning" the wrong description of the problem?  Long term
>>>> pins suggest to me FOLL_LONGTERM.  I think this is simple short term
>>>> pins that we care about here.
>>>
>>> vfio pins are held for a long time - Steve
>>
>> So this is a third sense of "pinned pages" that is neither what
>> filesystems nor the mm means by pinned pages, but whatever it is that
>> vfio means by pinned pages?  If only "pin" weren't such a desirable
>> word.  Can somebody explain to me in small words what a vfio pin looks
>> like because I've tried reading vfio_iommu_type1_pin_pages() and I
>> don't recognise anything there that looks like pinning in either of
>> the other two senses.
> 
> Oh, I think I found it!  pin_user_pages_remote() is called by
> vaddr_get_pfns().  If these are the pages you're concerned about,
> then the efficient way to do what you want is simply to call
> folio_maybe_dma_pinned().  Far more efficient than the current mess
> of total_mapcount().

vfio pinned pages triggered this change. Wouldn't checking refcounts against mapcount provide a more generalized way of 
detecting non-migratable pages?

Thanks,
Khalid
Matthew Wilcox (Oracle) May 26, 2023, 4:44 p.m. UTC | #7
On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote:
> > Oh, I think I found it!  pin_user_pages_remote() is called by
> > vaddr_get_pfns().  If these are the pages you're concerned about,
> > then the efficient way to do what you want is simply to call
> > folio_maybe_dma_pinned().  Far more efficient than the current mess
> > of total_mapcount().
> 
> vfio pinned pages triggered this change. Wouldn't checking refcounts against
> mapcount provide a more generalized way of detecting non-migratable pages?

Well, you changed the comment to say that we were concerned about
long-term pins.  If we are, than folio_maybe_dma_pinned() is how to test
for long-term pins.  If we want to skip pages which are short-term pinned,
then we need to not change the comment, and keep using mapcount/refcount
differences.
David Hildenbrand May 26, 2023, 4:46 p.m. UTC | #8
On 26.05.23 18:44, Matthew Wilcox wrote:
> On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote:
>>> Oh, I think I found it!  pin_user_pages_remote() is called by
>>> vaddr_get_pfns().  If these are the pages you're concerned about,
>>> then the efficient way to do what you want is simply to call
>>> folio_maybe_dma_pinned().  Far more efficient than the current mess
>>> of total_mapcount().
>>
>> vfio pinned pages triggered this change. Wouldn't checking refcounts against
>> mapcount provide a more generalized way of detecting non-migratable pages?
> 
> Well, you changed the comment to say that we were concerned about
> long-term pins.  If we are, than folio_maybe_dma_pinned() is how to test
> for long-term pins.  If we want to skip pages which are short-term pinned,
> then we need to not change the comment, and keep using mapcount/refcount
> differences.
> 

folio_maybe_dma_pinned() is all about FOLL_PIN, not FOLL_LONGTERM.

folio_maybe_dma_pinned() would skip migrating any page that has more 
than 1024 references. (shared libraries?)
Matthew Wilcox (Oracle) May 26, 2023, 6:46 p.m. UTC | #9
On Fri, May 26, 2023 at 06:46:15PM +0200, David Hildenbrand wrote:
> On 26.05.23 18:44, Matthew Wilcox wrote:
> > On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote:
> > > > Oh, I think I found it!  pin_user_pages_remote() is called by
> > > > vaddr_get_pfns().  If these are the pages you're concerned about,
> > > > then the efficient way to do what you want is simply to call
> > > > folio_maybe_dma_pinned().  Far more efficient than the current mess
> > > > of total_mapcount().
> > > 
> > > vfio pinned pages triggered this change. Wouldn't checking refcounts against
> > > mapcount provide a more generalized way of detecting non-migratable pages?
> > 
> > Well, you changed the comment to say that we were concerned about
> > long-term pins.  If we are, than folio_maybe_dma_pinned() is how to test
> > for long-term pins.  If we want to skip pages which are short-term pinned,
> > then we need to not change the comment, and keep using mapcount/refcount
> > differences.
> > 
> 
> folio_maybe_dma_pinned() is all about FOLL_PIN, not FOLL_LONGTERM.

But according to our documentation, FOLL_LONGTERM implies FOLL_PIN.
Anyway, right now, the code skips any pages which are merely FOLL_GET,
so we'll skip fewer pages if we do only skip the FOLL_PIN ones,
regardless if we'd prefer to only skip the FOLL_LONGTERM ones.

> folio_maybe_dma_pinned() would skip migrating any page that has more than
> 1024 references. (shared libraries?)

True, but maybe we should be skipping any page with that many mappings,
given how disruptive it is to the rest of the system to unmap a page
from >1024 processes.
David Hildenbrand May 26, 2023, 6:50 p.m. UTC | #10
On 26.05.23 20:46, Matthew Wilcox wrote:
> On Fri, May 26, 2023 at 06:46:15PM +0200, David Hildenbrand wrote:
>> On 26.05.23 18:44, Matthew Wilcox wrote:
>>> On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote:
>>>>> Oh, I think I found it!  pin_user_pages_remote() is called by
>>>>> vaddr_get_pfns().  If these are the pages you're concerned about,
>>>>> then the efficient way to do what you want is simply to call
>>>>> folio_maybe_dma_pinned().  Far more efficient than the current mess
>>>>> of total_mapcount().
>>>>
>>>> vfio pinned pages triggered this change. Wouldn't checking refcounts against
>>>> mapcount provide a more generalized way of detecting non-migratable pages?
>>>
>>> Well, you changed the comment to say that we were concerned about
>>> long-term pins.  If we are, than folio_maybe_dma_pinned() is how to test
>>> for long-term pins.  If we want to skip pages which are short-term pinned,
>>> then we need to not change the comment, and keep using mapcount/refcount
>>> differences.
>>>
>>
>> folio_maybe_dma_pinned() is all about FOLL_PIN, not FOLL_LONGTERM.
> 
> But according to our documentation, FOLL_LONGTERM implies FOLL_PIN.

Yes. But folio_maybe_dma_pinned() will indicate both, long-term pins and 
short-term pins. There really is no way to distinguish both, unfortunately.

> Anyway, right now, the code skips any pages which are merely FOLL_GET,
> so we'll skip fewer pages if we do only skip the FOLL_PIN ones,
> regardless if we'd prefer to only skip the FOLL_LONGTERM ones.
> 
>> folio_maybe_dma_pinned() would skip migrating any page that has more than
>> 1024 references. (shared libraries?)
> 
> True, but maybe we should be skipping any page with that many mappings,
> given how disruptive it is to the rest of the system to unmap a page
> from >1024 processes.
> 

So any user with 1024 processes can fragment physical memory? :/

Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned().

It's all suboptimal but let's not try to make it worse.
John Hubbard May 27, 2023, 2:11 a.m. UTC | #11
On 5/26/23 11:50, David Hildenbrand wrote:
> On 26.05.23 20:46, Matthew Wilcox wrote:
>> On Fri, May 26, 2023 at 06:46:15PM +0200, David Hildenbrand wrote:
>>> On 26.05.23 18:44, Matthew Wilcox wrote:
>>>> On Fri, May 26, 2023 at 09:44:34AM -0600, Khalid Aziz wrote:
>>>>>> Oh, I think I found it!  pin_user_pages_remote() is called by
>>>>>> vaddr_get_pfns().  If these are the pages you're concerned about,
>>>>>> then the efficient way to do what you want is simply to call
>>>>>> folio_maybe_dma_pinned().  Far more efficient than the current mess
>>>>>> of total_mapcount().
>>>>>
>>>>> vfio pinned pages triggered this change. Wouldn't checking refcounts against
>>>>> mapcount provide a more generalized way of detecting non-migratable pages?
>>>>
>>>> Well, you changed the comment to say that we were concerned about
>>>> long-term pins.  If we are, than folio_maybe_dma_pinned() is how to test
>>>> for long-term pins.  If we want to skip pages which are short-term pinned,
>>>> then we need to not change the comment, and keep using mapcount/refcount
>>>> differences.
>>>>
>>>
>>> folio_maybe_dma_pinned() is all about FOLL_PIN, not FOLL_LONGTERM.
>>
>> But according to our documentation, FOLL_LONGTERM implies FOLL_PIN.
> 
> Yes. But folio_maybe_dma_pinned() will indicate both, long-term pins and short-term pins. There really is no way to distinguish both, unfortunately.

Not yet, anyway. :)

> 
>> Anyway, right now, the code skips any pages which are merely FOLL_GET,
>> so we'll skip fewer pages if we do only skip the FOLL_PIN ones,
>> regardless if we'd prefer to only skip the FOLL_LONGTERM ones.
>>
>>> folio_maybe_dma_pinned() would skip migrating any page that has more than
>>> 1024 references. (shared libraries?)
>>
>> True, but maybe we should be skipping any page with that many mappings,
>> given how disruptive it is to the rest of the system to unmap a page
>> from >1024 processes.
>>
> 
> So any user with 1024 processes can fragment physical memory? :/
> 
> Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned().
> 

I was actually thinking that we should minimize any more cases of
fragile mapcount and refcount comparison, which then leads to
Matthew's approach here!


thanks,
Matthew Wilcox (Oracle) May 27, 2023, 3:18 a.m. UTC | #12
On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote:
> > So any user with 1024 processes can fragment physical memory? :/
> > 
> > Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned().
> 
> I was actually thinking that we should minimize any more cases of
> fragile mapcount and refcount comparison, which then leads to
> Matthew's approach here!

I was wondering if we shouldn't make folio_maybe_dma_pinned() a little
more accurate.  eg:

        if (folio_test_large(folio))
                return atomic_read(&folio->_pincount) > 0;
	return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >=
			GUP_PIN_COUNTING_BIAS;
John Hubbard May 28, 2023, 11:49 p.m. UTC | #13
On 5/26/23 20:18, Matthew Wilcox wrote:
> On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote:
>>> So any user with 1024 processes can fragment physical memory? :/
>>>
>>> Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned().
>>
>> I was actually thinking that we should minimize any more cases of
>> fragile mapcount and refcount comparison, which then leads to
>> Matthew's approach here!
> 
> I was wondering if we shouldn't make folio_maybe_dma_pinned() a little
> more accurate.  eg:
> 
>          if (folio_test_large(folio))
>                  return atomic_read(&folio->_pincount) > 0;
> 	return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >=
> 			GUP_PIN_COUNTING_BIAS;

I'm trying to figure out what might be wrong with that, but it seems
OK. We must have talked about this earlier, but I recall vaguely that
there was not a lot of concern about the case of a page being mapped
> 1024 times. Because pinned or not, it's likely to be effectively
locked into memory due to LRU effects. As mentioned here, too.

Anyway, sure.

A detail:

The unsigned cast, I'm not sure that helps or solves anything, right?
That is, other than bugs, is it possible to get refcount < mapcount?

And if it's only due to bugs, then the casting, again, isn't likely to
going to mitigate the fallout from whatever mess the bug caused.

thanks,
Matthew Wilcox (Oracle) May 29, 2023, 12:31 a.m. UTC | #14
On Sun, May 28, 2023 at 04:49:52PM -0700, John Hubbard wrote:
> On 5/26/23 20:18, Matthew Wilcox wrote:
> > On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote:
> > > > So any user with 1024 processes can fragment physical memory? :/
> > > > 
> > > > Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned().
> > > 
> > > I was actually thinking that we should minimize any more cases of
> > > fragile mapcount and refcount comparison, which then leads to
> > > Matthew's approach here!
> > 
> > I was wondering if we shouldn't make folio_maybe_dma_pinned() a little
> > more accurate.  eg:
> > 
> >          if (folio_test_large(folio))
> >                  return atomic_read(&folio->_pincount) > 0;
> > 	return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >=
> > 			GUP_PIN_COUNTING_BIAS;
> 
> I'm trying to figure out what might be wrong with that, but it seems
> OK. We must have talked about this earlier, but I recall vaguely that
> there was not a lot of concern about the case of a page being mapped
> > 1024 times. Because pinned or not, it's likely to be effectively
> locked into memory due to LRU effects. As mentioned here, too.

That was my point of view, but David convinced me that a hostile process
can effectively lock its own memory into place.

> Anyway, sure.
> 
> A detail:
> 
> The unsigned cast, I'm not sure that helps or solves anything, right?
> That is, other than bugs, is it possible to get refcount < mapcount?
> 
> And if it's only due to bugs, then the casting, again, isn't likely to
> going to mitigate the fallout from whatever mess the bug caused.

I wasn't thinking too hard about the cast.  If the caller has the folio
lock, I don't think it's possible for refcount < mapcount.  This caller
has a refcount, but doesn't hold the lock, so it is possible for them
to read mapcount first, then have both mapcount and refcount decremented
and see refcount < mapcount.

I don't think it matters too much.  We don't hold the folio lock, so
it might transition from pinned to unpinned as much as a refcount might
be decremented or a mapcount incremented.  What's important is that a
hostile process can't prevent memory from being moved indefinitely.

David, have I missed something else?
Huang, Ying May 29, 2023, 3:01 a.m. UTC | #15
Khalid Aziz <khalid.aziz@oracle.com> writes:

> Pages pinned in memory through extra refcounts can not be migrated.
> Currently as isolate_migratepages_block() scans pages for
> compaction, it skips any pinned anonymous pages. All non-migratable
> pages should be skipped and not just the anonymous pinned pages.
> This patch adds a check for extra refcounts on a page to determine
> if the page can be migrated.  This was seen as a real issue on a
> customer workload where a large number of pages were pinned by vfio
> on the host and any attempts to allocate hugepages resulted in
> significant amount of cpu time spent in either direct compaction or
> in kcompactd scanning vfio pinned pages over and over again that can
> not be migrated. These are the changes in relevant stats with this
> patch for a test run of this scenario:
>
> 				Before			After
> compact_migrate_scanned		329,798,858		370,984,387
> compact_free_scanned		40,478,406		25,843,262
> compact_isolated		135,470,452		777,235
> pgmigrate_success		544,255			507,325
> pgmigrate_fail			134,616,282		47
> kcompactd CPU time		5:12.81			0:12.28
>
> Before the patch, large number of pages were isolated but most of
> them failed to migrate.
>
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Suggested-by: Steve Sistare <steven.sistare@oracle.com>
> Cc: Khalid Aziz <khalid@kernel.org>
> ---
> v4:
> 	- Use existing folio_expected_refs() function (Suggested
> 	  by Huang, Ying)
> 	- Use folio functions
> 	- Take into account contig allocations when checking for
> 	  long temr pinning and skip pages in ZONE_MOVABLE and
> 	  MIGRATE_CMA type pages (Suggested by David Hildenbrand)
> 	- Use folio version of total_mapcount() instead of
> 	  page_mapcount() (Suggested by Baolin Wang)
>
> v3:
> 	- Account for extra ref added by get_page_unless_zero() earlier
> 	  in isolate_migratepages_block() (Suggested by Huang, Ying)
> 	- Clean up computation of extra refs to be consistent 
> 	  (Suggested by Huang, Ying)
>
> v2:
> 	- Update comments in the code (Suggested by Andrew)
> 	- Use PagePrivate() instead of page_has_private() (Suggested
> 	  by Matthew)
> 	- Pass mapping to page_has_extrarefs() (Suggested by Matthew)
> 	- Use page_ref_count() (Suggested by Matthew)
> 	- Rename is_pinned_page() to reflect its function more
> 	  accurately (Suggested by Matthew)
>
>  include/linux/migrate.h | 16 +++++++++++++++
>  mm/compaction.c         | 44 +++++++++++++++++++++++++++++++++++++----
>  mm/migrate.c            | 14 -------------
>  3 files changed, 56 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 6241a1596a75..4f59e15eae99 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -141,6 +141,22 @@ const struct movable_operations *page_movable_ops(struct page *page)
>  		((unsigned long)page->mapping - PAGE_MAPPING_MOVABLE);
>  }
>  
> +static inline
> +int folio_expected_refs(struct address_space *mapping,
> +		struct folio *folio)

I don't think that it's necessary to make this function inline.  It
isn't called in hot path.

> +{
> +	int refs = 1;
> +
> +	if (!mapping)
> +		return refs;
> +
> +	refs += folio_nr_pages(folio);
> +	if (folio_test_private(folio))
> +		refs++;
> +
> +	return refs;
> +}
> +
>  #ifdef CONFIG_NUMA_BALANCING
>  int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  			   int node);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5a9501e0ae01..b548e05f0349 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -764,6 +764,42 @@ static bool too_many_isolated(pg_data_t *pgdat)
>  	return too_many;
>  }
>  
> +/*
> + * Check if this base page should be skipped from isolation because
> + * it has extra refcounts that will prevent it from being migrated.
> + * This code is inspired by similar code in migrate_vma_check_page(),
> + * can_split_folio() and folio_migrate_mapping()
> + */
> +static inline bool page_has_extra_refs(struct page *page,
> +					struct address_space *mapping)
> +{
> +	unsigned long extra_refs;

s/extra_refs/expected_refs/
?

> +	struct folio *folio;
> +
> +	/*
> +	 * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA
> +	 * pages that can not be long term pinned
> +	 */
> +	if (is_zone_movable_page(page) || is_migrate_cma_page(page))
> +		return false;

I suggest to move these 2 checks out to the place before calling the
function.  Or change the name of the function.

> +
> +	folio = page_folio(page);
> +
> +	/*
> +	 * caller holds a ref already from get_page_unless_zero()
> +	 * which is accounted for in folio_expected_refs()
> +	 */
> +	extra_refs = folio_expected_refs(mapping, folio);
> +
> +	/*
> +	 * This is an admittedly racy check but good enough to determine
> +	 * if a page is pinned and can not be migrated
> +	 */
> +	if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio))
> +		return true;
> +	return false;
> +}
> +
>  /**
>   * isolate_migratepages_block() - isolate all migrate-able pages within
>   *				  a single pageblock
> @@ -992,12 +1028,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  			goto isolate_fail;
>  
>  		/*
> -		 * Migration will fail if an anonymous page is pinned in memory,
> -		 * so avoid taking lru_lock and isolating it unnecessarily in an
> -		 * admittedly racy check.
> +		 * Migration will fail if a page has extra refcounts
> +		 * from long term pinning preventing it from migrating,
> +		 * so avoid taking lru_lock and isolating it unnecessarily.
>  		 */
>  		mapping = page_mapping(page);
> -		if (!mapping && (page_count(page) - 1) > total_mapcount(page))
> +		if (!cc->alloc_contig && page_has_extra_refs(page, mapping))
>  			goto isolate_fail_put;
>  
>  		/*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index db3f154446af..a2f3e5834996 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -385,20 +385,6 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
>  }
>  #endif
>  
> -static int folio_expected_refs(struct address_space *mapping,
> -		struct folio *folio)
> -{
> -	int refs = 1;
> -	if (!mapping)
> -		return refs;
> -
> -	refs += folio_nr_pages(folio);
> -	if (folio_test_private(folio))
> -		refs++;
> -
> -	return refs;
> -}
> -
>  /*
>   * Replace the page in the mapping.
>   *

Best Regards,
Huang, Ying
David Hildenbrand May 29, 2023, 9:25 a.m. UTC | #16
On 29.05.23 02:31, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 04:49:52PM -0700, John Hubbard wrote:
>> On 5/26/23 20:18, Matthew Wilcox wrote:
>>> On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote:
>>>>> So any user with 1024 processes can fragment physical memory? :/
>>>>>
>>>>> Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned().
>>>>
>>>> I was actually thinking that we should minimize any more cases of
>>>> fragile mapcount and refcount comparison, which then leads to
>>>> Matthew's approach here!
>>>
>>> I was wondering if we shouldn't make folio_maybe_dma_pinned() a little
>>> more accurate.  eg:
>>>
>>>           if (folio_test_large(folio))
>>>                   return atomic_read(&folio->_pincount) > 0;
>>> 	return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >=
>>> 			GUP_PIN_COUNTING_BIAS;
>>
>> I'm trying to figure out what might be wrong with that, but it seems
>> OK. We must have talked about this earlier, but I recall vaguely that
>> there was not a lot of concern about the case of a page being mapped
>>> 1024 times. Because pinned or not, it's likely to be effectively
>> locked into memory due to LRU effects. As mentioned here, too.
> 
> That was my point of view, but David convinced me that a hostile process
> can effectively lock its own memory into place.
> 

1) My opinion on this optimization

Before I start going into detail, let me first phrase my opinion so we 
are on the same page:

"a tiny fraction of Linux installations makes heavy use of long-term 
pinning -- the *single* mechanism that completely *destroys* the whole 
purpose of memory compaction -- and the users complain about memory 
compaction overhead. So we are talking about optimizing for that by 
eventually harming *everybody else*."

Ehm ... I'm all for reasonable optimization, but not at any price.

We don't care about a handful of long-term pinned pages in the system, 
this is really about vfio long-term pinning a significant amount of 
system RAM, and we only care about shmem here.


*maybe* there is an issue with page migration when we have many page 
mappings, but (a) it's a separate issue and to be dealt with separately, 
not buried into such checks (b) it's unclear how many page mappings are 
too many, the magic number 1024 is just a random number (c) it needs 
much finer control (hostile processes).


2) mapcount vs. pagecount

Now, doing these mapcount vs. pagecount checks is perfectly reasonable 
(see mm/ksm.c) as long as know what we're doing. For example, we have to 
make sure that a possible compound page cannot get split concurrently 
(e.g., hold a reference). It's been used forever, I consider it stable.

I completely agree that we should be careful with such mapcount vs. 
pagecount checks, and if we can use something better, let's use 
something *better*.


3) page_maybe_dma_pinned()

Now, why do I dislike bringing up page_maybe_dma_pinned() [IOW, why is 
it not better]? Besides it ignoring FOLL_GET for now, that might be 
fixed at some point.

I think we agree that order-0 pages are the problem, because we get 
guaranteed false positives with many mappings (not just on speculative 
page pinnings). For these order-0 pages, it's perfectly reasonable to 
check page_maybe_dma_pinned() *as long as* we know the number of 
mappings is very small.

I don't consider anon pages the issue here, we barely get 1024 mappings 
(not even with KSM), and it's much harder to exploit because you cannot 
simply get new mappings via mmap(), only via fork().

In fact, we could optimize easily for order-0 anon pages if we'd need 
to: I have a patch lying around, it just wasn't really worth it for now, 
because there is only a single relevant page_maybe_dma_pinned() call in 
vmscan that could benefit:

https://github.com/davidhildenbrand/linux/commit/0575860d064694d4e2f307b2c20a880a6a7b59ab

We cannot do the same for pagecache pages, so we would possibly 
introduce harm by carelessly checking page_maybe_dma_pinned() on pages
with many mappings.


4) folio_maybe_dma_longterm_pinned() ?

I thought yesterday if we'd want something like 
folio_maybe_dma_longterm_pinned() here. Essentially using what we 
learned about long-term pinning of fs pages:

(1) ZONE_MOVABLE, MIGRATE_CMA -> "return false;"
(2) If !anon, !hugetlb, !shmem -> "return false;"
(3) "return folio_maybe_dma_pinned()"

Yes, above would easily create false-positives for short-term pinned 
pages (anon/hugetlb/shmem), but would never create false-positives for 
any other page (shared library ...).


We would use it in the following way:

bool skip_folio_in_isolation()
{
	/*
          * Avoid skipping pages that are short-term pinned, the pin
	 * might go away any moment and we'll succeed to migrate.
          *
          * We get false positives for short-term pinned anon, shmem and
          * hugetl pages for now, but such short-term pins are transient.
          */
	if (!folio_maybe_dma_longterm_pinned())
		return false;
         /*
          * order-0 pages with many mappings can easily be confused
          * for pinned pages and this could be exploited by
          * malicious user-space to cause fragmentation. This is only
          * an optimization, so if a page (especially shmem) is mapped
          * many times, we'll rather try migrating it instead of
          * accidentally skipping it all the time.
          */
	return folio_order(folio) != 0 || && total_mappings <= 32)
}

Someone long-term pins an shmem page with many mappings? Too bad, we 
don't optimize for that and still try migrating it.


BUT, I am still confused if we want to check here for "any additional 
references", which is what mapcount vs. refcount is, or 
folio_maybe_dma_longterm_pinned().

Of course, we could similarly write a variant of skip_folio_in_isolation:

bool skip_folio_in_isolation()
{
	/*
          * If a page is not pinned, try migrating it. Note that this
          * does not consider any FOLL_GET used for DMA yet.
          */
	if (!folio_maybe_dma_pinned())
		return false;
         /*
          * order-0 pages with many mappings can easily be confused
          * for pinned pages and this could be exploited by
          * malicious user-space to cause fragmentation. This is only
          * an optimization, so if a page is mapped
          * many times, we'll rather try migrating it instead of
          * accidentally skipping it all the time.
          */
	return folio_order(folio) != 0 || && total_mappings <= 32)
}


As long as FOLL_GET is still used for DMA, the mapcount vs. pagecount 
checks might be better ... but it depends on if we care about short-term 
or long-term pinned pages here.

>> Anyway, sure.
>>
>> A detail:
>>
>> The unsigned cast, I'm not sure that helps or solves anything, right?
>> That is, other than bugs, is it possible to get refcount < mapcount?

BUG IMHO.

>>
>> And if it's only due to bugs, then the casting, again, isn't likely to
>> going to mitigate the fallout from whatever mess the bug caused.
> 
> I wasn't thinking too hard about the cast.  If the caller has the folio
> lock, I don't think it's possible for refcount < mapcount.  This caller
> has a refcount, but doesn't hold the lock, so it is possible for them
> to read mapcount first, then have both mapcount and refcount decremented
> and see refcount < mapcount.
> 
> I don't think it matters too much.  We don't hold the folio lock, so
> it might transition from pinned to unpinned as much as a refcount might
> be decremented or a mapcount incremented.  What's important is that a
> hostile process can't prevent memory from being moved indefinitely.
> 
> David, have I missed something else?


What I learned from staring at the code in mm/ksm.c:write_protect_page() 
for too long a while ago is that:

(1) Mapping a page first increments the refcount, then the mapcount
(2) Unmapping a page first decrements the mapcount, then the refcount

So the mapcount is supposed to be always larger than the refcount. 
Especially, if you take a snapshot of both (read first the mapcount, 
then the mapcount).

A hostile process wouldn't be able to block compaction here forever, 
even if we accidentally would make the wrong call once when deciding 
whether to isolate a page. It would work on the next attempt.

That's the difference to page_maybe_dma_pinned(), which can be made to 
consistently block compaction.


[sorry for the lengthy mail]
Khalid Aziz May 30, 2023, 3:42 p.m. UTC | #17
On 5/29/23 03:25, David Hildenbrand wrote:
> On 29.05.23 02:31, Matthew Wilcox wrote:
>> On Sun, May 28, 2023 at 04:49:52PM -0700, John Hubbard wrote:
>>> On 5/26/23 20:18, Matthew Wilcox wrote:
>>>> On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote:
>>>>>> So any user with 1024 processes can fragment physical memory? :/
>>>>>>
>>>>>> Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned().
>>>>>
>>>>> I was actually thinking that we should minimize any more cases of
>>>>> fragile mapcount and refcount comparison, which then leads to
>>>>> Matthew's approach here!
>>>>
>>>> I was wondering if we shouldn't make folio_maybe_dma_pinned() a little
>>>> more accurate.  eg:
>>>>
>>>>           if (folio_test_large(folio))
>>>>                   return atomic_read(&folio->_pincount) > 0;
>>>>     return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >=
>>>>             GUP_PIN_COUNTING_BIAS;
>>>
>>> I'm trying to figure out what might be wrong with that, but it seems
>>> OK. We must have talked about this earlier, but I recall vaguely that
>>> there was not a lot of concern about the case of a page being mapped
>>>> 1024 times. Because pinned or not, it's likely to be effectively
>>> locked into memory due to LRU effects. As mentioned here, too.
>>
>> That was my point of view, but David convinced me that a hostile process
>> can effectively lock its own memory into place.
>>
> 
> 1) My opinion on this optimization
> 
> Before I start going into detail, let me first phrase my opinion so we are on the same page:
> 
> "a tiny fraction of Linux installations makes heavy use of long-term pinning -- the *single* mechanism that completely 
> *destroys* the whole purpose of memory compaction -- and the users complain about memory compaction overhead. So we are 
> talking about optimizing for that by eventually harming *everybody else*."
> 
> Ehm ... I'm all for reasonable optimization, but not at any price.
> 
> We don't care about a handful of long-term pinned pages in the system, this is really about vfio long-term pinning a 
> significant amount of system RAM, and we only care about shmem here.
> 
> 
> *maybe* there is an issue with page migration when we have many page mappings, but (a) it's a separate issue and to be 
> dealt with separately, not buried into such checks (b) it's unclear how many page mappings are too many, the magic 
> number 1024 is just a random number (c) it needs much finer control (hostile processes).
> 
> 
> 2) mapcount vs. pagecount
> 
> Now, doing these mapcount vs. pagecount checks is perfectly reasonable (see mm/ksm.c) as long as know what we're doing. 
> For example, we have to make sure that a possible compound page cannot get split concurrently (e.g., hold a reference). 
> It's been used forever, I consider it stable.
> 
> I completely agree that we should be careful with such mapcount vs. pagecount checks, and if we can use something 
> better, let's use something *better*.

When we have a reliable folio_maybe_dma_longterm_pinned() function, it will be better to call that instead of doing 
refcount vs mapcount check. Until that better function to check for pinned pages is in place, may I propose that the 
current patch fixes a customer problem though not optimally and is a good enough working solution. When a better 
function is in place, page_has_extra_refs() function can be updated to rely on this other function instead of refcount 
vs mapcount.

Thanks,
Khalid


> 
> 
> 3) page_maybe_dma_pinned()
> 
> Now, why do I dislike bringing up page_maybe_dma_pinned() [IOW, why is it not better]? Besides it ignoring FOLL_GET for 
> now, that might be fixed at some point.
> 
> I think we agree that order-0 pages are the problem, because we get guaranteed false positives with many mappings (not 
> just on speculative page pinnings). For these order-0 pages, it's perfectly reasonable to check page_maybe_dma_pinned() 
> *as long as* we know the number of mappings is very small.
> 
> I don't consider anon pages the issue here, we barely get 1024 mappings (not even with KSM), and it's much harder to 
> exploit because you cannot simply get new mappings via mmap(), only via fork().
> 
> In fact, we could optimize easily for order-0 anon pages if we'd need to: I have a patch lying around, it just wasn't 
> really worth it for now, because there is only a single relevant page_maybe_dma_pinned() call in vmscan that could benefit:
> 
> https://github.com/davidhildenbrand/linux/commit/0575860d064694d4e2f307b2c20a880a6a7b59ab
> 
> We cannot do the same for pagecache pages, so we would possibly introduce harm by carelessly checking 
> page_maybe_dma_pinned() on pages
> with many mappings.
> 
> 
> 4) folio_maybe_dma_longterm_pinned() ?
> 
> I thought yesterday if we'd want something like folio_maybe_dma_longterm_pinned() here. Essentially using what we 
> learned about long-term pinning of fs pages:
> 
> (1) ZONE_MOVABLE, MIGRATE_CMA -> "return false;"
> (2) If !anon, !hugetlb, !shmem -> "return false;"
> (3) "return folio_maybe_dma_pinned()"
> 
> Yes, above would easily create false-positives for short-term pinned pages (anon/hugetlb/shmem), but would never create 
> false-positives for any other page (shared library ...).
> 
> 
> We would use it in the following way:
> 
> bool skip_folio_in_isolation()
> {
>      /*
>           * Avoid skipping pages that are short-term pinned, the pin
>       * might go away any moment and we'll succeed to migrate.
>           *
>           * We get false positives for short-term pinned anon, shmem and
>           * hugetl pages for now, but such short-term pins are transient.
>           */
>      if (!folio_maybe_dma_longterm_pinned())
>          return false;
>          /*
>           * order-0 pages with many mappings can easily be confused
>           * for pinned pages and this could be exploited by
>           * malicious user-space to cause fragmentation. This is only
>           * an optimization, so if a page (especially shmem) is mapped
>           * many times, we'll rather try migrating it instead of
>           * accidentally skipping it all the time.
>           */
>      return folio_order(folio) != 0 || && total_mappings <= 32)
> }
> 
> Someone long-term pins an shmem page with many mappings? Too bad, we don't optimize for that and still try migrating it.
> 
> 
> BUT, I am still confused if we want to check here for "any additional references", which is what mapcount vs. refcount 
> is, or folio_maybe_dma_longterm_pinned().
> 
> Of course, we could similarly write a variant of skip_folio_in_isolation:
> 
> bool skip_folio_in_isolation()
> {
>      /*
>           * If a page is not pinned, try migrating it. Note that this
>           * does not consider any FOLL_GET used for DMA yet.
>           */
>      if (!folio_maybe_dma_pinned())
>          return false;
>          /*
>           * order-0 pages with many mappings can easily be confused
>           * for pinned pages and this could be exploited by
>           * malicious user-space to cause fragmentation. This is only
>           * an optimization, so if a page is mapped
>           * many times, we'll rather try migrating it instead of
>           * accidentally skipping it all the time.
>           */
>      return folio_order(folio) != 0 || && total_mappings <= 32)
> }
> 
> 
> As long as FOLL_GET is still used for DMA, the mapcount vs. pagecount checks might be better ... but it depends on if we 
> care about short-term or long-term pinned pages here.
> 
>>> Anyway, sure.
>>>
>>> A detail:
>>>
>>> The unsigned cast, I'm not sure that helps or solves anything, right?
>>> That is, other than bugs, is it possible to get refcount < mapcount?
> 
> BUG IMHO.
> 
>>>
>>> And if it's only due to bugs, then the casting, again, isn't likely to
>>> going to mitigate the fallout from whatever mess the bug caused.
>>
>> I wasn't thinking too hard about the cast.  If the caller has the folio
>> lock, I don't think it's possible for refcount < mapcount.  This caller
>> has a refcount, but doesn't hold the lock, so it is possible for them
>> to read mapcount first, then have both mapcount and refcount decremented
>> and see refcount < mapcount.
>>
>> I don't think it matters too much.  We don't hold the folio lock, so
>> it might transition from pinned to unpinned as much as a refcount might
>> be decremented or a mapcount incremented.  What's important is that a
>> hostile process can't prevent memory from being moved indefinitely.
>>
>> David, have I missed something else?
> 
> 
> What I learned from staring at the code in mm/ksm.c:write_protect_page() for too long a while ago is that:
> 
> (1) Mapping a page first increments the refcount, then the mapcount
> (2) Unmapping a page first decrements the mapcount, then the refcount
> 
> So the mapcount is supposed to be always larger than the refcount. Especially, if you take a snapshot of both (read 
> first the mapcount, then the mapcount).
> 
> A hostile process wouldn't be able to block compaction here forever, even if we accidentally would make the wrong call 
> once when deciding whether to isolate a page. It would work on the next attempt.
> 
> That's the difference to page_maybe_dma_pinned(), which can be made to consistently block compaction.
> 
> 
> [sorry for the lengthy mail]
>
Andrew Morton June 9, 2023, 10:11 p.m. UTC | #18
On Tue, 30 May 2023 09:42:33 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:

> > I completely agree that we should be careful with such mapcount vs. pagecount checks, and if we can use something 
> > better, let's use something *better*.
> 
> When we have a reliable folio_maybe_dma_longterm_pinned() function, it will be better to call that instead of doing 
> refcount vs mapcount check. Until that better function to check for pinned pages is in place, may I propose that the 
> current patch fixes a customer problem though not optimally and is a good enough working solution. When a better 
> function is in place, page_has_extra_refs() function can be updated to rely on this other function instead of refcount 
> vs mapcount.

We seem rather stuck with this patch.  I think I'll drop it while we
ponder a way forward.
Khalid Aziz June 9, 2023, 11:28 p.m. UTC | #19
On 6/9/23 16:11, Andrew Morton wrote:
> On Tue, 30 May 2023 09:42:33 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>>> I completely agree that we should be careful with such mapcount vs. pagecount checks, and if we can use something
>>> better, let's use something *better*.
>>
>> When we have a reliable folio_maybe_dma_longterm_pinned() function, it will be better to call that instead of doing
>> refcount vs mapcount check. Until that better function to check for pinned pages is in place, may I propose that the
>> current patch fixes a customer problem though not optimally and is a good enough working solution. When a better
>> function is in place, page_has_extra_refs() function can be updated to rely on this other function instead of refcount
>> vs mapcount.
> 
> We seem rather stuck with this patch.  I think I'll drop it while we
> ponder a way forward.

Hi Andrew,

I would still recommend that we use this patch and fix the existing problem, unless there are objections from other 
people. This is not an optimal solution but it works and is similar to what kernel does elsewhere. When a better 
function to check pinned pages is available, we start using that function instead.

Thanks,
Khalid
diff mbox series

Patch

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 6241a1596a75..4f59e15eae99 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -141,6 +141,22 @@  const struct movable_operations *page_movable_ops(struct page *page)
 		((unsigned long)page->mapping - PAGE_MAPPING_MOVABLE);
 }
 
+static inline
+int folio_expected_refs(struct address_space *mapping,
+		struct folio *folio)
+{
+	int refs = 1;
+
+	if (!mapping)
+		return refs;
+
+	refs += folio_nr_pages(folio);
+	if (folio_test_private(folio))
+		refs++;
+
+	return refs;
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 			   int node);
diff --git a/mm/compaction.c b/mm/compaction.c
index 5a9501e0ae01..b548e05f0349 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -764,6 +764,42 @@  static bool too_many_isolated(pg_data_t *pgdat)
 	return too_many;
 }
 
+/*
+ * Check if this base page should be skipped from isolation because
+ * it has extra refcounts that will prevent it from being migrated.
+ * This code is inspired by similar code in migrate_vma_check_page(),
+ * can_split_folio() and folio_migrate_mapping()
+ */
+static inline bool page_has_extra_refs(struct page *page,
+					struct address_space *mapping)
+{
+	unsigned long extra_refs;
+	struct folio *folio;
+
+	/*
+	 * Skip this check for pages in ZONE_MOVABLE or MIGRATE_CMA
+	 * pages that can not be long term pinned
+	 */
+	if (is_zone_movable_page(page) || is_migrate_cma_page(page))
+		return false;
+
+	folio = page_folio(page);
+
+	/*
+	 * caller holds a ref already from get_page_unless_zero()
+	 * which is accounted for in folio_expected_refs()
+	 */
+	extra_refs = folio_expected_refs(mapping, folio);
+
+	/*
+	 * This is an admittedly racy check but good enough to determine
+	 * if a page is pinned and can not be migrated
+	 */
+	if ((folio_ref_count(folio) - extra_refs) > folio_mapcount(folio))
+		return true;
+	return false;
+}
+
 /**
  * isolate_migratepages_block() - isolate all migrate-able pages within
  *				  a single pageblock
@@ -992,12 +1028,12 @@  isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			goto isolate_fail;
 
 		/*
-		 * Migration will fail if an anonymous page is pinned in memory,
-		 * so avoid taking lru_lock and isolating it unnecessarily in an
-		 * admittedly racy check.
+		 * Migration will fail if a page has extra refcounts
+		 * from long term pinning preventing it from migrating,
+		 * so avoid taking lru_lock and isolating it unnecessarily.
 		 */
 		mapping = page_mapping(page);
-		if (!mapping && (page_count(page) - 1) > total_mapcount(page))
+		if (!cc->alloc_contig && page_has_extra_refs(page, mapping))
 			goto isolate_fail_put;
 
 		/*
diff --git a/mm/migrate.c b/mm/migrate.c
index db3f154446af..a2f3e5834996 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -385,20 +385,6 @@  void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
 }
 #endif
 
-static int folio_expected_refs(struct address_space *mapping,
-		struct folio *folio)
-{
-	int refs = 1;
-	if (!mapping)
-		return refs;
-
-	refs += folio_nr_pages(folio);
-	if (folio_test_private(folio))
-		refs++;
-
-	return refs;
-}
-
 /*
  * Replace the page in the mapping.
  *