diff mbox series

[v1,2/2] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL

Message ID 20240620212935.656243-3-david@redhat.com (mailing list archive)
State New
Headers show
Series mm/migrate: move NUMA hinting fault folio isolation + checks under PTL | expand

Commit Message

David Hildenbrand June 20, 2024, 9:29 p.m. UTC
Currently we always take a folio reference even if migration will not
even be tried or isolation failed, requiring us to grab+drop an additional
reference.

Further, we end up calling folio_likely_mapped_shared() while the folio
might have already been unmapped, because after we dropped the PTL, that
can easily happen. We want to stop touching mapcounts and friends from
such context, and only call folio_likely_mapped_shared() while the folio
is still mapped: mapcount information is pretty much stale and unreliable
otherwise.

So let's move checks into numamigrate_isolate_folio(), rename that
function to migrate_misplaced_folio_prepare(), and call that function
from callsites where we call migrate_misplaced_folio(), but still with
the PTL held.

We can now stop taking temporary folio references, and really only take
a reference if folio isolation succeeded. Doing the
folio_likely_mapped_shared() + golio isolation under PT lock is now similar
to how we handle MADV_PAGEOUT.

While at it, combine the folio_is_file_lru() checks.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/migrate.h |  7 ++++
 mm/huge_memory.c        |  8 ++--
 mm/memory.c             |  9 +++--
 mm/migrate.c            | 81 +++++++++++++++++++----------------------
 4 files changed, 55 insertions(+), 50 deletions(-)

Comments

Zi Yan June 21, 2024, 2:05 a.m. UTC | #1
On 20 Jun 2024, at 17:29, David Hildenbrand wrote:

> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
>
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
>
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
>
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.
>
> While at it, combine the folio_is_file_lru() checks.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/migrate.h |  7 ++++
>  mm/huge_memory.c        |  8 ++--
>  mm/memory.c             |  9 +++--
>  mm/migrate.c            | 81 +++++++++++++++++++----------------------
>  4 files changed, 55 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f9d92482d117..644be30b69c8 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>  }
>
>  #ifdef CONFIG_NUMA_BALANCING
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node);
>  int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>  			   int node);
>  #else
> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>  static inline int migrate_misplaced_folio(struct folio *folio,
>  					 struct vm_area_struct *vma, int node)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fc27dabcd8e3..4b2817bb2c7d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	if (node_is_toptier(nid))
>  		last_cpupid = folio_last_cpupid(folio);
>  	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>  		goto out_map;
>  	}
> -
> +	/* The folio is isolated and isolation code holds a folio reference. */
>  	spin_unlock(vmf->ptl);
>  	writable = false;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 118660de5bcc..4fd1ecfced4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>
> -	folio_get(folio);
> -
>  	/* Record the current PID acceesing VMA */
>  	vma_set_access_pid_bit(vma);

Without taking a reference here, is it safe to call mpol_misplaced() on the folio
at the return? Can this folio be freed?

--
Best Regards,
Yan, Zi
Baolin Wang June 21, 2024, 4:07 a.m. UTC | #2
On 2024/6/21 05:29, David Hildenbrand wrote:
> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
> 
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
> 
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
> 
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.

s/golio/folio

Make sense to me. Feel free to add:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> 
> While at it, combine the folio_is_file_lru() checks.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/migrate.h |  7 ++++
>   mm/huge_memory.c        |  8 ++--
>   mm/memory.c             |  9 +++--
>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>   4 files changed, 55 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f9d92482d117..644be30b69c8 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>   }
>   
>   #ifdef CONFIG_NUMA_BALANCING
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node);
>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   			   int node);
>   #else
> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>   static inline int migrate_misplaced_folio(struct folio *folio,
>   					 struct vm_area_struct *vma, int node)
>   {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fc27dabcd8e3..4b2817bb2c7d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	if (node_is_toptier(nid))
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>   		goto out_map;
>   	}
> -
> +	/* The folio is isolated and isolation code holds a folio reference. */
>   	spin_unlock(vmf->ptl);
>   	writable = false;
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index 118660de5bcc..4fd1ecfced4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   
> -	folio_get(folio);
> -
>   	/* Record the current PID acceesing VMA */
>   	vma_set_access_pid_bit(vma);
>   
> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   	else
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>   		goto out_map;
>   	}
> +	/* The folio is isolated and isolation code holds a folio reference. */
>   	pte_unmap_unlock(vmf->pte, vmf->ptl);
>   	writable = false;
>   	ignore_writable = true;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 0307b54879a0..27f070f64f27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2530,9 +2530,37 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
>   	return __folio_alloc_node(gfp, order, nid);
>   }
>   
> -static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
> +/*
> + * Prepare for calling migrate_misplaced_folio() by isolating the folio if
> + * permitted. Must be called with the PTL still held.
> + */
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
>   {
>   	int nr_pages = folio_nr_pages(folio);
> +	pg_data_t *pgdat = NODE_DATA(node);
> +
> +	if (folio_is_file_lru(folio)) {
> +		/*
> +		 * Do not migrate file folios that are mapped in multiple
> +		 * processes with execute permissions as they are probably
> +		 * shared libraries.
> +		 *
> +		 * See folio_likely_mapped_shared() on possible imprecision
> +		 * when we cannot easily detect if a folio is shared.
> +		 */
> +		if ((vma->vm_flags & VM_EXEC) &&
> +		    folio_likely_mapped_shared(folio))
> +			return -EACCES;
> +
> +		/*
> +		 * Do not migrate dirty folios as not all filesystems can move
> +		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
> +		 * cycles.
> +		 */
> +		if (folio_test_dirty(folio))
> +			return -EAGAIN;
> +	}
>   
>   	/* Avoid migrating to a node that is nearly full */
>   	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
> @@ -2550,65 +2578,37 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>   		 * further.
>   		 */
>   		if (z < 0)
> -			return 0;
> +			return -EAGAIN;
>   
>   		wakeup_kswapd(pgdat->node_zones + z, 0,
>   			      folio_order(folio), ZONE_MOVABLE);
> -		return 0;
> +		return -EAGAIN;
>   	}
>   
>   	if (!folio_isolate_lru(folio))
> -		return 0;
> +		return -EAGAIN;
>   
>   	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
>   			    nr_pages);
> -
> -	/*
> -	 * Isolating the folio has taken another reference, so the
> -	 * caller's reference can be safely dropped without the folio
> -	 * disappearing underneath us during migration.
> -	 */
> -	folio_put(folio);
> -	return 1;
> +	return 0;
>   }

(just a nit: returning boolean seems more readable)

>   
>   /*
>    * Attempt to migrate a misplaced folio to the specified destination
> - * node. Caller is expected to have an elevated reference count on
> - * the folio that will be dropped by this function before returning.
> + * node. Caller is expected to have isolated the folio by calling
> + * migrate_misplaced_folio_prepare(), which will result in an
> + * elevated reference count on the folio. This function will un-isolate the
> + * folio, dereferencing the folio before returning.
>    */
>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   			    int node)
>   {
>   	pg_data_t *pgdat = NODE_DATA(node);
> -	int isolated;
>   	int nr_remaining;
>   	unsigned int nr_succeeded;
>   	LIST_HEAD(migratepages);
>   	int nr_pages = folio_nr_pages(folio);
>   
> -	/*
> -	 * Don't migrate file folios that are mapped in multiple processes
> -	 * with execute permissions as they are probably shared libraries.
> -	 *
> -	 * See folio_likely_mapped_shared() on possible imprecision when we
> -	 * cannot easily detect if a folio is shared.
> -	 */
> -	if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) &&
> -	    (vma->vm_flags & VM_EXEC))
> -		goto out;
> -
> -	/*
> -	 * Also do not migrate dirty folios as not all filesystems can move
> -	 * dirty folios in MIGRATE_ASYNC mode which is a waste of cycles.
> -	 */
> -	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> -		goto out;
> -
> -	isolated = numamigrate_isolate_folio(pgdat, folio);
> -	if (!isolated)
> -		goto out;
> -
>   	list_add(&folio->lru, &migratepages);
>   	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
>   				     NULL, node, MIGRATE_ASYNC,
> @@ -2620,7 +2620,6 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   					folio_is_file_lru(folio), -nr_pages);
>   			folio_putback_lru(folio);
>   		}
> -		isolated = 0;
>   	}
>   	if (nr_succeeded) {
>   		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> @@ -2629,11 +2628,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   					    nr_succeeded);
>   	}
>   	BUG_ON(!list_empty(&migratepages));
> -	return isolated ? 0 : -EAGAIN;
> -
> -out:
> -	folio_put(folio);
> -	return -EAGAIN;
> +	return nr_remaining ? -EAGAIN : 0;
>   }
>   #endif /* CONFIG_NUMA_BALANCING */
>   #endif /* CONFIG_NUMA */
David Hildenbrand June 21, 2024, 7:31 a.m. UTC | #3
On 21.06.24 06:07, Baolin Wang wrote:
> 
> 
> On 2024/6/21 05:29, David Hildenbrand wrote:
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> to how we handle MADV_PAGEOUT.
> 
> s/golio/folio
> 
> Make sense to me. Feel free to add:
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


Thanks!

[...]

>>    	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
>>    			    nr_pages);
>> -
>> -	/*
>> -	 * Isolating the folio has taken another reference, so the
>> -	 * caller's reference can be safely dropped without the folio
>> -	 * disappearing underneath us during migration.
>> -	 */
>> -	folio_put(folio);
>> -	return 1;
>> +	return 0;
>>    }
> 
> (just a nit: returning boolean seems more readable)

"return true" on success really is nasty in a code base where most 
functions return "0" on success. Like most functions in mm/migrate.c -- 
like migrate_pages() -- do.

Thanks!
David Hildenbrand June 21, 2024, 7:32 a.m. UTC | #4
On 21.06.24 04:05, Zi Yan wrote:
> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
> 
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> to how we handle MADV_PAGEOUT.
>>
>> While at it, combine the folio_is_file_lru() checks.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/migrate.h |  7 ++++
>>   mm/huge_memory.c        |  8 ++--
>>   mm/memory.c             |  9 +++--
>>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>   4 files changed, 55 insertions(+), 50 deletions(-)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index f9d92482d117..644be30b69c8 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>>   }
>>
>>   #ifdef CONFIG_NUMA_BALANCING
>> +int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node);
>>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>   			   int node);
>>   #else
>> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node)
>> +{
>> +	return -EAGAIN; /* can't migrate now */
>> +}
>>   static inline int migrate_misplaced_folio(struct folio *folio,
>>   					 struct vm_area_struct *vma, int node)
>>   {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fc27dabcd8e3..4b2817bb2c7d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>   	if (node_is_toptier(nid))
>>   		last_cpupid = folio_last_cpupid(folio);
>>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>   		goto out_map;
>>   	}
>> -
>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>   	spin_unlock(vmf->ptl);
>>   	writable = false;
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 118660de5bcc..4fd1ecfced4d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>>   {
>>   	struct vm_area_struct *vma = vmf->vma;
>>
>> -	folio_get(folio);
>> -
>>   	/* Record the current PID acceesing VMA */
>>   	vma_set_access_pid_bit(vma);
> 
> Without taking a reference here, is it safe to call mpol_misplaced() on the folio
> at the return? Can this folio be freed?

As long as we are holding the PTL, the folio cannot get unmapped and 
consequently not freed.

Note that if that would be the case (being able to get freed 
concurrently), even the folio_get() would have been wrong: if it could 
get freed concurrently, we would have to use folio_try_get*().

Thanks!
Zi Yan June 21, 2024, 1:44 p.m. UTC | #5
On 20 Jun 2024, at 17:29, David Hildenbrand wrote:

> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
>
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
>
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
>
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.
>
> While at it, combine the folio_is_file_lru() checks.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/migrate.h |  7 ++++
>  mm/huge_memory.c        |  8 ++--
>  mm/memory.c             |  9 +++--
>  mm/migrate.c            | 81 +++++++++++++++++++----------------------
>  4 files changed, 55 insertions(+), 50 deletions(-)

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>

One nit below:

<snip>

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fc27dabcd8e3..4b2817bb2c7d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	if (node_is_toptier(nid))
>  		last_cpupid = folio_last_cpupid(folio);
>  	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>  		goto out_map;
>  	}
> -
> +	/* The folio is isolated and isolation code holds a folio reference. */
>  	spin_unlock(vmf->ptl);
>  	writable = false;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 118660de5bcc..4fd1ecfced4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c

<snip>

> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>  	else
>  		last_cpupid = folio_last_cpupid(folio);
>  	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>  		goto out_map;
>  	}

These two locations are repeated code, maybe just merge the ifs into
numa_migrate_prep(). Feel free to ignore if you are not going to send
another version. :)

--
Best Regards,
Yan, Zi
Donet Tom June 21, 2024, 5:47 p.m. UTC | #6
On 6/21/24 02:59, David Hildenbrand wrote:
> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
>
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
>
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
>
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.
>
> While at it, combine the folio_is_file_lru() checks.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/migrate.h |  7 ++++
>   mm/huge_memory.c        |  8 ++--
>   mm/memory.c             |  9 +++--
>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>   4 files changed, 55 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f9d92482d117..644be30b69c8 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>   }
>   
>   #ifdef CONFIG_NUMA_BALANCING
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node);
>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   			   int node);
>   #else
> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>   static inline int migrate_misplaced_folio(struct folio *folio,
>   					 struct vm_area_struct *vma, int node)
>   {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fc27dabcd8e3..4b2817bb2c7d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>   	if (node_is_toptier(nid))
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>   		goto out_map;
>   	}
> -
> +	/* The folio is isolated and isolation code holds a folio reference. */
>   	spin_unlock(vmf->ptl);
>   	writable = false;
>   
> diff --git a/mm/memory.c b/mm/memory.c
> index 118660de5bcc..4fd1ecfced4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   
> -	folio_get(folio);
> -
>   	/* Record the current PID acceesing VMA */
>   	vma_set_access_pid_bit(vma);
>   
> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>   	else
>   		last_cpupid = folio_last_cpupid(folio);
>   	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
> -	if (target_nid == NUMA_NO_NODE) {
> -		folio_put(folio);
> +	if (target_nid == NUMA_NO_NODE)
> +		goto out_map;
> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> +		flags |= TNF_MIGRATE_FAIL;
>   		goto out_map;
>   	}
> +	/* The folio is isolated and isolation code holds a folio reference. */
>   	pte_unmap_unlock(vmf->pte, vmf->ptl);
>   	writable = false;
>   	ignore_writable = true;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 0307b54879a0..27f070f64f27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2530,9 +2530,37 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
>   	return __folio_alloc_node(gfp, order, nid);
>   }
>   
> -static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
> +/*
> + * Prepare for calling migrate_misplaced_folio() by isolating the folio if
> + * permitted. Must be called with the PTL still held.
> + */
> +int migrate_misplaced_folio_prepare(struct folio *folio,
> +		struct vm_area_struct *vma, int node)
>   {
>   	int nr_pages = folio_nr_pages(folio);
> +	pg_data_t *pgdat = NODE_DATA(node);
> +
> +	if (folio_is_file_lru(folio)) {
> +		/*
> +		 * Do not migrate file folios that are mapped in multiple
> +		 * processes with execute permissions as they are probably
> +		 * shared libraries.
> +		 *
> +		 * See folio_likely_mapped_shared() on possible imprecision
> +		 * when we cannot easily detect if a folio is shared.
> +		 */
> +		if ((vma->vm_flags & VM_EXEC) &&
> +		    folio_likely_mapped_shared(folio))
> +			return -EACCES;
> +
> +		/*
> +		 * Do not migrate dirty folios as not all filesystems can move
> +		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
> +		 * cycles.
> +		 */
> +		if (folio_test_dirty(folio))
> +			return -EAGAIN;
> +	}
>   
>   	/* Avoid migrating to a node that is nearly full */
>   	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
> @@ -2550,65 +2578,37 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>   		 * further.
>   		 */
>   		if (z < 0)
> -			return 0;
> +			return -EAGAIN;
>   
>   		wakeup_kswapd(pgdat->node_zones + z, 0,
>   			      folio_order(folio), ZONE_MOVABLE);
> -		return 0;
> +		return -EAGAIN;
>   	}
>   
>   	if (!folio_isolate_lru(folio))
> -		return 0;
> +		return -EAGAIN;
>   
>   	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
>   			    nr_pages);
> -
> -	/*
> -	 * Isolating the folio has taken another reference, so the
> -	 * caller's reference can be safely dropped without the folio
> -	 * disappearing underneath us during migration.
> -	 */
> -	folio_put(folio);
> -	return 1;
> +	return 0;
>   }
>   
>   /*
>    * Attempt to migrate a misplaced folio to the specified destination
> - * node. Caller is expected to have an elevated reference count on
> - * the folio that will be dropped by this function before returning.
> + * node. Caller is expected to have isolated the folio by calling
> + * migrate_misplaced_folio_prepare(), which will result in an
> + * elevated reference count on the folio. This function will un-isolate the
> + * folio, dereferencing the folio before returning.
>    */
>   int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   			    int node)
>   {
>   	pg_data_t *pgdat = NODE_DATA(node);
> -	int isolated;
>   	int nr_remaining;
>   	unsigned int nr_succeeded;
>   	LIST_HEAD(migratepages);
>   	int nr_pages = folio_nr_pages(folio);
>   
> -	/*
> -	 * Don't migrate file folios that are mapped in multiple processes
> -	 * with execute permissions as they are probably shared libraries.
> -	 *
> -	 * See folio_likely_mapped_shared() on possible imprecision when we
> -	 * cannot easily detect if a folio is shared.
> -	 */
> -	if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) &&
> -	    (vma->vm_flags & VM_EXEC))
> -		goto out;
> -
> -	/*
> -	 * Also do not migrate dirty folios as not all filesystems can move
> -	 * dirty folios in MIGRATE_ASYNC mode which is a waste of cycles.
> -	 */
> -	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
> -		goto out;
> -
> -	isolated = numamigrate_isolate_folio(pgdat, folio);
> -	if (!isolated)
> -		goto out;
> -
>   	list_add(&folio->lru, &migratepages);
>   	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
>   				     NULL, node, MIGRATE_ASYNC,
> @@ -2620,7 +2620,6 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   					folio_is_file_lru(folio), -nr_pages);
>   			folio_putback_lru(folio);
>   		}
> -		isolated = 0;
>   	}
>   	if (nr_succeeded) {
>   		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> @@ -2629,11 +2628,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>   					    nr_succeeded);
>   	}
>   	BUG_ON(!list_empty(&migratepages));
> -	return isolated ? 0 : -EAGAIN;
> -
> -out:
> -	folio_put(folio);
> -	return -EAGAIN;
> +	return nr_remaining ? -EAGAIN : 0;
>   }
>   #endif /* CONFIG_NUMA_BALANCING */
>   #endif /* CONFIG_NUMA */

Hi David,

I have tested these patches on my system. My system has 2 DRAM nodes and 
1 PMEM node.
I tested page migration between DRAM nodes and page promotion from PMEM 
to DRAM.
Both are working fine.

below are the results.

Migration results
=============
numa_pte_updates 18977
numa_huge_pte_updates 0
numa_hint_faults 18504
numa_hint_faults_local 2116
numa_pages_migrated 16388



Promotion Results
===============
nr_sec_page_table_pages 0
nr_iommu_pages 0
nr_swapcached 0
pgpromote_success 16386
pgpromote_candidate 0


Tested-by: Donet Tom <donettom@linux.ibm.com>


Thanks
Donet
David Hildenbrand June 21, 2024, 8:14 p.m. UTC | #7
On 21.06.24 19:47, Donet Tom wrote:
> 
> On 6/21/24 02:59, David Hildenbrand wrote:
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> to how we handle MADV_PAGEOUT.
>>
>> While at it, combine the folio_is_file_lru() checks.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>    include/linux/migrate.h |  7 ++++
>>    mm/huge_memory.c        |  8 ++--
>>    mm/memory.c             |  9 +++--
>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>    4 files changed, 55 insertions(+), 50 deletions(-)
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index f9d92482d117..644be30b69c8 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -139,9 +139,16 @@ const struct movable_operations *page_movable_ops(struct page *page)
>>    }
>>    
>>    #ifdef CONFIG_NUMA_BALANCING
>> +int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node);
>>    int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>    			   int node);
>>    #else
>> +static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node)
>> +{
>> +	return -EAGAIN; /* can't migrate now */
>> +}
>>    static inline int migrate_misplaced_folio(struct folio *folio,
>>    					 struct vm_area_struct *vma, int node)
>>    {
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fc27dabcd8e3..4b2817bb2c7d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>    	if (node_is_toptier(nid))
>>    		last_cpupid = folio_last_cpupid(folio);
>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>    		goto out_map;
>>    	}
>> -
>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>    	spin_unlock(vmf->ptl);
>>    	writable = false;
>>    
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 118660de5bcc..4fd1ecfced4d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5207,8 +5207,6 @@ int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
>>    {
>>    	struct vm_area_struct *vma = vmf->vma;
>>    
>> -	folio_get(folio);
>> -
>>    	/* Record the current PID acceesing VMA */
>>    	vma_set_access_pid_bit(vma);
>>    
>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>    	else
>>    		last_cpupid = folio_last_cpupid(folio);
>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>    		goto out_map;
>>    	}
>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>    	pte_unmap_unlock(vmf->pte, vmf->ptl);
>>    	writable = false;
>>    	ignore_writable = true;
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 0307b54879a0..27f070f64f27 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2530,9 +2530,37 @@ static struct folio *alloc_misplaced_dst_folio(struct folio *src,
>>    	return __folio_alloc_node(gfp, order, nid);
>>    }
>>    
>> -static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>> +/*
>> + * Prepare for calling migrate_misplaced_folio() by isolating the folio if
>> + * permitted. Must be called with the PTL still held.
>> + */
>> +int migrate_misplaced_folio_prepare(struct folio *folio,
>> +		struct vm_area_struct *vma, int node)
>>    {
>>    	int nr_pages = folio_nr_pages(folio);
>> +	pg_data_t *pgdat = NODE_DATA(node);
>> +
>> +	if (folio_is_file_lru(folio)) {
>> +		/*
>> +		 * Do not migrate file folios that are mapped in multiple
>> +		 * processes with execute permissions as they are probably
>> +		 * shared libraries.
>> +		 *
>> +		 * See folio_likely_mapped_shared() on possible imprecision
>> +		 * when we cannot easily detect if a folio is shared.
>> +		 */
>> +		if ((vma->vm_flags & VM_EXEC) &&
>> +		    folio_likely_mapped_shared(folio))
>> +			return -EACCES;
>> +
>> +		/*
>> +		 * Do not migrate dirty folios as not all filesystems can move
>> +		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
>> +		 * cycles.
>> +		 */
>> +		if (folio_test_dirty(folio))
>> +			return -EAGAIN;
>> +	}
>>    
>>    	/* Avoid migrating to a node that is nearly full */
>>    	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
>> @@ -2550,65 +2578,37 @@ static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
>>    		 * further.
>>    		 */
>>    		if (z < 0)
>> -			return 0;
>> +			return -EAGAIN;
>>    
>>    		wakeup_kswapd(pgdat->node_zones + z, 0,
>>    			      folio_order(folio), ZONE_MOVABLE);
>> -		return 0;
>> +		return -EAGAIN;
>>    	}
>>    
>>    	if (!folio_isolate_lru(folio))
>> -		return 0;
>> +		return -EAGAIN;
>>    
>>    	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
>>    			    nr_pages);
>> -
>> -	/*
>> -	 * Isolating the folio has taken another reference, so the
>> -	 * caller's reference can be safely dropped without the folio
>> -	 * disappearing underneath us during migration.
>> -	 */
>> -	folio_put(folio);
>> -	return 1;
>> +	return 0;
>>    }
>>    
>>    /*
>>     * Attempt to migrate a misplaced folio to the specified destination
>> - * node. Caller is expected to have an elevated reference count on
>> - * the folio that will be dropped by this function before returning.
>> + * node. Caller is expected to have isolated the folio by calling
>> + * migrate_misplaced_folio_prepare(), which will result in an
>> + * elevated reference count on the folio. This function will un-isolate the
>> + * folio, dereferencing the folio before returning.
>>     */
>>    int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>    			    int node)
>>    {
>>    	pg_data_t *pgdat = NODE_DATA(node);
>> -	int isolated;
>>    	int nr_remaining;
>>    	unsigned int nr_succeeded;
>>    	LIST_HEAD(migratepages);
>>    	int nr_pages = folio_nr_pages(folio);
>>    
>> -	/*
>> -	 * Don't migrate file folios that are mapped in multiple processes
>> -	 * with execute permissions as they are probably shared libraries.
>> -	 *
>> -	 * See folio_likely_mapped_shared() on possible imprecision when we
>> -	 * cannot easily detect if a folio is shared.
>> -	 */
>> -	if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) &&
>> -	    (vma->vm_flags & VM_EXEC))
>> -		goto out;
>> -
>> -	/*
>> -	 * Also do not migrate dirty folios as not all filesystems can move
>> -	 * dirty folios in MIGRATE_ASYNC mode which is a waste of cycles.
>> -	 */
>> -	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
>> -		goto out;
>> -
>> -	isolated = numamigrate_isolate_folio(pgdat, folio);
>> -	if (!isolated)
>> -		goto out;
>> -
>>    	list_add(&folio->lru, &migratepages);
>>    	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
>>    				     NULL, node, MIGRATE_ASYNC,
>> @@ -2620,7 +2620,6 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>    					folio_is_file_lru(folio), -nr_pages);
>>    			folio_putback_lru(folio);
>>    		}
>> -		isolated = 0;
>>    	}
>>    	if (nr_succeeded) {
>>    		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
>> @@ -2629,11 +2628,7 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
>>    					    nr_succeeded);
>>    	}
>>    	BUG_ON(!list_empty(&migratepages));
>> -	return isolated ? 0 : -EAGAIN;
>> -
>> -out:
>> -	folio_put(folio);
>> -	return -EAGAIN;
>> +	return nr_remaining ? -EAGAIN : 0;
>>    }
>>    #endif /* CONFIG_NUMA_BALANCING */
>>    #endif /* CONFIG_NUMA */
> 
> Hi David,
> 
> I have tested these patches on my system. My system has 2 DRAM nodes and
> 1 PMEM node.
> I tested page migration between DRAM nodes and page promotion from PMEM
> to DRAM.
> Both are working fine.
> 
> below are the results.
> 
> Migration results
> =============
> numa_pte_updates 18977
> numa_huge_pte_updates 0
> numa_hint_faults 18504
> numa_hint_faults_local 2116
> numa_pages_migrated 16388
> 
> 
> 
> Promotion Results
> ===============
> nr_sec_page_table_pages 0
> nr_iommu_pages 0
> nr_swapcached 0
> pgpromote_success 16386
> pgpromote_candidate 0
> 
> 
> Tested-by: Donet Tom <donettom@linux.ibm.com>

Oh, what a pleasant surprise, thanks a bunch of the fast testing!
David Hildenbrand June 21, 2024, 8:18 p.m. UTC | #8
On 21.06.24 15:44, Zi Yan wrote:
> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
> 
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> to how we handle MADV_PAGEOUT.
>>
>> While at it, combine the folio_is_file_lru() checks.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/migrate.h |  7 ++++
>>   mm/huge_memory.c        |  8 ++--
>>   mm/memory.c             |  9 +++--
>>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>   4 files changed, 55 insertions(+), 50 deletions(-)
> 
> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
> 
> One nit below:
> 
> <snip>
> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index fc27dabcd8e3..4b2817bb2c7d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>   	if (node_is_toptier(nid))
>>   		last_cpupid = folio_last_cpupid(folio);
>>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>   		goto out_map;
>>   	}
>> -
>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>   	spin_unlock(vmf->ptl);
>>   	writable = false;
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 118660de5bcc..4fd1ecfced4d 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
> 
> <snip>
> 
>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>   	else
>>   		last_cpupid = folio_last_cpupid(folio);
>>   	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>> -	if (target_nid == NUMA_NO_NODE) {
>> -		folio_put(folio);
>> +	if (target_nid == NUMA_NO_NODE)
>> +		goto out_map;
>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> +		flags |= TNF_MIGRATE_FAIL;
>>   		goto out_map;
>>   	}
> 
> These two locations are repeated code, maybe just merge the ifs into
> numa_migrate_prep(). Feel free to ignore if you are not going to send
> another version. :)

I went back and forth a couple of times and

a) Didn't want to move numa_migrate_prep() into
    migrate_misplaced_folio_prepare(), because having that code in
    mm/migrate.c felt a bit odd.

b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
    seeing the migrate_misplaced_folio_prepare() and
    migrate_misplaced_folio() calls in the same callercontext.

I also considered renaming numa_migrate_prep(), but wasn't really able 
to come up with a good name.

But maybe a) is not too bad?

We'd have migrate_misplaced_folio_prepare() consume &flags and 
&target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.

What would be your take?

> 
> --
> Best Regards,
> Yan, Zi
Zi Yan June 21, 2024, 8:48 p.m. UTC | #9
On 21 Jun 2024, at 16:18, David Hildenbrand wrote:

> On 21.06.24 15:44, Zi Yan wrote:
>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>
>>> Currently we always take a folio reference even if migration will not
>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>> reference.
>>>
>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>> might have already been unmapped, because after we dropped the PTL, that
>>> can easily happen. We want to stop touching mapcounts and friends from
>>> such context, and only call folio_likely_mapped_shared() while the folio
>>> is still mapped: mapcount information is pretty much stale and unreliable
>>> otherwise.
>>>
>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>> function to migrate_misplaced_folio_prepare(), and call that function
>>> from callsites where we call migrate_misplaced_folio(), but still with
>>> the PTL held.
>>>
>>> We can now stop taking temporary folio references, and really only take
>>> a reference if folio isolation succeeded. Doing the
>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>> to how we handle MADV_PAGEOUT.
>>>
>>> While at it, combine the folio_is_file_lru() checks.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   include/linux/migrate.h |  7 ++++
>>>   mm/huge_memory.c        |  8 ++--
>>>   mm/memory.c             |  9 +++--
>>>   mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>   4 files changed, 55 insertions(+), 50 deletions(-)
>>
>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>
>> One nit below:
>>
>> <snip>
>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>   	if (node_is_toptier(nid))
>>>   		last_cpupid = folio_last_cpupid(folio);
>>>   	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>> -	if (target_nid == NUMA_NO_NODE) {
>>> -		folio_put(folio);
>>> +	if (target_nid == NUMA_NO_NODE)
>>> +		goto out_map;
>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>> +		flags |= TNF_MIGRATE_FAIL;
>>>   		goto out_map;
>>>   	}
>>> -
>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>   	spin_unlock(vmf->ptl);
>>>   	writable = false;
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 118660de5bcc..4fd1ecfced4d 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>
>> <snip>
>>
>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>   	else
>>>   		last_cpupid = folio_last_cpupid(folio);
>>>   	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>> -	if (target_nid == NUMA_NO_NODE) {
>>> -		folio_put(folio);
>>> +	if (target_nid == NUMA_NO_NODE)
>>> +		goto out_map;
>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>> +		flags |= TNF_MIGRATE_FAIL;
>>>   		goto out_map;
>>>   	}
>>
>> These two locations are repeated code, maybe just merge the ifs into
>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>> another version. :)
>
> I went back and forth a couple of times and
>
> a) Didn't want to move numa_migrate_prep() into
>    migrate_misplaced_folio_prepare(), because having that code in
>    mm/migrate.c felt a bit odd.

I agree after checking the actual code, since the code is just
updating NUMA fault stats and checking where the folio should be.

>
> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>    seeing the migrate_misplaced_folio_prepare() and
>    migrate_misplaced_folio() calls in the same callercontext.
>
> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.

How about numa_migrate_check()? Since it tells whether a folio should be
migrated or not.

>
> But maybe a) is not too bad?
>
> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>
> What would be your take?

I would either rename numa_migrate_prep() or just do nothing. I have to admit
that the "prep" and "prepare" in both function names motivated me to propose
the merge, but now the actual code tells me they should be separate.

--
Best Regards,
Yan, Zi
David Hildenbrand June 26, 2024, 4:22 p.m. UTC | #10
On 20.06.24 23:29, David Hildenbrand wrote:
> Currently we always take a folio reference even if migration will not
> even be tried or isolation failed, requiring us to grab+drop an additional
> reference.
> 
> Further, we end up calling folio_likely_mapped_shared() while the folio
> might have already been unmapped, because after we dropped the PTL, that
> can easily happen. We want to stop touching mapcounts and friends from
> such context, and only call folio_likely_mapped_shared() while the folio
> is still mapped: mapcount information is pretty much stale and unreliable
> otherwise.
> 
> So let's move checks into numamigrate_isolate_folio(), rename that
> function to migrate_misplaced_folio_prepare(), and call that function
> from callsites where we call migrate_misplaced_folio(), but still with
> the PTL held.
> 
> We can now stop taking temporary folio references, and really only take
> a reference if folio isolation succeeded. Doing the
> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> to how we handle MADV_PAGEOUT.
> 
> While at it, combine the folio_is_file_lru() checks.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

Donet just reported an issue. I suspect this fixes it -- in any case, this is
the right thing to do.

 From 0833b9896e98c8d88c521609c811a220d14a2e7e Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 26 Jun 2024 18:14:44 +0200
Subject: [PATCH] Fixup: mm/migrate: move NUMA hinting fault folio isolation +
  checks under PTL

Donet reports an issue during NUMA migration we haven't seen previously:

	[   71.422804] list_del corruption, c00c00000061b3c8->next is
	LIST_POISON1 (5deadbeef0000100)
	[   71.422839] ------------[ cut here ]------------
	[   71.422843] kernel BUG at lib/list_debug.c:56!
	[   71.422850] Oops: Exception in kernel mode, sig: 5 [#1]

We forgot to convert one "return 0;" to return an error instead from
migrate_misplaced_folio_prepare() in case the target node is nearly
full.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/migrate.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8beedbb42a93..9ed43c1eea5e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2564,7 +2564,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
  		int z;
  
  		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING))
-			return 0;
+			return -EAGAIN;
  		for (z = pgdat->nr_zones - 1; z >= 0; z--) {
  			if (managed_zone(pgdat->node_zones + z))
  				break;

base-commit: 4b17ce353e02b47b00e2fe87b862f09e8b9a47e6
David Hildenbrand June 26, 2024, 4:49 p.m. UTC | #11
On 21.06.24 22:48, Zi Yan wrote:
> On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
> 
>> On 21.06.24 15:44, Zi Yan wrote:
>>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>>
>>>> Currently we always take a folio reference even if migration will not
>>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>>> reference.
>>>>
>>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>>> might have already been unmapped, because after we dropped the PTL, that
>>>> can easily happen. We want to stop touching mapcounts and friends from
>>>> such context, and only call folio_likely_mapped_shared() while the folio
>>>> is still mapped: mapcount information is pretty much stale and unreliable
>>>> otherwise.
>>>>
>>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>>> function to migrate_misplaced_folio_prepare(), and call that function
>>>> from callsites where we call migrate_misplaced_folio(), but still with
>>>> the PTL held.
>>>>
>>>> We can now stop taking temporary folio references, and really only take
>>>> a reference if folio isolation succeeded. Doing the
>>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>>> to how we handle MADV_PAGEOUT.
>>>>
>>>> While at it, combine the folio_is_file_lru() checks.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    include/linux/migrate.h |  7 ++++
>>>>    mm/huge_memory.c        |  8 ++--
>>>>    mm/memory.c             |  9 +++--
>>>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>>    4 files changed, 55 insertions(+), 50 deletions(-)
>>>
>>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>>
>>> One nit below:
>>>
>>> <snip>
>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>    	if (node_is_toptier(nid))
>>>>    		last_cpupid = folio_last_cpupid(folio);
>>>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>> -		folio_put(folio);
>>>> +	if (target_nid == NUMA_NO_NODE)
>>>> +		goto out_map;
>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>    		goto out_map;
>>>>    	}
>>>> -
>>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>>    	spin_unlock(vmf->ptl);
>>>>    	writable = false;
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 118660de5bcc..4fd1ecfced4d 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>
>>> <snip>
>>>
>>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>>    	else
>>>>    		last_cpupid = folio_last_cpupid(folio);
>>>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>> -		folio_put(folio);
>>>> +	if (target_nid == NUMA_NO_NODE)
>>>> +		goto out_map;
>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>    		goto out_map;
>>>>    	}
>>>
>>> These two locations are repeated code, maybe just merge the ifs into
>>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>>> another version. :)
>>
>> I went back and forth a couple of times and
>>
>> a) Didn't want to move numa_migrate_prep() into
>>     migrate_misplaced_folio_prepare(), because having that code in
>>     mm/migrate.c felt a bit odd.
> 
> I agree after checking the actual code, since the code is just
> updating NUMA fault stats and checking where the folio should be.
> 
>>
>> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>>     seeing the migrate_misplaced_folio_prepare() and
>>     migrate_misplaced_folio() calls in the same callercontext.
>>
>> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
> 
> How about numa_migrate_check()? Since it tells whether a folio should be
> migrated or not.
> 
>>
>> But maybe a) is not too bad?
>>
>> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>>
>> What would be your take?
> 
> I would either rename numa_migrate_prep() or just do nothing. I have to admit
> that the "prep" and "prepare" in both function names motivated me to propose
> the merge, but now the actual code tells me they should be separate.

Let's leave it like that for now. Renaming to numa_migrate_check() makes 
sense, and likely moving more numa handling stuff in there.

Bit I yet have to figure out why some of the memory.c vs. huge_memory.c 
code differences exist, so we can unify them.

For example, why did 33024536bafd9 introduce slightly different 
last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like 
some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe 
I am missing something obvious. :)
Zi Yan June 26, 2024, 5:37 p.m. UTC | #12
On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
> On 21.06.24 22:48, Zi Yan wrote:
> > On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
> > 
> >> On 21.06.24 15:44, Zi Yan wrote:
> >>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
> >>>
> >>>> Currently we always take a folio reference even if migration will not
> >>>> even be tried or isolation failed, requiring us to grab+drop an additional
> >>>> reference.
> >>>>
> >>>> Further, we end up calling folio_likely_mapped_shared() while the folio
> >>>> might have already been unmapped, because after we dropped the PTL, that
> >>>> can easily happen. We want to stop touching mapcounts and friends from
> >>>> such context, and only call folio_likely_mapped_shared() while the folio
> >>>> is still mapped: mapcount information is pretty much stale and unreliable
> >>>> otherwise.
> >>>>
> >>>> So let's move checks into numamigrate_isolate_folio(), rename that
> >>>> function to migrate_misplaced_folio_prepare(), and call that function
> >>>> from callsites where we call migrate_misplaced_folio(), but still with
> >>>> the PTL held.
> >>>>
> >>>> We can now stop taking temporary folio references, and really only take
> >>>> a reference if folio isolation succeeded. Doing the
> >>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
> >>>> to how we handle MADV_PAGEOUT.
> >>>>
> >>>> While at it, combine the folio_is_file_lru() checks.
> >>>>
> >>>> Signed-off-by: David Hildenbrand <david@redhat.com>
> >>>> ---
> >>>>    include/linux/migrate.h |  7 ++++
> >>>>    mm/huge_memory.c        |  8 ++--
> >>>>    mm/memory.c             |  9 +++--
> >>>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
> >>>>    4 files changed, 55 insertions(+), 50 deletions(-)
> >>>
> >>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
> >>>
> >>> One nit below:
> >>>
> >>> <snip>
> >>>
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index fc27dabcd8e3..4b2817bb2c7d 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> >>>>    	if (node_is_toptier(nid))
> >>>>    		last_cpupid = folio_last_cpupid(folio);
> >>>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
> >>>> -	if (target_nid == NUMA_NO_NODE) {
> >>>> -		folio_put(folio);
> >>>> +	if (target_nid == NUMA_NO_NODE)
> >>>> +		goto out_map;
> >>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> >>>> +		flags |= TNF_MIGRATE_FAIL;
> >>>>    		goto out_map;
> >>>>    	}
> >>>> -
> >>>> +	/* The folio is isolated and isolation code holds a folio reference. */
> >>>>    	spin_unlock(vmf->ptl);
> >>>>    	writable = false;
> >>>>
> >>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>> index 118660de5bcc..4fd1ecfced4d 100644
> >>>> --- a/mm/memory.c
> >>>> +++ b/mm/memory.c
> >>>
> >>> <snip>
> >>>
> >>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> >>>>    	else
> >>>>    		last_cpupid = folio_last_cpupid(folio);
> >>>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
> >>>> -	if (target_nid == NUMA_NO_NODE) {
> >>>> -		folio_put(folio);
> >>>> +	if (target_nid == NUMA_NO_NODE)
> >>>> +		goto out_map;
> >>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
> >>>> +		flags |= TNF_MIGRATE_FAIL;
> >>>>    		goto out_map;
> >>>>    	}
> >>>
> >>> These two locations are repeated code, maybe just merge the ifs into
> >>> numa_migrate_prep(). Feel free to ignore if you are not going to send
> >>> another version. :)
> >>
> >> I went back and forth a couple of times and
> >>
> >> a) Didn't want to move numa_migrate_prep() into
> >>     migrate_misplaced_folio_prepare(), because having that code in
> >>     mm/migrate.c felt a bit odd.
> > 
> > I agree after checking the actual code, since the code is just
> > updating NUMA fault stats and checking where the folio should be.
> > 
> >>
> >> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
> >>     seeing the migrate_misplaced_folio_prepare() and
> >>     migrate_misplaced_folio() calls in the same callercontext.
> >>
> >> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
> > 
> > How about numa_migrate_check()? Since it tells whether a folio should be
> > migrated or not.
> > 
> >>
> >> But maybe a) is not too bad?
> >>
> >> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
> >>
> >> What would be your take?
> > 
> > I would either rename numa_migrate_prep() or just do nothing. I have to admit
> > that the "prep" and "prepare" in both function names motivated me to propose
> > the merge, but now the actual code tells me they should be separate.
>
> Let's leave it like that for now. Renaming to numa_migrate_check() makes 
> sense, and likely moving more numa handling stuff in there.
>
> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c 
> code differences exist, so we can unify them.
>
> For example, why did 33024536bafd9 introduce slightly different 
> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like 
> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe 
> I am missing something obvious. :)

It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
check is missing in do_huge_pmd_numa_page(). So the

if (node_is_toptier(nid))

should be

if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
node_is_toptier(nid))

to be consistent with other checks. Add Ying to confirm.

I also think a function like

bool folio_has_cpupid(folio)
{
    return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
    || node_is_toptier(folio_nid(folio));
}

would be better than the existing checks.
Donet Tom June 27, 2024, 6 a.m. UTC | #13
On 6/26/24 21:52, David Hildenbrand wrote:
> On 20.06.24 23:29, David Hildenbrand wrote:
>> Currently we always take a folio reference even if migration will not
>> even be tried or isolation failed, requiring us to grab+drop an 
>> additional
>> reference.
>>
>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> might have already been unmapped, because after we dropped the PTL, that
>> can easily happen. We want to stop touching mapcounts and friends from
>> such context, and only call folio_likely_mapped_shared() while the folio
>> is still mapped: mapcount information is pretty much stale and 
>> unreliable
>> otherwise.
>>
>> So let's move checks into numamigrate_isolate_folio(), rename that
>> function to migrate_misplaced_folio_prepare(), and call that function
>> from callsites where we call migrate_misplaced_folio(), but still with
>> the PTL held.
>>
>> We can now stop taking temporary folio references, and really only take
>> a reference if folio isolation succeeded. Doing the
>> folio_likely_mapped_shared() + golio isolation under PT lock is now 
>> similar
>> to how we handle MADV_PAGEOUT.
>>
>> While at it, combine the folio_is_file_lru() checks.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>
> Donet just reported an issue. I suspect this fixes it -- in any case, 
> this is
> the right thing to do.
>
> From 0833b9896e98c8d88c521609c811a220d14a2e7e Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Wed, 26 Jun 2024 18:14:44 +0200
> Subject: [PATCH] Fixup: mm/migrate: move NUMA hinting fault folio 
> isolation +
>  checks under PTL
>
> Donet reports an issue during NUMA migration we haven't seen previously:
>
>     [   71.422804] list_del corruption, c00c00000061b3c8->next is
>     LIST_POISON1 (5deadbeef0000100)
>     [   71.422839] ------------[ cut here ]------------
>     [   71.422843] kernel BUG at lib/list_debug.c:56!
>     [   71.422850] Oops: Exception in kernel mode, sig: 5 [#1]
>
> We forgot to convert one "return 0;" to return an error instead from
> migrate_misplaced_folio_prepare() in case the target node is nearly
> full.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/migrate.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8beedbb42a93..9ed43c1eea5e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2564,7 +2564,7 @@ int migrate_misplaced_folio_prepare(struct folio 
> *folio,
>          int z;
>
>          if (!(sysctl_numa_balancing_mode & 
> NUMA_BALANCING_MEMORY_TIERING))
> -            return 0;
> +            return -EAGAIN;
>          for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>              if (managed_zone(pgdat->node_zones + z))
>                  break;
Hi David

I tested with this patch . The issue is resolved. I am not seeing the 
kernel panic.

I also tested the page migration. It working fine.

numa_pte_updates 1262330
numa_huge_pte_updates 0
numa_hint_faults 925797
numa_hint_faults_local 3780
numa_pages_migrated 327930
pgmigrate_success 822530


Thanks
Donet


>
> base-commit: 4b17ce353e02b47b00e2fe87b862f09e8b9a47e6
Huang, Ying July 1, 2024, 8:32 a.m. UTC | #14
"Zi Yan" <ziy@nvidia.com> writes:

> On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
>> On 21.06.24 22:48, Zi Yan wrote:
>> > On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
>> > 
>> >> On 21.06.24 15:44, Zi Yan wrote:
>> >>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>> >>>
>> >>>> Currently we always take a folio reference even if migration will not
>> >>>> even be tried or isolation failed, requiring us to grab+drop an additional
>> >>>> reference.
>> >>>>
>> >>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>> >>>> might have already been unmapped, because after we dropped the PTL, that
>> >>>> can easily happen. We want to stop touching mapcounts and friends from
>> >>>> such context, and only call folio_likely_mapped_shared() while the folio
>> >>>> is still mapped: mapcount information is pretty much stale and unreliable
>> >>>> otherwise.
>> >>>>
>> >>>> So let's move checks into numamigrate_isolate_folio(), rename that
>> >>>> function to migrate_misplaced_folio_prepare(), and call that function
>> >>>> from callsites where we call migrate_misplaced_folio(), but still with
>> >>>> the PTL held.
>> >>>>
>> >>>> We can now stop taking temporary folio references, and really only take
>> >>>> a reference if folio isolation succeeded. Doing the
>> >>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>> >>>> to how we handle MADV_PAGEOUT.
>> >>>>
>> >>>> While at it, combine the folio_is_file_lru() checks.
>> >>>>
>> >>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> >>>> ---
>> >>>>    include/linux/migrate.h |  7 ++++
>> >>>>    mm/huge_memory.c        |  8 ++--
>> >>>>    mm/memory.c             |  9 +++--
>> >>>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
>> >>>>    4 files changed, 55 insertions(+), 50 deletions(-)
>> >>>
>> >>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>> >>>
>> >>> One nit below:
>> >>>
>> >>> <snip>
>> >>>
>> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> >>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>> >>>> --- a/mm/huge_memory.c
>> >>>> +++ b/mm/huge_memory.c
>> >>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>> >>>>    	if (node_is_toptier(nid))
>> >>>>    		last_cpupid = folio_last_cpupid(folio);
>> >>>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>> >>>> -	if (target_nid == NUMA_NO_NODE) {
>> >>>> -		folio_put(folio);
>> >>>> +	if (target_nid == NUMA_NO_NODE)
>> >>>> +		goto out_map;
>> >>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> >>>> +		flags |= TNF_MIGRATE_FAIL;
>> >>>>    		goto out_map;
>> >>>>    	}
>> >>>> -
>> >>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>> >>>>    	spin_unlock(vmf->ptl);
>> >>>>    	writable = false;
>> >>>>
>> >>>> diff --git a/mm/memory.c b/mm/memory.c
>> >>>> index 118660de5bcc..4fd1ecfced4d 100644
>> >>>> --- a/mm/memory.c
>> >>>> +++ b/mm/memory.c
>> >>>
>> >>> <snip>
>> >>>
>> >>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>> >>>>    	else
>> >>>>    		last_cpupid = folio_last_cpupid(folio);
>> >>>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>> >>>> -	if (target_nid == NUMA_NO_NODE) {
>> >>>> -		folio_put(folio);
>> >>>> +	if (target_nid == NUMA_NO_NODE)
>> >>>> +		goto out_map;
>> >>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>> >>>> +		flags |= TNF_MIGRATE_FAIL;
>> >>>>    		goto out_map;
>> >>>>    	}
>> >>>
>> >>> These two locations are repeated code, maybe just merge the ifs into
>> >>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>> >>> another version. :)
>> >>
>> >> I went back and forth a couple of times and
>> >>
>> >> a) Didn't want to move numa_migrate_prep() into
>> >>     migrate_misplaced_folio_prepare(), because having that code in
>> >>     mm/migrate.c felt a bit odd.
>> > 
>> > I agree after checking the actual code, since the code is just
>> > updating NUMA fault stats and checking where the folio should be.
>> > 
>> >>
>> >> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>> >>     seeing the migrate_misplaced_folio_prepare() and
>> >>     migrate_misplaced_folio() calls in the same callercontext.
>> >>
>> >> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
>> > 
>> > How about numa_migrate_check()? Since it tells whether a folio should be
>> > migrated or not.
>> > 
>> >>
>> >> But maybe a) is not too bad?
>> >>
>> >> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>> >>
>> >> What would be your take?
>> > 
>> > I would either rename numa_migrate_prep() or just do nothing. I have to admit
>> > that the "prep" and "prepare" in both function names motivated me to propose
>> > the merge, but now the actual code tells me they should be separate.
>>
>> Let's leave it like that for now. Renaming to numa_migrate_check() makes 
>> sense, and likely moving more numa handling stuff in there.
>>
>> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c 
>> code differences exist, so we can unify them.
>>
>> For example, why did 33024536bafd9 introduce slightly different 
>> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like 
>> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe 
>> I am missing something obvious. :)
>
> It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
> check is missing in do_huge_pmd_numa_page(). So the
>
> if (node_is_toptier(nid))
>
> should be
>
> if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> node_is_toptier(nid))
>
> to be consistent with other checks. Add Ying to confirm.

Yes.  It should be so.  Sorry for my mistake and confusing.

> I also think a function like
>
> bool folio_has_cpupid(folio)
> {
>     return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>     || node_is_toptier(folio_nid(folio));
> }
>
> would be better than the existing checks.

Yes.  This looks better.  Even better, we can add some comments to the
function too.

--
Best Regards,
Huang, Ying
Zi Yan July 1, 2024, 1:50 p.m. UTC | #15
On 1 Jul 2024, at 4:32, Huang, Ying wrote:

> "Zi Yan" <ziy@nvidia.com> writes:
>
>> On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
>>> On 21.06.24 22:48, Zi Yan wrote:
>>>> On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
>>>>
>>>>> On 21.06.24 15:44, Zi Yan wrote:
>>>>>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>>>>>
>>>>>>> Currently we always take a folio reference even if migration will not
>>>>>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>>>>>> reference.
>>>>>>>
>>>>>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>>>>>> might have already been unmapped, because after we dropped the PTL, that
>>>>>>> can easily happen. We want to stop touching mapcounts and friends from
>>>>>>> such context, and only call folio_likely_mapped_shared() while the folio
>>>>>>> is still mapped: mapcount information is pretty much stale and unreliable
>>>>>>> otherwise.
>>>>>>>
>>>>>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>>>>>> function to migrate_misplaced_folio_prepare(), and call that function
>>>>>>> from callsites where we call migrate_misplaced_folio(), but still with
>>>>>>> the PTL held.
>>>>>>>
>>>>>>> We can now stop taking temporary folio references, and really only take
>>>>>>> a reference if folio isolation succeeded. Doing the
>>>>>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>>>>>> to how we handle MADV_PAGEOUT.
>>>>>>>
>>>>>>> While at it, combine the folio_is_file_lru() checks.
>>>>>>>
>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>> ---
>>>>>>>    include/linux/migrate.h |  7 ++++
>>>>>>>    mm/huge_memory.c        |  8 ++--
>>>>>>>    mm/memory.c             |  9 +++--
>>>>>>>    mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>>>>>    4 files changed, 55 insertions(+), 50 deletions(-)
>>>>>>
>>>>>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> One nit below:
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>>>>>> --- a/mm/huge_memory.c
>>>>>>> +++ b/mm/huge_memory.c
>>>>>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>>    	if (node_is_toptier(nid))
>>>>>>>    		last_cpupid = folio_last_cpupid(folio);
>>>>>>>    	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>> -		folio_put(folio);
>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>> +		goto out_map;
>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>    		goto out_map;
>>>>>>>    	}
>>>>>>> -
>>>>>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>>>>>    	spin_unlock(vmf->ptl);
>>>>>>>    	writable = false;
>>>>>>>
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 118660de5bcc..4fd1ecfced4d 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>>>>>    	else
>>>>>>>    		last_cpupid = folio_last_cpupid(folio);
>>>>>>>    	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>> -		folio_put(folio);
>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>> +		goto out_map;
>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>    		goto out_map;
>>>>>>>    	}
>>>>>>
>>>>>> These two locations are repeated code, maybe just merge the ifs into
>>>>>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>>>>>> another version. :)
>>>>>
>>>>> I went back and forth a couple of times and
>>>>>
>>>>> a) Didn't want to move numa_migrate_prep() into
>>>>>     migrate_misplaced_folio_prepare(), because having that code in
>>>>>     mm/migrate.c felt a bit odd.
>>>>
>>>> I agree after checking the actual code, since the code is just
>>>> updating NUMA fault stats and checking where the folio should be.
>>>>
>>>>>
>>>>> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>>>>>     seeing the migrate_misplaced_folio_prepare() and
>>>>>     migrate_misplaced_folio() calls in the same callercontext.
>>>>>
>>>>> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
>>>>
>>>> How about numa_migrate_check()? Since it tells whether a folio should be
>>>> migrated or not.
>>>>
>>>>>
>>>>> But maybe a) is not too bad?
>>>>>
>>>>> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>>>>>
>>>>> What would be your take?
>>>>
>>>> I would either rename numa_migrate_prep() or just do nothing. I have to admit
>>>> that the "prep" and "prepare" in both function names motivated me to propose
>>>> the merge, but now the actual code tells me they should be separate.
>>>
>>> Let's leave it like that for now. Renaming to numa_migrate_check() makes
>>> sense, and likely moving more numa handling stuff in there.
>>>
>>> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c
>>> code differences exist, so we can unify them.
>>>
>>> For example, why did 33024536bafd9 introduce slightly different
>>> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like
>>> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe
>>> I am missing something obvious. :)
>>
>> It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
>> check is missing in do_huge_pmd_numa_page(). So the
>>
>> if (node_is_toptier(nid))
>>
>> should be
>>
>> if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
>> node_is_toptier(nid))
>>
>> to be consistent with other checks. Add Ying to confirm.
>
> Yes.  It should be so.  Sorry for my mistake and confusing.

Thank you for the confirmation.

>
>> I also think a function like
>>
>> bool folio_has_cpupid(folio)
>> {
>>     return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>>     || node_is_toptier(folio_nid(folio));
>> }
>>
>> would be better than the existing checks.
>
> Yes.  This looks better.  Even better, we can add some comments to the
> function too.

I will prepare a patch about it.

Best Regards,
Yan, Zi
David Hildenbrand July 1, 2024, 2:03 p.m. UTC | #16
On 01.07.24 15:50, Zi Yan wrote:
> On 1 Jul 2024, at 4:32, Huang, Ying wrote:
> 
>> "Zi Yan" <ziy@nvidia.com> writes:
>>
>>> On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
>>>> On 21.06.24 22:48, Zi Yan wrote:
>>>>> On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
>>>>>
>>>>>> On 21.06.24 15:44, Zi Yan wrote:
>>>>>>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> Currently we always take a folio reference even if migration will not
>>>>>>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>>>>>>> reference.
>>>>>>>>
>>>>>>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>>>>>>> might have already been unmapped, because after we dropped the PTL, that
>>>>>>>> can easily happen. We want to stop touching mapcounts and friends from
>>>>>>>> such context, and only call folio_likely_mapped_shared() while the folio
>>>>>>>> is still mapped: mapcount information is pretty much stale and unreliable
>>>>>>>> otherwise.
>>>>>>>>
>>>>>>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>>>>>>> function to migrate_misplaced_folio_prepare(), and call that function
>>>>>>>> from callsites where we call migrate_misplaced_folio(), but still with
>>>>>>>> the PTL held.
>>>>>>>>
>>>>>>>> We can now stop taking temporary folio references, and really only take
>>>>>>>> a reference if folio isolation succeeded. Doing the
>>>>>>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>>>>>>> to how we handle MADV_PAGEOUT.
>>>>>>>>
>>>>>>>> While at it, combine the folio_is_file_lru() checks.
>>>>>>>>
>>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>>> ---
>>>>>>>>     include/linux/migrate.h |  7 ++++
>>>>>>>>     mm/huge_memory.c        |  8 ++--
>>>>>>>>     mm/memory.c             |  9 +++--
>>>>>>>>     mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>>>>>>     4 files changed, 55 insertions(+), 50 deletions(-)
>>>>>>>
>>>>>>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>>>>>>
>>>>>>> One nit below:
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>>>     	if (node_is_toptier(nid))
>>>>>>>>     		last_cpupid = folio_last_cpupid(folio);
>>>>>>>>     	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>>> -		folio_put(folio);
>>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>>> +		goto out_map;
>>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>>     		goto out_map;
>>>>>>>>     	}
>>>>>>>> -
>>>>>>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>>>>>>     	spin_unlock(vmf->ptl);
>>>>>>>>     	writable = false;
>>>>>>>>
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index 118660de5bcc..4fd1ecfced4d 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>>>>>>     	else
>>>>>>>>     		last_cpupid = folio_last_cpupid(folio);
>>>>>>>>     	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>>> -		folio_put(folio);
>>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>>> +		goto out_map;
>>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>>     		goto out_map;
>>>>>>>>     	}
>>>>>>>
>>>>>>> These two locations are repeated code, maybe just merge the ifs into
>>>>>>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>>>>>>> another version. :)
>>>>>>
>>>>>> I went back and forth a couple of times and
>>>>>>
>>>>>> a) Didn't want to move numa_migrate_prep() into
>>>>>>      migrate_misplaced_folio_prepare(), because having that code in
>>>>>>      mm/migrate.c felt a bit odd.
>>>>>
>>>>> I agree after checking the actual code, since the code is just
>>>>> updating NUMA fault stats and checking where the folio should be.
>>>>>
>>>>>>
>>>>>> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>>>>>>      seeing the migrate_misplaced_folio_prepare() and
>>>>>>      migrate_misplaced_folio() calls in the same callercontext.
>>>>>>
>>>>>> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
>>>>>
>>>>> How about numa_migrate_check()? Since it tells whether a folio should be
>>>>> migrated or not.
>>>>>
>>>>>>
>>>>>> But maybe a) is not too bad?
>>>>>>
>>>>>> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>>>>>>
>>>>>> What would be your take?
>>>>>
>>>>> I would either rename numa_migrate_prep() or just do nothing. I have to admit
>>>>> that the "prep" and "prepare" in both function names motivated me to propose
>>>>> the merge, but now the actual code tells me they should be separate.
>>>>
>>>> Let's leave it like that for now. Renaming to numa_migrate_check() makes
>>>> sense, and likely moving more numa handling stuff in there.
>>>>
>>>> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c
>>>> code differences exist, so we can unify them.
>>>>
>>>> For example, why did 33024536bafd9 introduce slightly different
>>>> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like
>>>> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe
>>>> I am missing something obvious. :)
>>>
>>> It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
>>> check is missing in do_huge_pmd_numa_page(). So the
>>>
>>> if (node_is_toptier(nid))
>>>
>>> should be
>>>
>>> if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
>>> node_is_toptier(nid))
>>>
>>> to be consistent with other checks. Add Ying to confirm.
>>
>> Yes.  It should be so.  Sorry for my mistake and confusing.
> 
> Thank you for the confirmation.
> 
>>
>>> I also think a function like
>>>
>>> bool folio_has_cpupid(folio)
>>> {
>>>      return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>>>      || node_is_toptier(folio_nid(folio));
>>> }
>>>
>>> would be better than the existing checks.
>>
>> Yes.  This looks better.  Even better, we can add some comments to the
>> function too.
> 
> I will prepare a patch about it.

Do you have capacity to further consolidate the logic, maybe moving more 
stuff into the numa_migrate_prep (and renaming it? :)).
Zi Yan July 1, 2024, 2:04 p.m. UTC | #17
On 1 Jul 2024, at 10:03, David Hildenbrand wrote:

> On 01.07.24 15:50, Zi Yan wrote:
>> On 1 Jul 2024, at 4:32, Huang, Ying wrote:
>>
>>> "Zi Yan" <ziy@nvidia.com> writes:
>>>
>>>> On Wed Jun 26, 2024 at 12:49 PM EDT, David Hildenbrand wrote:
>>>>> On 21.06.24 22:48, Zi Yan wrote:
>>>>>> On 21 Jun 2024, at 16:18, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 21.06.24 15:44, Zi Yan wrote:
>>>>>>>> On 20 Jun 2024, at 17:29, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>>> Currently we always take a folio reference even if migration will not
>>>>>>>>> even be tried or isolation failed, requiring us to grab+drop an additional
>>>>>>>>> reference.
>>>>>>>>>
>>>>>>>>> Further, we end up calling folio_likely_mapped_shared() while the folio
>>>>>>>>> might have already been unmapped, because after we dropped the PTL, that
>>>>>>>>> can easily happen. We want to stop touching mapcounts and friends from
>>>>>>>>> such context, and only call folio_likely_mapped_shared() while the folio
>>>>>>>>> is still mapped: mapcount information is pretty much stale and unreliable
>>>>>>>>> otherwise.
>>>>>>>>>
>>>>>>>>> So let's move checks into numamigrate_isolate_folio(), rename that
>>>>>>>>> function to migrate_misplaced_folio_prepare(), and call that function
>>>>>>>>> from callsites where we call migrate_misplaced_folio(), but still with
>>>>>>>>> the PTL held.
>>>>>>>>>
>>>>>>>>> We can now stop taking temporary folio references, and really only take
>>>>>>>>> a reference if folio isolation succeeded. Doing the
>>>>>>>>> folio_likely_mapped_shared() + golio isolation under PT lock is now similar
>>>>>>>>> to how we handle MADV_PAGEOUT.
>>>>>>>>>
>>>>>>>>> While at it, combine the folio_is_file_lru() checks.
>>>>>>>>>
>>>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>>>> ---
>>>>>>>>>     include/linux/migrate.h |  7 ++++
>>>>>>>>>     mm/huge_memory.c        |  8 ++--
>>>>>>>>>     mm/memory.c             |  9 +++--
>>>>>>>>>     mm/migrate.c            | 81 +++++++++++++++++++----------------------
>>>>>>>>>     4 files changed, 55 insertions(+), 50 deletions(-)
>>>>>>>>
>>>>>>>> LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>
>>>>>>>>
>>>>>>>> One nit below:
>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>>> index fc27dabcd8e3..4b2817bb2c7d 100644
>>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>>> @@ -1688,11 +1688,13 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>>>>>>>>>     	if (node_is_toptier(nid))
>>>>>>>>>     		last_cpupid = folio_last_cpupid(folio);
>>>>>>>>>     	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
>>>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>>>> -		folio_put(folio);
>>>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>>>> +		goto out_map;
>>>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>>>     		goto out_map;
>>>>>>>>>     	}
>>>>>>>>> -
>>>>>>>>> +	/* The folio is isolated and isolation code holds a folio reference. */
>>>>>>>>>     	spin_unlock(vmf->ptl);
>>>>>>>>>     	writable = false;
>>>>>>>>>
>>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>>> index 118660de5bcc..4fd1ecfced4d 100644
>>>>>>>>> --- a/mm/memory.c
>>>>>>>>> +++ b/mm/memory.c
>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>> @@ -5345,10 +5343,13 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
>>>>>>>>>     	else
>>>>>>>>>     		last_cpupid = folio_last_cpupid(folio);
>>>>>>>>>     	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
>>>>>>>>> -	if (target_nid == NUMA_NO_NODE) {
>>>>>>>>> -		folio_put(folio);
>>>>>>>>> +	if (target_nid == NUMA_NO_NODE)
>>>>>>>>> +		goto out_map;
>>>>>>>>> +	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
>>>>>>>>> +		flags |= TNF_MIGRATE_FAIL;
>>>>>>>>>     		goto out_map;
>>>>>>>>>     	}
>>>>>>>>
>>>>>>>> These two locations are repeated code, maybe just merge the ifs into
>>>>>>>> numa_migrate_prep(). Feel free to ignore if you are not going to send
>>>>>>>> another version. :)
>>>>>>>
>>>>>>> I went back and forth a couple of times and
>>>>>>>
>>>>>>> a) Didn't want to move numa_migrate_prep() into
>>>>>>>      migrate_misplaced_folio_prepare(), because having that code in
>>>>>>>      mm/migrate.c felt a bit odd.
>>>>>>
>>>>>> I agree after checking the actual code, since the code is just
>>>>>> updating NUMA fault stats and checking where the folio should be.
>>>>>>
>>>>>>>
>>>>>>> b) Didn't want to move migrate_misplaced_folio_prepare() because I enjoy
>>>>>>>      seeing the migrate_misplaced_folio_prepare() and
>>>>>>>      migrate_misplaced_folio() calls in the same callercontext.
>>>>>>>
>>>>>>> I also considered renaming numa_migrate_prep(), but wasn't really able to come up with a good name.
>>>>>>
>>>>>> How about numa_migrate_check()? Since it tells whether a folio should be
>>>>>> migrated or not.
>>>>>>
>>>>>>>
>>>>>>> But maybe a) is not too bad?
>>>>>>>
>>>>>>> We'd have migrate_misplaced_folio_prepare() consume &flags and &target_nid, and perform the "flags |= TNF_MIGRATE_FAIL;" internally.
>>>>>>>
>>>>>>> What would be your take?
>>>>>>
>>>>>> I would either rename numa_migrate_prep() or just do nothing. I have to admit
>>>>>> that the "prep" and "prepare" in both function names motivated me to propose
>>>>>> the merge, but now the actual code tells me they should be separate.
>>>>>
>>>>> Let's leave it like that for now. Renaming to numa_migrate_check() makes
>>>>> sense, and likely moving more numa handling stuff in there.
>>>>>
>>>>> Bit I yet have to figure out why some of the memory.c vs. huge_memory.c
>>>>> code differences exist, so we can unify them.
>>>>>
>>>>> For example, why did 33024536bafd9 introduce slightly different
>>>>> last_cpupid handling in do_huge_pmd_numa_page(), whereby it seems like
>>>>> some subtle difference in handling NUMA_BALANCING_MEMORY_TIERING? Maybe
>>>>> I am missing something obvious. :)
>>>>
>>>> It seems to me that a sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING
>>>> check is missing in do_huge_pmd_numa_page(). So the
>>>>
>>>> if (node_is_toptier(nid))
>>>>
>>>> should be
>>>>
>>>> if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
>>>> node_is_toptier(nid))
>>>>
>>>> to be consistent with other checks. Add Ying to confirm.
>>>
>>> Yes.  It should be so.  Sorry for my mistake and confusing.
>>
>> Thank you for the confirmation.
>>
>>>
>>>> I also think a function like
>>>>
>>>> bool folio_has_cpupid(folio)
>>>> {
>>>>      return !(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
>>>>      || node_is_toptier(folio_nid(folio));
>>>> }
>>>>
>>>> would be better than the existing checks.
>>>
>>> Yes.  This looks better.  Even better, we can add some comments to the
>>> function too.
>>
>> I will prepare a patch about it.
>
> Do you have capacity to further consolidate the logic, maybe moving more stuff into the numa_migrate_prep (and renaming it? :)).

Sure, let me give it a shot. :)

Best Regards,
Yan, Zi
diff mbox series

Patch

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f9d92482d117..644be30b69c8 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -139,9 +139,16 @@  const struct movable_operations *page_movable_ops(struct page *page)
 }
 
 #ifdef CONFIG_NUMA_BALANCING
+int migrate_misplaced_folio_prepare(struct folio *folio,
+		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 			   int node);
 #else
+static inline int migrate_misplaced_folio_prepare(struct folio *folio,
+		struct vm_area_struct *vma, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 static inline int migrate_misplaced_folio(struct folio *folio,
 					 struct vm_area_struct *vma, int node)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc27dabcd8e3..4b2817bb2c7d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1688,11 +1688,13 @@  vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	if (node_is_toptier(nid))
 		last_cpupid = folio_last_cpupid(folio);
 	target_nid = numa_migrate_prep(folio, vmf, haddr, nid, &flags);
-	if (target_nid == NUMA_NO_NODE) {
-		folio_put(folio);
+	if (target_nid == NUMA_NO_NODE)
+		goto out_map;
+	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
+		flags |= TNF_MIGRATE_FAIL;
 		goto out_map;
 	}
-
+	/* The folio is isolated and isolation code holds a folio reference. */
 	spin_unlock(vmf->ptl);
 	writable = false;
 
diff --git a/mm/memory.c b/mm/memory.c
index 118660de5bcc..4fd1ecfced4d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5207,8 +5207,6 @@  int numa_migrate_prep(struct folio *folio, struct vm_fault *vmf,
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	folio_get(folio);
-
 	/* Record the current PID acceesing VMA */
 	vma_set_access_pid_bit(vma);
 
@@ -5345,10 +5343,13 @@  static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	else
 		last_cpupid = folio_last_cpupid(folio);
 	target_nid = numa_migrate_prep(folio, vmf, vmf->address, nid, &flags);
-	if (target_nid == NUMA_NO_NODE) {
-		folio_put(folio);
+	if (target_nid == NUMA_NO_NODE)
+		goto out_map;
+	if (migrate_misplaced_folio_prepare(folio, vma, target_nid)) {
+		flags |= TNF_MIGRATE_FAIL;
 		goto out_map;
 	}
+	/* The folio is isolated and isolation code holds a folio reference. */
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	writable = false;
 	ignore_writable = true;
diff --git a/mm/migrate.c b/mm/migrate.c
index 0307b54879a0..27f070f64f27 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2530,9 +2530,37 @@  static struct folio *alloc_misplaced_dst_folio(struct folio *src,
 	return __folio_alloc_node(gfp, order, nid);
 }
 
-static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
+/*
+ * Prepare for calling migrate_misplaced_folio() by isolating the folio if
+ * permitted. Must be called with the PTL still held.
+ */
+int migrate_misplaced_folio_prepare(struct folio *folio,
+		struct vm_area_struct *vma, int node)
 {
 	int nr_pages = folio_nr_pages(folio);
+	pg_data_t *pgdat = NODE_DATA(node);
+
+	if (folio_is_file_lru(folio)) {
+		/*
+		 * Do not migrate file folios that are mapped in multiple
+		 * processes with execute permissions as they are probably
+		 * shared libraries.
+		 *
+		 * See folio_likely_mapped_shared() on possible imprecision
+		 * when we cannot easily detect if a folio is shared.
+		 */
+		if ((vma->vm_flags & VM_EXEC) &&
+		    folio_likely_mapped_shared(folio))
+			return -EACCES;
+
+		/*
+		 * Do not migrate dirty folios as not all filesystems can move
+		 * dirty folios in MIGRATE_ASYNC mode which is a waste of
+		 * cycles.
+		 */
+		if (folio_test_dirty(folio))
+			return -EAGAIN;
+	}
 
 	/* Avoid migrating to a node that is nearly full */
 	if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
@@ -2550,65 +2578,37 @@  static int numamigrate_isolate_folio(pg_data_t *pgdat, struct folio *folio)
 		 * further.
 		 */
 		if (z < 0)
-			return 0;
+			return -EAGAIN;
 
 		wakeup_kswapd(pgdat->node_zones + z, 0,
 			      folio_order(folio), ZONE_MOVABLE);
-		return 0;
+		return -EAGAIN;
 	}
 
 	if (!folio_isolate_lru(folio))
-		return 0;
+		return -EAGAIN;
 
 	node_stat_mod_folio(folio, NR_ISOLATED_ANON + folio_is_file_lru(folio),
 			    nr_pages);
-
-	/*
-	 * Isolating the folio has taken another reference, so the
-	 * caller's reference can be safely dropped without the folio
-	 * disappearing underneath us during migration.
-	 */
-	folio_put(folio);
-	return 1;
+	return 0;
 }
 
 /*
  * Attempt to migrate a misplaced folio to the specified destination
- * node. Caller is expected to have an elevated reference count on
- * the folio that will be dropped by this function before returning.
+ * node. Caller is expected to have isolated the folio by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folio. This function will un-isolate the
+ * folio, dereferencing the folio before returning.
  */
 int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 			    int node)
 {
 	pg_data_t *pgdat = NODE_DATA(node);
-	int isolated;
 	int nr_remaining;
 	unsigned int nr_succeeded;
 	LIST_HEAD(migratepages);
 	int nr_pages = folio_nr_pages(folio);
 
-	/*
-	 * Don't migrate file folios that are mapped in multiple processes
-	 * with execute permissions as they are probably shared libraries.
-	 *
-	 * See folio_likely_mapped_shared() on possible imprecision when we
-	 * cannot easily detect if a folio is shared.
-	 */
-	if (folio_likely_mapped_shared(folio) && folio_is_file_lru(folio) &&
-	    (vma->vm_flags & VM_EXEC))
-		goto out;
-
-	/*
-	 * Also do not migrate dirty folios as not all filesystems can move
-	 * dirty folios in MIGRATE_ASYNC mode which is a waste of cycles.
-	 */
-	if (folio_is_file_lru(folio) && folio_test_dirty(folio))
-		goto out;
-
-	isolated = numamigrate_isolate_folio(pgdat, folio);
-	if (!isolated)
-		goto out;
-
 	list_add(&folio->lru, &migratepages);
 	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_folio,
 				     NULL, node, MIGRATE_ASYNC,
@@ -2620,7 +2620,6 @@  int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 					folio_is_file_lru(folio), -nr_pages);
 			folio_putback_lru(folio);
 		}
-		isolated = 0;
 	}
 	if (nr_succeeded) {
 		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
@@ -2629,11 +2628,7 @@  int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
 					    nr_succeeded);
 	}
 	BUG_ON(!list_empty(&migratepages));
-	return isolated ? 0 : -EAGAIN;
-
-out:
-	folio_put(folio);
-	return -EAGAIN;
+	return nr_remaining ? -EAGAIN : 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */