[v4,4/6] mm: swap: Allow storage of all mTHP orders

Message ID	20240311150058.1122862-5-ryan.roberts@arm.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Barry Song <21cnbao@gmail.com>, Chris Li <chrisl@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders Date: Mon, 11 Mar 2024 15:00:56 +0000 Message-Id: <20240311150058.1122862-5-ryan.roberts@arm.com> In-Reply-To: <20240311150058.1122862-1-ryan.roberts@arm.com> References: <20240311150058.1122862-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Swap-out mTHP without splitting \| expand [v4,0/6] Swap-out mTHP without splitting [v4,1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags [v4,2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() [v4,3/6] mm: swap: Simplify struct percpu_cluster [v4,4/6] mm: swap: Allow storage of all mTHP orders [v4,5/6] mm: vmscan: Avoid split during shrink_folio_list() [v4,6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

Ryan Roberts March 11, 2024, 3 p.m. UTC

Multi-size THP enables performance improvements by allocating large,
pte-mapped folios for anonymous memory. However I've observed that on an
arm64 system running a parallel workload (e.g. kernel compilation)
across many cores, under high memory pressure, the speed regresses. This
is due to bottlenecking on the increased number of TLBIs added due to
all the extra folio splitting when the large folios are swapped out.

Therefore, solve this regression by adding support for swapping out mTHP
without needing to split the folio, just like is already done for
PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
and when the swap backing store is a non-rotating block device. These
are the same constraints as for the existing PMD-sized THP swap-out
support.

Note that no attempt is made to swap-in (m)THP here - this is still done
page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
prerequisite for swapping-in mTHP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [1, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller can fall back to splitting the folio
and allocates individual entries (as per existing PMD-sized THP
fallback).

The per-order current clusters are maintained per-cpu using the existing
infrastructure. This is done to avoid interleving pages from different
tasks, which would prevent IO being batched. This is already done for
the order-0 allocations so we follow the same pattern.

As is done for order-0 per-cpu clusters, the scanner now can steal
order-0 entries from any per-cpu-per-order reserved cluster. This
ensures that when the swap file is getting full, space doesn't get tied
up in the per-cpu reserves.

This change only modifies swap to be able to accept any order mTHP. It
doesn't change the callers to elide doing the actual split. That will be
done in separate changes.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h |   8 ++-
 mm/swapfile.c        | 167 +++++++++++++++++++++++++------------------
 2 files changed, 103 insertions(+), 72 deletions(-)

Huang, Ying March 12, 2024, 7:51 a.m. UTC | #1

Ryan Roberts <ryan.roberts@arm.com> writes:

> Multi-size THP enables performance improvements by allocating large,
> pte-mapped folios for anonymous memory. However I've observed that on an
> arm64 system running a parallel workload (e.g. kernel compilation)
> across many cores, under high memory pressure, the speed regresses. This
> is due to bottlenecking on the increased number of TLBIs added due to
> all the extra folio splitting when the large folios are swapped out.
>
> Therefore, solve this regression by adding support for swapping out mTHP
> without needing to split the folio, just like is already done for
> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
> and when the swap backing store is a non-rotating block device. These
> are the same constraints as for the existing PMD-sized THP swap-out
> support.
>
> Note that no attempt is made to swap-in (m)THP here - this is still done
> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
> prerequisite for swapping-in mTHP.
>
> The main change here is to improve the swap entry allocator so that it
> can allocate any power-of-2 number of contiguous entries between [1, (1
> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> order and allocating sequentially from it until the cluster is full.
> This ensures that we don't need to search the map and we get no
> fragmentation due to alignment padding for different orders in the
> cluster. If there is no current cluster for a given order, we attempt to
> allocate a free cluster from the list. If there are no free clusters, we
> fail the allocation and the caller can fall back to splitting the folio
> and allocates individual entries (as per existing PMD-sized THP
> fallback).
>
> The per-order current clusters are maintained per-cpu using the existing
> infrastructure. This is done to avoid interleving pages from different
> tasks, which would prevent IO being batched. This is already done for
> the order-0 allocations so we follow the same pattern.
>
> As is done for order-0 per-cpu clusters, the scanner now can steal
> order-0 entries from any per-cpu-per-order reserved cluster. This
> ensures that when the swap file is getting full, space doesn't get tied
> up in the per-cpu reserves.
>
> This change only modifies swap to be able to accept any order mTHP. It
> doesn't change the callers to elide doing the actual split. That will be
> done in separate changes.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/swap.h |   8 ++-
>  mm/swapfile.c        | 167 +++++++++++++++++++++++++------------------
>  2 files changed, 103 insertions(+), 72 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0cb082bee717..39b5c18ccc6a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -268,13 +268,19 @@ struct swap_cluster_info {
>   */
>  #define SWAP_NEXT_INVALID	0
>  
> +#ifdef CONFIG_THP_SWAP
> +#define SWAP_NR_ORDERS		(PMD_ORDER + 1)
> +#else
> +#define SWAP_NR_ORDERS		1
> +#endif
> +
>  /*
>   * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>   * its own cluster and swapout sequentially. The purpose is to optimize swapout
>   * throughput.
>   */
>  struct percpu_cluster {
> -	unsigned int next; /* Likely next allocation offset */
> +	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>  };
>  
>  struct swap_cluster_list {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 3828d81aa6b8..61118a090796 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>  
>  /*
>   * The cluster corresponding to page_nr will be used. The cluster will be
> - * removed from free cluster list and its usage counter will be increased.
> + * removed from free cluster list and its usage counter will be increased by
> + * count.
>   */
> -static void inc_cluster_info_page(struct swap_info_struct *p,
> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void add_cluster_info_page(struct swap_info_struct *p,
> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
> +	unsigned long count)
>  {
>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>  
> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>  	if (cluster_is_free(&cluster_info[idx]))
>  		alloc_cluster(p, idx);
>  
> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>  	cluster_set_count(&cluster_info[idx],
> -		cluster_count(&cluster_info[idx]) + 1);
> +		cluster_count(&cluster_info[idx]) + count);
> +}
> +
> +/*
> + * The cluster corresponding to page_nr will be used. The cluster will be
> + * removed from free cluster list and its usage counter will be increased by 1.
> + */
> +static void inc_cluster_info_page(struct swap_info_struct *p,
> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +{
> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>  }
>  
>  /*
> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>   */
>  static bool
>  scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> -	unsigned long offset)
> +	unsigned long offset, int order)
>  {
>  	struct percpu_cluster *percpu_cluster;
>  	bool conflict;
> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>  		return false;
>  
>  	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
> -	percpu_cluster->next = SWAP_NEXT_INVALID;
> +	percpu_cluster->next[order] = SWAP_NEXT_INVALID;
> +	return true;
> +}
> +
> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
> +				    unsigned int nr_pages)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		if (swap_map[start + i])
> +			return false;
> +	}
> +
>  	return true;
>  }
>  
>  /*
> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
> - * might involve allocating a new cluster for current CPU too.
> + * Try to get a swap entry (or size indicated by order) from current cpu's swap

IMO, it's not necessary to make mTHP a special case other than base
page.  So, this can be changed to

 * Try to get swap entries with specified order from current cpu's swap

> + * entry pool (a cluster). This might involve allocating a new cluster for
> + * current CPU too.
>   */
>  static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> -	unsigned long *offset, unsigned long *scan_base)
> +	unsigned long *offset, unsigned long *scan_base, int order)
>  {
> +	unsigned int nr_pages = 1 << order;
>  	struct percpu_cluster *cluster;
>  	struct swap_cluster_info *ci;
>  	unsigned int tmp, max;
>  
>  new_cluster:
>  	cluster = this_cpu_ptr(si->percpu_cluster);
> -	tmp = cluster->next;
> +	tmp = cluster->next[order];
>  	if (tmp == SWAP_NEXT_INVALID) {
>  		if (!cluster_list_empty(&si->free_clusters)) {
>  			tmp = cluster_next(&si->free_clusters.head) *
> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>  
>  	/*
>  	 * Other CPUs can use our cluster if they can't find a free cluster,
> -	 * check if there is still free entry in the cluster
> +	 * check if there is still free entry in the cluster, maintaining
> +	 * natural alignment.
>  	 */
>  	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
>  	if (tmp < max) {
>  		ci = lock_cluster(si, tmp);
>  		while (tmp < max) {
> -			if (!si->swap_map[tmp])
> +			if (swap_range_empty(si->swap_map, tmp, nr_pages))
>  				break;
> -			tmp++;
> +			tmp += nr_pages;
>  		}
>  		unlock_cluster(ci);
>  	}
>  	if (tmp >= max) {
> -		cluster->next = SWAP_NEXT_INVALID;
> +		cluster->next[order] = SWAP_NEXT_INVALID;
>  		goto new_cluster;
>  	}
>  	*offset = tmp;
>  	*scan_base = tmp;
> -	tmp += 1;
> -	cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
> +	tmp += nr_pages;
> +	cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
>  	return true;
>  }
>  
> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
>  
>  static int scan_swap_map_slots(struct swap_info_struct *si,
>  			       unsigned char usage, int nr,
> -			       swp_entry_t slots[])
> +			       swp_entry_t slots[], unsigned int nr_pages)

IMHO, it's better to use order as parameter directly.  We can change the
parameter of get_swap_pages() too.

>  {
>  	struct swap_cluster_info *ci;
>  	unsigned long offset;
>  	unsigned long scan_base;
>  	unsigned long last_in_cluster = 0;
>  	int latency_ration = LATENCY_LIMIT;
> +	int order = ilog2(nr_pages);
>  	int n_ret = 0;
>  	bool scanned_many = false;
>  
> @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	 * And we let swap pages go all over an SSD partition.  Hugh
>  	 */
>  
> +	if (nr_pages > 1) {
> +		/*
> +		 * Should not even be attempting large allocations when huge
> +		 * page swap is disabled.  Warn and fail the allocation.
> +		 */
> +		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
> +		    nr_pages > SWAPFILE_CLUSTER ||
> +		    !is_power_of_2(nr_pages)) {
> +			VM_WARN_ON_ONCE(1);
> +			return 0;
> +		}
> +
> +		/*
> +		 * Swapfile is not block device or not using clusters so unable
> +		 * to allocate large entries.
> +		 */
> +		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
> +			return 0;
> +	}
> +
>  	si->flags += SWP_SCANNING;
>  	/*
>  	 * Use percpu scan base for SSD to reduce lock contention on
> @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  
>  	/* SSD algorithm */
>  	if (si->cluster_info) {
> -		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
> +		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
> +			if (order > 0)
> +				goto no_page;
>  			goto scan;
> +		}
>  	} else if (unlikely(!si->cluster_nr--)) {
>  		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
>  			si->cluster_nr = SWAPFILE_CLUSTER - 1;
> @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  
>  checks:
>  	if (si->cluster_info) {
> -		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
> +		while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
>  		/* take a break if we already got some slots */
>  			if (n_ret)
>  				goto done;
>  			if (!scan_swap_map_try_ssd_cluster(si, &offset,
> -							&scan_base))
> +							&scan_base, order)) {
> +				if (order > 0)
> +					goto no_page;
>  				goto scan;
> +			}
>  		}
>  	}
>  	if (!(si->flags & SWP_WRITEOK))
>  		goto no_page;
>  	if (!si->highest_bit)
>  		goto no_page;
> -	if (offset > si->highest_bit)
> +	if (order == 0 && offset > si->highest_bit)

I don't think that we need to check "order == 0" here.  The original
condition will always be false for "order != 0".

>  		scan_base = offset = si->lowest_bit;
>  
>  	ci = lock_cluster(si, offset);
>  	/* reuse swap entry of cache-only swap if not busy. */
>  	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>  		int swap_was_freed;
> +		VM_WARN_ON(order > 0);

Instead of add WARN here, I think that it's better to add WARN at the
beginning of "scan" label.  We should never scan if "order > 0", it can
capture even more abnormal status.

>  		unlock_cluster(ci);
>  		spin_unlock(&si->lock);
>  		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	}
>  
>  	if (si->swap_map[offset]) {
> +		VM_WARN_ON(order > 0);
>  		unlock_cluster(ci);
>  		if (!n_ret)
>  			goto scan;
>  		else
>  			goto done;
>  	}
> -	WRITE_ONCE(si->swap_map[offset], usage);
> -	inc_cluster_info_page(si, si->cluster_info, offset);
> +	memset(si->swap_map + offset, usage, nr_pages);

Add barrier() here corresponds to original WRITE_ONCE()?
unlock_cluster(ci) may be NOP for some swap devices.

> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>  	unlock_cluster(ci);
>  
> -	swap_range_alloc(si, offset, 1);
> +	swap_range_alloc(si, offset, nr_pages);
>  	slots[n_ret++] = swp_entry(si->type, offset);
>  
>  	/* got enough slots or reach max slots? */

If "order > 0", "nr" must be 1.  So, we will "goto done" in the
following code.

        /* got enough slots or reach max slots? */
        if ((n_ret == nr) || (offset >= si->highest_bit))
           goto done;

We can add VM_WARN_ON() here to capture some abnormal status.

> @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  
>  	/* try to get more slots in cluster */
>  	if (si->cluster_info) {
> -		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
> +		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
>  			goto checks;
> +		if (order > 0)
> +			goto done;

Don't need to add this, if "order > 0", we will never go here.

>  	} else if (si->cluster_nr && !si->swap_map[++offset]) {
>  		/* non-ssd case, still more slots in cluster? */
>  		--si->cluster_nr;
> @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	}
>  
>  done:
> -	set_cluster_next(si, offset + 1);
> +	if (order == 0)
> +		set_cluster_next(si, offset + 1);
>  	si->flags -= SWP_SCANNING;
>  	return n_ret;
>  
> @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	return n_ret;
>  }
>  
> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
> -{
> -	unsigned long idx;
> -	struct swap_cluster_info *ci;
> -	unsigned long offset;
> -
> -	/*
> -	 * Should not even be attempting cluster allocations when huge
> -	 * page swap is disabled.  Warn and fail the allocation.
> -	 */
> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
> -		VM_WARN_ON_ONCE(1);
> -		return 0;
> -	}
> -
> -	if (cluster_list_empty(&si->free_clusters))
> -		return 0;
> -
> -	idx = cluster_list_first(&si->free_clusters);
> -	offset = idx * SWAPFILE_CLUSTER;
> -	ci = lock_cluster(si, offset);
> -	alloc_cluster(si, idx);
> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
> -
> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
> -	unlock_cluster(ci);
> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
> -	*slot = swp_entry(si->type, offset);
> -
> -	return 1;
> -}
> -
>  static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>  {
>  	unsigned long offset = idx * SWAPFILE_CLUSTER;
> @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  	int n_ret = 0;
>  	int node;
>  
> -	/* Only single cluster request supported */
> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
> +	/* Only single THP request supported */
> +	WARN_ON_ONCE(n_goal > 1 && size > 1);
>  
>  	spin_lock(&swap_avail_lock);
>  
> @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>  			spin_unlock(&si->lock);
>  			goto nextsi;
>  		}
> -		if (size == SWAPFILE_CLUSTER) {
> -			if (si->flags & SWP_BLKDEV)
> -				n_ret = swap_alloc_cluster(si, swp_entries);
> -		} else
> -			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> -						    n_goal, swp_entries);
> +		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> +					    n_goal, swp_entries, size);
>  		spin_unlock(&si->lock);
> -		if (n_ret || size == SWAPFILE_CLUSTER)
> +		if (n_ret || size > 1)
>  			goto check_out;
>  		cond_resched();
>  
> @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type)
>  
>  	/* This is called for allocating swap entry, not cache */
>  	spin_lock(&si->lock);
> -	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
> +	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1))
>  		atomic_long_dec(&nr_swap_pages);
>  	spin_unlock(&si->lock);
>  fail:
> @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  		p->flags |= SWP_SYNCHRONOUS_IO;
>  
>  	if (p->bdev && bdev_nonrot(p->bdev)) {
> -		int cpu;
> +		int cpu, i;
>  		unsigned long ci, nr_cluster;
>  
>  		p->flags |= SWP_SOLIDSTATE;
> @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  			struct percpu_cluster *cluster;
>  
>  			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
> -			cluster->next = SWAP_NEXT_INVALID;
> +			for (i = 0; i < SWAP_NR_ORDERS; i++)
> +				cluster->next[i] = SWAP_NEXT_INVALID;
>  		}
>  	} else {
>  		atomic_inc(&nr_rotate_swap);

You also need to check whether we should add swap_entry_size() for some
functions to optimize for small system.  We may need to add swap_order()
too.

--
Best Regards,
Huang, Ying

Ryan Roberts March 12, 2024, 9:40 a.m. UTC | #2

On 12/03/2024 07:51, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Multi-size THP enables performance improvements by allocating large,
>> pte-mapped folios for anonymous memory. However I've observed that on an
>> arm64 system running a parallel workload (e.g. kernel compilation)
>> across many cores, under high memory pressure, the speed regresses. This
>> is due to bottlenecking on the increased number of TLBIs added due to
>> all the extra folio splitting when the large folios are swapped out.
>>
>> Therefore, solve this regression by adding support for swapping out mTHP
>> without needing to split the folio, just like is already done for
>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>> and when the swap backing store is a non-rotating block device. These
>> are the same constraints as for the existing PMD-sized THP swap-out
>> support.
>>
>> Note that no attempt is made to swap-in (m)THP here - this is still done
>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>> prerequisite for swapping-in mTHP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller can fall back to splitting the folio
>> and allocates individual entries (as per existing PMD-sized THP
>> fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> This change only modifies swap to be able to accept any order mTHP. It
>> doesn't change the callers to elide doing the actual split. That will be
>> done in separate changes.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/swap.h |   8 ++-
>>  mm/swapfile.c        | 167 +++++++++++++++++++++++++------------------
>>  2 files changed, 103 insertions(+), 72 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 0cb082bee717..39b5c18ccc6a 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -268,13 +268,19 @@ struct swap_cluster_info {
>>   */
>>  #define SWAP_NEXT_INVALID	0
>>  
>> +#ifdef CONFIG_THP_SWAP
>> +#define SWAP_NR_ORDERS		(PMD_ORDER + 1)
>> +#else
>> +#define SWAP_NR_ORDERS		1
>> +#endif
>> +
>>  /*
>>   * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>>   * its own cluster and swapout sequentially. The purpose is to optimize swapout
>>   * throughput.
>>   */
>>  struct percpu_cluster {
>> -	unsigned int next; /* Likely next allocation offset */
>> +	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>>  };
>>  
>>  struct swap_cluster_list {
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 3828d81aa6b8..61118a090796 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>  
>>  /*
>>   * The cluster corresponding to page_nr will be used. The cluster will be
>> - * removed from free cluster list and its usage counter will be increased.
>> + * removed from free cluster list and its usage counter will be increased by
>> + * count.
>>   */
>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +static void add_cluster_info_page(struct swap_info_struct *p,
>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>> +	unsigned long count)
>>  {
>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>  
>> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>  	if (cluster_is_free(&cluster_info[idx]))
>>  		alloc_cluster(p, idx);
>>  
>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>  	cluster_set_count(&cluster_info[idx],
>> -		cluster_count(&cluster_info[idx]) + 1);
>> +		cluster_count(&cluster_info[idx]) + count);
>> +}
>> +
>> +/*
>> + * The cluster corresponding to page_nr will be used. The cluster will be
>> + * removed from free cluster list and its usage counter will be increased by 1.
>> + */
>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>> +{
>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>  }
>>  
>>  /*
>> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>   */
>>  static bool
>>  scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>> -	unsigned long offset)
>> +	unsigned long offset, int order)
>>  {
>>  	struct percpu_cluster *percpu_cluster;
>>  	bool conflict;
>> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>  		return false;
>>  
>>  	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
>> -	percpu_cluster->next = SWAP_NEXT_INVALID;
>> +	percpu_cluster->next[order] = SWAP_NEXT_INVALID;
>> +	return true;
>> +}
>> +
>> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
>> +				    unsigned int nr_pages)
>> +{
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < nr_pages; i++) {
>> +		if (swap_map[start + i])
>> +			return false;
>> +	}
>> +
>>  	return true;
>>  }
>>  
>>  /*
>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>> - * might involve allocating a new cluster for current CPU too.
>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
> 
> IMO, it's not necessary to make mTHP a special case other than base
> page.  So, this can be changed to
> 
>  * Try to get swap entries with specified order from current cpu's swap

Sure, will fix in next version.

> 
>> + * entry pool (a cluster). This might involve allocating a new cluster for
>> + * current CPU too.
>>   */
>>  static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> -	unsigned long *offset, unsigned long *scan_base)
>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>  {
>> +	unsigned int nr_pages = 1 << order;
>>  	struct percpu_cluster *cluster;
>>  	struct swap_cluster_info *ci;
>>  	unsigned int tmp, max;
>>  
>>  new_cluster:
>>  	cluster = this_cpu_ptr(si->percpu_cluster);
>> -	tmp = cluster->next;
>> +	tmp = cluster->next[order];
>>  	if (tmp == SWAP_NEXT_INVALID) {
>>  		if (!cluster_list_empty(&si->free_clusters)) {
>>  			tmp = cluster_next(&si->free_clusters.head) *
>> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>  
>>  	/*
>>  	 * Other CPUs can use our cluster if they can't find a free cluster,
>> -	 * check if there is still free entry in the cluster
>> +	 * check if there is still free entry in the cluster, maintaining
>> +	 * natural alignment.
>>  	 */
>>  	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
>>  	if (tmp < max) {
>>  		ci = lock_cluster(si, tmp);
>>  		while (tmp < max) {
>> -			if (!si->swap_map[tmp])
>> +			if (swap_range_empty(si->swap_map, tmp, nr_pages))
>>  				break;
>> -			tmp++;
>> +			tmp += nr_pages;
>>  		}
>>  		unlock_cluster(ci);
>>  	}
>>  	if (tmp >= max) {
>> -		cluster->next = SWAP_NEXT_INVALID;
>> +		cluster->next[order] = SWAP_NEXT_INVALID;
>>  		goto new_cluster;
>>  	}
>>  	*offset = tmp;
>>  	*scan_base = tmp;
>> -	tmp += 1;
>> -	cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
>> +	tmp += nr_pages;
>> +	cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
>>  	return true;
>>  }
>>  
>> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
>>  
>>  static int scan_swap_map_slots(struct swap_info_struct *si,
>>  			       unsigned char usage, int nr,
>> -			       swp_entry_t slots[])
>> +			       swp_entry_t slots[], unsigned int nr_pages)
> 
> IMHO, it's better to use order as parameter directly.  We can change the
> parameter of get_swap_pages() too.

I agree that this will make the interface clearer/self documenting. I'll do it
in the next version.

> 
>>  {
>>  	struct swap_cluster_info *ci;
>>  	unsigned long offset;
>>  	unsigned long scan_base;
>>  	unsigned long last_in_cluster = 0;
>>  	int latency_ration = LATENCY_LIMIT;
>> +	int order = ilog2(nr_pages);
>>  	int n_ret = 0;
>>  	bool scanned_many = false;
>>  
>> @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	 * And we let swap pages go all over an SSD partition.  Hugh
>>  	 */
>>  
>> +	if (nr_pages > 1) {
>> +		/*
>> +		 * Should not even be attempting large allocations when huge
>> +		 * page swap is disabled.  Warn and fail the allocation.
>> +		 */
>> +		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>> +		    nr_pages > SWAPFILE_CLUSTER ||
>> +		    !is_power_of_2(nr_pages)) {
>> +			VM_WARN_ON_ONCE(1);
>> +			return 0;
>> +		}
>> +
>> +		/*
>> +		 * Swapfile is not block device or not using clusters so unable
>> +		 * to allocate large entries.
>> +		 */
>> +		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>> +			return 0;
>> +	}
>> +
>>  	si->flags += SWP_SCANNING;
>>  	/*
>>  	 * Use percpu scan base for SSD to reduce lock contention on
>> @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  
>>  	/* SSD algorithm */
>>  	if (si->cluster_info) {
>> -		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>> +		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
>> +			if (order > 0)
>> +				goto no_page;
>>  			goto scan;
>> +		}
>>  	} else if (unlikely(!si->cluster_nr--)) {
>>  		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
>>  			si->cluster_nr = SWAPFILE_CLUSTER - 1;
>> @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  
>>  checks:
>>  	if (si->cluster_info) {
>> -		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
>> +		while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
>>  		/* take a break if we already got some slots */
>>  			if (n_ret)
>>  				goto done;
>>  			if (!scan_swap_map_try_ssd_cluster(si, &offset,
>> -							&scan_base))
>> +							&scan_base, order)) {
>> +				if (order > 0)
>> +					goto no_page;
>>  				goto scan;
>> +			}
>>  		}
>>  	}
>>  	if (!(si->flags & SWP_WRITEOK))
>>  		goto no_page;
>>  	if (!si->highest_bit)
>>  		goto no_page;
>> -	if (offset > si->highest_bit)
>> +	if (order == 0 && offset > si->highest_bit)
> 
> I don't think that we need to check "order == 0" here.  The original
> condition will always be false for "order != 0".

I spent ages looking at this and couldn't quite convince myself that this is
definitely safe. Certainly it would be catastrophic if we modified the returned
offset for a non-order-0 case (the code below assumes order-0 when checking). So
I decided in the end to be safe and add this condition. Looking again, I agree
with you. Will fix in next version.

> 
>>  		scan_base = offset = si->lowest_bit;
>>  
>>  	ci = lock_cluster(si, offset);
>>  	/* reuse swap entry of cache-only swap if not busy. */
>>  	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>>  		int swap_was_freed;
>> +		VM_WARN_ON(order > 0);
> 
> Instead of add WARN here, I think that it's better to add WARN at the
> beginning of "scan" label.  We should never scan if "order > 0", it can
> capture even more abnormal status.

OK, will do.

> 
>>  		unlock_cluster(ci);
>>  		spin_unlock(&si->lock);
>>  		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	}
>>  
>>  	if (si->swap_map[offset]) {
>> +		VM_WARN_ON(order > 0);

And remove this one too? (relying on the one in scan instead)

>>  		unlock_cluster(ci);
>>  		if (!n_ret)
>>  			goto scan;
>>  		else
>>  			goto done;
>>  	}
>> -	WRITE_ONCE(si->swap_map[offset], usage);
>> -	inc_cluster_info_page(si, si->cluster_info, offset);
>> +	memset(si->swap_map + offset, usage, nr_pages);
> 
> Add barrier() here corresponds to original WRITE_ONCE()?
> unlock_cluster(ci) may be NOP for some swap devices.

Yep, good spot!

> 
>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>  	unlock_cluster(ci);
>>  
>> -	swap_range_alloc(si, offset, 1);
>> +	swap_range_alloc(si, offset, nr_pages);
>>  	slots[n_ret++] = swp_entry(si->type, offset);
>>  
>>  	/* got enough slots or reach max slots? */
> 
> If "order > 0", "nr" must be 1.  So, we will "goto done" in the
> following code.

I've deliberately implemented scan_swap_map_slots() so that it allows nr > 1 for
order > 0. And leave it to the higher layers to decide on policy.

> 
>         /* got enough slots or reach max slots? */
>         if ((n_ret == nr) || (offset >= si->highest_bit))
>            goto done;
> 
> We can add VM_WARN_ON() here to capture some abnormal status.

That was actually how I implemented initially. But decided that it doesn't cost
anything to allow nr > 1 for order > 0, and IMHO makes the function easier to
understand because we remove this uneccessary constraint.

> 
>> @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  
>>  	/* try to get more slots in cluster */
>>  	if (si->cluster_info) {
>> -		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>> +		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
>>  			goto checks;
>> +		if (order > 0)
>> +			goto done;
> 
> Don't need to add this, if "order > 0", we will never go here.

As per above.

> 
>>  	} else if (si->cluster_nr && !si->swap_map[++offset]) {
>>  		/* non-ssd case, still more slots in cluster? */
>>  		--si->cluster_nr;
>> @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	}
>>  
>>  done:
>> -	set_cluster_next(si, offset + 1);
>> +	if (order == 0)
>> +		set_cluster_next(si, offset + 1);
>>  	si->flags -= SWP_SCANNING;
>>  	return n_ret;
>>  
>> @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	return n_ret;
>>  }
>>  
>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>> -{
>> -	unsigned long idx;
>> -	struct swap_cluster_info *ci;
>> -	unsigned long offset;
>> -
>> -	/*
>> -	 * Should not even be attempting cluster allocations when huge
>> -	 * page swap is disabled.  Warn and fail the allocation.
>> -	 */
>> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>> -		VM_WARN_ON_ONCE(1);
>> -		return 0;
>> -	}
>> -
>> -	if (cluster_list_empty(&si->free_clusters))
>> -		return 0;
>> -
>> -	idx = cluster_list_first(&si->free_clusters);
>> -	offset = idx * SWAPFILE_CLUSTER;
>> -	ci = lock_cluster(si, offset);
>> -	alloc_cluster(si, idx);
>> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
>> -
>> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>> -	unlock_cluster(ci);
>> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>> -	*slot = swp_entry(si->type, offset);
>> -
>> -	return 1;
>> -}
>> -
>>  static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>  {
>>  	unsigned long offset = idx * SWAPFILE_CLUSTER;
>> @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>  	int n_ret = 0;
>>  	int node;
>>  
>> -	/* Only single cluster request supported */
>> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>> +	/* Only single THP request supported */
>> +	WARN_ON_ONCE(n_goal > 1 && size > 1);
>>  
>>  	spin_lock(&swap_avail_lock);
>>  
>> @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>  			spin_unlock(&si->lock);
>>  			goto nextsi;
>>  		}
>> -		if (size == SWAPFILE_CLUSTER) {
>> -			if (si->flags & SWP_BLKDEV)
>> -				n_ret = swap_alloc_cluster(si, swp_entries);
>> -		} else
>> -			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>> -						    n_goal, swp_entries);
>> +		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>> +					    n_goal, swp_entries, size);
>>  		spin_unlock(&si->lock);
>> -		if (n_ret || size == SWAPFILE_CLUSTER)
>> +		if (n_ret || size > 1)
>>  			goto check_out;
>>  		cond_resched();
>>  
>> @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type)
>>  
>>  	/* This is called for allocating swap entry, not cache */
>>  	spin_lock(&si->lock);
>> -	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
>> +	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1))
>>  		atomic_long_dec(&nr_swap_pages);
>>  	spin_unlock(&si->lock);
>>  fail:
>> @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		p->flags |= SWP_SYNCHRONOUS_IO;
>>  
>>  	if (p->bdev && bdev_nonrot(p->bdev)) {
>> -		int cpu;
>> +		int cpu, i;
>>  		unsigned long ci, nr_cluster;
>>  
>>  		p->flags |= SWP_SOLIDSTATE;
>> @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  			struct percpu_cluster *cluster;
>>  
>>  			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
>> -			cluster->next = SWAP_NEXT_INVALID;
>> +			for (i = 0; i < SWAP_NR_ORDERS; i++)
>> +				cluster->next[i] = SWAP_NEXT_INVALID;
>>  		}
>>  	} else {
>>  		atomic_inc(&nr_rotate_swap);
> 
> You also need to check whether we should add swap_entry_size() for some
> functions to optimize for small system.  We may need to add swap_order()
> too.

I was planning to convert swap_entry_size() to swap_entry_order() as part of
switching to pass order instead of nr_pages. There is one other site that uses
swap_entry_size() and needs a size, so was going to just change it to 1 <<
swap_entry_order(). Does that work for you?

I'll do an audit for places to use swap_entry_order() but quick scan just now
suggests that the constant should propagate to all the static functions from
get_swap_pages().

Thanks,
Ryan

> 
> --
> Best Regards,
> Huang, Ying

Huang, Ying March 13, 2024, 1:33 a.m. UTC | #3

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 12/03/2024 07:51, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Multi-size THP enables performance improvements by allocating large,
>>> pte-mapped folios for anonymous memory. However I've observed that on an
>>> arm64 system running a parallel workload (e.g. kernel compilation)
>>> across many cores, under high memory pressure, the speed regresses. This
>>> is due to bottlenecking on the increased number of TLBIs added due to
>>> all the extra folio splitting when the large folios are swapped out.
>>>
>>> Therefore, solve this regression by adding support for swapping out mTHP
>>> without needing to split the folio, just like is already done for
>>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>>> and when the swap backing store is a non-rotating block device. These
>>> are the same constraints as for the existing PMD-sized THP swap-out
>>> support.
>>>
>>> Note that no attempt is made to swap-in (m)THP here - this is still done
>>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>>> prerequisite for swapping-in mTHP.
>>>
>>> The main change here is to improve the swap entry allocator so that it
>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>> order and allocating sequentially from it until the cluster is full.
>>> This ensures that we don't need to search the map and we get no
>>> fragmentation due to alignment padding for different orders in the
>>> cluster. If there is no current cluster for a given order, we attempt to
>>> allocate a free cluster from the list. If there are no free clusters, we
>>> fail the allocation and the caller can fall back to splitting the folio
>>> and allocates individual entries (as per existing PMD-sized THP
>>> fallback).
>>>
>>> The per-order current clusters are maintained per-cpu using the existing
>>> infrastructure. This is done to avoid interleving pages from different
>>> tasks, which would prevent IO being batched. This is already done for
>>> the order-0 allocations so we follow the same pattern.
>>>
>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>> ensures that when the swap file is getting full, space doesn't get tied
>>> up in the per-cpu reserves.
>>>
>>> This change only modifies swap to be able to accept any order mTHP. It
>>> doesn't change the callers to elide doing the actual split. That will be
>>> done in separate changes.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/swap.h |   8 ++-
>>>  mm/swapfile.c        | 167 +++++++++++++++++++++++++------------------
>>>  2 files changed, 103 insertions(+), 72 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 0cb082bee717..39b5c18ccc6a 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -268,13 +268,19 @@ struct swap_cluster_info {
>>>   */
>>>  #define SWAP_NEXT_INVALID	0
>>>  
>>> +#ifdef CONFIG_THP_SWAP
>>> +#define SWAP_NR_ORDERS		(PMD_ORDER + 1)
>>> +#else
>>> +#define SWAP_NR_ORDERS		1
>>> +#endif
>>> +
>>>  /*
>>>   * We assign a cluster to each CPU, so each CPU can allocate swap entry from
>>>   * its own cluster and swapout sequentially. The purpose is to optimize swapout
>>>   * throughput.
>>>   */
>>>  struct percpu_cluster {
>>> -	unsigned int next; /* Likely next allocation offset */
>>> +	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>>>  };
>>>  
>>>  struct swap_cluster_list {
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 3828d81aa6b8..61118a090796 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>  
>>>  /*
>>>   * The cluster corresponding to page_nr will be used. The cluster will be
>>> - * removed from free cluster list and its usage counter will be increased.
>>> + * removed from free cluster list and its usage counter will be increased by
>>> + * count.
>>>   */
>>> -static void inc_cluster_info_page(struct swap_info_struct *p,
>>> -	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +static void add_cluster_info_page(struct swap_info_struct *p,
>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr,
>>> +	unsigned long count)
>>>  {
>>>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>>>  
>>> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>>>  	if (cluster_is_free(&cluster_info[idx]))
>>>  		alloc_cluster(p, idx);
>>>  
>>> -	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
>>> +	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
>>>  	cluster_set_count(&cluster_info[idx],
>>> -		cluster_count(&cluster_info[idx]) + 1);
>>> +		cluster_count(&cluster_info[idx]) + count);
>>> +}
>>> +
>>> +/*
>>> + * The cluster corresponding to page_nr will be used. The cluster will be
>>> + * removed from free cluster list and its usage counter will be increased by 1.
>>> + */
>>> +static void inc_cluster_info_page(struct swap_info_struct *p,
>>> +	struct swap_cluster_info *cluster_info, unsigned long page_nr)
>>> +{
>>> +	add_cluster_info_page(p, cluster_info, page_nr, 1);
>>>  }
>>>  
>>>  /*
>>> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
>>>   */
>>>  static bool
>>>  scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>> -	unsigned long offset)
>>> +	unsigned long offset, int order)
>>>  {
>>>  	struct percpu_cluster *percpu_cluster;
>>>  	bool conflict;
>>> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
>>>  		return false;
>>>  
>>>  	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
>>> -	percpu_cluster->next = SWAP_NEXT_INVALID;
>>> +	percpu_cluster->next[order] = SWAP_NEXT_INVALID;
>>> +	return true;
>>> +}
>>> +
>>> +static inline bool swap_range_empty(char *swap_map, unsigned int start,
>>> +				    unsigned int nr_pages)
>>> +{
>>> +	unsigned int i;
>>> +
>>> +	for (i = 0; i < nr_pages; i++) {
>>> +		if (swap_map[start + i])
>>> +			return false;
>>> +	}
>>> +
>>>  	return true;
>>>  }
>>>  
>>>  /*
>>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
>>> - * might involve allocating a new cluster for current CPU too.
>>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap
>> 
>> IMO, it's not necessary to make mTHP a special case other than base
>> page.  So, this can be changed to
>> 
>>  * Try to get swap entries with specified order from current cpu's swap
>
> Sure, will fix in next version.
>
>> 
>>> + * entry pool (a cluster). This might involve allocating a new cluster for
>>> + * current CPU too.
>>>   */
>>>  static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> -	unsigned long *offset, unsigned long *scan_base)
>>> +	unsigned long *offset, unsigned long *scan_base, int order)
>>>  {
>>> +	unsigned int nr_pages = 1 << order;
>>>  	struct percpu_cluster *cluster;
>>>  	struct swap_cluster_info *ci;
>>>  	unsigned int tmp, max;
>>>  
>>>  new_cluster:
>>>  	cluster = this_cpu_ptr(si->percpu_cluster);
>>> -	tmp = cluster->next;
>>> +	tmp = cluster->next[order];
>>>  	if (tmp == SWAP_NEXT_INVALID) {
>>>  		if (!cluster_list_empty(&si->free_clusters)) {
>>>  			tmp = cluster_next(&si->free_clusters.head) *
>>> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>  
>>>  	/*
>>>  	 * Other CPUs can use our cluster if they can't find a free cluster,
>>> -	 * check if there is still free entry in the cluster
>>> +	 * check if there is still free entry in the cluster, maintaining
>>> +	 * natural alignment.
>>>  	 */
>>>  	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
>>>  	if (tmp < max) {
>>>  		ci = lock_cluster(si, tmp);
>>>  		while (tmp < max) {
>>> -			if (!si->swap_map[tmp])
>>> +			if (swap_range_empty(si->swap_map, tmp, nr_pages))
>>>  				break;
>>> -			tmp++;
>>> +			tmp += nr_pages;
>>>  		}
>>>  		unlock_cluster(ci);
>>>  	}
>>>  	if (tmp >= max) {
>>> -		cluster->next = SWAP_NEXT_INVALID;
>>> +		cluster->next[order] = SWAP_NEXT_INVALID;
>>>  		goto new_cluster;
>>>  	}
>>>  	*offset = tmp;
>>>  	*scan_base = tmp;
>>> -	tmp += 1;
>>> -	cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID;
>>> +	tmp += nr_pages;
>>> +	cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
>>>  	return true;
>>>  }
>>>  
>>> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
>>>  
>>>  static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  			       unsigned char usage, int nr,
>>> -			       swp_entry_t slots[])
>>> +			       swp_entry_t slots[], unsigned int nr_pages)
>> 
>> IMHO, it's better to use order as parameter directly.  We can change the
>> parameter of get_swap_pages() too.
>
> I agree that this will make the interface clearer/self documenting. I'll do it
> in the next version.
>
>> 
>>>  {
>>>  	struct swap_cluster_info *ci;
>>>  	unsigned long offset;
>>>  	unsigned long scan_base;
>>>  	unsigned long last_in_cluster = 0;
>>>  	int latency_ration = LATENCY_LIMIT;
>>> +	int order = ilog2(nr_pages);
>>>  	int n_ret = 0;
>>>  	bool scanned_many = false;
>>>  
>>> @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  	 * And we let swap pages go all over an SSD partition.  Hugh
>>>  	 */
>>>  
>>> +	if (nr_pages > 1) {
>>> +		/*
>>> +		 * Should not even be attempting large allocations when huge
>>> +		 * page swap is disabled.  Warn and fail the allocation.
>>> +		 */
>>> +		if (!IS_ENABLED(CONFIG_THP_SWAP) ||
>>> +		    nr_pages > SWAPFILE_CLUSTER ||
>>> +		    !is_power_of_2(nr_pages)) {
>>> +			VM_WARN_ON_ONCE(1);
>>> +			return 0;
>>> +		}
>>> +
>>> +		/*
>>> +		 * Swapfile is not block device or not using clusters so unable
>>> +		 * to allocate large entries.
>>> +		 */
>>> +		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
>>> +			return 0;
>>> +	}
>>> +
>>>  	si->flags += SWP_SCANNING;
>>>  	/*
>>>  	 * Use percpu scan base for SSD to reduce lock contention on
>>> @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  
>>>  	/* SSD algorithm */
>>>  	if (si->cluster_info) {
>>> -		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>>> +		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
>>> +			if (order > 0)
>>> +				goto no_page;
>>>  			goto scan;
>>> +		}
>>>  	} else if (unlikely(!si->cluster_nr--)) {
>>>  		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
>>>  			si->cluster_nr = SWAPFILE_CLUSTER - 1;
>>> @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  
>>>  checks:
>>>  	if (si->cluster_info) {
>>> -		while (scan_swap_map_ssd_cluster_conflict(si, offset)) {
>>> +		while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
>>>  		/* take a break if we already got some slots */
>>>  			if (n_ret)
>>>  				goto done;
>>>  			if (!scan_swap_map_try_ssd_cluster(si, &offset,
>>> -							&scan_base))
>>> +							&scan_base, order)) {
>>> +				if (order > 0)
>>> +					goto no_page;
>>>  				goto scan;
>>> +			}
>>>  		}
>>>  	}
>>>  	if (!(si->flags & SWP_WRITEOK))
>>>  		goto no_page;
>>>  	if (!si->highest_bit)
>>>  		goto no_page;
>>> -	if (offset > si->highest_bit)
>>> +	if (order == 0 && offset > si->highest_bit)
>> 
>> I don't think that we need to check "order == 0" here.  The original
>> condition will always be false for "order != 0".
>
> I spent ages looking at this and couldn't quite convince myself that this is
> definitely safe. Certainly it would be catastrophic if we modified the returned
> offset for a non-order-0 case (the code below assumes order-0 when checking). So
> I decided in the end to be safe and add this condition. Looking again, I agree
> with you. Will fix in next version.
>
>> 
>>>  		scan_base = offset = si->lowest_bit;
>>>  
>>>  	ci = lock_cluster(si, offset);
>>>  	/* reuse swap entry of cache-only swap if not busy. */
>>>  	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>>>  		int swap_was_freed;
>>> +		VM_WARN_ON(order > 0);
>> 
>> Instead of add WARN here, I think that it's better to add WARN at the
>> beginning of "scan" label.  We should never scan if "order > 0", it can
>> capture even more abnormal status.
>
> OK, will do.
>
>> 
>>>  		unlock_cluster(ci);
>>>  		spin_unlock(&si->lock);
>>>  		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  	}
>>>  
>>>  	if (si->swap_map[offset]) {
>>> +		VM_WARN_ON(order > 0);
>
> And remove this one too? (relying on the one in scan instead)

Yes.  I think so.

>>>  		unlock_cluster(ci);
>>>  		if (!n_ret)
>>>  			goto scan;
>>>  		else
>>>  			goto done;
>>>  	}
>>> -	WRITE_ONCE(si->swap_map[offset], usage);
>>> -	inc_cluster_info_page(si, si->cluster_info, offset);
>>> +	memset(si->swap_map + offset, usage, nr_pages);
>> 
>> Add barrier() here corresponds to original WRITE_ONCE()?
>> unlock_cluster(ci) may be NOP for some swap devices.
>
> Yep, good spot!
>
>> 
>>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>>  	unlock_cluster(ci);
>>>  
>>> -	swap_range_alloc(si, offset, 1);
>>> +	swap_range_alloc(si, offset, nr_pages);
>>>  	slots[n_ret++] = swp_entry(si->type, offset);
>>>  
>>>  	/* got enough slots or reach max slots? */
>> 
>> If "order > 0", "nr" must be 1.  So, we will "goto done" in the
>> following code.
>
> I've deliberately implemented scan_swap_map_slots() so that it allows nr > 1 for
> order > 0. And leave it to the higher layers to decide on policy.
>
>> 
>>         /* got enough slots or reach max slots? */
>>         if ((n_ret == nr) || (offset >= si->highest_bit))
>>            goto done;
>> 
>> We can add VM_WARN_ON() here to capture some abnormal status.
>
> That was actually how I implemented initially. But decided that it doesn't cost
> anything to allow nr > 1 for order > 0, and IMHO makes the function easier to
> understand because we remove this uneccessary constraint.

This sounds reasonable to me.

>> 
>>> @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  
>>>  	/* try to get more slots in cluster */
>>>  	if (si->cluster_info) {
>>> -		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base))
>>> +		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
>>>  			goto checks;
>>> +		if (order > 0)
>>> +			goto done;
>> 
>> Don't need to add this, if "order > 0", we will never go here.
>
> As per above.
>
>> 
>>>  	} else if (si->cluster_nr && !si->swap_map[++offset]) {
>>>  		/* non-ssd case, still more slots in cluster? */
>>>  		--si->cluster_nr;
>>> @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  	}
>>>  
>>>  done:
>>> -	set_cluster_next(si, offset + 1);
>>> +	if (order == 0)
>>> +		set_cluster_next(si, offset + 1);
>>>  	si->flags -= SWP_SCANNING;
>>>  	return n_ret;
>>>  
>>> @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  	return n_ret;
>>>  }
>>>  
>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
>>> -{
>>> -	unsigned long idx;
>>> -	struct swap_cluster_info *ci;
>>> -	unsigned long offset;
>>> -
>>> -	/*
>>> -	 * Should not even be attempting cluster allocations when huge
>>> -	 * page swap is disabled.  Warn and fail the allocation.
>>> -	 */
>>> -	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
>>> -		VM_WARN_ON_ONCE(1);
>>> -		return 0;
>>> -	}
>>> -
>>> -	if (cluster_list_empty(&si->free_clusters))
>>> -		return 0;
>>> -
>>> -	idx = cluster_list_first(&si->free_clusters);
>>> -	offset = idx * SWAPFILE_CLUSTER;
>>> -	ci = lock_cluster(si, offset);
>>> -	alloc_cluster(si, idx);
>>> -	cluster_set_count(ci, SWAPFILE_CLUSTER);
>>> -
>>> -	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
>>> -	unlock_cluster(ci);
>>> -	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
>>> -	*slot = swp_entry(si->type, offset);
>>> -
>>> -	return 1;
>>> -}
>>> -
>>>  static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>  {
>>>  	unsigned long offset = idx * SWAPFILE_CLUSTER;
>>> @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>>  	int n_ret = 0;
>>>  	int node;
>>>  
>>> -	/* Only single cluster request supported */
>>> -	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
>>> +	/* Only single THP request supported */
>>> +	WARN_ON_ONCE(n_goal > 1 && size > 1);
>>>  
>>>  	spin_lock(&swap_avail_lock);
>>>  
>>> @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
>>>  			spin_unlock(&si->lock);
>>>  			goto nextsi;
>>>  		}
>>> -		if (size == SWAPFILE_CLUSTER) {
>>> -			if (si->flags & SWP_BLKDEV)
>>> -				n_ret = swap_alloc_cluster(si, swp_entries);
>>> -		} else
>>> -			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>>> -						    n_goal, swp_entries);
>>> +		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>>> +					    n_goal, swp_entries, size);
>>>  		spin_unlock(&si->lock);
>>> -		if (n_ret || size == SWAPFILE_CLUSTER)
>>> +		if (n_ret || size > 1)
>>>  			goto check_out;
>>>  		cond_resched();
>>>  
>>> @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type)
>>>  
>>>  	/* This is called for allocating swap entry, not cache */
>>>  	spin_lock(&si->lock);
>>> -	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry))
>>> +	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1))
>>>  		atomic_long_dec(&nr_swap_pages);
>>>  	spin_unlock(&si->lock);
>>>  fail:
>>> @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>>  		p->flags |= SWP_SYNCHRONOUS_IO;
>>>  
>>>  	if (p->bdev && bdev_nonrot(p->bdev)) {
>>> -		int cpu;
>>> +		int cpu, i;
>>>  		unsigned long ci, nr_cluster;
>>>  
>>>  		p->flags |= SWP_SOLIDSTATE;
>>> @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>>  			struct percpu_cluster *cluster;
>>>  
>>>  			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
>>> -			cluster->next = SWAP_NEXT_INVALID;
>>> +			for (i = 0; i < SWAP_NR_ORDERS; i++)
>>> +				cluster->next[i] = SWAP_NEXT_INVALID;
>>>  		}
>>>  	} else {
>>>  		atomic_inc(&nr_rotate_swap);
>> 
>> You also need to check whether we should add swap_entry_size() for some
>> functions to optimize for small system.  We may need to add swap_order()
>> too.
>
> I was planning to convert swap_entry_size() to swap_entry_order() as part of
> switching to pass order instead of nr_pages. There is one other site that uses
> swap_entry_size() and needs a size, so was going to just change it to 1 <<
> swap_entry_order(). Does that work for you?

Yes.

> I'll do an audit for places to use swap_entry_order() but quick scan just now
> suggests that the constant should propagate to all the static functions from
> get_swap_pages().
>

--
Best Regards,
Huang, Ying

Ryan Roberts March 20, 2024, 12:22 p.m. UTC | #4

Hi Huang, Ying,


On 12/03/2024 07:51, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Multi-size THP enables performance improvements by allocating large,
>> pte-mapped folios for anonymous memory. However I've observed that on an
>> arm64 system running a parallel workload (e.g. kernel compilation)
>> across many cores, under high memory pressure, the speed regresses. This
>> is due to bottlenecking on the increased number of TLBIs added due to
>> all the extra folio splitting when the large folios are swapped out.
>>
>> Therefore, solve this regression by adding support for swapping out mTHP
>> without needing to split the folio, just like is already done for
>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>> and when the swap backing store is a non-rotating block device. These
>> are the same constraints as for the existing PMD-sized THP swap-out
>> support.
>>
>> Note that no attempt is made to swap-in (m)THP here - this is still done
>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>> prerequisite for swapping-in mTHP.
>>
>> The main change here is to improve the swap entry allocator so that it
>> can allocate any power-of-2 number of contiguous entries between [1, (1
>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>> order and allocating sequentially from it until the cluster is full.
>> This ensures that we don't need to search the map and we get no
>> fragmentation due to alignment padding for different orders in the
>> cluster. If there is no current cluster for a given order, we attempt to
>> allocate a free cluster from the list. If there are no free clusters, we
>> fail the allocation and the caller can fall back to splitting the folio
>> and allocates individual entries (as per existing PMD-sized THP
>> fallback).
>>
>> The per-order current clusters are maintained per-cpu using the existing
>> infrastructure. This is done to avoid interleving pages from different
>> tasks, which would prevent IO being batched. This is already done for
>> the order-0 allocations so we follow the same pattern.
>>
>> As is done for order-0 per-cpu clusters, the scanner now can steal
>> order-0 entries from any per-cpu-per-order reserved cluster. This
>> ensures that when the swap file is getting full, space doesn't get tied
>> up in the per-cpu reserves.
>>
>> This change only modifies swap to be able to accept any order mTHP. It
>> doesn't change the callers to elide doing the actual split. That will be
>> done in separate changes.

[...]

>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	}
>>  
>>  	if (si->swap_map[offset]) {
>> +		VM_WARN_ON(order > 0);
>>  		unlock_cluster(ci);
>>  		if (!n_ret)
>>  			goto scan;
>>  		else
>>  			goto done;
>>  	}
>> -	WRITE_ONCE(si->swap_map[offset], usage);
>> -	inc_cluster_info_page(si, si->cluster_info, offset);
>> +	memset(si->swap_map + offset, usage, nr_pages);
> 
> Add barrier() here corresponds to original WRITE_ONCE()?
> unlock_cluster(ci) may be NOP for some swap devices.

Looking at this a bit more closely, I'm not sure this is needed. Even if there
is no cluster, the swap_info is still locked, so unlocking that will act as a
barrier. There are a number of other callsites that memset(si->swap_map) without
an explicit barrier and with the swap_info locked.

Looking at the original commit that added the WRITE_ONCE() it was worried about
a race with reading swap_map in _swap_info_get(). But that site is now annotated
with a data_race(), which will suppress the warning. And I don't believe there
are any places that read swap_map locklessly and depend upon observing ordering
between it and other state? So I think the si unlock is sufficient?

I'm not planning to add barrier() here. Let me know if you disagree.

Thanks,
Ryan

> 
>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>  	unlock_cluster(ci);

Huang, Ying March 21, 2024, 4:39 a.m. UTC | #5

Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi Huang, Ying,
>
>
> On 12/03/2024 07:51, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Multi-size THP enables performance improvements by allocating large,
>>> pte-mapped folios for anonymous memory. However I've observed that on an
>>> arm64 system running a parallel workload (e.g. kernel compilation)
>>> across many cores, under high memory pressure, the speed regresses. This
>>> is due to bottlenecking on the increased number of TLBIs added due to
>>> all the extra folio splitting when the large folios are swapped out.
>>>
>>> Therefore, solve this regression by adding support for swapping out mTHP
>>> without needing to split the folio, just like is already done for
>>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>>> and when the swap backing store is a non-rotating block device. These
>>> are the same constraints as for the existing PMD-sized THP swap-out
>>> support.
>>>
>>> Note that no attempt is made to swap-in (m)THP here - this is still done
>>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>>> prerequisite for swapping-in mTHP.
>>>
>>> The main change here is to improve the swap entry allocator so that it
>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>> order and allocating sequentially from it until the cluster is full.
>>> This ensures that we don't need to search the map and we get no
>>> fragmentation due to alignment padding for different orders in the
>>> cluster. If there is no current cluster for a given order, we attempt to
>>> allocate a free cluster from the list. If there are no free clusters, we
>>> fail the allocation and the caller can fall back to splitting the folio
>>> and allocates individual entries (as per existing PMD-sized THP
>>> fallback).
>>>
>>> The per-order current clusters are maintained per-cpu using the existing
>>> infrastructure. This is done to avoid interleving pages from different
>>> tasks, which would prevent IO being batched. This is already done for
>>> the order-0 allocations so we follow the same pattern.
>>>
>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>> ensures that when the swap file is getting full, space doesn't get tied
>>> up in the per-cpu reserves.
>>>
>>> This change only modifies swap to be able to accept any order mTHP. It
>>> doesn't change the callers to elide doing the actual split. That will be
>>> done in separate changes.
>
> [...]
>
>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>  	}
>>>  
>>>  	if (si->swap_map[offset]) {
>>> +		VM_WARN_ON(order > 0);
>>>  		unlock_cluster(ci);
>>>  		if (!n_ret)
>>>  			goto scan;
>>>  		else
>>>  			goto done;
>>>  	}
>>> -	WRITE_ONCE(si->swap_map[offset], usage);
>>> -	inc_cluster_info_page(si, si->cluster_info, offset);
>>> +	memset(si->swap_map + offset, usage, nr_pages);
>> 
>> Add barrier() here corresponds to original WRITE_ONCE()?
>> unlock_cluster(ci) may be NOP for some swap devices.
>
> Looking at this a bit more closely, I'm not sure this is needed. Even if there
> is no cluster, the swap_info is still locked, so unlocking that will act as a
> barrier. There are a number of other callsites that memset(si->swap_map) without
> an explicit barrier and with the swap_info locked.
>
> Looking at the original commit that added the WRITE_ONCE() it was worried about
> a race with reading swap_map in _swap_info_get(). But that site is now annotated
> with a data_race(), which will suppress the warning. And I don't believe there
> are any places that read swap_map locklessly and depend upon observing ordering
> between it and other state? So I think the si unlock is sufficient?
>
> I'm not planning to add barrier() here. Let me know if you disagree.

swap_map[] may be read locklessly in swap_offset_available_and_locked()
in parallel.  IIUC, WRITE_ONCE() here is to make the writing take effect
as early as possible there.

>
>> 
>>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>>  	unlock_cluster(ci);

--
Best Regards,
Huang, Ying

Ryan Roberts March 21, 2024, 12:21 p.m. UTC | #6

On 21/03/2024 04:39, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Hi Huang, Ying,
>>
>>
>> On 12/03/2024 07:51, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> Multi-size THP enables performance improvements by allocating large,
>>>> pte-mapped folios for anonymous memory. However I've observed that on an
>>>> arm64 system running a parallel workload (e.g. kernel compilation)
>>>> across many cores, under high memory pressure, the speed regresses. This
>>>> is due to bottlenecking on the increased number of TLBIs added due to
>>>> all the extra folio splitting when the large folios are swapped out.
>>>>
>>>> Therefore, solve this regression by adding support for swapping out mTHP
>>>> without needing to split the folio, just like is already done for
>>>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
>>>> and when the swap backing store is a non-rotating block device. These
>>>> are the same constraints as for the existing PMD-sized THP swap-out
>>>> support.
>>>>
>>>> Note that no attempt is made to swap-in (m)THP here - this is still done
>>>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
>>>> prerequisite for swapping-in mTHP.
>>>>
>>>> The main change here is to improve the swap entry allocator so that it
>>>> can allocate any power-of-2 number of contiguous entries between [1, (1
>>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
>>>> order and allocating sequentially from it until the cluster is full.
>>>> This ensures that we don't need to search the map and we get no
>>>> fragmentation due to alignment padding for different orders in the
>>>> cluster. If there is no current cluster for a given order, we attempt to
>>>> allocate a free cluster from the list. If there are no free clusters, we
>>>> fail the allocation and the caller can fall back to splitting the folio
>>>> and allocates individual entries (as per existing PMD-sized THP
>>>> fallback).
>>>>
>>>> The per-order current clusters are maintained per-cpu using the existing
>>>> infrastructure. This is done to avoid interleving pages from different
>>>> tasks, which would prevent IO being batched. This is already done for
>>>> the order-0 allocations so we follow the same pattern.
>>>>
>>>> As is done for order-0 per-cpu clusters, the scanner now can steal
>>>> order-0 entries from any per-cpu-per-order reserved cluster. This
>>>> ensures that when the swap file is getting full, space doesn't get tied
>>>> up in the per-cpu reserves.
>>>>
>>>> This change only modifies swap to be able to accept any order mTHP. It
>>>> doesn't change the callers to elide doing the actual split. That will be
>>>> done in separate changes.
>>
>> [...]
>>
>>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>  	}
>>>>  
>>>>  	if (si->swap_map[offset]) {
>>>> +		VM_WARN_ON(order > 0);
>>>>  		unlock_cluster(ci);
>>>>  		if (!n_ret)
>>>>  			goto scan;
>>>>  		else
>>>>  			goto done;
>>>>  	}
>>>> -	WRITE_ONCE(si->swap_map[offset], usage);
>>>> -	inc_cluster_info_page(si, si->cluster_info, offset);
>>>> +	memset(si->swap_map + offset, usage, nr_pages);
>>>
>>> Add barrier() here corresponds to original WRITE_ONCE()?
>>> unlock_cluster(ci) may be NOP for some swap devices.
>>
>> Looking at this a bit more closely, I'm not sure this is needed. Even if there
>> is no cluster, the swap_info is still locked, so unlocking that will act as a
>> barrier. There are a number of other callsites that memset(si->swap_map) without
>> an explicit barrier and with the swap_info locked.
>>
>> Looking at the original commit that added the WRITE_ONCE() it was worried about
>> a race with reading swap_map in _swap_info_get(). But that site is now annotated
>> with a data_race(), which will suppress the warning. And I don't believe there
>> are any places that read swap_map locklessly and depend upon observing ordering
>> between it and other state? So I think the si unlock is sufficient?
>>
>> I'm not planning to add barrier() here. Let me know if you disagree.
> 
> swap_map[] may be read locklessly in swap_offset_available_and_locked()
> in parallel.  IIUC, WRITE_ONCE() here is to make the writing take effect
> as early as possible there.

Afraid I'm not convinced by that argument; if it's racing, it's racing - the
lockless side needs to be robust (it is). Adding the compiler barrier limits the
compiler's options which could lead to slower code in this path. If your
argument is that you want to reduce the window where
swap_offset_available_and_locked() could observe a free swap slot but then see
that its taken after it gets the si lock, that seems like a micro-optimization
to me, which we should avoid if we can.

By remnoving the WRITE_ONCE() and using memset, the lockless reader could
observe tearing though. I don't think that should cause a problem (because
everything is rechecked with under the lock), but if we want to avoid it, then
perhaps we just need to loop over WRITE_ONCE() here instead of using memset?


> 
>>
>>>
>>>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>>>  	unlock_cluster(ci);
> 
> --
> Best Regards,
> Huang, Ying

Huang, Ying March 22, 2024, 2:38 a.m. UTC | #7

Hi, Paul,

Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
For some example kernel code as follows,

"
unsigned char x[16];

void writer(void)
{
        memset(x, 1, sizeof(x));
        /* To make memset() take effect ASAP */
        barrier();
}

unsigned char reader(int n)
{
        return READ_ONCE(x[n]);
}
"

where, writer() and reader() may be called on 2 CPUs without any lock.
It's acceptable for reader() to read the written value a little later.
Our questions are,

1. because it's impossible for accessing "unsigned char" to cause
tearing.  So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
correctness, right?

2. we use barrier() and READ_ONCE() in writer() and reader(), because we
want to make writing take effect ASAP.  Is it a good practice?  Or it's
a micro-optimization that should be avoided?

--
Best Regards,
Huang, Ying

Huang, Ying March 22, 2024, 2:39 a.m. UTC | #8

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 21/03/2024 04:39, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Hi Huang, Ying,
>>>
>>>
>>> On 12/03/2024 07:51, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>> [...]
>>>
>>>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>>  	}
>>>>>  
>>>>>  	if (si->swap_map[offset]) {
>>>>> +		VM_WARN_ON(order > 0);
>>>>>  		unlock_cluster(ci);
>>>>>  		if (!n_ret)
>>>>>  			goto scan;
>>>>>  		else
>>>>>  			goto done;
>>>>>  	}
>>>>> -	WRITE_ONCE(si->swap_map[offset], usage);
>>>>> -	inc_cluster_info_page(si, si->cluster_info, offset);
>>>>> +	memset(si->swap_map + offset, usage, nr_pages);
>>>>
>>>> Add barrier() here corresponds to original WRITE_ONCE()?
>>>> unlock_cluster(ci) may be NOP for some swap devices.
>>>
>>> Looking at this a bit more closely, I'm not sure this is needed. Even if there
>>> is no cluster, the swap_info is still locked, so unlocking that will act as a
>>> barrier. There are a number of other callsites that memset(si->swap_map) without
>>> an explicit barrier and with the swap_info locked.
>>>
>>> Looking at the original commit that added the WRITE_ONCE() it was worried about
>>> a race with reading swap_map in _swap_info_get(). But that site is now annotated
>>> with a data_race(), which will suppress the warning. And I don't believe there
>>> are any places that read swap_map locklessly and depend upon observing ordering
>>> between it and other state? So I think the si unlock is sufficient?
>>>
>>> I'm not planning to add barrier() here. Let me know if you disagree.
>> 
>> swap_map[] may be read locklessly in swap_offset_available_and_locked()
>> in parallel.  IIUC, WRITE_ONCE() here is to make the writing take effect
>> as early as possible there.
>
> Afraid I'm not convinced by that argument; if it's racing, it's racing - the

It's not a race.

> lockless side needs to be robust (it is). Adding the compiler barrier limits the
> compiler's options which could lead to slower code in this path. If your
> argument is that you want to reduce the window where
> swap_offset_available_and_locked() could observe a free swap slot but then see
> that its taken after it gets the si lock, that seems like a micro-optimization
> to me, which we should avoid if we can.

Yes.  I think that it is a micro-optimization too.  I had thought that
it is a common practice to use WRITE_ONCE()/READ_ONCE() or barrier() in
intentional racy data accessing to make the change available as soon as
possible.  But I may be wrong here.

> By remnoving the WRITE_ONCE() and using memset, the lockless reader could
> observe tearing though. I don't think that should cause a problem (because
> everything is rechecked with under the lock), but if we want to avoid it, then
> perhaps we just need to loop over WRITE_ONCE() here instead of using memset?

IIUC, in practice that isn't necessary, because type of si->swap_map[]
is "unsigned char".  It isn't possible to tear "unsigned char".  In
theory, it may be better to use WRITE_ONCE() because we may change the
type of si->swap_map[] at some time (who knows).  I don't have a strong
opinion here.

>>>>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>>>>  	unlock_cluster(ci);

--
Best Regards,
Huang, Ying

Ryan Roberts March 22, 2024, 9:23 a.m. UTC | #9

On 22/03/2024 02:38, Huang, Ying wrote:
> Hi, Paul,
> 
> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
> For some example kernel code as follows,
> 
> "
> unsigned char x[16];
> 
> void writer(void)
> {
>         memset(x, 1, sizeof(x));
>         /* To make memset() take effect ASAP */
>         barrier();
> }
> 
> unsigned char reader(int n)
> {
>         return READ_ONCE(x[n]);
> }
> "
> 
> where, writer() and reader() may be called on 2 CPUs without any lock.

For the situation we are discussing, writer() is always called with a spin lock
held. So spin_unlock() will act as the barrier in this case; that's my argument
for not needing the explicit barrier(), anyway. Happy to be told I'm wrong.

> It's acceptable for reader() to read the written value a little later.
> Our questions are,
> 
> 1. because it's impossible for accessing "unsigned char" to cause
> tearing.  So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
> correctness, right?
> 
> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
> want to make writing take effect ASAP.  Is it a good practice?  Or it's
> a micro-optimization that should be avoided?
> 
> --
> Best Regards,
> Huang, Ying

Ryan Roberts March 22, 2024, 9:39 a.m. UTC | #10

On 22/03/2024 02:39, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 21/03/2024 04:39, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> Hi Huang, Ying,
>>>>
>>>>
>>>> On 12/03/2024 07:51, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>> [...]
>>>>
>>>>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>>>>>  	}
>>>>>>  
>>>>>>  	if (si->swap_map[offset]) {
>>>>>> +		VM_WARN_ON(order > 0);
>>>>>>  		unlock_cluster(ci);
>>>>>>  		if (!n_ret)
>>>>>>  			goto scan;
>>>>>>  		else
>>>>>>  			goto done;
>>>>>>  	}
>>>>>> -	WRITE_ONCE(si->swap_map[offset], usage);
>>>>>> -	inc_cluster_info_page(si, si->cluster_info, offset);
>>>>>> +	memset(si->swap_map + offset, usage, nr_pages);
>>>>>
>>>>> Add barrier() here corresponds to original WRITE_ONCE()?
>>>>> unlock_cluster(ci) may be NOP for some swap devices.
>>>>
>>>> Looking at this a bit more closely, I'm not sure this is needed. Even if there
>>>> is no cluster, the swap_info is still locked, so unlocking that will act as a
>>>> barrier. There are a number of other callsites that memset(si->swap_map) without
>>>> an explicit barrier and with the swap_info locked.
>>>>
>>>> Looking at the original commit that added the WRITE_ONCE() it was worried about
>>>> a race with reading swap_map in _swap_info_get(). But that site is now annotated
>>>> with a data_race(), which will suppress the warning. And I don't believe there
>>>> are any places that read swap_map locklessly and depend upon observing ordering
>>>> between it and other state? So I think the si unlock is sufficient?
>>>>
>>>> I'm not planning to add barrier() here. Let me know if you disagree.
>>>
>>> swap_map[] may be read locklessly in swap_offset_available_and_locked()
>>> in parallel.  IIUC, WRITE_ONCE() here is to make the writing take effect
>>> as early as possible there.
>>
>> Afraid I'm not convinced by that argument; if it's racing, it's racing - the
> 
> It's not a race.
> 
>> lockless side needs to be robust (it is). Adding the compiler barrier limits the
>> compiler's options which could lead to slower code in this path. If your
>> argument is that you want to reduce the window where
>> swap_offset_available_and_locked() could observe a free swap slot but then see
>> that its taken after it gets the si lock, that seems like a micro-optimization
>> to me, which we should avoid if we can.
> 
> Yes.  I think that it is a micro-optimization too.  I had thought that
> it is a common practice to use WRITE_ONCE()/READ_ONCE() or barrier() in
> intentional racy data accessing to make the change available as soon as
> possible.  But I may be wrong here.

My understanding is that WRITE_ONCE() forces the compiler to emit a store at
that point in the program; it can't just rely on knowing that it has previously
written the same value to that location, it can't reorder the load to later in
the program and it must store the whole word atomically so that no tearing can
be observed. But given that swap_map is only ever written with the si lock held,
I don't believe we require the first two of those semantics. It should be enough
to know that the compiler has emitted all the stores (if it determines they are
required) prior to the spin_unlock(). I'm not sure about the anti-tearing
guarrantee.

Happy to be told I'm wrong here!

> 
>> By remnoving the WRITE_ONCE() and using memset, the lockless reader could
>> observe tearing though. I don't think that should cause a problem (because
>> everything is rechecked with under the lock), but if we want to avoid it, then
>> perhaps we just need to loop over WRITE_ONCE() here instead of using memset?
> 
> IIUC, in practice that isn't necessary, because type of si->swap_map[]
> is "unsigned char".  It isn't possible to tear "unsigned char".  

In practice, perhaps. But I guess the compiler is free to update the char
bit-by-bit if it wants to, if the store is not marked WRITE_ONCE()?

> In
> theory, it may be better to use WRITE_ONCE() because we may change the
> type of si->swap_map[] at some time (who knows).  I don't have a strong
> opinion here.

The way I see it, the precedent is already set; there are a number of places
that already use memset to update swap_map. They are all under the si lock, and
none use barrier(). If its wrong here, then its wrong in those places too, I
believe.

> 
>>>>>> +	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
>>>>>>  	unlock_cluster(ci);
> 
> --
> Best Regards,
> Huang, Ying

Chris Li March 22, 2024, 1:19 p.m. UTC | #11

Hi Ying,

Very interesting question.

On Thu, Mar 21, 2024 at 7:40 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Paul,
>
> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
> For some example kernel code as follows,
>
> "
> unsigned char x[16];
>
> void writer(void)
> {
>         memset(x, 1, sizeof(x));
>         /* To make memset() take effect ASAP */
>         barrier();
> }
>
> unsigned char reader(int n)
> {
>         return READ_ONCE(x[n]);
> }
> "
>
> where, writer() and reader() may be called on 2 CPUs without any lock.
> It's acceptable for reader() to read the written value a little later.

I am trying to see if your program can convert into a litmus test so
the linux memory model tools can answer it for you.
Because you allow reader() to read written value a little later, there
is nothing the test can verify against. The reader can see both before
or after the writer's update, both are valid observations.

To make your test example more complete, you need the reader/writer to
do more actions to expose the race. For example, " if (READ_ONCE(x[n])
y = 1;"  Then you can ask the question whether it is possible to
observe x[n] == 0 and y== 1. That might not be the test condition you
have in mind, you can get the idea.

We want to have a test example that shows the result observable state
to indicate the bad things did happen(or not possible).

> Our questions are,
>
> 1. because it's impossible for accessing "unsigned char" to cause
> tearing.  So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
> correctness, right?

We need to define what is the expected behavior outcome to be
"correct", possibly including the before and after barrier actions.

Chris

>
> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
> want to make writing take effect ASAP.  Is it a good practice?  Or it's
> a micro-optimization that should be avoided?
>
> --
> Best Regards,
> Huang, Ying
>

Akira Yokosawa March 23, 2024, 2:11 a.m. UTC | #12

[Use Paul's reachable address in CC;
 trimmed CC list, keeping only those who have responded so far.]

Hello Huang,
Let me chime in.

On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
> Hi, Paul,
> 
> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
> For some example kernel code as follows,
> 
> "
> unsigned char x[16];
> 
> void writer(void)
> {
>         memset(x, 1, sizeof(x));
>         /* To make memset() take effect ASAP */
>         barrier();
> }
> 
> unsigned char reader(int n)
> {
>         return READ_ONCE(x[n]);
> }
> "
> 
> where, writer() and reader() may be called on 2 CPUs without any lock.
> It's acceptable for reader() to read the written value a little later.
> Our questions are,
> 
> 1. because it's impossible for accessing "unsigned char" to cause
> tearing.  So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
> correctness, right?
> 
> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
> want to make writing take effect ASAP.  Is it a good practice?  Or it's
> a micro-optimization that should be avoided?

Why don't you consult Documentation/memory-barriers.txt, especially
the section titled "COMPILER BARRIER"?

TL;DR:

barrier(), WRITE_ONCE(), and READ_ONCE() are compiler barriers, not
memory barriers.  They just restrict compiler optimizations and don't
have any effect with regard to "make writing take effect ASAP".

If you have further questions, please don't hesitate to ask.

Regards,
Akira (a LKMM Reveiwer).

> 
> --
> Best Regards,
> Huang, Ying

Paul E. McKenney March 25, 2024, 12:01 a.m. UTC | #13

On Sat, Mar 23, 2024 at 11:11:09AM +0900, Akira Yokosawa wrote:
> [Use Paul's reachable address in CC;
>  trimmed CC list, keeping only those who have responded so far.]
> 
> Hello Huang,
> Let me chime in.
> 
> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
> > Hi, Paul,
> > 
> > Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
> > For some example kernel code as follows,
> > 
> > "
> > unsigned char x[16];
> > 
> > void writer(void)
> > {
> >         memset(x, 1, sizeof(x));
> >         /* To make memset() take effect ASAP */
> >         barrier();
> > }
> > 
> > unsigned char reader(int n)
> > {
> >         return READ_ONCE(x[n]);
> > }
> > "
> > 
> > where, writer() and reader() may be called on 2 CPUs without any lock.
> > It's acceptable for reader() to read the written value a little later.

What are your consistency requirements?  For but one example, if reader(3)
gives the new value, is it OK for a later call to reader(2) to give the
old value?

Until we know what your requirements are, it is hard to say whether the
above code meets those requirements.  In the meantime, I can imagine
requirements that it meets and others that it does not.

Also, Akira's points below are quite important.

							Thanx, Paul

> > Our questions are,
> > 
> > 1. because it's impossible for accessing "unsigned char" to cause
> > tearing.  So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
> > correctness, right?
> > 
> > 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
> > want to make writing take effect ASAP.  Is it a good practice?  Or it's
> > a micro-optimization that should be avoided?
> 
> Why don't you consult Documentation/memory-barriers.txt, especially
> the section titled "COMPILER BARRIER"?
> 
> TL;DR:
> 
> barrier(), WRITE_ONCE(), and READ_ONCE() are compiler barriers, not
> memory barriers.  They just restrict compiler optimizations and don't
> have any effect with regard to "make writing take effect ASAP".
> 
> If you have further questions, please don't hesitate to ask.
> 
> Regards,
> Akira (a LKMM Reveiwer).
> 
> > 
> > --
> > Best Regards,
> > Huang, Ying
>

Huang, Ying March 25, 2024, 3 a.m. UTC | #14

Akira Yokosawa <akiyks@gmail.com> writes:

> [Use Paul's reachable address in CC;
>  trimmed CC list, keeping only those who have responded so far.]

Thanks a lot!

> Hello Huang,
> Let me chime in.
>
> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
>> Hi, Paul,
>> 
>> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
>> For some example kernel code as follows,
>> 
>> "
>> unsigned char x[16];
>> 
>> void writer(void)
>> {
>>         memset(x, 1, sizeof(x));
>>         /* To make memset() take effect ASAP */
>>         barrier();
>> }
>> 
>> unsigned char reader(int n)
>> {
>>         return READ_ONCE(x[n]);
>> }
>> "
>> 
>> where, writer() and reader() may be called on 2 CPUs without any lock.
>> It's acceptable for reader() to read the written value a little later.
>> Our questions are,
>> 
>> 1. because it's impossible for accessing "unsigned char" to cause
>> tearing.  So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
>> correctness, right?
>> 
>> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
>> want to make writing take effect ASAP.  Is it a good practice?  Or it's
>> a micro-optimization that should be avoided?
>
> Why don't you consult Documentation/memory-barriers.txt, especially
> the section titled "COMPILER BARRIER"?
>
> TL;DR:
>
> barrier(), WRITE_ONCE(), and READ_ONCE() are compiler barriers, not
> memory barriers.  They just restrict compiler optimizations and don't
> have any effect with regard to "make writing take effect ASAP".

Yes.  In theory, this is absolutely correct.

My question is, in practice, will compiler barriers make CPU runs (or
sees) memory read/write instructions a little earlier via avoiding to
reorder the operations after read/write (e.g., becomes before
read/write)?

> If you have further questions, please don't hesitate to ask.

--
Best Regards,
Huang, Ying

Huang, Ying March 25, 2024, 3:16 a.m. UTC | #15

"Paul E. McKenney" <paulmck@kernel.org> writes:

> On Sat, Mar 23, 2024 at 11:11:09AM +0900, Akira Yokosawa wrote:
>> [Use Paul's reachable address in CC;
>>  trimmed CC list, keeping only those who have responded so far.]
>> 
>> Hello Huang,
>> Let me chime in.
>> 
>> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
>> > Hi, Paul,
>> > 
>> > Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
>> > For some example kernel code as follows,
>> > 
>> > "
>> > unsigned char x[16];
>> > 
>> > void writer(void)
>> > {
>> >         memset(x, 1, sizeof(x));
>> >         /* To make memset() take effect ASAP */
>> >         barrier();
>> > }
>> > 
>> > unsigned char reader(int n)
>> > {
>> >         return READ_ONCE(x[n]);
>> > }
>> > "
>> > 
>> > where, writer() and reader() may be called on 2 CPUs without any lock.
>> > It's acceptable for reader() to read the written value a little later.
>
> What are your consistency requirements?  For but one example, if reader(3)
> gives the new value, is it OK for a later call to reader(2) to give the
> old value?

writer() will be called with a lock held (sorry, my previous words
aren't correct here).  After the racy checking in reader(), we will
acquire the lock and check "x[n]" again to confirm.  And, there are no
dependencies between different "n".  All in all, we can accept almost
all races between writer() and reader().

My question is, if there are some operations between writer() and
unlocking in its caller, whether does barrier() in writer() make any
sense?  Make write instructions appear a little earlier in compiled
code?  Mark the memory may be read racy?  Or doesn't make sense at all?

> Until we know what your requirements are, it is hard to say whether the
> above code meets those requirements.  In the meantime, I can imagine
> requirements that it meets and others that it does not.
>
> Also, Akira's points below are quite important.

Replied for his email.

--
Best Regards,
Huang, Ying

Huang, Ying March 25, 2024, 3:20 a.m. UTC | #16

Ryan Roberts <ryan.roberts@arm.com> writes:

> On 22/03/2024 02:38, Huang, Ying wrote:
>> Hi, Paul,
>> 
>> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
>> For some example kernel code as follows,
>> 
>> "
>> unsigned char x[16];
>> 
>> void writer(void)
>> {
>>         memset(x, 1, sizeof(x));
>>         /* To make memset() take effect ASAP */
>>         barrier();
>> }
>> 
>> unsigned char reader(int n)
>> {
>>         return READ_ONCE(x[n]);
>> }
>> "
>> 
>> where, writer() and reader() may be called on 2 CPUs without any lock.
>
> For the situation we are discussing, writer() is always called with a spin lock
> held. So spin_unlock() will act as the barrier in this case; that's my argument
> for not needing the explicit barrier(), anyway. Happy to be told I'm wrong.

Yes.  spin_unlock() is a barrier too.  There are some operations between
writer() and spin_unlock(), so I want to check whether it make any sense
to add a barrier earlier.

>> It's acceptable for reader() to read the written value a little later.
>> Our questions are,
>> 
>> 1. because it's impossible for accessing "unsigned char" to cause
>> tearing.  So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for
>> correctness, right?
>> 
>> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we
>> want to make writing take effect ASAP.  Is it a good practice?  Or it's
>> a micro-optimization that should be avoided?

--
Best Regards,
Huang, Ying

Ryan Roberts March 26, 2024, 5:08 p.m. UTC | #17

On 25/03/2024 03:16, Huang, Ying wrote:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
> 
>> On Sat, Mar 23, 2024 at 11:11:09AM +0900, Akira Yokosawa wrote:
>>> [Use Paul's reachable address in CC;
>>>  trimmed CC list, keeping only those who have responded so far.]
>>>
>>> Hello Huang,
>>> Let me chime in.
>>>
>>> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote:
>>>> Hi, Paul,
>>>>
>>>> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows?
>>>> For some example kernel code as follows,
>>>>
>>>> "
>>>> unsigned char x[16];
>>>>
>>>> void writer(void)
>>>> {
>>>>         memset(x, 1, sizeof(x));
>>>>         /* To make memset() take effect ASAP */
>>>>         barrier();
>>>> }
>>>>
>>>> unsigned char reader(int n)
>>>> {
>>>>         return READ_ONCE(x[n]);
>>>> }
>>>> "
>>>>
>>>> where, writer() and reader() may be called on 2 CPUs without any lock.
>>>> It's acceptable for reader() to read the written value a little later.
>>
>> What are your consistency requirements?  For but one example, if reader(3)
>> gives the new value, is it OK for a later call to reader(2) to give the
>> old value?
> 
> writer() will be called with a lock held (sorry, my previous words
> aren't correct here).  After the racy checking in reader(), we will
> acquire the lock and check "x[n]" again to confirm.  And, there are no
> dependencies between different "n".  All in all, we can accept almost
> all races between writer() and reader().
> 
> My question is, if there are some operations between writer() and
> unlocking in its caller, whether does barrier() in writer() make any
> sense?  Make write instructions appear a little earlier in compiled
> code?  Mark the memory may be read racy?  Or doesn't make sense at all?

A compiler barrier is neccessary but not sufficient to guarrantee that the
stores become visible to the reader; you would also need a memory barrier to
stop the HW from reordering IIUC. So I really fail to see the value of adding
barrier().

As you state above there is no correctness issue here. Its just a question of
whether the barrier() can make the store appear earlier to the reader for a
(micro!) performance optimization. You'll get both the compiler and memory
barrier from the slightly later spin_unlock(). The patch that added the original
WRITE_ONCE() was concerned with squashing kcsan warnings, not with performance
optimization. (And the addition of the WRITE_ONCE() wasn't actually needed to
achieve the aim).

So I'm planning to repost my series (hopefully tomorrow) without the barrier()
present, unless you still want to try to convince me that it is useful.

Thanks,
Ryan

> 
>> Until we know what your requirements are, it is hard to say whether the
>> above code meets those requirements.  In the meantime, I can imagine
>> requirements that it meets and others that it does not.
>>
>> Also, Akira's points below are quite important.
> 
> Replied for his email.
> 
> --
> Best Regards,
> Huang, Ying

[v4,4/6] mm: swap: Allow storage of all mTHP orders

Commit Message

Comments

Patch