Message ID | 20240311150058.1122862-5-ryan.roberts@arm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Swap-out mTHP without splitting | expand |
Ryan Roberts <ryan.roberts@arm.com> writes: > Multi-size THP enables performance improvements by allocating large, > pte-mapped folios for anonymous memory. However I've observed that on an > arm64 system running a parallel workload (e.g. kernel compilation) > across many cores, under high memory pressure, the speed regresses. This > is due to bottlenecking on the increased number of TLBIs added due to > all the extra folio splitting when the large folios are swapped out. > > Therefore, solve this regression by adding support for swapping out mTHP > without needing to split the folio, just like is already done for > PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, > and when the swap backing store is a non-rotating block device. These > are the same constraints as for the existing PMD-sized THP swap-out > support. > > Note that no attempt is made to swap-in (m)THP here - this is still done > page-by-page, like for PMD-sized THP. But swapping-out mTHP is a > prerequisite for swapping-in mTHP. > > The main change here is to improve the swap entry allocator so that it > can allocate any power-of-2 number of contiguous entries between [1, (1 > << PMD_ORDER)]. This is done by allocating a cluster for each distinct > order and allocating sequentially from it until the cluster is full. > This ensures that we don't need to search the map and we get no > fragmentation due to alignment padding for different orders in the > cluster. If there is no current cluster for a given order, we attempt to > allocate a free cluster from the list. If there are no free clusters, we > fail the allocation and the caller can fall back to splitting the folio > and allocates individual entries (as per existing PMD-sized THP > fallback). > > The per-order current clusters are maintained per-cpu using the existing > infrastructure. This is done to avoid interleving pages from different > tasks, which would prevent IO being batched. This is already done for > the order-0 allocations so we follow the same pattern. > > As is done for order-0 per-cpu clusters, the scanner now can steal > order-0 entries from any per-cpu-per-order reserved cluster. This > ensures that when the swap file is getting full, space doesn't get tied > up in the per-cpu reserves. > > This change only modifies swap to be able to accept any order mTHP. It > doesn't change the callers to elide doing the actual split. That will be > done in separate changes. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > include/linux/swap.h | 8 ++- > mm/swapfile.c | 167 +++++++++++++++++++++++++------------------ > 2 files changed, 103 insertions(+), 72 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 0cb082bee717..39b5c18ccc6a 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -268,13 +268,19 @@ struct swap_cluster_info { > */ > #define SWAP_NEXT_INVALID 0 > > +#ifdef CONFIG_THP_SWAP > +#define SWAP_NR_ORDERS (PMD_ORDER + 1) > +#else > +#define SWAP_NR_ORDERS 1 > +#endif > + > /* > * We assign a cluster to each CPU, so each CPU can allocate swap entry from > * its own cluster and swapout sequentially. The purpose is to optimize swapout > * throughput. > */ > struct percpu_cluster { > - unsigned int next; /* Likely next allocation offset */ > + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ > }; > > struct swap_cluster_list { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 3828d81aa6b8..61118a090796 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) > > /* > * The cluster corresponding to page_nr will be used. The cluster will be > - * removed from free cluster list and its usage counter will be increased. > + * removed from free cluster list and its usage counter will be increased by > + * count. > */ > -static void inc_cluster_info_page(struct swap_info_struct *p, > - struct swap_cluster_info *cluster_info, unsigned long page_nr) > +static void add_cluster_info_page(struct swap_info_struct *p, > + struct swap_cluster_info *cluster_info, unsigned long page_nr, > + unsigned long count) > { > unsigned long idx = page_nr / SWAPFILE_CLUSTER; > > @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p, > if (cluster_is_free(&cluster_info[idx])) > alloc_cluster(p, idx); > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER); > + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); > cluster_set_count(&cluster_info[idx], > - cluster_count(&cluster_info[idx]) + 1); > + cluster_count(&cluster_info[idx]) + count); > +} > + > +/* > + * The cluster corresponding to page_nr will be used. The cluster will be > + * removed from free cluster list and its usage counter will be increased by 1. > + */ > +static void inc_cluster_info_page(struct swap_info_struct *p, > + struct swap_cluster_info *cluster_info, unsigned long page_nr) > +{ > + add_cluster_info_page(p, cluster_info, page_nr, 1); > } > > /* > @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p, > */ > static bool > scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, > - unsigned long offset) > + unsigned long offset, int order) > { > struct percpu_cluster *percpu_cluster; > bool conflict; > @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, > return false; > > percpu_cluster = this_cpu_ptr(si->percpu_cluster); > - percpu_cluster->next = SWAP_NEXT_INVALID; > + percpu_cluster->next[order] = SWAP_NEXT_INVALID; > + return true; > +} > + > +static inline bool swap_range_empty(char *swap_map, unsigned int start, > + unsigned int nr_pages) > +{ > + unsigned int i; > + > + for (i = 0; i < nr_pages; i++) { > + if (swap_map[start + i]) > + return false; > + } > + > return true; > } > > /* > - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This > - * might involve allocating a new cluster for current CPU too. > + * Try to get a swap entry (or size indicated by order) from current cpu's swap IMO, it's not necessary to make mTHP a special case other than base page. So, this can be changed to * Try to get swap entries with specified order from current cpu's swap > + * entry pool (a cluster). This might involve allocating a new cluster for > + * current CPU too. > */ > static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > - unsigned long *offset, unsigned long *scan_base) > + unsigned long *offset, unsigned long *scan_base, int order) > { > + unsigned int nr_pages = 1 << order; > struct percpu_cluster *cluster; > struct swap_cluster_info *ci; > unsigned int tmp, max; > > new_cluster: > cluster = this_cpu_ptr(si->percpu_cluster); > - tmp = cluster->next; > + tmp = cluster->next[order]; > if (tmp == SWAP_NEXT_INVALID) { > if (!cluster_list_empty(&si->free_clusters)) { > tmp = cluster_next(&si->free_clusters.head) * > @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > > /* > * Other CPUs can use our cluster if they can't find a free cluster, > - * check if there is still free entry in the cluster > + * check if there is still free entry in the cluster, maintaining > + * natural alignment. > */ > max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); > if (tmp < max) { > ci = lock_cluster(si, tmp); > while (tmp < max) { > - if (!si->swap_map[tmp]) > + if (swap_range_empty(si->swap_map, tmp, nr_pages)) > break; > - tmp++; > + tmp += nr_pages; > } > unlock_cluster(ci); > } > if (tmp >= max) { > - cluster->next = SWAP_NEXT_INVALID; > + cluster->next[order] = SWAP_NEXT_INVALID; > goto new_cluster; > } > *offset = tmp; > *scan_base = tmp; > - tmp += 1; > - cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID; > + tmp += nr_pages; > + cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID; > return true; > } > > @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si, > > static int scan_swap_map_slots(struct swap_info_struct *si, > unsigned char usage, int nr, > - swp_entry_t slots[]) > + swp_entry_t slots[], unsigned int nr_pages) IMHO, it's better to use order as parameter directly. We can change the parameter of get_swap_pages() too. > { > struct swap_cluster_info *ci; > unsigned long offset; > unsigned long scan_base; > unsigned long last_in_cluster = 0; > int latency_ration = LATENCY_LIMIT; > + int order = ilog2(nr_pages); > int n_ret = 0; > bool scanned_many = false; > > @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > * And we let swap pages go all over an SSD partition. Hugh > */ > > + if (nr_pages > 1) { > + /* > + * Should not even be attempting large allocations when huge > + * page swap is disabled. Warn and fail the allocation. > + */ > + if (!IS_ENABLED(CONFIG_THP_SWAP) || > + nr_pages > SWAPFILE_CLUSTER || > + !is_power_of_2(nr_pages)) { > + VM_WARN_ON_ONCE(1); > + return 0; > + } > + > + /* > + * Swapfile is not block device or not using clusters so unable > + * to allocate large entries. > + */ > + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) > + return 0; > + } > + > si->flags += SWP_SCANNING; > /* > * Use percpu scan base for SSD to reduce lock contention on > @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > > /* SSD algorithm */ > if (si->cluster_info) { > - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) > + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) { > + if (order > 0) > + goto no_page; > goto scan; > + } > } else if (unlikely(!si->cluster_nr--)) { > if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { > si->cluster_nr = SWAPFILE_CLUSTER - 1; > @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > > checks: > if (si->cluster_info) { > - while (scan_swap_map_ssd_cluster_conflict(si, offset)) { > + while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) { > /* take a break if we already got some slots */ > if (n_ret) > goto done; > if (!scan_swap_map_try_ssd_cluster(si, &offset, > - &scan_base)) > + &scan_base, order)) { > + if (order > 0) > + goto no_page; > goto scan; > + } > } > } > if (!(si->flags & SWP_WRITEOK)) > goto no_page; > if (!si->highest_bit) > goto no_page; > - if (offset > si->highest_bit) > + if (order == 0 && offset > si->highest_bit) I don't think that we need to check "order == 0" here. The original condition will always be false for "order != 0". > scan_base = offset = si->lowest_bit; > > ci = lock_cluster(si, offset); > /* reuse swap entry of cache-only swap if not busy. */ > if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { > int swap_was_freed; > + VM_WARN_ON(order > 0); Instead of add WARN here, I think that it's better to add WARN at the beginning of "scan" label. We should never scan if "order > 0", it can capture even more abnormal status. > unlock_cluster(ci); > spin_unlock(&si->lock); > swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); > @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > } > > if (si->swap_map[offset]) { > + VM_WARN_ON(order > 0); > unlock_cluster(ci); > if (!n_ret) > goto scan; > else > goto done; > } > - WRITE_ONCE(si->swap_map[offset], usage); > - inc_cluster_info_page(si, si->cluster_info, offset); > + memset(si->swap_map + offset, usage, nr_pages); Add barrier() here corresponds to original WRITE_ONCE()? unlock_cluster(ci) may be NOP for some swap devices. > + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); > unlock_cluster(ci); > > - swap_range_alloc(si, offset, 1); > + swap_range_alloc(si, offset, nr_pages); > slots[n_ret++] = swp_entry(si->type, offset); > > /* got enough slots or reach max slots? */ If "order > 0", "nr" must be 1. So, we will "goto done" in the following code. /* got enough slots or reach max slots? */ if ((n_ret == nr) || (offset >= si->highest_bit)) goto done; We can add VM_WARN_ON() here to capture some abnormal status. > @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > > /* try to get more slots in cluster */ > if (si->cluster_info) { > - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) > + if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) > goto checks; > + if (order > 0) > + goto done; Don't need to add this, if "order > 0", we will never go here. > } else if (si->cluster_nr && !si->swap_map[++offset]) { > /* non-ssd case, still more slots in cluster? */ > --si->cluster_nr; > @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > } > > done: > - set_cluster_next(si, offset + 1); > + if (order == 0) > + set_cluster_next(si, offset + 1); > si->flags -= SWP_SCANNING; > return n_ret; > > @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > return n_ret; > } > > -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) > -{ > - unsigned long idx; > - struct swap_cluster_info *ci; > - unsigned long offset; > - > - /* > - * Should not even be attempting cluster allocations when huge > - * page swap is disabled. Warn and fail the allocation. > - */ > - if (!IS_ENABLED(CONFIG_THP_SWAP)) { > - VM_WARN_ON_ONCE(1); > - return 0; > - } > - > - if (cluster_list_empty(&si->free_clusters)) > - return 0; > - > - idx = cluster_list_first(&si->free_clusters); > - offset = idx * SWAPFILE_CLUSTER; > - ci = lock_cluster(si, offset); > - alloc_cluster(si, idx); > - cluster_set_count(ci, SWAPFILE_CLUSTER); > - > - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); > - unlock_cluster(ci); > - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); > - *slot = swp_entry(si->type, offset); > - > - return 1; > -} > - > static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > { > unsigned long offset = idx * SWAPFILE_CLUSTER; > @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) > int n_ret = 0; > int node; > > - /* Only single cluster request supported */ > - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER); > + /* Only single THP request supported */ > + WARN_ON_ONCE(n_goal > 1 && size > 1); > > spin_lock(&swap_avail_lock); > > @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) > spin_unlock(&si->lock); > goto nextsi; > } > - if (size == SWAPFILE_CLUSTER) { > - if (si->flags & SWP_BLKDEV) > - n_ret = swap_alloc_cluster(si, swp_entries); > - } else > - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, > - n_goal, swp_entries); > + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, > + n_goal, swp_entries, size); > spin_unlock(&si->lock); > - if (n_ret || size == SWAPFILE_CLUSTER) > + if (n_ret || size > 1) > goto check_out; > cond_resched(); > > @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type) > > /* This is called for allocating swap entry, not cache */ > spin_lock(&si->lock); > - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) > + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1)) > atomic_long_dec(&nr_swap_pages); > spin_unlock(&si->lock); > fail: > @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) > p->flags |= SWP_SYNCHRONOUS_IO; > > if (p->bdev && bdev_nonrot(p->bdev)) { > - int cpu; > + int cpu, i; > unsigned long ci, nr_cluster; > > p->flags |= SWP_SOLIDSTATE; > @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) > struct percpu_cluster *cluster; > > cluster = per_cpu_ptr(p->percpu_cluster, cpu); > - cluster->next = SWAP_NEXT_INVALID; > + for (i = 0; i < SWAP_NR_ORDERS; i++) > + cluster->next[i] = SWAP_NEXT_INVALID; > } > } else { > atomic_inc(&nr_rotate_swap); You also need to check whether we should add swap_entry_size() for some functions to optimize for small system. We may need to add swap_order() too. -- Best Regards, Huang, Ying
On 12/03/2024 07:51, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Multi-size THP enables performance improvements by allocating large, >> pte-mapped folios for anonymous memory. However I've observed that on an >> arm64 system running a parallel workload (e.g. kernel compilation) >> across many cores, under high memory pressure, the speed regresses. This >> is due to bottlenecking on the increased number of TLBIs added due to >> all the extra folio splitting when the large folios are swapped out. >> >> Therefore, solve this regression by adding support for swapping out mTHP >> without needing to split the folio, just like is already done for >> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, >> and when the swap backing store is a non-rotating block device. These >> are the same constraints as for the existing PMD-sized THP swap-out >> support. >> >> Note that no attempt is made to swap-in (m)THP here - this is still done >> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a >> prerequisite for swapping-in mTHP. >> >> The main change here is to improve the swap entry allocator so that it >> can allocate any power-of-2 number of contiguous entries between [1, (1 >> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >> order and allocating sequentially from it until the cluster is full. >> This ensures that we don't need to search the map and we get no >> fragmentation due to alignment padding for different orders in the >> cluster. If there is no current cluster for a given order, we attempt to >> allocate a free cluster from the list. If there are no free clusters, we >> fail the allocation and the caller can fall back to splitting the folio >> and allocates individual entries (as per existing PMD-sized THP >> fallback). >> >> The per-order current clusters are maintained per-cpu using the existing >> infrastructure. This is done to avoid interleving pages from different >> tasks, which would prevent IO being batched. This is already done for >> the order-0 allocations so we follow the same pattern. >> >> As is done for order-0 per-cpu clusters, the scanner now can steal >> order-0 entries from any per-cpu-per-order reserved cluster. This >> ensures that when the swap file is getting full, space doesn't get tied >> up in the per-cpu reserves. >> >> This change only modifies swap to be able to accept any order mTHP. It >> doesn't change the callers to elide doing the actual split. That will be >> done in separate changes. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> include/linux/swap.h | 8 ++- >> mm/swapfile.c | 167 +++++++++++++++++++++++++------------------ >> 2 files changed, 103 insertions(+), 72 deletions(-) >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index 0cb082bee717..39b5c18ccc6a 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -268,13 +268,19 @@ struct swap_cluster_info { >> */ >> #define SWAP_NEXT_INVALID 0 >> >> +#ifdef CONFIG_THP_SWAP >> +#define SWAP_NR_ORDERS (PMD_ORDER + 1) >> +#else >> +#define SWAP_NR_ORDERS 1 >> +#endif >> + >> /* >> * We assign a cluster to each CPU, so each CPU can allocate swap entry from >> * its own cluster and swapout sequentially. The purpose is to optimize swapout >> * throughput. >> */ >> struct percpu_cluster { >> - unsigned int next; /* Likely next allocation offset */ >> + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ >> }; >> >> struct swap_cluster_list { >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 3828d81aa6b8..61118a090796 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) >> >> /* >> * The cluster corresponding to page_nr will be used. The cluster will be >> - * removed from free cluster list and its usage counter will be increased. >> + * removed from free cluster list and its usage counter will be increased by >> + * count. >> */ >> -static void inc_cluster_info_page(struct swap_info_struct *p, >> - struct swap_cluster_info *cluster_info, unsigned long page_nr) >> +static void add_cluster_info_page(struct swap_info_struct *p, >> + struct swap_cluster_info *cluster_info, unsigned long page_nr, >> + unsigned long count) >> { >> unsigned long idx = page_nr / SWAPFILE_CLUSTER; >> >> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p, >> if (cluster_is_free(&cluster_info[idx])) >> alloc_cluster(p, idx); >> >> - VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER); >> + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); >> cluster_set_count(&cluster_info[idx], >> - cluster_count(&cluster_info[idx]) + 1); >> + cluster_count(&cluster_info[idx]) + count); >> +} >> + >> +/* >> + * The cluster corresponding to page_nr will be used. The cluster will be >> + * removed from free cluster list and its usage counter will be increased by 1. >> + */ >> +static void inc_cluster_info_page(struct swap_info_struct *p, >> + struct swap_cluster_info *cluster_info, unsigned long page_nr) >> +{ >> + add_cluster_info_page(p, cluster_info, page_nr, 1); >> } >> >> /* >> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p, >> */ >> static bool >> scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, >> - unsigned long offset) >> + unsigned long offset, int order) >> { >> struct percpu_cluster *percpu_cluster; >> bool conflict; >> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, >> return false; >> >> percpu_cluster = this_cpu_ptr(si->percpu_cluster); >> - percpu_cluster->next = SWAP_NEXT_INVALID; >> + percpu_cluster->next[order] = SWAP_NEXT_INVALID; >> + return true; >> +} >> + >> +static inline bool swap_range_empty(char *swap_map, unsigned int start, >> + unsigned int nr_pages) >> +{ >> + unsigned int i; >> + >> + for (i = 0; i < nr_pages; i++) { >> + if (swap_map[start + i]) >> + return false; >> + } >> + >> return true; >> } >> >> /* >> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This >> - * might involve allocating a new cluster for current CPU too. >> + * Try to get a swap entry (or size indicated by order) from current cpu's swap > > IMO, it's not necessary to make mTHP a special case other than base > page. So, this can be changed to > > * Try to get swap entries with specified order from current cpu's swap Sure, will fix in next version. > >> + * entry pool (a cluster). This might involve allocating a new cluster for >> + * current CPU too. >> */ >> static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >> - unsigned long *offset, unsigned long *scan_base) >> + unsigned long *offset, unsigned long *scan_base, int order) >> { >> + unsigned int nr_pages = 1 << order; >> struct percpu_cluster *cluster; >> struct swap_cluster_info *ci; >> unsigned int tmp, max; >> >> new_cluster: >> cluster = this_cpu_ptr(si->percpu_cluster); >> - tmp = cluster->next; >> + tmp = cluster->next[order]; >> if (tmp == SWAP_NEXT_INVALID) { >> if (!cluster_list_empty(&si->free_clusters)) { >> tmp = cluster_next(&si->free_clusters.head) * >> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >> >> /* >> * Other CPUs can use our cluster if they can't find a free cluster, >> - * check if there is still free entry in the cluster >> + * check if there is still free entry in the cluster, maintaining >> + * natural alignment. >> */ >> max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); >> if (tmp < max) { >> ci = lock_cluster(si, tmp); >> while (tmp < max) { >> - if (!si->swap_map[tmp]) >> + if (swap_range_empty(si->swap_map, tmp, nr_pages)) >> break; >> - tmp++; >> + tmp += nr_pages; >> } >> unlock_cluster(ci); >> } >> if (tmp >= max) { >> - cluster->next = SWAP_NEXT_INVALID; >> + cluster->next[order] = SWAP_NEXT_INVALID; >> goto new_cluster; >> } >> *offset = tmp; >> *scan_base = tmp; >> - tmp += 1; >> - cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID; >> + tmp += nr_pages; >> + cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID; >> return true; >> } >> >> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si, >> >> static int scan_swap_map_slots(struct swap_info_struct *si, >> unsigned char usage, int nr, >> - swp_entry_t slots[]) >> + swp_entry_t slots[], unsigned int nr_pages) > > IMHO, it's better to use order as parameter directly. We can change the > parameter of get_swap_pages() too. I agree that this will make the interface clearer/self documenting. I'll do it in the next version. > >> { >> struct swap_cluster_info *ci; >> unsigned long offset; >> unsigned long scan_base; >> unsigned long last_in_cluster = 0; >> int latency_ration = LATENCY_LIMIT; >> + int order = ilog2(nr_pages); >> int n_ret = 0; >> bool scanned_many = false; >> >> @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> * And we let swap pages go all over an SSD partition. Hugh >> */ >> >> + if (nr_pages > 1) { >> + /* >> + * Should not even be attempting large allocations when huge >> + * page swap is disabled. Warn and fail the allocation. >> + */ >> + if (!IS_ENABLED(CONFIG_THP_SWAP) || >> + nr_pages > SWAPFILE_CLUSTER || >> + !is_power_of_2(nr_pages)) { >> + VM_WARN_ON_ONCE(1); >> + return 0; >> + } >> + >> + /* >> + * Swapfile is not block device or not using clusters so unable >> + * to allocate large entries. >> + */ >> + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) >> + return 0; >> + } >> + >> si->flags += SWP_SCANNING; >> /* >> * Use percpu scan base for SSD to reduce lock contention on >> @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> >> /* SSD algorithm */ >> if (si->cluster_info) { >> - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) >> + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) { >> + if (order > 0) >> + goto no_page; >> goto scan; >> + } >> } else if (unlikely(!si->cluster_nr--)) { >> if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { >> si->cluster_nr = SWAPFILE_CLUSTER - 1; >> @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> >> checks: >> if (si->cluster_info) { >> - while (scan_swap_map_ssd_cluster_conflict(si, offset)) { >> + while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) { >> /* take a break if we already got some slots */ >> if (n_ret) >> goto done; >> if (!scan_swap_map_try_ssd_cluster(si, &offset, >> - &scan_base)) >> + &scan_base, order)) { >> + if (order > 0) >> + goto no_page; >> goto scan; >> + } >> } >> } >> if (!(si->flags & SWP_WRITEOK)) >> goto no_page; >> if (!si->highest_bit) >> goto no_page; >> - if (offset > si->highest_bit) >> + if (order == 0 && offset > si->highest_bit) > > I don't think that we need to check "order == 0" here. The original > condition will always be false for "order != 0". I spent ages looking at this and couldn't quite convince myself that this is definitely safe. Certainly it would be catastrophic if we modified the returned offset for a non-order-0 case (the code below assumes order-0 when checking). So I decided in the end to be safe and add this condition. Looking again, I agree with you. Will fix in next version. > >> scan_base = offset = si->lowest_bit; >> >> ci = lock_cluster(si, offset); >> /* reuse swap entry of cache-only swap if not busy. */ >> if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { >> int swap_was_freed; >> + VM_WARN_ON(order > 0); > > Instead of add WARN here, I think that it's better to add WARN at the > beginning of "scan" label. We should never scan if "order > 0", it can > capture even more abnormal status. OK, will do. > >> unlock_cluster(ci); >> spin_unlock(&si->lock); >> swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); >> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> } >> >> if (si->swap_map[offset]) { >> + VM_WARN_ON(order > 0); And remove this one too? (relying on the one in scan instead) >> unlock_cluster(ci); >> if (!n_ret) >> goto scan; >> else >> goto done; >> } >> - WRITE_ONCE(si->swap_map[offset], usage); >> - inc_cluster_info_page(si, si->cluster_info, offset); >> + memset(si->swap_map + offset, usage, nr_pages); > > Add barrier() here corresponds to original WRITE_ONCE()? > unlock_cluster(ci) may be NOP for some swap devices. Yep, good spot! > >> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); >> unlock_cluster(ci); >> >> - swap_range_alloc(si, offset, 1); >> + swap_range_alloc(si, offset, nr_pages); >> slots[n_ret++] = swp_entry(si->type, offset); >> >> /* got enough slots or reach max slots? */ > > If "order > 0", "nr" must be 1. So, we will "goto done" in the > following code. I've deliberately implemented scan_swap_map_slots() so that it allows nr > 1 for order > 0. And leave it to the higher layers to decide on policy. > > /* got enough slots or reach max slots? */ > if ((n_ret == nr) || (offset >= si->highest_bit)) > goto done; > > We can add VM_WARN_ON() here to capture some abnormal status. That was actually how I implemented initially. But decided that it doesn't cost anything to allow nr > 1 for order > 0, and IMHO makes the function easier to understand because we remove this uneccessary constraint. > >> @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> >> /* try to get more slots in cluster */ >> if (si->cluster_info) { >> - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) >> + if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) >> goto checks; >> + if (order > 0) >> + goto done; > > Don't need to add this, if "order > 0", we will never go here. As per above. > >> } else if (si->cluster_nr && !si->swap_map[++offset]) { >> /* non-ssd case, still more slots in cluster? */ >> --si->cluster_nr; >> @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> } >> >> done: >> - set_cluster_next(si, offset + 1); >> + if (order == 0) >> + set_cluster_next(si, offset + 1); >> si->flags -= SWP_SCANNING; >> return n_ret; >> >> @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> return n_ret; >> } >> >> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) >> -{ >> - unsigned long idx; >> - struct swap_cluster_info *ci; >> - unsigned long offset; >> - >> - /* >> - * Should not even be attempting cluster allocations when huge >> - * page swap is disabled. Warn and fail the allocation. >> - */ >> - if (!IS_ENABLED(CONFIG_THP_SWAP)) { >> - VM_WARN_ON_ONCE(1); >> - return 0; >> - } >> - >> - if (cluster_list_empty(&si->free_clusters)) >> - return 0; >> - >> - idx = cluster_list_first(&si->free_clusters); >> - offset = idx * SWAPFILE_CLUSTER; >> - ci = lock_cluster(si, offset); >> - alloc_cluster(si, idx); >> - cluster_set_count(ci, SWAPFILE_CLUSTER); >> - >> - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); >> - unlock_cluster(ci); >> - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); >> - *slot = swp_entry(si->type, offset); >> - >> - return 1; >> -} >> - >> static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) >> { >> unsigned long offset = idx * SWAPFILE_CLUSTER; >> @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) >> int n_ret = 0; >> int node; >> >> - /* Only single cluster request supported */ >> - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER); >> + /* Only single THP request supported */ >> + WARN_ON_ONCE(n_goal > 1 && size > 1); >> >> spin_lock(&swap_avail_lock); >> >> @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) >> spin_unlock(&si->lock); >> goto nextsi; >> } >> - if (size == SWAPFILE_CLUSTER) { >> - if (si->flags & SWP_BLKDEV) >> - n_ret = swap_alloc_cluster(si, swp_entries); >> - } else >> - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, >> - n_goal, swp_entries); >> + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, >> + n_goal, swp_entries, size); >> spin_unlock(&si->lock); >> - if (n_ret || size == SWAPFILE_CLUSTER) >> + if (n_ret || size > 1) >> goto check_out; >> cond_resched(); >> >> @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type) >> >> /* This is called for allocating swap entry, not cache */ >> spin_lock(&si->lock); >> - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) >> + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1)) >> atomic_long_dec(&nr_swap_pages); >> spin_unlock(&si->lock); >> fail: >> @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) >> p->flags |= SWP_SYNCHRONOUS_IO; >> >> if (p->bdev && bdev_nonrot(p->bdev)) { >> - int cpu; >> + int cpu, i; >> unsigned long ci, nr_cluster; >> >> p->flags |= SWP_SOLIDSTATE; >> @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) >> struct percpu_cluster *cluster; >> >> cluster = per_cpu_ptr(p->percpu_cluster, cpu); >> - cluster->next = SWAP_NEXT_INVALID; >> + for (i = 0; i < SWAP_NR_ORDERS; i++) >> + cluster->next[i] = SWAP_NEXT_INVALID; >> } >> } else { >> atomic_inc(&nr_rotate_swap); > > You also need to check whether we should add swap_entry_size() for some > functions to optimize for small system. We may need to add swap_order() > too. I was planning to convert swap_entry_size() to swap_entry_order() as part of switching to pass order instead of nr_pages. There is one other site that uses swap_entry_size() and needs a size, so was going to just change it to 1 << swap_entry_order(). Does that work for you? I'll do an audit for places to use swap_entry_order() but quick scan just now suggests that the constant should propagate to all the static functions from get_swap_pages(). Thanks, Ryan > > -- > Best Regards, > Huang, Ying
Ryan Roberts <ryan.roberts@arm.com> writes: > On 12/03/2024 07:51, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> Multi-size THP enables performance improvements by allocating large, >>> pte-mapped folios for anonymous memory. However I've observed that on an >>> arm64 system running a parallel workload (e.g. kernel compilation) >>> across many cores, under high memory pressure, the speed regresses. This >>> is due to bottlenecking on the increased number of TLBIs added due to >>> all the extra folio splitting when the large folios are swapped out. >>> >>> Therefore, solve this regression by adding support for swapping out mTHP >>> without needing to split the folio, just like is already done for >>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, >>> and when the swap backing store is a non-rotating block device. These >>> are the same constraints as for the existing PMD-sized THP swap-out >>> support. >>> >>> Note that no attempt is made to swap-in (m)THP here - this is still done >>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a >>> prerequisite for swapping-in mTHP. >>> >>> The main change here is to improve the swap entry allocator so that it >>> can allocate any power-of-2 number of contiguous entries between [1, (1 >>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >>> order and allocating sequentially from it until the cluster is full. >>> This ensures that we don't need to search the map and we get no >>> fragmentation due to alignment padding for different orders in the >>> cluster. If there is no current cluster for a given order, we attempt to >>> allocate a free cluster from the list. If there are no free clusters, we >>> fail the allocation and the caller can fall back to splitting the folio >>> and allocates individual entries (as per existing PMD-sized THP >>> fallback). >>> >>> The per-order current clusters are maintained per-cpu using the existing >>> infrastructure. This is done to avoid interleving pages from different >>> tasks, which would prevent IO being batched. This is already done for >>> the order-0 allocations so we follow the same pattern. >>> >>> As is done for order-0 per-cpu clusters, the scanner now can steal >>> order-0 entries from any per-cpu-per-order reserved cluster. This >>> ensures that when the swap file is getting full, space doesn't get tied >>> up in the per-cpu reserves. >>> >>> This change only modifies swap to be able to accept any order mTHP. It >>> doesn't change the callers to elide doing the actual split. That will be >>> done in separate changes. >>> >>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >>> --- >>> include/linux/swap.h | 8 ++- >>> mm/swapfile.c | 167 +++++++++++++++++++++++++------------------ >>> 2 files changed, 103 insertions(+), 72 deletions(-) >>> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>> index 0cb082bee717..39b5c18ccc6a 100644 >>> --- a/include/linux/swap.h >>> +++ b/include/linux/swap.h >>> @@ -268,13 +268,19 @@ struct swap_cluster_info { >>> */ >>> #define SWAP_NEXT_INVALID 0 >>> >>> +#ifdef CONFIG_THP_SWAP >>> +#define SWAP_NR_ORDERS (PMD_ORDER + 1) >>> +#else >>> +#define SWAP_NR_ORDERS 1 >>> +#endif >>> + >>> /* >>> * We assign a cluster to each CPU, so each CPU can allocate swap entry from >>> * its own cluster and swapout sequentially. The purpose is to optimize swapout >>> * throughput. >>> */ >>> struct percpu_cluster { >>> - unsigned int next; /* Likely next allocation offset */ >>> + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ >>> }; >>> >>> struct swap_cluster_list { >>> diff --git a/mm/swapfile.c b/mm/swapfile.c >>> index 3828d81aa6b8..61118a090796 100644 >>> --- a/mm/swapfile.c >>> +++ b/mm/swapfile.c >>> @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) >>> >>> /* >>> * The cluster corresponding to page_nr will be used. The cluster will be >>> - * removed from free cluster list and its usage counter will be increased. >>> + * removed from free cluster list and its usage counter will be increased by >>> + * count. >>> */ >>> -static void inc_cluster_info_page(struct swap_info_struct *p, >>> - struct swap_cluster_info *cluster_info, unsigned long page_nr) >>> +static void add_cluster_info_page(struct swap_info_struct *p, >>> + struct swap_cluster_info *cluster_info, unsigned long page_nr, >>> + unsigned long count) >>> { >>> unsigned long idx = page_nr / SWAPFILE_CLUSTER; >>> >>> @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p, >>> if (cluster_is_free(&cluster_info[idx])) >>> alloc_cluster(p, idx); >>> >>> - VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER); >>> + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); >>> cluster_set_count(&cluster_info[idx], >>> - cluster_count(&cluster_info[idx]) + 1); >>> + cluster_count(&cluster_info[idx]) + count); >>> +} >>> + >>> +/* >>> + * The cluster corresponding to page_nr will be used. The cluster will be >>> + * removed from free cluster list and its usage counter will be increased by 1. >>> + */ >>> +static void inc_cluster_info_page(struct swap_info_struct *p, >>> + struct swap_cluster_info *cluster_info, unsigned long page_nr) >>> +{ >>> + add_cluster_info_page(p, cluster_info, page_nr, 1); >>> } >>> >>> /* >>> @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p, >>> */ >>> static bool >>> scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, >>> - unsigned long offset) >>> + unsigned long offset, int order) >>> { >>> struct percpu_cluster *percpu_cluster; >>> bool conflict; >>> @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, >>> return false; >>> >>> percpu_cluster = this_cpu_ptr(si->percpu_cluster); >>> - percpu_cluster->next = SWAP_NEXT_INVALID; >>> + percpu_cluster->next[order] = SWAP_NEXT_INVALID; >>> + return true; >>> +} >>> + >>> +static inline bool swap_range_empty(char *swap_map, unsigned int start, >>> + unsigned int nr_pages) >>> +{ >>> + unsigned int i; >>> + >>> + for (i = 0; i < nr_pages; i++) { >>> + if (swap_map[start + i]) >>> + return false; >>> + } >>> + >>> return true; >>> } >>> >>> /* >>> - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This >>> - * might involve allocating a new cluster for current CPU too. >>> + * Try to get a swap entry (or size indicated by order) from current cpu's swap >> >> IMO, it's not necessary to make mTHP a special case other than base >> page. So, this can be changed to >> >> * Try to get swap entries with specified order from current cpu's swap > > Sure, will fix in next version. > >> >>> + * entry pool (a cluster). This might involve allocating a new cluster for >>> + * current CPU too. >>> */ >>> static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >>> - unsigned long *offset, unsigned long *scan_base) >>> + unsigned long *offset, unsigned long *scan_base, int order) >>> { >>> + unsigned int nr_pages = 1 << order; >>> struct percpu_cluster *cluster; >>> struct swap_cluster_info *ci; >>> unsigned int tmp, max; >>> >>> new_cluster: >>> cluster = this_cpu_ptr(si->percpu_cluster); >>> - tmp = cluster->next; >>> + tmp = cluster->next[order]; >>> if (tmp == SWAP_NEXT_INVALID) { >>> if (!cluster_list_empty(&si->free_clusters)) { >>> tmp = cluster_next(&si->free_clusters.head) * >>> @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >>> >>> /* >>> * Other CPUs can use our cluster if they can't find a free cluster, >>> - * check if there is still free entry in the cluster >>> + * check if there is still free entry in the cluster, maintaining >>> + * natural alignment. >>> */ >>> max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); >>> if (tmp < max) { >>> ci = lock_cluster(si, tmp); >>> while (tmp < max) { >>> - if (!si->swap_map[tmp]) >>> + if (swap_range_empty(si->swap_map, tmp, nr_pages)) >>> break; >>> - tmp++; >>> + tmp += nr_pages; >>> } >>> unlock_cluster(ci); >>> } >>> if (tmp >= max) { >>> - cluster->next = SWAP_NEXT_INVALID; >>> + cluster->next[order] = SWAP_NEXT_INVALID; >>> goto new_cluster; >>> } >>> *offset = tmp; >>> *scan_base = tmp; >>> - tmp += 1; >>> - cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID; >>> + tmp += nr_pages; >>> + cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID; >>> return true; >>> } >>> >>> @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si, >>> >>> static int scan_swap_map_slots(struct swap_info_struct *si, >>> unsigned char usage, int nr, >>> - swp_entry_t slots[]) >>> + swp_entry_t slots[], unsigned int nr_pages) >> >> IMHO, it's better to use order as parameter directly. We can change the >> parameter of get_swap_pages() too. > > I agree that this will make the interface clearer/self documenting. I'll do it > in the next version. > >> >>> { >>> struct swap_cluster_info *ci; >>> unsigned long offset; >>> unsigned long scan_base; >>> unsigned long last_in_cluster = 0; >>> int latency_ration = LATENCY_LIMIT; >>> + int order = ilog2(nr_pages); >>> int n_ret = 0; >>> bool scanned_many = false; >>> >>> @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>> * And we let swap pages go all over an SSD partition. Hugh >>> */ >>> >>> + if (nr_pages > 1) { >>> + /* >>> + * Should not even be attempting large allocations when huge >>> + * page swap is disabled. Warn and fail the allocation. >>> + */ >>> + if (!IS_ENABLED(CONFIG_THP_SWAP) || >>> + nr_pages > SWAPFILE_CLUSTER || >>> + !is_power_of_2(nr_pages)) { >>> + VM_WARN_ON_ONCE(1); >>> + return 0; >>> + } >>> + >>> + /* >>> + * Swapfile is not block device or not using clusters so unable >>> + * to allocate large entries. >>> + */ >>> + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) >>> + return 0; >>> + } >>> + >>> si->flags += SWP_SCANNING; >>> /* >>> * Use percpu scan base for SSD to reduce lock contention on >>> @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>> >>> /* SSD algorithm */ >>> if (si->cluster_info) { >>> - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) >>> + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) { >>> + if (order > 0) >>> + goto no_page; >>> goto scan; >>> + } >>> } else if (unlikely(!si->cluster_nr--)) { >>> if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { >>> si->cluster_nr = SWAPFILE_CLUSTER - 1; >>> @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>> >>> checks: >>> if (si->cluster_info) { >>> - while (scan_swap_map_ssd_cluster_conflict(si, offset)) { >>> + while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) { >>> /* take a break if we already got some slots */ >>> if (n_ret) >>> goto done; >>> if (!scan_swap_map_try_ssd_cluster(si, &offset, >>> - &scan_base)) >>> + &scan_base, order)) { >>> + if (order > 0) >>> + goto no_page; >>> goto scan; >>> + } >>> } >>> } >>> if (!(si->flags & SWP_WRITEOK)) >>> goto no_page; >>> if (!si->highest_bit) >>> goto no_page; >>> - if (offset > si->highest_bit) >>> + if (order == 0 && offset > si->highest_bit) >> >> I don't think that we need to check "order == 0" here. The original >> condition will always be false for "order != 0". > > I spent ages looking at this and couldn't quite convince myself that this is > definitely safe. Certainly it would be catastrophic if we modified the returned > offset for a non-order-0 case (the code below assumes order-0 when checking). So > I decided in the end to be safe and add this condition. Looking again, I agree > with you. Will fix in next version. > >> >>> scan_base = offset = si->lowest_bit; >>> >>> ci = lock_cluster(si, offset); >>> /* reuse swap entry of cache-only swap if not busy. */ >>> if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { >>> int swap_was_freed; >>> + VM_WARN_ON(order > 0); >> >> Instead of add WARN here, I think that it's better to add WARN at the >> beginning of "scan" label. We should never scan if "order > 0", it can >> capture even more abnormal status. > > OK, will do. > >> >>> unlock_cluster(ci); >>> spin_unlock(&si->lock); >>> swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); >>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>> } >>> >>> if (si->swap_map[offset]) { >>> + VM_WARN_ON(order > 0); > > And remove this one too? (relying on the one in scan instead) Yes. I think so. >>> unlock_cluster(ci); >>> if (!n_ret) >>> goto scan; >>> else >>> goto done; >>> } >>> - WRITE_ONCE(si->swap_map[offset], usage); >>> - inc_cluster_info_page(si, si->cluster_info, offset); >>> + memset(si->swap_map + offset, usage, nr_pages); >> >> Add barrier() here corresponds to original WRITE_ONCE()? >> unlock_cluster(ci) may be NOP for some swap devices. > > Yep, good spot! > >> >>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); >>> unlock_cluster(ci); >>> >>> - swap_range_alloc(si, offset, 1); >>> + swap_range_alloc(si, offset, nr_pages); >>> slots[n_ret++] = swp_entry(si->type, offset); >>> >>> /* got enough slots or reach max slots? */ >> >> If "order > 0", "nr" must be 1. So, we will "goto done" in the >> following code. > > I've deliberately implemented scan_swap_map_slots() so that it allows nr > 1 for > order > 0. And leave it to the higher layers to decide on policy. > >> >> /* got enough slots or reach max slots? */ >> if ((n_ret == nr) || (offset >= si->highest_bit)) >> goto done; >> >> We can add VM_WARN_ON() here to capture some abnormal status. > > That was actually how I implemented initially. But decided that it doesn't cost > anything to allow nr > 1 for order > 0, and IMHO makes the function easier to > understand because we remove this uneccessary constraint. This sounds reasonable to me. >> >>> @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>> >>> /* try to get more slots in cluster */ >>> if (si->cluster_info) { >>> - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) >>> + if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) >>> goto checks; >>> + if (order > 0) >>> + goto done; >> >> Don't need to add this, if "order > 0", we will never go here. > > As per above. > >> >>> } else if (si->cluster_nr && !si->swap_map[++offset]) { >>> /* non-ssd case, still more slots in cluster? */ >>> --si->cluster_nr; >>> @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>> } >>> >>> done: >>> - set_cluster_next(si, offset + 1); >>> + if (order == 0) >>> + set_cluster_next(si, offset + 1); >>> si->flags -= SWP_SCANNING; >>> return n_ret; >>> >>> @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>> return n_ret; >>> } >>> >>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) >>> -{ >>> - unsigned long idx; >>> - struct swap_cluster_info *ci; >>> - unsigned long offset; >>> - >>> - /* >>> - * Should not even be attempting cluster allocations when huge >>> - * page swap is disabled. Warn and fail the allocation. >>> - */ >>> - if (!IS_ENABLED(CONFIG_THP_SWAP)) { >>> - VM_WARN_ON_ONCE(1); >>> - return 0; >>> - } >>> - >>> - if (cluster_list_empty(&si->free_clusters)) >>> - return 0; >>> - >>> - idx = cluster_list_first(&si->free_clusters); >>> - offset = idx * SWAPFILE_CLUSTER; >>> - ci = lock_cluster(si, offset); >>> - alloc_cluster(si, idx); >>> - cluster_set_count(ci, SWAPFILE_CLUSTER); >>> - >>> - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); >>> - unlock_cluster(ci); >>> - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); >>> - *slot = swp_entry(si->type, offset); >>> - >>> - return 1; >>> -} >>> - >>> static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) >>> { >>> unsigned long offset = idx * SWAPFILE_CLUSTER; >>> @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) >>> int n_ret = 0; >>> int node; >>> >>> - /* Only single cluster request supported */ >>> - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER); >>> + /* Only single THP request supported */ >>> + WARN_ON_ONCE(n_goal > 1 && size > 1); >>> >>> spin_lock(&swap_avail_lock); >>> >>> @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) >>> spin_unlock(&si->lock); >>> goto nextsi; >>> } >>> - if (size == SWAPFILE_CLUSTER) { >>> - if (si->flags & SWP_BLKDEV) >>> - n_ret = swap_alloc_cluster(si, swp_entries); >>> - } else >>> - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, >>> - n_goal, swp_entries); >>> + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, >>> + n_goal, swp_entries, size); >>> spin_unlock(&si->lock); >>> - if (n_ret || size == SWAPFILE_CLUSTER) >>> + if (n_ret || size > 1) >>> goto check_out; >>> cond_resched(); >>> >>> @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type) >>> >>> /* This is called for allocating swap entry, not cache */ >>> spin_lock(&si->lock); >>> - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) >>> + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1)) >>> atomic_long_dec(&nr_swap_pages); >>> spin_unlock(&si->lock); >>> fail: >>> @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) >>> p->flags |= SWP_SYNCHRONOUS_IO; >>> >>> if (p->bdev && bdev_nonrot(p->bdev)) { >>> - int cpu; >>> + int cpu, i; >>> unsigned long ci, nr_cluster; >>> >>> p->flags |= SWP_SOLIDSTATE; >>> @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) >>> struct percpu_cluster *cluster; >>> >>> cluster = per_cpu_ptr(p->percpu_cluster, cpu); >>> - cluster->next = SWAP_NEXT_INVALID; >>> + for (i = 0; i < SWAP_NR_ORDERS; i++) >>> + cluster->next[i] = SWAP_NEXT_INVALID; >>> } >>> } else { >>> atomic_inc(&nr_rotate_swap); >> >> You also need to check whether we should add swap_entry_size() for some >> functions to optimize for small system. We may need to add swap_order() >> too. > > I was planning to convert swap_entry_size() to swap_entry_order() as part of > switching to pass order instead of nr_pages. There is one other site that uses > swap_entry_size() and needs a size, so was going to just change it to 1 << > swap_entry_order(). Does that work for you? Yes. > I'll do an audit for places to use swap_entry_order() but quick scan just now > suggests that the constant should propagate to all the static functions from > get_swap_pages(). > -- Best Regards, Huang, Ying
Hi Huang, Ying, On 12/03/2024 07:51, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Multi-size THP enables performance improvements by allocating large, >> pte-mapped folios for anonymous memory. However I've observed that on an >> arm64 system running a parallel workload (e.g. kernel compilation) >> across many cores, under high memory pressure, the speed regresses. This >> is due to bottlenecking on the increased number of TLBIs added due to >> all the extra folio splitting when the large folios are swapped out. >> >> Therefore, solve this regression by adding support for swapping out mTHP >> without needing to split the folio, just like is already done for >> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, >> and when the swap backing store is a non-rotating block device. These >> are the same constraints as for the existing PMD-sized THP swap-out >> support. >> >> Note that no attempt is made to swap-in (m)THP here - this is still done >> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a >> prerequisite for swapping-in mTHP. >> >> The main change here is to improve the swap entry allocator so that it >> can allocate any power-of-2 number of contiguous entries between [1, (1 >> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >> order and allocating sequentially from it until the cluster is full. >> This ensures that we don't need to search the map and we get no >> fragmentation due to alignment padding for different orders in the >> cluster. If there is no current cluster for a given order, we attempt to >> allocate a free cluster from the list. If there are no free clusters, we >> fail the allocation and the caller can fall back to splitting the folio >> and allocates individual entries (as per existing PMD-sized THP >> fallback). >> >> The per-order current clusters are maintained per-cpu using the existing >> infrastructure. This is done to avoid interleving pages from different >> tasks, which would prevent IO being batched. This is already done for >> the order-0 allocations so we follow the same pattern. >> >> As is done for order-0 per-cpu clusters, the scanner now can steal >> order-0 entries from any per-cpu-per-order reserved cluster. This >> ensures that when the swap file is getting full, space doesn't get tied >> up in the per-cpu reserves. >> >> This change only modifies swap to be able to accept any order mTHP. It >> doesn't change the callers to elide doing the actual split. That will be >> done in separate changes. [...] >> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> } >> >> if (si->swap_map[offset]) { >> + VM_WARN_ON(order > 0); >> unlock_cluster(ci); >> if (!n_ret) >> goto scan; >> else >> goto done; >> } >> - WRITE_ONCE(si->swap_map[offset], usage); >> - inc_cluster_info_page(si, si->cluster_info, offset); >> + memset(si->swap_map + offset, usage, nr_pages); > > Add barrier() here corresponds to original WRITE_ONCE()? > unlock_cluster(ci) may be NOP for some swap devices. Looking at this a bit more closely, I'm not sure this is needed. Even if there is no cluster, the swap_info is still locked, so unlocking that will act as a barrier. There are a number of other callsites that memset(si->swap_map) without an explicit barrier and with the swap_info locked. Looking at the original commit that added the WRITE_ONCE() it was worried about a race with reading swap_map in _swap_info_get(). But that site is now annotated with a data_race(), which will suppress the warning. And I don't believe there are any places that read swap_map locklessly and depend upon observing ordering between it and other state? So I think the si unlock is sufficient? I'm not planning to add barrier() here. Let me know if you disagree. Thanks, Ryan > >> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); >> unlock_cluster(ci);
Ryan Roberts <ryan.roberts@arm.com> writes: > Hi Huang, Ying, > > > On 12/03/2024 07:51, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> Multi-size THP enables performance improvements by allocating large, >>> pte-mapped folios for anonymous memory. However I've observed that on an >>> arm64 system running a parallel workload (e.g. kernel compilation) >>> across many cores, under high memory pressure, the speed regresses. This >>> is due to bottlenecking on the increased number of TLBIs added due to >>> all the extra folio splitting when the large folios are swapped out. >>> >>> Therefore, solve this regression by adding support for swapping out mTHP >>> without needing to split the folio, just like is already done for >>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, >>> and when the swap backing store is a non-rotating block device. These >>> are the same constraints as for the existing PMD-sized THP swap-out >>> support. >>> >>> Note that no attempt is made to swap-in (m)THP here - this is still done >>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a >>> prerequisite for swapping-in mTHP. >>> >>> The main change here is to improve the swap entry allocator so that it >>> can allocate any power-of-2 number of contiguous entries between [1, (1 >>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >>> order and allocating sequentially from it until the cluster is full. >>> This ensures that we don't need to search the map and we get no >>> fragmentation due to alignment padding for different orders in the >>> cluster. If there is no current cluster for a given order, we attempt to >>> allocate a free cluster from the list. If there are no free clusters, we >>> fail the allocation and the caller can fall back to splitting the folio >>> and allocates individual entries (as per existing PMD-sized THP >>> fallback). >>> >>> The per-order current clusters are maintained per-cpu using the existing >>> infrastructure. This is done to avoid interleving pages from different >>> tasks, which would prevent IO being batched. This is already done for >>> the order-0 allocations so we follow the same pattern. >>> >>> As is done for order-0 per-cpu clusters, the scanner now can steal >>> order-0 entries from any per-cpu-per-order reserved cluster. This >>> ensures that when the swap file is getting full, space doesn't get tied >>> up in the per-cpu reserves. >>> >>> This change only modifies swap to be able to accept any order mTHP. It >>> doesn't change the callers to elide doing the actual split. That will be >>> done in separate changes. > > [...] > >>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>> } >>> >>> if (si->swap_map[offset]) { >>> + VM_WARN_ON(order > 0); >>> unlock_cluster(ci); >>> if (!n_ret) >>> goto scan; >>> else >>> goto done; >>> } >>> - WRITE_ONCE(si->swap_map[offset], usage); >>> - inc_cluster_info_page(si, si->cluster_info, offset); >>> + memset(si->swap_map + offset, usage, nr_pages); >> >> Add barrier() here corresponds to original WRITE_ONCE()? >> unlock_cluster(ci) may be NOP for some swap devices. > > Looking at this a bit more closely, I'm not sure this is needed. Even if there > is no cluster, the swap_info is still locked, so unlocking that will act as a > barrier. There are a number of other callsites that memset(si->swap_map) without > an explicit barrier and with the swap_info locked. > > Looking at the original commit that added the WRITE_ONCE() it was worried about > a race with reading swap_map in _swap_info_get(). But that site is now annotated > with a data_race(), which will suppress the warning. And I don't believe there > are any places that read swap_map locklessly and depend upon observing ordering > between it and other state? So I think the si unlock is sufficient? > > I'm not planning to add barrier() here. Let me know if you disagree. swap_map[] may be read locklessly in swap_offset_available_and_locked() in parallel. IIUC, WRITE_ONCE() here is to make the writing take effect as early as possible there. > >> >>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); >>> unlock_cluster(ci); -- Best Regards, Huang, Ying
On 21/03/2024 04:39, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> Hi Huang, Ying, >> >> >> On 12/03/2024 07:51, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> Multi-size THP enables performance improvements by allocating large, >>>> pte-mapped folios for anonymous memory. However I've observed that on an >>>> arm64 system running a parallel workload (e.g. kernel compilation) >>>> across many cores, under high memory pressure, the speed regresses. This >>>> is due to bottlenecking on the increased number of TLBIs added due to >>>> all the extra folio splitting when the large folios are swapped out. >>>> >>>> Therefore, solve this regression by adding support for swapping out mTHP >>>> without needing to split the folio, just like is already done for >>>> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, >>>> and when the swap backing store is a non-rotating block device. These >>>> are the same constraints as for the existing PMD-sized THP swap-out >>>> support. >>>> >>>> Note that no attempt is made to swap-in (m)THP here - this is still done >>>> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a >>>> prerequisite for swapping-in mTHP. >>>> >>>> The main change here is to improve the swap entry allocator so that it >>>> can allocate any power-of-2 number of contiguous entries between [1, (1 >>>> << PMD_ORDER)]. This is done by allocating a cluster for each distinct >>>> order and allocating sequentially from it until the cluster is full. >>>> This ensures that we don't need to search the map and we get no >>>> fragmentation due to alignment padding for different orders in the >>>> cluster. If there is no current cluster for a given order, we attempt to >>>> allocate a free cluster from the list. If there are no free clusters, we >>>> fail the allocation and the caller can fall back to splitting the folio >>>> and allocates individual entries (as per existing PMD-sized THP >>>> fallback). >>>> >>>> The per-order current clusters are maintained per-cpu using the existing >>>> infrastructure. This is done to avoid interleving pages from different >>>> tasks, which would prevent IO being batched. This is already done for >>>> the order-0 allocations so we follow the same pattern. >>>> >>>> As is done for order-0 per-cpu clusters, the scanner now can steal >>>> order-0 entries from any per-cpu-per-order reserved cluster. This >>>> ensures that when the swap file is getting full, space doesn't get tied >>>> up in the per-cpu reserves. >>>> >>>> This change only modifies swap to be able to accept any order mTHP. It >>>> doesn't change the callers to elide doing the actual split. That will be >>>> done in separate changes. >> >> [...] >> >>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>>> } >>>> >>>> if (si->swap_map[offset]) { >>>> + VM_WARN_ON(order > 0); >>>> unlock_cluster(ci); >>>> if (!n_ret) >>>> goto scan; >>>> else >>>> goto done; >>>> } >>>> - WRITE_ONCE(si->swap_map[offset], usage); >>>> - inc_cluster_info_page(si, si->cluster_info, offset); >>>> + memset(si->swap_map + offset, usage, nr_pages); >>> >>> Add barrier() here corresponds to original WRITE_ONCE()? >>> unlock_cluster(ci) may be NOP for some swap devices. >> >> Looking at this a bit more closely, I'm not sure this is needed. Even if there >> is no cluster, the swap_info is still locked, so unlocking that will act as a >> barrier. There are a number of other callsites that memset(si->swap_map) without >> an explicit barrier and with the swap_info locked. >> >> Looking at the original commit that added the WRITE_ONCE() it was worried about >> a race with reading swap_map in _swap_info_get(). But that site is now annotated >> with a data_race(), which will suppress the warning. And I don't believe there >> are any places that read swap_map locklessly and depend upon observing ordering >> between it and other state? So I think the si unlock is sufficient? >> >> I'm not planning to add barrier() here. Let me know if you disagree. > > swap_map[] may be read locklessly in swap_offset_available_and_locked() > in parallel. IIUC, WRITE_ONCE() here is to make the writing take effect > as early as possible there. Afraid I'm not convinced by that argument; if it's racing, it's racing - the lockless side needs to be robust (it is). Adding the compiler barrier limits the compiler's options which could lead to slower code in this path. If your argument is that you want to reduce the window where swap_offset_available_and_locked() could observe a free swap slot but then see that its taken after it gets the si lock, that seems like a micro-optimization to me, which we should avoid if we can. By remnoving the WRITE_ONCE() and using memset, the lockless reader could observe tearing though. I don't think that should cause a problem (because everything is rechecked with under the lock), but if we want to avoid it, then perhaps we just need to loop over WRITE_ONCE() here instead of using memset? > >> >>> >>>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); >>>> unlock_cluster(ci); > > -- > Best Regards, > Huang, Ying
Hi, Paul, Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows? For some example kernel code as follows, " unsigned char x[16]; void writer(void) { memset(x, 1, sizeof(x)); /* To make memset() take effect ASAP */ barrier(); } unsigned char reader(int n) { return READ_ONCE(x[n]); } " where, writer() and reader() may be called on 2 CPUs without any lock. It's acceptable for reader() to read the written value a little later. Our questions are, 1. because it's impossible for accessing "unsigned char" to cause tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for correctness, right? 2. we use barrier() and READ_ONCE() in writer() and reader(), because we want to make writing take effect ASAP. Is it a good practice? Or it's a micro-optimization that should be avoided? -- Best Regards, Huang, Ying
Ryan Roberts <ryan.roberts@arm.com> writes: > On 21/03/2024 04:39, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> Hi Huang, Ying, >>> >>> >>> On 12/03/2024 07:51, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> [...] >>> >>>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>>>> } >>>>> >>>>> if (si->swap_map[offset]) { >>>>> + VM_WARN_ON(order > 0); >>>>> unlock_cluster(ci); >>>>> if (!n_ret) >>>>> goto scan; >>>>> else >>>>> goto done; >>>>> } >>>>> - WRITE_ONCE(si->swap_map[offset], usage); >>>>> - inc_cluster_info_page(si, si->cluster_info, offset); >>>>> + memset(si->swap_map + offset, usage, nr_pages); >>>> >>>> Add barrier() here corresponds to original WRITE_ONCE()? >>>> unlock_cluster(ci) may be NOP for some swap devices. >>> >>> Looking at this a bit more closely, I'm not sure this is needed. Even if there >>> is no cluster, the swap_info is still locked, so unlocking that will act as a >>> barrier. There are a number of other callsites that memset(si->swap_map) without >>> an explicit barrier and with the swap_info locked. >>> >>> Looking at the original commit that added the WRITE_ONCE() it was worried about >>> a race with reading swap_map in _swap_info_get(). But that site is now annotated >>> with a data_race(), which will suppress the warning. And I don't believe there >>> are any places that read swap_map locklessly and depend upon observing ordering >>> between it and other state? So I think the si unlock is sufficient? >>> >>> I'm not planning to add barrier() here. Let me know if you disagree. >> >> swap_map[] may be read locklessly in swap_offset_available_and_locked() >> in parallel. IIUC, WRITE_ONCE() here is to make the writing take effect >> as early as possible there. > > Afraid I'm not convinced by that argument; if it's racing, it's racing - the It's not a race. > lockless side needs to be robust (it is). Adding the compiler barrier limits the > compiler's options which could lead to slower code in this path. If your > argument is that you want to reduce the window where > swap_offset_available_and_locked() could observe a free swap slot but then see > that its taken after it gets the si lock, that seems like a micro-optimization > to me, which we should avoid if we can. Yes. I think that it is a micro-optimization too. I had thought that it is a common practice to use WRITE_ONCE()/READ_ONCE() or barrier() in intentional racy data accessing to make the change available as soon as possible. But I may be wrong here. > By remnoving the WRITE_ONCE() and using memset, the lockless reader could > observe tearing though. I don't think that should cause a problem (because > everything is rechecked with under the lock), but if we want to avoid it, then > perhaps we just need to loop over WRITE_ONCE() here instead of using memset? IIUC, in practice that isn't necessary, because type of si->swap_map[] is "unsigned char". It isn't possible to tear "unsigned char". In theory, it may be better to use WRITE_ONCE() because we may change the type of si->swap_map[] at some time (who knows). I don't have a strong opinion here. >>>>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); >>>>> unlock_cluster(ci); -- Best Regards, Huang, Ying
On 22/03/2024 02:38, Huang, Ying wrote: > Hi, Paul, > > Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows? > For some example kernel code as follows, > > " > unsigned char x[16]; > > void writer(void) > { > memset(x, 1, sizeof(x)); > /* To make memset() take effect ASAP */ > barrier(); > } > > unsigned char reader(int n) > { > return READ_ONCE(x[n]); > } > " > > where, writer() and reader() may be called on 2 CPUs without any lock. For the situation we are discussing, writer() is always called with a spin lock held. So spin_unlock() will act as the barrier in this case; that's my argument for not needing the explicit barrier(), anyway. Happy to be told I'm wrong. > It's acceptable for reader() to read the written value a little later. > Our questions are, > > 1. because it's impossible for accessing "unsigned char" to cause > tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for > correctness, right? > > 2. we use barrier() and READ_ONCE() in writer() and reader(), because we > want to make writing take effect ASAP. Is it a good practice? Or it's > a micro-optimization that should be avoided? > > -- > Best Regards, > Huang, Ying
On 22/03/2024 02:39, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 21/03/2024 04:39, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> >>>> Hi Huang, Ying, >>>> >>>> >>>> On 12/03/2024 07:51, Huang, Ying wrote: >>>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> [...] >>>> >>>>>> @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>>>>> } >>>>>> >>>>>> if (si->swap_map[offset]) { >>>>>> + VM_WARN_ON(order > 0); >>>>>> unlock_cluster(ci); >>>>>> if (!n_ret) >>>>>> goto scan; >>>>>> else >>>>>> goto done; >>>>>> } >>>>>> - WRITE_ONCE(si->swap_map[offset], usage); >>>>>> - inc_cluster_info_page(si, si->cluster_info, offset); >>>>>> + memset(si->swap_map + offset, usage, nr_pages); >>>>> >>>>> Add barrier() here corresponds to original WRITE_ONCE()? >>>>> unlock_cluster(ci) may be NOP for some swap devices. >>>> >>>> Looking at this a bit more closely, I'm not sure this is needed. Even if there >>>> is no cluster, the swap_info is still locked, so unlocking that will act as a >>>> barrier. There are a number of other callsites that memset(si->swap_map) without >>>> an explicit barrier and with the swap_info locked. >>>> >>>> Looking at the original commit that added the WRITE_ONCE() it was worried about >>>> a race with reading swap_map in _swap_info_get(). But that site is now annotated >>>> with a data_race(), which will suppress the warning. And I don't believe there >>>> are any places that read swap_map locklessly and depend upon observing ordering >>>> between it and other state? So I think the si unlock is sufficient? >>>> >>>> I'm not planning to add barrier() here. Let me know if you disagree. >>> >>> swap_map[] may be read locklessly in swap_offset_available_and_locked() >>> in parallel. IIUC, WRITE_ONCE() here is to make the writing take effect >>> as early as possible there. >> >> Afraid I'm not convinced by that argument; if it's racing, it's racing - the > > It's not a race. > >> lockless side needs to be robust (it is). Adding the compiler barrier limits the >> compiler's options which could lead to slower code in this path. If your >> argument is that you want to reduce the window where >> swap_offset_available_and_locked() could observe a free swap slot but then see >> that its taken after it gets the si lock, that seems like a micro-optimization >> to me, which we should avoid if we can. > > Yes. I think that it is a micro-optimization too. I had thought that > it is a common practice to use WRITE_ONCE()/READ_ONCE() or barrier() in > intentional racy data accessing to make the change available as soon as > possible. But I may be wrong here. My understanding is that WRITE_ONCE() forces the compiler to emit a store at that point in the program; it can't just rely on knowing that it has previously written the same value to that location, it can't reorder the load to later in the program and it must store the whole word atomically so that no tearing can be observed. But given that swap_map is only ever written with the si lock held, I don't believe we require the first two of those semantics. It should be enough to know that the compiler has emitted all the stores (if it determines they are required) prior to the spin_unlock(). I'm not sure about the anti-tearing guarrantee. Happy to be told I'm wrong here! > >> By remnoving the WRITE_ONCE() and using memset, the lockless reader could >> observe tearing though. I don't think that should cause a problem (because >> everything is rechecked with under the lock), but if we want to avoid it, then >> perhaps we just need to loop over WRITE_ONCE() here instead of using memset? > > IIUC, in practice that isn't necessary, because type of si->swap_map[] > is "unsigned char". It isn't possible to tear "unsigned char". In practice, perhaps. But I guess the compiler is free to update the char bit-by-bit if it wants to, if the store is not marked WRITE_ONCE()? > In > theory, it may be better to use WRITE_ONCE() because we may change the > type of si->swap_map[] at some time (who knows). I don't have a strong > opinion here. The way I see it, the precedent is already set; there are a number of places that already use memset to update swap_map. They are all under the si lock, and none use barrier(). If its wrong here, then its wrong in those places too, I believe. > >>>>>> + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); >>>>>> unlock_cluster(ci); > > -- > Best Regards, > Huang, Ying
Hi Ying, Very interesting question. On Thu, Mar 21, 2024 at 7:40 PM Huang, Ying <ying.huang@intel.com> wrote: > > Hi, Paul, > > Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows? > For some example kernel code as follows, > > " > unsigned char x[16]; > > void writer(void) > { > memset(x, 1, sizeof(x)); > /* To make memset() take effect ASAP */ > barrier(); > } > > unsigned char reader(int n) > { > return READ_ONCE(x[n]); > } > " > > where, writer() and reader() may be called on 2 CPUs without any lock. > It's acceptable for reader() to read the written value a little later. I am trying to see if your program can convert into a litmus test so the linux memory model tools can answer it for you. Because you allow reader() to read written value a little later, there is nothing the test can verify against. The reader can see both before or after the writer's update, both are valid observations. To make your test example more complete, you need the reader/writer to do more actions to expose the race. For example, " if (READ_ONCE(x[n]) y = 1;" Then you can ask the question whether it is possible to observe x[n] == 0 and y== 1. That might not be the test condition you have in mind, you can get the idea. We want to have a test example that shows the result observable state to indicate the bad things did happen(or not possible). > Our questions are, > > 1. because it's impossible for accessing "unsigned char" to cause > tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for > correctness, right? We need to define what is the expected behavior outcome to be "correct", possibly including the before and after barrier actions. Chris > > 2. we use barrier() and READ_ONCE() in writer() and reader(), because we > want to make writing take effect ASAP. Is it a good practice? Or it's > a micro-optimization that should be avoided? > > -- > Best Regards, > Huang, Ying >
[Use Paul's reachable address in CC; trimmed CC list, keeping only those who have responded so far.] Hello Huang, Let me chime in. On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote: > Hi, Paul, > > Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows? > For some example kernel code as follows, > > " > unsigned char x[16]; > > void writer(void) > { > memset(x, 1, sizeof(x)); > /* To make memset() take effect ASAP */ > barrier(); > } > > unsigned char reader(int n) > { > return READ_ONCE(x[n]); > } > " > > where, writer() and reader() may be called on 2 CPUs without any lock. > It's acceptable for reader() to read the written value a little later. > Our questions are, > > 1. because it's impossible for accessing "unsigned char" to cause > tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for > correctness, right? > > 2. we use barrier() and READ_ONCE() in writer() and reader(), because we > want to make writing take effect ASAP. Is it a good practice? Or it's > a micro-optimization that should be avoided? Why don't you consult Documentation/memory-barriers.txt, especially the section titled "COMPILER BARRIER"? TL;DR: barrier(), WRITE_ONCE(), and READ_ONCE() are compiler barriers, not memory barriers. They just restrict compiler optimizations and don't have any effect with regard to "make writing take effect ASAP". If you have further questions, please don't hesitate to ask. Regards, Akira (a LKMM Reveiwer). > > -- > Best Regards, > Huang, Ying
On Sat, Mar 23, 2024 at 11:11:09AM +0900, Akira Yokosawa wrote: > [Use Paul's reachable address in CC; > trimmed CC list, keeping only those who have responded so far.] > > Hello Huang, > Let me chime in. > > On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote: > > Hi, Paul, > > > > Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows? > > For some example kernel code as follows, > > > > " > > unsigned char x[16]; > > > > void writer(void) > > { > > memset(x, 1, sizeof(x)); > > /* To make memset() take effect ASAP */ > > barrier(); > > } > > > > unsigned char reader(int n) > > { > > return READ_ONCE(x[n]); > > } > > " > > > > where, writer() and reader() may be called on 2 CPUs without any lock. > > It's acceptable for reader() to read the written value a little later. What are your consistency requirements? For but one example, if reader(3) gives the new value, is it OK for a later call to reader(2) to give the old value? Until we know what your requirements are, it is hard to say whether the above code meets those requirements. In the meantime, I can imagine requirements that it meets and others that it does not. Also, Akira's points below are quite important. Thanx, Paul > > Our questions are, > > > > 1. because it's impossible for accessing "unsigned char" to cause > > tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for > > correctness, right? > > > > 2. we use barrier() and READ_ONCE() in writer() and reader(), because we > > want to make writing take effect ASAP. Is it a good practice? Or it's > > a micro-optimization that should be avoided? > > Why don't you consult Documentation/memory-barriers.txt, especially > the section titled "COMPILER BARRIER"? > > TL;DR: > > barrier(), WRITE_ONCE(), and READ_ONCE() are compiler barriers, not > memory barriers. They just restrict compiler optimizations and don't > have any effect with regard to "make writing take effect ASAP". > > If you have further questions, please don't hesitate to ask. > > Regards, > Akira (a LKMM Reveiwer). > > > > > -- > > Best Regards, > > Huang, Ying >
Akira Yokosawa <akiyks@gmail.com> writes: > [Use Paul's reachable address in CC; > trimmed CC list, keeping only those who have responded so far.] Thanks a lot! > Hello Huang, > Let me chime in. > > On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote: >> Hi, Paul, >> >> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows? >> For some example kernel code as follows, >> >> " >> unsigned char x[16]; >> >> void writer(void) >> { >> memset(x, 1, sizeof(x)); >> /* To make memset() take effect ASAP */ >> barrier(); >> } >> >> unsigned char reader(int n) >> { >> return READ_ONCE(x[n]); >> } >> " >> >> where, writer() and reader() may be called on 2 CPUs without any lock. >> It's acceptable for reader() to read the written value a little later. >> Our questions are, >> >> 1. because it's impossible for accessing "unsigned char" to cause >> tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for >> correctness, right? >> >> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we >> want to make writing take effect ASAP. Is it a good practice? Or it's >> a micro-optimization that should be avoided? > > Why don't you consult Documentation/memory-barriers.txt, especially > the section titled "COMPILER BARRIER"? > > TL;DR: > > barrier(), WRITE_ONCE(), and READ_ONCE() are compiler barriers, not > memory barriers. They just restrict compiler optimizations and don't > have any effect with regard to "make writing take effect ASAP". Yes. In theory, this is absolutely correct. My question is, in practice, will compiler barriers make CPU runs (or sees) memory read/write instructions a little earlier via avoiding to reorder the operations after read/write (e.g., becomes before read/write)? > If you have further questions, please don't hesitate to ask. -- Best Regards, Huang, Ying
"Paul E. McKenney" <paulmck@kernel.org> writes: > On Sat, Mar 23, 2024 at 11:11:09AM +0900, Akira Yokosawa wrote: >> [Use Paul's reachable address in CC; >> trimmed CC list, keeping only those who have responded so far.] >> >> Hello Huang, >> Let me chime in. >> >> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote: >> > Hi, Paul, >> > >> > Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows? >> > For some example kernel code as follows, >> > >> > " >> > unsigned char x[16]; >> > >> > void writer(void) >> > { >> > memset(x, 1, sizeof(x)); >> > /* To make memset() take effect ASAP */ >> > barrier(); >> > } >> > >> > unsigned char reader(int n) >> > { >> > return READ_ONCE(x[n]); >> > } >> > " >> > >> > where, writer() and reader() may be called on 2 CPUs without any lock. >> > It's acceptable for reader() to read the written value a little later. > > What are your consistency requirements? For but one example, if reader(3) > gives the new value, is it OK for a later call to reader(2) to give the > old value? writer() will be called with a lock held (sorry, my previous words aren't correct here). After the racy checking in reader(), we will acquire the lock and check "x[n]" again to confirm. And, there are no dependencies between different "n". All in all, we can accept almost all races between writer() and reader(). My question is, if there are some operations between writer() and unlocking in its caller, whether does barrier() in writer() make any sense? Make write instructions appear a little earlier in compiled code? Mark the memory may be read racy? Or doesn't make sense at all? > Until we know what your requirements are, it is hard to say whether the > above code meets those requirements. In the meantime, I can imagine > requirements that it meets and others that it does not. > > Also, Akira's points below are quite important. Replied for his email. -- Best Regards, Huang, Ying
Ryan Roberts <ryan.roberts@arm.com> writes: > On 22/03/2024 02:38, Huang, Ying wrote: >> Hi, Paul, >> >> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows? >> For some example kernel code as follows, >> >> " >> unsigned char x[16]; >> >> void writer(void) >> { >> memset(x, 1, sizeof(x)); >> /* To make memset() take effect ASAP */ >> barrier(); >> } >> >> unsigned char reader(int n) >> { >> return READ_ONCE(x[n]); >> } >> " >> >> where, writer() and reader() may be called on 2 CPUs without any lock. > > For the situation we are discussing, writer() is always called with a spin lock > held. So spin_unlock() will act as the barrier in this case; that's my argument > for not needing the explicit barrier(), anyway. Happy to be told I'm wrong. Yes. spin_unlock() is a barrier too. There are some operations between writer() and spin_unlock(), so I want to check whether it make any sense to add a barrier earlier. >> It's acceptable for reader() to read the written value a little later. >> Our questions are, >> >> 1. because it's impossible for accessing "unsigned char" to cause >> tearing. So, WRITE_ONCE()/READ_ONCE()/barrier() isn't necessary for >> correctness, right? >> >> 2. we use barrier() and READ_ONCE() in writer() and reader(), because we >> want to make writing take effect ASAP. Is it a good practice? Or it's >> a micro-optimization that should be avoided? -- Best Regards, Huang, Ying
On 25/03/2024 03:16, Huang, Ying wrote: > "Paul E. McKenney" <paulmck@kernel.org> writes: > >> On Sat, Mar 23, 2024 at 11:11:09AM +0900, Akira Yokosawa wrote: >>> [Use Paul's reachable address in CC; >>> trimmed CC list, keeping only those who have responded so far.] >>> >>> Hello Huang, >>> Let me chime in. >>> >>> On Fri, 22 Mar 2024 06:19:52 -0700, Huang, Ying wrote: >>>> Hi, Paul, >>>> >>>> Can you help us on WRITE_ONCE()/READ_ONCE()/barrier() usage as follows? >>>> For some example kernel code as follows, >>>> >>>> " >>>> unsigned char x[16]; >>>> >>>> void writer(void) >>>> { >>>> memset(x, 1, sizeof(x)); >>>> /* To make memset() take effect ASAP */ >>>> barrier(); >>>> } >>>> >>>> unsigned char reader(int n) >>>> { >>>> return READ_ONCE(x[n]); >>>> } >>>> " >>>> >>>> where, writer() and reader() may be called on 2 CPUs without any lock. >>>> It's acceptable for reader() to read the written value a little later. >> >> What are your consistency requirements? For but one example, if reader(3) >> gives the new value, is it OK for a later call to reader(2) to give the >> old value? > > writer() will be called with a lock held (sorry, my previous words > aren't correct here). After the racy checking in reader(), we will > acquire the lock and check "x[n]" again to confirm. And, there are no > dependencies between different "n". All in all, we can accept almost > all races between writer() and reader(). > > My question is, if there are some operations between writer() and > unlocking in its caller, whether does barrier() in writer() make any > sense? Make write instructions appear a little earlier in compiled > code? Mark the memory may be read racy? Or doesn't make sense at all? A compiler barrier is neccessary but not sufficient to guarrantee that the stores become visible to the reader; you would also need a memory barrier to stop the HW from reordering IIUC. So I really fail to see the value of adding barrier(). As you state above there is no correctness issue here. Its just a question of whether the barrier() can make the store appear earlier to the reader for a (micro!) performance optimization. You'll get both the compiler and memory barrier from the slightly later spin_unlock(). The patch that added the original WRITE_ONCE() was concerned with squashing kcsan warnings, not with performance optimization. (And the addition of the WRITE_ONCE() wasn't actually needed to achieve the aim). So I'm planning to repost my series (hopefully tomorrow) without the barrier() present, unless you still want to try to convince me that it is useful. Thanks, Ryan > >> Until we know what your requirements are, it is hard to say whether the >> above code meets those requirements. In the meantime, I can imagine >> requirements that it meets and others that it does not. >> >> Also, Akira's points below are quite important. > > Replied for his email. > > -- > Best Regards, > Huang, Ying
diff --git a/include/linux/swap.h b/include/linux/swap.h index 0cb082bee717..39b5c18ccc6a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -268,13 +268,19 @@ struct swap_cluster_info { */ #define SWAP_NEXT_INVALID 0 +#ifdef CONFIG_THP_SWAP +#define SWAP_NR_ORDERS (PMD_ORDER + 1) +#else +#define SWAP_NR_ORDERS 1 +#endif + /* * We assign a cluster to each CPU, so each CPU can allocate swap entry from * its own cluster and swapout sequentially. The purpose is to optimize swapout * throughput. */ struct percpu_cluster { - unsigned int next; /* Likely next allocation offset */ + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; struct swap_cluster_list { diff --git a/mm/swapfile.c b/mm/swapfile.c index 3828d81aa6b8..61118a090796 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) /* * The cluster corresponding to page_nr will be used. The cluster will be - * removed from free cluster list and its usage counter will be increased. + * removed from free cluster list and its usage counter will be increased by + * count. */ -static void inc_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr) +static void add_cluster_info_page(struct swap_info_struct *p, + struct swap_cluster_info *cluster_info, unsigned long page_nr, + unsigned long count) { unsigned long idx = page_nr / SWAPFILE_CLUSTER; @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p, if (cluster_is_free(&cluster_info[idx])) alloc_cluster(p, idx); - VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER); + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) + 1); + cluster_count(&cluster_info[idx]) + count); +} + +/* + * The cluster corresponding to page_nr will be used. The cluster will be + * removed from free cluster list and its usage counter will be increased by 1. + */ +static void inc_cluster_info_page(struct swap_info_struct *p, + struct swap_cluster_info *cluster_info, unsigned long page_nr) +{ + add_cluster_info_page(p, cluster_info, page_nr, 1); } /* @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p, */ static bool scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, - unsigned long offset) + unsigned long offset, int order) { struct percpu_cluster *percpu_cluster; bool conflict; @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, return false; percpu_cluster = this_cpu_ptr(si->percpu_cluster); - percpu_cluster->next = SWAP_NEXT_INVALID; + percpu_cluster->next[order] = SWAP_NEXT_INVALID; + return true; +} + +static inline bool swap_range_empty(char *swap_map, unsigned int start, + unsigned int nr_pages) +{ + unsigned int i; + + for (i = 0; i < nr_pages; i++) { + if (swap_map[start + i]) + return false; + } + return true; } /* - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This - * might involve allocating a new cluster for current CPU too. + * Try to get a swap entry (or size indicated by order) from current cpu's swap + * entry pool (a cluster). This might involve allocating a new cluster for + * current CPU too. */ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, - unsigned long *offset, unsigned long *scan_base) + unsigned long *offset, unsigned long *scan_base, int order) { + unsigned int nr_pages = 1 << order; struct percpu_cluster *cluster; struct swap_cluster_info *ci; unsigned int tmp, max; new_cluster: cluster = this_cpu_ptr(si->percpu_cluster); - tmp = cluster->next; + tmp = cluster->next[order]; if (tmp == SWAP_NEXT_INVALID) { if (!cluster_list_empty(&si->free_clusters)) { tmp = cluster_next(&si->free_clusters.head) * @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, /* * Other CPUs can use our cluster if they can't find a free cluster, - * check if there is still free entry in the cluster + * check if there is still free entry in the cluster, maintaining + * natural alignment. */ max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); if (tmp < max) { ci = lock_cluster(si, tmp); while (tmp < max) { - if (!si->swap_map[tmp]) + if (swap_range_empty(si->swap_map, tmp, nr_pages)) break; - tmp++; + tmp += nr_pages; } unlock_cluster(ci); } if (tmp >= max) { - cluster->next = SWAP_NEXT_INVALID; + cluster->next[order] = SWAP_NEXT_INVALID; goto new_cluster; } *offset = tmp; *scan_base = tmp; - tmp += 1; - cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID; + tmp += nr_pages; + cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID; return true; } @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si, static int scan_swap_map_slots(struct swap_info_struct *si, unsigned char usage, int nr, - swp_entry_t slots[]) + swp_entry_t slots[], unsigned int nr_pages) { struct swap_cluster_info *ci; unsigned long offset; unsigned long scan_base; unsigned long last_in_cluster = 0; int latency_ration = LATENCY_LIMIT; + int order = ilog2(nr_pages); int n_ret = 0; bool scanned_many = false; @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si, * And we let swap pages go all over an SSD partition. Hugh */ + if (nr_pages > 1) { + /* + * Should not even be attempting large allocations when huge + * page swap is disabled. Warn and fail the allocation. + */ + if (!IS_ENABLED(CONFIG_THP_SWAP) || + nr_pages > SWAPFILE_CLUSTER || + !is_power_of_2(nr_pages)) { + VM_WARN_ON_ONCE(1); + return 0; + } + + /* + * Swapfile is not block device or not using clusters so unable + * to allocate large entries. + */ + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) + return 0; + } + si->flags += SWP_SCANNING; /* * Use percpu scan base for SSD to reduce lock contention on @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si, /* SSD algorithm */ if (si->cluster_info) { - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) { + if (order > 0) + goto no_page; goto scan; + } } else if (unlikely(!si->cluster_nr--)) { if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { si->cluster_nr = SWAPFILE_CLUSTER - 1; @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si, checks: if (si->cluster_info) { - while (scan_swap_map_ssd_cluster_conflict(si, offset)) { + while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) { /* take a break if we already got some slots */ if (n_ret) goto done; if (!scan_swap_map_try_ssd_cluster(si, &offset, - &scan_base)) + &scan_base, order)) { + if (order > 0) + goto no_page; goto scan; + } } } if (!(si->flags & SWP_WRITEOK)) goto no_page; if (!si->highest_bit) goto no_page; - if (offset > si->highest_bit) + if (order == 0 && offset > si->highest_bit) scan_base = offset = si->lowest_bit; ci = lock_cluster(si, offset); /* reuse swap entry of cache-only swap if not busy. */ if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { int swap_was_freed; + VM_WARN_ON(order > 0); unlock_cluster(ci); spin_unlock(&si->lock); swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, } if (si->swap_map[offset]) { + VM_WARN_ON(order > 0); unlock_cluster(ci); if (!n_ret) goto scan; else goto done; } - WRITE_ONCE(si->swap_map[offset], usage); - inc_cluster_info_page(si, si->cluster_info, offset); + memset(si->swap_map + offset, usage, nr_pages); + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); unlock_cluster(ci); - swap_range_alloc(si, offset, 1); + swap_range_alloc(si, offset, nr_pages); slots[n_ret++] = swp_entry(si->type, offset); /* got enough slots or reach max slots? */ @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si, /* try to get more slots in cluster */ if (si->cluster_info) { - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) + if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) goto checks; + if (order > 0) + goto done; } else if (si->cluster_nr && !si->swap_map[++offset]) { /* non-ssd case, still more slots in cluster? */ --si->cluster_nr; @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si, } done: - set_cluster_next(si, offset + 1); + if (order == 0) + set_cluster_next(si, offset + 1); si->flags -= SWP_SCANNING; return n_ret; @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, return n_ret; } -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) -{ - unsigned long idx; - struct swap_cluster_info *ci; - unsigned long offset; - - /* - * Should not even be attempting cluster allocations when huge - * page swap is disabled. Warn and fail the allocation. - */ - if (!IS_ENABLED(CONFIG_THP_SWAP)) { - VM_WARN_ON_ONCE(1); - return 0; - } - - if (cluster_list_empty(&si->free_clusters)) - return 0; - - idx = cluster_list_first(&si->free_clusters); - offset = idx * SWAPFILE_CLUSTER; - ci = lock_cluster(si, offset); - alloc_cluster(si, idx); - cluster_set_count(ci, SWAPFILE_CLUSTER); - - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); - unlock_cluster(ci); - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); - *slot = swp_entry(si->type, offset); - - return 1; -} - static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) { unsigned long offset = idx * SWAPFILE_CLUSTER; @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) int n_ret = 0; int node; - /* Only single cluster request supported */ - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER); + /* Only single THP request supported */ + WARN_ON_ONCE(n_goal > 1 && size > 1); spin_lock(&swap_avail_lock); @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) spin_unlock(&si->lock); goto nextsi; } - if (size == SWAPFILE_CLUSTER) { - if (si->flags & SWP_BLKDEV) - n_ret = swap_alloc_cluster(si, swp_entries); - } else - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, - n_goal, swp_entries); + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, + n_goal, swp_entries, size); spin_unlock(&si->lock); - if (n_ret || size == SWAPFILE_CLUSTER) + if (n_ret || size > 1) goto check_out; cond_resched(); @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type) /* This is called for allocating swap entry, not cache */ spin_lock(&si->lock); - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1)) atomic_long_dec(&nr_swap_pages); spin_unlock(&si->lock); fail: @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) p->flags |= SWP_SYNCHRONOUS_IO; if (p->bdev && bdev_nonrot(p->bdev)) { - int cpu; + int cpu, i; unsigned long ci, nr_cluster; p->flags |= SWP_SOLIDSTATE; @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) struct percpu_cluster *cluster; cluster = per_cpu_ptr(p->percpu_cluster, cpu); - cluster->next = SWAP_NEXT_INVALID; + for (i = 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] = SWAP_NEXT_INVALID; } } else { atomic_inc(&nr_rotate_swap);
Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out. Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves. This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- include/linux/swap.h | 8 ++- mm/swapfile.c | 167 +++++++++++++++++++++++++------------------ 2 files changed, 103 insertions(+), 72 deletions(-)