Message ID | 20231017161302.2518826-3-ryan.roberts@arm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Swap-out small-sized THP without splitting | expand |
Ryan Roberts <ryan.roberts@arm.com> writes: > The upcoming anonymous small-sized THP feature enables performance > improvements by allocating large folios for anonymous memory. However > I've observed that on an arm64 system running a parallel workload (e.g. > kernel compilation) across many cores, under high memory pressure, the > speed regresses. This is due to bottlenecking on the increased number of > TLBIs added due to all the extra folio splitting. > > Therefore, solve this regression by adding support for swapping out > small-sized THP without needing to split the folio, just like is already > done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is > enabled, and when the swap backing store is a non-rotating block device. > These are the same constraints as for the existing PMD-sized THP > swap-out support. > > Note that no attempt is made to swap-in THP here - this is still done > page-by-page, like for PMD-sized THP. > > The main change here is to improve the swap entry allocator so that it > can allocate any power-of-2 number of contiguous entries between [4, (1 > << PMD_ORDER)] (THP cannot support order-1 folios). This is done by > allocating a cluster for each distinct order and allocating sequentially > from it until the cluster is full. This ensures that we don't need to > search the map and we get no fragmentation due to alignment padding for > different orders in the cluster. If there is no current cluster for a > given order, we attempt to allocate a free cluster from the list. If > there are no free clusters, we fail the allocation and the caller falls > back to splitting the folio and allocates individual entries (as per > existing PMD-sized THP fallback). > > The per-order current clusters are maintained per-cpu using the existing > percpu_cluster infrastructure. This is done to avoid interleving pages > from different tasks, which would prevent IO being batched. This is > already done for the order-0 allocations so we follow the same pattern. > > As far as I can tell, this should not cause any extra fragmentation > concerns, given how similar it is to the existing PMD-sized THP > allocation mechanism. There could be up to (PMD_ORDER-2) * nr_cpus > clusters in concurrent use though, which in a pathalogical case (cluster > set aside for every order for every cpu and only one huge entry > allocated from it) would tie up ~12MiB of unused swap entries for these > high orders (assuming PMD_ORDER=9). In practice, the number of orders in > use will be small and the amount of swap space reserved is very small > compared to a typical swap file. > > Note that PMD_ORDER is not compile-time constant on powerpc, so we have > to allocate the large_next[] array at runtime. > > I've run the tests on Ampere Altra (arm64), set up with a 35G block ram > device as the swap device and from inside a memcg limited to 40G memory. > I've then run `usemem` from vm-scalability with 70 processes (each has > its own core), each allocating and writing 1G of memory. I've repeated > everything 5 times and taken the mean and stdev: > > Mean Performance Improvement vs 4K/baseline > > | alloc size | baseline | + this series | > | | v6.6-rc4+anonfolio | | > |:-----------|--------------------:|--------------------:| > | 4K Page | 0.0% | 1.1% | > | 64K THP | -44.1% | 0.9% | > | 2M THP | 56.0% | 56.4% | > > So with this change, the regression for 64K swap performance goes away. > Both 4K and 64K benhcmarks are now bottlenecked on TLBI performance from > try_to_unmap_flush_dirty(), on arm64 at least. When using fewer cpus in > the test, I see upto 2x performance of 64K THP swapping compared to 4K. > > Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> > --- > include/linux/swap.h | 6 ++++ > mm/swapfile.c | 74 +++++++++++++++++++++++++++++++++++--------- > mm/vmscan.c | 10 +++--- > 3 files changed, 71 insertions(+), 19 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index a073366a227c..35cbbe6509a9 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -268,6 +268,12 @@ struct swap_cluster_info { > struct percpu_cluster { > struct swap_cluster_info index; /* Current cluster index */ > unsigned int next; /* Likely next allocation offset */ > + unsigned int large_next[]; /* > + * next free offset within current > + * allocation cluster for large folios, > + * or UINT_MAX if no current cluster. > + * Index is (order - 1). > + */ > }; > > struct swap_cluster_list { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index b83ad77e04c0..625964e53c22 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > return n_ret; > } > > -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) > +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot, > + unsigned int nr_pages) This looks hacky. IMO, we should put the allocation logic inside percpu_cluster framework. If percpu_cluster framework doesn't work for you, just refactor it firstly. > { > + int order_idx; > unsigned long idx; > struct swap_cluster_info *ci; > + struct percpu_cluster *cluster; > unsigned long offset; > > /* > * Should not even be attempting cluster allocations when huge > * page swap is disabled. Warn and fail the allocation. > */ > - if (!IS_ENABLED(CONFIG_THP_SWAP)) { > + if (!IS_ENABLED(CONFIG_THP_SWAP) || > + nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER || > + !is_power_of_2(nr_pages)) { > VM_WARN_ON_ONCE(1); > return 0; > } > > - if (cluster_list_empty(&si->free_clusters)) > + /* > + * Not using clusters so unable to allocate large entries. > + */ > + if (!si->cluster_info) > return 0; > > - idx = cluster_list_first(&si->free_clusters); > - offset = idx * SWAPFILE_CLUSTER; > - ci = lock_cluster(si, offset); > - alloc_cluster(si, idx); > - cluster_set_count(ci, SWAPFILE_CLUSTER); > + order_idx = ilog2(nr_pages) - 2; > + cluster = this_cpu_ptr(si->percpu_cluster); > + offset = cluster->large_next[order_idx]; > + > + if (offset == UINT_MAX) { > + if (cluster_list_empty(&si->free_clusters)) > + return 0; > + > + idx = cluster_list_first(&si->free_clusters); > + offset = idx * SWAPFILE_CLUSTER; > > - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); > + ci = lock_cluster(si, offset); > + alloc_cluster(si, idx); > + cluster_set_count(ci, SWAPFILE_CLUSTER); > + > + /* > + * If scan_swap_map_slots() can't find a free cluster, it will > + * check si->swap_map directly. To make sure this standby > + * cluster isn't taken by scan_swap_map_slots(), mark the swap > + * entries bad (occupied). (same approach as discard). > + */ > + memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD, > + SWAPFILE_CLUSTER - nr_pages); There's an issue with this solution. If the free space of swap device runs low, it's possible that - some cluster are put in the percpu_cluster of some CPUs the swap entries there are marked as used - no free swap entries elsewhere - nr_swap_pages isn't 0 So, we will still scan LRU, but swap allocation fails, although there's still free swap space. I think that we should follow the method we used for the original percpu_cluster. That is, if all free swap entries are in percpu_cluster, we will start to allocate from percpu_cluster. > + } else { > + idx = offset / SWAPFILE_CLUSTER; > + ci = lock_cluster(si, offset); > + } > + > + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); > unlock_cluster(ci); > - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); > + swap_range_alloc(si, offset, nr_pages); > *slot = swp_entry(si->type, offset); > > + offset += nr_pages; > + if (idx != offset / SWAPFILE_CLUSTER) > + offset = UINT_MAX; > + cluster->large_next[order_idx] = offset; > + > return 1; > } > [snip] -- Best Regards, Huang, Ying
On 18/10/2023 07:55, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> The upcoming anonymous small-sized THP feature enables performance >> improvements by allocating large folios for anonymous memory. However >> I've observed that on an arm64 system running a parallel workload (e.g. >> kernel compilation) across many cores, under high memory pressure, the >> speed regresses. This is due to bottlenecking on the increased number of >> TLBIs added due to all the extra folio splitting. >> >> Therefore, solve this regression by adding support for swapping out >> small-sized THP without needing to split the folio, just like is already >> done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is >> enabled, and when the swap backing store is a non-rotating block device. >> These are the same constraints as for the existing PMD-sized THP >> swap-out support. >> >> Note that no attempt is made to swap-in THP here - this is still done >> page-by-page, like for PMD-sized THP. >> >> The main change here is to improve the swap entry allocator so that it >> can allocate any power-of-2 number of contiguous entries between [4, (1 >> << PMD_ORDER)] (THP cannot support order-1 folios). This is done by >> allocating a cluster for each distinct order and allocating sequentially >> from it until the cluster is full. This ensures that we don't need to >> search the map and we get no fragmentation due to alignment padding for >> different orders in the cluster. If there is no current cluster for a >> given order, we attempt to allocate a free cluster from the list. If >> there are no free clusters, we fail the allocation and the caller falls >> back to splitting the folio and allocates individual entries (as per >> existing PMD-sized THP fallback). >> >> The per-order current clusters are maintained per-cpu using the existing >> percpu_cluster infrastructure. This is done to avoid interleving pages >> from different tasks, which would prevent IO being batched. This is >> already done for the order-0 allocations so we follow the same pattern. >> >> As far as I can tell, this should not cause any extra fragmentation >> concerns, given how similar it is to the existing PMD-sized THP >> allocation mechanism. There could be up to (PMD_ORDER-2) * nr_cpus >> clusters in concurrent use though, which in a pathalogical case (cluster >> set aside for every order for every cpu and only one huge entry >> allocated from it) would tie up ~12MiB of unused swap entries for these >> high orders (assuming PMD_ORDER=9). In practice, the number of orders in >> use will be small and the amount of swap space reserved is very small >> compared to a typical swap file. >> >> Note that PMD_ORDER is not compile-time constant on powerpc, so we have >> to allocate the large_next[] array at runtime. >> >> I've run the tests on Ampere Altra (arm64), set up with a 35G block ram >> device as the swap device and from inside a memcg limited to 40G memory. >> I've then run `usemem` from vm-scalability with 70 processes (each has >> its own core), each allocating and writing 1G of memory. I've repeated >> everything 5 times and taken the mean and stdev: >> >> Mean Performance Improvement vs 4K/baseline >> >> | alloc size | baseline | + this series | >> | | v6.6-rc4+anonfolio | | >> |:-----------|--------------------:|--------------------:| >> | 4K Page | 0.0% | 1.1% | >> | 64K THP | -44.1% | 0.9% | >> | 2M THP | 56.0% | 56.4% | >> >> So with this change, the regression for 64K swap performance goes away. >> Both 4K and 64K benhcmarks are now bottlenecked on TLBI performance from >> try_to_unmap_flush_dirty(), on arm64 at least. When using fewer cpus in >> the test, I see upto 2x performance of 64K THP swapping compared to 4K. >> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> >> --- >> include/linux/swap.h | 6 ++++ >> mm/swapfile.c | 74 +++++++++++++++++++++++++++++++++++--------- >> mm/vmscan.c | 10 +++--- >> 3 files changed, 71 insertions(+), 19 deletions(-) >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index a073366a227c..35cbbe6509a9 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -268,6 +268,12 @@ struct swap_cluster_info { >> struct percpu_cluster { >> struct swap_cluster_info index; /* Current cluster index */ >> unsigned int next; /* Likely next allocation offset */ >> + unsigned int large_next[]; /* >> + * next free offset within current >> + * allocation cluster for large folios, >> + * or UINT_MAX if no current cluster. >> + * Index is (order - 1). >> + */ >> }; >> >> struct swap_cluster_list { >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index b83ad77e04c0..625964e53c22 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >> return n_ret; >> } >> >> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) >> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot, >> + unsigned int nr_pages) > > This looks hacky. IMO, we should put the allocation logic inside > percpu_cluster framework. If percpu_cluster framework doesn't work for > you, just refactor it firstly. I'm not sure I really understand what you are suggesting - could you elaborate? What "framework"? I only see a per-cpu data structure and scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0 allocations. Are you suggesting you want to allocate large entries (> order-0) from the same cluster that is used for small (order-0) entries? The problem with this approach is that there may not be enough space left in the current cluster for the large entry that you are trying to allocate. Then you would need to take a new cluster from the free list and presumably leave the previous cluster with unused entries (which will now only be accessible by scanning). That unused space could be considerable. That's why I am currently reserving a "current cluster" per order; that way, all allocations are the same order, they are all naturally aligned and there is no waste. Perhaps I could implement "stealing" between cpus so that a cpu trying to allocate a specific order, but which doesn't have a current cluster for that order and the free list is empty, could allocate from another cpu's current cluster? I don't think it's a good idea to mix different orders in the same cluster though. I guess if really low, I could remove a current cluster from a cpu and allow it to be scanned for order-0 allocations at least? Any opinions gratefully received! > >> { >> + int order_idx; >> unsigned long idx; >> struct swap_cluster_info *ci; >> + struct percpu_cluster *cluster; >> unsigned long offset; >> >> /* >> * Should not even be attempting cluster allocations when huge >> * page swap is disabled. Warn and fail the allocation. >> */ >> - if (!IS_ENABLED(CONFIG_THP_SWAP)) { >> + if (!IS_ENABLED(CONFIG_THP_SWAP) || >> + nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER || >> + !is_power_of_2(nr_pages)) { >> VM_WARN_ON_ONCE(1); >> return 0; >> } >> >> - if (cluster_list_empty(&si->free_clusters)) >> + /* >> + * Not using clusters so unable to allocate large entries. >> + */ >> + if (!si->cluster_info) >> return 0; >> >> - idx = cluster_list_first(&si->free_clusters); >> - offset = idx * SWAPFILE_CLUSTER; >> - ci = lock_cluster(si, offset); >> - alloc_cluster(si, idx); >> - cluster_set_count(ci, SWAPFILE_CLUSTER); >> + order_idx = ilog2(nr_pages) - 2; >> + cluster = this_cpu_ptr(si->percpu_cluster); >> + offset = cluster->large_next[order_idx]; >> + >> + if (offset == UINT_MAX) { >> + if (cluster_list_empty(&si->free_clusters)) >> + return 0; >> + >> + idx = cluster_list_first(&si->free_clusters); >> + offset = idx * SWAPFILE_CLUSTER; >> >> - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); >> + ci = lock_cluster(si, offset); >> + alloc_cluster(si, idx); >> + cluster_set_count(ci, SWAPFILE_CLUSTER); >> + >> + /* >> + * If scan_swap_map_slots() can't find a free cluster, it will >> + * check si->swap_map directly. To make sure this standby >> + * cluster isn't taken by scan_swap_map_slots(), mark the swap >> + * entries bad (occupied). (same approach as discard). >> + */ >> + memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD, >> + SWAPFILE_CLUSTER - nr_pages); > > There's an issue with this solution. If the free space of swap device > runs low, it's possible that > > - some cluster are put in the percpu_cluster of some CPUs > the swap entries there are marked as used > > - no free swap entries elsewhere > > - nr_swap_pages isn't 0 > > So, we will still scan LRU, but swap allocation fails, although there's > still free swap space. Ahh yes, good spot. > > I think that we should follow the method we used for the original > percpu_cluster. That is, if all free swap entries are in > percpu_cluster, we will start to allocate from percpu_cluster. As i suggested above, I think I could do this by removing a cpu's current cluster for a given order from the percpu_cluster and making it generally available for scanning. Does that work for you? > >> + } else { >> + idx = offset / SWAPFILE_CLUSTER; >> + ci = lock_cluster(si, offset); >> + } >> + >> + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); >> unlock_cluster(ci); >> - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); >> + swap_range_alloc(si, offset, nr_pages); >> *slot = swp_entry(si->type, offset); >> >> + offset += nr_pages; >> + if (idx != offset / SWAPFILE_CLUSTER) >> + offset = UINT_MAX; >> + cluster->large_next[order_idx] = offset; >> + >> return 1; >> } >> > > [snip] > > -- > Best Regards, > Huang, Ying
Ryan Roberts <ryan.roberts@arm.com> writes: > On 18/10/2023 07:55, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> [snip] >>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>> index a073366a227c..35cbbe6509a9 100644 >>> --- a/include/linux/swap.h >>> +++ b/include/linux/swap.h >>> @@ -268,6 +268,12 @@ struct swap_cluster_info { >>> struct percpu_cluster { >>> struct swap_cluster_info index; /* Current cluster index */ >>> unsigned int next; /* Likely next allocation offset */ >>> + unsigned int large_next[]; /* >>> + * next free offset within current >>> + * allocation cluster for large folios, >>> + * or UINT_MAX if no current cluster. >>> + * Index is (order - 1). >>> + */ >>> }; >>> >>> struct swap_cluster_list { >>> diff --git a/mm/swapfile.c b/mm/swapfile.c >>> index b83ad77e04c0..625964e53c22 100644 >>> --- a/mm/swapfile.c >>> +++ b/mm/swapfile.c >>> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>> return n_ret; >>> } >>> >>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) >>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot, >>> + unsigned int nr_pages) >> >> This looks hacky. IMO, we should put the allocation logic inside >> percpu_cluster framework. If percpu_cluster framework doesn't work for >> you, just refactor it firstly. > > I'm not sure I really understand what you are suggesting - could you elaborate? > What "framework"? I only see a per-cpu data structure and > scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0 > allocations. I suggest to share as much code as possible between order-0 and order > 0 swap entry allocation. I think that we can make scan_swap_map_try_ssd_cluster() works for order > 0 swap entry allocation. > Are you suggesting you want to allocate large entries (> order-0) from the same > cluster that is used for small (order-0) entries? The problem with this approach > is that there may not be enough space left in the current cluster for the large > entry that you are trying to allocate. Then you would need to take a new cluster > from the free list and presumably leave the previous cluster with unused entries > (which will now only be accessible by scanning). That unused space could be > considerable. > > That's why I am currently reserving a "current cluster" per order; that way, all > allocations are the same order, they are all naturally aligned and there is no > waste. I am fine to use one swap cluster per order per CPU. I just think that we should share code. > Perhaps I could implement "stealing" between cpus so that a cpu trying to > allocate a specific order, but which doesn't have a current cluster for that > order and the free list is empty, could allocate from another cpu's current > cluster? I don't think it's a good idea to mix different orders in the same > cluster though. I think we can start from a simple solution, that is, just fall back to split the large folio. Then, we can optimize it step by step. > I guess if really low, I could remove a current cluster from a cpu and allow it > to be scanned for order-0 allocations at least? Better to have same behavior between order- and order > 0. Perhaps begin with the current solution (allow swap entries in per-CPU cluster to be stolen). Then we can optimize based on this. Not directly related to this patchset. Maybe we can replace swap_slots_cache with per-CPU cluster in the future. This will reduce the code complexity. > Any opinions gratefully received! Thanks! >> >>> { >>> + int order_idx; >>> unsigned long idx; >>> struct swap_cluster_info *ci; >>> + struct percpu_cluster *cluster; >>> unsigned long offset; >>> >>> /* >>> * Should not even be attempting cluster allocations when huge >>> * page swap is disabled. Warn and fail the allocation. >>> */ >>> - if (!IS_ENABLED(CONFIG_THP_SWAP)) { >>> + if (!IS_ENABLED(CONFIG_THP_SWAP) || >>> + nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER || >>> + !is_power_of_2(nr_pages)) { >>> VM_WARN_ON_ONCE(1); >>> return 0; >>> } >>> >>> - if (cluster_list_empty(&si->free_clusters)) >>> + /* >>> + * Not using clusters so unable to allocate large entries. >>> + */ >>> + if (!si->cluster_info) >>> return 0; >>> >>> - idx = cluster_list_first(&si->free_clusters); >>> - offset = idx * SWAPFILE_CLUSTER; >>> - ci = lock_cluster(si, offset); >>> - alloc_cluster(si, idx); >>> - cluster_set_count(ci, SWAPFILE_CLUSTER); >>> + order_idx = ilog2(nr_pages) - 2; >>> + cluster = this_cpu_ptr(si->percpu_cluster); >>> + offset = cluster->large_next[order_idx]; >>> + >>> + if (offset == UINT_MAX) { >>> + if (cluster_list_empty(&si->free_clusters)) >>> + return 0; >>> + >>> + idx = cluster_list_first(&si->free_clusters); >>> + offset = idx * SWAPFILE_CLUSTER; >>> >>> - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); >>> + ci = lock_cluster(si, offset); >>> + alloc_cluster(si, idx); >>> + cluster_set_count(ci, SWAPFILE_CLUSTER); >>> + >>> + /* >>> + * If scan_swap_map_slots() can't find a free cluster, it will >>> + * check si->swap_map directly. To make sure this standby >>> + * cluster isn't taken by scan_swap_map_slots(), mark the swap >>> + * entries bad (occupied). (same approach as discard). >>> + */ >>> + memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD, >>> + SWAPFILE_CLUSTER - nr_pages); >> >> There's an issue with this solution. If the free space of swap device >> runs low, it's possible that >> >> - some cluster are put in the percpu_cluster of some CPUs >> the swap entries there are marked as used >> >> - no free swap entries elsewhere >> >> - nr_swap_pages isn't 0 >> >> So, we will still scan LRU, but swap allocation fails, although there's >> still free swap space. > > Ahh yes, good spot. > >> >> I think that we should follow the method we used for the original >> percpu_cluster. That is, if all free swap entries are in >> percpu_cluster, we will start to allocate from percpu_cluster. > > As i suggested above, I think I could do this by removing a cpu's current > cluster for a given order from the percpu_cluster and making it generally > available for scanning. Does that work for you? replied above. >> >>> + } else { >>> + idx = offset / SWAPFILE_CLUSTER; >>> + ci = lock_cluster(si, offset); >>> + } >>> + >>> + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); >>> unlock_cluster(ci); >>> - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); >>> + swap_range_alloc(si, offset, nr_pages); >>> *slot = swp_entry(si->type, offset); >>> >>> + offset += nr_pages; >>> + if (idx != offset / SWAPFILE_CLUSTER) >>> + offset = UINT_MAX; >>> + cluster->large_next[order_idx] = offset; >>> + >>> return 1; >>> } >>> >> >> [snip] -- Best Regards, Huang, Ying
On 19/10/2023 06:49, Huang, Ying wrote: > Ryan Roberts <ryan.roberts@arm.com> writes: > >> On 18/10/2023 07:55, Huang, Ying wrote: >>> Ryan Roberts <ryan.roberts@arm.com> writes: >>> > > [snip] > >>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>> index a073366a227c..35cbbe6509a9 100644 >>>> --- a/include/linux/swap.h >>>> +++ b/include/linux/swap.h >>>> @@ -268,6 +268,12 @@ struct swap_cluster_info { >>>> struct percpu_cluster { >>>> struct swap_cluster_info index; /* Current cluster index */ >>>> unsigned int next; /* Likely next allocation offset */ >>>> + unsigned int large_next[]; /* >>>> + * next free offset within current >>>> + * allocation cluster for large folios, >>>> + * or UINT_MAX if no current cluster. >>>> + * Index is (order - 1). >>>> + */ >>>> }; >>>> >>>> struct swap_cluster_list { >>>> diff --git a/mm/swapfile.c b/mm/swapfile.c >>>> index b83ad77e04c0..625964e53c22 100644 >>>> --- a/mm/swapfile.c >>>> +++ b/mm/swapfile.c >>>> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>>> return n_ret; >>>> } >>>> >>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) >>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot, >>>> + unsigned int nr_pages) >>> >>> This looks hacky. IMO, we should put the allocation logic inside >>> percpu_cluster framework. If percpu_cluster framework doesn't work for >>> you, just refactor it firstly. >> >> I'm not sure I really understand what you are suggesting - could you elaborate? >> What "framework"? I only see a per-cpu data structure and >> scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0 >> allocations. > > I suggest to share as much code as possible between order-0 and order > > 0 swap entry allocation. I think that we can make > scan_swap_map_try_ssd_cluster() works for order > 0 swap entry allocation. > [...] >>>> + /* >>>> + * If scan_swap_map_slots() can't find a free cluster, it will >>>> + * check si->swap_map directly. To make sure this standby >>>> + * cluster isn't taken by scan_swap_map_slots(), mark the swap >>>> + * entries bad (occupied). (same approach as discard). >>>> + */ >>>> + memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD, >>>> + SWAPFILE_CLUSTER - nr_pages); >>> >>> There's an issue with this solution. If the free space of swap device >>> runs low, it's possible that >>> >>> - some cluster are put in the percpu_cluster of some CPUs >>> the swap entries there are marked as used >>> >>> - no free swap entries elsewhere >>> >>> - nr_swap_pages isn't 0 >>> >>> So, we will still scan LRU, but swap allocation fails, although there's >>> still free swap space. I'd like to decide how best to solve this problem before I can figure out how much code sharing I can do for the order-0 vs order > 0 allocators. I see a couple of potential options: 1) Manipulate nr_swap_pages to include the entries that are marked SWAP_MAP_BAD, so when reserving a new per-order/per-cpu cluster, subtract SWAPFILE_CLUSTER, and then add nr_pages for each allocation from that cluster. 2) Don't mark the entries in the reserved cluster as SWAP_MAP_BAD, which would allow the scanner to steal (order-0) entries. The scanner could set a flag in the cluster info to mark it as having been allocated from by the scanner, so the next attempt to allocate a high order from it would cause discarding it as the cpu's current cluster and trying to get a fresh cluster from the free list. While option 2 is a bit more complex, I prefer it as a solution. What do you think?
Ryan Roberts <ryan.roberts@arm.com> writes: > On 19/10/2023 06:49, Huang, Ying wrote: >> Ryan Roberts <ryan.roberts@arm.com> writes: >> >>> On 18/10/2023 07:55, Huang, Ying wrote: >>>> Ryan Roberts <ryan.roberts@arm.com> writes: >>>> >> >> [snip] >> >>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>>> index a073366a227c..35cbbe6509a9 100644 >>>>> --- a/include/linux/swap.h >>>>> +++ b/include/linux/swap.h >>>>> @@ -268,6 +268,12 @@ struct swap_cluster_info { >>>>> struct percpu_cluster { >>>>> struct swap_cluster_info index; /* Current cluster index */ >>>>> unsigned int next; /* Likely next allocation offset */ >>>>> + unsigned int large_next[]; /* >>>>> + * next free offset within current >>>>> + * allocation cluster for large folios, >>>>> + * or UINT_MAX if no current cluster. >>>>> + * Index is (order - 1). >>>>> + */ >>>>> }; >>>>> >>>>> struct swap_cluster_list { >>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c >>>>> index b83ad77e04c0..625964e53c22 100644 >>>>> --- a/mm/swapfile.c >>>>> +++ b/mm/swapfile.c >>>>> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>>>> return n_ret; >>>>> } >>>>> >>>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) >>>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot, >>>>> + unsigned int nr_pages) >>>> >>>> This looks hacky. IMO, we should put the allocation logic inside >>>> percpu_cluster framework. If percpu_cluster framework doesn't work for >>>> you, just refactor it firstly. >>> >>> I'm not sure I really understand what you are suggesting - could you elaborate? >>> What "framework"? I only see a per-cpu data structure and >>> scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0 >>> allocations. >> >> I suggest to share as much code as possible between order-0 and order > >> 0 swap entry allocation. I think that we can make >> scan_swap_map_try_ssd_cluster() works for order > 0 swap entry allocation. >> > > [...] > >>>>> + /* >>>>> + * If scan_swap_map_slots() can't find a free cluster, it will >>>>> + * check si->swap_map directly. To make sure this standby >>>>> + * cluster isn't taken by scan_swap_map_slots(), mark the swap >>>>> + * entries bad (occupied). (same approach as discard). >>>>> + */ >>>>> + memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD, >>>>> + SWAPFILE_CLUSTER - nr_pages); >>>> >>>> There's an issue with this solution. If the free space of swap device >>>> runs low, it's possible that >>>> >>>> - some cluster are put in the percpu_cluster of some CPUs >>>> the swap entries there are marked as used >>>> >>>> - no free swap entries elsewhere >>>> >>>> - nr_swap_pages isn't 0 >>>> >>>> So, we will still scan LRU, but swap allocation fails, although there's >>>> still free swap space. > > I'd like to decide how best to solve this problem before I can figure out how > much code sharing I can do for the order-0 vs order > 0 allocators. > > I see a couple of potential options: > > 1) Manipulate nr_swap_pages to include the entries that are marked SWAP_MAP_BAD, > so when reserving a new per-order/per-cpu cluster, subtract SWAPFILE_CLUSTER, > and then add nr_pages for each allocation from that cluster. > > 2) Don't mark the entries in the reserved cluster as SWAP_MAP_BAD, which would > allow the scanner to steal (order-0) entries. The scanner could set a flag in > the cluster info to mark it as having been allocated from by the scanner, so the > next attempt to allocate a high order from it would cause discarding it as the > cpu's current cluster and trying to get a fresh cluster from the free list. > > While option 2 is a bit more complex, I prefer it as a solution. What do you think? I think that this is a good choice to start with. We may build more optimization on top of it if necessary. -- Best Regards, Huang, Ying
diff --git a/include/linux/swap.h b/include/linux/swap.h index a073366a227c..35cbbe6509a9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -268,6 +268,12 @@ struct swap_cluster_info { struct percpu_cluster { struct swap_cluster_info index; /* Current cluster index */ unsigned int next; /* Likely next allocation offset */ + unsigned int large_next[]; /* + * next free offset within current + * allocation cluster for large folios, + * or UINT_MAX if no current cluster. + * Index is (order - 1). + */ }; struct swap_cluster_list { diff --git a/mm/swapfile.c b/mm/swapfile.c index b83ad77e04c0..625964e53c22 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si, return n_ret; } -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot, + unsigned int nr_pages) { + int order_idx; unsigned long idx; struct swap_cluster_info *ci; + struct percpu_cluster *cluster; unsigned long offset; /* * Should not even be attempting cluster allocations when huge * page swap is disabled. Warn and fail the allocation. */ - if (!IS_ENABLED(CONFIG_THP_SWAP)) { + if (!IS_ENABLED(CONFIG_THP_SWAP) || + nr_pages < 4 || nr_pages > SWAPFILE_CLUSTER || + !is_power_of_2(nr_pages)) { VM_WARN_ON_ONCE(1); return 0; } - if (cluster_list_empty(&si->free_clusters)) + /* + * Not using clusters so unable to allocate large entries. + */ + if (!si->cluster_info) return 0; - idx = cluster_list_first(&si->free_clusters); - offset = idx * SWAPFILE_CLUSTER; - ci = lock_cluster(si, offset); - alloc_cluster(si, idx); - cluster_set_count(ci, SWAPFILE_CLUSTER); + order_idx = ilog2(nr_pages) - 2; + cluster = this_cpu_ptr(si->percpu_cluster); + offset = cluster->large_next[order_idx]; + + if (offset == UINT_MAX) { + if (cluster_list_empty(&si->free_clusters)) + return 0; + + idx = cluster_list_first(&si->free_clusters); + offset = idx * SWAPFILE_CLUSTER; - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); + ci = lock_cluster(si, offset); + alloc_cluster(si, idx); + cluster_set_count(ci, SWAPFILE_CLUSTER); + + /* + * If scan_swap_map_slots() can't find a free cluster, it will + * check si->swap_map directly. To make sure this standby + * cluster isn't taken by scan_swap_map_slots(), mark the swap + * entries bad (occupied). (same approach as discard). + */ + memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD, + SWAPFILE_CLUSTER - nr_pages); + } else { + idx = offset / SWAPFILE_CLUSTER; + ci = lock_cluster(si, offset); + } + + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages); unlock_cluster(ci); - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); + swap_range_alloc(si, offset, nr_pages); *slot = swp_entry(si->type, offset); + offset += nr_pages; + if (idx != offset / SWAPFILE_CLUSTER) + offset = UINT_MAX; + cluster->large_next[order_idx] = offset; + return 1; } @@ -1041,7 +1076,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) int node; /* Only single cluster request supported */ - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER); + WARN_ON_ONCE(n_goal > 1 && size > 1); spin_lock(&swap_avail_lock); @@ -1078,14 +1113,14 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) spin_unlock(&si->lock); goto nextsi; } - if (size == SWAPFILE_CLUSTER) { + if (size > 1) { if (si->flags & SWP_BLKDEV) - n_ret = swap_alloc_cluster(si, swp_entries); + n_ret = swap_alloc_large(si, swp_entries, size); } else n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal, swp_entries); spin_unlock(&si->lock); - if (n_ret || size == SWAPFILE_CLUSTER) + if (n_ret || size > 1) goto check_out; cond_resched(); @@ -3046,6 +3081,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) if (p->bdev && bdev_nonrot(p->bdev)) { int cpu; unsigned long ci, nr_cluster; + int nr_order; + int i; p->flags |= SWP_SOLIDSTATE; p->cluster_next_cpu = alloc_percpu(unsigned int); @@ -3073,7 +3110,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) for (ci = 0; ci < nr_cluster; ci++) spin_lock_init(&((cluster_info + ci)->lock)); - p->percpu_cluster = alloc_percpu(struct percpu_cluster); + nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER - 1 : 0; + p->percpu_cluster = __alloc_percpu( + struct_size(p->percpu_cluster, + large_next, + nr_order), + __alignof__(struct percpu_cluster)); if (!p->percpu_cluster) { error = -ENOMEM; goto bad_swap_unlock_inode; @@ -3082,6 +3124,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) struct percpu_cluster *cluster; cluster = per_cpu_ptr(p->percpu_cluster, cpu); cluster_set_null(&cluster->index); + for (i = 0; i < nr_order; i++) + cluster->large_next[i] = UINT_MAX; } } else { atomic_inc(&nr_rotate_swap); diff --git a/mm/vmscan.c b/mm/vmscan.c index c16e2b1ea8ae..5984d2ae4547 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (!can_split_folio(folio, NULL)) goto activate_locked; /* - * Split folios without a PMD map right - * away. Chances are some or all of the - * tail pages can be freed without IO. + * Split PMD-mappable folios without a + * PMD map right away. Chances are some + * or all of the tail pages can be freed + * without IO. */ - if (!folio_entire_mapcount(folio) && + if (folio_test_pmd_mappable(folio) && + !folio_entire_mapcount(folio) && split_folio_to_list(folio, folio_list)) goto activate_locked;
The upcoming anonymous small-sized THP feature enables performance improvements by allocating large folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting. Therefore, solve this regression by adding support for swapping out small-sized THP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in THP here - this is still done page-by-page, like for PMD-sized THP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [4, (1 << PMD_ORDER)] (THP cannot support order-1 folios). This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller falls back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing percpu_cluster infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As far as I can tell, this should not cause any extra fragmentation concerns, given how similar it is to the existing PMD-sized THP allocation mechanism. There could be up to (PMD_ORDER-2) * nr_cpus clusters in concurrent use though, which in a pathalogical case (cluster set aside for every order for every cpu and only one huge entry allocated from it) would tie up ~12MiB of unused swap entries for these high orders (assuming PMD_ORDER=9). In practice, the number of orders in use will be small and the amount of swap space reserved is very small compared to a typical swap file. Note that PMD_ORDER is not compile-time constant on powerpc, so we have to allocate the large_next[] array at runtime. I've run the tests on Ampere Altra (arm64), set up with a 35G block ram device as the swap device and from inside a memcg limited to 40G memory. I've then run `usemem` from vm-scalability with 70 processes (each has its own core), each allocating and writing 1G of memory. I've repeated everything 5 times and taken the mean and stdev: Mean Performance Improvement vs 4K/baseline | alloc size | baseline | + this series | | | v6.6-rc4+anonfolio | | |:-----------|--------------------:|--------------------:| | 4K Page | 0.0% | 1.1% | | 64K THP | -44.1% | 0.9% | | 2M THP | 56.0% | 56.4% | So with this change, the regression for 64K swap performance goes away. Both 4K and 64K benhcmarks are now bottlenecked on TLBI performance from try_to_unmap_flush_dirty(), on arm64 at least. When using fewer cpus in the test, I see upto 2x performance of 64K THP swapping compared to 4K. Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> --- include/linux/swap.h | 6 ++++ mm/swapfile.c | 74 +++++++++++++++++++++++++++++++++++--------- mm/vmscan.c | 10 +++--- 3 files changed, 71 insertions(+), 19 deletions(-) -- 2.25.1