diff mbox series

[RFC,v1,6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.

Message ID 20241018064805.336490-7-kanchana.p.sridhar@intel.com (mailing list archive)
State New
Headers show
Series zswap IAA decompress batching | expand

Commit Message

Sridhar, Kanchana P Oct. 18, 2024, 6:48 a.m. UTC
This patch invokes the swapin_readahead() based batching interface to
prefetch a batch of 4K folios for zswap load with batch decompressions
in parallel using IAA hardware. swapin_readahead() prefetches folios based
on vm.page-cluster and the usefulness of prior prefetches to the
workload. As folios are created in the swapcache and the readahead code
calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
respective folio_batches get populated with the folios to be read.

Finally, the swapin_readahead() procedures will call the newly added
process_ra_batch_of_same_type() which:

 1) Reads all the non_zswap_batch folios sequentially by calling
    swap_read_folio().
 2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
    zswap_finish_load_batch() that finally decompresses each
    SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
    batch of say, 32 folios) in parallel with IAA.

Within do_swap_page(), we try to benefit from batch decompressions in both
these scenarios:

 1) single-mapped, SWP_SYNCHRONOUS_IO:
      We call swapin_readahead() with "single_mapped_path = true". This is
      done only in the !zswap_never_enabled() case.
 2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
      We call swapin_readahead() with "single_mapped_path = false".

This will place folios in the swapcache: a design choice that handles cases
where a folio that is "single-mapped" in process 1 could be prefetched in
process 2; and handles highly contended server scenarios with stability.
There are checks added at the end of do_swap_page(), after the folio has
been successfully loaded, to detect if the single-mapped swapcache folio is
still single-mapped, and if so, folio_free_swap() is called on the folio.

Within the swapin_readahead() functions, if single_mapped_path is true, and
either the platform does not have IAA, or, if the platform has IAA and the
user selects a software compressor for zswap (details of sysfs knob
follow), readahead/batching are skipped and the folio is loaded using
zswap_load().

A new swap parameter "singlemapped_ra_enabled" (false by default) is added
for platforms that have IAA, zswap_load_batching_enabled() is true, and we
want to give the user the option to run experiments with IAA and with
software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):

For IAA:
 echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled

For software compressors:
 echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled

If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
path.

Thanks Ying Huang for the really helpful brainstorming discussions on the
swap_read_folio() plug design.

Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
 mm/shmem.c      |   2 +-
 mm/swap.h       |  12 ++--
 mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
 mm/swapfile.c   |   2 +-
 5 files changed, 299 insertions(+), 61 deletions(-)

Comments

David Hildenbrand Oct. 18, 2024, 7:26 a.m. UTC | #1
On 18.10.24 08:48, Kanchana P Sridhar wrote:
> This patch invokes the swapin_readahead() based batching interface to
> prefetch a batch of 4K folios for zswap load with batch decompressions
> in parallel using IAA hardware. swapin_readahead() prefetches folios based
> on vm.page-cluster and the usefulness of prior prefetches to the
> workload. As folios are created in the swapcache and the readahead code
> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
> respective folio_batches get populated with the folios to be read.
> 
> Finally, the swapin_readahead() procedures will call the newly added
> process_ra_batch_of_same_type() which:
> 
>   1) Reads all the non_zswap_batch folios sequentially by calling
>      swap_read_folio().
>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
>      zswap_finish_load_batch() that finally decompresses each
>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
>      batch of say, 32 folios) in parallel with IAA.
> 
> Within do_swap_page(), we try to benefit from batch decompressions in both
> these scenarios:
> 
>   1) single-mapped, SWP_SYNCHRONOUS_IO:
>        We call swapin_readahead() with "single_mapped_path = true". This is
>        done only in the !zswap_never_enabled() case.
>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>        We call swapin_readahead() with "single_mapped_path = false".
> 
> This will place folios in the swapcache: a design choice that handles cases
> where a folio that is "single-mapped" in process 1 could be prefetched in
> process 2; and handles highly contended server scenarios with stability.
> There are checks added at the end of do_swap_page(), after the folio has
> been successfully loaded, to detect if the single-mapped swapcache folio is
> still single-mapped, and if so, folio_free_swap() is called on the folio.
> 
> Within the swapin_readahead() functions, if single_mapped_path is true, and
> either the platform does not have IAA, or, if the platform has IAA and the
> user selects a software compressor for zswap (details of sysfs knob
> follow), readahead/batching are skipped and the folio is loaded using
> zswap_load().
> 
> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> want to give the user the option to run experiments with IAA and with
> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> 
> For IAA:
>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> 
> For software compressors:
>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> 
> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
> path.
> 
> Thanks Ying Huang for the really helpful brainstorming discussions on the
> swap_read_folio() plug design.
> 
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
>   mm/shmem.c      |   2 +-
>   mm/swap.h       |  12 ++--
>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
>   mm/swapfile.c   |   2 +-
>   5 files changed, 299 insertions(+), 61 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b5745b9ffdf7..9655b85fc243 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>   	return 0;
>   }
>   
> +/*
> + * swapin readahead based batching interface for zswap batched loads using IAA:
> + *
> + * Should only be called for and if the faulting swap entry in do_swap_page
> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> + *
> + * Detect if the folio is in the swapcache, is still mapped to only this
> + * process, and further, there are no additional references to this folio
> + * (for e.g. if another process simultaneously readahead this swap entry
> + * while this process was handling the page-fault, and got a pointer to the
> + * folio allocated by this process in the swapcache), besides the references
> + * that were obtained within __read_swap_cache_async() by this process that is
> + * faulting in this single-mapped swap entry.
> + */

How is this supposed to work for large folios?

> +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> +						   struct folio *folio)
> +{
> +	if (!folio_test_swapcache(folio))
> +		return false;
> +
> +	if (__swap_count(entry) != 0)
> +		return false;
> +
> +	/*
> +	 * The folio ref count for a single-mapped folio that was allocated
> +	 * in __read_swap_cache_async(), can be a maximum of 3. These are the
> +	 * incrementors of the folio ref count in __read_swap_cache_async():
> +	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> +	 */
> +
> +	if (folio_ref_count(folio) <= 3)
> +		return true;
> +
> +	return false;
> +}
> +
>   static inline bool should_try_to_free_swap(struct folio *folio,
>   					   struct vm_area_struct *vma,
>   					   unsigned int fault_flags)
> @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	swp_entry_t entry;
>   	pte_t pte;
>   	vm_fault_t ret = 0;
> +	bool single_mapped_swapcache = false;
>   	void *shadow = NULL;
>   	int nr_pages;
>   	unsigned long page_idx;
> @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	if (!folio) {
>   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>   		    __swap_count(entry) == 1) {
> -			/* skip swapcache */
> -			folio = alloc_swap_folio(vmf);
> -			if (folio) {
> -				__folio_set_locked(folio);
> -				__folio_set_swapbacked(folio);
> -
> -				nr_pages = folio_nr_pages(folio);
> -				if (folio_test_large(folio))
> -					entry.val = ALIGN_DOWN(entry.val, nr_pages);
> -				/*
> -				 * Prevent parallel swapin from proceeding with
> -				 * the cache flag. Otherwise, another thread
> -				 * may finish swapin first, free the entry, and
> -				 * swapout reusing the same entry. It's
> -				 * undetectable as pte_same() returns true due
> -				 * to entry reuse.
> -				 */
> -				if (swapcache_prepare(entry, nr_pages)) {
> +			if (zswap_never_enabled()) {
> +				/* skip swapcache */
> +				folio = alloc_swap_folio(vmf);
> +				if (folio) {
> +					__folio_set_locked(folio);
> +					__folio_set_swapbacked(folio);
> +
> +					nr_pages = folio_nr_pages(folio);
> +					if (folio_test_large(folio))
> +						entry.val = ALIGN_DOWN(entry.val, nr_pages);
>   					/*
> -					 * Relax a bit to prevent rapid
> -					 * repeated page faults.
> +					 * Prevent parallel swapin from proceeding with
> +					 * the cache flag. Otherwise, another thread
> +					 * may finish swapin first, free the entry, and
> +					 * swapout reusing the same entry. It's
> +					 * undetectable as pte_same() returns true due
> +					 * to entry reuse.
>   					 */
> -					add_wait_queue(&swapcache_wq, &wait);
> -					schedule_timeout_uninterruptible(1);
> -					remove_wait_queue(&swapcache_wq, &wait);
> -					goto out_page;
> +					if (swapcache_prepare(entry, nr_pages)) {
> +						/*
> +						 * Relax a bit to prevent rapid
> +						 * repeated page faults.
> +						 */
> +						add_wait_queue(&swapcache_wq, &wait);
> +						schedule_timeout_uninterruptible(1);
> +						remove_wait_queue(&swapcache_wq, &wait);
> +						goto out_page;
> +					}
> +					need_clear_cache = true;
> +
> +					mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> +
> +					shadow = get_shadow_from_swap_cache(entry);
> +					if (shadow)
> +						workingset_refault(folio, shadow);
> +
> +					folio_add_lru(folio);
> +
> +					/* To provide entry to swap_read_folio() */
> +					folio->swap = entry;
> +					swap_read_folio(folio, NULL, NULL, NULL);
> +					folio->private = NULL;
> +				}
> +			} else {
> +				/*
> +				 * zswap is enabled or was enabled at some point.
> +				 * Don't skip swapcache.
> +				 *
> +				 * swapin readahead based batching interface
> +				 * for zswap batched loads using IAA:
> +				 *
> +				 * Readahead is invoked in this path only if
> +				 * the sys swap "singlemapped_ra_enabled" swap
> +				 * parameter is set to true. By default,
> +				 * "singlemapped_ra_enabled" is set to false,
> +				 * the recommended setting for software compressors.
> +				 * For IAA, if "singlemapped_ra_enabled" is set
> +				 * to true, readahead will be deployed in this path
> +				 * as well.
> +				 *
> +				 * For single-mapped pages, the batching interface
> +				 * calls __read_swap_cache_async() to allocate and
> +				 * place the faulting page in the swapcache. This is
> +				 * to handle a scenario where the faulting page in
> +				 * this process happens to simultaneously be a
> +				 * readahead page in another process. By placing the
> +				 * single-mapped faulting page in the swapcache,
> +				 * we avoid race conditions and duplicate page
> +				 * allocations under these scenarios.
> +				 */
> +				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +							 vmf, true);
> +				if (!folio) {
> +					ret = VM_FAULT_OOM;
> +					goto out;
>   				}
> -				need_clear_cache = true;
> -
> -				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> -
> -				shadow = get_shadow_from_swap_cache(entry);
> -				if (shadow)
> -					workingset_refault(folio, shadow);
> -
> -				folio_add_lru(folio);
>   
> -				/* To provide entry to swap_read_folio() */
> -				folio->swap = entry;
> -				swap_read_folio(folio, NULL, NULL, NULL);
> -				folio->private = NULL;
> -			}
> +				single_mapped_swapcache = true;
> +				nr_pages = folio_nr_pages(folio);
> +				swapcache = folio;
> +			} /* swapin with zswap support. */
>   		} else {
>   			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -						vmf);
> +						 vmf, false);
>   			swapcache = folio;

I'm sorry, but making this function ever more complicated and ugly is 
not going to fly. The zswap special casing is quite ugly here as well.

Is there a way forward that we can make this code actually readable and 
avoid zswap special casing?
Usama Arif Oct. 18, 2024, 11:04 a.m. UTC | #2
On 18/10/2024 08:26, David Hildenbrand wrote:
> On 18.10.24 08:48, Kanchana P Sridhar wrote:
>> This patch invokes the swapin_readahead() based batching interface to
>> prefetch a batch of 4K folios for zswap load with batch decompressions
>> in parallel using IAA hardware. swapin_readahead() prefetches folios based
>> on vm.page-cluster and the usefulness of prior prefetches to the
>> workload. As folios are created in the swapcache and the readahead code
>> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
>> respective folio_batches get populated with the folios to be read.
>>
>> Finally, the swapin_readahead() procedures will call the newly added
>> process_ra_batch_of_same_type() which:
>>
>>   1) Reads all the non_zswap_batch folios sequentially by calling
>>      swap_read_folio().
>>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
>>      zswap_finish_load_batch() that finally decompresses each
>>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
>>      batch of say, 32 folios) in parallel with IAA.
>>
>> Within do_swap_page(), we try to benefit from batch decompressions in both
>> these scenarios:
>>
>>   1) single-mapped, SWP_SYNCHRONOUS_IO:
>>        We call swapin_readahead() with "single_mapped_path = true". This is
>>        done only in the !zswap_never_enabled() case.
>>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>>        We call swapin_readahead() with "single_mapped_path = false".
>>
>> This will place folios in the swapcache: a design choice that handles cases
>> where a folio that is "single-mapped" in process 1 could be prefetched in
>> process 2; and handles highly contended server scenarios with stability.
>> There are checks added at the end of do_swap_page(), after the folio has
>> been successfully loaded, to detect if the single-mapped swapcache folio is
>> still single-mapped, and if so, folio_free_swap() is called on the folio.
>>
>> Within the swapin_readahead() functions, if single_mapped_path is true, and
>> either the platform does not have IAA, or, if the platform has IAA and the
>> user selects a software compressor for zswap (details of sysfs knob
>> follow), readahead/batching are skipped and the folio is loaded using
>> zswap_load().
>>
>> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
>> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
>> want to give the user the option to run experiments with IAA and with
>> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
>>
>> For IAA:
>>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>
>> For software compressors:
>>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>
>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
>> path.
>>
>> Thanks Ying Huang for the really helpful brainstorming discussions on the
>> swap_read_folio() plug design.
>>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>> ---
>>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
>>   mm/shmem.c      |   2 +-
>>   mm/swap.h       |  12 ++--
>>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
>>   mm/swapfile.c   |   2 +-
>>   5 files changed, 299 insertions(+), 61 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b5745b9ffdf7..9655b85fc243 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>>       return 0;
>>   }
>>   +/*
>> + * swapin readahead based batching interface for zswap batched loads using IAA:
>> + *
>> + * Should only be called for and if the faulting swap entry in do_swap_page
>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
>> + *
>> + * Detect if the folio is in the swapcache, is still mapped to only this
>> + * process, and further, there are no additional references to this folio
>> + * (for e.g. if another process simultaneously readahead this swap entry
>> + * while this process was handling the page-fault, and got a pointer to the
>> + * folio allocated by this process in the swapcache), besides the references
>> + * that were obtained within __read_swap_cache_async() by this process that is
>> + * faulting in this single-mapped swap entry.
>> + */
> 
> How is this supposed to work for large folios?
> 

Hi,

I was looking at zswapin large folio support and have posted a RFC in [1].
I got bogged down with some prod stuff, so wasn't able to send it earlier.

It looks quite different, and I think simpler from this series, so might be
a good comparison.

[1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/

Thanks,
Usama
Nhat Pham Oct. 18, 2024, 5:21 p.m. UTC | #3
On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
> On 18/10/2024 08:26, David Hildenbrand wrote:
> > On 18.10.24 08:48, Kanchana P Sridhar wrote:
> >> This patch invokes the swapin_readahead() based batching interface to
> >> prefetch a batch of 4K folios for zswap load with batch decompressions
> >> in parallel using IAA hardware. swapin_readahead() prefetches folios based
> >> on vm.page-cluster and the usefulness of prior prefetches to the
> >> workload. As folios are created in the swapcache and the readahead code
> >> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
> >> respective folio_batches get populated with the folios to be read.
> >>
> >> Finally, the swapin_readahead() procedures will call the newly added
> >> process_ra_batch_of_same_type() which:
> >>
> >>   1) Reads all the non_zswap_batch folios sequentially by calling
> >>      swap_read_folio().
> >>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
> >>      zswap_finish_load_batch() that finally decompresses each
> >>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
> >>      batch of say, 32 folios) in parallel with IAA.
> >>
> >> Within do_swap_page(), we try to benefit from batch decompressions in both
> >> these scenarios:
> >>
> >>   1) single-mapped, SWP_SYNCHRONOUS_IO:
> >>        We call swapin_readahead() with "single_mapped_path = true". This is
> >>        done only in the !zswap_never_enabled() case.
> >>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >>        We call swapin_readahead() with "single_mapped_path = false".
> >>
> >> This will place folios in the swapcache: a design choice that handles cases
> >> where a folio that is "single-mapped" in process 1 could be prefetched in
> >> process 2; and handles highly contended server scenarios with stability.
> >> There are checks added at the end of do_swap_page(), after the folio has
> >> been successfully loaded, to detect if the single-mapped swapcache folio is
> >> still single-mapped, and if so, folio_free_swap() is called on the folio.
> >>
> >> Within the swapin_readahead() functions, if single_mapped_path is true, and
> >> either the platform does not have IAA, or, if the platform has IAA and the
> >> user selects a software compressor for zswap (details of sysfs knob
> >> follow), readahead/batching are skipped and the folio is loaded using
> >> zswap_load().
> >>
> >> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
> >> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> >> want to give the user the option to run experiments with IAA and with
> >> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> >>
> >> For IAA:
> >>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>
> >> For software compressors:
> >>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>
> >> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> >> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
> >> path.
> >>
> >> Thanks Ying Huang for the really helpful brainstorming discussions on the
> >> swap_read_folio() plug design.
> >>
> >> Suggested-by: Ying Huang <ying.huang@intel.com>
> >> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >> ---
> >>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
> >>   mm/shmem.c      |   2 +-
> >>   mm/swap.h       |  12 ++--
> >>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> >>   mm/swapfile.c   |   2 +-
> >>   5 files changed, 299 insertions(+), 61 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index b5745b9ffdf7..9655b85fc243 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> >>       return 0;
> >>   }
> >>   +/*
> >> + * swapin readahead based batching interface for zswap batched loads using IAA:
> >> + *
> >> + * Should only be called for and if the faulting swap entry in do_swap_page
> >> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> >> + *
> >> + * Detect if the folio is in the swapcache, is still mapped to only this
> >> + * process, and further, there are no additional references to this folio
> >> + * (for e.g. if another process simultaneously readahead this swap entry
> >> + * while this process was handling the page-fault, and got a pointer to the
> >> + * folio allocated by this process in the swapcache), besides the references
> >> + * that were obtained within __read_swap_cache_async() by this process that is
> >> + * faulting in this single-mapped swap entry.
> >> + */
> >
> > How is this supposed to work for large folios?
> >
>
> Hi,
>
> I was looking at zswapin large folio support and have posted a RFC in [1].
> I got bogged down with some prod stuff, so wasn't able to send it earlier.
>
> It looks quite different, and I think simpler from this series, so might be
> a good comparison.
>
> [1] https://lore.kernel.org/all/20241018105026.2521366-1-usamaarif642@gmail.com/
>
> Thanks,
> Usama

I agree.

I think the lower hanging fruit here is to build upon Usama's patch.
Kanchana, do you think we can just use the new batch decompressing
infrastructure, and apply it to Usama's large folio zswap loading?

I'm not denying the readahead idea outright, but that seems much more
complicated. There are questions regarding the benefits of
readahead-ing when apply to zswap in the first place - IIUC, zram
circumvents that logic in several cases, and zswap shares many
characteristics with zram (fast, synchronous compression devices).

So let's reap the low hanging fruits first, get the wins as well as
stress test the new infrastructure. Then we can discuss the readahead
idea later?
Sridhar, Kanchana P Oct. 18, 2024, 6:09 p.m. UTC | #4
Hi David,

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Friday, October 18, 2024 12:27 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; hughd@google.com;
> willy@infradead.org; bfoster@redhat.com; dchinner@redhat.com;
> chrisl@kernel.org
> Cc: Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
> 
> On 18.10.24 08:48, Kanchana P Sridhar wrote:
> > This patch invokes the swapin_readahead() based batching interface to
> > prefetch a batch of 4K folios for zswap load with batch decompressions
> > in parallel using IAA hardware. swapin_readahead() prefetches folios based
> > on vm.page-cluster and the usefulness of prior prefetches to the
> > workload. As folios are created in the swapcache and the readahead code
> > calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
> the
> > respective folio_batches get populated with the folios to be read.
> >
> > Finally, the swapin_readahead() procedures will call the newly added
> > process_ra_batch_of_same_type() which:
> >
> >   1) Reads all the non_zswap_batch folios sequentially by calling
> >      swap_read_folio().
> >   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
> calls
> >      zswap_finish_load_batch() that finally decompresses each
> >      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> prefetch
> >      batch of say, 32 folios) in parallel with IAA.
> >
> > Within do_swap_page(), we try to benefit from batch decompressions in
> both
> > these scenarios:
> >
> >   1) single-mapped, SWP_SYNCHRONOUS_IO:
> >        We call swapin_readahead() with "single_mapped_path = true". This is
> >        done only in the !zswap_never_enabled() case.
> >   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >        We call swapin_readahead() with "single_mapped_path = false".
> >
> > This will place folios in the swapcache: a design choice that handles cases
> > where a folio that is "single-mapped" in process 1 could be prefetched in
> > process 2; and handles highly contended server scenarios with stability.
> > There are checks added at the end of do_swap_page(), after the folio has
> > been successfully loaded, to detect if the single-mapped swapcache folio is
> > still single-mapped, and if so, folio_free_swap() is called on the folio.
> >
> > Within the swapin_readahead() functions, if single_mapped_path is true,
> and
> > either the platform does not have IAA, or, if the platform has IAA and the
> > user selects a software compressor for zswap (details of sysfs knob
> > follow), readahead/batching are skipped and the folio is loaded using
> > zswap_load().
> >
> > A new swap parameter "singlemapped_ra_enabled" (false by default) is
> added
> > for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> > want to give the user the option to run experiments with IAA and with
> > software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> >
> > For IAA:
> >   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > For software compressors:
> >   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >
> > If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> > prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> do_swap_page()
> > path.
> >
> > Thanks Ying Huang for the really helpful brainstorming discussions on the
> > swap_read_folio() plug design.
> >
> > Suggested-by: Ying Huang <ying.huang@intel.com>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++------
> -----
> >   mm/shmem.c      |   2 +-
> >   mm/swap.h       |  12 ++--
> >   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
> >   mm/swapfile.c   |   2 +-
> >   5 files changed, 299 insertions(+), 61 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b5745b9ffdf7..9655b85fc243 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3924,6 +3924,42 @@ static vm_fault_t
> remove_device_exclusive_entry(struct vm_fault *vmf)
> >   	return 0;
> >   }
> >
> > +/*
> > + * swapin readahead based batching interface for zswap batched loads
> using IAA:
> > + *
> > + * Should only be called for and if the faulting swap entry in do_swap_page
> > + * is single-mapped and SWP_SYNCHRONOUS_IO.
> > + *
> > + * Detect if the folio is in the swapcache, is still mapped to only this
> > + * process, and further, there are no additional references to this folio
> > + * (for e.g. if another process simultaneously readahead this swap entry
> > + * while this process was handling the page-fault, and got a pointer to the
> > + * folio allocated by this process in the swapcache), besides the references
> > + * that were obtained within __read_swap_cache_async() by this process
> that is
> > + * faulting in this single-mapped swap entry.
> > + */
> 
> How is this supposed to work for large folios?

Thanks for your code review comments. The main idea behind this
patch-series is to work with the existing kernel page-fault granularity of 4K
folios, that swapin_readahead() builds upon to prefetch other "useful"
4K folios. The intent is to not try to make modifications at page-fault time
to opportunistically synthesize large folios for swapin.

As we know, __read_swap_cache_async() allocates an order-0 folio, which
explains the implementation of should_free_singlemap_swapcache() in this
patch. IOW, this is not supposed to work for large folios based on the existing
page-fault behavior and without making any modifications to that.

> 
> > +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> > +						   struct folio *folio)
> > +{
> > +	if (!folio_test_swapcache(folio))
> > +		return false;
> > +
> > +	if (__swap_count(entry) != 0)
> > +		return false;
> > +
> > +	/*
> > +	 * The folio ref count for a single-mapped folio that was allocated
> > +	 * in __read_swap_cache_async(), can be a maximum of 3. These are
> the
> > +	 * incrementors of the folio ref count in __read_swap_cache_async():
> > +	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> > +	 */
> > +
> > +	if (folio_ref_count(folio) <= 3)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >   static inline bool should_try_to_free_swap(struct folio *folio,
> >   					   struct vm_area_struct *vma,
> >   					   unsigned int fault_flags)
> > @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >   	swp_entry_t entry;
> >   	pte_t pte;
> >   	vm_fault_t ret = 0;
> > +	bool single_mapped_swapcache = false;
> >   	void *shadow = NULL;
> >   	int nr_pages;
> >   	unsigned long page_idx;
> > @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault
> *vmf)
> >   	if (!folio) {
> >   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >   		    __swap_count(entry) == 1) {
> > -			/* skip swapcache */
> > -			folio = alloc_swap_folio(vmf);
> > -			if (folio) {
> > -				__folio_set_locked(folio);
> > -				__folio_set_swapbacked(folio);
> > -
> > -				nr_pages = folio_nr_pages(folio);
> > -				if (folio_test_large(folio))
> > -					entry.val = ALIGN_DOWN(entry.val,
> nr_pages);
> > -				/*
> > -				 * Prevent parallel swapin from proceeding
> with
> > -				 * the cache flag. Otherwise, another thread
> > -				 * may finish swapin first, free the entry, and
> > -				 * swapout reusing the same entry. It's
> > -				 * undetectable as pte_same() returns true
> due
> > -				 * to entry reuse.
> > -				 */
> > -				if (swapcache_prepare(entry, nr_pages)) {
> > +			if (zswap_never_enabled()) {
> > +				/* skip swapcache */
> > +				folio = alloc_swap_folio(vmf);
> > +				if (folio) {
> > +					__folio_set_locked(folio);
> > +					__folio_set_swapbacked(folio);
> > +
> > +					nr_pages = folio_nr_pages(folio);
> > +					if (folio_test_large(folio))
> > +						entry.val =
> ALIGN_DOWN(entry.val, nr_pages);
> >   					/*
> > -					 * Relax a bit to prevent rapid
> > -					 * repeated page faults.
> > +					 * Prevent parallel swapin from
> proceeding with
> > +					 * the cache flag. Otherwise, another
> thread
> > +					 * may finish swapin first, free the
> entry, and
> > +					 * swapout reusing the same entry.
> It's
> > +					 * undetectable as pte_same()
> returns true due
> > +					 * to entry reuse.
> >   					 */
> > -					add_wait_queue(&swapcache_wq,
> &wait);
> > -
> 	schedule_timeout_uninterruptible(1);
> > -
> 	remove_wait_queue(&swapcache_wq, &wait);
> > -					goto out_page;
> > +					if (swapcache_prepare(entry,
> nr_pages)) {
> > +						/*
> > +						 * Relax a bit to prevent rapid
> > +						 * repeated page faults.
> > +						 */
> > +
> 	add_wait_queue(&swapcache_wq, &wait);
> > +
> 	schedule_timeout_uninterruptible(1);
> > +
> 	remove_wait_queue(&swapcache_wq, &wait);
> > +						goto out_page;
> > +					}
> > +					need_clear_cache = true;
> > +
> > +
> 	mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> > +
> > +					shadow =
> get_shadow_from_swap_cache(entry);
> > +					if (shadow)
> > +						workingset_refault(folio,
> shadow);
> > +
> > +					folio_add_lru(folio);
> > +
> > +					/* To provide entry to
> swap_read_folio() */
> > +					folio->swap = entry;
> > +					swap_read_folio(folio, NULL, NULL,
> NULL);
> > +					folio->private = NULL;
> > +				}
> > +			} else {
> > +				/*
> > +				 * zswap is enabled or was enabled at some
> point.
> > +				 * Don't skip swapcache.
> > +				 *
> > +				 * swapin readahead based batching
> interface
> > +				 * for zswap batched loads using IAA:
> > +				 *
> > +				 * Readahead is invoked in this path only if
> > +				 * the sys swap "singlemapped_ra_enabled"
> swap
> > +				 * parameter is set to true. By default,
> > +				 * "singlemapped_ra_enabled" is set to false,
> > +				 * the recommended setting for software
> compressors.
> > +				 * For IAA, if "singlemapped_ra_enabled" is
> set
> > +				 * to true, readahead will be deployed in this
> path
> > +				 * as well.
> > +				 *
> > +				 * For single-mapped pages, the batching
> interface
> > +				 * calls __read_swap_cache_async() to
> allocate and
> > +				 * place the faulting page in the swapcache.
> This is
> > +				 * to handle a scenario where the faulting
> page in
> > +				 * this process happens to simultaneously be
> a
> > +				 * readahead page in another process. By
> placing the
> > +				 * single-mapped faulting page in the
> swapcache,
> > +				 * we avoid race conditions and duplicate
> page
> > +				 * allocations under these scenarios.
> > +				 */
> > +				folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > +							 vmf, true);
> > +				if (!folio) {
> > +					ret = VM_FAULT_OOM;
> > +					goto out;
> >   				}
> > -				need_clear_cache = true;
> > -
> > -				mem_cgroup_swapin_uncharge_swap(entry,
> nr_pages);
> > -
> > -				shadow =
> get_shadow_from_swap_cache(entry);
> > -				if (shadow)
> > -					workingset_refault(folio, shadow);
> > -
> > -				folio_add_lru(folio);
> >
> > -				/* To provide entry to swap_read_folio() */
> > -				folio->swap = entry;
> > -				swap_read_folio(folio, NULL, NULL, NULL);
> > -				folio->private = NULL;
> > -			}
> > +				single_mapped_swapcache = true;
> > +				nr_pages = folio_nr_pages(folio);
> > +				swapcache = folio;
> > +			} /* swapin with zswap support. */
> >   		} else {
> >   			folio = swapin_readahead(entry,
> GFP_HIGHUSER_MOVABLE,
> > -						vmf);
> > +						 vmf, false);
> >   			swapcache = folio;
> 
> I'm sorry, but making this function ever more complicated and ugly is
> not going to fly. The zswap special casing is quite ugly here as well.
> 
> Is there a way forward that we can make this code actually readable and
> avoid zswap special casing?

Yes, I realize this is now quite cluttered. I need to think some more about
how to make this more readable, and would appreciate suggestions
towards this.

Thanks,
Kanchana

> 
> --
> Cheers,
> 
> David / dhildenb
Sridhar, Kanchana P Oct. 18, 2024, 9:59 p.m. UTC | #5
Hi Usama, Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Friday, October 18, 2024 10:21 AM
> To: Usama Arif <usamaarif642@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
> 
> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
> wrote:
> >
> >
> > On 18/10/2024 08:26, David Hildenbrand wrote:
> > > On 18.10.24 08:48, Kanchana P Sridhar wrote:
> > >> This patch invokes the swapin_readahead() based batching interface to
> > >> prefetch a batch of 4K folios for zswap load with batch decompressions
> > >> in parallel using IAA hardware. swapin_readahead() prefetches folios
> based
> > >> on vm.page-cluster and the usefulness of prior prefetches to the
> > >> workload. As folios are created in the swapcache and the readahead
> code
> > >> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
> the
> > >> respective folio_batches get populated with the folios to be read.
> > >>
> > >> Finally, the swapin_readahead() procedures will call the newly added
> > >> process_ra_batch_of_same_type() which:
> > >>
> > >>   1) Reads all the non_zswap_batch folios sequentially by calling
> > >>      swap_read_folio().
> > >>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
> calls
> > >>      zswap_finish_load_batch() that finally decompresses each
> > >>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> prefetch
> > >>      batch of say, 32 folios) in parallel with IAA.
> > >>
> > >> Within do_swap_page(), we try to benefit from batch decompressions in
> both
> > >> these scenarios:
> > >>
> > >>   1) single-mapped, SWP_SYNCHRONOUS_IO:
> > >>        We call swapin_readahead() with "single_mapped_path = true". This
> is
> > >>        done only in the !zswap_never_enabled() case.
> > >>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> > >>        We call swapin_readahead() with "single_mapped_path = false".
> > >>
> > >> This will place folios in the swapcache: a design choice that handles
> cases
> > >> where a folio that is "single-mapped" in process 1 could be prefetched in
> > >> process 2; and handles highly contended server scenarios with stability.
> > >> There are checks added at the end of do_swap_page(), after the folio has
> > >> been successfully loaded, to detect if the single-mapped swapcache folio
> is
> > >> still single-mapped, and if so, folio_free_swap() is called on the folio.
> > >>
> > >> Within the swapin_readahead() functions, if single_mapped_path is true,
> and
> > >> either the platform does not have IAA, or, if the platform has IAA and the
> > >> user selects a software compressor for zswap (details of sysfs knob
> > >> follow), readahead/batching are skipped and the folio is loaded using
> > >> zswap_load().
> > >>
> > >> A new swap parameter "singlemapped_ra_enabled" (false by default) is
> added
> > >> for platforms that have IAA, zswap_load_batching_enabled() is true, and
> we
> > >> want to give the user the option to run experiments with IAA and with
> > >> software compressors for zswap (swap device is
> SWP_SYNCHRONOUS_IO):
> > >>
> > >> For IAA:
> > >>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> > >>
> > >> For software compressors:
> > >>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> > >>
> > >> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
> skip
> > >> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> do_swap_page()
> > >> path.
> > >>
> > >> Thanks Ying Huang for the really helpful brainstorming discussions on the
> > >> swap_read_folio() plug design.
> > >>
> > >> Suggested-by: Ying Huang <ying.huang@intel.com>
> > >> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > >> ---
> > >>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++--
> ---------
> > >>   mm/shmem.c      |   2 +-
> > >>   mm/swap.h       |  12 ++--
> > >>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++--
> --
> > >>   mm/swapfile.c   |   2 +-
> > >>   5 files changed, 299 insertions(+), 61 deletions(-)
> > >>
> > >> diff --git a/mm/memory.c b/mm/memory.c
> > >> index b5745b9ffdf7..9655b85fc243 100644
> > >> --- a/mm/memory.c
> > >> +++ b/mm/memory.c
> > >> @@ -3924,6 +3924,42 @@ static vm_fault_t
> remove_device_exclusive_entry(struct vm_fault *vmf)
> > >>       return 0;
> > >>   }
> > >>   +/*
> > >> + * swapin readahead based batching interface for zswap batched loads
> using IAA:
> > >> + *
> > >> + * Should only be called for and if the faulting swap entry in
> do_swap_page
> > >> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> > >> + *
> > >> + * Detect if the folio is in the swapcache, is still mapped to only this
> > >> + * process, and further, there are no additional references to this folio
> > >> + * (for e.g. if another process simultaneously readahead this swap entry
> > >> + * while this process was handling the page-fault, and got a pointer to
> the
> > >> + * folio allocated by this process in the swapcache), besides the
> references
> > >> + * that were obtained within __read_swap_cache_async() by this
> process that is
> > >> + * faulting in this single-mapped swap entry.
> > >> + */
> > >
> > > How is this supposed to work for large folios?
> > >
> >
> > Hi,
> >
> > I was looking at zswapin large folio support and have posted a RFC in [1].
> > I got bogged down with some prod stuff, so wasn't able to send it earlier.
> >
> > It looks quite different, and I think simpler from this series, so might be
> > a good comparison.
> >
> > [1] https://lore.kernel.org/all/20241018105026.2521366-1-
> usamaarif642@gmail.com/
> >
> > Thanks,
> > Usama
> 
> I agree.
> 
> I think the lower hanging fruit here is to build upon Usama's patch.
> Kanchana, do you think we can just use the new batch decompressing
> infrastructure, and apply it to Usama's large folio zswap loading?
> 
> I'm not denying the readahead idea outright, but that seems much more
> complicated. There are questions regarding the benefits of
> readahead-ing when apply to zswap in the first place - IIUC, zram
> circumvents that logic in several cases, and zswap shares many
> characteristics with zram (fast, synchronous compression devices).
> 
> So let's reap the low hanging fruits first, get the wins as well as
> stress test the new infrastructure. Then we can discuss the readahead
> idea later?

Thanks Usama for publishing the zswap large folios swapin series, and
thanks Nhat for your suggestions.  Sure, I can look into integrating the
new batch decompressing infrastructure with Usama's large folio zswap
loading.

However, I think we need to get clarity on a bigger question: does it
make sense to swapin large folios? Some important considerations
would be:

1) What are the tradeoffs in memory footprint cost of swapping in a
    large folio?
2) If we decide to let the user determine this by say, an option that
     determines the swapin granularity (e.g. no more than 32k at a time),
     how does this constrain compression and zpool storage granularity?

Ultimately, I feel the bigger question is about memory utilization cost
of large folio swapin. The swapin_readahead() based approach tries to
use the prefetch-usefulness characteristics of the workload to improve
the efficiency of multiple 4k folios by using strategies like parallel
decompression, to strike some balance in memory utilization vs.
efficiency.

Usama, I downloaded your patch-series and tried to understand this
better, and wanted to share the data.

I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
and 16k/32k/64k large folios enabled to "always":

16k/32k/64k folios: kernel compilation with zstd:
 =================================================

 ------------------------------------------------------------------------------
                        mm-unstable-10-16-2024    + zswap large folios swapin
                                                                       series
 ------------------------------------------------------------------------------
 zswap compressor                         zstd                           zstd
 vm.page-cluster                             0                              0
 ------------------------------------------------------------------------------
 real_sec                               772.53                         870.61
 user_sec                            15,780.29                      15,836.71
 sys_sec                              5,353.20                       6,185.02
 Max_Res_Set_Size_KB                 1,873,348                      1,873,004
                                                                             
 ------------------------------------------------------------------------------
 memcg_high                                  0                              0
 memcg_swap_fail                             0                              0
 zswpout                            93,811,916                    111,663,872
 zswpin                             27,150,029                     54,730,678
 pswpout                                    64                             59
 pswpin                                     78                             53
 thp_swpout                                  0                              0
 thp_swpout_fallback                         0                              0
 16kB-mthp_swpout_fallback                   0                              0
 32kB-mthp_swpout_fallback                   0                              0
 64kB-mthp_swpout_fallback               5,470                              0
 pgmajfault                         29,019,256                     16,615,820
 swap_ra                                     0                              0
 swap_ra_hit                             3,004                          3,614
 ZSWPOUT-16kB                        1,324,160                      2,252,747
 ZSWPOUT-32kB                          730,534                      1,356,640
 ZSWPOUT-64kB                        3,039,760                      3,955,034
 ZSWPIN-16kB                                                        1,496,916
 ZSWPIN-32kB                                                        1,131,176
 ZSWPIN-64kB                                                        1,866,884
 SWPOUT-16kB                                 0                              0
 SWPOUT-32kB                                 0                              0
 SWPOUT-64kB                                 4                              3
 ------------------------------------------------------------------------------

It does appear like there is considerably higher swapout and swapin
activity as a result of swapping in large folios, which does end up
impacting performance.

I would appreciate thoughts on understanding the usefulness of
swapping in large folios, with the considerations outlined earlier/other
factors.

Thanks,
Kanchana
Usama Arif Oct. 20, 2024, 4:50 p.m. UTC | #6
On 18/10/2024 22:59, Sridhar, Kanchana P wrote:
> Hi Usama, Nhat,
> 
>> -----Original Message-----
>> From: Nhat Pham <nphamcs@gmail.com>
>> Sent: Friday, October 18, 2024 10:21 AM
>> To: Usama Arif <usamaarif642@gmail.com>
>> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
>> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
>> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
>> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
>> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
>> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
>> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
>> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
>> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
>> swapin_readahead() zswap load batching interface.
>>
>> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
>> wrote:
>>>
>>>
>>> On 18/10/2024 08:26, David Hildenbrand wrote:
>>>> On 18.10.24 08:48, Kanchana P Sridhar wrote:
>>>>> This patch invokes the swapin_readahead() based batching interface to
>>>>> prefetch a batch of 4K folios for zswap load with batch decompressions
>>>>> in parallel using IAA hardware. swapin_readahead() prefetches folios
>> based
>>>>> on vm.page-cluster and the usefulness of prior prefetches to the
>>>>> workload. As folios are created in the swapcache and the readahead
>> code
>>>>> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
>> the
>>>>> respective folio_batches get populated with the folios to be read.
>>>>>
>>>>> Finally, the swapin_readahead() procedures will call the newly added
>>>>> process_ra_batch_of_same_type() which:
>>>>>
>>>>>   1) Reads all the non_zswap_batch folios sequentially by calling
>>>>>      swap_read_folio().
>>>>>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
>> calls
>>>>>      zswap_finish_load_batch() that finally decompresses each
>>>>>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
>> prefetch
>>>>>      batch of say, 32 folios) in parallel with IAA.
>>>>>
>>>>> Within do_swap_page(), we try to benefit from batch decompressions in
>> both
>>>>> these scenarios:
>>>>>
>>>>>   1) single-mapped, SWP_SYNCHRONOUS_IO:
>>>>>        We call swapin_readahead() with "single_mapped_path = true". This
>> is
>>>>>        done only in the !zswap_never_enabled() case.
>>>>>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>>>>>        We call swapin_readahead() with "single_mapped_path = false".
>>>>>
>>>>> This will place folios in the swapcache: a design choice that handles
>> cases
>>>>> where a folio that is "single-mapped" in process 1 could be prefetched in
>>>>> process 2; and handles highly contended server scenarios with stability.
>>>>> There are checks added at the end of do_swap_page(), after the folio has
>>>>> been successfully loaded, to detect if the single-mapped swapcache folio
>> is
>>>>> still single-mapped, and if so, folio_free_swap() is called on the folio.
>>>>>
>>>>> Within the swapin_readahead() functions, if single_mapped_path is true,
>> and
>>>>> either the platform does not have IAA, or, if the platform has IAA and the
>>>>> user selects a software compressor for zswap (details of sysfs knob
>>>>> follow), readahead/batching are skipped and the folio is loaded using
>>>>> zswap_load().
>>>>>
>>>>> A new swap parameter "singlemapped_ra_enabled" (false by default) is
>> added
>>>>> for platforms that have IAA, zswap_load_batching_enabled() is true, and
>> we
>>>>> want to give the user the option to run experiments with IAA and with
>>>>> software compressors for zswap (swap device is
>> SWP_SYNCHRONOUS_IO):
>>>>>
>>>>> For IAA:
>>>>>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>>>>
>>>>> For software compressors:
>>>>>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
>>>>>
>>>>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
>> skip
>>>>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
>> do_swap_page()
>>>>> path.
>>>>>
>>>>> Thanks Ying Huang for the really helpful brainstorming discussions on the
>>>>> swap_read_folio() plug design.
>>>>>
>>>>> Suggested-by: Ying Huang <ying.huang@intel.com>
>>>>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>>>>> ---
>>>>>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++--
>> ---------
>>>>>   mm/shmem.c      |   2 +-
>>>>>   mm/swap.h       |  12 ++--
>>>>>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++--
>> --
>>>>>   mm/swapfile.c   |   2 +-
>>>>>   5 files changed, 299 insertions(+), 61 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index b5745b9ffdf7..9655b85fc243 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3924,6 +3924,42 @@ static vm_fault_t
>> remove_device_exclusive_entry(struct vm_fault *vmf)
>>>>>       return 0;
>>>>>   }
>>>>>   +/*
>>>>> + * swapin readahead based batching interface for zswap batched loads
>> using IAA:
>>>>> + *
>>>>> + * Should only be called for and if the faulting swap entry in
>> do_swap_page
>>>>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
>>>>> + *
>>>>> + * Detect if the folio is in the swapcache, is still mapped to only this
>>>>> + * process, and further, there are no additional references to this folio
>>>>> + * (for e.g. if another process simultaneously readahead this swap entry
>>>>> + * while this process was handling the page-fault, and got a pointer to
>> the
>>>>> + * folio allocated by this process in the swapcache), besides the
>> references
>>>>> + * that were obtained within __read_swap_cache_async() by this
>> process that is
>>>>> + * faulting in this single-mapped swap entry.
>>>>> + */
>>>>
>>>> How is this supposed to work for large folios?
>>>>
>>>
>>> Hi,
>>>
>>> I was looking at zswapin large folio support and have posted a RFC in [1].
>>> I got bogged down with some prod stuff, so wasn't able to send it earlier.
>>>
>>> It looks quite different, and I think simpler from this series, so might be
>>> a good comparison.
>>>
>>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-
>> usamaarif642@gmail.com/
>>>
>>> Thanks,
>>> Usama
>>
>> I agree.
>>
>> I think the lower hanging fruit here is to build upon Usama's patch.
>> Kanchana, do you think we can just use the new batch decompressing
>> infrastructure, and apply it to Usama's large folio zswap loading?
>>
>> I'm not denying the readahead idea outright, but that seems much more
>> complicated. There are questions regarding the benefits of
>> readahead-ing when apply to zswap in the first place - IIUC, zram
>> circumvents that logic in several cases, and zswap shares many
>> characteristics with zram (fast, synchronous compression devices).
>>
>> So let's reap the low hanging fruits first, get the wins as well as
>> stress test the new infrastructure. Then we can discuss the readahead
>> idea later?
> 
> Thanks Usama for publishing the zswap large folios swapin series, and
> thanks Nhat for your suggestions.  Sure, I can look into integrating the
> new batch decompressing infrastructure with Usama's large folio zswap
> loading.
> 
> However, I think we need to get clarity on a bigger question: does it
> make sense to swapin large folios? Some important considerations
> would be:
> 
> 1) What are the tradeoffs in memory footprint cost of swapping in a
>     large folio?

I would say the pros and cons of swapping in a large folio are the same as
the pros and cons of large folios in general.
As mentioned in my cover letter and the series that introduced large folios
you get fewer page faults, batched PTE and rmap manipulation, reduced lru list,
TLB coalescing (for arm64 and AMD) at the cost of higher memory usage and
fragmentation.

The other thing is that the series I wrote is hopefully just a start.
As shown by Barry in the case of zram in 
https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
there is a significant improvement in CPU utilization and compression
ratios when compressing at large granularity. Hopefully we can
try and do something similar for zswap. Not sure how that would look
for zswap as I haven't started looking at that yet.

> 2) If we decide to let the user determine this by say, an option that
>      determines the swapin granularity (e.g. no more than 32k at a time),
>      how does this constrain compression and zpool storage granularity?
> 
Right now whether or not zswapin happens is determined using
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/enabled
I assume when the someone sets some of these to always, they know that
their workload works best with those page sizes, so they would want folios
to be swapped in and used at that size as well?

There might be some merit in adding something like
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled,
as you might start thrashing swap if you are for e.g. swapping
in 1M folios and there isn't enough memory for it, which causes
you to swapout another folio in its place.

> Ultimately, I feel the bigger question is about memory utilization cost
> of large folio swapin. The swapin_readahead() based approach tries to
> use the prefetch-usefulness characteristics of the workload to improve
> the efficiency of multiple 4k folios by using strategies like parallel
> decompression, to strike some balance in memory utilization vs.
> efficiency.
> 
> Usama, I downloaded your patch-series and tried to understand this
> better, and wanted to share the data.
> 
> I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
> and 16k/32k/64k large folios enabled to "always":
> 
> 16k/32k/64k folios: kernel compilation with zstd:
>  =================================================
> 
>  ------------------------------------------------------------------------------
>                         mm-unstable-10-16-2024    + zswap large folios swapin
>                                                                        series
>  ------------------------------------------------------------------------------
>  zswap compressor                         zstd                           zstd
>  vm.page-cluster                             0                              0
>  ------------------------------------------------------------------------------
>  real_sec                               772.53                         870.61
>  user_sec                            15,780.29                      15,836.71
>  sys_sec                              5,353.20                       6,185.02
>  Max_Res_Set_Size_KB                 1,873,348                      1,873,004
>                                                                              
>  ------------------------------------------------------------------------------
>  memcg_high                                  0                              0
>  memcg_swap_fail                             0                              0
>  zswpout                            93,811,916                    111,663,872
>  zswpin                             27,150,029                     54,730,678
>  pswpout                                    64                             59
>  pswpin                                     78                             53
>  thp_swpout                                  0                              0
>  thp_swpout_fallback                         0                              0
>  16kB-mthp_swpout_fallback                   0                              0
>  32kB-mthp_swpout_fallback                   0                              0
>  64kB-mthp_swpout_fallback               5,470                              0
>  pgmajfault                         29,019,256                     16,615,820
>  swap_ra                                     0                              0
>  swap_ra_hit                             3,004                          3,614
>  ZSWPOUT-16kB                        1,324,160                      2,252,747
>  ZSWPOUT-32kB                          730,534                      1,356,640
>  ZSWPOUT-64kB                        3,039,760                      3,955,034
>  ZSWPIN-16kB                                                        1,496,916
>  ZSWPIN-32kB                                                        1,131,176
>  ZSWPIN-64kB                                                        1,866,884
>  SWPOUT-16kB                                 0                              0
>  SWPOUT-32kB                                 0                              0
>  SWPOUT-64kB                                 4                              3
>  ------------------------------------------------------------------------------
> 
> It does appear like there is considerably higher swapout and swapin
> activity as a result of swapping in large folios, which does end up
> impacting performance.

Thanks for having a look!
I had only tested with the microbenchmark for time taken to zswapin that I included in
my coverletter.
In general I expected zswap activity to go up as you are more likely to experience
memory pressure when swapping in large folios, but in return get lower pagefaults
and the advantage of lower TLB pressure in AMD and arm64.

Thanks for the test, those number look quite extreme! I think there is a lot of swap
thrashing. 
I am assuming you are testing on an intel machine, where you don't get the advantage
of lower TLB misses of large folios, I will try and get an AMD machine which has
TLB coalescing or an ARM server with CONT_PTE to see if the numbers get better.

Maybe it might be better for large folio zswapin to be considered along with
larger granuality compression to get all the benefits of large folios (and
hopefully better numbers.) I think that was the approach taken for zram as well.

> 
> I would appreciate thoughts on understanding the usefulness of
> swapping in large folios, with the considerations outlined earlier/other
> factors.
> 
> Thanks,
> Kanchana
Sridhar, Kanchana P Oct. 20, 2024, 8:12 p.m. UTC | #7
Hi Usama,

> -----Original Message-----
> From: Usama Arif <usamaarif642@gmail.com>
> Sent: Sunday, October 20, 2024 9:50 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> <nphamcs@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
> 
> 
> 
> On 18/10/2024 22:59, Sridhar, Kanchana P wrote:
> > Hi Usama, Nhat,
> >
> >> -----Original Message-----
> >> From: Nhat Pham <nphamcs@gmail.com>
> >> Sent: Friday, October 18, 2024 10:21 AM
> >> To: Usama Arif <usamaarif642@gmail.com>
> >> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
> >> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
> >> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> >> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> >> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> >> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> >> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> >> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> >> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> >> swapin_readahead() zswap load batching interface.
> >>
> >> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
> >> wrote:
> >>>
> >>>
> >>> On 18/10/2024 08:26, David Hildenbrand wrote:
> >>>> On 18.10.24 08:48, Kanchana P Sridhar wrote:
> >>>>> This patch invokes the swapin_readahead() based batching interface to
> >>>>> prefetch a batch of 4K folios for zswap load with batch decompressions
> >>>>> in parallel using IAA hardware. swapin_readahead() prefetches folios
> >> based
> >>>>> on vm.page-cluster and the usefulness of prior prefetches to the
> >>>>> workload. As folios are created in the swapcache and the readahead
> >> code
> >>>>> calls swap_read_folio() with a "zswap_batch" and a
> "non_zswap_batch",
> >> the
> >>>>> respective folio_batches get populated with the folios to be read.
> >>>>>
> >>>>> Finally, the swapin_readahead() procedures will call the newly added
> >>>>> process_ra_batch_of_same_type() which:
> >>>>>
> >>>>>   1) Reads all the non_zswap_batch folios sequentially by calling
> >>>>>      swap_read_folio().
> >>>>>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch
> which
> >> calls
> >>>>>      zswap_finish_load_batch() that finally decompresses each
> >>>>>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> >> prefetch
> >>>>>      batch of say, 32 folios) in parallel with IAA.
> >>>>>
> >>>>> Within do_swap_page(), we try to benefit from batch decompressions
> in
> >> both
> >>>>> these scenarios:
> >>>>>
> >>>>>   1) single-mapped, SWP_SYNCHRONOUS_IO:
> >>>>>        We call swapin_readahead() with "single_mapped_path = true".
> This
> >> is
> >>>>>        done only in the !zswap_never_enabled() case.
> >>>>>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> >>>>>        We call swapin_readahead() with "single_mapped_path = false".
> >>>>>
> >>>>> This will place folios in the swapcache: a design choice that handles
> >> cases
> >>>>> where a folio that is "single-mapped" in process 1 could be prefetched
> in
> >>>>> process 2; and handles highly contended server scenarios with
> stability.
> >>>>> There are checks added at the end of do_swap_page(), after the folio
> has
> >>>>> been successfully loaded, to detect if the single-mapped swapcache
> folio
> >> is
> >>>>> still single-mapped, and if so, folio_free_swap() is called on the folio.
> >>>>>
> >>>>> Within the swapin_readahead() functions, if single_mapped_path is
> true,
> >> and
> >>>>> either the platform does not have IAA, or, if the platform has IAA and
> the
> >>>>> user selects a software compressor for zswap (details of sysfs knob
> >>>>> follow), readahead/batching are skipped and the folio is loaded using
> >>>>> zswap_load().
> >>>>>
> >>>>> A new swap parameter "singlemapped_ra_enabled" (false by default)
> is
> >> added
> >>>>> for platforms that have IAA, zswap_load_batching_enabled() is true,
> and
> >> we
> >>>>> want to give the user the option to run experiments with IAA and with
> >>>>> software compressors for zswap (swap device is
> >> SWP_SYNCHRONOUS_IO):
> >>>>>
> >>>>> For IAA:
> >>>>>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>>>>
> >>>>> For software compressors:
> >>>>>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> >>>>>
> >>>>> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
> >> skip
> >>>>> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> >> do_swap_page()
> >>>>> path.
> >>>>>
> >>>>> Thanks Ying Huang for the really helpful brainstorming discussions on
> the
> >>>>> swap_read_folio() plug design.
> >>>>>
> >>>>> Suggested-by: Ying Huang <ying.huang@intel.com>
> >>>>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> >>>>> ---
> >>>>>   mm/memory.c     | 187
> +++++++++++++++++++++++++++++++++++++--
> >> ---------
> >>>>>   mm/shmem.c      |   2 +-
> >>>>>   mm/swap.h       |  12 ++--
> >>>>>   mm/swap_state.c | 157
> ++++++++++++++++++++++++++++++++++++--
> >> --
> >>>>>   mm/swapfile.c   |   2 +-
> >>>>>   5 files changed, 299 insertions(+), 61 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>> index b5745b9ffdf7..9655b85fc243 100644
> >>>>> --- a/mm/memory.c
> >>>>> +++ b/mm/memory.c
> >>>>> @@ -3924,6 +3924,42 @@ static vm_fault_t
> >> remove_device_exclusive_entry(struct vm_fault *vmf)
> >>>>>       return 0;
> >>>>>   }
> >>>>>   +/*
> >>>>> + * swapin readahead based batching interface for zswap batched
> loads
> >> using IAA:
> >>>>> + *
> >>>>> + * Should only be called for and if the faulting swap entry in
> >> do_swap_page
> >>>>> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> >>>>> + *
> >>>>> + * Detect if the folio is in the swapcache, is still mapped to only this
> >>>>> + * process, and further, there are no additional references to this folio
> >>>>> + * (for e.g. if another process simultaneously readahead this swap
> entry
> >>>>> + * while this process was handling the page-fault, and got a pointer to
> >> the
> >>>>> + * folio allocated by this process in the swapcache), besides the
> >> references
> >>>>> + * that were obtained within __read_swap_cache_async() by this
> >> process that is
> >>>>> + * faulting in this single-mapped swap entry.
> >>>>> + */
> >>>>
> >>>> How is this supposed to work for large folios?
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> I was looking at zswapin large folio support and have posted a RFC in [1].
> >>> I got bogged down with some prod stuff, so wasn't able to send it earlier.
> >>>
> >>> It looks quite different, and I think simpler from this series, so might be
> >>> a good comparison.
> >>>
> >>> [1] https://lore.kernel.org/all/20241018105026.2521366-1-
> >> usamaarif642@gmail.com/
> >>>
> >>> Thanks,
> >>> Usama
> >>
> >> I agree.
> >>
> >> I think the lower hanging fruit here is to build upon Usama's patch.
> >> Kanchana, do you think we can just use the new batch decompressing
> >> infrastructure, and apply it to Usama's large folio zswap loading?
> >>
> >> I'm not denying the readahead idea outright, but that seems much more
> >> complicated. There are questions regarding the benefits of
> >> readahead-ing when apply to zswap in the first place - IIUC, zram
> >> circumvents that logic in several cases, and zswap shares many
> >> characteristics with zram (fast, synchronous compression devices).
> >>
> >> So let's reap the low hanging fruits first, get the wins as well as
> >> stress test the new infrastructure. Then we can discuss the readahead
> >> idea later?
> >
> > Thanks Usama for publishing the zswap large folios swapin series, and
> > thanks Nhat for your suggestions.  Sure, I can look into integrating the
> > new batch decompressing infrastructure with Usama's large folio zswap
> > loading.
> >
> > However, I think we need to get clarity on a bigger question: does it
> > make sense to swapin large folios? Some important considerations
> > would be:
> >
> > 1) What are the tradeoffs in memory footprint cost of swapping in a
> >     large folio?
> 
> I would say the pros and cons of swapping in a large folio are the same as
> the pros and cons of large folios in general.
> As mentioned in my cover letter and the series that introduced large folios
> you get fewer page faults, batched PTE and rmap manipulation, reduced lru
> list,
> TLB coalescing (for arm64 and AMD) at the cost of higher memory usage and
> fragmentation.
> 
> The other thing is that the series I wrote is hopefully just a start.
> As shown by Barry in the case of zram in
> https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
> there is a significant improvement in CPU utilization and compression
> ratios when compressing at large granularity. Hopefully we can
> try and do something similar for zswap. Not sure how that would look
> for zswap as I haven't started looking at that yet.

Thanks a lot for sharing your thoughts on this! Yes, this makes sense.
It was helpful to get your thoughts on the larger compression granularity
with large folios, since we have been trying to answer similar questions
ourselves. We look forward to collaborating with you on getting this working
for zswap!

> 
> > 2) If we decide to let the user determine this by say, an option that
> >      determines the swapin granularity (e.g. no more than 32k at a time),
> >      how does this constrain compression and zpool storage granularity?
> >
> Right now whether or not zswapin happens is determined using
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/enabled
> I assume when the someone sets some of these to always, they know that
> their workload works best with those page sizes, so they would want folios
> to be swapped in and used at that size as well?

Sure, this rationale makes sense.

> 
> There might be some merit in adding something like
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/swapin_enabled,
> as you might start thrashing swap if you are for e.g. swapping
> in 1M folios and there isn't enough memory for it, which causes
> you to swapout another folio in its place.

Agreed.

> 
> > Ultimately, I feel the bigger question is about memory utilization cost
> > of large folio swapin. The swapin_readahead() based approach tries to
> > use the prefetch-usefulness characteristics of the workload to improve
> > the efficiency of multiple 4k folios by using strategies like parallel
> > decompression, to strike some balance in memory utilization vs.
> > efficiency.
> >
> > Usama, I downloaded your patch-series and tried to understand this
> > better, and wanted to share the data.
> >
> > I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
> > and 16k/32k/64k large folios enabled to "always":
> >
> > 16k/32k/64k folios: kernel compilation with zstd:
> >  =================================================
> >
> >  ------------------------------------------------------------------------------
> >                         mm-unstable-10-16-2024    + zswap large folios swapin
> >                                                                        series
> >  ------------------------------------------------------------------------------
> >  zswap compressor                         zstd                           zstd
> >  vm.page-cluster                             0                              0
> >  ------------------------------------------------------------------------------
> >  real_sec                               772.53                         870.61
> >  user_sec                            15,780.29                      15,836.71
> >  sys_sec                              5,353.20                       6,185.02
> >  Max_Res_Set_Size_KB                 1,873,348                      1,873,004
> >
> >  ------------------------------------------------------------------------------
> >  memcg_high                                  0                              0
> >  memcg_swap_fail                             0                              0
> >  zswpout                            93,811,916                    111,663,872
> >  zswpin                             27,150,029                     54,730,678
> >  pswpout                                    64                             59
> >  pswpin                                     78                             53
> >  thp_swpout                                  0                              0
> >  thp_swpout_fallback                         0                              0
> >  16kB-mthp_swpout_fallback                   0                              0
> >  32kB-mthp_swpout_fallback                   0                              0
> >  64kB-mthp_swpout_fallback               5,470                              0
> >  pgmajfault                         29,019,256                     16,615,820
> >  swap_ra                                     0                              0
> >  swap_ra_hit                             3,004                          3,614
> >  ZSWPOUT-16kB                        1,324,160                      2,252,747
> >  ZSWPOUT-32kB                          730,534                      1,356,640
> >  ZSWPOUT-64kB                        3,039,760                      3,955,034
> >  ZSWPIN-16kB                                                        1,496,916
> >  ZSWPIN-32kB                                                        1,131,176
> >  ZSWPIN-64kB                                                        1,866,884
> >  SWPOUT-16kB                                 0                              0
> >  SWPOUT-32kB                                 0                              0
> >  SWPOUT-64kB                                 4                              3
> >  ------------------------------------------------------------------------------
> >
> > It does appear like there is considerably higher swapout and swapin
> > activity as a result of swapping in large folios, which does end up
> > impacting performance.
> 
> Thanks for having a look!
> I had only tested with the microbenchmark for time taken to zswapin that I
> included in
> my coverletter.
> In general I expected zswap activity to go up as you are more likely to
> experience
> memory pressure when swapping in large folios, but in return get lower
> pagefaults
> and the advantage of lower TLB pressure in AMD and arm64.
> 
> Thanks for the test, those number look quite extreme! I think there is a lot of
> swap
> thrashing.
> I am assuming you are testing on an intel machine, where you don't get the
> advantage
> of lower TLB misses of large folios, I will try and get an AMD machine which
> has
> TLB coalescing or an ARM server with CONT_PTE to see if the numbers get
> better.

You're right, these numbers were gathered on an Intel Sapphire Rapids server.
Thanks for confirming kernel compilation/allmodconfig behavior on the
AMD/ARM systems.

> 
> Maybe it might be better for large folio zswapin to be considered along with
> larger granuality compression to get all the benefits of large folios (and
> hopefully better numbers.) I think that was the approach taken for zram as
> well.

Yes, agree with you on this as well. We are planning to run experiments with
the by_n patch-series [1], and Barry's zsmalloc multi-page patch-series [2]
posted earlier, and your zswapin large folio series. It would be great to compare
notes as we understand the overall workload behavior and the trade-offs in
memory vs. latency benefits of larger compression granularity.

[1] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/
[2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/

Thanks again,
Kanchana

> 
> >
> > I would appreciate thoughts on understanding the usefulness of
> > swapping in large folios, with the considerations outlined earlier/other
> > factors.
> >
> > Thanks,
> > Kanchana
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index b5745b9ffdf7..9655b85fc243 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3924,6 +3924,42 @@  static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
+/*
+ * swapin readahead based batching interface for zswap batched loads using IAA:
+ *
+ * Should only be called for and if the faulting swap entry in do_swap_page
+ * is single-mapped and SWP_SYNCHRONOUS_IO.
+ *
+ * Detect if the folio is in the swapcache, is still mapped to only this
+ * process, and further, there are no additional references to this folio
+ * (for e.g. if another process simultaneously readahead this swap entry
+ * while this process was handling the page-fault, and got a pointer to the
+ * folio allocated by this process in the swapcache), besides the references
+ * that were obtained within __read_swap_cache_async() by this process that is
+ * faulting in this single-mapped swap entry.
+ */
+static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
+						   struct folio *folio)
+{
+	if (!folio_test_swapcache(folio))
+		return false;
+
+	if (__swap_count(entry) != 0)
+		return false;
+
+	/*
+	 * The folio ref count for a single-mapped folio that was allocated
+	 * in __read_swap_cache_async(), can be a maximum of 3. These are the
+	 * incrementors of the folio ref count in __read_swap_cache_async():
+	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
+	 */
+
+	if (folio_ref_count(folio) <= 3)
+		return true;
+
+	return false;
+}
+
 static inline bool should_try_to_free_swap(struct folio *folio,
 					   struct vm_area_struct *vma,
 					   unsigned int fault_flags)
@@ -4215,6 +4251,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
+	bool single_mapped_swapcache = false;
 	void *shadow = NULL;
 	int nr_pages;
 	unsigned long page_idx;
@@ -4283,51 +4320,90 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/* skip swapcache */
-			folio = alloc_swap_folio(vmf);
-			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				nr_pages = folio_nr_pages(folio);
-				if (folio_test_large(folio))
-					entry.val = ALIGN_DOWN(entry.val, nr_pages);
-				/*
-				 * Prevent parallel swapin from proceeding with
-				 * the cache flag. Otherwise, another thread
-				 * may finish swapin first, free the entry, and
-				 * swapout reusing the same entry. It's
-				 * undetectable as pte_same() returns true due
-				 * to entry reuse.
-				 */
-				if (swapcache_prepare(entry, nr_pages)) {
+			if (zswap_never_enabled()) {
+				/* skip swapcache */
+				folio = alloc_swap_folio(vmf);
+				if (folio) {
+					__folio_set_locked(folio);
+					__folio_set_swapbacked(folio);
+
+					nr_pages = folio_nr_pages(folio);
+					if (folio_test_large(folio))
+						entry.val = ALIGN_DOWN(entry.val, nr_pages);
 					/*
-					 * Relax a bit to prevent rapid
-					 * repeated page faults.
+					 * Prevent parallel swapin from proceeding with
+					 * the cache flag. Otherwise, another thread
+					 * may finish swapin first, free the entry, and
+					 * swapout reusing the same entry. It's
+					 * undetectable as pte_same() returns true due
+					 * to entry reuse.
 					 */
-					add_wait_queue(&swapcache_wq, &wait);
-					schedule_timeout_uninterruptible(1);
-					remove_wait_queue(&swapcache_wq, &wait);
-					goto out_page;
+					if (swapcache_prepare(entry, nr_pages)) {
+						/*
+						 * Relax a bit to prevent rapid
+						 * repeated page faults.
+						 */
+						add_wait_queue(&swapcache_wq, &wait);
+						schedule_timeout_uninterruptible(1);
+						remove_wait_queue(&swapcache_wq, &wait);
+						goto out_page;
+					}
+					need_clear_cache = true;
+
+					mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
+
+					shadow = get_shadow_from_swap_cache(entry);
+					if (shadow)
+						workingset_refault(folio, shadow);
+
+					folio_add_lru(folio);
+
+					/* To provide entry to swap_read_folio() */
+					folio->swap = entry;
+					swap_read_folio(folio, NULL, NULL, NULL);
+					folio->private = NULL;
+				}
+			} else {
+				/*
+				 * zswap is enabled or was enabled at some point.
+				 * Don't skip swapcache.
+				 *
+				 * swapin readahead based batching interface
+				 * for zswap batched loads using IAA:
+				 *
+				 * Readahead is invoked in this path only if
+				 * the sys swap "singlemapped_ra_enabled" swap
+				 * parameter is set to true. By default,
+				 * "singlemapped_ra_enabled" is set to false,
+				 * the recommended setting for software compressors.
+				 * For IAA, if "singlemapped_ra_enabled" is set
+				 * to true, readahead will be deployed in this path
+				 * as well.
+				 *
+				 * For single-mapped pages, the batching interface
+				 * calls __read_swap_cache_async() to allocate and
+				 * place the faulting page in the swapcache. This is
+				 * to handle a scenario where the faulting page in
+				 * this process happens to simultaneously be a
+				 * readahead page in another process. By placing the
+				 * single-mapped faulting page in the swapcache,
+				 * we avoid race conditions and duplicate page
+				 * allocations under these scenarios.
+				 */
+				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+							 vmf, true);
+				if (!folio) {
+					ret = VM_FAULT_OOM;
+					goto out;
 				}
-				need_clear_cache = true;
-
-				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
-
-				shadow = get_shadow_from_swap_cache(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
 
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, NULL, NULL, NULL);
-				folio->private = NULL;
-			}
+				single_mapped_swapcache = true;
+				nr_pages = folio_nr_pages(folio);
+				swapcache = folio;
+			} /* swapin with zswap support. */
 		} else {
 			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
+						 vmf, false);
 			swapcache = folio;
 		}
 
@@ -4528,8 +4604,10 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * yet.
 	 */
 	swap_free_nr(entry, nr_pages);
-	if (should_try_to_free_swap(folio, vma, vmf->flags))
+	if (should_try_to_free_swap(folio, vma, vmf->flags)) {
 		folio_free_swap(folio);
+		single_mapped_swapcache = false;
+	}
 
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
@@ -4619,6 +4697,30 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
+
+	/*
+	 * swapin readahead based batching interface
+	 * for zswap batched loads using IAA:
+	 *
+	 * Don't skip swapcache strategy for single-mapped
+	 * pages: As described above, we place the
+	 * single-mapped faulting page in the swapcache,
+	 * to avoid race conditions and duplicate page
+	 * allocations between process 1 handling a
+	 * page-fault for a single-mapped page, while
+	 * simultaneously, the same swap entry is a
+	 * readahead prefetch page in another process 2.
+	 *
+	 * One side-effect of this, is that if the race did
+	 * not occur, we need to clean up the swapcache
+	 * entry and free the zswap entry for the faulting
+	 * page, iff it is still single-mapped and is
+	 * exclusive to this process.
+	 */
+	if (single_mapped_swapcache &&
+		data_race(should_free_singlemap_swapcache(entry, folio)))
+		folio_free_swap(folio);
+
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4638,6 +4740,11 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
+
+	if (single_mapped_swapcache &&
+		data_race(should_free_singlemap_swapcache(entry, folio)))
+		folio_free_swap(folio);
+
 	if (si)
 		put_swap_device(si);
 	return ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index 66eae800ffab..e4549c04f316 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1624,7 +1624,7 @@  static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
 	struct folio *folio;
 
 	mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
-	folio = swap_cluster_readahead(swap, gfp, mpol, ilx);
+	folio = swap_cluster_readahead(swap, gfp, mpol, ilx, false);
 	mpol_cond_put(mpol);
 
 	return folio;
diff --git a/mm/swap.h b/mm/swap.h
index 2b82c8ed765c..2861bd8f5a96 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -199,9 +199,11 @@  struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
-		struct mempolicy *mpol, pgoff_t ilx);
+		struct mempolicy *mpol, pgoff_t ilx,
+		bool single_mapped_path);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
-		struct vm_fault *vmf);
+		struct vm_fault *vmf,
+		bool single_mapped_path);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -304,13 +306,15 @@  static inline void show_swap_cache_info(void)
 }
 
 static inline struct folio *swap_cluster_readahead(swp_entry_t entry,
-			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
+			gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx,
+			bool single_mapped_path)
 {
 	return NULL;
 }
 
 static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_fault *vmf)
+			struct vm_fault *vmf,
+			bool single_mapped_path)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0aa938e4c34d..66ea8f7f724c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -44,6 +44,12 @@  struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static bool enable_vma_readahead __read_mostly = true;
 
+/*
+ * Enable readahead in single-mapped do_swap_page() path.
+ * Set to "true" for IAA.
+ */
+static bool enable_singlemapped_readahead __read_mostly = false;
+
 #define SWAP_RA_WIN_SHIFT	(PAGE_SHIFT / 2)
 #define SWAP_RA_HITS_MASK	((1UL << SWAP_RA_WIN_SHIFT) - 1)
 #define SWAP_RA_HITS_MAX	SWAP_RA_HITS_MASK
@@ -340,6 +346,11 @@  static inline bool swap_use_vma_readahead(void)
 	return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
 }
 
+static inline bool swap_use_singlemapped_readahead(void)
+{
+	return READ_ONCE(enable_singlemapped_readahead);
+}
+
 /*
  * Lookup a swap entry in the swap cache. A found folio will be returned
  * unlocked and with its refcount incremented - we rely on the kernel
@@ -635,12 +646,49 @@  static unsigned long swapin_nr_pages(unsigned long offset)
 	return pages;
 }
 
+static void process_ra_batch_of_same_type(
+	struct zswap_decomp_batch *zswap_batch,
+	struct folio_batch *non_zswap_batch,
+	swp_entry_t targ_entry,
+	struct swap_iocb **splug)
+{
+	unsigned int i;
+
+	for (i = 0; i < folio_batch_count(non_zswap_batch); ++i) {
+		struct folio *folio = non_zswap_batch->folios[i];
+		swap_read_folio(folio, splug, NULL, NULL);
+		if (folio->swap.val != targ_entry.val) {
+			folio_set_readahead(folio);
+			count_vm_event(SWAP_RA);
+		}
+		folio_put(folio);
+	}
+
+	swap_read_zswap_batch_unplug(zswap_batch, splug);
+
+	for (i = 0; i < folio_batch_count(&zswap_batch->fbatch); ++i) {
+		struct folio *folio = zswap_batch->fbatch.folios[i];
+		if (folio->swap.val != targ_entry.val) {
+			folio_set_readahead(folio);
+			count_vm_event(SWAP_RA);
+		}
+		folio_put(folio);
+	}
+
+	folio_batch_reinit(non_zswap_batch);
+
+	zswap_load_batch_reinit(zswap_batch);
+}
+
 /**
  * swap_cluster_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
  * @mpol: NUMA memory allocation policy to be applied
  * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
@@ -654,7 +702,8 @@  static unsigned long swapin_nr_pages(unsigned long offset)
  * are fairly likely to have been swapped out from the same node.
  */
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				    struct mempolicy *mpol, pgoff_t ilx)
+				     struct mempolicy *mpol, pgoff_t ilx,
+				     bool single_mapped_path)
 {
 	struct folio *folio;
 	unsigned long entry_offset = swp_offset(entry);
@@ -664,12 +713,22 @@  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct swap_info_struct *si = swp_swap_info(entry);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
+	struct zswap_decomp_batch zswap_batch;
+	struct folio_batch non_zswap_batch;
 	bool page_allocated;
 
+	if (single_mapped_path &&
+		(!swap_use_singlemapped_readahead() ||
+		 !zswap_load_batching_enabled()))
+		goto skip;
+
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
 
+	zswap_load_batch_init(&zswap_batch);
+	folio_batch_init(&non_zswap_batch);
+
 	/* Read a page_cluster sized and aligned cluster around offset. */
 	start_offset = offset & ~mask;
 	end_offset = offset | mask;
@@ -678,6 +737,7 @@  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (end_offset >= si->max)
 		end_offset = si->max - 1;
 
+	/* Note that all swap entries readahead are of the same swap type. */
 	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
@@ -687,14 +747,22 @@  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug, NULL, NULL);
-			if (offset != entry_offset) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
+			if (swap_read_folio(folio, &splug,
+					    &zswap_batch, &non_zswap_batch)) {
+				if (offset != entry_offset) {
+					folio_set_readahead(folio);
+					count_vm_event(SWAP_RA);
+				}
+				folio_put(folio);
 			}
+		} else {
+			folio_put(folio);
 		}
-		folio_put(folio);
 	}
+
+	process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+				      entry, &splug);
+
 	blk_finish_plug(&plug);
 	swap_read_unplug(splug);
 	lru_add_drain();	/* Push any new pages onto the LRU now */
@@ -1009,6 +1077,9 @@  static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
  * @mpol: NUMA memory allocation policy to be applied
  * @targ_ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
  * @vmf: fault information
+ * @single_mapped_path: Called from do_swap_page() single-mapped path.
+ * Only readahead if the sys "singlemapped_ra_enabled" swap parameter
+ * is set to true.
  *
  * Returns the struct folio for entry and addr, after queueing swapin.
  *
@@ -1019,10 +1090,14 @@  static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
  *
  */
 static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
-		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf)
+		struct mempolicy *mpol, pgoff_t targ_ilx, struct vm_fault *vmf,
+		bool single_mapped_path)
 {
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
+	struct zswap_decomp_batch zswap_batch;
+	struct folio_batch non_zswap_batch;
+	int type = -1, prev_type = -1;
 	struct folio *folio;
 	pte_t *pte = NULL, pentry;
 	int win;
@@ -1031,10 +1106,18 @@  static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	pgoff_t ilx;
 	bool page_allocated;
 
+	if (single_mapped_path &&
+		(!swap_use_singlemapped_readahead() ||
+		 !zswap_load_batching_enabled()))
+		goto skip;
+
 	win = swap_vma_ra_win(vmf, &start, &end);
 	if (win == 1)
 		goto skip;
 
+	zswap_load_batch_init(&zswap_batch);
+	folio_batch_init(&non_zswap_batch);
+
 	ilx = targ_ilx - PFN_DOWN(vmf->address - start);
 
 	blk_start_plug(&plug);
@@ -1057,16 +1140,38 @@  static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		if (!folio)
 			continue;
 		if (page_allocated) {
-			swap_read_folio(folio, &splug, NULL, NULL);
-			if (addr != vmf->address) {
-				folio_set_readahead(folio);
-				count_vm_event(SWAP_RA);
+			type = swp_type(entry);
+
+			/*
+			 * Process this sub-batch before switching to
+			 * another swap device type.
+			 */
+			if ((prev_type >= 0) && (type != prev_type))
+				process_ra_batch_of_same_type(&zswap_batch,
+							      &non_zswap_batch,
+							      targ_entry,
+							      &splug);
+
+			if (swap_read_folio(folio, &splug,
+					    &zswap_batch, &non_zswap_batch)) {
+				if (addr != vmf->address) {
+					folio_set_readahead(folio);
+					count_vm_event(SWAP_RA);
+				}
+				folio_put(folio);
 			}
+
+			prev_type = type;
+		} else {
+			folio_put(folio);
 		}
-		folio_put(folio);
 	}
 	if (pte)
 		pte_unmap(pte);
+
+	process_ra_batch_of_same_type(&zswap_batch, &non_zswap_batch,
+				      targ_entry, &splug);
+
 	blk_finish_plug(&plug);
 	swap_read_unplug(splug);
 	lru_add_drain();
@@ -1092,7 +1197,7 @@  static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
  * or vma-based(ie, virtual address based on faulty address) readahead.
  */
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-				struct vm_fault *vmf)
+				struct vm_fault *vmf, bool single_mapped_path)
 {
 	struct mempolicy *mpol;
 	pgoff_t ilx;
@@ -1100,8 +1205,10 @@  struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
 	folio = swap_use_vma_readahead() ?
-		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
-		swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+		swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf,
+				   single_mapped_path) :
+		swap_cluster_readahead(entry, gfp_mask, mpol, ilx,
+				       single_mapped_path);
 	mpol_cond_put(mpol);
 
 	return folio;
@@ -1126,10 +1233,30 @@  static ssize_t vma_ra_enabled_store(struct kobject *kobj,
 
 	return count;
 }
+static ssize_t singlemapped_ra_enabled_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  enable_singlemapped_readahead ? "true" : "false");
+}
+static ssize_t singlemapped_ra_enabled_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &enable_singlemapped_readahead);
+	if (ret)
+		return ret;
+
+	return count;
+}
 static struct kobj_attribute vma_ra_enabled_attr = __ATTR_RW(vma_ra_enabled);
+static struct kobj_attribute singlemapped_ra_enabled_attr = __ATTR_RW(singlemapped_ra_enabled);
 
 static struct attribute *swap_attrs[] = {
 	&vma_ra_enabled_attr.attr,
+	&singlemapped_ra_enabled_attr.attr,
 	NULL,
 };
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b0915f3fab31..10367eaee1ff 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2197,7 +2197,7 @@  static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			};
 
 			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						&vmf);
+						&vmf, false);
 		}
 		if (!folio) {
 			swp_count = READ_ONCE(si->swap_map[offset]);