diff mbox series

[RFC,v3,5/5] mm: support large folios swapin as a whole

Message ID 20240304081348.197341-6-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series mm: support large folios swap-in | expand

Commit Message

Barry Song March 4, 2024, 8:13 a.m. UTC
From: Chuanhua Han <hanchuanhua@oppo.com>

On an embedded system like Android, more than half of anon memory is
actually in swap devices such as zRAM. For example, while an app is
switched to background, its most memory might be swapped-out.

Now we have mTHP features, unfortunately, if we don't support large folios
swap-in, once those large folios are swapped-out, we immediately lose the
performance gain we can get through large folios and hardware optimization
such as CONT-PTE.

This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
to those contiguous swaps which were likely swapped out from mTHP as a
whole.

Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
case. It doesn't support swapin_readahead as large folios yet since this
kind of shared memory is much less than memory mapped by single process.

Right now, we are re-faulting large folios which are still in swapcache as a
whole, this can effectively decrease extra loops and early-exitings which we
have increased in arch_swap_restore() while supporting MTE restore for folios
rather than page. On the other hand, it can also decrease do_swap_page as
PTEs used to be set one by one even we hit a large folio in swapcache.

Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 212 insertions(+), 38 deletions(-)

Comments

Ryan Roberts March 12, 2024, 4:33 p.m. UTC | #1
On 04/03/2024 08:13, Barry Song wrote:
> From: Chuanhua Han <hanchuanhua@oppo.com>
> 
> On an embedded system like Android, more than half of anon memory is
> actually in swap devices such as zRAM. For example, while an app is
> switched to background, its most memory might be swapped-out.
> 
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
> 
> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> to those contiguous swaps which were likely swapped out from mTHP as a
> whole.
> 
> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> case. It doesn't support swapin_readahead as large folios yet since this
> kind of shared memory is much less than memory mapped by single process.
> 
> Right now, we are re-faulting large folios which are still in swapcache as a
> whole, this can effectively decrease extra loops and early-exitings which we
> have increased in arch_swap_restore() while supporting MTE restore for folios
> rather than page. On the other hand, it can also decrease do_swap_page as
> PTEs used to be set one by one even we hit a large folio in swapcache.
> 
> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 212 insertions(+), 38 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e0d34d705e07..501ede745ef3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>  	return VM_FAULT_SIGBUS;
>  }
>  
> +/*
> + * check a range of PTEs are completely swap entries with
> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> + * pte must be first one in the range
> + */
> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
> +{
> +	int i;
> +	struct swap_info_struct *si;
> +	swp_entry_t entry;
> +	unsigned type;
> +	pgoff_t start_offset;
> +	char has_cache;
> +
> +	entry = pte_to_swp_entry(ptep_get_lockless(pte));

Given you are getting entry locklessly, I expect it could change under you? So
probably need to check that its a swap entry, etc. first?

> +	if (non_swap_entry(entry))
> +		return false;
> +	start_offset = swp_offset(entry);
> +	if (start_offset % nr_pages)
> +		return false;
> +
> +	si = swp_swap_info(entry);

What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
then swap_map may have been freed when you read it below. Holding the PTL can
sometimes prevent it, but I don't think you're holding that here (you're using
ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?

> +	type = swp_type(entry);
> +	has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
> +	for (i = 1; i < nr_pages; i++) {
> +		entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> +		if (non_swap_entry(entry))
> +			return false;
> +		if (swp_offset(entry) != start_offset + i)
> +			return false;
> +		if (swp_type(entry) != type)
> +			return false;
> +		/*
> +		 * while allocating a large folio and doing swap_read_folio for the
> +		 * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> +		 * doesn't have swapcache. We need to ensure all PTEs have no cache
> +		 * as well, otherwise, we might go to swap devices while the content
> +		 * is in swapcache
> +		 */
> +		if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
> +			return false;
> +	}
> +
> +	return true;
> +}

I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
PTL is held, and you are lockless here. Thought it might be of interest though.

[1] https://lore.kernel.org/linux-mm/20240311150058.1122862-3-ryan.roberts@arm.com/

> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> + * for this vma. Then filter out the orders that can't be allocated over
> + * the faulting address and still be fully contained in the vma.
> + */
> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long orders;
> +
> +	orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> +					  BIT(PMD_ORDER) - 1);
> +	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> +	return orders;
> +}
> +#endif
> +
> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	unsigned long orders;
> +	struct folio *folio;
> +	unsigned long addr;
> +	pte_t *pte;
> +	gfp_t gfp;
> +	int order;
> +
> +	/*
> +	 * If uffd is active for the vma we need per-page fault fidelity to
> +	 * maintain the uffd semantics.
> +	 */
> +	if (unlikely(userfaultfd_armed(vma)))
> +		goto fallback;
> +
> +	/*
> +	 * a large folio being swapped-in could be partially in
> +	 * zswap and partially in swap devices, zswap doesn't
> +	 * support large folios yet, we might get corrupted
> +	 * zero-filled data by reading all subpages from swap
> +	 * devices while some of them are actually in zswap
> +	 */
> +	if (is_zswap_enabled())
> +		goto fallback;
> +
> +	orders = get_alloc_folio_orders(vmf);
> +	if (!orders)
> +		goto fallback;
> +
> +	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);

Could also briefly take PTL here, then is_pte_range_contig_swap() could be
merged with an enhanced swap_pte_batch()?

> +	if (unlikely(!pte))
> +		goto fallback;
> +
> +	/*
> +	 * For do_swap_page, find the highest order where the aligned range is
> +	 * completely swap entries with contiguous swap offsets.
> +	 */
> +	order = highest_order(orders);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
> +			break;
> +		order = next_order(&orders, order);
> +	}

So in the common case, swap-in will pull in the same size of folio as was
swapped-out. Is that definitely the right policy for all folio sizes? Certainly
it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
it makes sense for 2M THP; As the size increases the chances of actually needing
all of the folio reduces so chances are we are wasting IO. There are similar
arguments for CoW, where we currently copy 1 page per fault - it probably makes
sense to copy the whole folio up to a certain size.

Thanks,
Ryan

> +
> +	pte_unmap(pte);
> +
> +	/* Try allocating the highest of the remaining orders. */
> +	gfp = vma_thp_gfp_mask(vma);
> +	while (orders) {
> +		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> +		folio = vma_alloc_folio(gfp, order, vma, addr, true);
> +		if (folio)
> +			return folio;
> +		order = next_order(&orders, order);
> +	}
> +
> +fallback:
> +#endif
> +	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> +}
> +
> +
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>   * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	pte_t pte;
>  	vm_fault_t ret = 0;
>  	void *shadow = NULL;
> +	int nr_pages = 1;
> +	unsigned long start_address;
> +	pte_t *start_pte;
>  
>  	if (!pte_unmap_same(vmf))
>  		goto out;
> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	if (!folio) {
>  		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>  		    __swap_count(entry) == 1) {
> -			/*
> -			 * Prevent parallel swapin from proceeding with
> -			 * the cache flag. Otherwise, another thread may
> -			 * finish swapin first, free the entry, and swapout
> -			 * reusing the same entry. It's undetectable as
> -			 * pte_same() returns true due to entry reuse.
> -			 */
> -			if (swapcache_prepare(entry)) {
> -				/* Relax a bit to prevent rapid repeated page faults */
> -				schedule_timeout_uninterruptible(1);
> -				goto out;
> -			}
> -			need_clear_cache = true;
> -
>  			/* skip swapcache */
> -			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> -						vma, vmf->address, false);
> +			folio = alloc_swap_folio(vmf);
>  			page = &folio->page;
>  			if (folio) {
>  				__folio_set_locked(folio);
>  				__folio_set_swapbacked(folio);
>  
> +				if (folio_test_large(folio)) {
> +					nr_pages = folio_nr_pages(folio);
> +					entry.val = ALIGN_DOWN(entry.val, nr_pages);
> +				}
> +
> +				/*
> +				 * Prevent parallel swapin from proceeding with
> +				 * the cache flag. Otherwise, another thread may
> +				 * finish swapin first, free the entry, and swapout
> +				 * reusing the same entry. It's undetectable as
> +				 * pte_same() returns true due to entry reuse.
> +				 */
> +				if (swapcache_prepare_nr(entry, nr_pages)) {
> +					/* Relax a bit to prevent rapid repeated page faults */
> +					schedule_timeout_uninterruptible(1);
> +					goto out;
> +				}
> +				need_clear_cache = true;
> +
>  				if (mem_cgroup_swapin_charge_folio(folio,
>  							vma->vm_mm, GFP_KERNEL,
>  							entry)) {
>  					ret = VM_FAULT_OOM;
>  					goto out_page;
>  				}
> -				mem_cgroup_swapin_uncharge_swap(entry);
> +
> +				for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
> +					mem_cgroup_swapin_uncharge_swap(e);
>  
>  				shadow = get_shadow_from_swap_cache(entry);
>  				if (shadow)
> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 */
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>  			&vmf->ptl);
> +
> +	start_address = vmf->address;
> +	start_pte = vmf->pte;
> +	if (start_pte && folio_test_large(folio)) {
> +		unsigned long nr = folio_nr_pages(folio);
> +		unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> +		pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
> +
> +		/*
> +		 * case 1: we are allocating large_folio, try to map it as a whole
> +		 * iff the swap entries are still entirely mapped;
> +		 * case 2: we hit a large folio in swapcache, and all swap entries
> +		 * are still entirely mapped, try to map a large folio as a whole.
> +		 * otherwise, map only the faulting page within the large folio
> +		 * which is swapcache
> +		 */
> +		if (!is_pte_range_contig_swap(aligned_pte, nr)) {
> +			if (nr_pages > 1) /* ptes have changed for case 1 */
> +				goto out_nomap;
> +			goto check_pte;
> +		}
> +
> +		start_address = addr;
> +		start_pte = aligned_pte;
> +		/*
> +		 * the below has been done before swap_read_folio()
> +		 * for case 1
> +		 */
> +		if (unlikely(folio == swapcache)) {
> +			nr_pages = nr;
> +			entry.val = ALIGN_DOWN(entry.val, nr_pages);
> +			page = &folio->page;
> +		}
> +	}
> +
> +check_pte:
>  	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>  		goto out_nomap;
>  
> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * We're already holding a reference on the page but haven't mapped it
>  	 * yet.
>  	 */
> -	swap_free(entry);
> +	swap_nr_free(entry, nr_pages);
>  	if (should_try_to_free_swap(folio, vma, vmf->flags))
>  		folio_free_swap(folio);
>  
> -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> +	folio_ref_add(folio, nr_pages - 1);
> +	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> +	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
>  	pte = mk_pte(page, vma->vm_page_prot);
>  
>  	/*
> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * exclusivity.
>  	 */
>  	if (!folio_test_ksm(folio) &&
> -	    (exclusive || folio_ref_count(folio) == 1)) {
> +	    (exclusive || folio_ref_count(folio) == nr_pages)) {
>  		if (vmf->flags & FAULT_FLAG_WRITE) {
>  			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  			vmf->flags &= ~FAULT_FLAG_WRITE;
>  		}
>  		rmap_flags |= RMAP_EXCLUSIVE;
>  	}
> -	flush_icache_page(vma, page);
> +	flush_icache_pages(vma, page, nr_pages);
>  	if (pte_swp_soft_dirty(vmf->orig_pte))
>  		pte = pte_mksoft_dirty(pte);
>  	if (pte_swp_uffd_wp(vmf->orig_pte))
> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  
>  	/* ksm created a completely new copy */
>  	if (unlikely(folio != swapcache && swapcache)) {
> -		folio_add_new_anon_rmap(folio, vma, vmf->address);
> +		folio_add_new_anon_rmap(folio, vma, start_address);
>  		folio_add_lru_vma(folio, vma);
> +	} else if (!folio_test_anon(folio)) {
> +		folio_add_new_anon_rmap(folio, vma, start_address);
>  	} else {
> -		folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> +		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>  					rmap_flags);
>  	}
>  
>  	VM_BUG_ON(!folio_test_anon(folio) ||
>  			(pte_write(pte) && !PageAnonExclusive(page)));
> -	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> -	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> +	set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> +	arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>  
>  	folio_unlock(folio);
>  	if (folio != swapcache && swapcache) {
> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  
>  	if (vmf->flags & FAULT_FLAG_WRITE) {
> +		if (nr_pages > 1)
> +			vmf->orig_pte = ptep_get(vmf->pte);
> +
>  		ret |= do_wp_page(vmf);
>  		if (ret & VM_FAULT_ERROR)
>  			ret &= VM_FAULT_ERROR;
> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	}
>  
>  	/* No need to invalidate - it was non-present before */
> -	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> +	update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>  unlock:
>  	if (vmf->pte)
>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
>  	/* Clear the swap cache pin for direct swapin after PTL unlock */
>  	if (need_clear_cache)
> -		swapcache_clear(si, entry);
> +		swapcache_clear_nr(si, entry, nr_pages);
>  	if (si)
>  		put_swap_device(si);
>  	return ret;
> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		folio_put(swapcache);
>  	}
>  	if (need_clear_cache)
> -		swapcache_clear(si, entry);
> +		swapcache_clear_nr(si, entry, nr_pages);
>  	if (si)
>  		put_swap_device(si);
>  	return ret;
> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  	if (unlikely(userfaultfd_armed(vma)))
>  		goto fallback;
>  
> -	/*
> -	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> -	 * for this vma. Then filter out the orders that can't be allocated over
> -	 * the faulting address and still be fully contained in the vma.
> -	 */
> -	orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> -					  BIT(PMD_ORDER) - 1);
> -	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> -
> +	orders = get_alloc_folio_orders(vmf);
>  	if (!orders)
>  		goto fallback;
>
Chuanhua Han March 14, 2024, 12:56 p.m. UTC | #2
Ryan Roberts <ryan.roberts@arm.com> 于2024年3月13日周三 00:33写道:
>
> On 04/03/2024 08:13, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > On an embedded system like Android, more than half of anon memory is
> > actually in swap devices such as zRAM. For example, while an app is
> > switched to background, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> > to those contiguous swaps which were likely swapped out from mTHP as a
> > whole.
> >
> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> > case. It doesn't support swapin_readahead as large folios yet since this
> > kind of shared memory is much less than memory mapped by single process.
> >
> > Right now, we are re-faulting large folios which are still in swapcache as a
> > whole, this can effectively decrease extra loops and early-exitings which we
> > have increased in arch_swap_restore() while supporting MTE restore for folios
> > rather than page. On the other hand, it can also decrease do_swap_page as
> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 212 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e0d34d705e07..501ede745ef3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >       return VM_FAULT_SIGBUS;
> >  }
> >
> > +/*
> > + * check a range of PTEs are completely swap entries with
> > + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> > + * pte must be first one in the range
> > + */
> > +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
> > +{
> > +     int i;
> > +     struct swap_info_struct *si;
> > +     swp_entry_t entry;
> > +     unsigned type;
> > +     pgoff_t start_offset;
> > +     char has_cache;
> > +
> > +     entry = pte_to_swp_entry(ptep_get_lockless(pte));
>
> Given you are getting entry locklessly, I expect it could change under you? So
> probably need to check that its a swap entry, etc. first?
The following non_swap_entry checks to see if it is a swap entry.
>
> > +     if (non_swap_entry(entry))
> > +             return false;
> > +     start_offset = swp_offset(entry);
> > +     if (start_offset % nr_pages)
> > +             return false;
> > +
> > +     si = swp_swap_info(entry);
>
> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
> then swap_map may have been freed when you read it below. Holding the PTL can
> sometimes prevent it, but I don't think you're holding that here (you're using
> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
Thank you for your review,you are righit! this place reaally needs
get_swap_device()/put_swap_device().
>
> > +     type = swp_type(entry);
> > +     has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
> > +     for (i = 1; i < nr_pages; i++) {
> > +             entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> > +             if (non_swap_entry(entry))
> > +                     return false;
> > +             if (swp_offset(entry) != start_offset + i)
> > +                     return false;
> > +             if (swp_type(entry) != type)
> > +                     return false;
> > +             /*
> > +              * while allocating a large folio and doing swap_read_folio for the
> > +              * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> > +              * doesn't have swapcache. We need to ensure all PTEs have no cache
> > +              * as well, otherwise, we might go to swap devices while the content
> > +              * is in swapcache
> > +              */
> > +             if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
> > +                     return false;
> > +     }
> > +
> > +     return true;
> > +}
>
> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
> PTL is held, and you are lockless here. Thought it might be of interest though.
>
> [1] https://lore.kernel.org/linux-mm/20240311150058.1122862-3-ryan.roberts@arm.com/
>
Thanks. It's probably simily to ours, but as you said we are lockless
here, and we need to check has_cache.
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +/*
> > + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > + * for this vma. Then filter out the orders that can't be allocated over
> > + * the faulting address and still be fully contained in the vma.
> > + */
> > +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
> > +{
> > +     struct vm_area_struct *vma = vmf->vma;
> > +     unsigned long orders;
> > +
> > +     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> > +                                       BIT(PMD_ORDER) - 1);
> > +     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > +     return orders;
> > +}
> > +#endif
> > +
> > +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > +{
> > +     struct vm_area_struct *vma = vmf->vma;
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +     unsigned long orders;
> > +     struct folio *folio;
> > +     unsigned long addr;
> > +     pte_t *pte;
> > +     gfp_t gfp;
> > +     int order;
> > +
> > +     /*
> > +      * If uffd is active for the vma we need per-page fault fidelity to
> > +      * maintain the uffd semantics.
> > +      */
> > +     if (unlikely(userfaultfd_armed(vma)))
> > +             goto fallback;
> > +
> > +     /*
> > +      * a large folio being swapped-in could be partially in
> > +      * zswap and partially in swap devices, zswap doesn't
> > +      * support large folios yet, we might get corrupted
> > +      * zero-filled data by reading all subpages from swap
> > +      * devices while some of them are actually in zswap
> > +      */
> > +     if (is_zswap_enabled())
> > +             goto fallback;
> > +
> > +     orders = get_alloc_folio_orders(vmf);
> > +     if (!orders)
> > +             goto fallback;
> > +
> > +     pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>
> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
> merged with an enhanced swap_pte_batch()?
Yes, it's easy to use a lock here, but I'm wondering if it's
necessary, because when we actually set pte in do_swap_page, we'll
hold PTL to check if the pte changes.
>
> > +     if (unlikely(!pte))
> > +             goto fallback;
> > +
> > +     /*
> > +      * For do_swap_page, find the highest order where the aligned range is
> > +      * completely swap entries with contiguous swap offsets.
> > +      */
> > +     order = highest_order(orders);
> > +     while (orders) {
> > +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > +             if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
> > +                     break;
> > +             order = next_order(&orders, order);
> > +     }
>
> So in the common case, swap-in will pull in the same size of folio as was
> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> it makes sense for 2M THP; As the size increases the chances of actually needing
> all of the folio reduces so chances are we are wasting IO. There are similar
> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> sense to copy the whole folio up to a certain size.
For 2M THP, IO overhead may not necessarily be large? :)
1.If 2M THP are continuously stored in the swap device, the IO
overhead may not be very large (such as submitting bio with one
bio_vec at a time).
2.If the process really needs this 2M data, one page-fault may perform
much better than multiple.
3.For swap devices like zram,using 2M THP might also improve
decompression efficiency.

On the other hand, if the process only needs a small part of the 2M
data (such as only frequent use of 4K page, the rest of the data is
never accessed), This is indeed give a lark to catch a kite!  :(
>
> Thanks,
> Ryan
>
> > +
> > +     pte_unmap(pte);
> > +
> > +     /* Try allocating the highest of the remaining orders. */
> > +     gfp = vma_thp_gfp_mask(vma);
> > +     while (orders) {
> > +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > +             folio = vma_alloc_folio(gfp, order, vma, addr, true);
> > +             if (folio)
> > +                     return folio;
> > +             order = next_order(&orders, order);
> > +     }
> > +
> > +fallback:
> > +#endif
> > +     return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> > +}
> > +
> > +
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       pte_t pte;
> >       vm_fault_t ret = 0;
> >       void *shadow = NULL;
> > +     int nr_pages = 1;
> > +     unsigned long start_address;
> > +     pte_t *start_pte;
> >
> >       if (!pte_unmap_same(vmf))
> >               goto out;
> > @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (!folio) {
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                   __swap_count(entry) == 1) {
> > -                     /*
> > -                      * Prevent parallel swapin from proceeding with
> > -                      * the cache flag. Otherwise, another thread may
> > -                      * finish swapin first, free the entry, and swapout
> > -                      * reusing the same entry. It's undetectable as
> > -                      * pte_same() returns true due to entry reuse.
> > -                      */
> > -                     if (swapcache_prepare(entry)) {
> > -                             /* Relax a bit to prevent rapid repeated page faults */
> > -                             schedule_timeout_uninterruptible(1);
> > -                             goto out;
> > -                     }
> > -                     need_clear_cache = true;
> > -
> >                       /* skip swapcache */
> > -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > -                                             vma, vmf->address, false);
> > +                     folio = alloc_swap_folio(vmf);
> >                       page = &folio->page;
> >                       if (folio) {
> >                               __folio_set_locked(folio);
> >                               __folio_set_swapbacked(folio);
> >
> > +                             if (folio_test_large(folio)) {
> > +                                     nr_pages = folio_nr_pages(folio);
> > +                                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
> > +                             }
> > +
> > +                             /*
> > +                              * Prevent parallel swapin from proceeding with
> > +                              * the cache flag. Otherwise, another thread may
> > +                              * finish swapin first, free the entry, and swapout
> > +                              * reusing the same entry. It's undetectable as
> > +                              * pte_same() returns true due to entry reuse.
> > +                              */
> > +                             if (swapcache_prepare_nr(entry, nr_pages)) {
> > +                                     /* Relax a bit to prevent rapid repeated page faults */
> > +                                     schedule_timeout_uninterruptible(1);
> > +                                     goto out;
> > +                             }
> > +                             need_clear_cache = true;
> > +
> >                               if (mem_cgroup_swapin_charge_folio(folio,
> >                                                       vma->vm_mm, GFP_KERNEL,
> >                                                       entry)) {
> >                                       ret = VM_FAULT_OOM;
> >                                       goto out_page;
> >                               }
> > -                             mem_cgroup_swapin_uncharge_swap(entry);
> > +
> > +                             for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
> > +                                     mem_cgroup_swapin_uncharge_swap(e);
> >
> >                               shadow = get_shadow_from_swap_cache(entry);
> >                               if (shadow)
> > @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        */
> >       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >                       &vmf->ptl);
> > +
> > +     start_address = vmf->address;
> > +     start_pte = vmf->pte;
> > +     if (start_pte && folio_test_large(folio)) {
> > +             unsigned long nr = folio_nr_pages(folio);
> > +             unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> > +             pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
> > +
> > +             /*
> > +              * case 1: we are allocating large_folio, try to map it as a whole
> > +              * iff the swap entries are still entirely mapped;
> > +              * case 2: we hit a large folio in swapcache, and all swap entries
> > +              * are still entirely mapped, try to map a large folio as a whole.
> > +              * otherwise, map only the faulting page within the large folio
> > +              * which is swapcache
> > +              */
> > +             if (!is_pte_range_contig_swap(aligned_pte, nr)) {
> > +                     if (nr_pages > 1) /* ptes have changed for case 1 */
> > +                             goto out_nomap;
> > +                     goto check_pte;
> > +             }
> > +
> > +             start_address = addr;
> > +             start_pte = aligned_pte;
> > +             /*
> > +              * the below has been done before swap_read_folio()
> > +              * for case 1
> > +              */
> > +             if (unlikely(folio == swapcache)) {
> > +                     nr_pages = nr;
> > +                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
> > +                     page = &folio->page;
> > +             }
> > +     }
> > +
> > +check_pte:
> >       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >               goto out_nomap;
> >
> > @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * We're already holding a reference on the page but haven't mapped it
> >        * yet.
> >        */
> > -     swap_free(entry);
> > +     swap_nr_free(entry, nr_pages);
> >       if (should_try_to_free_swap(folio, vma, vmf->flags))
> >               folio_free_swap(folio);
> >
> > -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > +     folio_ref_add(folio, nr_pages - 1);
> > +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +
> >       pte = mk_pte(page, vma->vm_page_prot);
> >
> >       /*
> > @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * exclusivity.
> >        */
> >       if (!folio_test_ksm(folio) &&
> > -         (exclusive || folio_ref_count(folio) == 1)) {
> > +         (exclusive || folio_ref_count(folio) == nr_pages)) {
> >               if (vmf->flags & FAULT_FLAG_WRITE) {
> >                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >                       vmf->flags &= ~FAULT_FLAG_WRITE;
> >               }
> >               rmap_flags |= RMAP_EXCLUSIVE;
> >       }
> > -     flush_icache_page(vma, page);
> > +     flush_icache_pages(vma, page, nr_pages);
> >       if (pte_swp_soft_dirty(vmf->orig_pte))
> >               pte = pte_mksoft_dirty(pte);
> >       if (pte_swp_uffd_wp(vmf->orig_pte))
> > @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >
> >       /* ksm created a completely new copy */
> >       if (unlikely(folio != swapcache && swapcache)) {
> > -             folio_add_new_anon_rmap(folio, vma, vmf->address);
> > +             folio_add_new_anon_rmap(folio, vma, start_address);
> >               folio_add_lru_vma(folio, vma);
> > +     } else if (!folio_test_anon(folio)) {
> > +             folio_add_new_anon_rmap(folio, vma, start_address);
> >       } else {
> > -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> >                                       rmap_flags);
> >       }
> >
> >       VM_BUG_ON(!folio_test_anon(folio) ||
> >                       (pte_write(pte) && !PageAnonExclusive(page)));
> > -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
> >
> >       folio_unlock(folio);
> >       if (folio != swapcache && swapcache) {
> > @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       }
> >
> >       if (vmf->flags & FAULT_FLAG_WRITE) {
> > +             if (nr_pages > 1)
> > +                     vmf->orig_pte = ptep_get(vmf->pte);
> > +
> >               ret |= do_wp_page(vmf);
> >               if (ret & VM_FAULT_ERROR)
> >                       ret &= VM_FAULT_ERROR;
> > @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       }
> >
> >       /* No need to invalidate - it was non-present before */
> > -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >  unlock:
> >       if (vmf->pte)
> >               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out:
> >       /* Clear the swap cache pin for direct swapin after PTL unlock */
> >       if (need_clear_cache)
> > -             swapcache_clear(si, entry);
> > +             swapcache_clear_nr(si, entry, nr_pages);
> >       if (si)
> >               put_swap_device(si);
> >       return ret;
> > @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >               folio_put(swapcache);
> >       }
> >       if (need_clear_cache)
> > -             swapcache_clear(si, entry);
> > +             swapcache_clear_nr(si, entry, nr_pages);
> >       if (si)
> >               put_swap_device(si);
> >       return ret;
> > @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >       if (unlikely(userfaultfd_armed(vma)))
> >               goto fallback;
> >
> > -     /*
> > -      * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > -      * for this vma. Then filter out the orders that can't be allocated over
> > -      * the faulting address and still be fully contained in the vma.
> > -      */
> > -     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> > -                                       BIT(PMD_ORDER) - 1);
> > -     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > -
> > +     orders = get_alloc_folio_orders(vmf);
> >       if (!orders)
> >               goto fallback;
> >
>
>
Ryan Roberts March 14, 2024, 1:57 p.m. UTC | #3
On 14/03/2024 12:56, Chuanhua Han wrote:
> Ryan Roberts <ryan.roberts@arm.com> 于2024年3月13日周三 00:33写道:
>>
>> On 04/03/2024 08:13, Barry Song wrote:
>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>
>>> On an embedded system like Android, more than half of anon memory is
>>> actually in swap devices such as zRAM. For example, while an app is
>>> switched to background, its most memory might be swapped-out.
>>>
>>> Now we have mTHP features, unfortunately, if we don't support large folios
>>> swap-in, once those large folios are swapped-out, we immediately lose the
>>> performance gain we can get through large folios and hardware optimization
>>> such as CONT-PTE.
>>>
>>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>>> to those contiguous swaps which were likely swapped out from mTHP as a
>>> whole.
>>>
>>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>>> case. It doesn't support swapin_readahead as large folios yet since this
>>> kind of shared memory is much less than memory mapped by single process.
>>>
>>> Right now, we are re-faulting large folios which are still in swapcache as a
>>> whole, this can effectively decrease extra loops and early-exitings which we
>>> have increased in arch_swap_restore() while supporting MTE restore for folios
>>> rather than page. On the other hand, it can also decrease do_swap_page as
>>> PTEs used to be set one by one even we hit a large folio in swapcache.
>>>
>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> ---
>>>  mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
>>>  1 file changed, 212 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index e0d34d705e07..501ede745ef3 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>>>       return VM_FAULT_SIGBUS;
>>>  }
>>>
>>> +/*
>>> + * check a range of PTEs are completely swap entries with
>>> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
>>> + * pte must be first one in the range
>>> + */
>>> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
>>> +{
>>> +     int i;
>>> +     struct swap_info_struct *si;
>>> +     swp_entry_t entry;
>>> +     unsigned type;
>>> +     pgoff_t start_offset;
>>> +     char has_cache;
>>> +
>>> +     entry = pte_to_swp_entry(ptep_get_lockless(pte));
>>
>> Given you are getting entry locklessly, I expect it could change under you? So
>> probably need to check that its a swap entry, etc. first?
> The following non_swap_entry checks to see if it is a swap entry.

No, it checks if something already known to be a "swap entry" type is actually
describing a swap entry, or a non-swap entry (e.g. migration entry, hwpoison
entry, etc.) Swap entries with type >= MAX_SWAPFILES don't actually describe swap:

static inline int non_swap_entry(swp_entry_t entry)
{
	return swp_type(entry) >= MAX_SWAPFILES;
}


So you need to do something like:

pte = ptep_get_lockless(pte);
if (pte_none(pte) || !pte_present(pte))
	return false;
entry = pte_to_swp_entry(pte);
if (non_swap_entry(entry))
	return false;
...

>>
>>> +     if (non_swap_entry(entry))
>>> +             return false;
>>> +     start_offset = swp_offset(entry);
>>> +     if (start_offset % nr_pages)
>>> +             return false;
>>> +
>>> +     si = swp_swap_info(entry);
>>
>> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
>> then swap_map may have been freed when you read it below. Holding the PTL can
>> sometimes prevent it, but I don't think you're holding that here (you're using
>> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
> Thank you for your review,you are righit! this place reaally needs
> get_swap_device()/put_swap_device().
>>
>>> +     type = swp_type(entry);
>>> +     has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
>>> +     for (i = 1; i < nr_pages; i++) {
>>> +             entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
>>> +             if (non_swap_entry(entry))
>>> +                     return false;
>>> +             if (swp_offset(entry) != start_offset + i)
>>> +                     return false;
>>> +             if (swp_type(entry) != type)
>>> +                     return false;
>>> +             /*
>>> +              * while allocating a large folio and doing swap_read_folio for the
>>> +              * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
>>> +              * doesn't have swapcache. We need to ensure all PTEs have no cache
>>> +              * as well, otherwise, we might go to swap devices while the content
>>> +              * is in swapcache
>>> +              */
>>> +             if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
>>> +                     return false;
>>> +     }
>>> +
>>> +     return true;
>>> +}
>>
>> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
>> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
>> PTL is held, and you are lockless here. Thought it might be of interest though.
>>
>> [1] https://lore.kernel.org/linux-mm/20240311150058.1122862-3-ryan.roberts@arm.com/
>>
> Thanks. It's probably simily to ours, but as you said we are lockless
> here, and we need to check has_cache.
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +/*
>>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>> + * for this vma. Then filter out the orders that can't be allocated over
>>> + * the faulting address and still be fully contained in the vma.
>>> + */
>>> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
>>> +{
>>> +     struct vm_area_struct *vma = vmf->vma;
>>> +     unsigned long orders;
>>> +
>>> +     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>> +                                       BIT(PMD_ORDER) - 1);
>>> +     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>> +     return orders;
>>> +}
>>> +#endif
>>> +
>>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>> +{
>>> +     struct vm_area_struct *vma = vmf->vma;
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +     unsigned long orders;
>>> +     struct folio *folio;
>>> +     unsigned long addr;
>>> +     pte_t *pte;
>>> +     gfp_t gfp;
>>> +     int order;
>>> +
>>> +     /*
>>> +      * If uffd is active for the vma we need per-page fault fidelity to
>>> +      * maintain the uffd semantics.
>>> +      */
>>> +     if (unlikely(userfaultfd_armed(vma)))
>>> +             goto fallback;
>>> +
>>> +     /*
>>> +      * a large folio being swapped-in could be partially in
>>> +      * zswap and partially in swap devices, zswap doesn't
>>> +      * support large folios yet, we might get corrupted
>>> +      * zero-filled data by reading all subpages from swap
>>> +      * devices while some of them are actually in zswap
>>> +      */
>>> +     if (is_zswap_enabled())
>>> +             goto fallback;
>>> +
>>> +     orders = get_alloc_folio_orders(vmf);
>>> +     if (!orders)
>>> +             goto fallback;
>>> +
>>> +     pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>
>> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
>> merged with an enhanced swap_pte_batch()?
> Yes, it's easy to use a lock here, but I'm wondering if it's
> necessary, because when we actually set pte in do_swap_page, we'll
> hold PTL to check if the pte changes.
>>
>>> +     if (unlikely(!pte))
>>> +             goto fallback;
>>> +
>>> +     /*
>>> +      * For do_swap_page, find the highest order where the aligned range is
>>> +      * completely swap entries with contiguous swap offsets.
>>> +      */
>>> +     order = highest_order(orders);
>>> +     while (orders) {
>>> +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> +             if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
>>> +                     break;
>>> +             order = next_order(&orders, order);
>>> +     }
>>
>> So in the common case, swap-in will pull in the same size of folio as was
>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>> it makes sense for 2M THP; As the size increases the chances of actually needing
>> all of the folio reduces so chances are we are wasting IO. There are similar
>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>> sense to copy the whole folio up to a certain size.
> For 2M THP, IO overhead may not necessarily be large? :)
> 1.If 2M THP are continuously stored in the swap device, the IO
> overhead may not be very large (such as submitting bio with one
> bio_vec at a time).
> 2.If the process really needs this 2M data, one page-fault may perform
> much better than multiple.
> 3.For swap devices like zram,using 2M THP might also improve
> decompression efficiency.
> 
> On the other hand, if the process only needs a small part of the 2M
> data (such as only frequent use of 4K page, the rest of the data is
> never accessed), This is indeed give a lark to catch a kite!  :(

Yes indeed. It's not always clear-cut what the best thing to do is. It would be
good to hear from others on this.

>>
>> Thanks,
>> Ryan
>>
>>> +
>>> +     pte_unmap(pte);
>>> +
>>> +     /* Try allocating the highest of the remaining orders. */
>>> +     gfp = vma_thp_gfp_mask(vma);
>>> +     while (orders) {
>>> +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>> +             folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>> +             if (folio)
>>> +                     return folio;
>>> +             order = next_order(&orders, order);
>>> +     }
>>> +
>>> +fallback:
>>> +#endif
>>> +     return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
>>> +}
>>> +
>>> +
>>>  /*
>>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>>   * but allow concurrent faults), and pte mapped but not yet locked.
>>> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>       pte_t pte;
>>>       vm_fault_t ret = 0;
>>>       void *shadow = NULL;
>>> +     int nr_pages = 1;
>>> +     unsigned long start_address;
>>> +     pte_t *start_pte;
>>>
>>>       if (!pte_unmap_same(vmf))
>>>               goto out;
>>> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>       if (!folio) {
>>>               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>>                   __swap_count(entry) == 1) {
>>> -                     /*
>>> -                      * Prevent parallel swapin from proceeding with
>>> -                      * the cache flag. Otherwise, another thread may
>>> -                      * finish swapin first, free the entry, and swapout
>>> -                      * reusing the same entry. It's undetectable as
>>> -                      * pte_same() returns true due to entry reuse.
>>> -                      */
>>> -                     if (swapcache_prepare(entry)) {
>>> -                             /* Relax a bit to prevent rapid repeated page faults */
>>> -                             schedule_timeout_uninterruptible(1);
>>> -                             goto out;
>>> -                     }
>>> -                     need_clear_cache = true;
>>> -
>>>                       /* skip swapcache */
>>> -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>>> -                                             vma, vmf->address, false);
>>> +                     folio = alloc_swap_folio(vmf);
>>>                       page = &folio->page;
>>>                       if (folio) {
>>>                               __folio_set_locked(folio);
>>>                               __folio_set_swapbacked(folio);
>>>
>>> +                             if (folio_test_large(folio)) {
>>> +                                     nr_pages = folio_nr_pages(folio);
>>> +                                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
>>> +                             }
>>> +
>>> +                             /*
>>> +                              * Prevent parallel swapin from proceeding with
>>> +                              * the cache flag. Otherwise, another thread may
>>> +                              * finish swapin first, free the entry, and swapout
>>> +                              * reusing the same entry. It's undetectable as
>>> +                              * pte_same() returns true due to entry reuse.
>>> +                              */
>>> +                             if (swapcache_prepare_nr(entry, nr_pages)) {
>>> +                                     /* Relax a bit to prevent rapid repeated page faults */
>>> +                                     schedule_timeout_uninterruptible(1);
>>> +                                     goto out;
>>> +                             }
>>> +                             need_clear_cache = true;
>>> +
>>>                               if (mem_cgroup_swapin_charge_folio(folio,
>>>                                                       vma->vm_mm, GFP_KERNEL,
>>>                                                       entry)) {
>>>                                       ret = VM_FAULT_OOM;
>>>                                       goto out_page;
>>>                               }
>>> -                             mem_cgroup_swapin_uncharge_swap(entry);
>>> +
>>> +                             for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
>>> +                                     mem_cgroup_swapin_uncharge_swap(e);
>>>
>>>                               shadow = get_shadow_from_swap_cache(entry);
>>>                               if (shadow)
>>> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        */
>>>       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>>                       &vmf->ptl);
>>> +
>>> +     start_address = vmf->address;
>>> +     start_pte = vmf->pte;
>>> +     if (start_pte && folio_test_large(folio)) {
>>> +             unsigned long nr = folio_nr_pages(folio);
>>> +             unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
>>> +             pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
>>> +
>>> +             /*
>>> +              * case 1: we are allocating large_folio, try to map it as a whole
>>> +              * iff the swap entries are still entirely mapped;
>>> +              * case 2: we hit a large folio in swapcache, and all swap entries
>>> +              * are still entirely mapped, try to map a large folio as a whole.
>>> +              * otherwise, map only the faulting page within the large folio
>>> +              * which is swapcache
>>> +              */
>>> +             if (!is_pte_range_contig_swap(aligned_pte, nr)) {
>>> +                     if (nr_pages > 1) /* ptes have changed for case 1 */
>>> +                             goto out_nomap;
>>> +                     goto check_pte;
>>> +             }
>>> +
>>> +             start_address = addr;
>>> +             start_pte = aligned_pte;
>>> +             /*
>>> +              * the below has been done before swap_read_folio()
>>> +              * for case 1
>>> +              */
>>> +             if (unlikely(folio == swapcache)) {
>>> +                     nr_pages = nr;
>>> +                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
>>> +                     page = &folio->page;
>>> +             }
>>> +     }
>>> +
>>> +check_pte:
>>>       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>>>               goto out_nomap;
>>>
>>> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        * We're already holding a reference on the page but haven't mapped it
>>>        * yet.
>>>        */
>>> -     swap_free(entry);
>>> +     swap_nr_free(entry, nr_pages);
>>>       if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>               folio_free_swap(folio);
>>>
>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>> +     folio_ref_add(folio, nr_pages - 1);
>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>> +
>>>       pte = mk_pte(page, vma->vm_page_prot);
>>>
>>>       /*
>>> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>        * exclusivity.
>>>        */
>>>       if (!folio_test_ksm(folio) &&
>>> -         (exclusive || folio_ref_count(folio) == 1)) {
>>> +         (exclusive || folio_ref_count(folio) == nr_pages)) {
>>>               if (vmf->flags & FAULT_FLAG_WRITE) {
>>>                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>>                       vmf->flags &= ~FAULT_FLAG_WRITE;
>>>               }
>>>               rmap_flags |= RMAP_EXCLUSIVE;
>>>       }
>>> -     flush_icache_page(vma, page);
>>> +     flush_icache_pages(vma, page, nr_pages);
>>>       if (pte_swp_soft_dirty(vmf->orig_pte))
>>>               pte = pte_mksoft_dirty(pte);
>>>       if (pte_swp_uffd_wp(vmf->orig_pte))
>>> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>
>>>       /* ksm created a completely new copy */
>>>       if (unlikely(folio != swapcache && swapcache)) {
>>> -             folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> +             folio_add_new_anon_rmap(folio, vma, start_address);
>>>               folio_add_lru_vma(folio, vma);
>>> +     } else if (!folio_test_anon(folio)) {
>>> +             folio_add_new_anon_rmap(folio, vma, start_address);
>>>       } else {
>>> -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>>> +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>>>                                       rmap_flags);
>>>       }
>>>
>>>       VM_BUG_ON(!folio_test_anon(folio) ||
>>>                       (pte_write(pte) && !PageAnonExclusive(page)));
>>> -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>> -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>>> +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
>>> +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>>>
>>>       folio_unlock(folio);
>>>       if (folio != swapcache && swapcache) {
>>> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>       }
>>>
>>>       if (vmf->flags & FAULT_FLAG_WRITE) {
>>> +             if (nr_pages > 1)
>>> +                     vmf->orig_pte = ptep_get(vmf->pte);
>>> +
>>>               ret |= do_wp_page(vmf);
>>>               if (ret & VM_FAULT_ERROR)
>>>                       ret &= VM_FAULT_ERROR;
>>> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>       }
>>>
>>>       /* No need to invalidate - it was non-present before */
>>> -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>> +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>>>  unlock:
>>>       if (vmf->pte)
>>>               pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>  out:
>>>       /* Clear the swap cache pin for direct swapin after PTL unlock */
>>>       if (need_clear_cache)
>>> -             swapcache_clear(si, entry);
>>> +             swapcache_clear_nr(si, entry, nr_pages);
>>>       if (si)
>>>               put_swap_device(si);
>>>       return ret;
>>> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>               folio_put(swapcache);
>>>       }
>>>       if (need_clear_cache)
>>> -             swapcache_clear(si, entry);
>>> +             swapcache_clear_nr(si, entry, nr_pages);
>>>       if (si)
>>>               put_swap_device(si);
>>>       return ret;
>>> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>       if (unlikely(userfaultfd_armed(vma)))
>>>               goto fallback;
>>>
>>> -     /*
>>> -      * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>> -      * for this vma. Then filter out the orders that can't be allocated over
>>> -      * the faulting address and still be fully contained in the vma.
>>> -      */
>>> -     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>> -                                       BIT(PMD_ORDER) - 1);
>>> -     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>> -
>>> +     orders = get_alloc_folio_orders(vmf);
>>>       if (!orders)
>>>               goto fallback;
>>>
>>
>>
> 
>
Barry Song March 14, 2024, 8:43 p.m. UTC | #4
On Fri, Mar 15, 2024 at 2:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 14/03/2024 12:56, Chuanhua Han wrote:
> > Ryan Roberts <ryan.roberts@arm.com> 于2024年3月13日周三 00:33写道:
> >>
> >> On 04/03/2024 08:13, Barry Song wrote:
> >>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>
> >>> On an embedded system like Android, more than half of anon memory is
> >>> actually in swap devices such as zRAM. For example, while an app is
> >>> switched to background, its most memory might be swapped-out.
> >>>
> >>> Now we have mTHP features, unfortunately, if we don't support large folios
> >>> swap-in, once those large folios are swapped-out, we immediately lose the
> >>> performance gain we can get through large folios and hardware optimization
> >>> such as CONT-PTE.
> >>>
> >>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >>> to those contiguous swaps which were likely swapped out from mTHP as a
> >>> whole.
> >>>
> >>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >>> case. It doesn't support swapin_readahead as large folios yet since this
> >>> kind of shared memory is much less than memory mapped by single process.
> >>>
> >>> Right now, we are re-faulting large folios which are still in swapcache as a
> >>> whole, this can effectively decrease extra loops and early-exitings which we
> >>> have increased in arch_swap_restore() while supporting MTE restore for folios
> >>> rather than page. On the other hand, it can also decrease do_swap_page as
> >>> PTEs used to be set one by one even we hit a large folio in swapcache.
> >>>
> >>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>  mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
> >>>  1 file changed, 212 insertions(+), 38 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index e0d34d705e07..501ede745ef3 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >>>       return VM_FAULT_SIGBUS;
> >>>  }
> >>>
> >>> +/*
> >>> + * check a range of PTEs are completely swap entries with
> >>> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> >>> + * pte must be first one in the range
> >>> + */
> >>> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
> >>> +{
> >>> +     int i;
> >>> +     struct swap_info_struct *si;
> >>> +     swp_entry_t entry;
> >>> +     unsigned type;
> >>> +     pgoff_t start_offset;
> >>> +     char has_cache;
> >>> +
> >>> +     entry = pte_to_swp_entry(ptep_get_lockless(pte));
> >>
> >> Given you are getting entry locklessly, I expect it could change under you? So
> >> probably need to check that its a swap entry, etc. first?
> > The following non_swap_entry checks to see if it is a swap entry.
>
> No, it checks if something already known to be a "swap entry" type is actually
> describing a swap entry, or a non-swap entry (e.g. migration entry, hwpoison
> entry, etc.) Swap entries with type >= MAX_SWAPFILES don't actually describe swap:
>
> static inline int non_swap_entry(swp_entry_t entry)
> {
>         return swp_type(entry) >= MAX_SWAPFILES;
> }
>
>
> So you need to do something like:
>
> pte = ptep_get_lockless(pte);
> if (pte_none(pte) || !pte_present(pte))
>         return false;


Indeed, I noticed that a couple of days ago, but it turned out that it
didn't cause any issues
because the condition following 'if (swp_offset(entry) != start_offset
+ i)'  cannot be true :-)

I do agree it needs a fix here. maybe by

if (!is_swap_pte(pte))
           return false?

> entry = pte_to_swp_entry(pte);
> if (non_swap_entry(entry))
>         return false;
> ...
>
> >>
> >>> +     if (non_swap_entry(entry))
> >>> +             return false;
> >>> +     start_offset = swp_offset(entry);
> >>> +     if (start_offset % nr_pages)
> >>> +             return false;
> >>> +
> >>> +     si = swp_swap_info(entry);
> >>
> >> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
> >> then swap_map may have been freed when you read it below. Holding the PTL can
> >> sometimes prevent it, but I don't think you're holding that here (you're using
> >> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
> > Thank you for your review,you are righit! this place reaally needs
> > get_swap_device()/put_swap_device().
> >>
> >>> +     type = swp_type(entry);
> >>> +     has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
> >>> +     for (i = 1; i < nr_pages; i++) {
> >>> +             entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> >>> +             if (non_swap_entry(entry))
> >>> +                     return false;
> >>> +             if (swp_offset(entry) != start_offset + i)
> >>> +                     return false;
> >>> +             if (swp_type(entry) != type)
> >>> +                     return false;
> >>> +             /*
> >>> +              * while allocating a large folio and doing swap_read_folio for the
> >>> +              * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> >>> +              * doesn't have swapcache. We need to ensure all PTEs have no cache
> >>> +              * as well, otherwise, we might go to swap devices while the content
> >>> +              * is in swapcache
> >>> +              */
> >>> +             if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
> >>> +                     return false;
> >>> +     }
> >>> +
> >>> +     return true;
> >>> +}
> >>
> >> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
> >> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
> >> PTL is held, and you are lockless here. Thought it might be of interest though.
> >>
> >> [1] https://lore.kernel.org/linux-mm/20240311150058.1122862-3-ryan.roberts@arm.com/
> >>
> > Thanks. It's probably simily to ours, but as you said we are lockless
> > here, and we need to check has_cache.
> >>> +
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +/*
> >>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>> + * for this vma. Then filter out the orders that can't be allocated over
> >>> + * the faulting address and still be fully contained in the vma.
> >>> + */
> >>> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
> >>> +{
> >>> +     struct vm_area_struct *vma = vmf->vma;
> >>> +     unsigned long orders;
> >>> +
> >>> +     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>> +                                       BIT(PMD_ORDER) - 1);
> >>> +     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>> +     return orders;
> >>> +}
> >>> +#endif
> >>> +
> >>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >>> +{
> >>> +     struct vm_area_struct *vma = vmf->vma;
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +     unsigned long orders;
> >>> +     struct folio *folio;
> >>> +     unsigned long addr;
> >>> +     pte_t *pte;
> >>> +     gfp_t gfp;
> >>> +     int order;
> >>> +
> >>> +     /*
> >>> +      * If uffd is active for the vma we need per-page fault fidelity to
> >>> +      * maintain the uffd semantics.
> >>> +      */
> >>> +     if (unlikely(userfaultfd_armed(vma)))
> >>> +             goto fallback;
> >>> +
> >>> +     /*
> >>> +      * a large folio being swapped-in could be partially in
> >>> +      * zswap and partially in swap devices, zswap doesn't
> >>> +      * support large folios yet, we might get corrupted
> >>> +      * zero-filled data by reading all subpages from swap
> >>> +      * devices while some of them are actually in zswap
> >>> +      */
> >>> +     if (is_zswap_enabled())
> >>> +             goto fallback;
> >>> +
> >>> +     orders = get_alloc_folio_orders(vmf);
> >>> +     if (!orders)
> >>> +             goto fallback;
> >>> +
> >>> +     pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>
> >> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
> >> merged with an enhanced swap_pte_batch()?
> > Yes, it's easy to use a lock here, but I'm wondering if it's
> > necessary, because when we actually set pte in do_swap_page, we'll
> > hold PTL to check if the pte changes.
> >>
> >>> +     if (unlikely(!pte))
> >>> +             goto fallback;
> >>> +
> >>> +     /*
> >>> +      * For do_swap_page, find the highest order where the aligned range is
> >>> +      * completely swap entries with contiguous swap offsets.
> >>> +      */
> >>> +     order = highest_order(orders);
> >>> +     while (orders) {
> >>> +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> +             if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
> >>> +                     break;
> >>> +             order = next_order(&orders, order);
> >>> +     }
> >>
> >> So in the common case, swap-in will pull in the same size of folio as was
> >> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >> it makes sense for 2M THP; As the size increases the chances of actually needing
> >> all of the folio reduces so chances are we are wasting IO. There are similar
> >> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >> sense to copy the whole folio up to a certain size.
> > For 2M THP, IO overhead may not necessarily be large? :)
> > 1.If 2M THP are continuously stored in the swap device, the IO
> > overhead may not be very large (such as submitting bio with one
> > bio_vec at a time).
> > 2.If the process really needs this 2M data, one page-fault may perform
> > much better than multiple.
> > 3.For swap devices like zram,using 2M THP might also improve
> > decompression efficiency.
> >
> > On the other hand, if the process only needs a small part of the 2M
> > data (such as only frequent use of 4K page, the rest of the data is
> > never accessed), This is indeed give a lark to catch a kite!  :(
>
> Yes indeed. It's not always clear-cut what the best thing to do is. It would be
> good to hear from others on this.
>
> >>
> >> Thanks,
> >> Ryan
> >>
> >>> +
> >>> +     pte_unmap(pte);
> >>> +
> >>> +     /* Try allocating the highest of the remaining orders. */
> >>> +     gfp = vma_thp_gfp_mask(vma);
> >>> +     while (orders) {
> >>> +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> +             folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >>> +             if (folio)
> >>> +                     return folio;
> >>> +             order = next_order(&orders, order);
> >>> +     }
> >>> +
> >>> +fallback:
> >>> +#endif
> >>> +     return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> >>> +}
> >>> +
> >>> +
> >>>  /*
> >>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >>>   * but allow concurrent faults), and pte mapped but not yet locked.
> >>> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>       pte_t pte;
> >>>       vm_fault_t ret = 0;
> >>>       void *shadow = NULL;
> >>> +     int nr_pages = 1;
> >>> +     unsigned long start_address;
> >>> +     pte_t *start_pte;
> >>>
> >>>       if (!pte_unmap_same(vmf))
> >>>               goto out;
> >>> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>       if (!folio) {
> >>>               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >>>                   __swap_count(entry) == 1) {
> >>> -                     /*
> >>> -                      * Prevent parallel swapin from proceeding with
> >>> -                      * the cache flag. Otherwise, another thread may
> >>> -                      * finish swapin first, free the entry, and swapout
> >>> -                      * reusing the same entry. It's undetectable as
> >>> -                      * pte_same() returns true due to entry reuse.
> >>> -                      */
> >>> -                     if (swapcache_prepare(entry)) {
> >>> -                             /* Relax a bit to prevent rapid repeated page faults */
> >>> -                             schedule_timeout_uninterruptible(1);
> >>> -                             goto out;
> >>> -                     }
> >>> -                     need_clear_cache = true;
> >>> -
> >>>                       /* skip swapcache */
> >>> -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >>> -                                             vma, vmf->address, false);
> >>> +                     folio = alloc_swap_folio(vmf);
> >>>                       page = &folio->page;
> >>>                       if (folio) {
> >>>                               __folio_set_locked(folio);
> >>>                               __folio_set_swapbacked(folio);
> >>>
> >>> +                             if (folio_test_large(folio)) {
> >>> +                                     nr_pages = folio_nr_pages(folio);
> >>> +                                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >>> +                             }
> >>> +
> >>> +                             /*
> >>> +                              * Prevent parallel swapin from proceeding with
> >>> +                              * the cache flag. Otherwise, another thread may
> >>> +                              * finish swapin first, free the entry, and swapout
> >>> +                              * reusing the same entry. It's undetectable as
> >>> +                              * pte_same() returns true due to entry reuse.
> >>> +                              */
> >>> +                             if (swapcache_prepare_nr(entry, nr_pages)) {
> >>> +                                     /* Relax a bit to prevent rapid repeated page faults */
> >>> +                                     schedule_timeout_uninterruptible(1);
> >>> +                                     goto out;
> >>> +                             }
> >>> +                             need_clear_cache = true;
> >>> +
> >>>                               if (mem_cgroup_swapin_charge_folio(folio,
> >>>                                                       vma->vm_mm, GFP_KERNEL,
> >>>                                                       entry)) {
> >>>                                       ret = VM_FAULT_OOM;
> >>>                                       goto out_page;
> >>>                               }
> >>> -                             mem_cgroup_swapin_uncharge_swap(entry);
> >>> +
> >>> +                             for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
> >>> +                                     mem_cgroup_swapin_uncharge_swap(e);
> >>>
> >>>                               shadow = get_shadow_from_swap_cache(entry);
> >>>                               if (shadow)
> >>> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>        */
> >>>       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >>>                       &vmf->ptl);
> >>> +
> >>> +     start_address = vmf->address;
> >>> +     start_pte = vmf->pte;
> >>> +     if (start_pte && folio_test_large(folio)) {
> >>> +             unsigned long nr = folio_nr_pages(folio);
> >>> +             unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> >>> +             pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
> >>> +
> >>> +             /*
> >>> +              * case 1: we are allocating large_folio, try to map it as a whole
> >>> +              * iff the swap entries are still entirely mapped;
> >>> +              * case 2: we hit a large folio in swapcache, and all swap entries
> >>> +              * are still entirely mapped, try to map a large folio as a whole.
> >>> +              * otherwise, map only the faulting page within the large folio
> >>> +              * which is swapcache
> >>> +              */
> >>> +             if (!is_pte_range_contig_swap(aligned_pte, nr)) {
> >>> +                     if (nr_pages > 1) /* ptes have changed for case 1 */
> >>> +                             goto out_nomap;
> >>> +                     goto check_pte;
> >>> +             }
> >>> +
> >>> +             start_address = addr;
> >>> +             start_pte = aligned_pte;
> >>> +             /*
> >>> +              * the below has been done before swap_read_folio()
> >>> +              * for case 1
> >>> +              */
> >>> +             if (unlikely(folio == swapcache)) {
> >>> +                     nr_pages = nr;
> >>> +                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >>> +                     page = &folio->page;
> >>> +             }
> >>> +     }
> >>> +
> >>> +check_pte:
> >>>       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >>>               goto out_nomap;
> >>>
> >>> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>        * We're already holding a reference on the page but haven't mapped it
> >>>        * yet.
> >>>        */
> >>> -     swap_free(entry);
> >>> +     swap_nr_free(entry, nr_pages);
> >>>       if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>>               folio_free_swap(folio);
> >>>
> >>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>> +     folio_ref_add(folio, nr_pages - 1);
> >>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>> +
> >>>       pte = mk_pte(page, vma->vm_page_prot);
> >>>
> >>>       /*
> >>> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>        * exclusivity.
> >>>        */
> >>>       if (!folio_test_ksm(folio) &&
> >>> -         (exclusive || folio_ref_count(folio) == 1)) {
> >>> +         (exclusive || folio_ref_count(folio) == nr_pages)) {
> >>>               if (vmf->flags & FAULT_FLAG_WRITE) {
> >>>                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >>>                       vmf->flags &= ~FAULT_FLAG_WRITE;
> >>>               }
> >>>               rmap_flags |= RMAP_EXCLUSIVE;
> >>>       }
> >>> -     flush_icache_page(vma, page);
> >>> +     flush_icache_pages(vma, page, nr_pages);
> >>>       if (pte_swp_soft_dirty(vmf->orig_pte))
> >>>               pte = pte_mksoft_dirty(pte);
> >>>       if (pte_swp_uffd_wp(vmf->orig_pte))
> >>> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>
> >>>       /* ksm created a completely new copy */
> >>>       if (unlikely(folio != swapcache && swapcache)) {
> >>> -             folio_add_new_anon_rmap(folio, vma, vmf->address);
> >>> +             folio_add_new_anon_rmap(folio, vma, start_address);
> >>>               folio_add_lru_vma(folio, vma);
> >>> +     } else if (!folio_test_anon(folio)) {
> >>> +             folio_add_new_anon_rmap(folio, vma, start_address);
> >>>       } else {
> >>> -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> >>> +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> >>>                                       rmap_flags);
> >>>       }
> >>>
> >>>       VM_BUG_ON(!folio_test_anon(folio) ||
> >>>                       (pte_write(pte) && !PageAnonExclusive(page)));
> >>> -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >>> -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >>> +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> >>> +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
> >>>
> >>>       folio_unlock(folio);
> >>>       if (folio != swapcache && swapcache) {
> >>> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>       }
> >>>
> >>>       if (vmf->flags & FAULT_FLAG_WRITE) {
> >>> +             if (nr_pages > 1)
> >>> +                     vmf->orig_pte = ptep_get(vmf->pte);
> >>> +
> >>>               ret |= do_wp_page(vmf);
> >>>               if (ret & VM_FAULT_ERROR)
> >>>                       ret &= VM_FAULT_ERROR;
> >>> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>       }
> >>>
> >>>       /* No need to invalidate - it was non-present before */
> >>> -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> >>> +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >>>  unlock:
> >>>       if (vmf->pte)
> >>>               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>>  out:
> >>>       /* Clear the swap cache pin for direct swapin after PTL unlock */
> >>>       if (need_clear_cache)
> >>> -             swapcache_clear(si, entry);
> >>> +             swapcache_clear_nr(si, entry, nr_pages);
> >>>       if (si)
> >>>               put_swap_device(si);
> >>>       return ret;
> >>> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>               folio_put(swapcache);
> >>>       }
> >>>       if (need_clear_cache)
> >>> -             swapcache_clear(si, entry);
> >>> +             swapcache_clear_nr(si, entry, nr_pages);
> >>>       if (si)
> >>>               put_swap_device(si);
> >>>       return ret;
> >>> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >>>       if (unlikely(userfaultfd_armed(vma)))
> >>>               goto fallback;
> >>>
> >>> -     /*
> >>> -      * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>> -      * for this vma. Then filter out the orders that can't be allocated over
> >>> -      * the faulting address and still be fully contained in the vma.
> >>> -      */
> >>> -     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>> -                                       BIT(PMD_ORDER) - 1);
> >>> -     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>> -
> >>> +     orders = get_alloc_folio_orders(vmf);
> >>>       if (!orders)
> >>>               goto fallback;
> >>>

Thanks
Barry
Chuanhua Han March 15, 2024, 1:16 a.m. UTC | #5
Ryan Roberts <ryan.roberts@arm.com> 于2024年3月14日周四 21:57写道:
>
> On 14/03/2024 12:56, Chuanhua Han wrote:
> > Ryan Roberts <ryan.roberts@arm.com> 于2024年3月13日周三 00:33写道:
> >>
> >> On 04/03/2024 08:13, Barry Song wrote:
> >>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>
> >>> On an embedded system like Android, more than half of anon memory is
> >>> actually in swap devices such as zRAM. For example, while an app is
> >>> switched to background, its most memory might be swapped-out.
> >>>
> >>> Now we have mTHP features, unfortunately, if we don't support large folios
> >>> swap-in, once those large folios are swapped-out, we immediately lose the
> >>> performance gain we can get through large folios and hardware optimization
> >>> such as CONT-PTE.
> >>>
> >>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >>> to those contiguous swaps which were likely swapped out from mTHP as a
> >>> whole.
> >>>
> >>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >>> case. It doesn't support swapin_readahead as large folios yet since this
> >>> kind of shared memory is much less than memory mapped by single process.
> >>>
> >>> Right now, we are re-faulting large folios which are still in swapcache as a
> >>> whole, this can effectively decrease extra loops and early-exitings which we
> >>> have increased in arch_swap_restore() while supporting MTE restore for folios
> >>> rather than page. On the other hand, it can also decrease do_swap_page as
> >>> PTEs used to be set one by one even we hit a large folio in swapcache.
> >>>
> >>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> >>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> ---
> >>>  mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
> >>>  1 file changed, 212 insertions(+), 38 deletions(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index e0d34d705e07..501ede745ef3 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> >>>       return VM_FAULT_SIGBUS;
> >>>  }
> >>>
> >>> +/*
> >>> + * check a range of PTEs are completely swap entries with
> >>> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
> >>> + * pte must be first one in the range
> >>> + */
> >>> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
> >>> +{
> >>> +     int i;
> >>> +     struct swap_info_struct *si;
> >>> +     swp_entry_t entry;
> >>> +     unsigned type;
> >>> +     pgoff_t start_offset;
> >>> +     char has_cache;
> >>> +
> >>> +     entry = pte_to_swp_entry(ptep_get_lockless(pte));
> >>
> >> Given you are getting entry locklessly, I expect it could change under you? So
> >> probably need to check that its a swap entry, etc. first?
> > The following non_swap_entry checks to see if it is a swap entry.
>
> No, it checks if something already known to be a "swap entry" type is actually
> describing a swap entry, or a non-swap entry (e.g. migration entry, hwpoison
> entry, etc.) Swap entries with type >= MAX_SWAPFILES don't actually describe swap:
>
> static inline int non_swap_entry(swp_entry_t entry)
> {
>         return swp_type(entry) >= MAX_SWAPFILES;
> }
>
>
> So you need to do something like:
>
> pte = ptep_get_lockless(pte);
> if (pte_none(pte) || !pte_present(pte))
>         return false;
> entry = pte_to_swp_entry(pte);
> if (non_swap_entry(entry))
>         return false;
> ...
>
Indeed, this will more accurate, thank you very much for your advise!
> >>
> >>> +     if (non_swap_entry(entry))
> >>> +             return false;
> >>> +     start_offset = swp_offset(entry);
> >>> +     if (start_offset % nr_pages)
> >>> +             return false;
> >>> +
> >>> +     si = swp_swap_info(entry);
> >>
> >> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
> >> then swap_map may have been freed when you read it below. Holding the PTL can
> >> sometimes prevent it, but I don't think you're holding that here (you're using
> >> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
> > Thank you for your review,you are righit! this place reaally needs
> > get_swap_device()/put_swap_device().
> >>
> >>> +     type = swp_type(entry);
> >>> +     has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
> >>> +     for (i = 1; i < nr_pages; i++) {
> >>> +             entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> >>> +             if (non_swap_entry(entry))
> >>> +                     return false;
> >>> +             if (swp_offset(entry) != start_offset + i)
> >>> +                     return false;
> >>> +             if (swp_type(entry) != type)
> >>> +                     return false;
> >>> +             /*
> >>> +              * while allocating a large folio and doing swap_read_folio for the
> >>> +              * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
> >>> +              * doesn't have swapcache. We need to ensure all PTEs have no cache
> >>> +              * as well, otherwise, we might go to swap devices while the content
> >>> +              * is in swapcache
> >>> +              */
> >>> +             if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
> >>> +                     return false;
> >>> +     }
> >>> +
> >>> +     return true;
> >>> +}
> >>
> >> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
> >> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
> >> PTL is held, and you are lockless here. Thought it might be of interest though.
> >>
> >> [1] https://lore.kernel.org/linux-mm/20240311150058.1122862-3-ryan.roberts@arm.com/
> >>
> > Thanks. It's probably simily to ours, but as you said we are lockless
> > here, and we need to check has_cache.
> >>> +
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +/*
> >>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>> + * for this vma. Then filter out the orders that can't be allocated over
> >>> + * the faulting address and still be fully contained in the vma.
> >>> + */
> >>> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
> >>> +{
> >>> +     struct vm_area_struct *vma = vmf->vma;
> >>> +     unsigned long orders;
> >>> +
> >>> +     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>> +                                       BIT(PMD_ORDER) - 1);
> >>> +     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>> +     return orders;
> >>> +}
> >>> +#endif
> >>> +
> >>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >>> +{
> >>> +     struct vm_area_struct *vma = vmf->vma;
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +     unsigned long orders;
> >>> +     struct folio *folio;
> >>> +     unsigned long addr;
> >>> +     pte_t *pte;
> >>> +     gfp_t gfp;
> >>> +     int order;
> >>> +
> >>> +     /*
> >>> +      * If uffd is active for the vma we need per-page fault fidelity to
> >>> +      * maintain the uffd semantics.
> >>> +      */
> >>> +     if (unlikely(userfaultfd_armed(vma)))
> >>> +             goto fallback;
> >>> +
> >>> +     /*
> >>> +      * a large folio being swapped-in could be partially in
> >>> +      * zswap and partially in swap devices, zswap doesn't
> >>> +      * support large folios yet, we might get corrupted
> >>> +      * zero-filled data by reading all subpages from swap
> >>> +      * devices while some of them are actually in zswap
> >>> +      */
> >>> +     if (is_zswap_enabled())
> >>> +             goto fallback;
> >>> +
> >>> +     orders = get_alloc_folio_orders(vmf);
> >>> +     if (!orders)
> >>> +             goto fallback;
> >>> +
> >>> +     pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> >>
> >> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
> >> merged with an enhanced swap_pte_batch()?
> > Yes, it's easy to use a lock here, but I'm wondering if it's
> > necessary, because when we actually set pte in do_swap_page, we'll
> > hold PTL to check if the pte changes.
> >>
> >>> +     if (unlikely(!pte))
> >>> +             goto fallback;
> >>> +
> >>> +     /*
> >>> +      * For do_swap_page, find the highest order where the aligned range is
> >>> +      * completely swap entries with contiguous swap offsets.
> >>> +      */
> >>> +     order = highest_order(orders);
> >>> +     while (orders) {
> >>> +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> +             if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
> >>> +                     break;
> >>> +             order = next_order(&orders, order);
> >>> +     }
> >>
> >> So in the common case, swap-in will pull in the same size of folio as was
> >> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >> it makes sense for 2M THP; As the size increases the chances of actually needing
> >> all of the folio reduces so chances are we are wasting IO. There are similar
> >> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >> sense to copy the whole folio up to a certain size.
> > For 2M THP, IO overhead may not necessarily be large? :)
> > 1.If 2M THP are continuously stored in the swap device, the IO
> > overhead may not be very large (such as submitting bio with one
> > bio_vec at a time).
> > 2.If the process really needs this 2M data, one page-fault may perform
> > much better than multiple.
> > 3.For swap devices like zram,using 2M THP might also improve
> > decompression efficiency.
> >
> > On the other hand, if the process only needs a small part of the 2M
> > data (such as only frequent use of 4K page, the rest of the data is
> > never accessed), This is indeed give a lark to catch a kite!  :(
>
> Yes indeed. It's not always clear-cut what the best thing to do is. It would be
> good to hear from others on this.
>
> >>
> >> Thanks,
> >> Ryan
> >>
> >>> +
> >>> +     pte_unmap(pte);
> >>> +
> >>> +     /* Try allocating the highest of the remaining orders. */
> >>> +     gfp = vma_thp_gfp_mask(vma);
> >>> +     while (orders) {
> >>> +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> >>> +             folio = vma_alloc_folio(gfp, order, vma, addr, true);
> >>> +             if (folio)
> >>> +                     return folio;
> >>> +             order = next_order(&orders, order);
> >>> +     }
> >>> +
> >>> +fallback:
> >>> +#endif
> >>> +     return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
> >>> +}
> >>> +
> >>> +
> >>>  /*
> >>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >>>   * but allow concurrent faults), and pte mapped but not yet locked.
> >>> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>       pte_t pte;
> >>>       vm_fault_t ret = 0;
> >>>       void *shadow = NULL;
> >>> +     int nr_pages = 1;
> >>> +     unsigned long start_address;
> >>> +     pte_t *start_pte;
> >>>
> >>>       if (!pte_unmap_same(vmf))
> >>>               goto out;
> >>> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>       if (!folio) {
> >>>               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >>>                   __swap_count(entry) == 1) {
> >>> -                     /*
> >>> -                      * Prevent parallel swapin from proceeding with
> >>> -                      * the cache flag. Otherwise, another thread may
> >>> -                      * finish swapin first, free the entry, and swapout
> >>> -                      * reusing the same entry. It's undetectable as
> >>> -                      * pte_same() returns true due to entry reuse.
> >>> -                      */
> >>> -                     if (swapcache_prepare(entry)) {
> >>> -                             /* Relax a bit to prevent rapid repeated page faults */
> >>> -                             schedule_timeout_uninterruptible(1);
> >>> -                             goto out;
> >>> -                     }
> >>> -                     need_clear_cache = true;
> >>> -
> >>>                       /* skip swapcache */
> >>> -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> >>> -                                             vma, vmf->address, false);
> >>> +                     folio = alloc_swap_folio(vmf);
> >>>                       page = &folio->page;
> >>>                       if (folio) {
> >>>                               __folio_set_locked(folio);
> >>>                               __folio_set_swapbacked(folio);
> >>>
> >>> +                             if (folio_test_large(folio)) {
> >>> +                                     nr_pages = folio_nr_pages(folio);
> >>> +                                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >>> +                             }
> >>> +
> >>> +                             /*
> >>> +                              * Prevent parallel swapin from proceeding with
> >>> +                              * the cache flag. Otherwise, another thread may
> >>> +                              * finish swapin first, free the entry, and swapout
> >>> +                              * reusing the same entry. It's undetectable as
> >>> +                              * pte_same() returns true due to entry reuse.
> >>> +                              */
> >>> +                             if (swapcache_prepare_nr(entry, nr_pages)) {
> >>> +                                     /* Relax a bit to prevent rapid repeated page faults */
> >>> +                                     schedule_timeout_uninterruptible(1);
> >>> +                                     goto out;
> >>> +                             }
> >>> +                             need_clear_cache = true;
> >>> +
> >>>                               if (mem_cgroup_swapin_charge_folio(folio,
> >>>                                                       vma->vm_mm, GFP_KERNEL,
> >>>                                                       entry)) {
> >>>                                       ret = VM_FAULT_OOM;
> >>>                                       goto out_page;
> >>>                               }
> >>> -                             mem_cgroup_swapin_uncharge_swap(entry);
> >>> +
> >>> +                             for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
> >>> +                                     mem_cgroup_swapin_uncharge_swap(e);
> >>>
> >>>                               shadow = get_shadow_from_swap_cache(entry);
> >>>                               if (shadow)
> >>> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>        */
> >>>       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >>>                       &vmf->ptl);
> >>> +
> >>> +     start_address = vmf->address;
> >>> +     start_pte = vmf->pte;
> >>> +     if (start_pte && folio_test_large(folio)) {
> >>> +             unsigned long nr = folio_nr_pages(folio);
> >>> +             unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> >>> +             pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
> >>> +
> >>> +             /*
> >>> +              * case 1: we are allocating large_folio, try to map it as a whole
> >>> +              * iff the swap entries are still entirely mapped;
> >>> +              * case 2: we hit a large folio in swapcache, and all swap entries
> >>> +              * are still entirely mapped, try to map a large folio as a whole.
> >>> +              * otherwise, map only the faulting page within the large folio
> >>> +              * which is swapcache
> >>> +              */
> >>> +             if (!is_pte_range_contig_swap(aligned_pte, nr)) {
> >>> +                     if (nr_pages > 1) /* ptes have changed for case 1 */
> >>> +                             goto out_nomap;
> >>> +                     goto check_pte;
> >>> +             }
> >>> +
> >>> +             start_address = addr;
> >>> +             start_pte = aligned_pte;
> >>> +             /*
> >>> +              * the below has been done before swap_read_folio()
> >>> +              * for case 1
> >>> +              */
> >>> +             if (unlikely(folio == swapcache)) {
> >>> +                     nr_pages = nr;
> >>> +                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
> >>> +                     page = &folio->page;
> >>> +             }
> >>> +     }
> >>> +
> >>> +check_pte:
> >>>       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >>>               goto out_nomap;
> >>>
> >>> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>        * We're already holding a reference on the page but haven't mapped it
> >>>        * yet.
> >>>        */
> >>> -     swap_free(entry);
> >>> +     swap_nr_free(entry, nr_pages);
> >>>       if (should_try_to_free_swap(folio, vma, vmf->flags))
> >>>               folio_free_swap(folio);
> >>>
> >>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >>> +     folio_ref_add(folio, nr_pages - 1);
> >>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >>> +
> >>>       pte = mk_pte(page, vma->vm_page_prot);
> >>>
> >>>       /*
> >>> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>        * exclusivity.
> >>>        */
> >>>       if (!folio_test_ksm(folio) &&
> >>> -         (exclusive || folio_ref_count(folio) == 1)) {
> >>> +         (exclusive || folio_ref_count(folio) == nr_pages)) {
> >>>               if (vmf->flags & FAULT_FLAG_WRITE) {
> >>>                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >>>                       vmf->flags &= ~FAULT_FLAG_WRITE;
> >>>               }
> >>>               rmap_flags |= RMAP_EXCLUSIVE;
> >>>       }
> >>> -     flush_icache_page(vma, page);
> >>> +     flush_icache_pages(vma, page, nr_pages);
> >>>       if (pte_swp_soft_dirty(vmf->orig_pte))
> >>>               pte = pte_mksoft_dirty(pte);
> >>>       if (pte_swp_uffd_wp(vmf->orig_pte))
> >>> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>
> >>>       /* ksm created a completely new copy */
> >>>       if (unlikely(folio != swapcache && swapcache)) {
> >>> -             folio_add_new_anon_rmap(folio, vma, vmf->address);
> >>> +             folio_add_new_anon_rmap(folio, vma, start_address);
> >>>               folio_add_lru_vma(folio, vma);
> >>> +     } else if (!folio_test_anon(folio)) {
> >>> +             folio_add_new_anon_rmap(folio, vma, start_address);
> >>>       } else {
> >>> -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> >>> +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> >>>                                       rmap_flags);
> >>>       }
> >>>
> >>>       VM_BUG_ON(!folio_test_anon(folio) ||
> >>>                       (pte_write(pte) && !PageAnonExclusive(page)));
> >>> -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >>> -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >>> +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> >>> +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
> >>>
> >>>       folio_unlock(folio);
> >>>       if (folio != swapcache && swapcache) {
> >>> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>       }
> >>>
> >>>       if (vmf->flags & FAULT_FLAG_WRITE) {
> >>> +             if (nr_pages > 1)
> >>> +                     vmf->orig_pte = ptep_get(vmf->pte);
> >>> +
> >>>               ret |= do_wp_page(vmf);
> >>>               if (ret & VM_FAULT_ERROR)
> >>>                       ret &= VM_FAULT_ERROR;
> >>> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>       }
> >>>
> >>>       /* No need to invalidate - it was non-present before */
> >>> -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> >>> +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >>>  unlock:
> >>>       if (vmf->pte)
> >>>               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>>  out:
> >>>       /* Clear the swap cache pin for direct swapin after PTL unlock */
> >>>       if (need_clear_cache)
> >>> -             swapcache_clear(si, entry);
> >>> +             swapcache_clear_nr(si, entry, nr_pages);
> >>>       if (si)
> >>>               put_swap_device(si);
> >>>       return ret;
> >>> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>>               folio_put(swapcache);
> >>>       }
> >>>       if (need_clear_cache)
> >>> -             swapcache_clear(si, entry);
> >>> +             swapcache_clear_nr(si, entry, nr_pages);
> >>>       if (si)
> >>>               put_swap_device(si);
> >>>       return ret;
> >>> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >>>       if (unlikely(userfaultfd_armed(vma)))
> >>>               goto fallback;
> >>>
> >>> -     /*
> >>> -      * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >>> -      * for this vma. Then filter out the orders that can't be allocated over
> >>> -      * the faulting address and still be fully contained in the vma.
> >>> -      */
> >>> -     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >>> -                                       BIT(PMD_ORDER) - 1);
> >>> -     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >>> -
> >>> +     orders = get_alloc_folio_orders(vmf);
> >>>       if (!orders)
> >>>               goto fallback;
> >>>
> >>
> >>
> >
> >
>
Huang, Ying March 15, 2024, 8:41 a.m. UTC | #6
Barry Song <21cnbao@gmail.com> writes:

> From: Chuanhua Han <hanchuanhua@oppo.com>
>
> On an embedded system like Android, more than half of anon memory is
> actually in swap devices such as zRAM. For example, while an app is
> switched to background, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> to those contiguous swaps which were likely swapped out from mTHP as a
> whole.
>
> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> case. It doesn't support swapin_readahead as large folios yet since this
> kind of shared memory is much less than memory mapped by single process.

In contrast, I still think that it's better to start with normal swap-in
path, then expand to SWAP_SYCHRONOUS case.

In normal swap-in path, we can take advantage of swap readahead
information to determine the swapped-in large folio order.  That is, if
the return value of swapin_nr_pages() > 1, then we can try to allocate
and swapin a large folio.

To do that, we need to track whether the sub-pages are accessed.  I
guess we need that information for large file folio readahead too.

Hi, Matthew,

Can you help us on tracking whether the sub-pages of a readahead large
folio has been accessed?

> Right now, we are re-faulting large folios which are still in swapcache as a
> whole, this can effectively decrease extra loops and early-exitings which we
> have increased in arch_swap_restore() while supporting MTE restore for folios
> rather than page. On the other hand, it can also decrease do_swap_page as
> PTEs used to be set one by one even we hit a large folio in swapcache.
>

--
Best Regards,
Huang, Ying
Barry Song March 15, 2024, 8:54 a.m. UTC | #7
On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > On an embedded system like Android, more than half of anon memory is
> > actually in swap devices such as zRAM. For example, while an app is
> > switched to background, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> > to those contiguous swaps which were likely swapped out from mTHP as a
> > whole.
> >
> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> > case. It doesn't support swapin_readahead as large folios yet since this
> > kind of shared memory is much less than memory mapped by single process.
>
> In contrast, I still think that it's better to start with normal swap-in
> path, then expand to SWAP_SYCHRONOUS case.

I'd rather try the reverse direction as non-sync anon memory is only around
3% in a phone as my observation.

>
> In normal swap-in path, we can take advantage of swap readahead
> information to determine the swapped-in large folio order.  That is, if
> the return value of swapin_nr_pages() > 1, then we can try to allocate
> and swapin a large folio.

I am not quite sure we still need to depend on this. in do_anon_page,
we have broken the assumption and allocated a large folio directly.

On the other hand, compressing/decompressing large folios as a
whole rather than doing it one by one can save a large percent of
CPUs and provide a much lower compression ratio.  With a hardware
accelerator, this is even faster.

So I'd rather more aggressively get large folios swap-in involved
than depending on readahead.

>
> To do that, we need to track whether the sub-pages are accessed.  I
> guess we need that information for large file folio readahead too.
>
> Hi, Matthew,
>
> Can you help us on tracking whether the sub-pages of a readahead large
> folio has been accessed?
>
> > Right now, we are re-faulting large folios which are still in swapcache as a
> > whole, this can effectively decrease extra loops and early-exitings which we
> > have increased in arch_swap_restore() while supporting MTE restore for folios
> > rather than page. On the other hand, it can also decrease do_swap_page as
> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Huang, Ying March 15, 2024, 9:15 a.m. UTC | #8
Barry Song <21cnbao@gmail.com> writes:

> On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >
>> > On an embedded system like Android, more than half of anon memory is
>> > actually in swap devices such as zRAM. For example, while an app is
>> > switched to background, its most memory might be swapped-out.
>> >
>> > Now we have mTHP features, unfortunately, if we don't support large folios
>> > swap-in, once those large folios are swapped-out, we immediately lose the
>> > performance gain we can get through large folios and hardware optimization
>> > such as CONT-PTE.
>> >
>> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>> > to those contiguous swaps which were likely swapped out from mTHP as a
>> > whole.
>> >
>> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>> > case. It doesn't support swapin_readahead as large folios yet since this
>> > kind of shared memory is much less than memory mapped by single process.
>>
>> In contrast, I still think that it's better to start with normal swap-in
>> path, then expand to SWAP_SYCHRONOUS case.
>
> I'd rather try the reverse direction as non-sync anon memory is only around
> 3% in a phone as my observation.

Phone is not the only platform that Linux is running on.

>>
>> In normal swap-in path, we can take advantage of swap readahead
>> information to determine the swapped-in large folio order.  That is, if
>> the return value of swapin_nr_pages() > 1, then we can try to allocate
>> and swapin a large folio.
>
> I am not quite sure we still need to depend on this. in do_anon_page,
> we have broken the assumption and allocated a large folio directly.

I don't think that we have a sophisticated policy to allocate large
folio.  Large folio could waste memory for some workloads, so I think
that it's a good idea to allocate large folio always.

Readahead gives us an opportunity to play with the policy.

> On the other hand, compressing/decompressing large folios as a
> whole rather than doing it one by one can save a large percent of
> CPUs and provide a much lower compression ratio.  With a hardware
> accelerator, this is even faster.

I am not against to support large folio for compressing/decompressing.

I just suggest to do that later, after we play with normal swap-in.
SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
swap.  So, it seems natural to support large folio swap-in for normal
swap-in firstly.

> So I'd rather more aggressively get large folios swap-in involved
> than depending on readahead.

We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
optimization too.  The sub-pages that is not accessed by page fault can
be treated as readahead.  I think that is a better policy than
allocating large folio always.

>>
>> To do that, we need to track whether the sub-pages are accessed.  I
>> guess we need that information for large file folio readahead too.
>>
>> Hi, Matthew,
>>
>> Can you help us on tracking whether the sub-pages of a readahead large
>> folio has been accessed?
>>
>> > Right now, we are re-faulting large folios which are still in swapcache as a
>> > whole, this can effectively decrease extra loops and early-exitings which we
>> > have increased in arch_swap_restore() while supporting MTE restore for folios
>> > rather than page. On the other hand, it can also decrease do_swap_page as
>> > PTEs used to be set one by one even we hit a large folio in swapcache.
>> >
>>
--
Best Regards,
Huang, Ying
Barry Song March 15, 2024, 10:01 a.m. UTC | #9
On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >
> >> > On an embedded system like Android, more than half of anon memory is
> >> > actually in swap devices such as zRAM. For example, while an app is
> >> > switched to background, its most memory might be swapped-out.
> >> >
> >> > Now we have mTHP features, unfortunately, if we don't support large folios
> >> > swap-in, once those large folios are swapped-out, we immediately lose the
> >> > performance gain we can get through large folios and hardware optimization
> >> > such as CONT-PTE.
> >> >
> >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >> > to those contiguous swaps which were likely swapped out from mTHP as a
> >> > whole.
> >> >
> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >> > case. It doesn't support swapin_readahead as large folios yet since this
> >> > kind of shared memory is much less than memory mapped by single process.
> >>
> >> In contrast, I still think that it's better to start with normal swap-in
> >> path, then expand to SWAP_SYCHRONOUS case.
> >
> > I'd rather try the reverse direction as non-sync anon memory is only around
> > 3% in a phone as my observation.
>
> Phone is not the only platform that Linux is running on.

I suppose it's generally true that forked shared anonymous pages only
constitute a
small portion of all anonymous pages. The majority of anonymous pages are within
a single process.

I agree phones are not the only platform. But Rome wasn't built in a
day. I can only get
started on a hardware which I can easily reach and have enough hardware/test
resources on it. So we may take the first step which can be applied on
a real product
and improve its performance, and step by step, we broaden it and make it
widely useful to various areas  in which I can't reach :-)

so probably we can have a sysfs "enable" entry with default "n" or
have a maximum
swap-in order as Ryan's suggestion [1] at the beginning,

"
So in the common case, swap-in will pull in the same size of folio as was
swapped-out. Is that definitely the right policy for all folio sizes? Certainly
it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
it makes sense for 2M THP; As the size increases the chances of actually needing
all of the folio reduces so chances are we are wasting IO. There are similar
arguments for CoW, where we currently copy 1 page per fault - it probably makes
sense to copy the whole folio up to a certain size.
"

>
> >>
> >> In normal swap-in path, we can take advantage of swap readahead
> >> information to determine the swapped-in large folio order.  That is, if
> >> the return value of swapin_nr_pages() > 1, then we can try to allocate
> >> and swapin a large folio.
> >
> > I am not quite sure we still need to depend on this. in do_anon_page,
> > we have broken the assumption and allocated a large folio directly.
>
> I don't think that we have a sophisticated policy to allocate large
> folio.  Large folio could waste memory for some workloads, so I think
> that it's a good idea to allocate large folio always.

i agree, but we still have the below check just like do_anon_page() has it,

        orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
                                          BIT(PMD_ORDER) - 1);
        orders = thp_vma_suitable_orders(vma, vmf->address, orders);

in do_anon_page, we don't worry about the waste so much, the same
logic also applies to do_swap_page().

>
> Readahead gives us an opportunity to play with the policy.

I feel somehow the rules of the game have changed with an upper
limit for swap-in size. for example, if the upper limit is 4 order,
it limits folio size to 64KiB which is still a proper size for ARM64
whose base page can be 64KiB.

on the other hand, while swapping out large folios, we will always
compress them as a whole(zsmalloc/zram patch will come in a
couple of days), if we choose to decompress a subpage instead of
a large folio in do_swap_page(), we might need to decompress
nr_pages times. for example,

For large folios 16*4KiB, they are saved as a large object in zsmalloc(with
the coming patch), if we swap in a small folio, we decompress the large
object; next time, we will still need to decompress a large object. so
it is more sensible to swap in a large folio if we find those
swap entries are contiguous and were allocated by a large folio swap-out.

>
> > On the other hand, compressing/decompressing large folios as a
> > whole rather than doing it one by one can save a large percent of
> > CPUs and provide a much lower compression ratio.  With a hardware
> > accelerator, this is even faster.
>
> I am not against to support large folio for compressing/decompressing.
>
> I just suggest to do that later, after we play with normal swap-in.
> SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
> swap.  So, it seems natural to support large folio swap-in for normal
> swap-in firstly.

I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
than the swapcache path since it is the majority. and on the other hand, a lot
of modification is required for the swapcache path. in OPPO's code[1], we did
bring-up both paths, but the swapcache path is much much more complicated
than the SYNC path and hasn't really noticeable improvement.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplus/sm8650_u_14.0.0_oneplus12

>
> > So I'd rather more aggressively get large folios swap-in involved
> > than depending on readahead.
>
> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
> optimization too.  The sub-pages that is not accessed by page fault can
> be treated as readahead.  I think that is a better policy than
> allocating large folio always.

Considering the zsmalloc optimization, it would be a better choice to
always allocate
large folios if we find those swap entries are for a swapped-out large folio. as
decompressing just once, we get all subpages.
Some hardware accelerators are even able to decompress a large folio with
multi-hardware threads, for example, 16 hardware threads can compress
each subpage of a large folio at the same time, it is just as fast as
decompressing
one subpage.

for platforms without the above optimizations, a proper upper limit
will help them
disable the large folios swap-in or decrease the impact. For example,
if the upper
limit is 0-order, we are just removing this patchset. if the upper
limit is 2 orders, we
are just like BASE_PAGE size is 16KiB.

>
> >>
> >> To do that, we need to track whether the sub-pages are accessed.  I
> >> guess we need that information for large file folio readahead too.
> >>
> >> Hi, Matthew,
> >>
> >> Can you help us on tracking whether the sub-pages of a readahead large
> >> folio has been accessed?
> >>
> >> > Right now, we are re-faulting large folios which are still in swapcache as a
> >> > whole, this can effectively decrease extra loops and early-exitings which we
> >> > have increased in arch_swap_restore() while supporting MTE restore for folios
> >> > rather than page. On the other hand, it can also decrease do_swap_page as
> >> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >> >
> >>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Ryan Roberts March 15, 2024, 10:59 a.m. UTC | #10
On 14/03/2024 20:43, Barry Song wrote:
> On Fri, Mar 15, 2024 at 2:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 14/03/2024 12:56, Chuanhua Han wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> 于2024年3月13日周三 00:33写道:
>>>>
>>>> On 04/03/2024 08:13, Barry Song wrote:
>>>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>>>
>>>>> On an embedded system like Android, more than half of anon memory is
>>>>> actually in swap devices such as zRAM. For example, while an app is
>>>>> switched to background, its most memory might be swapped-out.
>>>>>
>>>>> Now we have mTHP features, unfortunately, if we don't support large folios
>>>>> swap-in, once those large folios are swapped-out, we immediately lose the
>>>>> performance gain we can get through large folios and hardware optimization
>>>>> such as CONT-PTE.
>>>>>
>>>>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>>>>> to those contiguous swaps which were likely swapped out from mTHP as a
>>>>> whole.
>>>>>
>>>>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>>>>> case. It doesn't support swapin_readahead as large folios yet since this
>>>>> kind of shared memory is much less than memory mapped by single process.
>>>>>
>>>>> Right now, we are re-faulting large folios which are still in swapcache as a
>>>>> whole, this can effectively decrease extra loops and early-exitings which we
>>>>> have increased in arch_swap_restore() while supporting MTE restore for folios
>>>>> rather than page. On the other hand, it can also decrease do_swap_page as
>>>>> PTEs used to be set one by one even we hit a large folio in swapcache.
>>>>>
>>>>> Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
>>>>> Co-developed-by: Barry Song <v-songbaohua@oppo.com>
>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>> ---
>>>>>  mm/memory.c | 250 ++++++++++++++++++++++++++++++++++++++++++++--------
>>>>>  1 file changed, 212 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index e0d34d705e07..501ede745ef3 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3907,6 +3907,136 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
>>>>>       return VM_FAULT_SIGBUS;
>>>>>  }
>>>>>
>>>>> +/*
>>>>> + * check a range of PTEs are completely swap entries with
>>>>> + * contiguous swap offsets and the same SWAP_HAS_CACHE.
>>>>> + * pte must be first one in the range
>>>>> + */
>>>>> +static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
>>>>> +{
>>>>> +     int i;
>>>>> +     struct swap_info_struct *si;
>>>>> +     swp_entry_t entry;
>>>>> +     unsigned type;
>>>>> +     pgoff_t start_offset;
>>>>> +     char has_cache;
>>>>> +
>>>>> +     entry = pte_to_swp_entry(ptep_get_lockless(pte));
>>>>
>>>> Given you are getting entry locklessly, I expect it could change under you? So
>>>> probably need to check that its a swap entry, etc. first?
>>> The following non_swap_entry checks to see if it is a swap entry.
>>
>> No, it checks if something already known to be a "swap entry" type is actually
>> describing a swap entry, or a non-swap entry (e.g. migration entry, hwpoison
>> entry, etc.) Swap entries with type >= MAX_SWAPFILES don't actually describe swap:
>>
>> static inline int non_swap_entry(swp_entry_t entry)
>> {
>>         return swp_type(entry) >= MAX_SWAPFILES;
>> }
>>
>>
>> So you need to do something like:
>>
>> pte = ptep_get_lockless(pte);
>> if (pte_none(pte) || !pte_present(pte))
>>         return false;
> 
> 
> Indeed, I noticed that a couple of days ago, but it turned out that it
> didn't cause any issues
> because the condition following 'if (swp_offset(entry) != start_offset
> + i)'  cannot be true :-)
> 
> I do agree it needs a fix here. maybe by
> 
> if (!is_swap_pte(pte))
>            return false?

Nice! I hadn't noticed is_swap_pte().

> 
>> entry = pte_to_swp_entry(pte);
>> if (non_swap_entry(entry))
>>         return false;
>> ...
>>
>>>>
>>>>> +     if (non_swap_entry(entry))
>>>>> +             return false;
>>>>> +     start_offset = swp_offset(entry);
>>>>> +     if (start_offset % nr_pages)
>>>>> +             return false;
>>>>> +
>>>>> +     si = swp_swap_info(entry);
>>>>
>>>> What ensures si remains valid (i.e. swapoff can't happen)? If swapoff can race,
>>>> then swap_map may have been freed when you read it below. Holding the PTL can
>>>> sometimes prevent it, but I don't think you're holding that here (you're using
>>>> ptep_get_lockless(). Perhaps get_swap_device()/put_swap_device() can help?
>>> Thank you for your review,you are righit! this place reaally needs
>>> get_swap_device()/put_swap_device().
>>>>
>>>>> +     type = swp_type(entry);
>>>>> +     has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
>>>>> +     for (i = 1; i < nr_pages; i++) {
>>>>> +             entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
>>>>> +             if (non_swap_entry(entry))
>>>>> +                     return false;
>>>>> +             if (swp_offset(entry) != start_offset + i)
>>>>> +                     return false;
>>>>> +             if (swp_type(entry) != type)
>>>>> +                     return false;
>>>>> +             /*
>>>>> +              * while allocating a large folio and doing swap_read_folio for the
>>>>> +              * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
>>>>> +              * doesn't have swapcache. We need to ensure all PTEs have no cache
>>>>> +              * as well, otherwise, we might go to swap devices while the content
>>>>> +              * is in swapcache
>>>>> +              */
>>>>> +             if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
>>>>> +                     return false;
>>>>> +     }
>>>>> +
>>>>> +     return true;
>>>>> +}
>>>>
>>>> I created swap_pte_batch() for the swap-out series [1]. I wonder if that could
>>>> be extended for the SWAP_HAS_CACHE checks? Possibly not because it assumes the
>>>> PTL is held, and you are lockless here. Thought it might be of interest though.
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20240311150058.1122862-3-ryan.roberts@arm.com/
>>>>
>>> Thanks. It's probably simily to ours, but as you said we are lockless
>>> here, and we need to check has_cache.
>>>>> +
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +/*
>>>>> + * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>>> + * for this vma. Then filter out the orders that can't be allocated over
>>>>> + * the faulting address and still be fully contained in the vma.
>>>>> + */
>>>>> +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
>>>>> +{
>>>>> +     struct vm_area_struct *vma = vmf->vma;
>>>>> +     unsigned long orders;
>>>>> +
>>>>> +     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>>>> +                                       BIT(PMD_ORDER) - 1);
>>>>> +     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>>>> +     return orders;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>> +static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>>>> +{
>>>>> +     struct vm_area_struct *vma = vmf->vma;
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +     unsigned long orders;
>>>>> +     struct folio *folio;
>>>>> +     unsigned long addr;
>>>>> +     pte_t *pte;
>>>>> +     gfp_t gfp;
>>>>> +     int order;
>>>>> +
>>>>> +     /*
>>>>> +      * If uffd is active for the vma we need per-page fault fidelity to
>>>>> +      * maintain the uffd semantics.
>>>>> +      */
>>>>> +     if (unlikely(userfaultfd_armed(vma)))
>>>>> +             goto fallback;
>>>>> +
>>>>> +     /*
>>>>> +      * a large folio being swapped-in could be partially in
>>>>> +      * zswap and partially in swap devices, zswap doesn't
>>>>> +      * support large folios yet, we might get corrupted
>>>>> +      * zero-filled data by reading all subpages from swap
>>>>> +      * devices while some of them are actually in zswap
>>>>> +      */
>>>>> +     if (is_zswap_enabled())
>>>>> +             goto fallback;
>>>>> +
>>>>> +     orders = get_alloc_folio_orders(vmf);
>>>>> +     if (!orders)
>>>>> +             goto fallback;
>>>>> +
>>>>> +     pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
>>>>
>>>> Could also briefly take PTL here, then is_pte_range_contig_swap() could be
>>>> merged with an enhanced swap_pte_batch()?
>>> Yes, it's easy to use a lock here, but I'm wondering if it's
>>> necessary, because when we actually set pte in do_swap_page, we'll
>>> hold PTL to check if the pte changes.
>>>>
>>>>> +     if (unlikely(!pte))
>>>>> +             goto fallback;
>>>>> +
>>>>> +     /*
>>>>> +      * For do_swap_page, find the highest order where the aligned range is
>>>>> +      * completely swap entries with contiguous swap offsets.
>>>>> +      */
>>>>> +     order = highest_order(orders);
>>>>> +     while (orders) {
>>>>> +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>>> +             if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
>>>>> +                     break;
>>>>> +             order = next_order(&orders, order);
>>>>> +     }
>>>>
>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>> sense to copy the whole folio up to a certain size.
>>> For 2M THP, IO overhead may not necessarily be large? :)
>>> 1.If 2M THP are continuously stored in the swap device, the IO
>>> overhead may not be very large (such as submitting bio with one
>>> bio_vec at a time).
>>> 2.If the process really needs this 2M data, one page-fault may perform
>>> much better than multiple.
>>> 3.For swap devices like zram,using 2M THP might also improve
>>> decompression efficiency.
>>>
>>> On the other hand, if the process only needs a small part of the 2M
>>> data (such as only frequent use of 4K page, the rest of the data is
>>> never accessed), This is indeed give a lark to catch a kite!  :(
>>
>> Yes indeed. It's not always clear-cut what the best thing to do is. It would be
>> good to hear from others on this.
>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> +
>>>>> +     pte_unmap(pte);
>>>>> +
>>>>> +     /* Try allocating the highest of the remaining orders. */
>>>>> +     gfp = vma_thp_gfp_mask(vma);
>>>>> +     while (orders) {
>>>>> +             addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>>>>> +             folio = vma_alloc_folio(gfp, order, vma, addr, true);
>>>>> +             if (folio)
>>>>> +                     return folio;
>>>>> +             order = next_order(&orders, order);
>>>>> +     }
>>>>> +
>>>>> +fallback:
>>>>> +#endif
>>>>> +     return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
>>>>> +}
>>>>> +
>>>>> +
>>>>>  /*
>>>>>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
>>>>>   * but allow concurrent faults), and pte mapped but not yet locked.
>>>>> @@ -3928,6 +4058,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>       pte_t pte;
>>>>>       vm_fault_t ret = 0;
>>>>>       void *shadow = NULL;
>>>>> +     int nr_pages = 1;
>>>>> +     unsigned long start_address;
>>>>> +     pte_t *start_pte;
>>>>>
>>>>>       if (!pte_unmap_same(vmf))
>>>>>               goto out;
>>>>> @@ -3991,35 +4124,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>       if (!folio) {
>>>>>               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>>>>                   __swap_count(entry) == 1) {
>>>>> -                     /*
>>>>> -                      * Prevent parallel swapin from proceeding with
>>>>> -                      * the cache flag. Otherwise, another thread may
>>>>> -                      * finish swapin first, free the entry, and swapout
>>>>> -                      * reusing the same entry. It's undetectable as
>>>>> -                      * pte_same() returns true due to entry reuse.
>>>>> -                      */
>>>>> -                     if (swapcache_prepare(entry)) {
>>>>> -                             /* Relax a bit to prevent rapid repeated page faults */
>>>>> -                             schedule_timeout_uninterruptible(1);
>>>>> -                             goto out;
>>>>> -                     }
>>>>> -                     need_clear_cache = true;
>>>>> -
>>>>>                       /* skip swapcache */
>>>>> -                     folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
>>>>> -                                             vma, vmf->address, false);
>>>>> +                     folio = alloc_swap_folio(vmf);
>>>>>                       page = &folio->page;
>>>>>                       if (folio) {
>>>>>                               __folio_set_locked(folio);
>>>>>                               __folio_set_swapbacked(folio);
>>>>>
>>>>> +                             if (folio_test_large(folio)) {
>>>>> +                                     nr_pages = folio_nr_pages(folio);
>>>>> +                                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
>>>>> +                             }
>>>>> +
>>>>> +                             /*
>>>>> +                              * Prevent parallel swapin from proceeding with
>>>>> +                              * the cache flag. Otherwise, another thread may
>>>>> +                              * finish swapin first, free the entry, and swapout
>>>>> +                              * reusing the same entry. It's undetectable as
>>>>> +                              * pte_same() returns true due to entry reuse.
>>>>> +                              */
>>>>> +                             if (swapcache_prepare_nr(entry, nr_pages)) {
>>>>> +                                     /* Relax a bit to prevent rapid repeated page faults */
>>>>> +                                     schedule_timeout_uninterruptible(1);
>>>>> +                                     goto out;
>>>>> +                             }
>>>>> +                             need_clear_cache = true;
>>>>> +
>>>>>                               if (mem_cgroup_swapin_charge_folio(folio,
>>>>>                                                       vma->vm_mm, GFP_KERNEL,
>>>>>                                                       entry)) {
>>>>>                                       ret = VM_FAULT_OOM;
>>>>>                                       goto out_page;
>>>>>                               }
>>>>> -                             mem_cgroup_swapin_uncharge_swap(entry);
>>>>> +
>>>>> +                             for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
>>>>> +                                     mem_cgroup_swapin_uncharge_swap(e);
>>>>>
>>>>>                               shadow = get_shadow_from_swap_cache(entry);
>>>>>                               if (shadow)
>>>>> @@ -4118,6 +4257,42 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>        */
>>>>>       vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>>>>                       &vmf->ptl);
>>>>> +
>>>>> +     start_address = vmf->address;
>>>>> +     start_pte = vmf->pte;
>>>>> +     if (start_pte && folio_test_large(folio)) {
>>>>> +             unsigned long nr = folio_nr_pages(folio);
>>>>> +             unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
>>>>> +             pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
>>>>> +
>>>>> +             /*
>>>>> +              * case 1: we are allocating large_folio, try to map it as a whole
>>>>> +              * iff the swap entries are still entirely mapped;
>>>>> +              * case 2: we hit a large folio in swapcache, and all swap entries
>>>>> +              * are still entirely mapped, try to map a large folio as a whole.
>>>>> +              * otherwise, map only the faulting page within the large folio
>>>>> +              * which is swapcache
>>>>> +              */
>>>>> +             if (!is_pte_range_contig_swap(aligned_pte, nr)) {
>>>>> +                     if (nr_pages > 1) /* ptes have changed for case 1 */
>>>>> +                             goto out_nomap;
>>>>> +                     goto check_pte;
>>>>> +             }
>>>>> +
>>>>> +             start_address = addr;
>>>>> +             start_pte = aligned_pte;
>>>>> +             /*
>>>>> +              * the below has been done before swap_read_folio()
>>>>> +              * for case 1
>>>>> +              */
>>>>> +             if (unlikely(folio == swapcache)) {
>>>>> +                     nr_pages = nr;
>>>>> +                     entry.val = ALIGN_DOWN(entry.val, nr_pages);
>>>>> +                     page = &folio->page;
>>>>> +             }
>>>>> +     }
>>>>> +
>>>>> +check_pte:
>>>>>       if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>>>>>               goto out_nomap;
>>>>>
>>>>> @@ -4185,12 +4360,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>        * We're already holding a reference on the page but haven't mapped it
>>>>>        * yet.
>>>>>        */
>>>>> -     swap_free(entry);
>>>>> +     swap_nr_free(entry, nr_pages);
>>>>>       if (should_try_to_free_swap(folio, vma, vmf->flags))
>>>>>               folio_free_swap(folio);
>>>>>
>>>>> -     inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>> -     dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>>>> +     folio_ref_add(folio, nr_pages - 1);
>>>>> +     add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>>>> +     add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>>>> +
>>>>>       pte = mk_pte(page, vma->vm_page_prot);
>>>>>
>>>>>       /*
>>>>> @@ -4200,14 +4377,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>        * exclusivity.
>>>>>        */
>>>>>       if (!folio_test_ksm(folio) &&
>>>>> -         (exclusive || folio_ref_count(folio) == 1)) {
>>>>> +         (exclusive || folio_ref_count(folio) == nr_pages)) {
>>>>>               if (vmf->flags & FAULT_FLAG_WRITE) {
>>>>>                       pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>>>>                       vmf->flags &= ~FAULT_FLAG_WRITE;
>>>>>               }
>>>>>               rmap_flags |= RMAP_EXCLUSIVE;
>>>>>       }
>>>>> -     flush_icache_page(vma, page);
>>>>> +     flush_icache_pages(vma, page, nr_pages);
>>>>>       if (pte_swp_soft_dirty(vmf->orig_pte))
>>>>>               pte = pte_mksoft_dirty(pte);
>>>>>       if (pte_swp_uffd_wp(vmf->orig_pte))
>>>>> @@ -4216,17 +4393,19 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>
>>>>>       /* ksm created a completely new copy */
>>>>>       if (unlikely(folio != swapcache && swapcache)) {
>>>>> -             folio_add_new_anon_rmap(folio, vma, vmf->address);
>>>>> +             folio_add_new_anon_rmap(folio, vma, start_address);
>>>>>               folio_add_lru_vma(folio, vma);
>>>>> +     } else if (!folio_test_anon(folio)) {
>>>>> +             folio_add_new_anon_rmap(folio, vma, start_address);
>>>>>       } else {
>>>>> -             folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>>>>> +             folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>>>>>                                       rmap_flags);
>>>>>       }
>>>>>
>>>>>       VM_BUG_ON(!folio_test_anon(folio) ||
>>>>>                       (pte_write(pte) && !PageAnonExclusive(page)));
>>>>> -     set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>>>> -     arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>>>>> +     set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
>>>>> +     arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>>>>>
>>>>>       folio_unlock(folio);
>>>>>       if (folio != swapcache && swapcache) {
>>>>> @@ -4243,6 +4422,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>       }
>>>>>
>>>>>       if (vmf->flags & FAULT_FLAG_WRITE) {
>>>>> +             if (nr_pages > 1)
>>>>> +                     vmf->orig_pte = ptep_get(vmf->pte);
>>>>> +
>>>>>               ret |= do_wp_page(vmf);
>>>>>               if (ret & VM_FAULT_ERROR)
>>>>>                       ret &= VM_FAULT_ERROR;
>>>>> @@ -4250,14 +4432,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>       }
>>>>>
>>>>>       /* No need to invalidate - it was non-present before */
>>>>> -     update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>>>> +     update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>>>>>  unlock:
>>>>>       if (vmf->pte)
>>>>>               pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>  out:
>>>>>       /* Clear the swap cache pin for direct swapin after PTL unlock */
>>>>>       if (need_clear_cache)
>>>>> -             swapcache_clear(si, entry);
>>>>> +             swapcache_clear_nr(si, entry, nr_pages);
>>>>>       if (si)
>>>>>               put_swap_device(si);
>>>>>       return ret;
>>>>> @@ -4273,7 +4455,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>>               folio_put(swapcache);
>>>>>       }
>>>>>       if (need_clear_cache)
>>>>> -             swapcache_clear(si, entry);
>>>>> +             swapcache_clear_nr(si, entry, nr_pages);
>>>>>       if (si)
>>>>>               put_swap_device(si);
>>>>>       return ret;
>>>>> @@ -4309,15 +4491,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>>>       if (unlikely(userfaultfd_armed(vma)))
>>>>>               goto fallback;
>>>>>
>>>>> -     /*
>>>>> -      * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>>>> -      * for this vma. Then filter out the orders that can't be allocated over
>>>>> -      * the faulting address and still be fully contained in the vma.
>>>>> -      */
>>>>> -     orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>>>>> -                                       BIT(PMD_ORDER) - 1);
>>>>> -     orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>>>> -
>>>>> +     orders = get_alloc_folio_orders(vmf);
>>>>>       if (!orders)
>>>>>               goto fallback;
>>>>>
> 
> Thanks
> Barry
Ryan Roberts March 15, 2024, 12:06 p.m. UTC | #11
On 15/03/2024 10:01, Barry Song wrote:
> On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>>> On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>>>>
>>>> Barry Song <21cnbao@gmail.com> writes:
>>>>
>>>>> From: Chuanhua Han <hanchuanhua@oppo.com>
>>>>>
>>>>> On an embedded system like Android, more than half of anon memory is
>>>>> actually in swap devices such as zRAM. For example, while an app is
>>>>> switched to background, its most memory might be swapped-out.
>>>>>
>>>>> Now we have mTHP features, unfortunately, if we don't support large folios
>>>>> swap-in, once those large folios are swapped-out, we immediately lose the
>>>>> performance gain we can get through large folios and hardware optimization
>>>>> such as CONT-PTE.
>>>>>
>>>>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>>>>> to those contiguous swaps which were likely swapped out from mTHP as a
>>>>> whole.
>>>>>
>>>>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>>>>> case. It doesn't support swapin_readahead as large folios yet since this
>>>>> kind of shared memory is much less than memory mapped by single process.
>>>>
>>>> In contrast, I still think that it's better to start with normal swap-in
>>>> path, then expand to SWAP_SYCHRONOUS case.
>>>
>>> I'd rather try the reverse direction as non-sync anon memory is only around
>>> 3% in a phone as my observation.
>>
>> Phone is not the only platform that Linux is running on.
> 
> I suppose it's generally true that forked shared anonymous pages only
> constitute a
> small portion of all anonymous pages. The majority of anonymous pages are within
> a single process.
> 
> I agree phones are not the only platform. But Rome wasn't built in a
> day. I can only get
> started on a hardware which I can easily reach and have enough hardware/test
> resources on it. So we may take the first step which can be applied on
> a real product
> and improve its performance, and step by step, we broaden it and make it
> widely useful to various areas  in which I can't reach :-)
> 
> so probably we can have a sysfs "enable" entry with default "n" or
> have a maximum
> swap-in order as Ryan's suggestion [1] at the beginning,

I wasn't neccessarily suggesting that we should hard-code an upper limit. I was
just pointing out that we likely need some policy somewhere because the right
thing very likely depends on the folio size and workload. And there is likely
similar policy needed for CoW.

We already have per-thp-size directories in sysfs, so there is a natural place
to add new controls as you suggest - that would fit well. Of course if we can
avoid exposing yet more controls that would be preferable.

> 
> "
> So in the common case, swap-in will pull in the same size of folio as was
> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> it makes sense for 2M THP; As the size increases the chances of actually needing
> all of the folio reduces so chances are we are wasting IO. There are similar
> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> sense to copy the whole folio up to a certain size.
> "
>
Barry Song March 17, 2024, 6:11 a.m. UTC | #12
On Sat, Mar 16, 2024 at 1:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 15/03/2024 10:01, Barry Song wrote:
> > On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >>> On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>>>
> >>>> Barry Song <21cnbao@gmail.com> writes:
> >>>>
> >>>>> From: Chuanhua Han <hanchuanhua@oppo.com>
> >>>>>
> >>>>> On an embedded system like Android, more than half of anon memory is
> >>>>> actually in swap devices such as zRAM. For example, while an app is
> >>>>> switched to background, its most memory might be swapped-out.
> >>>>>
> >>>>> Now we have mTHP features, unfortunately, if we don't support large folios
> >>>>> swap-in, once those large folios are swapped-out, we immediately lose the
> >>>>> performance gain we can get through large folios and hardware optimization
> >>>>> such as CONT-PTE.
> >>>>>
> >>>>> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >>>>> to those contiguous swaps which were likely swapped out from mTHP as a
> >>>>> whole.
> >>>>>
> >>>>> Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >>>>> case. It doesn't support swapin_readahead as large folios yet since this
> >>>>> kind of shared memory is much less than memory mapped by single process.
> >>>>
> >>>> In contrast, I still think that it's better to start with normal swap-in
> >>>> path, then expand to SWAP_SYCHRONOUS case.
> >>>
> >>> I'd rather try the reverse direction as non-sync anon memory is only around
> >>> 3% in a phone as my observation.
> >>
> >> Phone is not the only platform that Linux is running on.
> >
> > I suppose it's generally true that forked shared anonymous pages only
> > constitute a
> > small portion of all anonymous pages. The majority of anonymous pages are within
> > a single process.
> >
> > I agree phones are not the only platform. But Rome wasn't built in a
> > day. I can only get
> > started on a hardware which I can easily reach and have enough hardware/test
> > resources on it. So we may take the first step which can be applied on
> > a real product
> > and improve its performance, and step by step, we broaden it and make it
> > widely useful to various areas  in which I can't reach :-)
> >
> > so probably we can have a sysfs "enable" entry with default "n" or
> > have a maximum
> > swap-in order as Ryan's suggestion [1] at the beginning,
>
> I wasn't neccessarily suggesting that we should hard-code an upper limit. I was
> just pointing out that we likely need some policy somewhere because the right
> thing very likely depends on the folio size and workload. And there is likely
> similar policy needed for CoW.
>
> We already have per-thp-size directories in sysfs, so there is a natural place
> to add new controls as you suggest - that would fit well. Of course if we can
> avoid exposing yet more controls that would be preferable.
>
> >
> > "
> > So in the common case, swap-in will pull in the same size of folio as was
> > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > it makes sense for 2M THP; As the size increases the chances of actually needing
> > all of the folio reduces so chances are we are wasting IO. There are similar
> > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > sense to copy the whole folio up to a certain size.
> > "

right now we have an "enable" entry in each size, for example:
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/enable

for the phone case, it would be quite simple, just enable 64KiB(or +16KiB) and
allow swap-in 64KiB(or +16KiB) folios, so it doesn't need any new controls
since do_swap_page does the same checks as do_anonymous_page()
does. And we actually have deployed 64KiB-only swap-out and swap-in on
millions of real phones.

Considering other users scenarios which might want larger folios such as 2MiB
1MiB but only want smaller swap-in folio sizes, I could have a new
swapin control
like,

/sys/kernel/mm/transparent_hugepage/hugepages-64kB/swapin
this can be 1 or 0.

With this, it seems safer for the patchset to land while I don't have
the ability
to extensively test it on Linux servers?

Thanks
Barry
Huang, Ying March 18, 2024, 1:52 a.m. UTC | #13
Barry Song <21cnbao@gmail.com> writes:

> On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
>> >> >
>> >> > On an embedded system like Android, more than half of anon memory is
>> >> > actually in swap devices such as zRAM. For example, while an app is
>> >> > switched to background, its most memory might be swapped-out.
>> >> >
>> >> > Now we have mTHP features, unfortunately, if we don't support large folios
>> >> > swap-in, once those large folios are swapped-out, we immediately lose the
>> >> > performance gain we can get through large folios and hardware optimization
>> >> > such as CONT-PTE.
>> >> >
>> >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
>> >> > to those contiguous swaps which were likely swapped out from mTHP as a
>> >> > whole.
>> >> >
>> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
>> >> > case. It doesn't support swapin_readahead as large folios yet since this
>> >> > kind of shared memory is much less than memory mapped by single process.
>> >>
>> >> In contrast, I still think that it's better to start with normal swap-in
>> >> path, then expand to SWAP_SYCHRONOUS case.
>> >
>> > I'd rather try the reverse direction as non-sync anon memory is only around
>> > 3% in a phone as my observation.
>>
>> Phone is not the only platform that Linux is running on.
>
> I suppose it's generally true that forked shared anonymous pages only
> constitute a
> small portion of all anonymous pages. The majority of anonymous pages are within
> a single process.

Yes.  But IIUC, SWP_SYNCHRONOUS_IO is quite limited, they are set only
for memory backed swap devices.

> I agree phones are not the only platform. But Rome wasn't built in a
> day. I can only get
> started on a hardware which I can easily reach and have enough hardware/test
> resources on it. So we may take the first step which can be applied on
> a real product
> and improve its performance, and step by step, we broaden it and make it
> widely useful to various areas  in which I can't reach :-)

We must guarantee the normal swap path runs correctly and has no
performance regression when developing SWP_SYNCHRONOUS_IO optimization.
So we have to put some effort on the normal path test anyway.

> so probably we can have a sysfs "enable" entry with default "n" or
> have a maximum
> swap-in order as Ryan's suggestion [1] at the beginning,
>
> "
> So in the common case, swap-in will pull in the same size of folio as was
> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> it makes sense for 2M THP; As the size increases the chances of actually needing
> all of the folio reduces so chances are we are wasting IO. There are similar
> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> sense to copy the whole folio up to a certain size.
> "
>
>>
>> >>
>> >> In normal swap-in path, we can take advantage of swap readahead
>> >> information to determine the swapped-in large folio order.  That is, if
>> >> the return value of swapin_nr_pages() > 1, then we can try to allocate
>> >> and swapin a large folio.
>> >
>> > I am not quite sure we still need to depend on this. in do_anon_page,
>> > we have broken the assumption and allocated a large folio directly.
>>
>> I don't think that we have a sophisticated policy to allocate large
>> folio.  Large folio could waste memory for some workloads, so I think
>> that it's a good idea to allocate large folio always.
>
> i agree, but we still have the below check just like do_anon_page() has it,
>
>         orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
>                                           BIT(PMD_ORDER) - 1);
>         orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>
> in do_anon_page, we don't worry about the waste so much, the same
> logic also applies to do_swap_page().

As I said, "readahead" may help us from application/user specific
configuration in most cases.  It can be a starting point of "using mTHP
automatically when it helps and not cause many issues".

>>
>> Readahead gives us an opportunity to play with the policy.
>
> I feel somehow the rules of the game have changed with an upper
> limit for swap-in size. for example, if the upper limit is 4 order,
> it limits folio size to 64KiB which is still a proper size for ARM64
> whose base page can be 64KiB.
>
> on the other hand, while swapping out large folios, we will always
> compress them as a whole(zsmalloc/zram patch will come in a
> couple of days), if we choose to decompress a subpage instead of
> a large folio in do_swap_page(), we might need to decompress
> nr_pages times. for example,
>
> For large folios 16*4KiB, they are saved as a large object in zsmalloc(with
> the coming patch), if we swap in a small folio, we decompress the large
> object; next time, we will still need to decompress a large object. so
> it is more sensible to swap in a large folio if we find those
> swap entries are contiguous and were allocated by a large folio swap-out.

I understand that there are some special requirements for ZRAM.  But I
don't think it's a good idea to force the general code to fit the
requirements of a specific swap device too much.  This is one of the
reasons that I think that we should start with normal swap devices, then
try to optimize for some specific devices.

>>
>> > On the other hand, compressing/decompressing large folios as a
>> > whole rather than doing it one by one can save a large percent of
>> > CPUs and provide a much lower compression ratio.  With a hardware
>> > accelerator, this is even faster.
>>
>> I am not against to support large folio for compressing/decompressing.
>>
>> I just suggest to do that later, after we play with normal swap-in.
>> SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
>> swap.  So, it seems natural to support large folio swap-in for normal
>> swap-in firstly.
>
> I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
> than the swapcache path since it is the majority.

I don't think so.  Most PC and server systems uses !SWAP_SYCHRONOUS
swap devices.

> and on the other hand, a lot
> of modification is required for the swapcache path. in OPPO's code[1], we did
> bring-up both paths, but the swapcache path is much much more complicated
> than the SYNC path and hasn't really noticeable improvement.
>
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplus/sm8650_u_14.0.0_oneplus12

That's great.  Please cleanup the code and post it to mailing list.  Why
doesn't it help?  IIUC, it can optimize TLB at least.

>>
>> > So I'd rather more aggressively get large folios swap-in involved
>> > than depending on readahead.
>>
>> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
>> optimization too.  The sub-pages that is not accessed by page fault can
>> be treated as readahead.  I think that is a better policy than
>> allocating large folio always.
>
> Considering the zsmalloc optimization, it would be a better choice to
> always allocate
> large folios if we find those swap entries are for a swapped-out large folio. as
> decompressing just once, we get all subpages.
> Some hardware accelerators are even able to decompress a large folio with
> multi-hardware threads, for example, 16 hardware threads can compress
> each subpage of a large folio at the same time, it is just as fast as
> decompressing
> one subpage.
>
> for platforms without the above optimizations, a proper upper limit
> will help them
> disable the large folios swap-in or decrease the impact. For example,
> if the upper
> limit is 0-order, we are just removing this patchset. if the upper
> limit is 2 orders, we
> are just like BASE_PAGE size is 16KiB.
>
>>
>> >>
>> >> To do that, we need to track whether the sub-pages are accessed.  I
>> >> guess we need that information for large file folio readahead too.
>> >>
>> >> Hi, Matthew,
>> >>
>> >> Can you help us on tracking whether the sub-pages of a readahead large
>> >> folio has been accessed?
>> >>
>> >> > Right now, we are re-faulting large folios which are still in swapcache as a
>> >> > whole, this can effectively decrease extra loops and early-exitings which we
>> >> > have increased in arch_swap_restore() while supporting MTE restore for folios
>> >> > rather than page. On the other hand, it can also decrease do_swap_page as
>> >> > PTEs used to be set one by one even we hit a large folio in swapcache.
>> >> >
>> >>

--
Best Regards,
Huang, Ying
Barry Song March 18, 2024, 2:41 a.m. UTC | #14
On Mon, Mar 18, 2024 at 2:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Fri, Mar 15, 2024 at 10:17 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Fri, Mar 15, 2024 at 9:43 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >
> >> >> > On an embedded system like Android, more than half of anon memory is
> >> >> > actually in swap devices such as zRAM. For example, while an app is
> >> >> > switched to background, its most memory might be swapped-out.
> >> >> >
> >> >> > Now we have mTHP features, unfortunately, if we don't support large folios
> >> >> > swap-in, once those large folios are swapped-out, we immediately lose the
> >> >> > performance gain we can get through large folios and hardware optimization
> >> >> > such as CONT-PTE.
> >> >> >
> >> >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> >> >> > to those contiguous swaps which were likely swapped out from mTHP as a
> >> >> > whole.
> >> >> >
> >> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOUS
> >> >> > case. It doesn't support swapin_readahead as large folios yet since this
> >> >> > kind of shared memory is much less than memory mapped by single process.
> >> >>
> >> >> In contrast, I still think that it's better to start with normal swap-in
> >> >> path, then expand to SWAP_SYCHRONOUS case.
> >> >
> >> > I'd rather try the reverse direction as non-sync anon memory is only around
> >> > 3% in a phone as my observation.
> >>
> >> Phone is not the only platform that Linux is running on.
> >
> > I suppose it's generally true that forked shared anonymous pages only
> > constitute a
> > small portion of all anonymous pages. The majority of anonymous pages are within
> > a single process.
>
> Yes.  But IIUC, SWP_SYNCHRONOUS_IO is quite limited, they are set only
> for memory backed swap devices.

SWP_SYNCHRONOUS_IO is the most common case for embedded linux.
note almost all Android/embedded devices use zRAM rather than a disk
for swap.

And we can have an upper limit order or a new control like
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/swapin
and set them default to 0 first.

>
> > I agree phones are not the only platform. But Rome wasn't built in a
> > day. I can only get
> > started on a hardware which I can easily reach and have enough hardware/test
> > resources on it. So we may take the first step which can be applied on
> > a real product
> > and improve its performance, and step by step, we broaden it and make it
> > widely useful to various areas  in which I can't reach :-)
>
> We must guarantee the normal swap path runs correctly and has no
> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> So we have to put some effort on the normal path test anyway.
>
> > so probably we can have a sysfs "enable" entry with default "n" or
> > have a maximum
> > swap-in order as Ryan's suggestion [1] at the beginning,
> >
> > "
> > So in the common case, swap-in will pull in the same size of folio as was
> > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > it makes sense for 2M THP; As the size increases the chances of actually needing
> > all of the folio reduces so chances are we are wasting IO. There are similar
> > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > sense to copy the whole folio up to a certain size.
> > "
> >
> >>
> >> >>
> >> >> In normal swap-in path, we can take advantage of swap readahead
> >> >> information to determine the swapped-in large folio order.  That is, if
> >> >> the return value of swapin_nr_pages() > 1, then we can try to allocate
> >> >> and swapin a large folio.
> >> >
> >> > I am not quite sure we still need to depend on this. in do_anon_page,
> >> > we have broken the assumption and allocated a large folio directly.
> >>
> >> I don't think that we have a sophisticated policy to allocate large
> >> folio.  Large folio could waste memory for some workloads, so I think
> >> that it's a good idea to allocate large folio always.
> >
> > i agree, but we still have the below check just like do_anon_page() has it,
> >
> >         orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> >                                           BIT(PMD_ORDER) - 1);
> >         orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >
> > in do_anon_page, we don't worry about the waste so much, the same
> > logic also applies to do_swap_page().
>
> As I said, "readahead" may help us from application/user specific
> configuration in most cases.  It can be a starting point of "using mTHP
> automatically when it helps and not cause many issues".

I'd rather start from the simpler code path and really improve  on
phones & embedded linux which our team can really reach :-)

>
> >>
> >> Readahead gives us an opportunity to play with the policy.
> >
> > I feel somehow the rules of the game have changed with an upper
> > limit for swap-in size. for example, if the upper limit is 4 order,
> > it limits folio size to 64KiB which is still a proper size for ARM64
> > whose base page can be 64KiB.
> >
> > on the other hand, while swapping out large folios, we will always
> > compress them as a whole(zsmalloc/zram patch will come in a
> > couple of days), if we choose to decompress a subpage instead of
> > a large folio in do_swap_page(), we might need to decompress
> > nr_pages times. for example,
> >
> > For large folios 16*4KiB, they are saved as a large object in zsmalloc(with
> > the coming patch), if we swap in a small folio, we decompress the large
> > object; next time, we will still need to decompress a large object. so
> > it is more sensible to swap in a large folio if we find those
> > swap entries are contiguous and were allocated by a large folio swap-out.
>
> I understand that there are some special requirements for ZRAM.  But I
> don't think it's a good idea to force the general code to fit the
> requirements of a specific swap device too much.  This is one of the
> reasons that I think that we should start with normal swap devices, then
> try to optimize for some specific devices.

I agree. but we are having a good start. zRAM is not a specific device,
it widely represents embedded Linux.

>
> >>
> >> > On the other hand, compressing/decompressing large folios as a
> >> > whole rather than doing it one by one can save a large percent of
> >> > CPUs and provide a much lower compression ratio.  With a hardware
> >> > accelerator, this is even faster.
> >>
> >> I am not against to support large folio for compressing/decompressing.
> >>
> >> I just suggest to do that later, after we play with normal swap-in.
> >> SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
> >> swap.  So, it seems natural to support large folio swap-in for normal
> >> swap-in firstly.
> >
> > I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
> > than the swapcache path since it is the majority.
>
> I don't think so.  Most PC and server systems uses !SWAP_SYCHRONOUS
> swap devices.

The problem is that our team is all focusing on phones, we won't have
any resource
and bandwidth on PC and server. A more realistic goal is that we at
least let the
solutions benefit phones and similar embedded Linux and extend it to more areas
such as PC and server.

I'd be quite happy if you or other people can join together on PC and server.

>
> > and on the other hand, a lot
> > of modification is required for the swapcache path. in OPPO's code[1], we did
> > bring-up both paths, but the swapcache path is much much more complicated
> > than the SYNC path and hasn't really noticeable improvement.
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplus/sm8650_u_14.0.0_oneplus12
>
> That's great.  Please cleanup the code and post it to mailing list.  Why
> doesn't it help?  IIUC, it can optimize TLB at least.

I agree this can improve but most of the anon pages are single-process
mapped. only
quite a few pages go to the readahead code path on phones. That's why
there is no
noticeable improvement finally.

I understand all the benefits you mentioned on changing readahead, but
simply because
those kinds of pages are really really rare, so improving that path
doesn't really help
Android devices.

>
> >>
> >> > So I'd rather more aggressively get large folios swap-in involved
> >> > than depending on readahead.
> >>
> >> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
> >> optimization too.  The sub-pages that is not accessed by page fault can
> >> be treated as readahead.  I think that is a better policy than
> >> allocating large folio always.

This is always true even in do_anonymous_page(). but we don't worry that
too much as we have per-size control. overload has the chance to set its
preferences.
        /*
         * Get a list of all the (large) orders below PMD_ORDER that are enabled
         * for this vma. Then filter out the orders that can't be allocated over
         * the faulting address and still be fully contained in the vma.
         */
        orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
                                          BIT(PMD_ORDER) - 1);
        orders = thp_vma_suitable_orders(vma, vmf->address, orders);

On the other hand, we are not always allocating large folios. we are allocating
large folios when the swapped-out folio was large. This is quite important to
an embedded linux, as swap is happening so often. more than half memory
can be in swap, if we swap-out them as a large folio, but swap them in a
small, we immediately lose all advantages such as less page faults, CONT-PTE
etc.

> >
> > Considering the zsmalloc optimization, it would be a better choice to
> > always allocate
> > large folios if we find those swap entries are for a swapped-out large folio. as
> > decompressing just once, we get all subpages.
> > Some hardware accelerators are even able to decompress a large folio with
> > multi-hardware threads, for example, 16 hardware threads can compress
> > each subpage of a large folio at the same time, it is just as fast as
> > decompressing
> > one subpage.
> >
> > for platforms without the above optimizations, a proper upper limit
> > will help them
> > disable the large folios swap-in or decrease the impact. For example,
> > if the upper
> > limit is 0-order, we are just removing this patchset. if the upper
> > limit is 2 orders, we
> > are just like BASE_PAGE size is 16KiB.
> >
> >>
> >> >>
> >> >> To do that, we need to track whether the sub-pages are accessed.  I
> >> >> guess we need that information for large file folio readahead too.
> >> >>
> >> >> Hi, Matthew,
> >> >>
> >> >> Can you help us on tracking whether the sub-pages of a readahead large
> >> >> folio has been accessed?
> >> >>
> >> >> > Right now, we are re-faulting large folios which are still in swapcache as a
> >> >> > whole, this can effectively decrease extra loops and early-exitings which we
> >> >> > have increased in arch_swap_restore() while supporting MTE restore for folios
> >> >> > rather than page. On the other hand, it can also decrease do_swap_page as
> >> >> > PTEs used to be set one by one even we hit a large folio in swapcache.
> >> >> >
> >> >>
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Ryan Roberts March 18, 2024, 4:45 p.m. UTC | #15
>>> I agree phones are not the only platform. But Rome wasn't built in a
>>> day. I can only get
>>> started on a hardware which I can easily reach and have enough hardware/test
>>> resources on it. So we may take the first step which can be applied on
>>> a real product
>>> and improve its performance, and step by step, we broaden it and make it
>>> widely useful to various areas  in which I can't reach :-)
>>
>> We must guarantee the normal swap path runs correctly and has no
>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>> So we have to put some effort on the normal path test anyway.
>>
>>> so probably we can have a sysfs "enable" entry with default "n" or
>>> have a maximum
>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>
>>> "
>>> So in the common case, swap-in will pull in the same size of folio as was
>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>> sense to copy the whole folio up to a certain size.
>>> "

I thought about this a bit more. No clear conclusions, but hoped this might help
the discussion around policy:

The decision about the size of the THP is made at first fault, with some help
from user space and in future we might make decisions to split based on
munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
the THP out at some point in its lifetime should not impact on its size. It's
just being moved around in the system and the reason for our original decision
should still hold.

So from that PoV, it would be good to swap-in to the same size that was
swapped-out. But we only kind-of keep that information around, via the swap
entry contiguity and alignment. With that scheme it is possible that multiple
virtually adjacent but not physically contiguous folios get swapped-out to
adjacent swap slot ranges and then they would be swapped-in to a single, larger
folio. This is not ideal, and I think it would be valuable to try to maintain
the original folio size information with the swap slot. One way to do this would
be to store the original order for which the cluster was allocated in the
cluster. Then we at least know that a given swap slot is either for a folio of
that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
steal a bit from swap_map to determine which case it is? Or are there better
approaches?

Next we (I?) have concerns about wasting IO by swapping-in folios that are too
large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
say yes but I have no data. But on the other hand, memory is aged and
swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
original allocation size policy is good (it currently isn't) then a folio should
be sized to cover temporally close memory and if we need to access some of it,
chances are we need all of it.

If we think the IO concern is legitimate then we could define a threshold size
(sysfs?) for when we start swapping-in the folio in chunks. And how big should
those chunks be - one page, or the threshold size itself? Probably the latter?
And perhaps that threshold could also be used by zRAM to decide its upper limit
for compression chunk.

Perhaps we can learn from khugepaged here? I think it has programmable
thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
THP? I guess that exists for the same concerns about increased IO pressure?


If we think we will ever be swapping-in folios in chunks less than their
original size, then we need a separate mechanism to re-foliate them. We have
discussed a khugepaged-like approach for doing this asynchronously in the
background. I know that scares the Android folks, but David has suggested that
this could well be very cheap compared with khugepaged, because it would be
entirely limited to a single pgtable, so we only need the PTL. If we need this
mechanism anyway, perhaps we should develop it and see how it performs if
swap-in remains order-0? Although I guess that would imply not being able to
benefit from compressing THPs for the zRAM case.

I see all this as orthogonal to synchronous vs asynchronous swap devices. I
think the latter just implies that you might want to do some readahead to try to
cover up the latency? If swap is moving towards being folio-orientated, then
readahead also surely needs to be folio-orientated, but I think that should be
the only major difference.

Anyway, just some thoughts!

Thanks,
Ryan
Barry Song March 19, 2024, 6:27 a.m. UTC | #16
On Tue, Mar 19, 2024 at 5:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> >>> I agree phones are not the only platform. But Rome wasn't built in a
> >>> day. I can only get
> >>> started on a hardware which I can easily reach and have enough hardware/test
> >>> resources on it. So we may take the first step which can be applied on
> >>> a real product
> >>> and improve its performance, and step by step, we broaden it and make it
> >>> widely useful to various areas  in which I can't reach :-)
> >>
> >> We must guarantee the normal swap path runs correctly and has no
> >> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >> So we have to put some effort on the normal path test anyway.
> >>
> >>> so probably we can have a sysfs "enable" entry with default "n" or
> >>> have a maximum
> >>> swap-in order as Ryan's suggestion [1] at the beginning,
> >>>
> >>> "
> >>> So in the common case, swap-in will pull in the same size of folio as was
> >>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >>> all of the folio reduces so chances are we are wasting IO. There are similar
> >>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >>> sense to copy the whole folio up to a certain size.
> >>> "
>
> I thought about this a bit more. No clear conclusions, but hoped this might help
> the discussion around policy:
>
> The decision about the size of the THP is made at first fault, with some help
> from user space and in future we might make decisions to split based on
> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> the THP out at some point in its lifetime should not impact on its size. It's
> just being moved around in the system and the reason for our original decision
> should still hold.

Indeed, this is an ideal framework for smartphones and likely for
widely embedded
Linux systems utilizing zRAM. We set the mTHP size to 64KiB to
leverage CONT-PTE,
given that more than half of the memory on phones may frequently swap out and
swap in (for instance, when opening and switching between apps). The
ideal approach
would involve adhering to the decision made in do_anonymous_page().

>
> So from that PoV, it would be good to swap-in to the same size that was
> swapped-out. But we only kind-of keep that information around, via the swap
> entry contiguity and alignment. With that scheme it is possible that multiple
> virtually adjacent but not physically contiguous folios get swapped-out to
> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> folio. This is not ideal, and I think it would be valuable to try to maintain
> the original folio size information with the swap slot. One way to do this would
> be to store the original order for which the cluster was allocated in the
> cluster. Then we at least know that a given swap slot is either for a folio of
> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> steal a bit from swap_map to determine which case it is? Or are there better
> approaches?

In the case of non-SWP_SYNCHRONOUS_IO, users will invariably invoke
swap_readahead()
even when __swap_count(entry) equals 1.  This leads to two scenarios:
swap_vma_readahead
and swap_cluster_readahead.

In swap_vma_readahead, when blk_queue_nonrot, physical contiguity
doesn't appear to be a
critical concern. However, for swap_cluster_readahead, the focus
shifts towards the potential
impact of physical discontiguity.

struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
                                struct vm_fault *vmf)
{
        struct mempolicy *mpol;
        pgoff_t ilx;
        struct folio *folio;

        mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
        folio = swap_use_vma_readahead() ?
                swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
                swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
        mpol_cond_put(mpol);

        if (!folio)
                return NULL;
        return folio_file_page(folio, swp_offset(entry));
}

In Android and embedded systems, SWP_SYNCHRONOUS_IO is consistently utilized,
rendering physical contiguity less of a concern. Moreover, instances where
swap_readahead() is accessed are rare, typically occurring only in scenarios
involving forked but non-CoWed memory.

So I think large folios swap-in will at least need three steps

1. on SWP_SYNCHRONOUS_IO (Android and embedded Linux), this has a very
clear model and has no complex I/O issue.
2. on nonrot block device(bdev_nonrot ==  true), it cares less about
I/O contiguity.
3. on rot block devices which care about  I/O contiguity.

This patchset primarily addresses the systems utilizing
SWP_SYNCHRONOUS_IO(type1),
such as Android and embedded Linux, a straightforward model is established,
with minimal complexity regarding I/O issues.

>
> Next we (I?) have concerns about wasting IO by swapping-in folios that are too
> large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
> say yes but I have no data. But on the other hand, memory is aged and
> swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
> original allocation size policy is good (it currently isn't) then a folio should
> be sized to cover temporally close memory and if we need to access some of it,
> chances are we need all of it.
>
> If we think the IO concern is legitimate then we could define a threshold size
> (sysfs?) for when we start swapping-in the folio in chunks. And how big should
> those chunks be - one page, or the threshold size itself? Probably the latter?
> And perhaps that threshold could also be used by zRAM to decide its upper limit
> for compression chunk.


Agreed. What about introducing a parameter like
/sys/kernel/mm/transparent_hugepage/max_swapin_order
giving users the opportunity to fine-tune it according to their needs. For type1
users specifically, setting it to any value above 4 would be
beneficial. If there's
still a lack of tuning for desktop and server environments (type 2 and type 3),
the default value could be set to 0.

>
> Perhaps we can learn from khugepaged here? I think it has programmable
> thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
> THP? I guess that exists for the same concerns about increased IO pressure?
>
>
> If we think we will ever be swapping-in folios in chunks less than their
> original size, then we need a separate mechanism to re-foliate them. We have
> discussed a khugepaged-like approach for doing this asynchronously in the
> background. I know that scares the Android folks, but David has suggested that
> this could well be very cheap compared with khugepaged, because it would be
> entirely limited to a single pgtable, so we only need the PTL. If we need this
> mechanism anyway, perhaps we should develop it and see how it performs if
> swap-in remains order-0? Although I guess that would imply not being able to
> benefit from compressing THPs for the zRAM case.

The effectiveness of collapse operation relies on the stability of
forming large folios
to ensure optimal performance. In embedded systems, where more than half of the
memory may be allocated to zRAM, folios might undergo swapping out before
collapsing or immediately after the collapse operation. It seems a
TAO-like optimization
to decrease fallback and latency is more effective.

>
> I see all this as orthogonal to synchronous vs asynchronous swap devices. I
> think the latter just implies that you might want to do some readahead to try to
> cover up the latency? If swap is moving towards being folio-orientated, then
> readahead also surely needs to be folio-orientated, but I think that should be
> the only major difference.
>
> Anyway, just some thoughts!

Thank you very much for your valuable and insightful deliberations.

>
> Thanks,
> Ryan
>

Thanks
Barry
Ryan Roberts March 19, 2024, 9:05 a.m. UTC | #17
On 19/03/2024 06:27, Barry Song wrote:
> On Tue, Mar 19, 2024 at 5:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>>> day. I can only get
>>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>>> resources on it. So we may take the first step which can be applied on
>>>>> a real product
>>>>> and improve its performance, and step by step, we broaden it and make it
>>>>> widely useful to various areas  in which I can't reach :-)
>>>>
>>>> We must guarantee the normal swap path runs correctly and has no
>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>>> So we have to put some effort on the normal path test anyway.
>>>>
>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>>> have a maximum
>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>>
>>>>> "
>>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>>> sense to copy the whole folio up to a certain size.
>>>>> "
>>
>> I thought about this a bit more. No clear conclusions, but hoped this might help
>> the discussion around policy:
>>
>> The decision about the size of the THP is made at first fault, with some help
>> from user space and in future we might make decisions to split based on
>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>> the THP out at some point in its lifetime should not impact on its size. It's
>> just being moved around in the system and the reason for our original decision
>> should still hold.
> 
> Indeed, this is an ideal framework for smartphones and likely for
> widely embedded
> Linux systems utilizing zRAM. We set the mTHP size to 64KiB to
> leverage CONT-PTE,
> given that more than half of the memory on phones may frequently swap out and
> swap in (for instance, when opening and switching between apps). The
> ideal approach
> would involve adhering to the decision made in do_anonymous_page().
> 
>>
>> So from that PoV, it would be good to swap-in to the same size that was
>> swapped-out. But we only kind-of keep that information around, via the swap
>> entry contiguity and alignment. With that scheme it is possible that multiple
>> virtually adjacent but not physically contiguous folios get swapped-out to
>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>> folio. This is not ideal, and I think it would be valuable to try to maintain
>> the original folio size information with the swap slot. One way to do this would
>> be to store the original order for which the cluster was allocated in the
>> cluster. Then we at least know that a given swap slot is either for a folio of
>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>> steal a bit from swap_map to determine which case it is? Or are there better
>> approaches?
> 
> In the case of non-SWP_SYNCHRONOUS_IO, users will invariably invoke
> swap_readahead()
> even when __swap_count(entry) equals 1.  This leads to two scenarios:
> swap_vma_readahead
> and swap_cluster_readahead.
> 
> In swap_vma_readahead, when blk_queue_nonrot, physical contiguity
> doesn't appear to be a
> critical concern. However, for swap_cluster_readahead, the focus
> shifts towards the potential
> impact of physical discontiguity.

When you talk about "physical [dis]contiguity" I think you are talking about
contiguity of the swap entries in the swap device? Both paths currently allocate
order-0 folios to swap into, so neither have a concept of physical contiguity in
memory at the moment.

As I understand it, roughly the aim is to readahead by cluster for rotating
disks to reduce seek time, and readahead by virtual address for non-rotating
devices since there is no seek time cost. Correct?

Note that today, swap-out on supports (2M) THP if the swap device is
non-rotating. If it is rotating, the THP is first split. My swap-out series
maintains this policy for mTHP. So I think we only really care about
swap_vma_readahead() here; we want to teach it to figure out the order of the
swap entries and swap them into folios of the same order (with a fallback to
order-0 if allocation fails).

> 
> struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>                                 struct vm_fault *vmf)
> {
>         struct mempolicy *mpol;
>         pgoff_t ilx;
>         struct folio *folio;
> 
>         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>         folio = swap_use_vma_readahead() ?
>                 swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
>                 swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>         mpol_cond_put(mpol);
> 
>         if (!folio)
>                 return NULL;
>         return folio_file_page(folio, swp_offset(entry));
> }
> 
> In Android and embedded systems, SWP_SYNCHRONOUS_IO is consistently utilized,
> rendering physical contiguity less of a concern. Moreover, instances where
> swap_readahead() is accessed are rare, typically occurring only in scenarios
> involving forked but non-CoWed memory.

Yes understood. What I'm hearing is that for Android at least, stealing a bit
from swap_map to remember if a swap entry is the order marked in the cluster or
order-0 won't be noticed because almost all entries have swap count == 1. From
memory, I think swap_map is 8 bits, and 2 bits are currently stolen, leaving 6
bits (count = 64) before having to move to the swap map continuation stuff. Does
anyone know what workloads provoke this overflow? What are the consequences of
reducing that count to 32?

> 
> So I think large folios swap-in will at least need three steps
> 
> 1. on SWP_SYNCHRONOUS_IO (Android and embedded Linux), this has a very
> clear model and has no complex I/O issue.
> 2. on nonrot block device(bdev_nonrot ==  true), it cares less about
> I/O contiguity.
> 3. on rot block devices which care about  I/O contiguity.

I don't think we care about (3); if the device rotates, we will have split the
folio at swap-out, so we are only concerned with swapping-in order-0 folios.

> 
> This patchset primarily addresses the systems utilizing
> SWP_SYNCHRONOUS_IO(type1),
> such as Android and embedded Linux, a straightforward model is established,
> with minimal complexity regarding I/O issues.

Understood. But your implication is that making swap_vma_readahead() large folio
swap-in aware will be complex. I think we can remember the original order in the
swap device, then it shouldn't be too difficult - conceptually at least.

> 
>>
>> Next we (I?) have concerns about wasting IO by swapping-in folios that are too
>> large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
>> say yes but I have no data. But on the other hand, memory is aged and
>> swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
>> original allocation size policy is good (it currently isn't) then a folio should
>> be sized to cover temporally close memory and if we need to access some of it,
>> chances are we need all of it.
>>
>> If we think the IO concern is legitimate then we could define a threshold size
>> (sysfs?) for when we start swapping-in the folio in chunks. And how big should
>> those chunks be - one page, or the threshold size itself? Probably the latter?
>> And perhaps that threshold could also be used by zRAM to decide its upper limit
>> for compression chunk.
> 
> 
> Agreed. What about introducing a parameter like
> /sys/kernel/mm/transparent_hugepage/max_swapin_order
> giving users the opportunity to fine-tune it according to their needs. For type1
> users specifically, setting it to any value above 4 would be
> beneficial. If there's
> still a lack of tuning for desktop and server environments (type 2 and type 3),
> the default value could be set to 0.

This sort of thing sounds sensible to me. But I have a history of proposing
crappy sysfs interfaces :) So I'd like to hear from others - I suspect it will
take a fair bit of discussion before we converge. Having data to show that this
threshold is needed would also help (i.e. demonstration that the intuition that
swapping in a 2M folio is often counter-productive to performance).

> 
>>
>> Perhaps we can learn from khugepaged here? I think it has programmable
>> thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
>> THP? I guess that exists for the same concerns about increased IO pressure?
>>
>>
>> If we think we will ever be swapping-in folios in chunks less than their
>> original size, then we need a separate mechanism to re-foliate them. We have
>> discussed a khugepaged-like approach for doing this asynchronously in the
>> background. I know that scares the Android folks, but David has suggested that
>> this could well be very cheap compared with khugepaged, because it would be
>> entirely limited to a single pgtable, so we only need the PTL. If we need this
>> mechanism anyway, perhaps we should develop it and see how it performs if
>> swap-in remains order-0? Although I guess that would imply not being able to
>> benefit from compressing THPs for the zRAM case.
> 
> The effectiveness of collapse operation relies on the stability of
> forming large folios
> to ensure optimal performance. In embedded systems, where more than half of the
> memory may be allocated to zRAM, folios might undergo swapping out before
> collapsing or immediately after the collapse operation. It seems a
> TAO-like optimization
> to decrease fallback and latency is more effective.

Sorry, I'm not sure I've understood what you are saying here.

> 
>>
>> I see all this as orthogonal to synchronous vs asynchronous swap devices. I
>> think the latter just implies that you might want to do some readahead to try to
>> cover up the latency? If swap is moving towards being folio-orientated, then
>> readahead also surely needs to be folio-orientated, but I think that should be
>> the only major difference.
>>
>> Anyway, just some thoughts!
> 
> Thank you very much for your valuable and insightful deliberations.
> 
>>
>> Thanks,
>> Ryan
>>
> 
> Thanks
> Barry
Huang, Ying March 19, 2024, 9:20 a.m. UTC | #18
Ryan Roberts <ryan.roberts@arm.com> writes:

>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>> day. I can only get
>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>> resources on it. So we may take the first step which can be applied on
>>>> a real product
>>>> and improve its performance, and step by step, we broaden it and make it
>>>> widely useful to various areas  in which I can't reach :-)
>>>
>>> We must guarantee the normal swap path runs correctly and has no
>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>> So we have to put some effort on the normal path test anyway.
>>>
>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>> have a maximum
>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>
>>>> "
>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>> sense to copy the whole folio up to a certain size.
>>>> "
>
> I thought about this a bit more. No clear conclusions, but hoped this might help
> the discussion around policy:
>
> The decision about the size of the THP is made at first fault, with some help
> from user space and in future we might make decisions to split based on
> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> the THP out at some point in its lifetime should not impact on its size. It's
> just being moved around in the system and the reason for our original decision
> should still hold.
>
> So from that PoV, it would be good to swap-in to the same size that was
> swapped-out.

Sorry, I don't agree with this.  It's better to swap-in and swap-out in
smallest size if the page is only accessed seldom to avoid to waste
memory.

> But we only kind-of keep that information around, via the swap
> entry contiguity and alignment. With that scheme it is possible that multiple
> virtually adjacent but not physically contiguous folios get swapped-out to
> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> folio. This is not ideal, and I think it would be valuable to try to maintain
> the original folio size information with the swap slot. One way to do this would
> be to store the original order for which the cluster was allocated in the
> cluster. Then we at least know that a given swap slot is either for a folio of
> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> steal a bit from swap_map to determine which case it is? Or are there better
> approaches?

[snip]

--
Best Regards,
Huang, Ying
Ryan Roberts March 19, 2024, 12:19 p.m. UTC | #19
On 19/03/2024 09:20, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>>> day. I can only get
>>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>>> resources on it. So we may take the first step which can be applied on
>>>>> a real product
>>>>> and improve its performance, and step by step, we broaden it and make it
>>>>> widely useful to various areas  in which I can't reach :-)
>>>>
>>>> We must guarantee the normal swap path runs correctly and has no
>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>>> So we have to put some effort on the normal path test anyway.
>>>>
>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>>> have a maximum
>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>>
>>>>> "
>>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>>> sense to copy the whole folio up to a certain size.
>>>>> "
>>
>> I thought about this a bit more. No clear conclusions, but hoped this might help
>> the discussion around policy:
>>
>> The decision about the size of the THP is made at first fault, with some help
>> from user space and in future we might make decisions to split based on
>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>> the THP out at some point in its lifetime should not impact on its size. It's
>> just being moved around in the system and the reason for our original decision
>> should still hold.
>>
>> So from that PoV, it would be good to swap-in to the same size that was
>> swapped-out.
> 
> Sorry, I don't agree with this.  It's better to swap-in and swap-out in
> smallest size if the page is only accessed seldom to avoid to waste
> memory.

If we want to optimize only for memory consumption, I'm sure there are many
things we would do differently. We need to find a balance between memory and
performance. The benefits of folios are well documented and the kernel is
heading in the direction of managing memory in variable-sized blocks. So I don't
think it's as simple as saying we should always swap-in the smallest possible
amount of memory.

You also said we should swap *out* in smallest size possible. Have I
misunderstood you? I thought the case for swapping-out a whole folio without
splitting was well established and non-controversial?

> 
>> But we only kind-of keep that information around, via the swap
>> entry contiguity and alignment. With that scheme it is possible that multiple
>> virtually adjacent but not physically contiguous folios get swapped-out to
>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>> folio. This is not ideal, and I think it would be valuable to try to maintain
>> the original folio size information with the swap slot. One way to do this would
>> be to store the original order for which the cluster was allocated in the
>> cluster. Then we at least know that a given swap slot is either for a folio of
>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>> steal a bit from swap_map to determine which case it is? Or are there better
>> approaches?
> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying
Huang, Ying March 20, 2024, 2:18 a.m. UTC | #20
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 19/03/2024 09:20, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>>>> day. I can only get
>>>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>>>> resources on it. So we may take the first step which can be applied on
>>>>>> a real product
>>>>>> and improve its performance, and step by step, we broaden it and make it
>>>>>> widely useful to various areas  in which I can't reach :-)
>>>>>
>>>>> We must guarantee the normal swap path runs correctly and has no
>>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>>>> So we have to put some effort on the normal path test anyway.
>>>>>
>>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>>>> have a maximum
>>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>>>
>>>>>> "
>>>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>>>> sense to copy the whole folio up to a certain size.
>>>>>> "
>>>
>>> I thought about this a bit more. No clear conclusions, but hoped this might help
>>> the discussion around policy:
>>>
>>> The decision about the size of the THP is made at first fault, with some help
>>> from user space and in future we might make decisions to split based on
>>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>>> the THP out at some point in its lifetime should not impact on its size. It's
>>> just being moved around in the system and the reason for our original decision
>>> should still hold.
>>>
>>> So from that PoV, it would be good to swap-in to the same size that was
>>> swapped-out.
>> 
>> Sorry, I don't agree with this.  It's better to swap-in and swap-out in
>> smallest size if the page is only accessed seldom to avoid to waste
>> memory.
>
> If we want to optimize only for memory consumption, I'm sure there are many
> things we would do differently. We need to find a balance between memory and
> performance. The benefits of folios are well documented and the kernel is
> heading in the direction of managing memory in variable-sized blocks. So I don't
> think it's as simple as saying we should always swap-in the smallest possible
> amount of memory.

It's conditional, that is,

"if the page is only accessed seldom"

Then, the page swapped-in will be swapped-out soon and adjacent pages in
the same large folio will not be accessed during this period.

So, I suggest to create an algorithm to decide swap-in order based on
swap-readahead information automatically.  It can detect the situation
above via reduced swap readahead window size.  And, if the page is
accessed for quite long time, and the adjacent pages in the same large
folio are accessed too, swap-readahead window will increase and large
swap-in order will be used.

> You also said we should swap *out* in smallest size possible. Have I
> misunderstood you? I thought the case for swapping-out a whole folio without
> splitting was well established and non-controversial?

That is conditional too.

>> 
>>> But we only kind-of keep that information around, via the swap
>>> entry contiguity and alignment. With that scheme it is possible that multiple
>>> virtually adjacent but not physically contiguous folios get swapped-out to
>>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>>> folio. This is not ideal, and I think it would be valuable to try to maintain
>>> the original folio size information with the swap slot. One way to do this would
>>> be to store the original order for which the cluster was allocated in the
>>> cluster. Then we at least know that a given swap slot is either for a folio of
>>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>>> steal a bit from swap_map to determine which case it is? Or are there better
>>> approaches?
>> 
>> [snip]

--
Best Regards,
Huang, Ying
Barry Song March 20, 2024, 2:47 a.m. UTC | #21
On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
> > On 19/03/2024 09:20, Huang, Ying wrote:
> >> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>
> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >>>>>> day. I can only get
> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
> >>>>>> resources on it. So we may take the first step which can be applied on
> >>>>>> a real product
> >>>>>> and improve its performance, and step by step, we broaden it and make it
> >>>>>> widely useful to various areas  in which I can't reach :-)
> >>>>>
> >>>>> We must guarantee the normal swap path runs correctly and has no
> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >>>>> So we have to put some effort on the normal path test anyway.
> >>>>>
> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >>>>>> have a maximum
> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >>>>>>
> >>>>>> "
> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >>>>>> sense to copy the whole folio up to a certain size.
> >>>>>> "
> >>>
> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
> >>> the discussion around policy:
> >>>
> >>> The decision about the size of the THP is made at first fault, with some help
> >>> from user space and in future we might make decisions to split based on
> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >>> the THP out at some point in its lifetime should not impact on its size. It's
> >>> just being moved around in the system and the reason for our original decision
> >>> should still hold.
> >>>
> >>> So from that PoV, it would be good to swap-in to the same size that was
> >>> swapped-out.
> >>
> >> Sorry, I don't agree with this.  It's better to swap-in and swap-out in
> >> smallest size if the page is only accessed seldom to avoid to waste
> >> memory.
> >
> > If we want to optimize only for memory consumption, I'm sure there are many
> > things we would do differently. We need to find a balance between memory and
> > performance. The benefits of folios are well documented and the kernel is
> > heading in the direction of managing memory in variable-sized blocks. So I don't
> > think it's as simple as saying we should always swap-in the smallest possible
> > amount of memory.
>
> It's conditional, that is,
>
> "if the page is only accessed seldom"
>
> Then, the page swapped-in will be swapped-out soon and adjacent pages in
> the same large folio will not be accessed during this period.
>
> So, I suggest to create an algorithm to decide swap-in order based on
> swap-readahead information automatically.  It can detect the situation
> above via reduced swap readahead window size.  And, if the page is
> accessed for quite long time, and the adjacent pages in the same large
> folio are accessed too, swap-readahead window will increase and large
> swap-in order will be used.

The original size of do_anonymous_page() should be honored, considering it
embodies a decision influenced by not only sysfs settings and per-vma
HUGEPAGE hints but also architectural characteristics, for example
CONT-PTE.

The model you're proposing may offer memory-saving benefits or reduce I/O,
but it entirely disassociates the size of the swap in from the size prior to the
swap out. Moreover, there's no guarantee that the large folio generated by
the readahead window is contiguous in the swap and can be added to the
swap cache, as we are currently dealing with folio->swap instead of
subpage->swap.

Incidentally, do_anonymous_page() serves as the initial location for allocating
large folios. Given that memory conservation is a significant consideration in
do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?

A large folio, by its nature, represents a high-quality resource that has the
potential to leverage hardware characteristics for the benefit of the
entire system.
Conversely, I don't believe that a randomly determined size dictated by the
readahead window possesses the same advantageous qualities.

SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
their needs should also be respected.

> > You also said we should swap *out* in smallest size possible. Have I
> > misunderstood you? I thought the case for swapping-out a whole folio without
> > splitting was well established and non-controversial?
>
> That is conditional too.
>
> >>
> >>> But we only kind-of keep that information around, via the swap
> >>> entry contiguity and alignment. With that scheme it is possible that multiple
> >>> virtually adjacent but not physically contiguous folios get swapped-out to
> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
> >>> the original folio size information with the swap slot. One way to do this would
> >>> be to store the original order for which the cluster was allocated in the
> >>> cluster. Then we at least know that a given swap slot is either for a folio of
> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >>> steal a bit from swap_map to determine which case it is? Or are there better
> >>> approaches?
> >>
> >> [snip]
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Huang, Ying March 20, 2024, 6:20 a.m. UTC | #22
Barry Song <21cnbao@gmail.com> writes:

> On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>> > On 19/03/2024 09:20, Huang, Ying wrote:
>> >> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>
>> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>> >>>>>> day. I can only get
>> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
>> >>>>>> resources on it. So we may take the first step which can be applied on
>> >>>>>> a real product
>> >>>>>> and improve its performance, and step by step, we broaden it and make it
>> >>>>>> widely useful to various areas  in which I can't reach :-)
>> >>>>>
>> >>>>> We must guarantee the normal swap path runs correctly and has no
>> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>> >>>>> So we have to put some effort on the normal path test anyway.
>> >>>>>
>> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>> >>>>>> have a maximum
>> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>> >>>>>>
>> >>>>>> "
>> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
>> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>> >>>>>> sense to copy the whole folio up to a certain size.
>> >>>>>> "
>> >>>
>> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
>> >>> the discussion around policy:
>> >>>
>> >>> The decision about the size of the THP is made at first fault, with some help
>> >>> from user space and in future we might make decisions to split based on
>> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>> >>> the THP out at some point in its lifetime should not impact on its size. It's
>> >>> just being moved around in the system and the reason for our original decision
>> >>> should still hold.
>> >>>
>> >>> So from that PoV, it would be good to swap-in to the same size that was
>> >>> swapped-out.
>> >>
>> >> Sorry, I don't agree with this.  It's better to swap-in and swap-out in
>> >> smallest size if the page is only accessed seldom to avoid to waste
>> >> memory.
>> >
>> > If we want to optimize only for memory consumption, I'm sure there are many
>> > things we would do differently. We need to find a balance between memory and
>> > performance. The benefits of folios are well documented and the kernel is
>> > heading in the direction of managing memory in variable-sized blocks. So I don't
>> > think it's as simple as saying we should always swap-in the smallest possible
>> > amount of memory.
>>
>> It's conditional, that is,
>>
>> "if the page is only accessed seldom"
>>
>> Then, the page swapped-in will be swapped-out soon and adjacent pages in
>> the same large folio will not be accessed during this period.
>>
>> So, I suggest to create an algorithm to decide swap-in order based on
>> swap-readahead information automatically.  It can detect the situation
>> above via reduced swap readahead window size.  And, if the page is
>> accessed for quite long time, and the adjacent pages in the same large
>> folio are accessed too, swap-readahead window will increase and large
>> swap-in order will be used.
>
> The original size of do_anonymous_page() should be honored, considering it
> embodies a decision influenced by not only sysfs settings and per-vma
> HUGEPAGE hints but also architectural characteristics, for example
> CONT-PTE.
>
> The model you're proposing may offer memory-saving benefits or reduce I/O,
> but it entirely disassociates the size of the swap in from the size prior to the
> swap out.

Readahead isn't the only factor to determine folio order.  For example,
we must respect "never" policy to allocate order-0 folio always.
There's no requirements to use swap-out order in swap-in too.  Memory
allocation has different performance character of storage reading.

> Moreover, there's no guarantee that the large folio generated by
> the readahead window is contiguous in the swap and can be added to the
> swap cache, as we are currently dealing with folio->swap instead of
> subpage->swap.

Yes.  We can optimize only when all conditions are satisfied.  Just like
other optimization.

> Incidentally, do_anonymous_page() serves as the initial location for allocating
> large folios. Given that memory conservation is a significant consideration in
> do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?

Yes.  We should consider that too.  IIUC, that is why mTHP support is
off by default for now.  After we find a way to solve the memory usage
issue.  We may make default "on".

> A large folio, by its nature, represents a high-quality resource that has the
> potential to leverage hardware characteristics for the benefit of the
> entire system.

But not at the cost of memory wastage.

> Conversely, I don't believe that a randomly determined size dictated by the
> readahead window possesses the same advantageous qualities.

There's a readahead algorithm which is not pure random.

> SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> their needs should also be respected.

I understand that there are special requirements for SWP_SYNCHRONOUS_IO
devices.  I just suggest to work on general code before specific
optimization.

>> > You also said we should swap *out* in smallest size possible. Have I
>> > misunderstood you? I thought the case for swapping-out a whole folio without
>> > splitting was well established and non-controversial?
>>
>> That is conditional too.
>>
>> >>
>> >>> But we only kind-of keep that information around, via the swap
>> >>> entry contiguity and alignment. With that scheme it is possible that multiple
>> >>> virtually adjacent but not physically contiguous folios get swapped-out to
>> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
>> >>> the original folio size information with the swap slot. One way to do this would
>> >>> be to store the original order for which the cluster was allocated in the
>> >>> cluster. Then we at least know that a given swap slot is either for a folio of
>> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>> >>> steal a bit from swap_map to determine which case it is? Or are there better
>> >>> approaches?
>> >>
>> >> [snip]

--
Best Regards,
Huang, Ying
Barry Song March 20, 2024, 6:38 p.m. UTC | #23
On Wed, Mar 20, 2024 at 7:22 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>
> >> > On 19/03/2024 09:20, Huang, Ying wrote:
> >> >> Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >>
> >> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >> >>>>>> day. I can only get
> >> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
> >> >>>>>> resources on it. So we may take the first step which can be applied on
> >> >>>>>> a real product
> >> >>>>>> and improve its performance, and step by step, we broaden it and make it
> >> >>>>>> widely useful to various areas  in which I can't reach :-)
> >> >>>>>
> >> >>>>> We must guarantee the normal swap path runs correctly and has no
> >> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >> >>>>> So we have to put some effort on the normal path test anyway.
> >> >>>>>
> >> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >> >>>>>> have a maximum
> >> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >> >>>>>>
> >> >>>>>> "
> >> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
> >> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >> >>>>>> sense to copy the whole folio up to a certain size.
> >> >>>>>> "
> >> >>>
> >> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
> >> >>> the discussion around policy:
> >> >>>
> >> >>> The decision about the size of the THP is made at first fault, with some help
> >> >>> from user space and in future we might make decisions to split based on
> >> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >> >>> the THP out at some point in its lifetime should not impact on its size. It's
> >> >>> just being moved around in the system and the reason for our original decision
> >> >>> should still hold.
> >> >>>
> >> >>> So from that PoV, it would be good to swap-in to the same size that was
> >> >>> swapped-out.
> >> >>
> >> >> Sorry, I don't agree with this.  It's better to swap-in and swap-out in
> >> >> smallest size if the page is only accessed seldom to avoid to waste
> >> >> memory.
> >> >
> >> > If we want to optimize only for memory consumption, I'm sure there are many
> >> > things we would do differently. We need to find a balance between memory and
> >> > performance. The benefits of folios are well documented and the kernel is
> >> > heading in the direction of managing memory in variable-sized blocks. So I don't
> >> > think it's as simple as saying we should always swap-in the smallest possible
> >> > amount of memory.
> >>
> >> It's conditional, that is,
> >>
> >> "if the page is only accessed seldom"
> >>
> >> Then, the page swapped-in will be swapped-out soon and adjacent pages in
> >> the same large folio will not be accessed during this period.
> >>
> >> So, I suggest to create an algorithm to decide swap-in order based on
> >> swap-readahead information automatically.  It can detect the situation
> >> above via reduced swap readahead window size.  And, if the page is
> >> accessed for quite long time, and the adjacent pages in the same large
> >> folio are accessed too, swap-readahead window will increase and large
> >> swap-in order will be used.
> >
> > The original size of do_anonymous_page() should be honored, considering it
> > embodies a decision influenced by not only sysfs settings and per-vma
> > HUGEPAGE hints but also architectural characteristics, for example
> > CONT-PTE.
> >
> > The model you're proposing may offer memory-saving benefits or reduce I/O,
> > but it entirely disassociates the size of the swap in from the size prior to the
> > swap out.
>
> Readahead isn't the only factor to determine folio order.  For example,
> we must respect "never" policy to allocate order-0 folio always.
> There's no requirements to use swap-out order in swap-in too.  Memory
> allocation has different performance character of storage reading.

Still quite unclear.

If users have only enabled 64KiB (4-ORDER) large folios in sysfs, and the
readahead algorithm requires 16KiB, what should be set as the large folio size?
Setting it to 16KiB doesn't align with users' requirements, while
setting it to 64KiB
would be wasteful according to your criteria.

>
> > Moreover, there's no guarantee that the large folio generated by
> > the readahead window is contiguous in the swap and can be added to the
> > swap cache, as we are currently dealing with folio->swap instead of
> > subpage->swap.
>
> Yes.  We can optimize only when all conditions are satisfied.  Just like
> other optimization.
>
> > Incidentally, do_anonymous_page() serves as the initial location for allocating
> > large folios. Given that memory conservation is a significant consideration in
> > do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?
>
> Yes.  We should consider that too.  IIUC, that is why mTHP support is
> off by default for now.  After we find a way to solve the memory usage
> issue.  We may make default "on".

It's challenging to establish a universal solution because various systems
exhibit diverse hardware characteristics, and VMAs may require different
alignments. The current sysfs and per-vma hints allow users the opportunity
o customize settings according to their specific requirements.

>
> > A large folio, by its nature, represents a high-quality resource that has the
> > potential to leverage hardware characteristics for the benefit of the
> > entire system.
>
> But not at the cost of memory wastage.
>
> > Conversely, I don't believe that a randomly determined size dictated by the
> > readahead window possesses the same advantageous qualities.
>
> There's a readahead algorithm which is not pure random.
>
> > SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> > their needs should also be respected.
>
> I understand that there are special requirements for SWP_SYNCHRONOUS_IO
> devices.  I just suggest to work on general code before specific
> optimization.

I disagree with your definition of "special" and "general". According
to your logic,
non-SWP_SYNCHRONOUS_IO devices could also be classified as "special".
Furthermore, the number of systems running SWP_SYNCHRONOUS_IO is
significantly greater than those running non-SWP_SYNCHRONOUS_IO,
contradicting your assertion.

SWP_SYNCHRONOUS_IO devices have a minor chance of being involved
in readahead. However, in OPPO's code, which hasn't been sent in the
LKML yet, we use the exact same size as do_anonymous_page for readahead.
Without a clear description of how you want the new readahead
algorithm to balance memory waste and users' hints from sysfs and
per-vma flags, it appears to be an ambiguous area to address.

Please provide a clear description of how you would like the new readahead
algorithm to function. I believe this clarity will facilitate others
in attempting to
implement it.

>
> >> > You also said we should swap *out* in smallest size possible. Have I
> >> > misunderstood you? I thought the case for swapping-out a whole folio without
> >> > splitting was well established and non-controversial?
> >>
> >> That is conditional too.
> >>
> >> >>
> >> >>> But we only kind-of keep that information around, via the swap
> >> >>> entry contiguity and alignment. With that scheme it is possible that multiple
> >> >>> virtually adjacent but not physically contiguous folios get swapped-out to
> >> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
> >> >>> the original folio size information with the swap slot. One way to do this would
> >> >>> be to store the original order for which the cluster was allocated in the
> >> >>> cluster. Then we at least know that a given swap slot is either for a folio of
> >> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >> >>> steal a bit from swap_map to determine which case it is? Or are there better
> >> >>> approaches?
> >> >>
> >> >> [snip]
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Huang, Ying March 21, 2024, 4:23 a.m. UTC | #24
Barry Song <21cnbao@gmail.com> writes:

> On Wed, Mar 20, 2024 at 7:22 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>
>> >> > On 19/03/2024 09:20, Huang, Ying wrote:
>> >> >> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >>
>> >> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>> >> >>>>>> day. I can only get
>> >> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
>> >> >>>>>> resources on it. So we may take the first step which can be applied on
>> >> >>>>>> a real product
>> >> >>>>>> and improve its performance, and step by step, we broaden it and make it
>> >> >>>>>> widely useful to various areas  in which I can't reach :-)
>> >> >>>>>
>> >> >>>>> We must guarantee the normal swap path runs correctly and has no
>> >> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>> >> >>>>> So we have to put some effort on the normal path test anyway.
>> >> >>>>>
>> >> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>> >> >>>>>> have a maximum
>> >> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>> >> >>>>>>
>> >> >>>>>> "
>> >> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
>> >> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>> >> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>> >> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>> >> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>> >> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>> >> >>>>>> sense to copy the whole folio up to a certain size.
>> >> >>>>>> "
>> >> >>>
>> >> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
>> >> >>> the discussion around policy:
>> >> >>>
>> >> >>> The decision about the size of the THP is made at first fault, with some help
>> >> >>> from user space and in future we might make decisions to split based on
>> >> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>> >> >>> the THP out at some point in its lifetime should not impact on its size. It's
>> >> >>> just being moved around in the system and the reason for our original decision
>> >> >>> should still hold.
>> >> >>>
>> >> >>> So from that PoV, it would be good to swap-in to the same size that was
>> >> >>> swapped-out.
>> >> >>
>> >> >> Sorry, I don't agree with this.  It's better to swap-in and swap-out in
>> >> >> smallest size if the page is only accessed seldom to avoid to waste
>> >> >> memory.
>> >> >
>> >> > If we want to optimize only for memory consumption, I'm sure there are many
>> >> > things we would do differently. We need to find a balance between memory and
>> >> > performance. The benefits of folios are well documented and the kernel is
>> >> > heading in the direction of managing memory in variable-sized blocks. So I don't
>> >> > think it's as simple as saying we should always swap-in the smallest possible
>> >> > amount of memory.
>> >>
>> >> It's conditional, that is,
>> >>
>> >> "if the page is only accessed seldom"
>> >>
>> >> Then, the page swapped-in will be swapped-out soon and adjacent pages in
>> >> the same large folio will not be accessed during this period.
>> >>
>> >> So, I suggest to create an algorithm to decide swap-in order based on
>> >> swap-readahead information automatically.  It can detect the situation
>> >> above via reduced swap readahead window size.  And, if the page is
>> >> accessed for quite long time, and the adjacent pages in the same large
>> >> folio are accessed too, swap-readahead window will increase and large
>> >> swap-in order will be used.
>> >
>> > The original size of do_anonymous_page() should be honored, considering it
>> > embodies a decision influenced by not only sysfs settings and per-vma
>> > HUGEPAGE hints but also architectural characteristics, for example
>> > CONT-PTE.
>> >
>> > The model you're proposing may offer memory-saving benefits or reduce I/O,
>> > but it entirely disassociates the size of the swap in from the size prior to the
>> > swap out.
>>
>> Readahead isn't the only factor to determine folio order.  For example,
>> we must respect "never" policy to allocate order-0 folio always.
>> There's no requirements to use swap-out order in swap-in too.  Memory
>> allocation has different performance character of storage reading.
>
> Still quite unclear.
>
> If users have only enabled 64KiB (4-ORDER) large folios in sysfs, and the
> readahead algorithm requires 16KiB, what should be set as the large folio size?
> Setting it to 16KiB doesn't align with users' requirements, while
> setting it to 64KiB
> would be wasteful according to your criteria.

IIUC, enabling 64KB means you can use 64KB mTHP if appropriate, doesn't
mean that you must use 64KB mTHP.  If so, we should use 16KB mTHP in
that situation.

>> > Moreover, there's no guarantee that the large folio generated by
>> > the readahead window is contiguous in the swap and can be added to the
>> > swap cache, as we are currently dealing with folio->swap instead of
>> > subpage->swap.
>>
>> Yes.  We can optimize only when all conditions are satisfied.  Just like
>> other optimization.
>>
>> > Incidentally, do_anonymous_page() serves as the initial location for allocating
>> > large folios. Given that memory conservation is a significant consideration in
>> > do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?
>>
>> Yes.  We should consider that too.  IIUC, that is why mTHP support is
>> off by default for now.  After we find a way to solve the memory usage
>> issue.  We may make default "on".
>
> It's challenging to establish a universal solution because various systems
> exhibit diverse hardware characteristics, and VMAs may require different
> alignments. The current sysfs and per-vma hints allow users the opportunity
> o customize settings according to their specific requirements.

IIUC, Linux kernel is trying to provide a reasonable default behavior in
all situations.  We are trying to optimize default behavior in the first
place, only introduce customization if we fail to do that.  I don't
think that it's a good idea to introduce too much customization if we
haven't tried to optimize the default behavior.

>>
>> > A large folio, by its nature, represents a high-quality resource that has the
>> > potential to leverage hardware characteristics for the benefit of the
>> > entire system.
>>
>> But not at the cost of memory wastage.
>>
>> > Conversely, I don't believe that a randomly determined size dictated by the
>> > readahead window possesses the same advantageous qualities.
>>
>> There's a readahead algorithm which is not pure random.
>>
>> > SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
>> > their needs should also be respected.
>>
>> I understand that there are special requirements for SWP_SYNCHRONOUS_IO
>> devices.  I just suggest to work on general code before specific
>> optimization.
>
> I disagree with your definition of "special" and "general". According
> to your logic,
> non-SWP_SYNCHRONOUS_IO devices could also be classified as "special".

SWP_SYNCHRONOUS_IO devices also use general code path.  They just use
some special optimization in some special situation (__swap_count(entry)
== 1).  Optimization in general path benefits everyone.

> Furthermore, the number of systems running SWP_SYNCHRONOUS_IO is
> significantly greater than those running non-SWP_SYNCHRONOUS_IO,
> contradicting your assertion.
>
> SWP_SYNCHRONOUS_IO devices have a minor chance of being involved
> in readahead.

Then it loses an opportunity to determine the appropriate folio order.
We can consider how to balance between the overhead and benefit of
readahead.  IIUC, compared with original SWP_SYNCHRONOUS_IO swap-in,
mTHP is a kind of readahead too.

BTW, because we have added more and more swap cache related operations
(swapcache_prepare(), clear_shadow_from_swap_cache(), swapcache_clear(),
etc.) in SWP_SYNCHRONOUS_IO code path, I suspect whether the benefit of
SWP_SYNCHRONOUS_IO is still large enough.  We may need to re-evaluate
it.

> However, in OPPO's code, which hasn't been sent in the
> LKML yet, we use the exact same size as do_anonymous_page for readahead.
> Without a clear description of how you want the new readahead
> algorithm to balance memory waste and users' hints from sysfs and
> per-vma flags, it appears to be an ambiguous area to address.
>
> Please provide a clear description of how you would like the new readahead
> algorithm to function. I believe this clarity will facilitate others
> in attempting to
> implement it.

For example, if __swapin_nr_pages() > 4, we can try to allocate an
order-2 mTHP if other conditions are satisfied.

>>
>> >> > You also said we should swap *out* in smallest size possible. Have I
>> >> > misunderstood you? I thought the case for swapping-out a whole folio without
>> >> > splitting was well established and non-controversial?
>> >>
>> >> That is conditional too.
>> >>
>> >> >>
>> >> >>> But we only kind-of keep that information around, via the swap
>> >> >>> entry contiguity and alignment. With that scheme it is possible that multiple
>> >> >>> virtually adjacent but not physically contiguous folios get swapped-out to
>> >> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>> >> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
>> >> >>> the original folio size information with the swap slot. One way to do this would
>> >> >>> be to store the original order for which the cluster was allocated in the
>> >> >>> cluster. Then we at least know that a given swap slot is either for a folio of
>> >> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>> >> >>> steal a bit from swap_map to determine which case it is? Or are there better
>> >> >>> approaches?
>> >> >>
>> >> >> [snip]
>>

--
Best Regards,
Huang, Ying
Barry Song March 21, 2024, 5:12 a.m. UTC | #25
On Thu, Mar 21, 2024 at 5:25 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Wed, Mar 20, 2024 at 7:22 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >>
> >> >> > On 19/03/2024 09:20, Huang, Ying wrote:
> >> >> >> Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >> >>
> >> >> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >> >> >>>>>> day. I can only get
> >> >> >>>>>> started on a hardware which I can easily reach and have enough hardware/test
> >> >> >>>>>> resources on it. So we may take the first step which can be applied on
> >> >> >>>>>> a real product
> >> >> >>>>>> and improve its performance, and step by step, we broaden it and make it
> >> >> >>>>>> widely useful to various areas  in which I can't reach :-)
> >> >> >>>>>
> >> >> >>>>> We must guarantee the normal swap path runs correctly and has no
> >> >> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >> >> >>>>> So we have to put some effort on the normal path test anyway.
> >> >> >>>>>
> >> >> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >> >> >>>>>> have a maximum
> >> >> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >> >> >>>>>>
> >> >> >>>>>> "
> >> >> >>>>>> So in the common case, swap-in will pull in the same size of folio as was
> >> >> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >> >> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >> >> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >> >> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >> >> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >> >> >>>>>> sense to copy the whole folio up to a certain size.
> >> >> >>>>>> "
> >> >> >>>
> >> >> >>> I thought about this a bit more. No clear conclusions, but hoped this might help
> >> >> >>> the discussion around policy:
> >> >> >>>
> >> >> >>> The decision about the size of the THP is made at first fault, with some help
> >> >> >>> from user space and in future we might make decisions to split based on
> >> >> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >> >> >>> the THP out at some point in its lifetime should not impact on its size. It's
> >> >> >>> just being moved around in the system and the reason for our original decision
> >> >> >>> should still hold.
> >> >> >>>
> >> >> >>> So from that PoV, it would be good to swap-in to the same size that was
> >> >> >>> swapped-out.
> >> >> >>
> >> >> >> Sorry, I don't agree with this.  It's better to swap-in and swap-out in
> >> >> >> smallest size if the page is only accessed seldom to avoid to waste
> >> >> >> memory.
> >> >> >
> >> >> > If we want to optimize only for memory consumption, I'm sure there are many
> >> >> > things we would do differently. We need to find a balance between memory and
> >> >> > performance. The benefits of folios are well documented and the kernel is
> >> >> > heading in the direction of managing memory in variable-sized blocks. So I don't
> >> >> > think it's as simple as saying we should always swap-in the smallest possible
> >> >> > amount of memory.
> >> >>
> >> >> It's conditional, that is,
> >> >>
> >> >> "if the page is only accessed seldom"
> >> >>
> >> >> Then, the page swapped-in will be swapped-out soon and adjacent pages in
> >> >> the same large folio will not be accessed during this period.
> >> >>
> >> >> So, I suggest to create an algorithm to decide swap-in order based on
> >> >> swap-readahead information automatically.  It can detect the situation
> >> >> above via reduced swap readahead window size.  And, if the page is
> >> >> accessed for quite long time, and the adjacent pages in the same large
> >> >> folio are accessed too, swap-readahead window will increase and large
> >> >> swap-in order will be used.
> >> >
> >> > The original size of do_anonymous_page() should be honored, considering it
> >> > embodies a decision influenced by not only sysfs settings and per-vma
> >> > HUGEPAGE hints but also architectural characteristics, for example
> >> > CONT-PTE.
> >> >
> >> > The model you're proposing may offer memory-saving benefits or reduce I/O,
> >> > but it entirely disassociates the size of the swap in from the size prior to the
> >> > swap out.
> >>
> >> Readahead isn't the only factor to determine folio order.  For example,
> >> we must respect "never" policy to allocate order-0 folio always.
> >> There's no requirements to use swap-out order in swap-in too.  Memory
> >> allocation has different performance character of storage reading.
> >
> > Still quite unclear.
> >
> > If users have only enabled 64KiB (4-ORDER) large folios in sysfs, and the
> > readahead algorithm requires 16KiB, what should be set as the large folio size?
> > Setting it to 16KiB doesn't align with users' requirements, while
> > setting it to 64KiB
> > would be wasteful according to your criteria.
>
> IIUC, enabling 64KB means you can use 64KB mTHP if appropriate, doesn't
> mean that you must use 64KB mTHP.  If so, we should use 16KB mTHP in
> that situation.

A specific large folio size inherently denotes a high-quality
resource. For example,
a 64KiB folio necessitates only one TLB on ARM64, just as a 2MB large folio
accommodates only one TLB. I am skeptical whether a size determined by
readahead offers any tangible advantages over simply having small folios.

>
> >> > Moreover, there's no guarantee that the large folio generated by
> >> > the readahead window is contiguous in the swap and can be added to the
> >> > swap cache, as we are currently dealing with folio->swap instead of
> >> > subpage->swap.
> >>
> >> Yes.  We can optimize only when all conditions are satisfied.  Just like
> >> other optimization.
> >>
> >> > Incidentally, do_anonymous_page() serves as the initial location for allocating
> >> > large folios. Given that memory conservation is a significant consideration in
> >> > do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()?
> >>
> >> Yes.  We should consider that too.  IIUC, that is why mTHP support is
> >> off by default for now.  After we find a way to solve the memory usage
> >> issue.  We may make default "on".
> >
> > It's challenging to establish a universal solution because various systems
> > exhibit diverse hardware characteristics, and VMAs may require different
> > alignments. The current sysfs and per-vma hints allow users the opportunity
> > o customize settings according to their specific requirements.
>
> IIUC, Linux kernel is trying to provide a reasonable default behavior in
> all situations.  We are trying to optimize default behavior in the first
> place, only introduce customization if we fail to do that.  I don't
> think that it's a good idea to introduce too much customization if we
> haven't tried to optimize the default behavior.

I've never been opposed to the readahead case, but I feel it's a second step.

My point is to begin with the simplest and most practical approaches
that can generate
genuine value and contribution. The SWP_SYNCHRONOUS_IO case has been
implemented on millions of OPPO's phones and has demonstrated product success.

>
> >>
> >> > A large folio, by its nature, represents a high-quality resource that has the
> >> > potential to leverage hardware characteristics for the benefit of the
> >> > entire system.
> >>
> >> But not at the cost of memory wastage.
> >>
> >> > Conversely, I don't believe that a randomly determined size dictated by the
> >> > readahead window possesses the same advantageous qualities.
> >>
> >> There's a readahead algorithm which is not pure random.
> >>
> >> > SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever,
> >> > their needs should also be respected.
> >>
> >> I understand that there are special requirements for SWP_SYNCHRONOUS_IO
> >> devices.  I just suggest to work on general code before specific
> >> optimization.
> >
> > I disagree with your definition of "special" and "general". According
> > to your logic,
> > non-SWP_SYNCHRONOUS_IO devices could also be classified as "special".
>
> SWP_SYNCHRONOUS_IO devices also use general code path.  They just use
> some special optimization in some special situation (__swap_count(entry)
> == 1).  Optimization in general path benefits everyone.
>
> > Furthermore, the number of systems running SWP_SYNCHRONOUS_IO is
> > significantly greater than those running non-SWP_SYNCHRONOUS_IO,
> > contradicting your assertion.
> >
> > SWP_SYNCHRONOUS_IO devices have a minor chance of being involved
> > in readahead.
>
> Then it loses an opportunity to determine the appropriate folio order.
> We can consider how to balance between the overhead and benefit of
> readahead.  IIUC, compared with original SWP_SYNCHRONOUS_IO swap-in,
> mTHP is a kind of readahead too.
>
> BTW, because we have added more and more swap cache related operations
> (swapcache_prepare(), clear_shadow_from_swap_cache(), swapcache_clear(),
> etc.) in SWP_SYNCHRONOUS_IO code path, I suspect whether the benefit of
> SWP_SYNCHRONOUS_IO is still large enough.  We may need to re-evaluate
> it.

Obviously SWP_SYNCHRONOUS_IO is still quite valuable as Kairui has the data
in his commit  13ddaf26be324a ("mm/swap: fix race when skipping swapcache")

"Performance overhead is minimal, microbenchmark swapin 10G from 32G zram:
Before: 10934698 us
After: 11157121 us
Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag) "

>
> > However, in OPPO's code, which hasn't been sent in the
> > LKML yet, we use the exact same size as do_anonymous_page for readahead.
> > Without a clear description of how you want the new readahead
> > algorithm to balance memory waste and users' hints from sysfs and
> > per-vma flags, it appears to be an ambiguous area to address.
> >
> > Please provide a clear description of how you would like the new readahead
> > algorithm to function. I believe this clarity will facilitate others
> > in attempting to
> > implement it.
>
> For example, if __swapin_nr_pages() > 4, we can try to allocate an
> order-2 mTHP if other conditions are satisfied.

There is no evidence suggesting that an order-2 or any other orders
determined by readahead are superior to having four small folios.

>
> >>
> >> >> > You also said we should swap *out* in smallest size possible. Have I
> >> >> > misunderstood you? I thought the case for swapping-out a whole folio without
> >> >> > splitting was well established and non-controversial?
> >> >>
> >> >> That is conditional too.
> >> >>
> >> >> >>
> >> >> >>> But we only kind-of keep that information around, via the swap
> >> >> >>> entry contiguity and alignment. With that scheme it is possible that multiple
> >> >> >>> virtually adjacent but not physically contiguous folios get swapped-out to
> >> >> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >> >> >>> folio. This is not ideal, and I think it would be valuable to try to maintain
> >> >> >>> the original folio size information with the swap slot. One way to do this would
> >> >> >>> be to store the original order for which the cluster was allocated in the
> >> >> >>> cluster. Then we at least know that a given swap slot is either for a folio of
> >> >> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >> >> >>> steal a bit from swap_map to determine which case it is? Or are there better
> >> >> >>> approaches?
> >> >> >>
> >> >> >> [snip]
> >>
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Barry Song March 21, 2024, 9:22 a.m. UTC | #26
On Tue, Mar 19, 2024 at 10:05 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 19/03/2024 06:27, Barry Song wrote:
> > On Tue, Mar 19, 2024 at 5:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >>>>> day. I can only get
> >>>>> started on a hardware which I can easily reach and have enough hardware/test
> >>>>> resources on it. So we may take the first step which can be applied on
> >>>>> a real product
> >>>>> and improve its performance, and step by step, we broaden it and make it
> >>>>> widely useful to various areas  in which I can't reach :-)
> >>>>
> >>>> We must guarantee the normal swap path runs correctly and has no
> >>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >>>> So we have to put some effort on the normal path test anyway.
> >>>>
> >>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >>>>> have a maximum
> >>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >>>>>
> >>>>> "
> >>>>> So in the common case, swap-in will pull in the same size of folio as was
> >>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >>>>> sense to copy the whole folio up to a certain size.
> >>>>> "
> >>
> >> I thought about this a bit more. No clear conclusions, but hoped this might help
> >> the discussion around policy:
> >>
> >> The decision about the size of the THP is made at first fault, with some help
> >> from user space and in future we might make decisions to split based on
> >> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >> the THP out at some point in its lifetime should not impact on its size. It's
> >> just being moved around in the system and the reason for our original decision
> >> should still hold.
> >
> > Indeed, this is an ideal framework for smartphones and likely for
> > widely embedded
> > Linux systems utilizing zRAM. We set the mTHP size to 64KiB to
> > leverage CONT-PTE,
> > given that more than half of the memory on phones may frequently swap out and
> > swap in (for instance, when opening and switching between apps). The
> > ideal approach
> > would involve adhering to the decision made in do_anonymous_page().
> >
> >>
> >> So from that PoV, it would be good to swap-in to the same size that was
> >> swapped-out. But we only kind-of keep that information around, via the swap
> >> entry contiguity and alignment. With that scheme it is possible that multiple
> >> virtually adjacent but not physically contiguous folios get swapped-out to
> >> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >> folio. This is not ideal, and I think it would be valuable to try to maintain
> >> the original folio size information with the swap slot. One way to do this would
> >> be to store the original order for which the cluster was allocated in the
> >> cluster. Then we at least know that a given swap slot is either for a folio of
> >> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >> steal a bit from swap_map to determine which case it is? Or are there better
> >> approaches?
> >
> > In the case of non-SWP_SYNCHRONOUS_IO, users will invariably invoke
> > swap_readahead()
> > even when __swap_count(entry) equals 1.  This leads to two scenarios:
> > swap_vma_readahead
> > and swap_cluster_readahead.
> >
> > In swap_vma_readahead, when blk_queue_nonrot, physical contiguity
> > doesn't appear to be a
> > critical concern. However, for swap_cluster_readahead, the focus
> > shifts towards the potential
> > impact of physical discontiguity.
>
> When you talk about "physical [dis]contiguity" I think you are talking about
> contiguity of the swap entries in the swap device? Both paths currently allocate
> order-0 folios to swap into, so neither have a concept of physical contiguity in
> memory at the moment.
>
> As I understand it, roughly the aim is to readahead by cluster for rotating
> disks to reduce seek time, and readahead by virtual address for non-rotating
> devices since there is no seek time cost. Correct?

From the  code comment, I agree with this.

 * It's a main entry function for swap readahead. By the configuration,
 * it will read ahead blocks by cluster-based(ie, physical disk based)
 * or vma-based(ie, virtual address based on faulty address) readahead.

>
> Note that today, swap-out on supports (2M) THP if the swap device is
> non-rotating. If it is rotating, the THP is first split. My swap-out series
> maintains this policy for mTHP. So I think we only really care about
> swap_vma_readahead() here; we want to teach it to figure out the order of the
> swap entries and swap them into folios of the same order (with a fallback to
> order-0 if allocation fails).

I agree we don't need to care about devices which rotate.

>
> >
> > struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >                                 struct vm_fault *vmf)
> > {
> >         struct mempolicy *mpol;
> >         pgoff_t ilx;
> >         struct folio *folio;
> >
> >         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> >         folio = swap_use_vma_readahead() ?
> >                 swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
> >                 swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> >         mpol_cond_put(mpol);
> >
> >         if (!folio)
> >                 return NULL;
> >         return folio_file_page(folio, swp_offset(entry));
> > }
> >
> > In Android and embedded systems, SWP_SYNCHRONOUS_IO is consistently utilized,
> > rendering physical contiguity less of a concern. Moreover, instances where
> > swap_readahead() is accessed are rare, typically occurring only in scenarios
> > involving forked but non-CoWed memory.
>
> Yes understood. What I'm hearing is that for Android at least, stealing a bit
> from swap_map to remember if a swap entry is the order marked in the cluster or
> order-0 won't be noticed because almost all entries have swap count == 1. From
> memory, I think swap_map is 8 bits, and 2 bits are currently stolen, leaving 6
> bits (count = 64) before having to move to the swap map continuation stuff. Does
> anyone know what workloads provoke this overflow? What are the consequences of
> reducing that count to 32?

I'm not entirely clear on why you need bits to record this
information. Could you
provide more details?

>
> >
> > So I think large folios swap-in will at least need three steps
> >
> > 1. on SWP_SYNCHRONOUS_IO (Android and embedded Linux), this has a very
> > clear model and has no complex I/O issue.
> > 2. on nonrot block device(bdev_nonrot ==  true), it cares less about
> > I/O contiguity.
> > 3. on rot block devices which care about  I/O contiguity.
>
> I don't think we care about (3); if the device rotates, we will have split the
> folio at swap-out, so we are only concerned with swapping-in order-0 folios.
>
> >
> > This patchset primarily addresses the systems utilizing
> > SWP_SYNCHRONOUS_IO(type1),
> > such as Android and embedded Linux, a straightforward model is established,
> > with minimal complexity regarding I/O issues.
>
> Understood. But your implication is that making swap_vma_readahead() large folio
> swap-in aware will be complex. I think we can remember the original order in the
> swap device, then it shouldn't be too difficult - conceptually at least.

Currently, I can scan PTE entries and determine the number of
contiguous swap offsets.
The swap_vma_readahead code to support large folios already exists in
OPPO's repository.
I'm confident that it can be cleaned up and submitted to LKML.
However, the issue lies with
the readahead policy. We typically prefer using the same 64KiB size as in
do_anonymous_page(), but clearly, this isn't the preference for Ying :-)

>
> >
> >>
> >> Next we (I?) have concerns about wasting IO by swapping-in folios that are too
> >> large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
> >> say yes but I have no data. But on the other hand, memory is aged and
> >> swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
> >> original allocation size policy is good (it currently isn't) then a folio should
> >> be sized to cover temporally close memory and if we need to access some of it,
> >> chances are we need all of it.
> >>
> >> If we think the IO concern is legitimate then we could define a threshold size
> >> (sysfs?) for when we start swapping-in the folio in chunks. And how big should
> >> those chunks be - one page, or the threshold size itself? Probably the latter?
> >> And perhaps that threshold could also be used by zRAM to decide its upper limit
> >> for compression chunk.
> >
> >
> > Agreed. What about introducing a parameter like
> > /sys/kernel/mm/transparent_hugepage/max_swapin_order
> > giving users the opportunity to fine-tune it according to their needs. For type1
> > users specifically, setting it to any value above 4 would be
> > beneficial. If there's
> > still a lack of tuning for desktop and server environments (type 2 and type 3),
> > the default value could be set to 0.
>
> This sort of thing sounds sensible to me. But I have a history of proposing
> crappy sysfs interfaces :) So I'd like to hear from others - I suspect it will
> take a fair bit of discussion before we converge. Having data to show that this
> threshold is needed would also help (i.e. demonstration that the intuition that
> swapping in a 2M folio is often counter-productive to performance).
>

I understand. The ideal swap-in size is obviously a contentious topic :-)
However,  for my real use case, simplicity reigns: we consistently adhere
to a single size - 64KiB.

> >
> >>
> >> Perhaps we can learn from khugepaged here? I think it has programmable
> >> thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
> >> THP? I guess that exists for the same concerns about increased IO pressure?
> >>
> >>
> >> If we think we will ever be swapping-in folios in chunks less than their
> >> original size, then we need a separate mechanism to re-foliate them. We have
> >> discussed a khugepaged-like approach for doing this asynchronously in the
> >> background. I know that scares the Android folks, but David has suggested that
> >> this could well be very cheap compared with khugepaged, because it would be
> >> entirely limited to a single pgtable, so we only need the PTL. If we need this
> >> mechanism anyway, perhaps we should develop it and see how it performs if
> >> swap-in remains order-0? Although I guess that would imply not being able to
> >> benefit from compressing THPs for the zRAM case.
> >
> > The effectiveness of collapse operation relies on the stability of
> > forming large folios
> > to ensure optimal performance. In embedded systems, where more than half of the
> > memory may be allocated to zRAM, folios might undergo swapping out before
> > collapsing or immediately after the collapse operation. It seems a
> > TAO-like optimization
> > to decrease fallback and latency is more effective.
>
> Sorry, I'm not sure I've understood what you are saying here.

I'm not entirely clear on the specifics of the khugepaged-like
approach. However,a major
distinction for Android is that its folios may not remain in memory
for extended periods.
If we incur the cost of compaction and page migration to form a large
folio, it might soon
be swapped out. Therefore, a potentially more efficient approach could
involve a TAO-like
pool, where we obtain large folios at a low cost.

>
> >
> >>
> >> I see all this as orthogonal to synchronous vs asynchronous swap devices. I
> >> think the latter just implies that you might want to do some readahead to try to
> >> cover up the latency? If swap is moving towards being folio-orientated, then
> >> readahead also surely needs to be folio-orientated, but I think that should be
> >> the only major difference.
> >>
> >> Anyway, just some thoughts!
> >
> > Thank you very much for your valuable and insightful deliberations.
> >
> >>
> >> Thanks,
> >> Ryan
> >>
> >

Thanks
Barry
Barry Song March 21, 2024, 10:20 a.m. UTC | #27
On Wed, Mar 20, 2024 at 1:19 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 19/03/2024 09:20, Huang, Ying wrote:
> > Ryan Roberts <ryan.roberts@arm.com> writes:
> >
> >>>>> I agree phones are not the only platform. But Rome wasn't built in a
> >>>>> day. I can only get
> >>>>> started on a hardware which I can easily reach and have enough hardware/test
> >>>>> resources on it. So we may take the first step which can be applied on
> >>>>> a real product
> >>>>> and improve its performance, and step by step, we broaden it and make it
> >>>>> widely useful to various areas  in which I can't reach :-)
> >>>>
> >>>> We must guarantee the normal swap path runs correctly and has no
> >>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> >>>> So we have to put some effort on the normal path test anyway.
> >>>>
> >>>>> so probably we can have a sysfs "enable" entry with default "n" or
> >>>>> have a maximum
> >>>>> swap-in order as Ryan's suggestion [1] at the beginning,
> >>>>>
> >>>>> "
> >>>>> So in the common case, swap-in will pull in the same size of folio as was
> >>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> >>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> >>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
> >>>>> all of the folio reduces so chances are we are wasting IO. There are similar
> >>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
> >>>>> sense to copy the whole folio up to a certain size.
> >>>>> "
> >>
> >> I thought about this a bit more. No clear conclusions, but hoped this might help
> >> the discussion around policy:
> >>
> >> The decision about the size of the THP is made at first fault, with some help
> >> from user space and in future we might make decisions to split based on
> >> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
> >> the THP out at some point in its lifetime should not impact on its size. It's
> >> just being moved around in the system and the reason for our original decision
> >> should still hold.
> >>
> >> So from that PoV, it would be good to swap-in to the same size that was
> >> swapped-out.
> >
> > Sorry, I don't agree with this.  It's better to swap-in and swap-out in
> > smallest size if the page is only accessed seldom to avoid to waste
> > memory.
>
> If we want to optimize only for memory consumption, I'm sure there are many
> things we would do differently. We need to find a balance between memory and
> performance. The benefits of folios are well documented and the kernel is
> heading in the direction of managing memory in variable-sized blocks. So I don't
> think it's as simple as saying we should always swap-in the smallest possible
> amount of memory.

Absolutely agreed. With 64KiB large folios implemented, there may have been
a slight uptick in memory usage due to fragmentation. Nevertheless, through the
optimization of zRAM and zsmalloc to compress entire large folios, we found that
the compressed data could be up to 1GiB smaller compared to compressing them
in 4KiB increments on a typical phone with 12~16GiB memory. Consequently, we
not only reclaimed our memory loss entirely but also gained the benefits of
CONT-PTE , reduced TLB misses etc.

>
> You also said we should swap *out* in smallest size possible. Have I
> misunderstood you? I thought the case for swapping-out a whole folio without
> splitting was well established and non-controversial?
>
> >
> >> But we only kind-of keep that information around, via the swap
> >> entry contiguity and alignment. With that scheme it is possible that multiple
> >> virtually adjacent but not physically contiguous folios get swapped-out to
> >> adjacent swap slot ranges and then they would be swapped-in to a single, larger
> >> folio. This is not ideal, and I think it would be valuable to try to maintain
> >> the original folio size information with the swap slot. One way to do this would
> >> be to store the original order for which the cluster was allocated in the
> >> cluster. Then we at least know that a given swap slot is either for a folio of
> >> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
> >> steal a bit from swap_map to determine which case it is? Or are there better
> >> approaches?
> >
> > [snip]
> >
> > --
> > Best Regards,
> > Huang, Ying

Thanks
Barry
Ryan Roberts March 21, 2024, 11:13 a.m. UTC | #28
On 21/03/2024 09:22, Barry Song wrote:
> On Tue, Mar 19, 2024 at 10:05 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 19/03/2024 06:27, Barry Song wrote:
>>> On Tue, Mar 19, 2024 at 5:45 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>>>>> I agree phones are not the only platform. But Rome wasn't built in a
>>>>>>> day. I can only get
>>>>>>> started on a hardware which I can easily reach and have enough hardware/test
>>>>>>> resources on it. So we may take the first step which can be applied on
>>>>>>> a real product
>>>>>>> and improve its performance, and step by step, we broaden it and make it
>>>>>>> widely useful to various areas  in which I can't reach :-)
>>>>>>
>>>>>> We must guarantee the normal swap path runs correctly and has no
>>>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
>>>>>> So we have to put some effort on the normal path test anyway.
>>>>>>
>>>>>>> so probably we can have a sysfs "enable" entry with default "n" or
>>>>>>> have a maximum
>>>>>>> swap-in order as Ryan's suggestion [1] at the beginning,
>>>>>>>
>>>>>>> "
>>>>>>> So in the common case, swap-in will pull in the same size of folio as was
>>>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly
>>>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
>>>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing
>>>>>>> all of the folio reduces so chances are we are wasting IO. There are similar
>>>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes
>>>>>>> sense to copy the whole folio up to a certain size.
>>>>>>> "
>>>>
>>>> I thought about this a bit more. No clear conclusions, but hoped this might help
>>>> the discussion around policy:
>>>>
>>>> The decision about the size of the THP is made at first fault, with some help
>>>> from user space and in future we might make decisions to split based on
>>>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap
>>>> the THP out at some point in its lifetime should not impact on its size. It's
>>>> just being moved around in the system and the reason for our original decision
>>>> should still hold.
>>>
>>> Indeed, this is an ideal framework for smartphones and likely for
>>> widely embedded
>>> Linux systems utilizing zRAM. We set the mTHP size to 64KiB to
>>> leverage CONT-PTE,
>>> given that more than half of the memory on phones may frequently swap out and
>>> swap in (for instance, when opening and switching between apps). The
>>> ideal approach
>>> would involve adhering to the decision made in do_anonymous_page().
>>>
>>>>
>>>> So from that PoV, it would be good to swap-in to the same size that was
>>>> swapped-out. But we only kind-of keep that information around, via the swap
>>>> entry contiguity and alignment. With that scheme it is possible that multiple
>>>> virtually adjacent but not physically contiguous folios get swapped-out to
>>>> adjacent swap slot ranges and then they would be swapped-in to a single, larger
>>>> folio. This is not ideal, and I think it would be valuable to try to maintain
>>>> the original folio size information with the swap slot. One way to do this would
>>>> be to store the original order for which the cluster was allocated in the
>>>> cluster. Then we at least know that a given swap slot is either for a folio of
>>>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we
>>>> steal a bit from swap_map to determine which case it is? Or are there better
>>>> approaches?
>>>
>>> In the case of non-SWP_SYNCHRONOUS_IO, users will invariably invoke
>>> swap_readahead()
>>> even when __swap_count(entry) equals 1.  This leads to two scenarios:
>>> swap_vma_readahead
>>> and swap_cluster_readahead.
>>>
>>> In swap_vma_readahead, when blk_queue_nonrot, physical contiguity
>>> doesn't appear to be a
>>> critical concern. However, for swap_cluster_readahead, the focus
>>> shifts towards the potential
>>> impact of physical discontiguity.
>>
>> When you talk about "physical [dis]contiguity" I think you are talking about
>> contiguity of the swap entries in the swap device? Both paths currently allocate
>> order-0 folios to swap into, so neither have a concept of physical contiguity in
>> memory at the moment.
>>
>> As I understand it, roughly the aim is to readahead by cluster for rotating
>> disks to reduce seek time, and readahead by virtual address for non-rotating
>> devices since there is no seek time cost. Correct?
> 
> From the  code comment, I agree with this.
> 
>  * It's a main entry function for swap readahead. By the configuration,
>  * it will read ahead blocks by cluster-based(ie, physical disk based)
>  * or vma-based(ie, virtual address based on faulty address) readahead.
> 
>>
>> Note that today, swap-out on supports (2M) THP if the swap device is
>> non-rotating. If it is rotating, the THP is first split. My swap-out series
>> maintains this policy for mTHP. So I think we only really care about
>> swap_vma_readahead() here; we want to teach it to figure out the order of the
>> swap entries and swap them into folios of the same order (with a fallback to
>> order-0 if allocation fails).
> 
> I agree we don't need to care about devices which rotate.
> 
>>
>>>
>>> struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>>>                                 struct vm_fault *vmf)
>>> {
>>>         struct mempolicy *mpol;
>>>         pgoff_t ilx;
>>>         struct folio *folio;
>>>
>>>         mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>>>         folio = swap_use_vma_readahead() ?
>>>                 swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf) :
>>>                 swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>>>         mpol_cond_put(mpol);
>>>
>>>         if (!folio)
>>>                 return NULL;
>>>         return folio_file_page(folio, swp_offset(entry));
>>> }
>>>
>>> In Android and embedded systems, SWP_SYNCHRONOUS_IO is consistently utilized,
>>> rendering physical contiguity less of a concern. Moreover, instances where
>>> swap_readahead() is accessed are rare, typically occurring only in scenarios
>>> involving forked but non-CoWed memory.
>>
>> Yes understood. What I'm hearing is that for Android at least, stealing a bit
>> from swap_map to remember if a swap entry is the order marked in the cluster or
>> order-0 won't be noticed because almost all entries have swap count == 1. From
>> memory, I think swap_map is 8 bits, and 2 bits are currently stolen, leaving 6
>> bits (count = 64) before having to move to the swap map continuation stuff. Does
>> anyone know what workloads provoke this overflow? What are the consequences of
>> reducing that count to 32?
> 
> I'm not entirely clear on why you need bits to record this
> information. Could you
> provide more details?

For norot media, the swap device is carved up into "clusters", each the same
size as the pmd-size. Empty clusters are kept on a free list. Each CPU maintains
a "current cluster" that it fills sequentially with order-0 entries. To swap out
a THP, a whole cluster is allocated from the free list and used. (My swap-out
series expands this so that each CPU maintains a current cluster per-order to
fill sequentially with swap entries of that order). Once a cluster has been
filled, the CPU allocates a new current cluster. If no clusters are available,
only order-0 swap out will succeed, and in that case it will scan through the
entire swap device looking for a free slot. Once swapped-out, there is no
maintained information to tell you, for a given swap slot, what the originating
folio size was. We always swap-in in order-0 units, so it doesn't matter.

A side effect of this is that a page from what was a PMD-sized THP could be
swapped back in, meaning there is a free order-0 swap slot in that cluster. That
could then be allocated by the scanner, so the cluster now holds part of what
was a THP and an order-0 page. With my swap-out series, that mixing of orders
within a cluster becomes even more likely; (e.g.) we could have a current
cluster (2M) for order-4 (64K) and have swapped-out 2 folios, using up the first
128K). Then we try to swap out an order-0 and have to fall back to the scanner
because swap is fragmented. The scanner then puts the order-0 in the swap entry
after the second order-4 folio; the cluster now has a mix of order-4 and order-0
folios.

If we want to store the original folio size in the swap device, I was suggesting
that one approach would be to store the "primary" order of the cluster in the
cluster info struct. But that is not sufficient on its own, because there can
also be order-0 entries in the same cluster (due to either of the 2 mechanisms I
described above - either a page from a large order entry was freed, meaning all
the other pages in the large swap entry become order-0, or an order-0 was added
to the cluster by the scanner). So to discriminate these 2 possibilities on a
per-swap entry basis (either the swap entry is large and of the order indicated
by the cluster, or it is small), we need a per-swap entry bit.

swap_map is the the per-swap entry state that is currently maintained, and it is
(in the common case) 8 bits. 1 bit is used to remember if the swap entry has a
folio in the swap cache. And another bit is used for something relating to shmem
IIRC). The rest of the bits are used to count references to the swap entry (its
a bit more complicated but you get the idea). I was wondering if we could steal
a bit from the reference count to use for discrimanation between "swap entry is
size indicated by cluster's primary order" and "swap entry is order-0".


> 
>>
>>>
>>> So I think large folios swap-in will at least need three steps
>>>
>>> 1. on SWP_SYNCHRONOUS_IO (Android and embedded Linux), this has a very
>>> clear model and has no complex I/O issue.
>>> 2. on nonrot block device(bdev_nonrot ==  true), it cares less about
>>> I/O contiguity.
>>> 3. on rot block devices which care about  I/O contiguity.
>>
>> I don't think we care about (3); if the device rotates, we will have split the
>> folio at swap-out, so we are only concerned with swapping-in order-0 folios.
>>
>>>
>>> This patchset primarily addresses the systems utilizing
>>> SWP_SYNCHRONOUS_IO(type1),
>>> such as Android and embedded Linux, a straightforward model is established,
>>> with minimal complexity regarding I/O issues.
>>
>> Understood. But your implication is that making swap_vma_readahead() large folio
>> swap-in aware will be complex. I think we can remember the original order in the
>> swap device, then it shouldn't be too difficult - conceptually at least.
> 
> Currently, I can scan PTE entries and determine the number of
> contiguous swap offsets.

That's an approximation; if the folio was mapped into multiple processes and
partially mapped in the one you are swapping in for. But perhaps that is
sufficiently uncommon that it doesn't matter? It certainly removes the need for
storing the precise information in the swap device as I described above. To be
honest, I hadn't considered the PTEs; I was thinking only about the swap slot
contiguity.

> The swap_vma_readahead code to support large folios already exists in
> OPPO's repository.
> I'm confident that it can be cleaned up and submitted to LKML.
> However, the issue lies with
> the readahead policy. We typically prefer using the same 64KiB size as in
> do_anonymous_page(), but clearly, this isn't the preference for Ying :-)

I haven't caught up on all the latest discssion (although I see a bunch of
unread emails in my inbox :) ). But my view at the moment is roughly;

 - continue to calculate readahead as before
 - always swap-in folios in their original size
    - always swap-in the full faulting folio
    - calculate which other folios to readahead by rounding the calculated
      readahead endpoint to the nearest folio boundary.
 - Or if the "round to nearest" policy is shown to be problematic, introduce a
   "max swap-in folio size" tunable and round to that nearest boundary instead.

If we need the tunable, a default of 4K is same as current behaviour. 64K would
likely be a good setting for a system using contpte mappings. And 2M would
provide behaviour descibed above without the tubable.

> 
>>
>>>
>>>>
>>>> Next we (I?) have concerns about wasting IO by swapping-in folios that are too
>>>> large (e.g. 2M). I'm not sure if this is a real problem or not - intuitively I'd
>>>> say yes but I have no data. But on the other hand, memory is aged and
>>>> swapped-out per-folio, so why shouldn't it be swapped-in per folio? If the
>>>> original allocation size policy is good (it currently isn't) then a folio should
>>>> be sized to cover temporally close memory and if we need to access some of it,
>>>> chances are we need all of it.
>>>>
>>>> If we think the IO concern is legitimate then we could define a threshold size
>>>> (sysfs?) for when we start swapping-in the folio in chunks. And how big should
>>>> those chunks be - one page, or the threshold size itself? Probably the latter?
>>>> And perhaps that threshold could also be used by zRAM to decide its upper limit
>>>> for compression chunk.
>>>
>>>
>>> Agreed. What about introducing a parameter like
>>> /sys/kernel/mm/transparent_hugepage/max_swapin_order
>>> giving users the opportunity to fine-tune it according to their needs. For type1
>>> users specifically, setting it to any value above 4 would be
>>> beneficial. If there's
>>> still a lack of tuning for desktop and server environments (type 2 and type 3),
>>> the default value could be set to 0.
>>
>> This sort of thing sounds sensible to me. But I have a history of proposing
>> crappy sysfs interfaces :) So I'd like to hear from others - I suspect it will
>> take a fair bit of discussion before we converge. Having data to show that this
>> threshold is needed would also help (i.e. demonstration that the intuition that
>> swapping in a 2M folio is often counter-productive to performance).
>>
> 
> I understand. The ideal swap-in size is obviously a contentious topic :-)
> However,  for my real use case, simplicity reigns: we consistently adhere
> to a single size - 64KiB.
> 
>>>
>>>>
>>>> Perhaps we can learn from khugepaged here? I think it has programmable
>>>> thresholds for how many swapped-out pages can be swapped-in to aid collapse to a
>>>> THP? I guess that exists for the same concerns about increased IO pressure?
>>>>
>>>>
>>>> If we think we will ever be swapping-in folios in chunks less than their
>>>> original size, then we need a separate mechanism to re-foliate them. We have
>>>> discussed a khugepaged-like approach for doing this asynchronously in the
>>>> background. I know that scares the Android folks, but David has suggested that
>>>> this could well be very cheap compared with khugepaged, because it would be
>>>> entirely limited to a single pgtable, so we only need the PTL. If we need this
>>>> mechanism anyway, perhaps we should develop it and see how it performs if
>>>> swap-in remains order-0? Although I guess that would imply not being able to
>>>> benefit from compressing THPs for the zRAM case.
>>>
>>> The effectiveness of collapse operation relies on the stability of
>>> forming large folios
>>> to ensure optimal performance. In embedded systems, where more than half of the
>>> memory may be allocated to zRAM, folios might undergo swapping out before
>>> collapsing or immediately after the collapse operation. It seems a
>>> TAO-like optimization
>>> to decrease fallback and latency is more effective.
>>
>> Sorry, I'm not sure I've understood what you are saying here.
> 
> I'm not entirely clear on the specifics of the khugepaged-like
> approach. However,a major
> distinction for Android is that its folios may not remain in memory
> for extended periods.
> If we incur the cost of compaction and page migration to form a large
> folio, it might soon
> be swapped out. Therefore, a potentially more efficient approach could
> involve a TAO-like
> pool, where we obtain large folios at a low cost.
> 
>>
>>>
>>>>
>>>> I see all this as orthogonal to synchronous vs asynchronous swap devices. I
>>>> think the latter just implies that you might want to do some readahead to try to
>>>> cover up the latency? If swap is moving towards being folio-orientated, then
>>>> readahead also surely needs to be folio-orientated, but I think that should be
>>>> the only major difference.
>>>>
>>>> Anyway, just some thoughts!
>>>
>>> Thank you very much for your valuable and insightful deliberations.
>>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>
> 
> Thanks
> Barry
Shakeel Butt June 10, 2024, 8:43 p.m. UTC | #29
On Thu, Mar 14, 2024 at 08:56:17PM GMT, Chuanhua Han wrote:
[...]
> >
> > So in the common case, swap-in will pull in the same size of folio as was
> > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > it makes sense for 2M THP; As the size increases the chances of actually needing
> > all of the folio reduces so chances are we are wasting IO. There are similar
> > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > sense to copy the whole folio up to a certain size.
> For 2M THP, IO overhead may not necessarily be large? :)
> 1.If 2M THP are continuously stored in the swap device, the IO
> overhead may not be very large (such as submitting bio with one
> bio_vec at a time).
> 2.If the process really needs this 2M data, one page-fault may perform
> much better than multiple.
> 3.For swap devices like zram,using 2M THP might also improve
> decompression efficiency.
>

Sorry for late response, do we have any performance data backing the
above claims particularly for zswap/swap-on-zram cases?
Barry Song June 11, 2024, 12:23 a.m. UTC | #30
On Tue, Jun 11, 2024 at 8:43 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Mar 14, 2024 at 08:56:17PM GMT, Chuanhua Han wrote:
> [...]
> > >
> > > So in the common case, swap-in will pull in the same size of folio as was
> > > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > > it makes sense for 2M THP; As the size increases the chances of actually needing
> > > all of the folio reduces so chances are we are wasting IO. There are similar
> > > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > > sense to copy the whole folio up to a certain size.
> > For 2M THP, IO overhead may not necessarily be large? :)
> > 1.If 2M THP are continuously stored in the swap device, the IO
> > overhead may not be very large (such as submitting bio with one
> > bio_vec at a time).
> > 2.If the process really needs this 2M data, one page-fault may perform
> > much better than multiple.
> > 3.For swap devices like zram,using 2M THP might also improve
> > decompression efficiency.
> >
>
> Sorry for late response, do we have any performance data backing the
> above claims particularly for zswap/swap-on-zram cases?

no need to say sorry. You are always welcome to give comments.

this, combining with zram modification, not only improves compression
ratio but also reduces CPU time significantly. you may find some data
here[1].

granularity   orig_data_size   compr_data_size   time(us)
4KiB-zstd      1048576000       246876055        50259962
64KiB-zstd     1048576000       199763892        18330605

On mobile devices, We tested the performance of swapin by running
100 iterations of swapping in 100MB of data ,and the results were
as follows.the swapin speed increased by about 45%.

                time consumption of swapin(ms)
lz4 4k                  45274
lz4 64k                 22942

zstdn 4k                85035
zstdn 64k               46558

[1] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/

Thanks
Barry
Shakeel Butt June 11, 2024, 5:24 p.m. UTC | #31
On Tue, Jun 11, 2024 at 12:23:41PM GMT, Barry Song wrote:
> On Tue, Jun 11, 2024 at 8:43 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Mar 14, 2024 at 08:56:17PM GMT, Chuanhua Han wrote:
> > [...]
> > > >
> > > > So in the common case, swap-in will pull in the same size of folio as was
> > > > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > > > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > > > it makes sense for 2M THP; As the size increases the chances of actually needing
> > > > all of the folio reduces so chances are we are wasting IO. There are similar
> > > > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > > > sense to copy the whole folio up to a certain size.
> > > For 2M THP, IO overhead may not necessarily be large? :)
> > > 1.If 2M THP are continuously stored in the swap device, the IO
> > > overhead may not be very large (such as submitting bio with one
> > > bio_vec at a time).
> > > 2.If the process really needs this 2M data, one page-fault may perform
> > > much better than multiple.
> > > 3.For swap devices like zram,using 2M THP might also improve
> > > decompression efficiency.
> > >
> >
> > Sorry for late response, do we have any performance data backing the
> > above claims particularly for zswap/swap-on-zram cases?
> 
> no need to say sorry. You are always welcome to give comments.
> 
> this, combining with zram modification, not only improves compression
> ratio but also reduces CPU time significantly. you may find some data
> here[1].
> 
> granularity   orig_data_size   compr_data_size   time(us)
> 4KiB-zstd      1048576000       246876055        50259962
> 64KiB-zstd     1048576000       199763892        18330605
> 
> On mobile devices, We tested the performance of swapin by running
> 100 iterations of swapping in 100MB of data ,and the results were
> as follows.the swapin speed increased by about 45%.
> 
>                 time consumption of swapin(ms)
> lz4 4k                  45274
> lz4 64k                 22942
> 
> zstdn 4k                85035
> zstdn 64k               46558

Thanks for the response. Above numbers are actually very fascinating and
counter intuitive (at least to me). Do you also have numbers for 2MiB
THP? I am assuming 64k is the right balance between too small or too
large. Did you experiment on server machines as well?

> 
> [1] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> 
> Thanks
> Barry
Barry Song June 11, 2024, 10:13 p.m. UTC | #32
On Wed, Jun 12, 2024 at 5:24 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jun 11, 2024 at 12:23:41PM GMT, Barry Song wrote:
> > On Tue, Jun 11, 2024 at 8:43 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 08:56:17PM GMT, Chuanhua Han wrote:
> > > [...]
> > > > >
> > > > > So in the common case, swap-in will pull in the same size of folio as was
> > > > > swapped-out. Is that definitely the right policy for all folio sizes? Certainly
> > > > > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure
> > > > > it makes sense for 2M THP; As the size increases the chances of actually needing
> > > > > all of the folio reduces so chances are we are wasting IO. There are similar
> > > > > arguments for CoW, where we currently copy 1 page per fault - it probably makes
> > > > > sense to copy the whole folio up to a certain size.
> > > > For 2M THP, IO overhead may not necessarily be large? :)
> > > > 1.If 2M THP are continuously stored in the swap device, the IO
> > > > overhead may not be very large (such as submitting bio with one
> > > > bio_vec at a time).
> > > > 2.If the process really needs this 2M data, one page-fault may perform
> > > > much better than multiple.
> > > > 3.For swap devices like zram,using 2M THP might also improve
> > > > decompression efficiency.
> > > >
> > >
> > > Sorry for late response, do we have any performance data backing the
> > > above claims particularly for zswap/swap-on-zram cases?
> >
> > no need to say sorry. You are always welcome to give comments.
> >
> > this, combining with zram modification, not only improves compression
> > ratio but also reduces CPU time significantly. you may find some data
> > here[1].
> >
> > granularity   orig_data_size   compr_data_size   time(us)
> > 4KiB-zstd      1048576000       246876055        50259962
> > 64KiB-zstd     1048576000       199763892        18330605
> >
> > On mobile devices, We tested the performance of swapin by running
> > 100 iterations of swapping in 100MB of data ,and the results were
> > as follows.the swapin speed increased by about 45%.
> >
> >                 time consumption of swapin(ms)
> > lz4 4k                  45274
> > lz4 64k                 22942
> >
> > zstdn 4k                85035
> > zstdn 64k               46558
>
> Thanks for the response. Above numbers are actually very fascinating and
> counter intuitive (at least to me). Do you also have numbers for 2MiB
> THP? I am assuming 64k is the right balance between too small or too
> large. Did you experiment on server machines as well?

I don’t possess data on 2MiB, and regrettably, I lack a server machine
for testing. However, I believe that this type of higher compression ratio
and lower CPU consumption generally holds true for generic anonymous
memory.

64KB is a right balance. But nothing can stop THP from using 64KB to
swapin, compression and decompression. as you can see from the
zram/zsmalloc series,  we actually have a configuration
CONFIG_ZSMALLOC_MULTI_PAGES_ORDER

The default value is 4.

That means a 2MB THP can be compressed/decompressed as 32 * 64KB.
If we use 64KB as the swapin granularity, we still have the balance and
all the benefits if 2MB is a too large swap-in granularity which might cause
memory waste.

>
> >
> > [1] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> >

Thanks
Barry
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index e0d34d705e07..501ede745ef3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3907,6 +3907,136 @@  static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	return VM_FAULT_SIGBUS;
 }
 
+/*
+ * check a range of PTEs are completely swap entries with
+ * contiguous swap offsets and the same SWAP_HAS_CACHE.
+ * pte must be first one in the range
+ */
+static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
+{
+	int i;
+	struct swap_info_struct *si;
+	swp_entry_t entry;
+	unsigned type;
+	pgoff_t start_offset;
+	char has_cache;
+
+	entry = pte_to_swp_entry(ptep_get_lockless(pte));
+	if (non_swap_entry(entry))
+		return false;
+	start_offset = swp_offset(entry);
+	if (start_offset % nr_pages)
+		return false;
+
+	si = swp_swap_info(entry);
+	type = swp_type(entry);
+	has_cache = si->swap_map[start_offset] & SWAP_HAS_CACHE;
+	for (i = 1; i < nr_pages; i++) {
+		entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
+		if (non_swap_entry(entry))
+			return false;
+		if (swp_offset(entry) != start_offset + i)
+			return false;
+		if (swp_type(entry) != type)
+			return false;
+		/*
+		 * while allocating a large folio and doing swap_read_folio for the
+		 * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte
+		 * doesn't have swapcache. We need to ensure all PTEs have no cache
+		 * as well, otherwise, we might go to swap devices while the content
+		 * is in swapcache
+		 */
+		if ((si->swap_map[start_offset + i] & SWAP_HAS_CACHE) != has_cache)
+			return false;
+	}
+
+	return true;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Get a list of all the (large) orders below PMD_ORDER that are enabled
+ * for this vma. Then filter out the orders that can't be allocated over
+ * the faulting address and still be fully contained in the vma.
+ */
+static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long orders;
+
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
+					  BIT(PMD_ORDER) - 1);
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	return orders;
+}
+#endif
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long orders;
+	struct folio *folio;
+	unsigned long addr;
+	pte_t *pte;
+	gfp_t gfp;
+	int order;
+
+	/*
+	 * If uffd is active for the vma we need per-page fault fidelity to
+	 * maintain the uffd semantics.
+	 */
+	if (unlikely(userfaultfd_armed(vma)))
+		goto fallback;
+
+	/*
+	 * a large folio being swapped-in could be partially in
+	 * zswap and partially in swap devices, zswap doesn't
+	 * support large folios yet, we might get corrupted
+	 * zero-filled data by reading all subpages from swap
+	 * devices while some of them are actually in zswap
+	 */
+	if (is_zswap_enabled())
+		goto fallback;
+
+	orders = get_alloc_folio_orders(vmf);
+	if (!orders)
+		goto fallback;
+
+	pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
+	if (unlikely(!pte))
+		goto fallback;
+
+	/*
+	 * For do_swap_page, find the highest order where the aligned range is
+	 * completely swap entries with contiguous swap offsets.
+	 */
+	order = highest_order(orders);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		if (is_pte_range_contig_swap(pte + pte_index(addr), 1 << order))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	pte_unmap(pte);
+
+	/* Try allocating the highest of the remaining orders. */
+	gfp = vma_thp_gfp_mask(vma);
+	while (orders) {
+		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+		folio = vma_alloc_folio(gfp, order, vma, addr, true);
+		if (folio)
+			return folio;
+		order = next_order(&orders, order);
+	}
+
+fallback:
+#endif
+	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false);
+}
+
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3928,6 +4058,9 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	pte_t pte;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
+	int nr_pages = 1;
+	unsigned long start_address;
+	pte_t *start_pte;
 
 	if (!pte_unmap_same(vmf))
 		goto out;
@@ -3991,35 +4124,41 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
-			/*
-			 * Prevent parallel swapin from proceeding with
-			 * the cache flag. Otherwise, another thread may
-			 * finish swapin first, free the entry, and swapout
-			 * reusing the same entry. It's undetectable as
-			 * pte_same() returns true due to entry reuse.
-			 */
-			if (swapcache_prepare(entry)) {
-				/* Relax a bit to prevent rapid repeated page faults */
-				schedule_timeout_uninterruptible(1);
-				goto out;
-			}
-			need_clear_cache = true;
-
 			/* skip swapcache */
-			folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
-						vma, vmf->address, false);
+			folio = alloc_swap_folio(vmf);
 			page = &folio->page;
 			if (folio) {
 				__folio_set_locked(folio);
 				__folio_set_swapbacked(folio);
 
+				if (folio_test_large(folio)) {
+					nr_pages = folio_nr_pages(folio);
+					entry.val = ALIGN_DOWN(entry.val, nr_pages);
+				}
+
+				/*
+				 * Prevent parallel swapin from proceeding with
+				 * the cache flag. Otherwise, another thread may
+				 * finish swapin first, free the entry, and swapout
+				 * reusing the same entry. It's undetectable as
+				 * pte_same() returns true due to entry reuse.
+				 */
+				if (swapcache_prepare_nr(entry, nr_pages)) {
+					/* Relax a bit to prevent rapid repeated page faults */
+					schedule_timeout_uninterruptible(1);
+					goto out;
+				}
+				need_clear_cache = true;
+
 				if (mem_cgroup_swapin_charge_folio(folio,
 							vma->vm_mm, GFP_KERNEL,
 							entry)) {
 					ret = VM_FAULT_OOM;
 					goto out_page;
 				}
-				mem_cgroup_swapin_uncharge_swap(entry);
+
+				for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++)
+					mem_cgroup_swapin_uncharge_swap(e);
 
 				shadow = get_shadow_from_swap_cache(entry);
 				if (shadow)
@@ -4118,6 +4257,42 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 */
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
+
+	start_address = vmf->address;
+	start_pte = vmf->pte;
+	if (start_pte && folio_test_large(folio)) {
+		unsigned long nr = folio_nr_pages(folio);
+		unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+		pte_t *aligned_pte = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
+
+		/*
+		 * case 1: we are allocating large_folio, try to map it as a whole
+		 * iff the swap entries are still entirely mapped;
+		 * case 2: we hit a large folio in swapcache, and all swap entries
+		 * are still entirely mapped, try to map a large folio as a whole.
+		 * otherwise, map only the faulting page within the large folio
+		 * which is swapcache
+		 */
+		if (!is_pte_range_contig_swap(aligned_pte, nr)) {
+			if (nr_pages > 1) /* ptes have changed for case 1 */
+				goto out_nomap;
+			goto check_pte;
+		}
+
+		start_address = addr;
+		start_pte = aligned_pte;
+		/*
+		 * the below has been done before swap_read_folio()
+		 * for case 1
+		 */
+		if (unlikely(folio == swapcache)) {
+			nr_pages = nr;
+			entry.val = ALIGN_DOWN(entry.val, nr_pages);
+			page = &folio->page;
+		}
+	}
+
+check_pte:
 	if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
 		goto out_nomap;
 
@@ -4185,12 +4360,14 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * We're already holding a reference on the page but haven't mapped it
 	 * yet.
 	 */
-	swap_free(entry);
+	swap_nr_free(entry, nr_pages);
 	if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
-	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+	folio_ref_add(folio, nr_pages - 1);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+
 	pte = mk_pte(page, vma->vm_page_prot);
 
 	/*
@@ -4200,14 +4377,14 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * exclusivity.
 	 */
 	if (!folio_test_ksm(folio) &&
-	    (exclusive || folio_ref_count(folio) == 1)) {
+	    (exclusive || folio_ref_count(folio) == nr_pages)) {
 		if (vmf->flags & FAULT_FLAG_WRITE) {
 			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 			vmf->flags &= ~FAULT_FLAG_WRITE;
 		}
 		rmap_flags |= RMAP_EXCLUSIVE;
 	}
-	flush_icache_page(vma, page);
+	flush_icache_pages(vma, page, nr_pages);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
 		pte = pte_mksoft_dirty(pte);
 	if (pte_swp_uffd_wp(vmf->orig_pte))
@@ -4216,17 +4393,19 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	/* ksm created a completely new copy */
 	if (unlikely(folio != swapcache && swapcache)) {
-		folio_add_new_anon_rmap(folio, vma, vmf->address);
+		folio_add_new_anon_rmap(folio, vma, start_address);
 		folio_add_lru_vma(folio, vma);
+	} else if (!folio_test_anon(folio)) {
+		folio_add_new_anon_rmap(folio, vma, start_address);
 	} else {
-		folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
+		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
 					rmap_flags);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
 			(pte_write(pte) && !PageAnonExclusive(page)));
-	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
-	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+	set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
+	arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
 
 	folio_unlock(folio);
 	if (folio != swapcache && swapcache) {
@@ -4243,6 +4422,9 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	if (vmf->flags & FAULT_FLAG_WRITE) {
+		if (nr_pages > 1)
+			vmf->orig_pte = ptep_get(vmf->pte);
+
 		ret |= do_wp_page(vmf);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
@@ -4250,14 +4432,14 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+	update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
 unlock:
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
 	/* Clear the swap cache pin for direct swapin after PTL unlock */
 	if (need_clear_cache)
-		swapcache_clear(si, entry);
+		swapcache_clear_nr(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4273,7 +4455,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_put(swapcache);
 	}
 	if (need_clear_cache)
-		swapcache_clear(si, entry);
+		swapcache_clear_nr(si, entry, nr_pages);
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4309,15 +4491,7 @@  static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	if (unlikely(userfaultfd_armed(vma)))
 		goto fallback;
 
-	/*
-	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
-	 * for this vma. Then filter out the orders that can't be allocated over
-	 * the faulting address and still be fully contained in the vma.
-	 */
-	orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
-					  BIT(PMD_ORDER) - 1);
-	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
-
+	orders = get_alloc_folio_orders(vmf);
 	if (!orders)
 		goto fallback;