diff mbox series

[v2,01/14] mm: Batch-copy PTE ranges during fork()

Message ID 20231115163018.1303287-2-ryan.roberts@arm.com (mailing list archive)
State New, archived
Headers show
Series Transparent Contiguous PTEs for User Mappings | expand

Commit Message

Ryan Roberts Nov. 15, 2023, 4:30 p.m. UTC
Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
maps a physically contiguous block of memory, all belonging to the same
folio, with the same permissions, and for shared mappings, the same
dirty state. This will likely improve performance by a tiny amount due
to batching the folio reference count management and calling set_ptes()
rather than making individual calls to set_pte_at().

However, the primary motivation for this change is to reduce the number
of tlb maintenance operations that the arm64 backend has to perform
during fork, as it is about to add transparent support for the
"contiguous bit" in its ptes. By write-protecting the parent using the
new ptep_set_wrprotects() (note the 's' at the end) function, the
backend can avoid having to unfold contig ranges of PTEs, which is
expensive, when all ptes in the range are being write-protected.
Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
in the child, the backend does not need to fold a contiguous range once
they are all populated - they can be initially populated as a contiguous
range in the first place.

This change addresses the core-mm refactoring only, and introduces
ptep_set_wrprotects() with a default implementation that calls
ptep_set_wrprotect() for each pte in the range. A separate change will
implement ptep_set_wrprotects() in the arm64 backend to realize the
performance improvement as part of the work to enable contpte mappings.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h |  13 +++
 mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
 2 files changed, 150 insertions(+), 38 deletions(-)

Comments

kernel test robot Nov. 15, 2023, 9:26 p.m. UTC | #1
Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.7-rc1 next-20231115]
[cannot apply to arm64/for-next/core efi/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
config: arm-randconfig-002-20231116 (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311160516.kHhfmjvl-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot'; did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
     969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
         |                ^~~~~~~~~~
         |                ptep_get
   cc1: some warnings being treated as errors


vim +969 mm/memory.c

   950	
   951	static int folio_nr_pages_cont_mapped(struct folio *folio,
   952					      struct page *page, pte_t *pte,
   953					      unsigned long addr, unsigned long end,
   954					      pte_t ptent, bool *any_dirty)
   955	{
   956		int floops;
   957		int i;
   958		unsigned long pfn;
   959		pgprot_t prot;
   960		struct page *folio_end;
   961	
   962		if (!folio_test_large(folio))
   963			return 1;
   964	
   965		folio_end = &folio->page + folio_nr_pages(folio);
   966		end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
   967		floops = (end - addr) >> PAGE_SHIFT;
   968		pfn = page_to_pfn(page);
 > 969		prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
   970	
   971		*any_dirty = pte_dirty(ptent);
   972	
   973		pfn++;
   974		pte++;
   975	
   976		for (i = 1; i < floops; i++) {
   977			ptent = ptep_get(pte);
   978			ptent = pte_mkold(pte_mkclean(ptent));
   979	
   980			if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
   981			    pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
   982				break;
   983	
   984			if (pte_dirty(ptent))
   985				*any_dirty = true;
   986	
   987			pfn++;
   988			pte++;
   989		}
   990	
   991		return i;
   992	}
   993
Andrew Morton Nov. 15, 2023, 9:37 p.m. UTC | #2
On Wed, 15 Nov 2023 16:30:05 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:

> However, the primary motivation for this change is to reduce the number
> of tlb maintenance operations that the arm64 backend has to perform
> during fork

Do you have a feeling for how much performance improved due to this? 

Are there other architectures which might similarly benefit?  By
implementing ptep_set_wrprotects(), it appears.  If so, what sort of
gains might they see?
kernel test robot Nov. 15, 2023, 10:40 p.m. UTC | #3
Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.7-rc1 next-20231115]
[cannot apply to arm64/for-next/core efi/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
config: alpha-defconfig (https://download.01.org/0day-ci/archive/20231116/202311160652.wBj0hbPP-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/202311160652.wBj0hbPP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311160652.wBj0hbPP-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/memory.c: In function 'folio_nr_pages_cont_mapped':
   mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot'; did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
     969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
         |                ^~~~~~~~~~
         |                ptep_get
>> mm/memory.c:969:16: error: incompatible types when assigning to type 'pgprot_t' from type 'int'
   In file included from include/linux/shm.h:6,
                    from include/linux/sched.h:16,
                    from include/linux/hardirq.h:9,
                    from include/linux/interrupt.h:11,
                    from include/linux/kernel_stat.h:9,
                    from mm/memory.c:43:
>> arch/alpha/include/asm/page.h:38:29: error: request for member 'pgprot' in something not a structure or union
      38 | #define pgprot_val(x)   ((x).pgprot)
         |                             ^
   mm/memory.c:981:21: note: in expansion of macro 'pgprot_val'
     981 |                     pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
         |                     ^~~~~~~~~~
   cc1: some warnings being treated as errors


vim +969 mm/memory.c

   950	
   951	static int folio_nr_pages_cont_mapped(struct folio *folio,
   952					      struct page *page, pte_t *pte,
   953					      unsigned long addr, unsigned long end,
   954					      pte_t ptent, bool *any_dirty)
   955	{
   956		int floops;
   957		int i;
   958		unsigned long pfn;
   959		pgprot_t prot;
   960		struct page *folio_end;
   961	
   962		if (!folio_test_large(folio))
   963			return 1;
   964	
   965		folio_end = &folio->page + folio_nr_pages(folio);
   966		end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
   967		floops = (end - addr) >> PAGE_SHIFT;
   968		pfn = page_to_pfn(page);
 > 969		prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
   970	
   971		*any_dirty = pte_dirty(ptent);
   972	
   973		pfn++;
   974		pte++;
   975	
   976		for (i = 1; i < floops; i++) {
   977			ptent = ptep_get(pte);
   978			ptent = pte_mkold(pte_mkclean(ptent));
   979	
   980			if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
   981			    pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
   982				break;
   983	
   984			if (pte_dirty(ptent))
   985				*any_dirty = true;
   986	
   987			pfn++;
   988			pte++;
   989		}
   990	
   991		return i;
   992	}
   993
Ryan Roberts Nov. 16, 2023, 9:34 a.m. UTC | #4
On 15/11/2023 21:37, Andrew Morton wrote:
> On Wed, 15 Nov 2023 16:30:05 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork
> 
> Do you have a feeling for how much performance improved due to this? 

The commit log for patch 13 (the one which implements ptep_set_wrprotects() for
armt64) has performance numbers for a fork() microbenchmark with/without the
optimization:

---8<---

I see huge performance regression when PTE_CONT support was added, then
the regression is mostly fixed with the addition of this change. The
following shows regression relative to before PTE_CONT was enabled
(bigger negative value is bigger regression):

|   cpus |   before opt |   after opt |
|-------:|-------------:|------------:|
|      1 |       -10.4% |       -5.2% |
|      8 |       -15.4% |       -3.5% |
|     16 |       -38.7% |       -3.7% |
|     24 |       -57.0% |       -4.4% |
|     32 |       -65.8% |       -5.4% |

---8<---

Note that's running on Ampere Altra, where TLBI tends to have high cost.

> 
> Are there other architectures which might similarly benefit?  By
> implementing ptep_set_wrprotects(), it appears.  If so, what sort of
> gains might they see?

The rationale for this is to reduce expense for arm64 to manage
contpte-mappings. If other architectures support contpte-mappings then they
could benefit from this API for the same reasons that arm64 benefits. I have a
vague understanding that riscv has a similar concept to the arm64's contiguous
bit, so perhaps they are a future candidate. But I'm not familiar with the
details of the riscv feature so couldn't say whether they would be likely to see
the same level of perf improvement as arm64.

Thanks,
Ryan
David Hildenbrand Nov. 16, 2023, 10:03 a.m. UTC | #5
On 15.11.23 17:30, Ryan Roberts wrote:
> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
> maps a physically contiguous block of memory, all belonging to the same
> folio, with the same permissions, and for shared mappings, the same
> dirty state. This will likely improve performance by a tiny amount due
> to batching the folio reference count management and calling set_ptes()
> rather than making individual calls to set_pte_at().
> 
> However, the primary motivation for this change is to reduce the number
> of tlb maintenance operations that the arm64 backend has to perform
> during fork, as it is about to add transparent support for the
> "contiguous bit" in its ptes. By write-protecting the parent using the
> new ptep_set_wrprotects() (note the 's' at the end) function, the
> backend can avoid having to unfold contig ranges of PTEs, which is
> expensive, when all ptes in the range are being write-protected.
> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
> in the child, the backend does not need to fold a contiguous range once
> they are all populated - they can be initially populated as a contiguous
> range in the first place.
> 
> This change addresses the core-mm refactoring only, and introduces
> ptep_set_wrprotects() with a default implementation that calls
> ptep_set_wrprotect() for each pte in the range. A separate change will
> implement ptep_set_wrprotects() in the arm64 backend to realize the
> performance improvement as part of the work to enable contpte mappings.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h |  13 +++
>   mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>   2 files changed, 150 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..1c50f8a0fdde 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
>   }
>   #endif
>   
> +#ifndef ptep_set_wrprotects
> +struct mm_struct;
> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
> +				unsigned long address, pte_t *ptep,
> +				unsigned int nr)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> +		ptep_set_wrprotect(mm, address, ptep);
> +}
> +#endif
> +
>   /*
>    * On some architectures hardware does not set page access bit when accessing
>    * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/memory.c b/mm/memory.c
> index 1f18ed4a5497..b7c8228883cf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>   		/* Uffd-wp needs to be delivered to dest pte as well */
>   		pte = pte_mkuffd_wp(pte);
>   	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> -	return 0;
> +	return 1;
> +}
> +
> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
> +				struct page *anchor, unsigned long anchor_vaddr)
> +{
> +	unsigned long offset;
> +	unsigned long vaddr;
> +
> +	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
> +	vaddr = anchor_vaddr + offset;
> +
> +	if (anchor > page) {
> +		if (vaddr > anchor_vaddr)
> +			return 0;
> +	} else {
> +		if (vaddr < anchor_vaddr)
> +			return ULONG_MAX;
> +	}
> +
> +	return vaddr;
> +}
> +
> +static int folio_nr_pages_cont_mapped(struct folio *folio,
> +				      struct page *page, pte_t *pte,
> +				      unsigned long addr, unsigned long end,
> +				      pte_t ptent, bool *any_dirty)
> +{
> +	int floops;
> +	int i;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +	struct page *folio_end;
> +
> +	if (!folio_test_large(folio))
> +		return 1;
> +
> +	folio_end = &folio->page + folio_nr_pages(folio);
> +	end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
> +	floops = (end - addr) >> PAGE_SHIFT;
> +	pfn = page_to_pfn(page);
> +	prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
> +
> +	*any_dirty = pte_dirty(ptent);
> +
> +	pfn++;
> +	pte++;
> +
> +	for (i = 1; i < floops; i++) {
> +		ptent = ptep_get(pte);
> +		ptent = pte_mkold(pte_mkclean(ptent));
> +
> +		if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
> +		    pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
> +			break;
> +
> +		if (pte_dirty(ptent))
> +			*any_dirty = true;
> +
> +		pfn++;
> +		pte++;
> +	}
> +
> +	return i;
>   }
>   
>   /*
> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
> - * is required to copy this pte.
> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
> + * first pte.
>    */
>   static inline int
> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> -		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> -		 struct folio **prealloc)
> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> +		  pte_t *dst_pte, pte_t *src_pte,
> +		  unsigned long addr, unsigned long end,
> +		  int *rss, struct folio **prealloc)
>   {
>   	struct mm_struct *src_mm = src_vma->vm_mm;
>   	unsigned long vm_flags = src_vma->vm_flags;
>   	pte_t pte = ptep_get(src_pte);
>   	struct page *page;
>   	struct folio *folio;
> +	int nr = 1;
> +	bool anon;
> +	bool any_dirty = pte_dirty(pte);
> +	int i;
>   
>   	page = vm_normal_page(src_vma, addr, pte);
> -	if (page)
> +	if (page) {
>   		folio = page_folio(page);
> -	if (page && folio_test_anon(folio)) {
> -		/*
> -		 * If this page may have been pinned by the parent process,
> -		 * copy the page immediately for the child so that we'll always
> -		 * guarantee the pinned page won't be randomly replaced in the
> -		 * future.
> -		 */
> -		folio_get(folio);
> -		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> -			/* Page may be pinned, we have to copy. */
> -			folio_put(folio);
> -			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> -						 addr, rss, prealloc, page);
> +		anon = folio_test_anon(folio);
> +		nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> +						end, pte, &any_dirty);
> +
> +		for (i = 0; i < nr; i++, page++) {
> +			if (anon) {
> +				/*
> +				 * If this page may have been pinned by the
> +				 * parent process, copy the page immediately for
> +				 * the child so that we'll always guarantee the
> +				 * pinned page won't be randomly replaced in the
> +				 * future.
> +				 */
> +				if (unlikely(page_try_dup_anon_rmap(
> +						page, false, src_vma))) {
> +					if (i != 0)
> +						break;
> +					/* Page may be pinned, we have to copy. */
> +					return copy_present_page(
> +						dst_vma, src_vma, dst_pte,
> +						src_pte, addr, rss, prealloc,
> +						page);
> +				}
> +				rss[MM_ANONPAGES]++;
> +				VM_BUG_ON(PageAnonExclusive(page));
> +			} else {
> +				page_dup_file_rmap(page, false);
> +				rss[mm_counter_file(page)]++;
> +			}
>   		}
> -		rss[MM_ANONPAGES]++;
> -	} else if (page) {
> -		folio_get(folio);
> -		page_dup_file_rmap(page, false);
> -		rss[mm_counter_file(page)]++;
> +
> +		nr = i;
> +		folio_ref_add(folio, nr);

You're changing the order of mapcount vs. refcount increment. Don't. 
Make sure your refcount >= mapcount.

You can do that easily by doing the folio_ref_add(folio, nr) first and 
then decrementing in case of error accordingly. Errors due to pinned 
pages are the corner case.

I'll note that it will make a lot of sense to have batch variants of 
page_try_dup_anon_rmap() and page_dup_file_rmap().

Especially, the batch variant of page_try_dup_anon_rmap() would only 
check once if the folio maybe pinned, and in that case, you can simply 
drop all references again. So you either have all or no ptes to process, 
which makes that code easier.

But that can be added on top, and I'll happily do that.
Ryan Roberts Nov. 16, 2023, 10:07 a.m. UTC | #6
Hi All,

Hoping for some guidance below!


On 15/11/2023 21:26, kernel test robot wrote:
> Hi Ryan,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
> [cannot apply to arm64/for-next/core efi/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
> config: arm-randconfig-002-20231116 (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/config)
> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202311160516.kHhfmjvl-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>    mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot'; did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
>      969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>          |                ^~~~~~~~~~
>          |                ptep_get
>    cc1: some warnings being treated as errors

It turns out that pte_pgprot() is not universal; its only implemented by
architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
loongarch, mips, powerpc, s390, sh, x86).

I'm using it in core-mm to help calculate the number of "contiguously mapped"
pages within a folio (note that's not the same as arm64's notion of
contpte-mapped. I just want to know that there are N physically contiguous pages
mapped virtually contiguously with the same permissions). And I'm using
pte_pgprot() to extract the permissions for each pte to compare. It's important
that we compare the permissions because just because the pages belongs to the
same folio doesn't imply they are mapped with the same permissions; think
mprotect()ing a sub-range.

I don't have a great idea for how to fix this - does anyone have any thoughts?

Some ideas:

- Implement folio_nr_pages_cont_mapped() conditionally on
CONFIG_HAVE_IOREMAP_PROT being set, otherwise it just returns 1 and for those
arches we always get the old, non-batching behavior. There is some precident;
mm/memory.c is already using pte_pgprot() behind this ifdef.

- Implement a generic helper the same way arm64 does it. This will return all
the pte bits that are not part of the PFN. But I'm not sure this is definitely a
valid thing to do for all architectures:

static inline pgprot_t pte_pgprot(pte_t pte)
{
	unsigned long pfn = pte_pfn(pte);

	return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
}

- Explicitly implement pte_pgprot() for all arches that don't currently have it
(sigh).

Thanks,
Ryan


> 
> 
> vim +969 mm/memory.c
> 
>    950	
>    951	static int folio_nr_pages_cont_mapped(struct folio *folio,
>    952					      struct page *page, pte_t *pte,
>    953					      unsigned long addr, unsigned long end,
>    954					      pte_t ptent, bool *any_dirty)
>    955	{
>    956		int floops;
>    957		int i;
>    958		unsigned long pfn;
>    959		pgprot_t prot;
>    960		struct page *folio_end;
>    961	
>    962		if (!folio_test_large(folio))
>    963			return 1;
>    964	
>    965		folio_end = &folio->page + folio_nr_pages(folio);
>    966		end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>    967		floops = (end - addr) >> PAGE_SHIFT;
>    968		pfn = page_to_pfn(page);
>  > 969		prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>    970	
>    971		*any_dirty = pte_dirty(ptent);
>    972	
>    973		pfn++;
>    974		pte++;
>    975	
>    976		for (i = 1; i < floops; i++) {
>    977			ptent = ptep_get(pte);
>    978			ptent = pte_mkold(pte_mkclean(ptent));
>    979	
>    980			if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>    981			    pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>    982				break;
>    983	
>    984			if (pte_dirty(ptent))
>    985				*any_dirty = true;
>    986	
>    987			pfn++;
>    988			pte++;
>    989		}
>    990	
>    991		return i;
>    992	}
>    993	
>
David Hildenbrand Nov. 16, 2023, 10:12 a.m. UTC | #7
On 16.11.23 11:07, Ryan Roberts wrote:
> Hi All,
> 
> Hoping for some guidance below!
> 
> 
> On 15/11/2023 21:26, kernel test robot wrote:
>> Hi Ryan,
>>
>> kernel test robot noticed the following build errors:
>>
>> [auto build test ERROR on akpm-mm/mm-everything]
>> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
>> [cannot apply to arm64/for-next/core efi/next]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use '--base' as documented in
>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>
>> url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>> patch link:    https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
>> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
>> config: arm-randconfig-002-20231116 (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/config)
>> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
>> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/reproduce)
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <lkp@intel.com>
>> | Closes: https://lore.kernel.org/oe-kbuild-all/202311160516.kHhfmjvl-lkp@intel.com/
>>
>> All errors (new ones prefixed by >>):
>>
>>     mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot'; did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
>>       969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>           |                ^~~~~~~~~~
>>           |                ptep_get
>>     cc1: some warnings being treated as errors
> 
> It turns out that pte_pgprot() is not universal; its only implemented by
> architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
> loongarch, mips, powerpc, s390, sh, x86).
> 
> I'm using it in core-mm to help calculate the number of "contiguously mapped"
> pages within a folio (note that's not the same as arm64's notion of
> contpte-mapped. I just want to know that there are N physically contiguous pages
> mapped virtually contiguously with the same permissions). And I'm using
> pte_pgprot() to extract the permissions for each pte to compare. It's important
> that we compare the permissions because just because the pages belongs to the
> same folio doesn't imply they are mapped with the same permissions; think
> mprotect()ing a sub-range.
> 
> I don't have a great idea for how to fix this - does anyone have any thoughts?

KIS :) fork() operates on individual VMAs if I am not daydreaming.

Just check for the obvious pte_write()/dirty/ and you'll be fine.

If your code tries to optimize "between VMAs", you really shouldn't be 
doing that at this point.

If someone did an mprotect(), there are separate VMAs, and you shouldn't 
be looking at the PTEs belonging to a different VMA.
Ryan Roberts Nov. 16, 2023, 10:26 a.m. UTC | #8
On 16/11/2023 10:03, David Hildenbrand wrote:
> On 15.11.23 17:30, Ryan Roberts wrote:
>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>> maps a physically contiguous block of memory, all belonging to the same
>> folio, with the same permissions, and for shared mappings, the same
>> dirty state. This will likely improve performance by a tiny amount due
>> to batching the folio reference count management and calling set_ptes()
>> rather than making individual calls to set_pte_at().
>>
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork, as it is about to add transparent support for the
>> "contiguous bit" in its ptes. By write-protecting the parent using the
>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>> backend can avoid having to unfold contig ranges of PTEs, which is
>> expensive, when all ptes in the range are being write-protected.
>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>> in the child, the backend does not need to fold a contiguous range once
>> they are all populated - they can be initially populated as a contiguous
>> range in the first place.
>>
>> This change addresses the core-mm refactoring only, and introduces
>> ptep_set_wrprotects() with a default implementation that calls
>> ptep_set_wrprotect() for each pte in the range. A separate change will
>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>> performance improvement as part of the work to enable contpte mappings.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h |  13 +++
>>   mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>   2 files changed, 150 insertions(+), 38 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index af7639c3b0a3..1c50f8a0fdde 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>> *mm, unsigned long addres
>>   }
>>   #endif
>>   +#ifndef ptep_set_wrprotects
>> +struct mm_struct;
>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>> +                unsigned long address, pte_t *ptep,
>> +                unsigned int nr)
>> +{
>> +    unsigned int i;
>> +
>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>> +        ptep_set_wrprotect(mm, address, ptep);
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1f18ed4a5497..b7c8228883cf 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>> struct vm_area_struct *src_vma
>>           /* Uffd-wp needs to be delivered to dest pte as well */
>>           pte = pte_mkuffd_wp(pte);
>>       set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>> -    return 0;
>> +    return 1;
>> +}
>> +
>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>> +                struct page *anchor, unsigned long anchor_vaddr)
>> +{
>> +    unsigned long offset;
>> +    unsigned long vaddr;
>> +
>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>> +    vaddr = anchor_vaddr + offset;
>> +
>> +    if (anchor > page) {
>> +        if (vaddr > anchor_vaddr)
>> +            return 0;
>> +    } else {
>> +        if (vaddr < anchor_vaddr)
>> +            return ULONG_MAX;
>> +    }
>> +
>> +    return vaddr;
>> +}
>> +
>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>> +                      struct page *page, pte_t *pte,
>> +                      unsigned long addr, unsigned long end,
>> +                      pte_t ptent, bool *any_dirty)
>> +{
>> +    int floops;
>> +    int i;
>> +    unsigned long pfn;
>> +    pgprot_t prot;
>> +    struct page *folio_end;
>> +
>> +    if (!folio_test_large(folio))
>> +        return 1;
>> +
>> +    folio_end = &folio->page + folio_nr_pages(folio);
>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>> +    floops = (end - addr) >> PAGE_SHIFT;
>> +    pfn = page_to_pfn(page);
>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>> +
>> +    *any_dirty = pte_dirty(ptent);
>> +
>> +    pfn++;
>> +    pte++;
>> +
>> +    for (i = 1; i < floops; i++) {
>> +        ptent = ptep_get(pte);
>> +        ptent = pte_mkold(pte_mkclean(ptent));
>> +
>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>> +            break;
>> +
>> +        if (pte_dirty(ptent))
>> +            *any_dirty = true;
>> +
>> +        pfn++;
>> +        pte++;
>> +    }
>> +
>> +    return i;
>>   }
>>     /*
>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>> - * is required to copy this pte.
>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>> + * first pte.
>>    */
>>   static inline int
>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>> -         struct folio **prealloc)
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>> *src_vma,
>> +          pte_t *dst_pte, pte_t *src_pte,
>> +          unsigned long addr, unsigned long end,
>> +          int *rss, struct folio **prealloc)
>>   {
>>       struct mm_struct *src_mm = src_vma->vm_mm;
>>       unsigned long vm_flags = src_vma->vm_flags;
>>       pte_t pte = ptep_get(src_pte);
>>       struct page *page;
>>       struct folio *folio;
>> +    int nr = 1;
>> +    bool anon;
>> +    bool any_dirty = pte_dirty(pte);
>> +    int i;
>>         page = vm_normal_page(src_vma, addr, pte);
>> -    if (page)
>> +    if (page) {
>>           folio = page_folio(page);
>> -    if (page && folio_test_anon(folio)) {
>> -        /*
>> -         * If this page may have been pinned by the parent process,
>> -         * copy the page immediately for the child so that we'll always
>> -         * guarantee the pinned page won't be randomly replaced in the
>> -         * future.
>> -         */
>> -        folio_get(folio);
>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> -            /* Page may be pinned, we have to copy. */
>> -            folio_put(folio);
>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> -                         addr, rss, prealloc, page);
>> +        anon = folio_test_anon(folio);
>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> +                        end, pte, &any_dirty);
>> +
>> +        for (i = 0; i < nr; i++, page++) {
>> +            if (anon) {
>> +                /*
>> +                 * If this page may have been pinned by the
>> +                 * parent process, copy the page immediately for
>> +                 * the child so that we'll always guarantee the
>> +                 * pinned page won't be randomly replaced in the
>> +                 * future.
>> +                 */
>> +                if (unlikely(page_try_dup_anon_rmap(
>> +                        page, false, src_vma))) {
>> +                    if (i != 0)
>> +                        break;
>> +                    /* Page may be pinned, we have to copy. */
>> +                    return copy_present_page(
>> +                        dst_vma, src_vma, dst_pte,
>> +                        src_pte, addr, rss, prealloc,
>> +                        page);
>> +                }
>> +                rss[MM_ANONPAGES]++;
>> +                VM_BUG_ON(PageAnonExclusive(page));
>> +            } else {
>> +                page_dup_file_rmap(page, false);
>> +                rss[mm_counter_file(page)]++;
>> +            }
>>           }
>> -        rss[MM_ANONPAGES]++;
>> -    } else if (page) {
>> -        folio_get(folio);
>> -        page_dup_file_rmap(page, false);
>> -        rss[mm_counter_file(page)]++;
>> +
>> +        nr = i;
>> +        folio_ref_add(folio, nr);
> 
> You're changing the order of mapcount vs. refcount increment. Don't. Make sure
> your refcount >= mapcount.

Ouch - good spot.

> 
> You can do that easily by doing the folio_ref_add(folio, nr) first and then
> decrementing in case of error accordingly. Errors due to pinned pages are the
> corner case.

Yep, propose this for v3:

diff --git a/mm/memory.c b/mm/memory.c
index b7c8228883cf..98373349806e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1014,6 +1014,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct
vm_area_struct *src_vma
                anon = folio_test_anon(folio);
                nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
                                                end, pte, &any_dirty);
+               folio_ref_add(folio, nr);

                for (i = 0; i < nr; i++, page++) {
                        if (anon) {
@@ -1029,6 +1030,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct
vm_area_struct *src_vma
                                        if (i != 0)
                                                break;
                                        /* Page may be pinned, we have to copy. */
+                                       folio_ref_sub(folio, nr);
                                        return copy_present_page(
                                                dst_vma, src_vma, dst_pte,
                                                src_pte, addr, rss, prealloc,
@@ -1042,8 +1044,10 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct
vm_area_struct *src_vma
                        }
                }

-               nr = i;
-               folio_ref_add(folio, nr);
+               if (i < nr) {
+                       folio_ref_sub(folio, nr - i);
+                       nr = i;
+               }
        }

> 
> I'll note that it will make a lot of sense to have batch variants of
> page_try_dup_anon_rmap() and page_dup_file_rmap().
> 
> Especially, the batch variant of page_try_dup_anon_rmap() would only check once
> if the folio maybe pinned, and in that case, you can simply drop all references
> again. So you either have all or no ptes to process, which makes that code easier.
> 
> But that can be added on top, and I'll happily do that.

That's very kind - thanks for the offer! I'll leave it to you then.
Ryan Roberts Nov. 16, 2023, 10:36 a.m. UTC | #9
On 16/11/2023 10:12, David Hildenbrand wrote:
> On 16.11.23 11:07, Ryan Roberts wrote:
>> Hi All,
>>
>> Hoping for some guidance below!
>>
>>
>> On 15/11/2023 21:26, kernel test robot wrote:
>>> Hi Ryan,
>>>
>>> kernel test robot noticed the following build errors:
>>>
>>> [auto build test ERROR on akpm-mm/mm-everything]
>>> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
>>> [cannot apply to arm64/for-next/core efi/next]
>>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>>> And when submitting patch, we suggest to use '--base' as documented in
>>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>>
>>> url:   
>>> https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
>>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
>>> mm-everything
>>> patch link:   
>>> https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
>>> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
>>> config: arm-randconfig-002-20231116
>>> (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/config)
>>> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
>>> reproduce (this is a W=1 build):
>>> (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/reproduce)
>>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <lkp@intel.com>
>>> | Closes:
>>> https://lore.kernel.org/oe-kbuild-all/202311160516.kHhfmjvl-lkp@intel.com/
>>>
>>> All errors (new ones prefixed by >>):
>>>
>>>     mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot';
>>>>> did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
>>>       969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>           |                ^~~~~~~~~~
>>>           |                ptep_get
>>>     cc1: some warnings being treated as errors
>>
>> It turns out that pte_pgprot() is not universal; its only implemented by
>> architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
>> loongarch, mips, powerpc, s390, sh, x86).
>>
>> I'm using it in core-mm to help calculate the number of "contiguously mapped"
>> pages within a folio (note that's not the same as arm64's notion of
>> contpte-mapped. I just want to know that there are N physically contiguous pages
>> mapped virtually contiguously with the same permissions). And I'm using
>> pte_pgprot() to extract the permissions for each pte to compare. It's important
>> that we compare the permissions because just because the pages belongs to the
>> same folio doesn't imply they are mapped with the same permissions; think
>> mprotect()ing a sub-range.
>>
>> I don't have a great idea for how to fix this - does anyone have any thoughts?
> 
> KIS :) fork() operates on individual VMAs if I am not daydreaming.
> 
> Just check for the obvious pte_write()/dirty/ and you'll be fine.

Yes, that seems much simpler! I think we might have to be careful about the uffd
wp bit too? I think that's it - are there any other exotic bits that might need
to be considered?

> 
> If your code tries to optimize "between VMAs", you really shouldn't be doing
> that at this point.

No I'm not doing that; It's one VMA at a time.

> 
> If someone did an mprotect(), there are separate VMAs, and you shouldn't be
> looking at the PTEs belonging to a different VMA.
> 

Yep understood, thanks.
David Hildenbrand Nov. 16, 2023, 11:01 a.m. UTC | #10
On 16.11.23 11:36, Ryan Roberts wrote:
> On 16/11/2023 10:12, David Hildenbrand wrote:
>> On 16.11.23 11:07, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> Hoping for some guidance below!
>>>
>>>
>>> On 15/11/2023 21:26, kernel test robot wrote:
>>>> Hi Ryan,
>>>>
>>>> kernel test robot noticed the following build errors:
>>>>
>>>> [auto build test ERROR on akpm-mm/mm-everything]
>>>> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
>>>> [cannot apply to arm64/for-next/core efi/next]
>>>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>>>> And when submitting patch, we suggest to use '--base' as documented in
>>>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>>>
>>>> url:
>>>> https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
>>>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
>>>> mm-everything
>>>> patch link:
>>>> https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
>>>> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
>>>> config: arm-randconfig-002-20231116
>>>> (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/config)
>>>> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
>>>> reproduce (this is a W=1 build):
>>>> (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/reproduce)
>>>>
>>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>>> the same patch/commit), kindly add following tags
>>>> | Reported-by: kernel test robot <lkp@intel.com>
>>>> | Closes:
>>>> https://lore.kernel.org/oe-kbuild-all/202311160516.kHhfmjvl-lkp@intel.com/
>>>>
>>>> All errors (new ones prefixed by >>):
>>>>
>>>>      mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>>>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot';
>>>>>> did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
>>>>        969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>            |                ^~~~~~~~~~
>>>>            |                ptep_get
>>>>      cc1: some warnings being treated as errors
>>>
>>> It turns out that pte_pgprot() is not universal; its only implemented by
>>> architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
>>> loongarch, mips, powerpc, s390, sh, x86).
>>>
>>> I'm using it in core-mm to help calculate the number of "contiguously mapped"
>>> pages within a folio (note that's not the same as arm64's notion of
>>> contpte-mapped. I just want to know that there are N physically contiguous pages
>>> mapped virtually contiguously with the same permissions). And I'm using
>>> pte_pgprot() to extract the permissions for each pte to compare. It's important
>>> that we compare the permissions because just because the pages belongs to the
>>> same folio doesn't imply they are mapped with the same permissions; think
>>> mprotect()ing a sub-range.
>>>
>>> I don't have a great idea for how to fix this - does anyone have any thoughts?
>>
>> KIS :) fork() operates on individual VMAs if I am not daydreaming.
>>
>> Just check for the obvious pte_write()/dirty/ and you'll be fine.
> 
> Yes, that seems much simpler! I think we might have to be careful about the uffd
> wp bit too? I think that's it - are there any other exotic bits that might need
> to be considered?

Good question. Mimicing what the current code already does should be 
sufficient. uffd-wp should have the PTE R/O. You can set the contpte bit 
independent of any SW bit (uffd-wp, softdirty, ...) I guess, no need to 
worry about that.
David Hildenbrand Nov. 16, 2023, 11:03 a.m. UTC | #11
On 15.11.23 17:30, Ryan Roberts wrote:
> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
> maps a physically contiguous block of memory, all belonging to the same
> folio, with the same permissions, and for shared mappings, the same
> dirty state. This will likely improve performance by a tiny amount due
> to batching the folio reference count management and calling set_ptes()
> rather than making individual calls to set_pte_at().
> 
> However, the primary motivation for this change is to reduce the number
> of tlb maintenance operations that the arm64 backend has to perform
> during fork, as it is about to add transparent support for the
> "contiguous bit" in its ptes. By write-protecting the parent using the
> new ptep_set_wrprotects() (note the 's' at the end) function, the
> backend can avoid having to unfold contig ranges of PTEs, which is
> expensive, when all ptes in the range are being write-protected.
> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
> in the child, the backend does not need to fold a contiguous range once
> they are all populated - they can be initially populated as a contiguous
> range in the first place.
> 
> This change addresses the core-mm refactoring only, and introduces
> ptep_set_wrprotects() with a default implementation that calls
> ptep_set_wrprotect() for each pte in the range. A separate change will
> implement ptep_set_wrprotects() in the arm64 backend to realize the
> performance improvement as part of the work to enable contpte mappings.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   include/linux/pgtable.h |  13 +++
>   mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>   2 files changed, 150 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..1c50f8a0fdde 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
>   }
>   #endif
>   
> +#ifndef ptep_set_wrprotects
> +struct mm_struct;
> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
> +				unsigned long address, pte_t *ptep,
> +				unsigned int nr)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> +		ptep_set_wrprotect(mm, address, ptep);
> +}
> +#endif
> +
>   /*
>    * On some architectures hardware does not set page access bit when accessing
>    * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/memory.c b/mm/memory.c
> index 1f18ed4a5497..b7c8228883cf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>   		/* Uffd-wp needs to be delivered to dest pte as well */
>   		pte = pte_mkuffd_wp(pte);
>   	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> -	return 0;
> +	return 1;
> +}
> +
> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
> +				struct page *anchor, unsigned long anchor_vaddr)
> +{
> +	unsigned long offset;
> +	unsigned long vaddr;
> +
> +	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
> +	vaddr = anchor_vaddr + offset;
> +
> +	if (anchor > page) {
> +		if (vaddr > anchor_vaddr)
> +			return 0;
> +	} else {
> +		if (vaddr < anchor_vaddr)
> +			return ULONG_MAX;
> +	}
> +
> +	return vaddr;
> +}
> +
> +static int folio_nr_pages_cont_mapped(struct folio *folio,
> +				      struct page *page, pte_t *pte,
> +				      unsigned long addr, unsigned long end,
> +				      pte_t ptent, bool *any_dirty)
> +{
> +	int floops;
> +	int i;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +	struct page *folio_end;
> +
> +	if (!folio_test_large(folio))
> +		return 1;
> +
> +	folio_end = &folio->page + folio_nr_pages(folio);
> +	end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
> +	floops = (end - addr) >> PAGE_SHIFT;
> +	pfn = page_to_pfn(page);
> +	prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
> +
> +	*any_dirty = pte_dirty(ptent);
> +
> +	pfn++;
> +	pte++;
> +
> +	for (i = 1; i < floops; i++) {
> +		ptent = ptep_get(pte);
> +		ptent = pte_mkold(pte_mkclean(ptent));
> +
> +		if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
> +		    pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
> +			break;
> +
> +		if (pte_dirty(ptent))
> +			*any_dirty = true;
> +
> +		pfn++;
> +		pte++;
> +	}
> +
> +	return i;
>   }
>   
>   /*
> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
> - * is required to copy this pte.
> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
> + * first pte.
>    */
>   static inline int
> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> -		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> -		 struct folio **prealloc)
> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> +		  pte_t *dst_pte, pte_t *src_pte,
> +		  unsigned long addr, unsigned long end,
> +		  int *rss, struct folio **prealloc)
>   {
>   	struct mm_struct *src_mm = src_vma->vm_mm;
>   	unsigned long vm_flags = src_vma->vm_flags;
>   	pte_t pte = ptep_get(src_pte);
>   	struct page *page;
>   	struct folio *folio;
> +	int nr = 1;
> +	bool anon;
> +	bool any_dirty = pte_dirty(pte);
> +	int i;
>   
>   	page = vm_normal_page(src_vma, addr, pte);
> -	if (page)
> +	if (page) {
>   		folio = page_folio(page);
> -	if (page && folio_test_anon(folio)) {
> -		/*
> -		 * If this page may have been pinned by the parent process,
> -		 * copy the page immediately for the child so that we'll always
> -		 * guarantee the pinned page won't be randomly replaced in the
> -		 * future.
> -		 */
> -		folio_get(folio);
> -		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> -			/* Page may be pinned, we have to copy. */
> -			folio_put(folio);
> -			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> -						 addr, rss, prealloc, page);
> +		anon = folio_test_anon(folio);
> +		nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> +						end, pte, &any_dirty);
> +
> +		for (i = 0; i < nr; i++, page++) {
> +			if (anon) {
> +				/*
> +				 * If this page may have been pinned by the
> +				 * parent process, copy the page immediately for
> +				 * the child so that we'll always guarantee the
> +				 * pinned page won't be randomly replaced in the
> +				 * future.
> +				 */
> +				if (unlikely(page_try_dup_anon_rmap(
> +						page, false, src_vma))) {
> +					if (i != 0)
> +						break;
> +					/* Page may be pinned, we have to copy. */
> +					return copy_present_page(
> +						dst_vma, src_vma, dst_pte,
> +						src_pte, addr, rss, prealloc,
> +						page);
> +				}
> +				rss[MM_ANONPAGES]++;
> +				VM_BUG_ON(PageAnonExclusive(page));
> +			} else {
> +				page_dup_file_rmap(page, false);
> +				rss[mm_counter_file(page)]++;
> +			}
>   		}
> -		rss[MM_ANONPAGES]++;
> -	} else if (page) {
> -		folio_get(folio);
> -		page_dup_file_rmap(page, false);
> -		rss[mm_counter_file(page)]++;
> +
> +		nr = i;
> +		folio_ref_add(folio, nr);
>   	}
>   
>   	/*
> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>   	 * in the parent and the child
>   	 */
>   	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> -		ptep_set_wrprotect(src_mm, addr, src_pte);
> +		ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>   		pte = pte_wrprotect(pte);

You likely want an "any_pte_writable" check here instead, no?

Any operations that target a single indiividual PTE while multiple PTEs 
are adjusted are suspicious :)
Ryan Roberts Nov. 16, 2023, 11:13 a.m. UTC | #12
On 16/11/2023 11:01, David Hildenbrand wrote:
> On 16.11.23 11:36, Ryan Roberts wrote:
>> On 16/11/2023 10:12, David Hildenbrand wrote:
>>> On 16.11.23 11:07, Ryan Roberts wrote:
>>>> Hi All,
>>>>
>>>> Hoping for some guidance below!
>>>>
>>>>
>>>> On 15/11/2023 21:26, kernel test robot wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> kernel test robot noticed the following build errors:
>>>>>
>>>>> [auto build test ERROR on akpm-mm/mm-everything]
>>>>> [also build test ERROR on linus/master v6.7-rc1 next-20231115]
>>>>> [cannot apply to arm64/for-next/core efi/next]
>>>>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>>>>> And when submitting patch, we suggest to use '--base' as documented in
>>>>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>>>>
>>>>> url:
>>>>> https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/mm-Batch-copy-PTE-ranges-during-fork/20231116-010123
>>>>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
>>>>> mm-everything
>>>>> patch link:
>>>>> https://lore.kernel.org/r/20231115163018.1303287-2-ryan.roberts%40arm.com
>>>>> patch subject: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
>>>>> config: arm-randconfig-002-20231116
>>>>> (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/config)
>>>>> compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
>>>>> reproduce (this is a W=1 build):
>>>>> (https://download.01.org/0day-ci/archive/20231116/202311160516.kHhfmjvl-lkp@intel.com/reproduce)
>>>>>
>>>>> If you fix the issue in a separate patch/commit (i.e. not just a new
>>>>> version of
>>>>> the same patch/commit), kindly add following tags
>>>>> | Reported-by: kernel test robot <lkp@intel.com>
>>>>> | Closes:
>>>>> https://lore.kernel.org/oe-kbuild-all/202311160516.kHhfmjvl-lkp@intel.com/
>>>>>
>>>>> All errors (new ones prefixed by >>):
>>>>>
>>>>>      mm/memory.c: In function 'folio_nr_pages_cont_mapped':
>>>>>>> mm/memory.c:969:16: error: implicit declaration of function 'pte_pgprot';
>>>>>>> did you mean 'ptep_get'? [-Werror=implicit-function-declaration]
>>>>>        969 |         prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>            |                ^~~~~~~~~~
>>>>>            |                ptep_get
>>>>>      cc1: some warnings being treated as errors
>>>>
>>>> It turns out that pte_pgprot() is not universal; its only implemented by
>>>> architectures that select CONFIG_HAVE_IOREMAP_PROT (currently arc, arm64,
>>>> loongarch, mips, powerpc, s390, sh, x86).
>>>>
>>>> I'm using it in core-mm to help calculate the number of "contiguously mapped"
>>>> pages within a folio (note that's not the same as arm64's notion of
>>>> contpte-mapped. I just want to know that there are N physically contiguous
>>>> pages
>>>> mapped virtually contiguously with the same permissions). And I'm using
>>>> pte_pgprot() to extract the permissions for each pte to compare. It's important
>>>> that we compare the permissions because just because the pages belongs to the
>>>> same folio doesn't imply they are mapped with the same permissions; think
>>>> mprotect()ing a sub-range.
>>>>
>>>> I don't have a great idea for how to fix this - does anyone have any thoughts?
>>>
>>> KIS :) fork() operates on individual VMAs if I am not daydreaming.
>>>
>>> Just check for the obvious pte_write()/dirty/ and you'll be fine.
>>
>> Yes, that seems much simpler! I think we might have to be careful about the uffd
>> wp bit too? I think that's it - are there any other exotic bits that might need
>> to be considered?
> 
> Good question. Mimicing what the current code already does should be sufficient.
> uffd-wp should have the PTE R/O. You can set the contpte bit independent of any
> SW bit (uffd-wp, softdirty, ...) I guess, no need to worry about that.
> 

OK thanks. I'll rework for this approach in v3.
Ryan Roberts Nov. 16, 2023, 11:20 a.m. UTC | #13
On 16/11/2023 11:03, David Hildenbrand wrote:
> On 15.11.23 17:30, Ryan Roberts wrote:
>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>> maps a physically contiguous block of memory, all belonging to the same
>> folio, with the same permissions, and for shared mappings, the same
>> dirty state. This will likely improve performance by a tiny amount due
>> to batching the folio reference count management and calling set_ptes()
>> rather than making individual calls to set_pte_at().
>>
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork, as it is about to add transparent support for the
>> "contiguous bit" in its ptes. By write-protecting the parent using the
>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>> backend can avoid having to unfold contig ranges of PTEs, which is
>> expensive, when all ptes in the range are being write-protected.
>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>> in the child, the backend does not need to fold a contiguous range once
>> they are all populated - they can be initially populated as a contiguous
>> range in the first place.
>>
>> This change addresses the core-mm refactoring only, and introduces
>> ptep_set_wrprotects() with a default implementation that calls
>> ptep_set_wrprotect() for each pte in the range. A separate change will
>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>> performance improvement as part of the work to enable contpte mappings.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   include/linux/pgtable.h |  13 +++
>>   mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>   2 files changed, 150 insertions(+), 38 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index af7639c3b0a3..1c50f8a0fdde 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>> *mm, unsigned long addres
>>   }
>>   #endif
>>   +#ifndef ptep_set_wrprotects
>> +struct mm_struct;
>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>> +                unsigned long address, pte_t *ptep,
>> +                unsigned int nr)
>> +{
>> +    unsigned int i;
>> +
>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>> +        ptep_set_wrprotect(mm, address, ptep);
>> +}
>> +#endif
>> +
>>   /*
>>    * On some architectures hardware does not set page access bit when accessing
>>    * memory page, it is responsibility of software setting this bit. It brings
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1f18ed4a5497..b7c8228883cf 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>> struct vm_area_struct *src_vma
>>           /* Uffd-wp needs to be delivered to dest pte as well */
>>           pte = pte_mkuffd_wp(pte);
>>       set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>> -    return 0;
>> +    return 1;
>> +}
>> +
>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>> +                struct page *anchor, unsigned long anchor_vaddr)
>> +{
>> +    unsigned long offset;
>> +    unsigned long vaddr;
>> +
>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>> +    vaddr = anchor_vaddr + offset;
>> +
>> +    if (anchor > page) {
>> +        if (vaddr > anchor_vaddr)
>> +            return 0;
>> +    } else {
>> +        if (vaddr < anchor_vaddr)
>> +            return ULONG_MAX;
>> +    }
>> +
>> +    return vaddr;
>> +}
>> +
>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>> +                      struct page *page, pte_t *pte,
>> +                      unsigned long addr, unsigned long end,
>> +                      pte_t ptent, bool *any_dirty)
>> +{
>> +    int floops;
>> +    int i;
>> +    unsigned long pfn;
>> +    pgprot_t prot;
>> +    struct page *folio_end;
>> +
>> +    if (!folio_test_large(folio))
>> +        return 1;
>> +
>> +    folio_end = &folio->page + folio_nr_pages(folio);
>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>> +    floops = (end - addr) >> PAGE_SHIFT;
>> +    pfn = page_to_pfn(page);
>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>> +
>> +    *any_dirty = pte_dirty(ptent);
>> +
>> +    pfn++;
>> +    pte++;
>> +
>> +    for (i = 1; i < floops; i++) {
>> +        ptent = ptep_get(pte);
>> +        ptent = pte_mkold(pte_mkclean(ptent));
>> +
>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>> +            break;
>> +
>> +        if (pte_dirty(ptent))
>> +            *any_dirty = true;
>> +
>> +        pfn++;
>> +        pte++;
>> +    }
>> +
>> +    return i;
>>   }
>>     /*
>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>> - * is required to copy this pte.
>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>> + * first pte.
>>    */
>>   static inline int
>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>> -         struct folio **prealloc)
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>> *src_vma,
>> +          pte_t *dst_pte, pte_t *src_pte,
>> +          unsigned long addr, unsigned long end,
>> +          int *rss, struct folio **prealloc)
>>   {
>>       struct mm_struct *src_mm = src_vma->vm_mm;
>>       unsigned long vm_flags = src_vma->vm_flags;
>>       pte_t pte = ptep_get(src_pte);
>>       struct page *page;
>>       struct folio *folio;
>> +    int nr = 1;
>> +    bool anon;
>> +    bool any_dirty = pte_dirty(pte);
>> +    int i;
>>         page = vm_normal_page(src_vma, addr, pte);
>> -    if (page)
>> +    if (page) {
>>           folio = page_folio(page);
>> -    if (page && folio_test_anon(folio)) {
>> -        /*
>> -         * If this page may have been pinned by the parent process,
>> -         * copy the page immediately for the child so that we'll always
>> -         * guarantee the pinned page won't be randomly replaced in the
>> -         * future.
>> -         */
>> -        folio_get(folio);
>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> -            /* Page may be pinned, we have to copy. */
>> -            folio_put(folio);
>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> -                         addr, rss, prealloc, page);
>> +        anon = folio_test_anon(folio);
>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> +                        end, pte, &any_dirty);
>> +
>> +        for (i = 0; i < nr; i++, page++) {
>> +            if (anon) {
>> +                /*
>> +                 * If this page may have been pinned by the
>> +                 * parent process, copy the page immediately for
>> +                 * the child so that we'll always guarantee the
>> +                 * pinned page won't be randomly replaced in the
>> +                 * future.
>> +                 */
>> +                if (unlikely(page_try_dup_anon_rmap(
>> +                        page, false, src_vma))) {
>> +                    if (i != 0)
>> +                        break;
>> +                    /* Page may be pinned, we have to copy. */
>> +                    return copy_present_page(
>> +                        dst_vma, src_vma, dst_pte,
>> +                        src_pte, addr, rss, prealloc,
>> +                        page);
>> +                }
>> +                rss[MM_ANONPAGES]++;
>> +                VM_BUG_ON(PageAnonExclusive(page));
>> +            } else {
>> +                page_dup_file_rmap(page, false);
>> +                rss[mm_counter_file(page)]++;
>> +            }
>>           }
>> -        rss[MM_ANONPAGES]++;
>> -    } else if (page) {
>> -        folio_get(folio);
>> -        page_dup_file_rmap(page, false);
>> -        rss[mm_counter_file(page)]++;
>> +
>> +        nr = i;
>> +        folio_ref_add(folio, nr);
>>       }
>>         /*
>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>> vm_area_struct *src_vma,
>>        * in the parent and the child
>>        */
>>       if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>           pte = pte_wrprotect(pte);
> 
> You likely want an "any_pte_writable" check here instead, no?
> 
> Any operations that target a single indiividual PTE while multiple PTEs are
> adjusted are suspicious :)

The idea is that I've already constrained the batch of pages such that the
permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
pte is writable, then they all are - something has gone badly wrong if some are
writable and others are not.

The dirty bit has any_dirty special case, because we (deliberately) don't
consider access/dirty when determining the batch. Given the batch is all covered
by the same folio, and the kernel maintains the access/dirty info per-folio, we
don't want to uneccessarily reduce the batch size just because one of the pages
in the folio has been written to.

>
David Hildenbrand Nov. 16, 2023, 1:20 p.m. UTC | #14
On 16.11.23 12:20, Ryan Roberts wrote:
> On 16/11/2023 11:03, David Hildenbrand wrote:
>> On 15.11.23 17:30, Ryan Roberts wrote:
>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>> maps a physically contiguous block of memory, all belonging to the same
>>> folio, with the same permissions, and for shared mappings, the same
>>> dirty state. This will likely improve performance by a tiny amount due
>>> to batching the folio reference count management and calling set_ptes()
>>> rather than making individual calls to set_pte_at().
>>>
>>> However, the primary motivation for this change is to reduce the number
>>> of tlb maintenance operations that the arm64 backend has to perform
>>> during fork, as it is about to add transparent support for the
>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>> expensive, when all ptes in the range are being write-protected.
>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>> in the child, the backend does not need to fold a contiguous range once
>>> they are all populated - they can be initially populated as a contiguous
>>> range in the first place.
>>>
>>> This change addresses the core-mm refactoring only, and introduces
>>> ptep_set_wrprotects() with a default implementation that calls
>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>> performance improvement as part of the work to enable contpte mappings.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    include/linux/pgtable.h |  13 +++
>>>    mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>    2 files changed, 150 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>> *mm, unsigned long addres
>>>    }
>>>    #endif
>>>    +#ifndef ptep_set_wrprotects
>>> +struct mm_struct;
>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>> +                unsigned long address, pte_t *ptep,
>>> +                unsigned int nr)
>>> +{
>>> +    unsigned int i;
>>> +
>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>> +        ptep_set_wrprotect(mm, address, ptep);
>>> +}
>>> +#endif
>>> +
>>>    /*
>>>     * On some architectures hardware does not set page access bit when accessing
>>>     * memory page, it is responsibility of software setting this bit. It brings
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 1f18ed4a5497..b7c8228883cf 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>> struct vm_area_struct *src_vma
>>>            /* Uffd-wp needs to be delivered to dest pte as well */
>>>            pte = pte_mkuffd_wp(pte);
>>>        set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>> -    return 0;
>>> +    return 1;
>>> +}
>>> +
>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>> +{
>>> +    unsigned long offset;
>>> +    unsigned long vaddr;
>>> +
>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>> +    vaddr = anchor_vaddr + offset;
>>> +
>>> +    if (anchor > page) {
>>> +        if (vaddr > anchor_vaddr)
>>> +            return 0;
>>> +    } else {
>>> +        if (vaddr < anchor_vaddr)
>>> +            return ULONG_MAX;
>>> +    }
>>> +
>>> +    return vaddr;
>>> +}
>>> +
>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>> +                      struct page *page, pte_t *pte,
>>> +                      unsigned long addr, unsigned long end,
>>> +                      pte_t ptent, bool *any_dirty)
>>> +{
>>> +    int floops;
>>> +    int i;
>>> +    unsigned long pfn;
>>> +    pgprot_t prot;
>>> +    struct page *folio_end;
>>> +
>>> +    if (!folio_test_large(folio))
>>> +        return 1;
>>> +
>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>> +    pfn = page_to_pfn(page);
>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>> +
>>> +    *any_dirty = pte_dirty(ptent);
>>> +
>>> +    pfn++;
>>> +    pte++;
>>> +
>>> +    for (i = 1; i < floops; i++) {
>>> +        ptent = ptep_get(pte);
>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>> +
>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>> +            break;
>>> +
>>> +        if (pte_dirty(ptent))
>>> +            *any_dirty = true;
>>> +
>>> +        pfn++;
>>> +        pte++;
>>> +    }
>>> +
>>> +    return i;
>>>    }
>>>      /*
>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>> - * is required to copy this pte.
>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>> + * first pte.
>>>     */
>>>    static inline int
>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>> -         struct folio **prealloc)
>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>> *src_vma,
>>> +          pte_t *dst_pte, pte_t *src_pte,
>>> +          unsigned long addr, unsigned long end,
>>> +          int *rss, struct folio **prealloc)
>>>    {
>>>        struct mm_struct *src_mm = src_vma->vm_mm;
>>>        unsigned long vm_flags = src_vma->vm_flags;
>>>        pte_t pte = ptep_get(src_pte);
>>>        struct page *page;
>>>        struct folio *folio;
>>> +    int nr = 1;
>>> +    bool anon;
>>> +    bool any_dirty = pte_dirty(pte);
>>> +    int i;
>>>          page = vm_normal_page(src_vma, addr, pte);
>>> -    if (page)
>>> +    if (page) {
>>>            folio = page_folio(page);
>>> -    if (page && folio_test_anon(folio)) {
>>> -        /*
>>> -         * If this page may have been pinned by the parent process,
>>> -         * copy the page immediately for the child so that we'll always
>>> -         * guarantee the pinned page won't be randomly replaced in the
>>> -         * future.
>>> -         */
>>> -        folio_get(folio);
>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>> -            /* Page may be pinned, we have to copy. */
>>> -            folio_put(folio);
>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>> -                         addr, rss, prealloc, page);
>>> +        anon = folio_test_anon(folio);
>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>> +                        end, pte, &any_dirty);
>>> +
>>> +        for (i = 0; i < nr; i++, page++) {
>>> +            if (anon) {
>>> +                /*
>>> +                 * If this page may have been pinned by the
>>> +                 * parent process, copy the page immediately for
>>> +                 * the child so that we'll always guarantee the
>>> +                 * pinned page won't be randomly replaced in the
>>> +                 * future.
>>> +                 */
>>> +                if (unlikely(page_try_dup_anon_rmap(
>>> +                        page, false, src_vma))) {
>>> +                    if (i != 0)
>>> +                        break;
>>> +                    /* Page may be pinned, we have to copy. */
>>> +                    return copy_present_page(
>>> +                        dst_vma, src_vma, dst_pte,
>>> +                        src_pte, addr, rss, prealloc,
>>> +                        page);
>>> +                }
>>> +                rss[MM_ANONPAGES]++;
>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>> +            } else {
>>> +                page_dup_file_rmap(page, false);
>>> +                rss[mm_counter_file(page)]++;
>>> +            }
>>>            }
>>> -        rss[MM_ANONPAGES]++;
>>> -    } else if (page) {
>>> -        folio_get(folio);
>>> -        page_dup_file_rmap(page, false);
>>> -        rss[mm_counter_file(page)]++;
>>> +
>>> +        nr = i;
>>> +        folio_ref_add(folio, nr);
>>>        }
>>>          /*
>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>>> vm_area_struct *src_vma,
>>>         * in the parent and the child
>>>         */
>>>        if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>            pte = pte_wrprotect(pte);
>>
>> You likely want an "any_pte_writable" check here instead, no?
>>
>> Any operations that target a single indiividual PTE while multiple PTEs are
>> adjusted are suspicious :)
> 
> The idea is that I've already constrained the batch of pages such that the
> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
> pte is writable, then they all are - something has gone badly wrong if some are
> writable and others are not.

I wonder if it would be cleaner and easier to not do that, though.

Simply record if any pte is writable. Afterwards they will *all* be R/O 
and you can set the cont bit, correct?
Ryan Roberts Nov. 16, 2023, 1:49 p.m. UTC | #15
On 16/11/2023 13:20, David Hildenbrand wrote:
> On 16.11.23 12:20, Ryan Roberts wrote:
>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>> maps a physically contiguous block of memory, all belonging to the same
>>>> folio, with the same permissions, and for shared mappings, the same
>>>> dirty state. This will likely improve performance by a tiny amount due
>>>> to batching the folio reference count management and calling set_ptes()
>>>> rather than making individual calls to set_pte_at().
>>>>
>>>> However, the primary motivation for this change is to reduce the number
>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>> during fork, as it is about to add transparent support for the
>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>> expensive, when all ptes in the range are being write-protected.
>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>> in the child, the backend does not need to fold a contiguous range once
>>>> they are all populated - they can be initially populated as a contiguous
>>>> range in the first place.
>>>>
>>>> This change addresses the core-mm refactoring only, and introduces
>>>> ptep_set_wrprotects() with a default implementation that calls
>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>> performance improvement as part of the work to enable contpte mappings.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>    include/linux/pgtable.h |  13 +++
>>>>    mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>    2 files changed, 150 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>> *mm, unsigned long addres
>>>>    }
>>>>    #endif
>>>>    +#ifndef ptep_set_wrprotects
>>>> +struct mm_struct;
>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>> +                unsigned long address, pte_t *ptep,
>>>> +                unsigned int nr)
>>>> +{
>>>> +    unsigned int i;
>>>> +
>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>> +}
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * On some architectures hardware does not set page access bit when
>>>> accessing
>>>>     * memory page, it is responsibility of software setting this bit. It brings
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>> struct vm_area_struct *src_vma
>>>>            /* Uffd-wp needs to be delivered to dest pte as well */
>>>>            pte = pte_mkuffd_wp(pte);
>>>>        set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>> -    return 0;
>>>> +    return 1;
>>>> +}
>>>> +
>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>> +{
>>>> +    unsigned long offset;
>>>> +    unsigned long vaddr;
>>>> +
>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>> +    vaddr = anchor_vaddr + offset;
>>>> +
>>>> +    if (anchor > page) {
>>>> +        if (vaddr > anchor_vaddr)
>>>> +            return 0;
>>>> +    } else {
>>>> +        if (vaddr < anchor_vaddr)
>>>> +            return ULONG_MAX;
>>>> +    }
>>>> +
>>>> +    return vaddr;
>>>> +}
>>>> +
>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>> +                      struct page *page, pte_t *pte,
>>>> +                      unsigned long addr, unsigned long end,
>>>> +                      pte_t ptent, bool *any_dirty)
>>>> +{
>>>> +    int floops;
>>>> +    int i;
>>>> +    unsigned long pfn;
>>>> +    pgprot_t prot;
>>>> +    struct page *folio_end;
>>>> +
>>>> +    if (!folio_test_large(folio))
>>>> +        return 1;
>>>> +
>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>> +    pfn = page_to_pfn(page);
>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>> +
>>>> +    *any_dirty = pte_dirty(ptent);
>>>> +
>>>> +    pfn++;
>>>> +    pte++;
>>>> +
>>>> +    for (i = 1; i < floops; i++) {
>>>> +        ptent = ptep_get(pte);
>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>> +
>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>> +            break;
>>>> +
>>>> +        if (pte_dirty(ptent))
>>>> +            *any_dirty = true;
>>>> +
>>>> +        pfn++;
>>>> +        pte++;
>>>> +    }
>>>> +
>>>> +    return i;
>>>>    }
>>>>      /*
>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>>> - * is required to copy this pte.
>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>>> + * first pte.
>>>>     */
>>>>    static inline int
>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>> *src_vma,
>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>> -         struct folio **prealloc)
>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>> *src_vma,
>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>> +          unsigned long addr, unsigned long end,
>>>> +          int *rss, struct folio **prealloc)
>>>>    {
>>>>        struct mm_struct *src_mm = src_vma->vm_mm;
>>>>        unsigned long vm_flags = src_vma->vm_flags;
>>>>        pte_t pte = ptep_get(src_pte);
>>>>        struct page *page;
>>>>        struct folio *folio;
>>>> +    int nr = 1;
>>>> +    bool anon;
>>>> +    bool any_dirty = pte_dirty(pte);
>>>> +    int i;
>>>>          page = vm_normal_page(src_vma, addr, pte);
>>>> -    if (page)
>>>> +    if (page) {
>>>>            folio = page_folio(page);
>>>> -    if (page && folio_test_anon(folio)) {
>>>> -        /*
>>>> -         * If this page may have been pinned by the parent process,
>>>> -         * copy the page immediately for the child so that we'll always
>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>> -         * future.
>>>> -         */
>>>> -        folio_get(folio);
>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>> -            /* Page may be pinned, we have to copy. */
>>>> -            folio_put(folio);
>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>> -                         addr, rss, prealloc, page);
>>>> +        anon = folio_test_anon(folio);
>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>> +                        end, pte, &any_dirty);
>>>> +
>>>> +        for (i = 0; i < nr; i++, page++) {
>>>> +            if (anon) {
>>>> +                /*
>>>> +                 * If this page may have been pinned by the
>>>> +                 * parent process, copy the page immediately for
>>>> +                 * the child so that we'll always guarantee the
>>>> +                 * pinned page won't be randomly replaced in the
>>>> +                 * future.
>>>> +                 */
>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>> +                        page, false, src_vma))) {
>>>> +                    if (i != 0)
>>>> +                        break;
>>>> +                    /* Page may be pinned, we have to copy. */
>>>> +                    return copy_present_page(
>>>> +                        dst_vma, src_vma, dst_pte,
>>>> +                        src_pte, addr, rss, prealloc,
>>>> +                        page);
>>>> +                }
>>>> +                rss[MM_ANONPAGES]++;
>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>> +            } else {
>>>> +                page_dup_file_rmap(page, false);
>>>> +                rss[mm_counter_file(page)]++;
>>>> +            }
>>>>            }
>>>> -        rss[MM_ANONPAGES]++;
>>>> -    } else if (page) {
>>>> -        folio_get(folio);
>>>> -        page_dup_file_rmap(page, false);
>>>> -        rss[mm_counter_file(page)]++;
>>>> +
>>>> +        nr = i;
>>>> +        folio_ref_add(folio, nr);
>>>>        }
>>>>          /*
>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>>>> vm_area_struct *src_vma,
>>>>         * in the parent and the child
>>>>         */
>>>>        if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>            pte = pte_wrprotect(pte);
>>>
>>> You likely want an "any_pte_writable" check here instead, no?
>>>
>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>> adjusted are suspicious :)
>>
>> The idea is that I've already constrained the batch of pages such that the
>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
>> pte is writable, then they all are - something has gone badly wrong if some are
>> writable and others are not.
> 
> I wonder if it would be cleaner and easier to not do that, though.
> 
> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
> can set the cont bit, correct?

Oh I see what you mean - that only works for cow mappings though. If you have a
shared mapping, you won't be making it read-only at fork. So if we ignore
pte_write() state when demarking the batches, we will end up with a batch of
pages with a mix of RO and RW in the parent, but then we set_ptes() for the
child and those pages will all have the permissions of the first page of the batch.

I guess we could special case and do it the way you suggested for cow mappings;
it might be faster, but certainly not cleaner and easier IMHO.

>
David Hildenbrand Nov. 16, 2023, 2:13 p.m. UTC | #16
On 16.11.23 14:49, Ryan Roberts wrote:
> On 16/11/2023 13:20, David Hildenbrand wrote:
>> On 16.11.23 12:20, Ryan Roberts wrote:
>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>> to batching the folio reference count management and calling set_ptes()
>>>>> rather than making individual calls to set_pte_at().
>>>>>
>>>>> However, the primary motivation for this change is to reduce the number
>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>> during fork, as it is about to add transparent support for the
>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>> expensive, when all ptes in the range are being write-protected.
>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>> they are all populated - they can be initially populated as a contiguous
>>>>> range in the first place.
>>>>>
>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>     include/linux/pgtable.h |  13 +++
>>>>>     mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>     2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>> --- a/include/linux/pgtable.h
>>>>> +++ b/include/linux/pgtable.h
>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>> *mm, unsigned long addres
>>>>>     }
>>>>>     #endif
>>>>>     +#ifndef ptep_set_wrprotects
>>>>> +struct mm_struct;
>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>> +                unsigned long address, pte_t *ptep,
>>>>> +                unsigned int nr)
>>>>> +{
>>>>> +    unsigned int i;
>>>>> +
>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>>     /*
>>>>>      * On some architectures hardware does not set page access bit when
>>>>> accessing
>>>>>      * memory page, it is responsibility of software setting this bit. It brings
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>> struct vm_area_struct *src_vma
>>>>>             /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>             pte = pte_mkuffd_wp(pte);
>>>>>         set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>> -    return 0;
>>>>> +    return 1;
>>>>> +}
>>>>> +
>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>> +{
>>>>> +    unsigned long offset;
>>>>> +    unsigned long vaddr;
>>>>> +
>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>> +    vaddr = anchor_vaddr + offset;
>>>>> +
>>>>> +    if (anchor > page) {
>>>>> +        if (vaddr > anchor_vaddr)
>>>>> +            return 0;
>>>>> +    } else {
>>>>> +        if (vaddr < anchor_vaddr)
>>>>> +            return ULONG_MAX;
>>>>> +    }
>>>>> +
>>>>> +    return vaddr;
>>>>> +}
>>>>> +
>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>> +                      struct page *page, pte_t *pte,
>>>>> +                      unsigned long addr, unsigned long end,
>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>> +{
>>>>> +    int floops;
>>>>> +    int i;
>>>>> +    unsigned long pfn;
>>>>> +    pgprot_t prot;
>>>>> +    struct page *folio_end;
>>>>> +
>>>>> +    if (!folio_test_large(folio))
>>>>> +        return 1;
>>>>> +
>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>> +    pfn = page_to_pfn(page);
>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>> +
>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>> +
>>>>> +    pfn++;
>>>>> +    pte++;
>>>>> +
>>>>> +    for (i = 1; i < floops; i++) {
>>>>> +        ptent = ptep_get(pte);
>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>> +
>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>> +            break;
>>>>> +
>>>>> +        if (pte_dirty(ptent))
>>>>> +            *any_dirty = true;
>>>>> +
>>>>> +        pfn++;
>>>>> +        pte++;
>>>>> +    }
>>>>> +
>>>>> +    return i;
>>>>>     }
>>>>>       /*
>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>>>> - * is required to copy this pte.
>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>>>> + * first pte.
>>>>>      */
>>>>>     static inline int
>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>> *src_vma,
>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>> -         struct folio **prealloc)
>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>> *src_vma,
>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>> +          unsigned long addr, unsigned long end,
>>>>> +          int *rss, struct folio **prealloc)
>>>>>     {
>>>>>         struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>         unsigned long vm_flags = src_vma->vm_flags;
>>>>>         pte_t pte = ptep_get(src_pte);
>>>>>         struct page *page;
>>>>>         struct folio *folio;
>>>>> +    int nr = 1;
>>>>> +    bool anon;
>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>> +    int i;
>>>>>           page = vm_normal_page(src_vma, addr, pte);
>>>>> -    if (page)
>>>>> +    if (page) {
>>>>>             folio = page_folio(page);
>>>>> -    if (page && folio_test_anon(folio)) {
>>>>> -        /*
>>>>> -         * If this page may have been pinned by the parent process,
>>>>> -         * copy the page immediately for the child so that we'll always
>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>> -         * future.
>>>>> -         */
>>>>> -        folio_get(folio);
>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>> -            /* Page may be pinned, we have to copy. */
>>>>> -            folio_put(folio);
>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>> -                         addr, rss, prealloc, page);
>>>>> +        anon = folio_test_anon(folio);
>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>> +                        end, pte, &any_dirty);
>>>>> +
>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>> +            if (anon) {
>>>>> +                /*
>>>>> +                 * If this page may have been pinned by the
>>>>> +                 * parent process, copy the page immediately for
>>>>> +                 * the child so that we'll always guarantee the
>>>>> +                 * pinned page won't be randomly replaced in the
>>>>> +                 * future.
>>>>> +                 */
>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>> +                        page, false, src_vma))) {
>>>>> +                    if (i != 0)
>>>>> +                        break;
>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>> +                    return copy_present_page(
>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>> +                        src_pte, addr, rss, prealloc,
>>>>> +                        page);
>>>>> +                }
>>>>> +                rss[MM_ANONPAGES]++;
>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>> +            } else {
>>>>> +                page_dup_file_rmap(page, false);
>>>>> +                rss[mm_counter_file(page)]++;
>>>>> +            }
>>>>>             }
>>>>> -        rss[MM_ANONPAGES]++;
>>>>> -    } else if (page) {
>>>>> -        folio_get(folio);
>>>>> -        page_dup_file_rmap(page, false);
>>>>> -        rss[mm_counter_file(page)]++;
>>>>> +
>>>>> +        nr = i;
>>>>> +        folio_ref_add(folio, nr);
>>>>>         }
>>>>>           /*
>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>>>>> vm_area_struct *src_vma,
>>>>>          * in the parent and the child
>>>>>          */
>>>>>         if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>             pte = pte_wrprotect(pte);
>>>>
>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>
>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>> adjusted are suspicious :)
>>>
>>> The idea is that I've already constrained the batch of pages such that the
>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
>>> pte is writable, then they all are - something has gone badly wrong if some are
>>> writable and others are not.
>>
>> I wonder if it would be cleaner and easier to not do that, though.
>>
>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>> can set the cont bit, correct?
> 
> Oh I see what you mean - that only works for cow mappings though. If you have a
> shared mapping, you won't be making it read-only at fork. So if we ignore
> pte_write() state when demarking the batches, we will end up with a batch of
> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
> child and those pages will all have the permissions of the first page of the batch.

I see what you mean.

After fork(), all anon pages will be R/O in the parent and the child. 
Easy. If any PTE is writable, wrprotect all in the parent and the child.

After fork(), all shared pages can be R/O or R/W in the parent. For 
simplicity, I think you can simply set them all R/O in the child. So if 
any PTE is writable, wrprotect all in the child.

Why? in the default case, fork() does not even care about MAP_SHARED 
mappings; it does not copy the page tables/ptes. See vma_needs_copy().

Only in corner cases (e.g., uffd-wp, VM_PFNMAP, VM_MIXEDMAP), or in 
MAP_PRIVATE mappings, you can even end up in that code.

In MAP_PRIVATE mappings, only anon pages can be R/W, other pages can 
never be writable, so it does not matter. In VM_PFNMAP/VM_MIXEDMAP 
likely all permissions match either way.

So you might just wrprotect the !anon pages R/O for the child and nobody 
should really notice it, write faults will resolve it.

Famous last words :)
David Hildenbrand Nov. 16, 2023, 2:15 p.m. UTC | #17
On 16.11.23 15:13, David Hildenbrand wrote:
> On 16.11.23 14:49, Ryan Roberts wrote:
>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>> rather than making individual calls to set_pte_at().
>>>>>>
>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>> during fork, as it is about to add transparent support for the
>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>> range in the first place.
>>>>>>
>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>      include/linux/pgtable.h |  13 +++
>>>>>>      mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>>      2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>> --- a/include/linux/pgtable.h
>>>>>> +++ b/include/linux/pgtable.h
>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>>> *mm, unsigned long addres
>>>>>>      }
>>>>>>      #endif
>>>>>>      +#ifndef ptep_set_wrprotects
>>>>>> +struct mm_struct;
>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>> +                unsigned int nr)
>>>>>> +{
>>>>>> +    unsigned int i;
>>>>>> +
>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>>      /*
>>>>>>       * On some architectures hardware does not set page access bit when
>>>>>> accessing
>>>>>>       * memory page, it is responsibility of software setting this bit. It brings
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>> struct vm_area_struct *src_vma
>>>>>>              /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>              pte = pte_mkuffd_wp(pte);
>>>>>>          set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>> -    return 0;
>>>>>> +    return 1;
>>>>>> +}
>>>>>> +
>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>> +{
>>>>>> +    unsigned long offset;
>>>>>> +    unsigned long vaddr;
>>>>>> +
>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>> +
>>>>>> +    if (anchor > page) {
>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>> +            return 0;
>>>>>> +    } else {
>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>> +            return ULONG_MAX;
>>>>>> +    }
>>>>>> +
>>>>>> +    return vaddr;
>>>>>> +}
>>>>>> +
>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>> +                      struct page *page, pte_t *pte,
>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>> +{
>>>>>> +    int floops;
>>>>>> +    int i;
>>>>>> +    unsigned long pfn;
>>>>>> +    pgprot_t prot;
>>>>>> +    struct page *folio_end;
>>>>>> +
>>>>>> +    if (!folio_test_large(folio))
>>>>>> +        return 1;
>>>>>> +
>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>> +    pfn = page_to_pfn(page);
>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>> +
>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>> +
>>>>>> +    pfn++;
>>>>>> +    pte++;
>>>>>> +
>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>> +        ptent = ptep_get(pte);
>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>> +
>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>> +            break;
>>>>>> +
>>>>>> +        if (pte_dirty(ptent))
>>>>>> +            *any_dirty = true;
>>>>>> +
>>>>>> +        pfn++;
>>>>>> +        pte++;
>>>>>> +    }
>>>>>> +
>>>>>> +    return i;
>>>>>>      }
>>>>>>        /*
>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>>>>> - * is required to copy this pte.
>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>>>>> + * first pte.
>>>>>>       */
>>>>>>      static inline int
>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>> *src_vma,
>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>> -         struct folio **prealloc)
>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>> *src_vma,
>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>> +          unsigned long addr, unsigned long end,
>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>      {
>>>>>>          struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>          unsigned long vm_flags = src_vma->vm_flags;
>>>>>>          pte_t pte = ptep_get(src_pte);
>>>>>>          struct page *page;
>>>>>>          struct folio *folio;
>>>>>> +    int nr = 1;
>>>>>> +    bool anon;
>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>> +    int i;
>>>>>>            page = vm_normal_page(src_vma, addr, pte);
>>>>>> -    if (page)
>>>>>> +    if (page) {
>>>>>>              folio = page_folio(page);
>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>> -        /*
>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>> -         * future.
>>>>>> -         */
>>>>>> -        folio_get(folio);
>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>> -            folio_put(folio);
>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>> -                         addr, rss, prealloc, page);
>>>>>> +        anon = folio_test_anon(folio);
>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>> +                        end, pte, &any_dirty);
>>>>>> +
>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>> +            if (anon) {
>>>>>> +                /*
>>>>>> +                 * If this page may have been pinned by the
>>>>>> +                 * parent process, copy the page immediately for
>>>>>> +                 * the child so that we'll always guarantee the
>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>> +                 * future.
>>>>>> +                 */
>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>> +                        page, false, src_vma))) {
>>>>>> +                    if (i != 0)
>>>>>> +                        break;
>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>> +                    return copy_present_page(
>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>> +                        page);
>>>>>> +                }
>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>> +            } else {
>>>>>> +                page_dup_file_rmap(page, false);
>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>> +            }
>>>>>>              }
>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>> -    } else if (page) {
>>>>>> -        folio_get(folio);
>>>>>> -        page_dup_file_rmap(page, false);
>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>> +
>>>>>> +        nr = i;
>>>>>> +        folio_ref_add(folio, nr);
>>>>>>          }
>>>>>>            /*
>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct
>>>>>> vm_area_struct *src_vma,
>>>>>>           * in the parent and the child
>>>>>>           */
>>>>>>          if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>              pte = pte_wrprotect(pte);
>>>>>
>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>
>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>> adjusted are suspicious :)
>>>>
>>>> The idea is that I've already constrained the batch of pages such that the
>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the first
>>>> pte is writable, then they all are - something has gone badly wrong if some are
>>>> writable and others are not.
>>>
>>> I wonder if it would be cleaner and easier to not do that, though.
>>>
>>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>>> can set the cont bit, correct?
>>
>> Oh I see what you mean - that only works for cow mappings though. If you have a
>> shared mapping, you won't be making it read-only at fork. So if we ignore
>> pte_write() state when demarking the batches, we will end up with a batch of
>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>> child and those pages will all have the permissions of the first page of the batch.
> 
> I see what you mean.
> 
> After fork(), all anon pages will be R/O in the parent and the child.
> Easy. If any PTE is writable, wrprotect all in the parent and the child.
> 
> After fork(), all shared pages can be R/O or R/W in the parent. For
> simplicity, I think you can simply set them all R/O in the child. So if
> any PTE is writable, wrprotect all in the child.

Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.

But devil is in the detail.
Ryan Roberts Nov. 16, 2023, 5:58 p.m. UTC | #18
On 16/11/2023 14:15, David Hildenbrand wrote:
> On 16.11.23 15:13, David Hildenbrand wrote:
>> On 16.11.23 14:49, Ryan Roberts wrote:
>>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>>> rather than making individual calls to set_pte_at().
>>>>>>>
>>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>>> during fork, as it is about to add transparent support for the
>>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>>> range in the first place.
>>>>>>>
>>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> ---
>>>>>>>      include/linux/pgtable.h |  13 +++
>>>>>>>      mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>>>      2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>>> --- a/include/linux/pgtable.h
>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>>>> *mm, unsigned long addres
>>>>>>>      }
>>>>>>>      #endif
>>>>>>>      +#ifndef ptep_set_wrprotects
>>>>>>> +struct mm_struct;
>>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>>> +                unsigned int nr)
>>>>>>> +{
>>>>>>> +    unsigned int i;
>>>>>>> +
>>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>>> +}
>>>>>>> +#endif
>>>>>>> +
>>>>>>>      /*
>>>>>>>       * On some architectures hardware does not set page access bit when
>>>>>>> accessing
>>>>>>>       * memory page, it is responsibility of software setting this bit.
>>>>>>> It brings
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>>> struct vm_area_struct *src_vma
>>>>>>>              /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>>              pte = pte_mkuffd_wp(pte);
>>>>>>>          set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>>> -    return 0;
>>>>>>> +    return 1;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>>> +{
>>>>>>> +    unsigned long offset;
>>>>>>> +    unsigned long vaddr;
>>>>>>> +
>>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>>> +
>>>>>>> +    if (anchor > page) {
>>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>>> +            return 0;
>>>>>>> +    } else {
>>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>>> +            return ULONG_MAX;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return vaddr;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>>> +                      struct page *page, pte_t *pte,
>>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>>> +{
>>>>>>> +    int floops;
>>>>>>> +    int i;
>>>>>>> +    unsigned long pfn;
>>>>>>> +    pgprot_t prot;
>>>>>>> +    struct page *folio_end;
>>>>>>> +
>>>>>>> +    if (!folio_test_large(folio))
>>>>>>> +        return 1;
>>>>>>> +
>>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>>> +    pfn = page_to_pfn(page);
>>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>>> +
>>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>>> +
>>>>>>> +    pfn++;
>>>>>>> +    pte++;
>>>>>>> +
>>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>>> +        ptent = ptep_get(pte);
>>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>>> +
>>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>>> +            break;
>>>>>>> +
>>>>>>> +        if (pte_dirty(ptent))
>>>>>>> +            *any_dirty = true;
>>>>>>> +
>>>>>>> +        pfn++;
>>>>>>> +        pte++;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return i;
>>>>>>>      }
>>>>>>>        /*
>>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated
>>>>>>> page
>>>>>>> - * is required to copy this pte.
>>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to
>>>>>>> copy the
>>>>>>> + * first pte.
>>>>>>>       */
>>>>>>>      static inline int
>>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>> *src_vma,
>>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>>> -         struct folio **prealloc)
>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>> *src_vma,
>>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>>> +          unsigned long addr, unsigned long end,
>>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>>      {
>>>>>>>          struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>>          unsigned long vm_flags = src_vma->vm_flags;
>>>>>>>          pte_t pte = ptep_get(src_pte);
>>>>>>>          struct page *page;
>>>>>>>          struct folio *folio;
>>>>>>> +    int nr = 1;
>>>>>>> +    bool anon;
>>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>>> +    int i;
>>>>>>>            page = vm_normal_page(src_vma, addr, pte);
>>>>>>> -    if (page)
>>>>>>> +    if (page) {
>>>>>>>              folio = page_folio(page);
>>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>>> -        /*
>>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>>> -         * future.
>>>>>>> -         */
>>>>>>> -        folio_get(folio);
>>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>>> -            folio_put(folio);
>>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>>> -                         addr, rss, prealloc, page);
>>>>>>> +        anon = folio_test_anon(folio);
>>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>>> +                        end, pte, &any_dirty);
>>>>>>> +
>>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>>> +            if (anon) {
>>>>>>> +                /*
>>>>>>> +                 * If this page may have been pinned by the
>>>>>>> +                 * parent process, copy the page immediately for
>>>>>>> +                 * the child so that we'll always guarantee the
>>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>>> +                 * future.
>>>>>>> +                 */
>>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>>> +                        page, false, src_vma))) {
>>>>>>> +                    if (i != 0)
>>>>>>> +                        break;
>>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>>> +                    return copy_present_page(
>>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>>> +                        page);
>>>>>>> +                }
>>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>>> +            } else {
>>>>>>> +                page_dup_file_rmap(page, false);
>>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>>> +            }
>>>>>>>              }
>>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>>> -    } else if (page) {
>>>>>>> -        folio_get(folio);
>>>>>>> -        page_dup_file_rmap(page, false);
>>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>>> +
>>>>>>> +        nr = i;
>>>>>>> +        folio_ref_add(folio, nr);
>>>>>>>          }
>>>>>>>            /*
>>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma,
>>>>>>> struct
>>>>>>> vm_area_struct *src_vma,
>>>>>>>           * in the parent and the child
>>>>>>>           */
>>>>>>>          if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>>              pte = pte_wrprotect(pte);
>>>>>>
>>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>>
>>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>>> adjusted are suspicious :)
>>>>>
>>>>> The idea is that I've already constrained the batch of pages such that the
>>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the
>>>>> first
>>>>> pte is writable, then they all are - something has gone badly wrong if some
>>>>> are
>>>>> writable and others are not.
>>>>
>>>> I wonder if it would be cleaner and easier to not do that, though.
>>>>
>>>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>>>> can set the cont bit, correct?
>>>
>>> Oh I see what you mean - that only works for cow mappings though. If you have a
>>> shared mapping, you won't be making it read-only at fork. So if we ignore
>>> pte_write() state when demarking the batches, we will end up with a batch of
>>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>>> child and those pages will all have the permissions of the first page of the
>>> batch.
>>
>> I see what you mean.
>>
>> After fork(), all anon pages will be R/O in the parent and the child.
>> Easy. If any PTE is writable, wrprotect all in the parent and the child.
>>
>> After fork(), all shared pages can be R/O or R/W in the parent. For
>> simplicity, I think you can simply set them all R/O in the child. So if
>> any PTE is writable, wrprotect all in the child.
> 
> Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.
> 
> But devil is in the detail.

OK I think I follow. I'll implement this for v3. Thanks!
Alistair Popple Nov. 23, 2023, 4:26 a.m. UTC | #19
Ryan Roberts <ryan.roberts@arm.com> writes:

> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
> maps a physically contiguous block of memory, all belonging to the same
> folio, with the same permissions, and for shared mappings, the same
> dirty state. This will likely improve performance by a tiny amount due
> to batching the folio reference count management and calling set_ptes()
> rather than making individual calls to set_pte_at().
>
> However, the primary motivation for this change is to reduce the number
> of tlb maintenance operations that the arm64 backend has to perform
> during fork, as it is about to add transparent support for the
> "contiguous bit" in its ptes. By write-protecting the parent using the
> new ptep_set_wrprotects() (note the 's' at the end) function, the
> backend can avoid having to unfold contig ranges of PTEs, which is
> expensive, when all ptes in the range are being write-protected.
> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
> in the child, the backend does not need to fold a contiguous range once
> they are all populated - they can be initially populated as a contiguous
> range in the first place.
>
> This change addresses the core-mm refactoring only, and introduces
> ptep_set_wrprotects() with a default implementation that calls
> ptep_set_wrprotect() for each pte in the range. A separate change will
> implement ptep_set_wrprotects() in the arm64 backend to realize the
> performance improvement as part of the work to enable contpte mappings.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  include/linux/pgtable.h |  13 +++
>  mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>  2 files changed, 150 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b0a3..1c50f8a0fdde 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
>  }
>  #endif
>  
> +#ifndef ptep_set_wrprotects
> +struct mm_struct;
> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
> +				unsigned long address, pte_t *ptep,
> +				unsigned int nr)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
> +		ptep_set_wrprotect(mm, address, ptep);
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/memory.c b/mm/memory.c
> index 1f18ed4a5497..b7c8228883cf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  		/* Uffd-wp needs to be delivered to dest pte as well */
>  		pte = pte_mkuffd_wp(pte);
>  	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> -	return 0;
> +	return 1;

We should update the function comment to indicate why we return 1 here
because it will become non-obvious in future. But perhaps it's better to
leave this as is and do the error check/return code calculation in
copy_present_ptes().

> +}
> +
> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
> +				struct page *anchor, unsigned long anchor_vaddr)

It's likely I'm easily confused but the arguments here don't make much
sense to me. Something like this (noting that I've switch the argument
order) makes more sense to me at least:

static inline unsigned long page_cont_mapped_vaddr(struct page *page,
                            unsigned long page_vaddr, struct page *next_folio_page)

> +{
> +	unsigned long offset;
> +	unsigned long vaddr;
> +
> +	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;

Which IMHO makes this much more readable:

	offset = (page_to_pfn(next_folio_page) - page_to_pfn(page)) << PAGE_SHIFT;

> +	vaddr = anchor_vaddr + offset;
> +
> +	if (anchor > page) {

And also highlights that I think this condition (page > folio_page_end)
is impossible to hit. Which is good ...

> +		if (vaddr > anchor_vaddr)
> +			return 0;

... because I'm not sure returning 0 is valid as we would end up setting
floops = (0 - addr) >> PAGE_SHIFT which doesn't seem like it would end
particularly well :-)

> +	} else {
> +		if (vaddr < anchor_vaddr)

Same here - isn't the vaddr of the next folio always going to be larger
than the vaddr for the current page? It seems this function is really
just calculating the virtual address of the next folio, or am I deeply
confused?

> +			return ULONG_MAX;
> +	}
> +
> +	return vaddr;
> +}
> +
> +static int folio_nr_pages_cont_mapped(struct folio *folio,
> +				      struct page *page, pte_t *pte,
> +				      unsigned long addr, unsigned long end,
> +				      pte_t ptent, bool *any_dirty)
> +{
> +	int floops;
> +	int i;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +	struct page *folio_end;
> +
> +	if (!folio_test_large(folio))
> +		return 1;
> +
> +	folio_end = &folio->page + folio_nr_pages(folio);

I think you can replace this with:

folio_end = folio_next(folio)

Although given this is only passed to page_cont_mapped_vaddr() perhaps
it's better to just pass the folio in and do the calculation there.

> +	end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
> +	floops = (end - addr) >> PAGE_SHIFT;
> +	pfn = page_to_pfn(page);
> +	prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
> +
> +	*any_dirty = pte_dirty(ptent);
> +
> +	pfn++;
> +	pte++;
> +
> +	for (i = 1; i < floops; i++) {
> +		ptent = ptep_get(pte);
> +		ptent = pte_mkold(pte_mkclean(ptent));
> +
> +		if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
> +		    pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
> +			break;
> +
> +		if (pte_dirty(ptent))
> +			*any_dirty = true;
> +
> +		pfn++;
> +		pte++;
> +	}
> +
> +	return i;
>  }
>  
>  /*
> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
> - * is required to copy this pte.
> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
> + * first pte.
>   */
>  static inline int
> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> -		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> -		 struct folio **prealloc)
> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> +		  pte_t *dst_pte, pte_t *src_pte,
> +		  unsigned long addr, unsigned long end,
> +		  int *rss, struct folio **prealloc)
>  {
>  	struct mm_struct *src_mm = src_vma->vm_mm;
>  	unsigned long vm_flags = src_vma->vm_flags;
>  	pte_t pte = ptep_get(src_pte);
>  	struct page *page;
>  	struct folio *folio;
> +	int nr = 1;
> +	bool anon;
> +	bool any_dirty = pte_dirty(pte);
> +	int i;
>  
>  	page = vm_normal_page(src_vma, addr, pte);
> -	if (page)
> +	if (page) {
>  		folio = page_folio(page);
> -	if (page && folio_test_anon(folio)) {
> -		/*
> -		 * If this page may have been pinned by the parent process,
> -		 * copy the page immediately for the child so that we'll always
> -		 * guarantee the pinned page won't be randomly replaced in the
> -		 * future.
> -		 */
> -		folio_get(folio);
> -		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> -			/* Page may be pinned, we have to copy. */
> -			folio_put(folio);
> -			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> -						 addr, rss, prealloc, page);
> +		anon = folio_test_anon(folio);
> +		nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> +						end, pte, &any_dirty);
> +
> +		for (i = 0; i < nr; i++, page++) {
> +			if (anon) {
> +				/*
> +				 * If this page may have been pinned by the
> +				 * parent process, copy the page immediately for
> +				 * the child so that we'll always guarantee the
> +				 * pinned page won't be randomly replaced in the
> +				 * future.
> +				 */
> +				if (unlikely(page_try_dup_anon_rmap(
> +						page, false, src_vma))) {
> +					if (i != 0)
> +						break;
> +					/* Page may be pinned, we have to copy. */
> +					return copy_present_page(
> +						dst_vma, src_vma, dst_pte,
> +						src_pte, addr, rss, prealloc,
> +						page);
> +				}
> +				rss[MM_ANONPAGES]++;
> +				VM_BUG_ON(PageAnonExclusive(page));
> +			} else {
> +				page_dup_file_rmap(page, false);
> +				rss[mm_counter_file(page)]++;
> +			}
>  		}
> -		rss[MM_ANONPAGES]++;
> -	} else if (page) {
> -		folio_get(folio);
> -		page_dup_file_rmap(page, false);
> -		rss[mm_counter_file(page)]++;
> +
> +		nr = i;
> +		folio_ref_add(folio, nr);
>  	}
>  
>  	/*
> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  	 * in the parent and the child
>  	 */
>  	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> -		ptep_set_wrprotect(src_mm, addr, src_pte);
> +		ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>  		pte = pte_wrprotect(pte);
>  	}
> -	VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
>  
>  	/*
> -	 * If it's a shared mapping, mark it clean in
> -	 * the child
> +	 * If it's a shared mapping, mark it clean in the child. If its a
> +	 * private mapping, mark it dirty in the child if _any_ of the parent
> +	 * mappings in the block were marked dirty. The contiguous block of
> +	 * mappings are all backed by the same folio, so if any are dirty then
> +	 * the whole folio is dirty. This allows us to determine the batch size
> +	 * without having to ever consider the dirty bit. See
> +	 * folio_nr_pages_cont_mapped().
>  	 */
> -	if (vm_flags & VM_SHARED)
> -		pte = pte_mkclean(pte);
> -	pte = pte_mkold(pte);
> +	pte = pte_mkold(pte_mkclean(pte));
> +	if (!(vm_flags & VM_SHARED) && any_dirty)
> +		pte = pte_mkdirty(pte);
>  
>  	if (!userfaultfd_wp(dst_vma))
>  		pte = pte_clear_uffd_wp(pte);
>  
> -	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> -	return 0;
> +	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> +	return nr;
>  }
>  
>  static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
> @@ -1087,15 +1174,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  			 */
>  			WARN_ON_ONCE(ret != -ENOENT);
>  		}
> -		/* copy_present_pte() will clear `*prealloc' if consumed */
> -		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
> -				       addr, rss, &prealloc);
> +		/* copy_present_ptes() will clear `*prealloc' if consumed */
> +		ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
> +				       addr, end, rss, &prealloc);
> +
>  		/*
>  		 * If we need a pre-allocated page for this pte, drop the
>  		 * locks, allocate, and try again.
>  		 */
>  		if (unlikely(ret == -EAGAIN))
>  			break;
> +
> +		/*
> +		 * Positive return value is the number of ptes copied.
> +		 */
> +		VM_WARN_ON_ONCE(ret < 1);
> +		progress += 8 * ret;
> +		ret--;

Took me a second to figure out what was going on here. I think it would
be clearer to rename ret to nr_ptes ...

> +		dst_pte += ret;
> +		src_pte += ret;
> +		addr += ret << PAGE_SHIFT;
> +		ret = 0;
> +
>  		if (unlikely(prealloc)) {
>  			/*
>  			 * pre-alloc page cannot be reused by next time so as
> @@ -1106,7 +1206,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  			folio_put(prealloc);
>  			prealloc = NULL;
>  		}
> -		progress += 8;
>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);

... and do dst_pte += nr_ptes, etc. here instead (noting of course that
the continue clauses will need nr_ptes == 1, but perhpas reset that at
the start of the loop).

>  	arch_leave_lazy_mmu_mode();
Ryan Roberts Nov. 23, 2023, 10:26 a.m. UTC | #20
On 16/11/2023 14:15, David Hildenbrand wrote:
> On 16.11.23 15:13, David Hildenbrand wrote:
>> On 16.11.23 14:49, Ryan Roberts wrote:
>>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>>> rather than making individual calls to set_pte_at().
>>>>>>>
>>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>>> during fork, as it is about to add transparent support for the
>>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>>> range in the first place.
>>>>>>>
>>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>>
>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> ---
>>>>>>>      include/linux/pgtable.h |  13 +++
>>>>>>>      mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>>>      2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>>> --- a/include/linux/pgtable.h
>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>>>> *mm, unsigned long addres
>>>>>>>      }
>>>>>>>      #endif
>>>>>>>      +#ifndef ptep_set_wrprotects
>>>>>>> +struct mm_struct;
>>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>>> +                unsigned int nr)
>>>>>>> +{
>>>>>>> +    unsigned int i;
>>>>>>> +
>>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>>> +}
>>>>>>> +#endif
>>>>>>> +
>>>>>>>      /*
>>>>>>>       * On some architectures hardware does not set page access bit when
>>>>>>> accessing
>>>>>>>       * memory page, it is responsibility of software setting this bit.
>>>>>>> It brings
>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>>> --- a/mm/memory.c
>>>>>>> +++ b/mm/memory.c
>>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>>> struct vm_area_struct *src_vma
>>>>>>>              /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>>              pte = pte_mkuffd_wp(pte);
>>>>>>>          set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>>> -    return 0;
>>>>>>> +    return 1;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>>> +{
>>>>>>> +    unsigned long offset;
>>>>>>> +    unsigned long vaddr;
>>>>>>> +
>>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>>> +
>>>>>>> +    if (anchor > page) {
>>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>>> +            return 0;
>>>>>>> +    } else {
>>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>>> +            return ULONG_MAX;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return vaddr;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>>> +                      struct page *page, pte_t *pte,
>>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>>> +{
>>>>>>> +    int floops;
>>>>>>> +    int i;
>>>>>>> +    unsigned long pfn;
>>>>>>> +    pgprot_t prot;
>>>>>>> +    struct page *folio_end;
>>>>>>> +
>>>>>>> +    if (!folio_test_large(folio))
>>>>>>> +        return 1;
>>>>>>> +
>>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>>> +    pfn = page_to_pfn(page);
>>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>>> +
>>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>>> +
>>>>>>> +    pfn++;
>>>>>>> +    pte++;
>>>>>>> +
>>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>>> +        ptent = ptep_get(pte);
>>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>>> +
>>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>>> +            break;
>>>>>>> +
>>>>>>> +        if (pte_dirty(ptent))
>>>>>>> +            *any_dirty = true;
>>>>>>> +
>>>>>>> +        pfn++;
>>>>>>> +        pte++;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return i;
>>>>>>>      }
>>>>>>>        /*
>>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated
>>>>>>> page
>>>>>>> - * is required to copy this pte.
>>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to
>>>>>>> copy the
>>>>>>> + * first pte.
>>>>>>>       */
>>>>>>>      static inline int
>>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>> *src_vma,
>>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>>> -         struct folio **prealloc)
>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>> *src_vma,
>>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>>> +          unsigned long addr, unsigned long end,
>>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>>      {
>>>>>>>          struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>>          unsigned long vm_flags = src_vma->vm_flags;
>>>>>>>          pte_t pte = ptep_get(src_pte);
>>>>>>>          struct page *page;
>>>>>>>          struct folio *folio;
>>>>>>> +    int nr = 1;
>>>>>>> +    bool anon;
>>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>>> +    int i;
>>>>>>>            page = vm_normal_page(src_vma, addr, pte);
>>>>>>> -    if (page)
>>>>>>> +    if (page) {
>>>>>>>              folio = page_folio(page);
>>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>>> -        /*
>>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>>> -         * future.
>>>>>>> -         */
>>>>>>> -        folio_get(folio);
>>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>>> -            folio_put(folio);
>>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>>> -                         addr, rss, prealloc, page);
>>>>>>> +        anon = folio_test_anon(folio);
>>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>>> +                        end, pte, &any_dirty);
>>>>>>> +
>>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>>> +            if (anon) {
>>>>>>> +                /*
>>>>>>> +                 * If this page may have been pinned by the
>>>>>>> +                 * parent process, copy the page immediately for
>>>>>>> +                 * the child so that we'll always guarantee the
>>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>>> +                 * future.
>>>>>>> +                 */
>>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>>> +                        page, false, src_vma))) {
>>>>>>> +                    if (i != 0)
>>>>>>> +                        break;
>>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>>> +                    return copy_present_page(
>>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>>> +                        page);
>>>>>>> +                }
>>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>>> +            } else {
>>>>>>> +                page_dup_file_rmap(page, false);
>>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>>> +            }
>>>>>>>              }
>>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>>> -    } else if (page) {
>>>>>>> -        folio_get(folio);
>>>>>>> -        page_dup_file_rmap(page, false);
>>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>>> +
>>>>>>> +        nr = i;
>>>>>>> +        folio_ref_add(folio, nr);
>>>>>>>          }
>>>>>>>            /*
>>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma,
>>>>>>> struct
>>>>>>> vm_area_struct *src_vma,
>>>>>>>           * in the parent and the child
>>>>>>>           */
>>>>>>>          if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>>              pte = pte_wrprotect(pte);
>>>>>>
>>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>>
>>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>>> adjusted are suspicious :)
>>>>>
>>>>> The idea is that I've already constrained the batch of pages such that the
>>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the
>>>>> first
>>>>> pte is writable, then they all are - something has gone badly wrong if some
>>>>> are
>>>>> writable and others are not.
>>>>
>>>> I wonder if it would be cleaner and easier to not do that, though.
>>>>
>>>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>>>> can set the cont bit, correct?
>>>
>>> Oh I see what you mean - that only works for cow mappings though. If you have a
>>> shared mapping, you won't be making it read-only at fork. So if we ignore
>>> pte_write() state when demarking the batches, we will end up with a batch of
>>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>>> child and those pages will all have the permissions of the first page of the
>>> batch.
>>
>> I see what you mean.
>>
>> After fork(), all anon pages will be R/O in the parent and the child.
>> Easy. If any PTE is writable, wrprotect all in the parent and the child.
>>
>> After fork(), all shared pages can be R/O or R/W in the parent. For
>> simplicity, I think you can simply set them all R/O in the child. So if
>> any PTE is writable, wrprotect all in the child.
> 
> Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.

I've just come back to this to code it up, and want to clarify this last
comment; I'm already going to have to collect any_writable for the anon case, so
I will already have that info for the shared case too. I think you are
suggesting I *additionally* collect any_readonly, then in the shared case, I
only apply wrprotect if (any_writable && any_readonly). i.e. only apply
wrprotect if there is a mix of permissions for the batch, otherwise all the
permissions are the same (either all RW or all RO) and I can elide the wrprotet.
Is that what you meant?


> 
> But devil is in the detail.
>
David Hildenbrand Nov. 23, 2023, 12:12 p.m. UTC | #21
On 23.11.23 11:26, Ryan Roberts wrote:
> On 16/11/2023 14:15, David Hildenbrand wrote:
>> On 16.11.23 15:13, David Hildenbrand wrote:
>>> On 16.11.23 14:49, Ryan Roberts wrote:
>>>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>>>> rather than making individual calls to set_pte_at().
>>>>>>>>
>>>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>>>> during fork, as it is about to add transparent support for the
>>>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>>>> range in the first place.
>>>>>>>>
>>>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>>>
>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> ---
>>>>>>>>       include/linux/pgtable.h |  13 +++
>>>>>>>>       mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>>>>>>       2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct
>>>>>>>> *mm, unsigned long addres
>>>>>>>>       }
>>>>>>>>       #endif
>>>>>>>>       +#ifndef ptep_set_wrprotects
>>>>>>>> +struct mm_struct;
>>>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>>>> +                unsigned int nr)
>>>>>>>> +{
>>>>>>>> +    unsigned int i;
>>>>>>>> +
>>>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>>>> +}
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>       /*
>>>>>>>>        * On some architectures hardware does not set page access bit when
>>>>>>>> accessing
>>>>>>>>        * memory page, it is responsibility of software setting this bit.
>>>>>>>> It brings
>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>>>> --- a/mm/memory.c
>>>>>>>> +++ b/mm/memory.c
>>>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>>>> struct vm_area_struct *src_vma
>>>>>>>>               /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>>>               pte = pte_mkuffd_wp(pte);
>>>>>>>>           set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>>>> -    return 0;
>>>>>>>> +    return 1;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>>>> +{
>>>>>>>> +    unsigned long offset;
>>>>>>>> +    unsigned long vaddr;
>>>>>>>> +
>>>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>>>> +
>>>>>>>> +    if (anchor > page) {
>>>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>>>> +            return 0;
>>>>>>>> +    } else {
>>>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>>>> +            return ULONG_MAX;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return vaddr;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>>>> +                      struct page *page, pte_t *pte,
>>>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>>>> +{
>>>>>>>> +    int floops;
>>>>>>>> +    int i;
>>>>>>>> +    unsigned long pfn;
>>>>>>>> +    pgprot_t prot;
>>>>>>>> +    struct page *folio_end;
>>>>>>>> +
>>>>>>>> +    if (!folio_test_large(folio))
>>>>>>>> +        return 1;
>>>>>>>> +
>>>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>>>> +    pfn = page_to_pfn(page);
>>>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>>>> +
>>>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>>>> +
>>>>>>>> +    pfn++;
>>>>>>>> +    pte++;
>>>>>>>> +
>>>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>>>> +        ptent = ptep_get(pte);
>>>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>>>> +
>>>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>>>> +            break;
>>>>>>>> +
>>>>>>>> +        if (pte_dirty(ptent))
>>>>>>>> +            *any_dirty = true;
>>>>>>>> +
>>>>>>>> +        pfn++;
>>>>>>>> +        pte++;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return i;
>>>>>>>>       }
>>>>>>>>         /*
>>>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated
>>>>>>>> page
>>>>>>>> - * is required to copy this pte.
>>>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to
>>>>>>>> copy the
>>>>>>>> + * first pte.
>>>>>>>>        */
>>>>>>>>       static inline int
>>>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>>> *src_vma,
>>>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>>>> -         struct folio **prealloc)
>>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>>> *src_vma,
>>>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>>>> +          unsigned long addr, unsigned long end,
>>>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>>>       {
>>>>>>>>           struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>>>           unsigned long vm_flags = src_vma->vm_flags;
>>>>>>>>           pte_t pte = ptep_get(src_pte);
>>>>>>>>           struct page *page;
>>>>>>>>           struct folio *folio;
>>>>>>>> +    int nr = 1;
>>>>>>>> +    bool anon;
>>>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>>>> +    int i;
>>>>>>>>             page = vm_normal_page(src_vma, addr, pte);
>>>>>>>> -    if (page)
>>>>>>>> +    if (page) {
>>>>>>>>               folio = page_folio(page);
>>>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>>>> -        /*
>>>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>>>> -         * future.
>>>>>>>> -         */
>>>>>>>> -        folio_get(folio);
>>>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>>>> -            folio_put(folio);
>>>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>>>> -                         addr, rss, prealloc, page);
>>>>>>>> +        anon = folio_test_anon(folio);
>>>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>>>> +                        end, pte, &any_dirty);
>>>>>>>> +
>>>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>>>> +            if (anon) {
>>>>>>>> +                /*
>>>>>>>> +                 * If this page may have been pinned by the
>>>>>>>> +                 * parent process, copy the page immediately for
>>>>>>>> +                 * the child so that we'll always guarantee the
>>>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>>>> +                 * future.
>>>>>>>> +                 */
>>>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>>>> +                        page, false, src_vma))) {
>>>>>>>> +                    if (i != 0)
>>>>>>>> +                        break;
>>>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>>>> +                    return copy_present_page(
>>>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>>>> +                        page);
>>>>>>>> +                }
>>>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>> +            } else {
>>>>>>>> +                page_dup_file_rmap(page, false);
>>>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>>>> +            }
>>>>>>>>               }
>>>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>>>> -    } else if (page) {
>>>>>>>> -        folio_get(folio);
>>>>>>>> -        page_dup_file_rmap(page, false);
>>>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>>>> +
>>>>>>>> +        nr = i;
>>>>>>>> +        folio_ref_add(folio, nr);
>>>>>>>>           }
>>>>>>>>             /*
>>>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma,
>>>>>>>> struct
>>>>>>>> vm_area_struct *src_vma,
>>>>>>>>            * in the parent and the child
>>>>>>>>            */
>>>>>>>>           if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>>>               pte = pte_wrprotect(pte);
>>>>>>>
>>>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>>>
>>>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>>>> adjusted are suspicious :)
>>>>>>
>>>>>> The idea is that I've already constrained the batch of pages such that the
>>>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the
>>>>>> first
>>>>>> pte is writable, then they all are - something has gone badly wrong if some
>>>>>> are
>>>>>> writable and others are not.
>>>>>
>>>>> I wonder if it would be cleaner and easier to not do that, though.
>>>>>
>>>>> Simply record if any pte is writable. Afterwards they will *all* be R/O and you
>>>>> can set the cont bit, correct?
>>>>
>>>> Oh I see what you mean - that only works for cow mappings though. If you have a
>>>> shared mapping, you won't be making it read-only at fork. So if we ignore
>>>> pte_write() state when demarking the batches, we will end up with a batch of
>>>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>>>> child and those pages will all have the permissions of the first page of the
>>>> batch.
>>>
>>> I see what you mean.
>>>
>>> After fork(), all anon pages will be R/O in the parent and the child.
>>> Easy. If any PTE is writable, wrprotect all in the parent and the child.
>>>
>>> After fork(), all shared pages can be R/O or R/W in the parent. For
>>> simplicity, I think you can simply set them all R/O in the child. So if
>>> any PTE is writable, wrprotect all in the child.
>>
>> Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.
> 
> I've just come back to this to code it up, and want to clarify this last
> comment; I'm already going to have to collect any_writable for the anon case, so
> I will already have that info for the shared case too. I think you are
> suggesting I *additionally* collect any_readonly, then in the shared case, I
> only apply wrprotect if (any_writable && any_readonly). i.e. only apply
> wrprotect if there is a mix of permissions for the batch, otherwise all the
> permissions are the same (either all RW or all RO) and I can elide the wrprotet.
> Is that what you meant?

Yes. I suspect you might somehow be able to derive "any_readonly = nr - 
!any_writable".

Within a VMA, we really should only see:
* writable VMA: some might be R/O, some might be R/W
* VMA applicable to NUMA hinting: some might be PROT_NONE, others R/O or
   R/W

One could simply skip batching for now on pte_protnone() and focus on 
the "writable" vs. "not-writable".
Ryan Roberts Nov. 23, 2023, 12:28 p.m. UTC | #22
On 23/11/2023 12:12, David Hildenbrand wrote:
> On 23.11.23 11:26, Ryan Roberts wrote:
>> On 16/11/2023 14:15, David Hildenbrand wrote:
>>> On 16.11.23 15:13, David Hildenbrand wrote:
>>>> On 16.11.23 14:49, Ryan Roberts wrote:
>>>>> On 16/11/2023 13:20, David Hildenbrand wrote:
>>>>>> On 16.11.23 12:20, Ryan Roberts wrote:
>>>>>>> On 16/11/2023 11:03, David Hildenbrand wrote:
>>>>>>>> On 15.11.23 17:30, Ryan Roberts wrote:
>>>>>>>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>>>>>>>> maps a physically contiguous block of memory, all belonging to the same
>>>>>>>>> folio, with the same permissions, and for shared mappings, the same
>>>>>>>>> dirty state. This will likely improve performance by a tiny amount due
>>>>>>>>> to batching the folio reference count management and calling set_ptes()
>>>>>>>>> rather than making individual calls to set_pte_at().
>>>>>>>>>
>>>>>>>>> However, the primary motivation for this change is to reduce the number
>>>>>>>>> of tlb maintenance operations that the arm64 backend has to perform
>>>>>>>>> during fork, as it is about to add transparent support for the
>>>>>>>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>>>>>>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>>>>>>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>>>>>>>> expensive, when all ptes in the range are being write-protected.
>>>>>>>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>>>>>>>> in the child, the backend does not need to fold a contiguous range once
>>>>>>>>> they are all populated - they can be initially populated as a contiguous
>>>>>>>>> range in the first place.
>>>>>>>>>
>>>>>>>>> This change addresses the core-mm refactoring only, and introduces
>>>>>>>>> ptep_set_wrprotects() with a default implementation that calls
>>>>>>>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>>>>>>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>>>>>>>> performance improvement as part of the work to enable contpte mappings.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>> ---
>>>>>>>>>       include/linux/pgtable.h |  13 +++
>>>>>>>>>       mm/memory.c             | 175
>>>>>>>>> +++++++++++++++++++++++++++++++---------
>>>>>>>>>       2 files changed, 150 insertions(+), 38 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>>>>>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>>>>>>>> --- a/include/linux/pgtable.h
>>>>>>>>> +++ b/include/linux/pgtable.h
>>>>>>>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct
>>>>>>>>> mm_struct
>>>>>>>>> *mm, unsigned long addres
>>>>>>>>>       }
>>>>>>>>>       #endif
>>>>>>>>>       +#ifndef ptep_set_wrprotects
>>>>>>>>> +struct mm_struct;
>>>>>>>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>>>>>>>> +                unsigned long address, pte_t *ptep,
>>>>>>>>> +                unsigned int nr)
>>>>>>>>> +{
>>>>>>>>> +    unsigned int i;
>>>>>>>>> +
>>>>>>>>> +    for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>>>>>>>> +        ptep_set_wrprotect(mm, address, ptep);
>>>>>>>>> +}
>>>>>>>>> +#endif
>>>>>>>>> +
>>>>>>>>>       /*
>>>>>>>>>        * On some architectures hardware does not set page access bit when
>>>>>>>>> accessing
>>>>>>>>>        * memory page, it is responsibility of software setting this bit.
>>>>>>>>> It brings
>>>>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>>>>> index 1f18ed4a5497..b7c8228883cf 100644
>>>>>>>>> --- a/mm/memory.c
>>>>>>>>> +++ b/mm/memory.c
>>>>>>>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma,
>>>>>>>>> struct vm_area_struct *src_vma
>>>>>>>>>               /* Uffd-wp needs to be delivered to dest pte as well */
>>>>>>>>>               pte = pte_mkuffd_wp(pte);
>>>>>>>>>           set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>>>>>>>> -    return 0;
>>>>>>>>> +    return 1;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>>>>>>>> +                struct page *anchor, unsigned long anchor_vaddr)
>>>>>>>>> +{
>>>>>>>>> +    unsigned long offset;
>>>>>>>>> +    unsigned long vaddr;
>>>>>>>>> +
>>>>>>>>> +    offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>>>>>> +    vaddr = anchor_vaddr + offset;
>>>>>>>>> +
>>>>>>>>> +    if (anchor > page) {
>>>>>>>>> +        if (vaddr > anchor_vaddr)
>>>>>>>>> +            return 0;
>>>>>>>>> +    } else {
>>>>>>>>> +        if (vaddr < anchor_vaddr)
>>>>>>>>> +            return ULONG_MAX;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    return vaddr;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>>>>>>>> +                      struct page *page, pte_t *pte,
>>>>>>>>> +                      unsigned long addr, unsigned long end,
>>>>>>>>> +                      pte_t ptent, bool *any_dirty)
>>>>>>>>> +{
>>>>>>>>> +    int floops;
>>>>>>>>> +    int i;
>>>>>>>>> +    unsigned long pfn;
>>>>>>>>> +    pgprot_t prot;
>>>>>>>>> +    struct page *folio_end;
>>>>>>>>> +
>>>>>>>>> +    if (!folio_test_large(folio))
>>>>>>>>> +        return 1;
>>>>>>>>> +
>>>>>>>>> +    folio_end = &folio->page + folio_nr_pages(folio);
>>>>>>>>> +    end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>>>>>>>> +    floops = (end - addr) >> PAGE_SHIFT;
>>>>>>>>> +    pfn = page_to_pfn(page);
>>>>>>>>> +    prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>>>>>>>> +
>>>>>>>>> +    *any_dirty = pte_dirty(ptent);
>>>>>>>>> +
>>>>>>>>> +    pfn++;
>>>>>>>>> +    pte++;
>>>>>>>>> +
>>>>>>>>> +    for (i = 1; i < floops; i++) {
>>>>>>>>> +        ptent = ptep_get(pte);
>>>>>>>>> +        ptent = pte_mkold(pte_mkclean(ptent));
>>>>>>>>> +
>>>>>>>>> +        if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>>>>>>>> +            pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>>>>>>>> +            break;
>>>>>>>>> +
>>>>>>>>> +        if (pte_dirty(ptent))
>>>>>>>>> +            *any_dirty = true;
>>>>>>>>> +
>>>>>>>>> +        pfn++;
>>>>>>>>> +        pte++;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    return i;
>>>>>>>>>       }
>>>>>>>>>         /*
>>>>>>>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated
>>>>>>>>> page
>>>>>>>>> - * is required to copy this pte.
>>>>>>>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if
>>>>>>>>> succeeded
>>>>>>>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to
>>>>>>>>> copy the
>>>>>>>>> + * first pte.
>>>>>>>>>        */
>>>>>>>>>       static inline int
>>>>>>>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>>>> *src_vma,
>>>>>>>>> -         pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>>>>>> -         struct folio **prealloc)
>>>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct
>>>>>>>>> *src_vma,
>>>>>>>>> +          pte_t *dst_pte, pte_t *src_pte,
>>>>>>>>> +          unsigned long addr, unsigned long end,
>>>>>>>>> +          int *rss, struct folio **prealloc)
>>>>>>>>>       {
>>>>>>>>>           struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>>>>           unsigned long vm_flags = src_vma->vm_flags;
>>>>>>>>>           pte_t pte = ptep_get(src_pte);
>>>>>>>>>           struct page *page;
>>>>>>>>>           struct folio *folio;
>>>>>>>>> +    int nr = 1;
>>>>>>>>> +    bool anon;
>>>>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>>>>> +    int i;
>>>>>>>>>             page = vm_normal_page(src_vma, addr, pte);
>>>>>>>>> -    if (page)
>>>>>>>>> +    if (page) {
>>>>>>>>>               folio = page_folio(page);
>>>>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>>>>> -        /*
>>>>>>>>> -         * If this page may have been pinned by the parent process,
>>>>>>>>> -         * copy the page immediately for the child so that we'll always
>>>>>>>>> -         * guarantee the pinned page won't be randomly replaced in the
>>>>>>>>> -         * future.
>>>>>>>>> -         */
>>>>>>>>> -        folio_get(folio);
>>>>>>>>> -        if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>>>>> -            /* Page may be pinned, we have to copy. */
>>>>>>>>> -            folio_put(folio);
>>>>>>>>> -            return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>>>>> -                         addr, rss, prealloc, page);
>>>>>>>>> +        anon = folio_test_anon(folio);
>>>>>>>>> +        nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>>>>> +                        end, pte, &any_dirty);
>>>>>>>>> +
>>>>>>>>> +        for (i = 0; i < nr; i++, page++) {
>>>>>>>>> +            if (anon) {
>>>>>>>>> +                /*
>>>>>>>>> +                 * If this page may have been pinned by the
>>>>>>>>> +                 * parent process, copy the page immediately for
>>>>>>>>> +                 * the child so that we'll always guarantee the
>>>>>>>>> +                 * pinned page won't be randomly replaced in the
>>>>>>>>> +                 * future.
>>>>>>>>> +                 */
>>>>>>>>> +                if (unlikely(page_try_dup_anon_rmap(
>>>>>>>>> +                        page, false, src_vma))) {
>>>>>>>>> +                    if (i != 0)
>>>>>>>>> +                        break;
>>>>>>>>> +                    /* Page may be pinned, we have to copy. */
>>>>>>>>> +                    return copy_present_page(
>>>>>>>>> +                        dst_vma, src_vma, dst_pte,
>>>>>>>>> +                        src_pte, addr, rss, prealloc,
>>>>>>>>> +                        page);
>>>>>>>>> +                }
>>>>>>>>> +                rss[MM_ANONPAGES]++;
>>>>>>>>> +                VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>>> +            } else {
>>>>>>>>> +                page_dup_file_rmap(page, false);
>>>>>>>>> +                rss[mm_counter_file(page)]++;
>>>>>>>>> +            }
>>>>>>>>>               }
>>>>>>>>> -        rss[MM_ANONPAGES]++;
>>>>>>>>> -    } else if (page) {
>>>>>>>>> -        folio_get(folio);
>>>>>>>>> -        page_dup_file_rmap(page, false);
>>>>>>>>> -        rss[mm_counter_file(page)]++;
>>>>>>>>> +
>>>>>>>>> +        nr = i;
>>>>>>>>> +        folio_ref_add(folio, nr);
>>>>>>>>>           }
>>>>>>>>>             /*
>>>>>>>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma,
>>>>>>>>> struct
>>>>>>>>> vm_area_struct *src_vma,
>>>>>>>>>            * in the parent and the child
>>>>>>>>>            */
>>>>>>>>>           if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>>>>>>>> -        ptep_set_wrprotect(src_mm, addr, src_pte);
>>>>>>>>> +        ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>>>>>>>               pte = pte_wrprotect(pte);
>>>>>>>>
>>>>>>>> You likely want an "any_pte_writable" check here instead, no?
>>>>>>>>
>>>>>>>> Any operations that target a single indiividual PTE while multiple PTEs are
>>>>>>>> adjusted are suspicious :)
>>>>>>>
>>>>>>> The idea is that I've already constrained the batch of pages such that the
>>>>>>> permissions are all the same (see folio_nr_pages_cont_mapped()). So if the
>>>>>>> first
>>>>>>> pte is writable, then they all are - something has gone badly wrong if some
>>>>>>> are
>>>>>>> writable and others are not.
>>>>>>
>>>>>> I wonder if it would be cleaner and easier to not do that, though.
>>>>>>
>>>>>> Simply record if any pte is writable. Afterwards they will *all* be R/O
>>>>>> and you
>>>>>> can set the cont bit, correct?
>>>>>
>>>>> Oh I see what you mean - that only works for cow mappings though. If you
>>>>> have a
>>>>> shared mapping, you won't be making it read-only at fork. So if we ignore
>>>>> pte_write() state when demarking the batches, we will end up with a batch of
>>>>> pages with a mix of RO and RW in the parent, but then we set_ptes() for the
>>>>> child and those pages will all have the permissions of the first page of the
>>>>> batch.
>>>>
>>>> I see what you mean.
>>>>
>>>> After fork(), all anon pages will be R/O in the parent and the child.
>>>> Easy. If any PTE is writable, wrprotect all in the parent and the child.
>>>>
>>>> After fork(), all shared pages can be R/O or R/W in the parent. For
>>>> simplicity, I think you can simply set them all R/O in the child. So if
>>>> any PTE is writable, wrprotect all in the child.
>>>
>>> Or better: if any is R/O, set them all R/O. Otherwise just leave them as is.
>>
>> I've just come back to this to code it up, and want to clarify this last
>> comment; I'm already going to have to collect any_writable for the anon case, so
>> I will already have that info for the shared case too. I think you are
>> suggesting I *additionally* collect any_readonly, then in the shared case, I
>> only apply wrprotect if (any_writable && any_readonly). i.e. only apply
>> wrprotect if there is a mix of permissions for the batch, otherwise all the
>> permissions are the same (either all RW or all RO) and I can elide the wrprotet.
>> Is that what you meant?
> 
> Yes. I suspect you might somehow be able to derive "any_readonly = nr -
> !any_writable".

Yep, nice.

> 
> Within a VMA, we really should only see:
> * writable VMA: some might be R/O, some might be R/W
> * VMA applicable to NUMA hinting: some might be PROT_NONE, others R/O or
>   R/W
> 
> One could simply skip batching for now on pte_protnone() and focus on the
> "writable" vs. "not-writable".

I'm not sure we can simply "skip" batching on pte_protnone() since we will need
to terminate the batch if we spot it. But if we have to look for it anyway, we
might as well just terminate the batch when the value of pte_protnone()
*changes*. I'm also proposing to take this approach for pte_uffd_wp() which also
needs to be carefully preserved per-pte.


>
Ryan Roberts Nov. 23, 2023, 2:43 p.m. UTC | #23
On 23/11/2023 04:26, Alistair Popple wrote:
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>> maps a physically contiguous block of memory, all belonging to the same
>> folio, with the same permissions, and for shared mappings, the same
>> dirty state. This will likely improve performance by a tiny amount due
>> to batching the folio reference count management and calling set_ptes()
>> rather than making individual calls to set_pte_at().
>>
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork, as it is about to add transparent support for the
>> "contiguous bit" in its ptes. By write-protecting the parent using the
>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>> backend can avoid having to unfold contig ranges of PTEs, which is
>> expensive, when all ptes in the range are being write-protected.
>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>> in the child, the backend does not need to fold a contiguous range once
>> they are all populated - they can be initially populated as a contiguous
>> range in the first place.
>>
>> This change addresses the core-mm refactoring only, and introduces
>> ptep_set_wrprotects() with a default implementation that calls
>> ptep_set_wrprotect() for each pte in the range. A separate change will
>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>> performance improvement as part of the work to enable contpte mappings.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  include/linux/pgtable.h |  13 +++
>>  mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>  2 files changed, 150 insertions(+), 38 deletions(-)
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index af7639c3b0a3..1c50f8a0fdde 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
>>  }
>>  #endif
>>  
>> +#ifndef ptep_set_wrprotects
>> +struct mm_struct;
>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>> +				unsigned long address, pte_t *ptep,
>> +				unsigned int nr)
>> +{
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>> +		ptep_set_wrprotect(mm, address, ptep);
>> +}
>> +#endif
>> +
>>  /*
>>   * On some architectures hardware does not set page access bit when accessing
>>   * memory page, it is responsibility of software setting this bit. It brings
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 1f18ed4a5497..b7c8228883cf 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>>  		/* Uffd-wp needs to be delivered to dest pte as well */
>>  		pte = pte_mkuffd_wp(pte);
>>  	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>> -	return 0;
>> +	return 1;
> 
> We should update the function comment to indicate why we return 1 here
> because it will become non-obvious in future. But perhaps it's better to
> leave this as is and do the error check/return code calculation in
> copy_present_ptes().

OK, I'll return 0 for success and fix it up to 1 in copy_present_ptes().

> 
>> +}
>> +
>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>> +				struct page *anchor, unsigned long anchor_vaddr)
> 
> It's likely I'm easily confused but the arguments here don't make much
> sense to me. Something like this (noting that I've switch the argument
> order) makes more sense to me at least:
> 
> static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>                             unsigned long page_vaddr, struct page *next_folio_page)

I was originally using page_cont_mapped_vaddr() in more places than here and
needed a more generic helper than just "what is the virtual address of the end
of the folio, given a random page within the folio and its virtual address"; (I
needed "what is the virtual address of a page given a different page and its
virtual address and assuming the distance between the 2 pages is the same in
physical and virtual space"). But given I don't need that generality anymore,
yes, I agree I can simplify this significantly.

I think I can remove the function entirely and replace with this in
folio_nr_pages_cont_mapped():

	/*
	 * Loop either to `end` or to end of folio if its contiguously mapped,
	 * whichever is smaller.
	 */
	floops = (end - addr) >> PAGE_SHIFT;
	floops = min_t(int, floops,
		       folio_pfn(folio_next(folio)) - page_to_pfn(page));

where `end` and `addr` are the parameters as passed into the function. What do
you think?

> 
>> +{
>> +	unsigned long offset;
>> +	unsigned long vaddr;
>> +
>> +	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
> 
> Which IMHO makes this much more readable:
> 
> 	offset = (page_to_pfn(next_folio_page) - page_to_pfn(page)) << PAGE_SHIFT;
> 
>> +	vaddr = anchor_vaddr + offset;
>> +
>> +	if (anchor > page) {
> 
> And also highlights that I think this condition (page > folio_page_end)
> is impossible to hit. Which is good ...
> 
>> +		if (vaddr > anchor_vaddr)
>> +			return 0;
> 
> ... because I'm not sure returning 0 is valid as we would end up setting
> floops = (0 - addr) >> PAGE_SHIFT which doesn't seem like it would end
> particularly well :-)

This was covering the more general case that I no longer need.

> 
>> +	} else {
>> +		if (vaddr < anchor_vaddr)
> 
> Same here - isn't the vaddr of the next folio always going to be larger
> than the vaddr for the current page? It seems this function is really
> just calculating the virtual address of the next folio, or am I deeply
> confused?

This aims to protect against the corner case, where a page from a folio is
mremap()ed very high in address space such that the extra pages from the anchor
page to the end of the folio would actually wrap back to zero. But with the
approach propsed above, this problem goes away, I think.

> 
>> +			return ULONG_MAX;
>> +	}
>> +
>> +	return vaddr;
>> +}
>> +
>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>> +				      struct page *page, pte_t *pte,
>> +				      unsigned long addr, unsigned long end,
>> +				      pte_t ptent, bool *any_dirty)
>> +{
>> +	int floops;
>> +	int i;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +	struct page *folio_end;
>> +
>> +	if (!folio_test_large(folio))
>> +		return 1;
>> +
>> +	folio_end = &folio->page + folio_nr_pages(folio);
> 
> I think you can replace this with:
> 
> folio_end = folio_next(folio)

yep, done - thanks.

> 
> Although given this is only passed to page_cont_mapped_vaddr() perhaps
> it's better to just pass the folio in and do the calculation there.
> 
>> +	end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>> +	floops = (end - addr) >> PAGE_SHIFT;
>> +	pfn = page_to_pfn(page);
>> +	prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>> +
>> +	*any_dirty = pte_dirty(ptent);
>> +
>> +	pfn++;
>> +	pte++;
>> +
>> +	for (i = 1; i < floops; i++) {
>> +		ptent = ptep_get(pte);
>> +		ptent = pte_mkold(pte_mkclean(ptent));
>> +
>> +		if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>> +		    pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>> +			break;
>> +
>> +		if (pte_dirty(ptent))
>> +			*any_dirty = true;
>> +
>> +		pfn++;
>> +		pte++;
>> +	}
>> +
>> +	return i;
>>  }
>>  
>>  /*
>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>> - * is required to copy this pte.
>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>> + * first pte.
>>   */
>>  static inline int
>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> -		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>> -		 struct folio **prealloc)
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> +		  pte_t *dst_pte, pte_t *src_pte,
>> +		  unsigned long addr, unsigned long end,
>> +		  int *rss, struct folio **prealloc)
>>  {
>>  	struct mm_struct *src_mm = src_vma->vm_mm;
>>  	unsigned long vm_flags = src_vma->vm_flags;
>>  	pte_t pte = ptep_get(src_pte);
>>  	struct page *page;
>>  	struct folio *folio;
>> +	int nr = 1;
>> +	bool anon;
>> +	bool any_dirty = pte_dirty(pte);
>> +	int i;
>>  
>>  	page = vm_normal_page(src_vma, addr, pte);
>> -	if (page)
>> +	if (page) {
>>  		folio = page_folio(page);
>> -	if (page && folio_test_anon(folio)) {
>> -		/*
>> -		 * If this page may have been pinned by the parent process,
>> -		 * copy the page immediately for the child so that we'll always
>> -		 * guarantee the pinned page won't be randomly replaced in the
>> -		 * future.
>> -		 */
>> -		folio_get(folio);
>> -		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> -			/* Page may be pinned, we have to copy. */
>> -			folio_put(folio);
>> -			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> -						 addr, rss, prealloc, page);
>> +		anon = folio_test_anon(folio);
>> +		nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> +						end, pte, &any_dirty);
>> +
>> +		for (i = 0; i < nr; i++, page++) {
>> +			if (anon) {
>> +				/*
>> +				 * If this page may have been pinned by the
>> +				 * parent process, copy the page immediately for
>> +				 * the child so that we'll always guarantee the
>> +				 * pinned page won't be randomly replaced in the
>> +				 * future.
>> +				 */
>> +				if (unlikely(page_try_dup_anon_rmap(
>> +						page, false, src_vma))) {
>> +					if (i != 0)
>> +						break;
>> +					/* Page may be pinned, we have to copy. */
>> +					return copy_present_page(
>> +						dst_vma, src_vma, dst_pte,
>> +						src_pte, addr, rss, prealloc,
>> +						page);
>> +				}
>> +				rss[MM_ANONPAGES]++;
>> +				VM_BUG_ON(PageAnonExclusive(page));
>> +			} else {
>> +				page_dup_file_rmap(page, false);
>> +				rss[mm_counter_file(page)]++;
>> +			}
>>  		}
>> -		rss[MM_ANONPAGES]++;
>> -	} else if (page) {
>> -		folio_get(folio);
>> -		page_dup_file_rmap(page, false);
>> -		rss[mm_counter_file(page)]++;
>> +
>> +		nr = i;
>> +		folio_ref_add(folio, nr);
>>  	}
>>  
>>  	/*
>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>  	 * in the parent and the child
>>  	 */
>>  	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>> -		ptep_set_wrprotect(src_mm, addr, src_pte);
>> +		ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>  		pte = pte_wrprotect(pte);
>>  	}
>> -	VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
>>  
>>  	/*
>> -	 * If it's a shared mapping, mark it clean in
>> -	 * the child
>> +	 * If it's a shared mapping, mark it clean in the child. If its a
>> +	 * private mapping, mark it dirty in the child if _any_ of the parent
>> +	 * mappings in the block were marked dirty. The contiguous block of
>> +	 * mappings are all backed by the same folio, so if any are dirty then
>> +	 * the whole folio is dirty. This allows us to determine the batch size
>> +	 * without having to ever consider the dirty bit. See
>> +	 * folio_nr_pages_cont_mapped().
>>  	 */
>> -	if (vm_flags & VM_SHARED)
>> -		pte = pte_mkclean(pte);
>> -	pte = pte_mkold(pte);
>> +	pte = pte_mkold(pte_mkclean(pte));
>> +	if (!(vm_flags & VM_SHARED) && any_dirty)
>> +		pte = pte_mkdirty(pte);
>>  
>>  	if (!userfaultfd_wp(dst_vma))
>>  		pte = pte_clear_uffd_wp(pte);
>>  
>> -	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>> -	return 0;
>> +	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>> +	return nr;
>>  }
>>  
>>  static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
>> @@ -1087,15 +1174,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>  			 */
>>  			WARN_ON_ONCE(ret != -ENOENT);
>>  		}
>> -		/* copy_present_pte() will clear `*prealloc' if consumed */
>> -		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
>> -				       addr, rss, &prealloc);
>> +		/* copy_present_ptes() will clear `*prealloc' if consumed */
>> +		ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
>> +				       addr, end, rss, &prealloc);
>> +
>>  		/*
>>  		 * If we need a pre-allocated page for this pte, drop the
>>  		 * locks, allocate, and try again.
>>  		 */
>>  		if (unlikely(ret == -EAGAIN))
>>  			break;
>> +
>> +		/*
>> +		 * Positive return value is the number of ptes copied.
>> +		 */
>> +		VM_WARN_ON_ONCE(ret < 1);
>> +		progress += 8 * ret;
>> +		ret--;
> 
> Took me a second to figure out what was going on here. I think it would
> be clearer to rename ret to nr_ptes ...
> 
>> +		dst_pte += ret;
>> +		src_pte += ret;
>> +		addr += ret << PAGE_SHIFT;
>> +		ret = 0;
>> +
>>  		if (unlikely(prealloc)) {
>>  			/*
>>  			 * pre-alloc page cannot be reused by next time so as
>> @@ -1106,7 +1206,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>  			folio_put(prealloc);
>>  			prealloc = NULL;
>>  		}
>> -		progress += 8;
>>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
> 
> ... and do dst_pte += nr_ptes, etc. here instead (noting of course that
> the continue clauses will need nr_ptes == 1, but perhpas reset that at
> the start of the loop).

Yes, much cleaner! Implementing for v3...

Thanks for the review!

Thanks,
Ryan

> 
>>  	arch_leave_lazy_mmu_mode();
>
Alistair Popple Nov. 23, 2023, 11:50 p.m. UTC | #24
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 23/11/2023 04:26, Alistair Popple wrote:
>> 
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> Convert copy_pte_range() to copy a set of ptes in a batch. A given batch
>>> maps a physically contiguous block of memory, all belonging to the same
>>> folio, with the same permissions, and for shared mappings, the same
>>> dirty state. This will likely improve performance by a tiny amount due
>>> to batching the folio reference count management and calling set_ptes()
>>> rather than making individual calls to set_pte_at().
>>>
>>> However, the primary motivation for this change is to reduce the number
>>> of tlb maintenance operations that the arm64 backend has to perform
>>> during fork, as it is about to add transparent support for the
>>> "contiguous bit" in its ptes. By write-protecting the parent using the
>>> new ptep_set_wrprotects() (note the 's' at the end) function, the
>>> backend can avoid having to unfold contig ranges of PTEs, which is
>>> expensive, when all ptes in the range are being write-protected.
>>> Similarly, by using set_ptes() rather than set_pte_at() to set up ptes
>>> in the child, the backend does not need to fold a contiguous range once
>>> they are all populated - they can be initially populated as a contiguous
>>> range in the first place.
>>>
>>> This change addresses the core-mm refactoring only, and introduces
>>> ptep_set_wrprotects() with a default implementation that calls
>>> ptep_set_wrprotect() for each pte in the range. A separate change will
>>> implement ptep_set_wrprotects() in the arm64 backend to realize the
>>> performance improvement as part of the work to enable contpte mappings.
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  include/linux/pgtable.h |  13 +++
>>>  mm/memory.c             | 175 +++++++++++++++++++++++++++++++---------
>>>  2 files changed, 150 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index af7639c3b0a3..1c50f8a0fdde 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -622,6 +622,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
>>>  }
>>>  #endif
>>>  
>>> +#ifndef ptep_set_wrprotects
>>> +struct mm_struct;
>>> +static inline void ptep_set_wrprotects(struct mm_struct *mm,
>>> +				unsigned long address, pte_t *ptep,
>>> +				unsigned int nr)
>>> +{
>>> +	unsigned int i;
>>> +
>>> +	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
>>> +		ptep_set_wrprotect(mm, address, ptep);
>>> +}
>>> +#endif
>>> +
>>>  /*
>>>   * On some architectures hardware does not set page access bit when accessing
>>>   * memory page, it is responsibility of software setting this bit. It brings
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 1f18ed4a5497..b7c8228883cf 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -921,46 +921,129 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>>>  		/* Uffd-wp needs to be delivered to dest pte as well */
>>>  		pte = pte_mkuffd_wp(pte);
>>>  	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>> -	return 0;
>>> +	return 1;
>> 
>> We should update the function comment to indicate why we return 1 here
>> because it will become non-obvious in future. But perhaps it's better to
>> leave this as is and do the error check/return code calculation in
>> copy_present_ptes().
>
> OK, I'll return 0 for success and fix it up to 1 in copy_present_ptes().
>
>> 
>>> +}
>>> +
>>> +static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>> +				struct page *anchor, unsigned long anchor_vaddr)
>> 
>> It's likely I'm easily confused but the arguments here don't make much
>> sense to me. Something like this (noting that I've switch the argument
>> order) makes more sense to me at least:
>> 
>> static inline unsigned long page_cont_mapped_vaddr(struct page *page,
>>                             unsigned long page_vaddr, struct page *next_folio_page)
>
> I was originally using page_cont_mapped_vaddr() in more places than here and
> needed a more generic helper than just "what is the virtual address of the end
> of the folio, given a random page within the folio and its virtual address"; (I
> needed "what is the virtual address of a page given a different page and its
> virtual address and assuming the distance between the 2 pages is the same in
> physical and virtual space"). But given I don't need that generality anymore,
> yes, I agree I can simplify this significantly.

Thanks for the explaination, that explains my head scratching.

> I think I can remove the function entirely and replace with this in
> folio_nr_pages_cont_mapped():
>
> 	/*
> 	 * Loop either to `end` or to end of folio if its contiguously mapped,
> 	 * whichever is smaller.
> 	 */
> 	floops = (end - addr) >> PAGE_SHIFT;
> 	floops = min_t(int, floops,
> 		       folio_pfn(folio_next(folio)) - page_to_pfn(page));
>
> where `end` and `addr` are the parameters as passed into the function. What do
> you think?

Will admit by the end of the review I was wondering why we even needed
the extra function so looks good to me (the comment helps too!)

>> 
>>> +{
>>> +	unsigned long offset;
>>> +	unsigned long vaddr;
>>> +
>>> +	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>> 
>> Which IMHO makes this much more readable:
>> 
>> 	offset = (page_to_pfn(next_folio_page) - page_to_pfn(page)) << PAGE_SHIFT;
>> 
>>> +	vaddr = anchor_vaddr + offset;
>>> +
>>> +	if (anchor > page) {
>> 
>> And also highlights that I think this condition (page > folio_page_end)
>> is impossible to hit. Which is good ...
>> 
>>> +		if (vaddr > anchor_vaddr)
>>> +			return 0;
>> 
>> ... because I'm not sure returning 0 is valid as we would end up setting
>> floops = (0 - addr) >> PAGE_SHIFT which doesn't seem like it would end
>> particularly well :-)
>
> This was covering the more general case that I no longer need.
>
>> 
>>> +	} else {
>>> +		if (vaddr < anchor_vaddr)
>> 
>> Same here - isn't the vaddr of the next folio always going to be larger
>> than the vaddr for the current page? It seems this function is really
>> just calculating the virtual address of the next folio, or am I deeply
>> confused?
>
> This aims to protect against the corner case, where a page from a folio is
> mremap()ed very high in address space such that the extra pages from the anchor
> page to the end of the folio would actually wrap back to zero. But with the
> approach propsed above, this problem goes away, I think.
>
>> 
>>> +			return ULONG_MAX;
>>> +	}
>>> +
>>> +	return vaddr;
>>> +}
>>> +
>>> +static int folio_nr_pages_cont_mapped(struct folio *folio,
>>> +				      struct page *page, pte_t *pte,
>>> +				      unsigned long addr, unsigned long end,
>>> +				      pte_t ptent, bool *any_dirty)
>>> +{
>>> +	int floops;
>>> +	int i;
>>> +	unsigned long pfn;
>>> +	pgprot_t prot;
>>> +	struct page *folio_end;
>>> +
>>> +	if (!folio_test_large(folio))
>>> +		return 1;
>>> +
>>> +	folio_end = &folio->page + folio_nr_pages(folio);
>> 
>> I think you can replace this with:
>> 
>> folio_end = folio_next(folio)
>
> yep, done - thanks.
>
>> 
>> Although given this is only passed to page_cont_mapped_vaddr() perhaps
>> it's better to just pass the folio in and do the calculation there.
>> 
>>> +	end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
>>> +	floops = (end - addr) >> PAGE_SHIFT;
>>> +	pfn = page_to_pfn(page);
>>> +	prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
>>> +
>>> +	*any_dirty = pte_dirty(ptent);
>>> +
>>> +	pfn++;
>>> +	pte++;
>>> +
>>> +	for (i = 1; i < floops; i++) {
>>> +		ptent = ptep_get(pte);
>>> +		ptent = pte_mkold(pte_mkclean(ptent));
>>> +
>>> +		if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
>>> +		    pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
>>> +			break;
>>> +
>>> +		if (pte_dirty(ptent))
>>> +			*any_dirty = true;
>>> +
>>> +		pfn++;
>>> +		pte++;
>>> +	}
>>> +
>>> +	return i;
>>>  }
>>>  
>>>  /*
>>> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>>> - * is required to copy this pte.
>>> + * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
>>> + * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
>>> + * first pte.
>>>   */
>>>  static inline int
>>> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> -		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>> -		 struct folio **prealloc)
>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>> +		  pte_t *dst_pte, pte_t *src_pte,
>>> +		  unsigned long addr, unsigned long end,
>>> +		  int *rss, struct folio **prealloc)
>>>  {
>>>  	struct mm_struct *src_mm = src_vma->vm_mm;
>>>  	unsigned long vm_flags = src_vma->vm_flags;
>>>  	pte_t pte = ptep_get(src_pte);
>>>  	struct page *page;
>>>  	struct folio *folio;
>>> +	int nr = 1;
>>> +	bool anon;
>>> +	bool any_dirty = pte_dirty(pte);
>>> +	int i;
>>>  
>>>  	page = vm_normal_page(src_vma, addr, pte);
>>> -	if (page)
>>> +	if (page) {
>>>  		folio = page_folio(page);
>>> -	if (page && folio_test_anon(folio)) {
>>> -		/*
>>> -		 * If this page may have been pinned by the parent process,
>>> -		 * copy the page immediately for the child so that we'll always
>>> -		 * guarantee the pinned page won't be randomly replaced in the
>>> -		 * future.
>>> -		 */
>>> -		folio_get(folio);
>>> -		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>> -			/* Page may be pinned, we have to copy. */
>>> -			folio_put(folio);
>>> -			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>> -						 addr, rss, prealloc, page);
>>> +		anon = folio_test_anon(folio);
>>> +		nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>> +						end, pte, &any_dirty);
>>> +
>>> +		for (i = 0; i < nr; i++, page++) {
>>> +			if (anon) {
>>> +				/*
>>> +				 * If this page may have been pinned by the
>>> +				 * parent process, copy the page immediately for
>>> +				 * the child so that we'll always guarantee the
>>> +				 * pinned page won't be randomly replaced in the
>>> +				 * future.
>>> +				 */
>>> +				if (unlikely(page_try_dup_anon_rmap(
>>> +						page, false, src_vma))) {
>>> +					if (i != 0)
>>> +						break;
>>> +					/* Page may be pinned, we have to copy. */
>>> +					return copy_present_page(
>>> +						dst_vma, src_vma, dst_pte,
>>> +						src_pte, addr, rss, prealloc,
>>> +						page);
>>> +				}
>>> +				rss[MM_ANONPAGES]++;
>>> +				VM_BUG_ON(PageAnonExclusive(page));
>>> +			} else {
>>> +				page_dup_file_rmap(page, false);
>>> +				rss[mm_counter_file(page)]++;
>>> +			}
>>>  		}
>>> -		rss[MM_ANONPAGES]++;
>>> -	} else if (page) {
>>> -		folio_get(folio);
>>> -		page_dup_file_rmap(page, false);
>>> -		rss[mm_counter_file(page)]++;
>>> +
>>> +		nr = i;
>>> +		folio_ref_add(folio, nr);
>>>  	}
>>>  
>>>  	/*
>>> @@ -968,24 +1051,28 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>  	 * in the parent and the child
>>>  	 */
>>>  	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
>>> -		ptep_set_wrprotect(src_mm, addr, src_pte);
>>> +		ptep_set_wrprotects(src_mm, addr, src_pte, nr);
>>>  		pte = pte_wrprotect(pte);
>>>  	}
>>> -	VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
>>>  
>>>  	/*
>>> -	 * If it's a shared mapping, mark it clean in
>>> -	 * the child
>>> +	 * If it's a shared mapping, mark it clean in the child. If its a
>>> +	 * private mapping, mark it dirty in the child if _any_ of the parent
>>> +	 * mappings in the block were marked dirty. The contiguous block of
>>> +	 * mappings are all backed by the same folio, so if any are dirty then
>>> +	 * the whole folio is dirty. This allows us to determine the batch size
>>> +	 * without having to ever consider the dirty bit. See
>>> +	 * folio_nr_pages_cont_mapped().
>>>  	 */
>>> -	if (vm_flags & VM_SHARED)
>>> -		pte = pte_mkclean(pte);
>>> -	pte = pte_mkold(pte);
>>> +	pte = pte_mkold(pte_mkclean(pte));
>>> +	if (!(vm_flags & VM_SHARED) && any_dirty)
>>> +		pte = pte_mkdirty(pte);
>>>  
>>>  	if (!userfaultfd_wp(dst_vma))
>>>  		pte = pte_clear_uffd_wp(pte);
>>>  
>>> -	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
>>> -	return 0;
>>> +	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>> +	return nr;
>>>  }
>>>  
>>>  static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
>>> @@ -1087,15 +1174,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>  			 */
>>>  			WARN_ON_ONCE(ret != -ENOENT);
>>>  		}
>>> -		/* copy_present_pte() will clear `*prealloc' if consumed */
>>> -		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
>>> -				       addr, rss, &prealloc);
>>> +		/* copy_present_ptes() will clear `*prealloc' if consumed */
>>> +		ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
>>> +				       addr, end, rss, &prealloc);
>>> +
>>>  		/*
>>>  		 * If we need a pre-allocated page for this pte, drop the
>>>  		 * locks, allocate, and try again.
>>>  		 */
>>>  		if (unlikely(ret == -EAGAIN))
>>>  			break;
>>> +
>>> +		/*
>>> +		 * Positive return value is the number of ptes copied.
>>> +		 */
>>> +		VM_WARN_ON_ONCE(ret < 1);
>>> +		progress += 8 * ret;
>>> +		ret--;
>> 
>> Took me a second to figure out what was going on here. I think it would
>> be clearer to rename ret to nr_ptes ...
>> 
>>> +		dst_pte += ret;
>>> +		src_pte += ret;
>>> +		addr += ret << PAGE_SHIFT;
>>> +		ret = 0;
>>> +
>>>  		if (unlikely(prealloc)) {
>>>  			/*
>>>  			 * pre-alloc page cannot be reused by next time so as
>>> @@ -1106,7 +1206,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>  			folio_put(prealloc);
>>>  			prealloc = NULL;
>>>  		}
>>> -		progress += 8;
>>>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>> 
>> ... and do dst_pte += nr_ptes, etc. here instead (noting of course that
>> the continue clauses will need nr_ptes == 1, but perhpas reset that at
>> the start of the loop).
>
> Yes, much cleaner! Implementing for v3...
>
> Thanks for the review!
>
> Thanks,
> Ryan
>
>> 
>>>  	arch_leave_lazy_mmu_mode();
>>
David Hildenbrand Nov. 24, 2023, 8:53 a.m. UTC | #25
>> One could simply skip batching for now on pte_protnone() and focus on the
>> "writable" vs. "not-writable".
> 
> I'm not sure we can simply "skip" batching on pte_protnone() since we will need
> to terminate the batch if we spot it. But if we have to look for it anyway, we
> might as well just terminate the batch when the value of pte_protnone()
> *changes*. I'm also proposing to take this approach for pte_uffd_wp() which also
> needs to be carefully preserved per-pte.

Yes, that's what I meant.
Barry Song Nov. 27, 2023, 5:54 a.m. UTC | #26
> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> +		  pte_t *dst_pte, pte_t *src_pte,
> +		  unsigned long addr, unsigned long end,
> +		  int *rss, struct folio **prealloc)
>  {
>  	struct mm_struct *src_mm = src_vma->vm_mm;
>  	unsigned long vm_flags = src_vma->vm_flags;
>  	pte_t pte = ptep_get(src_pte);
>  	struct page *page;
>  	struct folio *folio;
> +	int nr = 1;
> +	bool anon;
> +	bool any_dirty = pte_dirty(pte);
> +	int i;
>  
>  	page = vm_normal_page(src_vma, addr, pte);
> -	if (page)
> +	if (page) {
>  		folio = page_folio(page);
> -	if (page && folio_test_anon(folio)) {
> -		/*
> -		 * If this page may have been pinned by the parent process,
> -		 * copy the page immediately for the child so that we'll always
> -		 * guarantee the pinned page won't be randomly replaced in the
> -		 * future.
> -		 */
> -		folio_get(folio);
> -		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> -			/* Page may be pinned, we have to copy. */
> -			folio_put(folio);
> -			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> -						 addr, rss, prealloc, page);
> +		anon = folio_test_anon(folio);
> +		nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> +						end, pte, &any_dirty);

in case we have a large folio with 16 CONTPTE basepages, and userspace
do madvise(addr + 4KB * 5, DONTNEED);

thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
will return 15. in this case, we should copy page0~page3 and page5~page15.

but the current code is copying page0~page14, right? unless we are immediatly
split_folio to basepages in zap_pte_range(), we will have problems?

> +
> +		for (i = 0; i < nr; i++, page++) {
> +			if (anon) {
> +				/*
> +				 * If this page may have been pinned by the
> +				 * parent process, copy the page immediately for
> +				 * the child so that we'll always guarantee the
> +				 * pinned page won't be randomly replaced in the
> +				 * future.
> +				 */
> +				if (unlikely(page_try_dup_anon_rmap(
> +						page, false, src_vma))) {
> +					if (i != 0)
> +						break;
> +					/* Page may be pinned, we have to copy. */
> +					return copy_present_page(
> +						dst_vma, src_vma, dst_pte,
> +						src_pte, addr, rss, prealloc,
> +						page);
> +				}
> +				rss[MM_ANONPAGES]++;
> +				VM_BUG_ON(PageAnonExclusive(page));
> +			} else {
> +				page_dup_file_rmap(page, false);
> +				rss[mm_counter_file(page)]++;
> +			}

Thanks
Barry
Barry Song Nov. 27, 2023, 8:42 a.m. UTC | #27
>> +		for (i = 0; i < nr; i++, page++) {
>> +			if (anon) {
>> +				/*
>> +				 * If this page may have been pinned by the
>> +				 * parent process, copy the page immediately for
>> +				 * the child so that we'll always guarantee the
>> +				 * pinned page won't be randomly replaced in the
>> +				 * future.
>> +				 */
>> +				if (unlikely(page_try_dup_anon_rmap(
>> +						page, false, src_vma))) {
>> +					if (i != 0)
>> +						break;
>> +					/* Page may be pinned, we have to copy. */
>> +					return copy_present_page(
>> +						dst_vma, src_vma, dst_pte,
>> +						src_pte, addr, rss, prealloc,
>> +						page);
>> +				}
>> +				rss[MM_ANONPAGES]++;
>> +				VM_BUG_ON(PageAnonExclusive(page));
>> +			} else {
>> +				page_dup_file_rmap(page, false);
>> +				rss[mm_counter_file(page)]++;
>> +			}
>>   		}
>> -		rss[MM_ANONPAGES]++;
>> -	} else if (page) {
>> -		folio_get(folio);
>> -		page_dup_file_rmap(page, false);
>> -		rss[mm_counter_file(page)]++;
>> +
>> +		nr = i;
>> +		folio_ref_add(folio, nr);
> 
> You're changing the order of mapcount vs. refcount increment. Don't. 
> Make sure your refcount >= mapcount.
> 
> You can do that easily by doing the folio_ref_add(folio, nr) first and 
> then decrementing in case of error accordingly. Errors due to pinned 
> pages are the corner case.
> 
> I'll note that it will make a lot of sense to have batch variants of 
> page_try_dup_anon_rmap() and page_dup_file_rmap().
> 

i still don't understand why it is not a entire map+1, but an increment
in each basepage.

as long as it is a CONTPTE large folio, there is no much difference with
PMD-mapped large folio. it has all the chance to be DoubleMap and need
split.

When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
similar things on a part of the large folio in process A,

this large folio will have partially mapped subpage in A (all CONTPE bits
in all subpages need to be removed though we only unmap a part of the
large folioas HW requires consistent CONTPTEs); and it has entire map in
process B(all PTEs are still CONPTES in process B).

isn't it more sensible for this large folios to have entire_map = 0(for
process B), and subpages which are still mapped in process A has map_count
=0? (start from -1).

> Especially, the batch variant of page_try_dup_anon_rmap() would only 
> check once if the folio maybe pinned, and in that case, you can simply 
> drop all references again. So you either have all or no ptes to process, 
> which makes that code easier.
> 
> But that can be added on top, and I'll happily do that.
> 
> -- 
> Cheers,
> 
> David / dhildenb

Thanks
Barry
Ryan Roberts Nov. 27, 2023, 9:24 a.m. UTC | #28
On 27/11/2023 05:54, Barry Song wrote:
>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>> +		  pte_t *dst_pte, pte_t *src_pte,
>> +		  unsigned long addr, unsigned long end,
>> +		  int *rss, struct folio **prealloc)
>>  {
>>  	struct mm_struct *src_mm = src_vma->vm_mm;
>>  	unsigned long vm_flags = src_vma->vm_flags;
>>  	pte_t pte = ptep_get(src_pte);
>>  	struct page *page;
>>  	struct folio *folio;
>> +	int nr = 1;
>> +	bool anon;
>> +	bool any_dirty = pte_dirty(pte);
>> +	int i;
>>  
>>  	page = vm_normal_page(src_vma, addr, pte);
>> -	if (page)
>> +	if (page) {
>>  		folio = page_folio(page);
>> -	if (page && folio_test_anon(folio)) {
>> -		/*
>> -		 * If this page may have been pinned by the parent process,
>> -		 * copy the page immediately for the child so that we'll always
>> -		 * guarantee the pinned page won't be randomly replaced in the
>> -		 * future.
>> -		 */
>> -		folio_get(folio);
>> -		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>> -			/* Page may be pinned, we have to copy. */
>> -			folio_put(folio);
>> -			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>> -						 addr, rss, prealloc, page);
>> +		anon = folio_test_anon(folio);
>> +		nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>> +						end, pte, &any_dirty);
> 
> in case we have a large folio with 16 CONTPTE basepages, and userspace
> do madvise(addr + 4KB * 5, DONTNEED);

nit: if you are offsetting by 5 pages from addr, then below I think you mean
page0~page4 and page6~15?

> 
> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> will return 15. in this case, we should copy page0~page3 and page5~page15.

No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
not how its intended to work. The function is scanning forwards from the current
pte until it finds the first pte that does not fit in the batch - either because
it maps a PFN that is not contiguous, or because the permissions are different
(although this is being relaxed a bit; see conversation with DavidH against this
same patch).

So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
(page0~page4) then the next time through the loop we will go through the
!present path and process the single swap marker. Then the 3rd time through the
loop folio_nr_pages_cont_mapped() will return 10.

Thanks,
Ryan

> 
> but the current code is copying page0~page14, right? unless we are immediatly
> split_folio to basepages in zap_pte_range(), we will have problems?
> 
>> +
>> +		for (i = 0; i < nr; i++, page++) {
>> +			if (anon) {
>> +				/*
>> +				 * If this page may have been pinned by the
>> +				 * parent process, copy the page immediately for
>> +				 * the child so that we'll always guarantee the
>> +				 * pinned page won't be randomly replaced in the
>> +				 * future.
>> +				 */
>> +				if (unlikely(page_try_dup_anon_rmap(
>> +						page, false, src_vma))) {
>> +					if (i != 0)
>> +						break;
>> +					/* Page may be pinned, we have to copy. */
>> +					return copy_present_page(
>> +						dst_vma, src_vma, dst_pte,
>> +						src_pte, addr, rss, prealloc,
>> +						page);
>> +				}
>> +				rss[MM_ANONPAGES]++;
>> +				VM_BUG_ON(PageAnonExclusive(page));
>> +			} else {
>> +				page_dup_file_rmap(page, false);
>> +				rss[mm_counter_file(page)]++;
>> +			}
> 
> Thanks
> Barry
>
Ryan Roberts Nov. 27, 2023, 9:35 a.m. UTC | #29
On 27/11/2023 08:42, Barry Song wrote:
>>> +		for (i = 0; i < nr; i++, page++) {
>>> +			if (anon) {
>>> +				/*
>>> +				 * If this page may have been pinned by the
>>> +				 * parent process, copy the page immediately for
>>> +				 * the child so that we'll always guarantee the
>>> +				 * pinned page won't be randomly replaced in the
>>> +				 * future.
>>> +				 */
>>> +				if (unlikely(page_try_dup_anon_rmap(
>>> +						page, false, src_vma))) {
>>> +					if (i != 0)
>>> +						break;
>>> +					/* Page may be pinned, we have to copy. */
>>> +					return copy_present_page(
>>> +						dst_vma, src_vma, dst_pte,
>>> +						src_pte, addr, rss, prealloc,
>>> +						page);
>>> +				}
>>> +				rss[MM_ANONPAGES]++;
>>> +				VM_BUG_ON(PageAnonExclusive(page));
>>> +			} else {
>>> +				page_dup_file_rmap(page, false);
>>> +				rss[mm_counter_file(page)]++;
>>> +			}
>>>   		}
>>> -		rss[MM_ANONPAGES]++;
>>> -	} else if (page) {
>>> -		folio_get(folio);
>>> -		page_dup_file_rmap(page, false);
>>> -		rss[mm_counter_file(page)]++;
>>> +
>>> +		nr = i;
>>> +		folio_ref_add(folio, nr);
>>
>> You're changing the order of mapcount vs. refcount increment. Don't. 
>> Make sure your refcount >= mapcount.
>>
>> You can do that easily by doing the folio_ref_add(folio, nr) first and 
>> then decrementing in case of error accordingly. Errors due to pinned 
>> pages are the corner case.
>>
>> I'll note that it will make a lot of sense to have batch variants of 
>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>
> 
> i still don't understand why it is not a entire map+1, but an increment
> in each basepage.

Because we are PTE-mapping the folio, we have to account each individual page.
If we accounted the entire folio, where would we unaccount it? Each page can be
unmapped individually (e.g. munmap() part of the folio) so need to account each
page. When PMD mapping, the whole thing is either mapped or unmapped, and its
atomic, so we can account the entire thing.

> 
> as long as it is a CONTPTE large folio, there is no much difference with
> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> split.
> 
> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> similar things on a part of the large folio in process A,
> 
> this large folio will have partially mapped subpage in A (all CONTPE bits
> in all subpages need to be removed though we only unmap a part of the
> large folioas HW requires consistent CONTPTEs); and it has entire map in
> process B(all PTEs are still CONPTES in process B).
> 
> isn't it more sensible for this large folios to have entire_map = 0(for
> process B), and subpages which are still mapped in process A has map_count
> =0? (start from -1).
> 
>> Especially, the batch variant of page_try_dup_anon_rmap() would only 
>> check once if the folio maybe pinned, and in that case, you can simply 
>> drop all references again. So you either have all or no ptes to process, 
>> which makes that code easier.

I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
fundamentally you can only use entire_mapcount if its only possible to map and
unmap the whole folio atomically.

>>
>> But that can be added on top, and I'll happily do that.
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
> 
> Thanks
> Barry
>
Barry Song Nov. 27, 2023, 9:59 a.m. UTC | #30
On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 08:42, Barry Song wrote:
> >>> +           for (i = 0; i < nr; i++, page++) {
> >>> +                   if (anon) {
> >>> +                           /*
> >>> +                            * If this page may have been pinned by the
> >>> +                            * parent process, copy the page immediately for
> >>> +                            * the child so that we'll always guarantee the
> >>> +                            * pinned page won't be randomly replaced in the
> >>> +                            * future.
> >>> +                            */
> >>> +                           if (unlikely(page_try_dup_anon_rmap(
> >>> +                                           page, false, src_vma))) {
> >>> +                                   if (i != 0)
> >>> +                                           break;
> >>> +                                   /* Page may be pinned, we have to copy. */
> >>> +                                   return copy_present_page(
> >>> +                                           dst_vma, src_vma, dst_pte,
> >>> +                                           src_pte, addr, rss, prealloc,
> >>> +                                           page);
> >>> +                           }
> >>> +                           rss[MM_ANONPAGES]++;
> >>> +                           VM_BUG_ON(PageAnonExclusive(page));
> >>> +                   } else {
> >>> +                           page_dup_file_rmap(page, false);
> >>> +                           rss[mm_counter_file(page)]++;
> >>> +                   }
> >>>             }
> >>> -           rss[MM_ANONPAGES]++;
> >>> -   } else if (page) {
> >>> -           folio_get(folio);
> >>> -           page_dup_file_rmap(page, false);
> >>> -           rss[mm_counter_file(page)]++;
> >>> +
> >>> +           nr = i;
> >>> +           folio_ref_add(folio, nr);
> >>
> >> You're changing the order of mapcount vs. refcount increment. Don't.
> >> Make sure your refcount >= mapcount.
> >>
> >> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >> then decrementing in case of error accordingly. Errors due to pinned
> >> pages are the corner case.
> >>
> >> I'll note that it will make a lot of sense to have batch variants of
> >> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>
> >
> > i still don't understand why it is not a entire map+1, but an increment
> > in each basepage.
>
> Because we are PTE-mapping the folio, we have to account each individual page.
> If we accounted the entire folio, where would we unaccount it? Each page can be
> unmapped individually (e.g. munmap() part of the folio) so need to account each
> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> atomic, so we can account the entire thing.

Hi Ryan,

There is no problem. for example, a large folio is entirely mapped in
process A with CONPTE,
and only page2 is mapped in process B.
then we will have

entire_map = 0
page0.map = -1
page1.map = -1
page2.map = 0
page3.map = -1
....

>
> >
> > as long as it is a CONTPTE large folio, there is no much difference with
> > PMD-mapped large folio. it has all the chance to be DoubleMap and need
> > split.
> >
> > When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> > similar things on a part of the large folio in process A,
> >
> > this large folio will have partially mapped subpage in A (all CONTPE bits
> > in all subpages need to be removed though we only unmap a part of the
> > large folioas HW requires consistent CONTPTEs); and it has entire map in
> > process B(all PTEs are still CONPTES in process B).
> >
> > isn't it more sensible for this large folios to have entire_map = 0(for
> > process B), and subpages which are still mapped in process A has map_count
> > =0? (start from -1).
> >
> >> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >> check once if the folio maybe pinned, and in that case, you can simply
> >> drop all references again. So you either have all or no ptes to process,
> >> which makes that code easier.
>
> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> fundamentally you can only use entire_mapcount if its only possible to map and
> unmap the whole folio atomically.



My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
it is partially
mapped. if a large folio is mapped in one processes with all CONTPTEs
and meanwhile in another process with partial mapping(w/o CONTPTE), it is
DoubleMapped.

Since we always hold ptl to set or drop CONTPTE bits, set/drop is
still atomic in a
spinlock area.

>
> >>
> >> But that can be added on top, and I'll happily do that.
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >

Thanks
Barry
Ryan Roberts Nov. 27, 2023, 10:10 a.m. UTC | #31
On 27/11/2023 09:59, Barry Song wrote:
> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2023 08:42, Barry Song wrote:
>>>>> +           for (i = 0; i < nr; i++, page++) {
>>>>> +                   if (anon) {
>>>>> +                           /*
>>>>> +                            * If this page may have been pinned by the
>>>>> +                            * parent process, copy the page immediately for
>>>>> +                            * the child so that we'll always guarantee the
>>>>> +                            * pinned page won't be randomly replaced in the
>>>>> +                            * future.
>>>>> +                            */
>>>>> +                           if (unlikely(page_try_dup_anon_rmap(
>>>>> +                                           page, false, src_vma))) {
>>>>> +                                   if (i != 0)
>>>>> +                                           break;
>>>>> +                                   /* Page may be pinned, we have to copy. */
>>>>> +                                   return copy_present_page(
>>>>> +                                           dst_vma, src_vma, dst_pte,
>>>>> +                                           src_pte, addr, rss, prealloc,
>>>>> +                                           page);
>>>>> +                           }
>>>>> +                           rss[MM_ANONPAGES]++;
>>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
>>>>> +                   } else {
>>>>> +                           page_dup_file_rmap(page, false);
>>>>> +                           rss[mm_counter_file(page)]++;
>>>>> +                   }
>>>>>             }
>>>>> -           rss[MM_ANONPAGES]++;
>>>>> -   } else if (page) {
>>>>> -           folio_get(folio);
>>>>> -           page_dup_file_rmap(page, false);
>>>>> -           rss[mm_counter_file(page)]++;
>>>>> +
>>>>> +           nr = i;
>>>>> +           folio_ref_add(folio, nr);
>>>>
>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>> Make sure your refcount >= mapcount.
>>>>
>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>> pages are the corner case.
>>>>
>>>> I'll note that it will make a lot of sense to have batch variants of
>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>
>>>
>>> i still don't understand why it is not a entire map+1, but an increment
>>> in each basepage.
>>
>> Because we are PTE-mapping the folio, we have to account each individual page.
>> If we accounted the entire folio, where would we unaccount it? Each page can be
>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>> atomic, so we can account the entire thing.
> 
> Hi Ryan,
> 
> There is no problem. for example, a large folio is entirely mapped in
> process A with CONPTE,
> and only page2 is mapped in process B.
> then we will have
> 
> entire_map = 0
> page0.map = -1
> page1.map = -1
> page2.map = 0
> page3.map = -1
> ....
> 
>>
>>>
>>> as long as it is a CONTPTE large folio, there is no much difference with
>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>> split.
>>>
>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>> similar things on a part of the large folio in process A,
>>>
>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>> in all subpages need to be removed though we only unmap a part of the
>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>> process B(all PTEs are still CONPTES in process B).
>>>
>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>> process B), and subpages which are still mapped in process A has map_count
>>> =0? (start from -1).
>>>
>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>> drop all references again. So you either have all or no ptes to process,
>>>> which makes that code easier.
>>
>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>> fundamentally you can only use entire_mapcount if its only possible to map and
>> unmap the whole folio atomically.
> 
> 
> 
> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> it is partially
> mapped. if a large folio is mapped in one processes with all CONTPTEs
> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> DoubleMapped.

There are 2 problems with your proposal, as I see it;

1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
entire_mapcount. The arch code is opportunistically and *transparently* managing
the CONT_PTE bit.

2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
be mapped with 32 contpte blocks. So you can't say it is entirely mapped
unless/until ALL of those blocks are set up. And then of course each block could
be unmapped unatomically.

For the PMD case there are actually 2 properties that allow using the
entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
and we know that the folio is exactly PMD sized (since it must be at least PMD
sized to be able to map it with the PMD, and we don't allocate THPs any bigger
than PMD size). So one PMD map or unmap operation corresponds to exactly one
*entire* map or unmap. That is not true when we are PTE mapping.

> 
> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
> still atomic in a
> spinlock area.
> 
>>
>>>>
>>>> But that can be added on top, and I'll happily do that.
>>>>
>>>> --
>>>> Cheers,
>>>>
>>>> David / dhildenb
>>>
> 
> Thanks
> Barry
Barry Song Nov. 27, 2023, 10:28 a.m. UTC | #32
On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 09:59, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2023 08:42, Barry Song wrote:
> >>>>> +           for (i = 0; i < nr; i++, page++) {
> >>>>> +                   if (anon) {
> >>>>> +                           /*
> >>>>> +                            * If this page may have been pinned by the
> >>>>> +                            * parent process, copy the page immediately for
> >>>>> +                            * the child so that we'll always guarantee the
> >>>>> +                            * pinned page won't be randomly replaced in the
> >>>>> +                            * future.
> >>>>> +                            */
> >>>>> +                           if (unlikely(page_try_dup_anon_rmap(
> >>>>> +                                           page, false, src_vma))) {
> >>>>> +                                   if (i != 0)
> >>>>> +                                           break;
> >>>>> +                                   /* Page may be pinned, we have to copy. */
> >>>>> +                                   return copy_present_page(
> >>>>> +                                           dst_vma, src_vma, dst_pte,
> >>>>> +                                           src_pte, addr, rss, prealloc,
> >>>>> +                                           page);
> >>>>> +                           }
> >>>>> +                           rss[MM_ANONPAGES]++;
> >>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
> >>>>> +                   } else {
> >>>>> +                           page_dup_file_rmap(page, false);
> >>>>> +                           rss[mm_counter_file(page)]++;
> >>>>> +                   }
> >>>>>             }
> >>>>> -           rss[MM_ANONPAGES]++;
> >>>>> -   } else if (page) {
> >>>>> -           folio_get(folio);
> >>>>> -           page_dup_file_rmap(page, false);
> >>>>> -           rss[mm_counter_file(page)]++;
> >>>>> +
> >>>>> +           nr = i;
> >>>>> +           folio_ref_add(folio, nr);
> >>>>
> >>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>> Make sure your refcount >= mapcount.
> >>>>
> >>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>> pages are the corner case.
> >>>>
> >>>> I'll note that it will make a lot of sense to have batch variants of
> >>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>
> >>>
> >>> i still don't understand why it is not a entire map+1, but an increment
> >>> in each basepage.
> >>
> >> Because we are PTE-mapping the folio, we have to account each individual page.
> >> If we accounted the entire folio, where would we unaccount it? Each page can be
> >> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >> atomic, so we can account the entire thing.
> >
> > Hi Ryan,
> >
> > There is no problem. for example, a large folio is entirely mapped in
> > process A with CONPTE,
> > and only page2 is mapped in process B.
> > then we will have
> >
> > entire_map = 0
> > page0.map = -1
> > page1.map = -1
> > page2.map = 0
> > page3.map = -1
> > ....
> >
> >>
> >>>
> >>> as long as it is a CONTPTE large folio, there is no much difference with
> >>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>> split.
> >>>
> >>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>> similar things on a part of the large folio in process A,
> >>>
> >>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>> in all subpages need to be removed though we only unmap a part of the
> >>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>> process B(all PTEs are still CONPTES in process B).
> >>>
> >>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>> process B), and subpages which are still mapped in process A has map_count
> >>> =0? (start from -1).
> >>>
> >>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>> drop all references again. So you either have all or no ptes to process,
> >>>> which makes that code easier.
> >>
> >> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >> fundamentally you can only use entire_mapcount if its only possible to map and
> >> unmap the whole folio atomically.
> >
> >
> >
> > My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> > in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> > it is partially
> > mapped. if a large folio is mapped in one processes with all CONTPTEs
> > and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> > DoubleMapped.
>
> There are 2 problems with your proposal, as I see it;
>
> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> entire_mapcount. The arch code is opportunistically and *transparently* managing
> the CONT_PTE bit.
>
> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> unless/until ALL of those blocks are set up. And then of course each block could
> be unmapped unatomically.
>
> For the PMD case there are actually 2 properties that allow using the
> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> and we know that the folio is exactly PMD sized (since it must be at least PMD
> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> *entire* map or unmap. That is not true when we are PTE mapping.

well. Thanks for clarification. based on the above description, i agree the
current code might make more sense by always using mapcount in subpage.

I gave my proposals as  I thought we were always CONTPTE size for small-THP
then we could drop the loop to iterate 16 times rmap. if we do it
entirely, we only
need to do dup rmap once for all 16 PTEs by increasing entire_map.

BTW, I have concerns that a variable small-THP size will really work
as userspace
is probably friendly to only one fixed size. for example, userspace
heap management
might be optimized to a size for freeing memory to the kernel. it is
very difficult
for the heap to adapt to various sizes at the same time. frequent unmap/free
size not equal with, and particularly smaller than small-THP size will
defeat all
efforts to use small-THP.

>
> >
> > Since we always hold ptl to set or drop CONTPTE bits, set/drop is
> > still atomic in a
> > spinlock area.
> >
> >>
> >>>>
> >>>> But that can be added on top, and I'll happily do that.
> >>>>
> >>>> --
> >>>> Cheers,
> >>>>
> >>>> David / dhildenb
> >>>
> >

Thanks
Barry
Ryan Roberts Nov. 27, 2023, 11:07 a.m. UTC | #33
On 27/11/2023 10:28, Barry Song wrote:
> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2023 09:59, Barry Song wrote:
>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2023 08:42, Barry Song wrote:
>>>>>>> +           for (i = 0; i < nr; i++, page++) {
>>>>>>> +                   if (anon) {
>>>>>>> +                           /*
>>>>>>> +                            * If this page may have been pinned by the
>>>>>>> +                            * parent process, copy the page immediately for
>>>>>>> +                            * the child so that we'll always guarantee the
>>>>>>> +                            * pinned page won't be randomly replaced in the
>>>>>>> +                            * future.
>>>>>>> +                            */
>>>>>>> +                           if (unlikely(page_try_dup_anon_rmap(
>>>>>>> +                                           page, false, src_vma))) {
>>>>>>> +                                   if (i != 0)
>>>>>>> +                                           break;
>>>>>>> +                                   /* Page may be pinned, we have to copy. */
>>>>>>> +                                   return copy_present_page(
>>>>>>> +                                           dst_vma, src_vma, dst_pte,
>>>>>>> +                                           src_pte, addr, rss, prealloc,
>>>>>>> +                                           page);
>>>>>>> +                           }
>>>>>>> +                           rss[MM_ANONPAGES]++;
>>>>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
>>>>>>> +                   } else {
>>>>>>> +                           page_dup_file_rmap(page, false);
>>>>>>> +                           rss[mm_counter_file(page)]++;
>>>>>>> +                   }
>>>>>>>             }
>>>>>>> -           rss[MM_ANONPAGES]++;
>>>>>>> -   } else if (page) {
>>>>>>> -           folio_get(folio);
>>>>>>> -           page_dup_file_rmap(page, false);
>>>>>>> -           rss[mm_counter_file(page)]++;
>>>>>>> +
>>>>>>> +           nr = i;
>>>>>>> +           folio_ref_add(folio, nr);
>>>>>>
>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>>>> Make sure your refcount >= mapcount.
>>>>>>
>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>>>> pages are the corner case.
>>>>>>
>>>>>> I'll note that it will make a lot of sense to have batch variants of
>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>>>
>>>>>
>>>>> i still don't understand why it is not a entire map+1, but an increment
>>>>> in each basepage.
>>>>
>>>> Because we are PTE-mapping the folio, we have to account each individual page.
>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>>>> atomic, so we can account the entire thing.
>>>
>>> Hi Ryan,
>>>
>>> There is no problem. for example, a large folio is entirely mapped in
>>> process A with CONPTE,
>>> and only page2 is mapped in process B.
>>> then we will have
>>>
>>> entire_map = 0
>>> page0.map = -1
>>> page1.map = -1
>>> page2.map = 0
>>> page3.map = -1
>>> ....
>>>
>>>>
>>>>>
>>>>> as long as it is a CONTPTE large folio, there is no much difference with
>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>>>> split.
>>>>>
>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>>>> similar things on a part of the large folio in process A,
>>>>>
>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>>>> in all subpages need to be removed though we only unmap a part of the
>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>>>> process B(all PTEs are still CONPTES in process B).
>>>>>
>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>>>> process B), and subpages which are still mapped in process A has map_count
>>>>> =0? (start from -1).
>>>>>
>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>>>> drop all references again. So you either have all or no ptes to process,
>>>>>> which makes that code easier.
>>>>
>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>>>> fundamentally you can only use entire_mapcount if its only possible to map and
>>>> unmap the whole folio atomically.
>>>
>>>
>>>
>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
>>> it is partially
>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
>>> DoubleMapped.
>>
>> There are 2 problems with your proposal, as I see it;
>>
>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
>> entire_mapcount. The arch code is opportunistically and *transparently* managing
>> the CONT_PTE bit.
>>
>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
>> unless/until ALL of those blocks are set up. And then of course each block could
>> be unmapped unatomically.
>>
>> For the PMD case there are actually 2 properties that allow using the
>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
>> and we know that the folio is exactly PMD sized (since it must be at least PMD
>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
>> *entire* map or unmap. That is not true when we are PTE mapping.
> 
> well. Thanks for clarification. based on the above description, i agree the
> current code might make more sense by always using mapcount in subpage.
> 
> I gave my proposals as  I thought we were always CONTPTE size for small-THP
> then we could drop the loop to iterate 16 times rmap. if we do it
> entirely, we only
> need to do dup rmap once for all 16 PTEs by increasing entire_map.

Well its always good to have the discussion - so thanks for the ideas. I think
there is a bigger question lurking here; should we be exposing the concept of
contpte mappings to the core-mm rather than burying it in the arm64 arch code?
I'm confident that would be a huge amount of effort and the end result would be
similar performace to what this approach gives. One potential benefit of letting
core-mm control it is that it would also give control to core-mm over the
granularity of access/dirty reporting (my approach implicitly ties it to the
folio). Having sub-folio access tracking _could_ potentially help with future
work to make THP size selection automatic, but we are not there yet, and I think
there are other (simpler) ways to achieve the same thing. So my view is that
_not_ exposing it to core-mm is the right way for now.

> 
> BTW, I have concerns that a variable small-THP size will really work
> as userspace
> is probably friendly to only one fixed size. for example, userspace
> heap management
> might be optimized to a size for freeing memory to the kernel. it is
> very difficult
> for the heap to adapt to various sizes at the same time. frequent unmap/free
> size not equal with, and particularly smaller than small-THP size will
> defeat all
> efforts to use small-THP.

I'll admit to not knowing a huge amount about user space allocators. But I will
say that as currently defined, the small-sized THP interface to user space
allows a sysadmin to specifically enable the set of sizes that they want; so a
single size can be enabled. I'm diliberately punting that decision away from the
kernel for now.

FWIW, My experience with the Speedometer/JavaScript use case is that performance
is a little bit better when enabling 64+32+16K vs just 64K THP.

Functionally, it will not matter if the allocator is not enlightened for the THP
size; it can continue to free, and if a partial folio is unmapped it is put on
the deferred split list, then under memory pressure it is split and the unused
pages are reclaimed. I guess this is the bit you are concerned about having a
performance impact?

Regardless, it would be good to move this conversation to the small-sized THP
patch series since this is all independent of contpte mappings.

> 
>>
>>>
>>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
>>> still atomic in a
>>> spinlock area.
>>>
>>>>
>>>>>>
>>>>>> But that can be added on top, and I'll happily do that.
>>>>>>
>>>>>> --
>>>>>> Cheers,
>>>>>>
>>>>>> David / dhildenb
>>>>>
>>>
> 
> Thanks
> Barry
Barry Song Nov. 27, 2023, 8:34 p.m. UTC | #34
On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 10:28, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2023 09:59, Barry Song wrote:
> >>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 27/11/2023 08:42, Barry Song wrote:
> >>>>>>> +           for (i = 0; i < nr; i++, page++) {
> >>>>>>> +                   if (anon) {
> >>>>>>> +                           /*
> >>>>>>> +                            * If this page may have been pinned by the
> >>>>>>> +                            * parent process, copy the page immediately for
> >>>>>>> +                            * the child so that we'll always guarantee the
> >>>>>>> +                            * pinned page won't be randomly replaced in the
> >>>>>>> +                            * future.
> >>>>>>> +                            */
> >>>>>>> +                           if (unlikely(page_try_dup_anon_rmap(
> >>>>>>> +                                           page, false, src_vma))) {
> >>>>>>> +                                   if (i != 0)
> >>>>>>> +                                           break;
> >>>>>>> +                                   /* Page may be pinned, we have to copy. */
> >>>>>>> +                                   return copy_present_page(
> >>>>>>> +                                           dst_vma, src_vma, dst_pte,
> >>>>>>> +                                           src_pte, addr, rss, prealloc,
> >>>>>>> +                                           page);
> >>>>>>> +                           }
> >>>>>>> +                           rss[MM_ANONPAGES]++;
> >>>>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
> >>>>>>> +                   } else {
> >>>>>>> +                           page_dup_file_rmap(page, false);
> >>>>>>> +                           rss[mm_counter_file(page)]++;
> >>>>>>> +                   }
> >>>>>>>             }
> >>>>>>> -           rss[MM_ANONPAGES]++;
> >>>>>>> -   } else if (page) {
> >>>>>>> -           folio_get(folio);
> >>>>>>> -           page_dup_file_rmap(page, false);
> >>>>>>> -           rss[mm_counter_file(page)]++;
> >>>>>>> +
> >>>>>>> +           nr = i;
> >>>>>>> +           folio_ref_add(folio, nr);
> >>>>>>
> >>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>>>> Make sure your refcount >= mapcount.
> >>>>>>
> >>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>>>> pages are the corner case.
> >>>>>>
> >>>>>> I'll note that it will make a lot of sense to have batch variants of
> >>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>>>
> >>>>>
> >>>>> i still don't understand why it is not a entire map+1, but an increment
> >>>>> in each basepage.
> >>>>
> >>>> Because we are PTE-mapping the folio, we have to account each individual page.
> >>>> If we accounted the entire folio, where would we unaccount it? Each page can be
> >>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >>>> atomic, so we can account the entire thing.
> >>>
> >>> Hi Ryan,
> >>>
> >>> There is no problem. for example, a large folio is entirely mapped in
> >>> process A with CONPTE,
> >>> and only page2 is mapped in process B.
> >>> then we will have
> >>>
> >>> entire_map = 0
> >>> page0.map = -1
> >>> page1.map = -1
> >>> page2.map = 0
> >>> page3.map = -1
> >>> ....
> >>>
> >>>>
> >>>>>
> >>>>> as long as it is a CONTPTE large folio, there is no much difference with
> >>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>>>> split.
> >>>>>
> >>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>>>> similar things on a part of the large folio in process A,
> >>>>>
> >>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>>>> in all subpages need to be removed though we only unmap a part of the
> >>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>>>> process B(all PTEs are still CONPTES in process B).
> >>>>>
> >>>>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>>>> process B), and subpages which are still mapped in process A has map_count
> >>>>> =0? (start from -1).
> >>>>>
> >>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>>>> drop all references again. So you either have all or no ptes to process,
> >>>>>> which makes that code easier.
> >>>>
> >>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >>>> fundamentally you can only use entire_mapcount if its only possible to map and
> >>>> unmap the whole folio atomically.
> >>>
> >>>
> >>>
> >>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> >>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> >>> it is partially
> >>> mapped. if a large folio is mapped in one processes with all CONTPTEs
> >>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> >>> DoubleMapped.
> >>
> >> There are 2 problems with your proposal, as I see it;
> >>
> >> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> >> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> >> entire_mapcount. The arch code is opportunistically and *transparently* managing
> >> the CONT_PTE bit.
> >>
> >> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> >> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> >> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> >> unless/until ALL of those blocks are set up. And then of course each block could
> >> be unmapped unatomically.
> >>
> >> For the PMD case there are actually 2 properties that allow using the
> >> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> >> and we know that the folio is exactly PMD sized (since it must be at least PMD
> >> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> >> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> >> *entire* map or unmap. That is not true when we are PTE mapping.
> >
> > well. Thanks for clarification. based on the above description, i agree the
> > current code might make more sense by always using mapcount in subpage.
> >
> > I gave my proposals as  I thought we were always CONTPTE size for small-THP
> > then we could drop the loop to iterate 16 times rmap. if we do it
> > entirely, we only
> > need to do dup rmap once for all 16 PTEs by increasing entire_map.
>
> Well its always good to have the discussion - so thanks for the ideas. I think
> there is a bigger question lurking here; should we be exposing the concept of
> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
> I'm confident that would be a huge amount of effort and the end result would be
> similar performace to what this approach gives. One potential benefit of letting
> core-mm control it is that it would also give control to core-mm over the
> granularity of access/dirty reporting (my approach implicitly ties it to the
> folio). Having sub-folio access tracking _could_ potentially help with future
> work to make THP size selection automatic, but we are not there yet, and I think
> there are other (simpler) ways to achieve the same thing. So my view is that
> _not_ exposing it to core-mm is the right way for now.

Hi Ryan,

We(OPPO) started a similar project like you even before folio was imported to
mainline, we have deployed the dynamic hugepage(that is how we name it)
on millions of mobile phones on real products and kernels before 5.16,  making
a huge success on performance improvement. for example, you may
find the out-of-tree 5.15 source code here

https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11

Our modification might not be so clean and has lots of workarounds
just for the stability of products

We mainly have

1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c

some CONTPTE helpers

2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h

some Dynamic Hugepage APIs

3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c

modified all page faults to support
     (1). allocation of hugepage of 64KB in do_anon_page
     (2). CoW hugepage in do_wp_page
     (3). copy CONPTEs in copy_pte_range
     (4). allocate and swap-in Hugepage as a whole in do_swap_page

4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c

reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.

So we are 100% interested in your patchset and hope it can find a way
to land on the
mainline, thus decreasing all the cost we have to maintain out-of-tree
code from a
kernel to another kernel version which we have done on a couple of
kernel versions
before 5.16. Firmly, we are 100% supportive of large anon folios
things you are leading.

A big pain was we found lots of races especially on CONTPTE unfolding
and especially a part
of basepages ran away from the 16 CONPTEs group since userspace is
always working
on basepages, having no idea of small-THP.  We ran our code on millions of
real phones, and now we have got them fixed (or maybe "can't reproduce"),
no outstanding issue.

Particularly for the rmap issue we are discussing, our out-of-tree is
using the entire_map for
CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
CONTPTE from mm-core.

We are doing this in mm/memory.c

copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
vm_area_struct *src_vma,
pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
struct page **prealloc)
{
      struct mm_struct *src_mm = src_vma->vm_mm;
      unsigned long vm_flags = src_vma->vm_flags;
      pte_t pte = *src_pte;
      struct page *page;

       page = vm_normal_page(src_vma, addr, pte);
      ...

     get_page(page);
     page_dup_rmap(page, true);   // an entire dup_rmap as you can
see.............
     rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
}

and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,

static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long haddr, bool freeze)
{
...
           if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
                  for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
                           atomic_inc(&head[i]._mapcount);
                 atomic_long_inc(&cont_pte_double_map_count);
           }


            if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
              ...
}

I am not selling our solution any more, but just showing you some differences we
have :-)

>
> >
> > BTW, I have concerns that a variable small-THP size will really work
> > as userspace
> > is probably friendly to only one fixed size. for example, userspace
> > heap management
> > might be optimized to a size for freeing memory to the kernel. it is
> > very difficult
> > for the heap to adapt to various sizes at the same time. frequent unmap/free
> > size not equal with, and particularly smaller than small-THP size will
> > defeat all
> > efforts to use small-THP.
>
> I'll admit to not knowing a huge amount about user space allocators. But I will
> say that as currently defined, the small-sized THP interface to user space
> allows a sysadmin to specifically enable the set of sizes that they want; so a
> single size can be enabled. I'm diliberately punting that decision away from the
> kernel for now.

Basically, userspace heap library has a PAGESIZE setting and allows users
to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
The default size is for sure equal to the basepage SIZE. once some objects are
freed by free() and libc get a free "page", userspace heap libraries might free
the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
it is quite similar with kernel slab.

so imagine we have small-THP now, but userspace libraries have *NO*
idea at all,  so it can frequently cause unfolding.

>
> FWIW, My experience with the Speedometer/JavaScript use case is that performance
> is a little bit better when enabling 64+32+16K vs just 64K THP.
>
> Functionally, it will not matter if the allocator is not enlightened for the THP
> size; it can continue to free, and if a partial folio is unmapped it is put on
> the deferred split list, then under memory pressure it is split and the unused
> pages are reclaimed. I guess this is the bit you are concerned about having a
> performance impact?

right. If this is happening on the majority of small-THP folios, we
don't have performance
improvement, and probably regression instead. This is really true on
real workloads!!

So that is why we really love a per-VMA hint to enable small-THP but
obviously you
have already supported it now by
mm: thp: Introduce per-size thp sysfs interface
https://lore.kernel.org/linux-mm/20231122162950.3854897-4-ryan.roberts@arm.com/

we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
can set the VMA flag when it is quite sure this VMA is working with
the alignment
of 64KB?

>
> Regardless, it would be good to move this conversation to the small-sized THP
> patch series since this is all independent of contpte mappings.
>
> >
> >>
> >>>
> >>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
> >>> still atomic in a
> >>> spinlock area.
> >>>
> >>>>
> >>>>>>
> >>>>>> But that can be added on top, and I'll happily do that.
> >>>>>>
> >>>>>> --
> >>>>>> Cheers,
> >>>>>>
> >>>>>> David / dhildenb
> >>>>>
> >>>
> >

Thanks
Barry
Barry Song Nov. 28, 2023, 12:11 a.m. UTC | #35
On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 05:54, Barry Song wrote:
> >> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >> +              pte_t *dst_pte, pte_t *src_pte,
> >> +              unsigned long addr, unsigned long end,
> >> +              int *rss, struct folio **prealloc)
> >>  {
> >>      struct mm_struct *src_mm = src_vma->vm_mm;
> >>      unsigned long vm_flags = src_vma->vm_flags;
> >>      pte_t pte = ptep_get(src_pte);
> >>      struct page *page;
> >>      struct folio *folio;
> >> +    int nr = 1;
> >> +    bool anon;
> >> +    bool any_dirty = pte_dirty(pte);
> >> +    int i;
> >>
> >>      page = vm_normal_page(src_vma, addr, pte);
> >> -    if (page)
> >> +    if (page) {
> >>              folio = page_folio(page);
> >> -    if (page && folio_test_anon(folio)) {
> >> -            /*
> >> -             * If this page may have been pinned by the parent process,
> >> -             * copy the page immediately for the child so that we'll always
> >> -             * guarantee the pinned page won't be randomly replaced in the
> >> -             * future.
> >> -             */
> >> -            folio_get(folio);
> >> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> >> -                    /* Page may be pinned, we have to copy. */
> >> -                    folio_put(folio);
> >> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> >> -                                             addr, rss, prealloc, page);
> >> +            anon = folio_test_anon(folio);
> >> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> >> +                                            end, pte, &any_dirty);
> >
> > in case we have a large folio with 16 CONTPTE basepages, and userspace
> > do madvise(addr + 4KB * 5, DONTNEED);
>
> nit: if you are offsetting by 5 pages from addr, then below I think you mean
> page0~page4 and page6~15?
>
> >
> > thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> > will return 15. in this case, we should copy page0~page3 and page5~page15.
>
> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
> not how its intended to work. The function is scanning forwards from the current
> pte until it finds the first pte that does not fit in the batch - either because
> it maps a PFN that is not contiguous, or because the permissions are different
> (although this is being relaxed a bit; see conversation with DavidH against this
> same patch).
>
> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
> (page0~page4) then the next time through the loop we will go through the
> !present path and process the single swap marker. Then the 3rd time through the
> loop folio_nr_pages_cont_mapped() will return 10.

one case we have met by running hundreds of real phones is as below,


static int
copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
               pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
               unsigned long end)
{
        ...
        dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
        if (!dst_pte) {
                ret = -ENOMEM;
                goto out;
        }
        src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
        if (!src_pte) {
                pte_unmap_unlock(dst_pte, dst_ptl);
                /* ret == 0 */
                goto out;
        }
        spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
        orig_src_pte = src_pte;
        orig_dst_pte = dst_pte;
        arch_enter_lazy_mmu_mode();

        do {
                /*
                 * We are holding two locks at this point - either of them
                 * could generate latencies in another task on another CPU.
                 */
                if (progress >= 32) {
                        progress = 0;
                        if (need_resched() ||
                            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
                                break;
                }
                ptent = ptep_get(src_pte);
                if (pte_none(ptent)) {
                        progress++;
                        continue;
                }

the above iteration can break when progress > =32. for example, at the
beginning,
if all PTEs are none, we break when progress >=32, and we break when we
are in the 8th pte of 16PTEs which might become CONTPTE after we release
PTL.

since we are releasing PTLs, next time when we get PTL, those pte_none() might
become pte_cont(), then are you going to copy CONTPTE from 8th pte,
thus, immediately
break the consistent CONPTEs rule of hardware?

pte0 - pte_none
pte1 - pte_none
...
pte7 - pte_none

pte8 - pte_cont
...
pte15 - pte_cont

so we did some modification to avoid a break in the middle of PTEs
which can potentially
become CONTPE.
do {
                /*
                * We are holding two locks at this point - either of them
                * could generate latencies in another task on another CPU.
                */
                if (progress >= 32) {
                                progress = 0;
#ifdef CONFIG_CONT_PTE_HUGEPAGE
                /*
                * XXX: don't release ptl at an unligned address as
cont_pte might form while
                * ptl is released, this causes double-map
                */
                if (!vma_is_chp_anonymous(src_vma) ||
                   (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
HPAGE_CONT_PTE_SIZE)))
#endif
                if (need_resched() ||
                   spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
                                break;
}

We could only reproduce the above issue by running thousands of phones.

Does your code survive from this problem?

>
> Thanks,
> Ryan
>
> >
> > but the current code is copying page0~page14, right? unless we are immediatly
> > split_folio to basepages in zap_pte_range(), we will have problems?
> >
> >> +
> >> +            for (i = 0; i < nr; i++, page++) {
> >> +                    if (anon) {
> >> +                            /*
> >> +                             * If this page may have been pinned by the
> >> +                             * parent process, copy the page immediately for
> >> +                             * the child so that we'll always guarantee the
> >> +                             * pinned page won't be randomly replaced in the
> >> +                             * future.
> >> +                             */
> >> +                            if (unlikely(page_try_dup_anon_rmap(
> >> +                                            page, false, src_vma))) {
> >> +                                    if (i != 0)
> >> +                                            break;
> >> +                                    /* Page may be pinned, we have to copy. */
> >> +                                    return copy_present_page(
> >> +                                            dst_vma, src_vma, dst_pte,
> >> +                                            src_pte, addr, rss, prealloc,
> >> +                                            page);
> >> +                            }
> >> +                            rss[MM_ANONPAGES]++;
> >> +                            VM_BUG_ON(PageAnonExclusive(page));
> >> +                    } else {
> >> +                            page_dup_file_rmap(page, false);
> >> +                            rss[mm_counter_file(page)]++;
> >> +                    }
> >

Thanks
Barry
Ryan Roberts Nov. 28, 2023, 9:14 a.m. UTC | #36
On 27/11/2023 20:34, Barry Song wrote:
> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2023 10:28, Barry Song wrote:
>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2023 09:59, Barry Song wrote:
>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 27/11/2023 08:42, Barry Song wrote:
>>>>>>>>> +           for (i = 0; i < nr; i++, page++) {
>>>>>>>>> +                   if (anon) {
>>>>>>>>> +                           /*
>>>>>>>>> +                            * If this page may have been pinned by the
>>>>>>>>> +                            * parent process, copy the page immediately for
>>>>>>>>> +                            * the child so that we'll always guarantee the
>>>>>>>>> +                            * pinned page won't be randomly replaced in the
>>>>>>>>> +                            * future.
>>>>>>>>> +                            */
>>>>>>>>> +                           if (unlikely(page_try_dup_anon_rmap(
>>>>>>>>> +                                           page, false, src_vma))) {
>>>>>>>>> +                                   if (i != 0)
>>>>>>>>> +                                           break;
>>>>>>>>> +                                   /* Page may be pinned, we have to copy. */
>>>>>>>>> +                                   return copy_present_page(
>>>>>>>>> +                                           dst_vma, src_vma, dst_pte,
>>>>>>>>> +                                           src_pte, addr, rss, prealloc,
>>>>>>>>> +                                           page);
>>>>>>>>> +                           }
>>>>>>>>> +                           rss[MM_ANONPAGES]++;
>>>>>>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>>> +                   } else {
>>>>>>>>> +                           page_dup_file_rmap(page, false);
>>>>>>>>> +                           rss[mm_counter_file(page)]++;
>>>>>>>>> +                   }
>>>>>>>>>             }
>>>>>>>>> -           rss[MM_ANONPAGES]++;
>>>>>>>>> -   } else if (page) {
>>>>>>>>> -           folio_get(folio);
>>>>>>>>> -           page_dup_file_rmap(page, false);
>>>>>>>>> -           rss[mm_counter_file(page)]++;
>>>>>>>>> +
>>>>>>>>> +           nr = i;
>>>>>>>>> +           folio_ref_add(folio, nr);
>>>>>>>>
>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>>>>>> Make sure your refcount >= mapcount.
>>>>>>>>
>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>>>>>> pages are the corner case.
>>>>>>>>
>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>>>>>
>>>>>>>
>>>>>>> i still don't understand why it is not a entire map+1, but an increment
>>>>>>> in each basepage.
>>>>>>
>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>>>>>> atomic, so we can account the entire thing.
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> There is no problem. for example, a large folio is entirely mapped in
>>>>> process A with CONPTE,
>>>>> and only page2 is mapped in process B.
>>>>> then we will have
>>>>>
>>>>> entire_map = 0
>>>>> page0.map = -1
>>>>> page1.map = -1
>>>>> page2.map = 0
>>>>> page3.map = -1
>>>>> ....
>>>>>
>>>>>>
>>>>>>>
>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>>>>>> split.
>>>>>>>
>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>>>>>> similar things on a part of the large folio in process A,
>>>>>>>
>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>>>>>> in all subpages need to be removed though we only unmap a part of the
>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>>>>>> process B(all PTEs are still CONPTES in process B).
>>>>>>>
>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>>>>>> process B), and subpages which are still mapped in process A has map_count
>>>>>>> =0? (start from -1).
>>>>>>>
>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>>>>>> drop all references again. So you either have all or no ptes to process,
>>>>>>>> which makes that code easier.
>>>>>>
>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
>>>>>> unmap the whole folio atomically.
>>>>>
>>>>>
>>>>>
>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
>>>>> it is partially
>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
>>>>> DoubleMapped.
>>>>
>>>> There are 2 problems with your proposal, as I see it;
>>>>
>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
>>>> the CONT_PTE bit.
>>>>
>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
>>>> unless/until ALL of those blocks are set up. And then of course each block could
>>>> be unmapped unatomically.
>>>>
>>>> For the PMD case there are actually 2 properties that allow using the
>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
>>>> *entire* map or unmap. That is not true when we are PTE mapping.
>>>
>>> well. Thanks for clarification. based on the above description, i agree the
>>> current code might make more sense by always using mapcount in subpage.
>>>
>>> I gave my proposals as  I thought we were always CONTPTE size for small-THP
>>> then we could drop the loop to iterate 16 times rmap. if we do it
>>> entirely, we only
>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
>>
>> Well its always good to have the discussion - so thanks for the ideas. I think
>> there is a bigger question lurking here; should we be exposing the concept of
>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
>> I'm confident that would be a huge amount of effort and the end result would be
>> similar performace to what this approach gives. One potential benefit of letting
>> core-mm control it is that it would also give control to core-mm over the
>> granularity of access/dirty reporting (my approach implicitly ties it to the
>> folio). Having sub-folio access tracking _could_ potentially help with future
>> work to make THP size selection automatic, but we are not there yet, and I think
>> there are other (simpler) ways to achieve the same thing. So my view is that
>> _not_ exposing it to core-mm is the right way for now.
> 
> Hi Ryan,
> 
> We(OPPO) started a similar project like you even before folio was imported to
> mainline, we have deployed the dynamic hugepage(that is how we name it)
> on millions of mobile phones on real products and kernels before 5.16,  making
> a huge success on performance improvement. for example, you may
> find the out-of-tree 5.15 source code here

Oh wow, thanks for reaching out and explaining this - I have to admit I feel
embarrassed that I clearly didn't do enough research on the prior art because I
wasn't aware of your work. So sorry about that.

I sensed that you had a different model for how this should work vs what I've
implemented and now I understand why :). I'll review your stuff and I'm sure
I'll have questions. I'm sure each solution has pros and cons.


> 
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> 
> Our modification might not be so clean and has lots of workarounds
> just for the stability of products
> 
> We mainly have
> 
> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
> 
> some CONTPTE helpers
> 
> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
> 
> some Dynamic Hugepage APIs
> 
> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
> 
> modified all page faults to support
>      (1). allocation of hugepage of 64KB in do_anon_page

My Small-Sized THP patch set is handling the equivalent of this.

>      (2). CoW hugepage in do_wp_page

This isn't handled yet in my patch set; the original RFC implemented it but I
removed it in order to strip back to the essential complexity for the initial
submission. DavidH has been working on a precise shared vs exclusive map
tracking mechanism - if that goes in, it will make CoWing large folios simpler.
Out of interest, what workloads benefit most from this?

>      (3). copy CONPTEs in copy_pte_range

As discussed this is done as part of the contpte patch set, but its not just a
simple copy; the arch code will notice and set the CONT_PTE bit as needed.

>      (4). allocate and swap-in Hugepage as a whole in do_swap_page

This is going to be a problem but I haven't even looked at this properly yet.
The advice so far has been to continue to swap-in small pages only, but improve
khugepaged to collapse to small-sized THP. I'll take a look at your code to
understand how you did this.

> 
> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
> 
> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.

I think this is all naturally handled by the folio code that exists in modern
kernels?

> 
> So we are 100% interested in your patchset and hope it can find a way
> to land on the
> mainline, thus decreasing all the cost we have to maintain out-of-tree
> code from a
> kernel to another kernel version which we have done on a couple of
> kernel versions
> before 5.16. Firmly, we are 100% supportive of large anon folios
> things you are leading.

That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
it closer :). If you had any ability to do any A/B performance testing, it would
be very interesting to see how this stacks up against your solution - if there
are gaps it would be good to know where and develop a plan to plug the gap.

> 
> A big pain was we found lots of races especially on CONTPTE unfolding
> and especially a part
> of basepages ran away from the 16 CONPTEs group since userspace is
> always working
> on basepages, having no idea of small-THP.  We ran our code on millions of
> real phones, and now we have got them fixed (or maybe "can't reproduce"),
> no outstanding issue.

I'm going to be brave and say that my solution shouldn't suffer from these
problems; but of course the proof is only in the testing. I did a lot of work
with our architecture group and micro architects to determine exactly what is
and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
optimization in patch 13 (see the commit log for details). Of course this has
all been checked with partners and we are confident that all existing
implementations conform to the modified wording.

> 
> Particularly for the rmap issue we are discussing, our out-of-tree is
> using the entire_map for
> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
> CONTPTE from mm-core.
> 
> We are doing this in mm/memory.c
> 
> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
> vm_area_struct *src_vma,
> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> struct page **prealloc)
> {
>       struct mm_struct *src_mm = src_vma->vm_mm;
>       unsigned long vm_flags = src_vma->vm_flags;
>       pte_t pte = *src_pte;
>       struct page *page;
> 
>        page = vm_normal_page(src_vma, addr, pte);
>       ...
> 
>      get_page(page);
>      page_dup_rmap(page, true);   // an entire dup_rmap as you can
> see.............
>      rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
> }
> 
> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
> 
> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> unsigned long haddr, bool freeze)
> {
> ...
>            if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
>                   for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
>                            atomic_inc(&head[i]._mapcount);
>                  atomic_long_inc(&cont_pte_double_map_count);
>            }
> 
> 
>             if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
>               ...
> }
> 
> I am not selling our solution any more, but just showing you some differences we
> have :-)

OK, I understand what you were saying now. I'm currently struggling to see how
this could fit into my model. Do you have any workloads and numbers on perf
improvement of using entire_mapcount?

> 
>>
>>>
>>> BTW, I have concerns that a variable small-THP size will really work
>>> as userspace
>>> is probably friendly to only one fixed size. for example, userspace
>>> heap management
>>> might be optimized to a size for freeing memory to the kernel. it is
>>> very difficult
>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
>>> size not equal with, and particularly smaller than small-THP size will
>>> defeat all
>>> efforts to use small-THP.
>>
>> I'll admit to not knowing a huge amount about user space allocators. But I will
>> say that as currently defined, the small-sized THP interface to user space
>> allows a sysadmin to specifically enable the set of sizes that they want; so a
>> single size can be enabled. I'm diliberately punting that decision away from the
>> kernel for now.
> 
> Basically, userspace heap library has a PAGESIZE setting and allows users
> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
> The default size is for sure equal to the basepage SIZE. once some objects are
> freed by free() and libc get a free "page", userspace heap libraries might free
> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
> it is quite similar with kernel slab.
> 
> so imagine we have small-THP now, but userspace libraries have *NO*
> idea at all,  so it can frequently cause unfolding.
> 
>>
>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
>> is a little bit better when enabling 64+32+16K vs just 64K THP.
>>
>> Functionally, it will not matter if the allocator is not enlightened for the THP
>> size; it can continue to free, and if a partial folio is unmapped it is put on
>> the deferred split list, then under memory pressure it is split and the unused
>> pages are reclaimed. I guess this is the bit you are concerned about having a
>> performance impact?
> 
> right. If this is happening on the majority of small-THP folios, we
> don't have performance
> improvement, and probably regression instead. This is really true on
> real workloads!!
> 
> So that is why we really love a per-VMA hint to enable small-THP but
> obviously you
> have already supported it now by
> mm: thp: Introduce per-size thp sysfs interface
> https://lore.kernel.org/linux-mm/20231122162950.3854897-4-ryan.roberts@arm.com/
> 
> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
> can set the VMA flag when it is quite sure this VMA is working with
> the alignment
> of 64KB?

Yes, that all exists in the series today. We have also discussed the possibility
of adding a new madvise_process() call that would take the set of THP sizes that
should be considered. Then you can set different VMAs to use different sizes;
the plan was to layer that on top if/when a workload was identified. Sounds like
you might be able to help there?

> 
>>
>> Regardless, it would be good to move this conversation to the small-sized THP
>> patch series since this is all independent of contpte mappings.
>>
>>>
>>>>
>>>>>
>>>>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
>>>>> still atomic in a
>>>>> spinlock area.
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> But that can be added on top, and I'll happily do that.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> David / dhildenb
>>>>>>>
>>>>>
>>>
> 
> Thanks
> Barry
Barry Song Nov. 28, 2023, 9:49 a.m. UTC | #37
On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 20:34, Barry Song wrote:
> > On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2023 10:28, Barry Song wrote:
> >>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 27/11/2023 09:59, Barry Song wrote:
> >>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> On 27/11/2023 08:42, Barry Song wrote:
> >>>>>>>>> +           for (i = 0; i < nr; i++, page++) {
> >>>>>>>>> +                   if (anon) {
> >>>>>>>>> +                           /*
> >>>>>>>>> +                            * If this page may have been pinned by the
> >>>>>>>>> +                            * parent process, copy the page immediately for
> >>>>>>>>> +                            * the child so that we'll always guarantee the
> >>>>>>>>> +                            * pinned page won't be randomly replaced in the
> >>>>>>>>> +                            * future.
> >>>>>>>>> +                            */
> >>>>>>>>> +                           if (unlikely(page_try_dup_anon_rmap(
> >>>>>>>>> +                                           page, false, src_vma))) {
> >>>>>>>>> +                                   if (i != 0)
> >>>>>>>>> +                                           break;
> >>>>>>>>> +                                   /* Page may be pinned, we have to copy. */
> >>>>>>>>> +                                   return copy_present_page(
> >>>>>>>>> +                                           dst_vma, src_vma, dst_pte,
> >>>>>>>>> +                                           src_pte, addr, rss, prealloc,
> >>>>>>>>> +                                           page);
> >>>>>>>>> +                           }
> >>>>>>>>> +                           rss[MM_ANONPAGES]++;
> >>>>>>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
> >>>>>>>>> +                   } else {
> >>>>>>>>> +                           page_dup_file_rmap(page, false);
> >>>>>>>>> +                           rss[mm_counter_file(page)]++;
> >>>>>>>>> +                   }
> >>>>>>>>>             }
> >>>>>>>>> -           rss[MM_ANONPAGES]++;
> >>>>>>>>> -   } else if (page) {
> >>>>>>>>> -           folio_get(folio);
> >>>>>>>>> -           page_dup_file_rmap(page, false);
> >>>>>>>>> -           rss[mm_counter_file(page)]++;
> >>>>>>>>> +
> >>>>>>>>> +           nr = i;
> >>>>>>>>> +           folio_ref_add(folio, nr);
> >>>>>>>>
> >>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>>>>>> Make sure your refcount >= mapcount.
> >>>>>>>>
> >>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>>>>>> pages are the corner case.
> >>>>>>>>
> >>>>>>>> I'll note that it will make a lot of sense to have batch variants of
> >>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>>>>>
> >>>>>>>
> >>>>>>> i still don't understand why it is not a entire map+1, but an increment
> >>>>>>> in each basepage.
> >>>>>>
> >>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
> >>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
> >>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >>>>>> atomic, so we can account the entire thing.
> >>>>>
> >>>>> Hi Ryan,
> >>>>>
> >>>>> There is no problem. for example, a large folio is entirely mapped in
> >>>>> process A with CONPTE,
> >>>>> and only page2 is mapped in process B.
> >>>>> then we will have
> >>>>>
> >>>>> entire_map = 0
> >>>>> page0.map = -1
> >>>>> page1.map = -1
> >>>>> page2.map = 0
> >>>>> page3.map = -1
> >>>>> ....
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
> >>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>>>>>> split.
> >>>>>>>
> >>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>>>>>> similar things on a part of the large folio in process A,
> >>>>>>>
> >>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>>>>>> in all subpages need to be removed though we only unmap a part of the
> >>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>>>>>> process B(all PTEs are still CONPTES in process B).
> >>>>>>>
> >>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>>>>>> process B), and subpages which are still mapped in process A has map_count
> >>>>>>> =0? (start from -1).
> >>>>>>>
> >>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>>>>>> drop all references again. So you either have all or no ptes to process,
> >>>>>>>> which makes that code easier.
> >>>>>>
> >>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
> >>>>>> unmap the whole folio atomically.
> >>>>>
> >>>>>
> >>>>>
> >>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> >>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> >>>>> it is partially
> >>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
> >>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> >>>>> DoubleMapped.
> >>>>
> >>>> There are 2 problems with your proposal, as I see it;
> >>>>
> >>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> >>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> >>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
> >>>> the CONT_PTE bit.
> >>>>
> >>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> >>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> >>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> >>>> unless/until ALL of those blocks are set up. And then of course each block could
> >>>> be unmapped unatomically.
> >>>>
> >>>> For the PMD case there are actually 2 properties that allow using the
> >>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> >>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
> >>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> >>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> >>>> *entire* map or unmap. That is not true when we are PTE mapping.
> >>>
> >>> well. Thanks for clarification. based on the above description, i agree the
> >>> current code might make more sense by always using mapcount in subpage.
> >>>
> >>> I gave my proposals as  I thought we were always CONTPTE size for small-THP
> >>> then we could drop the loop to iterate 16 times rmap. if we do it
> >>> entirely, we only
> >>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
> >>
> >> Well its always good to have the discussion - so thanks for the ideas. I think
> >> there is a bigger question lurking here; should we be exposing the concept of
> >> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
> >> I'm confident that would be a huge amount of effort and the end result would be
> >> similar performace to what this approach gives. One potential benefit of letting
> >> core-mm control it is that it would also give control to core-mm over the
> >> granularity of access/dirty reporting (my approach implicitly ties it to the
> >> folio). Having sub-folio access tracking _could_ potentially help with future
> >> work to make THP size selection automatic, but we are not there yet, and I think
> >> there are other (simpler) ways to achieve the same thing. So my view is that
> >> _not_ exposing it to core-mm is the right way for now.
> >
> > Hi Ryan,
> >
> > We(OPPO) started a similar project like you even before folio was imported to
> > mainline, we have deployed the dynamic hugepage(that is how we name it)
> > on millions of mobile phones on real products and kernels before 5.16,  making
> > a huge success on performance improvement. for example, you may
> > find the out-of-tree 5.15 source code here
>
> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
> embarrassed that I clearly didn't do enough research on the prior art because I
> wasn't aware of your work. So sorry about that.
>
> I sensed that you had a different model for how this should work vs what I've
> implemented and now I understand why :). I'll review your stuff and I'm sure
> I'll have questions. I'm sure each solution has pros and cons.
>
>
> >
> > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >
> > Our modification might not be so clean and has lots of workarounds
> > just for the stability of products
> >
> > We mainly have
> >
> > 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
> >
> > some CONTPTE helpers
> >
> > 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
> >
> > some Dynamic Hugepage APIs
> >
> > 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
> >
> > modified all page faults to support
> >      (1). allocation of hugepage of 64KB in do_anon_page
>
> My Small-Sized THP patch set is handling the equivalent of this.

right, the only difference is that we did a huge-zeropage for reading
in do_anon_page.
mapping all large folios to CONTPTE to zero page.

>
> >      (2). CoW hugepage in do_wp_page
>
> This isn't handled yet in my patch set; the original RFC implemented it but I
> removed it in order to strip back to the essential complexity for the initial
> submission. DavidH has been working on a precise shared vs exclusive map
> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
> Out of interest, what workloads benefit most from this?

as a phone, Android has a design almost all processes are forked from zygote.
thus, CoW happens quite often to all apps.

>
> >      (3). copy CONPTEs in copy_pte_range
>
> As discussed this is done as part of the contpte patch set, but its not just a
> simple copy; the arch code will notice and set the CONT_PTE bit as needed.

right, i have read all your unfold and fold stuff today, now i understand your
approach seems quite nice!


>
> >      (4). allocate and swap-in Hugepage as a whole in do_swap_page
>
> This is going to be a problem but I haven't even looked at this properly yet.
> The advice so far has been to continue to swap-in small pages only, but improve
> khugepaged to collapse to small-sized THP. I'll take a look at your code to
> understand how you did this.

this is also crucial to android phone as swap is always happening
on an embedded device. if we don't support large folios in swapin,
our large folios will never come back after it is swapped-out.

and i hated the collapse solution from the first beginning as there is
never a guarantee to succeed and its overhead is unacceptable to user UI,
so we supported hugepage allocation in do_swap_page from the first beginning.

>
> >
> > 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
> > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
> >
> > reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
>
> I think this is all naturally handled by the folio code that exists in modern
> kernels?

We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
reclaim large folios to the pool. as phones are running lots of apps
and drivers, and the memory is very limited, after a couple of hours,
it will become very hard to allocate large folios in the original buddy. thus,
large folios totally disappeared after running the phone for some time
if we didn't have the pool.

>
> >
> > So we are 100% interested in your patchset and hope it can find a way
> > to land on the
> > mainline, thus decreasing all the cost we have to maintain out-of-tree
> > code from a
> > kernel to another kernel version which we have done on a couple of
> > kernel versions
> > before 5.16. Firmly, we are 100% supportive of large anon folios
> > things you are leading.
>
> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
> it closer :). If you had any ability to do any A/B performance testing, it would
> be very interesting to see how this stacks up against your solution - if there
> are gaps it would be good to know where and develop a plan to plug the gap.
>

sure.

> >
> > A big pain was we found lots of races especially on CONTPTE unfolding
> > and especially a part
> > of basepages ran away from the 16 CONPTEs group since userspace is
> > always working
> > on basepages, having no idea of small-THP.  We ran our code on millions of
> > real phones, and now we have got them fixed (or maybe "can't reproduce"),
> > no outstanding issue.
>
> I'm going to be brave and say that my solution shouldn't suffer from these
> problems; but of course the proof is only in the testing. I did a lot of work
> with our architecture group and micro architects to determine exactly what is
> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
> optimization in patch 13 (see the commit log for details). Of course this has
> all been checked with partners and we are confident that all existing
> implementations conform to the modified wording.

cool. I like your try_unfold/fold code. it seems your code is setting/dropping
CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
our code is always stupidly checking some conditions before setting and dropping
CONT everywhere.

>
> >
> > Particularly for the rmap issue we are discussing, our out-of-tree is
> > using the entire_map for
> > CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
> > CONTPTE from mm-core.
> >
> > We are doing this in mm/memory.c
> >
> > copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
> > vm_area_struct *src_vma,
> > pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> > struct page **prealloc)
> > {
> >       struct mm_struct *src_mm = src_vma->vm_mm;
> >       unsigned long vm_flags = src_vma->vm_flags;
> >       pte_t pte = *src_pte;
> >       struct page *page;
> >
> >        page = vm_normal_page(src_vma, addr, pte);
> >       ...
> >
> >      get_page(page);
> >      page_dup_rmap(page, true);   // an entire dup_rmap as you can
> > see.............
> >      rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
> > }
> >
> > and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
> >
> > static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> > unsigned long haddr, bool freeze)
> > {
> > ...
> >            if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
> >                   for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
> >                            atomic_inc(&head[i]._mapcount);
> >                  atomic_long_inc(&cont_pte_double_map_count);
> >            }
> >
> >
> >             if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
> >               ...
> > }
> >
> > I am not selling our solution any more, but just showing you some differences we
> > have :-)
>
> OK, I understand what you were saying now. I'm currently struggling to see how
> this could fit into my model. Do you have any workloads and numbers on perf
> improvement of using entire_mapcount?

TBH, I don't have any data on this as from the first beginning, we were using
entire_map. So I have no comparison at all.

>
> >
> >>
> >>>
> >>> BTW, I have concerns that a variable small-THP size will really work
> >>> as userspace
> >>> is probably friendly to only one fixed size. for example, userspace
> >>> heap management
> >>> might be optimized to a size for freeing memory to the kernel. it is
> >>> very difficult
> >>> for the heap to adapt to various sizes at the same time. frequent unmap/free
> >>> size not equal with, and particularly smaller than small-THP size will
> >>> defeat all
> >>> efforts to use small-THP.
> >>
> >> I'll admit to not knowing a huge amount about user space allocators. But I will
> >> say that as currently defined, the small-sized THP interface to user space
> >> allows a sysadmin to specifically enable the set of sizes that they want; so a
> >> single size can be enabled. I'm diliberately punting that decision away from the
> >> kernel for now.
> >
> > Basically, userspace heap library has a PAGESIZE setting and allows users
> > to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
> > The default size is for sure equal to the basepage SIZE. once some objects are
> > freed by free() and libc get a free "page", userspace heap libraries might free
> > the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
> > it is quite similar with kernel slab.
> >
> > so imagine we have small-THP now, but userspace libraries have *NO*
> > idea at all,  so it can frequently cause unfolding.
> >
> >>
> >> FWIW, My experience with the Speedometer/JavaScript use case is that performance
> >> is a little bit better when enabling 64+32+16K vs just 64K THP.
> >>
> >> Functionally, it will not matter if the allocator is not enlightened for the THP
> >> size; it can continue to free, and if a partial folio is unmapped it is put on
> >> the deferred split list, then under memory pressure it is split and the unused
> >> pages are reclaimed. I guess this is the bit you are concerned about having a
> >> performance impact?
> >
> > right. If this is happening on the majority of small-THP folios, we
> > don't have performance
> > improvement, and probably regression instead. This is really true on
> > real workloads!!
> >
> > So that is why we really love a per-VMA hint to enable small-THP but
> > obviously you
> > have already supported it now by
> > mm: thp: Introduce per-size thp sysfs interface
> > https://lore.kernel.org/linux-mm/20231122162950.3854897-4-ryan.roberts@arm.com/
> >
> > we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
> > can set the VMA flag when it is quite sure this VMA is working with
> > the alignment
> > of 64KB?
>
> Yes, that all exists in the series today. We have also discussed the possibility
> of adding a new madvise_process() call that would take the set of THP sizes that
> should be considered. Then you can set different VMAs to use different sizes;
> the plan was to layer that on top if/when a workload was identified. Sounds like
> you might be able to help there?

i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
set a flag in this VMA and try to allocate 64KB.

But I will try to understand this requirement to madvise THPs size on a specific
VMA.

>
> >
> >>
> >> Regardless, it would be good to move this conversation to the small-sized THP
> >> patch series since this is all independent of contpte mappings.
> >>
> >>>
> >>>>
> >>>>>
> >>>>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
> >>>>> still atomic in a
> >>>>> spinlock area.
> >>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> But that can be added on top, and I'll happily do that.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Cheers,
> >>>>>>>>
> >>>>>>>> David / dhildenb
> >>>>>>>
> >>>>>
> >>>

Thanks
Barry
Ryan Roberts Nov. 28, 2023, 10:49 a.m. UTC | #38
On 28/11/2023 09:49, Barry Song wrote:
> On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2023 20:34, Barry Song wrote:
>>> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2023 10:28, Barry Song wrote:
>>>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 27/11/2023 09:59, Barry Song wrote:
>>>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 27/11/2023 08:42, Barry Song wrote:
>>>>>>>>>>> +           for (i = 0; i < nr; i++, page++) {
>>>>>>>>>>> +                   if (anon) {
>>>>>>>>>>> +                           /*
>>>>>>>>>>> +                            * If this page may have been pinned by the
>>>>>>>>>>> +                            * parent process, copy the page immediately for
>>>>>>>>>>> +                            * the child so that we'll always guarantee the
>>>>>>>>>>> +                            * pinned page won't be randomly replaced in the
>>>>>>>>>>> +                            * future.
>>>>>>>>>>> +                            */
>>>>>>>>>>> +                           if (unlikely(page_try_dup_anon_rmap(
>>>>>>>>>>> +                                           page, false, src_vma))) {
>>>>>>>>>>> +                                   if (i != 0)
>>>>>>>>>>> +                                           break;
>>>>>>>>>>> +                                   /* Page may be pinned, we have to copy. */
>>>>>>>>>>> +                                   return copy_present_page(
>>>>>>>>>>> +                                           dst_vma, src_vma, dst_pte,
>>>>>>>>>>> +                                           src_pte, addr, rss, prealloc,
>>>>>>>>>>> +                                           page);
>>>>>>>>>>> +                           }
>>>>>>>>>>> +                           rss[MM_ANONPAGES]++;
>>>>>>>>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>>>>> +                   } else {
>>>>>>>>>>> +                           page_dup_file_rmap(page, false);
>>>>>>>>>>> +                           rss[mm_counter_file(page)]++;
>>>>>>>>>>> +                   }
>>>>>>>>>>>             }
>>>>>>>>>>> -           rss[MM_ANONPAGES]++;
>>>>>>>>>>> -   } else if (page) {
>>>>>>>>>>> -           folio_get(folio);
>>>>>>>>>>> -           page_dup_file_rmap(page, false);
>>>>>>>>>>> -           rss[mm_counter_file(page)]++;
>>>>>>>>>>> +
>>>>>>>>>>> +           nr = i;
>>>>>>>>>>> +           folio_ref_add(folio, nr);
>>>>>>>>>>
>>>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>>>>>>>> Make sure your refcount >= mapcount.
>>>>>>>>>>
>>>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>>>>>>>> pages are the corner case.
>>>>>>>>>>
>>>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
>>>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> i still don't understand why it is not a entire map+1, but an increment
>>>>>>>>> in each basepage.
>>>>>>>>
>>>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
>>>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
>>>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>>>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>>>>>>>> atomic, so we can account the entire thing.
>>>>>>>
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> There is no problem. for example, a large folio is entirely mapped in
>>>>>>> process A with CONPTE,
>>>>>>> and only page2 is mapped in process B.
>>>>>>> then we will have
>>>>>>>
>>>>>>> entire_map = 0
>>>>>>> page0.map = -1
>>>>>>> page1.map = -1
>>>>>>> page2.map = 0
>>>>>>> page3.map = -1
>>>>>>> ....
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
>>>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>>>>>>>> split.
>>>>>>>>>
>>>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>>>>>>>> similar things on a part of the large folio in process A,
>>>>>>>>>
>>>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>>>>>>>> in all subpages need to be removed though we only unmap a part of the
>>>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>>>>>>>> process B(all PTEs are still CONPTES in process B).
>>>>>>>>>
>>>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>>>>>>>> process B), and subpages which are still mapped in process A has map_count
>>>>>>>>> =0? (start from -1).
>>>>>>>>>
>>>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>>>>>>>> drop all references again. So you either have all or no ptes to process,
>>>>>>>>>> which makes that code easier.
>>>>>>>>
>>>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>>>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
>>>>>>>> unmap the whole folio atomically.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
>>>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
>>>>>>> it is partially
>>>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
>>>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
>>>>>>> DoubleMapped.
>>>>>>
>>>>>> There are 2 problems with your proposal, as I see it;
>>>>>>
>>>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
>>>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
>>>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
>>>>>> the CONT_PTE bit.
>>>>>>
>>>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
>>>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
>>>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
>>>>>> unless/until ALL of those blocks are set up. And then of course each block could
>>>>>> be unmapped unatomically.
>>>>>>
>>>>>> For the PMD case there are actually 2 properties that allow using the
>>>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
>>>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
>>>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
>>>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
>>>>>> *entire* map or unmap. That is not true when we are PTE mapping.
>>>>>
>>>>> well. Thanks for clarification. based on the above description, i agree the
>>>>> current code might make more sense by always using mapcount in subpage.
>>>>>
>>>>> I gave my proposals as  I thought we were always CONTPTE size for small-THP
>>>>> then we could drop the loop to iterate 16 times rmap. if we do it
>>>>> entirely, we only
>>>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
>>>>
>>>> Well its always good to have the discussion - so thanks for the ideas. I think
>>>> there is a bigger question lurking here; should we be exposing the concept of
>>>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
>>>> I'm confident that would be a huge amount of effort and the end result would be
>>>> similar performace to what this approach gives. One potential benefit of letting
>>>> core-mm control it is that it would also give control to core-mm over the
>>>> granularity of access/dirty reporting (my approach implicitly ties it to the
>>>> folio). Having sub-folio access tracking _could_ potentially help with future
>>>> work to make THP size selection automatic, but we are not there yet, and I think
>>>> there are other (simpler) ways to achieve the same thing. So my view is that
>>>> _not_ exposing it to core-mm is the right way for now.
>>>
>>> Hi Ryan,
>>>
>>> We(OPPO) started a similar project like you even before folio was imported to
>>> mainline, we have deployed the dynamic hugepage(that is how we name it)
>>> on millions of mobile phones on real products and kernels before 5.16,  making
>>> a huge success on performance improvement. for example, you may
>>> find the out-of-tree 5.15 source code here
>>
>> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
>> embarrassed that I clearly didn't do enough research on the prior art because I
>> wasn't aware of your work. So sorry about that.
>>
>> I sensed that you had a different model for how this should work vs what I've
>> implemented and now I understand why :). I'll review your stuff and I'm sure
>> I'll have questions. I'm sure each solution has pros and cons.
>>
>>
>>>
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>>>
>>> Our modification might not be so clean and has lots of workarounds
>>> just for the stability of products
>>>
>>> We mainly have
>>>
>>> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
>>>
>>> some CONTPTE helpers
>>>
>>> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
>>>
>>> some Dynamic Hugepage APIs
>>>
>>> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
>>>
>>> modified all page faults to support
>>>      (1). allocation of hugepage of 64KB in do_anon_page
>>
>> My Small-Sized THP patch set is handling the equivalent of this.
> 
> right, the only difference is that we did a huge-zeropage for reading
> in do_anon_page.
> mapping all large folios to CONTPTE to zero page.

FWIW, I took a slightly different approach in my original RFC for the zero page
- although I ripped it all out to simplify for the initial series. I found that
it was pretty rare for user space to read multiple consecutive pages without
ever interleving any writes, so I kept the zero page as a base page, but at CoW,
I would expand the allocation to an approprately sized THP. But for the couple
of workloads that I've gone deep with, I found that it made barely any dent on
the amount of memory that ended up contpte-mapped; the vast majority was from
write allocation in do_anonymous_page().

> 
>>
>>>      (2). CoW hugepage in do_wp_page
>>
>> This isn't handled yet in my patch set; the original RFC implemented it but I
>> removed it in order to strip back to the essential complexity for the initial
>> submission. DavidH has been working on a precise shared vs exclusive map
>> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
>> Out of interest, what workloads benefit most from this?
> 
> as a phone, Android has a design almost all processes are forked from zygote.
> thus, CoW happens quite often to all apps.

Sure. But in my analysis I concluded that most of the memory mapped in zygote is
file-backed and mostly RO so therefore doing THP CoW doesn't help much. Perhaps
there are cases where that conclusion is wrong.

> 
>>
>>>      (3). copy CONPTEs in copy_pte_range
>>
>> As discussed this is done as part of the contpte patch set, but its not just a
>> simple copy; the arch code will notice and set the CONT_PTE bit as needed.
> 
> right, i have read all your unfold and fold stuff today, now i understand your
> approach seems quite nice!

Great - thanks!

> 
> 
>>
>>>      (4). allocate and swap-in Hugepage as a whole in do_swap_page
>>
>> This is going to be a problem but I haven't even looked at this properly yet.
>> The advice so far has been to continue to swap-in small pages only, but improve
>> khugepaged to collapse to small-sized THP. I'll take a look at your code to
>> understand how you did this.
> 
> this is also crucial to android phone as swap is always happening
> on an embedded device. if we don't support large folios in swapin,
> our large folios will never come back after it is swapped-out.
> 
> and i hated the collapse solution from the first beginning as there is
> never a guarantee to succeed and its overhead is unacceptable to user UI,
> so we supported hugepage allocation in do_swap_page from the first beginning.

Understood. I agree it would be nice to preserve large folios across swap. I
think this can be layered on top of the current work though.

> 
>>
>>>
>>> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
>>>
>>> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
>>
>> I think this is all naturally handled by the folio code that exists in modern
>> kernels?
> 
> We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
> reclaim large folios to the pool. as phones are running lots of apps
> and drivers, and the memory is very limited, after a couple of hours,
> it will become very hard to allocate large folios in the original buddy. thus,
> large folios totally disappeared after running the phone for some time
> if we didn't have the pool.
> 
>>
>>>
>>> So we are 100% interested in your patchset and hope it can find a way
>>> to land on the
>>> mainline, thus decreasing all the cost we have to maintain out-of-tree
>>> code from a
>>> kernel to another kernel version which we have done on a couple of
>>> kernel versions
>>> before 5.16. Firmly, we are 100% supportive of large anon folios
>>> things you are leading.
>>
>> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
>> it closer :). If you had any ability to do any A/B performance testing, it would
>> be very interesting to see how this stacks up against your solution - if there
>> are gaps it would be good to know where and develop a plan to plug the gap.
>>
> 
> sure.
> 
>>>
>>> A big pain was we found lots of races especially on CONTPTE unfolding
>>> and especially a part
>>> of basepages ran away from the 16 CONPTEs group since userspace is
>>> always working
>>> on basepages, having no idea of small-THP.  We ran our code on millions of
>>> real phones, and now we have got them fixed (or maybe "can't reproduce"),
>>> no outstanding issue.
>>
>> I'm going to be brave and say that my solution shouldn't suffer from these
>> problems; but of course the proof is only in the testing. I did a lot of work
>> with our architecture group and micro architects to determine exactly what is
>> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
>> optimization in patch 13 (see the commit log for details). Of course this has
>> all been checked with partners and we are confident that all existing
>> implementations conform to the modified wording.
> 
> cool. I like your try_unfold/fold code. it seems your code is setting/dropping
> CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
> our code is always stupidly checking some conditions before setting and dropping
> CONT everywhere.
> 
>>
>>>
>>> Particularly for the rmap issue we are discussing, our out-of-tree is
>>> using the entire_map for
>>> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
>>> CONTPTE from mm-core.
>>>
>>> We are doing this in mm/memory.c
>>>
>>> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
>>> vm_area_struct *src_vma,
>>> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>> struct page **prealloc)
>>> {
>>>       struct mm_struct *src_mm = src_vma->vm_mm;
>>>       unsigned long vm_flags = src_vma->vm_flags;
>>>       pte_t pte = *src_pte;
>>>       struct page *page;
>>>
>>>        page = vm_normal_page(src_vma, addr, pte);
>>>       ...
>>>
>>>      get_page(page);
>>>      page_dup_rmap(page, true);   // an entire dup_rmap as you can
>>> see.............
>>>      rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
>>> }
>>>
>>> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
>>>
>>> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
>>> unsigned long haddr, bool freeze)
>>> {
>>> ...
>>>            if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
>>>                   for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
>>>                            atomic_inc(&head[i]._mapcount);
>>>                  atomic_long_inc(&cont_pte_double_map_count);
>>>            }
>>>
>>>
>>>             if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
>>>               ...
>>> }
>>>
>>> I am not selling our solution any more, but just showing you some differences we
>>> have :-)
>>
>> OK, I understand what you were saying now. I'm currently struggling to see how
>> this could fit into my model. Do you have any workloads and numbers on perf
>> improvement of using entire_mapcount?
> 
> TBH, I don't have any data on this as from the first beginning, we were using
> entire_map. So I have no comparison at all.
> 
>>
>>>
>>>>
>>>>>
>>>>> BTW, I have concerns that a variable small-THP size will really work
>>>>> as userspace
>>>>> is probably friendly to only one fixed size. for example, userspace
>>>>> heap management
>>>>> might be optimized to a size for freeing memory to the kernel. it is
>>>>> very difficult
>>>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
>>>>> size not equal with, and particularly smaller than small-THP size will
>>>>> defeat all
>>>>> efforts to use small-THP.
>>>>
>>>> I'll admit to not knowing a huge amount about user space allocators. But I will
>>>> say that as currently defined, the small-sized THP interface to user space
>>>> allows a sysadmin to specifically enable the set of sizes that they want; so a
>>>> single size can be enabled. I'm diliberately punting that decision away from the
>>>> kernel for now.
>>>
>>> Basically, userspace heap library has a PAGESIZE setting and allows users
>>> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
>>> The default size is for sure equal to the basepage SIZE. once some objects are
>>> freed by free() and libc get a free "page", userspace heap libraries might free
>>> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
>>> it is quite similar with kernel slab.
>>>
>>> so imagine we have small-THP now, but userspace libraries have *NO*
>>> idea at all,  so it can frequently cause unfolding.
>>>
>>>>
>>>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
>>>> is a little bit better when enabling 64+32+16K vs just 64K THP.
>>>>
>>>> Functionally, it will not matter if the allocator is not enlightened for the THP
>>>> size; it can continue to free, and if a partial folio is unmapped it is put on
>>>> the deferred split list, then under memory pressure it is split and the unused
>>>> pages are reclaimed. I guess this is the bit you are concerned about having a
>>>> performance impact?
>>>
>>> right. If this is happening on the majority of small-THP folios, we
>>> don't have performance
>>> improvement, and probably regression instead. This is really true on
>>> real workloads!!
>>>
>>> So that is why we really love a per-VMA hint to enable small-THP but
>>> obviously you
>>> have already supported it now by
>>> mm: thp: Introduce per-size thp sysfs interface
>>> https://lore.kernel.org/linux-mm/20231122162950.3854897-4-ryan.roberts@arm.com/
>>>
>>> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
>>> can set the VMA flag when it is quite sure this VMA is working with
>>> the alignment
>>> of 64KB?
>>
>> Yes, that all exists in the series today. We have also discussed the possibility
>> of adding a new madvise_process() call that would take the set of THP sizes that
>> should be considered. Then you can set different VMAs to use different sizes;
>> the plan was to layer that on top if/when a workload was identified. Sounds like
>> you might be able to help there?
> 
> i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
> for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
> set a flag in this VMA and try to allocate 64KB.

When you say "we set a flag" do you mean user space? Or is there some heuristic
in the kernel?

> 
> But I will try to understand this requirement to madvise THPs size on a specific
> VMA.
> 
>>
>>>
>>>>
>>>> Regardless, it would be good to move this conversation to the small-sized THP
>>>> patch series since this is all independent of contpte mappings.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is
>>>>>>> still atomic in a
>>>>>>> spinlock area.
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But that can be added on top, and I'll happily do that.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> David / dhildenb
>>>>>>>>>
>>>>>>>
>>>>>
> 
> Thanks
> Barry
Ryan Roberts Nov. 28, 2023, 11 a.m. UTC | #39
On 28/11/2023 00:11, Barry Song wrote:
> On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2023 05:54, Barry Song wrote:
>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>> +              pte_t *dst_pte, pte_t *src_pte,
>>>> +              unsigned long addr, unsigned long end,
>>>> +              int *rss, struct folio **prealloc)
>>>>  {
>>>>      struct mm_struct *src_mm = src_vma->vm_mm;
>>>>      unsigned long vm_flags = src_vma->vm_flags;
>>>>      pte_t pte = ptep_get(src_pte);
>>>>      struct page *page;
>>>>      struct folio *folio;
>>>> +    int nr = 1;
>>>> +    bool anon;
>>>> +    bool any_dirty = pte_dirty(pte);
>>>> +    int i;
>>>>
>>>>      page = vm_normal_page(src_vma, addr, pte);
>>>> -    if (page)
>>>> +    if (page) {
>>>>              folio = page_folio(page);
>>>> -    if (page && folio_test_anon(folio)) {
>>>> -            /*
>>>> -             * If this page may have been pinned by the parent process,
>>>> -             * copy the page immediately for the child so that we'll always
>>>> -             * guarantee the pinned page won't be randomly replaced in the
>>>> -             * future.
>>>> -             */
>>>> -            folio_get(folio);
>>>> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>> -                    /* Page may be pinned, we have to copy. */
>>>> -                    folio_put(folio);
>>>> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>> -                                             addr, rss, prealloc, page);
>>>> +            anon = folio_test_anon(folio);
>>>> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>> +                                            end, pte, &any_dirty);
>>>
>>> in case we have a large folio with 16 CONTPTE basepages, and userspace
>>> do madvise(addr + 4KB * 5, DONTNEED);
>>
>> nit: if you are offsetting by 5 pages from addr, then below I think you mean
>> page0~page4 and page6~15?
>>
>>>
>>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
>>> will return 15. in this case, we should copy page0~page3 and page5~page15.
>>
>> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
>> not how its intended to work. The function is scanning forwards from the current
>> pte until it finds the first pte that does not fit in the batch - either because
>> it maps a PFN that is not contiguous, or because the permissions are different
>> (although this is being relaxed a bit; see conversation with DavidH against this
>> same patch).
>>
>> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
>> (page0~page4) then the next time through the loop we will go through the
>> !present path and process the single swap marker. Then the 3rd time through the
>> loop folio_nr_pages_cont_mapped() will return 10.
> 
> one case we have met by running hundreds of real phones is as below,
> 
> 
> static int
> copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>                pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
>                unsigned long end)
> {
>         ...
>         dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
>         if (!dst_pte) {
>                 ret = -ENOMEM;
>                 goto out;
>         }
>         src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
>         if (!src_pte) {
>                 pte_unmap_unlock(dst_pte, dst_ptl);
>                 /* ret == 0 */
>                 goto out;
>         }
>         spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>         orig_src_pte = src_pte;
>         orig_dst_pte = dst_pte;
>         arch_enter_lazy_mmu_mode();
> 
>         do {
>                 /*
>                  * We are holding two locks at this point - either of them
>                  * could generate latencies in another task on another CPU.
>                  */
>                 if (progress >= 32) {
>                         progress = 0;
>                         if (need_resched() ||
>                             spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>                                 break;
>                 }
>                 ptent = ptep_get(src_pte);
>                 if (pte_none(ptent)) {
>                         progress++;
>                         continue;
>                 }
> 
> the above iteration can break when progress > =32. for example, at the
> beginning,
> if all PTEs are none, we break when progress >=32, and we break when we
> are in the 8th pte of 16PTEs which might become CONTPTE after we release
> PTL.
> 
> since we are releasing PTLs, next time when we get PTL, those pte_none() might
> become pte_cont(), then are you going to copy CONTPTE from 8th pte,
> thus, immediately
> break the consistent CONPTEs rule of hardware?
> 
> pte0 - pte_none
> pte1 - pte_none
> ...
> pte7 - pte_none
> 
> pte8 - pte_cont
> ...
> pte15 - pte_cont
> 
> so we did some modification to avoid a break in the middle of PTEs
> which can potentially
> become CONTPE.
> do {
>                 /*
>                 * We are holding two locks at this point - either of them
>                 * could generate latencies in another task on another CPU.
>                 */
>                 if (progress >= 32) {
>                                 progress = 0;
> #ifdef CONFIG_CONT_PTE_HUGEPAGE
>                 /*
>                 * XXX: don't release ptl at an unligned address as
> cont_pte might form while
>                 * ptl is released, this causes double-map
>                 */
>                 if (!vma_is_chp_anonymous(src_vma) ||
>                    (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> HPAGE_CONT_PTE_SIZE)))
> #endif
>                 if (need_resched() ||
>                    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>                                 break;
> }
> 
> We could only reproduce the above issue by running thousands of phones.
> 
> Does your code survive from this problem?

Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
bit is not blindly "copied" from parent to child pte. As far as the core-mm is
concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
will see some pte_none() entries followed by some pte_present() entries. And
when calling set_ptes() on the child, the arch code will evaluate the current
state of the pgtable along with the new set_ptes() request and determine where
it should insert the CONT_PTE bit.

> 
>>
>> Thanks,
>> Ryan
>>
>>>
>>> but the current code is copying page0~page14, right? unless we are immediatly
>>> split_folio to basepages in zap_pte_range(), we will have problems?
>>>
>>>> +
>>>> +            for (i = 0; i < nr; i++, page++) {
>>>> +                    if (anon) {
>>>> +                            /*
>>>> +                             * If this page may have been pinned by the
>>>> +                             * parent process, copy the page immediately for
>>>> +                             * the child so that we'll always guarantee the
>>>> +                             * pinned page won't be randomly replaced in the
>>>> +                             * future.
>>>> +                             */
>>>> +                            if (unlikely(page_try_dup_anon_rmap(
>>>> +                                            page, false, src_vma))) {
>>>> +                                    if (i != 0)
>>>> +                                            break;
>>>> +                                    /* Page may be pinned, we have to copy. */
>>>> +                                    return copy_present_page(
>>>> +                                            dst_vma, src_vma, dst_pte,
>>>> +                                            src_pte, addr, rss, prealloc,
>>>> +                                            page);
>>>> +                            }
>>>> +                            rss[MM_ANONPAGES]++;
>>>> +                            VM_BUG_ON(PageAnonExclusive(page));
>>>> +                    } else {
>>>> +                            page_dup_file_rmap(page, false);
>>>> +                            rss[mm_counter_file(page)]++;
>>>> +                    }
>>>
> 
> Thanks
> Barry
Barry Song Nov. 28, 2023, 7 p.m. UTC | #40
On Wed, Nov 29, 2023 at 12:00 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2023 00:11, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2023 05:54, Barry Song wrote:
> >>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >>>> +              pte_t *dst_pte, pte_t *src_pte,
> >>>> +              unsigned long addr, unsigned long end,
> >>>> +              int *rss, struct folio **prealloc)
> >>>>  {
> >>>>      struct mm_struct *src_mm = src_vma->vm_mm;
> >>>>      unsigned long vm_flags = src_vma->vm_flags;
> >>>>      pte_t pte = ptep_get(src_pte);
> >>>>      struct page *page;
> >>>>      struct folio *folio;
> >>>> +    int nr = 1;
> >>>> +    bool anon;
> >>>> +    bool any_dirty = pte_dirty(pte);
> >>>> +    int i;
> >>>>
> >>>>      page = vm_normal_page(src_vma, addr, pte);
> >>>> -    if (page)
> >>>> +    if (page) {
> >>>>              folio = page_folio(page);
> >>>> -    if (page && folio_test_anon(folio)) {
> >>>> -            /*
> >>>> -             * If this page may have been pinned by the parent process,
> >>>> -             * copy the page immediately for the child so that we'll always
> >>>> -             * guarantee the pinned page won't be randomly replaced in the
> >>>> -             * future.
> >>>> -             */
> >>>> -            folio_get(folio);
> >>>> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> >>>> -                    /* Page may be pinned, we have to copy. */
> >>>> -                    folio_put(folio);
> >>>> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> >>>> -                                             addr, rss, prealloc, page);
> >>>> +            anon = folio_test_anon(folio);
> >>>> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> >>>> +                                            end, pte, &any_dirty);
> >>>
> >>> in case we have a large folio with 16 CONTPTE basepages, and userspace
> >>> do madvise(addr + 4KB * 5, DONTNEED);
> >>
> >> nit: if you are offsetting by 5 pages from addr, then below I think you mean
> >> page0~page4 and page6~15?
> >>
> >>>
> >>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> >>> will return 15. in this case, we should copy page0~page3 and page5~page15.
> >>
> >> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
> >> not how its intended to work. The function is scanning forwards from the current
> >> pte until it finds the first pte that does not fit in the batch - either because
> >> it maps a PFN that is not contiguous, or because the permissions are different
> >> (although this is being relaxed a bit; see conversation with DavidH against this
> >> same patch).
> >>
> >> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
> >> (page0~page4) then the next time through the loop we will go through the
> >> !present path and process the single swap marker. Then the 3rd time through the
> >> loop folio_nr_pages_cont_mapped() will return 10.
> >
> > one case we have met by running hundreds of real phones is as below,
> >
> >
> > static int
> > copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >                pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> >                unsigned long end)
> > {
> >         ...
> >         dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> >         if (!dst_pte) {
> >                 ret = -ENOMEM;
> >                 goto out;
> >         }
> >         src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
> >         if (!src_pte) {
> >                 pte_unmap_unlock(dst_pte, dst_ptl);
> >                 /* ret == 0 */
> >                 goto out;
> >         }
> >         spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> >         orig_src_pte = src_pte;
> >         orig_dst_pte = dst_pte;
> >         arch_enter_lazy_mmu_mode();
> >
> >         do {
> >                 /*
> >                  * We are holding two locks at this point - either of them
> >                  * could generate latencies in another task on another CPU.
> >                  */
> >                 if (progress >= 32) {
> >                         progress = 0;
> >                         if (need_resched() ||
> >                             spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >                                 break;
> >                 }
> >                 ptent = ptep_get(src_pte);
> >                 if (pte_none(ptent)) {
> >                         progress++;
> >                         continue;
> >                 }
> >
> > the above iteration can break when progress > =32. for example, at the
> > beginning,
> > if all PTEs are none, we break when progress >=32, and we break when we
> > are in the 8th pte of 16PTEs which might become CONTPTE after we release
> > PTL.
> >
> > since we are releasing PTLs, next time when we get PTL, those pte_none() might
> > become pte_cont(), then are you going to copy CONTPTE from 8th pte,
> > thus, immediately
> > break the consistent CONPTEs rule of hardware?
> >
> > pte0 - pte_none
> > pte1 - pte_none
> > ...
> > pte7 - pte_none
> >
> > pte8 - pte_cont
> > ...
> > pte15 - pte_cont
> >
> > so we did some modification to avoid a break in the middle of PTEs
> > which can potentially
> > become CONTPE.
> > do {
> >                 /*
> >                 * We are holding two locks at this point - either of them
> >                 * could generate latencies in another task on another CPU.
> >                 */
> >                 if (progress >= 32) {
> >                                 progress = 0;
> > #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >                 /*
> >                 * XXX: don't release ptl at an unligned address as
> > cont_pte might form while
> >                 * ptl is released, this causes double-map
> >                 */
> >                 if (!vma_is_chp_anonymous(src_vma) ||
> >                    (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> > HPAGE_CONT_PTE_SIZE)))
> > #endif
> >                 if (need_resched() ||
> >                    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >                                 break;
> > }
> >
> > We could only reproduce the above issue by running thousands of phones.
> >
> > Does your code survive from this problem?
>
> Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
> bit is not blindly "copied" from parent to child pte. As far as the core-mm is
> concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
> will see some pte_none() entries followed by some pte_present() entries. And
> when calling set_ptes() on the child, the arch code will evaluate the current
> state of the pgtable along with the new set_ptes() request and determine where
> it should insert the CONT_PTE bit.

yep, i have read very carefully and think your code is safe here. The
only problem
is that the code can randomly unfold parent processes' CONPTE while setting
wrprotect in the middle of a large folio while it actually should keep CONT
bit as all PTEs can be still consistent if we set protect from the 1st PTE.

while A forks B,  progress >= 32 might interrupt in the middle of a
new CONTPTE folio which is forming, as we have to set wrprotect to parent A,
this parent immediately loses CONT bit. this is  sad. but i can't find a
good way to resolve it unless CONT is exposed to mm-core. any idea on
this?

Our code[1] resolves this by only breaking at the aligned address

if (progress >= 32) {
     progress = 0;
     #ifdef CONFIG_CONT_PTE_HUGEPAGE
     /*
      * XXX: don't release ptl at an unligned address as cont_pte
might form while
      * ptl is released, this causes double-map
     */
    if (!vma_is_chp_anonymous(src_vma) ||
        (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
HPAGE_CONT_PTE_SIZE)))
    #endif
        if (need_resched() ||
           spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
             break;
}

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1180


Thanks
Barry
Barry Song Nov. 28, 2023, 9:06 p.m. UTC | #41
On Tue, Nov 28, 2023 at 11:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2023 09:49, Barry Song wrote:
> > On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2023 20:34, Barry Song wrote:
> >>> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 27/11/2023 10:28, Barry Song wrote:
> >>>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> On 27/11/2023 09:59, Barry Song wrote:
> >>>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>
> >>>>>>>> On 27/11/2023 08:42, Barry Song wrote:
> >>>>>>>>>>> +           for (i = 0; i < nr; i++, page++) {
> >>>>>>>>>>> +                   if (anon) {
> >>>>>>>>>>> +                           /*
> >>>>>>>>>>> +                            * If this page may have been pinned by the
> >>>>>>>>>>> +                            * parent process, copy the page immediately for
> >>>>>>>>>>> +                            * the child so that we'll always guarantee the
> >>>>>>>>>>> +                            * pinned page won't be randomly replaced in the
> >>>>>>>>>>> +                            * future.
> >>>>>>>>>>> +                            */
> >>>>>>>>>>> +                           if (unlikely(page_try_dup_anon_rmap(
> >>>>>>>>>>> +                                           page, false, src_vma))) {
> >>>>>>>>>>> +                                   if (i != 0)
> >>>>>>>>>>> +                                           break;
> >>>>>>>>>>> +                                   /* Page may be pinned, we have to copy. */
> >>>>>>>>>>> +                                   return copy_present_page(
> >>>>>>>>>>> +                                           dst_vma, src_vma, dst_pte,
> >>>>>>>>>>> +                                           src_pte, addr, rss, prealloc,
> >>>>>>>>>>> +                                           page);
> >>>>>>>>>>> +                           }
> >>>>>>>>>>> +                           rss[MM_ANONPAGES]++;
> >>>>>>>>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
> >>>>>>>>>>> +                   } else {
> >>>>>>>>>>> +                           page_dup_file_rmap(page, false);
> >>>>>>>>>>> +                           rss[mm_counter_file(page)]++;
> >>>>>>>>>>> +                   }
> >>>>>>>>>>>             }
> >>>>>>>>>>> -           rss[MM_ANONPAGES]++;
> >>>>>>>>>>> -   } else if (page) {
> >>>>>>>>>>> -           folio_get(folio);
> >>>>>>>>>>> -           page_dup_file_rmap(page, false);
> >>>>>>>>>>> -           rss[mm_counter_file(page)]++;
> >>>>>>>>>>> +
> >>>>>>>>>>> +           nr = i;
> >>>>>>>>>>> +           folio_ref_add(folio, nr);
> >>>>>>>>>>
> >>>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>>>>>>>> Make sure your refcount >= mapcount.
> >>>>>>>>>>
> >>>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>>>>>>>> pages are the corner case.
> >>>>>>>>>>
> >>>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
> >>>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> i still don't understand why it is not a entire map+1, but an increment
> >>>>>>>>> in each basepage.
> >>>>>>>>
> >>>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
> >>>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
> >>>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >>>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >>>>>>>> atomic, so we can account the entire thing.
> >>>>>>>
> >>>>>>> Hi Ryan,
> >>>>>>>
> >>>>>>> There is no problem. for example, a large folio is entirely mapped in
> >>>>>>> process A with CONPTE,
> >>>>>>> and only page2 is mapped in process B.
> >>>>>>> then we will have
> >>>>>>>
> >>>>>>> entire_map = 0
> >>>>>>> page0.map = -1
> >>>>>>> page1.map = -1
> >>>>>>> page2.map = 0
> >>>>>>> page3.map = -1
> >>>>>>> ....
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
> >>>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>>>>>>>> split.
> >>>>>>>>>
> >>>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>>>>>>>> similar things on a part of the large folio in process A,
> >>>>>>>>>
> >>>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>>>>>>>> in all subpages need to be removed though we only unmap a part of the
> >>>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>>>>>>>> process B(all PTEs are still CONPTES in process B).
> >>>>>>>>>
> >>>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>>>>>>>> process B), and subpages which are still mapped in process A has map_count
> >>>>>>>>> =0? (start from -1).
> >>>>>>>>>
> >>>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>>>>>>>> drop all references again. So you either have all or no ptes to process,
> >>>>>>>>>> which makes that code easier.
> >>>>>>>>
> >>>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >>>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
> >>>>>>>> unmap the whole folio atomically.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> >>>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> >>>>>>> it is partially
> >>>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
> >>>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> >>>>>>> DoubleMapped.
> >>>>>>
> >>>>>> There are 2 problems with your proposal, as I see it;
> >>>>>>
> >>>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> >>>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> >>>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
> >>>>>> the CONT_PTE bit.
> >>>>>>
> >>>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> >>>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> >>>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> >>>>>> unless/until ALL of those blocks are set up. And then of course each block could
> >>>>>> be unmapped unatomically.
> >>>>>>
> >>>>>> For the PMD case there are actually 2 properties that allow using the
> >>>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> >>>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
> >>>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> >>>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> >>>>>> *entire* map or unmap. That is not true when we are PTE mapping.
> >>>>>
> >>>>> well. Thanks for clarification. based on the above description, i agree the
> >>>>> current code might make more sense by always using mapcount in subpage.
> >>>>>
> >>>>> I gave my proposals as  I thought we were always CONTPTE size for small-THP
> >>>>> then we could drop the loop to iterate 16 times rmap. if we do it
> >>>>> entirely, we only
> >>>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
> >>>>
> >>>> Well its always good to have the discussion - so thanks for the ideas. I think
> >>>> there is a bigger question lurking here; should we be exposing the concept of
> >>>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
> >>>> I'm confident that would be a huge amount of effort and the end result would be
> >>>> similar performace to what this approach gives. One potential benefit of letting
> >>>> core-mm control it is that it would also give control to core-mm over the
> >>>> granularity of access/dirty reporting (my approach implicitly ties it to the
> >>>> folio). Having sub-folio access tracking _could_ potentially help with future
> >>>> work to make THP size selection automatic, but we are not there yet, and I think
> >>>> there are other (simpler) ways to achieve the same thing. So my view is that
> >>>> _not_ exposing it to core-mm is the right way for now.
> >>>
> >>> Hi Ryan,
> >>>
> >>> We(OPPO) started a similar project like you even before folio was imported to
> >>> mainline, we have deployed the dynamic hugepage(that is how we name it)
> >>> on millions of mobile phones on real products and kernels before 5.16,  making
> >>> a huge success on performance improvement. for example, you may
> >>> find the out-of-tree 5.15 source code here
> >>
> >> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
> >> embarrassed that I clearly didn't do enough research on the prior art because I
> >> wasn't aware of your work. So sorry about that.
> >>
> >> I sensed that you had a different model for how this should work vs what I've
> >> implemented and now I understand why :). I'll review your stuff and I'm sure
> >> I'll have questions. I'm sure each solution has pros and cons.
> >>
> >>
> >>>
> >>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >>>
> >>> Our modification might not be so clean and has lots of workarounds
> >>> just for the stability of products
> >>>
> >>> We mainly have
> >>>
> >>> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
> >>>
> >>> some CONTPTE helpers
> >>>
> >>> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
> >>>
> >>> some Dynamic Hugepage APIs
> >>>
> >>> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
> >>>
> >>> modified all page faults to support
> >>>      (1). allocation of hugepage of 64KB in do_anon_page
> >>
> >> My Small-Sized THP patch set is handling the equivalent of this.
> >
> > right, the only difference is that we did a huge-zeropage for reading
> > in do_anon_page.
> > mapping all large folios to CONTPTE to zero page.
>
> FWIW, I took a slightly different approach in my original RFC for the zero page
> - although I ripped it all out to simplify for the initial series. I found that
> it was pretty rare for user space to read multiple consecutive pages without
> ever interleving any writes, so I kept the zero page as a base page, but at CoW,
> I would expand the allocation to an approprately sized THP. But for the couple
> of workloads that I've gone deep with, I found that it made barely any dent on
> the amount of memory that ended up contpte-mapped; the vast majority was from
> write allocation in do_anonymous_page().

the problem is even if there is only one page read in 16 ptes, you
will map the page to
zero basepage. then while you write another page in these 16 ptes, you
lose the chance
to become large folio as pte_range_none() becomes false.

if we map these 16ptes to contpte zero page, in do_wp_page, we have a
good chance
to CoW and get a large anon folio.

>
> >
> >>
> >>>      (2). CoW hugepage in do_wp_page
> >>
> >> This isn't handled yet in my patch set; the original RFC implemented it but I
> >> removed it in order to strip back to the essential complexity for the initial
> >> submission. DavidH has been working on a precise shared vs exclusive map
> >> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
> >> Out of interest, what workloads benefit most from this?
> >
> > as a phone, Android has a design almost all processes are forked from zygote.
> > thus, CoW happens quite often to all apps.
>
> Sure. But in my analysis I concluded that most of the memory mapped in zygote is
> file-backed and mostly RO so therefore doing THP CoW doesn't help much. Perhaps
> there are cases where that conclusion is wrong.

CoW is much less than do_anon_page on my phone which is running dynamic
hugepage for a couple of hours:

OP52D1L1:/ # cat /proc/cont_pte_hugepage/stat
...
thp_cow 34669                           ---- CoW a large folio
thp_do_anon_pages 1032362     -----  a large folio in do_anon_page
...

so it is around 34669/1032362 = 3.35%.

>
> >
> >>
> >>>      (3). copy CONPTEs in copy_pte_range
> >>
> >> As discussed this is done as part of the contpte patch set, but its not just a
> >> simple copy; the arch code will notice and set the CONT_PTE bit as needed.
> >
> > right, i have read all your unfold and fold stuff today, now i understand your
> > approach seems quite nice!
>
> Great - thanks!
>
> >
> >
> >>
> >>>      (4). allocate and swap-in Hugepage as a whole in do_swap_page
> >>
> >> This is going to be a problem but I haven't even looked at this properly yet.
> >> The advice so far has been to continue to swap-in small pages only, but improve
> >> khugepaged to collapse to small-sized THP. I'll take a look at your code to
> >> understand how you did this.
> >
> > this is also crucial to android phone as swap is always happening
> > on an embedded device. if we don't support large folios in swapin,
> > our large folios will never come back after it is swapped-out.
> >
> > and i hated the collapse solution from the first beginning as there is
> > never a guarantee to succeed and its overhead is unacceptable to user UI,
> > so we supported hugepage allocation in do_swap_page from the first beginning.
>
> Understood. I agree it would be nice to preserve large folios across swap. I
> think this can be layered on top of the current work though.

This will be my first priority to use your large folio code on phones.
We need a patchset
on top of yours :-)

without it, we will likely fail. Typically, one phone can have a 4~8GB
zRAM to compress
a lot of anon pages, if the compression ratio is 1:4, that means
uncompressed anon
pages are much much more. Thus, while the background app is switched back
to foreground, we need those swapped-out large folios back rather than getting
small basepages replacement. swap-in basepage is definitely not going to
work well on a phone, neither does THP collapse.

>
> >
> >>
> >>>
> >>> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
> >>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
> >>>
> >>> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
> >>
> >> I think this is all naturally handled by the folio code that exists in modern
> >> kernels?
> >
> > We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
> > reclaim large folios to the pool. as phones are running lots of apps
> > and drivers, and the memory is very limited, after a couple of hours,
> > it will become very hard to allocate large folios in the original buddy. thus,
> > large folios totally disappeared after running the phone for some time
> > if we didn't have the pool.
> >
> >>
> >>>
> >>> So we are 100% interested in your patchset and hope it can find a way
> >>> to land on the
> >>> mainline, thus decreasing all the cost we have to maintain out-of-tree
> >>> code from a
> >>> kernel to another kernel version which we have done on a couple of
> >>> kernel versions
> >>> before 5.16. Firmly, we are 100% supportive of large anon folios
> >>> things you are leading.
> >>
> >> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
> >> it closer :). If you had any ability to do any A/B performance testing, it would
> >> be very interesting to see how this stacks up against your solution - if there
> >> are gaps it would be good to know where and develop a plan to plug the gap.
> >>
> >
> > sure.
> >
> >>>
> >>> A big pain was we found lots of races especially on CONTPTE unfolding
> >>> and especially a part
> >>> of basepages ran away from the 16 CONPTEs group since userspace is
> >>> always working
> >>> on basepages, having no idea of small-THP.  We ran our code on millions of
> >>> real phones, and now we have got them fixed (or maybe "can't reproduce"),
> >>> no outstanding issue.
> >>
> >> I'm going to be brave and say that my solution shouldn't suffer from these
> >> problems; but of course the proof is only in the testing. I did a lot of work
> >> with our architecture group and micro architects to determine exactly what is
> >> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
> >> optimization in patch 13 (see the commit log for details). Of course this has
> >> all been checked with partners and we are confident that all existing
> >> implementations conform to the modified wording.
> >
> > cool. I like your try_unfold/fold code. it seems your code is setting/dropping
> > CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
> > our code is always stupidly checking some conditions before setting and dropping
> > CONT everywhere.
> >
> >>
> >>>
> >>> Particularly for the rmap issue we are discussing, our out-of-tree is
> >>> using the entire_map for
> >>> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
> >>> CONTPTE from mm-core.
> >>>
> >>> We are doing this in mm/memory.c
> >>>
> >>> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
> >>> vm_area_struct *src_vma,
> >>> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> >>> struct page **prealloc)
> >>> {
> >>>       struct mm_struct *src_mm = src_vma->vm_mm;
> >>>       unsigned long vm_flags = src_vma->vm_flags;
> >>>       pte_t pte = *src_pte;
> >>>       struct page *page;
> >>>
> >>>        page = vm_normal_page(src_vma, addr, pte);
> >>>       ...
> >>>
> >>>      get_page(page);
> >>>      page_dup_rmap(page, true);   // an entire dup_rmap as you can
> >>> see.............
> >>>      rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
> >>> }
> >>>
> >>> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
> >>>
> >>> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> >>> unsigned long haddr, bool freeze)
> >>> {
> >>> ...
> >>>            if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
> >>>                   for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
> >>>                            atomic_inc(&head[i]._mapcount);
> >>>                  atomic_long_inc(&cont_pte_double_map_count);
> >>>            }
> >>>
> >>>
> >>>             if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
> >>>               ...
> >>> }
> >>>
> >>> I am not selling our solution any more, but just showing you some differences we
> >>> have :-)
> >>
> >> OK, I understand what you were saying now. I'm currently struggling to see how
> >> this could fit into my model. Do you have any workloads and numbers on perf
> >> improvement of using entire_mapcount?
> >
> > TBH, I don't have any data on this as from the first beginning, we were using
> > entire_map. So I have no comparison at all.
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> BTW, I have concerns that a variable small-THP size will really work
> >>>>> as userspace
> >>>>> is probably friendly to only one fixed size. for example, userspace
> >>>>> heap management
> >>>>> might be optimized to a size for freeing memory to the kernel. it is
> >>>>> very difficult
> >>>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
> >>>>> size not equal with, and particularly smaller than small-THP size will
> >>>>> defeat all
> >>>>> efforts to use small-THP.
> >>>>
> >>>> I'll admit to not knowing a huge amount about user space allocators. But I will
> >>>> say that as currently defined, the small-sized THP interface to user space
> >>>> allows a sysadmin to specifically enable the set of sizes that they want; so a
> >>>> single size can be enabled. I'm diliberately punting that decision away from the
> >>>> kernel for now.
> >>>
> >>> Basically, userspace heap library has a PAGESIZE setting and allows users
> >>> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
> >>> The default size is for sure equal to the basepage SIZE. once some objects are
> >>> freed by free() and libc get a free "page", userspace heap libraries might free
> >>> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
> >>> it is quite similar with kernel slab.
> >>>
> >>> so imagine we have small-THP now, but userspace libraries have *NO*
> >>> idea at all,  so it can frequently cause unfolding.
> >>>
> >>>>
> >>>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
> >>>> is a little bit better when enabling 64+32+16K vs just 64K THP.
> >>>>
> >>>> Functionally, it will not matter if the allocator is not enlightened for the THP
> >>>> size; it can continue to free, and if a partial folio is unmapped it is put on
> >>>> the deferred split list, then under memory pressure it is split and the unused
> >>>> pages are reclaimed. I guess this is the bit you are concerned about having a

> >>>> performance impact?
> >>>
> >>> right. If this is happening on the majority of small-THP folios, we
> >>> don't have performance
> >>> improvement, and probably regression instead. This is really true on
> >>> real workloads!!
> >>>
> >>> So that is why we really love a per-VMA hint to enable small-THP but
> >>> obviously you
> >>> have already supported it now by
> >>> mm: thp: Introduce per-size thp sysfs interface
> >>> https://lore.kernel.org/linux-mm/20231122162950.3854897-4-ryan.roberts@arm.com/
> >>>
> >>> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
> >>> can set the VMA flag when it is quite sure this VMA is working with
> >>> the alignment
> >>> of 64KB?
> >>
> >> Yes, that all exists in the series today. We have also discussed the possibility
> >> of adding a new madvise_process() call that would take the set of THP sizes that
> >> should be considered. Then you can set different VMAs to use different sizes;
> >> the plan was to layer that on top if/when a workload was identified. Sounds like
> >> you might be able to help there?
> >
> > i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
> > for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
> > set a flag in this VMA and try to allocate 64KB.
>
> When you say "we set a flag" do you mean user space? Or is there some heuristic
> in the kernel?

we are using a field extended by the android kernel in vma struct to
mark this vma
is all good to use CONTPTE. With the upstream solution you are providing, we can
remove this dirty code[1].
static inline bool vma_is_chp_anonymous(struct vm_area_struct *vma)
{
            return vma->android_kabi_reserved2 == THP_SWAP_PRIO_MAGIC;
}

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h#L4031

Thanks
Barry
Ryan Roberts Nov. 29, 2023, 12:21 p.m. UTC | #42
On 28/11/2023 21:06, Barry Song wrote:
> On Tue, Nov 28, 2023 at 11:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 28/11/2023 09:49, Barry Song wrote:
>>> On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2023 20:34, Barry Song wrote:
>>>>> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 27/11/2023 10:28, Barry Song wrote:
>>>>>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 27/11/2023 09:59, Barry Song wrote:
>>>>>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 27/11/2023 08:42, Barry Song wrote:
>>>>>>>>>>>>> +           for (i = 0; i < nr; i++, page++) {
>>>>>>>>>>>>> +                   if (anon) {
>>>>>>>>>>>>> +                           /*
>>>>>>>>>>>>> +                            * If this page may have been pinned by the
>>>>>>>>>>>>> +                            * parent process, copy the page immediately for
>>>>>>>>>>>>> +                            * the child so that we'll always guarantee the
>>>>>>>>>>>>> +                            * pinned page won't be randomly replaced in the
>>>>>>>>>>>>> +                            * future.
>>>>>>>>>>>>> +                            */
>>>>>>>>>>>>> +                           if (unlikely(page_try_dup_anon_rmap(
>>>>>>>>>>>>> +                                           page, false, src_vma))) {
>>>>>>>>>>>>> +                                   if (i != 0)
>>>>>>>>>>>>> +                                           break;
>>>>>>>>>>>>> +                                   /* Page may be pinned, we have to copy. */
>>>>>>>>>>>>> +                                   return copy_present_page(
>>>>>>>>>>>>> +                                           dst_vma, src_vma, dst_pte,
>>>>>>>>>>>>> +                                           src_pte, addr, rss, prealloc,
>>>>>>>>>>>>> +                                           page);
>>>>>>>>>>>>> +                           }
>>>>>>>>>>>>> +                           rss[MM_ANONPAGES]++;
>>>>>>>>>>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
>>>>>>>>>>>>> +                   } else {
>>>>>>>>>>>>> +                           page_dup_file_rmap(page, false);
>>>>>>>>>>>>> +                           rss[mm_counter_file(page)]++;
>>>>>>>>>>>>> +                   }
>>>>>>>>>>>>>             }
>>>>>>>>>>>>> -           rss[MM_ANONPAGES]++;
>>>>>>>>>>>>> -   } else if (page) {
>>>>>>>>>>>>> -           folio_get(folio);
>>>>>>>>>>>>> -           page_dup_file_rmap(page, false);
>>>>>>>>>>>>> -           rss[mm_counter_file(page)]++;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +           nr = i;
>>>>>>>>>>>>> +           folio_ref_add(folio, nr);
>>>>>>>>>>>>
>>>>>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
>>>>>>>>>>>> Make sure your refcount >= mapcount.
>>>>>>>>>>>>
>>>>>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
>>>>>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
>>>>>>>>>>>> pages are the corner case.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
>>>>>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> i still don't understand why it is not a entire map+1, but an increment
>>>>>>>>>>> in each basepage.
>>>>>>>>>>
>>>>>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
>>>>>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
>>>>>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
>>>>>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
>>>>>>>>>> atomic, so we can account the entire thing.
>>>>>>>>>
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>> There is no problem. for example, a large folio is entirely mapped in
>>>>>>>>> process A with CONPTE,
>>>>>>>>> and only page2 is mapped in process B.
>>>>>>>>> then we will have
>>>>>>>>>
>>>>>>>>> entire_map = 0
>>>>>>>>> page0.map = -1
>>>>>>>>> page1.map = -1
>>>>>>>>> page2.map = 0
>>>>>>>>> page3.map = -1
>>>>>>>>> ....
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
>>>>>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
>>>>>>>>>>> split.
>>>>>>>>>>>
>>>>>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
>>>>>>>>>>> similar things on a part of the large folio in process A,
>>>>>>>>>>>
>>>>>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
>>>>>>>>>>> in all subpages need to be removed though we only unmap a part of the
>>>>>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
>>>>>>>>>>> process B(all PTEs are still CONPTES in process B).
>>>>>>>>>>>
>>>>>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
>>>>>>>>>>> process B), and subpages which are still mapped in process A has map_count
>>>>>>>>>>> =0? (start from -1).
>>>>>>>>>>>
>>>>>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
>>>>>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
>>>>>>>>>>>> drop all references again. So you either have all or no ptes to process,
>>>>>>>>>>>> which makes that code easier.
>>>>>>>>>>
>>>>>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
>>>>>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
>>>>>>>>>> unmap the whole folio atomically.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
>>>>>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
>>>>>>>>> it is partially
>>>>>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
>>>>>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
>>>>>>>>> DoubleMapped.
>>>>>>>>
>>>>>>>> There are 2 problems with your proposal, as I see it;
>>>>>>>>
>>>>>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
>>>>>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
>>>>>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
>>>>>>>> the CONT_PTE bit.
>>>>>>>>
>>>>>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
>>>>>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
>>>>>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
>>>>>>>> unless/until ALL of those blocks are set up. And then of course each block could
>>>>>>>> be unmapped unatomically.
>>>>>>>>
>>>>>>>> For the PMD case there are actually 2 properties that allow using the
>>>>>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
>>>>>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
>>>>>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
>>>>>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
>>>>>>>> *entire* map or unmap. That is not true when we are PTE mapping.
>>>>>>>
>>>>>>> well. Thanks for clarification. based on the above description, i agree the
>>>>>>> current code might make more sense by always using mapcount in subpage.
>>>>>>>
>>>>>>> I gave my proposals as  I thought we were always CONTPTE size for small-THP
>>>>>>> then we could drop the loop to iterate 16 times rmap. if we do it
>>>>>>> entirely, we only
>>>>>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
>>>>>>
>>>>>> Well its always good to have the discussion - so thanks for the ideas. I think
>>>>>> there is a bigger question lurking here; should we be exposing the concept of
>>>>>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
>>>>>> I'm confident that would be a huge amount of effort and the end result would be
>>>>>> similar performace to what this approach gives. One potential benefit of letting
>>>>>> core-mm control it is that it would also give control to core-mm over the
>>>>>> granularity of access/dirty reporting (my approach implicitly ties it to the
>>>>>> folio). Having sub-folio access tracking _could_ potentially help with future
>>>>>> work to make THP size selection automatic, but we are not there yet, and I think
>>>>>> there are other (simpler) ways to achieve the same thing. So my view is that
>>>>>> _not_ exposing it to core-mm is the right way for now.
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> We(OPPO) started a similar project like you even before folio was imported to
>>>>> mainline, we have deployed the dynamic hugepage(that is how we name it)
>>>>> on millions of mobile phones on real products and kernels before 5.16,  making
>>>>> a huge success on performance improvement. for example, you may
>>>>> find the out-of-tree 5.15 source code here
>>>>
>>>> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
>>>> embarrassed that I clearly didn't do enough research on the prior art because I
>>>> wasn't aware of your work. So sorry about that.
>>>>
>>>> I sensed that you had a different model for how this should work vs what I've
>>>> implemented and now I understand why :). I'll review your stuff and I'm sure
>>>> I'll have questions. I'm sure each solution has pros and cons.
>>>>
>>>>
>>>>>
>>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>>>>>
>>>>> Our modification might not be so clean and has lots of workarounds
>>>>> just for the stability of products
>>>>>
>>>>> We mainly have
>>>>>
>>>>> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
>>>>>
>>>>> some CONTPTE helpers
>>>>>
>>>>> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
>>>>>
>>>>> some Dynamic Hugepage APIs
>>>>>
>>>>> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
>>>>>
>>>>> modified all page faults to support
>>>>>      (1). allocation of hugepage of 64KB in do_anon_page
>>>>
>>>> My Small-Sized THP patch set is handling the equivalent of this.
>>>
>>> right, the only difference is that we did a huge-zeropage for reading
>>> in do_anon_page.
>>> mapping all large folios to CONTPTE to zero page.
>>
>> FWIW, I took a slightly different approach in my original RFC for the zero page
>> - although I ripped it all out to simplify for the initial series. I found that
>> it was pretty rare for user space to read multiple consecutive pages without
>> ever interleving any writes, so I kept the zero page as a base page, but at CoW,
>> I would expand the allocation to an approprately sized THP. But for the couple
>> of workloads that I've gone deep with, I found that it made barely any dent on
>> the amount of memory that ended up contpte-mapped; the vast majority was from
>> write allocation in do_anonymous_page().
> 
> the problem is even if there is only one page read in 16 ptes, you
> will map the page to
> zero basepage. then while you write another page in these 16 ptes, you
> lose the chance
> to become large folio as pte_range_none() becomes false.
> 
> if we map these 16ptes to contpte zero page, in do_wp_page, we have a
> good chance
> to CoW and get a large anon folio.

Yes understood. I think we are a bit off-topic for this patch set though.
small-sized THP zero pages can be tackled as a separate series once these
initial series are in. I'd be happy to review a small-sized THP zero page post :)

> 
>>
>>>
>>>>
>>>>>      (2). CoW hugepage in do_wp_page
>>>>
>>>> This isn't handled yet in my patch set; the original RFC implemented it but I
>>>> removed it in order to strip back to the essential complexity for the initial
>>>> submission. DavidH has been working on a precise shared vs exclusive map
>>>> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
>>>> Out of interest, what workloads benefit most from this?
>>>
>>> as a phone, Android has a design almost all processes are forked from zygote.
>>> thus, CoW happens quite often to all apps.
>>
>> Sure. But in my analysis I concluded that most of the memory mapped in zygote is
>> file-backed and mostly RO so therefore doing THP CoW doesn't help much. Perhaps
>> there are cases where that conclusion is wrong.
> 
> CoW is much less than do_anon_page on my phone which is running dynamic
> hugepage for a couple of hours:
> 
> OP52D1L1:/ # cat /proc/cont_pte_hugepage/stat
> ...
> thp_cow 34669                           ---- CoW a large folio
> thp_do_anon_pages 1032362     -----  a large folio in do_anon_page
> ...
> 
> so it is around 34669/1032362 = 3.35%.

well its actually 34669 / (34669 + 1032362) = 3.25%. But, yes, the point is that
very few of large folios are lost due to CoW so there is likely to be little
perf impact. Again, I'd happily review a series that enables this!

> 
>>
>>>
>>>>
>>>>>      (3). copy CONPTEs in copy_pte_range
>>>>
>>>> As discussed this is done as part of the contpte patch set, but its not just a
>>>> simple copy; the arch code will notice and set the CONT_PTE bit as needed.
>>>
>>> right, i have read all your unfold and fold stuff today, now i understand your
>>> approach seems quite nice!
>>
>> Great - thanks!
>>
>>>
>>>
>>>>
>>>>>      (4). allocate and swap-in Hugepage as a whole in do_swap_page
>>>>
>>>> This is going to be a problem but I haven't even looked at this properly yet.
>>>> The advice so far has been to continue to swap-in small pages only, but improve
>>>> khugepaged to collapse to small-sized THP. I'll take a look at your code to
>>>> understand how you did this.
>>>
>>> this is also crucial to android phone as swap is always happening
>>> on an embedded device. if we don't support large folios in swapin,
>>> our large folios will never come back after it is swapped-out.
>>>
>>> and i hated the collapse solution from the first beginning as there is
>>> never a guarantee to succeed and its overhead is unacceptable to user UI,
>>> so we supported hugepage allocation in do_swap_page from the first beginning.
>>
>> Understood. I agree it would be nice to preserve large folios across swap. I
>> think this can be layered on top of the current work though.
> 
> This will be my first priority to use your large folio code on phones.
> We need a patchset
> on top of yours :-)
> 
> without it, we will likely fail. Typically, one phone can have a 4~8GB
> zRAM to compress
> a lot of anon pages, if the compression ratio is 1:4, that means
> uncompressed anon
> pages are much much more. Thus, while the background app is switched back
> to foreground, we need those swapped-out large folios back rather than getting
> small basepages replacement. swap-in basepage is definitely not going to
> work well on a phone, neither does THP collapse.

Yep understood. From the other thread, it sounds like you are preparing a series
for large swap-in - looking forward to seeing it!

> 
>>
>>>
>>>>
>>>>>
>>>>> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
>>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
>>>>>
>>>>> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
>>>>
>>>> I think this is all naturally handled by the folio code that exists in modern
>>>> kernels?
>>>
>>> We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
>>> reclaim large folios to the pool. as phones are running lots of apps
>>> and drivers, and the memory is very limited, after a couple of hours,
>>> it will become very hard to allocate large folios in the original buddy. thus,
>>> large folios totally disappeared after running the phone for some time
>>> if we didn't have the pool.
>>>
>>>>
>>>>>
>>>>> So we are 100% interested in your patchset and hope it can find a way
>>>>> to land on the
>>>>> mainline, thus decreasing all the cost we have to maintain out-of-tree
>>>>> code from a
>>>>> kernel to another kernel version which we have done on a couple of
>>>>> kernel versions
>>>>> before 5.16. Firmly, we are 100% supportive of large anon folios
>>>>> things you are leading.
>>>>
>>>> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
>>>> it closer :). If you had any ability to do any A/B performance testing, it would
>>>> be very interesting to see how this stacks up against your solution - if there
>>>> are gaps it would be good to know where and develop a plan to plug the gap.
>>>>
>>>
>>> sure.
>>>
>>>>>
>>>>> A big pain was we found lots of races especially on CONTPTE unfolding
>>>>> and especially a part
>>>>> of basepages ran away from the 16 CONPTEs group since userspace is
>>>>> always working
>>>>> on basepages, having no idea of small-THP.  We ran our code on millions of
>>>>> real phones, and now we have got them fixed (or maybe "can't reproduce"),
>>>>> no outstanding issue.
>>>>
>>>> I'm going to be brave and say that my solution shouldn't suffer from these
>>>> problems; but of course the proof is only in the testing. I did a lot of work
>>>> with our architecture group and micro architects to determine exactly what is
>>>> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
>>>> optimization in patch 13 (see the commit log for details). Of course this has
>>>> all been checked with partners and we are confident that all existing
>>>> implementations conform to the modified wording.
>>>
>>> cool. I like your try_unfold/fold code. it seems your code is setting/dropping
>>> CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
>>> our code is always stupidly checking some conditions before setting and dropping
>>> CONT everywhere.
>>>
>>>>
>>>>>
>>>>> Particularly for the rmap issue we are discussing, our out-of-tree is
>>>>> using the entire_map for
>>>>> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
>>>>> CONTPTE from mm-core.
>>>>>
>>>>> We are doing this in mm/memory.c
>>>>>
>>>>> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
>>>>> vm_area_struct *src_vma,
>>>>> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>>>>> struct page **prealloc)
>>>>> {
>>>>>       struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>       unsigned long vm_flags = src_vma->vm_flags;
>>>>>       pte_t pte = *src_pte;
>>>>>       struct page *page;
>>>>>
>>>>>        page = vm_normal_page(src_vma, addr, pte);
>>>>>       ...
>>>>>
>>>>>      get_page(page);
>>>>>      page_dup_rmap(page, true);   // an entire dup_rmap as you can
>>>>> see.............
>>>>>      rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
>>>>> }
>>>>>
>>>>> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
>>>>>
>>>>> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
>>>>> unsigned long haddr, bool freeze)
>>>>> {
>>>>> ...
>>>>>            if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
>>>>>                   for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
>>>>>                            atomic_inc(&head[i]._mapcount);
>>>>>                  atomic_long_inc(&cont_pte_double_map_count);
>>>>>            }
>>>>>
>>>>>
>>>>>             if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
>>>>>               ...
>>>>> }
>>>>>
>>>>> I am not selling our solution any more, but just showing you some differences we
>>>>> have :-)
>>>>
>>>> OK, I understand what you were saying now. I'm currently struggling to see how
>>>> this could fit into my model. Do you have any workloads and numbers on perf
>>>> improvement of using entire_mapcount?
>>>
>>> TBH, I don't have any data on this as from the first beginning, we were using
>>> entire_map. So I have no comparison at all.
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> BTW, I have concerns that a variable small-THP size will really work
>>>>>>> as userspace
>>>>>>> is probably friendly to only one fixed size. for example, userspace
>>>>>>> heap management
>>>>>>> might be optimized to a size for freeing memory to the kernel. it is
>>>>>>> very difficult
>>>>>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
>>>>>>> size not equal with, and particularly smaller than small-THP size will
>>>>>>> defeat all
>>>>>>> efforts to use small-THP.
>>>>>>
>>>>>> I'll admit to not knowing a huge amount about user space allocators. But I will
>>>>>> say that as currently defined, the small-sized THP interface to user space
>>>>>> allows a sysadmin to specifically enable the set of sizes that they want; so a
>>>>>> single size can be enabled. I'm diliberately punting that decision away from the
>>>>>> kernel for now.
>>>>>
>>>>> Basically, userspace heap library has a PAGESIZE setting and allows users
>>>>> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
>>>>> The default size is for sure equal to the basepage SIZE. once some objects are
>>>>> freed by free() and libc get a free "page", userspace heap libraries might free
>>>>> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
>>>>> it is quite similar with kernel slab.
>>>>>
>>>>> so imagine we have small-THP now, but userspace libraries have *NO*
>>>>> idea at all,  so it can frequently cause unfolding.
>>>>>
>>>>>>
>>>>>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
>>>>>> is a little bit better when enabling 64+32+16K vs just 64K THP.
>>>>>>
>>>>>> Functionally, it will not matter if the allocator is not enlightened for the THP
>>>>>> size; it can continue to free, and if a partial folio is unmapped it is put on
>>>>>> the deferred split list, then under memory pressure it is split and the unused
>>>>>> pages are reclaimed. I guess this is the bit you are concerned about having a
> 
>>>>>> performance impact?
>>>>>
>>>>> right. If this is happening on the majority of small-THP folios, we
>>>>> don't have performance
>>>>> improvement, and probably regression instead. This is really true on
>>>>> real workloads!!
>>>>>
>>>>> So that is why we really love a per-VMA hint to enable small-THP but
>>>>> obviously you
>>>>> have already supported it now by
>>>>> mm: thp: Introduce per-size thp sysfs interface
>>>>> https://lore.kernel.org/linux-mm/20231122162950.3854897-4-ryan.roberts@arm.com/
>>>>>
>>>>> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
>>>>> can set the VMA flag when it is quite sure this VMA is working with
>>>>> the alignment
>>>>> of 64KB?
>>>>
>>>> Yes, that all exists in the series today. We have also discussed the possibility
>>>> of adding a new madvise_process() call that would take the set of THP sizes that
>>>> should be considered. Then you can set different VMAs to use different sizes;
>>>> the plan was to layer that on top if/when a workload was identified. Sounds like
>>>> you might be able to help there?
>>>
>>> i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
>>> for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
>>> set a flag in this VMA and try to allocate 64KB.
>>
>> When you say "we set a flag" do you mean user space? Or is there some heuristic
>> in the kernel?
> 
> we are using a field extended by the android kernel in vma struct to
> mark this vma
> is all good to use CONTPTE. With the upstream solution you are providing, we can
> remove this dirty code[1].
> static inline bool vma_is_chp_anonymous(struct vm_area_struct *vma)
> {
>             return vma->android_kabi_reserved2 == THP_SWAP_PRIO_MAGIC;
> }

Sorry I'm not sure I've understood; how does that flag get set in the first
place? Does user space tell the kernel (via e.g. madvise()) or does the kernel
set it based on devined heuristics?

> 
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h#L4031
> 
> Thanks
> Barry
Ryan Roberts Nov. 29, 2023, 12:29 p.m. UTC | #43
On 28/11/2023 19:00, Barry Song wrote:
> On Wed, Nov 29, 2023 at 12:00 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 28/11/2023 00:11, Barry Song wrote:
>>> On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2023 05:54, Barry Song wrote:
>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>>>> +              pte_t *dst_pte, pte_t *src_pte,
>>>>>> +              unsigned long addr, unsigned long end,
>>>>>> +              int *rss, struct folio **prealloc)
>>>>>>  {
>>>>>>      struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>      unsigned long vm_flags = src_vma->vm_flags;
>>>>>>      pte_t pte = ptep_get(src_pte);
>>>>>>      struct page *page;
>>>>>>      struct folio *folio;
>>>>>> +    int nr = 1;
>>>>>> +    bool anon;
>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>> +    int i;
>>>>>>
>>>>>>      page = vm_normal_page(src_vma, addr, pte);
>>>>>> -    if (page)
>>>>>> +    if (page) {
>>>>>>              folio = page_folio(page);
>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>> -            /*
>>>>>> -             * If this page may have been pinned by the parent process,
>>>>>> -             * copy the page immediately for the child so that we'll always
>>>>>> -             * guarantee the pinned page won't be randomly replaced in the
>>>>>> -             * future.
>>>>>> -             */
>>>>>> -            folio_get(folio);
>>>>>> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>> -                    /* Page may be pinned, we have to copy. */
>>>>>> -                    folio_put(folio);
>>>>>> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>> -                                             addr, rss, prealloc, page);
>>>>>> +            anon = folio_test_anon(folio);
>>>>>> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>> +                                            end, pte, &any_dirty);
>>>>>
>>>>> in case we have a large folio with 16 CONTPTE basepages, and userspace
>>>>> do madvise(addr + 4KB * 5, DONTNEED);
>>>>
>>>> nit: if you are offsetting by 5 pages from addr, then below I think you mean
>>>> page0~page4 and page6~15?
>>>>
>>>>>
>>>>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
>>>>> will return 15. in this case, we should copy page0~page3 and page5~page15.
>>>>
>>>> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
>>>> not how its intended to work. The function is scanning forwards from the current
>>>> pte until it finds the first pte that does not fit in the batch - either because
>>>> it maps a PFN that is not contiguous, or because the permissions are different
>>>> (although this is being relaxed a bit; see conversation with DavidH against this
>>>> same patch).
>>>>
>>>> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
>>>> (page0~page4) then the next time through the loop we will go through the
>>>> !present path and process the single swap marker. Then the 3rd time through the
>>>> loop folio_nr_pages_cont_mapped() will return 10.
>>>
>>> one case we have met by running hundreds of real phones is as below,
>>>
>>>
>>> static int
>>> copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>                pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
>>>                unsigned long end)
>>> {
>>>         ...
>>>         dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
>>>         if (!dst_pte) {
>>>                 ret = -ENOMEM;
>>>                 goto out;
>>>         }
>>>         src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
>>>         if (!src_pte) {
>>>                 pte_unmap_unlock(dst_pte, dst_ptl);
>>>                 /* ret == 0 */
>>>                 goto out;
>>>         }
>>>         spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>>>         orig_src_pte = src_pte;
>>>         orig_dst_pte = dst_pte;
>>>         arch_enter_lazy_mmu_mode();
>>>
>>>         do {
>>>                 /*
>>>                  * We are holding two locks at this point - either of them
>>>                  * could generate latencies in another task on another CPU.
>>>                  */
>>>                 if (progress >= 32) {
>>>                         progress = 0;
>>>                         if (need_resched() ||
>>>                             spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>>>                                 break;
>>>                 }
>>>                 ptent = ptep_get(src_pte);
>>>                 if (pte_none(ptent)) {
>>>                         progress++;
>>>                         continue;
>>>                 }
>>>
>>> the above iteration can break when progress > =32. for example, at the
>>> beginning,
>>> if all PTEs are none, we break when progress >=32, and we break when we
>>> are in the 8th pte of 16PTEs which might become CONTPTE after we release
>>> PTL.
>>>
>>> since we are releasing PTLs, next time when we get PTL, those pte_none() might
>>> become pte_cont(), then are you going to copy CONTPTE from 8th pte,
>>> thus, immediately
>>> break the consistent CONPTEs rule of hardware?
>>>
>>> pte0 - pte_none
>>> pte1 - pte_none
>>> ...
>>> pte7 - pte_none
>>>
>>> pte8 - pte_cont
>>> ...
>>> pte15 - pte_cont
>>>
>>> so we did some modification to avoid a break in the middle of PTEs
>>> which can potentially
>>> become CONTPE.
>>> do {
>>>                 /*
>>>                 * We are holding two locks at this point - either of them
>>>                 * could generate latencies in another task on another CPU.
>>>                 */
>>>                 if (progress >= 32) {
>>>                                 progress = 0;
>>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
>>>                 /*
>>>                 * XXX: don't release ptl at an unligned address as
>>> cont_pte might form while
>>>                 * ptl is released, this causes double-map
>>>                 */
>>>                 if (!vma_is_chp_anonymous(src_vma) ||
>>>                    (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
>>> HPAGE_CONT_PTE_SIZE)))
>>> #endif
>>>                 if (need_resched() ||
>>>                    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>>>                                 break;
>>> }
>>>
>>> We could only reproduce the above issue by running thousands of phones.
>>>
>>> Does your code survive from this problem?
>>
>> Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
>> bit is not blindly "copied" from parent to child pte. As far as the core-mm is
>> concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
>> will see some pte_none() entries followed by some pte_present() entries. And
>> when calling set_ptes() on the child, the arch code will evaluate the current
>> state of the pgtable along with the new set_ptes() request and determine where
>> it should insert the CONT_PTE bit.
> 
> yep, i have read very carefully and think your code is safe here. The
> only problem
> is that the code can randomly unfold parent processes' CONPTE while setting
> wrprotect in the middle of a large folio while it actually should keep CONT
> bit as all PTEs can be still consistent if we set protect from the 1st PTE.
> 
> while A forks B,  progress >= 32 might interrupt in the middle of a
> new CONTPTE folio which is forming, as we have to set wrprotect to parent A,
> this parent immediately loses CONT bit. this is  sad. but i can't find a
> good way to resolve it unless CONT is exposed to mm-core. any idea on
> this?

No this is not the case; copy_present_ptes() will copy as many ptes as are
physcially contiguous and belong to the same folio (which usually means "the
whole folio" - the only time it doesn't is when we hit the end of the vma). We
will then return to the main loop and move forwards by the number of ptes that
were serviced, including:

progress += 8 * ret;

That might go above 32, so we will flash the lock. But we haven't done that in
the middle of a large folio. So the contpte-ness should be preserved.

> 
> Our code[1] resolves this by only breaking at the aligned address
> 
> if (progress >= 32) {
>      progress = 0;
>      #ifdef CONFIG_CONT_PTE_HUGEPAGE
>      /*
>       * XXX: don't release ptl at an unligned address as cont_pte
> might form while
>       * ptl is released, this causes double-map
>      */
>     if (!vma_is_chp_anonymous(src_vma) ||
>         (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> HPAGE_CONT_PTE_SIZE)))
>     #endif
>         if (need_resched() ||
>            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>              break;
> }
> 
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1180
> 
> 
> Thanks
> Barry
Barry Song Nov. 29, 2023, 1:09 p.m. UTC | #44
On Thu, Nov 30, 2023 at 1:29 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2023 19:00, Barry Song wrote:
> > On Wed, Nov 29, 2023 at 12:00 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 28/11/2023 00:11, Barry Song wrote:
> >>> On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 27/11/2023 05:54, Barry Song wrote:
> >>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >>>>>> +              pte_t *dst_pte, pte_t *src_pte,
> >>>>>> +              unsigned long addr, unsigned long end,
> >>>>>> +              int *rss, struct folio **prealloc)
> >>>>>>  {
> >>>>>>      struct mm_struct *src_mm = src_vma->vm_mm;
> >>>>>>      unsigned long vm_flags = src_vma->vm_flags;
> >>>>>>      pte_t pte = ptep_get(src_pte);
> >>>>>>      struct page *page;
> >>>>>>      struct folio *folio;
> >>>>>> +    int nr = 1;
> >>>>>> +    bool anon;
> >>>>>> +    bool any_dirty = pte_dirty(pte);
> >>>>>> +    int i;
> >>>>>>
> >>>>>>      page = vm_normal_page(src_vma, addr, pte);
> >>>>>> -    if (page)
> >>>>>> +    if (page) {
> >>>>>>              folio = page_folio(page);
> >>>>>> -    if (page && folio_test_anon(folio)) {
> >>>>>> -            /*
> >>>>>> -             * If this page may have been pinned by the parent process,
> >>>>>> -             * copy the page immediately for the child so that we'll always
> >>>>>> -             * guarantee the pinned page won't be randomly replaced in the
> >>>>>> -             * future.
> >>>>>> -             */
> >>>>>> -            folio_get(folio);
> >>>>>> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> >>>>>> -                    /* Page may be pinned, we have to copy. */
> >>>>>> -                    folio_put(folio);
> >>>>>> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> >>>>>> -                                             addr, rss, prealloc, page);
> >>>>>> +            anon = folio_test_anon(folio);
> >>>>>> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> >>>>>> +                                            end, pte, &any_dirty);
> >>>>>
> >>>>> in case we have a large folio with 16 CONTPTE basepages, and userspace
> >>>>> do madvise(addr + 4KB * 5, DONTNEED);
> >>>>
> >>>> nit: if you are offsetting by 5 pages from addr, then below I think you mean
> >>>> page0~page4 and page6~15?
> >>>>
> >>>>>
> >>>>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> >>>>> will return 15. in this case, we should copy page0~page3 and page5~page15.
> >>>>
> >>>> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
> >>>> not how its intended to work. The function is scanning forwards from the current
> >>>> pte until it finds the first pte that does not fit in the batch - either because
> >>>> it maps a PFN that is not contiguous, or because the permissions are different
> >>>> (although this is being relaxed a bit; see conversation with DavidH against this
> >>>> same patch).
> >>>>
> >>>> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
> >>>> (page0~page4) then the next time through the loop we will go through the
> >>>> !present path and process the single swap marker. Then the 3rd time through the
> >>>> loop folio_nr_pages_cont_mapped() will return 10.
> >>>
> >>> one case we have met by running hundreds of real phones is as below,
> >>>
> >>>
> >>> static int
> >>> copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >>>                pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> >>>                unsigned long end)
> >>> {
> >>>         ...
> >>>         dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> >>>         if (!dst_pte) {
> >>>                 ret = -ENOMEM;
> >>>                 goto out;
> >>>         }
> >>>         src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
> >>>         if (!src_pte) {
> >>>                 pte_unmap_unlock(dst_pte, dst_ptl);
> >>>                 /* ret == 0 */
> >>>                 goto out;
> >>>         }
> >>>         spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> >>>         orig_src_pte = src_pte;
> >>>         orig_dst_pte = dst_pte;
> >>>         arch_enter_lazy_mmu_mode();
> >>>
> >>>         do {
> >>>                 /*
> >>>                  * We are holding two locks at this point - either of them
> >>>                  * could generate latencies in another task on another CPU.
> >>>                  */
> >>>                 if (progress >= 32) {
> >>>                         progress = 0;
> >>>                         if (need_resched() ||
> >>>                             spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >>>                                 break;
> >>>                 }
> >>>                 ptent = ptep_get(src_pte);
> >>>                 if (pte_none(ptent)) {
> >>>                         progress++;
> >>>                         continue;
> >>>                 }
> >>>
> >>> the above iteration can break when progress > =32. for example, at the
> >>> beginning,
> >>> if all PTEs are none, we break when progress >=32, and we break when we
> >>> are in the 8th pte of 16PTEs which might become CONTPTE after we release
> >>> PTL.
> >>>
> >>> since we are releasing PTLs, next time when we get PTL, those pte_none() might
> >>> become pte_cont(), then are you going to copy CONTPTE from 8th pte,
> >>> thus, immediately
> >>> break the consistent CONPTEs rule of hardware?
> >>>
> >>> pte0 - pte_none
> >>> pte1 - pte_none
> >>> ...
> >>> pte7 - pte_none
> >>>
> >>> pte8 - pte_cont
> >>> ...
> >>> pte15 - pte_cont
> >>>
> >>> so we did some modification to avoid a break in the middle of PTEs
> >>> which can potentially
> >>> become CONTPE.
> >>> do {
> >>>                 /*
> >>>                 * We are holding two locks at this point - either of them
> >>>                 * could generate latencies in another task on another CPU.
> >>>                 */
> >>>                 if (progress >= 32) {
> >>>                                 progress = 0;
> >>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>                 /*
> >>>                 * XXX: don't release ptl at an unligned address as
> >>> cont_pte might form while
> >>>                 * ptl is released, this causes double-map
> >>>                 */
> >>>                 if (!vma_is_chp_anonymous(src_vma) ||
> >>>                    (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> >>> HPAGE_CONT_PTE_SIZE)))
> >>> #endif
> >>>                 if (need_resched() ||
> >>>                    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >>>                                 break;
> >>> }
> >>>
> >>> We could only reproduce the above issue by running thousands of phones.
> >>>
> >>> Does your code survive from this problem?
> >>
> >> Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
> >> bit is not blindly "copied" from parent to child pte. As far as the core-mm is
> >> concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
> >> will see some pte_none() entries followed by some pte_present() entries. And
> >> when calling set_ptes() on the child, the arch code will evaluate the current
> >> state of the pgtable along with the new set_ptes() request and determine where
> >> it should insert the CONT_PTE bit.
> >
> > yep, i have read very carefully and think your code is safe here. The
> > only problem
> > is that the code can randomly unfold parent processes' CONPTE while setting
> > wrprotect in the middle of a large folio while it actually should keep CONT
> > bit as all PTEs can be still consistent if we set protect from the 1st PTE.
> >
> > while A forks B,  progress >= 32 might interrupt in the middle of a
> > new CONTPTE folio which is forming, as we have to set wrprotect to parent A,
> > this parent immediately loses CONT bit. this is  sad. but i can't find a
> > good way to resolve it unless CONT is exposed to mm-core. any idea on
> > this?
>
> No this is not the case; copy_present_ptes() will copy as many ptes as are
> physcially contiguous and belong to the same folio (which usually means "the
> whole folio" - the only time it doesn't is when we hit the end of the vma). We
> will then return to the main loop and move forwards by the number of ptes that
> were serviced, including:

I probably have failed to describe my question. i'd like to give a
concrete example

1. process A forks B
2. At the beginning, address~address +64KB has pte_none PTEs
3. we scan the 5th pte of address + 5 * 4KB, progress becomes 32, we
break and release PTLs
4. another page fault in process A gets PTL and set
address~address+64KB to pte_cont
5. we get the PTL again and arrive 5th pte
6. we set wrprotects on 5,6,7....15 ptes, in this case, we have to
unfold parent A
and child B also gets unfolded PTEs unless our loop can go back the 0th pte.

technically, A should be able to keep CONTPTE, but because of the implementation
of the code, it can't. That is the sadness. but it is obviously not your fault.

no worries. This is not happening quite often. but i just want to make a note
here, maybe someday we can get back to address it.

>
> progress += 8 * ret;
>
> That might go above 32, so we will flash the lock. But we haven't done that in
> the middle of a large folio. So the contpte-ness should be preserved.
>
> >
> > Our code[1] resolves this by only breaking at the aligned address
> >
> > if (progress >= 32) {
> >      progress = 0;
> >      #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >      /*
> >       * XXX: don't release ptl at an unligned address as cont_pte
> > might form while
> >       * ptl is released, this causes double-map
> >      */
> >     if (!vma_is_chp_anonymous(src_vma) ||
> >         (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> > HPAGE_CONT_PTE_SIZE)))
> >     #endif
> >         if (need_resched() ||
> >            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >              break;
> > }
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1180
> >
> >
Thanks
Barry
Ryan Roberts Nov. 29, 2023, 2:07 p.m. UTC | #45
On 29/11/2023 13:09, Barry Song wrote:
> On Thu, Nov 30, 2023 at 1:29 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 28/11/2023 19:00, Barry Song wrote:
>>> On Wed, Nov 29, 2023 at 12:00 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 28/11/2023 00:11, Barry Song wrote:
>>>>> On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 27/11/2023 05:54, Barry Song wrote:
>>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>>>>>> +              pte_t *dst_pte, pte_t *src_pte,
>>>>>>>> +              unsigned long addr, unsigned long end,
>>>>>>>> +              int *rss, struct folio **prealloc)
>>>>>>>>  {
>>>>>>>>      struct mm_struct *src_mm = src_vma->vm_mm;
>>>>>>>>      unsigned long vm_flags = src_vma->vm_flags;
>>>>>>>>      pte_t pte = ptep_get(src_pte);
>>>>>>>>      struct page *page;
>>>>>>>>      struct folio *folio;
>>>>>>>> +    int nr = 1;
>>>>>>>> +    bool anon;
>>>>>>>> +    bool any_dirty = pte_dirty(pte);
>>>>>>>> +    int i;
>>>>>>>>
>>>>>>>>      page = vm_normal_page(src_vma, addr, pte);
>>>>>>>> -    if (page)
>>>>>>>> +    if (page) {
>>>>>>>>              folio = page_folio(page);
>>>>>>>> -    if (page && folio_test_anon(folio)) {
>>>>>>>> -            /*
>>>>>>>> -             * If this page may have been pinned by the parent process,
>>>>>>>> -             * copy the page immediately for the child so that we'll always
>>>>>>>> -             * guarantee the pinned page won't be randomly replaced in the
>>>>>>>> -             * future.
>>>>>>>> -             */
>>>>>>>> -            folio_get(folio);
>>>>>>>> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
>>>>>>>> -                    /* Page may be pinned, we have to copy. */
>>>>>>>> -                    folio_put(folio);
>>>>>>>> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
>>>>>>>> -                                             addr, rss, prealloc, page);
>>>>>>>> +            anon = folio_test_anon(folio);
>>>>>>>> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
>>>>>>>> +                                            end, pte, &any_dirty);
>>>>>>>
>>>>>>> in case we have a large folio with 16 CONTPTE basepages, and userspace
>>>>>>> do madvise(addr + 4KB * 5, DONTNEED);
>>>>>>
>>>>>> nit: if you are offsetting by 5 pages from addr, then below I think you mean
>>>>>> page0~page4 and page6~15?
>>>>>>
>>>>>>>
>>>>>>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
>>>>>>> will return 15. in this case, we should copy page0~page3 and page5~page15.
>>>>>>
>>>>>> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
>>>>>> not how its intended to work. The function is scanning forwards from the current
>>>>>> pte until it finds the first pte that does not fit in the batch - either because
>>>>>> it maps a PFN that is not contiguous, or because the permissions are different
>>>>>> (although this is being relaxed a bit; see conversation with DavidH against this
>>>>>> same patch).
>>>>>>
>>>>>> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
>>>>>> (page0~page4) then the next time through the loop we will go through the
>>>>>> !present path and process the single swap marker. Then the 3rd time through the
>>>>>> loop folio_nr_pages_cont_mapped() will return 10.
>>>>>
>>>>> one case we have met by running hundreds of real phones is as below,
>>>>>
>>>>>
>>>>> static int
>>>>> copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>>>>>                pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
>>>>>                unsigned long end)
>>>>> {
>>>>>         ...
>>>>>         dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
>>>>>         if (!dst_pte) {
>>>>>                 ret = -ENOMEM;
>>>>>                 goto out;
>>>>>         }
>>>>>         src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
>>>>>         if (!src_pte) {
>>>>>                 pte_unmap_unlock(dst_pte, dst_ptl);
>>>>>                 /* ret == 0 */
>>>>>                 goto out;
>>>>>         }
>>>>>         spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>>>>>         orig_src_pte = src_pte;
>>>>>         orig_dst_pte = dst_pte;
>>>>>         arch_enter_lazy_mmu_mode();
>>>>>
>>>>>         do {
>>>>>                 /*
>>>>>                  * We are holding two locks at this point - either of them
>>>>>                  * could generate latencies in another task on another CPU.
>>>>>                  */
>>>>>                 if (progress >= 32) {
>>>>>                         progress = 0;
>>>>>                         if (need_resched() ||
>>>>>                             spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>>>>>                                 break;
>>>>>                 }
>>>>>                 ptent = ptep_get(src_pte);
>>>>>                 if (pte_none(ptent)) {
>>>>>                         progress++;
>>>>>                         continue;
>>>>>                 }
>>>>>
>>>>> the above iteration can break when progress > =32. for example, at the
>>>>> beginning,
>>>>> if all PTEs are none, we break when progress >=32, and we break when we
>>>>> are in the 8th pte of 16PTEs which might become CONTPTE after we release
>>>>> PTL.
>>>>>
>>>>> since we are releasing PTLs, next time when we get PTL, those pte_none() might
>>>>> become pte_cont(), then are you going to copy CONTPTE from 8th pte,
>>>>> thus, immediately
>>>>> break the consistent CONPTEs rule of hardware?
>>>>>
>>>>> pte0 - pte_none
>>>>> pte1 - pte_none
>>>>> ...
>>>>> pte7 - pte_none
>>>>>
>>>>> pte8 - pte_cont
>>>>> ...
>>>>> pte15 - pte_cont
>>>>>
>>>>> so we did some modification to avoid a break in the middle of PTEs
>>>>> which can potentially
>>>>> become CONTPE.
>>>>> do {
>>>>>                 /*
>>>>>                 * We are holding two locks at this point - either of them
>>>>>                 * could generate latencies in another task on another CPU.
>>>>>                 */
>>>>>                 if (progress >= 32) {
>>>>>                                 progress = 0;
>>>>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
>>>>>                 /*
>>>>>                 * XXX: don't release ptl at an unligned address as
>>>>> cont_pte might form while
>>>>>                 * ptl is released, this causes double-map
>>>>>                 */
>>>>>                 if (!vma_is_chp_anonymous(src_vma) ||
>>>>>                    (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
>>>>> HPAGE_CONT_PTE_SIZE)))
>>>>> #endif
>>>>>                 if (need_resched() ||
>>>>>                    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>>>>>                                 break;
>>>>> }
>>>>>
>>>>> We could only reproduce the above issue by running thousands of phones.
>>>>>
>>>>> Does your code survive from this problem?
>>>>
>>>> Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
>>>> bit is not blindly "copied" from parent to child pte. As far as the core-mm is
>>>> concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
>>>> will see some pte_none() entries followed by some pte_present() entries. And
>>>> when calling set_ptes() on the child, the arch code will evaluate the current
>>>> state of the pgtable along with the new set_ptes() request and determine where
>>>> it should insert the CONT_PTE bit.
>>>
>>> yep, i have read very carefully and think your code is safe here. The
>>> only problem
>>> is that the code can randomly unfold parent processes' CONPTE while setting
>>> wrprotect in the middle of a large folio while it actually should keep CONT
>>> bit as all PTEs can be still consistent if we set protect from the 1st PTE.
>>>
>>> while A forks B,  progress >= 32 might interrupt in the middle of a
>>> new CONTPTE folio which is forming, as we have to set wrprotect to parent A,
>>> this parent immediately loses CONT bit. this is  sad. but i can't find a
>>> good way to resolve it unless CONT is exposed to mm-core. any idea on
>>> this?
>>
>> No this is not the case; copy_present_ptes() will copy as many ptes as are
>> physcially contiguous and belong to the same folio (which usually means "the
>> whole folio" - the only time it doesn't is when we hit the end of the vma). We
>> will then return to the main loop and move forwards by the number of ptes that
>> were serviced, including:
> 
> I probably have failed to describe my question. i'd like to give a
> concrete example
> 
> 1. process A forks B
> 2. At the beginning, address~address +64KB has pte_none PTEs
> 3. we scan the 5th pte of address + 5 * 4KB, progress becomes 32, we
> break and release PTLs
> 4. another page fault in process A gets PTL and set
> address~address+64KB to pte_cont
> 5. we get the PTL again and arrive 5th pte
> 6. we set wrprotects on 5,6,7....15 ptes, in this case, we have to
> unfold parent A
> and child B also gets unfolded PTEs unless our loop can go back the 0th pte.
> 
> technically, A should be able to keep CONTPTE, but because of the implementation
> of the code, it can't. That is the sadness. but it is obviously not your fault.
> 
> no worries. This is not happening quite often. but i just want to make a note
> here, maybe someday we can get back to address it.

Ahh, I understand the situation now, sorry for being slow!

I expect this to be a very rare situation anyway since (4) suggests process A
has another thread, and forking is not encouraged for multithreaded programs. In
fact the fork man page says:

  After a fork() in a multithreaded program, the child can safely call only
  async-signal-safe functions (see signal-safety(7)) until such time as it calls
  execve(2).

So in this case, we are about to completely repaint the child's address space
with execve() anyway.

So its just the racing parent that loses the CONT_PTE bit. I expect this to be
extremely rare. I'm not sure there is much we can do to solve it though, because
unlike with your solution, we have to cater for multiple sizes so there is no
obvious boarder until we get to PMD size and I'm guessing that's going to be a
problem for latency spikes.


> 
>>
>> progress += 8 * ret;
>>
>> That might go above 32, so we will flash the lock. But we haven't done that in
>> the middle of a large folio. So the contpte-ness should be preserved.
>>
>>>
>>> Our code[1] resolves this by only breaking at the aligned address
>>>
>>> if (progress >= 32) {
>>>      progress = 0;
>>>      #ifdef CONFIG_CONT_PTE_HUGEPAGE
>>>      /*
>>>       * XXX: don't release ptl at an unligned address as cont_pte
>>> might form while
>>>       * ptl is released, this causes double-map
>>>      */
>>>     if (!vma_is_chp_anonymous(src_vma) ||
>>>         (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
>>> HPAGE_CONT_PTE_SIZE)))
>>>     #endif
>>>         if (need_resched() ||
>>>            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
>>>              break;
>>> }
>>>
>>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1180
>>>
>>>
> Thanks
> Barry
Barry Song Nov. 30, 2023, 12:34 a.m. UTC | #46
On Thu, Nov 30, 2023 at 3:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 29/11/2023 13:09, Barry Song wrote:
> > On Thu, Nov 30, 2023 at 1:29 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 28/11/2023 19:00, Barry Song wrote:
> >>> On Wed, Nov 29, 2023 at 12:00 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 28/11/2023 00:11, Barry Song wrote:
> >>>>> On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> On 27/11/2023 05:54, Barry Song wrote:
> >>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >>>>>>>> +              pte_t *dst_pte, pte_t *src_pte,
> >>>>>>>> +              unsigned long addr, unsigned long end,
> >>>>>>>> +              int *rss, struct folio **prealloc)
> >>>>>>>>  {
> >>>>>>>>      struct mm_struct *src_mm = src_vma->vm_mm;
> >>>>>>>>      unsigned long vm_flags = src_vma->vm_flags;
> >>>>>>>>      pte_t pte = ptep_get(src_pte);
> >>>>>>>>      struct page *page;
> >>>>>>>>      struct folio *folio;
> >>>>>>>> +    int nr = 1;
> >>>>>>>> +    bool anon;
> >>>>>>>> +    bool any_dirty = pte_dirty(pte);
> >>>>>>>> +    int i;
> >>>>>>>>
> >>>>>>>>      page = vm_normal_page(src_vma, addr, pte);
> >>>>>>>> -    if (page)
> >>>>>>>> +    if (page) {
> >>>>>>>>              folio = page_folio(page);
> >>>>>>>> -    if (page && folio_test_anon(folio)) {
> >>>>>>>> -            /*
> >>>>>>>> -             * If this page may have been pinned by the parent process,
> >>>>>>>> -             * copy the page immediately for the child so that we'll always
> >>>>>>>> -             * guarantee the pinned page won't be randomly replaced in the
> >>>>>>>> -             * future.
> >>>>>>>> -             */
> >>>>>>>> -            folio_get(folio);
> >>>>>>>> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> >>>>>>>> -                    /* Page may be pinned, we have to copy. */
> >>>>>>>> -                    folio_put(folio);
> >>>>>>>> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> >>>>>>>> -                                             addr, rss, prealloc, page);
> >>>>>>>> +            anon = folio_test_anon(folio);
> >>>>>>>> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> >>>>>>>> +                                            end, pte, &any_dirty);
> >>>>>>>
> >>>>>>> in case we have a large folio with 16 CONTPTE basepages, and userspace
> >>>>>>> do madvise(addr + 4KB * 5, DONTNEED);
> >>>>>>
> >>>>>> nit: if you are offsetting by 5 pages from addr, then below I think you mean
> >>>>>> page0~page4 and page6~15?
> >>>>>>
> >>>>>>>
> >>>>>>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> >>>>>>> will return 15. in this case, we should copy page0~page3 and page5~page15.
> >>>>>>
> >>>>>> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
> >>>>>> not how its intended to work. The function is scanning forwards from the current
> >>>>>> pte until it finds the first pte that does not fit in the batch - either because
> >>>>>> it maps a PFN that is not contiguous, or because the permissions are different
> >>>>>> (although this is being relaxed a bit; see conversation with DavidH against this
> >>>>>> same patch).
> >>>>>>
> >>>>>> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
> >>>>>> (page0~page4) then the next time through the loop we will go through the
> >>>>>> !present path and process the single swap marker. Then the 3rd time through the
> >>>>>> loop folio_nr_pages_cont_mapped() will return 10.
> >>>>>
> >>>>> one case we have met by running hundreds of real phones is as below,
> >>>>>
> >>>>>
> >>>>> static int
> >>>>> copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >>>>>                pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> >>>>>                unsigned long end)
> >>>>> {
> >>>>>         ...
> >>>>>         dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> >>>>>         if (!dst_pte) {
> >>>>>                 ret = -ENOMEM;
> >>>>>                 goto out;
> >>>>>         }
> >>>>>         src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
> >>>>>         if (!src_pte) {
> >>>>>                 pte_unmap_unlock(dst_pte, dst_ptl);
> >>>>>                 /* ret == 0 */
> >>>>>                 goto out;
> >>>>>         }
> >>>>>         spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> >>>>>         orig_src_pte = src_pte;
> >>>>>         orig_dst_pte = dst_pte;
> >>>>>         arch_enter_lazy_mmu_mode();
> >>>>>
> >>>>>         do {
> >>>>>                 /*
> >>>>>                  * We are holding two locks at this point - either of them
> >>>>>                  * could generate latencies in another task on another CPU.
> >>>>>                  */
> >>>>>                 if (progress >= 32) {
> >>>>>                         progress = 0;
> >>>>>                         if (need_resched() ||
> >>>>>                             spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >>>>>                                 break;
> >>>>>                 }
> >>>>>                 ptent = ptep_get(src_pte);
> >>>>>                 if (pte_none(ptent)) {
> >>>>>                         progress++;
> >>>>>                         continue;
> >>>>>                 }
> >>>>>
> >>>>> the above iteration can break when progress > =32. for example, at the
> >>>>> beginning,
> >>>>> if all PTEs are none, we break when progress >=32, and we break when we
> >>>>> are in the 8th pte of 16PTEs which might become CONTPTE after we release
> >>>>> PTL.
> >>>>>
> >>>>> since we are releasing PTLs, next time when we get PTL, those pte_none() might
> >>>>> become pte_cont(), then are you going to copy CONTPTE from 8th pte,
> >>>>> thus, immediately
> >>>>> break the consistent CONPTEs rule of hardware?
> >>>>>
> >>>>> pte0 - pte_none
> >>>>> pte1 - pte_none
> >>>>> ...
> >>>>> pte7 - pte_none
> >>>>>
> >>>>> pte8 - pte_cont
> >>>>> ...
> >>>>> pte15 - pte_cont
> >>>>>
> >>>>> so we did some modification to avoid a break in the middle of PTEs
> >>>>> which can potentially
> >>>>> become CONTPE.
> >>>>> do {
> >>>>>                 /*
> >>>>>                 * We are holding two locks at this point - either of them
> >>>>>                 * could generate latencies in another task on another CPU.
> >>>>>                 */
> >>>>>                 if (progress >= 32) {
> >>>>>                                 progress = 0;
> >>>>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>>>                 /*
> >>>>>                 * XXX: don't release ptl at an unligned address as
> >>>>> cont_pte might form while
> >>>>>                 * ptl is released, this causes double-map
> >>>>>                 */
> >>>>>                 if (!vma_is_chp_anonymous(src_vma) ||
> >>>>>                    (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> >>>>> HPAGE_CONT_PTE_SIZE)))
> >>>>> #endif
> >>>>>                 if (need_resched() ||
> >>>>>                    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >>>>>                                 break;
> >>>>> }
> >>>>>
> >>>>> We could only reproduce the above issue by running thousands of phones.
> >>>>>
> >>>>> Does your code survive from this problem?
> >>>>
> >>>> Yes I'm confident my code is safe against this; as I said before, the CONT_PTE
> >>>> bit is not blindly "copied" from parent to child pte. As far as the core-mm is
> >>>> concerned, there is no CONT_PTE bit; they are just regular PTEs. So the code
> >>>> will see some pte_none() entries followed by some pte_present() entries. And
> >>>> when calling set_ptes() on the child, the arch code will evaluate the current
> >>>> state of the pgtable along with the new set_ptes() request and determine where
> >>>> it should insert the CONT_PTE bit.
> >>>
> >>> yep, i have read very carefully and think your code is safe here. The
> >>> only problem
> >>> is that the code can randomly unfold parent processes' CONPTE while setting
> >>> wrprotect in the middle of a large folio while it actually should keep CONT
> >>> bit as all PTEs can be still consistent if we set protect from the 1st PTE.
> >>>
> >>> while A forks B,  progress >= 32 might interrupt in the middle of a
> >>> new CONTPTE folio which is forming, as we have to set wrprotect to parent A,
> >>> this parent immediately loses CONT bit. this is  sad. but i can't find a
> >>> good way to resolve it unless CONT is exposed to mm-core. any idea on
> >>> this?
> >>
> >> No this is not the case; copy_present_ptes() will copy as many ptes as are
> >> physcially contiguous and belong to the same folio (which usually means "the
> >> whole folio" - the only time it doesn't is when we hit the end of the vma). We
> >> will then return to the main loop and move forwards by the number of ptes that
> >> were serviced, including:
> >
> > I probably have failed to describe my question. i'd like to give a
> > concrete example
> >
> > 1. process A forks B
> > 2. At the beginning, address~address +64KB has pte_none PTEs
> > 3. we scan the 5th pte of address + 5 * 4KB, progress becomes 32, we
> > break and release PTLs
> > 4. another page fault in process A gets PTL and set
> > address~address+64KB to pte_cont
> > 5. we get the PTL again and arrive 5th pte
> > 6. we set wrprotects on 5,6,7....15 ptes, in this case, we have to
> > unfold parent A
> > and child B also gets unfolded PTEs unless our loop can go back the 0th pte.
> >
> > technically, A should be able to keep CONTPTE, but because of the implementation
> > of the code, it can't. That is the sadness. but it is obviously not your fault.
> >
> > no worries. This is not happening quite often. but i just want to make a note
> > here, maybe someday we can get back to address it.
>
> Ahh, I understand the situation now, sorry for being slow!
>
> I expect this to be a very rare situation anyway since (4) suggests process A
> has another thread, and forking is not encouraged for multithreaded programs. In
> fact the fork man page says:
>
>   After a fork() in a multithreaded program, the child can safely call only
>   async-signal-safe functions (see signal-safety(7)) until such time as it calls
>   execve(2).
>
> So in this case, we are about to completely repaint the child's address space
> with execve() anyway.
>
> So its just the racing parent that loses the CONT_PTE bit. I expect this to be
> extremely rare. I'm not sure there is much we can do to solve it though, because
> unlike with your solution, we have to cater for multiple sizes so there is no
> obvious boarder until we get to PMD size and I'm guessing that's going to be a
> problem for latency spikes.

right. i don't think this can be a big problem. the background is that
we have some
way to constantly detect and report unexpected unfold/events, so we run hundreds
of phones in lab, and collect data to find out if we have any
potential problems. we
record unexpected unfold and reasons in a proc file, we monitor those data to
look for potential bugs we might have. this ">=32" break and unfold was found in
this way, thus we simply addressed it by disallowing the break at an unaligned
address.

This is never an issue which can stop your patchset. but I was still
glad to share
our observations with you and would like to hear if you had any idea :-)

>
>
> >
> >>
> >> progress += 8 * ret;
> >>
> >> That might go above 32, so we will flash the lock. But we haven't done that in
> >> the middle of a large folio. So the contpte-ness should be preserved.
> >>
> >>>
> >>> Our code[1] resolves this by only breaking at the aligned address
> >>>
> >>> if (progress >= 32) {
> >>>      progress = 0;
> >>>      #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>      /*
> >>>       * XXX: don't release ptl at an unligned address as cont_pte
> >>> might form while
> >>>       * ptl is released, this causes double-map
> >>>      */
> >>>     if (!vma_is_chp_anonymous(src_vma) ||
> >>>         (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> >>> HPAGE_CONT_PTE_SIZE)))
> >>>     #endif
> >>>         if (need_resched() ||
> >>>            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >>>              break;
> >>> }
> >>>
> >>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1180

Thanks
Barry
Barry Song Nov. 30, 2023, 12:51 a.m. UTC | #47
On Thu, Nov 30, 2023 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2023 21:06, Barry Song wrote:
> > On Tue, Nov 28, 2023 at 11:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 28/11/2023 09:49, Barry Song wrote:
> >>> On Tue, Nov 28, 2023 at 10:14 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 27/11/2023 20:34, Barry Song wrote:
> >>>>> On Tue, Nov 28, 2023 at 12:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> On 27/11/2023 10:28, Barry Song wrote:
> >>>>>>> On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>
> >>>>>>>> On 27/11/2023 09:59, Barry Song wrote:
> >>>>>>>>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 27/11/2023 08:42, Barry Song wrote:
> >>>>>>>>>>>>> +           for (i = 0; i < nr; i++, page++) {
> >>>>>>>>>>>>> +                   if (anon) {
> >>>>>>>>>>>>> +                           /*
> >>>>>>>>>>>>> +                            * If this page may have been pinned by the
> >>>>>>>>>>>>> +                            * parent process, copy the page immediately for
> >>>>>>>>>>>>> +                            * the child so that we'll always guarantee the
> >>>>>>>>>>>>> +                            * pinned page won't be randomly replaced in the
> >>>>>>>>>>>>> +                            * future.
> >>>>>>>>>>>>> +                            */
> >>>>>>>>>>>>> +                           if (unlikely(page_try_dup_anon_rmap(
> >>>>>>>>>>>>> +                                           page, false, src_vma))) {
> >>>>>>>>>>>>> +                                   if (i != 0)
> >>>>>>>>>>>>> +                                           break;
> >>>>>>>>>>>>> +                                   /* Page may be pinned, we have to copy. */
> >>>>>>>>>>>>> +                                   return copy_present_page(
> >>>>>>>>>>>>> +                                           dst_vma, src_vma, dst_pte,
> >>>>>>>>>>>>> +                                           src_pte, addr, rss, prealloc,
> >>>>>>>>>>>>> +                                           page);
> >>>>>>>>>>>>> +                           }
> >>>>>>>>>>>>> +                           rss[MM_ANONPAGES]++;
> >>>>>>>>>>>>> +                           VM_BUG_ON(PageAnonExclusive(page));
> >>>>>>>>>>>>> +                   } else {
> >>>>>>>>>>>>> +                           page_dup_file_rmap(page, false);
> >>>>>>>>>>>>> +                           rss[mm_counter_file(page)]++;
> >>>>>>>>>>>>> +                   }
> >>>>>>>>>>>>>             }
> >>>>>>>>>>>>> -           rss[MM_ANONPAGES]++;
> >>>>>>>>>>>>> -   } else if (page) {
> >>>>>>>>>>>>> -           folio_get(folio);
> >>>>>>>>>>>>> -           page_dup_file_rmap(page, false);
> >>>>>>>>>>>>> -           rss[mm_counter_file(page)]++;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +           nr = i;
> >>>>>>>>>>>>> +           folio_ref_add(folio, nr);
> >>>>>>>>>>>>
> >>>>>>>>>>>> You're changing the order of mapcount vs. refcount increment. Don't.
> >>>>>>>>>>>> Make sure your refcount >= mapcount.
> >>>>>>>>>>>>
> >>>>>>>>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and
> >>>>>>>>>>>> then decrementing in case of error accordingly. Errors due to pinned
> >>>>>>>>>>>> pages are the corner case.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'll note that it will make a lot of sense to have batch variants of
> >>>>>>>>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap().
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> i still don't understand why it is not a entire map+1, but an increment
> >>>>>>>>>>> in each basepage.
> >>>>>>>>>>
> >>>>>>>>>> Because we are PTE-mapping the folio, we have to account each individual page.
> >>>>>>>>>> If we accounted the entire folio, where would we unaccount it? Each page can be
> >>>>>>>>>> unmapped individually (e.g. munmap() part of the folio) so need to account each
> >>>>>>>>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its
> >>>>>>>>>> atomic, so we can account the entire thing.
> >>>>>>>>>
> >>>>>>>>> Hi Ryan,
> >>>>>>>>>
> >>>>>>>>> There is no problem. for example, a large folio is entirely mapped in
> >>>>>>>>> process A with CONPTE,
> >>>>>>>>> and only page2 is mapped in process B.
> >>>>>>>>> then we will have
> >>>>>>>>>
> >>>>>>>>> entire_map = 0
> >>>>>>>>> page0.map = -1
> >>>>>>>>> page1.map = -1
> >>>>>>>>> page2.map = 0
> >>>>>>>>> page3.map = -1
> >>>>>>>>> ....
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> as long as it is a CONTPTE large folio, there is no much difference with
> >>>>>>>>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need
> >>>>>>>>>>> split.
> >>>>>>>>>>>
> >>>>>>>>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any
> >>>>>>>>>>> similar things on a part of the large folio in process A,
> >>>>>>>>>>>
> >>>>>>>>>>> this large folio will have partially mapped subpage in A (all CONTPE bits
> >>>>>>>>>>> in all subpages need to be removed though we only unmap a part of the
> >>>>>>>>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in
> >>>>>>>>>>> process B(all PTEs are still CONPTES in process B).
> >>>>>>>>>>>
> >>>>>>>>>>> isn't it more sensible for this large folios to have entire_map = 0(for
> >>>>>>>>>>> process B), and subpages which are still mapped in process A has map_count
> >>>>>>>>>>> =0? (start from -1).
> >>>>>>>>>>>
> >>>>>>>>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only
> >>>>>>>>>>>> check once if the folio maybe pinned, and in that case, you can simply
> >>>>>>>>>>>> drop all references again. So you either have all or no ptes to process,
> >>>>>>>>>>>> which makes that code easier.
> >>>>>>>>>>
> >>>>>>>>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But
> >>>>>>>>>> fundamentally you can only use entire_mapcount if its only possible to map and
> >>>>>>>>>> unmap the whole folio atomically.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped
> >>>>>>>>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise,
> >>>>>>>>> it is partially
> >>>>>>>>> mapped. if a large folio is mapped in one processes with all CONTPTEs
> >>>>>>>>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is
> >>>>>>>>> DoubleMapped.
> >>>>>>>>
> >>>>>>>> There are 2 problems with your proposal, as I see it;
> >>>>>>>>
> >>>>>>>> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is
> >>>>>>>> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec
> >>>>>>>> entire_mapcount. The arch code is opportunistically and *transparently* managing
> >>>>>>>> the CONT_PTE bit.
> >>>>>>>>
> >>>>>>>> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may
> >>>>>>>> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and
> >>>>>>>> be mapped with 32 contpte blocks. So you can't say it is entirely mapped
> >>>>>>>> unless/until ALL of those blocks are set up. And then of course each block could
> >>>>>>>> be unmapped unatomically.
> >>>>>>>>
> >>>>>>>> For the PMD case there are actually 2 properties that allow using the
> >>>>>>>> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD
> >>>>>>>> and we know that the folio is exactly PMD sized (since it must be at least PMD
> >>>>>>>> sized to be able to map it with the PMD, and we don't allocate THPs any bigger
> >>>>>>>> than PMD size). So one PMD map or unmap operation corresponds to exactly one
> >>>>>>>> *entire* map or unmap. That is not true when we are PTE mapping.
> >>>>>>>
> >>>>>>> well. Thanks for clarification. based on the above description, i agree the
> >>>>>>> current code might make more sense by always using mapcount in subpage.
> >>>>>>>
> >>>>>>> I gave my proposals as  I thought we were always CONTPTE size for small-THP
> >>>>>>> then we could drop the loop to iterate 16 times rmap. if we do it
> >>>>>>> entirely, we only
> >>>>>>> need to do dup rmap once for all 16 PTEs by increasing entire_map.
> >>>>>>
> >>>>>> Well its always good to have the discussion - so thanks for the ideas. I think
> >>>>>> there is a bigger question lurking here; should we be exposing the concept of
> >>>>>> contpte mappings to the core-mm rather than burying it in the arm64 arch code?
> >>>>>> I'm confident that would be a huge amount of effort and the end result would be
> >>>>>> similar performace to what this approach gives. One potential benefit of letting
> >>>>>> core-mm control it is that it would also give control to core-mm over the
> >>>>>> granularity of access/dirty reporting (my approach implicitly ties it to the
> >>>>>> folio). Having sub-folio access tracking _could_ potentially help with future
> >>>>>> work to make THP size selection automatic, but we are not there yet, and I think
> >>>>>> there are other (simpler) ways to achieve the same thing. So my view is that
> >>>>>> _not_ exposing it to core-mm is the right way for now.
> >>>>>
> >>>>> Hi Ryan,
> >>>>>
> >>>>> We(OPPO) started a similar project like you even before folio was imported to
> >>>>> mainline, we have deployed the dynamic hugepage(that is how we name it)
> >>>>> on millions of mobile phones on real products and kernels before 5.16,  making
> >>>>> a huge success on performance improvement. for example, you may
> >>>>> find the out-of-tree 5.15 source code here
> >>>>
> >>>> Oh wow, thanks for reaching out and explaining this - I have to admit I feel
> >>>> embarrassed that I clearly didn't do enough research on the prior art because I
> >>>> wasn't aware of your work. So sorry about that.
> >>>>
> >>>> I sensed that you had a different model for how this should work vs what I've
> >>>> implemented and now I understand why :). I'll review your stuff and I'm sure
> >>>> I'll have questions. I'm sure each solution has pros and cons.
> >>>>
> >>>>
> >>>>>
> >>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >>>>>
> >>>>> Our modification might not be so clean and has lots of workarounds
> >>>>> just for the stability of products
> >>>>>
> >>>>> We mainly have
> >>>>>
> >>>>> 1. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c
> >>>>>
> >>>>> some CONTPTE helpers
> >>>>>
> >>>>> 2.https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h
> >>>>>
> >>>>> some Dynamic Hugepage APIs
> >>>>>
> >>>>> 3. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c
> >>>>>
> >>>>> modified all page faults to support
> >>>>>      (1). allocation of hugepage of 64KB in do_anon_page
> >>>>
> >>>> My Small-Sized THP patch set is handling the equivalent of this.
> >>>
> >>> right, the only difference is that we did a huge-zeropage for reading
> >>> in do_anon_page.
> >>> mapping all large folios to CONTPTE to zero page.
> >>
> >> FWIW, I took a slightly different approach in my original RFC for the zero page
> >> - although I ripped it all out to simplify for the initial series. I found that
> >> it was pretty rare for user space to read multiple consecutive pages without
> >> ever interleving any writes, so I kept the zero page as a base page, but at CoW,
> >> I would expand the allocation to an approprately sized THP. But for the couple
> >> of workloads that I've gone deep with, I found that it made barely any dent on
> >> the amount of memory that ended up contpte-mapped; the vast majority was from
> >> write allocation in do_anonymous_page().
> >
> > the problem is even if there is only one page read in 16 ptes, you
> > will map the page to
> > zero basepage. then while you write another page in these 16 ptes, you
> > lose the chance
> > to become large folio as pte_range_none() becomes false.
> >
> > if we map these 16ptes to contpte zero page, in do_wp_page, we have a
> > good chance
> > to CoW and get a large anon folio.
>
> Yes understood. I think we are a bit off-topic for this patch set though.
> small-sized THP zero pages can be tackled as a separate series once these
> initial series are in. I'd be happy to review a small-sized THP zero page post :)

I agree this can be deferred. Right now our first priority is the
swap-in series, so
I can't give a time when we can send a small-sized THP zero page.

>
> >
> >>
> >>>
> >>>>
> >>>>>      (2). CoW hugepage in do_wp_page
> >>>>
> >>>> This isn't handled yet in my patch set; the original RFC implemented it but I
> >>>> removed it in order to strip back to the essential complexity for the initial
> >>>> submission. DavidH has been working on a precise shared vs exclusive map
> >>>> tracking mechanism - if that goes in, it will make CoWing large folios simpler.
> >>>> Out of interest, what workloads benefit most from this?
> >>>
> >>> as a phone, Android has a design almost all processes are forked from zygote.
> >>> thus, CoW happens quite often to all apps.
> >>
> >> Sure. But in my analysis I concluded that most of the memory mapped in zygote is
> >> file-backed and mostly RO so therefore doing THP CoW doesn't help much. Perhaps
> >> there are cases where that conclusion is wrong.
> >
> > CoW is much less than do_anon_page on my phone which is running dynamic
> > hugepage for a couple of hours:
> >
> > OP52D1L1:/ # cat /proc/cont_pte_hugepage/stat
> > ...
> > thp_cow 34669                           ---- CoW a large folio
> > thp_do_anon_pages 1032362     -----  a large folio in do_anon_page
> > ...
> >
> > so it is around 34669/1032362 = 3.35%.
>
> well its actually 34669 / (34669 + 1032362) = 3.25%. But, yes, the point is that
> very few of large folios are lost due to CoW so there is likely to be little
> perf impact. Again, I'd happily review a series that enables this!

right, same as above.

>
> >
> >>
> >>>
> >>>>
> >>>>>      (3). copy CONPTEs in copy_pte_range
> >>>>
> >>>> As discussed this is done as part of the contpte patch set, but its not just a
> >>>> simple copy; the arch code will notice and set the CONT_PTE bit as needed.
> >>>
> >>> right, i have read all your unfold and fold stuff today, now i understand your
> >>> approach seems quite nice!
> >>
> >> Great - thanks!
> >>
> >>>
> >>>
> >>>>
> >>>>>      (4). allocate and swap-in Hugepage as a whole in do_swap_page
> >>>>
> >>>> This is going to be a problem but I haven't even looked at this properly yet.
> >>>> The advice so far has been to continue to swap-in small pages only, but improve
> >>>> khugepaged to collapse to small-sized THP. I'll take a look at your code to
> >>>> understand how you did this.
> >>>
> >>> this is also crucial to android phone as swap is always happening
> >>> on an embedded device. if we don't support large folios in swapin,
> >>> our large folios will never come back after it is swapped-out.
> >>>
> >>> and i hated the collapse solution from the first beginning as there is
> >>> never a guarantee to succeed and its overhead is unacceptable to user UI,
> >>> so we supported hugepage allocation in do_swap_page from the first beginning.
> >>
> >> Understood. I agree it would be nice to preserve large folios across swap. I
> >> think this can be layered on top of the current work though.
> >
> > This will be my first priority to use your large folio code on phones.
> > We need a patchset
> > on top of yours :-)
> >
> > without it, we will likely fail. Typically, one phone can have a 4~8GB
> > zRAM to compress
> > a lot of anon pages, if the compression ratio is 1:4, that means
> > uncompressed anon
> > pages are much much more. Thus, while the background app is switched back
> > to foreground, we need those swapped-out large folios back rather than getting
> > small basepages replacement. swap-in basepage is definitely not going to
> > work well on a phone, neither does THP collapse.
>
> Yep understood. From the other thread, it sounds like you are preparing a series
> for large swap-in - looking forward to seeing it!

right. as said, this is the first priority.

>
> >
> >>
> >>>
> >>>>
> >>>>>
> >>>>> 4. https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/vmscan.c
> >>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c
> >>>>>
> >>>>> reclaim hugepage as a whole and LRU optimization for 64KB dynamic hugepage.
> >>>>
> >>>> I think this is all naturally handled by the folio code that exists in modern
> >>>> kernels?
> >>>
> >>> We had a CONTPTE hugepage pool, if the pool is very limited, we let LRU
> >>> reclaim large folios to the pool. as phones are running lots of apps
> >>> and drivers, and the memory is very limited, after a couple of hours,
> >>> it will become very hard to allocate large folios in the original buddy. thus,
> >>> large folios totally disappeared after running the phone for some time
> >>> if we didn't have the pool.
> >>>
> >>>>
> >>>>>
> >>>>> So we are 100% interested in your patchset and hope it can find a way
> >>>>> to land on the
> >>>>> mainline, thus decreasing all the cost we have to maintain out-of-tree
> >>>>> code from a
> >>>>> kernel to another kernel version which we have done on a couple of
> >>>>> kernel versions
> >>>>> before 5.16. Firmly, we are 100% supportive of large anon folios
> >>>>> things you are leading.
> >>>>
> >>>> That's great to hear! Of course Reviewed-By's and Tested-By's will all help move
> >>>> it closer :). If you had any ability to do any A/B performance testing, it would
> >>>> be very interesting to see how this stacks up against your solution - if there
> >>>> are gaps it would be good to know where and develop a plan to plug the gap.
> >>>>
> >>>
> >>> sure.
> >>>
> >>>>>
> >>>>> A big pain was we found lots of races especially on CONTPTE unfolding
> >>>>> and especially a part
> >>>>> of basepages ran away from the 16 CONPTEs group since userspace is
> >>>>> always working
> >>>>> on basepages, having no idea of small-THP.  We ran our code on millions of
> >>>>> real phones, and now we have got them fixed (or maybe "can't reproduce"),
> >>>>> no outstanding issue.
> >>>>
> >>>> I'm going to be brave and say that my solution shouldn't suffer from these
> >>>> problems; but of course the proof is only in the testing. I did a lot of work
> >>>> with our architecture group and micro architects to determine exactly what is
> >>>> and isn't safe; We even tightened the Arm ARM spec very subtlely to allow the
> >>>> optimization in patch 13 (see the commit log for details). Of course this has
> >>>> all been checked with partners and we are confident that all existing
> >>>> implementations conform to the modified wording.
> >>>
> >>> cool. I like your try_unfold/fold code. it seems your code is setting/dropping
> >>> CONT automatically based on ALIGHMENT, Page number etc. Alternatively,
> >>> our code is always stupidly checking some conditions before setting and dropping
> >>> CONT everywhere.
> >>>
> >>>>
> >>>>>
> >>>>> Particularly for the rmap issue we are discussing, our out-of-tree is
> >>>>> using the entire_map for
> >>>>> CONTPTE in the way I sent to you. But I guess we can learn from you to decouple
> >>>>> CONTPTE from mm-core.
> >>>>>
> >>>>> We are doing this in mm/memory.c
> >>>>>
> >>>>> copy_present_cont_pte(struct vm_area_struct *dst_vma, struct
> >>>>> vm_area_struct *src_vma,
> >>>>> pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> >>>>> struct page **prealloc)
> >>>>> {
> >>>>>       struct mm_struct *src_mm = src_vma->vm_mm;
> >>>>>       unsigned long vm_flags = src_vma->vm_flags;
> >>>>>       pte_t pte = *src_pte;
> >>>>>       struct page *page;
> >>>>>
> >>>>>        page = vm_normal_page(src_vma, addr, pte);
> >>>>>       ...
> >>>>>
> >>>>>      get_page(page);
> >>>>>      page_dup_rmap(page, true);   // an entire dup_rmap as you can
> >>>>> see.............
> >>>>>      rss[mm_counter(page)] += HPAGE_CONT_PTE_NR;
> >>>>> }
> >>>>>
> >>>>> and we have a split in mm/cont_pte_hugepage.c to handle partially unmap,
> >>>>>
> >>>>> static void __split_huge_cont_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> >>>>> unsigned long haddr, bool freeze)
> >>>>> {
> >>>>> ...
> >>>>>            if (compound_mapcount(head) > 1 && !TestSetPageDoubleMap(head)) {
> >>>>>                   for (i = 0; i < HPAGE_CONT_PTE_NR; i++)
> >>>>>                            atomic_inc(&head[i]._mapcount);
> >>>>>                  atomic_long_inc(&cont_pte_double_map_count);
> >>>>>            }
> >>>>>
> >>>>>
> >>>>>             if (atomic_add_negative(-1, compound_mapcount_ptr(head))) {
> >>>>>               ...
> >>>>> }
> >>>>>
> >>>>> I am not selling our solution any more, but just showing you some differences we
> >>>>> have :-)
> >>>>
> >>>> OK, I understand what you were saying now. I'm currently struggling to see how
> >>>> this could fit into my model. Do you have any workloads and numbers on perf
> >>>> improvement of using entire_mapcount?
> >>>
> >>> TBH, I don't have any data on this as from the first beginning, we were using
> >>> entire_map. So I have no comparison at all.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> BTW, I have concerns that a variable small-THP size will really work
> >>>>>>> as userspace
> >>>>>>> is probably friendly to only one fixed size. for example, userspace
> >>>>>>> heap management
> >>>>>>> might be optimized to a size for freeing memory to the kernel. it is
> >>>>>>> very difficult
> >>>>>>> for the heap to adapt to various sizes at the same time. frequent unmap/free
> >>>>>>> size not equal with, and particularly smaller than small-THP size will
> >>>>>>> defeat all
> >>>>>>> efforts to use small-THP.
> >>>>>>
> >>>>>> I'll admit to not knowing a huge amount about user space allocators. But I will
> >>>>>> say that as currently defined, the small-sized THP interface to user space
> >>>>>> allows a sysadmin to specifically enable the set of sizes that they want; so a
> >>>>>> single size can be enabled. I'm diliberately punting that decision away from the
> >>>>>> kernel for now.
> >>>>>
> >>>>> Basically, userspace heap library has a PAGESIZE setting and allows users
> >>>>> to allocate/free all kinds of small objects such as 16,32,64,128,256,512 etc.
> >>>>> The default size is for sure equal to the basepage SIZE. once some objects are
> >>>>> freed by free() and libc get a free "page", userspace heap libraries might free
> >>>>> the PAGESIZE page to kernel by things like MADV_DONTNEED, then zap_pte_range().
> >>>>> it is quite similar with kernel slab.
> >>>>>
> >>>>> so imagine we have small-THP now, but userspace libraries have *NO*
> >>>>> idea at all,  so it can frequently cause unfolding.
> >>>>>
> >>>>>>
> >>>>>> FWIW, My experience with the Speedometer/JavaScript use case is that performance
> >>>>>> is a little bit better when enabling 64+32+16K vs just 64K THP.
> >>>>>>
> >>>>>> Functionally, it will not matter if the allocator is not enlightened for the THP
> >>>>>> size; it can continue to free, and if a partial folio is unmapped it is put on
> >>>>>> the deferred split list, then under memory pressure it is split and the unused
> >>>>>> pages are reclaimed. I guess this is the bit you are concerned about having a
> >
> >>>>>> performance impact?
> >>>>>
> >>>>> right. If this is happening on the majority of small-THP folios, we
> >>>>> don't have performance
> >>>>> improvement, and probably regression instead. This is really true on
> >>>>> real workloads!!
> >>>>>
> >>>>> So that is why we really love a per-VMA hint to enable small-THP but
> >>>>> obviously you
> >>>>> have already supported it now by
> >>>>> mm: thp: Introduce per-size thp sysfs interface
> >>>>> https://lore.kernel.org/linux-mm/20231122162950.3854897-4-ryan.roberts@arm.com/
> >>>>>
> >>>>> we can use MADVISE rather than ALWAYS and set fixed size like 64KB, so userspace
> >>>>> can set the VMA flag when it is quite sure this VMA is working with
> >>>>> the alignment
> >>>>> of 64KB?
> >>>>
> >>>> Yes, that all exists in the series today. We have also discussed the possibility
> >>>> of adding a new madvise_process() call that would take the set of THP sizes that
> >>>> should be considered. Then you can set different VMAs to use different sizes;
> >>>> the plan was to layer that on top if/when a workload was identified. Sounds like
> >>>> you might be able to help there?
> >>>
> >>> i'm not quite sure as on phones, we are using fixed-size CONTPTE. so we ask
> >>> for either 64KB or 4KB. If we think one VMA is all good to use CONTPTE, we
> >>> set a flag in this VMA and try to allocate 64KB.
> >>
> >> When you say "we set a flag" do you mean user space? Or is there some heuristic
> >> in the kernel?
> >
> > we are using a field extended by the android kernel in vma struct to
> > mark this vma
> > is all good to use CONTPTE. With the upstream solution you are providing, we can
> > remove this dirty code[1].
> > static inline bool vma_is_chp_anonymous(struct vm_area_struct *vma)
> > {
> >             return vma->android_kabi_reserved2 == THP_SWAP_PRIO_MAGIC;
> > }
>
> Sorry I'm not sure I've understood; how does that flag get set in the first
> place? Does user space tell the kernel (via e.g. madvise()) or does the kernel
> set it based on devined heuristics?

Basically we did it in an ugly way, on android, different vma types
have different
names.  For some types of vma, we have optimized them in userspace and tried
to decrease/avoid fragments and unaligned CONTPTEs unfold. So in the kernel,
we compare the name of the vma, if it is an optimized vma type, we set the
field in vma. noted for many cases, we might have to write dirty code as we have
to follow Android kernel's KMI :-)

based on your new sysfs interface, we can move to madvise(HUGEPAGE) and
set 64KB as MADVISE.

BTW, large anon folios can bring disaster to an unoptimized userspace
especially for
a memory limited system, memory footprint can terribly increase. so it
is really nice
to have your new sysfs interface and let userspace decide if it wants
large folios.

>
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/include/linux/mm.h#L4031
> >

Thanks
Barry
Christophe Leroy Dec. 4, 2023, 11:01 a.m. UTC | #48
Le 15/11/2023 à 22:37, Andrew Morton a écrit :
> On Wed, 15 Nov 2023 16:30:05 +0000 Ryan Roberts <ryan.roberts@arm.com> wrote:
> 
>> However, the primary motivation for this change is to reduce the number
>> of tlb maintenance operations that the arm64 backend has to perform
>> during fork
> 
> Do you have a feeling for how much performance improved due to this?
> 
> Are there other architectures which might similarly benefit?  By
> implementing ptep_set_wrprotects(), it appears.  If so, what sort of
> gains might they see?
> 

I think powerpc 8xx would benefit.

We have 16k hugepages which are implemented with 4 identical PTE which 
have the _PAGE_SPS bit set. _PAGE_SPS says it is a 16k page.

We have 512k hugepages which are implemented with 128 identical PTE 
which have _PAGE_HUGE bit set. _PAGE_HUGE tells it is a 512k page.

FWIW, PMD size is 4M and there is no 4M page size so no way to implement 
leaf PMD huge pages.

Christophe
diff mbox series

Patch

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b0a3..1c50f8a0fdde 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -622,6 +622,19 @@  static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 }
 #endif
 
+#ifndef ptep_set_wrprotects
+struct mm_struct;
+static inline void ptep_set_wrprotects(struct mm_struct *mm,
+				unsigned long address, pte_t *ptep,
+				unsigned int nr)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+		ptep_set_wrprotect(mm, address, ptep);
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/memory.c b/mm/memory.c
index 1f18ed4a5497..b7c8228883cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -921,46 +921,129 @@  copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		/* Uffd-wp needs to be delivered to dest pte as well */
 		pte = pte_mkuffd_wp(pte);
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
-	return 0;
+	return 1;
+}
+
+static inline unsigned long page_cont_mapped_vaddr(struct page *page,
+				struct page *anchor, unsigned long anchor_vaddr)
+{
+	unsigned long offset;
+	unsigned long vaddr;
+
+	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+	vaddr = anchor_vaddr + offset;
+
+	if (anchor > page) {
+		if (vaddr > anchor_vaddr)
+			return 0;
+	} else {
+		if (vaddr < anchor_vaddr)
+			return ULONG_MAX;
+	}
+
+	return vaddr;
+}
+
+static int folio_nr_pages_cont_mapped(struct folio *folio,
+				      struct page *page, pte_t *pte,
+				      unsigned long addr, unsigned long end,
+				      pte_t ptent, bool *any_dirty)
+{
+	int floops;
+	int i;
+	unsigned long pfn;
+	pgprot_t prot;
+	struct page *folio_end;
+
+	if (!folio_test_large(folio))
+		return 1;
+
+	folio_end = &folio->page + folio_nr_pages(folio);
+	end = min(page_cont_mapped_vaddr(folio_end, page, addr), end);
+	floops = (end - addr) >> PAGE_SHIFT;
+	pfn = page_to_pfn(page);
+	prot = pte_pgprot(pte_mkold(pte_mkclean(ptent)));
+
+	*any_dirty = pte_dirty(ptent);
+
+	pfn++;
+	pte++;
+
+	for (i = 1; i < floops; i++) {
+		ptent = ptep_get(pte);
+		ptent = pte_mkold(pte_mkclean(ptent));
+
+		if (!pte_present(ptent) || pte_pfn(ptent) != pfn ||
+		    pgprot_val(pte_pgprot(ptent)) != pgprot_val(prot))
+			break;
+
+		if (pte_dirty(ptent))
+			*any_dirty = true;
+
+		pfn++;
+		pte++;
+	}
+
+	return i;
 }
 
 /*
- * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
- * is required to copy this pte.
+ * Copy set of contiguous ptes.  Returns number of ptes copied if succeeded
+ * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
+ * first pte.
  */
 static inline int
-copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
-		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-		 struct folio **prealloc)
+copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+		  pte_t *dst_pte, pte_t *src_pte,
+		  unsigned long addr, unsigned long end,
+		  int *rss, struct folio **prealloc)
 {
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	unsigned long vm_flags = src_vma->vm_flags;
 	pte_t pte = ptep_get(src_pte);
 	struct page *page;
 	struct folio *folio;
+	int nr = 1;
+	bool anon;
+	bool any_dirty = pte_dirty(pte);
+	int i;
 
 	page = vm_normal_page(src_vma, addr, pte);
-	if (page)
+	if (page) {
 		folio = page_folio(page);
-	if (page && folio_test_anon(folio)) {
-		/*
-		 * If this page may have been pinned by the parent process,
-		 * copy the page immediately for the child so that we'll always
-		 * guarantee the pinned page won't be randomly replaced in the
-		 * future.
-		 */
-		folio_get(folio);
-		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
-			/* Page may be pinned, we have to copy. */
-			folio_put(folio);
-			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
-						 addr, rss, prealloc, page);
+		anon = folio_test_anon(folio);
+		nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
+						end, pte, &any_dirty);
+
+		for (i = 0; i < nr; i++, page++) {
+			if (anon) {
+				/*
+				 * If this page may have been pinned by the
+				 * parent process, copy the page immediately for
+				 * the child so that we'll always guarantee the
+				 * pinned page won't be randomly replaced in the
+				 * future.
+				 */
+				if (unlikely(page_try_dup_anon_rmap(
+						page, false, src_vma))) {
+					if (i != 0)
+						break;
+					/* Page may be pinned, we have to copy. */
+					return copy_present_page(
+						dst_vma, src_vma, dst_pte,
+						src_pte, addr, rss, prealloc,
+						page);
+				}
+				rss[MM_ANONPAGES]++;
+				VM_BUG_ON(PageAnonExclusive(page));
+			} else {
+				page_dup_file_rmap(page, false);
+				rss[mm_counter_file(page)]++;
+			}
 		}
-		rss[MM_ANONPAGES]++;
-	} else if (page) {
-		folio_get(folio);
-		page_dup_file_rmap(page, false);
-		rss[mm_counter_file(page)]++;
+
+		nr = i;
+		folio_ref_add(folio, nr);
 	}
 
 	/*
@@ -968,24 +1051,28 @@  copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		ptep_set_wrprotects(src_mm, addr, src_pte, nr);
 		pte = pte_wrprotect(pte);
 	}
-	VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
 
 	/*
-	 * If it's a shared mapping, mark it clean in
-	 * the child
+	 * If it's a shared mapping, mark it clean in the child. If its a
+	 * private mapping, mark it dirty in the child if _any_ of the parent
+	 * mappings in the block were marked dirty. The contiguous block of
+	 * mappings are all backed by the same folio, so if any are dirty then
+	 * the whole folio is dirty. This allows us to determine the batch size
+	 * without having to ever consider the dirty bit. See
+	 * folio_nr_pages_cont_mapped().
 	 */
-	if (vm_flags & VM_SHARED)
-		pte = pte_mkclean(pte);
-	pte = pte_mkold(pte);
+	pte = pte_mkold(pte_mkclean(pte));
+	if (!(vm_flags & VM_SHARED) && any_dirty)
+		pte = pte_mkdirty(pte);
 
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_clear_uffd_wp(pte);
 
-	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
-	return 0;
+	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
+	return nr;
 }
 
 static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
@@ -1087,15 +1174,28 @@  copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			 */
 			WARN_ON_ONCE(ret != -ENOENT);
 		}
-		/* copy_present_pte() will clear `*prealloc' if consumed */
-		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
-				       addr, rss, &prealloc);
+		/* copy_present_ptes() will clear `*prealloc' if consumed */
+		ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
+				       addr, end, rss, &prealloc);
+
 		/*
 		 * If we need a pre-allocated page for this pte, drop the
 		 * locks, allocate, and try again.
 		 */
 		if (unlikely(ret == -EAGAIN))
 			break;
+
+		/*
+		 * Positive return value is the number of ptes copied.
+		 */
+		VM_WARN_ON_ONCE(ret < 1);
+		progress += 8 * ret;
+		ret--;
+		dst_pte += ret;
+		src_pte += ret;
+		addr += ret << PAGE_SHIFT;
+		ret = 0;
+
 		if (unlikely(prealloc)) {
 			/*
 			 * pre-alloc page cannot be reused by next time so as
@@ -1106,7 +1206,6 @@  copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			folio_put(prealloc);
 			prealloc = NULL;
 		}
-		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();