diff mbox series

[v3] mm: introduce reference pages

Message ID 20200814213310.42170-1-pcc@google.com (mailing list archive)
State New, archived
Headers show
Series [v3] mm: introduce reference pages | expand

Commit Message

Peter Collingbourne Aug. 14, 2020, 9:33 p.m. UTC
Introduce a new syscall, refpage_create, which returns a file
descriptor which may be mapped using mmap. Such a mapping is similar
to an anonymous mapping, but instead of clean pages being backed by the
zero page, they are instead backed by a so-called reference page, whose
contents are specified using an argument to refpage_create. Loads from
the mapping will load directly from the reference page, and initial
stores to the mapping will copy-on-write from the reference page.

Reference pages are useful in circumstances where anonymous mappings
combined with manual stores to memory would impose undesirable costs,
either in terms of performance or RSS. Use cases are focused on heap
allocators and include:

- Pattern initialization for the heap. This is where malloc(3) gives
  you memory whose contents are filled with a non-zero pattern
  byte, in order to help detect and mitigate bugs involving use
  of uninitialized memory. Typically this is implemented by having
  the allocator memset the allocation with the pattern byte before
  returning it to the user, but for large allocations this can result
  in a significant increase in RSS, especially for allocations that
  are used sparsely. Even for dense allocations there is a needless
  impact to startup performance when it may be better to amortize it
  throughout the program. By creating allocations using a reference
  page filled with the pattern byte, we can avoid these costs.

- Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
  feature which allows for memory to be tagged in order to detect
  certain kinds of memory errors with low overhead. In order to set
  up an allocation to allow memory errors to be detected, the entire
  allocation needs to have the same tag. The issue here is similar to
  pattern initialization in the sense that large tagged allocations
  will be expensive if the tagging is done up front. The idea is that
  the allocator would create reference pages with each of the possible
  memory tags, and use those reference pages for the large allocations.

In order to measure the performance and RSS impact of reference pages,
a version of this patch backported to kernel version 4.14 was tested on
a Pixel 4 together with a modified [2] version of the Scudo allocator
that uses reference pages to implement pattern initialization. A
PDFium test program was used to collect the measurements like so:

$ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
$ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf

and the median of 100 runs measurement was taken with three variants
of the allocator:

- "anon" is the baseline (no pattern init)
- "memset" is with pattern init of allocator pages implemented by
  initializing anonymous pages with memset
- "refpage" is with pattern init of allocator pages implemented
  by creating reference pages

All three variants are measured using the patch that I linked. "anon"
is without the patch, "refpage" is with the patch and "memset" is
with a previous version of the patch [3] with "#if 0" in place of
"#if 1" in linux.cpp. The measurements are as follows:

          Real time (s)    Max RSS (KiB)
anon        2.237081         107088
memset      2.252241         112180
refpage     2.243786         107128

We can see that RSS for refpage is almost the same as anon, and real
time overhead is 44% that of memset.

As an alternative to introducing this syscall, I considered using
userfaultfd to implement reference pages. However, after having taken
a detailed look at the interface, it does not seem suitable to be
used in the context of a general purpose allocator. For example,
UFFD_FEATURE_FORK support would be required in order to correctly
support fork(2) in a process that uses the allocator (although POSIX
does not guarantee support for allocating after fork, many allocators
including Scudo support it, and nothing stops the forked process from
page faulting pre-existing allocations after forking anyway), but
UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
making it unsuitable for use in an allocator. Furthermore, even if
the interface issues are resolved, I suspect (but have not measured)
that the cost of the multiple context switches between kernel and
userspace would be too high to be used in an allocator anyway.

[1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
[2] https://github.com/pcc/llvm-project/commit/4871b739f86a631537d1725847a27ac148a392a0
[3] https://github.com/pcc/llvm-project/commit/a05f88aaebc7daf262d6885444d9845052026f4b

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reported-by: kernel test robot <lkp@intel.com>
---
v3:
- Fix build errors reported by kernel test robot

v2:
- Switch to an approach of adding a new syscall instead of modifying
  mmap(2)
- Move ownership of the reference page to the struct file to avoid
  refcount overflows

 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd.h             |  2 +-
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 include/linux/huge_mm.h                     |  7 +++
 include/linux/mm.h                          | 10 ++++
 include/linux/syscalls.h                    |  3 ++
 include/uapi/asm-generic/unistd.h           |  4 +-
 kernel/sys_ni.c                             |  1 +
 mm/Makefile                                 |  4 +-
 mm/gup.c                                    |  2 +-
 mm/memory.c                                 | 32 ++++++++----
 mm/migrate.c                                |  4 +-
 mm/refpage.c                                | 56 +++++++++++++++++++++
 28 files changed, 127 insertions(+), 16 deletions(-)
 create mode 100644 mm/refpage.c

Comments

John Hubbard Aug. 18, 2020, 2:31 a.m. UTC | #1
On 8/14/20 2:33 PM, Peter Collingbourne wrote:
> Introduce a new syscall, refpage_create, which returns a file
> descriptor which may be mapped using mmap. Such a mapping is similar

Hi,

For new syscalls, I think we need to put linux-api on CC, at the very
least. Adding them now. This would likely need man page support as well.
I'll put linux-doc on Cc, too.

> to an anonymous mapping, but instead of clean pages being backed by the
> zero page, they are instead backed by a so-called reference page, whose
> contents are specified using an argument to refpage_create. Loads from
> the mapping will load directly from the reference page, and initial
> stores to the mapping will copy-on-write from the reference page.
> 
> Reference pages are useful in circumstances where anonymous mappings
> combined with manual stores to memory would impose undesirable costs,
> either in terms of performance or RSS. Use cases are focused on heap
> allocators and include:
> 
> - Pattern initialization for the heap. This is where malloc(3) gives
>    you memory whose contents are filled with a non-zero pattern
>    byte, in order to help detect and mitigate bugs involving use
>    of uninitialized memory. Typically this is implemented by having
>    the allocator memset the allocation with the pattern byte before
>    returning it to the user, but for large allocations this can result
>    in a significant increase in RSS, especially for allocations that
>    are used sparsely. Even for dense allocations there is a needless
>    impact to startup performance when it may be better to amortize it
>    throughout the program. By creating allocations using a reference
>    page filled with the pattern byte, we can avoid these costs.
> 
> - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
>    feature which allows for memory to be tagged in order to detect
>    certain kinds of memory errors with low overhead. In order to set
>    up an allocation to allow memory errors to be detected, the entire
>    allocation needs to have the same tag. The issue here is similar to
>    pattern initialization in the sense that large tagged allocations
>    will be expensive if the tagging is done up front. The idea is that
>    the allocator would create reference pages with each of the possible
>    memory tags, and use those reference pages for the large allocations.

That is good information, and it belongs in a man page, and/or Documentation/.

> 
> In order to measure the performance and RSS impact of reference pages,
> a version of this patch backported to kernel version 4.14 was tested on
> a Pixel 4 together with a modified [2] version of the Scudo allocator
> that uses reference pages to implement pattern initialization. A
> PDFium test program was used to collect the measurements like so:
> 
> $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
> $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf
> 
> and the median of 100 runs measurement was taken with three variants
> of the allocator:
> 
> - "anon" is the baseline (no pattern init)
> - "memset" is with pattern init of allocator pages implemented by
>    initializing anonymous pages with memset
> - "refpage" is with pattern init of allocator pages implemented
>    by creating reference pages
> 
> All three variants are measured using the patch that I linked. "anon"
> is without the patch, "refpage" is with the patch and "memset" is
> with a previous version of the patch [3] with "#if 0" in place of
> "#if 1" in linux.cpp. The measurements are as follows:
> 
>            Real time (s)    Max RSS (KiB)
> anon        2.237081         107088
> memset      2.252241         112180
> refpage     2.243786         107128
> 
> We can see that RSS for refpage is almost the same as anon, and real
> time overhead is 44% that of memset.
> 

Are some of the numbers stale, maybe? Try as I might, I cannot combine
anything above to come up with 44%. :)


> As an alternative to introducing this syscall, I considered using
> userfaultfd to implement reference pages. However, after having taken
> a detailed look at the interface, it does not seem suitable to be
> used in the context of a general purpose allocator. For example,
> UFFD_FEATURE_FORK support would be required in order to correctly
> support fork(2) in a process that uses the allocator (although POSIX
> does not guarantee support for allocating after fork, many allocators
> including Scudo support it, and nothing stops the forked process from
> page faulting pre-existing allocations after forking anyway), but
> UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> making it unsuitable for use in an allocator. Furthermore, even if
> the interface issues are resolved, I suspect (but have not measured)
> that the cost of the multiple context switches between kernel and
> userspace would be too high to be used in an allocator anyway.


That whole blurb is good for a cover letter, and perhaps an "alternatives
considered" section in Documentation/. However, it should be omitted from
the patch commit description, IMHO.

...
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 467302056e17..a1dc07ff914a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -175,6 +175,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
>   
>   	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
>   		return false;
> +
> +	/*
> +	 * Transparent hugepages not currently supported for anonymous VMAs with
> +	 * reference pages
> +	 */
> +	if (unlikely(vma->vm_private_data))


This should use a helper function, such as is_reference_page_vma(). Because the
assumption that "vma->vm_private_data means a reference page vma" is much too
fragile. More below.


> +		return false;
>   	return true;
>   }
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e7602a3bcef1..ac375e398690 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3122,5 +3122,15 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
>   
>   extern int sysctl_nr_trim_pages;
>   
> +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
> +					 unsigned long pfn)
> +{
> +	if (is_zero_pfn(pfn))
> +		return true;
> +	if (unlikely(!vma->vm_ops && vma->vm_private_data))
> +		return pfn == page_to_pfn((struct page *)vma->vm_private_data);

As foreshadowed above, this needs a helper function. And the criteria for
deciding that it's a reference page needs to be more robust than just "no vm_ops,
vm_private_data is set, and it matches my page". Needs some more decisive
information.

Maybe setting vm_ops to some new "refpage" ops would be the way to go, for that.

...
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 5053439be6ab..6e9246d09e95 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>   	pmd_t *pmdp;
>   	pte_t *ptep;
>   
> -	/* Only allow populating anonymous memory */
> -	if (!vma_is_anonymous(vma))
> +	/* Only allow populating anonymous memory without a reference page */
> +	if (!vma_is_anonymous(vma) || vma->private_data)

Same thing here: helper function, instead of open-coding the assumption about
what makes a refpage vma.

...

> +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> +		flags)
> +{
> +	unsigned long content_addr = (unsigned long)content;
> +	struct page *userpage, *refpage;
> +	int fd;
> +
> +	if (flags != 0)
> +		return -EINVAL;
> +
> +	refpage = alloc_page(GFP_KERNEL);
> +	if (!refpage)
> +		return -ENOMEM;
> +
> +	if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> +	    get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
> +		put_page(refpage);
> +		return -EFAULT;
> +	}
> +
> +	copy_highpage(refpage, userpage);
> +	put_page(userpage);
> +
> +	fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage,
> +			      O_RDONLY | O_CLOEXEC);

Seems like the flags argument should have an influence on these flags, rather
than hard-coding O_CLOEXEC, right?


thanks,
Matthew Wilcox Aug. 18, 2020, 3 a.m. UTC | #2
On Mon, Aug 17, 2020 at 07:31:39PM -0700, John Hubbard wrote:
> >            Real time (s)    Max RSS (KiB)
> > anon        2.237081         107088
> > memset      2.252241         112180
> > refpage     2.243786         107128
> > 
> > We can see that RSS for refpage is almost the same as anon, and real
> > time overhead is 44% that of memset.
> > 
> 
> Are some of the numbers stale, maybe? Try as I might, I cannot combine
> anything above to come up with 44%. :)

You're not trying hard enough ;-)

(2.252241 - 2.237081) / 2.237081 = .00677668801442594166
(2.243786 - 2.237081) / 2.237081 = .00299720930981041812
.00299720930981041812 / .00677668801442594166 = .44228232189973614648

tadaa!

As I said last time this was posted, I'm just not excited by this.  We go
from having a 0.68% time overhead down to an 0.30% overhead, which just
doesn't move the needle for me.  Maybe there's a better benchmark than
this to show benefits from this patchset.
Jann Horn Aug. 18, 2020, 3:42 a.m. UTC | #3
[I started writing a reply before I saw what Matthew said, so I
decided to finish going through this patch...]

On Fri, Aug 14, 2020 at 11:33 PM Peter Collingbourne <pcc@google.com> wrote:
> Introduce a new syscall, refpage_create, which returns a file
> descriptor which may be mapped using mmap. Such a mapping is similar
> to an anonymous mapping, but instead of clean pages being backed by the
> zero page, they are instead backed by a so-called reference page, whose
> contents are specified using an argument to refpage_create. Loads from
> the mapping will load directly from the reference page, and initial
> stores to the mapping will copy-on-write from the reference page.
[...]
> - Pattern initialization for the heap. This is where malloc(3) gives
>   you memory whose contents are filled with a non-zero pattern
>   byte, in order to help detect and mitigate bugs involving use
>   of uninitialized memory. Typically this is implemented by having
>   the allocator memset the allocation with the pattern byte before
>   returning it to the user, but for large allocations this can result
>   in a significant increase in RSS, especially for allocations that
>   are used sparsely. Even for dense allocations there is a needless
>   impact to startup performance when it may be better to amortize it
>   throughout the program. By creating allocations using a reference
>   page filled with the pattern byte, we can avoid these costs.
>
> - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
>   feature which allows for memory to be tagged in order to detect
>   certain kinds of memory errors with low overhead. In order to set
>   up an allocation to allow memory errors to be detected, the entire
>   allocation needs to have the same tag. The issue here is similar to
>   pattern initialization in the sense that large tagged allocations
>   will be expensive if the tagging is done up front. The idea is that
>   the allocator would create reference pages with each of the possible
>   memory tags, and use those reference pages for the large allocations.

This means that you'll end up with one VMA per large heap object,
instead of being able to put them all into one big VMA, right?

> In order to measure the performance and RSS impact of reference pages,
> a version of this patch backported to kernel version 4.14 was tested on
> a Pixel 4 together with a modified [2] version of the Scudo allocator
> that uses reference pages to implement pattern initialization. A
> PDFium test program was used to collect the measurements like so:
>
> $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
> $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf
>
> and the median of 100 runs measurement was taken with three variants
> of the allocator:
>
> - "anon" is the baseline (no pattern init)
> - "memset" is with pattern init of allocator pages implemented by
>   initializing anonymous pages with memset

For the memory tagging usecase, this would use something like the
STZ2G instruction, which is specialized for zeroing and re-tagging
memory at high speed, right? Would STZ2G be expected to be faster than
a current memset() implementation? I don't know much about how the
hardware for this stuff works, but I'm guessing that STZ2G _miiiiight_
be optimized to reduce the amount of data transmitted over the memory
bus, or something like that?

Also, for that memset() test, did you do that on a fresh VMA (meaning
the memset() will constantly take page faults AFAIK), or did you do it
on memory that had been written to before (which should AFAIK be a bit
faster)?

> - "refpage" is with pattern init of allocator pages implemented
>   by creating reference pages
[...]
> As an alternative to introducing this syscall, I considered using
> userfaultfd to implement reference pages. However, after having taken
> a detailed look at the interface, it does not seem suitable to be
> used in the context of a general purpose allocator. For example,
> UFFD_FEATURE_FORK support would be required in order to correctly
> support fork(2) in a process that uses the allocator (although POSIX
> does not guarantee support for allocating after fork, many allocators
> including Scudo support it, and nothing stops the forked process from
> page faulting pre-existing allocations after forking anyway), but
> UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> making it unsuitable for use in an allocator.

That part should be fairly easy to fix by hooking an ioctl command up
to the ->read handler.

[...]
> @@ -3347,11 +3348,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>         if (unlikely(pmd_trans_unstable(vmf->pmd)))
>                 return 0;
>
> -       /* Use the zero-page for reads */
> +       /* Use the zero-page, or reference page if set, for reads */
>         if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>                         !mm_forbids_zeropage(vma->vm_mm)) {
> -               entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> -                                               vma->vm_page_prot));
> +               unsigned long pfn;
> +
> +               if (unlikely(refpage))
> +                       pfn = page_to_pfn(refpage);
> +               else
> +                       pfn = my_zero_pfn(vmf->address);
> +               entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot));

If someone maps this thing with MAP_SHARED and PROT_READ|PROT_WRITE,
will this create a writable special PTE, or am I missing something?

[...]
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 5053439be6ab..6e9246d09e95 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
>         pmd_t *pmdp;
>         pte_t *ptep;
>
> -       /* Only allow populating anonymous memory */
> -       if (!vma_is_anonymous(vma))
> +       /* Only allow populating anonymous memory without a reference page */
> +       if (!vma_is_anonymous(vma) || vma->private_data)
>                 goto abort;
>
>         pgdp = pgd_offset(mm, addr);
> diff --git a/mm/refpage.c b/mm/refpage.c
[...]
> +static int refpage_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +       vma_set_anonymous(vma);

I wonder whether it would make more sense to have your own
vm_operations_struct and handle faults through its hooks instead of
messing around in the generic code for this.

> +       vma->vm_private_data = vma->vm_file->private_data;
> +       return 0;
> +}
[...]
> +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> +               flags)
> +{
> +       unsigned long content_addr = (unsigned long)content;
> +       struct page *userpage, *refpage;
> +       int fd;
> +
> +       if (flags != 0)
> +               return -EINVAL;
> +
> +       refpage = alloc_page(GFP_KERNEL);

GFP_USER, maybe?

> +       if (!refpage)
> +               return -ENOMEM;

> +       if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> +           get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
> +               put_page(refpage);
> +               return -EFAULT;
> +       }
> +
> +       copy_highpage(refpage, userpage);
> +       put_page(userpage);

Why not this instead?

if (copy_from_user(page_address(refpage), content, PAGE_SIZE))
  goto out_put_page;

If that is because copy_highpage() is going to include some magic
memory-tag-copying thing or so, this needs a comment.

> +       fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage,
> +                             O_RDONLY | O_CLOEXEC);
> +       if (fd < 0)
> +               put_page(refpage);
> +
> +       return fd;
> +}
John Hubbard Aug. 18, 2020, 6:25 p.m. UTC | #4
On 8/17/20 8:00 PM, Matthew Wilcox wrote:
> On Mon, Aug 17, 2020 at 07:31:39PM -0700, John Hubbard wrote:
>>>             Real time (s)    Max RSS (KiB)
>>> anon        2.237081         107088
>>> memset      2.252241         112180
>>> refpage     2.243786         107128
>>>
>>> We can see that RSS for refpage is almost the same as anon, and real
>>> time overhead is 44% that of memset.
>>>
>>
>> Are some of the numbers stale, maybe? Try as I might, I cannot combine
>> anything above to come up with 44%. :)
> 
> You're not trying hard enough ;-)
> 
> (2.252241 - 2.237081) / 2.237081 = .00677668801442594166
> (2.243786 - 2.237081) / 2.237081 = .00299720930981041812
> .00299720930981041812 / .00677668801442594166 = .44228232189973614648
> 
> tadaa!

haha, OK then! :) Next time I may try harder, but on the other hand my
interpretation of the results is still "this is a small effect", even
if there is a way to make it sound large by comparing the 3rd significant
digits of the results...

> 
> As I said last time this was posted, I'm just not excited by this.  We go
> from having a 0.68% time overhead down to an 0.30% overhead, which just
> doesn't move the needle for me.  Maybe there's a better benchmark than
> this to show benefits from this patchset.
> 

Yes, I wonder if there is an artificial workload that just uses refpages
really extensively, maybe we can get some good solid improvements shown
with that? Otherwise, it seems like we've just learned that memset is
actually pretty good in this case. :)

thanks,
Peter Collingbourne June 19, 2021, 9:20 a.m. UTC | #5
[Apologies for the delay in getting back to you; other work ended up
taking priority and now I'm back to looking at this.]

On Tue, Aug 18, 2020 at 11:25 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 8/17/20 8:00 PM, Matthew Wilcox wrote:
> > On Mon, Aug 17, 2020 at 07:31:39PM -0700, John Hubbard wrote:
> >>>             Real time (s)    Max RSS (KiB)
> >>> anon        2.237081         107088
> >>> memset      2.252241         112180
> >>> refpage     2.243786         107128
> >>>
> >>> We can see that RSS for refpage is almost the same as anon, and real
> >>> time overhead is 44% that of memset.
> >>>
> >>
> >> Are some of the numbers stale, maybe? Try as I might, I cannot combine
> >> anything above to come up with 44%. :)
> >
> > You're not trying hard enough ;-)
> >
> > (2.252241 - 2.237081) / 2.237081 = .00677668801442594166
> > (2.243786 - 2.237081) / 2.237081 = .00299720930981041812
> > .00299720930981041812 / .00677668801442594166 = .44228232189973614648
> >
> > tadaa!
>
> haha, OK then! :) Next time I may try harder, but on the other hand my
> interpretation of the results is still "this is a small effect", even
> if there is a way to make it sound large by comparing the 3rd significant
> digits of the results...
>
> >
> > As I said last time this was posted, I'm just not excited by this.  We go
> > from having a 0.68% time overhead down to an 0.30% overhead, which just
> > doesn't move the needle for me.  Maybe there's a better benchmark than
> > this to show benefits from this patchset.
> >
>

Remember that this is a "realistic" benchmark, so it's doing plenty of
other work besides faulting pages. So I don't think we should expect
to see a massive improvement here.

I ran the pdfium benchmark again but I couldn't see the same
improvements that I got last time. This seems to be because pdfium has
since switched to its own allocator, bypassing the system allocator. I
think the gains should be larger with the memset optimization that
I've implemented, but I'm still in the process of finding a suitable
realistic benchmark that uses the system allocator.

But I would find a 0.4% perf improvement convincing enough,
personally, given that the workload is realistic. Consider a certain
large company which spends $billions annually on data centers. In that
environment a 0.4% performance improvement on realistic workloads can
translate to $millions of savings. And that's not taking into account
the memory savings which are important both in mobile environments and
in data centers.

> Yes, I wonder if there is an artificial workload that just uses refpages
> really extensively, maybe we can get some good solid improvements shown
> with that? Otherwise, it seems like we've just learned that memset is
> actually pretty good in this case. :)

Yes, it's possible to see the performance improvement here more
clearly with a microbenchmark. I've updated the commit message in v4
to include a microbenchmark program and some performance numbers from
it.

Peter
Peter Collingbourne June 19, 2021, 9:21 a.m. UTC | #6
On Mon, Aug 17, 2020 at 8:43 PM Jann Horn <jannh@google.com> wrote:
>
> [I started writing a reply before I saw what Matthew said, so I
> decided to finish going through this patch...]
>
> On Fri, Aug 14, 2020 at 11:33 PM Peter Collingbourne <pcc@google.com> wrote:
> > Introduce a new syscall, refpage_create, which returns a file
> > descriptor which may be mapped using mmap. Such a mapping is similar
> > to an anonymous mapping, but instead of clean pages being backed by the
> > zero page, they are instead backed by a so-called reference page, whose
> > contents are specified using an argument to refpage_create. Loads from
> > the mapping will load directly from the reference page, and initial
> > stores to the mapping will copy-on-write from the reference page.
> [...]
> > - Pattern initialization for the heap. This is where malloc(3) gives
> >   you memory whose contents are filled with a non-zero pattern
> >   byte, in order to help detect and mitigate bugs involving use
> >   of uninitialized memory. Typically this is implemented by having
> >   the allocator memset the allocation with the pattern byte before
> >   returning it to the user, but for large allocations this can result
> >   in a significant increase in RSS, especially for allocations that
> >   are used sparsely. Even for dense allocations there is a needless
> >   impact to startup performance when it may be better to amortize it
> >   throughout the program. By creating allocations using a reference
> >   page filled with the pattern byte, we can avoid these costs.
> >
> > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> >   feature which allows for memory to be tagged in order to detect
> >   certain kinds of memory errors with low overhead. In order to set
> >   up an allocation to allow memory errors to be detected, the entire
> >   allocation needs to have the same tag. The issue here is similar to
> >   pattern initialization in the sense that large tagged allocations
> >   will be expensive if the tagging is done up front. The idea is that
> >   the allocator would create reference pages with each of the possible
> >   memory tags, and use those reference pages for the large allocations.
>
> This means that you'll end up with one VMA per large heap object,
> instead of being able to put them all into one big VMA, right?

Yes, although in Scudo we create guard pages around each large
allocation in order to catch OOB accesses and these correspond to
their own VMAs, so we already unavoidably have 2-3 VMAs per allocation
and a switch to reference pages wouldn't change anything.

> > In order to measure the performance and RSS impact of reference pages,
> > a version of this patch backported to kernel version 4.14 was tested on
> > a Pixel 4 together with a modified [2] version of the Scudo allocator
> > that uses reference pages to implement pattern initialization. A
> > PDFium test program was used to collect the measurements like so:
> >
> > $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
> > $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf
> >
> > and the median of 100 runs measurement was taken with three variants
> > of the allocator:
> >
> > - "anon" is the baseline (no pattern init)
> > - "memset" is with pattern init of allocator pages implemented by
> >   initializing anonymous pages with memset
>
> For the memory tagging usecase, this would use something like the
> STZ2G instruction, which is specialized for zeroing and re-tagging
> memory at high speed, right? Would STZ2G be expected to be faster than
> a current memset() implementation? I don't know much about how the
> hardware for this stuff works, but I'm guessing that STZ2G _miiiiight_
> be optimized to reduce the amount of data transmitted over the memory
> bus, or something like that?

It's actually the DC GZVA instruction that is expected to be fastest
for this use case, since it operates on a cache line at a time. We've
switched to using that to clear PROT_MTE pages in the kernel. In v4 of
this patch, I have developed optimizations that check whether the
reference page is a uniformly tagged zero page and if that is the
case, DC GZVA is used to reset it. The same technique is also used for
pattern initialization (i.e. memset to a pattern byte if the page is
uniform) which also helps with performance.

> Also, for that memset() test, did you do that on a fresh VMA (meaning
> the memset() will constantly take page faults AFAIK), or did you do it
> on memory that had been written to before (which should AFAIK be a bit
> faster)?

I believe that it was using the default allocator settings, which for
Scudo implies aggressive use of munmap/MADV_DONTNEED and using a fresh
VMA for any new allocation. This is the best tradeoff for memory usage
but not necessarily for performance, however it doesn't necessarily
mean that we don't care at all about performance in this case.

> > - "refpage" is with pattern init of allocator pages implemented
> >   by creating reference pages
> [...]
> > As an alternative to introducing this syscall, I considered using
> > userfaultfd to implement reference pages. However, after having taken
> > a detailed look at the interface, it does not seem suitable to be
> > used in the context of a general purpose allocator. For example,
> > UFFD_FEATURE_FORK support would be required in order to correctly
> > support fork(2) in a process that uses the allocator (although POSIX
> > does not guarantee support for allocating after fork, many allocators
> > including Scudo support it, and nothing stops the forked process from
> > page faulting pre-existing allocations after forking anyway), but
> > UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> > ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> > making it unsuitable for use in an allocator.
>
> That part should be fairly easy to fix by hooking an ioctl command up
> to the ->read handler.
>
> [...]
> > @@ -3347,11 +3348,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> >         if (unlikely(pmd_trans_unstable(vmf->pmd)))
> >                 return 0;
> >
> > -       /* Use the zero-page for reads */
> > +       /* Use the zero-page, or reference page if set, for reads */
> >         if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> >                         !mm_forbids_zeropage(vma->vm_mm)) {
> > -               entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> > -                                               vma->vm_page_prot));
> > +               unsigned long pfn;
> > +
> > +               if (unlikely(refpage))
> > +                       pfn = page_to_pfn(refpage);
> > +               else
> > +                       pfn = my_zero_pfn(vmf->address);
> > +               entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot));
>
> If someone maps this thing with MAP_SHARED and PROT_READ|PROT_WRITE,
> will this create a writable special PTE, or am I missing something?

It looks like we will return early here:

        /* File mapping without ->vm_ops ? */
        if (vma->vm_flags & VM_SHARED)
                return VM_FAULT_SIGBUS;

This also seems like it would be necessary to avoid letting the zero
page be mapped read-write.

> [...]
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 5053439be6ab..6e9246d09e95 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >         pmd_t *pmdp;
> >         pte_t *ptep;
> >
> > -       /* Only allow populating anonymous memory */
> > -       if (!vma_is_anonymous(vma))
> > +       /* Only allow populating anonymous memory without a reference page */
> > +       if (!vma_is_anonymous(vma) || vma->private_data)
> >                 goto abort;
> >
> >         pgdp = pgd_offset(mm, addr);
> > diff --git a/mm/refpage.c b/mm/refpage.c
> [...]
> > +static int refpage_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +       vma_set_anonymous(vma);
>
> I wonder whether it would make more sense to have your own
> vm_operations_struct and handle faults through its hooks instead of
> messing around in the generic code for this.

I considered it, but this wouldn't be compatible with the vm_ops
interface as currently exposed. As I mentioned in an earlier email:

> I considered having reference page mappings continue to provide a
> custom vm_ops, but this would require changes to the interface to
> preserve the specialness of the reference page. For example,
> vm_normal_page() would need to know to return null for the reference
> page in order to prevent _mapcount from overflowing, which could
> probably be done by adding a new interface to vm_ops, but that seemed
> more complicated than changing the anonymous page code path.

Of course, if we were to refactor anonymous mappings in the future so
that they are implemented via vm_ops, it seems like it would be more
appropriate to have this be implemented in the same way.

> > +       vma->vm_private_data = vma->vm_file->private_data;
> > +       return 0;
> > +}
> [...]
> > +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> > +               flags)
> > +{
> > +       unsigned long content_addr = (unsigned long)content;
> > +       struct page *userpage, *refpage;
> > +       int fd;
> > +
> > +       if (flags != 0)
> > +               return -EINVAL;
> > +
> > +       refpage = alloc_page(GFP_KERNEL);
>
> GFP_USER, maybe?

I would say that the page we're allocating here is owned by the
kernel, even though it's directly accessible (read-only) to userspace.
In this regard, it's maybe similar to the zero page.

> > +       if (!refpage)
> > +               return -ENOMEM;
>
> > +       if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> > +           get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
> > +               put_page(refpage);
> > +               return -EFAULT;
> > +       }
> > +
> > +       copy_highpage(refpage, userpage);
> > +       put_page(userpage);
>
> Why not this instead?
>
> if (copy_from_user(page_address(refpage), content, PAGE_SIZE))
>   goto out_put_page;
>
> If that is because copy_highpage() is going to include some magic
> memory-tag-copying thing or so, this needs a comment.

Yes, with MTE we will need to zero the tags on the page so that it can
be used in PROT_MTE mappings even if the original page was not
PROT_MTE. In v4, I ended up making this an arch hook so that all of
the arch-specific stuff can go there.

Peter
Peter Collingbourne June 19, 2021, 9:21 a.m. UTC | #7
[Apologies for the delay in getting back to you; other work ended up
taking priority and now I'm back to looking at this.]

On Mon, Aug 17, 2020 at 7:31 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 8/14/20 2:33 PM, Peter Collingbourne wrote:
> > Introduce a new syscall, refpage_create, which returns a file
> > descriptor which may be mapped using mmap. Such a mapping is similar
>
> Hi,
>
> For new syscalls, I think we need to put linux-api on CC, at the very
> least. Adding them now. This would likely need man page support as well.
> I'll put linux-doc on Cc, too.

Thanks.

> > to an anonymous mapping, but instead of clean pages being backed by the
> > zero page, they are instead backed by a so-called reference page, whose
> > contents are specified using an argument to refpage_create. Loads from
> > the mapping will load directly from the reference page, and initial
> > stores to the mapping will copy-on-write from the reference page.
> >
> > Reference pages are useful in circumstances where anonymous mappings
> > combined with manual stores to memory would impose undesirable costs,
> > either in terms of performance or RSS. Use cases are focused on heap
> > allocators and include:
> >
> > - Pattern initialization for the heap. This is where malloc(3) gives
> >    you memory whose contents are filled with a non-zero pattern
> >    byte, in order to help detect and mitigate bugs involving use
> >    of uninitialized memory. Typically this is implemented by having
> >    the allocator memset the allocation with the pattern byte before
> >    returning it to the user, but for large allocations this can result
> >    in a significant increase in RSS, especially for allocations that
> >    are used sparsely. Even for dense allocations there is a needless
> >    impact to startup performance when it may be better to amortize it
> >    throughout the program. By creating allocations using a reference
> >    page filled with the pattern byte, we can avoid these costs.
> >
> > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> >    feature which allows for memory to be tagged in order to detect
> >    certain kinds of memory errors with low overhead. In order to set
> >    up an allocation to allow memory errors to be detected, the entire
> >    allocation needs to have the same tag. The issue here is similar to
> >    pattern initialization in the sense that large tagged allocations
> >    will be expensive if the tagging is done up front. The idea is that
> >    the allocator would create reference pages with each of the possible
> >    memory tags, and use those reference pages for the large allocations.
>
> That is good information, and it belongs in a man page, and/or Documentation/.

I plan to write a man page for refpage_create(2) once this is closer to landing.

> >
> > In order to measure the performance and RSS impact of reference pages,
> > a version of this patch backported to kernel version 4.14 was tested on
> > a Pixel 4 together with a modified [2] version of the Scudo allocator
> > that uses reference pages to implement pattern initialization. A
> > PDFium test program was used to collect the measurements like so:
> >
> > $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
> > $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf
> >
> > and the median of 100 runs measurement was taken with three variants
> > of the allocator:
> >
> > - "anon" is the baseline (no pattern init)
> > - "memset" is with pattern init of allocator pages implemented by
> >    initializing anonymous pages with memset
> > - "refpage" is with pattern init of allocator pages implemented
> >    by creating reference pages
> >
> > All three variants are measured using the patch that I linked. "anon"
> > is without the patch, "refpage" is with the patch and "memset" is
> > with a previous version of the patch [3] with "#if 0" in place of
> > "#if 1" in linux.cpp. The measurements are as follows:
> >
> >            Real time (s)    Max RSS (KiB)
> > anon        2.237081         107088
> > memset      2.252241         112180
> > refpage     2.243786         107128
> >
> > We can see that RSS for refpage is almost the same as anon, and real
> > time overhead is 44% that of memset.
> >
>
> Are some of the numbers stale, maybe? Try as I might, I cannot combine
> anything above to come up with 44%. :)
>
> > As an alternative to introducing this syscall, I considered using
> > userfaultfd to implement reference pages. However, after having taken
> > a detailed look at the interface, it does not seem suitable to be
> > used in the context of a general purpose allocator. For example,
> > UFFD_FEATURE_FORK support would be required in order to correctly
> > support fork(2) in a process that uses the allocator (although POSIX
> > does not guarantee support for allocating after fork, many allocators
> > including Scudo support it, and nothing stops the forked process from
> > page faulting pre-existing allocations after forking anyway), but
> > UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> > ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> > making it unsuitable for use in an allocator. Furthermore, even if
> > the interface issues are resolved, I suspect (but have not measured)
> > that the cost of the multiple context switches between kernel and
> > userspace would be too high to be used in an allocator anyway.
>
>
> That whole blurb is good for a cover letter, and perhaps an "alternatives
> considered" section in Documentation/. However, it should be omitted from
> the patch commit description, IMHO.

Okay, I moved it to the notes section of the commit message.

> ...
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 467302056e17..a1dc07ff914a 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -175,6 +175,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
> >
> >       if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> >               return false;
> > +
> > +     /*
> > +      * Transparent hugepages not currently supported for anonymous VMAs with
> > +      * reference pages
> > +      */
> > +     if (unlikely(vma->vm_private_data))
>
>
> This should use a helper function, such as is_reference_page_vma(). Because the
> assumption that "vma->vm_private_data means a reference page vma" is much too
> fragile. More below.

That makes sense. In v4 I've introduced a helper function.

> > +             return false;
> >       return true;
> >   }
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e7602a3bcef1..ac375e398690 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3122,5 +3122,15 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
> >
> >   extern int sysctl_nr_trim_pages;
> >
> > +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
> > +                                      unsigned long pfn)
> > +{
> > +     if (is_zero_pfn(pfn))
> > +             return true;
> > +     if (unlikely(!vma->vm_ops && vma->vm_private_data))
> > +             return pfn == page_to_pfn((struct page *)vma->vm_private_data);
>
> As foreshadowed above, this needs a helper function. And the criteria for
> deciding that it's a reference page needs to be more robust than just "no vm_ops,
> vm_private_data is set, and it matches my page". Needs some more decisive
> information.
>
> Maybe setting vm_ops to some new "refpage" ops would be the way to go, for that.

As I mentioned in my reply to Jann, we can't set vm_ops without
introducing some unwanted behavior as a result of following the
non-anonymous VMA code path. What I ended up doing instead was to
check whether vm_file->f_op refers to the refpage file_operations
struct.

It might be nice to introduce a VM_REFPAGE flag to make this check
more efficient, but this would first require extending vm_flags to 64
bits on 32-bit platforms since we're out of bits in vm_flags. From
looking around it looks like many people have attempted this over the
years; it looks like the most recent attempt is from this month:
https://www.spinics.net/lists/kernel/msg3961408.html

Let's see if it actually happens this time.

> ...
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 5053439be6ab..6e9246d09e95 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >       pmd_t *pmdp;
> >       pte_t *ptep;
> >
> > -     /* Only allow populating anonymous memory */
> > -     if (!vma_is_anonymous(vma))
> > +     /* Only allow populating anonymous memory without a reference page */
> > +     if (!vma_is_anonymous(vma) || vma->private_data)
>
> Same thing here: helper function, instead of open-coding the assumption about
> what makes a refpage vma.

Done.

> ...
>
> > +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> > +             flags)
> > +{
> > +     unsigned long content_addr = (unsigned long)content;
> > +     struct page *userpage, *refpage;
> > +     int fd;
> > +
> > +     if (flags != 0)
> > +             return -EINVAL;
> > +
> > +     refpage = alloc_page(GFP_KERNEL);
> > +     if (!refpage)
> > +             return -ENOMEM;
> > +
> > +     if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> > +         get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
> > +             put_page(refpage);
> > +             return -EFAULT;
> > +     }
> > +
> > +     copy_highpage(refpage, userpage);
> > +     put_page(userpage);
> > +
> > +     fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage,
> > +                           O_RDONLY | O_CLOEXEC);
>
> Seems like the flags argument should have an influence on these flags, rather
> than hard-coding O_CLOEXEC, right?

I couldn't see a use case for having one of these FDs without
O_CLOEXEC. If someone really wants a non-CLOEXEC refpage FD, they can
use fcntl to clear the CLOEXEC bit.

I only added the flags argument to support future extension as described in:
https://www.kernel.org/doc/html/v5.12/process/adding-syscalls.html#designing-the-api-planning-for-extension

Peter
diff mbox series

Patch

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index a28fb211881d..efbdbceba085 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@ 
 547	common	openat2				sys_openat2
 548	common	pidfd_getfd			sys_pidfd_getfd
 549	common	faccessat2			sys_faccessat2
+550	common	refpage_create			sys_refpage_create
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 7e8ee4adf269..68f0a0822ed6 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -453,3 +453,4 @@ 
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	refpage_create			sys_refpage_create
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 3b859596840d..b3b2019f8d16 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@ 
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		440
+#define __NR_compat_syscalls		441
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 17e81bd9a2d3..18ff5382341c 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -887,6 +887,8 @@  __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_faccessat2 439
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_refpage_create 440
+__SYSCALL(__NR_refpage_create, sys_refpage_create)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index ced9c83e47c9..dd58ddc63d92 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -360,3 +360,4 @@ 
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	refpage_create			sys_refpage_create
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 1a4822de7292..fe9c2ffcbf63 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@ 
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	refpage_create			sys_refpage_create
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index a3f4be8e7238..d8ef9318ac7f 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -445,3 +445,4 @@ 
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	refpage_create			sys_refpage_create
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 6b4ee92e3aed..8970f55475c4 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -378,3 +378,4 @@ 
 437	n32	openat2				sys_openat2
 438	n32	pidfd_getfd			sys_pidfd_getfd
 439	n32	faccessat2			sys_faccessat2
+440	n32	refpage_create			sys_refpage_create
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 391acbf425a0..894645fc00a2 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -354,3 +354,4 @@ 
 437	n64	openat2				sys_openat2
 438	n64	pidfd_getfd			sys_pidfd_getfd
 439	n64	faccessat2			sys_faccessat2
+440	n64	refpage_create			sys_refpage_create
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 5727c5187508..43957e224dbf 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -427,3 +427,4 @@ 
 437	o32	openat2				sys_openat2
 438	o32	pidfd_getfd			sys_pidfd_getfd
 439	o32	faccessat2			sys_faccessat2
+440	o32	refpage_create			sys_refpage_create
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 292baabefade..d6d8d7c5e60a 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@ 
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	refpage_create			sys_refpage_create
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index be9f74546068..a73e79116f43 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -529,3 +529,4 @@ 
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	refpage_create			sys_refpage_create
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index f1fda4375526..5ffa2aef5781 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@ 
 437  common	openat2			sys_openat2			sys_openat2
 438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
 439  common	faccessat2		sys_faccessat2			sys_faccessat2
+440  common	refpage_create		sys_refpage_create		sys_refpage_create
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 96848db9659e..5e3d7f569603 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@ 
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	refpage_create			sys_refpage_create
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 46024e80ee86..8b21deb46ef5 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -485,3 +485,4 @@ 
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	refpage_create			sys_refpage_create
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index e31a75262c9c..c614da77e1a0 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -444,3 +444,4 @@ 
 437	i386	openat2			sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd
 439	i386	faccessat2		sys_faccessat2
+440	i386	refpage_create		sys_refpage_create
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 9d82078c949a..7f7ab6bab41e 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -361,6 +361,7 @@ 
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd		sys_pidfd_getfd
 439	common	faccessat2		sys_faccessat2
+440	common	refpage_create		sys_refpage_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index d216ccba42f7..a086512e8f06 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -410,3 +410,4 @@ 
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+439	common	refpage_create			sys_refpage_create
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 467302056e17..a1dc07ff914a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -175,6 +175,13 @@  static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
 		return false;
+
+	/*
+	 * Transparent hugepages not currently supported for anonymous VMAs with
+	 * reference pages
+	 */
+	if (unlikely(vma->vm_private_data))
+		return false;
 	return true;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e7602a3bcef1..ac375e398690 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3122,5 +3122,15 @@  unsigned long wp_shared_mapping_range(struct address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
+					 unsigned long pfn)
+{
+	if (is_zero_pfn(pfn))
+		return true;
+	if (unlikely(!vma->vm_ops && vma->vm_private_data))
+		return pfn == page_to_pfn((struct page *)vma->vm_private_data);
+	return false;
+}
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index dc2b827c81e5..7ee15611729e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -831,6 +831,9 @@  asmlinkage long sys_mremap(unsigned long addr,
 			   unsigned long old_len, unsigned long new_len,
 			   unsigned long flags, unsigned long new_addr);
 
+/* mm/refpage.c */
+asmlinkage long sys_refpage_create(const void __user *content, unsigned long flags);
+
 /* security/keys/keyctl.c */
 asmlinkage long sys_add_key(const char __user *_type,
 			    const char __user *_description,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 995b36c2ea7d..26d99bd30e1e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@  __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_faccessat2 439
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_refpage_create 440
+__SYSCALL(__NR_refpage_create, sys_refpage_create)
 
 #undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 3b69a560a7ac..01af430d31da 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -291,6 +291,7 @@  COND_SYSCALL(migrate_pages);
 COND_SYSCALL_COMPAT(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL_COMPAT(move_pages);
+COND_SYSCALL(refpage_create);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
diff --git a/mm/Makefile b/mm/Makefile
index d5649f1c12c0..b2cc6f66d4e7 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -35,10 +35,10 @@  CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
 CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
 
 mmu-y			:= nommu.o
-mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
+mmu-$(CONFIG_MMU)	:= highmem.o ioremap.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o ioremap.o
+			   pgtable-generic.o refpage.o rmap.o vmalloc.o
 
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
diff --git a/mm/gup.c b/mm/gup.c
index 39e58df6925d..5b4c3e3c86b9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -463,7 +463,7 @@  static struct page *follow_page_pte(struct vm_area_struct *vma,
 			goto out;
 		}
 
-		if (is_zero_pfn(pte_pfn(pte))) {
+		if (is_zero_or_refpage_pfn(vma, pte_pfn(pte))) {
 			page = pte_page(pte);
 		} else {
 			ret = follow_pfn_pte(vma, address, ptep, flags);
diff --git a/mm/memory.c b/mm/memory.c
index 228efaca75d3..3289fceae9ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -602,7 +602,7 @@  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
-		if (is_zero_pfn(pfn))
+		if (is_zero_or_refpage_pfn(vma, pfn))
 			return NULL;
 		if (pte_devmap(pte))
 			return NULL;
@@ -628,7 +628,7 @@  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
-	if (is_zero_pfn(pfn))
+	if (is_zero_or_refpage_pfn(vma, pfn))
 		return NULL;
 
 check_pfn:
@@ -1880,7 +1880,7 @@  static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
 		return true;
 	if (pfn_t_special(pfn))
 		return true;
-	if (is_zero_pfn(pfn_t_to_pfn(pfn)))
+	if (is_zero_or_refpage_pfn(vma, pfn_t_to_pfn(pfn)))
 		return true;
 	return false;
 }
@@ -3322,6 +3322,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	struct page *refpage = vma->vm_private_data;
 	struct page *page;
 	vm_fault_t ret = 0;
 	pte_t entry;
@@ -3347,11 +3348,16 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (unlikely(pmd_trans_unstable(vmf->pmd)))
 		return 0;
 
-	/* Use the zero-page for reads */
+	/* Use the zero-page, or reference page if set, for reads */
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm)) {
-		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
-						vma->vm_page_prot));
+		unsigned long pfn;
+
+		if (unlikely(refpage))
+			pfn = page_to_pfn(refpage);
+		else
+			pfn = my_zero_pfn(vmf->address);
+		entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot));
 		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 				vmf->address, &vmf->ptl);
 		if (!pte_none(*vmf->pte)) {
@@ -3372,9 +3378,17 @@  static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
-	if (!page)
-		goto oom;
+
+	if (unlikely(refpage)) {
+		page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+		if (!page)
+			goto oom;
+		copy_user_highpage(page, refpage, vmf->address, vma);
+	} else {
+		page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
+		if (!page)
+			goto oom;
+	}
 
 	if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
 		goto oom_free_page;
diff --git a/mm/migrate.c b/mm/migrate.c
index 5053439be6ab..6e9246d09e95 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2841,8 +2841,8 @@  static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmd_t *pmdp;
 	pte_t *ptep;
 
-	/* Only allow populating anonymous memory */
-	if (!vma_is_anonymous(vma))
+	/* Only allow populating anonymous memory without a reference page */
+	if (!vma_is_anonymous(vma) || vma->private_data)
 		goto abort;
 
 	pgdp = pgd_offset(mm, addr);
diff --git a/mm/refpage.c b/mm/refpage.c
new file mode 100644
index 000000000000..c5fc66a38a51
--- /dev/null
+++ b/mm/refpage.c
@@ -0,0 +1,56 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/anon_inodes.h>
+#include <linux/fs_context.h>
+#include <linux/highmem.h>
+#include <linux/mount.h>
+#include <linux/syscalls.h>
+
+static int refpage_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	vma_set_anonymous(vma);
+	vma->vm_private_data = vma->vm_file->private_data;
+	return 0;
+}
+
+static int refpage_release(struct inode *inode, struct file *file)
+{
+	put_page(file->private_data);
+	return 0;
+}
+
+static const struct file_operations refpage_file_operations = {
+	.mmap = refpage_mmap,
+	.release = refpage_release,
+};
+
+SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
+		flags)
+{
+	unsigned long content_addr = (unsigned long)content;
+	struct page *userpage, *refpage;
+	int fd;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	refpage = alloc_page(GFP_KERNEL);
+	if (!refpage)
+		return -ENOMEM;
+
+	if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
+	    get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
+		put_page(refpage);
+		return -EFAULT;
+	}
+
+	copy_highpage(refpage, userpage);
+	put_page(userpage);
+
+	fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage,
+			      O_RDONLY | O_CLOEXEC);
+	if (fd < 0)
+		put_page(refpage);
+
+	return fd;
+}