diff mbox series

mm: introduce reference pages

Message ID 20200731203241.50427-1-pcc@google.com (mailing list archive)
State New, archived
Headers show
Series mm: introduce reference pages | expand

Commit Message

Peter Collingbourne July 31, 2020, 8:32 p.m. UTC
Introduce a new mmap flag, MAP_REFPAGE, that creates a mapping similar
to an anonymous mapping, but instead of clean pages being backed by the
zero page, they are instead backed by a so-called reference page, whose
address is specified using the offset argument to mmap. Loads from
the mapping will load directly from the reference page, and initial
stores to the mapping will copy-on-write from the reference page.

Reference pages are useful in circumstances where anonymous mappings
combined with manual stores to memory would impose undesirable costs,
either in terms of performance or RSS. Use cases are focused on heap
allocators and include:

- Pattern initialization for the heap. This is where malloc(3) gives
  you memory whose contents are filled with a non-zero pattern
  byte, in order to help detect and mitigate bugs involving use
  of uninitialized memory. Typically this is implemented by having
  the allocator memset the allocation with the pattern byte before
  returning it to the user, but for large allocations this can result
  in a significant increase in RSS, especially for allocations that
  are used sparsely. Even for dense allocations there is a needless
  impact to startup performance when it may be better to amortize it
  throughout the program. By creating allocations using a reference
  page filled with the pattern byte, we can avoid these costs.

- Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
  feature which allows for memory to be tagged in order to detect
  certain kinds of memory errors with low overhead. In order to set
  up an allocation to allow memory errors to be detected, the entire
  allocation needs to have the same tag. The issue here is similar to
  pattern initialization in the sense that large tagged allocations
  will be expensive if the tagging is done up front. The idea is that
  the allocator would create reference pages with each of the possible
  memory tags, and use those reference pages for the large allocations.

In order to measure the performance and RSS impact of reference pages,
a version of this patch backported to kernel version 4.14 was tested on
a Pixel 4 together with a modified [2] version of the Scudo allocator
that uses reference pages to implement pattern initialization. A
PDFium test program was used to collect the measurements like so:

$ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
$ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf

and the median of 100 runs measurement was taken with three variants
of the allocator:

- "anon" is the baseline (no pattern init)
- "memset" is with pattern init of allocator pages implemented by
  initializing anonymous pages with memset
- "refpage" is with pattern init of allocator pages implemented
  by creating reference pages

All three variants are measured using the patch that I linked. "anon"
is without the patch, "refpage" is with the patch and "memset"
is with the patch with "#if 0" in place of "#if 1" in linux.cpp.
The measurements are as follows:

          Real time (s)    Max RSS (KiB)
anon        2.237081         107088
memset      2.252241         112180
refpage     2.251220         103504

We can see that real time for refpage is about the same or maybe
slightly faster than memset. At this point it is unclear where the
discrepancy in performance between anon and refpage comes from. The
Pixel 4 kernel has transparent hugepages disabled so that can't be it.

I wouldn't trust the RSS number for reference pages (with a test
program that uses an anonymous page as a reference page, I saw the
following output on dmesg:

[75768.572560] BUG: Bad rss-counter state mm:00000000f1cdec59 idx:1 val:-2
[75768.572577] BUG: Bad rss-counter state mm:00000000f1cdec59 idx:3 val:2

indicating that I might not have implemented RSS accounting for
reference pages correctly), but we see straight away an RSS impact
of 5% for memset versus anon. Assuming that accounting for anonymous
pages has been implemented correctly, we can expect the true RSS
number for refpages to be similar to that which I measured for anon.

As an alternative to extending mmap(2), I considered using
userfaultfd to implement reference pages. However, after having taken
a detailed look at the interface, it does not seem suitable to be
used in the context of a general purpose allocator. For example,
UFFD_FEATURE_FORK support would be required in order to correctly
support fork(2) in a process that uses the allocator (although POSIX
does not guarantee support for allocating after fork, many allocators
including Scudo support it, and nothing stops the forked process from
page faulting pre-existing allocations after forking anyway), but
UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
making it unsuitable for use in an allocator. Furthermore, even if
the interface issues are resolved, I suspect (but have not measured)
that the cost of the multiple context switches between kernel and
userspace would be too high to be used in an allocator anyway.

There are unresolved issues with this patch:

- We need to decide on the semantics associated with remapping or
  unmapping the reference page. As currently implemented, the page is
  looked up by address on each page fault, and a segfault ensues if the
  address is not mapped. It may be better to have the mmap(2) call take
  a reference to the page (failing if not mapped) and the underlying
  vma so that future remappings or unmappings have no effect.

- I have not yet looked at interaction with transparent hugepages.

- We probably need to restrict which kinds of pages are supported as
  reference pages (probably only anonymous and file-backed pages). This
  is somewhat tied to the remapping semantics as we would need
  to decide what happens if a supported page is replaced with an
  unsupported page.

- Finally, the accounting issues as previously mentioned.

However, I am sending this first version of the patch in order to get
early feedback on the idea and whether it is suitable to be added to
the kernel.

[1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
[2] https://github.com/pcc/llvm-project/commit/a05f88aaebc7daf262d6885444d9845052026f4b

Signed-off-by: Peter Collingbourne <pcc@google.com>
---
 arch/mips/kernel/vdso.c                |  2 +-
 include/linux/mm.h                     |  2 +-
 include/uapi/asm-generic/mman-common.h |  1 +
 mm/mmap.c                              | 46 +++++++++++++++++++++++---
 4 files changed, 45 insertions(+), 6 deletions(-)

Comments

John Hubbard Aug. 3, 2020, 3:28 a.m. UTC | #1
On 7/31/20 1:32 PM, Peter Collingbourne wrote:
...

Hi,

I can see why you want to do this. A few points to consider, below.

btw, the patch would *not* apply for me, via `git am`. I finally used
patch(1) and that worked. Probably good to mention which tree and branch
this applies to, as a first step to avoiding that, but I'm not quite sure
what else went wrong. Maybe use stock git, instead of
2.28.0.163.g6104cc2f0b6-goog? Just guessing.

> @@ -1684,9 +1695,33 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
>   	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
>   }
>   
> +static vm_fault_t refpage_fault(struct vm_fault *vmf)
> +{
> +	struct page *page;
> +
> +	if (get_user_pages((unsigned long)vmf->vma->vm_private_data, 1, 0,
> +			   &page, 0) != 1)
> +		return VM_FAULT_SIGSEGV;
> +

This will end up overflowing the page->_refcount in some situations.

Some thoughts:

In order to implement this feature, the reference pages need to be made
at least a little bit more special, and probably little bit more like
zero pages. At one extreme, for example, zero pages could be a special
case of reference pages, although I'm not sure of a clean way to
implement that.


The reason that more special-ness is required, is that things such as
reference counting and locking can be special-cased with zero pages.
Doing so allows avoiding page->_refcount overflows, for example. Your
patch here, however, allows normal pages to be treated *almost* like a
zero page, in that it's a page full of constant value data. But because
a refpage can be any page, not just a special one that is defined at a
single location, that leads to problems with refcounts.


> +	vmf->page = page;
> +	return VM_FAULT_LOCKED;

Is the page really locked, or is this a case of "the page is special and
we can safely claim it is locked"? Maybe I'm just confused about the use
of VM_FAULT_LOCKED: I thought you only should set it after locking the
page.


> +}
> +
> +static void refpage_close(struct vm_area_struct *vma)
> +{
> +	/* This function exists only to prevent is_mergeable_vma from allowing a
> +	 * reference page mapping to be merged with an anonymous mapping.
> +	 */

While it is true that implementing a vma's .close() method will prevent
vma merging, this is an abuse of that function: it depends on how that
function is implemented. And given that refpages represent significant
new capability, I think they deserve their own "if" clause (and perhaps
a VMA flag) in is_mergeable_vma(), instead of this kind of minor hack.



thanks,
Matthew Wilcox (Oracle) Aug. 3, 2020, 3:51 a.m. UTC | #2
On Sun, Aug 02, 2020 at 08:28:08PM -0700, John Hubbard wrote:
> This will end up overflowing the page->_refcount in some situations.
> 
> Some thoughts:
> 
> In order to implement this feature, the reference pages need to be made
> at least a little bit more special, and probably little bit more like
> zero pages. At one extreme, for example, zero pages could be a special
> case of reference pages, although I'm not sure of a clean way to
> implement that.
> 
> 
> The reason that more special-ness is required, is that things such as
> reference counting and locking can be special-cased with zero pages.
> Doing so allows avoiding page->_refcount overflows, for example. Your
> patch here, however, allows normal pages to be treated *almost* like a
> zero page, in that it's a page full of constant value data. But because
> a refpage can be any page, not just a special one that is defined at a
> single location, that leads to problems with refcounts.

We could bump the refcount on mmap and only put it on munmap.  That
complexifies a few more paths which now need to check for the VMA special
page as well as the zero page on pte unmap.

Perhaps a better way around this is that the default page can only be one
of the pages in the mmap ... and that page is duplicated (not shared) on
fork().  That way the refcount is at most the number of pages in the mmap.
And if we constrain the size of these mappings to be no more than 8TB,
that constrains the refcount on this page to be no more than 2^31.
Kirill A. Shutemov Aug. 3, 2020, 9:32 a.m. UTC | #3
On Fri, Jul 31, 2020 at 01:32:41PM -0700, Peter Collingbourne wrote:
> Introduce a new mmap flag, MAP_REFPAGE, that creates a mapping similar
> to an anonymous mapping, but instead of clean pages being backed by the
> zero page, they are instead backed by a so-called reference page, whose
> address is specified using the offset argument to mmap. Loads from
> the mapping will load directly from the reference page, and initial
> stores to the mapping will copy-on-write from the reference page.
> 
> Reference pages are useful in circumstances where anonymous mappings
> combined with manual stores to memory would impose undesirable costs,
> either in terms of performance or RSS. Use cases are focused on heap
> allocators and include:
> 
> - Pattern initialization for the heap. This is where malloc(3) gives
>   you memory whose contents are filled with a non-zero pattern
>   byte, in order to help detect and mitigate bugs involving use
>   of uninitialized memory. Typically this is implemented by having
>   the allocator memset the allocation with the pattern byte before
>   returning it to the user, but for large allocations this can result
>   in a significant increase in RSS, especially for allocations that
>   are used sparsely. Even for dense allocations there is a needless
>   impact to startup performance when it may be better to amortize it
>   throughout the program. By creating allocations using a reference
>   page filled with the pattern byte, we can avoid these costs.
> 
> - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
>   feature which allows for memory to be tagged in order to detect
>   certain kinds of memory errors with low overhead. In order to set
>   up an allocation to allow memory errors to be detected, the entire
>   allocation needs to have the same tag. The issue here is similar to
>   pattern initialization in the sense that large tagged allocations
>   will be expensive if the tagging is done up front. The idea is that
>   the allocator would create reference pages with each of the possible
>   memory tags, and use those reference pages for the large allocations.

Looks like it's wrong layer to implement the functionality. Just have a
special fd that would return the same page for all vm_ops->fault and map
the fd with normal mmap(MAP_PRIVATE, fd). It will get you what you want
without touching core-mm.
Catalin Marinas Aug. 3, 2020, 12:01 p.m. UTC | #4
On Mon, Aug 03, 2020 at 12:32:59PM +0300, Kirill A. Shutemov wrote:
> On Fri, Jul 31, 2020 at 01:32:41PM -0700, Peter Collingbourne wrote:
> > Introduce a new mmap flag, MAP_REFPAGE, that creates a mapping similar
> > to an anonymous mapping, but instead of clean pages being backed by the
> > zero page, they are instead backed by a so-called reference page, whose
> > address is specified using the offset argument to mmap. Loads from
> > the mapping will load directly from the reference page, and initial
> > stores to the mapping will copy-on-write from the reference page.
> > 
> > Reference pages are useful in circumstances where anonymous mappings
> > combined with manual stores to memory would impose undesirable costs,
> > either in terms of performance or RSS. Use cases are focused on heap
> > allocators and include:
> > 
> > - Pattern initialization for the heap. This is where malloc(3) gives
> >   you memory whose contents are filled with a non-zero pattern
> >   byte, in order to help detect and mitigate bugs involving use
> >   of uninitialized memory. Typically this is implemented by having
> >   the allocator memset the allocation with the pattern byte before
> >   returning it to the user, but for large allocations this can result
> >   in a significant increase in RSS, especially for allocations that
> >   are used sparsely. Even for dense allocations there is a needless
> >   impact to startup performance when it may be better to amortize it
> >   throughout the program. By creating allocations using a reference
> >   page filled with the pattern byte, we can avoid these costs.
> > 
> > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> >   feature which allows for memory to be tagged in order to detect
> >   certain kinds of memory errors with low overhead. In order to set
> >   up an allocation to allow memory errors to be detected, the entire
> >   allocation needs to have the same tag. The issue here is similar to
> >   pattern initialization in the sense that large tagged allocations
> >   will be expensive if the tagging is done up front. The idea is that
> >   the allocator would create reference pages with each of the possible
> >   memory tags, and use those reference pages for the large allocations.
> 
> Looks like it's wrong layer to implement the functionality. Just have a
> special fd that would return the same page for all vm_ops->fault and map
> the fd with normal mmap(MAP_PRIVATE, fd). It will get you what you want
> without touching core-mm.

I think this would work even for the arm64 MTE (though I haven't tried):
use memfd_create() to get such file descriptor, mmap() it as MAP_SHARED
to populate the initial pattern, mmap() it as MAP_PRIVATE for any
subsequent mapping that needs to be copied-on-write.
Peter Collingbourne Aug. 4, 2020, 12:50 a.m. UTC | #5
On Mon, Aug 3, 2020 at 5:01 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Mon, Aug 03, 2020 at 12:32:59PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jul 31, 2020 at 01:32:41PM -0700, Peter Collingbourne wrote:
> > > Introduce a new mmap flag, MAP_REFPAGE, that creates a mapping similar
> > > to an anonymous mapping, but instead of clean pages being backed by the
> > > zero page, they are instead backed by a so-called reference page, whose
> > > address is specified using the offset argument to mmap. Loads from
> > > the mapping will load directly from the reference page, and initial
> > > stores to the mapping will copy-on-write from the reference page.
> > >
> > > Reference pages are useful in circumstances where anonymous mappings
> > > combined with manual stores to memory would impose undesirable costs,
> > > either in terms of performance or RSS. Use cases are focused on heap
> > > allocators and include:
> > >
> > > - Pattern initialization for the heap. This is where malloc(3) gives
> > >   you memory whose contents are filled with a non-zero pattern
> > >   byte, in order to help detect and mitigate bugs involving use
> > >   of uninitialized memory. Typically this is implemented by having
> > >   the allocator memset the allocation with the pattern byte before
> > >   returning it to the user, but for large allocations this can result
> > >   in a significant increase in RSS, especially for allocations that
> > >   are used sparsely. Even for dense allocations there is a needless
> > >   impact to startup performance when it may be better to amortize it
> > >   throughout the program. By creating allocations using a reference
> > >   page filled with the pattern byte, we can avoid these costs.
> > >
> > > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> > >   feature which allows for memory to be tagged in order to detect
> > >   certain kinds of memory errors with low overhead. In order to set
> > >   up an allocation to allow memory errors to be detected, the entire
> > >   allocation needs to have the same tag. The issue here is similar to
> > >   pattern initialization in the sense that large tagged allocations
> > >   will be expensive if the tagging is done up front. The idea is that
> > >   the allocator would create reference pages with each of the possible
> > >   memory tags, and use those reference pages for the large allocations.
> >
> > Looks like it's wrong layer to implement the functionality. Just have a
> > special fd that would return the same page for all vm_ops->fault and map
> > the fd with normal mmap(MAP_PRIVATE, fd). It will get you what you want
> > without touching core-mm.

Thanks, I like this idea. I will try to implement it.

> I think this would work even for the arm64 MTE (though I haven't tried):
> use memfd_create() to get such file descriptor, mmap() it as MAP_SHARED
> to populate the initial pattern, mmap() it as MAP_PRIVATE for any
> subsequent mapping that needs to be copied-on-write.

That would require a separate mmap() (i.e. separate VMA) for each
page, no? That sounds like it could be expensive both in terms of VMAs
and the number of mmap syscalls required (i.e. N/PAGE_SIZE). You could
decrease these costs by increasing the size of the memfd files to more
than a page, but that would also increase the amount of memory
required for the reference pages.

Peter
Catalin Marinas Aug. 4, 2020, 3:27 p.m. UTC | #6
On Mon, Aug 03, 2020 at 05:50:32PM -0700, Peter Collingbourne wrote:
> On Mon, Aug 3, 2020 at 5:01 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > I think this would work even for the arm64 MTE (though I haven't tried):
> > use memfd_create() to get such file descriptor, mmap() it as MAP_SHARED
> > to populate the initial pattern, mmap() it as MAP_PRIVATE for any
> > subsequent mapping that needs to be copied-on-write.
> 
> That would require a separate mmap() (i.e. separate VMA) for each
> page, no? That sounds like it could be expensive both in terms of VMAs
> and the number of mmap syscalls required (i.e. N/PAGE_SIZE). You could
> decrease these costs by increasing the size of the memfd files to more
> than a page, but that would also increase the amount of memory
> required for the reference pages.

I think I get it now. You'd like a multiple page mmap() to be covered by
a single reference page. The memfd trick wouldn't give you this without
multiple mmap() calls, one for each page.
Kirill A. Shutemov Aug. 4, 2020, 3:48 p.m. UTC | #7
On Tue, Aug 04, 2020 at 04:27:50PM +0100, Catalin Marinas wrote:
> On Mon, Aug 03, 2020 at 05:50:32PM -0700, Peter Collingbourne wrote:
> > On Mon, Aug 3, 2020 at 5:01 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > I think this would work even for the arm64 MTE (though I haven't tried):
> > > use memfd_create() to get such file descriptor, mmap() it as MAP_SHARED
> > > to populate the initial pattern, mmap() it as MAP_PRIVATE for any
> > > subsequent mapping that needs to be copied-on-write.
> > 
> > That would require a separate mmap() (i.e. separate VMA) for each
> > page, no? That sounds like it could be expensive both in terms of VMAs
> > and the number of mmap syscalls required (i.e. N/PAGE_SIZE). You could
> > decrease these costs by increasing the size of the memfd files to more
> > than a page, but that would also increase the amount of memory
> > required for the reference pages.
> 
> I think I get it now. You'd like a multiple page mmap() to be covered by
> a single reference page. The memfd trick wouldn't give you this without
> multiple mmap() calls, one for each page.

That's why I suggested a special file descriptor that would give the same
page on any access. We can piggy back on memfd infrastrucure or create a
new interface.
Peter Collingbourne Aug. 13, 2020, 10:03 p.m. UTC | #8
Hi John,

Thanks for the review and suggestions.

On Sun, Aug 2, 2020 at 8:28 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 7/31/20 1:32 PM, Peter Collingbourne wrote:
> ...
>
> Hi,
>
> I can see why you want to do this. A few points to consider, below.
>
> btw, the patch would *not* apply for me, via `git am`. I finally used
> patch(1) and that worked. Probably good to mention which tree and branch
> this applies to, as a first step to avoiding that, but I'm not quite sure
> what else went wrong. Maybe use stock git, instead of
> 2.28.0.163.g6104cc2f0b6-goog? Just guessing.

Sorry about that. It might have been because I had another patch
applied underneath this one when I created the patch. In the v2 that
I'm about to send I'm based directly on master.

> > @@ -1684,9 +1695,33 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
> >       return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
> >   }
> >
> > +static vm_fault_t refpage_fault(struct vm_fault *vmf)
> > +{
> > +     struct page *page;
> > +
> > +     if (get_user_pages((unsigned long)vmf->vma->vm_private_data, 1, 0,
> > +                        &page, 0) != 1)
> > +             return VM_FAULT_SIGSEGV;
> > +
>
> This will end up overflowing the page->_refcount in some situations.
>
> Some thoughts:
>
> In order to implement this feature, the reference pages need to be made
> at least a little bit more special, and probably little bit more like
> zero pages. At one extreme, for example, zero pages could be a special
> case of reference pages, although I'm not sure of a clean way to
> implement that.
>
>
> The reason that more special-ness is required, is that things such as
> reference counting and locking can be special-cased with zero pages.
> Doing so allows avoiding page->_refcount overflows, for example. Your
> patch here, however, allows normal pages to be treated *almost* like a
> zero page, in that it's a page full of constant value data. But because
> a refpage can be any page, not just a special one that is defined at a
> single location, that leads to problems with refcounts.

You're right, there is a potential reference count issue here. But it
looks like the issue is not with _refcount but with _mapcount. For
example, a program could create a reference page mapping with 2^32
pages, fault every page in the mapping and thereby overflow _mapcount.

It looks like we can avoid this issue by aligning the handling of
reference pages with that of the zero page, as you suggested. Like the
zero page, _mapcount is now not modified on reference pages to track
PTEs (this is done by causing vm_normal_page() to return null for
these pages, as we do for the zero page). Ownership is moved to the
struct file created by the new refpage_create (bikeshed colors
welcome) syscall which returns a file descriptor, per Kirill's
suggestion. A struct file's reference count is an atomic_long_t, which
I assume cannot realistically overflow. A pointer to the reference
page is stored in the VMA's vm_private_data, but this is mostly for
convenience because the page is kept alive by the VMA's struct file
reference. The VMA's vm_ops is now set to null, which causes us to
follow the code path for anonymous pages, which has been modified to
handle reference pages. That's all implemented in the v2 that I'm
about to send.

I considered having reference page mappings continue to provide a
custom vm_ops, but this would require changes to the interface to
preserve the specialness of the reference page. For example,
vm_normal_page() would need to know to return null for the reference
page in order to prevent _mapcount from overflowing, which could
probably be done by adding a new interface to vm_ops, but that seemed
more complicated than changing the anonymous page code path.

> > +     vmf->page = page;
> > +     return VM_FAULT_LOCKED;
>
> Is the page really locked, or is this a case of "the page is special and
> we can safely claim it is locked"? Maybe I'm just confused about the use
> of VM_FAULT_LOCKED: I thought you only should set it after locking the
> page.

You're right, it isn't locked at this point. I had confused locking
the page with incrementing its _refcount via get_user_pages(). But
with the new implementation we no longer need this fault handler.

> > +}
> > +
> > +static void refpage_close(struct vm_area_struct *vma)
> > +{
> > +     /* This function exists only to prevent is_mergeable_vma from allowing a
> > +      * reference page mapping to be merged with an anonymous mapping.
> > +      */
>
> While it is true that implementing a vma's .close() method will prevent
> vma merging, this is an abuse of that function: it depends on how that
> function is implemented. And given that refpages represent significant
> new capability, I think they deserve their own "if" clause (and perhaps
> a VMA flag) in is_mergeable_vma(), instead of this kind of minor hack.

It turns out that with the change to use a file descriptor we do not
need a change to is_mergeable_vma() because the function bails out if
the struct file pointers in the VMAs are different.

Thanks,
Peter
Peter Collingbourne Aug. 13, 2020, 10:03 p.m. UTC | #9
On Sun, Aug 2, 2020 at 8:51 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Aug 02, 2020 at 08:28:08PM -0700, John Hubbard wrote:
> > This will end up overflowing the page->_refcount in some situations.
> >
> > Some thoughts:
> >
> > In order to implement this feature, the reference pages need to be made
> > at least a little bit more special, and probably little bit more like
> > zero pages. At one extreme, for example, zero pages could be a special
> > case of reference pages, although I'm not sure of a clean way to
> > implement that.
> >
> >
> > The reason that more special-ness is required, is that things such as
> > reference counting and locking can be special-cased with zero pages.
> > Doing so allows avoiding page->_refcount overflows, for example. Your
> > patch here, however, allows normal pages to be treated *almost* like a
> > zero page, in that it's a page full of constant value data. But because
> > a refpage can be any page, not just a special one that is defined at a
> > single location, that leads to problems with refcounts.
>
> We could bump the refcount on mmap and only put it on munmap.  That
> complexifies a few more paths which now need to check for the VMA special
> page as well as the zero page on pte unmap.
>
> Perhaps a better way around this is that the default page can only be one
> of the pages in the mmap ... and that page is duplicated (not shared) on
> fork().  That way the refcount is at most the number of pages in the mmap.
> And if we constrain the size of these mappings to be no more than 8TB,
> that constrains the refcount on this page to be no more than 2^31.

I'm not a fan of this idea to be honest. It means that we need to
spend a page per mapping to get this behavior, instead of a page
across the entire process. And in an allocator like scudo we can end
up making a lot of mappings. I think there would also be complexities
around VMA splitting, which would probably mean that these mappings
become special enough that we don't gain much with this approach.

Thanks,
Peter
diff mbox series

Patch

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 242dc5e83847..403c00cc1ac3 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -101,7 +101,7 @@  int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		/* Map delay slot emulation page */
 		base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 				VM_READ | VM_EXEC |
-				VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
+				VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC, 0,
 				0, NULL);
 		if (IS_ERR_VALUE(base)) {
 			ret = base;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 256e1bc83460..3b3efa2e3283 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2576,7 +2576,7 @@  extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	unsigned long refpage, struct list_head *uf);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..f57552dcf99a 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,7 @@ 
 #define MAP_HUGETLB		0x040000	/* create a huge page mapping */
 #define MAP_SYNC		0x080000 /* perform synchronous page faults for the mapping */
 #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_REFPAGE		0x200000	/* use the offset argument as a pointer to a reference page */
 
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
diff --git a/mm/mmap.c b/mm/mmap.c
index d43cc3b0187c..d74d0963d460 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -47,6 +47,7 @@ 
 #include <linux/pkeys.h>
 #include <linux/oom.h>
 #include <linux/sched/mm.h>
+#include <linux/compat.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1371,6 +1372,7 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	vm_flags_t vm_flags;
 	int pkey = 0;
+	unsigned long refpage = 0;
 
 	*populate = 0;
 
@@ -1441,6 +1443,16 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (mlock_future_check(mm, vm_flags, len))
 		return -EAGAIN;
 
+	if (flags & MAP_REFPAGE) {
+		refpage = pgoff << PAGE_SHIFT;
+		if (in_compat_syscall()) {
+			/* The offset argument may have been sign extended at some
+			 * point, so we need to mask out the high bits.
+			 */
+			refpage &= 0xffffffff;
+		}
+	}
+
 	if (file) {
 		struct inode *inode = file_inode(file);
 		unsigned long flags_mask;
@@ -1541,8 +1553,7 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 		if (file && is_file_hugepages(file))
 			vm_flags |= VM_NORESERVE;
 	}
-
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, refpage, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1557,7 +1568,7 @@  unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 	struct file *file = NULL;
 	unsigned long retval;
 
-	if (!(flags & MAP_ANONYMOUS)) {
+	if (!(flags & (MAP_ANONYMOUS | MAP_REFPAGE))) {
 		audit_mmap_fd(fd, flags);
 		file = fget(fd);
 		if (!file)
@@ -1684,9 +1695,33 @@  static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
+static vm_fault_t refpage_fault(struct vm_fault *vmf)
+{
+	struct page *page;
+
+	if (get_user_pages((unsigned long)vmf->vma->vm_private_data, 1, 0,
+			   &page, 0) != 1)
+		return VM_FAULT_SIGSEGV;
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static void refpage_close(struct vm_area_struct *vma)
+{
+	/* This function exists only to prevent is_mergeable_vma from allowing a
+	 * reference page mapping to be merged with an anonymous mapping.
+	 */
+}
+
+const struct vm_operations_struct refpage_vmops = {
+	.fault = refpage_fault,
+	.close = refpage_close,
+};
+
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		unsigned long refpage, struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1788,6 +1823,9 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 		error = shmem_zero_setup(vma);
 		if (error)
 			goto free_vma;
+	} else if (refpage) {
+		vma->vm_ops = &refpage_vmops;
+		vma->vm_private_data = (void *)refpage;
 	} else {
 		vma_set_anonymous(vma);
 	}