diff mbox series

[resend,v2,2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables

Message ID 20210511081534.3507-3-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables | expand

Commit Message

David Hildenbrand May 11, 2021, 8:15 a.m. UTC
I. Background: Sparse Memory Mappings

When we manage sparse memory mappings dynamically in user space - also
sometimes involving MAP_NORESERVE - we want to dynamically populate/
discard memory inside such a sparse memory region. Example users are
hypervisors (especially implementing memory ballooning or similar
technologies like virtio-mem) and memory allocators. In addition, we want
to fail in a nice way (instead of generating SIGBUS) if populating does not
succeed because we are out of backend memory (which can happen easily with
file-based mappings, especially tmpfs and hugetlbfs).

While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
reliably discarding memory for most mapping types, there is no generic
approach to populate page tables and preallocate memory.

Although mmap() supports MAP_POPULATE, it is not applicable to the concept
of sparse memory mappings, where we want to populate/discard
dynamically and avoid expensive/problematic remappings. In addition,
we never actually report errors during the final populate phase - it is
best-effort only.

fallocate() can be used to preallocate file-based memory and fail in a safe
way. However, it cannot really be used for any private mappings on
anonymous files via memfd due to COW semantics. In addition, fallocate()
does not actually populate page tables, so we still always get
pagefaults on first access - which is sometimes undesired (i.e., real-time
workloads) and requires real prefaulting of page tables, not just a
preallocation of backend storage. There might be interesting use cases
for sparse memory regions along with mlockall(MCL_ONFAULT) which
fallocate() cannot satisfy as it does not prefault page tables.

II. On preallcoation/prefaulting from user space

Because we don't have a proper interface, what applications
(like QEMU and databases) end up doing is touching (i.e., reading+writing
one byte to not overwrite existing data) all individual pages.

However, that approach
1) Can result in wear on storage backing, because we end up reading/writing
   each page; this is especially a problem for dax/pmem.
2) Can result in mmap_sem contention when prefaulting via multiple
   threads.
3) Requires expensive signal handling, especially to catch SIGBUS in case
   of hugetlbfs/shmem/file-backed memory. For example, this is
   problematic in hypervisors like QEMU where SIGBUS handlers might already
   be used by other subsystems concurrently to e.g, handle hardware errors.
   "Simply" doing preallocation concurrently from other thread is not that
   easy.

III. On MADV_WILLNEED

Extending MADV_WILLNEED is not an option because
1. It would change the semantics: "Expect access in the near future." and
   "might be a good idea to read some pages" vs. "Definitely populate/
   preallocate all memory and definitely fail on errors.".
2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
   don't want populate/prealloc semantics. They treat this rather as a hint
   to give a little performance boost without too much overhead - and don't
   expect that a lot of memory might get consumed or a lot of time
   might be spent.

IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE

Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
MAP_POPULATE, with the following semantics:
1. MADV_POPULATE_READ can be used to prefault page tables just like
   manually reading each individual page. This will not break any COW
   mappings. The shared zero page might get mapped and no backend storage
   might get preallocated -- allocation might be deferred to
   write-fault time. Especially shared file mappings require an explicit
   fallocate() upfront to actually preallocate backend memory (blocks in
   the file system) in case the file might have holes.
2. If MADV_POPULATE_READ succeeds, all page tables have been populated
   (prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
   prefault page tables just like manually writing (or
   reading+writing) each individual page. This will break any COW
   mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
   (prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
   mappings marked with VM_PFNMAP and VM_IO. Also, proper access
   permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
   mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
   might have been populated.
7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
   when encountering a HW poisoned page in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
   cannot protect from the OOM (Out Of Memory) handler killing the
   process.

While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), one issue is that
whenever we prefault pages writable, the pages have to be marked dirty,
because the CPU could dirty them any time. while not a real problem for
hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
page will be marked dirty and has to be written back later when evicting.

MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
mapping from backend storage without marking it dirty, such that eviction
won't have to write it back. As discussed above, shared file mappings
might require an explciit fallocate() upfront to achieve
preallcoation+prepopulation.

Although sparse memory mappings are the primary use case, this will
also be useful for other preallocate/prefault use cases where MAP_POPULATE
is not desired or the semantics of MAP_POPULATE are not sufficient: as one
example, QEMU users can trigger preallocation/prefaulting of guest RAM
after the mapping was created -- and don't want errors to be silently
suppressed.

Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
however, the main motivation back than was performance improvements
-- which should also still be the case.

V. Single-threaded performance comparison

I did a short experiment, prefaulting page tables on completely *empty
mappings/files* and repeated the experiment 10 times. The results
correspond to the shortest execution time. In general, the performance
benefit for huge pages is negligible with small mappings.

V.1: Private mappings

POPULATE_READ and POPULATE_WRITE is fastest. Note that
Reading/POPULATE_READ will populate the shared zeropage where applicable
-- which result in short population times.

The fastest way to allocate backend storage (here: swap or huge pages)
and prefault page tables is POPULATE_WRITE.

V.2: Shared mappings

fallocate() is fastest, however, doesn't prefault
page tables. POPULATE_WRITE is faster than simple writes and read/writes.
POPULATE_READ is faster than simple reads.

Without a fd, the fastest way to allocate backend storage and prefault page
tables is POPULATE_WRITE. With an fd, the fastest way is usually
FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
exception are actual files: FALLOCATE+Read is slightly faster than
FALLOCATE+POPULATE_READ.

The fastest way to allocate backend storage prefault page tables is
FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
dirty.

v.3: Detailed results

==================================================
2 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB     : Read                     :     0.119 ms
Anon 4 KiB     : Write                    :     0.222 ms
Anon 4 KiB     : Read/Write               :     0.380 ms
Anon 4 KiB     : POPULATE_READ            :     0.060 ms
Anon 4 KiB     : POPULATE_WRITE           :     0.158 ms
Memfd 4 KiB    : Read                     :     0.034 ms
Memfd 4 KiB    : Write                    :     0.310 ms
Memfd 4 KiB    : Read/Write               :     0.362 ms
Memfd 4 KiB    : POPULATE_READ            :     0.039 ms
Memfd 4 KiB    : POPULATE_WRITE           :     0.229 ms
Memfd 2 MiB    : Read                     :     0.030 ms
Memfd 2 MiB    : Write                    :     0.030 ms
Memfd 2 MiB    : Read/Write               :     0.030 ms
Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
tmpfs          : Read                     :     0.033 ms
tmpfs          : Write                    :     0.313 ms
tmpfs          : Read/Write               :     0.406 ms
tmpfs          : POPULATE_READ            :     0.039 ms
tmpfs          : POPULATE_WRITE           :     0.285 ms
file           : Read                     :     0.033 ms
file           : Write                    :     0.351 ms
file           : Read/Write               :     0.408 ms
file           : POPULATE_READ            :     0.039 ms
file           : POPULATE_WRITE           :     0.290 ms
hugetlbfs      : Read                     :     0.030 ms
hugetlbfs      : Write                    :     0.030 ms
hugetlbfs      : Read/Write               :     0.030 ms
hugetlbfs      : POPULATE_READ            :     0.030 ms
hugetlbfs      : POPULATE_WRITE           :     0.030 ms
**************************************************
4096 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB     : Read                     :   237.940 ms
Anon 4 KiB     : Write                    :   708.409 ms
Anon 4 KiB     : Read/Write               :  1054.041 ms
Anon 4 KiB     : POPULATE_READ            :   124.310 ms
Anon 4 KiB     : POPULATE_WRITE           :   572.582 ms
Memfd 4 KiB    : Read                     :   136.928 ms
Memfd 4 KiB    : Write                    :   963.898 ms
Memfd 4 KiB    : Read/Write               :  1106.561 ms
Memfd 4 KiB    : POPULATE_READ            :    78.450 ms
Memfd 4 KiB    : POPULATE_WRITE           :   805.881 ms
Memfd 2 MiB    : Read                     :   357.116 ms
Memfd 2 MiB    : Write                    :   357.210 ms
Memfd 2 MiB    : Read/Write               :   357.606 ms
Memfd 2 MiB    : POPULATE_READ            :   356.094 ms
Memfd 2 MiB    : POPULATE_WRITE           :   356.937 ms
tmpfs          : Read                     :   137.536 ms
tmpfs          : Write                    :   954.362 ms
tmpfs          : Read/Write               :  1105.954 ms
tmpfs          : POPULATE_READ            :    80.289 ms
tmpfs          : POPULATE_WRITE           :   822.826 ms
file           : Read                     :   137.874 ms
file           : Write                    :   987.025 ms
file           : Read/Write               :  1107.439 ms
file           : POPULATE_READ            :    80.413 ms
file           : POPULATE_WRITE           :   857.622 ms
hugetlbfs      : Read                     :   355.607 ms
hugetlbfs      : Write                    :   355.729 ms
hugetlbfs      : Read/Write               :   356.127 ms
hugetlbfs      : POPULATE_READ            :   354.585 ms
hugetlbfs      : POPULATE_WRITE           :   355.138 ms
**************************************************
2 MiB MAP_SHARED:
**************************************************
Anon 4 KiB     : Read                     :     0.394 ms
Anon 4 KiB     : Write                    :     0.348 ms
Anon 4 KiB     : Read/Write               :     0.400 ms
Anon 4 KiB     : POPULATE_READ            :     0.326 ms
Anon 4 KiB     : POPULATE_WRITE           :     0.273 ms
Anon 2 MiB     : Read                     :     0.030 ms
Anon 2 MiB     : Write                    :     0.030 ms
Anon 2 MiB     : Read/Write               :     0.030 ms
Anon 2 MiB     : POPULATE_READ            :     0.030 ms
Anon 2 MiB     : POPULATE_WRITE           :     0.030 ms
Memfd 4 KiB    : Read                     :     0.412 ms
Memfd 4 KiB    : Write                    :     0.372 ms
Memfd 4 KiB    : Read/Write               :     0.419 ms
Memfd 4 KiB    : POPULATE_READ            :     0.343 ms
Memfd 4 KiB    : POPULATE_WRITE           :     0.288 ms
Memfd 4 KiB    : FALLOCATE                :     0.137 ms
Memfd 4 KiB    : FALLOCATE+Read           :     0.446 ms
Memfd 4 KiB    : FALLOCATE+Write          :     0.330 ms
Memfd 4 KiB    : FALLOCATE+Read/Write     :     0.454 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :     0.379 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :     0.268 ms
Memfd 2 MiB    : Read                     :     0.030 ms
Memfd 2 MiB    : Write                    :     0.030 ms
Memfd 2 MiB    : Read/Write               :     0.030 ms
Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
Memfd 2 MiB    : FALLOCATE                :     0.030 ms
Memfd 2 MiB    : FALLOCATE+Read           :     0.031 ms
Memfd 2 MiB    : FALLOCATE+Write          :     0.031 ms
Memfd 2 MiB    : FALLOCATE+Read/Write     :     0.031 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :     0.030 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :     0.030 ms
tmpfs          : Read                     :     0.416 ms
tmpfs          : Write                    :     0.369 ms
tmpfs          : Read/Write               :     0.425 ms
tmpfs          : POPULATE_READ            :     0.346 ms
tmpfs          : POPULATE_WRITE           :     0.295 ms
tmpfs          : FALLOCATE                :     0.139 ms
tmpfs          : FALLOCATE+Read           :     0.447 ms
tmpfs          : FALLOCATE+Write          :     0.333 ms
tmpfs          : FALLOCATE+Read/Write     :     0.454 ms
tmpfs          : FALLOCATE+POPULATE_READ  :     0.380 ms
tmpfs          : FALLOCATE+POPULATE_WRITE :     0.272 ms
file           : Read                     :     0.191 ms
file           : Write                    :     0.511 ms
file           : Read/Write               :     0.524 ms
file           : POPULATE_READ            :     0.196 ms
file           : POPULATE_WRITE           :     0.434 ms
file           : FALLOCATE                :     0.004 ms
file           : FALLOCATE+Read           :     0.197 ms
file           : FALLOCATE+Write          :     0.554 ms
file           : FALLOCATE+Read/Write     :     0.480 ms
file           : FALLOCATE+POPULATE_READ  :     0.201 ms
file           : FALLOCATE+POPULATE_WRITE :     0.381 ms
hugetlbfs      : Read                     :     0.030 ms
hugetlbfs      : Write                    :     0.030 ms
hugetlbfs      : Read/Write               :     0.030 ms
hugetlbfs      : POPULATE_READ            :     0.030 ms
hugetlbfs      : POPULATE_WRITE           :     0.030 ms
hugetlbfs      : FALLOCATE                :     0.030 ms
hugetlbfs      : FALLOCATE+Read           :     0.031 ms
hugetlbfs      : FALLOCATE+Write          :     0.031 ms
hugetlbfs      : FALLOCATE+Read/Write     :     0.030 ms
hugetlbfs      : FALLOCATE+POPULATE_READ  :     0.030 ms
hugetlbfs      : FALLOCATE+POPULATE_WRITE :     0.030 ms
**************************************************
4096 MiB MAP_SHARED:
**************************************************
Anon 4 KiB     : Read                     :  1053.090 ms
Anon 4 KiB     : Write                    :   913.642 ms
Anon 4 KiB     : Read/Write               :  1060.350 ms
Anon 4 KiB     : POPULATE_READ            :   893.691 ms
Anon 4 KiB     : POPULATE_WRITE           :   782.885 ms
Anon 2 MiB     : Read                     :   358.553 ms
Anon 2 MiB     : Write                    :   358.419 ms
Anon 2 MiB     : Read/Write               :   357.992 ms
Anon 2 MiB     : POPULATE_READ            :   357.533 ms
Anon 2 MiB     : POPULATE_WRITE           :   357.808 ms
Memfd 4 KiB    : Read                     :  1078.144 ms
Memfd 4 KiB    : Write                    :   942.036 ms
Memfd 4 KiB    : Read/Write               :  1100.391 ms
Memfd 4 KiB    : POPULATE_READ            :   925.829 ms
Memfd 4 KiB    : POPULATE_WRITE           :   804.394 ms
Memfd 4 KiB    : FALLOCATE                :   304.632 ms
Memfd 4 KiB    : FALLOCATE+Read           :  1163.359 ms
Memfd 4 KiB    : FALLOCATE+Write          :   933.186 ms
Memfd 4 KiB    : FALLOCATE+Read/Write     :  1187.304 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :  1013.660 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :   794.560 ms
Memfd 2 MiB    : Read                     :   358.131 ms
Memfd 2 MiB    : Write                    :   358.099 ms
Memfd 2 MiB    : Read/Write               :   358.250 ms
Memfd 2 MiB    : POPULATE_READ            :   357.563 ms
Memfd 2 MiB    : POPULATE_WRITE           :   357.334 ms
Memfd 2 MiB    : FALLOCATE                :   356.735 ms
Memfd 2 MiB    : FALLOCATE+Read           :   358.152 ms
Memfd 2 MiB    : FALLOCATE+Write          :   358.331 ms
Memfd 2 MiB    : FALLOCATE+Read/Write     :   358.018 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :   357.286 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :   357.523 ms
tmpfs          : Read                     :  1087.265 ms
tmpfs          : Write                    :   950.840 ms
tmpfs          : Read/Write               :  1107.567 ms
tmpfs          : POPULATE_READ            :   922.605 ms
tmpfs          : POPULATE_WRITE           :   810.094 ms
tmpfs          : FALLOCATE                :   306.320 ms
tmpfs          : FALLOCATE+Read           :  1169.796 ms
tmpfs          : FALLOCATE+Write          :   933.730 ms
tmpfs          : FALLOCATE+Read/Write     :  1191.610 ms
tmpfs          : FALLOCATE+POPULATE_READ  :  1020.474 ms
tmpfs          : FALLOCATE+POPULATE_WRITE :   798.945 ms
file           : Read                     :   654.101 ms
file           : Write                    :  1259.142 ms
file           : Read/Write               :  1289.509 ms
file           : POPULATE_READ            :   661.642 ms
file           : POPULATE_WRITE           :  1106.816 ms
file           : FALLOCATE                :     1.864 ms
file           : FALLOCATE+Read           :   656.328 ms
file           : FALLOCATE+Write          :  1153.300 ms
file           : FALLOCATE+Read/Write     :  1180.613 ms
file           : FALLOCATE+POPULATE_READ  :   668.347 ms
file           : FALLOCATE+POPULATE_WRITE :   996.143 ms
hugetlbfs      : Read                     :   357.245 ms
hugetlbfs      : Write                    :   357.413 ms
hugetlbfs      : Read/Write               :   357.120 ms
hugetlbfs      : POPULATE_READ            :   356.321 ms
hugetlbfs      : POPULATE_WRITE           :   356.693 ms
hugetlbfs      : FALLOCATE                :   355.927 ms
hugetlbfs      : FALLOCATE+Read           :   357.074 ms
hugetlbfs      : FALLOCATE+Write          :   357.120 ms
hugetlbfs      : FALLOCATE+Read/Write     :   356.983 ms
hugetlbfs      : FALLOCATE+POPULATE_READ  :   356.413 ms
hugetlbfs      : FALLOCATE+POPULATE_WRITE :   356.266 ms
**************************************************

[1] https://lkml.org/lkml/2013/6/27/698

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: linux-alpha@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Cc: Linux API <linux-api@vger.kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 ++
 arch/mips/include/uapi/asm/mman.h      |  3 ++
 arch/parisc/include/uapi/asm/mman.h    |  3 ++
 arch/xtensa/include/uapi/asm/mman.h    |  3 ++
 include/uapi/asm-generic/mman-common.h |  3 ++
 mm/gup.c                               | 58 ++++++++++++++++++++++
 mm/internal.h                          |  3 ++
 mm/madvise.c                           | 66 ++++++++++++++++++++++++++
 8 files changed, 142 insertions(+)

Comments

Michal Hocko May 18, 2021, 10:07 a.m. UTC | #1
[sorry for a long silence on this]

On Tue 11-05-21 10:15:31, David Hildenbrand wrote:
[...]

Thanks for the extensive usecase description. That is certainly useful
background. I am sorry to bring this up again but I am still not
convinced that READ/WRITE variant are the best interface.
 
> While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
> preallocate memory and prefault page tables for VMs), one issue is that
> whenever we prefault pages writable, the pages have to be marked dirty,
> because the CPU could dirty them any time. while not a real problem for
> hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
> page will be marked dirty and has to be written back later when evicting.
> 
> MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
> mapping from backend storage without marking it dirty, such that eviction
> won't have to write it back. As discussed above, shared file mappings
> might require an explciit fallocate() upfront to achieve
> preallcoation+prepopulation.

This means that you want to have two different uses depending on the
underlying mapping type. MADV_POPULATE_READ seems rather weak for
anonymous/private mappings. Memory backed by zero pages seems rather
unhelpful as the PF would need to do all the heavy lifting anyway.
Or is there any actual usecase when this is desirable?

So the split into these two modes seems more like gup interface
shortcomings bubbling up to the interface. I do expect userspace only
cares about pre-faulting the address range. No matter what the backing
storage is. 

Or do I still misunderstand all the usecases?
David Hildenbrand May 18, 2021, 10:32 a.m. UTC | #2
On 18.05.21 12:07, Michal Hocko wrote:
> [sorry for a long silence on this]
> 
> On Tue 11-05-21 10:15:31, David Hildenbrand wrote:
> [...]
> 
> Thanks for the extensive usecase description. That is certainly useful
> background. I am sorry to bring this up again but I am still not
> convinced that READ/WRITE variant are the best interface.

Thanks for having time to look into this.

>   
>> While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
>> preallocate memory and prefault page tables for VMs), one issue is that
>> whenever we prefault pages writable, the pages have to be marked dirty,
>> because the CPU could dirty them any time. while not a real problem for
>> hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
>> page will be marked dirty and has to be written back later when evicting.
>>
>> MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
>> mapping from backend storage without marking it dirty, such that eviction
>> won't have to write it back. As discussed above, shared file mappings
>> might require an explciit fallocate() upfront to achieve
>> preallcoation+prepopulation.
> 
> This means that you want to have two different uses depending on the
> underlying mapping type. MADV_POPULATE_READ seems rather weak for
> anonymous/private mappings. Memory backed by zero pages seems rather
> unhelpful as the PF would need to do all the heavy lifting anyway.
> Or is there any actual usecase when this is desirable?

Currently, userfaultfd-wp, which requires "some mapping" to be able to 
arm successfully. In QEMU, we currently have to prefault the shared 
zeropage for userfaultfd-wp to work as expected. I expect that use case 
might vanish over time (eventually with new kernels and updated user 
space), but it might stick for a bit.

Apart from that, populating the shared zeropage might be relevant in 
some corner cases: I remember there are sparse matrix algorithms that 
operate heavily on the shared zeropage.

> 
> So the split into these two modes seems more like gup interface
> shortcomings bubbling up to the interface. I do expect userspace only
> cares about pre-faulting the address range. No matter what the backing
> storage is.
> 
> Or do I still misunderstand all the usecases?

Let me give you an example where we really cannot tell what would be 
best from a kernel perspective.

a) Mapping a file into a VM to be used as RAM. We might expect the guest 
writing all memory immediately (e.g., booting Windows). We would want 
MADV_POPULATE_WRITE as we expect a write access immediately.

b) Mapping a file into a VM to be used as fake-NVDIMM, for example, 
ROOTFS or just data storage. We expect mostly reading from this memory, 
thus, we would want MADV_POPULATE_READ.

Instead of trying to be smart in the kernel, I think for this case it 
makes much more sense to provide user space the options. IMHO it doesn't 
really hurt to let user space decide on what it thinks is best.
Michal Hocko May 18, 2021, 11:17 a.m. UTC | #3
On Tue 18-05-21 12:32:12, David Hildenbrand wrote:
> On 18.05.21 12:07, Michal Hocko wrote:
> > [sorry for a long silence on this]
> > 
> > On Tue 11-05-21 10:15:31, David Hildenbrand wrote:
> > [...]
> > 
> > Thanks for the extensive usecase description. That is certainly useful
> > background. I am sorry to bring this up again but I am still not
> > convinced that READ/WRITE variant are the best interface.
> 
> Thanks for having time to look into this.
> 
> > > While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
> > > preallocate memory and prefault page tables for VMs), one issue is that
> > > whenever we prefault pages writable, the pages have to be marked dirty,
> > > because the CPU could dirty them any time. while not a real problem for
> > > hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
> > > page will be marked dirty and has to be written back later when evicting.
> > > 
> > > MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
> > > mapping from backend storage without marking it dirty, such that eviction
> > > won't have to write it back. As discussed above, shared file mappings
> > > might require an explciit fallocate() upfront to achieve
> > > preallcoation+prepopulation.
> > 
> > This means that you want to have two different uses depending on the
> > underlying mapping type. MADV_POPULATE_READ seems rather weak for
> > anonymous/private mappings. Memory backed by zero pages seems rather
> > unhelpful as the PF would need to do all the heavy lifting anyway.
> > Or is there any actual usecase when this is desirable?
> 
> Currently, userfaultfd-wp, which requires "some mapping" to be able to arm
> successfully. In QEMU, we currently have to prefault the shared zeropage for
> userfaultfd-wp to work as expected.

Just for clarification. The aim is to reduce the memory footprint at the
same time, right? If that is really the case then this is worth adding.

> I expect that use case might vanish over
> time (eventually with new kernels and updated user space), but it might
> stick for a bit.

Could you elaborate some more please?

> Apart from that, populating the shared zeropage might be relevant in some
> corner cases: I remember there are sparse matrix algorithms that operate
> heavily on the shared zeropage.

I am not sure I see why this would be a useful interface for those? Zero
page read fault is really low cost. Or are you worried about cummulative
overhead by entering the kernel many times?

> > So the split into these two modes seems more like gup interface
> > shortcomings bubbling up to the interface. I do expect userspace only
> > cares about pre-faulting the address range. No matter what the backing
> > storage is.
> > 
> > Or do I still misunderstand all the usecases?
> 
> Let me give you an example where we really cannot tell what would be best
> from a kernel perspective.
> 
> a) Mapping a file into a VM to be used as RAM. We might expect the guest
> writing all memory immediately (e.g., booting Windows). We would want
> MADV_POPULATE_WRITE as we expect a write access immediately.
> 
> b) Mapping a file into a VM to be used as fake-NVDIMM, for example, ROOTFS
> or just data storage. We expect mostly reading from this memory, thus, we
> would want MADV_POPULATE_READ.

I am afraid I do not follow. Could you be more explicit about advantages
of using those two modes for those example usecases? Is that to share
resources (e.g. by not breaking CoW)?

> Instead of trying to be smart in the kernel, I think for this case it makes
> much more sense to provide user space the options. IMHO it doesn't really
> hurt to let user space decide on what it thinks is best.

I am mostly worried that this will turn out to be more confusing than
helpful. People will need to grasp non trivial concepts and kernel
internal implementation details about how read/write faults are handled.

Thanks!
David Hildenbrand May 18, 2021, 12:03 p.m. UTC | #4
>>> This means that you want to have two different uses depending on the
>>> underlying mapping type. MADV_POPULATE_READ seems rather weak for
>>> anonymous/private mappings. Memory backed by zero pages seems rather
>>> unhelpful as the PF would need to do all the heavy lifting anyway.
>>> Or is there any actual usecase when this is desirable?
>>
>> Currently, userfaultfd-wp, which requires "some mapping" to be able to arm
>> successfully. In QEMU, we currently have to prefault the shared zeropage for
>> userfaultfd-wp to work as expected.
> 
> Just for clarification. The aim is to reduce the memory footprint at the
> same time, right? If that is really the case then this is worth adding.

Yes. userfaultfd-wp is right now used in QEMU for background 
snapshotting of VMs. Just because you trigger a background snapshot 
doesn't mean that you want to COW all pages. (especially, if your VM 
previously inflated the balloon, was using free page reporting etc.)

> 
>> I expect that use case might vanish over
>> time (eventually with new kernels and updated user space), but it might
>> stick for a bit.
> 
> Could you elaborate some more please?

After I raised that the current behavior of userfaultfd-wp is 
suboptimal, Peter started working on a userfaultfd-wp mode that doesn't 
require to prefault all pages just to have it working reliably -- 
getting notified when any page changes, including ones that haven't been 
populated yet and would have been populated with the shared zeropage on 
first access. Not sure what the state of that is and when we might see it.

> 
>> Apart from that, populating the shared zeropage might be relevant in some
>> corner cases: I remember there are sparse matrix algorithms that operate
>> heavily on the shared zeropage.
> 
> I am not sure I see why this would be a useful interface for those? Zero
> page read fault is really low cost. Or are you worried about cummulative
> overhead by entering the kernel many times?

Yes, cumulative overhead when dealing with large, sparse matrices. Just 
an example where I think it could be applied in the future -- but not 
that I consider populating the shared zeropage a really important use 
case in general (besides for userfaultfd-wp right now).

> 
>>> So the split into these two modes seems more like gup interface
>>> shortcomings bubbling up to the interface. I do expect userspace only
>>> cares about pre-faulting the address range. No matter what the backing
>>> storage is.
>>>
>>> Or do I still misunderstand all the usecases?
>>
>> Let me give you an example where we really cannot tell what would be best
>> from a kernel perspective.
>>
>> a) Mapping a file into a VM to be used as RAM. We might expect the guest
>> writing all memory immediately (e.g., booting Windows). We would want
>> MADV_POPULATE_WRITE as we expect a write access immediately.
>>
>> b) Mapping a file into a VM to be used as fake-NVDIMM, for example, ROOTFS
>> or just data storage. We expect mostly reading from this memory, thus, we
>> would want MADV_POPULATE_READ.
> 
> I am afraid I do not follow. Could you be more explicit about advantages
> of using those two modes for those example usecases? Is that to share
> resources (e.g. by not breaking CoW)?

I'm only talking about shared mappings "ordinary files" for now, because 
that's where MADV_POPULATE_READ vs MADV_POPULATE_WRITE differ in regards 
of "mark something dirty and write it back"; CoW doesn't apply to shared 
mappings, it's really just a difference in dirtying and having to write 
back. For things like PMEM/hugetlbfs/... we usually want 
MADV_POPULATE_WRITE because then we'd avoid a context switch when our VM 
actually writes to a page the first time -- and we don't care about 
dirtying, because we don't have writeback.

But again, that's just one use case I have in mind coming from the VM 
area. I consider MADV_POPULATE_READ really only useful when we are 
expecting mostly read access on a mapping. (I assume there are other use 
cases for databases etc. not explored yet where MADV_POPULATE_WRITE 
would not be desired for performance reasons)

> 
>> Instead of trying to be smart in the kernel, I think for this case it makes
>> much more sense to provide user space the options. IMHO it doesn't really
>> hurt to let user space decide on what it thinks is best.
> 
> I am mostly worried that this will turn out to be more confusing than
> helpful. People will need to grasp non trivial concepts and kernel
> internal implementation details about how read/write faults are handled.

And that's the point: in the simplest case (without any additional 
considerations about the underlying mapping), if you end up mostly 
*reading* MADV_POPULATE_READ is the right thing. If you end up mostly 
*writing* MADV_POPULATE_WRITE is the right thing. Only care has to be 
taken when you really want a "prealloction" as in "allocate backend 
storage" or "don't ever use the shared zeropage". I agree that these 
details require more knowledge, but so does anything that messes with 
memory mappings on that level (VMs, databases, ...).

QEMU currently implements exactly these two cases manually in user space.

Anyhow, please suggest a way to handle it via a single flag in the 
kernel -- which would be some kind of heuristic as we know from 
MAP_POPULATE. Having an alternative at hand would make it easier to 
discuss this topic further. I certainly *don't* want MAP_POPULATE 
semantics when it comes to MADV_POPULATE, especially when it comes to 
shared mappings. Not useful in QEMU now and in the future.

We could make MADV_POPULATE act depending on the readability/writability 
of a mapping. Use MADV_POPULATE_WRITE on writable mappings, use 
MADV_POPULATE_READ on readable mappings. Certainly not perfect for use 
cases where you have writable mappings that are mostly read only (as in 
the example with fake-NVDIMMs I gave ...), but if it makes people happy, 
fine with me. I mostly care about MADV_POPULATE_WRITE.
Michal Hocko May 20, 2021, 1:44 p.m. UTC | #5
On Tue 18-05-21 14:03:52, David Hildenbrand wrote:
[...]
> > > I expect that use case might vanish over
> > > time (eventually with new kernels and updated user space), but it might
> > > stick for a bit.
> > 
> > Could you elaborate some more please?
> 
> After I raised that the current behavior of userfaultfd-wp is suboptimal,
> Peter started working on a userfaultfd-wp mode that doesn't require to
> prefault all pages just to have it working reliably -- getting notified when
> any page changes, including ones that haven't been populated yet and would
> have been populated with the shared zeropage on first access. Not sure what
> the state of that is and when we might see it.

OK, thanks for the clarification. This suggests that inventing a new
interface to cover this usecase doesn't sound like the strongest
justification to me. But this doesn't mean this disqualifies it either.

> > > Apart from that, populating the shared zeropage might be relevant in some
> > > corner cases: I remember there are sparse matrix algorithms that operate
> > > heavily on the shared zeropage.
> > 
> > I am not sure I see why this would be a useful interface for those? Zero
> > page read fault is really low cost. Or are you worried about cummulative
> > overhead by entering the kernel many times?
> 
> Yes, cumulative overhead when dealing with large, sparse matrices. Just an
> example where I think it could be applied in the future -- but not that I
> consider populating the shared zeropage a really important use case in
> general (besides for userfaultfd-wp right now).

OK.
 
[...]
> Anyhow, please suggest a way to handle it via a single flag in the kernel --
> which would be some kind of heuristic as we know from MAP_POPULATE. Having
> an alternative at hand would make it easier to discuss this topic further. I
> certainly *don't* want MAP_POPULATE semantics when it comes to
> MADV_POPULATE, especially when it comes to shared mappings. Not useful in
> QEMU now and in the future.

OK, this point is still not entirely clear to me. Elsewhere you are
saying that QEMU cannot use MAP_POPULATE because it ignores errors
and also it doesn't support sparse mappings because they apply to the
whole mmap. These are all clear but it is less clear to me why the same
semantic is not applicable for QEMU when used through madvise interface
which can handle both of those.
Do I get it right that you really want to emulate the full fledged write
fault to a) limit another write fault when the content is actually
modified and b) prevent from potential errors during the write fault
(e.g. mkwrite failing on the fs data)?

> We could make MADV_POPULATE act depending on the readability/writability of
> a mapping. Use MADV_POPULATE_WRITE on writable mappings, use
> MADV_POPULATE_READ on readable mappings. Certainly not perfect for use cases
> where you have writable mappings that are mostly read only (as in the
> example with fake-NVDIMMs I gave ...), but if it makes people happy, fine
> with me. I mostly care about MADV_POPULATE_WRITE.

Yes, this is where my thinking was going as well. Essentially define
MADV_POPULATE as "Populate the mapping with the memory based on the
mapping access." This looks like a straightforward semantic to me and it
doesn't really require any deep knowledge of internals.

Now, I was trying to compare which of those would be more tricky to
understand and use and TBH I am not really convinced any of the two is
much better. Separate READ/WRITE modes are explicit which can be good
but it will require quite an advanced knowledge of the #PF behavior.
On the other hand MADV_POPULATE would require some tricks like mmap,
madvise and mprotect(to change to writable) when the data is really
written to. I am not sure how much of a deal this would be for QEMU for
example.

So, all that being said, I am not really sure. I am not really happy
about READ/WRITE split but if a simpler interface is going to be a bad
fit for existing usecases then I believe a proper way to go is the
document the more complex interface thoroughly.
David Hildenbrand May 21, 2021, 8:48 a.m. UTC | #6
> [...]
>> Anyhow, please suggest a way to handle it via a single flag in the kernel --
>> which would be some kind of heuristic as we know from MAP_POPULATE. Having
>> an alternative at hand would make it easier to discuss this topic further. I
>> certainly *don't* want MAP_POPULATE semantics when it comes to
>> MADV_POPULATE, especially when it comes to shared mappings. Not useful in
>> QEMU now and in the future.
> 
> OK, this point is still not entirely clear to me. Elsewhere you are
> saying that QEMU cannot use MAP_POPULATE because it ignores errors
> and also it doesn't support sparse mappings because they apply to the
> whole mmap. These are all clear but it is less clear to me why the same
> semantic is not applicable for QEMU when used through madvise interface
> which can handle both of those.

It's a combination of things:

a) MAP_POPULATE never was an option simply because of deferred
    "prealloc=on" handling in QEMU, happening way after we created the
    memmap. Further it doesn't report if there was an error, which is
    another reason why it's basically useless for QEMU use cases.
b) QEMU uses manual read-write prefaulting for "preallocation", for
    example, to avoid SIGBUS on hugetlbfs or shmem at runtime. There are
    cases where we absolutely want to avoid crashing the VM later just
    because of a user error. MAP_POPULATE does *not* do what we want for
    shared mappings, because it triggers a read fault.
c) QEMU uses the same mechanism for prefaulting in RT environments,
    where we want to avoid any kind of pagefault, using mlock() etc.
d) MAP_POPULATE does not apply to sparse memory mappings that I'll be
    using more heavily in QEMU, also for the purpose of preallocation
    with virtio-mem.

See the current QEMU code along with a comment in

https://github.com/qemu/qemu/blob/972e848b53970d12cb2ca64687ef8ff797fb6236/util/oslib-posix.c#L496

it's especially bad for PMEM ("wear on the storage backing"), which is 
why we have to trust on users not to trigger preallocation/prefaulting 
on PMEM, otherwise (as already expressed via bug reports) we waste a lot 
of time when backing VMs on PMEM or forwarding NVDIMMs, unnecessarily 
read/writing (slow) DAX.

> Do I get it right that you really want to emulate the full fledged write
> fault to a) limit another write fault when the content is actually
> modified and b) prevent from potential errors during the write fault
> (e.g. mkwrite failing on the fs data)?

Yes, for the use case of "preallocation" in QEMU. See the QEMU link.

But again, the thing that makes it more complicated is that I can come 
up with some use cases that want to handle "shared mappings of ordinary 
files" a little better. Or the usefaultfd-wp example I gave, where 
prefaulting via MADV_POPULATE_READ can roughly half the population time.

>> We could make MADV_POPULATE act depending on the readability/writability of
>> a mapping. Use MADV_POPULATE_WRITE on writable mappings, use
>> MADV_POPULATE_READ on readable mappings. Certainly not perfect for use cases
>> where you have writable mappings that are mostly read only (as in the
>> example with fake-NVDIMMs I gave ...), but if it makes people happy, fine
>> with me. I mostly care about MADV_POPULATE_WRITE.
> 
> Yes, this is where my thinking was going as well. Essentially define
> MADV_POPULATE as "Populate the mapping with the memory based on the
> mapping access." This looks like a straightforward semantic to me and it
> doesn't really require any deep knowledge of internals.
> 
> Now, I was trying to compare which of those would be more tricky to
> understand and use and TBH I am not really convinced any of the two is
> much better. Separate READ/WRITE modes are explicit which can be good
> but it will require quite an advanced knowledge of the #PF behavior.
> On the other hand MADV_POPULATE would require some tricks like mmap,
> madvise and mprotect(to change to writable) when the data is really
> written to. I am not sure how much of a deal this would be for QEMU for
> example.

IIRC, at the time we enable background snapshotting, the VM is running 
and we cannot temporarily mprotect(PROT_READ) without making the guest 
crash. But again, uffd-wp handling is somewhat a special case because 
the implementation in the kernel is really suboptimal.

The reason I chose MADV_POPULATE_READ + MADV_POPULATE_WRITE is because 
it really mimics what user space currently does to get the job done.

I guess the important part to document is that "be careful when using 
MADV_POPULATE_READ because it might just populate the shared zeropage" 
and "be careful with MADV_POPULATE_WRITE because it will do the same as 
when writing to every page: dirty the pages such that they will have to 
be written back when backed by actual files".


The current MAN page entry for MADV_POPULATE_READ reads:

"
Populate (prefault) page tables readable for the whole range without 
actually reading. Depending on the underlying mapping, map the shared 
zeropage, preallocate memory or read the underlying file. Do not 
generate SIGBUS when populating fails, return an error instead.

If  MADV_POPULATE_READ succeeds, all page tables have been populated 
(prefaulted) readable once. If MADV_POPULATE_READ fails, some page 
tables might have been populated.

MADV_POPULATE_READ cannot be applied to mappings without read 
permissions  and  special mappings marked with the kernel-internal 
VM_PFNMAP and VM_IO.

Note that with MADV_POPULATE_READ, the process can still be killed at 
any moment when the system runs out of memory.
"


> 
> So, all that being said, I am not really sure. I am not really happy
> about READ/WRITE split but if a simpler interface is going to be a bad
> fit for existing usecases then I believe a proper way to go is the
> document the more complex interface thoroughly.

I think with the split we are better off long term without requiring 
workarounds (mprotect()) to make some use cases work in the long term.

But again, if there is a good justification why a single MADV_POPULATE 
make sense, I'm happy to change it. Again, for me, the most important 
thing long-term is MADV_POPULATE_WRITE because that's really what QEMU 
mainly uses right now for preallocation. But I can see use cases for 
MADV_POPULATE_READ as well.

Thanks for your input!
diff mbox series

Patch

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index a18ec7f63888..56b4ee5a6c9e 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -71,6 +71,9 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 57dc2ac4f8bd..40b210c65a5a 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -98,6 +98,9 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index ab78cba446ed..9e3c010c0f61 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -52,6 +52,9 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index e5e643752947..b3a22095371b 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -106,6 +106,9 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..1567a3294c3d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -72,6 +72,9 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/gup.c b/mm/gup.c
index ef7d2da9f03f..632d12469deb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1403,6 +1403,64 @@  long populate_vma_page_range(struct vm_area_struct *vma,
 				NULL, NULL, locked);
 }
 
+/*
+ * faultin_vma_page_range() - populate (prefault) page tables inside the
+ *			      given VMA range readable/writable
+ *
+ * This takes care of mlocking the pages, too, if VM_LOCKED is set.
+ *
+ * @vma: target vma
+ * @start: start address
+ * @end: end address
+ * @write: whether to prefault readable or writable
+ * @locked: whether the mmap_lock is still held
+ *
+ * Returns either number of processed pages in the vma, or a negative error
+ * code on error (see __get_user_pages()).
+ *
+ * vma->vm_mm->mmap_lock must be held. The range must be page-aligned and
+ * covered by the VMA.
+ *
+ * If @locked is NULL, it may be held for read or write and will be unperturbed.
+ *
+ * If @locked is non-NULL, it must held for read only and may be released.  If
+ * it's released, *@locked will be set to 0.
+ */
+long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
+			    unsigned long end, bool write, int *locked)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long nr_pages = (end - start) / PAGE_SIZE;
+	int gup_flags;
+
+	VM_BUG_ON(!PAGE_ALIGNED(start));
+	VM_BUG_ON(!PAGE_ALIGNED(end));
+	VM_BUG_ON_VMA(start < vma->vm_start, vma);
+	VM_BUG_ON_VMA(end > vma->vm_end, vma);
+	mmap_assert_locked(mm);
+
+	/*
+	 * FOLL_TOUCH: Mark page accessed and thereby young; will also mark
+	 * 	       the page dirty with FOLL_WRITE -- which doesn't make a
+	 * 	       difference with !FOLL_FORCE, because the page is writable
+	 * 	       in the page table.
+	 * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
+	 *		  a poisoned page.
+	 * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
+	 * !FOLL_FORCE: Require proper access permissions.
+	 */
+	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+	if (write)
+		gup_flags |= FOLL_WRITE;
+
+	/*
+	 * See check_vma_flags(): Will return -EFAULT on incompatible mappings
+	 * or with insufficient permissions.
+	 */
+	return __get_user_pages(mm, start, nr_pages, gup_flags,
+				NULL, NULL, locked);
+}
+
 /*
  * __mm_populate - populate and/or mlock pages within a range of address space.
  *
diff --git a/mm/internal.h b/mm/internal.h
index bbf1c1274983..41e8d41a5d1e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -355,6 +355,9 @@  void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma);
 #ifdef CONFIG_MMU
 extern long populate_vma_page_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, int *locked);
+extern long faultin_vma_page_range(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end,
+				   bool write, int *locked);
 extern void munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
diff --git a/mm/madvise.c b/mm/madvise.c
index 01fef79ac761..a02cbda942ba 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -53,6 +53,8 @@  static int madvise_need_mmap_write(int behavior)
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_FREE:
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -822,6 +824,61 @@  static long madvise_dontneed_free(struct vm_area_struct *vma,
 		return -EINVAL;
 }
 
+static long madvise_populate(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end,
+			     int behavior)
+{
+	const bool write = behavior == MADV_POPULATE_WRITE;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long tmp_end;
+	int locked = 1;
+	long pages;
+
+	*prev = vma;
+
+	while (start < end) {
+		/*
+		 * We might have temporarily dropped the lock. For example,
+		 * our VMA might have been split.
+		 */
+		if (!vma || start >= vma->vm_end) {
+			vma = find_vma(mm, start);
+			if (!vma || start < vma->vm_start)
+				return -ENOMEM;
+		}
+
+		tmp_end = min_t(unsigned long, end, vma->vm_end);
+		/* Populate (prefault) page tables readable/writable. */
+		pages = faultin_vma_page_range(vma, start, tmp_end, write,
+					       &locked);
+		if (!locked) {
+			mmap_read_lock(mm);
+			locked = 1;
+			*prev = NULL;
+			vma = NULL;
+		}
+		if (pages < 0) {
+			switch (pages) {
+			case -EINTR:
+				return -EINTR;
+			case -EFAULT: /* Incompatible mappings / permissions. */
+				return -EINVAL;
+			case -EHWPOISON:
+				return -EHWPOISON;
+			default:
+				pr_warn_once("%s: unhandled return value: %ld\n",
+					     __func__, pages);
+				fallthrough;
+			case -ENOMEM:
+				return -ENOMEM;
+			}
+		}
+		start += pages * PAGE_SIZE;
+	}
+	return 0;
+}
+
 /*
  * Application wants to free up the pages and associated backing store.
  * This is effectively punching a hole into the middle of a file.
@@ -935,6 +992,9 @@  madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
+		return madvise_populate(vma, prev, start, end, behavior);
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -955,6 +1015,8 @@  madvise_behavior_valid(int behavior)
 	case MADV_FREE:
 	case MADV_COLD:
 	case MADV_PAGEOUT:
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
@@ -1042,6 +1104,10 @@  process_madvise_behavior_valid(int behavior)
  *		easily if memory pressure hanppens.
  *  MADV_PAGEOUT - the application is not expected to use this memory soon,
  *		page out the pages in this range immediately.
+ *  MADV_POPULATE_READ - populate (prefault) page tables readable by
+ *		triggering read faults if required
+ *  MADV_POPULATE_WRITE - populate (prefault) page tables writable by
+ *		triggering write faults if required
  *
  * return values:
  *  zero    - success