mbox series

[v1,00/17] mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT

Message ID 20240829165627.2256514-1-david@redhat.com (mailing list archive)
Headers show
Series mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT | expand

Message

David Hildenbrand Aug. 29, 2024, 4:56 p.m. UTC
RMAP overhaul and optimizations, PTE batching, large mapcount,
folio_likely_mapped_shared() introduction and optimizations, page_mapcount
cleanups and preparations ... it's been quite some work to get to this
point.

Next up is being able to identify -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a
large folio is "mapped exclusively" into a single MM, and using that
information to implement Copy-on-Write reuse and to improve
folio_likely_mapped_shared() for large folios.

... and based on that, finally introducing a kernel config option that
let's us not use+maintain per-page mapcounts in large folios, improving
performance of (un)map operations today, taking one step towards
supporting large folios  > PMD_SIZE, and preparing for the bright future
where we might no longer have a mapcount per page at all.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work from last
year [2], which proposed a precise way to identify whether a large folio is
"mapped shared" into multiple MMs or "mapped exclusively" into a single MM.

While that advanced approach has been simplified and optimized in the
meantime, let's start with something simpler first -- "certainly mapped
exclusive" vs. ""maybe mapped shared" -- so we can start learning about
the effects and TODOs that some of the implied changes of losing
per-page mapcounts has.

I have plans to exchange the simple approach used in this series at some
point by the advanced approach, but one important thing to learn if the
imprecision in the simple approach is relevant in practice.

64BIT only, and unless enabled in kconfig, this series should for now
not have any impact.


1) Patch Organization
=====================

Patch #1 -> #4: make more room on 64BIT in order-1 folios

Patch #5 -> #7: prepare for MM owner tracking of large folios

Patch #8: implement a simple MM owner tracking approach for large folios

patch #9: simple optimization

Patch #10: COW reuse for PTE-mapped anon THP

Patch #11 -> #17: introduce and implement CONFIG_NO_PAGE_MAPCOUNT


2) MM owner tracking
====================

Similar to my advanced approach [2], we assign each MM a unique 20-bit ID
("MM ID"), to be able to squeeze more information in our folios.

Each large folios can store two MM-ID+mapcount combination:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

Combined with the large mapcount, we can reliably identify whether one
of these MMs is the current owner (-> owns all mappings) or even holds
all folio references (-> owns all mappings, and all references are from
mappings).

Stored MM IDs can only change if the corresponding mapcount is logically
0, and if the folio is currently "mapped exclusively".

As long as only two MMs map folio pages at a time, we can reliably identify
whether a large folio is "mapped shared" or "mapped exclusively". The
approach is precise.

Any MM mapping the folio while two other MMs are already mapping the folio,
will lead to a "mapped shared" detection even after all other MMs stopped
mapping the folio and it is actually "mapped exclusively": we can have
false positives but never false negatives when detecting "mapped shared".

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + MM IDs + MM
mapcounts, and make sure we do keep the machinery fast, to not degrade
(un)map performance too much: for example, we make sure to only use a
single atomic (when grabbing the bit-spinlock), like we would already
perform when updating the large mapcount.

In the future, we might be able to use an arch_spin_lock(), but that's
future work.


3) CONFIG_NO_PAGE_MAPCOUNT
==========================

patch #11 -> #17 spell out and document what exactly is affected when
not maintaining the per-page mapcounts in large folios anymore.

For example, as we cannot maintain folio->_nr_pages_mapped anymore when
(un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped.

As another example, we might now under-estimate the USS (Unique Set Size)
of a process, but never over-estimate it.

With a more elaborate approach for MM-owner tracking like #1, some things
could be improved (e.g., USS to some degree), but somethings just cannot be
handled like we used to without these per-page mapcounts (e.g.,
folio->_nr_pages_mapped).


4) Performance
==============

The following kernel config combinations are possible:

* Base: CONFIG_PAGE_MAPCOUNT
  -> (existing) page-mapcount tracking
* MM-ID: CONFIG_MM_ID && CONFIG_PAGE_MAPCOUNT
  -> page-mapcount + MM-ID tracking
* No-Mapcount: CONFIG_MM_ID && CONFIG_NO_PAGE_MAPCOUNT
  -> MM-ID tracking


I run my PTE-mapped-THP microbenchmarks [3] and vm-scalability on a machine
with two NUMA nodes, with a 10-core Intel(R) Xeon(R) Silver 4210R CPU @
2.40GHz and 16 GiB of memory each.

4.1) PTE-mapped-THP microbenchmarks
-----------------------------------

All benchmarks allocate 1 GiB of THPs of a given size, to then fork()/
munmap/... PMD-sized THPs are mapped by PTEs first.

Numbers are increase (+) / reduction (-) in runtime. Reduction (-) is
good. "Base" is the baseline.

munmap: munmap() the allocated memory.

Folio Size |  MM-ID | No-Mapcount
--------------------------------
    16 KiB |   2 % |        -8 %
    32 KiB |   3 % |        -9 %
    64 KiB |   4 % |       -16 %
   128 KiB |   3 % |       -17 %
   256 KiB |   1 % |       -23 %
   512 KiB |   1 % |       -26 %
  1024 KiB |   0 % |       -29 %
  2048 KiB |   0 % |       -31 %

-> 32-128 with MM-ID are a bit unexpected: we would expect to see the worst
   case with the smallest size (16 KiB). But for these sizes also the STDEV
   is between 1% and 2%, in contrast to the others (< 1 %). Maybe some
   weird interaction with PCP/buddy.

fork: fork()

Folio Size |  MM-ID | No-Mapcount
--------------------------------
    16 KiB |    4 % |       -9 %
    32 KiB |    1 % |      -12 %
    64 KiB |    0 % |      -15 %
   128 KiB |    0 % |      -15 %
   256 KiB |    0 % |      -16 %
   512 KiB |    0 % |      -16 %
  1024 KiB |    0 % |      -17 %
  2048 KiB |   -1 % |      -21 %

-> Slight slowdown with MM-ID for the smallest folio size (more what we
expect in contrast to munmap()).

cow-byte: fork() and keep the child running. write one byte to each
  individual page, measuring the duration of all writes.

Folio Size |  MM-ID | No-Mapcount
--------------------------------
    16 KiB |    0 % |        0 %
    32 KiB |    0 % |        0 %
    64 KiB |    0 % |        0 %
   128 KiB |    0 % |        0 %
   256 KiB |    0 % |        0 %
   512 KiB |    0 % |        0 %
  1024 KiB |    0 % |        0 %
  2048 KiB |    0 % |        0 %

-> All other overhead dominates even when effectively unmapping
   single pages of large folios when replacing them by a copy during write
   faults. No change, which is great!

reuse-byte: fork() and wait until the child quit. write one byte to each
  individual page, measuring the duration of all writes.

Folio Size |  MM-ID | No-Mapcount
--------------------------------
    16 KiB |  -66 % |      -66 %
    32 KiB |  -65 % |      -65 %
    64 KiB |  -64 % |      -64 %
   128 KiB |  -64 % |      -64 %
   256 KiB |  -64 % |      -64 %
   512 KiB |  -64 % |      -64 %
  1024 KiB |  -64 % |      -64 %
  2048 KiB |  -64 % |      -64 %

-> No surprise, we reuse all pages instead of copying them.

child-reuse-bye: fork() and unmap the memory in the parent. write one byte
  to each individual page in the child, measuring the duration of all writes.

Folio Size |  MM-ID | No-Mapcount
--------------------------------
    16 KiB |  -66 % |      -66 %
    32 KiB |  -65 % |      -65 %
    64 KiB |  -64 % |      -64 %
   128 KiB |  -64 % |      -64 %
   256 KiB |  -64 % |      -64 %
   512 KiB |  -64 % |      -64 %
  1024 KiB |  -64 % |      -64 %
  2048 KiB |  -64 % |      -64 %

-> Same thing, we reuse all pages instead of copying them.


For 4 KiB, there is no change in any benchmark, as expected.


4.2) vm-scalability
-------------------

For now I only ran anon COW tests. I use 1 GiB per child process and use
one child per core (-> 20).

case-anon-cow-rand: random writes

There is effectively no change (<0.6% throughput difference).

case-anon-cow-seq: sequential writes

MM-ID has up to 2% *lower* throughout than Base, not really correlating to
folio size. The difference is almost as large as the STDEV (1% - 2%),
though. It looks like there is a very slight effective slowdown.

No-Mapcount has up to 3% *higher* throughput than Base, not really
correlating to the folio size. However, also here the difference is almost
as large as the STDEV (up to 2%). It looks like there is a very slight
effective speedup.

In summary, no earth-shattering slowdown with MM-ID (and we just recently
optimized folio->_nr_pages_mapped to give us some speedup :) ), and
another nice improvement with No-Mapcount.


I did a bunch of cross-compiles and the build bots turned out very helpful
over the last months. I did quite some testing with LTP and selftests,
but x86-64 only.

To keep the CC list short, adding only relevant maintainers (CCed on all
patches, sorry ;) ).

Related things on my long TODO list:
* Replace the "entire mapcount" by a "pmd mapcount" and a "pud mapcount",
  to prepare for handling folios > PMD size. Maybe we'll have to keep it for
  hugetlb, we'll see.
* Make hugetlb play along, and possibly use the same mapcount scheme. PMD
  page table sharing doesn't play along and needs some thought. There might
  be ways, but it's a bit more involved.
* Make AnonExclusive a per-folio flag. It's complicated.

[1] https://lwn.net/Articles/974223/
[2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/
[3] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net> 
Cc: Andy Lutomirski <luto@kernel.org> 
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>

David Hildenbrand (17):
  mm: factor out large folio handling from folio_order() into
    folio_large_order()
  mm: factor out large folio handling from folio_nr_pages() into
    folio_large_nr_pages()
  mm/rmap: use folio_large_nr_pages() in add/remove functions
  mm: let _folio_nr_pages overlay memcg_data in first tail page
  mm/rmap: pass dst_vma to page_try_dup_anon_rmap() and
    page_dup_file_rmap()
  mm/rmap: pass vma to __folio_add_rmap()
  mm/rmap: abstract large mapcount operations for large folios
    (!hugetlb)
  mm/rmap: initial MM owner tracking for large folios (!hugetlb)
  bit_spinlock: __always_inline (un)lock functions
  mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID
  mm: CONFIG_NO_PAGE_MAPCOUNT to prepare for not maintain per-page
    mapcounts in large folios
  mm: remove per-page mapcount dependency in
    folio_likely_mapped_shared() (CONFIG_NO_PAGE_MAPCOUNT)
  fs/proc/page: remove per-page mapcount dependency for /proc/kpagecount
    (CONFIG_NO_PAGE_MAPCOUNT)
  fs/proc/task_mmu: remove per-page mapcount dependency for
    PM_MMAP_EXCLUSIVE (CONFIG_NO_PAGE_MAPCOUNT)
  fs/proc/task_mmu: remove per-page mapcount dependency for "mapmax"
    (CONFIG_NO_PAGE_MAPCOUNT)
  fs/proc/task_mmu: remove per-page mapcount dependency for
    smaps/smaps_rollup (CONFIG_NO_PAGE_MAPCOUNT)
  mm: stop maintaining the per-page mapcount of large folios
    (CONFIG_NO_PAGE_MAPCOUNT)

 .../admin-guide/cgroup-v1/memory.rst          |   4 +
 Documentation/admin-guide/cgroup-v2.rst       |  10 +-
 Documentation/admin-guide/mm/pagemap.rst      |  16 +-
 Documentation/filesystems/proc.rst            |  28 ++-
 Documentation/mm/transhuge.rst                |  39 +++-
 arch/x86/entry/vdso/vdso32/fake_32bit_build.h |   1 +
 fs/proc/internal.h                            |  33 +++
 fs/proc/page.c                                |  18 +-
 fs/proc/task_mmu.c                            |  41 +++-
 include/linux/bit_spinlock.h                  |   8 +-
 include/linux/mm.h                            |  60 +++--
 include/linux/mm_types.h                      |  57 ++++-
 include/linux/page-flags.h                    |  41 ++++
 include/linux/rmap.h                          | 213 ++++++++++++++++--
 kernel/fork.c                                 |  36 +++
 mm/Kconfig                                    |  31 +++
 mm/huge_memory.c                              |  16 +-
 mm/internal.h                                 |  33 ++-
 mm/memory.c                                   |  97 ++++++--
 mm/page_alloc.c                               |  18 +-
 mm/page_owner.c                               |   2 +-
 mm/rmap.c                                     |  97 ++++++--
 22 files changed, 781 insertions(+), 118 deletions(-)

Comments

David Hildenbrand Oct. 23, 2024, 9:10 a.m. UTC | #1
On 29.08.24 18:56, David Hildenbrand wrote:
> RMAP overhaul and optimizations, PTE batching, large mapcount,
> folio_likely_mapped_shared() introduction and optimizations, page_mapcount
> cleanups and preparations ... it's been quite some work to get to this
> point.
> 
> Next up is being able to identify -- without false positives, without
> page-mapcounts and without page table/rmap scanning -- whether a
> large folio is "mapped exclusively" into a single MM, and using that
> information to implement Copy-on-Write reuse and to improve
> folio_likely_mapped_shared() for large folios.
> 
> ... and based on that, finally introducing a kernel config option that
> let's us not use+maintain per-page mapcounts in large folios, improving
> performance of (un)map operations today, taking one step towards
> supporting large folios  > PMD_SIZE, and preparing for the bright future
> where we might no longer have a mapcount per page at all.
> 
> The bigger picture was presented at LSF/MM [1].
> 
> This series is effectively a follow-up on my early work from last
> year [2], which proposed a precise way to identify whether a large folio is
> "mapped shared" into multiple MMs or "mapped exclusively" into a single MM.
> 
> While that advanced approach has been simplified and optimized in the
> meantime, let's start with something simpler first -- "certainly mapped
> exclusive" vs. ""maybe mapped shared" -- so we can start learning about
> the effects and TODOs that some of the implied changes of losing
> per-page mapcounts has.
> 
> I have plans to exchange the simple approach used in this series at some
> point by the advanced approach, but one important thing to learn if the
> imprecision in the simple approach is relevant in practice.
> 
> 64BIT only, and unless enabled in kconfig, this series should for now
> not have any impact.
> 
> 
> 1) Patch Organization
> =====================
> 
> Patch #1 -> #4: make more room on 64BIT in order-1 folios
> 
> Patch #5 -> #7: prepare for MM owner tracking of large folios
> 
> Patch #8: implement a simple MM owner tracking approach for large folios
> 
> patch #9: simple optimization
> 
> Patch #10: COW reuse for PTE-mapped anon THP
> 
> Patch #11 -> #17: introduce and implement CONFIG_NO_PAGE_MAPCOUNT
> 
> 
> 2) MM owner tracking
> ====================
> 
> Similar to my advanced approach [2], we assign each MM a unique 20-bit ID
> ("MM ID"), to be able to squeeze more information in our folios.
> 
> Each large folios can store two MM-ID+mapcount combination:
> * mm0_id + mm0_mapcount
> * mm1_id + mm1_mapcount
> 
> Combined with the large mapcount, we can reliably identify whether one
> of these MMs is the current owner (-> owns all mappings) or even holds
> all folio references (-> owns all mappings, and all references are from
> mappings).
> 
> Stored MM IDs can only change if the corresponding mapcount is logically
> 0, and if the folio is currently "mapped exclusively".
> 
> As long as only two MMs map folio pages at a time, we can reliably identify
> whether a large folio is "mapped shared" or "mapped exclusively". The
> approach is precise.
> 
> Any MM mapping the folio while two other MMs are already mapping the folio,
> will lead to a "mapped shared" detection even after all other MMs stopped
> mapping the folio and it is actually "mapped exclusively": we can have
> false positives but never false negatives when detecting "mapped shared".
> 
> So that's where the approach gets imprecise.
> 
> For now, we use a bit-spinlock to sync the large mapcount + MM IDs + MM
> mapcounts, and make sure we do keep the machinery fast, to not degrade
> (un)map performance too much: for example, we make sure to only use a
> single atomic (when grabbing the bit-spinlock), like we would already
> perform when updating the large mapcount.
> 
> In the future, we might be able to use an arch_spin_lock(), but that's
> future work.
> 
> 
> 3) CONFIG_NO_PAGE_MAPCOUNT
> ==========================
> 
> patch #11 -> #17 spell out and document what exactly is affected when
> not maintaining the per-page mapcounts in large folios anymore.
> 
> For example, as we cannot maintain folio->_nr_pages_mapped anymore when
> (un)mapping pages, we'll account a complete folio as mapped if a
> single page is mapped.
> 
> As another example, we might now under-estimate the USS (Unique Set Size)
> of a process, but never over-estimate it.
> 
> With a more elaborate approach for MM-owner tracking like #1, some things
> could be improved (e.g., USS to some degree), but somethings just cannot be
> handled like we used to without these per-page mapcounts (e.g.,
> folio->_nr_pages_mapped).
> 
> 
> 4) Performance
> ==============
> 
> The following kernel config combinations are possible:
> 
> * Base: CONFIG_PAGE_MAPCOUNT
>    -> (existing) page-mapcount tracking
> * MM-ID: CONFIG_MM_ID && CONFIG_PAGE_MAPCOUNT
>    -> page-mapcount + MM-ID tracking
> * No-Mapcount: CONFIG_MM_ID && CONFIG_NO_PAGE_MAPCOUNT
>    -> MM-ID tracking
> 
> 
> I run my PTE-mapped-THP microbenchmarks [3] and vm-scalability on a machine
> with two NUMA nodes, with a 10-core Intel(R) Xeon(R) Silver 4210R CPU @
> 2.40GHz and 16 GiB of memory each.
> 
> 4.1) PTE-mapped-THP microbenchmarks
> -----------------------------------
> 
> All benchmarks allocate 1 GiB of THPs of a given size, to then fork()/
> munmap/... PMD-sized THPs are mapped by PTEs first.
> 
> Numbers are increase (+) / reduction (-) in runtime. Reduction (-) is
> good. "Base" is the baseline.
> 
> munmap: munmap() the allocated memory.
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |   2 % |        -8 %
>      32 KiB |   3 % |        -9 %
>      64 KiB |   4 % |       -16 %
>     128 KiB |   3 % |       -17 %
>     256 KiB |   1 % |       -23 %
>     512 KiB |   1 % |       -26 %
>    1024 KiB |   0 % |       -29 %
>    2048 KiB |   0 % |       -31 %
> 
> -> 32-128 with MM-ID are a bit unexpected: we would expect to see the worst
>     case with the smallest size (16 KiB). But for these sizes also the STDEV
>     is between 1% and 2%, in contrast to the others (< 1 %). Maybe some
>     weird interaction with PCP/buddy.
> 
> fork: fork()
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |    4 % |       -9 %
>      32 KiB |    1 % |      -12 %
>      64 KiB |    0 % |      -15 %
>     128 KiB |    0 % |      -15 %
>     256 KiB |    0 % |      -16 %
>     512 KiB |    0 % |      -16 %
>    1024 KiB |    0 % |      -17 %
>    2048 KiB |   -1 % |      -21 %
> 
> -> Slight slowdown with MM-ID for the smallest folio size (more what we
> expect in contrast to munmap()).
> 
> cow-byte: fork() and keep the child running. write one byte to each
>    individual page, measuring the duration of all writes.
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |    0 % |        0 %
>      32 KiB |    0 % |        0 %
>      64 KiB |    0 % |        0 %
>     128 KiB |    0 % |        0 %
>     256 KiB |    0 % |        0 %
>     512 KiB |    0 % |        0 %
>    1024 KiB |    0 % |        0 %
>    2048 KiB |    0 % |        0 %
> 
> -> All other overhead dominates even when effectively unmapping
>     single pages of large folios when replacing them by a copy during write
>     faults. No change, which is great!
> 
> reuse-byte: fork() and wait until the child quit. write one byte to each
>    individual page, measuring the duration of all writes.
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |  -66 % |      -66 %
>      32 KiB |  -65 % |      -65 %
>      64 KiB |  -64 % |      -64 %
>     128 KiB |  -64 % |      -64 %
>     256 KiB |  -64 % |      -64 %
>     512 KiB |  -64 % |      -64 %
>    1024 KiB |  -64 % |      -64 %
>    2048 KiB |  -64 % |      -64 %
> 
> -> No surprise, we reuse all pages instead of copying them.
> 
> child-reuse-bye: fork() and unmap the memory in the parent. write one byte
>    to each individual page in the child, measuring the duration of all writes.
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |  -66 % |      -66 %
>      32 KiB |  -65 % |      -65 %
>      64 KiB |  -64 % |      -64 %
>     128 KiB |  -64 % |      -64 %
>     256 KiB |  -64 % |      -64 %
>     512 KiB |  -64 % |      -64 %
>    1024 KiB |  -64 % |      -64 %
>    2048 KiB |  -64 % |      -64 %
> 
> -> Same thing, we reuse all pages instead of copying them.
> 
> 
> For 4 KiB, there is no change in any benchmark, as expected.
> 
> 
> 4.2) vm-scalability
> -------------------
> 
> For now I only ran anon COW tests. I use 1 GiB per child process and use
> one child per core (-> 20).
> 
> case-anon-cow-rand: random writes
> 
> There is effectively no change (<0.6% throughput difference).
> 
> case-anon-cow-seq: sequential writes
> 
> MM-ID has up to 2% *lower* throughout than Base, not really correlating to
> folio size. The difference is almost as large as the STDEV (1% - 2%),
> though. It looks like there is a very slight effective slowdown.
> 
> No-Mapcount has up to 3% *higher* throughput than Base, not really
> correlating to the folio size. However, also here the difference is almost
> as large as the STDEV (up to 2%). It looks like there is a very slight
> effective speedup.
> 
> In summary, no earth-shattering slowdown with MM-ID (and we just recently
> optimized folio->_nr_pages_mapped to give us some speedup :) ), and
> another nice improvement with No-Mapcount.
> 
> 
> I did a bunch of cross-compiles and the build bots turned out very helpful
> over the last months. I did quite some testing with LTP and selftests,
> but x86-64 only.

Gentle ping. I might soon have capacity to continue working on this. If 
there is no further feedback I'll rebase and resend.