mbox series

[v3,00/20] mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT

Message ID 20250303163014.1128035-1-david@redhat.com (mailing list archive)
Headers show
Series mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT | expand

Message

David Hildenbrand March 3, 2025, 4:29 p.m. UTC
Some smaller change based on Zi Yan's feedback (thanks!).


Let's add an "easy" way to decide -- without false positives, without
page-mapcounts and without page table/rmap scanning -- whether a large
folio is "certainly mapped exclusively" into a single MM, or whether it
"maybe mapped shared" into multiple MMs.

Use that information to implement Copy-on-Write reuse, to convert
folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
introduce a kernel config option that let's us not use+maintain
per-page mapcounts in large folios anymore.

The bigger picture was presented at LSF/MM [1].

This series is effectively a follow-up on my early work [2], which
implemented a more precise, but also more complicated, way to identify
whether a large folio is "mapped shared" into multiple MMs or
"mapped exclusively" into a single MM.


1 Patch Organization
====================

Patch #1 -> #6: make more room in order-1 folios, so we have two
                "unsigned long" available for our purposes

Patch #7 -> #11: preparations

Patch #12: MM owner tracking for large folios

Patch #13: COW reuse for PTE-mapped anon THP

Patch #14: folio_maybe_mapped_shared()

Patch #15 -> #20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT


2 MM owner tracking
===================

We assign each MM a unique ID ("MM ID"), to be able to squeeze more
information in our folios. On 32bit we use 15-bit IDs, on 64bit we use
31-bit IDs.

For each large folios, we now store two MM-ID+mapcount ("slot")
combinations:
* mm0_id + mm0_mapcount
* mm1_id + mm1_mapcount

On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit
mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for
both slots.

Paired with the large mapcount, we can reliably identify whether one
of these MMs is the current owner (-> owns all mappings) or even holds
all folio references (-> owns all mappings, and all references are from
mappings).

As long as only two MMs map folio pages at a time, we can reliably and
precisely identify whether a large folio is "mapped shared" or
"mapped exclusively".

Any additional MM that starts mapping the folio while there are no free
slots becomes an "untracked MM". If one such "untracked MM" is the last
one mapping a folio exclusively, we will not detect the folio as
"mapped exclusively" but instead as "maybe mapped shared". (exception:
only a single mapping remains)

So that's where the approach gets imprecise.

For now, we use a bit-spinlock to sync the large mapcount + slots, and
make sure we do keep the machinery fast, to not degrade (un)map performance
drastically: for example, we make sure to only use a single atomic (when
grabbing the bit-spinlock), like we would already perform when updating
the large mapcount.


3 CONFIG_NO_PAGE_MAPCOUNT
=========================

patch #15 -> #20 spell out and document what exactly is affected when
not maintaining the per-page mapcounts in large folios anymore.

Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when
(un)mapping pages, we'll account a complete folio as mapped if a
single page is mapped. In addition, we'll not detect partially mapped
anonymous folios as such in all cases yet.

Likely less relevant changes include that we might now under-estimate the
USS (Unique Set Size) of a process, but never over-estimate it.

The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point,
to then slowly make it the only option, as we learn about real-life
impacts and possible ways to mitigate them.


4 Performance
=============

Detailed performance numbers were included in v1 [3], and not that much
changed between v1 and v2.

I did plenty of measurements on different systems in the meantime, that
all revealed slightly different results.

The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code
layout changes on some systems. Especially the fork() benchmark started
being more-shaky-than-before on recent kernels for some reason.

In summary, with my micro-benchmarks:

* Small folios are not impacted.

* CoW performance seems to be mostly unchanged across all folios sizes.

* CoW reuse performance of large folios now matches CoW reuse performance
  of small folios, because we now actually implement the CoW reuse
  optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction
  in runtime, on an arm64 system I measured ~54% reduction.

* munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw
  double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R
  and up to ~70% on an AmpereOne A192-32X) with larger folios. The
  larger the folios, the larger the performance improvement.

* munmao() performance very slightly (couple percent) degrades without
  CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there
  seems to be no change at all.

* fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw
  double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R
  and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger
  the folios, the larger the performance improvement.

* While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be
  almost unchanged on some systems, I saw some degradation for
  smaller folios on the AmpereOne A192-32X. I did not investigate the
  details yet, but I suspect code layout changes or suboptimal code
  placement / inlining.

I'm not to worried about the fork() micro-benchmarks for smaller folios
given how shaky the results are lately and by how much we improved fork()
performance recently.

I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability,
to assess the scalability and the impact of the bit-spinlock.
My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU
revealed no significant changes.

Similarly, running these benchmarks with 2 MiB THPs enabled on the
AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev,
which is nice.

So far, I did not get my hands on a similarly large system with multiple
sockets.

I found no other fitting scalability benchmarks that seem to really hammer
on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq
does.


5 Concerns
==========

5.1 Bit spinlock
----------------

I'm not quite happy about the bit-spinlock, but so far it does not seem to
affect scalability in my measurements.

If it ever becomes a problem we could either investigate improving the
locking, or simply stopping the MM tracking once there are "too many
mappings" and simply assume that the folio is "mapped shared" until it
was freed.

This would be similar (but slightly different) to the "0,1,2,stopped"
counting idea Willy had at some point. Adding that logic to "stop tracking"
adds more code to the hot path, so I avoided that for now.


5.2 folio_maybe_mapped_shared()
-------------------------------

I documented the change from folio_likely_mapped_shared() to
folio_maybe_mapped_shared() quite extensively. If we run into surprises,
I have some ideas on how to resolve them. For now, I think we should
be fine.


5.3 Added code to map/unmap hot path
------------------------------------

So far, it looks like the added code on the rmap hot path does not
really seem to matter much in the bigger picture. I'd like to further
reduce it (and possibly improve fork() performance further), but I don't
easily see how right now. Well, and I am out of puff 

Comments

Andrew Morton March 3, 2025, 10:43 p.m. UTC | #1
On Mon,  3 Mar 2025 17:29:53 +0100 David Hildenbrand <david@redhat.com> wrote:

> Some smaller change based on Zi Yan's feedback (thanks!).
> 
> 
> Let's add an "easy" way to decide -- without false positives, without
> page-mapcounts and without page table/rmap scanning -- whether a large
> folio is "certainly mapped exclusively" into a single MM, or whether it
> "maybe mapped shared" into multiple MMs.
> 
> Use that information to implement Copy-on-Write reuse, to convert
> folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to
> introduce a kernel config option that let's us not use+maintain
> per-page mapcounts in large folios anymore.
> 
> ...
>
> The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point,
> to then slowly make it the only option, as we learn about real-life
> impacts and possible ways to mitigate them.

I expect that we'll get very little runtime testing this way, and we
won't hear about that testing unless there's a failure.

Part of me wants to make it default on right now, but that's perhaps a
bit mean to linux-next testers.

Or perhaps default-off for now and switch to default-y for 6.15-rcX?

I suggest this just to push things along more aggressively - we may
choose to return to default-off after a few weeks of -rcX.