[WIP,v1,07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio

In contrast to small folios and hugetlb folios, for a partially-mappable
folio (i.e., THP), the total mapcount is often not expressive to identify
whether such a folio is "mapped shared" or "mapped exclusively". For small
folios and hugetlb folios that are always entirely mapped, the single
mapcount is traditionally used for that purpose: is it 1? Then the folio
is currently mapped exclusively; is it bigger than 1? Then it's mapped
at least twice, and, therefore, considered "mapped shared".

For a partially-mappable folio, each individual PTE/PMD/... mapping
requires exactly one folio reference and one folio mapcount;
folio_mapcount() > 1 does not imply that the folio is "mapped shared".

While there are some obvious cases when we can conclude that
partially-mappable folios are "mapped shared" -- see
folio_mapped_shared() -- but it is currently not always possible to
precisely tell whether a folio is "mapped exclusively".

For implementing a precise variant of folio_mapped_shared() and for
COW-reuse support of PTE-mapped anon THP, we need an efficient and precise
way to identify "mapped shared" vs. "mapped exclusively".

So how could we track if more than one MM is currently mapping a folio in
its page tables? Having a list of MMs per folio, or even a counter for
each MM for each folio is clearly not feasible.

... but what if we could play some fun math games to perform this
tracking while requiring a handful of counters per folio, the exact number
of counters depending on the size of the folio?

1. !!! Experimental Feature !!!
===============================

We'll only support CONFIG_64BIT and !CONFIG_PREEMPT_RT (implied by THP
support) for now. As we currently never get partially-mappable folios
without CONFIG_TRANSPARENT_HUGEPAGE, let's limit to that to avoid
unnecessary rmap ID allocations for setups without THP.

32bit support might be possible if there is demand, limiting it to 64k
rmap IDs and reasonably sized folio sizes (e.g., <= order-15).
Similarly, RT might be possible if there is ever real demand for it.

The feature will be experimental initially, and, therefore, disabled as
default.

Once the involved math is considered solid, the implementation saw extended
testing, and the performance implications are clear and have either been
optimized (e.g., rmap batching) or mitigated (e.g., do we really have to
perform this tracking for folios that are always assumed shared, like
folios mapping executables or shared libraries? Is some hardware
problematic?), we can consider always enabling it as default.

2. Per-mm rmap IDs
==================

We'll have to assign each MM an rmap ID that is smaller than
16*1024*1024 on 64bit. Note that these are significantly more than the
maximum number of processes we can possibly have in the system. There isn't
really a difference between supporting 16M IDs and 2M/4M IDs.

Due to the ID size limitation, we cannot use the MM pointer value and need
a separate ID allocator. Maybe, we want to cache some rmap IDs per CPU?
Maybe we want to improve the allocation path? We can add such improvements
when deemed necessary.

In the distant future, we might want to allocate rmap IDs for selected
VMAs: for example, imagine a systemcall that does something like fork
(COW-sharing of pages) within a process for a range of anonymous memory,
ending up with a new VMA that wants a separate rmap ID. For now, per-MM
is simple and sufficient.

3. Tracking Overview
====================

We derive a sequence of special sub-IDs from our MM rmap ID.

Any time we map/unmap a part (e.g., PTE, PMD) of a partially-mappable
folio to/from a MM, we:

 (1) Adjust (increment/decrement) the mapcount of the folio
 (2) Adjust (add/remove) the folio rmap values using the MM sub-IDs

So the rmap values are always linked to the folio mapcount. Consequently,
we know that a single rmap value in the folio is the sum of exactly
 #folio_mapcount() rmap sub-IDs. To identify whether a single MM is
responsible for all folio_mapcount() mappings of a folio
("mapped exclusively") or whether other MMs are involved ("mapped shared"),
we perform the following checks:

 (1) Do we have more mappings than the folio has pages? Then the folio is
     certainly shared. That is, when "folio_mapcount() > folio_nr_pages()"
 (2) For each rmap value X, does that rmap value folio->_rmap_valX
     correspond to "folio_mapcount() * sub-ID[X]" of the MM?
     Then the folio is certainly exclusive. Note that we only check that
     when "folio_mapcount() <= folio_nr_pages()".

4. Synchronization
==================

We're using an atomic seqcount, stored in the folio, to allow for readers
to detect concurrent (un)mapping, whereby they could obtain a wrong
snapshot of the mapcount+rmap values and make a wrong decision.

Further, the mapcount and all rmap values are updated using RMW atomics,
to allow for concurrent updates.

5. sub-IDs
==========

To achieve (2), we generate sub-IDs that have the following property,
assuming that our folio has P=folio_nr_pages() pages.
  "2 * sub-ID" cannot be represented by the sum of any other *2* sub-IDs
  "3 * sub-ID" cannot be represented by the sum of any other *3* sub-IDs
  "4 * sub-ID" cannot be represented by the sum of any other *4* sub-IDs
  ...
  "P * sub-ID" cannot be represented by the sum of any other *P* sub-IDs

The sub-IDs are generated in generations, whereby
(1) Generation #0 is the number 0
(2) Generation #N takes all numbers from generations #0..#N-1 and adds
    (P + 1)^(N - 1), effectively doubling the number of sub-IDs

Consequently, the smallest number S in gen #N is:
  S[#N] = (P + 1)^(N - 1)

The largest number L in gen #N is:
  L[#N] = (P + 1)^(N - 1) + (P + 1)^(N - 2) + ... (P + 1)^0 + 0.
  -> [geometric sum with "P + 1 != 1"]
        = (1 - (P + 1)^N) / (1 - (P + 1))
        = (1 - (P + 1)^N) / (-P)
        = ((P + 1)^N - 1) / P

Example with P=4 (order-2 folio):

Generation #0:      0
------------------------     + (4 + 1)^0 = 1
Generation #1:      1
------------------------     + (4 + 1)^1 = 5
Generation #2:      5
                    6
------------------------     + (4 + 1)^2 = 25
Generation #3:     25
                   26
                   30
                   31
------------------------     + (4 + 1)^3 = 125
[...]

Intuitively, we are working with sub-counters that cannot overflow as
long as we have <= P components. Let's consider the simple case of P=3,
whereby our sub-counters are exactly 2-bit wide.

Subid |      Bits | Sub-counters
--------------------------------
 0    | 0000 0000 |   0,0,0,0
 1    | 0000 0001 |   0,0,0,1
 4    | 0000 0100 |   0,0,1,0
 5    | 0000 0101 |   0,0,1,1
 16   | 0001 0000 |   0,1,0,0
 17   | 0001 0001 |   0,1,0,1
 20   | 0001 0100 |   0,1,1,0
 21   | 0001 0101 |   0,1,1,1
 64   | 0100 0000 |   1,0,0,0
 65   | 0100 0001 |   1,0,0,1
 68   | 0100 0100 |   1,0,1,0
 69   | 0100 0101 |   1,0,1,1
 80   | 0101 0100 |   1,1,0,0
 81   | 0101 0001 |   1,1,0,1
 84   | 0101 0100 |   1,1,1,0
 85   | 0101 0101 |   1,1,1,1

So if we, say, have:
	3 * 17 = 0,3,0,3
how could we possible get to that number by using 3 other subids? It's
impossible, because the sub-counters won't overflow as long as we stay
<= 3.

Interesting side note that might come in handy at some point: we also
cannot get to 0,3,0,3 by using 1 or 2 other subids. But, we could get to
1 * 17 = 0,1,0,1 by using 2 subids (16 and 1) or similarly to 2 * 17 =
0,2,0,2 by using 4 subids (2x16 and 2x1). Looks like we cannot get to
X * subid using any 1..X other subids.

Note 1: we'll add the actual detection logic used to be used by
folio_mapped_shared() and wp_can_reuse_anon_folio() separately.

Note 2: we might want to use that infrastructure for hugetlb as well in the
future: there is nothing THP-specific about rmap ID handling.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm_types.h |  58 +++++++
 include/linux/rmap.h     | 126 +++++++++++++-
 kernel/fork.c            |  26 +++
 mm/Kconfig               |  21 +++
 mm/Makefile              |   1 +
 mm/huge_memory.c         |  16 +-
 mm/init-mm.c             |   4 +
 mm/page_alloc.c          |   9 +
 mm/rmap_id.c             | 351 +++++++++++++++++++++++++++++++++++++++
 9 files changed, 604 insertions(+), 8 deletions(-)
 create mode 100644 mm/rmap_id.c

Message ID	20231124132626.235350-8-david@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 339BBC61D97 for <linux-mm@archiver.kernel.org>; Fri, 24 Nov 2023 13:27:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C2B178D007B; Fri, 24 Nov 2023 08:27:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BAE5F8D006E; Fri, 24 Nov 2023 08:27:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9402E8D007B; Fri, 24 Nov 2023 08:27:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 7B4878D006E for <linux-mm@kvack.org>; Fri, 24 Nov 2023 08:27:03 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 4B0F51A06CE for <linux-mm@kvack.org>; Fri, 24 Nov 2023 13:27:03 +0000 (UTC) X-FDA: 81492923526.25.F7844AE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf16.hostedemail.com (Postfix) with ESMTP id 6FDC318002B for <linux-mm@kvack.org>; Fri, 24 Nov 2023 13:27:01 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=blneaO6V; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf16.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700832421; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=a5yfSJmvw+walwOxM5XR2U+SrOM6iOWFmuLu46mrnEU=; b=hBRlH11yxUhj6wLXeQYC50d2BoODqg8lsJ1L5SWL0yaWVFgJ8Su+BEskAtw7d2HevQPiDf fPr+8uLvQboWXwJuDk0wnp3QoKvKPjOJMkysDZR58zoWAqvaiFn9mhtS0UMZdH47MJ5712 SEilFdEyjhF7Vc224MSXlgNO6HacsJM= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=blneaO6V; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf16.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700832421; a=rsa-sha256; cv=none; b=Sf8FlyaidmHe6BsEscKc5gmVoKYjlIQe7sR63HaA0sCqNHn8k0vhazN9+FjtQc07EXX+ZA sykQdkK4lQZ6gnyniW5Fcd6vPwQe5A/9z7TZRfND8WzpQ9yHq7Nu2pVTcXq62GkR4qTkqW zXxKvxCnhBqTAVM3zneDa0r85wVo9ao= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1700832420; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=a5yfSJmvw+walwOxM5XR2U+SrOM6iOWFmuLu46mrnEU=; b=blneaO6Von/7If1kkxAo+LYa5rv//QxgVYhvipio1uf59ixRdPSzdR41gUbs9hq+vqvk5M 2ZAC6xSD6EvmYcR0DOSBpfzjKvkeJ7koYcTBinUVIQCNVd6HWHqwLzo4XGT6PvnNzVdQlA 45zF0teKtWOwMSDeBPtLcGhN7cMj5U8= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-689-ORZ29QTLNFSiOT-xdaI3QA-1; Fri, 24 Nov 2023 08:26:57 -0500 X-MC-Unique: ORZ29QTLNFSiOT-xdaI3QA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BD5003C108C7; Fri, 24 Nov 2023 13:26:56 +0000 (UTC) Received: from t14s.fritz.box (unknown [10.39.194.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 09AB92166B2A; Fri, 24 Nov 2023 13:26:53 +0000 (UTC) From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Ryan Roberts <ryan.roberts@arm.com>, Matthew Wilcox <willy@infradead.org>, Hugh Dickins <hughd@google.com>, Yin Fengwei <fengwei.yin@intel.com>, Yang Shi <shy828301@gmail.com>, Ying Huang <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>, "Paul E. McKenney" <paulmck@kernel.org> Subject: [PATCH WIP v1 07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio Date: Fri, 24 Nov 2023 14:26:12 +0100 Message-ID: <20231124132626.235350-8-david@redhat.com> In-Reply-To: <20231124132626.235350-1-david@redhat.com> References: <20231124132626.235350-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Rspamd-Queue-Id: 6FDC318002B X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 6mysa5asecny5ef6e8ojsu564rq3sgdo X-HE-Tag: 1700832421-403653 X-HE-Meta: U2FsdGVkX1/mf5BHo3tMHS1snJHBc6yU5GJ69wWHeQYBVA0kXSd336JRNJ/wZ/Tik92wJXUs1Y6godoXid9aUucxOtdu/iYbaUE76nGGHUvJaWzyFunMK0f1MKctEMT96SddfMJR8NZb/0kDe4SrAIGSYLAyB9sIX3l7K6SfnpHEm62z0Vpo8D9MiUiSS7K0+9Y3ff3nZzV/FuqgORy8MlvdzfoYFi56U93SjIFis1za9BXv5Mg3uA4huj4mNkDo7ENdxrRaT05qzaTpbWq194Zn7eWeYxEosmzSloSFEup6wYiGaq2R4mER76pC5KKoV/6Akn5kCn/2JrH2mHRdjRDPr292QiSSuWAJQGAacm/J60aQlbJHqAVD9VKXk+Kbip22tcTDtCxPCLYMPnNMTJdu3sO6aQfVnWSag1gUEFnt9acC+237Wt77gW7/P4WzZhTujXqIq3G91wqodW3SWLs2tpQ7BeCwYLf6wPX04RiavAJGlG6NeDz880Ee8Jww378apZdHRxmt5DyWlcV1/O2emt+FMKcy/JNpWBBBrlR6tqnQ2nHEUCw4XRr9iOYbNq56gAa7eJ5311pzDzHgG5U1C7RdM6XCfi9aZcLqIsp2R7tJ8Bt+6la+Fkc25pjtXA1Vdw84gk3lKVgdaYbpeBLaniGb0TigzPXYGrYetFRo70Ge4g7CBcIQ8rBjzOOkZFZmenkbHbEK6xkB5a1FQWS0+WqRFckk0wNrQjyA7iqkIuoM3p6Gzz4WRnMoELgCWdk14hx+mPDpVcxAX4iGCJvoaBsNskf4B1Wu7HKiaD2Qp9TJi6+m6nPg82Fj65tNga1Pr0ZtNKgODxeOL2phmpfVB19y7NAZSDrFALbvyYOUkQoJxmDK/8FH6+jZ2dKeCdx1rZ3paQ7HtNq3qLKtRsHXhP9CDNr6gLHk060RZg1bQx9lF47zkrTpbDHLqP5TjX6QVFWMlhurU97w4rN ps7BWgJT aP1z1n9TMPcm5FZHCpMDsKRJNSHv6fg/EnneZnMTnCjDPutbkOp/nwiM3YRDL9ZISIpsiqolLnwdDHDRzNe7qFhGKIn3XY5Rlna+9pcGtGSVra/Uqvtz0Ms12c0PaxzdbSG1EAxFn/wXPf63AzaU+6X5lNDmB9vq8/2LLS8kmuK6wDJHSHWUTmmdwgbWbeqHXsC1VWAYDpWKyNEGDBxU2PfjnQvVIXwfAQWv/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> List-Subscribe: <mailto:majordomo@kvack.org> List-Unsubscribe: <mailto:majordomo@kvack.org>
Series	mm: precise "mapped shared" vs. "mapped exclusively" detection for PTE-mapped THP / partially-mappable folios \| expand [WIP,v1,00/20] mm: precise "mapped shared" vs. "mapped exclusively" detection for PTE-mapped THP / … [WIP,v1,01/20] mm/rmap: factor out adding folio range into __folio_add_rmap_range() [WIP,v1,02/20] mm: add a total mapcount for large folios [WIP,v1,03/20] mm: convert folio_estimated_sharers() to folio_mapped_shared() and improve it [WIP,v1,04/20] mm/rmap: pass dst_vma to page_try_dup_anon_rmap() and page_dup_file_rmap() [WIP,v1,05/20] mm/rmap: abstract total mapcount operations for partially-mappable folios [WIP,v1,06/20] atomic_seqcount: new (raw) seqcount variant to support concurrent writers [WIP,v1,07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio [WIP,v1,08/20] mm: pass MM to folio_mapped_shared() [WIP,v1,09/20] mm: improve folio_mapped_shared() for partially-mappable folios using rmap IDs [WIP,v1,10/20] mm/memory: COW reuse support for PTE-mapped THP with rmap IDs [WIP,v1,11/20] mm/rmap_id: support for 1, 2 and 3 values by manual calculation [WIP,v1,12/20] mm/rmap: introduce folio_add_anon_rmap_range() [WIP,v1,13/20] mm/huge_memory: batch rmap operations in __split_huge_pmd_locked() [WIP,v1,14/20] mm/huge_memory: avoid folio_refcount() < folio_mapcount() in __split_huge_pmd_locked… [WIP,v1,15/20] mm/rmap_id: verify precalculated subids with CONFIG_DEBUG_VM [WIP,v1,16/20] atomic_seqcount: support a single exclusive writer in the absence of other writers [WIP,v1,17/20] mm/rmap_id: reduce atomic RMW operations when we are the exclusive writer [WIP,v1,18/20] atomic_seqcount: use atomic add-return instead of atomic cmpxchg on 64bit [WIP,v1,19/20] mm/rmap: factor out removing folio range into __folio_remove_rmap_range() [WIP,v1,20/20] mm/rmap: perform all mapcount operations of large folios under the rmap seqcount

[WIP,v1,07/20] mm/rmap_id: track if one ore multiple MMs map a partially-mappable folio

Commit Message

Comments

Patch