[v1,08/17] mm/rmap: initial MM owner tracking for large folios (!hugetlb)

Let's track for each large folio (excluding hugetlb for now) whether it is
certainly mapped exclusively (mapped by a single MM), or whether it may be
mapped shared (mapped by multiple MMs).

In an ideal world, we'd have a more precise "mapped exclusively" vs.
"mapped shared" tracking -- avoiding the "maybe" part -- but the
approaches to achieve that are a bit more involve, and we are going to
start with something simple so we can also make progress on per-page
mapcount removal. We can later easily exchange the tracking mechanism.

We'll use this information next to optimize COW reuse for PTE-mapped
anonymous THP, and implement folio_likely_mapped_shared() in kernel
configuration where the per-page mapcounts in large folios are no longer
maintained. We could start doing the MM owner tracking for anonymous
folios initially (COW reuse only applies to anon folios), but we'll keep
it simple and just do it also for pagecache folios: the new tracking
must be manually enabled via a kconfig option for now.

64bit only, because we cannot easily squeeze more stuff into the "struct
folio" of order-1 folios. 32bit might be possible in the future, for
example when limiting order-1 folios to 64bit only.

We'll remember for each large folio for two MMs that currently map this
folio, how often they are mapping folio pages (mapcount). As long as
a folio is unmapped or exclusively mapped, another MM can take a free
spot. We won't allow to take a free spot if the folio is not mapped
exclusively: primarily to avoid some corner cases where some mappings of
a MM are tracked via the slot, and others not (identified while working
on this).

In addition, we'll remember the current state (exclusive/shared) and use a
bit spinlock to sync on updates, and to require only a single atomic
operation for our updates. Using a bit spinlock is not ideal, but there
are not that many easy alternatives. We might be able to squeeze an
arch_spin_lock into the "struct folio" later, for now keep it simple. RT is
out of the picture with THP, and we can always optimize this later.

As we have to squeeze this information into the "struct folio" of even
folios of order-1 (2 pages), and we generally want to reduce the required
metadata, we'll assign each MM a unique ID that consumes less than 32 bit.
We'll limit the IDs to 20bit / 1M for now: we could allow for up to 30bit,
but getting even 1M IDs is unlikely in practice. If required, we could
raise the limit later, and the 1M limit might come in handy in the
future with other tracking approaches.

There won't be any false "mapped shared" detection as long as only two MMs
map pages of a folio at one point in time -- for example with fork() and
short-lived child processes, or with apps that hand over state from one
instance to another, like live-migrating VMs on the same host, effectively
migrating guest RAM via a mmap'ed files.

As soon as three MMs are involved at the same time, we might detect
"mapped shared" although the folio is now "mapped exclusively". Example:
(1) App1 faults in a (shmem/file-backed) folio -> Tracked as MM0
(2) App2 faults in the same folio -> Tracked as MM1
(3) App3 faults in the same folio -> Cannot be tracked separately
(4) App1 and App2 unmap the folio.
(5) We'll still detect "shared" even though only App3 maps the folio.

With multiple processes, this might have the potential to result in
unexpected owner changes, when migrating pages or when faulting them in:
assume a parent process fork()'s two short-lived child processes. We would
expect that the parent always remains tracked under MM0, but it could be
that at some point both child processes are tracked instead. For
file-backed memory, reclaim+refault can trigger something similar.

Keep compilation for the vdso32 hack working by un-defining CONFIG_MM_ID
like we for CONFIG_64BIT.

Make use of __always_inline to keep possible performance degradation
when (un)mapping large folios to a minimum.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/mm/transhuge.rst                |   8 ++
 arch/x86/entry/vdso/vdso32/fake_32bit_build.h |   1 +
 include/linux/mm_types.h                      |  23 ++++
 include/linux/page-flags.h                    |  41 ++++++
 include/linux/rmap.h                          | 126 ++++++++++++++++++
 kernel/fork.c                                 |  36 +++++
 mm/Kconfig                                    |  11 ++
 mm/huge_memory.c                              |   6 +
 mm/internal.h                                 |   6 +
 mm/page_alloc.c                               |  10 ++
 10 files changed, 268 insertions(+)

Message ID	20240829165627.2256514-9-david@redhat.com (mailing list archive)
State	New
Headers	show Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3B9F71B5EBE for <linux-fsdevel@vger.kernel.org>; Thu, 29 Aug 2024 16:58:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724950692; cv=none; b=j360Prq9Tuty8VH3oD0pSuPkvgVtpTVeYPpCTH4o+fL8JVqnrzs11T6yDRIaxtNIrMaCPDBcXLbo/eIixrA9CXLuFrNFYm5YKbTYrpbVU86CZkGrHoBBIsT7yihprEUTwbhRqwc3YsezGfGuKwn8R2NIZlilEWDS1GQQ1aEcpIY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1724950692; c=relaxed/simple; bh=h3qLuDXwVIZz3/rCeEo8yr9v7u0QsQGiXf+MrwgchWo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PC2EQXiD+kizFNxLSVfn0yHPPcz9SK/LQlKABenSA1dL23yRw8fIXt7wOrDbnKICLmIwLxreOhHOsZeG2k5WwpmWTJcgrjmiTdfKTzBZ2rtM3sO31vqeVVB9eAZME4DJGgMTknjhBIR8nDcJsBEHXYxsBGQhcVGWaCnsu9zgLd0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=MA94pUop; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="MA94pUop" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1724950689; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=iJ36TGls6X+NLaUI9I8k1fO0354o+pno0obx3qgSgO4=; b=MA94pUopSeod8x7/mua90Ziaw/8PFOBBXXCkpsyMQvBN2Ay8QbFzkzP0rpD2haghVuY+oa wL8+PaWLB24Ihya4x3nAant6y6Ed812rWsplUPt9QDgqWd2K9UT7ZWSp0nG9OWfw2wl8Jr kdPExGj1olDSSfbbxa+ltpMTOY4hL4M= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-592-MTIkZoMGPJK2qRqMLQImJQ-1; Thu, 29 Aug 2024 12:58:06 -0400 X-MC-Unique: MTIkZoMGPJK2qRqMLQImJQ-1 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B18C7190ECCE; Thu, 29 Aug 2024 16:58:03 +0000 (UTC) Received: from t14s.redhat.com (unknown [10.39.193.245]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 45D921955F66; Thu, 29 Aug 2024 16:57:51 +0000 (UTC) From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org, linux-fsdevel@vger.kernel.org, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, Jonathan Corbet <corbet@lwn.net>, Andy Lutomirski <luto@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com> Subject: [PATCH v1 08/17] mm/rmap: initial MM owner tracking for large folios (!hugetlb) Date: Thu, 29 Aug 2024 18:56:11 +0200 Message-ID: <20240829165627.2256514-9-david@redhat.com> In-Reply-To: <20240829165627.2256514-1-david@redhat.com> References: <20240829165627.2256514-1-david@redhat.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: <linux-fsdevel.vger.kernel.org> List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12
Series	mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT \| expand [v1,00/17] mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT [v1,01/17] mm: factor out large folio handling from folio_order() into folio_large_order() [v1,02/17] mm: factor out large folio handling from folio_nr_pages() into folio_large_nr_pages() [v1,03/17] mm/rmap: use folio_large_nr_pages() in add/remove functions [v1,04/17] mm: let _folio_nr_pages overlay memcg_data in first tail page [v1,05/17] mm/rmap: pass dst_vma to page_try_dup_anon_rmap() and page_dup_file_rmap() [v1,06/17] mm/rmap: pass vma to __folio_add_rmap() [v1,07/17] mm/rmap: abstract large mapcount operations for large folios (!hugetlb) [v1,08/17] mm/rmap: initial MM owner tracking for large folios (!hugetlb) [v1,09/17] bit_spinlock: __always_inline (un)lock functions [v1,10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID [v1,11/17] mm: CONFIG_NO_PAGE_MAPCOUNT to prepare for not maintain per-page mapcounts in large foli… [v1,12/17] mm: remove per-page mapcount dependency in folio_likely_mapped_shared() (CONFIG_NO_PAGE_… [v1,13/17] fs/proc/page: remove per-page mapcount dependency for /proc/kpagecount (CONFIG_NO_PAGE_M… [v1,14/17] fs/proc/task_mmu: remove per-page mapcount dependency for PM_MMAP_EXCLUSIVE (CONFIG_NO_P… [v1,15/17] fs/proc/task_mmu: remove per-page mapcount dependency for "mapmax" (CONFIG_NO_PAGE_MAPCO… [v1,16/17] fs/proc/task_mmu: remove per-page mapcount dependency for smaps/smaps_rollup (CONFIG_NO_… [v1,17/17] mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)

[v1,08/17] mm/rmap: initial MM owner tracking for large folios (!hugetlb)

Commit Message

Comments

Patch