[v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()

We want to limit the use of page_mapcount() to places where absolutely
required, to prepare for kernel configs where we won't keep track of
per-page mapcounts in large folios.

khugepaged is one of the remaining "more challenging" page_mapcount()
users, but we might be able to move away from page_mapcount() without
resulting in a significant behavior change that would warrant
special-casing based on kernel configs.

In 2020, we first added support to khugepaged for collapsing COW-shared
pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shared
across fork"), followed by support for collapsing PTE-mapped THP in commit
5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages")
and limiting the memory waste via the "page_count() > 1" check in commit
71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable").

As a default, khugepaged will allow up to half of the PTEs to map shared
pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged
setting.

khugepaged does currently not care about swapcache page references, and
does not check under folio lock: so in some corner cases the "shared vs.
exclusive" detection might be a bit off, making us detect "exclusive" when
it's actually "shared".

Most of our anonymous folios in the system are usually exclusive. We
frequently see sharing of anonymous folios for a short period of time,
after which our short-lived suprocesses either quit or exec().

There are some famous examples, though, where child processes exist for a
long time, and where memory is COW-shared with a lot of processes
(webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for
reducing the memory footprint. We don't want to suddenly change the
behavior to result in a significant increase in memory waste.

Interestingly, khugepaged will only collapse an anonymous THP if at least
one PTE is writable. After fork(), that means that something (usually a
page fault) populated at least a single exclusive anonymous THP in that PMD
range.

So ... what happens when we switch to "is this folio mapped shared"
instead of "is this page mapped shared" by using
folio_likely_mapped_shared()?

For "not-COW-shared" folios, small folios and for THPs (large
folios) that are completely mapped into at least one process,
switching to folio_likely_mapped_shared() will not result in a change.

We'll only see a change for COW-shared PTE-mapped THPs that are
partially mapped into all involved processes.

There are two cases to consider:

(A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP

  If the folio is detected as exclusive, and it actually is exclusive,
  there is no change: page_mapcount() == 1. This is the common case
  without fork() or with short-lived child processes.

  folio_likely_mapped_shared() might currently still detect a folio as
  exclusive although it is shared (false negatives): if the first page is
  not mapped multiple times and if the average per-page mapcount is smaller
  than 1, implying that (1) the folio is partially mapped and (2) if we are
  responsible for many mapcounts by mapping many pages others can't
  ("mostly exclusive") (3) if we are not responsible for many mapcounts by
  mapping little pages ("mostly shared") it won't make a big impact on the
  end result.

  So while we might now detect a page as "exclusive" although it isn't,
  it's not expected to make a big difference in common cases.

(B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP

  folio_likely_mapped_shared() will never detect a large anonymous folio
  as shared although it is exclusive: there are no false positives.

  If we detect a THP as shared, at least one page of the THP is mapped by
  another process. It could well be that some pages are actually exclusive.
  For example, our child processes could have unmapped/COW'ed some pages
  such that they would now be exclusive to out process, which we now
  would treat as still-shared.

  Examples:
  (1) Parent maps all pages of a THP, child maps some pages. We detect
      all pages in the parent as shared although some are actually
      exclusive.
  (2) Parent maps all but some page of a THP, child maps the remainder.
      We detect all pages of the THP that the parent maps as shared
      although they are all exclusive.

  In (1) we wouldn't collapse a THP right now already: no PTE
  is writable, because a write fault would have resulted in COW of a
  single page and the parent would no longer map all pages of that THP.

  For (2) we would have collapsed a THP in the parent so far, now we
  wouldn't as long as the child process is still alive: unless the child
  process unmaps the remaining THP pages or we decide to split that THP.

  Possibly, the child COW'ed many pages, meaning that it's likely that
  we can populate a THP for our child first, and then for our parent.

  For (2), we are making really bad use of the THP in the first
  place (not even mapped completely in at least one process). If the
  THP would be completely partially mapped, it would be on the deferred
  split queue where we would split it lazily later.

  For short-running child processes, we don't particularly care. For
  long-running processes, the expectation is that such scenarios are
  rather rare: further, a THP might be best placed if most data in the
  PMD range is actually written, implying that we'll have to COW more
  pages first before khugepaged would collapse it.

To summarize, in the common case, this change is not expected to matter
much. The more common application of khugepaged operates on
exclusive pages, either before fork() or after a child quit.

Can we improve (A)? Yes, if we implement more precise tracking of "mapped
shared" vs. "mapped exclusively", we could get rid of the false
negatives completely.

Can we improve (B)? We could count how many pages of a large folio we map
inside the current page table and detect that we are responsible for most
of the folio mapcount and conclude "as good as exclusive", which might help
in some cases. ... but likely, some other mechanism should detect that
the THP is not a good use in the scenario (not even mapped completely in
a single process) and try splitting that folio lazily etc.

We'll move the folio_test_anon() check before our "shared" check, so we
might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order
of checks now matches the one in __collapse_huge_page_isolate(). Extend
documentation.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

How much time can one spend writing a patch description? Unbelievable. But
it was likely time well spend to have a clear picture of the impact.

This really needs the folio_likely_mapped_shared() optimization [1] that
resides in mm-unstable, I think, to reduce "false negatives".

The khugepage MM selftests keep working as expected, including:

	Run test: collapse_max_ptes_shared (khugepaged:anon)
	Allocate huge page... OK
	Share huge page over fork()... OK
	Trigger CoW on page 255 of 512... OK
	Maybe collapse with max_ptes_shared exceeded.... OK
	Trigger CoW on page 256 of 512... OK
	Collapse with max_ptes_shared PTEs shared.... OK
	Check if parent still has huge page... OK

Where we check that collapsing in the parent behaves as expected after
COWing a lot of pages in the parent: a sane scenario that is essentially
unchanged and which does not depend on any action in the child process
(compared to the cases discussed in (B) above).

[1] https://lkml.kernel.org/r/20240409192301.907377-6-david@redhat.com

---
 Documentation/admin-guide/mm/transhuge.rst |  3 ++-
 mm/khugepaged.c                            | 22 +++++++++++++++-------
 2 files changed, 17 insertions(+), 8 deletions(-)

Message ID	20240424122630.495788-1-david@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 933FEC4345F for <linux-mm@archiver.kernel.org>; Wed, 24 Apr 2024 12:26:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D93026B027A; Wed, 24 Apr 2024 08:26:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D431D6B027C; Wed, 24 Apr 2024 08:26:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C0ACE6B027D; Wed, 24 Apr 2024 08:26:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A59066B027A for <linux-mm@kvack.org>; Wed, 24 Apr 2024 08:26:41 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id C297A81004 for <linux-mm@kvack.org>; Wed, 24 Apr 2024 12:26:40 +0000 (UTC) X-FDA: 82044348960.24.127658F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf01.hostedemail.com (Postfix) with ESMTP id 2822A40014 for <linux-mm@kvack.org>; Wed, 24 Apr 2024 12:26:38 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="AL+XS/IS"; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf01.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713961599; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=ga3+hfqpq7kefWMT7+hfg1368I6jLzziyLojN0hllE8=; b=LgwwYEaY9bS5b3DPg6jzTDzVk8C3t0AdwHRvLitWEjoELW2BKiMgaQvRmTV6abUYrPPWOp sOy67xeeQWjAq3C2MVTnGAa1Jj4xkbsOofbwGMoWS5VCxSZo8oGoDPR5ta8A0tp4uY7RSq kvm9306tiwOm8GC5wrqVdGkHQSl1r+U= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="AL+XS/IS"; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf01.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713961599; a=rsa-sha256; cv=none; b=31KAhY0dxpVR9oAYBAXxTOtyf8joKzk4kOxDRkthDQ9tPb70ZwxQWrUWr1RCrzONMSJGxS hOo2oF/y+TcfM5g3AGCaRgFwUb+qOfKRlqJvouwjde2qRRYdUVgfwNJmjEeQ3/WLu032up oSGmeo+Zp3YHSRTvXuPuhhBJ3g95z2U= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1713961598; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=ga3+hfqpq7kefWMT7+hfg1368I6jLzziyLojN0hllE8=; b=AL+XS/ISDi5Ei1Qnni1wRLGkUoGv1OwdaNrfqA8+9oN2x4YiI1hiRHzHZvwjvSM5cvR3+c cHsoAcAlA8akh55hSv5/TbZS9oP4mWmzRNB0BAHPp7px2/z9PW9DCAn1j5z9ECDhRLn8Tf 9j5qwd0A1RhH3KjEMFNjHi36/Un3FEU= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-192-UF0W_gDvMVSSUKrV-5iNYQ-1; Wed, 24 Apr 2024 08:26:35 -0400 X-MC-Unique: UF0W_gDvMVSSUKrV-5iNYQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 0E0C88DCFC0; Wed, 24 Apr 2024 12:26:35 +0000 (UTC) Received: from t14s.redhat.com (unknown [10.39.193.224]) by smtp.corp.redhat.com (Postfix) with ESMTP id EEE052166B34; Wed, 24 Apr 2024 12:26:31 +0000 (UTC) From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Jonathan Corbet <corbet@lwn.net>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Zi Yan <ziy@nvidia.com>, Yang Shi <yang.shi@linux.alibaba.com>, John Hubbard <jhubbard@nvidia.com>, Ryan Roberts <ryan.roberts@arm.com> Subject: [PATCH v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared() Date: Wed, 24 Apr 2024 14:26:30 +0200 Message-ID: <20240424122630.495788-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 X-Rspam-User: X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 2822A40014 X-Stat-Signature: obpozs7omri7sagak8xxbqg6ff97qzq7 X-HE-Tag: 1713961598-336445 X-HE-Meta: U2FsdGVkX195t2SsP3c7v4tAhe9mZoOlo2igOP2NOXGlBJsXu/1qSAx1F5xOjQcvdb67yyxDfvw802k3+xasT3j4PKQKgPzaGeJs+WRBj1rKYRaqOy5OlcLWMWucXWrM4FUjfz36JK0fsNirFv0lgG812A6vC5RHD5N824g3aW/etlPE4dq98/Ktsi8VZoweJrBUx0ljdAmsYV1jAXxIdaR7JNBM5XltxTrM+T9UNApRuwxET0oGaUPIiRUCzIeD0KO/zfB0N3faT+on0hepcK1uQ2cwV+QPbUWb1j8D3gk4z8oHRt3t9p/iq1u5UoSNg+rnsZcl3uxy7m4Y+y5MI9KLiBeAECiOMggDFHDD5eM7aRQmXiJn2HSMrpRM1RPdGg4HOxk0LnYEaS0gz9VthjMOJRsjLfvsEvSMUhocbKv1+qiblkPX7PHNz8kG8oMqcBIo999dwZdtGTYI1zpqf4WImIhLhzcTtGUuLtTFHlKIs/k+ecWMGciDzGwbR+YCcQd41b8FDWHm5oD1P5eL+brAXL3vyriFk1NHOC1gKEQKcfUj81nG2NCVUT39dXaSrxgNzawVOsEazASYQ/CvO9y3gOWbGH6a1FmuJplVFmldw0ihQdpGS0uQtCqZef/UkMHpq1ehL3PChKq8IkGBoV+MrOM+61WDlU2CViFCXEHET1gLyQlQK2xc0nr8S9S3Sb7WURhqKyz3Wcz6Zpj5/IZ57iDDCeMa5zxYYeKGKC+IY6C01d9UDFlx0vZz0S5Z1GDkVG4gPEulv/PdacIGxspo0fYnPWlWSb8GawajiLkNqA8bQqvPZaBXGH5JiNgHWRMEtiJ213ysCF5sWjmsIaUqEma72l2G5QwGleSbdVOtUbgXkaNjXmlskrazblVrzO5ngikJYjmBKaY8JdFcxaFK31nVcT9ZcSYf5EKzzNV7+ru3do6EqTCRIqx7tPVs2tTRU+oNEo53P4k8ztI KmZ7WRJA deDtXJjdEkaMOKQ0SQGulI/VOCgV0kT9ZwAMVUaVIO+IE8wW+NnivVJMtnTmLLhsIUnejacBIeEiwcK3BJOxZveO7+Skh9jYGzQesWaBd1c30pXIQuRPU1EyDNEpZjd7JfVok9GAfq8ouic45x7DABq4nEc3qpcuNZiNgi+5nHTzVPFbmA18agoahlHYu4LL8nDDyexiyoY1erifTcovZ+PyiO9grIA90+nFMz+anT+zqPsaVn4cqMmD6oB58+GxUV057V03iQj4qUzIzA+OCWqd0bq0id7biFb0WeRCdEBROwqUPImsQorwAbJScgGDXm6Vr7El1EW6zM/o/o1Fkzxd7nQ4n2yYmKdb7PMWtPwW90FEa04zLbnzJ6IcjtbsiJWGPGq5XM+AJlip7qnsJX/lsWxtApANnAClmpz9KTRUcYH9mHVeazvHESVvUYGcoeSRDZL+EYgLmTZq+3l9vcLv+5jPjPIOpFfOmfHk1rgnSONU9EP8GByHSBi0XembXPsk07yjMVjkUm+V0nJJ4AMFUWw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> List-Subscribe: <mailto:majordomo@kvack.org> List-Unsubscribe: <mailto:majordomo@kvack.org>
Series	[v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared() \| expand [v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()

[v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()

Commit Message

Comments

Patch